Thread: Multicolun index creation never completes on 9.0.1/solaris
Note: I have this running now on a test box. If someone responds in the next couple hours, I can run whatever diagnostics you want on it. Otherwise I'll kill it off and start over with debug logging turned on. Version: 9.0.1 Platform: Solaris 10u8 / ZFS over DAS 7-drive array Severity: Serious Reproduceable? Always, on this machine Description: I've been running tests for a client on their old production machine. Part of the test setup involves restoring a 70GB table (part of a 400GB database). Restoring this table always fails to complete (we terminate it after 3 days). Watching it closer this time, I've noticed that it's hanging on the creation of a multicolumn index*: "write_log_accounttime_idx" btree (account_id, event_time) Table def*: Table "public.write_log" Column | Type | Modifiers ----------------+--------------------------+------------------------ log_id | character varying | not null account_id | character varying | write_id | character varying | event_time | timestamp with time zone | not null default now() event_loc | character varying(1024) | source | character varying(1024) | detail | character varying(1024) | source_ip | inet | event_length | integer | keyword | character varying | account_id is usually a 32-character hex string (a UUID). When trying to restore this index ... or, indeed, do *anything* else with in including ANALYZE or REINDEX ... takes longer than we're willing to monitor in order to complete (I once waited 30 hours). (I was able to test Analyze/reindex because the first time I converted the database using pg_upgrade). Interestingly, on the production database (where the index was successfully created on 8.4.4., that index has no entries in the stats table, i.e. pg_stat_user_indexes does not have a row for that index. However, this database *did* restore on the same platform in 8.4.4. The PK for this table restores successfully. I haven't been able to test other single-column indexes. Watching the process of creating this index, it's certainly busy: read(224, " o02\0\0E0C9 ]D101\0\0\0".., 8192) = 8192 read(224, " o02\0\0 P\f ^D101\0\0\0".., 8192) = 8192 read(224, " o02\0\0 P88E7D101\0\0\0".., 8192) = 8192 read(224, " o02\0\010 . ^D101\0\0\0".., 8192) = 8192 read(224, " o02\0\0B0 r ^D101\0\0\0".., 8192) = 8192 read(224, " o02\0\010 q ^D101\0\0\0".., 8192) = 8192 read(224, " o02\0\0C8F9 ^D101\0\0\0".., 8192) = 8192 read(224, " o02\0\0 hBE aD101\0\0\0".., 8192) = 8192 read(224, " o02\0\0C8D6 ^D101\0\0\0".., 8192) = 8192 read(224, " o02\0\0 `F8 ^D101\0\0\0".., 8192) = 8192 write(219, " e\0\0\0\0\0\0\08082D4 V".., 8192) = 8192 read(224, " o02\0\08085 xD101\0\0\0".., 8192) = 8192 read(224, " o02\0\0E8 ] _D101\0\0\0".., 8192) = 8192 read(224, " o02\0\0 ( _ _D101\0\0\0".., 8192) = 8192 read(224, " o02\0\0B080 _D101\0\0\0".., 8192) = 8192 read(224, " o02\0\0F8A2 _D101\0\0\0".., 8192) = 8192 read(224, " o02\0\0 xB5EAD101\0\0\0".., 8192) = 8192 read(224, " o02\0\0 @E4 _D101\0\0\0".., 8192) = 8192 read(224, " o02\0\0 \b `D101\0\0\0".., 8192) = 8192 read(224, " o02\0\090 ) `D101\0\0\0".., 8192) = 8192 write(219, "\0 OA197E1 !01\0 ,\0\0\0".., 8192) = 8192 read(224, " o02\0\0B0 K `D101\0\0\0".., 8192) = 8192 read(224, " o02\0\01002 bD101\0\0\0".., 8192) = 8192 read(224, " o02\0\0188E `D101\0\0\0".., 8192) = 8192 read(224, " o02\0\0 HB0 `D101\0\0\0".., 8192) = 8192 read(224, " o02\0\0 HD2 `D101\0\0\0".., 8192) = 8192 read(224, " o02\0\0 H U\vD201\0\0\0".., 8192) = 8192 read(224, " o02\0\0 x14 aD101\0\0\0".., 8192) = 8192 read(224, " o02\0\0C8B9 aD101\0\0\0".., 8192) = 8192 read(224, " o02\0\010BB aD101\0\0\0".., 8192) = 8192 read(224, " o02\0\0 h v aD101\0\0\0".., 8192) = 8192 write(219, " ,\0\0\0F4\0E8 *05\0 ( @".., 8192) = 8192 This is also the kind of activity I'd expect to see for an index build. But ... it goes on forever. The current index build run has been going for almost 100 hours. More detail: mainentance_work_mem = 1GB (out of 32GB RAM available) 9.0.1 was built from source, using Sun CC for the current test, I'm running with fsync off (but I've had the same issue with fsync on) Ideas? (* object names have been changed to preserve confidentiality) -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
Josh Berkus <josh@agliodbs.com> writes: > Note: I have this running now on a test box. If someone responds in the > next couple hours, I can run whatever diagnostics you want on it. > Otherwise I'll kill it off and start over with debug logging turned on. trace_sort would be interesting. regards, tom lane
On 1/25/11 1:21 PM, Tom Lane wrote: > Josh Berkus <josh@agliodbs.com> writes: >> Note: I have this running now on a test box. If someone responds in the >> next couple hours, I can run whatever diagnostics you want on it. >> Otherwise I'll kill it off and start over with debug logging turned on. > > trace_sort would be interesting. OK, I'll need to kill it and restart. Will go ahead. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
> trace_sort would be interesting. This is it so far: LOG: begin index sort: unique = f, workMem = 1048576, randomAccess = f STATEMENT: create index "write_log_accounttime_idx" on write_log (account_id, event_time); LOG: switching to external sort with 3745 tapes: CPU 9.06s/6.65u sec elapsed 21.68 sec STATEMENT: create index "write_log_accounttime_idx" on write_log (account_id, event_time); LOG: finished writing run 1 to tape 0: CPU 33.39s/149.02u sec elapsed 190.11 sec LOG: finished writing run 2 to tape 1: CPU 62.72s/371.06u sec elapsed 443.16 sec LOG: finished writing run 3 to tape 2: CPU 91.04s/599.43u sec elapsed 701.37 sec LOG: finished writing run 4 to tape 3: CPU 120.95s/823.59u sec elapsed 956.67 sec If it's going to take 3 minutes each to write each of 3745 tapes, that means completing in around 9 days. I wanted to see what this looks like with a single-column index, so I did one on event_time, which was even *worse*: LOG: begin index sort: unique = f, workMem = 1048576, randomAccess = f STATEMENT: create index "write_log_time_idx" on write_log (event_time); LOG: switching to external sort with 3745 tapes: CPU 14.45s/10.87u sec elapsed 26.19 sec LOG: finished writing run 1 to tape 0: CPU 135.32s/302.18u sec elapsed 447.44 sec ... run 2 didn't complete in even 1/2 hour. So the problem isn't multicolumn indexes, it's indexes on a table this large in general (a VARCHR index wasn't much faster to build). I'd guess that the PK index finished only because the table might have been in that rough physical order. For whatever reason, tape sort is being extra slow on this build on Solaris10; did we change anything between 8.4 and 9.0? Or is this possibly the build I used? I'll try to run a sort_trace on an 8.4.4 copy of the database. Oh, FWIW, the rough number of rows in the table: 1 486 530 000 -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On 1/25/11 1:21 PM, Tom Lane wrote: > Josh Berkus <josh@agliodbs.com> writes: >> Note: I have this running now on a test box. If someone responds in the >> next couple hours, I can run whatever diagnostics you want on it. >> Otherwise I'll kill it off and start over with debug logging turned on. > > trace_sort would be interesting. OK, either tape sort is broken or trace_sort is. The larger my maint_work_mem, the MORE tapes it wants to use: LOG: begin index sort: unique = f, workMem = 4194304, randomAccess = f LOG: switching to external sort with 14980 tapes: CPU 33.33s/24.23u sec elapsed 58.86 sec LOG: begin index sort: unique = f, workMem = 131072, randomAccess = f LOG: switching to external sort with 469 tapes: CPU 1.87s/1.48u sec elapsed 3.47 sec -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
Josh Berkus <josh@agliodbs.com> writes: > LOG: finished writing run 1 to tape 0: CPU 33.39s/149.02u sec elapsed > 190.11 sec > LOG: finished writing run 2 to tape 1: CPU 62.72s/371.06u sec elapsed > 443.16 sec > LOG: finished writing run 3 to tape 2: CPU 91.04s/599.43u sec elapsed > 701.37 sec > LOG: finished writing run 4 to tape 3: CPU 120.95s/823.59u sec elapsed > 956.67 sec > If it's going to take 3 minutes each to write each of 3745 tapes, that > means completing in around 9 days. Your logic has nothing to do with what is actually happening. Could we have a little bit more patience to see what happens next? > ... run 2 didn't complete in even 1/2 hour. That's *good*, not bad. Long runs mean fewer merge steps. (Memo to -hackers: what we lack in this trace is any indication of how long the runs are. Maybe should add that. Knuth claims the initial runs should amount to about 2X maintenance_work_mem on average, but maybe there's something unusual about the data distribution here.) regards, tom lane
>> If it's going to take 3 minutes each to write each of 3745 tapes, that >> means completing in around 9 days. > > Your logic has nothing to do with what is actually happening. Could we > have a little bit more patience to see what happens next? OK, I'll let it run overnight and give you the trace output. > (Memo to -hackers: what we lack in this trace is any indication of how > long the runs are. Maybe should add that. Knuth claims the initial > runs should amount to about 2X maintenance_work_mem on average, but > maybe there's something unusual about the data distribution here.) Well, for the straight test on event_time, the rows in the table should have been more-or-less physically already in order. This table is most-insert, and event_time mostly corresponds to current server time. For the composite index, rows would be more randomly distributed. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
Josh Berkus <josh@agliodbs.com> writes: > OK, either tape sort is broken or trace_sort is. The larger my > maint_work_mem, the MORE tapes it wants to use: Yeah, that's right. More tapes = better, or at least that was the conclusion we came to some years ago. It doesn't necessarily use all those tapes --- it's hoping it will find that the data sorts into fewer runs than that, and it can go directly to the final merge without any intermediate merge passes. It's conceivable that there's an upper bound to the number of tapes we should try to use, but you really haven't provided any data that would either confirm or refute that idea. regards, tom lane
Josh Berkus <josh@agliodbs.com> writes: > Well, for the straight test on event_time, the rows in the table should > have been more-or-less physically already in order. This table is > most-insert, and event_time mostly corresponds to current server time. That would result in a very long initial run (probably all or most of the data), which fits with your observation so far. But we don't know what happens after that. BTW, are you paying attention to vmstat or similar measure of the amount of I/O and CPU work getting done? vmstat data ought to make it possible to estimate the size of those runs, even though the trace_sort logic isn't telling us. regards, tom lane
> Your logic has nothing to do with what is actually happening. Could we > have a little bit more patience to see what happens next? So, after 15 hours: LOG: finished writing run 20 to tape 19: CPU 588.86s/4484.35u sec elapsed 5160.97 sec STATEMENT: create index "write_log_accounttime_idx" on write_log (account_id, event_time); LOG: finished writing run 21 to tape 20: CPU 620.46s/4712.51u sec elapsed 5435.63 sec STATEMENT: create index "write_log_accounttime_idx" on write_log (account_id, event_time); LOG: finished writing run 22 to tape 21: CPU 654.01s/4953.74u sec elapsed 5726.87 sec STATEMENT: create index "write_log_accounttime_idx" on write_log (account_id, event_time); LOG: finished writing run 23 to tape 22: CPU 688.84s/5204.24u sec elapsed 6029.41 sec STATEMENT: create index "write_log_accounttime_idx" on write_log (account_id, event_time); LOG: finished writing run 24 to tape 23: CPU 721.71s/5442.40u sec elapsed 6316.64 sec STATEMENT: create index "write_log_accounttime_idx" on write_log (account_id, event_time); LOG: finished writing run 25 to tape 24: CPU 757.74s/5708.86u sec elapsed 6636.64 sec STATEMENT: create index "write_log_accounttime_idx" on write_log (account_id, event_time); LOG: finished writing run 26 to tape 25: CPU 790.87s/5946.83u sec elapsed 6924.41 sec STATEMENT: create index "write_log_accounttime_idx" on write_log (account_id, event_time); LOG: finished writing run 27 to tape 26: CPU 823.64s/6186.06u sec elapsed 7212.11 sec STATEMENT: create index "write_log_accounttime_idx" on write_log (account_id, event_time); ... I'll point out that the elapsed times which trace_sort is giving me are clearly not clock times; I started this index run at 7pm PST yesterday and it's been 15 hours, not 2 as the elapsed time would suggest. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
Josh Berkus <josh@agliodbs.com> writes: > ... I'll point out that the elapsed times which trace_sort is giving me > are clearly not clock times; I started this index run at 7pm PST > yesterday and it's been 15 hours, not 2 as the elapsed time would suggest. [ raised eyebrow... ] Those numbers are the differences between two gettimeofday() readings. It's really really hard to believe that that's wrong, unless there's something seriously wrong with your machine. Now, the starting time is the time at which tuplesort_begin_common was called, but that should be more or less immediately at the start of a CREATE INDEX command. FWIW, I did a test last night on the time to sort some random (varchar, timestamptz) data using git HEAD. I didn't have enough disk space to sort 1.4 billion rows, but I tested with 200 million rows and 1GB maintenance_work_mem, and was able to create an index in 2424 seconds (~40 minutes) --- this on a pretty generic desktop machine. I don't see a reason to think the runtime for 1.4 billion would have been over 5 hours. The log output for my test looked like LOG: begin index sort: unique = f, workMem = 1048576, randomAccess = f STATEMENT: create index ii on big (account_id, event_time); LOG: switching to external sort with 3745 tapes: CPU 1.43s/4.85u sec elapsed 8.17 sec STATEMENT: create index ii on big (account_id, event_time); LOG: finished writing run 1 to tape 0: CPU 5.33s/127.78u sec elapsed 145.15 sec STATEMENT: create index ii on big (account_id, event_time); LOG: finished writing run 2 to tape 1: CPU 9.97s/273.53u sec elapsed 311.10 sec STATEMENT: create index ii on big (account_id, event_time); LOG: finished writing run 3 to tape 2: CPU 14.83s/422.90u sec elapsed 483.71 sec STATEMENT: create index ii on big (account_id, event_time); LOG: finished writing run 4 to tape 3: CPU 19.77s/572.60u sec elapsed 653.36 sec STATEMENT: create index ii on big (account_id, event_time); LOG: finished writing run 5 to tape 4: CPU 24.46s/722.20u sec elapsed 825.30 sec STATEMENT: create index ii on big (account_id, event_time); LOG: finished writing run 6 to tape 5: CPU 29.37s/871.97u sec elapsed 996.62 sec STATEMENT: create index ii on big (account_id, event_time); LOG: finished writing run 7 to tape 6: CPU 34.21s/1020.82u sec elapsed 1168.62 sec STATEMENT: create index ii on big (account_id, event_time); LOG: finished writing run 8 to tape 7: CPU 38.92s/1169.84u sec elapsed 1337.46 sec STATEMENT: create index ii on big (account_id, event_time); LOG: finished writing run 9 to tape 8: CPU 43.48s/1319.35u sec elapsed 1508.72 sec STATEMENT: create index ii on big (account_id, event_time); LOG: finished writing run 10 to tape 9: CPU 48.31s/1468.19u sec elapsed 1678.14 sec STATEMENT: create index ii on big (account_id, event_time); LOG: finished writing run 11 to tape 10: CPU 53.08s/1617.21u sec elapsed 1847.99 sec STATEMENT: create index ii on big (account_id, event_time); LOG: performsort starting: CPU 55.80s/1709.01u sec elapsed 1953.27 sec STATEMENT: create index ii on big (account_id, event_time); LOG: finished writing run 12 to tape 11: CPU 56.29s/1748.25u sec elapsed 1993.31 sec STATEMENT: create index ii on big (account_id, event_time); LOG: finished writing final run 13 to tape 12: CPU 56.59s/1768.89u sec elapsed 2014.48 sec STATEMENT: create index ii on big (account_id, event_time); LOG: performsort done (except 13-way final merge): CPU 57.02s/1771.57u sec elapsed 2018.39 sec STATEMENT: create index ii on big (account_id, event_time); LOG: external sort ended, 1073478 disk blocks used: CPU 81.28s/2104.60u sec elapsed 2423.59 sec STATEMENT: create index ii on big (account_id, event_time); Your machine seems to be dumping a run about once every 250-300 seconds, which is about half the speed of mine, which is a bit odd if it's big iron. (I wonder whether you have a non-C locale selected ...) But that would still extrapolate to no worse than 10 hours to perform the sort, assuming the runs are about the same size as mine, which they ought to be given the same maintenance_work_mem setting. It might be worth setting log_line_prefix = %m so you can check whether (a) the "begin index sort" message comes out right away, and (b) whether the trace_sort measurements really differ from clock time or not. regards, tom lane
Tom, > [ raised eyebrow... ] Those numbers are the differences between two > gettimeofday() readings. It's really really hard to believe that that's > wrong, unless there's something seriously wrong with your machine. Actually, you're right ... I don't know what time the run27 message was posted. It's possible that it finished run27 at 9pm last night ... Oh! Actually, it only *did* 27 runs. So it actually completed building the index. I'd expected trace_sort to give me some kind of completion message; apologies for not checking all screen windows. So, why didn't the index build complete (after more than 36 hours) when I ran it as part of pg_restore? I guess this goes back to being a pg_restore problem and not an index build problem. > FWIW, I did a test last night on the time to sort some random > (varchar, timestamptz) data using git HEAD. I didn't have enough disk > space to sort 1.4 billion rows, but I tested with 200 million rows and > 1GB maintenance_work_mem, and was able to create an index in 2424 > seconds (~40 minutes) --- this on a pretty generic desktop machine. > I don't see a reason to think the runtime for 1.4 billion would have > been over 5 hours. The log output for my test looked like 2+ hours is then still very slow considering the machine I'm on compared to yours. I wonder if we maybe should be using GCC instead of SunCC. Oh, that's why: it's a SPARC processor. Slowness explained then. > Your machine seems to be dumping a run about once every 250-300 seconds, > which is about half the speed of mine, which is a bit odd if it's big > iron. (I wonder whether you have a non-C locale selected ...) See above. So, as usual, I've completely mislocated the bug. I'll need to rerun the pg_restore and actually diagnose *that*. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
Josh Berkus <josh@agliodbs.com> writes: > Oh! Actually, it only *did* 27 runs. So it actually completed building > the index. I'd expected trace_sort to give me some kind of completion > message; apologies for not checking all screen windows. Huh, there should be an "external sort ended" message, as well as some other ones once it finishes writing the initial runs. > So, why didn't the index build complete (after more than 36 hours) when > I ran it as part of pg_restore? I guess this goes back to being a > pg_restore problem and not an index build problem. First thing that comes to mind is to wonder what maintenance_work_mem setting was in effect for the pg_restore run. regards, tom lane
On 1/26/11 12:13 PM, Tom Lane wrote: > Josh Berkus <josh@agliodbs.com> writes: >> Oh! Actually, it only *did* 27 runs. So it actually completed building >> the index. I'd expected trace_sort to give me some kind of completion >> message; apologies for not checking all screen windows. > > Huh, there should be an "external sort ended" message, as well as some > other ones once it finishes writing the initial runs. Oh, fooled by log rotation: LOG: finished writing run 83 to tape 82: CPU 2600.37s/19449.04u sec elapsed 22402.36 sec LOG: finished writing run 84 to tape 83: CPU 2630.31s/19674.93u sec elapsed 22659.40 sec LOG: finished writing run 85 to tape 84: CPU 2658.46s/19905.44u sec elapsed 22919.25 sec LOG: performsort starting: CPU 2686.35s/20121.46u sec elapsed 23164.61 sec LOG: finished writing run 86 to tape 85: CPU 2686.48s/20127.46u sec elapsed 23170.74 sec LOG: finished writing final run 87 to tape 86: CPU 2688.70s/20220.72u sec elapsed 23266.35 sec LOG: performsort done (except 87-way final merge): CPU 2691.55s/20224.35u sec elapsed 23277.44 sec LOG: external sort ended, 7991989 disk blocks used: CPU 3237.80s/27221.09u sec elapsed 31382.50 sec So, a total of almost 9 hours. Clearly there's something wrong here; if I have time I'll try a recompile with GCC and see how it does. Collation is 'C'. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com