Thread: Multicolun index creation never completes on 9.0.1/solaris

Multicolun index creation never completes on 9.0.1/solaris

From
Josh Berkus
Date:
Note: I have this running now on a test box.  If someone responds in the
next couple hours, I can run whatever diagnostics you want on it.
Otherwise I'll kill it off and start over with debug logging turned on.

Version: 9.0.1
Platform: Solaris 10u8 / ZFS over DAS 7-drive array
Severity: Serious
Reproduceable? Always, on this machine
Description:

I've been running tests for a client on their old production machine.
Part of the test setup involves restoring a 70GB table (part of a 400GB
database).  Restoring this table always fails to complete (we terminate
it after 3 days).

Watching it closer this time, I've noticed that it's hanging on the
creation of a multicolumn index*:

"write_log_accounttime_idx" btree (account_id, event_time)

Table def*:

                   Table "public.write_log"
     Column     |           Type           |       Modifiers
----------------+--------------------------+------------------------
 log_id         | character varying        | not null
 account_id     | character varying        |
 write_id       | character varying        |
 event_time     | timestamp with time zone | not null default now()
 event_loc      | character varying(1024)  |
 source         | character varying(1024)  |
 detail         | character varying(1024)  |
 source_ip      | inet                     |
 event_length   | integer                  |
 keyword        | character varying        |

account_id is usually a 32-character hex string (a UUID).

When trying to restore this index ... or, indeed, do *anything* else
with in including ANALYZE or REINDEX ... takes longer than we're willing
to monitor in order to complete (I once waited 30 hours).  (I was able
to test Analyze/reindex because the first time I converted the database
using pg_upgrade).

Interestingly, on the production database (where the index was
successfully created on 8.4.4., that index has no entries in the stats
table, i.e. pg_stat_user_indexes does not have a row for that index.
However, this database *did* restore on the same platform in 8.4.4.

The PK for this table restores successfully.  I haven't been able to
test other single-column indexes.

Watching the process of creating this index, it's certainly busy:

read(224, " o02\0\0E0C9 ]D101\0\0\0".., 8192)   = 8192
read(224, " o02\0\0 P\f ^D101\0\0\0".., 8192)   = 8192
read(224, " o02\0\0 P88E7D101\0\0\0".., 8192)   = 8192
read(224, " o02\0\010 . ^D101\0\0\0".., 8192)   = 8192
read(224, " o02\0\0B0 r ^D101\0\0\0".., 8192)   = 8192
read(224, " o02\0\010 q ^D101\0\0\0".., 8192)   = 8192
read(224, " o02\0\0C8F9 ^D101\0\0\0".., 8192)   = 8192
read(224, " o02\0\0 hBE aD101\0\0\0".., 8192)   = 8192
read(224, " o02\0\0C8D6 ^D101\0\0\0".., 8192)   = 8192
read(224, " o02\0\0 `F8 ^D101\0\0\0".., 8192)   = 8192
write(219, " e\0\0\0\0\0\0\08082D4 V".., 8192)  = 8192
read(224, " o02\0\08085 xD101\0\0\0".., 8192)   = 8192
read(224, " o02\0\0E8 ] _D101\0\0\0".., 8192)   = 8192
read(224, " o02\0\0 ( _ _D101\0\0\0".., 8192)   = 8192
read(224, " o02\0\0B080 _D101\0\0\0".., 8192)   = 8192
read(224, " o02\0\0F8A2 _D101\0\0\0".., 8192)   = 8192
read(224, " o02\0\0 xB5EAD101\0\0\0".., 8192)   = 8192
read(224, " o02\0\0 @E4 _D101\0\0\0".., 8192)   = 8192
read(224, " o02\0\0  \b `D101\0\0\0".., 8192)   = 8192
read(224, " o02\0\090 ) `D101\0\0\0".., 8192)   = 8192
write(219, "\0 OA197E1 !01\0 ,\0\0\0".., 8192)  = 8192
read(224, " o02\0\0B0 K `D101\0\0\0".., 8192)   = 8192
read(224, " o02\0\01002 bD101\0\0\0".., 8192)   = 8192
read(224, " o02\0\0188E `D101\0\0\0".., 8192)   = 8192
read(224, " o02\0\0 HB0 `D101\0\0\0".., 8192)   = 8192
read(224, " o02\0\0 HD2 `D101\0\0\0".., 8192)   = 8192
read(224, " o02\0\0 H U\vD201\0\0\0".., 8192)   = 8192
read(224, " o02\0\0 x14 aD101\0\0\0".., 8192)   = 8192
read(224, " o02\0\0C8B9 aD101\0\0\0".., 8192)   = 8192
read(224, " o02\0\010BB aD101\0\0\0".., 8192)   = 8192
read(224, " o02\0\0 h v aD101\0\0\0".., 8192)   = 8192
write(219, " ,\0\0\0F4\0E8 *05\0 ( @".., 8192)  = 8192

This is also the kind of activity I'd expect to see for an index build.
 But ... it goes on forever.  The current index build run has been going
for almost 100 hours.

More detail:
    mainentance_work_mem = 1GB (out of 32GB RAM available)
    9.0.1 was built from source, using Sun CC
    for the current test, I'm running with fsync off
        (but I've had the same issue with fsync on)

Ideas?

(* object names have been changed to preserve confidentiality)

--
                                  -- Josh Berkus
                                     PostgreSQL Experts Inc.
                                     http://www.pgexperts.com

Re: Multicolun index creation never completes on 9.0.1/solaris

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> Note: I have this running now on a test box.  If someone responds in the
> next couple hours, I can run whatever diagnostics you want on it.
> Otherwise I'll kill it off and start over with debug logging turned on.

trace_sort would be interesting.

            regards, tom lane

Re: Multicolun index creation never completes on 9.0.1/solaris

From
Josh Berkus
Date:
On 1/25/11 1:21 PM, Tom Lane wrote:
> Josh Berkus <josh@agliodbs.com> writes:
>> Note: I have this running now on a test box.  If someone responds in the
>> next couple hours, I can run whatever diagnostics you want on it.
>> Otherwise I'll kill it off and start over with debug logging turned on.
>
> trace_sort would be interesting.

OK, I'll need to kill it and restart.  Will go ahead.

--
                                  -- Josh Berkus
                                     PostgreSQL Experts Inc.
                                     http://www.pgexperts.com

Re: Multicolun index creation never completes on 9.0.1/solaris

From
Josh Berkus
Date:
> trace_sort would be interesting.

This is it so far:

LOG:  begin index sort: unique = f, workMem = 1048576, randomAccess = f
STATEMENT:  create index "write_log_accounttime_idx" on write_log
(account_id, event_time);
LOG:  switching to external sort with 3745 tapes: CPU 9.06s/6.65u sec
elapsed 21.68 sec
STATEMENT:  create index "write_log_accounttime_idx" on write_log
(account_id, event_time);
LOG:  finished writing run 1 to tape 0: CPU 33.39s/149.02u sec elapsed
190.11 sec

LOG:  finished writing run 2 to tape 1: CPU 62.72s/371.06u sec elapsed
443.16 sec

LOG:  finished writing run 3 to tape 2: CPU 91.04s/599.43u sec elapsed
701.37 sec

LOG:  finished writing run 4 to tape 3: CPU 120.95s/823.59u sec elapsed
956.67 sec

If it's going to take 3 minutes each to write each of 3745 tapes, that
means completing in around 9 days.

I wanted to see what this looks like with a single-column index, so I
did one on event_time, which was even *worse*:

LOG:  begin index sort: unique = f, workMem = 1048576, randomAccess = f
STATEMENT:  create index "write_log_time_idx" on write_log (event_time);

LOG:  switching to external sort with 3745 tapes: CPU 14.45s/10.87u sec
elapsed 26.19 sec

LOG:  finished writing run 1 to tape 0: CPU 135.32s/302.18u sec elapsed
447.44 sec

... run 2 didn't complete in even 1/2 hour.

So the problem isn't multicolumn indexes, it's indexes on a table this
large in general (a VARCHR index wasn't much faster to build).  I'd
guess that the PK index finished only because the table might have been
in that rough physical order.

For whatever reason, tape sort is being extra slow on this build on
Solaris10; did we change anything between 8.4 and 9.0?  Or is this
possibly the build I used?

I'll try to run a sort_trace on an 8.4.4 copy of the database.

Oh, FWIW, the rough number of rows in the table:
1 486 530 000

--
                                  -- Josh Berkus
                                     PostgreSQL Experts Inc.
                                     http://www.pgexperts.com

Re: Multicolun index creation never completes on 9.0.1/solaris

From
Josh Berkus
Date:
On 1/25/11 1:21 PM, Tom Lane wrote:
> Josh Berkus <josh@agliodbs.com> writes:
>> Note: I have this running now on a test box.  If someone responds in the
>> next couple hours, I can run whatever diagnostics you want on it.
>> Otherwise I'll kill it off and start over with debug logging turned on.
>
> trace_sort would be interesting.

OK, either tape sort is broken or trace_sort is.  The larger my
maint_work_mem, the MORE tapes it wants to use:

LOG:  begin index sort: unique = f, workMem = 4194304, randomAccess = f
LOG:  switching to external sort with 14980 tapes: CPU 33.33s/24.23u sec
elapsed 58.86 sec

LOG:  begin index sort: unique = f, workMem = 131072, randomAccess = f
LOG:  switching to external sort with 469 tapes: CPU 1.87s/1.48u sec
elapsed 3.47 sec


--
                                  -- Josh Berkus
                                     PostgreSQL Experts Inc.
                                     http://www.pgexperts.com

Re: Multicolun index creation never completes on 9.0.1/solaris

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> LOG:  finished writing run 1 to tape 0: CPU 33.39s/149.02u sec elapsed
> 190.11 sec

> LOG:  finished writing run 2 to tape 1: CPU 62.72s/371.06u sec elapsed
> 443.16 sec

> LOG:  finished writing run 3 to tape 2: CPU 91.04s/599.43u sec elapsed
> 701.37 sec

> LOG:  finished writing run 4 to tape 3: CPU 120.95s/823.59u sec elapsed
> 956.67 sec

> If it's going to take 3 minutes each to write each of 3745 tapes, that
> means completing in around 9 days.

Your logic has nothing to do with what is actually happening.  Could we
have a little bit more patience to see what happens next?

> ... run 2 didn't complete in even 1/2 hour.

That's *good*, not bad.  Long runs mean fewer merge steps.

(Memo to -hackers: what we lack in this trace is any indication of how
long the runs are.  Maybe should add that.  Knuth claims the initial
runs should amount to about 2X maintenance_work_mem on average, but
maybe there's something unusual about the data distribution here.)

            regards, tom lane

Re: Multicolun index creation never completes on 9.0.1/solaris

From
Josh Berkus
Date:
>> If it's going to take 3 minutes each to write each of 3745 tapes, that
>> means completing in around 9 days.
>
> Your logic has nothing to do with what is actually happening.  Could we
> have a little bit more patience to see what happens next?

OK, I'll let it run overnight and give you the trace output.


> (Memo to -hackers: what we lack in this trace is any indication of how
> long the runs are.  Maybe should add that.  Knuth claims the initial
> runs should amount to about 2X maintenance_work_mem on average, but
> maybe there's something unusual about the data distribution here.)

Well, for the straight test on event_time, the rows in the table should
have been more-or-less physically already in order.  This table is
most-insert, and event_time mostly corresponds to current server time.

For the composite index, rows would be more randomly distributed.

--
                                  -- Josh Berkus
                                     PostgreSQL Experts Inc.
                                     http://www.pgexperts.com

Re: Multicolun index creation never completes on 9.0.1/solaris

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> OK, either tape sort is broken or trace_sort is.  The larger my
> maint_work_mem, the MORE tapes it wants to use:

Yeah, that's right.  More tapes = better, or at least that was the
conclusion we came to some years ago.  It doesn't necessarily use all
those tapes --- it's hoping it will find that the data sorts into fewer
runs than that, and it can go directly to the final merge without any
intermediate merge passes.

It's conceivable that there's an upper bound to the number of tapes we
should try to use, but you really haven't provided any data that would
either confirm or refute that idea.

            regards, tom lane

Re: Multicolun index creation never completes on 9.0.1/solaris

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> Well, for the straight test on event_time, the rows in the table should
> have been more-or-less physically already in order.  This table is
> most-insert, and event_time mostly corresponds to current server time.

That would result in a very long initial run (probably all or most of
the data), which fits with your observation so far.  But we don't know
what happens after that.

BTW, are you paying attention to vmstat or similar measure of the amount
of I/O and CPU work getting done?  vmstat data ought to make it possible
to estimate the size of those runs, even though the trace_sort logic
isn't telling us.

            regards, tom lane

Re: Multicolun index creation never completes on 9.0.1/solaris

From
Josh Berkus
Date:
> Your logic has nothing to do with what is actually happening.  Could we
> have a little bit more patience to see what happens next?

So, after 15 hours:

LOG:  finished writing run 20 to tape 19: CPU 588.86s/4484.35u sec
elapsed 5160.97 sec
STATEMENT:  create index "write_log_accounttime_idx" on write_log
(account_id, event_time);
LOG:  finished writing run 21 to tape 20: CPU 620.46s/4712.51u sec
elapsed 5435.63 sec
STATEMENT:  create index "write_log_accounttime_idx" on write_log
(account_id, event_time);
LOG:  finished writing run 22 to tape 21: CPU 654.01s/4953.74u sec
elapsed 5726.87 sec
STATEMENT:  create index "write_log_accounttime_idx" on write_log
(account_id, event_time);
LOG:  finished writing run 23 to tape 22: CPU 688.84s/5204.24u sec
elapsed 6029.41 sec
STATEMENT:  create index "write_log_accounttime_idx" on write_log
(account_id, event_time);
LOG:  finished writing run 24 to tape 23: CPU 721.71s/5442.40u sec
elapsed 6316.64 sec
STATEMENT:  create index "write_log_accounttime_idx" on write_log
(account_id, event_time);
LOG:  finished writing run 25 to tape 24: CPU 757.74s/5708.86u sec
elapsed 6636.64 sec
STATEMENT:  create index "write_log_accounttime_idx" on write_log
(account_id, event_time);
LOG:  finished writing run 26 to tape 25: CPU 790.87s/5946.83u sec
elapsed 6924.41 sec
STATEMENT:  create index "write_log_accounttime_idx" on write_log
(account_id, event_time);
LOG:  finished writing run 27 to tape 26: CPU 823.64s/6186.06u sec
elapsed 7212.11 sec
STATEMENT:  create index "write_log_accounttime_idx" on write_log
(account_id, event_time);

... I'll point out that the elapsed times which trace_sort is giving me
are clearly not clock times; I started this index run at 7pm PST
yesterday and it's been 15 hours, not 2 as the elapsed time would suggest.

--
                                  -- Josh Berkus
                                     PostgreSQL Experts Inc.
                                     http://www.pgexperts.com

Re: Multicolun index creation never completes on 9.0.1/solaris

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> ... I'll point out that the elapsed times which trace_sort is giving me
> are clearly not clock times; I started this index run at 7pm PST
> yesterday and it's been 15 hours, not 2 as the elapsed time would suggest.

[ raised eyebrow... ]  Those numbers are the differences between two
gettimeofday() readings.  It's really really hard to believe that that's
wrong, unless there's something seriously wrong with your machine.

Now, the starting time is the time at which tuplesort_begin_common was
called, but that should be more or less immediately at the start of a
CREATE INDEX command.

FWIW, I did a test last night on the time to sort some random
(varchar, timestamptz) data using git HEAD.  I didn't have enough disk
space to sort 1.4 billion rows, but I tested with 200 million rows and
1GB maintenance_work_mem, and was able to create an index in 2424
seconds (~40 minutes) --- this on a pretty generic desktop machine.
I don't see a reason to think the runtime for 1.4 billion would have
been over 5 hours.  The log output for my test looked like

LOG:  begin index sort: unique = f, workMem = 1048576, randomAccess = f
STATEMENT:  create index ii on big (account_id, event_time);
LOG:  switching to external sort with 3745 tapes: CPU 1.43s/4.85u sec elapsed 8.17 sec
STATEMENT:  create index ii on big (account_id, event_time);
LOG:  finished writing run 1 to tape 0: CPU 5.33s/127.78u sec elapsed 145.15 sec
STATEMENT:  create index ii on big (account_id, event_time);
LOG:  finished writing run 2 to tape 1: CPU 9.97s/273.53u sec elapsed 311.10 sec
STATEMENT:  create index ii on big (account_id, event_time);
LOG:  finished writing run 3 to tape 2: CPU 14.83s/422.90u sec elapsed 483.71 sec
STATEMENT:  create index ii on big (account_id, event_time);
LOG:  finished writing run 4 to tape 3: CPU 19.77s/572.60u sec elapsed 653.36 sec
STATEMENT:  create index ii on big (account_id, event_time);
LOG:  finished writing run 5 to tape 4: CPU 24.46s/722.20u sec elapsed 825.30 sec
STATEMENT:  create index ii on big (account_id, event_time);
LOG:  finished writing run 6 to tape 5: CPU 29.37s/871.97u sec elapsed 996.62 sec
STATEMENT:  create index ii on big (account_id, event_time);
LOG:  finished writing run 7 to tape 6: CPU 34.21s/1020.82u sec elapsed 1168.62 sec
STATEMENT:  create index ii on big (account_id, event_time);
LOG:  finished writing run 8 to tape 7: CPU 38.92s/1169.84u sec elapsed 1337.46 sec
STATEMENT:  create index ii on big (account_id, event_time);
LOG:  finished writing run 9 to tape 8: CPU 43.48s/1319.35u sec elapsed 1508.72 sec
STATEMENT:  create index ii on big (account_id, event_time);
LOG:  finished writing run 10 to tape 9: CPU 48.31s/1468.19u sec elapsed 1678.14 sec
STATEMENT:  create index ii on big (account_id, event_time);
LOG:  finished writing run 11 to tape 10: CPU 53.08s/1617.21u sec elapsed 1847.99 sec
STATEMENT:  create index ii on big (account_id, event_time);
LOG:  performsort starting: CPU 55.80s/1709.01u sec elapsed 1953.27 sec
STATEMENT:  create index ii on big (account_id, event_time);
LOG:  finished writing run 12 to tape 11: CPU 56.29s/1748.25u sec elapsed 1993.31 sec
STATEMENT:  create index ii on big (account_id, event_time);
LOG:  finished writing final run 13 to tape 12: CPU 56.59s/1768.89u sec elapsed 2014.48 sec
STATEMENT:  create index ii on big (account_id, event_time);
LOG:  performsort done (except 13-way final merge): CPU 57.02s/1771.57u sec elapsed 2018.39 sec
STATEMENT:  create index ii on big (account_id, event_time);
LOG:  external sort ended, 1073478 disk blocks used: CPU 81.28s/2104.60u sec elapsed 2423.59 sec
STATEMENT:  create index ii on big (account_id, event_time);

Your machine seems to be dumping a run about once every 250-300 seconds,
which is about half the speed of mine, which is a bit odd if it's big
iron.  (I wonder whether you have a non-C locale selected ...)  But that
would still extrapolate to no worse than 10 hours to perform the sort,
assuming the runs are about the same size as mine, which they ought to
be given the same maintenance_work_mem setting.

It might be worth setting log_line_prefix = %m so you can check whether
(a) the "begin index sort" message comes out right away, and (b) whether
the trace_sort measurements really differ from clock time or not.

            regards, tom lane

Re: Multicolun index creation never completes on 9.0.1/solaris

From
Josh Berkus
Date:
Tom,

> [ raised eyebrow... ]  Those numbers are the differences between two
> gettimeofday() readings.  It's really really hard to believe that that's
> wrong, unless there's something seriously wrong with your machine.

Actually, you're right ... I don't know what time the run27 message was
posted.  It's possible that it finished run27 at 9pm last night ...

Oh!  Actually, it only *did* 27 runs.  So it actually completed building
the index.  I'd expected trace_sort to give me some kind of completion
message; apologies for not checking all screen windows.

So, why didn't the index build complete (after more than 36 hours) when
I ran it as part of pg_restore?  I guess this goes back to being a
pg_restore problem and not an index build problem.

> FWIW, I did a test last night on the time to sort some random
> (varchar, timestamptz) data using git HEAD.  I didn't have enough disk
> space to sort 1.4 billion rows, but I tested with 200 million rows and
> 1GB maintenance_work_mem, and was able to create an index in 2424
> seconds (~40 minutes) --- this on a pretty generic desktop machine.
> I don't see a reason to think the runtime for 1.4 billion would have
> been over 5 hours.  The log output for my test looked like

2+ hours is then still very slow considering the machine I'm on compared
to yours.   I wonder if we maybe should be using GCC instead of SunCC.

Oh, that's why: it's a SPARC processor.  Slowness explained then.

> Your machine seems to be dumping a run about once every 250-300 seconds,
> which is about half the speed of mine, which is a bit odd if it's big
> iron.  (I wonder whether you have a non-C locale selected ...)

See above.

So, as usual, I've completely mislocated the bug.  I'll need to rerun
the pg_restore and actually diagnose *that*.

--
                                  -- Josh Berkus
                                     PostgreSQL Experts Inc.
                                     http://www.pgexperts.com

Re: Multicolun index creation never completes on 9.0.1/solaris

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> Oh!  Actually, it only *did* 27 runs.  So it actually completed building
> the index.  I'd expected trace_sort to give me some kind of completion
> message; apologies for not checking all screen windows.

Huh, there should be an "external sort ended" message, as well as some
other ones once it finishes writing the initial runs.

> So, why didn't the index build complete (after more than 36 hours) when
> I ran it as part of pg_restore?  I guess this goes back to being a
> pg_restore problem and not an index build problem.

First thing that comes to mind is to wonder what maintenance_work_mem
setting was in effect for the pg_restore run.

            regards, tom lane

Re: Multicolun index creation never completes on 9.0.1/solaris

From
Josh Berkus
Date:
On 1/26/11 12:13 PM, Tom Lane wrote:
> Josh Berkus <josh@agliodbs.com> writes:
>> Oh!  Actually, it only *did* 27 runs.  So it actually completed building
>> the index.  I'd expected trace_sort to give me some kind of completion
>> message; apologies for not checking all screen windows.
>
> Huh, there should be an "external sort ended" message, as well as some
> other ones once it finishes writing the initial runs.

Oh, fooled by log rotation:

LOG:  finished writing run 83 to tape 82: CPU 2600.37s/19449.04u sec
elapsed 22402.36 sec

LOG:  finished writing run 84 to tape 83: CPU 2630.31s/19674.93u sec
elapsed 22659.40 sec

LOG:  finished writing run 85 to tape 84: CPU 2658.46s/19905.44u sec
elapsed 22919.25 sec

LOG:  performsort starting: CPU 2686.35s/20121.46u sec elapsed 23164.61 sec

LOG:  finished writing run 86 to tape 85: CPU 2686.48s/20127.46u sec
elapsed 23170.74 sec

LOG:  finished writing final run 87 to tape 86: CPU 2688.70s/20220.72u
sec elapsed 23266.35 sec

LOG:  performsort done (except 87-way final merge): CPU
2691.55s/20224.35u sec elapsed 23277.44 sec

LOG:  external sort ended, 7991989 disk blocks used: CPU
3237.80s/27221.09u sec elapsed 31382.50 sec

So, a total of almost 9 hours.  Clearly there's something wrong here; if
I have time I'll try a recompile with GCC and see how it does.
Collation is 'C'.

--
                                  -- Josh Berkus
                                     PostgreSQL Experts Inc.
                                     http://www.pgexperts.com