Thread: problem with large maintenance_work_mem settings and CREATE INDEX

problem with large maintenance_work_mem settings and CREATE INDEX

From
Stefan Kaltenbrunner
Date:
Hi all!

while playing on a new box i noticed that postgresql does not seem to be
able to cope with very large settings for maintenance_work_mem.

For a test I created a single table with 5 integer columns containing
about 1,8B rows 8(about 300M distinct values in the column I want to index):


foo=# select count(*) from testtable;  count------------ 1800201755(1 row)


I tried to create an index on one of the columns:

foo=# SET maintenance_work_mem to 4000000;
SET
foo=# CREATE INDEX a_idx ON testtable(a);
ERROR:  invalid memory alloc request size 1073741824

foo=# SET maintenance_work_mem to 3000000;
SET
foo=# CREATE INDEX a_idx ON testtable(a);
ERROR:  invalid memory alloc request size 1073741824

the error is generated pretty fast (a few seconds into the create index)

however:

foo=# SET maintenance_work_mem to 2000000;
SET
foo=# CREATE INDEX a_idx ON testtable(a);

is running now for about 10 hours with nearly no IO but pegging the
CPU-core it is running on at a constent 100%.

watching the process while this happens seems to indicate that the above
error occures after the backend exceeds about 3,1GB in resident size.

The box in question is a Dual Opteron 275 (4 cores @2,2Ghz) with 16GB ofRAM and 24GB of swap. OS is Debian Sarge/AMD64
witha pure 64bit userland.
 


Stefan


Re: problem with large maintenance_work_mem settings and

From
Stefan Kaltenbrunner
Date:
Stefan Kaltenbrunner wrote:
> Hi all!
> 
> while playing on a new box i noticed that postgresql does not seem to be
> able to cope with very large settings for maintenance_work_mem.
> 
> For a test I created a single table with 5 integer columns containing
> about 1,8B rows 8(about 300M distinct values in the column I want to index):
> 
> 
> foo=# select count(*) from testtable;
>    count
>  ------------
>   1800201755
>  (1 row)
> 
> 
> I tried to create an index on one of the columns:
> 
> foo=# SET maintenance_work_mem to 4000000;
> SET
> foo=# CREATE INDEX a_idx ON testtable(a);
> ERROR:  invalid memory alloc request size 1073741824
> 
> foo=# SET maintenance_work_mem to 3000000;
> SET
> foo=# CREATE INDEX a_idx ON testtable(a);
> ERROR:  invalid memory alloc request size 1073741824
> 
> the error is generated pretty fast (a few seconds into the create index)
> 
> however:
> 
> foo=# SET maintenance_work_mem to 2000000;
> SET
> foo=# CREATE INDEX a_idx ON testtable(a);
> 
> is running now for about 10 hours with nearly no IO but pegging the
> CPU-core it is running on at a constent 100%.
> 
> watching the process while this happens seems to indicate that the above
> error occures after the backend exceeds about 3,1GB in resident size.
> 
> The box in question is a Dual Opteron 275 (4 cores @2,2Ghz) with 16GB of
>  RAM and 24GB of swap. OS is Debian Sarge/AMD64 with a pure 64bit userland.

forgot to mention that this is 8.1.3 compiled from source. Further
testing shows that not only CREATE INDEX has some issue with large
maintenance_work_mem settings :


foo=# set maintenance_work_mem to 2000000;
SET
foo=# VACUUM ANALYZE verbose;
INFO:  vacuuming "information_schema.sql_features"
ERROR:  invalid memory alloc request size 2047999998


Stefan


Re: problem with large maintenance_work_mem settings and

From
"hubert depesz lubaczewski"
Date:
On 3/4/06, Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> wrote:
> forgot to mention that this is 8.1.3 compiled from source. Further
> testing shows that not only CREATE INDEX has some issue with large
> maintenance_work_mem settings :

what does it show:
cat /proc/sys/kernel/shmmax
?

depesz


Re: problem with large maintenance_work_mem settings and

From
Stefan Kaltenbrunner
Date:
hubert depesz lubaczewski wrote:
> On 3/4/06, Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> wrote:
> 
>>forgot to mention that this is 8.1.3 compiled from source. Further
>>testing shows that not only CREATE INDEX has some issue with large
>>maintenance_work_mem settings :
> 
> 
> what does it show:
> cat /proc/sys/kernel/shmmax

1421326592

not that I think it is related to the problem at all. It looks like I'm
hitting the MaxAllocSize Limit in src/include/utils/memutils.h.


Stefan


Re: problem with large maintenance_work_mem settings and

From
"Michael Paesold"
Date:
Stefan Kaltenbrunner wrote:
> hubert depesz lubaczewski wrote:
>> On 3/4/06, Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> wrote:
>>
>>>forgot to mention that this is 8.1.3 compiled from source. Further
>>>testing shows that not only CREATE INDEX has some issue with large
>>>maintenance_work_mem settings :
>>
>> what does it show:
>> cat /proc/sys/kernel/shmmax
>
> 1421326592
>
> not that I think it is related to the problem at all.

I can second that. Maintenance work mem is not allocated in shared memory.

> It looks like I'm
> hitting the MaxAllocSize Limit in src/include/utils/memutils.h.

There are two issues you have mentioned. This one is more obvious: the 
limitation of the memory that can be allocated.

The other one is the very bad performance for index creation. I can only 
guess, but is sound like this is related to the recent discussion on hackers 
about issues with the qsort performance. If the theory is true, your index 
creation should be much faster with a much lower setting for 
maintenance_work_mem, so that it uses external sort.

See the discussion starting here:
http://archives.postgresql.org/pgsql-hackers/2006-02/msg00590.php

Best Regards,
Michael Paesold 




Re: problem with large maintenance_work_mem settings and

From
Stefan Kaltenbrunner
Date:
Stefan Kaltenbrunner wrote:
> hubert depesz lubaczewski wrote:
> 
>>On 3/4/06, Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> wrote:
>>
>>
>>>forgot to mention that this is 8.1.3 compiled from source. Further
>>>testing shows that not only CREATE INDEX has some issue with large
>>>maintenance_work_mem settings :
>>
>>
>>what does it show:
>>cat /proc/sys/kernel/shmmax
> 
> 
> 1421326592
> 
> not that I think it is related to the problem at all. It looks like I'm
> hitting the MaxAllocSize Limit in src/include/utils/memutils.h.

just tried to increase this limit to 4GB (from the default 1GB) and this
seems to help a fair bit. (ie CREATE INDEX and VACUUM do seem to work
with much higher maintainance_work_mem settings now.
BUT: VACUUM VERBOSE (or VACUUM ANALYZE VERBOSE) breaks(with a rather
gigantic allocation request ;-)):

foo=# VACUUM VERBOSE;
INFO:  vacuuming "information_schema.sql_features"
ERROR:  invalid memory alloc request size 18446744073709551615


Stefan


Re: problem with large maintenance_work_mem settings and

From
"Matthew T. O'Connor"
Date:
Stefan Kaltenbrunner wrote:
> foo=# set maintenance_work_mem to 2000000;
> SET
> foo=# VACUUM ANALYZE verbose;
> INFO:  vacuuming "information_schema.sql_features"
> ERROR:  invalid memory alloc request size 2047999998

Just an FYI, I reported a similar problem on my 8.0.0 database a few 
weeks ago.  I upgraded to 8.1.3 and just recetly got that message again 
in a nighly maintenance email.

vacuumdb: vacuuming database "postgres"
vacuumdb: vacuuming of database "postgres" failed: ERROR:  out of memory
DETAIL:  Failed on request of size 167772156.

The interesting thing is that there aren't any tables in the postgres 
database.  This happened Thursday night, but vacuum ran fine on Friday 
night.  I'm on vacation right now, so I can't really look into it much 
deeper at the moment, but I thought I would mention it.

Matt



Re: problem with large maintenance_work_mem settings and

From
Stefan Kaltenbrunner
Date:
Michael Paesold wrote:
> Stefan Kaltenbrunner wrote:
> 
>> hubert depesz lubaczewski wrote:
>>
>>> On 3/4/06, Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> wrote:
>>>
>>>> forgot to mention that this is 8.1.3 compiled from source. Further
>>>> testing shows that not only CREATE INDEX has some issue with large
>>>> maintenance_work_mem settings :
>>>
>>>
>>> what does it show:
>>> cat /proc/sys/kernel/shmmax
>>
>>
>> 1421326592
>>
>> not that I think it is related to the problem at all.
> 
> 
> I can second that. Maintenance work mem is not allocated in shared memory.
> 
>> It looks like I'm
>> hitting the MaxAllocSize Limit in src/include/utils/memutils.h.
> 
> 
> There are two issues you have mentioned. This one is more obvious: the
> limitation of the memory that can be allocated.

yes

> 
> The other one is the very bad performance for index creation. I can only
> guess, but is sound like this is related to the recent discussion on
> hackers about issues with the qsort performance. If the theory is true,
> your index creation should be much faster with a much lower setting for
> maintenance_work_mem, so that it uses external sort.
> 
> See the discussion starting here:
> http://archives.postgresql.org/pgsql-hackers/2006-02/msg00590.php

I was following this thread - and it was partly a reason why I'm playing
with that(the CREATE INDEX on that table finished after about 12 hours
with a bit less 2GB for maintenance_work_mem(for comparision it took me
only about 2,5hours to create this table) .
I'm currently testing who long it takes with very low settings to force
an external sort.


Stefan


Re: problem with large maintenance_work_mem settings and

From
Tom Lane
Date:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>> not that I think it is related to the problem at all. It looks like I'm
>> hitting the MaxAllocSize Limit in src/include/utils/memutils.h.

> just tried to increase this limit to 4GB (from the default 1GB) and this
> seems to help a fair bit.

s/help a fair bit/break a whole lot of stuff/

There are reasons for that limit, and you can't just arbitrarily
rejigger it.

The sorting code probably needs a defense to keep it from trying to
exceed MaxAllocSize for the SortObject array; AFAIR there is no such
consideration there now, but it's easily added.  I'm not sure where your
VACUUM failure is coming from though --- can you get a back trace from
the errfinish call in that case?
        regards, tom lane


Re: problem with large maintenance_work_mem settings and

From
Stefan Kaltenbrunner
Date:
Tom Lane wrote:
> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> 
>>>not that I think it is related to the problem at all. It looks like I'm
>>>hitting the MaxAllocSize Limit in src/include/utils/memutils.h.
> 
> 
>>just tried to increase this limit to 4GB (from the default 1GB) and this
>>seems to help a fair bit.
> 
> 
> s/help a fair bit/break a whole lot of stuff/
> 
> There are reasons for that limit, and you can't just arbitrarily
> rejigger it.

heh - sure this is just a testbox so it was worth a try and I don't care
for the data anyway ...

> 
> The sorting code probably needs a defense to keep it from trying to
> exceed MaxAllocSize for the SortObject array; AFAIR there is no such
> consideration there now, but it's easily added.  I'm not sure where your
> VACUUM failure is coming from though --- can you get a back trace from
> the errfinish call in that case?

like(with maintenance_work_mem set to 2000000):

(gdb) bt
#0  errfinish (dummy=0) at elog.c:310
#1  0x00000000005c6c93 in elog_finish (elevel=-4145840, fmt=0x84da50
"invalid memory alloc request size %lu")   at elog.c:931
#2  0x00000000005d96a0 in MemoryContextAlloc (context=0x8d9c58,
size=2047999998) at mcxt.c:505
#3  0x00000000004db947 in lazy_space_alloc (vacrelstats=0x8de5b0,
relblocks=6) at vacuumlazy.c:963
#4  0x00000000004dab33 in lazy_scan_heap (onerel=0x2ad69a589cc8,
vacrelstats=0x8de5b0, Irel=0x0, nindexes=0)   at vacuumlazy.c:240
#5  0x00000000004da9d1 in lazy_vacuum_rel (onerel=0x2ad69a589cc8,
vacstmt=0x8c0fd0) at vacuumlazy.c:157
#6  0x00000000004d7325 in vacuum_rel (relid=2589498568,
vacstmt=0x8c0fd0, expected_relkind=-27 'å')   at vacuum.c:1077
#7  0x00000000004d6990 in vacuum (vacstmt=0x8c0fd0, relids=0x0) at
vacuum.c:449
#8  0x000000000055e871 in PortalRunUtility (portal=0x8e0360,
query=0x8c0e00, dest=0x8c1050,   completionTag=0x7fffffc0c410 "") at pquery.c:987
#9  0x000000000055eb07 in PortalRunMulti (portal=0x8e0360,
dest=0x8c1050, altdest=0x8c1050,   completionTag=0x7fffffc0c410 "") at pquery.c:1054
#10 0x000000000055e28f in PortalRun (portal=0x8e0360,
count=9223372036854775807, dest=0x8c1050,   altdest=0x8c1050, completionTag=0x7fffffc0c410 "") at pquery.c:665
#11 0x000000000055a3a1 in exec_simple_query (query_string=0x8c0cf0
"VACUUM VERBOSE;") at postgres.c:1002
#12 0x000000000055cc2c in PostgresMain (argc=4, argv=0x84c078,
username=0x84c040 "postgres") at postgres.c:3217
#13 0x0000000000538a71 in BackendRun (port=0x86add0) at postmaster.c:2853
#14 0x0000000000538550 in BackendStartup (port=0x86add0) at
postmaster.c:2497
#15 0x0000000000536b4d in ServerLoop () at postmaster.c:1230
#16 0x0000000000535fcf in PostmasterMain (argc=3, argv=0x8493c0) at
postmaster.c:941
#17 0x00000000004fcaa5 in main (argc=3, argv=0x8493c0) at main.c:265


Stefan


Re: problem with large maintenance_work_mem settings and

From
Tom Lane
Date:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> Tom Lane wrote:
>> The sorting code probably needs a defense to keep it from trying to
>> exceed MaxAllocSize for the SortObject array; AFAIR there is no such
>> consideration there now, but it's easily added.  I'm not sure where your
>> VACUUM failure is coming from though --- can you get a back trace from
>> the errfinish call in that case?

> like(with maintenance_work_mem set to 2000000):

> (gdb) bt
> #0  errfinish (dummy=0) at elog.c:310
> #1  0x00000000005c6c93 in elog_finish (elevel=-4145840, fmt=0x84da50
> "invalid memory alloc request size %lu")
>     at elog.c:931
> #2  0x00000000005d96a0 in MemoryContextAlloc (context=0x8d9c58,
> size=2047999998) at mcxt.c:505
> #3  0x00000000004db947 in lazy_space_alloc (vacrelstats=0x8de5b0,
> relblocks=6) at vacuumlazy.c:963

Oh, OK, it's lazy_space_alloc's fault --- we need to fix that to not try
to allocate chunks larger than MaxAllocSize.  Will do ... thanks for the
report.
        regards, tom lane


Re: problem with large maintenance_work_mem settings and

From
Tom Lane
Date:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>>> forgot to mention that this is 8.1.3 compiled from source.
>> 
>> See the discussion starting here:
>> http://archives.postgresql.org/pgsql-hackers/2006-02/msg00590.php

> I was following this thread - and it was partly a reason why I'm playing
> with that(the CREATE INDEX on that table finished after about 12 hours
> with a bit less 2GB for maintenance_work_mem(for comparision it took me
> only about 2,5hours to create this table) .

It would be interesting to try the same test with CVS tip to see if the
sorting improvements Simon and I made over the past few weeks help much.
        regards, tom lane


Re: problem with large maintenance_work_mem settings and

From
Stefan Kaltenbrunner
Date:
Tom Lane wrote:
> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> 
>>>>forgot to mention that this is 8.1.3 compiled from source.
>>>
>>>See the discussion starting here:
>>>http://archives.postgresql.org/pgsql-hackers/2006-02/msg00590.php
> 
> 
>>I was following this thread - and it was partly a reason why I'm playing
>>with that(the CREATE INDEX on that table finished after about 12 hours
>>with a bit less 2GB for maintenance_work_mem(for comparision it took me
>>only about 2,5hours to create this table) .
> 
> 
> It would be interesting to try the same test with CVS tip to see if the
> sorting improvements Simon and I made over the past few weeks help much.

playing with CVS tip right now, it is a bit faster for both the initial
bulkloading (about 5%) and for the CREATE INDEX itself (11h30min vs
11h54min) though not a dramatic improvement.


Stefan


Re: problem with large maintenance_work_mem settings and

From
Tom Lane
Date:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> samples  %        symbol name
> 24915704 96.2170  ltsReleaseBlock
> 387265    1.4955  LogicalTapeRead
> 168725    0.6516  inlineApplySortFunction

Hmm ... the comment in ltsReleaseBlock sez
   /*    * Insert blocknum into array, preserving decreasing order (so that    * ltsGetFreeBlock returns the lowest
availableblock number). This could    * get fairly slow if there were many free blocks, but we don't expect    * there
tobe very many at one time.    */
 

We probably need to tweak things so this doesn't get called during the
"final merge" pass.  Looking at it now.
        regards, tom lane


Re: problem with large maintenance_work_mem settings and

From
Simon Riggs
Date:
On Sun, 2006-03-05 at 15:15 -0500, Tom Lane wrote:
> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> > samples  %        symbol name
> > 24915704 96.2170  ltsReleaseBlock
> > 387265    1.4955  LogicalTapeRead
> > 168725    0.6516  inlineApplySortFunction
> 
> Hmm ... the comment in ltsReleaseBlock sez
> 
>     /*
>      * Insert blocknum into array, preserving decreasing order (so that
>      * ltsGetFreeBlock returns the lowest available block number). This could
>      * get fairly slow if there were many free blocks, but we don't expect
>      * there to be very many at one time.
>      */
> 
> We probably need to tweak things so this doesn't get called during the
> "final merge" pass.  Looking at it now.

OK. I also had a report of poor performance, just isolated to the final
merge pass and sucked quite badly; sounds like you've located the cause.

Best Regards, Simon Riggs



Re: problem with large maintenance_work_mem settings and

From
Tom Lane
Date:
I wrote:
> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>> samples  %        symbol name
>> 24915704 96.2170  ltsReleaseBlock

> We probably need to tweak things so this doesn't get called during the
> "final merge" pass.  Looking at it now.

I've committed a fix for this into CVS HEAD --- please try it out.
        regards, tom lane


Re: problem with large maintenance_work_mem settings and

From
Stefan Kaltenbrunner
Date:
Tom Lane wrote:
> I wrote:
> 
>>Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>>
>>>samples  %        symbol name
>>>24915704 96.2170  ltsReleaseBlock
> 
> 
>>We probably need to tweak things so this doesn't get called during the
>>"final merge" pass.  Looking at it now.
> 
> 
> I've committed a fix for this into CVS HEAD --- please try it out.

just tried that with CVS HEAD (includes the second fix too):


CREATE INDEX on a 1,8B row table (5 int columns - index created on the 
first row about 300M distinct values):

before: 11h 51min
after: 3h 11min(!)



Stefan


Re: problem with large maintenance_work_mem settings and

From
Tom Lane
Date:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> CREATE INDEX on a 1,8B row table (5 int columns - index created on the 
> first row about 300M distinct values):

> before: 11h 51min
> after: 3h 11min(!)

Cool.  Does it seem to be I/O bound now?  Would you be willing to do it
over with oprofile turned on?
        regards, tom lane


Re: problem with large maintenance_work_mem settings and

From
Simon Riggs
Date:
On Wed, 2006-03-08 at 10:45 -0500, Tom Lane wrote:
> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> > CREATE INDEX on a 1,8B row table (5 int columns - index created on the 
> > first row about 300M distinct values):
> 
> > before: 11h 51min
> > after: 3h 11min(!)
> 
> Cool.  Does it seem to be I/O bound now?  Would you be willing to do it
> over with oprofile turned on?

Very.

Any chance of trying it with different maintenance_work_mem settings?

Did you try this with trace_sort=on? If so could we get the logs for
that?

[Results welcome from other hackers...particularly with regard to
queries with sort steps in rather than CREATE INDEX.]

Best Regards, Simon Riggs



Re: problem with large maintenance_work_mem settings and

From
Stefan Kaltenbrunner
Date:
Tom Lane wrote:
> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> 
>>CREATE INDEX on a 1,8B row table (5 int columns - index created on the 
>>first row about 300M distinct values):
> 
> 
>>before: 11h 51min
>>after: 3h 11min(!)
> 
> 
> Cool.  Does it seem to be I/O bound now?  Would you be willing to do it
> over with oprofile turned on?

while it now does a fair amount of IO during the whole operation, it is
not yet IObound afaiks.

profile:

samples  %        symbol name
103520432 47.9018  inlineApplySortFunction
33382738 15.4471  comparetup_index
25296438 11.7054  tuplesort_heap_siftup
10089122  4.6685  btint4cmp
8395676   3.8849  LogicalTapeRead
2873556   1.3297  tuplesort_heap_insert
2796545   1.2940  tuplesort_gettuple_common
2752345   1.2736  AllocSetFree
2233889   1.0337  IndexBuildHeapScan
2035265   0.9418  heapgettup
1571035   0.7270  LWLockAcquire
1498800   0.6935  readtup_index
1213587   0.5616  index_form_tuple
1097172   0.5077  AllocSetAlloc
1056964   0.4891  heap_fill_tuple
1041172   0.4818  btbuildCallback
990005    0.4581  LWLockRelease
897662    0.4154  slot_deform_tuple
858527    0.3973  LogicalTapeWrite
806849    0.3734  PageAddItem
764136    0.3536  LockBuffer

trace_sort:

LOG:  begin index sort: unique = f, workMem = 2048000, randomAccess = f
LOG:  switching to external sort with 7315 tapes: CPU 4.07s/13.70u sec
elapsed 17.79 sec
LOG:  finished writing run 1 to tape 0: CPU 240.07s/3926.66u sec elapsed
4498.49 sec
LOG:  performsort starting: CPU 535.66s/8138.92u sec elapsed 9435.11 sec
LOG:  finished writing final run 2 to tape 1: CPU 538.54s/8242.23u sec
elapsed 9541.55 sec
LOG:  performsort done (except final merge): CPU 539.39s/8254.83u sec
elapsed 9559.75 sec
LOG:  external sort ended, 4398827 disk blocks used: CPU
768.38s/10027.39u sec elapsed 11884.63 sec


Stefan


Re: problem with large maintenance_work_mem settings and

From
Stefan Kaltenbrunner
Date:
Simon Riggs wrote:
> On Wed, 2006-03-08 at 10:45 -0500, Tom Lane wrote:
> 
>>Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>>
>>>CREATE INDEX on a 1,8B row table (5 int columns - index created on the 
>>>first row about 300M distinct values):
>>
>>>before: 11h 51min
>>>after: 3h 11min(!)
>>
>>Cool.  Does it seem to be I/O bound now?  Would you be willing to do it
>>over with oprofile turned on?
> 
> 
> Very.
> 
> Any chance of trying it with different maintenance_work_mem settings?


interesting:

with maintenance_work_mem = 2048000:

for 3 runs runtime between 3h11min and 3h25min

with maintenance_work_mem = 10240:

for 3 runs runtime between 2h40min and 2h50min


will try other settings as time permits


Stefan


Re: problem with large maintenance_work_mem settings and

From
Tom Lane
Date:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> samples  %        symbol name
> 103520432 47.9018  inlineApplySortFunction
> 33382738 15.4471  comparetup_index
> 25296438 11.7054  tuplesort_heap_siftup
> 10089122  4.6685  btint4cmp
> 8395676   3.8849  LogicalTapeRead
> 2873556   1.3297  tuplesort_heap_insert
> 2796545   1.2940  tuplesort_gettuple_common
> 2752345   1.2736  AllocSetFree
> 2233889   1.0337  IndexBuildHeapScan

Interesting.  What's the platform, and what compiler exactly?  For me,
gcc seems to inline inlineApplySortFunction successfully, but your
compiler evidently is not doing that.

I get a significantly different oprofile result from CVS HEAD:

CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped)
with a unit mask of 0x01 (mandatory) count 240000
samples  %        symbol name
194439   37.8399  tuplesort_heap_siftup
70803    13.7790  comparetup_index
35030     6.8172  btint4cmp
25139     4.8923  AllocSetAlloc
23965     4.6638  writetup_index
18057     3.5141  GetMemoryChunkSpace
17384     3.3831  tuplesort_gettuple_common
14356     2.7938  mergepreread
12740     2.4793  LWLockAcquire
11175     2.1748  LWLockRelease
6986      1.3596  tuplesort_heap_insert
6659      1.2959  PageAddItem
6387      1.2430  index_form_tuple
5225      1.0168  LogicalTapeRead

LOG:  begin index sort: unique = f, workMem = 65536, randomAccess = f
LOG:  switching to external sort with 234 tapes: CPU 0.23s/0.81u sec elapsed 1.82 sec
LOG:  finished writing run 1 to tape 0: CPU 0.68s/6.80u sec elapsed 9.59 sec
LOG:  finished writing run 2 to tape 1: CPU 1.23s/13.54u sec elapsed 17.88 sec
LOG:  finished writing run 3 to tape 2: CPU 1.62s/20.43u sec elapsed 25.56 sec
LOG:  finished writing run 4 to tape 3: CPU 2.08s/27.48u sec elapsed 34.64 sec
LOG:  finished writing run 5 to tape 4: CPU 2.56s/34.50u sec elapsed 43.06 sec
LOG:  performsort starting: CPU 2.66s/35.81u sec elapsed 45.00 sec
LOG:  finished writing run 6 to tape 5: CPU 2.74s/38.11u sec elapsed 47.38 sec
LOG:  finished writing final run 7 to tape 6: CPU 2.74s/38.15u sec elapsed 47.43 sec
LOG:  performsort done (except 7-way final merge): CPU 2.74s/38.56u sec elapsed 47.84 sec
LOG:  external sort ended, 24434 disk blocks used: CPU 4.10s/51.98u sec elapsed 65.33 sec

This is creating an index on a single integer column, 10 million rows,
random data, with maintenance_work_mem set to 65536 (64MB).  Platform
is 64-bit Xeon, Fedora Core 4.  I tried a larger test case (100M rows,
640MB maintenance_work_mem) and got similar results, with if anything
a few more percent in tuplesort_heap_siftup.  What's even more
interesting is that an instruction-level dump of the oprofile counts
shows that almost all the tuplesort_heap_siftup time is spent at
the "((tup1)->tupindex != (tup2)->tupindex)" part of the first
HEAPCOMPARE macro.  I interpret this as memory fetch delay: that's the
first touch of the SortTuple array elements, and if that part of the
array wasn't in L2 cache this is where the CPU would stall waiting
for it.

Your machine seems not to be subject to nearly the same amount of memory
delay.  Which is interesting because most of the argument for changing
sort algorithms seems to hinge on the assumption that main-memory delay
is the main thing we need to worry about.  That looks to be valid on the
Xeon I'm testing but not on your machine ...
        regards, tom lane


Re: problem with large maintenance_work_mem settings and

From
Stefan Kaltenbrunner
Date:
Tom Lane wrote:
> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> 
>>samples  %        symbol name
>>103520432 47.9018  inlineApplySortFunction
>>33382738 15.4471  comparetup_index
>>25296438 11.7054  tuplesort_heap_siftup
>>10089122  4.6685  btint4cmp
>>8395676   3.8849  LogicalTapeRead
>>2873556   1.3297  tuplesort_heap_insert
>>2796545   1.2940  tuplesort_gettuple_common
>>2752345   1.2736  AllocSetFree
>>2233889   1.0337  IndexBuildHeapScan
> 
> 
> Interesting.  What's the platform, and what compiler exactly?  For me,
> gcc seems to inline inlineApplySortFunction successfully, but your
> compiler evidently is not doing that.

Debian Sarge/AMD64 with gcc version 3.3.5 (Debian 1:3.3.5-13) running on
a Dual AMD Opteron 280 (so 4 cores @2,4GHz) with 16GB of RAM and a very
recent Kernel.
Debian has gcc 3.4 as an optional package in Sarge too so I certainly
can try that one.


[...]

> Your machine seems not to be subject to nearly the same amount of memory
> delay.  Which is interesting because most of the argument for changing
> sort algorithms seems to hinge on the assumption that main-memory delay
> is the main thing we need to worry about.  That looks to be valid on the
> Xeon I'm testing but not on your machine ...

hmm very interesting, Opterons are known for there very high memory
bandwidth and some (rather limited) testing using various benchmarktools
against a 3,2Ghz DP Xeon with 2MB L2 cache shows that the Opteron box
really has a significant advantage here ...


Stefan


Re: problem with large maintenance_work_mem settings and

From
Tom Lane
Date:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> LOG:  begin index sort: unique = f, workMem = 8024000, randomAccess = f
> LOG:  switching to external sort with 28658 tapes: CPU 4.18s/13.96u sec 
> elapsed 32.43 sec
> LOG:  finished writing run 1 to tape 0: CPU 173.56s/3425.85u sec elapsed 
> 3814.82 sec
> LOG:  performsort starting: CPU 344.17s/7013.20u sec elapsed 7715.45 sec
> LOG:  finished writing final run 2 to tape 1: CPU 347.19s/7121.78u sec 
> elapsed 7827.25 sec
> LOG:  performsort done (except 2-way final merge): CPU 348.25s/7132.99u 
> sec elapsed 7846.47 sec

> after that the postmaster is now consuming 99% CPU for about 22 hours(!) 

I'll look into it, but I was already wondering if we shouldn't bound the
number of tapes somehow.  It's a bit hard to believe that 28000 tapes is
a sane setting.
        regards, tom lane


Re: problem with large maintenance_work_mem settings and

From
Stefan Kaltenbrunner
Date:
Tom Lane wrote:
> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> 
>>LOG:  begin index sort: unique = f, workMem = 8024000, randomAccess = f
>>LOG:  switching to external sort with 28658 tapes: CPU 4.18s/13.96u sec 
>>elapsed 32.43 sec
>>LOG:  finished writing run 1 to tape 0: CPU 173.56s/3425.85u sec elapsed 
>>3814.82 sec
>>LOG:  performsort starting: CPU 344.17s/7013.20u sec elapsed 7715.45 sec
>>LOG:  finished writing final run 2 to tape 1: CPU 347.19s/7121.78u sec 
>>elapsed 7827.25 sec
>>LOG:  performsort done (except 2-way final merge): CPU 348.25s/7132.99u 
>>sec elapsed 7846.47 sec
> 
> 
>>after that the postmaster is now consuming 99% CPU for about 22 hours(!) 
> 
> 
> I'll look into it, but I was already wondering if we shouldn't bound the
> number of tapes somehow.  It's a bit hard to believe that 28000 tapes is
> a sane setting.


heh - don't think it is a sane setting either (and it doesn't look like
that pg is using more than 2GB anyway).

If this testing helps with defining appropriate upper bounds to prevent
bad behaviour like this (not responding to signals any more and eating
CPU like mad) I'm more than happy.
And the ltsReleaseBlock-fix already reduced dump&restore times for one
of our production databases by at about 15% which is already quite an
impressive improvment on its own ;-)


Stefan


Re: problem with large maintenance_work_mem settings and

From
Simon Riggs
Date:
On Fri, 2006-03-10 at 09:31 -0500, Tom Lane wrote:
> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> > LOG:  begin index sort: unique = f, workMem = 8024000, randomAccess = f
> > LOG:  switching to external sort with 28658 tapes: CPU 4.18s/13.96u sec 
> > elapsed 32.43 sec
> > LOG:  finished writing run 1 to tape 0: CPU 173.56s/3425.85u sec elapsed 
> > 3814.82 sec
> > LOG:  performsort starting: CPU 344.17s/7013.20u sec elapsed 7715.45 sec
> > LOG:  finished writing final run 2 to tape 1: CPU 347.19s/7121.78u sec 
> > elapsed 7827.25 sec
> > LOG:  performsort done (except 2-way final merge): CPU 348.25s/7132.99u 
> > sec elapsed 7846.47 sec
> 
> > after that the postmaster is now consuming 99% CPU for about 22 hours(!) 
> 
> I'll look into it, but I was already wondering if we shouldn't bound the
> number of tapes somehow.  It's a bit hard to believe that 28000 tapes is
> a sane setting.

Well, since they are not actually tapes, why not?

I thought you had changed the memory settings so that the 28658 was a
maximum, not the actual number it used. In the above example we are only
using 2 runs (tapes)... so it should just read in run 1 and then run 2
into memory and complete the final merge.

Seems reasonable to limit tapes to say 10,000. 28000 tapes allows you to
sort 448 TB without any merging...

Best Regards, Simon Riggs




Re: problem with large maintenance_work_mem settings and

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> On Fri, 2006-03-10 at 09:31 -0500, Tom Lane wrote:
>> I'll look into it, but I was already wondering if we shouldn't bound the
>> number of tapes somehow.  It's a bit hard to believe that 28000 tapes is
>> a sane setting.

> I thought you had changed the memory settings so that the 28658 was a
> maximum, not the actual number it used.

Well, it does set up the control structures with 28658 entries, but the
I/O buffers and so on are not allocated unless used (in this instance
only two will get used).  logtape.c itself does not look like it has any
serious problem with too many tapes, but maybe tuplesort.c does.  Or the
problem Stefan has stumbled across might be unrelated to number of
tapes, anyway --- we still need to dig.

> Seems reasonable to limit tapes to say 10,000. 28000 tapes allows you to
> sort 448 TB without any merging...

Yeah, I was thinking MAXTAPES = 10000 might not be a bad idea.
        regards, tom lane


Re: problem with large maintenance_work_mem settings and

From
Tom Lane
Date:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> samples  %        symbol name
> 350318533 98.8618  mergepreread
> 971822    0.2743  tuplesort_gettuple_common
> 413674    0.1167  tuplesort_heap_siftup

I'm not immediately seeing why mergepreread would have such a problem
with lots of tapes.  Could you reproduce this in a debug build, then
attach to the process with gdb and print out the contents of the "state"
struct?
        regards, tom lane


Re: problem with large maintenance_work_mem settings and

From
Tom Lane
Date:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> samples  %        symbol name
> 350318533 98.8618  mergepreread
> 971822    0.2743  tuplesort_gettuple_common
> 413674    0.1167  tuplesort_heap_siftup

I don't have enough memory to really reproduce this, but I've come close
enough that I believe I see what's happening.  It's an artifact of the
code I added recently to prevent the SortTuple array from growing during
the merge phase, specifically the "mergeslotsfree" logic.  You can get
into a state where mergeslotsfree is at the lower limit of what the code
will allow, and then if there's a run of tuples that should come from a
single tape, mergepreread ends up sucking just one tuple per call from
that tape --- and with the outer loop over 28000 tapes that aren't doing
anything, each call is pretty expensive.  I had mistakenly assumed that
the mergeslotsfree limit would be a seldom-hit corner case, but it seems
it's not so hard to get into that mode after all.  The code really needs
to do a better job of sharing the available array slots among the tapes.
Probably the right answer is to allocate so many free array slots to each
tape, similar to the per-tape limit on memory usage --- I had thought
that the memory limit would cover matters but it doesn't.

Another thing I am wondering about is the code's habit of prereading
from all tapes when one goes empty.  This is clearly pretty pointless in
the final-merge-pass case: we might as well just reload from the one
that went empty, and not bother scanning the rest.  However, in the
scenario where we are rewriting the data to tape, I think we still need
the preread-from-all behavior in order to keep things efficient in
logtape.c.  logtape likes it if you alternate a lot of reads with a lot
of writes, so once you've started reading you really want to refill
memory completely.

It might also be worth remembering the index of the last active tape so
that we don't iterate over thousands of uninteresting tapes in
mergepreread.
        regards, tom lane


Re: problem with large maintenance_work_mem settings and

From
Tom Lane
Date:
I wrote:
> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>> samples  %        symbol name
>> 350318533 98.8618  mergepreread
>> 971822    0.2743  tuplesort_gettuple_common
>> 413674    0.1167  tuplesort_heap_siftup

> I don't have enough memory to really reproduce this, but I've come close
> enough that I believe I see what's happening.  It's an artifact of the
> code I added recently to prevent the SortTuple array from growing during
> the merge phase, specifically the "mergeslotsfree" logic.

I've committed a fix for this; if you're still interested, please
retest.
        regards, tom lane


Re: problem with large maintenance_work_mem settings and

From
Stefan Kaltenbrunner
Date:
Tom Lane wrote:
> I wrote:
> 
>>Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>>
>>>samples  %        symbol name
>>>350318533 98.8618  mergepreread
>>>971822    0.2743  tuplesort_gettuple_common
>>>413674    0.1167  tuplesort_heap_siftup
> 
> 
>>I don't have enough memory to really reproduce this, but I've come close
>>enough that I believe I see what's happening.  It's an artifact of the
>>code I added recently to prevent the SortTuple array from growing during
>>the merge phase, specifically the "mergeslotsfree" logic.
> 
> 
> I've committed a fix for this; if you're still interested, please
> retest.

This seems to have fixed the problem - mergepreread is gone from the
profile and the CREATE INDEX finished in about 2h 37 minutes:

LOG:  begin index sort: unique = f, workMem = 8024000, randomAccess = f
LOG:  switching to external sort with 28658 tapes: CPU 4.68s/11.67u sec
elapsed 16.36 sec
LOG:  finished writing run 1 to tape 0: CPU 303.18s/3324.51u sec elapsed
3858.55 sec
LOG:  performsort starting: CPU 611.37s/6829.75u sec elapsed 7830.90 sec
LOG:  finished writing final run 2 to tape 1: CPU 614.15s/6928.92u sec
elapsed 7933.38 sec
LOG:  performsort done (except 2-way final merge): CPU 615.26s/6940.77u
sec elapsed 7951.58 sec
LOG:  external sort ended, 4398827 disk blocks used: CPU
827.45s/8519.86u sec elapsed 10046.31 sec


profile for this run looks like:

CPU: AMD64 processors, speed 2405.5 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a
unit mask of 0x00 (No unit mask) count 100000
samples  %        symbol name
77645525 39.6040  comparetup_index
47024448 23.9854  btint4cmp
22508630 11.4808  tuplesort_heap_siftup
12657874  6.4563  LogicalTapeRead
4049912   2.0657  tuplesort_heap_insert
3025537   1.5432  tuplesort_gettuple_common
2922149   1.4905  AllocSetFree
2499669   1.2750  readtup_index
1725984   0.8804  AllocSetAlloc
1318344   0.6724  LWLockAcquire
1298245   0.6622  PageAddItem
1271657   0.6486  heapgettup


Stefan