Re: problem with large maintenance_work_mem settings and - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: problem with large maintenance_work_mem settings and |
Date | |
Msg-id | 26014.1141925132@sss.pgh.pa.us Whole thread Raw |
In response to | Re: problem with large maintenance_work_mem settings and (Stefan Kaltenbrunner <stefan@kaltenbrunner.cc>) |
Responses |
Re: problem with large maintenance_work_mem settings and
|
List | pgsql-hackers |
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: > samples % symbol name > 103520432 47.9018 inlineApplySortFunction > 33382738 15.4471 comparetup_index > 25296438 11.7054 tuplesort_heap_siftup > 10089122 4.6685 btint4cmp > 8395676 3.8849 LogicalTapeRead > 2873556 1.3297 tuplesort_heap_insert > 2796545 1.2940 tuplesort_gettuple_common > 2752345 1.2736 AllocSetFree > 2233889 1.0337 IndexBuildHeapScan Interesting. What's the platform, and what compiler exactly? For me, gcc seems to inline inlineApplySortFunction successfully, but your compiler evidently is not doing that. I get a significantly different oprofile result from CVS HEAD: CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated) Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 240000 samples % symbol name 194439 37.8399 tuplesort_heap_siftup 70803 13.7790 comparetup_index 35030 6.8172 btint4cmp 25139 4.8923 AllocSetAlloc 23965 4.6638 writetup_index 18057 3.5141 GetMemoryChunkSpace 17384 3.3831 tuplesort_gettuple_common 14356 2.7938 mergepreread 12740 2.4793 LWLockAcquire 11175 2.1748 LWLockRelease 6986 1.3596 tuplesort_heap_insert 6659 1.2959 PageAddItem 6387 1.2430 index_form_tuple 5225 1.0168 LogicalTapeRead LOG: begin index sort: unique = f, workMem = 65536, randomAccess = f LOG: switching to external sort with 234 tapes: CPU 0.23s/0.81u sec elapsed 1.82 sec LOG: finished writing run 1 to tape 0: CPU 0.68s/6.80u sec elapsed 9.59 sec LOG: finished writing run 2 to tape 1: CPU 1.23s/13.54u sec elapsed 17.88 sec LOG: finished writing run 3 to tape 2: CPU 1.62s/20.43u sec elapsed 25.56 sec LOG: finished writing run 4 to tape 3: CPU 2.08s/27.48u sec elapsed 34.64 sec LOG: finished writing run 5 to tape 4: CPU 2.56s/34.50u sec elapsed 43.06 sec LOG: performsort starting: CPU 2.66s/35.81u sec elapsed 45.00 sec LOG: finished writing run 6 to tape 5: CPU 2.74s/38.11u sec elapsed 47.38 sec LOG: finished writing final run 7 to tape 6: CPU 2.74s/38.15u sec elapsed 47.43 sec LOG: performsort done (except 7-way final merge): CPU 2.74s/38.56u sec elapsed 47.84 sec LOG: external sort ended, 24434 disk blocks used: CPU 4.10s/51.98u sec elapsed 65.33 sec This is creating an index on a single integer column, 10 million rows, random data, with maintenance_work_mem set to 65536 (64MB). Platform is 64-bit Xeon, Fedora Core 4. I tried a larger test case (100M rows, 640MB maintenance_work_mem) and got similar results, with if anything a few more percent in tuplesort_heap_siftup. What's even more interesting is that an instruction-level dump of the oprofile counts shows that almost all the tuplesort_heap_siftup time is spent at the "((tup1)->tupindex != (tup2)->tupindex)" part of the first HEAPCOMPARE macro. I interpret this as memory fetch delay: that's the first touch of the SortTuple array elements, and if that part of the array wasn't in L2 cache this is where the CPU would stall waiting for it. Your machine seems not to be subject to nearly the same amount of memory delay. Which is interesting because most of the argument for changing sort algorithms seems to hinge on the assumption that main-memory delay is the main thing we need to worry about. That looks to be valid on the Xeon I'm testing but not on your machine ... regards, tom lane
pgsql-hackers by date: