Re: problem with large maintenance_work_mem settings and - Mailing list pgsql-hackers

From Tom Lane
Subject Re: problem with large maintenance_work_mem settings and
Date
Msg-id 26014.1141925132@sss.pgh.pa.us
Whole thread Raw
In response to Re: problem with large maintenance_work_mem settings and  (Stefan Kaltenbrunner <stefan@kaltenbrunner.cc>)
Responses Re: problem with large maintenance_work_mem settings and
List pgsql-hackers
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> samples  %        symbol name
> 103520432 47.9018  inlineApplySortFunction
> 33382738 15.4471  comparetup_index
> 25296438 11.7054  tuplesort_heap_siftup
> 10089122  4.6685  btint4cmp
> 8395676   3.8849  LogicalTapeRead
> 2873556   1.3297  tuplesort_heap_insert
> 2796545   1.2940  tuplesort_gettuple_common
> 2752345   1.2736  AllocSetFree
> 2233889   1.0337  IndexBuildHeapScan

Interesting.  What's the platform, and what compiler exactly?  For me,
gcc seems to inline inlineApplySortFunction successfully, but your
compiler evidently is not doing that.

I get a significantly different oprofile result from CVS HEAD:

CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped)
with a unit mask of 0x01 (mandatory) count 240000
samples  %        symbol name
194439   37.8399  tuplesort_heap_siftup
70803    13.7790  comparetup_index
35030     6.8172  btint4cmp
25139     4.8923  AllocSetAlloc
23965     4.6638  writetup_index
18057     3.5141  GetMemoryChunkSpace
17384     3.3831  tuplesort_gettuple_common
14356     2.7938  mergepreread
12740     2.4793  LWLockAcquire
11175     2.1748  LWLockRelease
6986      1.3596  tuplesort_heap_insert
6659      1.2959  PageAddItem
6387      1.2430  index_form_tuple
5225      1.0168  LogicalTapeRead

LOG:  begin index sort: unique = f, workMem = 65536, randomAccess = f
LOG:  switching to external sort with 234 tapes: CPU 0.23s/0.81u sec elapsed 1.82 sec
LOG:  finished writing run 1 to tape 0: CPU 0.68s/6.80u sec elapsed 9.59 sec
LOG:  finished writing run 2 to tape 1: CPU 1.23s/13.54u sec elapsed 17.88 sec
LOG:  finished writing run 3 to tape 2: CPU 1.62s/20.43u sec elapsed 25.56 sec
LOG:  finished writing run 4 to tape 3: CPU 2.08s/27.48u sec elapsed 34.64 sec
LOG:  finished writing run 5 to tape 4: CPU 2.56s/34.50u sec elapsed 43.06 sec
LOG:  performsort starting: CPU 2.66s/35.81u sec elapsed 45.00 sec
LOG:  finished writing run 6 to tape 5: CPU 2.74s/38.11u sec elapsed 47.38 sec
LOG:  finished writing final run 7 to tape 6: CPU 2.74s/38.15u sec elapsed 47.43 sec
LOG:  performsort done (except 7-way final merge): CPU 2.74s/38.56u sec elapsed 47.84 sec
LOG:  external sort ended, 24434 disk blocks used: CPU 4.10s/51.98u sec elapsed 65.33 sec

This is creating an index on a single integer column, 10 million rows,
random data, with maintenance_work_mem set to 65536 (64MB).  Platform
is 64-bit Xeon, Fedora Core 4.  I tried a larger test case (100M rows,
640MB maintenance_work_mem) and got similar results, with if anything
a few more percent in tuplesort_heap_siftup.  What's even more
interesting is that an instruction-level dump of the oprofile counts
shows that almost all the tuplesort_heap_siftup time is spent at
the "((tup1)->tupindex != (tup2)->tupindex)" part of the first
HEAPCOMPARE macro.  I interpret this as memory fetch delay: that's the
first touch of the SortTuple array elements, and if that part of the
array wasn't in L2 cache this is where the CPU would stall waiting
for it.

Your machine seems not to be subject to nearly the same amount of memory
delay.  Which is interesting because most of the argument for changing
sort algorithms seems to hinge on the assumption that main-memory delay
is the main thing we need to worry about.  That looks to be valid on the
Xeon I'm testing but not on your machine ...
        regards, tom lane


pgsql-hackers by date:

Previous
From: Stephen Frost
Date:
Subject: Re: Proposal for SYNONYMS
Next
From: Tom Lane
Date:
Subject: Re: Merge algorithms for large numbers of "tapes"