Re: Minor performance improvement in transition to external sort - Mailing list pgsql-hackers

From Jeff Janes
Subject Re: Minor performance improvement in transition to external sort
Date
Msg-id CAMkU=1wc7_mtc_2QvGxoLZEGBkd=D0dnwNKZ4C+ifSKu9e+imQ@mail.gmail.com
Whole thread Raw
In response to Minor performance improvement in transition to external sort  (Jeremy Harris <jgh@wizmail.org>)
Responses Re: Minor performance improvement in transition to external sort  (Jeremy Harris <jgh@wizmail.org>)
Re: Minor performance improvement in transition to external sort  (Jeremy Harris <jgh@wizmail.org>)
Re: Minor performance improvement in transition to external sort  (Simon Riggs <simon@2ndQuadrant.com>)
List pgsql-hackers
On Tue, Feb 4, 2014 at 2:22 PM, Jeremy Harris <jgh@wizmail.org> wrote:
The attached patch replaces the existing siftup method for heapify with
a siftdown method. Tested with random integers it does 18% fewer
compares and takes 10% less time for the heapify, over the work_mem
range 1024 to 1048576.

Thanks for working on this.  Several times I've wondered why we didn't use siftdown for heapifying, as it is a well known optimization.  How big of sets were you sorting in each case?  Was it always just slightly bigger than would fit in work_mem?  Did you try sorting already-sorted, reverse sorted, or pipe-organ shaped data sets?  We will also need to test it on strings.  I usually use md5(random()::text) to generate strings for such purposes, at least for a first pass.

The name of the attached patch is a bit unfortunate.  And I'm not sure what you are doing with tupindex, the change there doesn't seem to be necessary to the rest of your work, so some explanation on that would be helpful.  

The project coding style usually has comments in blocks before loops on which they comment, rather than interspersed throughout the clauses of the "for" statement.
 

Both algorithms appear to be O(n) (contradicting Wikipedia's claim
that a siftup heapify is O(n log n)).

Siftdown is linear in all cases.  Siftup may be linear in the typical case, but is n log n in the worst case.

Another optimization that is possible is to do only one comparison at each level (between the two children) all the way down the heap, and then compare the parent to the chain of smallest children in reverse order.  This can substantially reduce the number of comparisons on average, because most tuples sift most of the way down the heap (because most of the tuples are at the bottom of the heap).  Robert submitted a patch to do this in the main siftdown routine (which for some reason is named tuplesort_heap_siftup, rather than siftdown), and I found it a substantial improvement but it never got pushed forward.  I think Robert was concerned it might make some obscure cases worse rather than better, and no one put it through enough serious testing to overcome that concern.

(Also, I think you should make your siftdown code look more like the existing siftdown code.)

Cheers,

Jeff

pgsql-hackers by date:

Previous
From: Sawada Masahiko
Date:
Subject: 'dml' value for log_statement
Next
From: "David E. Wheeler"
Date:
Subject: Re: extension_control_path