Thread: Is tuplesort_heap_siftup() a misnomer?
Doesn't tuplesort_heap_siftup() actually shift-down? The Wikipedia article on heaps [1] lists "shift-down" (never "sift down", FWIW) as a common operation on a heap: "shift-down: move a node down in the tree, similar to shift-up; used to restore heap condition after deletion or replacement." This seems to be what tuplesort_heap_siftup() actually does (among other things; its job is to compact the heap following caller returning the root, where caller leaves a "hole" at the root position 0). As it happens, the node that this tuplesort.c function "moves down in the tree" is the memtuple that was previously positioned physically last in the heap portion of the memtuples array. This node is then considered as a "candidate root" (i) for each subheap as we actually shift down, starting from i = 0. Conceptually, we immediately "fill the hole" that caller left in the root with this other tuple (the physically last tuple), with this new tuple/node, then shifted down as needed. (I say "conceptually" because we don't actually bother with an eager swap into the "hole" position, but that's just to avoid a potentially useless swap.) Now, tuplesort_heap_insert() actually does sift (or shift up, if you prefer), which is necessary because it doesn't assume that there is an existing "hole" anywhere in the heap (it just assumes the availability of space to insert the caller's tuple, at the end of the heap's memtuple portion). Or, as Wikipedia describes it: "shift-up: move a node up in the tree, as long as needed; used to restore heap condition after insertion. Called "sift" because node moves up the tree until it reaches the correct level, as in a sieve." I think that tuplesort_heap_insert() is correct, because it doesn't claim to shift at all (or sift, or whatever). I'm just contrasting it with tuplesort_heap_siftup() to make a point. I think that tuplesort_heap_siftup() should probably be renamed to describe its purpose at a higher level, rather than how it shifts, which in any case is an implementation detail. Specifically, I suggest we rename tuplesort_heap_siftup() to tuplesort_heap_compact(), which suggests that it's roughly the opposite of tuplesort_heap_insert(). I think that the contrast between these two functions add clarity. tuplesort_heap_siftup() comments will also need to be updated, since "sift up" is mentioned there. Should I write a patch? [1] https://en.wikipedia.org/wiki/Heap_(data_structure) -- Peter Geoghegan
On Fri, Aug 12, 2016 at 4:30 PM, Peter Geoghegan <pg@heroku.com> wrote: > Doesn't tuplesort_heap_siftup() actually shift-down? > > The Wikipedia article on heaps [1] lists "shift-down" (never "sift > down", FWIW) as a common operation on a heap: > > "shift-down: move a node down in the tree, similar to shift-up; used > to restore heap condition after deletion or replacement." > > This seems to be what tuplesort_heap_siftup() actually does (among > other things; its job is to compact the heap following caller > returning the root, where caller leaves a "hole" at the root position > 0). Heikki (CC'd) and I discussed this privately today, and we were in agreement that this needs to be fixed, so I wrote a patch, which I attach here. -- Peter Geoghegan
Attachment
Peter Geoghegan <pg@heroku.com> writes: > On Fri, Aug 12, 2016 at 4:30 PM, Peter Geoghegan <pg@heroku.com> wrote: >> Doesn't tuplesort_heap_siftup() actually shift-down? The reason it's called siftup is that that's what Knuth calls it. See Algorithm 5.2.3H (Heapsort), pp 146-147 in the first edition of Volume 3; tuplesort_heap_siftup corresponds directly to steps H3-H8. > Heikki (CC'd) and I discussed this privately today, and we were in > agreement that this needs to be fixed, so I wrote a patch, which I > attach here. I think this patch is misguided, as it unnecessarily moves the code away from standard nomenclature. regards, tom lane
On Wed, Sep 7, 2016 at 2:42 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > The reason it's called siftup is that that's what Knuth calls it. > See Algorithm 5.2.3H (Heapsort), pp 146-147 in the first edition of > Volume 3; tuplesort_heap_siftup corresponds directly to steps H3-H8. I see that Knuth explains that these steps form the siftup procedure. What steps does tuplesort_heap_insert correspond to, if any? 5.2.3.H is about heapsort, and so the Wikipedia article on heapsort (not the one on heaps in general, which I referenced first) may be a useful reference here. It's also freely available. Consider the Wikipedia pseudocode for siftup [1], from classic heapsort. That version goes from child to parent each iteration (in the first iteration, "child" is initialized to "end" -- not root). So, it *ascends* the heap ("child" is assigned from "parent" for any subsequent iterations). OTOH, tuplesort_heap_siftup always starts from the root of tuplesort's top-level heap (or the "hole" at the root position, if you prefer), and *descends* the heap. > I think this patch is misguided, as it unnecessarily moves the code > away from standard nomenclature. As I mentioned, the patch is guided by the descriptions of fundamental operations on heaps from Wikipedia (I didn't think much about heapsort until now). I'm not really proposing to change things in a way that is inconsistent with Knuth (regardless of how the term siftup is understood). I'm proposing to change things in a way that seems clearer overall, particularly given the way that these various routines are used in fairly distinct contexts. The terminology in this area can be confusing. My Sedgewick/Wayne Algorithms book never uses the term shift or sift when discussing heaps, even once in passing. The call shift up "swim", and shift down "sink", possibly because they'd like to avoid any baggage that other terminology has. I propose, as a compromise, to not introduce the term "shift down" at all in this patch, but to still rename tuplesort_heap_siftup to tuplesort_heap_compact. Even assuming that I'm wrong about siftup here, I think that that still represents an improvement. Would that be acceptable to you? [1] https://en.wikipedia.org/wiki/Heapsort#Pseudocode -- Peter Geoghegan
On 09/08/2016 03:36 AM, Peter Geoghegan wrote: > On Wed, Sep 7, 2016 at 2:42 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The reason it's called siftup is that that's what Knuth calls it. >> See Algorithm 5.2.3H (Heapsort), pp 146-147 in the first edition of >> Volume 3; tuplesort_heap_siftup corresponds directly to steps H3-H8. > > I see that Knuth explains that these steps form the siftup procedure. > What steps does tuplesort_heap_insert correspond to, if any? Hmm. The loop in tuplesort_heap_siftup() indeed looks the same as Knuth's steps H3-H8. But the start is different. Knuth's algorithm begins with the situation that the top node of the sub-heap is in wrong place, and it pushes it downwards, until the whole sub-heap satisfies the heap condition again. But tuplesort_heap_siftup() begins by moving the last value in the heap to the top, and then it does the H3-H8 steps to move it to the right place. Using Wikipedia's terms, tuplesort_heap_siftup() function as whole is the Extract operation. And the loop inside it corresponds to the Max-Heapify operation, which is the same as Knuth's "siftup". I still think tuplesort_heap_siftup is a confusing name, although I'm not sure that Peter's "compact" is much better. I suggest that we rename it to tuplesort_heap_delete_top(). In comments within the function, explain that the *loop* corresponds to the "siftup" in Knuth's book. Interestingly, as Knuth explains the siftup algorithm, it performs a "replace the top" operation, rather than "remove the top". But we use it to remove the top, by moving the last element to the top and running the algorithm after that. Peter's other patch, to change the way we currently replace the top node, to do it in one pass rather than as delete+insert, is actually Knuth's siftup algorithm. Peter, looking at your "displace" patch in this light, I think tuplesort_heap_root_displace() and tuplesort_heap_delete_top() (as I'm calling it now), should share a common subroutine. Displace replaces the top node with the new node passed as argument, and then calls the subroutine to push it down to the right place. Delete_top replaces the top node with the last value in the heap, and then calls the subroutine. Or perhaps delete_top should remove the last value from the heap, and then call displace() with it, to re-insert it. - Heikki
On Thu, Sep 8, 2016 at 12:01 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > I still think tuplesort_heap_siftup is a confusing name, although I'm not > sure that Peter's "compact" is much better. I suggest that we rename it to > tuplesort_heap_delete_top(). In comments within the function, explain that > the *loop* corresponds to the "siftup" in Knuth's book. I'm also fine with that name. > Interestingly, as Knuth explains the siftup algorithm, it performs a > "replace the top" operation, rather than "remove the top". But we use it to > remove the top, by moving the last element to the top and running the > algorithm after that. Peter's other patch, to change the way we currently > replace the top node, to do it in one pass rather than as delete+insert, is > actually Knuth's siftup algorithm. Knuth must have a strange sense of humor. > Peter, looking at your "displace" patch in this light, I think > tuplesort_heap_root_displace() and tuplesort_heap_delete_top() (as I'm > calling it now), should share a common subroutine. Displace replaces the top > node with the new node passed as argument, and then calls the subroutine to > push it down to the right place. Delete_top replaces the top node with the > last value in the heap, and then calls the subroutine. Or perhaps delete_top > should remove the last value from the heap, and then call displace() with > it, to re-insert it. I can see the value in that, but I'd still rather not. The code reuse win isn't all that large, and having a separate tuplesort_heap_root_displace() routine allows us to explain what's going on for merging, to assert what we're able to assert with tape numbers vs. heap position, and to lose the HEAPCOMPARE()/checkIndex cruft that the existing routines need (if only barely) to support replacement selection. I would be surprised if with a very tight inner loop like this, HEAPCOMPARE() has an appreciable overhead (maybe it increases pipeline stalls). -- Peter Geoghegan
Peter Geoghegan <pg@heroku.com> writes: > On Thu, Sep 8, 2016 at 12:01 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> I still think tuplesort_heap_siftup is a confusing name, although I'm not >> sure that Peter's "compact" is much better. I suggest that we rename it to >> tuplesort_heap_delete_top(). In comments within the function, explain that >> the *loop* corresponds to the "siftup" in Knuth's book. > I'm also fine with that name. I can live with it too. >> Interestingly, as Knuth explains the siftup algorithm, it performs a >> "replace the top" operation, rather than "remove the top". But we use it to >> remove the top, by moving the last element to the top and running the >> algorithm after that. Peter's other patch, to change the way we currently >> replace the top node, to do it in one pass rather than as delete+insert, is >> actually Knuth's siftup algorithm. > Knuth must have a strange sense of humor. I'm too lazy to get the book back down off the shelf to check, but I think that Knuth uses sift-up to describe two different algorithms that have essentially the same inner loop. regards, tom lane
On Thu, Sep 8, 2016 at 10:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> On Thu, Sep 8, 2016 at 12:01 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: >>> I still think tuplesort_heap_siftup is a confusing name, although I'm not >>> sure that Peter's "compact" is much better. I suggest that we rename it to >>> tuplesort_heap_delete_top(). In comments within the function, explain that >>> the *loop* corresponds to the "siftup" in Knuth's book. > >> I'm also fine with that name. > > I can live with it too. Attached patch does it that way, then. I stuck with the reference to "shift down", though, since I think we all agree that that is unambiguous. -- Peter Geoghegan
Attachment
On Thu, Sep 8, 2016 at 4:20 PM, Peter Geoghegan <pg@heroku.com> wrote: > On Thu, Sep 8, 2016 at 10:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> On Thu, Sep 8, 2016 at 12:01 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: >>>> I still think tuplesort_heap_siftup is a confusing name, although I'm not >>>> sure that Peter's "compact" is much better. I suggest that we rename it to >>>> tuplesort_heap_delete_top(). In comments within the function, explain that >>>> the *loop* corresponds to the "siftup" in Knuth's book. >> >>> I'm also fine with that name. >> >> I can live with it too. > > Attached patch does it that way, then. I stuck with the reference to > "shift down", though, since I think we all agree that that is > unambiguous. I believe the term is "sift" not "shift"
<p dir="ltr">Sift means shift up. There is no such thing as sift down, though, only shift down. That is my understanding,based on the Wikipedia article on heaps. <p dir="ltr">--<br /> Peter Geoghegan
Peter Geoghegan <pg@heroku.com> writes: > Attached patch does it that way, then. I stuck with the reference to > "shift down", though, since I think we all agree that that is > unambiguous. I dunno. What you've now got is /* * The tuple at state->memtuples[0] has been removed from the heap. - * Decrement memtupcount, and sift up to maintain the heap invariant. + * Decrement memtupcount, and shift down to maintain the heap invariant. */static void -tuplesort_heap_siftup(Tuplesortstate *state, bool checkIndex) +tuplesort_heap_delete_top(Tuplesortstate *state, bool checkIndex) I don't find this clearer at all, because (1) As noted in the comment, the heap top has *already* been removed, we're just trying to re-establish the heap invariant. So maybe "delete_top" isn't the best name after all. (2) "shift down" seems a bit weird, since we're moving data closer to the heap top, not further away from it; why isn't that direction "up"? (3) "shift" makes it sound like it ought to be a simple memmove kind of operation, which it surely is not. regards, tom lane
On Thu, Sep 8, 2016 at 12:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > /* > * The tuple at state->memtuples[0] has been removed from the heap. > - * Decrement memtupcount, and sift up to maintain the heap invariant. > + * Decrement memtupcount, and shift down to maintain the heap invariant. > */ > static void > -tuplesort_heap_siftup(Tuplesortstate *state, bool checkIndex) > +tuplesort_heap_delete_top(Tuplesortstate *state, bool checkIndex) > > I don't find this clearer at all, because > > (1) As noted in the comment, the heap top has *already* been removed, > we're just trying to re-establish the heap invariant. So maybe > "delete_top" isn't the best name after all. This is why I suggested tuplesort_heap_compact(). When merging, there is a comment associated with a call to the function, "compact the heap". I based my name off of that, thinking that that was really no different to any other caller. > (2) "shift down" seems a bit weird, since we're moving data closer to > the heap top, not further away from it; why isn't that direction "up"? Well, the fact that the routine does that makes it a higher level thing than something like a sift or a shift. It might just as easily be some other tuple (e.g., from the caller), as far as "shifting down" goes. (I wrote a patch where for merging, it *is* from the caller, and the heap stays the same size, which Heikki mentioned on this thread.) > (3) "shift" makes it sound like it ought to be a simple memmove > kind of operation, which it surely is not. Then, how about the Sedgewick terminology, "sink"? I'm really not attached to that aspect. I do think that "shift down" is unambiguous, but I'd just as soon use "sink", or even explicitly spell it out. -- Peter Geoghegan
Hi, Is this issue really worth keeping several hackers busy? Andres
On Thu, Sep 8, 2016 at 1:14 PM, Andres Freund <andres@anarazel.de> wrote: > Is this issue really worth keeping several hackers busy? I don't think it's fair to put it to me specifically that I'm doing that. That said, I would like to see this resolved without further bikeshedding. -- Peter Geoghegan
Peter Geoghegan <pg@heroku.com> writes: > On Thu, Sep 8, 2016 at 1:14 PM, Andres Freund <andres@anarazel.de> wrote: >> Is this issue really worth keeping several hackers busy? > I don't think it's fair to put it to me specifically that I'm doing > that. That said, I would like to see this resolved without further > bikeshedding. Well, my vote is that it ain't broke and we shouldn't fix it. regards, tom lane
On Thu, Sep 8, 2016 at 2:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Well, my vote is that it ain't broke and we shouldn't fix it. To take a step back, what prompted this whole discussion is the patch that I wrote that shifts down, replacing calls to tuplesort_heap_siftup() and tuplesort_heap_insert with one new call to a function I've called tuplesort_heap_root_displace() (today, Claudio reports that it makes some of his test queries go 25% faster). This new function shifts down. It's not clear what I'm supposed to say about that, given the current understanding. So, in a sense, it's blocking on this. -- Peter Geoghegan
On Thu, Sep 8, 2016 at 2:19 PM, Peter Geoghegan <pg@heroku.com> wrote: >> Peter, looking at your "displace" patch in this light, I think >> tuplesort_heap_root_displace() and tuplesort_heap_delete_top() (as I'm >> calling it now), should share a common subroutine. Displace replaces the top >> node with the new node passed as argument, and then calls the subroutine to >> push it down to the right place. Delete_top replaces the top node with the >> last value in the heap, and then calls the subroutine. Or perhaps delete_top >> should remove the last value from the heap, and then call displace() with >> it, to re-insert it. > > I can see the value in that, but I'd still rather not. The code reuse > win isn't all that large, and having a separate > tuplesort_heap_root_displace() routine allows us to explain what's > going on for merging, to assert what we're able to assert with tape > numbers vs. heap position, and to lose the HEAPCOMPARE()/checkIndex > cruft that the existing routines need (if only barely) to support > replacement selection. I would be surprised if with a very tight inner > loop like this, HEAPCOMPARE() has an appreciable overhead (maybe it > increases pipeline stalls). BTW, regarding this, since I read in some other thread that it was ok to use static inline functions, I believe the compiler is smart enough to optimize out the constant true/false in checkIndex for inlined calls, so that may be viable (on modern compilers).