Thread: Memory usage during sorting
In tuplesort.c, it initially reads tuples into memory until availMem is exhausted. It then switches to the tape sort algorithm, and allocates buffer space for each tape it will use. This substantially over-runs the allowedMem, and drives availMem negative. It works off this deficit by writing tuples to tape, and pfree-ing their spot in the heap, which pfreed space is more or less randomly scattered. Once this is done it proceeds to alternate between freeing more space in the heap and adding things to the heap (in a nearly strict 1+1 alternation if the tuples are constant size). The space freed up by the initial round of pfree where it is working off the space deficit from inittapes is never re-used. It also cannot be paged out by the VM system, because it is scattered among actively used memory. I don't think that small chunks can be reused from one memory context to another, but I haven't checked. Even if it can be, during a big sort like an index build the backend process doing the build may have no other contexts which need to use it. So having over-ran workMem and stomped all over it to ensure no one else can re-use it, we then scrupulously refuse to benefit from that over-run amount ourselves. The attached patch allows it to reuse that memory. On my meager system it reduced the building of an index on an integer column in a skinny 200 million row totally randomly ordered table by about 3% from a baseline of 25 minutes. I think it would be better to pre-deduct the tape overhead amount we will need if we decide to switch to tape sort from the availMem before we even start reading (and then add it back if we do indeed make that switch). That way we wouldn't over-run the memory in the first place. However, that would cause apparent regressions in which sorts that previously fit into maintenance_work_mem no longer do. Boosting maintenance_work_mem to a level that was actually being used previously would fix those regressions, but pointing out that the previous behavior was not optimal doesn't change the fact that people are used to it and perhaps tuned to it. So the attached patch seems more backwards-friendly. Cheers, Jeff
Attachment
On 16 January 2012 00:59, Jeff Janes <jeff.janes@gmail.com> wrote: > I think it would be better to pre-deduct the tape overhead amount we > will need if we decide to switch to tape sort from the availMem before > we even start reading (and then add it back if we do indeed make that > switch). That way we wouldn't over-run the memory in the first place. > However, that would cause apparent regressions in which sorts that > previously fit into maintenance_work_mem no longer do. Boosting > maintenance_work_mem to a level that was actually being used > previously would fix those regressions, but pointing out that the > previous behavior was not optimal doesn't change the fact that people > are used to it and perhaps tuned to it. So the attached patch seems > more backwards-friendly. Hmm. Are people really setting maintenance_work_mem such that it is exactly large enough to quicksort when building an index in one case but not another? Is the difference large enough to warrant avoiding pre-deduction? -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
On Sat, Jan 21, 2012 at 4:51 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote: > On 16 January 2012 00:59, Jeff Janes <jeff.janes@gmail.com> wrote: >> I think it would be better to pre-deduct the tape overhead amount we >> will need if we decide to switch to tape sort from the availMem before >> we even start reading (and then add it back if we do indeed make that >> switch). That way we wouldn't over-run the memory in the first place. >> However, that would cause apparent regressions in which sorts that >> previously fit into maintenance_work_mem no longer do. Boosting >> maintenance_work_mem to a level that was actually being used >> previously would fix those regressions, but pointing out that the >> previous behavior was not optimal doesn't change the fact that people >> are used to it and perhaps tuned to it. So the attached patch seems >> more backwards-friendly. > > Hmm. Are people really setting maintenance_work_mem such that it is > exactly large enough to quicksort when building an index in one case > but not another? It could also apply to work_mem in other situations. I'm just using maintenance_work_mem in my example because index creation is the thing that lead me into this little diversion and so that is what my test-bed is set up to use. Some people do have very stable "ritualized" work-loads and so could have tuned exactly to them. I don't know how common that would be. More common might be synthetic benchmarks which had been tuned, either intentionally or accidentally, to just barely fit in the (overshot) memory, so it would be a perception problem that there was a regression when in fact the tuning merely became out of date. > Is the difference large enough to warrant avoiding > pre-deduction? Switching to an external sort when you could have done it all by quick sort (by cheating just a little bit on memory usage) makes a pretty big difference, over 2 fold in my tests. If the fast-path optimization for qsort goes in, the difference might get even bigger. Of course, there will always be some point at which that switch over must occur, so the real question is do we need to keep that switch point historically consistent to avoid nasty surprises for people with excessively fine-tuned *work_mem settings based on old behavior? And do such people even exist outside of my imagination? But a bigger question I now have is if any of this matters. Since there is currently no way to know how many connections might be running how many concurrent sorts, there is really no way to tune work_mem with much precision. So, does it matter if we slightly lie about how much we will actually use? And is it worth worrying about whether every last drop of memory is used efficiently? I was trying to compare the performance I was getting with a theoretical model of what I should get, and it was confusing to me that we were using a originally defined size of memory for the priority heap, and then later reducing that amount of memory. That is annoying because it complicates the theoretical model, and even more annoying when I realized that the "freed" memory that results from doing this is too fragmented to even be used for other purposes. But theoretical annoyances aside, it is hardly the worst thing about memory usage in sorting. It is just one that is standing in the way of my further study. So I don't think this patch I proposed is particularly important in its own right. I want to see if anyone will point out that I am all wet because my analysis failed to take <foo> into account. I probably should have explained this when I submitted the patch. The worst thing about the current memory usage is probably that big machines can't qsort more than 16,777,216 tuples no matter how much memory they have, because memtuples has to live entirely within a single allocation, which is currently limited to 1GB. It is probably too late to fix this problem for 9.2. I wish I had started there rather than here. This 16,777,216 tuple limitation will get even more unfortunate if the specializations that speed up qsort but not external sort get accepted. Attached is a completely uncommitable proof of concept/work in progress patch to get around the limitation. It shows a 2 fold improvement when indexing an integer column on a 50,000,000 row randomly ordered table. As for what to do to make it commitable, it seems like maybe there should be two different MaxAllocSize. One determined by the "aset.c assumes they can compute twice an allocation's size without overflow". I would think that simply testing size_t at compile time would be enough. Allowing more than 1 GB allocations would probably be pointless on 32 bit machines, and 64 bit should be able to compute twice of a much larger value than 1GB without overflow. For the varlena stuff, I don't know what to do. Is the "I swear to God I won't pass this pointer to one of those functions" sufficient? Or does each function need to test it? And I'm pretty sure that palloc, not just repalloc, would need an over-size alternative to make this a general-purpose improvement. Cheers, Jeff
Attachment
On Sat, Feb 4, 2012 at 10:06 AM, Jeff Janes <jeff.janes@gmail.com> wrote: > Attached is a completely uncommitable proof of concept/work in > progress patch to get around the limitation. It shows a 2 fold > improvement when indexing an integer column on a 50,000,000 row > randomly ordered table. Oops, forgot to say that is with set maintenance_work_mem=4194304;
On Sun, Jan 15, 2012 at 4:59 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > > The attached patch allows it to reuse that memory. On my meager > system it reduced the building of an index on an integer column in a > skinny 200 million row totally randomly ordered table by about 3% from > a baseline of 25 minutes. > Just to give a standard review, this patch is one line change and applies cleanly, builds ok. I'm not pretty sure what exactly you're trying to accomplish, but it seems to me that it's avoiding the first dumptuples cycle by forcing availMem = 0 even if it's negative. I read your comments as it'd be avoiding to alternate reading/writing back and force with scattered memory failing memory cache much during merge phase, but actually it doesn't affect merge phase but only init-dump phase, does it? If so, I'm not so convinced your benchmark gave 3 % gain by this change. Correct me as I'm probably wrong. Anyway, it's nice to modify the comment just above the change, since it's now true with it. Thanks, -- Hitoshi Harada
On Sat, Feb 4, 2012 at 10:06 AM, Jeff Janes <jeff.janes@gmail.com> wrote: > > The worst thing about the current memory usage is probably that big > machines can't qsort more than 16,777,216 tuples no matter how much > memory they have, because memtuples has to live entirely within a > single allocation, which is currently limited to 1GB. It is probably > too late to fix this problem for 9.2. I wish I had started there > rather than here. > > This 16,777,216 tuple limitation will get even more unfortunate if the > specializations that speed up qsort but not external sort get > accepted. > I think it's a fair ask to extend our palloc limitation of 1GB to 64bit space. I see there are a lot of applications that want more memory by one palloc call in user-defined functions, aggregates, etc. As you may notice, however, the area in postgres to accomplish it needs to be investigated deeply. I don't know where it's safe to allow it and where not. varlena types is unsafe, but it might be possible to extend varlena header to 64 bit in major release somehow. > Attached is a completely uncommitable proof of concept/work in > progress patch to get around the limitation. It shows a 2 fold > improvement when indexing an integer column on a 50,000,000 row > randomly ordered table. In any case, we do need bird-eye sketch to attack it but I guess it's worth and at some future point we definitely must do, though I don't know if it's the next release or third next release from now. Thanks, -- Hitoshi Harada
On Wed, Feb 8, 2012 at 1:01 AM, Hitoshi Harada <umi.tanuki@gmail.com> wrote: > On Sun, Jan 15, 2012 at 4:59 PM, Jeff Janes <jeff.janes@gmail.com> wrote: >> >> The attached patch allows it to reuse that memory. On my meager >> system it reduced the building of an index on an integer column in a >> skinny 200 million row totally randomly ordered table by about 3% from >> a baseline of 25 minutes. >> > > Just to give a standard review, this patch is one line change and > applies cleanly, builds ok. > > I'm not pretty sure what exactly you're trying to accomplish, but it > seems to me that it's avoiding the first dumptuples cycle by forcing > availMem = 0 even if it's negative. Yes. Currently when it switches to the TSS_BUILDRUNS part of a tape-sort, it starts by calling WRITETUP a large number of time consecutively, to work off the memory deficit incurred by the 3 blocks per tape of tape overhead, and then after that calls WRITETUP about once per puttuple.. Under my patch, it would only call WRITETUP about once per puttuple, right from the beginning. > I read your comments as it'd be > avoiding to alternate reading/writing back and force with scattered > memory failing memory cache much during merge phase, but actually it > doesn't affect merge phase but only init-dump phase, does it? It effects the building of the runs. But this building of the runs is not a simple dump, it is itself a mini merge phase, in which it merges the existing in-memory priority queue against the still-incoming tuples from the node which invoked the sort. By using less memory than it could, this means that the resulting runs are smaller than they could be, and so will sometimes necessitate an additional layer of merging later on. (This effect is particularly large for the very first run being built. Generally by merging incoming tuples into the memory-tuples, you can create runs that are 1.7 times the size of fits in memory. By wasting some memory, we are getting 1.7 the size of a smaller starting point. But for the first run, it is worse than that.Most of the benefit that leads to that 1.7 multipliercomes at the very early stage of each run-build. But by initially using the full memory, then writing out a bunch of tuples without doing any merge of the incoming, we have truncated the part that gives the most benefit.) My analysis that the "freed" memory is never reused (because we refuse to reuse it ourselves and it is too fragmented to be reused by anyone else, like the palloc or VM system) only applies to the run-building phase. So never was a bit of an overstatement. By the time the last initial run is completely written out to tape, the heap used for the priority queue should be totally empty. So at this point the allocator would have the chance to congeal all of the fragmented memory back into larger chunks, or maybe it parcels the allocations back out again in an order so that the unused space is contiguous and could be meaningfully paged out. But, it is it worth worrying about how much we fragment memory and if we overshoot our promises by 10 or 20%? > If so, > I'm not so convinced your benchmark gave 3 % gain by this change. > Correct me as I'm probably wrong. I've now done more complete testing. Building an index on an 200,000,000 row table with an integer column populated in random order with integers from 1..500,000,000, non-unique, on a machine with 2GB of RAM and 600MB of shared_buffers. It improves things by 6-7 percent at the end of working mem size, the rest are in the noise except at 77936 KB, where it reproducibly makes things 4% worse, for reasons I haven't figured out. So maybe the best thing to do is, rather than micromanaging memory usage, simply don't set maintenance_work_mem way to low. (But, it is the default). maintenance_work_mem %Change with patch 16384 6.2 19484 7.8 23170 -0.1 27554 0.1 32768 1.9 38968 -1.9 46341 0.4 55109 0.4 65536 0.6 77936 -4.3 92682 -0.3 110218 0.1 131072 1.7 > Anyway, it's nice to modify the comment just above the change, since > it's now true with it. Much of that comment is referring to other things (some of which I don't understand). How about an additional comment just before my code to the gist of: /* If we have already used, and thus fragmented, the memory then we* might as well continue to make use of it as no one elsecan.*/ (That might not actually be true, if the tuples we are sorting are very large, or perhaps if they arrive in already reverse sorted order) Thanks, Jeff
On Sat, Feb 11, 2012 at 11:34 AM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Wed, Feb 8, 2012 at 1:01 AM, Hitoshi Harada <umi.tanuki@gmail.com> wrote: >> On Sun, Jan 15, 2012 at 4:59 PM, Jeff Janes <jeff.janes@gmail.com> wrote: >>> >>> The attached patch allows it to reuse that memory. On my meager >>> system it reduced the building of an index on an integer column in a >>> skinny 200 million row totally randomly ordered table by about 3% from >>> a baseline of 25 minutes. >>> >> >> Just to give a standard review, this patch is one line change and >> applies cleanly, builds ok. >> >> I'm not pretty sure what exactly you're trying to accomplish, but it >> seems to me that it's avoiding the first dumptuples cycle by forcing >> availMem = 0 even if it's negative. > > Yes. Currently when it switches to the TSS_BUILDRUNS part of a > tape-sort, it starts by calling WRITETUP a large number of time > consecutively, to work off the memory deficit incurred by the 3 blocks > per tape of tape overhead, and then after that calls WRITETUP about > once per puttuple.. Under my patch, it would only call WRITETUP > about once per puttuple, right from the beginning. > >> I read your comments as it'd be >> avoiding to alternate reading/writing back and force with scattered >> memory failing memory cache much during merge phase, but actually it >> doesn't affect merge phase but only init-dump phase, does it? > > It effects the building of the runs. But this building of the runs is > not a simple dump, it is itself a mini merge phase, in which it merges > the existing in-memory priority queue against the still-incoming > tuples from the node which invoked the sort. By using less memory > than it could, this means that the resulting runs are smaller than > they could be, and so will sometimes necessitate an additional layer > of merging later on. (This effect is particularly large for the very > first run being built. Generally by merging incoming tuples into the > memory-tuples, you can create runs that are 1.7 times the size of fits > in memory. By wasting some memory, we are getting 1.7 the size of a > smaller starting point. But for the first run, it is worse than that. > Most of the benefit that leads to that 1.7 multiplier comes at the > very early stage of each run-build. But by initially using the full > memory, then writing out a bunch of tuples without doing any merge of > the incoming, we have truncated the part that gives the most benefit.) > > My analysis that the "freed" memory is never reused (because we refuse > to reuse it ourselves and it is too fragmented to be reused by anyone > else, like the palloc or VM system) only applies to the run-building > phase. So never was a bit of an overstatement. By the time the last > initial run is completely written out to tape, the heap used for the > priority queue should be totally empty. So at this point the > allocator would have the chance to congeal all of the fragmented > memory back into larger chunks, or maybe it parcels the allocations > back out again in an order so that the unused space is contiguous and > could be meaningfully paged out. > > But, it is it worth worrying about how much we fragment memory and if > we overshoot our promises by 10 or 20%? > >> If so, >> I'm not so convinced your benchmark gave 3 % gain by this change. >> Correct me as I'm probably wrong. > > I've now done more complete testing. Building an index on an > 200,000,000 row table with an integer column populated in random order > with integers from 1..500,000,000, non-unique, on a machine with 2GB > of RAM and 600MB of shared_buffers. > > It improves things by 6-7 percent at the end of working mem size, the > rest are in the noise except at 77936 KB, where it reproducibly makes > things 4% worse, for reasons I haven't figured out. So maybe the best > thing to do is, rather than micromanaging memory usage, simply don't > set maintenance_work_mem way to low. (But, it is the default). I've tested here with only a million rows mix of integer/text (table size is 80MB or so) with default setting, running CREATE INDEX continuously, but couldn't find performance improvement. The number varies from -2% to +2%, which I think is just error. While I agree with your insist that avoiding the first dump would make sense, I guess it depends on situations; if the dump goes with lots of tuples (which should happen when availMem is big), writing tuples a lot at a time will be faster than writing little by little later. I'm not sure about the conclusion, but given this discussion, I'm inclined to mark this Returned with Feedback. Thanks, -- Hitoshi Harada
On Tue, Feb 14, 2012 at 1:44 AM, Hitoshi Harada <umi.tanuki@gmail.com> wrote: > On Sat, Feb 11, 2012 at 11:34 AM, Jeff Janes <jeff.janes@gmail.com> wrote: >> On Wed, Feb 8, 2012 at 1:01 AM, Hitoshi Harada <umi.tanuki@gmail.com> wrote: >>> On Sun, Jan 15, 2012 at 4:59 PM, Jeff Janes <jeff.janes@gmail.com> wrote: >>>> >>>> The attached patch allows it to reuse that memory. On my meager >>>> system it reduced the building of an index on an integer column in a >>>> skinny 200 million row totally randomly ordered table by about 3% from >>>> a baseline of 25 minutes. >>>> >>> >>> Just to give a standard review, this patch is one line change and >>> applies cleanly, builds ok. >>> >>> I'm not pretty sure what exactly you're trying to accomplish, but it >>> seems to me that it's avoiding the first dumptuples cycle by forcing >>> availMem = 0 even if it's negative. >> >> Yes. Currently when it switches to the TSS_BUILDRUNS part of a >> tape-sort, it starts by calling WRITETUP a large number of time >> consecutively, to work off the memory deficit incurred by the 3 blocks >> per tape of tape overhead, and then after that calls WRITETUP about >> once per puttuple.. Under my patch, it would only call WRITETUP >> about once per puttuple, right from the beginning. >> >>> I read your comments as it'd be >>> avoiding to alternate reading/writing back and force with scattered >>> memory failing memory cache much during merge phase, but actually it >>> doesn't affect merge phase but only init-dump phase, does it? >> >> It effects the building of the runs. But this building of the runs is >> not a simple dump, it is itself a mini merge phase, in which it merges >> the existing in-memory priority queue against the still-incoming >> tuples from the node which invoked the sort. By using less memory >> than it could, this means that the resulting runs are smaller than >> they could be, and so will sometimes necessitate an additional layer >> of merging later on. (This effect is particularly large for the very >> first run being built. Generally by merging incoming tuples into the >> memory-tuples, you can create runs that are 1.7 times the size of fits >> in memory. By wasting some memory, we are getting 1.7 the size of a >> smaller starting point. But for the first run, it is worse than that. >> Most of the benefit that leads to that 1.7 multiplier comes at the >> very early stage of each run-build. But by initially using the full >> memory, then writing out a bunch of tuples without doing any merge of >> the incoming, we have truncated the part that gives the most benefit.) >> >> My analysis that the "freed" memory is never reused (because we refuse >> to reuse it ourselves and it is too fragmented to be reused by anyone >> else, like the palloc or VM system) only applies to the run-building >> phase. So never was a bit of an overstatement. By the time the last >> initial run is completely written out to tape, the heap used for the >> priority queue should be totally empty. So at this point the >> allocator would have the chance to congeal all of the fragmented >> memory back into larger chunks, or maybe it parcels the allocations >> back out again in an order so that the unused space is contiguous and >> could be meaningfully paged out. >> >> But, it is it worth worrying about how much we fragment memory and if >> we overshoot our promises by 10 or 20%? >> >>> If so, >>> I'm not so convinced your benchmark gave 3 % gain by this change. >>> Correct me as I'm probably wrong. >> >> I've now done more complete testing. Building an index on an >> 200,000,000 row table with an integer column populated in random order >> with integers from 1..500,000,000, non-unique, on a machine with 2GB >> of RAM and 600MB of shared_buffers. >> >> It improves things by 6-7 percent at the end of working mem size, the >> rest are in the noise except at 77936 KB, where it reproducibly makes >> things 4% worse, for reasons I haven't figured out. So maybe the best >> thing to do is, rather than micromanaging memory usage, simply don't >> set maintenance_work_mem way to low. (But, it is the default). > > I've tested here with only a million rows mix of integer/text (table > size is 80MB or so) with default setting, running CREATE INDEX > continuously, but couldn't find performance improvement. The number > varies from -2% to +2%, which I think is just error. > > While I agree with your insist that avoiding the first dump would make > sense, I guess it depends on situations; if the dump goes with lots of > tuples (which should happen when availMem is big), writing tuples a > lot at a time will be faster than writing little by little later. > > I'm not sure about the conclusion, but given this discussion, I'm > inclined to mark this Returned with Feedback. OK, thanks. Does anyone have additional feed-back on how tightly we wish to manage memory usage? Is trying to make us use as much memory as we are allowed to without going over a worthwhile endeavor at all, or is it just academic nitpicking? Also, since the default value of work_mem is quite low, should the docs be more aggressive in suggesting that people using any realistically modern hardware to tune it upwards? If you have a few thousand large sorts running concurrently, I think that having all of them spill to disk intentionally is probably going to destroy your server's performance about as much as having them spill to disk via swapping will. If you find yourself in this situation, the answer is probably to use a connection pooler with admission control, not a work_mem of 1MB. Cheers, Jeff Thanks, Jeff
On Sat, Feb 25, 2012 at 4:31 PM, Jeff Janes <jeff.janes@gmail.com> wrote: >> I'm not sure about the conclusion, but given this discussion, I'm >> inclined to mark this Returned with Feedback. > > OK, thanks. Does anyone have additional feed-back on how tightly we > wish to manage memory usage? Is trying to make us use as much memory > as we are allowed to without going over a worthwhile endeavor at all, > or is it just academic nitpicking? I'm not sure, either. It strikes me that, in general, it's hard to avoid a little bit of work_mem overrun, since we can't know whether the next tuple will fit until we've read it, and then if it turns out to be big, well, the best thing we can do is free it, but perhaps that's closing the barn door after the horse has gotten out already. Having recently spent quite a bit of time looking at tuplesort.c as a result of Peter Geoghegan's work and some other concerns, I'm inclined to think that it needs more than minor surgery. That file is peppered with numerous references to Knuth which serve the dual function of impressing the reader with the idea that the code must be very good (since Knuth is a smart guy) and rendering it almost completely impenetrable (since the design is justified by reference to a textbook most of us probably do not have copies of). A quick Google search for external sorting algorithms suggest that the typical way of doing an external sort is to read data until you fill your in-memory buffer, quicksort it, and dump it out as a run. Repeat until end-of-data; then, merge the runs (either in a single pass, or if there are too many, in multiple passes). I'm not sure whether that would be better than what we're doing now, but there seem to be enough other people doing it that we might want to try it out. Our current algorithm is to build a heap, bounded in size by work_mem, and dribble tuples in and out, but maintaining that heap is pretty expensive; there's a reason people use quicksort rather than heapsort for in-memory sorting. As a desirable side effect, I think it would mean that we could dispense with retail palloc and pfree altogether. We could just allocate a big chunk of memory, copy tuples into it until it's full, using a pointer to keep track of the next unused byte, and then, after writing the run, reset the allocation pointer back to the beginning of the buffer. That would not only avoid the cost of going through palloc/pfree, but also the memory overhead imposed by bookkeeping and power-of-two rounding. If we do want to stick with the current algorithm, there seem to be some newer techniques for cutting down on the heap maintenance overhead. Heikki's been investigating that a bit. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > A quick Google search for external sorting algorithms suggest that the > typical way of doing an external sort is to read data until you fill > your in-memory buffer, quicksort it, and dump it out as a run. Repeat > until end-of-data; then, merge the runs (either in a single pass, or > if there are too many, in multiple passes). I'm not sure whether that > would be better than what we're doing now, but there seem to be enough > other people doing it that we might want to try it out. Our current > algorithm is to build a heap, bounded in size by work_mem, and dribble > tuples in and out, but maintaining that heap is pretty expensive; > there's a reason people use quicksort rather than heapsort for > in-memory sorting. Well, the reason the heapsort approach is desirable here is that you end up with about half as many runs to merge, because the typical run length is significantly more than what will fit in work_mem. Don't get too excited about micro-optimization if you haven't understood the larger context. regards, tom lane
On Sun, Feb 26, 2012 at 7:20 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Sat, Feb 25, 2012 at 4:31 PM, Jeff Janes <jeff.janes@gmail.com> wrote: >>> I'm not sure about the conclusion, but given this discussion, I'm >>> inclined to mark this Returned with Feedback. >> >> OK, thanks. Does anyone have additional feed-back on how tightly we >> wish to manage memory usage? Is trying to make us use as much memory >> as we are allowed to without going over a worthwhile endeavor at all, >> or is it just academic nitpicking? > > I'm not sure, either. It strikes me that, in general, it's hard to > avoid a little bit of work_mem overrun, since we can't know whether > the next tuple will fit until we've read it, and then if it turns out > to be big, well, the best thing we can do is free it, but perhaps > that's closing the barn door after the horse has gotten out already. That type of overrun doesn't bother me much, because the size of the next tuple some else feeds us is mostly outside of this modules control. And also be cause the memory that is overrun should be reusable once the offending tuple is written out to tape. The type of overrun I'm more concerned with is that which is purely under this modules control, and which is then not re-used productively. The better solution would be to reduce the overhead in the first place. While building the initial runs, there is no reason to have 3 blocks worth of overhead for each tape, when only one tape is ever being used at a time. But that change seems much tougher to implement. > Having recently spent quite a bit of time looking at tuplesort.c as a > result of Peter Geoghegan's work and some other concerns, I'm inclined > to think that it needs more than minor surgery. That file is peppered > with numerous references to Knuth which serve the dual function of > impressing the reader with the idea that the code must be very good > (since Knuth is a smart guy) and rendering it almost completely > impenetrable (since the design is justified by reference to a textbook > most of us probably do not have copies of). Yes, I agree with that analysis. And getting the volume I want by inter-library-loan has been challenging--I keep getting the volume before or after the one I want. Maybe Knuth starts counting volumes at 0 and libraries start counting at 1 :) Anyway, I think the logtape could use redoing. When your tapes are actually physically tape drives, it is necessary to build up runs one after the other on physical tapes, because un-mounting a tape from a tape drive and remounting another tape is not very feasible at scale. Which then means you need to place your runs carefully, because if the final merge finds that two runs it needs are back-to-back on one tape, that is bad. But with disks pretending to be tapes, you could re-arrange the "runs" with just some book keeping. Maintaining the distinction between "tapes" and "runs" is pointless, which means the Fibonacci placement algorithm is pointless as well. > A quick Google search for external sorting algorithms suggest that the > typical way of doing an external sort is to read data until you fill > your in-memory buffer, quicksort it, and dump it out as a run. Repeat > until end-of-data; then, merge the runs (either in a single pass, or > if there are too many, in multiple passes). I'm not sure whether that > would be better than what we're doing now, but there seem to be enough > other people doing it that we might want to try it out. Our current > algorithm is to build a heap, bounded in size by work_mem, and dribble > tuples in and out, but maintaining that heap is pretty expensive; > there's a reason people use quicksort rather than heapsort for > in-memory sorting. But it would mean we have about 1.7x more runs that need to be merged (for initially random data). Considering the minimum merge order is 6, that increase in runs is likely not to lead to an additional level of merging, in which case the extra speed of building the runs would definitely win. But if it does cause an additional level of merge, it could end up being a loss. Is there some broad corpus of sorting benchmarks which changes could be tested against? I usually end up testing just simple columns of integers or small strings, because they are easy to set up. That is not ideal. > As a desirable side effect, I think it would mean > that we could dispense with retail palloc and pfree altogether. We > could just allocate a big chunk of memory, copy tuples into it until > it's full, using a pointer to keep track of the next unused byte, and > then, after writing the run, reset the allocation pointer back to the > beginning of the buffer. That would not only avoid the cost of going > through palloc/pfree, but also the memory overhead imposed by > bookkeeping and power-of-two rounding. Wouldn't we still need an array of pointers to the start of every tuple's location in the buffer? Or else, how would qsort know where to find them? Also, to do this we would need to get around the 1GB allocation limit.It is bad enough that memtuples is limited to 1GB,it would be much worse if the entire arena was limited to that amount. > If we do want to stick with the current algorithm, there seem to be > some newer techniques for cutting down on the heap maintenance > overhead. Heikki's been investigating that a bit. Interesting. Is that investigation around the poor L1/L2 caching properties of large heaps? I was wondering if there might be a way to give tuplesort an initial estimate of how much data there was to sort, so that it could use a smaller amount of memory than the max if it decided that that would lead to better caching effects. Once you know you can't do an in-memory sort, then there is no reason to use more memory than the amount that lets you merge all the runs in one go. Cheers, Jeff
On Sat, Mar 3, 2012 at 4:15 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > The better solution would be to reduce the overhead in the first > place. While building the initial runs, there is no reason to have 3 > blocks worth of overhead for each tape, when only one tape is ever > being used at a time. But that change seems much tougher to > implement. Hmm, yeah. > Anyway, I think the logtape could use redoing. When your tapes are > actually physically tape drives, it is necessary to build up runs one > after the other on physical tapes, because un-mounting a tape from a > tape drive and remounting another tape is not very feasible at scale. > Which then means you need to place your runs carefully, because if the > final merge finds that two runs it needs are back-to-back on one tape, > that is bad. But with disks pretending to be tapes, you could > re-arrange the "runs" with just some book keeping. Maintaining the > distinction between "tapes" and "runs" is pointless, which means the > Fibonacci placement algorithm is pointless as well. I think you're right. It seems to me that we could just write out an arbitrary number of runs, one per file, ending up with files number 1..N. If we can do a final merge of N runs without overrunning work_mem, fine. If not, we merge the first K runs (for the largest possible K) and write out a new run N+1. The range of valid run number is now K+1..N+1. If those can be merged in a single pass, we do so; otherwise we again merge the first K runs (K+1 through 2K) to create a new run N+2. And so on. I am not clear, however, whether this would be any faster. It may not be worth tinkering with just for the reduction in code complexity. > But it would mean we have about 1.7x more runs that need to be merged > (for initially random data). Considering the minimum merge order is > 6, that increase in runs is likely not to lead to an additional level > of merging, in which case the extra speed of building the runs would > definitely win. But if it does cause an additional level of merge, it > could end up being a loss. That's true, but the real hit to the run size should be quite a bit less than 1.7x, because we'd also be using memory more efficiently, and therefore more tuples should fit. A little debugging code suggests that on my MacBook Pro, things come out like this: - Allocations of 8 bytes of less use 24 bytes. - Allocations of 9-16 bytes use 32 bytes. - Allocations of 17-32 bytes use 48 bytes. - Allocations of 33-64 bytes use 80 bytes. - Allocations of 65-128 bytes use 144 bytes. - Allocations of 129-256 bytes use 272 bytes. In other words, the palloc overhead seems to be: round up to the next power of two, and add 16 bytes. This means that if, say, all the rows are exactly 100 bytes, we're using 44% more memory than we would under this scheme, which means that the runs would be 44% longer than otherwise. Of course that's just an example - if you are sorting 132 byte rows, you'll be able to squeeze in 106% more of them, and if you're sorting 128 byte rows, you'll only be able to squeeze in 12.5% more of them. So there are certainly cases where the runs would get shorter, especially if you happened to have mostly-sorted data rather than randomly-ordered data, but not in every case. Still, it may be a terrible idea; I only mention it because I did find a lot of references to people using quicksort to build up the initial runs rather than our heapsort-based approach. > Is there some broad corpus of sorting benchmarks which changes could > be tested against? I usually end up testing just simple columns of > integers or small strings, because they are easy to set up. That is > not ideal. I don't know of anything more formal, unfortunately. >> As a desirable side effect, I think it would mean >> that we could dispense with retail palloc and pfree altogether. We >> could just allocate a big chunk of memory, copy tuples into it until >> it's full, using a pointer to keep track of the next unused byte, and >> then, after writing the run, reset the allocation pointer back to the >> beginning of the buffer. That would not only avoid the cost of going >> through palloc/pfree, but also the memory overhead imposed by >> bookkeeping and power-of-two rounding. > > Wouldn't we still need an array of pointers to the start of every > tuple's location in the buffer? Or else, how would qsort know where > to find them? Yes, we'd still need that. I don't see any way to get around that; I just don't like the expense of palloc-ing so many little chunks in addition to that array. > Also, to do this we would need to get around the 1GB allocation limit. > It is bad enough that memtuples is limited to 1GB, it would be much > worse if the entire arena was limited to that amount. Well, we could always allocate multiple 1GB chunks and peel off pieces from each one in turn. >> If we do want to stick with the current algorithm, there seem to be >> some newer techniques for cutting down on the heap maintenance >> overhead. Heikki's been investigating that a bit. > > Interesting. Is that investigation around the poor L1/L2 caching > properties of large heaps? I was wondering if there might be a way to > give tuplesort an initial estimate of how much data there was to sort, > so that it could use a smaller amount of memory than the max if it > decided that that would lead to better caching effects. Once you know > you can't do an in-memory sort, then there is no reason to use more > memory than the amount that lets you merge all the runs in one go. No, it's about reducing the number of comparisons needed to maintain the heap property. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Mar 7, 2012 at 7:55 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> But it would mean we have about 1.7x more runs that need to be merged >> (for initially random data). Considering the minimum merge order is >> 6, that increase in runs is likely not to lead to an additional level >> of merging, in which case the extra speed of building the runs would >> definitely win. But if it does cause an additional level of merge, it >> could end up being a loss. > > That's true, but the real hit to the run size should be quite a bit > less than 1.7x, because we'd also be using memory more efficiently, > and therefore more tuples should fit. I'm not sure I believe the 1.7x. Keep in mind that even after starting a new tape we can continue to read in new tuples that belong on the first tape. So even if you have tuples that are more than N positions out of place (where N is the size of your heap) as long as you don't have very many you can keep writing out the first tape for quite a while. I suspect analyzing truly random inputs is also a bit like saying no compression algorithm can work on average. Partly sorted data is quite common and the tapesort algorithm would be able to do a lot of cases in a single merge that the quicksort and merge would generate a large number of merges for. All that said, quicksort and merge would always do no more i/o in cases where the total number of tuples to be sorted is significantly less than N^2 since that would be guaranteed to be possible to process with a single merge pass. (Where "significantly less" has something to do with how many tuples you have to read in one i/o to be efficient). That probably covers a lot of cases, and Postgres already has the stats to predict when it would happen, more or less. Fwiw when I was doing the top-k sorting I did a bunch of experiements and came up with a rule-of-thumb that our quicksort was about twice as fast as our heapsort. I'm not sure whether that's a big deal or not in this case. Keep in mind that the only win would be reducing the cpu time on a sort where every tuple was being written to disk and read back. For most users that won't run noticeably faster, just reduce cpu time consumption. -- greg
On Sat, Mar 17, 2012 at 10:47 AM, Greg Stark <stark@mit.edu> wrote: > On Wed, Mar 7, 2012 at 7:55 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> But it would mean we have about 1.7x more runs that need to be merged >>> (for initially random data). Considering the minimum merge order is >>> 6, that increase in runs is likely not to lead to an additional level >>> of merging, in which case the extra speed of building the runs would >>> definitely win. But if it does cause an additional level of merge, it >>> could end up being a loss. >> >> That's true, but the real hit to the run size should be quite a bit >> less than 1.7x, because we'd also be using memory more efficiently, >> and therefore more tuples should fit. > > I'm not sure I believe the 1.7x. Keep in mind that even after > starting a new tape we can continue to read in new tuples that belong > on the first tape. So even if you have tuples that are more than N > positions out of place (where N is the size of your heap) as long as > you don't have very many you can keep writing out the first tape for > quite a while. > > I suspect analyzing truly random inputs is also a bit like saying no > compression algorithm can work on average. I'm not sure what you mean here. The "also" suggests the previous discussion was about something other than the assumption of randomness, but that discussion doesn't make much sense unless both paragraphs are about that. Anyway I think keeping best/worst cases in mind is certainly a good thing to do. I've certainly seen train wrecks unfold when people assumed anything could be compressed without verifying it. > Partly sorted data is quite > common and the tapesort algorithm would be able to do a lot of cases > in a single merge that the quicksort and merge would generate a large > number of merges for. This is where some common and credible corpus would be invaluable. > All that said, quicksort and merge would always do no more i/o in > cases where the total number of tuples to be sorted is significantly > less than N^2 since that would be guaranteed to be possible to process > with a single merge pass. (Where "significantly less" has something to > do with how many tuples you have to read in one i/o to be efficient). Unless the tuples are quite large, the number needed for efficient i/o is quite high. > That probably covers a lot of cases, and Postgres already has the > stats to predict when it would happen, more or less. Currently sort makes no attempt to use the planner stats. Would that be a good thing to work on? > Fwiw when I was doing the top-k sorting I did a bunch of experiements > and came up with a rule-of-thumb that our quicksort was about twice as > fast as our heapsort. I'm not sure whether that's a big deal or not in > this case. I've been seeing around 3 fold, and that might go up more with some of the work being done that speeds up qsort but not tape sort. > Keep in mind that the only win would be reducing the cpu > time on a sort where every tuple was being written to disk and read > back. For most users that won't run noticeably faster, just reduce cpu > time consumption. But in many (perhaps most) tape sorts CPU time is most of it. A lot of tape sorts are entirely memory backed. The tuples either never reach disk, or are partially written but never read, instead being served back from the file-systems cache. By changing to a tape sort, you are buying insurance against a bunch of other sorts also running simultaneously, but often the insurance doesn't pay off because those numerous other sorts don't exist and the kernel manages to keep the few ones that do in RAM anyway. One avenue for major surgery on the sorts would be for an entry admission where jobs can negotiable over the memory. Something like allocation by halves, but with a possibility that a job could decide it would rather wait for another sort to finish and its memory to be freed up, than to make do with what is currently available. Cheers, Jeff
On Wed, Mar 7, 2012 at 11:55 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Sat, Mar 3, 2012 at 4:15 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > >> Anyway, I think the logtape could use redoing. When your tapes are >> actually physically tape drives, it is necessary to build up runs one >> after the other on physical tapes, because un-mounting a tape from a >> tape drive and remounting another tape is not very feasible at scale. >> Which then means you need to place your runs carefully, because if the >> final merge finds that two runs it needs are back-to-back on one tape, >> that is bad. But with disks pretending to be tapes, you could >> re-arrange the "runs" with just some book keeping. Maintaining the >> distinction between "tapes" and "runs" is pointless, which means the >> Fibonacci placement algorithm is pointless as well. > > I think you're right. It seems to me that we could just write out an > arbitrary number of runs, one per file, ending up with files number > 1..N. The problem there is that none of the files can be deleted until it was entirely read, so you end up with all the data on disk twice. I don't know how often people run their databases so close to the edge on disk space that this matters, but someone felt that that extra storage was worth avoiding. We could still get rid of the run/tape distinction but keep the block recycling method--they are conceptually distinct things. (But hopefully improve the block recycling so the locality doesn't degrade as much as it seems to now) > If we can do a final merge of N runs without overrunning > work_mem, fine. If not, we merge the first K runs (for the largest > possible K) and write out a new run N+1. The range of valid run > number is now K+1..N+1. If those can be merged in a single pass, we > do so; otherwise we again merge the first K runs (K+1 through 2K) to > create a new run N+2. And so on. > > I am not clear, however, whether this would be any faster. It may not > be worth tinkering with just for the reduction in code complexity. Yeah, I was thinking it would only be worth redoing if the current implementation interferes with other improvements--which I think is likely, but don't have any concrete ideas yet where it would. ... >>> As a desirable side effect, I think it would mean >>> that we could dispense with retail palloc and pfree altogether. We >>> could just allocate a big chunk of memory, copy tuples into it until >>> it's full, using a pointer to keep track of the next unused byte, and >>> then, after writing the run, reset the allocation pointer back to the >>> beginning of the buffer. That would not only avoid the cost of going >>> through palloc/pfree, but also the memory overhead imposed by >>> bookkeeping and power-of-two rounding. >> >> Wouldn't we still need an array of pointers to the start of every >> tuple's location in the buffer? Or else, how would qsort know where >> to find them? > > Yes, we'd still need that. I don't see any way to get around that; I > just don't like the expense of palloc-ing so many little chunks in > addition to that array. Right, OK, I had thought you meant the power of two rounding of mem_tuples, I overlooked the power of two rounding of palloc. > >> Also, to do this we would need to get around the 1GB allocation limit. >> It is bad enough that memtuples is limited to 1GB, it would be much >> worse if the entire arena was limited to that amount. > > Well, we could always allocate multiple 1GB chunks and peel off pieces > from each one in turn. I thought of that with mem_tuples itself, but I think it would be more difficult to do that than just fixing the allocation limit would be. Using multiple chunks might be easier with the buffer space as you don't need to do heap arithmetic on those. >>> If we do want to stick with the current algorithm, there seem to be >>> some newer techniques for cutting down on the heap maintenance >>> overhead. Heikki's been investigating that a bit. >> >> Interesting. Is that investigation around the poor L1/L2 caching >> properties of large heaps? I was wondering if there might be a way to >> give tuplesort an initial estimate of how much data there was to sort, >> so that it could use a smaller amount of memory than the max if it >> decided that that would lead to better caching effects. Once you know >> you can't do an in-memory sort, then there is no reason to use more >> memory than the amount that lets you merge all the runs in one go. > > No, it's about reducing the number of comparisons needed to maintain > the heap property. That sounds very interesting. I didn't know it was even theoretically possible to do that. Cheers, Jeff
Jeff Janes <jeff.janes@gmail.com> writes: > On Wed, Mar 7, 2012 at 11:55 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Sat, Mar 3, 2012 at 4:15 PM, Jeff Janes <jeff.janes@gmail.com> wrote: >>> Anyway, I think the logtape could use redoing. > The problem there is that none of the files can be deleted until it > was entirely read, so you end up with all the data on disk twice. I > don't know how often people run their databases so close to the edge > on disk space that this matters, but someone felt that that extra > storage was worth avoiding. Yeah, that was me, and it came out of actual user complaints ten or more years back. (It's actually not 2X growth but more like 4X growth according to the comments in logtape.c, though I no longer remember the exact reasons why.) We knew when we put in the logtape logic that we were trading off speed for space, and we accepted that. It's possible that with the growth of hard drive sizes, real-world applications would no longer care that much about whether the space required to sort is 4X data size rather than 1X. Or then again, maybe their data has grown just as fast and they still care. > As a desirable side effect, I think it would mean > that we could dispense with retail palloc and pfree altogether. �We > could just allocate a big chunk of memory, copy tuples into it until > it's full, using a pointer to keep track of the next unused byte, and > then, after writing the run, reset the allocation pointer back to the > beginning of the buffer. �That would not only avoid the cost of going > through palloc/pfree, but also the memory overhead imposed by > bookkeeping and power-of-two rounding. That would be worthwhile, probably. The power-of-2 rounding in palloc is not nice at all for tuplesort's purposes. We've occasionally talked about inventing a different memory context type that is less willing to waste space ... but on the other hand, such an allocator would run slower, so you're right back to making a speed for space tradeoff. If the tuples could be deallocated in bulk then that catch-22 goes away. >> No, it's about reducing the number of comparisons needed to maintain >> the heap property. > That sounds very interesting. I didn't know it was even theoretically > possible to do that. So has somebody found a hole in the n log n lower bound on the cost of comparison-based sorting? I thought that had been proven pretty rigorously. regards, tom lane
On 2012-03-18 15:25, Tom Lane wrote: > Jeff Janes<jeff.janes@gmail.com> writes: >> On Wed, Mar 7, 2012 at 11:55 AM, Robert Haas<robertmhaas@gmail.com> wrote: >> The problem there is that none of the files can be deleted until it >> was entirely read, so you end up with all the data on disk twice. I >> don't know how often people run their databases so close to the edge >> on disk space that this matters, but someone felt that that extra >> storage was worth avoiding. > > Yeah, that was me, and it came out of actual user complaints ten or more > years back. (It's actually not 2X growth but more like 4X growth > according to the comments in logtape.c, though I no longer remember the > exact reasons why.) We knew when we put in the logtape logic that we > were trading off speed for space, and we accepted that. How about a runtime check of disk-free? -- Jeremy
Jeremy Harris <jgh@wizmail.org> writes: > On 2012-03-18 15:25, Tom Lane wrote: >> Yeah, that was me, and it came out of actual user complaints ten or more >> years back. (It's actually not 2X growth but more like 4X growth >> according to the comments in logtape.c, though I no longer remember the >> exact reasons why.) We knew when we put in the logtape logic that we >> were trading off speed for space, and we accepted that. > How about a runtime check of disk-free? Not very workable, the foremost reason why not being that it assumes that the current sort is the only thing consuming disk space. I'm not sure there is any portable method of finding out how much disk space is available, anyway. (Think user quotas and such before supposing you know how to do that.) regards, tom lane
On Sun, Mar 18, 2012 at 3:25 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> No, it's about reducing the number of comparisons needed to maintain >>> the heap property. > >> That sounds very interesting. I didn't know it was even theoretically >> possible to do that. > > So has somebody found a hole in the n log n lower bound on the cost of > comparison-based sorting? I thought that had been proven pretty > rigorously. I think what he's describing is, for example, say you have a heap of 1,000 elements and that lets you do the entire sort in a single pass -- i.e. it generates a single sorted run after the first pass. Increasing the heap size to 2,000 will just mean doing an extra comparison for each tuple to sift up. And increasing the heap size further will just do more and more work for nothing. It's still nlogn but the constant factor is slightly higher. Of course to optimize this you would need to be able to see the future. I always thought the same behaviour held for if the heap was larger than necessary to do N merge passes. that is, making the heap larger might generate fewer tapes but the still enough to require the same number of passes. However trying to work out an example I'm not too sure. If you generate fewer tapes then the heap you use to do the merge is smaller so it might work out the same (or more likely, you might come out ahead). -- greg
On Sun, Mar 18, 2012 at 8:56 AM, Greg Stark <stark@mit.edu> wrote: > On Sun, Mar 18, 2012 at 3:25 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>>> No, it's about reducing the number of comparisons needed to maintain >>>> the heap property. >> >>> That sounds very interesting. I didn't know it was even theoretically >>> possible to do that. >> >> So has somebody found a hole in the n log n lower bound on the cost of >> comparison-based sorting? I thought that had been proven pretty >> rigorously. > > I think what he's describing is, for example, say you have a heap of > 1,000 elements and that lets you do the entire sort in a single pass > -- i.e. it generates a single sorted run after the first pass. > Increasing the heap size to 2,000 will just mean doing an extra > comparison for each tuple to sift up. And increasing the heap size > further will just do more and more work for nothing. It's still nlogn > but the constant factor is slightly higher. Of course to optimize this > you would need to be able to see the future. > > I always thought the same behaviour held for if the heap was larger > than necessary to do N merge passes. that is, making the heap larger > might generate fewer tapes but the still enough to require the same > number of passes. However trying to work out an example I'm not too > sure. If you generate fewer tapes then the heap you use to do the > merge is smaller so it might work out the same (or more likely, you > might come out ahead). Except for rounding effects, it does come out the same. Every extra layer on the tuple-heap during the initial run building causes a reduced layer on the tape-heap during the merge. So they wash. I haven't analyzed the exact rounding effects in detail, but by just instrumenting the code I found a huge tuple-heap with a two-tape merge at the end used less than one percent more comparisons, but was >40% slower, then a lower-memory sort which used a modest sized tuple-heap followed by a modest-size tape-heap merge. My conclusion is that it isn't the number of comparison that drove the difference, but the number of on-chip cache misses. Two modest sized heaps are more cache friendly than one giant heap and one tiny heap. Of course if the data is partially sorted in a way that is apparent over large ranges but not short ranges, the larger initial heap will capture that during the initial run construction while the smaller one will not. Cheers, Jeff
On Sun, Mar 18, 2012 at 11:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > So has somebody found a hole in the n log n lower bound on the cost of > comparison-based sorting? I thought that had been proven pretty > rigorously. There's not much danger of anyone finding a way around that bound since the proof is quite trivial, but keep in mind that it's a worst-case bound. For example, if you have reason to believe that the input is likely to already be sorted, you can check for that case in using just n-1 comparisons. If it turns out you're right, you can "sort" the data in linear time. Heck, you can also check for reverse-sorted input at the same time, if you like, and handle that special case too. Of course, when the data is unordered, such special-case checks hurt rather than help, even though it's still O(n lg n) overall; those pesky constant factors do matter quite a bit. Still, sometimes it's worth it: our quicksort algorithm checks for sorted input on each recursive invocation, since quicksort degrades rather badly in that scenario; but our heapsort doesn't contain a similar check, probably because heapsort typically works fine in that scenario. One thing that seems inefficient to me about our current algorithm is the use of the run number as a leading column in the sort key. There's no real reason why the tuples destined for the next run need to be maintained in heap order; we could just store them unordered and heapify the whole lot of them when it's time to start the next run. That ought to save comparison cycles for the current run, since the heap will be less deep, and there are other possible savings as well - in particular, an unordered array of tuples can be heapify'd in linear time, whereas our current algorithm is O(n lg n). However, when I actually implemented this, I found that doing it this way was a loser.In fact, excluding the next-run tuples from the heapfor the current run was a BIG loser even before you add in the time required to re-heapify. This confused the daylights out of me for a while, because it's hard to understand how insert and siftup can be slower on a small heap than a large one. My working theory is that, even though we must necessarily do more comparisons when manipulating a larger heap, many of those comparisons are resolved by comparing run numbers, and that's a lot cheaper than invoking the real comparator. For example, if we insert into a heap whose rightmost child is in the next run, and the new tuple goes into the current run, and the current size of the heap is such that the initial position of the new element is a descendent of the right node, it will very quickly crash through all of the next-run tuples and only need one REAL comparison - against the root. Either the new element ends up as the root, or it ends up as the direct child of the root; now we remove the root and, perhaps, replace it with its rightmost child, meaning that the next element we read in can do the exact same thing all over again. If we exclude the next-run elements from the heap, then the average depth of the heap when inserting a new element is smaller, but all the elements are in the same-run, and we have to invoke the real comparator every time. In other words, those next-run tuples act as dummies which effectively create a heap of uneven depth, and the average depth encountered when inserting tuples turns out to be less than what we get by pulling out the dummies and making the depth uniform. This makes me kind of nervous, because I don't understand why things should work out so well as all that. If throwing some dummy elements into the heap makes things work better, then maybe we should throw in some more. Or maybe it would work better to take some but not all of them out. There certainly doesn't seem to be any indication in the code that this is an anticipated effect, and it seems an awfully providential accident. It may (or may not) be related to the effect that Greg Stark mentions in a nearby post: increasing work_mem makes the heap bigger, so heap maintenance gets more expensive, sometimes without any corresponding benefit. For example, if the input happens to already be sorted, then inserting into the heap is cheap, because each new element is already greater than its parent, and life is good. But removing from the heap is painfully expensive, because to remove element from the heap we stick the last element in the heap into the whole left by the departing element and then sift it down. However, that requires 2*heap_depth comparisons, all of which involve really invoking the comparison function since everything's going to end up going into a single run, and that gets more and more expensive as work_mem increases. So a heap sort of already-sorted data seems to get slower and slower the more memory you give it to work with. For example, a sort of 5 million random strings of length 30, all in shared_buffers, takes 22.03 seconds on my MacBook Pro with work_mem = 128MB, but just 17.08 seconds with work_mem = 1MB. It turns out that it's possible to reduce the number of comparisons required to reinsert the last heap element from 2*heap_depth to heap_depth+lg(heap_depth). At each node, compare the two children, and then let the current node be the lesser of those. When you reach a leaf node, you've walked a path of length heap_depth which is known to be sorted (since we've got a heap), so you can find the correct reinsertion position by binary searching that path. In the case of the sorted data mentioned above, this reduces the time to 19.45 seconds with work_mem = 128MB and 16.12 seconds with work_mem = 1MB. However, in other cases, it seems not to help at all, or even be a slight regression. An obvious weakness of this algorithm is that the linear search we use now can terminate early, if it turns out that the element we're reinserting doesn't need to percolate downward very far, whereas this can't: so the 2*heap_depth bound is worst-case, but the heap_depth+lg(heap_depth) bound is exact. The problem of some comparisons being much cheaper than other also acts to muddy the waters - if there's just one element in the heap destined for the next run, reinserting it will be only about half as expensive, since we'll compare the descendents of each node to each other, but the comparison against the next-run element will be very cheap, and if all the tuples are the same size we may end up reinserting that particular node very frequently. Heikki wrote a patch implementing another "smart" reinsertion algorithm (what we call tuplesort_heap_siftup) which I won't attempt to explain that seems to work even a little bit better than the one mentioned above, but it seems to fall prey to some of the same issues - there are cases where it helps, but there are also cases where it doesn't do much or even regresses slightly, and I don't think either of us understand exactly how to tickle the various cases, let alone what causes them. What does seem clear is that the great bulk of the cost of a heap sort seems to be attributable to tuplesort_heap_siftup, and if we can find something that reduces the number of comparisons required for that operation we will more-or-less proportional speedup in heap-sorting overall. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Mar 19, 2012 at 7:23 PM, Robert Haas <robertmhaas@gmail.com> wrote: > There's no real reason why the tuples destined for the next run need > to be maintained in heap order; we could just store them unordered and > heapify the whole lot of them when it's time to start the next run. This sounded familiar.... http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=cf627ab41ab9f6038a29ddd04dd0ff0ccdca714e -- greg
Greg Stark <stark@mit.edu> writes: > On Mon, Mar 19, 2012 at 7:23 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> There's no real reason why the tuples destined for the next run need >> to be maintained in heap order; we could just store them unordered and >> heapify the whole lot of them when it's time to start the next run. > This sounded familiar.... > http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=cf627ab41ab9f6038a29ddd04dd0ff0ccdca714e Yeah, see also the pgsql-hackers thread starting here: http://archives.postgresql.org/pgsql-hackers/1999-10/msg00384.php That was a long time ago, of course, but I have some vague recollection that keeping next-run tuples in the current heap achieves a net savings in the total number of comparisons needed to heapify both runs. Robert's point about integer comparisons being faster than data comparisons may or may not be relevant. Certainly they are faster, but there are never very many run numbers in the heap at once (possibly no more than 2, I forget; and in any case often only 1). So I'd expect most tuple comparisons to end up having to do a data comparison anyway. regards, tom lane
On Tue, Mar 20, 2012 at 1:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > That was a long time ago, of course, but I have some vague recollection > that keeping next-run tuples in the current heap achieves a net savings > in the total number of comparisons needed to heapify both runs. Offhand I wonder if this is all because we don't have the O(n) heapify implemented. -- greg
Greg Stark <stark@mit.edu> writes: > Offhand I wonder if this is all because we don't have the O(n) heapify > implemented. Robert muttered something about that before, but is it real? If you could do that, I'd think you'd have a less-than-n-log-n sorting solution. regards, tom lane
On Tue, Mar 20, 2012 at 6:31 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Greg Stark <stark@mit.edu> writes: >> Offhand I wonder if this is all because we don't have the O(n) heapify >> implemented. I think we do already have it implemented. 1/2 the time the tuple stays where it is after one comparison, 1/4 it moves up one level with two comparisons, 1/8 it moves up two levels with 3 comparisons, etc. That series sums up to a constant. Maybe there is a worst-case that makes this fall apart, though. Heapifying something which is already reverse sorted, maybe? > Robert muttered something about that before, but is it real? If you > could do that, I'd think you'd have a less-than-n-log-n sorting > solution. Turning random tuples into heap can be linear. Extracting them while maintaining the heap is NlogN, though. You can't sort without the extraction step, so the law is preserved. Cheers, Jeff
Greg Stark <stark@mit.edu> writes: > http://en.wikipedia.org/wiki/Binary_heap#Building_a_heap Interesting. I'm pretty sure that idea appears nowhere in Knuth (which might mean it's new enough to have a live patent on it ... anybody know who invented this?). But it seems like that should buy back enough comparisons to justify leaving the next-run tuples out of the heap (unordered) until the heap becomes empty. You still want to insert new tuples into the heap if they can go to the current run, of course. regards, tom lane
On Tue, Mar 20, 2012 at 7:44 AM, Greg Stark <stark@mit.edu> wrote: > On Tue, Mar 20, 2012 at 1:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> That was a long time ago, of course, but I have some vague recollection >> that keeping next-run tuples in the current heap achieves a net savings >> in the total number of comparisons needed to heapify both runs. > > Offhand I wonder if this is all because we don't have the O(n) heapify > implemented. I'm pretty sure that's not the problem. Even though our heapify is not as efficient as it could be, it's plenty fast enough. I thought about writing a patch to implement the better algorithm, but it seems like a distraction at this point because the heapify step is such a small contributor to overall sort time. What's taking all the time is the repeated siftup operations as we pop things out of the heap. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Mar 20, 2012 at 12:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Greg Stark <stark@mit.edu> writes: >> http://en.wikipedia.org/wiki/Binary_heap#Building_a_heap > > Interesting. I'm pretty sure that idea appears nowhere in Knuth > (which might mean it's new enough to have a live patent on it ... > anybody know who invented this?). It's in every introductory algorithms textbook; I'd be shocked if anyone could make an IP claim on it. > But it seems like that should buy > back enough comparisons to justify leaving the next-run tuples out of > the heap (unordered) until the heap becomes empty. You still want to > insert new tuples into the heap if they can go to the current run, of > course. It seems like it should, but if you read (or reread) my long boring analysis upthread, you'll learn that it doesn't. It's slower even if the cost of building a heap is zero. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, Mar 20, 2012 at 7:44 AM, Greg Stark <stark@mit.edu> wrote: >> Offhand I wonder if this is all because we don't have the O(n) heapify >> implemented. > I'm pretty sure that's not the problem. Even though our heapify is > not as efficient as it could be, it's plenty fast enough. I thought > about writing a patch to implement the better algorithm, but it seems > like a distraction at this point because the heapify step is such a > small contributor to overall sort time. What's taking all the time is > the repeated siftup operations as we pop things out of the heap. Right, but wouldn't getting rid of the run-number comparisons provide some marginal improvement in the speed of tuplesort_heap_siftup? BTW, there's a link at the bottom of the wikipedia page to a very interesting ACM Queue article, which argues that the binary-tree data structure isn't terribly well suited to virtual memory because it touches random locations in succession. I'm not sure I believe his particular solution, but I'm wondering about B+ trees, ie more than 2 children per node. regards, tom lane
On Tue, Mar 20, 2012 at 12:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Tue, Mar 20, 2012 at 7:44 AM, Greg Stark <stark@mit.edu> wrote: >>> Offhand I wonder if this is all because we don't have the O(n) heapify >>> implemented. > >> I'm pretty sure that's not the problem. Even though our heapify is >> not as efficient as it could be, it's plenty fast enough. I thought >> about writing a patch to implement the better algorithm, but it seems >> like a distraction at this point because the heapify step is such a >> small contributor to overall sort time. What's taking all the time is >> the repeated siftup operations as we pop things out of the heap. > > Right, but wouldn't getting rid of the run-number comparisons provide > some marginal improvement in the speed of tuplesort_heap_siftup? No. It does the opposite: it slows it down. This is a highly surprising result but it's quite repeatable: removing comparisons makes it slower. As previously pontificated, I think this is probably because the heap can fill up with next-run tuples that are cheap to compare against, and that spares us having to do "real" comparisons involving the actual datatype comparators. > BTW, there's a link at the bottom of the wikipedia page to a very > interesting ACM Queue article, which argues that the binary-tree > data structure isn't terribly well suited to virtual memory because > it touches random locations in succession. I'm not sure I believe > his particular solution, but I'm wondering about B+ trees, ie more > than 2 children per node. I don't think virtual memory locality is the problem. I read somewhere that a ternary heap is supposed to be about one-eighth faster than a binary heap, but that's because picking the smallest of three tuples requires two comparisons, whereas picking the smallest of four tuples requires three comparisons, which is better. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Mar 20, 2012 at 5:04 PM, Robert Haas <robertmhaas@gmail.com> wrote: > No. It does the opposite: it slows it down. This is a highly > surprising result but it's quite repeatable: removing comparisons > makes it slower. As previously pontificated, I think this is probably > because the heap can fill up with next-run tuples that are cheap to > compare against, and that spares us having to do "real" comparisons > involving the actual datatype comparators. Frankly that analysis didn't make any sense to me at the time. Comparing integers is fast, sure, but it's still slower than not having to do any comparison at all. It's just barely conceivable that it's a cache effect on the *code* but even that seems pretty darned unlikely to me. My money currently is on a measurement error but that's just a guess since I haven't tried to replicate any of your results. > >> BTW, there's a link at the bottom of the wikipedia page to a very >> interesting ACM Queue article, which argues that the binary-tree >> data structure isn't terribly well suited to virtual memory because >> it touches random locations in succession. I'm not sure I believe >> his particular solution, but I'm wondering about B+ trees, ie more >> than 2 children per node. > > I don't think virtual memory locality is the problem. I read > somewhere that a ternary heap is supposed to be about one-eighth > faster than a binary heap, but that's because picking the smallest of > three tuples requires two comparisons, whereas picking the smallest of > four tuples requires three comparisons, which is better. The things I was reading suggested 4-heap was more cache efficient which is a heap with 4-children per node and still representable as a flat array. There's also Fibonacci heap which makes insertions really fast but since we're doing exactly as many insertions as deletions the question is how much work it adds to deletions. It's still O(logn) amortized but it may be 2x the constant factor. Fwiw I think more interesting than improving tapesort would be implementing wholly different algorithms like radix sort or the repeated quicksort. Being able to handle different algorithms that require a different API would be the first step to being able to handle parallel algorithms using threads or GPUs. I also wonder if disabling the space reuse and just generating separate files for each run might not be a win on exotic filesystems like ZFS or WAFL where overwriting blocks is more expensive than allocating new blocks. -- greg
On Tue, Mar 20, 2012 at 3:41 PM, Greg Stark <stark@mit.edu> wrote: > On Tue, Mar 20, 2012 at 5:04 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> No. It does the opposite: it slows it down. This is a highly >> surprising result but it's quite repeatable: removing comparisons >> makes it slower. As previously pontificated, I think this is probably >> because the heap can fill up with next-run tuples that are cheap to >> compare against, and that spares us having to do "real" comparisons >> involving the actual datatype comparators. > > Frankly that analysis didn't make any sense to me at the time. > Comparing integers is fast, sure, but it's still slower than not > having to do any comparison at all. I think you're underestimating how much it costs to call the datatype-specific comparator. My belief is that it's wicked expensive. The COMPARETUP() macro extracts a function pointer from the Tuplesortstate and calls it; we end up comparetup_heap, which calls ApplySortComparator(), which pulls the comparator function out of the state and then calls that. Since I was sorting strings, which have no sortsupport, we then end up in comparison_shim(), which uses the FunctionCallInvoke method to extract the actual function pointer and jump into bttextcmp(), which unpacks its arguments and then calls text_cmp(), which unpacks its arguments some more and then calls varstr_cmp() where the actual work happens. That's not trivial either - we have to call lc_collate_is_c() and then memcmp(). I have no problem believing that 6 levels of function calls, 3 of which involve jumps through function pointers, followed by lc_collate_is_c() and memcmp() is 100x or more as expensive than the lone integer comparison that happens when the tupindex values don't match - that's a single instruction. > Fwiw I think more interesting than improving tapesort would be > implementing wholly different algorithms like radix sort or the > repeated quicksort. Being able to handle different algorithms that > require a different API would be the first step to being able to > handle parallel algorithms using threads or GPUs. Yeah, I think I'm going to try implementing quicksort-the-whole-batch-and-dump-it-out-as-a-run algorithm, just to see how good or bad that is compared to what we have now. We may not end up doing anything that remotely resembles that, in the end, but I want to see the numbers. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 3/18/12 10:25 AM, Tom Lane wrote: > Jeff Janes<jeff.janes@gmail.com> writes: >> > On Wed, Mar 7, 2012 at 11:55 AM, Robert Haas<robertmhaas@gmail.com> wrote: >>> >> On Sat, Mar 3, 2012 at 4:15 PM, Jeff Janes<jeff.janes@gmail.com> wrote: >>>> >>> Anyway, I think the logtape could use redoing. >> > The problem there is that none of the files can be deleted until it >> > was entirely read, so you end up with all the data on disk twice. I >> > don't know how often people run their databases so close to the edge >> > on disk space that this matters, but someone felt that that extra >> > storage was worth avoiding. > Yeah, that was me, and it came out of actual user complaints ten or more > years back. (It's actually not 2X growth but more like 4X growth > according to the comments in logtape.c, though I no longer remember the > exact reasons why.) We knew when we put in the logtape logic that we > were trading off speed for space, and we accepted that. It's possible > that with the growth of hard drive sizes, real-world applications would > no longer care that much about whether the space required to sort is 4X > data size rather than 1X. Or then again, maybe their data has grown > just as fast and they still care. > I believe the case of tape sorts that fit entirely in filesystem cache is a big one as well... doubling or worse the amountof data that needed to live "on disk" at once would likely suck in that case. Also, it's not uncommon to be IO-bound on a database server... so even if we're not worried about storing everything 2 ormore times from a disk space standpoint, we should be concerned about the IO bandwidth. -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
Robert Haas <robertmhaas@gmail.com> writes: > Yeah, I think I'm going to try implementing > quicksort-the-whole-batch-and-dump-it-out-as-a-run algorithm, just to > see how good or bad that is compared to what we have now. We may not > end up doing anything that remotely resembles that, in the end, but I > want to see the numbers. The reason replacement selection is attractive is that the initial runs tend to be about twice as long as you can get if you just sort memory-loads independently. (And that's for random input, the win can be a lot more on partially ordered data.) It seems unlikely that you will get enough win from quicksorting to overcome that; especially not if your thesis is correct that all the cost is coming from comparison infrastructure. But don't let me discourage you from measuring it ... regards, tom lane
On Tue, Mar 20, 2012 at 8:00 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> Frankly that analysis didn't make any sense to me at the time. >> Comparing integers is fast, sure, but it's still slower than not >> having to do any comparison at all. > > I think you're underestimating how much it costs to call the > datatype-specific comparator. My belief is that it's wicked > expensive. I'm totally with you on the datatype-specific comparator being expensive. But we must be talking about two different scenarios. I don't see why Tom's algorithm was slower than Knuth's unless there was a bug. It seems to me it should perform exactly as many comparator calls but save the integer comparisons and the extra space for them. In the current algorithm, Knuth's, it compares the new tuple against the most recently emitted tuple to set the run number then adds it to the bottom of the heap and sifts up. If it's from the current run it does a bunch of integer comparisons and skips past the next run quickly. If it's from the next run it sifts up to the right spot in the next run and if it hits the top of the next run it does a quick integer comparison and stops. In Tom's algorithm it would perform the same comparison against the recently emitted tuple to set the run number and either add it to the unsorted list or the bottom of the heap. If it's added to the unsorted list we're done, if it's added to the bottom of the heap it performs the same siftup it would have done above except that it skips the bottom few levels of the heap -- all of which were fast integer comparisons. They might be fast but they can't be faster than doing nothing at all. When the heap is empty then we do a heapify which currently would be exactly the same O(nlogn) comparisons that we do maintaining the bottom of the heap but with a O(n) heapify would be fewer. -- greg
On Tue, Mar 20, 2012 at 6:16 PM, Greg Stark <stark@mit.edu> wrote: > On Tue, Mar 20, 2012 at 8:00 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> Frankly that analysis didn't make any sense to me at the time. >>> Comparing integers is fast, sure, but it's still slower than not >>> having to do any comparison at all. >> >> I think you're underestimating how much it costs to call the >> datatype-specific comparator. My belief is that it's wicked >> expensive. > > I'm totally with you on the datatype-specific comparator being expensive. > > But we must be talking about two different scenarios. I don't see why > Tom's algorithm was slower than Knuth's unless there was a bug. It > seems to me it should perform exactly as many comparator calls but > save the integer comparisons and the extra space for them. It seemed that way to me, too, but it wasn't. > In the current algorithm, Knuth's, it compares the new tuple against > the most recently emitted tuple to set the run number then adds it to > the bottom of the heap and sifts up. If it's from the current run it > does a bunch of integer comparisons and skips past the next run > quickly. If it's from the next run it sifts up to the right spot in > the next run and if it hits the top of the next run it does a quick > integer comparison and stops. Check. > In Tom's algorithm it would perform the same comparison against the > recently emitted tuple to set the run number and either add it to the > unsorted list or the bottom of the heap. If it's added to the unsorted > list we're done, Check. > if it's added to the bottom of the heap it performs > the same siftup it would have done above except that it skips the > bottom few levels of the heap -- all of which were fast integer > comparisons. I think this is where the wheels come off. It's excessively optimistic to suppose that we're going to skip the "bottom few levels" of the heap. Half of the elements in the heap have no children at all; and half of the remainder are parents but not grand-parents. If you assume, at a given point in time, that half of the tuples we're holding onto are for the next run, then the heap is ONE level shallower than it would otherwise have been, and you figure to save ONE integer comparison. If we're relatively early in the run and only one-tenth of the tuples we're holding onto are for the next run, then heap is barely shallower at all and we're scarcely avoiding anything. But having even a small number of next-run tuples scattered through the heap gives us the opportunity to "get lucky". If the heap size is N, even lg N next-run tuples in the heap is potentially enough for us to avoid most of the calls to a real comparator, if it so happens that all of those tuples are our ancestors. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Mar 19, 2012 at 12:23 PM, Robert Haas <robertmhaas@gmail.com> wrote: ... > > One thing that seems inefficient to me about our current algorithm is > the use of the run number as a leading column in the sort key. > There's no real reason why the tuples destined for the next run need > to be maintained in heap order; we could just store them unordered and > heapify the whole lot of them when it's time to start the next run. > That ought to save comparison cycles for the current run, since the > heap will be less deep, and there are other possible savings as well - > in particular, an unordered array of tuples can be heapify'd in linear > time, whereas our current algorithm is O(n lg n). However, when I > actually implemented this, I found that doing it this way was a loser. > In fact, excluding the next-run tuples from the heap for the current > run was a BIG loser even before you add in the time required to > re-heapify. This confused the daylights out of me for a while, > because it's hard to understand how insert and siftup can be slower on > a small heap than a large one. > > My working theory is that, even though we must necessarily do more > comparisons when manipulating a larger heap, many of those comparisons > are resolved by comparing run numbers, and that's a lot cheaper than > invoking the real comparator. For example, if we insert into a heap > whose rightmost child is in the next run, and the new tuple goes into > the current run, and the current size of the heap is such that the > initial position of the new element is a descendent of the right node, > it will very quickly crash through all of the next-run tuples and only > need one REAL comparison - against the root. Either the new element > ends up as the root, or it ends up as the direct child of the root; > now we remove the root and, perhaps, replace it with its rightmost > child, meaning that the next element we read in can do the exact same > thing all over again. If we exclude the next-run elements from the > heap, then the average depth of the heap when inserting a new element > is smaller, but all the elements are in the same-run, and we have to > invoke the real comparator every time. In other words, those next-run > tuples act as dummies which effectively create a heap of uneven depth, > and the average depth encountered when inserting tuples turns out to > be less than what we get by pulling out the dummies and making the > depth uniform. > > This makes me kind of nervous, because I don't understand why things > should work out so well as all that. If throwing some dummy elements > into the heap makes things work better, then maybe we should throw in > some more. Or maybe it would work better to take some but not all of > them out. There certainly doesn't seem to be any indication in the > code that this is an anticipated effect, and it seems an awfully > providential accident. Is there a patch or a git repo available for this change? I'd like to see if I can analyze that if I get some spare time. ... > It turns out that it's possible to reduce the number of comparisons > required to reinsert the last heap element from 2*heap_depth to > heap_depth+lg(heap_depth). At each node, compare the two children, > and then let the current node be the lesser of those. When you reach > a leaf node, you've walked a path of length heap_depth which is known > to be sorted (since we've got a heap), so you can find the correct > reinsertion position by binary searching that path. In the case of > the sorted data mentioned above, this reduces the time to 19.45 > seconds with work_mem = 128MB and 16.12 seconds with work_mem = 1MB. > However, in other cases, it seems not to help at all, or even be a > slight regression. Clever. Rather than doing a binary search of the path, it might make sense to do a reverse search of it. The tuple is likely to send up somewhere at the bottom of the heap, both because that is where most of the data is, and because the tuple being reinserted just came from the bottom so it is likely to be biased that way. Cheers, Jeff
On Sun, 2012-03-18 at 11:25 -0400, Tom Lane wrote: > Yeah, that was me, and it came out of actual user complaints ten or more > years back. (It's actually not 2X growth but more like 4X growth > according to the comments in logtape.c, though I no longer remember the > exact reasons why.) We knew when we put in the logtape logic that we > were trading off speed for space, and we accepted that. I skimmed through TAOCP, and I didn't find the 4X number you are referring to, and I can't think what would cause that, either. The exact wording in the comment in logtape.c is "4X the actual data volume", so maybe that's just referring to per-tuple overhead? However, I also noticed that section 5.4.4 (Vol 3 p299) starts discussing the idea of running the tapes backwards and forwards. That doesn't directly apply, because a disk seek is cheaper than rewinding a tape, but perhaps it could be adapted to get the block-freeing behavior we want. The comments in logtape.c say: "Few OSes allow arbitrary parts of a file to be released back to the OS, so we have to implement this space-recycling ourselves within a single logical file." But if we alternate between reading in forward and reverse order, we can make all of the deallocations at the end of the file, and then just truncate to free space. I would think that the OS could be more intelligent about block allocations and deallocations to avoid too much fragmentation, and it would probably be a net reduction in code complexity. Again, the comments in logtape.c have something to say about it: "...but the seeking involved should be comparable to what would happen if we kept each logical tape in a separate file" But I don't understand why that is the case. On another topic, quite a while ago Len Shapiro (PSU professor) suggested to me that we implement Algorithm F (Vol 3 p321). Right now, as tuplesort.c says: "In the current code we determine the number of tapes M on the basis of workMem: we want workMem/M to be large enough that we read a fair amount of data each time we preread from a tape" But with forcasting, we can be a little smarter about which tapes we preread from if the data in the runs is not random. That means we could potentially merge more runs at once with the same work_mem without sacrificing adequate buffers for prefetching. I'm not sure whether this is a practical problem today, and I'm also not sure what to do if we start merging a lot more runs and then determine that forcasting doesn't work as well as we'd hoped (e.g. that the data in the runs really is random). But I thought it was worth mentioning. Regards,Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > On Sun, 2012-03-18 at 11:25 -0400, Tom Lane wrote: >> Yeah, that was me, and it came out of actual user complaints ten or more >> years back. (It's actually not 2X growth but more like 4X growth >> according to the comments in logtape.c, though I no longer remember the >> exact reasons why.) We knew when we put in the logtape logic that we >> were trading off speed for space, and we accepted that. > I skimmed through TAOCP, and I didn't find the 4X number you are > referring to, and I can't think what would cause that, either. The exact > wording in the comment in logtape.c is "4X the actual data volume", so > maybe that's just referring to per-tuple overhead? My recollection is that that was an empirical measurement using the previous generation of code. It's got nothing to do with per-tuple overhead IIRC, but with the fact that the same tuple can be sitting on multiple "tapes" during a polyphase merge, because some of the tapes can be lying fallow waiting for future use --- but data on them is still taking up space, if you do nothing to recycle it. The argument in the comment shows why 2X is the minimum space growth for a plain merge algorithm, but that's only a minimum. regards, tom lane
On Wed, Mar 21, 2012 at 8:57 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Mon, Mar 19, 2012 at 12:23 PM, Robert Haas <robertmhaas@gmail.com> wrote: > ... >> >> One thing that seems inefficient to me about our current algorithm is >> the use of the run number as a leading column in the sort key. >> There's no real reason why the tuples destined for the next run need >> to be maintained in heap order; we could just store them unordered and >> heapify the whole lot of them when it's time to start the next run. >> That ought to save comparison cycles for the current run, since the >> heap will be less deep, and there are other possible savings as well - >> in particular, an unordered array of tuples can be heapify'd in linear >> time, whereas our current algorithm is O(n lg n). However, when I >> actually implemented this, I found that doing it this way was a loser. >> In fact, excluding the next-run tuples from the heap for the current >> run was a BIG loser even before you add in the time required to >> re-heapify. This confused the daylights out of me for a while, >> because it's hard to understand how insert and siftup can be slower on >> a small heap than a large one. >> >> My working theory is that, even though we must necessarily do more >> comparisons when manipulating a larger heap, many of those comparisons >> are resolved by comparing run numbers, and that's a lot cheaper than >> invoking the real comparator. For example, if we insert into a heap >> whose rightmost child is in the next run, and the new tuple goes into >> the current run, and the current size of the heap is such that the >> initial position of the new element is a descendent of the right node, >> it will very quickly crash through all of the next-run tuples and only >> need one REAL comparison - against the root. Either the new element >> ends up as the root, or it ends up as the direct child of the root; >> now we remove the root and, perhaps, replace it with its rightmost >> child, meaning that the next element we read in can do the exact same >> thing all over again. If we exclude the next-run elements from the >> heap, then the average depth of the heap when inserting a new element >> is smaller, but all the elements are in the same-run, and we have to >> invoke the real comparator every time. In other words, those next-run >> tuples act as dummies which effectively create a heap of uneven depth, >> and the average depth encountered when inserting tuples turns out to >> be less than what we get by pulling out the dummies and making the >> depth uniform. >> >> This makes me kind of nervous, because I don't understand why things >> should work out so well as all that. If throwing some dummy elements >> into the heap makes things work better, then maybe we should throw in >> some more. Or maybe it would work better to take some but not all of >> them out. There certainly doesn't seem to be any indication in the >> code that this is an anticipated effect, and it seems an awfully >> providential accident. > > Is there a patch or a git repo available for this change? I'd like to > see if I can analyze that if I get some spare time. Sorry for the slow response to this. Patch is attached (crummy-external-sort.patch). > Clever. Rather than doing a binary search of the path, it might make > sense to do a reverse search of it. The tuple is likely to send up > somewhere at the bottom of the heap, both because that is where most > of the data is, and because the tuple being reinserted just came from > the bottom so it is likely to be biased that way. Patches for these approaches also attached (siftup-using-bsearch.patch and siftup-reverse-linear.patch). Neither approach seemed terribly promising in my testing, IIRC. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Mon, Mar 19, 2012 at 7:23 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Sun, Mar 18, 2012 at 11:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> So has somebody found a hole in the n log n lower bound on the cost of >> comparison-based sorting? I thought that had been proven pretty >> rigorously. > > There's not much danger of anyone finding a way around that bound > since the proof is quite trivial, but keep in mind that it's a > worst-case bound. Fwiw it also only holds for comparison sorts. If you can sort your items without actually comparing each item with the others then you aren't bound by it at all. Notably algorithms like radix sort and direct sort aren't bound by it and are O(n). I'm hoping to look at trying some of these for integer sorts when they apply using the new sort specialization code Peter added. -- greg
On 13 April 2012 16:03, Greg Stark <stark@mit.edu> wrote: > Fwiw it also only holds for comparison sorts. If you can sort your > items without actually comparing each item with the others then you > aren't bound by it at all. Notably algorithms like radix sort and > direct sort aren't bound by it and are O(n). I'm hoping to look at > trying some of these for integer sorts when they apply using the new > sort specialization code Peter added. Actually, though a later revision of the patch did nominally allow for user-defined sorting functions through the SortSupport infrastructure (I didn't intend that it would be complete/useful enough to be really worth documenting), the version that was finally committed did not. However, there is a fairly obvious entry point for a radix sort, which is here: /* * We were able to accumulate all the tuples within the allowed * amount of memory. Just qsort'em and we're done. */ if (state->memtupcount > 1) { /* Can we use the single-key sortfunction? */ if (state->onlyKey != NULL) qsort_ssup(state->memtuples, state->memtupcount, state->onlyKey); else qsort_tuple(state->memtuples, state->memtupcount, state->comparetup, state); } The only specialisation that occurs here is between tuple sorts of a single key and multiple (type-specific specialisations were ultimately not judged to be worth the increase in binary size relative to their diminishing benefits). You'd probably set-up a type-specific positional notation machinery within the state's SortSupport struct (the type-specific abstraction through which we elide the SQL function call machinery for types that support it). One insight that I had at the time was that text comparisons where the c locale isn't used are really rather expensive, and I doubt that there is too much that can be done to address that directly. However, if we were to support timsort, that could substantially cut down on the number of comparisons for text columns, which could be a real win. Maybe that would happen through some explicit mechanism, or maybe the planner could actually infer that it was likely to be optimal to use timsort based on a high statistical correlation between physical row ordering and logical ordering of a key column's values. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
On 13 April 2012 17:42, Peter Geoghegan <peter@2ndquadrant.com> wrote: > One insight that I had at the time was that text comparisons where the > c locale isn't used are really rather expensive, and I doubt that > there is too much that can be done to address that directly. However, > if we were to support timsort, that could substantially cut down on > the number of comparisons for text columns, which could be a real win. > Maybe that would happen through some explicit mechanism, or maybe the > planner could actually infer that it was likely to be optimal to use > timsort based on a high statistical correlation between physical row > ordering and logical ordering of a key column's values. Further thoughts: At the time we committed our own quicksort implementation, based on the NetBSD one, we eschewed the optimisation of using insertion sort when n is fairly low. This happens to be a very common optimisation, so I'm not really super-confident that that was a good decision. However, we also added our own optimisation, which is to attempt, regardless of the size of n, to ascertain that the array is pre-sorted, in which case we don't quicksort at all. So if we attempt to quicksort an array which is almost pre-sorted, but say only has its very last element out-of-order, we'll do fairly horribly, because we waste the effort of almost an entire "bubble sort iteration". So almost-sorted data would seem to be a compelling thing to optimise for that reason, as well as for the simple observation that it isn't exactly uncommon in a relational database. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
On Fri, Apr 13, 2012 at 1:15 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote: > On 13 April 2012 17:42, Peter Geoghegan <peter@2ndquadrant.com> wrote: >> One insight that I had at the time was that text comparisons where the >> c locale isn't used are really rather expensive, and I doubt that >> there is too much that can be done to address that directly. However, >> if we were to support timsort, that could substantially cut down on >> the number of comparisons for text columns, which could be a real win. >> Maybe that would happen through some explicit mechanism, or maybe the >> planner could actually infer that it was likely to be optimal to use >> timsort based on a high statistical correlation between physical row >> ordering and logical ordering of a key column's values. > > Further thoughts: > > At the time we committed our own quicksort implementation, based on > the NetBSD one, we eschewed the optimisation of using insertion sort > when n is fairly low. This happens to be a very common optimisation, > so I'm not really super-confident that that was a good decision. > However, we also added our own optimisation, which is to attempt, > regardless of the size of n, to ascertain that the array is > pre-sorted, in which case we don't quicksort at all. We do use insertion sort for partitions of size < 7. I assume you are referring to something else. One thing that has struck me as odd is that we check for presorted arrays at every level of recursion. That is, if we start out with 100 elements, we'll check whether the input is presorted. Then, if we partition it into a block of 60 elements and a block of 40 elements, we'll check each of those partitions over again to see whether they are presorted. There is definitely some potential upside here, because if the input is mostly sorted it may contain runs where everything is in order, and we'll detect that and avoid some work. But it's also kind of expensive; I've wondered whether we should, for example, move the test for presorted input down a bit, so that it only happens in the n > 40 case. Another idea is to arrange things so that we remember the offset at which the last presort check failed. When we tail-recurse into the left half of the array, there's no need to recheck - if the presort check fell out after our partition boundary, it's presorted; if it fell out before the partition boundary, it isn't. > So if we attempt to quicksort an array which is almost pre-sorted, but > say only has its very last element out-of-order, we'll do fairly > horribly, because we waste the effort of almost an entire "bubble sort > iteration". So almost-sorted data would seem to be a compelling thing > to optimise for that reason, as well as for the simple observation > that it isn't exactly uncommon in a relational database. Heap-sorting could also benefit from some optimization in this area. Right now, if you heap-sort data that's already in order, it gets progressively slower as you crank up work_mem, because the heap gets deeper and so extraction and siftup get more expensive, for no real benefit. We could do something like check, every time we use up our available memory, whether the heap is already entirely in order. If so, we could dump all but one tuple to the current run and the start reading tuples again. Or maybe dump some percentage of the array, though that might require a bit more bookkeeping. Or maybe there's a smarter algorithm that would also cater to mostly-sorted data. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 13 April 2012 18:33, Robert Haas <robertmhaas@gmail.com> wrote: > We do use insertion sort for partitions of size < 7. I assume you are > referring to something else. I stand corrected. My confusion came from the removal of a later *switch* to insertion sort in a3f0b3d68f9a5357a3f72b40a45bcc714a9e0649, which also added the pre-sorted check where you'd expect to see the insertion switch. Of course, the n < 7 insertion sort thing is right beside the added check. So another, redundant copy of insertion sort was removed, and not the one that almost every real-world quicksort implementation has. > Heap-sorting could also benefit from some optimization in this area. > Right now, if you heap-sort data that's already in order, it gets > progressively slower as you crank up work_mem, because the heap gets > deeper and so extraction and siftup get more expensive, for no real > benefit. We could do something like check, every time we use up our > available memory, whether the heap is already entirely in order. If > so, we could dump all but one tuple to the current run and the start > reading tuples again. Or maybe dump some percentage of the array, > though that might require a bit more bookkeeping. Or maybe there's a > smarter algorithm that would also cater to mostly-sorted data. Well, timsort is specifically designed to take advantage of pre-sorted data. It does appear to have a lot of traction, as wikipedia points out: "Timsort has been Python's standard sorting algorithm since version 2.3. It is now also used to sort arrays in Java SE 7,[2] and on the Android platform.[3] " Somebody has produced a BSD-licensed C implementation that draws heavily on the Python/Java one here: http://code.google.com/p/timsort/source/browse/trunk/timSort.c?spec=svn17&r=17 It looks like it has exactly the same interface as a c stdlib qsort. So it'd be fairly easy to produce a timsort_arg() based on this, if anyone cares to. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
On Fri, Apr 13, 2012 at 7:01 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote: > Well, timsort is specifically designed to take advantage of pre-sorted > data. It does appear to have a lot of traction, as wikipedia points > out: I hadn't heard of it. But reading up on it it does seem like a good fit for us. It trades some additional storage -- an array of pointers into the sort array where in our case the pointers would be much smaller than a whole SortTuple -- for fewer comparisons -- which in our case are often much slower than a simple integer comparison. -- greg
On 14 April 2012 13:32, Greg Stark <stark@mit.edu> wrote: > On Fri, Apr 13, 2012 at 7:01 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote: >> Well, timsort is specifically designed to take advantage of pre-sorted >> data. It does appear to have a lot of traction, as wikipedia points >> out: > > I hadn't heard of it. But reading up on it it does seem like a good > fit for us. It trades some additional storage -- an array of pointers > into the sort array where in our case the pointers would be much > smaller than a whole SortTuple -- for fewer comparisons -- which in > our case are often much slower than a simple integer comparison. I wouldn't go so far as to suggest getting rid of quicksort, of course. Quicksort is generally faster than other average case O(n log n) algorithms in practice, for various reasons, principal among them that it takes advantage of memory locality so well. I don't think that it's a coincidence that timsort is popular in Python and Java land. Even the Python C implementation has to sort integers through all that PyObject reference indirection, I think. I'd now speculate that an appropriate use of this algorithm might be to simply use it with types that don't have a SortSupport function, that are largely passed by reference, and have expensive comparisons. FWIW, I started playing with adding timsort to Postgres last night: https://github.com/Peter2ndQuadrant/postgres/tree/timsort -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
On 14 April 2012 14:34, Peter Geoghegan <peter@2ndquadrant.com> wrote: > FWIW, I started playing with adding timsort to Postgres last night: > > https://github.com/Peter2ndQuadrant/postgres/tree/timsort I've fixed this feature-branch so that every qsort_arg call site (including the tuplesort specialisations thereof) call timsort_arg instead. All but 4 regression tests pass, but they don't really count as failures, since they're down to an assumption in the tests that the order certain tuples appear should be the same as our current quicksort implementation returns them, even though, in these problematic cases, that is partially dictated by implementation - our quicksort isn't stable, but timsort is. I think that the original tests are faulty, but I understand that they're only expected to provide smoke testing. I did have to fix one bug in the implementation that I found, which was an assumption that comparators would only return one of {-1, 0, 1}. The C standard requires only that they return negative, zero or positive values, as required. I also polished the code a bit, and added more comments. I haven't actually quantified the benefits yet, and have no numbers to show just yet, but I suspect it will prove to be a fairly compelling win in certain situations. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
On Mon, Apr 16, 2012 at 10:42 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote: > All but 4 regression tests pass, but they don't really count > as failures, since they're down to an assumption in the tests that the > order certain tuples appear should be the same as our current > quicksort implementation returns them, even though, in these > problematic cases, that is partially dictated by implementation - our > quicksort isn't stable, but timsort is. This is an interesting point. If we use a stable sort we'll probably be stuck with stable sorts indefinitely. People will start depending on the stability and then we'll break their apps if we find a faster sort that isn't stable. Notably though tapesort is not stable (because heapsort is not stable so neither the runs nor the merge steps are stable). So people's apps would appear to work when they're in memory and fail only on large data sets. It's easily possible for a user's query to never need to go to tape though. We don't have the luxury of having a separate sort and stable_sort though due to the ORDER BY clause. All in all I think it's handier to have a stable ORDER BY sort than an unstable one though. So I'm not necessarily opposed to it even if it means we're stuck using a stable sort indefinitely. -- greg
On 4/17/12 7:19 AM, Greg Stark wrote: > On Mon, Apr 16, 2012 at 10:42 PM, Peter Geoghegan<peter@2ndquadrant.com> wrote: >> > All but 4 regression tests pass, but they don't really count >> > as failures, since they're down to an assumption in the tests that the >> > order certain tuples appear should be the same as our current >> > quicksort implementation returns them, even though, in these >> > problematic cases, that is partially dictated by implementation - our >> > quicksort isn't stable, but timsort is. > This is an interesting point. If we use a stable sort we'll probably > be stuck with stable sorts indefinitely. People will start depending > on the stability and then we'll break their apps if we find a faster > sort that isn't stable. I have often wished that I could inject entropy into a test database to ferret out these kinds of issues. In particular Iworry about things like users depending on specific values for serial types or depending on the order of data in the heap. I would find it useful if Postgres had an option to intentionally inject more randomness in areas at the cost of some performance.IE: have nextval() burn through a small, random number of values before returning one, and have scan operatorsdo some re-ordering of tuples where appropriate. If we had such an option and encouraged users to use it in testing, it would reduce the risk of people depending on behaviorthat they shouldn't be. -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net