Thread: Memory usage during sorting

Memory usage during sorting

From
Jeff Janes
Date:
In tuplesort.c, it initially reads tuples into memory until availMem
is exhausted.

It then switches to the tape sort algorithm, and allocates buffer
space for each tape it will use.  This substantially over-runs the
allowedMem, and drives availMem negative.

It works off this deficit by writing tuples to tape, and pfree-ing
their spot in the heap, which pfreed space is more or less randomly
scattered.  Once this is done it proceeds to alternate between freeing
more space in the heap and adding things to the heap (in a nearly
strict 1+1 alternation if the tuples are constant size).

The space freed up by the initial round of pfree where it is working
off the space deficit from inittapes is never re-used.  It also cannot
be paged out by the VM system, because it is scattered among actively
used memory.

I don't think that small chunks can be reused from one memory context
to another, but I haven't checked.  Even if it can be, during a big
sort like an index build the backend process doing the build may have
no other contexts which need to use it.

So having over-ran workMem and stomped all over it to ensure no one
else can re-use it, we then scrupulously refuse to benefit from that
over-run amount ourselves.

The attached patch allows it to reuse that memory.  On my meager
system it reduced the building of an index on an integer column in a
skinny 200 million row totally randomly ordered table by about 3% from
a baseline of 25 minutes.

I think it would be better to pre-deduct the tape overhead amount we
will need if we decide to switch to tape sort from the availMem before
we even start reading (and then add it back if we do indeed make that
switch).  That way we wouldn't over-run the memory in the first place.
 However, that would cause apparent regressions in which sorts that
previously fit into maintenance_work_mem no longer do.  Boosting
maintenance_work_mem to a level that was actually being used
previously would fix those regressions, but pointing out that the
previous behavior was not optimal doesn't change the fact that people
are used to it and perhaps tuned to it.  So the attached patch seems
more backwards-friendly.


Cheers,

Jeff

Attachment

Re: Memory usage during sorting

From
Peter Geoghegan
Date:
On 16 January 2012 00:59, Jeff Janes <jeff.janes@gmail.com> wrote:
> I think it would be better to pre-deduct the tape overhead amount we
> will need if we decide to switch to tape sort from the availMem before
> we even start reading (and then add it back if we do indeed make that
> switch).  That way we wouldn't over-run the memory in the first place.
>  However, that would cause apparent regressions in which sorts that
> previously fit into maintenance_work_mem no longer do.  Boosting
> maintenance_work_mem to a level that was actually being used
> previously would fix those regressions, but pointing out that the
> previous behavior was not optimal doesn't change the fact that people
> are used to it and perhaps tuned to it.  So the attached patch seems
> more backwards-friendly.

Hmm. Are people really setting maintenance_work_mem such that it is
exactly large enough to quicksort when building an index in one case
but not another? Is the difference large enough to warrant avoiding
pre-deduction?

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


Re: Memory usage during sorting

From
Jeff Janes
Date:
On Sat, Jan 21, 2012 at 4:51 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> On 16 January 2012 00:59, Jeff Janes <jeff.janes@gmail.com> wrote:
>> I think it would be better to pre-deduct the tape overhead amount we
>> will need if we decide to switch to tape sort from the availMem before
>> we even start reading (and then add it back if we do indeed make that
>> switch).  That way we wouldn't over-run the memory in the first place.
>>  However, that would cause apparent regressions in which sorts that
>> previously fit into maintenance_work_mem no longer do.  Boosting
>> maintenance_work_mem to a level that was actually being used
>> previously would fix those regressions, but pointing out that the
>> previous behavior was not optimal doesn't change the fact that people
>> are used to it and perhaps tuned to it.  So the attached patch seems
>> more backwards-friendly.
>
> Hmm. Are people really setting maintenance_work_mem such that it is
> exactly large enough to quicksort when building an index in one case
> but not another?

It could also apply to work_mem in other situations.  I'm just using
maintenance_work_mem in my example because index creation is the thing
that lead me into this little diversion and so that is what my
test-bed is set up to use.

Some people do have very stable "ritualized" work-loads and so could
have tuned exactly to them.  I don't know how common that would be.
More common might be synthetic benchmarks which had been tuned, either
intentionally or accidentally, to just barely fit in the (overshot)
memory, so it would be a perception problem that there was a
regression when in fact the tuning merely became out of date.

> Is the difference large enough to warrant avoiding
> pre-deduction?

Switching to an external sort when you could have done it all by quick
sort (by cheating just a little bit on memory usage) makes a pretty
big difference, over 2 fold in my tests.  If the fast-path
optimization for qsort goes in, the difference might get even bigger.
Of course, there will always be some point at which that switch over
must occur, so the real question is do we need to keep that switch
point historically consistent to avoid nasty surprises for people with
excessively fine-tuned *work_mem settings based on old behavior?  And
do such people even exist outside of my imagination?

But a bigger question I now have is if any of this matters.  Since
there is currently no way to know how many connections might be
running how many concurrent sorts, there is really no way to tune
work_mem with much precision.  So, does it matter if we slightly lie
about how much we will actually use?  And is it worth worrying about
whether every last drop of memory is used efficiently?

I was trying to compare the performance I was getting with a
theoretical model of what I should get, and it was confusing to me
that we were using a originally defined size of memory for the
priority heap, and then later reducing that amount of memory.  That is
annoying because it complicates the theoretical model, and even more
annoying when I realized that the "freed" memory that results from
doing this is too fragmented to even be used for other purposes.  But
theoretical annoyances aside, it is hardly the worst thing about
memory usage in sorting.  It is just one that is standing in the way
of my further study.

So I don't think this patch I proposed is particularly important in
its own right.  I want to see if anyone will point out that I am all
wet because my analysis failed to take <foo> into account.   I
probably should have explained this when I submitted the patch.


The worst thing about the current memory usage is probably that big
machines can't qsort more than 16,777,216 tuples no matter how much
memory they have, because memtuples has to live entirely within a
single allocation, which is currently limited to 1GB.  It is probably
too late to fix this problem for 9.2. I wish I had started there
rather than here.

This 16,777,216 tuple limitation will get even more unfortunate if the
specializations that speed up qsort but not external sort get
accepted.

Attached is a completely uncommitable proof of concept/work in
progress patch to get around the limitation.  It shows a 2 fold
improvement when indexing an integer column on a 50,000,000 row
randomly ordered table.

As for what to do to make it commitable, it seems like maybe there
should be two different MaxAllocSize.  One determined by the "aset.c
assumes they can compute twice an allocation's size without overflow".
 I would think that simply testing size_t at compile time would be
enough.  Allowing more than 1 GB allocations would probably be
pointless on 32 bit machines, and 64 bit should be able to compute
twice of a much larger value than 1GB without overflow.

For the varlena stuff, I don't know what to do.  Is the "I swear to
God I won't pass this pointer to one of those functions" sufficient?
Or does each function need to test it?

And I'm pretty sure that palloc, not just repalloc, would need an
over-size alternative to make this a general-purpose improvement.

Cheers,

Jeff

Attachment

Re: Memory usage during sorting

From
Jeff Janes
Date:
On Sat, Feb 4, 2012 at 10:06 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

> Attached is a completely uncommitable proof of concept/work in
> progress patch to get around the limitation.  It shows a 2 fold
> improvement when indexing an integer column on a 50,000,000 row
> randomly ordered table.

Oops, forgot to say that is with
set maintenance_work_mem=4194304;


Re: Memory usage during sorting

From
Hitoshi Harada
Date:
On Sun, Jan 15, 2012 at 4:59 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>
> The attached patch allows it to reuse that memory.  On my meager
> system it reduced the building of an index on an integer column in a
> skinny 200 million row totally randomly ordered table by about 3% from
> a baseline of 25 minutes.
>

Just to give a standard review, this patch is one line change and
applies cleanly, builds ok.

I'm not pretty sure what exactly you're trying to accomplish, but it
seems to me that it's avoiding the first dumptuples cycle by forcing
availMem = 0 even if it's negative. I read your comments as it'd be
avoiding to alternate reading/writing back and force with scattered
memory failing memory cache much during merge phase, but actually it
doesn't affect merge phase but only init-dump phase, does it? If so,
I'm not so convinced your benchmark gave 3 % gain by this change.
Correct me as I'm probably wrong.

Anyway, it's nice to modify the comment just above the change, since
it's now true with it.

Thanks,
--
Hitoshi Harada


Re: Memory usage during sorting

From
Hitoshi Harada
Date:
On Sat, Feb 4, 2012 at 10:06 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
>
> The worst thing about the current memory usage is probably that big
> machines can't qsort more than 16,777,216 tuples no matter how much
> memory they have, because memtuples has to live entirely within a
> single allocation, which is currently limited to 1GB.  It is probably
> too late to fix this problem for 9.2. I wish I had started there
> rather than here.
>
> This 16,777,216 tuple limitation will get even more unfortunate if the
> specializations that speed up qsort but not external sort get
> accepted.
>

I think it's a fair ask to extend our palloc limitation of 1GB to
64bit space. I see there are a lot of applications that want more
memory by one palloc call in user-defined functions, aggregates, etc.
As you may notice, however, the area in postgres to accomplish it
needs to be investigated deeply. I don't know where it's safe to allow
it and where not. varlena types is unsafe, but it might be possible to
extend varlena header to 64 bit in major release somehow.

> Attached is a completely uncommitable proof of concept/work in
> progress patch to get around the limitation.  It shows a 2 fold
> improvement when indexing an integer column on a 50,000,000 row
> randomly ordered table.

In any case, we do need bird-eye sketch to attack it but I guess it's
worth and at some future point we definitely must do, though I don't
know if it's the next release or third next release from now.


Thanks,
--
Hitoshi Harada


Re: Memory usage during sorting

From
Jeff Janes
Date:
On Wed, Feb 8, 2012 at 1:01 AM, Hitoshi Harada <umi.tanuki@gmail.com> wrote:
> On Sun, Jan 15, 2012 at 4:59 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>>
>> The attached patch allows it to reuse that memory.  On my meager
>> system it reduced the building of an index on an integer column in a
>> skinny 200 million row totally randomly ordered table by about 3% from
>> a baseline of 25 minutes.
>>
>
> Just to give a standard review, this patch is one line change and
> applies cleanly, builds ok.
>
> I'm not pretty sure what exactly you're trying to accomplish, but it
> seems to me that it's avoiding the first dumptuples cycle by forcing
> availMem = 0 even if it's negative.

Yes.  Currently when it switches to the TSS_BUILDRUNS part of a
tape-sort, it starts by calling WRITETUP a large number of time
consecutively, to work off the memory deficit incurred by the 3 blocks
per tape of tape overhead, and then after that calls WRITETUP about
once per puttuple..   Under my patch, it would only call WRITETUP
about once per puttuple, right from the beginning.

> I read your comments as it'd be
> avoiding to alternate reading/writing back and force with scattered
> memory failing memory cache much during merge phase, but actually it
> doesn't affect merge phase but only init-dump phase, does it?

It effects the building of the runs.  But this building of the runs is
not a simple dump, it is itself a mini merge phase, in which it merges
the existing in-memory priority queue against the still-incoming
tuples from the node which invoked the sort.  By using less memory
than it could, this means that the resulting runs are smaller than
they could be, and so will sometimes necessitate an additional layer
of merging later on.   (This effect is particularly large for the very
first run being built.  Generally by merging incoming tuples into the
memory-tuples, you can create runs that are 1.7 times the size of fits
in memory.  By wasting some memory, we are getting 1.7 the size of a
smaller starting point.  But for the first run, it is worse than that.Most of the benefit that leads to that 1.7
multipliercomes at the 
very early stage of each run-build.  But by initially using the full
memory, then writing out a bunch of tuples without doing any merge of
the incoming, we have truncated the part that gives the most benefit.)

My analysis that the "freed" memory is never reused (because we refuse
to reuse it ourselves and it is too fragmented to be reused by anyone
else, like the palloc or VM system) only applies to the run-building
phase.  So never was a bit of an overstatement.  By the time the last
initial run is completely written out to tape, the heap used for the
priority queue should be totally empty.  So at this point the
allocator would have the chance to congeal all of the fragmented
memory back into larger chunks, or maybe it parcels the allocations
back out again in an order so that the unused space is contiguous and
could be meaningfully paged out.

But, it is it worth worrying about how much we fragment memory and if
we overshoot our promises by 10 or 20%?

> If so,
> I'm not so convinced your benchmark gave 3 % gain by this change.
> Correct me as I'm probably wrong.

I've now done more complete testing.  Building an index on an
200,000,000 row table with an integer column populated in random order
with integers from 1..500,000,000, non-unique, on a machine with 2GB
of RAM and 600MB of shared_buffers.

It improves things by 6-7 percent at the end of working mem size, the
rest are in the noise except at 77936 KB, where it reproducibly makes
things 4% worse, for reasons I haven't figured out.  So maybe the best
thing to do is, rather than micromanaging memory usage, simply don't
set maintenance_work_mem way to low.  (But, it is the default).

maintenance_work_mem    %Change with patch
16384    6.2
19484    7.8
23170    -0.1
27554    0.1
32768    1.9
38968    -1.9
46341    0.4
55109    0.4
65536    0.6
77936    -4.3
92682    -0.3
110218    0.1
131072    1.7


> Anyway, it's nice to modify the comment just above the change, since
> it's now true with it.

Much of that comment is referring to other things (some of which I
don't understand).  How about an additional comment just before my
code to the gist of:

/* If we have already used, and thus fragmented, the memory then we* might as well continue to make use of it as no one
elsecan.*/ 

(That might not actually be true, if the tuples we are sorting are
very large, or perhaps if they arrive in already reverse sorted order)

Thanks,

Jeff


Re: Memory usage during sorting

From
Hitoshi Harada
Date:
On Sat, Feb 11, 2012 at 11:34 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Wed, Feb 8, 2012 at 1:01 AM, Hitoshi Harada <umi.tanuki@gmail.com> wrote:
>> On Sun, Jan 15, 2012 at 4:59 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>>>
>>> The attached patch allows it to reuse that memory.  On my meager
>>> system it reduced the building of an index on an integer column in a
>>> skinny 200 million row totally randomly ordered table by about 3% from
>>> a baseline of 25 minutes.
>>>
>>
>> Just to give a standard review, this patch is one line change and
>> applies cleanly, builds ok.
>>
>> I'm not pretty sure what exactly you're trying to accomplish, but it
>> seems to me that it's avoiding the first dumptuples cycle by forcing
>> availMem = 0 even if it's negative.
>
> Yes.  Currently when it switches to the TSS_BUILDRUNS part of a
> tape-sort, it starts by calling WRITETUP a large number of time
> consecutively, to work off the memory deficit incurred by the 3 blocks
> per tape of tape overhead, and then after that calls WRITETUP about
> once per puttuple..   Under my patch, it would only call WRITETUP
> about once per puttuple, right from the beginning.
>
>> I read your comments as it'd be
>> avoiding to alternate reading/writing back and force with scattered
>> memory failing memory cache much during merge phase, but actually it
>> doesn't affect merge phase but only init-dump phase, does it?
>
> It effects the building of the runs.  But this building of the runs is
> not a simple dump, it is itself a mini merge phase, in which it merges
> the existing in-memory priority queue against the still-incoming
> tuples from the node which invoked the sort.  By using less memory
> than it could, this means that the resulting runs are smaller than
> they could be, and so will sometimes necessitate an additional layer
> of merging later on.   (This effect is particularly large for the very
> first run being built.  Generally by merging incoming tuples into the
> memory-tuples, you can create runs that are 1.7 times the size of fits
> in memory.  By wasting some memory, we are getting 1.7 the size of a
> smaller starting point.  But for the first run, it is worse than that.
>  Most of the benefit that leads to that 1.7 multiplier comes at the
> very early stage of each run-build.  But by initially using the full
> memory, then writing out a bunch of tuples without doing any merge of
> the incoming, we have truncated the part that gives the most benefit.)
>
> My analysis that the "freed" memory is never reused (because we refuse
> to reuse it ourselves and it is too fragmented to be reused by anyone
> else, like the palloc or VM system) only applies to the run-building
> phase.  So never was a bit of an overstatement.  By the time the last
> initial run is completely written out to tape, the heap used for the
> priority queue should be totally empty.  So at this point the
> allocator would have the chance to congeal all of the fragmented
> memory back into larger chunks, or maybe it parcels the allocations
> back out again in an order so that the unused space is contiguous and
> could be meaningfully paged out.
>
> But, it is it worth worrying about how much we fragment memory and if
> we overshoot our promises by 10 or 20%?
>
>> If so,
>> I'm not so convinced your benchmark gave 3 % gain by this change.
>> Correct me as I'm probably wrong.
>
> I've now done more complete testing.  Building an index on an
> 200,000,000 row table with an integer column populated in random order
> with integers from 1..500,000,000, non-unique, on a machine with 2GB
> of RAM and 600MB of shared_buffers.
>
> It improves things by 6-7 percent at the end of working mem size, the
> rest are in the noise except at 77936 KB, where it reproducibly makes
> things 4% worse, for reasons I haven't figured out.  So maybe the best
> thing to do is, rather than micromanaging memory usage, simply don't
> set maintenance_work_mem way to low.  (But, it is the default).

I've tested here with only a million rows mix of integer/text (table
size is 80MB or so) with default setting, running CREATE INDEX
continuously, but couldn't find performance improvement.  The number
varies from -2% to +2%, which I think is just error.

While I agree with your insist that avoiding the first dump would make
sense, I guess it depends on situations; if the dump goes with lots of
tuples (which should happen when availMem is big), writing tuples a
lot at a time will be faster than writing little by little later.

I'm not sure about the conclusion, but given this discussion, I'm
inclined to mark this Returned with Feedback.


Thanks,
--
Hitoshi Harada


Re: Memory usage during sorting

From
Jeff Janes
Date:
On Tue, Feb 14, 2012 at 1:44 AM, Hitoshi Harada <umi.tanuki@gmail.com> wrote:
> On Sat, Feb 11, 2012 at 11:34 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> On Wed, Feb 8, 2012 at 1:01 AM, Hitoshi Harada <umi.tanuki@gmail.com> wrote:
>>> On Sun, Jan 15, 2012 at 4:59 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>>>>
>>>> The attached patch allows it to reuse that memory.  On my meager
>>>> system it reduced the building of an index on an integer column in a
>>>> skinny 200 million row totally randomly ordered table by about 3% from
>>>> a baseline of 25 minutes.
>>>>
>>>
>>> Just to give a standard review, this patch is one line change and
>>> applies cleanly, builds ok.
>>>
>>> I'm not pretty sure what exactly you're trying to accomplish, but it
>>> seems to me that it's avoiding the first dumptuples cycle by forcing
>>> availMem = 0 even if it's negative.
>>
>> Yes.  Currently when it switches to the TSS_BUILDRUNS part of a
>> tape-sort, it starts by calling WRITETUP a large number of time
>> consecutively, to work off the memory deficit incurred by the 3 blocks
>> per tape of tape overhead, and then after that calls WRITETUP about
>> once per puttuple..   Under my patch, it would only call WRITETUP
>> about once per puttuple, right from the beginning.
>>
>>> I read your comments as it'd be
>>> avoiding to alternate reading/writing back and force with scattered
>>> memory failing memory cache much during merge phase, but actually it
>>> doesn't affect merge phase but only init-dump phase, does it?
>>
>> It effects the building of the runs.  But this building of the runs is
>> not a simple dump, it is itself a mini merge phase, in which it merges
>> the existing in-memory priority queue against the still-incoming
>> tuples from the node which invoked the sort.  By using less memory
>> than it could, this means that the resulting runs are smaller than
>> they could be, and so will sometimes necessitate an additional layer
>> of merging later on.   (This effect is particularly large for the very
>> first run being built.  Generally by merging incoming tuples into the
>> memory-tuples, you can create runs that are 1.7 times the size of fits
>> in memory.  By wasting some memory, we are getting 1.7 the size of a
>> smaller starting point.  But for the first run, it is worse than that.
>>  Most of the benefit that leads to that 1.7 multiplier comes at the
>> very early stage of each run-build.  But by initially using the full
>> memory, then writing out a bunch of tuples without doing any merge of
>> the incoming, we have truncated the part that gives the most benefit.)
>>
>> My analysis that the "freed" memory is never reused (because we refuse
>> to reuse it ourselves and it is too fragmented to be reused by anyone
>> else, like the palloc or VM system) only applies to the run-building
>> phase.  So never was a bit of an overstatement.  By the time the last
>> initial run is completely written out to tape, the heap used for the
>> priority queue should be totally empty.  So at this point the
>> allocator would have the chance to congeal all of the fragmented
>> memory back into larger chunks, or maybe it parcels the allocations
>> back out again in an order so that the unused space is contiguous and
>> could be meaningfully paged out.
>>
>> But, it is it worth worrying about how much we fragment memory and if
>> we overshoot our promises by 10 or 20%?
>>
>>> If so,
>>> I'm not so convinced your benchmark gave 3 % gain by this change.
>>> Correct me as I'm probably wrong.
>>
>> I've now done more complete testing.  Building an index on an
>> 200,000,000 row table with an integer column populated in random order
>> with integers from 1..500,000,000, non-unique, on a machine with 2GB
>> of RAM and 600MB of shared_buffers.
>>
>> It improves things by 6-7 percent at the end of working mem size, the
>> rest are in the noise except at 77936 KB, where it reproducibly makes
>> things 4% worse, for reasons I haven't figured out.  So maybe the best
>> thing to do is, rather than micromanaging memory usage, simply don't
>> set maintenance_work_mem way to low.  (But, it is the default).
>
> I've tested here with only a million rows mix of integer/text (table
> size is 80MB or so) with default setting, running CREATE INDEX
> continuously, but couldn't find performance improvement.  The number
> varies from -2% to +2%, which I think is just error.
>
> While I agree with your insist that avoiding the first dump would make
> sense, I guess it depends on situations; if the dump goes with lots of
> tuples (which should happen when availMem is big), writing tuples a
> lot at a time will be faster than writing little by little later.
>
> I'm not sure about the conclusion, but given this discussion, I'm
> inclined to mark this Returned with Feedback.

OK, thanks.  Does anyone have additional feed-back on how tightly we
wish to manage memory usage?  Is trying to make us use as much memory
as we are allowed to without going over a worthwhile endeavor at all,
or is it just academic nitpicking?

Also, since the default value of work_mem is quite low, should the
docs be more aggressive in suggesting that people using any
realistically modern hardware to tune it upwards?  If you have a few
thousand large sorts running concurrently, I think that having all of
them spill to disk intentionally is probably going to destroy your
server's performance about as much as having them spill to disk via
swapping will.  If you find yourself in this situation, the answer is
probably to use a connection pooler with admission control, not a
work_mem of 1MB.

Cheers,

Jeff

Thanks,

Jeff


Re: Memory usage during sorting

From
Robert Haas
Date:
On Sat, Feb 25, 2012 at 4:31 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> I'm not sure about the conclusion, but given this discussion, I'm
>> inclined to mark this Returned with Feedback.
>
> OK, thanks.  Does anyone have additional feed-back on how tightly we
> wish to manage memory usage?  Is trying to make us use as much memory
> as we are allowed to without going over a worthwhile endeavor at all,
> or is it just academic nitpicking?

I'm not sure, either.  It strikes me that, in general, it's hard to
avoid a little bit of work_mem overrun, since we can't know whether
the next tuple will fit until we've read it, and then if it turns out
to be big, well, the best thing we can do is free it, but perhaps
that's closing the barn door after the horse has gotten out already.

Having recently spent quite a bit of time looking at tuplesort.c as a
result of Peter Geoghegan's work and some other concerns, I'm inclined
to think that it needs more than minor surgery.  That file is peppered
with numerous references to Knuth which serve the dual function of
impressing the reader with the idea that the code must be very good
(since Knuth is a smart guy) and rendering it almost completely
impenetrable (since the design is justified by reference to a textbook
most of us probably do not have copies of).

A quick Google search for external sorting algorithms suggest that the
typical way of doing an external sort is to read data until you fill
your in-memory buffer, quicksort it, and dump it out as a run.  Repeat
until end-of-data; then, merge the runs (either in a single pass, or
if there are too many, in multiple passes).  I'm not sure whether that
would be better than what we're doing now, but there seem to be enough
other people doing it that we might want to try it out.  Our current
algorithm is to build a heap, bounded in size by work_mem, and dribble
tuples in and out, but maintaining that heap is pretty expensive;
there's a reason people use quicksort rather than heapsort for
in-memory sorting.  As a desirable side effect, I think it would mean
that we could dispense with retail palloc and pfree altogether.  We
could just allocate a big chunk of memory, copy tuples into it until
it's full, using a pointer to keep track of the next unused byte, and
then, after writing the run, reset the allocation pointer back to the
beginning of the buffer.  That would not only avoid the cost of going
through palloc/pfree, but also the memory overhead imposed by
bookkeeping and power-of-two rounding.

If we do want to stick with the current algorithm, there seem to be
some newer techniques for cutting down on the heap maintenance
overhead.  Heikki's been investigating that a bit.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Memory usage during sorting

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> A quick Google search for external sorting algorithms suggest that the
> typical way of doing an external sort is to read data until you fill
> your in-memory buffer, quicksort it, and dump it out as a run.  Repeat
> until end-of-data; then, merge the runs (either in a single pass, or
> if there are too many, in multiple passes).  I'm not sure whether that
> would be better than what we're doing now, but there seem to be enough
> other people doing it that we might want to try it out.  Our current
> algorithm is to build a heap, bounded in size by work_mem, and dribble
> tuples in and out, but maintaining that heap is pretty expensive;
> there's a reason people use quicksort rather than heapsort for
> in-memory sorting.

Well, the reason the heapsort approach is desirable here is that you end
up with about half as many runs to merge, because the typical run length
is significantly more than what will fit in work_mem.  Don't get too
excited about micro-optimization if you haven't understood the larger
context.
        regards, tom lane


Re: Memory usage during sorting

From
Jeff Janes
Date:
On Sun, Feb 26, 2012 at 7:20 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sat, Feb 25, 2012 at 4:31 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>>> I'm not sure about the conclusion, but given this discussion, I'm
>>> inclined to mark this Returned with Feedback.
>>
>> OK, thanks.  Does anyone have additional feed-back on how tightly we
>> wish to manage memory usage?  Is trying to make us use as much memory
>> as we are allowed to without going over a worthwhile endeavor at all,
>> or is it just academic nitpicking?
>
> I'm not sure, either.  It strikes me that, in general, it's hard to
> avoid a little bit of work_mem overrun, since we can't know whether
> the next tuple will fit until we've read it, and then if it turns out
> to be big, well, the best thing we can do is free it, but perhaps
> that's closing the barn door after the horse has gotten out already.

That type of overrun doesn't bother me much, because the size of the
next tuple some else feeds us is mostly outside of this modules
control.  And also be cause the memory that is overrun should be
reusable once the offending tuple is written out to tape.  The type of
overrun I'm more concerned with is that which is purely under this
modules control, and which is then not re-used productively.

The better solution would be to reduce the overhead in the first
place.  While building the initial runs, there is no reason to have 3
blocks worth of overhead for each tape, when only one tape is ever
being used at a time.  But that change seems much tougher to
implement.

> Having recently spent quite a bit of time looking at tuplesort.c as a
> result of Peter Geoghegan's work and some other concerns, I'm inclined
> to think that it needs more than minor surgery.  That file is peppered
> with numerous references to Knuth which serve the dual function of
> impressing the reader with the idea that the code must be very good
> (since Knuth is a smart guy) and rendering it almost completely
> impenetrable (since the design is justified by reference to a textbook
> most of us probably do not have copies of).

Yes, I agree with that analysis.  And getting the volume I want by
inter-library-loan has been challenging--I keep getting the volume
before or after the one I want.  Maybe Knuth starts counting volumes
at 0 and libraries start counting at 1 :)

Anyway, I think the logtape could use redoing.  When your tapes are
actually physically tape drives, it is necessary to build up runs one
after the other on physical tapes, because un-mounting a tape from a
tape drive and remounting another tape is not very feasible at scale.
Which then means you need to place your runs carefully, because if the
final merge finds that two runs it needs are back-to-back on one tape,
that is bad.  But with disks pretending to be tapes, you could
re-arrange the "runs" with just some book keeping.  Maintaining the
distinction between "tapes" and "runs" is pointless, which means the
Fibonacci placement algorithm is pointless as well.

> A quick Google search for external sorting algorithms suggest that the
> typical way of doing an external sort is to read data until you fill
> your in-memory buffer, quicksort it, and dump it out as a run.  Repeat
> until end-of-data; then, merge the runs (either in a single pass, or
> if there are too many, in multiple passes).  I'm not sure whether that
> would be better than what we're doing now, but there seem to be enough
> other people doing it that we might want to try it out.  Our current
> algorithm is to build a heap, bounded in size by work_mem, and dribble
> tuples in and out, but maintaining that heap is pretty expensive;
> there's a reason people use quicksort rather than heapsort for
> in-memory sorting.

But it would mean we have about 1.7x  more runs that need to be merged
(for initially random data).  Considering the minimum merge order is
6, that increase in runs is likely not to lead to an additional level
of merging, in which case the extra speed of building the runs would
definitely win.  But if it does cause an additional level of merge, it
could end up being a loss.

Is there some broad corpus of sorting benchmarks which changes could
be tested against?  I usually end up testing just simple columns of
integers or small strings, because they are easy to set up.  That is
not ideal.

> As a desirable side effect, I think it would mean
> that we could dispense with retail palloc and pfree altogether.  We
> could just allocate a big chunk of memory, copy tuples into it until
> it's full, using a pointer to keep track of the next unused byte, and
> then, after writing the run, reset the allocation pointer back to the
> beginning of the buffer.  That would not only avoid the cost of going
> through palloc/pfree, but also the memory overhead imposed by
> bookkeeping and power-of-two rounding.

Wouldn't we still need an array of pointers to the start of every
tuple's location in the buffer?  Or else, how would qsort know where
to find them?

Also, to do this we would need to get around the 1GB allocation limit.It is bad enough that memtuples is limited to
1GB,it would be much 
worse if the entire arena was limited to that amount.



> If we do want to stick with the current algorithm, there seem to be
> some newer techniques for cutting down on the heap maintenance
> overhead.  Heikki's been investigating that a bit.

Interesting.  Is that investigation around the poor L1/L2 caching
properties of large heaps?  I was wondering if there might be a way to
give tuplesort an initial estimate of how much data there was to sort,
so that it could use a smaller amount of memory than the max if it
decided that that would lead to better caching effects.  Once you know
you can't do an in-memory sort, then there is no reason to use more
memory than the amount that lets you merge all the runs in one go.

Cheers,

Jeff


Re: Memory usage during sorting

From
Robert Haas
Date:
On Sat, Mar 3, 2012 at 4:15 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> The better solution would be to reduce the overhead in the first
> place.  While building the initial runs, there is no reason to have 3
> blocks worth of overhead for each tape, when only one tape is ever
> being used at a time.  But that change seems much tougher to
> implement.

Hmm, yeah.

> Anyway, I think the logtape could use redoing.  When your tapes are
> actually physically tape drives, it is necessary to build up runs one
> after the other on physical tapes, because un-mounting a tape from a
> tape drive and remounting another tape is not very feasible at scale.
> Which then means you need to place your runs carefully, because if the
> final merge finds that two runs it needs are back-to-back on one tape,
> that is bad.  But with disks pretending to be tapes, you could
> re-arrange the "runs" with just some book keeping.  Maintaining the
> distinction between "tapes" and "runs" is pointless, which means the
> Fibonacci placement algorithm is pointless as well.

I think you're right.  It seems to me that we could just write out an
arbitrary number of runs, one per file, ending up with files number
1..N.  If we can do a final merge of N runs without overrunning
work_mem, fine.  If not, we merge the first K runs (for the largest
possible K) and write out a new run N+1.  The range of valid run
number is now K+1..N+1.  If those can be merged in a single pass, we
do so; otherwise we again merge the first K runs (K+1 through 2K) to
create a new run N+2.  And so on.

I am not clear, however, whether this would be any faster.  It may not
be worth tinkering with just for the reduction in code complexity.

> But it would mean we have about 1.7x  more runs that need to be merged
> (for initially random data).  Considering the minimum merge order is
> 6, that increase in runs is likely not to lead to an additional level
> of merging, in which case the extra speed of building the runs would
> definitely win.  But if it does cause an additional level of merge, it
> could end up being a loss.

That's true, but the real hit to the run size should be quite a bit
less than 1.7x, because we'd also be using memory more efficiently,
and therefore more tuples should fit. A little debugging code suggests
that on my MacBook Pro, things come out like this:

- Allocations of 8 bytes of less use 24 bytes.
- Allocations of 9-16 bytes use 32 bytes.
- Allocations of 17-32 bytes use 48 bytes.
- Allocations of 33-64 bytes use 80 bytes.
- Allocations of 65-128 bytes use 144 bytes.
- Allocations of 129-256 bytes use 272 bytes.

In other words, the palloc overhead seems to be: round up to the next
power of two, and add 16 bytes.  This means that if, say, all the rows
are exactly 100 bytes, we're using 44% more memory than we would under
this scheme, which means that the runs would be 44% longer than
otherwise.  Of course that's just an example - if you are sorting 132
byte rows, you'll be able to squeeze in 106% more of them, and if
you're sorting 128 byte rows, you'll only be able to squeeze in 12.5%
more of them.  So there are certainly cases where the runs would get
shorter, especially if you happened to have mostly-sorted data rather
than randomly-ordered data, but not in every case.  Still, it may be a
terrible idea; I only mention it because I did find a lot of
references to people using quicksort to build up the initial runs
rather than our heapsort-based approach.

> Is there some broad corpus of sorting benchmarks which changes could
> be tested against?  I usually end up testing just simple columns of
> integers or small strings, because they are easy to set up.  That is
> not ideal.

I don't know of anything more formal, unfortunately.

>> As a desirable side effect, I think it would mean
>> that we could dispense with retail palloc and pfree altogether.  We
>> could just allocate a big chunk of memory, copy tuples into it until
>> it's full, using a pointer to keep track of the next unused byte, and
>> then, after writing the run, reset the allocation pointer back to the
>> beginning of the buffer.  That would not only avoid the cost of going
>> through palloc/pfree, but also the memory overhead imposed by
>> bookkeeping and power-of-two rounding.
>
> Wouldn't we still need an array of pointers to the start of every
> tuple's location in the buffer?  Or else, how would qsort know where
> to find them?

Yes, we'd still need that.  I don't see any way to get around that; I
just don't like the expense of palloc-ing so many little chunks in
addition to that array.

> Also, to do this we would need to get around the 1GB allocation limit.
>  It is bad enough that memtuples is limited to 1GB, it would be much
> worse if the entire arena was limited to that amount.

Well, we could always allocate multiple 1GB chunks and peel off pieces
from each one in turn.

>> If we do want to stick with the current algorithm, there seem to be
>> some newer techniques for cutting down on the heap maintenance
>> overhead.  Heikki's been investigating that a bit.
>
> Interesting.  Is that investigation around the poor L1/L2 caching
> properties of large heaps?  I was wondering if there might be a way to
> give tuplesort an initial estimate of how much data there was to sort,
> so that it could use a smaller amount of memory than the max if it
> decided that that would lead to better caching effects.  Once you know
> you can't do an in-memory sort, then there is no reason to use more
> memory than the amount that lets you merge all the runs in one go.

No, it's about reducing the number of comparisons needed to maintain
the heap property.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Memory usage during sorting

From
Greg Stark
Date:
On Wed, Mar 7, 2012 at 7:55 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> But it would mean we have about 1.7x  more runs that need to be merged
>> (for initially random data).  Considering the minimum merge order is
>> 6, that increase in runs is likely not to lead to an additional level
>> of merging, in which case the extra speed of building the runs would
>> definitely win.  But if it does cause an additional level of merge, it
>> could end up being a loss.
>
> That's true, but the real hit to the run size should be quite a bit
> less than 1.7x, because we'd also be using memory more efficiently,
> and therefore more tuples should fit.

I'm not sure I believe the 1.7x.  Keep in mind that even after
starting a new tape we can continue to read in new tuples that belong
on the first tape. So even if you have tuples that are more than N
positions out of place (where N is the size of your heap) as long as
you don't have very many you can keep writing out the first tape for
quite a while.

I suspect analyzing truly random inputs is also a bit like saying no
compression algorithm can work on average. Partly sorted data is quite
common and the tapesort algorithm would be able to do a lot of cases
in a single merge that the quicksort and merge would generate a large
number of merges for.

All that said, quicksort and merge would always do no more i/o in
cases where the total number of tuples to be sorted is significantly
less than N^2 since that would be guaranteed to be possible to process
with a single merge pass. (Where "significantly less" has something to
do with how many tuples you have to read in one i/o to be efficient).
That probably covers a lot of cases, and Postgres already has the
stats to predict when it would happen, more or less.

Fwiw when I was doing the top-k sorting I did a bunch of experiements
and came up with a rule-of-thumb that our quicksort was about twice as
fast as our heapsort. I'm not sure whether that's a big deal or not in
this case. Keep in mind that the only win would be reducing the cpu
time on a sort where every tuple was being written to disk and read
back. For most users that won't run noticeably faster, just reduce cpu
time consumption.

--
greg


Re: Memory usage during sorting

From
Jeff Janes
Date:
On Sat, Mar 17, 2012 at 10:47 AM, Greg Stark <stark@mit.edu> wrote:
> On Wed, Mar 7, 2012 at 7:55 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> But it would mean we have about 1.7x  more runs that need to be merged
>>> (for initially random data).  Considering the minimum merge order is
>>> 6, that increase in runs is likely not to lead to an additional level
>>> of merging, in which case the extra speed of building the runs would
>>> definitely win.  But if it does cause an additional level of merge, it
>>> could end up being a loss.
>>
>> That's true, but the real hit to the run size should be quite a bit
>> less than 1.7x, because we'd also be using memory more efficiently,
>> and therefore more tuples should fit.
>
> I'm not sure I believe the 1.7x.  Keep in mind that even after
> starting a new tape we can continue to read in new tuples that belong
> on the first tape. So even if you have tuples that are more than N
> positions out of place (where N is the size of your heap) as long as
> you don't have very many you can keep writing out the first tape for
> quite a while.
>
> I suspect analyzing truly random inputs is also a bit like saying no
> compression algorithm can work on average.

I'm not sure what you mean here.  The "also" suggests the previous
discussion was about something other than the assumption of
randomness, but that discussion doesn't make much sense unless both
paragraphs are about that.

Anyway I think keeping best/worst cases in mind is certainly a good
thing to do.  I've certainly seen train wrecks unfold when people
assumed anything could be compressed without verifying it.

> Partly sorted data is quite
> common and the tapesort algorithm would be able to do a lot of cases
> in a single merge that the quicksort and merge would generate a large
> number of merges for.

This is where some common and credible corpus would be invaluable.

> All that said, quicksort and merge would always do no more i/o in
> cases where the total number of tuples to be sorted is significantly
> less than N^2 since that would be guaranteed to be possible to process
> with a single merge pass. (Where "significantly less" has something to
> do with how many tuples you have to read in one i/o to be efficient).

Unless the tuples are quite large, the number needed for efficient i/o
is quite high.

> That probably covers a lot of cases, and Postgres already has the
> stats to predict when it would happen, more or less.

Currently sort makes no attempt to use the planner stats.  Would that
be a good thing to work on?

> Fwiw when I was doing the top-k sorting I did a bunch of experiements
> and came up with a rule-of-thumb that our quicksort was about twice as
> fast as our heapsort. I'm not sure whether that's a big deal or not in
> this case.

I've been seeing around 3 fold, and that might go up more with some of
the work being done that speeds up qsort but not tape sort.


> Keep in mind that the only win would be reducing the cpu
> time on a sort where every tuple was being written to disk and read
> back. For most users that won't run noticeably faster, just reduce cpu
> time consumption.

But in many (perhaps most) tape sorts CPU time is most of it.   A lot
of tape sorts are entirely memory backed.  The tuples either never
reach disk, or are partially written but never read, instead being
served back from the file-systems cache.  By changing to a tape sort,
you are buying insurance against a bunch of other sorts also running
simultaneously, but often the insurance doesn't pay off because those
numerous other sorts don't exist and the kernel manages to keep the
few ones that do in RAM anyway.

One avenue for major surgery on the sorts would be for an entry
admission where jobs can negotiable over the memory.  Something like
allocation by halves, but with a possibility that a job could decide
it would rather wait for another sort to finish and its memory to be
freed up, than to make do with what is currently available.

Cheers,

Jeff


Re: Memory usage during sorting

From
Jeff Janes
Date:
On Wed, Mar 7, 2012 at 11:55 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sat, Mar 3, 2012 at 4:15 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>
>> Anyway, I think the logtape could use redoing.  When your tapes are
>> actually physically tape drives, it is necessary to build up runs one
>> after the other on physical tapes, because un-mounting a tape from a
>> tape drive and remounting another tape is not very feasible at scale.
>> Which then means you need to place your runs carefully, because if the
>> final merge finds that two runs it needs are back-to-back on one tape,
>> that is bad.  But with disks pretending to be tapes, you could
>> re-arrange the "runs" with just some book keeping.  Maintaining the
>> distinction between "tapes" and "runs" is pointless, which means the
>> Fibonacci placement algorithm is pointless as well.
>
> I think you're right.  It seems to me that we could just write out an
> arbitrary number of runs, one per file, ending up with files number
> 1..N.

The problem there is that none of the files can be deleted until it
was entirely read, so you end up with all the data on disk twice.  I
don't know how often people run their databases so close to the edge
on disk space that this matters, but someone felt that that extra
storage was worth avoiding.  We could still get rid of the run/tape
distinction but keep the block recycling method--they are conceptually
distinct things.   (But hopefully improve the block recycling so the
locality doesn't degrade as much as it seems to now)

> If we can do a final merge of N runs without overrunning
> work_mem, fine.  If not, we merge the first K runs (for the largest
> possible K) and write out a new run N+1.  The range of valid run
> number is now K+1..N+1.  If those can be merged in a single pass, we
> do so; otherwise we again merge the first K runs (K+1 through 2K) to
> create a new run N+2.  And so on.
>
> I am not clear, however, whether this would be any faster.  It may not
> be worth tinkering with just for the reduction in code complexity.

Yeah, I was thinking it would only be worth redoing if the current
implementation interferes with other improvements--which I think is
likely, but don't have any concrete ideas yet where it would.

...

>>> As a desirable side effect, I think it would mean
>>> that we could dispense with retail palloc and pfree altogether.  We
>>> could just allocate a big chunk of memory, copy tuples into it until
>>> it's full, using a pointer to keep track of the next unused byte, and
>>> then, after writing the run, reset the allocation pointer back to the
>>> beginning of the buffer.  That would not only avoid the cost of going
>>> through palloc/pfree, but also the memory overhead imposed by
>>> bookkeeping and power-of-two rounding.
>>
>> Wouldn't we still need an array of pointers to the start of every
>> tuple's location in the buffer?  Or else, how would qsort know where
>> to find them?
>
> Yes, we'd still need that.  I don't see any way to get around that; I
> just don't like the expense of palloc-ing so many little chunks in
> addition to that array.

Right, OK, I had thought you meant the power of two rounding of
mem_tuples, I overlooked the power of two rounding of palloc.


>
>> Also, to do this we would need to get around the 1GB allocation limit.
>>  It is bad enough that memtuples is limited to 1GB, it would be much
>> worse if the entire arena was limited to that amount.
>
> Well, we could always allocate multiple 1GB chunks and peel off pieces
> from each one in turn.

I thought of that with mem_tuples itself, but I think it would be more
difficult to do that than just fixing the allocation limit would be.
Using multiple chunks might be easier with the buffer space as you
don't need to do heap arithmetic on those.

>>> If we do want to stick with the current algorithm, there seem to be
>>> some newer techniques for cutting down on the heap maintenance
>>> overhead.  Heikki's been investigating that a bit.
>>
>> Interesting.  Is that investigation around the poor L1/L2 caching
>> properties of large heaps?  I was wondering if there might be a way to
>> give tuplesort an initial estimate of how much data there was to sort,
>> so that it could use a smaller amount of memory than the max if it
>> decided that that would lead to better caching effects.  Once you know
>> you can't do an in-memory sort, then there is no reason to use more
>> memory than the amount that lets you merge all the runs in one go.
>
> No, it's about reducing the number of comparisons needed to maintain
> the heap property.

That sounds very interesting.  I didn't know it was even theoretically
possible to do that.

Cheers,

Jeff


Re: Memory usage during sorting

From
Tom Lane
Date:
Jeff Janes <jeff.janes@gmail.com> writes:
> On Wed, Mar 7, 2012 at 11:55 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Sat, Mar 3, 2012 at 4:15 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>>> Anyway, I think the logtape could use redoing.

> The problem there is that none of the files can be deleted until it
> was entirely read, so you end up with all the data on disk twice.  I
> don't know how often people run their databases so close to the edge
> on disk space that this matters, but someone felt that that extra
> storage was worth avoiding.

Yeah, that was me, and it came out of actual user complaints ten or more
years back.  (It's actually not 2X growth but more like 4X growth
according to the comments in logtape.c, though I no longer remember the
exact reasons why.)  We knew when we put in the logtape logic that we
were trading off speed for space, and we accepted that.  It's possible
that with the growth of hard drive sizes, real-world applications would
no longer care that much about whether the space required to sort is 4X
data size rather than 1X.  Or then again, maybe their data has grown
just as fast and they still care.

> As a desirable side effect, I think it would mean
> that we could dispense with retail palloc and pfree altogether. �We
> could just allocate a big chunk of memory, copy tuples into it until
> it's full, using a pointer to keep track of the next unused byte, and
> then, after writing the run, reset the allocation pointer back to the
> beginning of the buffer. �That would not only avoid the cost of going
> through palloc/pfree, but also the memory overhead imposed by
> bookkeeping and power-of-two rounding.

That would be worthwhile, probably.  The power-of-2 rounding in palloc
is not nice at all for tuplesort's purposes.  We've occasionally talked
about inventing a different memory context type that is less willing to
waste space ... but on the other hand, such an allocator would run
slower, so you're right back to making a speed for space tradeoff.
If the tuples could be deallocated in bulk then that catch-22 goes away.

>> No, it's about reducing the number of comparisons needed to maintain
>> the heap property.

> That sounds very interesting.  I didn't know it was even theoretically
> possible to do that.

So has somebody found a hole in the n log n lower bound on the cost of
comparison-based sorting?  I thought that had been proven pretty
rigorously.
        regards, tom lane


Re: Memory usage during sorting

From
Jeremy Harris
Date:
On 2012-03-18 15:25, Tom Lane wrote:
> Jeff Janes<jeff.janes@gmail.com>  writes:
>> On Wed, Mar 7, 2012 at 11:55 AM, Robert Haas<robertmhaas@gmail.com>  wrote:
>> The problem there is that none of the files can be deleted until it
>> was entirely read, so you end up with all the data on disk twice.  I
>> don't know how often people run their databases so close to the edge
>> on disk space that this matters, but someone felt that that extra
>> storage was worth avoiding.
>
> Yeah, that was me, and it came out of actual user complaints ten or more
> years back.  (It's actually not 2X growth but more like 4X growth
> according to the comments in logtape.c, though I no longer remember the
> exact reasons why.)  We knew when we put in the logtape logic that we
> were trading off speed for space, and we accepted that.

How about a runtime check of disk-free?
-- 
Jeremy



Re: Memory usage during sorting

From
Tom Lane
Date:
Jeremy Harris <jgh@wizmail.org> writes:
> On 2012-03-18 15:25, Tom Lane wrote:
>> Yeah, that was me, and it came out of actual user complaints ten or more
>> years back.  (It's actually not 2X growth but more like 4X growth
>> according to the comments in logtape.c, though I no longer remember the
>> exact reasons why.)  We knew when we put in the logtape logic that we
>> were trading off speed for space, and we accepted that.

> How about a runtime check of disk-free?

Not very workable, the foremost reason why not being that it assumes
that the current sort is the only thing consuming disk space.  I'm not
sure there is any portable method of finding out how much disk space
is available, anyway.  (Think user quotas and such before supposing
you know how to do that.)
        regards, tom lane


Re: Memory usage during sorting

From
Greg Stark
Date:
On Sun, Mar 18, 2012 at 3:25 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> No, it's about reducing the number of comparisons needed to maintain
>>> the heap property.
>
>> That sounds very interesting.  I didn't know it was even theoretically
>> possible to do that.
>
> So has somebody found a hole in the n log n lower bound on the cost of
> comparison-based sorting?  I thought that had been proven pretty
> rigorously.

I think what he's describing is, for example, say you have a heap of
1,000 elements and that lets you do the entire sort in a single pass
-- i.e. it generates a single sorted run after the first pass.
Increasing the heap size to 2,000 will just mean doing an extra
comparison for each tuple to sift up. And increasing the heap size
further will just do more and more work for nothing. It's still nlogn
but the constant factor is slightly higher. Of course to optimize this
you would need to be able to see the future.

I always thought the same behaviour held for if the heap was larger
than necessary to do N merge passes. that is, making the heap larger
might generate fewer tapes but the still enough to require the same
number of passes. However trying to work out an example I'm not too
sure. If you generate fewer tapes then the heap you use to do the
merge is smaller so it might work out the same (or more likely, you
might come out ahead).

--
greg


Re: Memory usage during sorting

From
Jeff Janes
Date:
On Sun, Mar 18, 2012 at 8:56 AM, Greg Stark <stark@mit.edu> wrote:
> On Sun, Mar 18, 2012 at 3:25 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>> No, it's about reducing the number of comparisons needed to maintain
>>>> the heap property.
>>
>>> That sounds very interesting.  I didn't know it was even theoretically
>>> possible to do that.
>>
>> So has somebody found a hole in the n log n lower bound on the cost of
>> comparison-based sorting?  I thought that had been proven pretty
>> rigorously.
>
> I think what he's describing is, for example, say you have a heap of
> 1,000 elements and that lets you do the entire sort in a single pass
> -- i.e. it generates a single sorted run after the first pass.
> Increasing the heap size to 2,000 will just mean doing an extra
> comparison for each tuple to sift up. And increasing the heap size
> further will just do more and more work for nothing. It's still nlogn
> but the constant factor is slightly higher. Of course to optimize this
> you would need to be able to see the future.
>
> I always thought the same behaviour held for if the heap was larger
> than necessary to do N merge passes. that is, making the heap larger
> might generate fewer tapes but the still enough to require the same
> number of passes. However trying to work out an example I'm not too
> sure. If you generate fewer tapes then the heap you use to do the
> merge is smaller so it might work out the same (or more likely, you
> might come out ahead).

Except for rounding effects, it does come out the same.  Every extra
layer on the tuple-heap during the initial run building causes a
reduced layer on the tape-heap during the merge.  So they wash.  I
haven't analyzed the exact rounding effects in detail, but by just
instrumenting the code I found a huge tuple-heap with a two-tape merge
at the end used less than one percent more comparisons, but was >40%
slower, then a lower-memory sort which used a modest sized tuple-heap
followed by a modest-size tape-heap merge.    My conclusion is that it
isn't the number of comparison that drove the difference, but the
number of on-chip cache misses.  Two modest sized heaps are more cache
friendly than one giant heap and one tiny heap.

Of course if the data is partially sorted in a way that is apparent
over large ranges but not short ranges, the larger initial heap will
capture that during the initial run construction while the smaller one
will not.

Cheers,

Jeff


Re: Memory usage during sorting

From
Robert Haas
Date:
On Sun, Mar 18, 2012 at 11:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> So has somebody found a hole in the n log n lower bound on the cost of
> comparison-based sorting?  I thought that had been proven pretty
> rigorously.

There's not much danger of anyone finding a way around that bound
since the proof is quite trivial, but keep in mind that it's a
worst-case bound.  For example, if you have reason to believe that the
input is likely to already be sorted, you can check for that case in
using just n-1 comparisons.  If it turns out you're right, you can
"sort" the data in linear time.  Heck, you can also check for
reverse-sorted input at the same time, if you like, and handle that
special case too.  Of course, when the data is unordered, such
special-case checks hurt rather than help, even though it's still O(n
lg n) overall; those pesky constant factors do matter quite a bit.
Still, sometimes it's worth it: our quicksort algorithm checks for
sorted input on each recursive invocation, since quicksort degrades
rather badly in that scenario; but our heapsort doesn't contain a
similar check, probably because heapsort typically works fine in that
scenario.

One thing that seems inefficient to me about our current algorithm is
the use of the run number as a leading column in the sort key.
There's no real reason why the tuples destined for the next run need
to be maintained in heap order; we could just store them unordered and
heapify the whole lot of them when it's time to start the next run.
That ought to save comparison cycles for the current run, since the
heap will be less deep, and there are other possible savings as well -
in particular, an unordered array of tuples can be heapify'd in linear
time, whereas our current algorithm is O(n lg n).  However, when I
actually implemented this, I found that doing it this way was a loser.In fact, excluding the next-run tuples from the
heapfor the current 
run was a BIG loser even before you add in the time required to
re-heapify.  This confused the daylights out of me for a while,
because it's hard to understand how insert and siftup can be slower on
a small heap than a large one.

My working theory is that, even though we must necessarily do more
comparisons when manipulating a larger heap, many of those comparisons
are resolved by comparing run numbers, and that's a lot cheaper than
invoking the real comparator.  For example, if we insert into a heap
whose rightmost child is in the next run, and the new tuple goes into
the current run, and the current size of the heap is such that the
initial position of the new element is a descendent of the right node,
it will very quickly crash through all of the next-run tuples and only
need one REAL comparison - against the root.  Either the new element
ends up as the root, or it ends up as the direct child of the root;
now we remove the root and, perhaps, replace it with its rightmost
child, meaning that the next element we read in can do the exact same
thing all over again.  If we exclude the next-run elements from the
heap, then the average depth of the heap when inserting a new element
is smaller, but all the elements are in the same-run, and we have to
invoke the real comparator every time.  In other words, those next-run
tuples act as dummies which effectively create a heap of uneven depth,
and the average depth encountered when inserting tuples turns out to
be less than what we get by pulling out the dummies and making the
depth uniform.

This makes me kind of nervous, because I don't understand why things
should work out so well as all that.  If throwing some dummy elements
into the heap makes things work better, then maybe we should throw in
some more.  Or maybe it would work better to take some but not all of
them out.  There certainly doesn't seem to be any indication in the
code that this is an anticipated effect, and it seems an awfully
providential accident.

It may (or may not) be related to the effect that Greg Stark mentions
in a nearby post: increasing work_mem makes the heap bigger, so heap
maintenance gets more expensive, sometimes without any corresponding
benefit.  For example, if the input happens to already be sorted, then
inserting into the heap is cheap, because each new element is already
greater than its parent, and life is good.  But removing from the heap
is painfully expensive, because to remove element from the heap we
stick the last element in the heap into the whole left by the
departing element and then sift it down.  However, that requires
2*heap_depth comparisons, all of which involve really invoking the
comparison function since everything's going to end up going into a
single run, and that gets more and more expensive as work_mem
increases.  So a heap sort of already-sorted data seems to get slower
and slower the more memory you give it to work with.  For example, a
sort of 5 million random strings of length 30, all in shared_buffers,
takes 22.03 seconds on my MacBook Pro with work_mem = 128MB, but just
17.08 seconds with work_mem = 1MB.

It turns out that it's possible to reduce the number of comparisons
required to reinsert the last heap element from 2*heap_depth to
heap_depth+lg(heap_depth).  At each node, compare the two children,
and then let the current node be the lesser of those.  When you reach
a leaf node, you've walked a path of length heap_depth which is known
to be sorted (since we've got a heap), so you can find the correct
reinsertion position by binary searching that path.  In the case of
the sorted data mentioned above, this reduces the time to 19.45
seconds with work_mem = 128MB and 16.12 seconds with work_mem = 1MB.
However, in other cases, it seems not to help at all, or even be a
slight regression.  An obvious weakness of this algorithm is that the
linear search we use now can terminate early, if it turns out that the
element we're reinserting doesn't need to percolate downward very far,
whereas this can't: so the 2*heap_depth bound is worst-case, but the
heap_depth+lg(heap_depth) bound is exact.  The problem of some
comparisons being much cheaper than other also acts to muddy the
waters - if there's just one element in the heap destined for the next
run, reinserting it will be only about half as expensive, since we'll
compare the descendents of each node to each other, but the comparison
against the next-run element will be very cheap, and if all the tuples
are the same size we may end up reinserting that particular node very
frequently.

Heikki wrote a patch implementing another "smart" reinsertion
algorithm (what we call tuplesort_heap_siftup) which I won't attempt
to explain that seems to work even a little bit better than the one
mentioned above, but it seems to fall prey to some of the same issues
- there are cases where it helps, but there are also cases where it
doesn't do much or even regresses slightly, and I don't think either
of us understand exactly how to tickle the various cases, let alone
what causes them.  What does seem clear is that the great bulk of the
cost of a heap sort seems to be attributable to tuplesort_heap_siftup,
and if we can find something that reduces the number of comparisons
required for that operation we will more-or-less proportional speedup
in heap-sorting overall.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Memory usage during sorting

From
Greg Stark
Date:
On Mon, Mar 19, 2012 at 7:23 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> There's no real reason why the tuples destined for the next run need
> to be maintained in heap order; we could just store them unordered and
> heapify the whole lot of them when it's time to start the next run.

This sounded familiar....

http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=cf627ab41ab9f6038a29ddd04dd0ff0ccdca714e

-- 
greg


Re: Memory usage during sorting

From
Tom Lane
Date:
Greg Stark <stark@mit.edu> writes:
> On Mon, Mar 19, 2012 at 7:23 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> There's no real reason why the tuples destined for the next run need
>> to be maintained in heap order; we could just store them unordered and
>> heapify the whole lot of them when it's time to start the next run.

> This sounded familiar....
> http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=cf627ab41ab9f6038a29ddd04dd0ff0ccdca714e

Yeah, see also the pgsql-hackers thread starting here:
http://archives.postgresql.org/pgsql-hackers/1999-10/msg00384.php

That was a long time ago, of course, but I have some vague recollection
that keeping next-run tuples in the current heap achieves a net savings
in the total number of comparisons needed to heapify both runs.
Robert's point about integer comparisons being faster than data
comparisons may or may not be relevant.  Certainly they are faster, but
there are never very many run numbers in the heap at once (possibly no
more than 2, I forget; and in any case often only 1).  So I'd expect
most tuple comparisons to end up having to do a data comparison anyway.
        regards, tom lane


Re: Memory usage during sorting

From
Greg Stark
Date:
On Tue, Mar 20, 2012 at 1:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> That was a long time ago, of course, but I have some vague recollection
> that keeping next-run tuples in the current heap achieves a net savings
> in the total number of comparisons needed to heapify both runs.

Offhand I wonder if this is all because we don't have the O(n) heapify
implemented.

-- 
greg


Re: Memory usage during sorting

From
Tom Lane
Date:
Greg Stark <stark@mit.edu> writes:
> Offhand I wonder if this is all because we don't have the O(n) heapify
> implemented.

Robert muttered something about that before, but is it real?  If you
could do that, I'd think you'd have a less-than-n-log-n sorting
solution.
        regards, tom lane


Re: Memory usage during sorting

From
Greg Stark
Date:

Re: Memory usage during sorting

From
Jeff Janes
Date:
On Tue, Mar 20, 2012 at 6:31 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Greg Stark <stark@mit.edu> writes:
>> Offhand I wonder if this is all because we don't have the O(n) heapify
>> implemented.

I think we do already have it implemented.  1/2 the time the tuple
stays where it is after one comparison, 1/4 it moves up one level with
two comparisons, 1/8 it moves up two levels with 3 comparisons, etc.
That series sums up to a constant.  Maybe there is a worst-case that
makes this fall apart, though.  Heapifying something which is already
reverse sorted, maybe?

> Robert muttered something about that before, but is it real?  If you
> could do that, I'd think you'd have a less-than-n-log-n sorting
> solution.

Turning random tuples into heap can be linear.  Extracting them while
maintaining the heap is NlogN, though.  You can't sort without the
extraction step, so the law is preserved.

Cheers,

Jeff


Re: Memory usage during sorting

From
Tom Lane
Date:
Greg Stark <stark@mit.edu> writes:
> http://en.wikipedia.org/wiki/Binary_heap#Building_a_heap

Interesting.  I'm pretty sure that idea appears nowhere in Knuth
(which might mean it's new enough to have a live patent on it ...
anybody know who invented this?).  But it seems like that should buy
back enough comparisons to justify leaving the next-run tuples out of
the heap (unordered) until the heap becomes empty.  You still want to
insert new tuples into the heap if they can go to the current run, of
course.
        regards, tom lane


Re: Memory usage during sorting

From
Robert Haas
Date:
On Tue, Mar 20, 2012 at 7:44 AM, Greg Stark <stark@mit.edu> wrote:
> On Tue, Mar 20, 2012 at 1:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> That was a long time ago, of course, but I have some vague recollection
>> that keeping next-run tuples in the current heap achieves a net savings
>> in the total number of comparisons needed to heapify both runs.
>
> Offhand I wonder if this is all because we don't have the O(n) heapify
> implemented.

I'm pretty sure that's not the problem.  Even though our heapify is
not as efficient as it could be, it's plenty fast enough.  I thought
about writing a patch to implement the better algorithm, but it seems
like a distraction at this point because the heapify step is such a
small contributor to overall sort time.  What's taking all the time is
the repeated siftup operations as we pop things out of the heap.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Memory usage during sorting

From
Robert Haas
Date:
On Tue, Mar 20, 2012 at 12:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Greg Stark <stark@mit.edu> writes:
>> http://en.wikipedia.org/wiki/Binary_heap#Building_a_heap
>
> Interesting.  I'm pretty sure that idea appears nowhere in Knuth
> (which might mean it's new enough to have a live patent on it ...
> anybody know who invented this?).

It's in every introductory algorithms textbook; I'd be shocked if
anyone could make an IP claim on it.

> But it seems like that should buy
> back enough comparisons to justify leaving the next-run tuples out of
> the heap (unordered) until the heap becomes empty.  You still want to
> insert new tuples into the heap if they can go to the current run, of
> course.

It seems like it should, but if you read (or reread) my long boring
analysis upthread, you'll learn that it doesn't.  It's slower even if
the cost of building a heap is zero.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Memory usage during sorting

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Mar 20, 2012 at 7:44 AM, Greg Stark <stark@mit.edu> wrote:
>> Offhand I wonder if this is all because we don't have the O(n) heapify
>> implemented.

> I'm pretty sure that's not the problem.  Even though our heapify is
> not as efficient as it could be, it's plenty fast enough.  I thought
> about writing a patch to implement the better algorithm, but it seems
> like a distraction at this point because the heapify step is such a
> small contributor to overall sort time.  What's taking all the time is
> the repeated siftup operations as we pop things out of the heap.

Right, but wouldn't getting rid of the run-number comparisons provide
some marginal improvement in the speed of tuplesort_heap_siftup?

BTW, there's a link at the bottom of the wikipedia page to a very
interesting ACM Queue article, which argues that the binary-tree
data structure isn't terribly well suited to virtual memory because
it touches random locations in succession.  I'm not sure I believe
his particular solution, but I'm wondering about B+ trees, ie more
than 2 children per node.
        regards, tom lane


Re: Memory usage during sorting

From
Robert Haas
Date:
On Tue, Mar 20, 2012 at 12:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Tue, Mar 20, 2012 at 7:44 AM, Greg Stark <stark@mit.edu> wrote:
>>> Offhand I wonder if this is all because we don't have the O(n) heapify
>>> implemented.
>
>> I'm pretty sure that's not the problem.  Even though our heapify is
>> not as efficient as it could be, it's plenty fast enough.  I thought
>> about writing a patch to implement the better algorithm, but it seems
>> like a distraction at this point because the heapify step is such a
>> small contributor to overall sort time.  What's taking all the time is
>> the repeated siftup operations as we pop things out of the heap.
>
> Right, but wouldn't getting rid of the run-number comparisons provide
> some marginal improvement in the speed of tuplesort_heap_siftup?

No.  It does the opposite: it slows it down.  This is a highly
surprising result but it's quite repeatable: removing comparisons
makes it slower.  As previously pontificated, I think this is probably
because the heap can fill up with next-run tuples that are cheap to
compare against, and that spares us having to do "real" comparisons
involving the actual datatype comparators.

> BTW, there's a link at the bottom of the wikipedia page to a very
> interesting ACM Queue article, which argues that the binary-tree
> data structure isn't terribly well suited to virtual memory because
> it touches random locations in succession.  I'm not sure I believe
> his particular solution, but I'm wondering about B+ trees, ie more
> than 2 children per node.

I don't think virtual memory locality is the problem.  I read
somewhere that a ternary heap is supposed to be about one-eighth
faster than a binary heap, but that's because picking the smallest of
three tuples requires two comparisons, whereas picking the smallest of
four tuples requires three comparisons, which is better.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Memory usage during sorting

From
Greg Stark
Date:
On Tue, Mar 20, 2012 at 5:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> No.  It does the opposite: it slows it down.  This is a highly
> surprising result but it's quite repeatable: removing comparisons
> makes it slower.  As previously pontificated, I think this is probably
> because the heap can fill up with next-run tuples that are cheap to
> compare against, and that spares us having to do "real" comparisons
> involving the actual datatype comparators.

Frankly that analysis didn't make any sense to me at the time.
Comparing integers is fast, sure, but it's still slower than not
having to do any comparison at all.

It's just barely conceivable that it's a cache effect on the *code*
but even that seems pretty darned unlikely to me. My money currently
is on a measurement error but that's just a guess since I haven't
tried to replicate any of your results.

>
>> BTW, there's a link at the bottom of the wikipedia page to a very
>> interesting ACM Queue article, which argues that the binary-tree
>> data structure isn't terribly well suited to virtual memory because
>> it touches random locations in succession.  I'm not sure I believe
>> his particular solution, but I'm wondering about B+ trees, ie more
>> than 2 children per node.
>
> I don't think virtual memory locality is the problem.  I read
> somewhere that a ternary heap is supposed to be about one-eighth
> faster than a binary heap, but that's because picking the smallest of
> three tuples requires two comparisons, whereas picking the smallest of
> four tuples requires three comparisons, which is better.

The things I was reading suggested 4-heap was more cache efficient
which is a heap with 4-children per node and still representable as a
flat array. There's also Fibonacci heap which makes insertions really
fast but since we're doing exactly as many insertions as deletions the
question is how much work it adds to deletions. It's still O(logn)
amortized but it may be 2x the constant factor.

Fwiw I think more interesting than improving tapesort would be
implementing wholly different algorithms like radix sort or the
repeated quicksort. Being able to handle different algorithms that
require a different API would be the first step to being able to
handle parallel algorithms using threads or GPUs.

I also wonder if disabling the space reuse and just generating
separate files for each run might not be a win on exotic filesystems
like ZFS or WAFL where overwriting blocks is more expensive than
allocating new blocks.

--
greg


Re: Memory usage during sorting

From
Robert Haas
Date:
On Tue, Mar 20, 2012 at 3:41 PM, Greg Stark <stark@mit.edu> wrote:
> On Tue, Mar 20, 2012 at 5:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> No.  It does the opposite: it slows it down.  This is a highly
>> surprising result but it's quite repeatable: removing comparisons
>> makes it slower.  As previously pontificated, I think this is probably
>> because the heap can fill up with next-run tuples that are cheap to
>> compare against, and that spares us having to do "real" comparisons
>> involving the actual datatype comparators.
>
> Frankly that analysis didn't make any sense to me at the time.
> Comparing integers is fast, sure, but it's still slower than not
> having to do any comparison at all.

I think you're underestimating how much it costs to call the
datatype-specific comparator.  My belief is that it's wicked
expensive.   The COMPARETUP() macro extracts a function pointer from
the Tuplesortstate and calls it; we end up comparetup_heap, which
calls ApplySortComparator(), which pulls the comparator function out
of the state and then calls that.  Since I was sorting strings, which
have no sortsupport, we then end up in comparison_shim(), which uses
the FunctionCallInvoke method to extract the actual function pointer
and jump into bttextcmp(), which unpacks its arguments and then calls
text_cmp(), which unpacks its arguments some more and then calls
varstr_cmp() where the actual work happens.  That's not trivial either
- we have to call lc_collate_is_c() and then memcmp().  I have no
problem believing that 6 levels of function calls, 3 of which involve
jumps through function pointers, followed by lc_collate_is_c() and
memcmp() is 100x or more as expensive than the lone integer comparison
that happens when the tupindex values don't match - that's a single
instruction.

> Fwiw I think more interesting than improving tapesort would be
> implementing wholly different algorithms like radix sort or the
> repeated quicksort. Being able to handle different algorithms that
> require a different API would be the first step to being able to
> handle parallel algorithms using threads or GPUs.

Yeah, I think I'm going to try implementing
quicksort-the-whole-batch-and-dump-it-out-as-a-run algorithm, just to
see how good or bad that is compared to what we have now.  We may not
end up doing anything that remotely resembles that, in the end, but I
want to see the numbers.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Memory usage during sorting

From
Jim Nasby
Date:
On 3/18/12 10:25 AM, Tom Lane wrote:
> Jeff Janes<jeff.janes@gmail.com>  writes:
>> >  On Wed, Mar 7, 2012 at 11:55 AM, Robert Haas<robertmhaas@gmail.com>  wrote:
>>> >>  On Sat, Mar 3, 2012 at 4:15 PM, Jeff Janes<jeff.janes@gmail.com>  wrote:
>>>> >>>  Anyway, I think the logtape could use redoing.
>> >  The problem there is that none of the files can be deleted until it
>> >  was entirely read, so you end up with all the data on disk twice.  I
>> >  don't know how often people run their databases so close to the edge
>> >  on disk space that this matters, but someone felt that that extra
>> >  storage was worth avoiding.
> Yeah, that was me, and it came out of actual user complaints ten or more
> years back.  (It's actually not 2X growth but more like 4X growth
> according to the comments in logtape.c, though I no longer remember the
> exact reasons why.)  We knew when we put in the logtape logic that we
> were trading off speed for space, and we accepted that.  It's possible
> that with the growth of hard drive sizes, real-world applications would
> no longer care that much about whether the space required to sort is 4X
> data size rather than 1X.  Or then again, maybe their data has grown
> just as fast and they still care.
>

I believe the case of tape sorts that fit entirely in filesystem cache is a big one as well... doubling or worse the
amountof data that needed to live "on disk" at once would likely suck in that case.
 

Also, it's not uncommon to be IO-bound on a database server... so even if we're not worried about storing everything 2
ormore times from a disk space standpoint, we should be concerned about the IO bandwidth.
 
-- 
Jim C. Nasby, Database Architect                   jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net


Re: Memory usage during sorting

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> Yeah, I think I'm going to try implementing
> quicksort-the-whole-batch-and-dump-it-out-as-a-run algorithm, just to
> see how good or bad that is compared to what we have now.  We may not
> end up doing anything that remotely resembles that, in the end, but I
> want to see the numbers.

The reason replacement selection is attractive is that the initial runs
tend to be about twice as long as you can get if you just sort
memory-loads independently.  (And that's for random input, the win can
be a lot more on partially ordered data.)  It seems unlikely that you
will get enough win from quicksorting to overcome that; especially not
if your thesis is correct that all the cost is coming from comparison
infrastructure.

But don't let me discourage you from measuring it ...
        regards, tom lane


Re: Memory usage during sorting

From
Greg Stark
Date:
On Tue, Mar 20, 2012 at 8:00 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Frankly that analysis didn't make any sense to me at the time.
>> Comparing integers is fast, sure, but it's still slower than not
>> having to do any comparison at all.
>
> I think you're underestimating how much it costs to call the
> datatype-specific comparator.  My belief is that it's wicked
> expensive.

I'm totally with you on the datatype-specific comparator being expensive.

But we must be talking about two different scenarios. I don't see why
Tom's algorithm was slower than Knuth's unless there was a bug. It
seems to me it should perform exactly as many comparator calls but
save the integer comparisons and the extra space for them.

In the current algorithm, Knuth's, it compares the new tuple against
the most recently emitted tuple to set the run number then adds it to
the bottom of the heap and sifts up. If it's from the current run it
does a bunch of integer comparisons and skips past the next run
quickly. If it's from the next run it sifts up to the right spot in
the next run and if it hits the top of the next run it does a quick
integer comparison and stops.

In Tom's algorithm it would perform the same comparison against the
recently emitted tuple to set the run number and either add it to the
unsorted list or the bottom of the heap. If it's added to the unsorted
list we're done, if it's added to the bottom of the heap it performs
the same siftup it would have done above except that it skips the
bottom few levels of the heap -- all of which were fast integer
comparisons. They might be fast but they can't be faster than doing
nothing at all. When the heap is empty then we do a heapify which
currently would be exactly the same O(nlogn) comparisons that we do
maintaining the bottom of the heap but with a O(n) heapify would be
fewer.

--
greg


Re: Memory usage during sorting

From
Robert Haas
Date:
On Tue, Mar 20, 2012 at 6:16 PM, Greg Stark <stark@mit.edu> wrote:
> On Tue, Mar 20, 2012 at 8:00 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> Frankly that analysis didn't make any sense to me at the time.
>>> Comparing integers is fast, sure, but it's still slower than not
>>> having to do any comparison at all.
>>
>> I think you're underestimating how much it costs to call the
>> datatype-specific comparator.  My belief is that it's wicked
>> expensive.
>
> I'm totally with you on the datatype-specific comparator being expensive.
>
> But we must be talking about two different scenarios. I don't see why
> Tom's algorithm was slower than Knuth's unless there was a bug. It
> seems to me it should perform exactly as many comparator calls but
> save the integer comparisons and the extra space for them.

It seemed that way to me, too, but it wasn't.

> In the current algorithm, Knuth's, it compares the new tuple against
> the most recently emitted tuple to set the run number then adds it to
> the bottom of the heap and sifts up. If it's from the current run it
> does a bunch of integer comparisons and skips past the next run
> quickly. If it's from the next run it sifts up to the right spot in
> the next run and if it hits the top of the next run it does a quick
> integer comparison and stops.

Check.

> In Tom's algorithm it would perform the same comparison against the
> recently emitted tuple to set the run number and either add it to the
> unsorted list or the bottom of the heap. If it's added to the unsorted
> list we're done,

Check.

> if it's added to the bottom of the heap it performs
> the same siftup it would have done above except that it skips the
> bottom few levels of the heap -- all of which were fast integer
> comparisons.

I think this is where the wheels come off.  It's excessively
optimistic to suppose that we're going to skip the "bottom few levels"
of the heap.  Half of the elements in the heap have no children at
all; and half of the remainder are parents but not grand-parents.  If
you assume, at a given point in time, that half of the tuples we're
holding onto are for the next run, then the heap is ONE level
shallower than it would otherwise have been, and you figure to save
ONE integer comparison.  If we're relatively early in the run and only
one-tenth of the tuples we're holding onto are for the next run, then
heap is barely shallower at all and we're scarcely avoiding anything.

But having even a small number of next-run tuples scattered through
the heap gives us the opportunity to "get lucky".  If the heap size is
N, even lg N next-run tuples in the heap is potentially enough for us
to avoid most of the calls to a real comparator, if it so happens that
all of those tuples are our ancestors.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Memory usage during sorting

From
Jeff Janes
Date:
On Mon, Mar 19, 2012 at 12:23 PM, Robert Haas <robertmhaas@gmail.com> wrote:
...
>
> One thing that seems inefficient to me about our current algorithm is
> the use of the run number as a leading column in the sort key.
> There's no real reason why the tuples destined for the next run need
> to be maintained in heap order; we could just store them unordered and
> heapify the whole lot of them when it's time to start the next run.
> That ought to save comparison cycles for the current run, since the
> heap will be less deep, and there are other possible savings as well -
> in particular, an unordered array of tuples can be heapify'd in linear
> time, whereas our current algorithm is O(n lg n).  However, when I
> actually implemented this, I found that doing it this way was a loser.
>  In fact, excluding the next-run tuples from the heap for the current
> run was a BIG loser even before you add in the time required to
> re-heapify.  This confused the daylights out of me for a while,
> because it's hard to understand how insert and siftup can be slower on
> a small heap than a large one.
>
> My working theory is that, even though we must necessarily do more
> comparisons when manipulating a larger heap, many of those comparisons
> are resolved by comparing run numbers, and that's a lot cheaper than
> invoking the real comparator.  For example, if we insert into a heap
> whose rightmost child is in the next run, and the new tuple goes into
> the current run, and the current size of the heap is such that the
> initial position of the new element is a descendent of the right node,
> it will very quickly crash through all of the next-run tuples and only
> need one REAL comparison - against the root.  Either the new element
> ends up as the root, or it ends up as the direct child of the root;
> now we remove the root and, perhaps, replace it with its rightmost
> child, meaning that the next element we read in can do the exact same
> thing all over again.  If we exclude the next-run elements from the
> heap, then the average depth of the heap when inserting a new element
> is smaller, but all the elements are in the same-run, and we have to
> invoke the real comparator every time.  In other words, those next-run
> tuples act as dummies which effectively create a heap of uneven depth,
> and the average depth encountered when inserting tuples turns out to
> be less than what we get by pulling out the dummies and making the
> depth uniform.
>
> This makes me kind of nervous, because I don't understand why things
> should work out so well as all that.  If throwing some dummy elements
> into the heap makes things work better, then maybe we should throw in
> some more.  Or maybe it would work better to take some but not all of
> them out.  There certainly doesn't seem to be any indication in the
> code that this is an anticipated effect, and it seems an awfully
> providential accident.

Is there a patch or a git repo available for this change?  I'd like to
see if I can analyze that if I get some spare time.

...

> It turns out that it's possible to reduce the number of comparisons
> required to reinsert the last heap element from 2*heap_depth to
> heap_depth+lg(heap_depth).  At each node, compare the two children,
> and then let the current node be the lesser of those.  When you reach
> a leaf node, you've walked a path of length heap_depth which is known
> to be sorted (since we've got a heap), so you can find the correct
> reinsertion position by binary searching that path.  In the case of
> the sorted data mentioned above, this reduces the time to 19.45
> seconds with work_mem = 128MB and 16.12 seconds with work_mem = 1MB.
> However, in other cases, it seems not to help at all, or even be a
> slight regression.

Clever.  Rather than doing a binary search of the path, it might make
sense to do a reverse search of it.  The tuple is likely to send up
somewhere at the bottom of the heap, both because that is where most
of the data is, and because the tuple being reinserted just came from
the bottom so it is likely to be biased that way.

Cheers,

Jeff


Re: Memory usage during sorting

From
Jeff Davis
Date:
On Sun, 2012-03-18 at 11:25 -0400, Tom Lane wrote:
> Yeah, that was me, and it came out of actual user complaints ten or more
> years back.  (It's actually not 2X growth but more like 4X growth
> according to the comments in logtape.c, though I no longer remember the
> exact reasons why.)  We knew when we put in the logtape logic that we
> were trading off speed for space, and we accepted that.

I skimmed through TAOCP, and I didn't find the 4X number you are
referring to, and I can't think what would cause that, either. The exact
wording in the comment in logtape.c is "4X the actual data volume", so
maybe that's just referring to per-tuple overhead?

However, I also noticed that section 5.4.4 (Vol 3 p299) starts
discussing the idea of running the tapes backwards and forwards. That
doesn't directly apply, because a disk seek is cheaper than rewinding a
tape, but perhaps it could be adapted to get the block-freeing behavior
we want. The comments in logtape.c say:
 "Few OSes allow arbitrary parts of a file to be released back to the
OS, so we have to implement this space-recycling ourselves within a
single logical file."

But if we alternate between reading in forward and reverse order, we can
make all of the deallocations at the end of the file, and then just
truncate to free space. I would think that the OS could be more
intelligent about block allocations and deallocations to avoid too much
fragmentation, and it would probably be a net reduction in code
complexity. Again, the comments in logtape.c have something to say about
it:
  "...but the seeking involved should be comparable to what would
happen if we kept each logical tape in a separate file"

But I don't understand why that is the case.

On another topic, quite a while ago Len Shapiro (PSU professor)
suggested to me that we implement Algorithm F (Vol 3 p321). Right now,
as tuplesort.c says:
  "In the current code we determine the number of tapes M on the basis
of workMem: we want workMem/M to be large enough that we read a fair
amount of data each time we preread from a tape"

But with forcasting, we can be a little smarter about which tapes we
preread from if the data in the runs is not random. That means we could
potentially merge more runs at once with the same work_mem without
sacrificing adequate buffers for prefetching. I'm not sure whether this
is a practical problem today, and I'm also not sure what to do if we
start merging a lot more runs and then determine that forcasting doesn't
work as well as we'd hoped (e.g. that the data in the runs really is
random). But I thought it was worth mentioning.

Regards,Jeff Davis



Re: Memory usage during sorting

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> On Sun, 2012-03-18 at 11:25 -0400, Tom Lane wrote:
>> Yeah, that was me, and it came out of actual user complaints ten or more
>> years back.  (It's actually not 2X growth but more like 4X growth
>> according to the comments in logtape.c, though I no longer remember the
>> exact reasons why.)  We knew when we put in the logtape logic that we
>> were trading off speed for space, and we accepted that.

> I skimmed through TAOCP, and I didn't find the 4X number you are
> referring to, and I can't think what would cause that, either. The exact
> wording in the comment in logtape.c is "4X the actual data volume", so
> maybe that's just referring to per-tuple overhead?

My recollection is that that was an empirical measurement using the
previous generation of code.  It's got nothing to do with per-tuple
overhead IIRC, but with the fact that the same tuple can be sitting on
multiple "tapes" during a polyphase merge, because some of the tapes can
be lying fallow waiting for future use --- but data on them is still
taking up space, if you do nothing to recycle it.  The argument in the
comment shows why 2X is the minimum space growth for a plain merge
algorithm, but that's only a minimum.
        regards, tom lane


Re: Memory usage during sorting

From
Robert Haas
Date:
On Wed, Mar 21, 2012 at 8:57 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Mon, Mar 19, 2012 at 12:23 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> ...
>>
>> One thing that seems inefficient to me about our current algorithm is
>> the use of the run number as a leading column in the sort key.
>> There's no real reason why the tuples destined for the next run need
>> to be maintained in heap order; we could just store them unordered and
>> heapify the whole lot of them when it's time to start the next run.
>> That ought to save comparison cycles for the current run, since the
>> heap will be less deep, and there are other possible savings as well -
>> in particular, an unordered array of tuples can be heapify'd in linear
>> time, whereas our current algorithm is O(n lg n).  However, when I
>> actually implemented this, I found that doing it this way was a loser.
>>  In fact, excluding the next-run tuples from the heap for the current
>> run was a BIG loser even before you add in the time required to
>> re-heapify.  This confused the daylights out of me for a while,
>> because it's hard to understand how insert and siftup can be slower on
>> a small heap than a large one.
>>
>> My working theory is that, even though we must necessarily do more
>> comparisons when manipulating a larger heap, many of those comparisons
>> are resolved by comparing run numbers, and that's a lot cheaper than
>> invoking the real comparator.  For example, if we insert into a heap
>> whose rightmost child is in the next run, and the new tuple goes into
>> the current run, and the current size of the heap is such that the
>> initial position of the new element is a descendent of the right node,
>> it will very quickly crash through all of the next-run tuples and only
>> need one REAL comparison - against the root.  Either the new element
>> ends up as the root, or it ends up as the direct child of the root;
>> now we remove the root and, perhaps, replace it with its rightmost
>> child, meaning that the next element we read in can do the exact same
>> thing all over again.  If we exclude the next-run elements from the
>> heap, then the average depth of the heap when inserting a new element
>> is smaller, but all the elements are in the same-run, and we have to
>> invoke the real comparator every time.  In other words, those next-run
>> tuples act as dummies which effectively create a heap of uneven depth,
>> and the average depth encountered when inserting tuples turns out to
>> be less than what we get by pulling out the dummies and making the
>> depth uniform.
>>
>> This makes me kind of nervous, because I don't understand why things
>> should work out so well as all that.  If throwing some dummy elements
>> into the heap makes things work better, then maybe we should throw in
>> some more.  Or maybe it would work better to take some but not all of
>> them out.  There certainly doesn't seem to be any indication in the
>> code that this is an anticipated effect, and it seems an awfully
>> providential accident.
>
> Is there a patch or a git repo available for this change?  I'd like to
> see if I can analyze that if I get some spare time.

Sorry for the slow response to this.  Patch is attached
(crummy-external-sort.patch).

> Clever.  Rather than doing a binary search of the path, it might make
> sense to do a reverse search of it.  The tuple is likely to send up
> somewhere at the bottom of the heap, both because that is where most
> of the data is, and because the tuple being reinserted just came from
> the bottom so it is likely to be biased that way.

Patches for these approaches also attached (siftup-using-bsearch.patch
and siftup-reverse-linear.patch).  Neither approach seemed terribly
promising in my testing, IIRC.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: Memory usage during sorting

From
Greg Stark
Date:
On Mon, Mar 19, 2012 at 7:23 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sun, Mar 18, 2012 at 11:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> So has somebody found a hole in the n log n lower bound on the cost of
>> comparison-based sorting?  I thought that had been proven pretty
>> rigorously.
>
> There's not much danger of anyone finding a way around that bound
> since the proof is quite trivial, but keep in mind that it's a
> worst-case bound.

Fwiw it also only holds for comparison sorts. If you can sort your
items without actually comparing each item with the others then you
aren't bound by it at all. Notably algorithms like radix sort and
direct sort aren't bound by it and are O(n). I'm hoping to look at
trying some of these for integer sorts when they apply using the new
sort specialization code Peter added.

--
greg


Re: Memory usage during sorting

From
Peter Geoghegan
Date:
On 13 April 2012 16:03, Greg Stark <stark@mit.edu> wrote:
> Fwiw it also only holds for comparison sorts. If you can sort your
> items without actually comparing each item with the others then you
> aren't bound by it at all. Notably algorithms like radix sort and
> direct sort aren't bound by it and are O(n). I'm hoping to look at
> trying some of these for integer sorts when they apply using the new
> sort specialization code Peter added.

Actually, though a later revision of the patch did nominally allow for
user-defined sorting functions through the SortSupport infrastructure
(I didn't intend that it would be complete/useful enough to be really
worth documenting), the version that was finally committed did not.
However, there is a fairly obvious entry point for a radix sort, which
is here:
        /*         * We were able to accumulate all the tuples within the allowed         * amount of memory.  Just
qsort'em and we're done.         */        if (state->memtupcount > 1)        {            /* Can we use the single-key
sortfunction? */            if (state->onlyKey != NULL)                qsort_ssup(state->memtuples, state->memtupcount,
                         state->onlyKey);            else                qsort_tuple(state->memtuples,
         state->memtupcount,                            state->comparetup,                            state);        } 

The only specialisation that occurs here is between tuple sorts of a
single key and multiple (type-specific specialisations were ultimately
not judged to be worth the increase in binary size relative to their
diminishing benefits).  You'd probably set-up a type-specific
positional notation machinery within the state's SortSupport struct
(the type-specific abstraction through which we elide the SQL function
call machinery for types that support it).

One insight that I had at the time was that text comparisons where the
c locale isn't used are really rather expensive, and I doubt that
there is too much that can be done to address that directly.  However,
if we were to support timsort, that could substantially cut down on
the number of comparisons for text columns, which could be a real win.
Maybe that would happen through some explicit mechanism, or maybe the
planner could actually infer that it was likely to be optimal to use
timsort based on a high statistical correlation between physical row
ordering and logical ordering of a key column's values.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


Re: Memory usage during sorting

From
Peter Geoghegan
Date:
On 13 April 2012 17:42, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> One insight that I had at the time was that text comparisons where the
> c locale isn't used are really rather expensive, and I doubt that
> there is too much that can be done to address that directly.  However,
> if we were to support timsort, that could substantially cut down on
> the number of comparisons for text columns, which could be a real win.
> Maybe that would happen through some explicit mechanism, or maybe the
> planner could actually infer that it was likely to be optimal to use
> timsort based on a high statistical correlation between physical row
> ordering and logical ordering of a key column's values.

Further thoughts:

At the time we committed our own quicksort implementation, based on
the NetBSD one, we eschewed the optimisation of using insertion sort
when n is fairly low. This happens to be a very common optimisation,
so I'm not really super-confident that that was a good decision.
However, we also added our own optimisation, which is to attempt,
regardless of the size of n, to ascertain that the array is
pre-sorted, in which case we don't quicksort at all.

So if we attempt to quicksort an array which is almost pre-sorted, but
say only has its very last element out-of-order, we'll do fairly
horribly, because we waste the effort of almost an entire "bubble sort
iteration". So almost-sorted data would seem to be a compelling thing
to optimise for that reason, as well as for the simple observation
that it isn't exactly uncommon in a relational database.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


Re: Memory usage during sorting

From
Robert Haas
Date:
On Fri, Apr 13, 2012 at 1:15 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> On 13 April 2012 17:42, Peter Geoghegan <peter@2ndquadrant.com> wrote:
>> One insight that I had at the time was that text comparisons where the
>> c locale isn't used are really rather expensive, and I doubt that
>> there is too much that can be done to address that directly.  However,
>> if we were to support timsort, that could substantially cut down on
>> the number of comparisons for text columns, which could be a real win.
>> Maybe that would happen through some explicit mechanism, or maybe the
>> planner could actually infer that it was likely to be optimal to use
>> timsort based on a high statistical correlation between physical row
>> ordering and logical ordering of a key column's values.
>
> Further thoughts:
>
> At the time we committed our own quicksort implementation, based on
> the NetBSD one, we eschewed the optimisation of using insertion sort
> when n is fairly low. This happens to be a very common optimisation,
> so I'm not really super-confident that that was a good decision.
> However, we also added our own optimisation, which is to attempt,
> regardless of the size of n, to ascertain that the array is
> pre-sorted, in which case we don't quicksort at all.

We do use insertion sort for partitions of size < 7.  I assume you are
referring to something else.

One thing that has struck me as odd is that we check for presorted
arrays at every level of recursion.  That is, if we start out with 100
elements, we'll check whether the input is presorted.  Then, if we
partition it into a block of 60 elements and a block of 40 elements,
we'll check each of those partitions over again to see whether they
are presorted.  There is definitely some potential upside here,
because if the input is mostly sorted it may contain runs where
everything is in order, and we'll detect that and avoid some work.
But it's also kind of expensive; I've wondered whether we should, for
example, move the test for presorted input down a bit, so that it only
happens in the n > 40 case.  Another idea is to arrange things so that
we remember the offset at which the last presort check failed.  When
we tail-recurse into the left half of the array, there's no need to
recheck - if the presort check fell out after our partition boundary,
it's presorted; if it fell out before the partition boundary, it
isn't.

> So if we attempt to quicksort an array which is almost pre-sorted, but
> say only has its very last element out-of-order, we'll do fairly
> horribly, because we waste the effort of almost an entire "bubble sort
> iteration". So almost-sorted data would seem to be a compelling thing
> to optimise for that reason, as well as for the simple observation
> that it isn't exactly uncommon in a relational database.

Heap-sorting could also benefit from some optimization in this area.
Right now, if you heap-sort data that's already in order, it gets
progressively slower as you crank up work_mem, because the heap gets
deeper and so extraction and siftup get more expensive, for no real
benefit.  We could do something like check, every time we use up our
available memory, whether the heap is already entirely in order.  If
so, we could dump all but one tuple to the current run and the start
reading tuples again.  Or maybe dump some percentage of the array,
though that might require a bit more bookkeeping.  Or maybe there's a
smarter algorithm that would also cater to mostly-sorted data.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Memory usage during sorting

From
Peter Geoghegan
Date:
On 13 April 2012 18:33, Robert Haas <robertmhaas@gmail.com> wrote:
> We do use insertion sort for partitions of size < 7.  I assume you are
> referring to something else.

I stand corrected. My confusion came from the removal of a later
*switch* to insertion sort in
a3f0b3d68f9a5357a3f72b40a45bcc714a9e0649, which also added the
pre-sorted check where you'd expect to see the insertion switch. Of
course, the n < 7 insertion sort thing is right beside the added
check. So another, redundant copy of insertion sort was removed, and
not the one that almost every real-world quicksort implementation has.

> Heap-sorting could also benefit from some optimization in this area.
> Right now, if you heap-sort data that's already in order, it gets
> progressively slower as you crank up work_mem, because the heap gets
> deeper and so extraction and siftup get more expensive, for no real
> benefit.  We could do something like check, every time we use up our
> available memory, whether the heap is already entirely in order.  If
> so, we could dump all but one tuple to the current run and the start
> reading tuples again.  Or maybe dump some percentage of the array,
> though that might require a bit more bookkeeping.  Or maybe there's a
> smarter algorithm that would also cater to mostly-sorted data.

Well, timsort is specifically designed to take advantage of pre-sorted
data. It does appear to have a lot of traction, as wikipedia points
out:

"Timsort has been Python's standard sorting algorithm since version
2.3. It is now also used to sort arrays in Java SE 7,[2] and on the
Android platform.[3] "

Somebody has produced a BSD-licensed C implementation that draws
heavily on the Python/Java one here:

http://code.google.com/p/timsort/source/browse/trunk/timSort.c?spec=svn17&r=17

It looks like it has exactly the same interface as a c stdlib qsort.
So it'd be fairly easy to produce a timsort_arg() based on this, if
anyone cares to.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


Re: Memory usage during sorting

From
Greg Stark
Date:
On Fri, Apr 13, 2012 at 7:01 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> Well, timsort is specifically designed to take advantage of pre-sorted
> data. It does appear to have a lot of traction, as wikipedia points
> out:

I hadn't heard of it. But reading up on it it does seem like a good
fit for us. It trades some additional storage -- an array of pointers
into the sort array where in our case the pointers would be much
smaller than a whole SortTuple -- for fewer comparisons -- which in
our case are often much slower than a simple integer comparison.

-- 
greg


Re: Memory usage during sorting

From
Peter Geoghegan
Date:
On 14 April 2012 13:32, Greg Stark <stark@mit.edu> wrote:
> On Fri, Apr 13, 2012 at 7:01 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
>> Well, timsort is specifically designed to take advantage of pre-sorted
>> data. It does appear to have a lot of traction, as wikipedia points
>> out:
>
> I hadn't heard of it. But reading up on it it does seem like a good
> fit for us. It trades some additional storage -- an array of pointers
> into the sort array where in our case the pointers would be much
> smaller than a whole SortTuple -- for fewer comparisons -- which in
> our case are often much slower than a simple integer comparison.

I wouldn't go so far as to suggest getting rid of quicksort, of
course. Quicksort is generally faster than other average case O(n log
n) algorithms in practice, for various reasons, principal among them
that it takes advantage of memory locality so well. I don't think that
it's a coincidence that timsort is popular in Python and Java land.
Even the Python C implementation has to sort integers through all that
PyObject reference indirection, I think. I'd now speculate that an
appropriate use of this algorithm might be to simply use it with types
that don't have a SortSupport function, that are largely passed by
reference, and have expensive comparisons. FWIW, I started playing
with adding timsort to Postgres last night:

https://github.com/Peter2ndQuadrant/postgres/tree/timsort

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


Re: Memory usage during sorting

From
Peter Geoghegan
Date:
On 14 April 2012 14:34, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> FWIW, I started playing with adding timsort to Postgres last night:
>
> https://github.com/Peter2ndQuadrant/postgres/tree/timsort

I've fixed this feature-branch so that every qsort_arg call site
(including the tuplesort specialisations thereof) call timsort_arg
instead. All but 4 regression tests pass, but they don't really count
as failures, since they're down to an assumption in the tests that the
order certain tuples appear should be the same as our current
quicksort implementation returns them, even though, in these
problematic cases, that is partially dictated by implementation - our
quicksort isn't stable, but timsort is. I think that the original
tests are faulty, but I understand that they're only expected to
provide smoke testing.

I did have to fix one bug in the implementation that I found, which
was an assumption that comparators would only return one of {-1, 0,
1}. The C standard requires only that they return negative, zero or
positive values, as required. I also polished the code a bit, and
added more comments.

I haven't actually quantified the benefits yet, and have no numbers to
show just yet, but I suspect it will prove to be a fairly compelling
win in certain situations.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


Re: Memory usage during sorting

From
Greg Stark
Date:
On Mon, Apr 16, 2012 at 10:42 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> All but 4 regression tests pass, but they don't really count
> as failures, since they're down to an assumption in the tests that the
> order certain tuples appear should be the same as our current
> quicksort implementation returns them, even though, in these
> problematic cases, that is partially dictated by implementation - our
> quicksort isn't stable, but timsort is.

This is an interesting point. If we use a stable sort we'll probably
be stuck with stable sorts indefinitely. People will start depending
on the stability and then we'll break their apps if we find a faster
sort that isn't stable.

Notably though tapesort is not stable (because heapsort is not stable
so neither the runs nor the merge steps are stable). So people's apps
would appear to work when they're in memory and fail only on large
data sets. It's easily possible for a user's query to never need to go
to tape though.

We don't have the luxury of having a separate sort and stable_sort
though due to the ORDER BY clause.

All in all I think it's handier to have a stable ORDER BY sort than an
unstable one though. So I'm not necessarily opposed to it even if it
means we're stuck using a stable sort indefinitely.
-- 
greg


Re: Memory usage during sorting

From
Jim Nasby
Date:
On 4/17/12 7:19 AM, Greg Stark wrote:
> On Mon, Apr 16, 2012 at 10:42 PM, Peter Geoghegan<peter@2ndquadrant.com>  wrote:
>> >  All but 4 regression tests pass, but they don't really count
>> >  as failures, since they're down to an assumption in the tests that the
>> >  order certain tuples appear should be the same as our current
>> >  quicksort implementation returns them, even though, in these
>> >  problematic cases, that is partially dictated by implementation - our
>> >  quicksort isn't stable, but timsort is.
> This is an interesting point. If we use a stable sort we'll probably
> be stuck with stable sorts indefinitely. People will start depending
> on the stability and then we'll break their apps if we find a faster
> sort that isn't stable.

I have often wished that I could inject entropy into a test database to ferret out these kinds of issues. In particular
Iworry about things like users depending on specific values for serial types or depending on the order of data in the
heap.

I would find it useful if Postgres had an option to intentionally inject more randomness in areas at the cost of some
performance.IE: have nextval() burn through a small, random number of values before returning one, and have scan
operatorsdo some re-ordering of tuples where appropriate.
 

If we had such an option and encouraged users to use it in testing, it would reduce the risk of people depending on
behaviorthat they shouldn't be.
 
-- 
Jim C. Nasby, Database Architect                   jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net