Re: Memory usage during sorting - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Memory usage during sorting
Date
Msg-id 8425.1332084300@sss.pgh.pa.us
Whole thread Raw
In response to Re: Memory usage during sorting  (Jeff Janes <jeff.janes@gmail.com>)
Responses Re: Memory usage during sorting  (Jeremy Harris <jgh@wizmail.org>)
Re: Memory usage during sorting  (Greg Stark <stark@mit.edu>)
Re: Memory usage during sorting  (Robert Haas <robertmhaas@gmail.com>)
Re: Memory usage during sorting  (Jim Nasby <jim@nasby.net>)
Re: Memory usage during sorting  (Jeff Davis <pgsql@j-davis.com>)
List pgsql-hackers
Jeff Janes <jeff.janes@gmail.com> writes:
> On Wed, Mar 7, 2012 at 11:55 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Sat, Mar 3, 2012 at 4:15 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>>> Anyway, I think the logtape could use redoing.

> The problem there is that none of the files can be deleted until it
> was entirely read, so you end up with all the data on disk twice.  I
> don't know how often people run their databases so close to the edge
> on disk space that this matters, but someone felt that that extra
> storage was worth avoiding.

Yeah, that was me, and it came out of actual user complaints ten or more
years back.  (It's actually not 2X growth but more like 4X growth
according to the comments in logtape.c, though I no longer remember the
exact reasons why.)  We knew when we put in the logtape logic that we
were trading off speed for space, and we accepted that.  It's possible
that with the growth of hard drive sizes, real-world applications would
no longer care that much about whether the space required to sort is 4X
data size rather than 1X.  Or then again, maybe their data has grown
just as fast and they still care.

> As a desirable side effect, I think it would mean
> that we could dispense with retail palloc and pfree altogether. �We
> could just allocate a big chunk of memory, copy tuples into it until
> it's full, using a pointer to keep track of the next unused byte, and
> then, after writing the run, reset the allocation pointer back to the
> beginning of the buffer. �That would not only avoid the cost of going
> through palloc/pfree, but also the memory overhead imposed by
> bookkeeping and power-of-two rounding.

That would be worthwhile, probably.  The power-of-2 rounding in palloc
is not nice at all for tuplesort's purposes.  We've occasionally talked
about inventing a different memory context type that is less willing to
waste space ... but on the other hand, such an allocator would run
slower, so you're right back to making a speed for space tradeoff.
If the tuples could be deallocated in bulk then that catch-22 goes away.

>> No, it's about reducing the number of comparisons needed to maintain
>> the heap property.

> That sounds very interesting.  I didn't know it was even theoretically
> possible to do that.

So has somebody found a hole in the n log n lower bound on the cost of
comparison-based sorting?  I thought that had been proven pretty
rigorously.
        regards, tom lane


pgsql-hackers by date:

Previous
From: Yeb Havinga
Date:
Subject: Recent MinGW postgres builds with -O2 do not pass regression tests
Next
From: Jeremy Harris
Date:
Subject: Re: Memory usage during sorting