Re: Memory usage during sorting - Mailing list pgsql-hackers

From Jeff Janes
Subject Re: Memory usage during sorting
Date
Msg-id CAMkU=1yfaWD1h9UY-es3_bpoVyMPm2b-3Pz2iGibWPZq_gnqJA@mail.gmail.com
Whole thread Raw
In response to Re: Memory usage during sorting  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Memory usage during sorting
List pgsql-hackers
On Sun, Feb 26, 2012 at 7:20 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sat, Feb 25, 2012 at 4:31 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>>> I'm not sure about the conclusion, but given this discussion, I'm
>>> inclined to mark this Returned with Feedback.
>>
>> OK, thanks.  Does anyone have additional feed-back on how tightly we
>> wish to manage memory usage?  Is trying to make us use as much memory
>> as we are allowed to without going over a worthwhile endeavor at all,
>> or is it just academic nitpicking?
>
> I'm not sure, either.  It strikes me that, in general, it's hard to
> avoid a little bit of work_mem overrun, since we can't know whether
> the next tuple will fit until we've read it, and then if it turns out
> to be big, well, the best thing we can do is free it, but perhaps
> that's closing the barn door after the horse has gotten out already.

That type of overrun doesn't bother me much, because the size of the
next tuple some else feeds us is mostly outside of this modules
control.  And also be cause the memory that is overrun should be
reusable once the offending tuple is written out to tape.  The type of
overrun I'm more concerned with is that which is purely under this
modules control, and which is then not re-used productively.

The better solution would be to reduce the overhead in the first
place.  While building the initial runs, there is no reason to have 3
blocks worth of overhead for each tape, when only one tape is ever
being used at a time.  But that change seems much tougher to
implement.

> Having recently spent quite a bit of time looking at tuplesort.c as a
> result of Peter Geoghegan's work and some other concerns, I'm inclined
> to think that it needs more than minor surgery.  That file is peppered
> with numerous references to Knuth which serve the dual function of
> impressing the reader with the idea that the code must be very good
> (since Knuth is a smart guy) and rendering it almost completely
> impenetrable (since the design is justified by reference to a textbook
> most of us probably do not have copies of).

Yes, I agree with that analysis.  And getting the volume I want by
inter-library-loan has been challenging--I keep getting the volume
before or after the one I want.  Maybe Knuth starts counting volumes
at 0 and libraries start counting at 1 :)

Anyway, I think the logtape could use redoing.  When your tapes are
actually physically tape drives, it is necessary to build up runs one
after the other on physical tapes, because un-mounting a tape from a
tape drive and remounting another tape is not very feasible at scale.
Which then means you need to place your runs carefully, because if the
final merge finds that two runs it needs are back-to-back on one tape,
that is bad.  But with disks pretending to be tapes, you could
re-arrange the "runs" with just some book keeping.  Maintaining the
distinction between "tapes" and "runs" is pointless, which means the
Fibonacci placement algorithm is pointless as well.

> A quick Google search for external sorting algorithms suggest that the
> typical way of doing an external sort is to read data until you fill
> your in-memory buffer, quicksort it, and dump it out as a run.  Repeat
> until end-of-data; then, merge the runs (either in a single pass, or
> if there are too many, in multiple passes).  I'm not sure whether that
> would be better than what we're doing now, but there seem to be enough
> other people doing it that we might want to try it out.  Our current
> algorithm is to build a heap, bounded in size by work_mem, and dribble
> tuples in and out, but maintaining that heap is pretty expensive;
> there's a reason people use quicksort rather than heapsort for
> in-memory sorting.

But it would mean we have about 1.7x  more runs that need to be merged
(for initially random data).  Considering the minimum merge order is
6, that increase in runs is likely not to lead to an additional level
of merging, in which case the extra speed of building the runs would
definitely win.  But if it does cause an additional level of merge, it
could end up being a loss.

Is there some broad corpus of sorting benchmarks which changes could
be tested against?  I usually end up testing just simple columns of
integers or small strings, because they are easy to set up.  That is
not ideal.

> As a desirable side effect, I think it would mean
> that we could dispense with retail palloc and pfree altogether.  We
> could just allocate a big chunk of memory, copy tuples into it until
> it's full, using a pointer to keep track of the next unused byte, and
> then, after writing the run, reset the allocation pointer back to the
> beginning of the buffer.  That would not only avoid the cost of going
> through palloc/pfree, but also the memory overhead imposed by
> bookkeeping and power-of-two rounding.

Wouldn't we still need an array of pointers to the start of every
tuple's location in the buffer?  Or else, how would qsort know where
to find them?

Also, to do this we would need to get around the 1GB allocation limit.It is bad enough that memtuples is limited to
1GB,it would be much 
worse if the entire arena was limited to that amount.



> If we do want to stick with the current algorithm, there seem to be
> some newer techniques for cutting down on the heap maintenance
> overhead.  Heikki's been investigating that a bit.

Interesting.  Is that investigation around the poor L1/L2 caching
properties of large heaps?  I was wondering if there might be a way to
give tuplesort an initial estimate of how much data there was to sort,
so that it could use a smaller amount of memory than the max if it
decided that that would lead to better caching effects.  Once you know
you can't do an in-memory sort, then there is no reason to use more
memory than the amount that lets you merge all the runs in one go.

Cheers,

Jeff


pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: review: CHECK FUNCTION statement
Next
From: Alvaro Herrera
Date:
Subject: Re: review: CHECK FUNCTION statement