Re: Memory usage during sorting - Mailing list pgsql-hackers

From Jeff Janes
Subject Re: Memory usage during sorting
Date
Msg-id CAMkU=1y_4UOtJ4GkTec=ZEZZqu23R4VCCwVPJW0=s26-WJ=XCw@mail.gmail.com
Whole thread Raw
In response to Re: Memory usage during sorting  (Greg Stark <stark@mit.edu>)
List pgsql-hackers
On Sat, Mar 17, 2012 at 10:47 AM, Greg Stark <stark@mit.edu> wrote:
> On Wed, Mar 7, 2012 at 7:55 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> But it would mean we have about 1.7x  more runs that need to be merged
>>> (for initially random data).  Considering the minimum merge order is
>>> 6, that increase in runs is likely not to lead to an additional level
>>> of merging, in which case the extra speed of building the runs would
>>> definitely win.  But if it does cause an additional level of merge, it
>>> could end up being a loss.
>>
>> That's true, but the real hit to the run size should be quite a bit
>> less than 1.7x, because we'd also be using memory more efficiently,
>> and therefore more tuples should fit.
>
> I'm not sure I believe the 1.7x.  Keep in mind that even after
> starting a new tape we can continue to read in new tuples that belong
> on the first tape. So even if you have tuples that are more than N
> positions out of place (where N is the size of your heap) as long as
> you don't have very many you can keep writing out the first tape for
> quite a while.
>
> I suspect analyzing truly random inputs is also a bit like saying no
> compression algorithm can work on average.

I'm not sure what you mean here.  The "also" suggests the previous
discussion was about something other than the assumption of
randomness, but that discussion doesn't make much sense unless both
paragraphs are about that.

Anyway I think keeping best/worst cases in mind is certainly a good
thing to do.  I've certainly seen train wrecks unfold when people
assumed anything could be compressed without verifying it.

> Partly sorted data is quite
> common and the tapesort algorithm would be able to do a lot of cases
> in a single merge that the quicksort and merge would generate a large
> number of merges for.

This is where some common and credible corpus would be invaluable.

> All that said, quicksort and merge would always do no more i/o in
> cases where the total number of tuples to be sorted is significantly
> less than N^2 since that would be guaranteed to be possible to process
> with a single merge pass. (Where "significantly less" has something to
> do with how many tuples you have to read in one i/o to be efficient).

Unless the tuples are quite large, the number needed for efficient i/o
is quite high.

> That probably covers a lot of cases, and Postgres already has the
> stats to predict when it would happen, more or less.

Currently sort makes no attempt to use the planner stats.  Would that
be a good thing to work on?

> Fwiw when I was doing the top-k sorting I did a bunch of experiements
> and came up with a rule-of-thumb that our quicksort was about twice as
> fast as our heapsort. I'm not sure whether that's a big deal or not in
> this case.

I've been seeing around 3 fold, and that might go up more with some of
the work being done that speeds up qsort but not tape sort.


> Keep in mind that the only win would be reducing the cpu
> time on a sort where every tuple was being written to disk and read
> back. For most users that won't run noticeably faster, just reduce cpu
> time consumption.

But in many (perhaps most) tape sorts CPU time is most of it.   A lot
of tape sorts are entirely memory backed.  The tuples either never
reach disk, or are partially written but never read, instead being
served back from the file-systems cache.  By changing to a tape sort,
you are buying insurance against a bunch of other sorts also running
simultaneously, but often the insurance doesn't pay off because those
numerous other sorts don't exist and the kernel manages to keep the
few ones that do in RAM anyway.

One avenue for major surgery on the sorts would be for an entry
admission where jobs can negotiable over the memory.  Something like
allocation by halves, but with a possibility that a job could decide
it would rather wait for another sort to finish and its memory to be
freed up, than to make do with what is currently available.

Cheers,

Jeff


pgsql-hackers by date:

Previous
From: Daniel Farina
Date:
Subject: Cross-backend signals and administration (Was: Re: pg_terminate_backend for same-role)
Next
From: Jeff Janes
Date:
Subject: Re: Memory usage during sorting