Re: Sorting Improvements for 8.4 - Mailing list pgsql-hackers

From Jeff Davis
Subject Re: Sorting Improvements for 8.4
Date
Msg-id 1196722458.22428.492.camel@dogma.ljc.laika.com
Whole thread Raw
In response to Re: Sorting Improvements for 8.4  (Gregory Stark <stark@enterprisedb.com>)
List pgsql-hackers
On Mon, 2007-12-03 at 20:40 +0000, Gregory Stark wrote:
> I'm not sure where the idea of keeping the current bounds of the input tapes
> comes into it. We only preread when we run out of tuples anyways and then we
> don't really have a choice about which tape we want to preread from. And it's
> a good thing too since maintaining such a list of bounds and finding the
> lowest or highest would mean maintaining a second heap which would basically
> double the cpu cost of sorting.
> 

You're only keeping track of the maximum value for each run, which
should be cheap to track. The only time it changes is when you're
reading more data from that run, in which case it increases.

The tradeoff that's happening right now is: we want to merge many runs
at once because it reduces the number of merge phases, but the problem
is that it increases the seeking because we read one block from one run,
then one block from another run, etc., especially if the input is
random.

If we reduce the number of runs, then we can preread more efficiently.
See:

tuplesort.c:

* as sorted runs, we can eliminate any repeated I/O at all.  In the
current
* code we determine the number of tapes M on the basis of workMem: we
want
* workMem/M to be large enough that we read a fair amount of data each
time
* we preread from a tape, so as to maintain the locality of access
described
* above.  Nonetheless, with large workMem we can have many tapes.

So, for workMem/M to be "large enough", M has to be small enough. But a
small M means we have to do more merge phases, which is expensive.

Forecasting improves this trade. Forecasting no longer _blindly_
prereads from each tape, it uses information that it already has (the
max value of the last block read from each run) to determine the runs
from which we need tuples the soonest.

Then, it prereads the _correct_ data.

Regards,Jeff Davis




pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Is postgres.gif missing in cvs?
Next
From: Devrim GÜNDÜZ
Date:
Subject: Re: Is postgres.gif missing in cvs?