Home > mailing lists

Re: Sorting Improvements for 8.4 - Mailing list pgsql-hackers

From	Jeff Davis
Subject	Re: Sorting Improvements for 8.4
Date	December 3, 2007 18:54:57
Msg-id	1196722458.22428.492.camel@dogma.ljc.laika.com Whole thread Raw
In response to	Re: Sorting Improvements for 8.4 (Gregory Stark <stark@enterprisedb.com>)
List	pgsql-hackers

Tree view

On Mon, 2007-12-03 at 20:40 +0000, Gregory Stark wrote:
> I'm not sure where the idea of keeping the current bounds of the input tapes
> comes into it. We only preread when we run out of tuples anyways and then we
> don't really have a choice about which tape we want to preread from. And it's
> a good thing too since maintaining such a list of bounds and finding the
> lowest or highest would mean maintaining a second heap which would basically
> double the cpu cost of sorting.
> 

You're only keeping track of the maximum value for each run, which
should be cheap to track. The only time it changes is when you're
reading more data from that run, in which case it increases.

The tradeoff that's happening right now is: we want to merge many runs
at once because it reduces the number of merge phases, but the problem
is that it increases the seeking because we read one block from one run,
then one block from another run, etc., especially if the input is
random.

If we reduce the number of runs, then we can preread more efficiently.
See:

tuplesort.c:

* as sorted runs, we can eliminate any repeated I/O at all.  In the
current
* code we determine the number of tapes M on the basis of workMem: we
want
* workMem/M to be large enough that we read a fair amount of data each
time
* we preread from a tape, so as to maintain the locality of access
described
* above.  Nonetheless, with large workMem we can have many tapes.

So, for workMem/M to be "large enough", M has to be small enough. But a
small M means we have to do more merge phases, which is expensive.

Forecasting improves this trade. Forecasting no longer _blindly_
prereads from each tape, it uses information that it already has (the
max value of the last block read from each run) to determine the runs
from which we need tuples the soonest.

Then, it prereads the _correct_ data.

Regards,Jeff Davis

pgsql-hackers by date:

From: Tom Lane
Date: 03 December 2007, 18:51:05
Subject: Re: Is postgres.gif missing in cvs?

From: Devrim GÜNDÜZ
Date: 03 December 2007, 18:59:53
Subject: Re: Is postgres.gif missing in cvs?

Re: Sorting Improvements for 8.4 - Mailing list pgsql-hackers

Previous

Next