Re: Sorting Improvements for 8.4 - Mailing list pgsql-hackers

From Jeff Davis
Subject Re: Sorting Improvements for 8.4
Date
Msg-id 1196706749.22428.465.camel@dogma.ljc.laika.com
Whole thread Raw
In response to Re: Sorting Improvements for 8.4  (Simon Riggs <simon@2ndquadrant.com>)
Responses Re: Sorting Improvements for 8.4  (Simon Riggs <simon@2ndquadrant.com>)
List pgsql-hackers
On Mon, 2007-12-03 at 11:51 +0000, Simon Riggs wrote:
> So Algorithm F is somewhat orthogonal to what I've proposed, though
> maybe still interesting. What we do now is fairly close, but you should
> look at the code in tuplesort.c and logtape.c to see how well it
> matches. That might lead to an increase in the limit of the number of
> concurrent runs mergeable at any one time.
> 

tuplesort.c:
* When merging runs, we use a heap containing just the frontmost tuple from* each source run; we repeatedly output the
smallesttuple and insert the* next tuple from its source tape (if any).  When the heap empties, the merge* is complete.
The basic merge algorithm thus needs very little memory ---* only M tuples for an M-way merge, and M is constrained to
asmall number.* However, we can still make good use of our full workMem allocation by* pre-reading additional tuples
fromeach source tape.  Without prereading,* our access pattern to the temporary file would be very erratic; on average*
we'dread one block from each of M source tapes during the same time that* we're writing M blocks to the output tape, so
thereis no sequentiality of* access at all, defeating the read-ahead methods used by most Unix kernels.* Worse, the
outputtape gets written into a very random sequence of blocks* of the temp file, ensuring that things will be even
worsewhen it comes* time to read that tape.  A straightforward merge pass thus ends up doing a* lot of waiting for disk
seeks. We can improve matters by prereading from* each source tape sequentially, loading about workMem/M bytes from
eachtape* in turn.  Then we run the merge algorithm, writing but not reading until* one of the preloaded tuple series
runsout.  Then we switch back to preread* mode, fill memory again, and repeat.  This approach helps to localize both*
readand write accesses.
 

The idea of prefetching, as I understand it, is that we don't blindly
preread workMem/M bytes from each of M tapes; instead we predict
which tapes we will need tuples from next through forecasting.

If I understand correctly, we just keep track of the maximum value of
the last block read from each run, and then always read from the run in
which the last block read has the lowest maximum.

It seems as if this would allow a variable number of runs to be merged
at once, but if the data really *is* random, we'd want it to degrade
gracefully something resembling the current implementation.

I'm being somewhat vague here because I haven't taken the time to
really understand it. If you think this idea has potential I will look
into it in more detail.

Regards,Jeff Davis





pgsql-hackers by date:

Previous
From: Gregory Stark
Date:
Subject: Kludge in pg_standby.c
Next
From: Magnus Hagander
Date:
Subject: Re: buildenv.pl/buildenv.bat