Re: Sorting Improvements for 8.4 - Mailing list pgsql-hackers
From | Jeff Davis |
---|---|
Subject | Re: Sorting Improvements for 8.4 |
Date | |
Msg-id | 1196706749.22428.465.camel@dogma.ljc.laika.com Whole thread Raw |
In response to | Re: Sorting Improvements for 8.4 (Simon Riggs <simon@2ndquadrant.com>) |
Responses |
Re: Sorting Improvements for 8.4
(Simon Riggs <simon@2ndquadrant.com>)
|
List | pgsql-hackers |
On Mon, 2007-12-03 at 11:51 +0000, Simon Riggs wrote: > So Algorithm F is somewhat orthogonal to what I've proposed, though > maybe still interesting. What we do now is fairly close, but you should > look at the code in tuplesort.c and logtape.c to see how well it > matches. That might lead to an increase in the limit of the number of > concurrent runs mergeable at any one time. > tuplesort.c: * When merging runs, we use a heap containing just the frontmost tuple from* each source run; we repeatedly output the smallesttuple and insert the* next tuple from its source tape (if any). When the heap empties, the merge* is complete. The basic merge algorithm thus needs very little memory ---* only M tuples for an M-way merge, and M is constrained to asmall number.* However, we can still make good use of our full workMem allocation by* pre-reading additional tuples fromeach source tape. Without prereading,* our access pattern to the temporary file would be very erratic; on average* we'dread one block from each of M source tapes during the same time that* we're writing M blocks to the output tape, so thereis no sequentiality of* access at all, defeating the read-ahead methods used by most Unix kernels.* Worse, the outputtape gets written into a very random sequence of blocks* of the temp file, ensuring that things will be even worsewhen it comes* time to read that tape. A straightforward merge pass thus ends up doing a* lot of waiting for disk seeks. We can improve matters by prereading from* each source tape sequentially, loading about workMem/M bytes from eachtape* in turn. Then we run the merge algorithm, writing but not reading until* one of the preloaded tuple series runsout. Then we switch back to preread* mode, fill memory again, and repeat. This approach helps to localize both* readand write accesses. The idea of prefetching, as I understand it, is that we don't blindly preread workMem/M bytes from each of M tapes; instead we predict which tapes we will need tuples from next through forecasting. If I understand correctly, we just keep track of the maximum value of the last block read from each run, and then always read from the run in which the last block read has the lowest maximum. It seems as if this would allow a variable number of runs to be merged at once, but if the data really *is* random, we'd want it to degrade gracefully something resembling the current implementation. I'm being somewhat vague here because I haven't taken the time to really understand it. If you think this idea has potential I will look into it in more detail. Regards,Jeff Davis
pgsql-hackers by date: