Re: [HACKERS] sort on huge table - Mailing list pgsql-hackers

From Bruce Momjian
Subject Re: [HACKERS] sort on huge table
Date
Msg-id 199911011652.LAA17607@candle.pha.pa.us
Whole thread Raw
In response to Re: [HACKERS] sort on huge table  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
> Of course the flaw in this reasoning is that it assumes the OS isn't
> getting in the way.  On the HPUX system I've been testing on, the
> performance does seem to be about the same, but evidently it's much
> worse on your system.  (Exactly what OS are you running, anyway, and
> on what hardware?)  I speculate that your OS is applying some sort of
> read-ahead algorithm that is getting hopelessly confused by lots of
> seeks within a single file.  Perhaps it's reading the next block in
> sequence after every program-requested read, and then throwing away that
> work when it sees the program lseek the file instead of reading.
> 
> Next question is what to do about it.  I don't suppose we have any way
> of turning off the OS' read-ahead algorithm :-(.  We could forget about
> this space-recycling improvement and go back to separate temp files.
> The objection to that, of course, is that while sorting might be faster,
> it doesn't matter how fast the algorithm is if you don't have the disk
> space to execute it.

That is the key.  On BSDI, the kernel code is more complicated.  If it
does a read on an already open file, and the requested buffer is not in
core, it assumes that the readahead that was performed by the previous
read was useless, and scales back the readahead algorithm.  At least
that is my interpretation of the code and comments.

I suspect other OS's do similar work, but it is possible they do it more
simplistically, saying if someone does _any_ seek, they must be
accessing it non-sequentially, so read-ahead should be turned off.

Read-ahead on random file access is a terrible thing, and most OS's
figure out a way to turn off read-ahead in non-sequential cases.  Of
course, lack of read-ahead in sequential access also is a problem.

Tatsuo, what OS are you using?  Maybe I can check the kernel to see how
it is behaving.

> 
> A possible compromise is to use separate temp files but drop the
> polyphase merge and go to a balanced merge, which'd still access each
> temp file sequentially but would have only a 2X space penalty instead of
> 4X (since all the data starts on one set of tapes and gets copied to the
> other set during a complete merge pass).  The balanced merge is a little
> slower than polyphase --- more merge passes --- but the space savings
> probably justify it.
> 
> One thing I'd like to know before we make any decisions is whether
> this problem is widespread.  Can anyone else run performance tests
> of the speed of large sorts, using current sources vs. 6.5.* ?

I may be able to test that today on BSDI, but I doubt BSDI is typical.
They are probably state-of-the-art in kernel algorithm design.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


pgsql-hackers by date:

Previous
From: "Gene Sokolov"
Date:
Subject: file descriptors leak?
Next
From: "Andrij Korud"
Date:
Subject: Get OID of just inserted record