Re: [HACKERS] sort on huge table - Mailing list pgsql-hackers

From Tom Lane
Subject Re: [HACKERS] sort on huge table
Date
Msg-id 1680.941470920@sss.pgh.pa.us
Whole thread Raw
In response to Re: [HACKERS] sort on huge table  (Tatsuo Ishii <t-ishii@sra.co.jp>)
Responses Re: [HACKERS] sort on huge table
Re: [HACKERS] sort on huge table
Re: [HACKERS] sort on huge table
Re: [HACKERS] sort on huge table
List pgsql-hackers
Tatsuo Ishii <t-ishii@sra.co.jp> writes:
>>>> It worked with 2GB+ table but was much slower than before.
>>>> Before(with 8MB sort memory): 22 minutes
>>>> After(with 8MB sort memory): 1 hour and 5 minutes
>>>> After(with 80MB sort memory): 42 minutes.
>> 
> It's getting better, but still slower than before.
> 52:50 (with 8MB sort memory)
> ps shows 7:15 was consumed by the backend.
> 32:06 (with 80MB sort memory)
> CPU time was 5:11.

OK, so it's basically all I/O time, which is what I suspected.

What's causing this is the changes I made to reduce disk space usage;
the price of that savings is more-random access to the temporary file.
Apparently your setup is not coping very well with that.

The original code used seven separate temp files, each of which was
written and read in a purely sequential fashion.  Only problem: as
the merge steps proceed, all the data is read from one temp file and
dumped into another, and because of the way the merges are overlapped,
you end up with total space usage around 4X the actual data volume.

What's in there right now is just the same seven-tape merge algorithm,
but all the "tapes" are stored in a single temp file.  As soon as any
block of a "tape" is read in, it's recycled to become available space
for the current "output tape" (since we know we won't need to read that
block again).  This is why the disk space usage is roughly actual data
volume and not four times as much.  However, the access pattern to this
single temp file looks a lot more random to the OS than the access
patterns for the original temp files.

I figured that I could get away with this from a performance standpoint
because, while the old code processed each temp file sequentially, the
read and write accesses were interleaved --- on average, you'd expect
a merge pass to read one block from each of the N source tapes in the
same time span that it is writing N blocks to the current output tape;
on average, no two successive block read or write requests will go to
the same temp file.  So it appears to me that the old code should cause
a disk seek for each block read or written.  The new code's behavior
can't be any worse than that; it's just doing those seeks within one
temp file instead of seven.

Of course the flaw in this reasoning is that it assumes the OS isn't
getting in the way.  On the HPUX system I've been testing on, the
performance does seem to be about the same, but evidently it's much
worse on your system.  (Exactly what OS are you running, anyway, and
on what hardware?)  I speculate that your OS is applying some sort of
read-ahead algorithm that is getting hopelessly confused by lots of
seeks within a single file.  Perhaps it's reading the next block in
sequence after every program-requested read, and then throwing away that
work when it sees the program lseek the file instead of reading.

Next question is what to do about it.  I don't suppose we have any way
of turning off the OS' read-ahead algorithm :-(.  We could forget about
this space-recycling improvement and go back to separate temp files.
The objection to that, of course, is that while sorting might be faster,
it doesn't matter how fast the algorithm is if you don't have the disk
space to execute it.

A possible compromise is to use separate temp files but drop the
polyphase merge and go to a balanced merge, which'd still access each
temp file sequentially but would have only a 2X space penalty instead of
4X (since all the data starts on one set of tapes and gets copied to the
other set during a complete merge pass).  The balanced merge is a little
slower than polyphase --- more merge passes --- but the space savings
probably justify it.

One thing I'd like to know before we make any decisions is whether
this problem is widespread.  Can anyone else run performance tests
of the speed of large sorts, using current sources vs. 6.5.* ?
        regards, tom lane


pgsql-hackers by date:

Previous
From: "Andrij Korud"
Date:
Subject: Whos idea was this
Next
From: "Gene Sokolov"
Date:
Subject: file descriptors leak?