Re: [HACKERS] sort on huge table - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: [HACKERS] sort on huge table |
Date | |
Msg-id | 1680.941470920@sss.pgh.pa.us Whole thread Raw |
In response to | Re: [HACKERS] sort on huge table (Tatsuo Ishii <t-ishii@sra.co.jp>) |
Responses |
Re: [HACKERS] sort on huge table
Re: [HACKERS] sort on huge table Re: [HACKERS] sort on huge table Re: [HACKERS] sort on huge table |
List | pgsql-hackers |
Tatsuo Ishii <t-ishii@sra.co.jp> writes: >>>> It worked with 2GB+ table but was much slower than before. >>>> Before(with 8MB sort memory): 22 minutes >>>> After(with 8MB sort memory): 1 hour and 5 minutes >>>> After(with 80MB sort memory): 42 minutes. >> > It's getting better, but still slower than before. > 52:50 (with 8MB sort memory) > ps shows 7:15 was consumed by the backend. > 32:06 (with 80MB sort memory) > CPU time was 5:11. OK, so it's basically all I/O time, which is what I suspected. What's causing this is the changes I made to reduce disk space usage; the price of that savings is more-random access to the temporary file. Apparently your setup is not coping very well with that. The original code used seven separate temp files, each of which was written and read in a purely sequential fashion. Only problem: as the merge steps proceed, all the data is read from one temp file and dumped into another, and because of the way the merges are overlapped, you end up with total space usage around 4X the actual data volume. What's in there right now is just the same seven-tape merge algorithm, but all the "tapes" are stored in a single temp file. As soon as any block of a "tape" is read in, it's recycled to become available space for the current "output tape" (since we know we won't need to read that block again). This is why the disk space usage is roughly actual data volume and not four times as much. However, the access pattern to this single temp file looks a lot more random to the OS than the access patterns for the original temp files. I figured that I could get away with this from a performance standpoint because, while the old code processed each temp file sequentially, the read and write accesses were interleaved --- on average, you'd expect a merge pass to read one block from each of the N source tapes in the same time span that it is writing N blocks to the current output tape; on average, no two successive block read or write requests will go to the same temp file. So it appears to me that the old code should cause a disk seek for each block read or written. The new code's behavior can't be any worse than that; it's just doing those seeks within one temp file instead of seven. Of course the flaw in this reasoning is that it assumes the OS isn't getting in the way. On the HPUX system I've been testing on, the performance does seem to be about the same, but evidently it's much worse on your system. (Exactly what OS are you running, anyway, and on what hardware?) I speculate that your OS is applying some sort of read-ahead algorithm that is getting hopelessly confused by lots of seeks within a single file. Perhaps it's reading the next block in sequence after every program-requested read, and then throwing away that work when it sees the program lseek the file instead of reading. Next question is what to do about it. I don't suppose we have any way of turning off the OS' read-ahead algorithm :-(. We could forget about this space-recycling improvement and go back to separate temp files. The objection to that, of course, is that while sorting might be faster, it doesn't matter how fast the algorithm is if you don't have the disk space to execute it. A possible compromise is to use separate temp files but drop the polyphase merge and go to a balanced merge, which'd still access each temp file sequentially but would have only a 2X space penalty instead of 4X (since all the data starts on one set of tapes and gets copied to the other set during a complete merge pass). The balanced merge is a little slower than polyphase --- more merge passes --- but the space savings probably justify it. One thing I'd like to know before we make any decisions is whether this problem is widespread. Can anyone else run performance tests of the speed of large sorts, using current sources vs. 6.5.* ? regards, tom lane
pgsql-hackers by date: