proposal: be smarter about i/o patterns in index scan - Mailing list pgsql-hackers
From | Jeffrey W. Baker |
---|---|
Subject | proposal: be smarter about i/o patterns in index scan |
Date | |
Msg-id | 1084851137.24232.3.camel@heat Whole thread Raw |
Responses |
Re: proposal: be smarter about i/o patterns in index scan
Re: proposal: be smarter about i/o patterns in index scan Re: proposal: be smarter about i/o patterns in index scan |
List | pgsql-hackers |
Greetings all, We have noticed a way to make a major improvement in Pg's performance on our workload, and I would like to get your thoughts before I go off to work up a patch. The basic problem is that Pg seeks far too much when performing an index scan. I have saved an strace of a backend which is selecting 170,000 rows from a 240,000,000 row table via index scan. The strace shows 134,000 seeks and reads, or almost one seek for every tuple in the result set. This would be fine normally except the seeks are in a *very* suboptimal pattern, swinging back and forth across the device. The query requires 16 minutes, 30 seconds on our test hardware. The proposal is to sort the block requests before they are issued. Because Pg only issues single seek/read pairs synchronously, the kernel has no chance to apply its elevator and hence every seek/read Pg issues results in actual movement of the disk head. Pg's random pattern of seeking also makes the kernel's readahead fairly useless. To prove the proposal I did a simulation, using the recorded seek positions in the strace. We noted that Pg issued 403 seek/read pairs for every 8192 bytes read from the index. So we performed four simulations: a straight playback of Pg's seek pattern, a playback where each list of 403 seeks is sorted by byte offset, a playback of all the seeks sorted by offset, and a sequential scan of the 13GB data file. PostgreSQL 4.2.1: 16m30s Sorted in chunks: 10m6s Sorted altogether: 4m19s Sequential scan: 6m7s As you can see, there's a lot to be gained here. If Pg was able to issue its seeks in the optimal order, the query would return in 1/4th the time. Even the chunk-sorted scheme is a big win. So the proposal is this: the offsets stored in the index ought to be sorted. Either A) they can be stored sorted in the first place (sorted in VACUUM ANALYZE, or always sorted), or B) the backend can sort each list of tuples it reads from the index, or C) the backend can read the entire index result and sort that (this would be the best). >From just paging through the code, it doesn't seem terribly hard. B seems the easiest to implement but is also the least improvement. Even seqscan is better than B. One improvement to B would be to read much more than 8K from the index. Reading e.g. 256KB would improve things dramatically. A might be easy but would either degrade over time or cause slower inserts. C is the best and hardest to implement (I am guessing, because the size of sorted index subset is unbounded). Your thoughts are appreciated. -jwb
pgsql-hackers by date: