proposal: be smarter about i/o patterns in index scan - Mailing list pgsql-hackers

Greetings all,

We have noticed a way to make a major improvement in Pg's performance on
our workload, and I would like to get your thoughts before I go off to
work up a patch.

The basic problem is that Pg seeks far too much when performing an index
scan.  I have saved an strace of a backend which is selecting 170,000
rows from a 240,000,000 row table via index scan.  The strace shows
134,000 seeks and reads, or almost one seek for every tuple in the
result set.  This would be fine normally except the seeks are in a
*very* suboptimal pattern, swinging back and forth across the device. 
The query requires 16 minutes, 30 seconds on our test hardware.

The proposal is to sort the block requests before they are issued. 
Because Pg only issues single seek/read pairs synchronously, the kernel
has no chance to apply its elevator and hence every seek/read Pg issues
results in actual movement of the disk head.  Pg's random pattern of
seeking also makes the kernel's readahead fairly useless.

To prove the proposal I did a simulation, using the recorded seek
positions in the strace.  We noted that Pg issued 403 seek/read pairs
for every 8192 bytes read from the index.  So we performed four
simulations: a straight playback of Pg's seek pattern, a playback where
each list of 403 seeks is sorted by byte offset, a playback of all the
seeks sorted by offset, and a sequential scan of the 13GB data file.

PostgreSQL 4.2.1:  16m30s
Sorted in chunks:  10m6s
Sorted altogether: 4m19s
Sequential scan:   6m7s

As you can see, there's a lot to be gained here.  If Pg was able to
issue its seeks in the optimal order, the query would return in 1/4th
the time.  Even the chunk-sorted scheme is a big win.

So the proposal is this: the offsets stored in the index ought to be
sorted.  Either A) they can be stored sorted in the first place (sorted
in VACUUM ANALYZE, or always sorted), or B) the backend can sort each
list of tuples it reads from the index, or C) the backend can read the
entire index result and sort that (this would be the best).

>From just paging through the code, it doesn't seem terribly hard.  B
seems the easiest to implement but is also the least improvement.  Even
seqscan is better than B.  One improvement to B would be to read much
more than 8K from the index.  Reading e.g. 256KB would improve things
dramatically.  A might be easy but would either degrade over time or
cause slower inserts.  C is the best and hardest to implement (I am
guessing, because the size of sorted index subset is unbounded).

Your thoughts are appreciated.

-jwb



pgsql-hackers by date:

Previous
From: Alvaro Herrera Munoz
Date:
Subject: Re: Call for 7.5 feature completion
Next
From: Bruce Momjian
Date:
Subject: Re: Call for 7.5 feature completion