Re: proposal: be smarter about i/o patterns in index scan - Mailing list pgsql-hackers

From Glen Parker
Subject Re: proposal: be smarter about i/o patterns in index scan
Date
Msg-id AJEKKAIECKNMBCEKADJPGENNCEAA.glenebob@nwlink.com
Whole thread Raw
In response to proposal: be smarter about i/o patterns in index scan  ("Jeffrey W. Baker" <jwbaker@acm.org>)
List pgsql-hackers
> The basic problem is that Pg seeks far too much when performing an index
> scan.  I have saved an strace of a backend which is selecting 170,000
> rows from a 240,000,000 row table via index scan.  The strace shows
> 134,000 seeks and reads, or almost one seek for every tuple in the
> result set.  This would be fine normally except the seeks are in a
> *very* suboptimal pattern, swinging back and forth across the device.
> The query requires 16 minutes, 30 seconds on our test hardware.

Yes yes yes *please* fix this :-)  This performance bottle neck in PG
effects us all the time.


> The proposal is to sort the block requests before they are issued.
> Because Pg only issues single seek/read pairs synchronously, the kernel
> has no chance to apply its elevator and hence every seek/read Pg issues
> results in actual movement of the disk head.  Pg's random pattern of
> seeking also makes the kernel's readahead fairly useless.
>
> To prove the proposal I did a simulation, using the recorded seek
> positions in the strace.  We noted that Pg issued 403 seek/read pairs
> for every 8192 bytes read from the index.  So we performed four
> simulations: a straight playback of Pg's seek pattern, a playback where
> each list of 403 seeks is sorted by byte offset, a playback of all the
> seeks sorted by offset, and a sequential scan of the 13GB data file.
>
> PostgreSQL 4.2.1:  16m30s
> Sorted in chunks:  10m6s
> Sorted altogether: 4m19s
> Sequential scan:   6m7s
>
> As you can see, there's a lot to be gained here.  If Pg was able to
> issue its seeks in the optimal order, the query would return in 1/4th
> the time.  Even the chunk-sorted scheme is a big win.

I think your worst case may not be as bad as it gets.  Nothing scientific,
but my experience tells me that it's common to get even worse performance
than that.  I've had queries that would take seemingly forever, but then
after a fresh cluster and cache flush, the same query would take almost no
time at all.

I also think that with a multi-mirrored RAID setup and your proposed IO
sorting, the mirroring would be more likely to overcome seek overhead.  With
the current implementation, it seems almost useless to throw more hardware
at the problem.

I think the improvement would be even 'huger' than your numbers show :-)


> So the proposal is this: the offsets stored in the index ought to be
> sorted.  Either A) they can be stored sorted in the first place (sorted
> in VACUUM ANALYZE, or always sorted), or B) the backend can sort each
> list of tuples it reads from the index, or C) the backend can read the
> entire index result and sort that (this would be the best).
>
> >From just paging through the code, it doesn't seem terribly hard.  B
> seems the easiest to implement but is also the least improvement.  Even
> seqscan is better than B.  One improvement to B would be to read much
> more than 8K from the index.  Reading e.g. 256KB would improve things
> dramatically.  A might be easy but would either degrade over time or
> cause slower inserts.  C is the best and hardest to implement (I am
> guessing, because the size of sorted index subset is unbounded).

IMO if you're going to do it, do it right.  That means A is pretty much out,
again, IMO.  It would seem that B (improved) would be the way to go because,
as you say, C would produce unbounded subsets.  I would propose to read very
large sections of the index subset though, more in the order of several MB
(configurable, of course).  With that much space to play with, most queries
would require only one pass at the index and therefore show performance
equal to C.  Larger queries would suffer a bit, but then we don't usually
have such speedy expectations of large expected result sets, and again, with
enough sort space to play with, the performance could approach that of C.

I would think that you could also sort the index on index order (during
vacuum) to improve index scan performance.  This could be a completely
seperate implementation task.  With clean sequential index access *and*
clean sequential heap access, it might just prove to be the single largest
performance boost PG has seen, well, ever.

My .02...

Glen Parker



pgsql-hackers by date:

Previous
From: "Joshua D. Drake"
Date:
Subject: Re: Call for 7.5 feature completion
Next
From: Peter Eisentraut
Date:
Subject: Re: Call for 7.5 feature completion