Home > mailing lists

Re: index prefetching - Mailing list pgsql-hackers

From	Andres Freund
Subject	Re: index prefetching
Date	August 14, 2025 23:44:14
Msg-id	kvyser45imw3xmisfvpeoshisswazlzw35el3fq5zg73zblpql@f56enfj45nf7 Whole thread
In response to	Re: index prefetching (Peter Geoghegan <pg@bowt.ie>)
Responses	Re: index prefetching
List	pgsql-hackers

Tree view

Hi,

On 2025-08-14 15:45:26 -0400, Peter Geoghegan wrote:
> On Thu, Aug 14, 2025 at 3:15 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > Then why does the exact same pair of runs show "I/O Timings: shared
> > read=194.629" for the sequential table backwards scan (with total
> > execution time 1132.360 ms), versus "I/O Timings: shared read=352.88"
> > (with total execution time 697.681 ms) for the random table backwards
> > scan?
>
> If you're interested in trying this out for yourself, I've pushed my
> working branch here:
>
> https://github.com/petergeoghegan/postgres/tree/index-prefetch-batch-v1.2
>
> Note that the test case you'll run is added by the most recent commit:
>
> https://github.com/petergeoghegan/postgres/commit/c9ceb765f3b138f53b7f1fdf494ba7c816082aa1
>
> Run microbenchmarks/random_backwards_weird.sql to do an initial load
> of both of the tables. Then run
> microbenchmarks/queries_random_backwards_weird.sql to actually run the
> relevant queries. There are 4 such queries, but only the 2 backwards
> scan queries really seem relevant.

Interesting. In the sequential case I see some waits that are not attributed
in explain, due to the waits happening within WaitIO(), not WaitReadBuffers().
Which indicates that the read stream is trying to re-read a buffer that
previously started being read.

   read_stream_start_pending_read()
-> StartReadBuffers()
-> AsyncReadBuffers()
-> ReadBuffersCanStartIO()
-> StartBufferIO()
-> WaitIO()

There are far fewer cases of this in the random case.

From what I can tell the sequential case so often will re-read a buffer that
it is already in the process of reading - and thus wait for that IO before
continuing - that we don't actually keep enough IO in flight.

In your email with iostat output you can see that the slow case has
aqu-sz=5.18, while the fast case has aqu-sz=10.06, i.e. the fast case has
twice as much IO in flight. While both have IOs take the same amount of time
(r_await=0.20). Which certainly explains the performance difference...

We can optimize that by deferring the StartBufferIO() if we're encountering a
buffer that is undergoing IO, at the cost of some complexity.  I'm not sure
real-world queries will often encounter the pattern of the same block being
read in by a read stream multiple times in close proximity sufficiently often
to make that worth it.

Greetings,

Andres Freund

pgsql-hackers by date:

From: Masahiko Sawada
Date: 14 August 2025, 23:40:25
Subject: Re: POC: Parallel processing of indexes in autovacuum

From: Peter Geoghegan
Date: 15 August 2025, 00:06:07
Subject: Re: index prefetching

Re: index prefetching - Mailing list pgsql-hackers

Previous

Next