Hi,
On 2025-08-14 17:55:53 -0400, Peter Geoghegan wrote:
> On Thu, Aug 14, 2025 at 5:06 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > > We can optimize that by deferring the StartBufferIO() if we're encountering a
> > > buffer that is undergoing IO, at the cost of some complexity.  I'm not sure
> > > real-world queries will often encounter the pattern of the same block being
> > > read in by a read stream multiple times in close proximity sufficiently often
> > > to make that worth it.
> >
> > We definitely need to be prepared for duplicate prefetch requests in
> > the context of index scans.
> 
> Can you (or anybody else) think of a quick and dirty way of working
> around the problem on the read stream side? I would like to prioritize
> getting the patch into a state where its overall performance profile
> "feels right". From there we can iterate on fixing the underlying
> issues in more principled ways.
I think I can see a way to fix the issue, below read stream. Basically,
whenever AsyncReadBuffers() finds a buffer that has ongoing IO, instead of
waiting, as we do today, copy the wref to the ReadBuffersOperation() and set a
new flag indicating that we are waiting for an IO that was not started by the
wref. Then, in WaitReadBuffers(), we wait for such foreign started IOs. That
has to be somewhat different code from today, because we have to deal with the
fact of the "foreign" IO potentially having failed.
I'll try writing a prototype for that tomorrow. I think to actually get that
into a committable shape we need a test harness (probably a read stream
controlled by an SQL function that gets an array of buffers).
> FWIW it wouldn't be that hard to require the callback (in our case
> index_scan_stream_read_next) to explicitly point out that it knows
> that the block number it's requesting has to be a duplicate. It might
> make sense to at least place that much of the burden on the
> callback/client side.
The problem actually exists outside of your case. E.g. if you have multiple
backends doing a synchronized seqscan on the same relation, performance
regresses, because we often end up synchronously waiting for IOs started by
another backend. I don't think it has quite as large an effect for that as it
has here, because the different scans basically desynchronize whenever it
happens due to the synchronous waits slowing down the waiting backend a lot),
limiting the impact somewhat.
Greetings,
Andres Freund