Re: index prefetching - Mailing list pgsql-hackers

From Andres Freund
Subject Re: index prefetching
Date
Msg-id vbmf25wqadwulu53aldab7yqpqypqoct5lix7q2wdaz2wjm7me@4tlah4v7siio
Whole thread Raw
In response to Re: index prefetching  (Tomas Vondra <tomas@vondra.me>)
List pgsql-hackers
Hi,

On 2025-08-14 00:23:49 +0200, Tomas Vondra wrote:
> On 8/13/25 23:37, Andres Freund wrote:
> > On 2025-08-13 23:07:07 +0200, Tomas Vondra wrote:
> >> On 8/13/25 16:44, Andres Freund wrote:
> >>> On 2025-08-13 14:15:37 +0200, Tomas Vondra wrote:
> >>>> In fact, I believe this is about io_method. I initially didn't see the
> >>>> difference you described, and then I realized I set io_method=sync to
> >>>> make it easier to track the block access. And if I change io_method to
> >>>> worker, I get different stats, that also change between runs.
> >>>>
> >>>> With "sync" I always get this (after a restart):
> >>>>
> >>>>    Buffers: shared hit=7435 read=52801
> >>>>
> >>>> while with "worker" I get this:
> >>>>
> >>>>    Buffers: shared hit=4879 read=52801
> >>>>    Buffers: shared hit=5151 read=52801
> >>>>    Buffers: shared hit=4978 read=52801
> >>>>
> >>>> So not only it changes run to tun, it also does not add up to 60236.
> >>>
> >>> This is reproducible on master? If so, how?
> >>>
> >>>
> >>>> I vaguely recall I ran into this some time ago during AIO benchmarking,
> >>>> and IIRC it's due to how StartReadBuffersImpl() may behave differently
> >>>> depending on I/O started earlier. It only calls PinBufferForBlock() in
> >>>> some cases, and PinBufferForBlock() is what updates the hits.
> >>>
> >>> Hm, I don't immediately see an issue there. The only case we don't call
> >>> PinBufferForBlock() is if we already have pinned the relevant buffer in a
> >>> prior call to StartReadBuffersImpl().
> >>>
> >>>
> >>> If this happens only with the prefetching patch applied, is is possible that
> >>> what happens here is that we occasionally re-request buffers that already in
> >>> the process of being read in? That would only happen with a read stream and
> >>> io_method != sync (since with sync we won't read ahead). If we have to start
> >>> reading in a buffer that's already undergoing IO we wait for the IO to
> >>> complete and count that access as a hit:
> >>>
> >>>     /*
> >>>      * Check if we can start IO on the first to-be-read buffer.
> >>>      *
> >>>      * If an I/O is already in progress in another backend, we want to wait
> >>>      * for the outcome: either done, or something went wrong and we will
> >>>      * retry.
> >>>      */
> >>>     if (!ReadBuffersCanStartIO(buffers[nblocks_done], false))
> >>>     {
> >>> ...
> >>>         /*
> >>>          * Report and track this as a 'hit' for this backend, even though it
> >>>          * must have started out as a miss in PinBufferForBlock(). The other
> >>>          * backend will track this as a 'read'.
> >>>          */
> >>> ...
> >>>         if (persistence == RELPERSISTENCE_TEMP)
> >>>             pgBufferUsage.local_blks_hit += 1;
> >>>         else
> >>>             pgBufferUsage.shared_blks_hit += 1;
> >>> ...
> >>>
> >>>
> >>
> >> I think it has to be this. It only happens with io_method != sync, and
> >> only with effective_io_concurrency > 1. At first I was wondering why I
> >> can't reproduce this for seqscan/bitmapscan, but then I realized those
> >> plans never visit the same block repeatedly - indexscans do that. It's
> >> also not surprising it's timing-sensitive, as it likely depends on how
> >> fast the worker happens to start/complete requests.
> >>
> >> What would be a good way to "prove" it really is this?
> > 
> > I'd just comment out those stats increments and then check if the stats are
> > stable afterwards.
> > 
> 
> I tried that, but it's not enough - the buffer hits gets lower, but
> remains variable. It stabilizes only if I comment out the increment in
> PinBufferForBlock() too. At which point it gets to 0, of course ...

Ah, right - that'll be the cases where IO completed before we access it a
second time. There's no good way that I can see that we can make that
deterministic - I mean, we could just search all in-progress IOs before
starting a new IO for a matching block number and wait for all IO to complete
if so. But that seems like an obviously bad idea.

I think there's just some fundamental indeterminisism here. I don't think we
gain anything by hiding it...

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: index prefetching
Next
From: Rahila Syed
Date:
Subject: Re: Enhancing Memory Context Statistics Reporting