Re: index prefetching - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: index prefetching |
Date | |
Msg-id | vbmf25wqadwulu53aldab7yqpqypqoct5lix7q2wdaz2wjm7me@4tlah4v7siio Whole thread Raw |
In response to | Re: index prefetching (Tomas Vondra <tomas@vondra.me>) |
List | pgsql-hackers |
Hi, On 2025-08-14 00:23:49 +0200, Tomas Vondra wrote: > On 8/13/25 23:37, Andres Freund wrote: > > On 2025-08-13 23:07:07 +0200, Tomas Vondra wrote: > >> On 8/13/25 16:44, Andres Freund wrote: > >>> On 2025-08-13 14:15:37 +0200, Tomas Vondra wrote: > >>>> In fact, I believe this is about io_method. I initially didn't see the > >>>> difference you described, and then I realized I set io_method=sync to > >>>> make it easier to track the block access. And if I change io_method to > >>>> worker, I get different stats, that also change between runs. > >>>> > >>>> With "sync" I always get this (after a restart): > >>>> > >>>> Buffers: shared hit=7435 read=52801 > >>>> > >>>> while with "worker" I get this: > >>>> > >>>> Buffers: shared hit=4879 read=52801 > >>>> Buffers: shared hit=5151 read=52801 > >>>> Buffers: shared hit=4978 read=52801 > >>>> > >>>> So not only it changes run to tun, it also does not add up to 60236. > >>> > >>> This is reproducible on master? If so, how? > >>> > >>> > >>>> I vaguely recall I ran into this some time ago during AIO benchmarking, > >>>> and IIRC it's due to how StartReadBuffersImpl() may behave differently > >>>> depending on I/O started earlier. It only calls PinBufferForBlock() in > >>>> some cases, and PinBufferForBlock() is what updates the hits. > >>> > >>> Hm, I don't immediately see an issue there. The only case we don't call > >>> PinBufferForBlock() is if we already have pinned the relevant buffer in a > >>> prior call to StartReadBuffersImpl(). > >>> > >>> > >>> If this happens only with the prefetching patch applied, is is possible that > >>> what happens here is that we occasionally re-request buffers that already in > >>> the process of being read in? That would only happen with a read stream and > >>> io_method != sync (since with sync we won't read ahead). If we have to start > >>> reading in a buffer that's already undergoing IO we wait for the IO to > >>> complete and count that access as a hit: > >>> > >>> /* > >>> * Check if we can start IO on the first to-be-read buffer. > >>> * > >>> * If an I/O is already in progress in another backend, we want to wait > >>> * for the outcome: either done, or something went wrong and we will > >>> * retry. > >>> */ > >>> if (!ReadBuffersCanStartIO(buffers[nblocks_done], false)) > >>> { > >>> ... > >>> /* > >>> * Report and track this as a 'hit' for this backend, even though it > >>> * must have started out as a miss in PinBufferForBlock(). The other > >>> * backend will track this as a 'read'. > >>> */ > >>> ... > >>> if (persistence == RELPERSISTENCE_TEMP) > >>> pgBufferUsage.local_blks_hit += 1; > >>> else > >>> pgBufferUsage.shared_blks_hit += 1; > >>> ... > >>> > >>> > >> > >> I think it has to be this. It only happens with io_method != sync, and > >> only with effective_io_concurrency > 1. At first I was wondering why I > >> can't reproduce this for seqscan/bitmapscan, but then I realized those > >> plans never visit the same block repeatedly - indexscans do that. It's > >> also not surprising it's timing-sensitive, as it likely depends on how > >> fast the worker happens to start/complete requests. > >> > >> What would be a good way to "prove" it really is this? > > > > I'd just comment out those stats increments and then check if the stats are > > stable afterwards. > > > > I tried that, but it's not enough - the buffer hits gets lower, but > remains variable. It stabilizes only if I comment out the increment in > PinBufferForBlock() too. At which point it gets to 0, of course ... Ah, right - that'll be the cases where IO completed before we access it a second time. There's no good way that I can see that we can make that deterministic - I mean, we could just search all in-progress IOs before starting a new IO for a matching block number and wait for all IO to complete if so. But that seems like an obviously bad idea. I think there's just some fundamental indeterminisism here. I don't think we gain anything by hiding it... Greetings, Andres Freund
pgsql-hackers by date: