Re: index prefetching - Mailing list pgsql-hackers

From Konstantin Knizhnik
Subject Re: index prefetching
Date
Msg-id 6ba8a743-2b58-4c5a-8a39-28ccc130a069@garret.ru
Whole thread Raw
In response to Re: index prefetching  (Tomas Vondra <tomas@vondra.me>)
List pgsql-hackers
On 29/12/2025 1:53 AM, Tomas Vondra wrote:
> It seems this is due to sending an extra SET (for the new GUC) in the
> pgbench script, which is recognized only on the v5+threshold build.
>
> That's a thinko on my side, I should have realized the extra command
> might affect this. It doesn't really affect the behavior, because 10 is
> the default value for read_stream_threshold. I've fixed the script, will
> check fresh results tomorrow.
>
> Still, I think most of what I said about heuristics when to initialize
> the read stream, and the risk/benefit tradeoff, still applies.

I did a lot of experiments this morning but could not find any 
noticeable difference at any configuration when all working set fits in 
shared buffers.
And frankly speaking after more thinking I do not see good reasons which 
can explain such difference.
Just initialization of read stream should not add much overhead - it 
seems to be not expensive operation.
What is actually matter is async IO. Without read stream, Postgres reads 
heap pages using sync operation: backend just calls pread.
With read stream, AIO is used. By default "worker" AIO mode is used, it 
means that backend sends request to one of the workers and wait for it's 
completion. Worker receives request, performs IO and notifies backend. 
Such interprocess communication adds significant overhead and this is 
why if we initialize read stream from the very beginning, then we get 
about ~4x worse performance with LIMIT 1.

Please correct me if I wrong (or it is Mac specific), but it is not 
caused by any overhead related with read_stream, but by AIO.
I have not made such experiment, but it seems to me that if we make read 
stream to perform sync calls, then there will be almost no difference in 
performance.

When all data is cached in shared buffers, then we do not perform IO at all.
It means there it doesn't matter whether and when we initialize read_stream.
We can do it after processing 10 items (current default), or immediately 
- it should not affect performance.
And this is what I have tested: performance actually not depends on 
`read_stream_threshold` (if data fits in shared buffers).
At least it is within few percents and may be it is just random 
fluctuations.
Obviously there is no 25% degradation.


It definitely doesn't mean that it is not possible to find scenario 
where this approach with enabling prefetch after processing N items will 
show worse performance than master or v5. We just need to properly 
choose cache hit rate. But the same is true IMHO for v5 itself: it is 
possible to find workload where it will show the same degradation 
comparing with master.


More precise heuristic should IMHO take in account actual number of 
performed disk read.
Please notice that I do not want to predict number of disk reads - i.e. 
check if candidates for prefetch are present in shared buffers.
It will really adds significant overhead. I think that it is better to 
use as threshold number of performed reads.

Unfortunately looks like it is not possible to accumulate such 
information without changing other Postgres code.
For example, if `ReadBuffer` can somehow inform caller that it actually  
performs read, then it can be easily calculate number of reads in 
`heapam_index_fetch_tuple`:

```

static pg_attribute_always_inline Buffer
ReadBuffer_common(Relation rel, SMgrRelation smgr, char smgr_persistence,
                   ForkNumber forkNum,
                   BlockNumber blockNum, ReadBufferMode mode,
                   BufferAccessStrategy strategy,
                   bool* fast_path)
{
      ...
     if (StartReadBuffer(&operation,
                         &buffer,
                         blockNum,
                         flags))

     {
         WaitReadBuffers(&operation);
         *fast_path = false;
     }
     else
          *fast_path = true;
     return buffer;
}

It can be certainly achieved without changed ReadBuffer* family, just by 
directly calling StartReadBuffer
from `heapam_index_fetch_tuple` instead of `ReadBuffer`.
Not so nice because we have to duplicate some bufmgr code. Not so much - 
check for local relation and
filling `ReadBuffersOperation`structure. But it is better to avoid it.






pgsql-hackers by date:

Previous
From: Kirill Reshke
Date:
Subject: Re: GIN pageinspect support for entry tree and posting tree
Next
From: Daniil Davydov
Date:
Subject: Wrong comment for ReplicationSlotCreate