Re: index prefetching - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: index prefetching
Date
Msg-id d8e750ae-67cf-4e21-8ef6-e161656c90c5@vondra.me
Whole thread Raw
In response to Re: index prefetching  (Konstantin Knizhnik <knizhnik@garret.ru>)
Responses Re: index prefetching
Re: index prefetching
List pgsql-hackers

On 12/18/25 14:57, Konstantin Knizhnik wrote:
> 
> On 17/12/2025 9:54 PM, Tomas Vondra wrote:
>> On 12/17/25 20:30, Andres Freund wrote:
>>> Hi,
>>>
>>> On 2025-12-17 13:49:43 -0500, Peter Geoghegan wrote:
>>>> On Wed, Dec 17, 2025 at 12:19 PM Konstantin Knizhnik
>>>> <knizhnik@garret.ru> wrote:
>>>>> Moreover with `enable_indexscan_prefetch=off` results are the same.
>>>> It's quite unlikely that the current heuristics that trigger
>>>> prefetching would have ever allowed any prefetching, for queries such
>>>> as these.
>>>>
>>>> The exact rule right now is that we don't even begin prefetching until
>>>> we've already read at least one index leaf page, and have to read
>>>> another one. So it's impossible to use prefetching with a LIMIT of 1,
>>>> with queries such as these. It's highly unlikely that you'd see any
>>>> benefits from prefetching even with LIMIT 100 (usually we wouldn't
>>>> even begin prefetching).
>>> Note that due to the tuple size and fillfactor in Konstantin's
>>> workload, there
>>> will be one tuple per page... That should allow for some prefetching.
>>>
>> Yes, but that's in the heap. The mechanism Peter described is about leaf
>> pages in the index, and the index has the usual fillfactor. So there'll
>> be many index entries per leaf.
>>
> I slightly change my benchmark setup:
> 
> create table t (pk integer, sk integer, payload text default repeat('x',
> 1000)) with (fillfactor=10);
> insert into t values (generate_series(1,10000000),random()*10000000);
> create index on t(sk);
> 
> select.sql:
> 
> \set sk random(1, 10000000)
> select * from t where sk >= :sk order by sk limit N;
> 
> You are right. There is almost no effect of prefetch for limit=100, but
> ~2x times improvement for limit=1000:
> 
> eio\limit       1      100   1000
>  10          11102    142    28
>   0           11419    137    14
> 
> master:
> limit              1     100   1000
>                11480   130      13
> 
> One of the motivation of my experiments was to check that there is no
> degrade of performance because of batching.
> And it is nice that there is no performance penalty here.
> Still it is not quite clear to me why there is no any positive effect
> for LIMIT 100.

The technical reason is that batch_getnext() does this:

  /* Delay initializing stream until reading from scan's second batch */
  if (priorbatch && !scan->xs_heapfetch->rs && !batchqueue->disabled &&
      enable_indexscan_prefetch)
      scan->xs_heapfetch->rs =
          read_stream_begin_relation(READ_STREAM_DEFAULT, NULL,
                                     ....);

which means we only create the read_stream (which is what enables the
prefetching) only when creating the second batch. And with LIMIT 100 we
likely read just a single leaf page (=batch) most of the time, which
means no read_stream and thus no prefetching.

You can try disabling this "priorbatch" condition, so that the
read_stream gets created right away.

> Reading 100 random heap pages definitely should take advantages of AIO.
> We have also implemented prefetching for index only scan in Neon and
> here effect for similar query is quite noticeable (~3x times).
> But in Neon architecture prices of IO is much higher because requires
> network communication with page server.
> 

True, but only if the data is not already in memory / shared buffers.
IIRC this "priorbatch" logic mitigates regressions for cached workloads,
because the read_stream initialization is expensive enough to hurt small
queries when no I/O is needed.

Maybe the tradeoff is different for Neon, which probably can't rely on
cache that much? It's also true tying this to the number of batches is a
bit coarse, because the batch size can vary a lot. It can be a couple
items or hundreds of items, easily.

I believe we're open to alternative ideas.


regards

-- 
Tomas Vondra




pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: [BUG] [PATCH] pg_basebackup produces wrong incremental files after relation truncation in segmented tables
Next
From: Andres Freund
Date:
Subject: Re: index prefetching