Re: index prefetching - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: index prefetching
Date
Msg-id CAH2-Wzko86NwiENCJGtakJ=fOhWpr-Yz-F+1oxgv2Ku1mvXwvA@mail.gmail.com
Whole thread Raw
In response to Re: index prefetching  (Tomas Vondra <tomas@vondra.me>)
Responses Re: index prefetching
List pgsql-hackers
On Tue, Aug 12, 2025 at 7:10 PM Tomas Vondra <tomas@vondra.me> wrote:
> Actually, this might be a consequence of how backwards scans work (at
> least in btree). I logged the block in index_scan_stream_read_next, and
> this is what I see in the forward scan (at the beginning):

Just to be clear: you did disable deduplication and then reindex,
right? You're accounting for the known issue with posting list TIDs
returning TIDs in the wrong order, relative to the scan direction
(when the scan direction is backwards)?

It won't be necessary to do this once I commit my patch that fixes the
issue directly, on the nbtree side, but for now deduplication messes
things up here. And so for now you have to work around it.

> But with the backwards scan we apparently scan the values backwards, but
> then the blocks for each value are accessed in forward direction. So we
> do a couple blocks "forward" and then jump to the preceding value - but
> that's a couple blocks *back*. And that breaks the lastBlock check.

I don't think that this should be happening. The read stream ought to
be seeing blocks in exactly the same order as everything else.

> I believe this applies both to master and the prefetching, except that
> master doesn't have read stream - so it only does sync I/O.

In what sense is it an issue on master?

On master, we simply access the TIDs in whatever order amgettuple
returns TIDs in. That should always be scan order/index key space
order, where heap TID counts as a tie-breaker/affects the key space in
the presence of duplicates (at least once that issue with posting
lists is fixed, or once deduplication has been disabled in a way that
leaves no posting list TIDs around via a reindex).

It is certainly not surprising that master does poorly on backwards
scans. And it isn't all that surprising that master does worse on
backwards scans when direct I/O is in use (per the explanation
Andres offered just now). But master should nevertheless always read
the TIDs in whatever order it gets them from amgettuple in.

It sounds like amgetbatch doesn't really behave analogously to master
here, at least with backwards scans. It sounds like you're saying that
we *won't* feed TIDs heap block numbers to the read stream in exactly
scan order (when we happen to be scanning backwards) -- which seems
wrong to me.

As you pointed out, a forwards scan of a DESC column index should feed
heap blocks to the read stream in a way that is very similar to an
equivalent backwards scan of a similar ASC column on the same table.
There might be some very minor differences, due to differences in the
precise leaf page boundaries among each of the indexes. But that
should hardly be noticeable at all.

> Could that hide the extra buffer accesses, somehow?

I think that you meant to ask about *missing* buffer hits with the
patch, for the forwards scan. That doesn't agree with the backwards
scan with the patch, nor does it agree with master (with either the
forwards or backwards scan). Note that the heap accesses themselves
appear to have sane/consistent numbers, since we always see
"read=49933" as expected for those, for all 4 query executions that I
showed.

The "missing buffer hits" issue seems like an issue with the
instrumentation itself. Possibly one that is totally unrelated to
everything else we're discussing.

--
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Thomas Munro
Date:
Subject: Re: `pg_ctl init` crashes when run concurrently; semget(2) suspected
Next
From: Michael Paquier
Date:
Subject: Re: CI failures with Windows - VS2019 jobs