Re: index prefetching - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: index prefetching
Date
Msg-id 32c15a30-6e25-4f6d-9191-76a19482c556@vondra.me
Whole thread Raw
In response to Re: index prefetching  (Tomas Vondra <tomas@vondra.me>)
List pgsql-hackers
Hi,

Here's a rebased version of the patch, addressing a couple bugs with
scrollable cursors that Peter reported to me off-list. The patch did not
handle that quite right, resulting either in incorrect results (when the
position happened to be off by one), or crashes (when it got out of sync
with the read stream).

But then there are some issues with array keys and mark/restore,
triggered by Peter's "dynamic SAOP advancement" tests in extra tests
(some of the tests use data files too large to post on hackers, it's
available in the github branch). The patch used to handle mark/restore
entirely in indexam.c, and for simple scans that works. But with array
keys the btree code needs to update the moreLeft/moreRight/needPrimScan
flags, so that after restoring it knows where to continue.

There's two "fix" patches trying to make this work - it does not crash,
and almost all the "incorrect" query results are actually stats about
buffer hits etc. And that is expected to change with prefetching, not a
bug. But then there are a bunch of explains where the number of index
scans changed, e.g. like

-         Index Searches: 5
+         Index Searches: 4

And that is almost certainly a bug.

I haven't figured this out yet, and I feel a bit lost again :-(

It made me think again whether it makes sense to make this fundamental
redesign of the index AM interface a prerequisite for prefetching. I
don't dispute the advantages of this new design, with indexam.c
responsible for more stuff (e.g. when a batch gets freed). It seems more
flexible and might make some stuff easier, and if we were designing it
now, we'd do it that way ...

Even if I eventually to fix this issue, will I ever be sufficiently
confident about correctness of the new code, enough to commit that?
Perhaps I'm too skeptical, but I'm not really sure about that anymore.

After thinking about this for a while, I decided to revisit the approach
used in the experimental patch I spoke about at pgconf.dev unconference
in 2023, and see if maybe it could be made to work.

That patch was pretty dumb - it simply initiated prefetches from the AM,
by calling PrefetchBuffer(). And the arguments against that doing this
from the AM seems like a layering violation, that every AM would need to
do a copy of this, because each AM has a different representation of the
internal scan state.

But after looking at it with fresh eyes, this seems fixable. It might
have been "more true" with the fadvise-based prefetching, but with the
ReadStream the amount of new AM code is *much* smaller. It doesn't need
to track the distance, or anything like that - that's handled by the
ReadStream. It just needs to respond to read_next callback. It also
doesn't feel like a layering violation, for the same reason.

I gave this a try last week, and I was surprised how easy it was to make
this work, and how small and simple the patches are - see the attached
simple-prefetch.tgz archive:

  infrastructure - 22kB
  btree          - 10kB
  hash           - 7kB
  gist           - 10kB
  spgist         - 16kB

That's a grand total of ~64kB (there might be some more improvements
necessary, esp. in the gist/spgist part).

Now compare that with the more complex patch, where we have

  infrastructure - 100kB
  nbtree         - 100kB

And that's just one index type. The other index types would probably
need a comparable amount of new code eventually ...

Sure, it can probably be made somewhat smaller (e.g. the nbtree code
copies a lot of stuff to support both the old and new approach, and that
might be reduced if we ditch the old one), and some of the diff are
comments. But even considering all that the size/complexity difference
will remain significant.

The one real limitation of the simpler approach is that prefetching is
limited to a single leaf page - we can't prefetch from the next one,
until the scan advances to it. But based on experiments comparing this
simpler and the "complex" approach, I don't think that really matters
that much. I haven't seen any difference for regular queries.

The one case where I think it might matter is queries with array keys,
where each array key matches a single tuple on a different leaf page.
The complex patch might prefetch tuples for later array values, while
the simpler patch won't be able to do that. If an array key matches
multiple tuples, the simple patch can prefetch those just fine, of
course. I don't know which case is more likely.


One argument for moving more stuff (including prefetching) to indexam.c
was it seems desirable to have one "component" aware of all the relevant
information, so that it can adjust prefetching in some way. I believe
that's still possible even with the simpler patch - nothing prevents
adding a "struct" to the scan descriptor, and using it from the
read_next callback or something like that.


regards


[1] https://github.com/tvondra/postgres/tree/index-prefetch-2025

-- 
Tomas Vondra

Attachment

pgsql-hackers by date:

Previous
From: Salvatore Dipietro
Date:
Subject: Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture
Next
From: Michael Paquier
Date:
Subject: Re: Make COPY format extendable: Extract COPY TO format implementations