Hi,
Here's a rebased version of the patch, addressing a couple bugs with
scrollable cursors that Peter reported to me off-list. The patch did not
handle that quite right, resulting either in incorrect results (when the
position happened to be off by one), or crashes (when it got out of sync
with the read stream).
But then there are some issues with array keys and mark/restore,
triggered by Peter's "dynamic SAOP advancement" tests in extra tests
(some of the tests use data files too large to post on hackers, it's
available in the github branch). The patch used to handle mark/restore
entirely in indexam.c, and for simple scans that works. But with array
keys the btree code needs to update the moreLeft/moreRight/needPrimScan
flags, so that after restoring it knows where to continue.
There's two "fix" patches trying to make this work - it does not crash,
and almost all the "incorrect" query results are actually stats about
buffer hits etc. And that is expected to change with prefetching, not a
bug. But then there are a bunch of explains where the number of index
scans changed, e.g. like
- Index Searches: 5
+ Index Searches: 4
And that is almost certainly a bug.
I haven't figured this out yet, and I feel a bit lost again :-(
It made me think again whether it makes sense to make this fundamental
redesign of the index AM interface a prerequisite for prefetching. I
don't dispute the advantages of this new design, with indexam.c
responsible for more stuff (e.g. when a batch gets freed). It seems more
flexible and might make some stuff easier, and if we were designing it
now, we'd do it that way ...
Even if I eventually to fix this issue, will I ever be sufficiently
confident about correctness of the new code, enough to commit that?
Perhaps I'm too skeptical, but I'm not really sure about that anymore.
After thinking about this for a while, I decided to revisit the approach
used in the experimental patch I spoke about at pgconf.dev unconference
in 2023, and see if maybe it could be made to work.
That patch was pretty dumb - it simply initiated prefetches from the AM,
by calling PrefetchBuffer(). And the arguments against that doing this
from the AM seems like a layering violation, that every AM would need to
do a copy of this, because each AM has a different representation of the
internal scan state.
But after looking at it with fresh eyes, this seems fixable. It might
have been "more true" with the fadvise-based prefetching, but with the
ReadStream the amount of new AM code is *much* smaller. It doesn't need
to track the distance, or anything like that - that's handled by the
ReadStream. It just needs to respond to read_next callback. It also
doesn't feel like a layering violation, for the same reason.
I gave this a try last week, and I was surprised how easy it was to make
this work, and how small and simple the patches are - see the attached
simple-prefetch.tgz archive:
infrastructure - 22kB
btree - 10kB
hash - 7kB
gist - 10kB
spgist - 16kB
That's a grand total of ~64kB (there might be some more improvements
necessary, esp. in the gist/spgist part).
Now compare that with the more complex patch, where we have
infrastructure - 100kB
nbtree - 100kB
And that's just one index type. The other index types would probably
need a comparable amount of new code eventually ...
Sure, it can probably be made somewhat smaller (e.g. the nbtree code
copies a lot of stuff to support both the old and new approach, and that
might be reduced if we ditch the old one), and some of the diff are
comments. But even considering all that the size/complexity difference
will remain significant.
The one real limitation of the simpler approach is that prefetching is
limited to a single leaf page - we can't prefetch from the next one,
until the scan advances to it. But based on experiments comparing this
simpler and the "complex" approach, I don't think that really matters
that much. I haven't seen any difference for regular queries.
The one case where I think it might matter is queries with array keys,
where each array key matches a single tuple on a different leaf page.
The complex patch might prefetch tuples for later array values, while
the simpler patch won't be able to do that. If an array key matches
multiple tuples, the simple patch can prefetch those just fine, of
course. I don't know which case is more likely.
One argument for moving more stuff (including prefetching) to indexam.c
was it seems desirable to have one "component" aware of all the relevant
information, so that it can adjust prefetching in some way. I believe
that's still possible even with the simpler patch - nothing prevents
adding a "struct" to the scan descriptor, and using it from the
read_next callback or something like that.
regards
[1] https://github.com/tvondra/postgres/tree/index-prefetch-2025
--
Tomas Vondra