Hi,
Here is an experimental patch for read_stream.c. The basic idea is
that when read_stream_next_buffer() gives you a page P1, it should
also tell the CPU to prefetch the header of the next page P2, and so
on. However, I recognise that its lack of timing control may be a
fundamental flaw (see below). It just ... happens to work sometimes.
Can we do better?
One thought is that this might really belong in the AM, along with
David Rowley's work on intra-page memory prefetching[1]. Stepping
back a bit to explain why I wonder about that...
What I observe is that sometimes the attached patch makes a wide
variety of well-cached table scans measurably faster, unscientifically
something like 10-15%ish. Other times it doesn't, but it doesn't
hurt.
Which ones and when on which CPUs, I'm not yet sure. One observation
is that if you naively pg_prewarm an empty buffer pool the table is
contiguous in memory and the hardware prefetcher[2] can perhaps detect
a sequential memory scan (?), but that's not like real life: if you
run a system for a while its buffer pool turns into scrambled eggs,
and then I guess the hardware prefetcher can't predict much (I dunno,
we aren't passing around addresses for it to speculate about, just
buffer IDs, but maybe it's not beyond the realms of array index
detection (?), I dunno). I don't know if that is a factor, and I
don't know enough about CPUs to have specific ideas yet.
The problem with all this is that it's obviously hard to reason about
timing down here in lower level code with our tuple-at-a-time-volcano
executor design. I've tangled with variants of this stuff before for
hash joins[3] (uncommitted, sadly), and that was actually really quite
successful!, but that tried to tackle the timing problem head on: it
created its own batching opportunities, instead of letting the program
counter escape into the wilderness. Part of the reason I got stuck
with that project is that I started wanting general batch mode support
and batch mode tuple-lifetime instead of localised hacks, and then the
quicksand got me.
Concretely, if the caller does so much work with P1 that by the time
it fetches P2, P2's header has fallen out of [some level of] cache,
then it won't help [at that level at least]. In theory it has also
committed the crime of cache pollution in that case, ie possibly
kicked something useful out of [some level of] cache. It is hard for
read_stream.c to know what the caller is doing between calls, which
might include running up and down a volcano 100 times. Perhaps that
means the whole problem should be punted to the caller -- if heap
scans would like to prefetch the next page's memory, then they should
simply call read_stream_next_buffer() slightly before they finish
processing P1, so they can prefetch whatever they want themselves, and
micro-manage the timing better. Or some more complex reordering of
work.
That leads to the idea that it might somehow actually fit in better
with David's earlier work; can we treat the inter-page hop as just one
of the many hops in the general stream of memory accesses that he was
already hacking on for prefetching fragments of heap pages? Or
something. Maybe the attached always leaves the header in at least L2
even in the worst real cases and so it's worth considering! I dunno.
But I couldn't look at a > 10% performance increase from 2 lines of
code and not share, hence this hand-wavy nerd-snipy email.
[1]
https://www.postgresql.org/message-id/flat/CAApHDvpTRx7hqFZGiZJ%3Dd9JN4h1tzJ2%3Dxt7bM-9XRmpVj63psQ%40mail.gmail.com
[2] https://compas.cs.stonybrook.edu/~nhonarmand/courses/sp18/cse502/slides/13-prefetch.pdf
[3]
https://www.postgresql.org/message-id/flat/CAEepm%3D2y9HM9QP%2BHhRZdQ3pU6FShSMyu%3DV1uHXhQ5gG-dketHg%40mail.gmail.com