[MASSMAIL]Experimental prefetching of buffer memory - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | [MASSMAIL]Experimental prefetching of buffer memory |
Date | |
Msg-id | CA+hUKGKNUMnqubrrz8pRBdEM8vHeSCZcNq7iqERmkt6zPtpA3g@mail.gmail.com Whole thread Raw |
List | pgsql-hackers |
Hi, Here is an experimental patch for read_stream.c. The basic idea is that when read_stream_next_buffer() gives you a page P1, it should also tell the CPU to prefetch the header of the next page P2, and so on. However, I recognise that its lack of timing control may be a fundamental flaw (see below). It just ... happens to work sometimes. Can we do better? One thought is that this might really belong in the AM, along with David Rowley's work on intra-page memory prefetching[1]. Stepping back a bit to explain why I wonder about that... What I observe is that sometimes the attached patch makes a wide variety of well-cached table scans measurably faster, unscientifically something like 10-15%ish. Other times it doesn't, but it doesn't hurt. Which ones and when on which CPUs, I'm not yet sure. One observation is that if you naively pg_prewarm an empty buffer pool the table is contiguous in memory and the hardware prefetcher[2] can perhaps detect a sequential memory scan (?), but that's not like real life: if you run a system for a while its buffer pool turns into scrambled eggs, and then I guess the hardware prefetcher can't predict much (I dunno, we aren't passing around addresses for it to speculate about, just buffer IDs, but maybe it's not beyond the realms of array index detection (?), I dunno). I don't know if that is a factor, and I don't know enough about CPUs to have specific ideas yet. The problem with all this is that it's obviously hard to reason about timing down here in lower level code with our tuple-at-a-time-volcano executor design. I've tangled with variants of this stuff before for hash joins[3] (uncommitted, sadly), and that was actually really quite successful!, but that tried to tackle the timing problem head on: it created its own batching opportunities, instead of letting the program counter escape into the wilderness. Part of the reason I got stuck with that project is that I started wanting general batch mode support and batch mode tuple-lifetime instead of localised hacks, and then the quicksand got me. Concretely, if the caller does so much work with P1 that by the time it fetches P2, P2's header has fallen out of [some level of] cache, then it won't help [at that level at least]. In theory it has also committed the crime of cache pollution in that case, ie possibly kicked something useful out of [some level of] cache. It is hard for read_stream.c to know what the caller is doing between calls, which might include running up and down a volcano 100 times. Perhaps that means the whole problem should be punted to the caller -- if heap scans would like to prefetch the next page's memory, then they should simply call read_stream_next_buffer() slightly before they finish processing P1, so they can prefetch whatever they want themselves, and micro-manage the timing better. Or some more complex reordering of work. That leads to the idea that it might somehow actually fit in better with David's earlier work; can we treat the inter-page hop as just one of the many hops in the general stream of memory accesses that he was already hacking on for prefetching fragments of heap pages? Or something. Maybe the attached always leaves the header in at least L2 even in the worst real cases and so it's worth considering! I dunno. But I couldn't look at a > 10% performance increase from 2 lines of code and not share, hence this hand-wavy nerd-snipy email. [1] https://www.postgresql.org/message-id/flat/CAApHDvpTRx7hqFZGiZJ%3Dd9JN4h1tzJ2%3Dxt7bM-9XRmpVj63psQ%40mail.gmail.com [2] https://compas.cs.stonybrook.edu/~nhonarmand/courses/sp18/cse502/slides/13-prefetch.pdf [3] https://www.postgresql.org/message-id/flat/CAEepm%3D2y9HM9QP%2BHhRZdQ3pU6FShSMyu%3DV1uHXhQ5gG-dketHg%40mail.gmail.com
Attachment
pgsql-hackers by date: