Re: BitmapHeapScan streaming read user and prelim refactoring - Mailing list pgsql-hackers
From | James Hunter |
---|---|
Subject | Re: BitmapHeapScan streaming read user and prelim refactoring |
Date | |
Msg-id | CAJVSvF66dP-WpiVOnOV2Bj1YGKYXEz7w6Kns++Jv4caGyJ-8+A@mail.gmail.com Whole thread Raw |
In response to | Re: BitmapHeapScan streaming read user and prelim refactoring (Andres Freund <andres@anarazel.de>) |
List | pgsql-hackers |
Thanks for the comments! On Tue, Apr 15, 2025 at 3:11 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2025-04-14 09:58:19 -0700, James Hunter wrote: > > I see two orthogonal problems, in processing Bitmap Heap pages in > > parallel: (1) we need to prefetch enough pages, far enough in advance, > > to hide read latency; (2) later, every parallel worker needs to be > > given a set of pages to process, in a way that minimizes contention. > > > > The easiest way to hand out work to parallel workers (and often the > > best) is to maintain a single, shared, global work queue. Just put > > whatever pages you prefetch into a FIFO queue, and let each worker > > pull one piece of "work" off that queue. In this was, there's no > > "ramp-down" problem. > > If you just issue prefetch requests separately you'll get no read combining - > and it turns out that that is a really rather significant loss, both on the > storage layer and just due to the syscall overhead. So you do need to perform > batching when issuing IO. Which in turn requires a bit of rampup logic etc. Right, so if you need to do batching anyway, contention on a shared queue will be minimal, because it's amortized over the batch size. I agree about ramp *up* logic, I just don't see the need for ramp *down* logic. > > This is why a single shared queue is so nice, because it avoids > > workers being idle. But I am confused by your proposal, which seems to > > be trying to get the behavior of a single shared queue, but > > implemented with the added complexity of multiple queues. > > > > Why not just use a single queue? > > Accessing buffers in a maximally interleaved way, which is what a single queue > would give you, adds a good bit of overhead when you have a lot of memory, > because e.g. TLB hit rate is minimized. Well that's trade-off, right? As you point out, you need to do batching when issuing reads, to allow for read combining. The larger your batch, the more reads you can combine -- the more efficient your I/O, etc. But the larger your batch, the less locality you get in memory. You always have to choose a batch size large enough to hide I/O latency, plus allow, I guess, for read combining. I suspect that will blow out your TLB more than letting 8 parallel workers share the same queue. Not to mention the complexity (as Thomas has described very nicely, in this thread) of trying to partition+affinitize async read requests to individual parallel workers. (Consider "ramp-down" for a moment: the "problem" here is just that one parallel worker issued a batch of async reads , near the end of the query; and since the worker is affinitized to the async read, all other workers pack up and go home, leaving a single worker to process this last batch. If, instead, we just used a single queue, then there would be no need for "ramp-down" logic, because async reads would go into a single queue/pool, and not be affinitized to a single, "unlucky" worker.) > > It has never been clear to me why prefetching the exact blocks you'll > > later consume is seen as a *benefit*, rather than a *cost*. I'm not > > aware of any prefetch interface, other than PG's "ReadStream," that > > insists on this. But that's a separate discussion... > > ... > > As I said above, that's not to say that we'll only ever want to do readahead > via a the read stream interface. Well that's my point: since, I believe, we'll ultimately want a "heuristic" prefetch, which will be incompatible with the new read stream interface... we'll end up writing and supporting two different prefetch interfaces. It has never been clear to me that the advantages of having this second, read-stream, prefetch interface outweigh the costs of having to write and maintain two separate interfaces, to do pretty much the same thing. If we *didn't* need the "heuristic" interface, then I could be convinced that the "read-stream" interface was a good choice. But since we'll (eventually) need the "heuristic" interface, anyway, it's not clear to me that the benefits outweigh the costs of implementing this "read-stream" interface, as well. Thanks, James
pgsql-hackers by date: