Re: BitmapHeapScan streaming read user and prelim refactoring - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: BitmapHeapScan streaming read user and prelim refactoring |
Date | |
Msg-id | rmiw7n4ijedmphzcs6zwm6qoyhnifkj2ivmfob4xbh77iuehyw@rcp3eiafgt7q Whole thread Raw |
In response to | Re: BitmapHeapScan streaming read user and prelim refactoring (James Hunter <james.hunter.pg@gmail.com>) |
Responses |
Re: BitmapHeapScan streaming read user and prelim refactoring
|
List | pgsql-hackers |
Hi, On 2025-04-14 09:58:19 -0700, James Hunter wrote: > Of course, late feedback is a burden, but I think our discussion here > (and, in the future, if you try to "ReadStream" BTree index pages, > themselves) illustrates my point. FWIW, it's quite conceivable that we'll want to use non-readstream prefetching for some cases where "heuristic" prefetching is easier. I think prefetching of btree pages is one of those cases. The read stream based API provides a single point of adjusting the readahead logic, implementing read-combining etc, without needing to know about that in the user of the read stream. > I see two orthogonal problems, in processing Bitmap Heap pages in > parallel: (1) we need to prefetch enough pages, far enough in advance, > to hide read latency; (2) later, every parallel worker needs to be > given a set of pages to process, in a way that minimizes contention. > > The easiest way to hand out work to parallel workers (and often the > best) is to maintain a single, shared, global work queue. Just put > whatever pages you prefetch into a FIFO queue, and let each worker > pull one piece of "work" off that queue. In this was, there's no > "ramp-down" problem. If you just issue prefetch requests separately you'll get no read combining - and it turns out that that is a really rather significant loss, both on the storage layer and just due to the syscall overhead. So you do need to perform batching when issuing IO. Which in turn requires a bit of rampup logic etc. > If you find that contention on this shared queue becomes a bottleneck, > then you just pull *n* pieces of work, in a batch. Then the > "ramp-down" problem is limited to "n", instead of just 1. Often, one > can find a suitable value of n that simultaneously makes contention > effectively zero, while avoiding "ramp-down" problems; say n = 10. > > So much for popping from the shared queue. Pushing to the shared queue > is also easy, because you have async reads. Anytime a worker pops a > (batch of) work item(s) off the shared queue, it checks to see if the > queue is still large enough. If not, it issues the appropriate > prefetch / "ReadStream" calls. > > A single shared queue is easiest, but sometimes there's no way to > prevent it from becoming a bottleneck. In that case, one typically > partitions the input at startup, gives each worker its own partition, > and waits for all workers to complete. In this, second, model, workers > are entirely independent, so there is no contention: we scale out > perfectly. The problem, as you've pointed out, is that one worker > might finish its own work long before another; and then the worker > that finished its work is idle and therefore wasted. > > This is why a single shared queue is so nice, because it avoids > workers being idle. But I am confused by your proposal, which seems to > be trying to get the behavior of a single shared queue, but > implemented with the added complexity of multiple queues. > > Why not just use a single queue? Accessing buffers in a maximally interleaved way, which is what a single queue would give you, adds a good bit of overhead when you have a lot of memory, because e.g. TLB hit rate is minimized. > > These are now > > real IOs running in the background and for the *exact* blocks you will > > consume; posix_fadvise() was just a stepping towards AIO that > > tolerated sloppy synchronisation including being entirely wrong. > > It has never been clear to me why prefetching the exact blocks you'll > later consume is seen as a *benefit*, rather than a *cost*. I'm not > aware of any prefetch interface, other than PG's "ReadStream," that > insists on this. But that's a separate discussion... It (randomly ordered): a) reduces wasted IO b) makes things like IO combining a lot easier c) makes it a lot easier to adaptively control readahead distance, because you have information about 1) the # in-flight IOs 2) the # completed IOs 3) current position 4) reada d) provides *per-stream* control over readahead. You want to be only as aggressive about prefetching as you need to be, which very well can differ between different parts of a query tree. e) provides one central point to implement readahead logic As I said above, that's not to say that we'll only ever want to do readahead via a the read stream interface. Greetings, Andres Freund
pgsql-hackers by date: