Re: BitmapHeapScan streaming read user and prelim refactoring - Mailing list pgsql-hackers

From Andres Freund
Subject Re: BitmapHeapScan streaming read user and prelim refactoring
Date
Msg-id rmiw7n4ijedmphzcs6zwm6qoyhnifkj2ivmfob4xbh77iuehyw@rcp3eiafgt7q
Whole thread Raw
In response to Re: BitmapHeapScan streaming read user and prelim refactoring  (James Hunter <james.hunter.pg@gmail.com>)
Responses Re: BitmapHeapScan streaming read user and prelim refactoring
List pgsql-hackers
Hi,

On 2025-04-14 09:58:19 -0700, James Hunter wrote:
> Of course, late feedback is a burden, but I think our discussion here
> (and, in the future, if you try to "ReadStream" BTree index pages,
> themselves) illustrates my point.

FWIW, it's quite conceivable that we'll want to use non-readstream prefetching
for some cases where "heuristic" prefetching is easier. I think prefetching of
btree pages is one of those cases.

The read stream based API provides a single point of adjusting the readahead
logic, implementing read-combining etc, without needing to know about that in
the user of the read stream.


> I see two orthogonal problems, in processing Bitmap Heap pages in
> parallel: (1) we need to prefetch enough pages, far enough in advance,
> to hide read latency; (2) later, every parallel worker needs to be
> given a set of pages to process, in a way that minimizes contention.
> 
> The easiest way to hand out work to parallel workers (and often the
> best) is to maintain a single, shared, global work queue. Just put
> whatever pages you prefetch into a FIFO queue, and let each worker
> pull one piece of "work" off that queue. In this was, there's no
> "ramp-down" problem.

If you just issue prefetch requests separately you'll get no read combining -
and it turns out that that is a really rather significant loss, both on the
storage layer and just due to the syscall overhead.  So you do need to perform
batching when issuing IO. Which in turn requires a bit of rampup logic etc.


> If you find that contention on this shared queue becomes a bottleneck,
> then you just pull *n* pieces of work, in a batch. Then the
> "ramp-down" problem is limited to "n", instead of just 1. Often, one
> can find a suitable value of n that simultaneously makes contention
> effectively zero, while avoiding "ramp-down" problems; say n = 10.
> 
> So much for popping from the shared queue. Pushing to the shared queue
> is also easy, because you have async reads. Anytime a worker pops a
> (batch of) work item(s) off the shared queue, it checks to see if the
> queue is still large enough. If not, it issues the appropriate
> prefetch / "ReadStream" calls.
> 
> A single shared queue is easiest, but sometimes there's no way to
> prevent it from becoming a bottleneck. In that case, one typically
> partitions the input at startup, gives each worker its own partition,
> and waits for all workers to complete. In this, second, model, workers
> are entirely independent, so there is no contention: we scale out
> perfectly. The problem, as you've pointed out, is that one worker
> might finish its own work long before another; and then the worker
> that finished its work is idle and therefore wasted.
> 
> This is why a single shared queue is so nice, because it avoids
> workers being idle. But I am confused by your proposal, which seems to
> be trying to get the behavior of a single shared queue, but
> implemented with the added complexity of multiple queues.
> 
> Why not just use a single queue?

Accessing buffers in a maximally interleaved way, which is what a single queue
would give you, adds a good bit of overhead when you have a lot of memory,
because e.g. TLB hit rate is minimized.


> > These are now
> > real IOs running in the background and for the *exact* blocks you will
> > consume; posix_fadvise() was just a stepping towards AIO that
> > tolerated sloppy synchronisation including being entirely wrong.
> 
> It has never been clear to me why prefetching the exact blocks you'll
> later consume is seen as a *benefit*, rather than a *cost*. I'm not
> aware of any prefetch interface, other than PG's "ReadStream," that
> insists on this. But that's a separate discussion...

It (randomly ordered):

a) reduces wasted IO

b) makes things like IO combining a lot easier

c) makes it a lot easier to adaptively control readahead distance, because you
   have information about
   1) the # in-flight IOs
   2) the # completed IOs
   3) current position
   4) reada

d) provides *per-stream* control over readahead. You want to be only as
   aggressive about prefetching as you need to be, which very well can differ
   between different parts of a query tree.

e) provides one central point to implement readahead logic


As I said above, that's not to say that we'll only ever want to do readahead
via a the read stream interface.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Konstantin Osipov
Date:
Subject: Built-in Raft replication
Next
From: Tom Lane
Date:
Subject: Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward