Re: BitmapHeapScan streaming read user and prelim refactoring - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: BitmapHeapScan streaming read user and prelim refactoring
Date
Msg-id CA+hUKGKz+SZh28S4edbbRv6+DEPDfDMiPTCbJxQsBGsKtL+Gaw@mail.gmail.com
Whole thread Raw
In response to Re: BitmapHeapScan streaming read user and prelim refactoring  (Melanie Plageman <melanieplageman@gmail.com>)
Responses Re: BitmapHeapScan streaming read user and prelim refactoring
List pgsql-hackers
On Thu, Feb 13, 2025 at 1:40 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> Thomas mentioned this to me off-list, and I think he's right. We
> likely need to rethink the way parallel bitmap heap scan workers get
> block assignments for reading and prefetching to make it more similar
> to parallel sequential scan. The workers should probably get
> assignments of a range of blocks. On master, each worker does end up
> issuing reads/fadvises for a bunch of blocks in a row -- even though
> that isn't the intent of the parallel bitmap table scan
> implementation. We are losing some of that with the patch -- but only
> because it is behaving as you would expect given the implementation
> and design. I don't consider that a blocker, though.

I had a crack at this one a few weeks back, and wanted to share some
thoughts.  Getting sequential block range allocation work was easy
(0001), but ramp-down at this level as a parallel query fairness
strategy seems to be fundamentally incompatible with AIO.  0002 was an
abandoned attempt before I got lost for a while researching the
futility of it all.  See further down.  I gave up for v18.

Of course we already have this problem in v18, but 0001 would magnify
it.  Whether gaining I/O combining is worth losing some parallel
fairness, IDK, but it can wait...

With just the 0001 patch, io_workers=1 and two parallel processes, you
can trace the IO worker to see the ramp-up in each parallel worker and
the maximised I/O combining, but also the abrupt and unfair
end-of-scan:

tmunro@x1:~/projects/postgresql/build$ strace -p 377335 -e
trace=preadv,pread64 -s0
strace: Process 377335 attached
pread64(16, ""..., 8192, 720896)        = 8192
pread64(16, ""..., 8192, 729088)        = 8192
pread64(16, ""..., 16384, 737280)       = 16384
pread64(16, ""..., 16384, 753664)       = 16384
pread64(16, ""..., 32768, 802816)       = 32768
pread64(16, ""..., 32768, 770048)       = 32768
pread64(16, ""..., 65536, 835584)       = 65536
pread64(16, ""..., 65536, 901120)       = 65536
pread64(16, ""..., 131072, 966656)      = 131072
pread64(16, ""..., 131072, 1228800)     = 131072
pread64(16, ""..., 131072, 1097728)     = 131072
pread64(16, ""..., 131072, 1490944)     = 131072
pread64(16, ""..., 131072, 1359872)     = 131072
pread64(16, ""..., 131072, 1753088)     = 131072
pread64(16, ""..., 131072, 1884160)     = 131072
pread64(16, ""..., 131072, 1622016)     = 131072
pread64(16, ""..., 131072, 2015232)     = 131072
preadv(16, [...], 7, 2146304)           = 131072
preadv(16, [...], 7, 2277376)           = 131072
pread64(16, ""..., 131072, 2539520)     = 131072
pread64(16, ""..., 131072, 2408448)     = 131072
pread64(16, ""..., 98304, 2801664)      = 98304
pread64(16, ""..., 131072, 2670592)     = 131072

For real parallel query fairness, I wonder if we should think about a
work stealing approach.  Let me sketch out a simple idea first, and
explain why it's still not enough, and then sketch out a bigger idea:

Imagine a thing called WorkQueueSet in shared memory that contains a
WorkQueue for each worker, and then a function
wqs_get_next_object(&state->tidbitmap_wqs, &object) that looks in your
own backend's queue first, and if there is nothing, it calls a
registered callback that tries to fill your queue up (the current TBM
iterator logic becomes that callback in this case, and the per-backend
WorkQueue replaces the little array that I added in the attached 0001
patch), and if that doesn't work it tries to steal from everyone
else's queue, and if that fails then there really is no more work to
be done and your scan is finished.  That should be perfectly fair, not
just approximately fair under certain assumptions (that are now false)
like ramp down schemes.  *If* we can figure out how to make the
interlocking so lightweight that the common case of consuming from
your own uncontended work queue is practically as fast as reading out
of a local array...  (I have a few ideas about that part but this is
already long enough...)

But that only provides ramp-down for the block number ranges
*entering* each worker's ReadStream.  Our goal is to make the workers
run out of work as close in time as possible, and their work begins
when buffers *exit* each worker's ReadStream.  Thought experiment:
what if we could also steal blocks/buffers that someone else's
ReadStream holds, whether already valid, referenced by an inflight
I/O, or waiting to be combined with more blocks in the pending read?
The buffer manager can already deal with any and all races at its
level and do the right thing, but we'd somehow need to make sure the
theft victim doesn't also process that block.  The ReadStream's buffer
queue could be integrated with your backend's WorkQueue in a
WorkQueueSet in shared memory, so that others can steal your stuff,
and read_stream_next_buffer() would have to discard anything that has
been stolen between entering and exiting the stream.  The common
non-stolen case would have to be extremely cheap, read barriers and
ordering tricks I guess.  (This is basically a new take on an idea I
have mentioned before, parallel-aware ReadStream, except that all
previous versions collapsed in a heap of lock contention and other
design problems.  Work stealing seems to offer a way to keep
everything almost as it is now, except in a few weird corner cases at
end-of-stream when new magic fires up only if required.  It also seems
to be such a broadly useful technique that I'd want a general purpose
thing, not just a ReadStream internal trick.)

Putting both things together, I'm wondering about *two* levels of
WorkQueueSet for stuff like Seq Scan and BHS.

1.  Preserve I/O combining opportunities:  The ReadStream's block
number callback uses WorkQueueSet to hand out block numbers as
described a couple of paragraphs above (perhaps
WORK_QUEUE_TYPE_BLOCK_RANGE would be a built-in mode with queue
elements [blockno, nblocks] so you can consume from them and dequeue
once depleted to implement tableam.c's parallel seq block allocator,
with the callback working just as it does now except with the
ramp-down part removed, and tidbitmap.c would use WORK_QUEUE_TYPE_BLOB
to traffic TBMWorkQueueItem objects replacing the current
TBMIteratorResult).

2.  Fairness:  The ReadStream is put into a mode that attaches to your
backend's WorkQueue in a second WorkQueueSet, and uses it for buffers
and per-buffer-data.  This allowing other ReadStreams in other
processes attached to the same WorkQueueSet to steal stuff.  In the
common case, there's no stealing, but at stream end you get perfect
fairness.

I suspect that a generic work stealing component would find many other
uses.  I can think of quite a few, but here's one: It might be able to
fix the PHJ unmatched scan unfairness problem, within the constraints
of our existing executor model.  Thought experiment: PHJ_BATCH_PROBE
and PHJ_BATCH_SCAN are merged into one phase.  Each worker emits
already joined tuples from a WorkQueueSet (IDK how they hold tuples
exactly, this is hand-waving level).  The callback that pulls those
tuples into your backend's WorkQueue does the actual joining and sets
the match bit, which has the crucial property that you can know when
all all bits are set without waiting, and then the callback can switch
to feeding unmatched tuples into the work queue for FULL/RIGHT joins,
and every worker can participate.  No weird
detach-and-drop-to-single-process, no wait, no deadlock risk.  The
callback's batch size would become a new source of unfairness when
performing the join, so perhaps you need two levels of WorkQueueSet,
one for pre-joined tuples and one for post-joined tuples, and only
when both are completely exhausted do you have all match bits (there
is a small extra complication: I think you need a concept of 'dequeued
but busy' when moving items from pre-joined to post-joined, and a CV
so you can wait for that condition to clear, or something like that,
but that's a guaranteed-progress wait and only used in a very rare
edge case; if I'm making in mistake in all this it may be related to
this type of thing, but I'm not seeing anything unsolvable).  Note
that I didn't say you could take over other workers' scans, which we
definitely *can't* do with our current executor model, I'm just saying
that if there is no stealable work left, matching must be finished.
If that sounds expensive, one thought is that it also provides a tidy
solution to my old conundrum "how to do batch-oriented outer scan with
the timing and control needed to drive software memory prefetching",
ie without falling down the executor volcano by calling ExecProcNode()
between prefetches; I strongly suspect that would provide so much
speedup that it could easily pay for the cost of the extra steps,
again assuming we can make the uncontended case blindingly fast.
Maybe, IDK, I could be missing things...

Anyway, that was a long explanation for why I didn't come up with an
easy patch to teach PBHS not to destroy I/O combining opportunities
for v18 :-)  The short version is that 0001 does address the immediate
problem we identified, but also helped me see the bigger picture a
little bit more clearly and find a few interesting connections.
Parallel query and I/O streaming are just not the best of friends,
yet.  R&D needed.

Attachment

pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: BAS_BULKREAD vs read stream
Next
From: David Rowley
Date:
Subject: Re: [PoC] Reducing planning time when tables have many partitions