Re: BitmapHeapScan streaming read user and prelim refactoring - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: BitmapHeapScan streaming read user and prelim refactoring |
Date | |
Msg-id | CA+hUKGKz+SZh28S4edbbRv6+DEPDfDMiPTCbJxQsBGsKtL+Gaw@mail.gmail.com Whole thread Raw |
In response to | Re: BitmapHeapScan streaming read user and prelim refactoring (Melanie Plageman <melanieplageman@gmail.com>) |
Responses |
Re: BitmapHeapScan streaming read user and prelim refactoring
|
List | pgsql-hackers |
On Thu, Feb 13, 2025 at 1:40 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > Thomas mentioned this to me off-list, and I think he's right. We > likely need to rethink the way parallel bitmap heap scan workers get > block assignments for reading and prefetching to make it more similar > to parallel sequential scan. The workers should probably get > assignments of a range of blocks. On master, each worker does end up > issuing reads/fadvises for a bunch of blocks in a row -- even though > that isn't the intent of the parallel bitmap table scan > implementation. We are losing some of that with the patch -- but only > because it is behaving as you would expect given the implementation > and design. I don't consider that a blocker, though. I had a crack at this one a few weeks back, and wanted to share some thoughts. Getting sequential block range allocation work was easy (0001), but ramp-down at this level as a parallel query fairness strategy seems to be fundamentally incompatible with AIO. 0002 was an abandoned attempt before I got lost for a while researching the futility of it all. See further down. I gave up for v18. Of course we already have this problem in v18, but 0001 would magnify it. Whether gaining I/O combining is worth losing some parallel fairness, IDK, but it can wait... With just the 0001 patch, io_workers=1 and two parallel processes, you can trace the IO worker to see the ramp-up in each parallel worker and the maximised I/O combining, but also the abrupt and unfair end-of-scan: tmunro@x1:~/projects/postgresql/build$ strace -p 377335 -e trace=preadv,pread64 -s0 strace: Process 377335 attached pread64(16, ""..., 8192, 720896) = 8192 pread64(16, ""..., 8192, 729088) = 8192 pread64(16, ""..., 16384, 737280) = 16384 pread64(16, ""..., 16384, 753664) = 16384 pread64(16, ""..., 32768, 802816) = 32768 pread64(16, ""..., 32768, 770048) = 32768 pread64(16, ""..., 65536, 835584) = 65536 pread64(16, ""..., 65536, 901120) = 65536 pread64(16, ""..., 131072, 966656) = 131072 pread64(16, ""..., 131072, 1228800) = 131072 pread64(16, ""..., 131072, 1097728) = 131072 pread64(16, ""..., 131072, 1490944) = 131072 pread64(16, ""..., 131072, 1359872) = 131072 pread64(16, ""..., 131072, 1753088) = 131072 pread64(16, ""..., 131072, 1884160) = 131072 pread64(16, ""..., 131072, 1622016) = 131072 pread64(16, ""..., 131072, 2015232) = 131072 preadv(16, [...], 7, 2146304) = 131072 preadv(16, [...], 7, 2277376) = 131072 pread64(16, ""..., 131072, 2539520) = 131072 pread64(16, ""..., 131072, 2408448) = 131072 pread64(16, ""..., 98304, 2801664) = 98304 pread64(16, ""..., 131072, 2670592) = 131072 For real parallel query fairness, I wonder if we should think about a work stealing approach. Let me sketch out a simple idea first, and explain why it's still not enough, and then sketch out a bigger idea: Imagine a thing called WorkQueueSet in shared memory that contains a WorkQueue for each worker, and then a function wqs_get_next_object(&state->tidbitmap_wqs, &object) that looks in your own backend's queue first, and if there is nothing, it calls a registered callback that tries to fill your queue up (the current TBM iterator logic becomes that callback in this case, and the per-backend WorkQueue replaces the little array that I added in the attached 0001 patch), and if that doesn't work it tries to steal from everyone else's queue, and if that fails then there really is no more work to be done and your scan is finished. That should be perfectly fair, not just approximately fair under certain assumptions (that are now false) like ramp down schemes. *If* we can figure out how to make the interlocking so lightweight that the common case of consuming from your own uncontended work queue is practically as fast as reading out of a local array... (I have a few ideas about that part but this is already long enough...) But that only provides ramp-down for the block number ranges *entering* each worker's ReadStream. Our goal is to make the workers run out of work as close in time as possible, and their work begins when buffers *exit* each worker's ReadStream. Thought experiment: what if we could also steal blocks/buffers that someone else's ReadStream holds, whether already valid, referenced by an inflight I/O, or waiting to be combined with more blocks in the pending read? The buffer manager can already deal with any and all races at its level and do the right thing, but we'd somehow need to make sure the theft victim doesn't also process that block. The ReadStream's buffer queue could be integrated with your backend's WorkQueue in a WorkQueueSet in shared memory, so that others can steal your stuff, and read_stream_next_buffer() would have to discard anything that has been stolen between entering and exiting the stream. The common non-stolen case would have to be extremely cheap, read barriers and ordering tricks I guess. (This is basically a new take on an idea I have mentioned before, parallel-aware ReadStream, except that all previous versions collapsed in a heap of lock contention and other design problems. Work stealing seems to offer a way to keep everything almost as it is now, except in a few weird corner cases at end-of-stream when new magic fires up only if required. It also seems to be such a broadly useful technique that I'd want a general purpose thing, not just a ReadStream internal trick.) Putting both things together, I'm wondering about *two* levels of WorkQueueSet for stuff like Seq Scan and BHS. 1. Preserve I/O combining opportunities: The ReadStream's block number callback uses WorkQueueSet to hand out block numbers as described a couple of paragraphs above (perhaps WORK_QUEUE_TYPE_BLOCK_RANGE would be a built-in mode with queue elements [blockno, nblocks] so you can consume from them and dequeue once depleted to implement tableam.c's parallel seq block allocator, with the callback working just as it does now except with the ramp-down part removed, and tidbitmap.c would use WORK_QUEUE_TYPE_BLOB to traffic TBMWorkQueueItem objects replacing the current TBMIteratorResult). 2. Fairness: The ReadStream is put into a mode that attaches to your backend's WorkQueue in a second WorkQueueSet, and uses it for buffers and per-buffer-data. This allowing other ReadStreams in other processes attached to the same WorkQueueSet to steal stuff. In the common case, there's no stealing, but at stream end you get perfect fairness. I suspect that a generic work stealing component would find many other uses. I can think of quite a few, but here's one: It might be able to fix the PHJ unmatched scan unfairness problem, within the constraints of our existing executor model. Thought experiment: PHJ_BATCH_PROBE and PHJ_BATCH_SCAN are merged into one phase. Each worker emits already joined tuples from a WorkQueueSet (IDK how they hold tuples exactly, this is hand-waving level). The callback that pulls those tuples into your backend's WorkQueue does the actual joining and sets the match bit, which has the crucial property that you can know when all all bits are set without waiting, and then the callback can switch to feeding unmatched tuples into the work queue for FULL/RIGHT joins, and every worker can participate. No weird detach-and-drop-to-single-process, no wait, no deadlock risk. The callback's batch size would become a new source of unfairness when performing the join, so perhaps you need two levels of WorkQueueSet, one for pre-joined tuples and one for post-joined tuples, and only when both are completely exhausted do you have all match bits (there is a small extra complication: I think you need a concept of 'dequeued but busy' when moving items from pre-joined to post-joined, and a CV so you can wait for that condition to clear, or something like that, but that's a guaranteed-progress wait and only used in a very rare edge case; if I'm making in mistake in all this it may be related to this type of thing, but I'm not seeing anything unsolvable). Note that I didn't say you could take over other workers' scans, which we definitely *can't* do with our current executor model, I'm just saying that if there is no stealable work left, matching must be finished. If that sounds expensive, one thought is that it also provides a tidy solution to my old conundrum "how to do batch-oriented outer scan with the timing and control needed to drive software memory prefetching", ie without falling down the executor volcano by calling ExecProcNode() between prefetches; I strongly suspect that would provide so much speedup that it could easily pay for the cost of the extra steps, again assuming we can make the uncontended case blindingly fast. Maybe, IDK, I could be missing things... Anyway, that was a long explanation for why I didn't come up with an easy patch to teach PBHS not to destroy I/O combining opportunities for v18 :-) The short version is that 0001 does address the immediate problem we identified, but also helped me see the bigger picture a little bit more clearly and find a few interesting connections. Parallel query and I/O streaming are just not the best of friends, yet. R&D needed.
Attachment
pgsql-hackers by date: