Re: BitmapHeapScan streaming read user and prelim refactoring - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: BitmapHeapScan streaming read user and prelim refactoring |
Date | |
Msg-id | CA+hUKGLe4_7+1+8LYTK7SEDXcj6ueAmdjwUojdCf2r9YAK4KDQ@mail.gmail.com Whole thread Raw |
In response to | Re: BitmapHeapScan streaming read user and prelim refactoring (Melanie Plageman <melanieplageman@gmail.com>) |
Responses |
Re: BitmapHeapScan streaming read user and prelim refactoring
|
List | pgsql-hackers |
On Thu, Feb 13, 2025 at 1:40 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > On Sun, Feb 9, 2025 at 9:27 AM Tomas Vondra <tomas@vondra.me> wrote: > > For the nvme RAID (device: raid-nvme), it's looks almost exactly the > > same, except that with parallel query (page 27) there's a clear area of > > regression with eic=1 (look for "column" of red cells). That's a bit > > unfortunate, because eic=1 is the default value. > > So, I feel pretty confident after even more analysis today with Thomas > Munro that all of the parallel cases with effective_io_concurrency == > 1 are because master cheats effective_io_concurrency. Therefore, I'm > not too worried about those for now. We probably will have to increase > effective_io_concurrency before merging this patch, though. Yeah. Not only did we see it issuing up to 20 or so unauthorised overlapping POSIX_FADV_WILLNEED calls, but in the case we studied they didn't really do anything at all because it was *postfetching* (!), ie issuing advice for blocks it had already read. The I/O was sequential (reading all or most blocks in order), so the misdirected advice at least kept out of the kernel's way, and it did some really effective readahead. Meanwhile the stream version played by the rules and respected eic=1, staying only one block ahead. We know that advice for sequential blocks blocks readahead, so that seems to explain the red for that parameter/test permutation. (Hypothesis based on some quick reading: I guess kernel pages faulted in that way don't get marked with the internal "readahead was here" flag, which would normally cause the following pread() to trigger more asynchronous readahead with ever larger physical size). But that code in master is also *trying* to do exactly what we're doing, it's just failing because of some combination of bugs in the parallel scan code (?). It would be absurd to call it the winner: we're setting out with the same goal, but for that particular set of parameters and test setup, apparently two (?) bugs make a right. Melanie mentioned that some other losing cases also involved unauthorised amounts of overlapping advice, except there it was random and beneficial and the iterators didn't get out of sync with each other, so again it beat us in a few red patches, seemingly with a different accidental behaviour. Meanwhile we obediently trundled along with just 1 concurrent I/O or whatever, and lost. On top of that, read_stream.c is also *trying* to avoid issuing advice for access patterns that look sequential, and *trying* to do I/O combining, which all works OK in serial BHS, but it breaks down in parallel because: > Thomas mentioned this to me off-list, and I think he's right. We > likely need to rethink the way parallel bitmap heap scan workers get > block assignments for reading and prefetching to make it more similar > to parallel sequential scan. The workers should probably get > assignments of a range of blocks. On master, each worker does end up > issuing reads/fadvises for a bunch of blocks in a row -- even though > that isn't the intent of the parallel bitmap table scan > implementation. We are losing some of that with the patch -- but only > because it is behaving as you would expect given the implementation > and design. I don't consider that a blocker, though. + /* + * The maximum number of tuples per page is not large (typically 256 with + * 8K pages, or 1024 with 32K pages). So there's not much point in making + * the per-page bitmaps variable size. We just legislate that the size is + * this: + */ + OffsetNumber offsets[MaxHeapTuplesPerPage]; } TBMIterateResult; Seems to be 291? So sizeof(TBMIterateResult) must be somewhere around 588 bytes? There's one of those for every buffer, and we'll support queues of up to effective_io_concurrency (1-1000) * io_combine_limit (1-128?), which would require ~75MB of per-buffer-data with maximum settings That's a lot of memory to have to touch as you whizz around the circular queue even when you don't really need it. With more typical numbers like 10 * 32 it's ~90kB, which I guess is OK. I wonder if it would be worth trying to put a small object in there instead that could be expanded to the results later, cf f6bef362 for TIDStore, but I'm not sure if it's a blocker, just an observation.
pgsql-hackers by date: