Re: BitmapHeapScan streaming read user and prelim refactoring - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: BitmapHeapScan streaming read user and prelim refactoring
Date
Msg-id CA+hUKGLe4_7+1+8LYTK7SEDXcj6ueAmdjwUojdCf2r9YAK4KDQ@mail.gmail.com
Whole thread Raw
In response to Re: BitmapHeapScan streaming read user and prelim refactoring  (Melanie Plageman <melanieplageman@gmail.com>)
Responses Re: BitmapHeapScan streaming read user and prelim refactoring
List pgsql-hackers
On Thu, Feb 13, 2025 at 1:40 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> On Sun, Feb 9, 2025 at 9:27 AM Tomas Vondra <tomas@vondra.me> wrote:
> > For the nvme RAID (device: raid-nvme), it's looks almost exactly the
> > same, except that with parallel query (page 27) there's a clear area of
> > regression with eic=1 (look for "column" of red cells). That's a bit
> > unfortunate, because eic=1 is the default value.
>
> So, I feel pretty confident after even more analysis today with Thomas
> Munro that all of the parallel cases with effective_io_concurrency ==
> 1 are because master cheats effective_io_concurrency. Therefore, I'm
> not too worried about those for now. We probably will have to increase
> effective_io_concurrency before merging this patch, though.

Yeah.  Not only did we see it issuing up to 20 or so unauthorised
overlapping POSIX_FADV_WILLNEED calls, but in the case we studied they
didn't really do anything at all because it was *postfetching* (!), ie
issuing advice for blocks it had already read.  The I/O was sequential
(reading all or most blocks in order), so the misdirected advice at
least kept out of the kernel's way, and it did some really effective
readahead.  Meanwhile the stream version played by the rules and
respected eic=1, staying only one block ahead.  We know that advice for
sequential blocks blocks readahead, so that seems to explain the red
for that parameter/test permutation.  (Hypothesis based on some
quick reading: I guess kernel pages faulted in that way don't get
marked with the internal "readahead was here" flag, which would
normally cause the following pread() to trigger more asynchronous
readahead with ever larger physical size).  But that code
in master is also *trying* to do exactly what we're doing, it's just
failing because of some combination of bugs in the parallel scan code
(?).  It would be absurd to call it the winner:  we're setting out
with the same goal, but for that particular set of parameters and test
setup, apparently two (?) bugs make a right.  Melanie mentioned that
some other losing cases also involved unauthorised amounts of
overlapping advice, except there it was random and beneficial and the
iterators didn't get out of sync with each other, so again it beat us
in a few red patches, seemingly with a different accidental behaviour.
Meanwhile we obediently trundled along with just 1 concurrent I/O or
whatever, and lost.

On top of that, read_stream.c is also *trying* to avoid issuing advice
for access patterns that look sequential, and *trying* to do I/O
combining, which all works OK in serial BHS, but it breaks down in
parallel because:

> Thomas mentioned this to me off-list, and I think he's right. We
> likely need to rethink the way parallel bitmap heap scan workers get
> block assignments for reading and prefetching to make it more similar
> to parallel sequential scan. The workers should probably get
> assignments of a range of blocks. On master, each worker does end up
> issuing reads/fadvises for a bunch of blocks in a row -- even though
> that isn't the intent of the parallel bitmap table scan
> implementation. We are losing some of that with the patch -- but only
> because it is behaving as you would expect given the implementation
> and design. I don't consider that a blocker, though.

+    /*
+     * The maximum number of tuples per page is not large (typically 256 with
+     * 8K pages, or 1024 with 32K pages).  So there's not much point in making
+     * the per-page bitmaps variable size.  We just legislate that the size is
+     * this:
+     */
+    OffsetNumber offsets[MaxHeapTuplesPerPage];
 } TBMIterateResult;

Seems to be 291?  So sizeof(TBMIterateResult) must be somewhere
around 588 bytes?  There's one of those for every buffer, and we'll
support queues of up to effective_io_concurrency (1-1000) *
io_combine_limit (1-128?), which would require ~75MB of
per-buffer-data with maximum settings  That's a lot of memory to have
to touch as you whizz around the circular queue even when you don't
really need it.  With more typical numbers like 10 * 32 it's ~90kB,
which I guess is OK.  I wonder if it would be worth trying to put a
small object in there instead that could be expanded to the results
later, cf f6bef362 for TIDStore, but I'm not sure if it's a blocker,
just an observation.



pgsql-hackers by date:

Previous
From: Jeff Davis
Date:
Subject: Re: Statistics Import and Export
Next
From: Ashutosh Bapat
Date:
Subject: Re: EquivalenceClass and custom_read_write