Re: BitmapHeapScan streaming read user and prelim refactoring - Mailing list pgsql-hackers

From Melanie Plageman
Subject Re: BitmapHeapScan streaming read user and prelim refactoring
Date
Msg-id CAAKRu_a3uv0R4p+21NzJEwF0OyFqq7S4xJoVD8-HB8ya1=4MoQ@mail.gmail.com
Whole thread Raw
In response to Re: BitmapHeapScan streaming read user and prelim refactoring  (Melanie Plageman <melanieplageman@gmail.com>)
List pgsql-hackers
On Sun, Feb 16, 2025 at 7:29 AM Tomas Vondra <tomas@vondra.me> wrote:
>
> On 2/16/25 02:15, Tomas Vondra wrote:
> >
> > ...
> >
> > OK, I've uploaded the results to the github repository as usual
> >
> >   https://github.com/tvondra/bitmapscan-tests/tree/main/20250214-184807
> >
> > and I've generated the same PDF reports, with the colored comparison.
> >
> > If you compare the pivot tables (I opened the "same" PDF from the two
> > runs and flip between them using alt-tab, which makes the interesting
> > regions easy to spot), the change is very clear.
> >
> > Disabling the sequential detection greatly reduces the scope of
> > regressions. That looks pretty great, IMO.
> >
> > It also seems to lose some speedups, especially with io_combine_limit=1
> > and eic=1. I'm not sure why, if that's expected, etc.

Yea, I don't know why these would change with vs without sequential detection.
Honestly, for serial bitmap heap scan, I would assume that eic 1,
io_combine_limit 1 would be the same as master (i.e. issuing at least
the one fadvise and no additional latency introduced by trying to
combine IOs). So both the speedup and slowdown are a bit surprising to
me.

> > There still remain areas of regression, but most of them are for cases
> > that'd use index scan (tiny fraction of rows scanned), or with
> > read-ahead=4096 (and not for the lower settings).
> >
> > The read-ahead dependence is actually somewhat interesting, because I
> > realized the RAID array has this set to 8192 by default, i.e. even
> > higher than 4096 where it regresses. I suppose mdadm does that, or
> > something, I don't know how the default is calculated. But I assume it
> > depends on the number of devices, so larger arrays might have even
> > higher read-ahead values.

Are the readahead regressions you are talking about the ones with
io_combine_limit > 1 and effective_io_concurrency 0 (at higher
readahead values)? With these, we expect that no fadvises and higher
readahead values means the OS will do good readahead for us. But
somehow the read combining seems to be interfering with this.

This is curious. It seems like it must have something to do with the
way the kernel is calculating readahead size. While the linux kernel
readahead readme [1] is not particularly easy to understand, there are
some parts that stick out to me:

* The size of the async tail is determined by subtracting the size that
 * was explicitly requested from the determined request size, unless
 * this would be less than zero - then zero is used.

I wonder if somehow the larger IO requests are foiling readahead.
Honestly, I don't know how far we can get trying to figure this out.
And explicitly reverse engineering this may backfire in other ways.

- Melanie

[1] https://github.com/torvalds/linux/blob/master/mm/readahead.c



pgsql-hackers by date:

Previous
From: Daniel Gustafsson
Date:
Subject: Re: [PoC] Federated Authn/z with OAUTHBEARER
Next
From: Andrew Dunstan
Date:
Subject: Re: Improve pgindent exclude handling: ignore empty lines