Re: Limiting overshoot in nbtree's parallel SAOP index scans - Mailing list pgsql-hackers

From Matthias van de Meent
Subject Re: Limiting overshoot in nbtree's parallel SAOP index scans
Date
Msg-id CAEze2Wgge8=XtEh1zR7coX37azd9XdTZef85DWiU3eO0S-1Q4g@mail.gmail.com
Whole thread Raw
In response to Re: Limiting overshoot in nbtree's parallel SAOP index scans  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
On Wed, 14 May 2025 at 08:35, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Oct 17, 2024 at 5:03 AM Matthias van de Meent
> <boekewurm@gmail.com> wrote:
> >
> > On Thu, 17 Oct 2024 at 00:33, Peter Geoghegan <pg@bowt.ie> wrote:
> > >
> > > On Wed, Oct 16, 2024 at 5:48 PM Matthias van de Meent
> > > <boekewurm@gmail.com> wrote:
> > > > In v17 and the master branch you'll note 16 buffer hits for the test
> > > > query. However, when we use more expensive btree compare operations
> > > > (e.g. by adding pg_usleep(1) to both btint8cmp and btint4cmp), the
> > > > buffer access count starts to vary a lot and skyrockets to 30+ on my
> > > > machine, in some cases reaching >100 buffer hits. After applying my
> > > > patch, the buffer access count is capped to a much more agreeable
> > > > 16-18 hits - it still shows signs of overshooting the serial bounds,
> > > > but the number of buffers we overshoot our target is capped and thus
> > > > significantly lower.
> > >
> > > It's not exactly capped, though. Since in any case you're always prone
> > > to getting extra leaf page reads at the end of each primitive index
> > > scan. That's not something that's new to Postgres 17, though.
> >
> > True, but the SAOP-enabled continued overshoot _is_ new: previously,
> > each backend would only get up to one additional buffer access for
> > every SOAP scan entry, while now it's only limited by outer SAOP
> > bounds and index size.
> >
>
> IIUC, the problem you are trying to solve is that we will miss
> primitive index scans in case of parallel index scans.

Correct - we may continuously miss the opportunity to start a new
primitive index scan when processing the contents of a page takes
longer than the time it takes for another backend to get woken up and
acquire the shared scan state.

> The problem can
> happen when the second parallel backend process (say b-2) has seized
> the page after the first parallel backend (say b-1) has released the
> current page and before b-1 could check whether it can schedule
> another primitive index scan.

Correct.

> And you want to fix it by delaying the
> release of current page by b-1 in some cases, if so, then it has some
> downsides as well as pointed out by Peter. Is my understanding
> correct?

Yes, exactly.

> If so, then we may also need to prove that releasing the page
> at a later point doesn't harm any other cases.

There is no way to prove it won't harm other cases, as this is based
on heuristics and assumptions. If the assumptions

The assumption behind the patch is that if you've missed the
opportinity to start a new primitive index scan twice in a row, then
it's likely you'll miss the same chance the next time too, and the one
after that.

Furthermore, the skip scan architecture itself is based on heuristics
where the assumption is that starting a new primitive scan is cheaper
than scanning the index to the next matching element.

Finally, missing the opportunity to start a new skip scan requires
that at least 1 other worker is in approximately the same state as we
are: the other backend that continued the parallel index scan, which
made us miss the opportunity to start a new primitive index scan.

In this case, we know the parallel scan has already spent 2 page
scans' worth of time waiting for a new primitive scan. If our first
assumption holds, then by keeping the parallel scan we'll only block
the other backends from doing work that is wasteful for at most for
the duration of processing one page. If the assumption is false, then
we'll have lost the time of at most n_workers backends for that same
duration - workers that probably had pages with data they still had to
process.

> I am not sure if I understood the problem completely, so can you
> please explain it in a bit more layman's language, how it works before
> 17 and after 17?

Before PG17, every IN()-list item had its own primitive index scan
(after deduplication of the elements).
  This caused excess primitive scans for elements that were on the
same page (hence the change in PG17), but guaranteed separate
primitive scans for elements that were far apart, limiting pages
accessed to at most [SAOP scan count * (pages_with_values_per_primscan
+ n_workers)].

Starting with PG17, the index scan is (in principle) a single
primitive scan, unless 1.) a backend finds that a new primitive scan
is needed, and 2.) that backend can signal the shared state about this
before the scan state can be acquired by another worker.
  This guarantees we don't use a separate primitive scan for every
element in the scan when they're located on the same index pages, but
could mean we'd scan the whole index when someone issued an index scan
for = ANY(INT_MIN, INT_MAX) if the 2nd condition for starting a new
primitive scan is always false.

> BTW, I also want to clarify my understanding of primitive index scans
> and how it has changed in PG17, is it related to how we optimize SAOP
> scans by reducing the number of leaf page scans?

Exactly, yes.

I hope this clarified a few items.


Kind regards,

Matthias van de Meent
Databricks



pgsql-hackers by date:

Previous
From: shveta malik
Date:
Subject: Re: Logical Replication of sequences
Next
From: Jim Jones
Date:
Subject: Re: Allow ON CONFLICT DO UPDATE to return EXCLUDED values