Thread: Re: Limiting overshoot in nbtree's parallel SAOP index scans

Re: Limiting overshoot in nbtree's parallel SAOP index scans

From
Peter Geoghegan
Date:
On Fri, Oct 11, 2024 at 10:27 AM Matthias van de Meent
<boekewurm@gmail.com> wrote:
> With the introduction of the new SAOP handling in PG17, however, the
> shared state has become a bit more muddied. Because the default has
> changed from "stop scanning at the end of a SAOP value's range" to
> "continue scanning, unless ...", and because it is relatively
> expensive to determine whether we need to stop or continue we release
> the parallel scan to continue before we know if a new primitive scan
> would be preferred.

It's not just relatively expensive. It's essential that _bt_readpage
be able to release the parallel scan at the earliest opportunity (just
as soon as the next sibling link is known). Whereas a parallel backend
worker will only decide that it's likely a good idea to do another
primitive index scan when _bt_readpage has finished executing.

Clearly it would not be sensible to wait until _bt_readpage had gotten
as far as deciding on the need for another descent of the index. To do
so would essentially serialize the scan. There are clear advantages to
doing as little coordination as possible among backends. (I don't
think that you disagree with any of this, but just want to establish
context for anybody else that's reading along -- you seem to want to
do some limited/conditional form of this behavior.)

> Problem
> -------
> The new parallel index scan code can only get into a new primitive
> scan IFF no parallel worker has taken the next block in the time
> between the call to _bt_parallel_release at the start of _bt_readpage,
> and the call to _bt_parallel_primscan_schedule through _bt_checkkeys
> slightly later in _bt_readpage. This makes the code susceptible to
> race conditions

Do you have a test case?

I was aware of the issues in this area when designing the
_bt_parallel_primscan_schedule mechanism. I believe it's true that the
scan is only strictly guaranteed to be able to make forward progress
via naively scanning the next leaf page, again and again. This is a
fairly theoretical point, though.

I believe that it's practically impossible for this to cause us
problems. That would require a group of backends that all want to
start another primitive index scan, but each block each other out,
constantly. Not just once, mind you -- again and again (enough to be
noticeable). This would have to happen in spite of the fact that we
only release one backend per _bt_parallel_release call
(_bt_parallel_release uses ConditionVariableSignal under the hood,
which only releases the oldest worker waiting on the CV, not all
backends).

There is no signalling of condition variables within
_bt_parallel_primscan_schedule, either. So you just have this one call
to ConditionVariableSignal. Why should it be possible to get a
stampede of parallel backends? (I'm not 100% sure if your concern is
even theoretically valid.)

> Given the above, a parallel index scan with ScanKey `id = ANY
> (minvalue, maxvalue)` and bad enough timing/luck could thus scan the
> whole index, while just 2 primitive index scans (the behaviour before
> PG17) are sufficient.
>
> In practice it is quite unlikely that parallel index scans have such
> bad luck, but I don't think we should have luck or timing as a big
> factor for the performance picture in parallel index scans.

It's also possible for a hash table to degenerate when there are
excessively many hash collisions. This is certainly possible with
adversarial inputs -- it's not hard to write a test case that'll
generate some. So hash tables generally also "have luck as a big
factor", in about the same sense. No?

I don't think that it's fundamentally unreasonable to have concerns
about things like this, of course. Safety-critical avionics systems
tend to just not use hash tables at all.

> I think I've found a reasonably elegant solution, which limits the
> number of buffers we can fail to start a new primitive scan to the
> number of parallel workers + 1:  If we detect that a parallel worker
> in the same primscan range thinks this is the right moment to start a
> new primscan (but couldn't due to concurrent advancement), we don't
> release the parallel scan immediately, but instead only release it
> after reading the pages contents, to find out if we really should
> start a new primitive scan.  If so, we do that, and if not, we instead
> mark that the primitive scan has reached a new primscan range, do some
> cleanup, and then continue business as usual.

I'm opposed to this patch. It'll introduce a behavior that's more
difficult to reason about and test then the one that it purports to
fix.

--
Peter Geoghegan



Re: Limiting overshoot in nbtree's parallel SAOP index scans

From
Matthias van de Meent
Date:
On Wed, 16 Oct 2024 at 20:52, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Fri, Oct 11, 2024 at 10:27 AM Matthias van de Meent
> <boekewurm@gmail.com> wrote:
> > With the introduction of the new SAOP handling in PG17, however, the
> > shared state has become a bit more muddied. Because the default has
> > changed from "stop scanning at the end of a SAOP value's range" to
> > "continue scanning, unless ...", and because it is relatively
> > expensive to determine whether we need to stop or continue we release
> > the parallel scan to continue before we know if a new primitive scan
> > would be preferred.
>
> It's not just relatively expensive. It's essential that _bt_readpage
> be able to release the parallel scan at the earliest opportunity (just
> as soon as the next sibling link is known).

Why is this essential? AFAIK, that's only for performance reasons -
and it's also only for performance reasons that we start a new
primitive scan, so in my view there is a trade-off between releasing
the scan early (and making better use of parallel resources) and
scheduling a new primitive scan (to reduce overall resource usage).

> Whereas a parallel backend
> worker will only decide that it's likely a good idea to do another
> primitive index scan when _bt_readpage has finished executing.

Correct.

> Clearly it would not be sensible to wait until _bt_readpage had gotten
> as far as deciding on the need for another descent of the index. To do
> so would essentially serialize the scan. There are clear advantages to
> doing as little coordination as possible among backends. (I don't
> think that you disagree with any of this, but just want to establish
> context for anybody else that's reading along -- you seem to want to
> do some limited/conditional form of this behavior.)

In general, yes, but if we see strong signals that it may be worth the
wait, then why not?

> > Problem
> > -------
> > The new parallel index scan code can only get into a new primitive
> > scan IFF no parallel worker has taken the next block in the time
> > between the call to _bt_parallel_release at the start of _bt_readpage,
> > and the call to _bt_parallel_primscan_schedule through _bt_checkkeys
> > slightly later in _bt_readpage. This makes the code susceptible to
> > race conditions
>
> Do you have a test case?

Yes, see below.

> I was aware of the issues in this area when designing the
> _bt_parallel_primscan_schedule mechanism. I believe it's true that the
> scan is only strictly guaranteed to be able to make forward progress
> via naively scanning the next leaf page, again and again. This is a
> fairly theoretical point, though.
>
> I believe that it's practically impossible for this to cause us
> problems. That would require a group of backends that all want to
> start another primitive index scan, but each block each other out,
> constantly. Not just once, mind you -- again and again (enough to be
> noticeable). This would have to happen in spite of the fact that we
> only release one backend per _bt_parallel_release call
> (_bt_parallel_release uses ConditionVariableSignal under the hood,
> which only releases the oldest worker waiting on the CV, not all
> backends).

I may not have communicated this well enough, but what happens in the
topical issue is as follows:

0. backend 1 has seized the scan; backend 2 is now waiting for another
page to process.
1. backend 1: _release
2. backend 1: start _readpage
3. backend 2: wake up
4. backend 2: _seize
5. backend 1: finish _readpage;
6. backend 1: _primscan_schedule -> fail to schedule.
... swap roles of backend 1 and 2, and restart at 0.

This is easy to replicate with somewhat expensive compare operator,
one which take enough CPU time so that a signaled backend can wake up
and seize the scan before a page is processed in _readpage. Because
the pages contain no interesting tuples, the scan infrastructure won't
ever need to visit HEAP VM pages or return any data from the scan in
this overshoot case, and thus won't easily get into a state that might
stop seizing the scan for long enough that _primscan_schedule actually
does something - after _readpage finds out a new primitive scan is
needed, it immediately returns and is ready to _seize the scan again.

A test case for this:

Schema and data population:
CREATE TABLE test (a int, b bigint);
INSERT INTO test (a, b) SELECT i % 101, i % 1001 FROM
generate_series(1, 10000000) i;
CREATE INDEX ON test (a, b);
DELETE FROM test WHERE a = 0 AND b = 1000;
DELETE FROM test WHERE a = 100 AND b = 0;
VACUUM (FREEZE, INDEX_CLEANUP on, ANALYZE) test;

This gets you an index of 9239 blocks.

Config to force actual parallel scan (note: debug_parallel_query
doesn't exhibit the issue due to single-worker processing of data :-(
):
SET parallel_setup_cost = 0;
SET parallel_tuple_cost = 0;
SET min_parallel_table_scan_size = 1;
SET min_parallel_index_scan_size = 1;

Test query:
SELECT count(*) FROM test WHERE a IN (0, 100) AND b IN (0, 1000); -- count = 196

In v17 and the master branch you'll note 16 buffer hits for the test
query. However, when we use more expensive btree compare operations
(e.g. by adding pg_usleep(1) to both btint8cmp and btint4cmp), the
buffer access count starts to vary a lot and skyrockets to 30+ on my
machine, in some cases reaching >100 buffer hits. After applying my
patch, the buffer access count is capped to a much more agreeable
16-18 hits - it still shows signs of overshooting the serial bounds,
but the number of buffers we overshoot our target is capped and thus
significantly lower.

> There is no signalling of condition variables within
> _bt_parallel_primscan_schedule, either. So you just have this one call
> to ConditionVariableSignal. Why should it be possible to get a
> stampede of parallel backends? (I'm not 100% sure if your concern is
> even theoretically valid.)

See above: the concern isn't necessarily a stampede, but just that in
some cases it is likely that a parallel worker is ready to seize the
scan before this backend's page is read to a large enough extent to
decide to schedule a new primitive scan, thus starving the scan from
opportunities to start new primscans.

> > Given the above, a parallel index scan with ScanKey `id = ANY
> > (minvalue, maxvalue)` and bad enough timing/luck could thus scan the
> > whole index, while just 2 primitive index scans (the behaviour before
> > PG17) are sufficient.
> >
> > In practice it is quite unlikely that parallel index scans have such
> > bad luck, but I don't think we should have luck or timing as a big
> > factor for the performance picture in parallel index scans.
>
> It's also possible for a hash table to degenerate when there are
> excessively many hash collisions. This is certainly possible with
> adversarial inputs -- it's not hard to write a test case that'll
> generate some. So hash tables generally also "have luck as a big
> factor", in about the same sense. No?

For hash tables, the hash function is assumed to be really good or
perfect; and degeneration is expected when this assumption is wrong.
In btree parallel scans, we explicitly expect the scan to be seized
before we get to start a new primscan. And if that happens for one
page, why would we assume that that won't also happen on the next
page, or it's next page? Presumably, it's a rare occasion, but if it
happened then the timing must line up, and if it did line up once, why
would we assume it is a one-in-a-million case?

> I'm opposed to this patch. It'll introduce a behavior that's more
> difficult to reason about and test then the one that it purports to
> fix.

I'm happy to clarify and document the updated behavior of the parallel
scan, to make reasoning easier. I'd prefer to discontinue the
reasoning of "we may scan the whole index in parallel SAOP scans" in
favor of "we may scan a limited amount of index pages more in parallel
SAOP scans than in serial SAOP scans".

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)



Re: Limiting overshoot in nbtree's parallel SAOP index scans

From
Peter Geoghegan
Date:
On Wed, Oct 16, 2024 at 5:48 PM Matthias van de Meent
<boekewurm@gmail.com> wrote:
> In v17 and the master branch you'll note 16 buffer hits for the test
> query. However, when we use more expensive btree compare operations
> (e.g. by adding pg_usleep(1) to both btint8cmp and btint4cmp), the
> buffer access count starts to vary a lot and skyrockets to 30+ on my
> machine, in some cases reaching >100 buffer hits. After applying my
> patch, the buffer access count is capped to a much more agreeable
> 16-18 hits - it still shows signs of overshooting the serial bounds,
> but the number of buffers we overshoot our target is capped and thus
> significantly lower.

It's not exactly capped, though. Since in any case you're always prone
to getting extra leaf page reads at the end of each primitive index
scan. That's not something that's new to Postgres 17, though.

Anyway, I'm still not convinced. Your test case requires adding a one
second delay to each ORDER proc comparison, and so has an unrealistic
adversarial character. It uses an index-only scan that is drastically
faster if we don't use a parallel scan at all. The serial case is
0.046 ms for me, whereas the parallel case is 3.094 ms (obviously
that's without the addition of a 1 second delay). You've thrown
everything but the kitchen sink at the issue, and yet the impact on
buffer hits really isn't too bad.

Does anybody else have an opinion on this?

--
Peter Geoghegan



Re: Limiting overshoot in nbtree's parallel SAOP index scans

From
Matthias van de Meent
Date:
On Thu, 17 Oct 2024 at 00:33, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Wed, Oct 16, 2024 at 5:48 PM Matthias van de Meent
> <boekewurm@gmail.com> wrote:
> > In v17 and the master branch you'll note 16 buffer hits for the test
> > query. However, when we use more expensive btree compare operations
> > (e.g. by adding pg_usleep(1) to both btint8cmp and btint4cmp), the
> > buffer access count starts to vary a lot and skyrockets to 30+ on my
> > machine, in some cases reaching >100 buffer hits. After applying my
> > patch, the buffer access count is capped to a much more agreeable
> > 16-18 hits - it still shows signs of overshooting the serial bounds,
> > but the number of buffers we overshoot our target is capped and thus
> > significantly lower.
>
> It's not exactly capped, though. Since in any case you're always prone
> to getting extra leaf page reads at the end of each primitive index
> scan. That's not something that's new to Postgres 17, though.

True, but the SAOP-enabled continued overshoot _is_ new: previously,
each backend would only get up to one additional buffer access for
every SOAP scan entry, while now it's only limited by outer SAOP
bounds and index size.

> Anyway, I'm still not convinced. Your test case requires adding a one
> second delay to each ORDER proc comparison,

microsecond. pg_usleep() takes microseconds as argument, and so
pg_usleep(1) should sleep for >=1 microsecond, but definitely much
less than 1 second.

> and so has an unrealistic adversarial character.

I don't think it's too adversarial to expect some compare operations
for a page to take one microsecond, especially when that compare
operation is related to SAOPs. I think complex comparators like those
for jsonb or collated text can certainly take much more CPU time than
your usual integer compare operation, which can definitely be long
enough for other workers to seize the scan. It's just that I didn't
want to spend a lot of time building a dataset that would expose the
issue when a single and obvious modification can do the same.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)