Thread: Can can I make an injection point wait occur no more than once?

Can can I make an injection point wait occur no more than once?

From
Peter Geoghegan
Date:
I'm working on adding test coverage to _bt_lock_and_validate_left,
which was enhanced by Postgres 18 commit 1bd4bc85ca. In particular,
coverage of its unhappy path: the path where multiple concurrent page
splits necessitate that the scan (which generally moves to the left)
moves to the right multiple times, until finally it gives up. When it
gives up it returns to the original lastcurrblkno to see what's up
with it -- it'll need to get that page's now-current left sibling
link, beginning the whole process anew (by looping back to the start
of _bt_lock_and_validate_left).

An isolation test that uses injection points seems like a natural
approach (actually, it's likely the *only* approach that can produce a
maintainable test). One session should perform a backwards scan that
is forced to wait at the top of _bt_lock_and_validate_left. Another
session then inserts enough index tuples to cause several leaf page
splits that'll make life harder for the backwards scan. Finally, we
wake the backwards scan session, and get the desired test coverage;
it'll reliably have to do things the hard way.

I have all this working already. However, there are certain aspects of
the isolation test (and the injection points themselves) that seem
unsatisfactory. I could really use a way to make the wait within
_bt_lock_and_validate_left happen no more than once, in a way that's
directly under the control of my isolation test.

Any test like this needs to account for various implementation
details. For example, if the test needs to work with non-standard
BLCKSZ (which seems like a good idea), then the number of page splits
required might be greater or fewer than with standard BLCKSZ. This
shouldn't really be a problem; it necessitates inserting more data
than is strictly necessary most of the time: there needs to be some
margin or error to account for these effects. But that shouldn't be
much of a problem.

However, as things stand, this does create a problem: accounting for
these implementation details in this manner makes the number of times
that the injection point is reached unpredictable/hard to control. I
only want the wait within _bt_lock_and_validate_left to happen once,
before the concurrent inserts take place from within the other
isolation test session. I don't want any possible future calls to
_bt_lock_and_validate_left (that come after the other session is done)
to wait at all -- that'll make the backwards scan test session wait
forever (since no other session will be around to wake it up a second
or a third time).

I have successfully simulated "wait no more than once" by adding C
code to nbtree that looks like this:

            if (likely(!P_ISDELETED(opaque) &&
                       opaque->btpo_next == lastcurrblkno))
            {
                /* Found desired page, return it */
#ifdef USE_INJECTION_POINTS
                if (IS_INJECTION_POINT_ATTACHED("lock-and-validate-left"))
                {
                    InjectionPointDetach("lock-and-validate-left");
                }
#endif

But that's pretty ugly and non-modular. There are multiple return
paths within _bt_lock_and_validate_left, and I'd probably need to
cover them all with similar code. That seems borderline unacceptable.

It would be far preferable if I could just use some built-in way of
waiting exactly once, that can be used directly from SQL, through the
injection_points extension. That would allow me to write the isolation
test without having to add any code to nbtsearch.c that knows all
about the requirements of one particular isolation test.

Thanks
-- 
Peter Geoghegan



On Tue, Jul 08, 2025 at 11:21:20AM -0400, Peter Geoghegan wrote:
> On Mon, Jul 7, 2025 at 9:53 PM Noah Misch <noah@leadboat.com> wrote:
> > If it continues to be a problem, consider sharing the patch that's behaving
> > this way for you.
> 
> Attached patch shows my current progress with the isolation test.

Nothing looks suspicious in that code.

> I also attach diff output of the FreeBSD failures. Notice that the
> line "backwards_scan_session: NOTICE:  notice triggered for injection
> point lock-and-validate-new-lastcurrblkno" is completely absent from
> the test output. This absence indicates that the desired test coverage
> is totally missing on FreeBSD -- so the test is completely broken on
> FreeBSD.
> 
> I ran "meson test --suite setup --suite nbtree -q --print-errorlogs"
> in a loop 500 times on my Debian workstation without seeing any
> failures. Seems stable there. Whereas the FreeBSD target hasn't even
> passed once out of more than a dozen attempts. Seems to be reliably
> broken on FreeBSD.

> -backwards_scan_session: NOTICE:  notice triggered for injection point lock-and-validate-new-lastcurrblkno
> +ERROR:  could not find injection point lock-and-validate-left to wake up

Agreed.  Perhaps it's getting a different plan type on FreeBSD, so it's not
even reaching the INJECTION_POINT() calls?  That would be consistent with
these output diffs having no ERROR from attach/detach.  Some things I'd try:

- Add a plain elog(WARNING) before each INJECTION_POINT()
- Use debug_print_plan or similar to confirm the plan type



Re: Can can I make an injection point wait occur no more than once?

From
Peter Geoghegan
Date:
On Tue, Jul 8, 2025 at 11:04 PM Noah Misch <noah@leadboat.com> wrote:
> > -backwards_scan_session: NOTICE:  notice triggered for injection point lock-and-validate-new-lastcurrblkno
> > +ERROR:  could not find injection point lock-and-validate-left to wake up
>
> Agreed.  Perhaps it's getting a different plan type on FreeBSD, so it's not
> even reaching the INJECTION_POINT() calls?  That would be consistent with
> these output diffs having no ERROR from attach/detach.  Some things I'd try:
>
> - Add a plain elog(WARNING) before each INJECTION_POINT()
> - Use debug_print_plan or similar to confirm the plan type

I added a pair of elog(WARNING) traces before each of the new
INJECTION_POINT() calls.

When I run the test against the FreeBSD CI target with this new
instrumentation, I see a WARNING that indicates that we've reached the
top of _bt_lock_and_validate_left as expected. I don't see any second
WARNING indicating that we've taken _bt_lock_and_validate_left's
unhappy path, though (and the test still fails). This doesn't look
like an issue with the planner.

I attach the relevant regression test output, that shows all this.

Thanks
--
Peter Geoghegan

Attachment