Re: [PATCH] Fix WAIT FOR LSN cleanup on subtransaction abort - Mailing list pgsql-hackers

From Alexander Korotkov
Subject Re: [PATCH] Fix WAIT FOR LSN cleanup on subtransaction abort
Date
Msg-id CAPpHfdujoGu6AbL2NCzqCwZjV3tTxqTc0-qAM_zT3x2YHJzR-Q@mail.gmail.com
Whole thread
In response to Re: [PATCH] Fix WAIT FOR LSN cleanup on subtransaction abort  (Xuneng Zhou <xunengzhou@gmail.com>)
Responses Re: [PATCH] Fix WAIT FOR LSN cleanup on subtransaction abort
List pgsql-hackers
On Wed, May 6, 2026 at 11:46 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
>
> On Wed, May 6, 2026 at 3:18 PM Ayush Tiwari <ayushtiwari.slg01@gmail.com> wrote:
> >
> > Hi,
> >
> > I found a backend crash in WAIT FOR LSN when it is interrupted inside a
> > savepoint and the session then waits again.
> >
> > I tried to find if it was already reported, but could not find it, so, posting it.
> >
> > While navigating I noticed  WAIT FOR LSN cleanup is incomplete on
> > subtransaction abort. An interrupt such as statement_timeout while
> > waiting inside a savepoint leaves stale per-backend wait state,
> > causing a later WAIT FOR LSN in the same backend to violate
> > the wait-heap invariant and crash an assertion-enabled build.
> >
> > A small reproducer is:
> >
> >     BEGIN;
> >     SAVEPOINT s;
> >     SET statement_timeout = '100ms';
> >     WAIT FOR LSN '<future-lsn>' WITH (MODE 'primary_flush');
> >     ROLLBACK TO s;
> >     SET statement_timeout = 0;
> >     WAIT FOR LSN '0/0' WITH (MODE 'primary_flush', TIMEOUT '10ms', NO_THROW);
> >     COMMIT;
> >
> > where <future-lsn> can be generated with:
> >
> >     SELECT pg_current_wal_insert_lsn() + 10000000000;
> >
> > TRAP: failed Assert("!procInfo->inHeap"), File: "xlogwait.c"
> >
> > The attached patch mirrors the top-level abort cleanup by calling
> > WaitLSNCleanup() from AbortSubTransaction(), after LWLockReleaseAll().  It
> > also adds a TAP test to verify that WAIT FOR LSN can be reused in the same
> > backend after a statement_timeout and ROLLBACK TO SAVEPOINT.
> >
> > Thoughts?
>
> Thanks for reporting this. I agree with your analysis.

+1, thank you for reporting this.

> We need to add
> this clean-up into AbortSubTransaction. I've some comments on the
> patch:
>
> 1) Update the comment of WaitLSNCleanup
>
> Now the comment of this function says, "Clean up LSN waiters for
> exiting process." After this patch, this description would be too
> narrow because after a ROLLBACK TO SAVEPOINT, the same backend
> continues running and may issue another WAIT FOR LSN. How about
> changing it to something like:
>
> /*
>  * Clean up any LSN wait state for the current process.
>  */

I agree with this change.  Under detailed consideration, "existing
process" sounds confusing in this context even if it's called just
from AbortTransaction().

> 2) How about turning the SET statement_timeout = '100ms'; into more
> deterministic machinery used in other sections of the test file to
> ensure that the registration actually occurs by starting the waiter in
> a background psql session and waits until itself reports: wait_event =
> 'WaitForWalFlush' for that backend.
>
> 3) For the validation, consider replacing the fast success of '0/0'
> with the timeout of the unreachable LSN used earlier. This would
> reduce assumptions about the order between registration and the fast
> check being fixed.

I'm OK with these changes too, as they improve the test stability.

I'm going to push this if no objections.

------
Regards,
Alexander Korotkov
Supabase



pgsql-hackers by date:

Previous
From: shveta malik
Date:
Subject: Re: Proposal: Conflict log history table for Logical Replication
Next
From: Andrey Borodin
Date:
Subject: Re: [PATCH] Fix ProcKill lock-group vs procLatch recycle race