Re: [PATCH] Fix WAIT FOR LSN cleanup on subtransaction abort - Mailing list pgsql-hackers

From Ayush Tiwari
Subject Re: [PATCH] Fix WAIT FOR LSN cleanup on subtransaction abort
Date
Msg-id CAJTYsWUWPM2Fpv=vYXG=Yj085jaDd36hOu1=g6zdvOvVNgm7GQ@mail.gmail.com
Whole thread
In response to Re: [PATCH] Fix WAIT FOR LSN cleanup on subtransaction abort  (Alexander Korotkov <aekorotkov@gmail.com>)
List pgsql-hackers
Hi,

On Wed, 6 May 2026 at 15:13, Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Wed, May 6, 2026 at 11:46 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
>
> On Wed, May 6, 2026 at 3:18 PM Ayush Tiwari <ayushtiwari.slg01@gmail.com> wrote:
> >
> > Hi,
> >
> > I found a backend crash in WAIT FOR LSN when it is interrupted inside a
> > savepoint and the session then waits again.
> >
> > I tried to find if it was already reported, but could not find it, so, posting it.
> >
> > While navigating I noticed  WAIT FOR LSN cleanup is incomplete on
> > subtransaction abort. An interrupt such as statement_timeout while
> > waiting inside a savepoint leaves stale per-backend wait state,
> > causing a later WAIT FOR LSN in the same backend to violate
> > the wait-heap invariant and crash an assertion-enabled build.
> >
> > A small reproducer is:
> >
> >     BEGIN;
> >     SAVEPOINT s;
> >     SET statement_timeout = '100ms';
> >     WAIT FOR LSN '<future-lsn>' WITH (MODE 'primary_flush');
> >     ROLLBACK TO s;
> >     SET statement_timeout = 0;
> >     WAIT FOR LSN '0/0' WITH (MODE 'primary_flush', TIMEOUT '10ms', NO_THROW);
> >     COMMIT;
> >
> > where <future-lsn> can be generated with:
> >
> >     SELECT pg_current_wal_insert_lsn() + 10000000000;
> >
> > TRAP: failed Assert("!procInfo->inHeap"), File: "xlogwait.c"
> >
> > The attached patch mirrors the top-level abort cleanup by calling
> > WaitLSNCleanup() from AbortSubTransaction(), after LWLockReleaseAll().  It
> > also adds a TAP test to verify that WAIT FOR LSN can be reused in the same
> > backend after a statement_timeout and ROLLBACK TO SAVEPOINT.
> >
> > Thoughts?
>
> Thanks for reporting this. I agree with your analysis.

+1, thank you for reporting this.

> We need to add
> this clean-up into AbortSubTransaction. I've some comments on the
> patch:
>
> 1) Update the comment of WaitLSNCleanup
>
> Now the comment of this function says, "Clean up LSN waiters for
> exiting process." After this patch, this description would be too
> narrow because after a ROLLBACK TO SAVEPOINT, the same backend
> continues running and may issue another WAIT FOR LSN. How about
> changing it to something like:
>
> /*
>  * Clean up any LSN wait state for the current process.
>  */

I agree with this change.  Under detailed consideration, "existing
process" sounds confusing in this context even if it's called just
from AbortTransaction().

> 2) How about turning the SET statement_timeout = '100ms'; into more
> deterministic machinery used in other sections of the test file to
> ensure that the registration actually occurs by starting the waiter in
> a background psql session and waits until itself reports: wait_event =
> 'WaitForWalFlush' for that backend.
>
> 3) For the validation, consider replacing the fast success of '0/0'
> with the timeout of the unreachable LSN used earlier. This would
> reduce assumptions about the order between registration and the fast
> check being fixed.

I'm OK with these changes too, as they improve the test stability.

I'm going to push this if no objections.


Looks good to me. Thanks for the update and review!

Regards,
Ayush 

pgsql-hackers by date:

Previous
From: Andrey Borodin
Date:
Subject: Re: [PATCH] Fix ProcKill lock-group vs procLatch recycle race