Re: [bug fix] prepared transaction might be lost when max_prepared_transactions is zero on the subscriber - Mailing list pgsql-hackers

From shveta malik
Subject Re: [bug fix] prepared transaction might be lost when max_prepared_transactions is zero on the subscriber
Date
Msg-id CAJpy0uAgu3Han=LBWtbE=3ftSAKDfWnCqeU_b2ErULO73cDz8w@mail.gmail.com
Whole thread Raw
In response to Re: [bug fix] prepared transaction might be lost when max_prepared_transactions is zero on the subscriber  (shveta malik <shveta.malik@gmail.com>)
List pgsql-hackers
On Mon, Aug 12, 2024 at 3:36 PM shveta malik <shveta.malik@gmail.com> wrote:
>
> On Fri, Aug 9, 2024 at 2:39 PM Hayato Kuroda (Fujitsu)
> <kuroda.hayato@fujitsu.com> wrote:
> >
> > Dear Amit, Shveta, Hou,
> >
> > Thanks for giving many comments! I've updated the patch.
> >
> > > @@ -4409,6 +4409,14 @@ start_apply(XLogRecPtr origin_startpos)
> > >   }
> > >   PG_CATCH();
> > >   {
> > > + /*
> > > + * Reset the origin data to prevent the advancement of origin progress
> > > + * if the transaction failed to apply.
> > > + */
> > > + replorigin_session_origin = InvalidRepOriginId;
> > > + replorigin_session_origin_lsn = InvalidXLogRecPtr;
> > > + replorigin_session_origin_timestamp = 0;
> > >
> > > Can't we call replorigin_reset() instead here?
> >
> > I didn't use the function because arguments of calling function looked strange,
> > but ideally I can. Fixed.
> >
> > > + /*
> > > + * Register a callback to reset the origin state before aborting the
> > > + * transaction in ShutdownPostgres(). This is to prevent the advancement
> > > + * of origin progress if the transaction failed to apply.
> > > + */
> > > + before_shmem_exit(replorigin_reset, (Datum) 0);
> > >
> > > I think we need this despite resetting the origin-related variables in
> > > PG_CATCH block to handle FATAL error cases, right? If so, can we use
> > > PG_ENSURE_ERROR_CLEANUP() instead of PG_CATCH()?
> >
> > There are two reasons to add a shmem-exit callback. One is to support a FATAL,
> > another one is to support the case that user does the shutdown request while
> > applying changes. In this case, I think ShutdownPostgres() can be called so that
> > the session origin may advance.
>
> Agree that we need the 'reset' during shutdown flow as well. Details at [1]
>
> > However, I think we cannot use PG_ENSURE_ERROR_CLEANUP()/PG_END_ENSURE_ERROR_CLEANUP
> > macros here. According to codes, it assumes that any before-shmem callbacks are
> > not registered within the block because the cleanup function is registered and canceled
> > within the macro. LogicalRepApplyLoop() can register the function when
> > it handles COMMIT PREPARED message so it breaks the rule.
>
> Yes, on reanalyzing, we can not use PG_ENSURE_ERROR_CLEANUP in this
> flow due to the limitation of cancel_before_shmem_exit() that it can
> cancel only the last registered callback, while in our flow we have
> other callbacks also registered after we register our reset one.
>
> [1]
> Shutdown analysis:
>
> I did a test where we make apply worker wait for say 0sec right after

Correction here: 0sec -->10sec

> it updates 'replorigin_session_origin_lsn' in
> apply_handle_prepare_internal() (say it at code-point1). During this
> wait, we triggered a subscriber shutdown.Under normal circumstances,
> everything works fine: after the wait, the apply worker processes the
> SIGTERM (via LogicalRepApplyLoop-->ProcessInterrupts()) only after the
> prepare phase is complete, meaning the PREPARE LSN is flushed, and the
> origin LSN is correctly advanced in EndPrepare() before the worker
> shuts down. But, if we insert a LOG statement between code-point1 and
> EndPrepare(), the apply worker processes the SIGTERM during the LOG
> operation, as errfinish() triggers CHECK_FOR_INTERRUPTS at the end,
> which causes the origin LSN to be incorrectly advanced during
> shutdown. And thus the subsequent COMMIT PREPARED on the publisher
> results in ERROR on subscriber; as the 'PREPARE' is lost on the
> subscriber and is not resent by the publisher. ERROR:  prepared
> transaction with identifier "pg_gid_16403_757" does not exist
>
> A similar problem can also occur without introducing any additional
> LOG statements, but by simply setting log_min_messages=debug5. This
> causes the apply worker to output a few DEBUG messages upon receiving
> a shutdown signal (after code-point1) before it reaches EndPrepare().
> As a result, it ends up processing the SIGTERM (during logging)and
> invoking AbortOutOfAnyTransaction(), which incorrectly advances the
> origin LSN.
>
> thanks
> Shveta



pgsql-hackers by date:

Previous
From: shveta malik
Date:
Subject: Re: [bug fix] prepared transaction might be lost when max_prepared_transactions is zero on the subscriber
Next
From: Steven Niu
Date:
Subject: Re: [Patch] remove duplicated smgrclose