Re: [bug fix] prepared transaction might be lost when max_prepared_transactions is zero on the subscriber - Mailing list pgsql-hackers
From | shveta malik |
---|---|
Subject | Re: [bug fix] prepared transaction might be lost when max_prepared_transactions is zero on the subscriber |
Date | |
Msg-id | CAJpy0uAgu3Han=LBWtbE=3ftSAKDfWnCqeU_b2ErULO73cDz8w@mail.gmail.com Whole thread Raw |
In response to | Re: [bug fix] prepared transaction might be lost when max_prepared_transactions is zero on the subscriber (shveta malik <shveta.malik@gmail.com>) |
List | pgsql-hackers |
On Mon, Aug 12, 2024 at 3:36 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Fri, Aug 9, 2024 at 2:39 PM Hayato Kuroda (Fujitsu) > <kuroda.hayato@fujitsu.com> wrote: > > > > Dear Amit, Shveta, Hou, > > > > Thanks for giving many comments! I've updated the patch. > > > > > @@ -4409,6 +4409,14 @@ start_apply(XLogRecPtr origin_startpos) > > > } > > > PG_CATCH(); > > > { > > > + /* > > > + * Reset the origin data to prevent the advancement of origin progress > > > + * if the transaction failed to apply. > > > + */ > > > + replorigin_session_origin = InvalidRepOriginId; > > > + replorigin_session_origin_lsn = InvalidXLogRecPtr; > > > + replorigin_session_origin_timestamp = 0; > > > > > > Can't we call replorigin_reset() instead here? > > > > I didn't use the function because arguments of calling function looked strange, > > but ideally I can. Fixed. > > > > > + /* > > > + * Register a callback to reset the origin state before aborting the > > > + * transaction in ShutdownPostgres(). This is to prevent the advancement > > > + * of origin progress if the transaction failed to apply. > > > + */ > > > + before_shmem_exit(replorigin_reset, (Datum) 0); > > > > > > I think we need this despite resetting the origin-related variables in > > > PG_CATCH block to handle FATAL error cases, right? If so, can we use > > > PG_ENSURE_ERROR_CLEANUP() instead of PG_CATCH()? > > > > There are two reasons to add a shmem-exit callback. One is to support a FATAL, > > another one is to support the case that user does the shutdown request while > > applying changes. In this case, I think ShutdownPostgres() can be called so that > > the session origin may advance. > > Agree that we need the 'reset' during shutdown flow as well. Details at [1] > > > However, I think we cannot use PG_ENSURE_ERROR_CLEANUP()/PG_END_ENSURE_ERROR_CLEANUP > > macros here. According to codes, it assumes that any before-shmem callbacks are > > not registered within the block because the cleanup function is registered and canceled > > within the macro. LogicalRepApplyLoop() can register the function when > > it handles COMMIT PREPARED message so it breaks the rule. > > Yes, on reanalyzing, we can not use PG_ENSURE_ERROR_CLEANUP in this > flow due to the limitation of cancel_before_shmem_exit() that it can > cancel only the last registered callback, while in our flow we have > other callbacks also registered after we register our reset one. > > [1] > Shutdown analysis: > > I did a test where we make apply worker wait for say 0sec right after Correction here: 0sec -->10sec > it updates 'replorigin_session_origin_lsn' in > apply_handle_prepare_internal() (say it at code-point1). During this > wait, we triggered a subscriber shutdown.Under normal circumstances, > everything works fine: after the wait, the apply worker processes the > SIGTERM (via LogicalRepApplyLoop-->ProcessInterrupts()) only after the > prepare phase is complete, meaning the PREPARE LSN is flushed, and the > origin LSN is correctly advanced in EndPrepare() before the worker > shuts down. But, if we insert a LOG statement between code-point1 and > EndPrepare(), the apply worker processes the SIGTERM during the LOG > operation, as errfinish() triggers CHECK_FOR_INTERRUPTS at the end, > which causes the origin LSN to be incorrectly advanced during > shutdown. And thus the subsequent COMMIT PREPARED on the publisher > results in ERROR on subscriber; as the 'PREPARE' is lost on the > subscriber and is not resent by the publisher. ERROR: prepared > transaction with identifier "pg_gid_16403_757" does not exist > > A similar problem can also occur without introducing any additional > LOG statements, but by simply setting log_min_messages=debug5. This > causes the apply worker to output a few DEBUG messages upon receiving > a shutdown signal (after code-point1) before it reaches EndPrepare(). > As a result, it ends up processing the SIGTERM (during logging)and > invoking AbortOutOfAnyTransaction(), which incorrectly advances the > origin LSN. > > thanks > Shveta
pgsql-hackers by date: