RE: [bug fix] prepared transaction might be lost when max_prepared_transactions is zero on the subscriber - Mailing list pgsql-hackers

From Zhijie Hou (Fujitsu)
Subject RE: [bug fix] prepared transaction might be lost when max_prepared_transactions is zero on the subscriber
Date
Msg-id OS0PR01MB5716AB0BE911D79120DDE28D94B92@OS0PR01MB5716.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: [bug fix] prepared transaction might be lost when max_prepared_transactions is zero on the subscriber  (shveta malik <shveta.malik@gmail.com>)
Responses Re: [bug fix] prepared transaction might be lost when max_prepared_transactions is zero on the subscriber
List pgsql-hackers
On Thursday, August 8, 2024 6:01 PM shveta malik <shveta.malik@gmail.com> wrote:
> 
> On Thu, Aug 8, 2024 at 12:03 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> >
> > On Thu, Aug 8, 2024 at 10:37 AM Hayato Kuroda (Fujitsu)
> > <kuroda.hayato@fujitsu.com> wrote:
> > >
> > ...
> > >
> > > An easiest fix is to reset session replication origin before calling
> > > the RecordTransactionAbort(). I think this can happen when 1)
> > > LogicalRepApplyLoop() raises an ERROR or 2) apply worker exits.
> Attached patch can fix the issue on HEAD.
> > >
> >
> > Few comments:
> > =============
> > *
> > @@ -4409,6 +4409,14 @@ start_apply(XLogRecPtr origin_startpos)
> >   }
> >   PG_CATCH();
> >   {
> > + /*
> > + * Reset the origin data to prevent the advancement of origin
> > + progress
> > + * if the transaction failed to apply.
> > + */
> > + replorigin_session_origin = InvalidRepOriginId;
> > + replorigin_session_origin_lsn = InvalidXLogRecPtr;
> > + replorigin_session_origin_timestamp = 0;
> >
> > Can't we call replorigin_reset() instead here?
> >
> > *
> > + /*
> > + * Register a callback to reset the origin state before aborting the
> > + * transaction in ShutdownPostgres(). This is to prevent the
> > + advancement
> > + * of origin progress if the transaction failed to apply.
> > + */
> > + before_shmem_exit(replorigin_reset, (Datum) 0);
> >
> > I think we need this despite resetting the origin-related variables in
> > PG_CATCH block to handle FATAL error cases, right? If so, can we use
> > PG_ENSURE_ERROR_CLEANUP() instead of PG_CATCH()?
> 
> +1
> 
> Basic tests work fine on this patch. Just thinking out loud,
> SetupApplyOrSyncWorker() is called for table-sync worker as well and IIUC
> tablesync worker does not deal with 2PC txns. So do we even need to register
> replorigin_reset() for tablesync worker as well? If we may hit such an issue in
> general, then perhaps we need it in table-sync worker  otherwise not.  It
> needs some investigation. Thoughts?

I think this is a general issue that can occur not only due to 2PC. IIUC, this
problem should arise if any ERROR arises after setting the
replorigin_session_origin_lsn but before the CommitTransactionCommand is
completed. If so, I think we should register it for tablesync worker as well.

Best Regards,
Hou zj

pgsql-hackers by date:

Previous
From: Kirill Reshke
Date:
Subject: Re: Incremental View Maintenance, take 2
Next
From: Peter Eisentraut
Date:
Subject: Re: Enable data checksums by default