Home > mailing lists

RE: Excessive number of replication slots for 12->14 logical replication - Mailing list pgsql-bugs

From	houzj.fnst@fujitsu.com
Subject	RE: Excessive number of replication slots for 12->14 logical replication
Date	September 10, 2022 06:41:01
Msg-id	OS0PR01MB5716E128E78C6CECD15C718394429@OS0PR01MB5716.jpnprd01.prod.outlook.com Whole thread Raw
In response to	Re: Excessive number of replication slots for 12->14 logical replication (Masahiko Sawada <sawada.mshk@gmail.com>)
Responses	RE: Excessive number of replication slots for 12->14 logical replication
List	pgsql-bugs

Tree view

On Saturday, September 10, 2022 5:49 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> 
> On Tue, Aug 30, 2022 at 3:44 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Fri, Aug 26, 2022 at 7:04 AM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> > >
> > > Thanks for the testing. I'll push this sometime early next week (by
> > > Tuesday) unless Sawada-San or someone else has any comments on it.
> > >
> >
> > Pushed.
> 
> Tom reported buildfarm failures[1] and I've investigated the cause and
> concluded this commit is relevant.
> 
> In process_syncing_tables_for_sync(), we have the following code:
> 
>         UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
>                                    MyLogicalRepWorker->relid,
>                                    MyLogicalRepWorker->relstate,
>                                    MyLogicalRepWorker->relstate_lsn);
> 
>         ReplicationOriginNameForTablesync(MyLogicalRepWorker->subid,
>                                           MyLogicalRepWorker->relid,
>                                           originname,
>                                           sizeof(originname));
>         replorigin_session_reset();
>         replorigin_session_origin = InvalidRepOriginId;
>         replorigin_session_origin_lsn = InvalidXLogRecPtr;
>         replorigin_session_origin_timestamp = 0;
> 
>         /*
>          * We expect that origin must be present. The concurrent operations
>          * that remove origin like a refresh for the subscription take an
>          * access exclusive lock on pg_subscription which prevent the previou
>          * operation to update the rel state to SUBREL_STATE_SYNCDONE to
>          * succeed.
>          */
>         replorigin_drop_by_name(originname, false, false);
> 
>         /*
>          * End streaming so that LogRepWorkerWalRcvConn can be used to
> drop
>          * the slot.
>          */
>         walrcv_endstreaming(LogRepWorkerWalRcvConn, &tli);
> 
>         /*
>          * Cleanup the tablesync slot.
>          *
>          * This has to be done after the data changes because otherwise if
>          * there is an error while doing the database operations we won't be
>          * able to rollback dropped slot.
>          */
>         ReplicationSlotNameForTablesync(MyLogicalRepWorker->subid,
>                                         MyLogicalRepWorker->relid,
>                                         syncslotname,
>                                         sizeof(syncslotname));
> 
> If the table sync worker errored at walrcv_endstreaming(), we assumed that both
> dropping the replication origin and updating relstate are rolled back, which
> however was wrong. Indeed, the replication origin is not dropped but the
> in-memory state is reset. Therefore, after the tablesync worker restarts, it starts
> logical replication with starting point 0/0. Consequently, it  ends up applying
> the transaction that has already been applied.

Thanks for the analysis !

I think you are right. The replorigin_drop_by_name() will clear the
remote_lsn/local_lsn in shared memory which won't be rollback if we fail to
drop the origin.

Per off-list discussion with Amit. To fix this problem, I think we need to drop
the origin in table sync worker after committing the transaction which set the
relstate to SYNCDONE. Because it can make sure that the worker won’t be
restarted even if we fail to drop the origin. Besides, we need to add the
origin drop code back in apply worker in case the table sync worker failed to
drop the origin before it exits(which seems a rare case). I will share the
patch if we agree with the fix.

Best regards,
Hou zj

pgsql-bugs by date:

From: Masahiko Sawada
Date: 10 September 2022, 00:48:54
Subject: Re: Excessive number of replication slots for 12->14 logical replication

From: Amit Kapila
Date: 10 September 2022, 08:36:44
Subject: Re: Excessive number of replication slots for 12->14 logical replication

RE: Excessive number of replication slots for 12->14 logical replication - Mailing list pgsql-bugs

Previous

Next