Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication - Mailing list pgsql-hackers

From Melih Mutlu
Subject Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
Date
Msg-id CAGPVpCQ8m0Jcy2gip=qLis7XPyO2bkb1ZMUBWSXQcyv3tEtz2Q@mail.gmail.com
Whole thread Raw
In response to Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication  (Melih Mutlu <m.melihmutlu@gmail.com>)
List pgsql-hackers
Hi,

Melih Mutlu <m.melihmutlu@gmail.com>, 16 Şub 2023 Per, 14:37 tarihinde şunu yazdı:
I see that setting originname in the catalog before actually creating it causes issues. My concern with setting originname when setting the state to FINISHEDCOPY is that if worker waits until FINISHEDCOPY to update slot/origin name but it fails somewhere before reaching FINISHEDCOPY and after creating slot/origin, then this new created slot/origin will be left behind. It wouldn't be possible to find and drop them since their names are not stored in the catalog. Eventually, this might also cause hitting the max_replication_slots limit in case of such failures between origin creation and FINISHEDCOPY.

One fix I can think is to update the catalog right after creating a new origin. But this would also require commiting the current transaction to actually persist the originname. I guess this action of commiting the transaction in the middle of initial sync could hurt the copy process.

Here are more thoughts on this:
I still believe that updating originname when setting state to FINISHEDCOPY is not a good idea since any failure before FINISHEDCOPY prevent us to store originname in the catalog. If an origin or slot is not in the catalog, it's not easily possible to find and drop origins/slot that are left behind. And we definitely do not want to keep unnecessary origins/slots since we would hit max_replication_slots limit. 
It's better to be safe and update origin/slot names when setting state to DATASYNC. At this point, the worker must be sure that it writes correct origin/slot names into the catalog. 
Following part actually cleans up the catalog if a table is left behind in DATASYNC state and its slot and origin cannot be used for sync.

ReplicationSlotDropAtPubNode(LogRepWorkerWalRcvConn, prev_slotname, true);

StartTransactionCommand();
/* Replication origin might still exist. Try to drop */
replorigin_drop_by_name(originname, true, false);

/*
* Remove replication slot and origin name from the relation's
* catalog record
*/
UpdateSubscriptionRel(MyLogicalRepWorker->subid,
 MyLogicalRepWorker->relid,
 MyLogicalRepWorker->relstate,
 MyLogicalRepWorker->relstate_lsn,
 NULL,
 NULL);

The patch needs to refresh the origin name before it begins copying the table. It will try to read from the catalog but won't find any slot/origin since they are cleaned. Then, it will move on with the correct origin name which is the one created/will be created for the current sync worker.

I tested refetching originname and seems like it fixes the errors you reported.

Thanks,
--
Melih Mutlu
Microsoft

pgsql-hackers by date:

Previous
From: "wangw.fnst@fujitsu.com"
Date:
Subject: RE: Fix the description of GUC "max_locks_per_transaction" and "max_pred_locks_per_transaction" in guc_table.c
Next
From: Melih Mutlu
Date:
Subject: Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication