Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
Date
Msg-id CAA4eK1Lo4A4Anyp07r+XF4zp7okfNtUiM7_=e_s7tUedUB3ejw@mail.gmail.com
Whole thread Raw
In response to Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication  (Melih Mutlu <m.melihmutlu@gmail.com>)
Responses Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
List pgsql-hackers
On Wed, Jul 27, 2022 at 3:56 PM Melih Mutlu <m.melihmutlu@gmail.com> wrote:
>
> Hi Amit,
>
> I updated the patch in order to prevent the problems that might be caused by using different replication slots for
syncinga table.
 
> As suggested in previous emails, replication slot names are stored in the catalog. So slot names can be reached later
andit is ensured
 
> that same replication slot is used during tablesync step of a table.
>
> With the current version of the patch:
> -. "srrelslotname" column is introduced into pg_subscibtion_rel catalog. It stores the slot name for tablesync
>
> -. Tablesync worker logic is now as follows:
> 1. Tablesync worker is launched by apply worker for a table.
> 2. Worker generates a default replication slot name for itself. Slot name includes subid and worker pid for tracking
purposes.
> 3. If table has a slot name value in the catalog:
>
> i. If the table state is DATASYNC, drop the replication slot from the catalog and proceed tablesync with a new slot.
>
> ii. If the table state is FINISHEDCOPY, use the replicaton slot from the catalog, do not create a new slot.
>
> 4. Before worker moves to new table, drop any replication slot that are retrieved from the catalog and used.
>

Why after step 4, do you need to drop the replication slot? Won't just
clearing the required info from the catalog be sufficient?

> 5. In case of no table left to sync, drop the replication slot of that sync worker with worker pid if it exists.
(It'spossible that a sync worker do not create a replication slot for itself but uses slots read from the catalog in
eachiteration)
 
>
>
>> I think using worker_pid also has similar risks of mixing slots from
>> different workers because after restart same worker_pid could be
>> assigned to a totally different worker. Can we think of using a unique
>> 64-bit number instead? This will be allocated when each workers
>> started for the very first time and after that we can refer catalog to
>> find it as suggested in the idea below.
>
>
> I'm not sure how likely to have colliding pid's for different tablesync workers in the same subscription.
>

Hmm, I think even if there is an iota of a chance which I think is
there, we can't use worker_pid. Assume, that if the same worker_pid is
assigned to another worker once the worker using it got an error out,
the new worker will fail as soon as it will try to create a
replication slot.

> Though ,having pid in slot name makes it easier to track which slot belongs to which worker. That's why I kept using
pidin slot names.
 
> But I think it should be simple to switch to using a unique 64-bit number. So I can remove pid's from slot names, if
youthink that it would be better.
 
>

I feel it would be better or maybe we need to think of some other
identifier but one thing we need to think about before using a 64-bit
unique identifier here is how will we retrieve its last used value
after restart of server. We may need to store it in a persistent way
somewhere.
>>
>> We should use the same for the origin name as well.
>
>
> I did not really change anything related to origin names. Origin names are still the same and include relation id.
Whatdo you think would be an issue with origin names in this patch?
 
>

The problems will be similar to the slot name. The origin is used to
track the progress of replication, so, if we use the wrong origin name
after the restart, it can send the wrong start_streaming position to
the publisher.

-- 
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: Matthias van de Meent
Date:
Subject: Re: Maximize page freezing
Next
From: Zhang Mingli
Date:
Subject: Re: Patch to address creation of PgStat* contexts with null parent context