Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication - Mailing list pgsql-hackers
From | Melanie Plageman |
---|---|
Subject | Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication |
Date | |
Msg-id | CAAKRu_YKGyF+svRQqe1th-mG9xLdzneWgh9H1z1DtypBkawkkw@mail.gmail.com Whole thread Raw |
In response to | Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication (Melih Mutlu <m.melihmutlu@gmail.com>) |
Responses |
Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication |
List | pgsql-hackers |
On Wed, Feb 22, 2023 at 8:04 AM Melih Mutlu <m.melihmutlu@gmail.com> wrote: > > Hi Wang, > > Thanks for reviewing. > Please see updated patches. [1] This is cool! Thanks for working on this. I had a chance to review your patchset and I had some thoughts and questions. I notice that you've added a new user-facing option to make a snapshot. I think functionality to independently make a snapshot for use elsewhere has been discussed in the past for the implementation of different features (e.g. [1] pg_dump but they ended up using replication slots for this I think?), but I'm not quite sure I understand all the implications for providing a user-visible create snapshot command. Where can it be used? When can the snapshot be used? In your patch's case, you know that you can use the snapshot you are creating, but I just wonder if any restrictions or caveats need be taken for its general use. For the worker reuse portion of the code, could it be a separate patch in the set? It could be independently committable and would be easier to review (separate from repl slot reuse). Given table sync worker reuse, I think it is worth considering a more explicit structure for the table sync worker code now -- i.e. having a TableSyncWorkerMain() function. Though they still do the LogicalRepApplyLoop(), much of what else they do is different than the apply leader. Apply worker leader does: ApplyWorkerMain() walrcv_startstreaming() LogicalRepApplyLoop() launch table sync workers walrcv_endstreaming() proc_exit() Table Sync workers master: ApplyWorkerMain() start_table_sync() walrcv_create_slot() copy_table() walrcv_startstreaming() start_apply() LogicalRepApplyLoop() walrcv_endstreaming() proc_exit() Now table sync workers need to loop back and do start_table_sync() again for their new table. You have done this in ApplyWorkerMain(). But I think that this could be a separate main function since their main loop is effectively totally different now than an apply worker leader. Something like: TableSyncWorkerMain() TableSyncWorkerLoop() start_table_sync() walrcv_startstreaming() LogicalRepApplyLoop() walrcv_endstreaming() wait_for_new_rel_assignment() proc_exit() You mainly have this structure, but it is a bit hidden and some of the shared functions that previously may have made sense for table sync worker and apply workers to share don't really make sense to share anymore. The main thing that table sync workers and apply workers share is the logic in LogicalRepApplyLoop() (table sync workers use when they do catchup), so perhaps we should make the other code separate? Also on the topic of worker reuse, I was wondering if having workers find their own next table assignment (as you have done in process_syncing_tables_for_sync()) makes sense. The way the whole system would work now (with your patch applied), as I understand it, the apply leader would loop through the subscription rel states and launch workers up to max_sync_workers_per_subscription for every candidate table needing sync. The apply leader will continue to do this, even though none of those workers would exit unless they die unexpectedly. So, once it reaches max_sync_workers_per_subscription, it won't launch any more workers. When one of these sync workers is finished with a table (it is synced and caught up), it will search through the subscription rel states itself looking for a candidate table to work on. It seems it would be common for workers to be looking through the subscription rel states at the same time, and I don't really see how you prevent races in who is claiming a relation to work on. Though you take a shared lock on the LogicalRepWorkerLock, what if in between logicalrep_worker_find() and updating my MyLogicalRepWorker->relid, someone else also updates their relid to that relid. I don't think you can update LogicalRepWorker->relid with only a shared lock. I wonder if it is not better to have the apply leader, in process_syncing_tables_for_apply(), first check for an existing worker for the rel, then check for an available worker without an assignment, then launch a worker? Workers could then sleep after finishing their assignment and wait for the leader to give them a new assignment. Given an exclusive lock on LogicalRepWorkerLock, it may be okay for workers to find their own table assignments from the subscriptionrel -- and perhaps this will be much more efficient from a CPU perspective. It feels just a bit weird to have the code doing that buried in process_syncing_tables_for_sync(). It seems like it should at least return out to a main table sync worker loop in which workers loop through finding a table and assigning it to themselves, syncing the table, and catching the table up. - Melanie [1] https://www.postgresql.org/message-id/flat/CA%2BU5nMLRjGtpskUkYSzZOEYZ_8OMc02k%2BO6FDi4una3mB4rS1w%40mail.gmail.com#45692f75a1e79d4ce2d4f6a0e3ccb853
pgsql-hackers by date: