Re: Logical Replication of sequences - Mailing list pgsql-hackers
From | Amit Kapila |
---|---|
Subject | Re: Logical Replication of sequences |
Date | |
Msg-id | CAA4eK1+y_q1kUMVpme-trpPYXEUTjRhKfJYb4G8xxi=FfpikSQ@mail.gmail.com Whole thread Raw |
In response to | Re: Logical Replication of sequences (Masahiko Sawada <sawada.mshk@gmail.com>) |
Responses |
Re: Logical Replication of sequences
|
List | pgsql-hackers |
On Wed, Jun 12, 2024 at 10:44 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Tue, Jun 11, 2024 at 7:36 PM vignesh C <vignesh21@gmail.com> wrote: > > > > 1) CREATE PUBLICATION syntax enhancement: > > CREATE PUBLICATION ... FOR ALL SEQUENCES; > > The addition of a new column titled "all sequences" in the > > pg_publication system table will signify whether the publication is > > designated as all sequences publication or not. > > > > The first approach sounds like we don't create entries for sequences > in pg_subscription_rel. In this case, how do we know all sequences > that we need to refresh when executing the REFRESH PUBLICATION > SEQUENCES command you mentioned below? > As per my understanding, we should be creating entries for sequences in pg_subscription_rel similar to tables. The difference would be that we won't need all the sync_states (i = initialize, d = data is being copied, f = finished table copy, s = synchronized, r = ready) as we don't need any synchronization with apply workers. > > 2) CREATE SUBSCRIPTION -- no syntax change. > > Upon creation of a subscription, the following additional steps will > > be managed by the subscriber: > > i) The subscriber will retrieve the list of sequences associated with > > the subscription's publications. > > ii) For each sequence: a) Retrieve the sequence value from the > > publisher by invoking the pg_sequence_state function. b) Set the > > sequence with the value obtained from the publisher. iv) Once the > > subscription creation is completed, all sequence values will become > > visible at the subscriber's end. > > Sequence values are always copied from the publisher? or does it > happen only when copy_data = true? > It is better to do it when "copy_data = true" to keep it compatible with the table's behavior. > > > > An alternative design approach could involve retrieving the sequence > > list from the publisher during subscription creation and inserting the > > sequences with an "init" state into the pg_subscription_rel system > > table. These tasks could be executed by a single sequence sync worker, > > which would: > > i) Retrieve the list of sequences in the "init" state from the > > pg_subscription_rel system table. > > ii) Initiate a transaction. > > iii) For each sequence: a) Obtain the sequence value from the > > publisher by utilizing the pg_sequence_state function. b) Update the > > sequence with the value obtained from the publisher. > > iv) Commit the transaction. > > > > The benefit with the second approach is that if there are large number > > of sequences, the sequence sync can be enhanced to happen in parallel > > and also if there are any locks held on the sequences in the > > publisher, the sequence worker can wait to acquire the lock instead of > > blocking the whole create subscription command which will delay the > > initial copy of the tables too. > > I prefer to have separate workers to sync sequences. > +1. > Probably we can > start with a single worker and extend it to have multiple workers. Yeah, starting with a single worker sounds good for now. Do you think we should sync all the sequences in a single transaction or have some threshold value above which a different transaction would be required or maybe a different sequence sync worker altogether? Now, having multiple sequence-sync workers requires some synchronization so that only a single worker is allocated for one sequence. The simplest thing is to use a single sequence sync worker that syncs all sequences in one transaction but with a large number of sequences, it could be inefficient. OTOH, I am not sure if it would be a problem in reality. > > BTW > the sequence-sync worker will be taken from > max_sync_workers_per_subscription pool? > I think so. > Or yet another idea I came up with is that a tablesync worker will > synchronize both the table and sequences owned by the table. That is, > after the tablesync worker caught up with the apply worker, the > tablesync worker synchronizes sequences associated with the target > table as well. One benefit would be that at the time of initial table > sync being completed, the table and its sequence data are consistent. > As soon as new changes come to the table, it would become inconsistent > so it might not be helpful much, though. Also, sequences that are not > owned by any table will still need to be synchronized by someone. > The other thing to consider in this idea is that we somehow need to distinguish the sequences owned by the table. > > > > 3) Refreshing the sequence can be achieved through the existing > > command: ALTER SUBSCRIPTION ... REFRESH PUBLICATION(no syntax change > > here). > > The subscriber identifies stale sequences, meaning sequences present > > in pg_subscription_rel but absent from the publication, and removes > > them from the pg_subscription_rel system table. The subscriber also > > checks for newly added sequences in the publisher and synchronizes > > their values from the publisher using the steps outlined in the > > subscription creation process. It's worth noting that previously > > synchronized sequences won't be synchronized again; the sequence sync > > will occur solely for the newly added sequences. > > > > 4) Introducing a new command for refreshing all sequences: ALTER > > SUBSCRIPTION ... REFRESH PUBLICATION SEQUENCES. > > The subscriber will remove stale sequences and add newly added > > sequences from the publisher. Following this, it will re-synchronize > > the sequence values for all sequences in the updated list from the > > publisher, following the steps outlined in the subscription creation > > process. > > The difference between 3) and 4) is whether or not to re-synchronize > the previously synchronized sequences. Do we really want to introduce > a new command for 4)? I felt that we can invent an option say > copy_all_sequence for the REFRESH PUBLICATION command to cover the 4) > case. > Yeah, that is also an option but it could confuse along with copy_data option. Say the user has selected copy_data = false but copy_all_sequences = true then the first option indicates to *not* copy the data of table and sequences and the second option indicates to copy the sequences data which sounds contradictory. The other idea is to have an option copy_existing_sequences (which indicates to copy existing sequence values) but that also has somewhat the same drawback as copy_all_sequences but to a lesser degree. > > > > 5) Incorporate the pg_sequence_state function to fetch the sequence > > value from the publisher, along with the page LSN. Incorporate > > SetSequence function, which will procure a new relfilenode for the > > sequence and set the new relfilenode with the specified value. This > > will facilitate rollback in case of any failures. > > Does it mean that we create a new relfilenode for every update of the value? > We need it for initial sync so that if there is an error both the sequence state in pg_subscription_rel and sequence values can be rolled back together. However, it is unclear whether we need to create a new relfilenode while copying existing sequences (say during ALTER SUBSCRIPTION .. REFRESH PUBLICATION SEQUENCES, or whatever command we decide)? Probably the answer lies in how we want to implement this command. If we want to copy all sequence values during the command itself then it is probably okay but if we want to handover this task to the sequence-sync worker then we need some state management and a new relfilenode so that on error both state and sequence values are rolled back. -- With Regards, Amit Kapila.
pgsql-hackers by date: