Re: Logical Replication of sequences - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: Logical Replication of sequences
Date
Msg-id CAD21AoDLq4KJfLcQXMuVmtqWdDrV9K6XxAQZc3pyzxE0izCegA@mail.gmail.com
Whole thread Raw
In response to Re: Logical Replication of sequences  (vignesh C <vignesh21@gmail.com>)
Responses Re: Logical Replication of sequences
List pgsql-hackers
On Tue, Jun 11, 2024 at 7:36 PM vignesh C <vignesh21@gmail.com> wrote:
>
> On Tue, 11 Jun 2024 at 12:38, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Tue, Jun 11, 2024 at 12:25 PM vignesh C <vignesh21@gmail.com> wrote:
> > >
> > > On Mon, 10 Jun 2024 at 14:48, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > On Mon, Jun 10, 2024 at 12:43 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > > >
> > > > > On Mon, Jun 10, 2024 at 3:14 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > > > >
> > > > > > On Fri, Jun 7, 2024 at 7:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > > > >
> > > > > > >
> > > > > > > Are you imagining the behavior for sequences associated with tables
> > > > > > > differently than the ones defined by the CREATE SEQUENCE .. command? I
> > > > > > > was thinking that users would associate sequences with publications
> > > > > > > similar to what we do for tables for both cases. For example, they
> > > > > > > need to explicitly mention the sequences they want to replicate by
> > > > > > > commands like CREATE PUBLICATION ... FOR SEQUENCE s1, s2, ...; CREATE
> > > > > > > PUBLICATION ... FOR ALL SEQUENCES, or CREATE PUBLICATION ... FOR
> > > > > > > SEQUENCES IN SCHEMA sch1;
> > > > > > >
> > > > > > > In this, variants FOR ALL SEQUENCES and SEQUENCES IN SCHEMA sch1
> > > > > > > should copy both the explicitly defined sequences and sequences
> > > > > > > defined with the tables. Do you think a different variant for just
> > > > > > > copying sequences implicitly associated with tables (say for identity
> > > > > > > columns)?
> > > > > >
> > > > > > Oh, I was thinking that your proposal was to copy literally all
> > > > > > sequences by REPLICA/REFRESH SEQUENCE command.
> > > > > >
> > > >
> > > > I am trying to keep the behavior as close to tables as possible.
> > > >
> > > > > > But it seems to make
> > > > > > sense to explicitly specify the sequences they want to replicate. It
> > > > > > also means that they can create a publication that has only sequences.
> > > > > > In this case, even if they create a subscription for that publication,
> > > > > > we don't launch any apply workers for that subscription. Right?
> > > > > >
> > > >
> > > > Right, good point. I had not thought about this.
> > > >
> > > > > > Also, given that the main use case (at least as the first step) is
> > > > > > version upgrade, do we really need to support SEQUENCES IN SCHEMA and
> > > > > > even FOR SEQUENCE?
> > > > >
> > > >
> > > > At the very least, we can split the patch to move these variants to a
> > > > separate patch. Once the main patch is finalized, we can try to
> > > > evaluate the remaining separately.
> > >
> > > I engaged in an offline discussion with Amit about strategizing the
> > > division of patches to facilitate the review process. We agreed on the
> > > following split: The first patch will encompass the setting and
> > > getting of sequence values (core sequence changes). The second patch
> > > will cover all changes on the publisher side related to "FOR ALL
> > > SEQUENCES." The third patch will address subscriber side changes aimed
> > > at synchronizing "FOR ALL SEQUENCES" publications. The fourth patch
> > > will focus on supporting "FOR SEQUENCE" publication. Lastly, the fifth
> > > patch will introduce support for  "FOR ALL SEQUENCES IN SCHEMA"
> > > publication.
> > >
> > > I will work on this and share an updated patch for the same soon.
> >
> > +1. Sounds like a good plan.
>
> Amit and I engaged in an offline discussion regarding the design and
> contemplated that it could be like below:
> 1) CREATE PUBLICATION syntax enhancement:
> CREATE PUBLICATION ... FOR ALL SEQUENCES;
> The addition of a new column titled "all sequences" in the
> pg_publication system table will signify whether the publication is
> designated as all sequences publication or not.
>

The first approach sounds like we don't create entries for sequences
in pg_subscription_rel. In this case, how do we know all sequences
that we need to refresh when executing the REFRESH PUBLICATION
SEQUENCES command you mentioned below?

> 2)  CREATE SUBSCRIPTION -- no syntax change.
> Upon creation of a subscription, the following additional steps will
> be managed by the subscriber:
> i) The subscriber will retrieve the list of sequences associated with
> the subscription's publications.
> ii) For each sequence: a) Retrieve the sequence value from the
> publisher by invoking the pg_sequence_state function. b) Set the
> sequence with the value obtained from the publisher. iv) Once the
> subscription creation is completed, all sequence values will become
> visible at the subscriber's end.

Sequence values are always copied from the publisher? or does it
happen only when copy_data = true?

>
> An alternative design approach could involve retrieving the sequence
> list from the publisher during subscription creation and inserting the
> sequences with an "init" state into the pg_subscription_rel system
> table. These tasks could be executed by a single sequence sync worker,
> which would:
> i) Retrieve the list of sequences in the "init" state from the
> pg_subscription_rel system table.
> ii) Initiate a transaction.
> iii) For each sequence: a) Obtain the sequence value from the
> publisher by utilizing the pg_sequence_state function. b) Update the
> sequence with the value obtained from the publisher.
> iv) Commit the transaction.
>
> The benefit with the second approach is that if there are large number
> of sequences, the sequence sync can be enhanced to happen in parallel
> and also if there are any locks held on the sequences in the
> publisher, the sequence worker can wait to acquire the lock instead of
> blocking the whole create subscription command which will delay the
> initial copy of the tables too.

I prefer to have separate workers to sync sequences. Probably we can
start with a single worker and extend it to have multiple workers. BTW
the sequence-sync worker will be taken from
max_sync_workers_per_subscription pool?

Or yet another idea I came up with is that a tablesync worker will
synchronize both the table and sequences owned by the table. That is,
after the tablesync worker caught up with the apply worker, the
tablesync worker synchronizes sequences associated with the target
table as well. One benefit would be that at the time of initial table
sync being completed, the table and its sequence data are consistent.
As soon as new changes come to the table, it would become inconsistent
so it might not be helpful much, though. Also, sequences that are not
owned by any table will still need to be synchronized by someone.

>
> 3) Refreshing the sequence can be achieved through the existing
> command: ALTER SUBSCRIPTION ... REFRESH PUBLICATION(no syntax change
> here).
> The subscriber identifies stale sequences, meaning sequences present
> in pg_subscription_rel but absent from the publication, and removes
> them from the pg_subscription_rel system table. The subscriber also
> checks for newly added sequences in the publisher and synchronizes
> their values from the publisher using the steps outlined in the
> subscription creation process. It's worth noting that previously
> synchronized sequences won't be synchronized again; the sequence sync
> will occur solely for the newly added sequences.
>
> 4) Introducing a new command for refreshing all sequences: ALTER
> SUBSCRIPTION ... REFRESH PUBLICATION SEQUENCES.
> The subscriber will remove stale sequences and add newly added
> sequences from the publisher. Following this, it will re-synchronize
> the sequence values for all sequences in the updated list from the
> publisher, following the steps outlined in the subscription creation
> process.

The difference between 3) and 4) is whether or not to re-synchronize
the previously synchronized sequences. Do we really want to introduce
a new command for 4)? I felt that we can invent an option say
copy_all_sequence for the REFRESH PUBLICATION command to cover the 4)
case.

>
> 5) Incorporate the pg_sequence_state function to fetch the sequence
> value from the publisher, along with the page LSN. Incorporate
> SetSequence function, which will procure a new relfilenode for the
> sequence and set the new relfilenode with the specified value. This
> will facilitate rollback in case of any failures.

Does it mean that we create a new relfilenode for every update of the value?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: Dilip Kumar
Date:
Subject: Re: Conflict Detection and Resolution
Next
From: Dilip Kumar
Date:
Subject: Re: Logical Replication of sequences