Re: Synchronizing slots from primary to standby - Mailing list pgsql-hackers

From Hsu, John
Subject Re: Synchronizing slots from primary to standby
Date
Msg-id BF248F5F-013D-49B8-810D-14F620819869@amazon.com
Whole thread Raw
In response to Re: Synchronizing slots from primary to standby  (Masahiko Sawada <sawada.mshk@gmail.com>)
Responses Re: Synchronizing slots from primary to standby
List pgsql-hackers
> I might be missing something but isn’t it okay even if the new primary
   > server is behind the subscribers? IOW, even if two slot's LSNs (i.e.,
   > restart_lsn and confirm_flush_lsn) are behind the subscriber's remote
   > LSN (i.e., pg_replication_origin.remote_lsn), the primary sends only
   > transactions that were committed after the remote_lsn. So the
   > subscriber can resume logical replication with the new primary without
   > any data loss.

    Maybe I'm misreading, but I thought the purpose of this to make 
    sure that the logical subscriber does not have data that has not been
    replicated to the new primary. The use-case I can think of would be
    if synchronous_commit were enabled and fail-over occurs. If
    we didn't have this set, isn't it possible that this logical subscriber
    has extra commits that aren't present on the newly promoted primary?

    And sorry I accidentally started a new thread in my last reply. 
    Re-pasting some of my previous questions/comments:

    wait_for_standby_confirmation does not update standby_slot_names once it's
    in a loop and can't be fixed with SIGHUP. Similarly, synchronize_slot_names 
    isn't updated once the worker is launched.

    If a logical slot was dropped on the writer, should the worker drop logical 
    slots that it was previously synchronizing but are no longer present? Or 
    should we leave that to the user to manage? I'm trying to think why users 
    would want to sync logical slots to a reader but not have that be dropped 
    as well if it's no longer present.

    Is there a reason we're deciding to use one-worker syncing per database 
    instead of one general worker that syncs across all the databases? 
    I imagine I'm missing something obvious here. 

    As for how standby_slot_names should be configured, I'd prefer the 
    flexibility similar to what we have for synchronus_standby_names since 
    that seems the most analogous. It'd provide flexibility for failovers, 
    which I imagine is the most common use-case.

On 1/20/22, 9:34 PM, "Masahiko Sawada" <sawada.mshk@gmail.com> wrote:

    CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you
canconfirm the sender and know the content is safe.
 



    On Wed, Dec 15, 2021 at 7:13 AM Peter Eisentraut
    <peter.eisentraut@enterprisedb.com> wrote:
    >
    > On 31.10.21 11:08, Peter Eisentraut wrote:
    > > I want to reactivate $subject.  I took Petr Jelinek's patch from [0],
    > > rebased it, added a bit of testing.  It basically works, but as
    > > mentioned in [0], there are various issues to work out.
    > >
    > > The idea is that the standby runs a background worker to periodically
    > > fetch replication slot information from the primary.  On failover, a
    > > logical subscriber would then ideally find up-to-date replication slots
    > > on the new publisher and can just continue normally.
    >
    > > So, again, this isn't anywhere near ready, but there is already a lot
    > > here to gather feedback about how it works, how it should work, how to
    > > configure it, and how it fits into an overall replication and HA
    > > architecture.
    >
    > The second,
    > standby_slot_names, is set on the primary.  It holds back logical
    > replication until the listed physical standbys have caught up.  That
    > way, when failover is necessary, the promoted standby is not behind the
    > logical replication consumers.

    I might be missing something but isn’t it okay even if the new primary
    server is behind the subscribers? IOW, even if two slot's LSNs (i.e.,
    restart_lsn and confirm_flush_lsn) are behind the subscriber's remote
    LSN (i.e., pg_replication_origin.remote_lsn), the primary sends only
    transactions that were committed after the remote_lsn. So the
    subscriber can resume logical replication with the new primary without
    any data loss.

    The new primary should not be ahead of the subscribers because it
    forwards the logical replication start LSN to the slot’s
    confirm_flush_lsn in this case. But it cannot happen since the remote
    LSN of the subscriber’s origin is always updated first, then the
    confirm_flush_lsn of the slot on the primary is updated, and then the
    confirm_flush_lsn of the slot on the standby is synchronized.

    Regards,

    --
    Masahiko Sawada
    EDB:  https://www.enterprisedb.com/




pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: New developer papercut - Makefile references INSTALL
Next
From: Andres Freund
Date:
Subject: Re: fairywren is generating bogus BASE_BACKUP commands