On Wed, Dec 15, 2021 at 7:13 AM Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:
>
> On 31.10.21 11:08, Peter Eisentraut wrote:
> > I want to reactivate $subject. I took Petr Jelinek's patch from [0],
> > rebased it, added a bit of testing. It basically works, but as
> > mentioned in [0], there are various issues to work out.
> >
> > The idea is that the standby runs a background worker to periodically
> > fetch replication slot information from the primary. On failover, a
> > logical subscriber would then ideally find up-to-date replication slots
> > on the new publisher and can just continue normally.
>
> > So, again, this isn't anywhere near ready, but there is already a lot
> > here to gather feedback about how it works, how it should work, how to
> > configure it, and how it fits into an overall replication and HA
> > architecture.
>
> The second,
> standby_slot_names, is set on the primary. It holds back logical
> replication until the listed physical standbys have caught up. That
> way, when failover is necessary, the promoted standby is not behind the
> logical replication consumers.
I might be missing something but isn’t it okay even if the new primary
server is behind the subscribers? IOW, even if two slot's LSNs (i.e.,
restart_lsn and confirm_flush_lsn) are behind the subscriber's remote
LSN (i.e., pg_replication_origin.remote_lsn), the primary sends only
transactions that were committed after the remote_lsn. So the
subscriber can resume logical replication with the new primary without
any data loss.
The new primary should not be ahead of the subscribers because it
forwards the logical replication start LSN to the slot’s
confirm_flush_lsn in this case. But it cannot happen since the remote
LSN of the subscriber’s origin is always updated first, then the
confirm_flush_lsn of the slot on the primary is updated, and then the
confirm_flush_lsn of the slot on the standby is synchronized.
Regards,
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/