Re: State of pg_createsubscriber - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: State of pg_createsubscriber
Date
Msg-id CAA4eK1KcprYdxWwJMoX7HvXcsuPV4ZHUKstdRY0NEOx=VtEJTA@mail.gmail.com
Whole thread Raw
In response to Re: State of pg_createsubscriber  (Shlok Kyal <shlok.kyal.oss@gmail.com>)
Responses Re: State of pg_createsubscriber
List pgsql-hackers
On Wed, May 22, 2024 at 2:45 PM Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:
>
> > Just to summarize, apart from BF failures for which we had some
> > discussion, I could recall the following open points:
> >
> > 1. After promotion, the pre-existing replication objects should be
> > removed (either optionally or always), otherwise, it can lead to a new
> > subscriber not being able to restart or getting some unwarranted data.
> > [1][2].
> >
> I tried to reproduce the case and found a case where pre-existing
> replication objects can cause unwanted scenario:
>
> Suppose we have a setup of nodes N1, N2 and N3.
> N1 and N2 are in streaming replication where N1 is primary and N2 is standby.
> N3 and N1 are in logical replication where N3 is publisher and N1 is subscriber.
> The subscription created on N1 is replicated to N2 due to streaming replication.
>
> Now, after we run pg_createsubscriber on N2 and start the N2 server,
> we get the following logs repetitively:
> 2024-05-22 11:37:18.619 IST [27344] ERROR:  could not start WAL
> streaming: ERROR:  replication slot "test1" is active for PID 27202
> 2024-05-22 11:37:18.622 IST [27317] LOG:  background worker "logical
> replication apply worker" (PID 27344) exited with exit code 1
> 2024-05-22 11:37:23.610 IST [27349] LOG:  logical replication apply
> worker for subscription "test1" has started
> 2024-05-22 11:37:23.624 IST [27349] ERROR:  could not start WAL
> streaming: ERROR:  replication slot "test1" is active for PID 27202
> 2024-05-22 11:37:23.627 IST [27317] LOG:  background worker "logical
> replication apply worker" (PID 27349) exited with exit code 1
> 2024-05-22 11:37:28.616 IST [27382] LOG:  logical replication apply
> worker for subscription "test1" has started
>
> Note: 'test1' is the name of the subscription created on N1 initially
> and by default, slot name is the same as the subscription name.
>
> Once the N2 server is started after running pg_createsubscriber, the
> subscription that was earlier replicated by streaming replication will
> now try to connect to the publisher. Since the subscription name in N2
> is the same as the subscription created in N1, it will not be able to
> start a replication slot as the slot with the same name is active for
> logical replication between N3 and N1.
>
> Also, there would be a case where N1 becomes down for some time. Then
> in that case subscription on N2 will connect to the publication on N3
> and now data from N3 will be replicated to N2 instead of N1. And once
> N1 is up again, subscription on N1 will not be able to connect to
> publication on N3 as it is already connected to N2. This can lead to
> data inconsistency.
>

So, what shall we do about such cases?  I think by default we can
remove all pre-existing subscriptions and publications on the promoted
standby or instead we can remove them based on some switch. If we want
to go with this idea then we might need to distinguish the between
pre-existing subscriptions and the ones created by this tool.

The other case I remember adding an option in this tool was to avoid
specifying slots, pubs, etc. for each database. See [1]. We can
probably leave if the same is not important but we never reached the
conclusion of same.

[1] - https://www.postgresql.org/message-id/CAA4eK1%2Br96SyHYHx7BaTtGX0eviqpbbkSu01MEzwV5b2VFXP6g%40mail.gmail.com

--
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: Pgoutput not capturing the generated columns
Next
From: Alexander Lakhin
Date:
Subject: Re: Testing autovacuum wraparound (including failsafe)