Thread: State of pg_createsubscriber

State of pg_createsubscriber

From
Tom Lane
Date:
I'm fairly disturbed about the readiness of pg_createsubscriber.
The 040_pg_createsubscriber.pl TAP test is moderately unstable
in the buildfarm [1], and there are various unaddressed complaints
at the end of the patch thread (pretty much everything below [2]).
I guess this is good enough to start beta with, but it's far from
being good enough to ship, IMO.  If there were active work going
on to fix these things, I'd feel better, but neither the C code
nor the test script have been touched since 1 April.

            regards, tom lane

[1]
https://buildfarm.postgresql.org/cgi-bin/show_failures.pl?max_days=30&branch=HEAD&member=&stage=pg_basebackupCheck&filter=Submit
[2]
https://www.postgresql.org/message-id/flat/3fa9ef0f-b277-4c13-850a-8ccc04de1406%40eisentraut.org#152dacecfc8f0cf08cbd8ecb79d6d38f



Re: State of pg_createsubscriber

From
"Euler Taveira"
Date:
On Sun, May 19, 2024, at 2:30 PM, Tom Lane wrote:
I'm fairly disturbed about the readiness of pg_createsubscriber.
The 040_pg_createsubscriber.pl TAP test is moderately unstable
in the buildfarm [1], and there are various unaddressed complaints
at the end of the patch thread (pretty much everything below [2]).
I guess this is good enough to start beta with, but it's far from
being good enough to ship, IMO.  If there were active work going
on to fix these things, I'd feel better, but neither the C code
nor the test script have been touched since 1 April.

My bad. :( I'll post patches soon to address all of the points.


--
Euler Taveira

Re: State of pg_createsubscriber

From
Amit Kapila
Date:
On Sun, May 19, 2024 at 11:20 PM Euler Taveira <euler@eulerto.com> wrote:
>
> On Sun, May 19, 2024, at 2:30 PM, Tom Lane wrote:
>
> I'm fairly disturbed about the readiness of pg_createsubscriber.
> The 040_pg_createsubscriber.pl TAP test is moderately unstable
> in the buildfarm [1], and there are various unaddressed complaints
> at the end of the patch thread (pretty much everything below [2]).
> I guess this is good enough to start beta with, but it's far from
> being good enough to ship, IMO.  If there were active work going
> on to fix these things, I'd feel better, but neither the C code
> nor the test script have been touched since 1 April.
>
>
> My bad. :( I'll post patches soon to address all of the points.
>

Just to summarize, apart from BF failures for which we had some
discussion, I could recall the following open points:

1. After promotion, the pre-existing replication objects should be
removed (either optionally or always), otherwise, it can lead to a new
subscriber not being able to restart or getting some unwarranted data.
[1][2].

2. Retaining synced slots on new subscribers can lead to unnecessary
WAL retention and dead rows [3].

3. We need to consider whether some of the messages displayed in
--dry-run mode are useful or not [4].

[1] - https://www.postgresql.org/message-id/CAA4eK1L6HOhT1qifTyuestXkPpkRwY9bOqFd4wydKsN6C3hePA%40mail.gmail.com
[2] -
https://www.postgresql.org/message-id/flat/3fa9ef0f-b277-4c13-850a-8ccc04de1406%40eisentraut.org#152dacecfc8f0cf08cbd8ecb79d6d38f
[3] - https://www.postgresql.org/message-id/CAA4eK1KdCb%2B5sjYu6qCMXXdCX1y_ihr8kFzMozq0%3DP%3DauYxgog%40mail.gmail.com
[4] -
https://www.postgresql.org/message-id/flat/3fa9ef0f-b277-4c13-850a-8ccc04de1406%40eisentraut.org#152dacecfc8f0cf08cbd8ecb79d6d38f

--
With Regards,
Amit Kapila.



Re: State of pg_createsubscriber

From
Michael Paquier
Date:
On Sun, May 19, 2024 at 02:49:22PM -0300, Euler Taveira wrote:
> My bad. :( I'll post patches soon to address all of the points.

Please note that I have added an open item pointing at this thread.
--
Michael

Attachment

Re: State of pg_createsubscriber

From
Shlok Kyal
Date:
> Just to summarize, apart from BF failures for which we had some
> discussion, I could recall the following open points:
>
> 1. After promotion, the pre-existing replication objects should be
> removed (either optionally or always), otherwise, it can lead to a new
> subscriber not being able to restart or getting some unwarranted data.
> [1][2].
>
I tried to reproduce the case and found a case where pre-existing
replication objects can cause unwanted scenario:

Suppose we have a setup of nodes N1, N2 and N3.
N1 and N2 are in streaming replication where N1 is primary and N2 is standby.
N3 and N1 are in logical replication where N3 is publisher and N1 is subscriber.
The subscription created on N1 is replicated to N2 due to streaming replication.

Now, after we run pg_createsubscriber on N2 and start the N2 server,
we get the following logs repetitively:
2024-05-22 11:37:18.619 IST [27344] ERROR:  could not start WAL
streaming: ERROR:  replication slot "test1" is active for PID 27202
2024-05-22 11:37:18.622 IST [27317] LOG:  background worker "logical
replication apply worker" (PID 27344) exited with exit code 1
2024-05-22 11:37:23.610 IST [27349] LOG:  logical replication apply
worker for subscription "test1" has started
2024-05-22 11:37:23.624 IST [27349] ERROR:  could not start WAL
streaming: ERROR:  replication slot "test1" is active for PID 27202
2024-05-22 11:37:23.627 IST [27317] LOG:  background worker "logical
replication apply worker" (PID 27349) exited with exit code 1
2024-05-22 11:37:28.616 IST [27382] LOG:  logical replication apply
worker for subscription "test1" has started

Note: 'test1' is the name of the subscription created on N1 initially
and by default, slot name is the same as the subscription name.

Once the N2 server is started after running pg_createsubscriber, the
subscription that was earlier replicated by streaming replication will
now try to connect to the publisher. Since the subscription name in N2
is the same as the subscription created in N1, it will not be able to
start a replication slot as the slot with the same name is active for
logical replication between N3 and N1.

Also, there would be a case where N1 becomes down for some time. Then
in that case subscription on N2 will connect to the publication on N3
and now data from N3 will be replicated to N2 instead of N1. And once
N1 is up again, subscription on N1 will not be able to connect to
publication on N3 as it is already connected to N2. This can lead to
data inconsistency.

This error did not happen before running pg_createsubscriber on
standby node N2, because there is no 'logical replication launcher'
process on standby node.

Thanks and Regards,
Shlok Kyal



Re: State of pg_createsubscriber

From
Amit Kapila
Date:
On Wed, May 22, 2024 at 2:45 PM Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:
>
> > Just to summarize, apart from BF failures for which we had some
> > discussion, I could recall the following open points:
> >
> > 1. After promotion, the pre-existing replication objects should be
> > removed (either optionally or always), otherwise, it can lead to a new
> > subscriber not being able to restart or getting some unwarranted data.
> > [1][2].
> >
> I tried to reproduce the case and found a case where pre-existing
> replication objects can cause unwanted scenario:
>
> Suppose we have a setup of nodes N1, N2 and N3.
> N1 and N2 are in streaming replication where N1 is primary and N2 is standby.
> N3 and N1 are in logical replication where N3 is publisher and N1 is subscriber.
> The subscription created on N1 is replicated to N2 due to streaming replication.
>
> Now, after we run pg_createsubscriber on N2 and start the N2 server,
> we get the following logs repetitively:
> 2024-05-22 11:37:18.619 IST [27344] ERROR:  could not start WAL
> streaming: ERROR:  replication slot "test1" is active for PID 27202
> 2024-05-22 11:37:18.622 IST [27317] LOG:  background worker "logical
> replication apply worker" (PID 27344) exited with exit code 1
> 2024-05-22 11:37:23.610 IST [27349] LOG:  logical replication apply
> worker for subscription "test1" has started
> 2024-05-22 11:37:23.624 IST [27349] ERROR:  could not start WAL
> streaming: ERROR:  replication slot "test1" is active for PID 27202
> 2024-05-22 11:37:23.627 IST [27317] LOG:  background worker "logical
> replication apply worker" (PID 27349) exited with exit code 1
> 2024-05-22 11:37:28.616 IST [27382] LOG:  logical replication apply
> worker for subscription "test1" has started
>
> Note: 'test1' is the name of the subscription created on N1 initially
> and by default, slot name is the same as the subscription name.
>
> Once the N2 server is started after running pg_createsubscriber, the
> subscription that was earlier replicated by streaming replication will
> now try to connect to the publisher. Since the subscription name in N2
> is the same as the subscription created in N1, it will not be able to
> start a replication slot as the slot with the same name is active for
> logical replication between N3 and N1.
>
> Also, there would be a case where N1 becomes down for some time. Then
> in that case subscription on N2 will connect to the publication on N3
> and now data from N3 will be replicated to N2 instead of N1. And once
> N1 is up again, subscription on N1 will not be able to connect to
> publication on N3 as it is already connected to N2. This can lead to
> data inconsistency.
>

So, what shall we do about such cases?  I think by default we can
remove all pre-existing subscriptions and publications on the promoted
standby or instead we can remove them based on some switch. If we want
to go with this idea then we might need to distinguish the between
pre-existing subscriptions and the ones created by this tool.

The other case I remember adding an option in this tool was to avoid
specifying slots, pubs, etc. for each database. See [1]. We can
probably leave if the same is not important but we never reached the
conclusion of same.

[1] - https://www.postgresql.org/message-id/CAA4eK1%2Br96SyHYHx7BaTtGX0eviqpbbkSu01MEzwV5b2VFXP6g%40mail.gmail.com

--
With Regards,
Amit Kapila.



Re: State of pg_createsubscriber

From
Robert Haas
Date:
On Wed, May 22, 2024 at 7:42 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> So, what shall we do about such cases?  I think by default we can
> remove all pre-existing subscriptions and publications on the promoted
> standby or instead we can remove them based on some switch. If we want
> to go with this idea then we might need to distinguish the between
> pre-existing subscriptions and the ones created by this tool.
>
> The other case I remember adding an option in this tool was to avoid
> specifying slots, pubs, etc. for each database. See [1]. We can
> probably leave if the same is not important but we never reached the
> conclusion of same.

Another option that we should at least consider is "do nothing". In a
case like the one Shlok describes, how are we supposed to know what
the right thing to do is? Is it unreasonable to say that if the user
doesn't want those publications or subscriptions to exist, the user
should drop them?

Maybe it is unreasonable to say that, but it seems to me we should at
least talk about that.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: State of pg_createsubscriber

From
Robert Haas
Date:
On Wed, May 22, 2024 at 9:52 AM Robert Haas <robertmhaas@gmail.com> wrote:
> Another option that we should at least consider is "do nothing". In a
> case like the one Shlok describes, how are we supposed to know what
> the right thing to do is? Is it unreasonable to say that if the user
> doesn't want those publications or subscriptions to exist, the user
> should drop them?
>
> Maybe it is unreasonable to say that, but it seems to me we should at
> least talk about that.

As another option, maybe we could disable subscriptions, so that
nothing happens when the server is first started, and then the user
could decide after that what they want to do.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: State of pg_createsubscriber

From
Robert Haas
Date:
On Mon, May 20, 2024 at 2:42 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> Just to summarize, apart from BF failures for which we had some
> discussion, I could recall the following open points:
>
> 1. After promotion, the pre-existing replication objects should be
> removed (either optionally or always), otherwise, it can lead to a new
> subscriber not being able to restart or getting some unwarranted data.
> [1][2].
>
> 2. Retaining synced slots on new subscribers can lead to unnecessary
> WAL retention and dead rows [3].
>
> 3. We need to consider whether some of the messages displayed in
> --dry-run mode are useful or not [4].

Amit, thanks for summarzing your understanding of the situation. Tom,
is this list complete, to your knowledge? The original thread is quite
complex and it's hard to pick out what the open items actually are.
:-(

I would like to see this open item broken up into multiple open items,
one per issue.

Link [4] goes to a message that doesn't seem to relate to --dry-run.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: State of pg_createsubscriber

From
Amit Kapila
Date:
On Wed, May 22, 2024 at 8:00 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, May 22, 2024 at 9:52 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > Another option that we should at least consider is "do nothing". In a
> > case like the one Shlok describes, how are we supposed to know what
> > the right thing to do is? Is it unreasonable to say that if the user
> > doesn't want those publications or subscriptions to exist, the user
> > should drop them?
> >

This tool's intended purpose is to speed up the creation of a
subscriber node and for that one won't need the subscriptions already
present on the existing primary/publisher node from which the new
subscriber node is going to get the data. Additionally, the ERRORs
shown by Shlok will occur even during the command (performed by
pg_createsubscriber) which will probably be confusing.

> > Maybe it is unreasonable to say that, but it seems to me we should at
> > least talk about that.
>
> As another option, maybe we could disable subscriptions, so that
> nothing happens when the server is first started, and then the user
> could decide after that what they want to do.
>

Yeah, this would be worth considering. Note that even if the user
wants to retain such pre-existing subscriptions and enable them, they
need more steps than just to enable these to avoid duplicate data
issues or ERRORs as shown in Shlok's test.

So, we have the following options: (a) by default drop the
pre-existing subscriptions, (b) by default disable the pre-existing
subscriptions, and add a Note in the docs that users can take
necessary actions to enable or drop them. Now, we can even think of
providing a switch to retain the pre-existing subscriptions or
publications as the user may have some use case where it can be
helpful for her. For example, retaining publications can help in
creating a bi-directional setup.

--
With Regards,
Amit Kapila.



RE: State of pg_createsubscriber

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Amit, Robert,

> So, we have the following options: (a) by default drop the
> pre-existing subscriptions, (b) by default disable the pre-existing
> subscriptions, and add a Note in the docs that users can take
> necessary actions to enable or drop them. Now, we can even think of
> providing a switch to retain the pre-existing subscriptions or
> publications as the user may have some use case where it can be
> helpful for her. For example, retaining publications can help in
> creating a bi-directional setup.

Another point we should consider is the replication slot. If standby server has
had slots and they were forgotten, WAL files won't be discarded so disk full
failure will happen. v2-0004 proposed in [1] drops replication slots when their
failover option is true. This can partially solve the issue, but what should be
for other slots?

[1]: https://www.postgresql.org/message-id/CANhcyEV6q1Vhd37i1axUeScLi0UAGVxta1LDa0BV0Eh--TcPMg%40mail.gmail.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 


Re: State of pg_createsubscriber

From
Amit Kapila
Date:
On Wed, May 22, 2024 at 8:02 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, May 20, 2024 at 2:42 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Just to summarize, apart from BF failures for which we had some
> > discussion, I could recall the following open points:
> >
> > 1. After promotion, the pre-existing replication objects should be
> > removed (either optionally or always), otherwise, it can lead to a new
> > subscriber not being able to restart or getting some unwarranted data.
> > [1][2].
> >
> > 2. Retaining synced slots on new subscribers can lead to unnecessary
> > WAL retention and dead rows [3].
> >
> > 3. We need to consider whether some of the messages displayed in
> > --dry-run mode are useful or not [4].
>
> Amit, thanks for summarzing your understanding of the situation. Tom,
> is this list complete, to your knowledge? The original thread is quite
> complex and it's hard to pick out what the open items actually are.
> :-(
>
> I would like to see this open item broken up into multiple open items,
> one per issue.
>
> Link [4] goes to a message that doesn't seem to relate to --dry-run.
>

Sorry, for the wrong link. See [1] for the correct link for --dry-run
related suggestion:

[1] https://www.postgresql.org/message-id/CAA4eK1J2fAvsJ2HihbWJ_GxETd6sdqSMrZdCVJEutRZRpm1MEQ%40mail.gmail.com

--
With Regards,
Amit Kapila.



Re: State of pg_createsubscriber

From
Robert Haas
Date:
On Thu, May 23, 2024 at 4:40 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> Sorry, for the wrong link. See [1] for the correct link for --dry-run
> related suggestion:
>
> [1] https://www.postgresql.org/message-id/CAA4eK1J2fAvsJ2HihbWJ_GxETd6sdqSMrZdCVJEutRZRpm1MEQ%40mail.gmail.com

Yeah, those should definitely be fixed. Seems simple enough.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: State of pg_createsubscriber

From
Amit Kapila
Date:
On Mon, May 20, 2024 at 12:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sun, May 19, 2024 at 11:20 PM Euler Taveira <euler@eulerto.com> wrote:
> >
> > On Sun, May 19, 2024, at 2:30 PM, Tom Lane wrote:
> >
> > I'm fairly disturbed about the readiness of pg_createsubscriber.
> > The 040_pg_createsubscriber.pl TAP test is moderately unstable
> > in the buildfarm [1], and there are various unaddressed complaints
> > at the end of the patch thread (pretty much everything below [2]).
> > I guess this is good enough to start beta with, but it's far from
> > being good enough to ship, IMO.  If there were active work going
> > on to fix these things, I'd feel better, but neither the C code
> > nor the test script have been touched since 1 April.
> >
> >
> > My bad. :( I'll post patches soon to address all of the points.
> >
>
> Just to summarize, apart from BF failures for which we had some
> discussion, I could recall the following open points:
>
> 1. After promotion, the pre-existing replication objects should be
> removed (either optionally or always), otherwise, it can lead to a new
> subscriber not being able to restart or getting some unwarranted data.
> [1][2].
>
> 2. Retaining synced slots on new subscribers can lead to unnecessary
> WAL retention and dead rows [3].
>
> 3. We need to consider whether some of the messages displayed in
> --dry-run mode are useful or not [4].
>

The recent commits should silence BF failures and resolve point number
2. But we haven't done anything yet for 1 and 3. For 3, we have a
patch in email [1] (v3-0005-Avoid*) which can be reviewed and
committed but point number 1 needs discussion. Apart from that
somewhat related to point 1, Kuroda-San has raised a point in an email
[2] for replication slots. Shlok has presented a case in this thread
[3] where the problem due to point 1 can cause ERRORs or can cause
data inconsistency.

Now, the easiest way out here is that we accept the issues with the
pre-existing subscriptions and replication slots cases and just
document them for now with the intention to work on those in the next
version. OTOH, if there are no major challenges, we can try to
implement a patch for them as well as see how it goes.

[1] https://www.postgresql.org/message-id/CANhcyEWGfp7_AGg2zZUgJF_VYTCix01yeY8ZX9woz%2B03WCMPRg%40mail.gmail.com
[2]
https://www.postgresql.org/message-id/OSBPR01MB25525C17E2EF5FC81152F6C8F5F42%40OSBPR01MB2552.jpnprd01.prod.outlook.com
[3]
https://www.postgresql.org/message-id/CANhcyEWvimA1-f6hSrA%3D9qkfR5SonFb56b36M%2B%2BvT%3DLiFj%3D76g%40mail.gmail.com

--
With Regards,
Amit Kapila.



Re: State of pg_createsubscriber

From
Peter Eisentraut
Date:
On 18.06.24 05:59, Amit Kapila wrote:
>> 1. After promotion, the pre-existing replication objects should be
>> removed (either optionally or always), otherwise, it can lead to a new
>> subscriber not being able to restart or getting some unwarranted data.
>> [1][2].
>>
>> 2. Retaining synced slots on new subscribers can lead to unnecessary
>> WAL retention and dead rows [3].
>>
>> 3. We need to consider whether some of the messages displayed in
>> --dry-run mode are useful or not [4].
>>
> 
> The recent commits should silence BF failures and resolve point number
> 2. But we haven't done anything yet for 1 and 3. For 3, we have a
> patch in email [1] (v3-0005-Avoid*) which can be reviewed and
> committed but point number 1 needs discussion. Apart from that
> somewhat related to point 1, Kuroda-San has raised a point in an email
> [2] for replication slots. Shlok has presented a case in this thread
> [3] where the problem due to point 1 can cause ERRORs or can cause
> data inconsistency.
> 
> Now, the easiest way out here is that we accept the issues with the
> pre-existing subscriptions and replication slots cases and just
> document them for now with the intention to work on those in the next
> version. OTOH, if there are no major challenges, we can try to
> implement a patch for them as well as see how it goes.

This has gotten much too confusing to keep track of.

I suggest, if anyone has anything they want considered for 
pg_createsubscriber for PG17 at this point, they start a new thread, one 
for each topic, ideally with a subject like "pg_createsubscriber: 
Improve this thing", provide a self-contained description of the issue, 
and include a patch if one is available.




Re: State of pg_createsubscriber

From
"Euler Taveira"
Date:
On Tue, Jun 18, 2024, at 12:59 AM, Amit Kapila wrote:
On Mon, May 20, 2024 at 12:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sun, May 19, 2024 at 11:20 PM Euler Taveira <euler@eulerto.com> wrote:
> >
> > On Sun, May 19, 2024, at 2:30 PM, Tom Lane wrote:
> >
> > I'm fairly disturbed about the readiness of pg_createsubscriber.
> > The 040_pg_createsubscriber.pl TAP test is moderately unstable
> > in the buildfarm [1], and there are various unaddressed complaints
> > at the end of the patch thread (pretty much everything below [2]).
> > I guess this is good enough to start beta with, but it's far from
> > being good enough to ship, IMO.  If there were active work going
> > on to fix these things, I'd feel better, but neither the C code
> > nor the test script have been touched since 1 April.
> >
> >
> > My bad. :( I'll post patches soon to address all of the points.
> >
>
> Just to summarize, apart from BF failures for which we had some
> discussion, I could recall the following open points:
>
> 1. After promotion, the pre-existing replication objects should be
> removed (either optionally or always), otherwise, it can lead to a new
> subscriber not being able to restart or getting some unwarranted data.
> [1][2].
>
> 2. Retaining synced slots on new subscribers can lead to unnecessary
> WAL retention and dead rows [3].
>
> 3. We need to consider whether some of the messages displayed in
> --dry-run mode are useful or not [4].
>

The recent commits should silence BF failures and resolve point number
2. But we haven't done anything yet for 1 and 3. For 3, we have a
patch in email [1] (v3-0005-Avoid*) which can be reviewed and
committed but point number 1 needs discussion. Apart from that
somewhat related to point 1, Kuroda-San has raised a point in an email
[2] for replication slots. Shlok has presented a case in this thread
[3] where the problem due to point 1 can cause ERRORs or can cause
data inconsistency.

I read v3-0005 and it seems to silence (almost) all "write" messages. Does it
intend to avoid the misinterpretation that the dry run mode is writing
something? It is dry run mode! If I run a tool in dry run mode, I expect it to
execute some verifications and print useful messages so I can evaluate if it is
ok to run it. Maybe it is not your expectation for dry run mode.

I think if it not clear, let's inform that it changes nothing in dry run mode.

pg_createsubscriber: no modifications are done

as a first message in dry run mode. I agree with you when you pointed out that
some messages are misleading.

pg_createsubscriber: hint: If pg_createsubscriber fails after this
point, you must recreate the physical replica before continuing.

Maybe for this one, we omit the fake information, like:

pg_createsubscriber: setting the replication progress on database "postgres"

I will post a patch to address the messages once we agree what needs to be
changed.

Regarding 3, publications and subscriptions are ok to remove. You are not
allowed to create them on standby, hence, all publications and subscriptions
are streamed from primary. However, I'm wondering if you want to remove the
publications. Replication slots on a standby server are "invalidated" despite
of the wal_status is saying "reserved" (I think it is an oversight in the
design that implements slot invalidation), however, required WAL files have
already been removed because of pg_resetwal (see modify_subscriber_sysid()).
The scenario is to promote a standby server, run pg_resetwal on it and check
pg_replication_slots.

Do you have any other scenarios in mind?

Now, the easiest way out here is that we accept the issues with the
pre-existing subscriptions and replication slots cases and just
document them for now with the intention to work on those in the next
version. OTOH, if there are no major challenges, we can try to
implement a patch for them as well as see how it goes.

Agree.


--
Euler Taveira

Re: State of pg_createsubscriber

From
Amit Kapila
Date:
On Tue, Jun 18, 2024 at 12:41 PM Euler Taveira <euler@eulerto.com> wrote:
>
> On Tue, Jun 18, 2024, at 12:59 AM, Amit Kapila wrote:
>
> On Mon, May 20, 2024 at 12:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Sun, May 19, 2024 at 11:20 PM Euler Taveira <euler@eulerto.com> wrote:
> > >
> > > On Sun, May 19, 2024, at 2:30 PM, Tom Lane wrote:
> > >
> > > I'm fairly disturbed about the readiness of pg_createsubscriber.
> > > The 040_pg_createsubscriber.pl TAP test is moderately unstable
> > > in the buildfarm [1], and there are various unaddressed complaints
> > > at the end of the patch thread (pretty much everything below [2]).
> > > I guess this is good enough to start beta with, but it's far from
> > > being good enough to ship, IMO.  If there were active work going
> > > on to fix these things, I'd feel better, but neither the C code
> > > nor the test script have been touched since 1 April.
> > >
> > >
> > > My bad. :( I'll post patches soon to address all of the points.
> > >
> >
> > Just to summarize, apart from BF failures for which we had some
> > discussion, I could recall the following open points:
> >
> > 1. After promotion, the pre-existing replication objects should be
> > removed (either optionally or always), otherwise, it can lead to a new
> > subscriber not being able to restart or getting some unwarranted data.
> > [1][2].
> >
> > 2. Retaining synced slots on new subscribers can lead to unnecessary
> > WAL retention and dead rows [3].
> >
> > 3. We need to consider whether some of the messages displayed in
> > --dry-run mode are useful or not [4].
> >
>
> The recent commits should silence BF failures and resolve point number
> 2. But we haven't done anything yet for 1 and 3. For 3, we have a
> patch in email [1] (v3-0005-Avoid*) which can be reviewed and
> committed but point number 1 needs discussion. Apart from that
> somewhat related to point 1, Kuroda-San has raised a point in an email
> [2] for replication slots. Shlok has presented a case in this thread
> [3] where the problem due to point 1 can cause ERRORs or can cause
> data inconsistency.
>
>
> I read v3-0005 and it seems to silence (almost) all "write" messages. Does it
> intend to avoid the misinterpretation that the dry run mode is writing
> something? It is dry run mode! If I run a tool in dry run mode, I expect it to
> execute some verifications and print useful messages so I can evaluate if it is
> ok to run it. Maybe it is not your expectation for dry run mode.
>

I haven't studied the patch so can't comment but the intention was to
not print some unrelated write messages. I have shared my observation
in the email [1].

> I think if it not clear, let's inform that it changes nothing in dry run mode.
>
> pg_createsubscriber: no modifications are done
>
> as a first message in dry run mode. I agree with you when you pointed out that
> some messages are misleading.
>
> pg_createsubscriber: hint: If pg_createsubscriber fails after this
> point, you must recreate the physical replica before continuing.
>
> Maybe for this one, we omit the fake information, like:
>
> pg_createsubscriber: setting the replication progress on database "postgres"
>

I think we don't need to display this message as we are not going to
do anything for this in the --dry-run mode. We can even move the
related code in (!dry_run) check.

> I will post a patch to address the messages once we agree what needs to be
> changed.
>

I suggest we can start a new thread with the messages shared in the
email [1] and your response for each one of those.

> Regarding 3, publications and subscriptions are ok to remove. You are not
> allowed to create them on standby, hence, all publications and subscriptions
> are streamed from primary. However, I'm wondering if you want to remove the
> publications.
>

I am not so sure of publications but we should remove subscriptions as
there are clear problems with those as shown by Shlok in this thread.

> Replication slots on a standby server are "invalidated" despite
> of the wal_status is saying "reserved" (I think it is an oversight in the
> design that implements slot invalidation), however, required WAL files have
> already been removed because of pg_resetwal (see modify_subscriber_sysid()).
> The scenario is to promote a standby server, run pg_resetwal on it and check
> pg_replication_slots.
>

Ideally, invalidated slots shouldn't create any problems but it is
better that we discuss this also as a separate problem in new thread.

> Do you have any other scenarios in mind?
>

No, so we have three issues to discuss (a) some unwarranted messages
in --dry-run mode; (b) whether to remove pre-existing subscriptions
during conversion; (c) whether to remove pre-existing replication
slots.

Would you like to start three new threads for each of these or would
you like Kuroda-San or me to start some or all?

[1] - https://www.postgresql.org/message-id/CAA4eK1J2fAvsJ2HihbWJ_GxETd6sdqSMrZdCVJEutRZRpm1MEQ%40mail.gmail.com

--
With Regards,
Amit Kapila.



Re: State of pg_createsubscriber

From
Amit Kapila
Date:
On Tue, Jun 18, 2024 at 12:13 PM Peter Eisentraut <peter@eisentraut.org> wrote:
>
> On 18.06.24 05:59, Amit Kapila wrote:
> >> 1. After promotion, the pre-existing replication objects should be
> >> removed (either optionally or always), otherwise, it can lead to a new
> >> subscriber not being able to restart or getting some unwarranted data.
> >> [1][2].
> >>
> >> 2. Retaining synced slots on new subscribers can lead to unnecessary
> >> WAL retention and dead rows [3].
> >>
> >> 3. We need to consider whether some of the messages displayed in
> >> --dry-run mode are useful or not [4].
> >>
> >
> > The recent commits should silence BF failures and resolve point number
> > 2. But we haven't done anything yet for 1 and 3. For 3, we have a
> > patch in email [1] (v3-0005-Avoid*) which can be reviewed and
> > committed but point number 1 needs discussion. Apart from that
> > somewhat related to point 1, Kuroda-San has raised a point in an email
> > [2] for replication slots. Shlok has presented a case in this thread
> > [3] where the problem due to point 1 can cause ERRORs or can cause
> > data inconsistency.
> >
> > Now, the easiest way out here is that we accept the issues with the
> > pre-existing subscriptions and replication slots cases and just
> > document them for now with the intention to work on those in the next
> > version. OTOH, if there are no major challenges, we can try to
> > implement a patch for them as well as see how it goes.
>
> This has gotten much too confusing to keep track of.
>
> I suggest, if anyone has anything they want considered for
> pg_createsubscriber for PG17 at this point, they start a new thread, one
> for each topic, ideally with a subject like "pg_createsubscriber:
> Improve this thing", provide a self-contained description of the issue,
> and include a patch if one is available.
>

Fair enough. In my mind, we have three pending issues to discuss and I
have responded to an email to see if Euler can start individual
threads for those, otherwise, I'll do it.

We can close the open item pointing to this thread.

--
With Regards,
Amit Kapila.



Re: State of pg_createsubscriber

From
Robert Haas
Date:
On Tue, Jun 18, 2024 at 11:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> We can close the open item pointing to this thread.

Done, and for the record I also asked for the thread to be split, back
on May 22.

IMHO, we shouldn't add open items pointing to general complaints like
the one that started this thread. Open items need to be specific and
actionable.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: State of pg_createsubscriber

From
"Euler Taveira"
Date:
On Wed, Jun 19, 2024, at 12:51 AM, Amit Kapila wrote:
Ideally, invalidated slots shouldn't create any problems but it is
better that we discuss this also as a separate problem in new thread.

Ok.

> Do you have any other scenarios in mind?
>

No, so we have three issues to discuss (a) some unwarranted messages
in --dry-run mode; (b) whether to remove pre-existing subscriptions
during conversion; (c) whether to remove pre-existing replication
slots.

Would you like to start three new threads for each of these or would
you like Kuroda-San or me to start some or all?

I will open new threads soon if you don't.


--
Euler Taveira

Re: State of pg_createsubscriber

From
Amit Kapila
Date:
On Thu, Jun 20, 2024 at 2:35 AM Euler Taveira <euler@eulerto.com> wrote:
>
> I will open new threads soon if you don't.
>

Okay, thanks. I'll wait for you to start new threads and then we can
discuss the respective problems in those threads.

--
With Regards,
Amit Kapila.