Thread: Why is subscription/t/031_column_list.pl failing so much?

Why is subscription/t/031_column_list.pl failing so much?

From
Tom Lane
Date:
If you look at the buildfarm's failures page and filter down to
just subscriptionCheck failures, what you find is that all of the
last 6 such failures are in 031_column_list.pl:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua&dt=2024-02-02%2019%3A33%3A16
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-02-02%2011%3A21%3A44
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2024-02-01%2020%3A34%3A29
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2024-02-01%2016%3A57%3A14
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel&dt=2024-01-31%2022%3A18%3A24
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-01-30%2011%3A29%3A23

There are some further back too:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2023-11-17%2018%3A28%3A24

but this definitely got way more common in the last few days.

Digging down into the logs, these all look pretty similar.  Somehow
things get into a state where replication connections fail with
the publisher reporting "publication does not exist":

2024-02-02 19:42:23.187 UTC [1631708][not initialized][:0] LOG:  connection received: host=[local]
2024-02-02 19:42:23.189 UTC [1631708][walsender][4/287:0] LOG:  connection authenticated: user="bf" method=trust
(/home/bf/bf-build/tamandua/HEAD/pgsql.build/testrun/subscription/031_column_list/data/t_031_column_list_publisher_data/pgdata/pg_hba.conf:117)
2024-02-02 19:42:23.189 UTC [1631708][walsender][4/287:0] LOG:  replication connection authorized: user=bf
application_name=sub1
2024-02-02 19:42:23.214 UTC [1631708][walsender][4/288:0] LOG:  statement: SELECT pg_catalog.set_config('search_path',
'',false); 
2024-02-02 19:42:23.226 UTC [1631708][walsender][4/0:0] LOG:  received replication command: IDENTIFY_SYSTEM
2024-02-02 19:42:23.226 UTC [1631708][walsender][4/0:0] STATEMENT:  IDENTIFY_SYSTEM
2024-02-02 19:42:23.226 UTC [1631708][walsender][4/0:0] LOG:  received replication command: START_REPLICATION SLOT
"sub1"LOGICAL 0/15BCDD0 (proto_version '4', origin 'any', publication_names '"pub7"') 
2024-02-02 19:42:23.226 UTC [1631708][walsender][4/0:0] STATEMENT:  START_REPLICATION SLOT "sub1" LOGICAL 0/15BCDD0
(proto_version'4', origin 'any', publication_names '"pub7"') 
2024-02-02 19:42:23.226 UTC [1631708][walsender][4/0:0] LOG:  acquired logical replication slot "sub1"
2024-02-02 19:42:23.226 UTC [1631708][walsender][4/0:0] STATEMENT:  START_REPLICATION SLOT "sub1" LOGICAL 0/15BCDD0
(proto_version'4', origin 'any', publication_names '"pub7"') 
2024-02-02 19:42:23.242 UTC [1631708][walsender][4/0:0] LOG:  starting logical decoding for slot "sub1"
2024-02-02 19:42:23.242 UTC [1631708][walsender][4/0:0] DETAIL:  Streaming transactions committing after 0/15BCDD0,
readingWAL from 0/15BCDD0. 
2024-02-02 19:42:23.242 UTC [1631708][walsender][4/0:0] STATEMENT:  START_REPLICATION SLOT "sub1" LOGICAL 0/15BCDD0
(proto_version'4', origin 'any', publication_names '"pub7"') 
2024-02-02 19:42:23.243 UTC [1631708][walsender][4/0:0] LOG:  logical decoding found consistent point at 0/15BCDD0
2024-02-02 19:42:23.243 UTC [1631708][walsender][4/0:0] DETAIL:  There are no running transactions.
2024-02-02 19:42:23.243 UTC [1631708][walsender][4/0:0] STATEMENT:  START_REPLICATION SLOT "sub1" LOGICAL 0/15BCDD0
(proto_version'4', origin 'any', publication_names '"pub7"') 
2024-02-02 19:42:23.244 UTC [1631708][walsender][4/0:0] ERROR:  publication "pub7" does not exist
2024-02-02 19:42:23.244 UTC [1631708][walsender][4/0:0] CONTEXT:  slot "sub1", output plugin "pgoutput", in the change
callback,associated LSN 0/15C7698 
2024-02-02 19:42:23.244 UTC [1631708][walsender][4/0:0] STATEMENT:  START_REPLICATION SLOT "sub1" LOGICAL 0/15BCDD0
(proto_version'4', origin 'any', publication_names '"pub7"') 
2024-02-02 19:42:23.244 UTC [1631708][walsender][4/0:0] LOG:  released logical replication slot "sub1"
2024-02-02 19:42:23.834 UTC [1631708][walsender][:0] LOG:  disconnection: session time: 0:00:00.647 user=bf
database=postgreshost=[local] 

and then we just repeat that until the test times out.  It fails at
different points in the test script (hence, different publication
names), but the pattern looks about the same.

I don't see anything that 031_column_list.pl is doing that is much
different from other subscription tests, so why is it the only one
failing?  And more to the point, what's going wrong exactly?

I am suspicious that this somehow represents a failure of the
historical catalog decoding logic, but I don't see how that theory
explains this only breaking in one test script.

            regards, tom lane



Re: Why is subscription/t/031_column_list.pl failing so much?

From
Noah Misch
Date:
On Fri, Feb 02, 2024 at 05:07:14PM -0500, Tom Lane wrote:
> If you look at the buildfarm's failures page and filter down to
> just subscriptionCheck failures, what you find is that all of the
> last 6 such failures are in 031_column_list.pl:
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua&dt=2024-02-02%2019%3A33%3A16
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-02-02%2011%3A21%3A44
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2024-02-01%2020%3A34%3A29
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2024-02-01%2016%3A57%3A14
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel&dt=2024-01-31%2022%3A18%3A24
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-01-30%2011%3A29%3A23
> 
> There are some further back too:
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2023-11-17%2018%3A28%3A24
> 
> but this definitely got way more common in the last few days.

> I don't see anything that 031_column_list.pl is doing that is much
> different from other subscription tests, so why is it the only one
> failing?  And more to the point, what's going wrong exactly?

I don't know, but
https://www.postgresql.org/message-id/flat/16d6d9cc-f97d-0b34-be65-425183ed3721%40gmail.com
reported a replacement BgWriterDelay value reproducing it.  That hasn't
reproduced it in ~10 runs on my machine, though.

> I am suspicious that this somehow represents a failure of the
> historical catalog decoding logic, but I don't see how that theory
> explains this only breaking in one test script.



Re: Why is subscription/t/031_column_list.pl failing so much?

From
Noah Misch
Date:
On Fri, Feb 02, 2024 at 02:30:03PM -0800, Noah Misch wrote:
> On Fri, Feb 02, 2024 at 05:07:14PM -0500, Tom Lane wrote:
> > If you look at the buildfarm's failures page and filter down to
> > just subscriptionCheck failures, what you find is that all of the
> > last 6 such failures are in 031_column_list.pl:
> > 
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua&dt=2024-02-02%2019%3A33%3A16
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-02-02%2011%3A21%3A44
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2024-02-01%2020%3A34%3A29
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2024-02-01%2016%3A57%3A14
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel&dt=2024-01-31%2022%3A18%3A24
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-01-30%2011%3A29%3A23
> > 
> > There are some further back too:
> > 
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2023-11-17%2018%3A28%3A24
> > 
> > but this definitely got way more common in the last few days.
> 
> > I don't see anything that 031_column_list.pl is doing that is much
> > different from other subscription tests, so why is it the only one
> > failing?  And more to the point, what's going wrong exactly?
> 
> I don't know, but
> https://www.postgresql.org/message-id/flat/16d6d9cc-f97d-0b34-be65-425183ed3721%40gmail.com
> reported a replacement BgWriterDelay value reproducing it.

Correction: the recipe changes LOG_SNAPSHOT_INTERVAL_MS in addition to
BgWriterDelay.

> That hasn't reproduced it in ~10 runs on my machine, though.

After 207 successes, it did fail once for me.

> > I am suspicious that this somehow represents a failure of the
> > historical catalog decoding logic, but I don't see how that theory
> > explains this only breaking in one test script.



Re: Why is subscription/t/031_column_list.pl failing so much?

From
Tom Lane
Date:
Noah Misch <noah@leadboat.com> writes:
> On Fri, Feb 02, 2024 at 05:07:14PM -0500, Tom Lane wrote:
>> If you look at the buildfarm's failures page and filter down to
>> just subscriptionCheck failures, what you find is that all of the
>> last 6 such failures are in 031_column_list.pl:
>> ...
>> I don't see anything that 031_column_list.pl is doing that is much
>> different from other subscription tests, so why is it the only one
>> failing?  And more to the point, what's going wrong exactly?

> I don't know, but
> https://www.postgresql.org/message-id/flat/16d6d9cc-f97d-0b34-be65-425183ed3721%40gmail.com
> reported a replacement BgWriterDelay value reproducing it.  That hasn't
> reproduced it in ~10 runs on my machine, though.

Ah, thanks for that link.  I like the theory proposed in that thread
that the walsender is starting up at an LSN somewhere before where
the publication is created.  I'm tempted to add some more queries to
the test script to see if that can be proven.

I'm still wondering how come the failure seems to have suddenly gotten
way more common.  The only changes that are in vaguely-related places
and fit the time frame are Amit's 732924043 and 776621a5e, but I sure
don't see a connection.

            regards, tom lane



Re: Why is subscription/t/031_column_list.pl failing so much?

From
Alexander Lakhin
Date:
Hello Tom and Noah,

03.02.2024 04:24, Tom Lane wrote:
> I'm still wondering how come the failure seems to have suddenly gotten
> way more common.  The only changes that are in vaguely-related places
> and fit the time frame are Amit's 732924043 and 776621a5e, but I sure
> don't see a connection.

I think the failure rate increased due to tamandua, calliphoridae,
flaviventris, and kestrel were switched from make to meson recently.
The last `make` builds for them:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua&dt=2024-01-31%2016%3A51%3A31
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-01-31%2016%3A51%3A38
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2024-01-31%2016%3A52%3A37
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel&dt=2024-01-31%2016%3A51%3A53

and since that switch the 031_column_list duration increased significantly,
e.g., on the last tamandua `make` run it executed for 7 seconds, but
successful `meson` runs give much longer duration:
280/283 postgresql:subscription / subscription/031_column_list                    OK 38.27s   36 subtests passed
280/283 postgresql:subscription / subscription/031_column_list                    OK 126.13s   36 subtests passed
280/283 postgresql:subscription / subscription/031_column_list                    OK 31.93s   36 subtests passed
279/283 postgresql:subscription / subscription/031_column_list                    OK 99.76s   36 subtests passed

So, looking at the tamandua's failure log, I see:
2024-02-02 19:41:19.750 UTC [1579219][postmaster][:0] LOG:  starting PostgreSQL 17devel on x86_64-linux, compiled by 
gcc-12.3.0, 64-bit
...
2024-02-02 19:42:19.973 UTC [1629333][client backend][4/213:0] LOG: statement: ALTER SUBSCRIPTION sub1 SET PUBLICATION
pub7
2024-02-02 19:42:20.131 UTC [1625765][logical replication apply worker][3/122:0] LOG:  logical replication worker for 
subscription "sub1" will restart because of a parameter change
2024-02-02 19:42:20.137 UTC [1629333][client backend][:0] LOG: disconnection: session time: 0:00:00.212 user=bf 
database=postgres host=[local]
2024-02-02 19:42:20.191 UTC [1629535][logical replication apply worker][3/124:0] LOG:  logical replication apply worker

for subscription "sub1" has started
...
2024-02-02 19:42:20.445 UTC [1629535][logical replication apply worker][3/0:0] ERROR:  could not receive data from WAL

stream: ERROR:  publication "pub7" does not exist
     CONTEXT:  slot "sub1", output plugin "pgoutput", in the change callback, associated LSN 0/15C7698

(The interval between subscriber start and the error is ~ 4 * 15 seconds.)

Thus it still may be explained by bgwriter activity, though perhaps
autovacuum/checkpointer can add something as well.

Best regards,
Alexander



Re: Why is subscription/t/031_column_list.pl failing so much?

From
Amit Kapila
Date:
On Sat, Feb 3, 2024 at 12:30 PM Alexander Lakhin <exclusion@gmail.com> wrote:
>
> 03.02.2024 04:24, Tom Lane wrote:
> > I'm still wondering how come the failure seems to have suddenly gotten
> > way more common.  The only changes that are in vaguely-related places
> > and fit the time frame are Amit's 732924043 and 776621a5e, but I sure
> > don't see a connection.
>
> I think the failure rate increased due to tamandua, calliphoridae,
> flaviventris, and kestrel were switched from make to meson recently.
>

If this can be reproduced frequently then we can even try to test the
patch in that thread by Osumi-San [1] (I haven't tested that it
applies cleanly but shouldn't be difficult to make it work) based on
the theory that walsender is starting up at an LSN somewhere before
where the publication is created.

[1] -
https://www.postgresql.org/message-id/TYCPR01MB83737A68CD5D554EA82BD7B9EDD39%40TYCPR01MB8373.jpnprd01.prod.outlook.com

--
With Regards,
Amit Kapila.



Re: Why is subscription/t/031_column_list.pl failing so much?

From
Alexander Lakhin
Date:
Hello Amit,

05.02.2024 15:20, Amit Kapila wrote:
> If this can be reproduced frequently then we can even try to test the
> patch in that thread by Osumi-San [1] (I haven't tested that it
> applies cleanly but shouldn't be difficult to make it work) based on
> the theory that walsender is starting up at an LSN somewhere before
> where the publication is created.
>
> [1] -
https://www.postgresql.org/message-id/TYCPR01MB83737A68CD5D554EA82BD7B9EDD39%40TYCPR01MB8373.jpnprd01.prod.outlook.com
>

Yes, with the aforementioned modification of bgwriter.c and when running
20 tests in parallel, I got failures on iterations 20. 3, 21 ..., but with the
updated Osumi-San's patch (which adds wait_for_catchup('sub1') before every
ALTER SUBSCRIPTION sub1 SET PUBLICATION ...) applied, 300 iterations ran
with no failures.

Best regards,
Alexander
Attachment

Re: Why is subscription/t/031_column_list.pl failing so much?

From
Amit Kapila
Date:
On Tue, Feb 6, 2024 at 8:30 AM Alexander Lakhin <exclusion@gmail.com> wrote:
>
> 05.02.2024 15:20, Amit Kapila wrote:
> > If this can be reproduced frequently then we can even try to test the
> > patch in that thread by Osumi-San [1] (I haven't tested that it
> > applies cleanly but shouldn't be difficult to make it work) based on
> > the theory that walsender is starting up at an LSN somewhere before
> > where the publication is created.
> >
> > [1] -
https://www.postgresql.org/message-id/TYCPR01MB83737A68CD5D554EA82BD7B9EDD39%40TYCPR01MB8373.jpnprd01.prod.outlook.com
> >
>
> Yes, with the aforementioned modification of bgwriter.c and when running
> 20 tests in parallel, I got failures on iterations 20. 3, 21 ..., but with the
> updated Osumi-San's patch (which adds wait_for_catchup('sub1') before every
> ALTER SUBSCRIPTION sub1 SET PUBLICATION ...) applied, 300 iterations ran
> with no failures.
>

cool, is it possible to see whether this patch changes the runtime of
this test in any noticeable way?

--
With Regards,
Amit Kapila.



Re: Why is subscription/t/031_column_list.pl failing so much?

From
Alexander Lakhin
Date:
06.02.2024 09:48, Amit Kapila wrote:
> cool, is it possible to see whether this patch changes the runtime of
> this test in any noticeable way?
>

Yes, unfortunately it does.
I've measured duration of 100 tests runs without the patch (with pristine
bgwriter and with NO_TEMP_INSTALL):
real    6m46,031s
real    6m52,406s
real    6m51,014s

But with the patched test, I've got:
real    9m39,872s
real    9m40,044s
real    9m38,236s
(nearly 2 seconds increase per one test run)

Under Valgrind, the original test run takes:
Files=1, Tests=36, 334 wallclock secs ( 0.02 usr  0.00 sys + 163.14 cusr  7.98 csys = 171.14 CPU)

But the patched one:
Files=1, Tests=36, 368 wallclock secs ( 0.02 usr  0.00 sys + 182.16 cusr  8.90 csys = 191.08 CPU)
(30 seconds increase)

Maybe the more CPU-efficient solution would be disabling bgworker, as was
proposed in another discussion of tests instability:
https://www.postgresql.org/message-id/ZaTxhjnPygOdosJ4%40ip-10-97-1-34.eu-west-3.compute.internal

Though I think that devising a way to control bgwriter may take more time
than we can afford given the current 031 failure rate on the buildfarm
(17 failures for the last 3 days).

Best regards,
Alexander



Re: Why is subscription/t/031_column_list.pl failing so much?

From
Amit Kapila
Date:
On Tue, Feb 6, 2024 at 2:30 PM Alexander Lakhin <exclusion@gmail.com> wrote:
>
> 06.02.2024 09:48, Amit Kapila wrote:
> > cool, is it possible to see whether this patch changes the runtime of
> > this test in any noticeable way?
> >
>
> Yes, unfortunately it does.
> I've measured duration of 100 tests runs without the patch (with pristine
> bgwriter and with NO_TEMP_INSTALL):
> real    6m46,031s
> real    6m52,406s
> real    6m51,014s
>
> But with the patched test, I've got:
> real    9m39,872s
> real    9m40,044s
> real    9m38,236s
> (nearly 2 seconds increase per one test run)
>
> Under Valgrind, the original test run takes:
> Files=1, Tests=36, 334 wallclock secs ( 0.02 usr  0.00 sys + 163.14 cusr  7.98 csys = 171.14 CPU)
>
> But the patched one:
> Files=1, Tests=36, 368 wallclock secs ( 0.02 usr  0.00 sys + 182.16 cusr  8.90 csys = 191.08 CPU)
> (30 seconds increase)
>

Yeah, I was worried about that. The other idea I have previously
thought was to change Alter Subscription to Drop+Create Subscription.
That should also help in bringing stability without losing any
functionality.

--
With Regards,
Amit Kapila.



Re: Why is subscription/t/031_column_list.pl failing so much?

From
Tom Lane
Date:
Amit Kapila <amit.kapila16@gmail.com> writes:
> Yeah, I was worried about that. The other idea I have previously
> thought was to change Alter Subscription to Drop+Create Subscription.
> That should also help in bringing stability without losing any
> functionality.

Hm, why would that fix it?

More to the point, aren't these proposals just band-aids that
would stabilize the test without fixing the actual problem?
The same thing is likely to happen to people in the field,
unless we do something drastic like removing ALTER SUBSCRIPTION.

            regards, tom lane



Re: Why is subscription/t/031_column_list.pl failing so much?

From
Tom Lane
Date:
I wrote:
> More to the point, aren't these proposals just band-aids that
> would stabilize the test without fixing the actual problem?
> The same thing is likely to happen to people in the field,
> unless we do something drastic like removing ALTER SUBSCRIPTION.

I've been able to make the 031_column_list.pl failure pretty
reproducible by adding a delay in walsender, as attached.

While I'm not too familiar with this code, it definitely does appear
that the new walsender is told to start up at an LSN before the
creation of the publication, and then if it needs to decide whether
to stream a particular data change before it's reached that creation,
kaboom!

I read and understood the upthread worries about it not being
a great idea to ignore publication lookup failures, but I really
don't see that we have much choice.  As an example, if a subscriber
is humming along reading publication pub1, and then someone
drops and then recreates pub1 on the publisher, I don't think that
the subscriber will be able to advance through that gap if there
are any operations within it that require deciding if they should
be streamed.  (That is, contrary to Amit's expectation that
DROP/CREATE would mask the problem, I suspect it will instead turn
it into a hard failure.  I've not experimented though.)

BTW, this same change breaks two other subscription tests:
015_stream.pl and 022_twophase_cascade.pl.
The symptoms are different (no "publication does not exist" errors),
so maybe these are just test problems not fundamental weaknesses.
But "replication falls over if the walsender is slow" isn't
something I'd call acceptable.

            regards, tom lane

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 77c8baa32a..c897ef4c60 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2699,6 +2699,8 @@ WalSndLoop(WalSndSendDataCallback send_data)
             !pq_is_send_pending())
             break;

+        pg_usleep(10000);
+
         /*
          * If we don't have any pending data in the output buffer, try to send
          * some more.  If there is some, we don't bother to call send_data

Re: Why is subscription/t/031_column_list.pl failing so much?

From
vignesh C
Date:
On Tue, 6 Feb 2024 at 08:30, Alexander Lakhin <exclusion@gmail.com> wrote:
>
> Hello Amit,
>
> 05.02.2024 15:20, Amit Kapila wrote:
> > If this can be reproduced frequently then we can even try to test the
> > patch in that thread by Osumi-San [1] (I haven't tested that it
> > applies cleanly but shouldn't be difficult to make it work) based on
> > the theory that walsender is starting up at an LSN somewhere before
> > where the publication is created.
> >
> > [1] -
https://www.postgresql.org/message-id/TYCPR01MB83737A68CD5D554EA82BD7B9EDD39%40TYCPR01MB8373.jpnprd01.prod.outlook.com
> >
>
> Yes, with the aforementioned modification of bgwriter.c and when running
> 20 tests in parallel, I got failures on iterations 20. 3, 21 ..., but with the
> updated Osumi-San's patch (which adds wait_for_catchup('sub1') before every
> ALTER SUBSCRIPTION sub1 SET PUBLICATION ...) applied, 300 iterations ran
> with no failures.

I was able to reproduce the issue with the patch changes suggested in
bgwriter, but for me it failed in the 287th run.
Then I had run the test 1000 times with the test changes shared at [1]
and the test had passed all the 1000 times successfully.

I have measured the test execution with average of 10 runs and found
that it takes about 1.2 seconds more time to execute with the changes:
Without patch:                 8.454 seconds
With test change patch:   9.672 seconds

For the test execution comparison I had used a machine which has total
Memory of 755.536 GB, 120 CPUs and RHEL 7 Operating System.

[1] - https://www.postgresql.org/message-id/e6ce3cf7-4025-f129-e3ac-0f778469f720%40gmail.com

Regards,
Vignesh



Re: Why is subscription/t/031_column_list.pl failing so much?

From
Amit Kapila
Date:
On Tue, Feb 6, 2024 at 8:21 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Amit Kapila <amit.kapila16@gmail.com> writes:
> > Yeah, I was worried about that. The other idea I have previously
> > thought was to change Alter Subscription to Drop+Create Subscription.
> > That should also help in bringing stability without losing any
> > functionality.
>
> Hm, why would that fix it?
>

Because for new subscriptions, we will start reading WAL from the
latest WAL insert pointer on the publisher which will be after the
point where publication is created.

> More to the point, aren't these proposals just band-aids that
> would stabilize the test without fixing the actual problem?

Yes, but OTOH, this behavior has been since the beginning of logical
replication. This particular test has just exposed it, so keeping BF
failing for this particular test doesn't sound like the best way to
remember it.

--
With Regards,
Amit Kapila.



Re: Why is subscription/t/031_column_list.pl failing so much?

From
Amit Kapila
Date:
On Wed, Feb 7, 2024 at 2:06 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> I wrote:
> > More to the point, aren't these proposals just band-aids that
> > would stabilize the test without fixing the actual problem?
> > The same thing is likely to happen to people in the field,
> > unless we do something drastic like removing ALTER SUBSCRIPTION.
>
> I've been able to make the 031_column_list.pl failure pretty
> reproducible by adding a delay in walsender, as attached.
>
> While I'm not too familiar with this code, it definitely does appear
> that the new walsender is told to start up at an LSN before the
> creation of the publication, and then if it needs to decide whether
> to stream a particular data change before it's reached that creation,
> kaboom!
>
> I read and understood the upthread worries about it not being
> a great idea to ignore publication lookup failures, but I really
> don't see that we have much choice.  As an example, if a subscriber
> is humming along reading publication pub1, and then someone
> drops and then recreates pub1 on the publisher, I don't think that
> the subscriber will be able to advance through that gap if there
> are any operations within it that require deciding if they should
> be streamed.
>

Right. One idea to address those worries was to have a new
subscription option like ignore_nonexistant_pubs (or some better name
for such an option). The 'true' value of this new option means that we
will ignore the publication lookup failures and continue replication,
the 'false' means give an error as we are doing now. If we agree that
such an option is useful or at least saves us in some cases as
discussed in another thread [1], we can keep the default value as true
so that users don't face such errors by default and also have a way to
go back to current behavior.

>
  (That is, contrary to Amit's expectation that
> DROP/CREATE would mask the problem, I suspect it will instead turn
> it into a hard failure.  I've not experimented though.)
>

This is not contrary because I was suggesting to DROP/CREATE
Subscription whereas you are talking of drop and recreate of
Publication.

> BTW, this same change breaks two other subscription tests:
> 015_stream.pl and 022_twophase_cascade.pl.
> The symptoms are different (no "publication does not exist" errors),
> so maybe these are just test problems not fundamental weaknesses.
>

As per the initial analysis, this is because those cases have somewhat
larger transactions (more than 64kB) under test so it just times out
waiting for all the data to be replicated. We will do further analysis
and share the findings.

--
With Regards,
Amit Kapila.



Re: Why is subscription/t/031_column_list.pl failing so much?

From
vignesh C
Date:
On Wed, 7 Feb 2024 at 15:21, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Feb 6, 2024 at 8:21 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> > Amit Kapila <amit.kapila16@gmail.com> writes:
> > > Yeah, I was worried about that. The other idea I have previously
> > > thought was to change Alter Subscription to Drop+Create Subscription.
> > > That should also help in bringing stability without losing any
> > > functionality.
> >
> > Hm, why would that fix it?
> >
>
> Because for new subscriptions, we will start reading WAL from the
> latest WAL insert pointer on the publisher which will be after the
> point where publication is created.

I was able to reproduce the issue consistently with the changes shared
by Tom Lane at [1].
I have made changes to change ALTER SUBSCRIPTION to DROP+CREATE
SUBSCRIPTION and verified that the test has passed consistently for
>50 runs that I ran. Also the test execution time increased for this
case is very negligibly:
Without patch:                 7.991 seconds
With test change patch:   8.121 seconds

The test changes for the same are attached.

[1] - https://www.postgresql.org/message-id/631312.1707251789%40sss.pgh.pa.us

Regards,
Vignesh

Attachment

Re: Why is subscription/t/031_column_list.pl failing so much?

From
vignesh C
Date:
On Wed, 7 Feb 2024 at 15:26, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Feb 7, 2024 at 2:06 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> > I wrote:
> > > More to the point, aren't these proposals just band-aids that
> > > would stabilize the test without fixing the actual problem?
> > > The same thing is likely to happen to people in the field,
> > > unless we do something drastic like removing ALTER SUBSCRIPTION.
> >
> > I've been able to make the 031_column_list.pl failure pretty
> > reproducible by adding a delay in walsender, as attached.
> >
> > While I'm not too familiar with this code, it definitely does appear
> > that the new walsender is told to start up at an LSN before the
> > creation of the publication, and then if it needs to decide whether
> > to stream a particular data change before it's reached that creation,
> > kaboom!
> >
> > I read and understood the upthread worries about it not being
> > a great idea to ignore publication lookup failures, but I really
> > don't see that we have much choice.  As an example, if a subscriber
> > is humming along reading publication pub1, and then someone
> > drops and then recreates pub1 on the publisher, I don't think that
> > the subscriber will be able to advance through that gap if there
> > are any operations within it that require deciding if they should
> > be streamed.
> >
>
> Right. One idea to address those worries was to have a new
> subscription option like ignore_nonexistant_pubs (or some better name
> for such an option). The 'true' value of this new option means that we
> will ignore the publication lookup failures and continue replication,
> the 'false' means give an error as we are doing now. If we agree that
> such an option is useful or at least saves us in some cases as
> discussed in another thread [1], we can keep the default value as true
> so that users don't face such errors by default and also have a way to
> go back to current behavior.
>
> >
>   (That is, contrary to Amit's expectation that
> > DROP/CREATE would mask the problem, I suspect it will instead turn
> > it into a hard failure.  I've not experimented though.)
> >
>
> This is not contrary because I was suggesting to DROP/CREATE
> Subscription whereas you are talking of drop and recreate of
> Publication.
>
> > BTW, this same change breaks two other subscription tests:
> > 015_stream.pl and 022_twophase_cascade.pl.
> > The symptoms are different (no "publication does not exist" errors),
> > so maybe these are just test problems not fundamental weaknesses.
> >
>
> As per the initial analysis, this is because those cases have somewhat
> larger transactions (more than 64kB) under test so it just times out
> waiting for all the data to be replicated. We will do further analysis
> and share the findings.

Yes, these tests are failing while waiting to catchup the larger
transactions to be replicated within180 seconds, as the transactions
needs more time to replicate because of the sleep added. To verify
this I had tried a couple of things a) I had increased the timeout to
a higher value and verified that both the test runs successfully with
1800 seconds timeout. b) I reduced the sleep to 1000 microseconds and
verified that both the test runs successfully.

So I feel these tests 015_stream.pl and 022_twophase_cascade.pl
failing after the sleep added can be ignored.

Regards,
Vignesh



035_standby_logical_decoding unbounded hang

From
Noah Misch
Date:
Coincidentally, one of my buildfarm animals hanged several weeks in a
different test, 035_standby_logical_decoding.pl.  A LOG_SNAPSHOT_INTERVAL_MS
reduction was part of making it reproducible:

On Fri, Feb 02, 2024 at 04:01:45PM -0800, Noah Misch wrote:
> On Fri, Feb 02, 2024 at 02:30:03PM -0800, Noah Misch wrote:
> > On Fri, Feb 02, 2024 at 05:07:14PM -0500, Tom Lane wrote:
> > > If you look at the buildfarm's failures page and filter down to
> > > just subscriptionCheck failures, what you find is that all of the
> > > last 6 such failures are in 031_column_list.pl:

> > https://www.postgresql.org/message-id/flat/16d6d9cc-f97d-0b34-be65-425183ed3721%40gmail.com
> > reported a replacement BgWriterDelay value reproducing it.
> 
> Correction: the recipe changes LOG_SNAPSHOT_INTERVAL_MS in addition to
> BgWriterDelay.

I'm reusing this thread just in case there's overlap with the
031_column_list.pl cause and fix.  The 035_standby_logical_decoding.pl hang is
a race condition arising from an event sequence like this:

- Test script sends CREATE SUBSCRIPTION to subscriber, which loses the CPU.
- Test script calls pg_log_standby_snapshot() on primary.  Emits XLOG_RUNNING_XACTS.
- checkpoint_timeout makes a primary checkpoint finish.  Emits XLOG_RUNNING_XACTS.
- bgwriter executes LOG_SNAPSHOT_INTERVAL_MS logic.  Emits XLOG_RUNNING_XACTS.
- CREATE SUBSCRIPTION wakes up and sends CREATE_REPLICATION_SLOT to standby.

Other test code already has a solution for this, so the attached patches add a
timeout and copy the existing solution.  I'm also attaching the hack that
makes it 100% reproducible.

Attachment

Re: Why is subscription/t/031_column_list.pl failing so much?

From
vignesh C
Date:
On Wed, 7 Feb 2024 at 16:27, vignesh C <vignesh21@gmail.com> wrote:
>
> On Wed, 7 Feb 2024 at 15:21, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Tue, Feb 6, 2024 at 8:21 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > >
> > > Amit Kapila <amit.kapila16@gmail.com> writes:
> > > > Yeah, I was worried about that. The other idea I have previously
> > > > thought was to change Alter Subscription to Drop+Create Subscription.
> > > > That should also help in bringing stability without losing any
> > > > functionality.
> > >
> > > Hm, why would that fix it?
> > >
> >
> > Because for new subscriptions, we will start reading WAL from the
> > latest WAL insert pointer on the publisher which will be after the
> > point where publication is created.
>
> I was able to reproduce the issue consistently with the changes shared
> by Tom Lane at [1].
> I have made changes to change ALTER SUBSCRIPTION to DROP+CREATE
> SUBSCRIPTION and verified that the test has passed consistently for
> >50 runs that I ran. Also the test execution time increased for this
> case is very negligibly:
> Without patch:                 7.991 seconds
> With test change patch:   8.121 seconds
>
> The test changes for the same are attached.

Alternative, this could also be fixed like the changes proposed by
Amit at [1]. In this case we ignore publications that are not found
for the purpose of computing RelSyncEntry attributes. We won't mark
such an entry as valid till all the publications are loaded without
anything missing. This means we won't publish operations on tables
corresponding to that publication till we found such a publication and
that seems okay.

Tomas had raised a performance issue forcing us to reload it for every
replicated change/row in case the publications are invalid at [2]. How
about keeping the default option as it is and providing a new option
skip_not_exist_publication while creating/altering a subscription. In
this case if skip_not_exist_publication  is specified we will ignore
the case if publication is not present and publications will be kept
in invalid and get validated later.

The attached patch has the changes for the same. Thoughts?

[1] - https://www.postgresql.org/message-id/CAA4eK1%2BT-ETXeRM4DHWzGxBpKafLCp__5bPA_QZfFQp7-0wj4Q%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/dc08add3-10a8-738b-983a-191c7406707b%40enterprisedb.com

Regards,
Vignesh

Attachment

Re: 035_standby_logical_decoding unbounded hang

From
Bertrand Drouvot
Date:
Hi,

On Sat, Feb 10, 2024 at 05:02:27PM -0800, Noah Misch wrote:
> Coincidentally, one of my buildfarm animals hanged several weeks in a
> different test, 035_standby_logical_decoding.pl.  A LOG_SNAPSHOT_INTERVAL_MS
> reduction was part of making it reproducible:
> 
> On Fri, Feb 02, 2024 at 04:01:45PM -0800, Noah Misch wrote:
> > On Fri, Feb 02, 2024 at 02:30:03PM -0800, Noah Misch wrote:
> > > On Fri, Feb 02, 2024 at 05:07:14PM -0500, Tom Lane wrote:
> > > > If you look at the buildfarm's failures page and filter down to
> > > > just subscriptionCheck failures, what you find is that all of the
> > > > last 6 such failures are in 031_column_list.pl:
> 
> > > https://www.postgresql.org/message-id/flat/16d6d9cc-f97d-0b34-be65-425183ed3721%40gmail.com
> > > reported a replacement BgWriterDelay value reproducing it.
> > 
> > Correction: the recipe changes LOG_SNAPSHOT_INTERVAL_MS in addition to
> > BgWriterDelay.
> 
> I'm reusing this thread just in case there's overlap with the
> 031_column_list.pl cause and fix.  The 035_standby_logical_decoding.pl hang is
> a race condition arising from an event sequence like this:
> 
> - Test script sends CREATE SUBSCRIPTION to subscriber, which loses the CPU.
> - Test script calls pg_log_standby_snapshot() on primary.  Emits XLOG_RUNNING_XACTS.
> - checkpoint_timeout makes a primary checkpoint finish.  Emits XLOG_RUNNING_XACTS.
> - bgwriter executes LOG_SNAPSHOT_INTERVAL_MS logic.  Emits XLOG_RUNNING_XACTS.
> - CREATE SUBSCRIPTION wakes up and sends CREATE_REPLICATION_SLOT to standby.
> 
> Other test code already has a solution for this, so the attached patches add a
> timeout and copy the existing solution.  I'm also attaching the hack that
> makes it 100% reproducible.

Thanks!

I did a few tests and confirm that the proposed solution fixes the corner case.

standby-slot-test-1-timeout-v1.patch LGTM.

Regarding standby-slot-test-2-race-v1.patch:

> +# See corresponding create_logical_slot_on_standby() code.
> +$node_standby->poll_query_until(
> +    'postgres', qq[
> +        SELECT restart_lsn IS NOT NULL
> +        FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub'
> +    ]) or die "timed out waiting for logical slot to calculate its restart_lsn";
> +

What about creating a sub, say wait_for_restart_lsn_calculation() in Cluster.pm
and then make use of it in create_logical_slot_on_standby() and above? (something
like wait_for_restart_lsn_calculation-v1.patch attached).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Why is subscription/t/031_column_list.pl failing so much?

From
Amit Kapila
Date:
On Wed, Feb 14, 2024 at 12:58 PM vignesh C <vignesh21@gmail.com> wrote:
>
> On Wed, 7 Feb 2024 at 16:27, vignesh C <vignesh21@gmail.com> wrote:
> >
> > I was able to reproduce the issue consistently with the changes shared
> > by Tom Lane at [1].
> > I have made changes to change ALTER SUBSCRIPTION to DROP+CREATE
> > SUBSCRIPTION and verified that the test has passed consistently for
> > >50 runs that I ran. Also the test execution time increased for this
> > case is very negligibly:
> > Without patch:                 7.991 seconds
> > With test change patch:   8.121 seconds
> >
> > The test changes for the same are attached.
>
> Alternative, this could also be fixed like the changes proposed by
> Amit at [1]. In this case we ignore publications that are not found
> for the purpose of computing RelSyncEntry attributes. We won't mark
> such an entry as valid till all the publications are loaded without
> anything missing. This means we won't publish operations on tables
> corresponding to that publication till we found such a publication and
> that seems okay.
>
> Tomas had raised a performance issue forcing us to reload it for every
> replicated change/row in case the publications are invalid at [2].
>

Did you measure the performance impact? I think it should impact the
performance only when there is a dropped/non-existent publication as
per the subscription definition. Also, in the same email[2], Tomas
raised another concern that in such cases it is better to break the
replication.

>
 How
> about keeping the default option as it is and providing a new option
> skip_not_exist_publication while creating/altering a subscription. In
> this case if skip_not_exist_publication  is specified we will ignore
> the case if publication is not present and publications will be kept
> in invalid and get validated later.
>

I don't know if this deserves a separate option or not but I think it
is better to discuss this in a separate thread. To resolve the BF
failure, I still feel, we should just recreate the subscription. This
is a pre-existing problem and we can track it via a separate patch
with a test case targetting such failures.

> The attached patch has the changes for the same. Thoughts?
>
> [1] - https://www.postgresql.org/message-id/CAA4eK1%2BT-ETXeRM4DHWzGxBpKafLCp__5bPA_QZfFQp7-0wj4Q%40mail.gmail.com
> [2] - https://www.postgresql.org/message-id/dc08add3-10a8-738b-983a-191c7406707b%40enterprisedb.com
>

--
With Regards,
Amit Kapila.



Re: Why is subscription/t/031_column_list.pl failing so much?

From
vignesh C
Date:
On Thu, 15 Feb 2024 at 08:42, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Feb 14, 2024 at 12:58 PM vignesh C <vignesh21@gmail.com> wrote:
> >
> > On Wed, 7 Feb 2024 at 16:27, vignesh C <vignesh21@gmail.com> wrote:
> > >
> > > I was able to reproduce the issue consistently with the changes shared
> > > by Tom Lane at [1].
> > > I have made changes to change ALTER SUBSCRIPTION to DROP+CREATE
> > > SUBSCRIPTION and verified that the test has passed consistently for
> > > >50 runs that I ran. Also the test execution time increased for this
> > > case is very negligibly:
> > > Without patch:                 7.991 seconds
> > > With test change patch:   8.121 seconds
> > >
> > > The test changes for the same are attached.
> >
> > Alternative, this could also be fixed like the changes proposed by
> > Amit at [1]. In this case we ignore publications that are not found
> > for the purpose of computing RelSyncEntry attributes. We won't mark
> > such an entry as valid till all the publications are loaded without
> > anything missing. This means we won't publish operations on tables
> > corresponding to that publication till we found such a publication and
> > that seems okay.
> >
> > Tomas had raised a performance issue forcing us to reload it for every
> > replicated change/row in case the publications are invalid at [2].
> >
>
> Did you measure the performance impact? I think it should impact the
> performance only when there is a dropped/non-existent publication as
> per the subscription definition.

The performance was almost similar in this case:
Without patch:                                                 7.991 seconds
With skip_not_exist_publication option patch: 7.996 seconds

> Also, in the same email[2], Tomas
> raised another concern that in such cases it is better to break the
> replication.

I will handle this in the next version

> >
>  How
> > about keeping the default option as it is and providing a new option
> > skip_not_exist_publication while creating/altering a subscription. In
> > this case if skip_not_exist_publication  is specified we will ignore
> > the case if publication is not present and publications will be kept
> > in invalid and get validated later.
> >
>
> I don't know if this deserves a separate option or not but I think it
> is better to discuss this in a separate thread.

I will start a separate thread for this.

> To resolve the BF
> failure, I still feel, we should just recreate the subscription. This
> is a pre-existing problem and we can track it via a separate patch
> with a test case targetting such failures.

+1 for going with recreation of the subscription, the changes for this
are available at [1].
[1] - https://www.postgresql.org/message-id/CALDaNm1hLZW4H4Z61swK6WPW6pcNcjzXqw%3D6NqG7e-RMtkFaZA%40mail.gmail.com

Regards,
Vignesh



Re: 035_standby_logical_decoding unbounded hang

From
Noah Misch
Date:
On Wed, Feb 14, 2024 at 03:31:16PM +0000, Bertrand Drouvot wrote:
> On Sat, Feb 10, 2024 at 05:02:27PM -0800, Noah Misch wrote:
> > The 035_standby_logical_decoding.pl hang is
> > a race condition arising from an event sequence like this:
> > 
> > - Test script sends CREATE SUBSCRIPTION to subscriber, which loses the CPU.
> > - Test script calls pg_log_standby_snapshot() on primary.  Emits XLOG_RUNNING_XACTS.
> > - checkpoint_timeout makes a primary checkpoint finish.  Emits XLOG_RUNNING_XACTS.
> > - bgwriter executes LOG_SNAPSHOT_INTERVAL_MS logic.  Emits XLOG_RUNNING_XACTS.
> > - CREATE SUBSCRIPTION wakes up and sends CREATE_REPLICATION_SLOT to standby.
> > 
> > Other test code already has a solution for this, so the attached patches add a
> > timeout and copy the existing solution.  I'm also attaching the hack that
> > makes it 100% reproducible.

> I did a few tests and confirm that the proposed solution fixes the corner case.

Thanks for reviewing.

> What about creating a sub, say wait_for_restart_lsn_calculation() in Cluster.pm
> and then make use of it in create_logical_slot_on_standby() and above? (something
> like wait_for_restart_lsn_calculation-v1.patch attached).

Waiting for restart_lsn is just a prerequisite for calling
pg_log_standby_snapshot(), so I wouldn't separate those two.  If we're
extracting a sub, I would move the pg_log_standby_snapshot() call into the sub
and make the API like one of these:

  $standby->wait_for_subscription_starting_point($primary, $slot_name);
  $primary->log_standby_snapshot($standby, $slot_name);

Would you like to finish the patch in such a way?



Re: 035_standby_logical_decoding unbounded hang

From
Bertrand Drouvot
Date:
Hi,

On Thu, Feb 15, 2024 at 12:48:16PM -0800, Noah Misch wrote:
> On Wed, Feb 14, 2024 at 03:31:16PM +0000, Bertrand Drouvot wrote:
> > What about creating a sub, say wait_for_restart_lsn_calculation() in Cluster.pm
> > and then make use of it in create_logical_slot_on_standby() and above? (something
> > like wait_for_restart_lsn_calculation-v1.patch attached).
> 
> Waiting for restart_lsn is just a prerequisite for calling
> pg_log_standby_snapshot(), so I wouldn't separate those two.

Yeah, makes sense.

> If we're
> extracting a sub, I would move the pg_log_standby_snapshot() call into the sub
> and make the API like one of these:
> 
>   $standby->wait_for_subscription_starting_point($primary, $slot_name);
>   $primary->log_standby_snapshot($standby, $slot_name);
> 
> Would you like to finish the patch in such a way?

Sure, please find attached standby-slot-test-2-race-v2.patch doing so. I used
log_standby_snapshot() as it seems more generic to me.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Why is subscription/t/031_column_list.pl failing so much?

From
Amit Kapila
Date:
On Thu, Feb 15, 2024 at 10:46 AM vignesh C <vignesh21@gmail.com> wrote:
>
> On Thu, 15 Feb 2024 at 08:42, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
>
> > To resolve the BF
> > failure, I still feel, we should just recreate the subscription. This
> > is a pre-existing problem and we can track it via a separate patch
> > with a test case targetting such failures.
>
> +1 for going with recreation of the subscription, the changes for this
> are available at [1].
> [1] - https://www.postgresql.org/message-id/CALDaNm1hLZW4H4Z61swK6WPW6pcNcjzXqw%3D6NqG7e-RMtkFaZA%40mail.gmail.com
>

Tom, and others, does anyone still object to going ahead with an idea
by just changing the test to recreate the subscription to silence BF
failures for this test?

Note that Vignesh has already started a thread [1] to add an option
(or otherwise) to ignore publication lookup failures.

[1] -
https://www.postgresql.org/message-id/CALDaNm0-n8FGAorM%2BbTxkzn%2BAOUyx5%3DL_XmnvOP6T24%2B-NcBKg%40mail.gmail.com

--
With Regards,
Amit Kapila.



Re: 035_standby_logical_decoding unbounded hang

From
Noah Misch
Date:
On Fri, Feb 16, 2024 at 06:37:38AM +0000, Bertrand Drouvot wrote:
> On Thu, Feb 15, 2024 at 12:48:16PM -0800, Noah Misch wrote:
> > On Wed, Feb 14, 2024 at 03:31:16PM +0000, Bertrand Drouvot wrote:
> > > What about creating a sub, say wait_for_restart_lsn_calculation() in Cluster.pm
> > > and then make use of it in create_logical_slot_on_standby() and above? (something
> > > like wait_for_restart_lsn_calculation-v1.patch attached).
> > 
> > Waiting for restart_lsn is just a prerequisite for calling
> > pg_log_standby_snapshot(), so I wouldn't separate those two.
> 
> Yeah, makes sense.
> 
> > If we're
> > extracting a sub, I would move the pg_log_standby_snapshot() call into the sub
> > and make the API like one of these:
> > 
> >   $standby->wait_for_subscription_starting_point($primary, $slot_name);
> >   $primary->log_standby_snapshot($standby, $slot_name);
> > 
> > Would you like to finish the patch in such a way?
> 
> Sure, please find attached standby-slot-test-2-race-v2.patch doing so. I used
> log_standby_snapshot() as it seems more generic to me.

Thanks.  Pushed at commit 0e16281.



Re: Why is subscription/t/031_column_list.pl failing so much?

From
Amit Kapila
Date:
On Mon, Feb 19, 2024 at 2:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Feb 15, 2024 at 10:46 AM vignesh C <vignesh21@gmail.com> wrote:
> >
> > On Thu, 15 Feb 2024 at 08:42, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> >
> > > To resolve the BF
> > > failure, I still feel, we should just recreate the subscription. This
> > > is a pre-existing problem and we can track it via a separate patch
> > > with a test case targetting such failures.
> >
> > +1 for going with recreation of the subscription, the changes for this
> > are available at [1].
> > [1] - https://www.postgresql.org/message-id/CALDaNm1hLZW4H4Z61swK6WPW6pcNcjzXqw%3D6NqG7e-RMtkFaZA%40mail.gmail.com
> >
>
> Tom, and others, does anyone still object to going ahead with an idea
> by just changing the test to recreate the subscription to silence BF
> failures for this test?
>

Seeing no objections, I have pushed the required test changes to silence the BF.

--
With Regards,
Amit Kapila.



Re: Why is subscription/t/031_column_list.pl failing so much?

From
Tom Lane
Date:
Amit Kapila <amit.kapila16@gmail.com> writes:
> Seeing no objections, I have pushed the required test changes to silence the BF.

The farm is still unhappy in the v16 branch.

            regards, tom lane



Re: Why is subscription/t/031_column_list.pl failing so much?

From
Amit Kapila
Date:
On Sat, Feb 24, 2024 at 2:31 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Amit Kapila <amit.kapila16@gmail.com> writes:
> > Seeing no objections, I have pushed the required test changes to silence the BF.
>
> The farm is still unhappy in the v16 branch.
>

As this feature and tests were introduced in 15, I think we should
backpatch till 15. I'll do that early next week unless you or someone
else thinks otherwise.

--
With Regards,
Amit Kapila.



Re: 035_standby_logical_decoding unbounded hang

From
"Andrey M. Borodin"
Date:

> On 20 Feb 2024, at 04:09, Noah Misch <noah@leadboat.com> wrote:
>

I’m not sure if it is connected, but so far many patches in CFbot keep hanging in this test. For example [0].

I haven’t done root cause analysis yet, but hangs may be related to this thread. Maybe someone more familiar with
similarissues could take a look there? 

Thanks!


Best regards, Andrey Borodin.

[0] https://cirrus-ci.com/task/5911176840740864?logs=check_world#L292


Re: 035_standby_logical_decoding unbounded hang

From
Noah Misch
Date:
On Thu, Mar 07, 2024 at 02:46:55PM +0500, Andrey M. Borodin wrote:
> I’m not sure if it is connected, but so far many patches in CFbot keep hanging in this test. For example [0].

> [0] https://cirrus-ci.com/task/5911176840740864?logs=check_world#L292 

Relevant part:

[22:03:10.292] stderr:
[22:03:10.292] # poll_query_until timed out executing this query:
[22:03:10.292] # SELECT (SELECT catalog_xmin::text::int - 770 from pg_catalog.pg_replication_slots where slot_name =
'injection_activeslot')> 0
 
[22:03:10.292] # expecting this output:
[22:03:10.292] # t
[22:03:10.292] # last actual query output:
[22:03:10.292] # f
[22:03:10.292] # with stderr:
[22:03:10.292] # Tests were run but no plan was declared and done_testing() was not seen.
[22:03:10.292] # Looks like your test exited with 255 just after 57.

The injection_activeslot test got added after $SUBJECT, in 08a52ab151
(2024-03-06).  It's now reverted in 65db0cfb4c (2024-03-07).