Thread: Build-farm - intermittent error in 031_column_list.pl

Build-farm - intermittent error in 031_column_list.pl

From

Peter Smith

Date:

19 May 2022, 04:26:56

Hi hackers.

FYI, I saw that there was a recent Build-farm error on the "grison" machine [1]
[1] https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=grison&br=HEAD

The error happened during "subscriptionCheck" phase in the TAP test
t/031_column_list.pl
This test file was added by this [2] commit.
[2] https://github.com/postgres/postgres/commit/923def9a533a7d986acfb524139d8b9e5466d0a5

~~

I checked the history of fails for that TAP test t/031_column_list.pl
and found that this same error seems to have been happening
intermittently for at least the last 50 days.

Details of similar previous errors from the BF are listed below.

~~~

1. Details for system "grison" failure at stage subscriptionCheck,
snapshot taken 2022-05-18 18:11:45
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grison&dt=2022-05-18%2018%3A11%3A45

[22:02:08] t/029_on_error.pl .................. ok    25475 ms ( 0.01
usr  0.00 sys + 15.39 cusr  5.59 csys = 20.99 CPU)
# poll_query_until timed out executing this query:
# SELECT '0/1530588' <= replay_lsn AND state = 'streaming'
#          FROM pg_catalog.pg_stat_replication
#          WHERE application_name IN ('sub1', 'walreceiver')
# expecting this output:
# t
# last actual query output:
#
# with stderr:
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 29 just after 22.
[22:09:25] t/031_column_list.pl ...............
...
[22:02:47.887](1.829s) ok 22 - partitions with different replica
identities not replicated correctly Waiting for replication conn
sub1's replay_lsn to pass 0/1530588 on publisher
[22:09:25.395](397.508s) # poll_query_until timed out executing this query:
# SELECT '0/1530588' <= replay_lsn AND state = 'streaming'
#          FROM pg_catalog.pg_stat_replication
#          WHERE application_name IN ('sub1', 'walreceiver')
# expecting this output:
# t
# last actual query output:
#
# with stderr:
timed out waiting for catchup at t/031_column_list.pl line 728.
### Stopping node "publisher" using mode immediate

~~~

2. Details for system "xenodermus" failure at stage subscriptionCheck,
snapshot taken 2022-04-16 21:00:04
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=xenodermus&dt=2022-04-16%2021%3A00%3A04

[00:15:32] t/029_on_error.pl .................. ok     8278 ms ( 0.00
usr  0.00 sys +  1.33 cusr  0.55 csys =  1.88 CPU)
# poll_query_until timed out executing this query:
# SELECT '0/1543648' <= replay_lsn AND state = 'streaming'
#          FROM pg_catalog.pg_stat_replication
#          WHERE application_name IN ('sub1', 'walreceiver')
# expecting this output:
# t
# last actual query output:
#
# with stderr:
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 29 just after 25.
[00:22:30] t/031_column_list.pl ...............
...
[00:16:04.100](0.901s) ok 25 - partitions with different replica
identities not replicated correctly Waiting for replication conn
sub1's replay_lsn to pass 0/1543648 on publisher
[00:22:29.923](385.823s) # poll_query_until timed out executing this query:
# SELECT '0/1543648' <= replay_lsn AND state = 'streaming'
#          FROM pg_catalog.pg_stat_replication
#          WHERE application_name IN ('sub1', 'walreceiver')
# expecting this output:
# t
# last actual query output:
#
# with stderr:
timed out waiting for catchup at t/031_column_list.pl line 818.

~~~

3. Details for system "phycodurus" failure at stage subscriptionCheck,
snapshot taken 2022-04-05 17:30:04
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=phycodurus&dt=2022-04-05%2017%3A30%3A04

# poll_query_until timed out executing this query:
# SELECT '0/1528640' <= replay_lsn AND state = 'streaming'
#          FROM pg_catalog.pg_stat_replication
#          WHERE application_name IN ('sub1', 'walreceiver')
# expecting this output:
# t
# last actual query output:
#
# with stderr:
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 29 just after 22.
[20:50:25] t/031_column_list.pl ...............
...
ok 22 - partitions with different replica identities not replicated
correctly Waiting for replication conn sub1's replay_lsn to pass
0/1528640 on publisher # poll_query_until timed out executing this
query:
# SELECT '0/1528640' <= replay_lsn AND state = 'streaming'
#          FROM pg_catalog.pg_stat_replication
#          WHERE application_name IN ('sub1', 'walreceiver')
# expecting this output:
# t
# last actual query output:
#
# with stderr:
timed out waiting for catchup at t/031_column_list.pl line 667.

~~~

4. Details for system "phycodurus" failure at stage subscriptionCheck,
snapshot taken 2022-04-05 17:30:04
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=phycodurus&dt=2022-04-05%2017%3A30%3A04

[20:43:04] t/030_sequences.pl ................. ok    11108 ms ( 0.00
usr  0.00 sys +  1.49 cusr  0.40 csys =  1.89 CPU)
# poll_query_until timed out executing this query:
# SELECT '0/1528640' <= replay_lsn AND state = 'streaming'
#          FROM pg_catalog.pg_stat_replication
#          WHERE application_name IN ('sub1', 'walreceiver')
# expecting this output:
# t
# last actual query output:
#
# with stderr:
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 29 just after 22.
[20:50:25] t/031_column_list.pl ...............
...
ok 22 - partitions with different replica identities not replicated
correctly Waiting for replication conn sub1's replay_lsn to pass
0/1528640 on publisher # poll_query_until timed out executing this
query:
# SELECT '0/1528640' <= replay_lsn AND state = 'streaming'
#          FROM pg_catalog.pg_stat_replication
#          WHERE application_name IN ('sub1', 'walreceiver')
# expecting this output:
# t
# last actual query output:
#
# with stderr:
timed out waiting for catchup at t/031_column_list.pl line 667.

~~~

5. Details for system "grison" failure at stage subscriptionCheck,
snapshot taken 2022-04-03 18:11:39
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grison&dt=2022-04-03%2018%3A11%3A39

[22:28:00] t/030_sequences.pl ................. ok    22970 ms ( 0.01
usr  0.00 sys + 14.93 cusr  5.14 csys = 20.08 CPU)
# poll_query_until timed out executing this query:
# SELECT '0/1528CF0' <= replay_lsn AND state = 'streaming'
#          FROM pg_catalog.pg_stat_replication
#          WHERE application_name IN ('sub1', 'walreceiver')
# expecting this output:
# t
# last actual query output:
#
# with stderr:
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 29 just after 22.
[22:35:11] t/031_column_list.pl ...............
....
ok 22 - partitions with different replica identities not replicated
correctly Waiting for replication conn sub1's replay_lsn to pass
0/1528CF0 on publisher # poll_query_until timed out executing this
query:
# SELECT '0/1528CF0' <= replay_lsn AND state = 'streaming'
#          FROM pg_catalog.pg_stat_replication
#          WHERE application_name IN ('sub1', 'walreceiver')
# expecting this output:
# t
# last actual query output:
#
# with stderr:
timed out waiting for catchup at t/031_column_list.pl line 667.

----
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Build-farm - intermittent error in 031_column_list.pl

From

Kyotaro Horiguchi

Date:

19 May 2022, 06:58:04

At Thu, 19 May 2022 14:26:56 +1000, Peter Smith <smithpb2250@gmail.com> wrote in 
> Hi hackers.
> 
> FYI, I saw that there was a recent Build-farm error on the "grison" machine [1]
> [1] https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=grison&br=HEAD
> 
> The error happened during "subscriptionCheck" phase in the TAP test
> t/031_column_list.pl
> This test file was added by this [2] commit.
> [2] https://github.com/postgres/postgres/commit/923def9a533a7d986acfb524139d8b9e5466d0a5

What is happening for all of them looks like that the name of a
publication created by CREATE PUBLICATION without a failure report is
missing for a walsender came later. It seems like CREATE PUBLICATION
can silently fail to create a publication, or walsender somehow failed
to find existing one.


> ~~
> 
> I checked the history of fails for that TAP test t/031_column_list.pl
> and found that this same error seems to have been happening
> intermittently for at least the last 50 days.
> 
> Details of similar previous errors from the BF are listed below.
> 
> ~~~
> 
> 1. Details for system "grison" failure at stage subscriptionCheck,
> snapshot taken 2022-05-18 18:11:45
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grison&dt=2022-05-18%2018%3A11%3A45
> 
> [22:02:08] t/029_on_error.pl .................. ok    25475 ms ( 0.01
> usr  0.00 sys + 15.39 cusr  5.59 csys = 20.99 CPU)
> # poll_query_until timed out executing this query:
> # SELECT '0/1530588' <= replay_lsn AND state = 'streaming'
> #          FROM pg_catalog.pg_stat_replication
> #          WHERE application_name IN ('sub1', 'walreceiver')
> # expecting this output:
> # t
> # last actual query output:
> #
> # with stderr:
> # Tests were run but no plan was declared and done_testing() was not seen.
> # Looks like your test exited with 29 just after 22.
> [22:09:25] t/031_column_list.pl ...............
> ...
> [22:02:47.887](1.829s) ok 22 - partitions with different replica
> identities not replicated correctly Waiting for replication conn
> sub1's replay_lsn to pass 0/1530588 on publisher
> [22:09:25.395](397.508s) # poll_query_until timed out executing this query:
> # SELECT '0/1530588' <= replay_lsn AND state = 'streaming'
> #          FROM pg_catalog.pg_stat_replication
> #          WHERE application_name IN ('sub1', 'walreceiver')
> # expecting this output:
> # t
> # last actual query output:
> #
> # with stderr:
> timed out waiting for catchup at t/031_column_list.pl line 728.
> ### Stopping node "publisher" using mode immediate

2022-04-17 00:16:04.278 CEST [293659][client backend][4/270:0][031_column_list.pl] LOG:  statement: CREATE PUBLICATION
pub9FOR TABLE test_part_d (a) WITH (publish_via_partition_root = true);
 
2022-04-17 00:16:04.279 CEST [293659][client backend][:0][031_column_list.pl] LOG:  disconnection: session time:
0:00:00.002user=bf database=postgres host=[local]
 

"CREATE PUBLICATION pub9" is executed at 00:16:04.278 on 293659 then
the session has been disconnected. But the following request for the
same publication fails due to the absense of the publication.

2022-04-17 00:16:08.147 CEST [293856][walsender][3/0:0][sub1] STATEMENT:  START_REPLICATION SLOT "sub1" LOGICAL
0/153DB88(proto_version '3', publication_names '"pub9"')
 
2022-04-17 00:16:08.148 CEST [293856][walsender][3/0:0][sub1] ERROR:  publication "pub9" does not exist


> ~~~
> 
> 2. Details for system "xenodermus" failure at stage subscriptionCheck,
> snapshot taken 2022-04-16 21:00:04
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=xenodermus&dt=2022-04-16%2021%3A00%3A04

The same. pub9 is missing after creation.

> ~~~
> 
> 3. Details for system "phycodurus" failure at stage subscriptionCheck,
> snapshot taken 2022-04-05 17:30:04
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=phycodurus&dt=2022-04-05%2017%3A30%3A04

The same happens for pub7..

> 4. Details for system "phycodurus" failure at stage subscriptionCheck,
> snapshot taken 2022-04-05 17:30:04
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=phycodurus&dt=2022-04-05%2017%3A30%3A04

Same. pub7 is missing.

> 5. Details for system "grison" failure at stage subscriptionCheck,
> snapshot taken 2022-04-03 18:11:39
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grison&dt=2022-04-03%2018%3A11%3A39

Same. pub7 is missing.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Build-farm - intermittent error in 031_column_list.pl

From

Amit Kapila

Date:

19 May 2022, 09:46:52

On Thu, May 19, 2022 at 12:28 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Thu, 19 May 2022 14:26:56 +1000, Peter Smith <smithpb2250@gmail.com> wrote in
> > Hi hackers.
> >
> > FYI, I saw that there was a recent Build-farm error on the "grison" machine [1]
> > [1] https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=grison&br=HEAD
> >
> > The error happened during "subscriptionCheck" phase in the TAP test
> > t/031_column_list.pl
> > This test file was added by this [2] commit.
> > [2] https://github.com/postgres/postgres/commit/923def9a533a7d986acfb524139d8b9e5466d0a5
>
> What is happening for all of them looks like that the name of a
> publication created by CREATE PUBLICATION without a failure report is
> missing for a walsender came later. It seems like CREATE PUBLICATION
> can silently fail to create a publication, or walsender somehow failed
> to find existing one.
>

Do you see anything in LOGS which indicates CREATE SUBSCRIPTION has failed?

>
> > ~~
> >
>
> 2022-04-17 00:16:04.278 CEST [293659][client backend][4/270:0][031_column_list.pl] LOG:  statement: CREATE
PUBLICATIONpub9 FOR TABLE test_part_d (a) WITH (publish_via_partition_root = true);

> 2022-04-17 00:16:04.279 CEST [293659][client backend][:0][031_column_list.pl] LOG:  disconnection: session time:
0:00:00.002user=bf database=postgres host=[local]

>
> "CREATE PUBLICATION pub9" is executed at 00:16:04.278 on 293659 then
> the session has been disconnected. But the following request for the
> same publication fails due to the absense of the publication.
>
> 2022-04-17 00:16:08.147 CEST [293856][walsender][3/0:0][sub1] STATEMENT:  START_REPLICATION SLOT "sub1" LOGICAL
0/153DB88(proto_version '3', publication_names '"pub9"')

> 2022-04-17 00:16:08.148 CEST [293856][walsender][3/0:0][sub1] ERROR:  publication "pub9" does not exist
>

This happens after "ALTER SUBSCRIPTION sub1 SET PUBLICATION pub9". The
probable theory is that ALTER SUBSCRIPTION will lead to restarting of
apply worker (which we can see in LOGS as well) and after the restart,
the apply worker will use the existing slot and replication origin
corresponding to the subscription. Now, it is possible that before
restart the origin has not been updated and the WAL start location
points to a location prior to where PUBLICATION pub9 exists which can
lead to such an error. Once this error occurs, apply worker will never
be able to proceed and will always return the same error. Does this
make sense?

Unless you or others see a different theory, this seems to be the
existing problem in logical replication which is manifested by this
test. If we just want to fix these test failures, we can create a new
subscription instead of altering the existing publication to point to
the new publication.

Note: Added Tomas to know his views as he has committed this test.

-- 
With Regards,
Amit Kapila.

Re: Build-farm - intermittent error in 031_column_list.pl

From

Amit Kapila

Date:

19 May 2022, 11:12:31

On Thu, May 19, 2022 at 3:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, May 19, 2022 at 12:28 PM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> >
> > At Thu, 19 May 2022 14:26:56 +1000, Peter Smith <smithpb2250@gmail.com> wrote in
> > > Hi hackers.
> > >
> > > FYI, I saw that there was a recent Build-farm error on the "grison" machine [1]
> > > [1] https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=grison&br=HEAD
> > >
> > > The error happened during "subscriptionCheck" phase in the TAP test
> > > t/031_column_list.pl
> > > This test file was added by this [2] commit.
> > > [2] https://github.com/postgres/postgres/commit/923def9a533a7d986acfb524139d8b9e5466d0a5
> >
> > What is happening for all of them looks like that the name of a
> > publication created by CREATE PUBLICATION without a failure report is
> > missing for a walsender came later. It seems like CREATE PUBLICATION
> > can silently fail to create a publication, or walsender somehow failed
> > to find existing one.
> >
>
> Do you see anything in LOGS which indicates CREATE SUBSCRIPTION has failed?
>
> >
> > > ~~
> > >
> >
> > 2022-04-17 00:16:04.278 CEST [293659][client backend][4/270:0][031_column_list.pl] LOG:  statement: CREATE
PUBLICATIONpub9 FOR TABLE test_part_d (a) WITH (publish_via_partition_root = true);
 
> > 2022-04-17 00:16:04.279 CEST [293659][client backend][:0][031_column_list.pl] LOG:  disconnection: session time:
0:00:00.002user=bf database=postgres host=[local]
 
> >
> > "CREATE PUBLICATION pub9" is executed at 00:16:04.278 on 293659 then
> > the session has been disconnected. But the following request for the
> > same publication fails due to the absense of the publication.
> >
> > 2022-04-17 00:16:08.147 CEST [293856][walsender][3/0:0][sub1] STATEMENT:  START_REPLICATION SLOT "sub1" LOGICAL
0/153DB88(proto_version '3', publication_names '"pub9"')
 
> > 2022-04-17 00:16:08.148 CEST [293856][walsender][3/0:0][sub1] ERROR:  publication "pub9" does not exist
> >
>
> This happens after "ALTER SUBSCRIPTION sub1 SET PUBLICATION pub9". The
> probable theory is that ALTER SUBSCRIPTION will lead to restarting of
> apply worker (which we can see in LOGS as well) and after the restart,
> the apply worker will use the existing slot and replication origin
> corresponding to the subscription. Now, it is possible that before
> restart the origin has not been updated and the WAL start location
> points to a location prior to where PUBLICATION pub9 exists which can
> lead to such an error. Once this error occurs, apply worker will never
> be able to proceed and will always return the same error. Does this
> make sense?
>
> Unless you or others see a different theory, this seems to be the
> existing problem in logical replication which is manifested by this
> test. If we just want to fix these test failures, we can create a new
> subscription instead of altering the existing publication to point to
> the new publication.
>

If the above theory is correct then I think allowing the publisher to
catch up with "$node_publisher->wait_for_catchup('sub1');" before
ALTER SUBSCRIPTION should fix this problem. Because if before ALTER
both publisher and subscriber are in sync then the new publication
should be visible to WALSender.

-- 
With Regards,
Amit Kapila.

Re: Build-farm - intermittent error in 031_column_list.pl

From

Kyotaro Horiguchi

Date:

20 May 2022, 01:28:19

At Thu, 19 May 2022 16:42:31 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in 
> On Thu, May 19, 2022 at 3:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > This happens after "ALTER SUBSCRIPTION sub1 SET PUBLICATION pub9". The
> > probable theory is that ALTER SUBSCRIPTION will lead to restarting of
> > apply worker (which we can see in LOGS as well) and after the restart,

Yes.

> > the apply worker will use the existing slot and replication origin
> > corresponding to the subscription. Now, it is possible that before
> > restart the origin has not been updated and the WAL start location
> > points to a location prior to where PUBLICATION pub9 exists which can
> > lead to such an error. Once this error occurs, apply worker will never
> > be able to proceed and will always return the same error. Does this
> > make sense?

Wow. I didin't thought that line. That theory explains the silence and
makes sense even though I don't see LSN transistions that clearly
support it.  I dimly remember a similar kind of problem..

> > Unless you or others see a different theory, this seems to be the
> > existing problem in logical replication which is manifested by this
> > test. If we just want to fix these test failures, we can create a new
> > subscription instead of altering the existing publication to point to
> > the new publication.
> >
> 
> If the above theory is correct then I think allowing the publisher to
> catch up with "$node_publisher->wait_for_catchup('sub1');" before
> ALTER SUBSCRIPTION should fix this problem. Because if before ALTER
> both publisher and subscriber are in sync then the new publication
> should be visible to WALSender.

It looks right to me.  That timetravel seems inintuitive but it's the
(current) way it works.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Build-farm - intermittent error in 031_column_list.pl

From

Amit Kapila

Date:

20 May 2022, 03:58:29

On Fri, May 20, 2022 at 6:58 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> > > the apply worker will use the existing slot and replication origin
> > > corresponding to the subscription. Now, it is possible that before
> > > restart the origin has not been updated and the WAL start location
> > > points to a location prior to where PUBLICATION pub9 exists which can
> > > lead to such an error. Once this error occurs, apply worker will never
> > > be able to proceed and will always return the same error. Does this
> > > make sense?
>
> Wow. I didin't thought that line. That theory explains the silence and
> makes sense even though I don't see LSN transistions that clearly
> support it.  I dimly remember a similar kind of problem..
>
> > > Unless you or others see a different theory, this seems to be the
> > > existing problem in logical replication which is manifested by this
> > > test. If we just want to fix these test failures, we can create a new
> > > subscription instead of altering the existing publication to point to
> > > the new publication.
> > >
> >
> > If the above theory is correct then I think allowing the publisher to
> > catch up with "$node_publisher->wait_for_catchup('sub1');" before
> > ALTER SUBSCRIPTION should fix this problem. Because if before ALTER
> > both publisher and subscriber are in sync then the new publication
> > should be visible to WALSender.
>
> It looks right to me.
>

Let's wait for Tomas or others working in this area to share their thoughts.

>  That timetravel seems inintuitive but it's the
> (current) way it works.
>

I have thought about it but couldn't come up with a good way to change
the way currently it works. Moreover, I think it is easy to hit this
in other ways as well. Say, you first create a subscription with a
non-existent publication and then do operation on any unrelated table
on the publisher before creating the required publication, we will hit
exactly this problem of "publication does not exist", so I think we
may need to live with this behavior and write tests carefully.

-- 
With Regards,
Amit Kapila.

Re: Build-farm - intermittent error in 031_column_list.pl

From

Tomas Vondra

Date:

20 May 2022, 10:31:17

On 5/20/22 05:58, Amit Kapila wrote:
> On Fri, May 20, 2022 at 6:58 AM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
>>
>>>> the apply worker will use the existing slot and replication origin
>>>> corresponding to the subscription. Now, it is possible that before
>>>> restart the origin has not been updated and the WAL start location
>>>> points to a location prior to where PUBLICATION pub9 exists which can
>>>> lead to such an error. Once this error occurs, apply worker will never
>>>> be able to proceed and will always return the same error. Does this
>>>> make sense?
>>
>> Wow. I didin't thought that line. That theory explains the silence and
>> makes sense even though I don't see LSN transistions that clearly
>> support it.  I dimly remember a similar kind of problem..
>>
>>>> Unless you or others see a different theory, this seems to be the
>>>> existing problem in logical replication which is manifested by this
>>>> test. If we just want to fix these test failures, we can create a new
>>>> subscription instead of altering the existing publication to point to
>>>> the new publication.
>>>>
>>>
>>> If the above theory is correct then I think allowing the publisher to
>>> catch up with "$node_publisher->wait_for_catchup('sub1');" before
>>> ALTER SUBSCRIPTION should fix this problem. Because if before ALTER
>>> both publisher and subscriber are in sync then the new publication
>>> should be visible to WALSender.
>>
>> It looks right to me.
>>
> 
> Let's wait for Tomas or others working in this area to share their thoughts.
> 

Are we really querying the publications (in get_rel_sync_entry) using
the historical snapshot? I haven't really realized this, but yeah, that
might explain the issue.

The new TAP test does ALTER SUBSCRIPTION ... SET PUBLICATION much more
often than any other test (there are ~15 calls, 12 of which are in this
new test). That might be why we haven't seen failures before. Or maybe
the existing tests simply are not vulnerable to this, because they
either do wait_for_catchup late enough or don't do any DML right before
executing SET PUBLICATION.

>>  That timetravel seems inintuitive but it's the
>> (current) way it works.
>>
> 
> I have thought about it but couldn't come up with a good way to change
> the way currently it works. Moreover, I think it is easy to hit this
> in other ways as well. Say, you first create a subscription with a
> non-existent publication and then do operation on any unrelated table
> on the publisher before creating the required publication, we will hit
> exactly this problem of "publication does not exist", so I think we
> may need to live with this behavior and write tests carefully.
> 

Yeah, I think it pretty much requires ensuring the subscriber is fully
caught up with the publisher, otherwise ALTER SUBSCRIPTION may break the
replication in an unrecoverable way (actually, you can alter the
subscription and remove the publication again, right?).

But this is not just about tests, of course - the same issue applies to
regular replication. That's a bit unfortunate, so maybe we should think
about making this less fragile.

We might make sure the subscriber is not lagging (essentially the
wait_for_catchup) - which the users will have to do anyway (although
maybe they know the publisher is beyond the LSN where it was created).

The other option would be to detect such case, somehow - if you don't
see the publication yet, see if it exists in current snapshot, and then
maybe ignore this error. But that has other issues (the publication
might have been created and dropped, in which case you won't see it).
Also, we'd probably have to ignore RelationSyncEntry for a while, which
seems quite expensive.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

RE: Build-farm - intermittent error in 031_column_list.pl

From

"osumi.takamichi@fujitsu.com"

Date:

20 May 2022, 14:53:22

On Thursday, May 19, 2022 8:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, May 19, 2022 at 3:16 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> >
> > On Thu, May 19, 2022 at 12:28 PM Kyotaro Horiguchi
> > <horikyota.ntt@gmail.com> wrote:
> > >
> > > At Thu, 19 May 2022 14:26:56 +1000, Peter Smith
> > > <smithpb2250@gmail.com> wrote in
> > > > Hi hackers.
> > > >
> > > > FYI, I saw that there was a recent Build-farm error on the
> > > > "grison" machine [1] [1]
> > > > https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=grison
> > > > &br=HEAD
> > > >
> > > > The error happened during "subscriptionCheck" phase in the TAP
> > > > test t/031_column_list.pl This test file was added by this [2]
> > > > commit.
> > > > [2]
> > > >
> https://github.com/postgres/postgres/commit/923def9a533a7d986acfb5
> > > > 24139d8b9e5466d0a5
> > >
> > > 2022-04-17 00:16:04.278 CEST [293659][client
> > > backend][4/270:0][031_column_list.pl] LOG:  statement: CREATE
> > > PUBLICATION pub9 FOR TABLE test_part_d (a) WITH
> > > (publish_via_partition_root = true);
> > > 2022-04-17 00:16:04.279 CEST [293659][client
> > > backend][:0][031_column_list.pl] LOG:  disconnection: session time:
> > > 0:00:00.002 user=bf database=postgres host=[local]
> > >
> > > "CREATE PUBLICATION pub9" is executed at 00:16:04.278 on 293659 then
> > > the session has been disconnected. But the following request for the
> > > same publication fails due to the absense of the publication.
> > >
> > > 2022-04-17 00:16:08.147 CEST [293856][walsender][3/0:0][sub1]
> > > STATEMENT:  START_REPLICATION SLOT "sub1" LOGICAL 0/153DB88
> > > (proto_version '3', publication_names '"pub9"')
> > > 2022-04-17 00:16:08.148 CEST [293856][walsender][3/0:0][sub1] ERROR:
> > > publication "pub9" does not exist
> > >
> >
> > This happens after "ALTER SUBSCRIPTION sub1 SET PUBLICATION pub9".
> The
> > probable theory is that ALTER SUBSCRIPTION will lead to restarting of
> > apply worker (which we can see in LOGS as well) and after the restart,
> > the apply worker will use the existing slot and replication origin
> > corresponding to the subscription. Now, it is possible that before
> > restart the origin has not been updated and the WAL start location
> > points to a location prior to where PUBLICATION pub9 exists which can
> > lead to such an error. Once this error occurs, apply worker will never
> > be able to proceed and will always return the same error. Does this
> > make sense?
> >
> > Unless you or others see a different theory, this seems to be the
> > existing problem in logical replication which is manifested by this
> > test. If we just want to fix these test failures, we can create a new
> > subscription instead of altering the existing publication to point to
> > the new publication.
> >
> 
> If the above theory is correct then I think allowing the publisher to catch up with
> "$node_publisher->wait_for_catchup('sub1');" before ALTER SUBSCRIPTION
> should fix this problem. Because if before ALTER both publisher and
> subscriber are in sync then the new publication should be visible to
> WALSender.
Hi,


I've attached a patch for the fix proposed here.
First of all, thank you so much for helping me offlist, Amit-san.

I reproduced the failure like [1] by commenting out
WalSndWaitForWal's call of WalSndKeepalive and running the test.
This comment out intends to suppress the advance of confirmed_flush location
after creating a publication.

In short, my understanding how the bug happened is, 
1. we execute 'create publication pubX' and create one publication.
2. 'alter subscription subY set publication pubX' makes the apply worker exit
3. relaunched apply worker searches for pubX. But, the slot position(confirmed_flush)
   doesn't get updated and points to some location before create publication at the publisher node.

Applying the attached patch have made the test pass.


[1] the subscriber's log

2022-05-20 08:56:50.773 UTC [5153] 031_column_list.pl LOG: statement: ALTER SUBSCRIPTION sub1 SET PUBLICATION pub6
2022-05-20 08:56:50.801 UTC [5156] 031_column_list.pl LOG: statement: SELECT count(1) = 0 FROM pg_subscription_rel
WHEREsrsubstate NOT IN ('r', 's');
 
2022-05-20 08:56:50.846 UTC [5112] LOG: logical replication apply worker for subscription "sub1" will restart because
ofa parameter change
 
2022-05-20 08:56:50.915 UTC [5158] 031_column_list.pl LOG: statement: SELECT count(1) = 0 FROM pg_subscription_rel
WHEREsrsubstate NOT IN ('r', 's');
 
...
2022-05-20 08:56:51.257 UTC [5164] 031_column_list.pl LOG: statement: SELECT count(1) = 0 FROM pg_subscription_rel
WHEREsrsubstate NOT IN ('r', 's');
 
2022-05-20 08:56:51.353 UTC [5166] LOG: logical replication apply worker for subscription "sub1" has started
2022-05-20 08:56:51.366 UTC [5168] LOG: logical replication table synchronization worker for subscription "sub1", table
"test_part_a"has started
 
2022-05-20 08:56:51.370 UTC [5171] 031_column_list.pl LOG: statement: SELECT count(1) = 0 FROM pg_subscription_rel
WHEREsrsubstate NOT IN ('r', 's');
 
2022-05-20 08:56:51.373 UTC [5166] ERROR: could not receive data from WAL stream: ERROR: publication "pub6" does not
exist
CONTEXT: slot "sub1", output plugin "pgoutput", in the change callback, associated LSN 0/15C61B8
2022-05-20 08:56:51.374 UTC [4338] LOG: background worker "logical replication worker" (PID 5166) exited with exit code
1



Best Regards,
    Takamichi Osumi

Attachment

v1-0001-Fix-the-intermittent-Build-farm-failures-in-031_c.patch

Re: Build-farm - intermittent error in 031_column_list.pl

From

Amit Kapila

Date:

21 May 2022, 03:33:48

On Fri, May 20, 2022 at 4:01 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 5/20/22 05:58, Amit Kapila wrote:
>
> Are we really querying the publications (in get_rel_sync_entry) using
> the historical snapshot?
>

Yes.

> I haven't really realized this, but yeah, that
> might explain the issue.
>
> The new TAP test does ALTER SUBSCRIPTION ... SET PUBLICATION much more
> often than any other test (there are ~15 calls, 12 of which are in this
> new test). That might be why we haven't seen failures before. Or maybe
> the existing tests simply are not vulnerable to this,
>

Right, I have checked the other cases are not vulnerable to this,
otherwise, I think we would have seen intermittent failures till now.
They don't seem to be doing DMLs before the creation of a publication
or they create a subscription pointing to the same publication before.

> because they
> either do wait_for_catchup late enough or don't do any DML right before
> executing SET PUBLICATION.
>
> >>  That timetravel seems inintuitive but it's the
> >> (current) way it works.
> >>
> >
> > I have thought about it but couldn't come up with a good way to change
> > the way currently it works. Moreover, I think it is easy to hit this
> > in other ways as well. Say, you first create a subscription with a
> > non-existent publication and then do operation on any unrelated table
> > on the publisher before creating the required publication, we will hit
> > exactly this problem of "publication does not exist", so I think we
> > may need to live with this behavior and write tests carefully.
> >
>
> Yeah, I think it pretty much requires ensuring the subscriber is fully
> caught up with the publisher, otherwise ALTER SUBSCRIPTION may break the
> replication in an unrecoverable way (actually, you can alter the
> subscription and remove the publication again, right?).
>

Right.

> But this is not just about tests, of course - the same issue applies to
> regular replication. That's a bit unfortunate, so maybe we should think
> about making this less fragile.
>

Agreed, provided we find some reasonable solution.

> We might make sure the subscriber is not lagging (essentially the
> wait_for_catchup) - which the users will have to do anyway (although
> maybe they know the publisher is beyond the LSN where it was created).
>

This won't work for the case mentioned above where we create a
subscription with non-existent publications, then perform DML and then
'CREATE PUBLICATION'.

> The other option would be to detect such case, somehow - if you don't
> see the publication yet, see if it exists in current snapshot, and then
> maybe ignore this error. But that has other issues (the publication
> might have been created and dropped, in which case you won't see it).
>

True, the dropped case would again be tricky to deal with and I think
we will end up publishing some operations which are performed before
the publication is even created.

> Also, we'd probably have to ignore RelationSyncEntry for a while, which
> seems quite expensive.
>

Yet another option could be that we continue using a historic snapshot
but ignore publications that are not found for the purpose of
computing RelSyncEntry attributes. We won't mark such an entry as
valid till all the publications are loaded without anything missing. I
think such cases in practice won't be enough to matter. This means we
won't publish operations on tables corresponding to that publication
till we found such a publication and that seems okay.

-- 
With Regards,
Amit Kapila.

Re: Build-farm - intermittent error in 031_column_list.pl

From

Amit Kapila

Date:

24 May 2022, 12:49:45

On Sat, May 21, 2022 at 9:03 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, May 20, 2022 at 4:01 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>
> > Also, we'd probably have to ignore RelationSyncEntry for a while, which
> > seems quite expensive.
> >
>
> Yet another option could be that we continue using a historic snapshot
> but ignore publications that are not found for the purpose of
> computing RelSyncEntry attributes. We won't mark such an entry as
> valid till all the publications are loaded without anything missing. I
> think such cases in practice won't be enough to matter. This means we
> won't publish operations on tables corresponding to that publication
> till we found such a publication and that seems okay.
>

Attached, find the patch to show what I have in mind for this. Today,
we have received a bug report with a similar symptom [1] and that
should also be fixed with this. The reported bug should also be fixed
with this.

Thoughts?


[1] -
https://www.postgresql.org/message-id/CANWRaJyyD%3D9c1E2HdF-Tqfe7%2BvuCQnAkXd6%2BEFwxC0wM%3D313AA%40mail.gmail.com

-- 
With Regards,
Amit Kapila.

Attachment

skip_missing_pubs.1.patch

RE: Build-farm - intermittent error in 031_column_list.pl

From

"osumi.takamichi@fujitsu.com"

Date:

25 May 2022, 02:28:01

On Tuesday, May 24, 2022 9:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Sat, May 21, 2022 at 9:03 AM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> >
> > On Fri, May 20, 2022 at 4:01 PM Tomas Vondra
> > <tomas.vondra@enterprisedb.com> wrote:
> >
> > > Also, we'd probably have to ignore RelationSyncEntry for a while,
> > > which seems quite expensive.
> > >
> >
> > Yet another option could be that we continue using a historic snapshot
> > but ignore publications that are not found for the purpose of
> > computing RelSyncEntry attributes. We won't mark such an entry as
> > valid till all the publications are loaded without anything missing. I
> > think such cases in practice won't be enough to matter. This means we
> > won't publish operations on tables corresponding to that publication
> > till we found such a publication and that seems okay.
> >
> 
> Attached, find the patch to show what I have in mind for this. Today, we have
> received a bug report with a similar symptom [1] and that should also be fixed
> with this. The reported bug should also be fixed with this.
> 
> Thoughts?
Hi,

I agree with this direction.
I think this approach solves the issue fundamentally
and is better than the first approach to add several calls
of wait_for_catchup in the test, since taking the first one
means we need to care about avoiding the same issue,
whenever we write a new (similar) test, even after the modification.

I've used the patch to check below things.
1. The patch can be applied and make check-world has passed without failure.
2. HEAD applied with the patch passed all tests in src/test/subscription
   (including 031_column_list.pl), after commenting out of WalSndWaitForWal's WalSndKeepalive.
3. The new bug fix report in 'How is this possible "publication does not exist"' thread
   has been fixed. FYI, after I execute the script's function, I also conduct
   additional insert to the publisher, and this was correctly replicated on the subscriber.

Best Regards,
    Takamichi Osumi

Re: Build-farm - intermittent error in 031_column_list.pl

From

Kyotaro Horiguchi

Date:

25 May 2022, 02:46:06

At Tue, 24 May 2022 18:19:45 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in 
> On Sat, May 21, 2022 at 9:03 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Fri, May 20, 2022 at 4:01 PM Tomas Vondra
> > <tomas.vondra@enterprisedb.com> wrote:
> >
> > > Also, we'd probably have to ignore RelationSyncEntry for a while, which
> > > seems quite expensive.
> > >
> >
> > Yet another option could be that we continue using a historic snapshot
> > but ignore publications that are not found for the purpose of
> > computing RelSyncEntry attributes. We won't mark such an entry as
> > valid till all the publications are loaded without anything missing. I
> > think such cases in practice won't be enough to matter. This means we
> > won't publish operations on tables corresponding to that publication
> > till we found such a publication and that seems okay.
> >
> 
> Attached, find the patch to show what I have in mind for this. Today,
> we have received a bug report with a similar symptom [1] and that
> should also be fixed with this. The reported bug should also be fixed
> with this.
> 
> Thoughts?
> 
> 
> [1] -
https://www.postgresql.org/message-id/CANWRaJyyD%3D9c1E2HdF-Tqfe7%2BvuCQnAkXd6%2BEFwxC0wM%3D313AA%40mail.gmail.com

It does "fix" the case of [1].  But AFAIS
RelationSyncEntry.replicate_valid is only used to inhibit repeated
loading in get_rel_sync_entry and the function doesn't seem to be
assumed to return a invalid entry. (Since the flag is not checked
nowhere else.)

For example pgoutput_change does not check for the flag of the entry
returned from the function before uses it, which is not seemingly
safe. (I didn't check further, though)

Don't we need to explicitly avoid using invalid entries outside the
function?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Build-farm - intermittent error in 031_column_list.pl

From

Amit Kapila

Date:

25 May 2022, 11:26:11

On Wed, May 25, 2022 at 8:16 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> It does "fix" the case of [1].  But AFAIS
> RelationSyncEntry.replicate_valid is only used to inhibit repeated
> loading in get_rel_sync_entry and the function doesn't seem to be
> assumed to return a invalid entry. (Since the flag is not checked
> nowhere else.)
>
> For example pgoutput_change does not check for the flag of the entry
> returned from the function before uses it, which is not seemingly
> safe. (I didn't check further, though)
>
> Don't we need to explicitly avoid using invalid entries outside the
> function?
>

We decide that based on pubactions in the callers, so even if entry is
valid, it won't do anything.  Actually, we don't need to avoid setting
replication_valid flag as some of the publications for the table may
be already present. We can check if the publications_valid flag is set
while trying to validate the entry. Now, even if we don't find any
publications the replicate_valid flag will be set but none of the
actions will be set, so it won't do anything in the caller. Is this
better than the previous approach?

-- 
With Regards,
Amit Kapila.

Attachment

skip_missing_pubs.2.patch

Re: Build-farm - intermittent error in 031_column_list.pl

From

Tomas Vondra

Date:

25 May 2022, 13:24:33

On 5/25/22 13:26, Amit Kapila wrote:
> On Wed, May 25, 2022 at 8:16 AM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
>>
>> It does "fix" the case of [1].  But AFAIS
>> RelationSyncEntry.replicate_valid is only used to inhibit repeated
>> loading in get_rel_sync_entry and the function doesn't seem to be
>> assumed to return a invalid entry. (Since the flag is not checked
>> nowhere else.)
>>
>> For example pgoutput_change does not check for the flag of the entry
>> returned from the function before uses it, which is not seemingly
>> safe. (I didn't check further, though)
>>
>> Don't we need to explicitly avoid using invalid entries outside the
>> function?
>>
> 
> We decide that based on pubactions in the callers, so even if entry is
> valid, it won't do anything.  Actually, we don't need to avoid setting
> replication_valid flag as some of the publications for the table may
> be already present. We can check if the publications_valid flag is set
> while trying to validate the entry. Now, even if we don't find any
> publications the replicate_valid flag will be set but none of the
> actions will be set, so it won't do anything in the caller. Is this
> better than the previous approach?
> 

For the record, I'm not convinced this is the right way to fix the
issue, as it may easily mask the real problem.

We do silently ignore missing objects in various places, but only when
either requested or when it's obvious it's expected and safe to ignore.
But I'm not sure that applies here, in a clean way.

Imagine you have a subscriber using two publications p1 and p2, and
someone comes around and drops p1 by mistake. With the proposed patch,
the subscription will notice this, but it'll continue sending data
ignoring the missing publication. Yes, it will continue working, but
it's quite possible this breaks the subscriber and it's be better to
fail and stop replicating.

The other aspect I dislike is that we just stop caching publication
info, forcing us to reload it for every replicated change/row. So even
if dropping the publication happens not to "break" the subscriber (i.e.
the data makes sense), this may easily cause performance issues, lag in
the replication, and so on. And the users will have no idea why and/or
how to fix it, because we just do this silently.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Build-farm - intermittent error in 031_column_list.pl

From

Tom Lane

Date:

25 May 2022, 14:28:05

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
> The other aspect I dislike is that we just stop caching publication
> info, forcing us to reload it for every replicated change/row.

Ouch --- that seems likely to be completely horrid.  At the very least
I'd want to see some benchmark numbers before concluding we could
live with that.

            regards, tom lane

Re: Build-farm - intermittent error in 031_column_list.pl

From

Amit Kapila

Date:

26 May 2022, 02:37:27

On Wed, May 25, 2022 at 6:54 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 5/25/22 13:26, Amit Kapila wrote:
> > On Wed, May 25, 2022 at 8:16 AM Kyotaro Horiguchi
> > <horikyota.ntt@gmail.com> wrote:
> >>
> >> It does "fix" the case of [1].  But AFAIS
> >> RelationSyncEntry.replicate_valid is only used to inhibit repeated
> >> loading in get_rel_sync_entry and the function doesn't seem to be
> >> assumed to return a invalid entry. (Since the flag is not checked
> >> nowhere else.)
> >>
> >> For example pgoutput_change does not check for the flag of the entry
> >> returned from the function before uses it, which is not seemingly
> >> safe. (I didn't check further, though)
> >>
> >> Don't we need to explicitly avoid using invalid entries outside the
> >> function?
> >>
> >
> > We decide that based on pubactions in the callers, so even if entry is
> > valid, it won't do anything.  Actually, we don't need to avoid setting
> > replication_valid flag as some of the publications for the table may
> > be already present. We can check if the publications_valid flag is set
> > while trying to validate the entry. Now, even if we don't find any
> > publications the replicate_valid flag will be set but none of the
> > actions will be set, so it won't do anything in the caller. Is this
> > better than the previous approach?
> >
>
> For the record, I'm not convinced this is the right way to fix the
> issue, as it may easily mask the real problem.
>
> We do silently ignore missing objects in various places, but only when
> either requested or when it's obvious it's expected and safe to ignore.
> But I'm not sure that applies here, in a clean way.
>
> Imagine you have a subscriber using two publications p1 and p2, and
> someone comes around and drops p1 by mistake. With the proposed patch,
> the subscription will notice this, but it'll continue sending data
> ignoring the missing publication. Yes, it will continue working, but
> it's quite possible this breaks the subscriber and it's be better to
> fail and stop replicating.
>

Ideally, shouldn't we disallow drop of publication in such cases where
it is part of some subscription? I know it will be tricky because some
subscriptions could be disabled.

> The other aspect I dislike is that we just stop caching publication
> info, forcing us to reload it for every replicated change/row. So even
> if dropping the publication happens not to "break" the subscriber (i.e.
> the data makes sense), this may easily cause performance issues, lag in
> the replication, and so on. And the users will have no idea why and/or
> how to fix it, because we just do this silently.
>

Yeah, this is true that if there are missing publications, it needs to
reload all the publications info again unless we build a mechanism to
change the existing cached entry by loading only required info. The
other thing we could do here is to LOG the info for missing
publications to make users aware of the fact. I think we can also
introduce a new option while defining/altering subscription to
indicate whether to continue on missing publication or not, that way
by default we will stop replication as we are doing now but users will
have a way to move replication.

BTW, what are the other options we have to fix the cases where
replication is broken (or the user has no clue on how to proceed) as
we are discussing the case here or the OP reported yet another case on
pgsql-bugs [1]?

[1] -
https://www.postgresql.org/message-id/CANWRaJyyD%3D9c1E2HdF-Tqfe7%2BvuCQnAkXd6%2BEFwxC0wM%3D313AA%40mail.gmail.com

-- 
With Regards,
Amit Kapila.

RE: Build-farm - intermittent error in 031_column_list.pl

From

"osumi.takamichi@fujitsu.com"

Date:

01 June 2022, 02:06:11

On Thursday, May 26, 2022 11:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, May 25, 2022 at 6:54 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
> >
> > On 5/25/22 13:26, Amit Kapila wrote:
> > > On Wed, May 25, 2022 at 8:16 AM Kyotaro Horiguchi
> > > <horikyota.ntt@gmail.com> wrote:
> > >>
> > >> It does "fix" the case of [1].  But AFAIS
> > >> RelationSyncEntry.replicate_valid is only used to inhibit repeated
> > >> loading in get_rel_sync_entry and the function doesn't seem to be
> > >> assumed to return a invalid entry. (Since the flag is not checked
> > >> nowhere else.)
> > >>
> > >> For example pgoutput_change does not check for the flag of the
> > >> entry returned from the function before uses it, which is not
> > >> seemingly safe. (I didn't check further, though)
> > >>
> > >> Don't we need to explicitly avoid using invalid entries outside the
> > >> function?
> > >>
> > >
> > > We decide that based on pubactions in the callers, so even if entry
> > > is valid, it won't do anything.  Actually, we don't need to avoid
> > > setting replication_valid flag as some of the publications for the
> > > table may be already present. We can check if the publications_valid
> > > flag is set while trying to validate the entry. Now, even if we
> > > don't find any publications the replicate_valid flag will be set but
> > > none of the actions will be set, so it won't do anything in the
> > > caller. Is this better than the previous approach?
> > >
> >
> > For the record, I'm not convinced this is the right way to fix the
> > issue, as it may easily mask the real problem.
> >
> > We do silently ignore missing objects in various places, but only when
> > either requested or when it's obvious it's expected and safe to ignore.
> > But I'm not sure that applies here, in a clean way.
> >
> > Imagine you have a subscriber using two publications p1 and p2, and
> > someone comes around and drops p1 by mistake. With the proposed patch,
> > the subscription will notice this, but it'll continue sending data
> > ignoring the missing publication. Yes, it will continue working, but
> > it's quite possible this breaks the subscriber and it's be better to
> > fail and stop replicating.
> >
> 
> Ideally, shouldn't we disallow drop of publication in such cases where it is part
> of some subscription? I know it will be tricky because some subscriptions
> could be disabled.
> 
> > The other aspect I dislike is that we just stop caching publication
> > info, forcing us to reload it for every replicated change/row. So even
> > if dropping the publication happens not to "break" the subscriber (i.e.
> > the data makes sense), this may easily cause performance issues, lag
> > in the replication, and so on. And the users will have no idea why
> > and/or how to fix it, because we just do this silently.
> >
> 
> Yeah, this is true that if there are missing publications, it needs to reload all the
> publications info again unless we build a mechanism to change the existing
> cached entry by loading only required info. The other thing we could do here is
> to LOG the info for missing publications to make users aware of the fact. I think
> we can also introduce a new option while defining/altering subscription to
> indicate whether to continue on missing publication or not, that way by default
> we will stop replication as we are doing now but users will have a way to move
> replication.
> 
> BTW, what are the other options we have to fix the cases where replication is
> broken (or the user has no clue on how to proceed) as we are discussing the
> case here or the OP reported yet another case on pgsql-bugs [1]?
Hi, 


FYI, I've noticed that after the last report by Peter-san
we've gotten the same errors on Build Farm.
We need to keep discussing to conclude this.


1. Details for system "xenodermus" failure at stage subscriptionCheck, snapshot taken 2022-05-31 13:00:04
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=xenodermus&dt=2022-05-31%2013%3A00%3A04


2. Details for system "phycodurus" failure at stage subscriptionCheck, snapshot taken 2022-05-26 17:30:04
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=phycodurus&dt=2022-05-26%2017%3A30%3A04


Best Regards,
    Takamichi Osumi

Re: Build-farm - intermittent error in 031_column_list.pl

From

Alexander Lakhin

Date:

16 January 2024, 18:00:01

Hello hackers,

01.06.2022 05:06, osumi.takamichi@fujitsu.com wrote:
> FYI, I've noticed that after the last report by Peter-san
> we've gotten the same errors on Build Farm.
> We need to keep discussing to conclude this.
>
>
> 1. Details for system "xenodermus" failure at stage subscriptionCheck, snapshot taken 2022-05-31 13:00:04
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=xenodermus&dt=2022-05-31%2013%3A00%3A04
>
>
> 2. Details for system "phycodurus" failure at stage subscriptionCheck, snapshot taken 2022-05-26 17:30:04
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=phycodurus&dt=2022-05-26%2017%3A30%3A04

I think, I've discovered what causes that test failure.
When playing with bgwriter during researching [1], I made it more
aggressive with a dirty hack:
-#define LOG_SNAPSHOT_INTERVAL_MS 15000
+#define LOG_SNAPSHOT_INTERVAL_MS 1

          rc = WaitLatch(MyLatch,
                         WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
-                       BgWriterDelay /* ms */ , WAIT_EVENT_BGWRITER_MAIN);
+                       1 /* ms */ , WAIT_EVENT_BGWRITER_MAIN);

With this modification, I ran `make check -C src/test/subscription` in a
loop and observed the same failure of 031_column_list as discussed here.
With log_min_messages = DEBUG2, I see that in a failed case there is no
'"sub1" has now caught up' message (as Amit and Tomas guessed upthread)
(See full logs from one successful and two failed runs attached (I added
some extra logging and enabled wal_debug to better understand what's going
on here).)

If we look at the failures occurred in the buildfarm:
The first two from the past:
1)
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=xenodermus&dt=2022-05-31%2013%3A00%3A04

2022-05-26 20:39:23.828 CEST [276284][postmaster][:0][] LOG: starting PostgreSQL 15beta1 on x86_64-pc-linux-gnu, 
compiled by gcc-10 (Debian 10.3.0-15) 10.3.0, 64-bit
...
2022-05-26 20:39:39.768 CEST [277545][walsender][3/0:0][sub1] ERROR:  publication "pub6" does not exist

2)
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=phycodurus&dt=2022-05-26%2017%3A30%3A04

2022-05-31 16:33:25.506 CEST [3223685][postmaster][:0][] LOG: starting PostgreSQL 15beta1 on x86_64-pc-linux-gnu, 
compiled by clang version 6.0.1 , 64-bit
...
2022-05-31 16:33:41.114 CEST [3224511][walsender][3/0:0][sub1] ERROR:  publication "pub6" does not exist

The other two from the present:
3)

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=kestrel&dt=2023-12-14%2009%3A14%3A52&stg=subscription-check

2023-12-14 09:26:12.523 UTC [1144979][postmaster][:0] LOG:  starting PostgreSQL 16.1 on x86_64-pc-linux-gnu, compiled
by
 
Debian clang version 13.0.1-11+b2, 64-bit
...
2023-12-14 09:26:28.663 UTC [1157936][walsender][3/0:0] ERROR: publication "pub6" does not exist

4)
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2023-11-17%2018%3A28%3A24

2023-11-17 18:31:13.594 UTC [200939] LOG:  starting PostgreSQL 17devel on x86_64-linux, compiled by clang-14.0.6,
64-bit
...
2023-11-17 18:31:29.292 UTC [222103] sub1 ERROR:  publication "pub6" does not exist

we can see that all the failures occurred approximately since 16 seconds
after the server start. And it's very close to predefined
LOG_SNAPSHOT_INTERVAL_MS.

[1] https://www.postgresql.org/message-id/6f85667e-5754-5d35-dbf1-c83fe08c1e48%40gmail.com

Best regards,
Alexander

Attachment

031_column_list-runs.7z