Thread: Build-farm - intermittent error in 031_column_list.pl
Hi hackers. FYI, I saw that there was a recent Build-farm error on the "grison" machine [1] [1] https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=grison&br=HEAD The error happened during "subscriptionCheck" phase in the TAP test t/031_column_list.pl This test file was added by this [2] commit. [2] https://github.com/postgres/postgres/commit/923def9a533a7d986acfb524139d8b9e5466d0a5 ~~ I checked the history of fails for that TAP test t/031_column_list.pl and found that this same error seems to have been happening intermittently for at least the last 50 days. Details of similar previous errors from the BF are listed below. ~~~ 1. Details for system "grison" failure at stage subscriptionCheck, snapshot taken 2022-05-18 18:11:45 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grison&dt=2022-05-18%2018%3A11%3A45 [22:02:08] t/029_on_error.pl .................. ok 25475 ms ( 0.01 usr 0.00 sys + 15.39 cusr 5.59 csys = 20.99 CPU) # poll_query_until timed out executing this query: # SELECT '0/1530588' <= replay_lsn AND state = 'streaming' # FROM pg_catalog.pg_stat_replication # WHERE application_name IN ('sub1', 'walreceiver') # expecting this output: # t # last actual query output: # # with stderr: # Tests were run but no plan was declared and done_testing() was not seen. # Looks like your test exited with 29 just after 22. [22:09:25] t/031_column_list.pl ............... ... [22:02:47.887](1.829s) ok 22 - partitions with different replica identities not replicated correctly Waiting for replication conn sub1's replay_lsn to pass 0/1530588 on publisher [22:09:25.395](397.508s) # poll_query_until timed out executing this query: # SELECT '0/1530588' <= replay_lsn AND state = 'streaming' # FROM pg_catalog.pg_stat_replication # WHERE application_name IN ('sub1', 'walreceiver') # expecting this output: # t # last actual query output: # # with stderr: timed out waiting for catchup at t/031_column_list.pl line 728. ### Stopping node "publisher" using mode immediate ~~~ 2. Details for system "xenodermus" failure at stage subscriptionCheck, snapshot taken 2022-04-16 21:00:04 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=xenodermus&dt=2022-04-16%2021%3A00%3A04 [00:15:32] t/029_on_error.pl .................. ok 8278 ms ( 0.00 usr 0.00 sys + 1.33 cusr 0.55 csys = 1.88 CPU) # poll_query_until timed out executing this query: # SELECT '0/1543648' <= replay_lsn AND state = 'streaming' # FROM pg_catalog.pg_stat_replication # WHERE application_name IN ('sub1', 'walreceiver') # expecting this output: # t # last actual query output: # # with stderr: # Tests were run but no plan was declared and done_testing() was not seen. # Looks like your test exited with 29 just after 25. [00:22:30] t/031_column_list.pl ............... ... [00:16:04.100](0.901s) ok 25 - partitions with different replica identities not replicated correctly Waiting for replication conn sub1's replay_lsn to pass 0/1543648 on publisher [00:22:29.923](385.823s) # poll_query_until timed out executing this query: # SELECT '0/1543648' <= replay_lsn AND state = 'streaming' # FROM pg_catalog.pg_stat_replication # WHERE application_name IN ('sub1', 'walreceiver') # expecting this output: # t # last actual query output: # # with stderr: timed out waiting for catchup at t/031_column_list.pl line 818. ~~~ 3. Details for system "phycodurus" failure at stage subscriptionCheck, snapshot taken 2022-04-05 17:30:04 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=phycodurus&dt=2022-04-05%2017%3A30%3A04 # poll_query_until timed out executing this query: # SELECT '0/1528640' <= replay_lsn AND state = 'streaming' # FROM pg_catalog.pg_stat_replication # WHERE application_name IN ('sub1', 'walreceiver') # expecting this output: # t # last actual query output: # # with stderr: # Tests were run but no plan was declared and done_testing() was not seen. # Looks like your test exited with 29 just after 22. [20:50:25] t/031_column_list.pl ............... ... ok 22 - partitions with different replica identities not replicated correctly Waiting for replication conn sub1's replay_lsn to pass 0/1528640 on publisher # poll_query_until timed out executing this query: # SELECT '0/1528640' <= replay_lsn AND state = 'streaming' # FROM pg_catalog.pg_stat_replication # WHERE application_name IN ('sub1', 'walreceiver') # expecting this output: # t # last actual query output: # # with stderr: timed out waiting for catchup at t/031_column_list.pl line 667. ~~~ 4. Details for system "phycodurus" failure at stage subscriptionCheck, snapshot taken 2022-04-05 17:30:04 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=phycodurus&dt=2022-04-05%2017%3A30%3A04 [20:43:04] t/030_sequences.pl ................. ok 11108 ms ( 0.00 usr 0.00 sys + 1.49 cusr 0.40 csys = 1.89 CPU) # poll_query_until timed out executing this query: # SELECT '0/1528640' <= replay_lsn AND state = 'streaming' # FROM pg_catalog.pg_stat_replication # WHERE application_name IN ('sub1', 'walreceiver') # expecting this output: # t # last actual query output: # # with stderr: # Tests were run but no plan was declared and done_testing() was not seen. # Looks like your test exited with 29 just after 22. [20:50:25] t/031_column_list.pl ............... ... ok 22 - partitions with different replica identities not replicated correctly Waiting for replication conn sub1's replay_lsn to pass 0/1528640 on publisher # poll_query_until timed out executing this query: # SELECT '0/1528640' <= replay_lsn AND state = 'streaming' # FROM pg_catalog.pg_stat_replication # WHERE application_name IN ('sub1', 'walreceiver') # expecting this output: # t # last actual query output: # # with stderr: timed out waiting for catchup at t/031_column_list.pl line 667. ~~~ 5. Details for system "grison" failure at stage subscriptionCheck, snapshot taken 2022-04-03 18:11:39 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grison&dt=2022-04-03%2018%3A11%3A39 [22:28:00] t/030_sequences.pl ................. ok 22970 ms ( 0.01 usr 0.00 sys + 14.93 cusr 5.14 csys = 20.08 CPU) # poll_query_until timed out executing this query: # SELECT '0/1528CF0' <= replay_lsn AND state = 'streaming' # FROM pg_catalog.pg_stat_replication # WHERE application_name IN ('sub1', 'walreceiver') # expecting this output: # t # last actual query output: # # with stderr: # Tests were run but no plan was declared and done_testing() was not seen. # Looks like your test exited with 29 just after 22. [22:35:11] t/031_column_list.pl ............... .... ok 22 - partitions with different replica identities not replicated correctly Waiting for replication conn sub1's replay_lsn to pass 0/1528CF0 on publisher # poll_query_until timed out executing this query: # SELECT '0/1528CF0' <= replay_lsn AND state = 'streaming' # FROM pg_catalog.pg_stat_replication # WHERE application_name IN ('sub1', 'walreceiver') # expecting this output: # t # last actual query output: # # with stderr: timed out waiting for catchup at t/031_column_list.pl line 667. ---- Kind Regards, Peter Smith. Fujitsu Australia
At Thu, 19 May 2022 14:26:56 +1000, Peter Smith <smithpb2250@gmail.com> wrote in > Hi hackers. > > FYI, I saw that there was a recent Build-farm error on the "grison" machine [1] > [1] https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=grison&br=HEAD > > The error happened during "subscriptionCheck" phase in the TAP test > t/031_column_list.pl > This test file was added by this [2] commit. > [2] https://github.com/postgres/postgres/commit/923def9a533a7d986acfb524139d8b9e5466d0a5 What is happening for all of them looks like that the name of a publication created by CREATE PUBLICATION without a failure report is missing for a walsender came later. It seems like CREATE PUBLICATION can silently fail to create a publication, or walsender somehow failed to find existing one. > ~~ > > I checked the history of fails for that TAP test t/031_column_list.pl > and found that this same error seems to have been happening > intermittently for at least the last 50 days. > > Details of similar previous errors from the BF are listed below. > > ~~~ > > 1. Details for system "grison" failure at stage subscriptionCheck, > snapshot taken 2022-05-18 18:11:45 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grison&dt=2022-05-18%2018%3A11%3A45 > > [22:02:08] t/029_on_error.pl .................. ok 25475 ms ( 0.01 > usr 0.00 sys + 15.39 cusr 5.59 csys = 20.99 CPU) > # poll_query_until timed out executing this query: > # SELECT '0/1530588' <= replay_lsn AND state = 'streaming' > # FROM pg_catalog.pg_stat_replication > # WHERE application_name IN ('sub1', 'walreceiver') > # expecting this output: > # t > # last actual query output: > # > # with stderr: > # Tests were run but no plan was declared and done_testing() was not seen. > # Looks like your test exited with 29 just after 22. > [22:09:25] t/031_column_list.pl ............... > ... > [22:02:47.887](1.829s) ok 22 - partitions with different replica > identities not replicated correctly Waiting for replication conn > sub1's replay_lsn to pass 0/1530588 on publisher > [22:09:25.395](397.508s) # poll_query_until timed out executing this query: > # SELECT '0/1530588' <= replay_lsn AND state = 'streaming' > # FROM pg_catalog.pg_stat_replication > # WHERE application_name IN ('sub1', 'walreceiver') > # expecting this output: > # t > # last actual query output: > # > # with stderr: > timed out waiting for catchup at t/031_column_list.pl line 728. > ### Stopping node "publisher" using mode immediate 2022-04-17 00:16:04.278 CEST [293659][client backend][4/270:0][031_column_list.pl] LOG: statement: CREATE PUBLICATION pub9FOR TABLE test_part_d (a) WITH (publish_via_partition_root = true); 2022-04-17 00:16:04.279 CEST [293659][client backend][:0][031_column_list.pl] LOG: disconnection: session time: 0:00:00.002user=bf database=postgres host=[local] "CREATE PUBLICATION pub9" is executed at 00:16:04.278 on 293659 then the session has been disconnected. But the following request for the same publication fails due to the absense of the publication. 2022-04-17 00:16:08.147 CEST [293856][walsender][3/0:0][sub1] STATEMENT: START_REPLICATION SLOT "sub1" LOGICAL 0/153DB88(proto_version '3', publication_names '"pub9"') 2022-04-17 00:16:08.148 CEST [293856][walsender][3/0:0][sub1] ERROR: publication "pub9" does not exist > ~~~ > > 2. Details for system "xenodermus" failure at stage subscriptionCheck, > snapshot taken 2022-04-16 21:00:04 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=xenodermus&dt=2022-04-16%2021%3A00%3A04 The same. pub9 is missing after creation. > ~~~ > > 3. Details for system "phycodurus" failure at stage subscriptionCheck, > snapshot taken 2022-04-05 17:30:04 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=phycodurus&dt=2022-04-05%2017%3A30%3A04 The same happens for pub7.. > 4. Details for system "phycodurus" failure at stage subscriptionCheck, > snapshot taken 2022-04-05 17:30:04 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=phycodurus&dt=2022-04-05%2017%3A30%3A04 Same. pub7 is missing. > 5. Details for system "grison" failure at stage subscriptionCheck, > snapshot taken 2022-04-03 18:11:39 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grison&dt=2022-04-03%2018%3A11%3A39 Same. pub7 is missing. -- Kyotaro Horiguchi NTT Open Source Software Center
On Thu, May 19, 2022 at 12:28 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Thu, 19 May 2022 14:26:56 +1000, Peter Smith <smithpb2250@gmail.com> wrote in > > Hi hackers. > > > > FYI, I saw that there was a recent Build-farm error on the "grison" machine [1] > > [1] https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=grison&br=HEAD > > > > The error happened during "subscriptionCheck" phase in the TAP test > > t/031_column_list.pl > > This test file was added by this [2] commit. > > [2] https://github.com/postgres/postgres/commit/923def9a533a7d986acfb524139d8b9e5466d0a5 > > What is happening for all of them looks like that the name of a > publication created by CREATE PUBLICATION without a failure report is > missing for a walsender came later. It seems like CREATE PUBLICATION > can silently fail to create a publication, or walsender somehow failed > to find existing one. > Do you see anything in LOGS which indicates CREATE SUBSCRIPTION has failed? > > > ~~ > > > > 2022-04-17 00:16:04.278 CEST [293659][client backend][4/270:0][031_column_list.pl] LOG: statement: CREATE PUBLICATIONpub9 FOR TABLE test_part_d (a) WITH (publish_via_partition_root = true); > 2022-04-17 00:16:04.279 CEST [293659][client backend][:0][031_column_list.pl] LOG: disconnection: session time: 0:00:00.002user=bf database=postgres host=[local] > > "CREATE PUBLICATION pub9" is executed at 00:16:04.278 on 293659 then > the session has been disconnected. But the following request for the > same publication fails due to the absense of the publication. > > 2022-04-17 00:16:08.147 CEST [293856][walsender][3/0:0][sub1] STATEMENT: START_REPLICATION SLOT "sub1" LOGICAL 0/153DB88(proto_version '3', publication_names '"pub9"') > 2022-04-17 00:16:08.148 CEST [293856][walsender][3/0:0][sub1] ERROR: publication "pub9" does not exist > This happens after "ALTER SUBSCRIPTION sub1 SET PUBLICATION pub9". The probable theory is that ALTER SUBSCRIPTION will lead to restarting of apply worker (which we can see in LOGS as well) and after the restart, the apply worker will use the existing slot and replication origin corresponding to the subscription. Now, it is possible that before restart the origin has not been updated and the WAL start location points to a location prior to where PUBLICATION pub9 exists which can lead to such an error. Once this error occurs, apply worker will never be able to proceed and will always return the same error. Does this make sense? Unless you or others see a different theory, this seems to be the existing problem in logical replication which is manifested by this test. If we just want to fix these test failures, we can create a new subscription instead of altering the existing publication to point to the new publication. Note: Added Tomas to know his views as he has committed this test. -- With Regards, Amit Kapila.
On Thu, May 19, 2022 at 3:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, May 19, 2022 at 12:28 PM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > > > At Thu, 19 May 2022 14:26:56 +1000, Peter Smith <smithpb2250@gmail.com> wrote in > > > Hi hackers. > > > > > > FYI, I saw that there was a recent Build-farm error on the "grison" machine [1] > > > [1] https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=grison&br=HEAD > > > > > > The error happened during "subscriptionCheck" phase in the TAP test > > > t/031_column_list.pl > > > This test file was added by this [2] commit. > > > [2] https://github.com/postgres/postgres/commit/923def9a533a7d986acfb524139d8b9e5466d0a5 > > > > What is happening for all of them looks like that the name of a > > publication created by CREATE PUBLICATION without a failure report is > > missing for a walsender came later. It seems like CREATE PUBLICATION > > can silently fail to create a publication, or walsender somehow failed > > to find existing one. > > > > Do you see anything in LOGS which indicates CREATE SUBSCRIPTION has failed? > > > > > > ~~ > > > > > > > 2022-04-17 00:16:04.278 CEST [293659][client backend][4/270:0][031_column_list.pl] LOG: statement: CREATE PUBLICATIONpub9 FOR TABLE test_part_d (a) WITH (publish_via_partition_root = true); > > 2022-04-17 00:16:04.279 CEST [293659][client backend][:0][031_column_list.pl] LOG: disconnection: session time: 0:00:00.002user=bf database=postgres host=[local] > > > > "CREATE PUBLICATION pub9" is executed at 00:16:04.278 on 293659 then > > the session has been disconnected. But the following request for the > > same publication fails due to the absense of the publication. > > > > 2022-04-17 00:16:08.147 CEST [293856][walsender][3/0:0][sub1] STATEMENT: START_REPLICATION SLOT "sub1" LOGICAL 0/153DB88(proto_version '3', publication_names '"pub9"') > > 2022-04-17 00:16:08.148 CEST [293856][walsender][3/0:0][sub1] ERROR: publication "pub9" does not exist > > > > This happens after "ALTER SUBSCRIPTION sub1 SET PUBLICATION pub9". The > probable theory is that ALTER SUBSCRIPTION will lead to restarting of > apply worker (which we can see in LOGS as well) and after the restart, > the apply worker will use the existing slot and replication origin > corresponding to the subscription. Now, it is possible that before > restart the origin has not been updated and the WAL start location > points to a location prior to where PUBLICATION pub9 exists which can > lead to such an error. Once this error occurs, apply worker will never > be able to proceed and will always return the same error. Does this > make sense? > > Unless you or others see a different theory, this seems to be the > existing problem in logical replication which is manifested by this > test. If we just want to fix these test failures, we can create a new > subscription instead of altering the existing publication to point to > the new publication. > If the above theory is correct then I think allowing the publisher to catch up with "$node_publisher->wait_for_catchup('sub1');" before ALTER SUBSCRIPTION should fix this problem. Because if before ALTER both publisher and subscriber are in sync then the new publication should be visible to WALSender. -- With Regards, Amit Kapila.
At Thu, 19 May 2022 16:42:31 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > On Thu, May 19, 2022 at 3:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > This happens after "ALTER SUBSCRIPTION sub1 SET PUBLICATION pub9". The > > probable theory is that ALTER SUBSCRIPTION will lead to restarting of > > apply worker (which we can see in LOGS as well) and after the restart, Yes. > > the apply worker will use the existing slot and replication origin > > corresponding to the subscription. Now, it is possible that before > > restart the origin has not been updated and the WAL start location > > points to a location prior to where PUBLICATION pub9 exists which can > > lead to such an error. Once this error occurs, apply worker will never > > be able to proceed and will always return the same error. Does this > > make sense? Wow. I didin't thought that line. That theory explains the silence and makes sense even though I don't see LSN transistions that clearly support it. I dimly remember a similar kind of problem.. > > Unless you or others see a different theory, this seems to be the > > existing problem in logical replication which is manifested by this > > test. If we just want to fix these test failures, we can create a new > > subscription instead of altering the existing publication to point to > > the new publication. > > > > If the above theory is correct then I think allowing the publisher to > catch up with "$node_publisher->wait_for_catchup('sub1');" before > ALTER SUBSCRIPTION should fix this problem. Because if before ALTER > both publisher and subscriber are in sync then the new publication > should be visible to WALSender. It looks right to me. That timetravel seems inintuitive but it's the (current) way it works. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Fri, May 20, 2022 at 6:58 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > > > the apply worker will use the existing slot and replication origin > > > corresponding to the subscription. Now, it is possible that before > > > restart the origin has not been updated and the WAL start location > > > points to a location prior to where PUBLICATION pub9 exists which can > > > lead to such an error. Once this error occurs, apply worker will never > > > be able to proceed and will always return the same error. Does this > > > make sense? > > Wow. I didin't thought that line. That theory explains the silence and > makes sense even though I don't see LSN transistions that clearly > support it. I dimly remember a similar kind of problem.. > > > > Unless you or others see a different theory, this seems to be the > > > existing problem in logical replication which is manifested by this > > > test. If we just want to fix these test failures, we can create a new > > > subscription instead of altering the existing publication to point to > > > the new publication. > > > > > > > If the above theory is correct then I think allowing the publisher to > > catch up with "$node_publisher->wait_for_catchup('sub1');" before > > ALTER SUBSCRIPTION should fix this problem. Because if before ALTER > > both publisher and subscriber are in sync then the new publication > > should be visible to WALSender. > > It looks right to me. > Let's wait for Tomas or others working in this area to share their thoughts. > That timetravel seems inintuitive but it's the > (current) way it works. > I have thought about it but couldn't come up with a good way to change the way currently it works. Moreover, I think it is easy to hit this in other ways as well. Say, you first create a subscription with a non-existent publication and then do operation on any unrelated table on the publisher before creating the required publication, we will hit exactly this problem of "publication does not exist", so I think we may need to live with this behavior and write tests carefully. -- With Regards, Amit Kapila.
On 5/20/22 05:58, Amit Kapila wrote: > On Fri, May 20, 2022 at 6:58 AM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: >> >>>> the apply worker will use the existing slot and replication origin >>>> corresponding to the subscription. Now, it is possible that before >>>> restart the origin has not been updated and the WAL start location >>>> points to a location prior to where PUBLICATION pub9 exists which can >>>> lead to such an error. Once this error occurs, apply worker will never >>>> be able to proceed and will always return the same error. Does this >>>> make sense? >> >> Wow. I didin't thought that line. That theory explains the silence and >> makes sense even though I don't see LSN transistions that clearly >> support it. I dimly remember a similar kind of problem.. >> >>>> Unless you or others see a different theory, this seems to be the >>>> existing problem in logical replication which is manifested by this >>>> test. If we just want to fix these test failures, we can create a new >>>> subscription instead of altering the existing publication to point to >>>> the new publication. >>>> >>> >>> If the above theory is correct then I think allowing the publisher to >>> catch up with "$node_publisher->wait_for_catchup('sub1');" before >>> ALTER SUBSCRIPTION should fix this problem. Because if before ALTER >>> both publisher and subscriber are in sync then the new publication >>> should be visible to WALSender. >> >> It looks right to me. >> > > Let's wait for Tomas or others working in this area to share their thoughts. > Are we really querying the publications (in get_rel_sync_entry) using the historical snapshot? I haven't really realized this, but yeah, that might explain the issue. The new TAP test does ALTER SUBSCRIPTION ... SET PUBLICATION much more often than any other test (there are ~15 calls, 12 of which are in this new test). That might be why we haven't seen failures before. Or maybe the existing tests simply are not vulnerable to this, because they either do wait_for_catchup late enough or don't do any DML right before executing SET PUBLICATION. >> That timetravel seems inintuitive but it's the >> (current) way it works. >> > > I have thought about it but couldn't come up with a good way to change > the way currently it works. Moreover, I think it is easy to hit this > in other ways as well. Say, you first create a subscription with a > non-existent publication and then do operation on any unrelated table > on the publisher before creating the required publication, we will hit > exactly this problem of "publication does not exist", so I think we > may need to live with this behavior and write tests carefully. > Yeah, I think it pretty much requires ensuring the subscriber is fully caught up with the publisher, otherwise ALTER SUBSCRIPTION may break the replication in an unrecoverable way (actually, you can alter the subscription and remove the publication again, right?). But this is not just about tests, of course - the same issue applies to regular replication. That's a bit unfortunate, so maybe we should think about making this less fragile. We might make sure the subscriber is not lagging (essentially the wait_for_catchup) - which the users will have to do anyway (although maybe they know the publisher is beyond the LSN where it was created). The other option would be to detect such case, somehow - if you don't see the publication yet, see if it exists in current snapshot, and then maybe ignore this error. But that has other issues (the publication might have been created and dropped, in which case you won't see it). Also, we'd probably have to ignore RelationSyncEntry for a while, which seems quite expensive. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thursday, May 19, 2022 8:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > On Thu, May 19, 2022 at 3:16 PM Amit Kapila <amit.kapila16@gmail.com> > wrote: > > > > On Thu, May 19, 2022 at 12:28 PM Kyotaro Horiguchi > > <horikyota.ntt@gmail.com> wrote: > > > > > > At Thu, 19 May 2022 14:26:56 +1000, Peter Smith > > > <smithpb2250@gmail.com> wrote in > > > > Hi hackers. > > > > > > > > FYI, I saw that there was a recent Build-farm error on the > > > > "grison" machine [1] [1] > > > > https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=grison > > > > &br=HEAD > > > > > > > > The error happened during "subscriptionCheck" phase in the TAP > > > > test t/031_column_list.pl This test file was added by this [2] > > > > commit. > > > > [2] > > > > > https://github.com/postgres/postgres/commit/923def9a533a7d986acfb5 > > > > 24139d8b9e5466d0a5 > > > > > > 2022-04-17 00:16:04.278 CEST [293659][client > > > backend][4/270:0][031_column_list.pl] LOG: statement: CREATE > > > PUBLICATION pub9 FOR TABLE test_part_d (a) WITH > > > (publish_via_partition_root = true); > > > 2022-04-17 00:16:04.279 CEST [293659][client > > > backend][:0][031_column_list.pl] LOG: disconnection: session time: > > > 0:00:00.002 user=bf database=postgres host=[local] > > > > > > "CREATE PUBLICATION pub9" is executed at 00:16:04.278 on 293659 then > > > the session has been disconnected. But the following request for the > > > same publication fails due to the absense of the publication. > > > > > > 2022-04-17 00:16:08.147 CEST [293856][walsender][3/0:0][sub1] > > > STATEMENT: START_REPLICATION SLOT "sub1" LOGICAL 0/153DB88 > > > (proto_version '3', publication_names '"pub9"') > > > 2022-04-17 00:16:08.148 CEST [293856][walsender][3/0:0][sub1] ERROR: > > > publication "pub9" does not exist > > > > > > > This happens after "ALTER SUBSCRIPTION sub1 SET PUBLICATION pub9". > The > > probable theory is that ALTER SUBSCRIPTION will lead to restarting of > > apply worker (which we can see in LOGS as well) and after the restart, > > the apply worker will use the existing slot and replication origin > > corresponding to the subscription. Now, it is possible that before > > restart the origin has not been updated and the WAL start location > > points to a location prior to where PUBLICATION pub9 exists which can > > lead to such an error. Once this error occurs, apply worker will never > > be able to proceed and will always return the same error. Does this > > make sense? > > > > Unless you or others see a different theory, this seems to be the > > existing problem in logical replication which is manifested by this > > test. If we just want to fix these test failures, we can create a new > > subscription instead of altering the existing publication to point to > > the new publication. > > > > If the above theory is correct then I think allowing the publisher to catch up with > "$node_publisher->wait_for_catchup('sub1');" before ALTER SUBSCRIPTION > should fix this problem. Because if before ALTER both publisher and > subscriber are in sync then the new publication should be visible to > WALSender. Hi, I've attached a patch for the fix proposed here. First of all, thank you so much for helping me offlist, Amit-san. I reproduced the failure like [1] by commenting out WalSndWaitForWal's call of WalSndKeepalive and running the test. This comment out intends to suppress the advance of confirmed_flush location after creating a publication. In short, my understanding how the bug happened is, 1. we execute 'create publication pubX' and create one publication. 2. 'alter subscription subY set publication pubX' makes the apply worker exit 3. relaunched apply worker searches for pubX. But, the slot position(confirmed_flush) doesn't get updated and points to some location before create publication at the publisher node. Applying the attached patch have made the test pass. [1] the subscriber's log 2022-05-20 08:56:50.773 UTC [5153] 031_column_list.pl LOG: statement: ALTER SUBSCRIPTION sub1 SET PUBLICATION pub6 2022-05-20 08:56:50.801 UTC [5156] 031_column_list.pl LOG: statement: SELECT count(1) = 0 FROM pg_subscription_rel WHEREsrsubstate NOT IN ('r', 's'); 2022-05-20 08:56:50.846 UTC [5112] LOG: logical replication apply worker for subscription "sub1" will restart because ofa parameter change 2022-05-20 08:56:50.915 UTC [5158] 031_column_list.pl LOG: statement: SELECT count(1) = 0 FROM pg_subscription_rel WHEREsrsubstate NOT IN ('r', 's'); ... 2022-05-20 08:56:51.257 UTC [5164] 031_column_list.pl LOG: statement: SELECT count(1) = 0 FROM pg_subscription_rel WHEREsrsubstate NOT IN ('r', 's'); 2022-05-20 08:56:51.353 UTC [5166] LOG: logical replication apply worker for subscription "sub1" has started 2022-05-20 08:56:51.366 UTC [5168] LOG: logical replication table synchronization worker for subscription "sub1", table "test_part_a"has started 2022-05-20 08:56:51.370 UTC [5171] 031_column_list.pl LOG: statement: SELECT count(1) = 0 FROM pg_subscription_rel WHEREsrsubstate NOT IN ('r', 's'); 2022-05-20 08:56:51.373 UTC [5166] ERROR: could not receive data from WAL stream: ERROR: publication "pub6" does not exist CONTEXT: slot "sub1", output plugin "pgoutput", in the change callback, associated LSN 0/15C61B8 2022-05-20 08:56:51.374 UTC [4338] LOG: background worker "logical replication worker" (PID 5166) exited with exit code 1 Best Regards, Takamichi Osumi
Attachment
On Fri, May 20, 2022 at 4:01 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 5/20/22 05:58, Amit Kapila wrote: > > Are we really querying the publications (in get_rel_sync_entry) using > the historical snapshot? > Yes. > I haven't really realized this, but yeah, that > might explain the issue. > > The new TAP test does ALTER SUBSCRIPTION ... SET PUBLICATION much more > often than any other test (there are ~15 calls, 12 of which are in this > new test). That might be why we haven't seen failures before. Or maybe > the existing tests simply are not vulnerable to this, > Right, I have checked the other cases are not vulnerable to this, otherwise, I think we would have seen intermittent failures till now. They don't seem to be doing DMLs before the creation of a publication or they create a subscription pointing to the same publication before. > because they > either do wait_for_catchup late enough or don't do any DML right before > executing SET PUBLICATION. > > >> That timetravel seems inintuitive but it's the > >> (current) way it works. > >> > > > > I have thought about it but couldn't come up with a good way to change > > the way currently it works. Moreover, I think it is easy to hit this > > in other ways as well. Say, you first create a subscription with a > > non-existent publication and then do operation on any unrelated table > > on the publisher before creating the required publication, we will hit > > exactly this problem of "publication does not exist", so I think we > > may need to live with this behavior and write tests carefully. > > > > Yeah, I think it pretty much requires ensuring the subscriber is fully > caught up with the publisher, otherwise ALTER SUBSCRIPTION may break the > replication in an unrecoverable way (actually, you can alter the > subscription and remove the publication again, right?). > Right. > But this is not just about tests, of course - the same issue applies to > regular replication. That's a bit unfortunate, so maybe we should think > about making this less fragile. > Agreed, provided we find some reasonable solution. > We might make sure the subscriber is not lagging (essentially the > wait_for_catchup) - which the users will have to do anyway (although > maybe they know the publisher is beyond the LSN where it was created). > This won't work for the case mentioned above where we create a subscription with non-existent publications, then perform DML and then 'CREATE PUBLICATION'. > The other option would be to detect such case, somehow - if you don't > see the publication yet, see if it exists in current snapshot, and then > maybe ignore this error. But that has other issues (the publication > might have been created and dropped, in which case you won't see it). > True, the dropped case would again be tricky to deal with and I think we will end up publishing some operations which are performed before the publication is even created. > Also, we'd probably have to ignore RelationSyncEntry for a while, which > seems quite expensive. > Yet another option could be that we continue using a historic snapshot but ignore publications that are not found for the purpose of computing RelSyncEntry attributes. We won't mark such an entry as valid till all the publications are loaded without anything missing. I think such cases in practice won't be enough to matter. This means we won't publish operations on tables corresponding to that publication till we found such a publication and that seems okay. -- With Regards, Amit Kapila.
On Sat, May 21, 2022 at 9:03 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, May 20, 2022 at 4:01 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: > > > Also, we'd probably have to ignore RelationSyncEntry for a while, which > > seems quite expensive. > > > > Yet another option could be that we continue using a historic snapshot > but ignore publications that are not found for the purpose of > computing RelSyncEntry attributes. We won't mark such an entry as > valid till all the publications are loaded without anything missing. I > think such cases in practice won't be enough to matter. This means we > won't publish operations on tables corresponding to that publication > till we found such a publication and that seems okay. > Attached, find the patch to show what I have in mind for this. Today, we have received a bug report with a similar symptom [1] and that should also be fixed with this. The reported bug should also be fixed with this. Thoughts? [1] - https://www.postgresql.org/message-id/CANWRaJyyD%3D9c1E2HdF-Tqfe7%2BvuCQnAkXd6%2BEFwxC0wM%3D313AA%40mail.gmail.com -- With Regards, Amit Kapila.
Attachment
On Tuesday, May 24, 2022 9:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > On Sat, May 21, 2022 at 9:03 AM Amit Kapila <amit.kapila16@gmail.com> > wrote: > > > > On Fri, May 20, 2022 at 4:01 PM Tomas Vondra > > <tomas.vondra@enterprisedb.com> wrote: > > > > > Also, we'd probably have to ignore RelationSyncEntry for a while, > > > which seems quite expensive. > > > > > > > Yet another option could be that we continue using a historic snapshot > > but ignore publications that are not found for the purpose of > > computing RelSyncEntry attributes. We won't mark such an entry as > > valid till all the publications are loaded without anything missing. I > > think such cases in practice won't be enough to matter. This means we > > won't publish operations on tables corresponding to that publication > > till we found such a publication and that seems okay. > > > > Attached, find the patch to show what I have in mind for this. Today, we have > received a bug report with a similar symptom [1] and that should also be fixed > with this. The reported bug should also be fixed with this. > > Thoughts? Hi, I agree with this direction. I think this approach solves the issue fundamentally and is better than the first approach to add several calls of wait_for_catchup in the test, since taking the first one means we need to care about avoiding the same issue, whenever we write a new (similar) test, even after the modification. I've used the patch to check below things. 1. The patch can be applied and make check-world has passed without failure. 2. HEAD applied with the patch passed all tests in src/test/subscription (including 031_column_list.pl), after commenting out of WalSndWaitForWal's WalSndKeepalive. 3. The new bug fix report in 'How is this possible "publication does not exist"' thread has been fixed. FYI, after I execute the script's function, I also conduct additional insert to the publisher, and this was correctly replicated on the subscriber. Best Regards, Takamichi Osumi
At Tue, 24 May 2022 18:19:45 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > On Sat, May 21, 2022 at 9:03 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Fri, May 20, 2022 at 4:01 PM Tomas Vondra > > <tomas.vondra@enterprisedb.com> wrote: > > > > > Also, we'd probably have to ignore RelationSyncEntry for a while, which > > > seems quite expensive. > > > > > > > Yet another option could be that we continue using a historic snapshot > > but ignore publications that are not found for the purpose of > > computing RelSyncEntry attributes. We won't mark such an entry as > > valid till all the publications are loaded without anything missing. I > > think such cases in practice won't be enough to matter. This means we > > won't publish operations on tables corresponding to that publication > > till we found such a publication and that seems okay. > > > > Attached, find the patch to show what I have in mind for this. Today, > we have received a bug report with a similar symptom [1] and that > should also be fixed with this. The reported bug should also be fixed > with this. > > Thoughts? > > > [1] - https://www.postgresql.org/message-id/CANWRaJyyD%3D9c1E2HdF-Tqfe7%2BvuCQnAkXd6%2BEFwxC0wM%3D313AA%40mail.gmail.com It does "fix" the case of [1]. But AFAIS RelationSyncEntry.replicate_valid is only used to inhibit repeated loading in get_rel_sync_entry and the function doesn't seem to be assumed to return a invalid entry. (Since the flag is not checked nowhere else.) For example pgoutput_change does not check for the flag of the entry returned from the function before uses it, which is not seemingly safe. (I didn't check further, though) Don't we need to explicitly avoid using invalid entries outside the function? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Wed, May 25, 2022 at 8:16 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > It does "fix" the case of [1]. But AFAIS > RelationSyncEntry.replicate_valid is only used to inhibit repeated > loading in get_rel_sync_entry and the function doesn't seem to be > assumed to return a invalid entry. (Since the flag is not checked > nowhere else.) > > For example pgoutput_change does not check for the flag of the entry > returned from the function before uses it, which is not seemingly > safe. (I didn't check further, though) > > Don't we need to explicitly avoid using invalid entries outside the > function? > We decide that based on pubactions in the callers, so even if entry is valid, it won't do anything. Actually, we don't need to avoid setting replication_valid flag as some of the publications for the table may be already present. We can check if the publications_valid flag is set while trying to validate the entry. Now, even if we don't find any publications the replicate_valid flag will be set but none of the actions will be set, so it won't do anything in the caller. Is this better than the previous approach? -- With Regards, Amit Kapila.
Attachment
On 5/25/22 13:26, Amit Kapila wrote: > On Wed, May 25, 2022 at 8:16 AM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: >> >> It does "fix" the case of [1]. But AFAIS >> RelationSyncEntry.replicate_valid is only used to inhibit repeated >> loading in get_rel_sync_entry and the function doesn't seem to be >> assumed to return a invalid entry. (Since the flag is not checked >> nowhere else.) >> >> For example pgoutput_change does not check for the flag of the entry >> returned from the function before uses it, which is not seemingly >> safe. (I didn't check further, though) >> >> Don't we need to explicitly avoid using invalid entries outside the >> function? >> > > We decide that based on pubactions in the callers, so even if entry is > valid, it won't do anything. Actually, we don't need to avoid setting > replication_valid flag as some of the publications for the table may > be already present. We can check if the publications_valid flag is set > while trying to validate the entry. Now, even if we don't find any > publications the replicate_valid flag will be set but none of the > actions will be set, so it won't do anything in the caller. Is this > better than the previous approach? > For the record, I'm not convinced this is the right way to fix the issue, as it may easily mask the real problem. We do silently ignore missing objects in various places, but only when either requested or when it's obvious it's expected and safe to ignore. But I'm not sure that applies here, in a clean way. Imagine you have a subscriber using two publications p1 and p2, and someone comes around and drops p1 by mistake. With the proposed patch, the subscription will notice this, but it'll continue sending data ignoring the missing publication. Yes, it will continue working, but it's quite possible this breaks the subscriber and it's be better to fail and stop replicating. The other aspect I dislike is that we just stop caching publication info, forcing us to reload it for every replicated change/row. So even if dropping the publication happens not to "break" the subscriber (i.e. the data makes sense), this may easily cause performance issues, lag in the replication, and so on. And the users will have no idea why and/or how to fix it, because we just do this silently. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Tomas Vondra <tomas.vondra@enterprisedb.com> writes: > The other aspect I dislike is that we just stop caching publication > info, forcing us to reload it for every replicated change/row. Ouch --- that seems likely to be completely horrid. At the very least I'd want to see some benchmark numbers before concluding we could live with that. regards, tom lane
On Wed, May 25, 2022 at 6:54 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 5/25/22 13:26, Amit Kapila wrote: > > On Wed, May 25, 2022 at 8:16 AM Kyotaro Horiguchi > > <horikyota.ntt@gmail.com> wrote: > >> > >> It does "fix" the case of [1]. But AFAIS > >> RelationSyncEntry.replicate_valid is only used to inhibit repeated > >> loading in get_rel_sync_entry and the function doesn't seem to be > >> assumed to return a invalid entry. (Since the flag is not checked > >> nowhere else.) > >> > >> For example pgoutput_change does not check for the flag of the entry > >> returned from the function before uses it, which is not seemingly > >> safe. (I didn't check further, though) > >> > >> Don't we need to explicitly avoid using invalid entries outside the > >> function? > >> > > > > We decide that based on pubactions in the callers, so even if entry is > > valid, it won't do anything. Actually, we don't need to avoid setting > > replication_valid flag as some of the publications for the table may > > be already present. We can check if the publications_valid flag is set > > while trying to validate the entry. Now, even if we don't find any > > publications the replicate_valid flag will be set but none of the > > actions will be set, so it won't do anything in the caller. Is this > > better than the previous approach? > > > > For the record, I'm not convinced this is the right way to fix the > issue, as it may easily mask the real problem. > > We do silently ignore missing objects in various places, but only when > either requested or when it's obvious it's expected and safe to ignore. > But I'm not sure that applies here, in a clean way. > > Imagine you have a subscriber using two publications p1 and p2, and > someone comes around and drops p1 by mistake. With the proposed patch, > the subscription will notice this, but it'll continue sending data > ignoring the missing publication. Yes, it will continue working, but > it's quite possible this breaks the subscriber and it's be better to > fail and stop replicating. > Ideally, shouldn't we disallow drop of publication in such cases where it is part of some subscription? I know it will be tricky because some subscriptions could be disabled. > The other aspect I dislike is that we just stop caching publication > info, forcing us to reload it for every replicated change/row. So even > if dropping the publication happens not to "break" the subscriber (i.e. > the data makes sense), this may easily cause performance issues, lag in > the replication, and so on. And the users will have no idea why and/or > how to fix it, because we just do this silently. > Yeah, this is true that if there are missing publications, it needs to reload all the publications info again unless we build a mechanism to change the existing cached entry by loading only required info. The other thing we could do here is to LOG the info for missing publications to make users aware of the fact. I think we can also introduce a new option while defining/altering subscription to indicate whether to continue on missing publication or not, that way by default we will stop replication as we are doing now but users will have a way to move replication. BTW, what are the other options we have to fix the cases where replication is broken (or the user has no clue on how to proceed) as we are discussing the case here or the OP reported yet another case on pgsql-bugs [1]? [1] - https://www.postgresql.org/message-id/CANWRaJyyD%3D9c1E2HdF-Tqfe7%2BvuCQnAkXd6%2BEFwxC0wM%3D313AA%40mail.gmail.com -- With Regards, Amit Kapila.
On Thursday, May 26, 2022 11:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > On Wed, May 25, 2022 at 6:54 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: > > > > On 5/25/22 13:26, Amit Kapila wrote: > > > On Wed, May 25, 2022 at 8:16 AM Kyotaro Horiguchi > > > <horikyota.ntt@gmail.com> wrote: > > >> > > >> It does "fix" the case of [1]. But AFAIS > > >> RelationSyncEntry.replicate_valid is only used to inhibit repeated > > >> loading in get_rel_sync_entry and the function doesn't seem to be > > >> assumed to return a invalid entry. (Since the flag is not checked > > >> nowhere else.) > > >> > > >> For example pgoutput_change does not check for the flag of the > > >> entry returned from the function before uses it, which is not > > >> seemingly safe. (I didn't check further, though) > > >> > > >> Don't we need to explicitly avoid using invalid entries outside the > > >> function? > > >> > > > > > > We decide that based on pubactions in the callers, so even if entry > > > is valid, it won't do anything. Actually, we don't need to avoid > > > setting replication_valid flag as some of the publications for the > > > table may be already present. We can check if the publications_valid > > > flag is set while trying to validate the entry. Now, even if we > > > don't find any publications the replicate_valid flag will be set but > > > none of the actions will be set, so it won't do anything in the > > > caller. Is this better than the previous approach? > > > > > > > For the record, I'm not convinced this is the right way to fix the > > issue, as it may easily mask the real problem. > > > > We do silently ignore missing objects in various places, but only when > > either requested or when it's obvious it's expected and safe to ignore. > > But I'm not sure that applies here, in a clean way. > > > > Imagine you have a subscriber using two publications p1 and p2, and > > someone comes around and drops p1 by mistake. With the proposed patch, > > the subscription will notice this, but it'll continue sending data > > ignoring the missing publication. Yes, it will continue working, but > > it's quite possible this breaks the subscriber and it's be better to > > fail and stop replicating. > > > > Ideally, shouldn't we disallow drop of publication in such cases where it is part > of some subscription? I know it will be tricky because some subscriptions > could be disabled. > > > The other aspect I dislike is that we just stop caching publication > > info, forcing us to reload it for every replicated change/row. So even > > if dropping the publication happens not to "break" the subscriber (i.e. > > the data makes sense), this may easily cause performance issues, lag > > in the replication, and so on. And the users will have no idea why > > and/or how to fix it, because we just do this silently. > > > > Yeah, this is true that if there are missing publications, it needs to reload all the > publications info again unless we build a mechanism to change the existing > cached entry by loading only required info. The other thing we could do here is > to LOG the info for missing publications to make users aware of the fact. I think > we can also introduce a new option while defining/altering subscription to > indicate whether to continue on missing publication or not, that way by default > we will stop replication as we are doing now but users will have a way to move > replication. > > BTW, what are the other options we have to fix the cases where replication is > broken (or the user has no clue on how to proceed) as we are discussing the > case here or the OP reported yet another case on pgsql-bugs [1]? Hi, FYI, I've noticed that after the last report by Peter-san we've gotten the same errors on Build Farm. We need to keep discussing to conclude this. 1. Details for system "xenodermus" failure at stage subscriptionCheck, snapshot taken 2022-05-31 13:00:04 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=xenodermus&dt=2022-05-31%2013%3A00%3A04 2. Details for system "phycodurus" failure at stage subscriptionCheck, snapshot taken 2022-05-26 17:30:04 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=phycodurus&dt=2022-05-26%2017%3A30%3A04 Best Regards, Takamichi Osumi
Hello hackers, 01.06.2022 05:06, osumi.takamichi@fujitsu.com wrote: > FYI, I've noticed that after the last report by Peter-san > we've gotten the same errors on Build Farm. > We need to keep discussing to conclude this. > > > 1. Details for system "xenodermus" failure at stage subscriptionCheck, snapshot taken 2022-05-31 13:00:04 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=xenodermus&dt=2022-05-31%2013%3A00%3A04 > > > 2. Details for system "phycodurus" failure at stage subscriptionCheck, snapshot taken 2022-05-26 17:30:04 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=phycodurus&dt=2022-05-26%2017%3A30%3A04 I think, I've discovered what causes that test failure. When playing with bgwriter during researching [1], I made it more aggressive with a dirty hack: -#define LOG_SNAPSHOT_INTERVAL_MS 15000 +#define LOG_SNAPSHOT_INTERVAL_MS 1 rc = WaitLatch(MyLatch, WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, - BgWriterDelay /* ms */ , WAIT_EVENT_BGWRITER_MAIN); + 1 /* ms */ , WAIT_EVENT_BGWRITER_MAIN); With this modification, I ran `make check -C src/test/subscription` in a loop and observed the same failure of 031_column_list as discussed here. With log_min_messages = DEBUG2, I see that in a failed case there is no '"sub1" has now caught up' message (as Amit and Tomas guessed upthread) (See full logs from one successful and two failed runs attached (I added some extra logging and enabled wal_debug to better understand what's going on here).) If we look at the failures occurred in the buildfarm: The first two from the past: 1) https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=xenodermus&dt=2022-05-31%2013%3A00%3A04 2022-05-26 20:39:23.828 CEST [276284][postmaster][:0][] LOG: starting PostgreSQL 15beta1 on x86_64-pc-linux-gnu, compiled by gcc-10 (Debian 10.3.0-15) 10.3.0, 64-bit ... 2022-05-26 20:39:39.768 CEST [277545][walsender][3/0:0][sub1] ERROR: publication "pub6" does not exist 2) https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=phycodurus&dt=2022-05-26%2017%3A30%3A04 2022-05-31 16:33:25.506 CEST [3223685][postmaster][:0][] LOG: starting PostgreSQL 15beta1 on x86_64-pc-linux-gnu, compiled by clang version 6.0.1 , 64-bit ... 2022-05-31 16:33:41.114 CEST [3224511][walsender][3/0:0][sub1] ERROR: publication "pub6" does not exist The other two from the present: 3) https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=kestrel&dt=2023-12-14%2009%3A14%3A52&stg=subscription-check 2023-12-14 09:26:12.523 UTC [1144979][postmaster][:0] LOG: starting PostgreSQL 16.1 on x86_64-pc-linux-gnu, compiled by Debian clang version 13.0.1-11+b2, 64-bit ... 2023-12-14 09:26:28.663 UTC [1157936][walsender][3/0:0] ERROR: publication "pub6" does not exist 4) https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2023-11-17%2018%3A28%3A24 2023-11-17 18:31:13.594 UTC [200939] LOG: starting PostgreSQL 17devel on x86_64-linux, compiled by clang-14.0.6, 64-bit ... 2023-11-17 18:31:29.292 UTC [222103] sub1 ERROR: publication "pub6" does not exist we can see that all the failures occurred approximately since 16 seconds after the server start. And it's very close to predefined LOG_SNAPSHOT_INTERVAL_MS. [1] https://www.postgresql.org/message-id/6f85667e-5754-5d35-dbf1-c83fe08c1e48%40gmail.com Best regards, Alexander