Thread: BF animal dikkop reported a failure in 035_standby_logical_decoding
Hi hackers, I saw a buildfarm failure on "dikkop"[1]. It failed in 035_standby_logical_decoding.pl, because the slots row_removal_inactiveslot and row_removal_activeslot are not invalidated after vacuum. regress_log_035_standby_logical_decoding: ``` [12:15:05.943](4.442s) not ok 22 - inactiveslot slot invalidation is logged with vacuum on pg_class [12:15:05.945](0.003s) [12:15:05.946](0.000s) # Failed test 'inactiveslot slot invalidation is logged with vacuum on pg_class' # at t/035_standby_logical_decoding.pl line 238. [12:15:05.948](0.002s) not ok 23 - activeslot slot invalidation is logged with vacuum on pg_class [12:15:05.949](0.001s) [12:15:05.950](0.000s) # Failed test 'activeslot slot invalidation is logged with vacuum on pg_class' # at t/035_standby_logical_decoding.pl line 244. [13:38:26.977](5001.028s) # poll_query_until timed out executing this query: # select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb' # expecting this output: # t # last actual query output: # f # with stderr: [13:38:26.980](0.003s) not ok 24 - confl_active_logicalslot updated [13:38:26.982](0.002s) [13:38:26.982](0.000s) # Failed test 'confl_active_logicalslot updated' # at t/035_standby_logical_decoding.pl line 251. Timed out waiting confl_active_logicalslot to be updated at t/035_standby_logical_decoding.pl line 251. ``` 035_standby_logical_decoding.pl: ``` # This should trigger the conflict $node_primary->safe_psql( 'testdb', qq[ CREATE TABLE conflict_test(x integer, y text); DROP TABLE conflict_test; VACUUM pg_class; INSERT INTO flush_wal DEFAULT VALUES; -- see create table flush_wal ]); $node_primary->wait_for_replay_catchup($node_standby); # Check invalidation in the logfile and in pg_stat_database_conflicts check_for_invalidation('row_removal_', $logstart, 'with vacuum on pg_class'); ``` Is it possible that the vacuum command didn't remove tuples and then the conflict was not triggered? It seems we can't confirm this because there is not enough information. Maybe "vacuum verbose" can be used to provide more information. [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dikkop&dt=2023-05-24%2006%3A16%3A18 Regards, Shi Yu
Re: BF animal dikkop reported a failure in 035_standby_logical_decoding
From
"Drouvot, Bertrand"
Date:
Hi, On 5/26/23 9:27 AM, Yu Shi (Fujitsu) wrote: > Hi hackers, > > I saw a buildfarm failure on "dikkop"[1]. It failed in > 035_standby_logical_decoding.pl, because the slots row_removal_inactiveslot and > row_removal_activeslot are not invalidated after vacuum. Thanks for sharing! > > regress_log_035_standby_logical_decoding: > ``` > [12:15:05.943](4.442s) not ok 22 - inactiveslot slot invalidation is logged with vacuum on pg_class > [12:15:05.945](0.003s) > [12:15:05.946](0.000s) # Failed test 'inactiveslot slot invalidation is logged with vacuum on pg_class' > # at t/035_standby_logical_decoding.pl line 238. > [12:15:05.948](0.002s) not ok 23 - activeslot slot invalidation is logged with vacuum on pg_class > [12:15:05.949](0.001s) > [12:15:05.950](0.000s) # Failed test 'activeslot slot invalidation is logged with vacuum on pg_class' > # at t/035_standby_logical_decoding.pl line 244. > [13:38:26.977](5001.028s) # poll_query_until timed out executing this query: > # select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb' > # expecting this output: > # t > # last actual query output: > # f > # with stderr: > [13:38:26.980](0.003s) not ok 24 - confl_active_logicalslot updated > [13:38:26.982](0.002s) > [13:38:26.982](0.000s) # Failed test 'confl_active_logicalslot updated' > # at t/035_standby_logical_decoding.pl line 251. > Timed out waiting confl_active_logicalslot to be updated at t/035_standby_logical_decoding.pl line 251. > ``` > > 035_standby_logical_decoding.pl: > ``` > # This should trigger the conflict > $node_primary->safe_psql( > 'testdb', qq[ > CREATE TABLE conflict_test(x integer, y text); > DROP TABLE conflict_test; > VACUUM pg_class; > INSERT INTO flush_wal DEFAULT VALUES; -- see create table flush_wal > ]); > > $node_primary->wait_for_replay_catchup($node_standby); > > # Check invalidation in the logfile and in pg_stat_database_conflicts > check_for_invalidation('row_removal_', $logstart, 'with vacuum on pg_class'); > ``` > > Is it possible that the vacuum command didn't remove tuples and then the > conflict was not triggered? The flush_wal table added by Andres should guarantee that the WAL is flushed, so the only reason I can think about is indeed that the vacuum did not remove tuples ( but I don't get why/how that could be the case). > It seems we can't confirm this because there is not > enough information. Right, and looking at its status history most of the time the test is green (making it even more difficult to diagnose). > Maybe "vacuum verbose" can be used to provide more > information. I can see that dikkop "belongs" to Tomas (adding Tomas to this thread). Tomas, do you think it would be possible to run some 035_standby_logical_decoding.pl manually with "vacuum verbose" in the test mentioned above? (or any other way you can think about that would help diagnose this random failure?). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
RE: BF animal dikkop reported a failure in 035_standby_logical_decoding
From
"Yu Shi (Fujitsu)"
Date:
On Monday, May 29, 2023 5:22 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On 5/26/23 9:27 AM, Yu Shi (Fujitsu) wrote: > > Hi hackers, > > > > I saw a buildfarm failure on "dikkop"[1]. It failed in > > 035_standby_logical_decoding.pl, because the slots row_removal_inactiveslot > and > > row_removal_activeslot are not invalidated after vacuum. > > Thanks for sharing! > > > > > regress_log_035_standby_logical_decoding: > > ``` > > [12:15:05.943](4.442s) not ok 22 - inactiveslot slot invalidation is logged with > vacuum on pg_class > > [12:15:05.945](0.003s) > > [12:15:05.946](0.000s) # Failed test 'inactiveslot slot invalidation is logged > with vacuum on pg_class' > > # at t/035_standby_logical_decoding.pl line 238. > > [12:15:05.948](0.002s) not ok 23 - activeslot slot invalidation is logged with > vacuum on pg_class > > [12:15:05.949](0.001s) > > [12:15:05.950](0.000s) # Failed test 'activeslot slot invalidation is logged with > vacuum on pg_class' > > # at t/035_standby_logical_decoding.pl line 244. > > [13:38:26.977](5001.028s) # poll_query_until timed out executing this query: > > # select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where > datname = 'testdb' > > # expecting this output: > > # t > > # last actual query output: > > # f > > # with stderr: > > [13:38:26.980](0.003s) not ok 24 - confl_active_logicalslot updated > > [13:38:26.982](0.002s) > > [13:38:26.982](0.000s) # Failed test 'confl_active_logicalslot updated' > > # at t/035_standby_logical_decoding.pl line 251. > > Timed out waiting confl_active_logicalslot to be updated at > t/035_standby_logical_decoding.pl line 251. > > ``` > > > > 035_standby_logical_decoding.pl: > > ``` > > # This should trigger the conflict > > $node_primary->safe_psql( > > 'testdb', qq[ > > CREATE TABLE conflict_test(x integer, y text); > > DROP TABLE conflict_test; > > VACUUM pg_class; > > INSERT INTO flush_wal DEFAULT VALUES; -- see create table flush_wal > > ]); > > > > $node_primary->wait_for_replay_catchup($node_standby); > > > > # Check invalidation in the logfile and in pg_stat_database_conflicts > > check_for_invalidation('row_removal_', $logstart, 'with vacuum on pg_class'); > > ``` > > > > Is it possible that the vacuum command didn't remove tuples and then the > > conflict was not triggered? > > The flush_wal table added by Andres should guarantee that the WAL is flushed, > so > the only reason I can think about is indeed that the vacuum did not remove > tuples ( > but I don't get why/how that could be the case). > > > It seems we can't confirm this because there is not > > enough information. > > Right, and looking at its status history most of the time the test is green (making > it > even more difficult to diagnose). > > > Maybe "vacuum verbose" can be used to provide more > > information. > > I can see that dikkop "belongs" to Tomas (adding Tomas to this thread). > Tomas, do you think it would be possible to run some > 035_standby_logical_decoding.pl > manually with "vacuum verbose" in the test mentioned above? (or any other > way you can think > about that would help diagnose this random failure?). > Thanks for your reply. I saw another failure on "drongo" [1], which looks like a similar problem. Maybe a temporary patch can be committed to dump the result of "vacuum verbose". And we can check this when the test fails. [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2023-05-26%2018%3A05%3A57 Regards, Shi Yu
"Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com> writes: > On 5/26/23 9:27 AM, Yu Shi (Fujitsu) wrote: >> Is it possible that the vacuum command didn't remove tuples and then the >> conflict was not triggered? > The flush_wal table added by Andres should guarantee that the WAL is flushed, so > the only reason I can think about is indeed that the vacuum did not remove tuples ( > but I don't get why/how that could be the case). This test is broken on its face: CREATE TABLE conflict_test(x integer, y text); DROP TABLE conflict_test; VACUUM full pg_class; There will be something VACUUM can remove only if there were no other transactions holding back global xmin --- and there's not even a delay here to give any such transaction a chance to finish. Background autovacuum is the most likely suspect for breaking that, but I wouldn't be surprised if something in the logical replication mechanism itself could be running a transaction at the wrong instant. Some of the other recovery tests set autovacuum = off to try to control such problems, but I'm not sure how much of a solution that really is. regards, tom lane
Re: BF animal dikkop reported a failure in 035_standby_logical_decoding
From
"Drouvot, Bertrand"
Date:
Hi, On 5/29/23 1:03 PM, Tom Lane wrote: > "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com> writes: >> On 5/26/23 9:27 AM, Yu Shi (Fujitsu) wrote: >>> Is it possible that the vacuum command didn't remove tuples and then the >>> conflict was not triggered? > >> The flush_wal table added by Andres should guarantee that the WAL is flushed, so >> the only reason I can think about is indeed that the vacuum did not remove tuples ( >> but I don't get why/how that could be the case). > > This test is broken on its face: > > CREATE TABLE conflict_test(x integer, y text); > DROP TABLE conflict_test; > VACUUM full pg_class; > > There will be something VACUUM can remove only if there were no other > transactions holding back global xmin --- and there's not even a delay > here to give any such transaction a chance to finish. > > Background autovacuum is the most likely suspect for breaking that, Oh right, I did not think autovacuum could start during this test, but yeah there is no reasons it could not. > but I wouldn't be surprised if something in the logical replication > mechanism itself could be running a transaction at the wrong instant. > > Some of the other recovery tests set > autovacuum = off > to try to control such problems, but I'm not sure how much of > a solution that really is. One option I can think of is to: 1) set autovacuum = off (as it looks like the usual suspect). 2) trigger the vacuum in verbose mode (as suggested by Shi-san) and depending of its output run the "invalidation" test or: re-launch the vacuum, re-check the output and so on.. (n times max). If n is reached, then skip this test. As this test is currently failing randomly (and it looks like there is more success that failures, even without autovacuum = off), then the test should still validate that the invalidation works as expected for the large majority of runs (and skipping the test should be pretty rare then). Would that make sense? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On 2023-05-29 14:31:24 +0200, Drouvot, Bertrand wrote: > On 5/29/23 1:03 PM, Tom Lane wrote: > > but I wouldn't be surprised if something in the logical replication > > mechanism itself could be running a transaction at the wrong instant. > > > > Some of the other recovery tests set > > autovacuum = off > > to try to control such problems, but I'm not sure how much of > > a solution that really is. > > One option I can think of is to: > > 1) set autovacuum = off (as it looks like the usual suspect). > 2) trigger the vacuum in verbose mode (as suggested by Shi-san) and > depending of its output run the "invalidation" test or: re-launch the vacuum, re-check the output > and so on.. (n times max). If n is reached, then skip this test. I think the best fix would be to wait for a new snapshot that has a newer horizon, before doing the vacuum full. Greetings, Andres Freund
Re: BF animal dikkop reported a failure in 035_standby_logical_decoding
From
"Drouvot, Bertrand"
Date:
Hi, On 5/30/23 5:24 PM, Andres Freund wrote: > Hi, > > On 2023-05-29 14:31:24 +0200, Drouvot, Bertrand wrote: >> On 5/29/23 1:03 PM, Tom Lane wrote: >>> but I wouldn't be surprised if something in the logical replication >>> mechanism itself could be running a transaction at the wrong instant. >>> >>> Some of the other recovery tests set >>> autovacuum = off >>> to try to control such problems, but I'm not sure how much of >>> a solution that really is. >> >> One option I can think of is to: >> >> 1) set autovacuum = off (as it looks like the usual suspect). >> 2) trigger the vacuum in verbose mode (as suggested by Shi-san) and >> depending of its output run the "invalidation" test or: re-launch the vacuum, re-check the output >> and so on.. (n times max). If n is reached, then skip this test. > > I think the best fix would be to wait for a new snapshot that has a newer > horizon, before doing the vacuum full. > Thanks for the proposal! I think that's a great idea, I'll look at it and update the patch I've submitted in [1] accordingly. [1]: https://www.postgresql.org/message-id/bf67e076-b163-9ba3-4ade-b9fc51a3a8f6%40gmail.com Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com