Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed - Mailing list pgsql-hackers
From | Alexander Lakhin |
---|---|
Subject | Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed |
Date | |
Msg-id | 7c025095-5763-17a6-8c80-899b35bd0459@gmail.com Whole thread Raw |
In response to | Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed (Michael Paquier <michael@paquier.xyz>) |
Responses |
Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed
|
List | pgsql-hackers |
Hi, 10.01.2024 07:26, Michael Paquier wrote: > On Tue, Jan 09, 2024 at 08:00:00PM +0300, Alexander Lakhin wrote: >> Thus just adding FREEZE is not enough, seemingly. It makes me wonder if >> 0174c2d21 should be superseded by a patch like discussed (or just have >> autovacuum = off added)... > Adding an extra FREEZE offers an extra insurance, so I don't see why > it would be a problem to add it to stabilize the horizon conflicts on > the standbys. As 0174c2d21 added FREEZE already, I meant to add "autovacuum = off" or apply a fix similar to what we're are discussing here. > >> Michael, it definitely increases stability of the test (tens of iterations >> with 20 tests in parallel performed successfully), although I've managed to >> see another interesting failure (twice): >> 13 # Failed test 'activeslot slot invalidation is logged with vacuum on pg_class' >> 13 # at t/035_standby_logical_decoding.pl line 227. > Something I'd like to confirm here: you still see this failure with > the patch, but without an extra FREEZE, right? If we do both, the > test would get more stable, wouldn't it? Yes, I tested the patch as-is, without FREEZE, but it looks like it doesn't matter in that case. And sorry for misleading information about missing VACUUM records in my previous message, please ignore it. 10.01.2024 12:46, Bertrand Drouvot wrote: >> although I've managed to >> see another interesting failure (twice): >> 13 # Failed test 'activeslot slot invalidation is logged with vacuum on pg_class' >> 13 # at t/035_standby_logical_decoding.pl line 227. >> > Looking at the attached log files and particularly 1/regress_log_035_standby_logical_decoding: > > " > [11:05:28.118](13.993s) ok 24 - inactiveslot slot invalidation is logged with vacuum on pg_class > [11:05:28.119](0.001s) not ok 25 - activeslot slot invalidation is logged with vacuum on pg_class > " > > That seems weird, the inactive slot has been invalidated while the active one is not. > While it takes a bit longer to invalidate an active slot, I don't think the test can > move on until both are invalidated (then leading to the tests 24 and 25)). I can > see the tests are very slow to run (13.993s for 24) but still don't get how 24 could > succeed while 25 does not. > > ... > > Would it be possible to also send the standby logs? Yes, please look at the attached logs. This time I've build postgres with -DWAL_DEBUG and ran tests with TEMP_CONFIG as below: wal_keep_size=1GB wal_debug = on log_autovacuum_min_duration = 0 log_statement = 'all' log_min_messages = INFO The archive attached contains logs from four runs: recovery-1-ok -- an example of successful run for reference recovery-7-pruning and recovery-19-pruning -- failures with a failed subtest 'activeslot slot invalidation is logged with on-access pruning' recovery-15-vacuum_pg_class -- a failure with a failed subtest 'activeslot slot invalidation is logged with vacuum on pg_class' The distinction that I see in the failed run logs, for example in recovery-15-vacuum_pg_class, 035_standby_logical_decoding_standby.log: 2024-01-10 11:00:18.700 UTC [789618] LOG: REDO @ 0/4020220; LSN 0/4020250: prev 0/401FDE0; xid 753; len 20 - Transaction/COMMIT: 2024-01-10 11:00:18.471694+00 2024-01-10 11:00:18.797 UTC [789618] LOG: REDO @ 0/4020250; LSN 0/4020288: prev 0/4020220; xid 0; len 24 - Standby/RUNNING_XACTS: nextXid 754 latestCompletedXid 753 oldestRunningXid 754 2024-01-10 11:00:19.013 UTC [789618] LOG: REDO @ 0/4020288; LSN 0/40202C8: prev 0/4020250; xid 0; len 9; blkref #0: rel 1663/16384/2610, blk 0 - Heap2/PRUNE: snapshotConflictHorizon: 752, nredirected: 0, ndead: 1, isCatalogRel: T, nunused: 0, redirected: [], dead: [48], unused: [] 2024-01-10 11:00:19.111 UTC [789618] LOG: invalidating obsolete replication slot "row_removal_inactiveslot" 2024-01-10 11:00:19.111 UTC [789618] DETAIL: The slot conflicted with xid horizon 752. 2024-01-10 11:00:19.111 UTC [789618] CONTEXT: WAL redo at 0/4020288 for Heap2/PRUNE: snapshotConflictHorizon: 752, nredirected: 0, ndead: 1, isCatalogRel: T, nunused: 0, redirected: [], dead: [48], unused: []; blkref #0: rel 1663/16384/2610, blk 0 2024-01-10 11:00:29.109 UTC [789618] LOG: terminating process 790377 to release replication slot "row_removal_activeslot" 2024-01-10 11:00:29.109 UTC [789618] DETAIL: The slot conflicted with xid horizon 752. 2024-01-10 11:00:29.109 UTC [789618] CONTEXT: WAL redo at 0/4020288 for Heap2/PRUNE: snapshotConflictHorizon: 752, nredirected: 0, ndead: 1, isCatalogRel: T, nunused: 0, redirected: [], dead: [48], unused: []; blkref #0: rel 1663/16384/2610, blk 0 2024-01-10 11:00:30.144 UTC [790377] 035_standby_logical_decoding.pl ERROR: canceling statement due to conflict with recovery 2024-01-10 11:00:30.144 UTC [790377] 035_standby_logical_decoding.pl DETAIL: User was using a logical replication slot that must be invalidated. 2024-01-10 11:00:30.144 UTC [790377] 035_standby_logical_decoding.pl STATEMENT: START_REPLICATION SLOT "row_removal_activeslot" LOGICAL 0/0 ("include-xids" '0', "skip-empty-xacts" '1') 2024-01-10 11:00:30.144 UTC [790377] 035_standby_logical_decoding.pl LOG: released logical replication slot "row_removal_activeslot" is an absent message 'obsolete replication slot "row_removal_activeslot"' and an additional record 'Standby/RUNNING_XACTS', which can be found in 035_standby_logical_decoding_primary.log: 2024-01-10 11:00:18.515 UTC [783410] LOG: xlog bg flush request write 0/4020250; flush: 0/4020250, current is write 0/4020220; flush 0/4020220 2024-01-10 11:00:18.646 UTC [783387] LOG: INSERT @ 0/4020288: - Standby/RUNNING_XACTS: nextXid 754 latestCompletedXid 753 oldestRunningXid 754 2024-01-10 11:00:18.702 UTC [790526] 035_standby_logical_decoding.pl LOG: statement: SELECT (select txid_snapshot_xmin(txid_current_snapshot()) - 753) > 0 2024-01-10 11:00:18.724 UTC [783410] LOG: xlog bg flush request write 0/4020288; flush: 0/4020288, current is write 0/4020250; flush 0/4020250 So perhaps it can affect an active slot invalidation? Best regards, Alexander
Attachment
pgsql-hackers by date: