Re: A failure in 031_recovery_conflict.pl on Debian/s390x - Mailing list pgsql-hackers
From | Alexander Lakhin |
---|---|
Subject | Re: A failure in 031_recovery_conflict.pl on Debian/s390x |
Date | |
Msg-id | 8dfdb0bb-b913-28f6-0e1d-4904660471fc@gmail.com Whole thread Raw |
In response to | Re: A failure in 031_recovery_conflict.pl on Debian/s390x (Andres Freund <andres@anarazel.de>) |
List | pgsql-hackers |
Hi, 13.08.2023 00:00, Andres Freund wrote: > Hi, > > On 2023-08-12 15:50:24 +1200, Thomas Munro wrote: >> Thanks. I realised that it's easy enough to test that theory about >> cleanup locks by hacking ConditionalLockBufferForCleanup() to return >> false randomly. Then the test occasionally fails as described. Seems >> like we'll need to fix that test, but it's not evidence of a server >> bug, and my signal handler refactoring patch is in the clear. Thanks >> for testing it! > WRT fixing the test: I think just using VACUUM FREEZE ought to do the job? > After changing all the VACUUMs to VACUUM FREEZEs, 031_recovery_conflict.pl > passes even after I make ConditionalLockBufferForCleanup() fail 100%. > I happened to encounter another failure of 031_recovery_conflict, on rather slow ARMv7: t/031_recovery_conflict.pl ............ 13/? # poll_query_until timed out executing this query: # # SELECT 'waiting' FROM pg_locks WHERE locktype = 'relation' AND NOT granted; # # expecting this output: # waiting # last actual query output: # # with stderr: t/031_recovery_conflict.pl ............ 14/? # Failed test 'startup deadlock: lock acquisition is waiting' # at t/031_recovery_conflict.pl line 261. # Tests were run but no plan was declared and done_testing() was not seen. # Looks like your test exited with 255 just after 15. t/031_recovery_conflict.pl ............ Dubious, test returned 255 (wstat 65280, 0xff00) Failed 1/15 subtests I've managed to reproduce it with the attached test patch, on a regular x86_64 VM with CPU slowed down to 2%, so that the modified test runs for 5+ minutes. ... t/031_recovery_conflict.pl .. 102/? # iteration 31 t/031_recovery_conflict.pl .. 104/? # iteration 32 t/031_recovery_conflict.pl .. 106/? <... waiting for something ...> 031_recovery_conflict_standby.log contains: 2023-10-09 12:43:14.821 UTC [145519] 031_recovery_conflict.pl LOG: statement: BEGIN; 2023-10-09 12:43:14.821 UTC [145519] 031_recovery_conflict.pl LOG: statement: DECLARE test_recovery_conflict_cursor CURSOR FOR SELECT a FROM test_recovery_conflict_table1; 2023-10-09 12:43:14.822 UTC [145519] 031_recovery_conflict.pl LOG: statement: FETCH FORWARD FROM test_recovery_conflict_cursor; 2023-10-09 12:43:14.822 UTC [145519] 031_recovery_conflict.pl LOG: statement: SELECT * FROM test_recovery_conflict_table2; 2023-10-09 12:43:15.039 UTC [145513] LOG: recovery still waiting after 10.217 ms: recovery conflict on buffer pin 2023-10-09 12:43:15.039 UTC [145513] CONTEXT: WAL redo at 0/35ADD38 for Heap2/PRUNE: snapshotConflictHorizon: 0, nredirected: 0, ndead: 100, nunused: 0, redirected: [], dead: [2, 3, 4, ... , 99, 100, 101], unused: []; blkref #0: rel 1663/16385/16489, blk 0 2023-10-09 12:43:15.039 UTC [145519] 031_recovery_conflict.pl ERROR: canceling statement due to conflict with recovery at character 15 2023-10-09 12:43:15.039 UTC [145519] 031_recovery_conflict.pl DETAIL: User transaction caused buffer deadlock with recovery. 2023-10-09 12:43:15.039 UTC [145519] 031_recovery_conflict.pl STATEMENT: SELECT * FROM test_recovery_conflict_table2; 2023-10-09 12:43:15.039 UTC [145513] LOG: recovery finished waiting after 10.382 ms: recovery conflict on buffer pin 2023-10-09 12:43:15.039 UTC [145513] CONTEXT: WAL redo at 0/35ADD38 for Heap2/PRUNE: snapshotConflictHorizon: 0, nredirected: 0, ndead: 100, nunused: 0, redirected: [], dead: [2, 3, 4, ... , 99, 100, 101], unused: []; blkref #0: rel 1663/16385/16489, blk 0 2023-10-09 12:43:15.134 UTC [145527] 031_recovery_conflict.pl LOG: statement: SELECT 'waiting' FROM pg_locks WHERE locktype = 'relation' AND NOT granted; 2023-10-09 12:43:15.647 UTC [145530] 031_recovery_conflict.pl LOG: statement: SELECT 'waiting' FROM pg_locks WHERE locktype = 'relation' AND NOT granted; 2023-10-09 12:43:15.981 UTC [145532] 031_recovery_conflict.pl LOG: statement: SELECT 'waiting' FROM pg_locks WHERE locktype = 'relation' AND NOT granted; ... 2023-10-09 12:51:24.729 UTC [149175] 031_recovery_conflict.pl LOG: statement: SELECT 'waiting' FROM pg_locks WHERE locktype = 'relation' AND NOT granted; 2023-10-09 12:51:25.969 UTC [145514] FATAL: could not receive data from WAL stream: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. Indeed, t_031_recovery_conflict_standby_data/pgdata/pg_wal/ 000000010000000000000003 contains: ... rmgr: Heap len (rec/tot): 59/ 59, tx: 972, lsn: 0/035ADB18, prev 0/035ADAD8, desc: INSERT off: 100, flags: 0x00, blkref #0: rel 1663/16385/16489 blk 0 rmgr: Heap len (rec/tot): 59/ 59, tx: 972, lsn: 0/035ADB58, prev 0/035ADB18, desc: INSERT off: 101, flags: 0x00, blkref #0: rel 1663/16385/16489 blk 0 rmgr: Transaction len (rec/tot): 34/ 34, tx: 972, lsn: 0/035ADB98, prev 0/035ADB58, desc: ABORT 2023-10-09 12:43:13.797513 UTC rmgr: Standby len (rec/tot): 42/ 42, tx: 973, lsn: 0/035ADBC0, prev 0/035ADB98, desc: LOCK xid 973 db 16385 rel 16389 rmgr: Transaction len (rec/tot): 178/ 178, tx: 973, lsn: 0/035ADBF0, prev 0/035ADBC0, desc: PREPARE gid lock: 2023-10-09 12:43:13.798048 UTC rmgr: Heap len (rec/tot): 59/ 59, tx: 974, lsn: 0/035ADCA8, prev 0/035ADBF0, desc: INSERT off: 102, flags: 0x00, blkref #0: rel 1663/16385/16489 blk 0 rmgr: Transaction len (rec/tot): 34/ 34, tx: 974, lsn: 0/035ADCE8, prev 0/035ADCA8, desc: COMMIT 2023-10-09 12:43:13.798434 UTC rmgr: Transaction len (rec/tot): 34/ 34, tx: 975, lsn: 0/035ADD10, prev 0/035ADCE8, desc: COMMIT 2023-10-09 12:43:13.900046 UTC rmgr: Heap2 len (rec/tot): 255/ 255, tx: 0, lsn: 0/035ADD38, prev 0/035ADD10, desc: PRUNE snapshotConflictHorizon: 0, nredirected: 0, ndead: 100, nunused: 0, redirected: [], dead: [2, 3, 4, ... , 99, 100, 101], unused: [], blkref #0: rel 1663/16385/16489 blk 0 rmgr: Heap2 len (rec/tot): 248/ 248, tx: 0, lsn: 0/035ADE38, prev 0/035ADD38, desc: VACUUM nunused: 100, unused: [2, 3, 4, ... , 99, 100, 101], blkref #0: rel 1663/16385/16489 blk 0 So it looks like "SELECT * FROM test_recovery_conflict_table2" in this case was interrupted due to conflict with replaying PRUNE, not VACUUM, and as a result, this query hadn't entered "waiting for lock" state. (I saw also other failures of the test, but probably they have the same root cause). Best regards, Alexander
Attachment
pgsql-hackers by date: