Re: A failure in 031_recovery_conflict.pl on Debian/s390x - Mailing list pgsql-hackers

From Alexander Lakhin
Subject Re: A failure in 031_recovery_conflict.pl on Debian/s390x
Date
Msg-id 8dfdb0bb-b913-28f6-0e1d-4904660471fc@gmail.com
Whole thread Raw
In response to Re: A failure in 031_recovery_conflict.pl on Debian/s390x  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
Hi,

13.08.2023 00:00, Andres Freund wrote:
> Hi,
>
> On 2023-08-12 15:50:24 +1200, Thomas Munro wrote:
>> Thanks.  I realised that it's easy enough to test that theory about
>> cleanup locks by hacking ConditionalLockBufferForCleanup() to return
>> false randomly.  Then the test occasionally fails as described.  Seems
>> like we'll need to fix that test, but it's not evidence of a server
>> bug, and my signal handler refactoring patch is in the clear.  Thanks
>> for testing it!
> WRT fixing the test: I think just using VACUUM FREEZE ought to do the job?
> After changing all the VACUUMs to VACUUM FREEZEs, 031_recovery_conflict.pl
> passes even after I make ConditionalLockBufferForCleanup() fail 100%.
>

I happened to encounter another failure of 031_recovery_conflict, on
rather slow ARMv7:
t/031_recovery_conflict.pl ............ 13/? # poll_query_until timed out executing this query:
#
# SELECT 'waiting' FROM pg_locks WHERE locktype = 'relation' AND NOT granted;
#
# expecting this output:
# waiting
# last actual query output:
#
# with stderr:
t/031_recovery_conflict.pl ............ 14/?
#   Failed test 'startup deadlock: lock acquisition is waiting'
#   at t/031_recovery_conflict.pl line 261.
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 255 just after 15.
t/031_recovery_conflict.pl ............ Dubious, test returned 255 (wstat 65280, 0xff00)
Failed 1/15 subtests

I've managed to reproduce it with the attached test patch, on a regular
x86_64 VM with CPU slowed down to 2%, so that the modified test runs for
5+ minutes.
...
t/031_recovery_conflict.pl .. 102/? # iteration 31
t/031_recovery_conflict.pl .. 104/? # iteration 32
t/031_recovery_conflict.pl .. 106/?
<... waiting for something ...>

031_recovery_conflict_standby.log contains:
2023-10-09 12:43:14.821 UTC [145519] 031_recovery_conflict.pl LOG: statement: BEGIN;
2023-10-09 12:43:14.821 UTC [145519] 031_recovery_conflict.pl LOG: statement: DECLARE test_recovery_conflict_cursor 
CURSOR FOR SELECT a FROM test_recovery_conflict_table1;
2023-10-09 12:43:14.822 UTC [145519] 031_recovery_conflict.pl LOG: statement: FETCH FORWARD FROM 
test_recovery_conflict_cursor;
2023-10-09 12:43:14.822 UTC [145519] 031_recovery_conflict.pl LOG: statement: SELECT * FROM
test_recovery_conflict_table2;
2023-10-09 12:43:15.039 UTC [145513] LOG:  recovery still waiting after 10.217 ms: recovery conflict on buffer pin
2023-10-09 12:43:15.039 UTC [145513] CONTEXT:  WAL redo at 0/35ADD38 for Heap2/PRUNE: snapshotConflictHorizon: 0, 
nredirected: 0, ndead: 100, nunused: 0, redirected: [], dead: [2, 3, 4, ... , 99, 100, 101], unused: []; blkref #0: rel

1663/16385/16489, blk 0
2023-10-09 12:43:15.039 UTC [145519] 031_recovery_conflict.pl ERROR:  canceling statement due to conflict with recovery

at character 15
2023-10-09 12:43:15.039 UTC [145519] 031_recovery_conflict.pl DETAIL:  User transaction caused buffer deadlock with 
recovery.
2023-10-09 12:43:15.039 UTC [145519] 031_recovery_conflict.pl STATEMENT:  SELECT * FROM test_recovery_conflict_table2;
2023-10-09 12:43:15.039 UTC [145513] LOG:  recovery finished waiting after 10.382 ms: recovery conflict on buffer pin
2023-10-09 12:43:15.039 UTC [145513] CONTEXT:  WAL redo at 0/35ADD38 for Heap2/PRUNE: snapshotConflictHorizon: 0, 
nredirected: 0, ndead: 100, nunused: 0, redirected: [], dead: [2, 3, 4, ... , 99, 100, 101], unused: []; blkref #0: rel

1663/16385/16489, blk 0
2023-10-09 12:43:15.134 UTC [145527] 031_recovery_conflict.pl LOG: statement: SELECT 'waiting' FROM pg_locks WHERE 
locktype = 'relation' AND NOT granted;
2023-10-09 12:43:15.647 UTC [145530] 031_recovery_conflict.pl LOG: statement: SELECT 'waiting' FROM pg_locks WHERE 
locktype = 'relation' AND NOT granted;
2023-10-09 12:43:15.981 UTC [145532] 031_recovery_conflict.pl LOG: statement: SELECT 'waiting' FROM pg_locks WHERE 
locktype = 'relation' AND NOT granted;
...
2023-10-09 12:51:24.729 UTC [149175] 031_recovery_conflict.pl LOG: statement: SELECT 'waiting' FROM pg_locks WHERE 
locktype = 'relation' AND NOT granted;
2023-10-09 12:51:25.969 UTC [145514] FATAL:  could not receive data from WAL stream: server closed the connection 
unexpectedly
                 This probably means the server terminated abnormally
                 before or while processing the request.

Indeed, t_031_recovery_conflict_standby_data/pgdata/pg_wal/
000000010000000000000003 contains:
...
rmgr: Heap        len (rec/tot):     59/    59, tx:        972, lsn: 0/035ADB18, prev 0/035ADAD8, desc: INSERT off:
100,
 
flags: 0x00, blkref #0: rel 1663/16385/16489 blk 0
rmgr: Heap        len (rec/tot):     59/    59, tx:        972, lsn: 0/035ADB58, prev 0/035ADB18, desc: INSERT off:
101,
 
flags: 0x00, blkref #0: rel 1663/16385/16489 blk 0
rmgr: Transaction len (rec/tot):     34/    34, tx:        972, lsn: 0/035ADB98, prev 0/035ADB58, desc: ABORT
2023-10-09
 
12:43:13.797513 UTC
rmgr: Standby     len (rec/tot):     42/    42, tx:        973, lsn: 0/035ADBC0, prev 0/035ADB98, desc: LOCK xid 973 db

16385 rel 16389
rmgr: Transaction len (rec/tot):    178/   178, tx:        973, lsn: 0/035ADBF0, prev 0/035ADBC0, desc: PREPARE gid 
lock: 2023-10-09 12:43:13.798048 UTC
rmgr: Heap        len (rec/tot):     59/    59, tx:        974, lsn: 0/035ADCA8, prev 0/035ADBF0, desc: INSERT off:
102,
 
flags: 0x00, blkref #0: rel 1663/16385/16489 blk 0
rmgr: Transaction len (rec/tot):     34/    34, tx:        974, lsn: 0/035ADCE8, prev 0/035ADCA8, desc: COMMIT 
2023-10-09 12:43:13.798434 UTC
rmgr: Transaction len (rec/tot):     34/    34, tx:        975, lsn: 0/035ADD10, prev 0/035ADCE8, desc: COMMIT 
2023-10-09 12:43:13.900046 UTC
rmgr: Heap2       len (rec/tot):    255/   255, tx:          0, lsn: 0/035ADD38, prev 0/035ADD10, desc: PRUNE 
snapshotConflictHorizon: 0, nredirected: 0, ndead: 100, nunused: 0, redirected: [], dead: [2, 3, 4, ... , 99, 100,
101],
 
unused: [], blkref #0: rel 1663/16385/16489 blk 0
rmgr: Heap2       len (rec/tot):    248/   248, tx:          0, lsn: 0/035ADE38, prev 0/035ADD38, desc: VACUUM nunused:

100, unused: [2, 3, 4, ... , 99, 100, 101], blkref #0: rel 1663/16385/16489 blk 0

So it looks like "SELECT * FROM test_recovery_conflict_table2" in this
case was interrupted due to conflict with replaying PRUNE, not VACUUM,
and as a result, this query hadn't entered "waiting for lock" state.
(I saw also other failures of the test, but probably they have the same
root cause).

Best regards,
Alexander
Attachment

pgsql-hackers by date:

Previous
From: Amul Sul
Date:
Subject: Re: ALTER COLUMN ... SET EXPRESSION to alter stored generated column's expression
Next
From: Alexander Korotkov
Date:
Subject: Re: On login trigger: take three