failures in t/031_recovery_conflict.pl on CI - Mailing list pgsql-hackers

From Andres Freund
Subject failures in t/031_recovery_conflict.pl on CI
Date
Msg-id 20220409045515.35ypjzddp25v72ou@alap3.anarazel.de
Whole thread Raw
Responses Re: failures in t/031_recovery_conflict.pl on CI  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
Hi,

on CI [1] the new t/031_recovery_conflict.pl is failing occasionally. Which is
interesting, because I ran it there dozens if not hundreds of times before
commit, with - I think - only cosmetic changes.

I've reproduced it in a private branch, with more logging. And the results are
sure interesting.

https://cirrus-ci.com/task/6448492666159104

https://api.cirrus-ci.com/v1/artifact/task/6448492666159104/log/src/test/recovery/tmp_check/log/031_recovery_conflict_standby.log

The primary is waiting for 0/343A000 to be applied, which requires a recovery
conflict to be detected and resolved. On the standby there's the following
sequence (some omitted):

prerequisite for recovery conflict:
2022-04-09 04:05:31.292 UTC [35071][client backend] [031_recovery_conflict.pl][2/2:0] LOG:  statement: BEGIN;
2022-04-09 04:05:31.292 UTC [35071][client backend] [031_recovery_conflict.pl][2/2:0] LOG:  statement: DECLARE
test_recovery_conflict_cursorCURSOR FOR SELECT b FROM test_recovery_conflict_table1;
 
2022-04-09 04:05:31.293 UTC [35071][client backend] [031_recovery_conflict.pl][2/2:0] LOG:  statement: FETCH FORWARD
FROMtest_recovery_conflict_cursor;
 

detecting the conflict:
2022-04-09 04:05:31.382 UTC [35038][startup] LOG:  recovery still waiting after 28.821 ms: recovery conflict on buffer
pin
2022-04-09 04:05:31.382 UTC [35038][startup] CONTEXT:  WAL redo at 0/3432800 for Heap2/PRUNE: latestRemovedXid 0
nredirected0 ndead 1; blkref #0: rel 1663/16385/16386, blk 0
 

and then nothing until the timeout:
2022-04-09 04:09:19.317 UTC [35035][postmaster] LOG:  received immediate shutdown request
2022-04-09 04:09:19.317 UTC [35035][postmaster] DEBUG:  sending signal 3 to process 35071
2022-04-09 04:09:19.320 UTC [35035][postmaster] DEBUG:  reaping dead processes
2022-04-09 04:09:19.320 UTC [35035][postmaster] DEBUG:  reaping dead processes
2022-04-09 04:09:19.320 UTC [35035][postmaster] DEBUG:  server process (PID 35071) exited with exit code 2

Afaics that has to mean something is broken around sending, receiving or
processing of recovery conflict interrupts.


All the failures so far were on freebsd, from what I can see. There were other
failures in other tests, but I think for reverted or fixed things.


Except for not previously triggering while the shmstats patch was in
development, it's hard to tell whether it's a regression or just a
longstanding bug - we never had tests for recovery conflicts...


I don't really see how recovery prefetching could play a role here, clearly
we've been trying to replay the record. So we're elsewhere...

Greetings,

Andres Freund

https://cirrus-ci.com/github/postgres/postgres/master



pgsql-hackers by date:

Previous
From: Julien Rouhaud
Date:
Subject: Re: Add parameter jit_warn_above_fraction
Next
From: Andres Freund
Date:
Subject: Re: failures in t/031_recovery_conflict.pl on CI