Unstable tests for recovery conflict handling - Mailing list pgsql-hackers

From Tom Lane
Subject Unstable tests for recovery conflict handling
Date
Msg-id 394950.1651077914@sss.pgh.pa.us
Whole thread Raw
Responses Re: Unstable tests for recovery conflict handling  (Mark Dilger <mark.dilger@enterprisedb.com>)
Re: Unstable tests for recovery conflict handling  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
[ starting a new thread cuz the shared-stats one is way too long ]

Andres Freund <andres@anarazel.de> writes:
> Add minimal tests for recovery conflict handling.

It's been kind of hidden by other buildfarm noise, but
031_recovery_conflict.pl is not as stable as it should be [1][2][3][4].

Three of those failures look like

[11:08:46.806](105.129s) ok 1 - buffer pin conflict: cursor with conflicting pin established
Waiting for replication conn standby's replay_lsn to pass 0/33EF190 on primary
[12:01:49.614](3182.807s) # poll_query_until timed out executing this query:
# SELECT '0/33EF190' <= replay_lsn AND state = 'streaming'
#          FROM pg_catalog.pg_stat_replication
#          WHERE application_name IN ('standby', 'walreceiver')
# expecting this output:
# t
# last actual query output:
# f
# with stderr:
timed out waiting for catchup at t/031_recovery_conflict.pl line 123.

In each of these examples we can see in the standby's log that it
detected the expected buffer pin conflict:

2022-04-27 11:08:46.353 UTC [1961604][client backend][2/2:0] LOG:  statement: BEGIN;
2022-04-27 11:08:46.416 UTC [1961604][client backend][2/2:0] LOG:  statement: DECLARE test_recovery_conflict_cursor
CURSORFOR SELECT b FROM test_recovery_conflict_table1; 
2022-04-27 11:08:46.730 UTC [1961604][client backend][2/2:0] LOG:  statement: FETCH FORWARD FROM
test_recovery_conflict_cursor;
2022-04-27 11:08:47.825 UTC [1961298][startup][1/0:0] LOG:  recovery still waiting after 13.367 ms: recovery conflict
onbuffer pin 
2022-04-27 11:08:47.825 UTC [1961298][startup][1/0:0] CONTEXT:  WAL redo at 0/33E6E80 for Heap2/PRUNE: latestRemovedXid
0nredirected 0 ndead 1; blkref #0: rel 1663/16385/16386, blk 0 

but then nothing happens until the script times out and kills the test.
I think this is showing us a real bug, ie we sometimes fail to cancel
the conflicting query.

The other example [3] looks different:

[01:02:43.582](2.357s) ok 1 - buffer pin conflict: cursor with conflicting pin established
Waiting for replication conn standby's replay_lsn to pass 0/342C000 on primary
done
[01:02:43.747](0.165s) ok 2 - buffer pin conflict: logfile contains terminated connection due to recovery conflict
[01:02:43.804](0.057s) not ok 3 - buffer pin conflict: stats show conflict on standby
[01:02:43.805](0.000s)
[01:02:43.805](0.000s) #   Failed test 'buffer pin conflict: stats show conflict on standby'
#   at t/031_recovery_conflict.pl line 295.
[01:02:43.805](0.000s) #          got: '0'
#     expected: '1'

Not sure what to make of that --- could there be a race condition in the
reporting of the conflict?

            regards, tom lane

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2022-04-27%2007%3A16%3A51
[2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=peripatus&dt=2022-04-21%2021%3A20%3A15
[3] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=morepork&dt=2022-04-13%2022%3A45%3A30
[4] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2022-04-11%2005%3A40%3A41



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: remove redundant check of item pointer
Next
From: Julien Rouhaud
Date:
Subject: Re: BUG #17448: In Windows 10, version 1703 and later, huge_pages doesn't work.