A failure in 031_recovery_conflict.pl on Debian/s390x - Mailing list pgsql-hackers

From Christoph Berg
Subject A failure in 031_recovery_conflict.pl on Debian/s390x
Date
Msg-id ZMzycdG5IWCM73ll@msg.df7cb.de
Whole thread Raw
In response to Re: A failure in 031_recovery_conflict.pl on Cirrus CI Windows Server  (Noah Misch <noah@leadboat.com>)
Responses Re: A failure in 031_recovery_conflict.pl on Debian/s390x
List pgsql-hackers
Re: Noah Misch
> On Tue, Jul 25, 2023 at 01:56:41PM +0530, Bharath Rupireddy wrote:
> > I've observed the following failure once in one of my Cirrus CI runs
> > on Windows Server on HEAD:
> >
> > timed out waiting for match: (?^:User was holding shared buffer pin
> > for too long) at
> > C:/cirrus/src/test/recovery/t/031_recovery_conflict.pl line 318.
> > # Postmaster PID for node "primary" is 696
> >
> >
https://api.cirrus-ci.com/v1/artifact/task/5272062399348736/testrun/build/testrun/recovery/031_recovery_conflict/log/regress_log_031_recovery_conflict
> >
> > https://github.com/BRupireddy/postgres/runs/15296698158
>
> Known:
> https://postgr.es/m/flat/20220409045515.35ypjzddp25v72ou%40alap3.anarazel.de
> https://postgr.es/m/flat/CA+hUKGK3PGKwcKqzoosamn36YW-fsuTdOPPF1i_rtEO=nEYKSg@mail.gmail.com
>
> A good next step would be patch "Fix recovery conflict SIGUSR1 handling" from
> https://postgr.es/m/CA+hUKG+Hi5P1j_8cVeqLLwNSVyJh4RntF04fYWkeNXfTrH2MYA@mail.gmail.com
> (near the end of that second thread).

I am also seeing problems with t/031_recovery_conflict.pl, on the
newly added s390x architecture on apt.postgresql.org. The failure is
flaky, but has been popping up in build logs often enough.

I managed to reproduce it on the shell by running the test in a loop a
few times. The failure looks like this:

echo "# +++ tap check in src/test/recovery +++" && rm -rf
'/home/myon/postgresql/pg/postgresql/build/src/test/recovery'/tmp_check&& /bin/mkdir -p
'/home/myon/postgresql/pg/postgresql/build/src/test/recovery'/tmp_check&& cd
/home/myon/postgresql/pg/postgresql/build/../src/test/recovery&&
TESTLOGDIR='/home/myon/postgresql/pg/postgresql/build/src/test/recovery/tmp_check/log'
TESTDATADIR='/home/myon/postgresql/pg/postgresql/build/src/test/recovery/tmp_check'
PATH="/home/myon/postgresql/pg/postgresql/build/tmp_install/usr/lib/postgresql/17/bin:/home/myon/postgresql/pg/postgresql/build/src/test/recovery:$PATH"
LD_LIBRARY_PATH="/home/myon/postgresql/pg/postgresql/build/tmp_install/usr/lib/s390x-linux-gnu" PGPORT='65432'
top_builddir='/home/myon/postgresql/pg/postgresql/build/src/test/recovery/../../..'
PG_REGRESS='/home/myon/postgresql/pg/postgresql/build/src/test/recovery/../../../src/test/regress/pg_regress'
/usr/bin/prove-I /home/myon/postgresql/pg/postgresql/build/../src/test/perl/ -I
/home/myon/postgresql/pg/postgresql/build/../src/test/recovery--verbose t/031_recovery_conflict.pl 
# +++ tap check in src/test/recovery +++
t/031_recovery_conflict.pl ..
# issuing query via background psql:
#     BEGIN;
#     DECLARE test_recovery_conflict_cursor CURSOR FOR SELECT b FROM test_recovery_conflict_table1;
#     FETCH FORWARD FROM test_recovery_conflict_cursor;
ok 1 - buffer pin conflict: cursor with conflicting pin established
ok 2 - buffer pin conflict: logfile contains terminated connection due to recovery conflict
ok 3 - buffer pin conflict: stats show conflict on standby
# issuing query via background psql:
#         BEGIN;
#         DECLARE test_recovery_conflict_cursor CURSOR FOR SELECT b FROM test_recovery_conflict_table1;
#         FETCH FORWARD FROM test_recovery_conflict_cursor;
#
ok 4 - snapshot conflict: cursor with conflicting snapshot established
ok 5 - snapshot conflict: logfile contains terminated connection due to recovery conflict
ok 6 - snapshot conflict: stats show conflict on standby
# issuing query via background psql:
#         BEGIN;
#         LOCK TABLE test_recovery_conflict_table1 IN ACCESS SHARE MODE;
#         SELECT 1;
#
ok 7 - lock conflict: conflicting lock acquired
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 255 just after 7.
Dubious, test returned 255 (wstat 65280, 0xff00)
All 7 subtests passed

Test Summary Report
-------------------
t/031_recovery_conflict.pl (Wstat: 65280 Tests: 7 Failed: 0)
  Non-zero exit status: 255
  Parse errors: No plan found in TAP output
Files=1, Tests=7, 186 wallclock secs ( 0.01 usr  0.00 sys +  0.74 cusr  0.71 csys =  1.46 CPU)
Result: FAIL
make: *** [Makefile:23: check] Error 1

I.e. the test file just exits after test 7, without running the rest.

Is it dying because it's running into a deadlock instead of some other
expected error message?

timed out waiting for match: (?^:User was holding a relation lock for too long) at t/031_recovery_conflict.pl line 318.
2023-08-04 12:20:38.061 UTC [1920073] 031_recovery_conflict.pl FATAL:  terminating connection due to conflict with
recovery

Christoph

Attachment

pgsql-hackers by date:

Previous
From: Amit Langote
Date:
Subject: Re: remaining sql/json patches
Next
From: "Hayato Kuroda (Fujitsu)"
Date:
Subject: RE: [PoC] pg_upgrade: allow to upgrade publisher node