RE: failures in t/031_recovery_conflict.pl on CI - Mailing list pgsql-hackers

From Факеев Алексей
Subject RE: failures in t/031_recovery_conflict.pl on CI
Date
Msg-id 00cc01d8ffc4$f6169e60$e243db20$@postgrespro.ru
Whole thread Raw
In response to Re: failures in t/031_recovery_conflict.pl on CI  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
Hello.

Even after applying the patch, we are still facing an "ack Broken pipe"
problem. 
It occurs on the arm64 platform, presumably under high load. 
Here is a log snippet from buildfarm:
...
[19:08:12.150](0.394s) ok 13 - startup deadlock: cursor holding conflicting
pin, also waiting for lock, established
[19:08:12.208](0.057s) ok 14 - startup deadlock: lock acquisition is waiting
Waiting for replication conn standby's replay_lsn to pass 0/33F5FE0 on
primary
done
psql:<stdin>:8: ERROR:  canceling statement due to conflict with recovery
LINE 1: SELECT * FROM test_recovery_conflict_table2;
                      ^
DETAIL:  User transaction caused buffer deadlock with recovery.
[19:08:12.319](0.112s) ok 15 - startup deadlock: logfile contains terminated
connection due to recovery conflict
ack Broken pipe: write( 13, '\\q
' ) at /usr/share/perl5/IPC/Run/IO.pm line 550.
### Stopping node "primary" using mode immediate
# Running: pg_ctl -D
/home/bf/build/buildfarm-flaviventris/REL_15_STABLE/pgsql.build/src/test/rec
overy/tmp_check/t_031_recovery_conflict_primary_data/pgdata -m immediate
stop
waiting for server to shut down... done
server stopped
# No postmaster PID for node "primary"
### Stopping node "standby" using mode immediate
# Running: pg_ctl -D
/home/bf/build/buildfarm-flaviventris/REL_15_STABLE/pgsql.build/src/test/rec
overy/tmp_check/t_031_recovery_conflict_standby_data/pgdata -m immediate
stop
waiting for server to shut down.... done
server stopped
# No postmaster PID for node "standby"
[19:08:12.450](0.131s) # Tests were run but no plan was declared and
done_testing() was not seen.
[19:08:12.450](0.000s) # Looks like your test exited with 32 just after 15.
...

Below is a test report:
... 
[20:46:35] t/030_stats_cleanup_replica.pl ....... ok     9956 ms ( 0.01 usr
0.00 sys +  3.49 cusr  2.49 csys =  5.99 CPU)
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 32 just after 15.
[20:46:43] t/031_recovery_conflict.pl ........... 
Dubious, test returned 32 (wstat 8192, 0x2000)
All 15 subtests passed 
[20:46:56] t/032_relfilenode_reuse.pl ........... ok    13625 ms ( 0.01 usr
0.00 sys +  3.53 cusr  2.39 csys =  5.93 CPU)
[20:47:03] t/110_wal_sender_check_crc.pl ........ ok     6421 ms ( 0.00 usr
0.00 sys +  3.20 cusr  1.87 csys =  5.07 CPU)
[20:47:03]

Test Summary Report
-------------------
t/031_recovery_conflict.pl         (Wstat: 8192 Tests: 15 Failed: 0)
  Non-zero exit status: 32
  Parse errors: No plan found in TAP output




-----Original Message-----
From: Andres Freund <andres@anarazel.de> 
Sent: Tuesday, May 3, 2022 11:13 PM
To: Tom Lane <tgl@sss.pgh.pa.us>
Cc: Robert Haas <robertmhaas@gmail.com>; pgsql-hackers@postgresql.org;
Thomas Munro <thomas.munro@gmail.com>
Subject: Re: failures in t/031_recovery_conflict.pl on CI

Hi,

On 2022-05-03 14:23:23 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> >> So it's almost surely a timing issue, and your theory here seems
plausible.
> 
> > Unfortunately I don't think my theory holds, because I actually had 
> > added a defense against this into the test that I forgot about
momentarily...
> 
> Oh, hm.  I can try harder to repro it.

I've now reproduced it a couple times here running under rr, so it's
probably not worth putting too much effort into that.

Attached is a fix for the test that I think should avoid the problem.
Couldn't repro it with it applied, under both rr and valgrind.


My current problem is that I'm running into some IPC::Run issues (bug?). I
get "ack Broken pipe:" iff I add "SELECT pg_sleep(1);" after
"-- wait for lock held by prepared transaction"

It doesn't happen without that debugging thing, but it makes me worried that
it's something that'll come up in random situations.

It looks to me like it's a bug in IPC::Run - with a few changes I get the
failure to happen inside pump_nb(), which seems like it shouldn't error out
just because the child process exited...


I *think* it might not happen without the sleep. But I'm not at all
confident.

In general I'm kinda worried on how much effectively unmaintained perl stuff
we're depending :(

Greetings,

Andres Freund




pgsql-hackers by date:

Previous
From: Bharath Rupireddy
Date:
Subject: Allow processes to reset procArrayGroupNext themselves instead of leader resetting for all the followers
Next
From: "Hayato Kuroda (Fujitsu)"
Date:
Subject: RE: wake up logical workers after ALTER SUBSCRIPTION