Re: deadlock-hard flakiness - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: deadlock-hard flakiness |
Date | |
Msg-id | 20230208221145.bwzhancellclrgia@awork3.anarazel.de Whole thread Raw |
In response to | deadlock-hard flakiness (Andres Freund <andres@anarazel.de>) |
Responses |
Re: deadlock-hard flakiness
|
List | pgsql-hackers |
Hi, On 2023-02-07 17:10:21 -0800, Andres Freund wrote: > diff -U3 /tmp/cirrus-ci-build/src/test/isolation/expected/deadlock-hard.out /tmp/cirrus-ci-build/build/testrun/isolation-running/isolation/results/deadlock-hard.out > --- /tmp/cirrus-ci-build/src/test/isolation/expected/deadlock-hard.out 2023-02-07 05:32:34.536429000 +0000 > +++ /tmp/cirrus-ci-build/build/testrun/isolation-running/isolation/results/deadlock-hard.out 2023-02-07 05:40:33.833908000+0000 > @@ -25,10 +25,11 @@ > step s6a7: <... completed> > step s6c: COMMIT; > step s5a6: <... completed> > -step s5c: COMMIT; > +step s5c: COMMIT; <waiting ...> > step s4a5: <... completed> > step s4c: COMMIT; > step s3a4: <... completed> > +step s5c: <... completed> > step s3c: COMMIT; > step s2a3: <... completed> > step s2c: COMMIT; > > > Commit 741d7f1047f fixed a similar issue in deadlock-hard. But it looks like > we need something more. But perhaps this isn't an output ordering issue: > > How can we end up with s5c getting reported as waiting? I don't see how s5c > could end up blocking on anything? After looking through isolationtester's blocking detection logic I started to suspect that what we're seeing is not being blocked by a heavyweight lock, but by a snapshot. So I added logging to pg_isolation_test_session_is_blocked(). Took a while to reproduce the issue, but indeed: https://cirrus-ci.com/task/4901334571286528 https://api.cirrus-ci.com/v1/artifact/task/4901334571286528/testrun/build/testrun/isolation-running/isolation/regression.diffs https://api.cirrus-ci.com/v1/artifact/task/4901334571286528/testrun/build/testrun/runningcheck.log indicates that we indeed were blocked by a snapshot: 2023-02-08 21:30:12.123 UTC [9276][client backend] [isolation/deadlock-hard/control connection][3/8971:0] LOG: pid 9280blocked due to snapshot by pid: 0 ... 2023-02-08 21:30:12.155 UTC [9276][client backend] [isolation/deadlock-hard/control connection][3/8973:0] LOG: pid 9278blocked due to snapshot by pid: 0 Unclear why we end up without a pid. It looks like 2PC removes the pid from the field? In the problematic case the prepared_xacts test is indeed scheduled concurrently: 2023-02-08 21:30:12.100 UTC [9397][client backend] [pg_regress/prepared_xacts][23/1296:39171] ERROR: transaction identifier"foo3" is already in use 2023-02-08 21:30:12.100 UTC [9397][client backend] [pg_regress/prepared_xacts][23/1296:39171] STATEMENT: PREPARE TRANSACTION'foo3'; foo3 for example does use SERIALIZABLE. I don't really understand how GetSafeSnapshotBlockingPids() can end up finding deadlock-hard's sessions being blocked by a safe snapshot. Afaict nothing uses serializable in that test. How can SXACT_FLAG_DEFERRABLE_WAITING be set for the sxact of a backend that never did serializable? Are we possibly forgetting to clear it or such? I don't think it should affect the reports here, but I did break something when removing SHMQueue - GetSafeSnapshotBlockingPids() doesn't check output_size anymore. Will fix. Thomas, any chance you could do a pass through 96003717645 to see if I screwed up other things? I stared a lot at that change, but I obviously did miss at least one thing. Greetings, Andres Freund
pgsql-hackers by date: