Thread: deadlock-hard flakiness
Hi, On cfbot / CI, we've recently seen a lot of spurious test failures due to src/test/isolation/specs/deadlock-hard.spec changing output. Always on freebsd, when running tests against a pre-existing instance. I'm fairly sure I've seen this failure on the buildfarm as well, but I'm too impatient to wait for the buildfarm database query (it really should be updated to use lz4 toast compression). Example failures: 1) https://cirrus-ci.com/task/5307793230528512?logs=test_running#L211 https://api.cirrus-ci.com/v1/artifact/task/5307793230528512/testrun/build/testrun/isolation-running/isolation/regression.diffs https://api.cirrus-ci.com/v1/artifact/task/5307793230528512/testrun/build/testrun/runningcheck.log 2) https://cirrus-ci.com/task/6137098198056960?logs=test_running#L212 https://api.cirrus-ci.com/v1/artifact/task/6137098198056960/testrun/build/testrun/isolation-running/isolation/regression.diffs https://api.cirrus-ci.com/v1/artifact/task/6137098198056960/testrun/build/testrun/runningcheck.log So far the diff always is: diff -U3 /tmp/cirrus-ci-build/src/test/isolation/expected/deadlock-hard.out /tmp/cirrus-ci-build/build/testrun/isolation-running/isolation/results/deadlock-hard.out --- /tmp/cirrus-ci-build/src/test/isolation/expected/deadlock-hard.out 2023-02-07 05:32:34.536429000 +0000 +++ /tmp/cirrus-ci-build/build/testrun/isolation-running/isolation/results/deadlock-hard.out 2023-02-07 05:40:33.833908000+0000 @@ -25,10 +25,11 @@ step s6a7: <... completed> step s6c: COMMIT; step s5a6: <... completed> -step s5c: COMMIT; +step s5c: COMMIT; <waiting ...> step s4a5: <... completed> step s4c: COMMIT; step s3a4: <... completed> +step s5c: <... completed> step s3c: COMMIT; step s2a3: <... completed> step s2c: COMMIT; Commit 741d7f1047f fixed a similar issue in deadlock-hard. But it looks like we need something more. But perhaps this isn't an output ordering issue: How can we end up with s5c getting reported as waiting? I don't see how s5c could end up blocking on anything? Greetings, Andres Freund
Hi, On 2023-02-07 17:10:21 -0800, Andres Freund wrote: > So far the diff always is: > > diff -U3 /tmp/cirrus-ci-build/src/test/isolation/expected/deadlock-hard.out /tmp/cirrus-ci-build/build/testrun/isolation-running/isolation/results/deadlock-hard.out > --- /tmp/cirrus-ci-build/src/test/isolation/expected/deadlock-hard.out 2023-02-07 05:32:34.536429000 +0000 > +++ /tmp/cirrus-ci-build/build/testrun/isolation-running/isolation/results/deadlock-hard.out 2023-02-07 05:40:33.833908000+0000 > @@ -25,10 +25,11 @@ > step s6a7: <... completed> > step s6c: COMMIT; > step s5a6: <... completed> > -step s5c: COMMIT; > +step s5c: COMMIT; <waiting ...> > step s4a5: <... completed> > step s4c: COMMIT; > step s3a4: <... completed> > +step s5c: <... completed> > step s3c: COMMIT; > step s2a3: <... completed> > step s2c: COMMIT; While trying to debug the create_index issue [1], I did end up hitting a deadlock-soft output difference: https://cirrus-ci.com/task/6332011665686528 diff -U3 /tmp/cirrus-ci-build/src/test/isolation/expected/deadlock-soft.out /tmp/cirrus-ci-build/build/testrun/isolation-running/isolation/results/deadlock-soft.out --- /tmp/cirrus-ci-build/src/test/isolation/expected/deadlock-soft.out 2023-02-08 06:55:17.620898000 +0000 +++ /tmp/cirrus-ci-build/build/testrun/isolation-running/isolation/results/deadlock-soft.out 2023-02-08 06:56:17.622621000+0000 @@ -8,10 +8,13 @@ step d1a2: LOCK TABLE a2 IN ACCESS SHARE MODE; <waiting ...> step d2a1: LOCK TABLE a1 IN ACCESS SHARE MODE; <waiting ...> step d1a2: <... completed> -step d1c: COMMIT; +step d1c: COMMIT; <waiting ...> step e1l: <... completed> step e1c: COMMIT; step d2a1: <... completed> -step d2c: COMMIT; +step d1c: <... completed> +step d2c: COMMIT; <waiting ...> step e2l: <... completed> -step e2c: COMMIT; +step e2c: COMMIT; <waiting ...> +step d2c: <... completed> +step e2c: <... completed> Like in the deadlock-hard case, I don't understand how the commits suddenly end up being considered waiting. Greetings, Andres Freund [1] https://www.postgresql.org/message-id/20230208024748.ijvwabhqu4xlbvin%40awork3.anarazel.de
On Wed, Feb 8, 2023 at 2:10 PM Andres Freund <andres@anarazel.de> wrote: > I'm fairly sure I've seen this failure on the buildfarm as well, but I'm too > impatient to wait for the buildfarm database query (it really should be > updated to use lz4 toast compression). Failures in deadlock-hard (excluding crashes, because they make lots of tests appear to fail bogusly) all occurred on animals that are no longer with us: animal | last_report_time -----------+--------------------- anole | 2022-07-05 12:31:02 dory | 2021-09-30 04:50:08 fossa | 2021-10-28 01:50:29 gharial | 2022-07-05 22:04:23 hyrax | 2021-05-10 15:11:53 lousyjack | 2020-05-18 10:03:03 The failures stopped in mid '21 as far as my scraper noticed: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2021-06-13%2016:31:57 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2021-06-11%2017:13:44 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2021-06-11%2006:14:39 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2021-05-31%2006:41:25 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2021-05-23%2019:43:04 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2021-05-16%2000:36:16 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2021-05-10%2000:42:43 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2021-05-08%2006:34:13 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2021-04-22%2021:24:02 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fossa&dt=2021-04-08%2019:36:06 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2021-03-22%2013:26:03 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2021-03-13%2007:24:02 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2021-03-05%2019:39:46 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2021-01-08%2003:16:28 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dory&dt=2020-12-28%2011:05:15 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2020-11-27%2015:39:27 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2020-10-25%2023:47:21 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2020-09-29%2021:35:52 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2020-07-29%2014:34:49 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lousyjack&dt=2020-05-15%2005:03:03 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lousyjack&dt=2020-05-14%2013:33:03 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lousyjack&dt=2020-05-13%2022:03:03 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lousyjack&dt=2020-05-12%2015:03:03 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lousyjack&dt=2020-05-11%2022:03:06 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lousyjack&dt=2020-05-10%2022:33:02 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2020-01-14%2010:11:30 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2019-12-04%2001:38:28 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2019-11-12%2005:43:59 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2019-10-16%2016:43:58 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2019-07-18%2021:57:59 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2019-07-10%2005:59:16 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2019-07-08%2015:02:17 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2019-06-23%2004:17:09 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2019-06-12%2021:46:24 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2017-04-09%2021:58:03 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2017-04-08%2021:58:04 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2017-04-08%2005:19:17 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2017-04-07%2000:23:39 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2017-04-05%2018:58:04
On Wed, Feb 8, 2023 at 11:34 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Wed, Feb 8, 2023 at 2:10 PM Andres Freund <andres@anarazel.de> wrote: > > I'm fairly sure I've seen this failure on the buildfarm as well, but I'm too > > impatient to wait for the buildfarm database query (it really should be > > updated to use lz4 toast compression). > > Failures in deadlock-hard (excluding crashes, because they make lots > of tests appear to fail bogusly) all occurred on animals that are no > longer with us: Oh, and there was this thread: https://www.postgresql.org/message-id/flat/CA%2BhUKGJ6xtAsXFFs%2BSGcR%3DJkv0wCje_W-SUxV1%2BN451Q-5t6MA%40mail.gmail.com
Hi, On 2023-02-07 17:10:21 -0800, Andres Freund wrote: > diff -U3 /tmp/cirrus-ci-build/src/test/isolation/expected/deadlock-hard.out /tmp/cirrus-ci-build/build/testrun/isolation-running/isolation/results/deadlock-hard.out > --- /tmp/cirrus-ci-build/src/test/isolation/expected/deadlock-hard.out 2023-02-07 05:32:34.536429000 +0000 > +++ /tmp/cirrus-ci-build/build/testrun/isolation-running/isolation/results/deadlock-hard.out 2023-02-07 05:40:33.833908000+0000 > @@ -25,10 +25,11 @@ > step s6a7: <... completed> > step s6c: COMMIT; > step s5a6: <... completed> > -step s5c: COMMIT; > +step s5c: COMMIT; <waiting ...> > step s4a5: <... completed> > step s4c: COMMIT; > step s3a4: <... completed> > +step s5c: <... completed> > step s3c: COMMIT; > step s2a3: <... completed> > step s2c: COMMIT; > > > Commit 741d7f1047f fixed a similar issue in deadlock-hard. But it looks like > we need something more. But perhaps this isn't an output ordering issue: > > How can we end up with s5c getting reported as waiting? I don't see how s5c > could end up blocking on anything? After looking through isolationtester's blocking detection logic I started to suspect that what we're seeing is not being blocked by a heavyweight lock, but by a snapshot. So I added logging to pg_isolation_test_session_is_blocked(). Took a while to reproduce the issue, but indeed: https://cirrus-ci.com/task/4901334571286528 https://api.cirrus-ci.com/v1/artifact/task/4901334571286528/testrun/build/testrun/isolation-running/isolation/regression.diffs https://api.cirrus-ci.com/v1/artifact/task/4901334571286528/testrun/build/testrun/runningcheck.log indicates that we indeed were blocked by a snapshot: 2023-02-08 21:30:12.123 UTC [9276][client backend] [isolation/deadlock-hard/control connection][3/8971:0] LOG: pid 9280blocked due to snapshot by pid: 0 ... 2023-02-08 21:30:12.155 UTC [9276][client backend] [isolation/deadlock-hard/control connection][3/8973:0] LOG: pid 9278blocked due to snapshot by pid: 0 Unclear why we end up without a pid. It looks like 2PC removes the pid from the field? In the problematic case the prepared_xacts test is indeed scheduled concurrently: 2023-02-08 21:30:12.100 UTC [9397][client backend] [pg_regress/prepared_xacts][23/1296:39171] ERROR: transaction identifier"foo3" is already in use 2023-02-08 21:30:12.100 UTC [9397][client backend] [pg_regress/prepared_xacts][23/1296:39171] STATEMENT: PREPARE TRANSACTION'foo3'; foo3 for example does use SERIALIZABLE. I don't really understand how GetSafeSnapshotBlockingPids() can end up finding deadlock-hard's sessions being blocked by a safe snapshot. Afaict nothing uses serializable in that test. How can SXACT_FLAG_DEFERRABLE_WAITING be set for the sxact of a backend that never did serializable? Are we possibly forgetting to clear it or such? I don't think it should affect the reports here, but I did break something when removing SHMQueue - GetSafeSnapshotBlockingPids() doesn't check output_size anymore. Will fix. Thomas, any chance you could do a pass through 96003717645 to see if I screwed up other things? I stared a lot at that change, but I obviously did miss at least one thing. Greetings, Andres Freund
Hi, On 2023-02-08 14:11:45 -0800, Andres Freund wrote: > On 2023-02-07 17:10:21 -0800, Andres Freund wrote: > I don't really understand how GetSafeSnapshotBlockingPids() can end up finding > deadlock-hard's sessions being blocked by a safe snapshot. Afaict nothing uses > serializable in that test. How can SXACT_FLAG_DEFERRABLE_WAITING be set for > the sxact of a backend that never did serializable? Are we possibly forgetting > to clear it or such? > > > I don't think it should affect the reports here, but I did break something > when removing SHMQueue - GetSafeSnapshotBlockingPids() doesn't check > output_size anymore. Will fix. Thomas, any chance you could do a pass through > 96003717645 to see if I screwed up other things? I stared a lot at that > change, but I obviously did miss at least one thing. Argh, it's actually caused by 96003717645 as well: Previously loop iteration without finding a matching pid ends with sxact == NULL, now it doesn't. Greetings, Andres Freund