Thread: deadlock-hard flakiness

deadlock-hard flakiness

From
Andres Freund
Date:
Hi,

On cfbot / CI, we've recently seen a lot of spurious test failures due to
src/test/isolation/specs/deadlock-hard.spec changing output. Always on
freebsd, when running tests against a pre-existing instance.

I'm fairly sure I've seen this failure on the buildfarm as well, but I'm too
impatient to wait for the buildfarm database query (it really should be
updated to use lz4 toast compression).

Example failures:

1)
https://cirrus-ci.com/task/5307793230528512?logs=test_running#L211

https://api.cirrus-ci.com/v1/artifact/task/5307793230528512/testrun/build/testrun/isolation-running/isolation/regression.diffs
https://api.cirrus-ci.com/v1/artifact/task/5307793230528512/testrun/build/testrun/runningcheck.log

2)
https://cirrus-ci.com/task/6137098198056960?logs=test_running#L212

https://api.cirrus-ci.com/v1/artifact/task/6137098198056960/testrun/build/testrun/isolation-running/isolation/regression.diffs
https://api.cirrus-ci.com/v1/artifact/task/6137098198056960/testrun/build/testrun/runningcheck.log

So far the diff always is:

diff -U3 /tmp/cirrus-ci-build/src/test/isolation/expected/deadlock-hard.out
/tmp/cirrus-ci-build/build/testrun/isolation-running/isolation/results/deadlock-hard.out
--- /tmp/cirrus-ci-build/src/test/isolation/expected/deadlock-hard.out    2023-02-07 05:32:34.536429000 +0000
+++ /tmp/cirrus-ci-build/build/testrun/isolation-running/isolation/results/deadlock-hard.out    2023-02-07
05:40:33.833908000+0000
 
@@ -25,10 +25,11 @@
 step s6a7: <... completed>
 step s6c: COMMIT;
 step s5a6: <... completed>
-step s5c: COMMIT;
+step s5c: COMMIT; <waiting ...>
 step s4a5: <... completed>
 step s4c: COMMIT;
 step s3a4: <... completed>
+step s5c: <... completed>
 step s3c: COMMIT;
 step s2a3: <... completed>
 step s2c: COMMIT;


Commit 741d7f1047f fixed a similar issue in deadlock-hard. But it looks like
we need something more. But perhaps this isn't an output ordering issue:

How can we end up with s5c getting reported as waiting? I don't see how s5c
could end up blocking on anything?

Greetings,

Andres Freund



Re: deadlock-hard flakiness

From
Andres Freund
Date:
Hi,

On 2023-02-07 17:10:21 -0800, Andres Freund wrote:
> So far the diff always is:
> 
> diff -U3 /tmp/cirrus-ci-build/src/test/isolation/expected/deadlock-hard.out
/tmp/cirrus-ci-build/build/testrun/isolation-running/isolation/results/deadlock-hard.out
> --- /tmp/cirrus-ci-build/src/test/isolation/expected/deadlock-hard.out    2023-02-07 05:32:34.536429000 +0000
> +++ /tmp/cirrus-ci-build/build/testrun/isolation-running/isolation/results/deadlock-hard.out    2023-02-07
05:40:33.833908000+0000
 
> @@ -25,10 +25,11 @@
>  step s6a7: <... completed>
>  step s6c: COMMIT;
>  step s5a6: <... completed>
> -step s5c: COMMIT;
> +step s5c: COMMIT; <waiting ...>
>  step s4a5: <... completed>
>  step s4c: COMMIT;
>  step s3a4: <... completed>
> +step s5c: <... completed>
>  step s3c: COMMIT;
>  step s2a3: <... completed>
>  step s2c: COMMIT;

While trying to debug the create_index issue [1], I did end up hitting a
deadlock-soft output difference:

https://cirrus-ci.com/task/6332011665686528
diff -U3 /tmp/cirrus-ci-build/src/test/isolation/expected/deadlock-soft.out
/tmp/cirrus-ci-build/build/testrun/isolation-running/isolation/results/deadlock-soft.out
--- /tmp/cirrus-ci-build/src/test/isolation/expected/deadlock-soft.out    2023-02-08 06:55:17.620898000 +0000
+++ /tmp/cirrus-ci-build/build/testrun/isolation-running/isolation/results/deadlock-soft.out    2023-02-08
06:56:17.622621000+0000
 
@@ -8,10 +8,13 @@
 step d1a2: LOCK TABLE a2 IN ACCESS SHARE MODE; <waiting ...>
 step d2a1: LOCK TABLE a1 IN ACCESS SHARE MODE; <waiting ...>
 step d1a2: <... completed>
-step d1c: COMMIT;
+step d1c: COMMIT; <waiting ...>
 step e1l: <... completed>
 step e1c: COMMIT;
 step d2a1: <... completed>
-step d2c: COMMIT;
+step d1c: <... completed>
+step d2c: COMMIT; <waiting ...>
 step e2l: <... completed>
-step e2c: COMMIT;
+step e2c: COMMIT; <waiting ...>
+step d2c: <... completed>
+step e2c: <... completed>


Like in the deadlock-hard case, I don't understand how the commits suddenly
end up being considered waiting.

Greetings,

Andres Freund

[1] https://www.postgresql.org/message-id/20230208024748.ijvwabhqu4xlbvin%40awork3.anarazel.de



Re: deadlock-hard flakiness

From
Thomas Munro
Date:
On Wed, Feb 8, 2023 at 2:10 PM Andres Freund <andres@anarazel.de> wrote:
> I'm fairly sure I've seen this failure on the buildfarm as well, but I'm too
> impatient to wait for the buildfarm database query (it really should be
> updated to use lz4 toast compression).

Failures in deadlock-hard (excluding crashes, because they make lots
of tests appear to fail bogusly) all occurred on animals that are no
longer with us:

  animal   |  last_report_time
-----------+---------------------
 anole     | 2022-07-05 12:31:02
 dory      | 2021-09-30 04:50:08
 fossa     | 2021-10-28 01:50:29
 gharial   | 2022-07-05 22:04:23
 hyrax     | 2021-05-10 15:11:53
 lousyjack | 2020-05-18 10:03:03

The failures stopped in mid '21 as far as my scraper noticed:

 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2021-06-13%2016:31:57
 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2021-06-11%2017:13:44
 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2021-06-11%2006:14:39
 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2021-05-31%2006:41:25
 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2021-05-23%2019:43:04
 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2021-05-16%2000:36:16
 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2021-05-10%2000:42:43
 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2021-05-08%2006:34:13
 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2021-04-22%2021:24:02
 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fossa&dt=2021-04-08%2019:36:06
 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2021-03-22%2013:26:03
 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2021-03-13%2007:24:02
 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2021-03-05%2019:39:46
 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2021-01-08%2003:16:28
 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dory&dt=2020-12-28%2011:05:15
 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2020-11-27%2015:39:27
 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2020-10-25%2023:47:21
 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2020-09-29%2021:35:52
 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2020-07-29%2014:34:49
 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lousyjack&dt=2020-05-15%2005:03:03
 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lousyjack&dt=2020-05-14%2013:33:03
 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lousyjack&dt=2020-05-13%2022:03:03
 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lousyjack&dt=2020-05-12%2015:03:03
 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lousyjack&dt=2020-05-11%2022:03:06
 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lousyjack&dt=2020-05-10%2022:33:02
 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2020-01-14%2010:11:30
 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2019-12-04%2001:38:28
 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2019-11-12%2005:43:59
 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2019-10-16%2016:43:58
 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2019-07-18%2021:57:59
 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2019-07-10%2005:59:16
 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2019-07-08%2015:02:17
 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2019-06-23%2004:17:09
 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2019-06-12%2021:46:24
 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2017-04-09%2021:58:03
 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2017-04-08%2021:58:04
 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2017-04-08%2005:19:17
 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2017-04-07%2000:23:39
 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2017-04-05%2018:58:04



Re: deadlock-hard flakiness

From
Thomas Munro
Date:
On Wed, Feb 8, 2023 at 11:34 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Wed, Feb 8, 2023 at 2:10 PM Andres Freund <andres@anarazel.de> wrote:
> > I'm fairly sure I've seen this failure on the buildfarm as well, but I'm too
> > impatient to wait for the buildfarm database query (it really should be
> > updated to use lz4 toast compression).
>
> Failures in deadlock-hard (excluding crashes, because they make lots
> of tests appear to fail bogusly) all occurred on animals that are no
> longer with us:

Oh, and there was this thread:


https://www.postgresql.org/message-id/flat/CA%2BhUKGJ6xtAsXFFs%2BSGcR%3DJkv0wCje_W-SUxV1%2BN451Q-5t6MA%40mail.gmail.com



Re: deadlock-hard flakiness

From
Andres Freund
Date:
Hi,

On 2023-02-07 17:10:21 -0800, Andres Freund wrote:
> diff -U3 /tmp/cirrus-ci-build/src/test/isolation/expected/deadlock-hard.out
/tmp/cirrus-ci-build/build/testrun/isolation-running/isolation/results/deadlock-hard.out
> --- /tmp/cirrus-ci-build/src/test/isolation/expected/deadlock-hard.out    2023-02-07 05:32:34.536429000 +0000
> +++ /tmp/cirrus-ci-build/build/testrun/isolation-running/isolation/results/deadlock-hard.out    2023-02-07
05:40:33.833908000+0000
 
> @@ -25,10 +25,11 @@
>  step s6a7: <... completed>
>  step s6c: COMMIT;
>  step s5a6: <... completed>
> -step s5c: COMMIT;
> +step s5c: COMMIT; <waiting ...>
>  step s4a5: <... completed>
>  step s4c: COMMIT;
>  step s3a4: <... completed>
> +step s5c: <... completed>
>  step s3c: COMMIT;
>  step s2a3: <... completed>
>  step s2c: COMMIT;
> 
> 
> Commit 741d7f1047f fixed a similar issue in deadlock-hard. But it looks like
> we need something more. But perhaps this isn't an output ordering issue:
> 
> How can we end up with s5c getting reported as waiting? I don't see how s5c
> could end up blocking on anything?

After looking through isolationtester's blocking detection logic I started to
suspect that what we're seeing is not being blocked by a heavyweight lock, but
by a snapshot. So I added logging to
pg_isolation_test_session_is_blocked(). Took a while to reproduce the issue,
but indeed:
https://cirrus-ci.com/task/4901334571286528

https://api.cirrus-ci.com/v1/artifact/task/4901334571286528/testrun/build/testrun/isolation-running/isolation/regression.diffs
https://api.cirrus-ci.com/v1/artifact/task/4901334571286528/testrun/build/testrun/runningcheck.log

indicates that we indeed were blocked by a snapshot:
2023-02-08 21:30:12.123 UTC [9276][client backend] [isolation/deadlock-hard/control connection][3/8971:0] LOG:  pid
9280blocked due to snapshot by pid: 0
 
...
2023-02-08 21:30:12.155 UTC [9276][client backend] [isolation/deadlock-hard/control connection][3/8973:0] LOG:  pid
9278blocked due to snapshot by pid: 0
 


Unclear why we end up without a pid. It looks like 2PC removes the pid from
the field?  In the problematic case the prepared_xacts test is indeed
scheduled concurrently:

2023-02-08 21:30:12.100 UTC [9397][client backend] [pg_regress/prepared_xacts][23/1296:39171] ERROR:  transaction
identifier"foo3" is already in use
 
2023-02-08 21:30:12.100 UTC [9397][client backend] [pg_regress/prepared_xacts][23/1296:39171] STATEMENT:  PREPARE
TRANSACTION'foo3';
 

foo3 for example does use SERIALIZABLE.


I don't really understand how GetSafeSnapshotBlockingPids() can end up finding
deadlock-hard's sessions being blocked by a safe snapshot. Afaict nothing uses
serializable in that test. How can SXACT_FLAG_DEFERRABLE_WAITING be set for
the sxact of a backend that never did serializable? Are we possibly forgetting
to clear it or such?


I don't think it should affect the reports here, but I did break something
when removing SHMQueue - GetSafeSnapshotBlockingPids() doesn't check
output_size anymore. Will fix.  Thomas, any chance you could do a pass through
96003717645 to see if I screwed up other things? I stared a lot at that
change, but I obviously did miss at least one thing.

Greetings,

Andres Freund



Re: deadlock-hard flakiness

From
Andres Freund
Date:
Hi,

On 2023-02-08 14:11:45 -0800, Andres Freund wrote:
> On 2023-02-07 17:10:21 -0800, Andres Freund wrote:
> I don't really understand how GetSafeSnapshotBlockingPids() can end up finding
> deadlock-hard's sessions being blocked by a safe snapshot. Afaict nothing uses
> serializable in that test. How can SXACT_FLAG_DEFERRABLE_WAITING be set for
> the sxact of a backend that never did serializable? Are we possibly forgetting
> to clear it or such?
> 
> 
> I don't think it should affect the reports here, but I did break something
> when removing SHMQueue - GetSafeSnapshotBlockingPids() doesn't check
> output_size anymore. Will fix.  Thomas, any chance you could do a pass through
> 96003717645 to see if I screwed up other things? I stared a lot at that
> change, but I obviously did miss at least one thing.

Argh, it's actually caused by 96003717645 as well: Previously loop iteration
without finding a matching pid ends with sxact == NULL, now it doesn't.

Greetings,

Andres Freund