Re: deadlock-hard flakiness - Mailing list pgsql-hackers

From Andres Freund
Subject Re: deadlock-hard flakiness
Date
Msg-id 20230208221145.bwzhancellclrgia@awork3.anarazel.de
Whole thread Raw
In response to deadlock-hard flakiness  (Andres Freund <andres@anarazel.de>)
Responses Re: deadlock-hard flakiness
List pgsql-hackers
Hi,

On 2023-02-07 17:10:21 -0800, Andres Freund wrote:
> diff -U3 /tmp/cirrus-ci-build/src/test/isolation/expected/deadlock-hard.out
/tmp/cirrus-ci-build/build/testrun/isolation-running/isolation/results/deadlock-hard.out
> --- /tmp/cirrus-ci-build/src/test/isolation/expected/deadlock-hard.out    2023-02-07 05:32:34.536429000 +0000
> +++ /tmp/cirrus-ci-build/build/testrun/isolation-running/isolation/results/deadlock-hard.out    2023-02-07
05:40:33.833908000+0000
 
> @@ -25,10 +25,11 @@
>  step s6a7: <... completed>
>  step s6c: COMMIT;
>  step s5a6: <... completed>
> -step s5c: COMMIT;
> +step s5c: COMMIT; <waiting ...>
>  step s4a5: <... completed>
>  step s4c: COMMIT;
>  step s3a4: <... completed>
> +step s5c: <... completed>
>  step s3c: COMMIT;
>  step s2a3: <... completed>
>  step s2c: COMMIT;
> 
> 
> Commit 741d7f1047f fixed a similar issue in deadlock-hard. But it looks like
> we need something more. But perhaps this isn't an output ordering issue:
> 
> How can we end up with s5c getting reported as waiting? I don't see how s5c
> could end up blocking on anything?

After looking through isolationtester's blocking detection logic I started to
suspect that what we're seeing is not being blocked by a heavyweight lock, but
by a snapshot. So I added logging to
pg_isolation_test_session_is_blocked(). Took a while to reproduce the issue,
but indeed:
https://cirrus-ci.com/task/4901334571286528

https://api.cirrus-ci.com/v1/artifact/task/4901334571286528/testrun/build/testrun/isolation-running/isolation/regression.diffs
https://api.cirrus-ci.com/v1/artifact/task/4901334571286528/testrun/build/testrun/runningcheck.log

indicates that we indeed were blocked by a snapshot:
2023-02-08 21:30:12.123 UTC [9276][client backend] [isolation/deadlock-hard/control connection][3/8971:0] LOG:  pid
9280blocked due to snapshot by pid: 0
 
...
2023-02-08 21:30:12.155 UTC [9276][client backend] [isolation/deadlock-hard/control connection][3/8973:0] LOG:  pid
9278blocked due to snapshot by pid: 0
 


Unclear why we end up without a pid. It looks like 2PC removes the pid from
the field?  In the problematic case the prepared_xacts test is indeed
scheduled concurrently:

2023-02-08 21:30:12.100 UTC [9397][client backend] [pg_regress/prepared_xacts][23/1296:39171] ERROR:  transaction
identifier"foo3" is already in use
 
2023-02-08 21:30:12.100 UTC [9397][client backend] [pg_regress/prepared_xacts][23/1296:39171] STATEMENT:  PREPARE
TRANSACTION'foo3';
 

foo3 for example does use SERIALIZABLE.


I don't really understand how GetSafeSnapshotBlockingPids() can end up finding
deadlock-hard's sessions being blocked by a safe snapshot. Afaict nothing uses
serializable in that test. How can SXACT_FLAG_DEFERRABLE_WAITING be set for
the sxact of a backend that never did serializable? Are we possibly forgetting
to clear it or such?


I don't think it should affect the reports here, but I did break something
when removing SHMQueue - GetSafeSnapshotBlockingPids() doesn't check
output_size anymore. Will fix.  Thomas, any chance you could do a pass through
96003717645 to see if I screwed up other things? I stared a lot at that
change, but I obviously did miss at least one thing.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: run pgindent on a regular basis / scripted manner
Next
From: Andres Freund
Date:
Subject: Re: deadlock-hard flakiness