On Tue, May 25, 2021 at 11:32:38AM -0400, Alvaro Herrera wrote:
> On 2021-May-24, Noah Misch wrote:
> > What if we had a standard that the step after the cancel shall send a query to
> > the backend that just received the cancel? Something like:
>
> Hmm ... I don't understand why this fixes the problem, but it
> drastically reduces the probability.
This comment, from run_permutation(), explains:
/*
* Check whether the session that needs to perform the next step is
* still blocked on an earlier step. If so, wait for it to finish.
*
* (In older versions of this tool, we allowed precisely one session
* to be waiting at a time. If we reached a step that required that
* session to execute the next command, we would declare the whole
* permutation invalid, cancel everything, and move on to the next
* one. Unfortunately, that made it impossible to test the deadlock
* detector using this framework, unless the number of processes
* involved in the deadlock was precisely two. We now assume that if
* we reach a step that is still blocked, we need to wait for it to
* unblock itself.)
*/
> Here's a complete patch. I got
> about one failure in 1000 instead of 1 in 10. The new failure looks
> like this:
>
> diff -U3 /pgsql/source/master/src/test/isolation/expected/detach-partition-concurrently-3.out
/home/alvherre/Code/pgsql-build/master/src/test/isolation/output_iso/results/detach-partition-concurrently-3.out
> --- /pgsql/source/master/src/test/isolation/expected/detach-partition-concurrently-3.out 2021-05-25
11:12:42.333987835-0400
> +++ /home/alvherre/Code/pgsql-build/master/src/test/isolation/output_iso/results/detach-partition-concurrently-3.out
2021-05-25 11:19:03.714947775 -0400
> @@ -13,7 +13,7 @@
>
> t
> step s2detach: <... completed>
> -error in steps s1cancel s2detach: ERROR: canceling statement due to user request
I'm guessing this is:
report_multiple_error_messages("s1cancel", 1, {"s2detach"})
> +ERROR: canceling statement due to user request
And this is:
report_multiple_error_messages("s2detach", 0, {})
> step s2noop: UNLISTEN noop;
> step s1c: COMMIT;
> step s1describe: SELECT 'd3_listp' AS root, * FROM pg_partition_tree('d3_listp')
>
>
> I find this a bit weird and I'm wondering if it could be an
> isolationtester bug -- why is it not attributing the error message to
> any steps?
I agree that looks like an isolationtester bug. isolationtester already
printed the tuples from s1cancel, so s1cancel should be considered finished.
This error message emission should never have any step name prefixing it.
Thanks,
nm