Re: Continuing instability in insert-conflict-specconflict test - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Continuing instability in insert-conflict-specconflict test
Date
Msg-id 2744723.1598371481@sss.pgh.pa.us
Whole thread Raw
In response to Re: Continuing instability in insert-conflict-specconflict test  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Continuing instability in insert-conflict-specconflict test  (Asim Praveen <pasim@vmware.com>)
Re: Continuing instability in insert-conflict-specconflict test  (Noah Misch <noah@leadboat.com>)
List pgsql-hackers
I wrote:
> I've spent the day fooling around with a re-implementation of
> isolationtester that waits for all its controlled sessions to quiesce
> (either wait for client input, or block on a lock held by another
> session) before moving on to the next step.  That was not a feasible
> approach before we had the wait_event infrastructure, but it's
> seeming like it might be workable now.  Still have a few issues to
> sort out though ...

I wasted a good deal of time on this idea, and eventually concluded
that it's a dead end, because there is an unremovable race condition.
Namely, that even if the isolationtester's observer backend has
observed that test session X has quiesced according to its
wait_event_info, it is possible for the report of that fact to arrive
at the isolationtester client process before test session X's output
does.

It's quite obvious how that might happen if the isolationtester is
on a different machine than the PG server --- just imagine a dropped
packet in X's output that has to be retransmitted.  You might think
it shouldn't happen within a single machine, but I'm seeing results
that I cannot explain any other way (on an 8-core RHEL8 box).
It appears to not be particularly rare, either.

> Andres Freund <andres@anarazel.de> writes:
>> ISTM the issue at hand isn't so much that X expects something to be
>> printed by Y before it terminates, but that it expects the next step to
>> not be executed before Y unlocks. If I understand the wrong output
>> correctly, what happens is that "controller_print_speculative_locks" is
>> executed, even though s1 hasn't yet acquired the next lock.

The problem as I'm now understanding it is that
insert-conflict-specconflict.spec issues multiple commands in sequence
to its "controller" session, and expects that NOTICE outputs from a
different test session will arrive at a determinate point in that
sequence.  In practice that's not guaranteed, because (a) the other
test session might not send the NOTICE soon enough --- as my modified
specfile proves --- and (b) even if the NOTICE is timely sent, the
kernel will not guarantee timely receipt.  We could fix (a) by
introducing some explicit interlock between the controller and test
sessions, but (b) is a killer.

I think what we have to do to salvage this test is to get rid of the
use of NOTICE outputs, and instead have the test functions insert
log records into some table, which we can inspect after the fact
to verify that things happened as we expect.

            regards, tom lane



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Hybrid Hash/Nested Loop joins and caching results from subplans
Next
From: Ranier Vilela
Date:
Subject: [PATCH] Fix Uninitialized scalar variable (UNINIT) (src/backend/access/heap/heapam_handler.c)