Thread: why can the isolation tester handle only one waiting process?

why can the isolation tester handle only one waiting process?

From
Robert Haas
Date:
I had the idea that it would be useful to have some regression tests
that verify that the deadlock checker works as advertised, because we
currently have just about zero test coverage for it.  And it's easy
enough to write a regression test that causes a simple 2-way deadlock.
But you can't test anything more interesting than that, because of
this limitation described in the README:

> Currently, at most one step can be waiting at a time.  As long as one
> step is waiting, subsequent steps are run to completion synchronously.

Is there any reason not to try to lift that restriction?  (And do we
think it's hard?)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: why can the isolation tester handle only one waiting process?

From
Alvaro Herrera
Date:
Robert Haas wrote:
> I had the idea that it would be useful to have some regression tests
> that verify that the deadlock checker works as advertised, because we
> currently have just about zero test coverage for it.  And it's easy
> enough to write a regression test that causes a simple 2-way deadlock.
> But you can't test anything more interesting than that, because of
> this limitation described in the README:
> 
> > Currently, at most one step can be waiting at a time.  As long as one
> > step is waiting, subsequent steps are run to completion synchronously.
> 
> Is there any reason not to try to lift that restriction?  (And do we
> think it's hard?)

It's just what we needed at the time, so we didn't press any further.
Feel free to lift the restriction if you're so inclined.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: why can the isolation tester handle only one waiting process?

From
Robert Haas
Date:
On Fri, Aug 14, 2015 at 10:14 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Robert Haas wrote:
>> I had the idea that it would be useful to have some regression tests
>> that verify that the deadlock checker works as advertised, because we
>> currently have just about zero test coverage for it.  And it's easy
>> enough to write a regression test that causes a simple 2-way deadlock.
>> But you can't test anything more interesting than that, because of
>> this limitation described in the README:
>>
>> > Currently, at most one step can be waiting at a time.  As long as one
>> > step is waiting, subsequent steps are run to completion synchronously.
>>
>> Is there any reason not to try to lift that restriction?  (And do we
>> think it's hard?)
>
> It's just what we needed at the time, so we didn't press any further.
> Feel free to lift the restriction if you're so inclined.

Thanks for the reply.  It appears that this is to some degree a
semantically relevant restriction.  We assume that a step never
unblocks except when we run a future step, which is false in the case
of the deadlock detector.  However, once one step is waiting, we run
further steps to completion, which means that there's time for the
deadlock detector to kick in after all.  To make this useful for
deadlock testing, something further is needed.

The simplest thing to do seems to be to allow for a way to configure
the maximum number of processes that are expected to wait during a
particular test.  That could default to 1, the current behavior.
That's pretty coarse-grained, but I think it would be sufficient for
what I want to do.  Alternatively, we could try to provide a way to
attach information to the step, or within the permutation, indicating
the points at which we expect waiters to accumulate and to be
released.  I'm not sure exactly what that would look like, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: why can the isolation tester handle only one waiting process?

From
Alvaro Herrera
Date:
Robert Haas wrote:

> Thanks for the reply.  It appears that this is to some degree a
> semantically relevant restriction.  We assume that a step never
> unblocks except when we run a future step, which is false in the case
> of the deadlock detector.  However, once one step is waiting, we run
> further steps to completion, which means that there's time for the
> deadlock detector to kick in after all.  To make this useful for
> deadlock testing, something further is needed.
> 
> The simplest thing to do seems to be to allow for a way to configure
> the maximum number of processes that are expected to wait during a
> particular test.  That could default to 1, the current behavior.
> That's pretty coarse-grained, but I think it would be sufficient for
> what I want to do.  Alternatively, we could try to provide a way to
> attach information to the step, or within the permutation, indicating
> the points at which we expect waiters to accumulate and to be
> released.  I'm not sure exactly what that would look like, though.

Hmm, clearly you couldn't attach the info to the step itself, because a
step that blocks in one permutation doesn't necessarily block in every
permutation.  You could attach it to each step that needed it in the
permutation, but then it wouldn't work to leave permutation
specification out for such a test.  Maybe that's an acceptable
restriction if you cause the test to fail with a specific error instead
of stalling forever (which is what happens currently IIRC).

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: why can the isolation tester handle only one waiting process?

From
Robert Haas
Date:
On Fri, Aug 14, 2015 at 2:57 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Hmm, clearly you couldn't attach the info to the step itself, because a
> step that blocks in one permutation doesn't necessarily block in every
> permutation.  You could attach it to each step that needed it in the
> permutation, but then it wouldn't work to leave permutation
> specification out for such a test.  Maybe that's an acceptable
> restriction if you cause the test to fail with a specific error instead
> of stalling forever (which is what happens currently IIRC).

After some study, I think the best thing to do here is change the way
we handle the case where we reach a step that the use of a connection
that is currently blocked on a lock.  Right now, we handle that by
declaring the permutation invalid; I'd like to change that so that
consider that a cue to wait for that connnection to unblock itself.
This will require a number of tests that currently blindly run through
all permutations to specify a list of permutations, or they will hang.

But I'm not sure that's such a bad thing, because running through all
permutations in those cases provides no additional test coverage.
Each invalid permutation runs the sequence of steps only up until the
point where it chooses an invalid next step.  Therefore, each invalid
permutation is testing an initial prefix of the steps tested by some
valid permutation.  If the "invalid" permutation ceased to be invalid,
because the command at which we give up returned immediately rather
than waiting, that would also change the test output of the other,
valid test of which it is the initial prefix.  And therefore, at least
as it seems to me, testing the invalid permutations is just a waste of
CPU time, and we'd be better off not doing it.

Actually, I'm really rather wondering if the list of valid
permutations should also be pruned for some of these tests.  Some of
these output files are thousands of lines long, and I'm not sure that
somebody has really gone through that whole file and made sure that
the output of each permutation is expected.  And I'm sure some of them
are functionally identical.

Attached are three patches.  The first one modifies the isolation
tester to allow multiple sessions to wait, using the design described
above.  The second one adds permutation specifications to all of the
existing regression tests that have "invalid" permutations in them, so
that we don't waste time trying those permutations.  The third one
adds a couple of simple deadlock tests.  I'd like to add tests for
some more complicated scenarios involving soft deadlocks of various
kinds, but this gives you the idea of what I have in mind.

Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: why can the isolation tester handle only one waiting process?

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> After some study, I think the best thing to do here is change the way
> we handle the case where we reach a step that the use of a connection
> that is currently blocked on a lock.  Right now, we handle that by
> declaring the permutation invalid; I'd like to change that so that
> consider that a cue to wait for that connnection to unblock itself.

Maybe a timeout there would be a good idea?  I see that the code
currently will wait forever in some cases, which seems bad enough,
but it seems like you want to increase the risk of that considerably.
        regards, tom lane



Re: why can the isolation tester handle only one waiting process?

From
Alvaro Herrera
Date:
Robert Haas wrote:
> On Fri, Aug 14, 2015 at 2:57 PM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
> > Hmm, clearly you couldn't attach the info to the step itself, because a
> > step that blocks in one permutation doesn't necessarily block in every
> > permutation.  You could attach it to each step that needed it in the
> > permutation, but then it wouldn't work to leave permutation
> > specification out for such a test.  Maybe that's an acceptable
> > restriction if you cause the test to fail with a specific error instead
> > of stalling forever (which is what happens currently IIRC).
> 
> After some study, I think the best thing to do here is change the way
> we handle the case where we reach a step that the use of a connection
> that is currently blocked on a lock.  Right now, we handle that by
> declaring the permutation invalid; I'd like to change that so that
> consider that a cue to wait for that connnection to unblock itself.
> This will require a number of tests that currently blindly run through
> all permutations to specify a list of permutations, or they will hang.

Well, hanging forever may not be all that great.  Buildfarm animals with
test processes stuck probably won't be happy.  Maybe put a cap on the
time we're willing to wait; something like a minute should suffice for
all reasonable tests.  At the same time I wonder if iterating as quickly
as possible is really such a hot idea; why don't we sleep even 100ms if
nothing is to be done immediately?  That would reduce log traffic if you
have log_statements=all, for one thing ...

I guess (from a patch author perspective) we can just use
isolationtester -n to produce appropriate permutation lines when
developing a spec file, and then prune the ones causing trouble.

FWIW I tried this with the spec I posted at 
http://www.postgresql.org/message-id/20141212205254.GC1768@alvh.no-ip.org
and it seems to work fine (modulo a bug in the spec itself).  I didn't
try reverting the patch that fixed the bug.

> But I'm not sure that's such a bad thing, because running through all
> permutations in those cases provides no additional test coverage.
> Each invalid permutation runs the sequence of steps only up until the
> point where it chooses an invalid next step.  Therefore, each invalid
> permutation is testing an initial prefix of the steps tested by some
> valid permutation.  If the "invalid" permutation ceased to be invalid,
> because the command at which we give up returned immediately rather
> than waiting, that would also change the test output of the other,
> valid test of which it is the initial prefix.  And therefore, at least
> as it seems to me, testing the invalid permutations is just a waste of
> CPU time, and we'd be better off not doing it.

Well, the number of tests that actually exercise this is not large.
More time is spent in the timeout test, ISTM (even though the CPU is
sleeping during that, but it's still wasted clock time).

> Actually, I'm really rather wondering if the list of valid
> permutations should also be pruned for some of these tests.  Some of
> these output files are thousands of lines long, and I'm not sure that
> somebody has really gone through that whole file and made sure that
> the output of each permutation is expected.  And I'm sure some of them
> are functionally identical.

No objections there, but alter-table-1 and alter-table-2 seem to be the
only tests that have thousands of lines long of expected output and also
have invalid permutations in the expected output.  The only others with
1k+ lines are two-ids, receipt-reports and prepared-transactions, which
don't have invalid permutations.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: why can the isolation tester handle only one waiting process?

From
Robert Haas
Date:
On Sat, Aug 15, 2015 at 1:17 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
>> After some study, I think the best thing to do here is change the way
>> we handle the case where we reach a step that the use of a connection
>> that is currently blocked on a lock.  Right now, we handle that by
>> declaring the permutation invalid; I'd like to change that so that
>> consider that a cue to wait for that connnection to unblock itself.
>> This will require a number of tests that currently blindly run through
>> all permutations to specify a list of permutations, or they will hang.
>
> Well, hanging forever may not be all that great.  Buildfarm animals with
> test processes stuck probably won't be happy.  Maybe put a cap on the
> time we're willing to wait; something like a minute should suffice for
> all reasonable tests.

Good idea.  Here's an updated patch series that takes that approach.
It cancels any query after 60 seconds of waiting, and if the query
doesn't respond to the cancel, then it bails out completely after 75
seconds (i.e. 15 seconds after attempting the cancel).

> At the same time I wonder if iterating as quickly
> as possible is really such a hot idea; why don't we sleep even 100ms if
> nothing is to be done immediately?  That would reduce log traffic if you
> have log_statements=all, for one thing ...

We could certainly do that; I'm not sure it really matters, though.
We'd probably make the test runtime a bit longer in exchange for a bit
less log chatter.  I'm not that excited about it either way.

> I guess (from a patch author perspective) we can just use
> isolationtester -n to produce appropriate permutation lines when
> developing a spec file, and then prune the ones causing trouble.

Yes.  And honestly, if you're not going to go through and look at what
happens in all the permutations, then why bother adding the test
anyway?  At that point you're just testing that nothing ever changes,
not that anything was correct.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: why can the isolation tester handle only one waiting process?

From
Robert Haas
Date:
On Mon, Aug 17, 2015 at 5:40 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Good idea.  Here's an updated patch series that takes that approach.
> It cancels any query after 60 seconds of waiting, and if the query
> doesn't respond to the cancel, then it bails out completely after 75
> seconds (i.e. 15 seconds after attempting the cancel).

Here's an updated patch series with some more improvements to the
isolationtester code, and some better test cases.  I now have a test
for (a) a "simple" deadlock, involving a lock upgrade scenario where
the process seeking the upgrade must jump the wait queue; (b) a hard
deadlock; and (c) a soft deadlock that can be resolved by reordering
the wait queue.  According to lcov this tests most of deadlock.c: 10
of 11 functions (not GetBlockingAutoVacuumPgproc), and 246 of 291
lines.  That's clearly an improvement over the status quo, but I'm
having a hard time feeling happy about it, because it's really only
testing the easy cases.

I can't construct a case where reversing any one single soft edge
doesn't immediately resolve the deadlock (see end of
TestConfigurationRecurse); the first one tried always works.  Moving a
process that would otherwise deadlock ahead of conflicting waiters
seems to be an extremely effective way of resolving deadlocks.  For it
to fail, reversing one of the edges in the waits-for graph must create
a new cycle.  But it seems to be quite hard for that to actually
happen: the new edge that is created after the reversal points to the
guy that got skipped ahead in the queue.  For that reversed edge to be
part of a cycle, the queue-skipping process has to be directly or
indirectly waiting for some other process he jumped over.  But,
clearly, he's only waiting for processes that are *still ahead* of him
in the queue, and he would have had to wait for those processes
whether he'd skipped ahead in the queue or not.  So perhaps a test
case here would involve a process that skips forward in the lock
queue, but not far enough?  But I haven't been able to figure it out.

I also can't construct a test case where ExpandConstraints returns
false (see TestConfiguration); the wait orderings it generates are
always self-consistent.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: why can the isolation tester handle only one waiting process?

From
Robert Haas
Date:
On Tue, Sep 8, 2015 at 5:11 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Here's an updated patch series with some more improvements to the
> isolationtester code, and some better test cases.

OK, here's a final set of patches that I intend to commit in the next
few days if nobody objects, per discussion on the thread about
parallelism fixes.  It's basically the same as the previous set, but
there's a bug fix and an additional test case.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: why can the isolation tester handle only one waiting process?

From
Andres Freund
Date:
On February 9, 2016 7:12:23 PM GMT+01:00, Robert Haas <robertmhaas@gmail.com> wrote:
>On Tue, Sep 8, 2015 at 5:11 PM, Robert Haas <robertmhaas@gmail.com>
>wrote:
>> Here's an updated patch series with some more improvements to the
>> isolationtester code, and some better test cases.
>
>OK, here's a final set of patches that I intend to commit in the next
>few days if nobody objects, per discussion on the thread about
>parallelism fixes.  It's basically the same as the previous set, but
>there's a bug fix and an additional test case.

Fwiw, as the person starting the ruckus over there, I think something like isolationtester has a much lower need for
carefulreview, consensus et al than say lock.c and deadlock.c.
 

Andres

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.