Thread: why can the isolation tester handle only one waiting process?
I had the idea that it would be useful to have some regression tests that verify that the deadlock checker works as advertised, because we currently have just about zero test coverage for it. And it's easy enough to write a regression test that causes a simple 2-way deadlock. But you can't test anything more interesting than that, because of this limitation described in the README: > Currently, at most one step can be waiting at a time. As long as one > step is waiting, subsequent steps are run to completion synchronously. Is there any reason not to try to lift that restriction? (And do we think it's hard?) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas wrote: > I had the idea that it would be useful to have some regression tests > that verify that the deadlock checker works as advertised, because we > currently have just about zero test coverage for it. And it's easy > enough to write a regression test that causes a simple 2-way deadlock. > But you can't test anything more interesting than that, because of > this limitation described in the README: > > > Currently, at most one step can be waiting at a time. As long as one > > step is waiting, subsequent steps are run to completion synchronously. > > Is there any reason not to try to lift that restriction? (And do we > think it's hard?) It's just what we needed at the time, so we didn't press any further. Feel free to lift the restriction if you're so inclined. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Aug 14, 2015 at 10:14 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Robert Haas wrote: >> I had the idea that it would be useful to have some regression tests >> that verify that the deadlock checker works as advertised, because we >> currently have just about zero test coverage for it. And it's easy >> enough to write a regression test that causes a simple 2-way deadlock. >> But you can't test anything more interesting than that, because of >> this limitation described in the README: >> >> > Currently, at most one step can be waiting at a time. As long as one >> > step is waiting, subsequent steps are run to completion synchronously. >> >> Is there any reason not to try to lift that restriction? (And do we >> think it's hard?) > > It's just what we needed at the time, so we didn't press any further. > Feel free to lift the restriction if you're so inclined. Thanks for the reply. It appears that this is to some degree a semantically relevant restriction. We assume that a step never unblocks except when we run a future step, which is false in the case of the deadlock detector. However, once one step is waiting, we run further steps to completion, which means that there's time for the deadlock detector to kick in after all. To make this useful for deadlock testing, something further is needed. The simplest thing to do seems to be to allow for a way to configure the maximum number of processes that are expected to wait during a particular test. That could default to 1, the current behavior. That's pretty coarse-grained, but I think it would be sufficient for what I want to do. Alternatively, we could try to provide a way to attach information to the step, or within the permutation, indicating the points at which we expect waiters to accumulate and to be released. I'm not sure exactly what that would look like, though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas wrote: > Thanks for the reply. It appears that this is to some degree a > semantically relevant restriction. We assume that a step never > unblocks except when we run a future step, which is false in the case > of the deadlock detector. However, once one step is waiting, we run > further steps to completion, which means that there's time for the > deadlock detector to kick in after all. To make this useful for > deadlock testing, something further is needed. > > The simplest thing to do seems to be to allow for a way to configure > the maximum number of processes that are expected to wait during a > particular test. That could default to 1, the current behavior. > That's pretty coarse-grained, but I think it would be sufficient for > what I want to do. Alternatively, we could try to provide a way to > attach information to the step, or within the permutation, indicating > the points at which we expect waiters to accumulate and to be > released. I'm not sure exactly what that would look like, though. Hmm, clearly you couldn't attach the info to the step itself, because a step that blocks in one permutation doesn't necessarily block in every permutation. You could attach it to each step that needed it in the permutation, but then it wouldn't work to leave permutation specification out for such a test. Maybe that's an acceptable restriction if you cause the test to fail with a specific error instead of stalling forever (which is what happens currently IIRC). -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Aug 14, 2015 at 2:57 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Hmm, clearly you couldn't attach the info to the step itself, because a > step that blocks in one permutation doesn't necessarily block in every > permutation. You could attach it to each step that needed it in the > permutation, but then it wouldn't work to leave permutation > specification out for such a test. Maybe that's an acceptable > restriction if you cause the test to fail with a specific error instead > of stalling forever (which is what happens currently IIRC). After some study, I think the best thing to do here is change the way we handle the case where we reach a step that the use of a connection that is currently blocked on a lock. Right now, we handle that by declaring the permutation invalid; I'd like to change that so that consider that a cue to wait for that connnection to unblock itself. This will require a number of tests that currently blindly run through all permutations to specify a list of permutations, or they will hang. But I'm not sure that's such a bad thing, because running through all permutations in those cases provides no additional test coverage. Each invalid permutation runs the sequence of steps only up until the point where it chooses an invalid next step. Therefore, each invalid permutation is testing an initial prefix of the steps tested by some valid permutation. If the "invalid" permutation ceased to be invalid, because the command at which we give up returned immediately rather than waiting, that would also change the test output of the other, valid test of which it is the initial prefix. And therefore, at least as it seems to me, testing the invalid permutations is just a waste of CPU time, and we'd be better off not doing it. Actually, I'm really rather wondering if the list of valid permutations should also be pruned for some of these tests. Some of these output files are thousands of lines long, and I'm not sure that somebody has really gone through that whole file and made sure that the output of each permutation is expected. And I'm sure some of them are functionally identical. Attached are three patches. The first one modifies the isolation tester to allow multiple sessions to wait, using the design described above. The second one adds permutation specifications to all of the existing regression tests that have "invalid" permutations in them, so that we don't waste time trying those permutations. The third one adds a couple of simple deadlock tests. I'd like to add tests for some more complicated scenarios involving soft deadlocks of various kinds, but this gives you the idea of what I have in mind. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Robert Haas <robertmhaas@gmail.com> writes: > After some study, I think the best thing to do here is change the way > we handle the case where we reach a step that the use of a connection > that is currently blocked on a lock. Right now, we handle that by > declaring the permutation invalid; I'd like to change that so that > consider that a cue to wait for that connnection to unblock itself. Maybe a timeout there would be a good idea? I see that the code currently will wait forever in some cases, which seems bad enough, but it seems like you want to increase the risk of that considerably. regards, tom lane
Robert Haas wrote: > On Fri, Aug 14, 2015 at 2:57 PM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: > > Hmm, clearly you couldn't attach the info to the step itself, because a > > step that blocks in one permutation doesn't necessarily block in every > > permutation. You could attach it to each step that needed it in the > > permutation, but then it wouldn't work to leave permutation > > specification out for such a test. Maybe that's an acceptable > > restriction if you cause the test to fail with a specific error instead > > of stalling forever (which is what happens currently IIRC). > > After some study, I think the best thing to do here is change the way > we handle the case where we reach a step that the use of a connection > that is currently blocked on a lock. Right now, we handle that by > declaring the permutation invalid; I'd like to change that so that > consider that a cue to wait for that connnection to unblock itself. > This will require a number of tests that currently blindly run through > all permutations to specify a list of permutations, or they will hang. Well, hanging forever may not be all that great. Buildfarm animals with test processes stuck probably won't be happy. Maybe put a cap on the time we're willing to wait; something like a minute should suffice for all reasonable tests. At the same time I wonder if iterating as quickly as possible is really such a hot idea; why don't we sleep even 100ms if nothing is to be done immediately? That would reduce log traffic if you have log_statements=all, for one thing ... I guess (from a patch author perspective) we can just use isolationtester -n to produce appropriate permutation lines when developing a spec file, and then prune the ones causing trouble. FWIW I tried this with the spec I posted at http://www.postgresql.org/message-id/20141212205254.GC1768@alvh.no-ip.org and it seems to work fine (modulo a bug in the spec itself). I didn't try reverting the patch that fixed the bug. > But I'm not sure that's such a bad thing, because running through all > permutations in those cases provides no additional test coverage. > Each invalid permutation runs the sequence of steps only up until the > point where it chooses an invalid next step. Therefore, each invalid > permutation is testing an initial prefix of the steps tested by some > valid permutation. If the "invalid" permutation ceased to be invalid, > because the command at which we give up returned immediately rather > than waiting, that would also change the test output of the other, > valid test of which it is the initial prefix. And therefore, at least > as it seems to me, testing the invalid permutations is just a waste of > CPU time, and we'd be better off not doing it. Well, the number of tests that actually exercise this is not large. More time is spent in the timeout test, ISTM (even though the CPU is sleeping during that, but it's still wasted clock time). > Actually, I'm really rather wondering if the list of valid > permutations should also be pruned for some of these tests. Some of > these output files are thousands of lines long, and I'm not sure that > somebody has really gone through that whole file and made sure that > the output of each permutation is expected. And I'm sure some of them > are functionally identical. No objections there, but alter-table-1 and alter-table-2 seem to be the only tests that have thousands of lines long of expected output and also have invalid permutations in the expected output. The only others with 1k+ lines are two-ids, receipt-reports and prepared-transactions, which don't have invalid permutations. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, Aug 15, 2015 at 1:17 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >> After some study, I think the best thing to do here is change the way >> we handle the case where we reach a step that the use of a connection >> that is currently blocked on a lock. Right now, we handle that by >> declaring the permutation invalid; I'd like to change that so that >> consider that a cue to wait for that connnection to unblock itself. >> This will require a number of tests that currently blindly run through >> all permutations to specify a list of permutations, or they will hang. > > Well, hanging forever may not be all that great. Buildfarm animals with > test processes stuck probably won't be happy. Maybe put a cap on the > time we're willing to wait; something like a minute should suffice for > all reasonable tests. Good idea. Here's an updated patch series that takes that approach. It cancels any query after 60 seconds of waiting, and if the query doesn't respond to the cancel, then it bails out completely after 75 seconds (i.e. 15 seconds after attempting the cancel). > At the same time I wonder if iterating as quickly > as possible is really such a hot idea; why don't we sleep even 100ms if > nothing is to be done immediately? That would reduce log traffic if you > have log_statements=all, for one thing ... We could certainly do that; I'm not sure it really matters, though. We'd probably make the test runtime a bit longer in exchange for a bit less log chatter. I'm not that excited about it either way. > I guess (from a patch author perspective) we can just use > isolationtester -n to produce appropriate permutation lines when > developing a spec file, and then prune the ones causing trouble. Yes. And honestly, if you're not going to go through and look at what happens in all the permutations, then why bother adding the test anyway? At that point you're just testing that nothing ever changes, not that anything was correct. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Mon, Aug 17, 2015 at 5:40 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Good idea. Here's an updated patch series that takes that approach. > It cancels any query after 60 seconds of waiting, and if the query > doesn't respond to the cancel, then it bails out completely after 75 > seconds (i.e. 15 seconds after attempting the cancel). Here's an updated patch series with some more improvements to the isolationtester code, and some better test cases. I now have a test for (a) a "simple" deadlock, involving a lock upgrade scenario where the process seeking the upgrade must jump the wait queue; (b) a hard deadlock; and (c) a soft deadlock that can be resolved by reordering the wait queue. According to lcov this tests most of deadlock.c: 10 of 11 functions (not GetBlockingAutoVacuumPgproc), and 246 of 291 lines. That's clearly an improvement over the status quo, but I'm having a hard time feeling happy about it, because it's really only testing the easy cases. I can't construct a case where reversing any one single soft edge doesn't immediately resolve the deadlock (see end of TestConfigurationRecurse); the first one tried always works. Moving a process that would otherwise deadlock ahead of conflicting waiters seems to be an extremely effective way of resolving deadlocks. For it to fail, reversing one of the edges in the waits-for graph must create a new cycle. But it seems to be quite hard for that to actually happen: the new edge that is created after the reversal points to the guy that got skipped ahead in the queue. For that reversed edge to be part of a cycle, the queue-skipping process has to be directly or indirectly waiting for some other process he jumped over. But, clearly, he's only waiting for processes that are *still ahead* of him in the queue, and he would have had to wait for those processes whether he'd skipped ahead in the queue or not. So perhaps a test case here would involve a process that skips forward in the lock queue, but not far enough? But I haven't been able to figure it out. I also can't construct a test case where ExpandConstraints returns false (see TestConfiguration); the wait orderings it generates are always self-consistent. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Tue, Sep 8, 2015 at 5:11 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Here's an updated patch series with some more improvements to the > isolationtester code, and some better test cases. OK, here's a final set of patches that I intend to commit in the next few days if nobody objects, per discussion on the thread about parallelism fixes. It's basically the same as the previous set, but there's a bug fix and an additional test case. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On February 9, 2016 7:12:23 PM GMT+01:00, Robert Haas <robertmhaas@gmail.com> wrote: >On Tue, Sep 8, 2015 at 5:11 PM, Robert Haas <robertmhaas@gmail.com> >wrote: >> Here's an updated patch series with some more improvements to the >> isolationtester code, and some better test cases. > >OK, here's a final set of patches that I intend to commit in the next >few days if nobody objects, per discussion on the thread about >parallelism fixes. It's basically the same as the previous set, but >there's a bug fix and an additional test case. Fwiw, as the person starting the ruckus over there, I think something like isolationtester has a much lower need for carefulreview, consensus et al than say lock.c and deadlock.c. Andres --- Please excuse brevity and formatting - I am writing this on my mobile phone.