Thread: spoonbill vs. -HEAD

spoonbill vs. -HEAD

From
Stefan Kaltenbrunner
Date:
Hi all!


I finally started to investigate why spoonbill stopped reporting to the
buildfarm feedback about 2 months ago.
It seems that the foreign-keys locking patch (or something commity very
close to January 23th) broke it in a fairly annoying way - running the
buildfarm script seems to
consistently "stall" during the isolationtester part of the regression
testing leaving the postgresql instance running causing all future
buildfarm runs to fail...


The process listing at that time looks like:

https://www.kaltenbrunner.cc/files/process_listing.txt


pg_stats_activity of the running instance:


https://www.kaltenbrunner.cc/files/pg_stat_activity.txt


pg_locks:

https://www.kaltenbrunner.cc/files/pg_locks.txt


backtraces of the three backends:

https://www.kaltenbrunner.cc/files/bt_20467.txt
https://www.kaltenbrunner.cc/files/bt_20897.txt
https://www.kaltenbrunner.cc/files/bt_24285.txt




Stefan



Re: spoonbill vs. -HEAD

From
Tom Lane
Date:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> I finally started to investigate why spoonbill stopped reporting to the
> buildfarm feedback about 2 months ago.
> It seems that the foreign-keys locking patch (or something commity very
> close to January 23th) broke it in a fairly annoying way - running the
> buildfarm script seems to
> consistently "stall" during the isolationtester part of the regression
> testing leaving the postgresql instance running causing all future
> buildfarm runs to fail...

It looks from here like the isolationtester client is what's dropping
the ball --- the backend states are unsurprising, and two of them are
waiting for a new client command.  Can you get a stack trace from the
isolationtester process?
        regards, tom lane



Re: spoonbill vs. -HEAD

From
Stefan Kaltenbrunner
Date:
On 03/26/2013 08:45 PM, Tom Lane wrote:
> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>> I finally started to investigate why spoonbill stopped reporting to the
>> buildfarm feedback about 2 months ago.
>> It seems that the foreign-keys locking patch (or something commity very
>> close to January 23th) broke it in a fairly annoying way - running the
>> buildfarm script seems to
>> consistently "stall" during the isolationtester part of the regression
>> testing leaving the postgresql instance running causing all future
>> buildfarm runs to fail...
> 
> It looks from here like the isolationtester client is what's dropping
> the ball --- the backend states are unsurprising, and two of them are
> waiting for a new client command.  Can you get a stack trace from the
> isolationtester process?


https://www.kaltenbrunner.cc/files/isolationtester.txt


Stefan



Re: spoonbill vs. -HEAD

From
Tom Lane
Date:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> On 03/26/2013 08:45 PM, Tom Lane wrote:
>> It looks from here like the isolationtester client is what's dropping
>> the ball --- the backend states are unsurprising, and two of them are
>> waiting for a new client command.  Can you get a stack trace from the
>> isolationtester process?

> https://www.kaltenbrunner.cc/files/isolationtester.txt

Hmm ... isolationtester.c:584 is in the code that tries to cancel the
current permutation (test case) after realizing that it's constructed
an invalid permutation.  It looks like the preceding PQcancel() failed
for some reason, since the waiting backend is still waiting.  The
isolationtester code doesn't bother to check for an error result there,
which is kinda bad, not that it's clear what it could do about it.
Could you look at the contents of the local variable "buf" in the
run_permutation stack frame?  Or else try modifying the code along the
lines of

-                PQcancel(cancel, buf, sizeof(buf));
+                if (!PQcancel(cancel, buf, sizeof(buf)))
+                  fprintf(stderr, "PQcancel failed: %s\n", buf);

and see if it prints anything interesting before hanging up.
        regards, tom lane



Re: spoonbill vs. -HEAD

From
Andrew Dunstan
Date:
On 03/26/2013 02:50 PM, Stefan Kaltenbrunner wrote:
> Hi all!
>
>
> I finally started to investigate why spoonbill stopped reporting to the
> buildfarm feedback about 2 months ago.
> It seems that the foreign-keys locking patch (or something commity very
> close to January 23th) broke it in a fairly annoying way - running the
> buildfarm script seems to
> consistently "stall" during the isolationtester part of the regression
> testing leaving the postgresql instance running causing all future
> buildfarm runs to fail...



There is some timeout code already in the buildfarm client. It was 
originally put there to help us when we got CVS hangs, a not infrequent 
occurrence in the early days, so it's currently only used if configured 
for the checkout phase, but it could easily be used to create a build 
timeout which would kill the whole process group if the timeout expired. 
It wouldn't work on Windows, and of course it won't solve whatever 
problem caused the hang in the first place, but it still might be worth 
doing.

cheers

andrew




Re: spoonbill vs. -HEAD

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> There is some timeout code already in the buildfarm client. It was 
> originally put there to help us when we got CVS hangs, a not infrequent 
> occurrence in the early days, so it's currently only used if configured 
> for the checkout phase, but it could easily be used to create a build 
> timeout which would kill the whole process group if the timeout expired. 
> It wouldn't work on Windows, and of course it won't solve whatever 
> problem caused the hang in the first place, but it still might be worth 
> doing.

+1 --- at least then we'd get reports of failures, rather than the
current behavior where the animal just stops reporting.
        regards, tom lane



Re: spoonbill vs. -HEAD

From
Stefan Kaltenbrunner
Date:
On 03/26/2013 09:33 PM, Tom Lane wrote:
> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>> On 03/26/2013 08:45 PM, Tom Lane wrote:
>>> It looks from here like the isolationtester client is what's dropping
>>> the ball --- the backend states are unsurprising, and two of them are
>>> waiting for a new client command.  Can you get a stack trace from the
>>> isolationtester process?
> 
>> https://www.kaltenbrunner.cc/files/isolationtester.txt
> 
> Hmm ... isolationtester.c:584 is in the code that tries to cancel the
> current permutation (test case) after realizing that it's constructed
> an invalid permutation.  It looks like the preceding PQcancel() failed
> for some reason, since the waiting backend is still waiting.  The
> isolationtester code doesn't bother to check for an error result there,
> which is kinda bad, not that it's clear what it could do about it.
> Could you look at the contents of the local variable "buf" in the
> run_permutation stack frame?  Or else try modifying the code along the
> lines of
> 
> -                PQcancel(cancel, buf, sizeof(buf));
> +                if (!PQcancel(cancel, buf, sizeof(buf)))
> +                  fprintf(stderr, "PQcancel failed: %s\n", buf);
> 
> and see if it prints anything interesting before hanging up.

hmm - will look into that in a bit - but I also just noticed that on the
same day spoonbill broke there was also a commit to that file
immediately before that code adding the fflush() calls.


Stefan



Re: spoonbill vs. -HEAD

From
Tom Lane
Date:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> hmm - will look into that in a bit - but I also just noticed that on the
> same day spoonbill broke there was also a commit to that file
> immediately before that code adding the fflush() calls.

It's hard to see how those would be related to this symptom.  My bet
is that the new fk-deadlock test exposed some pre-existing issue in
isolationtester.  Not quite clear what yet, though.

A different line of thought is that the cancel was received by the
backend but didn't succeed in cancelling the query for some reason.
        regards, tom lane



Re: spoonbill vs. -HEAD

From
Stefan Kaltenbrunner
Date:
On 03/26/2013 11:30 PM, Tom Lane wrote:
> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>> hmm - will look into that in a bit - but I also just noticed that on the
>> same day spoonbill broke there was also a commit to that file
>> immediately before that code adding the fflush() calls.
> 
> It's hard to see how those would be related to this symptom.  My bet
> is that the new fk-deadlock test exposed some pre-existing issue in
> isolationtester.  Not quite clear what yet, though.

yeah removing them does not seem to change the behaviour at all


> 
> A different line of thought is that the cancel was received by the
> backend but didn't succeed in cancelling the query for some reason.

I added the "pgcancel failed" codepath you suggested but it does not
seem to get triggered at all so the above might actually be what is
happening...


Stefan



Re: spoonbill vs. -HEAD

From
Stefan Kaltenbrunner
Date:
On 03/26/2013 10:18 PM, Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> There is some timeout code already in the buildfarm client. It was 
>> originally put there to help us when we got CVS hangs, a not infrequent 
>> occurrence in the early days, so it's currently only used if configured 
>> for the checkout phase, but it could easily be used to create a build 
>> timeout which would kill the whole process group if the timeout expired. 
>> It wouldn't work on Windows, and of course it won't solve whatever 
>> problem caused the hang in the first place, but it still might be worth 
>> doing.
> 
> +1 --- at least then we'd get reports of failures, rather than the
> current behavior where the animal just stops reporting.

yeah I have had multiple buildfarm members running into similiar issues
(like the still-unexplained issues on spoonbill from a year back:
http://www.postgresql.org/message-id/4FE4B674.3020500@kaltenbrunner.cc)
so I would really like to see an option for a global timeout.



Stefan



Re: spoonbill vs. -HEAD

From
Andrew Dunstan
Date:
On 03/27/2013 03:49 PM, Stefan Kaltenbrunner wrote:
> On 03/26/2013 10:18 PM, Tom Lane wrote:
>> Andrew Dunstan <andrew@dunslane.net> writes:
>>> There is some timeout code already in the buildfarm client. It was
>>> originally put there to help us when we got CVS hangs, a not infrequent
>>> occurrence in the early days, so it's currently only used if configured
>>> for the checkout phase, but it could easily be used to create a build
>>> timeout which would kill the whole process group if the timeout expired.
>>> It wouldn't work on Windows, and of course it won't solve whatever
>>> problem caused the hang in the first place, but it still might be worth
>>> doing.
>> +1 --- at least then we'd get reports of failures, rather than the
>> current behavior where the animal just stops reporting.
> yeah I have had multiple buildfarm members running into similiar issues
> (like the still-unexplained issues on spoonbill from a year back:
> http://www.postgresql.org/message-id/4FE4B674.3020500@kaltenbrunner.cc)
> so I would really like to see an option for a global timeout.
>

I have been looking at it briefly. There are a few wrinkles, but it's on 
the TODO list.

cheers

andrew



>
> Stefan
>




Re: spoonbill vs. -HEAD

From
Tom Lane
Date:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> On 03/26/2013 11:30 PM, Tom Lane wrote:
>> A different line of thought is that the cancel was received by the
>> backend but didn't succeed in cancelling the query for some reason.

> I added the "pgcancel failed" codepath you suggested but it does not
> seem to get triggered at all so the above might actually be what is
> happening...

Stefan was kind enough to grant me access to spoonbill, and after
some experimentation I found out the problem.  It seems that OpenBSD
blocks additional deliveries of a signal while the signal handler is
in progress, and that this is implemented by just calling sigprocmask()
before and after calling the handler.  Therefore, if the handler doesn't
return normally --- like, say, it longjmps --- the restoration of the
previous mask never happens.  So we're left with the signal still
blocked, meaning second and subsequent attempts to interrupt the backend
don't work.

This isn't revealed by the regular regression tests because they don't
exercise PQcancel, but several recently-added isolation tests do attempt
to PQcancel the same backend more than once.

It's a bit surprising that it's taken us this long to recognize the
problem.  Typical use of PQcancel doesn't necessarily cause a failure:
StatementCancelHandler() won't exit through longjmp unless
ImmediateInterruptOK is true, which is only the case while waiting for a
heavyweight lock.  But still, you'd think somebody would've run into
the case in normal usage.

I think the simplest fix is to insert "PG_SETMASK(&UnBlockSig)" into
StatementCancelHandler() and any other handlers that might exit via
longjmp.  I'm a bit inclined to only do this on platforms where a
problem is demonstrable, which so far is only OpenBSD.  (You'd
think that all BSDen would have the same issue, but the buildfarm
shows otherwise.)

BTW, this does not seem to explain the symptoms shown at
http://www.postgresql.org/message-id/4FE4D89A.8020002@kaltenbrunner.cc
because what we were seeing there was that *all* signals appeared to be
blocked.  However, after this round of debugging I no longer have a lot
of faith in OpenBSD's ps, because it was lying to me about whether the
process had signals blocked or not (or at least, it couldn't see the
effects of the interrupt signal disable, although when I added debugging
code to print the active signal mask according to sigprocmask() I got
told the truth).  So I'm not sure how much trust to put in those older
ps results.  It's possible that the previous failures were a
manifestation of something related to this bug.
        regards, tom lane



Re: spoonbill vs. -HEAD

From
Tom Lane
Date:
I wrote:
> I think the simplest fix is to insert "PG_SETMASK(&UnBlockSig)" into
> StatementCancelHandler() and any other handlers that might exit via
> longjmp.  I'm a bit inclined to only do this on platforms where a
> problem is demonstrable, which so far is only OpenBSD.  (You'd
> think that all BSDen would have the same issue, but the buildfarm
> shows otherwise.)

BTW, on further thought it seems like maybe this is an OpenBSD bug,
at least in part: what is evidently happening is that the temporary
blockage of SIGINT during the handler persists even after we've
longjmp'd back to the main loop.  But we're using sigsetjmp(..., 1)
to establish that longjmp handler --- so why isn't the original signal
mask reinstalled when we return to the main loop?

If (your version of) OpenBSD is getting this wrong, it'd explain why
we've not seen similar behavior elsewhere.

This theory doesn't really exonerate our code completely, because we use
sigsetjmp(..., 0) in PG_TRY, which means that in a code path where we
catch a longjmp and don't PG_RE_THROW it, we could be left with the
signal disabled.  I don't believe that is happening in the
isolation-test cases, though.
        regards, tom lane



Re: spoonbill vs. -HEAD

From
Stefan Kaltenbrunner
Date:
On 04/03/2013 12:59 AM, Tom Lane wrote:
> I wrote:
>> I think the simplest fix is to insert "PG_SETMASK(&UnBlockSig)" into
>> StatementCancelHandler() and any other handlers that might exit via
>> longjmp.  I'm a bit inclined to only do this on platforms where a
>> problem is demonstrable, which so far is only OpenBSD.  (You'd
>> think that all BSDen would have the same issue, but the buildfarm
>> shows otherwise.)
> 
> BTW, on further thought it seems like maybe this is an OpenBSD bug,
> at least in part: what is evidently happening is that the temporary
> blockage of SIGINT during the handler persists even after we've
> longjmp'd back to the main loop.  But we're using sigsetjmp(..., 1)
> to establish that longjmp handler --- so why isn't the original signal
> mask reinstalled when we return to the main loop?
> 
> If (your version of) OpenBSD is getting this wrong, it'd explain why
> we've not seen similar behavior elsewhere.

hmm trolling the openbsd cvs history brings up this:

http://www.openbsd.org/cgi-bin/cvsweb/src/sys/arch/sparc64/sparc64/machdep.c?r1=1.143;sortby=date#rev1.143


Stefan



Re: spoonbill vs. -HEAD

From
Tom Lane
Date:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> On 04/03/2013 12:59 AM, Tom Lane wrote:
>> BTW, on further thought it seems like maybe this is an OpenBSD bug,
>> at least in part: what is evidently happening is that the temporary
>> blockage of SIGINT during the handler persists even after we've
>> longjmp'd back to the main loop.  But we're using sigsetjmp(..., 1)
>> to establish that longjmp handler --- so why isn't the original signal
>> mask reinstalled when we return to the main loop?
>>
>> If (your version of) OpenBSD is getting this wrong, it'd explain why
>> we've not seen similar behavior elsewhere.

> hmm trolling the openbsd cvs history brings up this:
> http://www.openbsd.org/cgi-bin/cvsweb/src/sys/arch/sparc64/sparc64/machdep.c?r1=1.143;sortby=date#rev1.143

That's about alternate signal stacks, which we're not using.

I put together a simple test program (attached) and tried it on
spoonbill, and it says that the signal *does* get unblocked when control
returns to the sigsetjmp(...,1).  So now I'm really confused.  Somehow
the results we're getting in a full-fledged backend do not match up with
the results gotten by this test program ... but why?

            regards, tom lane


Attachment

Re: spoonbill vs. -HEAD

From
Stefan Kaltenbrunner
Date:
On 04/04/2013 02:18 AM, Tom Lane wrote:
> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>> On 04/03/2013 12:59 AM, Tom Lane wrote:
>>> BTW, on further thought it seems like maybe this is an OpenBSD bug,
>>> at least in part: what is evidently happening is that the temporary
>>> blockage of SIGINT during the handler persists even after we've
>>> longjmp'd back to the main loop.  But we're using sigsetjmp(..., 1)
>>> to establish that longjmp handler --- so why isn't the original signal
>>> mask reinstalled when we return to the main loop?
>>>
>>> If (your version of) OpenBSD is getting this wrong, it'd explain why
>>> we've not seen similar behavior elsewhere.
>
>> hmm trolling the openbsd cvs history brings up this:
>> http://www.openbsd.org/cgi-bin/cvsweb/src/sys/arch/sparc64/sparc64/machdep.c?r1=1.143;sortby=date#rev1.143
>
> That's about alternate signal stacks, which we're not using.
>
> I put together a simple test program (attached) and tried it on
> spoonbill, and it says that the signal *does* get unblocked when control
> returns to the sigsetjmp(...,1).  So now I'm really confused.  Somehow
> the results we're getting in a full-fledged backend do not match up with
> the results gotten by this test program ... but why?

as a followup to this - I spend some time upgrading spoonbill to the 
lastest OpenBSD release (5.3) the other day and it seems to be able to 
pass a full regression test run now on a manual run. I will add it to 
the regular reporting schedule again, but it seems at least part of the 
problem is/was an Operating system level issue that got fixed in either 
5.2 or 5.3 (spoonbill was on 5.1 before).


Stefan