Thread: Race condition in crash-recovery tests

Race condition in crash-recovery tests

From

Tom Lane

Date:

27 January 2019, 04:53:48

Recently, buildfarm member curculio has started to show a semi-repeatable
failure in src/test/recovery/t/013_crash_restart.pl:

# aborting wait: program died
# stream contents: >>psql:<stdin>:8: no connection to the server
# psql:<stdin>:8: connection to server was lost
# <<
# pattern searched for: (?^m:server closed the connection unexpectedly)

#   Failed test 'psql query died successfully after SIGKILL'
#   at t/013_crash_restart.pl line 198.

The message this test is looking for is what libpq reports upon getting
EOF or ECONNRESET from a socket read attempt.  The message it's actually
seeing is what libpq reports if it notices that the PQconn is *already*
in CONNECTION_BAD state when it's trying to send a new query.

I have no idea why we're seeing this in only one buildfarm member
and only for the past week or so, as it doesn't appear that any
related code has changed for months.  (Perhaps something changed
about curculio's host?)

I do see a plausibly related change back in October: commit 4247db625
added PQconsumeInput() calls in places where psql would call that after
having already emitted a query result, but before going idle.  So my
rough theory about what's happening is that psql loses the CPU immediately
after printing the query result containing "in-progress-before-sigquit",
which allows the test script to proceed, and doesn't get it back until
after the connected backend commits hara-kiri.  Then PQconsumeInput
tries to read the socket, gets EOF or ECONNRESET, and sets a message
(that's ignored) along with setting state = CONNECTION_BAD.  Then when
psql tries to send the next query, we get the observed message instead
of what the test is expecting.

I'm inclined to write this off as an inevitable race condition and
just change the test script to accept either message as a successful
result.  I think that 4247db625 made such races more likely, but I
don't believe it was impossible before.

Another idea is to change libpq so that both these cases emit identical
messages, but I don't really feel that that'd be an improvement.  Also,
since 4247db625 was back-patched, we'd have to back-patch the message
change as well, which I like even less.  People might be relying on
seeing either message spelling in some situations.

In any case, it'd be nice to have an explanation of why this just
started failing --- that fact makes it even harder to credit that
4247db625 was the trigger.

Thoughts?

            regards, tom lane

Re: Race condition in crash-recovery tests

From

Andres Freund

Date:

27 January 2019, 05:29:37

Hi,

On 2019-01-26 20:53:48 -0500, Tom Lane wrote:
> Recently, buildfarm member curculio has started to show a semi-repeatable
> failure in src/test/recovery/t/013_crash_restart.pl:
> 
> # aborting wait: program died
> # stream contents: >>psql:<stdin>:8: no connection to the server
> # psql:<stdin>:8: connection to server was lost
> # <<
> # pattern searched for: (?^m:server closed the connection unexpectedly)
> 
> #   Failed test 'psql query died successfully after SIGKILL'
> #   at t/013_crash_restart.pl line 198.
> 
> The message this test is looking for is what libpq reports upon getting
> EOF or ECONNRESET from a socket read attempt.  The message it's actually
> seeing is what libpq reports if it notices that the PQconn is *already*
> in CONNECTION_BAD state when it's trying to send a new query.
> 
> I have no idea why we're seeing this in only one buildfarm member
> and only for the past week or so, as it doesn't appear that any
> related code has changed for months.  (Perhaps something changed
> about curculio's host?)

I have no idea why it's just curculio, but I think I know why it only
started recently: Curculio doesn't appear to have tap tests enabled
before
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2019-01-17%2021%3A30%3A02


> just change the test script to accept either message as a successful
> result.  I think that 4247db625 made such races more likely, but I
> don't believe it was impossible before.

Sounds right to me - do you want to do the honors or shall I?


> Another idea is to change libpq so that both these cases emit identical
> messages, but I don't really feel that that'd be an improvement.  Also,
> since 4247db625 was back-patched, we'd have to back-patch the message
> change as well, which I like even less.  People might be relying on
> seeing either message spelling in some situations.

Yea, I don't think that's the way to go.

Greetings,

Andres Freund

Re: Race condition in crash-recovery tests

From

Tom Lane

Date:

27 January 2019, 05:45:11

Andres Freund <andres@anarazel.de> writes:
> On 2019-01-26 20:53:48 -0500, Tom Lane wrote:
>> I have no idea why we're seeing this in only one buildfarm member
>> and only for the past week or so, as it doesn't appear that any
>> related code has changed for months.  (Perhaps something changed
>> about curculio's host?)

> I have no idea why it's just curculio, but I think I know why it only
> started recently: Curculio doesn't appear to have tap tests enabled
> before
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2019-01-17%2021%3A30%3A02

Oh, right ... I knew that, actually, but forgot ...

So then we only have to assume that the race condition is encouraged
by something about the kernel scheduler's rules on that machine, which
isn't so much of a leap, especially since it's our only OpenBSD
critter.  The test case only exists in v11 and HEAD branches, and
curculio's only run this test a few times in v11, so the lack of
back-branch failures isn't so odd.

>> just change the test script to accept either message as a successful
>> result.  I think that 4247db625 made such races more likely, but I
>> don't believe it was impossible before.

> Sounds right to me - do you want to do the honors or shall I?

I'll do it in a bit.

            regards, tom lane