Thread: Race condition in server-crash testing

Race condition in server-crash testing

From
Tom Lane
Date:
My pet dinosaur gaur just failed [1] in
src/test/recovery/t/022_crash_temp_files.pl, which does this:

-----
my $ret = PostgreSQL::Test::Utils::system_log('pg_ctl', 'kill', 'KILL', $pid);
is($ret, 0, 'killed process with KILL');

# Close psql session
$killme->finish;
$killme2->finish;

# Wait till server restarts
$node->poll_query_until('postgres', undef, '');
-----

It's hard to be totally sure, but I think what happened is that
gaur hit the in-hindsight-obvious race condition in this code:
we managed to execute a successful iteration of poll_query_until
before the postmaster had noticed its dead child and commenced
the restart.  The test lines after these are not prepared to see
failure-to-connect.

It's not obvious to me how to remove this race condition.
Thoughts?

            regards, tom lane

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gaur&dt=2022-04-03%2021%3A14%3A41



Re: Race condition in server-crash testing

From
Andres Freund
Date:
Hi,

On 2022-04-04 00:50:27 -0400, Tom Lane wrote:
> My pet dinosaur gaur just failed [1] in
> src/test/recovery/t/022_crash_temp_files.pl, which does this:
> 
> -----
> my $ret = PostgreSQL::Test::Utils::system_log('pg_ctl', 'kill', 'KILL', $pid);
> is($ret, 0, 'killed process with KILL');
> 
> # Close psql session
> $killme->finish;
> $killme2->finish;
> 
> # Wait till server restarts
> $node->poll_query_until('postgres', undef, '');
> -----
> 
> It's hard to be totally sure, but I think what happened is that
> gaur hit the in-hindsight-obvious race condition in this code:
> we managed to execute a successful iteration of poll_query_until
> before the postmaster had noticed its dead child and commenced
> the restart.  The test lines after these are not prepared to see
> failure-to-connect.
> 
> It's not obvious to me how to remove this race condition.
> Thoughts?

Maybe we can use pump_until() with the psql that's not getting killed? With a
non-matching regex? That'd only return once the backend was killed by
postmaster, afaics?

Greetings,

Andres Freund



Re: Race condition in server-crash testing

From
Noah Misch
Date:
On Sun, Apr 03, 2022 at 10:07:21PM -0700, Andres Freund wrote:
> On 2022-04-04 00:50:27 -0400, Tom Lane wrote:
> > My pet dinosaur gaur just failed [1] in
> > src/test/recovery/t/022_crash_temp_files.pl, which does this:
> > 
> > -----
> > my $ret = PostgreSQL::Test::Utils::system_log('pg_ctl', 'kill', 'KILL', $pid);
> > is($ret, 0, 'killed process with KILL');
> > 
> > # Close psql session
> > $killme->finish;
> > $killme2->finish;
> > 
> > # Wait till server restarts
> > $node->poll_query_until('postgres', undef, '');
> > -----
> > 
> > It's hard to be totally sure, but I think what happened is that
> > gaur hit the in-hindsight-obvious race condition in this code:
> > we managed to execute a successful iteration of poll_query_until
> > before the postmaster had noticed its dead child and commenced
> > the restart.  The test lines after these are not prepared to see
> > failure-to-connect.
> > 
> > It's not obvious to me how to remove this race condition.
> > Thoughts?
> 
> Maybe we can use pump_until() with the psql that's not getting killed? With a
> non-matching regex? That'd only return once the backend was killed by
> postmaster, afaics?

Sounds good; I suspect that will be better than any of the ideas I scratched
down when
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hoverfly&dt=2021-08-31%2015%3A00%3A52
failed the same way.  For what it's worth, those were:

- Check that pg_postmaster_start_time() has changed.  But that runs into EPIPE
  trouble, requiring a write to a file or an eval{} to trap the EPIPE.
- Likewise, but check "select checkpoint_lsn from pg_control_checkpoint();".
- Poll pg_controldata until a new checkpoint happens.  Compare checkpoint LSN.
  Use checkpoint_timeout=1h to avoid non-end-of-recovery checkpoints.
- Poll logfile until "all server processes terminated; reinitializing".  Can
  be fooled with certain log_min_messages settings, but so can our other
  log-scraping tests.
- Grab the pid of e.g. the checkpointer and poll for that process to be gone.
  Can be fooled by PID reuse.



Re: Race condition in server-crash testing

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2022-04-04 00:50:27 -0400, Tom Lane wrote:
>> It's hard to be totally sure, but I think what happened is that
>> gaur hit the in-hindsight-obvious race condition in this code:
>> we managed to execute a successful iteration of poll_query_until
>> before the postmaster had noticed its dead child and commenced
>> the restart.  The test lines after these are not prepared to see
>> failure-to-connect.
>> It's not obvious to me how to remove this race condition.
>> Thoughts?

> Maybe we can use pump_until() with the psql that's not getting killed? With a
> non-matching regex? That'd only return once the backend was killed by
> postmaster, afaics?

Good idea.  What I actually did was to borrow the recently-fixed code
in 013_crash_restart.pl that checks for psql's "connection lost"
report.

            regards, tom lane