Thread: Race condition in server-crash testing
My pet dinosaur gaur just failed [1] in src/test/recovery/t/022_crash_temp_files.pl, which does this: ----- my $ret = PostgreSQL::Test::Utils::system_log('pg_ctl', 'kill', 'KILL', $pid); is($ret, 0, 'killed process with KILL'); # Close psql session $killme->finish; $killme2->finish; # Wait till server restarts $node->poll_query_until('postgres', undef, ''); ----- It's hard to be totally sure, but I think what happened is that gaur hit the in-hindsight-obvious race condition in this code: we managed to execute a successful iteration of poll_query_until before the postmaster had noticed its dead child and commenced the restart. The test lines after these are not prepared to see failure-to-connect. It's not obvious to me how to remove this race condition. Thoughts? regards, tom lane [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gaur&dt=2022-04-03%2021%3A14%3A41
Hi, On 2022-04-04 00:50:27 -0400, Tom Lane wrote: > My pet dinosaur gaur just failed [1] in > src/test/recovery/t/022_crash_temp_files.pl, which does this: > > ----- > my $ret = PostgreSQL::Test::Utils::system_log('pg_ctl', 'kill', 'KILL', $pid); > is($ret, 0, 'killed process with KILL'); > > # Close psql session > $killme->finish; > $killme2->finish; > > # Wait till server restarts > $node->poll_query_until('postgres', undef, ''); > ----- > > It's hard to be totally sure, but I think what happened is that > gaur hit the in-hindsight-obvious race condition in this code: > we managed to execute a successful iteration of poll_query_until > before the postmaster had noticed its dead child and commenced > the restart. The test lines after these are not prepared to see > failure-to-connect. > > It's not obvious to me how to remove this race condition. > Thoughts? Maybe we can use pump_until() with the psql that's not getting killed? With a non-matching regex? That'd only return once the backend was killed by postmaster, afaics? Greetings, Andres Freund
On Sun, Apr 03, 2022 at 10:07:21PM -0700, Andres Freund wrote: > On 2022-04-04 00:50:27 -0400, Tom Lane wrote: > > My pet dinosaur gaur just failed [1] in > > src/test/recovery/t/022_crash_temp_files.pl, which does this: > > > > ----- > > my $ret = PostgreSQL::Test::Utils::system_log('pg_ctl', 'kill', 'KILL', $pid); > > is($ret, 0, 'killed process with KILL'); > > > > # Close psql session > > $killme->finish; > > $killme2->finish; > > > > # Wait till server restarts > > $node->poll_query_until('postgres', undef, ''); > > ----- > > > > It's hard to be totally sure, but I think what happened is that > > gaur hit the in-hindsight-obvious race condition in this code: > > we managed to execute a successful iteration of poll_query_until > > before the postmaster had noticed its dead child and commenced > > the restart. The test lines after these are not prepared to see > > failure-to-connect. > > > > It's not obvious to me how to remove this race condition. > > Thoughts? > > Maybe we can use pump_until() with the psql that's not getting killed? With a > non-matching regex? That'd only return once the backend was killed by > postmaster, afaics? Sounds good; I suspect that will be better than any of the ideas I scratched down when https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hoverfly&dt=2021-08-31%2015%3A00%3A52 failed the same way. For what it's worth, those were: - Check that pg_postmaster_start_time() has changed. But that runs into EPIPE trouble, requiring a write to a file or an eval{} to trap the EPIPE. - Likewise, but check "select checkpoint_lsn from pg_control_checkpoint();". - Poll pg_controldata until a new checkpoint happens. Compare checkpoint LSN. Use checkpoint_timeout=1h to avoid non-end-of-recovery checkpoints. - Poll logfile until "all server processes terminated; reinitializing". Can be fooled with certain log_min_messages settings, but so can our other log-scraping tests. - Grab the pid of e.g. the checkpointer and poll for that process to be gone. Can be fooled by PID reuse.
Andres Freund <andres@anarazel.de> writes: > On 2022-04-04 00:50:27 -0400, Tom Lane wrote: >> It's hard to be totally sure, but I think what happened is that >> gaur hit the in-hindsight-obvious race condition in this code: >> we managed to execute a successful iteration of poll_query_until >> before the postmaster had noticed its dead child and commenced >> the restart. The test lines after these are not prepared to see >> failure-to-connect. >> It's not obvious to me how to remove this race condition. >> Thoughts? > Maybe we can use pump_until() with the psql that's not getting killed? With a > non-matching regex? That'd only return once the backend was killed by > postmaster, afaics? Good idea. What I actually did was to borrow the recently-fixed code in 013_crash_restart.pl that checks for psql's "connection lost" report. regards, tom lane