Re: pgsql: TAP tests: check for postmaster.pid anyway when "pg_ctl start" f - Mailing list pgsql-committers

From Tom Lane
Subject Re: pgsql: TAP tests: check for postmaster.pid anyway when "pg_ctl start" f
Date
Msg-id 926839.1642708567@sss.pgh.pa.us
Whole thread Raw
In response to Re: pgsql: TAP tests: check for postmaster.pid anyway when "pg_ctl start" f  (Andres Freund <andres@anarazel.de>)
Responses Re: pgsql: TAP tests: check for postmaster.pid anyway when "pg_ctl start" f  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-committers
Andres Freund <andres@anarazel.de> writes:
> IOW, the test doesn't actually quite seem to be working in the "local" case?
> Hence not seeing the problem of picking up some wrong pid.

It looks like there are two different failure cases:

1. The symptom shown by skink:

# pg_ctl start failed; logfile:
...
2022-01-20 12:36:22.021 UTC [4023581][postmaster][:0] FATAL:  pre-existing shared memory block (key 16253309, ID
1602617349)is still in use 
2022-01-20 12:36:22.021 UTC [4023581][postmaster][:0] HINT:  Terminate any old server processes associated with data
directory"/mnt/resource/bf/build/skink-master/HEAD/pgsql.build/src/test/recovery/tmp_check/t_017_shm_gnat_data/pgdata". 

I can reproduce this locally by using a valgrind wrapper around the
postmaster.  What is evidently happening is that some postmaster
child is slow to exit after the postmaster is kill -9'd (thanks to
being valgrind'd) and so the would-be new postmaster sees the shmem
block as still active.

2. The symptom shown on Noah's AIX flotilla:

# pg_ctl start failed; logfile:
...
2022-01-20 03:08:56.200 UTC [11796728:1] FATAL:  lock file "postmaster.pid" already exists
2022-01-20 03:08:56.200 UTC [11796728:2] HINT:  Is another postmaster (PID 6750300) running in data directory
"/home/nm/farm/xlc64/HEAD/pgsql.build/src/test/recovery/tmp_check/t_017_shm_gnat_data/pgdata"?

Looking at the code in miscinit.c that emits that message, it seems
that kill(6750300, 0) must have succeeded, even though nontrivial
time has passed since we did the kill -9.  Alternatively, AIX returned
some errno other than the two we ignore --- but I checked its manpages
and it alleges just the same set of error conditions as Linux.

The first of these scenarios could be dealt with easily if we make
Cluster::start validate the PID it gets out of the pidfile.  The second
one seems problematic though: if the killed postmaster is not-dead-yet
when the replacement postmaster looks, it might still be that way when
Cluster::start checks, a few microseconds later, so that we'd "validate"
a PID that's not going to be around much longer.

What I'm thinking of doing is inventing a "soft_stop" variant of
Cluster::stop that won't complain if pg_ctl stop fails, and then
having 017_shm's poll_start() call that before retrying the start
call.  In this way, if we did manage to start a postmaster then
we'll shut it down before trying again, while if we didn't
successfully start one then soft_stop will end with _pid unset,
allowing Cluster::start to succeed.

It seems like it'd also be a good idea if the END block used
soft_stop instead of regular stop.  As is, if we have multiple
postmasters and the first one gives us trouble (say it's dead
already), I think we'll likely fail to stop the rest.  Not sure
what happens if we BAIL_OUT from an END block, but it seems
unlikely to be good.

            regards, tom lane



pgsql-committers by date:

Previous
From: Andres Freund
Date:
Subject: Re: pgsql: TAP tests: check for postmaster.pid anyway when "pg_ctl start" f
Next
From: Tom Lane
Date:
Subject: pgsql: Tighten TAP tests' tracking of postmaster state some more.