Thread: pgsql: TAP tests: check for postmaster.pid anyway when "pg_ctl start" f

pgsql: TAP tests: check for postmaster.pid anyway when "pg_ctl start" f

From
Tom Lane
Date:
TAP tests: check for postmaster.pid anyway when "pg_ctl start" fails.

"pg_ctl start" might start a new postmaster and then return failure
anyway, for example if PGCTLTIMEOUT is exceeded.  If there is a
postmaster there, it's still incumbent on us to shut it down at
script end, so check for the PID file even though we are about
to fail.

This has been broken all along, so back-patch to all supported branches.

Discussion: https://postgr.es/m/647439.1642622744@sss.pgh.pa.us

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/6c4a8903b93fa5ee088a282b2e8b79ad6571a853

Modified Files
--------------
src/test/perl/PostgreSQL/Test/Cluster.pm | 14 +++++++++++---
1 file changed, 11 insertions(+), 3 deletions(-)


Re: pgsql: TAP tests: check for postmaster.pid anyway when "pg_ctl start" f

From
Thomas Munro
Date:
On Thu, Jan 20, 2022 at 10:29 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> TAP tests: check for postmaster.pid anyway when "pg_ctl start" fails.
>
> "pg_ctl start" might start a new postmaster and then return failure
> anyway, for example if PGCTLTIMEOUT is exceeded.  If there is a
> postmaster there, it's still incumbent on us to shut it down at
> script end, so check for the PID file even though we are about
> to fail.
>
> This has been broken all along, so back-patch to all supported branches.
>
> Discussion: https://postgr.es/m/647439.1642622744@sss.pgh.pa.us

This seems to have caused some kind of problem for the 017_shm.pl test:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2022-01-20%2003%3A23%3A44
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2022-01-20%2001%3A14%3A07
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet&dt=2022-01-20%2000%3A10%3A37



Thomas Munro <thomas.munro@gmail.com> writes:
> On Thu, Jan 20, 2022 at 10:29 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> TAP tests: check for postmaster.pid anyway when "pg_ctl start" fails.

> This seems to have caused some kind of problem for the 017_shm.pl test:

Hmm.  I think the problem is that poll_start() thinks it can just call
start() a second time after a failure.  If it wasn't a true failure
but a timeout, then _pid is now set and the second call complains.

I'm not entirely sure how that scenario worked before --- maybe
pg_ctl is not picky about whether the running postmaster is the
same one it started??

Anyway it seems like we need some less-fuzzy thinking in poll_start.
Haven't absorbed enough caffeine to decide what.

A possible band-aid to get rid of the buildfarm failures is to force
a higher PGCTLTIMEOUT during poll_start.

            regards, tom lane



I wrote:
> Hmm.  I think the problem is that poll_start() thinks it can just call
> start() a second time after a failure.  If it wasn't a true failure
> but a timeout, then _pid is now set and the second call complains.

Oh, wait --- the case that is failing is after 017_shm.pl has
intentionally kill -9'd a postmaster, so that its pidfile is
left behind.  The next attempted start fails on shmem id
conflict, but it doesn't remove the old pidfile, and then
the code I added to sub start erroneously picks that up
as a live postmaster PID.

Seems like we need to do 'kill 0' on the PID we get from
the file to verify that there's really a postmaster there.
(I wonder how well that works on Windows?  perlport claims
it does, but ...)

I fear I still don't have the whole story though because
per this theory it should fail everywhere, yet it doesn't.

            regards, tom lane



Re: pgsql: TAP tests: check for postmaster.pid anyway when "pg_ctl start" f

From
Andres Freund
Date:
Hi,

On 2022-01-20 09:32:16 -0500, Tom Lane wrote:
> Seems like we need to do 'kill 0' on the PID we get from
> the file to verify that there's really a postmaster there.
> (I wonder how well that works on Windows?  perlport claims
> it does, but ...)

We seem to successfully do that in pg_ctl's postmaster_is_alive().


> I wrote:
> > Hmm.  I think the problem is that poll_start() thinks it can just call
> > start() a second time after a failure.  If it wasn't a true failure
> > but a timeout, then _pid is now set and the second call complains.
> 
> Oh, wait --- the case that is failing is after 017_shm.pl has
> intentionally kill -9'd a postmaster, so that its pidfile is
> left behind.  The next attempted start fails on shmem id
> conflict, but it doesn't remove the old pidfile, and then
> the code I added to sub start erroneously picks that up
> as a live postmaster PID.

> I fear I still don't have the whole story though because
> per this theory it should fail everywhere, yet it doesn't.

The sequence in regress_log_* that hoverfly is getting and that I am seeing
locally differs. It's a bit hard to tell, because the test doesn't report any
test progress until quite a bit into the test...

For the first kill9, I see:

local:
  ### Killing node "gnat" using signal 9
  ...
  ### Starting node "gnat"
  # Running: pg_ctl -w -D
/home/andres/build/postgres/dev-assert/vpath/src/test/recovery/tmp_check/t_017_shm_gnat_data/pgdata-l
/home/andres/build/postgres/dev>
  pg_ctl: another server might be running; trying to start server anyway
  waiting for server to start.... done
  server started
  # Postmaster PID for node "gnat" is 2904505


hoverfly:
  ### Killing node "gnat" using signal 9
  ...
  ### Starting node "gnat"
  # Running: pg_ctl -w -D
/scratch/nm/farm/xlc64v16/HEAD/pgsql.build/src/test/recovery/tmp_check/t_017_shm_gnat_data/pgdata-l
/scratch/nm/farm/xlc64v16/HEAD/pgsql.build/src/test/recovery/tmp_check/log/017_shm_gnat.log-o --cluster-name=gnat
start
  pg_ctl: another server might be running; trying to start server anyway
  waiting for server to start.... stopped waiting
  pg_ctl: could not start server
  Examine the log output.
  # pg_ctl start failed; logfile:


local log:
  2022-01-20 11:02:40.109 PST [2905213] LOG:  starting PostgreSQL 15devel on x86_64-pc-linux-gnu, compiled by gcc-12
(Debian12-20220106-1) 12.0.0 20220106 (experimental) [master r12-6277-g1935db29689], 64-bit
 
  2022-01-20 11:02:40.109 PST [2905213] LOG:  listening on Unix socket "/tmp/jAyaRMImQH/.s.PGSQL.53851"
  2022-01-20 11:02:40.110 PST [2905216] LOG:  database system was interrupted; last known up at 2022-01-20 11:02:39
PST
  2022-01-20 11:02:40.111 PST [2905216] LOG:  database system was not properly shut down; automatic recovery in
progress


hoverfly:
  2022-01-20 18:04:53.087 UTC [50659690:1] LOG:  starting PostgreSQL 15devel on powerpc-ibm-aix7.2.4.0, compiled by
/opt/IBM/xlc/16.1.0/bin/xlc_r-D_LARGE_FILES=1 -DRANDOMIZE_ALLOCATED_MEMORY, 64-bit
 
  2022-01-20 18:04:53.088 UTC [50659690:2] LOG:  listening on Unix socket
"/home/nm/farm/tmp/oR5PbI8A5R/.s.PGSQL.61673"
  2022-01-20 18:04:53.092 UTC [393770:1] LOG:  database system was shut down at 2022-01-20 18:04:52 UTC
  2022-01-20 18:04:53.099 UTC [50659690:3] LOG:  database system is ready to accept connections
  2022-01-20 18:04:53.275 UTC [33751596:1] FATAL:  lock file "postmaster.pid" already exists
  2022-01-20 18:04:53.275 UTC [33751596:2] HINT:  Is another postmaster (PID 50659690) running in data directory
"/scratch/nm/farm/xlc64v16/HEAD/pgsql.build/src/test/recovery/tmp_check/t_017_shm_gnat_data/pgdata"?

IOW, the test doesn't actually quite seem to be working in the "local" case?
Hence not seeing the problem of picking up some wrong pid.

I think it's a question of how quick the postmaster-death detection logic
is. At first I thought it's a question of USE_POSTMASTER_DEATH_SIGNAL, but I
see the same with the pipe based approach. So it's probably just a question of
event ordering.

I think this test could benefit from
$gnat->append_conf('postgresql.conf', 'log_min_messages=debug3');
or such, so that we see processes exiting.

I also think we ought to emit a debug message when detecting postmaster
death. It's unnecessarily hard to debug them right now.

Greetings,

Andres Freund



Andres Freund <andres@anarazel.de> writes:
> IOW, the test doesn't actually quite seem to be working in the "local" case?
> Hence not seeing the problem of picking up some wrong pid.

It looks like there are two different failure cases:

1. The symptom shown by skink:

# pg_ctl start failed; logfile:
...
2022-01-20 12:36:22.021 UTC [4023581][postmaster][:0] FATAL:  pre-existing shared memory block (key 16253309, ID
1602617349)is still in use 
2022-01-20 12:36:22.021 UTC [4023581][postmaster][:0] HINT:  Terminate any old server processes associated with data
directory"/mnt/resource/bf/build/skink-master/HEAD/pgsql.build/src/test/recovery/tmp_check/t_017_shm_gnat_data/pgdata". 

I can reproduce this locally by using a valgrind wrapper around the
postmaster.  What is evidently happening is that some postmaster
child is slow to exit after the postmaster is kill -9'd (thanks to
being valgrind'd) and so the would-be new postmaster sees the shmem
block as still active.

2. The symptom shown on Noah's AIX flotilla:

# pg_ctl start failed; logfile:
...
2022-01-20 03:08:56.200 UTC [11796728:1] FATAL:  lock file "postmaster.pid" already exists
2022-01-20 03:08:56.200 UTC [11796728:2] HINT:  Is another postmaster (PID 6750300) running in data directory
"/home/nm/farm/xlc64/HEAD/pgsql.build/src/test/recovery/tmp_check/t_017_shm_gnat_data/pgdata"?

Looking at the code in miscinit.c that emits that message, it seems
that kill(6750300, 0) must have succeeded, even though nontrivial
time has passed since we did the kill -9.  Alternatively, AIX returned
some errno other than the two we ignore --- but I checked its manpages
and it alleges just the same set of error conditions as Linux.

The first of these scenarios could be dealt with easily if we make
Cluster::start validate the PID it gets out of the pidfile.  The second
one seems problematic though: if the killed postmaster is not-dead-yet
when the replacement postmaster looks, it might still be that way when
Cluster::start checks, a few microseconds later, so that we'd "validate"
a PID that's not going to be around much longer.

What I'm thinking of doing is inventing a "soft_stop" variant of
Cluster::stop that won't complain if pg_ctl stop fails, and then
having 017_shm's poll_start() call that before retrying the start
call.  In this way, if we did manage to start a postmaster then
we'll shut it down before trying again, while if we didn't
successfully start one then soft_stop will end with _pid unset,
allowing Cluster::start to succeed.

It seems like it'd also be a good idea if the END block used
soft_stop instead of regular stop.  As is, if we have multiple
postmasters and the first one gives us trouble (say it's dead
already), I think we'll likely fail to stop the rest.  Not sure
what happens if we BAIL_OUT from an END block, but it seems
unlikely to be good.

            regards, tom lane



I wrote:
> What I'm thinking of doing is inventing a "soft_stop" variant of
> Cluster::stop that won't complain if pg_ctl stop fails, and then
> having 017_shm's poll_start() call that before retrying the start
> call.

I did that, modeling the details on Cluster::start's fail_ok option.

> It seems like it'd also be a good idea if the END block used
> soft_stop instead of regular stop.

I still like this idea, but I left it alone for the moment,
figuring I've used up my quota of buildfarm destabilization
for now.

            regards, tom lane



Re: pgsql: TAP tests: check for postmaster.pid anyway when "pg_ctl start" f

From
Noah Misch
Date:
On Thu, Jan 20, 2022 at 05:30:44PM -0500, Tom Lane wrote:
> I wrote:
> > What I'm thinking of doing is inventing a "soft_stop" variant of
> > Cluster::stop that won't complain if pg_ctl stop fails, and then
> > having 017_shm's poll_start() call that before retrying the start
> > call.
> 
> I did that, modeling the details on Cluster::start's fail_ok option.

On AIX, this pair of commits added 5-10h of runtime to 017_shm.pl.

Before 6c4a890 or 4fdbf9a:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hoverfly&dt=2022-01-19%2020%3A41%3A47
[21:14:17] t/017_shm.pl ......................... ok     7360 ms ( 0.01 usr  0.00 sys +  1.62 cusr  0.71 csys =  2.34
CPU)

After both:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hoverfly&dt=2022-01-21%2021%3A03%3A28
[07:52:31] t/017_shm.pl ......................... ok 36908794 ms ( 0.00 usr  0.00 sys +  5.16 cusr 16.04 csys = 21.20
CPU)



Re: pgsql: TAP tests: check for postmaster.pid anyway when "pg_ctl start" f

From
Tom Lane
Date:
Noah Misch <noah@leadboat.com> writes:
> On AIX, this pair of commits added 5-10h of runtime to 017_shm.pl.

Ouch!  Do you have any idea why?

            regards, tom lane



Re: pgsql: TAP tests: check for postmaster.pid anyway when "pg_ctl start" f

From
Noah Misch
Date:
On Wed, Feb 09, 2022 at 09:41:09PM -0500, Tom Lane wrote:
> Noah Misch <noah@leadboat.com> writes:
> > On AIX, this pair of commits added 5-10h of runtime to 017_shm.pl.
> 
> Ouch!  Do you have any idea why?

I do not.



Re: pgsql: TAP tests: check for postmaster.pid anyway when "pg_ctl start" f

From
Tom Lane
Date:
Noah Misch <noah@leadboat.com> writes:
> On Wed, Feb 09, 2022 at 09:41:09PM -0500, Tom Lane wrote:
>> Ouch!  Do you have any idea why?

> I do not.

OK, I'll look into it tomorrow.

            regards, tom lane



Re: pgsql: TAP tests: check for postmaster.pid anyway when "pg_ctl start" f

From
Tom Lane
Date:
Noah Misch <noah@leadboat.com> writes:
> On AIX, this pair of commits added 5-10h of runtime to 017_shm.pl.

I poked into this on gcc119 (hoverfly's host).  As best I can tell,
the root of the problem is that "kill 9" isn't instantaneous on this
platform.  In [1] we find this in 017_shm_gnat.log:

2022-02-10 07:08:20.175 UTC [41615696:1] LOG:  starting PostgreSQL 15devel on powerpc-ibm-aix7.2.4.0, compiled by
/opt/IBM/xlc/16.1.0/bin/xlc_r-D_LARGE_FILES=1 -DRANDOMIZE_ALLOCATED_MEMORY, 64-bit 
2022-02-10 07:08:20.175 UTC [41615696:2] LOG:  listening on Unix socket "/home/nm/farm/tmp/FTwLLIVrmg/.s.PGSQL.63971"
2022-02-10 07:08:20.182 UTC [34931288:1] LOG:  database system was shut down at 2022-02-10 07:08:20 UTC
2022-02-10 07:08:20.194 UTC [41615696:3] LOG:  database system is ready to accept connections
2022-02-10 07:08:20.380 UTC [26018140:1] FATAL:  lock file "postmaster.pid" already exists
2022-02-10 07:08:20.380 UTC [26018140:2] HINT:  Is another postmaster (PID 41615696) running in data directory
"/scratch/nm/farm/xlc64v16/HEAD/pgsql.build/src/test/recovery/tmp_check/t_017_shm_gnat_data/pgdata"?
2022-02-10 12:48:44.665 UTC [18350518:1] LOG:  starting PostgreSQL 15devel on powerpc-ibm-aix7.2.4.0, compiled by
/opt/IBM/xlc/16.1.0/bin/xlc_r-D_LARGE_FILES=1 -DRANDOMIZE_ALLOCATED_MEMORY, 64-bit 

41615696 is the postmaster PID that is targeted by 017_shm.pl's first
kill9 call, after which we immediately attempt to start a new postmaster
(26018140), which is complaining that 41615696 is still alive.  Looking
at the code that issues that in miscinit.c, either kill(41615696, 0)
succeeded, or it returned some errno that's neither ESRCH nor EPERM.
AIX doesn't document any kill(2) errors outside of what POSIX says,
so I reject the second theory and conclude that kill() was still seeing
41615696 as alive at this point.

But, once the replacement postmaster fails, we got trouble, because
the old postmaster.pid file (mentioning 41615696) is still there.
017_shm.pl's poll_start sees that "pg_ctl start" failed but then
does
        # Clean up in case the start attempt just timed out or some such.
        $node->stop('fast', fail_ok => 1);
(which I added in 4fdbf9af5).

My hypothesis is that the kill() call in pg_ctl's do_stop() still sees
41615696 as alive at this point, so pg_ctl enters a loop of waiting for
postmaster.pid to go away, which it never will.  PGCTLTIMEOUT later
(~18000 seconds for hoverfly), pg_ctl gives up and the test proceeds.

So it looks to me like the core problem is that pg_ctl's do_stop()
is too trusting: if it once sees the postmaster PID as alive, it
figures that's the end of the story.  We need to modify it to
recheck that during the wait loop.  (do_restart and do_promote
seem similarly naive ... and why are there so many copies of the
wait loop, anyway?)

Another idea is to modify Cluster.pm's kill9 to not return until the
PID is gone according to kill(0).  On the other hand, that'd mask
problems like this, so I'm not entirely enthused about changing
that end of things.

            regards, tom lane

[1]
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hoverfly&dt=2022-02-10%2006%3A29%3A25&stg=recovery-check



Re: pgsql: TAP tests: check for postmaster.pid anyway when "pg_ctl start" f

From
Andres Freund
Date:
Hi,

On 2022-02-10 14:58:57 -0500, Tom Lane wrote:
> So it looks to me like the core problem is that pg_ctl's do_stop()
> is too trusting: if it once sees the postmaster PID as alive, it
> figures that's the end of the story.

Agreed, that's a problem. Even if it wasn't the cause of slot tests on AIX.

kill -9 can be slow on other operating systems as well. I think on linux for
example it's not processed immediately if already in the middle of dumping
core.


> (do_restart and do_promote seem similarly naive ... and why are there so
> many copies of the wait loop, anyway?)

:(

There's generally some confusing duplication in pg_ctl.c itself, and between
pg_ctl and other programs. E.g. CreateRestrictedProcess() existing both in
pg_ctl and slightly differently in restricted_token.c. Wut? Also why is
restricted_token.c in common/, rather than port?


> Another idea is to modify Cluster.pm's kill9 to not return until the
> PID is gone according to kill(0).  On the other hand, that'd mask
> problems like this, so I'm not entirely enthused about changing
> that end of things.

Seems reasonable.

Greetings,

Andres Freund



Re: pgsql: TAP tests: check for postmaster.pid anyway when "pg_ctl start" f

From
Andres Freund
Date:
Hi,

On 2022-02-10 14:38:06 -0800, Andres Freund wrote:
> On 2022-02-10 14:58:57 -0500, Tom Lane wrote:
> > So it looks to me like the core problem is that pg_ctl's do_stop()
> > is too trusting: if it once sees the postmaster PID as alive, it
> > figures that's the end of the story.
> 
> Agreed, that's a problem. Even if it wasn't the cause of slot tests on AIX.

Should we take the danger of postmaster dying without cleaning up the pid file
and then another process reusing that pid serious? That obviously could make
pg_ctl wait until timeout.

Postmaster always opening a named pipe or unix socket in the data directory
would allow to detect that, I guess. Or checking the shared memory segment for
attached processes, but that seems like a mess to do in pg_ctl.

Greetings,

Andres Freund



Re: pgsql: TAP tests: check for postmaster.pid anyway when "pg_ctl start" f

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> Should we take the danger of postmaster dying without cleaning up the pid file
> and then another process reusing that pid serious? That obviously could make
> pg_ctl wait until timeout.

I'm under the impression that Unixen try not to recycle PIDs within
a short interval.  So I'm not that worried about this.  (Also, since
we treat EPERM like ESRCH, conflicting PIDs belonging to other user
IDs won't matter anyway.)

            regards, tom lane