Re: pgsql: TAP tests: check for postmaster.pid anyway when "pg_ctl start" f - Mailing list pgsql-committers

From Andres Freund
Subject Re: pgsql: TAP tests: check for postmaster.pid anyway when "pg_ctl start" f
Date
Msg-id 20220120192736.msyesvdlljn6tptw@alap3.anarazel.de
Whole thread Raw
In response to Re: pgsql: TAP tests: check for postmaster.pid anyway when "pg_ctl start" f  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: pgsql: TAP tests: check for postmaster.pid anyway when "pg_ctl start" f  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-committers
Hi,

On 2022-01-20 09:32:16 -0500, Tom Lane wrote:
> Seems like we need to do 'kill 0' on the PID we get from
> the file to verify that there's really a postmaster there.
> (I wonder how well that works on Windows?  perlport claims
> it does, but ...)

We seem to successfully do that in pg_ctl's postmaster_is_alive().


> I wrote:
> > Hmm.  I think the problem is that poll_start() thinks it can just call
> > start() a second time after a failure.  If it wasn't a true failure
> > but a timeout, then _pid is now set and the second call complains.
> 
> Oh, wait --- the case that is failing is after 017_shm.pl has
> intentionally kill -9'd a postmaster, so that its pidfile is
> left behind.  The next attempted start fails on shmem id
> conflict, but it doesn't remove the old pidfile, and then
> the code I added to sub start erroneously picks that up
> as a live postmaster PID.

> I fear I still don't have the whole story though because
> per this theory it should fail everywhere, yet it doesn't.

The sequence in regress_log_* that hoverfly is getting and that I am seeing
locally differs. It's a bit hard to tell, because the test doesn't report any
test progress until quite a bit into the test...

For the first kill9, I see:

local:
  ### Killing node "gnat" using signal 9
  ...
  ### Starting node "gnat"
  # Running: pg_ctl -w -D
/home/andres/build/postgres/dev-assert/vpath/src/test/recovery/tmp_check/t_017_shm_gnat_data/pgdata-l
/home/andres/build/postgres/dev>
  pg_ctl: another server might be running; trying to start server anyway
  waiting for server to start.... done
  server started
  # Postmaster PID for node "gnat" is 2904505


hoverfly:
  ### Killing node "gnat" using signal 9
  ...
  ### Starting node "gnat"
  # Running: pg_ctl -w -D
/scratch/nm/farm/xlc64v16/HEAD/pgsql.build/src/test/recovery/tmp_check/t_017_shm_gnat_data/pgdata-l
/scratch/nm/farm/xlc64v16/HEAD/pgsql.build/src/test/recovery/tmp_check/log/017_shm_gnat.log-o --cluster-name=gnat
start
  pg_ctl: another server might be running; trying to start server anyway
  waiting for server to start.... stopped waiting
  pg_ctl: could not start server
  Examine the log output.
  # pg_ctl start failed; logfile:


local log:
  2022-01-20 11:02:40.109 PST [2905213] LOG:  starting PostgreSQL 15devel on x86_64-pc-linux-gnu, compiled by gcc-12
(Debian12-20220106-1) 12.0.0 20220106 (experimental) [master r12-6277-g1935db29689], 64-bit
 
  2022-01-20 11:02:40.109 PST [2905213] LOG:  listening on Unix socket "/tmp/jAyaRMImQH/.s.PGSQL.53851"
  2022-01-20 11:02:40.110 PST [2905216] LOG:  database system was interrupted; last known up at 2022-01-20 11:02:39
PST
  2022-01-20 11:02:40.111 PST [2905216] LOG:  database system was not properly shut down; automatic recovery in
progress


hoverfly:
  2022-01-20 18:04:53.087 UTC [50659690:1] LOG:  starting PostgreSQL 15devel on powerpc-ibm-aix7.2.4.0, compiled by
/opt/IBM/xlc/16.1.0/bin/xlc_r-D_LARGE_FILES=1 -DRANDOMIZE_ALLOCATED_MEMORY, 64-bit
 
  2022-01-20 18:04:53.088 UTC [50659690:2] LOG:  listening on Unix socket
"/home/nm/farm/tmp/oR5PbI8A5R/.s.PGSQL.61673"
  2022-01-20 18:04:53.092 UTC [393770:1] LOG:  database system was shut down at 2022-01-20 18:04:52 UTC
  2022-01-20 18:04:53.099 UTC [50659690:3] LOG:  database system is ready to accept connections
  2022-01-20 18:04:53.275 UTC [33751596:1] FATAL:  lock file "postmaster.pid" already exists
  2022-01-20 18:04:53.275 UTC [33751596:2] HINT:  Is another postmaster (PID 50659690) running in data directory
"/scratch/nm/farm/xlc64v16/HEAD/pgsql.build/src/test/recovery/tmp_check/t_017_shm_gnat_data/pgdata"?

IOW, the test doesn't actually quite seem to be working in the "local" case?
Hence not seeing the problem of picking up some wrong pid.

I think it's a question of how quick the postmaster-death detection logic
is. At first I thought it's a question of USE_POSTMASTER_DEATH_SIGNAL, but I
see the same with the pipe based approach. So it's probably just a question of
event ordering.

I think this test could benefit from
$gnat->append_conf('postgresql.conf', 'log_min_messages=debug3');
or such, so that we see processes exiting.

I also think we ought to emit a debug message when detecting postmaster
death. It's unnecessarily hard to debug them right now.

Greetings,

Andres Freund



pgsql-committers by date:

Previous
From: Robert Haas
Date:
Subject: pgsql: Support base backup targets.
Next
From: Tom Lane
Date:
Subject: Re: pgsql: TAP tests: check for postmaster.pid anyway when "pg_ctl start" f