Re: pgsql: TAP tests: check for postmaster.pid anyway when "pg_ctl start" f - Mailing list pgsql-committers
From | Tom Lane |
---|---|
Subject | Re: pgsql: TAP tests: check for postmaster.pid anyway when "pg_ctl start" f |
Date | |
Msg-id | 1709793.1644523137@sss.pgh.pa.us Whole thread Raw |
In response to | Re: pgsql: TAP tests: check for postmaster.pid anyway when "pg_ctl start" f (Noah Misch <noah@leadboat.com>) |
Responses |
Re: pgsql: TAP tests: check for postmaster.pid anyway when "pg_ctl start" f
|
List | pgsql-committers |
Noah Misch <noah@leadboat.com> writes: > On AIX, this pair of commits added 5-10h of runtime to 017_shm.pl. I poked into this on gcc119 (hoverfly's host). As best I can tell, the root of the problem is that "kill 9" isn't instantaneous on this platform. In [1] we find this in 017_shm_gnat.log: 2022-02-10 07:08:20.175 UTC [41615696:1] LOG: starting PostgreSQL 15devel on powerpc-ibm-aix7.2.4.0, compiled by /opt/IBM/xlc/16.1.0/bin/xlc_r-D_LARGE_FILES=1 -DRANDOMIZE_ALLOCATED_MEMORY, 64-bit 2022-02-10 07:08:20.175 UTC [41615696:2] LOG: listening on Unix socket "/home/nm/farm/tmp/FTwLLIVrmg/.s.PGSQL.63971" 2022-02-10 07:08:20.182 UTC [34931288:1] LOG: database system was shut down at 2022-02-10 07:08:20 UTC 2022-02-10 07:08:20.194 UTC [41615696:3] LOG: database system is ready to accept connections 2022-02-10 07:08:20.380 UTC [26018140:1] FATAL: lock file "postmaster.pid" already exists 2022-02-10 07:08:20.380 UTC [26018140:2] HINT: Is another postmaster (PID 41615696) running in data directory "/scratch/nm/farm/xlc64v16/HEAD/pgsql.build/src/test/recovery/tmp_check/t_017_shm_gnat_data/pgdata"? 2022-02-10 12:48:44.665 UTC [18350518:1] LOG: starting PostgreSQL 15devel on powerpc-ibm-aix7.2.4.0, compiled by /opt/IBM/xlc/16.1.0/bin/xlc_r-D_LARGE_FILES=1 -DRANDOMIZE_ALLOCATED_MEMORY, 64-bit 41615696 is the postmaster PID that is targeted by 017_shm.pl's first kill9 call, after which we immediately attempt to start a new postmaster (26018140), which is complaining that 41615696 is still alive. Looking at the code that issues that in miscinit.c, either kill(41615696, 0) succeeded, or it returned some errno that's neither ESRCH nor EPERM. AIX doesn't document any kill(2) errors outside of what POSIX says, so I reject the second theory and conclude that kill() was still seeing 41615696 as alive at this point. But, once the replacement postmaster fails, we got trouble, because the old postmaster.pid file (mentioning 41615696) is still there. 017_shm.pl's poll_start sees that "pg_ctl start" failed but then does # Clean up in case the start attempt just timed out or some such. $node->stop('fast', fail_ok => 1); (which I added in 4fdbf9af5). My hypothesis is that the kill() call in pg_ctl's do_stop() still sees 41615696 as alive at this point, so pg_ctl enters a loop of waiting for postmaster.pid to go away, which it never will. PGCTLTIMEOUT later (~18000 seconds for hoverfly), pg_ctl gives up and the test proceeds. So it looks to me like the core problem is that pg_ctl's do_stop() is too trusting: if it once sees the postmaster PID as alive, it figures that's the end of the story. We need to modify it to recheck that during the wait loop. (do_restart and do_promote seem similarly naive ... and why are there so many copies of the wait loop, anyway?) Another idea is to modify Cluster.pm's kill9 to not return until the PID is gone according to kill(0). On the other hand, that'd mask problems like this, so I'm not entirely enthused about changing that end of things. regards, tom lane [1] https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hoverfly&dt=2022-02-10%2006%3A29%3A25&stg=recovery-check
pgsql-committers by date: