Re: Shutting down a warm standby database in 8.2beta3 - Mailing list pgsql-hackers

From Stephen Harris
Subject Re: Shutting down a warm standby database in 8.2beta3
Date
Msg-id 20061122185623.GA23202@pugwash.spuddy.org
Whole thread Raw
In response to Re: [GENERAL] Shutting down a warm standby database in 8.2beta3  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Shutting down a warm standby database in 8.2beta3  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Mon, Nov 20, 2006 at 11:20:41AM -0500, Tom Lane wrote:
> 
>         kill(child_pid, SIGxxx);
>     #ifdef HAVE_SETSID
>         kill(-child_pid, SIGxxx);
>     #endif
> 
> In the normal case where the child has already completed setsid(), the
> extra signal sent to it should do no harm.  In the startup race

Hmm.  It looks like something more than this may be needed.  The postgres
recovery process appears to be ignoring it.   I ran the whole database
in it's own process group (ksh runs processes in their own process group
by default, so pg_ctl became the session leader and so everything under
pg_ctl all stayed in that process group).

% ps -o pid,ppid,pgid,args -g 29141 | sort PID  PPID  PGID COMMAND
29145     1 29141 /local/apps/postgres/8.2.b3.0/solaris/bin/postgres
29146 29145 29141 /local/apps/postgres/8.2.b3.0/solaris/bin/postgres
29147 29145 29141 /local/apps/postgres/8.2.b3.0/solaris/bin/postgres
29501 29147 29141 sh -c /export/home/swharris/rr 000000010000000100000057 pg_xlog/RECOVERYXLOG
29502 29501 29141 /bin/ksh -p /export/home/swharris/rr 000000010000000100000057 pg_xlog/RECOVERYX
29537 29502 29141 sleep 5

I did
kill -QUIT -29141 ; sleep 1 ; touch /export/home/swharris/archives/STOP_SWEH_RECOVERY

This sent the QUIT signal to all those processes.  The shell script ignores
it and so tries to start again, so the 'touch' command tells it to exit(1)
rather than loop again.

The log file (the timestamp entries are from my 'rr' program so I
can see what it's doing)...

To start with we see a normal recovery:
 Wed Nov 22 13:41:20 EST 2006: Attempting to restore 000000010000000100000056 Wed Nov 22 13:41:25 EST 2006: Finished
000000010000000100000056LOG:  restored log file "000000010000000100000056" from archive Wed Nov 22 13:41:25 EST 2006:
Attemptingto restore 000000010000000100000057 Wed Nov 22 13:41:25 EST 2006: Waiting for file to become available
 

Now I send the kill signal...
 LOG:  received immediate shutdown request

We can see that the sleep process got it! /export/home/swharris/rr[37]: 29537 Quit(coredump)
And my script detects the trigger file Wed Nov 22 13:43:51 EST 2006: End of recovery trigger file found

Now database recovery appears to continue as normal; the postgres
recovery processes are still running, despite having received SIGQUIT
 LOG:  could not open file "pg_xlog/000000010000000100000057" (log file 1, segment 87): No such file or directory LOG:
redodone at 1/56000070 Wed Nov 22 13:43:51 EST 2006: Attempting to restore 000000010000000100000056 Wed Nov 22 13:43:55
EST2006: Finished 000000010000000100000056 LOG:  restored log file "000000010000000100000056" from archive LOG:
archiverecovery complete LOG:  database system is ready LOG:  logger shutting down
 

pg_xlog now contains 000000010000000100000056 and 000000010000000100000057

A similar sort of thing happens if I use SIGTERM rather than SIGQUIT

I'm out of here in an hour, so for all you US based people, have a good
Thanksgiving holiday!

-- 

rgds
Stephen


pgsql-hackers by date:

Previous
From: Markus Schiltknecht
Date:
Subject: Re: Integrating Replication into Core
Next
From: Andrew Dunstan
Date:
Subject: Re: Integrating Replication into Core