Thread: Re: [GENERAL] Shutting down a warm standby database in 8.2beta3
Stephen Harris <lists@spuddy.org> writes: > Doing a shutdown "immediate" isn't to clever because it actually leaves > the recovery threads running > LOG: restored log file "00000001000000010000003E" from archive > LOG: received immediate shutdown request > LOG: restored log file "00000001000000010000003F" from archive Hm, that should work --- AFAICS the startup process should abort on SIGQUIT the same as any regular backend. [ thinks... ] Ah-hah, "man system(3)" tells the tale: system() ignores the SIGINT and SIGQUIT signals, and blocks the SIGCHLD signal, while waiting for the command to terminate. If this might cause the application to miss a signal that would have killed it, the application should examine the return value from system() and take whatever action is appropriate to the application if the command terminated due to receipt of a signal. So the SIGQUIT went to the recovery script command and was missed by the startup process. It looks to me like your script actually ignored the signal, which you'll need to fix, but it also looks like we are not checking for these cases in RestoreArchivedFile(), which we'd better fix. As the code stands, if the recovery script is killed by a signal, we'd take that as normal termination of the recovery and proceed to come up, which is definitely the Wrong Thing. regards, tom lane
On Fri, Nov 17, 2006 at 05:03:44PM -0500, Tom Lane wrote: > Stephen Harris <lists@spuddy.org> writes: > > Doing a shutdown "immediate" isn't to clever because it actually leaves > > the recovery threads running > > > LOG: restored log file "00000001000000010000003E" from archive > > LOG: received immediate shutdown request > > LOG: restored log file "00000001000000010000003F" from archive > > Hm, that should work --- AFAICS the startup process should abort on > SIGQUIT the same as any regular backend. > > [ thinks... ] Ah-hah, "man system(3)" tells the tale: > > system() ignores the SIGINT and SIGQUIT signals, and blocks the > SIGCHLD signal, while waiting for the command to terminate. If this > might cause the application to miss a signal that would have killed > it, the application should examine the return value from system() and > take whatever action is appropriate to the application if the command > terminated due to receipt of a signal. > > So the SIGQUIT went to the recovery script command and was missed by the > startup process. It looks to me like your script actually ignored the > signal, which you'll need to fix, but it also looks like we are not My script was just a ksh script and didn't do anything special with signals. Essentially it does #!/bin/ksh -p [...variable setup...] while [ ! -f $wanted_file ] do if [ -f $abort_file ] then exit 1 fi sleep 5 done cat $wanted_file I know signals can be deferred in scripts (a signal sent to the script during the sleep will be deferred if a trap handler had been written for the signal) but they _do_ get delivered. However, it seems the signal wasn't sent at all. Once the wanted file appeared the recovery thread from postmaster started a _new_ script for the next log. I'll rewrite the script in perl (probably monday when I'm back in the office) and stick lots of signal() traps in to see if anything does get sent to the script. > As the code stands, if the recovery script is killed by a signal, we'd > take that as normal termination of the recovery and proceed to come up, > which is definitely the Wrong Thing. Oh good; that means I'm not mad :-) -- rgds Stephen
"Stephen Harris" <lists@spuddy.org> writes: > My script was just a ksh script and didn't do anything special with signals. > Essentially it does > #!/bin/ksh -p > > [...variable setup...] > while [ ! -f $wanted_file ] > do > if [ -f $abort_file ] > then > exit 1 > fi > sleep 5 > done > cat $wanted_file > > I know signals can be deferred in scripts (a signal sent to the script during > the sleep will be deferred if a trap handler had been written for the signal) > but they _do_ get delivered. Sure, but it might be getting delivered to, say, your "sleep" command. You haven't checked the return value of sleep to handle any errors that may occur. As it stands you have to check for errors from every single command executed by your script. That doesn't seem terribly practical to expect of useres. As long as Postgres is using SIGQUIT for its own communication it seems it really ought to arrange to block the signal while the script is running so it will receive the signals it expects once the script ends. Alternatively perhaps Postgres really ought to be using USR1/USR2 or other signals that library routines won't think they have any business rearranging. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Stephen Harris <lists@spuddy.org> writes: > However, it seems the signal wasn't sent at all. Now that I think about it, the behavior of system() is predicated on the assumption that SIGINT and SIGQUIT originate with the tty driver and are broadcast to all members of the session's process group --- so the called command will get them too, and there's no need for system() to do anything except wait to see whether the called command dies or traps the signal. This does not apply to signals originated by the postmaster --- it doesn't even know that the child process is doing a system(), much less have any way to signal the grandchild. Ugh. Reimplementing system() seems pretty ugly, but maybe we have no choice. It strikes me that system() has a race condition as defined anyway, because if a signal arrives between blocking the handler and issuing the fork(), it'll disappear into the ether; and the same at the end of the routine. regards, tom lane
On Fri, Nov 17, 2006 at 09:39:39PM -0500, Gregory Stark wrote: > "Stephen Harris" <lists@spuddy.org> writes: > > [...variable setup...] > > while [ ! -f $wanted_file ] > > do > > if [ -f $abort_file ] > > then > > exit 1 > > fi > > sleep 5 > > done > > cat $wanted_file > > I know signals can be deferred in scripts (a signal sent to the script during > Sure, but it might be getting delivered to, say, your "sleep" command. You No. The sleep command keeps on running. I could see that using "ps". To the best of my knowldge, a random child process of the script wouldn't even get a signal. All the postmaster recovery thread knows about is the system() - ie "sh -c". All sh knows about is the ksh process. Neither postmaster or sh know about "sleep" and so "sleep" wouldn't receive the signal (unless it was sent to all processes in the process group). Here's an example from Solaris 10 demonstrating lack of signal propogation. $ uname -sr SunOS 5.10 $ echo $0 /bin/sh $ cat x #!/bin/ksh -p sleep 10000 $ ./x & 4622 $ kill 4622 $ 4622 Terminated $ ps -ef | grep sleep sweh 4624 4602 0 22:13:13 pts/1 0:00 grep sleep sweh 4623 1 0 22:13:04 pts/1 0:00 sleep 10000 This is, in fact, what proper "job control" shells do. Doing the same test with ksh as the command shell will kill the sleep :-) $ echo $0 -ksh $ ./x & [1] 4632 $ kill %1 [1] + Terminated ./x & $ ps -ef | grep sleep sweh 4635 4582 0 22:15:17 pts/1 0:00 grep sleep [ Aside: The only way I've been able to guarantee all processes and child processes and everything to be killed is to run a subprocess with setsid() to create a new process group and kill the whole process group. It's a pain ] If postmaster was sending a signal to the system() process then "sh -c" might not signal the ksh script, anyway. The ksh script might terminate, or it might defer until sleep had finished. Only if postmaster had signalled a complete process group would sleep ever see the signal. -- rgds Stephen
On Fri, Nov 17, 2006 at 10:49:39PM -0500, Tom Lane wrote: > Stephen Harris <lists@spuddy.org> writes: > > However, it seems the signal wasn't sent at all. > > Now that I think about it, the behavior of system() is predicated on the > assumption that SIGINT and SIGQUIT originate with the tty driver and are > broadcast to all members of the session's process group --- so the > This does not apply to signals originated by the postmaster --- it > doesn't even know that the child process is doing a system(), much less > have any way to signal the grandchild. Ugh. Why not, after calling fork() create a new process group with setsid() and then instead of killing the recovery thread, kill the whole process group (-PID rather than PID)? Then every process (the recovery thread, the system, the script, any child of the script) will all receive the signal. -- rgds Stephen
Stephen Harris <lists@spuddy.org> writes: > On Fri, Nov 17, 2006 at 10:49:39PM -0500, Tom Lane wrote: >> This does not apply to signals originated by the postmaster --- it >> doesn't even know that the child process is doing a system(), much less >> have any way to signal the grandchild. Ugh. > Why not, after calling fork() create a new process group with setsid() and > then instead of killing the recovery thread, kill the whole process group > (-PID rather than PID)? Then every process (the recovery thread, the > system, the script, any child of the script) will all receive the signal. This seems like a good answer if setsid and/or setpgrp are universally available. I fear it won't work on Windows though :-(. Also, each backend would become its own process group leader --- does anyone know if adding hundreds of process groups would slow down any popular kernels? [ thinks for a bit... ] Another issue is that there'd be a race condition during backend start: if the postmaster tries to kill -PID before the backend has managed to execute setsid, it wouldn't work. regards, tom lane
Gregory Stark <stark@enterprisedb.com> writes: > Sure, but it might be getting delivered to, say, your "sleep" command. You > haven't checked the return value of sleep to handle any errors that may occur. > As it stands you have to check for errors from every single command executed > by your script. The expectation is that something like SIGINT or SIGQUIT would be delivered to both the sleep command and the shell process running the script. So the shell should fail anyway. (Of course, a nontrivial archive or recovery script had better be checking for failures at each step, but this is not very relevant to the immediate problem.) > Alternatively perhaps Postgres really ought to be using USR1/USR2 or other > signals that library routines won't think they have any business rearranging. The existing signal assignments were all picked for what seem to me to be good reasons; I'm disinclined to change them. In any case, the important point here is that we'd really like an archive or recovery script, or for that matter any command executed via system() from a backend, to abort when the parent backend is SIGINT'd or SIGQUIT'd. Stephen's idea of executing setsid() at each backend start seems interesting, but is there a way that will work on Windows? regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > Gregory Stark <stark@enterprisedb.com> writes: >> Sure, but it might be getting delivered to, say, your "sleep" command. You >> haven't checked the return value of sleep to handle any errors that may occur. >> As it stands you have to check for errors from every single command executed >> by your script. > > The expectation is that something like SIGINT or SIGQUIT would be > delivered to both the sleep command and the shell process running the > script. So the shell should fail anyway. (Of course, a nontrivial > archive or recovery script had better be checking for failures at each > step, but this is not very relevant to the immediate problem.) Hm, I tried to test that before I sent that. But I guess my test was faulty since I was really testing what process the terminal handling delivered the signal to: $ cat /tmp/test.sh #!/bin/sh echo before sleep 5 || echo sleep failed echo after $ sh /tmp/test.sh ; echo $? before ^\ /tmp/test.sh: line 4: 23407 Quit sleep 5 sleep failed after 0 -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark <stark@enterprisedb.com> writes: > Hm, I tried to test that before I sent that. But I guess my test was faulty > since I was really testing what process the terminal handling delivered the > signal to: Interesting. I tried the same test on HPUX, and find that its /bin/sh seems to ignore SIGQUIT but not SIGINT: $ sh /tmp/test.sh ; echo $? before -- typed ^C here 130 $ sh /tmp/test.sh ; echo $? before -- typed ^\ here /tmp/test.sh[4]: 25166 Quit(coredump) sleep failed after 0 $ There is nothing in the shell man page about this :-( That seems to leave us back at square one. How can we ensure an archive or recovery script will fail on being signaled? (Obviously we can't prevent someone from trapping the signal, but it'd be good if the default behavior was this way.) regards, tom lane
On Fri, Nov 17, 2006 at 11:40:36PM -0500, Tom Lane wrote: > Stephen Harris <lists@spuddy.org> writes: > > Why not, after calling fork() create a new process group with setsid() and > > then instead of killing the recovery thread, kill the whole process group > > (-PID rather than PID)? Then every process (the recovery thread, the > > system, the script, any child of the script) will all receive the signal. > > This seems like a good answer if setsid and/or setpgrp are universally > available. I fear it won't work on Windows though :-(. Also, each It's POSIX, so I would suppose it's standard on most modern *nix platforms. Windows... bluh. I wonder how perl handles POSIX::setsid() on Windows! > backend would become its own process group leader --- does anyone know > if adding hundreds of process groups would slow down any popular > kernels? Shouldn't hurt. This is, after all, what using "&" in a command line shell with job control (csh, ksh, tcsh, bash, zsh) does. Because you only run one archive or recovery thread at a time (which is very good and very clever) you won't have too many process groups at any instance in time. > [ thinks for a bit... ] Another issue is that there'd be a race > condition during backend start: if the postmaster tries to kill -PID > before the backend has managed to execute setsid, it wouldn't work. *ponder* Bugger. Standard solutions (eg try three times with a second pause) would mitigate this, but.... Hmm. Another idea is to make the shutdown be more co-operative under control of the script; eg an exit code of 0 means xlog is now available, code if 1 means the log is non-existent (so recovery is complete) and an exit code of 255 means "failure to recover; perform database shutdown". In this way a solution similar to the existing trigger files ("recovery complete") could be used. It's a little messy in that pg_ctl wouldn't be used to shutdown the database; the script would essentially tell the recovery thread to abort, which would tell the main postmaster to shutdown. We'd have no clients connected, no child process running, so a smart shutdown would work. -- rgds Stephen
Stephen Harris <lists@spuddy.org> writes: > On Fri, Nov 17, 2006 at 11:40:36PM -0500, Tom Lane wrote: >> [ thinks for a bit... ] Another issue is that there'd be a race >> condition during backend start: if the postmaster tries to kill -PID >> before the backend has managed to execute setsid, it wouldn't work. > *ponder* Bugger. Standard solutions (eg try three times with a second > pause) would mitigate this, but.... Hmm. I think we could make it work if the postmaster does this: kill(child_pid, SIGxxx);#ifdef HAVE_SETSID kill(-child_pid, SIGxxx);#endif In the normal case where the child has already completed setsid(), the extra signal sent to it should do no harm. In the startup race condition case, the second kill() accomplishes nothing, and we are effectively assuming that the child will enable signals and respond to the first kill before it launches any grandchildren; seems safe enough. Of course the postmaster should only do this when sending signals that are likely to be interpreted the same way by random shell scripts as they are by backends, ie, SIGINT, SIGQUIT, SIGTERM (not, say, SIGUSR1). This would fix the problem on Unixen with setsid, which should be pretty much all of them; on any that don't have it, the behavior doesn't change. Still leaves Windows out in the cold, but I don't see that we have anything much we can do about it there --- Windows users will just have to know to kill the child script process instead of the postmaster. Comments? regards, tom lane
> This would fix the problem on Unixen with setsid, which > should be pretty much all of them; on any that don't have it, > the behavior doesn't change. Still leaves Windows out in the > cold, but I don't see that we have anything much we can do > about it there --- Windows users will just have to know to > kill the child script process instead of the postmaster. I haven't really been folliowing this thread. What exactly is it you want to accomplish (maybe there is a different way to fix it on win32)? Kill a process along with all it's subprocesses? //Magnus
"Magnus Hagander" <mha@sollentuna.net> writes: > I haven't really been folliowing this thread. What exactly is it you > want to accomplish (maybe there is a different way to fix it on win32)? > Kill a process along with all it's subprocesses? Right. The problem (as seen on Unix) is that if the recovery subprocess is waiting for a recovery_command script to complete, it doesn't react to SIGQUIT (because system() ignores SIGQUIT) and the script doesn't react either (because it never gets the signal at all). We'd like "pg_ctl stop -m immediate" to abort the script and then have the recovery process fail, allowing restart later. More generally, it seems like any time a backend is executing system(), the called script ought to be subject to SIGINT or SIGQUIT signals sent to the backend by the postmaster. regards, tom lane
> > I haven't really been folliowing this thread. What exactly > is it you > > want to accomplish (maybe there is a different way to fix > it on win32)? > > Kill a process along with all it's subprocesses? > > Right. The problem (as seen on Unix) is that if the recovery > subprocess is waiting for a recovery_command script to > complete, it doesn't react to SIGQUIT (because system() > ignores SIGQUIT) and the script doesn't react either (because > it never gets the signal at all). We'd like "pg_ctl stop -m > immediate" to abort the script and then have the recovery > process fail, allowing restart later. More generally, it > seems like any time a backend is executing system(), the > called script ought to be subject to SIGINT or SIGQUIT > signals sent to the backend by the postmaster. Hm. ok. Well, first of all, the generic case will never really wrok - because we send our own kind of signals, and a script (batchfile executing under cmd.exe) or anythin non-postgresql-backedn will never respond to that anwyay. However, for this specific case we might be able to do something. pg_ctl creates a job object that it assigns the postmaster to when it's started if that works. (Won't work if pg_ctl itself is running inside a job object, but normally it won't be doing that). That means that we could have pg_ctl look up this job object and execute TerminateJobObject() on it. That will kill all processes inside the job object (includnig backends, but if we run it from pg_ctl *after* postmaster has shut down, there should be none of those left). (This will work for the generic processes as well, but it will *only* work for the pg_ctl-stops-theprocess scenario, it can't be done from inside the postmaster) It's not entirely pretty (to say the least), but might be worthwhile looking at? //Magnus
"Magnus Hagander" <mha@sollentuna.net> writes: > However, for this specific case we might be able to do something. pg_ctl > creates a job object that it assigns the postmaster to when it's started > if that works. (Won't work if pg_ctl itself is running inside a job > object, but normally it won't be doing that). That means that we could > have pg_ctl look up this job object and execute TerminateJobObject() on > it. That will kill all processes inside the job object (includnig > backends, but if we run it from pg_ctl *after* postmaster has shut down, > there should be none of those left). That would take care of cleaning up any scripts after the postmaster has already quit, but what about the problem of kicking the waiting backend off the system() call in the first place? I suppose that if we can hack things so system() doesn't block SIGQUIT on Windows, it might be OK, at least for the "-m immediate" case. I was hoping we could get Query Cancel to SIGINT a system()-invoked script in a regular backend, but that might be too much to hope for on Windows, especially if it hasn't even got a distinction between INT and QUIT ... regards, tom lane
On Mon, Nov 20, 2006 at 11:20:41AM -0500, Tom Lane wrote: > Stephen Harris <lists@spuddy.org> writes: > > On Fri, Nov 17, 2006 at 11:40:36PM -0500, Tom Lane wrote: > >> [ thinks for a bit... ] Another issue is that there'd be a race > >> condition during backend start: if the postmaster tries to kill -PID > >> before the backend has managed to execute setsid, it wouldn't work. > > > *ponder* Bugger. Standard solutions (eg try three times with a second > > pause) would mitigate this, but.... Hmm. > > I think we could make it work if the postmaster does this: > > kill(child_pid, SIGxxx); > #ifdef HAVE_SETSID > kill(-child_pid, SIGxxx); > #endif That should work. I just created a dummy test case where a process forks a child, the child does setsid and then forks 3 times, so 8 processes all in the new process group. The original parent sleeps for 3 seconds then kills the original child (the process group leader) then sleeps 3 more seconds and then kills the process group. A simple sig handler shows the signal being received by the remaining children. Tested on RHEL4 (2.6.9-42.0.3.ELsmp) and Solaris 10 (Sparc) and got the same results. #include <signal.h> #include <unistd.h> #include <stdio.h> #include <stdlib.h> void sig(int x) { int y=getpid(); printf("Process %d got signal\n",y); exit(0); } void do_child() { setsid(); fork(); fork(); fork(); printf("%d sleeping for a long time!\n",(int)getpid()); sleep(10000); } int main() { signal(SIGQUIT,&sig); signal(SIGCLD,SIG_IGN); int ch=fork(); if (ch==0) { do_child(); } else { int e; sleep(3); printf("Killing first child\n"); printf("Return is: %d\n",e=kill(ch,SIGQUIT)); if (e) { perror("Bad kill"); } sleep(3); printf("Killing process group\n"); printf("Returnis: %d\n",e=kill(-ch,SIGQUIT)); if (e) { perror("Bad kill"); } } exit(0); } -- rgds Stephen
On Mon, Nov 20, 2006 at 11:20:41AM -0500, Tom Lane wrote: > > kill(child_pid, SIGxxx); > #ifdef HAVE_SETSID > kill(-child_pid, SIGxxx); > #endif > > In the normal case where the child has already completed setsid(), the > extra signal sent to it should do no harm. In the startup race Hmm. It looks like something more than this may be needed. The postgres recovery process appears to be ignoring it. I ran the whole database in it's own process group (ksh runs processes in their own process group by default, so pg_ctl became the session leader and so everything under pg_ctl all stayed in that process group). % ps -o pid,ppid,pgid,args -g 29141 | sort PID PPID PGID COMMAND 29145 1 29141 /local/apps/postgres/8.2.b3.0/solaris/bin/postgres 29146 29145 29141 /local/apps/postgres/8.2.b3.0/solaris/bin/postgres 29147 29145 29141 /local/apps/postgres/8.2.b3.0/solaris/bin/postgres 29501 29147 29141 sh -c /export/home/swharris/rr 000000010000000100000057 pg_xlog/RECOVERYXLOG 29502 29501 29141 /bin/ksh -p /export/home/swharris/rr 000000010000000100000057 pg_xlog/RECOVERYX 29537 29502 29141 sleep 5 I did kill -QUIT -29141 ; sleep 1 ; touch /export/home/swharris/archives/STOP_SWEH_RECOVERY This sent the QUIT signal to all those processes. The shell script ignores it and so tries to start again, so the 'touch' command tells it to exit(1) rather than loop again. The log file (the timestamp entries are from my 'rr' program so I can see what it's doing)... To start with we see a normal recovery: Wed Nov 22 13:41:20 EST 2006: Attempting to restore 000000010000000100000056 Wed Nov 22 13:41:25 EST 2006: Finished 000000010000000100000056LOG: restored log file "000000010000000100000056" from archive Wed Nov 22 13:41:25 EST 2006: Attemptingto restore 000000010000000100000057 Wed Nov 22 13:41:25 EST 2006: Waiting for file to become available Now I send the kill signal... LOG: received immediate shutdown request We can see that the sleep process got it! /export/home/swharris/rr[37]: 29537 Quit(coredump) And my script detects the trigger file Wed Nov 22 13:43:51 EST 2006: End of recovery trigger file found Now database recovery appears to continue as normal; the postgres recovery processes are still running, despite having received SIGQUIT LOG: could not open file "pg_xlog/000000010000000100000057" (log file 1, segment 87): No such file or directory LOG: redodone at 1/56000070 Wed Nov 22 13:43:51 EST 2006: Attempting to restore 000000010000000100000056 Wed Nov 22 13:43:55 EST2006: Finished 000000010000000100000056 LOG: restored log file "000000010000000100000056" from archive LOG: archiverecovery complete LOG: database system is ready LOG: logger shutting down pg_xlog now contains 000000010000000100000056 and 000000010000000100000057 A similar sort of thing happens if I use SIGTERM rather than SIGQUIT I'm out of here in an hour, so for all you US based people, have a good Thanksgiving holiday! -- rgds Stephen
Stephen Harris <lists@spuddy.org> writes: > We can see that the sleep process got it! > /export/home/swharris/rr[37]: 29537 Quit(coredump) > And my script detects the trigger file > Wed Nov 22 13:43:51 EST 2006: End of recovery trigger file found > Now database recovery appears to continue as normal; the postgres > recovery processes are still running, despite having received SIGQUIT Well, sure, because system() will have ignored the SIGQUIT. You have to make the shell script return an abort code rather than exit(1) or whatever it's doing in this case. (I've tested that CVS HEAD does the right thing when the script is just a straight invocation of cp. But if the script is trapping errors, it's going to have to cooperate.) regards, tom lane
On Wed, Nov 22, 2006 at 03:03:03PM -0500, Tom Lane wrote: > Stephen Harris <lists@spuddy.org> writes: > > We can see that the sleep process got it! > > /export/home/swharris/rr[37]: 29537 Quit(coredump) > > And my script detects the trigger file > > Wed Nov 22 13:43:51 EST 2006: End of recovery trigger file found > > > Now database recovery appears to continue as normal; the postgres > > recovery processes are still running, despite having received SIGQUIT > > Well, sure, because system() will have ignored the SIGQUIT. You have to > make the shell script return an abort code rather than exit(1) or > whatever it's doing in this case. (I've tested that CVS HEAD does the Hmm, on Linux I did a kill -QUIT on the process group and then rather than cleanly exiting my recovery script I "kill -9"d it. The database still completed recovery as previously described. > right thing when the script is just a straight invocation of cp. But > if the script is trapping errors, it's going to have to cooperate.) The script doesn't trap errors; ksh itself plays with SIGQUIT. I guess I should rewrite my script in perl and see what happens. -- rgds Stephen
On Thu, Nov 23, 2006 at 01:18:06PM -0500, Stephen Harris wrote: > Hmm, on Linux I did a kill -QUIT on the process group and then rather > than cleanly exiting my recovery script I "kill -9"d it. The database > still completed recovery as previously described. OK, I just did the same tests on the CVS tree, rather than beta3 and I see different results. This time recovery aborts with a PANIC. I did a "pg_ctl stop -m i" and then kill -9'd the recovery script and: LOG: restored log file "000000010000000000000006" from archive Looking for 000000010000000000000007 LOG: received immediate shutdown request /tmp/rr: line 12: 22031 Quit sleep 5 PANIC: could not restore file "000000010000000000000007" from archive: return code 9 Similar behaviour was seen with "smart" and "fast" shutdowns. Looking for 000000010000000000000005 Restoring 000000010000000000000005 to pg_xlog/RECOVERYXLOG LOG: restored log file "000000010000000000000005" from archive LOG: checkpoint record is at 0/5000020 LOG: redo record is at 0/5000020; undo record is at 0/0; shutdown FALSE LOG: next transaction ID: 0/601; next OID: 24577 LOG: next MultiXactId: 1; next MultiXactOffset: 0 LOG: automatic recovery in progress LOG: redo starts at 0/5000068 Looking for 000000010000000000000006 Restoring 000000010000000000000006 to pg_xlog/RECOVERYXLOG LOG: restored log file "000000010000000000000006" from archive Looking for 000000010000000000000007 LOG: received fast shutdown request PANIC: could not restore file "000000010000000000000007" from archive: return code 9 LOG: startup process (PID 22246) was terminated by signal 6 LOG: aborting startup due to startup process failure Restarting the database appears to go back to asking for archive log 5 again. I'll have to do some testing with more archives to see how far back we have to go. -- rgds Stephen
Stephen Harris <lists@spuddy.org> writes: > OK, I just did the same tests on the CVS tree, rather than beta3 and > I see different results. This time recovery aborts with a PANIC. Yeah, that's expected since the whole recovery process is a critical section. We could change that but it's cosmetic. > Restarting the database appears to go back to asking for archive log 5 > again. It would back up to whereever the last recovery checkpoint was. I think we've got it hardwired at one recovery checkpoint per 100 source checkpoints ... maybe that's too far apart. regards, tom lane
On Thu, Nov 23, 2006 at 02:47:27PM -0500, Tom Lane wrote: > It would back up to whereever the last recovery checkpoint was. I think > we've got it hardwired at one recovery checkpoint per 100 source > checkpoints ... maybe that's too far apart. I'm talking out of my arse now, since I have no ideas of the internals of postgres recovery, but maybe a recovery checkpoint can be added each time it asks for a new archive file, or 100 source checkpoints, whatever comes first. A database that is mostly idle (eg overnight) but has 1 or 2 critical transactions will have received lots of archive logs (from the 5 minute checkpoint timeout) but not necessarily enough to cause recovery to checkpoint. This could mean needing to keep a large number of shipped logfiles available on the standby system "just in case". -- rgds Stephen
On Thu, Nov 23, 2006 at 02:47:27PM -0500, Tom Lane wrote: > Stephen Harris <lists@spuddy.org> writes: > > OK, I just did the same tests on the CVS tree, rather than beta3 and > > I see different results. This time recovery aborts with a PANIC. > > Yeah, that's expected since the whole recovery process is a critical > section. We could change that but it's cosmetic. Should this cause a coredump when it happens? With rc1 whenever I do a shutdown I get a core file core: ELF 32-bit MSB core file SPARC Version 1, from 'postgres' (gdb) bt #0 0xff0a02d4 in _libc_kill () from /usr/lib/libc.so.1 #1 0xff035a5c in abort () from /usr/lib/libc.so.1 #2 0x0020cd50 in errfinish () #3 0x0006d258 in RestoreArchivedFile () -- rgds Stephen
Stephen Harris wrote: > On Thu, Nov 23, 2006 at 02:47:27PM -0500, Tom Lane wrote: > > Stephen Harris <lists@spuddy.org> writes: > > > OK, I just did the same tests on the CVS tree, rather than beta3 and > > > I see different results. This time recovery aborts with a PANIC. > > > > Yeah, that's expected since the whole recovery process is a critical > > section. We could change that but it's cosmetic. > > Should this cause a coredump when it happens? With rc1 whenever I > do a shutdown I get a core file > > core: ELF 32-bit MSB core file SPARC Version 1, from 'postgres' > > (gdb) bt > #0 0xff0a02d4 in _libc_kill () from /usr/lib/libc.so.1 > #1 0xff035a5c in abort () from /usr/lib/libc.so.1 > #2 0x0020cd50 in errfinish () > #3 0x0006d258 in RestoreArchivedFile () You should never get a core file. I think we need the binary compiled with symbols so we can get more information from this backtrace. -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian <bruce@momjian.us> writes: > Stephen Harris wrote: >> Should this cause a coredump when it happens? > You should never get a core file. elog(PANIC) => abort() => core dump. This is completely expected. regards, tom lane
Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > Stephen Harris wrote: > >> Should this cause a coredump when it happens? > > > You should never get a core file. > > elog(PANIC) => abort() => core dump. This is completely expected. Well, then I should have said there is no reason you should ever get a panic. -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Thu, Nov 30, 2006 at 09:58:52AM -0500, Bruce Momjian wrote: > Tom Lane wrote: > > Bruce Momjian <bruce@momjian.us> writes: > > > Stephen Harris wrote: > > >> Should this cause a coredump when it happens? > > > > > You should never get a core file. > > > > elog(PANIC) => abort() => core dump. This is completely expected. > > Well, then I should have said there is no reason you should ever get a > panic. Well, as Tom said earlier in the thread > I see different results. This time recovery aborts with a PANIC. Yeah, that's expected since the whole recoveryprocess is a critical section. We could change that but it's cosmetic. Because of the changes made, we don't actually need to do a database shutdown. Merely killing the restore_command process (eg with kill -9) appears to cause the same effect. And now a personal opinion... I think this is more than cosmetic; shutting down a standby database cleanly is critical functionality for proper warm-standby procedures. What we have now "works", but should be tidied up. Probably low on the priority queue though :-) -- rgds Stephen
Bruce Momjian <bruce@momjian.us> writes: > Tom Lane wrote: >> elog(PANIC) => abort() => core dump. This is completely expected. > Well, then I should have said there is no reason you should ever get a > panic. The reason it's a PANIC is that the whole of StartupXLOG() runs as a critical section. Given the recent changes to make elog(FATAL) do exit(1) instead of exit(0), I think it might be possible to dispense with that, but it's rather a ticklish thing to be changing the day before we wrap 8.2. How excited are you about this? regards, tom lane
Stephen Harris <lists@spuddy.org> writes: > I think this is more than cosmetic; shutting down a standby database > cleanly is critical functionality for proper warm-standby procedures. > What we have now "works", but should be tidied up. I've made some small adjustments that make "pg_ctl stop -m fast" work for shutting down a warm-standby server. Please test. regards, tom lane
On Thu, Nov 30, 2006 at 01:32:31PM -0500, Tom Lane wrote: > Stephen Harris <lists@spuddy.org> writes: > > I think this is more than cosmetic; shutting down a standby database > > cleanly is critical functionality for proper warm-standby procedures. > > What we have now "works", but should be tidied up. > > I've made some small adjustments that make "pg_ctl stop -m fast" work > for shutting down a warm-standby server. Please test. That's a lot cleaner. "fast" is good because of SIGTERM, which the shell script also will respond to even while waiting on another process LOG: restored log file "00000001000000000000003E" from archive Thu Nov 30 15:45:56 EST 2006: Attempting to restore 00000001000000000000003F Thu Nov 30 15:45:58 EST 2006: Finished 00000001000000000000003F LOG: restored log file "00000001000000000000003F" from archive Thu Nov 30 15:45:59 EST 2006: Attempting to restore 000000010000000000000040 Thu Nov 30 15:45:59 EST 2006: Waiting for file to become available LOG: received fast shutdown request Terminated FATAL: could not restore file "000000010000000000000040" from archive: return code 36608 LOG: startup process (PID 10730) exited with exit code 1 LOG: aborting startup due to startup process failure LOG: logger shutting down (the datestamp entries are from my script) Starting up the standby database still goes back to earlier log files, but I guess that's the 100 checkpoint thing you mentioned earlier. The change you just made is very good and very much tidier; no core files, no panics, cleans up children nicely. Good stuff! -- rgds Stephen
Stephen Harris <lists@spuddy.org> writes: > Starting up the standby database still goes back to earlier log files, > but I guess that's the 100 checkpoint thing you mentioned earlier. Actually I was misremembering that: the frequency of recovery checkpoints is time-based, and for a slave that's in sync with its master, they should occur about as often as checkpoints on the master. So the amount of log to be replayed should be comparable to what the master would have to replay if it crashed. regards, tom lane
On Thu, Nov 30, 2006 at 03:59:15PM -0500, Tom Lane wrote: > Stephen Harris <lists@spuddy.org> writes: > > Starting up the standby database still goes back to earlier log files, > > but I guess that's the 100 checkpoint thing you mentioned earlier. > > Actually I was misremembering that: the frequency of recovery > checkpoints is time-based, and for a slave that's in sync with its > master, they should occur about as often as checkpoints on the master. > So the amount of log to be replayed should be comparable to what the > master would have to replay if it crashed. Hmm. For testing purposes I have archive_timeout=5 so an archive log is created every 5 seconds if there's activity. In idle periods it's created every 5 minutes as the internal checkpoint occurs on the server. However if I replay all those "empty" logs then shutdown then startup it will ask for some of them again. As a test, I started a recover, recovered 11 archive logs (000000010000000000000035 to 00000001000000000000003F), some may have had activity; I not sure. I then waited 10 minutes after the last log file had been replayed. Then I did a fast shutdown. Starting the standby database again required it going back to 000000010000000000000035 again. -- rgds Stephen
Stephen Harris <lists@spuddy.org> writes: > On Thu, Nov 30, 2006 at 03:59:15PM -0500, Tom Lane wrote: >> Actually I was misremembering that: the frequency of recovery >> checkpoints is time-based, and for a slave that's in sync with its >> master, they should occur about as often as checkpoints on the master. >> So the amount of log to be replayed should be comparable to what the >> master would have to replay if it crashed. > As a test, I started a recover, recovered 11 archive logs > (000000010000000000000035 to 00000001000000000000003F), some may have had > activity; I not sure. I then waited 10 minutes after the last log file > had been replayed. Then I did a fast shutdown. Starting the standby > database again required it going back to 000000010000000000000035 again. That case is not what I described --- you weren't tracking a live master that's generating checkpoints. regards, tom lane
On Fri, 2006-11-24 at 07:20 -0500, Stephen Harris wrote: > On Thu, Nov 23, 2006 at 02:47:27PM -0500, Tom Lane wrote: > > > It would back up to whereever the last recovery checkpoint was. I think > > we've got it hardwired at one recovery checkpoint per 100 source > > checkpoints ... maybe that's too far apart. > > I'm talking out of my arse now, since I have no ideas of the internals > of postgres recovery, but maybe a recovery checkpoint can be added each > time it asks for a new archive file, or 100 source checkpoints, whatever > comes first. We can take a restorepoint during replay only when we come across a checkpoint WAL record. We take the first checkpoint WAL record we come across after checkpoint_timeout seconds have expired, not 100 as described. History: The 100 checkpoints per restorepoint idea was the way it was originally following the discussions at Anniversary. That was replaced prior to code commit with the concept of a minimum delay between restorepoints equal to the checkpoint_timeout. > A database that is mostly idle (eg overnight) but has 1 or 2 critical > transactions will have received lots of archive logs (from the 5 minute > checkpoint timeout) but not necessarily enough to cause recovery to > checkpoint. This could mean needing to keep a large number of shipped > logfiles available on the standby system "just in case". It doesn't work that way. There is only ever one WAL file being worked upon at a time, so there is no build up of log files in the case you mention. The case you mention would be captured by the current code, since after 5 minutes worth of WAL replay we would attempt a restorepoint at the next checkpoint record. If replay goes at exactly the same speed as the original WAL writes, then there would be one restorepoint per original checkpoint. If replay goes 10 times faster then there would be one restorepoint for each 10 checkpoints, etc.. So the restorepoint is taken every 5 minutes in wall-clock time during replay, whereas that might be anywhere from 5 minutes upwards in original-WAL-write-time, possibly hours or days if you aren't writing any WAL at all for long periods. But note that will not translate into many WAL files. If the de-archive time is significant then the average rate of WAL apply will go down and so the number of restorepoints per checkpoint will fall. So overall, I can't see any need to change the code, but I'm listening for different opinions and Stephen's especially has been very good to have, thanks. Humble apologies to all for my late arrival on this thread. I missed the start of it during a busy period last week and didn't notice subsequent postings. I'll try to filter things better in future since I'm clearly not managing to read all postings. Please feel free to include me into any threads. If there's anything that people think I should be working on in support of this, please say and I'll prioritise that now ahead of my docs and advocacy activities around the release. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com