Thread: Re: [GENERAL] Shutting down a warm standby database in 8.2beta3

Re: [GENERAL] Shutting down a warm standby database in 8.2beta3

From
Tom Lane
Date:
Stephen Harris <lists@spuddy.org> writes:
> Doing a shutdown "immediate" isn't to clever because it actually leaves
> the recovery threads running

> LOG:  restored log file "00000001000000010000003E" from archive
> LOG:  received immediate shutdown request
> LOG:  restored log file "00000001000000010000003F" from archive

Hm, that should work --- AFAICS the startup process should abort on
SIGQUIT the same as any regular backend.

[ thinks... ]  Ah-hah, "man system(3)" tells the tale:

     system() ignores the SIGINT and SIGQUIT signals, and blocks the
     SIGCHLD signal, while waiting for the command to terminate.  If this
     might cause the application to miss a signal that would have killed
     it, the application should examine the return value from system() and
     take whatever action is appropriate to the application if the command
     terminated due to receipt of a signal.

So the SIGQUIT went to the recovery script command and was missed by the
startup process.  It looks to me like your script actually ignored the
signal, which you'll need to fix, but it also looks like we are not
checking for these cases in RestoreArchivedFile(), which we'd better fix.
As the code stands, if the recovery script is killed by a signal, we'd
take that as normal termination of the recovery and proceed to come up,
which is definitely the Wrong Thing.

            regards, tom lane

Re: [GENERAL] Shutting down a warm standby database in 8.2beta3

From
Stephen Harris
Date:
On Fri, Nov 17, 2006 at 05:03:44PM -0500, Tom Lane wrote:
> Stephen Harris <lists@spuddy.org> writes:
> > Doing a shutdown "immediate" isn't to clever because it actually leaves
> > the recovery threads running
>
> > LOG:  restored log file "00000001000000010000003E" from archive
> > LOG:  received immediate shutdown request
> > LOG:  restored log file "00000001000000010000003F" from archive
>
> Hm, that should work --- AFAICS the startup process should abort on
> SIGQUIT the same as any regular backend.
>
> [ thinks... ]  Ah-hah, "man system(3)" tells the tale:
>
>      system() ignores the SIGINT and SIGQUIT signals, and blocks the
>      SIGCHLD signal, while waiting for the command to terminate.  If this
>      might cause the application to miss a signal that would have killed
>      it, the application should examine the return value from system() and
>      take whatever action is appropriate to the application if the command
>      terminated due to receipt of a signal.
>
> So the SIGQUIT went to the recovery script command and was missed by the
> startup process.  It looks to me like your script actually ignored the
> signal, which you'll need to fix, but it also looks like we are not

My script was just a ksh script and didn't do anything special with signals.
Essentially it does
  #!/bin/ksh -p

  [...variable setup...]
  while [ ! -f $wanted_file ]
  do
    if [ -f $abort_file ]
    then
      exit 1
    fi
    sleep 5
  done
  cat $wanted_file

I know signals can be deferred in scripts (a signal sent to the script during
the sleep will be deferred if a trap handler had been written for the signal)
but they _do_ get delivered.

However, it seems the signal wasn't sent at all.  Once the wanted file
appeared the recovery thread from postmaster started a _new_ script for
the next log.  I'll rewrite the script in perl (probably monday when
I'm back in the office) and stick lots of signal() traps in to see if
anything does get sent to the script.

> As the code stands, if the recovery script is killed by a signal, we'd
> take that as normal termination of the recovery and proceed to come up,
> which is definitely the Wrong Thing.

Oh good; that means I'm not mad :-)

--

rgds
Stephen

Re: [GENERAL] Shutting down a warm standby database in 8.2beta3

From
Gregory Stark
Date:
"Stephen Harris" <lists@spuddy.org> writes:

> My script was just a ksh script and didn't do anything special with signals.
> Essentially it does
>   #!/bin/ksh -p
>
>   [...variable setup...]
>   while [ ! -f $wanted_file ]
>   do
>     if [ -f $abort_file ]
>     then
>       exit 1
>     fi
>     sleep 5
>   done
>   cat $wanted_file
>
> I know signals can be deferred in scripts (a signal sent to the script during
> the sleep will be deferred if a trap handler had been written for the signal)
> but they _do_ get delivered.

Sure, but it might be getting delivered to, say, your "sleep" command. You
haven't checked the return value of sleep to handle any errors that may occur.
As it stands you have to check for errors from every single command executed
by your script.

That doesn't seem terribly practical to expect of useres. As long as Postgres
is using SIGQUIT for its own communication it seems it really ought to arrange
to block the signal while the script is running so it will receive the signals
it expects once the script ends.

Alternatively perhaps Postgres really ought to be using USR1/USR2 or other
signals that library routines won't think they have any business rearranging.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

Re: [GENERAL] Shutting down a warm standby database in 8.2beta3

From
Tom Lane
Date:
Stephen Harris <lists@spuddy.org> writes:
> However, it seems the signal wasn't sent at all.

Now that I think about it, the behavior of system() is predicated on the
assumption that SIGINT and SIGQUIT originate with the tty driver and are
broadcast to all members of the session's process group --- so the
called command will get them too, and there's no need for system() to
do anything except wait to see whether the called command dies or traps
the signal.

This does not apply to signals originated by the postmaster --- it
doesn't even know that the child process is doing a system(), much less
have any way to signal the grandchild.  Ugh.

Reimplementing system() seems pretty ugly, but maybe we have no choice.
It strikes me that system() has a race condition as defined anyway,
because if a signal arrives between blocking the handler and issuing the
fork(), it'll disappear into the ether; and the same at the end of the
routine.
        regards, tom lane


Re: [GENERAL] Shutting down a warm standby database in 8.2beta3

From
Stephen Harris
Date:
On Fri, Nov 17, 2006 at 09:39:39PM -0500, Gregory Stark wrote:
> "Stephen Harris" <lists@spuddy.org> writes:
> >   [...variable setup...]
> >   while [ ! -f $wanted_file ]
> >   do
> >     if [ -f $abort_file ]
> >     then
> >       exit 1
> >     fi
> >     sleep 5
> >   done
> >   cat $wanted_file

> > I know signals can be deferred in scripts (a signal sent to the script during

> Sure, but it might be getting delivered to, say, your "sleep" command. You

No.  The sleep command keeps on running.  I could see that using "ps".

To the best of my knowldge, a random child process of the script wouldn't
even get a signal.  All the postmaster recovery thread knows about is the
system() - ie "sh -c".  All sh knows about is the ksh process.  Neither
postmaster or sh know about "sleep" and so "sleep" wouldn't receive the
signal (unless it was sent to all processes in the process group).

Here's an example from Solaris 10 demonstrating lack of signal propogation.

  $ uname -sr
  SunOS 5.10
  $ echo $0
  /bin/sh
  $ cat x
  #!/bin/ksh -p

  sleep 10000
  $ ./x &
  4622
  $ kill 4622
  $
  4622 Terminated
  $ ps -ef | grep sleep
      sweh  4624  4602   0 22:13:13 pts/1       0:00 grep sleep
      sweh  4623     1   0 22:13:04 pts/1       0:00 sleep 10000

This is, in fact, what proper "job control" shells do.  Doing the same
test with ksh as the command shell will kill the sleep :-)

  $ echo $0
  -ksh
  $ ./x &
  [1]     4632
  $ kill %1
  [1] + Terminated               ./x &
  $ ps -ef | grep sleep
      sweh  4635  4582   0 22:15:17 pts/1       0:00 grep sleep

[ Aside: The only way I've been able to guarantee all processes and child
  processes and everything to be killed is to run a subprocess with
  setsid() to create a new process group and kill the whole process group.
  It's a pain ]

If postmaster was sending a signal to the system() process then "sh -c"
might not signal the ksh script, anyway.  The ksh script might terminate,
or it might defer until sleep had finished.  Only if postmaster had
signalled a complete process group would sleep ever see the signal.

--

rgds
Stephen

Re: [GENERAL] Shutting down a warm standby database in 8.2beta3

From
Stephen Harris
Date:
On Fri, Nov 17, 2006 at 10:49:39PM -0500, Tom Lane wrote:
> Stephen Harris <lists@spuddy.org> writes:
> > However, it seems the signal wasn't sent at all.
> 
> Now that I think about it, the behavior of system() is predicated on the
> assumption that SIGINT and SIGQUIT originate with the tty driver and are
> broadcast to all members of the session's process group --- so the

> This does not apply to signals originated by the postmaster --- it
> doesn't even know that the child process is doing a system(), much less
> have any way to signal the grandchild.  Ugh.

Why not, after calling fork() create a new process group with setsid() and
then instead of killing the recovery thread, kill the whole process group
(-PID rather than PID)?  Then every process (the recovery thread, the
system, the script, any child of the script) will all receive the signal.

-- 

rgds
Stephen


Re: [GENERAL] Shutting down a warm standby database in 8.2beta3

From
Tom Lane
Date:
Stephen Harris <lists@spuddy.org> writes:
> On Fri, Nov 17, 2006 at 10:49:39PM -0500, Tom Lane wrote:
>> This does not apply to signals originated by the postmaster --- it
>> doesn't even know that the child process is doing a system(), much less
>> have any way to signal the grandchild.  Ugh.

> Why not, after calling fork() create a new process group with setsid() and
> then instead of killing the recovery thread, kill the whole process group
> (-PID rather than PID)?  Then every process (the recovery thread, the
> system, the script, any child of the script) will all receive the signal.

This seems like a good answer if setsid and/or setpgrp are universally
available.  I fear it won't work on Windows though :-(.  Also, each
backend would become its own process group leader --- does anyone know
if adding hundreds of process groups would slow down any popular
kernels?

[ thinks for a bit... ]  Another issue is that there'd be a race
condition during backend start: if the postmaster tries to kill -PID
before the backend has managed to execute setsid, it wouldn't work.
        regards, tom lane


Re: [GENERAL] Shutting down a warm standby database in 8.2beta3

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> Sure, but it might be getting delivered to, say, your "sleep" command. You
> haven't checked the return value of sleep to handle any errors that may occur.
> As it stands you have to check for errors from every single command executed
> by your script.

The expectation is that something like SIGINT or SIGQUIT would be
delivered to both the sleep command and the shell process running the
script.  So the shell should fail anyway.  (Of course, a nontrivial
archive or recovery script had better be checking for failures at each
step, but this is not very relevant to the immediate problem.)

> Alternatively perhaps Postgres really ought to be using USR1/USR2 or other
> signals that library routines won't think they have any business rearranging.

The existing signal assignments were all picked for what seem to me
to be good reasons; I'm disinclined to change them.  In any case, the
important point here is that we'd really like an archive or recovery
script, or for that matter any command executed via system() from a
backend, to abort when the parent backend is SIGINT'd or SIGQUIT'd.

Stephen's idea of executing setsid() at each backend start seems
interesting, but is there a way that will work on Windows?

            regards, tom lane

Re: [GENERAL] Shutting down a warm standby database in 8.2beta3

From
Gregory Stark
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> Gregory Stark <stark@enterprisedb.com> writes:
>> Sure, but it might be getting delivered to, say, your "sleep" command. You
>> haven't checked the return value of sleep to handle any errors that may occur.
>> As it stands you have to check for errors from every single command executed
>> by your script.
>
> The expectation is that something like SIGINT or SIGQUIT would be
> delivered to both the sleep command and the shell process running the
> script.  So the shell should fail anyway.  (Of course, a nontrivial
> archive or recovery script had better be checking for failures at each
> step, but this is not very relevant to the immediate problem.)

Hm, I tried to test that before I sent that. But I guess my test was faulty
since I was really testing what process the terminal handling delivered the
signal to:


$ cat /tmp/test.sh
#!/bin/sh

echo before
sleep 5 || echo sleep failed
echo after

$ sh /tmp/test.sh ; echo $?
before
^\
/tmp/test.sh: line 4: 23407 Quit                    sleep 5
sleep failed
after
0


--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

Re: [GENERAL] Shutting down a warm standby database in 8.2beta3

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> Hm, I tried to test that before I sent that. But I guess my test was faulty
> since I was really testing what process the terminal handling delivered the
> signal to:

Interesting.  I tried the same test on HPUX, and find that its /bin/sh
seems to ignore SIGQUIT but not SIGINT:

$ sh /tmp/test.sh ; echo $?
before                    -- typed ^C here
130
$ sh /tmp/test.sh ; echo $?
before                    -- typed ^\ here
/tmp/test.sh[4]: 25166 Quit(coredump)
sleep failed
after
0
$ 

There is nothing in the shell man page about this :-(

That seems to leave us back at square one.  How can we ensure an archive
or recovery script will fail on being signaled?  (Obviously we can't
prevent someone from trapping the signal, but it'd be good if the
default behavior was this way.)
        regards, tom lane


Re: [GENERAL] Shutting down a warm standby database in 8.2beta3

From
Stephen Harris
Date:
On Fri, Nov 17, 2006 at 11:40:36PM -0500, Tom Lane wrote:
> Stephen Harris <lists@spuddy.org> writes:
> > Why not, after calling fork() create a new process group with setsid() and
> > then instead of killing the recovery thread, kill the whole process group
> > (-PID rather than PID)?  Then every process (the recovery thread, the
> > system, the script, any child of the script) will all receive the signal.
> 
> This seems like a good answer if setsid and/or setpgrp are universally
> available.  I fear it won't work on Windows though :-(.  Also, each

It's POSIX, so I would suppose it's standard on most modern *nix
platforms.  Windows... bluh.  I wonder how perl handles POSIX::setsid()
on Windows!

> backend would become its own process group leader --- does anyone know
> if adding hundreds of process groups would slow down any popular
> kernels?

Shouldn't hurt.  This is, after all, what using "&" in a command line
shell with job control (csh, ksh, tcsh, bash, zsh) does.  Because you only
run one archive or recovery thread at a time (which is very good and very
clever) you won't have too many process groups at any instance in time.

> [ thinks for a bit... ]  Another issue is that there'd be a race
> condition during backend start: if the postmaster tries to kill -PID
> before the backend has managed to execute setsid, it wouldn't work.

*ponder*  Bugger.  Standard solutions (eg try three times with a second
pause) would mitigate this, but....  Hmm.

Another idea is to make the shutdown be more co-operative under control
of the script; eg an exit code of 0 means xlog is now available, code
if 1 means the log is non-existent (so recovery is complete) and an
exit code of 255 means "failure to recover; perform database shutdown".
In this way a solution similar to the existing trigger files ("recovery
complete") could be used.  It's a little messy in that pg_ctl wouldn't
be used to shutdown the database; the script would essentially tell
the recovery thread to abort, which would tell the main postmaster to
shutdown.  We'd have no clients connected, no child process running,
so a smart shutdown would work.

-- 

rgds
Stephen


Re: [GENERAL] Shutting down a warm standby database in 8.2beta3

From
Tom Lane
Date:
Stephen Harris <lists@spuddy.org> writes:
> On Fri, Nov 17, 2006 at 11:40:36PM -0500, Tom Lane wrote:
>> [ thinks for a bit... ]  Another issue is that there'd be a race
>> condition during backend start: if the postmaster tries to kill -PID
>> before the backend has managed to execute setsid, it wouldn't work.

> *ponder*  Bugger.  Standard solutions (eg try three times with a second
> pause) would mitigate this, but....  Hmm.

I think we could make it work if the postmaster does this:
    kill(child_pid, SIGxxx);#ifdef HAVE_SETSID    kill(-child_pid, SIGxxx);#endif

In the normal case where the child has already completed setsid(), the
extra signal sent to it should do no harm.  In the startup race
condition case, the second kill() accomplishes nothing, and we are
effectively assuming that the child will enable signals and respond to
the first kill before it launches any grandchildren; seems safe enough.

Of course the postmaster should only do this when sending signals that
are likely to be interpreted the same way by random shell scripts as
they are by backends, ie, SIGINT, SIGQUIT, SIGTERM (not, say, SIGUSR1).

This would fix the problem on Unixen with setsid, which should be pretty
much all of them; on any that don't have it, the behavior doesn't
change.  Still leaves Windows out in the cold, but I don't see that we
have anything much we can do about it there --- Windows users will just
have to know to kill the child script process instead of the postmaster.

Comments?
        regards, tom lane


Re: [GENERAL] Shutting down a warm standby database in 8.2beta3

From
"Magnus Hagander"
Date:
> This would fix the problem on Unixen with setsid, which
> should be pretty much all of them; on any that don't have it,
> the behavior doesn't change.  Still leaves Windows out in the
> cold, but I don't see that we have anything much we can do
> about it there --- Windows users will just have to know to
> kill the child script process instead of the postmaster.

I haven't really been folliowing this thread. What exactly is it you
want to accomplish (maybe there is a different way to fix it on win32)?
Kill a process along with all it's subprocesses?

//Magnus


Re: [GENERAL] Shutting down a warm standby database in 8.2beta3

From
Tom Lane
Date:
"Magnus Hagander" <mha@sollentuna.net> writes:
> I haven't really been folliowing this thread. What exactly is it you
> want to accomplish (maybe there is a different way to fix it on win32)?
> Kill a process along with all it's subprocesses?

Right.  The problem (as seen on Unix) is that if the recovery subprocess
is waiting for a recovery_command script to complete, it doesn't react
to SIGQUIT (because system() ignores SIGQUIT) and the script doesn't
react either (because it never gets the signal at all).  We'd like
"pg_ctl stop -m immediate" to abort the script and then have the
recovery process fail, allowing restart later.  More generally, it
seems like any time a backend is executing system(), the called script
ought to be subject to SIGINT or SIGQUIT signals sent to the backend
by the postmaster.
        regards, tom lane


Re: [GENERAL] Shutting down a warm standby database in 8.2beta3

From
"Magnus Hagander"
Date:
> > I haven't really been folliowing this thread. What exactly
> is it you
> > want to accomplish (maybe there is a different way to fix
> it on win32)?
> > Kill a process along with all it's subprocesses?
>
> Right.  The problem (as seen on Unix) is that if the recovery
> subprocess is waiting for a recovery_command script to
> complete, it doesn't react to SIGQUIT (because system()
> ignores SIGQUIT) and the script doesn't react either (because
> it never gets the signal at all).  We'd like "pg_ctl stop -m
> immediate" to abort the script and then have the recovery
> process fail, allowing restart later.  More generally, it
> seems like any time a backend is executing system(), the
> called script ought to be subject to SIGINT or SIGQUIT
> signals sent to the backend by the postmaster.

Hm. ok.
Well, first of all, the generic case will never really wrok - because we
send our own kind of signals, and a script (batchfile executing under
cmd.exe) or anythin non-postgresql-backedn will never respond to that
anwyay.

However, for this specific case we might be able to do something. pg_ctl
creates a job object that it assigns the postmaster to when it's started
if that works. (Won't work if pg_ctl itself is running inside a job
object, but normally it won't be doing that). That means that we could
have pg_ctl look up this job object and execute TerminateJobObject() on
it. That will kill all processes inside the job object (includnig
backends, but if we run it from pg_ctl *after* postmaster has shut down,
there should be none of those left).
(This will work for the generic processes as well, but it will *only*
work for the pg_ctl-stops-theprocess scenario, it can't be done from
inside the postmaster)

It's not entirely pretty (to say the least), but might be worthwhile
looking at?

//Magnus


Re: [GENERAL] Shutting down a warm standby database in 8.2beta3

From
Tom Lane
Date:
"Magnus Hagander" <mha@sollentuna.net> writes:
> However, for this specific case we might be able to do something. pg_ctl
> creates a job object that it assigns the postmaster to when it's started
> if that works. (Won't work if pg_ctl itself is running inside a job
> object, but normally it won't be doing that). That means that we could
> have pg_ctl look up this job object and execute TerminateJobObject() on
> it. That will kill all processes inside the job object (includnig
> backends, but if we run it from pg_ctl *after* postmaster has shut down,
> there should be none of those left).

That would take care of cleaning up any scripts after the postmaster has
already quit, but what about the problem of kicking the waiting backend
off the system() call in the first place?  I suppose that if we can hack
things so system() doesn't block SIGQUIT on Windows, it might be OK, at
least for the "-m immediate" case.  I was hoping we could get Query
Cancel to SIGINT a system()-invoked script in a regular backend, but
that might be too much to hope for on Windows, especially if it hasn't
even got a distinction between INT and QUIT ...
        regards, tom lane


Re: [GENERAL] Shutting down a warm standby database in 8.2beta3

From
Stephen Harris
Date:
On Mon, Nov 20, 2006 at 11:20:41AM -0500, Tom Lane wrote:
> Stephen Harris <lists@spuddy.org> writes:
> > On Fri, Nov 17, 2006 at 11:40:36PM -0500, Tom Lane wrote:
> >> [ thinks for a bit... ]  Another issue is that there'd be a race
> >> condition during backend start: if the postmaster tries to kill -PID
> >> before the backend has managed to execute setsid, it wouldn't work.
> 
> > *ponder*  Bugger.  Standard solutions (eg try three times with a second
> > pause) would mitigate this, but....  Hmm.
> 
> I think we could make it work if the postmaster does this:
> 
>         kill(child_pid, SIGxxx);
>     #ifdef HAVE_SETSID
>         kill(-child_pid, SIGxxx);
>     #endif

That should work.  I just created a dummy test case where a process forks
a child, the child does setsid and then forks 3 times, so 8 processes
all in the new process group.  The original parent sleeps for 3 seconds then
kills the original child (the process group leader) then sleeps 3 more seconds
and then kills the process group.  A simple sig handler shows the signal
being received by the remaining children.

Tested on RHEL4 (2.6.9-42.0.3.ELsmp) and Solaris 10 (Sparc) and got the
same results.

#include <signal.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>

void sig(int x)
{ int y=getpid(); printf("Process %d got signal\n",y); exit(0);
}

void do_child()
{ setsid(); fork(); fork(); fork();
 printf("%d sleeping for a long time!\n",(int)getpid()); sleep(10000);
}

int main()
{ signal(SIGQUIT,&sig); signal(SIGCLD,SIG_IGN); int ch=fork();
 if (ch==0) {   do_child(); } else {   int e;   sleep(3);   printf("Killing first child\n");   printf("Return is:
%d\n",e=kill(ch,SIGQUIT));  if (e) { perror("Bad kill"); }   sleep(3);   printf("Killing process group\n");
printf("Returnis: %d\n",e=kill(-ch,SIGQUIT));   if (e) { perror("Bad kill"); } }
 
 exit(0);
}

-- 

rgds
Stephen


Re: Shutting down a warm standby database in 8.2beta3

From
Stephen Harris
Date:
On Mon, Nov 20, 2006 at 11:20:41AM -0500, Tom Lane wrote:
> 
>         kill(child_pid, SIGxxx);
>     #ifdef HAVE_SETSID
>         kill(-child_pid, SIGxxx);
>     #endif
> 
> In the normal case where the child has already completed setsid(), the
> extra signal sent to it should do no harm.  In the startup race

Hmm.  It looks like something more than this may be needed.  The postgres
recovery process appears to be ignoring it.   I ran the whole database
in it's own process group (ksh runs processes in their own process group
by default, so pg_ctl became the session leader and so everything under
pg_ctl all stayed in that process group).

% ps -o pid,ppid,pgid,args -g 29141 | sort PID  PPID  PGID COMMAND
29145     1 29141 /local/apps/postgres/8.2.b3.0/solaris/bin/postgres
29146 29145 29141 /local/apps/postgres/8.2.b3.0/solaris/bin/postgres
29147 29145 29141 /local/apps/postgres/8.2.b3.0/solaris/bin/postgres
29501 29147 29141 sh -c /export/home/swharris/rr 000000010000000100000057 pg_xlog/RECOVERYXLOG
29502 29501 29141 /bin/ksh -p /export/home/swharris/rr 000000010000000100000057 pg_xlog/RECOVERYX
29537 29502 29141 sleep 5

I did
kill -QUIT -29141 ; sleep 1 ; touch /export/home/swharris/archives/STOP_SWEH_RECOVERY

This sent the QUIT signal to all those processes.  The shell script ignores
it and so tries to start again, so the 'touch' command tells it to exit(1)
rather than loop again.

The log file (the timestamp entries are from my 'rr' program so I
can see what it's doing)...

To start with we see a normal recovery:
 Wed Nov 22 13:41:20 EST 2006: Attempting to restore 000000010000000100000056 Wed Nov 22 13:41:25 EST 2006: Finished
000000010000000100000056LOG:  restored log file "000000010000000100000056" from archive Wed Nov 22 13:41:25 EST 2006:
Attemptingto restore 000000010000000100000057 Wed Nov 22 13:41:25 EST 2006: Waiting for file to become available
 

Now I send the kill signal...
 LOG:  received immediate shutdown request

We can see that the sleep process got it! /export/home/swharris/rr[37]: 29537 Quit(coredump)
And my script detects the trigger file Wed Nov 22 13:43:51 EST 2006: End of recovery trigger file found

Now database recovery appears to continue as normal; the postgres
recovery processes are still running, despite having received SIGQUIT
 LOG:  could not open file "pg_xlog/000000010000000100000057" (log file 1, segment 87): No such file or directory LOG:
redodone at 1/56000070 Wed Nov 22 13:43:51 EST 2006: Attempting to restore 000000010000000100000056 Wed Nov 22 13:43:55
EST2006: Finished 000000010000000100000056 LOG:  restored log file "000000010000000100000056" from archive LOG:
archiverecovery complete LOG:  database system is ready LOG:  logger shutting down
 

pg_xlog now contains 000000010000000100000056 and 000000010000000100000057

A similar sort of thing happens if I use SIGTERM rather than SIGQUIT

I'm out of here in an hour, so for all you US based people, have a good
Thanksgiving holiday!

-- 

rgds
Stephen


Re: Shutting down a warm standby database in 8.2beta3

From
Tom Lane
Date:
Stephen Harris <lists@spuddy.org> writes:
> We can see that the sleep process got it!
>   /export/home/swharris/rr[37]: 29537 Quit(coredump)
> And my script detects the trigger file
>   Wed Nov 22 13:43:51 EST 2006: End of recovery trigger file found

> Now database recovery appears to continue as normal; the postgres
> recovery processes are still running, despite having received SIGQUIT

Well, sure, because system() will have ignored the SIGQUIT.  You have to
make the shell script return an abort code rather than exit(1) or
whatever it's doing in this case.  (I've tested that CVS HEAD does the
right thing when the script is just a straight invocation of cp.  But
if the script is trapping errors, it's going to have to cooperate.)
        regards, tom lane


Re: Shutting down a warm standby database in 8.2beta3

From
Stephen Harris
Date:
On Wed, Nov 22, 2006 at 03:03:03PM -0500, Tom Lane wrote:
> Stephen Harris <lists@spuddy.org> writes:
> > We can see that the sleep process got it!
> >   /export/home/swharris/rr[37]: 29537 Quit(coredump)
> > And my script detects the trigger file
> >   Wed Nov 22 13:43:51 EST 2006: End of recovery trigger file found
> 
> > Now database recovery appears to continue as normal; the postgres
> > recovery processes are still running, despite having received SIGQUIT
> 
> Well, sure, because system() will have ignored the SIGQUIT.  You have to
> make the shell script return an abort code rather than exit(1) or
> whatever it's doing in this case.  (I've tested that CVS HEAD does the

Hmm, on Linux I did a kill -QUIT on the process group and then rather
than cleanly exiting my recovery script I "kill -9"d it.  The database
still completed recovery as previously described.  

> right thing when the script is just a straight invocation of cp.  But
> if the script is trapping errors, it's going to have to cooperate.)

The script doesn't trap errors; ksh itself plays with SIGQUIT.

I guess I should rewrite my script in perl and see what happens.

-- 

rgds
Stephen


Re: Shutting down a warm standby database in 8.2beta3

From
Stephen Harris
Date:
On Thu, Nov 23, 2006 at 01:18:06PM -0500, Stephen Harris wrote:
> Hmm, on Linux I did a kill -QUIT on the process group and then rather
> than cleanly exiting my recovery script I "kill -9"d it.  The database
> still completed recovery as previously described.  

OK, I just did the same tests on the CVS tree, rather than beta3 and
I see different results.  This time recovery aborts with a PANIC.  I
did a "pg_ctl stop -m i" and then kill -9'd the recovery script and:

LOG:  restored log file "000000010000000000000006" from archive
Looking for 000000010000000000000007
LOG:  received immediate shutdown request
/tmp/rr: line 12: 22031 Quit                    sleep 5
PANIC:  could not restore file "000000010000000000000007" from archive: return code 9

Similar behaviour was seen with "smart" and "fast" shutdowns.

Looking for 000000010000000000000005
Restoring 000000010000000000000005 to pg_xlog/RECOVERYXLOG
LOG:  restored log file "000000010000000000000005" from archive
LOG:  checkpoint record is at 0/5000020
LOG:  redo record is at 0/5000020; undo record is at 0/0; shutdown FALSE
LOG:  next transaction ID: 0/601; next OID: 24577
LOG:  next MultiXactId: 1; next MultiXactOffset: 0
LOG:  automatic recovery in progress
LOG:  redo starts at 0/5000068
Looking for 000000010000000000000006
Restoring 000000010000000000000006 to pg_xlog/RECOVERYXLOG
LOG:  restored log file "000000010000000000000006" from archive
Looking for 000000010000000000000007
LOG:  received fast shutdown request
PANIC:  could not restore file "000000010000000000000007" from archive: return code 9
LOG:  startup process (PID 22246) was terminated by signal 6
LOG:  aborting startup due to startup process failure


Restarting the database appears to go back to asking for archive log 5
again.  I'll have to do some testing with more archives to see how far back
we have to go.

-- 

rgds
Stephen


Re: Shutting down a warm standby database in 8.2beta3

From
Tom Lane
Date:
Stephen Harris <lists@spuddy.org> writes:
> OK, I just did the same tests on the CVS tree, rather than beta3 and
> I see different results.  This time recovery aborts with a PANIC.

Yeah, that's expected since the whole recovery process is a critical
section.  We could change that but it's cosmetic.

> Restarting the database appears to go back to asking for archive log 5
> again.

It would back up to whereever the last recovery checkpoint was.  I think
we've got it hardwired at one recovery checkpoint per 100 source
checkpoints ... maybe that's too far apart.
        regards, tom lane


Re: Shutting down a warm standby database in 8.2beta3

From
Stephen Harris
Date:
On Thu, Nov 23, 2006 at 02:47:27PM -0500, Tom Lane wrote:

> It would back up to whereever the last recovery checkpoint was.  I think
> we've got it hardwired at one recovery checkpoint per 100 source
> checkpoints ... maybe that's too far apart.

I'm talking out of my arse now, since I have no ideas of the internals
of postgres recovery, but maybe a recovery checkpoint can be added each
time it asks for a new archive file, or 100 source checkpoints, whatever
comes first.

A database that is mostly idle (eg overnight) but has 1 or 2 critical
transactions will have received lots of archive logs (from the 5 minute
checkpoint timeout) but not necessarily enough to cause recovery to
checkpoint.  This could mean needing to keep a large number of shipped
logfiles available on the standby system "just in case".

-- 

rgds
Stephen


Re: Shutting down a warm standby database in 8.2beta3

From
Stephen Harris
Date:
On Thu, Nov 23, 2006 at 02:47:27PM -0500, Tom Lane wrote:
> Stephen Harris <lists@spuddy.org> writes:
> > OK, I just did the same tests on the CVS tree, rather than beta3 and
> > I see different results.  This time recovery aborts with a PANIC.
> 
> Yeah, that's expected since the whole recovery process is a critical
> section.  We could change that but it's cosmetic.

Should this cause a coredump when it happens?  With rc1 whenever I
do a shutdown I get a core file

core:           ELF 32-bit MSB core file SPARC Version 1, from 'postgres'

(gdb) bt
#0  0xff0a02d4 in _libc_kill () from /usr/lib/libc.so.1
#1  0xff035a5c in abort () from /usr/lib/libc.so.1
#2  0x0020cd50 in errfinish ()
#3  0x0006d258 in RestoreArchivedFile ()

-- 

rgds
Stephen


Re: Shutting down a warm standby database in

From
Bruce Momjian
Date:
Stephen Harris wrote:
> On Thu, Nov 23, 2006 at 02:47:27PM -0500, Tom Lane wrote:
> > Stephen Harris <lists@spuddy.org> writes:
> > > OK, I just did the same tests on the CVS tree, rather than beta3 and
> > > I see different results.  This time recovery aborts with a PANIC.
> > 
> > Yeah, that's expected since the whole recovery process is a critical
> > section.  We could change that but it's cosmetic.
> 
> Should this cause a coredump when it happens?  With rc1 whenever I
> do a shutdown I get a core file
> 
> core:           ELF 32-bit MSB core file SPARC Version 1, from 'postgres'
> 
> (gdb) bt
> #0  0xff0a02d4 in _libc_kill () from /usr/lib/libc.so.1
> #1  0xff035a5c in abort () from /usr/lib/libc.so.1
> #2  0x0020cd50 in errfinish ()
> #3  0x0006d258 in RestoreArchivedFile ()

You should never get a core file.  I think we need the binary compiled
with symbols so we can get more information from this backtrace.

--  Bruce Momjian   bruce@momjian.us EnterpriseDB    http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Shutting down a warm standby database in 8.2beta3

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> Stephen Harris wrote:
>> Should this cause a coredump when it happens?

> You should never get a core file.

elog(PANIC) => abort() => core dump.  This is completely expected.
        regards, tom lane


Re: Shutting down a warm standby database in

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > Stephen Harris wrote:
> >> Should this cause a coredump when it happens?
> 
> > You should never get a core file.
> 
> elog(PANIC) => abort() => core dump.  This is completely expected.

Well, then I should have said there is no reason you should ever get a
panic.

--  Bruce Momjian   bruce@momjian.us EnterpriseDB    http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Shutting down a warm standby database in

From
Stephen Harris
Date:
On Thu, Nov 30, 2006 at 09:58:52AM -0500, Bruce Momjian wrote:
> Tom Lane wrote:
> > Bruce Momjian <bruce@momjian.us> writes:
> > > Stephen Harris wrote:
> > >> Should this cause a coredump when it happens?
> > 
> > > You should never get a core file.
> > 
> > elog(PANIC) => abort() => core dump.  This is completely expected.
> 
> Well, then I should have said there is no reason you should ever get a
> panic.

Well, as Tom said earlier in the thread
   > I see different results.  This time recovery aborts with a PANIC.      Yeah, that's expected since the whole
recoveryprocess is a critical   section.  We could change that but it's cosmetic.
 

Because of the changes made, we don't actually need to do a database
shutdown.  Merely killing the restore_command process (eg with kill -9)
appears to cause the same effect.

And now a personal opinion...

I think this is more than cosmetic; shutting down a standby database
cleanly is critical functionality for proper warm-standby procedures.
What we have now "works", but should be tidied up.  Probably low on the
priority queue though :-)

-- 

rgds
Stephen


Re: Shutting down a warm standby database in

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> Tom Lane wrote:
>> elog(PANIC) => abort() => core dump.  This is completely expected.

> Well, then I should have said there is no reason you should ever get a
> panic.

The reason it's a PANIC is that the whole of StartupXLOG() runs as a
critical section.  Given the recent changes to make elog(FATAL) do
exit(1) instead of exit(0), I think it might be possible to dispense
with that, but it's rather a ticklish thing to be changing the day
before we wrap 8.2.  How excited are you about this?
        regards, tom lane


Re: Shutting down a warm standby database in

From
Tom Lane
Date:
Stephen Harris <lists@spuddy.org> writes:
> I think this is more than cosmetic; shutting down a standby database
> cleanly is critical functionality for proper warm-standby procedures.
> What we have now "works", but should be tidied up.

I've made some small adjustments that make "pg_ctl stop -m fast" work
for shutting down a warm-standby server.  Please test.
        regards, tom lane


Re: Shutting down a warm standby database in

From
Stephen Harris
Date:
On Thu, Nov 30, 2006 at 01:32:31PM -0500, Tom Lane wrote:
> Stephen Harris <lists@spuddy.org> writes:
> > I think this is more than cosmetic; shutting down a standby database
> > cleanly is critical functionality for proper warm-standby procedures.
> > What we have now "works", but should be tidied up.
> 
> I've made some small adjustments that make "pg_ctl stop -m fast" work
> for shutting down a warm-standby server.  Please test.

That's a lot cleaner.  "fast" is good because of SIGTERM, which the
shell script also will respond to even while waiting on another process


LOG:  restored log file "00000001000000000000003E" from archive
Thu Nov 30 15:45:56 EST 2006: Attempting to restore 00000001000000000000003F
Thu Nov 30 15:45:58 EST 2006: Finished 00000001000000000000003F
LOG:  restored log file "00000001000000000000003F" from archive
Thu Nov 30 15:45:59 EST 2006: Attempting to restore 000000010000000000000040
Thu Nov 30 15:45:59 EST 2006: Waiting for file to become available
LOG:  received fast shutdown request
Terminated
FATAL:  could not restore file "000000010000000000000040" from archive: return code 36608
LOG:  startup process (PID 10730) exited with exit code 1
LOG:  aborting startup due to startup process failure
LOG:  logger shutting down

(the datestamp entries are from my script)

Starting up the standby database still goes back to earlier log files,
but I guess that's the 100 checkpoint thing you mentioned earlier.

The change you just made is very good and very much tidier; no core files,
no panics, cleans up children nicely.  Good stuff!


-- 

rgds
Stephen


Re: Shutting down a warm standby database in

From
Tom Lane
Date:
Stephen Harris <lists@spuddy.org> writes:
> Starting up the standby database still goes back to earlier log files,
> but I guess that's the 100 checkpoint thing you mentioned earlier.

Actually I was misremembering that: the frequency of recovery
checkpoints is time-based, and for a slave that's in sync with its
master, they should occur about as often as checkpoints on the master.
So the amount of log to be replayed should be comparable to what the
master would have to replay if it crashed.
        regards, tom lane


Re: Shutting down a warm standby database in

From
Stephen Harris
Date:
On Thu, Nov 30, 2006 at 03:59:15PM -0500, Tom Lane wrote:
> Stephen Harris <lists@spuddy.org> writes:
> > Starting up the standby database still goes back to earlier log files,
> > but I guess that's the 100 checkpoint thing you mentioned earlier.
> 
> Actually I was misremembering that: the frequency of recovery
> checkpoints is time-based, and for a slave that's in sync with its
> master, they should occur about as often as checkpoints on the master.
> So the amount of log to be replayed should be comparable to what the
> master would have to replay if it crashed.

Hmm.  For testing purposes I have 
archive_timeout=5

so an archive log is created every 5 seconds if there's activity.
In idle periods it's created every 5 minutes as the internal checkpoint
occurs on the server.  However if I replay all those "empty" logs then
shutdown then startup it will ask for some of them again.

As a test, I started a recover, recovered 11 archive logs
(000000010000000000000035 to 00000001000000000000003F), some may have had
activity; I not sure.  I then waited 10 minutes after the last log file
had been replayed.  Then I did a fast shutdown.  Starting the standby
database again required it going back to 000000010000000000000035 again.

-- 

rgds
Stephen


Re: Shutting down a warm standby database in

From
Tom Lane
Date:
Stephen Harris <lists@spuddy.org> writes:
> On Thu, Nov 30, 2006 at 03:59:15PM -0500, Tom Lane wrote:
>> Actually I was misremembering that: the frequency of recovery
>> checkpoints is time-based, and for a slave that's in sync with its
>> master, they should occur about as often as checkpoints on the master.
>> So the amount of log to be replayed should be comparable to what the
>> master would have to replay if it crashed.

> As a test, I started a recover, recovered 11 archive logs
> (000000010000000000000035 to 00000001000000000000003F), some may have had
> activity; I not sure.  I then waited 10 minutes after the last log file
> had been replayed.  Then I did a fast shutdown.  Starting the standby
> database again required it going back to 000000010000000000000035 again.

That case is not what I described --- you weren't tracking a live master
that's generating checkpoints.
        regards, tom lane


Re: Shutting down a warm standby database in 8.2beta3

From
"Simon Riggs"
Date:
On Fri, 2006-11-24 at 07:20 -0500, Stephen Harris wrote:
> On Thu, Nov 23, 2006 at 02:47:27PM -0500, Tom Lane wrote:
> 
> > It would back up to whereever the last recovery checkpoint was.  I think
> > we've got it hardwired at one recovery checkpoint per 100 source
> > checkpoints ... maybe that's too far apart.
> 
> I'm talking out of my arse now, since I have no ideas of the internals
> of postgres recovery, but maybe a recovery checkpoint can be added each
> time it asks for a new archive file, or 100 source checkpoints, whatever
> comes first.

We can take a restorepoint during replay only when we come across a
checkpoint WAL record. We take the first checkpoint WAL record we come
across after checkpoint_timeout seconds have expired, not 100 as
described.

History: The 100 checkpoints per restorepoint idea was the way it was
originally following the discussions at Anniversary. That was replaced
prior to code commit with the concept of a minimum delay between
restorepoints equal to the checkpoint_timeout.

> A database that is mostly idle (eg overnight) but has 1 or 2 critical
> transactions will have received lots of archive logs (from the 5 minute
> checkpoint timeout) but not necessarily enough to cause recovery to
> checkpoint.  This could mean needing to keep a large number of shipped
> logfiles available on the standby system "just in case".

It doesn't work that way. There is only ever one WAL file being worked
upon at a time, so there is no build up of log files in the case you
mention.

The case you mention would be captured by the current code, since after
5 minutes worth of WAL replay we would attempt a restorepoint at the
next checkpoint record. If replay goes at exactly the same speed as the
original WAL writes, then there would be one restorepoint per original
checkpoint. If replay goes 10 times faster then there would be one
restorepoint for each 10 checkpoints, etc.. So the restorepoint is taken
every 5 minutes in wall-clock time during replay, whereas that might be
anywhere from 5 minutes upwards in original-WAL-write-time, possibly
hours or days if you aren't writing any WAL at all for long periods. But
note that will not translate into many WAL files.

If the de-archive time is significant then the average rate of WAL apply
will go down and so the number of restorepoints per checkpoint will
fall. So overall, I can't see any need to change the code, but I'm
listening for different opinions and Stephen's especially has been very
good to have, thanks.

Humble apologies to all for my late arrival on this thread. I missed the
start of it during a busy period last week and didn't notice subsequent
postings. I'll try to filter things better in future since I'm clearly
not managing to read all postings. Please feel free to include me into
any threads.

If there's anything that people think I should be working on in support
of this, please say and I'll prioritise that now ahead of my docs and
advocacy activities around the release.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com