Re: WAL Restore process during recovery - Mailing list pgsql-hackers
From | Simon Riggs |
---|---|
Subject | Re: WAL Restore process during recovery |
Date | |
Msg-id | CA+U5nM+fz+OnWbLePcPn-3LP1HjL19Kx2nO8u+k-Uyiy=H8f4w@mail.gmail.com Whole thread Raw |
In response to | Re: WAL Restore process during recovery (Fujii Masao <masao.fujii@gmail.com>) |
Responses |
Re: WAL Restore process during recovery
|
List | pgsql-hackers |
On Mon, Jan 23, 2012 at 12:23 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > In StartChildProcess(), the code which emits an error when fork of walrestore > fails is required. OK > In reaper(), the following comment needs to be updated because an unexpected > exit (including FATAL) is treated as a crash in the patch. > > /* > * Was it the wal restore? If exit status is zero (normal) or one > * (FATAL exit), we assume everything is all right just like normal > * backends. > */ > if (pid == WalRestorePID) OK > Why does walrestore need to be invoked even when restore_command is > not specified? It seems to be useless. We invoke walreceiver only when > primary_conninfo is specified now. Similarly we should invoke walrestore > only when restore_command is specified? walreceiver is shutdown and restarted in case of failed connection. That never happens with walrestore because the command is run each time - when we issue system(3) a new process is forked to run the command. So there is no specific cleanup to perform and so no reason for a managed cleanup process. So I can't see a specific reason to change that. Do you think it makes a difference? > When I set up the file-based log-shipping environment using pg_standby, > ran "pgbench -i -s2", waited for walrestore to restore at least one WAL > file, and created the trigger file, then I encounterd the following error in > the standby. > > sby LOG: startup process requests 000000010000000000000003 from archive > trigger file found: smart failover > sby LOG: startup process sees last file was 000000010000000000000003 > sby FATAL: could not rename file "pg_xlog/RECOVERYXLOG" to > "pg_xlog/000000010000000000000003": No such file or directory > sby LOG: startup process (PID 11079) exited with exit code 1 > sby LOG: terminating any other active server processes Will look further. > When I set up streaming replication with setting restore_command, > I got the following messages repeatedly. The WAL file name was always > "000000000000000000000000". Will look further. > sby1 LOG: walrestore checking for next file to restore > sby1 LOG: restore of 000000000000000000000000 is already complete, so sleep Will look further. > In PostmasterStateMachine(), the following comment needs to mention WALRestore. > > * PM_WAIT_READONLY state ends when we have no regular backends that > * have been started during recovery. We kill the startup and > * walreceiver processes and transition to PM_WAIT_BACKENDS. Ideally, OK > In walrestore.c, the following comments seem to be incorrect. At least > an unexpected > exit of WALRestore doesn't start a recovery cycle in the patch. > > * If the WAL restore exits unexpectedly, the postmaster treats > that the same > * as a backend crash: shared memory may be corrupted, so remaining backends > * should be killed by SIGQUIT and then a recovery cycle started. Yes it does... > In walrestore.c > + * Main entry point for walrestore process > + * > + * This is invoked from BootstrapMain, which has already created the basic > + * execution environment, but not enabled signals yet. > > BootstrapMain() doesn't exist, and it should be changed to > AuxiliaryProcessMain(). > This is not a fault of the patch. There are the same typos in > bgwriter.c, walwriter.c > and checkpointer.c OK, will fix. > In walrestore.c > + * SIGUSR1 is presently unused; keep it spare in case someday we want this > + * process to participate in ProcSignal signalling. > > The above comment is incorrect because SIGUSR1 is presently used. OK > + /* > + * From here on, elog(ERROR) should end with exit(1), not send > + * control back to the sigsetjmp block above > + */ > + ExitOnAnyError = true; > > The above is not required because sigsetjmp is not used in walrestore.c OK > + /* Normal exit from the walwriter is here */ > + proc_exit(0); /* done */ > > Typo: s/walwriter/walrestore OK Cleaned up the points noted, new patch attached in case you wish to review further. Still has bug, so still with me to fix. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
pgsql-hackers by date: