Re: WAL Restore process during recovery - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: WAL Restore process during recovery
Date
Msg-id CA+U5nM+fz+OnWbLePcPn-3LP1HjL19Kx2nO8u+k-Uyiy=H8f4w@mail.gmail.com
Whole thread Raw
In response to Re: WAL Restore process during recovery  (Fujii Masao <masao.fujii@gmail.com>)
Responses Re: WAL Restore process during recovery
List pgsql-hackers
On Mon, Jan 23, 2012 at 12:23 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

> In StartChildProcess(), the code which emits an error when fork of walrestore
> fails is required.

OK

> In reaper(), the following comment needs to be updated because an unexpected
> exit (including FATAL) is treated as a crash in the patch.
>
>                /*
>                 * Was it the wal restore?  If exit status is zero (normal) or one
>                 * (FATAL exit), we assume everything is all right just like normal
>                 * backends.
>                 */
>                if (pid == WalRestorePID)

OK

> Why does walrestore need to be invoked even when restore_command is
> not specified? It seems to be useless. We invoke walreceiver only when
> primary_conninfo is specified now. Similarly we should invoke walrestore
> only when restore_command is specified?

walreceiver is shutdown and restarted in case of failed connection.
That never happens with walrestore because the command is run each
time - when we issue system(3) a new process is forked to run the
command. So there is no specific cleanup to perform and so no reason
for a managed cleanup process.

So I can't see a specific reason to change that. Do you think it makes
a difference?


> When I set up the file-based log-shipping environment using pg_standby,
> ran "pgbench -i -s2", waited for walrestore to restore at least one WAL
> file, and created the trigger file, then I encounterd the following error in
> the standby.
>
>    sby LOG:  startup process requests 000000010000000000000003 from archive
>    trigger file found: smart failover
>    sby LOG:  startup process sees last file was 000000010000000000000003
>    sby FATAL:  could not rename file "pg_xlog/RECOVERYXLOG" to
> "pg_xlog/000000010000000000000003": No such file or directory
>    sby LOG:  startup process (PID 11079) exited with exit code 1
>    sby LOG:  terminating any other active server processes

Will look further.

> When I set up streaming replication with setting restore_command,
> I got the following messages repeatedly. The WAL file name was always
> "000000000000000000000000".

Will look further.

>    sby1 LOG:  walrestore checking for next file to restore
>    sby1 LOG:  restore of 000000000000000000000000 is already complete, so sleep

Will look further.

> In PostmasterStateMachine(), the following comment needs to mention WALRestore.
>
>                 * PM_WAIT_READONLY state ends when we have no regular backends that
>                 * have been started during recovery.  We kill the startup and
>                 * walreceiver processes and transition to PM_WAIT_BACKENDS.  Ideally,

OK

> In walrestore.c, the following comments seem to be incorrect. At least
> an unexpected
> exit of WALRestore doesn't start a recovery cycle in the patch.
>
>     * If the WAL restore exits unexpectedly, the postmaster treats
> that the same
>     * as a backend crash: shared memory may be corrupted, so remaining backends
>     * should be killed by SIGQUIT and then a recovery cycle started.

Yes it does...

> In walrestore.c
> + * Main entry point for walrestore process
> + *
> + * This is invoked from BootstrapMain, which has already created the basic
> + * execution environment, but not enabled signals yet.
>
> BootstrapMain() doesn't exist, and it should be changed to
> AuxiliaryProcessMain().
> This is not a fault of the patch. There are the same typos in
> bgwriter.c, walwriter.c
> and checkpointer.c

OK, will fix.

> In walrestore.c
> +        * SIGUSR1 is presently unused; keep it spare in case someday we want this
> +        * process to participate in ProcSignal signalling.
>
> The above comment is incorrect because SIGUSR1 is presently used.

OK

> +                       /*
> +                        * From here on, elog(ERROR) should end with exit(1), not send
> +                        * control back to the sigsetjmp block above
> +                        */
> +                       ExitOnAnyError = true;
>
> The above is not required because sigsetjmp is not used in walrestore.c

OK

> +                       /* Normal exit from the walwriter is here */
> +                       proc_exit(0);           /* done */
>
> Typo: s/walwriter/walrestore

OK


Cleaned up the points noted, new patch attached in case you wish to
review further.

Still has bug, so still with me to fix.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Inline Extension
Next
From: Dimitri Fontaine
Date:
Subject: Re: Inline Extension