Re: Missing important information in backup.sgml - Mailing list pgsql-docs

From Stephen Frost
Subject Re: Missing important information in backup.sgml
Date
Msg-id 20161129134323.GO13284@tamriel.snowman.net
Whole thread Raw
In response to Re: Missing important information in backup.sgml  ("Gunnar \"Nick\" Bluth" <gunnar.bluth@pro-open.de>)
Responses Re: Missing important information in backup.sgml  ("Gunnar \"Nick\" Bluth" <gunnar.bluth@pro-open.de>)
List pgsql-docs
Greetings,

* Gunnar "Nick" Bluth (gunnar.bluth@pro-open.de) wrote:
> Am 23.11.2016 um 21:41 schrieb Stephen Frost:
> > * Gunnar "Nick" Bluth (gunnar.bluth@pro-open.de) wrote:
> > One of the very important things that should be done as part of a backup
> > is to ensure that all of the archive files required to restore the
> > database to a consistent state are safely stored in the archive.  If
> > that isn't done then it's possible that an incomplete archive may also
> > render backups invalid.
>
> Well, the need to have a complete archive is described in the docs
> already. Maybe the potential consequences of an incomplete archive
> should be pointed or more drastically...?

We should probably add to "step 5" something about "verify that all WAL
files have been archived between the start and stop backup" or similar.

> Now, the main purpose of my patch was to document a behaviour that many
> of us have run into, namely that FATAL error showing up in the log when
> the archive_command exits with RC > 127. It's a nuisance only, but it
> does send people on false tracks and should at least be mentioned in the
> documentation.

I agree that we should add information to the documentation that certain
error codes are handled differently.

> And since a couple of people does use rsync (or some wrappers around it)
> for archiving, and that is notoriously giving RCs > 127, it seems legit
> to at least mention it, no?

The low-level API documentation should be focused on the API and not how
to (mis)use the API using common unix commands.

> diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
> index 6eaed1e..a8f574e 100644
> --- a/doc/src/sgml/backup.sgml
> +++ b/doc/src/sgml/backup.sgml
> @@ -587,7 +587,8 @@ tar -cf backup.tar /usr/local/pgsql/data
>      the administrator specify a shell command to be executed to copy a
>      completed segment file to wherever it needs to go.  The command could be
>      as simple as a <literal>cp</>, or it could invoke a complex shell
> -    script — it's all up to you.
> +    script — it's all up to you. There however are some things to consider
> +    when creating such a command, most of which are covered below.
>     </para>

If we're changing this then we should remove the notion that it could be
"as simple as a cp" because it really should *not* be.  That is a
dangerous and very poor recommendation to be making to our users.

>     <para>
> @@ -636,7 +637,11 @@ test ! -f /mnt/server/archivedir/00000001000000A900000065 && cp pg_wal/0
>      <productname>PostgreSQL</> will assume that the file has been
>      successfully archived, and will remove or recycle it.  However, a nonzero
>      status tells <productname>PostgreSQL</> that the file was not archived;
> -    it will try again periodically until it succeeds.
> +    it will try again periodically until it succeeds. Note that an exit
> +    status of 128 or higher will cause the archiver to exit, resulting in a
> +    <literal>FATAL</> error in the server log. It will be restarted by the
> +    postmaster and continue where it left. E.g., <command>rsync</> is known
> +    for returning exit statuses of 255 on network issues.
>     </para>

I agree with adding documentation about what happens with different exit
status values.  I don't believe we should make any mention of rsync.  A
sophisticated enough user to understand exit codes should be able to
work out what the exit code is for whatever tool or tools they're using,
it's not on us to document what the exit codes are for every possible
tool (nor should one even use something as simple as a bare cp or rsync
in archive_command).

>     <para>
> @@ -696,6 +701,16 @@ test ! -f /mnt/server/archivedir/00000001000000A900000065 && cp pg_wal/0
>      preserve the file name (<literal>%f</>).
>     </para>
>
> +  <para>
> +    Depending on your specific requirements (and datacenter layout), you may
> +    want to make sure that the archived WAL segments have been written out to
> +    persistent storage before the <command>archive_command</> returns.
> +    Otherwise, a WAL segment that is assumed by <productname>PostgreSQL</>
> +    to be archived could be recycled or removed prematurely, rendering your
> +    archive incomplete (and thus disabling a recovery) in case of an outage of
> +    the archiving destination.
> +  </para>

I dislike the tone of this.  The wording I would suggest would be more
along the lines of:

The archive_command should only return success once the WAL segment has
been completely copied, written out, and synced to persistent storage as
PostgreSQL will recycle or remove the segment shortly after the
archive_command returns.  If the WAL segment is lost due to an outage of
the archive server or other issue, any backup which was performed during
the time that the WAL segment was written will be unusable and
Point-In-Time-Recovery from an earlier backup will not be usable past
the missing WAL segment.

Thanks!

Stephen

Attachment

pgsql-docs by date:

Previous
From: "Gunnar \"Nick\" Bluth"
Date:
Subject: Re: Missing important information in backup.sgml
Next
From: "Gunnar \"Nick\" Bluth"
Date:
Subject: Re: Missing important information in backup.sgml