Re: Missing important information in backup.sgml - Mailing list pgsql-docs

From Gunnar \"Nick\" Bluth
Subject Re: Missing important information in backup.sgml
Date
Msg-id d1b49a36-6e90-d8c6-6242-66d26ae32b6f@pro-open.de
Whole thread Raw
In response to Re: Missing important information in backup.sgml  (Stephen Frost <sfrost@snowman.net>)
List pgsql-docs
Am 29.11.2016 um 14:43 schrieb Stephen Frost:
> Greetings,
>
> * Gunnar "Nick" Bluth (gunnar.bluth@pro-open.de) wrote:
>> Am 23.11.2016 um 21:41 schrieb Stephen Frost:
>>> * Gunnar "Nick" Bluth (gunnar.bluth@pro-open.de) wrote:
>>> One of the very important things that should be done as part of a backup
>>> is to ensure that all of the archive files required to restore the
>>> database to a consistent state are safely stored in the archive.  If
>>> that isn't done then it's possible that an incomplete archive may also
>>> render backups invalid.
>>
>> Well, the need to have a complete archive is described in the docs
>> already. Maybe the potential consequences of an incomplete archive
>> should be pointed or more drastically...?
>
> We should probably add to "step 5" something about "verify that all WAL
> files have been archived between the start and stop backup" or similar.

+1

>> Now, the main purpose of my patch was to document a behaviour that many
>> of us have run into, namely that FATAL error showing up in the log when
>> the archive_command exits with RC > 127. It's a nuisance only, but it
>> does send people on false tracks and should at least be mentioned in the
>> documentation.
>
> I agree that we should add information to the documentation that certain
> error codes are handled differently.
>
>> And since a couple of people does use rsync (or some wrappers around it)
>> for archiving, and that is notoriously giving RCs > 127, it seems legit
>> to at least mention it, no?
>
> The low-level API documentation should be focused on the API and not how
> to (mis)use the API using common unix commands.
>
>> diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
>> index 6eaed1e..a8f574e 100644
>> --- a/doc/src/sgml/backup.sgml
>> +++ b/doc/src/sgml/backup.sgml
>> @@ -587,7 +587,8 @@ tar -cf backup.tar /usr/local/pgsql/data
>>      the administrator specify a shell command to be executed to copy a
>>      completed segment file to wherever it needs to go.  The command could be
>>      as simple as a <literal>cp</>, or it could invoke a complex shell
>> -    script — it's all up to you.
>> +    script — it's all up to you. There however are some things to consider
>> +    when creating such a command, most of which are covered below.
>>     </para>
>
> If we're changing this then we should remove the notion that it could be
> "as simple as a cp" because it really should *not* be.  That is a
> dangerous and very poor recommendation to be making to our users.

That thought creeped on me as well... ;-)

>>     <para>
>> @@ -636,7 +637,11 @@ test ! -f /mnt/server/archivedir/00000001000000A900000065 && cp pg_wal/0
>>      <productname>PostgreSQL</> will assume that the file has been
>>      successfully archived, and will remove or recycle it.  However, a nonzero
>>      status tells <productname>PostgreSQL</> that the file was not archived;
>> -    it will try again periodically until it succeeds.
>> +    it will try again periodically until it succeeds. Note that an exit
>> +    status of 128 or higher will cause the archiver to exit, resulting in a
>> +    <literal>FATAL</> error in the server log. It will be restarted by the
>> +    postmaster and continue where it left. E.g., <command>rsync</> is known
>> +    for returning exit statuses of 255 on network issues.
>>     </para>
>
> I agree with adding documentation about what happens with different exit
> status values.  I don't believe we should make any mention of rsync.  A
> sophisticated enough user to understand exit codes should be able to
> work out what the exit code is for whatever tool or tools they're using,
> it's not on us to document what the exit codes are for every possible
> tool (nor should one even use something as simple as a bare cp or rsync
> in archive_command).

Fair enough.

>
>>     <para>
>> @@ -696,6 +701,16 @@ test ! -f /mnt/server/archivedir/00000001000000A900000065 && cp pg_wal/0
>>      preserve the file name (<literal>%f</>).
>>     </para>
>>
>> +  <para>
>> +    Depending on your specific requirements (and datacenter layout), you may
>> +    want to make sure that the archived WAL segments have been written out to
>> +    persistent storage before the <command>archive_command</> returns.
>> +    Otherwise, a WAL segment that is assumed by <productname>PostgreSQL</>
>> +    to be archived could be recycled or removed prematurely, rendering your
>> +    archive incomplete (and thus disabling a recovery) in case of an outage of
>> +    the archiving destination.
>> +  </para>
>
> I dislike the tone of this.  The wording I would suggest would be more
> along the lines of:
>
> The archive_command should only return success once the WAL segment has
> been completely copied, written out, and synced to persistent storage as
> PostgreSQL will recycle or remove the segment shortly after the
> archive_command returns.  If the WAL segment is lost due to an outage of
> the archive server or other issue, any backup which was performed during
> the time that the WAL segment was written will be unusable and
> Point-In-Time-Recovery from an earlier backup will not be usable past
> the missing WAL segment.

Sounds good to me. I'll incorparate that and send -v6 soon!

--
Gunnar "Nick" Bluth
RHCE/SCLA

Mobil +49 172 8853339
Email: gunnar.bluth@pro-open.de
_____________________________________________________________
In 1984 mainstream users were choosing VMS over UNIX.
Ten years later they are choosing Windows over UNIX.
What part of that message aren't you getting? - Tom Payne



Attachment

pgsql-docs by date:

Previous
From: Stephen Frost
Date:
Subject: Re: Missing important information in backup.sgml
Next
From: Ian Barwick
Date:
Subject: monitoring.sgml - clarify length of query text displayed in pg_stat_statements