Re: Missing important information in backup.sgml - Mailing list pgsql-docs

From Gunnar \"Nick\" Bluth
Subject Re: Missing important information in backup.sgml
Date
Msg-id cd880db6-1ff6-8fa7-4868-249421cd6fb2@pro-open.de
Whole thread Raw
In response to Re: Missing important information in backup.sgml  ("Gunnar \"Nick\" Bluth" <gunnar.bluth@pro-open.de>)
Responses Re: Missing important information in backup.sgml  (Kevin Grittner <kgrittn@gmail.com>)
List pgsql-docs
Am 16.11.2016 um 22:07 schrieb Gunnar "Nick" Bluth:
> Am 16.11.2016 um 15:36 schrieb Stephen Frost:
>> Gunnar, all,
>>
>> * Gunnar "Nick" Bluth (gunnar.bluth.extern@elster.de) wrote:
>>> Am 16.11.2016 um 11:37 schrieb Gunnar "Nick" Bluth:
>>>> I ran into this issue (see patch) a few times over the past years, and
>>>> tend to forget it again (sigh!). Today I had to clean up a few hundred
>>>> GB of unarchived WALs, so I decided to write a patch for the
>>>> documentation this time.
>>>
>>> Uhm, well, the actual problem was a stale replication slot... and
>>> tomatoes on my eyes, it seems ;-/. Ashes etc.!
>>>
>>> However, I still think a warning on (esp. rsync's) RCs >= 128 is worth
>>> considering (see -v2 attached).
>>
>> Frankly, I wouldn't suggest including such wording as it would imply
>> that using a bare rsync command is an acceptable configuration of
>> archive_command.  It isn't.  At the very least, a bare rsync does
>> nothing to ensure that the WAL has been fsync'd to permanent storage
>> before returning, leading to potential data loss due to the WAL
>> segment being removed by PG before the new segment has been permanently
>> stored.
>
> I for myself deem a UPS-backed server in a different DC a pretty good
> starting point, and I reckon many others do as well... obviously it's
> not a belt and bracers solution, but my guess would be that > 90% of
> users have something similar in place, many of them actually using rsync
> (or scp) one way or the other (or, think WAL-E et. al., how do you force
> an fsync on AWS?!?).
> In environments where there's a risk of the WAL segment being
> overwritten before that target server has fsync'd, heck, yeah, you're
> right. But then you'd probably have something quite sophisticated in
> place, and hate to see allegedly random _FATAL_ errors that are _not
> documented outside the source code_ even more. Esp. when you can't tell
> for sure (from the docs) if archiving your WAL segment will be retried
> or not.
>
>> The PG documentation around archive command is, at best, a starting
>> point for individuals who wish to implement their own proper backup
>> solution, not as examples of good practice for production environments.
>
> True. Which doesn't mean there's no room for more hints, like "ok, we
> throw a FATAL error sometimes, but they're not really a problem, you
> know, it's just external software that basically everyone uses at one
> point or the other doing odd things sometimes" ;-).
>
> Alas, I've been hunting a red herring today, cause when you find your
> pg_xlog cluttered with old files _and_ see FATAL archiving messages in
> your logs, your first thought is not "there's prolly a replication slot
> left over", but "uh oh, those archive_command calls failed, so something
> might be somehow stuck now".
>
> I'll try to come up with something more comprehensive, taking your
> comments into account...

So, attached is what I came up with. It's obviously not "complete",
however it points out the RC >= 128 "quirk" and also mentions Stephen's
remarks on rsync (although to get actual _data loss_, you'd have to have
a power outage in the DC caused by your PG server exploding... ;-).

Cheers,
--
Gunnar "Nick" Bluth
RHCE/SCLA

Mobil +49 172 8853339
Email: gunnar.bluth@pro-open.de
_____________________________________________________________
In 1984 mainstream users were choosing VMS over UNIX.
Ten years later they are choosing Windows over UNIX.
What part of that message aren't you getting? - Tom Payne



Attachment

pgsql-docs by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: Better example
Next
From: Kevin Grittner
Date:
Subject: Re: Missing important information in backup.sgml