Re: Missing important information in backup.sgml - Mailing list pgsql-docs

From Gunnar \"Nick\" Bluth
Subject Re: Missing important information in backup.sgml
Date
Msg-id a5a614b8-9ce2-ae3b-1141-c09e391cbba2@pro-open.de
Whole thread Raw
In response to Re: Missing important information in backup.sgml  (Kevin Grittner <kgrittn@gmail.com>)
Responses Re: Missing important information in backup.sgml  ("Gunnar \"Nick\" Bluth" <gunnar.bluth@pro-open.de>)
Re: Missing important information in backup.sgml  (Stephen Frost <sfrost@snowman.net>)
Re: Missing important information in backup.sgml  (Kevin Grittner <kgrittn@gmail.com>)
List pgsql-docs
Am 23.11.2016 um 20:21 schrieb Kevin Grittner:
> On Wed, Nov 23, 2016 at 12:24 PM, Gunnar "Nick" Bluth
> <gunnar.bluth@pro-open.de> wrote:
>
>> mentions Stephen's
>> remarks on rsync (although to get actual _data loss_, you'd have to have
>> a power outage in the DC caused by your PG server exploding... ;-).
>
> I have seen power loss between the UPS and a server; including a
> tech tripping on the power cord.  I have also seen servers abruptly
> shut down due to high temperatures in spite of having a UPS.  I
> have also seen an OS bug lock up a system such that it was
> impossible to get a clean shutdown before having to cycle power to
> recover.
>
> No explosion needed.
>
> If you value the data in your database you should assume that the
> OS could go down at any instant without proper shutdown, and that
> your storage system(s) could be lost without warning at any time.

Kevin, all,

I've been in this business for 15 years, and had my share of outages.
The worst case being an AC service guy pushing the big red button next
to the DC entrance, assuming it was the light switch...

It's not like I've not gone through the possible scenarios in my head
before writing such a broad statement. Let me explain.

Assertions (that I take as givens for anyone valueing his data...):
- you have decent HW (BBU controller, HDD cache off, ECC RAM, redundant
PSUs, ...)
- you have a decent DC (UPS, AC, ...)
- you use a single DB server and/or no (synchronous) replication in place
- your archive server is in the same DC (potentially the same machine as
the DB server)
- (in case of SAN) your storage correctly reports when it has written to
disk/BBU cache
- your OS (and/or archive_script) does not report RC=0 before all data
has been _transmitted_ (think MongoDB... ;-)
- (for the sake of completeness) fsync=on for PG

Now, what could happen is
a) complete DC power outage
b) outage of DB server
c) outage of archive server (or the network connection to it)
d) outage of storage system
e) complete DC outage caused by your DB server vanishing (burning down,
exploding, melting, ...),
f) a complete _loss_ of the DC (atomar strike, plane crash, ...)

In case a), your DB server would have fsync'd all committed transactions
=> no _data_ loss, but your _archive_ is potentially incomplete.
In case b), the same applies, but your archive should be intact.
In case c), the archiver would retry until your archiving server comes
back online => no _data_ loss, no _archive_ loss.
In case d), see a), if you're lucky b)
In case e), you'd have lost your DB _and_ your archive may be incomplete.
In case f), your f)....d anyway (oh, the coincidence! ;-).

Protecting yourself from case f) will involve a 2nd (3rd, ...) DC (or
some cloud thingie) anyway. In my experience, users that do have more
than one DC also have a policy in place saying that backups (which
archive logs would probably be counted as) have to be placed in a
different DC.

So, losing actual _data_ is unlikely (at least from the archiving point
of view...), but not explicitly fsync'ing the archive _may_ lead to
incomplete archives. Which is exactly what I tried to point out by
"[...], rendering your archive incomplete in case of a power outage".

Am I missing something?

P.S.: just to point that out... my patch does _not_ mention exploding
servers ;-)

Cheers,
--
Gunnar "Nick" Bluth
RHCE/SCLA

Mobil +49 172 8853339
Email: gunnar.bluth@pro-open.de
_____________________________________________________________
In 1984 mainstream users were choosing VMS over UNIX.
Ten years later they are choosing Windows over UNIX.
What part of that message aren't you getting? - Tom Payne


Attachment

pgsql-docs by date:

Previous
From: Kevin Grittner
Date:
Subject: Re: Missing important information in backup.sgml
Next
From: "Gunnar \"Nick\" Bluth"
Date:
Subject: Re: Missing important information in backup.sgml