Re: Synchronous Standalone Master Redoux - Mailing list pgsql-hackers

From Jose Ildefonso Camargo Tolosa
Subject Re: Synchronous Standalone Master Redoux
Date
Msg-id CAETJ_S-jFznJWX=wZfzwVMF4jxMLmWCH0GVLYW2aQ3sK0JMPoA@mail.gmail.com
Whole thread Raw
In response to Re: Synchronous Standalone Master Redoux  (Aidan Van Dyk <aidan@highrise.ca>)
Responses Re: Synchronous Standalone Master Redoux
List pgsql-hackers
On Thu, Jul 12, 2012 at 9:28 AM, Aidan Van Dyk <aidan@highrise.ca> wrote:
> On Thu, Jul 12, 2012 at 9:21 AM, Shaun Thomas <sthomas@optionshouse.com> wrote:
>
>> So far as transaction durability is concerned... we have a continuous
>> background rsync over dark fiber for archived transaction logs, DRBD for
>> block-level sync, filesystem snapshots for our backups, a redundant async DR
>> cluster, an offsite backup location, and a tape archival service stretching
>> back for seven years. And none of that will cause the master to stop
>> processing transactions unless the master itself dies and triggers a
>> failover.
>
> Right, so if the dark fiber between New Orleans and Seattle (pick two
> places for your datacenter) happens to be the first thing failing in
> your NO data center.  Disconenct the sync-ness, and continue.  Not a
> problem, unless it happens to be Aug 29, 2005.
>
> You have lost data.  Maybe only a bit.  Maybe it wasn't even
> important.  But that's not for PostgreSQL to decide.

I never asked for it... but, you (the one who is configuring the
system) can decide, and should be able to decide... right now: we
can't decide.

>
> But because your PG on DRDB "continued" when it couldn't replicate to
> Seattle, it told it's clients the data was durable, just minutes
> before the whole DC was under water.

Yeah, well, what is the probability of all of that?... really tiny.  I
bet it is more likely that you win the lottery, than all of these
events happening within that time frame.  But, risking monetary loses
because, for example, the online store stopped accepting orders while
the standby server went down, that's not acceptable for some companies
(and some companies just can't buy 3 x DB servers, or more!).

>
> OK, so a wise admin team would have removed the NO DC from it's
> primary role days before that hit.
>
> Change the NO to NYC and the date Sept 11, 2001.
>
> OK, so maybe we can concede that these types of major catasrophies are
> more devestating to us than loosing some data.
>
> Now your primary server was in AWS US East last week.  It's sync slave
> was in the affected AZ, but your PG primary continues on, until, since
> it was a EC2 instance, it disappears.  Now where is your data?

Who would *really* trust your PostgreSQL DB to EC2?... I mean, the I/O
is not very good, and the price is not exactly that low so that you
take that risk.

All in all: you are still getting together coincidences that have *so
low* probability....

>
> Or the fire marshall orders the data center (or whole building) EPO,
> and the connection to your backup goes down minutes before your
> servers or other network peers.
>
>> Using PG sync in its current incarnation would introduce an extra failure
>> scenario that wasn't there before. I'm pretty sure we're not the only ones
>> avoiding it for exactly that reason. Our queue discards messages it can't
>> fulfil within ten seconds and then throws an error for each one. We need to
>> decouple the secondary as quickly as possible if it becomes unresponsive,
>> and there's really no way to do that without something in the database, one
>> way or another.
>
> It introduces an "extra failure", because it has introduce an "extra
> data durability guarantee".
>
> Sure, many people don't *really* want that data durability guarantee,
> even though they would like the "maybe guaranteed" version of it.
>
> But that fine line is actually a difficult (impossible?) one to define
> if you don't know, at the moment of decision, what the next few
> moments will/could become.

You *never* know.  And the truth is that you have to make the decision
with what you have, if you can pay 10 servers nationwide: good for
you, not all of us can afford that (men, I could barely pay for two,
and that because I *know* I don't want to risk to lose the data or
service because the single server died).

As currently is, the point of: freezing the master because standby
dies is not good for all cases (and I dare say: for most cases), and
having to wait for pacemaker or other monitoring to note that, change
master config and reload... it will cause a service disruption! (for
several seconds, usually, ~30 seconds).


pgsql-hackers by date:

Previous
From: Jose Ildefonso Camargo Tolosa
Date:
Subject: Re: Synchronous Standalone Master Redoux
Next
From: Jose Ildefonso Camargo Tolosa
Date:
Subject: Re: Synchronous Standalone Master Redoux