Re: Synchronous Standalone Master Redoux - Mailing list pgsql-hackers

From Aidan Van Dyk
Subject Re: Synchronous Standalone Master Redoux
Date
Msg-id CAC_2qU9rDFkUMO6ChANQsnsKQN9N0v5mhUru0r6BqowiNPaO=A@mail.gmail.com
Whole thread Raw
In response to Re: Synchronous Standalone Master Redoux  (Shaun Thomas <sthomas@optionshouse.com>)
Responses Re: Synchronous Standalone Master Redoux
List pgsql-hackers
On Thu, Jul 12, 2012 at 9:21 AM, Shaun Thomas <sthomas@optionshouse.com> wrote:

> So far as transaction durability is concerned... we have a continuous
> background rsync over dark fiber for archived transaction logs, DRBD for
> block-level sync, filesystem snapshots for our backups, a redundant async DR
> cluster, an offsite backup location, and a tape archival service stretching
> back for seven years. And none of that will cause the master to stop
> processing transactions unless the master itself dies and triggers a
> failover.

Right, so if the dark fiber between New Orleans and Seattle (pick two
places for your datacenter) happens to be the first thing failing in
your NO data center.  Disconenct the sync-ness, and continue.  Not a
problem, unless it happens to be Aug 29, 2005.

You have lost data.  Maybe only a bit.  Maybe it wasn't even
important.  But that's not for PostgreSQL to decide.

But because your PG on DRDB "continued" when it couldn't replicate to
Seattle, it told it's clients the data was durable, just minutes
before the whole DC was under water.

OK, so a wise admin team would have removed the NO DC from it's
primary role days before that hit.

Change the NO to NYC and the date Sept 11, 2001.

OK, so maybe we can concede that these types of major catasrophies are
more devestating to us than loosing some data.

Now your primary server was in AWS US East last week.  It's sync slave
was in the affected AZ, but your PG primary continues on, until, since
it was a EC2 instance, it disappears.  Now where is your data?

Or the fire marshall orders the data center (or whole building) EPO,
and the connection to your backup goes down minutes before your
servers or other network peers.

> Using PG sync in its current incarnation would introduce an extra failure
> scenario that wasn't there before. I'm pretty sure we're not the only ones
> avoiding it for exactly that reason. Our queue discards messages it can't
> fulfil within ten seconds and then throws an error for each one. We need to
> decouple the secondary as quickly as possible if it becomes unresponsive,
> and there's really no way to do that without something in the database, one
> way or another.

It introduces an "extra failure", because it has introduce an "extra
data durability guarantee".

Sure, many people don't *really* want that data durability guarantee,
even though they would like the "maybe guaranteed" version of it.

But that fine line is actually a difficult (impossible?) one to define
if you don't know, at the moment of decision, what the next few
moments will/could become.

a.

-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.


pgsql-hackers by date:

Previous
From: Joel Jacobson
Date:
Subject: Re: Schema version management
Next
From: Tom Lane
Date:
Subject: Re: Schema version management