Re: Synchronous Standalone Master Redoux - Mailing list pgsql-hackers

From Jose Ildefonso Camargo Tolosa
Subject Re: Synchronous Standalone Master Redoux
Date
Msg-id CAETJ_S-nJrhYzBc3rwLXSoUxFHgC6rCRwBaHuWqmv4PL60qxmg@mail.gmail.com
Whole thread Raw
In response to Re: Synchronous Standalone Master Redoux  (Shaun Thomas <sthomas@optionshouse.com>)
Responses Re: Synchronous Standalone Master Redoux
Re: Synchronous Standalone Master Redoux
List pgsql-hackers
Greetings,

On Wed, Jul 11, 2012 at 9:11 AM, Shaun Thomas <sthomas@optionshouse.com> wrote:
> On 07/10/2012 06:02 PM, Daniel Farina wrote:
>
>> For example, what if DRBD can only complete one page per second for
>> some reason?  Does it it simply have the primary wait at this glacial
>> pace, or drop synchronous replication and go degraded?  Or does it do
>> something more clever than just a timeout?
>
>
> That's a good question, and way beyond what I know about the internals. :)
> In practice though, there are configurable thresholds, and if exceeded, it
> will invalidate the secondary. When using Pacemaker, we've actually had
> instances where the 10G link we had between the servers died, so each node
> thought the other was down. That lead to the secondary node self-promoting
> and trying to steal the VIP from the primary. Throw in a gratuitous arp, and
> you get a huge mess.

That's why Pacemaker *recommends* STONITH (Shoot The Other Node In The
Head).  Whenever the standby decides to promote itself, it would just
kill the former master (just in case)... the STONITH thing have to use
an independent connection.  Additionally, redundant link between
cluster nodes is a must.

>
> That lead to what DRBD calls split-brain, because both nodes were running
> and writing to the block device. Thankfully, you can actually tell one node
> to discard its changes and re-subscribe. Doing that will replay the
> transactions from the "good" node on the "bad" one. And even then, it's a
> good idea to run an online verify to do a block-by-block checksum and
> correct any differences.
>
> Of course, all of that's only possible because it's a block-level
> replication. I can't even imagine PG doing anything like that. It would have
> to know the last good transaction from the primary and do an implied PIT
> recovery to reach that state, then re-attach for sync commits.
>
>
>> Regardless of what DRBD does, I think the problem with the
>> async/sync duality as-is is there is no nice way to manage exposure
>> to transaction loss under various situations and requirements.
>
>
> Which would be handy. With synchronous commits, it's given that the protocol
> is bi-directional. Then again, PG can detect when clients disconnect the
> instant they do so, and having such an event implicitly disable
> synchronous_standby_names until reconnect would be an easy fix. The database
> already keeps transaction logs, so replaying would still happen on
> re-attach. It could easily throw a warning for every sync-required commit so
> long as it's in "degraded" mode. Those alone are very small changes that
> don't really harm the intent of sync commit.
>
> That's basically what a RAID-1 does, and people have been fine with that for
> decades.
>
>

I can't believe how many times I have seen this topic arise in the
mailing list... I was myself about to start a thread like this!
(thanks Shaun!).

I don't really get what people wants out of the synchronous streaming
replication.... DRBD (that is being used as comparison) in protocol C
is synchronous (it won't confirm a write unless it was written to disk
on both nodes).  PostgreSQL (8.4, 9.0, 9.1, ...) will work just fine
with it, except that you don't have a standby that you can connect
to... also, you need to setup a dedicated volume to put the DRBD block
device, setup DRBD, then put the filesystem on top of DRBD, and handle
the DRBD promotion, partition mount (with possible FS error handling),
and then starting PostgreSQL after the FS is correctly mounted......

With synchronous streaming replication you can have about the same:
the standby will have the changes written to disk before master
confirms commit.... I don't really care if standby has already applied
the changes to its DB (although that would certainly be nice).... the
point is: the data is on the standby, and if the master were to crash,
and I were to "promote" the standby: the standby would have the same
commited data the server had before it crashed.

So, why are we, HA people, bothering you DB people so much?: simplify
the things, it is simpler to setup synchronous streaming replication,
than having to setup DRBD + pacemaker rules to make it promote DRBD,
mount FS, and then start pgsql.

Also, there is an great perk to synchronous replication with Hot
Standby: you have a read/only standby that can be used for some things
(even though it doesn't always have exactly the same data as the
master).

I mean, a lot of people here have a really valid point: 2-safe
reliability is great, but how good is it if when you lose it, ALL the
system just freeze? I mean, RAID1 gives you 2-safe reliability, but no
one would use it if the machine were to freeze when you lose 1 disk,
same for DRBD: it offers 2-safe reliability too (at block-level), but
it doesn't freeze if the secondary goes away!

Now, I see some people who are arguing because, apparently,
synchronous replication is not an HA feature (those who says that SR
doesn't fit the HA environment)... please, those people, answer why is
synchronous streaming replication under the High Availability
PostgreSQL manual chapter?

I really feel bad that people are so closed to fix something, I mean:
making the master note that the standby is no longer there and just
fallback to "standalone" mode seems to just bother them so much, that
they wouldn't even allow *an option* to allow that.... we are not
asking you to change default behavior, just add an option that makes
it gracefully continue operation and issue warnings, after all: if you
lose a disk on a RAID array, you get some kind of indication of the
failure to get it fixed ASAP: you know you are in risk until you fix
it, but you can continue to function... name a single RAID controller
that will shutdown your server on single disk failure?, I haven't seen
any card that does that: no body would buy it.

Adding more on a related issue: what's up with the fact that the
standby doesn't respect wal_keep_segments? This is forcing some people
to have to copy the WAL files *twice*: one through streaming
replication, and again to a WAL archive, because if the master dies,
and you have more than one standby (say: 1 synchronous, and 2
asynchronous), you can actually point the async ones to the sync one
once you promote it (as long as you trick the sync one into *not*
switching the timeline, by moving away recovery.conf and restarting,
instead of using "normal" promotion), but if you don't have the WAL
archive, and one of the standbys was too lagged: it wouldn't be able
to recover.

Please, stop arguing on all of this: I don't think that adding an
option will hurt anybody (specially because the work was already done
by someone), we are not asking to change how the things work, we just
want an option to decided whether we want it to freeze on standby
disconnection, or if we want it to continue automatically... is that
asking so much?

Sincerely,

Ildefonso


pgsql-hackers by date:

Previous
From: Alex Hunsaker
Date:
Subject: Re: [SPAM] [MessageLimit][lowlimit] Re: pl/perl and utf-8 in sql_ascii databases
Next
From: Kyotaro HORIGUCHI
Date:
Subject: Re: [SPAM] [MessageLimit][lowlimit] Re: pl/perl and utf-8 in sql_ascii databases