Re: Synchronous Standalone Master Redoux - Mailing list pgsql-hackers
From | Jose Ildefonso Camargo Tolosa |
---|---|
Subject | Re: Synchronous Standalone Master Redoux |
Date | |
Msg-id | CAETJ_S-nJrhYzBc3rwLXSoUxFHgC6rCRwBaHuWqmv4PL60qxmg@mail.gmail.com Whole thread Raw |
In response to | Re: Synchronous Standalone Master Redoux (Shaun Thomas <sthomas@optionshouse.com>) |
Responses |
Re: Synchronous Standalone Master Redoux
Re: Synchronous Standalone Master Redoux |
List | pgsql-hackers |
Greetings, On Wed, Jul 11, 2012 at 9:11 AM, Shaun Thomas <sthomas@optionshouse.com> wrote: > On 07/10/2012 06:02 PM, Daniel Farina wrote: > >> For example, what if DRBD can only complete one page per second for >> some reason? Does it it simply have the primary wait at this glacial >> pace, or drop synchronous replication and go degraded? Or does it do >> something more clever than just a timeout? > > > That's a good question, and way beyond what I know about the internals. :) > In practice though, there are configurable thresholds, and if exceeded, it > will invalidate the secondary. When using Pacemaker, we've actually had > instances where the 10G link we had between the servers died, so each node > thought the other was down. That lead to the secondary node self-promoting > and trying to steal the VIP from the primary. Throw in a gratuitous arp, and > you get a huge mess. That's why Pacemaker *recommends* STONITH (Shoot The Other Node In The Head). Whenever the standby decides to promote itself, it would just kill the former master (just in case)... the STONITH thing have to use an independent connection. Additionally, redundant link between cluster nodes is a must. > > That lead to what DRBD calls split-brain, because both nodes were running > and writing to the block device. Thankfully, you can actually tell one node > to discard its changes and re-subscribe. Doing that will replay the > transactions from the "good" node on the "bad" one. And even then, it's a > good idea to run an online verify to do a block-by-block checksum and > correct any differences. > > Of course, all of that's only possible because it's a block-level > replication. I can't even imagine PG doing anything like that. It would have > to know the last good transaction from the primary and do an implied PIT > recovery to reach that state, then re-attach for sync commits. > > >> Regardless of what DRBD does, I think the problem with the >> async/sync duality as-is is there is no nice way to manage exposure >> to transaction loss under various situations and requirements. > > > Which would be handy. With synchronous commits, it's given that the protocol > is bi-directional. Then again, PG can detect when clients disconnect the > instant they do so, and having such an event implicitly disable > synchronous_standby_names until reconnect would be an easy fix. The database > already keeps transaction logs, so replaying would still happen on > re-attach. It could easily throw a warning for every sync-required commit so > long as it's in "degraded" mode. Those alone are very small changes that > don't really harm the intent of sync commit. > > That's basically what a RAID-1 does, and people have been fine with that for > decades. > > I can't believe how many times I have seen this topic arise in the mailing list... I was myself about to start a thread like this! (thanks Shaun!). I don't really get what people wants out of the synchronous streaming replication.... DRBD (that is being used as comparison) in protocol C is synchronous (it won't confirm a write unless it was written to disk on both nodes). PostgreSQL (8.4, 9.0, 9.1, ...) will work just fine with it, except that you don't have a standby that you can connect to... also, you need to setup a dedicated volume to put the DRBD block device, setup DRBD, then put the filesystem on top of DRBD, and handle the DRBD promotion, partition mount (with possible FS error handling), and then starting PostgreSQL after the FS is correctly mounted...... With synchronous streaming replication you can have about the same: the standby will have the changes written to disk before master confirms commit.... I don't really care if standby has already applied the changes to its DB (although that would certainly be nice).... the point is: the data is on the standby, and if the master were to crash, and I were to "promote" the standby: the standby would have the same commited data the server had before it crashed. So, why are we, HA people, bothering you DB people so much?: simplify the things, it is simpler to setup synchronous streaming replication, than having to setup DRBD + pacemaker rules to make it promote DRBD, mount FS, and then start pgsql. Also, there is an great perk to synchronous replication with Hot Standby: you have a read/only standby that can be used for some things (even though it doesn't always have exactly the same data as the master). I mean, a lot of people here have a really valid point: 2-safe reliability is great, but how good is it if when you lose it, ALL the system just freeze? I mean, RAID1 gives you 2-safe reliability, but no one would use it if the machine were to freeze when you lose 1 disk, same for DRBD: it offers 2-safe reliability too (at block-level), but it doesn't freeze if the secondary goes away! Now, I see some people who are arguing because, apparently, synchronous replication is not an HA feature (those who says that SR doesn't fit the HA environment)... please, those people, answer why is synchronous streaming replication under the High Availability PostgreSQL manual chapter? I really feel bad that people are so closed to fix something, I mean: making the master note that the standby is no longer there and just fallback to "standalone" mode seems to just bother them so much, that they wouldn't even allow *an option* to allow that.... we are not asking you to change default behavior, just add an option that makes it gracefully continue operation and issue warnings, after all: if you lose a disk on a RAID array, you get some kind of indication of the failure to get it fixed ASAP: you know you are in risk until you fix it, but you can continue to function... name a single RAID controller that will shutdown your server on single disk failure?, I haven't seen any card that does that: no body would buy it. Adding more on a related issue: what's up with the fact that the standby doesn't respect wal_keep_segments? This is forcing some people to have to copy the WAL files *twice*: one through streaming replication, and again to a WAL archive, because if the master dies, and you have more than one standby (say: 1 synchronous, and 2 asynchronous), you can actually point the async ones to the sync one once you promote it (as long as you trick the sync one into *not* switching the timeline, by moving away recovery.conf and restarting, instead of using "normal" promotion), but if you don't have the WAL archive, and one of the standbys was too lagged: it wouldn't be able to recover. Please, stop arguing on all of this: I don't think that adding an option will hurt anybody (specially because the work was already done by someone), we are not asking to change how the things work, we just want an option to decided whether we want it to freeze on standby disconnection, or if we want it to continue automatically... is that asking so much? Sincerely, Ildefonso
pgsql-hackers by date: