Thread: Synchronous Standalone Master Redoux
Hey everyone, Upon doing some usability tests with PostgreSQL 9.1 recently, I ran across this discussion: http://archives.postgresql.org/pgsql-hackers/2011-12/msg01224.php And after reading the entire thing, I found it odd that the overriding pushback was because nobody could think of a use case. The argument was: if you don't care if the slave dies, why not just use asynchronous replication? I'd like to introduce all of you to DRBD. DRBD is, for those who aren't familiar, distributed (network) block-level replication. Right now, this is what we're using, and will use in the future, to ensure a stable synchronous PostgreSQL copy on our backup node. I was excited to read about synchronous replication, because with it, came the possibility we could have two readable nodes with the servers we already have. You can't do that with DRBD; secondary nodes can't even mount the device. So here's your use case: 1. Slave wants to be synchronous with master. Master wants replication on at least one slave. They have this, and are happy. 2. For whatever reason, slave crashes or becomes unavailable. 3. Master notices no more slaves are available, and operates in standalone mode, accumulating WAL files until a suitable slave appears. 4. Slave finishes rebooting/rebuilding/upgrading/whatever, and re-subscribes to the feed. 5. Slave stays in degraded sync (asynchronous) mode until it is caught up, and then switches to synchronous. This makes both master and slave happy, because *intent* of synchronous replication is fulfilled. PostgreSQL's implementation means the master will block until someone/something notices and tells it to stop waiting, or the slave comes back. For pretty much any high-availability environment, this is not viable. Based on that alone, I can't imagine a scenario where synchronous replication would be considered beneficial. The current setup doubles unplanned system outage scenarios in such a way I'd never use it in a production environment. Right now, we only care if the master server dies. With sync rep, we'd have to watch both servers like a hawk and be ready to tell the master to disable sync rep, lest our 10k TPS system come to an absolute halt because the slave died. With DRBD, when a slave node goes offline, the master operates in standalone until the secondary re-appears, after which it re-synchronizes missing data, and then operates in sync mode afterwards. Just because the data is temporarily out of sync does *not* mean we want asynchronous replication. I think you'd be hard pressed to find many users taking advantage of DRBD's async mode. Just because data is temporarily catching up, doesn't mean it will remain in that state. I would *love* to have the functionality discussed in the patch. If I can make a case for it, I might even be able to convince my company to sponsor its addition, provided someone has time to integrate it. Right now, we're using DRBD so we can have a very short outage window while the offline node gets promoted, and it works, but that means a basically idle server at all times. I'd gladly accept a 10-20% performance hit for sync rep if it meant that other server could reliably act as a read slave. That's currently impossible because async replication is too slow, and sync is too fragile for reasons stated above. Am I totally off-base, here? I was shocked when I actually read the documentation on how sync rep worked, and saw that no servers would function properly until at least two were online. -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-444-8534 sthomas@optionshouse.com ______________________________________________ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
Shaun, > PostgreSQL's implementation means the master will block until > someone/something notices and tells it to stop waiting, or the slave > comes back. For pretty much any high-availability environment, this is > not viable. Based on that alone, I can't imagine a scenario where > synchronous replication would be considered beneficial. So there's an issue with the definition of "synchronous". What "synchronous" in "synchronous replication" means is "guarantee zero data loss or fail the transaction". It does NOT mean "master and slave have the same transactional data at the same time", as much as that would be great to have. There are, indeed, systems where you'd rather shut down the system than accept writes which were not replicated, or we wouldn't have the feature. That just doesn't happen to fit your needs (nor, indeed, the needs of most people who think they want SR). "Total-consistency" replication is what I think you want, that is, to guarantee that at any given time a read query on the master will return the same results as a read query on the standby. Heck, *most* people would like to have that. You would also be advancing database science in general if you could come up with a way to implement it. > slave. That's currently impossible because async replication is too > slow, and sync is too fragile for reasons stated above. So I'm unclear on why sync rep would be faster than async rep given that they use exactly the same mechanism. Explain? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Mon, Jul 9, 2012 at 1:30 PM, Shaun Thomas <sthomas@optionshouse.com> wrote: > > 1. Slave wants to be synchronous with master. Master wants replication on at least one slave. They have this, and are happy. > 2. For whatever reason, slave crashes or becomes unavailable. > 3. Master notices no more slaves are available, and operates in standalone mode, accumulating WAL files until a suitableslave appears. > 4. Slave finishes rebooting/rebuilding/upgrading/whatever, and re-subscribes to the feed. > 5. Slave stays in degraded sync (asynchronous) mode until it is caught up, and then switches to synchronous. This makesboth master and slave happy, because *intent* of synchronous replication is fulfilled. > So if I get this straight, what you are saying is "be asynchronous replication unless someone is around, in which case be synchronous" is the mode you want. I think if your goal is zero-transaction loss then you would want to rethink this, and that was the goal of SR: two copies, no matter what, before COMMIT returns from the primary. However, I think there is something you are stating here that has a finer point on it: right now, there is no graceful way to attenuate the speed of commit on a primary to ensure bounded lag of an *asynchronous* standby. This is a pretty tricky definition: consider if you bring a standby on-line from archive replay and it shows up in streaming with pretty high lag, and stops all commit traffic while it reaches the bounded window of what "acceptable" lag is. That sounds pretty terrible, too. How does DBRD handle this? It seems like the catchup phase might be interesting prior art. On first inspection, the best I can come up with something like "if the standby is making progress and it fails to make progress in convergence, attenuate the primary's speed of COMMIT until convergence is projected to occur in a projected time" or something like that. Relatedly, this is related to one of the one of the ugliest problems I have with continuous archiving: there is no graceful way to attenuate the speed of operations to prevent backlog that can fill up the disk containing pg_xlog. It also makes it very hard to very strictly bound the amount of data that can remain outstanding and unarchived. To get around this, I was planning on very carefully making use of the status messages supplied that inform synchronous replication to block and unblock operations, but perhaps a less strained interface is possible with some kind of cooperation from Postgres. -- fdr
> From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Daniel Farina > Sent: Tuesday, July 10, 2012 11:42 AM >>On Mon, Jul 9, 2012 at 1:30 PM, Shaun Thomas <sthomas@optionshouse.com> wrote: >> >> 1. Slave wants to be synchronous with master. Master wants replication on at least one slave. They have this, and are happy. >> 2. For whatever reason, slave crashes or becomes unavailable. >> 3. Master notices no more slaves are available, and operates in standalone mode, accumulating WAL files until a suitable slave appears. >> 4. Slave finishes rebooting/rebuilding/upgrading/whatever, and re-subscribes to the feed. >> 5. Slave stays in degraded sync (asynchronous) mode until it is caught up, and then switches to synchronous. This makes both master and slave happy, because *intent* of synchronous replication is fulfilled. >> > So if I get this straight, what you are saying is "be asynchronous > replication unless someone is around, in which case be synchronous" is > the mode you want. I think if your goal is zero-transaction loss then > you would want to rethink this, and that was the goal of SR: two > copies, no matter what, before COMMIT returns from the primary. For such cases, can there be a way with which an option can be provided to user if he wants to change mode to async?
On Tue, Jul 10, 2012 at 8:42 AM, Amit Kapila <amit.kapila@huawei.com> wrote: >> From: pgsql-hackers-owner@postgresql.org > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Daniel Farina >> Sent: Tuesday, July 10, 2012 11:42 AM >>>On Mon, Jul 9, 2012 at 1:30 PM, Shaun Thomas <sthomas@optionshouse.com> > wrote: >>> >>> 1. Slave wants to be synchronous with master. Master wants replication on > at least one slave. They have this, and are happy. >>> 2. For whatever reason, slave crashes or becomes unavailable. >>> 3. Master notices no more slaves are available, and operates in > standalone mode, accumulating WAL files until a suitable slave appears. >>> 4. Slave finishes rebooting/rebuilding/upgrading/whatever, and > re-subscribes to the feed. >>> 5. Slave stays in degraded sync (asynchronous) mode until it is caught > up, and then switches to synchronous. This makes both master and slave > happy, because *intent* of synchronous replication is fulfilled. >>> > >> So if I get this straight, what you are saying is "be asynchronous >> replication unless someone is around, in which case be synchronous" is >> the mode you want. I think if your goal is zero-transaction loss then >> you would want to rethink this, and that was the goal of SR: two >> copies, no matter what, before COMMIT returns from the primary. > > For such cases, can there be a way with which an option can be provided to > user if he wants to change mode to async? You can already change synchronous_standby_names, and do so without a restart. That will change between sync and async just fine on a live system. And you can control that from some external monitor to define your own rules for exactly when it should drop to async mode. -- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/
On 07/10/2012 01:11 AM, Daniel Farina wrote: > So if I get this straight, what you are saying is "be asynchronous > replication unless someone is around, in which case be synchronous" > is the mode you want. Er, no. I think I see where you might have gotten that, but no. > This is a pretty tricky definition: consider if you bring a standby > on-line from archive replay and it shows up in streaming with pretty > high lag, and stops all commit traffic while it reaches the bounded > window of what "acceptable" lag is. That sounds pretty terrible, too. > How does DBRD handle this? It seems like the catchup phase might be > interesting prior art. Well, DRBD actually has a very definitive sync mode, and no "attenuation" is involved at all. Here's what a fully working cluster looks like, according to /proc/drbd: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate Here's what happens when I disconnect the secondary: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown So there's a few things here: 1. Primary is waiting for the secondary to reconnect. 2. It knows its own data is still up to date. 3. It's waiting to assess the secondary when it re-appears 4. It's still capable of writing to the device. This is more akin to degraded RAID-1. Writes are synchronous as long as two devices exist, but if one vanishes, you can still use the disk at your own risk. Checking the status of DRBD will show this readily. I also want to point out it is *fully* synchronous when both nodes are available. I.e., you can't even call a filesystem sync without the sync succeeding on both nodes. When you re-connect a secondary device, it catches up as fast as possible by replaying waiting transactions, and then re-attaching to the cluster. Until it's fully caught-up, it doesn't exist. DRBD acknowledges the secondary is there and attempting to catch up, but does not leave "degraded" mode until the secondary reaches "UpToDate" status. This is a much more graceful failure scenario than is currently possible with PostgreSQL. With DRBD, you'd still need a tool to notice the master node is in an invalid state and perform a failover, but the secondary going belly-up will not suddenly halt the master. But I'm not even hoping for *that* level of functionality. I just want to be able to tell PostgreSQL to notice when the secondary becomes unavailable *on its own*, and then perform in "degraded non-sync mode" because it's much faster than any monitor I can possibly attach to perform the same function. I plan on using DRBD until either PG can do that, or a better alternative presents itself. Async is simply too slow for our OLTP system except for the disaster recovery node, which isn't expected to carry on within seconds of the primary's failure. I briefly considered sync mode when it appeared as a feature, but I see it's still too early in its development cycle, because there are no degraded operation modes. That's fine, I'm willing to wait. I just don't understand the push-back, I guess. RAID-1 is the poster child for synchronous writes for fault tolerance. It will whine constantly to anyone who will listen when operating only on one device, but at least it still works. I'm pretty sure nobody would use RAID-1 if its failure mode was: block writes until someone installs a replacement disk. -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-444-8534 sthomas@optionshouse.com ______________________________________________ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
On Tue, Jul 10, 2012 at 9:28 AM, Shaun Thomas <sthomas@optionshouse.com> wrote: > Async is simply too slow for our OLTP system except for the disaster > recovery node, which isn't expected to carry on within seconds of the > primary's failure. I briefly considered sync mode when it appeared as a > feature, but I see it's still too early in its development cycle, because > there are no degraded operation modes. That's fine, I'm willing to wait. But this is where some of us are confused with what your asking for. async is actually *FASTER* than sync. It's got less over head. Synchrounous replication is basicaly async replication, with an extra overhead, and an artificial delay on the master for the commit to *RETURN* to the client. The data is still committed and view able to new queries on the master, and the slave at the same rate as with async replication. Just that the commit status returned to the client is delayed. So the "async is too slow" is what we don't understand. > I just don't understand the push-back, I guess. RAID-1 is the poster child > for synchronous writes for fault tolerance. It will whine constantly to > anyone who will listen when operating only on one device, but at least it > still works. I'm pretty sure nobody would use RAID-1 if its failure mode > was: block writes until someone installs a replacement disk. I think most of us in the "synchronous replication must be syncronous replication" camp are there because the guarantees of a simple RAID 1 just isn't good enough for us ;-) a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
On 07/09/2012 05:15 PM, Josh Berkus wrote: > "Total-consistency" replication is what I think you want, that is, to > guarantee that at any given time a read query on the master will return > the same results as a read query on the standby. Heck, *most* people > would like to have that. You would also be advancing database science > in general if you could come up with a way to implement it. Doesn't having consistent transactional state across the systems imply that? > So I'm unclear on why sync rep would be faster than async rep given > that they use exactly the same mechanism. Explain? Too many mental gymnastics. I get that async is "faster" than sync, but the inconsistent transactional state makes it *look* slower. If a customer makes an order, but just happens to check that order state on the secondary before it can catch up, that's a net loss. Like I said, that's fine for our DR system, or a reporting mirror, or any one of several use-case scenarios, but it's not good enough for a failover when better alternatives exist. In this case, better alternatives are anything that can guarantee transaction durability: DRBD / PG sync. PG sync mode does what I want in that regard, it just has no graceful failure state without relatively invasive intervention. Theoretically we could write a Pacemaker agent, or some other simple harness, that just monitors both servers and performs an LSB HUP after modifying the primary node to disable synchronous_standby_names if the secondary dies, or promotes the secondary if the primary dies. But after being spoiled by DRBD knowing the instant the secondary disconnects, but still being available until the secondary is restored, we can't justifiably switch to something that will have the primary hang for ten seconds between monitor checks and service reloads. I'm just saying I considered it briefly during testing the last few days, but there's no way I can make a business case for it. PG sync rep is a great step forward, but it's not for us. Yet. -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-444-8534 sthomas@optionshouse.com ______________________________________________ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
On 10.07.2012 17:31, Shaun Thomas wrote: > On 07/09/2012 05:15 PM, Josh Berkus wrote: >> So I'm unclear on why sync rep would be faster than async rep given >> that they use exactly the same mechanism. Explain? > > Too many mental gymnastics. I get that async is "faster" than sync, but > the inconsistent transactional state makes it *look* slower. If a > customer makes an order, but just happens to check that order state on > the secondary before it can catch up, that's a net loss. Like I said, > that's fine for our DR system, or a reporting mirror, or any one of > several use-case scenarios, but it's not good enough for a failover when > better alternatives exist. In this case, better alternatives are > anything that can guarantee transaction durability: DRBD / PG sync. > > PG sync mode does what I want in that regard, it just has no graceful > failure state without relatively invasive intervention. You are mistaken. PostgreSQL's synchronous replication does not guarantee that the transaction is immediately replayed in the standby. It only guarantees that it's been sync'd to disk in the standby, but if there are open snapshots or the system is simply busy, it might takes minutes or more until the effects of that transaction become visible. I agree that such a mode would be highly useful, where a transaction is not acknowledged to the client as committed until it's been replicated *and* replayed in the standby. And in that mode, a timeout after which the master just goes ahead without the standby would be useful. You could then configure your middleware and/or standby to not use the standby server for queries after that timeout. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 07/10/2012 09:40 AM, Heikki Linnakangas wrote: > You are mistaken. It only guarantees that it's been sync'd to disk in > the standby, but if there are open snapshots or the system is simply > busy, it might takes minutes or more until the effects of that > transaction become visible. Well, crap. It's subtle distinctions like this I wish I'd noticed before. Doesn't really affect our plans, it just makes sync rep even less viable for our use case. Thanks for the correction! :) -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-444-8534 sthomas@optionshouse.com ______________________________________________ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
On Tue, Jul 10, 2012 at 6:28 AM, Shaun Thomas <sthomas@optionshouse.com> wrote: > On 07/10/2012 01:11 AM, Daniel Farina wrote: > >> So if I get this straight, what you are saying is "be asynchronous >> replication unless someone is around, in which case be synchronous" >> is the mode you want. > > > Er, no. I think I see where you might have gotten that, but no. From your other communications, this sounds like exactly what you want, because RAID-1 is rather like this: on writes, a degraded RAID-1 needs not wait on its (non-existent) mirror, and can be faster, but once it has caught up it is not allowed to leave synchronization, which is slower than writing to one disk alone, since it is the maximum of the time taken to write to two disks. While in the degraded state there is effectively only one copy of the data, and while a mirror rebuild is occurring the replication is effectively asynchronous to bring it up to date. -- fdr
Shaun, > Too many mental gymnastics. I get that async is "faster" than sync, but > the inconsistent transactional state makes it *look* slower. If a > customer makes an order, but just happens to check that order state on > the secondary before it can catch up, that's a net loss. Like I said, > that's fine for our DR system, or a reporting mirror, or any one of > several use-case scenarios, but it's not good enough for a failover when > better alternatives exist. In this case, better alternatives are > anything that can guarantee transaction durability: DRBD / PG sync. Per your exchange with Heikki, that's not actually how SyncRep works in 9.1. So it's not giving you what you want anyway. This is why we felt that the "sync rep if you can" mode was useless and didn't accept it into 9.1. The *only* difference between sync rep and async rep is whether or not the master waits for ack that the standby has written to log. I think one of the new modes in 9.2 forces synch-to-DB before ack. No? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Shaun Thomas <sthomas@optionshouse.com> writes: > When you re-connect a secondary device, it catches up as fast as possible by > replaying waiting transactions, and then re-attaching to the cluster. Until > it's fully caught-up, it doesn't exist. DRBD acknowledges the secondary is > there and attempting to catch up, but does not leave "degraded" mode until > the secondary reaches "UpToDate" status. That's exactly what happens with PostgreSQL when using asynchronous replication and archiving. When joining the cluster, the standby will feed from the archives and then there's nothing recent enough left over there, and only at this time it will contact the master. For a real graceful setup you need both archiving and replication. Then, synchronous replication means that no transaction can make it to the master alone. The use case is not being allowed to tell the client it's ok when you're at risk of losing the transaction by crashing the master when it's the only one knowing about it. What you explain you want reads to me "Async replication + Archiving". Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On Tue, Jul 10, 2012 at 2:42 PM, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote:> > What you explain you want reads to me "Async replication + Archiving". Notable caveat: one can't very easily measure or bound the amount of transaction loss in any graceful way as-is. We only have "unlimited lag" and "2-safe or bust". Presumably the DRBD setup run by the original poster can do this: * run without a partner in a degraded mode (to use common RAID terminology) * asynchronous rebuild and catch-up of a new remote RAID partner * switch to synchronous RAID-1, which attenuates the source of block device changes to get 2-safe reliability (i.e. blocking on confirmations from two block devices) However, the tricky part is what is DRBD's heuristic when suffering degraded but non-zero performance of the network or block device will drop attempts to replicate to its partner. Postgres's interpretation is "halt, because 2-safe is currently impossible." DRBD seems to be "continue" (but hopefully record a statistic, because who knows how often you are actually 2-safe, then). For example, what if DRBD can only complete one page per second for some reason? Does it it simply have the primary wait at this glacial pace, or drop synchronous replication and go degraded? Or does it do something more clever than just a timeout? These may seem like theoretical concerns, but 'slow, but non-zero' progress has been an actual thorn in my side many times. Regardless of what DRBD does, I think the problem with the async/sync duality as-is is there is no nice way to manage exposure to transaction loss under various situations and requirements. I'm not really sure what a solution might look like; I was going to do something grotesque and conjure carefully orchestrated standby status packets to accomplish this. -- fdr
Daniel Farina <daniel@heroku.com> writes: > Notable caveat: one can't very easily measure or bound the amount of > transaction loss in any graceful way as-is. We only have "unlimited > lag" and "2-safe or bust". ¡per-transaction! You can change your mind mid-transaction and ask for 2-safe or bust. That's the detail we've not been talking about in this thread and makes the whole solution practical in real life, at least for me. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On 07/10/2012 06:02 PM, Daniel Farina wrote: > For example, what if DRBD can only complete one page per second for > some reason? Does it it simply have the primary wait at this glacial > pace, or drop synchronous replication and go degraded? Or does it do > something more clever than just a timeout? That's a good question, and way beyond what I know about the internals. :) In practice though, there are configurable thresholds, and if exceeded, it will invalidate the secondary. When using Pacemaker, we've actually had instances where the 10G link we had between the servers died, so each node thought the other was down. That lead to the secondary node self-promoting and trying to steal the VIP from the primary. Throw in a gratuitous arp, and you get a huge mess. That lead to what DRBD calls split-brain, because both nodes were running and writing to the block device. Thankfully, you can actually tell one node to discard its changes and re-subscribe. Doing that will replay the transactions from the "good" node on the "bad" one. And even then, it's a good idea to run an online verify to do a block-by-block checksum and correct any differences. Of course, all of that's only possible because it's a block-level replication. I can't even imagine PG doing anything like that. It would have to know the last good transaction from the primary and do an implied PIT recovery to reach that state, then re-attach for sync commits. > Regardless of what DRBD does, I think the problem with the > async/sync duality as-is is there is no nice way to manage exposure > to transaction loss under various situations and requirements. Which would be handy. With synchronous commits, it's given that the protocol is bi-directional. Then again, PG can detect when clients disconnect the instant they do so, and having such an event implicitly disable synchronous_standby_names until reconnect would be an easy fix. The database already keeps transaction logs, so replaying would still happen on re-attach. It could easily throw a warning for every sync-required commit so long as it's in "degraded" mode. Those alone are very small changes that don't really harm the intent of sync commit. That's basically what a RAID-1 does, and people have been fine with that for decades. -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-444-8534 sthomas@optionshouse.com ______________________________________________ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
Shaun Thomas <sthomas@optionshouse.com> writes: >> Regardless of what DRBD does, I think the problem with the >> async/sync duality as-is is there is no nice way to manage exposure >> to transaction loss under various situations and requirements. Yeah. > Which would be handy. With synchronous commits, it's given that the protocol > is bi-directional. Then again, PG can detect when clients disconnect the > instant they do so, and having such an event implicitly disable It's not always possible, given how TCP works, if I understand correctly. > synchronous_standby_names until reconnect would be an easy fix. The database > already keeps transaction logs, so replaying would still happen on > re-attach. It could easily throw a warning for every sync-required commit so > long as it's in "degraded" mode. Those alone are very small changes that > don't really harm the intent of sync commit. We already have that, with the archives. The missing piece is how to apply that to Synchronous Replication… > That's basically what a RAID-1 does, and people have been fine with that for > decades. … and we want to cover *data* availability (durability), not just service availability. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On 7/11/12 6:41 AM, Shaun Thomas wrote: > Which would be handy. With synchronous commits, it's given that the > protocol is bi-directional. Then again, PG can detect when clients > disconnect the instant they do so, and having such an event implicitly > disable synchronous_standby_names until reconnect would be an easy fix. > The database already keeps transaction logs, so replaying would still > happen on re-attach. It could easily throw a warning for every > sync-required commit so long as it's in "degraded" mode. Those alone are > very small changes that don't really harm the intent of sync commit. So your suggestion is to have a switch "allow degraded", where if the sync standby doesn't respond within a certain threshold, will switch to async with a warning for each transaction which asks for sync? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Tue, Jul 10, 2012 at 12:57 PM, Josh Berkus <josh@agliodbs.com> wrote: > Per your exchange with Heikki, that's not actually how SyncRep works in > 9.1. So it's not giving you what you want anyway. > > This is why we felt that the "sync rep if you can" mode was useless and > didn't accept it into 9.1. The *only* difference between sync rep and > async rep is whether or not the master waits for ack that the standby > has written to log. > > I think one of the new modes in 9.2 forces synch-to-DB before ack. No? No. Such a mode has been discussed and draft patches have been circulated, but nothing's been committed. The new mode in 9.2 is less synchronous than the previous mode (wait for remote write rather than remote fsync), not more. Now, if we DID have such a mode, then many people would likely attempt to use synchronous replication in that mode as a way of ensuring that read queries can't see stale data, rather than as a method of providing increased durability. And in that case it sure seems like it would be useful to wait only if the standby is connected. In fact, you'd almost certainly want to have multiple standbys running synchronously, and have the ability to wait for only those connected at the moment. You might also want to have a way for standbys that lose their connection to the master to refuse to take any new snapshots until the slave is reconnected and has caught up. Then you could guarantee that any query run on the slave will see all the commits that are visible on the master (and possibly more, since commits become visible on the slave first), which would be useful for many applications. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Greetings, On Wed, Jul 11, 2012 at 9:11 AM, Shaun Thomas <sthomas@optionshouse.com> wrote: > On 07/10/2012 06:02 PM, Daniel Farina wrote: > >> For example, what if DRBD can only complete one page per second for >> some reason? Does it it simply have the primary wait at this glacial >> pace, or drop synchronous replication and go degraded? Or does it do >> something more clever than just a timeout? > > > That's a good question, and way beyond what I know about the internals. :) > In practice though, there are configurable thresholds, and if exceeded, it > will invalidate the secondary. When using Pacemaker, we've actually had > instances where the 10G link we had between the servers died, so each node > thought the other was down. That lead to the secondary node self-promoting > and trying to steal the VIP from the primary. Throw in a gratuitous arp, and > you get a huge mess. That's why Pacemaker *recommends* STONITH (Shoot The Other Node In The Head). Whenever the standby decides to promote itself, it would just kill the former master (just in case)... the STONITH thing have to use an independent connection. Additionally, redundant link between cluster nodes is a must. > > That lead to what DRBD calls split-brain, because both nodes were running > and writing to the block device. Thankfully, you can actually tell one node > to discard its changes and re-subscribe. Doing that will replay the > transactions from the "good" node on the "bad" one. And even then, it's a > good idea to run an online verify to do a block-by-block checksum and > correct any differences. > > Of course, all of that's only possible because it's a block-level > replication. I can't even imagine PG doing anything like that. It would have > to know the last good transaction from the primary and do an implied PIT > recovery to reach that state, then re-attach for sync commits. > > >> Regardless of what DRBD does, I think the problem with the >> async/sync duality as-is is there is no nice way to manage exposure >> to transaction loss under various situations and requirements. > > > Which would be handy. With synchronous commits, it's given that the protocol > is bi-directional. Then again, PG can detect when clients disconnect the > instant they do so, and having such an event implicitly disable > synchronous_standby_names until reconnect would be an easy fix. The database > already keeps transaction logs, so replaying would still happen on > re-attach. It could easily throw a warning for every sync-required commit so > long as it's in "degraded" mode. Those alone are very small changes that > don't really harm the intent of sync commit. > > That's basically what a RAID-1 does, and people have been fine with that for > decades. > > I can't believe how many times I have seen this topic arise in the mailing list... I was myself about to start a thread like this! (thanks Shaun!). I don't really get what people wants out of the synchronous streaming replication.... DRBD (that is being used as comparison) in protocol C is synchronous (it won't confirm a write unless it was written to disk on both nodes). PostgreSQL (8.4, 9.0, 9.1, ...) will work just fine with it, except that you don't have a standby that you can connect to... also, you need to setup a dedicated volume to put the DRBD block device, setup DRBD, then put the filesystem on top of DRBD, and handle the DRBD promotion, partition mount (with possible FS error handling), and then starting PostgreSQL after the FS is correctly mounted...... With synchronous streaming replication you can have about the same: the standby will have the changes written to disk before master confirms commit.... I don't really care if standby has already applied the changes to its DB (although that would certainly be nice).... the point is: the data is on the standby, and if the master were to crash, and I were to "promote" the standby: the standby would have the same commited data the server had before it crashed. So, why are we, HA people, bothering you DB people so much?: simplify the things, it is simpler to setup synchronous streaming replication, than having to setup DRBD + pacemaker rules to make it promote DRBD, mount FS, and then start pgsql. Also, there is an great perk to synchronous replication with Hot Standby: you have a read/only standby that can be used for some things (even though it doesn't always have exactly the same data as the master). I mean, a lot of people here have a really valid point: 2-safe reliability is great, but how good is it if when you lose it, ALL the system just freeze? I mean, RAID1 gives you 2-safe reliability, but no one would use it if the machine were to freeze when you lose 1 disk, same for DRBD: it offers 2-safe reliability too (at block-level), but it doesn't freeze if the secondary goes away! Now, I see some people who are arguing because, apparently, synchronous replication is not an HA feature (those who says that SR doesn't fit the HA environment)... please, those people, answer why is synchronous streaming replication under the High Availability PostgreSQL manual chapter? I really feel bad that people are so closed to fix something, I mean: making the master note that the standby is no longer there and just fallback to "standalone" mode seems to just bother them so much, that they wouldn't even allow *an option* to allow that.... we are not asking you to change default behavior, just add an option that makes it gracefully continue operation and issue warnings, after all: if you lose a disk on a RAID array, you get some kind of indication of the failure to get it fixed ASAP: you know you are in risk until you fix it, but you can continue to function... name a single RAID controller that will shutdown your server on single disk failure?, I haven't seen any card that does that: no body would buy it. Adding more on a related issue: what's up with the fact that the standby doesn't respect wal_keep_segments? This is forcing some people to have to copy the WAL files *twice*: one through streaming replication, and again to a WAL archive, because if the master dies, and you have more than one standby (say: 1 synchronous, and 2 asynchronous), you can actually point the async ones to the sync one once you promote it (as long as you trick the sync one into *not* switching the timeline, by moving away recovery.conf and restarting, instead of using "normal" promotion), but if you don't have the WAL archive, and one of the standbys was too lagged: it wouldn't be able to recover. Please, stop arguing on all of this: I don't think that adding an option will hurt anybody (specially because the work was already done by someone), we are not asking to change how the things work, we just want an option to decided whether we want it to freeze on standby disconnection, or if we want it to continue automatically... is that asking so much? Sincerely, Ildefonso
> Please, stop arguing on all of this: I don't think that adding an > option will hurt anybody (specially because the work was already done > by someone), we are not asking to change how the things work, we just > want an option to decided whether we want it to freeze on standby > disconnection, or if we want it to continue automatically... is that > asking so much? The objection is that, *given the way synchronous replication currently works*, having that kind of an option would make the "synchronous" setting fairly meaningless. The only benefit that synchronous replication gives you is the guarantee that a write on the master is also on the standby. If you remove that guarantee, you are using asynchronous replication, even if the setting says synchronous. I think what you really want is a separate "auto-degrade" setting. That is, a setting which says "if no synchronous standby is present, auto-degrade to async/standalone, and start writing a bunch of warning messages to the logs and whenever anyone runs a synchronous transaction". That's an approach which makes some sense, but AFAICT somewhat different from the proposed patch. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Wed, Jul 11, 2012 at 11:48 PM, Josh Berkus <josh@agliodbs.com> wrote: > >> Please, stop arguing on all of this: I don't think that adding an >> option will hurt anybody (specially because the work was already done >> by someone), we are not asking to change how the things work, we just >> want an option to decided whether we want it to freeze on standby >> disconnection, or if we want it to continue automatically... is that >> asking so much? > > The objection is that, *given the way synchronous replication currently > works*, having that kind of an option would make the "synchronous" > setting fairly meaningless. The only benefit that synchronous > replication gives you is the guarantee that a write on the master is > also on the standby. If you remove that guarantee, you are using > asynchronous replication, even if the setting says synchronous. I know how synchronous replication works, I have read it several times, I have seen it in the real life, I have seen it in virtual test environments. And no, it doesn't makes synchronous replication meaningless, because it will work synchronous if it have someone to sync to, and work async (or standalone) if it doesn't: that's perfect for HA environment. > > I think what you really want is a separate "auto-degrade" setting. That > is, a setting which says "if no synchronous standby is present, > auto-degrade to async/standalone, and start writing a bunch of warning > messages to the logs and whenever anyone runs a synchronous > transaction". That's an approach which makes some sense, but AFAICT > somewhat different from the proposed patch. Certainly, different to current patch, the one I saw I believe it had all of that you say there: except the additional warning. As synchronous standby currently is, it just doesn't fit the HA usage, and if you really want to keep it that way, it doesn't belong to the HA chapter on the pgsql documentation, and should be moved. And NO async replication will *not* work for HA, because the master can have more transactions than standby, and if the master crashes, the standby will have no way to recover these transactions, with synchronous replication we have *exactly* what we need: the data in the standby, after all, it will apply it once we promote it. Ildefonso.
On Wed, Jul 11, 2012 at 3:03 AM, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote: > Daniel Farina <daniel@heroku.com> writes: >> Notable caveat: one can't very easily measure or bound the amount of >> transaction loss in any graceful way as-is. We only have "unlimited >> lag" and "2-safe or bust". > > ¡per-transaction! > > You can change your mind mid-transaction and ask for 2-safe or bust. > That's the detail we've not been talking about in this thread and makes > the whole solution practical in real life, at least for me. It's a pretty good feature, but it's pretty dissatisfying that one cannot have the latency of asynchronous transactions while not exposing users to unbounded loss as an administrator or provider (as opposed to a user that sets synchronous commit, as you are saying). If I had a strong opinion on *how* this should be tunable, I'd voice it, but I think it's worth insisting that there is a missing part of this continuum that involves non-zero but not-unbounded risk management and transaction loss that is under-served. DRBD seems to have some heuristic that makes people happy that's somewhere in-between. I'm not saying it should be copied, but the fact it makes people happy may be worth understanding. I was quite excited for the syncrep feature because it does open the door to write those, even if painfully, at all, since we now have both "unbounded" and "strictly bounded". -- fdr
On Wed, Jul 11, 2012 at 6:41 AM, Shaun Thomas <sthomas@optionshouse.com> wrote: >> Regardless of what DRBD does, I think the problem with the >> async/sync duality as-is is there is no nice way to manage exposure >> to transaction loss under various situations and requirements. > > > Which would be handy. With synchronous commits, it's given that the protocol > is bi-directional. Then again, PG can detect when clients disconnect the > instant they do so, and having such an event implicitly disable > synchronous_standby_names until reconnect would be an easy fix. The database > already keeps transaction logs, so replaying would still happen on > re-attach. It could easily throw a warning for every sync-required commit so > long as it's in "degraded" mode. Those alone are very small changes that > don't really harm the intent of sync commit. > > That's basically what a RAID-1 does, and people have been fine with that for > decades. But RAID-1 as nominally seen is a fundamentally different problem, with much tinier differences in latency, bandwidth, and connectivity. Perhaps useful for study, but to suggest the problem is *that* similar I think is wrong. I think your wording is even more right here than you suggest: "That's *basically* what a RAID-1 does". I'm pretty unhappy with many user-facing aspects of this formulation, even though I think the fundamental need being addressed is reasonable. But, putting that aside, why not write a piece of middleware that does precisely this, or whatever you want? It can live on the same machine as Postgres and ack synchronous commit when nobody is home, and notify (e.g. page) you in the most precise way you want if nobody is home "for a while". -- fdr
> From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-owner@postgresql.org] > On Behalf Of Jose Ildefonso Camargo Tolosa > Please, stop arguing on all of this: I don't think that adding an > option will hurt anybody (specially because the work was already done > by someone), we are not asking to change how the things work, we just > want an option to decided whether we want it to freeze on standby > disconnection, or if we want it to continue automatically... is that > asking so much? I think this kind of decision should be done from outside utility or scripts. It would be better if from outside it can be detected that stand-by is down during sync replication, and send command to master to change its mode or change settings appropriately without stopping master. Putting this kind of more and more logic into replication code will make it more cumbersome. With Regards, Amit Kapila.
Hi, Jose Ildefonso Camargo Tolosa <ildefonso.camargo@gmail.com> writes: > environments. And no, it doesn't makes synchronous replication > meaningless, because it will work synchronous if it have someone to > sync to, and work async (or standalone) if it doesn't: that's perfect > for HA environment. You seem to want Service Availibility when we are providing Data Availibility. I'm not saying you shouldn't ask what you're asking, just that it is a different need. If you troll the archives, you will see that this debate has received much consideration already. The conclusion is that if you care about Service Availibility you should have 2 standby servers and set them both as candidates to being the synchronous one. That way, when you lose one standby the service is unaffected, the second standby is now the synchronous one, and it's possible to re-attach the failed standby live, with or without archiving (with is preferred so that the master isn't involved in the catch-up phase). > As synchronous standby currently is, it just doesn't fit the HA usage, It does actually allow both data high availability and service high availability, provided that you feed at least two standbys. What you seem to be asking is both data and service high availability with only two nodes. You're right that we can not provide that with current releases of PostgreSQL. I'm not sure anyone has a solid plan to make that happen. > and if you really want to keep it that way, it doesn't belong to the > HA chapter on the pgsql documentation, and should be moved. And NO > async replication will *not* work for HA, because the master can have > more transactions than standby, and if the master crashes, the standby > will have no way to recover these transactions, with synchronous > replication we have *exactly* what we need: the data in the standby, > after all, it will apply it once we promote it. Exactly. We want data availability first. Service availability is important too, and for that you need another standby. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On 07/12/2012 12:31 AM, Daniel Farina wrote: > But RAID-1 as nominally seen is a fundamentally different problem, > with much tinier differences in latency, bandwidth, and connectivity. > Perhaps useful for study, but to suggest the problem is *that* similar > I think is wrong. Well, yes and no. One of the reasons I brought up DRBD was because it's basically RAID-1 over a network interface. It's not without overhead, but a few basic pgbench tests show it's still 10-15% faster than a synchronous PG setup for two servers in the same rack. Greg Smith's tests show that beyond a certain point, a synchronous PG setup effectively becomes untenable simply due to network latency in the protocol implementation. In reality, it probably wouldn't be usable beyond two servers in different datacenters in the same city. RAID-1 was the model for DRBD, but I brought it up only because it's pretty much the definition of a synchronous commit that degrades gracefully. I'd even suggest it's more important in a network context than for RAID-1, because you're far more likely to get sync interruptions due to network issues than you are for a disk to fail. > But, putting that aside, why not write a piece of middleware that > does precisely this, or whatever you want? It can live on the same > machine as Postgres and ack synchronous commit when nobody is home, > and notify (e.g. page) you in the most precise way you want if nobody > is home "for a while". You're right that there are lots of ways to kinda get this ability, they're just not mature enough or capable enough to really matter. Tailing the log to watch for secondary disconnect is too slow. Monit or Nagios style checks are too slow and unreliable. A custom-built middle-layer (a master-slave plugin for Pacemaker, for example) is too slow. All of these would rely on some kind of check interval. Set that too high, and we get 10,000xn missed transactions for n seconds. Too low, and we'd increase the likelihood of false positives and unnecessary detachments. If it's possible through a PG 9.x extension, that'd probably be the way to *safely* handle it as a bolt-on solution. If the original author of the patch can convert it to such a beast, we'd install it approximately five seconds after it finished compiling. So far as transaction durability is concerned... we have a continuous background rsync over dark fiber for archived transaction logs, DRBD for block-level sync, filesystem snapshots for our backups, a redundant async DR cluster, an offsite backup location, and a tape archival service stretching back for seven years. And none of that will cause the master to stop processing transactions unless the master itself dies and triggers a failover. Using PG sync in its current incarnation would introduce an extra failure scenario that wasn't there before. I'm pretty sure we're not the only ones avoiding it for exactly that reason. Our queue discards messages it can't fulfil within ten seconds and then throws an error for each one. We need to decouple the secondary as quickly as possible if it becomes unresponsive, and there's really no way to do that without something in the database, one way or another. -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-444-8534 sthomas@optionshouse.com ______________________________________________ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
On Thu, Jul 12, 2012 at 9:21 AM, Shaun Thomas <sthomas@optionshouse.com> wrote: > So far as transaction durability is concerned... we have a continuous > background rsync over dark fiber for archived transaction logs, DRBD for > block-level sync, filesystem snapshots for our backups, a redundant async DR > cluster, an offsite backup location, and a tape archival service stretching > back for seven years. And none of that will cause the master to stop > processing transactions unless the master itself dies and triggers a > failover. Right, so if the dark fiber between New Orleans and Seattle (pick two places for your datacenter) happens to be the first thing failing in your NO data center. Disconenct the sync-ness, and continue. Not a problem, unless it happens to be Aug 29, 2005. You have lost data. Maybe only a bit. Maybe it wasn't even important. But that's not for PostgreSQL to decide. But because your PG on DRDB "continued" when it couldn't replicate to Seattle, it told it's clients the data was durable, just minutes before the whole DC was under water. OK, so a wise admin team would have removed the NO DC from it's primary role days before that hit. Change the NO to NYC and the date Sept 11, 2001. OK, so maybe we can concede that these types of major catasrophies are more devestating to us than loosing some data. Now your primary server was in AWS US East last week. It's sync slave was in the affected AZ, but your PG primary continues on, until, since it was a EC2 instance, it disappears. Now where is your data? Or the fire marshall orders the data center (or whole building) EPO, and the connection to your backup goes down minutes before your servers or other network peers. > Using PG sync in its current incarnation would introduce an extra failure > scenario that wasn't there before. I'm pretty sure we're not the only ones > avoiding it for exactly that reason. Our queue discards messages it can't > fulfil within ten seconds and then throws an error for each one. We need to > decouple the secondary as quickly as possible if it becomes unresponsive, > and there's really no way to do that without something in the database, one > way or another. It introduces an "extra failure", because it has introduce an "extra data durability guarantee". Sure, many people don't *really* want that data durability guarantee, even though they would like the "maybe guaranteed" version of it. But that fine line is actually a difficult (impossible?) one to define if you don't know, at the moment of decision, what the next few moments will/could become. a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
On Thu, Jul 12, 2012 at 11:33:26AM +0530, Amit Kapila wrote: > > From: pgsql-hackers-owner@postgresql.org > [mailto:pgsql-hackers-owner@postgresql.org] > > On Behalf Of Jose Ildefonso Camargo Tolosa > > > Please, stop arguing on all of this: I don't think that adding an > > option will hurt anybody (specially because the work was already done > > by someone), we are not asking to change how the things work, we just > > want an option to decided whether we want it to freeze on standby > > disconnection, or if we want it to continue automatically... is that > > asking so much? > > I think this kind of decision should be done from outside utility or > scripts. > It would be better if from outside it can be detected that stand-by is down > during sync replication, and send command to master to change its mode or > change settings appropriately without stopping master. > Putting this kind of more and more logic into replication code will make it > more cumbersome. We certainly would need something external to inform administrators that the system is no longer synchronous. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Thu, Jul 12, 2012 at 08:21:08AM -0500, Shaun Thomas wrote: > >But, putting that aside, why not write a piece of middleware that > >does precisely this, or whatever you want? It can live on the same > >machine as Postgres and ack synchronous commit when nobody is home, > >and notify (e.g. page) you in the most precise way you want if nobody > >is home "for a while". > > You're right that there are lots of ways to kinda get this ability, > they're just not mature enough or capable enough to really matter. > Tailing the log to watch for secondary disconnect is too slow. Monit > or Nagios style checks are too slow and unreliable. A custom-built > middle-layer (a master-slave plugin for Pacemaker, for example) is > too slow. All of these would rely on some kind of check interval. > Set that too high, and we get 10,000xn missed transactions for n > seconds. Too low, and we'd increase the likelihood of false > positives and unnecessary detachments. Well, the problem also exists if add it as an internal database feature --- how long do we wait to consider the standby dead, how do we inform administrators, etc. I don't think anyone says the feature is useless, but is isn't going to be a simple boolean either. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On 07/12/2012 12:02 PM, Bruce Momjian wrote: > Well, the problem also exists if add it as an internal database > feature --- how long do we wait to consider the standby dead, how do > we inform administrators, etc. True. Though if there is no secondary connected, either because it's not there yet, or because it disconnected, that's an easy check. It's the network lag/stall detection that's tricky. > I don't think anyone says the feature is useless, but is isn't going > to be a simple boolean either. Oh $Deity no. I'd never suggest that. I just tend to be overly verbose, and sometimes my intent gets lost in the rambling as I try to explain my perspective. I apologize if it somehow came across that anyone could just flip a switch and have it work. My C is way too rusty, or I'd be writing an extension right now to do this, or be looking over that patch I linked to originally to make suitable adaptations. I know I talk about how relatively handy DRBD is, but it's also a gigantic PITA since it has to exist underneath the actual filesystem. :) -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-444-8534 sthomas@optionshouse.com ______________________________________________ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
On Thu, Jul 12, 2012 at 8:35 AM, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote: > Hi, > > Jose Ildefonso Camargo Tolosa <ildefonso.camargo@gmail.com> writes: >> environments. And no, it doesn't makes synchronous replication >> meaningless, because it will work synchronous if it have someone to >> sync to, and work async (or standalone) if it doesn't: that's perfect >> for HA environment. > > You seem to want Service Availibility when we are providing Data > Availibility. I'm not saying you shouldn't ask what you're asking, just > that it is a different need. Yes, and no: I don't see why we can't have and option to choose which one we want. I can see the point of "data availability": it is better freeze the service, than risk losing transactions... however, try to explain that to some managers: "well, you know, the DB server froze the whole bank system because, well, the standby server died, and we didn't want to risk transaction loss, we just froze the master.... you know, in case the master were to die too before the we had a reliable standby." I don't think a manager would really understand why you would block the whole company's system, just because *the standby* server died (and why you don't block it, when the master dies?!). Now, maybe that's a bad example, I know a bank should have at least 3 or 4 servers, with some of them in different geographical areas, but just think on the typical boss. In "Service Availability", you have data Availability most of the time, until one of the servers fails (if you have just 2 nodes), what if you have more than two: well, good for you! But, you can keep going with a single server, understanding that you are in a high risk, that have to be fixed real soon (emergency). > > If you troll the archives, you will see that this debate has received > much consideration already. The conclusion is that if you care about > Service Availibility you should have 2 standby servers and set them both > as candidates to being the synchronous one. That's more cost, and for most applications: it doesn't worth the extra cost. Really, I see the point you have, and I have *never* asked to remove the data warranties, but to have an option to relax it, if the particular situation requires it: "enough safety" for a given cost. > > That way, when you lose one standby the service is unaffected, the > second standby is now the synchronous one, and it's possible to > re-attach the failed standby live, with or without archiving (with is > preferred so that the master isn't involved in the catch-up phase). > >> As synchronous standby currently is, it just doesn't fit the HA usage, > > It does actually allow both data high availability and service high > availability, provided that you feed at least two standbys. Still, doesn't fit. You need to spend more hardware, and more power (and money there), and more carbon footprint, ..... you get the point, also, having 3 servers for your DB can be necessary (and possible) for some companies, but for others: no. > > What you seem to be asking is both data and service high availability > with only two nodes. You're right that we can not provide that with > current releases of PostgreSQL. I'm not sure anyone has a solid plan to > make that happen. > >> and if you really want to keep it that way, it doesn't belong to the >> HA chapter on the pgsql documentation, and should be moved. And NO >> async replication will *not* work for HA, because the master can have >> more transactions than standby, and if the master crashes, the standby >> will have no way to recover these transactions, with synchronous >> replication we have *exactly* what we need: the data in the standby, >> after all, it will apply it once we promote it. > > Exactly. We want data availability first. Service availability is > important too, and for that you need another standby. Yeah, you need that with PostgreSQL, but no with DRBD, for example (sorry, but DRBD is one of the flagships of HA things in the Linux world). Also, I'm not convinced about the "2nd standby" thing... I mean, just read this on the docs, which is a little alarming: "If primary restarts while commits are waiting for acknowledgement, those waiting transactions will be marked fully committed once the primary database recovers. There is no way to be certain that all standbys have received all outstanding WAL data at time of the crash of the primary. Some transactions may not show as committed on the standby, even though they show as committed on the primary. The guarantee we offer is that the application will not receive explicit acknowledgement of the successful commit of a transaction until the WAL data is known to be safely received by the standby." So... there is no *real* warranty here either... I don't know how I skipped that paragraph before today.... I mean, this implies that it is possible that a transaction could be marked as commited on the master, but the app was not informed on that (and thus, could try to send it again), and the transaction was NOT applied on the standby.... how can this happen? I mean, when the master comes back, shouldn't the standby get the missing WAL pieces from the master and then apply the transaction? The standby part is the one that I don't really get, on the application side... well, there are several ways in which you can miss the "commit confirmation": connection issues in the worst moment, and the such, so, I guess it is not *so* serious, and the app should have a way of checking its last transaction if it lost connectivity to server before getting the transaction commited.
On Thu, Jul 12, 2012 at 9:28 AM, Aidan Van Dyk <aidan@highrise.ca> wrote: > On Thu, Jul 12, 2012 at 9:21 AM, Shaun Thomas <sthomas@optionshouse.com> wrote: > >> So far as transaction durability is concerned... we have a continuous >> background rsync over dark fiber for archived transaction logs, DRBD for >> block-level sync, filesystem snapshots for our backups, a redundant async DR >> cluster, an offsite backup location, and a tape archival service stretching >> back for seven years. And none of that will cause the master to stop >> processing transactions unless the master itself dies and triggers a >> failover. > > Right, so if the dark fiber between New Orleans and Seattle (pick two > places for your datacenter) happens to be the first thing failing in > your NO data center. Disconenct the sync-ness, and continue. Not a > problem, unless it happens to be Aug 29, 2005. > > You have lost data. Maybe only a bit. Maybe it wasn't even > important. But that's not for PostgreSQL to decide. I never asked for it... but, you (the one who is configuring the system) can decide, and should be able to decide... right now: we can't decide. > > But because your PG on DRDB "continued" when it couldn't replicate to > Seattle, it told it's clients the data was durable, just minutes > before the whole DC was under water. Yeah, well, what is the probability of all of that?... really tiny. I bet it is more likely that you win the lottery, than all of these events happening within that time frame. But, risking monetary loses because, for example, the online store stopped accepting orders while the standby server went down, that's not acceptable for some companies (and some companies just can't buy 3 x DB servers, or more!). > > OK, so a wise admin team would have removed the NO DC from it's > primary role days before that hit. > > Change the NO to NYC and the date Sept 11, 2001. > > OK, so maybe we can concede that these types of major catasrophies are > more devestating to us than loosing some data. > > Now your primary server was in AWS US East last week. It's sync slave > was in the affected AZ, but your PG primary continues on, until, since > it was a EC2 instance, it disappears. Now where is your data? Who would *really* trust your PostgreSQL DB to EC2?... I mean, the I/O is not very good, and the price is not exactly that low so that you take that risk. All in all: you are still getting together coincidences that have *so low* probability.... > > Or the fire marshall orders the data center (or whole building) EPO, > and the connection to your backup goes down minutes before your > servers or other network peers. > >> Using PG sync in its current incarnation would introduce an extra failure >> scenario that wasn't there before. I'm pretty sure we're not the only ones >> avoiding it for exactly that reason. Our queue discards messages it can't >> fulfil within ten seconds and then throws an error for each one. We need to >> decouple the secondary as quickly as possible if it becomes unresponsive, >> and there's really no way to do that without something in the database, one >> way or another. > > It introduces an "extra failure", because it has introduce an "extra > data durability guarantee". > > Sure, many people don't *really* want that data durability guarantee, > even though they would like the "maybe guaranteed" version of it. > > But that fine line is actually a difficult (impossible?) one to define > if you don't know, at the moment of decision, what the next few > moments will/could become. You *never* know. And the truth is that you have to make the decision with what you have, if you can pay 10 servers nationwide: good for you, not all of us can afford that (men, I could barely pay for two, and that because I *know* I don't want to risk to lose the data or service because the single server died). As currently is, the point of: freezing the master because standby dies is not good for all cases (and I dare say: for most cases), and having to wait for pacemaker or other monitoring to note that, change master config and reload... it will cause a service disruption! (for several seconds, usually, ~30 seconds).
On Thu, Jul 12, 2012 at 12:17 PM, Bruce Momjian <bruce@momjian.us> wrote: > On Thu, Jul 12, 2012 at 11:33:26AM +0530, Amit Kapila wrote: >> > From: pgsql-hackers-owner@postgresql.org >> [mailto:pgsql-hackers-owner@postgresql.org] >> > On Behalf Of Jose Ildefonso Camargo Tolosa >> >> > Please, stop arguing on all of this: I don't think that adding an >> > option will hurt anybody (specially because the work was already done >> > by someone), we are not asking to change how the things work, we just >> > want an option to decided whether we want it to freeze on standby >> > disconnection, or if we want it to continue automatically... is that >> > asking so much? >> >> I think this kind of decision should be done from outside utility or >> scripts. >> It would be better if from outside it can be detected that stand-by is down >> during sync replication, and send command to master to change its mode or >> change settings appropriately without stopping master. >> Putting this kind of more and more logic into replication code will make it >> more cumbersome. > > We certainly would need something external to inform administrators that > the system is no longer synchronous. That is *mandatory*, just as you monitor DRBD, or disk arrays: if a disk fail, and alert have to be issued, to fix it as soon as possible. But such alerts can wait 30 seconds to be sent out, so, any monitoring system would be able to handle that, we just need to get current system status from the monitoring system, and create corresponding rules: a simple matter, actually.
On Thu, Jul 12, 2012 at 8:27 PM, Jose Ildefonso Camargo Tolosa > Yeah, you need that with PostgreSQL, but no with DRBD, for example > (sorry, but DRBD is one of the flagships of HA things in the Linux > world). Also, I'm not convinced about the "2nd standby" thing... I > mean, just read this on the docs, which is a little alarming: > > "If primary restarts while commits are waiting for acknowledgement, > those waiting transactions will be marked fully committed once the > primary database recovers. There is no way to be certain that all > standbys have received all outstanding WAL data at time of the crash > of the primary. Some transactions may not show as committed on the > standby, even though they show as committed on the primary. The > guarantee we offer is that the application will not receive explicit > acknowledgement of the successful commit of a transaction until the > WAL data is known to be safely received by the standby." > > So... there is no *real* warranty here either... I don't know how I > skipped that paragraph before today.... I mean, this implies that it > is possible that a transaction could be marked as commited on the > master, but the app was not informed on that (and thus, could try to > send it again), and the transaction was NOT applied on the standby.... > how can this happen? I mean, when the master comes back, shouldn't the > standby get the missing WAL pieces from the master and then apply the > transaction? The standby part is the one that I don't really get, on > the application side... well, there are several ways in which you can > miss the "commit confirmation": connection issues in the worst moment, > and the such, so, I guess it is not *so* serious, and the app should > have a way of checking its last transaction if it lost connectivity to > server before getting the transaction commited. But you already have that in a single server situation as well. There is a window between when the commit is "durable" (and visible to others, and will be committed after recovery of a crash), when the client doesn't yet know it's committed (and might never get the commit message due to server crash, network disconnect, client middle-tier crash, etc). So people are already susceptible to that, and defending against it, no? ;-) And they are susceptible to that if they are on PostgreSQL, Oracle, MS SQL, DB2, etc. a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
On Thu, Jul 12, 2012 at 8:29 PM, Aidan Van Dyk <aidan@highrise.ca> wrote: > On Thu, Jul 12, 2012 at 8:27 PM, Jose Ildefonso Camargo Tolosa > >> Yeah, you need that with PostgreSQL, but no with DRBD, for example >> (sorry, but DRBD is one of the flagships of HA things in the Linux >> world). Also, I'm not convinced about the "2nd standby" thing... I >> mean, just read this on the docs, which is a little alarming: >> >> "If primary restarts while commits are waiting for acknowledgement, >> those waiting transactions will be marked fully committed once the >> primary database recovers. There is no way to be certain that all >> standbys have received all outstanding WAL data at time of the crash >> of the primary. Some transactions may not show as committed on the >> standby, even though they show as committed on the primary. The >> guarantee we offer is that the application will not receive explicit >> acknowledgement of the successful commit of a transaction until the >> WAL data is known to be safely received by the standby." >> >> So... there is no *real* warranty here either... I don't know how I >> skipped that paragraph before today.... I mean, this implies that it >> is possible that a transaction could be marked as commited on the >> master, but the app was not informed on that (and thus, could try to >> send it again), and the transaction was NOT applied on the standby.... >> how can this happen? I mean, when the master comes back, shouldn't the >> standby get the missing WAL pieces from the master and then apply the >> transaction? The standby part is the one that I don't really get, on >> the application side... well, there are several ways in which you can >> miss the "commit confirmation": connection issues in the worst moment, >> and the such, so, I guess it is not *so* serious, and the app should >> have a way of checking its last transaction if it lost connectivity to >> server before getting the transaction commited. > > But you already have that in a single server situation as well. There > is a window between when the commit is "durable" (and visible to > others, and will be committed after recovery of a crash), when the > client doesn't yet know it's committed (and might never get the commit > message due to server crash, network disconnect, client middle-tier > crash, etc). > > So people are already susceptible to that, and defending against it, no? ;-) Right. What I'm saying is that particular part on the docs: "If primary restarts while commits are waiting for acknowledgement, those waiting transactions will be marked fully committed once the primary database recovers. "(....)"Some transactions may not show as committed on the standby, even though they show as committed on the primary."(...) See? it sounds like, after the primary database recovers, the standby will still not have the transaction committed, and as far as I thought I knew, the standby should get that over the WAL stream from master once it reconnects to it. > > And they are susceptible to that if they are on PostgreSQL, Oracle, MS > SQL, DB2, etc. Certainly. That's why I said: (...)"The standby part is the one that I don't really get, on the application side... well, there are several ways in which you can miss the "commit confirmation": connection issues in the worst moment, and the such, so, I guess it is not *so* serious, and the app should have a way of checking its last transaction if it lost connectivity to server before getting the transaction commited."
On Thu, Jul 12, 2012 at 4:10 PM, Shaun Thomas <sthomas@optionshouse.com> wrote: > On 07/12/2012 12:02 PM, Bruce Momjian wrote: > >> Well, the problem also exists if add it as an internal database >> feature --- how long do we wait to consider the standby dead, how do >> we inform administrators, etc. > > > True. Though if there is no secondary connected, either because it's not > there yet, or because it disconnected, that's an easy check. It's the > network lag/stall detection that's tricky. Well, yes... but how does PostgreSQL currently note its "main synchronous standby" went away and that it have to use another standby and synchronous? How long does it takes it to note that? > > >> I don't think anyone says the feature is useless, but is isn't going >> to be a simple boolean either. > > > Oh $Deity no. I'd never suggest that. I just tend to be overly verbose, and > sometimes my intent gets lost in the rambling as I try to explain my > perspective. I apologize if it somehow came across that anyone could just > flip a switch and have it work. > > My C is way too rusty, or I'd be writing an extension right now to do this, > or be looking over that patch I linked to originally to make suitable > adaptations. I know I talk about how relatively handy DRBD is, but it's also > a gigantic PITA since it has to exist underneath the actual filesystem. :) > > > -- > Shaun Thomas > OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 > 312-444-8534 > sthomas@optionshouse.com > > > > ______________________________________________ > > See http://www.peak6.com/email_disclaimer/ for terms and conditions related > to this email > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers
> From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-owner@postgresql.org] > On Behalf Of Jose Ildefonso Camargo Tolosa >>On Thu, Jul 12, 2012 at 9:28 AM, Aidan Van Dyk <aidan@highrise.ca> wrote: > On Thu, Jul 12, 2012 at 9:21 AM, Shaun Thomas <sthomas@optionshouse.com> wrote: > > As currently is, the point of: freezing the master because standby > dies is not good for all cases (and I dare say: for most cases), and > having to wait for pacemaker or other monitoring to note that, change > master config and reload... it will cause a service disruption! (for > several seconds, usually, ~30 seconds). Yes, this is true that it can cause service disruption, but the same will be True even if master detects that internally by having timeout. By keeping this as external, the current behavior of PostgreSQL can be maintained that if there is no standy in sync mode, it will wait and still serve the purpose as externally it can send message for master.
Hi all, Here are some (slightly too long) thoughts about this. Shaun Thomas skrev 2012-07-12 22:40: > On 07/12/2012 12:02 PM, Bruce Momjian wrote: > >> Well, the problem also exists if add it as an internal database >> feature --- how long do we wait to consider the standby dead, how do >> we inform administrators, etc. > > True. Though if there is no secondary connected, either because it's not > there yet, or because it disconnected, that's an easy check. It's the > network lag/stall detection that's tricky. It is indeed tricky to detect this. If you don't get an (immediate) reply from the secondary (and you never do!), then all you can do is wait and *eventually* (after how long? 250ms? 10s?) assume that there is no connection between them. The conclusion may very well be wrong sometimes. A second problem is that we still don't know if this is caused by some kind of network problems or if it's caused by the secondary not running. It's perfectly possible that both servers are working, but just can't communicate at the moment. The thing is that what we do next (at least if our data is important and why otherwise use synchronous replication of any kind...) depends on what *did* happen. Assume that we have two database servers. At any time we need at most one primary database to be running. Without that requirement our data can get messed up completely... If HA is important to us, we may choose to do a failover to the secondary (and live without replication for the moment) if the primary fails. With synchronous repliction, we can do this without losing any data. If the secondary also dies, then we do lose data (and we'll know it!), but it might be an acceptable risk. If the secondary isn't permanently damaged, then we might even be able to get the data back after some down time. Ok, so that's one way to reconfigure the database servers on a failure. If the secondary fails instead, then we can do similarly and remove it from the "cluster" (or in other words, disable synchronous replication to the secondary). Again, we don't lose any data by doing this. We're taking a certain risk, however. We can't safely do a failover to the secondary anymore... So if the primary fails now, then the only way not to lose data is to hope that we can get it back from the failed machine (the failure may be temporary). There's also the third possibility, of course, that the two servers are both up and running, but they can't communicate over the network at the moment (this is, by the way, a difference from RAID, I guess). What do we do then? Well, we still need at most one primary database server. We'll have to (somehow, which doesn't matter as much) decide which database to keep and consider the other one "down". Then we can just do as above (with all the same implications!). Is it always a good idea to keep the primary? No! What if you (as a stupid example) pull the network cable from the primary (or maybe turn off a switch so that it's isolated from most of the network)? In that case you probably want the secondary to take over instead. At least if you value service availability. At this point we can still do a safe failover too. My point here is that if HA is important to you, then you may very well want to disable synchronous replication on a failure to avoid down time, but this has to be integrated with your overall failover / cluster management solution. Just having the primary automatically disable synchronous replication doesn't seem overly useful to me... If you're using synchronous replication to begin with, you probably want to *know* if you may have lost data or not. Otherwise, you will have to assume that you did and then you could frankly have been running async replication all along. If you do integrate it with your failover solution, then you can keep track of when it's safe to do a failover and when it's not, however, and decide how to handle each case. How you decide what to do with the servers on failures isn't that important here, really. You can probably run e.g. Pacemaker on 3+ machines and have it check for quorums to accomplish this. That's a good approach at least. You can still have only 2 database servers (for cost reasons), if you want. PostgreSQL could have all this built-in, but I don't think it sounds overly useful to only be able to disable synchronous replication on the primary after a timeout. Then you can never safely do a failover to the secondary, because you can't be sure synchronous replication was active on the failed primary... Regards, Hampus
On Fri, Jul 13, 2012 at 09:12:56AM +0200, Hampus Wessman wrote: > How you decide what to do with the servers on failures isn't that > important here, really. You can probably run e.g. Pacemaker on 3+ > machines and have it check for quorums to accomplish this. That's a > good approach at least. You can still have only 2 database servers > (for cost reasons), if you want. PostgreSQL could have all this > built-in, but I don't think it sounds overly useful to only be able > to disable synchronous replication on the primary after a timeout. > Then you can never safely do a failover to the secondary, because > you can't be sure synchronous replication was active on the failed > primary... So how about this for a Postgres TODO: Add configuration variable to allow Postgres to disable synchronousreplication after a specified timeout, and add variableto alertadministrators of the change. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Fri, Jul 13, 2012 at 12:25 AM, Amit Kapila <amit.kapila@huawei.com> wrote: > >> From: pgsql-hackers-owner@postgresql.org > [mailto:pgsql-hackers-owner@postgresql.org] >> On Behalf Of Jose Ildefonso Camargo Tolosa >>>On Thu, Jul 12, 2012 at 9:28 AM, Aidan Van Dyk <aidan@highrise.ca> wrote: >> On Thu, Jul 12, 2012 at 9:21 AM, Shaun Thomas <sthomas@optionshouse.com> > wrote: >> > >> As currently is, the point of: freezing the master because standby >> dies is not good for all cases (and I dare say: for most cases), and >> having to wait for pacemaker or other monitoring to note that, change >> master config and reload... it will cause a service disruption! (for >> several seconds, usually, ~30 seconds). > > Yes, this is true that it can cause service disruption, but the same will be > True even if master detects that internally by having timeout. > By keeping this as external, the current behavior of PostgreSQL can be > maintained that > if there is no standy in sync mode, it will wait and still serve the purpose > as externally it can send message for master. > How does currently PostgreSQL detects that its main synchronous standby went away and switch to another synchronous standby on the synchronous_standby_names config parameter? The same logic could be applied to "no more synchronous standbys: go into standalone" (optionally). -- Ildefonso Camargo Command Prompt, Inc. - http://www.commandprompt.com/ PostgreSQL Support, Training, Professional Services and Development High Availability, Oracle Conversion, Postgres-XC @cmdpromptinc - 509-416-6579
Hi Hampus, On Fri, Jul 13, 2012 at 2:42 AM, Hampus Wessman <hampus@hampuswessman.se> wrote: > Hi all, > > Here are some (slightly too long) thoughts about this. Nah, not that long. > > Shaun Thomas skrev 2012-07-12 22:40: > >> On 07/12/2012 12:02 PM, Bruce Momjian wrote: >> >>> Well, the problem also exists if add it as an internal database >>> feature --- how long do we wait to consider the standby dead, how do >>> we inform administrators, etc. >> >> >> True. Though if there is no secondary connected, either because it's not >> there yet, or because it disconnected, that's an easy check. It's the >> network lag/stall detection that's tricky. > > > It is indeed tricky to detect this. If you don't get an (immediate) reply > from the secondary (and you never do!), then all you can do is wait and > *eventually* (after how long? 250ms? 10s?) assume that there is no > connection between them. The conclusion may very well be wrong sometimes. A > second problem is that we still don't know if this is caused by some kind of > network problems or if it's caused by the secondary not running. It's > perfectly possible that both servers are working, but just can't communicate > at the moment. How about: same logic as it currently uses to detect when the "designated" synchronous standby is no longer there, and move on to the next one on the synchronous_standby_names? The rule to *know* that a standby went away is already there. > > The thing is that what we do next (at least if our data is important and why > otherwise use synchronous replication of any kind...) depends on what *did* > happen. Assume that we have two database servers. At any time we need at > most one primary database to be running. Without that requirement our data > can get messed up completely... If HA is important to us, we may choose to Not necessarily, but true: that's why you use to kill the (failing?) node on promotion of the standby, just in case. > do a failover to the secondary (and live without replication for the moment) > if the primary fails. With synchronous repliction, we can do this without > losing any data. If the secondary also dies, then we do lose data (and we'll > know it!), but it might be an acceptable risk. If the secondary isn't > permanently damaged, then we might even be able to get the data back after > some down time. Ok, so that's one way to reconfigure the database servers on > a failure. If the secondary fails instead, then we can do similarly and > remove it from the "cluster" (or in other words, disable synchronous > replication to the secondary). Again, we don't lose any data by doing this. Right, but you have to monitor the standby too! ie: more work on the pacemaker side..... and non-trivial work, for example, just blowing away the standby won't do any good here, as for the master: you can just power it off, promote the standby, and be done with it!, if the standby fails: you have to modify master's config, and reload configs there... more code: more chances of failure. > We're taking a certain risk, however. We can't safely do a failover to the > secondary anymore... So if the primary fails now, then the only way not to > lose data is to hope that we can get it back from the failed machine (the > failure may be temporary). > > There's also the third possibility, of course, that the two servers are both > up and running, but they can't communicate over the network at the moment > (this is, by the way, a difference from RAID, I guess). What do we do then? Kill the "failing" node, just in case, in this case, without the "extra" work of monitoring standby, you would just make the standby kill the master before promoting the standby. > Well, we still need at most one primary database server. We'll have to > (somehow, which doesn't matter as much) decide which database to keep and > consider the other one "down". Then we can just do as above (with all the This is arbitrary, we usually just assume the master to be failing when the standby is healthy (from the standby point of view). > same implications!). Is it always a good idea to keep the primary? No! What > if you (as a stupid example) pull the network cable from the primary (or > maybe turn off a switch so that it's isolated from most of the network)? In That means that you failed to have redundant connectivity to the standby (that is a must on clusters), yes, redundant switch too: with "smart switches" on the <US$100 range now, there is no much excuse for not having 2 switches connecting your cluster (but, if you have just 2 nodes, you just need 2 network interfaces, and 2 network cables). > that case you probably want the secondary to take over instead. At least if > you value service availability. At this point we can still do a safe > failover too. > > My point here is that if HA is important to you, then you may very well want > to disable synchronous replication on a failure to avoid down time, but this > has to be integrated with your overall failover / cluster management > solution. Just having the primary automatically disable synchronous That's not a trivial matter, you have to monitor the standby, and make changes on the master configuration. > replication doesn't seem overly useful to me... If you're using synchronous > replication to begin with, you probably want to *know* if you may have lost > data or not. Otherwise, you will have to assume that you did and then you Right, and you would know, when the standby node (or service) goes down, the monitoring system can inform you.. but it doesn't have to change master's config. > could frankly have been running async replication all along. If you do No, you can't, because the 99.9% of the time when standby is healthy and connected, you are at risk of losing transactions if you run async replication. > integrate it with your failover solution, then you can keep track of when > it's safe to do a failover and when it's not, however, and decide how to > handle each case. Of course you can, but it is more complex, and likely slower. For example, if master detects that standby disconnected: TCP connection was closed, it can just fallback to async while it comes back, then pass through the "catch-up" process when it comes back, and go back to sync. The monitor will likely take, at the very least, 1 second (up to 30 seconds, on most configurations) to realize, make the change, and then reload master's config. See, the main problem here is that, with current PostgreSQL behavior, you have doubled the chances of service disruption: if master fails, there is the time the cluster takes to note it, and bring standby up (and kill master, likely), AND if standby fails, there is the time the cluster takes to note it, change configs on master, and reload. > > How you decide what to do with the servers on failures isn't that important > here, really. You can probably run e.g. Pacemaker on 3+ machines and have it > check for quorums to accomplish this. That's a good approach at least. You > can still have only 2 database servers (for cost reasons), if you want. > PostgreSQL could have all this built-in, but I don't think it sounds overly > useful to only be able to disable synchronous replication on the primary > after a timeout. Then you can never safely do a failover to the secondary, > because you can't be sure synchronous replication was active on the failed > primary... Or have a mixed cluster of application servers and DB servers, and have them support each other for quorum. And no, not after a timeout: immediately if TCP socket is closed, or with the same logic as it "switches" to other sync standby otherwise. -- Ildefonso Camargo Command Prompt, Inc. - http://www.commandprompt.com/ PostgreSQL Support, Training, Professional Services and Development High Availability, Oracle Conversion, Postgres-XC @cmdpromptinc - 509-416-6579
On Fri, Jul 13, 2012 at 10:22 AM, Bruce Momjian <bruce@momjian.us> wrote: > On Fri, Jul 13, 2012 at 09:12:56AM +0200, Hampus Wessman wrote: >> How you decide what to do with the servers on failures isn't that >> important here, really. You can probably run e.g. Pacemaker on 3+ >> machines and have it check for quorums to accomplish this. That's a >> good approach at least. You can still have only 2 database servers >> (for cost reasons), if you want. PostgreSQL could have all this >> built-in, but I don't think it sounds overly useful to only be able >> to disable synchronous replication on the primary after a timeout. >> Then you can never safely do a failover to the secondary, because >> you can't be sure synchronous replication was active on the failed >> primary... > > So how about this for a Postgres TODO: > > Add configuration variable to allow Postgres to disable synchronous > replication after a specified timeout, and add variable to alert > administrators of the change. I agree we need a TODO for this, but... I think timeout-only is not the best choice, there should be a maximum timeout (as a last resource: the maximum time we are willing to wait for standby, this have to have the option of "forever"), but certainly PostgreSQL have to detect the *complete* disconnection of the standby (or all standbys on the synchronous_standby_names), if it detects that no standbys are eligible for sync standby AND the option to do fallback to async is enabled = it will go into standalone mode (as if synchronous_standby_names were empty), otherwise (if option is disabled) it will just continue to wait for ever (the "last resource" timeout is ignored if the fallback option is disabled).... I would call this "soft_synchronous_standby", and "soft_synchronous_standby_timeout" (in seconds, 0=forever, a sane value would be ~5 seconds) or something like that (I'm quite bad at picking names :( ). -- Ildefonso Camargo Command Prompt, Inc. - http://www.commandprompt.com/ PostgreSQL Support, Training, Professional Services and Development High Availability, Oracle Conversion, Postgres-XC @cmdpromptinc - 509-416-6579
From: pgsql-hackers-owner@postgresql.org [pgsql-hackers-owner@postgresql.org] on behalf of Jose Ildefonso Camargo Tolosa[ildefonso.camargo@gmail.com] Sent: Saturday, July 14, 2012 6:08 AM On Fri, Jul 13, 2012 at 10:22 AM, Bruce Momjian <bruce@momjian.us> wrote: > On Fri, Jul 13, 2012 at 09:12:56AM +0200, Hampus Wessman wrote: > >> So how about this for a Postgres TODO: >> >> Add configuration variable to allow Postgres to disable synchronous >> replication after a specified timeout, and add variable to alert >> administrators of the change. > I agree we need a TODO for this, but... I think timeout-only is not > the best choice, there should be a maximum timeout (as a last > resource: the maximum time we are willing to wait for standby, this > have to have the option of "forever"), but certainly PostgreSQL have > to detect the *complete* disconnection of the standby (or all standbys > on the synchronous_standby_names), if it detects that no standbys are > eligible for sync standby AND the option to do fallback to async is > enabled = it will go into standalone mode (as if > synchronous_standby_names were empty), otherwise (if option is > disabled) it will just continue to wait for ever (the "last resource" > timeout is ignored if the fallback option is disabled).... I would > call this "soft_synchronous_standby", and > "soft_synchronous_standby_timeout" (in seconds, 0=forever, a sane > value would be ~5 seconds) or something like that (I'm quite bad at > picking names :( ). After it has gone to standalone mode, if the standby came back will it be able to return back to sync mode with it. If not, then won't it break the current behavior, as currently I think in freeze mode if the standby came back, the syncmode replication can again start. With Regards, Amit Kapila.
On Fri, Jul 13, 2012 at 11:12 PM, Amit kapila <amit.kapila@huawei.com> wrote: > From: pgsql-hackers-owner@postgresql.org [pgsql-hackers-owner@postgresql.org] on behalf of Jose Ildefonso Camargo Tolosa[ildefonso.camargo@gmail.com] > Sent: Saturday, July 14, 2012 6:08 AM > On Fri, Jul 13, 2012 at 10:22 AM, Bruce Momjian <bruce@momjian.us> wrote: >> On Fri, Jul 13, 2012 at 09:12:56AM +0200, Hampus Wessman wrote: >> >>> So how about this for a Postgres TODO: >>> >>> Add configuration variable to allow Postgres to disable synchronous >>> replication after a specified timeout, and add variable to alert >>> administrators of the change. > >> I agree we need a TODO for this, but... I think timeout-only is not >> the best choice, there should be a maximum timeout (as a last >> resource: the maximum time we are willing to wait for standby, this >> have to have the option of "forever"), but certainly PostgreSQL have >> to detect the *complete* disconnection of the standby (or all standbys >> on the synchronous_standby_names), if it detects that no standbys are >> eligible for sync standby AND the option to do fallback to async is >> enabled = it will go into standalone mode (as if >> synchronous_standby_names were empty), otherwise (if option is >> disabled) it will just continue to wait for ever (the "last resource" >> timeout is ignored if the fallback option is disabled).... I would >> call this "soft_synchronous_standby", and >> "soft_synchronous_standby_timeout" (in seconds, 0=forever, a sane >> value would be ~5 seconds) or something like that (I'm quite bad at >> picking names :( ). > > After it has gone to standalone mode, if the standby came back will it be able to return back to sync mode with it. That's the idea, yes, after the standby comes back, the master would act as if the sync standby connected for the first time: first going through the "catchup" mode, and "once the lag between standby and primary reaches zero "(...)" we move to real-time streaming state" (from 9.1 docs), at that point: normal sync behavior is restored. -- Ildefonso Camargo Command Prompt, Inc. - http://www.commandprompt.com/ PostgreSQL Support, Training, Professional Services and Development High Availability, Oracle Conversion, Postgres-XC @cmdpromptinc - 509-416-6579
> From: Jose Ildefonso Camargo Tolosa [ildefonso.camargo@gmail.com] > Sent: Saturday, July 14, 2012 9:36 AM >On Fri, Jul 13, 2012 at 11:12 PM, Amit kapila <amit.kapila@huawei.com> wrote: > From: pgsql-hackers-owner@postgresql.org [pgsql-hackers-owner@postgresql.org] on behalf of Jose Ildefonso Camargo Tolosa[ildefonso.camargo@gmail.com] > Sent: Saturday, July 14, 2012 6:08 AM > On Fri, Jul 13, 2012 at 10:22 AM, Bruce Momjian <bruce@momjian.us> wrote: >> On Fri, Jul 13, 2012 at 09:12:56AM +0200, Hampus Wessman wrote: >> >>>> So how about this for a Postgres TODO: >>>> >>>> Add configuration variable to allow Postgres to disable synchronous >>>> replication after a specified timeout, and add variable to alert >>>> administrators of the change. > >>> I agree we need a TODO for this, but... I think timeout-only is not >>> the best choice, there should be a maximum timeout (as a last >>> resource: the maximum time we are willing to wait for standby, this >>> have to have the option of "forever"), but certainly PostgreSQL have >>> to detect the *complete* disconnection of the standby (or all standbys >>> on the synchronous_standby_names), if it detects that no standbys are >>> eligible for sync standby AND the option to do fallback to async is >>> enabled = it will go into standalone mode (as if >>> synchronous_standby_names were empty), otherwise (if option is >>> disabled) it will just continue to wait for ever (the "last resource" >>> timeout is ignored if the fallback option is disabled).... I would >>> call this "soft_synchronous_standby", and >>> "soft_synchronous_standby_timeout" (in seconds, 0=forever, a sane >>> value would be ~5 seconds) or something like that (I'm quite bad at >>> picking names :( ). > > >After it has gone to standalone mode, if the standby came back will it be able to return back to sync mode with it. > That's the idea, yes, after the standby comes back, the master would > act as if the sync standby connected for the first time: first going > through the "catchup" mode, and "once the lag between standby and > primary reaches zero "(...)" we move to real-time streaming state" > (from 9.1 docs), at that point: normal sync behavior is restored. Idea wise, it looks okay, but are you sure that in the current code/design, it can handle the way you are suggesting. I am not sure it can work because it might be the case that due to network instability, the master has gone in standalonemode and now after standy is able to communicate back, it might be expecting to get more data rather than go in cacthup mode. I believe some person who is expert of this code area can comment here to make it more concrete. With Regards, Amit Kapila.
On Sat, Jul 14, 2012 at 12:42 AM, Amit kapila <amit.kapila@huawei.com> wrote: >> From: Jose Ildefonso Camargo Tolosa [ildefonso.camargo@gmail.com] >> Sent: Saturday, July 14, 2012 9:36 AM >>On Fri, Jul 13, 2012 at 11:12 PM, Amit kapila <amit.kapila@huawei.com> wrote: >> From: pgsql-hackers-owner@postgresql.org [pgsql-hackers-owner@postgresql.org] on behalf of Jose Ildefonso Camargo Tolosa[ildefonso.camargo@gmail.com] >> Sent: Saturday, July 14, 2012 6:08 AM >> On Fri, Jul 13, 2012 at 10:22 AM, Bruce Momjian <bruce@momjian.us> wrote: >>> On Fri, Jul 13, 2012 at 09:12:56AM +0200, Hampus Wessman wrote: >>> >>>>> So how about this for a Postgres TODO: >>>>> >>>>> Add configuration variable to allow Postgres to disable synchronous >>>>> replication after a specified timeout, and add variable to alert >>>>> administrators of the change. >> >>>> I agree we need a TODO for this, but... I think timeout-only is not >>>> the best choice, there should be a maximum timeout (as a last >>>> resource: the maximum time we are willing to wait for standby, this >>>> have to have the option of "forever"), but certainly PostgreSQL have >>>> to detect the *complete* disconnection of the standby (or all standbys >>>> on the synchronous_standby_names), if it detects that no standbys are >>>> eligible for sync standby AND the option to do fallback to async is >>>> enabled = it will go into standalone mode (as if >>>> synchronous_standby_names were empty), otherwise (if option is >>>> disabled) it will just continue to wait for ever (the "last resource" >>>> timeout is ignored if the fallback option is disabled).... I would >>>> call this "soft_synchronous_standby", and >>>> "soft_synchronous_standby_timeout" (in seconds, 0=forever, a sane >>>> value would be ~5 seconds) or something like that (I'm quite bad at >>>> picking names :( ). >> >> >After it has gone to standalone mode, if the standby came back will it be able to return back to sync mode with it. > >> That's the idea, yes, after the standby comes back, the master would >> act as if the sync standby connected for the first time: first going >> through the "catchup" mode, and "once the lag between standby and >> primary reaches zero "(...)" we move to real-time streaming state" >> (from 9.1 docs), at that point: normal sync behavior is restored. > > Idea wise, it looks okay, but are you sure that in the current code/design, it can handle the way you are suggesting. > I am not sure it can work because it might be the case that due to network instability, the master has gone in standalonemode > and now after standy is able to communicate back, it might be expecting to get more data rather than go in cacthup mode. > I believe some person who is expert of this code area can comment here to make it more concrete. Well, I'd need to dive into the code, but as far as I know, is the master who decides to be on "catchup" mode, and standby just takes care of sending feedback to master. Also, it has to handle the situation, because currently, if master goes away because it crashed, or because of network issues, the standby doesn't really know why, and will reconnect to master and do whatever it needs to do to get in sync with master again (be it: try to reconnect several times while master is restarting, or that it just reconnect to a waiting master, and request pending WAL segments). There have to be code in place to handle those issues, because it is already working. I'm trying to get a solution that is as non-intrusive as possible, with lower amount of code added, so that performance doesn't suffer by reusing current logic and actions, with small alterations. > > With Regards, > Amit Kapila. -- Ildefonso Camargo Command Prompt, Inc. - http://www.commandprompt.com/ PostgreSQL Support, Training, Professional Services and Development High Availability, Oracle Conversion, Postgres-XC @cmdpromptinc - 509-416-6579
So, here's the core issue with degraded mode. I'm not mentioning this to block any patch anyone has, but rather out of a desire to see someone address this core problem with some clever idea I've not thought of. The problem in a nutshell is: indeterminancy. Assume someone implements degraded mode. Then: 1. Master has one synchronous standby, Standby1, and two asynchronous, Standby2 and Standby3. 2. Standby1 develops a NIC problem and is in and out of contact with Master. As a result, it's flipping in and out of synchronous / degraded mode. 3. Master fails catastrophically due to a RAID card meltdown. All data lost. At this point, the DBA is in kind of a pickle, because he doesn't know: (a) Was Standby1 in synchronous or degraded mode when Master died? The only log for that was on Master, which is now gone. (b) Is Standby1 actually the most caught up standby, and thus the appropriate new master for Standby2 and Standby3, or is it behind? With the current functionality of Synchronous Replication, you don't have either piece of indeterminancy, because some external management process (hopefully located on another server) needs to disable synchronous replication when Standby1 develops its problem. That is, if the master is accepting synchronous transactions at all, you know that Standby1 is up-to-date, and no data is lost. While you can answer (b) by checking all servers, (a) is particularly pernicious, because unless you have the application log all "operating in degraded mode" messages, there is no way to ever determine the truth. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Sat, Jul 14, 2012 at 7:54 PM, Josh Berkus <josh@agliodbs.com> wrote: > So, here's the core issue with degraded mode. I'm not mentioning this > to block any patch anyone has, but rather out of a desire to see someone > address this core problem with some clever idea I've not thought of. > The problem in a nutshell is: indeterminancy. > > Assume someone implements degraded mode. Then: > > 1. Master has one synchronous standby, Standby1, and two asynchronous, > Standby2 and Standby3. > > 2. Standby1 develops a NIC problem and is in and out of contact with > Master. As a result, it's flipping in and out of synchronous / degraded > mode. > > 3. Master fails catastrophically due to a RAID card meltdown. All data > lost. > > At this point, the DBA is in kind of a pickle, because he doesn't know: > > (a) Was Standby1 in synchronous or degraded mode when Master died? The > only log for that was on Master, which is now gone. > > (b) Is Standby1 actually the most caught up standby, and thus the > appropriate new master for Standby2 and Standby3, or is it behind? > > With the current functionality of Synchronous Replication, you don't > have either piece of indeterminancy, because some external management > process (hopefully located on another server) needs to disable > synchronous replication when Standby1 develops its problem. That is, if > the master is accepting synchronous transactions at all, you know that > Standby1 is up-to-date, and no data is lost. > > While you can answer (b) by checking all servers, (a) is particularly > pernicious, because unless you have the application log all "operating > in degraded mode" messages, there is no way to ever determine the truth. Good explanation. In brief, the problem here is that you can only rely on the no-transaction-loss guarantee provided by synchronous replication if you can be certain that you'll always be aware of it when synchronous replication gets shut off. Right now that is trivially true, because it has to be shut off manually. If we provide a facility that logs a message and then shuts it off, we lose that certainty, because the log message could get eaten en route by the same calamity that takes down the master. There is no way for the master to WAIT for the log message to be delivered and only then degrade. However, we could craft a mechanism that has this effect. Suppose we create a new GUC with a name like synchronous_replication_status_change_command. If we're thinking about switching between synchronous replication and degraded mode automatically, we first run this command. If it returns 0, then we're allowed to switch, but if it returns anything else, then we're not allowed to switch (but can retry the command after a suitable interval). The user is responsible for supplying a command that records the status change somewhere off-box in a fashion that's sufficiently durable that the user has confidence that the notification won't subsequently be lost. For example, the user-supplied command could SSH into three machines located in geographically disparate data centers and create a file with a certain name on each one, returning 0 only if it's able to reach at least two of them and create the file on all the ones it can reach. If the master dies, but at least two out of the those three machines are still alive, we can be certain of determining with confidence whether the master might have been in degraded mode at the time of the crash. More or less paranoid versions of this scheme are possible depending on user preferences, but the key point is that for the no-transaction-loss guarantee to be of any use, there has to be a way to reliably know whether that guarantee was in effect at the time the master died in a fire. Logging isn't enough, but I think some more sophisticated mechanism can get us there. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 16.07.2012 22:01, Robert Haas wrote: > On Sat, Jul 14, 2012 at 7:54 PM, Josh Berkus<josh@agliodbs.com> wrote: >> So, here's the core issue with degraded mode. I'm not mentioning this >> to block any patch anyone has, but rather out of a desire to see someone >> address this core problem with some clever idea I've not thought of. >> The problem in a nutshell is: indeterminancy. >> >> Assume someone implements degraded mode. Then: >> >> 1. Master has one synchronous standby, Standby1, and two asynchronous, >> Standby2 and Standby3. >> >> 2. Standby1 develops a NIC problem and is in and out of contact with >> Master. As a result, it's flipping in and out of synchronous / degraded >> mode. >> >> 3. Master fails catastrophically due to a RAID card meltdown. All data >> lost. >> >> At this point, the DBA is in kind of a pickle, because he doesn't know: >> >> (a) Was Standby1 in synchronous or degraded mode when Master died? The >> only log for that was on Master, which is now gone. >> >> (b) Is Standby1 actually the most caught up standby, and thus the >> appropriate new master for Standby2 and Standby3, or is it behind? >> >> With the current functionality of Synchronous Replication, you don't >> have either piece of indeterminancy, because some external management >> process (hopefully located on another server) needs to disable >> synchronous replication when Standby1 develops its problem. That is, if >> the master is accepting synchronous transactions at all, you know that >> Standby1 is up-to-date, and no data is lost. >> >> While you can answer (b) by checking all servers, (a) is particularly >> pernicious, because unless you have the application log all "operating >> in degraded mode" messages, there is no way to ever determine the truth. > > Good explanation. > > In brief, the problem here is that you can only rely on the > no-transaction-loss guarantee provided by synchronous replication if > you can be certain that you'll always be aware of it when synchronous > replication gets shut off. Right now that is trivially true, because > it has to be shut off manually. If we provide a facility that logs a > message and then shuts it off, we lose that certainty, because the log > message could get eaten en route by the same calamity that takes down > the master. There is no way for the master to WAIT for the log > message to be delivered and only then degrade. > > However, we could craft a mechanism that has this effect. Suppose we > create a new GUC with a name like > synchronous_replication_status_change_command. If we're thinking > about switching between synchronous replication and degraded mode > automatically, we first run this command. If it returns 0, then we're > allowed to switch, but if it returns anything else, then we're not > allowed to switch (but can retry the command after a suitable > interval). The user is responsible for supplying a command that > records the status change somewhere off-box in a fashion that's > sufficiently durable that the user has confidence that the > notification won't subsequently be lost. For example, the > user-supplied command could SSH into three machines located in > geographically disparate data centers and create a file with a certain > name on each one, returning 0 only if it's able to reach at least two > of them and create the file on all the ones it can reach. If the > master dies, but at least two out of the those three machines are > still alive, we can be certain of determining with confidence whether > the master might have been in degraded mode at the time of the crash. > > More or less paranoid versions of this scheme are possible depending > on user preferences, but the key point is that for the > no-transaction-loss guarantee to be of any use, there has to be a way > to reliably know whether that guarantee was in effect at the time the > master died in a fire. Logging isn't enough, but I think some more > sophisticated mechanism can get us there. Yeah, I think that's the right general approach. Not necessarily that exact GUC, but something like that. I don't want PostgreSQL to get more involved in determining the state of the standby, when to do failover, or when to fall back to degraded mode. That's a whole new territory with all kinds of problems, and there is plenty of software out there to handle that. Usually you have some external software to do monitoring and to initiate failovers anyway. What we need is a better API for co-operating with such software, to perform failover, and to switch replication between synchronous and asynchronous modes. BTW, one little detail that I don't think has been mentioned in this thread before: Even though the master currently knows whether a standby is connected or not, and you could write a patch to act based on that, there are other failure scenarios where you would still not be happy. For example, imagine that the standby has a disk failure. It stays connected to the master, but fails to fsync anything to disk. Would you want to fall back to degraded mode and just do asynchronous replication in that case? How do you decide when to do that in the master? Or what if the standby keeps making progress, but becomes incredibly slow for some reason, like disk failure in a RAID array? I'd rather outsource all that logic to external monitoring software - software that you should be running anyway. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Mon, Jul 16, 2012 at 10:58 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > BTW, one little detail that I don't think has been mentioned in this thread > before: Even though the master currently knows whether a standby is > connected or not, and you could write a patch to act based on that, there > are other failure scenarios where you would still not be happy. For example, > imagine that the standby has a disk failure. It stays connected to the > master, but fails to fsync anything to disk. Would you want to fall back to > degraded mode and just do asynchronous replication in that case? How do you > decide when to do that in the master? Or what if the standby keeps making > progress, but becomes incredibly slow for some reason, like disk failure in > a RAID array? I'd rather outsource all that logic to external monitoring > software - software that you should be running anyway. I would like to express some support for the non-edge nature of this case. Outside of simple loss of availability of a server, losing access to a block device is probably the second-most-common cause of loss of availability for me. It's especially insidious because simple "select 1" checks may continue to return for quite some time, so instead we rely on linux diskstats parsing to see if write progress hits zero for "a while." In cases like these, the overhead of a shell-command to rapidly consort with a decision-making process can be prohibitive -- it's already a pretty big waster of time for me in wal archiving/dearchiving, where process startup and SSL negotiation and lack of parallelization can be pretty slow. This may also exhibit this problem. I would like to plead that whatever is done would be most useful being controllable via non-GUCs in its entirely -- arguably that is already the case, since one can write a replication protocol client to do the job, by faking the standby status update messages, but perhaps there is a more lucid way if one makes accommodation. In particular, the awkwardness of using pg_receivexlog[0] or a similar tool for replacing archive_command is something that I feel should be addressed eventually, as to not be a second-class citizen. Although that is already being worked on[1]...the archive command has no backpressure either, other than "out of disk". The case of restore_command is even more sore: remastering or archive-recovery via streaming protocol actions is kind of a pain at the moment. I haven't thoroughly explored this yet and I don't think it is documented, but it can be hard for something that is dearchiving from wal segments stored somewhere to find exactly the right record to start replaying at: the wal record format is not stable, and it need not be, if the server helps by ignoring records that predate what it requires or can inform the process feeding WAL that it got things wrong. Maybe that is the case, but it is not documented. I also don't think any guarantees around the maximum size or alignment of WAL shipped by the streaming protocol in XLogData messages, and that's too bad. Also, the endianness of WAL position fields in the XLogData is host-byte-order-dependent, which sucks if you are forwarding WAL around but need to know what range is contained in a message. In practice many people can say "all I have is little-endian," but it is somewhat unpleasant and not necessarily the case. Correct me if I'm wrong, I'd be glad for it. [0]: see the notes section, http://www.postgresql.org/docs/devel/static/app-pgreceivexlog.html [1]: http://archives.postgresql.org/pgsql-hackers/2012-06/msg00348.php -- fdr
On Fri, Jul 13, 2012 at 08:08:59PM -0430, Jose Ildefonso Camargo Tolosa wrote: > On Fri, Jul 13, 2012 at 10:22 AM, Bruce Momjian <bruce@momjian.us> wrote: > > On Fri, Jul 13, 2012 at 09:12:56AM +0200, Hampus Wessman wrote: > >> How you decide what to do with the servers on failures isn't that > >> important here, really. You can probably run e.g. Pacemaker on 3+ > >> machines and have it check for quorums to accomplish this. That's a > >> good approach at least. You can still have only 2 database servers > >> (for cost reasons), if you want. PostgreSQL could have all this > >> built-in, but I don't think it sounds overly useful to only be able > >> to disable synchronous replication on the primary after a timeout. > >> Then you can never safely do a failover to the secondary, because > >> you can't be sure synchronous replication was active on the failed > >> primary... > > > > So how about this for a Postgres TODO: > > > > Add configuration variable to allow Postgres to disable synchronous > > replication after a specified timeout, and add variable to alert > > administrators of the change. > > I agree we need a TODO for this, but... I think timeout-only is not > the best choice, there should be a maximum timeout (as a last > resource: the maximum time we are willing to wait for standby, this > have to have the option of "forever"), but certainly PostgreSQL have > to detect the *complete* disconnection of the standby (or all standbys > on the synchronous_standby_names), if it detects that no standbys are > eligible for sync standby AND the option to do fallback to async is > enabled = it will go into standalone mode (as if > synchronous_standby_names were empty), otherwise (if option is > disabled) it will just continue to wait for ever (the "last resource" > timeout is ignored if the fallback option is disabled).... I would > call this "soft_synchronous_standby", and > "soft_synchronous_standby_timeout" (in seconds, 0=forever, a sane > value would be ~5 seconds) or something like that (I'm quite bad at > picking names :( ). TODO added: Allow synchronous_standby_names to be disabled after communicationfailure with all synchronous standby servers exceeds sometimeout This also requires successful execution of a synchronousnotification command. http://archives.postgresql.org/pgsql-hackers/2012-07/msg00409.php -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +