Thread: Geographic High-Availability/Replication
Hey all, new postgres user here. We are trying to setup/research an HA/Replicated solution with Postrgresql between a datacenter in LA and a d.c. in NY. We have a private LAN link between the two D.C.'s with a max round-trip of 150ms. We will have a web server at each d.c. (among other servers) that will write/read to/from the local LAN DB. On writes, that data should be xmited to the other data center so that if, for whatever reason, my website request was sent to LA instead of NY, all my session information etc will still exist. I was looking into PGCluster 1.7rc5 but their website states: "...running one Cluster DB node in a different geographical location is not what PGCluster was built for..." Does anyone have an idea? We cannot possibly be the only company needing to do this. Thanks, Matthew
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Matthew wrote: > Hey all, new postgres user here. We are trying to setup/research an > HA/Replicated solution with Postrgresql between a datacenter in LA and a > d.c. in NY. > > We have a private LAN link between the two D.C.'s with a max round-trip > of 150ms. > > We will have a web server at each d.c. (among other servers) that will > write/read to/from the local LAN DB. On writes, that data should be > xmited to the other data center so that if, for whatever reason, my > website request was sent to LA instead of NY, all my session information > etc will still exist. > > I was looking into PGCluster 1.7rc5 but their website states: > > "...running one Cluster DB node in a different geographical location is > not what PGCluster was built for..." > > Does anyone have an idea? We cannot possibly be the only company needing > to do this. No, but you are the only company that likely is trying to do it cross continent ;) :). You can not do multi master cross continent reliably. There are ways using custom caches etc, to pull all session info into a geo-redundant data store though. Sincerely, Joshua D. Drake > > Thanks, > Matthew > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster > - -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 24x7/Emergency: +1.800.492.2240 PostgreSQL solutions since 1997 http://www.commandprompt.com/ UNIQUE NOT NULL Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL Replication: http://www.commandprompt.com/products/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGzJ6HATb/zqfZUUQRAoSFAJ99sfzpgVjWYbtTKDIzC+EdChJRcQCgjTZQ nKVW3xmmvQvjBltykxz8Gco= =qv3s -----END PGP SIGNATURE-----
On Aug 22, 2007, at 3:37 PM, Joshua D. Drake wrote: > You can not do multi master cross continent reliably. I'm pretty sure that credit card processors and some other companies do it... it just costs a LOT to actually do it well. -- Decibel!, aka Jim Nasby decibel@decibel.org EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Decibel! <decibel@decibel.org> wrote: > > On Aug 22, 2007, at 3:37 PM, Joshua D. Drake wrote: > > You can not do multi master cross continent reliably. > > I'm pretty sure that credit card processors and some other companies > do it... it just costs a LOT to actually do it well. Isn't this sort of requirement the entire reason for 2-phase commit? -- Bill Moran http://www.potentialtech.com
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 08/23/07 17:22, Bill Moran wrote: > Decibel! <decibel@decibel.org> wrote: >> On Aug 22, 2007, at 3:37 PM, Joshua D. Drake wrote: >>> You can not do multi master cross continent reliably. >> I'm pretty sure that credit card processors and some other companies >> do it... it just costs a LOT to actually do it well. > > Isn't this sort of requirement the entire reason for 2-phase commit? Entire reason? Not that I've heard. - -- Ron Johnson, Jr. Jefferson LA USA Give a man a fish, and he eats for a day. Hit him with a fish, and he goes away for good! -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFGzjKiS9HxQb37XmcRArTlAJ43MAEDdbbi71WDIApW5j0PveeJIwCePJPx czuG/oescDoF8SAAehw4xdA= =v+RP -----END PGP SIGNATURE-----
Hi, Matthew wrote: > Hey all, new postgres user here. We are trying to setup/research an > HA/Replicated solution with Postrgresql between a datacenter in LA and a > d.c. in NY. > > We have a private LAN link between the two D.C.'s with a max round-trip > of 150ms. > > We will have a web server at each d.c. (among other servers) that will > write/read to/from the local LAN DB. On writes, that data should be > xmited to the other data center so that if, for whatever reason, my > website request was sent to LA instead of NY, all my session information > etc will still exist. This is commonly known as synchronous replication. As that involves *at-least* one round-trip *before* committing, it's quite expensive. Can you live with a delay of ~150ms before COMMIT confirmation? Another issue is the reliability of your failure detectors. How does server B know that server A is really down (and not only the link?). Normally, that's solved with a quorum device. So that you have to have at least three servers - preferably in different locations. Regards Markus Disclaimer: I'm the developer behind Postgres-R (www.postgres-r.org), have a look at it, it's designed to do what you are looking for.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 08/24/07 08:21, Markus Schiltknecht wrote: > Hi, > > Matthew wrote: >> Hey all, new postgres user here. We are trying to setup/research an >> HA/Replicated solution with Postrgresql between a datacenter in LA and a >> d.c. in NY. >> >> We have a private LAN link between the two D.C.'s with a max round-trip >> of 150ms. >> >> We will have a web server at each d.c. (among other servers) that will >> write/read to/from the local LAN DB. On writes, that data should be >> xmited to the other data center so that if, for whatever reason, my >> website request was sent to LA instead of NY, all my session information >> etc will still exist. > > This is commonly known as synchronous replication. As that involves > *at-least* one round-trip *before* committing, it's quite expensive. Can > you live with a delay of ~150ms before COMMIT confirmation? Which puts an upper limit on transaction rates at 6TPS. Blech. > Another issue is the reliability of your failure detectors. How does > server B know that server A is really down (and not only the link?). > Normally, that's solved with a quorum device. So that you have to have > at least three servers - preferably in different locations. - -- Ron Johnson, Jr. Jefferson LA USA Give a man a fish, and he eats for a day. Hit him with a fish, and he goes away for good! -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFGzu/7S9HxQb37XmcRArnEAJ947nvYFT1eQvRzj6YkpVEDLtLUqQCgsbQy rgyz2ZCrlGbS+RzzXTD1ybY= =SRQv -----END PGP SIGNATURE-----
"Ron Johnson" <ron.l.johnson@cox.net> writes: > On 08/24/07 08:21, Markus Schiltknecht wrote: > >> This is commonly known as synchronous replication. As that involves >> *at-least* one round-trip *before* committing, it's quite expensive. Can >> you live with a delay of ~150ms before COMMIT confirmation? > > Which puts an upper limit on transaction rates at 6TPS. Blech. Only if your application is single-threaded. By single-threaded I don't refer to operating system threads but to the architecture. If you're processing a large batch file handling records one by one and waiting for each commit before proceeding then it's single threaded. If you have a hundred independent clients on separate connections doing separate things then each one of them could get 6tps. Which you have will depend on your application and your needs, it may not be something you can change. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Hi, Gregory Stark wrote: > Only if your application is single-threaded. By single-threaded I don't refer > to operating system threads but to the architecture. If you're processing a > large batch file handling records one by one and waiting for each commit > before proceeding then it's single threaded. If you have a hundred independent > clients on separate connections doing separate things then each one of them > could get 6tps. Which you have will depend on your application and your needs, > it may not be something you can change. Correct. Plus, as in the implementation of Postgres-R, performance is *not* bound to the slowest node. Instead, every node can process transactions at it's own speed. Slower nodes might then have to queue transactions from those until they catch up again. Regards Markus
Markus Schiltknecht <markus@bluegap.ch> wrote: > > Hi, > > Gregory Stark wrote: > > Only if your application is single-threaded. By single-threaded I don't refer > > to operating system threads but to the architecture. If you're processing a > > large batch file handling records one by one and waiting for each commit > > before proceeding then it's single threaded. If you have a hundred independent > > clients on separate connections doing separate things then each one of them > > could get 6tps. Which you have will depend on your application and your needs, > > it may not be something you can change. > > Correct. > > Plus, as in the implementation of Postgres-R, performance is *not* bound > to the slowest node. Instead, every node can process transactions at > it's own speed. Slower nodes might then have to queue transactions from > those until they catch up again. I'm curious as to how Postgres-R would handle a situation where the constant throughput exceeded the processing speed of one of the nodes. I can see your system working if it's just spike loads and the slow nodes can catch up during slow periods, but I'm wondering about the scenarios where an admin has underestimated the hardware requirements and one or more nodes is unable to keep up. Just musing, really. -- Bill Moran http://www.potentialtech.com
Hi, Bill Moran wrote: > I'm curious as to how Postgres-R would handle a situation where the > constant throughput exceeded the processing speed of one of the nodes. Well, what do you expect to happen? This case is easily detectable, but I can only see two possible solutions: either stop the node which is to slow or stop accepting new transactions for a while. This technique is not meant to allow nodes to lag behind several thousands of transactions - that should better be avoided. Rather it's meant to decrease the commit delay necessary for synchronous replication. > I can see your system working if it's just spike loads and the slow > nodes can catch up during slow periods, but I'm wondering about the > scenarios where an admin has underestimated the hardware requirements > and one or more nodes is unable to keep up. Please keep in mind, that replication per se does not speed your database up, it rather adds a layer of reliability, which *costs* some performance. To increase the transactional throughput you would need to add partitioning to the mix. Or you could try to make use of the gained reliability and abandon WAL - you won't need that as long as at least one replica is running - that should increase the single node's throughput and therefore the cluster's throughput, too. When replication meets partitioning and load balancing, you'll get into a whole new world, where new trade-offs need to be considered. Some look similar to those with RAID storage - probably Sequoia's term RAIDb isn't bad at all. Regards Markus
On 8/26/07, Bill Moran <wmoran@potentialtech.com> wrote: > I'm curious as to how Postgres-R would handle a situation where the > constant throughput exceeded the processing speed of one of the nodes. Such situation is not a specific problem to Postgres-R or to synchronous replication in general. Asyncronous replication will break down too. -- marko
Hi, Marko Kreen wrote: > Such situation is not a specific problem to Postgres-R or to > synchronous replication in general. Asyncronous replication > will break down too. Agreed, except that I don't consider slowness as 'breaking down'. Regards Markus
On 8/23/07, Decibel! <decibel@decibel.org> wrote: > On Aug 22, 2007, at 3:37 PM, Joshua D. Drake wrote: > > You can not do multi master cross continent reliably. > > I'm pretty sure that credit card processors and some other companies > do it... it just costs a LOT to actually do it well. And most credit card processing companies don't do it (credit card processing that is) all that well.
In response to Markus Schiltknecht <markus@bluegap.ch>: > Hi, > > Bill Moran wrote: > > I'm curious as to how Postgres-R would handle a situation where the > > constant throughput exceeded the processing speed of one of the nodes. > > Well, what do you expect to happen? This case is easily detectable, but > I can only see two possible solutions: either stop the node which is to > slow or stop accepting new transactions for a while. It appears as if I miscommunicated my point. I'm not expecting PostgreSQL-R to break the laws of physics or anything, I'm just curious how it reacts. This is the difference between software that will be really great one day, and software that is great now. Great now would mean the system would notice that it's too far behind and Do The Right Thing automatically. I'm not exactly sure what The Right Thing is, but my first guess would be force the hopelessly slow node out of the cluster. I expect this would be non-trivial, as you've have to have a way to ensure it was a problem isolated to a single (or few) nodes, and not just the whole cluster getting hit with unexpected traffic. > This technique is not meant to allow nodes to lag behind several > thousands of transactions - that should better be avoided. Rather it's > meant to decrease the commit delay necessary for synchronous replication. Of course not, that's why the behaviour when that non-ideal situation occurs is so interesting. How does PostgreSQL-R fail? PostgreSQL fails wonderfully: A hardware crash will usually result in a system that can recover without operator intervention. In a system like PostgreSQL-R, the failure scenarios are more numerous, and probably more complicated. > > I can see your system working if it's just spike loads and the slow > > nodes can catch up during slow periods, but I'm wondering about the > > scenarios where an admin has underestimated the hardware requirements > > and one or more nodes is unable to keep up. > > Please keep in mind, that replication per se does not speed your > database up, it rather adds a layer of reliability, which *costs* some > performance. To increase the transactional throughput you would need to > add partitioning to the mix. Or you could try to make use of the gained > reliability and abandon WAL - you won't need that as long as at least > one replica is running - that should increase the single node's > throughput and therefore the cluster's throughput, too. I understand. I'm not asking it to do something it's not designed to. At least, I don't _think_ I am. -- Bill Moran http://www.potentialtech.com
Hello Bill, Bill Moran wrote: > It appears as if I miscommunicated my point. I'm not expecting > PostgreSQL-R to break the laws of physics or anything, I'm just > curious how it reacts. This is the difference between software > that will be really great one day, and software that is great now. Agreed. As Postgres-R is still a prototype, it does *currently* not handle the situation at all. But I'm thankful for this discussion, as it it helps me figuring out how Postgres-R *should* react. So, thank you for pointing this out. > Great now would mean the system would notice that it's too far behind > and Do The Right Thing automatically. I'm not exactly sure what The > Right Thing is, but my first guess would be force the hopelessly > slow node out of the cluster. I expect this would be non-trivial, > as you've have to have a way to ensure it was a problem isolated to > a single (or few) nodes, and not just the whole cluster getting hit > with unexpected traffic. Hm.. yeah, that's a tricky decision to make. For a start, I'd be in favor of just informing the administrator about the delay and let him take care of the problem (as currently done with 'disk full' conditions). Instead of trying to do something clever automatically. (This seems to be much more PostgreSQL-like, too). > Of course not, that's why the behaviour when that non-ideal situation > occurs is so interesting. How does PostgreSQL-R fail? PostgreSQL > fails wonderfully: A hardware crash will usually result in a system > that can recover without operator intervention. In a system like > PostgreSQL-R, the failure scenarios are more numerous, and probably > more complicated. I agree that there are more failure scenarios. Although fewer are critical to the complete system. IMO, a node which is too slow should not be considered a failure, but rather a system limitation (possibly due to unfortunate configuration), much like out of memory or disk space conditions. Forcing such a node to go down could have unwanted side effects on the other nodes (i.e. increased read-only traffic) *and* does not solve the real problem. Again, thanks for pointing this out. I'll think more about some issues, especially similar corner cases like this one. Single-node disk full would be another example. Possibly also out of memory conditions? Regards Markus
On Fri, Aug 24, 2007 at 06:54:35PM +0200, Markus Schiltknecht wrote: > Gregory Stark wrote: > >Only if your application is single-threaded. By single-threaded I don't > >refer > >to operating system threads but to the architecture. If you're processing a > >large batch file handling records one by one and waiting for each commit > >before proceeding then it's single threaded. If you have a hundred > >independent > >clients on separate connections doing separate things then each one of them > >could get 6tps. Which you have will depend on your application and your > >needs, > >it may not be something you can change. > > Correct. > > Plus, as in the implementation of Postgres-R, performance is *not* bound > to the slowest node. Instead, every node can process transactions at > it's own speed. Slower nodes might then have to queue transactions from > those until they catch up again. But is the complete transaction information safely stored on all nodes before a commit returns? -- Decibel!, aka Jim Nasby decibel@decibel.org EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Attachment
Hi, Decibel! wrote: > But is the complete transaction information safely stored on all nodes > before a commit returns? Good question. It depends very much on the group communication system and the guarantees it provides for message delivery. For certain, the information isn't safely stored on every node before commit confirmation. Let me quickly explain those two points. Lately, I've read a lot about different Group Communication Systems and how they handle delivery guarantees. Spread offers an 'agreed' and a 'safe' mode, only the later guarantees that all nodes have received the data. It's a rather expensive mode in terms of latency. In our case, it would be sufficient if at least n nodes would confirm having correctly received the data. That would allow for (n - 1) simultaneously failing nodes, so that there's always at least one correct node which has received the data, even if the sender just failed after sending. This one node can redistribute the data to others which didn't receive the message until all nodes have received it. No group communication system I know of offers such fine grained levels of delivery guarantees. Additionally, I've figured that it would be nice to have subgroups and multiple orderings within a group. Thus - opposed to my initial intention - I've finally started to write yet another group communication system, providing all of these nice features. Anyway, that's another story. Regarding durability: given the above assumption, that at most (n - 1) nodes fail, you don't have to care much about recovery, because there's always at least one running node which has all the data. As we know, reality doesn't always care about our assumptions. So, if you want to prevent data loss due to failures of more than (n - 1) nodes, possibly even all nodes, you'd have to do transaction logging, much like WAL, but a cluster-wide one. Having every single node write a transaction log, like WAL, would be rather expensive and complex during recovery, as you'd have to mix and match all node's WALs. Instead, I think it's better to decouple transaction logging (backup) from ordinary operation. That gives you much more freedom. For example, you could have nodes dedicated to and optimized for logging. But most importantly, you have separated the problem: as long as your permanent storage for transaction logging is living, you can recover your data. No matter what's happening with the rest of the cluster. And the other way around: as long as your cluster is living (i.e. no more than (n - 1) simultaneous failures), you don't really need the transaction log. So, before committing a transaction, a node has to wait for the delivery of the data through the GCS *and* for the transaction logger(s) to have written the data to permanent storage. Please note, that those two operations can be done simultaneously, i.e. the latency does not summarize, it's rather just the maximum of the two. Regards Markus