Thread: Geographic High-Availability/Replication

Geographic High-Availability/Replication

From
Matthew
Date:
Hey all, new postgres user here. We are trying to setup/research an
HA/Replicated solution with Postrgresql between a datacenter in LA and a
d.c. in NY.

We have a private LAN link between the two D.C.'s with a max round-trip
of 150ms.

We will have a web server at each d.c. (among other servers) that will
write/read to/from the local LAN DB. On writes, that data should be
xmited to the other data center so that if, for whatever reason, my
website request was sent to LA instead of NY, all my session information
etc will still exist.

I was looking into PGCluster 1.7rc5 but their website states:

    "...running one Cluster DB node in a different geographical location is
not what PGCluster was built for..."

Does anyone have an idea? We cannot possibly be the only company needing
to do this.

Thanks,
Matthew

Re: Geographic High-Availability/Replication

From
"Joshua D. Drake"
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Matthew wrote:
> Hey all, new postgres user here. We are trying to setup/research an
> HA/Replicated solution with Postrgresql between a datacenter in LA and a
> d.c. in NY.
>
> We have a private LAN link between the two D.C.'s with a max round-trip
> of 150ms.
>
> We will have a web server at each d.c. (among other servers) that will
> write/read to/from the local LAN DB. On writes, that data should be
> xmited to the other data center so that if, for whatever reason, my
> website request was sent to LA instead of NY, all my session information
> etc will still exist.
>
> I was looking into PGCluster 1.7rc5 but their website states:
>
>     "...running one Cluster DB node in a different geographical location is
> not what PGCluster was built for..."
>
> Does anyone have an idea? We cannot possibly be the only company needing
> to do this.

No, but you are the only company that likely is  trying to do it cross
continent ;) :).

You can not do multi master cross continent reliably. There are ways
using custom caches etc, to pull all session info into a geo-redundant
data store though.

Sincerely,

Joshua D. Drake


>
> Thanks,
> Matthew
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
>


- --

      === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564   24x7/Emergency: +1.800.492.2240
PostgreSQL solutions since 1997  http://www.commandprompt.com/
            UNIQUE NOT NULL
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGzJ6HATb/zqfZUUQRAoSFAJ99sfzpgVjWYbtTKDIzC+EdChJRcQCgjTZQ
nKVW3xmmvQvjBltykxz8Gco=
=qv3s
-----END PGP SIGNATURE-----

Re: Geographic High-Availability/Replication

From
Decibel!
Date:
On Aug 22, 2007, at 3:37 PM, Joshua D. Drake wrote:
> You can not do multi master cross continent reliably.

I'm pretty sure that credit card processors and some other companies
do it... it just costs a LOT to actually do it well.
--
Decibel!, aka Jim Nasby                        decibel@decibel.org
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)



Re: Geographic High-Availability/Replication

From
Bill Moran
Date:
Decibel! <decibel@decibel.org> wrote:
>
> On Aug 22, 2007, at 3:37 PM, Joshua D. Drake wrote:
> > You can not do multi master cross continent reliably.
>
> I'm pretty sure that credit card processors and some other companies
> do it... it just costs a LOT to actually do it well.

Isn't this sort of requirement the entire reason for 2-phase commit?

--
Bill Moran
http://www.potentialtech.com

Re: Geographic High-Availability/Replication

From
Ron Johnson
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 08/23/07 17:22, Bill Moran wrote:
> Decibel! <decibel@decibel.org> wrote:
>> On Aug 22, 2007, at 3:37 PM, Joshua D. Drake wrote:
>>> You can not do multi master cross continent reliably.
>> I'm pretty sure that credit card processors and some other companies
>> do it... it just costs a LOT to actually do it well.
>
> Isn't this sort of requirement the entire reason for 2-phase commit?

Entire reason?  Not that I've heard.

- --
Ron Johnson, Jr.
Jefferson LA  USA

Give a man a fish, and he eats for a day.
Hit him with a fish, and he goes away for good!

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFGzjKiS9HxQb37XmcRArTlAJ43MAEDdbbi71WDIApW5j0PveeJIwCePJPx
czuG/oescDoF8SAAehw4xdA=
=v+RP
-----END PGP SIGNATURE-----

Re: Geographic High-Availability/Replication

From
Markus Schiltknecht
Date:
Hi,

Matthew wrote:
> Hey all, new postgres user here. We are trying to setup/research an
> HA/Replicated solution with Postrgresql between a datacenter in LA and a
> d.c. in NY.
>
> We have a private LAN link between the two D.C.'s with a max round-trip
> of 150ms.
>
> We will have a web server at each d.c. (among other servers) that will
> write/read to/from the local LAN DB. On writes, that data should be
> xmited to the other data center so that if, for whatever reason, my
> website request was sent to LA instead of NY, all my session information
> etc will still exist.

This is commonly known as synchronous replication. As that involves
*at-least* one round-trip *before* committing, it's quite expensive. Can
you live with a delay of ~150ms before COMMIT confirmation?

Another issue is the reliability of your failure detectors. How does
server B know that server A is really down (and not only the link?).
Normally, that's solved with a quorum device. So that you have to have
at least three servers - preferably in different locations.

Regards

Markus

Disclaimer: I'm the developer behind Postgres-R (www.postgres-r.org),
have a look at it, it's designed to do what you are looking for.


Re: Geographic High-Availability/Replication

From
Ron Johnson
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 08/24/07 08:21, Markus Schiltknecht wrote:
> Hi,
>
> Matthew wrote:
>> Hey all, new postgres user here. We are trying to setup/research an
>> HA/Replicated solution with Postrgresql between a datacenter in LA and a
>> d.c. in NY.
>>
>> We have a private LAN link between the two D.C.'s with a max round-trip
>> of 150ms.
>>
>> We will have a web server at each d.c. (among other servers) that will
>> write/read to/from the local LAN DB. On writes, that data should be
>> xmited to the other data center so that if, for whatever reason, my
>> website request was sent to LA instead of NY, all my session information
>> etc will still exist.
>
> This is commonly known as synchronous replication. As that involves
> *at-least* one round-trip *before* committing, it's quite expensive. Can
> you live with a delay of ~150ms before COMMIT confirmation?

Which puts an upper limit on transaction rates at 6TPS.  Blech.

> Another issue is the reliability of your failure detectors. How does
> server B know that server A is really down (and not only the link?).
> Normally, that's solved with a quorum device. So that you have to have
> at least three servers - preferably in different locations.

- --
Ron Johnson, Jr.
Jefferson LA  USA

Give a man a fish, and he eats for a day.
Hit him with a fish, and he goes away for good!

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFGzu/7S9HxQb37XmcRArnEAJ947nvYFT1eQvRzj6YkpVEDLtLUqQCgsbQy
rgyz2ZCrlGbS+RzzXTD1ybY=
=SRQv
-----END PGP SIGNATURE-----

Re: Geographic High-Availability/Replication

From
Gregory Stark
Date:
"Ron Johnson" <ron.l.johnson@cox.net> writes:

> On 08/24/07 08:21, Markus Schiltknecht wrote:
>
>> This is commonly known as synchronous replication. As that involves
>> *at-least* one round-trip *before* committing, it's quite expensive. Can
>> you live with a delay of ~150ms before COMMIT confirmation?
>
> Which puts an upper limit on transaction rates at 6TPS.  Blech.

Only if your application is single-threaded. By single-threaded I don't refer
to operating system threads but to the architecture. If you're processing a
large batch file handling records one by one and waiting for each commit
before proceeding then it's single threaded. If you have a hundred independent
clients on separate connections doing separate things then each one of them
could get 6tps. Which you have will depend on your application and your needs,
it may not be something you can change.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

Re: Geographic High-Availability/Replication

From
Markus Schiltknecht
Date:
Hi,

Gregory Stark wrote:
> Only if your application is single-threaded. By single-threaded I don't refer
> to operating system threads but to the architecture. If you're processing a
> large batch file handling records one by one and waiting for each commit
> before proceeding then it's single threaded. If you have a hundred independent
> clients on separate connections doing separate things then each one of them
> could get 6tps. Which you have will depend on your application and your needs,
> it may not be something you can change.

Correct.

Plus, as in the implementation of Postgres-R, performance is *not* bound
to the slowest node. Instead, every node can process transactions at
it's own speed. Slower nodes might then have to queue transactions from
those until they catch up again.

Regards

Markus


Re: Geographic High-Availability/Replication

From
Bill Moran
Date:
Markus Schiltknecht <markus@bluegap.ch> wrote:
>
> Hi,
>
> Gregory Stark wrote:
> > Only if your application is single-threaded. By single-threaded I don't refer
> > to operating system threads but to the architecture. If you're processing a
> > large batch file handling records one by one and waiting for each commit
> > before proceeding then it's single threaded. If you have a hundred independent
> > clients on separate connections doing separate things then each one of them
> > could get 6tps. Which you have will depend on your application and your needs,
> > it may not be something you can change.
>
> Correct.
>
> Plus, as in the implementation of Postgres-R, performance is *not* bound
> to the slowest node. Instead, every node can process transactions at
> it's own speed. Slower nodes might then have to queue transactions from
> those until they catch up again.

I'm curious as to how Postgres-R would handle a situation where the
constant throughput exceeded the processing speed of one of the nodes.

I can see your system working if it's just spike loads and the slow
nodes can catch up during slow periods, but I'm wondering about the
scenarios where an admin has underestimated the hardware requirements
and one or more nodes is unable to keep up.

Just musing, really.

--
Bill Moran
http://www.potentialtech.com

Re: Geographic High-Availability/Replication

From
Markus Schiltknecht
Date:
Hi,

Bill Moran wrote:
> I'm curious as to how Postgres-R would handle a situation where the
> constant throughput exceeded the processing speed of one of the nodes.

Well, what do you expect to happen? This case is easily detectable, but
I can only see two possible solutions: either stop the node which is to
slow or stop accepting new transactions for a while.

This technique is not meant to allow nodes to lag behind several
thousands of transactions - that should better be avoided. Rather it's
meant to decrease the commit delay necessary for synchronous replication.

> I can see your system working if it's just spike loads and the slow
> nodes can catch up during slow periods, but I'm wondering about the
> scenarios where an admin has underestimated the hardware requirements
> and one or more nodes is unable to keep up.

Please keep in mind, that replication per se does not speed your
database up, it rather adds a layer of reliability, which *costs* some
performance. To increase the transactional throughput you would need to
add partitioning to the mix. Or you could try to make use of the gained
reliability and abandon WAL - you won't need that as long as at least
one replica is running - that should increase the single node's
throughput and therefore the cluster's throughput, too.

When replication meets partitioning and load balancing, you'll get into
a whole new world, where new trade-offs need to be considered. Some look
similar to those with RAID storage - probably Sequoia's term RAIDb isn't
bad at all.

Regards

Markus


Re: Geographic High-Availability/Replication

From
"Marko Kreen"
Date:
On 8/26/07, Bill Moran <wmoran@potentialtech.com> wrote:
> I'm curious as to how Postgres-R would handle a situation where the
> constant throughput exceeded the processing speed of one of the nodes.

Such situation is not a specific problem to Postgres-R or to
synchronous replication in general.  Asyncronous replication
will break down too.

--
marko

Re: Geographic High-Availability/Replication

From
Markus Schiltknecht
Date:
Hi,

Marko Kreen wrote:
> Such situation is not a specific problem to Postgres-R or to
> synchronous replication in general.  Asyncronous replication
> will break down too.

Agreed, except that I don't consider slowness as 'breaking down'.

Regards

Markus

Re: Geographic High-Availability/Replication

From
"Scott Marlowe"
Date:
On 8/23/07, Decibel! <decibel@decibel.org> wrote:
> On Aug 22, 2007, at 3:37 PM, Joshua D. Drake wrote:
> > You can not do multi master cross continent reliably.
>
> I'm pretty sure that credit card processors and some other companies
> do it... it just costs a LOT to actually do it well.

And most credit card processing companies don't do it (credit card
processing that is) all that well.

Re: Geographic High-Availability/Replication

From
Bill Moran
Date:
In response to Markus Schiltknecht <markus@bluegap.ch>:

> Hi,
>
> Bill Moran wrote:
> > I'm curious as to how Postgres-R would handle a situation where the
> > constant throughput exceeded the processing speed of one of the nodes.
>
> Well, what do you expect to happen? This case is easily detectable, but
> I can only see two possible solutions: either stop the node which is to
> slow or stop accepting new transactions for a while.

It appears as if I miscommunicated my point.  I'm not expecting
PostgreSQL-R to break the laws of physics or anything, I'm just
curious how it reacts.  This is the difference between software
that will be really great one day, and software that is great now.

Great now would mean the system would notice that it's too far behind
and Do The Right Thing automatically.  I'm not exactly sure what The
Right Thing is, but my first guess would be force the hopelessly
slow node out of the cluster.  I expect this would be non-trivial,
as you've have to have a way to ensure it was a problem isolated to
a single (or few) nodes, and not just the whole cluster getting hit
with unexpected traffic.

> This technique is not meant to allow nodes to lag behind several
> thousands of transactions - that should better be avoided. Rather it's
> meant to decrease the commit delay necessary for synchronous replication.

Of course not, that's why the behaviour when that non-ideal situation
occurs is so interesting.  How does PostgreSQL-R fail?  PostgreSQL
fails wonderfully: A hardware crash will usually result in a system
that can recover without operator intervention.  In a system like
PostgreSQL-R, the failure scenarios are more numerous, and probably
more complicated.

> > I can see your system working if it's just spike loads and the slow
> > nodes can catch up during slow periods, but I'm wondering about the
> > scenarios where an admin has underestimated the hardware requirements
> > and one or more nodes is unable to keep up.
>
> Please keep in mind, that replication per se does not speed your
> database up, it rather adds a layer of reliability, which *costs* some
> performance. To increase the transactional throughput you would need to
> add partitioning to the mix. Or you could try to make use of the gained
> reliability and abandon WAL - you won't need that as long as at least
> one replica is running - that should increase the single node's
> throughput and therefore the cluster's throughput, too.

I understand.  I'm not asking it to do something it's not designed to.
At least, I don't _think_ I am.

--
Bill Moran
http://www.potentialtech.com

Re: Geographic High-Availability/Replication

From
Markus Schiltknecht
Date:
Hello Bill,

Bill Moran wrote:
> It appears as if I miscommunicated my point.  I'm not expecting
> PostgreSQL-R to break the laws of physics or anything, I'm just
> curious how it reacts.  This is the difference between software
> that will be really great one day, and software that is great now.

Agreed. As Postgres-R is still a prototype, it does *currently* not
handle the situation at all. But I'm thankful for this discussion, as it
it helps me figuring out how Postgres-R *should* react. So, thank you
for pointing this out.

> Great now would mean the system would notice that it's too far behind
> and Do The Right Thing automatically.  I'm not exactly sure what The
> Right Thing is, but my first guess would be force the hopelessly
> slow node out of the cluster.  I expect this would be non-trivial,
> as you've have to have a way to ensure it was a problem isolated to
> a single (or few) nodes, and not just the whole cluster getting hit
> with unexpected traffic.

Hm.. yeah, that's a tricky decision to make. For a start, I'd be in
favor of just informing the administrator about the delay and let him
take care of the problem (as currently done with 'disk full'
conditions). Instead of trying to do something clever automatically.
(This seems to be much more PostgreSQL-like, too).

> Of course not, that's why the behaviour when that non-ideal situation
> occurs is so interesting.  How does PostgreSQL-R fail?  PostgreSQL
> fails wonderfully: A hardware crash will usually result in a system
> that can recover without operator intervention.  In a system like
> PostgreSQL-R, the failure scenarios are more numerous, and probably
> more complicated.

I agree that there are more failure scenarios. Although fewer are
critical to the complete system.

IMO, a node which is too slow should not be considered a failure, but
rather a system limitation (possibly due to unfortunate configuration),
much like out of memory or disk space conditions. Forcing such a node to
go down could have unwanted side effects on the other nodes (i.e.
increased read-only traffic) *and* does not solve the real problem.

Again, thanks for pointing this out. I'll think more about some issues,
especially similar corner cases like this one. Single-node disk full
would be another example. Possibly also out of memory conditions?

Regards

Markus


Re: Geographic High-Availability/Replication

From
Decibel!
Date:
On Fri, Aug 24, 2007 at 06:54:35PM +0200, Markus Schiltknecht wrote:
> Gregory Stark wrote:
> >Only if your application is single-threaded. By single-threaded I don't
> >refer
> >to operating system threads but to the architecture. If you're processing a
> >large batch file handling records one by one and waiting for each commit
> >before proceeding then it's single threaded. If you have a hundred
> >independent
> >clients on separate connections doing separate things then each one of them
> >could get 6tps. Which you have will depend on your application and your
> >needs,
> >it may not be something you can change.
>
> Correct.
>
> Plus, as in the implementation of Postgres-R, performance is *not* bound
> to the slowest node. Instead, every node can process transactions at
> it's own speed. Slower nodes might then have to queue transactions from
> those until they catch up again.

But is the complete transaction information safely stored on all nodes
before a commit returns?
--
Decibel!, aka Jim Nasby                        decibel@decibel.org
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Attachment

Re: Geographic High-Availability/Replication

From
Markus Schiltknecht
Date:
Hi,

Decibel! wrote:
> But is the complete transaction information safely stored on all nodes
> before a commit returns?

Good question. It depends very much on the group communication system
and the guarantees it provides for message delivery. For certain, the
information isn't safely stored on every node before commit
confirmation. Let me quickly explain those two points.

Lately, I've read a lot about different Group Communication Systems and
how they handle delivery guarantees. Spread offers an 'agreed' and a
'safe' mode, only the later guarantees that all nodes have received the
data. It's a rather expensive mode in terms of latency.

In our case, it would be sufficient if at least n nodes would confirm
having correctly received the data. That would allow for (n - 1)
simultaneously failing nodes, so that there's always at least one
correct node which has received the data, even if the sender just failed
after sending. This one node can redistribute the data to others which
didn't receive the message until all nodes have received it.

No group communication system I know of offers such fine grained levels
of delivery guarantees. Additionally, I've figured that it would be nice
to have subgroups and multiple orderings within a group. Thus - opposed
to my initial intention - I've finally started to write yet another
group communication system, providing all of these nice features.
Anyway, that's another story.

Regarding durability: given the above assumption, that at most (n - 1)
nodes fail, you don't have to care much about recovery, because there's
always at least one running node which has all the data. As we know,
reality doesn't always care about our assumptions. So, if you want to
prevent data loss due to failures of more than (n - 1) nodes, possibly
even all nodes, you'd have to do transaction logging, much like WAL, but
a cluster-wide one. Having every single node write a transaction log,
like WAL, would be rather expensive and complex during recovery, as
you'd have to mix and match all node's WALs.

Instead, I think it's better to decouple transaction logging (backup)
from ordinary operation. That gives you much more freedom. For example,
you could have nodes dedicated to and optimized for logging. But most
importantly, you have separated the problem: as long as your permanent
storage for transaction logging is living, you can recover your data. No
matter what's happening with the rest of the cluster. And the other way
around: as long as your cluster is living (i.e. no more than (n - 1)
simultaneous failures), you don't really need the transaction log.

So, before committing a transaction, a node has to wait for the delivery
of the data through the GCS *and* for the transaction logger(s) to have
written the data to permanent storage. Please note, that those two
operations can be done simultaneously, i.e. the latency does not
summarize, it's rather just the maximum of the two.

Regards

Markus