Thread: High Availability, Load Balancing, and Replication Feature Matrix

High Availability, Load Balancing, and Replication Feature Matrix

From
Bruce Momjian
Date:
[ BCC to hackers.]

I have added a High Availability, Load Balancing, and Replication
Feature Matrix table to the docs:

    http://momjian.us/main/writings/pgsql/sgml/high-availability.html#HIGH-AVAILABILITY-MATRIX

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://postgres.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: High Availability, Load Balancing, and Replication Feature Matrix

From
Markus Schiltknecht
Date:
Hello Bruce,

Bruce Momjian wrote:
> I have added a High Availability, Load Balancing, and Replication
> Feature Matrix table to the docs:

Nice work. I appreciate your efforts in clearing up the uncertainty that
surrounds this topic.

As you might have guessed, I have some complaints regarding the Feature
Matrix. I hope this won't discourage you, but I'd rather like to
contribute to an improved variant.

First of all, I don't quite like the negated formulations. I can see
that you want a dot to mark a positive feature, but I find it hard to
understand.

I'm especially puzzled about is the "master never locks others". All
first four, namely "shared disk failover", "file system replication",
"warm standby" and "master slave replication", block others (the slaves)
completely, which is about the worst kind of lock.

Comparing between "File System Replication" and "Shared Disk Failover",
you state that the former has "master server overhead", while the later
doesn't. Seen solely from the single server node, this might be true.
But summarized over the cluster, you have a network with a quite similar
load in both cases. I wouldn't say one has less overhead than the other
per definition.

Then, you are mixing apples and oranges. Why should a "statement based
replication solution" not require conflict resolution? You can build
eager as well as lazy statement based replication solutions, that does
not have anything to do with the other, does it?

Same applies to "master slave replication" and "per table granularity".

And in the special case of (async, but eager) Postgres-R also to "async
multi-master replication" and "no conflict resolution necessary".
Although I can understand that that's a pretty nifty difference.

Given the matrix focuses on practically available solutions, I can see
some value in it. But from a more theoretical viewpoint, I find it
pretty confusing. Now, if you want a practically usable feature
comparison table, I'd strongly vote for clearly mentioning the products
you have in mind - otherwise the table pretends to be something it is not.

If it should be theoretically correct without mentioning available
solutions, I'd rather vote for explaining the terms and concepts.

To clarify my viewpoint, I'll quickly go over the features you're
mentioning and associate them with the concepts, as I understand them.

  - special hardware:  always nice, not much theoretical effect, a
                       network is a network, storage is storage.

  - multiple masters:  that's what single- vs multi masters is about:
                       writing transactions. Can be mixed with
                       eager/lazy, every combination makes
                       sense for certain applications.

  - overhead:          replication per definition generates overhead,
                       question is: how much, and where.

  - locking of others: again, question of how much and how fine grained
                       the locking is. In a single master repl. sol., the
                       slaves are locked completely. In lazy repl. sol.,
                       the locking is deferred until after the commit,
                       during conflict resolution. In eager repl. sol.,
                       the locking needs to take place before the commit.
                       But all replication systems need some kind of
                       locks!

  - data loss on fail: solely dependent on eager/lazy. (Given a real
                       replication, with a replica, which shared storage
                       does not provide, IMO)

  - slaves read only:  theoretically possible with all replication
                       system, are they lazy/eager, single-/multi-
                       master. That we are unable to read from slave
                       nodes is an implementation annoyance of
                       Postgres, if you want.

  - per table gran.:   again, independent of lazy/eager, single-/multi.
                       Depends solely on the level where data is
                       replicated: block device, file system, statement,
                       WAL or other internal format.

  - conflict resol.:   in multi master systems, that depends on the
                       lazy/eager property. Single master systems
                       obviously never need to resolve conflicts.

IMO, "data partitioning" is entirely perpendicular to replication. It
can be combined, in various ways. There's horizontal and vertical
partitioning, eager/lazy and single-/multi-master replication. I guess
we could find a use case for most of the combinations thereof. (Kudos
for finding a combination which definitely has no use case).

Well, these are my theories, do with it whatever you like. Comments
appreciated.

Kind regards

Markus


Re: High Availability, Load Balancing, and Replication Feature Matrix

From
Bruce Momjian
Date:
Markus Schiltknecht wrote:
> Hello Bruce,
>
> Bruce Momjian wrote:
> > I have added a High Availability, Load Balancing, and Replication
> > Feature Matrix table to the docs:
>
> Nice work. I appreciate your efforts in clearing up the uncertainty that
> surrounds this topic.
>
> As you might have guessed, I have some complaints regarding the Feature
> Matrix. I hope this won't discourage you, but I'd rather like to
> contribute to an improved variant.

Not sure if you were around when we wrote this chapter but there was a
lot of good discussion to get it to where it is now.

> First of all, I don't quite like the negated formulations. I can see
> that you want a dot to mark a positive feature, but I find it hard to
> understand.

Well, the idea is to say "what things do I want and what offers it?"  If
you have positive/negative it makes it harder to do that.  I realize it
is confusing in a different way.  We could split out the negatives into
a different table but that seems worse.

> I'm especially puzzled about is the "master never locks others". All
> first four, namely "shared disk failover", "file system replication",
> "warm standby" and "master slave replication", block others (the slaves)
> completely, which is about the worst kind of lock.

That item assumes you have slaves that are trying to do work.  The point
is that multi-master slows down the other slaves in a way no other
option does, which is the reason we don't support it yet.  I have
updated the wording to "No inter-server locking delay".

> Comparing between "File System Replication" and "Shared Disk Failover",
> you state that the former has "master server overhead", while the later
> doesn't. Seen solely from the single server node, this might be true.
> But summarized over the cluster, you have a network with a quite similar
> load in both cases. I wouldn't say one has less overhead than the other
> per definition.

The point is that file system replication has to wait for the standby
server to write the blocks, while disk failover does not.  I don't think
the network is an issue considering many use NAS anyway.

> Then, you are mixing apples and oranges. Why should a "statement based
> replication solution" not require conflict resolution? You can build
> eager as well as lazy statement based replication solutions, that does
> not have anything to do with the other, does it?

There is no dot there so I am saying "statement based replication
solution" requires conflict resolution.  Agreed you could do it without
conflict resolution and it is kind of independent.  How should we deal
with this?

> Same applies to "master slave replication" and "per table granularity".

I tried to mark them based on existing or typical solutions, but you are
right, especially if the master/slave is not PITR based.  Some can't do
per-table, like disk failover.

> And in the special case of (async, but eager) Postgres-R also to "async
> multi-master replication" and "no conflict resolution necessary".
> Although I can understand that that's a pretty nifty difference.

Yea, the table isn't going to be 100% but tries to summarize what in the
section above.

> Given the matrix focuses on practically available solutions, I can see
> some value in it. But from a more theoretical viewpoint, I find it
> pretty confusing. Now, if you want a practically usable feature
> comparison table, I'd strongly vote for clearly mentioning the products
> you have in mind - otherwise the table pretends to be something it is not.

I considered that and I can add something that says you have to consider
the text above for more details.  Some require solution mentions, Slony,
while others do not, like disk failover.

> If it should be theoretically correct without mentioning available
> solutions, I'd rather vote for explaining the terms and concepts.
>
> To clarify my viewpoint, I'll quickly go over the features you're
> mentioning and associate them with the concepts, as I understand them.
>
>   - special hardware:  always nice, not much theoretical effect, a
>                        network is a network, storage is storage.
>
>   - multiple masters:  that's what single- vs multi masters is about:
>                        writing transactions. Can be mixed with
>                        eager/lazy, every combination makes
>                        sense for certain applications.
>
>   - overhead:          replication per definition generates overhead,
>                        question is: how much, and where.
>
>   - locking of others: again, question of how much and how fine grained
>                        the locking is. In a single master repl. sol., the
>                        slaves are locked completely. In lazy repl. sol.,
>                        the locking is deferred until after the commit,
>                        during conflict resolution. In eager repl. sol.,
>                        the locking needs to take place before the commit.
>                        But all replication systems need some kind of
>                        locks!
>
>   - data loss on fail: solely dependent on eager/lazy. (Given a real
>                        replication, with a replica, which shared storage
>                        does not provide, IMO)
>
>   - slaves read only:  theoretically possible with all replication
>                        system, are they lazy/eager, single-/multi-
>                        master. That we are unable to read from slave
>                        nodes is an implementation annoyance of
>                        Postgres, if you want.
>
>   - per table gran.:   again, independent of lazy/eager, single-/multi.
>                        Depends solely on the level where data is
>                        replicated: block device, file system, statement,
>                        WAL or other internal format.
>
>   - conflict resol.:   in multi master systems, that depends on the
>                        lazy/eager property. Single master systems
>                        obviously never need to resolve conflicts.

Right, but the point of the chart is go give people guidance, not to
give them details;  that is in the part above.

> IMO, "data partitioning" is entirely perpendicular to replication. It
> can be combined, in various ways. There's horizontal and vertical
> partitioning, eager/lazy and single-/multi-master replication. I guess
> we could find a use case for most of the combinations thereof. (Kudos
> for finding a combination which definitely has no use case).

Really?  Are you saying the office example is useless?  What is a good
use case for this?

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://postgres.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: High Availability, Load Balancing, and Replication Feature Matrix

From
Markus Schiltknecht
Date:
Hello Bruce,

thank you for your detailed answer.

Bruce Momjian wrote:
> Not sure if you were around when we wrote this chapter but there was a
> lot of good discussion to get it to where it is now.

Uh.. IIRC quite a good part of the discussion for chapter 23 was between
you and me, pretty exactly a year ago. Or what discussion are you
referring to?

>> First of all, I don't quite like the negated formulations. I can see
>> that you want a dot to mark a positive feature, but I find it hard to
>> understand.
>
> Well, the idea is to say "what things do I want and what offers it?"  If
> you have positive/negative it makes it harder to do that.  I realize it
> is confusing in a different way.  We could split out the negatives into
> a different table but that seems worse.

Hm.. yeah, I can understand that. As those are thing the user wants, I
think we could formulate positive wishes. Just a proposal:

No special hardware required:        works with commodity hardware

No conflict resolution necessary:    maintains durability property

master failure will never lose data: maintains durability
                                      on single node failure

With the other two I'm unsure.. I see it's very hard to find helpful
positive formulations...

>> I'm especially puzzled about is the "master never locks others". All
>> first four, namely "shared disk failover", "file system replication",
>> "warm standby" and "master slave replication", block others (the slaves)
>> completely, which is about the worst kind of lock.
>
> That item assumes you have slaves that are trying to do work.

Yes, replication in general assumes that. So does high availability,
IMO. Having read-only slaves means nothing else but locking them from
write access.

> The point
> is that multi-master slows down the other slaves in a way no other
> option does,

Uh.. you mean the other masters? But according to that statement, "async
multi-master replication" as well as "statement-based replication
middleware" should not have a dot, because those as well slow down other
masters. In the async case at different points in time, yes, but all
master have to write the data, which slows them down.

I'm suspecting you are rather talking about the network dependent commit
latency of eager replication solutions. I find the term "locking delay"
for that rather confusing. How about: "normal commit latency"? (Normal,
as in: depends on the storage system used, instead of on the network and
storage).

> which is the reason we don't support it yet.

Uhm.. PgCluster *is* a synchronous multi-master replication solution. It
also is a middleware and it does statement based replication. Which dots
of the matrix do you think apply for it?

>> Comparing between "File System Replication" and "Shared Disk Failover",
>> you state that the former has "master server overhead", while the later
>> doesn't. Seen solely from the single server node, this might be true.
>> But summarized over the cluster, you have a network with a quite similar
>> load in both cases. I wouldn't say one has less overhead than the other
>> per definition.
>
> The point is that file system replication has to wait for the standby
> server to write the blocks, while disk failover does not.

In "disk failover", the master has to wait for the NAS to write the
blocks on mirrored disks, while in "file system replication" the master
has to wait for multiple nodes to write the blocks. As the nodes of a
replicated file system can write in parallel, very much like a RAID-1
NAS, I don't see that much of a difference there.

> I don't think
> the network is an issue considering many use NAS anyway.

I think you are comparing an enterprise NAS to a low-cost, commodity
hardware clustered filesystem. Take the same amount of money and the
same number of mirrors and you'll get comparable performance.

> There is no dot there so I am saying "statement based replication
> solution" requires conflict resolution.  Agreed you could do it without
> conflict resolution and it is kind of independent.  How should we deal
> with this?

Maybe a third state: 'n/a'?

>> And in the special case of (async, but eager) Postgres-R also to "async
>> multi-master replication" and "no conflict resolution necessary".
>> Although I can understand that that's a pretty nifty difference.
>
> Yea, the table isn't going to be 100% but tries to summarize what in the
> section above.

That's fine.

 > [...]
 >
> Right, but the point of the chart is go give people guidance, not to
> give them details;  that is in the part above.

Well, sure. But then we are back at the discussion of the parts above,
which is quite fuzzy, IMO. I'm still missing those details. And I'm
dubious about it being a basis for a feature matrix with clear dots or
no dots. For the reasons explained above.

>> IMO, "data partitioning" is entirely perpendicular to replication. It
>> can be combined, in various ways. There's horizontal and vertical
>> partitioning, eager/lazy and single-/multi-master replication. I guess
>> we could find a use case for most of the combinations thereof. (Kudos
>> for finding a combination which definitely has no use case).
>
> Really?  Are you saying the office example is useless?  What is a good
> use case for this?

Uhm, no sorry, I was unclear here. And not even correct. I was trying to
say that there's a use case for each and every combination of the three
properties above.

I'm now revoking one: "master-slave" combines very badly with "eager
replication". Because if you do eager replication, you can as well have
multiple masters without any additional cost. So, only these three
combinations make sense:

  - lazy, master-slave
  - eager, master-slave
  - eager, multi-master

Now, no partitioning, horizontal as well as vertical partitioning can be
combined with any of the above replication method. Giving a total of
nine combinations, which all make perfect sense for certain applications.

If I understand correctly, your office example is about horizontal data
partitioning, with lazy, master-slave replication for the read-only copy
of the remote data. It makes perfect sense.


With regard to replication, there's another feature I think would be
worth mentioning: dynamic addition or removal of nodes (masters or
slaves). But that's solely implementation dependent, so it probably
doesn't fit into the matrix.

Another interesting property I'm missing is the existence of single
points of failures.

Regards

Markus


Re: High Availability, Load Balancing, and Replication Feature Matrix

From
Bruce Momjian
Date:
Markus Schiltknecht wrote:
> Hello Bruce,
>
> thank you for your detailed answer.
>
> Bruce Momjian wrote:
> > Not sure if you were around when we wrote this chapter but there was a
> > lot of good discussion to get it to where it is now.
>
> Uh.. IIRC quite a good part of the discussion for chapter 23 was between
> you and me, pretty exactly a year ago. Or what discussion are you
> referring to?

Sorry, I forgot who was involved in that discussion.

> >> First of all, I don't quite like the negated formulations. I can see
> >> that you want a dot to mark a positive feature, but I find it hard to
> >> understand.
> >
> > Well, the idea is to say "what things do I want and what offers it?"  If
> > you have positive/negative it makes it harder to do that.  I realize it
> > is confusing in a different way.  We could split out the negatives into
> > a different table but that seems worse.
>
> Hm.. yeah, I can understand that. As those are thing the user wants, I
> think we could formulate positive wishes. Just a proposal:
>
> No special hardware required:        works with commodity hardware
>
> No conflict resolution necessary:    maintains durability property
>
> master failure will never lose data: maintains durability
>                                       on single node failure
>
> With the other two I'm unsure.. I see it's very hard to find helpful
> positive formulations...

Yea, that's where I got stuck --- that the positives were harder to
understand.

> >> I'm especially puzzled about is the "master never locks others". All
> >> first four, namely "shared disk failover", "file system replication",
> >> "warm standby" and "master slave replication", block others (the slaves)
> >> completely, which is about the worst kind of lock.
> >
> > That item assumes you have slaves that are trying to do work.
>
> Yes, replication in general assumes that. So does high availability,
> IMO. Having read-only slaves means nothing else but locking them from
> write access.
>
> > The point
> > is that multi-master slows down the other slaves in a way no other
> > option does,
>
> Uh.. you mean the other masters? But according to that statement, "async

Sorry, I meant that a master that is modifying data is slowed down by
other masters to an extent that doesn't happen in other cases (e.g. with
slaves).  Is the current "No inter-server locking delay" OK?

> multi-master replication" as well as "statement-based replication
> middleware" should not have a dot, because those as well slow down other
> masters. In the async case at different points in time, yes, but all
> master have to write the data, which slows them down.

Yea, that is why I have the new text about locking.

> I'm suspecting you are rather talking about the network dependent commit
> latency of eager replication solutions. I find the term "locking delay"
> for that rather confusing. How about: "normal commit latency"? (Normal,
> as in: depends on the storage system used, instead of on the network and
> storage).

Uh, I assume that multi-master locking happens often before the commit.

> > which is the reason we don't support it yet.
>
> Uhm.. PgCluster *is* a synchronous multi-master replication solution. It
> also is a middleware and it does statement based replication. Which dots
> of the matrix do you think apply for it?

I don't consider PgCluster middleware because the servers have to
cooperate with the middleware.  And I am told it is much slower for
writes than a single server which supports my "locking" item, though it
is more "waiting for other masters" that is the delay, I think.

> >> Comparing between "File System Replication" and "Shared Disk Failover",
> >> you state that the former has "master server overhead", while the later
> >> doesn't. Seen solely from the single server node, this might be true.
> >> But summarized over the cluster, you have a network with a quite similar
> >> load in both cases. I wouldn't say one has less overhead than the other
> >> per definition.
> >
> > The point is that file system replication has to wait for the standby
> > server to write the blocks, while disk failover does not.
>
> In "disk failover", the master has to wait for the NAS to write the
> blocks on mirrored disks, while in "file system replication" the master
> has to wait for multiple nodes to write the blocks. As the nodes of a
> replicated file system can write in parallel, very much like a RAID-1
> NAS, I don't see that much of a difference there.

I don't assume the disk failover has mirrored disks.  It can just like a
single server can, but it isn't part of the backend process, and I
assume a RAID card that has RAM that can cache writes.  In the file
system replication case the server is having to send commands to the
mirror and wait for completion.

> > I don't think
> > the network is an issue considering many use NAS anyway.
>
> I think you are comparing an enterprise NAS to a low-cost, commodity
> hardware clustered filesystem. Take the same amount of money and the
> same number of mirrors and you'll get comparable performance.

Agreed.  In the one case you are relying on another server, and in the
NAS case you are relying on a black box server.  I think the big
difference is that the other server is a separate entity, while the NAS
is a shared item.

> > There is no dot there so I am saying "statement based replication
> > solution" requires conflict resolution.  Agreed you could do it without
> > conflict resolution and it is kind of independent.  How should we deal
> > with this?
>
> Maybe a third state: 'n/a'?

Good idea, or "~".  How would middleware avoid conflicts, i.e. how would
it know that two incoming queries were in conflict?

> >> And in the special case of (async, but eager) Postgres-R also to "async
> >> multi-master replication" and "no conflict resolution necessary".
> >> Although I can understand that that's a pretty nifty difference.
> >
> > Yea, the table isn't going to be 100% but tries to summarize what in the
> > section above.
>
> That's fine.
>
>  > [...]
>  >
> > Right, but the point of the chart is go give people guidance, not to
> > give them details;  that is in the part above.
>
> Well, sure. But then we are back at the discussion of the parts above,
> which is quite fuzzy, IMO. I'm still missing those details. And I'm
> dubious about it being a basis for a feature matrix with clear dots or
> no dots. For the reasons explained above.
>
> >> IMO, "data partitioning" is entirely perpendicular to replication. It
> >> can be combined, in various ways. There's horizontal and vertical
> >> partitioning, eager/lazy and single-/multi-master replication. I guess
> >> we could find a use case for most of the combinations thereof. (Kudos
> >> for finding a combination which definitely has no use case).
> >
> > Really?  Are you saying the office example is useless?  What is a good
> > use case for this?
>
> Uhm, no sorry, I was unclear here. And not even correct. I was trying to
> say that there's a use case for each and every combination of the three
> properties above.

OK.

> I'm now revoking one: "master-slave" combines very badly with "eager
> replication". Because if you do eager replication, you can as well have
> multiple masters without any additional cost. So, only these three

Right.  I was trying to hit typical usages.

> combinations make sense:
>
>   - lazy, master-slave
>   - eager, master-slave
>   - eager, multi-master

Yep.

> Now, no partitioning, horizontal as well as vertical partitioning can be
> combined with any of the above replication method. Giving a total of
> nine combinations, which all make perfect sense for certain applications.
>
> If I understand correctly, your office example is about horizontal data
> partitioning, with lazy, master-slave replication for the read-only copy
> of the remote data. It makes perfect sense.

I did move it below and removed it from the chart because as you say how
to replicate to the slaves is an independent issue.

> With regard to replication, there's another feature I think would be
> worth mentioning: dynamic addition or removal of nodes (masters or
> slaves). But that's solely implementation dependent, so it probably
> doesn't fit into the matrix.

Yea, I had that but found you could add/remove slaves easily in most
cases.

> Another interesting property I'm missing is the existence of single
> points of failures.

Ah, yea, but then you get into power and fire issues.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://postgres.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: High Availability, Load Balancing, and Replication Feature Matrix

From
Markus Schiltknecht
Date:
Hello Bruce,

Bruce Momjian wrote:
> Sorry, I forgot who was involved in that discussion.

Well, at least that means I didn't annoy you to death last time ;-)

>> With the other two I'm unsure.. I see it's very hard to find helpful
>> positive formulations...
>
> Yea, that's where I got stuck --- that the positives were harder to
> understand.

Okay, understood.

> Sorry, I meant that a master that is modifying data is slowed down by
> other masters to an extent that doesn't happen in other cases (e.g. with
> slaves).  Is the current "No inter-server locking delay" OK?

Yes, sort of. I know what you meant, but I find it hard to understand.
And with regard to anything except lazy or eager replication, it does
not make any sense. Its pretty moot saying anything about "inter-server
locking delays" for "statement-based replication middleware": you don't
know if it's lazy or eager. And all other solutions you are mentioning
are single-master or no replication at all. When there's only one
master, it's pretty obvious that there can't be no inter-(master)-server
locking delay. (Well, it's also very obvious that a single master never
'conflicts' with itself...)

Given you want to cover existing solutions, one could say, that (AFAIK)
all statement based replication solutions are eager. But in that case,
the dot would be wrong, because the middleware would need to wait for at
least an absolute majority to confirm the commit. Which as well leads to
excessive locking, as you are saying for "synchronous multi-master
replication". Because it's a property inherent to eager multi-master
replication, as we correctly explain above the feature matrix.

>> multi-master replication" as well as "statement-based replication
>> middleware" should not have a dot, because those as well slow down other
>> masters. In the async case at different points in time, yes, but all
>> master have to write the data, which slows them down.
>
> Yea, that is why I have the new text about locking.

To me this makes it sound like "statement-based replication" could be
faster than "synchronous multi-master replication". That's absolute
nonsense, since those two don't compare. Or put it another way: most
"statement-based replication" solutions often are "synchronous
multi-master replication" as well.

[ In that sense, stating that "PostgreSQL does not offer this kind of
replication" is wrong, under "Synchronous Multi-Master Replication". As
is the assumption, that all those send "data changes". Probably you
should clarify that to say: "tuple based, eager multi-master
replication", because that's what you are talking about. ]

If you are comparing an eager, statement-based, multi-master replication
(like PgCluster) with an eager, tuple-based, multi-master replication
(like Postgres-R), the former can't possibly be faster than the later.
I.e. it certainly doesn't have less (locking?) delays.

>>> which is the reason we don't support it yet.
>> Uhm.. PgCluster *is* a synchronous multi-master replication solution. It
>> also is a middleware and it does statement based replication. Which dots
>> of the matrix do you think apply for it?
>
> I don't consider PgCluster middleware because the servers have to
> cooperate with the middleware.

Okay, then take Sequoia: statement-based, middleware, synchronous (thus
eager) multi-master replication solution.

( I've never liked the term "middleware" in that chapter. It's solely a
question of implementation and does not have much to do with other
concepts of replication. )

> And I am told it is much slower for
> writes than a single server which supports my "locking" item, though it
> is more "waiting for other masters" that is the delay, I think.

Uh.. with the dot there, you are saying that "statement based
middleware" does *not* have any inter-server locking delay.

What's the difference between "waiting for other masters" and "locking
delay"? What exactly do you consider a lock? Why should it be locking
when using binary-tuple replication, but not when using statement based
replication?

> I don't assume the disk failover has mirrored disks.  It can just like a
> single server can, but it isn't part of the backend process, and I
> assume a RAID card that has RAM that can cache writes.

In that case, you'd loose the "master failure will never lose data"
property, no? Or do you trust the writeback cache and the connection to
the NAS that much as to assume it never fails?

>>> I don't think
>>> the network is an issue considering many use NAS anyway.
>> I think you are comparing an enterprise NAS to a low-cost, commodity
>> hardware clustered filesystem. Take the same amount of money and the
>> same number of mirrors and you'll get comparable performance.
>
> Agreed.  In the one case you are relying on another server, and in the
> NAS case you are relying on a black box server.  I think the big
> difference is that the other server is a separate entity, while the NAS
> is a shared item.

Correct, thus the former is a kind of single-master replication, while
the later cannot be considered replication (lacking a replica). It's
rather a variant of how to enhance reliability of your single-master
database server system.

>>> There is no dot there so I am saying "statement based replication
>>> solution" requires conflict resolution.  Agreed you could do it without
>>> conflict resolution and it is kind of independent.  How should we deal
>>> with this?
>> Maybe a third state: 'n/a'?
>
> Good idea, or "~".  How would middleware avoid conflicts, i.e. how would
> it know that two incoming queries were in conflict?

A majority of servers rejecting or blocking the query? In case of a
minority, which blocks, the majority would win and apply the
transaction, while the minority would have to replay the transaction? I
don't know, probably most solutions do something simpler, like aborting
a transaction even if only one server fails. Much simpler, and
sufficient for most cases.

(Why do you ask me, I'm advocating internal, tuple level replication
with Postgres-R, not statement based one :-) )

> I did move it below and removed it from the chart because as you say how
> to replicate to the slaves is an independent issue.

Okay, I like that better, thanks.

>> With regard to replication, there's another feature I think would be
>> worth mentioning: dynamic addition or removal of nodes (masters or
>> slaves). But that's solely implementation dependent, so it probably
>> doesn't fit into the matrix.
>
> Yea, I had that but found you could add/remove slaves easily in most
> cases.

Hm.. you're right.

>> Another interesting property I'm missing is the existence of single
>> points of failures.
>
> Ah, yea, but then you get into power and fire issues.

Which high-availability is all about, no?

But well, again, all kinds of replication (which excludes the NAS) can
theoretically be spread across the continent. So it might be pretty
useless to add dots for that.

Regards

Markus

Re: High Availability, Load Balancing, and Replication Feature Matrix

From
Bruce Momjian
Date:
Markus Schiltknecht wrote:
> Hello Bruce,
>
> Bruce Momjian wrote:
> > Sorry, I forgot who was involved in that discussion.
>
> Well, at least that means I didn't annoy you to death last time ;-)

Certainly not.  The more ideas the better.  I need all the help I can
get.

> > Sorry, I meant that a master that is modifying data is slowed down by
> > other masters to an extent that doesn't happen in other cases (e.g. with
> > slaves).  Is the current "No inter-server locking delay" OK?
>
> Yes, sort of. I know what you meant, but I find it hard to understand.
> And with regard to anything except lazy or eager replication, it does
> not make any sense. Its pretty moot saying anything about "inter-server
> locking delays" for "statement-based replication middleware": you don't
> know if it's lazy or eager. And all other solutions you are mentioning

I think the point is that with middleware each server is as least
working simultaneously while with multi-master they don't, at least in
most current implementations, no?  Now you can end up being as slow as
the slowest server but that seems pretty hard to represent, no?

> are single-master or no replication at all. When there's only one
> master, it's pretty obvious that there can't be no inter-(master)-server
> locking delay. (Well, it's also very obvious that a single master never
> 'conflicts' with itself...)

Totally agree.  What I need is a negative for multi-master so it is
clear why that option isn't used 100% of the time.  The text above
clearly describes the reason, but how to do that in a bullet?

I was thinking I could take "No master server overhead" and somehow make
multi-master double-cost by using two bullets, but because it is a
negative I can't.  :-(  We could just remove "No inter-server locking
delay" and assume the "No master server overhead" represents the locking
overhead but that kind of loses the distinction that the multi-master
has much higher overhead.  If you look at the chart it is kind of like
we have two items "no overhead" and "no significant overhead".  Would
that be better?

> Given you want to cover existing solutions, one could say, that (AFAIK)
> all statement based replication solutions are eager. But in that case,

Agreed.

> the dot would be wrong, because the middleware would need to wait for at
> least an absolute majority to confirm the commit. Which as well leads to
> excessive locking, as you are saying for "synchronous multi-master
> replication". Because it's a property inherent to eager multi-master
> replication, as we correctly explain above the feature matrix.

See my comments above on simultaneous.

> >> multi-master replication" as well as "statement-based replication
> >> middleware" should not have a dot, because those as well slow down other
> >> masters. In the async case at different points in time, yes, but all
> >> master have to write the data, which slows them down.
> >
> > Yea, that is why I have the new text about locking.
>
> To me this makes it sound like "statement-based replication" could be
> faster than "synchronous multi-master replication". That's absolute
> nonsense, since those two don't compare. Or put it another way: most
> "statement-based replication" solutions often are "synchronous
> multi-master replication" as well.

Agreed "statement-based replication" in a way offers multi-master
capabilities, but as outlined above it has limitations as outlined in
the doc details.  What I have done is changed the text to "No waiting
for multiple servers" and removed bullets from the appropriate
solutions. Is this better?

> >>> which is the reason we don't support it yet.
> >> Uhm.. PgCluster *is* a synchronous multi-master replication solution. It
> >> also is a middleware and it does statement based replication. Which dots
> >> of the matrix do you think apply for it?
> >
> > I don't consider PgCluster middleware because the servers have to
> > cooperate with the middleware.
>
> Okay, then take Sequoia: statement-based, middleware, synchronous (thus
> eager) multi-master replication solution.
>
> ( I've never liked the term "middleware" in that chapter. It's solely a
> question of implementation and does not have much to do with other
> concepts of replication. )

I had middleware in there because of the problem middleware has with
sequences and current_timestamp, i.e. you need to adjust the application
to deal with those sometimes.

> > I don't assume the disk failover has mirrored disks.  It can just like a
> > single server can, but it isn't part of the backend process, and I
> > assume a RAID card that has RAM that can cache writes.
>
> In that case, you'd loose the "master failure will never lose data"
> property, no? Or do you trust the writeback cache and the connection to
> the NAS that much as to assume it never fails?

My assumption is that the _shared_ disk is not part of the master
itself.  Of course if the shared disk fails you are out of luck, which
is mentioned above.

> >>> I don't think
> >>> the network is an issue considering many use NAS anyway.
> >> I think you are comparing an enterprise NAS to a low-cost, commodity
> >> hardware clustered filesystem. Take the same amount of money and the
> >> same number of mirrors and you'll get comparable performance.
> >
> > Agreed.  In the one case you are relying on another server, and in the
> > NAS case you are relying on a black box server.  I think the big
> > difference is that the other server is a separate entity, while the NAS
> > is a shared item.
>
> Correct, thus the former is a kind of single-master replication, while
> the later cannot be considered replication (lacking a replica). It's
> rather a variant of how to enhance reliability of your single-master
> database server system.

Right, which is why we call it "high availability" rather than
replication.

> >>> There is no dot there so I am saying "statement based replication
> >>> solution" requires conflict resolution.  Agreed you could do it without
> >>> conflict resolution and it is kind of independent.  How should we deal
> >>> with this?
> >> Maybe a third state: 'n/a'?
> >
> > Good idea, or "~".  How would middleware avoid conflicts, i.e. how would
> > it know that two incoming queries were in conflict?
>
> A majority of servers rejecting or blocking the query? In case of a
> minority, which blocks, the majority would win and apply the
> transaction, while the minority would have to replay the transaction? I
> don't know, probably most solutions do something simpler, like aborting
> a transaction even if only one server fails. Much simpler, and
> sufficient for most cases.

Right, which I think we can call conflict resolution (abort on failure).

> (Why do you ask me, I'm advocating internal, tuple level replication
> with Postgres-R, not statement based one :-) )

Sure.

> > I did move it below and removed it from the chart because as you say how
> > to replicate to the slaves is an independent issue.
>
> Okay, I like that better, thanks.
>
> >> With regard to replication, there's another feature I think would be
> >> worth mentioning: dynamic addition or removal of nodes (masters or
> >> slaves). But that's solely implementation dependent, so it probably
> >> doesn't fit into the matrix.
> >
> > Yea, I had that but found you could add/remove slaves easily in most
> > cases.
>
> Hm.. you're right.
>
> >> Another interesting property I'm missing is the existence of single
> >> points of failures.
> >
> > Ah, yea, but then you get into power and fire issues.
>
> Which high-availability is all about, no?
>
> But well, again, all kinds of replication (which excludes the NAS) can
> theoretically be spread across the continent. So it might be pretty
> useless to add dots for that.

Yea.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://postgres.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: High Availability, Load Balancing, and Replication Feature Matrix

From
Markus Schiltknecht
Date:
Hello Bruce,

Bruce Momjian wrote:
> I think the point is that with middleware each server is as least
> working simultaneously while with multi-master they don't, at least in
> most current implementations, no?

Current implementations include PgCluster, which calls itself a
multi-master replication. I definitely also consider it that.

You are stating that PgCluster is a replication middleware and thus not
a multi-master replication suite. That's very confusing, IMO.

>> are single-master or no replication at all. When there's only one
>> master, it's pretty obvious that there can't be no inter-(master)-server
>> locking delay. (Well, it's also very obvious that a single master never
>> 'conflicts' with itself...)
>
> Totally agree.  What I need is a negative for multi-master so it is
> clear why that option isn't used 100% of the time.  The text above
> clearly describes the reason, but how to do that in a bullet?

Ah, I see where you are coming from. We certainly need a negative for
eager multi-master, even if it's my favorite discipline :-)

I'm fine with the current term ("no waiting for multiple servers"),
because it's a replication delay inherent to eager multi-master
replication - no matter if statement based or tuple based, or if it's
tightly woven into the database system or implemented in a middleware.

> I was thinking I could take "No master server overhead" and somehow make
> multi-master double-cost by using two bullets, but because it is a
> negative I can't.  :-(  We could just remove "No inter-server locking
> delay" and assume the "No master server overhead" represents the locking
> overhead but that kind of loses the distinction that the multi-master
> has much higher overhead.  If you look at the chart it is kind of like
> we have two items "no overhead" and "no significant overhead".  Would
> that be better?

I don't think that would be better, because it's even less clear.

"No master server overhead" and "No waiting for multiple servers" is
good enough, IMO.

> Agreed "statement-based replication" in a way offers multi-master
> capabilities, but as outlined above it has limitations as outlined in
> the doc details.  What I have done is changed the text to "No waiting
> for multiple servers" and removed bullets from the appropriate
> solutions. Is this better?

Yup, that's fine with me.

> I had middleware in there because of the problem middleware has with
> sequences and current_timestamp, i.e. you need to adjust the application
> to deal with those sometimes.

..or let the middleware do some parsing and introducing logic into that.
AFAICT, that's what the Sequoia people are doing.

> My assumption is that the _shared_ disk is not part of the master
> itself.  Of course if the shared disk fails you are out of luck, which
> is mentioned above.

Understood. However, please be aware that you are comparing parts of a
clustered database (in case of the NAS) to the full cluster (all other
cases).

>> A majority of servers rejecting or blocking the query? In case of a
>> minority, which blocks, the majority would win and apply the
>> transaction, while the minority would have to replay the transaction? I
>> don't know, probably most solutions do something simpler, like aborting
>> a transaction even if only one server fails. Much simpler, and
>> sufficient for most cases.
>
> Right, which I think we can call conflict resolution (abort on failure).

Yes.

Regards

Markus

Re: High Availability, Load Balancing, and Replication Feature Matrix

From
Bruce Momjian
Date:
Markus Schiltknecht wrote:
> Hello Bruce,
>
> Bruce Momjian wrote:
> > I think the point is that with middleware each server is as least
> > working simultaneously while with multi-master they don't, at least in
> > most current implementations, no?
>
> Current implementations include PgCluster, which calls itself a
> multi-master replication. I definitely also consider it that.
>
> You are stating that PgCluster is a replication middleware and thus not
> a multi-master replication suite. That's very confusing, IMO.

Uh, I think of PgCluster as multi-master, but in a way it is a hybrid
because there is a central server that gets all the queries.

> >> are single-master or no replication at all. When there's only one
> >> master, it's pretty obvious that there can't be no inter-(master)-server
> >> locking delay. (Well, it's also very obvious that a single master never
> >> 'conflicts' with itself...)
> >
> > Totally agree.  What I need is a negative for multi-master so it is
> > clear why that option isn't used 100% of the time.  The text above
> > clearly describes the reason, but how to do that in a bullet?
>
> Ah, I see where you are coming from. We certainly need a negative for
> eager multi-master, even if it's my favorite discipline :-)
>
> I'm fine with the current term ("no waiting for multiple servers"),
> because it's a replication delay inherent to eager multi-master
> replication - no matter if statement based or tuple based, or if it's
> tightly woven into the database system or implemented in a middleware.

Good.

> > I was thinking I could take "No master server overhead" and somehow make
> > multi-master double-cost by using two bullets, but because it is a
> > negative I can't.  :-(  We could just remove "No inter-server locking
> > delay" and assume the "No master server overhead" represents the locking
> > overhead but that kind of loses the distinction that the multi-master
> > has much higher overhead.  If you look at the chart it is kind of like
> > we have two items "no overhead" and "no significant overhead".  Would
> > that be better?
>
> I don't think that would be better, because it's even less clear.
>
> "No master server overhead" and "No waiting for multiple servers" is
> good enough, IMO.

Good.

> > Agreed "statement-based replication" in a way offers multi-master
> > capabilities, but as outlined above it has limitations as outlined in
> > the doc details.  What I have done is changed the text to "No waiting
> > for multiple servers" and removed bullets from the appropriate
> > solutions. Is this better?
>
> Yup, that's fine with me.
>
> > I had middleware in there because of the problem middleware has with
> > sequences and current_timestamp, i.e. you need to adjust the application
> > to deal with those sometimes.
>
> ..or let the middleware do some parsing and introducing logic into that.
> AFAICT, that's what the Sequoia people are doing.
>
> > My assumption is that the _shared_ disk is not part of the master
> > itself.  Of course if the shared disk fails you are out of luck, which
> > is mentioned above.
>
> Understood. However, please be aware that you are comparing parts of a
> clustered database (in case of the NAS) to the full cluster (all other
> cases).

Yes, and the section above outlines those issues, I think.

> >> A majority of servers rejecting or blocking the query? In case of a
> >> minority, which blocks, the majority would win and apply the
> >> transaction, while the minority would have to replay the transaction? I
> >> don't know, probably most solutions do something simpler, like aborting
> >> a transaction even if only one server fails. Much simpler, and
> >> sufficient for most cases.
> >
> > Right, which I think we can call conflict resolution (abort on failure).
>
> Yes.

Good.  Let me know if you think of other ideas.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://postgres.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: High Availability, Load Balancing, and Replication Feature Matrix

From
Markus Schiltknecht
Date:
Hello Bruce,

Bruce Momjian wrote:
> Uh, I think of PgCluster as multi-master, but in a way it is a hybrid
> because there is a central server that gets all the queries.

Yes, PgCluster as well as Sequoia use statement based replication.
Sequoia is also clearly a middleware (no changes to Postgres needed).

Both suffer from the limitations you describe in "statement based
replication middleware". AFAICT Sequoia does quite well in circumventing
those. (Heck, it even tries to masquerade differences between database
systems, so you can keep a Postgres database in sync with a MySQL one.)

Depending on the RAIDb level you are using, Sequoia can be considered
multi-master (RAIDb-1) or single-master (RAIDb-0). Also note that
sequoia can run multiple controllers, thus it does not rely on one
central server.

So, at least Sequoia is clearly a hybrid, in between your definitions of
"statement based replication middleware" and "synchronous multi-master
replication". Depending on how "middleware" you consider PgCluster, it's
also a hybrid. Certainly it does statement based replication.

Regards

Markus


Re: High Availability, Load Balancing, and Replication Feature Matrix

From
Bruce Momjian
Date:
Markus Schiltknecht wrote:
> Hello Bruce,
>
> Bruce Momjian wrote:
> > Uh, I think of PgCluster as multi-master, but in a way it is a hybrid
> > because there is a central server that gets all the queries.
>
> Yes, PgCluster as well as Sequoia use statement based replication.
> Sequoia is also clearly a middleware (no changes to Postgres needed).
>
> Both suffer from the limitations you describe in "statement based
> replication middleware". AFAICT Sequoia does quite well in circumventing
> those. (Heck, it even tries to masquerade differences between database
> systems, so you can keep a Postgres database in sync with a MySQL one.)
>
> Depending on the RAIDb level you are using, Sequoia can be considered
> multi-master (RAIDb-1) or single-master (RAIDb-0). Also note that
> sequoia can run multiple controllers, thus it does not rely on one
> central server.

But in those cases isn't the multi-master just at the storage level?  I
don't consider that multi-"master".

> So, at least Sequoia is clearly a hybrid, in between your definitions of
> "statement based replication middleware" and "synchronous multi-master
> replication". Depending on how "middleware" you consider PgCluster, it's
> also a hybrid. Certainly it does statement based replication.

I am afraid we are stuck between clarity and understand-ability here. ;-)

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://postgres.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: High Availability, Load Balancing, and Replication Feature Matrix

From
Markus Schiltknecht
Date:
Hello Bruce,

Bruce Momjian wrote:
>> Depending on the RAIDb level you are using, Sequoia can be considered
>> multi-master (RAIDb-1) or single-master (RAIDb-0). Also note that
>> sequoia can run multiple controllers, thus it does not rely on one
>> central server.
>
> But in those cases isn't the multi-master just at the storage level?  I
> don't consider that multi-"master".

Eh.. I think you misunderstood. The Sequoia people use RAIDb to mean
Redundant Array of Inexpensive _Databases_. A possible setup might look
like:

       controller  <-->   controller  <-->  controller
        /   |  \   (GCS)   /  |   \   (GCS)   /  |   \
       /    |   \         /   |    \         /   |    \
      DB    DB   DB      DB   DB    DB      DB   DB    DB
      |      |     .     .   .     .        .    .     .
     local  local
     disk   disk

(controllers may as well run on the same physical node as the DB itself)

Given we are talking about replication (mirroring), every database in
the scenario above hosts a replica of the data and has to apply all
(writing) transactions. That's pretty much what I'd call a master. The
complete system can be considered a (statement based) multi-master
replication solution. No shared storage or clustering file system is
involved.

With RAIDb-0, where they distribute tables across different databases,
we'd have a single-master solution. (Or rather data partitioning, since
  there's no replica).

Then again, with RAIDb-2, which is what they call the mix of the two,
i.e. combining mirroring and partitioning, we are back at a multi-master
configuration, where only parts of all nodes are masters.

See also [1] for better diagrams and explanations of their RAIDb concept.

As a side note: I personally don't like the name RAIDb and even less the
numbering. I prefer talking about replication (or mirroring) and
partitioning, as that's more meaningful than numbers.

> I am afraid we are stuck between clarity and understand-ability here. ;-)

Agreed, but that's where I think the current chapter creates confusion
by trying to separate into "statement based replication middleware" and
"synchronous multi-master replication". Such a separation does not
exist, instead every combination of single vs. multi-master and
statement based vs tuple based is possible. Examples:

single-master, tuple based: Slony-I, Mammoth Replicator
multi-master, tuple based: Postgres-R, Slony-II, maybe Bucardo (?)
single-master, statement based: maybe pgpool or skytools can do that (?)
multi-master, statement based: Sequoia, PgCluster

But IIRC we already had that discussion a year ago.

Regards

Markus

[1] C-JDBC site with samples
http://c-jdbc.objectweb.org/current/doc/userGuide/html/ar01s10.html

Re: High Availability, Load Balancing, and Replication Feature Matrix

From
Bruce Momjian
Date:
Markus Schiltknecht wrote:
> Eh.. I think you misunderstood. The Sequoia people use RAIDb to mean
> Redundant Array of Inexpensive _Databases_. A possible setup might look
> like:

You are right.  I didn't understand that.  Interesting.

> > I am afraid we are stuck between clarity and understand-ability here. ;-)
>
> Agreed, but that's where I think the current chapter creates confusion
> by trying to separate into "statement based replication middleware" and
> "synchronous multi-master replication". Such a separation does not
> exist, instead every combination of single vs. multi-master and
> statement based vs tuple based is possible. Examples:
>
> single-master, tuple based: Slony-I, Mammoth Replicator
> multi-master, tuple based: Postgres-R, Slony-II, maybe Bucardo (?)
> single-master, statement based: maybe pgpool or skytools can do that (?)
> multi-master, statement based: Sequoia, PgCluster

Uh, to me the issue is something like pgpool and Sequoia, where the
_master_/replication is happening _outside_ the server, vs something
like Oracle RAC where it is happening inside the server.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://postgres.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: High Availability, Load Balancing, and Replication Feature Matrix

From
Markus Schiltknecht
Date:
Hello Bruce,

Bruce Momjian wrote:
> Uh, to me the issue is something like pgpool and Sequoia, where the
> _master_/replication is happening _outside_ the server

Well, you are saying that the controllers are the masters and do
replication. I can see the reasoning behind it: they are the only nodes
  which allow write access, seen from the outside.

However, I don't consider these controllers to be masters nor slaves,
because they don't carry a replica of the data. Instead I'm considering
the database nodes which are (synchronously or not) processing the
writing transactions on behalf of the controller to be the masters. They
do all the work and the locking, and they carry the replicated data.

PgCluster (and therefore Cybercluster, too) seem to follow my
definition, as they are advertising themselves as multi-master
replication solutions (even though they only support one single
controller, AFAICT).

I didn't find any self-definition of PgPool's replication feature nor
Sequoias. However, I'd argue that both are generally considered
synchronous multi-master replication solutions as well, even if there's
only one controller.

 > vs something
 > like Oracle RAC where it is happening inside the server.

..or like Postgres-R :-)

Regards

Markus