Thread: PostgreSQL Documentation of High Availability and Load Balancing

PostgreSQL Documentation of High Availability and Load Balancing

From
Markus Schiltknecht
Date:
Hi,

We are about to extend the PostgreSQL documentation for high
availability and load balancing. We've had some discussions so far, but
only few people have spoken up until now. I'd really like to get some
more input from all the different projects providing solutions in this area.

You can find the current draft here:
http://momjian.us/main/writings/pgsql/sgml/high-availability.html

Please have a look and check if your project fits somewhere and if the
description is correct. I'll happily forward your inputs if you are not
subscribed to the (AFAIK closed) PostgreSQL docs mailing list.

Regards

Markus

Re: PostgreSQL Documentation of High Availability and Load Balancing

From
Markus Schiltknecht
Date:
Hello Jose,

José Orlando Pereira wrote:
> Hi,
>
> Thanks for the heads up. I mostly agree with the text (except maybe the
> paragraph on Oracle RAC and 2PC).

Hm, what's wrong with that? Okay, we should better not mention Oracle
RAC there, but it is a product doing 'Multi-Master Replication Using
Clustering', isn't it?

> Regarding GORDA, I would not try to fit it into existing bullets, as our
> current prototype implements several variations of query broadcasting,
> multi-master and master/slave, although none with production quality.

AFAIK, we want to help the users to get a general understanding about
replication, high availability and load balancing. So as to give them a
good starting point on their search for a solution to their problem(s).

Thus it's okay if you say GORDA is prototyping several of the algorithms
mentioned. I just would like to make sure that the descriptions are
general enough to be appropriate for most implementations.

For example, recently Josh Berkus proposed, that 'statement-based
replication' was the more common term for 'query broadcasting'. Do you
agree with that? What did you call it? Does the description fit?

> Instead, I'd point out PostgreSQL's friendliness to research and the resulting
> prototypes that stem from academia. This can help to draw community attention
> to our efforts.

Thank you for your suggestions. And I'm glad you're seeing PostgreSQL
that way. But I think your additions don't quite fit into the
documentation because they are too promotional.

I should probably have given some more details about previous
discussions. One consensus we've reached was, that we don't want to keep
a quickly changing list of open source projects in our documentation. So
probably nether GORDA nor Postgres-R will get mentioned there.

But for sure we'll compile a list of available replication solutions
*somewhere* on the website, where we can change it as often as we want.
I'm sure GORDA will get mentioned there.

@pgsql-docs: BTW, what's the state on that one? Or do I have to ask that
on -www?

Regards

Markus

Re: [Pgcluster-general] PostgreSQL Documentation of High Availability

From
Markus Schiltknecht
Date:
Hi,

a.mitani@sra-europe.com wrote:
> Current generation of PGCluster is a Shared-Nothing type of multi-master
> and syncronous replication system.

Thank you for pointing us to yet another very common distinction in the
clustering world: shared-nothing vs. shared-disk or even
shared-everything. We don't touch that in the current documentation. Do
we want or need to do so?

> I think that the feature of this type of replication system is as the
> 'Multi-Master Replication Using Clustering' chapter of your document.

Most probably, yes. Please note that it's not *my* document :-)  Bruce
Momjian wrote most of it, with only some hints and annoying nit-picking
from my side.

> However, Oracle RAC is a Shared-Everything type of multi-master clustering
> system. If it set up appropriately, most of these limitations would be
> improved.

Shared-Everything, really? I thought they did their own distributed
shared memory or distributed locking stuff, so it would be shared-disk.
And together with their OCFS, they would reach shared-nothing. But I
don't really know.

@pgsql-docs: I'd strongly vote for not mentioning Oracle if we don't
event want to mention proprietary products for PostgreSQL. There are
enough research or ongoing projects (even some ongoing reserch projects
;-)  ) to mention. PgCluster-II, GORDA, Slony-II or Postgres-R come to mind.

> Next generation of PGCluster (I named PGCluster-II) will be a
> Shared-Everything type of multi-master clustering system as demonstrated
> in Toronto.

Yeah, I remember that demonstration. Do you think PGCluster-II fits
what's described under 'Multi-Master Replication Using Clustering'? Do
you think we should explain Shared-Nothing vs. Shared-Disk vs.
Shared-Everything there?

Regards

Markus

Re: [Pgcluster-general] PostgreSQL Documentation of High Availability

From
Markus Schiltknecht
Date:
Hi,

a.mitani@sra-europe.com wrote:
> I believe that shared-everything and shared-disk are cmpletely different
> things.

No. AFAIK, shared everything is basically a marketing term. Some vendors
(especially Oracle) use it to mean 'shared disk', which is really
confusing. But that's probably intentional, as it sounds good to share
everything... much better than sharing nothing.

Others, like GreenPlum use the term to mean 'shared memory' for example
in [1], which is much more appropriate.

Anyway, the term 'Shared Nothing' seems to go back to Stonebraker's
paper 'The Case for Shared Nothing Architecture' [2]. There he defined
these three terms:

shared memory (SM):  multiple processors share a common central memory
shared disk (SD):    multiple processors each with private memory share
                     a common collection of disks
shared nothing (SN): neither memory nor peripheral storage is shared
                     among processors

As it makes no sense to have shared memory but individual disks, no term
for such a thing got defined.

> Oracle has said the RAC as Shared-Everything.
> Probably, the definition of the words would differ from them.

It's hard to find architectural details within their documents, but
AFAICS, they do something like distributed shared memory or distributed
locking, which they call "Cache Fusion", see [3].

In the very same paper, they define what a Cluster is for them: "A
cluster is a group of independent servers that cooperate as a single
system. The primary cluster components are processor nodes, a cluster
interconnect, and a shared disk subsystem. The clusters share disk
access and resources that manage data, but the distinct hardware cluster
nodes do not share memory."

Thus, Oracle RAC seems to be a shared disk solution. Only in conjunction
with their OCFS, you could probably call it a shared-nothing solution,
but it's certainly not a shared-memory thing.

I'd vote for explaining these terms in the PostgreSQL documentation, as
there seems to be a lot of confusion regarding these terms.

Regards

Markus

[1]: GreenPlum about Shared Nothing vs Shared Everything:
http://www.greenplum.com/products/sharedNothing.php

[2]: Michael Stonebraker, The Case for Shared Nothing Architecture:
http://db.cs.berkeley.edu/papers/hpts85-nothing.pdf

[3]: A random Oracle Paper about "Cache Fusion" (tm):
http://www.oracle.com/technology/products/oracle9i/pdf/cache_fusion_rel2.pdf

Re: [Pgcluster-general] PostgreSQL Documentation of High

From
Bruce Momjian
Date:
Markus Schiltknecht wrote:
> Hi,
>
> a.mitani@sra-europe.com wrote:
> > Current generation of PGCluster is a Shared-Nothing type of multi-master
> > and syncronous replication system.
>
> Thank you for pointing us to yet another very common distinction in the
> clustering world: shared-nothing vs. shared-disk or even
> shared-everything. We don't touch that in the current documentation. Do
> we want or need to do so?

I feel the shared-* issue splits us up like master/slave and
multi-master splits up --- it added more confusion than clarity, because
many solutions fell in the middle.

> > I think that the feature of this type of replication system is as the
> > 'Multi-Master Replication Using Clustering' chapter of your document.
>
> Most probably, yes. Please note that it's not *my* document :-)  Bruce
> Momjian wrote most of it, with only some hints and annoying nit-picking
> from my side.
>
> > However, Oracle RAC is a Shared-Everything type of multi-master clustering
> > system. If it set up appropriately, most of these limitations would be
> > improved.
>
> Shared-Everything, really? I thought they did their own distributed
> shared memory or distributed locking stuff, so it would be shared-disk.
> And together with their OCFS, they would reach shared-nothing. But I
> don't really know.

Yea, gets confusing.

> @pgsql-docs: I'd strongly vote for not mentioning Oracle if we don't
> event want to mention proprietary products for PostgreSQL. There are
> enough research or ongoing projects (even some ongoing reserch projects
> ;-)  ) to mention. PgCluster-II, GORDA, Slony-II or Postgres-R come to mind.

Good point. I mentioned Oracle RAC only because it seems to be an
industry standard, so by mentioning it, people know exactly what we are
talking about.  Is there a better way?  And people do ask for Oracle
RAC, so in a way we are telling them we don't have something similar.
As sad as that is, it is true currently.

> > Next generation of PGCluster (I named PGCluster-II) will be a
> > Shared-Everything type of multi-master clustering system as demonstrated
> > in Toronto.
>
> Yeah, I remember that demonstration. Do you think PGCluster-II fits
> what's described under 'Multi-Master Replication Using Clustering'? Do
> you think we should explain Shared-Nothing vs. Shared-Disk vs.
> Shared-Everything there?

pgcluster is must closer to Oracle RAC, but I haven't mentioned it
because I am unsure where it is in terms of usability and stability.
Comments?

--
  Bruce Momjian   bruce@momjian.us
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: PostgreSQL Documentation of High Availability and

From
Bruce Momjian
Date:
Nice idea.

---------------------------------------------------------------------------

Markus Schiltknecht wrote:
> Hi,
>
> We are about to extend the PostgreSQL documentation for high
> availability and load balancing. We've had some discussions so far, but
> only few people have spoken up until now. I'd really like to get some
> more input from all the different projects providing solutions in this area.
>
> You can find the current draft here:
> http://momjian.us/main/writings/pgsql/sgml/high-availability.html
>
> Please have a look and check if your project fits somewhere and if the
> description is correct. I'll happily forward your inputs if you are not
> subscribed to the (AFAIK closed) PostgreSQL docs mailing list.
>
> Regards
>
> Markus
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: You can help support the PostgreSQL project by donating at
>
>                 http://www.postgresql.org/about/donate

--
  Bruce Momjian   bruce@momjian.us
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: [Pgcluster-general] PostgreSQL Documentation of

From
Bruce Momjian
Date:
Markus Schiltknecht wrote:
> In the very same paper, they define what a Cluster is for them: "A
> cluster is a group of independent servers that cooperate as a single
> system. The primary cluster components are processor nodes, a cluster
> interconnect, and a shared disk subsystem. The clusters share disk
> access and resources that manage data, but the distinct hardware cluster
> nodes do not share memory."
>
> Thus, Oracle RAC seems to be a shared disk solution. Only in conjunction
> with their OCFS, you could probably call it a shared-nothing solution,
> but it's certainly not a shared-memory thing.
>
> I'd vote for explaining these terms in the PostgreSQL documentation, as
> there seems to be a lot of confusion regarding these terms.

OK, but how does explaining the terms help our users?

--
  Bruce Momjian   bruce@momjian.us
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: [Sequoia] PostgreSQL Documentation of High Availability and Load

From
Markus Schiltknecht
Date:
Hi,

Emmanuel Cecchet wrote:
> I think that you can still have a matrix with the main
> features(performance, data loss/failover/failback on node failure,
> Disaster recovery, WAN, ...) and how each approach (master/slave, shared
> disk, multi-master, ...) addresses the issue.

Yes, I certainly agree with that.

>> These are good questions to analyze a certain solution. As far as our
>> documentation is concerned, I think giving rough estimates for
>> categories of replication algorithms is sufficient (i.e. stating that
>> Multi Master Replication scales very good for reading transactions,
>> but not very well for writing ones).
> Even here I think that there is a common misconception between
> performance and scalability. Most people think that by having multiple
> nodes their query will run faster which is obviously wrong if your
> original workload does not saturate a single node.

Sure. Do you think that should be made clearer?

> The replication
> mechanisms are even adding overhead (usually perceived as increased
> latency) to the query execution. It is ONLY when the workload increases
> that you can see throughput going up (ideally somewhat close to the
> workload increase) and query latency remaining stable. Unless you really
> have a parallel query execution (that is only efficient for big queries
> anyway), you will never see a performance improvement on a single query
> execution since this is always the same database engine that executes
> the query in the end.

I don't quite agree with that statement, but probably I'm just
misreading it. If you have enough concurrent transactions you can spread
among the nodes, you'll certainly note an improvement. After all, it's a
huge difference, if your single node is processing only ten or hundreds
of concurrent transactions.

Of course, the amount of concurrent transactions limits how far a
replication solution can scale. Having more nodes than concurrent
transactions does not make sense. (Of course with the exception of
parallel query execution.)

> But you are right that full replication (in shared nothing environments)
> does not perform with write heavy workload. At best it will go to the
> speed of the fastest node in the cluster, but it will usually degrade
> quickly. A good replication implementation will have a constant overhead
> on query execution time (let's say few millliseconds). Therefore the
> impact will be quite different if this is a small query or a
> long-running query. Adding few milliseconds to a query that takes
> seconds to execute is negligible but adding the same time to a
> sub-millisecond query will be a tremendous slowdown (in term of latency).
> To summarize, clustering solutions provide performance scalability
> (stable latency, throughput increasing almost linearly with load) but
> not performance improvement on individual query execution time.

Yes, for writing transactions, no for read-only ones (queries?). Or why
do you have to add overhead to read-only queries?

> If the
> client application is not multithreaded it is very unlikely that any
> solution will improve the application performance.

Ehm.. I wouldn't refer to threading here. You can very well have
multiple single-process programs running on different nodes...

I'd keep referring to concurrency of transactions.

> As an additional point, transactions including calls such as 'select
> nextval' should be considered as write transactions with PostgreSQL.

Sure.

> That might not be obvious for most users.

Agreed.

>>> When configured with RAIDb-1,
>> I know RAID-1, but what's a RAIBd-1?
> RAIDb is an acronym for Redundant Array of Inexpensive Databases.
> You can find an article on this at
> http://c-jdbc.objectweb.org/current/doc/RR-C-JDBC.pdf

Aha, thank you.

> That's great that the work was revived in 8.2. Yes, Postgres-R is much
> more embedded in Postgres but I was confused with Middle-R that was done
> later on with Bettina and Ricardo using a similar technique at the
> middlware level.

Yeah, I thought you meant that one. I don't know Middle-R at all, sorry.
Seems similar to sequoia. Did you base your work on Middle-R?

> What are your development plans for Postgres-R?

To make it work and production ready as soon as possible. ;-)  I'm
currently working on initialization and recovery.

Regards

Markus


Re: [Sequoia] PostgreSQL Documentation of High Availability and Load

From
Markus Schiltknecht
Date:
Hello Emmanuel,

Emmanuel Cecchet wrote:
> I just quickly went through the classification.
> I don't think that your description of 'multi-master replication using
> clustering' is correct. Oracle RAC is a shared disk approach and just
> send cache invalidations to other nodes but not actual data. As the disk
> is shared, data is only commited once to disk and there is a distributed
> locking protocol to make nodes agree on a serializable transactional order.

I agree with that. AFAI understand, we are not entirely sure how much of
that we want to cover in our documentation.

> You can have a look at the last ApacheCon presentation I gave
> (http://www.continuent.org/uploads/sequoia/Resources/2006-08-15Cecchet_ApacheConAsia2006.pdf),

thank you for that link, I'll read through it ASAP.

> there is a comparison of most solutions. I think that it would be cool
> to have a comparison matrix with the different features that you may
> need in an HA solution and see how the different solutions can provide
> an answer to these problems.

We've already come to the consensus, that we don't want to list all
products in our documentation, but better do that on the website, where
it can easily be changed at any time. The documentation should explain
fundamentals.

> What you describe as 'Statement-Based Replication Middleware' which is
> in fact multi-master replication at the middleware level, is usually
> more than just a simple proxy that broadcasts all queries. It must
> provide 1-copy-serializability if you want the cluster to be consistent.

Yes, the current description is very much targeted at pgpool. Thank you
very much for your input, that helps to clarify things.

> This means that write queries must be sent in the same serializable
> order to every node. As SQL is interpreted, macros can be replaced on
> the fly by the middleware with cluster-wide values. About sequences,
> calls to sequences must be broadcast as if it were writes. This is
> needed to update sequences the same way on all nodes. Note that
> sequences are parts of unrollbackable changes of the database. This
> means that even if a transaction rollbacks at one node, its sequence
> will not be rollbacked meaning that all other nodes must also play
> rollbacked transactions (at runtime or recovery time) if they were
> accessing sequences or involving other operations that the database
> cannot rollback.
>  From a performance standpoint, the middleware approach can balance
> queries that come from the same client connection to different nodes
> which is not the case when you are directly connected to a database
> instance. Also failover can be made fully transparent at the middleware
> level (any node failure can be completely hidden to the client). With
> Sequoia, we even hide middleware failures by providing transparent
> failover code in our driver.
> The discussion should add a section describing real HA features (not
> just load balancing):
> - Do you lose data on failure?
> - Will my transactions fail on a node failure?
> - Is failover transparent? Does it need special support in the client
> application?
> - Is failback a manual process?
> - How long does it take to restart/resynchronize a node?
> - Can I add nodes on the fly to the cluster?
> - Can I do maintenance operations without stopping the cluster?
> - Can I upgrade the cluster without interruptions (e.g. migrate from
> PostgreSQL 7.4 to 8.0 without interruption) ?
> - WAN support? How network partitions are handled?

These are good questions to analyze a certain solution. As far as our
documentation is concerned, I think giving rough estimates for
categories of replication algorithms is sufficient (i.e. stating that
Multi Master Replication scales very good for reading transactions, but
not very well for writing ones).

> When configured with RAIDb-1,

I know RAID-1, but what's a RAIBd-1?

> Sequoia provide a multi-master (full)
> replication at the middleware level. Note that Sequoia also allows for
> partial replication or data partitioning (but the granularity is the
> table meaning that you can only distribute tables on different nodes but
> not slice a table and split it on different nodes as you describe in
> Data Partitioning).
>
> Thanks again for this great work and I hope this will help improve the
> documentation.

Sure, thank you very much for your input. I'm glad we have the
possibility to cover sequoia, too.

> Note that there is also Postgres-R that is another approach to
> middleware-based replication.

Middleware? Mrs. Kemme also did some research with middleware
replication, but I wouldn't exactly call Postgres-R a middleware-based
replication solution. I've ported Postgres-R to PostgreSQL 8.2 and can
assure you that it's very well embedded into the backend of PostgreSQL.
Probably too well for some ;-)  You might want to check out my (slightly
outdated) website www.postgres-r.org.

Regards

Markus


Re: [Pgcluster-general] PostgreSQL Documentation of High Availability

From
Markus Schiltknecht
Date:
Hi,

Bruce Momjian wrote:
> I feel the shared-* issue splits us up like master/slave and
> multi-master splits up

No, not quite. To sum up, I'd say the following combinations make sense:

sync, multi-master replication on shared-memory cluster (which is much
like a super-computer. With shared memory distributing locks does not
cost much - beside marketing, there is probably not much sense in
calling this a cluster at all).

sync, multi-master replication on shared-disk cluster (where locks and
memory-caches have to be synchronized. OracleRAC and PgCluster-II fit in
here.)

(Probably running an async replication on a shared-disk cluster would
make sense with MVCC and in some corner cases, but I don't see much
benefits in that.)

sync, multi-master replication on shared-nothing cluster (where locks,
caches and data needs to be synchronized over an interconnect.
Postgres-R, PgCluster, PgPool)

(sync, single-master replication does not make much sense, because if
you go sync at all, you could as well use the nodes which run in sync).

async, multi-master replication on shared-nothing cluster (i.e. Slony-I)

async, single-master replication on shared-nothing cluster (mainly for
failover purpose, you mention solutions for that)


For me these categorizations are important and help a good deal to
ensure what I'm talking about with somebody. The documentation is much
more focused on individual solutions, sometimes avoiding to categorize
them. I would love to get others opinions, but as not many others speak
up, I just accept it that way.

> Yea, gets confusing.

Well, Oracle also does a good deal in making it confusing, IMO.

> Good point. I mentioned Oracle RAC only because it seems to be an
> industry standard, so by mentioning it, people know exactly what we are
> talking about.

That's a point, even if I don't really know how much of an industry
standard it is. But given how badly Oracle does in explaining basics of
replication and clustering, I think it's not very beneficial.

> Is there a better way?  And people do ask for Oracle
> RAC, so in a way we are telling them we don't have something similar.
> As sad as that is, it is true currently.

How far is PGCluster-II? Does it make sense to mention it? Can
PGCluster-II be used with network filesystems like NFS, OCFS2 or the like?

> pgcluster is must closer to Oracle RAC,

Why do you think so? Oracle RAC is mainly based on a shared disk
cluster, where PGCluster bases on a shared nothing architecture.
PGCluster-II seems closer to Oracle RAC, for me.

> but I haven't mentioned it
> because I am unsure where it is in terms of usability and stability.
> Comments?

Did you work on it since Toronto, Mitani-San?

Regards

Markus


Re: [Pgcluster-general] PostgreSQL Documentation of High Availability

From
Markus Schiltknecht
Date:
Hello Bruce,

Bruce Momjian wrote:
> OK, but how does explaining the terms help our users?

As we even have on sort-of-a solution for shared disk clusters (the
Shared Disk Failover part), we should explain this term (as you already
do there).

Clarifying that all other solutions are for shared nothing clusters
makes sense, IMO. We don't necessarily need to go into shared memory and
the confusion which shared everything introduced. OTOH, where else to
enlighten people about that if not in such a documentation?

To answer your question: by explaining these terms, they are
demystified. The users will understand the experts better and have some
fundamental terms which they can base their discussion on. Of course
it's questionable how far to go, and we are debating just that now, I think.

But I have no doubt in the OSS tradition of good documentation. Long
live the saying 'RTFM'! :-)

Regards

Markus


Re: PostgreSQL Documentation of High Availability and Load Balancing

From
Markus Schiltknecht
Date:
Hi,

José Orlando Pereira wrote:
>> Hm, what's wrong with that? Okay, we should better not mention Oracle
>> RAC there, but it is a product doing 'Multi-Master Replication Using
>> Clustering', isn't it?
>
> AFAIK, RAC uses a shared disk, thus it does not provide replication.

Oh, that's right. Hm... thus there are no such things as Multi-Master
Replication for shared-disk or shared-memory machines, because that's
not replication. My fault, sorry.

> And I
> don't think RAC can be emulated at all at the application level with 2PC.

No, that would not make sense. The paragraph is about Multi Master
Replication, which I thought Oracle RAC would be in. But I agree that
Oracle RAC should not be considered replication at all.

What do you think about sharing disks by the means of network file
systems, like OCFS2? I was under the impression that Oracle built that
one to run RAC on top of it. That combination would run on a shared
nothing cluster, but does that make it replication?

According to you, what category does Oracle RAC (and PGCluster-II)
belong to? Shared Disk Clusters?

> Classifying replication protocols is indeed a hard problem. Besides my issues
> with the multi-master replication using clustering category, I miss a
> reference to multi-master asynchronous replication (and thus, to
> reconciliation), which is a big issue in Oracle, MS SQL, etc literature.

Yeah, I'm missing that, too.

Well, the docu talks about async and sync, but IMO, it's somewhat sloppy
in that it only covers one aspect of synchronous replication (namely
that a failover will not loose data).

The other statement, that 'servers will return consistent results with
no propagation delay' is somewhat uncorrect, as there certainly is a
delay of propagation before the commit. And in that the individual
databases are very well consistent, just not synchronous.

Emmanuel Cecchet listed some questions one might use to categorize or
further specify aspects of synchronous replication in [1].

The current paragraph doesn't even clearly state that it's talking about
synchronous replication. Maybe we want to have only one paragraph for
Multi-Master replication and cover sync as well as async there?

> Coming from a fault-tolerant distributed systems background, we'd call
> that "replicated state machine" or "active replication". I don't think
> however that using those names in this context would be helpful.

Wikipedia has a definition of replicated state machine in [2]. I'm not
keen to use that term.

>> Thank you for your suggestions. And I'm glad you're seeing PostgreSQL
>> that way. But I think your additions don't quite fit into the
>> documentation because they are too promotional.
>
> Hey, you can't blame me for trying... ;)

No, it's more that I'm sorry for not having explained better what we need.

> Ok, I understand your motivations. I agree with the listing replication
> solutions somewhere on the website. I'd still add the research and innovation
> bullet,

Yes, pointing to that surely won't hurt.

> instead of trying to squeeze group-communication based stuff in
> existing bullets.

I see 2PC, shared memory and locking and using a GCS as implementation
details of sync, multi-master replication. I'd even put statement-based
replication in there, but one can reasonably argue about that. Anyway,
if at all, those should only be quickly mentioned as possible
implementations. But I don't think it helps to go that far. Having a
good description of sync MM and async MM replication is certainly
sufficient there.

Again, thank you very much for your inputs.

Regards

Markus


[1]: Emmanuel Cecchet:
https://forge.continuent.org/pipermail/sequoia/2006-November/004070.html

[2]: Wikipedia definition of replicated state machine:
http://en.wikipedia.org/wiki/State_machine_replication

Re: PostgreSQL Documentation of High Availability and

From
Bruce Momjian
Date:
Markus Schiltknecht wrote:
> Hi,
>
> Jos? Orlando Pereira wrote:
> >> Hm, what's wrong with that? Okay, we should better not mention Oracle
> >> RAC there, but it is a product doing 'Multi-Master Replication Using
> >> Clustering', isn't it?
> >
> > AFAIK, RAC uses a shared disk, thus it does not provide replication.
>
> Oh, that's right. Hm... thus there are no such things as Multi-Master
> Replication for shared-disk or shared-memory machines, because that's
> not replication. My fault, sorry.

OK, title now is "Multi-Master Clustering".

> > And I
> > don't think RAC can be emulated at all at the application level with 2PC.
>
> No, that would not make sense. The paragraph is about Multi Master
> Replication, which I thought Oracle RAC would be in. But I agree that
> Oracle RAC should not be considered replication at all.
>
> What do you think about sharing disks by the means of network file
> systems, like OCFS2? I was under the impression that Oracle built that
> one to run RAC on top of it. That combination would run on a shared
> nothing cluster, but does that make it replication?
>
> According to you, what category does Oracle RAC (and PGCluster-II)
> belong to? Shared Disk Clusters?
>
> > Classifying replication protocols is indeed a hard problem. Besides my issues
> > with the multi-master replication using clustering category, I miss a
> > reference to multi-master asynchronous replication (and thus, to
> > reconciliation), which is a big issue in Oracle, MS SQL, etc literature.
>
> Yeah, I'm missing that, too.

I added async multi-master:

     <varlistentry>
      <term>Multi-Master With Conflict Resolution</term>
      <listitem>

       <para>
        For servers that are not regularly connected, like laptops or
        remote servers, keeping data consistent among servers is a
        challenge.  One simple solution is to allow each server to
        modify the data, and have periodic communication compare
        databases and ask users to resolve any conflicts.
       </para>
      </listitem>
     </varlistentry>

> Well, the docu talks about async and sync, but IMO, it's somewhat sloppy
> in that it only covers one aspect of synchronous replication (namely
> that a failover will not loose data).
>
> The other statement, that 'servers will return consistent results with
> no propagation delay' is somewhat uncorrect, as there certainly is a
> delay of propagation before the commit. And in that the individual
> databases are very well consistent, just not synchronous.

OK, updated to add "little" delay, and removed "small" from async case:

  load-balanced servers will return consistent results with little
  propagation delay. Asynchronous updating has a delay between the

>
> Emmanuel Cecchet listed some questions one might use to categorize or
> further specify aspects of synchronous replication in [1].
>
> The current paragraph doesn't even clearly state that it's talking about
> synchronous replication. Maybe we want to have only one paragraph for
> Multi-Master replication and cover sync as well as async there?

Does the new conflict resolution section help that?

> > Coming from a fault-tolerant distributed systems background, we'd call
> > that "replicated state machine" or "active replication". I don't think
> > however that using those names in this context would be helpful.
>
> Wikipedia has a definition of replicated state machine in [2]. I'm not
> keen to use that term.
>
> >> Thank you for your suggestions. And I'm glad you're seeing PostgreSQL
> >> that way. But I think your additions don't quite fit into the
> >> documentation because they are too promotional.
> >
> > Hey, you can't blame me for trying... ;)
>
> No, it's more that I'm sorry for not having explained better what we need.

I was originally worried no one commented on my initial version of this
chapter.  I am not worried any more.  ;-)  Actually, I think we all
understand 60% of this topic, but a different 60%, so when we are done,
it will cover 100%.

> > Ok, I understand your motivations. I agree with the listing replication
> > solutions somewhere on the website. I'd still add the research and innovation
> > bullet,
>
> Yes, pointing to that surely won't hurt.
>
> > instead of trying to squeeze group-communication based stuff in
> > existing bullets.
>
> I see 2PC, shared memory and locking and using a GCS as implementation
> details of sync, multi-master replication. I'd even put statement-based
> replication in there, but one can reasonably argue about that. Anyway,
> if at all, those should only be quickly mentioned as possible
> implementations. But I don't think it helps to go that far. Having a
> good description of sync MM and async MM replication is certainly
> sufficient there.
>
> [1]: Emmanuel Cecchet:
> https://forge.continuent.org/pipermail/sequoia/2006-November/004070.html

Ah, good read.  I didn't realize the shared disk aspect of Oracle RAC,
and have removed mention of RAC from our documentation.  Oracle RAC
seems like an interesting hybrid solution.  They use shared disk so they
don't have to send the data to all the nodes, but send cache
invalidation information to all nodes so they know when something has
changed.  I have added the Oracle RAC details as an SGML comment in case
we ever need to mention it.

As far as going into the other details of what features each replication
solution has, e.g. adding nodes, etc, it is beyond the scope of this
chapter, though perhaps some of the items are appropriate.

--
  Bruce Momjian   bruce@momjian.us
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: [Pgcluster-general] PostgreSQL Documentation of

From
Bruce Momjian
Date:
Markus Schiltknecht wrote:
> Hi,
>
> Bruce Momjian wrote:
> > I feel the shared-* issue splits us up like master/slave and
> > multi-master splits up
>
> No, not quite. To sum up, I'd say the following combinations make sense:
>
> sync, multi-master replication on shared-memory cluster (which is much
> like a super-computer. With shared memory distributing locks does not
> cost much - beside marketing, there is probably not much sense in
> calling this a cluster at all).

Wow, how is that different than an multi-CPU server?  I guess I don't
see the point to it.  The only value I see to it would be failover if
one of the servers fails, but it seems the failed server would be
holding locks that would make failover difficult to do without
restarting all the servers.

> sync, multi-master replication on shared-disk cluster (where locks and
> memory-caches have to be synchronized. OracleRAC and PgCluster-II fit in
> here.)

OK.  I didn't think pgcluster was shared disk.  I thought all the
synchronization was via the network.

> (Probably running an async replication on a shared-disk cluster would
> make sense with MVCC and in some corner cases, but I don't see much
> benefits in that.)
>
> sync, multi-master replication on shared-nothing cluster (where locks,
> caches and data needs to be synchronized over an interconnect.
> Postgres-R, PgCluster, PgPool)

Yes, I think we have that one covered.

> (sync, single-master replication does not make much sense, because if
> you go sync at all, you could as well use the nodes which run in sync).
>
> async, multi-master replication on shared-nothing cluster (i.e. Slony-I)

Covered.

> async, single-master replication on shared-nothing cluster (mainly for
> failover purpose, you mention solutions for that)

Added as a new entry calld Conflict Resolution.

> For me these categorizations are important and help a good deal to
> ensure what I'm talking about with somebody. The documentation is much
> more focused on individual solutions, sometimes avoiding to categorize
> them. I would love to get others opinions, but as not many others speak
> up, I just accept it that way.

One problem I have is that we we have shared disk failover, but no other
shared case with a PostgreSQL implementation, and people don't want to
mention Oracle RAC, so why do we mention it if we have no
implementations even in the works.

> > Yea, gets confusing.
>
> Well, Oracle also does a good deal in making it confusing, IMO.
>
> > Good point. I mentioned Oracle RAC only because it seems to be an
> > industry standard, so by mentioning it, people know exactly what we are
> > talking about.
>
> That's a point, even if I don't really know how much of an industry
> standard it is. But given how badly Oracle does in explaining basics of
> replication and clustering, I think it's not very beneficial.

OK, agreed, removed.

> > Is there a better way?  And people do ask for Oracle
> > RAC, so in a way we are telling them we don't have something similar.
> > As sad as that is, it is true currently.
>
> How far is PGCluster-II? Does it make sense to mention it? Can
> PGCluster-II be used with network filesystems like NFS, OCFS2 or the like?

I am waiting for email from Mitani-san, the pgcluster author.

>
> > pgcluster is must closer to Oracle RAC,
>
> Why do you think so? Oracle RAC is mainly based on a shared disk
> cluster, where PGCluster bases on a shared nothing architecture.
> PGCluster-II seems closer to Oracle RAC, for me.

Oh, I am not aware of pgcluster-II.  Did you mean pgpool-II?  I think
so.  I have mentioned pgpool-II now as part of Clustering For Parallel
Query Execution.  Is that OK?

--
  Bruce Momjian   bruce@momjian.us
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: [Pgcluster-general] PostgreSQL Documentation of

From
"Joshua D. Drake"
Date:
On Tue, 2006-11-21 at 16:51 -0500, Bruce Momjian wrote:
> Markus Schiltknecht wrote:
> > Hi,
> >
> > Bruce Momjian wrote:
> > > I feel the shared-* issue splits us up like master/slave and
> > > multi-master splits up
> >
> > No, not quite. To sum up, I'd say the following combinations make sense:
> >
> > sync, multi-master replication on shared-memory cluster (which is much
> > like a super-computer. With shared memory distributing locks does not
> > cost much - beside marketing, there is probably not much sense in
> > calling this a cluster at all).
>
> Wow, how is that different than an multi-CPU server?

You can't have 1000 cpus :).. You can have 1000 dual core servers.

Joshua D. Drake

--

      === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
             http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate




Re: [Pgcluster-general] PostgreSQL Documentation of

From
Markus Schiltknecht
Date:
Joshua D. Drake wrote:
> On Tue, 2006-11-21 at 16:51 -0500, Bruce Momjian wrote:
>> Markus Schiltknecht wrote:
>>> Hi,
>>>
>>> Bruce Momjian wrote:
>>>> I feel the shared-* issue splits us up like master/slave and
>>>> multi-master splits up
>>> No, not quite. To sum up, I'd say the following combinations make sense:
>>>
>>> sync, multi-master replication on shared-memory cluster (which is much
>>> like a super-computer. With shared memory distributing locks does not
>>> cost much - beside marketing, there is probably not much sense in
>>> calling this a cluster at all).
>> Wow, how is that different than an multi-CPU server?
>
> You can't have 1000 cpus :).. You can have 1000 dual core servers.

Have them share all their memory is the challenge, though. Ask IBM, they
certainly do big-irons like that.

As I said, it's like a super-computer, but you can also see it as a
cluster of CPUs with shared memory and shared disks. It's all just a
matter of your point of view.

Regards

Markus

Re: [Pgcluster-general] PostgreSQL Documentation of

From
Markus Schiltknecht
Date:
Hi,

Bruce Momjian wrote:
> Oh, I am not aware of pgcluster-II.  Did you mean pgpool-II?  I think
> so.

No, I really mean PGCluster-II. Didn't you attend A. Mitani's speech
about PGCluster? Check his slides:

http://conference.postgresql.org/download/TFCKUpload/62.pdf

Regards

Markus

Re: [Pgcluster-general] PostgreSQL Documentation of

From
Bruce Momjian
Date:
Joshua D. Drake wrote:
> On Tue, 2006-11-21 at 16:51 -0500, Bruce Momjian wrote:
> > Markus Schiltknecht wrote:
> > > Hi,
> > >
> > > Bruce Momjian wrote:
> > > > I feel the shared-* issue splits us up like master/slave and
> > > > multi-master splits up
> > >
> > > No, not quite. To sum up, I'd say the following combinations make sense:
> > >
> > > sync, multi-master replication on shared-memory cluster (which is much
> > > like a super-computer. With shared memory distributing locks does not
> > > cost much - beside marketing, there is probably not much sense in
> > > calling this a cluster at all).
> >
> > Wow, how is that different than an multi-CPU server?
>
> You can't have 1000 cpus :).. You can have 1000 dual core servers.

But does anyone make a shared-memory cluster that can do 1000 cpu's?
Sounds like Sequent, but I didn't think anyone was doing this anymore.

--
  Bruce Momjian   bruce@momjian.us
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: [Pgcluster-general] PostgreSQL Documentation of

From
Bruce Momjian
Date:
Markus Schiltknecht wrote:
> Hello Bruce,
>
> Bruce Momjian wrote:
> > OK, but how does explaining the terms help our users?
>
> As we even have on sort-of-a solution for shared disk clusters (the
> Shared Disk Failover part), we should explain this term (as you already
> do there).
>
> Clarifying that all other solutions are for shared nothing clusters
> makes sense, IMO. We don't necessarily need to go into shared memory and
> the confusion which shared everything introduced. OTOH, where else to
> enlighten people about that if not in such a documentation?
>
> To answer your question: by explaining these terms, they are
> demystified. The users will understand the experts better and have some
> fundamental terms which they can base their discussion on. Of course
> it's questionable how far to go, and we are debating just that now, I think.
>
> But I have no doubt in the OSS tradition of good documentation. Long
> live the saying 'RTFM'! :-)

I figured that shared-disk/memory only really makes sense for
multi-master clustering, so I mentioned it in that paragraph:

  <term>Multi-Master Clustering</term>
  <listitem>

   <para>
    In clustering, each server can accept write requests, and
    modified data is transmitted from the original server to every
    other server before each transaction commits.  Heavy write
    activity can cause excessive locking, leading to poor performance.
    In fact, write performance is often worse than that of a single
->    server.  Read requests can be sent to any server.  Some
->    implementations use cluster-wide shared memory or shared disk
->    to reduce the communication overhead.  Clustering is best for
    mostly read workloads, though its big advantage is that any
    server can accept write requests — there is no need to
    partition workloads between master and slave servers, and
    because the data changes are sent from one server to another,
    there is no problem with non-deterministic functions like
    <function>random()</>.

Is that enought?

--
  Bruce Momjian   bruce@momjian.us
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: [Pgcluster-general] PostgreSQL Documentation of

From
Bruce Momjian
Date:
Markus Schiltknecht wrote:
> Hi,
>
> Bruce Momjian wrote:
> > Oh, I am not aware of pgcluster-II.  Did you mean pgpool-II?  I think
> > so.
>
> No, I really mean PGCluster-II. Didn't you attend A. Mitani's speech
> about PGCluster? Check his slides:
>
> http://conference.postgresql.org/download/TFCKUpload/62.pdf

I just saw it.  It does seem more like Oracle RAC than any other method.

--
  Bruce Momjian   bruce@momjian.us
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: [Sequoia] PostgreSQL Documentation of High Availability and Load

From
Markus Schiltknecht
Date:
Hello Emmanuel,

Emmanuel Cecchet wrote:
>>> Even here I think that there is a common misconception between
>>> performance and scalability. Most people think that by having
>>> multiple nodes their query will run faster which is obviously wrong
>>> if your original workload does not saturate a single node.
>>
>> Sure. Do you think that should be made clearer?
> Yes, I think so because this is a very common belief that we experience
> with new users.

Okay, I have forwarded that to Bruce, who's editing the documentation
(and is a native English speaker). I'm not sure how we can cover this,
as we are very general in our description.

You might want to recheck the paragraph, which is now called
"Synchronous Multi-Master Replication":

http://momjian.us/main/writings/pgsql/sgml/high-availability.html

I'm particularly unsure, where Sequoia would fit in. There is still the
split between "Statement-Based Replication Middleware" and "Synchonous
Multi-Master Replication".

Does Sequoia offer any form of async replication?

>>> The replication mechanisms are even adding overhead (usually
>>> perceived as increased latency) to the query execution. It is ONLY
>>> when the workload increases that you can see throughput going up
>>> (ideally somewhat close to the workload increase) and query latency
>>> remaining stable. Unless you really have a parallel query execution
>>> (that is only efficient for big queries anyway), you will never see a
>>> performance improvement on a single query execution since this is
>>> always the same database engine that executes the query in the end.
>>
>> I don't quite agree with that statement, but probably I'm just
>> misreading it. If you have enough concurrent transactions you can
>> spread among the nodes, you'll certainly note an improvement. After
>> all, it's a huge difference, if your single node is processing only
>> ten or hundreds of concurrent transactions.
> Yes, but that already means that your single node was somewhat already a
> bottleneck. My point was that for low workloads (note that low is
> relative here since many users have dual-cpu machines with decent RAM
> and disks, and it takes quite a number of concurrent transactions to get
> to the peak point), you will not see any improvement and even you'll see
> a slight degradation especially from a latency perspective. Below the
> peak point of a single machine, you will get the same performance (from
> a client point of view) but the load on the various machine resources
> will decreased by the number of machines in the cluster (at best). For
> example, if I have a workload of 50 requests/second that generates 50%
> cpu load on 1 node, I will still get my 50 req/s with 2 machines but the
> cpu load will only be 25% on each node.
> Now the contention can be elsewhere (disk, locks, ...) and exhibit other
> scalability characteristics but it usually conforms to the model I
> described.

Agreed.

>> Of course, the amount of concurrent transactions limits how far a
>> replication solution can scale. Having more nodes than concurrent
>> transactions does not make sense. (Of course with the exception of
>> parallel query execution.)
> Yes but don't underestimate the capability of a single node to execute
> transactions in parallel as well. Oftentimes sending 2 concurrent
> transactions to a single node or to 2 different nodes does not make any
> difference (obviously it depends on the nature of the transaction).

Okay, I just have to believe that. Up until now I'm mostly basing on
theoretical estimates, rather than hard facts. :-)  You seem to have
made some real benchmarks. Did you publish them?

>>> To summarize, clustering solutions provide performance scalability
>>> (stable latency, throughput increasing almost linearly with load) but
>>> not performance improvement on individual query execution time.
>>
>> Yes, for writing transactions, no for read-only ones (queries?). Or
>> why do you have to add overhead to read-only queries?
> In a middleware approach you have to proxy the read results as well so
> you will add some latency there. When replication is integrated in the
> database you can prevent this extra hop but still the replication logic
> adds some overhead to any query (that seems inevitable if you want to
> ensure consistency).

Ah, okay, yes, the extra hop through the proxy has to be added even to
reading queries.

>> To make it work and production ready as soon as possible. ;-)  I'm
>> currently working on initialization and recovery.
> Good luck, this is the hardest part ! You'll soon figure out that
> replication was really the easy part !

Thanks.

May I ask what you use for automatic testing and benchmarking? I'm
currently stuck testing 90% of the time. Starting up the GCS, two
Databases and attaching the debugger to every process which could
possibly go havoc really takes a F***ING lot of time!

Regards

Markus



Re: [Sequoia] PostgreSQL Documentation of High Availability

From
Bruce Momjian
Date:
Markus Schiltknecht wrote:
> Hello Emmanuel,
>
> Emmanuel Cecchet wrote:
> >>> Even here I think that there is a common misconception between
> >>> performance and scalability. Most people think that by having
> >>> multiple nodes their query will run faster which is obviously wrong
> >>> if your original workload does not saturate a single node.
> >>
> >> Sure. Do you think that should be made clearer?
> > Yes, I think so because this is a very common belief that we experience
> > with new users.
>
> Okay, I have forwarded that to Bruce, who's editing the documentation
> (and is a native English speaker). I'm not sure how we can cover this,
> as we are very general in our description.

OK, updated:

  <term>Multi-Server Parallel Query Execution</term>
  <listitem>

   <para>
    Many of the above solutions allow multiple servers to handle
    multiple queries, but none allow a single query to use multiple
    servers to complete faster.  This solution allows multiple
    servers to work concurrently on a single query.  This is usually
    accomplished by splitting the data among servers and having
    each server execute its part of the query and return results
    to a central server where they are combined and returned to
    the user.  Pgpool-II has this capability.

--
  Bruce Momjian   bruce@momjian.us
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: PostgreSQL Documentation of High Availability and Load Balancing

From
Robert Treat
Date:
On Monday 20 November 2006 09:52, Markus Schiltknecht wrote:
> But for sure we'll compile a list of available replication solutions
> *somewhere* on the website, where we can change it as often as we want.
> I'm sure GORDA will get mentioned there.
>
> @pgsql-docs: BTW, what's the state on that one? Or do I have to ask that
> on -www?
>

Normally the www list is the best place to ask www related questions, but the
answer to this is to take a look at the techdocs section of the website
(http://www.postgresql.org/docs/techdocs/), it is a wiki like system that
allows any community member to put online guides and documentation into
place, made specifically for the purpose of things like what your looking
for.

--
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL

Re: PostgreSQL Documentation of High Availability and Load

From
Markus Schiltknecht
Date:
Hi,

Robert Treat wrote:
> Normally the www list is the best place to ask www related questions, but the
> answer to this is to take a look at the techdocs section of the website
> (http://www.postgresql.org/docs/techdocs/)

That URL doesn't quite work because of the trailing slash. Removing it
and we're fine. Anyway, thanks for pointing me to the correct place.

> it is a wiki like system that
> allows any community member to put online guides and documentation into
> place, made specifically for the purpose of things like what your looking
> for.

Bruce, do we start a techdocs for listing the actual replication solutions?

Regards

Markus



Re: PostgreSQL Documentation of High Availability and

From
Bruce Momjian
Date:
Markus Schiltknecht wrote:
> Hi,
>
> Robert Treat wrote:
> > Normally the www list is the best place to ask www related questions, but the
> > answer to this is to take a look at the techdocs section of the website
> > (http://www.postgresql.org/docs/techdocs/)
>
> That URL doesn't quite work because of the trailing slash. Removing it
> and we're fine. Anyway, thanks for pointing me to the correct place.
>
> > it is a wiki like system that
> > allows any community member to put online guides and documentation into
> > place, made specifically for the purpose of things like what your looking
> > for.
>
> Bruce, do we start a techdocs for listing the actual replication solutions?

Sure.  Once it is done, I can add a link from our documentation.

--
  Bruce Momjian   bruce@momjian.us
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +