Thread: PostgreSQL Documentation of High Availability and Load Balancing
Hi, We are about to extend the PostgreSQL documentation for high availability and load balancing. We've had some discussions so far, but only few people have spoken up until now. I'd really like to get some more input from all the different projects providing solutions in this area. You can find the current draft here: http://momjian.us/main/writings/pgsql/sgml/high-availability.html Please have a look and check if your project fits somewhere and if the description is correct. I'll happily forward your inputs if you are not subscribed to the (AFAIK closed) PostgreSQL docs mailing list. Regards Markus
Hello Jose, José Orlando Pereira wrote: > Hi, > > Thanks for the heads up. I mostly agree with the text (except maybe the > paragraph on Oracle RAC and 2PC). Hm, what's wrong with that? Okay, we should better not mention Oracle RAC there, but it is a product doing 'Multi-Master Replication Using Clustering', isn't it? > Regarding GORDA, I would not try to fit it into existing bullets, as our > current prototype implements several variations of query broadcasting, > multi-master and master/slave, although none with production quality. AFAIK, we want to help the users to get a general understanding about replication, high availability and load balancing. So as to give them a good starting point on their search for a solution to their problem(s). Thus it's okay if you say GORDA is prototyping several of the algorithms mentioned. I just would like to make sure that the descriptions are general enough to be appropriate for most implementations. For example, recently Josh Berkus proposed, that 'statement-based replication' was the more common term for 'query broadcasting'. Do you agree with that? What did you call it? Does the description fit? > Instead, I'd point out PostgreSQL's friendliness to research and the resulting > prototypes that stem from academia. This can help to draw community attention > to our efforts. Thank you for your suggestions. And I'm glad you're seeing PostgreSQL that way. But I think your additions don't quite fit into the documentation because they are too promotional. I should probably have given some more details about previous discussions. One consensus we've reached was, that we don't want to keep a quickly changing list of open source projects in our documentation. So probably nether GORDA nor Postgres-R will get mentioned there. But for sure we'll compile a list of available replication solutions *somewhere* on the website, where we can change it as often as we want. I'm sure GORDA will get mentioned there. @pgsql-docs: BTW, what's the state on that one? Or do I have to ask that on -www? Regards Markus
Re: [Pgcluster-general] PostgreSQL Documentation of High Availability
From
Markus Schiltknecht
Date:
Hi, a.mitani@sra-europe.com wrote: > Current generation of PGCluster is a Shared-Nothing type of multi-master > and syncronous replication system. Thank you for pointing us to yet another very common distinction in the clustering world: shared-nothing vs. shared-disk or even shared-everything. We don't touch that in the current documentation. Do we want or need to do so? > I think that the feature of this type of replication system is as the > 'Multi-Master Replication Using Clustering' chapter of your document. Most probably, yes. Please note that it's not *my* document :-) Bruce Momjian wrote most of it, with only some hints and annoying nit-picking from my side. > However, Oracle RAC is a Shared-Everything type of multi-master clustering > system. If it set up appropriately, most of these limitations would be > improved. Shared-Everything, really? I thought they did their own distributed shared memory or distributed locking stuff, so it would be shared-disk. And together with their OCFS, they would reach shared-nothing. But I don't really know. @pgsql-docs: I'd strongly vote for not mentioning Oracle if we don't event want to mention proprietary products for PostgreSQL. There are enough research or ongoing projects (even some ongoing reserch projects ;-) ) to mention. PgCluster-II, GORDA, Slony-II or Postgres-R come to mind. > Next generation of PGCluster (I named PGCluster-II) will be a > Shared-Everything type of multi-master clustering system as demonstrated > in Toronto. Yeah, I remember that demonstration. Do you think PGCluster-II fits what's described under 'Multi-Master Replication Using Clustering'? Do you think we should explain Shared-Nothing vs. Shared-Disk vs. Shared-Everything there? Regards Markus
Re: [Pgcluster-general] PostgreSQL Documentation of High Availability
From
Markus Schiltknecht
Date:
Hi, a.mitani@sra-europe.com wrote: > I believe that shared-everything and shared-disk are cmpletely different > things. No. AFAIK, shared everything is basically a marketing term. Some vendors (especially Oracle) use it to mean 'shared disk', which is really confusing. But that's probably intentional, as it sounds good to share everything... much better than sharing nothing. Others, like GreenPlum use the term to mean 'shared memory' for example in [1], which is much more appropriate. Anyway, the term 'Shared Nothing' seems to go back to Stonebraker's paper 'The Case for Shared Nothing Architecture' [2]. There he defined these three terms: shared memory (SM): multiple processors share a common central memory shared disk (SD): multiple processors each with private memory share a common collection of disks shared nothing (SN): neither memory nor peripheral storage is shared among processors As it makes no sense to have shared memory but individual disks, no term for such a thing got defined. > Oracle has said the RAC as Shared-Everything. > Probably, the definition of the words would differ from them. It's hard to find architectural details within their documents, but AFAICS, they do something like distributed shared memory or distributed locking, which they call "Cache Fusion", see [3]. In the very same paper, they define what a Cluster is for them: "A cluster is a group of independent servers that cooperate as a single system. The primary cluster components are processor nodes, a cluster interconnect, and a shared disk subsystem. The clusters share disk access and resources that manage data, but the distinct hardware cluster nodes do not share memory." Thus, Oracle RAC seems to be a shared disk solution. Only in conjunction with their OCFS, you could probably call it a shared-nothing solution, but it's certainly not a shared-memory thing. I'd vote for explaining these terms in the PostgreSQL documentation, as there seems to be a lot of confusion regarding these terms. Regards Markus [1]: GreenPlum about Shared Nothing vs Shared Everything: http://www.greenplum.com/products/sharedNothing.php [2]: Michael Stonebraker, The Case for Shared Nothing Architecture: http://db.cs.berkeley.edu/papers/hpts85-nothing.pdf [3]: A random Oracle Paper about "Cache Fusion" (tm): http://www.oracle.com/technology/products/oracle9i/pdf/cache_fusion_rel2.pdf
Markus Schiltknecht wrote: > Hi, > > a.mitani@sra-europe.com wrote: > > Current generation of PGCluster is a Shared-Nothing type of multi-master > > and syncronous replication system. > > Thank you for pointing us to yet another very common distinction in the > clustering world: shared-nothing vs. shared-disk or even > shared-everything. We don't touch that in the current documentation. Do > we want or need to do so? I feel the shared-* issue splits us up like master/slave and multi-master splits up --- it added more confusion than clarity, because many solutions fell in the middle. > > I think that the feature of this type of replication system is as the > > 'Multi-Master Replication Using Clustering' chapter of your document. > > Most probably, yes. Please note that it's not *my* document :-) Bruce > Momjian wrote most of it, with only some hints and annoying nit-picking > from my side. > > > However, Oracle RAC is a Shared-Everything type of multi-master clustering > > system. If it set up appropriately, most of these limitations would be > > improved. > > Shared-Everything, really? I thought they did their own distributed > shared memory or distributed locking stuff, so it would be shared-disk. > And together with their OCFS, they would reach shared-nothing. But I > don't really know. Yea, gets confusing. > @pgsql-docs: I'd strongly vote for not mentioning Oracle if we don't > event want to mention proprietary products for PostgreSQL. There are > enough research or ongoing projects (even some ongoing reserch projects > ;-) ) to mention. PgCluster-II, GORDA, Slony-II or Postgres-R come to mind. Good point. I mentioned Oracle RAC only because it seems to be an industry standard, so by mentioning it, people know exactly what we are talking about. Is there a better way? And people do ask for Oracle RAC, so in a way we are telling them we don't have something similar. As sad as that is, it is true currently. > > Next generation of PGCluster (I named PGCluster-II) will be a > > Shared-Everything type of multi-master clustering system as demonstrated > > in Toronto. > > Yeah, I remember that demonstration. Do you think PGCluster-II fits > what's described under 'Multi-Master Replication Using Clustering'? Do > you think we should explain Shared-Nothing vs. Shared-Disk vs. > Shared-Everything there? pgcluster is must closer to Oracle RAC, but I haven't mentioned it because I am unsure where it is in terms of usability and stability. Comments? -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Nice idea. --------------------------------------------------------------------------- Markus Schiltknecht wrote: > Hi, > > We are about to extend the PostgreSQL documentation for high > availability and load balancing. We've had some discussions so far, but > only few people have spoken up until now. I'd really like to get some > more input from all the different projects providing solutions in this area. > > You can find the current draft here: > http://momjian.us/main/writings/pgsql/sgml/high-availability.html > > Please have a look and check if your project fits somewhere and if the > description is correct. I'll happily forward your inputs if you are not > subscribed to the (AFAIK closed) PostgreSQL docs mailing list. > > Regards > > Markus > > ---------------------------(end of broadcast)--------------------------- > TIP 7: You can help support the PostgreSQL project by donating at > > http://www.postgresql.org/about/donate -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Markus Schiltknecht wrote: > In the very same paper, they define what a Cluster is for them: "A > cluster is a group of independent servers that cooperate as a single > system. The primary cluster components are processor nodes, a cluster > interconnect, and a shared disk subsystem. The clusters share disk > access and resources that manage data, but the distinct hardware cluster > nodes do not share memory." > > Thus, Oracle RAC seems to be a shared disk solution. Only in conjunction > with their OCFS, you could probably call it a shared-nothing solution, > but it's certainly not a shared-memory thing. > > I'd vote for explaining these terms in the PostgreSQL documentation, as > there seems to be a lot of confusion regarding these terms. OK, but how does explaining the terms help our users? -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Hi, Emmanuel Cecchet wrote: > I think that you can still have a matrix with the main > features(performance, data loss/failover/failback on node failure, > Disaster recovery, WAN, ...) and how each approach (master/slave, shared > disk, multi-master, ...) addresses the issue. Yes, I certainly agree with that. >> These are good questions to analyze a certain solution. As far as our >> documentation is concerned, I think giving rough estimates for >> categories of replication algorithms is sufficient (i.e. stating that >> Multi Master Replication scales very good for reading transactions, >> but not very well for writing ones). > Even here I think that there is a common misconception between > performance and scalability. Most people think that by having multiple > nodes their query will run faster which is obviously wrong if your > original workload does not saturate a single node. Sure. Do you think that should be made clearer? > The replication > mechanisms are even adding overhead (usually perceived as increased > latency) to the query execution. It is ONLY when the workload increases > that you can see throughput going up (ideally somewhat close to the > workload increase) and query latency remaining stable. Unless you really > have a parallel query execution (that is only efficient for big queries > anyway), you will never see a performance improvement on a single query > execution since this is always the same database engine that executes > the query in the end. I don't quite agree with that statement, but probably I'm just misreading it. If you have enough concurrent transactions you can spread among the nodes, you'll certainly note an improvement. After all, it's a huge difference, if your single node is processing only ten or hundreds of concurrent transactions. Of course, the amount of concurrent transactions limits how far a replication solution can scale. Having more nodes than concurrent transactions does not make sense. (Of course with the exception of parallel query execution.) > But you are right that full replication (in shared nothing environments) > does not perform with write heavy workload. At best it will go to the > speed of the fastest node in the cluster, but it will usually degrade > quickly. A good replication implementation will have a constant overhead > on query execution time (let's say few millliseconds). Therefore the > impact will be quite different if this is a small query or a > long-running query. Adding few milliseconds to a query that takes > seconds to execute is negligible but adding the same time to a > sub-millisecond query will be a tremendous slowdown (in term of latency). > To summarize, clustering solutions provide performance scalability > (stable latency, throughput increasing almost linearly with load) but > not performance improvement on individual query execution time. Yes, for writing transactions, no for read-only ones (queries?). Or why do you have to add overhead to read-only queries? > If the > client application is not multithreaded it is very unlikely that any > solution will improve the application performance. Ehm.. I wouldn't refer to threading here. You can very well have multiple single-process programs running on different nodes... I'd keep referring to concurrency of transactions. > As an additional point, transactions including calls such as 'select > nextval' should be considered as write transactions with PostgreSQL. Sure. > That might not be obvious for most users. Agreed. >>> When configured with RAIDb-1, >> I know RAID-1, but what's a RAIBd-1? > RAIDb is an acronym for Redundant Array of Inexpensive Databases. > You can find an article on this at > http://c-jdbc.objectweb.org/current/doc/RR-C-JDBC.pdf Aha, thank you. > That's great that the work was revived in 8.2. Yes, Postgres-R is much > more embedded in Postgres but I was confused with Middle-R that was done > later on with Bettina and Ricardo using a similar technique at the > middlware level. Yeah, I thought you meant that one. I don't know Middle-R at all, sorry. Seems similar to sequoia. Did you base your work on Middle-R? > What are your development plans for Postgres-R? To make it work and production ready as soon as possible. ;-) I'm currently working on initialization and recovery. Regards Markus
Hello Emmanuel, Emmanuel Cecchet wrote: > I just quickly went through the classification. > I don't think that your description of 'multi-master replication using > clustering' is correct. Oracle RAC is a shared disk approach and just > send cache invalidations to other nodes but not actual data. As the disk > is shared, data is only commited once to disk and there is a distributed > locking protocol to make nodes agree on a serializable transactional order. I agree with that. AFAI understand, we are not entirely sure how much of that we want to cover in our documentation. > You can have a look at the last ApacheCon presentation I gave > (http://www.continuent.org/uploads/sequoia/Resources/2006-08-15Cecchet_ApacheConAsia2006.pdf), thank you for that link, I'll read through it ASAP. > there is a comparison of most solutions. I think that it would be cool > to have a comparison matrix with the different features that you may > need in an HA solution and see how the different solutions can provide > an answer to these problems. We've already come to the consensus, that we don't want to list all products in our documentation, but better do that on the website, where it can easily be changed at any time. The documentation should explain fundamentals. > What you describe as 'Statement-Based Replication Middleware' which is > in fact multi-master replication at the middleware level, is usually > more than just a simple proxy that broadcasts all queries. It must > provide 1-copy-serializability if you want the cluster to be consistent. Yes, the current description is very much targeted at pgpool. Thank you very much for your input, that helps to clarify things. > This means that write queries must be sent in the same serializable > order to every node. As SQL is interpreted, macros can be replaced on > the fly by the middleware with cluster-wide values. About sequences, > calls to sequences must be broadcast as if it were writes. This is > needed to update sequences the same way on all nodes. Note that > sequences are parts of unrollbackable changes of the database. This > means that even if a transaction rollbacks at one node, its sequence > will not be rollbacked meaning that all other nodes must also play > rollbacked transactions (at runtime or recovery time) if they were > accessing sequences or involving other operations that the database > cannot rollback. > From a performance standpoint, the middleware approach can balance > queries that come from the same client connection to different nodes > which is not the case when you are directly connected to a database > instance. Also failover can be made fully transparent at the middleware > level (any node failure can be completely hidden to the client). With > Sequoia, we even hide middleware failures by providing transparent > failover code in our driver. > The discussion should add a section describing real HA features (not > just load balancing): > - Do you lose data on failure? > - Will my transactions fail on a node failure? > - Is failover transparent? Does it need special support in the client > application? > - Is failback a manual process? > - How long does it take to restart/resynchronize a node? > - Can I add nodes on the fly to the cluster? > - Can I do maintenance operations without stopping the cluster? > - Can I upgrade the cluster without interruptions (e.g. migrate from > PostgreSQL 7.4 to 8.0 without interruption) ? > - WAN support? How network partitions are handled? These are good questions to analyze a certain solution. As far as our documentation is concerned, I think giving rough estimates for categories of replication algorithms is sufficient (i.e. stating that Multi Master Replication scales very good for reading transactions, but not very well for writing ones). > When configured with RAIDb-1, I know RAID-1, but what's a RAIBd-1? > Sequoia provide a multi-master (full) > replication at the middleware level. Note that Sequoia also allows for > partial replication or data partitioning (but the granularity is the > table meaning that you can only distribute tables on different nodes but > not slice a table and split it on different nodes as you describe in > Data Partitioning). > > Thanks again for this great work and I hope this will help improve the > documentation. Sure, thank you very much for your input. I'm glad we have the possibility to cover sequoia, too. > Note that there is also Postgres-R that is another approach to > middleware-based replication. Middleware? Mrs. Kemme also did some research with middleware replication, but I wouldn't exactly call Postgres-R a middleware-based replication solution. I've ported Postgres-R to PostgreSQL 8.2 and can assure you that it's very well embedded into the backend of PostgreSQL. Probably too well for some ;-) You might want to check out my (slightly outdated) website www.postgres-r.org. Regards Markus
Re: [Pgcluster-general] PostgreSQL Documentation of High Availability
From
Markus Schiltknecht
Date:
Hi, Bruce Momjian wrote: > I feel the shared-* issue splits us up like master/slave and > multi-master splits up No, not quite. To sum up, I'd say the following combinations make sense: sync, multi-master replication on shared-memory cluster (which is much like a super-computer. With shared memory distributing locks does not cost much - beside marketing, there is probably not much sense in calling this a cluster at all). sync, multi-master replication on shared-disk cluster (where locks and memory-caches have to be synchronized. OracleRAC and PgCluster-II fit in here.) (Probably running an async replication on a shared-disk cluster would make sense with MVCC and in some corner cases, but I don't see much benefits in that.) sync, multi-master replication on shared-nothing cluster (where locks, caches and data needs to be synchronized over an interconnect. Postgres-R, PgCluster, PgPool) (sync, single-master replication does not make much sense, because if you go sync at all, you could as well use the nodes which run in sync). async, multi-master replication on shared-nothing cluster (i.e. Slony-I) async, single-master replication on shared-nothing cluster (mainly for failover purpose, you mention solutions for that) For me these categorizations are important and help a good deal to ensure what I'm talking about with somebody. The documentation is much more focused on individual solutions, sometimes avoiding to categorize them. I would love to get others opinions, but as not many others speak up, I just accept it that way. > Yea, gets confusing. Well, Oracle also does a good deal in making it confusing, IMO. > Good point. I mentioned Oracle RAC only because it seems to be an > industry standard, so by mentioning it, people know exactly what we are > talking about. That's a point, even if I don't really know how much of an industry standard it is. But given how badly Oracle does in explaining basics of replication and clustering, I think it's not very beneficial. > Is there a better way? And people do ask for Oracle > RAC, so in a way we are telling them we don't have something similar. > As sad as that is, it is true currently. How far is PGCluster-II? Does it make sense to mention it? Can PGCluster-II be used with network filesystems like NFS, OCFS2 or the like? > pgcluster is must closer to Oracle RAC, Why do you think so? Oracle RAC is mainly based on a shared disk cluster, where PGCluster bases on a shared nothing architecture. PGCluster-II seems closer to Oracle RAC, for me. > but I haven't mentioned it > because I am unsure where it is in terms of usability and stability. > Comments? Did you work on it since Toronto, Mitani-San? Regards Markus
Re: [Pgcluster-general] PostgreSQL Documentation of High Availability
From
Markus Schiltknecht
Date:
Hello Bruce, Bruce Momjian wrote: > OK, but how does explaining the terms help our users? As we even have on sort-of-a solution for shared disk clusters (the Shared Disk Failover part), we should explain this term (as you already do there). Clarifying that all other solutions are for shared nothing clusters makes sense, IMO. We don't necessarily need to go into shared memory and the confusion which shared everything introduced. OTOH, where else to enlighten people about that if not in such a documentation? To answer your question: by explaining these terms, they are demystified. The users will understand the experts better and have some fundamental terms which they can base their discussion on. Of course it's questionable how far to go, and we are debating just that now, I think. But I have no doubt in the OSS tradition of good documentation. Long live the saying 'RTFM'! :-) Regards Markus
Hi, José Orlando Pereira wrote: >> Hm, what's wrong with that? Okay, we should better not mention Oracle >> RAC there, but it is a product doing 'Multi-Master Replication Using >> Clustering', isn't it? > > AFAIK, RAC uses a shared disk, thus it does not provide replication. Oh, that's right. Hm... thus there are no such things as Multi-Master Replication for shared-disk or shared-memory machines, because that's not replication. My fault, sorry. > And I > don't think RAC can be emulated at all at the application level with 2PC. No, that would not make sense. The paragraph is about Multi Master Replication, which I thought Oracle RAC would be in. But I agree that Oracle RAC should not be considered replication at all. What do you think about sharing disks by the means of network file systems, like OCFS2? I was under the impression that Oracle built that one to run RAC on top of it. That combination would run on a shared nothing cluster, but does that make it replication? According to you, what category does Oracle RAC (and PGCluster-II) belong to? Shared Disk Clusters? > Classifying replication protocols is indeed a hard problem. Besides my issues > with the multi-master replication using clustering category, I miss a > reference to multi-master asynchronous replication (and thus, to > reconciliation), which is a big issue in Oracle, MS SQL, etc literature. Yeah, I'm missing that, too. Well, the docu talks about async and sync, but IMO, it's somewhat sloppy in that it only covers one aspect of synchronous replication (namely that a failover will not loose data). The other statement, that 'servers will return consistent results with no propagation delay' is somewhat uncorrect, as there certainly is a delay of propagation before the commit. And in that the individual databases are very well consistent, just not synchronous. Emmanuel Cecchet listed some questions one might use to categorize or further specify aspects of synchronous replication in [1]. The current paragraph doesn't even clearly state that it's talking about synchronous replication. Maybe we want to have only one paragraph for Multi-Master replication and cover sync as well as async there? > Coming from a fault-tolerant distributed systems background, we'd call > that "replicated state machine" or "active replication". I don't think > however that using those names in this context would be helpful. Wikipedia has a definition of replicated state machine in [2]. I'm not keen to use that term. >> Thank you for your suggestions. And I'm glad you're seeing PostgreSQL >> that way. But I think your additions don't quite fit into the >> documentation because they are too promotional. > > Hey, you can't blame me for trying... ;) No, it's more that I'm sorry for not having explained better what we need. > Ok, I understand your motivations. I agree with the listing replication > solutions somewhere on the website. I'd still add the research and innovation > bullet, Yes, pointing to that surely won't hurt. > instead of trying to squeeze group-communication based stuff in > existing bullets. I see 2PC, shared memory and locking and using a GCS as implementation details of sync, multi-master replication. I'd even put statement-based replication in there, but one can reasonably argue about that. Anyway, if at all, those should only be quickly mentioned as possible implementations. But I don't think it helps to go that far. Having a good description of sync MM and async MM replication is certainly sufficient there. Again, thank you very much for your inputs. Regards Markus [1]: Emmanuel Cecchet: https://forge.continuent.org/pipermail/sequoia/2006-November/004070.html [2]: Wikipedia definition of replicated state machine: http://en.wikipedia.org/wiki/State_machine_replication
Markus Schiltknecht wrote: > Hi, > > Jos? Orlando Pereira wrote: > >> Hm, what's wrong with that? Okay, we should better not mention Oracle > >> RAC there, but it is a product doing 'Multi-Master Replication Using > >> Clustering', isn't it? > > > > AFAIK, RAC uses a shared disk, thus it does not provide replication. > > Oh, that's right. Hm... thus there are no such things as Multi-Master > Replication for shared-disk or shared-memory machines, because that's > not replication. My fault, sorry. OK, title now is "Multi-Master Clustering". > > And I > > don't think RAC can be emulated at all at the application level with 2PC. > > No, that would not make sense. The paragraph is about Multi Master > Replication, which I thought Oracle RAC would be in. But I agree that > Oracle RAC should not be considered replication at all. > > What do you think about sharing disks by the means of network file > systems, like OCFS2? I was under the impression that Oracle built that > one to run RAC on top of it. That combination would run on a shared > nothing cluster, but does that make it replication? > > According to you, what category does Oracle RAC (and PGCluster-II) > belong to? Shared Disk Clusters? > > > Classifying replication protocols is indeed a hard problem. Besides my issues > > with the multi-master replication using clustering category, I miss a > > reference to multi-master asynchronous replication (and thus, to > > reconciliation), which is a big issue in Oracle, MS SQL, etc literature. > > Yeah, I'm missing that, too. I added async multi-master: <varlistentry> <term>Multi-Master With Conflict Resolution</term> <listitem> <para> For servers that are not regularly connected, like laptops or remote servers, keeping data consistent among servers is a challenge. One simple solution is to allow each server to modify the data, and have periodic communication compare databases and ask users to resolve any conflicts. </para> </listitem> </varlistentry> > Well, the docu talks about async and sync, but IMO, it's somewhat sloppy > in that it only covers one aspect of synchronous replication (namely > that a failover will not loose data). > > The other statement, that 'servers will return consistent results with > no propagation delay' is somewhat uncorrect, as there certainly is a > delay of propagation before the commit. And in that the individual > databases are very well consistent, just not synchronous. OK, updated to add "little" delay, and removed "small" from async case: load-balanced servers will return consistent results with little propagation delay. Asynchronous updating has a delay between the > > Emmanuel Cecchet listed some questions one might use to categorize or > further specify aspects of synchronous replication in [1]. > > The current paragraph doesn't even clearly state that it's talking about > synchronous replication. Maybe we want to have only one paragraph for > Multi-Master replication and cover sync as well as async there? Does the new conflict resolution section help that? > > Coming from a fault-tolerant distributed systems background, we'd call > > that "replicated state machine" or "active replication". I don't think > > however that using those names in this context would be helpful. > > Wikipedia has a definition of replicated state machine in [2]. I'm not > keen to use that term. > > >> Thank you for your suggestions. And I'm glad you're seeing PostgreSQL > >> that way. But I think your additions don't quite fit into the > >> documentation because they are too promotional. > > > > Hey, you can't blame me for trying... ;) > > No, it's more that I'm sorry for not having explained better what we need. I was originally worried no one commented on my initial version of this chapter. I am not worried any more. ;-) Actually, I think we all understand 60% of this topic, but a different 60%, so when we are done, it will cover 100%. > > Ok, I understand your motivations. I agree with the listing replication > > solutions somewhere on the website. I'd still add the research and innovation > > bullet, > > Yes, pointing to that surely won't hurt. > > > instead of trying to squeeze group-communication based stuff in > > existing bullets. > > I see 2PC, shared memory and locking and using a GCS as implementation > details of sync, multi-master replication. I'd even put statement-based > replication in there, but one can reasonably argue about that. Anyway, > if at all, those should only be quickly mentioned as possible > implementations. But I don't think it helps to go that far. Having a > good description of sync MM and async MM replication is certainly > sufficient there. > > [1]: Emmanuel Cecchet: > https://forge.continuent.org/pipermail/sequoia/2006-November/004070.html Ah, good read. I didn't realize the shared disk aspect of Oracle RAC, and have removed mention of RAC from our documentation. Oracle RAC seems like an interesting hybrid solution. They use shared disk so they don't have to send the data to all the nodes, but send cache invalidation information to all nodes so they know when something has changed. I have added the Oracle RAC details as an SGML comment in case we ever need to mention it. As far as going into the other details of what features each replication solution has, e.g. adding nodes, etc, it is beyond the scope of this chapter, though perhaps some of the items are appropriate. -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Markus Schiltknecht wrote: > Hi, > > Bruce Momjian wrote: > > I feel the shared-* issue splits us up like master/slave and > > multi-master splits up > > No, not quite. To sum up, I'd say the following combinations make sense: > > sync, multi-master replication on shared-memory cluster (which is much > like a super-computer. With shared memory distributing locks does not > cost much - beside marketing, there is probably not much sense in > calling this a cluster at all). Wow, how is that different than an multi-CPU server? I guess I don't see the point to it. The only value I see to it would be failover if one of the servers fails, but it seems the failed server would be holding locks that would make failover difficult to do without restarting all the servers. > sync, multi-master replication on shared-disk cluster (where locks and > memory-caches have to be synchronized. OracleRAC and PgCluster-II fit in > here.) OK. I didn't think pgcluster was shared disk. I thought all the synchronization was via the network. > (Probably running an async replication on a shared-disk cluster would > make sense with MVCC and in some corner cases, but I don't see much > benefits in that.) > > sync, multi-master replication on shared-nothing cluster (where locks, > caches and data needs to be synchronized over an interconnect. > Postgres-R, PgCluster, PgPool) Yes, I think we have that one covered. > (sync, single-master replication does not make much sense, because if > you go sync at all, you could as well use the nodes which run in sync). > > async, multi-master replication on shared-nothing cluster (i.e. Slony-I) Covered. > async, single-master replication on shared-nothing cluster (mainly for > failover purpose, you mention solutions for that) Added as a new entry calld Conflict Resolution. > For me these categorizations are important and help a good deal to > ensure what I'm talking about with somebody. The documentation is much > more focused on individual solutions, sometimes avoiding to categorize > them. I would love to get others opinions, but as not many others speak > up, I just accept it that way. One problem I have is that we we have shared disk failover, but no other shared case with a PostgreSQL implementation, and people don't want to mention Oracle RAC, so why do we mention it if we have no implementations even in the works. > > Yea, gets confusing. > > Well, Oracle also does a good deal in making it confusing, IMO. > > > Good point. I mentioned Oracle RAC only because it seems to be an > > industry standard, so by mentioning it, people know exactly what we are > > talking about. > > That's a point, even if I don't really know how much of an industry > standard it is. But given how badly Oracle does in explaining basics of > replication and clustering, I think it's not very beneficial. OK, agreed, removed. > > Is there a better way? And people do ask for Oracle > > RAC, so in a way we are telling them we don't have something similar. > > As sad as that is, it is true currently. > > How far is PGCluster-II? Does it make sense to mention it? Can > PGCluster-II be used with network filesystems like NFS, OCFS2 or the like? I am waiting for email from Mitani-san, the pgcluster author. > > > pgcluster is must closer to Oracle RAC, > > Why do you think so? Oracle RAC is mainly based on a shared disk > cluster, where PGCluster bases on a shared nothing architecture. > PGCluster-II seems closer to Oracle RAC, for me. Oh, I am not aware of pgcluster-II. Did you mean pgpool-II? I think so. I have mentioned pgpool-II now as part of Clustering For Parallel Query Execution. Is that OK? -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Tue, 2006-11-21 at 16:51 -0500, Bruce Momjian wrote: > Markus Schiltknecht wrote: > > Hi, > > > > Bruce Momjian wrote: > > > I feel the shared-* issue splits us up like master/slave and > > > multi-master splits up > > > > No, not quite. To sum up, I'd say the following combinations make sense: > > > > sync, multi-master replication on shared-memory cluster (which is much > > like a super-computer. With shared memory distributing locks does not > > cost much - beside marketing, there is probably not much sense in > > calling this a cluster at all). > > Wow, how is that different than an multi-CPU server? You can't have 1000 cpus :).. You can have 1000 dual core servers. Joshua D. Drake -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
Joshua D. Drake wrote: > On Tue, 2006-11-21 at 16:51 -0500, Bruce Momjian wrote: >> Markus Schiltknecht wrote: >>> Hi, >>> >>> Bruce Momjian wrote: >>>> I feel the shared-* issue splits us up like master/slave and >>>> multi-master splits up >>> No, not quite. To sum up, I'd say the following combinations make sense: >>> >>> sync, multi-master replication on shared-memory cluster (which is much >>> like a super-computer. With shared memory distributing locks does not >>> cost much - beside marketing, there is probably not much sense in >>> calling this a cluster at all). >> Wow, how is that different than an multi-CPU server? > > You can't have 1000 cpus :).. You can have 1000 dual core servers. Have them share all their memory is the challenge, though. Ask IBM, they certainly do big-irons like that. As I said, it's like a super-computer, but you can also see it as a cluster of CPUs with shared memory and shared disks. It's all just a matter of your point of view. Regards Markus
Hi, Bruce Momjian wrote: > Oh, I am not aware of pgcluster-II. Did you mean pgpool-II? I think > so. No, I really mean PGCluster-II. Didn't you attend A. Mitani's speech about PGCluster? Check his slides: http://conference.postgresql.org/download/TFCKUpload/62.pdf Regards Markus
Joshua D. Drake wrote: > On Tue, 2006-11-21 at 16:51 -0500, Bruce Momjian wrote: > > Markus Schiltknecht wrote: > > > Hi, > > > > > > Bruce Momjian wrote: > > > > I feel the shared-* issue splits us up like master/slave and > > > > multi-master splits up > > > > > > No, not quite. To sum up, I'd say the following combinations make sense: > > > > > > sync, multi-master replication on shared-memory cluster (which is much > > > like a super-computer. With shared memory distributing locks does not > > > cost much - beside marketing, there is probably not much sense in > > > calling this a cluster at all). > > > > Wow, how is that different than an multi-CPU server? > > You can't have 1000 cpus :).. You can have 1000 dual core servers. But does anyone make a shared-memory cluster that can do 1000 cpu's? Sounds like Sequent, but I didn't think anyone was doing this anymore. -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Markus Schiltknecht wrote: > Hello Bruce, > > Bruce Momjian wrote: > > OK, but how does explaining the terms help our users? > > As we even have on sort-of-a solution for shared disk clusters (the > Shared Disk Failover part), we should explain this term (as you already > do there). > > Clarifying that all other solutions are for shared nothing clusters > makes sense, IMO. We don't necessarily need to go into shared memory and > the confusion which shared everything introduced. OTOH, where else to > enlighten people about that if not in such a documentation? > > To answer your question: by explaining these terms, they are > demystified. The users will understand the experts better and have some > fundamental terms which they can base their discussion on. Of course > it's questionable how far to go, and we are debating just that now, I think. > > But I have no doubt in the OSS tradition of good documentation. Long > live the saying 'RTFM'! :-) I figured that shared-disk/memory only really makes sense for multi-master clustering, so I mentioned it in that paragraph: <term>Multi-Master Clustering</term> <listitem> <para> In clustering, each server can accept write requests, and modified data is transmitted from the original server to every other server before each transaction commits. Heavy write activity can cause excessive locking, leading to poor performance. In fact, write performance is often worse than that of a single -> server. Read requests can be sent to any server. Some -> implementations use cluster-wide shared memory or shared disk -> to reduce the communication overhead. Clustering is best for mostly read workloads, though its big advantage is that any server can accept write requests — there is no need to partition workloads between master and slave servers, and because the data changes are sent from one server to another, there is no problem with non-deterministic functions like <function>random()</>. Is that enought? -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Markus Schiltknecht wrote: > Hi, > > Bruce Momjian wrote: > > Oh, I am not aware of pgcluster-II. Did you mean pgpool-II? I think > > so. > > No, I really mean PGCluster-II. Didn't you attend A. Mitani's speech > about PGCluster? Check his slides: > > http://conference.postgresql.org/download/TFCKUpload/62.pdf I just saw it. It does seem more like Oracle RAC than any other method. -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Hello Emmanuel, Emmanuel Cecchet wrote: >>> Even here I think that there is a common misconception between >>> performance and scalability. Most people think that by having >>> multiple nodes their query will run faster which is obviously wrong >>> if your original workload does not saturate a single node. >> >> Sure. Do you think that should be made clearer? > Yes, I think so because this is a very common belief that we experience > with new users. Okay, I have forwarded that to Bruce, who's editing the documentation (and is a native English speaker). I'm not sure how we can cover this, as we are very general in our description. You might want to recheck the paragraph, which is now called "Synchronous Multi-Master Replication": http://momjian.us/main/writings/pgsql/sgml/high-availability.html I'm particularly unsure, where Sequoia would fit in. There is still the split between "Statement-Based Replication Middleware" and "Synchonous Multi-Master Replication". Does Sequoia offer any form of async replication? >>> The replication mechanisms are even adding overhead (usually >>> perceived as increased latency) to the query execution. It is ONLY >>> when the workload increases that you can see throughput going up >>> (ideally somewhat close to the workload increase) and query latency >>> remaining stable. Unless you really have a parallel query execution >>> (that is only efficient for big queries anyway), you will never see a >>> performance improvement on a single query execution since this is >>> always the same database engine that executes the query in the end. >> >> I don't quite agree with that statement, but probably I'm just >> misreading it. If you have enough concurrent transactions you can >> spread among the nodes, you'll certainly note an improvement. After >> all, it's a huge difference, if your single node is processing only >> ten or hundreds of concurrent transactions. > Yes, but that already means that your single node was somewhat already a > bottleneck. My point was that for low workloads (note that low is > relative here since many users have dual-cpu machines with decent RAM > and disks, and it takes quite a number of concurrent transactions to get > to the peak point), you will not see any improvement and even you'll see > a slight degradation especially from a latency perspective. Below the > peak point of a single machine, you will get the same performance (from > a client point of view) but the load on the various machine resources > will decreased by the number of machines in the cluster (at best). For > example, if I have a workload of 50 requests/second that generates 50% > cpu load on 1 node, I will still get my 50 req/s with 2 machines but the > cpu load will only be 25% on each node. > Now the contention can be elsewhere (disk, locks, ...) and exhibit other > scalability characteristics but it usually conforms to the model I > described. Agreed. >> Of course, the amount of concurrent transactions limits how far a >> replication solution can scale. Having more nodes than concurrent >> transactions does not make sense. (Of course with the exception of >> parallel query execution.) > Yes but don't underestimate the capability of a single node to execute > transactions in parallel as well. Oftentimes sending 2 concurrent > transactions to a single node or to 2 different nodes does not make any > difference (obviously it depends on the nature of the transaction). Okay, I just have to believe that. Up until now I'm mostly basing on theoretical estimates, rather than hard facts. :-) You seem to have made some real benchmarks. Did you publish them? >>> To summarize, clustering solutions provide performance scalability >>> (stable latency, throughput increasing almost linearly with load) but >>> not performance improvement on individual query execution time. >> >> Yes, for writing transactions, no for read-only ones (queries?). Or >> why do you have to add overhead to read-only queries? > In a middleware approach you have to proxy the read results as well so > you will add some latency there. When replication is integrated in the > database you can prevent this extra hop but still the replication logic > adds some overhead to any query (that seems inevitable if you want to > ensure consistency). Ah, okay, yes, the extra hop through the proxy has to be added even to reading queries. >> To make it work and production ready as soon as possible. ;-) I'm >> currently working on initialization and recovery. > Good luck, this is the hardest part ! You'll soon figure out that > replication was really the easy part ! Thanks. May I ask what you use for automatic testing and benchmarking? I'm currently stuck testing 90% of the time. Starting up the GCS, two Databases and attaching the debugger to every process which could possibly go havoc really takes a F***ING lot of time! Regards Markus
Markus Schiltknecht wrote: > Hello Emmanuel, > > Emmanuel Cecchet wrote: > >>> Even here I think that there is a common misconception between > >>> performance and scalability. Most people think that by having > >>> multiple nodes their query will run faster which is obviously wrong > >>> if your original workload does not saturate a single node. > >> > >> Sure. Do you think that should be made clearer? > > Yes, I think so because this is a very common belief that we experience > > with new users. > > Okay, I have forwarded that to Bruce, who's editing the documentation > (and is a native English speaker). I'm not sure how we can cover this, > as we are very general in our description. OK, updated: <term>Multi-Server Parallel Query Execution</term> <listitem> <para> Many of the above solutions allow multiple servers to handle multiple queries, but none allow a single query to use multiple servers to complete faster. This solution allows multiple servers to work concurrently on a single query. This is usually accomplished by splitting the data among servers and having each server execute its part of the query and return results to a central server where they are combined and returned to the user. Pgpool-II has this capability. -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Monday 20 November 2006 09:52, Markus Schiltknecht wrote: > But for sure we'll compile a list of available replication solutions > *somewhere* on the website, where we can change it as often as we want. > I'm sure GORDA will get mentioned there. > > @pgsql-docs: BTW, what's the state on that one? Or do I have to ask that > on -www? > Normally the www list is the best place to ask www related questions, but the answer to this is to take a look at the techdocs section of the website (http://www.postgresql.org/docs/techdocs/), it is a wiki like system that allows any community member to put online guides and documentation into place, made specifically for the purpose of things like what your looking for. -- Robert Treat Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL
Hi, Robert Treat wrote: > Normally the www list is the best place to ask www related questions, but the > answer to this is to take a look at the techdocs section of the website > (http://www.postgresql.org/docs/techdocs/) That URL doesn't quite work because of the trailing slash. Removing it and we're fine. Anyway, thanks for pointing me to the correct place. > it is a wiki like system that > allows any community member to put online guides and documentation into > place, made specifically for the purpose of things like what your looking > for. Bruce, do we start a techdocs for listing the actual replication solutions? Regards Markus
Markus Schiltknecht wrote: > Hi, > > Robert Treat wrote: > > Normally the www list is the best place to ask www related questions, but the > > answer to this is to take a look at the techdocs section of the website > > (http://www.postgresql.org/docs/techdocs/) > > That URL doesn't quite work because of the trailing slash. Removing it > and we're fine. Anyway, thanks for pointing me to the correct place. > > > it is a wiki like system that > > allows any community member to put online guides and documentation into > > place, made specifically for the purpose of things like what your looking > > for. > > Bruce, do we start a techdocs for listing the actual replication solutions? Sure. Once it is done, I can add a link from our documentation. -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +