Thread: Future In-Core Replication
I'm beginning to work on advanced additions to in-core replication for PostgreSQL. There are a number of additional features for existing single-master replication still to achieve, but the key topics to be addressed are major leaps forward in functionality. I hope to add useful features in 9.3, though realise that many things could take two or even more release cycles to achieve. (The last set of features took 8 years, so I'm hoping to do this a little faster). Some of my 2ndQuadrant colleagues will be committing themselves to the project also and we hope to work with the community in the normal way to create new features. I mention this only to say that major skills and resources will be devoted to this for the next release(s), not that this is a private project. Some people have talked about the need for "multi-master replication", whereby 2+ databases communicate changes to one another. This topic has been discussed in some depth in Computer Science academic papers, most notably, "The Dangers of Replication and a Solution" by the late Jim Gray. I've further studied this to the point where I have a mathematical model of this that allows me to predict what our likely success will be from implementing that. Without meaning to worry you, MM replication alone is not a solution for large data or the general case. For the general case, single master replication will continue to be the most viable option. For large and distributed data sets, some form of partitioning/sharding is required simply because full multi-master replication just isn't viable at both volume and scale. So my take on this is that MM is desirable, but is not the only thing we need - we also need partial/filtered replication to make large systems practical. Hence why I've been calling this the "Bi-Directional Replication" project. I'm aware that paragraph alone requires lots of explanation, which I hope to do both in writing and in person at the forthcoming developer conference. My starting point for designs is to focus on a key aspect: massive change to the code base is not viable and any in-core solution must look at minimally invasive changes. And of course, if it is in-core we must also add robust, clear code with reasonable performance that do not impede non-replication usage. The use cases we will address are not focused on any one project or user. I've distilled these points so far from talking to a wide variety of users, from major enterprises to startups. 1. GEOGRAPHICALLY DISTRIBUTED - Large users require both High Availability, which necessitates multiple nodes, as well as Disaster Recovery, which necessitates geographically distributed nodes. So my focus is not focused on "clustering" in the sense of Hadoop or Oracle RAC, since those technologies require additional technologies to provide DR, so my aim is to arrive at a coherent set of technologies that provide all that we want. I'm aware that other projects *are* focused on clustering, so even more reason not to try to simultaneously invent the wheel. 2. COHERENT - With regard to the coherence, I note this thinking is similar to the way that Oracle replication is evolving, where they have multiple kinds of in-core replication and various purchased technologies. We have a similar issue with regard to various external projects. I very much hope that we can utilise the knowledge, code and expertise of those other projects in the way we move forwards. 3. ONLINE UPGRADE - highly available distributed systems must have a mechanism for online upgrade, otherwise they won't stay HA for long. This challenge must be part of the solution, and incidentally should be a useful goal in itself. 4. MULTI-MASTER - the ability to update data from a variety of locations 5. WRITE-SCALEABLE - the ability to partition data across nodes in a way that allows the solution to improve beyond the write rate of a single node. Those are the basic requirements that I am trying to address. There are a great many important details, but the core of this is probably what I would call "logical replication", that is shipping changes to other nodes in a way that does not tie us to the same physical representation that recovery/streaming replication does now. Of course, non-physical replication can take many forms. The assumption of consistency across nodes is considered optional at this point, and I hope to support both eagerly consistent and eventually consistent approaches. I'm aware that this is a broad topic and many people will want input on this, and am also aware that will take much time. This post is more about announcing the project, than discussing specific details. My strategy for doing this is to come up with some designs and prototypes of a few things that might be the best way forwards. By building prototypes we will more quickly be able to address the key questions before us. So there is currently work on research-based development to allow wider discussion based upon something more than just whiteboards. I'll be the first to explain things that don't work. I also very much agree that "one size fits all" is the wrong strategy. So there will be implementation options and parameters, and possibly even multiple replication techniques. I will also be organising a small-medium sized "Future of In-Core Replication" meeting in Ottawa on Wed 16 May, 6-10pm. To avoid this becoming an unworkably large meeting, this will be limited but is open to highly technical PostgreSQL users who share these requirements, any attendee of the main developer's meeting that wishes to attend and other developers working on PostgreSQL replication/related topics. That will also allow me to order enough pizza for everyone too. I'll send out private invites to people whom I know (no spam) and I think may be interested, but you are welcome to email me to get access. (This will take me a day or two, so don't ping me back you didn't get your invite). I'm going to do my best to include the right set of features for the majority of people, all focused on submissions to PostgreSQL core, not any external project. Best Regards -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Apr 26, 2012 at 1:41 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > I will also be organising a small-medium sized "Future of In-Core > Replication" meeting in Ottawa on Wed 16 May, 6-10pm. Thanks for such rapid response. I've put up a wiki page and will be adding names as they come through http://wiki.postgresql.org/wiki/PgCon2012CanadaInCoreReplicationMeeting -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 4/26/12 7:41 AM, Simon Riggs wrote: > 5. WRITE-SCALEABLE - the ability to partition data across nodes in a > way that allows the solution to improve beyond the write rate of a > single node. It would be valuable to look at READ-SCALEABLE as well; specifically a second form of "synchronous" replication where youcan read from a slave "immediately" after transaction commit and have the changes be visible. That ability would makeit trivial to spread reads off of the master database. My hope is this wouldn't be horribly painful to achieve if we relaxed the need to fsync the corresponding WAL on the slave;kind of the opposite of the semi-synchronous mode we have now. My theory is that thanks to full page writes a slaveshould normally have the necessary pages to handle a WAL record in cache, so actually applying the WAL change shouldn'tbe horribly slow. -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
Simon, So the idea is that you'll present briefly your intentions for 9.3 at the developer meeting, and then have this in-depth afterwards? Sounds great. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Thu, Apr 26, 2012 at 6:48 PM, Josh Berkus <josh@agliodbs.com> wrote: > So the idea is that you'll present briefly your intentions for 9.3 at > the developer meeting, and then have this in-depth afterwards? Sounds > great. I really, really do not want the developer meeting to turn into a series of presentations. That's what we had last year, and it was boring and unproductive. Furthermore, it excludes from the conversation everyone who doesn't fit into a room in Ottawa. I think that plans should be presented here, on pgsql-hackers, and the developer meeting should be reserved for discussion of issues with which everyone who will be there is already familiar. If a presentation is required, the developer meeting is the wrong forum, IMHO of course. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Simon, > I'm beginning to work on advanced additions to in-core replication for > PostgreSQL. ... > Those are the basic requirements that I am trying to address. There > are a great many important details, but the core of this is probably > what I would call "logical replication", that is shipping changes to > other nodes in a way that does not tie us to the same physical > representation that recovery/streaming replication does now. Of > course, non-physical replication can take many forms. So, I'm a bit confused. You talk about this as "additions to in-core replication", but then you talk about implementing logical replication, which would NOT be an addition to Binary replication. Can you explain what you mean? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Fri, Apr 27, 2012 at 1:26 AM, Josh Berkus <josh@agliodbs.com> wrote: > Simon, > >> I'm beginning to work on advanced additions to in-core replication for >> PostgreSQL. > ... >> Those are the basic requirements that I am trying to address. There >> are a great many important details, but the core of this is probably >> what I would call "logical replication", that is shipping changes to >> other nodes in a way that does not tie us to the same physical >> representation that recovery/streaming replication does now. Of >> course, non-physical replication can take many forms. > > So, I'm a bit confused. You talk about this as "additions to in-core > replication", but then you talk about implementing logical replication, > which would NOT be an addition to Binary replication. Can you explain > what you mean? The key point is that there is a specific objective of including additional features in-core. That places some restrictions, but also offers some opportunities. Tight integration allows performance improvements, as well as ease of use etc.. I'm not sure what you mean by "would not be an addition to binary replication". Yes, for reasons most elegantly explained by Robert here [http://rhaas.blogspot.co.uk/2011/02/case-for-logical-replication.html], physical/binary replication puts too many restrictions on us and we cannot solve all of the problems that way. I was unaware of Robert's post, but it sets the scene clearly. So the future of in-core replication, IMHO, is some form of non-physical replication. There are various options there, but anything that goes in would reuse significant parts of the existing replication setup that already works so well. Put that another way: the infrastructure for the secure and efficient transport of replication messages is already in place. Reuse is also what makes something useful be achievable in a reasonable timescale. What we need to consider is the form of those new non-physicalWAL messages, how they are built on the sender and how they are handled at the receiving end. What I'm hoping to do is to build a basic prototype of logical replication using WAL translation, so we can inspect it to see what the downsides are. It's an extremely non-trivial problem and so I expect there to be mountains to climb. There are other routes to logical replication, with messages marshalled in a similar way to Slony/Londiste/Bucardo/Mammoth(?). So there are options, with measurements to be made and discussions to be had. It will take time for people to believe this is possible and longer to analyse and agree the options. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Apr 27, 2012 at 12:40 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Apr 26, 2012 at 6:48 PM, Josh Berkus <josh@agliodbs.com> wrote: >> So the idea is that you'll present briefly your intentions for 9.3 at >> the developer meeting, and then have this in-depth afterwards? Sounds >> great. > > I really, really do not want the developer meeting to turn into a > series of presentations. That's what we had last year, and it was > boring and unproductive. Furthermore, it excludes from the > conversation everyone who doesn't fit into a room in Ottawa. I think > that plans should be presented here, on pgsql-hackers, and the > developer meeting should be reserved for discussion of issues with > which everyone who will be there is already familiar. If a > presentation is required, the developer meeting is the wrong forum, > IMHO of course. I agree. Obviously, the word presentation is the issue here. If one person speaks and everybody else is silent that is not a good use of anybody's time. On any topic, I expect the introducer of that topic to set the scene, establish the major questions and encourage discussion. I would follow that form myself: I want to hear the wisdom of others and alter my plans accordingly. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
> What I'm hoping to do is to build a basic prototype of logical > replication using WAL translation, so we can inspect it to see what > the downsides are. Sounds like Mammoth. You take a look at that? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Fri, Apr 27, 2012 at 6:59 PM, Josh Berkus <josh@agliodbs.com> wrote: > >> What I'm hoping to do is to build a basic prototype of logical >> replication using WAL translation, so we can inspect it to see what >> the downsides are. > > Sounds like Mammoth. You take a look at that? Well, they all sound similar. My info was that Mammoth was not WAL-based. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 04/27/2012 11:33 AM, Simon Riggs wrote: > Well, they all sound similar. My info was that Mammoth was not WAL-based. Mammoth was transaction log based but not WAL based.
On Fri, Apr 27, 2012 at 4:11 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > What I'm hoping to do is to build a basic prototype of logical > replication using WAL translation, so we can inspect it to see what > the downsides are. It's an extremely non-trivial problem and so I > expect there to be mountains to climb. There are other routes to > logical replication, with messages marshalled in a similar way to > Slony/Londiste/Bucardo/Mammoth(?). So there are options, with > measurements to be made and discussions to be had. I'll note that the latest version of Slony, expected to be 2.2 (which generally seems to work, but we're stuck at the moment waiting to get free cycles to QA it) has made a substantial change to its data representation. The triggers used to cook data into a sort of "fractional WHERE clause," transforming an I/U/D into a string that you'd trivially combine with the string INSERT INTO/UPDATE/DELETE FROM to get the logical update. If there was need to do anything fancier, you'd be left having to have a "fractional SQL parser" to split the data out by hand. New in 2.2 is that the log data is split out into an array of text values which means that if someone wanted to do some transformation, such as filtering on value, or filtering out columns, they could modify the application-of-updates code to query for the data that they want to fiddle with. No parser needed. It's doubtless worthwhile to take a peek at that to make sure it informs your data representation appropriately. It's important to have data represented in a fashion that is amenable to manipulation, and that decidedly wasn't the case pre-2.2. I wonder if a meaningful transport mechanism might involve combining: a) A trigger that indicates that some data needs to be captured in a "logical" form (rather than the presently pretty purely physical form of WAL) b) Perhaps a way of capturing logical updates in WAL c) One of the old ideas that fell through was to try to capture commit timestamps via triggers. Doing it directly turned out to be too controversial to get in. Perhaps that's something that could be captured via some process that parses WAL. Something seems wrong about that in that it mixes together updates of multiple forms into WAL, physical *and* logical, and perhaps that implies that there should be an altogether separate "logical updates log." (LUL?) That still involves capturing updates in a duplicative fashion, e.g. - WAL + LUL, which seems somehow wrong. Or perhaps I'm tilting at a windmill here. With Slony/Londiste/Bucardo, we're capturing "LUL" in some tables, meaning that it gets written both to the tables' data files as well as WAL. Adding a binary LUL eliminates those table files and attendant WAL updates, thus providing some savings. [Insert a LULCATS joke here...] Perhaps I've just had too much coffee... -- When confronted by a difficult problem, solve it by reducing it to the question, "How would the Lone Ranger handle this?"
On Fri, Apr 27, 2012 at 11:50 PM, Christopher Browne <cbbrowne@gmail.com> wrote: > On Fri, Apr 27, 2012 at 4:11 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> What I'm hoping to do is to build a basic prototype of logical >> replication using WAL translation, so we can inspect it to see what >> the downsides are. It's an extremely non-trivial problem and so I >> expect there to be mountains to climb. There are other routes to >> logical replication, with messages marshalled in a similar way to >> Slony/Londiste/Bucardo/Mammoth(?). So there are options, with >> measurements to be made and discussions to be had. > > I'll note that the latest version of Slony ...has made a substantial change to its data > representation.... The basic model I'm working to is that "logical replication" will ship Logical Change Records (LCRs) using the same transport mechanism that we built for WAL. How the LCRs are produced and how they are applied is a subject for debate and measurement. We're lucky enough to have a variety of mechanisms to compare, Slony 1.0/2.0, Slony 2.2/Londiste/Bucardo and its worth adding WAL translation there also. My initial thought is that WAL translation has many positive aspects to it and we are investigating. There are also some variants on those themes, such as the one you discussed above. You probably won't recognise this as such, but I hope that people might see that I'm hoping to build Slony 3.0, Londiste++ etc. At some point, we'll all say "thats not Slony", but we'll also say (Josh already did) "thats not binary replication". But it will be the descendant of all. Backwards compatibility is not a goal, please note, but only because that will complicate matters intensely. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Sat, 2012-04-28 at 09:36 +0100, Simon Riggs wrote: > On Fri, Apr 27, 2012 at 11:50 PM, Christopher Browne <cbbrowne@gmail.com> wrote: > > On Fri, Apr 27, 2012 at 4:11 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > >> What I'm hoping to do is to build a basic prototype of logical > >> replication using WAL translation, so we can inspect it to see what > >> the downsides are. It's an extremely non-trivial problem and so I > >> expect there to be mountains to climb. There are other routes to > >> logical replication, with messages marshalled in a similar way to > >> Slony/Londiste/Bucardo/Mammoth(?). So there are options, with > >> measurements to be made and discussions to be had. > > > > I'll note that the latest version of Slony ...has made a substantial change to its data > > representation.... > > The basic model I'm working to is that "logical replication" will ship > Logical Change Records (LCRs) using the same transport mechanism that > we built for WAL. One outcome of this LCR approach is probably that you will be shipping changes as they are made and on slave have to either apply them in N parallel transactions and commit each transaction when the LCR for the corresponding transaction says so, or you have to collect the LCR-s before applying and then apply and commit committed-on-master transactions in commit order and throw away the aborted ones. The optimal approach will probably be some combination of these, that is collect and apply short ones, start replay in separate transaction if commit does not arrive in N ms. As to what LCRs should contain, it will probably be locical equivalents of INSERT, UPDATE ... LIMIT 1, DELETE ... LIMIT 1, TRUNCATE and all DDL. The DDL could actually stay "raw" (as in LCRs for system tables) on generator side as hopefully the rule that system tables cant have triggers" does not apply when generating the LCR-s on WAL path. If we need to go back to ALTER TABLE ... commands, then this is probably the wisest to leave for client. Client here could also be some xReader like middleman. I would even go as far as propose a variant for DML-WITH-LIMIT-1 to be added to postgresql's SQL syntax so that the LCRs could be converted to SQL text for some tasks and thus should be easy to process using generic text-based tools. The DML-WITH-LIMIT-1 is required to do single logical updates on tables with non-unique rows. And as for any logical updates we will have huge performance problem when doing UPDATE or DELETE on large table with no indexes, but fortunately this problem is on slave, not master ;) Generating and shipping the LCR-s at WAL-generation time or perhaps even a bit earlier will have a huge performance benefit of not doing double writing of captured events on the master which currently is needed for several reasons, the main one being the determining of which transactions do commit and in what order. (this cant be solved on master without a local event log table as we dont have commit/rollback triggers) If we delegate that part out of the master then this alone enables us to be almost as fast as WAL based replica in most cases, even when we have different logical structure on slaves. > How the LCRs are produced and how they are applied is a subject for > debate and measurement. We're lucky enough to have a variety of > mechanisms to compare, Slony 1.0/2.0, Slony 2.2/Londiste/Bucardo and > its worth adding WAL translation there also. My initial thought is > that WAL translation has many positive aspects to it and we are > investigating. There are also some variants on those themes, such as > the one you discussed above. > > You probably won't recognise this as such, but I hope that people > might see that I'm hoping to build Slony 3.0, Londiste++ etc. At some > point, we'll all say "thats not Slony", but we'll also say (Josh > already did) "thats not binary replication". But it will be the > descendant of all. If we get efficient and flexible logical change event generation on the master, then I'm sure the current trigger-based logical replication providers will switch (for full replication) or at least add and extra LCR-source . It may still make sense to leave some flexibility to the master side, so the some decisions - possibly even complex ones - could be made when generating the LCR-s What I would like is to have some of it exposed to userspace via function which could be used by developers to push their own LCRs. As metioned above, significant part of this approach can be prototyped from user-level triggers as soon as we have triggers on commit and rollback , even though at a slightly reduced performance. That is it will still have the trigger overhead, but we can omit all the extra writing and then re-reading and event-table management on the master. Wanting to play with Streaming Logical Replication (as opposed to current Chunked Logical Replication) is also one of the reasons that I complained when the "command triggers" patch was kicked out from 9.2. > Backwards compatibility is not a goal, please note, but only because > that will complicate matters intensely. Currently there really is nothing similar enough this could be backward compatible to :) > -- > Simon Riggs http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services >
On Sat, 2012-04-28 at 21:40 +0200, Hannu Krosing wrote: > On Sat, 2012-04-28 at 09:36 +0100, Simon Riggs wrote: > > On Fri, Apr 27, 2012 at 11:50 PM, Christopher Browne <cbbrowne@gmail.com> wrote: > > > On Fri, Apr 27, 2012 at 4:11 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > > >> What I'm hoping to do is to build a basic prototype of logical > > >> replication using WAL translation, so we can inspect it to see what > > >> the downsides are. It's an extremely non-trivial problem and so I > > >> expect there to be mountains to climb. There are other routes to > > >> logical replication, with messages marshalled in a similar way to > > >> Slony/Londiste/Bucardo/Mammoth(?). So there are options, with > > >> measurements to be made and discussions to be had. > > > > > > I'll note that the latest version of Slony ...has made a substantial change to its data > > > representation.... Btw, Londiste has also moved from Slony-like trigger with partial SQL to more general format a few versions ago. I sure hope they will move to JSON as the data representation now when we'll have JSON in core in 9.2 -- ------- Hannu Krosing PostgreSQL Unlimited Scalability and Performance Consultant 2ndQuadrant Nordic PG Admin Book: http://www.2ndQuadrant.com/books/
On Sat, Apr 28, 2012 at 8:40 PM, Hannu Krosing <hannu@krosing.net> wrote: > As to what LCRs should contain, it will probably be locical equivalents > of INSERT, UPDATE ... LIMIT 1, DELETE ... LIMIT 1, TRUNCATE and all DDL. Yeh > I would even go as far as propose a variant for DML-WITH-LIMIT-1 to be > added to postgresql's SQL syntax so that the LCRs could be converted to > SQL text for some tasks and thus should be easy to process using generic > text-based tools. > The DML-WITH-LIMIT-1 is required to do single logical updates on tables > with non-unique rows. > And as for any logical updates we will have huge performance problem > when doing UPDATE or DELETE on large table with no indexes, but > fortunately this problem is on slave, not master ;) While that is possible, I would favour the do-nothing approach. By making the default replication mode = none, we then require a PK to be assigned before allowing replication mode = on for a table. Trying to replicate tables without PKs is a problem that can wait basically. > Generating and shipping the LCR-s at WAL-generation time or perhaps even > a bit earlier will have a huge performance benefit of not doing double > writing of captured events on the master which currently is needed for > several reasons, the main one being the determining of which > transactions do commit and in what order. (this cant be solved on master > without a local event log table as we dont have commit/rollback > triggers) > > If we delegate that part out of the master then this alone enables us to > be almost as fast as WAL based replica in most cases, even when we have > different logical structure on slaves. Agreed >> How the LCRs are produced and how they are applied is a subject for >> debate and measurement. We're lucky enough to have a variety of >> mechanisms to compare, Slony 1.0/2.0, Slony 2.2/Londiste/Bucardo and >> its worth adding WAL translation there also. My initial thought is >> that WAL translation has many positive aspects to it and we are >> investigating. There are also some variants on those themes, such as >> the one you discussed above. >> >> You probably won't recognise this as such, but I hope that people >> might see that I'm hoping to build Slony 3.0, Londiste++ etc. At some >> point, we'll all say "thats not Slony", but we'll also say (Josh >> already did) "thats not binary replication". But it will be the >> descendant of all. > > If we get efficient and flexible logical change event generation on the > master, then I'm sure the current trigger-based logical replication > providers will switch (for full replication) or at least add and extra > LCR-source . It may still make sense to leave some flexibility to the > master side, so the some decisions - possibly even complex ones - could > be made when generating the LCR-s > > What I would like is to have some of it exposed to userspace via > function which could be used by developers to push their own LCRs. Yes, I see that one coming. That use case is not something I'm focused on, but I do recognise others wish to pursue that. The bit I'm not sure about is whether we have custom handler code as well. > As metioned above, significant part of this approach can be prototyped > from user-level triggers as soon as we have triggers on commit and > rollback , even though at a slightly reduced performance. That is it > will still have the trigger overhead, but we can omit all the extra > writing and then re-reading and event-table management on the master. Agreed. > Wanting to play with Streaming Logical Replication (as opposed to > current Chunked Logical Replication) is also one of the reasons that I > complained when the "command triggers" patch was kicked out from 9.2. Yeh. It's clear that project needs to move forwards quickly in 9.3 if we are to make this Just Work in the way we hope. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Sun, 2012-04-29 at 12:03 +0100, Simon Riggs wrote: > On Sat, Apr 28, 2012 at 8:40 PM, Hannu Krosing <hannu@krosing.net> wrote: > > > As to what LCRs should contain, it will probably be locical equivalents > > of INSERT, UPDATE ... LIMIT 1, DELETE ... LIMIT 1, TRUNCATE and all DDL. > > Yeh > > > I would even go as far as propose a variant for DML-WITH-LIMIT-1 to be > > added to postgresql's SQL syntax so that the LCRs could be converted to > > SQL text for some tasks and thus should be easy to process using generic > > text-based tools. > > The DML-WITH-LIMIT-1 is required to do single logical updates on tables > > with non-unique rows. > > And as for any logical updates we will have huge performance problem > > when doing UPDATE or DELETE on large table with no indexes, but > > fortunately this problem is on slave, not master ;) > > While that is possible, I would favour the do-nothing approach. By > making the default replication mode = none, we then require a PK to be > assigned before allowing replication mode = on for a table. Trying to > replicate tables without PKs is a problem that can wait basically. While this is a good approach in most cases, there is a large use case for pk-less / indexless tables in large logfiles, where you may want to do INSERT only replication, perhaps with some automatic partitioning on logdate. Allowing this is probably something to look at in the first release, even though I'm not sure wht would happen on violation of this insert-only policy. Should it * refuse to continue and rollback the transaction (probably not) * fail silently * succeed but log the change locally * succseed with some special flags so other side can treat it specially without having to look up stuff in system catalog * (if we mark the unique / pk fields in some special way anyway, then the previous one is free :) -- ------- Hannu Krosing PostgreSQL Unlimited Scalability and Performance Consultant 2ndQuadrant Nordic PG Admin Book: http://www.2ndQuadrant.com/books/
On Sun, Apr 29, 2012 at 6:20 PM, Hannu Krosing <hannu@2ndquadrant.com> wrote: > On Sun, 2012-04-29 at 12:03 +0100, Simon Riggs wrote: >> On Sat, Apr 28, 2012 at 8:40 PM, Hannu Krosing <hannu@krosing.net> wrote: >> >> > As to what LCRs should contain, it will probably be locical equivalents >> > of INSERT, UPDATE ... LIMIT 1, DELETE ... LIMIT 1, TRUNCATE and all DDL. >> >> Yeh >> >> > I would even go as far as propose a variant for DML-WITH-LIMIT-1 to be >> > added to postgresql's SQL syntax so that the LCRs could be converted to >> > SQL text for some tasks and thus should be easy to process using generic >> > text-based tools. >> > The DML-WITH-LIMIT-1 is required to do single logical updates on tables >> > with non-unique rows. >> > And as for any logical updates we will have huge performance problem >> > when doing UPDATE or DELETE on large table with no indexes, but >> > fortunately this problem is on slave, not master ;) >> >> While that is possible, I would favour the do-nothing approach. By >> making the default replication mode = none, we then require a PK to be >> assigned before allowing replication mode = on for a table. Trying to >> replicate tables without PKs is a problem that can wait basically. > > While this is a good approach in most cases, there is a large use case > for pk-less / indexless tables in large logfiles, where you may want to > do INSERT only replication, perhaps with some automatic partitioning on > logdate. Allowing this is probably something to look at in the first > release, even though I'm not sure wht would happen on violation of this > insert-only policy. > Should it > * refuse to continue and rollback the transaction (probably not) > * fail silently > * succeed but log the change locally > * succseed with some special flags so other side can treat it specially > without having to look up stuff in system catalog > * (if we mark the unique / pk fields in some special way anyway, then > the previous one is free :) OK, I think an insert-only replication mode would allow that. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Excerpts from Simon Riggs's message of jue abr 26 11:10:09 -0300 2012: > On Thu, Apr 26, 2012 at 1:41 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > > I will also be organising a small-medium sized "Future of In-Core > > Replication" meeting in Ottawa on Wed 16 May, 6-10pm. > > Thanks for such rapid response. I've put up a wiki page and will be > adding names as they come through > > http://wiki.postgresql.org/wiki/PgCon2012CanadaInCoreReplicationMeeting How is this not redundant with the Cluster Summit? http://wiki.postgresql.org/wiki/PgCon2012CanadaClusterSummit ... oh, you're also already enlisted in that one. Sigh. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Sun, Apr 29, 2012 at 11:20 PM, Alvaro Herrera <alvherre@commandprompt.com> wrote: > > Excerpts from Simon Riggs's message of jue abr 26 11:10:09 -0300 2012: >> On Thu, Apr 26, 2012 at 1:41 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> >> > I will also be organising a small-medium sized "Future of In-Core >> > Replication" meeting in Ottawa on Wed 16 May, 6-10pm. >> >> Thanks for such rapid response. I've put up a wiki page and will be >> adding names as they come through >> >> http://wiki.postgresql.org/wiki/PgCon2012CanadaInCoreReplicationMeeting > > How is this not redundant with the Cluster Summit? > http://wiki.postgresql.org/wiki/PgCon2012CanadaClusterSummit > > ... oh, you're also already enlisted in that one. Sigh. My understanding is that the agenda for the cluster meeting is almost entirely dedicated to Postgres-XC. -- Dave Page Blog: http://pgsnake.blogspot.com Twitter: @pgsnake EnterpriseDB UK: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Apr 30, 2012 at 7:35 AM, Dave Page <dpage@pgadmin.org> wrote: > On Sun, Apr 29, 2012 at 11:20 PM, Alvaro Herrera > <alvherre@commandprompt.com> wrote: >> >> Excerpts from Simon Riggs's message of jue abr 26 11:10:09 -0300 2012: >>> On Thu, Apr 26, 2012 at 1:41 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >>> >>> > I will also be organising a small-medium sized "Future of In-Core >>> > Replication" meeting in Ottawa on Wed 16 May, 6-10pm. >>> >>> Thanks for such rapid response. I've put up a wiki page and will be >>> adding names as they come through >>> >>> http://wiki.postgresql.org/wiki/PgCon2012CanadaInCoreReplicationMeeting >> >> How is this not redundant with the Cluster Summit? >> http://wiki.postgresql.org/wiki/PgCon2012CanadaClusterSummit >> >> ... oh, you're also already enlisted in that one. Sigh. > > My understanding is that the agenda for the cluster meeting is almost > entirely dedicated to Postgres-XC. Yes. Regrettably, there wasn't a 90 minute slot available to discuss this at the cluster meeting, the dev meeting agenda is fairly full and my presentation on replication was rejected for the main conference. As a result, the additional meeting was the only way left open to me to initiate detailed discussion. The additional meeting will give us 240 minutes of discussion and allows us to bring in some technical users as well. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Apr 30, 2012 at 12:39 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Mon, Apr 30, 2012 at 7:35 AM, Dave Page <dpage@pgadmin.org> wrote: >> On Sun, Apr 29, 2012 at 11:20 PM, Alvaro Herrera >> <alvherre@commandprompt.com> wrote: >>> >>> Excerpts from Simon Riggs's message of jue abr 26 11:10:09 -0300 2012: >>>> On Thu, Apr 26, 2012 at 1:41 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >>>> >>>> > I will also be organising a small-medium sized "Future of In-Core >>>> > Replication" meeting in Ottawa on Wed 16 May, 6-10pm. >>>> >>>> Thanks for such rapid response. I've put up a wiki page and will be >>>> adding names as they come through >>>> >>>> http://wiki.postgresql.org/wiki/PgCon2012CanadaInCoreReplicationMeeting >>> >>> How is this not redundant with the Cluster Summit? >>> http://wiki.postgresql.org/wiki/PgCon2012CanadaClusterSummit >>> >>> ... oh, you're also already enlisted in that one. Sigh. >> >> My understanding is that the agenda for the cluster meeting is almost >> entirely dedicated to Postgres-XC. > > Yes. Regrettably, there wasn't a 90 minute slot available to discuss > this at the cluster meeting, the dev meeting agenda is fairly full and > my presentation on replication was rejected for the main conference. > As a result, the additional meeting was the only way left open to me > to initiate detailed discussion. > > The additional meeting will give us 240 minutes of discussion and > allows us to bring in some technical users as well. > > -- > Simon Riggs http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers Hi All, Will the live streaming/recording of the meeting be available? Atri -- Regards, Atri l'apprenant
On Thu, Apr 26, 2012 at 01:41:33PM +0100, Simon Riggs wrote: > Some people have talked about the need for "multi-master replication", > whereby 2+ databases communicate changes to one another. This topic > has been discussed in some depth in Computer Science academic papers, > most notably, "The Dangers of Replication and a Solution" by the late > Jim Gray. I've further studied this to the point where I have a > mathematical model of this that allows me to predict what our likely > success will be from implementing that. Without meaning to worry you, > MM replication alone is not a solution for large data or the general > case. For the general case, single master replication will continue to > be the most viable option. For large and distributed data sets, some > form of partitioning/sharding is required simply because full > multi-master replication just isn't viable at both volume and scale. > So my take on this is that MM is desirable, but is not the only thing > we need - we also need partial/filtered replication to make large > systems practical. Hence why I've been calling this the > "Bi-Directional Replication" project. I'm aware that paragraph alone > requires lots of explanation, which I hope to do both in writing and > in person at the forthcoming developer conference. I would love to see a layout of exactly where these things make sense, similar to what we do at the bottom of our documentation for "High Availability, Load Balancing, and Replication": http://www.postgresql.org/docs/9.1/static/different-replication-solutions.html Users and developers just can't seem to get the calculus of where things make sense into their heads, me included. For example, you said that "MM replication alone is not a solution for large data or the general case". Why is that? Is the goal of your work really to do logical replciation, which allows for major version upgrades? Is that the defining feature? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Mon, Apr 30, 2012 at 12:38 PM, Bruce Momjian <bruce@momjian.us> wrote: > For example, you said that "MM replication alone is not a solution for > large data or the general case". Why is that? Is the goal of your work > really to do logical replciation, which allows for major version > upgrades? Is that the defining feature? TBH, I don't think MM replication belongs in the database at all. Ditto any replication solution that implements 'eventual consistency' such that after the fact conflict resolution is required. In an SQL database, when a transaction commits, it should remain so. It belongs in the application layer. merlin
On Mon, Apr 30, 2012 at 6:38 PM, Bruce Momjian <bruce@momjian.us> wrote: > I would love to see a layout of exactly where these things make sense, > similar to what we do at the bottom of our documentation for "High > Availability, Load Balancing, and Replication": > > http://www.postgresql.org/docs/9.1/static/different-replication-solutions.html > > Users and developers just can't seem to get the calculus of where things > make sense into their heads, me included. > > For example, you said that "MM replication alone is not a solution for > large data or the general case". Why is that? Is the goal of your work > really to do logical replciation, which allows for major version > upgrades? Is that the defining feature? Good question. The use case, its breadth and utility are always the first place I start. I'm in the middle of writing a presentation that explains this from first principles and will be discussing that at the PgCon meeting. It's taken a long time to articulate that rather than make leaps of assumption and belief. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Apr 30, 2012 at 2:33 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Mon, Apr 30, 2012 at 12:38 PM, Bruce Momjian <bruce@momjian.us> wrote: >> For example, you said that "MM replication alone is not a solution for >> large data or the general case". Why is that? Is the goal of your work >> really to do logical replciation, which allows for major version >> upgrades? Is that the defining feature? > > TBH, I don't think MM replication belongs in the database at all. > Ditto any replication solution that implements 'eventual consistency' > such that after the fact conflict resolution is required. In an SQL > database, when a transaction commits, it should remain so. It belongs > in the application layer. I basically agree, at least in the medium term. The logical replication solutions we have today generally seem to work by watching the inserts, updates, and deletes go by and writing the changed tuples to a side table. This is not very performant, because it amounts to writing the data four times: we have to write WAL for the original change, write the data files for the original change, write more WAL for the change records, and the write those data files. Since all large database solutions are eventually I/O-bound, this is not great. Writing and flushing a separate replication log in parallel to WAL would get us down to three writes, and extracting tuple data from the existing WAL would get us down to two writes, which is as well as we ever know how to do. If we just had that much in core - that is, the ability to efficiently extra tuple inserts, updates, and deletes on a logical level - it would be much easier to build a good logical replication system around PostgreSQL than it is today, and the existing systems could be adapted to deliver higher performance by making use of the new infrastructure.The other half of the changes - applying the updates- is relatively straightforward, and it wouldn't bother me to leave that in user-land, especially in the MMR case, where you have to deal with conflict resolution rules that may be much simpler to express in a higher-level language than they would be in C. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Apr 30, 2012 at 07:55:00PM +0100, Simon Riggs wrote: > On Mon, Apr 30, 2012 at 6:38 PM, Bruce Momjian <bruce@momjian.us> wrote: > > > I would love to see a layout of exactly where these things make sense, > > similar to what we do at the bottom of our documentation for "High > > Availability, Load Balancing, and Replication": > > > > http://www.postgresql.org/docs/9.1/static/different-replication-solutions.html > > > > Users and developers just can't seem to get the calculus of where things > > make sense into their heads, me included. > > > > For example, you said that "MM replication alone is not a solution for > > large data or the general case". Why is that? Is the goal of your work > > really to do logical replciation, which allows for major version > > upgrades? Is that the defining feature? > > Good question. > > The use case, its breadth and utility are always the first place I > start. I'm in the middle of writing a presentation that explains this > from first principles and will be discussing that at the PgCon > meeting. It's taken a long time to articulate that rather than make > leaps of assumption and belief. Yep, it is the "assumption and belief" that always confuses me. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Robert Haas <robertmhaas@gmail.com> wrote: > The other half of the changes - applying the updates - is > relatively straightforward, and it wouldn't bother me to leave > that in user-land, especially in the MMR case, where you have to > deal with conflict resolution rules that may be much simpler to > express in a higher-level language than they would be in C. I've developed successful MMR more than once, and while I wouldn't say it's exactly been implemented in the database, it hasn't exactly been in application space either. The most successful implementations I've worked with have been a layer just outside the database, of which application software was completely unaware. The database wasn't aware of the coordination per se; but it *did* need to provide transaction information in a clean way, and the declarations of how data was distributed were in the database. In my experience a declarative definition of data distribution has always been sufficient, and certainly cleaner to deal with than imperative coding would be. YMMV. -Kevin
On Mon, Apr 30, 2012 at 2:38 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Apr 30, 2012 at 2:33 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >> On Mon, Apr 30, 2012 at 12:38 PM, Bruce Momjian <bruce@momjian.us> wrote: >>> For example, you said that "MM replication alone is not a solution for >>> large data or the general case". Why is that? Is the goal of your work >>> really to do logical replciation, which allows for major version >>> upgrades? Is that the defining feature? >> >> TBH, I don't think MM replication belongs in the database at all. >> Ditto any replication solution that implements 'eventual consistency' >> such that after the fact conflict resolution is required. In an SQL >> database, when a transaction commits, it should remain so. It belongs >> in the application layer. > > I basically agree, at least in the medium term. The logical > replication solutions we have today generally seem to work by watching > the inserts, updates, and deletes go by and writing the changed tuples > to a side table. This is not very performant, because it amounts to > writing the data four times: we have to write WAL for the original > change, write the data files for the original change, write more WAL > for the change records, and the write those data files. Since all > large database solutions are eventually I/O-bound, this is not great. > Writing and flushing a separate replication log in parallel to WAL > would get us down to three writes, and extracting tuple data from the > existing WAL would get us down to two writes, which is as well as we > ever know how to do. > > If we just had that much in core - that is, the ability to efficiently > extra tuple inserts, updates, and deletes on a logical level - it > would be much easier to build a good logical replication system around > PostgreSQL than it is today, and the existing systems could be adapted > to deliver higher performance by making use of the new infrastructure. > The other half of the changes - applying the updates - is relatively > straightforward, and it wouldn't bother me to leave that in user-land, > especially in the MMR case, where you have to deal with conflict > resolution rules that may be much simpler to express in a higher-level > language than they would be in C. Yeah -- here at $work the SQL Server team (once in a while we cross no-man's land and converse) has some fancy technology that sits directly on top of the transaction log and exposes an API that you can use to peek into the river of data running through the log and do stuff with it. In our case, they use it to triage extracts from about 100 or so distributed databases into a centralized store in a relatively realtime fashion. HS/SR simply can't do that and there would be tremendous value in something that could. merlin
>> If we just had that much in core - that is, the ability to efficiently >> extra tuple inserts, updates, and deletes on a logical level - it >> would be much easier to build a good logical replication system around >> PostgreSQL than it is today, and the existing systems could be adapted >> to deliver higher performance by making use of the new infrastructure. Well, this *is* the purpose of the cluster-hackers group, to add backend support which would make external replication systems easier to build and more efficient. So far the only real feature to come out of that has been the Command Triggers, but if you read the TODO list of that group you'll see that it's a laundry list of things replication systems need support for in the backend. http://wiki.postgresql.org/wiki/ClusterFeatures http://wiki.postgresql.org/wiki/PgCon2012CanadaClusterSummit What puts me off about this "let's start from first principles" approach is that in our community we have years of experience (in a couple cases, over a decade) with logical-level replication. It seems like the last thing we need is Yet Another PostgreSQL Replication System, started over from scratch and years away from being production quality. Don't we have enough external replication systems with not enough developers behind them? Even if improving an existing replication system proves to be impossible, it would make more sense to start with an analysis of the strengths and deficiencies of Slony, Londiste, Bucardo, etc., than with some kind of "clean-room" approach. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Mon, Apr 30, 2012 at 11:43 PM, Josh Berkus <josh@agliodbs.com> wrote: > Well, this *is* the purpose of the cluster-hackers group Well, I tried all available means to discuss my ideas before organising an external meeting. You can think of the InCore meeting as an extension of the cluster hackers meeting if you wish. > What puts me off about this "let's start from first principles" approach > is that in our community we have years of experience (in a couple cases, > over a decade) with logical-level replication. It seems like the last > thing we need is Yet Another PostgreSQL Replication System, started over > from scratch and years away from being production quality. Don't we > have enough external replication systems with not enough developers > behind them? I've tried very hard to express my admiration and respect for developers of many replication systems both personally when we meet in person and on list. And I've mentioned that I'd like to include as many ideas as possible in an in-core approach. Yes, we have many external replication systems. Many, many people have expressed the desire for more coherent features in core. I believe we can achieve production quality code in 1, maybe 2 releases. That is only possible by building on what we already have and reusing the concepts, experience and perhaps even code from other projects. We are closer to that than your statement allows. Yes, we have enough external replication systems and that is one reason why we need to put things in core and not just create another external system. > Even if improving an existing replication system proves to be > impossible, it would make more sense to start with an analysis of the > strengths and deficiencies of Slony, Londiste, Bucardo, etc., than with > some kind of "clean-room" approach. I'm not sure why you think I would *not* be starting with that analysis. I look forward to discussing this in person, where I'm sure it will be easier. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
>> Well, this *is* the purpose of the cluster-hackers group > > Well, I tried all available means to discuss my ideas before > organising an external meeting. You can think of the InCore meeting as > an extension of the cluster hackers meeting if you wish. That comment wasn't for you, it was for other folks who didn't seem to be aware that some of these ideas had already been discussed. You, of course, know because you were there. And, if you need a 3-hour meeting there's no question that you need the evening thing. There's just not enough time in anyone's schedule. > I'm not sure why you think I would *not* be starting with that analysis. That wasn't the impression I got from your prior emails to hackers. Thanks for the clarification! -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
> Those are the basic requirements that I am trying to address. There > are a great many important details, but the core of this is probably > what I would call "logical replication", that is shipping changes to > other nodes in a way that does not tie us to the same physical > representation that recovery/streaming replication does now. Of > course, non-physical replication can take many forms. Guessing from "shipping changes to other nodes", you seem to implicitly aim at asynchronous replication? If so, I am afraid it will force users to pay some cost to migrate from existig applications. -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp
On Tue, May 1, 2012 at 1:10 AM, Tatsuo Ishii <ishii@postgresql.org> wrote: >> Those are the basic requirements that I am trying to address. There >> are a great many important details, but the core of this is probably >> what I would call "logical replication", that is shipping changes to >> other nodes in a way that does not tie us to the same physical >> representation that recovery/streaming replication does now. Of >> course, non-physical replication can take many forms. > > Guessing from "shipping changes to other nodes", you seem to > implicitly aim at asynchronous replication? If so, I am afraid it will > force users to pay some cost to migrate from existig applications. The main plan is to use the existing streaming mechanism. That allows changes/LCRs to be shipped synchronously to some nodes and asynchronously to others. We need a mechanism that works across continents, so eager replication is not the most likely candidate. So the focus would be on shipping changes after they have been made using lazy replication. It sounds a bit strange but we can have synchronous lazy replication, if you wish it. None of the above presumes anything about the content or role of the LCRs being shipped. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Apr 30, 2012 at 6:43 PM, Josh Berkus <josh@agliodbs.com> wrote: > Well, this *is* the purpose of the cluster-hackers group, to add backend > support which would make external replication systems easier to build > and more efficient. So far the only real feature to come out of that > has been the Command Triggers, but if you read the TODO list of that > group you'll see that it's a laundry list of things replication systems > need support for in the backend. Right, but the cluster-hackers group has essentially zero mailing list traffic and seems to be making no progress on getting any of the features they need into the backend, with the exception, as you say, of Command Triggers. So to say, oh, well, we don't need a new effort to do this, we have cluster-hackers is just sticking our head in the sand. We obviously do need more of an effort than we've heretofore had, and if Simon/2Q are willing to step it up a few notches, kudos to them. Hopefully I'll get a chance to participate as well. One thing I am a bit uncomfortable with about this whole discussion is that it seems like a lot of the design and research is happening off-list, with intent to report results back here later. While that is obviously fine at some level, keeping the design under wraps until the code is written is a strategy that's had notoriously poor success in this community, and I hope we're not in for a repeat of that strategy. I think we ought to be sharing and debugging designs in public, not internally within 2ndQuadrant - or any other company, or any other mailing list other than this one. Nobody enjoys doing a lot of work on a patch and then having it get killed because Tom Lane - or someone else - points out some critical flaw in the design, and the way to avoid that is to try to flush out the flaws in public before too much work goes into it. On the flip side I certainly understand that sometimes you need to take some time to get your house in order before you can present a coherent public proposal, so don't take an accusation that anyone is doing anything wrong, just a possible concern. > What puts me off about this "let's start from first principles" approach > is that in our community we have years of experience (in a couple cases, > over a decade) with logical-level replication. It seems like the last > thing we need is Yet Another PostgreSQL Replication System, started over > from scratch and years away from being production quality. Don't we > have enough external replication systems with not enough developers > behind them? No. We have, after over a decade of work, no meaningful core support for the parts of replication that need core support in order to not suck. People are switching from logical replication of various types to SR in droves, not because they like the draconian restrictions that HS/SR impose, but rather because it has superior performance and reliability characteristics. It has those characteristics because it has core support. You cannot build an airplane by bolting wings onto a car. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, May 1, 2012 at 2:09 PM, Robert Haas <robertmhaas@gmail.com> wrote: > One thing I am a bit uncomfortable with about this whole discussion is > that it seems like a lot of the design and research is happening > off-list, with intent to report results back here later. While that > is obviously fine at some level, keeping the design under wraps until > the code is written is a strategy that's had notoriously poor success > in this community, and I hope we're not in for a repeat of that > strategy. I think we ought to be sharing and debugging designs in > public, not internally within 2ndQuadrant - or any other company, or > any other mailing list other than this one. Nobody enjoys doing a lot > of work on a patch and then having it get killed because Tom Lane - or > someone else - points out some critical flaw in the design, and the > way to avoid that is to try to flush out the flaws in public before > too much work goes into it. On the flip side I certainly understand > that sometimes you need to take some time to get your house in order > before you can present a coherent public proposal, so don't take an > accusation that anyone is doing anything wrong, just a possible > concern. I'm aware of the difficulties here. It's a bootstrap. If we start by discussing it then we receive the predictable "that'll never work" replies. If we start by building a prototype then we get the "you're hiding it" replies. Neither is ideal. I'm also well aware of the "if only you'd showed me your design sooner, I could have saved you the time" which is the stock response from every architect I ever met. The truth is, as I'm sure you might guess, that I don't know everything, so making a few mistakes early on allows us to avoid the time wasting annoyance of early stage ideas. All I can say is that I've done this a few times before and the process is easier once we have a little structure in place, with details like which bike shed we are painting and what the meaning of the phrase "to paint" entails. My hope is that we can present a basic set of facts and measurements that will act as a platform for challenge and discussion of requirements and ideas. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 4/29/12 6:03 AM, Simon Riggs wrote: >> The DML-WITH-LIMIT-1 is required to do single logical updates on tables >> > with non-unique rows. >> > And as for any logical updates we will have huge performance problem >> > when doing UPDATE or DELETE on large table with no indexes, but >> > fortunately this problem is on slave, not master;) > While that is possible, I would favour the do-nothing approach. By > making the default replication mode = none, we then require a PK to be > assigned before allowing replication mode = on for a table. Trying to > replicate tables without PKs is a problem that can wait basically. > Something that a in-core method might be able to do that an external one can't would be to support a method of uniquely identifyingrows in tables with no PK's. A gross example (that undoubtedly wouldn't work in the real world) would be usingTID's. A real-world implementation might be based on a hidden serial column. -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On 5/2/12 10:58 PM, Jim Nasby wrote: > On 4/29/12 6:03 AM, Simon Riggs wrote: >>> The DML-WITH-LIMIT-1 is required to do single logical updates on tables >>> > with non-unique rows. >>> > And as for any logical updates we will have huge performance problem >>> > when doing UPDATE or DELETE on large table with no indexes, but >>> > fortunately this problem is on slave, not master;) >> While that is possible, I would favour the do-nothing approach. By >> making the default replication mode = none, we then require a PK to be >> assigned before allowing replication mode = on for a table. Trying to >> replicate tables without PKs is a problem that can wait basically. >> > > Something that a in-core method might be able to do that an external one > can't would be to support a method of uniquely identifying rows in > tables with no PK's. A gross example (that undoubtedly wouldn't work in > the real world) would be using TID's. A real-world implementation might > be based on a hidden serial column. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
> Something that a in-core method might be able to do that an external one > can't would be to support a method of uniquely identifying rows in > tables with no PK's. A gross example (that undoubtedly wouldn't work in > the real world) would be using TID's. A real-world implementation might > be based on a hidden serial column. Realistically you need more than a serial for MM replication. For each row-version, you need: serverID of last update serialID of row timestamp of last update ... and note that this would have to include deleted rows as well. Currently Bucardo does this by using several fields, but you could put together one 128-bit field which contains all of this information. Or you could do an Intagram and compress it into 64 bits, but that would require limiting the problem space in a way you probably can't do it for a general tool. I do agree that depending on user-defined PKs raises a whole host of issues which we'd rather just sidestep, though. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Thu, May 3, 2012 at 6:03 PM, Josh Berkus <josh@agliodbs.com> wrote: > I do agree that depending on user-defined PKs raises a whole host of > issues which we'd rather just sidestep, though. What do you have in mind instead? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 05/01/2012 09:09 AM, Robert Haas wrote: > I think we ought to be sharing and debugging designs in > public, not internally within 2ndQuadrant - or any other company, or > any other mailing list other than this one. OK. You go first. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On Thu, 2012-05-03 at 00:58 -0500, Jim Nasby wrote: > On 4/29/12 6:03 AM, Simon Riggs wrote: > >> The DML-WITH-LIMIT-1 is required to do single logical updates on tables > >> > with non-unique rows. > >> > And as for any logical updates we will have huge performance problem > >> > when doing UPDATE or DELETE on large table with no indexes, but > >> > fortunately this problem is on slave, not master;) > > While that is possible, I would favour the do-nothing approach. By > > making the default replication mode = none, we then require a PK to be > > assigned before allowing replication mode = on for a table. Trying to > > replicate tables without PKs is a problem that can wait basically. > > > > Something that a in-core method might be able to do that an external one > can't would be to support a method of uniquely identifying rows in tables > with no PK's. A gross example (that undoubtedly wouldn't work in the real > world) would be using TID's. > A real-world implementation might be based on a hidden serial column. For logical we don't really need to uniquely identify such rows - it should sufficient if we just update exactly one of the matching rows. The way to do this is to put all fields of the OLD.* tuple in the WHERE clause and then update just one matching row. IIRC updating (or deleting) CURRENT OF a cursor is currently supported only in pl/pgsql so this needs to be done using a plpgsql cursor. If the table has no indexes or index lookup returns lots of rows, then this is bound to be slow, but in this case it was probably slow on master too :) > -- > Jim C. Nasby, Database Architect jim@nasby.net > 512.569.9461 (cell) http://jim.nasby.net >
On Fri, May 4, 2012 at 8:32 AM, Hannu Krosing <hannu@krosing.net> wrote: > For logical we don't really need to uniquely identify such rows - it > should sufficient if we just update exactly one of the matching rows. > > The way to do this is to put all fields of the OLD.* tuple in the WHERE > clause and then update just one matching row. > > IIRC updating (or deleting) CURRENT OF a cursor is currently supported > only in pl/pgsql so this needs to be done using a plpgsql cursor. > > If the table has no indexes or index lookup returns lots of rows, then > this is bound to be slow, but in this case it was probably slow on > master too :) I was about to write a reply saying exactly this, but you said it better than I would have been able to manage. I think this is all exactly right. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, May 3, 2012 at 8:22 PM, Greg Smith <greg@2ndquadrant.com> wrote: > On 05/01/2012 09:09 AM, Robert Haas wrote: >> >> I think we ought to be sharing and debugging designs in >> public, not internally within 2ndQuadrant - or any other company, or >> any other mailing list other than this one. > > OK. You go first. Are you requesting more transparency in general, asking for my thoughts on logical replication specifically, or something else? I try pretty hard not to go off and do large amounts of work in a vacuum. If something is more than a couple days work, I post the design on hackers and wait for feedback before writing a line of code. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 4 May 2012 14:01, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, May 4, 2012 at 8:32 AM, Hannu Krosing <hannu@krosing.net> wrote: >> For logical we don't really need to uniquely identify such rows - it >> should sufficient if we just update exactly one of the matching rows. >> >> The way to do this is to put all fields of the OLD.* tuple in the WHERE >> clause and then update just one matching row. >> >> IIRC updating (or deleting) CURRENT OF a cursor is currently supported >> only in pl/pgsql so this needs to be done using a plpgsql cursor. >> >> If the table has no indexes or index lookup returns lots of rows, then >> this is bound to be slow, but in this case it was probably slow on >> master too :) > > I was about to write a reply saying exactly this, but you said it > better than I would have been able to manage. > > I think this is all exactly right. Yes, but its not a high priority for inclusion. Many things like this will need to wait behind the really critical additional features. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 05/04/2012 09:03 AM, Robert Haas wrote: > I try pretty hard not to go off and do large amounts of work in a > vacuum. If something is more than a couple days work, I post the > design on hackers and wait for feedback before writing a line of code. That is an excellent luxury to have. You've worked very hard to earn it. Not everyone is in the position where that's possible though. > Are you requesting more transparency in general, asking for my > thoughts on logical replication specifically, or something else? The straw man argument here would require 100% transparency on everything you do in regards to PostgreSQL and related software. Before doing any development on any code, first post here to ask for design review. And if someone asks you to work on a program that isn't open source from day one, refuse unless you can operate that transparently. That standard is nice if you can pull it off. But I don't give you a hard time if you have to make some compromises from that ideal to keep yourself gainfully employed. You do a ton of good work for the PostgreSQL community in a transparent way, so I assume that you're doing the best you can. I would like to see that assumption presumed on our side, too. Here are the individual straw men in this area I'd like to see put out of their misery: "You're developing things in secret": if that's the case, we're pretty bad at it, given the history I outlined at http://archives.postgresql.org/message-id/4F9B1B6C.5010300@2ndQuadrant.com "That discussion didn't happen in the right place": it's not our fault that the cluster-hackers list exists. Go joust at getting that list shut down and their meeting during PGCon canceled if you think it's unproductive for discussions to happen there. I've been trying to bridge that gap for over two years now; note how many times I appear in the edit history at http://wiki.postgresql.org/index.php?title=ClusterFeatures&action=history "You might do too much development in the wrong direction and not build the right thing": and? Yes, there are people who develop into a corner and end up doing unproductive work as a result. And there are others who submit things and give up when faced with feedback on them. Last time I checked, there wasn't anyone who flat-out rejects on-list feedback working for 2ndQuadrant. Instead, I see features that go through extensive and numerous review cycles based on what we hear back. "Designs should be presented on-list before doing any development": this is not always practical for those of us who are doing feature development. Some feature sponsors are still getting used to open development. If we have a private development milestone date to hit *in order to get more funding for public PostgreSQL work*, which is often the case here, we try not to miss it. We'd be bad community members to do so. And sometimes that involves building a proof of concept or prototype here first, then submitting it to the community once it's moved onto being a proven concept. Since the community has a clear set of guidelines for how and when to submit new features, we make sure the development plans line up with them. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On Fri, May 4, 2012 at 11:06 AM, Greg Smith <greg@2ndquadrant.com> wrote: > The straw man argument here would require 100% transparency on everything > you do in regards to PostgreSQL and related software. Before doing any > development on any code, first post here to ask for design review. And if > someone asks you to work on a program that isn't open source from day one, > refuse unless you can operate that transparently. Well, look. At the end of the day, I don't really care whether you post your designs before writing code or not - unless it turns out that we get to the end of the development cycle, a gigantic patch shows up at the last minute, it gets rejected because people aren't satisfied with the design, and then massive bitching ensues because the author(s) put a lot of work into that patch. Then I care, because now the fact that no design consensus was sought at the outset has been transformed into a defect in the community process, which does in fact have defects, but that isn't one of them. We all know that design review is going to have to happen at some point, and if there's not an adequate opportunity to do that before the code is written then it will happen after the code is written. If that means the code has to be thrown out, then that's the risk you take by writing the code first. As long as everybody understands that, do it in whatever order you like. I think the real straw man here is the idea that it will somehow save time to skip the design phase and start writing code. I have never worked on a project, open source or otherwise, where that was true, and I believe that any textbook on software engineering you pick up is likely to tell you that in fact exactly the opposite is the case. Obviously, there are times when you need to write some throw-away code just to see how things shake out, and I do that all the time, and it makes complete sense, and I'm not knocking it. But if any of that code makes it into the committed patch, I count that as unusually lucky. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi Robert, Hi all, On Friday, May 04, 2012 06:29:33 PM Robert Haas wrote: > On Fri, May 4, 2012 at 11:06 AM, Greg Smith <greg@2ndquadrant.com> wrote: > > The straw man argument here would require 100% transparency on everything > > you do in regards to PostgreSQL and related software. Before doing any > > development on any code, first post here to ask for design review. And > > if someone asks you to work on a program that isn't open source from day > > one, refuse unless you can operate that transparently. > > Well, look. At the end of the day, I don't really care whether you > post your designs before writing code or not - unless it turns out > that we get to the end of the development cycle, a gigantic patch > shows up at the last minute, it gets rejected because people aren't > satisfied with the design, and then massive bitching ensues because > the author(s) put a lot of work into that patch. Then I care, because > now the fact that no design consensus was sought at the outset has > been transformed into a defect in the community process, which does in > fact have defects, but that isn't one of them. We all know that > design review is going to have to happen at some point, and if there's > not an adequate opportunity to do that before the code is written then > it will happen after the code is written. If that means the code has > to be thrown out, then that's the risk you take by writing the code > first. As long as everybody understands that, do it in whatever order > you like. In my understanding - as the person doing quite a bit of the coding atm - the point is to provide a very minimal *early* prototype to have a sensible basis for design decisions/discussions. On one side thats useful to get a feeling for the problems involved. On the other side doing design discussions without an underlaying basic patch & design on -hackers tends to often go into directions of feature creep and bikeshedding. It also helps against "this is impossible" claims. Parts of this thread and related ones are a somewhat good example of this. The plan is to show the early prototype around pgcon and send design documents and split-up patches (of that prototype) a holiday and some cleanup later to - hackers. I/We aim to have individual, independently usable, parts of the patch submitted to the first 9.3 commitfest. I definitely do not wish to screw anyone over doing this or such. And I am sure thats the same with others working on this. Even if sometimes emotions get in the way/into play. Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, May 4, 2012 at 12:59 PM, Andres Freund <andres@2ndquadrant.com> wrote: > In my understanding - as the person doing quite a bit of the coding atm - the > point is to provide a very minimal *early* prototype to have a sensible basis > for design decisions/discussions. On one side thats useful to get a feeling > for the problems involved. On the other side doing design discussions without > an underlaying basic patch & design on -hackers tends to often go into > directions of feature creep and bikeshedding. It also helps against "this is > impossible" claims. > Parts of this thread and related ones are a somewhat good example of this. > > The plan is to show the early prototype around pgcon and send design documents > and split-up patches (of that prototype) a holiday and some cleanup later to - > hackers. I/We aim to have individual, independently usable, parts of the patch > submitted to the first 9.3 commitfest. > > I definitely do not wish to screw anyone over doing this or such. And I am > sure thats the same with others working on this. Even if sometimes emotions > get in the way/into play. Thanks. I think we are basically all on the same page, here. As I said upthread, I am not trying to accuse anything of doing anything wrong, and as you and Simon and Greg have all said, you're not trying to do anything wrong. So, no problem. :-) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company