Thread: Proposal: Commit timestamp
For a future multimaster replication system, I will need a couple of features in the PostgreSQL server itself. I will submit separate proposals per feature so that discussions can be kept focused on one feature per thread. For conflict resolution purposes in an asynchronous multimaster system, the "last update" definition often comes into play. For this to work, the system must provide a monotonically increasing timestamp taken at the commit of a transaction. During replication, the replication process must be able to provide the remote nodes timestamp so that the replicated data will be "as of the time it was written on the remote node", and not the current local time of the replica, which is by definition of "asynchronous" later. To provide this data, I would like to add another "log" directory, pg_tslog. The files in this directory will be similar to the clog, but contain arrays of timestamptz values. On commit, the current system time will be taken. As long as this time is lower or equal to the last taken time in this PostgreSQL instance, the value will be increased by one microsecond. The resulting time will be added to the commit WAL record and written into the pg_tslog file. If a per database configurable tslog_priority is given, the timestamp will be truncated to milliseconds and the increment logic is done on milliseconds. The priority is added to the timestamp. This guarantees that no two timestamps for commits will ever be exactly identical, even across different servers. The COMMIT syntax will get extended to COMMIT [TRANSACTION] [WITH TIMESTAMP <timestamptz>]; The extension is limited to superusers and will override the normally generated commit timestamp. This will be used to give the replicating transaction on the replica the exact same timestamp it got on the originating master node. The pg_tslog segments will be purged like the clog segments, after all transactions belonging to them have been stamped frozen. A frozen xid by definition has a timestamp of epoch. To ensure a system using this timestamp feature has enough time to perform its work, a new GUC variable defining an interval will prevent vacuum from freezing xid's that are younger than that. A function get_commit_timestamp(xid) returning timpstamptz will return the commit time of a transaction as recorded by this feature. Comments, changes, additions? Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
On Thu, 2007-01-25 at 18:16 -0500, Jan Wieck wrote: > For conflict resolution purposes in an asynchronous multimaster system, > the "last update" definition often comes into play. For this to work, > the system must provide a monotonically increasing timestamp taken at > the commit of a transaction. Do you really need an actual timestamptz derived from the system clock, or would a monotonically increasing 64-bit counter be sufficient? (The assumption that the system clock is monotonically increasing seems pretty fragile, in the presence of manual system clock changes, ntpd, etc.) > Comments, changes, additions? Would this feature have any use beyond the specific project/algorithm you have in mind? -Neil
Jan Wieck <JanWieck@Yahoo.com> writes: > To provide this data, I would like to add another "log" directory, > pg_tslog. The files in this directory will be similar to the clog, but > contain arrays of timestamptz values. Why should everybody be made to pay this overhead? > The COMMIT syntax will get extended to > COMMIT [TRANSACTION] [WITH TIMESTAMP <timestamptz>]; > The extension is limited to superusers and will override the normally > generated commit timestamp. This will be used to give the replicating > transaction on the replica the exact same timestamp it got on the > originating master node. I'm not convinced you've even thought this through. If you do that then you have no guarantee of commit timestamp monotonicity on the slave (if it has either multi masters or any locally generated transactions). Since this is supposedly for a multi-master system, that seems a rather fatal objection --- no node in the system will actually have commit timestamp monotonicity. What are you hoping to accomplish with this? regards, tom lane
On 1/25/2007 6:47 PM, Neil Conway wrote: > On Thu, 2007-01-25 at 18:16 -0500, Jan Wieck wrote: >> For conflict resolution purposes in an asynchronous multimaster system, >> the "last update" definition often comes into play. For this to work, >> the system must provide a monotonically increasing timestamp taken at >> the commit of a transaction. > > Do you really need an actual timestamptz derived from the system clock, > or would a monotonically increasing 64-bit counter be sufficient? (The > assumption that the system clock is monotonically increasing seems > pretty fragile, in the presence of manual system clock changes, ntpd, > etc.) Yes, I do need it to be a timestamp, and one assumption is that all servers in the multimaster cluster are ntp synchronized. The reason is that this is for asynchronous multimaster (in my case). Two sequences running on separate systems don't tell which was the "last update" on a timeline. This conflict resolution method alone is of course completely inadequate. > >> Comments, changes, additions? > > Would this feature have any use beyond the specific project/algorithm > you have in mind? The tablelog project on pgfoundry currently uses the transactions start time but would be very delighted to have the commit time available instead. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
Jan Wieck <JanWieck@Yahoo.com> writes: > On 1/25/2007 6:47 PM, Neil Conway wrote: >> Would this feature have any use beyond the specific project/algorithm >> you have in mind? > The tablelog project on pgfoundry currently uses the transactions start > time but would be very delighted to have the commit time available instead. BTW, it's not clear to me why you need a new log area for this. (We don't log transaction start time anywhere, so certainly tablelog's needs would not include it.) Commit timestamps are available from WAL commit records in a crash-and-restart scenario, so wouldn't that be enough? regards, tom lane
On 1/25/2007 6:49 PM, Tom Lane wrote: > Jan Wieck <JanWieck@Yahoo.com> writes: >> To provide this data, I would like to add another "log" directory, >> pg_tslog. The files in this directory will be similar to the clog, but >> contain arrays of timestamptz values. > > Why should everybody be made to pay this overhead? It could be made an initdb time option. If you intend to use a product that requires this feature, you will be willing to pay that price. > >> The COMMIT syntax will get extended to >> COMMIT [TRANSACTION] [WITH TIMESTAMP <timestamptz>]; >> The extension is limited to superusers and will override the normally >> generated commit timestamp. This will be used to give the replicating >> transaction on the replica the exact same timestamp it got on the >> originating master node. > > I'm not convinced you've even thought this through. If you do that then > you have no guarantee of commit timestamp monotonicity on the slave > (if it has either multi masters or any locally generated transactions). > Since this is supposedly for a multi-master system, that seems a rather > fatal objection --- no node in the system will actually have commit > timestamp monotonicity. What are you hoping to accomplish with this? Maybe I wasn't clear enough about this. If the commit timestamps on the local machine are guaranteed to increase at least by one millisecond (okay that limits the system to a sustained 1000 commits per second before it really seems to run ahead of time), then no two commits on the same instance will ever have the same timestamp. If furthermore each instance in a cluster has a distinct priority (the microsecond part added to the millisecond-truncated timestamp), each commit timestamp could even act as a globally unique ID. It does require that all the nodes in the cluster are configured with a distinct priority. What I hope to accomplish with this is a very easy, commit time based "last update wins" conflict resolution for data fields of the overwrite nature. The replication system I have in mind will have another field type of the balance nature, where it will never communicate the current value but only deltas that get applied regardless of the two timestamps. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
On 1/25/2007 7:41 PM, Tom Lane wrote: > Jan Wieck <JanWieck@Yahoo.com> writes: >> On 1/25/2007 6:47 PM, Neil Conway wrote: >>> Would this feature have any use beyond the specific project/algorithm >>> you have in mind? > >> The tablelog project on pgfoundry currently uses the transactions start >> time but would be very delighted to have the commit time available instead. > > BTW, it's not clear to me why you need a new log area for this. (We > don't log transaction start time anywhere, so certainly tablelog's needs > would not include it.) Commit timestamps are available from WAL commit > records in a crash-and-restart scenario, so wouldn't that be enough? First, I need the timestamp of the original transaction that caused the data to change, which can be a remote or a local transaction. So the timestamp currently recorded in the WAL commit record is useless and the commit record has to be extended by one more timestamp. Second, I don't think that an API scanning for WAL commit records by xid would be efficient enough to satisfy the needs of a timestamp based conflict resolution system, which would have to retrieve the timestamp for every rows xmin that it is about to update in order to determine if the old or the new values should be used. Third, keeping the timestamp information in the WAL only would require to keep the WAL segments around until they are older than the admin chosen minimum freeze age. I hope you don't want to force that penalty on everyone who intends to use multimaster replication. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
On Thu, 25 Jan 2007, Jan Wieck wrote: > > For a future multimaster replication system, I will need a couple of > features in the PostgreSQL server itself. I will submit separate > proposals per feature so that discussions can be kept focused on one > feature per thread. Hmm... "will need" ... Have you prototyped this system yet? ISTM you can prototype your proposal using "external" components so you can work out the kinks first. Richard -- Richard Troy, Chief Scientist Science Tools Corporation 510-924-1363 or 202-747-1263 rtroy@ScienceTools.com, http://ScienceTools.com/
On 1/25/2007 8:42 PM, Richard Troy wrote: > On Thu, 25 Jan 2007, Jan Wieck wrote: >> >> For a future multimaster replication system, I will need a couple of >> features in the PostgreSQL server itself. I will submit separate >> proposals per feature so that discussions can be kept focused on one >> feature per thread. > > Hmm... "will need" ... Have you prototyped this system yet? ISTM you can > prototype your proposal using "external" components so you can work out > the kinks first. These details are pretty drilled down and are needed with the described functionality. And I will not make the same mistake as with Slony-I again and develop things, that require backend support, as totally external (look at the catalog corruption mess I created there and you know what I'm talking about). Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
Jan Wieck wrote: > On 1/25/2007 6:49 PM, Tom Lane wrote: > > Jan Wieck <JanWieck@Yahoo.com> writes: > >> To provide this data, I would like to add another "log" directory, > >> pg_tslog. The files in this directory will be similar to the clog, but > >> contain arrays of timestamptz values. > > > > Why should everybody be made to pay this overhead? > > It could be made an initdb time option. If you intend to use a product > that requires this feature, you will be willing to pay that price. That is going to cut your usage by like 80%. There must be a better way. -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On 1/25/2007 11:41 PM, Bruce Momjian wrote: > Jan Wieck wrote: >> On 1/25/2007 6:49 PM, Tom Lane wrote: >> > Jan Wieck <JanWieck@Yahoo.com> writes: >> >> To provide this data, I would like to add another "log" directory, >> >> pg_tslog. The files in this directory will be similar to the clog, but >> >> contain arrays of timestamptz values. >> > >> > Why should everybody be made to pay this overhead? >> >> It could be made an initdb time option. If you intend to use a product >> that requires this feature, you will be willing to pay that price. > > That is going to cut your usage by like 80%. There must be a better > way. I'd love to. But it is a datum that needs to be collected at the moment where basically the clog entry is made ... I don't think any external module can do that ever. You know how long I've been in and out and back into replication again. The one thing that pops up again and again in all the scenarios is "what the heck was the commit order?". Now the pure commit order for a single node could certainly be recorded from a sequence, but that doesn't cover the multi-node environment I am after. That's why I want it to be a timestamp with a few fudged bits at the end. If you look at what I've described, you will notice that as long as all node priorities are unique, this timestamp will be a globally unique ID in a somewhat ascending order along a timeline. That is what replication people are looking for. Tom fears that the overhead is significant, which I do understand and frankly, wonder myself about (actually I don't even have a vague estimate). I really think we should make this thing an initdb option and decide later if it's on or off by default. Probably we can implement it even in a way that one can turn it on/off and a postmaster restart plus waiting the desired freeze-delay would do. What I know for certain is that no async replication system can ever do without the commit timestamp information. Using the transaction start time or even the single statements timeofday will only lead to inconsistencies all over the place (I haven't been absent from the mailing lists for the past couple of month hiding in my closet ... I've been experimenting and trying to get around all these issues - in my closet). Slony-I can survive without that information because everything happens on one node and we record snapshot information for later abusal. But look at what cost we are dealing with this rather trivial issue. All we need to know is the serializable commit order. And we have to issue queries that eventually might exceed address space limits? Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
I would be *very* concerned that system time is not a guaranteed monotonic entity. Surely a counter or other internally managed mechanism would be a better solution. Furthermore, what would be the ramifications of master and slave system times being out of sync? Finally what if system time is rolled forward a few minutes as part of a correction and there were transactions completed in that time? There is a change, albeit small, that two transactions will have the same timestamp. More importantly, this will throw all kinds of issues in when the slave sees transactions in the future. Even with regular NTP syncs, drift can cause a clock to be rolled forward a few milliseconds, possibly resulting in duplicate transaction IDs. In summary, I don't think the use of system time has any place in PostgreSQL's internal consistency mechanisms, it is too unreliable an environment property. Why can't a counter be used for this instead? - Naz. Jan Wieck wrote: > For a future multimaster replication system, I will need a couple of > features in the PostgreSQL server itself. I will submit separate > proposals per feature so that discussions can be kept focused on one > feature per thread. > > For conflict resolution purposes in an asynchronous multimaster > system, the "last update" definition often comes into play. For this > to work, the system must provide a monotonically increasing timestamp > taken at the commit of a transaction. During replication, the > replication process must be able to provide the remote nodes timestamp > so that the replicated data will be "as of the time it was written on > the remote node", and not the current local time of the replica, which > is by definition of "asynchronous" later. > > To provide this data, I would like to add another "log" directory, > pg_tslog. The files in this directory will be similar to the clog, but > contain arrays of timestamptz values. On commit, the current system > time will be taken. As long as this time is lower or equal to the last > taken time in this PostgreSQL instance, the value will be increased by > one microsecond. The resulting time will be added to the commit WAL > record and written into the pg_tslog file. > > If a per database configurable tslog_priority is given, the timestamp > will be truncated to milliseconds and the increment logic is done on > milliseconds. The priority is added to the timestamp. This guarantees > that no two timestamps for commits will ever be exactly identical, > even across different servers. > > The COMMIT syntax will get extended to > > COMMIT [TRANSACTION] [WITH TIMESTAMP <timestamptz>]; > > The extension is limited to superusers and will override the normally > generated commit timestamp. This will be used to give the replicating > transaction on the replica the exact same timestamp it got on the > originating master node. > > The pg_tslog segments will be purged like the clog segments, after all > transactions belonging to them have been stamped frozen. A frozen xid > by definition has a timestamp of epoch. To ensure a system using this > timestamp feature has enough time to perform its work, a new GUC > variable defining an interval will prevent vacuum from freezing xid's > that are younger than that. > > A function get_commit_timestamp(xid) returning timpstamptz will return > the commit time of a transaction as recorded by this feature. > > > Comments, changes, additions? > > Jan >
Hi, Jan Wieck wrote: > The replication system I have in mind will have another field type of > the balance nature, where it will never communicate the current value > but only deltas that get applied regardless of the two timestamps. I'd favor a more generally usable conflict resolution function interface, on top of which you can implement both, the "last update wins" as well as the "balance" conflict resolution type. Passing the last common ancestor and the two conflicting heads to the conflict resolution function (CRF) should be enough. That would easily allow to implement the "balance" type (as you can calculate both deltas). And if you want to rely on something as arbitrary as a timestamp, you'd simply have to add a timestamp column to your table and let the CRF decide uppon that. This would allow pretty much any type of conflict resolution, for example: higher priority cleanup transactions, which change lots of tuples and should better not be aborted later on. Those could be implemented by adding a priority column and having the CRF respect that one, too. To find the last common ancestor tuple, transaction ids and MVCC are enough. You wouldn't need to add timestamps. You'd only have to make sure VACUUM doesn't delete tuples you still need. Regards Markus
Jan Wieck wrote: > But it is a datum that needs to be collected at the moment where > basically the clog entry is made ... I don't think any external module > can do that ever. How atomic does it need to be? External modules can register callbacks that get called right after the clog update and removing the xid from MyProc entry. That's about as close to making the clog entry you can get. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Jan, et. al., On Jan 26, 2007, at 2:37 AM, Naz Gassiep wrote: > I would be *very* concerned that system time is not a guaranteed > monotonic entity. Surely a counter or other internally managed > mechanism would be a better solution. As you should be concerned. Looking on my desk through the last few issues in IEEE Transactions on Parallel and Distributed Systems, I see no time synch stuff for clusters of machines that is actually based on time. Almost all rely on something like a Lamport timestamp or some relaxation thereof. A few are based off a tree based pulse. Using actual times is fraught with problems and is typically inappropriate for cluster synchronization needs. > Furthermore, what would be the ramifications of master and slave > system times being out of sync? I'm much more concerned with the overall approach. The algorithm for replication should be published in theoretic style with a thorough analysis of its assumptions and a proof of correctness based on those assumptions. Databases and replication therein are definitely technologies that aren't "off-the-cuff," and rigorous academic discussion and acceptance before they will get adopted. People generally will not adopt technologies to store mission critical data until they are confident that it will both work as designed and work as implemented -- the second is far less important as the weakness there are simply bugs. I'm not implying that this rigorous dissection of replication design hasn't happened, but I didn't see it referenced anywhere in this thread. Can you point me to it? I've reviewed many of these papers and would like to better understand what you are aiming at. Best regards, Theo Schlossnagle
On 1/26/2007 2:37 AM, Naz Gassiep wrote: > I would be *very* concerned that system time is not a guaranteed > monotonic entity. Surely a counter or other internally managed mechanism > would be a better solution. Such a counter has only "local" relevance. How do you plan to compare the two separate counters on different machines to tell which transaction happened last? Even if the system clock isn't monotonically increasing, the described increment system guarantees the timestamp used to appear so. Granted, this system will not work too well on a platform that doesn't allow to slew the system clock. > > Furthermore, what would be the ramifications of master and slave system > times being out of sync? The origin of a transaction must scan all tuples it updates and make sure that the timestamp it uses for commit appears in the future with respect to them. > > Finally what if system time is rolled forward a few minutes as part of a > correction and there were transactions completed in that time? There is > a change, albeit small, that two transactions will have the same > timestamp. More importantly, this will throw all kinds of issues in when > the slave sees transactions in the future. Even with regular NTP syncs, > drift can cause a clock to be rolled forward a few milliseconds, > possibly resulting in duplicate transaction IDs. > > In summary, I don't think the use of system time has any place in > PostgreSQL's internal consistency mechanisms, it is too unreliable an > environment property. Why can't a counter be used for this instead? This is nothing used for PostgreSQL's consistency. It is a vehicle intended to be used to synchronize the "last update wins" decision process of an asynchronous multimaster system. If not with a timestamp, how would you make sure that the replication processes of two different nodes will come to the same conclusion as to which update was last? Especially considering that the replication might take place hours after the original transaction happened. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
On Thu, 2007-01-25 at 18:16 -0500, Jan Wieck wrote: > To provide this data, I would like to add another "log" directory, > pg_tslog. The files in this directory will be similar to the clog, but > contain arrays of timestamptz values. On commit, the current system time > will be taken. As long as this time is lower or equal to the last taken > time in this PostgreSQL instance, the value will be increased by one > microsecond. The resulting time will be added to the commit WAL record > and written into the pg_tslog file. A transaction time table/log has other uses as well, so its fairly interesting to have this. > COMMIT [TRANSACTION] [WITH TIMESTAMP <timestamptz>]; > > The extension is limited to superusers and will override the normally > generated commit timestamp. I don't think its acceptable to override the normal timestamp. That could lead to non monotonic time values which could screw up PITR. My view is that you still need PITR even when you are using replication, because the former provides recoverability and the latter provides availability. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On 1/26/2007 8:26 AM, Simon Riggs wrote: > On Thu, 2007-01-25 at 18:16 -0500, Jan Wieck wrote: > >> To provide this data, I would like to add another "log" directory, >> pg_tslog. The files in this directory will be similar to the clog, but >> contain arrays of timestamptz values. On commit, the current system time >> will be taken. As long as this time is lower or equal to the last taken >> time in this PostgreSQL instance, the value will be increased by one >> microsecond. The resulting time will be added to the commit WAL record >> and written into the pg_tslog file. > > A transaction time table/log has other uses as well, so its fairly > interesting to have this. > >> COMMIT [TRANSACTION] [WITH TIMESTAMP <timestamptz>]; >> >> The extension is limited to superusers and will override the normally >> generated commit timestamp. > > I don't think its acceptable to override the normal timestamp. That > could lead to non monotonic time values which could screw up PITR. My > view is that you still need PITR even when you are using replication, > because the former provides recoverability and the latter provides > availability. Without that it is rendered useless for conflict resolution purposes. The timestamp used does not necessarily have much to do with the real time at commit. Although I'd like it to be as close as possible. This timestamp marks the age of the new datum in an update. Since the replication is asynchronous, the update on the remote systems will happen later, but the timestamp recorded with that datum must be the timestamp of the original transaction, not the current time when it is replicated remotely. All we have to determine that is the xmin in the rows tuple header, so that xmin must resolve to the original transactions timestamp. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
* Jan Wieck (JanWieck@Yahoo.com) wrote: > On 1/26/2007 2:37 AM, Naz Gassiep wrote: > >I would be *very* concerned that system time is not a guaranteed > >monotonic entity. Surely a counter or other internally managed mechanism > >would be a better solution. > > Such a counter has only "local" relevance. How do you plan to compare > the two separate counters on different machines to tell which > transaction happened last? I'd also suggest you look into Lamport timestamps... Trusting the system clock just isn't practical, even with NTP. I've developed (albeit relatively small) systems using Lamport timestamps and would be happy to talk about it offlist. I've probably got some code I could share as well. Thanks, Stephen
Stephen Frost wrote: > I'd also suggest you look into Lamport timestamps... Trusting the > system clock just isn't practical, even with NTP. I've developed > (albeit relatively small) systems using Lamport timestamps and would be > happy to talk about it offlist. I've probably got some code I could > share as well. > that looks like what Oracle RAC uses: http://www.lc.leidenuniv.nl/awcourse/oracle/rac.920/a96597/coord.htm cheers andrew
On 1/26/2007 9:38 AM, Stephen Frost wrote: > * Jan Wieck (JanWieck@Yahoo.com) wrote: >> On 1/26/2007 2:37 AM, Naz Gassiep wrote: >> >I would be *very* concerned that system time is not a guaranteed >> >monotonic entity. Surely a counter or other internally managed mechanism >> >would be a better solution. >> >> Such a counter has only "local" relevance. How do you plan to compare >> the two separate counters on different machines to tell which >> transaction happened last? > > I'd also suggest you look into Lamport timestamps... Trusting the > system clock just isn't practical, even with NTP. I've developed > (albeit relatively small) systems using Lamport timestamps and would be > happy to talk about it offlist. I've probably got some code I could > share as well. I think the system I described is a slightly modified Lamport generator. The maximum timestamp of any row updated in this transaction, you can consider that the "counters received from other nodes". Then I make sure that the next counter (timestamp) is higher than anything I know so far, and I add cluster-wide unique tie breaker to that. Looking closer, I don't even have to check the timestamps of the rows updated. Since a remote transaction replicated will bump the local Lamport clock on commit, a local transaction modifying such a row will have a timestamp in the future of that remote transaction, even if my local clock is limping behind. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
Jan Wieck <JanWieck@Yahoo.com> writes: > I think the system I described is a slightly modified Lamport generator. The > maximum timestamp of any row updated in this transaction, you can consider that > the "counters received from other nodes". Then I make sure that the next > counter (timestamp) is higher than anything I know so far, and I add > cluster-wide unique tie breaker to that. If you know all the timestamps in the system then you don't need timestamps at all, just use a counter that you increment by one each time. Isn't the whole reason people use timestamps is so that you don't have to depend on atomically knowing every timestamp in the system? So two transactions can commit simultaneously on different systems and use the timestamps to resolve conflicts later. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
On 1/27/2007 7:26 AM, Gregory Stark wrote: > Jan Wieck <JanWieck@Yahoo.com> writes: > >> I think the system I described is a slightly modified Lamport generator. The >> maximum timestamp of any row updated in this transaction, you can consider that >> the "counters received from other nodes". Then I make sure that the next >> counter (timestamp) is higher than anything I know so far, and I add >> cluster-wide unique tie breaker to that. > > If you know all the timestamps in the system then you don't need timestamps at > all, just use a counter that you increment by one each time. > > Isn't the whole reason people use timestamps is so that you don't have to > depend on atomically knowing every timestamp in the system? So two > transactions can commit simultaneously on different systems and use the > timestamps to resolve conflicts later. This assumes that you never lose contact to the cluster or if so, instantly stop all update activity because you are at risk that the counters diverge. This risk is much higher with a simple counter than with a system clock that was in sync at the time of disconnect. With all the disadvantages and the pain factor of an asynchronous multimaster replication system comes one big advantage. You can continue autonomously and let conflict resolution figure it out later. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
On Jan 25, 2007, at 6:16 PM, Jan Wieck wrote: > If a per database configurable tslog_priority is given, the > timestamp will be truncated to milliseconds and the increment logic > is done on milliseconds. The priority is added to the timestamp. > This guarantees that no two timestamps for commits will ever be > exactly identical, even across different servers. Wouldn't it be better to just store that information separately, rather than mucking with the timestamp? Though, there's anothe issue here... I don't think NTP is good for any better than a few milliseconds, even on a local network. How exact does the conflict resolution need to be, anyway? Would it really be a problem if transaction B committed 0.1 seconds after transaction A yet the cluster thought it was the other way around? -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
On 2/1/2007 11:23 PM, Jim Nasby wrote: > On Jan 25, 2007, at 6:16 PM, Jan Wieck wrote: >> If a per database configurable tslog_priority is given, the >> timestamp will be truncated to milliseconds and the increment logic >> is done on milliseconds. The priority is added to the timestamp. >> This guarantees that no two timestamps for commits will ever be >> exactly identical, even across different servers. > > Wouldn't it be better to just store that information separately, > rather than mucking with the timestamp? > > Though, there's anothe issue here... I don't think NTP is good for > any better than a few milliseconds, even on a local network. > > How exact does the conflict resolution need to be, anyway? Would it > really be a problem if transaction B committed 0.1 seconds after > transaction A yet the cluster thought it was the other way around? Since the timestamp is basically a Lamport counter which is just bumped be the clock as well, it doesn't need to be too precise. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
On Feb 3, 2007, at 3:52 PM, Jan Wieck wrote: > On 2/1/2007 11:23 PM, Jim Nasby wrote: >> On Jan 25, 2007, at 6:16 PM, Jan Wieck wrote: >>> If a per database configurable tslog_priority is given, the >>> timestamp will be truncated to milliseconds and the increment >>> logic is done on milliseconds. The priority is added to the >>> timestamp. This guarantees that no two timestamps for commits >>> will ever be exactly identical, even across different servers. >> Wouldn't it be better to just store that information separately, >> rather than mucking with the timestamp? >> Though, there's anothe issue here... I don't think NTP is good >> for any better than a few milliseconds, even on a local network. >> How exact does the conflict resolution need to be, anyway? Would >> it really be a problem if transaction B committed 0.1 seconds >> after transaction A yet the cluster thought it was the other way >> around? > > Since the timestamp is basically a Lamport counter which is just > bumped be the clock as well, it doesn't need to be too precise. Unless I'm missing something, you are _treating_ the counter as a Lamport timestamp, when in fact it is not and thus does not provide semantics of a Lamport timestamp. As such, any algorithms that use lamport timestamps as a basis or assumption for the proof of their correctness will not translate (provably) to this system. How are your counter semantically equivalent to Lamport timestamps? // Theo Schlossnagle // CTO -- http://www.omniti.com/~jesus/ // OmniTI Computer Consulting, Inc. -- http://www.omniti.com/
On 2/3/2007 4:05 PM, Theo Schlossnagle wrote: > On Feb 3, 2007, at 3:52 PM, Jan Wieck wrote: > >> On 2/1/2007 11:23 PM, Jim Nasby wrote: >>> On Jan 25, 2007, at 6:16 PM, Jan Wieck wrote: >>>> If a per database configurable tslog_priority is given, the >>>> timestamp will be truncated to milliseconds and the increment >>>> logic is done on milliseconds. The priority is added to the >>>> timestamp. This guarantees that no two timestamps for commits >>>> will ever be exactly identical, even across different servers. >>> Wouldn't it be better to just store that information separately, >>> rather than mucking with the timestamp? >>> Though, there's anothe issue here... I don't think NTP is good >>> for any better than a few milliseconds, even on a local network. >>> How exact does the conflict resolution need to be, anyway? Would >>> it really be a problem if transaction B committed 0.1 seconds >>> after transaction A yet the cluster thought it was the other way >>> around? >> >> Since the timestamp is basically a Lamport counter which is just >> bumped be the clock as well, it doesn't need to be too precise. > > Unless I'm missing something, you are _treating_ the counter as a > Lamport timestamp, when in fact it is not and thus does not provide > semantics of a Lamport timestamp. As such, any algorithms that use > lamport timestamps as a basis or assumption for the proof of their > correctness will not translate (provably) to this system. > > How are your counter semantically equivalent to Lamport timestamps? Yes, you must be missing something. The last used timestamp is remembered. When a remote transaction is replicated, the remembered timestamp is set to max(remembered, remote). For a local transaction, the remembered timestamp is set to max(remembered+1ms, systemclock) and that value is used as the transaction commit timestamp. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
On Feb 3, 2007, at 4:38 PM, Jan Wieck wrote: > On 2/3/2007 4:05 PM, Theo Schlossnagle wrote: >> On Feb 3, 2007, at 3:52 PM, Jan Wieck wrote: >>> On 2/1/2007 11:23 PM, Jim Nasby wrote: >>>> On Jan 25, 2007, at 6:16 PM, Jan Wieck wrote: >>>>> If a per database configurable tslog_priority is given, the >>>>> timestamp will be truncated to milliseconds and the increment >>>>> logic is done on milliseconds. The priority is added to the >>>>> timestamp. This guarantees that no two timestamps for commits >>>>> will ever be exactly identical, even across different servers. >>>> Wouldn't it be better to just store that information >>>> separately, rather than mucking with the timestamp? >>>> Though, there's anothe issue here... I don't think NTP is good >>>> for any better than a few milliseconds, even on a local network. >>>> How exact does the conflict resolution need to be, anyway? >>>> Would it really be a problem if transaction B committed 0.1 >>>> seconds after transaction A yet the cluster thought it was the >>>> other way around? >>> >>> Since the timestamp is basically a Lamport counter which is just >>> bumped be the clock as well, it doesn't need to be too precise. >> Unless I'm missing something, you are _treating_ the counter as a >> Lamport timestamp, when in fact it is not and thus does not >> provide semantics of a Lamport timestamp. As such, any >> algorithms that use lamport timestamps as a basis or assumption >> for the proof of their correctness will not translate (provably) >> to this system. >> How are your counter semantically equivalent to Lamport timestamps? > > Yes, you must be missing something. > > The last used timestamp is remembered. When a remote transaction is > replicated, the remembered timestamp is set to max(remembered, > remote). For a local transaction, the remembered timestamp is set > to max(remembered+1ms, systemclock) and that value is used as the > transaction commit timestamp. A Lamport clock, IIRC, require a cluster wide tick. This seems based only on activity and is thus an observational tick only which means various nodes can have various perspectives at different times. Given that time skew is prevalent, why is the system clock involved at all? As is usual distributed systems problems, they are very hard to explain casually and also hard to review from a theoretical angle without a proof. Are you basing this off a paper? If so which one? If not, have you written a rigorous proof of correctness for this approach? // Theo Schlossnagle // CTO -- http://www.omniti.com/~jesus/ // OmniTI Computer Consulting, Inc. -- http://www.omniti.com/
On 2/3/2007 4:58 PM, Theo Schlossnagle wrote: > On Feb 3, 2007, at 4:38 PM, Jan Wieck wrote: > >> On 2/3/2007 4:05 PM, Theo Schlossnagle wrote: >>> On Feb 3, 2007, at 3:52 PM, Jan Wieck wrote: >>>> On 2/1/2007 11:23 PM, Jim Nasby wrote: >>>>> On Jan 25, 2007, at 6:16 PM, Jan Wieck wrote: >>>>>> If a per database configurable tslog_priority is given, the >>>>>> timestamp will be truncated to milliseconds and the increment >>>>>> logic is done on milliseconds. The priority is added to the >>>>>> timestamp. This guarantees that no two timestamps for commits >>>>>> will ever be exactly identical, even across different servers. >>>>> Wouldn't it be better to just store that information >>>>> separately, rather than mucking with the timestamp? >>>>> Though, there's anothe issue here... I don't think NTP is good >>>>> for any better than a few milliseconds, even on a local network. >>>>> How exact does the conflict resolution need to be, anyway? >>>>> Would it really be a problem if transaction B committed 0.1 >>>>> seconds after transaction A yet the cluster thought it was the >>>>> other way around? >>>> >>>> Since the timestamp is basically a Lamport counter which is just >>>> bumped be the clock as well, it doesn't need to be too precise. >>> Unless I'm missing something, you are _treating_ the counter as a >>> Lamport timestamp, when in fact it is not and thus does not >>> provide semantics of a Lamport timestamp. As such, any >>> algorithms that use lamport timestamps as a basis or assumption >>> for the proof of their correctness will not translate (provably) >>> to this system. >>> How are your counter semantically equivalent to Lamport timestamps? >> >> Yes, you must be missing something. >> >> The last used timestamp is remembered. When a remote transaction is >> replicated, the remembered timestamp is set to max(remembered, >> remote). For a local transaction, the remembered timestamp is set >> to max(remembered+1ms, systemclock) and that value is used as the >> transaction commit timestamp. > > A Lamport clock, IIRC, require a cluster wide tick. This seems based > only on activity and is thus an observational tick only which means > various nodes can have various perspectives at different times. > > Given that time skew is prevalent, why is the system clock involved > at all? This question was already answered. > As is usual distributed systems problems, they are very hard to > explain casually and also hard to review from a theoretical angle > without a proof. Are you basing this off a paper? If so which one? > If not, have you written a rigorous proof of correctness for this > approach? I don't have any such paper and the proof of concept will be the implementation of the system. I do however see enough resistance against this proposal to withdraw the commit timestamp at this time. The new replication system will therefore require the installation of a patched, non-standard PostgreSQL version, compiled from sources cluster wide in order to be used. I am aware that this will dramatically reduce it's popularity but it is impossible to develop this essential feature as an external module. I thank everyone for their attention. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
On Feb 3, 2007, at 5:09 PM, Jan Wieck wrote: > On 2/3/2007 4:58 PM, Theo Schlossnagle wrote: > > I don't have any such paper and the proof of concept will be the > implementation of the system. I do however see enough resistance > against this proposal to withdraw the commit timestamp at this > time. The new replication system will therefore require the > installation of a patched, non-standard PostgreSQL version, > compiled from sources cluster wide in order to be used. I am aware > that this will dramatically reduce it's popularity but it is > impossible to develop this essential feature as an external module. > > I thank everyone for their attention. Actually, I believe the commit timestamp stuff would be very useful in general. I would certainly like to see rigorous proofs of any multi-master replication technology built on top of them. I believe that while your replication stuff might rely on the commit timestamps, the commit timestamps rely on thing else (except the work that you have been churning on). Using commit timestamps, one can easily implement cross vendor database replication. These can be used to implement something like trigger selective redo logs. I think they can be used to produce DML logs that will require a lot less accounting to manage replicating tables from PostgreSQL into another database (like Oracle or MySQL). // Theo Schlossnagle // CTO -- http://www.omniti.com/~jesus/ // OmniTI Computer Consulting, Inc. -- http://www.omniti.com/
Jan Wieck wrote: > I don't have any such paper and the proof of concept will be the > implementation of the system. I do however see enough resistance against > this proposal to withdraw the commit timestamp at this time. The new > replication system will therefore require the installation of a patched, > non-standard PostgreSQL version, compiled from sources cluster wide in > order to be used. I am aware that this will dramatically reduce it's > popularity but it is impossible to develop this essential feature as an > external module. > > I thank everyone for their attention. Going and working on it on your own doesn't seem like the proper solution. I don't see people objecting to adding it, but they want it work, which I am sure you want too. You have to show how it will work and convince others of that, and then you have a higher chance it will work, and be in the PostgreSQL codebase. -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On 2/3/2007 5:20 PM, Bruce Momjian wrote: > Jan Wieck wrote: >> I don't have any such paper and the proof of concept will be the >> implementation of the system. I do however see enough resistance against >> this proposal to withdraw the commit timestamp at this time. The new >> replication system will therefore require the installation of a patched, >> non-standard PostgreSQL version, compiled from sources cluster wide in >> order to be used. I am aware that this will dramatically reduce it's >> popularity but it is impossible to develop this essential feature as an >> external module. >> >> I thank everyone for their attention. > > Going and working on it on your own doesn't seem like the proper > solution. I don't see people objecting to adding it, but they want it > work, which I am sure you want too. You have to show how it will work > and convince others of that, and then you have a higher chance it will > work, and be in the PostgreSQL codebase. Bruce, I think I have sufficiently detailed explained how this Lamport timestamp will be unique and ever increasing, with the nodes ID being used as a tie breaker. The only thing important for "last update wins" conflict resolution is that whatever timestamp you have associated with a row, the update you do to it must be associated with a later timestamp so that all other nodes will overwrite the data. If a third node gets the two updates out of order, it will do the second nodes update and since the row it has then has a later timestamp then the first update arriving late, it will throw away that information. All nodes in sync again. This is all that is needed for last update wins resolution. And as said before, the only reason the clock is involved in this is so that nodes can continue autonomously when they lose connection without conflict resolution going crazy later on, which it would do if they were simple counters. It doesn't require microsecond synchronized clocks and the system clock isn't just used as a Lamport timestamp. The problem seems to me that people want a full scale proof of concept for the whole multimaster replication system I'm planning instead of thinking isolated about this one aspect, the intended use case and other possible uses for it (like table logging). And we all know that that discussion will take us way behind the 8.3 feature freeze date, so the whole thing will never get done. I don't want to work on this on my own and I sure would prefer it to be a default PostgreSQL feature. As said, I have learned some things from Slony-I. One of them is that I will not go through any more ugly workarounds in order to not require a patched backend. If the features I really need aren't going to be in the default codebase, people will have to install from patched sources. Finally, again, Slony-I could have well used this feature. With a logical commit timestamp, I would have never even thought about that other wart called xxid. It would have all been sooo much easier. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
Jan Wieck wrote: > This is all that is needed for last update wins resolution. And as > said before, the only reason the clock is involved in this is so that > nodes can continue autonomously when they lose connection without > conflict resolution going crazy later on, which it would do if they > were simple counters. It doesn't require microsecond synchronized > clocks and the system clock isn't just used as a Lamport timestamp. Earlier you said that "one assumption is that all servers in the multimaster cluster are ntp synchronized", which already rung the alarm bells in me. Now that I read this you appear to require synchronization not on the microsecond level but on some level. I think that would be pretty hard to manage for an administrator, seeing that NTP typically cannot provide such guarantees. -- Peter Eisentraut http://developer.postgresql.org/~petere/
On 2/4/2007 3:16 AM, Peter Eisentraut wrote: > Jan Wieck wrote: >> This is all that is needed for last update wins resolution. And as >> said before, the only reason the clock is involved in this is so that >> nodes can continue autonomously when they lose connection without >> conflict resolution going crazy later on, which it would do if they >> were simple counters. It doesn't require microsecond synchronized >> clocks and the system clock isn't just used as a Lamport timestamp. > > Earlier you said that "one assumption is that all servers in the > multimaster cluster are ntp synchronized", which already rung the alarm > bells in me. Now that I read this you appear to require > synchronization not on the microsecond level but on some level. I > think that would be pretty hard to manage for an administrator, seeing > that NTP typically cannot provide such guarantees. Synchronization to some degree is wanted to avoid totally unexpected behavior. The conflict resolution algorithm itself can perfectly fine live with counters, but I guess you wouldn't want the result of it. If you update a record on one node, then 10 minutes later you update the same record on another node. Unfortunately, the nodes had no communication and because the first node is much busier, its counter is way advanced ... this would mean the 10 minutes later update would get lost in the conflict resolution when the nodes reestablish communication. They would have the same data at the end, just not what any sane person would expect. This behavior will kick in whenever the cross node conflicting updates happen close enough so that the time difference between the clocks can affect it. So if you update the logical same row on two nodes within a tenth of a second, and the clocks are more than that apart, the conflict resolution can result in the older row to survive. Clock synchronization is simply used to minimize this. The system clock is used only to keep the counters somewhat synchronized in the case of connection loss to retain some degree of "last update" meaning. Without that, continuing autonomously during a network outage is just not practical. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
On Feb 4, 2007, at 10:06 AM, Jan Wieck wrote: > On 2/4/2007 3:16 AM, Peter Eisentraut wrote: >> Jan Wieck wrote: >>> This is all that is needed for last update wins resolution. And as >>> said before, the only reason the clock is involved in this is so >>> that >>> nodes can continue autonomously when they lose connection without >>> conflict resolution going crazy later on, which it would do if they >>> were simple counters. It doesn't require microsecond synchronized >>> clocks and the system clock isn't just used as a Lamport timestamp. >> Earlier you said that "one assumption is that all servers in the >> multimaster cluster are ntp synchronized", which already rung the >> alarm bells in me. Now that I read this you appear to require >> synchronization not on the microsecond level but on some level. I >> think that would be pretty hard to manage for an administrator, >> seeing that NTP typically cannot provide such guarantees. > > Synchronization to some degree is wanted to avoid totally > unexpected behavior. The conflict resolution algorithm itself can > perfectly fine live with counters, but I guess you wouldn't want > the result of it. If you update a record on one node, then 10 > minutes later you update the same record on another node. > Unfortunately, the nodes had no communication and because the first > node is much busier, its counter is way advanced ... this would > mean the 10 minutes later update would get lost in the conflict > resolution when the nodes reestablish communication. They would > have the same data at the end, just not what any sane person would > expect. > > This behavior will kick in whenever the cross node conflicting > updates happen close enough so that the time difference between the > clocks can affect it. So if you update the logical same row on two > nodes within a tenth of a second, and the clocks are more than that > apart, the conflict resolution can result in the older row to > survive. Clock synchronization is simply used to minimize this. > > The system clock is used only to keep the counters somewhat > synchronized in the case of connection loss to retain some degree > of "last update" meaning. Without that, continuing autonomously > during a network outage is just not practical. A Lamport clock addresses this. It relies on a cluster-wise clock tick. While it could be based on the system clock, it would not be based on more than one clock. The point of the lamport clock is that there is _a_ clock, not multiple ones. One concept is to have a univeral clock that ticks forward (like every second) and each node orders all their transactions inside the second-granular tick. Then each commit would be like: {node, clocksecond, txn#} and each time the clock ticks forward, txn# is reset to zero. This gives you ordered txns that windowed in some cluster-wide acceptable window (1 second). However, this is totally broken as NTP is entirely insufficient for this purpose because of a variety of forms of clock skew. As such, the timestamp should be incremented via cluster consensus (one token ring or the pulse generated by the leader of the current cluster membership quorom). As the clock must be incremented clusterwide, the need for it to be insync with the system clock (on any or all of the systems) is obviated. In fact, as you can't guarantee the synchronicity means that it can be confusing -- one expects a time-based clock to be accurate to the time. A counter-based clock has no such expectations. // Theo Schlossnagle // CTO -- http://www.omniti.com/~jesus/ // OmniTI Computer Consulting, Inc. -- http://www.omniti.com/
"Theo Schlossnagle" <jesus@omniti.com> writes: > As the clock must be incremented clusterwide, the need for it to be insync with > the system clock (on any or all of the systems) is obviated. In fact, as you > can't guarantee the synchronicity means that it can be confusing -- one > expects a time-based clock to be accurate to the time. A counter-based clock > has no such expectations. So if the nodes get split they can keep operating independently but clients can see that there's no guarantee of ordering against transactions from other nodes because the clock isn't advancing? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
On 2/4/2007 10:53 AM, Theo Schlossnagle wrote: > As the clock must be incremented clusterwide, the need for it to be > insync with the system clock (on any or all of the systems) is > obviated. In fact, as you can't guarantee the synchronicity means > that it can be confusing -- one expects a time-based clock to be > accurate to the time. A counter-based clock has no such expectations. For the fourth time, the clock is in the mix to allow to continue during a network outage. All your arguments seem to assume 100% network uptime. There will be no clusterwide clock or clusterwide increment when you lose connection. How does your idea cope with that? Obviously the counters will immediately drift apart based on the transaction load of the nodes as soon as the network goes down. And in order to avoid this "clock" confusion and wrong expectation, you'd rather have a system with such a simple, non-clock based counter and accept that it starts behaving totally wonky when the cluster reconnects after a network outage? I rather confuse a few people than having a last update wins conflict resolution that basically rolls dice to determine "last". Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
On Feb 4, 2007, at 1:36 PM, Jan Wieck wrote: > On 2/4/2007 10:53 AM, Theo Schlossnagle wrote: >> As the clock must be incremented clusterwide, the need for it to >> be insync with the system clock (on any or all of the systems) >> is obviated. In fact, as you can't guarantee the synchronicity >> means that it can be confusing -- one expects a time-based clock >> to be accurate to the time. A counter-based clock has no such >> expectations. > > For the fourth time, the clock is in the mix to allow to continue > during a network outage. All your arguments seem to assume 100% > network uptime. There will be no clusterwide clock or clusterwide > increment when you lose connection. How does your idea cope with that? That's exactly what a quorum algorithm is for. > Obviously the counters will immediately drift apart based on the > transaction load of the nodes as soon as the network goes down. And > in order to avoid this "clock" confusion and wrong expectation, > you'd rather have a system with such a simple, non-clock based > counter and accept that it starts behaving totally wonky when the > cluster reconnects after a network outage? I rather confuse a few > people than having a last update wins conflict resolution that > basically rolls dice to determine "last". If your cluster partition and you have hours of independent action and upon merge you apply a conflict resolution algorithm that has enormous effect undoing portions of the last several hours of work on the nodes, you wouldn't call that "wonky?" For sane disconnected (or more generally, partitioned) operation in multi-master environments, a quorum for the dataset must be established. Now, one can consider the "database" to be the dataset. So, on network partitions those in "the" quorum are allowed to progress with data modification and others only read. However, there is no reason why the dataset _must_ be the database and that multiple datasets _must_ share the same quorum algorithm. You could easily classify certain tables or schema or partitions into a specific dataset and apply a suitable quorum algorithm to that and a different quorum algorithm to other disjoint data sets. // Theo Schlossnagle // CTO -- http://www.omniti.com/~jesus/ // OmniTI Computer Consulting, Inc. -- http://www.omniti.com/
> One concept is to have a univeral clock that ticks forward (like > every second) and each node orders all their transactions inside the > second-granular tick. Then each commit would be like: {node, > clocksecond, txn#} and each time the clock ticks forward, txn# is > reset to zero. This gives you ordered txns that windowed in some > cluster-wide acceptable window (1 second). However, this is totally > broken as NTP is entirely insufficient for this purpose because of a > variety of forms of clock skew. As such, the timestamp should be > incremented via cluster consensus (one token ring or the pulse > generated by the leader of the current cluster membership quorom). I think you are completely ignoring practicability. Or are you saying, that such a system exists and works for e.g. a loosly connected group of laptop field agents that only sporadically have a connection to the cluster. I think Jan's definition gives a pragmatic solution to the problem, and will be able to give "good" automatic conflict resolution. It has downsides he stated, and cannot guarantee 100% correct automatic conflict resolution in case of connection loss, but I am quite sure you are not able to do better, without loosing yourself in theory. e.g. assume all clocks vary by no more than 30 seconds when disconnected, you can require manual (or rule based) resolution to all conflicts that vary by less than 1 minute. Andreas
On Sun, Feb 04, 2007 at 01:36:03PM -0500, Jan Wieck wrote: > For the fourth time, the clock is in the mix to allow to continue during > a network outage. All your arguments seem to assume 100% network uptime. > There will be no clusterwide clock or clusterwide increment when you > lose connection. How does your idea cope with that? I'm wondering whether a combined approach is needed. This makes things more complicated, but what if you somehow co-ordinate local counters with shared clock ticks? When you get a failure on your talk to the shared clock, you regard yourself as in some sort of failure (you're going to need softfails and that sort of thing, and yes, I'm flapping my hands in the air at the moment). At rejoin to the cluster, you need some sort of way to publish "here's the counter and the last global time I had" and "here's my current counter". You can publish local time with this too, I guess, to solve for conflict cases, but that seems like the sort of decision that needs to be pushed down to policy level. A -- Andrew Sullivan | ajs@crankycanuck.ca The whole tendency of modern prose is away from concreteness. --George Orwell
Hi, Theo Schlossnagle wrote: > On Feb 4, 2007, at 1:36 PM, Jan Wieck wrote: >> Obviously the counters will immediately drift apart based on the >> transaction load of the nodes as soon as the network goes down. And in >> order to avoid this "clock" confusion and wrong expectation, you'd >> rather have a system with such a simple, non-clock based counter and >> accept that it starts behaving totally wonky when the cluster >> reconnects after a network outage? I rather confuse a few people than >> having a last update wins conflict resolution that basically rolls >> dice to determine "last". > > If your cluster partition and you have hours of independent action and > upon merge you apply a conflict resolution algorithm that has enormous > effect undoing portions of the last several hours of work on the nodes, > you wouldn't call that "wonky?" You are talking about different things. Async replication, as Jan is planning to do, is per se "wonky", because you have to cope with conflicts by definition. And you have to resolve them by late-aborting a transaction (i.e. after a commit). Or put it another way: async MM replication means continuing in disconnected mode (w/o quorum or some such) and trying to reconciliate later on. It should not matter if the delay is just some milliseconds of network latency or three days (except of course that you probably have more data to reconciliate). > For sane disconnected (or more generally, partitioned) operation in > multi-master environments, a quorum for the dataset must be > established. Now, one can consider the "database" to be the dataset. > So, on network partitions those in "the" quorum are allowed to progress > with data modification and others only read. You can do this to *prevent* conflicts, but that clearly belongs to the world of sync replication. I'm doing this in Postgres-R: in case of network partitioning, only a primary partition may continue to process writing transactions. For async replication, it does not make sense to prevent conflicts when disconnected. Async is meant to cope with conflicts. So as to be independent of network latency. > However, there is no > reason why the dataset _must_ be the database and that multiple datasets > _must_ share the same quorum algorithm. You could easily classify > certain tables or schema or partitions into a specific dataset and apply > a suitable quorum algorithm to that and a different quorum algorithm to > other disjoint data sets. I call that partitioning (among nodes). And it's applicable to sync as well as async replication, while it makes more sense in sync replication. What I'm more concerned about, with Jan's proposal, is the assumption that you always want to resolve conflicts by time (except for balances, for which we don't have much information, yet). I'd rather say that time does not matter much if your nodes are disconnected. And (especially in async replication) you should prevent your clients from committing to one node and then reading from another, expecting to find your data there. So why resolve by time? It only makes the user think you could guarantee that order, but you certainly cannot. Regards Markus
> What I'm more concerned about, with Jan's proposal, is the assumption > that you always want to resolve conflicts by time (except for > balances, > for which we don't have much information, yet). I'd rather Um, I think the proposal was only for beneficial backend functionality for replication in general and time based conflict resolution. And "time based" is surely one of the important conflict resolution methods for async MM replication. Sure there are others, like "rule based" "priority based" but I think you don't need additional backend functionality for those. Andreas
Hi, Zeugswetter Andreas ADI SD wrote: > And "time based" > is surely one of the important conflict resolution methods for async MM > replication. That's what I'm questioning. Wouldn't any other deterministic, but seemingly random abort decision be as clever as time based conflict resolution? It would then be clear to the user that it's random and not some "in most cases time based, but no in others and only if..." thing. > Sure there are others, like "rule based" "priority based" but I think > you don't need additional backend functionality for those. Got the point, yes. I'm impatient, sorry. Neither the less, I'm questioning if is it worth adding backend functionality for that. And given this probably is the most wanted resolution method, this question might be "heretical". You could also see it as sort of an user educating question: don't favor time based resolution if that's the one resolution method with the most traps. Regards Markus
On 2/6/2007 11:44 AM, Markus Schiltknecht wrote: > Hi, > > Zeugswetter Andreas ADI SD wrote: >> And "time based" >> is surely one of the important conflict resolution methods for async MM >> replication. > > That's what I'm questioning. Wouldn't any other deterministic, but > seemingly random abort decision be as clever as time based conflict > resolution? It would then be clear to the user that it's random and not > some "in most cases time based, but no in others and only if..." thing. > >> Sure there are others, like "rule based" "priority based" but I think >> you don't need additional backend functionality for those. > > Got the point, yes. I'm impatient, sorry. > > Neither the less, I'm questioning if is it worth adding backend > functionality for that. And given this probably is the most wanted > resolution method, this question might be "heretical". You could also > see it as sort of an user educating question: don't favor time based > resolution if that's the one resolution method with the most traps. These are all very good suggestions towards additional conflict resolution mechanisms, that solve one or the other problem. As we have said for years now, one size will not fit all. What I am after for the moment is a system that supports by default a last update wins on the row level, where last update certainly is a little fuzzy, but not by minutes. Plus balance type columns. A balance column is not propagated as a new value, but as a delta between the old and the new value. All replica will apply the delta to that column regardless of whether the replication info is newer or older than the existing row. That way, literal value type columns (like an address) will maintain cluster wide the value of the last update to the row, while balance type columns will clusterwide maintain the sum of all changes. Whatever strategy one will use, in an async multimaster there are always cases that can be resolved by rules (last update being one of them), and some that I can't even imagine solving so far. I guess some of the cases will simply boil down to "the application has to make sure that ... never occurs". Think of a multi-item order, created on one node, while another node is deleting the long unused item (which would have to be backordered). Now while those two nodes figure out what to do to make this consistent again, a third node does a partial shipment of that order. The solution is simple, reinsert the deleted item ... only that there were rather nasty ON DELETE CASCADE's on that item that removed all the consumer reviews, product descriptions, data sheets and what not. It's going to be an awful lot of undo. I haven't really made up my mind about a user defined rule based conflict resolution interface yet. I do plan to have a unique and foreign key constraint based, synchronous advisory locking system on top of my system in a later version (advisory key locks would stay in place until the transaction, that placed them, replicates). I guess you see by now why I wanted to keep the discussion about the individual, rather generic support features in the backend separate from the particular features I plan to implement in the replication system. Everyone has different needs and consequently an async multi-master "must" do a whole range of mutually exclusive things altogether ... because Postgres can never accept a partial solution. We want the egg laying milk-wool-pig or nothing. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
Something worth noting... the only places I've actually seen MM replication implemented, each master was in fact still responsible for it's own set of data. It was essentially something that you could really do with Slony, if you could tolerate the extreme complexity that would be involved. It might well be worth focusing on that case first, before trying to come up with a perfect last-committed mechanism. On Feb 5, 2007, at 5:20 AM, Zeugswetter Andreas ADI SD wrote: > I think you are completely ignoring practicability. Or are you saying, > that such a system exists and works for e.g. a loosly connected > group of > laptop field agents that only sporadically have a connection to the > cluster. > > I think Jan's definition gives a pragmatic solution to the problem, > and will be able to give "good" automatic conflict resolution. > > It has downsides he stated, and cannot guarantee 100% correct > automatic > conflict > resolution in case of connection loss, but I am quite sure you are not > able to do > better, without loosing yourself in theory. > > e.g. assume all clocks vary by no more than 30 seconds when > disconnected, you can > require manual (or rule based) resolution to all conflicts that > vary by > less than > 1 minute. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Hi, Jan Wieck wrote: > Whatever strategy one will use, in an async multimaster there are always > cases that can be resolved by rules (last update being one of them), and > some that I can't even imagine solving so far. I guess some of the cases > will simply boil down to "the application has to make sure that ... > never occurs". Think of a multi-item order, created on one node, while > another node is deleting the long unused item (which would have to be > backordered). Now while those two nodes figure out what to do to make > this consistent again, a third node does a partial shipment of that > order. It helps to categorize these conflict types. There basically are: * data conflicts: simple row data, i.e. update - update conflicts. * uniqueness conflicts: two rows conflict because they'd violate a uniquenes constraint, i.e. insert - insert, update - insert or update - update. * visibility conflicts: basically the remaining update - delete and delete - delete cases. But also SELECT FOR UPDATE candidates, etc... Everything having to do with a rows not yet or no longer being visible to a transaction. Your example certainly involves a visibility conflict (update - delete). Not even (sync) Postgres-R can guarantee consistency on the visibility level, i.e. a first transaction's SELECT FOR UPDATE might not see some just recently committed transactions newly inserted rows (because that one isn't replayed yet on the node, thus the transaction is working on an 'old' snapshot of the database state). Another simpler example: Postgres-R doesn't raise a serialization error on delete-delete conflicts, it simply deletes the row once, even if two transactions confirmed to have committed a transaction which deleted a row. Luckily, most applications don't need that anyway, though. > The solution is simple, reinsert the deleted item ... ..at which point timestamps certainly won't help :-) Sorry, couldn't resist... > only that > there were rather nasty ON DELETE CASCADE's on that item that removed > all the consumer reviews, product descriptions, data sheets and what > not. It's going to be an awful lot of undo. Huh? Are you planning on aborting *parts* of a transaction? I didn't think about that, but my gut feeling is that you don't want to do that. > I haven't really made up my mind about a user defined rule based > conflict resolution interface yet. I do plan to have a unique and > foreign key constraint based, synchronous advisory locking system on top > of my system in a later version (advisory key locks would stay in place > until the transaction, that placed them, replicates). You'd have to elaborate on that... > I guess you see by now why I wanted to keep the discussion about the > individual, rather generic support features in the backend separate from > the particular features I plan to implement in the replication system. Sure. I know, discussions about replication can get endless, probably even are so by definition ;-) But hey, they're fun! > Everyone has different needs and consequently an async multi-master > "must" do a whole range of mutually exclusive things altogether ... > because Postgres can never accept a partial solution. We want the egg > laying milk-wool-pig or nothing. Like the one which would result from a merge of such an async replication with a sync one? Imagine being able to choose between sync and async per transaction... Regards Markus
On Saturday 03 February 2007, Bruce Momjian wrote: > Jan Wieck wrote: > > I don't have any such paper and the proof of concept will be the > > implementation of the system. I do however see enough resistance against > > this proposal to withdraw the commit timestamp at this time. The new > > replication system will therefore require the installation of a patched, > > non-standard PostgreSQL version, compiled from sources cluster wide in > > order to be used. I am aware that this will dramatically reduce it's > > popularity but it is impossible to develop this essential feature as an > > external module. > > > > I thank everyone for their attention. > > Going and working on it on your own doesn't seem like the proper > solution. I don't see people objecting to adding it, but they want it > work, which I am sure you want too. You have to show how it will work > and convince others of that, and then you have a higher chance it will > work, and be in the PostgreSQL codebase. Hi, Would it be possible to solve the problem using the GORDA on-commit hook? Jan would be able reliably obtain a commit timestamp with the desired semantics and store it in a regular table within transaction boundaries. PostgreSQL would not have to commit to a specific timestamp semantics and the patch is quite small. Regards, -- Jose Orlando Pereira
On 2/7/2007 2:37 AM, Markus Schiltknecht wrote: > Hi, > > Jan Wieck wrote: >> Whatever strategy one will use, in an async multimaster there are always >> cases that can be resolved by rules (last update being one of them), and >> some that I can't even imagine solving so far. I guess some of the cases >> will simply boil down to "the application has to make sure that ... >> never occurs". Think of a multi-item order, created on one node, while >> another node is deleting the long unused item (which would have to be >> backordered). Now while those two nodes figure out what to do to make >> this consistent again, a third node does a partial shipment of that >> order. > > It helps to categorize these conflict types. There basically are: Are we still discussing if the Postgres backend may provide support for a commit timestamp, that follows the rules for Lamport timestamps in a multi-node cluster? It seems more like we are drifting into what type of replication system I should design to please most people. Jan > > * data conflicts: simple row data, i.e. update - update conflicts. > > * uniqueness conflicts: two rows conflict because they'd violate a > uniquenes constraint, i.e. insert - insert, update - insert or update - > update. > > * visibility conflicts: basically the remaining update - delete and > delete - delete cases. But also SELECT FOR UPDATE candidates, etc... > Everything having to do with a rows not yet or no longer being visible > to a transaction. > > Your example certainly involves a visibility conflict (update - delete). > Not even (sync) Postgres-R can guarantee consistency on the visibility > level, i.e. a first transaction's SELECT FOR UPDATE might not see some > just recently committed transactions newly inserted rows (because that > one isn't replayed yet on the node, thus the transaction is working on > an 'old' snapshot of the database state). Another simpler example: > Postgres-R doesn't raise a serialization error on delete-delete > conflicts, it simply deletes the row once, even if two transactions > confirmed to have committed a transaction which deleted a row. > > Luckily, most applications don't need that anyway, though. > >> The solution is simple, reinsert the deleted item ... > > ..at which point timestamps certainly won't help :-) Sorry, couldn't > resist... > >> only that >> there were rather nasty ON DELETE CASCADE's on that item that removed >> all the consumer reviews, product descriptions, data sheets and what >> not. It's going to be an awful lot of undo. > > Huh? Are you planning on aborting *parts* of a transaction? I didn't > think about that, but my gut feeling is that you don't want to do that. > >> I haven't really made up my mind about a user defined rule based >> conflict resolution interface yet. I do plan to have a unique and >> foreign key constraint based, synchronous advisory locking system on top >> of my system in a later version (advisory key locks would stay in place >> until the transaction, that placed them, replicates). > > You'd have to elaborate on that... > >> I guess you see by now why I wanted to keep the discussion about the >> individual, rather generic support features in the backend separate from >> the particular features I plan to implement in the replication system. > > Sure. I know, discussions about replication can get endless, probably > even are so by definition ;-) But hey, they're fun! > >> Everyone has different needs and consequently an async multi-master >> "must" do a whole range of mutually exclusive things altogether ... >> because Postgres can never accept a partial solution. We want the egg >> laying milk-wool-pig or nothing. > > Like the one which would result from a merge of such an async > replication with a sync one? Imagine being able to choose between sync > and async per transaction... > > Regards > > Markus > > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
Hi, Jan Wieck wrote: > Are we still discussing if the Postgres backend may provide support for > a commit timestamp, that follows the rules for Lamport timestamps in a > multi-node cluster? No. And I think you know my opinion about that by now. ;-) > It seems more like we are drifting into what type of > replication system I should design to please most people. Nobody is telling you what you should do. You're free to do whatever you want to. I'm only trying to get a discussion going, because a) I'm interested in how you plan to solve these problems and b) in the past, most people were complaining that all the different replication efforts didn't try to work together. I'm slowly trying to open up and discuss what I'm doing with Postgres-R on the lists. Just yesterday at the SFPUG meeting, I've experienced how confusing it is for the users to have such a broad variety of (existing and upcoming) replication solutions. And I'm all for working together and probably even for merging different replication solutions. Regards Markus
> Jan Wieck wrote: > > Are we still discussing if the Postgres backend may provide support for > > a commit timestamp, that follows the rules for Lamport timestamps in a > > multi-node cluster? ...I thought you said in this thread that you haven't and weren't going to work on any kind of logical proof of it's correctness, saw no value in prototyping your way to a clear (convincing) argument, and were withdrawing the proposal due to all the issues others raised which were, in light of this, unanswerable beyond conjecture. I thought that the thread was continuing because other people saw value in the kernel of the idea, would support if if it could be shown to be correct/useful, were disappointed you'd leave it at that and wanted to continue to see if something positive might come of the dialogue. So, the thread weaved around a bit. I think that if you want to nail this down, people here are willing to be convinced, but that hasn't happened yet. On Wed, 7 Feb 2007, Markus Schiltknecht wrote: > I'm only trying to get a discussion going, because a) I'm interested in > how you plan to solve these problems and b) in the past, most people > were complaining that all the different replication efforts didn't try > to work together. I'm slowly trying to open up and discuss what I'm > doing with Postgres-R on the lists. > > Just yesterday at the SFPUG meeting, I've experienced how confusing it > is for the users to have such a broad variety of (existing and upcoming) > replication solutions. And I'm all for working together and probably > even for merging different replication solutions. In support of that idea, I offer this; When Randy Eash wrote the world's first replication system for Ingres circa 1990, his work included ideas and features that are right now in the Postgres world fragmented among several existing replication / replication-related products, along with some things that are only now in discussion in this group. As discussed at the SFPUG meeting last night, real-world use cases are seldom if ever completely satisfied with a one-size-fits-all replication strategy. For example, a manufacturing company might want all factories to be capable of being autonomous but both report activities and take direction from corporate headquarters. To do this without having multiple databases at each site, a single database instance would likely be both a master and slave, but for differing aspects of the businesses needs. Business decisions would resolve the conflicts, say, the manufacturing node always wins when it comes to data that pertains to their work, rather than something like a time-stamp, last timestamp/serialized update wins. Like Markus, I would like to see the various replication efforts merged as best they can be because even if the majority of users don't use a little bit of everything, surely the more interesting cases would like to and the entire community is better served if the various "solutions" are in harmony. Richard -- Richard Troy, Chief Scientist Science Tools Corporation 510-924-1363 or 202-747-1263 rtroy@ScienceTools.com, http://ScienceTools.com/
On 2/7/2007 12:54 PM, Markus Schiltknecht wrote: > Hi, > > Jan Wieck wrote: >> Are we still discussing if the Postgres backend may provide support for >> a commit timestamp, that follows the rules for Lamport timestamps in a >> multi-node cluster? > > No. And I think you know my opinion about that by now. ;-) Then let me give you a little puzzle just for the fun of it. A database containing customer contact information (among other things) is a two node multimaster system. One is serving the customer web portal, the other is used by the company staff including the call center. At 13:45 the two servers lose connectivity to each other, yet the internal staff can access the internal server while the web portal is accessible from the outside. At 13:50 customer A updates their credit card information through the web portal, while customer B does the same through the call center. At 13:55 both customers change their mind to use yet another credit card, now customer A phones the call center while customer B does it via the internet. At 14:00 the two servers reconnect and go through the conflict resolution. How do you intend to solve both conflicts without using any "clock", because that seems to be a stopword causing instant rejection of whatever you propose. Needless to say, both customers will be dissatisfied if you charge the "wrong" credit card during your next billing cycle. > >> It seems more like we are drifting into what type of >> replication system I should design to please most people. > > Nobody is telling you what you should do. You're free to do whatever you > want to. > > I'm only trying to get a discussion going, because a) I'm interested in > how you plan to solve these problems and b) in the past, most people > were complaining that all the different replication efforts didn't try > to work together. I'm slowly trying to open up and discuss what I'm > doing with Postgres-R on the lists. Which is a good discussion because one of the reasons why I stopped looking into Postgres-R is the fact that is based on the idea to push all the replication information through a system that generates a global serialized message queue. That by itself isn't the problem, but the fact that implementing a global serialized message queue has serious throughput issues that are (among other details) linked to the speed of light. I am trying to start with a system, that doesn't rely on such a mechanism for everything. I do intend to add an option later, that allows to declare a UNIQUE NOT NULL constraint to be synchronous. What that means is, that any INSERT, UPDATE, DELETE and SELECT FOR UPDATE will require the node to currently be a member of the (quorum or priority defined) majority of the cluster. An advisory lock system, based on a total order group communication, will grant the lock to the unique key values on a first come, first serve base. Every node in the cluster will keep those keys as "locked" until the asynchronous replication stream reports the locking transaction as ended. If another remote transaction in the meantime requires updating such key, the incoming stream from that node will be on hold until the lock is cleared. This is to protect agains node B replicating a transaction from node A and a later update on node B arrives on C before C got the first event from A. A node that got disconnected from the cluster must rebuild the current advisory lock list upon reconnecting to the cluster. I think that this will be a way to overcome Postgres-R's communication bottleneck, as well as allowing limited update activity even during a completely disconnected state of a node. Synchronous or group communication messages are reduced to the cases, where the application cannot be implemented in a conflict free way, like allocating a natural primary key. There is absolutely no need to synchronize for example creating a sales order. An application can use global unique ID's for the order number. And everything possibly referenced by an order (items, customers, ...) is stored in a way that the references are never updated. Deletes to those possibly referenced objects are implemented in a two step process, where they are first marked obsolete, and later on things that have been marked obsolete for X long are deleted. A REPLICA TRIGGER on inserting an order will simply reset the obsolete flag of referenced objects. If a node is disconnected longer than X, you have a problem - hunt down the guy who defined X. > Just yesterday at the SFPUG meeting, I've experienced how confusing it > is for the users to have such a broad variety of (existing and upcoming) > replication solutions. And I'm all for working together and probably > even for merging different replication solutions. Merging certain ideas to come up with an async/sync hybrid? Seems to me we have similar enough ideas to need conflict resolution, because we had them simultaneously but communicate them asynchronously. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
Hi, Jan Wieck wrote: > Then let me give you a little puzzle just for the fun of it. > > A database containing customer contact information (among other things) > is a two node multimaster system. One is serving the customer web > portal, the other is used by the company staff including the call > center. At 13:45 the two servers lose connectivity to each other, yet > the internal staff can access the internal server while the web portal > is accessible from the outside. At 13:50 customer A updates their credit > card information through the web portal, while customer B does the same > through the call center. At 13:55 both customers change their mind to > use yet another credit card, now customer A phones the call center while > customer B does it via the internet. Phew, a mind twister... one customer would already be enough to trigger that sort of conflict... > At 14:00 the two servers reconnect and go through the conflict > resolution. How do you intend to solve both conflicts without using any > "clock", because that seems to be a stopword causing instant rejection > of whatever you propose. Needless to say, both customers will be > dissatisfied if you charge the "wrong" credit card during your next > billing cycle. Correct. But do these cases satisfy storing timestamps to each and every transaction you do? That's what I doubt, not the usefulness of time based conflict resolution for certain cases. You can always add a time based conflict resolution, by adding a timestamp column and decide upon that one. I'd guess that the overall costs are lower that way. But you've withdrawn that proposal already, so... > Which is a good discussion because one of the reasons why I stopped > looking into Postgres-R is the fact that is based on the idea to push > all the replication information through a system that generates a global > serialized message queue. That by itself isn't the problem, but the fact > that implementing a global serialized message queue has serious > throughput issues that are (among other details) linked to the speed of > light. Agreed. Nevertheless, there are use cases for such systems, because they put less limitations to the application. One could even argue, that your above example would be one ;-) > I am trying to start with a system, that doesn't rely on such a > mechanism for everything. I do intend to add an option later, that > allows to declare a UNIQUE NOT NULL constraint to be synchronous. What > that means is, that any INSERT, UPDATE, DELETE and SELECT FOR UPDATE > will require the node to currently be a member of the (quorum or > priority defined) majority of the cluster. Sounds reasonable. > An advisory lock system, > based on a total order group communication, will grant the lock to the > unique key values on a first come, first serve base. Every node in the > cluster will keep those keys as "locked" until the asynchronous > replication stream reports the locking transaction as ended. If another > remote transaction in the meantime requires updating such key, the > incoming stream from that node will be on hold until the lock is > cleared. This is to protect agains node B replicating a transaction from > node A and a later update on node B arrives on C before C got the first > event from A. A node that got disconnected from the cluster must rebuild > the current advisory lock list upon reconnecting to the cluster. Yeah, this is a convenient way to replicate sequences via a GCS. > I think that this will be a way to overcome Postgres-R's communication > bottleneck, as well as allowing limited update activity even during a > completely disconnected state of a node. Synchronous or group > communication messages are reduced to the cases, where the application > cannot be implemented in a conflict free way, like allocating a natural > primary key. There is absolutely no need to synchronize for example > creating a sales order. Agreed, such cases can easily be optimized. But you have to be aware of he limitations these optimizations cause. Postgres-R is much more targeted at very general use cases. > An application can use global unique ID's for > the order number. And everything possibly referenced by an order (items, > customers, ...) is stored in a way that the references are never > updated. Deletes to those possibly referenced objects are implemented in > a two step process, where they are first marked obsolete, and later on > things that have been marked obsolete for X long are deleted. A REPLICA > TRIGGER on inserting an order will simply reset the obsolete flag of > referenced objects. If a node is disconnected longer than X, you have a > problem - hunt down the guy who defined X. Yeah, that's another very nice optimization. Again, as long as you know the limitations, that's all well and fine. > Merging certain ideas to come up with an async/sync hybrid? Seems to me > we have similar enough ideas to need conflict resolution, because we had > them simultaneously but communicate them asynchronously. Huh? Sorry, I didn't get what you're trying to say here. Regards Markus
On 2/7/2007 2:15 PM, Richard Troy wrote: >> Jan Wieck wrote: >> > Are we still discussing if the Postgres backend may provide support for >> > a commit timestamp, that follows the rules for Lamport timestamps in a >> > multi-node cluster? > > ...I thought you said in this thread that you haven't and weren't going to > work on any kind of logical proof of it's correctness, saw no value in > prototyping your way to a clear (convincing) argument, and were > withdrawing the proposal [...] I said I don't have any such documents. I was asked to continue this discussion in order to find people willing to help discover potential problems. I am prepared to continue this development isolated, although I wouldn't like to. The PostgreSQL developers community used to be good at throwing out ideas, brainstorming about the possibilities, adding more to them and coming up with very unique and flexible solutions. I am a little disappointed that much of that got lost over the years and please forgive me if I sound a little grumpy over that. The statement to withdraw the proposal was certainly premature - consider it not withdrawn at this time. However, comparing what used to be our process to what I see today, I must say that something like TOAST would never have happened. It was the result of a global brainstorming, that I simply translated into C code. Many details and features of the implementation are purely mine, but the really big sparks, that got it to what it is, I'm not claiming for myself. Most importantly, "give me proof of concept before we can talk about changing backend code" was not part of the process at all. We were pretty eager to change things back then, when we needed to get better in almost every way possible ... are we so good at replication that we need to be conservative in that respect now? We are certainly good at some things and have to be conservative with respect to them, but replication in my not so very humble opinion isn't one of them. I do understand that we have a codebase used in production these days. And because of that we have to maintain code and syntax stability to a degree, we didn't have back in the glory days of introducing EXCEPT and INTERCEPT (who's first incarnation was committed to the code base while completely destroying my entire work of fixing the rewriter). Maybe we need to introduce something entirely different, like the concept of an experimental feature. Something that we add to the code but that is explicitly flagged as not final, not stable, not guaranteed to stay or work in this or any other form. This requires that the feature has very limited interference with other parts of the system, like (or especially like) the query parser. If it turns out to be a problem in x.y.0, it will be backed out and gone in x.y.1. Or in a different way, like we create an experimental CVS branch off of every major release. That way, developers can easier share experimental code and if things settle there, they will be considered to be adopted into HEAD. > Like Markus, I would like to see the various replication efforts merged as > best they can be because even if the majority of users don't use a little > bit of everything, surely the more interesting cases would like to and the > entire community is better served if the various "solutions" are in > harmony. No doubt about that and I was the one organizing the Afilias sponsored meeting in Toronto back then, where my reversed Postgres-R idea was taken apart because it won't scale due to the gigantic amount of synchronized group communication it would require. Again, it might be that experimental features will cause more of the efforts to converge by using the same base as a compromise instead of having each and every support feature being designed completely independent. I still have a hard time understanding why someone would object to adding a feature, however useless it might seem to them, as long as it doesn't cost them anything. Admitted, any feature causes maintenance costs on the side of the PostgreSQL development community (mainly those, who actually contribute and maintain the code - fortunately that is a finite number - everyone please ask themselves if they are part of that). But aside from that, would anyone, who is questioning the commit timestamp as I proposed it, likewise vehemently object to yet another procedural language, or adding another log tuning switch? I don't think so. As long as it doesn't cost you unless you turn it on, why would you even care if it serves my purpose or not? The thing that kicked off this emotional spin was that multimaster replication is what so many people want, but nobody has a universal solution for. Everyone wants to see "their" problem solved "as well", or the solution isn't good. Tell you what, I can live with my problem solved even if it doesn't solve yours. Can you tell me what I have to modify in order to solve your problem as well, or are you asking me to not implement anything unless "I" find a way to solve everyones problems in one, big, universal solution? Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
On 2/7/2007 9:27 PM, Markus Schiltknecht wrote: > Hi, > > Jan Wieck wrote: >> Then let me give you a little puzzle just for the fun of it. >> >> A database containing customer contact information (among other things) >> is a two node multimaster system. One is serving the customer web >> portal, the other is used by the company staff including the call >> center. At 13:45 the two servers lose connectivity to each other, yet >> the internal staff can access the internal server while the web portal >> is accessible from the outside. At 13:50 customer A updates their credit >> card information through the web portal, while customer B does the same >> through the call center. At 13:55 both customers change their mind to >> use yet another credit card, now customer A phones the call center while >> customer B does it via the internet. > > Phew, a mind twister... one customer would already be enough to trigger > that sort of conflict... > >> At 14:00 the two servers reconnect and go through the conflict >> resolution. How do you intend to solve both conflicts without using any >> "clock", because that seems to be a stopword causing instant rejection >> of whatever you propose. Needless to say, both customers will be >> dissatisfied if you charge the "wrong" credit card during your next >> billing cycle. > > Correct. But do these cases satisfy storing timestamps to each and every > transaction you do? That's what I doubt, not the usefulness of time > based conflict resolution for certain cases. > > You can always add a time based conflict resolution, by adding a > timestamp column and decide upon that one. I'd guess that the overall > costs are lower that way. Yes, yes, and yes ... but aside from the problem that you use the very ambiguous word "timestamp" (which somehow suggests using a "clock" of some sort), isn't the "begin" timestamp of a long running transaction worse than the "commit" timestamp, when all its work got visible to the outside world instantaneously? > > But you've withdrawn that proposal already, so... > >> Which is a good discussion because one of the reasons why I stopped >> looking into Postgres-R is the fact that is based on the idea to push >> all the replication information through a system that generates a global >> serialized message queue. That by itself isn't the problem, but the fact >> that implementing a global serialized message queue has serious >> throughput issues that are (among other details) linked to the speed of >> light. > > Agreed. Nevertheless, there are use cases for such systems, because they > put less limitations to the application. One could even argue, that your > above example would be one ;-) Now we're in sync :-) > >> I am trying to start with a system, that doesn't rely on such a >> mechanism for everything. I do intend to add an option later, that >> allows to declare a UNIQUE NOT NULL constraint to be synchronous. What >> that means is, that any INSERT, UPDATE, DELETE and SELECT FOR UPDATE >> will require the node to currently be a member of the (quorum or >> priority defined) majority of the cluster. > > Sounds reasonable. > >> An advisory lock system, >> based on a total order group communication, will grant the lock to the >> unique key values on a first come, first serve base. Every node in the >> cluster will keep those keys as "locked" until the asynchronous >> replication stream reports the locking transaction as ended. If another >> remote transaction in the meantime requires updating such key, the >> incoming stream from that node will be on hold until the lock is >> cleared. This is to protect agains node B replicating a transaction from >> node A and a later update on node B arrives on C before C got the first >> event from A. A node that got disconnected from the cluster must rebuild >> the current advisory lock list upon reconnecting to the cluster. > > Yeah, this is a convenient way to replicate sequences via a GCS. > >> I think that this will be a way to overcome Postgres-R's communication >> bottleneck, as well as allowing limited update activity even during a >> completely disconnected state of a node. Synchronous or group >> communication messages are reduced to the cases, where the application >> cannot be implemented in a conflict free way, like allocating a natural >> primary key. There is absolutely no need to synchronize for example >> creating a sales order. > > Agreed, such cases can easily be optimized. But you have to be aware of > he limitations these optimizations cause. Postgres-R is much more > targeted at very general use cases. I am, if for no other reason than that I am familiar with the concepts underneath Postgres-R for more than 3 years. What I realized is that the "general use" case (for arbitrary complex applications)is very likely to be in conflict with any king of "good default performance" case. > >> An application can use global unique ID's for >> the order number. And everything possibly referenced by an order (items, >> customers, ...) is stored in a way that the references are never >> updated. Deletes to those possibly referenced objects are implemented in >> a two step process, where they are first marked obsolete, and later on >> things that have been marked obsolete for X long are deleted. A REPLICA >> TRIGGER on inserting an order will simply reset the obsolete flag of >> referenced objects. If a node is disconnected longer than X, you have a >> problem - hunt down the guy who defined X. > > Yeah, that's another very nice optimization. Again, as long as you know > the limitations, that's all well and fine. > >> Merging certain ideas to come up with an async/sync hybrid? Seems to me >> we have similar enough ideas to need conflict resolution, because we had >> them simultaneously but communicate them asynchronously. > > Huh? Sorry, I didn't get what you're trying to say here. Out of sync again ... we'll get there tomorrow ... unless your clock is way back and tomorrow will never come. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
I find the term "logical proof of it's correctness" too restrictive. It sounds like some formal academic process that really doesn't work well for us. What I did want to hear is a layout of how the system would work, and an exchange of ideas until almost everyone was happy. Also, I saw the trigger patch with no explaination of why it was important or who would use it --- that also isn't going to fly well. So, to add something, the community needs to hear how it is going to help users, because every code addition has cost, and we don't want to add things unless it has general utility. If someone can't explain the utility of an addition, I question whether the person has fully thought through were they are going. As far as adding a language, no, we would not just add any language. We would judge whether the language has usefulness to our users. I think APL would be cool, but I am not sure it is usable, so there is a hurdle even there. As far as TOAST, there is no question in my mind that TOAST development would happen the same way today as it did when we did it in 2001 --- we have a problem, how can we fix it. --------------------------------------------------------------------------- Jan Wieck wrote: > On 2/7/2007 2:15 PM, Richard Troy wrote: > >> Jan Wieck wrote: > >> > Are we still discussing if the Postgres backend may provide support for > >> > a commit timestamp, that follows the rules for Lamport timestamps in a > >> > multi-node cluster? > > > > ...I thought you said in this thread that you haven't and weren't going to > > work on any kind of logical proof of it's correctness, saw no value in > > prototyping your way to a clear (convincing) argument, and were > > withdrawing the proposal [...] > > I said I don't have any such documents. I was asked to continue this > discussion in order to find people willing to help discover potential > problems. I am prepared to continue this development isolated, although > I wouldn't like to. > > The PostgreSQL developers community used to be good at throwing out > ideas, brainstorming about the possibilities, adding more to them and > coming up with very unique and flexible solutions. I am a little > disappointed that much of that got lost over the years and please > forgive me if I sound a little grumpy over that. The statement to > withdraw the proposal was certainly premature - consider it not > withdrawn at this time. However, comparing what used to be our process > to what I see today, I must say that something like TOAST would never > have happened. It was the result of a global brainstorming, that I > simply translated into C code. Many details and features of the > implementation are purely mine, but the really big sparks, that got it > to what it is, I'm not claiming for myself. Most importantly, "give me > proof of concept before we can talk about changing backend code" was not > part of the process at all. We were pretty eager to change things back > then, when we needed to get better in almost every way possible ... are > we so good at replication that we need to be conservative in that > respect now? We are certainly good at some things and have to be > conservative with respect to them, but replication in my not so very > humble opinion isn't one of them. > > I do understand that we have a codebase used in production these days. > And because of that we have to maintain code and syntax stability to a > degree, we didn't have back in the glory days of introducing EXCEPT and > INTERCEPT (who's first incarnation was committed to the code base while > completely destroying my entire work of fixing the rewriter). Maybe we > need to introduce something entirely different, like the concept of an > experimental feature. Something that we add to the code but that is > explicitly flagged as not final, not stable, not guaranteed to stay or > work in this or any other form. This requires that the feature has very > limited interference with other parts of the system, like (or especially > like) the query parser. If it turns out to be a problem in x.y.0, it > will be backed out and gone in x.y.1. Or in a different way, like we > create an experimental CVS branch off of every major release. That way, > developers can easier share experimental code and if things settle > there, they will be considered to be adopted into HEAD. > > > Like Markus, I would like to see the various replication efforts merged as > > best they can be because even if the majority of users don't use a little > > bit of everything, surely the more interesting cases would like to and the > > entire community is better served if the various "solutions" are in > > harmony. > > No doubt about that and I was the one organizing the Afilias sponsored > meeting in Toronto back then, where my reversed Postgres-R idea was > taken apart because it won't scale due to the gigantic amount of > synchronized group communication it would require. Again, it might be > that experimental features will cause more of the efforts to converge by > using the same base as a compromise instead of having each and every > support feature being designed completely independent. > > I still have a hard time understanding why someone would object to > adding a feature, however useless it might seem to them, as long as it > doesn't cost them anything. Admitted, any feature causes maintenance > costs on the side of the PostgreSQL development community (mainly those, > who actually contribute and maintain the code - fortunately that is a > finite number - everyone please ask themselves if they are part of > that). But aside from that, would anyone, who is questioning the commit > timestamp as I proposed it, likewise vehemently object to yet another > procedural language, or adding another log tuning switch? I don't think > so. As long as it doesn't cost you unless you turn it on, why would you > even care if it serves my purpose or not? The thing that kicked off this > emotional spin was that multimaster replication is what so many people > want, but nobody has a universal solution for. Everyone wants to see > "their" problem solved "as well", or the solution isn't good. Tell you > what, I can live with my problem solved even if it doesn't solve yours. > Can you tell me what I have to modify in order to solve your problem as > well, or are you asking me to not implement anything unless "I" find a > way to solve everyones problems in one, big, universal solution? > > > Jan > > -- > #======================================================================# > # It's easier to get forgiveness for being wrong than for being right. # > # Let's break this rule - forgive me. # > #================================================== JanWieck@Yahoo.com # > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On 2/7/2007 10:35 PM, Bruce Momjian wrote: > I find the term "logical proof of it's correctness" too restrictive. It > sounds like some formal academic process that really doesn't work well > for us. Thank you. > Also, I saw the trigger patch with no explaination of why it was > important or who would use it --- that also isn't going to fly well. You didn't respond to my explanation how the current Slony implementation could improve and evolve using it. Are you missing something? I am discussing this very issue with our own QA department, and thus far, I think I have a majority of "would use a pg_trigger backpatched PostgreSQL" vs. "No, I prefer a system that knows exactly how it corrupted my system catalog". > As far as TOAST, there is no question in my mind that TOAST development > would happen the same way today as it did when we did it in 2001 --- we > have a problem, how can we fix it. Looking at what did happen back then and what happens in this case, I do see a difference. There were concerns about the compression algorithm used ... it still is today what was the first incarnation and nobody ever bothered to even investigate if there could possibly be any better thing. Do you think lzcompress is the best we can come up with? I don't! So why is it still the thing used? Maybe it is good enough? Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
Jan Wieck wrote: > On 2/7/2007 10:35 PM, Bruce Momjian wrote: > > I find the term "logical proof of it's correctness" too restrictive. It > > sounds like some formal academic process that really doesn't work well > > for us. > > Thank you. > > > Also, I saw the trigger patch with no explaination of why it was > > important or who would use it --- that also isn't going to fly well. > > You didn't respond to my explanation how the current Slony > implementation could improve and evolve using it. Are you missing > something? I am discussing this very issue with our own QA department, > and thus far, I think I have a majority of "would use a pg_trigger > backpatched PostgreSQL" vs. "No, I prefer a system that knows exactly > how it corrupted my system catalog". No, I _now_ understand the use case, but when the patch was posted, the use case was missing. I would like to see a repost with the patch, and a description of its use so we can all move forward on that. > > As far as TOAST, there is no question in my mind that TOAST development > > would happen the same way today as it did when we did it in 2001 --- we > > have a problem, how can we fix it. > > Looking at what did happen back then and what happens in this case, I do > see a difference. There were concerns about the compression algorithm > used ... it still is today what was the first incarnation and nobody > ever bothered to even investigate if there could possibly be any better > thing. Do you think lzcompress is the best we can come up with? I don't! > So why is it still the thing used? Maybe it is good enough? It is simple/stupid enough, I would say, and the compression space is a mine-field of patents. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
> Yes, yes, and yes ... but aside from the problem that you use the very > ambiguous word "timestamp" (which somehow suggests using a "clock" of > some sort), isn't the "begin" timestamp of a long running transaction imho a begin timestamp is near useless > worse than the "commit" timestamp, when all its work got visible to the > outside world instantaneously? This is one of the areas I am still worried about. Is one commit lamport timestamp enough ? I think for some conflict resolutions we need to look at the row level, and resolve conflicts per row and not per transaction (yes, this means that a tx might get partially replicated). What I am trying to lead at is: maybe an infrastructure to produce wieck lamport timestamps, that can be used in different places like commit hooks and column defaults, would be of more general use. Maybe such a column could be a system column that is not visible with "select *" for those cases where commit is not enough. And a commit hook could insert it into clog like storage. Andreas
On 2/7/2007 11:12 PM, Bruce Momjian wrote: > Jan Wieck wrote: >> On 2/7/2007 10:35 PM, Bruce Momjian wrote: >> > I find the term "logical proof of it's correctness" too restrictive. It >> > sounds like some formal academic process that really doesn't work well >> > for us. >> >> Thank you. >> >> > Also, I saw the trigger patch with no explaination of why it was >> > important or who would use it --- that also isn't going to fly well. >> >> You didn't respond to my explanation how the current Slony >> implementation could improve and evolve using it. Are you missing >> something? I am discussing this very issue with our own QA department, >> and thus far, I think I have a majority of "would use a pg_trigger >> backpatched PostgreSQL" vs. "No, I prefer a system that knows exactly >> how it corrupted my system catalog". > > No, I _now_ understand the use case, but when the patch was posted, the > use case was missing. I would like to see a repost with the patch, and > a description of its use so we can all move forward on that. Is this a new policy that after discussion, all patches must be resubmitted with a summary and conclusions of the discussion? I can certainly do that for you, but just tell me if you are going to ask the same from everyone. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
Jan Wieck wrote: > On 2/7/2007 11:12 PM, Bruce Momjian wrote: > > Jan Wieck wrote: > >> On 2/7/2007 10:35 PM, Bruce Momjian wrote: > >> > I find the term "logical proof of it's correctness" too restrictive. It > >> > sounds like some formal academic process that really doesn't work well > >> > for us. > >> > >> Thank you. > >> > >> > Also, I saw the trigger patch with no explaination of why it was > >> > important or who would use it --- that also isn't going to fly well. > >> > >> You didn't respond to my explanation how the current Slony > >> implementation could improve and evolve using it. Are you missing > >> something? I am discussing this very issue with our own QA department, > >> and thus far, I think I have a majority of "would use a pg_trigger > >> backpatched PostgreSQL" vs. "No, I prefer a system that knows exactly > >> how it corrupted my system catalog". > > > > No, I _now_ understand the use case, but when the patch was posted, the > > use case was missing. I would like to see a repost with the patch, and > > a description of its use so we can all move forward on that. > > Is this a new policy that after discussion, all patches must be > resubmitted with a summary and conclusions of the discussion? I can > certainly do that for you, but just tell me if you are going to ask the > same from everyone. No, I am asking only this time because I feel there was too much disconnect between the patch and the extensive replication discussion that few community members would see the connection. I would also like to know what your new features does for each supported option. I have not seen that spelled out yet at all. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian wrote: > Jan Wieck wrote: > > On 2/7/2007 11:12 PM, Bruce Momjian wrote: > > > Jan Wieck wrote: > > >> On 2/7/2007 10:35 PM, Bruce Momjian wrote: > > >> > > >> > Also, I saw the trigger patch with no explaination of why it was > > >> > important or who would use it --- that also isn't going to fly well. > > >> > > >> You didn't respond to my explanation how the current Slony > > >> implementation could improve and evolve using it. Are you missing > > >> something? I am discussing this very issue with our own QA department, > > >> and thus far, I think I have a majority of "would use a pg_trigger > > >> backpatched PostgreSQL" vs. "No, I prefer a system that knows exactly > > >> how it corrupted my system catalog". > > > > > > No, I _now_ understand the use case, but when the patch was posted, the > > > use case was missing. I would like to see a repost with the patch, and > > > a description of its use so we can all move forward on that. > > > > Is this a new policy that after discussion, all patches must be > > resubmitted with a summary and conclusions of the discussion? I can > > certainly do that for you, but just tell me if you are going to ask the > > same from everyone. > > No, I am asking only this time because I feel there was too much > disconnect between the patch and the extensive replication discussion > that few community members would see the connection. FYI, in my opinion the trigger addition is clearly useful to Mammoth Replicator as well. In fact, it's so obviously useful that I didn't see a need to state that in the original thread where it was discussed. Not sure about the timestamp stuff, because Replicator is not multi-master, so there's no conflict resolution to take care of. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera wrote: > > > Is this a new policy that after discussion, all patches must be > > > resubmitted with a summary and conclusions of the discussion? I can > > > certainly do that for you, but just tell me if you are going to ask the > > > same from everyone. > > > > No, I am asking only this time because I feel there was too much > > disconnect between the patch and the extensive replication discussion > > that few community members would see the connection. > > FYI, in my opinion the trigger addition is clearly useful to Mammoth > Replicator as well. In fact, it's so obviously useful that I didn't see > a need to state that in the original thread where it was discussed. Right, I know it is useful too, but I would like a layout of what it does and why so everyone is clear on it. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On 2/8/2007 3:32 PM, Bruce Momjian wrote: > Alvaro Herrera wrote: >> > > Is this a new policy that after discussion, all patches must be >> > > resubmitted with a summary and conclusions of the discussion? I can >> > > certainly do that for you, but just tell me if you are going to ask the >> > > same from everyone. >> > >> > No, I am asking only this time because I feel there was too much >> > disconnect between the patch and the extensive replication discussion >> > that few community members would see the connection. >> >> FYI, in my opinion the trigger addition is clearly useful to Mammoth >> Replicator as well. In fact, it's so obviously useful that I didn't see >> a need to state that in the original thread where it was discussed. > > Right, I know it is useful too, but I would like a layout of what it > does and why so everyone is clear on it. I have no clue what got you into what you are doing here. But that shall not be my real concern. If you feel the need to do this sort of thing, be my guest. I will add the remaining changes to pg_rewrite, including the new support commands and changes to psql as well as pg_dump and resubmit the new patch with explanations that will hopefully help you to comprehend what and how this relatively small and fully backward compatible change in the trigger and rule firing mechanism will work and what existing problems it will solve. Regards, Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
Jan Wieck wrote: > On 2/8/2007 3:32 PM, Bruce Momjian wrote: > > Alvaro Herrera wrote: > >> > > Is this a new policy that after discussion, all patches must be > >> > > resubmitted with a summary and conclusions of the discussion? I can > >> > > certainly do that for you, but just tell me if you are going to ask the > >> > > same from everyone. > >> > > >> > No, I am asking only this time because I feel there was too much > >> > disconnect between the patch and the extensive replication discussion > >> > that few community members would see the connection. > >> > >> FYI, in my opinion the trigger addition is clearly useful to Mammoth > >> Replicator as well. In fact, it's so obviously useful that I didn't see > >> a need to state that in the original thread where it was discussed. > > > > Right, I know it is useful too, but I would like a layout of what it > > does and why so everyone is clear on it. > > I have no clue what got you into what you are doing here. But that shall > not be my real concern. If you feel the need to do this sort of thing, > be my guest. I will add the remaining changes to pg_rewrite, including > the new support commands and changes to psql as well as pg_dump and > resubmit the new patch with explanations that will hopefully help you to > comprehend what and how this relatively small and fully backward > compatible change in the trigger and rule firing mechanism will work and > what existing problems it will solve. Yep, that's what I want everyone to see. This is standard procedure for everyone in the community, and core is not immune. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Jan Wieck wrote: > On 2/8/2007 3:32 PM, Bruce Momjian wrote: >> Alvaro Herrera wrote: >>> > > Is this a new policy that after discussion, all patches must be > >>> > resubmitted with a summary and conclusions of the discussion? I can >>> > > certainly do that for you, but just tell me if you are going to >>> ask the > > same from everyone. >>> > > No, I am asking only this time because I feel there was too much >>> > disconnect between the patch and the extensive replication discussion >>> > that few community members would see the connection. >>> >>> FYI, in my opinion the trigger addition is clearly useful to Mammoth >>> Replicator as well. In fact, it's so obviously useful that I didn't see >>> a need to state that in the original thread where it was discussed. >> >> Right, I know it is useful too, but I would like a layout of what it >> does and why so everyone is clear on it. Well how deep are we talking here? My understanding of what Jan wants to do is simple. Be able to declare which triggers are fired depending on the state of the cluster. In Jan's terms, the Origin or Subscriber. In Replicator terms the Master or Slave. This is useful because I may have a trigger on the Master and the same trigger on the Slave. You do not want the trigger to fire on the Slave because we are doing data replication. In short, the we replicate the result, not the action. However, you may want triggers that are on the Slave to fire separately. A reporting server that generates materialized views is a good example. Don't tie up the Master with what a Slave can do. Sincerely, Joshua D. Drake > > I have no clue what got you into what you are doing here. But that shall > not be my real concern. If you feel the need to do this sort of thing, > be my guest. I will add the remaining changes to pg_rewrite, including > the new support commands and changes to psql as well as pg_dump and > resubmit the new patch with explanations that will hopefully help you to > comprehend what and how this relatively small and fully backward > compatible change in the trigger and rule firing mechanism will work and > what existing problems it will solve. > > > Regards, > Jan > -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL Replication: http://www.commandprompt.com/products/
I just want an outline of what each option is supposed to control. If that information is in a documentation patch, then fine, he can just post that and tell people to read the patch documentation. --------------------------------------------------------------------------- Joshua D. Drake wrote: > Jan Wieck wrote: > > On 2/8/2007 3:32 PM, Bruce Momjian wrote: > >> Alvaro Herrera wrote: > >>> > > Is this a new policy that after discussion, all patches must be > > >>> > resubmitted with a summary and conclusions of the discussion? I can > >>> > > certainly do that for you, but just tell me if you are going to > >>> ask the > > same from everyone. > >>> > > No, I am asking only this time because I feel there was too much > >>> > disconnect between the patch and the extensive replication discussion > >>> > that few community members would see the connection. > >>> > >>> FYI, in my opinion the trigger addition is clearly useful to Mammoth > >>> Replicator as well. In fact, it's so obviously useful that I didn't see > >>> a need to state that in the original thread where it was discussed. > >> > >> Right, I know it is useful too, but I would like a layout of what it > >> does and why so everyone is clear on it. > > Well how deep are we talking here? My understanding of what Jan wants to > do is simple. > > Be able to declare which triggers are fired depending on the state of > the cluster. > > In Jan's terms, the Origin or Subscriber. In Replicator terms the Master > or Slave. > > This is useful because I may have a trigger on the Master and the same > trigger on the Slave. You do not want the trigger to fire on the Slave > because we are doing data replication. In short, the we replicate the > result, not the action. > > However, you may want triggers that are on the Slave to fire separately. > A reporting server that generates materialized views is a good example. > Don't tie up the Master with what a Slave can do. > > Sincerely, > > Joshua D. Drake > > > > > > > I have no clue what got you into what you are doing here. But that shall > > not be my real concern. If you feel the need to do this sort of thing, > > be my guest. I will add the remaining changes to pg_rewrite, including > > the new support commands and changes to psql as well as pg_dump and > > resubmit the new patch with explanations that will hopefully help you to > > comprehend what and how this relatively small and fully backward > > compatible change in the trigger and rule firing mechanism will work and > > what existing problems it will solve. > > > > > > Regards, > > Jan > > > > > -- > > === The PostgreSQL Company: Command Prompt, Inc. === > Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 > Providing the most comprehensive PostgreSQL solutions since 1997 > http://www.commandprompt.com/ > > Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate > PostgreSQL Replication: http://www.commandprompt.com/products/ > > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Thu, 8 Feb 2007, Joshua D. Drake wrote: > > Well how deep are we talking here? My understanding of what Jan wants to > do is simple. > > Be able to declare which triggers are fired depending on the state of > the cluster. > > In Jan's terms, the Origin or Subscriber. In Replicator terms the Master > or Slave. > > This is useful because I may have a trigger on the Master and the same > trigger on the Slave. You do not want the trigger to fire on the Slave > because we are doing data replication. In short, the we replicate the > result, not the action. > > However, you may want triggers that are on the Slave to fire separately. > A reporting server that generates materialized views is a good example. > Don't tie up the Master with what a Slave can do. > It'd be great if Jan considers the blending of replication; any given DB instance shouldn't be only a master/originator or only a slave/subscriber. A solution that lets you blend replication strategies in a single db is, from my point of view, very important. > > I have no clue what got you into what you are doing here. Jan, some sleep now and then might be helpful to your public disposition. -smile- peace, Richard -- Richard Troy, Chief Scientist Science Tools Corporation 510-924-1363 or 202-747-1263 rtroy@ScienceTools.com, http://ScienceTools.com/
On 2/8/2007 11:41 PM, Richard Troy wrote: > On Thu, 8 Feb 2007, Joshua D. Drake wrote: >> >> Well how deep are we talking here? My understanding of what Jan wants to >> do is simple. >> >> Be able to declare which triggers are fired depending on the state of >> the cluster. >> >> In Jan's terms, the Origin or Subscriber. In Replicator terms the Master >> or Slave. >> >> This is useful because I may have a trigger on the Master and the same >> trigger on the Slave. You do not want the trigger to fire on the Slave >> because we are doing data replication. In short, the we replicate the >> result, not the action. >> >> However, you may want triggers that are on the Slave to fire separately. >> A reporting server that generates materialized views is a good example. >> Don't tie up the Master with what a Slave can do. >> > > It'd be great if Jan considers the blending of replication; Please elaborate. I would really like to get all you can contribute. > >> > I have no clue what got you into what you are doing here. > > Jan, some sleep now and then might be helpful to your public disposition. Richard, don't embarrass Bruce. He doesn't need your help. I have been with this project and know Bruce Momjian for more than 10 years. Every now and then, Bruce and I get into some sort of eventually publicly visible dispute that doesn't really mean much. I'll probably spend next Christmas with him and his family again, play a few rounds of backgammon with Wilma (who I really owe a revenge), hopefully don't interfere too much with Christine's work (especially when it involves handling food over a white carpet) and none of us will even remember this crap. Our friendship has been through some real tests. Any real problem we would have, we'd never discuss here. We would just meet and talk. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
On 2/7/2007 7:13 AM, José Orlando Pereira wrote: > On Saturday 03 February 2007, Bruce Momjian wrote: >> Jan Wieck wrote: >> > I don't have any such paper and the proof of concept will be the >> > implementation of the system. I do however see enough resistance against >> > this proposal to withdraw the commit timestamp at this time. The new >> > replication system will therefore require the installation of a patched, >> > non-standard PostgreSQL version, compiled from sources cluster wide in >> > order to be used. I am aware that this will dramatically reduce it's >> > popularity but it is impossible to develop this essential feature as an >> > external module. >> > >> > I thank everyone for their attention. >> >> Going and working on it on your own doesn't seem like the proper >> solution. I don't see people objecting to adding it, but they want it >> work, which I am sure you want too. You have to show how it will work >> and convince others of that, and then you have a higher chance it will >> work, and be in the PostgreSQL codebase. > > Hi, > > Would it be possible to solve the problem using the GORDA on-commit hook? > > Jan would be able reliably obtain a commit timestamp with the desired > semantics and store it in a regular table within transaction boundaries. I am not sure, I would have to look at what exactly that hook provides. The key to a Lamport timestamp is that it is advancing it commit order (plus some other things ... of course). If the hook can guarantee that the calls are made always in commit order, serialized without any race condition possible, it would probably be suitable. Jan > > PostgreSQL would not have to commit to a specific timestamp semantics and the > patch is quite small. > > Regards, > -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
Richard Troy wrote: > It'd be great if Jan considers the blending of replication; any given DB > instance shouldn't be only a master/originator or only a slave/subscriber. > A solution that lets you blend replication strategies in a single db is, > from my point of view, very important. > Perhaps if more people read Jan's posts he wouldn't be so frustrated. http://archives.postgresql.org/pgsql-hackers/2007-01/msg01302.php He clearly describes that the master/slave setting is per session, not per database. Kris Jurka
On Feb 8, 2007, at 8:41 PM, Richard Troy wrote: > It'd be great if Jan considers the blending of replication; any > given DB > instance shouldn't be only a master/originator or only a slave/ > subscriber. > A solution that lets you blend replication strategies in a single > db is, > from my point of view, very important. It might be constructive to define what a minimal "complete" set of replication primitives actually is in addition to which ones should be implemented. In addition to master/slave models, you have Paxos algorithms and dynamic reconfiguration models in literature that can utilize many of the same primitives but which are very different in implementation. I see the value of Jan's proposal, but perhaps it would be better to step back and make some assertions about the nature of the core capabilities that will be supported in some broader picture. Having a theoretically (mostly) complete set of usable primitives would be an incredibly powerful feature set. Cheers, J. Andrew Rogers jrogers@neopolitan.com
On Fri, 9 Feb 2007, Jan Wieck wrote: > > [ I wrote ] > > It'd be great if Jan considers the blending of replication; > > Please elaborate. I would really like to get all you can contribute. Thanks Jan, prefaced that I really haven't read everything you've written on this (or what other people are doing, either), and that I've got a terrible flu right now (fever, etc), I'll give it a go - hopefully it's actually helpful. To wit: In general terms, "blending of replication [techniques]" means to me that one can have a single database instance serve as a master and as a slave (to use only one set of terminology), and as a multi-master, too, all simultaneously, letting the DBA / Architect choose which portions serve which roles (purposes). All replication features would respect the boundaries of such choices automatically, as it's all blended. In more specific terms, and I'm just brainstorming in public here, perhaps we can use the power of Schemas within a database to manage such divisions; commands which pertain to replication can/would include a schema specifier and elements within the schema can be replicated one way or another, at the whim of the DBA / Architect. For backwards compatability, if a schema isn't specified, it indicates that command pertains to the entire database. At the very least, a schema division strategy for replication leaverages an existing DB-component binding/dividing mechanism that most everyone is familliar with. While there are/may be database-wide, nay, installation- wide constructs as in your Commit Timestamp proposal, I don't see that there's any conflict - at least, from what I understand of existing systems and proposals to date. HTH, Richard -- Richard Troy, Chief Scientist Science Tools Corporation 510-924-1363 or 202-747-1263 rtroy@ScienceTools.com, http://ScienceTools.com/
On Feb 7, 8:12 pm, b...@momjian.us (Bruce Momjian) wrote: > Jan Wieck wrote: > > On 2/7/2007 10:35 PM, Bruce Momjian wrote: > > > I find the term "logical proof of it's correctness" too restrictive. It > > > sounds like some formal academic process that really doesn't work well > > > for us. > > > Thank you. My intuition is that it might be possible to prove that _nothing_ can provide guaranteed ordering when there is disconnected operation. However, I think that the clock based ordering Jan has described could provide _probable_ ordering under disconnected operation. I can see three variables in the equation that would determine the probability of correctness for the ordering. 1) clock drift rate between disconnected clusters 2) disconnection time 3) transaction rate on the tables, or even rows involved There are probably more. I think that if Jan implements what he's described then a very interesting follow-up would be to do the statistical analysis necessary to quantify the risk of incorrect ordering while disconnected. (I've got x ms/ms relative clock drift, and y tps. How long can I run disconnected before falling under 99.999% probability of correctly ordered transactions?) > No, I _now_ understand the use case, but when the patch was posted, the > use case was missing. I would like to see a repost with the patch, and > a description of its use so we can all move forward on that. An additional use case for an on-commit timestamp is in the analysis of billing transactions in highly concurrent systems. For example, imagine your billing period is monthly and you have transactions which start before and end after the "end-of-month". Having the on-commit timestamp for these transactions may help when attempting to reconcile between transactions and account activities. Andrew
Richard Troy wrote: > In more specific terms, and I'm just brainstorming in public here, perhaps > we can use the power of Schemas within a database to manage such > divisions; commands which pertain to replication can/would include a > schema specifier and elements within the schema can be replicated one way > or another, at the whim of the DBA / Architect. For backwards > compatability, if a schema isn't specified, it indicates that command > pertains to the entire database. > > I understand that you're just thinking aloud, but overloading namespaces in this way strikes me as awful. Applications and extensions, which are the things that have need of namespaces, should not have to care about replication. If we have to design them for replication we'll be on a fast track to nowhere IMNSHO. cheers andrew
On 2/9/2007 2:27 PM, Richard Troy wrote: > In general terms, "blending of replication [techniques]" means to me that > one can have a single database instance serve as a master and as a slave > (to use only one set of terminology), and as a multi-master, too, all > simultaneously, letting the DBA / Architect choose which portions serve > which roles (purposes). All replication features would respect the > boundaries of such choices automatically, as it's all blended. That is specifically what the changes to pg_trigger and pg_rewrite take into account. However much you blend different techniques, a single transaction on one server will always fall into one of three categories. 1) It could be the original operation done by the client application. B) It could be the actions performed by the replication engine to replay a remote transaction. And iii) it can be an administrative operation that requires not to be propagated at all. No matter how many different models you have in parallel, one single transaction will be either a master, a slave or an isolated local thing. The proposed changes allow to tell the session which of these three roles it is playing and the triggers and rules can be configured to fire during master/local role, slave role, always or never. That functionality will work for master-slave as well as multi-master. Although my current plan isn't creating such a blended system, the proposed trigger and rule changes are designed to support exactly that in a 100% backward compatible way. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
On Fri, 9 Feb 2007, Andrew Dunstan wrote: > Richard Troy wrote: > > In more specific terms, and I'm just brainstorming in public here, perhaps > > we can use the power of Schemas within a database to manage such > > divisions; commands which pertain to replication can/would include a > > schema specifier and elements within the schema can be replicated one way > > or another, at the whim of the DBA / Architect. For backwards > > compatability, if a schema isn't specified, it indicates that command > > pertains to the entire database. > > I understand that you're just thinking aloud, but overloading namespaces > in this way strikes me as awful. Applications and extensions, which are > the things that have need of namespaces, should not have to care about > replication. If we have to design them for replication we'll be on a > fast track to nowhere IMNSHO. Well, Andrew, replication _is_ an application. Or, you could think of replication as an extension to an application. I was under the impression that_users_ decide to put tables in schema spaces based upon _user_ need, and that Postgres developer's use of them for other purposes was incroaching on user choices, not the other way around. Either way, claiming "need" like this strikes me as stuck-in-a-rut or dogmatic thinking. Besides, don't we have schema nesting to help resolve any such "care?" And, what do you mean by "design them for replication?" While I'm in no way stuck on blending replication strategies via schemas, it does strike me as an appropriate concept and I'd preferr to have it evaluated based on technical merrit - possibly citing workarounds or solutions to technical issues, which is what I gather has been the tradition of this group: Use case first, technical merrit second... Other alternatives, ISTM, will have virtually the same look/feel as a schema from an external perspective, and the more I think of it the more I think using schemas is a sound, clean approach. That it offends someones sense of asthetics STM a poor rationale for not choosing it. Another question might be: What's lacking in the implementation of schemas that makes this a poor choice, and what could be done about it without much effort? Regards, Richard -- Richard Troy, Chief Scientist Science Tools Corporation 510-924-1363 or 202-747-1263 rtroy@ScienceTools.com, http://ScienceTools.com/
On Fri, 9 Feb 2007, Jan Wieck wrote: > > No matter how many different models you have in parallel, one single > transaction will be either a master, a slave or an isolated local thing. > The proposed changes allow to tell the session which of these three > roles it is playing and the triggers and rules can be configured to fire > during master/local role, slave role, always or never. That > functionality will work for master-slave as well as multi-master. > > Although my current plan isn't creating such a blended system, the > proposed trigger and rule changes are designed to support exactly that > in a 100% backward compatible way. > > Jan Fantastic! ...At some point you'll be thinking of the management end - turning it on or off, etc. That might be where these other points come more into play. Richard -- Richard Troy, Chief Scientist Science Tools Corporation 510-924-1363 or 202-747-1263 rtroy@ScienceTools.com, http://ScienceTools.com/
On 2/9/2007 2:19 PM, Andrew Hammond wrote: > On Feb 7, 8:12 pm, b...@momjian.us (Bruce Momjian) wrote: >> Jan Wieck wrote: >> > On 2/7/2007 10:35 PM, Bruce Momjian wrote: >> > > I find the term "logical proof of it's correctness" too restrictive. It >> > > sounds like some formal academic process that really doesn't work well >> > > for us. >> >> > Thank you. > > My intuition is that it might be possible to prove that _nothing_ can > provide guaranteed ordering when there is disconnected operation. As a matter of physics, for two events happening outside of the event horizon of each other, the question which happened first is pointless. > However, I think that the clock based ordering Jan has described could > provide _probable_ ordering under disconnected operation. I can see > three variables in the equation that would determine the probability > of correctness for the ordering. That precisely is the intended functionality. And I can exactly describe when two conflicting actions will result in the "wrong" row to persist. This will happen when the second update to the logically same row will be performed on the server with the Lamport timestamp lagging behind by more than the time between the two conflicting commits. Example: User fills out a form, submits, hits back button, corrects input and submits again within 3 seconds. Load balancing sends both requests to different servers and the first server is 3.0001 seconds ahead ... the users typo will be the winner. My Lamport timestamp conflict resolution will not be able to solve this problem. However, when this happens, one thing is guaranteed. The update from the second server, arriving on the first for replication will be ignored because a locally generated row is newer. This fact can be used as an indicator that there is a possible conflict that was resolved using the wrong data (business process wise). All nodes in the cluster will end up using the same wrong row, so at least they are consistently wrong. Nevertheless, being able to identify possible problem cases this way will allow to initiate further action including but not limited to human intervention. If this is not an acceptable risk for the application, other resolution methods will be needed. But I think in many cases, this form of default resolution will be "good enough". Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
Richard Troy wrote: > On Fri, 9 Feb 2007, Andrew Dunstan wrote: > >> Richard Troy wrote: >> >>> In more specific terms, and I'm just brainstorming in public here, perhaps >>> we can use the power of Schemas within a database to manage such >>> divisions; commands which pertain to replication can/would include a >>> schema specifier and elements within the schema can be replicated one way >>> or another, at the whim of the DBA / Architect. For backwards >>> compatability, if a schema isn't specified, it indicates that command >>> pertains to the entire database. >>> >> I understand that you're just thinking aloud, but overloading namespaces >> in this way strikes me as awful. Applications and extensions, which are >> the things that have need of namespaces, should not have to care about >> replication. If we have to design them for replication we'll be on a >> fast track to nowhere IMNSHO. >> > > Well, Andrew, replication _is_ an application. Or, you could think of > replication as an extension to an application. No, I don't think of it as either. It's a utility, more an extension of the DBMS than of the application. You don't replicate for the sake of replicating. > I was under the impression > that_users_ decide to put tables in schema spaces based upon _user_ need, > and that Postgres developer's use of them for other purposes was > incroaching on user choices, not the other way around. That's exactly what you would be doing with this proposal, encroaching on what I regard as user space. > Either way, > claiming "need" like this strikes me as stuck-in-a-rut or dogmatic > thinking. Besides, don't we have schema nesting to help resolve any such > "care?" No. We do now have schema nesting, for this or any other purpose. Where did you get that idea? If we did I would not be so resistant to using them for this purpose, but as it is, if you hijack schemas for replication segregation you will detract from their more obvious use in name segregation. cheers andrew
I wrote: > We do now have schema nesting, for this or any other purpose. s/now/not/ (of course) cheers andrew
On 2/9/2007 3:25 PM, Andrew Dunstan wrote: > Richard Troy wrote: >> On Fri, 9 Feb 2007, Andrew Dunstan wrote: >> >>> Richard Troy wrote: >>> >>>> In more specific terms, and I'm just brainstorming in public here, perhaps >>>> we can use the power of Schemas within a database to manage such >>>> divisions; commands which pertain to replication can/would include a >>>> schema specifier and elements within the schema can be replicated one way >>>> or another, at the whim of the DBA / Architect. For backwards >>>> compatability, if a schema isn't specified, it indicates that command >>>> pertains to the entire database. >>>> >>> I understand that you're just thinking aloud, but overloading namespaces >>> in this way strikes me as awful. Applications and extensions, which are >>> the things that have need of namespaces, should not have to care about >>> replication. If we have to design them for replication we'll be on a >>> fast track to nowhere IMNSHO. >>> >> >> Well, Andrew, replication _is_ an application. Or, you could think of >> replication as an extension to an application. > > No, I don't think of it as either. It's a utility, more an extension of > the DBMS than of the application. You don't replicate for the sake of > replicating. > >> I was under the impression >> that_users_ decide to put tables in schema spaces based upon _user_ need, >> and that Postgres developer's use of them for other purposes was >> incroaching on user choices, not the other way around. > > That's exactly what you would be doing with this proposal, encroaching > on what I regard as user space. I'd never use a schema for that. Look at Slony-I. It uses a user selectable schema for the stuff it needs to store in the database, to keep Slony objects separate from user objects. But it organizes the user tables in what is called sets. A set can be any combination of tables and sequences from any number of namespaces. If I would do it again, I would use names instead of numbers to identify sets, and I would allow for summary sets containing groups of simple sets. On the other issue, replication is part of the overall infrastructure and thereby part of the solution to a business problem. Like the hardware, database and application it solves a piece of the puzzle, none of the other is particularly good at. It will perform better or eventually not at all, depending on how much consideration the distributed nature of the business model was given when the application was designed. Applications that "happened by accident" rather than being designed usually don't allow any kind of distribution. And yes, there are people who replicate for the sake of it. It is the kind of people who ask for sync multi-master no matter what their actual problem might be, because it sounds most prestigious. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
On Friday 09 February 2007, Jan Wieck wrote: > I am not sure, I would have to look at what exactly that hook provides. > The key to a Lamport timestamp is that it is advancing it commit order > (plus some other things ... of course). If the hook can guarantee that > the calls are made always in commit order, serialized without any race > condition possible, it would probably be suitable. Actually what we do is a bit stronger. We use the commit hook to enforce an externally defined commit order. In our case, this is defined by a group communication protocol, which is even allowed to reorder a pair of transactions originating from the same replica. Therefore, achieving a commit order that is consistent with a local clock should be straightforward. Regards, -- Jose Orlando Pereira