Thread: Proposal: Commit timestamp

Proposal: Commit timestamp

From

Jan Wieck

Date:

25 January 2007, 19:16:19

For a future multimaster replication system, I will need a couple of 
features in the PostgreSQL server itself. I will submit separate 
proposals per feature so that discussions can be kept focused on one 
feature per thread.

For conflict resolution purposes in an asynchronous multimaster system, 
the "last update" definition often comes into play. For this to work, 
the system must provide a monotonically increasing timestamp taken at 
the commit of a transaction. During replication, the replication process 
must be able to provide the remote nodes timestamp so that the 
replicated data will be "as of the time it was written on the remote 
node", and not the current local time of the replica, which is by 
definition of "asynchronous" later.

To provide this data, I would like to add another "log" directory, 
pg_tslog. The files in this directory will be similar to the clog, but 
contain arrays of timestamptz values. On commit, the current system time 
will be taken. As long as this time is lower or equal to the last taken 
time in this PostgreSQL instance, the value will be increased by one 
microsecond. The resulting time will be added to the commit WAL record 
and written into the pg_tslog file.

If a per database configurable tslog_priority is given, the timestamp 
will be truncated to milliseconds and the increment logic is done on 
milliseconds. The priority is added to the timestamp. This guarantees 
that no two timestamps for commits will ever be exactly identical, even 
across different servers.

The COMMIT syntax will get extended to
    COMMIT [TRANSACTION] [WITH TIMESTAMP <timestamptz>];

The extension is limited to superusers and will override the normally 
generated commit timestamp. This will be used to give the replicating 
transaction on the replica the exact same timestamp it got on the 
originating master node.

The pg_tslog segments will be purged like the clog segments, after all 
transactions belonging to them have been stamped frozen. A frozen xid by 
definition has a timestamp of epoch. To ensure a system using this 
timestamp feature has enough time to perform its work, a new GUC 
variable defining an interval will prevent vacuum from freezing xid's 
that are younger than that.

A function get_commit_timestamp(xid) returning timpstamptz will return 
the commit time of a transaction as recorded by this feature.


Comments, changes, additions?

Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #

Re: Proposal: Commit timestamp

From

Neil Conway

Date:

25 January 2007, 19:47:51

On Thu, 2007-01-25 at 18:16 -0500, Jan Wieck wrote:
> For conflict resolution purposes in an asynchronous multimaster system, 
> the "last update" definition often comes into play. For this to work, 
> the system must provide a monotonically increasing timestamp taken at 
> the commit of a transaction.

Do you really need an actual timestamptz derived from the system clock,
or would a monotonically increasing 64-bit counter be sufficient? (The
assumption that the system clock is monotonically increasing seems
pretty fragile, in the presence of manual system clock changes, ntpd,
etc.)

> Comments, changes, additions?

Would this feature have any use beyond the specific project/algorithm
you have in mind?

-Neil

Re: Proposal: Commit timestamp

From

Tom Lane

Date:

25 January 2007, 19:49:49

Jan Wieck <JanWieck@Yahoo.com> writes:
> To provide this data, I would like to add another "log" directory, 
> pg_tslog. The files in this directory will be similar to the clog, but 
> contain arrays of timestamptz values.

Why should everybody be made to pay this overhead?

> The COMMIT syntax will get extended to
>     COMMIT [TRANSACTION] [WITH TIMESTAMP <timestamptz>];
> The extension is limited to superusers and will override the normally 
> generated commit timestamp. This will be used to give the replicating 
> transaction on the replica the exact same timestamp it got on the 
> originating master node.

I'm not convinced you've even thought this through.  If you do that then
you have no guarantee of commit timestamp monotonicity on the slave
(if it has either multi masters or any locally generated transactions).
Since this is supposedly for a multi-master system, that seems a rather
fatal objection --- no node in the system will actually have commit
timestamp monotonicity.  What are you hoping to accomplish with this?
        regards, tom lane

Re: Proposal: Commit timestamp

From

Jan Wieck

Date:

25 January 2007, 20:31:31

On 1/25/2007 6:47 PM, Neil Conway wrote:
> On Thu, 2007-01-25 at 18:16 -0500, Jan Wieck wrote:
>> For conflict resolution purposes in an asynchronous multimaster system, 
>> the "last update" definition often comes into play. For this to work, 
>> the system must provide a monotonically increasing timestamp taken at 
>> the commit of a transaction.
> 
> Do you really need an actual timestamptz derived from the system clock,
> or would a monotonically increasing 64-bit counter be sufficient? (The
> assumption that the system clock is monotonically increasing seems
> pretty fragile, in the presence of manual system clock changes, ntpd,
> etc.)

Yes, I do need it to be a timestamp, and one assumption is that all 
servers in the multimaster cluster are ntp synchronized. The reason is 
that this is for asynchronous multimaster (in my case). Two sequences 
running on separate systems don't tell which was the "last update" on a 
timeline. This conflict resolution method alone is of course completely 
inadequate.

> 
>> Comments, changes, additions?
> 
> Would this feature have any use beyond the specific project/algorithm
> you have in mind?

The tablelog project on pgfoundry currently uses the transactions start 
time but would be very delighted to have the commit time available instead.

Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #

Re: Proposal: Commit timestamp

From

Tom Lane

Date:

25 January 2007, 20:41:53

Jan Wieck <JanWieck@Yahoo.com> writes:
> On 1/25/2007 6:47 PM, Neil Conway wrote:
>> Would this feature have any use beyond the specific project/algorithm
>> you have in mind?

> The tablelog project on pgfoundry currently uses the transactions start 
> time but would be very delighted to have the commit time available instead.

BTW, it's not clear to me why you need a new log area for this.  (We
don't log transaction start time anywhere, so certainly tablelog's needs
would not include it.)  Commit timestamps are available from WAL commit
records in a crash-and-restart scenario, so wouldn't that be enough?
        regards, tom lane

Re: Proposal: Commit timestamp

From

Jan Wieck

Date:

25 January 2007, 20:42:43

On 1/25/2007 6:49 PM, Tom Lane wrote:
> Jan Wieck <JanWieck@Yahoo.com> writes:
>> To provide this data, I would like to add another "log" directory, 
>> pg_tslog. The files in this directory will be similar to the clog, but 
>> contain arrays of timestamptz values.
> 
> Why should everybody be made to pay this overhead?

It could be made an initdb time option. If you intend to use a product 
that requires this feature, you will be willing to pay that price.

> 
>> The COMMIT syntax will get extended to
>>     COMMIT [TRANSACTION] [WITH TIMESTAMP <timestamptz>];
>> The extension is limited to superusers and will override the normally 
>> generated commit timestamp. This will be used to give the replicating 
>> transaction on the replica the exact same timestamp it got on the 
>> originating master node.
> 
> I'm not convinced you've even thought this through.  If you do that then
> you have no guarantee of commit timestamp monotonicity on the slave
> (if it has either multi masters or any locally generated transactions).
> Since this is supposedly for a multi-master system, that seems a rather
> fatal objection --- no node in the system will actually have commit
> timestamp monotonicity.  What are you hoping to accomplish with this?

Maybe I wasn't clear enough about this. If the commit timestamps on the 
local machine are guaranteed to increase at least by one millisecond 
(okay that limits the system to a sustained 1000 commits per second 
before it really seems to run ahead of time), then no two commits on the 
same instance will ever have the same timestamp. If furthermore each 
instance in a cluster has a distinct priority (the microsecond part 
added to the millisecond-truncated timestamp), each commit timestamp 
could even act as a globally unique ID. It does require that all the 
nodes in the cluster are configured with a distinct priority.

What I hope to accomplish with this is a very easy, commit time based 
"last update wins" conflict resolution for data fields of the overwrite 
nature.

The replication system I have in mind will have another field type of 
the balance nature, where it will never communicate the current value 
but only deltas that get applied regardless of the two timestamps.

Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #

Re: Proposal: Commit timestamp

From

Jan Wieck

Date:

25 January 2007, 20:49:59

On 1/25/2007 7:41 PM, Tom Lane wrote:
> Jan Wieck <JanWieck@Yahoo.com> writes:
>> On 1/25/2007 6:47 PM, Neil Conway wrote:
>>> Would this feature have any use beyond the specific project/algorithm
>>> you have in mind?
> 
>> The tablelog project on pgfoundry currently uses the transactions start 
>> time but would be very delighted to have the commit time available instead.
> 
> BTW, it's not clear to me why you need a new log area for this.  (We
> don't log transaction start time anywhere, so certainly tablelog's needs
> would not include it.)  Commit timestamps are available from WAL commit
> records in a crash-and-restart scenario, so wouldn't that be enough?

First, I need the timestamp of the original transaction that caused the 
data to change, which can be a remote or a local transaction. So the 
timestamp currently recorded in the WAL commit record is useless and the 
commit record has to be extended by one more timestamp.

Second, I don't think that an API scanning for WAL commit records by xid 
would be efficient enough to satisfy the needs of a timestamp based 
conflict resolution system, which would have to retrieve the timestamp 
for every rows xmin that it is about to update in order to determine if 
the old or the new values should be used.

Third, keeping the timestamp information in the WAL only would require 
to keep the WAL segments around until they are older than the admin 
chosen minimum freeze age. I hope you don't want to force that penalty 
on everyone who intends to use multimaster replication.

Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #

Re: Proposal: Commit timestamp

From

Richard Troy

Date:

25 January 2007, 21:24:55

On Thu, 25 Jan 2007, Jan Wieck wrote:
>
> For a future multimaster replication system, I will need a couple of
> features in the PostgreSQL server itself. I will submit separate
> proposals per feature so that discussions can be kept focused on one
> feature per thread.

Hmm... "will need" ... Have you prototyped this system yet? ISTM you can
prototype your proposal using "external" components so you can work out
the kinks first.

Richard


-- 
Richard Troy, Chief Scientist
Science Tools Corporation
510-924-1363 or 202-747-1263
rtroy@ScienceTools.com, http://ScienceTools.com/

Re: Proposal: Commit timestamp

From

Jan Wieck

Date:

25 January 2007, 22:01:48

On 1/25/2007 8:42 PM, Richard Troy wrote:
> On Thu, 25 Jan 2007, Jan Wieck wrote:
>>
>> For a future multimaster replication system, I will need a couple of
>> features in the PostgreSQL server itself. I will submit separate
>> proposals per feature so that discussions can be kept focused on one
>> feature per thread.
> 
> Hmm... "will need" ... Have you prototyped this system yet? ISTM you can
> prototype your proposal using "external" components so you can work out
> the kinks first.

These details are pretty drilled down and are needed with the described 
functionality. And I will not make the same mistake as with Slony-I 
again and develop things, that require backend support, as totally 
external (look at the catalog corruption mess I created there and you 
know what I'm talking about).


Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #

Re: Proposal: Commit timestamp

From

Bruce Momjian

Date:

26 January 2007, 00:41:38

Jan Wieck wrote:
> On 1/25/2007 6:49 PM, Tom Lane wrote:
> > Jan Wieck <JanWieck@Yahoo.com> writes:
> >> To provide this data, I would like to add another "log" directory, 
> >> pg_tslog. The files in this directory will be similar to the clog, but 
> >> contain arrays of timestamptz values.
> > 
> > Why should everybody be made to pay this overhead?
> 
> It could be made an initdb time option. If you intend to use a product 
> that requires this feature, you will be willing to pay that price.

That is going to cut your usage by like 80%.  There must be a better
way.

--  Bruce Momjian   bruce@momjian.us EnterpriseDB    http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: Proposal: Commit timestamp

From

Jan Wieck

Date:

26 January 2007, 01:42:33

On 1/25/2007 11:41 PM, Bruce Momjian wrote:
> Jan Wieck wrote:
>> On 1/25/2007 6:49 PM, Tom Lane wrote:
>> > Jan Wieck <JanWieck@Yahoo.com> writes:
>> >> To provide this data, I would like to add another "log" directory, 
>> >> pg_tslog. The files in this directory will be similar to the clog, but 
>> >> contain arrays of timestamptz values.
>> > 
>> > Why should everybody be made to pay this overhead?
>> 
>> It could be made an initdb time option. If you intend to use a product 
>> that requires this feature, you will be willing to pay that price.
> 
> That is going to cut your usage by like 80%.  There must be a better
> way.

I'd love to.

But it is a datum that needs to be collected at the moment where 
basically the clog entry is made ... I don't think any external module 
can do that ever.

You know how long I've been in and out and back into replication again. 
The one thing that pops up again and again in all the scenarios is "what 
the heck was the commit order?". Now the pure commit order for a single 
node could certainly be recorded from a sequence, but that doesn't cover 
the multi-node environment I am after. That's why I want it to be a 
timestamp with a few fudged bits at the end. If you look at what I've 
described, you will notice that as long as all node priorities are 
unique, this timestamp will be a globally unique ID in a somewhat 
ascending order along a timeline. That is what replication people are 
looking for.

Tom fears that the overhead is significant, which I do understand and 
frankly, wonder myself about (actually I don't even have a vague 
estimate). I really think we should make this thing an initdb option and 
decide later if it's on or off by default. Probably we can implement it 
even in a way that one can turn it on/off and a postmaster restart plus 
waiting the desired freeze-delay would do.

What I know for certain is that no async replication system can ever do 
without the commit timestamp information. Using the transaction start 
time or even the single statements timeofday will only lead to 
inconsistencies all over the place (I haven't been absent from the 
mailing lists for the past couple of month hiding in my closet ... I've 
been experimenting and trying to get around all these issues - in my 
closet). Slony-I can survive without that information because everything 
happens on one node and we record snapshot information for later abusal. 
But look at what cost we are dealing with this rather trivial issue. All 
we need to know is the serializable commit order. And we have to issue 
queries that eventually might exceed address space limits?

Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #

Re: Proposal: Commit timestamp

From

Naz Gassiep

Date:

26 January 2007, 04:02:53

I would be *very* concerned that system time is not a guaranteed 
monotonic entity. Surely a counter or other internally managed mechanism 
would be a better solution.

Furthermore, what would be the ramifications of master and slave system 
times being out of sync?

Finally what if system time is rolled forward a few minutes as part of a 
correction and there were transactions completed in that time? There is 
a change, albeit small, that two transactions will have the same 
timestamp. More importantly, this will throw all kinds of issues in when 
the slave sees transactions in the future. Even with regular NTP syncs, 
drift can cause a clock to be rolled forward a few milliseconds, 
possibly resulting in duplicate transaction IDs.

In summary, I don't think the use of system time has any place in 
PostgreSQL's internal consistency mechanisms, it is too unreliable an 
environment property. Why can't a counter be used for this instead?

- Naz.

Jan Wieck wrote:
> For a future multimaster replication system, I will need a couple of 
> features in the PostgreSQL server itself. I will submit separate 
> proposals per feature so that discussions can be kept focused on one 
> feature per thread.
>
> For conflict resolution purposes in an asynchronous multimaster 
> system, the "last update" definition often comes into play. For this 
> to work, the system must provide a monotonically increasing timestamp 
> taken at the commit of a transaction. During replication, the 
> replication process must be able to provide the remote nodes timestamp 
> so that the replicated data will be "as of the time it was written on 
> the remote node", and not the current local time of the replica, which 
> is by definition of "asynchronous" later.
>
> To provide this data, I would like to add another "log" directory, 
> pg_tslog. The files in this directory will be similar to the clog, but 
> contain arrays of timestamptz values. On commit, the current system 
> time will be taken. As long as this time is lower or equal to the last 
> taken time in this PostgreSQL instance, the value will be increased by 
> one microsecond. The resulting time will be added to the commit WAL 
> record and written into the pg_tslog file.
>
> If a per database configurable tslog_priority is given, the timestamp 
> will be truncated to milliseconds and the increment logic is done on 
> milliseconds. The priority is added to the timestamp. This guarantees 
> that no two timestamps for commits will ever be exactly identical, 
> even across different servers.
>
> The COMMIT syntax will get extended to
>
>     COMMIT [TRANSACTION] [WITH TIMESTAMP <timestamptz>];
>
> The extension is limited to superusers and will override the normally 
> generated commit timestamp. This will be used to give the replicating 
> transaction on the replica the exact same timestamp it got on the 
> originating master node.
>
> The pg_tslog segments will be purged like the clog segments, after all 
> transactions belonging to them have been stamped frozen. A frozen xid 
> by definition has a timestamp of epoch. To ensure a system using this 
> timestamp feature has enough time to perform its work, a new GUC 
> variable defining an interval will prevent vacuum from freezing xid's 
> that are younger than that.
>
> A function get_commit_timestamp(xid) returning timpstamptz will return 
> the commit time of a transaction as recorded by this feature.
>
>
> Comments, changes, additions?
>
> Jan
>

Re: Proposal: Commit timestamp

From

Markus Schiltknecht

Date:

26 January 2007, 06:36:18

Hi,

Jan Wieck wrote:
> The replication system I have in mind will have another field type of 
> the balance nature, where it will never communicate the current value 
> but only deltas that get applied regardless of the two timestamps.

I'd favor a more generally usable conflict resolution function 
interface, on top of which you can implement both, the "last update 
wins" as well as the "balance" conflict resolution type.

Passing the last common ancestor and the two conflicting heads to the 
conflict resolution function (CRF) should be enough. That would easily 
allow to implement the "balance" type (as you can calculate both 
deltas). And if you want to rely on something as arbitrary as a 
timestamp, you'd simply have to add a timestamp column to your table and 
let the CRF decide uppon that.

This would allow pretty much any type of conflict resolution, for 
example: higher priority cleanup transactions, which change lots of 
tuples and should better not be aborted later on. Those could be 
implemented by adding a priority column and having the CRF respect that 
one, too.

To find the last common ancestor tuple, transaction ids and MVCC are 
enough. You wouldn't need to add timestamps. You'd only have to make 
sure VACUUM doesn't delete tuples you still need.

Regards

Markus

Re: Proposal: Commit timestamp

From

Heikki Linnakangas

Date:

26 January 2007, 06:54:52

Jan Wieck wrote:
> But it is a datum that needs to be collected at the moment where 
> basically the clog entry is made ... I don't think any external module 
> can do that ever.

How atomic does it need to be? External modules can register callbacks 
that get called right after the clog update and removing the xid from 
MyProc entry. That's about as close to making the clog entry you can get.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Proposal: Commit timestamp

From

Theo Schlossnagle

Date:

26 January 2007, 09:11:38

Jan, et. al.,

On Jan 26, 2007, at 2:37 AM, Naz Gassiep wrote:
> I would be *very* concerned that system time is not a guaranteed  
> monotonic entity. Surely a counter or other internally managed  
> mechanism would be a better solution.

As you should be concerned.  Looking on my desk through the last few  
issues in IEEE Transactions on Parallel and Distributed Systems, I  
see no time synch stuff for clusters of machines that is actually  
based on time.  Almost all rely on something like a Lamport timestamp  
or some relaxation thereof.  A few are based off a tree based pulse.   
Using actual times is fraught with problems and is typically  
inappropriate for cluster synchronization needs.

> Furthermore, what would be the ramifications of master and slave  
> system times being out of sync?

I'm much more concerned with the overall approach.  The algorithm for  
replication should be published in theoretic style with a thorough  
analysis of its assumptions and a proof of correctness based on those  
assumptions.  Databases and replication therein are definitely  
technologies that aren't "off-the-cuff," and rigorous academic  
discussion and acceptance before they will get adopted.  People  
generally will not adopt technologies to store mission critical data  
until they are confident that it will both work as designed and work  
as implemented -- the second is far less important as the weakness  
there are simply bugs.

I'm not implying that this rigorous dissection of replication design  
hasn't happened, but I didn't see it referenced anywhere in this  
thread.  Can you point me to it?  I've reviewed many of these papers  
and would like to better understand what you are aiming at.

Best regards,

Theo Schlossnagle

Re: Proposal: Commit timestamp

From

Jan Wieck

Date:

26 January 2007, 09:21:51

On 1/26/2007 2:37 AM, Naz Gassiep wrote:
> I would be *very* concerned that system time is not a guaranteed 
> monotonic entity. Surely a counter or other internally managed mechanism 
> would be a better solution.

Such a counter has only "local" relevance. How do you plan to compare 
the two separate counters on different machines to tell which 
transaction happened last?

Even if the system clock isn't monotonically increasing, the described 
increment system guarantees the timestamp used to appear so. Granted, 
this system will not work too well on a platform that doesn't allow to 
slew the system clock.

> 
> Furthermore, what would be the ramifications of master and slave system 
> times being out of sync?

The origin of a transaction must scan all tuples it updates and make 
sure that the timestamp it uses for commit appears in the future with 
respect to them.

> 
> Finally what if system time is rolled forward a few minutes as part of a 
> correction and there were transactions completed in that time? There is 
> a change, albeit small, that two transactions will have the same 
> timestamp. More importantly, this will throw all kinds of issues in when 
> the slave sees transactions in the future. Even with regular NTP syncs, 
> drift can cause a clock to be rolled forward a few milliseconds, 
> possibly resulting in duplicate transaction IDs.
> 
> In summary, I don't think the use of system time has any place in 
> PostgreSQL's internal consistency mechanisms, it is too unreliable an 
> environment property. Why can't a counter be used for this instead?

This is nothing used for PostgreSQL's consistency. It is a vehicle 
intended to be used to synchronize the "last update wins" decision 
process of an asynchronous multimaster system. If not with a timestamp, 
how would you make sure that the replication processes of two different 
nodes will come to the same conclusion as to which update was last? 
Especially considering that the replication might take place hours after 
the original transaction happened.

Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #

Re: Proposal: Commit timestamp

From

"Simon Riggs"

Date:

26 January 2007, 09:28:27

On Thu, 2007-01-25 at 18:16 -0500, Jan Wieck wrote:

> To provide this data, I would like to add another "log" directory, 
> pg_tslog. The files in this directory will be similar to the clog, but 
> contain arrays of timestamptz values. On commit, the current system time 
> will be taken. As long as this time is lower or equal to the last taken 
> time in this PostgreSQL instance, the value will be increased by one 
> microsecond. The resulting time will be added to the commit WAL record 
> and written into the pg_tslog file.

A transaction time table/log has other uses as well, so its fairly
interesting to have this.

>      COMMIT [TRANSACTION] [WITH TIMESTAMP <timestamptz>];
> 
> The extension is limited to superusers and will override the normally 
> generated commit timestamp. 

I don't think its acceptable to override the normal timestamp. That
could lead to non monotonic time values which could screw up PITR. My
view is that you still need PITR even when you are using replication,
because the former provides recoverability and the latter provides
availability.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com

Re: Proposal: Commit timestamp

From

Jan Wieck

Date:

26 January 2007, 09:45:12

On 1/26/2007 8:26 AM, Simon Riggs wrote:
> On Thu, 2007-01-25 at 18:16 -0500, Jan Wieck wrote:
> 
>> To provide this data, I would like to add another "log" directory, 
>> pg_tslog. The files in this directory will be similar to the clog, but 
>> contain arrays of timestamptz values. On commit, the current system time 
>> will be taken. As long as this time is lower or equal to the last taken 
>> time in this PostgreSQL instance, the value will be increased by one 
>> microsecond. The resulting time will be added to the commit WAL record 
>> and written into the pg_tslog file.
> 
> A transaction time table/log has other uses as well, so its fairly
> interesting to have this.
> 
>>      COMMIT [TRANSACTION] [WITH TIMESTAMP <timestamptz>];
>> 
>> The extension is limited to superusers and will override the normally 
>> generated commit timestamp. 
> 
> I don't think its acceptable to override the normal timestamp. That
> could lead to non monotonic time values which could screw up PITR. My
> view is that you still need PITR even when you are using replication,
> because the former provides recoverability and the latter provides
> availability.

Without that it is rendered useless for conflict resolution purposes.

The timestamp used does not necessarily have much to do with the real 
time at commit. Although I'd like it to be as close as possible. This 
timestamp marks the age of the new datum in an update. Since the 
replication is asynchronous, the update on the remote systems will 
happen later, but the timestamp recorded with that datum must be the 
timestamp of the original transaction, not the current time when it is 
replicated remotely. All we have to determine that is the xmin in the 
rows tuple header, so that xmin must resolve to the original 
transactions timestamp.

Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #

Re: Proposal: Commit timestamp

From

Stephen Frost

Date:

26 January 2007, 10:38:55

* Jan Wieck (JanWieck@Yahoo.com) wrote:
> On 1/26/2007 2:37 AM, Naz Gassiep wrote:
> >I would be *very* concerned that system time is not a guaranteed
> >monotonic entity. Surely a counter or other internally managed mechanism
> >would be a better solution.
>
> Such a counter has only "local" relevance. How do you plan to compare
> the two separate counters on different machines to tell which
> transaction happened last?

I'd also suggest you look into Lamport timestamps...  Trusting the
system clock just isn't practical, even with NTP.  I've developed
(albeit relatively small) systems using Lamport timestamps and would be
happy to talk about it offlist.  I've probably got some code I could
share as well.
Thanks,
    Stephen

Re: Proposal: Commit timestamp

From

Andrew Dunstan

Date:

26 January 2007, 11:11:09

Stephen Frost wrote:
> I'd also suggest you look into Lamport timestamps...  Trusting the
> system clock just isn't practical, even with NTP.  I've developed
> (albeit relatively small) systems using Lamport timestamps and would be
> happy to talk about it offlist.  I've probably got some code I could
> share as well.
>   

that looks like what Oracle RAC uses: 
http://www.lc.leidenuniv.nl/awcourse/oracle/rac.920/a96597/coord.htm

cheers

andrew

Re: Proposal: Commit timestamp

From

Jan Wieck

Date:

26 January 2007, 11:53:56

On 1/26/2007 9:38 AM, Stephen Frost wrote:
> * Jan Wieck (JanWieck@Yahoo.com) wrote:
>> On 1/26/2007 2:37 AM, Naz Gassiep wrote:
>> >I would be *very* concerned that system time is not a guaranteed 
>> >monotonic entity. Surely a counter or other internally managed mechanism 
>> >would be a better solution.
>> 
>> Such a counter has only "local" relevance. How do you plan to compare 
>> the two separate counters on different machines to tell which 
>> transaction happened last?
> 
> I'd also suggest you look into Lamport timestamps...  Trusting the
> system clock just isn't practical, even with NTP.  I've developed
> (albeit relatively small) systems using Lamport timestamps and would be
> happy to talk about it offlist.  I've probably got some code I could
> share as well.

I think the system I described is a slightly modified Lamport generator. 
The maximum timestamp of any row updated in this transaction, you can 
consider that the "counters received from other nodes". Then I make sure 
that the next counter (timestamp) is higher than anything I know so far, 
and I add cluster-wide unique tie breaker to that.

Looking closer, I don't even have to check the timestamps of the rows 
updated. Since a remote transaction replicated will bump the local 
Lamport clock on commit, a local transaction modifying such a row will 
have a timestamp in the future of that remote transaction, even if my 
local clock is limping behind.

Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #

Re: Proposal: Commit timestamp

From

Gregory Stark

Date:

27 January 2007, 08:26:32

Jan Wieck <JanWieck@Yahoo.com> writes:

> I think the system I described is a slightly modified Lamport generator. The
> maximum timestamp of any row updated in this transaction, you can consider that
> the "counters received from other nodes". Then I make sure that the next
> counter (timestamp) is higher than anything I know so far, and I add
> cluster-wide unique tie breaker to that.

If you know all the timestamps in the system then you don't need timestamps at
all, just use a counter that you increment by one each time.

Isn't the whole reason people use timestamps is so that you don't have to
depend on atomically knowing every timestamp in the system? So two
transactions can commit simultaneously on different systems and use the
timestamps to resolve conflicts later.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com

Re: Proposal: Commit timestamp

From

Jan Wieck

Date:

27 January 2007, 10:14:22

On 1/27/2007 7:26 AM, Gregory Stark wrote:
> Jan Wieck <JanWieck@Yahoo.com> writes:
> 
>> I think the system I described is a slightly modified Lamport generator. The
>> maximum timestamp of any row updated in this transaction, you can consider that
>> the "counters received from other nodes". Then I make sure that the next
>> counter (timestamp) is higher than anything I know so far, and I add
>> cluster-wide unique tie breaker to that.
> 
> If you know all the timestamps in the system then you don't need timestamps at
> all, just use a counter that you increment by one each time.
> 
> Isn't the whole reason people use timestamps is so that you don't have to
> depend on atomically knowing every timestamp in the system? So two
> transactions can commit simultaneously on different systems and use the
> timestamps to resolve conflicts later.

This assumes that you never lose contact to the cluster or if so, 
instantly stop all update activity because you are at risk that the 
counters diverge. This risk is much higher with a simple counter than 
with a system clock that was in sync at the time of disconnect.

With all the disadvantages and the pain factor of an asynchronous 
multimaster replication system comes one big advantage. You can continue 
autonomously and let conflict resolution figure it out later.


Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #

Re: Proposal: Commit timestamp

From

Jim Nasby

Date:

02 February 2007, 12:43:48

On Jan 25, 2007, at 6:16 PM, Jan Wieck wrote:
> If a per database configurable tslog_priority is given, the  
> timestamp will be truncated to milliseconds and the increment logic  
> is done on milliseconds. The priority is added to the timestamp.  
> This guarantees that no two timestamps for commits will ever be  
> exactly identical, even across different servers.

Wouldn't it be better to just store that information separately,  
rather than mucking with the timestamp?

Though, there's anothe issue here... I don't think NTP is good for  
any better than a few milliseconds, even on a local network.

How exact does the conflict resolution need to be, anyway? Would it  
really be a problem if transaction B committed 0.1 seconds after  
transaction A yet the cluster thought it was the other way around?
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Re: Proposal: Commit timestamp

From

Jan Wieck

Date:

03 February 2007, 16:53:04

On 2/1/2007 11:23 PM, Jim Nasby wrote:
> On Jan 25, 2007, at 6:16 PM, Jan Wieck wrote:
>> If a per database configurable tslog_priority is given, the  
>> timestamp will be truncated to milliseconds and the increment logic  
>> is done on milliseconds. The priority is added to the timestamp.  
>> This guarantees that no two timestamps for commits will ever be  
>> exactly identical, even across different servers.
> 
> Wouldn't it be better to just store that information separately,  
> rather than mucking with the timestamp?
> 
> Though, there's anothe issue here... I don't think NTP is good for  
> any better than a few milliseconds, even on a local network.
> 
> How exact does the conflict resolution need to be, anyway? Would it  
> really be a problem if transaction B committed 0.1 seconds after  
> transaction A yet the cluster thought it was the other way around?

Since the timestamp is basically a Lamport counter which is just bumped 
be the clock as well, it doesn't need to be too precise.


Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #

Re: Proposal: Commit timestamp

From

Theo Schlossnagle

Date:

03 February 2007, 17:05:34

On Feb 3, 2007, at 3:52 PM, Jan Wieck wrote:

> On 2/1/2007 11:23 PM, Jim Nasby wrote:
>> On Jan 25, 2007, at 6:16 PM, Jan Wieck wrote:
>>> If a per database configurable tslog_priority is given, the   
>>> timestamp will be truncated to milliseconds and the increment  
>>> logic  is done on milliseconds. The priority is added to the  
>>> timestamp.  This guarantees that no two timestamps for commits  
>>> will ever be  exactly identical, even across different servers.
>> Wouldn't it be better to just store that information separately,   
>> rather than mucking with the timestamp?
>> Though, there's anothe issue here... I don't think NTP is good  
>> for  any better than a few milliseconds, even on a local network.
>> How exact does the conflict resolution need to be, anyway? Would  
>> it  really be a problem if transaction B committed 0.1 seconds  
>> after  transaction A yet the cluster thought it was the other way  
>> around?
>
> Since the timestamp is basically a Lamport counter which is just  
> bumped be the clock as well, it doesn't need to be too precise.

Unless I'm missing something, you are _treating_ the counter as a  
Lamport timestamp, when in fact it is not and thus does not provide  
semantics of a Lamport timestamp.  As such, any algorithms that use  
lamport timestamps as a basis or assumption for the proof of their  
correctness will not translate (provably) to this system.

How are your counter semantically equivalent to Lamport timestamps?

// Theo Schlossnagle
// CTO -- http://www.omniti.com/~jesus/
// OmniTI Computer Consulting, Inc. -- http://www.omniti.com/

Re: Proposal: Commit timestamp

From

Jan Wieck

Date:

03 February 2007, 17:39:09

On 2/3/2007 4:05 PM, Theo Schlossnagle wrote:
> On Feb 3, 2007, at 3:52 PM, Jan Wieck wrote:
> 
>> On 2/1/2007 11:23 PM, Jim Nasby wrote:
>>> On Jan 25, 2007, at 6:16 PM, Jan Wieck wrote:
>>>> If a per database configurable tslog_priority is given, the   
>>>> timestamp will be truncated to milliseconds and the increment  
>>>> logic  is done on milliseconds. The priority is added to the  
>>>> timestamp.  This guarantees that no two timestamps for commits  
>>>> will ever be  exactly identical, even across different servers.
>>> Wouldn't it be better to just store that information separately,   
>>> rather than mucking with the timestamp?
>>> Though, there's anothe issue here... I don't think NTP is good  
>>> for  any better than a few milliseconds, even on a local network.
>>> How exact does the conflict resolution need to be, anyway? Would  
>>> it  really be a problem if transaction B committed 0.1 seconds  
>>> after  transaction A yet the cluster thought it was the other way  
>>> around?
>>
>> Since the timestamp is basically a Lamport counter which is just  
>> bumped be the clock as well, it doesn't need to be too precise.
> 
> Unless I'm missing something, you are _treating_ the counter as a  
> Lamport timestamp, when in fact it is not and thus does not provide  
> semantics of a Lamport timestamp.  As such, any algorithms that use  
> lamport timestamps as a basis or assumption for the proof of their  
> correctness will not translate (provably) to this system.
> 
> How are your counter semantically equivalent to Lamport timestamps?

Yes, you must be missing something.

The last used timestamp is remembered. When a remote transaction is 
replicated, the remembered timestamp is set to max(remembered, remote). 
For a local transaction, the remembered timestamp is set to 
max(remembered+1ms, systemclock) and that value is used as the 
transaction commit timestamp.


Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #

Re: Proposal: Commit timestamp

From

Theo Schlossnagle

Date:

03 February 2007, 17:58:44

On Feb 3, 2007, at 4:38 PM, Jan Wieck wrote:

> On 2/3/2007 4:05 PM, Theo Schlossnagle wrote:
>> On Feb 3, 2007, at 3:52 PM, Jan Wieck wrote:
>>> On 2/1/2007 11:23 PM, Jim Nasby wrote:
>>>> On Jan 25, 2007, at 6:16 PM, Jan Wieck wrote:
>>>>> If a per database configurable tslog_priority is given, the    
>>>>> timestamp will be truncated to milliseconds and the increment   
>>>>> logic  is done on milliseconds. The priority is added to the   
>>>>> timestamp.  This guarantees that no two timestamps for commits   
>>>>> will ever be  exactly identical, even across different servers.
>>>> Wouldn't it be better to just store that information  
>>>> separately,   rather than mucking with the timestamp?
>>>> Though, there's anothe issue here... I don't think NTP is good   
>>>> for  any better than a few milliseconds, even on a local network.
>>>> How exact does the conflict resolution need to be, anyway?  
>>>> Would  it  really be a problem if transaction B committed 0.1  
>>>> seconds  after  transaction A yet the cluster thought it was the  
>>>> other way  around?
>>>
>>> Since the timestamp is basically a Lamport counter which is just   
>>> bumped be the clock as well, it doesn't need to be too precise.
>> Unless I'm missing something, you are _treating_ the counter as a   
>> Lamport timestamp, when in fact it is not and thus does not  
>> provide  semantics of a Lamport timestamp.  As such, any  
>> algorithms that use  lamport timestamps as a basis or assumption  
>> for the proof of their  correctness will not translate (provably)  
>> to this system.
>> How are your counter semantically equivalent to Lamport timestamps?
>
> Yes, you must be missing something.
>
> The last used timestamp is remembered. When a remote transaction is  
> replicated, the remembered timestamp is set to max(remembered,  
> remote). For a local transaction, the remembered timestamp is set  
> to max(remembered+1ms, systemclock) and that value is used as the  
> transaction commit timestamp.

A Lamport clock, IIRC, require a cluster wide tick.  This seems based  
only on activity and is thus an observational tick only which means  
various nodes can have various perspectives at different times.

Given that time skew is prevalent, why is the system clock involved  
at all?

As is usual distributed systems problems, they are very hard to  
explain casually and also hard to review from a theoretical angle  
without a proof.  Are you basing this off a paper?  If so which one?   
If not, have you written a rigorous proof of correctness for this  
approach?

// Theo Schlossnagle
// CTO -- http://www.omniti.com/~jesus/
// OmniTI Computer Consulting, Inc. -- http://www.omniti.com/

Re: Proposal: Commit timestamp

From

Jan Wieck

Date:

03 February 2007, 18:09:14

On 2/3/2007 4:58 PM, Theo Schlossnagle wrote:
> On Feb 3, 2007, at 4:38 PM, Jan Wieck wrote:
> 
>> On 2/3/2007 4:05 PM, Theo Schlossnagle wrote:
>>> On Feb 3, 2007, at 3:52 PM, Jan Wieck wrote:
>>>> On 2/1/2007 11:23 PM, Jim Nasby wrote:
>>>>> On Jan 25, 2007, at 6:16 PM, Jan Wieck wrote:
>>>>>> If a per database configurable tslog_priority is given, the    
>>>>>> timestamp will be truncated to milliseconds and the increment   
>>>>>> logic  is done on milliseconds. The priority is added to the   
>>>>>> timestamp.  This guarantees that no two timestamps for commits   
>>>>>> will ever be  exactly identical, even across different servers.
>>>>> Wouldn't it be better to just store that information  
>>>>> separately,   rather than mucking with the timestamp?
>>>>> Though, there's anothe issue here... I don't think NTP is good   
>>>>> for  any better than a few milliseconds, even on a local network.
>>>>> How exact does the conflict resolution need to be, anyway?  
>>>>> Would  it  really be a problem if transaction B committed 0.1  
>>>>> seconds  after  transaction A yet the cluster thought it was the  
>>>>> other way  around?
>>>>
>>>> Since the timestamp is basically a Lamport counter which is just   
>>>> bumped be the clock as well, it doesn't need to be too precise.
>>> Unless I'm missing something, you are _treating_ the counter as a   
>>> Lamport timestamp, when in fact it is not and thus does not  
>>> provide  semantics of a Lamport timestamp.  As such, any  
>>> algorithms that use  lamport timestamps as a basis or assumption  
>>> for the proof of their  correctness will not translate (provably)  
>>> to this system.
>>> How are your counter semantically equivalent to Lamport timestamps?
>>
>> Yes, you must be missing something.
>>
>> The last used timestamp is remembered. When a remote transaction is  
>> replicated, the remembered timestamp is set to max(remembered,  
>> remote). For a local transaction, the remembered timestamp is set  
>> to max(remembered+1ms, systemclock) and that value is used as the  
>> transaction commit timestamp.
> 
> A Lamport clock, IIRC, require a cluster wide tick.  This seems based  
> only on activity and is thus an observational tick only which means  
> various nodes can have various perspectives at different times.
> 
> Given that time skew is prevalent, why is the system clock involved  
> at all?

This question was already answered.

> As is usual distributed systems problems, they are very hard to  
> explain casually and also hard to review from a theoretical angle  
> without a proof.  Are you basing this off a paper?  If so which one?   
> If not, have you written a rigorous proof of correctness for this  
> approach?

I don't have any such paper and the proof of concept will be the 
implementation of the system. I do however see enough resistance against 
this proposal to withdraw the commit timestamp at this time. The new 
replication system will therefore require the installation of a patched, 
non-standard PostgreSQL version, compiled from sources cluster wide in 
order to be used. I am aware that this will dramatically reduce it's 
popularity but it is impossible to develop this essential feature as an 
external module.

I thank everyone for their attention.


Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #

Re: Proposal: Commit timestamp

From

Theo Schlossnagle

Date:

03 February 2007, 18:19:31

On Feb 3, 2007, at 5:09 PM, Jan Wieck wrote:

> On 2/3/2007 4:58 PM, Theo Schlossnagle wrote:
>
> I don't have any such paper and the proof of concept will be the  
> implementation of the system. I do however see enough resistance  
> against this proposal to withdraw the commit timestamp at this  
> time. The new replication system will therefore require the  
> installation of a patched, non-standard PostgreSQL version,  
> compiled from sources cluster wide in order to be used. I am aware  
> that this will dramatically reduce it's popularity but it is  
> impossible to develop this essential feature as an external module.
>
> I thank everyone for their attention.

Actually, I believe the commit timestamp stuff would be very useful  
in general.  I would certainly like to see rigorous proofs of any   
multi-master replication technology built on top of them.  I believe  
that while your replication stuff might rely on the commit  
timestamps, the commit timestamps rely on thing else (except the work  
that you have been churning on).

Using commit timestamps, one can easily implement cross vendor  
database replication.  These can be used to implement something like  
trigger selective redo logs.  I think they can be used to produce DML  
logs that will require a lot less accounting to manage replicating  
tables from PostgreSQL into another database (like Oracle or MySQL).

// Theo Schlossnagle
// CTO -- http://www.omniti.com/~jesus/
// OmniTI Computer Consulting, Inc. -- http://www.omniti.com/

Re: Proposal: Commit timestamp

From

Bruce Momjian

Date:

03 February 2007, 18:20:52

Jan Wieck wrote:
> I don't have any such paper and the proof of concept will be the 
> implementation of the system. I do however see enough resistance against 
> this proposal to withdraw the commit timestamp at this time. The new 
> replication system will therefore require the installation of a patched, 
> non-standard PostgreSQL version, compiled from sources cluster wide in 
> order to be used. I am aware that this will dramatically reduce it's 
> popularity but it is impossible to develop this essential feature as an 
> external module.
> 
> I thank everyone for their attention.

Going and working on it on your own doesn't seem like the proper
solution.  I don't see people objecting to adding it, but they want it
work, which I am sure you want too.  You have to show how it will work
and convince others of that, and then you have a higher chance it will
work, and be in the PostgreSQL codebase.

--  Bruce Momjian   bruce@momjian.us EnterpriseDB    http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: Proposal: Commit timestamp

From

Jan Wieck

Date:

04 February 2007, 01:16:33

On 2/3/2007 5:20 PM, Bruce Momjian wrote:
> Jan Wieck wrote:
>> I don't have any such paper and the proof of concept will be the 
>> implementation of the system. I do however see enough resistance against 
>> this proposal to withdraw the commit timestamp at this time. The new 
>> replication system will therefore require the installation of a patched, 
>> non-standard PostgreSQL version, compiled from sources cluster wide in 
>> order to be used. I am aware that this will dramatically reduce it's 
>> popularity but it is impossible to develop this essential feature as an 
>> external module.
>> 
>> I thank everyone for their attention.
> 
> Going and working on it on your own doesn't seem like the proper
> solution.  I don't see people objecting to adding it, but they want it
> work, which I am sure you want too.  You have to show how it will work
> and convince others of that, and then you have a higher chance it will
> work, and be in the PostgreSQL codebase.

Bruce,

I think I have sufficiently detailed explained how this Lamport 
timestamp will be unique and ever increasing, with the nodes ID being 
used as a tie breaker. The only thing important for "last update wins" 
conflict resolution is that whatever timestamp you have associated with 
a row, the update you do to it must be associated with a later timestamp 
so that all other nodes will overwrite the data. If a third node gets 
the two updates out of order, it will do the second nodes update and 
since the row it has then has a later timestamp then the first update 
arriving late, it will throw away that information. All nodes in sync 
again.

This is all that is needed for last update wins resolution. And as said 
before, the only reason the clock is involved in this is so that nodes 
can continue autonomously when they lose connection without conflict 
resolution going crazy later on, which it would do if they were simple 
counters. It doesn't require microsecond synchronized clocks and the 
system clock isn't just used as a Lamport timestamp.

The problem seems to me that people want a full scale proof of concept 
for the whole multimaster replication system I'm planning instead of 
thinking isolated about this one aspect, the intended use case and other 
possible uses for it (like table logging). And we all know that that 
discussion will take us way behind the 8.3 feature freeze date, so the 
whole thing will never get done.

I don't want to work on this on my own and I sure would prefer it to be 
a default PostgreSQL feature. As said, I have learned some things from 
Slony-I. One of them is that I will not go through any more ugly 
workarounds in order to not require a patched backend. If the features I 
really need aren't going to be in the default codebase, people will have 
to install from patched sources.

Finally, again, Slony-I could have well used this feature. With a 
logical commit timestamp, I would have never even thought about that 
other wart called xxid. It would have all been sooo much easier.

Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #

Re: Proposal: Commit timestamp

From

Peter Eisentraut

Date:

04 February 2007, 04:15:47

Jan Wieck wrote:
> This is all that is needed for last update wins resolution. And as
> said before, the only reason the clock is involved in this is so that
> nodes can continue autonomously when they lose connection without
> conflict resolution going crazy later on, which it would do if they
> were simple counters. It doesn't require microsecond synchronized
> clocks and the system clock isn't just used as a Lamport timestamp.

Earlier you said that "one assumption is that all servers in the 
multimaster cluster are ntp synchronized", which already rung the alarm 
bells in me.  Now that I read this you appear to require 
synchronization not on the microsecond level but on some level.  I 
think that would be pretty hard to manage for an administrator, seeing 
that NTP typically cannot provide such guarantees.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: Proposal: Commit timestamp

From

Jan Wieck

Date:

04 February 2007, 11:06:40

On 2/4/2007 3:16 AM, Peter Eisentraut wrote:
> Jan Wieck wrote:
>> This is all that is needed for last update wins resolution. And as
>> said before, the only reason the clock is involved in this is so that
>> nodes can continue autonomously when they lose connection without
>> conflict resolution going crazy later on, which it would do if they
>> were simple counters. It doesn't require microsecond synchronized
>> clocks and the system clock isn't just used as a Lamport timestamp.
> 
> Earlier you said that "one assumption is that all servers in the 
> multimaster cluster are ntp synchronized", which already rung the alarm 
> bells in me.  Now that I read this you appear to require 
> synchronization not on the microsecond level but on some level.  I 
> think that would be pretty hard to manage for an administrator, seeing 
> that NTP typically cannot provide such guarantees.

Synchronization to some degree is wanted to avoid totally unexpected 
behavior. The conflict resolution algorithm itself can perfectly fine 
live with counters, but I guess you wouldn't want the result of it. If 
you update a record on one node, then 10 minutes later you update the 
same record on another node. Unfortunately, the nodes had no 
communication and because the first node is much busier, its counter is 
way advanced ... this would mean the 10 minutes later update would get 
lost in the conflict resolution when the nodes reestablish 
communication. They would have the same data at the end, just not what 
any sane person would expect.

This behavior will kick in whenever the cross node conflicting updates 
happen close enough so that the time difference between the clocks can 
affect it. So if you update the logical same row on two nodes within a 
tenth of a second, and the clocks are more than that apart, the conflict 
resolution can result in the older row to survive. Clock synchronization 
is simply used to minimize this.

The system clock is used only to keep the counters somewhat synchronized 
in the case of connection loss to retain some degree of "last update" 
meaning. Without that, continuing autonomously during a network outage 
is just not practical.

Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #

Re: Proposal: Commit timestamp

From

Theo Schlossnagle

Date:

04 February 2007, 11:53:48

On Feb 4, 2007, at 10:06 AM, Jan Wieck wrote:

> On 2/4/2007 3:16 AM, Peter Eisentraut wrote:
>> Jan Wieck wrote:
>>> This is all that is needed for last update wins resolution. And as
>>> said before, the only reason the clock is involved in this is so  
>>> that
>>> nodes can continue autonomously when they lose connection without
>>> conflict resolution going crazy later on, which it would do if they
>>> were simple counters. It doesn't require microsecond synchronized
>>> clocks and the system clock isn't just used as a Lamport timestamp.
>> Earlier you said that "one assumption is that all servers in the  
>> multimaster cluster are ntp synchronized", which already rung the  
>> alarm bells in me.  Now that I read this you appear to require  
>> synchronization not on the microsecond level but on some level.  I  
>> think that would be pretty hard to manage for an administrator,  
>> seeing that NTP typically cannot provide such guarantees.
>
> Synchronization to some degree is wanted to avoid totally  
> unexpected behavior. The conflict resolution algorithm itself can  
> perfectly fine live with counters, but I guess you wouldn't want  
> the result of it. If you update a record on one node, then 10  
> minutes later you update the same record on another node.  
> Unfortunately, the nodes had no communication and because the first  
> node is much busier, its counter is way advanced ... this would  
> mean the 10 minutes later update would get lost in the conflict  
> resolution when the nodes reestablish communication. They would  
> have the same data at the end, just not what any sane person would  
> expect.
>
> This behavior will kick in whenever the cross node conflicting  
> updates happen close enough so that the time difference between the  
> clocks can affect it. So if you update the logical same row on two  
> nodes within a tenth of a second, and the clocks are more than that  
> apart, the conflict resolution can result in the older row to  
> survive. Clock synchronization is simply used to minimize this.
>
> The system clock is used only to keep the counters somewhat  
> synchronized in the case of connection loss to retain some degree  
> of "last update" meaning. Without that, continuing autonomously  
> during a network outage is just not practical.

A Lamport clock addresses this.  It relies on a cluster-wise clock  
tick.  While it could be based on the system clock, it would not be  
based on more than one clock.  The point of the lamport clock is that  
there is _a_ clock, not multiple ones.

One concept is to have a univeral clock that ticks forward (like  
every second) and each node orders all their transactions inside the  
second-granular tick.  Then each commit would be like: {node,  
clocksecond, txn#} and each time the clock ticks forward, txn# is  
reset to zero.  This gives you ordered txns that windowed in some  
cluster-wide acceptable window (1 second).  However, this is totally  
broken as NTP is entirely insufficient for this purpose because of a  
variety of forms of clock skew.  As such, the timestamp should be  
incremented via cluster consensus (one token ring or the pulse  
generated by the leader of the current cluster membership quorom).

As the clock must be incremented clusterwide, the need for it to be  
insync with the system clock (on any or all of the systems) is  
obviated.  In fact, as you can't guarantee the synchronicity means  
that it can be confusing -- one expects a time-based clock to be  
accurate to the time.  A counter-based clock has no such expectations.

// Theo Schlossnagle
// CTO -- http://www.omniti.com/~jesus/
// OmniTI Computer Consulting, Inc. -- http://www.omniti.com/

Re: Proposal: Commit timestamp

From

Gregory Stark

Date:

04 February 2007, 12:34:22

"Theo Schlossnagle" <jesus@omniti.com> writes:

> As the clock must be incremented clusterwide, the need for it to be insync with
> the system clock (on any or all of the systems) is  obviated.  In fact, as you
> can't guarantee the synchronicity means  that it can be confusing -- one
> expects a time-based clock to be  accurate to the time.  A counter-based clock
> has no such expectations.

So if the nodes get split they can keep operating independently but clients
can see that there's no guarantee of ordering against transactions from other
nodes because the clock isn't advancing?

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com

Re: Proposal: Commit timestamp

From

Jan Wieck

Date:

04 February 2007, 14:36:17

On 2/4/2007 10:53 AM, Theo Schlossnagle wrote:
> As the clock must be incremented clusterwide, the need for it to be  
> insync with the system clock (on any or all of the systems) is  
> obviated.  In fact, as you can't guarantee the synchronicity means  
> that it can be confusing -- one expects a time-based clock to be  
> accurate to the time.  A counter-based clock has no such expectations.

For the fourth time, the clock is in the mix to allow to continue during 
a network outage. All your arguments seem to assume 100% network uptime. 
There will be no clusterwide clock or clusterwide increment when you 
lose connection. How does your idea cope with that?

Obviously the counters will immediately drift apart based on the 
transaction load of the nodes as soon as the network goes down. And in 
order to avoid this "clock" confusion and wrong expectation, you'd 
rather have a system with such a simple, non-clock based counter and 
accept that it starts behaving totally wonky when the cluster reconnects 
after a network outage? I rather confuse a few people than having a last 
update wins conflict resolution that basically rolls dice to determine 
"last".

Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #

Re: Proposal: Commit timestamp

From

Theo Schlossnagle

Date:

04 February 2007, 15:04:48

On Feb 4, 2007, at 1:36 PM, Jan Wieck wrote:

> On 2/4/2007 10:53 AM, Theo Schlossnagle wrote:
>> As the clock must be incremented clusterwide, the need for it to  
>> be  insync with the system clock (on any or all of the systems)  
>> is  obviated.  In fact, as you can't guarantee the synchronicity  
>> means  that it can be confusing -- one expects a time-based clock  
>> to be  accurate to the time.  A counter-based clock has no such  
>> expectations.
>
> For the fourth time, the clock is in the mix to allow to continue  
> during a network outage. All your arguments seem to assume 100%  
> network uptime. There will be no clusterwide clock or clusterwide  
> increment when you lose connection. How does your idea cope with that?

That's exactly what a quorum algorithm is for.

> Obviously the counters will immediately drift apart based on the  
> transaction load of the nodes as soon as the network goes down. And  
> in order to avoid this "clock" confusion and wrong expectation,  
> you'd rather have a system with such a simple, non-clock based  
> counter and accept that it starts behaving totally wonky when the  
> cluster reconnects after a network outage? I rather confuse a few  
> people than having a last update wins conflict resolution that  
> basically rolls dice to determine "last".

If your cluster partition and you have hours of independent action  
and upon merge you apply a conflict resolution algorithm that has  
enormous effect undoing portions of the last several hours of work on  
the nodes, you wouldn't call that "wonky?"

For sane disconnected (or more generally, partitioned) operation in  
multi-master environments, a quorum for the dataset must be  
established.  Now, one can consider the "database" to be the  
dataset.  So, on network partitions those in "the" quorum are allowed  
to progress with data modification and others only read.  However,  
there is no reason why the dataset _must_ be the database and that  
multiple datasets _must_ share the same quorum algorithm.  You could  
easily classify certain tables or schema or partitions into a  
specific dataset and apply a suitable quorum algorithm to that and a  
different quorum algorithm to other disjoint data sets.

// Theo Schlossnagle
// CTO -- http://www.omniti.com/~jesus/
// OmniTI Computer Consulting, Inc. -- http://www.omniti.com/

Re: Proposal: Commit timestamp

From

"Zeugswetter Andreas ADI SD"

Date:

05 February 2007, 07:25:38

> One concept is to have a univeral clock that ticks forward (like
> every second) and each node orders all their transactions inside the
> second-granular tick.  Then each commit would be like: {node,
> clocksecond, txn#} and each time the clock ticks forward, txn# is
> reset to zero.  This gives you ordered txns that windowed in some
> cluster-wide acceptable window (1 second).  However, this is totally
> broken as NTP is entirely insufficient for this purpose because of a
> variety of forms of clock skew.  As such, the timestamp should be
> incremented via cluster consensus (one token ring or the pulse
> generated by the leader of the current cluster membership quorom).

I think you are completely ignoring practicability. Or are you saying,
that such a system exists and works for e.g. a loosly connected group of
laptop field agents that only sporadically have a connection to the
cluster.

I think Jan's definition gives a pragmatic solution to the problem,
and will be able to give "good" automatic conflict resolution.

It has downsides he stated, and cannot guarantee 100% correct automatic
conflict
resolution in case of connection loss, but I am quite sure you are not
able to do
better, without loosing yourself in theory.

e.g. assume all clocks vary by no more than 30 seconds when
disconnected, you can
require manual (or rule based) resolution to all conflicts that vary by
less than
1 minute.

Andreas

Re: Proposal: Commit timestamp

From

Andrew Sullivan

Date:

05 February 2007, 18:38:32

On Sun, Feb 04, 2007 at 01:36:03PM -0500, Jan Wieck wrote:
> For the fourth time, the clock is in the mix to allow to continue during 
> a network outage. All your arguments seem to assume 100% network uptime. 
> There will be no clusterwide clock or clusterwide increment when you 
> lose connection. How does your idea cope with that?

I'm wondering whether a combined approach is needed.  This makes
things more complicated, but what if you somehow co-ordinate local
counters with shared clock ticks?  When you get a failure on your
talk to the shared clock, you regard yourself as in some sort of
failure (you're going to need softfails and that sort of thing, and
yes, I'm flapping my hands in the air at the moment).  At rejoin to
the cluster, you need some sort of way to publish "here's the counter
and the last global time I had" and "here's my current counter".  You
can publish local time with this too, I guess, to solve for conflict
cases, but that seems like the sort of decision that needs to be
pushed down to policy level.

A

-- 
Andrew Sullivan  | ajs@crankycanuck.ca
The whole tendency of modern prose is away from concreteness.    --George Orwell

Re: Proposal: Commit timestamp

From

Markus Schiltknecht

Date:

06 February 2007, 03:36:55

Hi,

Theo Schlossnagle wrote:
> On Feb 4, 2007, at 1:36 PM, Jan Wieck wrote:
>> Obviously the counters will immediately drift apart based on the 
>> transaction load of the nodes as soon as the network goes down. And in 
>> order to avoid this "clock" confusion and wrong expectation, you'd 
>> rather have a system with such a simple, non-clock based counter and 
>> accept that it starts behaving totally wonky when the cluster 
>> reconnects after a network outage? I rather confuse a few people than 
>> having a last update wins conflict resolution that basically rolls 
>> dice to determine "last".
> 
> If your cluster partition and you have hours of independent action and 
> upon merge you apply a conflict resolution algorithm that has enormous 
> effect undoing portions of the last several hours of work on the nodes, 
> you wouldn't call that "wonky?"

You are talking about different things. Async replication, as Jan is 
planning to do, is per se "wonky", because you have to cope with 
conflicts by definition. And you have to resolve them by late-aborting a 
transaction (i.e. after a commit). Or put it another way: async MM 
replication means continuing in disconnected mode (w/o quorum or some 
such) and trying to reconciliate later on. It should not matter if the 
delay is just some milliseconds of network latency or three days (except 
of course that you probably have more data to reconciliate).

> For sane disconnected (or more generally, partitioned) operation in 
> multi-master environments, a quorum for the dataset must be 
> established.  Now, one can consider the "database" to be the dataset.  
> So, on network partitions those in "the" quorum are allowed to progress 
> with data modification and others only read.

You can do this to *prevent* conflicts, but that clearly belongs to the 
world of sync replication. I'm doing this in Postgres-R: in case of 
network partitioning, only a primary partition may continue to process 
writing transactions. For async replication, it does not make sense to 
prevent conflicts when disconnected. Async is meant to cope with 
conflicts. So as to be independent of network latency.

> However, there is no 
> reason why the dataset _must_ be the database and that multiple datasets 
> _must_ share the same quorum algorithm.  You could easily classify 
> certain tables or schema or partitions into a specific dataset and apply 
> a suitable quorum algorithm to that and a different quorum algorithm to 
> other disjoint data sets.

I call that partitioning (among nodes). And it's applicable to sync as 
well as async replication, while it makes more sense in sync replication.

What I'm more concerned about, with Jan's proposal, is the assumption 
that you always want to resolve conflicts by time (except for balances, 
for which we don't have much information, yet). I'd rather say that time 
does not matter much if your nodes are disconnected. And (especially in 
async replication) you should prevent your clients from committing to 
one node and then reading from another, expecting to find your data 
there. So why resolve by time? It only makes the user think you could 
guarantee that order, but you certainly cannot.

Regards

Markus

Re: Proposal: Commit timestamp

From

"Zeugswetter Andreas ADI SD"

Date:

06 February 2007, 05:25:41

> What I'm more concerned about, with Jan's proposal, is the assumption
> that you always want to resolve conflicts by time (except for
> balances,
> for which we don't have much information, yet). I'd rather

Um, I think the proposal was only for beneficial backend functionality
for replication in general and time based conflict resolution. And "time
based"
is surely one of the important conflict resolution methods for async MM
replication.

Sure there are others, like "rule based" "priority based" but I think
you don't need additional backend functionality for those.

Andreas

Re: Proposal: Commit timestamp

From

Markus Schiltknecht

Date:

06 February 2007, 12:44:52

Hi,

Zeugswetter Andreas ADI SD wrote:
> And "time based"
> is surely one of the important conflict resolution methods for async MM
> replication.

That's what I'm questioning. Wouldn't any other deterministic, but 
seemingly random abort decision be as clever as time based conflict 
resolution? It would then be clear to the user that it's random and not 
some "in most cases time based, but no in others and only if..." thing.

> Sure there are others, like "rule based" "priority based" but I think
> you don't need additional backend functionality for those.

Got the point, yes. I'm impatient, sorry.

Neither the less, I'm questioning if is it worth adding backend 
functionality for that. And given this probably is the most wanted 
resolution method, this question might be "heretical". You could also 
see it as sort of an user educating question: don't favor time based 
resolution if that's the one resolution method with the most traps.

Regards

Markus

Re: Proposal: Commit timestamp

From

Jan Wieck

Date:

06 February 2007, 20:30:53

On 2/6/2007 11:44 AM, Markus Schiltknecht wrote:
> Hi,
> 
> Zeugswetter Andreas ADI SD wrote:
>> And "time based"
>> is surely one of the important conflict resolution methods for async MM
>> replication.
> 
> That's what I'm questioning. Wouldn't any other deterministic, but 
> seemingly random abort decision be as clever as time based conflict 
> resolution? It would then be clear to the user that it's random and not 
> some "in most cases time based, but no in others and only if..." thing.
> 
>> Sure there are others, like "rule based" "priority based" but I think
>> you don't need additional backend functionality for those.
> 
> Got the point, yes. I'm impatient, sorry.
> 
> Neither the less, I'm questioning if is it worth adding backend 
> functionality for that. And given this probably is the most wanted 
> resolution method, this question might be "heretical". You could also 
> see it as sort of an user educating question: don't favor time based 
> resolution if that's the one resolution method with the most traps.

These are all very good suggestions towards additional conflict 
resolution mechanisms, that solve one or the other problem. As we have 
said for years now, one size will not fit all. What I am after for the 
moment is a system that supports by default a last update wins on the 
row level, where last update certainly is a little fuzzy, but not by 
minutes. Plus balance type columns. A balance column is not propagated 
as a new value, but as a delta between the old and the new value. All 
replica will apply the delta to that column regardless of whether the 
replication info is newer or older than the existing row. That way, 
literal value type columns (like an address) will maintain cluster wide 
the value of the last update to the row, while balance type columns will 
clusterwide maintain the sum of all changes.

Whatever strategy one will use, in an async multimaster there are always 
cases that can be resolved by rules (last update being one of them), and 
some that I can't even imagine solving so far. I guess some of the cases 
will simply boil down to "the application has to make sure that ... 
never occurs". Think of a multi-item order, created on one node, while 
another node is deleting the long unused item (which would have to be 
backordered). Now while those two nodes figure out what to do to make 
this consistent again, a third node does a partial shipment of that 
order. The solution is simple, reinsert the deleted item ... only that 
there were rather nasty ON DELETE CASCADE's on that item that removed 
all the consumer reviews, product descriptions, data sheets and what 
not. It's going to be an awful lot of undo.

I haven't really made up my mind about a user defined rule based 
conflict resolution interface yet. I do plan to have a unique and 
foreign key constraint based, synchronous advisory locking system on top 
of my system in a later version (advisory key locks would stay in place 
until the transaction, that placed them, replicates).

I guess you see by now why I wanted to keep the discussion about the 
individual, rather generic support features in the backend separate from 
the particular features I plan to implement in the replication system. 
Everyone has different needs and consequently an async multi-master 
"must" do a whole range of mutually exclusive things altogether ... 
because Postgres can never accept a partial solution. We want the egg 
laying milk-wool-pig or nothing.

Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #

Re: Proposal: Commit timestamp

From

Jim Nasby

Date:

07 February 2007, 02:59:45

Something worth noting... the only places I've actually seen MM  
replication implemented, each master was in fact still responsible  
for it's own set of data. It was essentially something that you could  
really do with Slony, if you could tolerate the extreme complexity  
that would be involved. It might well be worth focusing on that case  
first, before trying to come up with a perfect last-committed mechanism.

On Feb 5, 2007, at 5:20 AM, Zeugswetter Andreas ADI SD wrote:

> I think you are completely ignoring practicability. Or are you saying,
> that such a system exists and works for e.g. a loosly connected  
> group of
> laptop field agents that only sporadically have a connection to the
> cluster.
>
> I think Jan's definition gives a pragmatic solution to the problem,
> and will be able to give "good" automatic conflict resolution.
>
> It has downsides he stated, and cannot guarantee 100% correct  
> automatic
> conflict
> resolution in case of connection loss, but I am quite sure you are not
> able to do
> better, without loosing yourself in theory.
>
> e.g. assume all clocks vary by no more than 30 seconds when
> disconnected, you can
> require manual (or rule based) resolution to all conflicts that  
> vary by
> less than
> 1 minute.

--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Re: Proposal: Commit timestamp

From

Markus Schiltknecht

Date:

07 February 2007, 03:38:17

Hi,

Jan Wieck wrote:
> Whatever strategy one will use, in an async multimaster there are always 
> cases that can be resolved by rules (last update being one of them), and 
> some that I can't even imagine solving so far. I guess some of the cases 
> will simply boil down to "the application has to make sure that ... 
> never occurs". Think of a multi-item order, created on one node, while 
> another node is deleting the long unused item (which would have to be 
> backordered). Now while those two nodes figure out what to do to make 
> this consistent again, a third node does a partial shipment of that 
> order.

It helps to categorize these conflict types. There basically are:

* data conflicts: simple row data, i.e. update - update conflicts.

* uniqueness conflicts: two rows conflict because they'd violate a 
uniquenes constraint, i.e. insert - insert, update - insert or update - 
update.

* visibility conflicts: basically the remaining update - delete and 
delete - delete cases. But also SELECT FOR UPDATE candidates, etc... 
Everything having to do with a rows not yet or no longer being visible 
to a transaction.

Your example certainly involves a visibility conflict (update - delete). 
Not even (sync) Postgres-R can guarantee consistency on the visibility 
level, i.e.  a first transaction's SELECT FOR UPDATE might not see some 
just recently committed transactions newly inserted rows (because that 
one isn't replayed yet on the node, thus the transaction is working on 
an 'old' snapshot of the database state). Another simpler example: 
Postgres-R doesn't raise a serialization error on delete-delete 
conflicts, it simply deletes the row once, even if two transactions 
confirmed to have committed a transaction which deleted a row.

Luckily, most applications don't need that anyway, though.

> The solution is simple, reinsert the deleted item ... 

..at which point timestamps certainly won't help :-)   Sorry, couldn't 
resist...

> only that 
> there were rather nasty ON DELETE CASCADE's on that item that removed 
> all the consumer reviews, product descriptions, data sheets and what 
> not. It's going to be an awful lot of undo.

Huh? Are you planning on aborting *parts* of a transaction? I didn't 
think about that, but my gut feeling is that you don't want to do that.

> I haven't really made up my mind about a user defined rule based 
> conflict resolution interface yet. I do plan to have a unique and 
> foreign key constraint based, synchronous advisory locking system on top 
> of my system in a later version (advisory key locks would stay in place 
> until the transaction, that placed them, replicates).

You'd have to elaborate on that...

> I guess you see by now why I wanted to keep the discussion about the 
> individual, rather generic support features in the backend separate from 
> the particular features I plan to implement in the replication system.

Sure. I know, discussions about replication can get endless, probably 
even are so by definition ;-)  But hey, they're fun!

> Everyone has different needs and consequently an async multi-master 
> "must" do a whole range of mutually exclusive things altogether ... 
> because Postgres can never accept a partial solution. We want the egg 
> laying milk-wool-pig or nothing.

Like the one which would result from a merge of such an async 
replication with a sync one? Imagine being able to choose between sync 
and async per transaction...

Regards

Markus

Re: Proposal: Commit timestamp

From

José Orlando Pereira

Date:

07 February 2007, 08:20:38

On Saturday 03 February 2007, Bruce Momjian wrote:
> Jan Wieck wrote:
> > I don't have any such paper and the proof of concept will be the
> > implementation of the system. I do however see enough resistance against
> > this proposal to withdraw the commit timestamp at this time. The new
> > replication system will therefore require the installation of a patched,
> > non-standard PostgreSQL version, compiled from sources cluster wide in
> > order to be used. I am aware that this will dramatically reduce it's
> > popularity but it is impossible to develop this essential feature as an
> > external module.
> >
> > I thank everyone for their attention.
>
> Going and working on it on your own doesn't seem like the proper
> solution.  I don't see people objecting to adding it, but they want it
> work, which I am sure you want too.  You have to show how it will work
> and convince others of that, and then you have a higher chance it will
> work, and be in the PostgreSQL codebase.

Hi,

Would it be possible to solve the problem using the GORDA on-commit hook?

Jan would be able reliably obtain a commit timestamp with the desired 
semantics and store it in a regular table within transaction boundaries.

PostgreSQL would not have to commit to a specific timestamp semantics and the 
patch is quite small.

Regards,

-- 
Jose Orlando Pereira

Re: Proposal: Commit timestamp

From

Jan Wieck

Date:

07 February 2007, 12:54:52

On 2/7/2007 2:37 AM, Markus Schiltknecht wrote:
> Hi,
> 
> Jan Wieck wrote:
>> Whatever strategy one will use, in an async multimaster there are always 
>> cases that can be resolved by rules (last update being one of them), and 
>> some that I can't even imagine solving so far. I guess some of the cases 
>> will simply boil down to "the application has to make sure that ... 
>> never occurs". Think of a multi-item order, created on one node, while 
>> another node is deleting the long unused item (which would have to be 
>> backordered). Now while those two nodes figure out what to do to make 
>> this consistent again, a third node does a partial shipment of that 
>> order.
> 
> It helps to categorize these conflict types. There basically are:

Are we still discussing if the Postgres backend may provide support for
a commit timestamp, that follows the rules for Lamport timestamps in a
multi-node cluster? It seems more like we are drifting into what type of
replication system I should design to please most people.


Jan


> 
> * data conflicts: simple row data, i.e. update - update conflicts.
> 
> * uniqueness conflicts: two rows conflict because they'd violate a 
> uniquenes constraint, i.e. insert - insert, update - insert or update - 
> update.
> 
> * visibility conflicts: basically the remaining update - delete and 
> delete - delete cases. But also SELECT FOR UPDATE candidates, etc... 
> Everything having to do with a rows not yet or no longer being visible 
> to a transaction.
> 
> Your example certainly involves a visibility conflict (update - delete). 
> Not even (sync) Postgres-R can guarantee consistency on the visibility 
> level, i.e.  a first transaction's SELECT FOR UPDATE might not see some 
> just recently committed transactions newly inserted rows (because that 
> one isn't replayed yet on the node, thus the transaction is working on 
> an 'old' snapshot of the database state). Another simpler example: 
> Postgres-R doesn't raise a serialization error on delete-delete 
> conflicts, it simply deletes the row once, even if two transactions 
> confirmed to have committed a transaction which deleted a row.
> 
> Luckily, most applications don't need that anyway, though.
> 
>> The solution is simple, reinsert the deleted item ... 
> 
> ..at which point timestamps certainly won't help :-)   Sorry, couldn't 
> resist...
> 
>> only that 
>> there were rather nasty ON DELETE CASCADE's on that item that removed 
>> all the consumer reviews, product descriptions, data sheets and what 
>> not. It's going to be an awful lot of undo.
> 
> Huh? Are you planning on aborting *parts* of a transaction? I didn't 
> think about that, but my gut feeling is that you don't want to do that.
> 
>> I haven't really made up my mind about a user defined rule based 
>> conflict resolution interface yet. I do plan to have a unique and 
>> foreign key constraint based, synchronous advisory locking system on top 
>> of my system in a later version (advisory key locks would stay in place 
>> until the transaction, that placed them, replicates).
> 
> You'd have to elaborate on that...
> 
>> I guess you see by now why I wanted to keep the discussion about the 
>> individual, rather generic support features in the backend separate from 
>> the particular features I plan to implement in the replication system.
> 
> Sure. I know, discussions about replication can get endless, probably 
> even are so by definition ;-)  But hey, they're fun!
> 
>> Everyone has different needs and consequently an async multi-master 
>> "must" do a whole range of mutually exclusive things altogether ... 
>> because Postgres can never accept a partial solution. We want the egg 
>> laying milk-wool-pig or nothing.
> 
> Like the one which would result from a merge of such an async 
> replication with a sync one? Imagine being able to choose between sync 
> and async per transaction...
> 
> Regards
> 
> Markus
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
> 
>                http://archives.postgresql.org


-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #

Re: Proposal: Commit timestamp

From

Markus Schiltknecht

Date:

07 February 2007, 13:55:22

Hi,

Jan Wieck wrote:
> Are we still discussing if the Postgres backend may provide support for
> a commit timestamp, that follows the rules for Lamport timestamps in a
> multi-node cluster?

No. And I think you know my opinion about that by now. ;-)

> It seems more like we are drifting into what type of
> replication system I should design to please most people.

Nobody is telling you what you should do. You're free to do whatever you 
want to.

I'm only trying to get a discussion going, because a) I'm interested in 
how you plan to solve these problems and b) in the past, most people 
were complaining that all the different replication efforts didn't try 
to work together. I'm slowly trying to open up and discuss what I'm 
doing with Postgres-R on the lists.

Just yesterday at the SFPUG meeting, I've experienced how confusing it 
is for the users to have such a broad variety of (existing and upcoming) 
replication solutions. And I'm all for working together and probably 
even for merging different replication solutions.

Regards

Markus

Re: Proposal: Commit timestamp

From

Richard Troy

Date:

07 February 2007, 14:57:09

> Jan Wieck wrote:
> > Are we still discussing if the Postgres backend may provide support for
> > a commit timestamp, that follows the rules for Lamport timestamps in a
> > multi-node cluster?

...I thought you said in this thread that you haven't and weren't going to
work on any kind of logical proof of it's correctness, saw no value in
prototyping your way to a clear (convincing) argument, and were
withdrawing the proposal due to all the issues others raised which were,
in light of this, unanswerable beyond conjecture. I thought that the
thread was continuing because other people saw value in the kernel of the
idea, would support if if it could be shown to be correct/useful, were
disappointed you'd leave it at that and wanted to continue to see if
something positive might come of the dialogue. So, the thread weaved
around a bit. I think that if you want to nail this down, people here are
willing to be convinced, but that hasn't happened yet.

On Wed, 7 Feb 2007, Markus Schiltknecht wrote:
> I'm only trying to get a discussion going, because a) I'm interested in
> how you plan to solve these problems and b) in the past, most people
> were complaining that all the different replication efforts didn't try
> to work together. I'm slowly trying to open up and discuss what I'm
> doing with Postgres-R on the lists.
>
> Just yesterday at the SFPUG meeting, I've experienced how confusing it
> is for the users to have such a broad variety of (existing and upcoming)
> replication solutions. And I'm all for working together and probably
> even for merging different replication solutions.

In support of that idea, I offer this; When Randy Eash wrote the world's
first replication system for Ingres circa 1990, his work included ideas
and features that are right now in the Postgres world fragmented among
several existing replication / replication-related products, along with
some things that are only now in discussion in this group. As discussed at
the SFPUG meeting last night, real-world use cases are seldom if ever
completely satisfied with a one-size-fits-all replication strategy. For
example, a manufacturing company might want all factories to be capable of
being autonomous but both report activities and take direction from
corporate headquarters. To do this without having multiple databases at
each site, a single database instance would likely be both a master and
slave, but for differing aspects of the businesses needs. Business
decisions would resolve the conflicts, say, the manufacturing node always
wins when it comes to data that pertains to their work, rather than
something like a time-stamp, last timestamp/serialized update wins.

Like Markus, I would like to see the various replication efforts merged as
best they can be because even if the majority of users don't use a little
bit of everything, surely the more interesting cases would like to and the
entire community is better served if the various "solutions" are in
harmony.

Richard

-- 
Richard Troy, Chief Scientist
Science Tools Corporation
510-924-1363 or 202-747-1263
rtroy@ScienceTools.com, http://ScienceTools.com/

Re: Proposal: Commit timestamp

From

Jan Wieck

Date:

07 February 2007, 21:13:27

On 2/7/2007 12:54 PM, Markus Schiltknecht wrote:
> Hi,
> 
> Jan Wieck wrote:
>> Are we still discussing if the Postgres backend may provide support for
>> a commit timestamp, that follows the rules for Lamport timestamps in a
>> multi-node cluster?
> 
> No. And I think you know my opinion about that by now. ;-)

Then let me give you a little puzzle just for the fun of it.

A database containing customer contact information (among other things) 
is a two node multimaster system. One is serving the customer web 
portal, the other is used by the company staff including the call 
center. At 13:45 the two servers lose connectivity to each other, yet 
the internal staff can access the internal server while the web portal 
is accessible from the outside. At 13:50 customer A updates their credit 
card information through the web portal, while customer B does the same 
through the call center. At 13:55 both customers change their mind to 
use yet another credit card, now customer A phones the call center while 
customer B does it via the internet.

At 14:00 the two servers reconnect and go through the conflict 
resolution. How do you intend to solve both conflicts without using any 
"clock", because that seems to be a stopword causing instant rejection 
of whatever you propose. Needless to say, both customers will be 
dissatisfied if you charge the "wrong" credit card during your next 
billing cycle.

> 
>> It seems more like we are drifting into what type of
>> replication system I should design to please most people.
> 
> Nobody is telling you what you should do. You're free to do whatever you 
> want to.
> 
> I'm only trying to get a discussion going, because a) I'm interested in 
> how you plan to solve these problems and b) in the past, most people 
> were complaining that all the different replication efforts didn't try 
> to work together. I'm slowly trying to open up and discuss what I'm 
> doing with Postgres-R on the lists.

Which is a good discussion because one of the reasons why I stopped 
looking into Postgres-R is the fact that is based on the idea to push 
all the replication information through a system that generates a global 
serialized message queue. That by itself isn't the problem, but the fact 
that implementing a global serialized message queue has serious 
throughput issues that are (among other details) linked to the speed of 
light.

I am trying to start with a system, that doesn't rely on such a 
mechanism for everything. I do intend to add an option later, that 
allows to declare a UNIQUE NOT NULL constraint to be synchronous. What 
that means is, that any INSERT, UPDATE, DELETE and SELECT FOR UPDATE 
will require the node to currently be a member of the (quorum or 
priority defined) majority of the cluster. An advisory lock system, 
based on a total order group communication, will grant the lock to the 
unique key values on a first come, first serve base. Every node in the 
cluster will keep those keys as "locked" until the asynchronous 
replication stream reports the locking transaction as ended. If another 
remote transaction in the meantime requires updating such key, the 
incoming stream from that node will be on hold until the lock is 
cleared. This is to protect agains node B replicating a transaction from 
node A and a later update on node B arrives on C before C got the first 
event from A. A node that got disconnected from the cluster must rebuild 
the current advisory lock list upon reconnecting to the cluster.

I think that this will be a way to overcome Postgres-R's communication 
bottleneck, as well as allowing limited update activity even during a 
completely disconnected state of a node. Synchronous or group 
communication messages are reduced to the cases, where the application 
cannot be implemented in a conflict free way, like allocating a natural 
primary key. There is absolutely no need to synchronize for example 
creating a sales order. An application can use global unique ID's for 
the order number. And everything possibly referenced by an order (items, 
customers, ...) is stored in a way that the references are never 
updated. Deletes to those possibly referenced objects are implemented in 
a two step process, where they are first marked obsolete, and later on 
things that have been marked obsolete for X long are deleted. A REPLICA 
TRIGGER on inserting an order will simply reset the obsolete flag of 
referenced objects. If a node is disconnected longer than X, you have a 
problem - hunt down the guy who defined X.

> Just yesterday at the SFPUG meeting, I've experienced how confusing it 
> is for the users to have such a broad variety of (existing and upcoming) 
> replication solutions. And I'm all for working together and probably 
> even for merging different replication solutions.

Merging certain ideas to come up with an async/sync hybrid? Seems to me 
we have similar enough ideas to need conflict resolution, because we had 
them simultaneously but communicate them asynchronously.

Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #

Re: Proposal: Commit timestamp

From

Markus Schiltknecht

Date:

07 February 2007, 22:28:32

Hi,

Jan Wieck wrote:
> Then let me give you a little puzzle just for the fun of it.
> 
> A database containing customer contact information (among other things) 
> is a two node multimaster system. One is serving the customer web 
> portal, the other is used by the company staff including the call 
> center. At 13:45 the two servers lose connectivity to each other, yet 
> the internal staff can access the internal server while the web portal 
> is accessible from the outside. At 13:50 customer A updates their credit 
> card information through the web portal, while customer B does the same 
> through the call center. At 13:55 both customers change their mind to 
> use yet another credit card, now customer A phones the call center while 
> customer B does it via the internet.

Phew, a mind twister... one customer would already be enough to trigger 
that sort of conflict...

> At 14:00 the two servers reconnect and go through the conflict 
> resolution. How do you intend to solve both conflicts without using any 
> "clock", because that seems to be a stopword causing instant rejection 
> of whatever you propose. Needless to say, both customers will be 
> dissatisfied if you charge the "wrong" credit card during your next 
> billing cycle.

Correct. But do these cases satisfy storing timestamps to each and every 
transaction you do? That's what I doubt, not the usefulness of time 
based conflict resolution for certain cases.

You can always add a time based conflict resolution, by adding a 
timestamp column and decide upon that one. I'd guess that the overall 
costs are lower that way.

But you've withdrawn that proposal already, so...

> Which is a good discussion because one of the reasons why I stopped 
> looking into Postgres-R is the fact that is based on the idea to push 
> all the replication information through a system that generates a global 
> serialized message queue. That by itself isn't the problem, but the fact 
> that implementing a global serialized message queue has serious 
> throughput issues that are (among other details) linked to the speed of 
> light.

Agreed. Nevertheless, there are use cases for such systems, because they 
put less limitations to the application. One could even argue, that your 
above example would be one ;-)

> I am trying to start with a system, that doesn't rely on such a 
> mechanism for everything. I do intend to add an option later, that 
> allows to declare a UNIQUE NOT NULL constraint to be synchronous. What 
> that means is, that any INSERT, UPDATE, DELETE and SELECT FOR UPDATE 
> will require the node to currently be a member of the (quorum or 
> priority defined) majority of the cluster.

Sounds reasonable.

> An advisory lock system, 
> based on a total order group communication, will grant the lock to the 
> unique key values on a first come, first serve base. Every node in the 
> cluster will keep those keys as "locked" until the asynchronous 
> replication stream reports the locking transaction as ended. If another 
> remote transaction in the meantime requires updating such key, the 
> incoming stream from that node will be on hold until the lock is 
> cleared. This is to protect agains node B replicating a transaction from 
> node A and a later update on node B arrives on C before C got the first 
> event from A. A node that got disconnected from the cluster must rebuild 
> the current advisory lock list upon reconnecting to the cluster.

Yeah, this is a convenient way to replicate sequences via a GCS.

> I think that this will be a way to overcome Postgres-R's communication 
> bottleneck, as well as allowing limited update activity even during a 
> completely disconnected state of a node. Synchronous or group 
> communication messages are reduced to the cases, where the application 
> cannot be implemented in a conflict free way, like allocating a natural 
> primary key. There is absolutely no need to synchronize for example 
> creating a sales order. 

Agreed, such cases can easily be optimized. But you have to be aware of 
he limitations these optimizations cause. Postgres-R is much more 
targeted at very general use cases.

> An application can use global unique ID's for 
> the order number. And everything possibly referenced by an order (items, 
> customers, ...) is stored in a way that the references are never 
> updated. Deletes to those possibly referenced objects are implemented in 
> a two step process, where they are first marked obsolete, and later on 
> things that have been marked obsolete for X long are deleted. A REPLICA 
> TRIGGER on inserting an order will simply reset the obsolete flag of 
> referenced objects. If a node is disconnected longer than X, you have a 
> problem - hunt down the guy who defined X.

Yeah, that's another very nice optimization. Again, as long as you know 
the limitations, that's all well and fine.

> Merging certain ideas to come up with an async/sync hybrid? Seems to me 
> we have similar enough ideas to need conflict resolution, because we had 
> them simultaneously but communicate them asynchronously.

Huh? Sorry, I didn't get what you're trying to say here.

Regards

Markus

Re: Proposal: Commit timestamp

From

Jan Wieck

Date:

07 February 2007, 23:14:03

On 2/7/2007 2:15 PM, Richard Troy wrote:
>> Jan Wieck wrote:
>> > Are we still discussing if the Postgres backend may provide support for
>> > a commit timestamp, that follows the rules for Lamport timestamps in a
>> > multi-node cluster?
> 
> ...I thought you said in this thread that you haven't and weren't going to
> work on any kind of logical proof of it's correctness, saw no value in
> prototyping your way to a clear (convincing) argument, and were
> withdrawing the proposal [...]

I said I don't have any such documents. I was asked to continue this 
discussion in order to find people willing to help discover potential 
problems. I am prepared to continue this development isolated, although 
I wouldn't like to.

The PostgreSQL developers community used to be good at throwing out 
ideas, brainstorming about the possibilities, adding more to them and 
coming up with very unique and flexible solutions. I am a little 
disappointed that much of that got lost over the years and please 
forgive me if I sound a little grumpy over that. The statement to 
withdraw the proposal was certainly premature - consider it not 
withdrawn at this time. However, comparing what used to be our process 
to what I see today, I must say that something like TOAST would never 
have happened. It was the result of a global brainstorming, that I 
simply translated into C code. Many details and features of the 
implementation are purely mine, but the really big sparks, that got it 
to what it is, I'm not claiming for myself. Most importantly, "give me 
proof of concept before we can talk about changing backend code" was not 
part of the process at all. We were pretty eager to change things back 
then, when we needed to get better in almost every way possible ... are 
we so good at replication that we need to be conservative in that 
respect now? We are certainly good at some things and have to be 
conservative with respect to them, but replication in my not so very 
humble opinion isn't one of them.

I do understand that we have a codebase used in production these days. 
And because of that we have to maintain code and syntax stability to a 
degree, we didn't have back in the glory days of introducing EXCEPT and 
INTERCEPT (who's first incarnation was committed to the code base while 
completely destroying my entire work of fixing the rewriter). Maybe we 
need to introduce something entirely different, like the concept of an 
experimental feature. Something that we add to the code but that is 
explicitly flagged as not final, not stable, not guaranteed to stay or 
work in this or any other form. This requires that the feature has very 
limited interference with other parts of the system, like (or especially 
like) the query parser. If it turns out to be a problem in x.y.0, it 
will be backed out and gone in x.y.1. Or in a different way, like we 
create an experimental CVS branch off of every major release. That way, 
developers can easier share experimental code and if things settle 
there, they will be considered to be adopted into HEAD.

> Like Markus, I would like to see the various replication efforts merged as
> best they can be because even if the majority of users don't use a little
> bit of everything, surely the more interesting cases would like to and the
> entire community is better served if the various "solutions" are in
> harmony.

No doubt about that and I was the one organizing the Afilias sponsored 
meeting in Toronto back then, where my reversed Postgres-R idea was 
taken apart because it won't scale due to the gigantic amount of 
synchronized group communication it would require. Again, it might be 
that experimental features will cause more of the efforts to converge by 
using the same base as a compromise instead of having each and every 
support feature being designed completely independent.

I still have a hard time understanding why someone would object to 
adding a feature, however useless it might seem to them, as long as it 
doesn't cost them anything. Admitted, any feature causes maintenance 
costs on the side of the PostgreSQL development community (mainly those, 
who actually contribute and maintain the code - fortunately that is a 
finite number - everyone please ask themselves if they are part of 
that). But aside from that, would anyone, who is questioning the commit 
timestamp as I proposed it, likewise vehemently object to yet another 
procedural language, or adding another log tuning switch? I don't think 
so. As long as it doesn't cost you unless you turn it on, why would you 
even care if it serves my purpose or not? The thing that kicked off this 
emotional spin was that multimaster replication is what so many people 
want, but nobody has a universal solution for. Everyone wants to see 
"their" problem solved "as well", or the solution isn't good. Tell you 
what, I can live with my problem solved even if it doesn't solve yours. 
Can you tell me what I have to modify in order to solve your problem as 
well, or are you asking me to not implement anything unless "I" find a 
way to solve everyones problems in one, big, universal solution?

Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #

Re: Proposal: Commit timestamp

From

Jan Wieck

Date:

07 February 2007, 23:35:53

On 2/7/2007 9:27 PM, Markus Schiltknecht wrote:
> Hi,
> 
> Jan Wieck wrote:
>> Then let me give you a little puzzle just for the fun of it.
>> 
>> A database containing customer contact information (among other things) 
>> is a two node multimaster system. One is serving the customer web 
>> portal, the other is used by the company staff including the call 
>> center. At 13:45 the two servers lose connectivity to each other, yet 
>> the internal staff can access the internal server while the web portal 
>> is accessible from the outside. At 13:50 customer A updates their credit 
>> card information through the web portal, while customer B does the same 
>> through the call center. At 13:55 both customers change their mind to 
>> use yet another credit card, now customer A phones the call center while 
>> customer B does it via the internet.
> 
> Phew, a mind twister... one customer would already be enough to trigger 
> that sort of conflict...
> 
>> At 14:00 the two servers reconnect and go through the conflict 
>> resolution. How do you intend to solve both conflicts without using any 
>> "clock", because that seems to be a stopword causing instant rejection 
>> of whatever you propose. Needless to say, both customers will be 
>> dissatisfied if you charge the "wrong" credit card during your next 
>> billing cycle.
> 
> Correct. But do these cases satisfy storing timestamps to each and every 
> transaction you do? That's what I doubt, not the usefulness of time 
> based conflict resolution for certain cases.
> 
> You can always add a time based conflict resolution, by adding a 
> timestamp column and decide upon that one. I'd guess that the overall 
> costs are lower that way.

Yes, yes, and yes ... but aside from the problem that you use the very 
ambiguous word "timestamp" (which somehow suggests using a "clock" of 
some sort), isn't the "begin" timestamp of a long running transaction 
worse than the "commit" timestamp, when all its work got visible to the 
outside world instantaneously?

> 
> But you've withdrawn that proposal already, so...
> 
>> Which is a good discussion because one of the reasons why I stopped 
>> looking into Postgres-R is the fact that is based on the idea to push 
>> all the replication information through a system that generates a global 
>> serialized message queue. That by itself isn't the problem, but the fact 
>> that implementing a global serialized message queue has serious 
>> throughput issues that are (among other details) linked to the speed of 
>> light.
> 
> Agreed. Nevertheless, there are use cases for such systems, because they 
> put less limitations to the application. One could even argue, that your 
> above example would be one ;-)

Now we're in sync :-)

> 
>> I am trying to start with a system, that doesn't rely on such a 
>> mechanism for everything. I do intend to add an option later, that 
>> allows to declare a UNIQUE NOT NULL constraint to be synchronous. What 
>> that means is, that any INSERT, UPDATE, DELETE and SELECT FOR UPDATE 
>> will require the node to currently be a member of the (quorum or 
>> priority defined) majority of the cluster.
> 
> Sounds reasonable.
> 
>> An advisory lock system, 
>> based on a total order group communication, will grant the lock to the 
>> unique key values on a first come, first serve base. Every node in the 
>> cluster will keep those keys as "locked" until the asynchronous 
>> replication stream reports the locking transaction as ended. If another 
>> remote transaction in the meantime requires updating such key, the 
>> incoming stream from that node will be on hold until the lock is 
>> cleared. This is to protect agains node B replicating a transaction from 
>> node A and a later update on node B arrives on C before C got the first 
>> event from A. A node that got disconnected from the cluster must rebuild 
>> the current advisory lock list upon reconnecting to the cluster.
> 
> Yeah, this is a convenient way to replicate sequences via a GCS.
> 
>> I think that this will be a way to overcome Postgres-R's communication 
>> bottleneck, as well as allowing limited update activity even during a 
>> completely disconnected state of a node. Synchronous or group 
>> communication messages are reduced to the cases, where the application 
>> cannot be implemented in a conflict free way, like allocating a natural 
>> primary key. There is absolutely no need to synchronize for example 
>> creating a sales order. 
> 
> Agreed, such cases can easily be optimized. But you have to be aware of 
> he limitations these optimizations cause. Postgres-R is much more 
> targeted at very general use cases.

I am, if for no other reason than that I am familiar with the concepts 
underneath Postgres-R for more than 3 years. What I realized is that the  "general use" case (for arbitrary complex
applications)is very likely 
 
to be in conflict with any king of "good default performance" case.

> 
>> An application can use global unique ID's for 
>> the order number. And everything possibly referenced by an order (items, 
>> customers, ...) is stored in a way that the references are never 
>> updated. Deletes to those possibly referenced objects are implemented in 
>> a two step process, where they are first marked obsolete, and later on 
>> things that have been marked obsolete for X long are deleted. A REPLICA 
>> TRIGGER on inserting an order will simply reset the obsolete flag of 
>> referenced objects. If a node is disconnected longer than X, you have a 
>> problem - hunt down the guy who defined X.
> 
> Yeah, that's another very nice optimization. Again, as long as you know 
> the limitations, that's all well and fine.
> 
>> Merging certain ideas to come up with an async/sync hybrid? Seems to me 
>> we have similar enough ideas to need conflict resolution, because we had 
>> them simultaneously but communicate them asynchronously.
> 
> Huh? Sorry, I didn't get what you're trying to say here.

Out of sync again ... we'll get there tomorrow ... unless your clock is 
way back and tomorrow will never come.


Jan


-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #

Re: Proposal: Commit timestamp

From

Bruce Momjian

Date:

07 February 2007, 23:36:41

I find the term "logical proof of it's correctness" too restrictive.  It
sounds like some formal academic process that really doesn't work well
for us.

What I did want to hear is a layout of how the system would work, and an
exchange of ideas until almost everyone was happy.

Also, I saw the trigger patch with no explaination of why it was
important or who would use it --- that also isn't going to fly well.

So, to add something, the community needs to hear how it is going to
help users, because every code addition has cost, and we don't want to
add things unless it has general utility.  If someone can't explain the
utility of an addition, I question whether the person has fully thought
through were they are going.

As far as adding a language, no, we would not just add any language.  We
would judge whether the language has usefulness to our users.  I think
APL would be cool, but I am not sure it is usable, so there is a hurdle
even there.

As far as TOAST, there is no question in my mind that TOAST development
would happen the same way today as it did when we did it in 2001 --- we
have a problem, how can we fix it.



---------------------------------------------------------------------------

Jan Wieck wrote:
> On 2/7/2007 2:15 PM, Richard Troy wrote:
> >> Jan Wieck wrote:
> >> > Are we still discussing if the Postgres backend may provide support for
> >> > a commit timestamp, that follows the rules for Lamport timestamps in a
> >> > multi-node cluster?
> > 
> > ...I thought you said in this thread that you haven't and weren't going to
> > work on any kind of logical proof of it's correctness, saw no value in
> > prototyping your way to a clear (convincing) argument, and were
> > withdrawing the proposal [...]
> 
> I said I don't have any such documents. I was asked to continue this 
> discussion in order to find people willing to help discover potential 
> problems. I am prepared to continue this development isolated, although 
> I wouldn't like to.
> 
> The PostgreSQL developers community used to be good at throwing out 
> ideas, brainstorming about the possibilities, adding more to them and 
> coming up with very unique and flexible solutions. I am a little 
> disappointed that much of that got lost over the years and please 
> forgive me if I sound a little grumpy over that. The statement to 
> withdraw the proposal was certainly premature - consider it not 
> withdrawn at this time. However, comparing what used to be our process 
> to what I see today, I must say that something like TOAST would never 
> have happened. It was the result of a global brainstorming, that I 
> simply translated into C code. Many details and features of the 
> implementation are purely mine, but the really big sparks, that got it 
> to what it is, I'm not claiming for myself. Most importantly, "give me 
> proof of concept before we can talk about changing backend code" was not 
> part of the process at all. We were pretty eager to change things back 
> then, when we needed to get better in almost every way possible ... are 
> we so good at replication that we need to be conservative in that 
> respect now? We are certainly good at some things and have to be 
> conservative with respect to them, but replication in my not so very 
> humble opinion isn't one of them.
> 
> I do understand that we have a codebase used in production these days. 
> And because of that we have to maintain code and syntax stability to a 
> degree, we didn't have back in the glory days of introducing EXCEPT and 
> INTERCEPT (who's first incarnation was committed to the code base while 
> completely destroying my entire work of fixing the rewriter). Maybe we 
> need to introduce something entirely different, like the concept of an 
> experimental feature. Something that we add to the code but that is 
> explicitly flagged as not final, not stable, not guaranteed to stay or 
> work in this or any other form. This requires that the feature has very 
> limited interference with other parts of the system, like (or especially 
> like) the query parser. If it turns out to be a problem in x.y.0, it 
> will be backed out and gone in x.y.1. Or in a different way, like we 
> create an experimental CVS branch off of every major release. That way, 
> developers can easier share experimental code and if things settle 
> there, they will be considered to be adopted into HEAD.
> 
> > Like Markus, I would like to see the various replication efforts merged as
> > best they can be because even if the majority of users don't use a little
> > bit of everything, surely the more interesting cases would like to and the
> > entire community is better served if the various "solutions" are in
> > harmony.
> 
> No doubt about that and I was the one organizing the Afilias sponsored 
> meeting in Toronto back then, where my reversed Postgres-R idea was 
> taken apart because it won't scale due to the gigantic amount of 
> synchronized group communication it would require. Again, it might be 
> that experimental features will cause more of the efforts to converge by 
> using the same base as a compromise instead of having each and every 
> support feature being designed completely independent.
> 
> I still have a hard time understanding why someone would object to 
> adding a feature, however useless it might seem to them, as long as it 
> doesn't cost them anything. Admitted, any feature causes maintenance 
> costs on the side of the PostgreSQL development community (mainly those, 
> who actually contribute and maintain the code - fortunately that is a 
> finite number - everyone please ask themselves if they are part of 
> that). But aside from that, would anyone, who is questioning the commit 
> timestamp as I proposed it, likewise vehemently object to yet another 
> procedural language, or adding another log tuning switch? I don't think 
> so. As long as it doesn't cost you unless you turn it on, why would you 
> even care if it serves my purpose or not? The thing that kicked off this 
> emotional spin was that multimaster replication is what so many people 
> want, but nobody has a universal solution for. Everyone wants to see 
> "their" problem solved "as well", or the solution isn't good. Tell you 
> what, I can live with my problem solved even if it doesn't solve yours. 
> Can you tell me what I have to modify in order to solve your problem as 
> well, or are you asking me to not implement anything unless "I" find a 
> way to solve everyones problems in one, big, universal solution?
> 
> 
> Jan
> 
> -- 
> #======================================================================#
> # It's easier to get forgiveness for being wrong than for being right. #
> # Let's break this rule - forgive me.                                  #
> #================================================== JanWieck@Yahoo.com #
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: Proposal: Commit timestamp

From

Jan Wieck

Date:

08 February 2007, 00:02:27

On 2/7/2007 10:35 PM, Bruce Momjian wrote:
> I find the term "logical proof of it's correctness" too restrictive.  It
> sounds like some formal academic process that really doesn't work well
> for us.

Thank you.

> Also, I saw the trigger patch with no explaination of why it was
> important or who would use it --- that also isn't going to fly well.

You didn't respond to my explanation how the current Slony 
implementation could improve and evolve using it. Are you missing 
something? I am discussing this very issue with our own QA department, 
and thus far, I think I have a majority of "would use a pg_trigger 
backpatched PostgreSQL" vs. "No, I prefer a system that knows exactly 
how it corrupted my system catalog".

> As far as TOAST, there is no question in my mind that TOAST development
> would happen the same way today as it did when we did it in 2001 --- we
> have a problem, how can we fix it.

Looking at what did happen back then and what happens in this case, I do 
see a difference. There were concerns about the compression algorithm 
used ... it still is today what was the first incarnation and nobody 
ever bothered to even investigate if there could possibly be any better 
thing. Do you think lzcompress is the best we can come up with? I don't! 
So why is it still the thing used? Maybe it is good enough?

Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #

Re: Proposal: Commit timestamp

From

Bruce Momjian

Date:

08 February 2007, 00:13:06

Jan Wieck wrote:
> On 2/7/2007 10:35 PM, Bruce Momjian wrote:
> > I find the term "logical proof of it's correctness" too restrictive.  It
> > sounds like some formal academic process that really doesn't work well
> > for us.
> 
> Thank you.
> 
> > Also, I saw the trigger patch with no explaination of why it was
> > important or who would use it --- that also isn't going to fly well.
> 
> You didn't respond to my explanation how the current Slony 
> implementation could improve and evolve using it. Are you missing 
> something? I am discussing this very issue with our own QA department, 
> and thus far, I think I have a majority of "would use a pg_trigger 
> backpatched PostgreSQL" vs. "No, I prefer a system that knows exactly 
> how it corrupted my system catalog".

No, I _now_ understand the use case, but when the patch was posted, the
use case was missing.  I would like to see a repost with the patch, and
a description of its use so we can all move forward on that.

> > As far as TOAST, there is no question in my mind that TOAST development
> > would happen the same way today as it did when we did it in 2001 --- we
> > have a problem, how can we fix it.
> 
> Looking at what did happen back then and what happens in this case, I do 
> see a difference. There were concerns about the compression algorithm 
> used ... it still is today what was the first incarnation and nobody 
> ever bothered to even investigate if there could possibly be any better 
> thing. Do you think lzcompress is the best we can come up with? I don't! 
> So why is it still the thing used? Maybe it is good enough?

It is simple/stupid enough, I would say, and the compression space is a
mine-field of patents.

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: Proposal: Commit timestamp

From

"Zeugswetter Andreas ADI SD"

Date:

08 February 2007, 05:36:03

> Yes, yes, and yes ... but aside from the problem that you use the very

> ambiguous word "timestamp" (which somehow suggests using a "clock" of
> some sort), isn't the "begin" timestamp of a long running transaction
imho a begin timestamp is near useless

> worse than the "commit" timestamp, when all its work got visible to
the
> outside world instantaneously?

This is one of the areas I am still worried about. Is one commit lamport
timestamp enough ?
I think for some conflict resolutions we need to look at the
row level, and resolve conflicts per row and not per transaction
(yes, this means that a tx might get partially replicated).

What I am trying to lead at is: maybe an infrastructure to produce
wieck lamport timestamps, that can be used in different places like
commit hooks and column defaults, would be of more general use. Maybe
such
a column could be a system column that is not visible with "select *"
for those cases where commit is not enough. And a commit hook could
insert it into clog like storage.

Andreas

Re: Proposal: Commit timestamp

From

Jan Wieck

Date:

08 February 2007, 15:14:44

On 2/7/2007 11:12 PM, Bruce Momjian wrote:
> Jan Wieck wrote:
>> On 2/7/2007 10:35 PM, Bruce Momjian wrote:
>> > I find the term "logical proof of it's correctness" too restrictive.  It
>> > sounds like some formal academic process that really doesn't work well
>> > for us.
>> 
>> Thank you.
>> 
>> > Also, I saw the trigger patch with no explaination of why it was
>> > important or who would use it --- that also isn't going to fly well.
>> 
>> You didn't respond to my explanation how the current Slony 
>> implementation could improve and evolve using it. Are you missing 
>> something? I am discussing this very issue with our own QA department, 
>> and thus far, I think I have a majority of "would use a pg_trigger 
>> backpatched PostgreSQL" vs. "No, I prefer a system that knows exactly 
>> how it corrupted my system catalog".
> 
> No, I _now_ understand the use case, but when the patch was posted, the
> use case was missing.  I would like to see a repost with the patch, and
> a description of its use so we can all move forward on that.

Is this a new policy that after discussion, all patches must be 
resubmitted with a summary and conclusions of the discussion? I can 
certainly do that for you, but just tell me if you are going to ask the 
same from everyone.


Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #

Re: Proposal: Commit timestamp

From

Bruce Momjian

Date:

08 February 2007, 15:44:01

Jan Wieck wrote:
> On 2/7/2007 11:12 PM, Bruce Momjian wrote:
> > Jan Wieck wrote:
> >> On 2/7/2007 10:35 PM, Bruce Momjian wrote:
> >> > I find the term "logical proof of it's correctness" too restrictive.  It
> >> > sounds like some formal academic process that really doesn't work well
> >> > for us.
> >> 
> >> Thank you.
> >> 
> >> > Also, I saw the trigger patch with no explaination of why it was
> >> > important or who would use it --- that also isn't going to fly well.
> >> 
> >> You didn't respond to my explanation how the current Slony 
> >> implementation could improve and evolve using it. Are you missing 
> >> something? I am discussing this very issue with our own QA department, 
> >> and thus far, I think I have a majority of "would use a pg_trigger 
> >> backpatched PostgreSQL" vs. "No, I prefer a system that knows exactly 
> >> how it corrupted my system catalog".
> > 
> > No, I _now_ understand the use case, but when the patch was posted, the
> > use case was missing.  I would like to see a repost with the patch, and
> > a description of its use so we can all move forward on that.
> 
> Is this a new policy that after discussion, all patches must be 
> resubmitted with a summary and conclusions of the discussion? I can 
> certainly do that for you, but just tell me if you are going to ask the 
> same from everyone.

No, I am asking only this time because I feel there was too much
disconnect between the patch and the extensive replication discussion
that few community members would see the connection.

I would also like to know what your new features does for each supported
option.  I have not seen that spelled out yet at all.

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: Proposal: Commit timestamp

From

Alvaro Herrera

Date:

08 February 2007, 16:00:38

Bruce Momjian wrote:
> Jan Wieck wrote:
> > On 2/7/2007 11:12 PM, Bruce Momjian wrote:
> > > Jan Wieck wrote:
> > >> On 2/7/2007 10:35 PM, Bruce Momjian wrote:
> > >>
> > >> > Also, I saw the trigger patch with no explaination of why it was
> > >> > important or who would use it --- that also isn't going to fly well.
> > >> 
> > >> You didn't respond to my explanation how the current Slony 
> > >> implementation could improve and evolve using it. Are you missing 
> > >> something? I am discussing this very issue with our own QA department, 
> > >> and thus far, I think I have a majority of "would use a pg_trigger 
> > >> backpatched PostgreSQL" vs. "No, I prefer a system that knows exactly 
> > >> how it corrupted my system catalog".
> > > 
> > > No, I _now_ understand the use case, but when the patch was posted, the
> > > use case was missing.  I would like to see a repost with the patch, and
> > > a description of its use so we can all move forward on that.
> > 
> > Is this a new policy that after discussion, all patches must be 
> > resubmitted with a summary and conclusions of the discussion? I can 
> > certainly do that for you, but just tell me if you are going to ask the 
> > same from everyone.
> 
> No, I am asking only this time because I feel there was too much
> disconnect between the patch and the extensive replication discussion
> that few community members would see the connection.

FYI, in my opinion the trigger addition is clearly useful to Mammoth
Replicator as well.  In fact, it's so obviously useful that I didn't see
a need to state that in the original thread where it was discussed.

Not sure about the timestamp stuff, because Replicator is not
multi-master, so there's no conflict resolution to take care of.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: Proposal: Commit timestamp

From

Bruce Momjian

Date:

08 February 2007, 16:32:58

Alvaro Herrera wrote:
> > > Is this a new policy that after discussion, all patches must be 
> > > resubmitted with a summary and conclusions of the discussion? I can 
> > > certainly do that for you, but just tell me if you are going to ask the 
> > > same from everyone.
> > 
> > No, I am asking only this time because I feel there was too much
> > disconnect between the patch and the extensive replication discussion
> > that few community members would see the connection.
> 
> FYI, in my opinion the trigger addition is clearly useful to Mammoth
> Replicator as well.  In fact, it's so obviously useful that I didn't see
> a need to state that in the original thread where it was discussed.

Right, I know it is useful too, but I would like a layout of what it
does and why so everyone is clear on it.

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: Proposal: Commit timestamp

From

Jan Wieck

Date:

08 February 2007, 20:28:49

On 2/8/2007 3:32 PM, Bruce Momjian wrote:
> Alvaro Herrera wrote:
>> > > Is this a new policy that after discussion, all patches must be 
>> > > resubmitted with a summary and conclusions of the discussion? I can 
>> > > certainly do that for you, but just tell me if you are going to ask the 
>> > > same from everyone.
>> > 
>> > No, I am asking only this time because I feel there was too much
>> > disconnect between the patch and the extensive replication discussion
>> > that few community members would see the connection.
>> 
>> FYI, in my opinion the trigger addition is clearly useful to Mammoth
>> Replicator as well.  In fact, it's so obviously useful that I didn't see
>> a need to state that in the original thread where it was discussed.
> 
> Right, I know it is useful too, but I would like a layout of what it
> does and why so everyone is clear on it.

I have no clue what got you into what you are doing here. But that shall 
not be my real concern. If you feel the need to do this sort of thing, 
be my guest. I will add the remaining changes to pg_rewrite, including 
the new support commands and changes to psql as well as pg_dump and 
resubmit the new patch with explanations that will hopefully help you to 
comprehend what and how this relatively small and fully backward 
compatible change in the trigger and rule firing mechanism will work and 
what existing problems it will solve.

Regards,
Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #

Re: Proposal: Commit timestamp

From

Bruce Momjian

Date:

08 February 2007, 20:36:38

Jan Wieck wrote:
> On 2/8/2007 3:32 PM, Bruce Momjian wrote:
> > Alvaro Herrera wrote:
> >> > > Is this a new policy that after discussion, all patches must be 
> >> > > resubmitted with a summary and conclusions of the discussion? I can 
> >> > > certainly do that for you, but just tell me if you are going to ask the 
> >> > > same from everyone.
> >> > 
> >> > No, I am asking only this time because I feel there was too much
> >> > disconnect between the patch and the extensive replication discussion
> >> > that few community members would see the connection.
> >> 
> >> FYI, in my opinion the trigger addition is clearly useful to Mammoth
> >> Replicator as well.  In fact, it's so obviously useful that I didn't see
> >> a need to state that in the original thread where it was discussed.
> > 
> > Right, I know it is useful too, but I would like a layout of what it
> > does and why so everyone is clear on it.
> 
> I have no clue what got you into what you are doing here. But that shall 
> not be my real concern. If you feel the need to do this sort of thing, 
> be my guest. I will add the remaining changes to pg_rewrite, including 
> the new support commands and changes to psql as well as pg_dump and 
> resubmit the new patch with explanations that will hopefully help you to 
> comprehend what and how this relatively small and fully backward 
> compatible change in the trigger and rule firing mechanism will work and 
> what existing problems it will solve.

Yep, that's what I want everyone to see.  This is standard procedure for
everyone in the community, and core is not immune.

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: Proposal: Commit timestamp

From

"Joshua D. Drake"

Date:

08 February 2007, 23:44:27

Jan Wieck wrote:
> On 2/8/2007 3:32 PM, Bruce Momjian wrote:
>> Alvaro Herrera wrote:
>>> > > Is this a new policy that after discussion, all patches must be >
>>> > resubmitted with a summary and conclusions of the discussion? I can
>>> > > certainly do that for you, but just tell me if you are going to
>>> ask the > > same from everyone.
>>> > > No, I am asking only this time because I feel there was too much
>>> > disconnect between the patch and the extensive replication discussion
>>> > that few community members would see the connection.
>>>
>>> FYI, in my opinion the trigger addition is clearly useful to Mammoth
>>> Replicator as well.  In fact, it's so obviously useful that I didn't see
>>> a need to state that in the original thread where it was discussed.
>>
>> Right, I know it is useful too, but I would like a layout of what it
>> does and why so everyone is clear on it.

Well how deep are we talking here? My understanding of what Jan wants to
do is simple.

Be able to declare which triggers are fired depending on the state of
the cluster.

In Jan's terms, the Origin or Subscriber. In Replicator terms the Master
or Slave.

This is useful because I may have a trigger on the Master and the same
trigger on the Slave. You do not want the trigger to fire on the Slave
because we are doing data replication. In short, the we replicate the
result, not the action.

However, you may want triggers that are on the Slave to fire separately.
A reporting server that generates materialized views is a good example.
Don't tie up the Master with what a Slave can do.

Sincerely,

Joshua D. Drake



> 
> I have no clue what got you into what you are doing here. But that shall
> not be my real concern. If you feel the need to do this sort of thing,
> be my guest. I will add the remaining changes to pg_rewrite, including
> the new support commands and changes to psql as well as pg_dump and
> resubmit the new patch with explanations that will hopefully help you to
> comprehend what and how this relatively small and fully backward
> compatible change in the trigger and rule firing mechanism will work and
> what existing problems it will solve.
> 
> 
> Regards,
> Jan
> 


-- 
     === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997            http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/

Re: Proposal: Commit timestamp

From

Bruce Momjian

Date:

08 February 2007, 23:50:05

I just want an outline of what each option is supposed to control.  If
that information is in a documentation patch, then fine, he can just
post that and tell people to read the patch documentation.

---------------------------------------------------------------------------

Joshua D. Drake wrote:
> Jan Wieck wrote:
> > On 2/8/2007 3:32 PM, Bruce Momjian wrote:
> >> Alvaro Herrera wrote:
> >>> > > Is this a new policy that after discussion, all patches must be >
> >>> > resubmitted with a summary and conclusions of the discussion? I can
> >>> > > certainly do that for you, but just tell me if you are going to
> >>> ask the > > same from everyone.
> >>> > > No, I am asking only this time because I feel there was too much
> >>> > disconnect between the patch and the extensive replication discussion
> >>> > that few community members would see the connection.
> >>>
> >>> FYI, in my opinion the trigger addition is clearly useful to Mammoth
> >>> Replicator as well.  In fact, it's so obviously useful that I didn't see
> >>> a need to state that in the original thread where it was discussed.
> >>
> >> Right, I know it is useful too, but I would like a layout of what it
> >> does and why so everyone is clear on it.
> 
> Well how deep are we talking here? My understanding of what Jan wants to
> do is simple.
> 
> Be able to declare which triggers are fired depending on the state of
> the cluster.
> 
> In Jan's terms, the Origin or Subscriber. In Replicator terms the Master
> or Slave.
> 
> This is useful because I may have a trigger on the Master and the same
> trigger on the Slave. You do not want the trigger to fire on the Slave
> because we are doing data replication. In short, the we replicate the
> result, not the action.
> 
> However, you may want triggers that are on the Slave to fire separately.
> A reporting server that generates materialized views is a good example.
> Don't tie up the Master with what a Slave can do.
> 
> Sincerely,
> 
> Joshua D. Drake
> 
> 
> 
> > 
> > I have no clue what got you into what you are doing here. But that shall
> > not be my real concern. If you feel the need to do this sort of thing,
> > be my guest. I will add the remaining changes to pg_rewrite, including
> > the new support commands and changes to psql as well as pg_dump and
> > resubmit the new patch with explanations that will hopefully help you to
> > comprehend what and how this relatively small and fully backward
> > compatible change in the trigger and rule firing mechanism will work and
> > what existing problems it will solve.
> > 
> > 
> > Regards,
> > Jan
> > 
> 
> 
> -- 
> 
>       === The PostgreSQL Company: Command Prompt, Inc. ===
> Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
> Providing the most comprehensive  PostgreSQL solutions since 1997
>              http://www.commandprompt.com/
> 
> Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
> PostgreSQL Replication: http://www.commandprompt.com/products/
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: Proposal: Commit timestamp

From

Richard Troy

Date:

09 February 2007, 00:22:44

On Thu, 8 Feb 2007, Joshua D. Drake wrote:
>
> Well how deep are we talking here? My understanding of what Jan wants to
> do is simple.
>
> Be able to declare which triggers are fired depending on the state of
> the cluster.
>
> In Jan's terms, the Origin or Subscriber. In Replicator terms the Master
> or Slave.
>
> This is useful because I may have a trigger on the Master and the same
> trigger on the Slave. You do not want the trigger to fire on the Slave
> because we are doing data replication. In short, the we replicate the
> result, not the action.
>
> However, you may want triggers that are on the Slave to fire separately.
> A reporting server that generates materialized views is a good example.
> Don't tie up the Master with what a Slave can do.
>

It'd be great if Jan considers the blending of replication; any given DB
instance shouldn't be only a master/originator or only a slave/subscriber.
A solution that lets you blend replication strategies in a single db is,
from my point of view, very important.

> > I have no clue what got you into what you are doing here.

Jan, some sleep now and then might be helpful to your public disposition.
-smile-

peace,
Richard

-- 
Richard Troy, Chief Scientist
Science Tools Corporation
510-924-1363 or 202-747-1263
rtroy@ScienceTools.com, http://ScienceTools.com/

Re: Proposal: Commit timestamp

From

Jan Wieck

Date:

09 February 2007, 07:32:18

On 2/8/2007 11:41 PM, Richard Troy wrote:
> On Thu, 8 Feb 2007, Joshua D. Drake wrote:
>>
>> Well how deep are we talking here? My understanding of what Jan wants to
>> do is simple.
>>
>> Be able to declare which triggers are fired depending on the state of
>> the cluster.
>>
>> In Jan's terms, the Origin or Subscriber. In Replicator terms the Master
>> or Slave.
>>
>> This is useful because I may have a trigger on the Master and the same
>> trigger on the Slave. You do not want the trigger to fire on the Slave
>> because we are doing data replication. In short, the we replicate the
>> result, not the action.
>>
>> However, you may want triggers that are on the Slave to fire separately.
>> A reporting server that generates materialized views is a good example.
>> Don't tie up the Master with what a Slave can do.
>>
> 
> It'd be great if Jan considers the blending of replication; 

Please elaborate. I would really like to get all you can contribute.

> 
>> > I have no clue what got you into what you are doing here.
> 
> Jan, some sleep now and then might be helpful to your public disposition.

Richard, don't embarrass Bruce. He doesn't need your help.

I have been with this project and know Bruce Momjian for more than 10 
years. Every now and then, Bruce and I get into some sort of eventually 
publicly visible dispute that doesn't really mean much. I'll probably 
spend next Christmas with him and his family again, play a few rounds of 
backgammon with Wilma (who I really owe a revenge), hopefully don't 
interfere too much with Christine's work (especially when it involves 
handling food over a white carpet) and none of us will even remember 
this crap. Our friendship has been through some real tests. Any real 
problem we would have, we'd never discuss here. We would just meet and 
talk.

Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #

Re: Proposal: Commit timestamp

From

Jan Wieck

Date:

09 February 2007, 07:36:14

On 2/7/2007 7:13 AM, José Orlando Pereira wrote:
> On Saturday 03 February 2007, Bruce Momjian wrote:
>> Jan Wieck wrote:
>> > I don't have any such paper and the proof of concept will be the
>> > implementation of the system. I do however see enough resistance against
>> > this proposal to withdraw the commit timestamp at this time. The new
>> > replication system will therefore require the installation of a patched,
>> > non-standard PostgreSQL version, compiled from sources cluster wide in
>> > order to be used. I am aware that this will dramatically reduce it's
>> > popularity but it is impossible to develop this essential feature as an
>> > external module.
>> >
>> > I thank everyone for their attention.
>>
>> Going and working on it on your own doesn't seem like the proper
>> solution.  I don't see people objecting to adding it, but they want it
>> work, which I am sure you want too.  You have to show how it will work
>> and convince others of that, and then you have a higher chance it will
>> work, and be in the PostgreSQL codebase.
> 
> Hi,
> 
> Would it be possible to solve the problem using the GORDA on-commit hook?
> 
> Jan would be able reliably obtain a commit timestamp with the desired 
> semantics and store it in a regular table within transaction boundaries.

I am not sure, I would have to look at what exactly that hook provides. 
The key to a Lamport timestamp is that it is advancing it commit order 
(plus some other things ... of course). If the hook can guarantee that 
the calls are made always in commit order, serialized without any race 
condition possible, it would probably be suitable.


Jan

> 
> PostgreSQL would not have to commit to a specific timestamp semantics and the 
> patch is quite small.
> 
> Regards,
> 


-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #

Re: Proposal: Commit timestamp

From

Kris Jurka

Date:

09 February 2007, 07:49:06

Richard Troy wrote:

> It'd be great if Jan considers the blending of replication; any given DB
> instance shouldn't be only a master/originator or only a slave/subscriber.
> A solution that lets you blend replication strategies in a single db is,
> from my point of view, very important.
> 

Perhaps if more people read Jan's posts he wouldn't be so frustrated.

http://archives.postgresql.org/pgsql-hackers/2007-01/msg01302.php

He clearly describes that the master/slave setting is per session, not 
per database.

Kris Jurka

Re: Proposal: Commit timestamp

From

"J. Andrew Rogers"

Date:

09 February 2007, 09:10:34

On Feb 8, 2007, at 8:41 PM, Richard Troy wrote:
> It'd be great if Jan considers the blending of replication; any  
> given DB
> instance shouldn't be only a master/originator or only a slave/ 
> subscriber.
> A solution that lets you blend replication strategies in a single  
> db is,
> from my point of view, very important.

It might be constructive to define what a minimal "complete" set of  
replication primitives actually is in addition to which ones should  
be implemented.  In addition to master/slave models, you have Paxos  
algorithms and dynamic reconfiguration models in literature that can  
utilize many of the same primitives but which are very different in  
implementation.  I see the value of Jan's proposal, but perhaps it  
would be better to step back and make some assertions about the  
nature of the core capabilities that will be supported in some  
broader picture.  Having a theoretically (mostly) complete set of  
usable primitives would be an incredibly powerful feature set.

Cheers,

J. Andrew Rogers
jrogers@neopolitan.com

Re: Proposal: Commit timestamp

From

Richard Troy

Date:

09 February 2007, 15:09:22

On Fri, 9 Feb 2007, Jan Wieck wrote:
> > [ I wrote ]
> > It'd be great if Jan considers the blending of replication;
>
> Please elaborate. I would really like to get all you can contribute.

Thanks Jan,

prefaced that I really haven't read everything you've written on this (or
what other people are doing, either), and that I've got a terrible flu
right now (fever, etc), I'll give it a go - hopefully it's actually
helpful. To wit:

In general terms, "blending of replication [techniques]" means to me that
one can have a single database instance serve as a master and as a slave
(to use only one set of terminology), and as a multi-master, too, all
simultaneously, letting the DBA / Architect choose which portions serve
which roles (purposes). All replication features would respect the
boundaries of such choices automatically, as it's all blended.

In more specific terms, and I'm just brainstorming in public here, perhaps
we can use the power of Schemas within a database to manage such
divisions; commands which pertain to replication can/would include a
schema specifier and elements within the schema can be replicated one way
or another, at the whim of the DBA / Architect. For backwards
compatability, if a schema isn't specified, it indicates that command
pertains to the entire database.

At the very least, a schema division strategy for replication leaverages
an existing DB-component binding/dividing mechanism that most everyone is
familliar with. While there are/may be database-wide, nay, installation-
wide constructs as in your Commit Timestamp proposal, I don't see that
there's any conflict - at least, from what I understand of existing
systems and proposals to date.

HTH,
Richard

-- 
Richard Troy, Chief Scientist
Science Tools Corporation
510-924-1363 or 202-747-1263
rtroy@ScienceTools.com, http://ScienceTools.com/

Re: Proposal: Commit timestamp

From

"Andrew Hammond"

Date:

09 February 2007, 15:20:13

On Feb 7, 8:12 pm, b...@momjian.us (Bruce Momjian) wrote:
> Jan Wieck wrote:
> > On 2/7/2007 10:35 PM, Bruce Momjian wrote:
> > > I find the term "logical proof of it's correctness" too restrictive.  It
> > > sounds like some formal academic process that really doesn't work well
> > > for us.
>
> > Thank you.

My intuition is that it might be possible to prove that _nothing_ can
provide guaranteed ordering when there is disconnected operation.
However, I think that the clock based ordering Jan has described could
provide _probable_ ordering under disconnected operation. I can see
three variables in the equation that would determine the probability
of correctness for the ordering.
1) clock drift rate between disconnected clusters
2) disconnection time
3) transaction rate on the tables, or even rows involved
There are probably more. I think that if Jan implements what he's
described then a very interesting follow-up would be to do the
statistical analysis necessary to quantify the risk of incorrect
ordering while disconnected. (I've got x ms/ms relative clock drift,
and y tps. How long can I run disconnected  before falling under
99.999% probability of correctly ordered transactions?)

> No, I _now_ understand the use case, but when the patch was posted, the
> use case was missing.  I would like to see a repost with the patch, and
> a description of its use so we can all move forward on that.

An additional use case for an on-commit timestamp is in the analysis
of billing transactions in highly concurrent systems. For example,
imagine your billing period is monthly and you have transactions which
start before and end after the "end-of-month". Having the on-commit
timestamp for these transactions may help when attempting to reconcile
between transactions and account activities.

Andrew

Re: Proposal: Commit timestamp

From

Andrew Dunstan

Date:

09 February 2007, 15:23:20

Richard Troy wrote:
> In more specific terms, and I'm just brainstorming in public here, perhaps
> we can use the power of Schemas within a database to manage such
> divisions; commands which pertain to replication can/would include a
> schema specifier and elements within the schema can be replicated one way
> or another, at the whim of the DBA / Architect. For backwards
> compatability, if a schema isn't specified, it indicates that command
> pertains to the entire database.
>
>   

I understand that you're just thinking aloud, but overloading namespaces 
in this way strikes me as awful. Applications and extensions, which are 
the things that have need of namespaces, should not have to care about 
replication. If we have to design them for replication we'll be on a 
fast track to nowhere IMNSHO.

cheers

andrew

Re: Proposal: Commit timestamp

From

Jan Wieck

Date:

09 February 2007, 16:01:42

On 2/9/2007 2:27 PM, Richard Troy wrote:
> In general terms, "blending of replication [techniques]" means to me that
> one can have a single database instance serve as a master and as a slave
> (to use only one set of terminology), and as a multi-master, too, all
> simultaneously, letting the DBA / Architect choose which portions serve
> which roles (purposes). All replication features would respect the
> boundaries of such choices automatically, as it's all blended.

That is specifically what the changes to pg_trigger and pg_rewrite take 
into account. However much you blend different techniques, a single 
transaction on one server will always fall into one of three categories. 
1) It could be the original operation done by the client application. B) 
It could be the actions performed by the replication engine to replay a 
remote transaction. And iii) it can be an administrative operation that 
requires not to be propagated at all.

No matter how many different models you have in parallel, one single 
transaction will be either a master, a slave or an isolated local thing. 
The proposed changes allow to tell the session which of these three 
roles it is playing and the triggers and rules can be configured to fire 
during master/local role, slave role, always or never. That 
functionality will work for master-slave as well as multi-master.

Although my current plan isn't creating such a blended system, the 
proposed trigger and rule changes are designed to support exactly that 
in a 100% backward compatible way.

Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #

Re: Proposal: Commit timestamp

From

Richard Troy

Date:

09 February 2007, 16:02:32

On Fri, 9 Feb 2007, Andrew Dunstan wrote:
> Richard Troy wrote:
> > In more specific terms, and I'm just brainstorming in public here, perhaps
> > we can use the power of Schemas within a database to manage such
> > divisions; commands which pertain to replication can/would include a
> > schema specifier and elements within the schema can be replicated one way
> > or another, at the whim of the DBA / Architect. For backwards
> > compatability, if a schema isn't specified, it indicates that command
> > pertains to the entire database.
>
> I understand that you're just thinking aloud, but overloading namespaces
> in this way strikes me as awful. Applications and extensions, which are
> the things that have need of namespaces, should not have to care about
> replication. If we have to design them for replication we'll be on a
> fast track to nowhere IMNSHO.

Well, Andrew, replication _is_ an application. Or, you could think of
replication as an extension to an application. I was under the impression
that_users_ decide to put tables in schema spaces based upon _user_ need,
and that Postgres developer's use of them for other purposes was
incroaching on user choices, not the other way around. Either way,
claiming "need"  like this strikes me as stuck-in-a-rut or dogmatic
thinking. Besides, don't we have schema nesting to help resolve any such
"care?" And, what do you mean by "design them for replication?"

While I'm in no way stuck on blending replication strategies via schemas,
it does strike me as an appropriate concept and I'd preferr to have it
evaluated based on technical merrit - possibly citing workarounds or
solutions to technical issues, which is what I gather has been the
tradition of this group: Use case first, technical merrit second... Other
alternatives, ISTM, will have virtually the same look/feel as a schema
from an external perspective, and the more I think of it the more I think
using schemas is a sound, clean approach. That it offends someones sense
of asthetics STM a poor rationale for not choosing it. Another question
might be: What's lacking in the implementation of schemas that makes this
a poor choice, and what could be done about it without much effort?

Regards,
Richard

-- 
Richard Troy, Chief Scientist
Science Tools Corporation
510-924-1363 or 202-747-1263
rtroy@ScienceTools.com, http://ScienceTools.com/

Re: Proposal: Commit timestamp

From

Richard Troy

Date:

09 February 2007, 16:06:18

On Fri, 9 Feb 2007, Jan Wieck wrote:
>
> No matter how many different models you have in parallel, one single
> transaction will be either a master, a slave or an isolated local thing.
> The proposed changes allow to tell the session which of these three
> roles it is playing and the triggers and rules can be configured to fire
> during master/local role, slave role, always or never. That
> functionality will work for master-slave as well as multi-master.
>
> Although my current plan isn't creating such a blended system, the
> proposed trigger and rule changes are designed to support exactly that
> in a 100% backward compatible way.
>
> Jan

Fantastic! ...At some point you'll be thinking of the management end -
turning it on or off, etc. That might be where these other points come
more into play.

Richard

-- 
Richard Troy, Chief Scientist
Science Tools Corporation
510-924-1363 or 202-747-1263
rtroy@ScienceTools.com, http://ScienceTools.com/

Re: Proposal: Commit timestamp

From

Jan Wieck

Date:

09 February 2007, 16:17:03

On 2/9/2007 2:19 PM, Andrew Hammond wrote:
> On Feb 7, 8:12 pm, b...@momjian.us (Bruce Momjian) wrote:
>> Jan Wieck wrote:
>> > On 2/7/2007 10:35 PM, Bruce Momjian wrote:
>> > > I find the term "logical proof of it's correctness" too restrictive.  It
>> > > sounds like some formal academic process that really doesn't work well
>> > > for us.
>>
>> > Thank you.
> 
> My intuition is that it might be possible to prove that _nothing_ can
> provide guaranteed ordering when there is disconnected operation.

As a matter of physics, for two events happening outside of the event 
horizon of each other, the question which happened first is pointless.

> However, I think that the clock based ordering Jan has described could
> provide _probable_ ordering under disconnected operation. I can see
> three variables in the equation that would determine the probability
> of correctness for the ordering.

That precisely is the intended functionality. And I can exactly describe 
when two conflicting actions will result in the "wrong" row to persist. 
This will happen when the second update to the logically same row will 
be performed on the server with the Lamport timestamp lagging behind by 
more than the time between the two conflicting commits. Example: User 
fills out a form, submits, hits back button, corrects input and submits 
again within 3 seconds. Load balancing sends both requests to different 
servers and the first server is 3.0001 seconds ahead ... the users typo 
will be the winner.

My Lamport timestamp conflict resolution will not be able to solve this 
problem. However, when this happens, one thing is guaranteed. The update 
from the second server, arriving on the first for replication will be 
ignored because a locally generated row is newer. This fact can be used 
as an indicator that there is a possible conflict that was resolved 
using the wrong data (business process wise). All nodes in the cluster 
will end up using the same wrong row, so at least they are consistently 
wrong. Nevertheless, being able to identify possible problem cases this 
way will allow to initiate further action including but not limited to 
human intervention.

If this is not an acceptable risk for the application, other resolution 
methods will be needed. But I think in many cases, this form of default 
resolution will be "good enough".

Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #

Re: Proposal: Commit timestamp

From

Andrew Dunstan

Date:

09 February 2007, 16:25:41

Richard Troy wrote:
> On Fri, 9 Feb 2007, Andrew Dunstan wrote:
>   
>> Richard Troy wrote:
>>     
>>> In more specific terms, and I'm just brainstorming in public here, perhaps
>>> we can use the power of Schemas within a database to manage such
>>> divisions; commands which pertain to replication can/would include a
>>> schema specifier and elements within the schema can be replicated one way
>>> or another, at the whim of the DBA / Architect. For backwards
>>> compatability, if a schema isn't specified, it indicates that command
>>> pertains to the entire database.
>>>       
>> I understand that you're just thinking aloud, but overloading namespaces
>> in this way strikes me as awful. Applications and extensions, which are
>> the things that have need of namespaces, should not have to care about
>> replication. If we have to design them for replication we'll be on a
>> fast track to nowhere IMNSHO.
>>     
>
> Well, Andrew, replication _is_ an application. Or, you could think of
> replication as an extension to an application. 

No, I don't think of it as either. It's a utility, more an extension of 
the DBMS than of the application. You don't replicate for the sake of 
replicating.

> I was under the impression
> that_users_ decide to put tables in schema spaces based upon _user_ need,
> and that Postgres developer's use of them for other purposes was
> incroaching on user choices, not the other way around.

That's exactly what you would be doing with this proposal, encroaching 
on what I regard as user space.


>  Either way,
> claiming "need"  like this strikes me as stuck-in-a-rut or dogmatic
> thinking. Besides, don't we have schema nesting to help resolve any such
> "care?" 

No. We do now have schema nesting, for this or any other purpose. Where 
did you get that idea? If we did I would not be so resistant to using 
them for this purpose, but as it is, if you hijack schemas for 
replication segregation you will detract from their more obvious use in 
name segregation.

cheers

andrew

Re: Proposal: Commit timestamp

From

Andrew Dunstan

Date:

09 February 2007, 16:37:12

I wrote:
> We do now have schema nesting, for this or any other purpose.

s/now/not/ (of course)

cheers

andrew

Re: Proposal: Commit timestamp

From

Jan Wieck

Date:

09 February 2007, 18:49:26

On 2/9/2007 3:25 PM, Andrew Dunstan wrote:
> Richard Troy wrote:
>> On Fri, 9 Feb 2007, Andrew Dunstan wrote:
>>   
>>> Richard Troy wrote:
>>>     
>>>> In more specific terms, and I'm just brainstorming in public here, perhaps
>>>> we can use the power of Schemas within a database to manage such
>>>> divisions; commands which pertain to replication can/would include a
>>>> schema specifier and elements within the schema can be replicated one way
>>>> or another, at the whim of the DBA / Architect. For backwards
>>>> compatability, if a schema isn't specified, it indicates that command
>>>> pertains to the entire database.
>>>>       
>>> I understand that you're just thinking aloud, but overloading namespaces
>>> in this way strikes me as awful. Applications and extensions, which are
>>> the things that have need of namespaces, should not have to care about
>>> replication. If we have to design them for replication we'll be on a
>>> fast track to nowhere IMNSHO.
>>>     
>>
>> Well, Andrew, replication _is_ an application. Or, you could think of
>> replication as an extension to an application. 
> 
> No, I don't think of it as either. It's a utility, more an extension of 
> the DBMS than of the application. You don't replicate for the sake of 
> replicating.
> 
>> I was under the impression
>> that_users_ decide to put tables in schema spaces based upon _user_ need,
>> and that Postgres developer's use of them for other purposes was
>> incroaching on user choices, not the other way around.
> 
> That's exactly what you would be doing with this proposal, encroaching 
> on what I regard as user space.

I'd never use a schema for that. Look at Slony-I. It uses a user 
selectable schema for the stuff it needs to store in the database, to 
keep Slony objects separate from user objects. But it organizes the user 
tables in what is called sets. A set can be any combination of tables 
and sequences from any number of namespaces. If I would do it again, I 
would use names instead of numbers to identify sets, and I would allow 
for summary sets containing groups of simple sets.

On the other issue, replication is part of the overall infrastructure 
and thereby part of the solution to a business problem. Like the 
hardware, database and application it solves a piece of the puzzle, none 
of the other is particularly good at. It will perform better or 
eventually not at all, depending on how much consideration the 
distributed nature of the business model was given when the application 
was designed. Applications that "happened by accident" rather than being 
designed usually don't allow any kind of distribution.

And yes, there are people who replicate for the sake of it. It is the 
kind of people who ask for sync multi-master no matter what their actual 
problem might be, because it sounds most prestigious.

Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #

Re: Proposal: Commit timestamp

From

José Orlando Pereira

Date:

21 February 2007, 12:26:52

On Friday 09 February 2007, Jan Wieck wrote:
> I am not sure, I would have to look at what exactly that hook provides.
> The key to a Lamport timestamp is that it is advancing it commit order
> (plus some other things ... of course). If the hook can guarantee that
> the calls are made always in commit order, serialized without any race
> condition possible, it would probably be suitable.

Actually what we do is a bit stronger. We use the commit hook to enforce an 
externally defined commit order. In our case, this is defined by a group 
communication protocol, which is even allowed to reorder a pair of 
transactions originating from the same replica. Therefore, achieving a commit 
order that is consistent with a local clock should be straightforward.

Regards,

-- 
Jose Orlando Pereira