Re: Proposal: Commit timestamp - Mailing list pgsql-hackers

From Markus Schiltknecht
Subject Re: Proposal: Commit timestamp
Date
Msg-id 45C82FFF.1010605@bluegap.ch
Whole thread Raw
In response to Re: Proposal: Commit timestamp  (Theo Schlossnagle <jesus@omniti.com>)
Responses Re: Proposal: Commit timestamp  ("Zeugswetter Andreas ADI SD" <ZeugswetterA@spardat.at>)
List pgsql-hackers
Hi,

Theo Schlossnagle wrote:
> On Feb 4, 2007, at 1:36 PM, Jan Wieck wrote:
>> Obviously the counters will immediately drift apart based on the 
>> transaction load of the nodes as soon as the network goes down. And in 
>> order to avoid this "clock" confusion and wrong expectation, you'd 
>> rather have a system with such a simple, non-clock based counter and 
>> accept that it starts behaving totally wonky when the cluster 
>> reconnects after a network outage? I rather confuse a few people than 
>> having a last update wins conflict resolution that basically rolls 
>> dice to determine "last".
> 
> If your cluster partition and you have hours of independent action and 
> upon merge you apply a conflict resolution algorithm that has enormous 
> effect undoing portions of the last several hours of work on the nodes, 
> you wouldn't call that "wonky?"

You are talking about different things. Async replication, as Jan is 
planning to do, is per se "wonky", because you have to cope with 
conflicts by definition. And you have to resolve them by late-aborting a 
transaction (i.e. after a commit). Or put it another way: async MM 
replication means continuing in disconnected mode (w/o quorum or some 
such) and trying to reconciliate later on. It should not matter if the 
delay is just some milliseconds of network latency or three days (except 
of course that you probably have more data to reconciliate).

> For sane disconnected (or more generally, partitioned) operation in 
> multi-master environments, a quorum for the dataset must be 
> established.  Now, one can consider the "database" to be the dataset.  
> So, on network partitions those in "the" quorum are allowed to progress 
> with data modification and others only read.

You can do this to *prevent* conflicts, but that clearly belongs to the 
world of sync replication. I'm doing this in Postgres-R: in case of 
network partitioning, only a primary partition may continue to process 
writing transactions. For async replication, it does not make sense to 
prevent conflicts when disconnected. Async is meant to cope with 
conflicts. So as to be independent of network latency.

> However, there is no 
> reason why the dataset _must_ be the database and that multiple datasets 
> _must_ share the same quorum algorithm.  You could easily classify 
> certain tables or schema or partitions into a specific dataset and apply 
> a suitable quorum algorithm to that and a different quorum algorithm to 
> other disjoint data sets.

I call that partitioning (among nodes). And it's applicable to sync as 
well as async replication, while it makes more sense in sync replication.

What I'm more concerned about, with Jan's proposal, is the assumption 
that you always want to resolve conflicts by time (except for balances, 
for which we don't have much information, yet). I'd rather say that time 
does not matter much if your nodes are disconnected. And (especially in 
async replication) you should prevent your clients from committing to 
one node and then reading from another, expecting to find your data 
there. So why resolve by time? It only makes the user think you could 
guarantee that order, but you certainly cannot.

Regards

Markus



pgsql-hackers by date:

Previous
From: Josh Berkus
Date:
Subject: Re: period data type
Next
From: "Jonathan Gray"
Date:
Subject: Pl/pgsql functions causing crashes in 8.2.2