Re: Synchronization levels in SR - Mailing list pgsql-hackers

From Greg Smith
Subject Re: Synchronization levels in SR
Date
Msg-id 4C0606C2.4090901@2ndquadrant.com
Whole thread Raw
In response to Re: Synchronization levels in SR  (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>)
Responses Re: Synchronization levels in SR  (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>)
Re: Synchronization levels in SR  (Simon Riggs <simon@2ndQuadrant.com>)
List pgsql-hackers
Heikki Linnakangas wrote:
> The possibilities are endless... Your proposal above covers a pretty 
> good set of scenarios, but it's by no means complete. If we try to 
> solve everything the configuration will need to be written in a 
> Turing-complete Replication Description Language. We'll have to pick a 
> useful, easy-to-understand subset that covers the common scenarios. To 
> handle the more exotic scenarios, you can write a proxy that sits in 
> front of the master, and implements whatever rules you wish, with the 
> rules written in C.

I was thinking about this a bit recently.  As I see it, there are three 
fundamental parts of this:

1) We have a transaction that is being committed.  The rest of the 
computations here are all relative to it.

2) There is an (internal?) table that lists the state of each 
replication target relative to that transaction.  It would include the 
node name, perhaps some metadata ('location' seems the one that's most 
likely to help with the remote data center issue), and a state code.  
The codes from http://wiki.postgresql.org/wiki/Streaming_Replication 
work fine for the last part (which is the only dynamic one--everything 
else is static data being joined against):

async=hasn't received yet
recv=been received but just in RAM
fsync=received and synced to disk
apply=applied to the database

These would need to be enums so they can be ordered from lesser to 
greater consistency.

So in a 3 node case, the internal state table might look like this after 
a bit of data had been committed:

node | location | state
----------------------------------
a | local    | fsync 
b | remote | recv
c | remote | async

This means that the local node has a fully persistent copy, but the best 
either remote one has done is received the data, it's not on disk at all 
yet at the remote data center.  Still working its way through.

3) The decision about whether the data has been committed to enough 
places to be considered safe by the master is computed by a function 
that is passed this internal table as something like a SRF, and it 
returns a boolean.  Once that returns true, saying it's satisfied, the 
transaction closes on the master and continues to percolate out from 
there.  If it's false, we wait for another state change to come in and 
return to (2).

I would propose that most behaviors someone has expressed as being their 
desired implementation is possible to implement using this scheme. 

-Semi-sync commit:  proceed as soon somebody else has a copy and hope 
the copies all become consistent:  EXISTS WHERE state>=recv
-Don't proceed until there's a fsync'd commit on at least one of the 
remote nodes:  EXISTS WHERE location='remote' AND state>=fsync
-Look for a quorum of n commits of fsync quality:  CASE WHEN (SELECT 
COUNT(*) WHERE state>=fsync)>n THEN true else FALSE end;

Syntax is obviously rough but I think you can get the drift of what I'm 
suggesting.

While exposing the local state and running this computation isn't free, 
in situations where there truly are remote nodes in here being 
communicated with the network overhead is going to dwarf that.  If there 
were a fast path for the simplest cases and this complicated one for the 
rest, I think you could get the fully programmable behavior some people 
want using simple SQL, rather than having to write a new "Replication 
Description Language" or something so ambitious.  This data about what's 
been replicated to where looks an awful lot like a set of rows you can 
operate on using features already in the database to me.

-- 
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us



pgsql-hackers by date:

Previous
From: Takahiro Itagaki
Date:
Subject: Re: [BUGS] BUG #5487: dblink failed with 63 bytes connection names
Next
From: Fujii Masao
Date:
Subject: obsolete comments in xlog.c