Thread: A Replication Idea

A Replication Idea

From
Orion Henry
Date:
I've been thinking about replication and wanted to throw out an idea to see 
how fast it gets torn apart.  I'm sure the problem can't be this easy but I 
can't think of why.

Ok... Let's say you have two fresh databases, both empty.  You set up a 
postgres proxy for them.  The proxy works like this:
       It listens on port 5432.       It pools connections to both real databases.       It is very simple just
forwardingrequests and responses back and forth               between client and server.  A client can connect to the
proxy               and not be able to tell that it is not an actual postgres                database.       When
connectionsare made to it, it proxys connections to both back-end               databases.       If an
insert/update/delete/DDLcommand comes, it forwards it to both               machines.       If a query comes down the
lineit forwards it to one machine or the                other.       If one of the machines goes offline or is not
respondingthe proxy                queues up all update transactions intended for it and stops
forwardingqueries to it until it comes back online and all                queued transactions have been committed.
A new machine can be inserted to the cluster.  When the proxy is alerted               to this, it's first
communicationwould be to pgdumpall() one of               the functional databases and pipe it to the new one.  At that
              moment, it is considered an unreachable database and all update               transactions are queued for
whenthe dump/rebuild is complete.       If a machine dies in catastrophic failure it can be removed from the
  cluster, and once the machine is fixed, re-inserted as per                above.       If there were some SQL command
fordetermining the load a machine is                experiencing the proxy could intelligently balance the load
      to the machines in the cluster that can handle it.       If the proxy were to fail, clients could safely connect
toone of the                back end databases in read-only mode until the proxy came back                up.       The
proxywould store a log of incomplete transactions in some kind of                presistant storage for all the
databasesit's connected to, so                should it die, it can resume right where it left off assuming
  the log is intact.
 

With the proxy set up like this you could connect to it as though it were a 
database, upload your current data and schema and get most all the benifits 
of clustering.

With this setup could achieve load balancing, fail-over, master-master 
replication, master-slave replication, hot swap servers, dynamic addition 
and removal of servers and HA-like clustering.  The only thing it does not 
do is partition data across servers.  The only assumption I am aware of 
that I am making is that two identical databases, given the same set of 
arbitrary transactions will end up being the same.  The only single point 
of failure in this system would be the proxy itself.  A modification to the 
postgres client software could allow automatically fail-over to read-only 
connections with one of the back-end databases.  Also, the proxy could be 
run on a router or other diskless system.  I haven't really thought about 
it, but it may even be possible to use current HA technology and run a pool 
of failover proxy's.

If the proxy ended up NOT slowing the performance of a standalone, 
single-system server, it could become the default connection method to 
PostgreSQL such that a person could do an out-of-the-box install of the 
database and a year later realize they really wanted a cluster, they could 
hot-add a server without even restarting the database.  

So, long story short, I'd like to get people's comments on this.  If it 
won't/can't work or has been tried before, I want to hear about it before I 
start coding. ;)
       Orion



Re: A Replication Idea

From
Sean Chittenden
Date:
>         It listens on port 5432.
>         It pools connections to both real databases.

Check out pgsql-replication.  They're doing something more complex,
but will get you the multi-master model that you're proposing.  It's
using a reliable multicast model based on the spread toolkit
(spread.org).



As for your idea, it sounds really good, but has two pit-falls that I
know of:

1) TCP latency could intorduce race conditions and data syncronization  problems.

2) transaction WAL log syncronization.

-sc

-- 
Sean Chittenden


Re: A Replication Idea

From
Darren Johnson
Date:
>I've been thinking about replication and wanted to throw out an idea to see 
>how fast it gets torn apart.  I'm sure the problem can't be this easy but I 
>can't think of why.
>
I have some comments/questions to share.  If you are proposing SQL based 
replication (the statements
get planned, parsed, and executed on all replicas) how can you guarantee 
each replica will stay synchronized?
When it comes to executing a set of commands in a transactions, which 
could kick off triggers or call
stored procedures, or functions, how does the proxy know each data 
change in the transaction was
successful?  

While having an advantage of being outside of the core postgres code, 
you would not be affected
by constant changes, so development/integration would be less intrusive. OTOH things like conflict
resolution or avoidance in a multi master scenario are much more 
difficult to handle in your proxy
approach.

>   
>So, long story short, I'd like to get people's comments on this.  If it 
>won't/can't work or has been tried before, I want to hear about it before I 
>start coding. ;)
>
We did some research a while back, and you might find some of the 
information useful within...

http://gborg.postgresql.org/genpage?replication_research


If you're interested, maybe we could collaborate,

Darren




Re: A Replication Idea

From
"Zeugswetter Andreas SB SD"
Date:
> The only assumption I am aware of
> that I am making is that two identical databases, given the same set of
> arbitrary transactions will end up being the same.

Unfortunately I do not think you can make that assumption. Since
response time of the two machines will be different you might end up
generating different data on different servers.

>         If an insert/update/delete/DDL command comes, it forwards it to both
>                 machines.
>         If a query comes down the line it forwards it to one machine or the
>                 other.

Also in PostgreSQL you cannot easily distinguish between read-only
and read-write statements. Not without looking at the query plan.

Thus imho generic replication cannot be achieved by plugging something between
clients and the backend.

Andreas