Proposed project: Multimaster, synchronous replication system - Mailing list pgsql-general

From Chris Travers
Subject Proposed project: Multimaster, synchronous replication system
Date
Msg-id 008701c3b4a7$d249b5d0$2d00053d@SAMUEL
Whole thread Raw
List pgsql-general
Hi All;

I may be able to do this in Perl, but if there is enough interest in doing
something like this in C, maybe I can still help (however, consider your
self warned about my skill at coding C).

I am looking at the following design.  The clustering daemon either need to
run on separate computer systems or separate interfaces of the same database
servers.  The program would basically pretend to be Postmaster and intercept
the requests.  These requests would be sent to each of the clustered
servers.

Initially I had thought about filtering out the Select queries and running
them only on one system, but the possibility of having rules, stored
proceedures, etc. means that I could not be 100% sure that filtering out
select queries would leave all the databases in identical states.  Therefore
the select queries would be run against all servers.  Perhaps this could be
admin-configurable though.

For the first version, I would suggest using round robin dns load balancing,
but eventually I would like to include a protocol for IP address sharing as
well.

Error and Database State Handling:
1:  Each clusterign service would need to maintain a database of the state
and locations of:
    a: Other clustering services
    b: Database servers.

A separate daemon would run on each system running the clustering daemon.
This state management daemon would be responsible for asynchronously
handling error messages recieved from peers.  Upon recieving update
requests, it would update the state database, which would be read on each
new connection or on error.

When the clustering daemon recieves two different results to the first phase
of the TPC (probably a simple "SELECT TRUE;"), it uses the following logic:
1)  If there are more successes than failures, commit the transaction, and
proceed with error handling for those servers which did not succeed.
2)  If more failures than successes, rollback the transaction, and proceed
with error handling for those servers who did succeed.
3) If exactly half (i.e. 1 out of 2) failed, assume failures are less common
than successes, commit transaction, and proceed with error handling against
he server(s) which did not succeed.

Error handling is handled as:
1: Reread state database and eliminate any new duplicate failures.
2: Notify other clustering servers' state management daemons of the failure
so that the other state databases remain in sync.


Recovery:
Recovery is tricky because we don't know the extent or cause of the problem.
The only way to ensure that the databases are synchronized is to drop,
recreate, dump, and reload the database, applying all transactions in a
serialized way.  Here is my proposed algorythm:
1)  Have a new recovery daemon on each database server.  It contacts one of
the clustering servers for permission to restore.
2)  The cluster recovery coordinator daemon on the clustering server
contacts all other clustering servers requesting that all current
transactions be committed, and all new transactions be handled via recovery
system (4 below).  New transactions cannot be committed until permission has
been fully confirmed and acknowledged.
3)  One permission has been granted (all transactions going through
recovery), the recovery daemon makes a connection with pg_dump and begins a
restore of the database.  It then acknowledges permission, allowing new
transactions to be committed.
4)  For databases under recovery, the clustering daemon sends the SQL Query
and Transaction ID consisting of the MAC address of the interface (or some
other unique ID), and a serial number starting from 0.  Any commit or end
transaction statements are given a timestamp of 'now' in order to facilitate
serialization.  These statements are queued in a recovery schema until the
database has been restored.
4)  Once the database pg_restore has finished, the recovery statements are
executed in the order of transaction commitment.  Any failure of any
transaction causes the whole recovery to fail and the database to return to
a "Failed" state.
5)  Once the queue has no more committed transactions in it, the recovery
daemon notifies the state management daemons that it is now online.

Caveats:
Uncommitted transactions during recovery could kill the performance of the
entire cluster.

There are a few other areas where performance could be a problem as well.

Any thoughts, suggestions, etc?
Chris Travers


pgsql-general by date:

Previous
From: Tom Lane
Date:
Subject: Re: disaster recovery
Next
From: "Chris Travers"
Date:
Subject: PostgreSQL, MySQL, etc., was Re: PostgreSQL is much faster than MySQL, only when...