Proposed project: Multimaster, synchronous replication system - Mailing list pgsql-general
From | Chris Travers |
---|---|
Subject | Proposed project: Multimaster, synchronous replication system |
Date | |
Msg-id | 008701c3b4a7$d249b5d0$2d00053d@SAMUEL Whole thread Raw |
List | pgsql-general |
Hi All; I may be able to do this in Perl, but if there is enough interest in doing something like this in C, maybe I can still help (however, consider your self warned about my skill at coding C). I am looking at the following design. The clustering daemon either need to run on separate computer systems or separate interfaces of the same database servers. The program would basically pretend to be Postmaster and intercept the requests. These requests would be sent to each of the clustered servers. Initially I had thought about filtering out the Select queries and running them only on one system, but the possibility of having rules, stored proceedures, etc. means that I could not be 100% sure that filtering out select queries would leave all the databases in identical states. Therefore the select queries would be run against all servers. Perhaps this could be admin-configurable though. For the first version, I would suggest using round robin dns load balancing, but eventually I would like to include a protocol for IP address sharing as well. Error and Database State Handling: 1: Each clusterign service would need to maintain a database of the state and locations of: a: Other clustering services b: Database servers. A separate daemon would run on each system running the clustering daemon. This state management daemon would be responsible for asynchronously handling error messages recieved from peers. Upon recieving update requests, it would update the state database, which would be read on each new connection or on error. When the clustering daemon recieves two different results to the first phase of the TPC (probably a simple "SELECT TRUE;"), it uses the following logic: 1) If there are more successes than failures, commit the transaction, and proceed with error handling for those servers which did not succeed. 2) If more failures than successes, rollback the transaction, and proceed with error handling for those servers who did succeed. 3) If exactly half (i.e. 1 out of 2) failed, assume failures are less common than successes, commit transaction, and proceed with error handling against he server(s) which did not succeed. Error handling is handled as: 1: Reread state database and eliminate any new duplicate failures. 2: Notify other clustering servers' state management daemons of the failure so that the other state databases remain in sync. Recovery: Recovery is tricky because we don't know the extent or cause of the problem. The only way to ensure that the databases are synchronized is to drop, recreate, dump, and reload the database, applying all transactions in a serialized way. Here is my proposed algorythm: 1) Have a new recovery daemon on each database server. It contacts one of the clustering servers for permission to restore. 2) The cluster recovery coordinator daemon on the clustering server contacts all other clustering servers requesting that all current transactions be committed, and all new transactions be handled via recovery system (4 below). New transactions cannot be committed until permission has been fully confirmed and acknowledged. 3) One permission has been granted (all transactions going through recovery), the recovery daemon makes a connection with pg_dump and begins a restore of the database. It then acknowledges permission, allowing new transactions to be committed. 4) For databases under recovery, the clustering daemon sends the SQL Query and Transaction ID consisting of the MAC address of the interface (or some other unique ID), and a serial number starting from 0. Any commit or end transaction statements are given a timestamp of 'now' in order to facilitate serialization. These statements are queued in a recovery schema until the database has been restored. 4) Once the database pg_restore has finished, the recovery statements are executed in the order of transaction commitment. Any failure of any transaction causes the whole recovery to fail and the database to return to a "Failed" state. 5) Once the queue has no more committed transactions in it, the recovery daemon notifies the state management daemons that it is now online. Caveats: Uncommitted transactions during recovery could kill the performance of the entire cluster. There are a few other areas where performance could be a problem as well. Any thoughts, suggestions, etc? Chris Travers
pgsql-general by date: