Thread: Replication
I re-wrote RServ.pm to C, and wrote a replication daemon. It works, but it works like the whole rserv project. I don't like it. OK, what the hell do we need to do to get PostgreSQL replicating?
On Mon, 4 Feb 2002, mlw wrote: I've developed a replacement for Rserv and we are planning on releasing it as open source(ie as a contrib module). Like Rserv its trigger based but its much more flexible. The key adventages it has over Rserv is that it has -Support for multiple slaves -It Perserves transactions while doing the mirroring. Ie If rows A,B are originally added in the same transaction they will be mirrored in the same transaction. We have plans on adding filtering based on data/selective mirroring as well. (Ie only rows with COUNTRY='Canada' go to slave A, and rows with COUNTRY='China' go to slave B). But I'm not sure when I'll get to that. Support for conflict resolution(If allow edits to be made on the slaves) would be nice. I hope to be able to send a tarball with the source to the pgpatches list within the next few days. We've been using the system operationally for a number of months and have been happy with it. > I re-wrote RServ.pm to C, and wrote a replication daemon. It works, but it > works like the whole rserv project. I don't like it. > OK, what the hell do we need to do to get PostgreSQL replicating? > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster > -- Steven Singer ssinger@navtechinc.com Aircraft Performance Systems Phone: 519-747-1170 ext 282 Navtech Systems Support Inc. AFTN: CYYZXNSX SITA: YYZNSCR Waterloo, Ontario ARINC: YKFNSCR
> > OK, what the hell do we need to do to get PostgreSQL replicating? I hope you understand that replication, done right, is a massive project. I know that Darren any myself (and the rest of the pg-repl folks) have been waiting till 7.2 went gold till we did anymore work. I think we hope to have master / slave replicatin working for 7.3 and then target multimaster for 7.4. At least that's the hope. - Brandon ----------------------------------------------------------------------------c: 646-456-5455 h: 201-798-4983b. palmer, bpalmer@crimelabs.net pgp:crimelabs.net/bpalmer.pgp5
bpalmer wrote: > > > > > OK, what the hell do we need to do to get PostgreSQL replicating? > > I hope you understand that replication, done right, is a massive > project. I know that Darren any myself (and the rest of the pg-repl > folks) have been waiting till 7.2 went gold till we did anymore work. I > think we hope to have master / slave replicatin working for 7.3 and then > target multimaster for 7.4. At least that's the hope. I do know how hard replication is. I also understand how important it is. If you guys have a project going, and need developers, I am more than willing.
Steven wrote: > > On Mon, 4 Feb 2002, mlw wrote: > > I've developed a replacement for Rserv and we are planning on releasing > it as open source(ie as a contrib module). > > Like Rserv its trigger based but its much more flexible. > The key adventages it has over Rserv is that it has > -Support for multiple slaves > -It Perserves transactions while doing the mirroring. Ie If rows A,B are > originally added in the same transaction they will be mirrored in the same > transaction. I did a similar thing. I took the rserv trigger "as is," but rewrote the replication support code. What I eventually did was write a "snapshot daemon" which created snapshot files. Then a "slave daemon" which would check the last snapshot applied and apply all the snapshots, in order, as needed. One would run one of these daemons per slave server.
DBMirror doesn't use snapshot's instead it records a log of transactions that are committed to the database in a pair of tables. In the case of an INSERT this is the row that is being added. In the case of a delete the primary key of the row being deleted. And in the case of an UPDATE, the primary key before the update along with all of the data the row should have after an update. Then for each slave database a perl script walks though the transactions that are pending for that host and reconstructs SQL to send the row edits to that host. A record of the fact that transaction Y has been sent to host X is also kept. When transaction X has been sent to all of the hosts that are in the system it is then deleted from the Pending tables. I suspect that all of the information I'm storing in the Pending tables is also being stored by Postgres in its log but I haven't investigated how the information could be extracted(or how long it is kept for). That would reduce the extra storage overhead that the replication system imposes. As I remember(Its been a while since I've looked at it) RServ uses OID's in its tables to point to the data that needs to be replicated. We tried a similar approach but found difficulties with doing partial updates. On Mon, 4 Feb 2002, mlw wrote: > I did a similar thing. I took the rserv trigger "as is," but rewrote the > replication support code. What I eventually did was write a "snapshot daemon" > which created snapshot files. Then a "slave daemon" which would check the last > snapshot applied and apply all the snapshots, in order, as needed. One would > run one of these daemons per slave server. -- Steven Singer ssinger@navtechinc.com Aircraft Performance Systems Phone: 519-747-1170 ext 282 Navtech Systems Support Inc. AFTN: CYYZXNSX SITA: YYZNSCR Waterloo, Ontario ARINC: YKFNSCR
On Mon, 4 Feb 2002, mlw wrote: > I re-wrote RServ.pm to C, and wrote a replication daemon. It works, but it > works like the whole rserv project. I don't like it. > > OK, what the hell do we need to do to get PostgreSQL replicating? The trigger model is not a very sophisticated one. I think I have a better -- though more complicated -- one. This model would be able to handle multiple masters and master->slave. First of all, all machines in the cluster would have to be aware all the machines in the cluster. This would have to be stored in a new system table. The FE/BE protocol would need to be modified to accepted parsed node trees generated by pg_analyze_and_rewrite(). These could then be dispatched by the executing server, inside of pg_exec_query_string, to all other servers in the cluster (excluding itself). Naturally, this dispatch would need to be non-blocking. pg_exec_query_string() would need to check that nodetags to make sure selects and perhaps some commands are not dispatched. Before the executing server runs finish_xact_command(), it would check that the query was successfully executed on all machines otherwise abort. Such a system would need a few configuration options: whether or not you abort on failed replication to slaves, the ability to replicate only certain tables, etc. Naturally, this would slow down writes to the system (possibly a lot depending on the performance difference between the executing machine and the least powerful machine in the cluster), but most usages of postgresql are read intensive, not write. Any reason this model would not work? Gavin
Gavin Sherry wrote: > Naturally, this would slow down writes to the system (possibly a lot > depending on the performance difference between the executing machine and > the least powerful machine in the cluster), but most usages of postgresql > are read intensive, not write. > > Any reason this model would not work? What, then is the purpose of replication to multiple masters? I can think of only two reasons why you want replication. (1) Redundancy, make sure that if one server dies, then another server has the same data and is used seamlessly. (2) Increase performance over one system. In reason (1) I submit that a server load balance which sits on top of PostgreSQL, and executes writes on both servers while distributing reads would be best. This is a HUGE project. The load balancer must know EXACTLY how the system is configured, which includes all functions and everything. In reason (2) your system would fail to provide the scalability that would be needed. If writes take a long time, but reads are fine, what is the difference between the trigger based replicator? I have in the back of my mind, an idea of patching into the WAL stuff, and using that mechanism to push changes out to the slaves. Where one machine is still the master, but no trigger stuff, just a WAL patch. Perhaps some shared memory paradigm to manage WAL visibility? I'm not sure exactly, the idea hasn't completely formed yet.
What you describe sounds like a form of a two-stage commit protocol. If the command worked on two of the replicated databases but failed on a third then the executing server would have to be able to undo the command on the replicated databases as well as itself. The problems with two stage commit type approches to replication are 1) Speed as you mentioned. Write speed isn't a concern for some applications but it is very important in others. and 2) All of the databases must be able to communicate with each other at all times in order for any edits to work. If the servers are connected over some sort of WAN that periodically has short outages this is a problem. Also if your using replication because you want to be able to take down one of the databases for short periods of time without bringing down the others your in trouble. btw: I posted the alternative to Rserv that I mentioned the other day to the pg-patches mailing list. If anyone is intreasted you should be able to grab it off the archives. On Thu, 7 Feb 2002, Gavin Sherry wrote: > > First of all, all machines in the cluster would have to be aware all the > machines in the cluster. This would have to be stored in a new system > table. > > The FE/BE protocol would need to be modified to accepted parsed node trees > generated by pg_analyze_and_rewrite(). These could then be dispatched by > the executing server, inside of pg_exec_query_string, to all other servers > in the cluster (excluding itself). Naturally, this dispatch would need to > be non-blocking. > > pg_exec_query_string() would need to check that nodetags to make sure > selects and perhaps some commands are not dispatched. > > Before the executing server runs finish_xact_command(), it would check > that the query was successfully executed on all machines otherwise > abort. Such a system would need a few configuration options: whether or > not you abort on failed replication to slaves, the ability to replicate > only certain tables, etc. > > Naturally, this would slow down writes to the system (possibly a lot > depending on the performance difference between the executing machine and > the least powerful machine in the cluster), but most usages of postgresql > are read intensive, not write. > > Any reason this model would not work? > > Gavin > > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster > -- Steven Singer ssinger@navtechinc.com Aircraft Performance Systems Phone: 519-747-1170 ext 282 Navtech Systems Support Inc. AFTN: CYYZXNSX SITA: YYZNSCR Waterloo, Ontario ARINC: YKFNSCR
I'm not that familiar with the whole replication issues in PostgreSQL, however, I would be partial to replication that was based upon the playback of the (a?) journal file. (I believe that the WAL is a journal file.) By being based upon a journal file, it would be possible to accomplish two significant items. First, it would be possible to "restore" a database to an exact state just before a failure. Most commercial databases provide the ability to do this. Banks, etc. log the journal files directly to tape to provide a complete transaction history such that they can rebuild their database from any given snapshot. (Note that the journal file needs to be "editable" as a failure may be "delete from x" with a missing where clause.) This leads directly into the second advantage, the ability to have a replicated database operating anywhere, over any connection on any server. Speed of writes would not be a factor. In essence, as long as the replicated database had a snapshot of the database and then was provided with all journal files since the snapshot, it would be possible to build a current database. If the replicant got behind in the processing, it would catch up when things slowed down. In my opionion, the first advantage is in many ways most important. Replication becomes simply the restoration of the database in realtime on a second server. The "replication" task becomes the definition of a protocol for distributing the journal file. At least one major database vendor does replication (shadowing) in exactly this mannor. Maybe I'm all wet and the journal file and journal playback already exists. If so, IMHO, basing replication off of this would be the right direction. On Thu, 07 Feb 2002 07:52:23 EST, mlw wrote: > > I have in the back of my mind, an idea of patching into the WAL stuff, and > using that mechanism to push changes out to the slaves. > > Where one machine is still the master, but no trigger stuff, just a WAL patch. > Perhaps some shared memory paradigm to manage WAL visibility? I'm not sure > exactly, the idea hasn't completely formed yet. >
>> The problems with two stage commit type approches to replication are IMHO the biggest problem with two phased commit is it doesn't scale. The more servers you add to the replica the slower it goes. Also there's the potential for dead locks across server boundaries. >> 2) All of the databases must be able to communicate with each other at> all times in order for any edits to work. Ifthe servers are> connected over some sort of WAN that periodically has short outages this> is a problem. Also if yourusing replication because you want to be able> to take down one of the databases for short periods of time without> bringing down the others your in trouble. All true for two phased commit protocol. To have multi master replication, you must have all systems communicating, but you can use a multicast group communication system instead of 2PC. Using total order messaging, you can ensure all changes are delivered to all servers in the replica in the same order. This group communication system also allows failures to be detected while other servers in the replica continue processing. A few of us are working with this theory, and trying to integrate with 7.2. There is a working model for 6.4, but its very limited. (insert, update, and deletes) We are currently hosted at http://gborg.postgresql.org/project/pgreplication/projdisplay.php But the site has been down the last 2 days. I've contacted the web master, but haven't seen any results yet. If any one knows what going on with gborg, I'd appreciate a status. Darren
Darren, Given that different replication strategies will probably be developed for PG, do you envisage DBAs to be able to select the type of replication for their installation? I.e. Replication being selectable rther like storage structures? Would be a killer bit of flexibility, given how enormous the impact of replication will be to corporate adoption of PG. Brad >>>>>>>>>>>>>>>>>> Original Message <<<<<<<<<<<<<<<<<< On 2/8/02, 5:29:22 AM, Darren Johnson <darren.johnson@cox.net> wrote regarding Re: [HACKERS] Replication: > > > > The problems with two stage commit type approches to replication are > IMHO the biggest problem with two phased commit is it doesn't scale. > The more servers > you add to the replica the slower it goes. Also there's the potential > for dead locks across > server boundaries. > > > > 2) All of the databases must be able to communicate with each other at > > all times in order for any edits to work. If the servers are > > connected over some sort of WAN that periodically has short outages this > > is a problem. Also if your using replication because you want to be > able > > to take down one of the databases for short periods of time without > > bringing down the others your in trouble. > All true for two phased commit protocol. To have multi master > replication, you must have all > systems communicating, but you can use a multicast group communication > system instead of > 2PC. Using total order messaging, you can ensure all changes are > delivered to all servers in the > replica in the same order. This group communication system also allows > failures to be detected > while other servers in the replica continue processing. > A few of us are working with this theory, and trying to integrate with > 7.2. There is a working > model for 6.4, but its very limited. (insert, update, and deletes) We > are currently hosted at > http://gborg.postgresql.org/project/pgreplication/projdisplay.php > But the site has been down the last 2 days. I've contacted the web > master, but haven't seen > any results yet. If any one knows what going on with gborg, I'd > appreciate a status. > Darren > ---------------------------(end of broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
> > Given that different replication strategies will probably be developed > for PG, do you envisage DBAs to be able to select the type of replication > for their installation? I.e. Replication being selectable rther like > storage structures? I can't speak for other replication solutions, but we are using the --with-replication or -r parameter when starting postmaster. Some day I hope there will be parameters for master/slave partial/full and sync/async, but it will be some time before we cross those bridges. Darren
I've been looking into database replication theory lately and have found some interesting papers discussing various approaches. (Here's one paper that struck me as being very helpful, http://citeseer.nj.nec.com/460405.html ) So far I favour an eager replication system which is predicated on a read local/write all available. The system should not depend on two phase commit or primary copy algorithms. The former leads to the whole system being as quick as the slowest machine. In addition, 2 phase commit involves 2n messages for each transaction which does not scale well at all. This idea will also have to take into account a crashed node which did not ack a transaction. The primary copy algorithms I've seen suffer from a single point of failure and potential bottlenecks at the primary node. Instead I like the master to master or peer to peer algorithm as discussed in the above paper. This approach accounts for network partitions, nodes leaving and joining a cluster and the ability to commit a transaction once the communication module has determined the total order of the said transaction, i.e. no need for waiting for acks. This scales well and research has shown it to increase the number of transactions/second a database cluster can handle over a single node. Postgres-R is another interesting approach which I think should be taken seriously. Anyone interested can read a paper on this at http://citeseer.nj.nec.com/330257.html Anyways, my two cents Randall Jonasz Software Engineer Click2net Inc. On Thu, 7 Feb 2002, mlw wrote: > Gavin Sherry wrote: > > Naturally, this would slow down writes to the system (possibly a lot > > depending on the performance difference between the executing machine and > > the least powerful machine in the cluster), but most usages of postgresql > > are read intensive, not write. > > > > Any reason this model would not work? > > What, then is the purpose of replication to multiple masters? > > I can think of only two reasons why you want replication. (1) Redundancy, make > sure that if one server dies, then another server has the same data and is used > seamlessly. (2) Increase performance over one system. > > In reason (1) I submit that a server load balance which sits on top of > PostgreSQL, and executes writes on both servers while distributing reads would > be best. This is a HUGE project. The load balancer must know EXACTLY how the > system is configured, which includes all functions and everything. > > In reason (2) your system would fail to provide the scalability that would be > needed. If writes take a long time, but reads are fine, what is the difference > between the trigger based replicator? > > I have in the back of my mind, an idea of patching into the WAL stuff, and > using that mechanism to push changes out to the slaves. > > Where one machine is still the master, but no trigger stuff, just a WAL patch. > Perhaps some shared memory paradigm to manage WAL visibility? I'm not sure > exactly, the idea hasn't completely formed yet. > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/users-lounge/docs/faq.html > >
I've not looked at the first paper, but I wil. > Postgres-R is another interesting approach which I think should be taken > seriously. Anyone interested can read a paper on this at > http://citeseer.nj.nec.com/330257.html I would point you to the info on gborg, but it seems to be down at the moment. - Brandon ----------------------------------------------------------------------------c: 646-456-5455 h: 201-798-4983b. palmer, bpalmer@crimelabs.net pgp:crimelabs.net/bpalmer.pgp5
> I've been looking into database replication theory lately and have found > some interesting papers discussing various approaches. (Here's > one paper that struck me as being very helpful, > http://citeseer.nj.nec.com/460405.html ) Here is another one from that same group, that addresses the WAN issues. > http://www.cnds.jhu.edu/pub/papers/cnds-2002-1.pdf enjoy, Darren
> > I have in the back of my mind, an idea of patching into the WAL stuff, and > > using that mechanism to push changes out to the slaves. > > > > Where one machine is still the master, but no trigger stuff, just a WAL patch. > > Perhaps some shared memory paradigm to manage WAL visibility? I'm not sure > > exactly, the idea hasn't completely formed yet. > > FWIW, Sybase Replication Server does just such a thing. They have a secondary log marker (prevents the log from truncating past the oldest unreplicated transaction). A thread within the system called the "rep agent" (but it use to be a separate process call the LTM), reads the log and forwards it to the rep server, once the rep server has the whole transaction and it is written to a stable device (aka synced to disk) the rep server responds to the LTM telling him it's OK to move the log marker forward. Anyway, once the replication server proper has the transaction it uses a publish/subscribe methodology to see who wants get the update. Bidirectional replication is done by making two oneway replications. The whole thing is table based, it marks the tables as replicated or not in the database to save the trip to the repserver on un replicated tables. Plus you can take parts of a database (replicate all rows where the country is "us" to this server and all the rows with "uk" to that server). Or opposite you can roll up smaller regional databases to bigger ones, it's very flexible. Cheers, Brian
Thread added to TODO.detail/replication. mlw wrote: > I re-wrote RServ.pm to C, and wrote a replication daemon. It works, but it > works like the whole rserv project. I don't like it. > > OK, what the hell do we need to do to get PostgreSQL replicating? > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Thread added to TODO.detail/replication. mlw wrote: > I re-wrote RServ.pm to C, and wrote a replication daemon. It works, but it > works like the whole rserv project. I don't like it. > > OK, what the hell do we need to do to get PostgreSQL replicating? > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> > > Thread added to TODO.detail/replication. > Here is what I have... http://gborg.postgresql.org/genpage?replication_72todo It's just for the pg-replication project, and needs some further discussion. Darren >