Thread: Replication

Replication

From
mlw
Date:
I re-wrote RServ.pm to C, and wrote a replication daemon. It works, but it
works like the whole rserv project. I don't like it.

OK, what the hell do we need to do to get PostgreSQL replicating?


Re: Replication

From
Steven
Date:
On Mon, 4 Feb 2002, mlw wrote:

I've developed a replacement for Rserv and we are planning on releasing 
it as open source(ie as a contrib module).  

Like Rserv its trigger based but its much more flexible.
The key adventages it has over Rserv is that it has
-Support for multiple slaves
-It Perserves transactions while doing the mirroring. Ie  If rows A,B are 
originally added in the same transaction they will be mirrored in the same 
transaction.

We have plans on adding filtering based on data/selective mirroring as 
well. (Ie only rows with COUNTRY='Canada' go to 
slave A, and  rows with COUNTRY='China' go to slave B).
But I'm not sure when I'll get to that.

Support for conflict resolution(If allow edits to be made on the slaves) 
would be nice.

I hope to be able to send a tarball with the source to the pgpatches list 
within the next few days.

We've been using the system operationally for a number of months and have
been happy with it.

> I re-wrote RServ.pm to C, and wrote a replication daemon. It works, but it
> works like the whole rserv project. I don't like it. 
> OK, what the hell do we need to do to get PostgreSQL replicating?
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
> 

-- 
Steven Singer                                       ssinger@navtechinc.com
Aircraft Performance Systems                Phone:  519-747-1170 ext 282
Navtech Systems Support Inc.                AFTN:   CYYZXNSX SITA: YYZNSCR
Waterloo, Ontario                           ARINC:  YKFNSCR



Re: Replication

From
bpalmer
Date:
>
> OK, what the hell do we need to do to get PostgreSQL replicating?

I hope you understand that replication,  done right,  is a massive
project.  I know that Darren any myself (and the rest of the pg-repl
folks) have been waiting till 7.2 went gold till we did anymore work.  I
think we hope to have master / slave replicatin working for 7.3 and then
target multimaster for 7.4.  At least that's the hope.

- Brandon

----------------------------------------------------------------------------c: 646-456-5455
              h: 201-798-4983b. palmer,  bpalmer@crimelabs.net           pgp:crimelabs.net/bpalmer.pgp5
 



Re: Replication

From
mlw
Date:
bpalmer wrote:
> 
> >
> > OK, what the hell do we need to do to get PostgreSQL replicating?
> 
> I hope you understand that replication,  done right,  is a massive
> project.  I know that Darren any myself (and the rest of the pg-repl
> folks) have been waiting till 7.2 went gold till we did anymore work.  I
> think we hope to have master / slave replicatin working for 7.3 and then
> target multimaster for 7.4.  At least that's the hope.

I do know how hard replication is. I also understand how important it is.

If you guys have a project going, and need developers, I am more than willing.


Re: Replication

From
mlw
Date:
Steven wrote:
> 
> On Mon, 4 Feb 2002, mlw wrote:
> 
> I've developed a replacement for Rserv and we are planning on releasing
> it as open source(ie as a contrib module).
> 
> Like Rserv its trigger based but its much more flexible.
> The key adventages it has over Rserv is that it has
> -Support for multiple slaves
> -It Perserves transactions while doing the mirroring. Ie  If rows A,B are
> originally added in the same transaction they will be mirrored in the same
> transaction.

I did a similar thing. I took the rserv trigger "as is," but rewrote the
replication support code. What I eventually did was write a "snapshot daemon"
which created snapshot files. Then a "slave daemon" which would check the last
snapshot applied and apply all the snapshots, in order, as needed. One would
run one of these daemons per slave server.


Re: Replication

From
Steven
Date:
DBMirror doesn't use snapshot's instead it records a log of transactions 
that are committed to the database in a pair of tables.  
In the case of an INSERT this is the row that is being added.
In the case of a delete the primary key of the row being deleted.

And in the case of an UPDATE, the primary key before the update along with 
all of the data the row should have after an update.

Then for each slave database a perl script walks though the transactions 
that are pending for that host and reconstructs SQL to send the row edits 
to that host.  A record of the fact that transaction Y has been sent to 
host X is also kept.

When transaction X has been sent to all of the hosts that are in the 
system it is then deleted from the Pending tables.

I suspect that all of the information I'm storing in the Pending tables is 
also being stored by Postgres in its log but I haven't investigated how 
the information could be extracted(or how long it is kept for).  That 
would  reduce the extra storage overhead that the replication system 
imposes.

As I remember(Its been a while since I've looked at it) RServ uses OID's 
in its tables to point to the data that needs to be replicated.  We tried 
a similar approach but found difficulties with doing partial updates.






On Mon, 4 Feb 2002, mlw wrote:

> I did a similar thing. I took the rserv trigger "as is," but rewrote the
> replication support code. What I eventually did was write a "snapshot daemon"
> which created snapshot files. Then a "slave daemon" which would check the last
> snapshot applied and apply all the snapshots, in order, as needed. One would
> run one of these daemons per slave server.





-- 
Steven Singer                                       ssinger@navtechinc.com
Aircraft Performance Systems                Phone:  519-747-1170 ext 282
Navtech Systems Support Inc.                AFTN:   CYYZXNSX SITA: YYZNSCR
Waterloo, Ontario                           ARINC:  YKFNSCR



Re: Replication

From
Gavin Sherry
Date:
On Mon, 4 Feb 2002, mlw wrote:

> I re-wrote RServ.pm to C, and wrote a replication daemon. It works, but it
> works like the whole rserv project. I don't like it.
> 
> OK, what the hell do we need to do to get PostgreSQL replicating?

The trigger model is not a very sophisticated one. I think I have a better
-- though more complicated -- one. This model would be able to handle
multiple masters and master->slave.

First of all, all machines in the cluster would have to be aware all the
machines in the cluster. This would have to be stored in a new system
table.

The FE/BE protocol would need to be modified to accepted parsed node trees
generated by pg_analyze_and_rewrite(). These could then be dispatched by 
the executing server, inside of pg_exec_query_string, to all other servers
in the cluster (excluding itself). Naturally, this dispatch would need to
be non-blocking.

pg_exec_query_string() would need to check that nodetags to make sure
selects and perhaps some commands are not dispatched.

Before the executing server runs finish_xact_command(), it would check
that the query was successfully executed on all machines otherwise
abort. Such a system would need a few configuration options: whether or
not you abort on failed replication to slaves, the ability to replicate
only certain tables, etc.

Naturally, this would slow down writes to the system (possibly a lot
depending on the performance difference between the executing machine and
the least powerful machine in the cluster), but most usages of postgresql
are read intensive, not write.

Any reason this model would not work?

Gavin



Re: Replication

From
mlw
Date:
Gavin Sherry wrote:
> Naturally, this would slow down writes to the system (possibly a lot
> depending on the performance difference between the executing machine and
> the least powerful machine in the cluster), but most usages of postgresql
> are read intensive, not write.
> 
> Any reason this model would not work?

What, then is the purpose of replication to multiple masters?

I can think of only two reasons why you want replication. (1) Redundancy, make
sure that if one server dies, then another server has the same data and is used
seamlessly. (2) Increase performance over one system.

In reason (1) I submit that a server load balance which sits on top of
PostgreSQL, and executes writes on both servers while distributing reads would
be best. This is a HUGE project. The load balancer must know EXACTLY how the
system is configured, which includes all functions and everything. 

In reason (2) your system would fail to provide the scalability that would be
needed. If writes take a long time, but reads are fine, what is the difference
between the trigger based replicator?

I have in the back of my mind, an idea of patching into the WAL stuff, and
using that mechanism to push changes out to the slaves.

Where one machine is still the master, but no trigger stuff, just a WAL patch.
Perhaps some shared memory paradigm to manage WAL visibility? I'm not sure
exactly, the idea hasn't completely formed yet.


Re: Replication

From
Steven Singer
Date:

What you describe sounds like a form of a two-stage commit protocol.

If the command worked on two of the replicated databases but failed on a 
third then the executing server would have to be able to undo the command
on the replicated databases as well as itself.

The problems with two stage commit type approches to replication are 
1) Speed as you mentioned.  Write speed isn't a concern for some 
applications but it is very important in others.

and 
2) All of the databases must be able to communicate with each other at 
all times in order for any edits to work.   If the servers are 
connected over some sort of WAN that periodically has short outages this 
is a problem.   Also if your using replication because you want to be able 
to take down one of the databases for short periods of time without 
bringing down the others your in trouble.


btw: I posted the alternative to Rserv that I mentioned the other day to 
the  pg-patches mailing list.  If anyone is intreasted you should be able 
to grab it off the archives.

On Thu, 7 Feb 2002, Gavin Sherry wrote:

> 
> First of all, all machines in the cluster would have to be aware all the
> machines in the cluster. This would have to be stored in a new system
> table.
> 
> The FE/BE protocol would need to be modified to accepted parsed node trees
> generated by pg_analyze_and_rewrite(). These could then be dispatched by 
> the executing server, inside of pg_exec_query_string, to all other servers
> in the cluster (excluding itself). Naturally, this dispatch would need to
> be non-blocking.
> 
> pg_exec_query_string() would need to check that nodetags to make sure
> selects and perhaps some commands are not dispatched.
> 
> Before the executing server runs finish_xact_command(), it would check
> that the query was successfully executed on all machines otherwise
> abort. Such a system would need a few configuration options: whether or
> not you abort on failed replication to slaves, the ability to replicate
> only certain tables, etc.
> 
> Naturally, this would slow down writes to the system (possibly a lot
> depending on the performance difference between the executing machine and
> the least powerful machine in the cluster), but most usages of postgresql
> are read intensive, not write.
> 
> Any reason this model would not work?
> 
> Gavin
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
> 

-- 
Steven Singer                                       ssinger@navtechinc.com
Aircraft Performance Systems                Phone:  519-747-1170 ext 282
Navtech Systems Support Inc.                AFTN:   CYYZXNSX SITA: YYZNSCR
Waterloo, Ontario                           ARINC:  YKFNSCR



Re: Replication

From
F Harvell
Date:
I'm not that familiar with the whole replication issues in PostgreSQL,
however, I would be partial to replication that was based upon the
playback of the (a?) journal file.  (I believe that the WAL is a
journal file.)

By being based upon a journal file, it would be possible to accomplish
two significant items.  First, it would be possible to "restore" a
database to an exact state just before a failure.  Most commercial
databases provide the ability to do this.  Banks, etc. log the journal
files directly to tape to provide a complete transaction history such
that they can rebuild their database from any given snapshot.  (Note
that the journal file needs to be "editable" as a failure may be
"delete from x" with a missing where clause.)

This leads directly into the second advantage, the ability to have a
replicated database operating anywhere, over any connection on any
server.  Speed of writes would not be a factor.  In essence, as long
as the replicated database had a snapshot of the database and then was
provided with all journal files since the snapshot, it would be
possible to build a current database.  If the replicant got behind in
the processing, it would catch up when things slowed down.

In my opionion, the first advantage is in many ways most important.
Replication becomes simply the restoration of the database in realtime
on a second server.  The "replication" task becomes the definition of
a protocol for distributing the journal file.  At least one major
database vendor does replication (shadowing) in exactly this mannor.

Maybe I'm all wet and the journal file and journal playback already
exists.  If so, IMHO, basing replication off of this would be the
right direction.


On Thu, 07 Feb 2002 07:52:23 EST, mlw wrote:
> 
> I have in the back of my mind, an idea of patching into the WAL stuff, and
> using that mechanism to push changes out to the slaves.
> 
> Where one machine is still the master, but no trigger stuff, just a WAL patch.
> Perhaps some shared memory paradigm to manage WAL visibility? I'm not sure
> exactly, the idea hasn't completely formed yet.
> 




Re: Replication

From
Darren Johnson
Date:
>> The problems with two stage commit type approches to replication are

IMHO the biggest problem with two phased commit is it doesn't scale.
The more servers
you add to the replica the slower it goes.  Also there's the potential
for dead locks across
server boundaries.
>> 2) All of the databases must be able to communicate with each other at> all times in order for any edits to work.
Ifthe servers are> connected over some sort of WAN that periodically has short outages this> is a problem.   Also if
yourusing replication because you want to be 
 
able> to take down one of the databases for short periods of time without> bringing down the others your in trouble.

All true for two phased commit protocol.  To have multi master
replication, you must have all
systems communicating, but you can use a multicast group communication
system instead of
2PC.  Using total order messaging, you can ensure all changes are
delivered to all servers in the
replica in the same order.   This group communication system also allows
failures to be detected
while other servers in the replica continue processing.

A few of us are working with this theory, and trying to integrate with
7.2.  There is a working
model for 6.4, but its very limited.  (insert, update, and deletes)  We
are currently hosted at

http://gborg.postgresql.org/project/pgreplication/projdisplay.php
But the site has been down the last 2 days.  I've contacted the web
master, but haven't seen
any results yet.  If any one knows what going on with gborg, I'd
appreciate a status.

Darren



Re: Replication

From
Bradley Kieser
Date:
Darren,
Given that different replication strategies will probably be developed 
for PG, do you envisage DBAs to be able to select the type of replication 
for their installation? I.e. Replication being selectable rther like 
storage structures?

Would be a killer bit of flexibility, given how enormous the impact of 
replication will be to corporate adoption of PG.

Brad 


>>>>>>>>>>>>>>>>>> Original Message <<<<<<<<<<<<<<<<<<

On 2/8/02, 5:29:22 AM, Darren Johnson <darren.johnson@cox.net> wrote 
regarding Re: [HACKERS] Replication:


>  >
>  > The problems with two stage commit type approches to replication are

> IMHO the biggest problem with two phased commit is it doesn't scale.
> The more servers
> you add to the replica the slower it goes.  Also there's the potential
> for dead locks across
> server boundaries.

>  >
>  > 2) All of the databases must be able to communicate with each other at
>  > all times in order for any edits to work.   If the servers are
>  > connected over some sort of WAN that periodically has short outages this
>  > is a problem.   Also if your using replication because you want to be
> able
>  > to take down one of the databases for short periods of time without
>  > bringing down the others your in trouble.

> All true for two phased commit protocol.  To have multi master
> replication, you must have all
> systems communicating, but you can use a multicast group communication
> system instead of
> 2PC.  Using total order messaging, you can ensure all changes are
> delivered to all servers in the
> replica in the same order.   This group communication system also allows
> failures to be detected
> while other servers in the replica continue processing.

> A few of us are working with this theory, and trying to integrate with
> 7.2.  There is a working
> model for 6.4, but its very limited.  (insert, update, and deletes)  We
> are currently hosted at

> http://gborg.postgresql.org/project/pgreplication/projdisplay.php
> But the site has been down the last 2 days.  I've contacted the web
> master, but haven't seen
> any results yet.  If any one knows what going on with gborg, I'd
> appreciate a status.

> Darren


> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)


Re: Replication

From
Darren Johnson
Date:
> 
> Given that different replication strategies will probably be developed 
> for PG, do you envisage DBAs to be able to select the type of replication 
> for their installation? I.e. Replication being selectable rther like 
> storage structures?

I can't speak for other replication solutions, but we are using the 
--with-replication or
-r parameter when starting postmaster.  Some day I hope there will be 
parameters for
master/slave partial/full and sync/async,  but it will be some time 
before we cross those
bridges.

Darren





Re: Replication

From
Randall Jonasz
Date:
I've been looking into database replication theory lately and have found
some interesting papers discussing various approaches.  (Here's
one paper that struck me as being very helpful,
http://citeseer.nj.nec.com/460405.html )  So far I favour an
eager replication system which is predicated on a read local/write all
available. The system should not depend on two phase commit or primary
copy algorithms.  The former leads to the whole system being as quick as
the slowest machine.  In addition, 2 phase commit involves 2n messages for
each transaction which does not scale well at all.  This idea will also
have to take into account a crashed node which did not ack a transaction.
The primary copy algorithms I've seen suffer from a single point of
failure and potential bottlenecks at the primary node.

Instead I like the master to master or peer to peer algorithm as discussed
in the above paper.  This approach accounts for network partitions, nodes
leaving and joining a cluster and the ability to commit a transaction once
the communication module has determined the total order of the said
transaction, i.e. no need for waiting for acks.   This scales well and
research has shown it to increase the number of transactions/second a
database cluster can handle over a single node.

Postgres-R is another interesting approach which I think should be taken
seriously. Anyone interested can read a paper on this at
http://citeseer.nj.nec.com/330257.html

Anyways, my two cents

Randall Jonasz
Software Engineer
Click2net Inc.


On Thu, 7 Feb 2002, mlw wrote:

> Gavin Sherry wrote:
> > Naturally, this would slow down writes to the system (possibly a lot
> > depending on the performance difference between the executing machine and
> > the least powerful machine in the cluster), but most usages of postgresql
> > are read intensive, not write.
> >
> > Any reason this model would not work?
>
> What, then is the purpose of replication to multiple masters?
>
> I can think of only two reasons why you want replication. (1) Redundancy, make
> sure that if one server dies, then another server has the same data and is used
> seamlessly. (2) Increase performance over one system.
>
> In reason (1) I submit that a server load balance which sits on top of
> PostgreSQL, and executes writes on both servers while distributing reads would
> be best. This is a HUGE project. The load balancer must know EXACTLY how the
> system is configured, which includes all functions and everything.
>
> In reason (2) your system would fail to provide the scalability that would be
> needed. If writes take a long time, but reads are fine, what is the difference
> between the trigger based replicator?
>
> I have in the back of my mind, an idea of patching into the WAL stuff, and
> using that mechanism to push changes out to the slaves.
>
> Where one machine is still the master, but no trigger stuff, just a WAL patch.
> Perhaps some shared memory paradigm to manage WAL visibility? I'm not sure
> exactly, the idea hasn't completely formed yet.
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/users-lounge/docs/faq.html
>
>



Re: Replication

From
bpalmer
Date:
I've not looked at the first paper,  but I wil.

> Postgres-R is another interesting approach which I think should be taken
> seriously. Anyone interested can read a paper on this at
> http://citeseer.nj.nec.com/330257.html

I would point you to the info on gborg,  but it seems to be down at the
moment.

- Brandon

----------------------------------------------------------------------------c: 646-456-5455
              h: 201-798-4983b. palmer,  bpalmer@crimelabs.net           pgp:crimelabs.net/bpalmer.pgp5
 



Re: Replication

From
Darren Johnson
Date:
> I've been looking into database replication theory lately and have found
> some interesting papers discussing various approaches.  (Here's
> one paper that struck me as being very helpful,
> http://citeseer.nj.nec.com/460405.html )


Here is another one from that same group, that addresses  the WAN issues.

> http://www.cnds.jhu.edu/pub/papers/cnds-2002-1.pdf


enjoy,

Darren





Re: Replication

From
Brian Bruns
Date:
> > I have in the back of my mind, an idea of patching into the WAL stuff, and
> > using that mechanism to push changes out to the slaves.
> >
> > Where one machine is still the master, but no trigger stuff, just a WAL patch.
> > Perhaps some shared memory paradigm to manage WAL visibility? I'm not sure
> > exactly, the idea hasn't completely formed yet.
> >

FWIW, Sybase Replication Server does just such a thing.  

They have a secondary log marker (prevents the log from truncating past 
the oldest unreplicated transaction).  A thread within the system called 
the "rep agent" (but it use to be a separate process call the LTM), reads 
the log and forwards it to the rep server, once the rep server has the 
whole transaction and it is written to a stable device (aka synced to 
disk) the rep server responds to the LTM telling him it's OK to move the 
log marker forward.

Anyway, once the replication server proper has the transaction it uses a 
publish/subscribe methodology to see who wants get the update.

Bidirectional replication is done by making two oneway replications.  The 
whole thing is table based, it marks the tables as replicated or not in 
the database to save the trip to the repserver on un replicated tables.

Plus you can take parts of a database (replicate all rows where the 
country is "us" to this server and all the rows with "uk" to that server).  
Or opposite you can roll up smaller regional databases to bigger ones, 
it's very flexible.


Cheers,

Brian



Re: Replication

From
Bruce Momjian
Date:
Thread added to TODO.detail/replication.

mlw wrote:
> I re-wrote RServ.pm to C, and wrote a replication daemon. It works, but it
> works like the whole rserv project. I don't like it.
> 
> OK, what the hell do we need to do to get PostgreSQL replicating?
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Replication

From
Bruce Momjian
Date:
Thread added to TODO.detail/replication.

mlw wrote:
> I re-wrote RServ.pm to C, and wrote a replication daemon. It works, but it
> works like the whole rserv project. I don't like it.
> 
> OK, what the hell do we need to do to get PostgreSQL replicating?
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Replication

From
Date:
> 
> 
> Thread added to TODO.detail/replication.
> 
Here is what I have...

http://gborg.postgresql.org/genpage?replication_72todo

It's just for the pg-replication project, and needs some
further discussion.

Darren
>