Thread: Big 7.4 items

Big 7.4 items

From
Bruce Momjian
Date:
I wanted to outline some of the big items we are looking at for 7.4:

Win32 Port:
Katie Ward and Jan are working on contributing their Win32port for 7.4.  They plan to have a patch available by the end
ofDecember.

Point-In-Time Recovery (PITR)
J. R. Nield did a PITR patch late in 7.3 development, and PatrickMacDonald from Red Hat is working on merging it into
CVSandadding any missing pieces.  Patrick, do you have an ETA on that?
 

Replication
I have talked to Darren Johnson and I believe 7.4 is the time tomerge the Postgres-R source tree into our main CVS.
Mostof thereplication code will be in its own directory, with only minorchanges to our existing tree.  They have
single-masterreplicationworking now, so we may have that feature in somecapacity for 7.4.  I know others are working on
replicationsolutions. This is probably the time to decide for certain ifthis is the direction we want to go for
replication. Most whohave have studied Postgres-R feel it is the most promisingmulti-master replication solution for
reliablynetworked hosts.
 

Comments?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Big 7.4 items

From
"Shridhar Daithankar"
Date:
On 13 Dec 2002 at 1:22, Bruce Momjian wrote:
> Replication
> 
>     I have talked to Darren Johnson and I believe 7.4 is the time to
>     merge the Postgres-R source tree into our main CVS.  Most of the
>     replication code will be in its own directory, with only minor
>     changes to our existing tree.  They have single-master
>     replication working now, so we may have that feature in some
>     capacity for 7.4.  I know others are working on replication
>     solutions.  This is probably the time to decide for certain if
>     this is the direction we want to go for replication.  Most who
>     have have studied Postgres-R feel it is the most promising
>     multi-master replication solution for reliably networked hosts.
> 
> Comments?

Some.

1) What kind of replication are we looking at? log file replay/syncnronous etc. 
If it is real time, like usogres( I hope I am in line with things here), that 
would be real good .Choice is always good..

2 If we are going to have replication, can we have built in load balancing? Is 
it a good idea to have it in postgresql or a separate application would be way 
to go?

And where are nested transactions?



ByeShridhar

--
Booker's Law:    An ounce of application is worth a ton of abstraction.



Re: Big 7.4 items

From
Hannu Krosing
Date:
On Fri, 2002-12-13 at 06:22, Bruce Momjian wrote:
> I wanted to outline some of the big items we are looking at for 7.4:
> Point-In-Time Recovery (PITR)
> 
>     J. R. Nield did a PITR patch late in 7.3 development, and Patrick
>     MacDonald from Red Hat is working on merging it into CVS and
>     adding any missing pieces.  Patrick, do you have an ETA on that?

How hard would it be to extend PITR for master-slave (hot backup)
repliaction, which should then amount to continuously shipping logs to
slave and doing nonstop PITR there :)

It will never be usable for multi-master replication, but somehow it
feels that for master-slave replication simple log replay would be most
simple and robust solution.

-- 
Hannu Krosing <hannu@tm.ee>


Re: Big 7.4 items

From
Mike Mascari
Date:
Bruce Momjian wrote:
> I wanted to outline some of the big items we are looking at for 7.4:
> 
> Win32 Port:
> 
>     Katie Ward and Jan are working on contributing their Win32
>     port for 7.4.  They plan to have a patch available by the end of
>     December.
> 
> Point-In-Time Recovery (PITR)
> 
>     J. R. Nield did a PITR patch late in 7.3 development, and Patrick
>     MacDonald from Red Hat is working on merging it into CVS and
>     adding any missing pieces.  Patrick, do you have an ETA on that?
> 
> Replication
> 
>     I have talked to Darren Johnson and I believe 7.4 is the time to
>     merge the Postgres-R source tree into our main CVS.  Most of the
>     replication code will be in its own directory, with only minor
>     changes to our existing tree.  They have single-master
>     replication working now, so we may have that feature in some
>     capacity for 7.4.  I know others are working on replication
>     solutions.  This is probably the time to decide for certain if
>     this is the direction we want to go for replication.  Most who
>     have have studied Postgres-R feel it is the most promising
>     multi-master replication solution for reliably networked hosts.
> 
> Comments?

What about distributed TX support:


http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&oe=UTF-8&threadm=20021106111554.69ae1dcd.pgsql%40snaga.org&rnum=2&prev=/groups%3Fq%3DNAGAYASU%2BSatoshi%26ie%3DUTF-8%26oe%3DUTF-8%26hl%3Den

Mike Mascari
mascarm@mascari.com



Re: Big 7.4 items

From
Date:
> 
> How hard would it be to extend PITR for master-slave (hot backup)
> repliaction, which should then amount to continuously shipping logs to
> slave and doing nonstop PITR there :)

I have not looked at the PITR patch yet, but it might be
possible to use the same PITR format to queue/log writesetswith postgres-r, so we can have multi-master replication
and PITR from the same mechanism.

Darren



Re: Big 7.4 items

From
Joe Conway
Date:
Bruce Momjian wrote:
> Win32 Port:
> 
>     Katie Ward and Jan are working on contributing their Win32
>     port for 7.4.  They plan to have a patch available by the end of
>     December.

I have .Net Studio available to me, so if you need help in merging or testing 
or whatever, let me know.

> Point-In-Time Recovery (PITR)
> 
>     J. R. Nield did a PITR patch late in 7.3 development, and Patrick
>     MacDonald from Red Hat is working on merging it into CVS and
>     adding any missing pieces.  Patrick, do you have an ETA on that?

As Hannu asked (and related to your question below), is there any thought of 
extending this to allow simple log based replication? In many important 
scenarios that would be more than adequate, and simpler to set up.

> Replication
> 
>     I have talked to Darren Johnson and I believe 7.4 is the time to
>     merge the Postgres-R source tree into our main CVS.  Most of the
>     replication code will be in its own directory, with only minor
>     changes to our existing tree.  They have single-master
>     replication working now, so we may have that feature in some
>     capacity for 7.4.  I know others are working on replication
>     solutions.  This is probably the time to decide for certain if
>     this is the direction we want to go for replication.  Most who
>     have have studied Postgres-R feel it is the most promising
>     multi-master replication solution for reliably networked hosts.

I'd question if we would want the one-and-only builtin replication method to 
be dependent on an external communication library (Spread). I would like to 
see Postgres-R merged, but I'd also like to see a simple log-based option.

> Comments?
> 

I'd also second Mike Mascari's question -- whatever happened to the person 
working on two-phase commit? Is that likely to be done for 7.4? Did he ever 
send in a patch?

Joe



Re: Big 7.4 items

From
Bruce Momjian
Date:
Shridhar Daithankar wrote:
> And where are nested transactions?

I didn't mention nested transactions because it didn't seem to be a
_big_ item like the others.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Big 7.4 items

From
Bruce Momjian
Date:
Shridhar Daithankar wrote:
> 1) What kind of replication are we looking at? log file
> replay/synchronous etc.  If it is real time, like usogres( I
> hope I am in line with things here), that would be real good.
> Choice is always good.

Good.  This is the discussion we need.  Let me quote the TODO list
replication section first:
   * Add replication of distributed databases [replication]       o automatic failover       o load balancing       o
master/slavereplication       o multi-master replication       o partition data across servers       o sample
implementationin contrib/rserv       o queries across databases or servers (two-phase commit)       o allow replication
overunreliable or non-persistent links       o http://gborg.postgresql.org/project/pgreplication/projdisplay.php
 

OK, the first thing is that there isn't any one replication solution
that will behave optimally in all situations.  

Now, let me describe Postgres-R and then the other replication
solutions.  Postgres-R is multi-master, meaning you can send SELECT and
UPDATE/DELETE queries to any of the servers in the cluster, and get the
same result.  It is also synchronous, meaning it doesn't update the
local copy until it is sure the other nodes agree to the change. It
allows failover, because if one node goes down, the others keep going.

Now, let me contrast:

rserv and dbmirror do master/slave.  There is no mechanism to allow you
to do updates on the slave, and have them propagate to the master.  You
can, however, send SELECT queries to the slave, and in fact that's how
usogres does load balancing.

Two-phase commit is probably the most popular commercial replication
solution.  While it works for multi-master, it suffers from poor
performance and doesn't handle cases where one node disappears very
well.

Another replication need is for asynchronous replication, most
traditionally for traveling salesmen who need to update their databases
periodically.  The only solution I know for that is PeerDirect's
PostgreSQL commercial offering at http://www.peerdirect.com.  It is
possible PITR may help with this, but we need to handle propagating
changes made by the salesmen back up into the server, and to do that, we
will need a mechanism to handle conflicts that occur when two people
update the same records.  This is always a problem for asynchronous
replication.

> 2 If we are going to have replication, can we have built in load
> balancing? Is it a good idea to have it in postgresql or a
> separate application would be way to go?

Well, because Postgres-R is multi-master, it has automatic load
balancing.  You can have your users point to whatever node you want. 
You can implement this "pointing" by using dns IP address cycling, or
have a router that auto-load balances, though you would need to keep a
db session on the same node, of course.

So, in summary, I think we will eventually have two directions for
replication.  One is Postgres-R for multi-master, synchronous
replication, and PITR, for asynchronous replication.  I don't think
there is any value to use PITR for synchronous replication because by
definition, you don't _store_ the changes for later use because it is
synchronous.  In synchronous, you communicate your changes to all the
nodes involved, then commit them.

I will describe the use of 'spread' and the Postgres-R internal issues
in my next email.

-- Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Big 7.4 items

From
Bruce Momjian
Date:
Hannu Krosing wrote:
> On Fri, 2002-12-13 at 06:22, Bruce Momjian wrote:
> > I wanted to outline some of the big items we are looking at for 7.4:
> > Point-In-Time Recovery (PITR)
> > 
> >     J. R. Nield did a PITR patch late in 7.3 development, and Patrick
> >     MacDonald from Red Hat is working on merging it into CVS and
> >     adding any missing pieces.  Patrick, do you have an ETA on that?
> 
> How hard would it be to extend PITR for master-slave (hot backup)
> repliaction, which should then amount to continuously shipping logs to
> slave and doing nonstop PITR there :)
> 
> It will never be usable for multi-master replication, but somehow it
> feels that for master-slave replication simple log replay would be most
> simple and robust solution.

Exactly.  See my previous email.  We eventually have two replication
solutions:  one, Postgres-R for multi-master, and PITR used for for
asynchonous master/slave.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Big 7.4 items

From
Bruce Momjian
Date:
Mike Mascari wrote:
> What about distributed TX support:
> 
>
http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&oe=UTF-8&threadm=20021106111554.69ae1dcd.pgsql%40snaga.org&rnum=2&prev=/groups%3Fq%3DNAGAYASU%2BSatoshi%26ie%3DUTF-8%26oe%3DUTF-8%26hl%3Den

OK, yes, that is Satoshi's 2-phase commit implementation.  I will
address 2-phase commit vs Postgres-R in my next email about spread.


--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Big 7.4 items

From
Bruce Momjian
Date:
darren@up.hrcoxmail.com wrote:
> >
> > How hard would it be to extend PITR for master-slave (hot backup)
> > repliaction, which should then amount to continuously shipping logs to
> > slave and doing nonstop PITR there :)
> 
> I have not looked at the PITR patch yet, but it might be possible
> to use the same PITR format to queue/log writesetswith postgres-r,
> so we can have multi-master replication and PITR from the same
> mechanism.

Yes, we do need a method to send write sets to the various nodes, and
PITR may be a help in getting those write sets.  However, it should be
clear that we really aren't archiving-replaying them like you would
think for PITR.  We are only grabbing stuff from the PITR to send to
other nodes.  We may also be able to use PITR to bring nodes back up to
date if they have fallen out of communication.

-- Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Big 7.4 items

From
"Mike Mascari"
Date:
Okay. But please keep in mind that a 2-phase commit implementation is used for more than just replication. Any
distributedTX will require a 2PC protocol. As an example, for the DBLINK implementation to ultimately be transaction
safe(at least amongst multiple PostgreSQL installations), the players in the distributed transaction must all be
participantsin a 2PC exchange. And a participant whose communications link is dropped needs to be able to recover by
askingthe coordinator whether or not to complete or abort the distributed TX. I am 100% ignorant of the distributed TX
standardTom referenced earlier, but I'd guess there might be an assumption of 2PC support in the implementation. In
otherwords, I think we still need 2PC, regardless of the method of replication. And if  Satoshi Nagayasu has an
implementationready, why not investigate its possibilities? 

Mike Mascari
mascarm@mascari.com

----- Original Message -----
From: "Bruce Momjian" <pgman@candle.pha.pa.us>


> Mike Mascari wrote:
> > What about distributed TX support:

> OK, yes, that is Satoshi's 2-phase commit implementation.  I will
> address 2-phase commit vs Postgres-R in my next email about spread.




Re: Big 7.4 items

From
snpe
Date:
On Friday 13 December 2002 17:51, Bruce Momjian wrote:
> Shridhar Daithankar wrote:
> > And where are nested transactions?
>
> I didn't mention nested transactions because it didn't seem to be a
> _big_ item like the others.

This is big item

regards
Haris Peco


Re: Big 7.4 items

From
Bruce Momjian
Date:
Joe Conway wrote:
> Bruce Momjian wrote:
> > Win32 Port:
> > 
> >     Katie Ward and Jan are working on contributing their Win32
> >     port for 7.4.  They plan to have a patch available by the end of
> >     December.
> 
> I have .Net Studio available to me, so if you need help in merging or testing 
> or whatever, let me know.

OK, Jan, let him know how he can help.

> > Point-In-Time Recovery (PITR)
> > 
> >     J. R. Nield did a PITR patch late in 7.3 development, and Patrick
> >     MacDonald from Red Hat is working on merging it into CVS and
> >     adding any missing pieces.  Patrick, do you have an ETA on that?
> 
> As Hannu asked (and related to your question below), is there any thought of 
> extending this to allow simple log based replication? In many important 
> scenarios that would be more than adequate, and simpler to set up.

Yes, see previous email.

> I'd question if we would want the one-and-only builtin replication method to 
> be dependent on an external communication library (Spread). I would like to 
> see Postgres-R merged, but I'd also like to see a simple log-based option.

OK, let me reiterate I think we will have two replication solutions in
the end --- on Postgres-R for multi-master/synchronous, and PITR for
master/slave asynchronous replication.

Let me address the Spread issue and two-phase commit.  (Spread is an
open source piece of software used by Postgres-R.)

In two-phase commit, when one node is about to commit, it gets a lock
from all the other nodes, does its commit, then releases the lock.  (Of
course, this is simplified.)  It is called two-phase because it says to
all the other nodes "I am about to do something, is that OK?", then when
gets all OK's, it does the commit and says "I did the commit".

Postgres-R uses a different mechanism.  This method is shown on page 22
and 24 and following of:
ftp://gborg.postgresql.org/pub/pgreplication/stable/PostgreSQLReplication.pdf.gz

The basic difference is that Spread groups all the write sets into a
queue who's ordering is the same on all the nodes.  Instead of asking
for approval for a commit, a node puts its commit in the Spread queue,
and then waits for its own commit to come back in the queue, meaning all
the other nodes saw its commit too.

The only tricky part is that while reading the other node's write sets
before its own arrives, it has to check to see if any of these conflict
with its own write set. If it conflicts, it has to assume the earlier
write set succeeded and its own failed.  It also has to check the write
set stream and apply only those changes that don't conflict.

As stated before in Postgres-R discussion, this mechanism hinges on
being able to determine which write sets conflict because there is no
explicit "I aborted", only a stream of write sets, and each node has to
accept the non-conflicting ones and reject the conflicting ones.

> I'd also second Mike Mascari's question -- whatever happened to the person 
> working on two-phase commit? Is that likely to be done for 7.4? Did he ever 
> send in a patch?

I have not seen a patch from him, but it is very likely he could have
one for 7.4.  This is why it is good we discuss this now and figure out
where we want to go for 7.4 so we can get started.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Big 7.4 items

From
Bruce Momjian
Date:
Mike Mascari wrote:
> Okay. But please keep in mind that a 2-phase commit implementation
> is used for more than just replication. Any distributed TX will
> require a 2PC protocol. As an example, for the DBLINK implementation
> to ultimately be transaction safe (at least amongst multiple
> PostgreSQL installations), the players in the distributed
> transaction must all be participants in a 2PC exchange. And a
> participant whose communications link is dropped needs to be
> able to recover by asking the coordinator whether or not to
> complete or abort the distributed TX. I am 100% ignorant of the
> distributed TX standard Tom referenced earlier, but I'd guess
> there might be an assumption of 2PC support in the implementation.
> In other words, I think we still need 2PC, regardless of the
> method of replication. And if  Satoshi Nagayasu has an implementation
> ready, why not investigate its possibilities?

This is a good point.  I don't want to push Postgres-R as our solution. 
Rather, I have looked at both and like Postgres-R, but others need to
look at both and decide so we are all in agreement when we move forward.

-- Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Big 7.4 items

From
"Mike Mascari"
Date:
----- Original Message -----
From: "Bruce Momjian" <pgman@candle.pha.pa.us>

> Mike Mascari wrote:
> > Okay. But please keep in mind that a 2-phase commit implementation
> > is used for more than just replication.
>
> This is a good point.  I don't want to push Postgres-R as our solution.
> Rather, I have looked at both and like Postgres-R, but others need to
> look at both and decide so we are all in agreement when we move forward.
>

After having read your post regarding Spread, I see that it is an alternative to 2PC as a distributed TX protocol. If I
understandyou correctly, a DBLINK implementation built atop Spread would also be possible. Correct? The question then
is,do other RDBMS expose a 2PC implementation which could not then be leveraged at a later time? For example imagine: 

1. 7.4 includes a native 2PC protocol with:

CREATE DATABASE LINK accounting
CONNECT TO accounting.acme.com:5432
IDENTIFIED BY mascarm/mascarm;

SELECT *
FROM employees@accounting;

INSERT INTO employees@accounting
VALUES (1, 'Mike', 'Mascari');

That would be great, allowing PostgreSQL servers running in different departments to participate in a distributed tx.

2. 7.5 includes a DBLINK which supports PostgreSQL participating in a heterogenous distributed transaction (with say,
anOracle database): 

CREATE DATABASE LINK finance
CONNECT TO <oracle names entry>
IDENTIFIED BY mascarm/mascarm
USING INTERFACE 'pg2oracle.so';

INSERT INTO employees@finance
VALUES (1, 'Mike', 'Mascari');

I guess I'm basically asking:

1) Is it necessary to *choose* between support for 2PC and Spread (Postgres-R) or can't we have both? Spread for
Replication,2PC for non-replicating distributed TX? 

2) Do major SQL DBMS vendors which support distributed options expose a callable interface into a 2PC protocol that
wouldallow PostgreSQL to participate? I could check on this... 

3) Are there any standards (besides ODBC, which, the last time I looked just had COMMIT/ABORT APIs), that have been
definedand adopted by the industry for distributed tx? 

Again, I'd guess most people want:

1) High performance Master/Master replication *and* (r.e. Postgres-R)
2) Ability to participate in distrubuted tx's (r.e. 2PC?)

Mike Mascari
mascarm@mascari.com





Re: Big 7.4 items

From
Neil Conway
Date:
On Fri, 2002-12-13 at 13:20, Bruce Momjian wrote:
> Let me address the Spread issue and two-phase commit.  (Spread is an
> open source piece of software used by Postgres-R.)

Note that while Spread is open source in the sense that "the source is
available", it's license is significantly more restrictive than
PostgreSQL's:
   http://www.spread.org/license/

Just FYI...

Cheers,

Neil
-- 
Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC





Re: Big 7.4 items

From
Bruce Momjian
Date:
Neil Conway wrote:
> On Fri, 2002-12-13 at 13:20, Bruce Momjian wrote:
> > Let me address the Spread issue and two-phase commit.  (Spread is an
> > open source piece of software used by Postgres-R.)
> 
> Note that while Spread is open source in the sense that "the source is
> available", it's license is significantly more restrictive than
> PostgreSQL's:
> 
>     http://www.spread.org/license/
> 

Interesting.  It looks like a modified version of the old BSD license
where you are required to mention you are using Spread.  I believe we
can get that reduced.  (I think Darren already addressed this with
them.) We certainly are not going to accept software that requires all
PostgreSQL user sites to mention Spread.

The whole "mention" aspect of the old BSD license was pretty ambiguous,
and I assume this is similar.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Big 7.4 items

From
"Mike Mascari"
Date:
I wrote:
>
> I guess I'm basically asking:
>
> 1) Is it necessary to *choose* between support for 2PC and Spread (Postgres-R) or can't we have both? Spread for
Replication,2PC for non-replicating distributed TX? 
>
> 2) Do major SQL DBMS vendors which support distributed options expose a callable interface into a 2PC protocol that
wouldallow PostgreSQL to participate? I could check on this... 
>
> 3) Are there any standards (besides ODBC, which, the last time I looked just had COMMIT/ABORT APIs), that have been
definedand adopted by the industry for distributed tx? 

Answer:

The Open Group's Open/XA C193 specificiation for API for distributed transactions:

http://www.opengroup.org/public/pubs/catalog/c193.htm

I couldn't find any draft copies on the web, but a good description at the Sybase site:

http://manuals.sybase.com/onlinebooks/group-xs/xsg1111e/xatuxedo/@ebt-link;pt=61?target=%25N%13_446_START_RESTART_N%25

The standard is 2PC based.

Mike Mascari
mascarm@mascari.com




Re: Big 7.4 items

From
Jan Wieck
Date:
Bruce Momjian wrote:

> OK, the first thing is that there isn't any one replication solution
> that will behave optimally in all situations.

Right

> Now, let me describe Postgres-R and then the other replication
> solutions.  Postgres-R is multi-master, meaning you can send SELECT and
> UPDATE/DELETE queries to any of the servers in the cluster, and get the
> same result.  It is also synchronous, meaning it doesn't update the
> local copy until it is sure the other nodes agree to the change. It
> allows failover, because if one node goes down, the others keep going.

Wrong

It is asynchronous without the need of 2 phase commit. It is group
communication based and requires the group communication system to
guarantee total order. The tricky part is, that the local transaction
must be on hold until the own commit message comes back without a prior
lock conflict by a replication transaction. If such a lock confict
occurs, the replication transaction wins and the local transaction rolls
back.

> 
> Now, let me contrast:
> 
> rserv and dbmirror do master/slave.  There is no mechanism to allow you
> to do updates on the slave, and have them propagate to the master.  You
> can, however, send SELECT queries to the slave, and in fact that's how
> usogres does load balancing.

But you cannot use the result of such a SELECT to update anything. So
you can only phase out complete read only transaction to the slaves.
Requires support from the application since the load balancing system
cannot know automatically what will be a read only transaction and what
not.

> 
> Two-phase commit is probably the most popular commercial replication
> solution.  While it works for multi-master, it suffers from poor
> performance and doesn't handle cases where one node disappears very
> well.
> 
> Another replication need is for asynchronous replication, most
> traditionally for traveling salesmen who need to update their databases
> periodically.  The only solution I know for that is PeerDirect's
> PostgreSQL commercial offering at http://www.peerdirect.com.  It is
> possible PITR may help with this, but we need to handle propagating
> changes made by the salesmen back up into the server, and to do that, we
> will need a mechanism to handle conflicts that occur when two people
> update the same records.  This is always a problem for asynchronous
> replication.

PITR doesn't help here at all, since PeerDirect's replication is trigger
and control table based. What makes our replication system unique is
that it works bidirectional in a heterogenious world.

> I will describe the use of 'spread' and the Postgres-R internal issues
> in my next email.

The last time i was playing with spread (that was at Great Bridge in
Norfolk), it was IMHO useless (for Postgres-R) because it sometimes
dropped messages when the network load got too high. This occured
without any indication, no error, nothing. This is not exactly what I
understand as total order. I hope they have made some substantial
progress on that.


Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #


Re: Big 7.4 items

From
Date:
> > Note that while Spread is open source in the sense that "the source is
> > available", it's license is significantly more restrictive than
> > PostgreSQL's:
> > 
> >     http://www.spread.org/license/
> > 
> 
> Interesting.  It looks like a modified version of the old BSD license
> where you are required to mention you are using Spread.  I believe we
> can get that reduced.  (I think Darren already addressed this with
> them.) We certainly are not going to accept software that requires all
> PostgreSQL user sites to mention Spread.
> 

I dont think this is the case.  We don't redistribute spread
from the pg-replication site.  There are links to the down
load area.  I don't think this should be any different if
postgres-r is merged with the main postgresql tree.  If 
Spread is the group communication we choose to use for 
postgresql replication, then I would think some Spread 
information would be in order on the advocacy site, and 
in any set up documentation for replication.

I have spoken to Yair Amir from the Spread camp on 
several occasions, and they are very excited about the
replication project.  I sure it won't be an issue, but 
I will forward this message to him.

Darren




Re: Big 7.4 items

From
Bruce Momjian
Date:
darren@up.hrcoxmail.com wrote:
> > > Note that while Spread is open source in the sense that "the source is
> > > available", it's license is significantly more restrictive than
> > > PostgreSQL's:
> > > 
> > >     http://www.spread.org/license/
> > > 
> > 
> > Interesting.  It looks like a modified version of the old BSD license
> > where you are required to mention you are using Spread.  I believe we
> > can get that reduced.  (I think Darren already addressed this with
> > them.) We certainly are not going to accept software that requires all
> > PostgreSQL user sites to mention Spread.
> > 
> 
> I dont think this is the case.  We don't redistribute spread
> from the pg-replication site.  There are links to the down
> load area.  I don't think this should be any different if
> postgres-r is merged with the main postgresql tree.  If 
> Spread is the group communication we choose to use for 
> postgresql replication, then I would think some Spread 
> information would be in order on the advocacy site, and 
> in any set up documentation for replication.

Yes, the question is whether we will ship the spread code inside our
tarball?  I doubt we would ever have replication running by default, but
we may want to ship a binary that was replication-enabled.  I am
especially thinking of commerical vendors.  Can you imagine Red Hat DB
being required to mention Spread on their web page?  I don't think that
will fly.

Of course we will want to mention Spread on our web site and in our
documentation, but we don't want to be forced to, and we don't want that
burden to "spread" out to other users.

> I have spoken to Yair Amir from the Spread camp on 
> several occasions, and they are very excited about the
> replication project.  I sure it won't be an issue, but 
> I will forward this message to him.

Good.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Big 7.4 items

From
Bruce Momjian
Date:
Jan Wieck wrote:
> Bruce Momjian wrote:
> 
> > OK, the first thing is that there isn't any one replication solution
> > that will behave optimally in all situations.
> 
> Right
> 
> > Now, let me describe Postgres-R and then the other replication
> > solutions.  Postgres-R is multi-master, meaning you can send SELECT and
> > UPDATE/DELETE queries to any of the servers in the cluster, and get the
> > same result.  It is also synchronous, meaning it doesn't update the
> > local copy until it is sure the other nodes agree to the change. It
> > allows failover, because if one node goes down, the others keep going.
> 
> Wrong
> 
> It is asynchronous without the need of 2 phase commit. It is group

Well, Darren's PDF at:
ftp://gborg.postgresql.org/pub/pgreplication/stable/PostgreSQLReplication.pdf.gz

calls Postgres-R "Type: Embedded, Peer-to-Peer, Sync".  I don't know
enough about replication so I will let you fight it out with him.  ;-)

> communication based and requires the group communication system to
> guarantee total order. The tricky part is, that the local transaction
> must be on hold until the own commit message comes back without a prior
> lock conflict by a replication transaction. If such a lock confict
> occurs, the replication transaction wins and the local transaction rolls
> back.

Yep, that's the tricky part.

> > 
> > Now, let me contrast:
> > 
> > rserv and dbmirror do master/slave.  There is no mechanism to allow you
> > to do updates on the slave, and have them propagate to the master.  You
> > can, however, send SELECT queries to the slave, and in fact that's how
> > usogres does load balancing.
> 
> But you cannot use the result of such a SELECT to update anything. So
> you can only phase out complete read only transaction to the slaves.
> Requires support from the application since the load balancing system
> cannot know automatically what will be a read only transaction and what
> not.

Good point.  It has to be a read-only session because you can't jump
nodes during a session.  That definately limits its usefulness.

> > Two-phase commit is probably the most popular commercial replication
> > solution.  While it works for multi-master, it suffers from poor
> > performance and doesn't handle cases where one node disappears very
> > well.
> > 
> > Another replication need is for asynchronous replication, most
> > traditionally for traveling salesmen who need to update their databases
> > periodically.  The only solution I know for that is PeerDirect's
> > PostgreSQL commercial offering at http://www.peerdirect.com.  It is
> > possible PITR may help with this, but we need to handle propagating
> > changes made by the salesmen back up into the server, and to do that, we
> > will need a mechanism to handle conflicts that occur when two people
> > update the same records.  This is always a problem for asynchronous
> > replication.
> 
> PITR doesn't help here at all, since PeerDirect's replication is trigger
> and control table based. What makes our replication system unique is
> that it works bidirectional in a heterogenious world.

I was only suggesting that PITR _may_ help as an archive method for the
changes.  PeerDirect stores those changes in a table?

> > I will describe the use of 'spread' and the Postgres-R internal issues
> > in my next email.
> 
> The last time i was playing with spread (that was at Great Bridge in
> Norfolk), it was IMHO useless (for Postgres-R) because it sometimes
> dropped messages when the network load got too high. This occured
> without any indication, no error, nothing. This is not exactly what I
> understand as total order. I hope they have made some substantial
> progress on that.

That's a serious problem, clearly.  Hopefully it is either fixed or it
will get fixed.  We can't use it without reliability.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Big 7.4 items

From
Date:
> It is asynchronous without the need of 2 phase commit. It is group
> communication based and requires the group communication system to
> guarantee total order. The tricky part is, that the local transaction
> must be on hold until the own commit message comes back without a prior

No, It holds until it's own Writeset comes back.  Commits 
and then send a commit message on the simple channel, so
commits don't wait for ordered writesets.  

Remember total order guarantees if no changes in front of 
the local changes conflict, the local changes can commit.


> lock conflict by a replication transaction. If such a lock confict
> occurs, the replication transaction wins and the local transaction rolls
> back.

Correct.

> 
> The last time i was playing with spread (that was at Great Bridge in
> Norfolk), it was IMHO useless (for Postgres-R) because it sometimes
> dropped messages when the network load got too high. This occured
> without any indication, no error, nothing. This is not exactly what I
> understand as total order. I hope they have made some substantial
> progress on that.
> 

I remember the TCL tester you set up, and having problems,
but I don't recall investigating what the problems were.
If you still have the code I can try and reproduce the 
problem, and investigate it on the spread list.  

Darren



Re: Big 7.4 items

From
Date:
> > It is asynchronous without the need of 2 phase commit. It is group
> 
> Well, Darren's PDF at:
> 
>     ftp://gborg.postgresql.org/pub/pgreplication/stable/PostgreSQLReplication.pdf.gz
> 
> calls Postgres-R "Type: Embedded, Peer-to-Peer, Sync".  I don't know
> enough about replication so I will let you fight it out with him.  ;-)
> 


If were still defining synchronous as pre commit, then
postgres-r is synchronous.

Darren



Re: Big 7.4 items

From
Bruce Momjian
Date:
darren@up.hrcoxmail.com wrote:
> > It is asynchronous without the need of 2 phase commit. It is group
> > communication based and requires the group communication system to
> > guarantee total order. The tricky part is, that the local transaction
> > must be on hold until the own commit message comes back without a prior
> 
> No, It holds until it's own Writeset comes back.  Commits 
> and then send a commit message on the simple channel, so
> commits don't wait for ordered writesets.  

Darren, can you clarify this?  Why does it send that message?  How does
it allow commits not to wait for ordered writesets?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Big 7.4 items

From
Date:
>  
> 
> Darren, can you clarify this?  Why does it send that message?  How does
> it allow commits not to wait for ordered writesets?
> 

There are two channels.  One for total order writesets 
(changes to the DB).  The other is simple order for
aborts, commits, joins (systems joining the replica), etc.
The simple channel is necessary, because we don't want to
wait for total ordered changes to get an abort message and
so forth.  In some cases you might get an abort or a commit
message before you get the writeset it refers to.

Lets say we have systems A, B and C.  Each one has some
changes and sends a writeset to the group communication
system (GSC).  The total order dictates WS(A), WS(B), and
WS(C) and the writes sets are recieved in that order at
each system.  Now C gets WS(A) no conflict, gets WS(B) no
conflict, and receives WS(C).  Now C can commit WS(C) even 
before the commit messages C(A) or C(B), because there is no
conflict.  

Hope that helps,

Darren



Re: Big 7.4 items

From
Jan Wieck
Date:
darren@up.hrcoxmail.com wrote:
> 
> >
> >
> > Darren, can you clarify this?  Why does it send that message?  How does
> > it allow commits not to wait for ordered writesets?
> >
> 
> There are two channels.  One for total order writesets
> (changes to the DB).  The other is simple order for
> aborts, commits, joins (systems joining the replica), etc.
> The simple channel is necessary, because we don't want to
> wait for total ordered changes to get an abort message and
> so forth.  In some cases you might get an abort or a commit
> message before you get the writeset it refers to.
> 
> Lets say we have systems A, B and C.  Each one has some
> changes and sends a writeset to the group communication
> system (GSC).  The total order dictates WS(A), WS(B), and
> WS(C) and the writes sets are recieved in that order at
> each system.  Now C gets WS(A) no conflict, gets WS(B) no
> conflict, and receives WS(C).  Now C can commit WS(C) even
> before the commit messages C(A) or C(B), because there is no
> conflict.

And that is IMHO not synchronous. C does not have to wait for A and B to
finish the same tasks. If now at this very moment two new transactions
query system A and system C (assuming A has not yet committed WS(C)
while C has), they will get different data back (thanks to non-blocking
reads). I think this is pretty asynchronous. 

It doesn't lead to inconsistencies, because the transaction on A cannot
do something that is in conflict with the changes made by WS(C), since
it's WS(A)2 will come back after WS(C) arrived at A and thus WS(C)
arriving at A will cause WS(A)2 to rollback (WS used synonymous to Xact
in this context).

> 
> Hope that helps,
> 
> Darren

Hope this doesn't add too much confusion :-)

Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #


Re: Big 7.4 items

From
Bruce Momjian
Date:
darren@up.hrcoxmail.com wrote:
> >  
> > 
> > Darren, can you clarify this?  Why does it send that message?  How does
> > it allow commits not to wait for ordered writesets?
> > 
> 
> There are two channels.  One for total order writesets 
> (changes to the DB).  The other is simple order for
> aborts, commits, joins (systems joining the replica), etc.
> The simple channel is necessary, because we don't want to
> wait for total ordered changes to get an abort message and
> so forth.  In some cases you might get an abort or a commit
> message before you get the writeset it refers to.
> 
> Lets say we have systems A, B and C.  Each one has some
> changes and sends a writeset to the group communication
> system (GSC).  The total order dictates WS(A), WS(B), and
> WS(C) and the writes sets are recieved in that order at
> each system.  Now C gets WS(A) no conflict, gets WS(B) no
> conflict, and receives WS(C).  Now C can commit WS(C) even 
> before the commit messages C(A) or C(B), because there is no
> conflict.  

Oh, so C doesn't apply A's changes until it see A's commit, but it can
continue with its own changes because there is no conflict?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Big 7.4 items

From
Jan Wieck
Date:
darren@up.hrcoxmail.com wrote:
> 
> > It is asynchronous without the need of 2 phase commit. It is group
> > communication based and requires the group communication system to
> > guarantee total order. The tricky part is, that the local transaction
> > must be on hold until the own commit message comes back without a prior
> 
> No, It holds until it's own Writeset comes back.  Commits
> and then send a commit message on the simple channel, so
> commits don't wait for ordered writesets.
> 
> Remember total order guarantees if no changes in front of
> the local changes conflict, the local changes can commit.

Right, it's the writeset ... that get's sent just before you flip bits
in the clog, then wait until it comes back and flip 'em. 

> >
> > The last time i was playing with spread (that was at Great Bridge in
> > Norfolk), it was IMHO useless (for Postgres-R) because it sometimes
> > dropped messages when the network load got too high. This occured
> > without any indication, no error, nothing. This is not exactly what I
> > understand as total order. I hope they have made some substantial
> > progress on that.
> >
> 
> I remember the TCL tester you set up, and having problems,
> but I don't recall investigating what the problems were.
> If you still have the code I can try and reproduce the
> problem, and investigate it on the spread list.

Maybe you heard about it, there was this funny conversation while
walking down the hallway:

"Did that German guy ever turn in his notebook?"

"You mean THIS German guy?"

"Yes, did he turn it in?"

"He is here. Right BEHIND YOU!!!"

"Hummmpf ... er ..."


The stuff was on that notebook. Sorry.


Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #


Re: Big 7.4 items

From
Neil Conway
Date:
On Fri, 2002-12-13 at 13:36, Jan Wieck wrote:
> But you cannot use the result of such a SELECT to update anything. So
> you can only phase out complete read only transaction to the slaves.
> Requires support from the application since the load balancing system
> cannot know automatically what will be a read only transaction and what
> not.

Interesting -- SQL contains the concept of "read only" and "read write"
transactions (the default is RW). If we implemented that (which
shouldn't be too difficult[1]), it might make differentiating between
classes of transactions a little easier. Client applications would still
need to be modified, but not nearly as much.

Does this sound like it's worth doing?

[1] -- AFAICS, the only tricky implementation detail is deciding exactly
which database operations are "writes". Does nextval() count, for
example?

Cheers,

Neil
-- 
Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC





Re: Big 7.4 items

From
Bruce Momjian
Date:
Neil Conway wrote:
> On Fri, 2002-12-13 at 13:36, Jan Wieck wrote:
> > But you cannot use the result of such a SELECT to update anything. So
> > you can only phase out complete read only transaction to the slaves.
> > Requires support from the application since the load balancing system
> > cannot know automatically what will be a read only transaction and what
> > not.
> 
> Interesting -- SQL contains the concept of "read only" and "read write"
> transactions (the default is RW). If we implemented that (which
> shouldn't be too difficult[1]), it might make differentiating between
> classes of transactions a little easier. Client applications would still
> need to be modified, but not nearly as much.
> 
> Does this sound like it's worth doing?
> 
> [1] -- AFAICS, the only tricky implementation detail is deciding exactly
> which database operations are "writes". Does nextval() count, for
> example?

You can't migrate a session between nodes, so the entire _session_ has
to be read-only.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Big 7.4 items

From
"Christopher Kings-Lynne"
Date:
> This is a good point.  I don't want to push Postgres-R as our solution.
> Rather, I have looked at both and like Postgres-R, but others need to
> look at both and decide so we are all in agreement when we move forward.

I think in either way, it's clear that they need to be in the main CVS, in
order for it to get up to speed.

Chris




Re: Big 7.4 items

From
Darren Johnson
Date:
>
>
>>
>>Lets say we have systems A, B and C.  Each one has some
>>changes and sends a writeset to the group communication
>>system (GSC).  The total order dictates WS(A), WS(B), and
>>WS(C) and the writes sets are recieved in that order at
>>each system.  Now C gets WS(A) no conflict, gets WS(B) no
>>conflict, and receives WS(C).  Now C can commit WS(C) even
>>before the commit messages C(A) or C(B), because there is no
>>conflict.
>>
>
>And that is IMHO not synchronous. C does not have to wait for A and B to
>finish the same tasks. If now at this very moment two new transactions
>query system A and system C (assuming A has not yet committed WS(C)
>while C has), they will get different data back (thanks to non-blocking
>reads). I think this is pretty asynchronous. 
>

So if we hold WS(C) until we receive commit messages for WS(A) and 
WS(B), will that meet
your synchronous expectations, or do all the systems need to commit the 
WS in the same order
and at the same exact time.

>
>
>It doesn't lead to inconsistencies, because the transaction on A cannot
>do something that is in conflict with the changes made by WS(C), since
>it's WS(A)2 will come back after WS(C) arrived at A and thus WS(C)
>arriving at A will cause WS(A)2 to rollback (WS used synonymous to Xact
>in this context).
>
Right

>
>Hope this doesn't add too much confusion :-)
>
No, however I guess I need to adjust my slides to include your 
definition of synchronous
replication.  ;-)

Darren

>




Re: Big 7.4 items

From
Jan Wieck
Date:
Bruce Momjian wrote:
> 
> Joe Conway wrote:
> > Bruce Momjian wrote:
> > > Win32 Port:
> > >
> > >     Katie Ward and Jan are working on contributing their Win32
> > >     port for 7.4.  They plan to have a patch available by the end of
> > >     December.
> >
> > I have .Net Studio available to me, so if you need help in merging or testing
> > or whatever, let me know.
> 
> OK, Jan, let him know how he can help.

My current plan is to comb out the Win32 port only from what we've done
all together against 7.2.1. The result should be a clean patch that
applied against 7.2.1 builds a native windows port.

From there, this patch must be lifted up to 7.4.

I have the original context diff now down from 160,000 lines to 80,000
lines. I think I will have the clean diff against 7.2.1 somewhere next
week. That would IMHO be a good time for Tom to start complaining so
that we can work in the required changes during the 7.4 lifting. ;-)


Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #


Re: Big 7.4 items

From
Neil Conway
Date:
On Fri, 2002-12-13 at 20:20, Christopher Kings-Lynne wrote:
> > This is a good point.  I don't want to push Postgres-R as our solution.
> > Rather, I have looked at both and like Postgres-R, but others need to
> > look at both and decide so we are all in agreement when we move forward.
> 
> I think in either way, it's clear that they need to be in the main CVS, in
> order for it to get up to speed.

Why's that?

Cheers,

Neil
-- 
Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC





Re: Big 7.4 items

From
Shridhar Daithankar
Date:
On Friday 13 December 2002 11:01 pm, you wrote:
> Good.  This is the discussion we need.  Let me quote the TODO list
> replication section first:
>
>     * Add replication of distributed databases [replication]
>         o automatic failover

Very good. We need that for HA.

>         o load balancing

>         o master/slave replication
>         o multi-master replication
>         o partition data across servers

I am interested in this for multitude of reasons. Scalability is obviously one 
of them. But I was just wondering about some things(After going thr. all the 
messages on this).

Once we have partitioning and replication, that effectively means database 
cache can span multiple machines and no longer restricted by shared memory. 
So will it work on mosix now? Just a thought.


> OK, the first thing is that there isn't any one replication solution
> that will behave optimally in all situations.
>
> Now, let me describe Postgres-R and then the other replication
> solutions.  Postgres-R is multi-master, meaning you can send SELECT and
> UPDATE/DELETE queries to any of the servers in the cluster, and get the
> same result.  It is also synchronous, meaning it doesn't update the
> local copy until it is sure the other nodes agree to the change. It
> allows failover, because if one node goes down, the others keep going.
>
> Now, let me contrast:
>
> rserv and dbmirror do master/slave.  There is no mechanism to allow you
> to do updates on the slave, and have them propagate to the master.  You
> can, however, send SELECT queries to the slave, and in fact that's how
> usogres does load balancing.

Seems like mirroring v/s striping to me. Can we have both combined in either 
fashion just like RAID.

Most importantly will it be able to resize the cluster on the fly? Are we 
looking at network management of database like Oracle does. (OK the tools are 
unwarranted in many situation but it has to offer it).

Most importantly I would like to see this thing easy to setup from a one 
point-of-administration view.

Something like declare a cluster of these n1 machines as database partitions 
and have these another n2 machine do a slave sync with them for handling 
loads. If these kind of command-line options are there, adding easy tools on 
top of them should be a pop.

And please, in place database upgrades. Otherwise it will be a huge pain to 
maintain such a cluster over long period of times.


> Another replication need is for asynchronous replication, most
> traditionally for traveling salesmen who need to update their databases
> periodically.  The only solution I know for that is PeerDirect's
> PostgreSQL commercial offering at http://www.peerdirect.com.  It is
> possible PITR may help with this, but we need to handle propagating
> changes made by the salesmen back up into the server, and to do that, we
> will need a mechanism to handle conflicts that occur when two people
> update the same records.  This is always a problem for asynchronous
> replication.

We need not offer entire asynchronous replication all at once. We can have 
levels of asynchronous replication like read only(Syncing essentially) and 
Read-write. Even if we get slave sync only at first, that will be huge plus. 

> > 2 If we are going to have replication, can we have built in load
> > balancing? Is it a good idea to have it in postgresql or a
> > separate application would be way to go?
>
> Well, because Postgres-R is multi-master, it has automatic load
> balancing.  You can have your users point to whatever node you want.
> You can implement this "pointing" by using dns IP address cycling, or
> have a router that auto-load balances, though you would need to keep a
> db session on the same node, of course.

Umm. W.r.t above point i.e. combining data partitioning and slave-sync, will 
it take a partitioned cluster as a single server or that cluster can take 
care of itself in such situattions?

>
> So, in summary, I think we will eventually have two directions for
> replication.  One is Postgres-R for multi-master, synchronous
> replication, and PITR, for asynchronous replication.  I don't think

I would put that as two options rather than directions. We need to be able to 
deploy them both if required.

Imagine postgresql running over 500 machine cluster..;-)
Shridhar


Re: Big 7.4 items

From
Justin Clift
Date:
darren@up.hrcoxmail.com wrote:
>>It is asynchronous without the need of 2 phase commit. It is group
>>communication based and requires the group communication system to
>>guarantee total order. The tricky part is, that the local transaction
>>must be on hold until the own commit message comes back without a prior
> 
> 
> No, It holds until it's own Writeset comes back.  Commits 
> and then send a commit message on the simple channel, so
> commits don't wait for ordered writesets.  
> 
> Remember total order guarantees if no changes in front of 
> the local changes conflict, the local changes can commit.

Do people have to be careful about how they use sequences, as they don't normally roll back?

Regards and best wishes,

Justin Clift

<snip>
> Darren

-- 
"My grandfather once told me that there are two kinds of people: those
who work and those who take the credit. He told me to try to be in the
first group; there was less competition there."
- Indira Gandhi



Re: Big 7.4 items

From
Justin Clift
Date:
Bruce Momjian wrote:
> Joe Conway wrote:
<snip>
>>>Point-In-Time Recovery (PITR)
>>>
>>>    J. R. Nield did a PITR patch late in 7.3 development, and Patrick
>>>    MacDonald from Red Hat is working on merging it into CVS and
>>>    adding any missing pieces.  Patrick, do you have an ETA on that?
>>
>>As Hannu asked (and related to your question below), is there any thought of 
>>extending this to allow simple log based replication? In many important 
>>scenarios that would be more than adequate, and simpler to set up.
<snip>

For PITR-log-based-replication, how much data would be required to be pushed out to each slave system in order to bring

it up to date?

I'm having visions of a 16MB WAL file being pushed out to slave systems in order to update them with a few rows of
data...

:-/

Regards and best wishes,

Justin Clift

-- 
"My grandfather once told me that there are two kinds of people: those
who work and those who take the credit. He told me to try to be in the
first group; there was less competition there."
- Indira Gandhi



Re: Big 7.4 items

From
"Shridhar Daithankar"
Date:
On 14 Dec 2002 at 18:02, Justin Clift wrote:
> For PITR-log-based-replication, how much data would be required to be pushed out to each slave system in order to
bring
 
> it up to date?
> 
> I'm having visions of a 16MB WAL file being pushed out to slave systems in order to update them with a few rows of
data...

I was under impression that data is pushed to slave after a checkpoint is 
complete. i.e. 16MB of WAL file has recycled.

Conversely a slave would contain accurate data upto last WAL checkpoint.

I think tunable WAL size should be of some help in such scenario. Otherwise the 
system designer has to use async.  replication. for granularity upto a 
transaction.


ByeShridhar

--
Conference, n.:    A special meeting in which the boss gathers subordinates to 
hear    what they have to say, so long as it doesn't conflict with what    he's 
already decided to do.



Re: Big 7.4 items

From
Christopher Kings-Lynne
Date:
> > I think in either way, it's clear that they need to be in the main CVS, in
> > order for it to get up to speed.
>
> Why's that?

Because until replication is in CVS, it won't be used, tested and improved
and developed as fast...

Chris




Re: Big 7.4 items - Replication

From
"Al Sutton"
Date:
For live replication could I propose that we consider the systems A,B, and C
connected to each other independantly (i.e. A has links to B and C, B has
links to A and C, and C has links to A and B), and that replication is
handled by the node receiving the write based transaction.

If we consider a write transaction that arrives at A (called WT(A)), system
A will then send WT(A) to systems B and C via it's direct connections.
System A will receive back either an OK response if there are not conflicts,
a NOT_OK response if there are conflicts, or no response if the system is
unavailable.

If system A receives a NOT_OK response from any other node it begins the
process of rolling back the transaction from all nodes which previously
issued an OK, and the transaction returns a failure code to the client which
submitted WT(A). The other systems (B and C) would track recent transactions
and there would be a specified timeout after which the transaction is
considered safe and could not be rolled out.

Any system not returning an OK or NOT_OK state is assumed to be down, and
error messages are logged to state that the transaction could not be sent to
the system due it it's unavailablility, and any monitoring system would
alter the administrator that a replicant is faulty.

There would also need to be code developed to ensure that a system could be
brought into sync with the current state of other systems within the group
in order to allow new databases to be added, and faulty databases to be
re-entered to the group. This code could also be used for non-realtime
replication to allow databases to be syncronised with the live master.

This would give a multi-master solution whereby a write transaction to any
one node would guarentee that all available replicants would also hold the
data once it is completed, and would also provide the code to handle
scenarios where non-realtime data replication is required.

This system assumes that a majority of transactions will be sucessful (which
should be the case for a well designed system).

Comments?

Al.






----- Original Message -----
From: "Darren Johnson" <darren@up.hrcoxmail.com>
To: "Jan Wieck" <JanWieck@Yahoo.com>
Cc: "Bruce Momjian" <pgman@candle.pha.pa.us>;
<shridhar_daithankar@persistent.co.in>; "PostgreSQL-development"
<pgsql-hackers@postgresql.org>
Sent: Saturday, December 14, 2002 1:28 AM
Subject: [mail] Re: [HACKERS] Big 7.4 items


> >
> >
> >>
> >>Lets say we have systems A, B and C.  Each one has some
> >>changes and sends a writeset to the group communication
> >>system (GSC).  The total order dictates WS(A), WS(B), and
> >>WS(C) and the writes sets are recieved in that order at
> >>each system.  Now C gets WS(A) no conflict, gets WS(B) no
> >>conflict, and receives WS(C).  Now C can commit WS(C) even
> >>before the commit messages C(A) or C(B), because there is no
> >>conflict.
> >>
> >
> >And that is IMHO not synchronous. C does not have to wait for A and B to
> >finish the same tasks. If now at this very moment two new transactions
> >query system A and system C (assuming A has not yet committed WS(C)
> >while C has), they will get different data back (thanks to non-blocking
> >reads). I think this is pretty asynchronous.
> >
>
> So if we hold WS(C) until we receive commit messages for WS(A) and
> WS(B), will that meet
> your synchronous expectations, or do all the systems need to commit the
> WS in the same order
> and at the same exact time.
>
> >
> >
> >It doesn't lead to inconsistencies, because the transaction on A cannot
> >do something that is in conflict with the changes made by WS(C), since
> >it's WS(A)2 will come back after WS(C) arrived at A and thus WS(C)
> >arriving at A will cause WS(A)2 to rollback (WS used synonymous to Xact
> >in this context).
> >
> Right
>
> >
> >Hope this doesn't add too much confusion :-)
> >
> No, however I guess I need to adjust my slides to include your
> definition of synchronous
> replication.  ;-)
>
> Darren
>
> >
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org
>




Re: Big 7.4 items - Replication

From
Bruce Momjian
Date:
This sounds like two-phase commit. While it will work, it is probably
slower than Postgres-R's method.

---------------------------------------------------------------------------

Al Sutton wrote:
> For live replication could I propose that we consider the systems A,B, and C
> connected to each other independantly (i.e. A has links to B and C, B has
> links to A and C, and C has links to A and B), and that replication is
> handled by the node receiving the write based transaction.
> 
> If we consider a write transaction that arrives at A (called WT(A)), system
> A will then send WT(A) to systems B and C via it's direct connections.
> System A will receive back either an OK response if there are not conflicts,
> a NOT_OK response if there are conflicts, or no response if the system is
> unavailable.
> 
> If system A receives a NOT_OK response from any other node it begins the
> process of rolling back the transaction from all nodes which previously
> issued an OK, and the transaction returns a failure code to the client which
> submitted WT(A). The other systems (B and C) would track recent transactions
> and there would be a specified timeout after which the transaction is
> considered safe and could not be rolled out.
> 
> Any system not returning an OK or NOT_OK state is assumed to be down, and
> error messages are logged to state that the transaction could not be sent to
> the system due it it's unavailablility, and any monitoring system would
> alter the administrator that a replicant is faulty.
> 
> There would also need to be code developed to ensure that a system could be
> brought into sync with the current state of other systems within the group
> in order to allow new databases to be added, and faulty databases to be
> re-entered to the group. This code could also be used for non-realtime
> replication to allow databases to be syncronised with the live master.
> 
> This would give a multi-master solution whereby a write transaction to any
> one node would guarentee that all available replicants would also hold the
> data once it is completed, and would also provide the code to handle
> scenarios where non-realtime data replication is required.
> 
> This system assumes that a majority of transactions will be sucessful (which
> should be the case for a well designed system).
> 
> Comments?
> 
> Al.
> 
> 
> 
> 
> 
> 
> ----- Original Message -----
> From: "Darren Johnson" <darren@up.hrcoxmail.com>
> To: "Jan Wieck" <JanWieck@Yahoo.com>
> Cc: "Bruce Momjian" <pgman@candle.pha.pa.us>;
> <shridhar_daithankar@persistent.co.in>; "PostgreSQL-development"
> <pgsql-hackers@postgresql.org>
> Sent: Saturday, December 14, 2002 1:28 AM
> Subject: [mail] Re: [HACKERS] Big 7.4 items
> 
> 
> > >
> > >
> > >>
> > >>Lets say we have systems A, B and C.  Each one has some
> > >>changes and sends a writeset to the group communication
> > >>system (GSC).  The total order dictates WS(A), WS(B), and
> > >>WS(C) and the writes sets are recieved in that order at
> > >>each system.  Now C gets WS(A) no conflict, gets WS(B) no
> > >>conflict, and receives WS(C).  Now C can commit WS(C) even
> > >>before the commit messages C(A) or C(B), because there is no
> > >>conflict.
> > >>
> > >
> > >And that is IMHO not synchronous. C does not have to wait for A and B to
> > >finish the same tasks. If now at this very moment two new transactions
> > >query system A and system C (assuming A has not yet committed WS(C)
> > >while C has), they will get different data back (thanks to non-blocking
> > >reads). I think this is pretty asynchronous.
> > >
> >
> > So if we hold WS(C) until we receive commit messages for WS(A) and
> > WS(B), will that meet
> > your synchronous expectations, or do all the systems need to commit the
> > WS in the same order
> > and at the same exact time.
> >
> > >
> > >
> > >It doesn't lead to inconsistencies, because the transaction on A cannot
> > >do something that is in conflict with the changes made by WS(C), since
> > >it's WS(A)2 will come back after WS(C) arrived at A and thus WS(C)
> > >arriving at A will cause WS(A)2 to rollback (WS used synonymous to Xact
> > >in this context).
> > >
> > Right
> >
> > >
> > >Hope this doesn't add too much confusion :-)
> > >
> > No, however I guess I need to adjust my slides to include your
> > definition of synchronous
> > replication.  ;-)
> >
> > Darren
> >
> > >
> >
> >
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 6: Have you searched our list archives?
> >
> > http://archives.postgresql.org
> >
> 
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
> 
> http://archives.postgresql.org
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Big 7.4 items - Replication

From
Mathieu Arnold
Date:

--En cette belle journée de samedi 14 décembre 2002 11:59 -0500,
-- Bruce Momjian écrivait avec ses petits doigts :
>
> This sounds like two-phase commit. While it will work, it is probably
> slower than Postgres-R's method.

What exactly is Postgres-R's method ?

--
Mathieu Arnold


Re: [mail] Re: Big 7.4 items - Replication

From
"Al Sutton"
Date:
I see it as very difficult to avoid a two stage process because there will
be the following two parts to any transaction;

1) All databases must agree upon the acceptability of a transaction before
the client can be informed of it's success. 2) All databases must be
informed as to whether or not the transaction was accepted by the entire
replicant set, and thus whether it should be written to the database.

If stage1 is missed then the client application may be informed of a
sucessful transaction which may fail when it is replicated to other
databases.

If stage 2 is missed then databases may become out of sync because they have
accepted transactions that were rejected by other databases.

From reading the PDF on Postgres-R I can see that either one of two things
will occur;

a) There will be a central point of synchronization where conflicts will be
tested and delt with. This is not desirable because it will leave the
synchronization and replication processing load concentrated in one place
which will limit scaleability as well as leaving a single point of failure.

or

b) The Group Communication blob will consist of a number of processes which
need to talk to all of the others to interrogate them for changes which may
conflict with the current write that being handled and then issue the
transaction response. This is basically the two phase commit solution with
phases moved into the group communication process.

I can see the possibility of using solution b and having less group
communication processes than databases as attempt to simplify things, but
this would mean the loss of a number of databases if the machine running the
group communication process for the set of databases is lost.

Al.

----- Original Message -----
From: "Bruce Momjian" <pgman@candle.pha.pa.us>
To: "Al Sutton" <al@alsutton.com>
Cc: "Darren Johnson" <darren@up.hrcoxmail.com>; "Jan Wieck"
<JanWieck@Yahoo.com>; <shridhar_daithankar@persistent.co.in>;
"PostgreSQL-development" <pgsql-hackers@postgresql.org>
Sent: Saturday, December 14, 2002 4:59 PM
Subject: [mail] Re: [HACKERS] Big 7.4 items - Replication


>
> This sounds like two-phase commit. While it will work, it is probably
> slower than Postgres-R's method.
>
> --------------------------------------------------------------------------
-
>
> Al Sutton wrote:
> > For live replication could I propose that we consider the systems A,B,
and C
> > connected to each other independantly (i.e. A has links to B and C, B
has
> > links to A and C, and C has links to A and B), and that replication is
> > handled by the node receiving the write based transaction.
> >
> > If we consider a write transaction that arrives at A (called WT(A)),
system
> > A will then send WT(A) to systems B and C via it's direct connections.
> > System A will receive back either an OK response if there are not
conflicts,
> > a NOT_OK response if there are conflicts, or no response if the system
is
> > unavailable.
> >
> > If system A receives a NOT_OK response from any other node it begins the
> > process of rolling back the transaction from all nodes which previously
> > issued an OK, and the transaction returns a failure code to the client
which
> > submitted WT(A). The other systems (B and C) would track recent
transactions
> > and there would be a specified timeout after which the transaction is
> > considered safe and could not be rolled out.
> >
> > Any system not returning an OK or NOT_OK state is assumed to be down,
and
> > error messages are logged to state that the transaction could not be
sent to
> > the system due it it's unavailablility, and any monitoring system would
> > alter the administrator that a replicant is faulty.
> >
> > There would also need to be code developed to ensure that a system could
be
> > brought into sync with the current state of other systems within the
group
> > in order to allow new databases to be added, and faulty databases to be
> > re-entered to the group. This code could also be used for non-realtime
> > replication to allow databases to be syncronised with the live master.
> >
> > This would give a multi-master solution whereby a write transaction to
any
> > one node would guarentee that all available replicants would also hold
the
> > data once it is completed, and would also provide the code to handle
> > scenarios where non-realtime data replication is required.
> >
> > This system assumes that a majority of transactions will be sucessful
(which
> > should be the case for a well designed system).
> >
> > Comments?
> >
> > Al.
> >
> >
> >
> >
> >
> >
> > ----- Original Message -----
> > From: "Darren Johnson" <darren@up.hrcoxmail.com>
> > To: "Jan Wieck" <JanWieck@Yahoo.com>
> > Cc: "Bruce Momjian" <pgman@candle.pha.pa.us>;
> > <shridhar_daithankar@persistent.co.in>; "PostgreSQL-development"
> > <pgsql-hackers@postgresql.org>
> > Sent: Saturday, December 14, 2002 1:28 AM
> > Subject: [mail] Re: [HACKERS] Big 7.4 items
> >
> >
> > > >
> > > >
> > > >>
> > > >>Lets say we have systems A, B and C.  Each one has some
> > > >>changes and sends a writeset to the group communication
> > > >>system (GSC).  The total order dictates WS(A), WS(B), and
> > > >>WS(C) and the writes sets are recieved in that order at
> > > >>each system.  Now C gets WS(A) no conflict, gets WS(B) no
> > > >>conflict, and receives WS(C).  Now C can commit WS(C) even
> > > >>before the commit messages C(A) or C(B), because there is no
> > > >>conflict.
> > > >>
> > > >
> > > >And that is IMHO not synchronous. C does not have to wait for A and B
to
> > > >finish the same tasks. If now at this very moment two new
transactions
> > > >query system A and system C (assuming A has not yet committed WS(C)
> > > >while C has), they will get different data back (thanks to
non-blocking
> > > >reads). I think this is pretty asynchronous.
> > > >
> > >
> > > So if we hold WS(C) until we receive commit messages for WS(A) and
> > > WS(B), will that meet
> > > your synchronous expectations, or do all the systems need to commit
the
> > > WS in the same order
> > > and at the same exact time.
> > >
> > > >
> > > >
> > > >It doesn't lead to inconsistencies, because the transaction on A
cannot
> > > >do something that is in conflict with the changes made by WS(C),
since
> > > >it's WS(A)2 will come back after WS(C) arrived at A and thus WS(C)
> > > >arriving at A will cause WS(A)2 to rollback (WS used synonymous to
Xact
> > > >in this context).
> > > >
> > > Right
> > >
> > > >
> > > >Hope this doesn't add too much confusion :-)
> > > >
> > > No, however I guess I need to adjust my slides to include your
> > > definition of synchronous
> > > replication.  ;-)
> > >
> > > Darren
> > >
> > > >
> > >
> > >
> > >
> > > ---------------------------(end of
broadcast)---------------------------
> > > TIP 6: Have you searched our list archives?
> > >
> > > http://archives.postgresql.org
> > >
> >
> >
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 6: Have you searched our list archives?
> >
> > http://archives.postgresql.org
> >
>
> --
>   Bruce Momjian                        |  http://candle.pha.pa.us
>   pgman@candle.pha.pa.us               |  (610) 359-1001
>   +  If your life is a hard drive,     |  13 Roberts Road
>   +  Christ can be your backup.        |  Newtown Square, Pennsylvania
19073
>




Re: [mail] Re: Big 7.4 items - Replication

From
Darren Johnson
Date:
>
>
>
>b) The Group Communication blob will consist of a number of processes which
>need to talk to all of the others to interrogate them for changes which may
>conflict with the current write that being handled and then issue the
>transaction response. This is basically the two phase commit solution with
>phases moved into the group communication process.
>
>I can see the possibility of using solution b and having less group
>communication processes than databases as attempt to simplify things, but
>this would mean the loss of a number of databases if the machine running the
>group communication process for the set of databases is lost.
>
The group communication system doesn't just run on one system.  For 
postgres-r using spread
there is actually a spread daemon that runs on each database server.  It 
has nothing to do with
detecting the conflicts.  Its job is to deliver messages in a total 
order for writesets or simple order
for commits, aborts, joins, etc.  

The detection of conflicts will be done at the database level, by a 
backend processes.  The basic
concept is "if all databases get the writesets (changes) in the exact 
same order, apply them in a
consistent order, avoid conflicts, then one copy serialization is 
achieved.  (one copy of the database
replicated across all databases in the replica)

I hope that explains the group communication system's responsibility.

Darren


>




Re: [mail] Re: Big 7.4 items - Replication

From
"Al Sutton"
Date:
Many thanks for the explanation. Could you explain to me where the order or
the writeset for the following scenario;

If a tranasction takes 50ms to reach one database from another, for a
specific data element (called X), the following timeline occurs

at 0ms, T1(X) is written to system A.
at 10ms, T2(X) is written to system B.

Where T1(X) and T2(X) conflict.

My concern is that if the Group Communication Daemon (gcd) is operating on
each database,  a successful result for T1(X) will returned to the client
talking to database A because T2(X) has not reached it, and thus no conflict
is known about, and a sucessful result is returned to the client submitting
T2(X) to database B because it is not aware of T1(X). This would mean that
the two clients beleive bothe T1(X) and T2(X) completed succesfully, yet
they can not due to the conflict.

Thanks,

Al.

----- Original Message -----
From: "Darren Johnson" <darren@up.hrcoxmail.com>
To: "Al Sutton" <al@alsutton.com>
Cc: "Bruce Momjian" <pgman@candle.pha.pa.us>; "Jan Wieck"
<JanWieck@Yahoo.com>; <shridhar_daithankar@persistent.co.in>;
"PostgreSQL-development" <pgsql-hackers@postgresql.org>
Sent: Saturday, December 14, 2002 6:48 PM
Subject: Re: [mail] Re: [HACKERS] Big 7.4 items - Replication


> >
> >
> >
> >b) The Group Communication blob will consist of a number of processes
which
> >need to talk to all of the others to interrogate them for changes which
may
> >conflict with the current write that being handled and then issue the
> >transaction response. This is basically the two phase commit solution
with
> >phases moved into the group communication process.
> >
> >I can see the possibility of using solution b and having less group
> >communication processes than databases as attempt to simplify things, but
> >this would mean the loss of a number of databases if the machine running
the
> >group communication process for the set of databases is lost.
> >
> The group communication system doesn't just run on one system.  For
> postgres-r using spread
> there is actually a spread daemon that runs on each database server.  It
> has nothing to do with
> detecting the conflicts.  Its job is to deliver messages in a total
> order for writesets or simple order
> for commits, aborts, joins, etc.
>
> The detection of conflicts will be done at the database level, by a
> backend processes.  The basic
> concept is "if all databases get the writesets (changes) in the exact
> same order, apply them in a
> consistent order, avoid conflicts, then one copy serialization is
> achieved.  (one copy of the database
> replicated across all databases in the replica)
>
> I hope that explains the group communication system's responsibility.
>
> Darren
>
>
> >
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/users-lounge/docs/faq.html




Re: [MLIST] Re: [mail] Re: Big 7.4 items - Replication

From
David Walker
Date:
Another concern I have with multi-master systems is what happens if the 
network splits in 2 so that 2 master systems are taking commits for 2 
separate sets of clients.  It seems to me that to re-sync the 2 databases 
upon the network healing would be a very complex task or impossible task.

On Sunday 15 December 2002 04:16 am, Al Sutton wrote:
> Many thanks for the explanation. Could you explain to me where the order or
> the writeset for the following scenario;
>
> If a tranasction takes 50ms to reach one database from another, for a
> specific data element (called X), the following timeline occurs
>
> at 0ms, T1(X) is written to system A.
> at 10ms, T2(X) is written to system B.
>
> Where T1(X) and T2(X) conflict.
>
> My concern is that if the Group Communication Daemon (gcd) is operating on
> each database,  a successful result for T1(X) will returned to the client
> talking to database A because T2(X) has not reached it, and thus no
> conflict is known about, and a sucessful result is returned to the client
> submitting T2(X) to database B because it is not aware of T1(X). This would
> mean that the two clients beleive bothe T1(X) and T2(X) completed
> succesfully, yet they can not due to the conflict.
>
> Thanks,
>
> Al.
>
> ----- Original Message -----
> From: "Darren Johnson" <darren@up.hrcoxmail.com>
> To: "Al Sutton" <al@alsutton.com>
> Cc: "Bruce Momjian" <pgman@candle.pha.pa.us>; "Jan Wieck"
> <JanWieck@Yahoo.com>; <shridhar_daithankar@persistent.co.in>;
> "PostgreSQL-development" <pgsql-hackers@postgresql.org>
> Sent: Saturday, December 14, 2002 6:48 PM
> Subject: Re: [mail] Re: [HACKERS] Big 7.4 items - Replication
>
> > >b) The Group Communication blob will consist of a number of processes
>
> which
>
> > >need to talk to all of the others to interrogate them for changes which
>
> may
>
> > >conflict with the current write that being handled and then issue the
> > >transaction response. This is basically the two phase commit solution
>
> with
>
> > >phases moved into the group communication process.
> > >
> > >I can see the possibility of using solution b and having less group
> > >communication processes than databases as attempt to simplify things,
> > > but this would mean the loss of a number of databases if the machine
> > > running
>
> the
>
> > >group communication process for the set of databases is lost.
> >
> > The group communication system doesn't just run on one system.  For
> > postgres-r using spread
> > there is actually a spread daemon that runs on each database server.  It
> > has nothing to do with
> > detecting the conflicts.  Its job is to deliver messages in a total
> > order for writesets or simple order
> > for commits, aborts, joins, etc.
> >
> > The detection of conflicts will be done at the database level, by a
> > backend processes.  The basic
> > concept is "if all databases get the writesets (changes) in the exact
> > same order, apply them in a
> > consistent order, avoid conflicts, then one copy serialization is
> > achieved.  (one copy of the database
> > replicated across all databases in the replica)
> >
> > I hope that explains the group communication system's responsibility.
> >
> > Darren
> >
> >
> >
> >
> >
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 5: Have you checked our extensive FAQ?
> >
> > http://www.postgresql.org/users-lounge/docs/faq.html
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org



Re: [MLIST] Re: [mail] Re: Big 7.4 items - Replication

From
"Al Sutton"
Date:
David,

This can be resolved by requiring that for any transaction to succeed the
entrypoint database must receive acknowlegements from n/2 + 0.5 (rounded up
to the nearest integer) databases where n is the total number in the
replicant set. The following cases are shown as an example;

Total Number of databases: 2
Number required to accept transaction: 2

Total Number of databases: 3
Number required to accept transaction: 2

Total Number of databases: 4
Number required to accept transaction: 3

Total Number of databases: 5
Number required to accept transaction: 3

Total Number of databases: 6
Number required to accept transaction: 4

Total Number of databases: 7
Number required to accept transaction: 4

Total Number of databases: 8
Number required to accept transaction: 5

This would prevent two replicant sub-sets forming, because it is impossible
for both sets to have over 50% of the databases.

Applications could be able to detect when a database has dropped out of the
replicant set because the database could report a state of "Unable to obtain
majority consesus". This would allow applications differentiate between a
database out of the set where writing to other databases in the set could
yield a sucessful result, and "Unable to commit due to conflict" where
trying other databases is pointless.

Al

Example
----- Original Message -----
From: "David Walker" <pgsql@grax.com>
To: "Al Sutton" <al@alsutton.com>; "Darren Johnson"
<darren@up.hrcoxmail.com>
Cc: "Bruce Momjian" <pgman@candle.pha.pa.us>; "Jan Wieck"
<JanWieck@Yahoo.com>; <shridhar_daithankar@persistent.co.in>;
"PostgreSQL-development" <pgsql-hackers@postgresql.org>
Sent: Sunday, December 15, 2002 2:29 PM
Subject: Re: [MLIST] Re: [mail] Re: [HACKERS] Big 7.4 items - Replication


> Another concern I have with multi-master systems is what happens if the
> network splits in 2 so that 2 master systems are taking commits for 2
> separate sets of clients.  It seems to me that to re-sync the 2 databases
> upon the network healing would be a very complex task or impossible task.
>
> On Sunday 15 December 2002 04:16 am, Al Sutton wrote:
> > Many thanks for the explanation. Could you explain to me where the order
or
> > the writeset for the following scenario;
> >
> > If a tranasction takes 50ms to reach one database from another, for a
> > specific data element (called X), the following timeline occurs
> >
> > at 0ms, T1(X) is written to system A.
> > at 10ms, T2(X) is written to system B.
> >
> > Where T1(X) and T2(X) conflict.
> >
> > My concern is that if the Group Communication Daemon (gcd) is operating
on
> > each database,  a successful result for T1(X) will returned to the
client
> > talking to database A because T2(X) has not reached it, and thus no
> > conflict is known about, and a sucessful result is returned to the
client
> > submitting T2(X) to database B because it is not aware of T1(X). This
would
> > mean that the two clients beleive bothe T1(X) and T2(X) completed
> > succesfully, yet they can not due to the conflict.
> >
> > Thanks,
> >
> > Al.
> >
> > ----- Original Message -----
> > From: "Darren Johnson" <darren@up.hrcoxmail.com>
> > To: "Al Sutton" <al@alsutton.com>
> > Cc: "Bruce Momjian" <pgman@candle.pha.pa.us>; "Jan Wieck"
> > <JanWieck@Yahoo.com>; <shridhar_daithankar@persistent.co.in>;
> > "PostgreSQL-development" <pgsql-hackers@postgresql.org>
> > Sent: Saturday, December 14, 2002 6:48 PM
> > Subject: Re: [mail] Re: [HACKERS] Big 7.4 items - Replication
> >
> > > >b) The Group Communication blob will consist of a number of processes
> >
> > which
> >
> > > >need to talk to all of the others to interrogate them for changes
which
> >
> > may
> >
> > > >conflict with the current write that being handled and then issue the
> > > >transaction response. This is basically the two phase commit solution
> >
> > with
> >
> > > >phases moved into the group communication process.
> > > >
> > > >I can see the possibility of using solution b and having less group
> > > >communication processes than databases as attempt to simplify things,
> > > > but this would mean the loss of a number of databases if the machine
> > > > running
> >
> > the
> >
> > > >group communication process for the set of databases is lost.
> > >
> > > The group communication system doesn't just run on one system.  For
> > > postgres-r using spread
> > > there is actually a spread daemon that runs on each database server.
It
> > > has nothing to do with
> > > detecting the conflicts.  Its job is to deliver messages in a total
> > > order for writesets or simple order
> > > for commits, aborts, joins, etc.
> > >
> > > The detection of conflicts will be done at the database level, by a
> > > backend processes.  The basic
> > > concept is "if all databases get the writesets (changes) in the exact
> > > same order, apply them in a
> > > consistent order, avoid conflicts, then one copy serialization is
> > > achieved.  (one copy of the database
> > > replicated across all databases in the replica)
> > >
> > > I hope that explains the group communication system's responsibility.
> > >
> > > Darren
> > >
> > >
> > >
> > >
> > >
> > >
> > > ---------------------------(end of
broadcast)---------------------------
> > > TIP 5: Have you checked our extensive FAQ?
> > >
> > > http://www.postgresql.org/users-lounge/docs/faq.html
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 6: Have you searched our list archives?
> >
> > http://archives.postgresql.org
>
>




Re: [mail] Re: Big 7.4 items - Replication

From
"Al Sutton"
Date:
Jonathan,

How do the group communication daemons on system A and B agree that T2 is
after T1?,

As I understand it the operation is performed locally before being passed on
to the group for replication, when T2 arrives at system B, system B has no
knowlege of T1 and so can perform T2 sucessfully.

I am guessing that the System B performs T2 locally, sends it to the group
communication daemon for ordering, and then receives it back from the group
communication order queue after it's position in the order queue has been
decided before it is written to the database.

This would indicate to me that there is a single central point which decides
that T2 is after T1.

Is this true?

Al.

----- Original Message -----
From: "Jonathan Stanton" <jonathan@cnds.jhu.edu>
To: "Al Sutton" <al@alsutton.com>
Cc: "Darren Johnson" <darren@up.hrcoxmail.com>; "Bruce Momjian"
<pgman@candle.pha.pa.us>; "Jan Wieck" <JanWieck@Yahoo.com>;
<shridhar_daithankar@persistent.co.in>; "PostgreSQL-development"
<pgsql-hackers@postgresql.org>
Sent: Sunday, December 15, 2002 5:00 PM
Subject: Re: [mail] Re: [HACKERS] Big 7.4 items - Replication


> The total order provided by the group communication daemons guarantees
> that every member will see the tranactions/writesets in the same order.
> So both A and B will see that T1 is ordered before T2 BEFORE writing
> anything back to the client. So for both servers T1 will be completed
> successfully, and T2 will be aborted because of conflicting writesets.
>
> Jonathan
>
> On Sun, Dec 15, 2002 at 10:16:22AM -0000, Al Sutton wrote:
> > Many thanks for the explanation. Could you explain to me where the order
or
> > the writeset for the following scenario;
> >
> > If a tranasction takes 50ms to reach one database from another, for a
> > specific data element (called X), the following timeline occurs
> >
> > at 0ms, T1(X) is written to system A.
> > at 10ms, T2(X) is written to system B.
> >
> > Where T1(X) and T2(X) conflict.
> >
> > My concern is that if the Group Communication Daemon (gcd) is operating
on
> > each database,  a successful result for T1(X) will returned to the
client
> > talking to database A because T2(X) has not reached it, and thus no
conflict
> > is known about, and a sucessful result is returned to the client
submitting
> > T2(X) to database B because it is not aware of T1(X). This would mean
that
> > the two clients beleive bothe T1(X) and T2(X) completed succesfully, yet
> > they can not due to the conflict.
> >
> > Thanks,
> >
> > Al.
> >
> > ----- Original Message -----
> > From: "Darren Johnson" <darren@up.hrcoxmail.com>
> > To: "Al Sutton" <al@alsutton.com>
> > Cc: "Bruce Momjian" <pgman@candle.pha.pa.us>; "Jan Wieck"
> > <JanWieck@Yahoo.com>; <shridhar_daithankar@persistent.co.in>;
> > "PostgreSQL-development" <pgsql-hackers@postgresql.org>
> > Sent: Saturday, December 14, 2002 6:48 PM
> > Subject: Re: [mail] Re: [HACKERS] Big 7.4 items - Replication
> >
> >
> > > >
> > > >
> > > >
> > > >b) The Group Communication blob will consist of a number of processes
> > which
> > > >need to talk to all of the others to interrogate them for changes
which
> > may
> > > >conflict with the current write that being handled and then issue the
> > > >transaction response. This is basically the two phase commit solution
> > with
> > > >phases moved into the group communication process.
> > > >
> > > >I can see the possibility of using solution b and having less group
> > > >communication processes than databases as attempt to simplify things,
but
> > > >this would mean the loss of a number of databases if the machine
running
> > the
> > > >group communication process for the set of databases is lost.
> > > >
> > > The group communication system doesn't just run on one system.  For
> > > postgres-r using spread
> > > there is actually a spread daemon that runs on each database server.
It
> > > has nothing to do with
> > > detecting the conflicts.  Its job is to deliver messages in a total
> > > order for writesets or simple order
> > > for commits, aborts, joins, etc.
> > >
> > > The detection of conflicts will be done at the database level, by a
> > > backend processes.  The basic
> > > concept is "if all databases get the writesets (changes) in the exact
> > > same order, apply them in a
> > > consistent order, avoid conflicts, then one copy serialization is
> > > achieved.  (one copy of the database
> > > replicated across all databases in the replica)
> > >
> > > I hope that explains the group communication system's responsibility.
> > >
> > > Darren
> > >
> > >
> > > >
> > >
> > >
> > >
> > > ---------------------------(end of
broadcast)---------------------------
> > > TIP 5: Have you checked our extensive FAQ?
> > >
> > > http://www.postgresql.org/users-lounge/docs/faq.html
> >
> >
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 6: Have you searched our list archives?
> >
> > http://archives.postgresql.org
>
> --
> -------------------------------------------------------
> Jonathan R. Stanton         jonathan@cs.jhu.edu
> Dept. of Computer Science
> Johns Hopkins University
> -------------------------------------------------------
>




Re: [mail] Re: Big 7.4 items - Replication

From
"Al Sutton"
Date:
Jonathan,

Many thanks for clarifying the situation some more. With token passing, I
have the following concerns;

1) What happends if a server holding the token should die whilst it is in
posession of the token.

2) If I have n servers, and the time to pass the token between each server
is x milliseconds, I may have to wait for upto m times x milliseconds in
order for a transaction to be processed. If a server is limited to a single
transaction per posession of the token (in order to ensure no system hogs
the token), and the server develops a queue of length y, I will have to wait
m times x times y for the transaction to be processed.  Both scenarios I
beleive would not scale well beyond a small subset of  servers with low
network latency between them.

If we consider the following situation I can illustrate why I'm still in
favour of a two phase commit;

Imagine, for example, credit card details about the status of an account
replicated in real time between databases in London, Moscow, Singapore,
Syndey, and New York. If any server can talk to any other server with a
guarenteed packet transfer time of 150ms a two phase commit could complete
in 600ms as it's worst case (assuming that the two phases consist of
request/response pairs, and that each server talks to all the others in
parallel). A token passing system may have to wait for the token to pass
through every other server before reaching the one that has the transaction
comitted to it, which could take about 750ms.

If you then expand the network to allow for a primary and disaster recover
database at each location the two phase commit still maintains it's 600ms
response time, but the token passing system doubles to 1500ms.

Allowing disjointed segments to continue executing is also a concern because
any split in the replication group could effectively double the accepted
card limit for any card holder should they purchase items from various
locations around the globe.

I can see an idea that the token may be passed to the system with the most
transactions in a wait state, but this would cause low volume databases to
loose out on response times to higher volume ones, which is again,
undesirable.

Al.

----- Original Message -----
From: "Jonathan Stanton" <jonathan@cnds.jhu.edu>
To: "Al Sutton" <al@alsutton.com>
Cc: "Darren Johnson" <darren@up.hrcoxmail.com>; "Bruce Momjian"
<pgman@candle.pha.pa.us>; "Jan Wieck" <JanWieck@Yahoo.com>;
<shridhar_daithankar@persistent.co.in>; "PostgreSQL-development"
<pgsql-hackers@postgresql.org>
Sent: Sunday, December 15, 2002 9:17 PM
Subject: Re: [mail] Re: [HACKERS] Big 7.4 items - Replication


> On Sun, Dec 15, 2002 at 07:42:35PM -0000, Al Sutton wrote:
> > Jonathan,
> >
> > How do the group communication daemons on system A and B agree that T2
is
> > after T1?,
>
> Lets split this into two separate problems:
>
> 1) How do the daemons totally order a set of messages (abstract
> messages)
>
> 2) How do database transactions get split into writesets that are sent
> as messages through the group communication system.
>
> As to question 1, the set of daemons (usually one running on each
> participating server) run a distributed ordering algorithm, as well as
> distributed algorithms to provide message reliability, fault-detection,
> and membership services. These are completely distributed algorithms, no
> "central" controller node exists, so even if network partitions occur
> the group communication system keeps running and providing ordering and
> reliability guarantees to messages.
>
> A number of different algorithms exist as to how to provide a total
> order on messages. Spread currently uses a token algorithm, that
> involves passing a token between the daemons, and a counter attached to
> each message, but other algorithms exist and we have implemneted some
> other ones in our research. You can find lots of details in the papers
> at www.cnds.jhu.edu/publications/ and www.spread.org.
>
> As to question 2, there are several different approaches to how to use
> such a total order for actual database replication. They all use the gcs
> total order to establish a single sequence of "events" that all the
> databases see. Then each database can act on the events as they are
> delivered by teh gcs and be guaranteed that no other database will see a
> different order.
>
> In the postgres-R case, the action received from a client is performned
> partially at the originating postgres server, the writesets are then
> sent through the gcs to order them and determine conflicts. Once they
> are delivered back, if no conflicts occured in the meantime, the
> original transaction is completed and the result returned to the client.
> If a conflict occured, the original transaction is rolled back and
> aborted. and the abort is returned to the client.
>
> >
> > As I understand it the operation is performed locally before being
passed on
> > to the group for replication, when T2 arrives at system B, system B has
no
> > knowlege of T1 and so can perform T2 sucessfully.
> >
> > I am guessing that the System B performs T2 locally, sends it to the
group
> > communication daemon for ordering, and then receives it back from the
group
> > communication order queue after it's position in the order queue has
been
> > decided before it is written to the database.
>
> If I understand the above correctly, yes, that is the same as I describe
> above.
>
> >
> > This would indicate to me that there is a single central point which
decides
> > that T2 is after T1.
>
> No, there is a distributed algorithm that determins the order. The
> distributed algorithm "emulates" a central controller who decides the
> order, but no single controller actually exists.
>
> Jonathan
>
> > ----- Original Message -----
> > From: "Jonathan Stanton" <jonathan@cnds.jhu.edu>
> > To: "Al Sutton" <al@alsutton.com>
> > Cc: "Darren Johnson" <darren@up.hrcoxmail.com>; "Bruce Momjian"
> > <pgman@candle.pha.pa.us>; "Jan Wieck" <JanWieck@Yahoo.com>;
> > <shridhar_daithankar@persistent.co.in>; "PostgreSQL-development"
> > <pgsql-hackers@postgresql.org>
> > Sent: Sunday, December 15, 2002 5:00 PM
> > Subject: Re: [mail] Re: [HACKERS] Big 7.4 items - Replication
> >
> >
> > > The total order provided by the group communication daemons guarantees
> > > that every member will see the tranactions/writesets in the same
order.
> > > So both A and B will see that T1 is ordered before T2 BEFORE writing
> > > anything back to the client. So for both servers T1 will be completed
> > > successfully, and T2 will be aborted because of conflicting writesets.
> > >
> > > Jonathan
> > >
> > > On Sun, Dec 15, 2002 at 10:16:22AM -0000, Al Sutton wrote:
> > > > Many thanks for the explanation. Could you explain to me where the
order
> > or
> > > > the writeset for the following scenario;
> > > >
> > > > If a tranasction takes 50ms to reach one database from another, for
a
> > > > specific data element (called X), the following timeline occurs
> > > >
> > > > at 0ms, T1(X) is written to system A.
> > > > at 10ms, T2(X) is written to system B.
> > > >
> > > > Where T1(X) and T2(X) conflict.
> > > >
> > > > My concern is that if the Group Communication Daemon (gcd) is
operating
> > on
> > > > each database,  a successful result for T1(X) will returned to the
> > client
> > > > talking to database A because T2(X) has not reached it, and thus no
> > conflict
> > > > is known about, and a sucessful result is returned to the client
> > submitting
> > > > T2(X) to database B because it is not aware of T1(X). This would
mean
> > that
> > > > the two clients beleive bothe T1(X) and T2(X) completed succesfully,
yet
> > > > they can not due to the conflict.
> > > >
> > > > Thanks,
> > > >
> > > > Al.
> > > >
> > > > ----- Original Message -----
> > > > From: "Darren Johnson" <darren@up.hrcoxmail.com>
> > > > To: "Al Sutton" <al@alsutton.com>
> > > > Cc: "Bruce Momjian" <pgman@candle.pha.pa.us>; "Jan Wieck"
> > > > <JanWieck@Yahoo.com>; <shridhar_daithankar@persistent.co.in>;
> > > > "PostgreSQL-development" <pgsql-hackers@postgresql.org>
> > > > Sent: Saturday, December 14, 2002 6:48 PM
> > > > Subject: Re: [mail] Re: [HACKERS] Big 7.4 items - Replication
> > > >
> > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >b) The Group Communication blob will consist of a number of
processes
> > > > which
> > > > > >need to talk to all of the others to interrogate them for changes
> > which
> > > > may
> > > > > >conflict with the current write that being handled and then issue
the
> > > > > >transaction response. This is basically the two phase commit
solution
> > > > with
> > > > > >phases moved into the group communication process.
> > > > > >
> > > > > >I can see the possibility of using solution b and having less
group
> > > > > >communication processes than databases as attempt to simplify
things,
> > but
> > > > > >this would mean the loss of a number of databases if the machine
> > running
> > > > the
> > > > > >group communication process for the set of databases is lost.
> > > > > >
> > > > > The group communication system doesn't just run on one system.
For
> > > > > postgres-r using spread
> > > > > there is actually a spread daemon that runs on each database
server.
> > It
> > > > > has nothing to do with
> > > > > detecting the conflicts.  Its job is to deliver messages in a
total
> > > > > order for writesets or simple order
> > > > > for commits, aborts, joins, etc.
> > > > >
> > > > > The detection of conflicts will be done at the database level, by
a
> > > > > backend processes.  The basic
> > > > > concept is "if all databases get the writesets (changes) in the
exact
> > > > > same order, apply them in a
> > > > > consistent order, avoid conflicts, then one copy serialization is
> > > > > achieved.  (one copy of the database
> > > > > replicated across all databases in the replica)
> > > > >
> > > > > I hope that explains the group communication system's
responsibility.
> > > > >
> > > > > Darren
> > > > >
> > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > ---------------------------(end of
> > broadcast)---------------------------
> > > > > TIP 5: Have you checked our extensive FAQ?
> > > > >
> > > > > http://www.postgresql.org/users-lounge/docs/faq.html
> > > >
> > > >
> > > >
> > > > ---------------------------(end of
broadcast)---------------------------
> > > > TIP 6: Have you searched our list archives?
> > > >
> > > > http://archives.postgresql.org
> > >
> > > --
> > > -------------------------------------------------------
> > > Jonathan R. Stanton         jonathan@cs.jhu.edu
> > > Dept. of Computer Science
> > > Johns Hopkins University
> > > -------------------------------------------------------
> > >
> >
> >
>
> --
> -------------------------------------------------------
> Jonathan R. Stanton         jonathan@cs.jhu.edu
> Dept. of Computer Science
> Johns Hopkins University
> -------------------------------------------------------
>




Re: [mail] Re: Big 7.4 items - Replication

From
Jan Wieck
Date:
Darren Johnson wrote:

> The group communication system doesn't just run on one system.  For
> postgres-r using spread

The reason why group communication software is used is simply because
this software is designed with two goals in mind:

1) optimize bandwidth usage

2) make many-to-many communication easy

Number one is done by utilizing things like multicasting where
available.

Number two is done by using global scoped queues.

I add this only to avoid reading that pushing some PITR log snippets via
FTP or worse over a network would do the same. It did not in the past,
it does not do right now and it will not do in the future.


Jan

-- 

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #


Re: Big 7.4 items

From
Greg Copeland
Date:
On Fri, 2002-12-13 at 04:53, Hannu Krosing wrote:
> On Fri, 2002-12-13 at 06:22, Bruce Momjian wrote:
> > I wanted to outline some of the big items we are looking at for 7.4:
> > Point-In-Time Recovery (PITR)
> > 
> >     J. R. Nield did a PITR patch late in 7.3 development, and Patrick
> >     MacDonald from Red Hat is working on merging it into CVS and
> >     adding any missing pieces.  Patrick, do you have an ETA on that?
> 
> How hard would it be to extend PITR for master-slave (hot backup)
> repliaction, which should then amount to continuously shipping logs to
> slave and doing nonstop PITR there :)
> 
> It will never be usable for multi-master replication, but somehow it
> feels that for master-slave replication simple log replay would be most
> simple and robust solution.

I'm curious, what would be the recovery strategy for PITR master-slave
replication should the master fail (assuming hot fail over/backup)?  A
simple dump/restore?  Are there/is there any facilities in PorstgreSQL
for PITR archival which prevents PITR logs from be recycled (or perhaps,
simply archived off)?  What about PITR streaming to networked and/or
removable media?

-- 
Greg Copeland <greg@copelandconsulting.net>
Copeland Computer Consulting



Re: Big 7.4 items

From
Shridhar Daithankar
Date:
On Monday 16 December 2002 07:26 pm, you wrote:
> I'm curious, what would be the recovery strategy for PITR master-slave
> replication should the master fail (assuming hot fail over/backup)?  A
> simple dump/restore?  Are there/is there any facilities in PorstgreSQL
> for PITR archival which prevents PITR logs from be recycled (or perhaps,
> simply archived off)?  What about PITR streaming to networked and/or
> removable media?

In asynchrounous replication, WAL log records are fed to anoter host, which 
replays those transactions to sync the data. This way it does not matter if 
WAL log is recycled as it is already replicated someplace else..

HTH
Shridhar


Re: Big 7.4 items

From
Greg Copeland
Date:
I must of miscommunicated here as you're describing PITR replication. 
I'm asking about a master failing and the slaving picking up.  Now, some
n-time later, how do you recover your master system to be back in sync
with the slave.  Obviously, I'm assuming some level of manual recovery. 
I'm wondering what the general approach would be.

Consider that on the slave which is now the active server (master dead),
it's possible that the slave's PITR's will be recycled before the master
can come back up.  As such, unless there is a, an archival process for
PITR or b, a method of streaming PITR's off for archival, the odds of
using PITR to recover the master (resync if you will) seem greatly
reduced as you will be unable to replay PITR on the master for
synchronization.

Greg



On Mon, 2002-12-16 at 08:02, Shridhar Daithankar wrote:
> On Monday 16 December 2002 07:26 pm, you wrote:
> > I'm curious, what would be the recovery strategy for PITR master-slave
> > replication should the master fail (assuming hot fail over/backup)?  A
> > simple dump/restore?  Are there/is there any facilities in PorstgreSQL
> > for PITR archival which prevents PITR logs from be recycled (or perhaps,
> > simply archived off)?  What about PITR streaming to networked and/or
> > removable media?
> 
> In asynchrounous replication, WAL log records are fed to anoter host, which 
> replays those transactions to sync the data. This way it does not matter if 
> WAL log is recycled as it is already replicated someplace else..
> 
> HTH
> 
>  Shridhar
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
-- 
Greg Copeland <greg@copelandconsulting.net>
Copeland Computer Consulting



Re: Big 7.4 items

From
Shridhar Daithankar
Date:
On Monday 16 December 2002 07:43 pm, you wrote:
> Consider that on the slave which is now the active server (master dead),
> it's possible that the slave's PITR's will be recycled before the master
> can come back up.  As such, unless there is a, an archival process for
> PITR or b, a method of streaming PITR's off for archival, the odds of
> using PITR to recover the master (resync if you will) seem greatly
> reduced as you will be unable to replay PITR on the master for
> synchronization.

I agree. Since we are talking about features in future release, I think it 
should be added to TODO if not already there.

I don't know about WAL numbering but AFAIU, it increments and old files are 
removed once there are enough WAL files as specified in posgresql.conf. IIRC 
there are some perl based replication project exist already which use this 
feature.
Shridhar



Re: Big 7.4 items

From
Greg Copeland
Date:
On Mon, 2002-12-16 at 08:20, Shridhar Daithankar wrote:
> On Monday 16 December 2002 07:43 pm, you wrote:
> > Consider that on the slave which is now the active server (master dead),
> > it's possible that the slave's PITR's will be recycled before the master
> > can come back up.  As such, unless there is a, an archival process for
> > PITR or b, a method of streaming PITR's off for archival, the odds of
> > using PITR to recover the master (resync if you will) seem greatly
> > reduced as you will be unable to replay PITR on the master for
> > synchronization.
> 
> I agree. Since we are talking about features in future release, I think it 
> should be added to TODO if not already there.
> 
> I don't know about WAL numbering but AFAIU, it increments and old files are 
> removed once there are enough WAL files as specified in posgresql.conf. IIRC 
> there are some perl based replication project exist already which use this 
> feature.
> 

The problem with this is that most people, AFAICT, are going to size WAL
based on their performance/sizing requirements and not based on
theoretical estimates which someone might make to allow for a window of
failure.  That is, I don't believe increasing the number of WAL's is
going to satisfactorily address the issue.


-- 
Greg Copeland <greg@copelandconsulting.net>
Copeland Computer Consulting



Re: Big 7.4 items

From
Shridhar Daithankar
Date:
On Monday 16 December 2002 08:07 pm, you wrote:
> On Mon, 2002-12-16 at 08:20, Shridhar Daithankar wrote:
> > I don't know about WAL numbering but AFAIU, it increments and old files
> > are removed once there are enough WAL files as specified in
> > posgresql.conf. IIRC there are some perl based replication project exist
> > already which use this feature.
>
> The problem with this is that most people, AFAICT, are going to size WAL
> based on their performance/sizing requirements and not based on
> theoretical estimates which someone might make to allow for a window of
> failure.  That is, I don't believe increasing the number of WAL's is
> going to satisfactorily address the issue.

Sorry for not being clear. When I said, WAL numbering, I meant WAL naming 
conventions where numbers are used to mark WAL files. 

It is not number of WAL files. It is entirely upto the installation and IIRC, 
even in replication project(Sorry I forgot the exact name), you can set 
number of WAL files that it can have.
Shridhar



Re: Big 7.4 items

From
Bruce Momjian
Date:
Shridhar Daithankar wrote:
> On Monday 16 December 2002 08:07 pm, you wrote:
> > On Mon, 2002-12-16 at 08:20, Shridhar Daithankar wrote:
> > > I don't know about WAL numbering but AFAIU, it increments and old files
> > > are removed once there are enough WAL files as specified in
> > > posgresql.conf. IIRC there are some perl based replication project exist
> > > already which use this feature.
> >
> > The problem with this is that most people, AFAICT, are going to size WAL
> > based on their performance/sizing requirements and not based on
> > theoretical estimates which someone might make to allow for a window of
> > failure.  That is, I don't believe increasing the number of WAL's is
> > going to satisfactorily address the issue.
> 
> Sorry for not being clear. When I said, WAL numbering, I meant WAL naming 
> conventions where numbers are used to mark WAL files. 
> 
> It is not number of WAL files. It is entirely upto the installation and IIRC, 
> even in replication project(Sorry I forgot the exact name), you can set 
> number of WAL files that it can have.

Basically, PITR is going to have a way to archive off a log of database
changes, either from WAL or from somewhere else.  At some point, there
is going to have to be administrative action which says, "I have a
master down for three days.  I am going to have to save my PITR logs for
that period."  So, PITR will probably be used for recovery of a failed
master, and such recover is going to have to require some administrative
action _if_ the automatic expiration of PITR logs is shorter than the
duration the master is down.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Big 7.4 items

From
Patrick Macdonald
Date:
Bruce Momjian wrote:
> 
> I wanted to outline some of the big items we are looking at for 7.4:
> 
> [snip]
>
> Point-In-Time Recovery (PITR)
> 
>         J. R. Nield did a PITR patch late in 7.3 development, and Patrick
>         MacDonald from Red Hat is working on merging it into CVS and
>         adding any missing pieces.  Patrick, do you have an ETA on that?

Neil Conway and I will be working on this starting the beginning
of January.  By the middle of January, we hope to have a handle on
an ETA.

Cheers,
Patrick
--
Patrick Macdonald
Red Hat Database Development


Re: Big 7.4 items

From
Bruce Momjian
Date:
Patrick Macdonald wrote:
> Bruce Momjian wrote:
> > 
> > I wanted to outline some of the big items we are looking at for 7.4:
> > 
> > [snip]
> >
> > Point-In-Time Recovery (PITR)
> > 
> >         J. R. Nield did a PITR patch late in 7.3 development, and Patrick
> >         MacDonald from Red Hat is working on merging it into CVS and
> >         adding any missing pieces.  Patrick, do you have an ETA on that?
> 
> Neil Conway and I will be working on this starting the beginning
> of January.  By the middle of January, we hope to have a handle on
> an ETA.

Ewe, that is later than I was hoping.  I have put J.R's PITR patch up
at:
ftp://candle.pha.pa.us/pub/postgresql/PITR_20020822_02.gz

(I have tried to contact J.R. several times over the past few months,
with no reply.)

J.R felt it was ready to go.  I would like to have an evaluation of the
patch to know what it does and what is missing.  I would like that
sooner rather than later because:
o  I want to avoid too much code drifto  I don't want to find there are major pieces missing and to    not have enough
timeto implement them in 7.4o  It is a big feature so I would like sufficient testing before beta
 

OK, I just talked to Patrick on the phone, and he says Neil Conway is
working on merging the code into 7.3, and adding missing pieces like
logging table creation.  So, it seems PITR is moving forward.  Neil, can
you comment on where you are with this, and what still needs to be done?
Do we need to start looking at log archiving options?  How is the PITR
log contents different from the WAL log contents, except of course no
pre-write page images?

If we need to discuss things, perhaps we can do it now and get folks
working on other pieces, or at least thinking about them.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Big 7.4 items

From
Neil Conway
Date:
On Mon, 2002-12-16 at 13:38, Bruce Momjian wrote:
> OK, I just talked to Patrick on the phone, and he says Neil Conway is
> working on merging the code into 7.3, and adding missing pieces like
> logging table creation.  So, it seems PITR is moving forward.  Neil, can
> you comment on where you are with this, and what still needs to be done?

Well, I should have a preliminary merge of the old PITR patch with CVS
HEAD done by Wednesday or Thursday. It took me a while to merge because
(a) I've got final exams at university at the moment (b) I had to merge
most of it by hand, as much of the diff is a single hunk (!), for some
reason.

As for the status of the code, I haven't really had a chance to evaluate
it; as Patrick noted, I think we should be able to give you an ETA by
the middle of January or so (I'll be offline starting Thursday until the
first week of January).

Cheers,

Neil
-- 
Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC





Re: Big 7.4 items

From
Patrick Macdonald
Date:
Bruce Momjian wrote:
> 
> Patrick Macdonald wrote:
> > Bruce Momjian wrote:
> > >
> > > I wanted to outline some of the big items we are looking at for 7.4:
> > >
> > > [snip]
> > >
> > > Point-In-Time Recovery (PITR)
> > >
> > >         J. R. Nield did a PITR patch late in 7.3 development, and Patrick
> > >         MacDonald from Red Hat is working on merging it into CVS and
> > >         adding any missing pieces.  Patrick, do you have an ETA on that?
> >
> > Neil Conway and I will be working on this starting the beginning
> > of January.  By the middle of January, we hope to have a handle on
> > an ETA.
> 
> Ewe, that is later than I was hoping.  I have put J.R's PITR patch up
> at:
> 
>         ftp://candle.pha.pa.us/pub/postgresql/PITR_20020822_02.gz
> 
> (I have tried to contact J.R. several times over the past few months,
> with no reply.)
> 
> J.R felt it was ready to go.  I would like to have an evaluation of the
> patch to know what it does and what is missing.  I would like that
> sooner rather than later because:
> 
>         o  I want to avoid too much code drift
>         o  I don't want to find there are major pieces missing and to
>            not have enough time to implement them in 7.4
>         o  It is a big feature so I would like sufficient testing before beta
> 
> OK, I just talked to Patrick on the phone, and he says Neil Conway is
> working on merging the code into 7.3, and adding missing pieces like
> logging table creation.  So, it seems PITR is moving forward.

Well, sort of.  I stated that Neil was already working on merging the
patch into the CVS tip.  I also mentioned that there are missing 
pieces but have no idea if Neil is currently working on them.

Cheers,
Patrick
--
Patrick Macdonald
Red Hat Database Development


Re: Big 7.4 items

From
Bruce Momjian
Date:
Patrick Macdonald wrote:
> > OK, I just talked to Patrick on the phone, and he says Neil Conway is
> > working on merging the code into 7.3, and adding missing pieces like
> > logging table creation.  So, it seems PITR is moving forward.
> 
> Well, sort of.  I stated that Neil was already working on merging the
> patch into the CVS tip.  I also mentioned that there are missing 
> pieces but have no idea if Neil is currently working on them.

Oh, OK.  What I would like to do is find out what actually needs to be
done so we can get folks started on it.  If we can get a 7.3 merge,
maybe we should get it into CVS and then list the items needing
attention and folks can submit patches to implement those.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Big 7.4 items

From
Janardhan
Date:
The  file  <a class="moz-txt-link-rfc2396E"
href="ftp://candle.pha.pa.us/pub/postgresql/PITR_20020822_02.gz">"ftp://candle.pha.pa.us/pub/postgresql/PITR_20020822_02.gz"</a>
donot<br /> have read permissions to copy. please provide the read permissions to copy.<br /><br /> Regards<br />
jana<br/><blockquote cite="mid200212161838.gBGIcb717436@candle.pha.pa.us" type="cite"><pre wrap="">Patrick Macdonald
wrote:</pre><blockquote type="cite"><pre wrap="">Bruce Momjian wrote:   </pre><blockquote type="cite"><pre wrap="">I
wantedto outline some of the big items we are looking at for 7.4:
 

[snip]

Point-In-Time Recovery (PITR)
       J. R. Nield did a PITR patch late in 7.3 development, and Patrick       MacDonald from Red Hat is working on
mergingit into CVS and       adding any missing pieces.  Patrick, do you have an ETA on that?
</pre></blockquote><prewrap="">Neil Conway and I will be working on this starting the beginning
 
of January.  By the middle of January, we hope to have a handle on
an ETA.   </pre></blockquote><pre wrap="">
Ewe, that is later than I was hoping.  I have put J.R's PITR patch up
at:
<a class="moz-txt-link-freetext"
href="ftp://candle.pha.pa.us/pub/postgresql/PITR_20020822_02.gz">ftp://candle.pha.pa.us/pub/postgresql/PITR_20020822_02.gz</a>

(I have tried to contact J.R. several times over the past few months,
with no reply.)

J.R felt it was ready to go.  I would like to have an evaluation of the
patch to know what it does and what is missing.  I would like that
sooner rather than later because:
o  I want to avoid too much code drifto  I don't want to find there are major pieces missing and to    not have enough
timeto implement them in 7.4o  It is a big feature so I would like sufficient testing before beta
 

OK, I just talked to Patrick on the phone, and he says Neil Conway is
working on merging the code into 7.3, and adding missing pieces like
logging table creation.  So, it seems PITR is moving forward.  Neil, can
you comment on where you are with this, and what still needs to be done?
Do we need to start looking at log archiving options?  How is the PITR
log contents different from the WAL log contents, except of course no
pre-write page images?

If we need to discuss things, perhaps we can do it now and get folks
working on other pieces, or at least thinking about them.
 </pre></blockquote><br />

Re: Big 7.4 items

From
Bruce Momjian
Date:
Oops, sorry.  Permissions fixed.

---------------------------------------------------------------------------

Janardhan wrote:
> The  file  "ftp://candle.pha.pa.us/pub/postgresql/PITR_20020822_02.gz" 
> do not
> have read permissions to copy. please provide the read permissions to copy.
> 
> Regards
> jana
> 
> >Patrick Macdonald wrote:
> >  
> >
> >>Bruce Momjian wrote:
> >>    
> >>
> >>>I wanted to outline some of the big items we are looking at for 7.4:
> >>>
> >>>[snip]
> >>>
> >>>Point-In-Time Recovery (PITR)
> >>>
> >>>        J. R. Nield did a PITR patch late in 7.3 development, and Patrick
> >>>        MacDonald from Red Hat is working on merging it into CVS and
> >>>        adding any missing pieces.  Patrick, do you have an ETA on that?
> >>>      
> >>>
> >>Neil Conway and I will be working on this starting the beginning
> >>of January.  By the middle of January, we hope to have a handle on
> >>an ETA.
> >>    
> >>
> >
> >Ewe, that is later than I was hoping.  I have put J.R's PITR patch up
> >at:
> >
> >    ftp://candle.pha.pa.us/pub/postgresql/PITR_20020822_02.gz
> >
> >(I have tried to contact J.R. several times over the past few months,
> >with no reply.)
> >
> >J.R felt it was ready to go.  I would like to have an evaluation of the
> >patch to know what it does and what is missing.  I would like that
> >sooner rather than later because:
> >
> >    o  I want to avoid too much code drift
> >    o  I don't want to find there are major pieces missing and to 
> >       not have enough time to implement them in 7.4
> >    o  It is a big feature so I would like sufficient testing before beta
> >
> >OK, I just talked to Patrick on the phone, and he says Neil Conway is
> >working on merging the code into 7.3, and adding missing pieces like
> >logging table creation.  So, it seems PITR is moving forward.  Neil, can
> >you comment on where you are with this, and what still needs to be done?
> >Do we need to start looking at log archiving options?  How is the PITR
> >log contents different from the WAL log contents, except of course no
> >pre-write page images?
> >
> >If we need to discuss things, perhaps we can do it now and get folks
> >working on other pieces, or at least thinking about them.
> >
> >  
> >
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Big 7.4 items

From
Thomas O'Connell
Date:
So if this gets added to the 7.3 branch, will there be documentation 
accompanying it?

-tfo

In article <200212161838.gBGIcb717436@candle.pha.pa.us>,pgman@candle.pha.pa.us (Bruce Momjian) wrote:

> OK, I just talked to Patrick on the phone, and he says Neil Conway is
> working on merging the code into 7.3, and adding missing pieces like
> logging table creation.  So, it seems PITR is moving forward.


Re: Big 7.4 items

From
Bruce Momjian
Date:
I meant he is merging it into HEAD, not the 7.3 CVS.  Sorry for the
confusion.

---------------------------------------------------------------------------

Thomas O'Connell wrote:
> So if this gets added to the 7.3 branch, will there be documentation 
> accompanying it?
> 
> -tfo
> 
> In article <200212161838.gBGIcb717436@candle.pha.pa.us>,
>  pgman@candle.pha.pa.us (Bruce Momjian) wrote:
> 
> > OK, I just talked to Patrick on the phone, and he says Neil Conway is
> > working on merging the code into 7.3, and adding missing pieces like
> > logging table creation.  So, it seems PITR is moving forward.
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073