Thread: Replication Ideas

Replication Ideas

From

Chris Travers

Date:

24 August 2003, 01:27:47

Hi--

I had been thinking of the issues of multimaster replication and how to
do highly available, loadballanced clustering with PostgreSQL.  Here is
my outline, and I am looking for comments on the limitations of how this
would work.

Several PostgreSQL servers would share a virtual IP address, and would
coordinate among themselves which will act as "Master" for the purposes
of a single transaction (but connection could be easier).  SELECT
statements are handled exclusively by the transaction master while
anything that writes to a database would be sent to all the the
"Masters."  At the end of each transaction the systems would poll
eachother regarding whether they were all successful:

1:  Any system which is successful in COMMITting the transaction must
ignore any system which fails the transaction untill a recovery can be made.

2:  Any system which fails in COMMITting the transaction must cease to
be a master, provided that it recieves a signat from any other member of
the cluster that indicates that that member succeeded in committing the
transaction.

3: If all nodes fail to commit, then they all remain masters.

Recovery would be done in several steps:

1:  The database would be copied to the failed system using pg_dump.
2:  A current recovery would be done from the transaction log.
3:  This would be repeated in order to ensure that the database is up to
date.
4:  When two successive restores have been achieved with no new
additions to the database, the "All Recovered" signal is sent to the
cluster and the node is ready to start processing again. (need a better
way of doing this).

Note:  Recovery is the problem, I know.  my model is only a starting
point for the purposes of discussion and trying to bring something to
the conversation.

Any thoughts or suggestions?

Best Wishes,
Chris Travers

Re: Replication Ideas

From

Ron Johnson

Date:

24 August 2003, 03:13:28

On Sat, 2003-08-23 at 23:27, Chris Travers wrote:
> Hi--
>
> I had been thinking of the issues of multimaster replication and how to
> do highly available, loadballanced clustering with PostgreSQL.  Here is
> my outline, and I am looking for comments on the limitations of how this
> would work.
>
> Several PostgreSQL servers would share a virtual IP address, and would
> coordinate among themselves which will act as "Master" for the purposes
> of a single transaction (but connection could be easier).  SELECT
> statements are handled exclusively by the transaction master while
> anything that writes to a database would be sent to all the the
> "Masters."  At the end of each transaction the systems would poll
> eachother regarding whether they were all successful:
>
> 1:  Any system which is successful in COMMITting the transaction must
> ignore any system which fails the transaction untill a recovery can be made.
>
> 2:  Any system which fails in COMMITting the transaction must cease to
> be a master, provided that it recieves a signat from any other member of
> the cluster that indicates that that member succeeded in committing the
> transaction.
>
> 3: If all nodes fail to commit, then they all remain masters.
>
> Recovery would be done in several steps:
>
> 1:  The database would be copied to the failed system using pg_dump.
> 2:  A current recovery would be done from the transaction log.
> 3:  This would be repeated in order to ensure that the database is up to
> date.
> 4:  When two successive restores have been achieved with no new
> additions to the database, the "All Recovered" signal is sent to the
> cluster and the node is ready to start processing again. (need a better
> way of doing this).
>
> Note:  Recovery is the problem, I know.  my model is only a starting
> point for the purposes of discussion and trying to bring something to
> the conversation.

This is vaguely similar to Two Phase Commit, which is a sine qua
non of distributed transactions, which is the s.q.n. of multi-master
replication.

--
-----------------------------------------------------------------
Ron Johnson, Jr. ron.l.johnson@cox.net
Jefferson, LA USA

"Eternal vigilance is the price of liberty: power is ever
stealing from the many to the few. The manna of popular liberty
must be gathered each day, or it is rotten... The hand entrusted
with power becomes, either from human depravity or esprit de
corps, the necessary enemy of the people. Only by continual
oversight can the democrat in office be prevented from hardening
into a despot: only by unintermitted agitation can a people be
kept sufficiently awake to principle not to let liberty be
smothered in material prosperity... Never look, for an age when
the people can be quiet and safe. At such times despotism, like
a shrouding mist, steals over the mirror of Freedom"
Wendell Phillips

Re: Replication Ideas

From

Chris Travers

Date:

25 August 2003, 14:06:41

Ron Johnson wrote:

>This is vaguely similar to Two Phase Commit, which is a sine qua
>non of distributed transactions, which is the s.q.n. of multi-master
>replication.
>
>
>

I may be wrong, but if I recall correctly, one of the problems with a
standard 2-phase commit is that if one server goes down, the other
masters cannot commit their transactions.  This would make a clustered
database server have a downtime equivalent to the total downtime of all
of its nodes.  This is a real problem.  Of course my understanding of
Two Phase Commit may be incorrect, in which case, I would appreciate it
if someone could point out where I am wrong.

It had occurred to me that the issue was one of failure handling more
than one of concept.  I.e. the problem is how one node's failure is
handled rather than the fundamental structure of Two Phase Commit.  If a
single node fails, we don't want that to take down the whole cluster,
and I have actually revised my logic a bit more (to make it even
safer).  In this I assume that:

1:  General failures on any one node are rare
2:  A failure is more likely to prevent a transaction from being
committed than allow one to be committed.

This hot-failover solution requires a transparency from a client
perspective-- i.e. the client should not have to choose a different
server should one go and should not need to know when a server comes
back up.  This also means that we need to assume that a load balancing
solution can be a part of the clustering solution.  I would assume that
this would require a shared IP address for the public interface of the
server and a private communicatiions channel where each node has a
separate IP address (similar to Microsoft's implimentation of Network
Load Balancing).  Also, different transactions within a single
connection should be able to be handled by different nodes, so if one
node goes down, users don't have to reconnect.

So here is my suggested logic for high availablility/load balanced
clustering:

1:  All nodes recognize each user connection and delegage transactions
rather than connections.

2:  At the beginning of a transaction, nodes decide who will take it.
Any operation which does not change the information or schema of the
database is handled exclusively on that node.  Other operations are
distributed across nodes.

3:  When the transaction is committed, the nodes "vote" on whether the
commitment of the transaction is valid. Majority rules, and the minority
must remove themselves from the cluster until they can synchronize their
databases with the existing masters.  If the vote is split 50/50 (i.e.
one node fails in a 2 node cluster), success is considered more likely
to be valid than failure, and the node(s) which failed to commit the
transaction must remove themselves from the cluster until they can recover.

Best Wishes,
Chris Travers

Re: Replication Ideas

From

Ron Johnson

Date:

25 August 2003, 14:38:49

On Mon, 2003-08-25 at 12:06, Chris Travers wrote:
> Ron Johnson wrote:
>
> >This is vaguely similar to Two Phase Commit, which is a sine qua
> >non of distributed transactions, which is the s.q.n. of multi-master
> >replication.
> >
> >
> >
>
> I may be wrong, but if I recall correctly, one of the problems with a
> standard 2-phase commit is that if one server goes down, the other
> masters cannot commit their transactions.  This would make a clustered
> database server have a downtime equivalent to the total downtime of all
> of its nodes.  This is a real problem.  Of course my understanding of
> Two Phase Commit may be incorrect, in which case, I would appreciate it
> if someone could point out where I am wrong.

Note that I didn't mean to imply that 2PC is sufficient to implement
M-M.  The DBMS designer(s) must decide what to do (like queue up
changes) if 2PC fails.

--
-----------------------------------------------------------------
Ron Johnson, Jr. ron.l.johnson@cox.net
Jefferson, LA USA

"Our computers and their computers are the same color. The
conversion should be no problem!"
Unknown

Re: Replication Ideas

From

Alvaro Herrera

Date:

25 August 2003, 15:25:25

On Mon, Aug 25, 2003 at 10:06:22AM -0700, Chris Travers wrote:
> Ron Johnson wrote:
>
> >This is vaguely similar to Two Phase Commit, which is a sine qua
> >non of distributed transactions, which is the s.q.n. of multi-master
> >replication.
>
> I may be wrong, but if I recall correctly, one of the problems with a
> standard 2-phase commit is that if one server goes down, the other
> masters cannot commit their transactions.

Before the discussion goes any further, have you read the work related
to Postgres-r?  It's a substantially different animal from 2PC AFAIK.

--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"Right now the sectors on the hard disk run clockwise, but I heard a rumor that
you can squeeze 0.2% more throughput by running them counterclockwise.
It's worth the effort. Recommended."  (Gerry Pourwelle)

Re: Replication Ideas

From

Chris Travers

Date:

25 August 2003, 15:38:15

Alvaro Herrera wrote:

>Before the discussion goes any further, have you read the work related
>to Postgres-r?  It's a substantially different animal from 2PC AFAIK.
>
>
>
Yes I have. Postgres-r is not a high-availability solution which is
capable of transparent failover, although it is a very useful project on
its own.

Best Wishes,
Chris Travers.

Re: Replication Ideas

From

Chris Travers

Date:

25 August 2003, 19:16:45

Tom Lane wrote:

>Chris Travers <chris@travelamericas.com> writes:
>
>
>>Yes I have. Postgres-r is not a high-availability solution which is
>>capable of transparent failover,
>>
>>
>
>What makes you say that?  My understanding is it's supposed to survive
>loss of individual servers.
>
>            regards, tom lane
>
>
>
>
My mistake.  I must have gotten them confused with another
(asynchronous) replication project.

Best Wishes,
Chris Travers

Re: Replication Ideas

From

Tom Lane

Date:

26 August 2003, 02:54:34

Chris Travers <chris@travelamericas.com> writes:
> Yes I have. Postgres-r is not a high-availability solution which is
> capable of transparent failover,

What makes you say that?  My understanding is it's supposed to survive
loss of individual servers.

            regards, tom lane

Re: Replication Ideas

From

Jan Wieck

Date:

26 August 2003, 22:48:58

WARNING: This is getting long ...

Postgres-R is a very interesting and inspiring idea. And I've been
kicking that concept around for a while now. What I don't like about it
is that it requires fundamental changes in the lock mechanism and that
it is based on the assumption of very low lock conflict.

<explain-PG-R>
In Postgres-R a committing transaction sends it's workset (WS - a list
of all updates done in this transaction) to the group communication
system (GC). The GC guarantees total order, meaning that all nodes will
receive all WSs in the same order, no matter how they have been sent.

If a node receives back it's own WS before any error occured, it goes
ahead and finalizes the commit. If it receives a foreign WS, it has to
apply the whole WS and commit it before it can process anything else. If
now a local transaction, in progress or while waiting for it's WS to
come back, holds a lock that is required to process such remote WS, the
local transaction needs to be aborted to unlock it's resources ... it
lost the total order race.
</explain-PG-R>

Postgres-R requires that all remote WSs are applied and committed before
a local transaction can commit. Otherwise it couldn't correctly detect a
lock conflict. So there will not be any read ahead. And since the total
order really counts here, it cannot apply any two remote WSs in
parallel, a race condition could possibly exist and a later WS in the
total order runs faster and locks up a previous one, so we have to
squeeze all remote WSs through one single replication work process. And
all the locally parallel executed transactions that wait for their WSs
to come back have to wait until that poor little worker is done with the
whole pile. Bye bye concurrency. And I don't know how the GC will deal
with the backlog either. Could well choke on it.

I do not see how this will scale well in a multi-SMP-system cluster. At
least the serialization of WSs will become a horror if there is
significant lock contention like in a standard TPC-C on the district row
containing the order number counter. I don't know for sure, but I
suspect that with this kind of bottleneck, Postgres-R will have to
rollback more than 50% of it's transactions when there are more than 4
nodes under heavy load (like in a benchmark run). That will suck ...

But ... initially I said that it is an inspiring concept ... soooo ...

I am currently hacking around with some C+PL/TclU+Spread constructs that
might form a rude kind of prototype creature.

My changes to the Postgres-R concept are that there will be as many
replicating slave processes as there are in summary masters out in the
cluster ... yes, it will try to utilize all the CPU's in the cluster!
For failover reliability, A committing transaction will hold before
finalizing the commit and send it's "I'm ready" to the GC. Every
replicator that reaches the same state send's "I'm ready" too. Spread
guarantees in SAFE_MESS mode that messages are delivered to all nodes in
a group or that at least LEAVE/DISCONNECT messages are deliverd before.
So if a node receives more than 50% of "I'm ready", there would be a
very small gap where multiple nodes have to fail in the same split
second so that the majority of nodes does NOT commit. A node that
reported "I'm ready" but lost more than 50% of the cluster before
committing has to rollback and rejoin or wait for operator intervention.

Now the idea is to split up the communication into GC distribution
groups per transaction. So working master backends and associated
replication backends will join/leave a unique group for every
transaction in the cluster. This way, the per process communication is
reduced to the required minimum.

As said, I am hacking on some code ...

Jan

Chris Travers wrote:
> Tom Lane wrote:
>
>>Chris Travers <chris@travelamericas.com> writes:
>>
>>
>>>Yes I have. Postgres-r is not a high-availability solution which is
>>>capable of transparent failover,
>>>
>>>
>>
>>What makes you say that?  My understanding is it's supposed to survive
>>loss of individual servers.
>>
>>            regards, tom lane
>>
>>
>>
>>
> My mistake.  I must have gotten them confused with another
> (asynchronous) replication project.
>
> Best Wishes,
> Chris Travers
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: the planner will ignore your desire to choose an index scan if your
>       joining column's datatypes do not match

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #

Re: Replication Ideas

From

Dennis Gearon

Date:

27 August 2003, 02:57:15

Jan Wieck wrote:

> WARNING: This is getting long ...
>
> Postgres-R is a very interesting and inspiring idea. And I've been
> kicking that concept around for a while now. What I don't like about
> it is that it requires fundamental changes in the lock mechanism and
> that it is based on the assumption of very low lock conflict.
>
> <explain-PG-R>
> In Postgres-R a committing transaction sends it's workset (WS - a list
> of all updates done in this transaction) to the group communication
> system (GC). The GC guarantees total order, meaning that all nodes
> will receive all WSs in the same order, no matter how they have been
> sent.
>
> If a node receives back it's own WS before any error occured, it goes
> ahead and finalizes the commit. If it receives a foreign WS, it has to
> apply the whole WS and commit it before it can process anything else.
> If now a local transaction, in progress or while waiting for it's WS
> to come back, holds a lock that is required to process such remote WS,
> the local transaction needs to be aborted to unlock it's resources ...
> it lost the total order race.
> </explain-PG-R>
>
> Postgres-R requires that all remote WSs are applied and committed
> before a local transaction can commit. Otherwise it couldn't correctly
> detect a lock conflict. So there will not be any read ahead. And since
> the total order really counts here, it cannot apply any two remote WSs
> in parallel, a race condition could possibly exist and a later WS in
> the total order runs faster and locks up a previous one, so we have to
> squeeze all remote WSs through one single replication work process.
> And all the locally parallel executed transactions that wait for their
> WSs to come back have to wait until that poor little worker is done
> with the whole pile. Bye bye concurrency. And I don't know how the GC
> will deal with the backlog either. Could well choke on it.
>
> I do not see how this will scale well in a multi-SMP-system cluster.
> At least the serialization of WSs will become a horror if there is
> significant lock contention like in a standard TPC-C on the district
> row containing the order number counter. I don't know for sure, but I
> suspect that with this kind of bottleneck, Postgres-R will have to
> rollback more than 50% of it's transactions when there are more than 4
> nodes under heavy load (like in a benchmark run). That will suck ...
>
>
> But ... initially I said that it is an inspiring concept ... soooo ...
>
> I am currently hacking around with some C+PL/TclU+Spread constructs
> that might form a rude kind of prototype creature.
>
> My changes to the Postgres-R concept are that there will be as many
> replicating slave processes as there are in summary masters out in the
> cluster ... yes, it will try to utilize all the CPU's in the cluster!
> For failover reliability, A committing transaction will hold before
> finalizing the commit and send it's "I'm ready" to the GC. Every
> replicator that reaches the same state send's "I'm ready" too. Spread
> guarantees in SAFE_MESS mode that messages are delivered to all nodes
> in a group or that at least LEAVE/DISCONNECT messages are deliverd
> before. So if a node receives more than 50% of "I'm ready", there
> would be a very small gap where multiple nodes have to fail in the
> same split second so that the majority of nodes does NOT commit. A
> node that reported "I'm ready" but lost more than 50% of the cluster
> before committing has to rollback and rejoin or wait for operator
> intervention.
>
> Now the idea is to split up the communication into GC distribution
> groups per transaction. So working master backends and associated
> replication backends will join/leave a unique group for every
> transaction in the cluster. This way, the per process communication is
> reduced to the required minimum.
>
>
> As said, I am hacking on some code ...
>
>
> Jan
>
> Chris Travers wrote:
>
>> Tom Lane wrote:
>>
>>> Chris Travers <chris@travelamericas.com> writes:
>>>
>>>
>>>> Yes I have. Postgres-r is not a high-availability solution which is
>>>> capable of transparent failover,
>>>>
>>>
>>>
>>> What makes you say that?  My understanding is it's supposed to survive
>>> loss of individual servers.
>>>
>>>             regards, tom lane
>>>
>>>
>>>
>>>
>> My mistake.  I must have gotten them confused with another
>> (asynchronous) replication project.
>>
>> Best Wishes,
>> Chris Travers
>>
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 9: the planner will ignore your desire to choose an index scan if
>> your
>>       joining column's datatypes do not match
>
>
>
As my british friends would say, "Bully for you",and I applaud you
playing, struggling, learning from this for our sakes. Jeez, all I think
about is me,huh?

Re: Replication Ideas

From

"Marc G. Fournier"

Date:

27 August 2003, 15:49:50

On Mon, 25 Aug 2003, Tom Lane wrote:

> Chris Travers <chris@travelamericas.com> writes:
> > Yes I have. Postgres-r is not a high-availability solution which is
> > capable of transparent failover,
>
> What makes you say that?  My understanding is it's supposed to survive
> loss of individual servers.

How does it play 'catch up' went a server comes back online?

note that I did go through the 'docs' on how it works, and am/was quite
impressed at what they were doing ... but, if I have a large network, say,
and one group is connecting to ServerA, and another group with ServerB,
what happens when ServerA and ServerB loose network connectivity for any
period of time?  How do they re-sync when the network comes back up again?

Re: Replication Ideas

From

Tom Lane

Date:

27 August 2003, 21:21:55

"Marc G. Fournier" <scrappy@hub.org> writes:
> On Mon, 25 Aug 2003, Tom Lane wrote:
>> What makes you say that?  My understanding is it's supposed to survive
>> loss of individual servers.

> How does it play 'catch up' went a server comes back online?

The recovered server has to run through the part of the GCS data stream
that it missed the first time.  This is not conceptually different from
recovering using archived WAL logs (or archived trigger-driven
replication data streams).  As with using WAL for recovery, you have to
be able to archive the message stream until you don't need it any more.

            regards, tom lane

Re: Replication Ideas

From

Ron Johnson

Date:

27 August 2003, 23:48:41

On Tue, 2003-08-26 at 22:37, Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > If you can detect if outside transactions conflict with your
> > transaction, you should be able to determine if the outside transactions
> > conflict with each other.
>
> Uh ... not necessarily.  That amounts to assuming that every xact has
> complete knowledge of the actions of every other, which is an assumption
> I'd rather not make.  Detecting that what you've done conflicts with
> someone else is one thing, detecting that party B has conflicted with
> party C is another league entirely.

Maybe some sort of Lock Manager?  A process running on each node
keeps a tree structure of all locks, requested locks, what is
(requested to be) locked, and the type of lock.  If you are running
multi-master replication, each LM keeps in sync with each other,
thus creating a Distributed Lock Manager.  (This would also be the
key to implementing database clusters.  Of course, the interface
to the DLM would have to be pretty deep within Postgres itself...)

Using a DLM, the postmaster on node_a would know that the postmaster
on node_b has just locked a certain set of tuples and index keys,
and
(1) will queue up it's request to lock that data into that node's
    LM,
(2) which will propagate it to the other nodes,
(3) then when the node_a postmaster executes the COMMIT WORK, the
    node_b postmaster can obtain it's desired locks.
(4) If the postmaster on node_[ac-z] needs to lock the that same
    data, it will then similarly queue up to wait until the node_b
    postmaster executes it's COMMIT WORK.

Notes:
a) this is, of course, not *sufficient* for multi-master
b) yes, you need a fast, low latency network for the DLM chatter.

This is a tried and true method of synchronization.  DEC Rdb/VMS
has been using it for 19 years as the underpinnings of it's cluster
technology, and Oracle licensed it from them (well, really Compaq)
for it's 9i RAC.

--
-----------------------------------------------------------------
Ron Johnson, Jr. ron.l.johnson@cox.net
Jefferson, LA USA

"The UN couldn't break up a cookie fight in a Brownie meeting."
Larry Miller

Re: Replication Ideas

From

"Shridhar Daithankar"

Date:

28 August 2003, 04:06:35

On 26 Aug 2003 at 3:01, Marc G. Fournier wrote:

>
>
> On Mon, 25 Aug 2003, Tom Lane wrote:
>
> > Chris Travers <chris@travelamericas.com> writes:
> > > Yes I have. Postgres-r is not a high-availability solution which is
> > > capable of transparent failover,
> >
> > What makes you say that?  My understanding is it's supposed to survive
> > loss of individual servers.
>
> How does it play 'catch up' went a server comes back online?

<dumb idea>
PITR + archive logs daemon? Chances of a node and an archive log daemon going
down simalrenously are pretty low. If archive log daemon works on another
machin, the MTBF should be pretty acceptable..
</dumb idea>


Bye
 Shridhar

--
The Briggs-Chase Law of Program Development:    To determine how long it will take
to write and debug a    program, take your best estimate, multiply that by two,
add    one, and convert to the next higher units.

Re: Replication Ideas

From

Jan Wieck

Date:

28 August 2003, 18:15:08

Ron Johnson wrote:

> Notes:
> a) this is, of course, not *sufficient* for multi-master
> b) yes, you need a fast, low latency network for the DLM chatter.

"Fast" is an understatement. The DLM you're talking about would (in our
case) need to use Spread's AGREED_MESS or SAFE_MESS service type,
meaning guarantee of total order. A transaction that needs any type of
lock sends that request into the DLM group and then waits. The incoming
stream of lock messages determines success or failure. With the overhead
of these service types I don't think one single communication group for
all database backends in the whole cluster guaranteeing total order will
be that efficient.

>
> This is a tried and true method of synchronization.  DEC Rdb/VMS
> has been using it for 19 years as the underpinnings of it's cluster
> technology, and Oracle licensed it from them (well, really Compaq)
> for it's 9i RAC.

Are you sure they're using it that way?

Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #

Re: Replication Ideas

From

Ron Johnson

Date:

28 August 2003, 19:19:24

On Thu, 2003-08-28 at 16:00, Jan Wieck wrote:
> Ron Johnson wrote:
>
> > Notes:
> > a) this is, of course, not *sufficient* for multi-master
> > b) yes, you need a fast, low latency network for the DLM chatter.
>
> "Fast" is an understatement. The DLM you're talking about would (in our
> case) need to use Spread's AGREED_MESS or SAFE_MESS service type,
> meaning guarantee of total order. A transaction that needs any type of
> lock sends that request into the DLM group and then waits. The incoming
> stream of lock messages determines success or failure. With the overhead
> of these service types I don't think one single communication group for
> all database backends in the whole cluster guaranteeing total order will
> be that efficient.

I guess it's the differing protocols involved.  DEC made clustering
(including Rdb/VMS) work over an 80Mbps protocol, back in The Day,
and HPaq says that it works fine now over fast ethernet.

> > This is a tried and true method of synchronization.  DEC Rdb/VMS
> > has been using it for 19 years as the underpinnings of it's cluster
> > technology, and Oracle licensed it from them (well, really Compaq)
> > for it's 9i RAC.
>
> Are you sure they're using it that way?

Not as sure as I am that the sun will rise in the east tomorrow,
but, yes, I am highly confident that O modified DLM for use in
9i RAC.  Note that O purchased Rdb/VMS from DEC back in 1994, along
with the Engineers, so they have long knowledge of how it works
in VMS.  One of the reasons they bought Rdb was to merge the tech-
nology into RDBMS.

--
-----------------------------------------------------------------
Ron Johnson, Jr. ron.l.johnson@cox.net
Jefferson, LA USA

"they love our milk and honey, but preach about another way of living"
Merle Haggard, "The Fighting Side Of Me"

Re: Replication Ideas

From

Dennis Gearon

Date:

28 August 2003, 19:47:49

Are these clusters physically together using dedicate LAN lines .... or
are they synchronizing over the Interwait?

Ron Johnson wrote:

>On Thu, 2003-08-28 at 16:00, Jan Wieck wrote:
>
>
>>Ron Johnson wrote:
>>
>>
>>
>>>Notes:
>>>a) this is, of course, not *sufficient* for multi-master
>>>b) yes, you need a fast, low latency network for the DLM chatter.
>>>
>>>
>>"Fast" is an understatement. The DLM you're talking about would (in our
>>case) need to use Spread's AGREED_MESS or SAFE_MESS service type,
>>meaning guarantee of total order. A transaction that needs any type of
>>lock sends that request into the DLM group and then waits. The incoming
>>stream of lock messages determines success or failure. With the overhead
>>of these service types I don't think one single communication group for
>>all database backends in the whole cluster guaranteeing total order will
>>be that efficient.
>>
>>
>
>I guess it's the differing protocols involved.  DEC made clustering
>(including Rdb/VMS) work over an 80Mbps protocol, back in The Day,
>and HPaq says that it works fine now over fast ethernet.
>
>
>
>>>This is a tried and true method of synchronization.  DEC Rdb/VMS
>>>has been using it for 19 years as the underpinnings of it's cluster
>>>technology, and Oracle licensed it from them (well, really Compaq)
>>>for it's 9i RAC.
>>>
>>>
>>Are you sure they're using it that way?
>>
>>
>
>Not as sure as I am that the sun will rise in the east tomorrow,
>but, yes, I am highly confident that O modified DLM for use in
>9i RAC.  Note that O purchased Rdb/VMS from DEC back in 1994, along
>with the Engineers, so they have long knowledge of how it works
>in VMS.  One of the reasons they bought Rdb was to merge the tech-
>nology into RDBMS.
>
>
>

Re: Replication Ideas

From

Ron Johnson

Date:

29 August 2003, 00:20:55

On Thu, 2003-08-28 at 17:52, Dennis Gearon wrote:
> Are these clusters physically together using dedicate LAN lines .... or
> are they synchronizing over the Interwait?

There have been multiple methods over the years.  In order:

1. Cluster Interconnect (CI) : There's a big box, called the CI,
   that in the early days was really a stripped PDP-11 running
   an RTOS.  Each VAX (and, later, Alpha) is connected to the CI
   via a special adapters and cables.  Disks are connected to an
   "HSC" Storage Controllers which also plug into the CI.  Basic-
   ally, it's a big, intelligent switch.  Disk sectors pass
   along the wires from VAX and Alpha to disks and back.  DLM
   messages pass along the wires from node to node.  With mul-
   tiple CI adapters, and HSCs (they were dual-ported) you could
   set up otal dual-redundancy.  Up to 96 nodes can be cluster-
   ed.  It still works, but Memory Channel is preferred now.

2. LAVC - Local Area VAX Cluster : In this scheme, disks were
   directly attached to nodes, and data (disk and DLM) is trans-
   ferred back and forth across the 10Mbps Ethernet.  It could
   travel over TCP/IP or DECnet.  For obvious reasons, LAVC was
   a lot cheaper and slower than CI.

3. SCSI clusters : SCSI disks are wired to a dual-ported "HSZ"
   Storage Controller.  Then, SCSI cards on each of 2 nodes
   could be wired into a port.  The SCSI disks could also be
   wired to a 2nd HSZ, and a 2nd SCSI card in each node plugged
   into that HSZ, dual-redundancy is achieved.  With modern
   versions of VMS, the SCSI drivers can choose which SCSI
   card it wanted to send data through, to increase performance.
   DLM messages are passed via TCP/IP.  Only 2 nodes can be
   clustered.  A related method uses fiber channel disks on
   "HSG" Storage Controllers.

4. Memory Channel : A higher speed interconnect.  Don't know
   much about it.  128 nodes can be clustered.

Note that since DLM awareness is built deep into VMS and all the
RTLs, every program is cluster-aware, no matter what type of
cluster method is used.

> Ron Johnson wrote:
>
> >On Thu, 2003-08-28 at 16:00, Jan Wieck wrote:
> >
> >
> >>Ron Johnson wrote:
> >>
> >>
> >>
> >>>Notes:
> >>>a) this is, of course, not *sufficient* for multi-master
> >>>b) yes, you need a fast, low latency network for the DLM chatter.
> >>>
> >>>
> >>"Fast" is an understatement. The DLM you're talking about would (in our
> >>case) need to use Spread's AGREED_MESS or SAFE_MESS service type,
> >>meaning guarantee of total order. A transaction that needs any type of
> >>lock sends that request into the DLM group and then waits. The incoming
> >>stream of lock messages determines success or failure. With the overhead
> >>of these service types I don't think one single communication group for
> >>all database backends in the whole cluster guaranteeing total order will
> >>be that efficient.
> >>
> >>
> >
> >I guess it's the differing protocols involved.  DEC made clustering
> >(including Rdb/VMS) work over an 80Mbps protocol, back in The Day,
> >and HPaq says that it works fine now over fast ethernet.
> >
> >
> >
> >>>This is a tried and true method of synchronization.  DEC Rdb/VMS
> >>>has been using it for 19 years as the underpinnings of it's cluster
> >>>technology, and Oracle licensed it from them (well, really Compaq)
> >>>for it's 9i RAC.
> >>>
> >>>
> >>Are you sure they're using it that way?
> >>
> >>
> >
> >Not as sure as I am that the sun will rise in the east tomorrow,
> >but, yes, I am highly confident that O modified DLM for use in
> >9i RAC.  Note that O purchased Rdb/VMS from DEC back in 1994, along
> >with the Engineers, so they have long knowledge of how it works
> >in VMS.  One of the reasons they bought Rdb was to merge the tech-
> >nology into RDBMS.
> >
> >
> >

--
-----------------------------------------------------------------
Ron Johnson, Jr. ron.l.johnson@cox.net
Jefferson, LA USA

"Oh, great altar of passive entertainment, bestow upon me thy
discordant images at such speed as to render linear thought impossible"
Calvin, regarding TV