Re: Feature Request for 7.5 - Mailing list pgsql-general

From Keith C. Perry
Subject Re: Feature Request for 7.5
Date
Msg-id 1070483244.3fce472cb2454@webmail.vcsn.com
Whole thread Raw
In response to Re: Feature Request for 7.5  (Jan Wieck <JanWieck@Yahoo.com>)
Responses Re: Feature Request for 7.5
List pgsql-general
Jan,

To continue the brain-dump.  I was curious how the GC protocol was going to be
implemented (if you had any ideas thus far).

Several years ago, I started working on a network security and intrusion
detection system for a client where the audit/logging system needed to be
redundant- they wanted 3 servers each on a different LANs in fact.

The work in that design was centered around making sure that the aggregated data
set was exactly the same on each of the 3 servers.  Not only were all the
event timestamps the same but the events were ordered the same way in the logs.

The solution I was working on was a multicast IPv4 (possibly IPv6) network where
the "packet" of information had an id of some sort and the event data inside the
datagram had a timestamp (of course).

The obviously problem is that multicasting is not reliable so in order sure all
event were on all servers, there would be a periodic polling that would give a
server with say 2 missing event the chance to "catch-up".  This catch-up"
function make sure all events were ordered an had the same last event.  This
would be much more of an issue with the server a couple of hops away than with a
server on the same LAN.  The client never went ahead with the system so I
apologize for not having some reference examples.

This is totally different from what true replication is amongst a group of
database servers but it seems to me that if the servers are in multicast group,
at least transactions would be theoretically sent to all servers at the same
time.  I would think that a homogenous system of servers is already ordering
events the same way so transactions would occur properly unless it was missed.
A "catch-up" function here was be difficult to implement because if the servers
are committing asyncronously then you can't catch-up and one of your datasets
has lost integrity.  Syncronously (meaning, "we'll all commit now because we all
 agree on the current list of transactions") seems a bit messy and not as scalable.

I didn't mean to get into all that but how the GC is going to work in this
project is something that I'm curious about.


--
Keith C. Perry, MS E.E.
Director of Networks & Applications
VCSN, Inc.
http://vcsn.com

Quoting Jan Wieck <JanWieck@Yahoo.com>:

> The following is more or less a brain-dump ... not finally thought out
> and not supposed to be considered a proposal at this time.
>
> The synchronous multi-master solution I have in mind needs a few
> currently non existent support features in the backend. One is
> non-blocking locks and another one is a callback mechanism just before
> marking a transaction in clog as committed.
>
> It will use reliable group communication (GC) that can guarantee total
> order. There is an AFTER trigger on all replicated tables. A daemon
> started for every database will create a number of threads/subprocesses.
> Each of these workers has his separate DB connection and is a member of
> a different group in the GC. The number of these groups determines the
> maximum number of concurrent UPDATE-transactions, the cluster can handle.
>
> At the first call of the trigger inside of a transaction (this is the
> first modifying statement), the trigger allocates one of the replication
> groups (possibly waiting for one to become free). It now communicates
> with one daemon thread on every database in the cluster. The triggers
> now send the replication data into this group. It is not necessary to
> wait for the other cluster members as long as the GC guarantees FIFO by
> sender.
>
> At the time the transaction commits, it sends a commit message into the
> group. This message has another service type level which is total order.
> It will wait now for all members in the replication group to reply with
> the same. When every member in the group replied, all agreed to commit
> and are just before stamping clog.
>
> Since the service type is total order, the GC guarantees that either all
> members get the messages in the same order, or if one cannot get a
> message a corresponding LEAVE message will be generated. Also, all the
> replication threads will use non-blocking locking. If any of them ever
> finds a locked row, it will send an ABORT message into the group,
> causing the whole group to roll back.
>
> This way, either all members of the group reach the "just before
> stamping clog" state together and know that everyone got there, or they
> will get an abort or leave message from any of their co-workers and roll
> back.
>
> There is a gap between reporting "ready" and really stamping clog in
> which a database might crash. This will cause all other cluster members
> to go ahead and commit while the crashed DB does not commit. But this is
> limited to crashes only and a restarting database must rejoin/resync
> with the cluster anyway and doubt its own data. So this is not really a
> problem.
>
>
> With this synchronous model, read only transactions can be handled on
> every node independently of replication at all - this is the scaling
> part. The total amount of UPDATE transactions is limited by the slowest
> cluster member and does not scale, but that is true for all synchronous
> solutions.
>
>
> Jan
>
>
> Chris Travers wrote:
>
> > Interesting feedback.
> >
> > It strikes me that, for many sorts of databases, multimaster synchronous
> > replication is not the best solution for the reasons that Scott, Jan, et.
> > al. have raised.  I am wondering how commercial RDBMS's get arround this
> > problem?  There are several possibilities that I can think of-- have a
> write
> > master, and many read-only slaves (available at the moment, iirc).
> > Replication could then occur at the tuple level using linked databases,
> > triggers, etc.  Rewrite rules could then allow one to use the slaves to
> > "funnel" the queries back up to the master.  It seems to me that latency
> > would be a killer on this sort of solution, though everything would
> > effectively occur on all databases in the same order, but recovering from
> a
> > crash of the master could be complicated and result in additional
> > downtime...
> >
> > The other solution (still not "guaranteed" to work in all cases) is that
> > every proxy could be hardwired to attempt to contact databases in a set
> > order.  This would also avoid deadlocks.  Note that if sufficient business
> > logic is built into the database, one would be guaranteed that a single
> > "consistent" view would be maintained at any given time (conflicts would
> > result in the minority of up to 50 percent of the servers needing to go
> > through the recovery process-- not killing uptime, but certainly killing
> > performance).
> >
> > However, it seems to me that the only solution for many of these databases
> > is to have a "cluster in a box" solution where you have a system comprised
> > entirely of redundent, hot-swapable hardware so that nearly anything can
> be
> > swapped out if it breaks.  In this case, we should be able to just run
> > PostgreSQL as is....
> >
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 6: Have you searched our list archives?
> >
> >                http://archives.postgresql.org
>
>
> --
> #======================================================================#
> # It's easier to get forgiveness for being wrong than for being right. #
> # Let's break this rule - forgive me.                                  #
> #================================================== JanWieck@Yahoo.com #
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
>




____________________________________
This email account is being host by:
VCSN, Inc : http://vcsn.com

pgsql-general by date:

Previous
From: Jan Wieck
Date:
Subject: Re: postgresql locks the whole table!
Next
From: Alvar Freude
Date:
Subject: bytea, index and like operator