Re: replication choices - Mailing list pgsql-general

From Ben
Subject Re: replication choices
Date
Msg-id Pine.LNX.4.64.0701311511210.8626@localhost.localdomain
Whole thread Raw
In response to Re: replication choices  ("Lenorovitz, Joel" <Joel.Lenorovitz@usap.gov>)
Responses Re: replication choices
List pgsql-general
Well, it sounds like your situation is more difficult than mine, in that
it looks like you might have to deal with the possibility of conflicting
changes. Fortuantely, I don't have that issue, because even though I have
data flowing both ways between my "master" and my sites, it only flows one
way for any given table.

But unlike you (?) I have the additional problem that some data changes at
the central site need to be replicated to multiple sites.

After looking into it more, it appears that Slony log shipping just may
work for me.... if only it weren't for the problem that to ship logs from
the remote sites back to the central site, each remote site needs to have
a normal slony node first, which I don't have the hardware for.

Short of conflict resolution, I don't see any big problems with your
scheme.

On Wed, 31 Jan 2007, Lenorovitz, Joel wrote:

> I believe I have a similar situation involving multiple database
> instances acting quasi-independently on a common (at least conceptually)
> set of data.  Since each instance can effectively operate independently,
> I am uncertain if the term replication is accurate, but here is my
> strategy to keep the data properly synchronized.  It is still unproven
> so any advice or scrutiny that can be given is welcome.
>
> Situation:
> There are multiple sites at which the same database/front-end
> application is running.  None of the sites are directly connected to one
> another over a network and the only communication between sites is
> effectively unidirectional to a central location (i.e., information can
> independently go both ways during a communications link, but it's not
> real-time duplex).  Each of the sites allows authorized users to perform
> any type of change to the data.
>
> Solution:
> Each site has 3 different versions of what I call the base schema:
> Confirmed, Pending, and Update.  Each of these versions has some special
> columns associated with it to capture other information about changes
> that are made to it (e.g. timestamp, action(insert,update,delete), and
> status).  The central site (which I'm loathe to call 'master') has these
> same schemas, plus it has an additional Update schema for each other
> site in the system.
>
> During normal use at each non-central site, the Pending schema is the
> active schema  from which data is queried and also added, modified, and
> deleted.  Each time a record is changed in the Pending schema it's
> status is flagged as 'pending' and the new data is copied to the Update
> schema.  Also copied to the Update schema is the old data from the
> record that was changed.  This effectively makes the Update schema a log
> of what each record in the database was changed to, what it was changed
> from, and when that happened (in UTC).  The data from the update schema
> is then dumped regularly to a flat file.
>
> When any remote site establishes a communications link with the central
> site, the flat files of the Update schema from each site are exchanged
> and the official synchronization time is taken to be that of the flat
> file that was updated least recently (i.e., the older file).  Then, at
> each site the data from the flat file is uploaded to the local Updates
> schema.  All of the records in the now more populous Update schema are
> then processed sequentially by timestamp and applied to the Confirmed
> schema so, in theory, the same changes should be simultaneously getting
> applied to the Confirmed schemas at both locations in question.
> Finally, each record in the Pending schema is set to the value contained
> in the Confirmed schema and the flag set back to 'confirmed', the two
> sites are considered synchronized, and then the whole process starts
> anew.
>
> There are some details that have been glossed over here to eschew
> obfuscation, and the actual situation at the central site is more
> complex than this in practice, but that is the gist of the approach.  I
> do not know of any product, Slony included, that has built in support
> for a situation such as this, so I suspect all of the details will have
> to be handled in a custom fashion.
>
> Anyhow, Ben, this is my working solution and, from the sounds of it,
> yours is the only case I have heard of that has the same set of
> challenges.  I am interested in hearing if these ideas will work for you
> and/or if anyone knows of any flaws in this methodology or a
> better/easier/more reliable means of accomplishing this task.  I should
> point out that, in our environment of understandably limited
> connectivity, we are definitely more tolerant of the delayed performance
> this synchronization strategy will cause than most users/companies would
> be.  The important thing for us is that the data integrity is maintained
> and that everyone at each site can access and change the data.
>
> Regards,
> Joel
>
> -----Original Message-----
> From: Ben [mailto:bench@silentmedia.com]
> Sent: Thursday, January 25, 2007 1:18 PM
> To: pgsql-general@postgresql.org
> Subject: replication choices
>
> Hi guys. I've inherited a system that I'm looking to add replication to.
>
> It already has some custom replication code, but it's being nice to say
> that code less than good. I'm hoping there's an existing project out
> there that will work much better. Unfortunately, I'm not seeing anything
> that obviously fits my needs, so maybe somebody here can suggest
> something.
>
> I've got a single cluster in the datacenter and dozens of remote sites.
> Many of these sites are on unreliable connections to the internet, and
> while they're online more often then not, when their network will go
> down isn't known, and even when it's up, the network isn't that fast.
>
> A vast majority of activity occurs at these remote sites, with very
> little at the datacenter cluster. That said, the datacenter cluster
> needs to keep pretty good copies of most (but not all) of the data at
> each site.
> Obviously the network unrealiability puts a limit on how up to date the
> datacenter can be, but loosing data is considered Bad. So, for instance,
> restoring the daily backup of each site at the datacenter is too
> infrequent.
>
> Each site will replicate to its own schema in the datacenter cluster, so
> I don't *think* I need a multi-master solution.... but at the same time,
> because data will be coming from multiple sites, simply replaying WAL
> files at the datacenter won't work.
>
> In addition, there will be some data changes made at the datacenter that
> will need to replicate to all of the remote sites as soon as they're
> online. It's ok if data being replicated from the datacenter ends up in
> a different schema at the remote sites than the schema which holds the
> data that will be replicated back to the datacenter.
>
> My current best guess of what to do is create a global schema at every
> database, a local schema at each site, and a schema for each site at the
> datacenter. Then I can use Slony to replicate the global schema from the
> datacenter to each site, and again use Slony to replicate the local
> schema from each site to that site's schema in the datacenter. But I'm
> not too familiar with Slony, and from what I understand, using Slony
> with bad networks leads to bad problems. I'm also not sure that Slony
> supports replicating from multiple sources to the same postgres install,
> even if each replication process is writing to a different schema.
>
> Are there any better options? Or is my Slony idea not so bad?
>

pgsql-general by date:

Previous
From: Guido Neitzer
Date:
Subject: Re: replication choices
Next
From: "Andrus"
Date:
Subject: Re: How to allow users to log on only from my application not from pgadmin