Re: Replication options? - Mailing list pgsql-general

From Andrew Sullivan
Subject Re: Replication options?
Date
Msg-id 20040812102504.GA7952@libertyrms.info
Whole thread Raw
In response to Re: Replication options?  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-general
I'll try this again, since it doesn't seem to have made it to the
list.


On Wed, Aug 11, 2004 at 12:02:07PM -0400, Tom Lane wrote:
> Now erServer did work for them, but it required significant amounts of
> tuning and constant babysitting by the DBA.  (If Andrew Sullivan is
> paying attention to this thread, he can offer lots of gory details.)
> I can also personally testify that getting erServer set up is a major
> pain in the rear.  I haven't messed with Slony, but all reports are that
> it's a substantially better piece of code.

I can indeed provide gory details.  Erserver worked for us, and was
able to handle the load we gave it (at times pretty substantial).
But it had a number of flaws.  Some of these were mere matters of
implementation, and some were (in my view) fundamental.  Since I've
been observing radio silence on the list lately, I feel entitled to
blather on at length now.  So, below is the gore, and the reasons we
finally decided to abandon erserver.  This is very similar to the
negative part of what I had to say at OSCON, so if you were bored by
me there, you'll find this equally boring.

A.  First, the implementation faults.  As Vivek Khera pointed out,
the failover and set up support is not strong.

1.  Setting up erserver on a system which is not already replicated
is a major pain.  (We didn't have this problem because we always
launched with erserver support in place.)  On a database of a few gig,
you could easily have to take 24 hours downtime to get it set up.
Some of that was just faulty implementation, and if you have a single
not null unique column on every table, the problem is more to do with
poorly conceived setup scripts.  But finding this out turns out to
depend on having available an expert in the system (and as far as I
know, almost all the experts on it actually work here at Afilias.  I
did put together some notes on this topic for the BSD version of
erserver.  They're at
<http://gborg.postgresql.org/pipermail/erserver-general/2003-October/000169.html>
or <http://tinyurl.com/66b89>.)

2.  Switchover is also a pain (we don't like to talk about failover:
erserver is, like Slony-I, async, and failover more or less
automatically risks stranding data on the dead master).  There are
some automation scripts which make it a little easier, but the basic
problem is getting your slave into a condition where it can actually
take over from the master.  The slaves in erserver really don't know
enough about the master to be in a position to do this.  It _is_
possible: I've done it.  It's not fun.  (Failing back is even less
fun, and essentially requires you to build a new slave.  See A1, above.
If you're going to use erserver as a disaster-avoidance system, you
need two identical servers, so that any one can play the role of
master.)

3.  The engine was written in Java.  Java is a nice language, but the
JDK from Sun imposes a 3 G limit on the size of the JVM.  If you get
far enough behind, the VM just blows up, and then you have no hope of
recovering.  This is a _very_ serious limitation for high traffic
sites.  It also turned out to be completely fatal for certain users
who wanted to replicate large objects: one object would be enough to
make the system fail (for reasons that are too incredible to go into,
the process actually has two copies of the data at one time during a
part of processing.  This is just a bug, though a dangerous one).

4.  The logging code was deliberately obfuscatory.  For some reason,
the person who originally wrote the Java code (note _not_ the original
code from PostgreSQL, Inc.) decided to wrap all the error handling in
an outer layer which returned the line number of the error handler
every time it threw an exception.  This meant that, from looking at
the logs, every case of a bug looks like it happened at the same
place.  You can imagine how much fun it was to fix things.  Every
person I've ever known who looked at the logging code suffered
retinal damange -- it was that bad.  (This is acutally fixed in the
PostgreSQL-commercial version of the software, BTW.)

B.  Second, the fundamental errors.

1.  The first big problem came from something we thought was an
advantage: erserver replicates only the latest version of the row.
This reduces the replication overhead considerably, and for a long
time I was a great proponent of this approach.  I was wrong, because
the performance overhead that it imposes under certain perverse kinds
of loads is well and truly awful.  Even in the normal circumstance,
the performance penalty is noticable; but it's not a problem if you
have enough excess capacity.  When that capacity is squeezed, you
run into a lot of pain.  In such cases, the replication application
starts to slow down.  Get under really heavy load, and you start to
have to worry about the JVM limits outlined in A3.  This can be
dealt with, but you absolutely need to hold its hand when things are
bad.  Your DBAs have better things to do, I assume.

The decision to send only the last row also cost some functionality,
because you can't build an historical-database slave with erserver,
unfortunately (if a row gets updated twice in the space of one
transaction, you won't see two changes on the slave, but only the
final state of the changed row).

2.  Finally, there is the problem that the snapshot applications
occasionally could get into the situation where applying rows to the
slave would result either in bad data (bad) or errors on unique
indexes (also bad).  You had to choose between making your slave even
more unlike your master or potentially getting called in the middle
of the night to hand-fix the deadlock condition.  (Some further
discussion of this feature of the software is at
<http://gborg.postgresql.org/pipermail/erserver-general/2003-October/000185.html>
or <http://tinyurl.com/5erj7>.)

It is really the items in B that finally conviced us that we had to
give up on the erserver code and work on a fresh system.  I think
Jan will confirm that his Slony-I work drew some useful inspiration
from the erserver code (in particular, the magic that Vadim
performed).  But ultimately, erserver taught us as much about what
_else_ you needed before you got a real replication system.  In
particular, we felt that you needed more knowledge at all the nodes
than erserver was able to provide.  By contrast, you can usefully
think of Slony-I as a cluster-communication system which happens to
specialise in keeping the data the same on all subscribing nodes.

This isn't to say that erserver is not undergoing development.  I
understand from Geoff Davidson of PostgreSQL, Inc, that they are
continuing work on the product, with an eye to a multi-master
distributed system and automatic failover.  I think such developments
would be welcomed by PostgreSQL users.

A

--
----
Andrew Sullivan                         204-4141 Yonge Street
Afilias Canada                        Toronto, Ontario Canada
<andrew@ca.afilias.info>                              M2P 2A8
                                        +1 416 646 3304 x4110


pgsql-general by date:

Previous
From: Richard Huxton
Date:
Subject: Re: Clustering, mirroriing, or replication?
Next
From: Nikola Milutinovic
Date:
Subject: PgSQL 8.0.0 beta1 compile problem + patch