Thread: postgres in HA constellation

postgres in HA constellation

From
"Sebastian Reitenbach"
Date:
Hi all,

I want to use postgres, combined with postgis, in a high available
configuration to store spatial data. I will have tens of millions, or even
more, of datasets in the database. so the database will be fairly large and
consume a lot of disk space. Therefore I thought to have a shared storage.

the HA postgres Howto states following possibilities:
- backup/restore with pg_dump/pg_restore (the process took, with a test
  database on a fairly fast computer, about one or two days, way too long)
- rsync, but I doubt that this will work in my case
- rserv, replication script

the most promising, in an active/passvie HA configuration seems to be the
rserv script. Unfortunately there was not much info about it in the HOWTO.

furthermore I found LifeKeeper, a commercial monitor to use with Postgresql in
a HA environment.
http://www.openminds.co.uk/high_availability_solutions/databases/postgresql.htm

something I thought that might work:
is there sth. that will repair an inconsisten postgresql datastore? e.g. the
master database died, the slave will mount the storage, then repair it in a
reasonable time, and then start to work.


I assume having postgres running in a shared storage environment, with a

are there any other possibilities that might work that I am not aware of?
anybody has experiences with postgres in a HA environment with shared storage?

any hint is greatly appreciated

kind regards
Sebastian


Re: postgres in HA constellation

From
Bruno Wolff III
Date:
On Wed, Oct 04, 2006 at 11:23:45 -0000,
  Sebastian Reitenbach <itlistuser@rapideye.de> wrote:
>
> something I thought that might work:
> is there sth. that will repair an inconsisten postgresql datastore? e.g. the
> master database died, the slave will mount the storage, then repair it in a
> reasonable time, and then start to work.

You can just start the slave server and it will recover using th WAL files
and should do so in a short time. However, you MUST be certain that the
master is no longer running when the slave starts or your database WILL BE
TRASHED.
>
> are there any other possibilities that might work that I am not aware of?
> anybody has experiences with postgres in a HA environment with shared storage?

You can also use servers with redundant hardware (hotplug CPUs and the like)
to make failure less likely. If you get the hardware failure rates down to
that of your storage system, that might be good enough for your purposes.

If you have multiple data centers to protect against disaster, then you might
look at SLONY which you can use to replicate to a slave system. However, I
think it is possible for the master to report a transaction as commited
before it is shipped off to the slave, so that if the master fails you might
lose some transactions when switching to the slave. But double check with
the SLONY documentation on this.


Re: postgres in HA constellation

From
Brad Nicholson
Date:
On Wed, 2006-10-04 at 10:43 -0500, Bruno Wolff III wrote:
> On Wed, Oct 04, 2006 at 11:23:45 -0000,
>   Sebastian Reitenbach <itlistuser@rapideye.de> wrote:

> If you have multiple data centers to protect against disaster, then you might
> look at SLONY which you can use to replicate to a slave system. However, I
> think it is possible for the master to report a transaction as commited
> before it is shipped off to the slave, so that if the master fails you might
> lose some transactions when switching to the slave. But double check with
> the SLONY documentation on this.

This is indeed accurate, and highly likely to happen if you need to do a
fail over.  It all depends on the level of activity.

Brad.


Re: postgres in HA constellation

From
"Sebastian Reitenbach"
Date:
Hi,

Bruno Wolff III <bruno@wolff.to> wrote:
> On Wed, Oct 04, 2006 at 11:23:45 -0000,
>   Sebastian Reitenbach <itlistuser@rapideye.de> wrote:
> >
> > something I thought that might work:
> > is there sth. that will repair an inconsisten postgresql datastore? e.g.
the
> > master database died, the slave will mount the storage, then repair it in
a
> > reasonable time, and then start to work.
>
> You can just start the slave server and it will recover using th WAL files
> and should do so in a short time. However, you MUST be certain that the
> master is no longer running when the slave starts or your database WILL BE
> TRASHED.
that sounds great, that I can just start the slave, and it will repair
whatever broke. I am aware of that I have to make sure that the master is
really dead.

> >
> > are there any other possibilities that might work that I am not aware of?
> > anybody has experiences with postgres in a HA environment with shared
storage?
>
> You can also use servers with redundant hardware (hotplug CPUs and the like)
> to make failure less likely. If you get the hardware failure rates down to
> that of your storage system, that might be good enough for your purposes.
>
> If you have multiple data centers to protect against disaster, then you
might
> look at SLONY which you can use to replicate to a slave system. However, I
> think it is possible for the master to report a transaction as commited
> before it is shipped off to the slave, so that if the master fails you might
> lose some transactions when switching to the slave. But double check with
> the SLONY documentation on this.

I just have one data center, no remote far away replication is needed.

thank you all for your answers

kind regards
Sebastian


Re: postgres in HA constellation

From
Andrew Sullivan
Date:
On Thu, Oct 05, 2006 at 04:24:17AM -0000, Sebastian Reitenbach wrote:
>
> I just have one data center, no remote far away replication is needed.

If it is at all feasible with your budget, I'd think _very strongly_
about replicating using Slony inside your data centre _too_.  The
shared storage answer is nice, but it is _really really really_ easy
to shoot yourself in the foot with a rocket propelled grenade with
that arrangement.  Very careful administration might prevent it, but
there is a reason that none of the corporate people will guarantee
two machines will never accidentally mount the same file system at
once: in a shared-disc-only system, it's impossible to be 100%
certain that the other machine really is dead and not coming back.
Very tricky scripts could of course lower the risk.

If you're really going to have all that data, it's going to be a
major pain to restore in the event of such corruption.  In addition,
your recovery will only be to the last dump.  That's why I suggest
replicating, either with Slony or something else, as a belt that will
nicely complement the suspenders of your shared-disc failover.

A

--
Andrew Sullivan  | ajs@crankycanuck.ca
I remember when computers were frustrating because they *did* exactly what
you told them to.  That actually seems sort of quaint now.
        --J.D. Baldwin

Re: postgres in HA constellation

From
Brad Nicholson
Date:
On Thu, 2006-10-05 at 04:24 +0000, Sebastian Reitenbach wrote:
> Hi,
>
> Bruno Wolff III <bruno@wolff.to> wrote:
> > On Wed, Oct 04, 2006 at 11:23:45 -0000,
> >   Sebastian Reitenbach <itlistuser@rapideye.de> wrote:
> > >
> > > something I thought that might work:
> > > is there sth. that will repair an inconsisten postgresql datastore? e.g.
> the
> > > master database died, the slave will mount the storage, then repair it in
> a
> > > reasonable time, and then start to work.
> >
> > You can just start the slave server and it will recover using th WAL files
> > and should do so in a short time. However, you MUST be certain that the
> > master is no longer running when the slave starts or your database WILL BE
> > TRASHED.
> that sounds great, that I can just start the slave, and it will repair
> whatever broke. I am aware of that I have to make sure that the master is
> really dead.

Be absolutely sure that you test, test, test this.  And then test it
some more.  There are all sorts of nasty surprises that can happen in an
HA config that can leave you with a corrupt database.  As mentioned, two
postmasters writing on the same directory is certainly one.

But you have to watch for problems in the HA config can as well.

What happens if the load on the master shoots up to the point where it
can't release the disk resources in time?  Will the other box try and
take over an scribble on your file system?

What happens when you pull the plug on the master while under load
(either physically, or virtually with a halt -q)?

How does the HA solution play with your file system of choice?

Trust me, you want to know the answers to all of these questions and
more before you even consider turning an HA solution on in your
production environment.  Don't just rely on the marketing material and
what sales drones tell you.  If not, make sure you have good backups.


Brad.


Re: postgres in HA constellation

From
Jim Nasby
Date:
On Oct 5, 2006, at 1:41 PM, Andrew Sullivan wrote:
> On Thu, Oct 05, 2006 at 04:24:17AM -0000, Sebastian Reitenbach wrote:
>>
>> I just have one data center, no remote far away replication is
>> needed.
>
> If it is at all feasible with your budget, I'd think _very strongly_
> about replicating using Slony inside your data centre _too_.  The
> shared storage answer is nice, but it is _really really really_ easy
> to shoot yourself in the foot with a rocket propelled grenade with
> that arrangement.  Very careful administration might prevent it, but
> there is a reason that none of the corporate people will guarantee
> two machines will never accidentally mount the same file system at
> once: in a shared-disc-only system, it's impossible to be 100%
> certain that the other machine really is dead and not coming back.
> Very tricky scripts could of course lower the risk.

Isn't it entirely possible that if the master gets trashed it would
start sending garbage to the Slony slave as well?

I think PITR would be a much better option to protect against this,
since you could probably recover up to the exact point of failover.

When it comes to the actual failover, take a look at the HA-linux
project. They've got some stuff you could probably use (such as the
heartbeat program). Another really good idea is to give the backup
machine to kill the power to the primary machine, and not have either
machine mount the shared storage at bootup.

If you're interested in paying someone to help setting this up, I
know that we (EnterpriseDB) have folks that have done this before. I
suspect that some of the other folks listed on the commercial support
page have done this as well (likely Command Prompt and Varlena).
--
Jim Nasby                                    jimn@enterprisedb.com
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)



Re: postgres in HA constellation

From
"Sebastian Reitenbach"
Date:
Hi all,

>
> I think PITR would be a much better option to protect against this,
> since you could probably recover up to the exact point of failover.
>
> When it comes to the actual failover, take a look at the HA-linux
> project. They've got some stuff you could probably use (such as the
> heartbeat program). Another really good idea is to give the backup
> machine to kill the power to the primary machine, and not have either
> machine mount the shared storage at bootup.
As I am using carp on OpenBSD to setup HA cluster, I am very comfortable with
ucarp, the userland implementation for Linux. at boot up the default mount of
the database files is readonly and having the database not running, then
starting ucarp, and then only in case this machine becomes master, it remounts
the postgres data and shall start the database.
And I thought about that killing the power of the "lost" master after a
takeover too, to make sure hte machine will not come back unconditionally
later.

kind regards
Sebastian


Re: postgres in HA constellation

From
"Jim C. Nasby"
Date:
On Fri, Oct 06, 2006 at 06:34:25AM -0000, Sebastian Reitenbach wrote:
> > I think PITR would be a much better option to protect against this,
> > since you could probably recover up to the exact point of failover.
> >
> > When it comes to the actual failover, take a look at the HA-linux
> > project. They've got some stuff you could probably use (such as the
> > heartbeat program). Another really good idea is to give the backup
> > machine to kill the power to the primary machine, and not have either
> > machine mount the shared storage at bootup.
> As I am using carp on OpenBSD to setup HA cluster, I am very comfortable with
> ucarp, the userland implementation for Linux. at boot up the default mount of
> the database files is readonly and having the database not running, then
> starting ucarp, and then only in case this machine becomes master, it remounts
> the postgres data and shall start the database.
> And I thought about that killing the power of the "lost" master after a
> takeover too, to make sure hte machine will not come back unconditionally
> later.

Heh, I'd assumed you were on linux, but of course there's no reason you
couldn't setup HA on OpenBSD. The key is just to make sure that you
never bring up two servers on the same data directory.
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Re: postgres in HA constellation

From
Andrew Sullivan
Date:
On Thu, Oct 05, 2006 at 08:43:21PM -0500, Jim Nasby wrote:
> Isn't it entirely possible that if the master gets trashed it would
> start sending garbage to the Slony slave as well?

Well, maybe, but unlikely.  What happens in a shared-disc failover is
that the second machine re-mounts the same partition as the old
machine had open.  The risk is the case where your to-be-removed
machine hasn't actually stopped writing on the partition yet, but
your failover software thinks it's dead, and can fail over.  Two
processes have the same Postgres data and WAL files mounted at the
same time, and blammo.  As nearly as I can tell, it takes
approximately zero time for this arrangement to make such a mess that
you're not committing any transactions.  Slony will only get the data
on COMMIT, so the risk is very small.

> I think PITR would be a much better option to protect against this,
> since you could probably recover up to the exact point of failover.

That oughta work too, except that your remounted WAL gets corrupted
under the imagined scenario, and then you copy the next updates to
the WAL.  So you have to save all the incremental copies of the WAL
you make, so that you don't have a garbage file to read.

As I said, I don't think that it's a bad idea to use this sort of
trick.  I just think it's a poor single line of defence, because when
it fails, it fails hard.

A

--
Andrew Sullivan  | ajs@crankycanuck.ca
In the future this spectacle of the middle classes shocking the avant-
garde will probably become the textbook definition of Postmodernism.
                --Brad Holland

Re: postgres in HA constellation

From
Andrew Sullivan
Date:
On Tue, Oct 10, 2006 at 10:11:08AM -0500, Jim C. Nasby wrote:
> couldn't setup HA on OpenBSD. The key is just to make sure that you
> never bring up two servers on the same data directory.

I think this highlights exactly what I'm trying to emphasise: in
actual, shared-nothing systems like this, there's no possible
guarantee of "never".  There are possible guarantees of "very
rarely".  The problem is, you're already trying to address a teeny
portion of the likely events on your machines.  So you have to assume
that more than one thing might break at the same time, and have a
recovery plan for it.  I submit that a recovery plan of "restore from
pg_dump" is usually not going to be enough if it was worth the cost
and hassle of setting up shared disk failover.  YMMV, of course.

A

--
Andrew Sullivan  | ajs@crankycanuck.ca
The fact that technology doesn't work is no bar to success in the marketplace.
        --Philip Greenspun

Re: postgres in HA constellation

From
"Sebastian Reitenbach"
Date:
Hi,

>
> I think this highlights exactly what I'm trying to emphasise: in
> actual, shared-nothing systems like this, there's no possible
> guarantee of "never".  There are possible guarantees of "very
> rarely".  The problem is, you're already trying to address a teeny
> portion of the likely events on your machines.  So you have to assume
> that more than one thing might break at the same time, and have a
> recovery plan for it.  I submit that a recovery plan of "restore from
> pg_dump" is usually not going to be enough if it was worth the cost
> and hassle of setting up shared disk failover.  YMMV, of course.
>
yes, I did a restore from a dump with pg_dump, that took over a day. maybe I
can speed it up a bit, but this is going to take too long.


Sebastian


Re: postgres in HA constellation

From
"Jim C. Nasby"
Date:
On Wed, Oct 11, 2006 at 10:28:44AM -0400, Andrew Sullivan wrote:
> On Thu, Oct 05, 2006 at 08:43:21PM -0500, Jim Nasby wrote:
> > Isn't it entirely possible that if the master gets trashed it would
> > start sending garbage to the Slony slave as well?
>
> Well, maybe, but unlikely.  What happens in a shared-disc failover is
> that the second machine re-mounts the same partition as the old
> machine had open.  The risk is the case where your to-be-removed
> machine hasn't actually stopped writing on the partition yet, but
> your failover software thinks it's dead, and can fail over.  Two
> processes have the same Postgres data and WAL files mounted at the
> same time, and blammo.  As nearly as I can tell, it takes
> approximately zero time for this arrangement to make such a mess that
> you're not committing any transactions.  Slony will only get the data
> on COMMIT, so the risk is very small.

Hrm... I guess it depends on how quickly the Slony master would stop
processing if it was talking to a shared-disk that had become corrupt
from another postmaster.

> > I think PITR would be a much better option to protect against this,
> > since you could probably recover up to the exact point of failover.
>
> That oughta work too, except that your remounted WAL gets corrupted
> under the imagined scenario, and then you copy the next updates to
> the WAL.  So you have to save all the incremental copies of the WAL
> you make, so that you don't have a garbage file to read.
>
> As I said, I don't think that it's a bad idea to use this sort of
> trick.  I just think it's a poor single line of defence, because when
> it fails, it fails hard.

Yeah, STONITH is *critical* for shared-disk.
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Re: postgres in HA constellation

From
Brad Nicholson
Date:
On Wed, 2006-10-11 at 16:12 -0500, Jim C. Nasby wrote:
> On Wed, Oct 11, 2006 at 10:28:44AM -0400, Andrew Sullivan wrote:
> > On Thu, Oct 05, 2006 at 08:43:21PM -0500, Jim Nasby wrote:
> > > Isn't it entirely possible that if the master gets trashed it would
> > > start sending garbage to the Slony slave as well?
> >
> > Well, maybe, but unlikely.  What happens in a shared-disc failover is
> > that the second machine re-mounts the same partition as the old
> > machine had open.  The risk is the case where your to-be-removed
> > machine hasn't actually stopped writing on the partition yet, but
> > your failover software thinks it's dead, and can fail over.  Two
> > processes have the same Postgres data and WAL files mounted at the
> > same time, and blammo.  As nearly as I can tell, it takes
> > approximately zero time for this arrangement to make such a mess that
> > you're not committing any transactions.  Slony will only get the data
> > on COMMIT, so the risk is very small.
>
> Hrm... I guess it depends on how quickly the Slony master would stop
> processing if it was talking to a shared-disk that had become corrupt
> from another postmaster.

That doesn't depend on Slony, it depends on Postgres.  If transactions
are committing on the master, Slony will replicate them.  You could have
a situation where your HA failover trashes some of you database, but the
database still starts up.  It starts accepting and replicating
transactions before the corruption is discovered.

Brad.


Re: postgres in HA constellation

From
Chris Browne
Date:
bnichols@ca.afilias.info (Brad Nicholson) writes:
> On Wed, 2006-10-11 at 16:12 -0500, Jim C. Nasby wrote:
>> On Wed, Oct 11, 2006 at 10:28:44AM -0400, Andrew Sullivan wrote:
>> > On Thu, Oct 05, 2006 at 08:43:21PM -0500, Jim Nasby wrote:
>> > > Isn't it entirely possible that if the master gets trashed it would
>> > > start sending garbage to the Slony slave as well?
>> >
>> > Well, maybe, but unlikely.  What happens in a shared-disc failover is
>> > that the second machine re-mounts the same partition as the old
>> > machine had open.  The risk is the case where your to-be-removed
>> > machine hasn't actually stopped writing on the partition yet, but
>> > your failover software thinks it's dead, and can fail over.  Two
>> > processes have the same Postgres data and WAL files mounted at the
>> > same time, and blammo.  As nearly as I can tell, it takes
>> > approximately zero time for this arrangement to make such a mess that
>> > you're not committing any transactions.  Slony will only get the data
>> > on COMMIT, so the risk is very small.
>>
>> Hrm... I guess it depends on how quickly the Slony master would stop
>> processing if it was talking to a shared-disk that had become corrupt
>> from another postmaster.
>
> That doesn't depend on Slony, it depends on Postgres.  If transactions
> are committing on the master, Slony will replicate them.  You could have
> a situation where your HA failover trashes some of you database, but the
> database still starts up.  It starts accepting and replicating
> transactions before the corruption is discovered.

There's a bit of "joint responsibility" there.

Let's suppose that the disk has gone bad, zeroing out some index pages
for the Slony-I table sl_log_1.  (The situation will be the same for
just about any kind of corruption of a Slony-I internal table.)

There are two possibilities:
  1.  The PostgreSQL instance may notice that those pages are bad,
      returning an error message, and halting the SYNC.

  2.  The PostgreSQL instance may NOT notice that those pages are bad,
      and, as a result, fail to apply some updates, thereby corrupting
      the subscriber.

I think there's a pretty high probability of 1) happening rather than
2), but there is a risk of corruption of subscribers roughly
proportional to the probability of 2).

My "gut feel" is that the probability of 2) is pretty small, but I
don't have anything to point to as a proof of that...
--
output = reverse("gro.mca" "@" "enworbbc")
http://www3.sympatico.ca/cbbrowne/
"One of the main causes of the fall of the Roman Empire was that,
lacking zero, they had no way to indicate successful termination of
their C programs."  -- Robert Firth