Thread: PG on two nodes with shared disk ocfs2 & drbd

PG on two nodes with shared disk ocfs2 & drbd

From
Jasmin Dizdarevic
Date:
Hi, 

I have to build a load balanced pg-cluster and I wanted to ask you, if this configuration would work:

A drbd disk in dual primary mode with ocfs2-filesystem. 

Will there be any conflicts if using the shared volume as PGDATA directory?

R+W is a required feature for this cluster.

Thank you
Jasmin

Re: PG on two nodes with shared disk ocfs2 & drbd

From
Andrew Sullivan
Date:
On Sun, Feb 27, 2011 at 01:48:24PM +0100, Jasmin Dizdarevic wrote:
> A drbd disk in dual primary mode with ocfs2-filesystem.
>
> Will there be any conflicts if using the shared volume as PGDATA directory?

Yes.

A
--
Andrew Sullivan
ajs@crankycanuck.ca

Re: PG on two nodes with shared disk ocfs2 & drbd

From
John R Pierce
Date:
On 02/27/11 4:48 AM, Jasmin Dizdarevic wrote:
>
> I have to build a load balanced pg-cluster and I wanted...

master-master doesn't work real well with databases, especially ones
like postgres that are optimized for a high level of concurrency and
transactional integrity.

on proper hardware, postgres can run quite high transactional volumes on
a single server.  If high availability is a requirement, you can run a
2nd server as a standby slave using either postgres streaming
replication, or drbd style block replication, or another such similar
technique.

you -can- distribute read accesses between a master and a slave server
via things like pgpool2, where all inserts, updates, DDL changes, etc
are made to the master server, but reads are done to either.  note you
do NOT want to use block level replication like drbd for this as the
drbd slave can not be actively mounted, nor could the slave instance of
postgres be aware of changes to the underlying storage, rather you would
use the streaming replication built into postgresql 9.0.    another
approach is to use the master server for all OLTP type accesses, and the
hot standby server for more complex long running OLAP queries for
reporting, etc.   in case of master failure, the slave becomes the new
master, and you shut down the presumably less important OLAP operations
until such time as a new slave can be deployed.



Re: PG on two nodes with shared disk ocfs2 & drbd

From
Andrew Sullivan
Date:
On Sun, Feb 27, 2011 at 12:10:36PM -0800, John R Pierce wrote:
> are made to the master server, but reads are done to either.  note you
> do NOT want to use block level replication like drbd for this as the
> drbd slave can not be actively mounted, nor could the slave instance of
> postgres be aware of changes to the underlying storage, rather you would
> use the streaming replication built into postgresql 9.0.

Note that with drbd, you can have a piece of hot standby hardware
sitting there to take over the filesystem in real time, in the event
the original master blows up or something.  My experience with systems
designed like this is that they are a foot-bazooka: the only real
utility I ever saw in them was to increase on-call hours for sysadmins
after they blew off their own foot (and too often, my database) doing
something tricky with the standby server.  If it were me setting it
up, I'd think the streaming replication approach a better bet.  Not
that anything will save you when someone else has root and decides to
play with a production server.

I believe that Greenplum sells a system based on Postgres that is
supposed to do some kind of distributed cluster thing.  I don't
understand the details and it's been a long time since I had any look
at it.  I think it's intended to compete in the scalability rather
than the availability market.  Maybe someone around here knows more.

The only people I'm aware of who really do this sort of thing for
availability are Oracle with RAC, and Oracle with some mostly-works
clustering stuff in MySQL.  I have never met a happy customer of the
former, but I've heard some people tell me it's real impressive
technology when it's working.  (The unhappy people seemed mostly
unhappy because, for that kind of coin, they would like it to work
most of the time.  I know at least one metronet deployment that didn't
work even once for two years.)  In the case of the MySQL stuff, there
are some trade-offs in the design that make my heart sink.  But maybe
for the OP's application it will work.

A


--
Andrew Sullivan
ajs@crankycanuck.ca

Re: PG on two nodes with shared disk ocfs2 & drbd

From
Jasmin Dizdarevic
Date:
Thank you for your detailed information about HA and LB. First of all it's a pitty that there is no built-in feature for LB+HA (both of them, simultaneous).
In my eyes, the pgpool2/3-solution has to much disadvantages and restrictions. 
My idea was the one, that john described: DML and DDL are done on the small box and reporting on the "big mama" with streaming replication and hot stand-by enabled. the only problem is that we use temp tables for reporting purposes. i hope that the query duration impact with not using temp tables will be equalized through running dml/ddl on the small box. 

I think, this will be the final configuration:
- drbd with multi primary (ocfs2) as archive location for the primary node
- streaming replication and hot stand-by

this is a good howto to get real high availability when the primary node goes down, but for now I'm going to deploy the described configuration with manual fail over.

Regards,
Jasmin

2011/2/27 Andrew Sullivan <ajs@crankycanuck.ca>
On Sun, Feb 27, 2011 at 12:10:36PM -0800, John R Pierce wrote:
> are made to the master server, but reads are done to either.  note you
> do NOT want to use block level replication like drbd for this as the
> drbd slave can not be actively mounted, nor could the slave instance of
> postgres be aware of changes to the underlying storage, rather you would
> use the streaming replication built into postgresql 9.0.

Note that with drbd, you can have a piece of hot standby hardware
sitting there to take over the filesystem in real time, in the event
the original master blows up or something.  My experience with systems
designed like this is that they are a foot-bazooka: the only real
utility I ever saw in them was to increase on-call hours for sysadmins
after they blew off their own foot (and too often, my database) doing
something tricky with the standby server.  If it were me setting it
up, I'd think the streaming replication approach a better bet.  Not
that anything will save you when someone else has root and decides to
play with a production server.

I believe that Greenplum sells a system based on Postgres that is
supposed to do some kind of distributed cluster thing.  I don't
understand the details and it's been a long time since I had any look
at it.  I think it's intended to compete in the scalability rather
than the availability market.  Maybe someone around here knows more.

The only people I'm aware of who really do this sort of thing for
availability are Oracle with RAC, and Oracle with some mostly-works
clustering stuff in MySQL.  I have never met a happy customer of the
former, but I've heard some people tell me it's real impressive
technology when it's working.  (The unhappy people seemed mostly
unhappy because, for that kind of coin, they would like it to work
most of the time.  I know at least one metronet deployment that didn't
work even once for two years.)  In the case of the MySQL stuff, there
are some trade-offs in the design that make my heart sink.  But maybe
for the OP's application it will work.

A


--
Andrew Sullivan
ajs@crankycanuck.ca

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: PG on two nodes with shared disk ocfs2 & drbd

From
Andrew Sullivan
Date:
On Mon, Feb 28, 2011 at 12:13:32AM +0100, Jasmin Dizdarevic wrote:
> Thank you for your detailed information about HA and LB. First of all it's a
> pitty that there is no built-in feature for LB+HA (both of them,
> simultaneous).

I think it's a pity that I'm not paid a million dollars a year, too,
but barring magic I don't think it'll happen soon.  Multi-master
transactional ACID-type databases with multiple masters is very hard.

A

--
Andrew Sullivan
ajs@crankycanuck.ca

Re: PG on two nodes with shared disk ocfs2 & drbd

From
Andrew Sullivan
Date:
On Mon, Feb 28, 2011 at 12:13:32AM +0100, Jasmin Dizdarevic wrote:
> My idea was the one, that john described: DML and DDL are done on the small
> box and reporting on the "big mama" with streaming replication and hot
> stand-by enabled. the only problem is that we use temp tables for reporting
> purposes. i hope that the query duration impact with not using temp tables
> will be equalized through running dml/ddl on the small box.

By the way, despite my flip comment, it is entirely possible that what
you need would be better handled by one of the other replication
systems.  Slony is actually well-suited to this sort of thing, despite
the overhead that it imposes.  This is a matter of trade-offs, and you
might want to think about different roles for different boxes --
especially since hardware is so cheap these days.

A

--
Andrew Sullivan
ajs@crankycanuck.ca

Re: PG on two nodes with shared disk ocfs2 & drbd

From
John R Pierce
Date:
On 02/27/11 4:07 PM, Andrew Sullivan wrote:
> Multi-master transactional ACID-type databases with multiple masters is very hard.
>

indeed.

Oracle RAC works by having a distributed cache and locking manager
replicating over a fast bus like infininet.   oracle fundamentally uses
a transaction redo log rather than a write-ahead log like postgres, this
is somewhat more amendable to distributed processing when combined with
the distributed cache and lock.

The trade-off is, its a really complicated and fragile system, with a
high opportunity for catastrophic failure.  It seems to me like Oracle
wants to sell turnkey database servers with their Exadata stuff.



Re: PG on two nodes with shared disk ocfs2 & drbd

From
Robert Treat
Date:
On Sun, Feb 27, 2011 at 7:17 PM, Andrew Sullivan <ajs@crankycanuck.ca> wrote:
> On Mon, Feb 28, 2011 at 12:13:32AM +0100, Jasmin Dizdarevic wrote:
>> My idea was the one, that john described: DML and DDL are done on the small
>> box and reporting on the "big mama" with streaming replication and hot
>> stand-by enabled. the only problem is that we use temp tables for reporting
>> purposes. i hope that the query duration impact with not using temp tables
>> will be equalized through running dml/ddl on the small box.
>
> By the way, despite my flip comment, it is entirely possible that what
> you need would be better handled by one of the other replication
> systems.  Slony is actually well-suited to this sort of thing, despite
> the overhead that it imposes.  This is a matter of trade-offs, and you
> might want to think about different roles for different boxes --
> especially since hardware is so cheap these days.
>

Yeah, it's possible one of the async master-master systems like
bucardo or rubyrep would also fit his needs. There are options here,
just no full on pony/unicorn/pegasus mix like everyone hopes for.

Oh, I guess if someone is looking to fund/help development of such a
thing, it might be worth pointing people to Postgres-XC
(http://wiki.postgresql.org/wiki/Postgres-XC). It's got a ways to go,
but they are at least trying.

Robert Treat
play: xzilla.net
work: omniti.com
hiring: l42.org/Lg

Re: PG on two nodes with shared disk ocfs2 & drbd

From
Jasmin Dizdarevic
Date:
hehe...
andrew, I appriciate pg and it's free open source features - maybe I've chosen a wrong formulation. 
in my eyes such a feature is getting more important nowadays. Postgresql-R and -XC are interesting ideas.

thanks everybody for the comments

regards,
jasmin



2011/2/28 Andrew Sullivan <ajs@crankycanuck.ca>
On Mon, Feb 28, 2011 at 12:13:32AM +0100, Jasmin Dizdarevic wrote:
> My idea was the one, that john described: DML and DDL are done on the small
> box and reporting on the "big mama" with streaming replication and hot
> stand-by enabled. the only problem is that we use temp tables for reporting
> purposes. i hope that the query duration impact with not using temp tables
> will be equalized through running dml/ddl on the small box.

By the way, despite my flip comment, it is entirely possible that what
you need would be better handled by one of the other replication
systems.  Slony is actually well-suited to this sort of thing, despite
the overhead that it imposes.  This is a matter of trade-offs, and you
might want to think about different roles for different boxes --
especially since hardware is so cheap these days.

A

--
Andrew Sullivan
ajs@crankycanuck.ca

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: PG on two nodes with shared disk ocfs2 & drbd

From
Craig Ringer
Date:
On 03/03/11 09:01, Jasmin Dizdarevic wrote:
> hehe...
> andrew, I appriciate pg and it's free open source features - maybe I've
> chosen a wrong formulation.
> in my eyes such a feature is getting more important nowadays.

Why? Shared disk means shared point of failure, and poor redundancy
against a variety of non-total failure conditions  (data corruption,
etc). Add the synchronization costs to the mix, and I don't see the appeal.

I think clustering _in general_ is becoming a big issue, but I don't
really see the appeal of shared-disk clustering personally.

--
Craig Ringer