Thread: PostgreSQL clustering (shared disk)
Hello list,
I have a mission to implement a two-node active-passive PostgreSQL cluster. The databases at the cluster are rather large (hundreds of GB's) which opts me to consider a shared disk environment. I know this is not natively supported with PostgreSQL, but I have been investigating the Red Hat Cluster Suite with GFS. The idea would be that the cluster programs with gfs (and HP ilo) would make sure that only one postmaster at a time would be able to access the shared disk, and in case the active node fails the cluster software would shift the services to the previously passive node. What I'm pondering here is that is the cluster able to keep the postmasters synchronized at all times so that the database won't get corrupted.
Is there anyone on the list that has seen such configuration, or, even better, implemented it themselves? I found a small document by Devrim Gunduz describing this scenario but it was rather scant on details.
If shared disk is definitely out of the question, the fallback plan would be to use drbd and linux-ha.
Regards
MP
I have a mission to implement a two-node active-passive PostgreSQL cluster. The databases at the cluster are rather large (hundreds of GB's) which opts me to consider a shared disk environment. I know this is not natively supported with PostgreSQL, but I have been investigating the Red Hat Cluster Suite with GFS. The idea would be that the cluster programs with gfs (and HP ilo) would make sure that only one postmaster at a time would be able to access the shared disk, and in case the active node fails the cluster software would shift the services to the previously passive node. What I'm pondering here is that is the cluster able to keep the postmasters synchronized at all times so that the database won't get corrupted.
Is there anyone on the list that has seen such configuration, or, even better, implemented it themselves? I found a small document by Devrim Gunduz describing this scenario but it was rather scant on details.
If shared disk is definitely out of the question, the fallback plan would be to use drbd and linux-ha.
Regards
MP
On 16.08.2007 08:42, Mikko Partio wrote: > I have a mission to implement a two-node active-passive PostgreSQL cluster. > The databases at the cluster are rather large (hundreds of GB's) which opts > me to consider a shared disk environment. I know this is not natively > supported with PostgreSQL, but I have been investigating the Red Hat Cluster > Suite with GFS. The idea would be that the cluster programs with gfs (and HP > ilo) would make sure that only one postmaster at a time would be able to > access the shared disk, and in case the active node fails the cluster > software would shift the services to the previously passive node. What I'm > pondering here is that is the cluster able to keep the postmasters > synchronized at all times so that the database won't get corrupted. > > Is there anyone on the list that has seen such configuration, or, even > better, implemented it themselves? I found a small document by Devrim Gunduz > describing this scenario but it was rather scant on details. > > If shared disk is definitely out of the question, the fallback plan would be > to use drbd and linux-ha. The usual setup is DRBD + Heartbeat, which is fast, simple and proven. Using a shared disk / SAN has disadvantages, such as single point of failure, (usually) non-native fencing and (usually) way higher latency. DRBD does handle a lot of stuff by it self, which you need to take care yourself with a plain shared device. Using a cluster file system such as GFS2, OCFS2 or Lustre is a waste of resources, as you can't have active/active with PostgreSQL anyway. -- Regards, Hannes Dorbath
Hi, On Thu, 2007-08-16 at 09:42 +0300, Mikko Partio wrote: > The idea would be that the cluster programs with gfs (and HP ilo) > would make sure that only one postmaster at a time would be able to > access the shared disk, and in case the active node fails the cluster > software would shift the services to the previously passive node. AFAIK, it is the fence device that will prevent the postmaster access from the failed node. RHCS will just switch the servers. > What I'm pondering here is that is the cluster able to keep the > postmasters synchronized at all times so that the database won't get > corrupted. Keep all the $PGDATA in the shared disk. That would minimize data loss (Of course, there is still a risk of data loss -- the postmasters are not aware of each other and they don't share each other's buffers, etc.) Regards, -- Devrim GÜNDÜZ PostgreSQL Replication, Consulting, Custom Development, 24x7 support Managed Services, Shared and Dedicated Hosting Co-Authors: plPHP, ODBCng - http://www.commandprompt.com/
Attachment
Hi, On Thu, 2007-08-16 at 10:05 +0300, Devrim GÜNDÜZ wrote: > (Of course, there is still a risk of data loss -- the postmasters are > not aware of each other and they don't share each other's buffers, > etc.) Err... I was talking about uncommitted transactions, and of course this does not mean a data loss (Thanks to Magnus for the reminder) Regards, -- Devrim GÜNDÜZ PostgreSQL Replication, Consulting, Custom Development, 24x7 support Managed Services, Shared and Dedicated Hosting Co-Authors: plPHP, ODBCng - http://www.commandprompt.com/
Attachment
Devrim GÜNDÜZ <devrim@CommandPrompt.com> writes: >> What I'm pondering here is that is the cluster able to keep the >> postmasters synchronized at all times so that the database won't get >> corrupted. > > Keep all the $PGDATA in the shared disk. That would minimize data loss > (Of course, there is still a risk of data loss -- the postmasters are > not aware of each other and they don't share each other's buffers, etc.) It would be much better to have the cluster software only run one postmaster at a time, starting up the secondary if the primary fails. That's the usual practice with shared storage. -Doug
On 8/16/07, Douglas McNaught <doug@mcnaught.org> wrote:
This was my original intention. I'm still quite hesitant to trust the fencing devices ability to quarantee that only one postmaster at a time is running, because of the disastrous possibility of corrupting the whole database.
Maybe I'm just better off using the more simple (crude?) method of drbd + heartbeat?
Regards
MP
Devrim GÜNDÜZ <devrim@CommandPrompt.com> writes:
>> What I'm pondering here is that is the cluster able to keep the
>> postmasters synchronized at all times so that the database won't get
>> corrupted.
>
> Keep all the $PGDATA in the shared disk. That would minimize data loss
> (Of course, there is still a risk of data loss -- the postmasters are
> not aware of each other and they don't share each other's buffers, etc.)
It would be much better to have the cluster software only run one
postmaster at a time, starting up the secondary if the primary fails.
That's the usual practice with shared storage.
This was my original intention. I'm still quite hesitant to trust the fencing devices ability to quarantee that only one postmaster at a time is running, because of the disastrous possibility of corrupting the whole database.
Maybe I'm just better off using the more simple (crude?) method of drbd + heartbeat?
Regards
MP
On 17.08.2007 11:12, Mikko Partio wrote: > Maybe I'm just better off using the more simple (crude?) method of drbd + > heartbeat? Crude? Use what you like to use, but you should keep one thing in mind: If you don't know the software you are running in each and every detail, how it behaves in each and every situation you can think of, it's a bad idea to use it in a HA setup. You don't want to be one of those admins that just configured something in a few days, moved production stuff on it and fail to recover from a split brain situation. Setting up a HA environment is something you do in months, not days, at least if you want to do it right. There is so much that can go wrong, and so much to learn. Keep it simple. -- Regards, Hannes Dorbath
On 8/17/07, Hannes Dorbath <light@theendofthetunnel.de> wrote:
Exactly my thoughts, as I have some experience with drbd and I know it works. My point was that since I have access to a san environment, a shared storage would be a more "elegant" solution, but as you pointed out it's probably better to stick to the method that feels most comfortable.
Thanks for your thoughts.
Regards
MP
On 17.08.2007 11:12, Mikko Partio wrote:
> Maybe I'm just better off using the more simple (crude?) method of drbd +
> heartbeat?
Crude? Use what you like to use, but you should keep one thing in mind:
If you don't know the software you are running in each and every detail,
how it behaves in each and every situation you can think of, it's a bad
idea to use it in a HA setup.
You don't want to be one of those admins that just configured something
in a few days, moved production stuff on it and fail to recover from a
split brain situation. Setting up a HA environment is something you do
in months, not days, at least if you want to do it right. There is so
much that can go wrong, and so much to learn. Keep it simple.
Exactly my thoughts, as I have some experience with drbd and I know it works. My point was that since I have access to a san environment, a shared storage would be a more "elegant" solution, but as you pointed out it's probably better to stick to the method that feels most comfortable.
Thanks for your thoughts.
Regards
MP
"Mikko Partio" <mpartio@gmail.com> writes: > This was my original intention. I'm still quite hesitant to trust the > fencing devices ability to quarantee that only one postmaster at a time is > running, because of the disastrous possibility of corrupting the whole > database. Making that guarantee is a fencing device's only excuse for existence. So I think you should trust that a properly-implemented fence will do what it's claimed to do. On the other side of the coin, I have little confidence in DRBD providing the storage semantics we need (in particular guaranteeing write ordering). So that path doesn't sound exactly risk-free either. regards, tom lane
On 17.08.2007 15:59, Tom Lane wrote: > On the other side of the coin, I have little confidence in DRBD > providing the storage semantics we need (in particular guaranteeing > write ordering). So that path doesn't sound exactly risk-free either. To my understanding DRBD provides this. I think a discussion about that with the DRBD developers would be very useful for many users searching for a solution to replicate PostgreSQL, so I'm cross posting this to DRBD list. Maybe you can make clear in detail what requirements PostgreSQL has. -- Regards, Hannes Dorbath
Hi, > On the other side of the coin, I have little confidence in DRBD > providing the storage semantics we need (in particular guaranteeing > write ordering). So that path doesn't sound exactly risk-free either. DRBD seems to enforce strict write ordering on both sides of the link according to the docs. I didn't look at the code, but my plug-pulling tests on a busy PostgreSQL server didn't cause any problems. No conclusive evidence, but useful at lease in my use-case. (And yes: I make ps_dumps often just in case) - Sander
On Fri, Aug 17, 2007 at 04:19:57PM +0200, Hannes Dorbath wrote: > On 17.08.2007 15:59, Tom Lane wrote: > >On the other side of the coin, I have little confidence in DRBD > >providing the storage semantics we need (in particular guaranteeing > >write ordering). So that path doesn't sound exactly risk-free either. > > To my understanding DRBD provides this. I think a discussion about that > with the DRBD developers would be very useful for many users searching > for a solution to replicate PostgreSQL, so I'm cross posting this to > DRBD list. Maybe you can make clear in detail what requirements > PostgreSQL has. It does, AFAIK, if yuo configure it properly. I think it's the "protocol" parameter you need to set to C which is the slowest, but it's the only one that waits for the block to hit *both* disks. //Magnus
On Fri, Aug 17, 2007 at 09:59:26AM -0400, Tom Lane wrote: > "Mikko Partio" <mpartio@gmail.com> writes: > > This was my original intention. I'm still quite hesitant to trust the > > fencing devices ability to quarantee that only one postmaster at a time is > > running, because of the disastrous possibility of corrupting the whole > > database. > > Making that guarantee is a fencing device's only excuse for existence. > So I think you should trust that a properly-implemented fence will do > what it's claimed to do. > > On the other side of the coin, I have little confidence in DRBD > providing the storage semantics we need (in particular guaranteeing > write ordering). So that path doesn't sound exactly risk-free either. > > regards, tom lane of course we guarantee write ordering. we (linbit, company behind drbd, paying drbd developers) operate quite a few postgres clusters in production on clusters "powered by heartbeat and DRBD". there are much more we do not operate directly ourselves. just because we happen to have a partnership with mysql does not mean we don't like postgres very much indeed :) to get an idea of what drbd does for you, please, if you are interessted, read some of the http://www.drbd.org/publications.html, maybe start with the LinuxConf 2007 pdf. cheers, -- : Lars Ellenberg Tel +43-1-8178292-55 : : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : : Vivenotgasse 48, A-1120 Vienna/Europe http://www.linbit.com :