Thread: postgres in HA constellation
Hi all, I want to use postgres, combined with postgis, in a high available configuration to store spatial data. I will have tens of millions, or even more, of datasets in the database. so the database will be fairly large and consume a lot of disk space. Therefore I thought to have a shared storage. the HA postgres Howto states following possibilities: - backup/restore with pg_dump/pg_restore (the process took, with a test database on a fairly fast computer, about one or two days, way too long) - rsync, but I doubt that this will work in my case - rserv, replication script the most promising, in an active/passvie HA configuration seems to be the rserv script. Unfortunately there was not much info about it in the HOWTO. furthermore I found LifeKeeper, a commercial monitor to use with Postgresql in a HA environment. http://www.openminds.co.uk/high_availability_solutions/databases/postgresql.htm something I thought that might work: is there sth. that will repair an inconsisten postgresql datastore? e.g. the master database died, the slave will mount the storage, then repair it in a reasonable time, and then start to work. I assume having postgres running in a shared storage environment, with a are there any other possibilities that might work that I am not aware of? anybody has experiences with postgres in a HA environment with shared storage? any hint is greatly appreciated kind regards Sebastian
On Wed, Oct 04, 2006 at 11:23:45 -0000, Sebastian Reitenbach <itlistuser@rapideye.de> wrote: > > something I thought that might work: > is there sth. that will repair an inconsisten postgresql datastore? e.g. the > master database died, the slave will mount the storage, then repair it in a > reasonable time, and then start to work. You can just start the slave server and it will recover using th WAL files and should do so in a short time. However, you MUST be certain that the master is no longer running when the slave starts or your database WILL BE TRASHED. > > are there any other possibilities that might work that I am not aware of? > anybody has experiences with postgres in a HA environment with shared storage? You can also use servers with redundant hardware (hotplug CPUs and the like) to make failure less likely. If you get the hardware failure rates down to that of your storage system, that might be good enough for your purposes. If you have multiple data centers to protect against disaster, then you might look at SLONY which you can use to replicate to a slave system. However, I think it is possible for the master to report a transaction as commited before it is shipped off to the slave, so that if the master fails you might lose some transactions when switching to the slave. But double check with the SLONY documentation on this.
On Wed, 2006-10-04 at 10:43 -0500, Bruno Wolff III wrote: > On Wed, Oct 04, 2006 at 11:23:45 -0000, > Sebastian Reitenbach <itlistuser@rapideye.de> wrote: > If you have multiple data centers to protect against disaster, then you might > look at SLONY which you can use to replicate to a slave system. However, I > think it is possible for the master to report a transaction as commited > before it is shipped off to the slave, so that if the master fails you might > lose some transactions when switching to the slave. But double check with > the SLONY documentation on this. This is indeed accurate, and highly likely to happen if you need to do a fail over. It all depends on the level of activity. Brad.
Hi, Bruno Wolff III <bruno@wolff.to> wrote: > On Wed, Oct 04, 2006 at 11:23:45 -0000, > Sebastian Reitenbach <itlistuser@rapideye.de> wrote: > > > > something I thought that might work: > > is there sth. that will repair an inconsisten postgresql datastore? e.g. the > > master database died, the slave will mount the storage, then repair it in a > > reasonable time, and then start to work. > > You can just start the slave server and it will recover using th WAL files > and should do so in a short time. However, you MUST be certain that the > master is no longer running when the slave starts or your database WILL BE > TRASHED. that sounds great, that I can just start the slave, and it will repair whatever broke. I am aware of that I have to make sure that the master is really dead. > > > > are there any other possibilities that might work that I am not aware of? > > anybody has experiences with postgres in a HA environment with shared storage? > > You can also use servers with redundant hardware (hotplug CPUs and the like) > to make failure less likely. If you get the hardware failure rates down to > that of your storage system, that might be good enough for your purposes. > > If you have multiple data centers to protect against disaster, then you might > look at SLONY which you can use to replicate to a slave system. However, I > think it is possible for the master to report a transaction as commited > before it is shipped off to the slave, so that if the master fails you might > lose some transactions when switching to the slave. But double check with > the SLONY documentation on this. I just have one data center, no remote far away replication is needed. thank you all for your answers kind regards Sebastian
On Thu, Oct 05, 2006 at 04:24:17AM -0000, Sebastian Reitenbach wrote: > > I just have one data center, no remote far away replication is needed. If it is at all feasible with your budget, I'd think _very strongly_ about replicating using Slony inside your data centre _too_. The shared storage answer is nice, but it is _really really really_ easy to shoot yourself in the foot with a rocket propelled grenade with that arrangement. Very careful administration might prevent it, but there is a reason that none of the corporate people will guarantee two machines will never accidentally mount the same file system at once: in a shared-disc-only system, it's impossible to be 100% certain that the other machine really is dead and not coming back. Very tricky scripts could of course lower the risk. If you're really going to have all that data, it's going to be a major pain to restore in the event of such corruption. In addition, your recovery will only be to the last dump. That's why I suggest replicating, either with Slony or something else, as a belt that will nicely complement the suspenders of your shared-disc failover. A -- Andrew Sullivan | ajs@crankycanuck.ca I remember when computers were frustrating because they *did* exactly what you told them to. That actually seems sort of quaint now. --J.D. Baldwin
On Thu, 2006-10-05 at 04:24 +0000, Sebastian Reitenbach wrote: > Hi, > > Bruno Wolff III <bruno@wolff.to> wrote: > > On Wed, Oct 04, 2006 at 11:23:45 -0000, > > Sebastian Reitenbach <itlistuser@rapideye.de> wrote: > > > > > > something I thought that might work: > > > is there sth. that will repair an inconsisten postgresql datastore? e.g. > the > > > master database died, the slave will mount the storage, then repair it in > a > > > reasonable time, and then start to work. > > > > You can just start the slave server and it will recover using th WAL files > > and should do so in a short time. However, you MUST be certain that the > > master is no longer running when the slave starts or your database WILL BE > > TRASHED. > that sounds great, that I can just start the slave, and it will repair > whatever broke. I am aware of that I have to make sure that the master is > really dead. Be absolutely sure that you test, test, test this. And then test it some more. There are all sorts of nasty surprises that can happen in an HA config that can leave you with a corrupt database. As mentioned, two postmasters writing on the same directory is certainly one. But you have to watch for problems in the HA config can as well. What happens if the load on the master shoots up to the point where it can't release the disk resources in time? Will the other box try and take over an scribble on your file system? What happens when you pull the plug on the master while under load (either physically, or virtually with a halt -q)? How does the HA solution play with your file system of choice? Trust me, you want to know the answers to all of these questions and more before you even consider turning an HA solution on in your production environment. Don't just rely on the marketing material and what sales drones tell you. If not, make sure you have good backups. Brad.
On Oct 5, 2006, at 1:41 PM, Andrew Sullivan wrote: > On Thu, Oct 05, 2006 at 04:24:17AM -0000, Sebastian Reitenbach wrote: >> >> I just have one data center, no remote far away replication is >> needed. > > If it is at all feasible with your budget, I'd think _very strongly_ > about replicating using Slony inside your data centre _too_. The > shared storage answer is nice, but it is _really really really_ easy > to shoot yourself in the foot with a rocket propelled grenade with > that arrangement. Very careful administration might prevent it, but > there is a reason that none of the corporate people will guarantee > two machines will never accidentally mount the same file system at > once: in a shared-disc-only system, it's impossible to be 100% > certain that the other machine really is dead and not coming back. > Very tricky scripts could of course lower the risk. Isn't it entirely possible that if the master gets trashed it would start sending garbage to the Slony slave as well? I think PITR would be a much better option to protect against this, since you could probably recover up to the exact point of failover. When it comes to the actual failover, take a look at the HA-linux project. They've got some stuff you could probably use (such as the heartbeat program). Another really good idea is to give the backup machine to kill the power to the primary machine, and not have either machine mount the shared storage at bootup. If you're interested in paying someone to help setting this up, I know that we (EnterpriseDB) have folks that have done this before. I suspect that some of the other folks listed on the commercial support page have done this as well (likely Command Prompt and Varlena). -- Jim Nasby jimn@enterprisedb.com EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Hi all, > > I think PITR would be a much better option to protect against this, > since you could probably recover up to the exact point of failover. > > When it comes to the actual failover, take a look at the HA-linux > project. They've got some stuff you could probably use (such as the > heartbeat program). Another really good idea is to give the backup > machine to kill the power to the primary machine, and not have either > machine mount the shared storage at bootup. As I am using carp on OpenBSD to setup HA cluster, I am very comfortable with ucarp, the userland implementation for Linux. at boot up the default mount of the database files is readonly and having the database not running, then starting ucarp, and then only in case this machine becomes master, it remounts the postgres data and shall start the database. And I thought about that killing the power of the "lost" master after a takeover too, to make sure hte machine will not come back unconditionally later. kind regards Sebastian
On Fri, Oct 06, 2006 at 06:34:25AM -0000, Sebastian Reitenbach wrote: > > I think PITR would be a much better option to protect against this, > > since you could probably recover up to the exact point of failover. > > > > When it comes to the actual failover, take a look at the HA-linux > > project. They've got some stuff you could probably use (such as the > > heartbeat program). Another really good idea is to give the backup > > machine to kill the power to the primary machine, and not have either > > machine mount the shared storage at bootup. > As I am using carp on OpenBSD to setup HA cluster, I am very comfortable with > ucarp, the userland implementation for Linux. at boot up the default mount of > the database files is readonly and having the database not running, then > starting ucarp, and then only in case this machine becomes master, it remounts > the postgres data and shall start the database. > And I thought about that killing the power of the "lost" master after a > takeover too, to make sure hte machine will not come back unconditionally > later. Heh, I'd assumed you were on linux, but of course there's no reason you couldn't setup HA on OpenBSD. The key is just to make sure that you never bring up two servers on the same data directory. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
On Thu, Oct 05, 2006 at 08:43:21PM -0500, Jim Nasby wrote: > Isn't it entirely possible that if the master gets trashed it would > start sending garbage to the Slony slave as well? Well, maybe, but unlikely. What happens in a shared-disc failover is that the second machine re-mounts the same partition as the old machine had open. The risk is the case where your to-be-removed machine hasn't actually stopped writing on the partition yet, but your failover software thinks it's dead, and can fail over. Two processes have the same Postgres data and WAL files mounted at the same time, and blammo. As nearly as I can tell, it takes approximately zero time for this arrangement to make such a mess that you're not committing any transactions. Slony will only get the data on COMMIT, so the risk is very small. > I think PITR would be a much better option to protect against this, > since you could probably recover up to the exact point of failover. That oughta work too, except that your remounted WAL gets corrupted under the imagined scenario, and then you copy the next updates to the WAL. So you have to save all the incremental copies of the WAL you make, so that you don't have a garbage file to read. As I said, I don't think that it's a bad idea to use this sort of trick. I just think it's a poor single line of defence, because when it fails, it fails hard. A -- Andrew Sullivan | ajs@crankycanuck.ca In the future this spectacle of the middle classes shocking the avant- garde will probably become the textbook definition of Postmodernism. --Brad Holland
On Tue, Oct 10, 2006 at 10:11:08AM -0500, Jim C. Nasby wrote: > couldn't setup HA on OpenBSD. The key is just to make sure that you > never bring up two servers on the same data directory. I think this highlights exactly what I'm trying to emphasise: in actual, shared-nothing systems like this, there's no possible guarantee of "never". There are possible guarantees of "very rarely". The problem is, you're already trying to address a teeny portion of the likely events on your machines. So you have to assume that more than one thing might break at the same time, and have a recovery plan for it. I submit that a recovery plan of "restore from pg_dump" is usually not going to be enough if it was worth the cost and hassle of setting up shared disk failover. YMMV, of course. A -- Andrew Sullivan | ajs@crankycanuck.ca The fact that technology doesn't work is no bar to success in the marketplace. --Philip Greenspun
Hi, > > I think this highlights exactly what I'm trying to emphasise: in > actual, shared-nothing systems like this, there's no possible > guarantee of "never". There are possible guarantees of "very > rarely". The problem is, you're already trying to address a teeny > portion of the likely events on your machines. So you have to assume > that more than one thing might break at the same time, and have a > recovery plan for it. I submit that a recovery plan of "restore from > pg_dump" is usually not going to be enough if it was worth the cost > and hassle of setting up shared disk failover. YMMV, of course. > yes, I did a restore from a dump with pg_dump, that took over a day. maybe I can speed it up a bit, but this is going to take too long. Sebastian
On Wed, Oct 11, 2006 at 10:28:44AM -0400, Andrew Sullivan wrote: > On Thu, Oct 05, 2006 at 08:43:21PM -0500, Jim Nasby wrote: > > Isn't it entirely possible that if the master gets trashed it would > > start sending garbage to the Slony slave as well? > > Well, maybe, but unlikely. What happens in a shared-disc failover is > that the second machine re-mounts the same partition as the old > machine had open. The risk is the case where your to-be-removed > machine hasn't actually stopped writing on the partition yet, but > your failover software thinks it's dead, and can fail over. Two > processes have the same Postgres data and WAL files mounted at the > same time, and blammo. As nearly as I can tell, it takes > approximately zero time for this arrangement to make such a mess that > you're not committing any transactions. Slony will only get the data > on COMMIT, so the risk is very small. Hrm... I guess it depends on how quickly the Slony master would stop processing if it was talking to a shared-disk that had become corrupt from another postmaster. > > I think PITR would be a much better option to protect against this, > > since you could probably recover up to the exact point of failover. > > That oughta work too, except that your remounted WAL gets corrupted > under the imagined scenario, and then you copy the next updates to > the WAL. So you have to save all the incremental copies of the WAL > you make, so that you don't have a garbage file to read. > > As I said, I don't think that it's a bad idea to use this sort of > trick. I just think it's a poor single line of defence, because when > it fails, it fails hard. Yeah, STONITH is *critical* for shared-disk. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
On Wed, 2006-10-11 at 16:12 -0500, Jim C. Nasby wrote: > On Wed, Oct 11, 2006 at 10:28:44AM -0400, Andrew Sullivan wrote: > > On Thu, Oct 05, 2006 at 08:43:21PM -0500, Jim Nasby wrote: > > > Isn't it entirely possible that if the master gets trashed it would > > > start sending garbage to the Slony slave as well? > > > > Well, maybe, but unlikely. What happens in a shared-disc failover is > > that the second machine re-mounts the same partition as the old > > machine had open. The risk is the case where your to-be-removed > > machine hasn't actually stopped writing on the partition yet, but > > your failover software thinks it's dead, and can fail over. Two > > processes have the same Postgres data and WAL files mounted at the > > same time, and blammo. As nearly as I can tell, it takes > > approximately zero time for this arrangement to make such a mess that > > you're not committing any transactions. Slony will only get the data > > on COMMIT, so the risk is very small. > > Hrm... I guess it depends on how quickly the Slony master would stop > processing if it was talking to a shared-disk that had become corrupt > from another postmaster. That doesn't depend on Slony, it depends on Postgres. If transactions are committing on the master, Slony will replicate them. You could have a situation where your HA failover trashes some of you database, but the database still starts up. It starts accepting and replicating transactions before the corruption is discovered. Brad.
bnichols@ca.afilias.info (Brad Nicholson) writes: > On Wed, 2006-10-11 at 16:12 -0500, Jim C. Nasby wrote: >> On Wed, Oct 11, 2006 at 10:28:44AM -0400, Andrew Sullivan wrote: >> > On Thu, Oct 05, 2006 at 08:43:21PM -0500, Jim Nasby wrote: >> > > Isn't it entirely possible that if the master gets trashed it would >> > > start sending garbage to the Slony slave as well? >> > >> > Well, maybe, but unlikely. What happens in a shared-disc failover is >> > that the second machine re-mounts the same partition as the old >> > machine had open. The risk is the case where your to-be-removed >> > machine hasn't actually stopped writing on the partition yet, but >> > your failover software thinks it's dead, and can fail over. Two >> > processes have the same Postgres data and WAL files mounted at the >> > same time, and blammo. As nearly as I can tell, it takes >> > approximately zero time for this arrangement to make such a mess that >> > you're not committing any transactions. Slony will only get the data >> > on COMMIT, so the risk is very small. >> >> Hrm... I guess it depends on how quickly the Slony master would stop >> processing if it was talking to a shared-disk that had become corrupt >> from another postmaster. > > That doesn't depend on Slony, it depends on Postgres. If transactions > are committing on the master, Slony will replicate them. You could have > a situation where your HA failover trashes some of you database, but the > database still starts up. It starts accepting and replicating > transactions before the corruption is discovered. There's a bit of "joint responsibility" there. Let's suppose that the disk has gone bad, zeroing out some index pages for the Slony-I table sl_log_1. (The situation will be the same for just about any kind of corruption of a Slony-I internal table.) There are two possibilities: 1. The PostgreSQL instance may notice that those pages are bad, returning an error message, and halting the SYNC. 2. The PostgreSQL instance may NOT notice that those pages are bad, and, as a result, fail to apply some updates, thereby corrupting the subscriber. I think there's a pretty high probability of 1) happening rather than 2), but there is a risk of corruption of subscribers roughly proportional to the probability of 2). My "gut feel" is that the probability of 2) is pretty small, but I don't have anything to point to as a proof of that... -- output = reverse("gro.mca" "@" "enworbbc") http://www3.sympatico.ca/cbbrowne/ "One of the main causes of the fall of the Roman Empire was that, lacking zero, they had no way to indicate successful termination of their C programs." -- Robert Firth