Thread: Warm standby failover mechanism
Hi all,
We're looking at setting up a warm-standby server using log shipping and aren't too sure about how we should trigger failover. Is there a commonly-used approach which is reliable enough to recommend? Looking at the documentation, there doesn't seem to be any recommendation. I preferrably don't want to use a witness server.
Also, what would you say is the best way to tell the failed primary server that it is no longer the primary server?
Thanks
Thom
We're looking at setting up a warm-standby server using log shipping and aren't too sure about how we should trigger failover. Is there a commonly-used approach which is reliable enough to recommend? Looking at the documentation, there doesn't seem to be any recommendation. I preferrably don't want to use a witness server.
Also, what would you say is the best way to tell the failed primary server that it is no longer the primary server?
Thanks
Thom
On Tue, 2009-02-24 at 16:55 +0000, Thom Brown wrote: > We're looking at setting up a warm-standby server using log shipping > and aren't too sure about how we should trigger failover. Is there a > commonly-used approach which is reliable enough to recommend? Looking > at the documentation, there doesn't seem to be any recommendation. I > preferrably don't want to use a witness server. > > Also, what would you say is the best way to tell the failed primary > server that it is no longer the primary server? http://www.postgresql.org/docs/8.3/static/pgstandby.html -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Thanks for the link Simon, but this doesn't recommend any method for triggering failover, or telling the primary that another server is now primary.
We've set up a primary server in archive mode to continuously archive to an NFS mount, and the standby server to continuously recovery from that directory (although I'm not sure that's actually working... I've probably overlooked something). The problem we face is working out how to tell the standby server that it is the primary. Yes this can be done with a trigger file in /tmp, but how would that automatically appear there? And when the failed server actually restarted, or the Postgres service restarts, how do we tell it that it is no longer the primary?
Thanks
Thom
We've set up a primary server in archive mode to continuously archive to an NFS mount, and the standby server to continuously recovery from that directory (although I'm not sure that's actually working... I've probably overlooked something). The problem we face is working out how to tell the standby server that it is the primary. Yes this can be done with a trigger file in /tmp, but how would that automatically appear there? And when the failed server actually restarted, or the Postgres service restarts, how do we tell it that it is no longer the primary?
Thanks
Thom
2009/2/24 Simon Riggs <simon@2ndquadrant.com>
http://www.postgresql.org/docs/8.3/static/pgstandby.html
On Tue, 2009-02-24 at 16:55 +0000, Thom Brown wrote:
> We're looking at setting up a warm-standby server using log shipping
> and aren't too sure about how we should trigger failover. Is there a
> commonly-used approach which is reliable enough to recommend? Looking
> at the documentation, there doesn't seem to be any recommendation. I
> preferrably don't want to use a witness server.
>
> Also, what would you say is the best way to tell the failed primary
> server that it is no longer the primary server?
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support
On Wed, Feb 25, 2009 at 09:26:26AM +0000, Thom Brown wrote: > We've set up a primary server in archive mode to continuously archive to an > NFS mount, and the standby server to continuously recovery from that > directory (although I'm not sure that's actually working... I've probably > overlooked something). The problem we face is working out how to tell the > standby server that it is the primary. Yes this can be done with a trigger > file in /tmp, but how would that automatically appear there? Think backwards: How would it automatically *dis*appear from there. Have the secondary create it and check for existance at regular intervals. Have the primary delete it at slightly shorter intervals. When the secondary finds it again after the check interval the primary didn't delete it. Wait some more for good measure and/or try to ping the primary some other way... Just my 2 cents, Karsten -- GPG key ID E4071346 @ wwwkeys.pgp.net E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346
2009/2/25 Karsten Hilbert <Karsten.Hilbert@gmx.net>
Think backwards: How would it automatically *dis*appear from
there. Have the secondary create it and check for existance
at regular intervals. Have the primary delete it at slightly
shorter intervals.
When the secondary finds it again after the check interval
the primary didn't delete it. Wait some more for good
measure and/or try to ping the primary some other way...
Thanks for the suggestion. I'm wondering what the chances of such a create/delete system would have of accidentally causing failover? Is it possible to use the archive command parameter to firstly run the pg_standby command, and follow it by a touch /tmp/pg.triggerfile by using &&?
I'm still trying to work out how this file creation/deletion thing will work. If I can tag a "&& touch /tmp/pg.trigger" command to the end of the recovery command, how often will that be called? If I can't, I still need to ensure that it is created and deleted before the recovery command is called, otherwise it will see it before it is deleted and put itself online.
Has anyone got a practical example of what they've set up, or know what others have set up?
Has anyone got a practical example of what they've set up, or know what others have set up?
On Wed, 2009-02-25 at 13:54 +0000, Thom Brown wrote: > I'm still trying to work out how this file creation/deletion thing > will work. If I can tag a "&& touch /tmp/pg.trigger" command to the > end of the recovery command, how often will that be called? If I > can't, I still need to ensure that it is created and deleted before > the recovery command is called, otherwise it will see it before it is > deleted and put itself online. > > Has anyone got a practical example of what they've set up, or know > what others have set up? You are doing this the hard way. Grab PITRTTools. https://projects.commandprompt.com/public/pitrtools Just pull down the stable branch: svn co https://projects.commandprompt.com/public/pitrtools/repo/branches/1.2 Sincerely, Joshua D. Drake -- PostgreSQL - XMPP: jdrake@jabber.postgresql.org Consulting, Development, Support, Training 503-667-4564 - http://www.commandprompt.com/ The PostgreSQL Company, serving since 1997
You are doing this the hard way. Grab PITRTTools.
https://projects.commandprompt.com/public/pitrtools
Thanks!
Thom
On Wed, 2009-02-25 at 16:33 +0000, Thom Brown wrote: > You are doing this the hard way. Grab PITRTTools. > > https://projects.commandprompt.com/public/pitrtools > > > I can't really dispute a recommendation from JD. I'll have to look > into that. It's a shame because we've spent ages trying to work out > where we've been going wrong in this whole process (we still haven't > got it picking up WALs from the archive). As a note, all PITRTools does is wrap around all the tools that you are trying to make work. So you will still need pg_standby, rsync, ssh etc... I have updated the wiki to make it a bit more friendly. https://projects.commandprompt.com/public/pitrtools/wiki Sincerely, Joshua D. Drake -- PostgreSQL - XMPP: jdrake@jabber.postgresql.org Consulting, Development, Support, Training 503-667-4564 - http://www.commandprompt.com/ The PostgreSQL Company, serving since 1997
As a note, all PITRTools does is wrap around all the tools that you are
trying to make work. So you will still need pg_standby, rsync, ssh
etc...
I have updated the wiki to make it a bit more friendly.
https://projects.commandprompt.com/public/pitrtools/wiki
NOTICE: check_config_func()
NOTICE: Performing standard archive
NOTICE: archive_func()
NOTICE: send_queue_func()
NOTICE: list_queue_func()
NOTICE: generate_slave_list_func()
NOTICE: Your slaves are: ['192.168.1.17']
Traceback (most recent call last):
File "../pitr_tools/cmd_archiver", line 343, in <module>
archive_func()
File "../pitr_tools/cmd_archiver", line 254, in archive_func
queue = send_queue_func()
File "../pitr_tools/cmd_archiver", line 219, in send_queue_func
for host in list_queue_func():
File "../pitr_tools/cmd_archiver", line 202, in list_queue_func
list_archives = os.listdir(queuedir)
OSError: [Errno 2] No such file or directory: '/var/lib/postgresql/archive/192.168.1.17'
The only time that IP address appears in the INI file is on the slaves line, so I'm not sure why it's trying to find a directory called that.
On Wed, 2009-02-25 at 21:39 +0000, Thom Brown wrote: > > As a note, all PITRTools does is wrap around all the tools > that you are > trying to make work. So you will still need pg_standby, rsync, > ssh > etc... > > I have updated the wiki to make it a bit more friendly. > > https://projects.commandprompt.com/public/pitrtools/wiki > > > > I've given PITRTools a try, but I have been unsuccessful in utilising > it. I've configured both ini files, everything owned by user > postgres, and upon running "./cmd_archiver -C cmd_archiver.ini" I get: > > NOTICE: check_config_func() > NOTICE: Performing standard archive > NOTICE: archive_func() > NOTICE: send_queue_func() > NOTICE: list_queue_func() > NOTICE: generate_slave_list_func() > NOTICE: Your slaves are: ['192.168.1.17'] > Traceback (most recent call last): > File "../pitr_tools/cmd_archiver", line 343, in <module> > archive_func() > File "../pitr_tools/cmd_archiver", line 254, in archive_func > queue = send_queue_func() > File "../pitr_tools/cmd_archiver", line 219, in send_queue_func > for host in list_queue_func(): > File "../pitr_tools/cmd_archiver", line 202, in list_queue_func > list_archives = os.listdir(queuedir) > OSError: [Errno 2] No such file or directory: > '/var/lib/postgresql/archive/192.168.1.17' > > The only time that IP address appears in the INI file is on the slaves > line, so I'm not sure why it's trying to find a directory called that. Looks like you didn't run cmd_archiver -C <config_file> -I Joshua D. Drake -- PostgreSQL - XMPP: jdrake@jabber.postgresql.org Consulting, Development, Support, Training 503-667-4564 - http://www.commandprompt.com/ The PostgreSQL Company, serving since 1997
Looks like you didn't run cmd_archiver -C <config_file> -I
Ahh, okay, that did something, which I think means it created a directory named after the slave IP in the archive directory. I didn't see any mention of that switch in the README file, unless it's mentioned elsewhere and I missed it.
Now when I run "./cmd_archiver -C cmd_archiver.ini" I get:
rsync: link_stat "/var/lib/postgresql/pitr_tools/None" failed: No such file or directory (2)
rsync error: some files could not be transferred (code 23) at main.c(1058) [sender=3.0.3]
rsync: link_stat "/var/lib/postgresql/pitr_tools/None" failed: No such file or directory (2)
rsync error: some files could not be transferred (code 23) at main.c(1058) [sender=3.0.3]
FATAL: Unabled to rsync_transfer or queue_transfer
CRITICAL 5888
On Wed, 2009-02-25 at 22:06 +0000, Thom Brown wrote: > > Looks like you didn't run cmd_archiver -C <config_file> -I > > > > Ahh, okay, that did something, which I think means it created a > directory named after the slave IP in the archive directory. Right that is the queue directory. > Now when I run "./cmd_archiver -C cmd_archiver.ini" I get: > That command really shouldn't do anything but error. The whole point of the archiver is to be placed in the archive_command option in the postgresql.conf. E.g; cmd_archive -C cmd_archive.ini -F %p This really should be happening on the pitrtools list. Let's bounce over there and resolve this. Joshua D. Drake > -- PostgreSQL - XMPP: jdrake@jabber.postgresql.org Consulting, Development, Support, Training 503-667-4564 - http://www.commandprompt.com/ The PostgreSQL Company, serving since 1997