Thread: Slave promotion failure
Hi,
I have the following recovery.conf (Ubuntu 12.04):
standby_mode = on
restore_command = '/usr/local/omnipitr/bin/omnipitr-restore -D /var/lib/postgresql/9.1/main/ --source gzip=/var/backups/seevibes/wal/dbanalytics.production/ --remove-unneeded --temp-dir /var/tmp/omnipitr -l /var/log/omnipitr/restore.log --error-pgcontroldata hang --pgcontroldata-path /usr/lib/postgresql/9.1/bin/pg_controldata "%f" "%p"'
trigger_file = '/var/lib/postgresql/9.1/main/recovery.done'
archive_cleanup_command = '/usr/local/omnipitr/bin/omnipitr-cleanup --log /var/log/omnipitr/cleanup.log --archive gzip=/var/backups/seevibes/wal/dbanalytics.production "%r"'
I can't seem to promote the slave:
$ sudo -u postgres touch /var/lib/postgresql/9.1/main/recovery.done
# log is silent
$ sudo -u postgres /usr/lib/postgresql/9.1/bin/pg_ctl promote -D /var/lib/postgresql/9.1/main
server promoting
# log is silent
The postgresql around the time I attempted the promotions is:
2013-06-06 16:21:51.030 UTC - @ 26434 (00000) 2013-06-06 16:20:42 UTC - LOG: restored log file "0000000400001658000000CB" from archive
2013-06-06 16:22:35.324 UTC - @ 26411 (00000) 2013-06-06 16:20:41 UTC - LOG: received SIGHUP, reloading configuration files
2013-06-06 16:22:51.457 UTC - @ 26434 (00000) 2013-06-06 16:20:42 UTC - LOG: restored log file "0000000400001658000000CC" from archive
2013-06-06 16:24:51.034 UTC - @ 26434 (00000) 2013-06-06 16:20:42 UTC - LOG: restored log file "0000000400001658000000CD" from archive
The SIGHUP occurred because recovery.conf wasn't owned by user postgres (using Puppet and configuration reloads on any change). A quick scan of the data directory reveals nothing out of the ordinary:
$ sudo ls -la /var/lib/postgresql/9.1/main/
total 108
drwx------ 13 postgres postgres 4096 Jun 6 16:27 .
drwxr-xr-x 3 700 postgres 17 May 31 15:46 ..
-rw------- 1 postgres postgres 184 May 22 12:10 backup_label.old
drwx------ 12 postgres postgres 130 Apr 18 17:33 base
drwx------ 2 postgres postgres 8192 Jun 6 16:32 global
drwx------ 2 postgres postgres 4096 May 31 20:57 pg_clog
drwx------ 4 postgres postgres 34 Jan 12 2012 pg_multixact
drwx------ 2 postgres postgres 17 Jun 6 16:20 pg_notify
drwx------ 2 postgres postgres 6 Jan 12 2012 pg_serial
drwx------ 2 postgres postgres 24 Jun 6 16:20 pg_stat_tmp
drwx------ 2 postgres postgres 17 Jun 6 08:32 pg_subtrans
drwx------ 2 postgres postgres 6 Jan 12 2012 pg_tblspc
drwx------ 2 postgres postgres 6 Jan 12 2012 pg_twophase
-rw------- 1 postgres postgres 4 Jan 12 2012 PG_VERSION
drwxr-xr-x 3 postgres postgres 45056 Jun 6 16:34 pg_xlog
-rw------- 1 postgres postgres 350 Jun 6 16:20 postmaster.opts
-rw------- 1 postgres postgres 93 Jun 6 16:20 postmaster.pid
-rw-r--r-- 1 postgres postgres 591 Jun 6 16:20 recovery.conf
lrwxrwxrwx 1 postgres postgres 36 Dec 2 2012 server.crt -> /etc/ssl/certs/ssl-cert-snakeoil.pem
lrwxrwxrwx 1 postgres postgres 38 Dec 2 2012 server.key -> /etc/ssl/private/ssl-cert-snakeoil.key
I also attempted to restart the slave, with and without recovery.done, to no avail. I must be missing something. Someone has an idea? I did read http://www.postgresql.org/docs/9.1/static/warm-standby-failover.html very carefully. I believe I did everything I was supposed to do.
Thanks,
François Beausoleil
Attachment
On Fri, Jun 7, 2013 at 1:37 AM, François Beausoleil <francois@teksol.info> wrote:
-- I can't seem to promote the slave:$ sudo -u postgres touch /var/lib/postgresql/9.1/main/recovery.done# log is silent
This has no effect. recovery.conf is renamed to recovery.done internally by the server. If recovery.done is present in data folder with recovery.conf at the moment of promotion, recovery.done is removed before file renaming. What you can do to use a trigger file for promotion is setting up trigger_file in recovery.conf, then promotion will be kicked once file has been created.
$ sudo -u postgres /usr/lib/postgresql/9.1/bin/pg_ctl promote -D /var/lib/postgresql/9.1/mainserver promoting# log is silent
I am not a specialist of the ubuntu-related internals, but this should be enough to promote the server.
Playing with recovery.done has no effect on the promotion. Perhaps some issue with the layer used for automatic settings?The postgresql around the time I attempted the promotions is:2013-06-06 16:21:51.030 UTC - @ 26434 (00000) 2013-06-06 16:20:42 UTC - LOG: restored log file "0000000400001658000000CB" from archive2013-06-06 16:22:35.324 UTC - @ 26411 (00000) 2013-06-06 16:20:41 UTC - LOG: received SIGHUP, reloading configuration files2013-06-06 16:22:51.457 UTC - @ 26434 (00000) 2013-06-06 16:20:42 UTC - LOG: restored log file "0000000400001658000000CC" from archive2013-06-06 16:24:51.034 UTC - @ 26434 (00000) 2013-06-06 16:20:42 UTC - LOG: restored log file "0000000400001658000000CD" from archiveI also attempted to restart the slave, with and without recovery.done, to no avail. I must be missing something. Someone has an idea? I did read http://www.postgresql.org/docs/9.1/static/warm-standby-failover.html very carefully. I believe I did everything I was supposed to do.
Michael
Le 2013-06-06 à 18:40, Michael Paquier a écrit :
On Fri, Jun 7, 2013 at 1:37 AM, François Beausoleil <francois@teksol.info> wrote:I can't seem to promote the slave:$ sudo -u postgres touch /var/lib/postgresql/9.1/main/recovery.done# log is silentThis has no effect. recovery.conf is renamed to recovery.done internally by the server. If recovery.done is present in data folder with recovery.conf at the moment of promotion, recovery.done is removed before file renaming. What you can do to use a trigger file for promotion is setting up trigger_file in recovery.conf, then promotion will be kicked once file has been created.
I believe I know what my mistake is: I set trigger_file to /var/lib/postgresql/9.1/main/recovery.done -- and PostgreSQL doesn't seem to like that name. I should set to another name and retry.
It's still strange that pg_ctl promote didn't work though. Maybe because recovery.done existed at the time I tried.
I'll try again today, with better names.
Thanks!
François
Attachment
Le 2013-06-07 à 07:00, François Beausoleil a écrit :
Le 2013-06-06 à 18:40, Michael Paquier a écrit :On Fri, Jun 7, 2013 at 1:37 AM, François Beausoleil <francois@teksol.info> wrote:I can't seem to promote the slave:$ sudo -u postgres touch /var/lib/postgresql/9.1/main/recovery.done# log is silentThis has no effect. recovery.conf is renamed to recovery.done internally by the server. If recovery.done is present in data folder with recovery.conf at the moment of promotion, recovery.done is removed before file renaming. What you can do to use a trigger file for promotion is setting up trigger_file in recovery.conf, then promotion will be kicked once file has been created.I believe I know what my mistake is: I set trigger_file to /var/lib/postgresql/9.1/main/recovery.done -- and PostgreSQL doesn't seem to like that name. I should set to another name and retry.It's still strange that pg_ctl promote didn't work though. Maybe because recovery.done existed at the time I tried.I'll try again today, with better names.
Okay, here's my new recovery.conf:
standby_mode = on
restore_command = '/usr/local/omnipitr/bin/omnipitr-restore -D /var/lib/postgresql/9.1/main/ --source gzip=/var/backups/seevibes/wal/dbanalytics.production/ --remove-unneeded --temp-dir /var/tmp/omnipitr -l /var/log/omnipitr/restore.log --error-pgcontroldata hang --pgcontroldata-path /usr/lib/postgresql/9.1/bin/pg_controldata "%f" "%p"'
trigger_file = '/var/lib/postgresql/9.1/main/trigger-promotion'
archive_cleanup_command = '/usr/local/omnipitr/bin/omnipitr-cleanup --log /var/log/omnipitr/cleanup.log --archive gzip=/var/backups/seevibes/wal/dbanalytics.production "%r"'
Notice trigger_file has a better name. I touch the file using:
sudo -u postgres touch /var/lib/postgresql/9.1/main/trigger-promotion
and nothing happens: no messages appear in the log, PostgreSQL continues to apply WAL records.
I've just retried pg_ctl promote, and that too didn't do anything. I'm really at a loss to explain what happens.
Bye,
François
Attachment
Le 2013-06-07 à 12:00, François Beausoleil a écrit :
Le 2013-06-07 à 07:00, François Beausoleil a écrit :Le 2013-06-06 à 18:40, Michael Paquier a écrit :On Fri, Jun 7, 2013 at 1:37 AM, François Beausoleil <francois@teksol.info> wrote:I can't seem to promote the slave:$ sudo -u postgres touch /var/lib/postgresql/9.1/main/recovery.done# log is silentThis has no effect. recovery.conf is renamed to recovery.done internally by the server. If recovery.done is present in data folder with recovery.conf at the moment of promotion, recovery.done is removed before file renaming. What you can do to use a trigger file for promotion is setting up trigger_file in recovery.conf, then promotion will be kicked once file has been created.I believe I know what my mistake is: I set trigger_file to /var/lib/postgresql/9.1/main/recovery.done -- and PostgreSQL doesn't seem to like that name. I should set to another name and retry.It's still strange that pg_ctl promote didn't work though. Maybe because recovery.done existed at the time I tried.I'll try again today, with better names.Okay, here's my new recovery.conf:standby_mode = onrestore_command = '/usr/local/omnipitr/bin/omnipitr-restore -D /var/lib/postgresql/9.1/main/ --source gzip=/var/backups/seevibes/wal/dbanalytics.production/ --remove-unneeded --temp-dir /var/tmp/omnipitr -l /var/log/omnipitr/restore.log --error-pgcontroldata hang --pgcontroldata-path /usr/lib/postgresql/9.1/bin/pg_controldata "%f" "%p"'trigger_file = '/var/lib/postgresql/9.1/main/trigger-promotion'archive_cleanup_command = '/usr/local/omnipitr/bin/omnipitr-cleanup --log /var/log/omnipitr/cleanup.log --archive gzip=/var/backups/seevibes/wal/dbanalytics.production "%r"'Notice trigger_file has a better name. I touch the file using:sudo -u postgres touch /var/lib/postgresql/9.1/main/trigger-promotionand nothing happens: no messages appear in the log, PostgreSQL continues to apply WAL records.I've just retried pg_ctl promote, and that too didn't do anything. I'm really at a loss to explain what happens.
I answered my question. I use OmniPITR, and I forgot to include the --finish-recovery flag, pointing to the trigger file.
Bye,
François