Thread: Slave promotion failure

Slave promotion failure

From
François Beausoleil
Date:
Hi,

I have the following recovery.conf (Ubuntu 12.04):

standby_mode = on
restore_command = '/usr/local/omnipitr/bin/omnipitr-restore -D /var/lib/postgresql/9.1/main/ --source gzip=/var/backups/seevibes/wal/dbanalytics.production/ --remove-unneeded --temp-dir /var/tmp/omnipitr -l /var/log/omnipitr/restore.log --error-pgcontroldata hang --pgcontroldata-path /usr/lib/postgresql/9.1/bin/pg_controldata "%f" "%p"'
trigger_file = '/var/lib/postgresql/9.1/main/recovery.done'
archive_cleanup_command = '/usr/local/omnipitr/bin/omnipitr-cleanup --log /var/log/omnipitr/cleanup.log --archive gzip=/var/backups/seevibes/wal/dbanalytics.production "%r"'

I can't seem to promote the slave:

$ sudo -u postgres touch /var/lib/postgresql/9.1/main/recovery.done
# log is silent

$ sudo -u postgres /usr/lib/postgresql/9.1/bin/pg_ctl promote -D /var/lib/postgresql/9.1/main
server promoting
# log is silent

The postgresql around the time I attempted the promotions is:

2013-06-06 16:21:51.030 UTC - @ 26434 (00000) 2013-06-06 16:20:42 UTC - LOG:  restored log file "0000000400001658000000CB" from archive
2013-06-06 16:22:35.324 UTC - @ 26411 (00000) 2013-06-06 16:20:41 UTC - LOG:  received SIGHUP, reloading configuration files
2013-06-06 16:22:51.457 UTC - @ 26434 (00000) 2013-06-06 16:20:42 UTC - LOG:  restored log file "0000000400001658000000CC" from archive
2013-06-06 16:24:51.034 UTC - @ 26434 (00000) 2013-06-06 16:20:42 UTC - LOG:  restored log file "0000000400001658000000CD" from archive

The SIGHUP occurred because recovery.conf wasn't owned by user postgres (using Puppet and configuration reloads on any change). A quick scan of the data directory reveals nothing out of the ordinary:

$ sudo ls -la /var/lib/postgresql/9.1/main/
total 108
drwx------ 13 postgres postgres  4096 Jun  6 16:27 .
drwxr-xr-x  3      700 postgres    17 May 31 15:46 ..
-rw-------  1 postgres postgres   184 May 22 12:10 backup_label.old
drwx------ 12 postgres postgres   130 Apr 18 17:33 base
drwx------  2 postgres postgres  8192 Jun  6 16:32 global
drwx------  2 postgres postgres  4096 May 31 20:57 pg_clog
drwx------  4 postgres postgres    34 Jan 12  2012 pg_multixact
drwx------  2 postgres postgres    17 Jun  6 16:20 pg_notify
drwx------  2 postgres postgres     6 Jan 12  2012 pg_serial
drwx------  2 postgres postgres    24 Jun  6 16:20 pg_stat_tmp
drwx------  2 postgres postgres    17 Jun  6 08:32 pg_subtrans
drwx------  2 postgres postgres     6 Jan 12  2012 pg_tblspc
drwx------  2 postgres postgres     6 Jan 12  2012 pg_twophase
-rw-------  1 postgres postgres     4 Jan 12  2012 PG_VERSION
drwxr-xr-x  3 postgres postgres 45056 Jun  6 16:34 pg_xlog
-rw-------  1 postgres postgres   350 Jun  6 16:20 postmaster.opts
-rw-------  1 postgres postgres    93 Jun  6 16:20 postmaster.pid
-rw-r--r--  1 postgres postgres   591 Jun  6 16:20 recovery.conf
lrwxrwxrwx  1 postgres postgres    36 Dec  2  2012 server.crt -> /etc/ssl/certs/ssl-cert-snakeoil.pem
lrwxrwxrwx  1 postgres postgres    38 Dec  2  2012 server.key -> /etc/ssl/private/ssl-cert-snakeoil.key

I also attempted to restart the slave, with and without recovery.done, to no avail. I must be missing something. Someone has an idea? I did read http://www.postgresql.org/docs/9.1/static/warm-standby-failover.html very carefully. I believe I did everything I was supposed to do.

Thanks,
François Beausoleil
Attachment

Re: Slave promotion failure

From
Michael Paquier
Date:



On Fri, Jun 7, 2013 at 1:37 AM, François Beausoleil <francois@teksol.info> wrote:
I can't seem to promote the slave:

$ sudo -u postgres touch /var/lib/postgresql/9.1/main/recovery.done
# log is silent
This has no effect. recovery.conf is renamed to recovery.done internally by the server. If recovery.done is present in data folder with recovery.conf at the moment of promotion, recovery.done is removed before file renaming. What you can do to use a trigger file for promotion is setting up trigger_file in recovery.conf, then promotion will be kicked once file has been created.


$ sudo -u postgres /usr/lib/postgresql/9.1/bin/pg_ctl promote -D /var/lib/postgresql/9.1/main
server promoting
# log is silent
 
I am not a specialist of the ubuntu-related internals, but this should be enough to promote the server.
 
The postgresql around the time I attempted the promotions is:

2013-06-06 16:21:51.030 UTC - @ 26434 (00000) 2013-06-06 16:20:42 UTC - LOG:  restored log file "0000000400001658000000CB" from archive
2013-06-06 16:22:35.324 UTC - @ 26411 (00000) 2013-06-06 16:20:41 UTC - LOG:  received SIGHUP, reloading configuration files
2013-06-06 16:22:51.457 UTC - @ 26434 (00000) 2013-06-06 16:20:42 UTC - LOG:  restored log file "0000000400001658000000CC" from archive
2013-06-06 16:24:51.034 UTC - @ 26434 (00000) 2013-06-06 16:20:42 UTC - LOG:  restored log file "0000000400001658000000CD" from archive

I also attempted to restart the slave, with and without recovery.done, to no avail. I must be missing something. Someone has an idea? I did read http://www.postgresql.org/docs/9.1/static/warm-standby-failover.html very carefully. I believe I did everything I was supposed to do.
Playing with recovery.done has no effect on the promotion. Perhaps some issue with the layer used for automatic settings?
--
Michael

Re: Slave promotion failure

From
François Beausoleil
Date:

Le 2013-06-06 à 18:40, Michael Paquier a écrit :

On Fri, Jun 7, 2013 at 1:37 AM, François Beausoleil <francois@teksol.info> wrote:
I can't seem to promote the slave:

$ sudo -u postgres touch /var/lib/postgresql/9.1/main/recovery.done
# log is silent
This has no effect. recovery.conf is renamed to recovery.done internally by the server. If recovery.done is present in data folder with recovery.conf at the moment of promotion, recovery.done is removed before file renaming. What you can do to use a trigger file for promotion is setting up trigger_file in recovery.conf, then promotion will be kicked once file has been created.

I believe I know what my mistake is: I set trigger_file to /var/lib/postgresql/9.1/main/recovery.done -- and PostgreSQL doesn't seem to like that name. I should set to another name and retry.

It's still strange that pg_ctl promote didn't work though. Maybe because recovery.done existed at the time I tried.

I'll try again today, with better names.

Thanks!
François
Attachment

Re: Slave promotion failure

From
François Beausoleil
Date:

Le 2013-06-07 à 07:00, François Beausoleil a écrit :


Le 2013-06-06 à 18:40, Michael Paquier a écrit :

On Fri, Jun 7, 2013 at 1:37 AM, François Beausoleil <francois@teksol.info> wrote:
I can't seem to promote the slave:

$ sudo -u postgres touch /var/lib/postgresql/9.1/main/recovery.done
# log is silent
This has no effect. recovery.conf is renamed to recovery.done internally by the server. If recovery.done is present in data folder with recovery.conf at the moment of promotion, recovery.done is removed before file renaming. What you can do to use a trigger file for promotion is setting up trigger_file in recovery.conf, then promotion will be kicked once file has been created.

I believe I know what my mistake is: I set trigger_file to /var/lib/postgresql/9.1/main/recovery.done -- and PostgreSQL doesn't seem to like that name. I should set to another name and retry.

It's still strange that pg_ctl promote didn't work though. Maybe because recovery.done existed at the time I tried.

I'll try again today, with better names.

Okay, here's my new recovery.conf:

standby_mode = on
restore_command = '/usr/local/omnipitr/bin/omnipitr-restore -D /var/lib/postgresql/9.1/main/ --source gzip=/var/backups/seevibes/wal/dbanalytics.production/ --remove-unneeded --temp-dir /var/tmp/omnipitr -l /var/log/omnipitr/restore.log --error-pgcontroldata hang --pgcontroldata-path /usr/lib/postgresql/9.1/bin/pg_controldata "%f" "%p"'
trigger_file = '/var/lib/postgresql/9.1/main/trigger-promotion'
archive_cleanup_command = '/usr/local/omnipitr/bin/omnipitr-cleanup --log /var/log/omnipitr/cleanup.log --archive gzip=/var/backups/seevibes/wal/dbanalytics.production "%r"'

Notice trigger_file has a better name. I touch the file using:

sudo -u postgres touch /var/lib/postgresql/9.1/main/trigger-promotion

and nothing happens: no messages appear in the log, PostgreSQL continues to apply WAL records.

I've just retried pg_ctl promote, and that too didn't do anything. I'm really at a loss to explain what happens.

Bye,
François

Attachment

Re: Slave promotion failure

From
François Beausoleil
Date:

Le 2013-06-07 à 12:00, François Beausoleil a écrit :


Le 2013-06-07 à 07:00, François Beausoleil a écrit :


Le 2013-06-06 à 18:40, Michael Paquier a écrit :

On Fri, Jun 7, 2013 at 1:37 AM, François Beausoleil <francois@teksol.info> wrote:
I can't seem to promote the slave:

$ sudo -u postgres touch /var/lib/postgresql/9.1/main/recovery.done
# log is silent
This has no effect. recovery.conf is renamed to recovery.done internally by the server. If recovery.done is present in data folder with recovery.conf at the moment of promotion, recovery.done is removed before file renaming. What you can do to use a trigger file for promotion is setting up trigger_file in recovery.conf, then promotion will be kicked once file has been created.

I believe I know what my mistake is: I set trigger_file to /var/lib/postgresql/9.1/main/recovery.done -- and PostgreSQL doesn't seem to like that name. I should set to another name and retry.

It's still strange that pg_ctl promote didn't work though. Maybe because recovery.done existed at the time I tried.

I'll try again today, with better names.

Okay, here's my new recovery.conf:

standby_mode = on
restore_command = '/usr/local/omnipitr/bin/omnipitr-restore -D /var/lib/postgresql/9.1/main/ --source gzip=/var/backups/seevibes/wal/dbanalytics.production/ --remove-unneeded --temp-dir /var/tmp/omnipitr -l /var/log/omnipitr/restore.log --error-pgcontroldata hang --pgcontroldata-path /usr/lib/postgresql/9.1/bin/pg_controldata "%f" "%p"'
trigger_file = '/var/lib/postgresql/9.1/main/trigger-promotion'
archive_cleanup_command = '/usr/local/omnipitr/bin/omnipitr-cleanup --log /var/log/omnipitr/cleanup.log --archive gzip=/var/backups/seevibes/wal/dbanalytics.production "%r"'

Notice trigger_file has a better name. I touch the file using:

sudo -u postgres touch /var/lib/postgresql/9.1/main/trigger-promotion

and nothing happens: no messages appear in the log, PostgreSQL continues to apply WAL records.

I've just retried pg_ctl promote, and that too didn't do anything. I'm really at a loss to explain what happens.

I answered my question. I use OmniPITR, and I forgot to include the --finish-recovery flag, pointing to the trigger file.

Bye,
François

Attachment