Thread: PostgreSQL-12 replication failover, pg_rewind fails
Hi,
I am working on providing HA for replication, using automation scripts.
My set up consists on two nodes, Master and Slave. When master fails, The slave is promoted to master. But when I try to re-register the old master as slave, the pg_rewind command fails. Details below.
Non-default entries in Postgresql.conf
===================================
archive_mode = on
archive_command = 'scp %p root@10.209.56.88:/pg_backup/%f'
archive_cleanup_command = '/bin/ssh root@10.209.56.88 "/usr/pgsql-12/bin/pg_archivecleanup /pg_backup %r"'
synchronous_standby_names = '*'
listen_addresses = '*'
wal_level = hot_standby
max_wal_senders = 32
max_replication_slots = 8
hot_standby = on
wal_log_hints = on
synchronous_commit = on
hot_standby_feedback = on
restart_after_crash = off
max_connections = 100
===================================
Postgresql.auto.conf at standby:
===================================
primary_conninfo = 'user=replicator host=10.209.57.16 port=5432 sslmode=prefer sslcompression=0 gssencmode=prefer krbsrvname=postgres target_session_attrs=any'
restore_command = 'scp root@10.209.56.88:/pg_backup/%f %p'
recovery_target_timeline='latest'
===================================
Reproduction steps:
- Start master:
-bash-4.2$ /usr/pgsql-12/bin/pg_ctl -D /pg_mnt/pg-12/data/ start
waiting for server to start....2020-05-11 05:26:01.665 IST [23999] LOG: starting PostgreSQL 12.2 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit
2020-05-11 05:26:01.665 IST [23999] LOG: listening on IPv4 address "0.0.0.0", port 5432
2020-05-11 05:26:01.665 IST [23999] LOG: listening on IPv6 address "::", port 5432
2020-05-11 05:26:01.666 IST [23999] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2020-05-11 05:26:01.668 IST [23999] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
2020-05-11 05:26:01.688 IST [23999] LOG: redirecting log output to logging collector process
2020-05-11 05:26:01.688 IST [23999] HINT: Future log output will appear in directory "log".
done
server started
- Basebackup on slave:
/usr/pgsql-12/bin/pg_basebackup -D /pg_mnt/pg-12/data -X stream -R -c fast -U replicator --host=10.209.57.16 --port=5432
- Copy postgresql.auto.conf to slave data directory and touch standby.signal
- Start slave
-bash-4.2$ /usr/pgsql-12/bin/pg_ctl -D /pg_mnt/pg-12/data start
waiting for server to start....2020-05-11 05:29:09.422 IST [22624] LOG: starting PostgreSQL 12.2 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit
2020-05-11 05:29:09.422 IST [22624] LOG: listening on IPv4 address "0.0.0.0", port 5432
2020-05-11 05:29:09.422 IST [22624] LOG: listening on IPv6 address "::", port 5432
2020-05-11 05:29:09.423 IST [22624] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2020-05-11 05:29:09.425 IST [22624] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
2020-05-11 05:29:09.442 IST [22624] LOG: redirecting log output to logging collector process
2020-05-11 05:29:09.442 IST [22624] HINT: Future log output will appear in directory "log".
.. done
server started
- Check replication status on master:
-bash-4.2$ psql
psql (12.2)
Type "help" for help.
postgres=# \x
Expanded display is on.
postgres=# select * from pg_stat_replication;
-[ RECORD 1 ]----+---------------------------------
pid | 24988
usesysid | 16385
usename | replicator
application_name | walreceiver
client_addr | 10.209.57.14
client_hostname |
client_port | 40802
backend_start | 2020-05-11 05:29:14.752341+05:30
backend_xmin | 588
state | streaming
sent_lsn | 5/A60000D8
write_lsn | 5/A60000D8
flush_lsn | 5/A60000D8
replay_lsn | 5/A60000D8
write_lag |
flush_lag |
replay_lag |
sync_priority | 1
sync_state | sync
reply_time | 2020-05-11 05:29:54.895857+05:30
- From a third node write large data on master
#psql -h 10.209.57.16 --user=postgres
psql (12.2)
Type "help" for help.
postgres=# \c vcs
You are now connected to database "vcs" as user "postgres".
vcs=# create table c1 as select generate_series(1,10000000000);
- Kill postgres processes on master:
-bash-4.2$ ps -ef | grep post
root 1860 18990 0 May08 pts/1 00:00:00 vim postgresql.cluster.conf
root 1925 14243 0 Apr07 pts/0 00:00:00 su - postgres
postgres 1926 1925 0 Apr07 pts/0 00:00:00 -bash
root 7321 1 0 Mar14 ? 00:00:15 /usr/libexec/postfix/master -w
postfix 7350 7321 0 Mar14 ? 00:00:03 qmgr -l -t unix -u
postfix 10410 7321 0 04:47 ? 00:00:00 pickup -l -t unix -u
postgres 23999 1 0 05:26 ? 00:00:00 /usr/pgsql-12/bin/postgres -D /pg_mnt/pg-12/data
postgres 24000 23999 0 05:26 ? 00:00:00 postgres: logger
postgres 24002 23999 0 05:26 ? 00:00:00 postgres: checkpointer
postgres 24003 23999 0 05:26 ? 00:00:00 postgres: background writer
postgres 24004 23999 0 05:26 ? 00:00:01 postgres: walwriter
postgres 24005 23999 0 05:26 ? 00:00:00 postgres: autovacuum launcher
postgres 24006 23999 0 05:26 ? 00:00:00 postgres: archiver archiving 0000003500000005000000C7
postgres 24007 23999 0 05:26 ? 00:00:00 postgres: stats collector
postgres 24008 23999 0 05:26 ? 00:00:00 postgres: logical replication launcher
postgres 24988 23999 1 05:29 ? 00:00:03 postgres: walsender replicator 10.209.57.14(40802) streaming 6/4A480000
postgres 25906 23999 76 05:31 ? 00:00:47 postgres: postgres vcs 10.209.57.15(58642) CREATE TABLE AS
postgres 26240 24006 1 05:32 ? 00:00:00 scp pg_wal/0000003500000005000000C7 root@10.209.56.88:/pg_backup/0000003500000005000000C7
postgres 26241 26240 0 05:32 ? 00:00:00 /usr/bin/ssh -x -oForwardAgent=no -oPermitLocalCommand=no -oClearAllForwardings=yes -l root -- 10.209.56.88 scp -t /pg_backup/0000003500000005000000C7
postgres 26242 32277 0 05:32 pts/0 00:00:00 ps -ef
postgres 26243 32277 0 05:32 pts/0 00:00:00 grep --color=auto post
root 32276 24501 0 May10 pts/0 00:00:00 su - postgres
postgres 32277 32276 0 May10 pts/0 00:00:00 -bash
-bash-4.2$ kill -9 23999
-bash-4.2$ ps -ef | grep post
root 1860 18990 0 May08 pts/1 00:00:00 vim postgresql.cluster.conf
root 1925 14243 0 Apr07 pts/0 00:00:00 su - postgres
postgres 1926 1925 0 Apr07 pts/0 00:00:00 -bash
root 7321 1 0 Mar14 ? 00:00:15 /usr/libexec/postfix/master -w
postfix 7350 7321 0 Mar14 ? 00:00:03 qmgr -l -t unix -u
postfix 10410 7321 0 04:47 ? 00:00:00 pickup -l -t unix -u
postgres 24000 1 0 05:26 ? 00:00:00 postgres: logger
postgres 25906 1 81 05:31 ? 00:01:30 postgres: postgres vcs 10.209.57.15(58642) CREATE TABLE AS
postgres 26514 32277 0 05:33 pts/0 00:00:00 ps -ef
postgres 26515 32277 0 05:33 pts/0 00:00:00 grep --color=auto post
root 32276 24501 0 May10 pts/0 00:00:00 su - postgres
postgres 32277 32276 0 May10 pts/0 00:00:00 -bash
- Promote slave:
-bash-4.2$ /usr/pgsql-12/bin/pg_ctl -D /pg_mnt/pg-12/data promote
waiting for server to promote..... done
server promoted
-bash-4.2$ psql
psql (12.2)
Type "help" for help.
postgres=# checkpoint;
CHECKPOINT
postgres=#
- Re-register old master as standby
-bash-4.2$ /usr/pgsql-12/bin/pg_rewind -D /pg_mnt/pg-12/data --source-server="host=10.209.57.17 port=5432 user=postgres dbname=postgres"
pg_rewind: fatal: target server must be shut down cleanly
- Start and stop db to finish recovery (old master)
-bash-4.2$ /usr/pgsql-12/bin/pg_ctl -D /pg_mnt/pg-12/data/ -m smart start
pg_ctl: another server might be running; trying to start server anyway
waiting for server to start....2020-05-11 05:36:29.005 IST [27684] LOG: starting PostgreSQL 12.2 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit
2020-05-11 05:36:29.006 IST [27684] LOG: listening on IPv4 address "0.0.0.0", port 5432
2020-05-11 05:36:29.006 IST [27684] LOG: listening on IPv6 address "::", port 5432
2020-05-11 05:36:29.007 IST [27684] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2020-05-11 05:36:29.008 IST [27684] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
2020-05-11 05:36:29.027 IST [27684] LOG: redirecting log output to logging collector process
2020-05-11 05:36:29.027 IST [27684] HINT: Future log output will appear in directory "log".
........................................................... stopped waiting
pg_ctl: server did not start in time
-bash-4.2$ /usr/pgsql-12/bin/pg_ctl -D /pg_mnt/pg-12/data/ status
pg_ctl: server is running (PID: 27684)
/usr/pgsql-12/bin/postgres "-D" "/pg_mnt/pg-12/data"
-bash-4.2$ /usr/pgsql-12/bin/pg_ctl -D /pg_mnt/pg-12/data/ -m smart stop
waiting for server to shut down..... done
server stopped
- Rewind again:
- -bash-4.2$ /usr/pgsql-12/bin/pg_rewind -D /pg_mnt/pg-12/data --source-server="host=10.209.57.17 port=5432 user=postgres dbname=postgres"
pg_rewind: servers diverged at WAL location 6/B9FFFFD8 on timeline 53
pg_rewind: error: could not open file "/pg_mnt/pg-12/data/pg_wal/0000003500000006000000B9": No such file or directory
pg_rewind: fatal: could not find previous WAL record at 6/B9FFFFD8
I have tried this multiple times but always face the same error. Can someone help me resolve this?
Thanks and Regards,
Mariya
Hello,
Can someone please help me with the below query?
Regards,
Mariya
From: Mariya Rampurawala <Mariya.Rampurawala@veritas.com>
Date: Sunday, 10 May 2020 at 2:55 PM
To: "pgsql-general@lists.postgresql.org" <pgsql-general@lists.postgresql.org>, "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Subject: PostgreSQL-12 replication failover, pg_rewind fails
Hi,
I am working on providing HA for replication, using automation scripts.
My set up consists on two nodes, Master and Slave. When master fails, The slave is promoted to master. But when I try to re-register the old master as slave, the pg_rewind command fails. Details below.
Non-default entries in Postgresql.conf
===================================
archive_mode = on
archive_command = 'scp %p root@10.209.56.88:/pg_backup/%f'
archive_cleanup_command = '/bin/ssh root@10.209.56.88 "/usr/pgsql-12/bin/pg_archivecleanup /pg_backup %r"'
synchronous_standby_names = '*'
listen_addresses = '*'
wal_level = hot_standby
max_wal_senders = 32
max_replication_slots = 8
hot_standby = on
wal_log_hints = on
synchronous_commit = on
hot_standby_feedback = on
restart_after_crash = off
max_connections = 100
===================================
Postgresql.auto.conf at standby:
===================================
primary_conninfo = 'user=replicator host=10.209.57.16 port=5432 sslmode=prefer sslcompression=0 gssencmode=prefer krbsrvname=postgres target_session_attrs=any'
restore_command = 'scp root@10.209.56.88:/pg_backup/%f %p'
recovery_target_timeline='latest'
===================================
Reproduction steps:
- Start master:
-bash-4.2$ /usr/pgsql-12/bin/pg_ctl -D /pg_mnt/pg-12/data/ start
waiting for server to start....2020-05-11 05:26:01.665 IST [23999] LOG: starting PostgreSQL 12.2 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit
2020-05-11 05:26:01.665 IST [23999] LOG: listening on IPv4 address "0.0.0.0", port 5432
2020-05-11 05:26:01.665 IST [23999] LOG: listening on IPv6 address "::", port 5432
2020-05-11 05:26:01.666 IST [23999] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2020-05-11 05:26:01.668 IST [23999] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
2020-05-11 05:26:01.688 IST [23999] LOG: redirecting log output to logging collector process
2020-05-11 05:26:01.688 IST [23999] HINT: Future log output will appear in directory "log".
done
server started
- Basebackup on slave:
/usr/pgsql-12/bin/pg_basebackup -D /pg_mnt/pg-12/data -X stream -R -c fast -U replicator --host=10.209.57.16 --port=5432
- Copy postgresql.auto.conf to slave data directory and touch standby.signal
- Start slave
-bash-4.2$ /usr/pgsql-12/bin/pg_ctl -D /pg_mnt/pg-12/data start
waiting for server to start....2020-05-11 05:29:09.422 IST [22624] LOG: starting PostgreSQL 12.2 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit
2020-05-11 05:29:09.422 IST [22624] LOG: listening on IPv4 address "0.0.0.0", port 5432
2020-05-11 05:29:09.422 IST [22624] LOG: listening on IPv6 address "::", port 5432
2020-05-11 05:29:09.423 IST [22624] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2020-05-11 05:29:09.425 IST [22624] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
2020-05-11 05:29:09.442 IST [22624] LOG: redirecting log output to logging collector process
2020-05-11 05:29:09.442 IST [22624] HINT: Future log output will appear in directory "log".
.. done
server started
- Check replication status on master:
-bash-4.2$ psql
psql (12.2)
Type "help" for help.
postgres=# \x
Expanded display is on.
postgres=# select * from pg_stat_replication;
-[ RECORD 1 ]----+---------------------------------
pid | 24988
usesysid | 16385
usename | replicator
application_name | walreceiver
client_addr | 10.209.57.14
client_hostname |
client_port | 40802
backend_start | 2020-05-11 05:29:14.752341+05:30
backend_xmin | 588
state | streaming
sent_lsn | 5/A60000D8
write_lsn | 5/A60000D8
flush_lsn | 5/A60000D8
replay_lsn | 5/A60000D8
write_lag |
flush_lag |
replay_lag |
sync_priority | 1
sync_state | sync
reply_time | 2020-05-11 05:29:54.895857+05:30
- From a third node write large data on master
#psql -h 10.209.57.16 --user=postgres
psql (12.2)
Type "help" for help.
postgres=# \c vcs
You are now connected to database "vcs" as user "postgres".
vcs=# create table c1 as select generate_series(1,10000000000);
- Kill postgres processes on master:
-bash-4.2$ ps -ef | grep post
root 1860 18990 0 May08 pts/1 00:00:00 vim postgresql.cluster.conf
root 1925 14243 0 Apr07 pts/0 00:00:00 su - postgres
postgres 1926 1925 0 Apr07 pts/0 00:00:00 -bash
root 7321 1 0 Mar14 ? 00:00:15 /usr/libexec/postfix/master -w
postfix 7350 7321 0 Mar14 ? 00:00:03 qmgr -l -t unix -u
postfix 10410 7321 0 04:47 ? 00:00:00 pickup -l -t unix -u
postgres 23999 1 0 05:26 ? 00:00:00 /usr/pgsql-12/bin/postgres -D /pg_mnt/pg-12/data
postgres 24000 23999 0 05:26 ? 00:00:00 postgres: logger
postgres 24002 23999 0 05:26 ? 00:00:00 postgres: checkpointer
postgres 24003 23999 0 05:26 ? 00:00:00 postgres: background writer
postgres 24004 23999 0 05:26 ? 00:00:01 postgres: walwriter
postgres 24005 23999 0 05:26 ? 00:00:00 postgres: autovacuum launcher
postgres 24006 23999 0 05:26 ? 00:00:00 postgres: archiver archiving 0000003500000005000000C7
postgres 24007 23999 0 05:26 ? 00:00:00 postgres: stats collector
postgres 24008 23999 0 05:26 ? 00:00:00 postgres: logical replication launcher
postgres 24988 23999 1 05:29 ? 00:00:03 postgres: walsender replicator 10.209.57.14(40802) streaming 6/4A480000
postgres 25906 23999 76 05:31 ? 00:00:47 postgres: postgres vcs 10.209.57.15(58642) CREATE TABLE AS
postgres 26240 24006 1 05:32 ? 00:00:00 scp pg_wal/0000003500000005000000C7 root@10.209.56.88:/pg_backup/0000003500000005000000C7
postgres 26241 26240 0 05:32 ? 00:00:00 /usr/bin/ssh -x -oForwardAgent=no -oPermitLocalCommand=no -oClearAllForwardings=yes -l root -- 10.209.56.88 scp -t /pg_backup/0000003500000005000000C7
postgres 26242 32277 0 05:32 pts/0 00:00:00 ps -ef
postgres 26243 32277 0 05:32 pts/0 00:00:00 grep --color=auto post
root 32276 24501 0 May10 pts/0 00:00:00 su - postgres
postgres 32277 32276 0 May10 pts/0 00:00:00 -bash
-bash-4.2$ kill -9 23999
-bash-4.2$ ps -ef | grep post
root 1860 18990 0 May08 pts/1 00:00:00 vim postgresql.cluster.conf
root 1925 14243 0 Apr07 pts/0 00:00:00 su - postgres
postgres 1926 1925 0 Apr07 pts/0 00:00:00 -bash
root 7321 1 0 Mar14 ? 00:00:15 /usr/libexec/postfix/master -w
postfix 7350 7321 0 Mar14 ? 00:00:03 qmgr -l -t unix -u
postfix 10410 7321 0 04:47 ? 00:00:00 pickup -l -t unix -u
postgres 24000 1 0 05:26 ? 00:00:00 postgres: logger
postgres 25906 1 81 05:31 ? 00:01:30 postgres: postgres vcs 10.209.57.15(58642) CREATE TABLE AS
postgres 26514 32277 0 05:33 pts/0 00:00:00 ps -ef
postgres 26515 32277 0 05:33 pts/0 00:00:00 grep --color=auto post
root 32276 24501 0 May10 pts/0 00:00:00 su - postgres
postgres 32277 32276 0 May10 pts/0 00:00:00 -bash
- Promote slave:
-bash-4.2$ /usr/pgsql-12/bin/pg_ctl -D /pg_mnt/pg-12/data promote
waiting for server to promote..... done
server promoted
-bash-4.2$ psql
psql (12.2)
Type "help" for help.
postgres=# checkpoint;
CHECKPOINT
postgres=#
- Re-register old master as standby
-bash-4.2$ /usr/pgsql-12/bin/pg_rewind -D /pg_mnt/pg-12/data --source-server="host=10.209.57.17 port=5432 user=postgres dbname=postgres"
pg_rewind: fatal: target server must be shut down cleanly
- Start and stop db to finish recovery (old master)
-bash-4.2$ /usr/pgsql-12/bin/pg_ctl -D /pg_mnt/pg-12/data/ -m smart start
pg_ctl: another server might be running; trying to start server anyway
waiting for server to start....2020-05-11 05:36:29.005 IST [27684] LOG: starting PostgreSQL 12.2 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit
2020-05-11 05:36:29.006 IST [27684] LOG: listening on IPv4 address "0.0.0.0", port 5432
2020-05-11 05:36:29.006 IST [27684] LOG: listening on IPv6 address "::", port 5432
2020-05-11 05:36:29.007 IST [27684] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2020-05-11 05:36:29.008 IST [27684] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
2020-05-11 05:36:29.027 IST [27684] LOG: redirecting log output to logging collector process
2020-05-11 05:36:29.027 IST [27684] HINT: Future log output will appear in directory "log".
........................................................... stopped waiting
pg_ctl: server did not start in time
-bash-4.2$ /usr/pgsql-12/bin/pg_ctl -D /pg_mnt/pg-12/data/ status
pg_ctl: server is running (PID: 27684)
/usr/pgsql-12/bin/postgres "-D" "/pg_mnt/pg-12/data"
-bash-4.2$ /usr/pgsql-12/bin/pg_ctl -D /pg_mnt/pg-12/data/ -m smart stop
waiting for server to shut down..... done
server stopped
- Rewind again:
- -bash-4.2$ /usr/pgsql-12/bin/pg_rewind -D /pg_mnt/pg-12/data --source-server="host=10.209.57.17 port=5432 user=postgres dbname=postgres"
pg_rewind: servers diverged at WAL location 6/B9FFFFD8 on timeline 53
pg_rewind: error: could not open file "/pg_mnt/pg-12/data/pg_wal/0000003500000006000000B9": No such file or directory
pg_rewind: fatal: could not find previous WAL record at 6/B9FFFFD8
I have tried this multiple times but always face the same error. Can someone help me resolve this?
Thanks and Regards,
Mariya
Hello. At Tue, 12 May 2020 06:32:30 +0000, Mariya Rampurawala <Mariya.Rampurawala@veritas.com> wrote in > I am working on providing HA for replication, using automation scripts. > My set up consists on two nodes, Master and Slave. When master fails, The slave is promoted to master. But when I try tore-register the old master as slave, the pg_rewind command fails. Details below. ... > 1. Rewind again: > 2. -bash-4.2$ /usr/pgsql-12/bin/pg_rewind -D /pg_mnt/pg-12/data --source-server="host=10.209.57.17 port=5432 user=postgresdbname=postgres" > > pg_rewind: servers diverged at WAL location 6/B9FFFFD8 on timeline 53 > > pg_rewind: error: could not open file "/pg_mnt/pg-12/data/pg_wal/0000003500000006000000B9": No such file or directory > > pg_rewind: fatal: could not find previous WAL record at 6/B9FFFFD8 > > > I have tried this multiple times but always face the same error. Can someone help me resolve this? As the error message is saying, required WAL file has been removed on the old master. It is the normal behavior and described in the documentation. https://www.postgresql.org/docs/12/app-pgrewind.html > but if the target cluster ran for a long time after the divergence, > the old WAL files might no longer be present. In that case, they can > be manually copied from the WAL archive to the pg_wal directory, or > fetched on startup by configuring primary_conninfo or restore_command. So you seem to need to restore the required WAL files from archive or the current master. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Hi, Thank you for the response. > but if the target cluster ran for a long time after the divergence, > the old WAL files might no longer be present. In that case, they can > be manually copied from the WAL archive to the pg_wal directory, or > fetched on startup by configuring primary_conninfo or restore_command. I hit this issue every time I follow the aforementioned steps, manually as well as with scripts. How long is "long time after divergence"? Is there a way I can make some configuration changes so that I don’t hit this issue? Is there anything I must change in my restore command? =================================== primary_conninfo = 'user=replicator host=10.209.57.16 port=5432 sslmode=prefer sslcompression=0 gssencmode=prefer krbsrvname=postgrestarget_session_attrs=any' restore_command = 'scp root@10.209.56.88:/pg_backup/%f %p' =================================== Regards, Mariya On 12/05/20, 2:15 PM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote: Hello. At Tue, 12 May 2020 06:32:30 +0000, Mariya Rampurawala <Mariya.Rampurawala@veritas.com> wrote in > I am working on providing HA for replication, using automation scripts. > My set up consists on two nodes, Master and Slave. When master fails, The slave is promoted to master. But when I tryto re-register the old master as slave, the pg_rewind command fails. Details below. ... > 1. Rewind again: > 2. -bash-4.2$ /usr/pgsql-12/bin/pg_rewind -D /pg_mnt/pg-12/data --source-server="host=10.209.57.17 port=5432 user=postgresdbname=postgres" > > pg_rewind: servers diverged at WAL location 6/B9FFFFD8 on timeline 53 > > pg_rewind: error: could not open file "/pg_mnt/pg-12/data/pg_wal/0000003500000006000000B9": No such file or directory > > pg_rewind: fatal: could not find previous WAL record at 6/B9FFFFD8 > > > I have tried this multiple times but always face the same error. Can someone help me resolve this? As the error message is saying, required WAL file has been removed on the old master. It is the normal behavior and described in the documentation. https://www.postgresql.org/docs/12/app-pgrewind.html > but if the target cluster ran for a long time after the divergence, > the old WAL files might no longer be present. In that case, they can > be manually copied from the WAL archive to the pg_wal directory, or > fetched on startup by configuring primary_conninfo or restore_command. So you seem to need to restore the required WAL files from archive or the current master. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, 2020-05-12 at 09:40 +0000, Mariya Rampurawala wrote: > > but if the target cluster ran for a long time after the divergence, > > the old WAL files might no longer be present. In that case, they can > > be manually copied from the WAL archive to the pg_wal directory, or > > fetched on startup by configuring primary_conninfo or restore_command. > > I hit this issue every time I follow the aforementioned steps, manually as well as with scripts. > How long is "long time after divergence"? Is there a way I can make some > configuration changes so that I don’t hit this issue? > Is there anything I must change in my restore command? What you can do is to use a higher value for "wal_keep_segments". Then PostgreSQL will keep around that number of old WAL segments, which increases the chance for "pg_rewind" to succeed. Yours, Laurenz Albe -- Cybertec | https://www.cybertec-postgresql.com
At Tue, 12 May 2020 17:29:50 +0200, Laurenz Albe <laurenz.albe@cybertec.at> wrote in > On Tue, 2020-05-12 at 09:40 +0000, Mariya Rampurawala wrote: > > > but if the target cluster ran for a long time after the divergence, > > > the old WAL files might no longer be present. In that case, they can > > > be manually copied from the WAL archive to the pg_wal directory, or > > > fetched on startup by configuring primary_conninfo or restore_command. > > > > I hit this issue every time I follow the aforementioned steps, manually as well as with scripts. > > How long is "long time after divergence"? Is there a way I can make some > > configuration changes so that I don’t hit this issue? > > Is there anything I must change in my restore command? As mentioned in the documentation, pg_rewind uses the WAL records startng from the last checkpoint just before the divergence point. The divergence point is shown as the follows in your log messages. > pg_rewind: servers diverged at WAL location 6/B9FFFFD8 on timeline 53 pg_rewind scans backward starting from that location to find a checkpoint record, which is the oldest WAL record pg_rewind needs. As you see it is not a matter of calculation. There's no other way than actually restoring WAL segment files to read and try finding. > What you can do is to use a higher value for "wal_keep_segments". > Then PostgreSQL will keep around that number of old WAL segments, > which increases the chance for "pg_rewind" to succeed. So this is one effective way to reduce the chance to lose required WAL (segment) files. On PG12, an easy way to automatically restore all required WAL files would be restoring the WAL file every time pg_rewind complains that it is missing. Or, you could use pg_waldump to find a checkpoint record. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Thank you Kyotaro and Laurenz for your quick responses. This helped me get my setup working. Regards, Mariya On 13/05/20, 6:51 AM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote: At Tue, 12 May 2020 17:29:50 +0200, Laurenz Albe <laurenz.albe@cybertec.at> wrote in > On Tue, 2020-05-12 at 09:40 +0000, Mariya Rampurawala wrote: > > > but if the target cluster ran for a long time after the divergence, > > > the old WAL files might no longer be present. In that case, they can > > > be manually copied from the WAL archive to the pg_wal directory, or > > > fetched on startup by configuring primary_conninfo or restore_command. > > > > I hit this issue every time I follow the aforementioned steps, manually as well as with scripts. > > How long is "long time after divergence"? Is there a way I can make some > > configuration changes so that I donʼt hit this issue? > > Is there anything I must change in my restore command? As mentioned in the documentation, pg_rewind uses the WAL records startng from the last checkpoint just before the divergence point. The divergence point is shown as the follows in your log messages. > pg_rewind: servers diverged at WAL location 6/B9FFFFD8 on timeline 53 pg_rewind scans backward starting from that location to find a checkpoint record, which is the oldest WAL record pg_rewind needs. As you see it is not a matter of calculation. There's no other way than actually restoring WAL segment files to read and try finding. > What you can do is to use a higher value for "wal_keep_segments". > Then PostgreSQL will keep around that number of old WAL segments, > which increases the chance for "pg_rewind" to succeed. So this is one effective way to reduce the chance to lose required WAL (segment) files. On PG12, an easy way to automatically restore all required WAL files would be restoring the WAL file every time pg_rewind complains that it is missing. Or, you could use pg_waldump to find a checkpoint record. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Wed, May 13, 2020 at 04:58:15AM +0000, Mariya Rampurawala wrote: > Thank you Kyotaro and Laurenz for your quick responses. > This helped me get my setup working. Please note that we have added in Postgres 13 the possibility to use a restore_command when using pg_rewind if the parameter is set in the target cluster's configuration. This way, you can fetch missing WAL segments from archives during the rewind operation without the need to rely on wal_keep_segments or such. -- Michael