PostgreSQL-12 replication failover, pg_rewind fails - Mailing list pgsql-general
From | Mariya Rampurawala |
---|---|
Subject | PostgreSQL-12 replication failover, pg_rewind fails |
Date | |
Msg-id | 2775DB08-A4FB-44EE-9A3B-81ABB4CB4F7F@veritas.com Whole thread Raw |
Responses |
Re: PostgreSQL-12 replication failover, pg_rewind fails
|
List | pgsql-general |
Hi,
I am working on providing HA for replication, using automation scripts.
My set up consists on two nodes, Master and Slave. When master fails, The slave is promoted to master. But when I try to re-register the old master as slave, the pg_rewind command fails. Details below.
Non-default entries in Postgresql.conf
===================================
archive_mode = on
archive_command = 'scp %p root@10.209.56.88:/pg_backup/%f'
archive_cleanup_command = '/bin/ssh root@10.209.56.88 "/usr/pgsql-12/bin/pg_archivecleanup /pg_backup %r"'
synchronous_standby_names = '*'
listen_addresses = '*'
wal_level = hot_standby
max_wal_senders = 32
max_replication_slots = 8
hot_standby = on
wal_log_hints = on
synchronous_commit = on
hot_standby_feedback = on
restart_after_crash = off
max_connections = 100
===================================
Postgresql.auto.conf at standby:
===================================
primary_conninfo = 'user=replicator host=10.209.57.16 port=5432 sslmode=prefer sslcompression=0 gssencmode=prefer krbsrvname=postgres target_session_attrs=any'
restore_command = 'scp root@10.209.56.88:/pg_backup/%f %p'
recovery_target_timeline='latest'
===================================
Reproduction steps:
- Start master:
-bash-4.2$ /usr/pgsql-12/bin/pg_ctl -D /pg_mnt/pg-12/data/ start
waiting for server to start....2020-05-11 05:26:01.665 IST [23999] LOG: starting PostgreSQL 12.2 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit
2020-05-11 05:26:01.665 IST [23999] LOG: listening on IPv4 address "0.0.0.0", port 5432
2020-05-11 05:26:01.665 IST [23999] LOG: listening on IPv6 address "::", port 5432
2020-05-11 05:26:01.666 IST [23999] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2020-05-11 05:26:01.668 IST [23999] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
2020-05-11 05:26:01.688 IST [23999] LOG: redirecting log output to logging collector process
2020-05-11 05:26:01.688 IST [23999] HINT: Future log output will appear in directory "log".
done
server started
- Basebackup on slave:
/usr/pgsql-12/bin/pg_basebackup -D /pg_mnt/pg-12/data -X stream -R -c fast -U replicator --host=10.209.57.16 --port=5432
- Copy postgresql.auto.conf to slave data directory and touch standby.signal
- Start slave
-bash-4.2$ /usr/pgsql-12/bin/pg_ctl -D /pg_mnt/pg-12/data start
waiting for server to start....2020-05-11 05:29:09.422 IST [22624] LOG: starting PostgreSQL 12.2 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit
2020-05-11 05:29:09.422 IST [22624] LOG: listening on IPv4 address "0.0.0.0", port 5432
2020-05-11 05:29:09.422 IST [22624] LOG: listening on IPv6 address "::", port 5432
2020-05-11 05:29:09.423 IST [22624] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2020-05-11 05:29:09.425 IST [22624] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
2020-05-11 05:29:09.442 IST [22624] LOG: redirecting log output to logging collector process
2020-05-11 05:29:09.442 IST [22624] HINT: Future log output will appear in directory "log".
.. done
server started
- Check replication status on master:
-bash-4.2$ psql
psql (12.2)
Type "help" for help.
postgres=# \x
Expanded display is on.
postgres=# select * from pg_stat_replication;
-[ RECORD 1 ]----+---------------------------------
pid | 24988
usesysid | 16385
usename | replicator
application_name | walreceiver
client_addr | 10.209.57.14
client_hostname |
client_port | 40802
backend_start | 2020-05-11 05:29:14.752341+05:30
backend_xmin | 588
state | streaming
sent_lsn | 5/A60000D8
write_lsn | 5/A60000D8
flush_lsn | 5/A60000D8
replay_lsn | 5/A60000D8
write_lag |
flush_lag |
replay_lag |
sync_priority | 1
sync_state | sync
reply_time | 2020-05-11 05:29:54.895857+05:30
- From a third node write large data on master
#psql -h 10.209.57.16 --user=postgres
psql (12.2)
Type "help" for help.
postgres=# \c vcs
You are now connected to database "vcs" as user "postgres".
vcs=# create table c1 as select generate_series(1,10000000000);
- Kill postgres processes on master:
-bash-4.2$ ps -ef | grep post
root 1860 18990 0 May08 pts/1 00:00:00 vim postgresql.cluster.conf
root 1925 14243 0 Apr07 pts/0 00:00:00 su - postgres
postgres 1926 1925 0 Apr07 pts/0 00:00:00 -bash
root 7321 1 0 Mar14 ? 00:00:15 /usr/libexec/postfix/master -w
postfix 7350 7321 0 Mar14 ? 00:00:03 qmgr -l -t unix -u
postfix 10410 7321 0 04:47 ? 00:00:00 pickup -l -t unix -u
postgres 23999 1 0 05:26 ? 00:00:00 /usr/pgsql-12/bin/postgres -D /pg_mnt/pg-12/data
postgres 24000 23999 0 05:26 ? 00:00:00 postgres: logger
postgres 24002 23999 0 05:26 ? 00:00:00 postgres: checkpointer
postgres 24003 23999 0 05:26 ? 00:00:00 postgres: background writer
postgres 24004 23999 0 05:26 ? 00:00:01 postgres: walwriter
postgres 24005 23999 0 05:26 ? 00:00:00 postgres: autovacuum launcher
postgres 24006 23999 0 05:26 ? 00:00:00 postgres: archiver archiving 0000003500000005000000C7
postgres 24007 23999 0 05:26 ? 00:00:00 postgres: stats collector
postgres 24008 23999 0 05:26 ? 00:00:00 postgres: logical replication launcher
postgres 24988 23999 1 05:29 ? 00:00:03 postgres: walsender replicator 10.209.57.14(40802) streaming 6/4A480000
postgres 25906 23999 76 05:31 ? 00:00:47 postgres: postgres vcs 10.209.57.15(58642) CREATE TABLE AS
postgres 26240 24006 1 05:32 ? 00:00:00 scp pg_wal/0000003500000005000000C7 root@10.209.56.88:/pg_backup/0000003500000005000000C7
postgres 26241 26240 0 05:32 ? 00:00:00 /usr/bin/ssh -x -oForwardAgent=no -oPermitLocalCommand=no -oClearAllForwardings=yes -l root -- 10.209.56.88 scp -t /pg_backup/0000003500000005000000C7
postgres 26242 32277 0 05:32 pts/0 00:00:00 ps -ef
postgres 26243 32277 0 05:32 pts/0 00:00:00 grep --color=auto post
root 32276 24501 0 May10 pts/0 00:00:00 su - postgres
postgres 32277 32276 0 May10 pts/0 00:00:00 -bash
-bash-4.2$ kill -9 23999
-bash-4.2$ ps -ef | grep post
root 1860 18990 0 May08 pts/1 00:00:00 vim postgresql.cluster.conf
root 1925 14243 0 Apr07 pts/0 00:00:00 su - postgres
postgres 1926 1925 0 Apr07 pts/0 00:00:00 -bash
root 7321 1 0 Mar14 ? 00:00:15 /usr/libexec/postfix/master -w
postfix 7350 7321 0 Mar14 ? 00:00:03 qmgr -l -t unix -u
postfix 10410 7321 0 04:47 ? 00:00:00 pickup -l -t unix -u
postgres 24000 1 0 05:26 ? 00:00:00 postgres: logger
postgres 25906 1 81 05:31 ? 00:01:30 postgres: postgres vcs 10.209.57.15(58642) CREATE TABLE AS
postgres 26514 32277 0 05:33 pts/0 00:00:00 ps -ef
postgres 26515 32277 0 05:33 pts/0 00:00:00 grep --color=auto post
root 32276 24501 0 May10 pts/0 00:00:00 su - postgres
postgres 32277 32276 0 May10 pts/0 00:00:00 -bash
- Promote slave:
-bash-4.2$ /usr/pgsql-12/bin/pg_ctl -D /pg_mnt/pg-12/data promote
waiting for server to promote..... done
server promoted
-bash-4.2$ psql
psql (12.2)
Type "help" for help.
postgres=# checkpoint;
CHECKPOINT
postgres=#
- Re-register old master as standby
-bash-4.2$ /usr/pgsql-12/bin/pg_rewind -D /pg_mnt/pg-12/data --source-server="host=10.209.57.17 port=5432 user=postgres dbname=postgres"
pg_rewind: fatal: target server must be shut down cleanly
- Start and stop db to finish recovery (old master)
-bash-4.2$ /usr/pgsql-12/bin/pg_ctl -D /pg_mnt/pg-12/data/ -m smart start
pg_ctl: another server might be running; trying to start server anyway
waiting for server to start....2020-05-11 05:36:29.005 IST [27684] LOG: starting PostgreSQL 12.2 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit
2020-05-11 05:36:29.006 IST [27684] LOG: listening on IPv4 address "0.0.0.0", port 5432
2020-05-11 05:36:29.006 IST [27684] LOG: listening on IPv6 address "::", port 5432
2020-05-11 05:36:29.007 IST [27684] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2020-05-11 05:36:29.008 IST [27684] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
2020-05-11 05:36:29.027 IST [27684] LOG: redirecting log output to logging collector process
2020-05-11 05:36:29.027 IST [27684] HINT: Future log output will appear in directory "log".
........................................................... stopped waiting
pg_ctl: server did not start in time
-bash-4.2$ /usr/pgsql-12/bin/pg_ctl -D /pg_mnt/pg-12/data/ status
pg_ctl: server is running (PID: 27684)
/usr/pgsql-12/bin/postgres "-D" "/pg_mnt/pg-12/data"
-bash-4.2$ /usr/pgsql-12/bin/pg_ctl -D /pg_mnt/pg-12/data/ -m smart stop
waiting for server to shut down..... done
server stopped
- Rewind again:
- -bash-4.2$ /usr/pgsql-12/bin/pg_rewind -D /pg_mnt/pg-12/data --source-server="host=10.209.57.17 port=5432 user=postgres dbname=postgres"
pg_rewind: servers diverged at WAL location 6/B9FFFFD8 on timeline 53
pg_rewind: error: could not open file "/pg_mnt/pg-12/data/pg_wal/0000003500000006000000B9": No such file or directory
pg_rewind: fatal: could not find previous WAL record at 6/B9FFFFD8
I have tried this multiple times but always face the same error. Can someone help me resolve this?
Thanks and Regards,
Mariya
pgsql-general by date: