Thread: PostgreSQL-12 replication failover, pg_rewind fails

PostgreSQL-12 replication failover, pg_rewind fails

From
Mariya Rampurawala
Date:

Hi,

 

I am working on providing HA for replication, using automation scripts.

My set up consists on two nodes, Master and Slave. When master fails, The slave is promoted to master. But when I try to re-register the old master as slave, the pg_rewind command fails. Details below.

 

Non-default entries in Postgresql.conf

===================================

archive_mode = on   

archive_command = 'scp %p root@10.209.56.88:/pg_backup/%f' 

archive_cleanup_command = '/bin/ssh root@10.209.56.88 "/usr/pgsql-12/bin/pg_archivecleanup /pg_backup %r"' 

synchronous_standby_names = '*'

listen_addresses = '*'

wal_level = hot_standby

max_wal_senders = 32

max_replication_slots = 8

hot_standby = on

wal_log_hints = on

synchronous_commit = on

hot_standby_feedback = on

restart_after_crash = off

max_connections = 100

===================================

 

Postgresql.auto.conf at standby:

===================================

primary_conninfo = 'user=replicator host=10.209.57.16 port=5432 sslmode=prefer sslcompression=0 gssencmode=prefer krbsrvname=postgres target_session_attrs=any'

restore_command = 'scp  root@10.209.56.88:/pg_backup/%f %p'

recovery_target_timeline='latest'

===================================

 

 

Reproduction steps:

  1. Start master:

-bash-4.2$ /usr/pgsql-12/bin/pg_ctl -D /pg_mnt/pg-12/data/  start

waiting for server to start....2020-05-11 05:26:01.665 IST [23999] LOG:  starting PostgreSQL 12.2 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit

2020-05-11 05:26:01.665 IST [23999] LOG:  listening on IPv4 address "0.0.0.0", port 5432

2020-05-11 05:26:01.665 IST [23999] LOG:  listening on IPv6 address "::", port 5432

2020-05-11 05:26:01.666 IST [23999] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"

2020-05-11 05:26:01.668 IST [23999] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"

2020-05-11 05:26:01.688 IST [23999] LOG:  redirecting log output to logging collector process

2020-05-11 05:26:01.688 IST [23999] HINT:  Future log output will appear in directory "log".

done

server started

  1. Basebackup on slave:

/usr/pgsql-12/bin/pg_basebackup -D /pg_mnt/pg-12/data -X stream -R -c fast -U replicator --host=10.209.57.16 --port=5432

  1. Copy postgresql.auto.conf to slave data directory and touch standby.signal
  2. Start slave

-bash-4.2$ /usr/pgsql-12/bin/pg_ctl -D /pg_mnt/pg-12/data start

waiting for server to start....2020-05-11 05:29:09.422 IST [22624] LOG:  starting PostgreSQL 12.2 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit

2020-05-11 05:29:09.422 IST [22624] LOG:  listening on IPv4 address "0.0.0.0", port 5432

2020-05-11 05:29:09.422 IST [22624] LOG:  listening on IPv6 address "::", port 5432

2020-05-11 05:29:09.423 IST [22624] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"

2020-05-11 05:29:09.425 IST [22624] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"

2020-05-11 05:29:09.442 IST [22624] LOG:  redirecting log output to logging collector process

2020-05-11 05:29:09.442 IST [22624] HINT:  Future log output will appear in directory "log".

.. done

server started

  1. Check replication status on master:

-bash-4.2$ psql

psql (12.2)

Type "help" for help.

 

postgres=# \x

Expanded display is on.

postgres=# select * from pg_stat_replication;

-[ RECORD 1 ]----+---------------------------------

pid              | 24988

usesysid         | 16385

usename          | replicator

application_name | walreceiver

client_addr      | 10.209.57.14

client_hostname  | 

client_port      | 40802

backend_start    | 2020-05-11 05:29:14.752341+05:30

backend_xmin     | 588

state            | streaming

sent_lsn         | 5/A60000D8

write_lsn        | 5/A60000D8

flush_lsn        | 5/A60000D8

replay_lsn       | 5/A60000D8

write_lag        | 

flush_lag        | 

replay_lag       | 

sync_priority    | 1

sync_state       | sync

reply_time       | 2020-05-11 05:29:54.895857+05:30

 

  1. From a third node write large data on master

#psql -h 10.209.57.16 --user=postgres

psql (12.2)

Type "help" for help.

 

postgres=# \c vcs

You are now connected to database "vcs" as user "postgres".

vcs=# create table c1 as select generate_series(1,10000000000);

 

  1. Kill postgres processes on master:

-bash-4.2$ ps -ef | grep post

root      1860 18990  0 May08 pts/1    00:00:00 vim postgresql.cluster.conf

root      1925 14243  0 Apr07 pts/0    00:00:00 su - postgres

postgres  1926  1925  0 Apr07 pts/0    00:00:00 -bash

root      7321     1  0 Mar14 ?        00:00:15 /usr/libexec/postfix/master -w

postfix   7350  7321  0 Mar14 ?        00:00:03 qmgr -l -t unix -u

postfix  10410  7321  0 04:47 ?        00:00:00 pickup -l -t unix -u

postgres 23999     1  0 05:26 ?        00:00:00 /usr/pgsql-12/bin/postgres -D /pg_mnt/pg-12/data

postgres 24000 23999  0 05:26 ?        00:00:00 postgres: logger   

postgres 24002 23999  0 05:26 ?        00:00:00 postgres: checkpointer   

postgres 24003 23999  0 05:26 ?        00:00:00 postgres: background writer   

postgres 24004 23999  0 05:26 ?        00:00:01 postgres: walwriter   

postgres 24005 23999  0 05:26 ?        00:00:00 postgres: autovacuum launcher   

postgres 24006 23999  0 05:26 ?        00:00:00 postgres: archiver   archiving 0000003500000005000000C7

postgres 24007 23999  0 05:26 ?        00:00:00 postgres: stats collector   

postgres 24008 23999  0 05:26 ?        00:00:00 postgres: logical replication launcher   

postgres 24988 23999  1 05:29 ?        00:00:03 postgres: walsender replicator 10.209.57.14(40802) streaming 6/4A480000

postgres 25906 23999 76 05:31 ?        00:00:47 postgres: postgres vcs 10.209.57.15(58642) CREATE TABLE AS

postgres 26240 24006  1 05:32 ?        00:00:00 scp pg_wal/0000003500000005000000C7 root@10.209.56.88:/pg_backup/0000003500000005000000C7

postgres 26241 26240  0 05:32 ?        00:00:00 /usr/bin/ssh -x -oForwardAgent=no -oPermitLocalCommand=no -oClearAllForwardings=yes -l root -- 10.209.56.88 scp -t /pg_backup/0000003500000005000000C7

postgres 26242 32277  0 05:32 pts/0    00:00:00 ps -ef

postgres 26243 32277  0 05:32 pts/0    00:00:00 grep --color=auto post

root     32276 24501  0 May10 pts/0    00:00:00 su - postgres

postgres 32277 32276  0 May10 pts/0    00:00:00 -bash

-bash-4.2$ kill -9 23999

-bash-4.2$ ps -ef | grep post

root      1860 18990  0 May08 pts/1    00:00:00 vim postgresql.cluster.conf

root      1925 14243  0 Apr07 pts/0    00:00:00 su - postgres

postgres  1926  1925  0 Apr07 pts/0    00:00:00 -bash

root      7321     1  0 Mar14 ?        00:00:15 /usr/libexec/postfix/master -w

postfix   7350  7321  0 Mar14 ?        00:00:03 qmgr -l -t unix -u

postfix  10410  7321  0 04:47 ?        00:00:00 pickup -l -t unix -u

postgres 24000     1  0 05:26 ?        00:00:00 postgres: logger   

postgres 25906     1 81 05:31 ?        00:01:30 postgres: postgres vcs 10.209.57.15(58642) CREATE TABLE AS

postgres 26514 32277  0 05:33 pts/0    00:00:00 ps -ef

postgres 26515 32277  0 05:33 pts/0    00:00:00 grep --color=auto post

root     32276 24501  0 May10 pts/0    00:00:00 su - postgres

postgres 32277 32276  0 May10 pts/0    00:00:00 -bash

 

  1. Promote slave:

-bash-4.2$ /usr/pgsql-12/bin/pg_ctl -D /pg_mnt/pg-12/data promote

waiting for server to promote..... done

server promoted

-bash-4.2$ psql

psql (12.2)

Type "help" for help.

 

postgres=# checkpoint;

CHECKPOINT

postgres=# 

  1. Re-register old master as standby

-bash-4.2$ /usr/pgsql-12/bin/pg_rewind -D /pg_mnt/pg-12/data --source-server="host=10.209.57.17  port=5432 user=postgres dbname=postgres"

pg_rewind: fatal: target server must be shut down cleanly

  1. Start and stop db to finish recovery (old master)

-bash-4.2$ /usr/pgsql-12/bin/pg_ctl -D /pg_mnt/pg-12/data/ -m smart start

pg_ctl: another server might be running; trying to start server anyway

waiting for server to start....2020-05-11 05:36:29.005 IST [27684] LOG:  starting PostgreSQL 12.2 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit

2020-05-11 05:36:29.006 IST [27684] LOG:  listening on IPv4 address "0.0.0.0", port 5432

2020-05-11 05:36:29.006 IST [27684] LOG:  listening on IPv6 address "::", port 5432

2020-05-11 05:36:29.007 IST [27684] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"

2020-05-11 05:36:29.008 IST [27684] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"

2020-05-11 05:36:29.027 IST [27684] LOG:  redirecting log output to logging collector process

2020-05-11 05:36:29.027 IST [27684] HINT:  Future log output will appear in directory "log".

........................................................... stopped waiting

pg_ctl: server did not start in time

 -bash-4.2$ /usr/pgsql-12/bin/pg_ctl -D /pg_mnt/pg-12/data/ status

pg_ctl: server is running (PID: 27684)

/usr/pgsql-12/bin/postgres "-D" "/pg_mnt/pg-12/data"

 

-bash-4.2$ /usr/pgsql-12/bin/pg_ctl -D /pg_mnt/pg-12/data/ -m smart stop

waiting for server to shut down..... done

server stopped

 

  1. Rewind again:
  2. -bash-4.2$ /usr/pgsql-12/bin/pg_rewind -D /pg_mnt/pg-12/data --source-server="host=10.209.57.17  port=5432 user=postgres dbname=postgres"

pg_rewind: servers diverged at WAL location 6/B9FFFFD8 on timeline 53

pg_rewind: error: could not open file "/pg_mnt/pg-12/data/pg_wal/0000003500000006000000B9": No such file or directory

pg_rewind: fatal: could not find previous WAL record at 6/B9FFFFD8

 

 

I have tried this multiple times but always face the same error. Can someone help me resolve this?

 

 

 

 

Thanks and Regards,

Mariya

 

 

Re: PostgreSQL-12 replication failover, pg_rewind fails

From
Mariya Rampurawala
Date:

Hello,

Can someone please help me with the below query?

 

Regards,

Mariya

 

From: Mariya Rampurawala <Mariya.Rampurawala@veritas.com>
Date: Sunday, 10 May 2020 at 2:55 PM
To: "pgsql-general@lists.postgresql.org" <pgsql-general@lists.postgresql.org>, "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Subject: PostgreSQL-12 replication failover, pg_rewind fails

 

Hi,

 

I am working on providing HA for replication, using automation scripts.

My set up consists on two nodes, Master and Slave. When master fails, The slave is promoted to master. But when I try to re-register the old master as slave, the pg_rewind command fails. Details below.

 

Non-default entries in Postgresql.conf

===================================

archive_mode = on   

archive_command = 'scp %p root@10.209.56.88:/pg_backup/%f' 

archive_cleanup_command = '/bin/ssh root@10.209.56.88 "/usr/pgsql-12/bin/pg_archivecleanup /pg_backup %r"' 

synchronous_standby_names = '*'

listen_addresses = '*'

wal_level = hot_standby

max_wal_senders = 32

max_replication_slots = 8

hot_standby = on

wal_log_hints = on

synchronous_commit = on

hot_standby_feedback = on

restart_after_crash = off

max_connections = 100

===================================

 

Postgresql.auto.conf at standby:

===================================

primary_conninfo = 'user=replicator host=10.209.57.16 port=5432 sslmode=prefer sslcompression=0 gssencmode=prefer krbsrvname=postgres target_session_attrs=any'

restore_command = 'scp  root@10.209.56.88:/pg_backup/%f %p'

recovery_target_timeline='latest'

===================================

 

 

Reproduction steps:

  1. Start master:

-bash-4.2$ /usr/pgsql-12/bin/pg_ctl -D /pg_mnt/pg-12/data/  start

waiting for server to start....2020-05-11 05:26:01.665 IST [23999] LOG:  starting PostgreSQL 12.2 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit

2020-05-11 05:26:01.665 IST [23999] LOG:  listening on IPv4 address "0.0.0.0", port 5432

2020-05-11 05:26:01.665 IST [23999] LOG:  listening on IPv6 address "::", port 5432

2020-05-11 05:26:01.666 IST [23999] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"

2020-05-11 05:26:01.668 IST [23999] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"

2020-05-11 05:26:01.688 IST [23999] LOG:  redirecting log output to logging collector process

2020-05-11 05:26:01.688 IST [23999] HINT:  Future log output will appear in directory "log".

done

server started

  1. Basebackup on slave:

/usr/pgsql-12/bin/pg_basebackup -D /pg_mnt/pg-12/data -X stream -R -c fast -U replicator --host=10.209.57.16 --port=5432

  1. Copy postgresql.auto.conf to slave data directory and touch standby.signal
  2. Start slave

-bash-4.2$ /usr/pgsql-12/bin/pg_ctl -D /pg_mnt/pg-12/data start

waiting for server to start....2020-05-11 05:29:09.422 IST [22624] LOG:  starting PostgreSQL 12.2 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit

2020-05-11 05:29:09.422 IST [22624] LOG:  listening on IPv4 address "0.0.0.0", port 5432

2020-05-11 05:29:09.422 IST [22624] LOG:  listening on IPv6 address "::", port 5432

2020-05-11 05:29:09.423 IST [22624] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"

2020-05-11 05:29:09.425 IST [22624] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"

2020-05-11 05:29:09.442 IST [22624] LOG:  redirecting log output to logging collector process

2020-05-11 05:29:09.442 IST [22624] HINT:  Future log output will appear in directory "log".

.. done

server started

  1. Check replication status on master:

-bash-4.2$ psql

psql (12.2)

Type "help" for help.

 

postgres=# \x

Expanded display is on.

postgres=# select * from pg_stat_replication;

-[ RECORD 1 ]----+---------------------------------

pid              | 24988

usesysid         | 16385

usename          | replicator

application_name | walreceiver

client_addr      | 10.209.57.14

client_hostname  | 

client_port      | 40802

backend_start    | 2020-05-11 05:29:14.752341+05:30

backend_xmin     | 588

state            | streaming

sent_lsn         | 5/A60000D8

write_lsn        | 5/A60000D8

flush_lsn        | 5/A60000D8

replay_lsn       | 5/A60000D8

write_lag        | 

flush_lag        | 

replay_lag       | 

sync_priority    | 1

sync_state       | sync

reply_time       | 2020-05-11 05:29:54.895857+05:30

 

  1. From a third node write large data on master

#psql -h 10.209.57.16 --user=postgres

psql (12.2)

Type "help" for help.

 

postgres=# \c vcs

You are now connected to database "vcs" as user "postgres".

vcs=# create table c1 as select generate_series(1,10000000000);

 

  1. Kill postgres processes on master:

-bash-4.2$ ps -ef | grep post

root      1860 18990  0 May08 pts/1    00:00:00 vim postgresql.cluster.conf

root      1925 14243  0 Apr07 pts/0    00:00:00 su - postgres

postgres  1926  1925  0 Apr07 pts/0    00:00:00 -bash

root      7321     1  0 Mar14 ?        00:00:15 /usr/libexec/postfix/master -w

postfix   7350  7321  0 Mar14 ?        00:00:03 qmgr -l -t unix -u

postfix  10410  7321  0 04:47 ?        00:00:00 pickup -l -t unix -u

postgres 23999     1  0 05:26 ?        00:00:00 /usr/pgsql-12/bin/postgres -D /pg_mnt/pg-12/data

postgres 24000 23999  0 05:26 ?        00:00:00 postgres: logger   

postgres 24002 23999  0 05:26 ?        00:00:00 postgres: checkpointer   

postgres 24003 23999  0 05:26 ?        00:00:00 postgres: background writer   

postgres 24004 23999  0 05:26 ?        00:00:01 postgres: walwriter   

postgres 24005 23999  0 05:26 ?        00:00:00 postgres: autovacuum launcher   

postgres 24006 23999  0 05:26 ?        00:00:00 postgres: archiver   archiving 0000003500000005000000C7

postgres 24007 23999  0 05:26 ?        00:00:00 postgres: stats collector   

postgres 24008 23999  0 05:26 ?        00:00:00 postgres: logical replication launcher   

postgres 24988 23999  1 05:29 ?        00:00:03 postgres: walsender replicator 10.209.57.14(40802) streaming 6/4A480000

postgres 25906 23999 76 05:31 ?        00:00:47 postgres: postgres vcs 10.209.57.15(58642) CREATE TABLE AS

postgres 26240 24006  1 05:32 ?        00:00:00 scp pg_wal/0000003500000005000000C7 root@10.209.56.88:/pg_backup/0000003500000005000000C7

postgres 26241 26240  0 05:32 ?        00:00:00 /usr/bin/ssh -x -oForwardAgent=no -oPermitLocalCommand=no -oClearAllForwardings=yes -l root -- 10.209.56.88 scp -t /pg_backup/0000003500000005000000C7

postgres 26242 32277  0 05:32 pts/0    00:00:00 ps -ef

postgres 26243 32277  0 05:32 pts/0    00:00:00 grep --color=auto post

root     32276 24501  0 May10 pts/0    00:00:00 su - postgres

postgres 32277 32276  0 May10 pts/0    00:00:00 -bash

-bash-4.2$ kill -9 23999

-bash-4.2$ ps -ef | grep post

root      1860 18990  0 May08 pts/1    00:00:00 vim postgresql.cluster.conf

root      1925 14243  0 Apr07 pts/0    00:00:00 su - postgres

postgres  1926  1925  0 Apr07 pts/0    00:00:00 -bash

root      7321     1  0 Mar14 ?        00:00:15 /usr/libexec/postfix/master -w

postfix   7350  7321  0 Mar14 ?        00:00:03 qmgr -l -t unix -u

postfix  10410  7321  0 04:47 ?        00:00:00 pickup -l -t unix -u

postgres 24000     1  0 05:26 ?        00:00:00 postgres: logger   

postgres 25906     1 81 05:31 ?        00:01:30 postgres: postgres vcs 10.209.57.15(58642) CREATE TABLE AS

postgres 26514 32277  0 05:33 pts/0    00:00:00 ps -ef

postgres 26515 32277  0 05:33 pts/0    00:00:00 grep --color=auto post

root     32276 24501  0 May10 pts/0    00:00:00 su - postgres

postgres 32277 32276  0 May10 pts/0    00:00:00 -bash

 

  1. Promote slave:

-bash-4.2$ /usr/pgsql-12/bin/pg_ctl -D /pg_mnt/pg-12/data promote

waiting for server to promote..... done

server promoted

-bash-4.2$ psql

psql (12.2)

Type "help" for help.

 

postgres=# checkpoint;

CHECKPOINT

postgres=# 

  1. Re-register old master as standby

-bash-4.2$ /usr/pgsql-12/bin/pg_rewind -D /pg_mnt/pg-12/data --source-server="host=10.209.57.17  port=5432 user=postgres dbname=postgres"

pg_rewind: fatal: target server must be shut down cleanly

  1. Start and stop db to finish recovery (old master)

-bash-4.2$ /usr/pgsql-12/bin/pg_ctl -D /pg_mnt/pg-12/data/ -m smart start

pg_ctl: another server might be running; trying to start server anyway

waiting for server to start....2020-05-11 05:36:29.005 IST [27684] LOG:  starting PostgreSQL 12.2 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit

2020-05-11 05:36:29.006 IST [27684] LOG:  listening on IPv4 address "0.0.0.0", port 5432

2020-05-11 05:36:29.006 IST [27684] LOG:  listening on IPv6 address "::", port 5432

2020-05-11 05:36:29.007 IST [27684] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"

2020-05-11 05:36:29.008 IST [27684] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"

2020-05-11 05:36:29.027 IST [27684] LOG:  redirecting log output to logging collector process

2020-05-11 05:36:29.027 IST [27684] HINT:  Future log output will appear in directory "log".

........................................................... stopped waiting

pg_ctl: server did not start in time

 -bash-4.2$ /usr/pgsql-12/bin/pg_ctl -D /pg_mnt/pg-12/data/ status

pg_ctl: server is running (PID: 27684)

/usr/pgsql-12/bin/postgres "-D" "/pg_mnt/pg-12/data"

 

-bash-4.2$ /usr/pgsql-12/bin/pg_ctl -D /pg_mnt/pg-12/data/ -m smart stop

waiting for server to shut down..... done

server stopped

 

  1. Rewind again:
  2. -bash-4.2$ /usr/pgsql-12/bin/pg_rewind -D /pg_mnt/pg-12/data --source-server="host=10.209.57.17  port=5432 user=postgres dbname=postgres"

pg_rewind: servers diverged at WAL location 6/B9FFFFD8 on timeline 53

pg_rewind: error: could not open file "/pg_mnt/pg-12/data/pg_wal/0000003500000006000000B9": No such file or directory

pg_rewind: fatal: could not find previous WAL record at 6/B9FFFFD8

 

 

I have tried this multiple times but always face the same error. Can someone help me resolve this?

 

 

 

 

Thanks and Regards,

Mariya

 

 

Re: PostgreSQL-12 replication failover, pg_rewind fails

From
Kyotaro Horiguchi
Date:
Hello.

At Tue, 12 May 2020 06:32:30 +0000, Mariya Rampurawala <Mariya.Rampurawala@veritas.com> wrote in 
> I am working on providing HA for replication, using automation scripts.
> My set up consists on two nodes, Master and Slave. When master fails, The slave is promoted to master. But when I try
tore-register the old master as slave, the pg_rewind command fails. Details below.
 
...
>   1.  Rewind again:
>   2.  -bash-4.2$ /usr/pgsql-12/bin/pg_rewind -D /pg_mnt/pg-12/data --source-server="host=10.209.57.17  port=5432
user=postgresdbname=postgres"
 
> 
> pg_rewind: servers diverged at WAL location 6/B9FFFFD8 on timeline 53
> 
> pg_rewind: error: could not open file "/pg_mnt/pg-12/data/pg_wal/0000003500000006000000B9": No such file or
directory
> 
> pg_rewind: fatal: could not find previous WAL record at 6/B9FFFFD8
> 
> 
> I have tried this multiple times but always face the same error. Can someone help me resolve this?

As the error message is saying, required WAL file has been removed on
the old master.  It is the normal behavior and described in the
documentation.

https://www.postgresql.org/docs/12/app-pgrewind.html

> but if the target cluster ran for a long time after the divergence,
> the old WAL files might no longer be present. In that case, they can
> be manually copied from the WAL archive to the pg_wal directory, or
> fetched on startup by configuring primary_conninfo or restore_command.

So you seem to need to restore the required WAL files from archive or
the current master.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: [EXTERNAL] Re: PostgreSQL-12 replication failover, pg_rewindfails

From
Mariya Rampurawala
Date:
Hi,

Thank you for the response.

    > but if the target cluster ran for a long time after the divergence,
    > the old WAL files might no longer be present. In that case, they can
    > be manually copied from the WAL archive to the pg_wal directory, or
    > fetched on startup by configuring primary_conninfo or restore_command.

I hit this issue every time I follow the aforementioned steps, manually as well as with scripts.
How long is "long time after divergence"? Is there a way I can make some configuration changes so that I don’t hit this
issue?
Is there anything I must change in my restore command?

===================================
primary_conninfo = 'user=replicator host=10.209.57.16 port=5432 sslmode=prefer sslcompression=0 gssencmode=prefer
krbsrvname=postgrestarget_session_attrs=any'
 
restore_command = 'scp  root@10.209.56.88:/pg_backup/%f %p'
===================================

Regards,
Mariya

On 12/05/20, 2:15 PM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote:

    Hello.
    
    At Tue, 12 May 2020 06:32:30 +0000, Mariya Rampurawala <Mariya.Rampurawala@veritas.com> wrote in 
    > I am working on providing HA for replication, using automation scripts.
    > My set up consists on two nodes, Master and Slave. When master fails, The slave is promoted to master. But when I
tryto re-register the old master as slave, the pg_rewind command fails. Details below.
 
    ...
    >   1.  Rewind again:
    >   2.  -bash-4.2$ /usr/pgsql-12/bin/pg_rewind -D /pg_mnt/pg-12/data --source-server="host=10.209.57.17  port=5432
user=postgresdbname=postgres"
 
    > 
    > pg_rewind: servers diverged at WAL location 6/B9FFFFD8 on timeline 53
    > 
    > pg_rewind: error: could not open file "/pg_mnt/pg-12/data/pg_wal/0000003500000006000000B9": No such file or
directory
    > 
    > pg_rewind: fatal: could not find previous WAL record at 6/B9FFFFD8
    > 
    > 
    > I have tried this multiple times but always face the same error. Can someone help me resolve this?
    
    As the error message is saying, required WAL file has been removed on
    the old master.  It is the normal behavior and described in the
    documentation.
    
    https://www.postgresql.org/docs/12/app-pgrewind.html
    
    > but if the target cluster ran for a long time after the divergence,
    > the old WAL files might no longer be present. In that case, they can
    > be manually copied from the WAL archive to the pg_wal directory, or
    > fetched on startup by configuring primary_conninfo or restore_command.
    
    So you seem to need to restore the required WAL files from archive or
    the current master.
    
    regards.
    
    -- 
    Kyotaro Horiguchi
    NTT Open Source Software Center
    


Re: [EXTERNAL] Re: PostgreSQL-12 replication failover, pg_rewindfails

From
Laurenz Albe
Date:
On Tue, 2020-05-12 at 09:40 +0000, Mariya Rampurawala wrote:
> > but if the target cluster ran for a long time after the divergence,
> > the old WAL files might no longer be present. In that case, they can
> > be manually copied from the WAL archive to the pg_wal directory, or
> > fetched on startup by configuring primary_conninfo or restore_command.
> 
> I hit this issue every time I follow the aforementioned steps, manually as well as with scripts.
> How long is "long time after divergence"? Is there a way I can make some
> configuration changes so that I don’t hit this issue?
> Is there anything I must change in my restore command?

What you can do is to use a higher value for "wal_keep_segments".
Then PostgreSQL will keep around that number of old WAL segments,
which increases the chance for "pg_rewind" to succeed.

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com




Re: [EXTERNAL] Re: PostgreSQL-12 replication failover, pg_rewindfails

From
Kyotaro Horiguchi
Date:
At Tue, 12 May 2020 17:29:50 +0200, Laurenz Albe <laurenz.albe@cybertec.at> wrote in 
> On Tue, 2020-05-12 at 09:40 +0000, Mariya Rampurawala wrote:
> > > but if the target cluster ran for a long time after the divergence,
> > > the old WAL files might no longer be present. In that case, they can
> > > be manually copied from the WAL archive to the pg_wal directory, or
> > > fetched on startup by configuring primary_conninfo or restore_command.
> > 
> > I hit this issue every time I follow the aforementioned steps, manually as well as with scripts.
> > How long is "long time after divergence"? Is there a way I can make some
> > configuration changes so that I don’t hit this issue?
> > Is there anything I must change in my restore command?

As mentioned in the documentation, pg_rewind uses the WAL records
startng from the last checkpoint just before the divergence point. The
divergence point is shown as the follows in your log messages.

> pg_rewind: servers diverged at WAL location 6/B9FFFFD8 on timeline 53

pg_rewind scans backward starting from that location to find a
checkpoint record, which is the oldest WAL record pg_rewind needs.  As
you see it is not a matter of calculation.  There's no other way than
actually restoring WAL segment files to read and try finding.

> What you can do is to use a higher value for "wal_keep_segments".
> Then PostgreSQL will keep around that number of old WAL segments,
> which increases the chance for "pg_rewind" to succeed.

So this is one effective way to reduce the chance to lose required WAL
(segment) files.

On PG12, an easy way to automatically restore all required WAL files
would be restoring the WAL file every time pg_rewind complains that it
is missing.

Or, you could use pg_waldump to find a checkpoint record.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [EXTERNAL] Re: PostgreSQL-12 replication failover, pg_rewindfails

From
Mariya Rampurawala
Date:
Thank you Kyotaro and Laurenz for your quick responses.
This helped me get my setup working.

Regards,
Mariya

On 13/05/20, 6:51 AM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote:

    At Tue, 12 May 2020 17:29:50 +0200, Laurenz Albe <laurenz.albe@cybertec.at> wrote in 
    > On Tue, 2020-05-12 at 09:40 +0000, Mariya Rampurawala wrote:
    > > > but if the target cluster ran for a long time after the divergence,
    > > > the old WAL files might no longer be present. In that case, they can
    > > > be manually copied from the WAL archive to the pg_wal directory, or
    > > > fetched on startup by configuring primary_conninfo or restore_command.
    > > 
    > > I hit this issue every time I follow the aforementioned steps, manually as well as with scripts.
    > > How long is "long time after divergence"? Is there a way I can make some
    > > configuration changes so that I donʼt hit this issue?
    > > Is there anything I must change in my restore command?
    
    As mentioned in the documentation, pg_rewind uses the WAL records
    startng from the last checkpoint just before the divergence point. The
    divergence point is shown as the follows in your log messages.
    
    > pg_rewind: servers diverged at WAL location 6/B9FFFFD8 on timeline 53
    
    pg_rewind scans backward starting from that location to find a
    checkpoint record, which is the oldest WAL record pg_rewind needs.  As
    you see it is not a matter of calculation.  There's no other way than
    actually restoring WAL segment files to read and try finding.
    
    > What you can do is to use a higher value for "wal_keep_segments".
    > Then PostgreSQL will keep around that number of old WAL segments,
    > which increases the chance for "pg_rewind" to succeed.
    
    So this is one effective way to reduce the chance to lose required WAL
    (segment) files.
    
    On PG12, an easy way to automatically restore all required WAL files
    would be restoring the WAL file every time pg_rewind complains that it
    is missing.
    
    Or, you could use pg_waldump to find a checkpoint record.
    
    regards.
    
    -- 
    Kyotaro Horiguchi
    NTT Open Source Software Center
    


Re: [EXTERNAL] Re: PostgreSQL-12 replication failover, pg_rewindfails

From
Michael Paquier
Date:
On Wed, May 13, 2020 at 04:58:15AM +0000, Mariya Rampurawala wrote:
> Thank you Kyotaro and Laurenz for your quick responses.
> This helped me get my setup working.

Please note that we have added in Postgres 13 the possibility to use a
restore_command when using pg_rewind if the parameter is set in the
target cluster's configuration.  This way, you can fetch missing WAL
segments from archives during the rewind operation without the need to
rely on wal_keep_segments or such.
--
Michael

Attachment