Thread: Streaming replication: PANIC on tertiary when secondary promoted
Hi,
I had a cascade serverA->serverB->serverC->serverD of Postgres 10.14 servers connected with streaming replication.
There was no archive shipping set up, but there was an empty directory /data/pg_archive/10/dedupe_shard1_10/ mentioned in config for it on each of the servers.
When I promoted serverB, serverC crashed:
Jun 16 14:45:43 serverC postgresql-10-dedupe_shard1_10[43934]: [7-1] 2021-06-16 14:45:43.717 UTC [43934] {-} LOG: replication terminated by primary server
Jun 16 14:45:43 serverC postgresql-10-dedupe_shard1_10[43934]: [7-2] 2021-06-16 14:45:43.717 UTC [43934] {-} DETAIL: End of WAL reached on timeline 1 at 190B0/36000000.
Jun 16 14:45:43 serverC postgresql-10-dedupe_shard1_10[43934]: [8-1] 2021-06-16 14:45:43.717 UTC [43934] {-} LOG: fetching timeline history file for timeline 2 from primary server
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000002.history': No such file or directory
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000003.history': No such file or directory
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000002.history': No such file or directory
Jun 16 14:45:43 serverC postgresql-10-dedupe_shard1_10[33222]: [9-1] 2021-06-16 14:45:43.736 UTC [33222] {-} LOG: new target timeline is 2
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000002000190B000000036': No such file or directory
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000001000190B000000036': No such file or directory
Jun 16 14:45:43 serverC postgresql-10-dedupe_shard1_10[43934]: [9-1] 2021-06-16 14:45:43.746 UTC [43934] {-} LOG: restarted WAL streaming at 190B0/36000000 on timeline 2
Jun 16 14:45:43 serverC postgresql-10-dedupe_shard1_10[43934]: [10-1] 2021-06-16 14:45:43.746 UTC [43934] {-} FATAL: could not receive data from WAL stream: ERROR: requested starting point 190B0/36000000 is ahead of the WAL flush position of this server 190B0/35FFFFE8
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000003.history': No such file or directory
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000002000190B000000036': No such file or directory
Jun 16 14:45:43 serverC postgresql-10-dedupe_shard1_10[93365]: [7-1] 2021-06-16 14:45:43.764 UTC [93365] {-} LOG: started streaming WAL from primary at 190B0/36000000 on timeline 2
Jun 16 14:45:43 serverC postgresql-10-dedupe_shard1_10[93365]: [8-1] 2021-06-16 14:45:43.764 UTC [93365] {-} FATAL: could not receive data from WAL stream: ERROR: requested starting point 190B0/36000000 is ahead of the WAL flush position of this server 190B0/35FFFFE8
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000003.history': No such file or directory
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000002000190B000000036': No such file or directory
Jun 16 14:45:48 serverC postgresql-10-dedupe_shard1_10[93421]: [7-1] 2021-06-16 14:45:48.771 UTC [93421] {-} LOG: started streaming WAL from primary at 190B0/36000000 on timeline 2
Jun 16 14:45:48 serverC postgresql-10-dedupe_shard1_10[33222]: [10-1] 2021-06-16 14:45:48.792 UTC [33222] {-} LOG: invalid contrecord length 1585 at 190B0/35FFFFB8
Jun 16 14:45:48 serverC postgresql-10-dedupe_shard1_10[93421]: [8-1] 2021-06-16 14:45:48.793 UTC [93421] {-} FATAL: terminating walreceiver process due to administrator command
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000003.history': No such file or directory
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000002000190B000000035': No such file or directory
Jun 16 14:45:48 serverC postgresql-10-dedupe_shard1_10[33222]: [11-1] 2021-06-16 14:45:48.803 UTC [33222] {-} PANIC: could not open file "pg_wal/00000002000190B000000035": No such file or directory
After I started it it seems to have caught up:
Jun 16 14:59:09 serverB postgresql-10-dedupe_shard1_10[100187]: [8-1] 2021-06-16 14:59:09.826 UTC [100187] {-} LOG: consistent recovery state reached at 190B0/35FFFFB8
Jun 16 14:59:09 serverB postgresql-10-dedupe_shard1_10[100172]: [5-1] 2021-06-16 14:59:09.827 UTC [100172] {-} LOG: database system is ready to accept read only connections
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000002000190B000000036': No such file or directory
Jun 16 14:59:10 serverB postgresql-10-dedupe_shard1_10[100187]: [9-1] 2021-06-16 14:59:10.281 UTC [100187] {-} LOG: invalid record length at 190B0/36000898: wanted 24, got 0
Jun 16 14:59:10 serverB postgresql-10-dedupe_shard1_10[101331]: [6-1] 2021-06-16 14:59:10.292 UTC [101331] {-} LOG: started streaming WAL from primary at 190B0/36000000 on timeline 2
Jun 16 14:59:10 serverB postgresql-10-dedupe_shard1_10[101332]: [6-1] 2021-06-16 14:59:10.332 UTC [101332] {[unknown]-[local]} [unknown]@[unknown] LOG: incomplete startup packet
Jun 16 14:59:10 serverB postgresql-10-dedupe_shard1_10[101334]: [6-1] 2021-06-16 14:59:10.508 UTC [101334] {walreceiver-10.1.10.12(43648)} replication@[unknown] ERROR: requested starting point 190B0/36000000 on timeline 1 is not in this server's history
Jun 16 14:59:10 serverB postgresql-10-dedupe_shard1_10[101334]: [6-2] 2021-06-16 14:59:10.508 UTC [101334] {walreceiver-10.1.10.12(43648)} replication@[unknown] DETAIL: This server's history forked from timeline 1 at 190B0/35FFFFB8.
Jun 16 14:59:11 serverB postgresql-10-dedupe_shard1_10[101336]: [6-1] 2021-06-16 14:59:11.509 UTC [101336] {walreceiver-10.1.10.12(43652)} replication@[unknown] LOG: could not receive data from client: Connection reset by peer
Jun 16 14:59:11 serverB postgresql-10-dedupe_shard1_10[101336]: [7-1] 2021-06-16 14:59:11.509 UTC [101336] {walreceiver-10.1.10.12(43652)} replication@[unknown] LOG: unexpected EOF on standby connection
But now serverD crashed in a similar fashion:
Jun 16 14:59:10 serverD postgresql-10-dedupe_shard1_10[29127]: [6-1] 2021-06-16 14:59:10.500 UTC [29127] {-} LOG: fetching timeline history file for timeline 2 from primary server
Jun 16 14:59:10 serverD postgresql-10-dedupe_shard1_10[29127]: [7-1] 2021-06-16 14:59:10.509 UTC [29127] {-} FATAL: could not start WAL streaming: ERROR: requested starting point 190B0/36000000 on timeline 1 is
not in this server's history
Jun 16 14:59:10 serverD postgresql-10-dedupe_shard1_10[29127]: [7-2] #011DETAIL: This server's history forked from timeline 1 at 190B0/35FFFFB8.
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000002.history': No such file or directory
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000003.history': No such file or directory
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000002.history': No such file or directory
Jun 16 14:59:10 serverD postgresql-10-dedupe_shard1_10[32792]: [9-1] 2021-06-16 14:59:10.524 UTC [32792] {-} LOG: new target timeline is 2
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000002000190B000000036': No such file or directory
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000001000190B000000036': No such file or directory
Jun 16 14:59:10 serverD postgresql-10-dedupe_shard1_10[29138]: [6-1] 2021-06-16 14:59:10.556 UTC [29138] {-} LOG: started streaming WAL from primary at 190B0/36000000 on timeline 2
Jun 16 14:59:10 serverD postgresql-10-dedupe_shard1_10[32792]: [10-1] 2021-06-16 14:59:10.588 UTC [32792] {-} LOG: invalid contrecord length 1585 at 190B0/35FFFFB8
Jun 16 14:59:10 serverD postgresql-10-dedupe_shard1_10[29138]: [7-1] 2021-06-16 14:59:10.588 UTC [29138] {-} FATAL: terminating walreceiver process due to administrator command
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000003.history': No such file or directory
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000002000190B000000035': No such file or directory
Jun 16 14:59:10 serverD postgresql-10-dedupe_shard1_10[32792]: [11-1] 2021-06-16 14:59:10.598 UTC [32792] {-} PANIC: could not open file "pg_wal/00000002000190B000000035": No such file or directory
Jun 16 14:59:10 serverD postgresql-10-dedupe_shard1_10[29143]: [6-1] 2021-06-16 14:59:10.616 UTC [29143] {-} LOG: started streaming WAL from primary at 190B0/35000000 on timeline 2
Jun 16 14:59:11 serverD postgresql-10-dedupe_shard1_10[32790]: [6-1] 2021-06-16 14:59:11.505 UTC [32790] {-} LOG: startup process (PID 32792) was terminated by signal 6: Aborted
Jun 16 14:59:11 serverD postgresql-10-dedupe_shard1_10[32790]: [7-1] 2021-06-16 14:59:11.506 UTC [32790] {-} LOG: terminating any other active server processes
I had a cascade serverA->serverB->serverC->serverD of Postgres 10.14 servers connected with streaming replication.
There was no archive shipping set up, but there was an empty directory /data/pg_archive/10/dedupe_shard1_10/ mentioned in config for it on each of the servers.
When I promoted serverB, serverC crashed:
Jun 16 14:45:43 serverC postgresql-10-dedupe_shard1_10[43934]: [7-1] 2021-06-16 14:45:43.717 UTC [43934] {-} LOG: replication terminated by primary server
Jun 16 14:45:43 serverC postgresql-10-dedupe_shard1_10[43934]: [7-2] 2021-06-16 14:45:43.717 UTC [43934] {-} DETAIL: End of WAL reached on timeline 1 at 190B0/36000000.
Jun 16 14:45:43 serverC postgresql-10-dedupe_shard1_10[43934]: [8-1] 2021-06-16 14:45:43.717 UTC [43934] {-} LOG: fetching timeline history file for timeline 2 from primary server
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000002.history': No such file or directory
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000003.history': No such file or directory
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000002.history': No such file or directory
Jun 16 14:45:43 serverC postgresql-10-dedupe_shard1_10[33222]: [9-1] 2021-06-16 14:45:43.736 UTC [33222] {-} LOG: new target timeline is 2
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000002000190B000000036': No such file or directory
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000001000190B000000036': No such file or directory
Jun 16 14:45:43 serverC postgresql-10-dedupe_shard1_10[43934]: [9-1] 2021-06-16 14:45:43.746 UTC [43934] {-} LOG: restarted WAL streaming at 190B0/36000000 on timeline 2
Jun 16 14:45:43 serverC postgresql-10-dedupe_shard1_10[43934]: [10-1] 2021-06-16 14:45:43.746 UTC [43934] {-} FATAL: could not receive data from WAL stream: ERROR: requested starting point 190B0/36000000 is ahead of the WAL flush position of this server 190B0/35FFFFE8
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000003.history': No such file or directory
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000002000190B000000036': No such file or directory
Jun 16 14:45:43 serverC postgresql-10-dedupe_shard1_10[93365]: [7-1] 2021-06-16 14:45:43.764 UTC [93365] {-} LOG: started streaming WAL from primary at 190B0/36000000 on timeline 2
Jun 16 14:45:43 serverC postgresql-10-dedupe_shard1_10[93365]: [8-1] 2021-06-16 14:45:43.764 UTC [93365] {-} FATAL: could not receive data from WAL stream: ERROR: requested starting point 190B0/36000000 is ahead of the WAL flush position of this server 190B0/35FFFFE8
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000003.history': No such file or directory
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000002000190B000000036': No such file or directory
Jun 16 14:45:48 serverC postgresql-10-dedupe_shard1_10[93421]: [7-1] 2021-06-16 14:45:48.771 UTC [93421] {-} LOG: started streaming WAL from primary at 190B0/36000000 on timeline 2
Jun 16 14:45:48 serverC postgresql-10-dedupe_shard1_10[33222]: [10-1] 2021-06-16 14:45:48.792 UTC [33222] {-} LOG: invalid contrecord length 1585 at 190B0/35FFFFB8
Jun 16 14:45:48 serverC postgresql-10-dedupe_shard1_10[93421]: [8-1] 2021-06-16 14:45:48.793 UTC [93421] {-} FATAL: terminating walreceiver process due to administrator command
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000003.history': No such file or directory
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000002000190B000000035': No such file or directory
Jun 16 14:45:48 serverC postgresql-10-dedupe_shard1_10[33222]: [11-1] 2021-06-16 14:45:48.803 UTC [33222] {-} PANIC: could not open file "pg_wal/00000002000190B000000035": No such file or directory
After I started it it seems to have caught up:
Jun 16 14:59:09 serverB postgresql-10-dedupe_shard1_10[100187]: [8-1] 2021-06-16 14:59:09.826 UTC [100187] {-} LOG: consistent recovery state reached at 190B0/35FFFFB8
Jun 16 14:59:09 serverB postgresql-10-dedupe_shard1_10[100172]: [5-1] 2021-06-16 14:59:09.827 UTC [100172] {-} LOG: database system is ready to accept read only connections
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000002000190B000000036': No such file or directory
Jun 16 14:59:10 serverB postgresql-10-dedupe_shard1_10[100187]: [9-1] 2021-06-16 14:59:10.281 UTC [100187] {-} LOG: invalid record length at 190B0/36000898: wanted 24, got 0
Jun 16 14:59:10 serverB postgresql-10-dedupe_shard1_10[101331]: [6-1] 2021-06-16 14:59:10.292 UTC [101331] {-} LOG: started streaming WAL from primary at 190B0/36000000 on timeline 2
Jun 16 14:59:10 serverB postgresql-10-dedupe_shard1_10[101332]: [6-1] 2021-06-16 14:59:10.332 UTC [101332] {[unknown]-[local]} [unknown]@[unknown] LOG: incomplete startup packet
Jun 16 14:59:10 serverB postgresql-10-dedupe_shard1_10[101334]: [6-1] 2021-06-16 14:59:10.508 UTC [101334] {walreceiver-10.1.10.12(43648)} replication@[unknown] ERROR: requested starting point 190B0/36000000 on timeline 1 is not in this server's history
Jun 16 14:59:10 serverB postgresql-10-dedupe_shard1_10[101334]: [6-2] 2021-06-16 14:59:10.508 UTC [101334] {walreceiver-10.1.10.12(43648)} replication@[unknown] DETAIL: This server's history forked from timeline 1 at 190B0/35FFFFB8.
Jun 16 14:59:11 serverB postgresql-10-dedupe_shard1_10[101336]: [6-1] 2021-06-16 14:59:11.509 UTC [101336] {walreceiver-10.1.10.12(43652)} replication@[unknown] LOG: could not receive data from client: Connection reset by peer
Jun 16 14:59:11 serverB postgresql-10-dedupe_shard1_10[101336]: [7-1] 2021-06-16 14:59:11.509 UTC [101336] {walreceiver-10.1.10.12(43652)} replication@[unknown] LOG: unexpected EOF on standby connection
But now serverD crashed in a similar fashion:
Jun 16 14:59:10 serverD postgresql-10-dedupe_shard1_10[29127]: [6-1] 2021-06-16 14:59:10.500 UTC [29127] {-} LOG: fetching timeline history file for timeline 2 from primary server
Jun 16 14:59:10 serverD postgresql-10-dedupe_shard1_10[29127]: [7-1] 2021-06-16 14:59:10.509 UTC [29127] {-} FATAL: could not start WAL streaming: ERROR: requested starting point 190B0/36000000 on timeline 1 is
not in this server's history
Jun 16 14:59:10 serverD postgresql-10-dedupe_shard1_10[29127]: [7-2] #011DETAIL: This server's history forked from timeline 1 at 190B0/35FFFFB8.
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000002.history': No such file or directory
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000003.history': No such file or directory
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000002.history': No such file or directory
Jun 16 14:59:10 serverD postgresql-10-dedupe_shard1_10[32792]: [9-1] 2021-06-16 14:59:10.524 UTC [32792] {-} LOG: new target timeline is 2
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000002000190B000000036': No such file or directory
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000001000190B000000036': No such file or directory
Jun 16 14:59:10 serverD postgresql-10-dedupe_shard1_10[29138]: [6-1] 2021-06-16 14:59:10.556 UTC [29138] {-} LOG: started streaming WAL from primary at 190B0/36000000 on timeline 2
Jun 16 14:59:10 serverD postgresql-10-dedupe_shard1_10[32792]: [10-1] 2021-06-16 14:59:10.588 UTC [32792] {-} LOG: invalid contrecord length 1585 at 190B0/35FFFFB8
Jun 16 14:59:10 serverD postgresql-10-dedupe_shard1_10[29138]: [7-1] 2021-06-16 14:59:10.588 UTC [29138] {-} FATAL: terminating walreceiver process due to administrator command
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000003.history': No such file or directory
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000002000190B000000035': No such file or directory
Jun 16 14:59:10 serverD postgresql-10-dedupe_shard1_10[32792]: [11-1] 2021-06-16 14:59:10.598 UTC [32792] {-} PANIC: could not open file "pg_wal/00000002000190B000000035": No such file or directory
Jun 16 14:59:10 serverD postgresql-10-dedupe_shard1_10[29143]: [6-1] 2021-06-16 14:59:10.616 UTC [29143] {-} LOG: started streaming WAL from primary at 190B0/35000000 on timeline 2
Jun 16 14:59:11 serverD postgresql-10-dedupe_shard1_10[32790]: [6-1] 2021-06-16 14:59:11.505 UTC [32790] {-} LOG: startup process (PID 32792) was terminated by signal 6: Aborted
Jun 16 14:59:11 serverD postgresql-10-dedupe_shard1_10[32790]: [7-1] 2021-06-16 14:59:11.506 UTC [32790] {-} LOG: terminating any other active server processes
and it came back again too after restart.
Is it all expected? Is the data safe and consistent now or should I re-init serverC and serverD?
Best regards, Alex
Is it all expected? Is the data safe and consistent now or should I re-init serverC and serverD?
Best regards, Alex
What is your recovery_target_timeline set to on replicas ? I just did a primary -> replica -> cascading replica setup. and then promoted replica as new primary. cascading replica was working fine, no restarts required. for me recovery_target_timeline was set to 'latest' i have pg14beta installed btw. initdb -D primary The files belonging to this database system will be owned by user "postgres". This user must also own the server process. postgres@db:~/playground$ pg_ctl -D primary -l logfile start waiting for server to start.... done server started postgres@db:~/playground$ psql psql (14beta1) Type "help" for help. postgres=# select pg_create_physical_replication_slot('replica'); pg_create_physical_replication_slot ------------------------------------- (replica,) (1 row) postgres=# create table t(id int primary key); CREATE TABLE postgres=# insert into t select x from generate_series(1, 100) x; checkpoint; INSERT 0 100 postgres=# \q -- create a replica postgres@db:~/playground$ pg_basebackup -D replica -R -X stream -S replica -v -d "dbname=postgres port=5432" -U postgres pg_basebackup: initiating base backup, waiting for checkpoint to complete pg_basebackup: checkpoint completed pg_basebackup: write-ahead log start point: 0/2000028 on timeline 1 pg_basebackup: starting background WAL receiver pg_basebackup: write-ahead log end point: 0/2000138 pg_basebackup: waiting for background process to finish streaming ... pg_basebackup: syncing data to disk ... pg_basebackup: renaming backup_manifest.tmp to backup_manifest pg_basebackup: base backup completed postgres@db:~/playground$ vim replica/postgresql.conf --start the replica (port 5433) postgres@db:~/playground$ pg_ctl -D replica -l replicalog start waiting for server to start.... done server started postgres@db:~/playground$ psql -p 5433 -c 'select count(1) from t;' count ------- 100 (1 row) postgres@db:~/playground$ psql -p 5433 psql (14beta1) Type "help" for help. -- create a replica slot for cascading streaming replication postgres=# select pg_create_physical_replication_slot('cascading_replica'); pg_create_physical_replication_slot ------------------------------------- (cascading_replica,) (1 row) postgres=# \q -- create a cascading replica off replica postgres@db:~/playground$ pg_basebackup -D cascading_replica -R -X stream -S cascading_replica -v -d "dbname=postgres port=5433" -U postgres pg_basebackup: initiating base backup, waiting for checkpoint to complete pg_basebackup: checkpoint completed pg_basebackup: write-ahead log start point: 0/3000028 on timeline 1 pg_basebackup: starting background WAL receiver pg_basebackup: write-ahead log end point: 0/30000D8 pg_basebackup: waiting for background process to finish streaming ... pg_basebackup: syncing data to disk ... pg_basebackup: renaming backup_manifest.tmp to backup_manifest pg_basebackup: base backup completed postgres@db:~/playground$ vim cascading_replica/postgresql.conf postgres@db:~/playground$ pg_ctl -D cascading_replica -l creplica start waiting for server to start.... done server started -- validate receiving data fine. postgres@db:~/playground$ psql -p 5434 -c 'select count(1) from t;' count ------- 100 (1 row) -- stop primary postgres@db:~/playground$ pg_ctl -D primary -l logfile stop waiting for server to shut down.... done server stopped -- promote replica to new primary postgres@db:~/playground$ psql -p 5433 psql (14beta1) Type "help" for help. postgres=# select pg_promote(); pg_promote ------------ t (1 row) postgres=# select pg_is_in_recovery(); pg_is_in_recovery ------------------- f (1 row) postgres=# \q --do some dml, validate changes replayed to new replica. postgres@db:~/playground$ psql -p 5434 -c 'select count(1) from t;' count ------- 100 (1 row) postgres@db:~/playground$ psql -p 5433 -c 'select count(1) from t;' count ------- 100 (1 row) postgres@db:~/playground$ psql -p 5433 -c 'delete from t where id < 50;' DELETE 49 postgres@db:~/playground$ psql -p 5433 -c 'select count(1) from t;' count ------- 51 (1 row) postgres@db:~/playground$ psql -p 5434 -c 'select count(1) from t;' count ------- 51 (1 row) in all my cases. recovery_timeline was set to 'latest'. i did not rx any panic messages in logs.
I had it "latest" as well. I'll try to reproduce it again tomorrow. On 16/06/2021 17:20, Vijaykumar Jain wrote: > What is your recovery_target_timeline set to on replicas ? > > I just did a primary -> replica -> cascading replica setup. and then > promoted replica as new primary. > cascading replica was working fine, no restarts required. > > for me recovery_target_timeline was set to 'latest' > > i have pg14beta installed btw. > > initdb -D primary > The files belonging to this database system will be owned by user "postgres". > This user must also own the server process. > > postgres@db:~/playground$ pg_ctl -D primary -l logfile start > waiting for server to start.... done > server started > postgres@db:~/playground$ psql > psql (14beta1) > Type "help" for help. > > postgres=# select pg_create_physical_replication_slot('replica'); > pg_create_physical_replication_slot > ------------------------------------- > (replica,) > (1 row) > > postgres=# create table t(id int primary key); > CREATE TABLE > postgres=# insert into t select x from generate_series(1, 100) x; checkpoint; > INSERT 0 100 > postgres=# \q > > -- create a replica > postgres@db:~/playground$ pg_basebackup -D replica -R -X stream -S > replica -v -d "dbname=postgres port=5432" -U postgres > pg_basebackup: initiating base backup, waiting for checkpoint to complete > pg_basebackup: checkpoint completed > pg_basebackup: write-ahead log start point: 0/2000028 on timeline 1 > pg_basebackup: starting background WAL receiver > pg_basebackup: write-ahead log end point: 0/2000138 > pg_basebackup: waiting for background process to finish streaming ... > pg_basebackup: syncing data to disk ... > pg_basebackup: renaming backup_manifest.tmp to backup_manifest > pg_basebackup: base backup completed > postgres@db:~/playground$ vim replica/postgresql.conf > > --start the replica (port 5433) > postgres@db:~/playground$ pg_ctl -D replica -l replicalog start > waiting for server to start.... done > server started > postgres@db:~/playground$ psql -p 5433 -c 'select count(1) from t;' > count > ------- > 100 > (1 row) > > postgres@db:~/playground$ psql -p 5433 > psql (14beta1) > Type "help" for help. > > > -- create a replica slot for cascading streaming replication > postgres=# select pg_create_physical_replication_slot('cascading_replica'); > pg_create_physical_replication_slot > ------------------------------------- > (cascading_replica,) > (1 row) > > postgres=# \q > > -- create a cascading replica off replica > > postgres@db:~/playground$ pg_basebackup -D cascading_replica -R -X > stream -S cascading_replica -v -d "dbname=postgres port=5433" -U > postgres > pg_basebackup: initiating base backup, waiting for checkpoint to complete > pg_basebackup: checkpoint completed > pg_basebackup: write-ahead log start point: 0/3000028 on timeline 1 > pg_basebackup: starting background WAL receiver > pg_basebackup: write-ahead log end point: 0/30000D8 > pg_basebackup: waiting for background process to finish streaming ... > pg_basebackup: syncing data to disk ... > pg_basebackup: renaming backup_manifest.tmp to backup_manifest > pg_basebackup: base backup completed > postgres@db:~/playground$ vim cascading_replica/postgresql.conf > postgres@db:~/playground$ pg_ctl -D cascading_replica -l creplica start > waiting for server to start.... done > server started > > -- validate receiving data fine. > postgres@db:~/playground$ psql -p 5434 -c 'select count(1) from t;' > count > ------- > 100 > (1 row) > > > -- stop primary > postgres@db:~/playground$ pg_ctl -D primary -l logfile stop > waiting for server to shut down.... done > server stopped > > > -- promote replica to new primary > postgres@db:~/playground$ psql -p 5433 > psql (14beta1) > Type "help" for help. > > postgres=# select pg_promote(); > pg_promote > ------------ > t > (1 row) > > postgres=# select pg_is_in_recovery(); > pg_is_in_recovery > ------------------- > f > (1 row) > > postgres=# \q > > --do some dml, validate changes replayed to new replica. > postgres@db:~/playground$ psql -p 5434 -c 'select count(1) from t;' > count > ------- > 100 > (1 row) > > postgres@db:~/playground$ psql -p 5433 -c 'select count(1) from t;' > count > ------- > 100 > (1 row) > > postgres@db:~/playground$ psql -p 5433 -c 'delete from t where id < 50;' > DELETE 49 > postgres@db:~/playground$ psql -p 5433 -c 'select count(1) from t;' > count > ------- > 51 > (1 row) > > postgres@db:~/playground$ psql -p 5434 -c 'select count(1) from t;' > count > ------- > 51 > (1 row) > > > in all my cases. recovery_timeline was set to 'latest'. > i did not rx any panic messages in logs.
On 16/06/2021 20:31, Alexey Bashtanov wrote: > I had it "latest" as well. > I'll try to reproduce it again tomorrow. replica -v -d > "dbname=postgres port=5432" -U postgres I cannot quite reproduce it artificially. One more piece of detail: in the chain serverA->serverB->serverC->serverD when serverB was promoted and serverC crashed I had wal_compression enabled on serverA only. Best regards, Alex