Streaming replication: PANIC on tertiary when secondary promoted - Mailing list pgsql-general

From Alexey Bashtanov
Subject Streaming replication: PANIC on tertiary when secondary promoted
Date
Msg-id 887ed598-ccc4-4a82-0be9-f5065a026f2c@imap.cc
Whole thread Raw
Responses Re: Streaming replication: PANIC on tertiary when secondary promoted
List pgsql-general
Hi,

I had a cascade serverA->serverB->serverC->serverD of Postgres 10.14 servers connected with streaming replication.
There was no archive shipping set up, but there was an empty directory /data/pg_archive/10/dedupe_shard1_10/ mentioned in config for it on each of the servers.

When I promoted serverB, serverC crashed:

Jun 16 14:45:43 serverC postgresql-10-dedupe_shard1_10[43934]: [7-1] 2021-06-16 14:45:43.717 UTC [43934] {-} LOG:  replication terminated by primary server
Jun 16 14:45:43 serverC postgresql-10-dedupe_shard1_10[43934]: [7-2] 2021-06-16 14:45:43.717 UTC [43934] {-} DETAIL:  End of WAL reached on timeline 1 at 190B0/36000000.
Jun 16 14:45:43 serverC postgresql-10-dedupe_shard1_10[43934]: [8-1] 2021-06-16 14:45:43.717 UTC [43934] {-} LOG:  fetching timeline history file for timeline 2 from primary server
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000002.history': No such file or directory
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000003.history': No such file or directory
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000002.history': No such file or directory
Jun 16 14:45:43 serverC postgresql-10-dedupe_shard1_10[33222]: [9-1] 2021-06-16 14:45:43.736 UTC [33222] {-} LOG:  new target timeline is 2
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000002000190B000000036': No such file or directory
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000001000190B000000036': No such file or directory
Jun 16 14:45:43 serverC postgresql-10-dedupe_shard1_10[43934]: [9-1] 2021-06-16 14:45:43.746 UTC [43934] {-} LOG:  restarted WAL streaming at 190B0/36000000 on timeline 2
Jun 16 14:45:43 serverC postgresql-10-dedupe_shard1_10[43934]: [10-1] 2021-06-16 14:45:43.746 UTC [43934] {-} FATAL:  could not receive data from WAL stream: ERROR:  requested starting point 190B0/36000000 is ahead of the WAL flush position of this server 190B0/35FFFFE8
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000003.history': No such file or directory
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000002000190B000000036': No such file or directory
Jun 16 14:45:43 serverC postgresql-10-dedupe_shard1_10[93365]: [7-1] 2021-06-16 14:45:43.764 UTC [93365] {-} LOG:  started streaming WAL from primary at 190B0/36000000 on timeline 2
Jun 16 14:45:43 serverC postgresql-10-dedupe_shard1_10[93365]: [8-1] 2021-06-16 14:45:43.764 UTC [93365] {-} FATAL:  could not receive data from WAL stream: ERROR:  requested starting point 190B0/36000000 is ahead of the WAL flush position of this server 190B0/35FFFFE8
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000003.history': No such file or directory
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000002000190B000000036': No such file or directory
Jun 16 14:45:48 serverC postgresql-10-dedupe_shard1_10[93421]: [7-1] 2021-06-16 14:45:48.771 UTC [93421] {-} LOG:  started streaming WAL from primary at 190B0/36000000 on timeline 2
Jun 16 14:45:48 serverC postgresql-10-dedupe_shard1_10[33222]: [10-1] 2021-06-16 14:45:48.792 UTC [33222] {-} LOG:  invalid contrecord length 1585 at 190B0/35FFFFB8
Jun 16 14:45:48 serverC postgresql-10-dedupe_shard1_10[93421]: [8-1] 2021-06-16 14:45:48.793 UTC [93421] {-} FATAL:  terminating walreceiver process due to administrator command
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000003.history': No such file or directory
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000002000190B000000035': No such file or directory
Jun 16 14:45:48 serverC postgresql-10-dedupe_shard1_10[33222]: [11-1] 2021-06-16 14:45:48.803 UTC [33222] {-} PANIC:  could not open file "pg_wal/00000002000190B000000035": No such file or directory

After I started it it seems to have caught up:

Jun 16 14:59:09 serverB postgresql-10-dedupe_shard1_10[100187]: [8-1] 2021-06-16 14:59:09.826 UTC [100187] {-} LOG:  consistent recovery state reached at 190B0/35FFFFB8
Jun 16 14:59:09 serverB postgresql-10-dedupe_shard1_10[100172]: [5-1] 2021-06-16 14:59:09.827 UTC [100172] {-} LOG:  database system is ready to accept read only connections
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000002000190B000000036': No such file or directory
Jun 16 14:59:10 serverB postgresql-10-dedupe_shard1_10[100187]: [9-1] 2021-06-16 14:59:10.281 UTC [100187] {-} LOG:  invalid record length at 190B0/36000898: wanted 24, got 0
Jun 16 14:59:10 serverB postgresql-10-dedupe_shard1_10[101331]: [6-1] 2021-06-16 14:59:10.292 UTC [101331] {-} LOG:  started streaming WAL from primary at 190B0/36000000 on timeline 2
Jun 16 14:59:10 serverB postgresql-10-dedupe_shard1_10[101332]: [6-1] 2021-06-16 14:59:10.332 UTC [101332] {[unknown]-[local]} [unknown]@[unknown] LOG:  incomplete startup packet
Jun 16 14:59:10 serverB postgresql-10-dedupe_shard1_10[101334]: [6-1] 2021-06-16 14:59:10.508 UTC [101334] {walreceiver-10.1.10.12(43648)} replication@[unknown] ERROR:  requested starting point 190B0/36000000 on timeline 1 is not in this server's history
Jun 16 14:59:10 serverB postgresql-10-dedupe_shard1_10[101334]: [6-2] 2021-06-16 14:59:10.508 UTC [101334] {walreceiver-10.1.10.12(43648)} replication@[unknown] DETAIL:  This server's history forked from timeline 1 at 190B0/35FFFFB8.
Jun 16 14:59:11 serverB postgresql-10-dedupe_shard1_10[101336]: [6-1] 2021-06-16 14:59:11.509 UTC [101336] {walreceiver-10.1.10.12(43652)} replication@[unknown] LOG:  could not receive data from client: Connection reset by peer
Jun 16 14:59:11 serverB postgresql-10-dedupe_shard1_10[101336]: [7-1] 2021-06-16 14:59:11.509 UTC [101336] {walreceiver-10.1.10.12(43652)} replication@[unknown] LOG:  unexpected EOF on standby connection

But now serverD crashed in a similar fashion:

Jun 16 14:59:10 serverD postgresql-10-dedupe_shard1_10[29127]: [6-1] 2021-06-16 14:59:10.500 UTC [29127] {-} LOG:  fetching timeline history file for timeline 2 from primary server
Jun 16 14:59:10 serverD postgresql-10-dedupe_shard1_10[29127]: [7-1] 2021-06-16 14:59:10.509 UTC [29127] {-} FATAL:  could not start WAL streaming: ERROR:  requested starting point 190B0/36000000 on timeline 1 is
 not in this server's history
Jun 16 14:59:10 serverD postgresql-10-dedupe_shard1_10[29127]: [7-2] #011DETAIL:  This server's history forked from timeline 1 at 190B0/35FFFFB8.
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000002.history': No such file or directory
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000003.history': No such file or directory
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000002.history': No such file or directory
Jun 16 14:59:10 serverD postgresql-10-dedupe_shard1_10[32792]: [9-1] 2021-06-16 14:59:10.524 UTC [32792] {-} LOG:  new target timeline is 2
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000002000190B000000036': No such file or directory
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000001000190B000000036': No such file or directory
Jun 16 14:59:10 serverD postgresql-10-dedupe_shard1_10[29138]: [6-1] 2021-06-16 14:59:10.556 UTC [29138] {-} LOG:  started streaming WAL from primary at 190B0/36000000 on timeline 2
Jun 16 14:59:10 serverD postgresql-10-dedupe_shard1_10[32792]: [10-1] 2021-06-16 14:59:10.588 UTC [32792] {-} LOG:  invalid contrecord length 1585 at 190B0/35FFFFB8
Jun 16 14:59:10 serverD postgresql-10-dedupe_shard1_10[29138]: [7-1] 2021-06-16 14:59:10.588 UTC [29138] {-} FATAL:  terminating walreceiver process due to administrator command
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000003.history': No such file or directory
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000002000190B000000035': No such file or directory
Jun 16 14:59:10 serverD postgresql-10-dedupe_shard1_10[32792]: [11-1] 2021-06-16 14:59:10.598 UTC [32792] {-} PANIC:  could not open file "pg_wal/00000002000190B000000035": No such file or directory
Jun 16 14:59:10 serverD postgresql-10-dedupe_shard1_10[29143]: [6-1] 2021-06-16 14:59:10.616 UTC [29143] {-} LOG:  started streaming WAL from primary at 190B0/35000000 on timeline 2
Jun 16 14:59:11 serverD postgresql-10-dedupe_shard1_10[32790]: [6-1] 2021-06-16 14:59:11.505 UTC [32790] {-} LOG:  startup process (PID 32792) was terminated by signal 6: Aborted
Jun 16 14:59:11 serverD postgresql-10-dedupe_shard1_10[32790]: [7-1] 2021-06-16 14:59:11.506 UTC [32790] {-} LOG:  terminating any other active server processes

and it came back again too after restart.

Is it all expected? Is the data safe and consistent now or should I re-init serverC and serverD?

Best regards, Alex

pgsql-general by date:

Previous
From: Tom Lane
Date:
Subject: Re: A simple question about text fields
Next
From: Vijaykumar Jain
Date:
Subject: Re: Streaming replication: PANIC on tertiary when secondary promoted