Home > mailing lists

Re: Recovery from WAL archives not advancing timeline? - Mailing list pgsql-admin

From	Don Seiler
Subject	Re: Recovery from WAL archives not advancing timeline?
Date	August 10, 2020 23:28:34
Msg-id	CAHJZqBB5X37hGJJyUR2QG-NHqjUnQVTVcWxRk0Yjwsi=tpxv+Q@mail.gmail.com Whole thread Raw
In response to	Re: Recovery from WAL archives not advancing timeline? (Don Seiler <don@seiler.us>)
Responses	Re: Recovery from WAL archives not advancing timeline? (Rui DeSousa <rui@crazybean.net>)
List	pgsql-admin

Tree view

On Sat, Aug 8, 2020 at 2:24 PM Don Seiler <don@seiler.us> wrote:

Doing some more testing and it seems to boil down to the fact that a replica should create a .partial file when it is promoted. I tested both PG 9.6 and 10, with replicas streaming vs restoring WAL files. The sequence was always this:

Stop primary (A)
Observe primary writes on final WAL file and it is archived
Promote replica (B)
Add recovery.conf to A for just WAL recovery and start.
In cases where the replica B was a streaming replica, upon promotion it would create a .partial file in the old timeline and then a history file in the new timeline and proceed. These files were then archived by wal-g and the new replica (A) would recover them and continue replication.

If replica B was not streaming to begin with (so it was just restoring and recovering WAL from archives), then upon promotion it would NOT create a .partial file. It would just create the .history file and open up. When A is then started in step 4, it would complain like this (in PG 9.6):

2020-08-08 19:13:43.337 UTC [17837] LOG: database system was shut down at 2020-08-08 19:09:24 UTC

2020-08-08 19:13:44.580 UTC [17837] LOG: restored log file "00000002.history" from archive

ERROR: 2020/08/08 19:13:45.213175 Archive '00000003.history' does not exist.
2020-08-08 19:13:45.214 UTC [17837] LOG: entering standby mode

2020-08-08 19:13:45.676 UTC [17837] LOG: restored log file "00000002.history" from archive
2020-08-08 19:13:46.327 UTC [17837] LOG: restored log file "000000020000000000000006" from archive

2020-08-08 19:13:46.381 UTC [17837] LOG: invalid xl_info in primary checkpoint record
2020-08-08 19:13:46.381 UTC [17837] PANIC: could not locate a valid checkpoint record

2020-08-08 19:13:48.095 UTC [17836] LOG: startup process (PID 17837) was terminated by signal 6: Aborted
2020-08-08 19:13:48.095 UTC [17836] LOG: aborting startup due to startup process failure
2020-08-08 19:13:48.105 UTC [17836] LOG: database system is shut down

I even tested forcing an additional log switch from the new primary (which is the example you see above) but it still doesn't jive.

I'm not sure if this is a bug or expected behavior with non-streaming replicas but I believe it means that all of my old primaries need to be restored from new backups as the new primaries were all restore/recovery setup (due to VPN bandwidth limitations that made streaming difficult). Unless someone has a way forward?

I'm following up to see if anyone else would categorize this as a bug. It renders my old primaries dead until I do a restore on them.

Don.

Don Seiler
www.seiler.us

pgsql-admin by date:

From: Mohammed Afsar
Date: 10 August 2020, 14:51:44
Subject: Postgresql background writer and checkpointer memory utilization in Linux server

From: Rui DeSousa
Date: 11 August 2020, 03:03:54
Subject: Re: Recovery from WAL archives not advancing timeline?

Re: Recovery from WAL archives not advancing timeline? - Mailing list pgsql-admin

Previous

Next