Thread: BUG #17177: Secondary fails to start after upgrade from 13.3

BUG #17177: Secondary fails to start after upgrade from 13.3

From

PG Bug reporting form

Date:

03 September 2021, 03:31:49

The following bug has been logged on the website:

Bug reference:      17177
Logged by:          Andriy Bartash
Email address:      abartash@xmatters.com
PostgreSQL version: 13.4
Operating system:   CentOS7
Description:

Physical replication, primary and secondary are on 13.3, upgrading secondary
first and getting
2021-09-02 17:47:03.024 UTC [18068] (user=) (db=) (rhost=) (app=) [vxid:
txid:0] [] LOG:  unexpected timeline ID 2 in log segment
00000002000053E8000000E7, offset 4956160
2021-09-02 17:47:03.136 UTC [18068] (user=) (db=) (rhost=) (app=) [vxid:
txid:0] [] LOG:  unexpected timeline ID 2 in log segment
00000002000053E8000000E7, offset 5890048
2021-09-02 17:47:03.137 UTC [18171] (user=) (db=) (rhost=) (app=) [vxid:
txid:0] [] FATAL:  terminating walreceiver process due to administrator
command
Cluster is not starting.
Tried to replace 00000002000053E8000000E7 from primary manually but it
didn't help.
Once downgraded postgres to 13.3, cluster started successfully in standby
mode.
-----------------------------------------------------
pg_controldata output:
-----------------------------------------------------
pg_controldata -D /pgsql/cluster/data/
pg_control version number:            1300
Catalog version number:               202007201
Database system identifier:           6942522466342357095
Database cluster state:               in archive recovery
pg_control last modified:             Fri 03 Sep 2021 12:28:17 AM UTC
Latest checkpoint location:           53FC/37E028B8
Latest checkpoint's REDO location:    53FB/FAB28A40
Latest checkpoint's REDO WAL file:    00000002000053FB000000FA
Latest checkpoint's TimeLineID:       2
Latest checkpoint's PrevTimeLineID:   2

Please let me know if I need to provide anything else
Thank you in advance, Andriy

Re: BUG #17177: Secondary fails to start after upgrade from 13.3

From

Kyotaro Horiguchi

Date:

03 September 2021, 14:34:11

At Fri, 03 Sep 2021 00:31:49 +0000, PG Bug reporting form <noreply@postgresql.org> wrote in 
> ted timeline ID 2 in log segment
> 00000002000053E8000000E7, offset 4956160
> 2021-09-02 17:47:03.136 UTC [18068] (user=) (db=) (rhost=) (app=) [vxid:
> txid:0] [] LOG:  unexpected timeline ID 2 in log segment
> 00000002000053E8000000E7, offset 5890048
> 2021-09-02 17:47:03.137 UTC [18171] (user=) (db=) (rhost=) (app=) [vxid:
> txid:0] [] FATAL:  terminating walreceiver process due to administrator
> command
> Cluster is not starting.
> Tried to replace 00000002000053E8000000E7 from primary manually but it
> didn't help.
> Once downgraded postgres to 13.3, cluster started successfully in standby
> mode.
> -----------------------------------------------------
> pg_controldata output:
> -----------------------------------------------------
> pg_controldata -D /pgsql/cluster/data/
> pg_control version number:            1300
> Catalog version number:               202007201
> Database system identifier:           6942522466342357095
> Database cluster state:               in archive recovery
> pg_control last modified:             Fri 03 Sep 2021 12:28:17 AM UTC
> Latest checkpoint location:           53FC/37E028B8
> Latest checkpoint's REDO WAL file:    00000002000053FB000000FA
> Latest checkpoint's TimeLineID:       2
> Latest checkpoint's PrevTimeLineID:   2

(The file at failure is before the REDO location, but it seems because
the controldata is taken after the successful startup.)

If 0826564292156762d32c183c6708c94564fcad1c is the cause, one
possibility is that:

- You removed all WAL files before starting the standby but left
  history files alone.

- The standby has a history file for more than tli = 2 and that file
  does not contain tli = 2.

With the two above conditions, no error printed before the commit but
the same error message is printed after the commit.

Is the above conditions match your environment? If so, that would be
fixed by removing the extra history (that is, history files for TLI >
2) files.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: BUG #17177: Secondary fails to start after upgrade from 13.3

From

Andriy Bartash

Date:

03 September 2021, 21:53:01

Hello Kyotaro

Thank you very much for your swift reply.

In my case we had an extra .history file with TLI>2

ll /pgsql/cluster/data/pg_wal/|grep hist

-rw-------. 1 41 Mar 26 17:16 00000002.history

-rw-r--r--. 1 0 May 18 19:02 00000003.history

I have successfully reproduced the issue on TST environment.

I think it’s up to you, close the bug or fix it (if necessary)

Once again, thank you

From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Friday, September 3, 2021 at 4:34 AM
To: abartash@xmatters.com <abartash@xmatters.com>, pgsql-bugs@lists.postgresql.org <pgsql-bugs@lists.postgresql.org>
Subject: Re: BUG #17177: Secondary fails to start after upgrade from 13.3

[everbridge logo]       CAUTION:
This email originated outside the organization. Do not click links or open attachments if you do not recognize the sender and know the content is safe.

At Fri, 03 Sep 2021 00:31:49 +0000, PG Bug reporting form <noreply@postgresql.org> wrote in
> ted timeline ID 2 in log segment
> 00000002000053E8000000E7, offset 4956160
> 2021-09-02 17:47:03.136 UTC [18068] (user=) (db=) (rhost=) (app=) [vxid:
> txid:0] [] LOG: unexpected timeline ID 2 in log segment
> 00000002000053E8000000E7, offset 5890048
> 2021-09-02 17:47:03.137 UTC [18171] (user=) (db=) (rhost=) (app=) [vxid:
> txid:0] [] FATAL: terminating walreceiver process due to administrator
> command
> Cluster is not starting.
> Tried to replace 00000002000053E8000000E7 from primary manually but it
> didn't help.
> Once downgraded postgres to 13.3, cluster started successfully in standby
> mode.
> -----------------------------------------------------
> pg_controldata output:
> -----------------------------------------------------
> pg_controldata -D /pgsql/cluster/data/
> pg_control version number:            1300
> Catalog version number:               202007201
> Database system identifier:           6942522466342357095
> Database cluster state:               in archive recovery
> pg_control last modified:             Fri 03 Sep 2021 12:28:17 AM UTC
> Latest checkpoint location:           53FC/37E028B8
> Latest checkpoint's REDO WAL file:    00000002000053FB000000FA
> Latest checkpoint's TimeLineID:       2
> Latest checkpoint's PrevTimeLineID:   2

(The file at failure is before the REDO location, but it seems because
the controldata is taken after the successful startup.)

If 0826564292156762d32c183c6708c94564fcad1c is the cause, one
possibility is that:

- You removed all WAL files before starting the standby but left
history files alone.

- The standby has a history file for more than tli = 2 and that file
does not contain tli = 2.

With the two above conditions, no error printed before the commit but
the same error message is printed after the commit.

Is the above conditions match your environment? If so, that would be
fixed by removing the extra history (that is, history files for TLI >
2) files.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

The content of this email is confidential and intended for the recipient specified in message only. It is strictly forbidden to share any part of this message with any third party, without a written consent of the sender. If you received this message in error, please reply to this message and follow with its deletion, so that we can ensure such an error does not occur in the future.