Postgres HA issue - Standby server does not start after Mastercompute host is shut down - Mailing list pgsql-general

From Nagendra Bandi
Subject Postgres HA issue - Standby server does not start after Mastercompute host is shut down
Date
Msg-id CAO8W-N9ZPacOAJ4PqaPxM7KCEMca-4LbQBOd1_iXJ=o6gJWvCg@mail.gmail.com
Whole thread Raw
Responses Re: Postgres HA issue - Standby server does not start after Mastercompute host is shut down
List pgsql-general
Dear community,
Looking for help on HA issue. If you could provide any hints on what I am missing that would be greatly appreciated.

Problem Description :

PostgreSQL database is set up for High Availability and asynchronous streaming replication with hot standby as described in https://www.postgresql.org/docs/9.4/warm-standby.html#STREAMING-REPLICATION. There is only 1 master and 1 standby server set up. The deployment is on Openstack cloud where master database is running on one compute host and standby server is running on another compute host. When master is brought down (compute host is powered off), stand by is not able to take over. The PostgreSQL database server on the standby fails to start with errors messages given below.

Error logs from Standby Postgres DB: 

Logs form postgres log : 
< 2019-08-22 16:00:04.110 UTC >LOG: invalid record length at C/513BF150

< 2019-08-22 16:00:04.110 UTC >LOG: redo done at C/513BF118

< 2019-08-22 16:00:04.115 UTC >FATAL: WAL ends before end of online backup

< 2019-08-22 16:00:04.115 UTC >HINT: All WAL generated while online backup was taken must be available at recovery.

< 2019-08-22 16:00:04.116 UTC >LOG: startup process (PID 10450) exited with exit code 1

< 2019-08-22 16:00:04.116 UTC >LOG: aborting startup due to startup process failure

< 2019-08-22 16:00:06.692 UTC >LOG: database system was interrupted while in recovery at 2019-08-22 16:00:04 UTC

< 2019-08-22 16:00:06.692 UTC >HINT: This probably means that some data is corrupted and you will have to use the last backup for recovery.

< 2019-08-22 16:00:06.777 UTC >LOG: database system was not properly shut down; automatic recovery in progress

< 2019-08-22 16:00:06.780 UTC >LOG: redo starts at C/513BF078

< 2019-08-22 16:00:06.780 UTC >LOG: invalid record length at C/513BF150

< 2019-08-22 16:00:06.780 UTC >LOG: redo done at C/513BF118

< 2019-08-22 16:00:06.783 UTC >FATAL: WAL ends before end of online backup

< 2019-08-22 16:00:06.783 UTC >HINT: All WAL generated while online backup was taken must be available at recovery.

< 2019-08-22 16:00:06.785 UTC >LOG: startup process (PID 10645) exited with exit code 1

< 2019-08-22 16:00:06.785 UTC >LOG: aborting startup due to startup process failure

 

NOTE :

This issue is seen only when "compute host" is brought down with database size of about 2GB. Tests are repeated few times and issue is reproducible. However, issue is NOT seen when VM is brought down. i.e. When master is brought down (by making respective "VM" is powered off), standby is able to take over.

 

Alternatives Tried : 
Changed the streaming replication from asynchronous mode to synchronous mode (as described in
https://www.postgresql.org/docs/9.4/warm-standby.html#SYNCHRONOUS-REPLICATION) but even this alternative gave the same erroneous behaviour. Logs attached for this sync case as well.

 

Version Info :

POSTGRES Version : 9.4. Community edition.

OS Version: RHEL 7.2

Openstack version: Newton

Config Files:

master_postgresql.conf : postgresql.conf on the Master.

standby_postgresql.conf : postgresql.conf on the Standby.

standby_recovery.conf : recovery.conf on the Standby.


Log Files :

Log File name for async replication test case: 20190901_102820_pg_standby_debug_Async_replication.log

Description:

a.    Enabled debug on Postgres configuration by setting “log_min_messages = debug5”.

b.    The time when the test case is executed is “2019-09-01 10:28:20”, when the trigger file is found by Postgres.

 

Log File name for synchronous replication test case: 20190901_154356_pg_standby_debug_synchronous_replication.log

Description:

a)    Enabled debug on Postgres configuration by setting “log_min_messages = debug5”.

b)    The time when test case is executed is “2019-09-01 15:43:56”, when the trigger file is found by Postgres.


Appreciate your help.

Regards,
Nagendra
Attachment

pgsql-general by date:

Previous
From: Paul Ramsey
Date:
Subject: Re: Slow statement using parallelism after 9.6>11 upgrade
Next
From: Imre Samu
Date:
Subject: Re: Slow statement using parallelism after 9.6>11 upgrade