Thread: Postgres HA issue - Standby server does not start after Mastercompute host is shut down

Dear community,
Looking for help on HA issue. If you could provide any hints on what I am missing that would be greatly appreciated.

Problem Description :

PostgreSQL database is set up for High Availability and asynchronous streaming replication with hot standby as described in https://www.postgresql.org/docs/9.4/warm-standby.html#STREAMING-REPLICATION. There is only 1 master and 1 standby server set up. The deployment is on Openstack cloud where master database is running on one compute host and standby server is running on another compute host. When master is brought down (compute host is powered off), stand by is not able to take over. The PostgreSQL database server on the standby fails to start with errors messages given below.

Error logs from Standby Postgres DB: 

Logs form postgres log : 
< 2019-08-22 16:00:04.110 UTC >LOG: invalid record length at C/513BF150

< 2019-08-22 16:00:04.110 UTC >LOG: redo done at C/513BF118

< 2019-08-22 16:00:04.115 UTC >FATAL: WAL ends before end of online backup

< 2019-08-22 16:00:04.115 UTC >HINT: All WAL generated while online backup was taken must be available at recovery.

< 2019-08-22 16:00:04.116 UTC >LOG: startup process (PID 10450) exited with exit code 1

< 2019-08-22 16:00:04.116 UTC >LOG: aborting startup due to startup process failure

< 2019-08-22 16:00:06.692 UTC >LOG: database system was interrupted while in recovery at 2019-08-22 16:00:04 UTC

< 2019-08-22 16:00:06.692 UTC >HINT: This probably means that some data is corrupted and you will have to use the last backup for recovery.

< 2019-08-22 16:00:06.777 UTC >LOG: database system was not properly shut down; automatic recovery in progress

< 2019-08-22 16:00:06.780 UTC >LOG: redo starts at C/513BF078

< 2019-08-22 16:00:06.780 UTC >LOG: invalid record length at C/513BF150

< 2019-08-22 16:00:06.780 UTC >LOG: redo done at C/513BF118

< 2019-08-22 16:00:06.783 UTC >FATAL: WAL ends before end of online backup

< 2019-08-22 16:00:06.783 UTC >HINT: All WAL generated while online backup was taken must be available at recovery.

< 2019-08-22 16:00:06.785 UTC >LOG: startup process (PID 10645) exited with exit code 1

< 2019-08-22 16:00:06.785 UTC >LOG: aborting startup due to startup process failure

 

NOTE :

This issue is seen only when "compute host" is brought down with database size of about 2GB. Tests are repeated few times and issue is reproducible. However, issue is NOT seen when VM is brought down. i.e. When master is brought down (by making respective "VM" is powered off), standby is able to take over.

 

Alternatives Tried : 
Changed the streaming replication from asynchronous mode to synchronous mode (as described in
https://www.postgresql.org/docs/9.4/warm-standby.html#SYNCHRONOUS-REPLICATION) but even this alternative gave the same erroneous behaviour. Logs attached for this sync case as well.

 

Version Info :

POSTGRES Version : 9.4. Community edition.

OS Version: RHEL 7.2

Openstack version: Newton

Config Files:

master_postgresql.conf : postgresql.conf on the Master.

standby_postgresql.conf : postgresql.conf on the Standby.

standby_recovery.conf : recovery.conf on the Standby.


Log Files :

Log File name for async replication test case: 20190901_102820_pg_standby_debug_Async_replication.log

Description:

a.    Enabled debug on Postgres configuration by setting “log_min_messages = debug5”.

b.    The time when the test case is executed is “2019-09-01 10:28:20”, when the trigger file is found by Postgres.

 

Log File name for synchronous replication test case: 20190901_154356_pg_standby_debug_synchronous_replication.log

Description:

a)    Enabled debug on Postgres configuration by setting “log_min_messages = debug5”.

b)    The time when test case is executed is “2019-09-01 15:43:56”, when the trigger file is found by Postgres.


Appreciate your help.

Regards,
Nagendra
Attachment

Re: Postgres HA issue - Standby server does not start after Mastercompute host is shut down

From
"Jehan-Guillaume (ioguix) de Rorthais"
Date:
On Tue, 3 Sep 2019 20:27:37 +0530
Nagendra Bandi <nagendra.bandi@gmail.com> wrote:
...
> *Problem Description :*
> 
> PostgreSQL database is set up for High Availability and *asynchronous
> streaming replication* with *hot standby* as described in
> https://www.postgresql.org/docs/9.4/warm-standby.html#STREAMING-REPLICATION.
> There is only 1 master and 1 standby server set up. The deployment is on
> Openstack cloud where master database is running on one compute host and
> standby server is running on another compute host. When master is brought
> down (*compute host is powered off*), stand by is *not* able to take over.
> The PostgreSQL database server on the standby fails to start with errors
> messages given below.
> 
> *Error logs from Standby Postgres DB:*
> 
> *Logs form postgres log :
> *< 2019-08-22 16:00:04.110 UTC >LOG: invalid record length at C/513BF150
> 
> < 2019-08-22 16:00:04.110 UTC >LOG: redo done at C/513BF118
> 
> < 2019-08-22 16:00:04.115 UTC >FATAL: WAL ends before end of online backup

It looks like the standby never received the end of online backup record from
the master.

Could you describe how you built your standby?

Regards,



Hi Jehan,
Thanks for your quick response.
Standby is built from the primary using pg_basebackup. i.e. Initial copy of the primary database is taken with pg_basebackup command and then restarted the server.
Regards,
Nagendra

On Tue, Sep 3, 2019 at 8:43 PM Jehan-Guillaume (ioguix) de Rorthais <ioguix@free.fr> wrote:
On Tue, 3 Sep 2019 20:27:37 +0530
Nagendra Bandi <nagendra.bandi@gmail.com> wrote:
...
> *Problem Description :*
>
> PostgreSQL database is set up for High Availability and *asynchronous
> streaming replication* with *hot standby* as described in
> https://www.postgresql.org/docs/9.4/warm-standby.html#STREAMING-REPLICATION.
> There is only 1 master and 1 standby server set up. The deployment is on
> Openstack cloud where master database is running on one compute host and
> standby server is running on another compute host. When master is brought
> down (*compute host is powered off*), stand by is *not* able to take over.
> The PostgreSQL database server on the standby fails to start with errors
> messages given below.
>
> *Error logs from Standby Postgres DB:*
>
> *Logs form postgres log :
> *< 2019-08-22 16:00:04.110 UTC >LOG: invalid record length at C/513BF150
>
> < 2019-08-22 16:00:04.110 UTC >LOG: redo done at C/513BF118
>
> < 2019-08-22 16:00:04.115 UTC >FATAL: WAL ends before end of online backup

It looks like the standby never received the end of online backup record from
the master.

Could you describe how you built your standby?

Regards,

Re: Postgres HA issue - Standby server does not start after Mastercompute host is shut down

From
"Jehan-Guillaume (ioguix) de Rorthais"
Date:
On Tue, 3 Sep 2019 21:31:23 +0530
Nagendra Bandi <nagendra.bandi@gmail.com> wrote:

> Hi Jehan,
> Thanks for your quick response.
> Standby is built from the primary using pg_basebackup. i.e. Initial copy of
> the primary database is taken with pg_basebackup command

pg_basebackup is supposed to make an non-exclusive backup, so this error
shouldn't happen.

Did you check the online backup actually succeed?

> and then restarted the server.

What is this server restart about? Do you restart the primary? Is it a simple
startup of the standby?

There something wrong somewhere in your procedure. Could you share your precise
procedure steps please?

> On Tue, Sep 3, 2019 at 8:43 PM Jehan-Guillaume (ioguix) de Rorthais <
> ioguix@free.fr> wrote:  
> 
> > On Tue, 3 Sep 2019 20:27:37 +0530
> > Nagendra Bandi <nagendra.bandi@gmail.com> wrote:
> > ...  
> > > *Problem Description :*
> > >
> > > PostgreSQL database is set up for High Availability and *asynchronous
> > > streaming replication* with *hot standby* as described in
> > >  
> > https://www.postgresql.org/docs/9.4/warm-standby.html#STREAMING-REPLICATION
> > .  
> > > There is only 1 master and 1 standby server set up. The deployment is on
> > > Openstack cloud where master database is running on one compute host and
> > > standby server is running on another compute host. When master is brought
> > > down (*compute host is powered off*), stand by is *not* able to take  
> > over.  
> > > The PostgreSQL database server on the standby fails to start with errors
> > > messages given below.
> > >
> > > *Error logs from Standby Postgres DB:*
> > >
> > > *Logs form postgres log :
> > > *< 2019-08-22 16:00:04.110 UTC >LOG: invalid record length at C/513BF150
> > >
> > > < 2019-08-22 16:00:04.110 UTC >LOG: redo done at C/513BF118
> > >
> > > < 2019-08-22 16:00:04.115 UTC >FATAL: WAL ends before end of online  
> > backup
> >
> > It looks like the standby never received the end of online backup record
> > from
> > the master.
> >
> > Could you describe how you built your standby?
> >
> > Regards,
> >  



-- 
Jehan-Guillaume de Rorthais
Dalibo



On Tue, Sep 3, 2019 at 6:08 PM Nagendra Bandi <nagendra.bandi@gmail.com> wrote:
> Standby is built from the primary using pg_basebackup. i.e. Initial copy of the primary database is taken with
pg_basebackupcommand and then restarted the server.
 

This puzzles me:

< 2019-09-01 15:43:56.440 UTC >LOG:  trigger file found:
/var/opt/rh/rh-postgresql94/lib/pgsql/trigger_switch
< 2019-09-01 15:43:56.440 UTC >FATAL:  terminating walreceiver process
due to administrator command

< 2019-09-01 15:43:56.443 UTC >DEBUG:  switched WAL source from stream
to archive after failure
< 2019-09-01 15:43:56.443 UTC >LOG:  record with zero length at C/95193A70
< 2019-09-01 15:43:56.443 UTC >LOG:  redo done at C/95193A38
< 2019-09-01 15:43:56.443 UTC >LOG:  last completed transaction was at
log time 2019-09-01 15:39:18.804265+00
< 2019-09-01 15:43:56.443 UTC >DEBUG:  resetting unlogged relations:
cleanup 0 init 1
< 2019-09-01 15:43:56.448 UTC >FATAL:  WAL ends before end of online backup
< 2019-09-01 15:43:56.448 UTC >HINT:  All WAL generated while online
backup was taken must be available at recovery.

so the standby found the trigger file and terminated the wal receiver,
but then tried to get wals from local archive, why?
Seems to me the local WALs are not completed or removed, and so the
standby cannot get the last part, I suspect this is because there's a
restore_command that is not streaming (and that should be fine, since
it is optional for streaming).
Also the version is quite old.

Luca