Thread: Postgres HA issue - Standby server does not start after Mastercompute host is shut down
Postgres HA issue - Standby server does not start after Mastercompute host is shut down
Problem Description :
PostgreSQL database is set up for High Availability and asynchronous streaming replication with hot standby as described in https://www.postgresql.org/docs/9.4/warm-standby.html#STREAMING-REPLICATION. There is only 1 master and 1 standby server set up. The deployment is on Openstack cloud where master database is running on one compute host and standby server is running on another compute host. When master is brought down (compute host is powered off), stand by is not able to take over. The PostgreSQL database server on the standby fails to start with errors messages given below.
Error logs from Standby Postgres DB:
Logs form postgres log : < 2019-08-22 16:00:04.110 UTC >LOG: invalid record length at C/513BF150
< 2019-08-22 16:00:04.110 UTC >LOG: redo done at C/513BF118
< 2019-08-22 16:00:04.115 UTC >FATAL: WAL ends before end of online backup
< 2019-08-22 16:00:04.115 UTC >HINT: All WAL generated while online backup was taken must be available at recovery.
< 2019-08-22 16:00:04.116 UTC >LOG: startup process (PID 10450) exited with exit code 1
< 2019-08-22 16:00:04.116 UTC >LOG: aborting startup due to startup process failure
< 2019-08-22 16:00:06.692 UTC >LOG: database system was interrupted while in recovery at 2019-08-22 16:00:04 UTC
< 2019-08-22 16:00:06.692 UTC >HINT: This probably means that some data is corrupted and you will have to use the last backup for recovery.
< 2019-08-22 16:00:06.777 UTC >LOG: database system was not properly shut down; automatic recovery in progress
< 2019-08-22 16:00:06.780 UTC >LOG: redo starts at C/513BF078
< 2019-08-22 16:00:06.780 UTC >LOG: invalid record length at C/513BF150
< 2019-08-22 16:00:06.780 UTC >LOG: redo done at C/513BF118
< 2019-08-22 16:00:06.783 UTC >FATAL: WAL ends before end of online backup
< 2019-08-22 16:00:06.783 UTC >HINT: All WAL generated while online backup was taken must be available at recovery.
< 2019-08-22 16:00:06.785 UTC >LOG: startup process (PID 10645) exited with exit code 1
< 2019-08-22 16:00:06.785 UTC >LOG: aborting startup due to startup process failure
NOTE :
This issue is seen only when "compute host" is brought down with database size of about 2GB. Tests are repeated few times and issue is reproducible. However, issue is NOT seen when VM is brought down. i.e. When master is brought down (by making respective "VM" is powered off), standby is able to take over.
Alternatives Tried :
Changed the streaming replication from asynchronous mode to synchronous mode (as described in https://www.postgresql.org/docs/9.4/warm-standby.html#SYNCHRONOUS-REPLICATION) but even this alternative gave the same erroneous behaviour. Logs attached for this sync case as well.
Version Info :
POSTGRES Version : 9.4. Community edition.
OS Version: RHEL 7.2
Openstack version: Newton
Config Files:
master_postgresql.conf : postgresql.conf on the Master.
standby_postgresql.conf : postgresql.conf on the Standby.
standby_recovery.conf : recovery.conf on the Standby.
Log Files :
Log File name for async replication test case: 20190901_102820_pg_standby_debug_Async_replication.log
Description:
a. Enabled debug on Postgres configuration by setting “log_min_messages = debug5”.
b. The time when the test case is executed is “2019-09-01 10:28:20”, when the trigger file is found by Postgres.
Log File name for synchronous replication test case: 20190901_154356_pg_standby_debug_synchronous_replication.log
Description:
a) Enabled debug on Postgres configuration by setting “log_min_messages = debug5”.
b) The time when test case is executed is “2019-09-01 15:43:56”, when the trigger file is found by Postgres.
Attachment
Re: Postgres HA issue - Standby server does not start after Mastercompute host is shut down
On Tue, 3 Sep 2019 20:27:37 +0530 Nagendra Bandi <nagendra.bandi@gmail.com> wrote: ... > *Problem Description :* > > PostgreSQL database is set up for High Availability and *asynchronous > streaming replication* with *hot standby* as described in > https://www.postgresql.org/docs/9.4/warm-standby.html#STREAMING-REPLICATION. > There is only 1 master and 1 standby server set up. The deployment is on > Openstack cloud where master database is running on one compute host and > standby server is running on another compute host. When master is brought > down (*compute host is powered off*), stand by is *not* able to take over. > The PostgreSQL database server on the standby fails to start with errors > messages given below. > > *Error logs from Standby Postgres DB:* > > *Logs form postgres log : > *< 2019-08-22 16:00:04.110 UTC >LOG: invalid record length at C/513BF150 > > < 2019-08-22 16:00:04.110 UTC >LOG: redo done at C/513BF118 > > < 2019-08-22 16:00:04.115 UTC >FATAL: WAL ends before end of online backup It looks like the standby never received the end of online backup record from the master. Could you describe how you built your standby? Regards,
Re: Postgres HA issue - Standby server does not start after Mastercompute host is shut down
On Tue, 3 Sep 2019 20:27:37 +0530
Nagendra Bandi <nagendra.bandi@gmail.com> wrote:
...
> *Problem Description :*
>
> PostgreSQL database is set up for High Availability and *asynchronous
> streaming replication* with *hot standby* as described in
> https://www.postgresql.org/docs/9.4/warm-standby.html#STREAMING-REPLICATION.
> There is only 1 master and 1 standby server set up. The deployment is on
> Openstack cloud where master database is running on one compute host and
> standby server is running on another compute host. When master is brought
> down (*compute host is powered off*), stand by is *not* able to take over.
> The PostgreSQL database server on the standby fails to start with errors
> messages given below.
>
> *Error logs from Standby Postgres DB:*
>
> *Logs form postgres log :
> *< 2019-08-22 16:00:04.110 UTC >LOG: invalid record length at C/513BF150
>
> < 2019-08-22 16:00:04.110 UTC >LOG: redo done at C/513BF118
>
> < 2019-08-22 16:00:04.115 UTC >FATAL: WAL ends before end of online backup
It looks like the standby never received the end of online backup record from
the master.
Could you describe how you built your standby?
Regards,
Re: Postgres HA issue - Standby server does not start after Mastercompute host is shut down
On Tue, 3 Sep 2019 21:31:23 +0530 Nagendra Bandi <nagendra.bandi@gmail.com> wrote: > Hi Jehan, > Thanks for your quick response. > Standby is built from the primary using pg_basebackup. i.e. Initial copy of > the primary database is taken with pg_basebackup command pg_basebackup is supposed to make an non-exclusive backup, so this error shouldn't happen. Did you check the online backup actually succeed? > and then restarted the server. What is this server restart about? Do you restart the primary? Is it a simple startup of the standby? There something wrong somewhere in your procedure. Could you share your precise procedure steps please? > On Tue, Sep 3, 2019 at 8:43 PM Jehan-Guillaume (ioguix) de Rorthais < > ioguix@free.fr> wrote: > > > On Tue, 3 Sep 2019 20:27:37 +0530 > > Nagendra Bandi <nagendra.bandi@gmail.com> wrote: > > ... > > > *Problem Description :* > > > > > > PostgreSQL database is set up for High Availability and *asynchronous > > > streaming replication* with *hot standby* as described in > > > > > https://www.postgresql.org/docs/9.4/warm-standby.html#STREAMING-REPLICATION > > . > > > There is only 1 master and 1 standby server set up. The deployment is on > > > Openstack cloud where master database is running on one compute host and > > > standby server is running on another compute host. When master is brought > > > down (*compute host is powered off*), stand by is *not* able to take > > over. > > > The PostgreSQL database server on the standby fails to start with errors > > > messages given below. > > > > > > *Error logs from Standby Postgres DB:* > > > > > > *Logs form postgres log : > > > *< 2019-08-22 16:00:04.110 UTC >LOG: invalid record length at C/513BF150 > > > > > > < 2019-08-22 16:00:04.110 UTC >LOG: redo done at C/513BF118 > > > > > > < 2019-08-22 16:00:04.115 UTC >FATAL: WAL ends before end of online > > backup > > > > It looks like the standby never received the end of online backup record > > from > > the master. > > > > Could you describe how you built your standby? > > > > Regards, > > -- Jehan-Guillaume de Rorthais Dalibo
Re: Postgres HA issue - Standby server does not start after Mastercompute host is shut down
On Tue, Sep 3, 2019 at 6:08 PM Nagendra Bandi <nagendra.bandi@gmail.com> wrote: > Standby is built from the primary using pg_basebackup. i.e. Initial copy of the primary database is taken with pg_basebackupcommand and then restarted the server. This puzzles me: < 2019-09-01 15:43:56.440 UTC >LOG: trigger file found: /var/opt/rh/rh-postgresql94/lib/pgsql/trigger_switch < 2019-09-01 15:43:56.440 UTC >FATAL: terminating walreceiver process due to administrator command < 2019-09-01 15:43:56.443 UTC >DEBUG: switched WAL source from stream to archive after failure < 2019-09-01 15:43:56.443 UTC >LOG: record with zero length at C/95193A70 < 2019-09-01 15:43:56.443 UTC >LOG: redo done at C/95193A38 < 2019-09-01 15:43:56.443 UTC >LOG: last completed transaction was at log time 2019-09-01 15:39:18.804265+00 < 2019-09-01 15:43:56.443 UTC >DEBUG: resetting unlogged relations: cleanup 0 init 1 < 2019-09-01 15:43:56.448 UTC >FATAL: WAL ends before end of online backup < 2019-09-01 15:43:56.448 UTC >HINT: All WAL generated while online backup was taken must be available at recovery. so the standby found the trigger file and terminated the wal receiver, but then tried to get wals from local archive, why? Seems to me the local WALs are not completed or removed, and so the standby cannot get the last part, I suspect this is because there's a restore_command that is not streaming (and that should be fine, since it is optional for streaming). Also the version is quite old. Luca