Re: Standby crash - Mailing list pgsql-general

From Adrian Klaver
Subject Re: Standby crash
Date
Msg-id 1eda9158-df8d-4d62-2320-f937beb033ae@aklaver.com
Whole thread Raw
In response to Standby crash  (James Sewell <james.sewell@jirotech.com>)
List pgsql-general
On 08/11/2016 11:28 PM, James Sewell wrote:
> Hello,
>
> We recently experienced a critical failure when failing to a DR environment.
>
> This is in the following environment:

What is the version of Postgres used?

What is the OS and version used?

>
>   * 3 x PostgreSQL machines in Prod in a sync replication cluster

Is there a connection between the above and the below?

Or can you provide a schematic representation of both?


>   * 3 x PostgreSQL machines in DR, with a single machine async and the
>     other two cascading from the first machine.


The key seems to be this:
"The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory."

So is that appearing on all the machines in the DR setup?

Do you have other Postgres instances on those machines?

What do the OS logs show at the same time?


>
> There was network failure which isolated Production from everything
> else, Production has no errors during this time (and has now come back OK).
>
> DR did not tolerate the break, the following appeared in the logs and
> none of them can start postgres. There were no queries coming into DR at
> the time of the break.
>
> Please note that the "Host Key verification failed" messages are due to
> the scp command not functioning. This means restore_command is not
> working to restore from the XLOG archive, but should not effect anything
> else.
>
> 2016-08-12 02:52:55 GMT [23205]: [9077-1] user=,db=,client=
>  (0:00000)LOG:  restartpoint starting: time
> 2016-08-12 02:57:25 GMT [23205]: [9078-1] user=,db=,client=
>  (0:00000)LOG:  restartpoint complete: wrote 13361 buffers (0.6%); 0
> transaction log file(s) added, 0 removed, 3 recycled; write=269.972 s,
> sync=0.031 s, total=270.008 s; sync files=41, longest=0.004 s,
> average=0.000 s; distance=37747 kB, estimate=48968 kB
> 2016-08-12 02:57:25 GMT [23205]: [9079-1] user=,db=,client=
>  (0:00000)LOG:  recovery restart point at 3/7C7B0228
> 2016-08-12 02:57:25 GMT [23205]: [9080-1] user=,db=,client=
>  (0:00000)DETAIL:  last completed transaction was at log time 2016-08-12
> 02:57:24.033588+00
> 2016-08-12 02:57:55 GMT [23205]: [9081-1] user=,db=,client=
>  (0:00000)LOG:  restartpoint starting: time
> 2016-08-12 03:02:25 GMT [23205]: [9082-1] user=,db=,client=
>  (0:00000)LOG:  restartpoint complete: wrote 7196 buffers (0.3%); 0
> transaction log file(s) added, 0 removed, 2 recycled; write=269.988 s,
> sync=0.014 s, total=270.007 s; sync files=42, longest=0.002 s,
> average=0.000 s; distance=79888 kB, estimate=79888 kB
> 2016-08-12 03:02:25 GMT [23205]: [9083-1] user=,db=,client=
>  (0:00000)LOG:  recovery restart point at 3/815B42C8
> 2016-08-12 03:02:25 GMT [23205]: [9084-1] user=,db=,client=
>  (0:00000)DETAIL:  last completed transaction was at log time 2016-08-12
> 03:02:15.07639+00
> 2016-08-12 03:02:55 GMT [23205]: [9085-1] user=,db=,client=
>  (0:00000)LOG:  restartpoint starting: time
> 2016-08-12 03:04:00 GMT [22350]: [2-1] user=,db=,client=
>  (0:XX000)FATAL:  terminating walreceiver due to timeout
> Host key verification failed.^M
> Host key verification failed.^M
> 2016-08-12 03:04:00 GMT [23188]: [9-1] user=,db=,client=
>  (0:XX000)FATAL:  invalid memory alloc request size 3445219328
> 2016-08-12 03:04:01 GMT [23182]: [5-1] user=,db=,client=  (0:00000)LOG:
>  startup process (PID 23188) exited with exit code 1
> 2016-08-12 03:04:01 GMT [23182]: [6-1] user=,db=,client=  (0:00000)LOG:
>  terminating any other active server processes
> 2016-08-12 03:04:01 GMT [22285]: [1-1]
> user=replication,db=[unknown],client=10.8.0.14 10.8.0.14(55826)
> (0:57P02)WARNING:  terminating connection because of crash of another
> server process
> 2016-08-12 03:04:01 GMT [22285]: [2-1]
> user=replication,db=[unknown],client=10.8.0.14 10.8.0.14(55826)
> (0:57P02)DETAIL:  The postmaster has commanded this server process to
> roll back the current transaction and exit, because another server
> process exited abnormally and possibly corrupted shared memory.
> 2016-08-12 03:04:01 GMT [22285]: [3-1]
> user=replication,db=[unknown],client=10.8.0.14 10.8.0.14(55826)
> (0:57P02)HINT:  In a moment you should be able to reconnect to the
> database and repeat your command.
> 2016-08-12 03:04:01 GMT [22286]: [1-1]
> user=replication,db=[unknown],client=10.8.0.32 10.8.0.32(56442)
> (0:57P02)WARNING:  terminating connection because of crash of another
> server process
> 2016-08-12 03:04:01 GMT [22286]: [2-1]
> user=replication,db=[unknown],client=10.8.0.32 10.8.0.32(56442)
> (0:57P02)DETAIL:  The postmaster has commanded this server process to
> roll back the current transaction and exit, because another server
> process exited abnormally and possibly corrupted shared memory.
> 2016-08-12 03:04:01 GMT [22286]: [3-1]
> user=replication,db=[unknown],client=10.8.0.32 10.8.0.32(56442)
> (0:57P02)HINT:  In a moment you should be able to reconnect to the
> database and repeat your command.
> 2016-08-12 03:04:01 GMT [23182]: [7-1] user=,db=,client=  (0:00000)LOG:
>  archiver process (PID 23207) exited with exit code 1
> 2016-08-12 04:43:51 GMT [23614]: [1-1] user=,db=,client=  (0:00000)LOG:
>  database system was interrupted while in recovery at log time
> 2016-08-12 02:53:19 GMT
> 2016-08-12 04:43:51 GMT [23614]: [2-1] user=,db=,client=  (0:00000)HINT:
>  If this has occurred more than once some data might be corrupted and
> you might need to choose an earlier recovery target.
> 2016-08-12 04:43:51 GMT [23615]: [1-1]
> user=postgres,db=postgres,client=[local] [local] (0:57P03)FATAL:  the
> database system is starting up
> Host key verification failed.^M
> 2016-08-12 04:43:51 GMT [23614]: [3-1] user=,db=,client=  (0:00000)LOG:
>  entering standby mode
> Host key verification failed.^M
> Host key verification failed.^M
> 2016-08-12 04:43:51 GMT [23614]: [4-1] user=,db=,client=  (0:00000)LOG:
>  redo starts at 3/815B42C8
> Host key verification failed.^M
> Host key verification failed.^M
> Host key verification failed.^M
> 2016-08-12 04:43:52 GMT [23643]: [1-1]
> user=postgres,db=postgres,client=[local] [local] (0:57P03)FATAL:  the
> database system is starting up
> Host key verification failed.^M
> Host key verification failed.^M
> Host key verification failed.^M
> Host key verification failed.^M
> 2016-08-12 04:43:53 GMT [23614]: [5-1] user=,db=,client=  (0:00000)LOG:
>  consistent recovery state reached at 3/8811DFF0
> 2016-08-12 04:43:53 GMT [23614]: [6-1] user=,db=,client=
>  (0:XX000)FATAL:  invalid memory alloc request size 3445219328
> 2016-08-12 04:43:53 GMT [23612]: [3-1] user=,db=,client=  (0:00000)LOG:
>  database system is ready to accept read only connections
> 2016-08-12 04:43:53 GMT [23612]: [4-1] user=,db=,client=  (0:00000)LOG:
>  startup process (PID 23614) exited with exit code 1
> 2016-08-12 04:43:53 GMT [23612]: [5-1] user=,db=,client=  (0:00000)LOG:
>  terminating any other active server processes
> 2016-08-12 04:43:53 GMT [23612]: [6-1] user=,db=,client=  (0:00000)LOG:
>  archiver process (PID 23627) exited with exit code 1
> 2016-08-12 04:44:04 GMT [23710]: [1-1] user=,db=,client=  (0:00000)LOG:
>  database system was interrupted while in recovery at log time
> 2016-08-12 02:53:19 GMT
> 2016-08-12 04:44:04 GMT [23710]: [2-1] user=,db=,client=  (0:00000)HINT:
>  If this has occurred more than once some data might be corrupted and
> you might need to choose an earlier recovery target.
>
> Does anyone have any ideas? It looks to me like some sort of bug / error
> with the replication protocol or maybe some corruption on the master
> which wasn't noticed and fed across?
>
> If that's the case would checksums help here? Are the computed on the
> standby?
>
> Cheers,
>
> James Sewell,
>  Solutions Architect
>
>
>
> Suite 112, Jones Bay Wharf, 26-32 Pirrama Road, Pyrmont NSW 2009
> *P *(+61) 2 8099 9000
> <tel:(+61)%202%208099%209000> * **W* www.jirotech.com
> <http://www.jirotech.com/>  *F *(+61) 2 8099 9099
> <tel:(+61)%202%208099%209000>
>
> ------------------------------------------------------------------------
> The contents of this email are confidential and may be subject to legal
> or professional privilege and copyright. No representation is made that
> this email is free of viruses or other defects. If you have received
> this communication in error, you may not copy or distribute any part of
> it or otherwise disclose its contents to anyone. Please advise the
> sender of your incorrect receipt of this correspondence.


--
Adrian Klaver
adrian.klaver@aklaver.com


pgsql-general by date:

Previous
From: Ioana Danes
Date:
Subject: Re: Corrupted Data ?
Next
From: Adrian Klaver
Date:
Subject: Re: pgbasebackup is failing after truncate