Thread: postgres on physical replica crashes

postgres on physical replica crashes

From
greigwise
Date:
Hello.  I've had several instances where postgres on my physical replica
under version 9.6.6 is crashing with messages like the following in the
logs:

2018-04-18 05:43:26 UTC dbname 5acf5e4a.6918 dbuser DETAIL:  The postmaster
has commanded this server process to roll back the current transaction and
exit, because another server process exited abnormally and possibly
corrupted shared memory.
2018-04-18 05:43:26 UTC dbname 5acf5e4a.6918 dbuser HINT:  In a moment you
should be able to reconnect to the database and repeat your command.
2018-04-18 05:43:26 UTC dbname 5acf5e39.68e5 dbuser WARNING:  terminating
connection because of crash of another server process
2018-04-18 05:43:26 UTC dbname 5acf5e39.68e5 dbuser DETAIL:  The postmaster
has commanded this server process to roll back the current transaction and
exit, because another server process exited abnormally and possibly
corrupted shared memory.
2018-04-18 05:43:26 UTC dbname 5acf5e39.68e5 dbuser HINT:  In a moment you
should be able to reconnect to the database and repeat your command.
2018-04-18 05:43:27 UTC  5acf5e12.6819  LOG:  database system is shut down

When this happens, what I've found is that I can go into the pg_xlog
directory on the replica, remove all the log files and the postgres will
restart and things seem to come back up normally.

So, the question is what's going on here... is the log maybe getting corrupt
in transmission somehow?  Should I be concerned about the viability of my
replica after having restarted in the described fashion?

Thanks,
Greig Wise



--
Sent from: http://www.postgresql-archive.org/PostgreSQL-general-f1843780.html


Re: postgres on physical replica crashes

From
Hannes Erven
Date:
Hi Greig,


just last week I experienced the same situation as you on a 10.3 
physical replica (it even has checksums activated), and a few months ago 
on 9.6 .
We used the same resolution as you we, and so far we haven't noticed any 
problems with data integrity on the replicas.



The logs were as follows:
2018-04-13 06:31:16.947 CEST [15603] FATAL:  WAL-Receiver-Prozess wird 
abgebrochen wegen Zeitüberschreitung
2018-04-13 06:31:16.948 CEST [15213] FATAL:  invalid memory alloc 
request size 4280303616
2018-04-13 06:31:16.959 CEST [15212] LOG:  Startprozess (PID 15213) 
beendete mit Status 1
2018-04-13 06:31:16.959 CEST [15212] LOG:  aktive Serverprozesse werden 
abgebrochen
2018-04-13 06:31:16.959 CEST [19838] user@db WARNUNG:  Verbindung wird 
abgebrochen wegen Absturz eines anderen Serverprozesses
2018-04-13 06:31:16.959 CEST [19838] user@db DETAIL:  Der Postmaster hat 
diesen Serverprozess angewiesen, die aktuelle Transaktion zurückzurollen 
und die Sitzung zu beenden, weil ein anderer Serverprozess abnormal 
beendet wurde und möglicherweise das Shared Memory verfälscht hat.
2018-04-13 06:31:16.959 CEST [19838] user@db TIPP:  In einem Moment 
sollten Sie wieder mit der Datenbank verbinden und Ihren Befehl 
wiederholen können.


This replica then refused to start up:
2018-04-13 09:25:15.941 CEST [1957] LOG:  Standby-Modus eingeschaltet
2018-04-13 09:25:15.947 CEST [1957] LOG:  Redo beginnt bei 1C/69C0FF30
2018-04-13 09:25:15.951 CEST [1957] LOG:  konsistenter 
Wiederherstellungszustand erreicht bei 1C/69D9A9C0
2018-04-13 09:25:15.952 CEST [1956] LOG:  Datenbanksystem ist bereit, um 
lesende Verbindungen anzunehmen
2018-04-13 09:25:15.953 CEST [1957] FATAL:  invalid memory alloc request 
size 4280303616
2018-04-13 09:25:15.954 CEST [1956] LOG:  Startprozess (PID 1957) 
beendete mit Status 1


... until the WAL files from the hot standby's pg_wal were manually 
removed and re-downloaded from the primary.

Unfortunately I did not collect hard evidence, but I think I saw the 
primary's replication slot's restart point was set to a position /after/ 
the standby's actual restart location. This time, the error was noticed 
immediately and the required WAL was still present on the master.


A few months ago I experienced the same situation on a 9.6 cluster, but 
that was not noticed for a long time, and - despite using a replication 
slot! - the primary had already removed required segments. Fortunately I 
could get them from a tape backup...



Best regards,

    -hannes




Am 2018-04-18 um 18:16 schrieb greigwise:
> Hello.  I've had several instances where postgres on my physical replica
> under version 9.6.6 is crashing with messages like the following in the
> logs:
> 
> 2018-04-18 05:43:26 UTC dbname 5acf5e4a.6918 dbuser DETAIL:  The postmaster
> has commanded this server process to roll back the current transaction and
> exit, because another server process exited abnormally and possibly
> corrupted shared memory.
> 2018-04-18 05:43:26 UTC dbname 5acf5e4a.6918 dbuser HINT:  In a moment you
> should be able to reconnect to the database and repeat your command.
> 2018-04-18 05:43:26 UTC dbname 5acf5e39.68e5 dbuser WARNING:  terminating
> connection because of crash of another server process
> 2018-04-18 05:43:26 UTC dbname 5acf5e39.68e5 dbuser DETAIL:  The postmaster
> has commanded this server process to roll back the current transaction and
> exit, because another server process exited abnormally and possibly
> corrupted shared memory.
> 2018-04-18 05:43:26 UTC dbname 5acf5e39.68e5 dbuser HINT:  In a moment you
> should be able to reconnect to the database and repeat your command.
> 2018-04-18 05:43:27 UTC  5acf5e12.6819  LOG:  database system is shut down
> 
> When this happens, what I've found is that I can go into the pg_xlog
> directory on the replica, remove all the log files and the postgres will
> restart and things seem to come back up normally.
> 
> So, the question is what's going on here... is the log maybe getting corrupt
> in transmission somehow?  Should I be concerned about the viability of my
> replica after having restarted in the described fashion?
> 
> Thanks,
> Greig Wise
> 
> 
> 
> --
> Sent from: http://www.postgresql-archive.org/PostgreSQL-general-f1843780.html
> 



Re: postgres on physical replica crashes

From
SRINIVASARAO OGURI
Date:
Hi Greig Wise,

If you are using CentOS/REDHAT - 07 , check this link "https://srinivasoguri.blogspot.in/2018/04/postgresql-crash-in-centosredhat-07.html"

On Fri, Apr 20, 2018 at 6:58 PM, Hannes Erven <hannes@erven.at> wrote:
Hi Greig,


just last week I experienced the same situation as you on a 10.3 physical replica (it even has checksums activated), and a few months ago on 9.6 .
We used the same resolution as you we, and so far we haven't noticed any problems with data integrity on the replicas.



The logs were as follows:
2018-04-13 06:31:16.947 CEST [15603] FATAL:  WAL-Receiver-Prozess wird abgebrochen wegen Zeitüberschreitung
2018-04-13 06:31:16.948 CEST [15213] FATAL:  invalid memory alloc request size 4280303616
2018-04-13 06:31:16.959 CEST [15212] LOG:  Startprozess (PID 15213) beendete mit Status 1
2018-04-13 06:31:16.959 CEST [15212] LOG:  aktive Serverprozesse werden abgebrochen
2018-04-13 06:31:16.959 CEST [19838] user@db WARNUNG:  Verbindung wird abgebrochen wegen Absturz eines anderen Serverprozesses
2018-04-13 06:31:16.959 CEST [19838] user@db DETAIL:  Der Postmaster hat diesen Serverprozess angewiesen, die aktuelle Transaktion zurückzurollen und die Sitzung zu beenden, weil ein anderer Serverprozess abnormal beendet wurde und möglicherweise das Shared Memory verfälscht hat.
2018-04-13 06:31:16.959 CEST [19838] user@db TIPP:  In einem Moment sollten Sie wieder mit der Datenbank verbinden und Ihren Befehl wiederholen können.


This replica then refused to start up:
2018-04-13 09:25:15.941 CEST [1957] LOG:  Standby-Modus eingeschaltet
2018-04-13 09:25:15.947 CEST [1957] LOG:  Redo beginnt bei 1C/69C0FF30
2018-04-13 09:25:15.951 CEST [1957] LOG:  konsistenter Wiederherstellungszustand erreicht bei 1C/69D9A9C0
2018-04-13 09:25:15.952 CEST [1956] LOG:  Datenbanksystem ist bereit, um lesende Verbindungen anzunehmen
2018-04-13 09:25:15.953 CEST [1957] FATAL:  invalid memory alloc request size 4280303616
2018-04-13 09:25:15.954 CEST [1956] LOG:  Startprozess (PID 1957) beendete mit Status 1


... until the WAL files from the hot standby's pg_wal were manually removed and re-downloaded from the primary.

Unfortunately I did not collect hard evidence, but I think I saw the primary's replication slot's restart point was set to a position /after/ the standby's actual restart location. This time, the error was noticed immediately and the required WAL was still present on the master.


A few months ago I experienced the same situation on a 9.6 cluster, but that was not noticed for a long time, and - despite using a replication slot! - the primary had already removed required segments. Fortunately I could get them from a tape backup...



Best regards,

        -hannes




Am 2018-04-18 um 18:16 schrieb greigwise:
Hello.  I've had several instances where postgres on my physical replica
under version 9.6.6 is crashing with messages like the following in the
logs:

2018-04-18 05:43:26 UTC dbname 5acf5e4a.6918 dbuser DETAIL:  The postmaster
has commanded this server process to roll back the current transaction and
exit, because another server process exited abnormally and possibly
corrupted shared memory.
2018-04-18 05:43:26 UTC dbname 5acf5e4a.6918 dbuser HINT:  In a moment you
should be able to reconnect to the database and repeat your command.
2018-04-18 05:43:26 UTC dbname 5acf5e39.68e5 dbuser WARNING:  terminating
connection because of crash of another server process
2018-04-18 05:43:26 UTC dbname 5acf5e39.68e5 dbuser DETAIL:  The postmaster
has commanded this server process to roll back the current transaction and
exit, because another server process exited abnormally and possibly
corrupted shared memory.
2018-04-18 05:43:26 UTC dbname 5acf5e39.68e5 dbuser HINT:  In a moment you
should be able to reconnect to the database and repeat your command.
2018-04-18 05:43:27 UTC  5acf5e12.6819  LOG:  database system is shut down

When this happens, what I've found is that I can go into the pg_xlog
directory on the replica, remove all the log files and the postgres will
restart and things seem to come back up normally.

So, the question is what's going on here... is the log maybe getting corrupt
in transmission somehow?  Should I be concerned about the viability of my
replica after having restarted in the described fashion?

Thanks,
Greig Wise



--
Sent from: http://www.postgresql-archive.org/PostgreSQL-general-f1843780.html