Re: postgres on physical replica crashes - Mailing list pgsql-general

From SRINIVASARAO OGURI
Subject Re: postgres on physical replica crashes
Date
Msg-id CAO69RdgCxp5Yq6KaopcdgLCHkh28eTKtJAk5TCW6x5Q1c_i9pw@mail.gmail.com
Whole thread Raw
In response to Re: postgres on physical replica crashes  (Hannes Erven <hannes@erven.at>)
List pgsql-general
Hi Greig Wise,

If you are using CentOS/REDHAT - 07 , check this link "https://srinivasoguri.blogspot.in/2018/04/postgresql-crash-in-centosredhat-07.html"

On Fri, Apr 20, 2018 at 6:58 PM, Hannes Erven <hannes@erven.at> wrote:
Hi Greig,


just last week I experienced the same situation as you on a 10.3 physical replica (it even has checksums activated), and a few months ago on 9.6 .
We used the same resolution as you we, and so far we haven't noticed any problems with data integrity on the replicas.



The logs were as follows:
2018-04-13 06:31:16.947 CEST [15603] FATAL:  WAL-Receiver-Prozess wird abgebrochen wegen Zeitüberschreitung
2018-04-13 06:31:16.948 CEST [15213] FATAL:  invalid memory alloc request size 4280303616
2018-04-13 06:31:16.959 CEST [15212] LOG:  Startprozess (PID 15213) beendete mit Status 1
2018-04-13 06:31:16.959 CEST [15212] LOG:  aktive Serverprozesse werden abgebrochen
2018-04-13 06:31:16.959 CEST [19838] user@db WARNUNG:  Verbindung wird abgebrochen wegen Absturz eines anderen Serverprozesses
2018-04-13 06:31:16.959 CEST [19838] user@db DETAIL:  Der Postmaster hat diesen Serverprozess angewiesen, die aktuelle Transaktion zurückzurollen und die Sitzung zu beenden, weil ein anderer Serverprozess abnormal beendet wurde und möglicherweise das Shared Memory verfälscht hat.
2018-04-13 06:31:16.959 CEST [19838] user@db TIPP:  In einem Moment sollten Sie wieder mit der Datenbank verbinden und Ihren Befehl wiederholen können.


This replica then refused to start up:
2018-04-13 09:25:15.941 CEST [1957] LOG:  Standby-Modus eingeschaltet
2018-04-13 09:25:15.947 CEST [1957] LOG:  Redo beginnt bei 1C/69C0FF30
2018-04-13 09:25:15.951 CEST [1957] LOG:  konsistenter Wiederherstellungszustand erreicht bei 1C/69D9A9C0
2018-04-13 09:25:15.952 CEST [1956] LOG:  Datenbanksystem ist bereit, um lesende Verbindungen anzunehmen
2018-04-13 09:25:15.953 CEST [1957] FATAL:  invalid memory alloc request size 4280303616
2018-04-13 09:25:15.954 CEST [1956] LOG:  Startprozess (PID 1957) beendete mit Status 1


... until the WAL files from the hot standby's pg_wal were manually removed and re-downloaded from the primary.

Unfortunately I did not collect hard evidence, but I think I saw the primary's replication slot's restart point was set to a position /after/ the standby's actual restart location. This time, the error was noticed immediately and the required WAL was still present on the master.


A few months ago I experienced the same situation on a 9.6 cluster, but that was not noticed for a long time, and - despite using a replication slot! - the primary had already removed required segments. Fortunately I could get them from a tape backup...



Best regards,

        -hannes




Am 2018-04-18 um 18:16 schrieb greigwise:
Hello.  I've had several instances where postgres on my physical replica
under version 9.6.6 is crashing with messages like the following in the
logs:

2018-04-18 05:43:26 UTC dbname 5acf5e4a.6918 dbuser DETAIL:  The postmaster
has commanded this server process to roll back the current transaction and
exit, because another server process exited abnormally and possibly
corrupted shared memory.
2018-04-18 05:43:26 UTC dbname 5acf5e4a.6918 dbuser HINT:  In a moment you
should be able to reconnect to the database and repeat your command.
2018-04-18 05:43:26 UTC dbname 5acf5e39.68e5 dbuser WARNING:  terminating
connection because of crash of another server process
2018-04-18 05:43:26 UTC dbname 5acf5e39.68e5 dbuser DETAIL:  The postmaster
has commanded this server process to roll back the current transaction and
exit, because another server process exited abnormally and possibly
corrupted shared memory.
2018-04-18 05:43:26 UTC dbname 5acf5e39.68e5 dbuser HINT:  In a moment you
should be able to reconnect to the database and repeat your command.
2018-04-18 05:43:27 UTC  5acf5e12.6819  LOG:  database system is shut down

When this happens, what I've found is that I can go into the pg_xlog
directory on the replica, remove all the log files and the postgres will
restart and things seem to come back up normally.

So, the question is what's going on here... is the log maybe getting corrupt
in transmission somehow?  Should I be concerned about the viability of my
replica after having restarted in the described fashion?

Thanks,
Greig Wise



--
Sent from: http://www.postgresql-archive.org/PostgreSQL-general-f1843780.html




pgsql-general by date:

Previous
From: Markus
Date:
Subject: Re: Query planner riddle (array-related?)
Next
From: Igor Korot
Date:
Subject: Re: Add schema to the query