Re: postgres on physical replica crashes - Mailing list pgsql-general
From | Hannes Erven |
---|---|
Subject | Re: postgres on physical replica crashes |
Date | |
Msg-id | c433628d-35ec-3273-34fc-ce9a171b400b@erven.at Whole thread Raw |
In response to | postgres on physical replica crashes (greigwise <greigwise@comcast.net>) |
Responses |
Re: postgres on physical replica crashes
|
List | pgsql-general |
Hi Greig, just last week I experienced the same situation as you on a 10.3 physical replica (it even has checksums activated), and a few months ago on 9.6 . We used the same resolution as you we, and so far we haven't noticed any problems with data integrity on the replicas. The logs were as follows: 2018-04-13 06:31:16.947 CEST [15603] FATAL: WAL-Receiver-Prozess wird abgebrochen wegen Zeitüberschreitung 2018-04-13 06:31:16.948 CEST [15213] FATAL: invalid memory alloc request size 4280303616 2018-04-13 06:31:16.959 CEST [15212] LOG: Startprozess (PID 15213) beendete mit Status 1 2018-04-13 06:31:16.959 CEST [15212] LOG: aktive Serverprozesse werden abgebrochen 2018-04-13 06:31:16.959 CEST [19838] user@db WARNUNG: Verbindung wird abgebrochen wegen Absturz eines anderen Serverprozesses 2018-04-13 06:31:16.959 CEST [19838] user@db DETAIL: Der Postmaster hat diesen Serverprozess angewiesen, die aktuelle Transaktion zurückzurollen und die Sitzung zu beenden, weil ein anderer Serverprozess abnormal beendet wurde und möglicherweise das Shared Memory verfälscht hat. 2018-04-13 06:31:16.959 CEST [19838] user@db TIPP: In einem Moment sollten Sie wieder mit der Datenbank verbinden und Ihren Befehl wiederholen können. This replica then refused to start up: 2018-04-13 09:25:15.941 CEST [1957] LOG: Standby-Modus eingeschaltet 2018-04-13 09:25:15.947 CEST [1957] LOG: Redo beginnt bei 1C/69C0FF30 2018-04-13 09:25:15.951 CEST [1957] LOG: konsistenter Wiederherstellungszustand erreicht bei 1C/69D9A9C0 2018-04-13 09:25:15.952 CEST [1956] LOG: Datenbanksystem ist bereit, um lesende Verbindungen anzunehmen 2018-04-13 09:25:15.953 CEST [1957] FATAL: invalid memory alloc request size 4280303616 2018-04-13 09:25:15.954 CEST [1956] LOG: Startprozess (PID 1957) beendete mit Status 1 ... until the WAL files from the hot standby's pg_wal were manually removed and re-downloaded from the primary. Unfortunately I did not collect hard evidence, but I think I saw the primary's replication slot's restart point was set to a position /after/ the standby's actual restart location. This time, the error was noticed immediately and the required WAL was still present on the master. A few months ago I experienced the same situation on a 9.6 cluster, but that was not noticed for a long time, and - despite using a replication slot! - the primary had already removed required segments. Fortunately I could get them from a tape backup... Best regards, -hannes Am 2018-04-18 um 18:16 schrieb greigwise: > Hello. I've had several instances where postgres on my physical replica > under version 9.6.6 is crashing with messages like the following in the > logs: > > 2018-04-18 05:43:26 UTC dbname 5acf5e4a.6918 dbuser DETAIL: The postmaster > has commanded this server process to roll back the current transaction and > exit, because another server process exited abnormally and possibly > corrupted shared memory. > 2018-04-18 05:43:26 UTC dbname 5acf5e4a.6918 dbuser HINT: In a moment you > should be able to reconnect to the database and repeat your command. > 2018-04-18 05:43:26 UTC dbname 5acf5e39.68e5 dbuser WARNING: terminating > connection because of crash of another server process > 2018-04-18 05:43:26 UTC dbname 5acf5e39.68e5 dbuser DETAIL: The postmaster > has commanded this server process to roll back the current transaction and > exit, because another server process exited abnormally and possibly > corrupted shared memory. > 2018-04-18 05:43:26 UTC dbname 5acf5e39.68e5 dbuser HINT: In a moment you > should be able to reconnect to the database and repeat your command. > 2018-04-18 05:43:27 UTC 5acf5e12.6819 LOG: database system is shut down > > When this happens, what I've found is that I can go into the pg_xlog > directory on the replica, remove all the log files and the postgres will > restart and things seem to come back up normally. > > So, the question is what's going on here... is the log maybe getting corrupt > in transmission somehow? Should I be concerned about the viability of my > replica after having restarted in the described fashion? > > Thanks, > Greig Wise > > > > -- > Sent from: http://www.postgresql-archive.org/PostgreSQL-general-f1843780.html >
pgsql-general by date: