Re: postgres on physical replica crashes - Mailing list pgsql-general

From Hannes Erven
Subject Re: postgres on physical replica crashes
Date
Msg-id c433628d-35ec-3273-34fc-ce9a171b400b@erven.at
Whole thread Raw
In response to postgres on physical replica crashes  (greigwise <greigwise@comcast.net>)
Responses Re: postgres on physical replica crashes
List pgsql-general
Hi Greig,


just last week I experienced the same situation as you on a 10.3 
physical replica (it even has checksums activated), and a few months ago 
on 9.6 .
We used the same resolution as you we, and so far we haven't noticed any 
problems with data integrity on the replicas.



The logs were as follows:
2018-04-13 06:31:16.947 CEST [15603] FATAL:  WAL-Receiver-Prozess wird 
abgebrochen wegen Zeitüberschreitung
2018-04-13 06:31:16.948 CEST [15213] FATAL:  invalid memory alloc 
request size 4280303616
2018-04-13 06:31:16.959 CEST [15212] LOG:  Startprozess (PID 15213) 
beendete mit Status 1
2018-04-13 06:31:16.959 CEST [15212] LOG:  aktive Serverprozesse werden 
abgebrochen
2018-04-13 06:31:16.959 CEST [19838] user@db WARNUNG:  Verbindung wird 
abgebrochen wegen Absturz eines anderen Serverprozesses
2018-04-13 06:31:16.959 CEST [19838] user@db DETAIL:  Der Postmaster hat 
diesen Serverprozess angewiesen, die aktuelle Transaktion zurückzurollen 
und die Sitzung zu beenden, weil ein anderer Serverprozess abnormal 
beendet wurde und möglicherweise das Shared Memory verfälscht hat.
2018-04-13 06:31:16.959 CEST [19838] user@db TIPP:  In einem Moment 
sollten Sie wieder mit der Datenbank verbinden und Ihren Befehl 
wiederholen können.


This replica then refused to start up:
2018-04-13 09:25:15.941 CEST [1957] LOG:  Standby-Modus eingeschaltet
2018-04-13 09:25:15.947 CEST [1957] LOG:  Redo beginnt bei 1C/69C0FF30
2018-04-13 09:25:15.951 CEST [1957] LOG:  konsistenter 
Wiederherstellungszustand erreicht bei 1C/69D9A9C0
2018-04-13 09:25:15.952 CEST [1956] LOG:  Datenbanksystem ist bereit, um 
lesende Verbindungen anzunehmen
2018-04-13 09:25:15.953 CEST [1957] FATAL:  invalid memory alloc request 
size 4280303616
2018-04-13 09:25:15.954 CEST [1956] LOG:  Startprozess (PID 1957) 
beendete mit Status 1


... until the WAL files from the hot standby's pg_wal were manually 
removed and re-downloaded from the primary.

Unfortunately I did not collect hard evidence, but I think I saw the 
primary's replication slot's restart point was set to a position /after/ 
the standby's actual restart location. This time, the error was noticed 
immediately and the required WAL was still present on the master.


A few months ago I experienced the same situation on a 9.6 cluster, but 
that was not noticed for a long time, and - despite using a replication 
slot! - the primary had already removed required segments. Fortunately I 
could get them from a tape backup...



Best regards,

    -hannes




Am 2018-04-18 um 18:16 schrieb greigwise:
> Hello.  I've had several instances where postgres on my physical replica
> under version 9.6.6 is crashing with messages like the following in the
> logs:
> 
> 2018-04-18 05:43:26 UTC dbname 5acf5e4a.6918 dbuser DETAIL:  The postmaster
> has commanded this server process to roll back the current transaction and
> exit, because another server process exited abnormally and possibly
> corrupted shared memory.
> 2018-04-18 05:43:26 UTC dbname 5acf5e4a.6918 dbuser HINT:  In a moment you
> should be able to reconnect to the database and repeat your command.
> 2018-04-18 05:43:26 UTC dbname 5acf5e39.68e5 dbuser WARNING:  terminating
> connection because of crash of another server process
> 2018-04-18 05:43:26 UTC dbname 5acf5e39.68e5 dbuser DETAIL:  The postmaster
> has commanded this server process to roll back the current transaction and
> exit, because another server process exited abnormally and possibly
> corrupted shared memory.
> 2018-04-18 05:43:26 UTC dbname 5acf5e39.68e5 dbuser HINT:  In a moment you
> should be able to reconnect to the database and repeat your command.
> 2018-04-18 05:43:27 UTC  5acf5e12.6819  LOG:  database system is shut down
> 
> When this happens, what I've found is that I can go into the pg_xlog
> directory on the replica, remove all the log files and the postgres will
> restart and things seem to come back up normally.
> 
> So, the question is what's going on here... is the log maybe getting corrupt
> in transmission somehow?  Should I be concerned about the viability of my
> replica after having restarted in the described fashion?
> 
> Thanks,
> Greig Wise
> 
> 
> 
> --
> Sent from: http://www.postgresql-archive.org/PostgreSQL-general-f1843780.html
> 



pgsql-general by date:

Previous
From: Fabio Pardi
Date:
Subject: Re: pg_upgrade help
Next
From: Adrian Klaver
Date:
Subject: Re: Problem with trigger makes Detail record be invalid