Re: [HACKERS] Funny WAL corruption issue - Mailing list pgsql-hackers

From Vladimir Borodin
Subject Re: [HACKERS] Funny WAL corruption issue
Date
Msg-id 539CC85E-D8A5-42EA-B10A-5996D35695D4@simply.name
Whole thread Raw
In response to [HACKERS] Funny WAL corruption issue  (Chris Travers <chris.travers@gmail.com>)
Responses Re: [HACKERS] Funny WAL corruption issue
List pgsql-hackers
Hi, Chris.

10 авг. 2017 г., в 15:09, Chris Travers <chris.travers@gmail.com> написал(а):

Hi;

I ran into a funny situation today regarding PostgreSQL replication and wal corruption and wanted to go over what I think happened and what I wonder about as a possible solution.

Basic information is custom-build PostgreSQL 9.6.3 on Gentoo, on a ~5TB database with variable load.  Master database has two slaves and generates 10-20MB of WAL traffic a second.  The data_checksum option is off.


The problem occurred when I attempted to restart the service on the slave using pg_ctl (I believe the service had been started with sys V init scripts).  On trying to restart, it gave me a nice "Invalid memory allocation request" error and promptly stopped.

The main logs showed a lot of messages like before the restart:
2017-08-02 11:47:33 UTC LOG:  PID 19033 in cancel request did not match any process
2017-08-02 11:47:33 UTC LOG:  PID 19032 in cancel request did not match any process
2017-08-02 11:47:33 UTC LOG:  PID 19024 in cancel request did not match any process
2017-08-02 11:47:33 UTC LOG:  PID 19034 in cancel request did not match any process

On restart, the following was logged to stderr:
LOG:  entering standby mode
LOG:  redo starts at 1E39C/8B77B458
LOG:  consistent recovery state reached at 1E39C/E1117FF8
FATAL:  invalid memory alloc request size 3456458752
LOG:  startup process (PID 18167) exited with exit code 1
LOG:  terminating any other active server processes

LOG:  database system is shut down

After some troubleshooting I found that the wal segment had become corrupt, I copied the correct one from the master and everything came up to present.

So It seems like somewhere something crashed big time on the back-end and when we tried to restart, the wal ended in an invalid way.

We have reported the same thing [1] nearly a year ago. Could you please check with pg_xlogdump that both WALs (normal from master and corrupted) are exactly the same until some certain LSN?



I am wondering what can be done to prevent these sorts of things from happening in the future if, for example, a replica dies in the middle of a wal fsync. 
--
Best Wishes,
Chris Travers

Efficito:  Hosted Accounting and ERP.  Robust and Flexible.  No vendor lock-in.


--
May the force be with you…

pgsql-hackers by date:

Previous
From: Ashutosh Bapat
Date:
Subject: Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
Next
From: Etsuro Fujita
Date:
Subject: [HACKERS] Comment typo in plannodes.h