Home > mailing lists

[HACKERS] Funny WAL corruption issue - Mailing list pgsql-hackers

From	Chris Travers
Subject	[HACKERS] Funny WAL corruption issue
Date	August 10, 2017 15:09:53
Msg-id	CAKt_ZfvqM8BmxnW6xV0RHDghYaspm0Lv=GOvN6t4jRdvgDEVrw@mail.gmail.com Whole thread Raw
Responses	Re: [HACKERS] Funny WAL corruption issue
List	pgsql-hackers

Tree view

Hi;

I ran into a funny situation today regarding PostgreSQL replication and wal corruption and wanted to go over what I think happened and what I wonder about as a possible solution.

Basic information is custom-build PostgreSQL 9.6.3 on Gentoo, on a ~5TB database with variable load. Master database has two slaves and generates 10-20MB of WAL traffic a second. The data_checksum option is off.

The problem occurred when I attempted to restart the service on the slave using pg_ctl (I believe the service had been started with sys V init scripts). On trying to restart, it gave me a nice "Invalid memory allocation request" error and promptly stopped.

The main logs showed a lot of messages like before the restart:

2017-08-02 11:47:33 UTC LOG: PID 19033 in cancel request did not match any process

2017-08-02 11:47:33 UTC LOG: PID 19032 in cancel request did not match any process

2017-08-02 11:47:33 UTC LOG: PID 19024 in cancel request did not match any process

2017-08-02 11:47:33 UTC LOG: PID 19034 in cancel request did not match any process

On restart, the following was logged to stderr:

LOG: entering standby mode

LOG: redo starts at 1E39C/8B77B458

LOG: consistent recovery state reached at 1E39C/E1117FF8

FATAL: invalid memory alloc request size 3456458752

LOG: startup process (PID 18167) exited with exit code 1

LOG: terminating any other active server processes

LOG: database system is shut down

After some troubleshooting I found that the wal segment had become corrupt, I copied the correct one from the master and everything came up to present.

So It seems like somewhere something crashed big time on the back-end and when we tried to restart, the wal ended in an invalid way.

I am wondering what can be done to prevent these sorts of things from happening in the future if, for example, a replica dies in the middle of a wal fsync.

Best Wishes,

Chris Travers

Efficito: Hosted Accounting and ERP. Robust and Flexible. No vendor lock-in.

http://www.efficito.com/learn_more

pgsql-hackers by date:

From: Robert Haas
Date: 10 August 2017, 15:00:44
Subject: Re: [HACKERS] Server crash (FailedAssertion) due to catcache refcount mis-handling

From: Ashutosh Bapat
Date: 10 August 2017, 15:14:57
Subject: Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables

[HACKERS] Funny WAL corruption issue - Mailing list pgsql-hackers

Previous

Next