Re: [HACKERS] Funny WAL corruption issue - Mailing list pgsql-hackers
From | Chris Travers |
---|---|
Subject | Re: [HACKERS] Funny WAL corruption issue |
Date | |
Msg-id | CAKt_ZfvvsufN-Hv86bTCEd=t1dKLUcFDhA3oBCKTGyLW1eUhRA@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] Funny WAL corruption issue (Vladimir Borodin <root@simply.name>) |
Responses |
Re: [HACKERS] Funny WAL corruption issue
Re: [HACKERS] Funny WAL corruption issue |
List | pgsql-hackers |
Sorry, meant to reply all.
--
On Thu, Aug 10, 2017 at 2:19 PM, Vladimir Borodin <root@simply.name> wrote:
Hi, Chris.10 авг. 2017 г., в 15:09, Chris Travers <chris.travers@gmail.com> написал(а):Hi;I ran into a funny situation today regarding PostgreSQL replication and wal corruption and wanted to go over what I think happened and what I wonder about as a possible solution.Basic information is custom-build PostgreSQL 9.6.3 on Gentoo, on a ~5TB database with variable load. Master database has two slaves and generates 10-20MB of WAL traffic a second. The data_checksum option is off.The problem occurred when I attempted to restart the service on the slave using pg_ctl (I believe the service had been started with sys V init scripts). On trying to restart, it gave me a nice "Invalid memory allocation request" error and promptly stopped.The main logs showed a lot of messages like before the restart:2017-08-02 11:47:33 UTC LOG: PID 19033 in cancel request did not match any process2017-08-02 11:47:33 UTC LOG: PID 19032 in cancel request did not match any process2017-08-02 11:47:33 UTC LOG: PID 19024 in cancel request did not match any process2017-08-02 11:47:33 UTC LOG: PID 19034 in cancel request did not match any processOn restart, the following was logged to stderr:LOG: entering standby modeLOG: redo starts at 1E39C/8B77B458LOG: consistent recovery state reached at 1E39C/E1117FF8FATAL: invalid memory alloc request size 3456458752LOG: startup process (PID 18167) exited with exit code 1LOG: terminating any other active server processesLOG: database system is shut downAfter some troubleshooting I found that the wal segment had become corrupt, I copied the correct one from the master and everything came up to present.So It seems like somewhere something crashed big time on the back-end and when we tried to restart, the wal ended in an invalid way.We have reported the same thing [1] nearly a year ago. Could you please check with pg_xlogdump that both WALs (normal from master and corrupted) are exactly the same until some certain LSN?
Yes. Exactly the same output until a certain point where pg_xlogdump dies with an error. That is the same LSN where PostgreSQL died with an error on restart.
I am wondering what can be done to prevent these sorts of things from happening in the future if, for example, a replica dies in the middle of a wal fsync.--Best Wishes,Chris TraversEfficito: Hosted Accounting and ERP. Robust and Flexible. No vendor lock-in.
Best Wishes,
Chris Travers
Efficito: Hosted Accounting and ERP. Robust and Flexible. No vendor lock-in.
pgsql-hackers by date: