Thread: [HACKERS] Funny WAL corruption issue
2017-08-02 11:47:33 UTC LOG: PID 19033 in cancel request did not match any process
2017-08-02 11:47:33 UTC LOG: PID 19032 in cancel request did not match any process
2017-08-02 11:47:33 UTC LOG: PID 19024 in cancel request did not match any process
2017-08-02 11:47:33 UTC LOG: PID 19034 in cancel request did not match any process
On restart, the following was logged to stderr:
LOG: entering standby mode
LOG: redo starts at 1E39C/8B77B458
LOG: consistent recovery state reached at 1E39C/E1117FF8
FATAL: invalid memory alloc request size 3456458752
LOG: startup process (PID 18167) exited with exit code 1
LOG: terminating any other active server processes
LOG: database system is shut down
10 авг. 2017 г., в 15:09, Chris Travers <chris.travers@gmail.com> написал(а):Hi;I ran into a funny situation today regarding PostgreSQL replication and wal corruption and wanted to go over what I think happened and what I wonder about as a possible solution.Basic information is custom-build PostgreSQL 9.6.3 on Gentoo, on a ~5TB database with variable load. Master database has two slaves and generates 10-20MB of WAL traffic a second. The data_checksum option is off.The problem occurred when I attempted to restart the service on the slave using pg_ctl (I believe the service had been started with sys V init scripts). On trying to restart, it gave me a nice "Invalid memory allocation request" error and promptly stopped.The main logs showed a lot of messages like before the restart:2017-08-02 11:47:33 UTC LOG: PID 19033 in cancel request did not match any process2017-08-02 11:47:33 UTC LOG: PID 19032 in cancel request did not match any process2017-08-02 11:47:33 UTC LOG: PID 19024 in cancel request did not match any process2017-08-02 11:47:33 UTC LOG: PID 19034 in cancel request did not match any processOn restart, the following was logged to stderr:LOG: entering standby modeLOG: redo starts at 1E39C/8B77B458LOG: consistent recovery state reached at 1E39C/E1117FF8FATAL: invalid memory alloc request size 3456458752LOG: startup process (PID 18167) exited with exit code 1LOG: terminating any other active server processesLOG: database system is shut downAfter some troubleshooting I found that the wal segment had become corrupt, I copied the correct one from the master and everything came up to present.So It seems like somewhere something crashed big time on the back-end and when we tried to restart, the wal ended in an invalid way.
I am wondering what can be done to prevent these sorts of things from happening in the future if, for example, a replica dies in the middle of a wal fsync.--Best Wishes,Chris TraversEfficito: Hosted Accounting and ERP. Robust and Flexible. No vendor lock-in.
Hi, Chris.10 авг. 2017 г., в 15:09, Chris Travers <chris.travers@gmail.com> написал(а):Hi;I ran into a funny situation today regarding PostgreSQL replication and wal corruption and wanted to go over what I think happened and what I wonder about as a possible solution.Basic information is custom-build PostgreSQL 9.6.3 on Gentoo, on a ~5TB database with variable load. Master database has two slaves and generates 10-20MB of WAL traffic a second. The data_checksum option is off.The problem occurred when I attempted to restart the service on the slave using pg_ctl (I believe the service had been started with sys V init scripts). On trying to restart, it gave me a nice "Invalid memory allocation request" error and promptly stopped.The main logs showed a lot of messages like before the restart:2017-08-02 11:47:33 UTC LOG: PID 19033 in cancel request did not match any process2017-08-02 11:47:33 UTC LOG: PID 19032 in cancel request did not match any process2017-08-02 11:47:33 UTC LOG: PID 19024 in cancel request did not match any process2017-08-02 11:47:33 UTC LOG: PID 19034 in cancel request did not match any processOn restart, the following was logged to stderr:LOG: entering standby modeLOG: redo starts at 1E39C/8B77B458LOG: consistent recovery state reached at 1E39C/E1117FF8FATAL: invalid memory alloc request size 3456458752LOG: startup process (PID 18167) exited with exit code 1LOG: terminating any other active server processesLOG: database system is shut downAfter some troubleshooting I found that the wal segment had become corrupt, I copied the correct one from the master and everything came up to present.So It seems like somewhere something crashed big time on the back-end and when we tried to restart, the wal ended in an invalid way.We have reported the same thing [1] nearly a year ago. Could you please check with pg_xlogdump that both WALs (normal from master and corrupted) are exactly the same until some certain LSN?
I am wondering what can be done to prevent these sorts of things from happening in the future if, for example, a replica dies in the middle of a wal fsync.--Best Wishes,Chris TraversEfficito: Hosted Accounting and ERP. Robust and Flexible. No vendor lock-in.
Yes. Exactly the same output until a certain point where pg_xlogdump dies with an error. That is the same LSN where PostgreSQL died with an error on restart.
pg_xlogdump: FATAL: error in WAL record at 1E39C/E1117FB8: unexpected pageaddr 1E375/61118000 in log segment 000000000001E39C000000E1, offset 1146880
Hi Chris, > I ran into a funny situation today regarding PostgreSQL replication and > wal corruption and wanted to go over what I think happened and what I > wonder about as a possible solution. Sad story. Unfortunately I have no idea what could be a reason nor can I suggest a good way to find it unless there is an already know sequence of steps that reproduces an issue. I just wanted to point out that a hardware issue or third party software issues (bugs in FS, software RAID, ...) could not be fully excluded from the list of suspects. According to the talk by Christophe Pettus [1] it's not that uncommon as most people think. If the issue reproduces from time to time on one replica and doesn't on the second identical replica there is a good chance that you've faced a hardware issue. Another thing that is worth checking is a SMART status of the hard drive. [1]: http://www.pgcon.org/2017/schedule/attachments/453_corruption-pgcon-2017.pdf -- Best regards, Aleksander Alekseev
I just wanted to point out that a hardware issue or third party software
issues (bugs in FS, software RAID, ...) could not be fully excluded from
the list of suspects. According to the talk by Christophe Pettus [1]
it's not that uncommon as most people think.
Vladimir Rusinov
PostgreSQL SRE, Google Ireland
Google Ireland Ltd.,Gordon House, Barrow Street, Dublin 4, Ireland
Registered in Dublin, Ireland
Registration Number: 368047
On Thu, Aug 10, 2017 at 1:48 PM, Aleksander Alekseev <a.alekseev@postgrespro.ru> wrote:I just wanted to point out that a hardware issue or third party software
issues (bugs in FS, software RAID, ...) could not be fully excluded from
the list of suspects. According to the talk by Christophe Pettus [1]
it's not that uncommon as most people think.This still might be the case of hardware corruption, but it does not look like one.
Likelihood of two different persons seeing similar error message just a year apart is low. From our practice hardware corruption usually looks like a random single bit flip (most common - bad cpu or memory), bunch of zeroes (bad storage), or bunch of complete garbage (usually indicates in-memory pointer corruption).Chris, if you still have original WAL segment from the master and it's corrupt copy from standby, can you do bit-by-bit comparison to see how they are different? Also, if you can please share some hardware details. Specifically, do you use ECC? If so, are there any ECC errors logged? Do you use physical disks/ssd or some form of storage virtualization?
pg_xlogdump: FATAL: error in WAL record at 1E39C/E1117FB8: unexpected pageaddr 1E375/61118000 in log segment 000000000001E39C000000E1, offset 1146880
Starting with the good segment:
Good wall segment, I think the record starts at 003b:
0117fb0 0000 0000 0000 0000 003b 0000 0000 0000
0117fc0 7f28 e111 e39c 0001 0940 0000 cb88 db01
0117fd0 0200 0000 067f 0000 4000 0000 2249 0195
0117fe0 0001 0000 8001 0000 b5c3 0000 05ff 0000
0117ff0 0000 0003 0000 0000 008c 0000 0000 0000
0118000 d093 0005 0001 0000 8000 e111 e39c 0001
0118010 0084 0000 0000 0000 7fb8 e111 e39c 0001
0118020 0910 0000 ccac 2eba 2000 0056 067f 0000
0118030 4000 0000 2249 0195 b5c4 0000 08ff 0001
0118040 0002 0003 0004 0005 0006 0007 0008 0009
0118050 000a 000b 000c 000d 000e 000f 0010 0011
0118060 0012 0013 0014 0015 0016 0017 0018 0019
0118070 001a 001b 001c 001d 001e 001f 0020 0021
0117fb0 0000 0000 0000 0000 003b 0000 0000 0000
0117fc0 7f28 e111 e39c 0001 0940 0000 cb88 db01
0117fd0 0200 0000 067f 0000 4000 0000 2249 0195
0117fe0 0001 0000 8001 0000 b5c3 0000 05ff 0000
0117ff0 0000 0003 0000 0000 4079 ce05 1cce ecf9
0118000 d093 0005 0001 0000 8000 6111 e375 0001
0118010 119d 0000 0000 0000 cfd4 00cc ca00 0410
0118020 1800 7c00 5923 544b dc20 914c 7a5c afec
0118030 db45 0060 b700 1910 1800 7c00 791f 2ede
0118040 c573 a110 5a88 e1e6 ab48 0034 9c00 2210
0118050 1800 7c00 4415 400d 2c7e b5e3 7c88 bcef
0118060 4666 00db 9900 0a10 1800 7c00 7d1d b355
Also, in absolute majority of cases corruption is caught by checksums. I am not familiar with WAL protocol - do we have enough checksums when writing it out and on the wire? I suspect there are much more things PostgreSQL can do to be more resilient, and at least detect corruptions earlier.
--
Vladimir Rusinov
PostgreSQL SRE, Google Ireland
Google Ireland Ltd.,Gordon House, Barrow Street, Dublin 4, Ireland
Registered in Dublin, Ireland
Registration Number: 368047
On 10 August 2017 at 15:26, Chris Travers <chris.travers@gmail.com> wrote: > > > The bitwise comparison is interesting. Remember the error was: > > pg_xlogdump: FATAL: error in WAL record at 1E39C/E1117FB8: unexpected > pageaddr 1E375/61118000 in log segment 000000000001E39C000000E1, offset > 1146880 ... > Since this didn't throw a checksum error (we have data checksums disabled but wal records ISTR have a separate CRC check),would this perhaps indicate that the checksum operated over incorrect data? No checksum error and this "unexpected pageaddr" doesn't necessarily mean data corruption. It could mean that when the database stopped logging it was reusing a wal file and the old wal stream had a record boundary on the same byte position. So the previous record checksum passed and the following record checksum passes but the record header is for a different wal stream position. I think you could actually hack xlogdump to ignore this condition and keep outputting and you'll see whether the records that follow appear to be old wal log data. I haven't actually tried this though. -- greg
On 10 August 2017 at 15:26, Chris Travers <chris.travers@gmail.com> wrote:
>
>
> The bitwise comparison is interesting. Remember the error was:
>
> pg_xlogdump: FATAL: error in WAL record at 1E39C/E1117FB8: unexpected
> pageaddr 1E375/61118000 in log segment 000000000001E39C000000E1, offset
> 1146880
...
> Since this didn't throw a checksum error (we have data checksums disabled but wal records ISTR have a separate CRC check), would this perhaps indicate that the checksum operated over incorrect data?
No checksum error and this "unexpected pageaddr" doesn't necessarily
mean data corruption. It could mean that when the database stopped logging
it was reusing a wal file and the old wal stream had a record boundary
on the same byte position. So the previous record checksum passed and
the following record checksum passes but the record header is for a
different wal stream position.
I think you could actually hack xlogdump to ignore this condition and
keep outputting and you'll see whether the records that follow appear
to be old wal log data. I haven't actually tried this though.
--
greg
On 10 August 2017 at 15:26, Chris Travers <chris.travers@gmail.com> wrote:
>
>
> The bitwise comparison is interesting. Remember the error was:
>
> pg_xlogdump: FATAL: error in WAL record at 1E39C/E1117FB8: unexpected
> pageaddr 1E375/61118000 in log segment 000000000001E39C000000E1, offset
> 1146880
...
> Since this didn't throw a checksum error (we have data checksums disabled but wal records ISTR have a separate CRC check), would this perhaps indicate that the checksum operated over incorrect data?
No checksum error and this "unexpected pageaddr" doesn't necessarily
mean data corruption. It could mean that when the database stopped logging
it was reusing a wal file and the old wal stream had a record boundary
on the same byte position. So the previous record checksum passed and
the following record checksum passes but the record header is for a
different wal stream position.
I think you could actually hack xlogdump to ignore this condition and
keep outputting and you'll see whether the records that follow appear
to be old wal log data. I haven't actually tried this though.
--
greg