Re: [HACKERS] Funny WAL corruption issue - Mailing list pgsql-hackers
From | Chris Travers |
---|---|
Subject | Re: [HACKERS] Funny WAL corruption issue |
Date | |
Msg-id | CAKt_Zfvj=0cXBqEW2UBjtcY7Y2munm1Z7dPqxTh4PSCA76cB-g@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] Funny WAL corruption issue (Vladimir Rusinov <vrusinov@google.com>) |
Responses |
Re: [HACKERS] Funny WAL corruption issue
|
List | pgsql-hackers |
On Thu, Aug 10, 2017 at 1:48 PM, Aleksander Alekseev <a.alekseev@postgrespro.ru> wrote:I just wanted to point out that a hardware issue or third party software
issues (bugs in FS, software RAID, ...) could not be fully excluded from
the list of suspects. According to the talk by Christophe Pettus [1]
it's not that uncommon as most people think.This still might be the case of hardware corruption, but it does not look like one.
Likelihood of two different persons seeing similar error message just a year apart is low. From our practice hardware corruption usually looks like a random single bit flip (most common - bad cpu or memory), bunch of zeroes (bad storage), or bunch of complete garbage (usually indicates in-memory pointer corruption).Chris, if you still have original WAL segment from the master and it's corrupt copy from standby, can you do bit-by-bit comparison to see how they are different? Also, if you can please share some hardware details. Specifically, do you use ECC? If so, are there any ECC errors logged? Do you use physical disks/ssd or some form of storage virtualization?
pg_xlogdump: FATAL: error in WAL record at 1E39C/E1117FB8: unexpected pageaddr 1E375/61118000 in log segment 000000000001E39C000000E1, offset 1146880
Starting with the good segment:
Good wall segment, I think the record starts at 003b:
0117fb0 0000 0000 0000 0000 003b 0000 0000 0000
0117fc0 7f28 e111 e39c 0001 0940 0000 cb88 db01
0117fd0 0200 0000 067f 0000 4000 0000 2249 0195
0117fe0 0001 0000 8001 0000 b5c3 0000 05ff 0000
0117ff0 0000 0003 0000 0000 008c 0000 0000 0000
0118000 d093 0005 0001 0000 8000 e111 e39c 0001
0118010 0084 0000 0000 0000 7fb8 e111 e39c 0001
0118020 0910 0000 ccac 2eba 2000 0056 067f 0000
0118030 4000 0000 2249 0195 b5c4 0000 08ff 0001
0118040 0002 0003 0004 0005 0006 0007 0008 0009
0118050 000a 000b 000c 000d 000e 000f 0010 0011
0118060 0012 0013 0014 0015 0016 0017 0018 0019
0118070 001a 001b 001c 001d 001e 001f 0020 0021
0117fb0 0000 0000 0000 0000 003b 0000 0000 0000
0117fc0 7f28 e111 e39c 0001 0940 0000 cb88 db01
0117fd0 0200 0000 067f 0000 4000 0000 2249 0195
0117fe0 0001 0000 8001 0000 b5c3 0000 05ff 0000
0117ff0 0000 0003 0000 0000 4079 ce05 1cce ecf9
0118000 d093 0005 0001 0000 8000 6111 e375 0001
0118010 119d 0000 0000 0000 cfd4 00cc ca00 0410
0118020 1800 7c00 5923 544b dc20 914c 7a5c afec
0118030 db45 0060 b700 1910 1800 7c00 791f 2ede
0118040 c573 a110 5a88 e1e6 ab48 0034 9c00 2210
0118050 1800 7c00 4415 400d 2c7e b5e3 7c88 bcef
0118060 4666 00db 9900 0a10 1800 7c00 7d1d b355
Also, in absolute majority of cases corruption is caught by checksums. I am not familiar with WAL protocol - do we have enough checksums when writing it out and on the wire? I suspect there are much more things PostgreSQL can do to be more resilient, and at least detect corruptions earlier.
--
Vladimir Rusinov
PostgreSQL SRE, Google Ireland
Google Ireland Ltd.,Gordon House, Barrow Street, Dublin 4, Ireland
Registered in Dublin, Ireland
Registration Number: 368047
pgsql-hackers by date: