Re: [HACKERS] Funny WAL corruption issue - Mailing list pgsql-hackers

From Chris Travers
Subject Re: [HACKERS] Funny WAL corruption issue
Date
Msg-id CAKt_Zfvj=0cXBqEW2UBjtcY7Y2munm1Z7dPqxTh4PSCA76cB-g@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Funny WAL corruption issue  (Vladimir Rusinov <vrusinov@google.com>)
Responses Re: [HACKERS] Funny WAL corruption issue
List pgsql-hackers


On Thu, Aug 10, 2017 at 3:17 PM, Vladimir Rusinov <vrusinov@google.com> wrote:


On Thu, Aug 10, 2017 at 1:48 PM, Aleksander Alekseev <a.alekseev@postgrespro.ru> wrote:
I just wanted to point out that a hardware issue or third party software
issues (bugs in FS, software RAID, ...) could not be fully excluded from
the list of suspects. According to the talk by Christophe Pettus [1]
it's not that uncommon as most people think.

This still might be the case of hardware corruption, but it does not look like one.

Yeah, I don't think so either.  The systems were not restarted, only the service so I don't think this is a lie-on-write case.  We have EEC with full checks, etc.  It really looks like something I initiated caused it but not sure what and really not interested in trying to reproduce on a db of this size. 

Likelihood of two different persons seeing similar error message just a year apart is low. From our practice hardware corruption usually looks like a random single bit flip (most common - bad cpu or memory), bunch of zeroes (bad storage), or bunch of complete garbage (usually indicates in-memory pointer corruption).

Chris, if you still have original WAL segment from the master and it's corrupt copy from standby, can you do bit-by-bit comparison to see how they are different? Also, if you can please share some hardware details. Specifically, do you use ECC? If so, are there any ECC errors logged? Do you use physical disks/ssd or some form of storage virtualization?

Straight on bare metal, eec with no errors logged.  SSD for both data and wal.

The bitwise comparison is interesting.  Remember the error was:

pg_xlogdump: FATAL:  error in WAL record at 1E39C/E1117FB8: unexpected pageaddr 1E375/61118000 in log segment 000000000001E39C000000E1, offset 1146880


Starting with the good segment:

Good wall segment, I think the record starts at 003b:


0117fb0 0000 0000 0000 0000 003b 0000 0000 0000

0117fc0 7f28 e111 e39c 0001 0940 0000 cb88 db01

0117fd0 0200 0000 067f 0000 4000 0000 2249 0195

0117fe0 0001 0000 8001 0000 b5c3 0000 05ff 0000

0117ff0 0000 0003 0000 0000 008c 0000 0000 0000

0118000 d093 0005 0001 0000 8000 e111 e39c 0001

0118010 0084 0000 0000 0000 7fb8 e111 e39c 0001

0118020 0910 0000 ccac 2eba 2000 0056 067f 0000

0118030 4000 0000 2249 0195 b5c4 0000 08ff 0001

0118040 0002 0003 0004 0005 0006 0007 0008 0009

0118050 000a 000b 000c 000d 000e 000f 0010 0011

0118060 0012 0013 0014 0015 0016 0017 0018 0019

0118070 001a 001b 001c 001d 001e 001f 0020 0021



0117fb0 0000 0000 0000 0000 003b 0000 0000 0000

0117fc0 7f28 e111 e39c 0001 0940 0000 cb88 db01

0117fd0 0200 0000 067f 0000 4000 0000 2249 0195

0117fe0 0001 0000 8001 0000 b5c3 0000 05ff 0000

0117ff0 0000 0003 0000 0000 4079 ce05 1cce ecf9

0118000 d093 0005 0001 0000 8000 6111 e375 0001

0118010 119d 0000 0000 0000 cfd4 00cc ca00 0410

0118020 1800 7c00 5923 544b dc20 914c 7a5c afec

0118030 db45 0060 b700 1910 1800 7c00 791f 2ede

0118040 c573 a110 5a88 e1e6 ab48 0034 9c00 2210

0118050 1800 7c00 4415 400d 2c7e b5e3 7c88 bcef

0118060 4666 00db 9900 0a10 1800 7c00 7d1d b355

0118070 d432 8365 de99 4dba 87c7 00ed 6200 2210 

I think the divergence is interesting here.  Up through 0117ff8, they are identical.  Then the last half if the line differs.
The first half of the next is the same (but up through 011800a this time), but the last 6 bytes differ (those six hold what appear to be the memory address causing the problem), and we only have a few bits different in the rest of the line.

It looks like some data and some flags were overwritten, perhaps while the process exited.  Very interesting.


Also, in absolute majority of cases corruption is caught by checksums. I am not familiar with WAL protocol - do we have enough checksums when writing it out and on the wire? I suspect there are much more things PostgreSQL can do to be more resilient, and at least detect corruptions earlier.

Since this didn't throw a checksum error (we have data checksums disabled but wal records ISTR have a separate CRC check), would this perhaps indicate that the checksum operated over incorrect data?

-- 
Vladimir Rusinov
PostgreSQL SRE, Google Ireland

Google Ireland Ltd.,Gordon House, Barrow Street, Dublin 4, Ireland
Registered in Dublin, Ireland
Registration Number: 368047




--
Best Wishes,
Chris Travers

Efficito:  Hosted Accounting and ERP.  Robust and Flexible.  No vendor lock-in.

pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: [HACKERS] Comment in snapbuild.c file
Next
From: Tom Lane
Date:
Subject: Re: [HACKERS] pl/perl extension fails on Windows