Re: New WAL code dumps core trivially on replay of bad data - Mailing list pgsql-hackers

From Amit kapila
Subject Re: New WAL code dumps core trivially on replay of bad data
Date
Msg-id 6C0B27F7206C9E4CA54AE035729E9C38285258A8@szxeml509-mbx
Whole thread Raw
In response to New WAL code dumps core trivially on replay of bad data  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: New WAL code dumps core trivially on replay of bad data
List pgsql-hackers
Tom Lane Sent: Saturday, August 18, 2012 7:16 AM

> The startup process's stack trace is

> #0  0x26fd1c in RecordIsValid (record=0x4008d7a0, recptr=80658424, emode=15)
>    at xlog.c:3713
> 3713            COMP_CRC32(crc, XLogRecGetData(record), len);
> (gdb) bt
> #0  0x26fd1c in RecordIsValid (record=0x4008d7a0, recptr=80658424, emode=15)
>    at xlog.c:3713
> #1  0x270690 in ReadRecord (RecPtr=0x7b03bad0, emode=15,
>    fetching_ckpt=0 '\000') at xlog.c:4006

> The current WAL address is 80658424 == 0x04cebff8, that is just 8 bytes
> short of a page boundary, and what RecordIsValid thinks it is dealing
> with is


> so it merrily tries to compute a checksum on a gigabyte worth of data,
> and soon falls off the end of memory.

> In reality, inspection of the WAL file suggests that this is the end of
> valid data and what should have happened is that replay just stopped.
> The xl_len and so forth shown above are just garbage from off the end of
> what was actually read from the file (everything beyond offset 0xcebff8
> in file 4 is in fact zeroes).

> I'm not sure whether this is just a matter of having failed to
> sanity-check that xl_tot_len is at least SizeOfXLogRecord, or whether
> there is a deeper problem with the new design of continuation records
> that makes it impossible to validate records safely.

Earlier there was a check related to total length in ReadRecord, before it calls RecordIsValid()    if
(record->xl_tot_len< SizeOfXLogRecord + record->xl_len ||              record->xl_tot_len > SizeOfXLogRecord +
record->xl_len+                        XLR_MAX_BKP_BLOCKS * (sizeof(BkpBlock) + BLCKSZ)) 

I think that missing check of total length has caused this problem. However now this check will be different.

With Regards,
Amit Kapila.


pgsql-hackers by date:

Previous
From: Jeff Janes
Date:
Subject: "CLUSTER VERBOSE" tab completion
Next
From: Craig Ringer
Date:
Subject: Re: [PATCH] Docs: Make notes on sequences and rollback more obvious