Home > mailing lists

Re: New WAL code dumps core trivially on replay of bad data - Mailing list pgsql-hackers

From	Heikki Linnakangas
Subject	Re: New WAL code dumps core trivially on replay of bad data
Date	August 20, 2012 06:39:44
Msg-id	5031DBA7.6010309@enterprisedb.com Whole thread Raw
In response to	Re: New WAL code dumps core trivially on replay of bad data (Amit kapila <amit.kapila@huawei.com>)
Responses	Re: New WAL code dumps core trivially on replay of bad data
List	pgsql-hackers

Tree view

On 18.08.2012 08:52, Amit kapila wrote:
> Tom Lane Sent: Saturday, August 18, 2012 7:16 AM
>
>> so it merrily tries to compute a checksum on a gigabyte worth of data,
>> and soon falls off the end of memory.
>
>> In reality, inspection of the WAL file suggests that this is the end of
>> valid data and what should have happened is that replay just stopped.
>> The xl_len and so forth shown above are just garbage from off the end of
>> what was actually read from the file (everything beyond offset 0xcebff8
>> in file 4 is in fact zeroes).
>
>> I'm not sure whether this is just a matter of having failed to
>> sanity-check that xl_tot_len is at least SizeOfXLogRecord, or whether
>> there is a deeper problem with the new design of continuation records
>> that makes it impossible to validate records safely.
>
> Earlier there was a check related to total length in ReadRecord, before it calls RecordIsValid()
>       if (record->xl_tot_len<  SizeOfXLogRecord + record->xl_len ||
>                 record->xl_tot_len>  SizeOfXLogRecord + record->xl_len +
>                           XLR_MAX_BKP_BLOCKS * (sizeof(BkpBlock) + BLCKSZ))
>
> I think that missing check of total length has caused this problem. However now this check will be different.

That check still exists, in ValidXLogRecordHeader(). However, we now 
allocate the buffer for the whole record before that check, based on 
xl_tot_len, if the record header is split across pages. The theory in 
allocating the buffer is that a bogus xl_tot_len field will cause the 
malloc() to fail, returning NULL, and we treat that the same as a broken 
header. However, with memory overcommit, what happens is that the 
malloc() succeeds, but the process is killed when it actually tries to 
use all that memory.

I think we need to delay the allocation of the record buffer. We need to 
read and validate the whole record header first, like we did before, 
before we trust xl_tot_len enough to call malloc() with it. I'll take a 
shot at doing that.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

pgsql-hackers by date:

From: Craig Ringer
Date: 20 August 2012, 04:08:03
Subject: Re: [PATCH] Docs: Make notes on sequences and rollback more obvious

From: Rushabh Lathia
Date: 20 August 2012, 06:50:56
Subject: Primary Key Constraint on inheritance table not getting route to child tables

Re: New WAL code dumps core trivially on replay of bad data - Mailing list pgsql-hackers

Previous

Next