Re: New WAL code dumps core trivially on replay of bad data - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: New WAL code dumps core trivially on replay of bad data
Date
Msg-id 5032531F.9090007@enterprisedb.com
Whole thread Raw
In response to Re: New WAL code dumps core trivially on replay of bad data  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: New WAL code dumps core trivially on replay of bad data
List pgsql-hackers
On 20.08.2012 17:04, Tom Lane wrote:
> Heikki Linnakangas<heikki.linnakangas@enterprisedb.com>  writes:
>> On 18.08.2012 08:52, Amit kapila wrote:
>>> I think that missing check of total length has caused this problem. However now this check will be different.
>
>> That check still exists, in ValidXLogRecordHeader(). However, we now
>> allocate the buffer for the whole record before that check, based on
>> xl_tot_len, if the record header is split across pages. The theory in
>> allocating the buffer is that a bogus xl_tot_len field will cause the
>> malloc() to fail, returning NULL, and we treat that the same as a broken
>> header.
>
> Uh, no, you misread it.  xl_tot_len is *zero* in this example.  The
> problem is that RecordIsValid believes xl_len (and backup block size)
> even when it exceeds xl_tot_len.

Ah yes, I see that now. I think all we need then is a check for 
xl_tot_len >= SizeOfXLogRecord.

>> I think we need to delay the allocation of the record buffer. We need to
>> read and validate the whole record header first, like we did before,
>> before we trust xl_tot_len enough to call malloc() with it. I'll take a
>> shot at doing that.
>
> I don't believe this theory at all.  Overcommit applies to writing on
> pages that were formerly shared with the parent process --- it should
> not have anything to do with malloc'ing new space.  But anyway, this
> is not what happened in my example.

I was thinking that we might read gigabytes worth of bogus WAL into the 
memory buffer, if xl_tot_len is bogus and large, e.g 0xffffffff. But now 
that I look closer, the xlog record is validated after reading the first 
continuation page, so we should catch a bogus xl_tot_len value at that 
point. And there is a cross-check with xl_rem_len on every continuation 
page, too.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: alter enum add value if not exists
Next
From: Tom Lane
Date:
Subject: Re: New WAL code dumps core trivially on replay of bad data