Re: Page Checksums - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Page Checksums
Date
Msg-id 4EEE402B.1030807@enterprisedb.com
Whole thread Raw
In response to Re: Page Checksums  (David Fetter <david@fetter.org>)
Responses Re: Page Checksums  (Peter Eisentraut <peter_e@gmx.net>)
List pgsql-hackers
On 18.12.2011 20:44, David Fetter wrote:
> On Sun, Dec 18, 2011 at 12:19:32PM +0200, Heikki Linnakangas wrote:
>> On 18.12.2011 10:54, David Fetter wrote:
>>> On Sun, Dec 18, 2011 at 10:14:38AM +0200, Heikki Linnakangas wrote:
>>>> On 17.12.2011 23:33, David Fetter wrote:
>>>>> If this introduces new failure modes, please detail, and preferably
>>>>> demonstrate, just what those new modes are.
>>>>
>>>> Hint bits, torn pages ->   failed CRC. See earlier discussion:
>>>>
>>>> http://archives.postgresql.org/pgsql-hackers/2009-11/msg01975.php
>>>
>>> The patch requires that full page writes be on in order to obviate
>>> this problem by never reading a torn page.
>>
>> Doesn't help. Hint bit updates are not WAL-logged.
>
> What new failure modes are you envisioning for this case?

Umm, the one explained in the email I linked to... Let me try once more. 
For the sake of keeping the example short, imagine that the PostgreSQL 
block size is 8 bytes, and the OS block size is 4 bytes. The CRC is 1 
byte, and is stored on the first byte of each page.

In the beginning, a page is in the buffer cache, and it looks like this:

AA 12 34 56  78 9A BC DE

AA is the checksum. Now a hint bit on the last byte is set, so that the 
page in the shared buffer cache looks like this:

AA 12 34 56  78 9A BC DF

Now PostgreSQL wants to evict the page from the buffer cache, so it 
recalculates the CRC. The page in the buffer cache now looks like this:

BB 12 34 56  78 9A BC DF

Now, PostgreSQL writes the page to the OS cache, with the write() system 
call. It sits in the OS cache for a few seconds, and then the OS decides 
to flush the first 4 bytes, ie. the first OS block, to disk. On disk, 
you now have this:

BB 12 34 56  78 9A BC DE

If the server now crashes, before the OS has flushed the second half of 
the PostgreSQL page to disk, you have a classic torn page. The updated 
CRC made it to disk, but the hint bit did not. The CRC on disk is not 
valid, for the rest of the contents of that page on disk.

Without CRCs, that's not a problem because the data is valid whether or 
not the hint bit makes it to the disk. It's just a hint, after all. But 
when you have a CRC on the page, the CRC is only valid if both the CRC 
update *and* the hint bit update makes it to disk, or neither.

So you've just turned an innocent torn page, which PostgreSQL tolerates 
just fine, into a block with bad CRC.
> Any way to> simulate them, even if it's by injecting faults into the source code?

Hmm, it's hard to persuade the OS to suffer a torn page on purpose. What 
you could do is split the write() call in mdwrite() into two. First 
write the 1st half of the page, then the second. Then you can put a 
breakpoint in between the writes, and kill the system before the 2nd 
half is written.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Command Triggers
Next
From: Peter Eisentraut
Date:
Subject: Re: Page Checksums