Home > mailing lists

Re: Page Checksums + Double Writes - Mailing list pgsql-hackers

From	Simon Riggs
Subject	Re: Page Checksums + Double Writes
Date	December 22, 2011 01:59:31
Msg-id	CA+U5nMLAJ2cYvk9bLytK1QwMBOrmTAtq8GiSF=oC81F4aSWhYA@mail.gmail.com Whole thread Raw
In response to	Re: Page Checksums + Double Writes (Simon Riggs <simon@2ndQuadrant.com>)
List	pgsql-hackers

Tree view

On Thu, Dec 22, 2011 at 12:06 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

>> Having two different page formats running around in the system at the
>> same time is far from free; in the worst case it means that every single
>> piece of code that touches pages has to know about and be prepared to
>> cope with both versions.  That's a rather daunting prospect, from a
>> coding perspective and even more from a testing perspective.  Maybe
>> the issues can be kept localized, but I've seen no analysis done of
>> what the impact would be or how we could minimize it.  I do know that
>> we considered the idea and mostly rejected it a year or two back.
>
> I'm looking at that now.
>
> My feeling is it probably depends upon how different the formats are,
> so given we are discussing a 4 byte addition to the header, it might
> be doable.
>
> I'm investing some time on the required analysis.

We've assumed to now that adding a CRC to the Page Header would add 4
bytes, meaning that we are assuming we are taking a CRC-32 check
field. This will change the size of the header and thus break
pg_upgrade in a straightforward implementation. Breaking pg_upgrade is
not acceptable. We can get around this by making code dependent upon
page version, allowing mixed page versions in one executable. That
causes the PageGetItemId() macro to be page version dependent. After
review, altering the speed of PageGetItemId() is not acceptable either
(show me microbenchmarks if you doubt that). In a large minority of
cases the line pointer and the page header will be in separate cache
lines.

As Kevin points out, we have 13 bits spare on the pd_flags of
PageHeader, so we have a little wiggle room there. In addition to that
I notice that pd_pagesize_version itself is 8 bits (page size is other
8 bits packed together), yet we currently use just one bit of that,
since version is 4. Version 3 was last seen in Postgres 8.2, now
de-supported.

Since we don't care too much about backwards compatibility with data
in Postgres 8.2 and below, we can just assume that all pages are
version 4, unless marked otherwise with additional flags. We then use
two separate bits to pd_flags to show PD_HAS_CRC (0x0008 and 0x8000).
We then completely replace the 16 bit version field with a 16-bit CRC
value, rather than a 32-bit value. Why two flag bits? If either CRC
bit is set we assume the page's CRC is supposed to be valid. This
ensures that a single bit error doesn't switch off CRC checking when
it was supposed to be active. I suggest we remove the page size data
completely; if we need to keep that we should mark 8192 bytes as the
default and set bits for 16kB and 32 kB respectively.

With those changes, we are able to re-organise the page header so that
we can add a 16 bit checksum (CRC), yet retain the same size of
header. Thus, we don't need to change PageGetItemId(). We would
require changes to PageHeaderIsValid() and PageInit() only. Making
these changes means we are reducing the number of bits used to
validate the page header, though we are providing a much better way of
detecting page validity, so the change is of positive benefit.

Adding a CRC was a performance concern because of the hint bit
problem, so making the value 16 bits long gives performance where it
is needed. Note that we do now have a separation of bgwriter and
checkpointer, so we have more CPU bandwidth to address the problem.
Adding multiple bgwriters is also possible.

Notably, this proposal makes CRC checking optional, so if performance
is a concern it can be disabled completely.

Which CRC algorithm to choose?
"A study of error detection capabilities for random independent bit
errors and burst errors reveals that XOR, two's complement addition,
and Adler checksums are suboptimal for typical network use. Instead,
one's complement addition should be used for networks willing to
sacrifice error detection effectiveness to reduce compute cost,
Fletcher checksum for networks looking for a balance of error
detection and compute cost, and CRCs for networks willing to pay a
higher compute cost for significantly improved error detection."
The Effectiveness of Checksums for Embedded Control Networks,
Maxino, T.C.  Koopman, P.J.,
Dependable and Secure Computing, IEEE Transactions on
Issue Date: Jan.-March 2009
Available here - http://www.ece.cmu.edu/~koopman/pubs/maxino09_checksums.pdf

Based upon that paper, I suggest we use Fletcher-16. The overall
concept is not sensitive to the choice of checksum algorithm however
and the algorithm itself could be another option. F16 or CRC. My poor
understanding of the difference is that F16 is about 20 times cheaper
to calculate, at the expense of about 1000 times worse error detection
(but still pretty good).

16 bit CRCs are not the strongest available, but still support
excellent error detection rates - better than 1 failure in a million,
possibly much better depending on which algorithm and block size.
That's good easily enough to detect our kind of errors.

This idea doesn't rule out the possibility of a 4 byte CRC-32 added in
the future, since we still have 11 bits spare for use as future page
version indicators. (If we did that, it is clear that we should add
the checksum as a *trailer* not as part of the header.)

So overall, I do now think its still possible to add an optional
checksum in the 9.2 release and am willing to pursue it unless there
are technical objections.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

pgsql-hackers by date:

From: David Fetter
Date: 22 December 2011, 01:08:45
Subject: Re: Page Checksums + Double Writes

From: Joey Adams
Date: 22 December 2011, 02:07:04
Subject: Wishlist: parameterizable types

Re: Page Checksums + Double Writes - Mailing list pgsql-hackers

Previous

Next