Re: Page Checksums + Double Writes - Mailing list pgsql-hackers
From | Simon Riggs |
---|---|
Subject | Re: Page Checksums + Double Writes |
Date | |
Msg-id | CA+U5nMLAJ2cYvk9bLytK1QwMBOrmTAtq8GiSF=oC81F4aSWhYA@mail.gmail.com Whole thread Raw |
In response to | Re: Page Checksums + Double Writes (Simon Riggs <simon@2ndQuadrant.com>) |
List | pgsql-hackers |
On Thu, Dec 22, 2011 at 12:06 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> Having two different page formats running around in the system at the >> same time is far from free; in the worst case it means that every single >> piece of code that touches pages has to know about and be prepared to >> cope with both versions. That's a rather daunting prospect, from a >> coding perspective and even more from a testing perspective. Maybe >> the issues can be kept localized, but I've seen no analysis done of >> what the impact would be or how we could minimize it. I do know that >> we considered the idea and mostly rejected it a year or two back. > > I'm looking at that now. > > My feeling is it probably depends upon how different the formats are, > so given we are discussing a 4 byte addition to the header, it might > be doable. > > I'm investing some time on the required analysis. We've assumed to now that adding a CRC to the Page Header would add 4 bytes, meaning that we are assuming we are taking a CRC-32 check field. This will change the size of the header and thus break pg_upgrade in a straightforward implementation. Breaking pg_upgrade is not acceptable. We can get around this by making code dependent upon page version, allowing mixed page versions in one executable. That causes the PageGetItemId() macro to be page version dependent. After review, altering the speed of PageGetItemId() is not acceptable either (show me microbenchmarks if you doubt that). In a large minority of cases the line pointer and the page header will be in separate cache lines. As Kevin points out, we have 13 bits spare on the pd_flags of PageHeader, so we have a little wiggle room there. In addition to that I notice that pd_pagesize_version itself is 8 bits (page size is other 8 bits packed together), yet we currently use just one bit of that, since version is 4. Version 3 was last seen in Postgres 8.2, now de-supported. Since we don't care too much about backwards compatibility with data in Postgres 8.2 and below, we can just assume that all pages are version 4, unless marked otherwise with additional flags. We then use two separate bits to pd_flags to show PD_HAS_CRC (0x0008 and 0x8000). We then completely replace the 16 bit version field with a 16-bit CRC value, rather than a 32-bit value. Why two flag bits? If either CRC bit is set we assume the page's CRC is supposed to be valid. This ensures that a single bit error doesn't switch off CRC checking when it was supposed to be active. I suggest we remove the page size data completely; if we need to keep that we should mark 8192 bytes as the default and set bits for 16kB and 32 kB respectively. With those changes, we are able to re-organise the page header so that we can add a 16 bit checksum (CRC), yet retain the same size of header. Thus, we don't need to change PageGetItemId(). We would require changes to PageHeaderIsValid() and PageInit() only. Making these changes means we are reducing the number of bits used to validate the page header, though we are providing a much better way of detecting page validity, so the change is of positive benefit. Adding a CRC was a performance concern because of the hint bit problem, so making the value 16 bits long gives performance where it is needed. Note that we do now have a separation of bgwriter and checkpointer, so we have more CPU bandwidth to address the problem. Adding multiple bgwriters is also possible. Notably, this proposal makes CRC checking optional, so if performance is a concern it can be disabled completely. Which CRC algorithm to choose? "A study of error detection capabilities for random independent bit errors and burst errors reveals that XOR, two's complement addition, and Adler checksums are suboptimal for typical network use. Instead, one's complement addition should be used for networks willing to sacrifice error detection effectiveness to reduce compute cost, Fletcher checksum for networks looking for a balance of error detection and compute cost, and CRCs for networks willing to pay a higher compute cost for significantly improved error detection." The Effectiveness of Checksums for Embedded Control Networks, Maxino, T.C. Koopman, P.J., Dependable and Secure Computing, IEEE Transactions on Issue Date: Jan.-March 2009 Available here - http://www.ece.cmu.edu/~koopman/pubs/maxino09_checksums.pdf Based upon that paper, I suggest we use Fletcher-16. The overall concept is not sensitive to the choice of checksum algorithm however and the algorithm itself could be another option. F16 or CRC. My poor understanding of the difference is that F16 is about 20 times cheaper to calculate, at the expense of about 1000 times worse error detection (but still pretty good). 16 bit CRCs are not the strongest available, but still support excellent error detection rates - better than 1 failure in a million, possibly much better depending on which algorithm and block size. That's good easily enough to detect our kind of errors. This idea doesn't rule out the possibility of a 4 byte CRC-32 added in the future, since we still have 11 bits spare for use as future page version indicators. (If we did that, it is clear that we should add the checksum as a *trailer* not as part of the header.) So overall, I do now think its still possible to add an optional checksum in the 9.2 release and am willing to pursue it unless there are technical objections. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
pgsql-hackers by date: