9.3: summary of corruption detection / checksums / CRCs discussion - Mailing list pgsql-hackers
From | Jeff Davis |
---|---|
Subject | 9.3: summary of corruption detection / checksums / CRCs discussion |
Date | |
Msg-id | 1335044457.25680.99.camel@jdavis Whole thread Raw |
Responses |
Re: 9.3: summary of corruption detection / checksums / CRCs discussion
Re: 9.3: summary of corruption detection / checksums / CRCs discussion Re: 9.3: summary of corruption detection / checksums / CRCs discussion |
List | pgsql-hackers |
A lot of discussion took place regarding corruption detection, and I am attempting to summarize it in a useful way. Please excuse the lack of references; I'm hoping to agree on the basic problem space and the nature of the solutions offered, and then turn it into a wiki where we can get into the details. Also, almost every discussion touched on several of these issues. Please help me fill in the parts of the problem and directions in the high-level solution that are missing. This thread is not intended to get into low-level design or decision making. First, I'll get a few of the smaller issues out of the way: * In addition to data pages and slru/clog, it also may be useful to detect problems with temp files. It's also much easier, because we don't need to worry about upgrade or crash scenarios. Performance impact unknown, but could easily be turned on/off at runtime. * In addition to detecting random garbage, we also need to be able to detect zeroing of pages. Right now, a zero page is not considered corrupt, so that's a problem. We'll need to WAL table extension operations, and we'll need to mitigate the performance impact of doing so. I think we can do that by extending larger tables by many pages (say, 16 at a time) so we can amortize the cost of WAL and avoid contention. * Utilities, like those for taking a base backup, should also verify the checksum. * In addition to detecting random garbage and zeros, we need to detect entire pages being transposed into different parts of the same file or different files. To do this we can include the database ID, tablespace, relfilenode, and page number in the CRC calculation. Perhaps only include relfilenode and page# to make it easier for utilities to check. None of this information needs to actually be stored on the page, so it doesn't affect the header. However, if we are going to expand the page header anyway, it would be useful to include this information so that the CRC can be calculated without any external context. Now, onto the two big problems, upgrade and torn pages: ----------------------------------------------- UPGRADE (and on/off) ----------------------------------------------- * Should we try to use existing space in header? It's very appealing to be able to avoid the upgrade work by using existing space in the header. There was a surprising amount of discussion about which space to use: pd_pagesize_version or pd_tli. There was also some discussion of using a few bits from various places. * Table-level, or system level? Table-level would be appealing if there turns out to be a significant performance impact. But there are challenges during recovery, because no relcache is available. It seems like a relatively minor problem, because pages would indicate whether they have a checksum or not, but there are some details to be worked out. * If we do expand the header, we need an upgrade path. One proposed approach is to start reserving the necessary space in the previous version (with a simple point release), and have some way to verify that all of the pages have the required free space to upgrade. Then, the new version can update pages gradually, with some final VACUUM to ensure that all pages are the new version. That sounds easy, except that we need some way to free up space on the old pages in the old version, which is non-trivial. For heap pages, that could be like an update; but for index pages, it would need to be something like a page split, which is specific to the index type. * Also, if we expand the page header, we need to figure out something for the SLRU/CLOG as well. * We'll need some variant of VACUUM to turn checksums on/off (either per-table or system wide). ----------------------------------------------- TORN PAGES ----------------------------------------------- We don't want torn pages to falsely indicate a checksum failure. Many page writes are already protected from this with full-page images in the WAL; but hint bit updates (including the index dead tuple markers) are not. * Just pay the price -- WAL all hint bit updates, including FPIs. * Double-Write buffer -- this attacks the problem most directly. Don't make any changes to the way hint bits are done; instead, push all page writes through a double-write buffer. There are numerous performance implications here, some of which may improve performance and some which may hurt performance. It's hard to say, at the end, whether this will be a good solution for everyone (particularly those without battery-backed caches), but it seems like an accepted approach that can be very good for the people who need performance the most. * Bulk Load -- this is more indirect. The idea is that, during normal OLTP operation, using the WAL for hints might not be so bad, because the page is likely to need a FPI for some other reason. The worst case is when bulk loading, so see if we can set hint bits during the bulk load in an MVCC-safe way. http://archives.postgresql.org/message-id/CABRT9RBRMdsoz8KxgeHfb4LG-ev9u67-6DLqvoiibpkKhTLQfw@mail.gmail.com * Some way of caching CLOG information or making the access faster. IIRC, there were some vague ideas about mmapping() the CLOG, or caching a very small representation of the CLOG. * Something else -- There are a few other lines of thought here. For instance, can we use WAL for hint bits without a FPI, and still protect against torn pages causing CRC failures? This is related to a comment during the 2011 developer meeting, where someone brought up the idea of idempotent WAL changes, and how that could help us avoid FPIs. It seems possible after reading the discussions, but not clear enough on the challenges to summarize here. If we do use WAL for hint bit updates, that has an impact on Hot Standby, because HS can't write WAL. So, it would seem that HS could not set hint bits. Comments? Regards,Jeff Davis
pgsql-hackers by date: