Thread: 9.3: summary of corruption detection / checksums / CRCs discussion
A lot of discussion took place regarding corruption detection, and I am attempting to summarize it in a useful way. Please excuse the lack of references; I'm hoping to agree on the basic problem space and the nature of the solutions offered, and then turn it into a wiki where we can get into the details. Also, almost every discussion touched on several of these issues. Please help me fill in the parts of the problem and directions in the high-level solution that are missing. This thread is not intended to get into low-level design or decision making. First, I'll get a few of the smaller issues out of the way: * In addition to data pages and slru/clog, it also may be useful to detect problems with temp files. It's also much easier, because we don't need to worry about upgrade or crash scenarios. Performance impact unknown, but could easily be turned on/off at runtime. * In addition to detecting random garbage, we also need to be able to detect zeroing of pages. Right now, a zero page is not considered corrupt, so that's a problem. We'll need to WAL table extension operations, and we'll need to mitigate the performance impact of doing so. I think we can do that by extending larger tables by many pages (say, 16 at a time) so we can amortize the cost of WAL and avoid contention. * Utilities, like those for taking a base backup, should also verify the checksum. * In addition to detecting random garbage and zeros, we need to detect entire pages being transposed into different parts of the same file or different files. To do this we can include the database ID, tablespace, relfilenode, and page number in the CRC calculation. Perhaps only include relfilenode and page# to make it easier for utilities to check. None of this information needs to actually be stored on the page, so it doesn't affect the header. However, if we are going to expand the page header anyway, it would be useful to include this information so that the CRC can be calculated without any external context. Now, onto the two big problems, upgrade and torn pages: ----------------------------------------------- UPGRADE (and on/off) ----------------------------------------------- * Should we try to use existing space in header? It's very appealing to be able to avoid the upgrade work by using existing space in the header. There was a surprising amount of discussion about which space to use: pd_pagesize_version or pd_tli. There was also some discussion of using a few bits from various places. * Table-level, or system level? Table-level would be appealing if there turns out to be a significant performance impact. But there are challenges during recovery, because no relcache is available. It seems like a relatively minor problem, because pages would indicate whether they have a checksum or not, but there are some details to be worked out. * If we do expand the header, we need an upgrade path. One proposed approach is to start reserving the necessary space in the previous version (with a simple point release), and have some way to verify that all of the pages have the required free space to upgrade. Then, the new version can update pages gradually, with some final VACUUM to ensure that all pages are the new version. That sounds easy, except that we need some way to free up space on the old pages in the old version, which is non-trivial. For heap pages, that could be like an update; but for index pages, it would need to be something like a page split, which is specific to the index type. * Also, if we expand the page header, we need to figure out something for the SLRU/CLOG as well. * We'll need some variant of VACUUM to turn checksums on/off (either per-table or system wide). ----------------------------------------------- TORN PAGES ----------------------------------------------- We don't want torn pages to falsely indicate a checksum failure. Many page writes are already protected from this with full-page images in the WAL; but hint bit updates (including the index dead tuple markers) are not. * Just pay the price -- WAL all hint bit updates, including FPIs. * Double-Write buffer -- this attacks the problem most directly. Don't make any changes to the way hint bits are done; instead, push all page writes through a double-write buffer. There are numerous performance implications here, some of which may improve performance and some which may hurt performance. It's hard to say, at the end, whether this will be a good solution for everyone (particularly those without battery-backed caches), but it seems like an accepted approach that can be very good for the people who need performance the most. * Bulk Load -- this is more indirect. The idea is that, during normal OLTP operation, using the WAL for hints might not be so bad, because the page is likely to need a FPI for some other reason. The worst case is when bulk loading, so see if we can set hint bits during the bulk load in an MVCC-safe way. http://archives.postgresql.org/message-id/CABRT9RBRMdsoz8KxgeHfb4LG-ev9u67-6DLqvoiibpkKhTLQfw@mail.gmail.com * Some way of caching CLOG information or making the access faster. IIRC, there were some vague ideas about mmapping() the CLOG, or caching a very small representation of the CLOG. * Something else -- There are a few other lines of thought here. For instance, can we use WAL for hint bits without a FPI, and still protect against torn pages causing CRC failures? This is related to a comment during the 2011 developer meeting, where someone brought up the idea of idempotent WAL changes, and how that could help us avoid FPIs. It seems possible after reading the discussions, but not clear enough on the challenges to summarize here. If we do use WAL for hint bit updates, that has an impact on Hot Standby, because HS can't write WAL. So, it would seem that HS could not set hint bits. Comments? Regards,Jeff Davis
On Sat, Apr 21, 2012 at 10:40 PM, Jeff Davis <pgsql@j-davis.com> wrote: > * In addition to detecting random garbage, we also need to be able to > detect zeroing of pages. Right now, a zero page is not considered > corrupt, so that's a problem. We'll need to WAL table extension > operations, and we'll need to mitigate the performance impact of doing > so. I think we can do that by extending larger tables by many pages > (say, 16 at a time) so we can amortize the cost of WAL and avoid > contention. I haven't seen this come up in discussion. WAL logging table extensions wouldn't by itself work because currently we treat the file size on disk as the size of the table. So you would have to do the extension in the critical section or else different backends might see the wrong file size and write out conflicting wal entries. > ----------------------------------------------- > TORN PAGES > ----------------------------------------------- > > We don't want torn pages to falsely indicate a checksum failure. Many > page writes are already protected from this with full-page images in the > WAL; but hint bit updates (including the index dead tuple markers) are > not. > > * Just pay the price -- WAL all hint bit updates, including FPIs. > > * Double-Write buffer -- this attacks the problem most directly. Don't The earlier consensus was to move all the hint bits to a dedicated area and exclude them from the checksum. I think double-write buffers seem to have become more fashionable but a summary that doesn't describe the former is definitely incomplete. Fwiw the tradeoff here is at least partly between small and large systems. For double writes to be at all efficient you need either flash or a dedicated third spindle in addition to the logs and data. For smaller systems that would be a huge cost but for larger ones that's not really a problem at all. > * Bulk Load -- this is more indirect. The idea is that, during normal > OLTP operation, using the WAL for hints might not be so bad, because the > page is likely to need a FPI for some other reason. The worst case is > when bulk loading, so see if we can set hint bits during the bulk load > in an MVCC-safe way. > http://archives.postgresql.org/message-id/CABRT9RBRMdsoz8KxgeHfb4LG-ev9u67-6DLqvoiibpkKhTLQfw@mail.gmail.com That link points to the MVCC-safe truncate patch. I don't follow how optimizations in bulk loads are relevant to wal logging hint bit updates. -- greg
On Sun, 2012-04-22 at 00:08 +0100, Greg Stark wrote: > On Sat, Apr 21, 2012 at 10:40 PM, Jeff Davis <pgsql@j-davis.com> wrote: > > * In addition to detecting random garbage, we also need to be able to > > detect zeroing of pages. Right now, a zero page is not considered > > corrupt, so that's a problem. We'll need to WAL table extension > > operations, and we'll need to mitigate the performance impact of doing > > so. I think we can do that by extending larger tables by many pages > > (say, 16 at a time) so we can amortize the cost of WAL and avoid > > contention. > > I haven't seen this come up in discussion. I don't have any links, and it might just be based on in-person discussions. I think it's just being left as a loose end for later, but it will eventually need to be solved. > WAL logging table > extensions wouldn't by itself work because currently we treat the file > size on disk as the size of the table. So you would have to do the > extension in the critical section or else different backends might see > the wrong file size and write out conflicting wal entries. By "critical section", I assume you mean "while holding the relation extension lock" not "while inside a CRITICAL_SECTION()", right? There would be some synchronization overhead, to be sure, but I think it can be done. Ideally, we'd be able to do large enough extensions that, if there is a parallel bulk load on a single table or something, the overhead could be made insignificant. I didn't intend to get too much into the detail in this thread, but if it's a totally ridiculous or impossible idea, I'll remove it. > The earlier consensus was to move all the hint bits to a dedicated > area and exclude them from the checksum. I think double-write buffers > seem to have become more fashionable but a summary that doesn't > describe the former is definitely incomplete. Thank you, that's the kind of omission I was looking to catch. > That link points to the MVCC-safe truncate patch. I don't follow how > optimizations in bulk loads are relevant to wal logging hint bit > updates. I should have linked to these messages: http://archives.postgresql.org/message-id/CA +TgmoYLOzDezzJKyJ8_x2bPeEerAo5dJ-OMvS1fLQOQSQP5jg@mail.gmail.com http://archives.postgresql.org/message-id/CA +Tgmoa4Xs1jbZhm=pb9Xi4AGMJXRB2a4GSE9EJtLo=70Zne=g@mail.gmail.com Though perhaps I'm reading too much into Robert's comments. Regards,Jeff Davis
On Sat, Apr 21, 2012 at 5:40 PM, Jeff Davis <pgsql@j-davis.com> wrote: > * In addition to detecting random garbage, we also need to be able to > detect zeroing of pages. Right now, a zero page is not considered > corrupt, so that's a problem. We'll need to WAL table extension > operations, and we'll need to mitigate the performance impact of doing > so. I think we can do that by extending larger tables by many pages > (say, 16 at a time) so we can amortize the cost of WAL and avoid > contention. I think that extending tables in larger chunks is probably a very good idea for performance reasons completely apart from checksums. However, I don't think that WAL-logging relation extension will allow you to treat a zero page as an error condition unless you not only WAL-log the operation but also *flush WAL* before performing the actual table-extension operation. Otherwise, we might crash after the actual extension and before the WAL record hits the disk, and now we're back to having a zero'd page in the file. And the impact of writing *and flushing* WAL for every extension seems likely to be more than we're willing to pay. If we extended 16 pages at a time, that means waiting for 8 WAL fsyncs per MB of relation extension. On my MacBook Pro, which is admittedly a pessimal case for fsyncs, that would work out to an extra half second of elapsed time per MB written, so pgbench -i -s 100 would probably take about an extra 640 seconds. If that's a correct analysis, that sounds pretty bad, because right now it's taking 143 seconds. > * Should we try to use existing space in header? It's very appealing to > be able to avoid the upgrade work by using existing space in the header. > There was a surprising amount of discussion about which space to use: > pd_pagesize_version or pd_tli. There was also some discussion of using a > few bits from various places. It still seems to me that pd_tli is the obvious choice, since there is no code anywhere in the system that relies on that field having any particular value, so we can pretty much whack it around at will without breaking anything. AFAICS, the only reason to bump the page format is if we want a longer checksum - 32 bits, say, instead of 16. But I am inclined to think that 16 ought to be enough to detect the vast majority of cases of corruption. > * Table-level, or system level? Table-level would be appealing if there > turns out to be a significant performance impact. But there are > challenges during recovery, because no relcache is available. It seems > like a relatively minor problem, because pages would indicate whether > they have a checksum or not, but there are some details to be worked > out. I think the root of the issue here is that it's hard to turn checksums on and off *online*. If checksums were an initdb-time option, the design here would be pretty simple: store the flag in the control file. And it wouldn't even be hard to allow the flag to be flipped while the cluster is shut down: just write a utility to checksum and rewrite all the blocks, fsync everything, and then flip the flag in the control file and fsync that; also, force the user to rebuild all their standbys. This might not be very convenient, but it would be comparatively simple to implement. However, I've been assuming that we do want to handle turning checksums on and off without shutting down the cluster, and that definitely makes things more complicated. In theory it could work more or less the same way as in the off-line case: you launch some command or function that visits every data block in the cluster and rewrites them all, xlogging as it goes. However, I'm a little concerned that it won't be very usable in that form for some of the same reasons that the off-line solution might not be very usable: you essentially force the user to do a very large data-rewriting operation all at once. We already know that having a kicking off a big vacuum in the background can impact the performance of the whole cluster, and that's just one table; this would be the whole cluster all at once. Moreover, if it's based on the existing vacuum code, which seems generally desirable, it seems like it's going to have to be something you run in one database at a time until you've hit them all. In that case, you're going to need some way to indicate which tables or databases have been processed already and which haven't yet ... and if you have that, then it's not a very big step to think that maybe we ought to just control it on a per-table or per-database level to begin with, since that would also have some of the collateral benefits you mention. The alternative is to have a new command, or a new form of vacuum, that processes every table in every cluster regardless of which DB you're connected to at the moment. That's a lot of new, fairly special-purpose code and it doesn't seem very convenient from a management perspective, but it is probably simpler otherwise. I'm not sure what the right decision is here. I like the idea of being able to activate and deactivate the feature without shutting down the cluster, and I really like the idea of being able to do it incrementally, one table at a time, instead of all at once. But it's surely more complex. If you store the flag on a table level, then it's not available during recovery (unless we invent some new trick to make that work, like storing it in some other format that is simple enough to be interpreted by the recovery code); if you store it on a page level, then you can miss corruption if that corruption has the side effect of clearing the bit. It's not dissimilar to your complaint upthread about empty pages: if you see an empty page (a page that says it hasn't been checksummed), how do you know whether it's really supposed to be empty (have no checksum) or whether it ended empty (unchecksummed) because of exactly the sort of error we're trying to detect? > We don't want torn pages to falsely indicate a checksum failure. Many > page writes are already protected from this with full-page images in the > WAL; but hint bit updates (including the index dead tuple markers) are > not. > > * Just pay the price -- WAL all hint bit updates, including FPIs. Simon's patch contains a pretty good idea on this point: we don't actually need to WAL log all hint bit updates. In the worst case, we need to log an FPI once per checkpoint cycle, and then only if the page is hint-bit dirtied before it's dirtied by a fully WAL-logged operation. Moreover, if we WAL-log an FPI when the page is hint-bit-dirtied, and a fully WAL-logged operation comes along later in the same checkpoint cycle, we can skip the FPI for the second operation, which probably recovers most of the cost of the previous FPI. The real problem here is: what happens when we're dirtying a large number of pages only for hint bits? In that case, we're potentially generating a huge volume of WAL. We can probably optimize the bulk-loading case, but there are going to be other cases that stink. Apart from bulk-loading, pgbench doesn't take much of a hit because it's basically not going to touch a pgbench_accounts page again until it comes back to update another row on the same page, so the overhead shouldn't be that bad. But it might be much worse on some other workload where there are lots of reads mixed in with the writes: now you've got a good chance of kicking out a bunch of extra FPIs. There are some things we could do about that - e.g. make sure to set all possible hint bits before evicting an already-dirty buffer, so that the chances of needing to dirty it again the next time we visit are minimized. However, I don't think all of that is work that has to be done up front. I suspect we will need to do some mitigation of the bulk-loading case before committing even an initial version of checksums, but I think beyond that we could probably postpone most of the other optimization in this area to a future patch. > * Double-Write buffer -- this attacks the problem most directly. Don't > make any changes to the way hint bits are done; instead, push all page > writes through a double-write buffer. There are numerous performance > implications here, some of which may improve performance and some which > may hurt performance. It's hard to say, at the end, whether this will be > a good solution for everyone (particularly those without battery-backed > caches), but it seems like an accepted approach that can be very good > for the people who need performance the most. I'm pretty skeptical of this approach. It may work out well with specialized hardware, but none of the hardware I have is going to cope with that volume of fsyncs anything like gracefully. Unless of course we make the buffer really big, but that hasn't seemed to be a popular approach. > * Some way of caching CLOG information or making the access faster. > IIRC, there were some vague ideas about mmapping() the CLOG, or caching > a very small representation of the CLOG. I'm interested in pursuing this in general, independent of checksums. Not sure how far I will get, but reducing CLOG contention is something I do plan to spend more time on. > * Something else -- There are a few other lines of thought here. For > instance, can we use WAL for hint bits without a FPI, and still protect > against torn pages causing CRC failures? This is related to a comment > during the 2011 developer meeting, where someone brought up the idea of > idempotent WAL changes, and how that could help us avoid FPIs. It seems > possible after reading the discussions, but not clear enough on the > challenges to summarize here. There are some possibilities here, but overall I have my doubts about backing up checksums behind another complex project with uncertain prospects. > If we do use WAL for hint bit updates, that has an impact on Hot > Standby, because HS can't write WAL. So, it would seem that HS could not > set hint bits. Yeah. The good news is that WAL-logging hint bits on the master would propagate those hint bits to the slave, which would reduce the need to set them on the slave. But there's definitely some potential for performance regression here; I'm just not sure how much. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Apr 21, 2012 at 7:08 PM, Greg Stark <stark@mit.edu> wrote: > The earlier consensus was to move all the hint bits to a dedicated > area and exclude them from the checksum. I think double-write buffers > seem to have become more fashionable but a summary that doesn't > describe the former is definitely incomplete. I don't think we ever had any consensus that moving the hint bits around was a good idea. For one thing, they are only hints in one direction. It's OK to clear them by accident, but it's not OK to set them by accident. For two things, it's not exactly clear how we'd rearrange the page to make this work at all: where are those hint bits gonna go, if not in the tuple headers? For three things, index pages have hint-type changes that are not single-bit changes. > That link points to the MVCC-safe truncate patch. I don't follow how > optimizations in bulk loads are relevant to wal logging hint bit > updates. That patch actually has more than one optimization in it, I think, but the basic idea is that if we could figure out a way to set HEAP_XMIN_COMMITTED when loading data into a table created or truncated within the same transaction, the need to set hint bits on first scan of the table would be eliminated. Writing the xmin as FrozenTransactionId would save even more, though it introduces some additional complexity. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Apr 24, 2012 at 9:40 PM, Robert Haas <robertmhaas@gmail.com> wrote: > For three things, index pages > have hint-type changes that are not single-bit changes. ? Just how big are these? Part of the reason hint bit updates are safe is because one bit definitely absolutely has to be entirely in one page. You can't tear a page in the middle of a bit. In reality the size is much larger, probably 4k and almost certainly at least 512 bytes. But the postgres block layout doesn't really offer much guarantees about the location of anything relative those 512 byte blocks so probably anything larger than a word is unsafe to update. The main problem with the approach was that we kept finding more hint bits we had forgotten about. Once the coding idiom was established it seems it was a handy hammer for a lot of problems. -- greg
On 4/21/12 2:40 PM, Jeff Davis wrote: > If we do use WAL for hint bit updates, that has an impact on Hot > Standby, because HS can't write WAL. So, it would seem that HS could not > set hint bits. If we're WAL-logging hint bits, then the standby would be receiving them, so it doesn't *need* to write them. However, I suspect that WAL-logging hint bits would be prohibitively expensive. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Tue, Apr 24, 2012 at 3:40 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Sat, Apr 21, 2012 at 7:08 PM, Greg Stark <stark@mit.edu> wrote: >> The earlier consensus was to move all the hint bits to a dedicated >> area and exclude them from the checksum. I think double-write buffers >> seem to have become more fashionable but a summary that doesn't >> describe the former is definitely incomplete. > > I don't think we ever had any consensus that moving the hint bits > around was a good idea. For one thing, they are only hints in one > direction. It's OK to clear them by accident, but it's not OK to set > them by accident. For two things, it's not exactly clear how we'd > rearrange the page to make this work at all: where are those hint bits > gonna go, if not in the tuple headers? For three things, index pages > have hint-type changes that are not single-bit changes. > >> That link points to the MVCC-safe truncate patch. I don't follow how >> optimizations in bulk loads are relevant to wal logging hint bit >> updates. > > That patch actually has more than one optimization in it, I think, but > the basic idea is that if we could figure out a way to set > HEAP_XMIN_COMMITTED when loading data into a table created or > truncated within the same transaction, the need to set hint bits on > first scan of the table would be eliminated. Writing the xmin as > FrozenTransactionId would save even more, though it introduces some > additional complexity. This would be great but it's only a corner case. A pretty common application flow is to write a large number of records, scan them, update them, scan them again, delete them, etc. in a table that's already established and possibly pretty large. Unfortunately this type of work doesn't get a lot of coverage with the common benchmarks. Also, wouldn't the extra out of band wal traffic from hint bits exacerbate contention issues on the wal insert lock? merlin
On Tue, Apr 24, 2012 at 8:52 PM, Greg Stark <stark@mit.edu> wrote: > On Tue, Apr 24, 2012 at 9:40 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> For three things, index pages >> have hint-type changes that are not single-bit changes. > > ? Just how big are these? Part of the reason hint bit updates are safe > is because one bit definitely absolutely has to be entirely in one > page. You can't tear a page in the middle of a bit. In reality the > size is much larger, probably 4k and almost certainly at least 512 > bytes. But the postgres block layout doesn't really offer much > guarantees about the location of anything relative those 512 byte > blocks so probably anything larger than a word is unsafe to update. See _bt_killitems. It uses ItemIdMarkDead, which looks like it will turn into a 4-byte store. > The main problem with the approach was that we kept finding more hint > bits we had forgotten about. Once the coding idiom was established it > seems it was a handy hammer for a lot of problems. It is. And I think we shouldn't be lulled into the trap of thinking hint bits are bad. They do cause some problems, but they exist because they solve even worse problems. It's fundamentally pretty useful to be able to cache the results of expensive calculations in data pages, which is what hints allow us to do, and they let us do it without incurring the overhead of WAL-logging. Even if we could find a way of making CLOG access cheap enough that we didn't need HEAP_XMIN/XMAX_COMMITTED, it wouldn't clear the way to getting rid of hinting entirely. I strongly suspect that the btree item-is-dead hinting is actually MORE valuable than the heap hint bits. CLOG probes are expensive, but there is room for optimization there through caching and just because the data set is relatively limited in size. OTOH, the btree hints potentially save you a heap fetch on the next trip through, which potentially means a random I/O into a huge table. That's nothing to sneeze at. It also means that the next index insertion in the page can potentially prune that item away completely, allowing faster space re-use. That's nothing to sneeze at, either. To put that another way, the reason why WAL-logging all hints seems expensive is because NOT WAL-logging hints is a huge performance optimization. If we can come up with an even better performance optimization that also reduces the need to write out hinted pages, then of course we should do that, but we shouldn't hate the optimization we have because it's not as good as the one we wish we had. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Apr 25, 2012 at 9:28 AM, Merlin Moncure <mmoncure@gmail.com> wrote: >> That patch actually has more than one optimization in it, I think, but >> the basic idea is that if we could figure out a way to set >> HEAP_XMIN_COMMITTED when loading data into a table created or >> truncated within the same transaction, the need to set hint bits on >> first scan of the table would be eliminated. Writing the xmin as >> FrozenTransactionId would save even more, though it introduces some >> additional complexity. > > This would be great but it's only a corner case. A pretty common > application flow is to write a large number of records, scan them, > update them, scan them again, delete them, etc. in a table that's > already established and possibly pretty large. Unfortunately this > type of work doesn't get a lot of coverage with the common benchmarks. > > Also, wouldn't the extra out of band wal traffic from hint bits > exacerbate contention issues on the wal insert lock? Hm, You probably remember the process local hint bit cache. While my implementation had a lot of issues and it's arguable if I'm a patient enough C programmer to come up with one to pass muster, it did work as advertised: it greatly reduced to the point of elimination hint bit traffic in cases like this, including the bulk load case. Some variant of the approach would probably take at least some of the sting out of the logged hint bits if that's the direction things have to go. That doesn't help high concurrency cases where the #tuples/xid is low though. And even a small increase in contention on the wal insert lock could mean a measurable reduction in scaling unless that problem is solved independently. merlin