Re: 9.3: summary of corruption detection / checksums / CRCs discussion - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: 9.3: summary of corruption detection / checksums / CRCs discussion |
Date | |
Msg-id | CA+TgmoY9pB27f22aWfYrPXKYaoCfLNojDu5aTkWu-xrYyMgymA@mail.gmail.com Whole thread Raw |
In response to | 9.3: summary of corruption detection / checksums / CRCs discussion (Jeff Davis <pgsql@j-davis.com>) |
List | pgsql-hackers |
On Sat, Apr 21, 2012 at 5:40 PM, Jeff Davis <pgsql@j-davis.com> wrote: > * In addition to detecting random garbage, we also need to be able to > detect zeroing of pages. Right now, a zero page is not considered > corrupt, so that's a problem. We'll need to WAL table extension > operations, and we'll need to mitigate the performance impact of doing > so. I think we can do that by extending larger tables by many pages > (say, 16 at a time) so we can amortize the cost of WAL and avoid > contention. I think that extending tables in larger chunks is probably a very good idea for performance reasons completely apart from checksums. However, I don't think that WAL-logging relation extension will allow you to treat a zero page as an error condition unless you not only WAL-log the operation but also *flush WAL* before performing the actual table-extension operation. Otherwise, we might crash after the actual extension and before the WAL record hits the disk, and now we're back to having a zero'd page in the file. And the impact of writing *and flushing* WAL for every extension seems likely to be more than we're willing to pay. If we extended 16 pages at a time, that means waiting for 8 WAL fsyncs per MB of relation extension. On my MacBook Pro, which is admittedly a pessimal case for fsyncs, that would work out to an extra half second of elapsed time per MB written, so pgbench -i -s 100 would probably take about an extra 640 seconds. If that's a correct analysis, that sounds pretty bad, because right now it's taking 143 seconds. > * Should we try to use existing space in header? It's very appealing to > be able to avoid the upgrade work by using existing space in the header. > There was a surprising amount of discussion about which space to use: > pd_pagesize_version or pd_tli. There was also some discussion of using a > few bits from various places. It still seems to me that pd_tli is the obvious choice, since there is no code anywhere in the system that relies on that field having any particular value, so we can pretty much whack it around at will without breaking anything. AFAICS, the only reason to bump the page format is if we want a longer checksum - 32 bits, say, instead of 16. But I am inclined to think that 16 ought to be enough to detect the vast majority of cases of corruption. > * Table-level, or system level? Table-level would be appealing if there > turns out to be a significant performance impact. But there are > challenges during recovery, because no relcache is available. It seems > like a relatively minor problem, because pages would indicate whether > they have a checksum or not, but there are some details to be worked > out. I think the root of the issue here is that it's hard to turn checksums on and off *online*. If checksums were an initdb-time option, the design here would be pretty simple: store the flag in the control file. And it wouldn't even be hard to allow the flag to be flipped while the cluster is shut down: just write a utility to checksum and rewrite all the blocks, fsync everything, and then flip the flag in the control file and fsync that; also, force the user to rebuild all their standbys. This might not be very convenient, but it would be comparatively simple to implement. However, I've been assuming that we do want to handle turning checksums on and off without shutting down the cluster, and that definitely makes things more complicated. In theory it could work more or less the same way as in the off-line case: you launch some command or function that visits every data block in the cluster and rewrites them all, xlogging as it goes. However, I'm a little concerned that it won't be very usable in that form for some of the same reasons that the off-line solution might not be very usable: you essentially force the user to do a very large data-rewriting operation all at once. We already know that having a kicking off a big vacuum in the background can impact the performance of the whole cluster, and that's just one table; this would be the whole cluster all at once. Moreover, if it's based on the existing vacuum code, which seems generally desirable, it seems like it's going to have to be something you run in one database at a time until you've hit them all. In that case, you're going to need some way to indicate which tables or databases have been processed already and which haven't yet ... and if you have that, then it's not a very big step to think that maybe we ought to just control it on a per-table or per-database level to begin with, since that would also have some of the collateral benefits you mention. The alternative is to have a new command, or a new form of vacuum, that processes every table in every cluster regardless of which DB you're connected to at the moment. That's a lot of new, fairly special-purpose code and it doesn't seem very convenient from a management perspective, but it is probably simpler otherwise. I'm not sure what the right decision is here. I like the idea of being able to activate and deactivate the feature without shutting down the cluster, and I really like the idea of being able to do it incrementally, one table at a time, instead of all at once. But it's surely more complex. If you store the flag on a table level, then it's not available during recovery (unless we invent some new trick to make that work, like storing it in some other format that is simple enough to be interpreted by the recovery code); if you store it on a page level, then you can miss corruption if that corruption has the side effect of clearing the bit. It's not dissimilar to your complaint upthread about empty pages: if you see an empty page (a page that says it hasn't been checksummed), how do you know whether it's really supposed to be empty (have no checksum) or whether it ended empty (unchecksummed) because of exactly the sort of error we're trying to detect? > We don't want torn pages to falsely indicate a checksum failure. Many > page writes are already protected from this with full-page images in the > WAL; but hint bit updates (including the index dead tuple markers) are > not. > > * Just pay the price -- WAL all hint bit updates, including FPIs. Simon's patch contains a pretty good idea on this point: we don't actually need to WAL log all hint bit updates. In the worst case, we need to log an FPI once per checkpoint cycle, and then only if the page is hint-bit dirtied before it's dirtied by a fully WAL-logged operation. Moreover, if we WAL-log an FPI when the page is hint-bit-dirtied, and a fully WAL-logged operation comes along later in the same checkpoint cycle, we can skip the FPI for the second operation, which probably recovers most of the cost of the previous FPI. The real problem here is: what happens when we're dirtying a large number of pages only for hint bits? In that case, we're potentially generating a huge volume of WAL. We can probably optimize the bulk-loading case, but there are going to be other cases that stink. Apart from bulk-loading, pgbench doesn't take much of a hit because it's basically not going to touch a pgbench_accounts page again until it comes back to update another row on the same page, so the overhead shouldn't be that bad. But it might be much worse on some other workload where there are lots of reads mixed in with the writes: now you've got a good chance of kicking out a bunch of extra FPIs. There are some things we could do about that - e.g. make sure to set all possible hint bits before evicting an already-dirty buffer, so that the chances of needing to dirty it again the next time we visit are minimized. However, I don't think all of that is work that has to be done up front. I suspect we will need to do some mitigation of the bulk-loading case before committing even an initial version of checksums, but I think beyond that we could probably postpone most of the other optimization in this area to a future patch. > * Double-Write buffer -- this attacks the problem most directly. Don't > make any changes to the way hint bits are done; instead, push all page > writes through a double-write buffer. There are numerous performance > implications here, some of which may improve performance and some which > may hurt performance. It's hard to say, at the end, whether this will be > a good solution for everyone (particularly those without battery-backed > caches), but it seems like an accepted approach that can be very good > for the people who need performance the most. I'm pretty skeptical of this approach. It may work out well with specialized hardware, but none of the hardware I have is going to cope with that volume of fsyncs anything like gracefully. Unless of course we make the buffer really big, but that hasn't seemed to be a popular approach. > * Some way of caching CLOG information or making the access faster. > IIRC, there were some vague ideas about mmapping() the CLOG, or caching > a very small representation of the CLOG. I'm interested in pursuing this in general, independent of checksums. Not sure how far I will get, but reducing CLOG contention is something I do plan to spend more time on. > * Something else -- There are a few other lines of thought here. For > instance, can we use WAL for hint bits without a FPI, and still protect > against torn pages causing CRC failures? This is related to a comment > during the 2011 developer meeting, where someone brought up the idea of > idempotent WAL changes, and how that could help us avoid FPIs. It seems > possible after reading the discussions, but not clear enough on the > challenges to summarize here. There are some possibilities here, but overall I have my doubts about backing up checksums behind another complex project with uncertain prospects. > If we do use WAL for hint bit updates, that has an impact on Hot > Standby, because HS can't write WAL. So, it would seem that HS could not > set hint bits. Yeah. The good news is that WAL-logging hint bits on the master would propagate those hint bits to the slave, which would reduce the need to set them on the slave. But there's definitely some potential for performance regression here; I'm just not sure how much. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: