Re: backup manifests - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: backup manifests |
Date | |
Msg-id | CA+TgmoZMy77L89GSqifqH7wzyUcfbZYzXiFD9syzAea=AkiJpg@mail.gmail.com Whole thread Raw |
In response to | Re: backup manifests (David Steele <david@pgmasters.net>) |
List | pgsql-hackers |
On Fri, Nov 22, 2019 at 2:29 PM David Steele <david@pgmasters.net> wrote: > See: > https://www.nist.gov/system/files/documents/2017/04/26/lrdc_systems_part2_032713.pdf > Search for "The maximum block size" Hmm, so it says: "The maximum block size that can be protected by a 32-bit CRC is 512MB." My problem is that (1) it doesn't back this up with a citation or any kind of logical explanation and (2) it's not very clear what "protected" means. Tels replies downthread to explain that the internal state of the 32-bit CRC calculation is also limited to 32 bits, and changes once per bit, so that after processing 512MB = 2^29 bytes = 2^32 bits of data, you're guaranteed to start repeating internal states. Perhaps this is also what the NIST folks had in mind, though it's hard to know. This link provides some more details: https://community.arm.com/developer/tools-software/tools/f/keil-forum/17467/crc-for-256-byte-data Not everyone on the thread agrees with everybody else, but it seems like there are size limits below which a CRC-n is guaranteed to detect all 1-bit and 2-bit errors, and above which this is no longer guaranteed. They put the limit *lower* than what NIST supposes, namely 2^(n-1)-1 bits, which would be 256MB, not 512MB, if I'm doing math correctly. However, they also say that above that value, you are still likely to detect most errors. Absent an intelligent adversary, the chance of a random collision when corruption is present is still about 1 in 4 billion (2^-32). To me, guaranteed detection of 1-bit and 2-bit errors (and the other kinds of specific things CRC is designed to catch) doesn't seem like a principle design consideration. It's nice if we can get it and I'm not against it, but these are algorithms that are designed to be used when data undergoes a digital-to-analog-to-digital conversion, where for example it's possible that that the conversion back to digital loses sync and reads 9 bits or 7 bits rather than 8 bits. And that's not really what we're doing here: we all know that bits get flipped sometimes, but nobody uses scp to copy a 1GB file and ends up with a file that is 1GB +/- a few bits. Some lower-level part of the communication stack is handling that part of the work; you're going to get exactly 1GB. So it seems to me that here, as with XLOG, we're not relying on the specific CRC properties that were intended to be used to catch and in some cases repair bit flips caused by wrinkles in an A-to-D conversion, but just on its general tendency to probably not match if any bits got flipped. And those properties hold regardless of input length. That being said, having done some reading on this, I am a little concerned that we're getting further and further from the design center of the CRC algorithm. Like relation segment files, XLOG records are not packets subject to bit insertions, but at least they're small, and relation files are not. Using a 40-year-old algorithm that was intended to be used for things like making sure the modem hadn't lost framing in the last second to verify 1GB files feels, in some nebulous way, like we might be stretching. That being said, I'm not sure what we think the reasonable alternatives are. Users aren't going to be better off if we say that, because CRC-32C might not do a great job detecting errors, we're not going to check for errors at all. If we go the other way and say we're going to use some variant of SHA, they will be better off, but at the price of what looks like a *significant* hit in terms of backup time. > > "Typically an n-bit CRC applied to a data block of arbitrary length > > will detect any single error burst not longer than n bits, and the > > fraction of all longer error bursts that it will detect is (1 − > > 2^−n)." > > I'm not sure how encouraging I find this -- a four-byte error not a lot > and 2^32 is only 4 billion. We have individual users who have backed up > more than 4 billion files over the last few years. I agree that people have a lot more than 4 billion files backed up, but I'm not sure it matters very much given the use case I'm trying to enable. There's a lot of difference between delta restore and backup integrity checking. For backup integrity checking, my goal is that, on those occasions when a file gets corrupted, the chances that we notice that it has been corrupted. For that purpose, a 32-bit checksum is probably sufficient. If a file gets corrupted, we have about a 1-in-4-billion chance of being unable to detect it. If 4 billion files get corrupted, we'll miss, on average, one of those corruption events. That's sad, but so is the fact that you had *4 billion corrupted files*. This is not the total number of files backed up; this is the number of those that got corrupted. I don't really know how common it is to copy a file and end up with a corrupt copy, but if you say it's one-in-a-million, which I suspect is far too high, then you'd have to back up something like 4 quadrillion files before you missed a corruption event, and that's a *very* big number. Now delta restore is a whole different kettle of fish. The birthday problem is huge here. If you've got a 32-bit checksum for file A, and you go and look it up in a database of checksums, and that database has even 1 billion things in it, you've got a pretty decent shot of latching onto a file that is not actually the same as file A. The problem goes away almost entirely if you only compare against previous versions of that file from that database cluster. You've probably only got tens or maybe at the very outside hundreds or thousands of backups of that particular file, and a collision is unlikely even with only a 32-bit checksum -- though even there maybe you'd like to use something larger just to be on the safe side. But if you're going to compare to other files from the same cluster, or even worse any file from any cluster, 32 bits is *woefully* inadequate. TBH even using SHA for such use cases feels a little scary to me. It's probably good enough -- 2^160 for SHA-1 is a *lot* bigger than 2^32, and 2^512 for SHA-512 is enormous. But I'd want to spend time thinking very carefully about the math before designing such a system. > OK, I'll buy that. But I *don't* think CRCs should be allowed for > deltas (when we have them) and I *do* think we should caveat their > effectiveness (assuming we can agree on them). Sounds good. > In general the answer to faster backups should be more cores/faster > network/faster disk, not compromising backup integrity. I understand > we'll need to wait until we have parallelism in pg_basebackup to justify > that answer. I would like to dispute that characterization of what we're talking about here. If we added a 1-bit checksum (parity bit) it would be *strictly better* than what we're doing right now, which is nothing. That's not a serious proposal because it's obvious we can do a lot better for trivial additional cost, but deciding that we're going to use a weaker kind of checksum to avoid adding too much overhead is not wimping out, because it's still going to be strong enough to catch the overwhelming majority of problems that go undetected today. Even an *8-bit* checksum would give us a >99% chance of catching a corrupted file, which would be noticeably better than the 0% chance we have today. Even a manifest with no checksums at all that just checked the presence and size of files would catch tons of operator error, e.g. - wait, that database had tablespaces? - were those logs in pg_clog anything important? - oh, i wasn't supposed to start postgres on the copy of the database stored in the backup directory? So I don't think we're talking about whether to compromise backup integrity. I think we're talking about - if we're going to make backup integrity better than it is today, how much better should we try to make it, and what are the trade-offs there? The straw man here is that we could make the database infinitely secure if we put it in a concrete bunker and sunk it to the bottom of the ocean, with the small price that we'd no longer be able to access it either. Somewhere between that extreme and the other extreme of setting the authentication method to 0.0.0.0/0 trust there's a happy medium where security is tolerably good but ease of access isn't crippled, and the same thing applies here. We could (probably) be the first database on the planet to store a 1024-bit encrypted checksum of every 8kB block, but that seems like it's going too far in the "concrete bunker" direction. IMHO, at least, we should be aiming for something that has a high probability of catching real problems and a low probability of being super-annoying. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: