Re: backup manifests - Mailing list pgsql-hackers

From Robert Haas
Subject Re: backup manifests
Date
Msg-id CA+TgmoZMy77L89GSqifqH7wzyUcfbZYzXiFD9syzAea=AkiJpg@mail.gmail.com
Whole thread Raw
In response to Re: backup manifests  (David Steele <david@pgmasters.net>)
List pgsql-hackers
On Fri, Nov 22, 2019 at 2:29 PM David Steele <david@pgmasters.net> wrote:
> See:
> https://www.nist.gov/system/files/documents/2017/04/26/lrdc_systems_part2_032713.pdf
> Search for "The maximum block size"

Hmm, so it says: "The maximum block size that can be protected by a
32-bit CRC is 512MB." My problem is that (1) it doesn't back this up
with a citation or any kind of logical explanation and (2) it's not
very clear what "protected" means. Tels replies downthread to explain
that the internal state of the 32-bit CRC calculation is also limited
to 32 bits, and changes once per bit, so that after processing 512MB =
2^29 bytes = 2^32 bits of data, you're guaranteed to start repeating
internal states. Perhaps this is also what the NIST folks had in mind,
though it's hard to know.

This link provides some more details:

https://community.arm.com/developer/tools-software/tools/f/keil-forum/17467/crc-for-256-byte-data

Not everyone on the thread agrees with everybody else, but it seems
like there are size limits below which a CRC-n is guaranteed to detect
all 1-bit and 2-bit errors, and above which this is no longer
guaranteed. They put the limit *lower* than what NIST supposes, namely
2^(n-1)-1 bits, which would be 256MB, not 512MB, if I'm doing math
correctly. However, they also say that above that value, you are still
likely to detect most errors. Absent an intelligent adversary, the
chance of a random collision when corruption is present is still about
1 in 4 billion (2^-32).

To me, guaranteed detection of 1-bit and 2-bit errors (and the other
kinds of specific things CRC is designed to catch) doesn't seem like a
principle design consideration. It's nice if we can get it and I'm not
against it, but these are algorithms that are designed to be used when
data undergoes a digital-to-analog-to-digital conversion, where for
example it's possible that that the conversion back to digital loses
sync and reads 9 bits or 7 bits rather than 8 bits. And that's not
really what we're doing here: we all know that bits get flipped
sometimes, but nobody uses scp to copy a 1GB file and ends up with a
file that is 1GB +/- a few bits. Some lower-level part of the
communication stack is handling that part of the work; you're going to
get exactly 1GB. So it seems to me that here, as with XLOG, we're not
relying on the specific CRC properties that were intended to be used
to catch and in some cases repair bit flips caused by wrinkles in an
A-to-D conversion, but just on its general tendency to probably not
match if any bits got flipped. And those properties hold regardless of
input length.

That being said, having done some reading on this, I am a little
concerned that we're getting further and further from the design
center of the CRC algorithm. Like relation segment files, XLOG records
are not packets subject to bit insertions, but at least they're small,
and relation files are not. Using a 40-year-old algorithm that was
intended to be used for things like making sure the modem hadn't lost
framing in the last second to verify 1GB files feels, in some nebulous
way, like we might be stretching. That being said, I'm not sure what
we think the reasonable alternatives are. Users aren't going to be
better off if we say that, because CRC-32C might not do a great job
detecting errors, we're not going to check for errors at all. If we go
the other way and say we're going to use some variant of SHA, they
will be better off, but at the price of what looks like a
*significant* hit in terms of backup time.

> > "Typically an n-bit CRC applied to a data block of arbitrary length
> > will detect any single error burst not longer than n bits, and the
> > fraction of all longer error bursts that it will detect is (1 −
> > 2^−n)."
>
> I'm not sure how encouraging I find this -- a four-byte error not a lot
> and 2^32 is only 4 billion.  We have individual users who have backed up
> more than 4 billion files over the last few years.

I agree that people have a lot more than 4 billion files backed up,
but I'm not sure it matters very much given the use case I'm trying to
enable. There's a lot of difference between delta restore and backup
integrity checking. For backup integrity checking, my goal is that, on
those occasions when a file gets corrupted, the chances that we notice
that it has been corrupted. For that purpose, a 32-bit checksum is
probably sufficient. If a file gets corrupted, we have about a
1-in-4-billion chance of being unable to detect it. If 4 billion files
get corrupted, we'll miss, on average, one of those corruption events.
That's sad, but so is the fact that you had *4 billion corrupted
files*. This is not the total number of files backed up; this is the
number of those that got corrupted. I don't really know how common it
is to copy a file and end up with a corrupt copy, but if you say it's
one-in-a-million, which I suspect is far too high, then you'd have to
back up something like 4 quadrillion files before you missed a
corruption event, and that's a *very* big number.

Now delta restore is a whole different kettle of fish. The birthday
problem is huge here. If you've got a 32-bit checksum for file A, and
you go and look it up in a database of checksums, and that database
has even 1 billion things in it, you've got a pretty decent shot of
latching onto a file that is not actually the same as file A. The
problem goes away almost entirely if you only compare against previous
versions of that file from that database cluster. You've probably only
got tens or maybe at the very outside hundreds or thousands of backups
of that particular file, and a collision is unlikely even with only a
32-bit checksum -- though even there maybe you'd like to use something
larger just to be on the safe side. But if you're going to compare to
other files from the same cluster, or even worse any file from any
cluster, 32 bits is *woefully* inadequate. TBH even using SHA for such
use cases feels a little scary to me. It's probably good enough --
2^160 for SHA-1 is a *lot* bigger than 2^32, and 2^512 for SHA-512 is
enormous. But I'd want to spend time thinking very carefully about the
math before designing such a system.

> OK, I'll buy that.  But I *don't* think CRCs should be allowed for
> deltas (when we have them) and I *do* think we should caveat their
> effectiveness (assuming we can agree on them).

Sounds good.

> In general the answer to faster backups should be more cores/faster
> network/faster disk, not compromising backup integrity.  I understand
> we'll need to wait until we have parallelism in pg_basebackup to justify
> that answer.

I would like to dispute that characterization of what we're talking
about here. If we added a 1-bit checksum (parity bit) it would be
*strictly better* than what we're doing right now, which is nothing.
That's not a serious proposal because it's obvious we can do a lot
better for trivial additional cost, but deciding that we're going to
use a weaker kind of checksum to avoid adding too much overhead is not
wimping out, because it's still going to be strong enough to catch the
overwhelming majority of problems that go undetected today. Even an
*8-bit* checksum would give us a >99% chance of catching a corrupted
file, which would be noticeably better than the 0% chance we have
today. Even a manifest with no checksums at all that just checked the
presence and size of files would catch tons of operator error, e.g.

- wait, that database had tablespaces?
- were those logs in pg_clog anything important?
- oh, i wasn't supposed to start postgres on the copy of the database
stored in the backup directory?

So I don't think we're talking about whether to compromise backup
integrity. I think we're talking about - if we're going to make backup
integrity better than it is today, how much better should we try to
make it, and what are the trade-offs there? The straw man here is that
we could make the database infinitely secure if we put it in a
concrete bunker and sunk it to the bottom of the ocean, with the small
price that we'd no longer be able to access it either. Somewhere
between that extreme and the other extreme of setting the
authentication method to 0.0.0.0/0 trust there's a happy medium where
security is tolerably good but ease of access isn't crippled, and the
same thing applies here. We could (probably) be the first database on
the planet to store a 1024-bit encrypted checksum of every 8kB block,
but that seems like it's going too far in the "concrete bunker"
direction. IMHO, at least, we should be aiming for something that has
a high probability of catching real problems and a low probability of
being super-annoying.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Tels
Date:
Subject: Re: backup manifests
Next
From: Robert Haas
Date:
Subject: Re: backup manifests