Re: backup manifests - Mailing list pgsql-hackers

From Robert Haas
Subject Re: backup manifests
Date
Msg-id CA+TgmoZH_j_LsL9w0UuCqEP57svb=NyPKkFk_FDRqzEc+GFdUA@mail.gmail.com
Whole thread Raw
In response to Re: backup manifests  (David Steele <david@pgmasters.net>)
Responses Re: backup manifests
Re: backup manifests
List pgsql-hackers
On Fri, Nov 22, 2019 at 1:10 PM David Steele <david@pgmasters.net> wrote:
> Well, the maximum amount of data that can be protected with a 32-bit CRC
> is 512MB according to all the sources I found (NIST, Wikipedia, etc).  I
> presume that's what we are talking about since I can't find any 64-bit
> CRC code in core or this patch.

Could you give a more precise citation for this? I can't find a
reference to that in the Wikipedia article off-hand and I don't know
where to look in NIST. I apologize if I'm being dense here, but I
don't see why there should be any limit on the amount of data that can
be protected. The important thing is that if the original file F is
altered to F', we hope that CHECKSUM(F) != CHECKSUM(F'). The
probability of that, assuming that the alteration is random rather
than malicious and that the checksum function is equally likely to
produce every possible output, is just 1-2^-${CHECKSUM_BITS},
regardless of the length of the message (except that there might be
some special cases for very short messages, which don't matter here).

This analysis by me seems to match
https://en.wikipedia.org/wiki/Cyclic_redundancy_check, which says:

"Typically an n-bit CRC applied to a data block of arbitrary length
will detect any single error burst not longer than n bits, and the
fraction of all longer error bursts that it will detect is (1 −
2^−n)."

Notice the phrase "a data block of arbitrary length" and the formula "1 - 2^-n".

> > Phrased more positively, if you want a cryptographic hash
> > at all, you should probably use one that isn't widely viewed as too
> > weak.
>
> Sure.  There's another advantage to picking an algorithm with lower
> collision rates, though.
>
> CRCs are fine for catching transmission errors (as caveated above) but
> not as great for comparing two files for equality.  With strong hashes
> you can confidently compare local files against the path, size, and hash
> stored in the manifest and save yourself a round-trip to the remote
> storage to grab the file if it has not changed locally.

I agree in part. I think there are two reasons why a cryptographically
strong hash is desirable for delta restore. First, since the checksums
are longer, the probability of a false match happening randomly is
lower, which is important. Even if the above analysis is correct and
the chance of a false match is just 2^-32 with a 32-bit CRC, if you
back up ten million files every day, you'll likely get a false match
within a few years or less, and once is too often. Second, unlike what
I supposed above, the contents of a PostgreSQL data file are not
chosen at random, unlike transmission errors, which probably are more
or less random. It seems somewhat possible that there is an adversary
who is trying to choose the data that gets stored in some particular
record so as to create a false checksum match. A CRC is a lot easier
to fool than a crytographic hash, so I think that using a CRC of *any*
length for this kind of use case would be extremely dangerous no
matter the probability of an accidental match.

> This is the basic premise of what we call delta restore which can speed
> up restores by orders of magnitude.
>
> Delta restore is the main advantage that made us decide to require SHA1
> checksums.  In most cases, restore speed is more important than backup
> speed.

I see your point, but it's not the whole story. We've encountered a
bunch of cases where the time it took to complete a backup exceeded
the user's desired backup interval, which is obviously very bad, or
even more commonly where it exceeded the length of the user's
"low-usage" period when they could tolerate the extra overhead imposed
by the backup. A few percentage points is probably not a big deal, but
a user who has an 8-hour window to get the backup done overnight will
not be happy if it's taking 6 hours now and we tack 40%-50% on to
that. So I think that we either have to disable backup checksums by
default, or figure out a way to get the overhead down to something a
lot smaller than what current tests are showing -- which we could
possibly do without changing the algorithm if we can somehow make it a
lot cheaper, but otherwise I think the choice is between disabling the
functionality altogether by default and adopting a less-expensive
algorithm. Maybe someday when delta restore is in core and widely used
and CPUs are faster, it'll make sense to revise the default, and
that's cool, but I can't see imposing a big overhead by default to
enable a feature core doesn't have yet...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Mark Dilger
Date:
Subject: Re: Assertion failing in master, predicate.c
Next
From: David Steele
Date:
Subject: Re: backup manifests