better page-level checksums - Mailing list pgsql-hackers

From Robert Haas
Subject better page-level checksums
Date
Msg-id CA+TgmoaCeQ2b-BVgVfF8go8zFoceDjJq9w4AFQX7u6Acfdn2uA@mail.gmail.com
Whole thread Raw
In response to Re: storing an explicit nonce  (Stephen Frost <sfrost@snowman.net>)
Responses Re: better page-level checksums
Re: better page-level checksums
Re: better page-level checksums
List pgsql-hackers
On Thu, Oct 7, 2021 at 11:50 AM Stephen Frost <sfrost@snowman.net> wrote:
> Alternatively, we could use
> that technique to just provide a better per-page checksum than what we
> have today.  Maybe we could figure out how to leverage that to move to
> 64bit transaction IDs with some page-level epoch.

I'm interested in assessing the feasibility of a "better page-level
checksums" feature. I have a few questions, and a few observations.
One of my questions is what algorithm(s) we'd want to support. I did a
quick Google search and found that brtfs supports CRC-32C, XXHASH,
SHA256, and BLAKE2B. I don't know that we want to support that many
options (but maybe we do) and I don't think CRC-32C makes any sense
here, for two reasons. First, we've already got a 16-bit checksum, and
a 32-bit checksum doesn't seem like it's gaining enough to be worth
the implementation complexity. Second, we're probably going to have to
dole out per-page space in multiples of MAXALIGN, and that's usually
8. I think for this purpose we should limit ourselves to algorithms
whose output size is, at minimum, 64 bits, and ideally, a multiple of
64 bits. I'm sure there are plenty of options other than the ones that
btrfs uses; I mentioned them only as a way of jump-starting the
discussion. Note that SHA-256 and BLAKE2B apparently emit enormously
wide 16 BYTE checksums. That's a lot of space to consume with a
checksum, but your chances of a collision are very small indeed.

Even if we only offer one new kind of checksum, making space for a
wider checksum makes the page format variable in a way that it
currently isn't. There seem to be about 50 compile-time constants in
the source code whose values are computed based on the block size and
amount of special space in use by some particular AM (yikes!). For
example, for the heap, there's stuff like MaxHeapTuplesPerPage and
MaxHeapTupleSize. If in the future we have some pages that are just
like the ones we have today, and other clusters where we've allowed
space for a checksum, then those constants become run-time variable.
And since they're used in some very low-level functions that are
called a lot, like PageGetHeapFreeSpace(), that seems potentially
problematic. The problem is multiplied if you also think about trying
to store an epoch on each heap page, as per Stephen's proposal above,
because now every page used by any AM whatsoever might or might not
have a checksum, and every heap page might also have or not have an
epoch XID. I think it's going to be somewhat tricky to figure out a
scheme here that avoids making things slow. Originally I was thinking
that things like MaxHeapTuplesPerPage ought to become macros or static
inline functions, but now I have what I think is a better idea: make
them into global variables and initialize them with the appropriate
values for the cluster right after we read the control file. This
doesn't solve the problem if some pages are different than others,
though, and even for the case where every page in the cluster has the
same amount of reserved space, reading a global variable is not going
to be as efficient as referring to a constant compiled right into the
code. I'm hopeful that all of this is solvable somehow, but it's
hairy, for sure.

Another thing I realized is that we would probably end up with the
pd_checksum unused when this other feature is activated. If someone
comes up with a clever idea for how to allocate extra space without
needing things to be a multiple of MAXIMUM_ALIGNOF, they could
potentially shrink the space they need elsewhere by 2 bytes and then
use both that space and pd_checksum, but otherwise pd_checksum is
likely to be dead when an enhanced checksum feature is in use. Since
it's also dead when checksums are turned off, that's probably OK. I
suppose another possibility is to allow both to be turned on and off
independently, i.e. let someone have both a Fletcher-16 checksum in
pd_checksum, and also a wider checksum in this other chunk of space,
but I'm not sure whether that's really a useful thing to be able to
do. (Opinions?)

I'm also a little fuzzy on what the command-line interface for
selecting this functionality would look like. The existing option to
initdb is just --data-checksums, which doesn't leave any way to say
what kind of checksums you want.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: Tightening behaviour for non-immutable behaviour in immutable functions
Next
From: "Finnerty, Jim"
Date:
Subject: Re: Collation version tracking for macOS