better page-level checksums - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | better page-level checksums |
Date | |
Msg-id | CA+TgmoaCeQ2b-BVgVfF8go8zFoceDjJq9w4AFQX7u6Acfdn2uA@mail.gmail.com Whole thread Raw |
In response to | Re: storing an explicit nonce (Stephen Frost <sfrost@snowman.net>) |
Responses |
Re: better page-level checksums
Re: better page-level checksums Re: better page-level checksums |
List | pgsql-hackers |
On Thu, Oct 7, 2021 at 11:50 AM Stephen Frost <sfrost@snowman.net> wrote: > Alternatively, we could use > that technique to just provide a better per-page checksum than what we > have today. Maybe we could figure out how to leverage that to move to > 64bit transaction IDs with some page-level epoch. I'm interested in assessing the feasibility of a "better page-level checksums" feature. I have a few questions, and a few observations. One of my questions is what algorithm(s) we'd want to support. I did a quick Google search and found that brtfs supports CRC-32C, XXHASH, SHA256, and BLAKE2B. I don't know that we want to support that many options (but maybe we do) and I don't think CRC-32C makes any sense here, for two reasons. First, we've already got a 16-bit checksum, and a 32-bit checksum doesn't seem like it's gaining enough to be worth the implementation complexity. Second, we're probably going to have to dole out per-page space in multiples of MAXALIGN, and that's usually 8. I think for this purpose we should limit ourselves to algorithms whose output size is, at minimum, 64 bits, and ideally, a multiple of 64 bits. I'm sure there are plenty of options other than the ones that btrfs uses; I mentioned them only as a way of jump-starting the discussion. Note that SHA-256 and BLAKE2B apparently emit enormously wide 16 BYTE checksums. That's a lot of space to consume with a checksum, but your chances of a collision are very small indeed. Even if we only offer one new kind of checksum, making space for a wider checksum makes the page format variable in a way that it currently isn't. There seem to be about 50 compile-time constants in the source code whose values are computed based on the block size and amount of special space in use by some particular AM (yikes!). For example, for the heap, there's stuff like MaxHeapTuplesPerPage and MaxHeapTupleSize. If in the future we have some pages that are just like the ones we have today, and other clusters where we've allowed space for a checksum, then those constants become run-time variable. And since they're used in some very low-level functions that are called a lot, like PageGetHeapFreeSpace(), that seems potentially problematic. The problem is multiplied if you also think about trying to store an epoch on each heap page, as per Stephen's proposal above, because now every page used by any AM whatsoever might or might not have a checksum, and every heap page might also have or not have an epoch XID. I think it's going to be somewhat tricky to figure out a scheme here that avoids making things slow. Originally I was thinking that things like MaxHeapTuplesPerPage ought to become macros or static inline functions, but now I have what I think is a better idea: make them into global variables and initialize them with the appropriate values for the cluster right after we read the control file. This doesn't solve the problem if some pages are different than others, though, and even for the case where every page in the cluster has the same amount of reserved space, reading a global variable is not going to be as efficient as referring to a constant compiled right into the code. I'm hopeful that all of this is solvable somehow, but it's hairy, for sure. Another thing I realized is that we would probably end up with the pd_checksum unused when this other feature is activated. If someone comes up with a clever idea for how to allocate extra space without needing things to be a multiple of MAXIMUM_ALIGNOF, they could potentially shrink the space they need elsewhere by 2 bytes and then use both that space and pd_checksum, but otherwise pd_checksum is likely to be dead when an enhanced checksum feature is in use. Since it's also dead when checksums are turned off, that's probably OK. I suppose another possibility is to allow both to be turned on and off independently, i.e. let someone have both a Fletcher-16 checksum in pd_checksum, and also a wider checksum in this other chunk of space, but I'm not sure whether that's really a useful thing to be able to do. (Opinions?) I'm also a little fuzzy on what the command-line interface for selecting this functionality would look like. The existing option to initdb is just --data-checksums, which doesn't leave any way to say what kind of checksums you want. -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-hackers by date: