Re: better page-level checksums - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: better page-level checksums |
Date | |
Msg-id | CA+TgmoZUmvPr1XRJOZjL4L_jvkB8hWr-F=CqvG6dDHvZjR0iZg@mail.gmail.com Whole thread Raw |
In response to | Re: better page-level checksums (Peter Geoghegan <pg@bowt.ie>) |
Responses |
Re: better page-level checksums
Re: better page-level checksums |
List | pgsql-hackers |
On Thu, Jun 9, 2022 at 5:34 PM Peter Geoghegan <pg@bowt.ie> wrote: > Why not? The only problems that it won't solve are all related to > crypto. Which is perfectly fine, but it seems like there is a > terminology issue here. ISTM that you're really talking about adding a > cryptographic hash function, not a checksum. These are rather > different things. I don't think those are mutually exclusive categories. I shall cite Wikipedia: "Cryptographic hash ... can also be used as ordinary hash functions, to index data in hash tables, for fingerprinting, to detect duplicate data or uniquely identify files, and as checksums to detect accidental data corruption."[1] There is also PostgreSQL precedent in the form of the --manifest-checksums argument to pg_basebackup, whose legal values are SHA{224,256,384,512}|CRC32C|NONE. The man page for the "shasum" utility says that the purpose of the command is to "Print or Check SHA Checksums". I'm not perfectly attached to the idea of using SHA here, but it seems to me that's pretty much the standard thing these days. Stephen Frost and David Steele pushed hard for SHA checksums in backup manifests, and actually wanted it to be the default. I think that if you're the kind of person who looks at our existing page checksums and finds them too weak, I doubt that CRC-32C is going to make you feel any better. You're probably the sort of person who thinks that checksums should have a lot of bits, and you're probably not going to be satisfied with the properties of an algorithm invented in the 1960s. Of course if there's anyone out there who thinks that our existing 16-bit checksums are a pile of garbage but would be much happier if CRC-32C is an option, I am happy to have them show up here and say so, but I find it much more likely that people who want this kind of feature would advocate for a more modern algorithm. > My preference is for an approach that builds on that, or at least > doesn't significantly complicate it. So a cryptographic hash or nonce > can go in the special area proper (structs like BTPageOpaqueData don't > need any changes), but at a page offset before the special area proper > -- not after. > > What disadvantages does that approach have, if any, from your point of view? I think it would be an extremely good idea to store the extended checksum at the same offset in every page. Right now, code that wants to compute checksums, or a tool like pg_checksums that wants to verify them, can find the checksum without needing to interpret any of the remaining page contents. Things get sticky if you have to interpret the page contents to locate the checksum that's going to tell you whether the page contents are messed up. Perhaps this could be worked around if you tried hard enough, but I don't see what we get out of it. I don't think that putting the checksum at the very end of the every page precludes using variable-size special space in the AMs, or even complicates it much, because if there's a fixed-length block of stuff at the end of every page, you can easily account for that. There's a lot less code that cares about the space above pd_special than there is code that cares about any other portion of the page. -- Robert Haas EDB: http://www.enterprisedb.com [1] https://en.wikipedia.org/wiki/Cryptographic_hash_function
pgsql-hackers by date: