Re: better page-level checksums - Mailing list pgsql-hackers

From Robert Haas
Subject Re: better page-level checksums
Date
Msg-id CA+TgmoZUmvPr1XRJOZjL4L_jvkB8hWr-F=CqvG6dDHvZjR0iZg@mail.gmail.com
Whole thread Raw
In response to Re: better page-level checksums  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: better page-level checksums
Re: better page-level checksums
List pgsql-hackers
On Thu, Jun 9, 2022 at 5:34 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Why not? The only problems that it won't solve are all related to
> crypto. Which is perfectly fine, but it seems like there is a
> terminology issue here. ISTM that you're really talking about adding a
> cryptographic hash function, not a checksum. These are rather
> different things.

I don't think those are mutually exclusive categories. I shall cite
Wikipedia: "Cryptographic hash ... can also be used as ordinary hash
functions, to index data in hash tables, for fingerprinting, to detect
duplicate data or uniquely identify files, and as checksums to detect
accidental data corruption."[1] There is also PostgreSQL precedent in
the form of the --manifest-checksums argument to pg_basebackup, whose
legal values are SHA{224,256,384,512}|CRC32C|NONE. The man page for
the "shasum" utility says that the purpose of the command is to "Print
or Check SHA Checksums".

I'm not perfectly attached to the idea of using SHA here, but it seems
to me that's pretty much the standard thing these days. Stephen Frost
and David Steele pushed hard for SHA checksums in backup manifests,
and actually wanted it to be the default.

I think that if you're the kind of person who looks at our existing
page checksums and finds them too weak, I doubt that CRC-32C is going
to make you feel any better. You're probably the sort of person who
thinks that checksums should have a lot of bits, and you're probably
not going to be satisfied with the properties of an algorithm invented
in the 1960s. Of course if there's anyone out there who thinks that
our existing 16-bit checksums are a pile of garbage but would be much
happier if CRC-32C is an option, I am happy to have them show up here
and say so, but I find it much more likely that people who want this
kind of feature would advocate for a more modern algorithm.

> My preference is for an approach that builds on that, or at least
> doesn't significantly complicate it. So a cryptographic hash or nonce
> can go in the special area proper (structs like BTPageOpaqueData don't
> need any changes), but at a page offset before the special area proper
> -- not after.
>
> What disadvantages does that approach have, if any, from your point of view?

I think it would be an extremely good idea to store the extended
checksum at the same offset in every page. Right now, code that wants
to compute checksums, or a tool like pg_checksums that wants to verify
them, can find the checksum without needing to interpret any of the
remaining page contents. Things get sticky if you have to interpret
the page contents to locate the checksum that's going to tell you
whether the page contents are messed up. Perhaps this could be worked
around if you tried hard enough, but I don't see what we get out of
it. I don't think that putting the checksum at the very end of the
every page precludes using variable-size special space in the AMs, or
even complicates it much, because if there's a fixed-length block of
stuff at the end of every page, you can easily account for that.

There's a lot less code that cares about the space above pd_special
than there is code that cares about any other portion of the page.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

[1] https://en.wikipedia.org/wiki/Cryptographic_hash_function



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Count output lines automatically in psql/help.c
Next
From: Kaiting Chen
Date:
Subject: Re: Allow foreign keys to reference a superset of unique columns