Re: documenting the backup manifest file format - Mailing list pgsql-hackers

From David Steele
Subject Re: documenting the backup manifest file format
Date
Msg-id 3d49cad1-43f3-4c8a-1676-fb552892e629@pgmasters.net
Whole thread Raw
In response to Re: documenting the backup manifest file format  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On 4/13/20 4:14 PM, Robert Haas wrote:
> On Mon, Apr 13, 2020 at 3:34 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> 
>> Also, I
>> see no mention of prettification-chars such as newlines or indentation.
>> I suppose if I pass a manifest file through prettification (or Windows
>> newline conversion), the checksum may break.
> 
> It would indeed break. I'm not sure what you want me to say here,
> though. If you're trying to parse a manifest, you shouldn't care about
> how the whitespace is arranged. If you're trying to generate one, you
> can arrange it any way you like, as long as you also include it in the
> checksum.

pgBackRest ignores whitespace but this is a legacy of the way Perl 
calculated checksums, not an intentional feature. This worked well when 
the manifest was loaded as a whole, converted to JSON, and checksummed, 
but it is a major pain for the streaming code we now have in C.

I guarantee that that our next manifest version will do a simple 
checksum of bytes as Robert has done in this feature.

So, I'm +1 as implemented.

>> Why is the top-level checksum only allowed to be SHA-256, if the files
>> can use up to SHA-512?

<snip>

> I agree that it's a little bit weird that you can have a stronger
> checksum for the files instead of the manifest itself, but I also
> wonder what the use case would be for using a stronger checksum on the
> manifest. David Steele argued that strong checksums on the files could
> be useful to software that wants to rifle through all the backups
> you've ever taken and find another copy of that file by looking for
> something with a matching checksum. CRC-32C wouldn't be strong enough
> for that, because eventually you could have enough files that you
> start to have collisions. The SHA algorithms output enough bits to
> make that quite unlikely. But this argument only makes sense for the
> files, not the manifest.

Agreed. I think SHA-256 is *more* than enough to protect the manifest 
against corruption. That said, since the cost of SHA-256 vs. SHA-512 in 
the context on the manifest is negligible we could just use the stronger 
algorithm to deflect a similar question going forward.

That choice might not age well, but we could always say, well, we picked 
it because it was the strongest available at the time. Allowing a choice 
of which algorithm to use for to manifest checksum seems like it will 
just make verifying the file harder with no tangible benefit.

Maybe just a comment in the docs about why SHA-256 was used would be fine.

>> (Also, did we intentionally omit the dash in
>> hash names, so "SHA-256" to make it SHA256?  This will also be critical
>> for checksumming the manifest itself.)
> 
> I debated this with myself, settled on this spelling, and nobody
> complained until now. It could be changed, though. I didn't have any
> particular reason for choosing it except the feeling that people would
> probably prefer to type --manifest-checksum=sha256 rather than
> --manifest-checksum=sha-256.

+1 for sha256 rather than sha-256.

Regards,
-- 
-David
david@pgmasters.net



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Poll: are people okay with function/operator table redesign?
Next
From: Tom Lane
Date:
Subject: Re: Poll: are people okay with function/operator table redesign?