Re: backup manifests - Mailing list pgsql-hackers

From Robert Haas
Subject Re: backup manifests
Date
Msg-id CA+TgmoaBh6TEY3Re3rA75WaEd+XXRzmC8=UYEU+2JiryKNFUsQ@mail.gmail.com
Whole thread Raw
In response to Re: backup manifests  (Tels <nospam-pg-abuse@bloodgate.com>)
Responses Re: backup manifests  (Rushabh Lathia <rushabh.lathia@gmail.com>)
List pgsql-hackers
On Fri, Nov 22, 2019 at 5:15 PM Tels <nospam-pg-abuse@bloodgate.com> wrote:
> It is related to the number of states...

Thanks for this explanation. See my reply to David where I also
discuss this point.

> However, if you choose a hash, please do not go below SHA-256. Both MD5
> and SHA-1 already had collision attacks, and these only got to be bound
> to be worse.
>
>    https://www.mscs.dal.ca/~selinger/md5collision/
>    https://shattered.io/

Yikes, that second link, about SHA-1, is depressing. Now, it's not
likely that an attacker has access to your backup repository and can
spend 6500 years of CPU time to engineer a Trojan file there (maybe
more, because the files are probably bigger than the PDFs they used in
that case) and then induce you to restore and rely upon that backup.
However, it's entirely likely that somebody is going to eventually ban
SHA-1 as the attacks get better, which is going to be a problem for us
whether the underlying exposures are problems or not.

> It might even be a wise idea to encode the used Hash-Algorithm into the
> manifest file, so it can be changed later. The hash length might be not
> enough to decide which algorithm is the one used.

I agree. Let's write
SHA256:bc1c3a57369acd0d2183a927fb2e07acbbb1c97f317bbc3b39d93ec65b754af5
or similar rather than just the hash. That way even if the entire SHA
family gets cracked, we can easily substitute in something else that
hasn't been cracked yet.

(It is unclear to me why anyone supposes that *any* popular hash
function won't eventually be cracked. For a K-bit hash function, there
are 2^K possible outputs, where K is probably in the hundreds. But
there are 2^{2^33} possible 1GB files. So for every possible output
value, there are 2^{2^33-K} inputs that produce that value, which is a
very very big number. The probability that any given input produces a
certain output is very low, but the number of possible inputs that
produce a given output is very high; so assuming that nobody's ever
going to figure out how to construct them seems optimistic.)

> To get a feeling one can use:
>
>     openssl speed md5 sha1 sha256 sha512
>
> On my really-not-fast desktop CPU (i5-4690T CPU @ 2.50GHz) it says:
>
>   The 'numbers' are in 1000s of bytes per second processed.
>    type       16 bytes     64 bytes    256 bytes   1024 bytes   8192
> bytes  16384 bytes
>    md5       122638.55k   277023.96k   487725.57k   630806.19k
> 683892.74k   688553.98k
>    sha1      127226.45k   313891.52k   632510.55k   865753.43k
> 960995.33k   977215.19k
>    sha256     77611.02k   173368.15k   325460.99k   412633.43k
> 447022.92k   448020.48k
>    sha512     51164.77k   205189.87k   361345.79k   543883.26k
> 638372.52k   645933.74k
>
> Or in other words, it can hash nearly 931 MByte /s with SHA-1 and about
> 427 MByte / s with SHA-256 (if I haven't miscalculated something). You'd
> need a
> pretty fast disk (aka M.2 SSD) and network (aka > 1 Gbit) to top these
> speeds
> and then you'd use a real CPU for your server, not some poor Intel
> powersaving
> surfing thingy-majingy :)

I mean, how fast is in theory doesn't matter nearly as much as what
happens when you benchmark the proposed implementation, and the
results we have so far don't support the theory that this is so cheap
as to be negligible.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: backup manifests
Next
From: vignesh C
Date:
Subject: Re: dropdb --force