Thread: data checksums

data checksums

From
bruno vieira da silva
Date:
Hello.
I've been dealing with some database corruption events, so i've been wondering to enable data checksums on my deployments. 

so my question is why data checksums aren't enabled by default on pg? the pg doc 
mentions a considerable performance penality, how considerable it is?

Thanks
--
Bruno Vieira da Silva

Re: data checksums

From
Ron Johnson
Date:
On Tue, Aug 6, 2024 at 11:12 AM bruno vieira da silva <brunogiovs@gmail.com> wrote:
Hello.
I've been dealing with some database corruption events, so i've been wondering to enable data checksums on my deployments. 

so my question is why data checksums aren't enabled by default on pg? the pg doc 
mentions a considerable performance penality, how considerable it is?
 
I'm going to hazard a guess that the relevant line from the docs were written many many years ago, back when servers were much slower.

We certainly didn't notice any speed degradation when going from PG 9.6.24 to PG 14.12.  Or... we just use enough hardware and flash-only SAN disks to make the overhead minuscule.

--
Death to America, and butter sauce.
Iraq lobster!

Re: data checksums

From
Christophe Pettus
Date:

> On Aug 6, 2024, at 08:11, bruno vieira da silva <brunogiovs@gmail.com> wrote:
>
> so my question is why data checksums aren't enabled by default on pg?

At this point, mostly historical reasons.  They're also superfluous if your underlying file system or storage hardware
doesstorage-level corruption checks (which most don't). 

> the pg doc
> mentions a considerable performance penality, how considerable it is?

That line is probably somewhat out of date at this point.  We haven't seen a significant slowdown in enabling them on
anymodern hardware.  I always turn them on, except on the type of filesystems/hardware mentioned above. 


Re: data checksums

From
Greg Sabino Mullane
Date:
The penalty is not "considerable", but it is not completely trivial either. But it's more on the trivial side. Part of the problem is that it is hard to measure, as it is very workload dependent. As to why it is not the default, Postgres is very careful and conservative by default, and not everybody was convinced that enabling checksums is worth the tradeoff, especially (IIRC) with the lack of real-world examples of people discovering issues thanks to these checksums. So yes, please enable and share with us if the checksums catch anything.

I think the argument is worth re-opening again, because (as far as I know), all of the companies providing Postgres support, whether completely cloud-managed, setting up a complex HA cluster, or just providing tuning advice, have enabled checksums by default for many, many years. So a big +1 from me to finally making it the default. It's a de-facto default anyway at this point.

Cheers,
Greg

Re: data checksums

From
Laurenz Albe
Date:
On Tue, 2024-08-06 at 09:29 -0700, Christophe Pettus wrote:
>
> > On Aug 6, 2024, at 08:11, bruno vieira da silva <brunogiovs@gmail.com> wrote:
> >
> > so my question is why data checksums aren't enabled by default on pg?
>
> At this point, mostly historical reasons.  They're also superfluous if your underlying
> file system or storage hardware does storage-level corruption checks (which most don't).

I am surprised by that.  Would you say that most storage systems will happily give you a
garbage block if there was a hardware problem somewhere?

> > the pg doc
> > mentions a considerable performance penality, how considerable it is?
>
> That line is probably somewhat out of date at this point.  We haven't seen a significant
> slowdown in enabling them on any modern hardware.  I always turn them on, except on the
> type of filesystems/hardware mentioned above.

Turning data checksums on will write WAL for hint bits, which can significantly increase
the amount of WAL written.

Yours,
Laurenz Albe



Re: data checksums

From
Christophe Pettus
Date:

> On Aug 6, 2024, at 19:45, Laurenz Albe <laurenz.albe@cybertec.at> wrote:
> I am surprised by that.  Would you say that most storage systems will happily give you a
> garbage block if there was a hardware problem somewhere?

"Most" is hard for me to judge.  HDDs can have uncorrected and undetected errors, definitely.  ZFS, for example, can
correctthose (within limits); XFS doesn't try.  I have been told that SSDs can have uncorrected/undetected errors as
well,but I don't know details on that. 

> Turning data checksums on will write WAL for hint bits, which can significantly increase
> the amount of WAL written.

I was curious about that, so I just did a quick experiment using pgbench, with identical databases except for
checksums. They both generated the same amount of WAL within 10% or so, so I don't think the impact is huge.  (And you
needthe hint bits for pg_rewind, which is a nice thing to have in your back pocket if required.) 


Re: data checksums

From
Daniel Gustafsson
Date:
> On 6 Aug 2024, at 18:29, Christophe Pettus <xof@thebuild.com> wrote:
>> On Aug 6, 2024, at 08:11, bruno vieira da silva <brunogiovs@gmail.com> wrote:

>> the pg doc
>> mentions a considerable performance penality, how considerable it is?
>
> That line is probably somewhat out of date at this point.  We haven't seen a significant slowdown in enabling them on
anymodern hardware.  I always turn them on, except on the type of filesystems/hardware mentioned above. 

The last in-depth analysis of data checksums (and hint bits) overhead that I
can remember is from 2019:

  https://www.postgresql.org/message-id/20190330192543.GH4719%40development

Hardware advances in the last five years may very well have made these findings
irrelevant however.

--
Daniel Gustafsson




Re: data checksums

From
Ron Johnson
Date:
On Wed, Aug 7, 2024 at 3:41 AM Daniel Gustafsson <daniel@yesql.se> wrote:
> On 6 Aug 2024, at 18:29, Christophe Pettus <xof@thebuild.com> wrote:
>> On Aug 6, 2024, at 08:11, bruno vieira da silva <brunogiovs@gmail.com> wrote:

>> the pg doc
>> mentions a considerable performance penality, how considerable it is?
>
> That line is probably somewhat out of date at this point.  We haven't seen a significant slowdown in enabling them on any modern hardware.  I always turn them on, except on the type of filesystems/hardware mentioned above.

The last in-depth analysis of data checksums (and hint bits) overhead that I
can remember is from 2019:

  https://www.postgresql.org/message-id/20190330192543.GH4719%40development

A quote from that post:
"I have not investigated the exact reasons, but my hypothesis it's about
the amount of WAL generated during the initial CREATE INDEX (because it
probably ends up setting the hint bits), which puts additional pressure
on the storage."
 
Presuming that hypothesis is true: how often do "you" run CREATE INDEX (or VACUUM FULL or CLUSTER)?  I certainly don't run them very often.

--
Death to America, and butter sauce!
Iraq lobster...