Thread: Enable data checksums by default

Enable data checksums by default

From

Christoph Berg

Date:

22 March 2019, 15:16:54

Lately, PostgreSQL has moved many defaults from "bare minimum" more to
the "user friendly by default" side, e.g. hot_standby & replication in
the default configuration, parallelism, and generally higher defaults
for resource knobs like *_mem, autovacuum_* and so on.

I think, the next step in that direction would be to enable data
checksums by default. They make sense in most setups, and people who
plan to run very performance-critical systems where checksums might be
too much need to tune many knobs anyway, and can as well choose to
disable them manually, instead of having everyone else have to enable
them manually. Also, disabling is much easier than enabling.

One argument against checksums used to be that we lack tools to fix
problems with them. But ignore_checksum_failure and the pg_checksums
tool fix that.

The attached patch flips the default in initdb. It also adds a new
option -k --no-data-checksums that wasn't present previously. Docs are
updated to say what the new default is, and the testsuite exercises
the -K option.

Christoph

Attachment

0001-Enable-data-checksums-by-default.patch

Re: Enable data checksums by default

From

Tom Lane

Date:

22 March 2019, 16:07:22

Christoph Berg <myon@debian.org> writes:
> I think, the next step in that direction would be to enable data
> checksums by default. They make sense in most setups,

Well, that is exactly the point that needs some proof, not just
an unfounded assertion.

IMO, the main value of checksums is that they allow the Postgres
project to deflect blame.  That's nice for us but I'm not sure
that it's a benefit for users.  I've seen little if any data to
suggest that checksums actually catch enough problems to justify
the extra CPU costs and the risk of false positives.

            regards, tom lane

Re: Enable data checksums by default

From

Andres Freund

Date:

22 March 2019, 16:10:40

Hi,

On 2019-03-22 12:07:22 -0400, Tom Lane wrote:
> Christoph Berg <myon@debian.org> writes:
> > I think, the next step in that direction would be to enable data
> > checksums by default. They make sense in most setups,
> 
> Well, that is exactly the point that needs some proof, not just
> an unfounded assertion.
> 
> IMO, the main value of checksums is that they allow the Postgres
> project to deflect blame.  That's nice for us but I'm not sure
> that it's a benefit for users.  I've seen little if any data to
> suggest that checksums actually catch enough problems to justify
> the extra CPU costs and the risk of false positives.

IDK, being able to verify in some form that backups aren't corrupted on
an IO level is mighty nice. That often does allow to detect the issue
while one still has older backups around.

My problem is more that I'm not confident the checks are mature
enough. The basebackup checks are atm not able to detect random data,
and neither basebackup nor backend checks detect zeroed out files/file
ranges.

Greetings,

Andres Freund

Re: Enable data checksums by default

From

Tomas Vondra

Date:

22 March 2019, 16:32:10

On 3/22/19 5:10 PM, Andres Freund wrote:
> Hi,
> 
> On 2019-03-22 12:07:22 -0400, Tom Lane wrote:
>> Christoph Berg <myon@debian.org> writes:
>>> I think, the next step in that direction would be to enable data
>>> checksums by default. They make sense in most setups,
>>
>> Well, that is exactly the point that needs some proof, not just
>> an unfounded assertion.
>>
>> IMO, the main value of checksums is that they allow the Postgres
>> project to deflect blame.  That's nice for us but I'm not sure
>> that it's a benefit for users.  I've seen little if any data to
>> suggest that checksums actually catch enough problems to justify
>> the extra CPU costs and the risk of false positives.
> 

I'm not sure about checksums being an effective tool to deflect blame.
Considering the recent fsync retry issues - due to the assumption that
we can just retry fsync we might have lost some of the writes, resulting
in torn pages and checksum failures. I'm sure we could argue about how
much sense the fsync behavior makes, but I doubt checksum failures are
enough to deflect blame here.

> IDK, being able to verify in some form that backups aren't corrupted on
> an IO level is mighty nice. That often does allow to detect the issue
> while one still has older backups around.
> 

Yeah, I agree that's a valuable capability. I think the question is how
effective it actually is considering how much the storage changed over
the past few years (which necessarily affects the type of failures
people have to deal with).

> My problem is more that I'm not confident the checks are mature
> enough. The basebackup checks are atm not able to detect random data,
> and neither basebackup nor backend checks detect zeroed out files/file
> ranges.
> 

Yep :-( The pg_basebackup vulnerability to random garbage in a page
header is unfortunate, we better improve that.

It's not clear to me what can checksums do about zeroed pages (and/or
truncated files) though.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Enable data checksums by default

From

Andres Freund

Date:

22 March 2019, 16:41:17

Hi,

On 2019-03-22 17:32:10 +0100, Tomas Vondra wrote:
> On 3/22/19 5:10 PM, Andres Freund wrote:
> > IDK, being able to verify in some form that backups aren't corrupted on
> > an IO level is mighty nice. That often does allow to detect the issue
> > while one still has older backups around.
> > 
> 
> Yeah, I agree that's a valuable capability. I think the question is how
> effective it actually is considering how much the storage changed over
> the past few years (which necessarily affects the type of failures
> people have to deal with).

I'm not sure I understand? How do the changes around storage
meaningfully affect the need to have some trust in backups and
benefiting from earlier detection?

> It's not clear to me what can checksums do about zeroed pages (and/or
> truncated files) though.

Well, there's nothing fundamental about needing added pages be
zeroes. We could expand them to be initialized with actual valid
checksums instead of
        /* new buffers are zero-filled */
        MemSet((char *) bufBlock, 0, BLCKSZ);
        /* don't set checksum for all-zero page */
        smgrextend(smgr, forkNum, blockNum, (char *) bufBlock, false);

the problem is that it's hard to do so safely without adding a lot of
additional WAL logging. A lot of filesystems will journal metadata
changes (like the size of the file), but not contents. So after a crash
the tail end might appear zeroed out, even if we never wrote
zeroes. That's obviously solvable by WAL logging, but that's not cheap.

It might still be a good idea to just write a page with an initialized
header / checksum at that point, as that ought to still detect a number
of problems we can't detect right now.

Greetings,

Andres Freund

Re: Enable data checksums by default

From

Tomas Vondra

Date:

22 March 2019, 17:01:32

On 3/22/19 5:41 PM, Andres Freund wrote:
> Hi,
> 
> On 2019-03-22 17:32:10 +0100, Tomas Vondra wrote:
>> On 3/22/19 5:10 PM, Andres Freund wrote:
>>> IDK, being able to verify in some form that backups aren't corrupted on
>>> an IO level is mighty nice. That often does allow to detect the issue
>>> while one still has older backups around.
>>>
>>
>> Yeah, I agree that's a valuable capability. I think the question is how
>> effective it actually is considering how much the storage changed over
>> the past few years (which necessarily affects the type of failures
>> people have to deal with).
> 
> I'm not sure I understand? How do the changes around storage
> meaningfully affect the need to have some trust in backups and
> benefiting from earlier detection?
> 

Having trusted in backups is still desirable - nothing changes that,
obviously. The question I was posing was rather "Are checksums still
effective on current storage systems?"

I'm wondering if the storage systems people use nowadays may be failing
in ways that are not reliably detectable by checksums. I don't have any
data to either support or reject that hypothesis, though.

> 
>> It's not clear to me what can checksums do about zeroed pages (and/or
>> truncated files) though.
> 
> Well, there's nothing fundamental about needing added pages be
> zeroes. We could expand them to be initialized with actual valid
> checksums instead of
>         /* new buffers are zero-filled */
>         MemSet((char *) bufBlock, 0, BLCKSZ);
>         /* don't set checksum for all-zero page */
>         smgrextend(smgr, forkNum, blockNum, (char *) bufBlock, false);
> 
> the problem is that it's hard to do so safely without adding a lot of
> additional WAL logging. A lot of filesystems will journal metadata
> changes (like the size of the file), but not contents. So after a crash
> the tail end might appear zeroed out, even if we never wrote
> zeroes. That's obviously solvable by WAL logging, but that's not cheap.
> 

Hmmm. I'd say a filesystem that does not guarantee having all the data
after an fsync is outright broken, but maybe that's what checksums are
meant to protect against.

> It might still be a good idea to just write a page with an initialized
> header / checksum at that point, as that ought to still detect a number
> of problems we can't detect right now.
> 

Sounds reasonable.

cheers

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Enable data checksums by default

From

Andres Freund

Date:

22 March 2019, 17:07:15

On 2019-03-22 18:01:32 +0100, Tomas Vondra wrote:
> On 3/22/19 5:41 PM, Andres Freund wrote:
> > Hi,
> > 
> > On 2019-03-22 17:32:10 +0100, Tomas Vondra wrote:
> >> On 3/22/19 5:10 PM, Andres Freund wrote:
> >>> IDK, being able to verify in some form that backups aren't corrupted on
> >>> an IO level is mighty nice. That often does allow to detect the issue
> >>> while one still has older backups around.
> >>>
> >>
> >> Yeah, I agree that's a valuable capability. I think the question is how
> >> effective it actually is considering how much the storage changed over
> >> the past few years (which necessarily affects the type of failures
> >> people have to deal with).
> > 
> > I'm not sure I understand? How do the changes around storage
> > meaningfully affect the need to have some trust in backups and
> > benefiting from earlier detection?
> > 
> 
> Having trusted in backups is still desirable - nothing changes that,
> obviously. The question I was posing was rather "Are checksums still
> effective on current storage systems?"
> 
> I'm wondering if the storage systems people use nowadays may be failing
> in ways that are not reliably detectable by checksums. I don't have any
> data to either support or reject that hypothesis, though.

I don't think it's useful to paint unsubstantiated doom-and-gloom
pictures.


> >> It's not clear to me what can checksums do about zeroed pages (and/or
> >> truncated files) though.
> > 
> > Well, there's nothing fundamental about needing added pages be
> > zeroes. We could expand them to be initialized with actual valid
> > checksums instead of
> >         /* new buffers are zero-filled */
> >         MemSet((char *) bufBlock, 0, BLCKSZ);
> >         /* don't set checksum for all-zero page */
> >         smgrextend(smgr, forkNum, blockNum, (char *) bufBlock, false);
> > 
> > the problem is that it's hard to do so safely without adding a lot of
> > additional WAL logging. A lot of filesystems will journal metadata
> > changes (like the size of the file), but not contents. So after a crash
> > the tail end might appear zeroed out, even if we never wrote
> > zeroes. That's obviously solvable by WAL logging, but that's not cheap.
> > 
> 
> Hmmm. I'd say a filesystem that does not guarantee having all the data
> after an fsync is outright broken, but maybe that's what checksums are
> meant to protect against.

There's no fsync here. smgrextend(with-valid-checksum);crash; - the OS
will probably have journalled the file size change, but not the
contents. After a crash it's thus likely that the data page will appear
zeroed.  Which prevents us from erroring out when encountering a zeroed
page, even though that'd be very good for error detection capabilities,
because storage systems will show corrupted data as zeroes in a number
of cases.

Greetings,

Andres Freund

Re: Enable data checksums by default

From

Christoph Berg

Date:

27 March 2019, 13:56:58

Re: To Tom Lane 2019-03-26 <20190326151446.GG3829@msg.df7cb.de>
> I run a benchmark with checksums disabled/enabled. shared_buffers is
> 512kB to make sure almost any read will fetch the page from the OS
> cache; scale factor is 50 (~750MB) to make sure the whole cluster fits
> into RAM.
[...]
> So the cost is 5% in this very contrived case. In almost any other
> setting, the cost would be lower, I'd think.

(That was on 12devel, btw.)

That was about the most extreme OLTP read-only workload. After
thinking about it some more, I realized that exercising large seqscans
might be an even better way to test it because of less per-query
overhead.

Same setup again, shared_buffers = 16 (128kB), jit = off,
max_parallel_workers_per_gather = 0:

select count(bid) from pgbench_accounts;

no checksums: ~456ms
with checksums: ~489ms

456.0/489 = 0.9325

The cost of checksums is about 6.75% here.

Christoph

Re: Enable data checksums by default

From

Ants Aasma

Date:

27 March 2019, 20:51:16

On Wed, Mar 27, 2019, 15:57 Christoph Berg <myon@debian.org> wrote:

Re: To Tom Lane 2019-03-26 <20190326151446.GG3829@msg.df7cb.de>
> I run a benchmark with checksums disabled/enabled. shared_buffers is
> 512kB to make sure almost any read will fetch the page from the OS
> cache; scale factor is 50 (~750MB) to make sure the whole cluster fits
> into RAM.
[...]
> So the cost is 5% in this very contrived case. In almost any other
> setting, the cost would be lower, I'd think.

(That was on 12devel, btw.)

That was about the most extreme OLTP read-only workload. After
thinking about it some more, I realized that exercising large seqscans
might be an even better way to test it because of less per-query
overhead.

Same setup again, shared_buffers = 16 (128kB), jit = off,
max_parallel_workers_per_gather = 0:

select count(bid) from pgbench_accounts;

no checksums: ~456ms
with checksums: ~489ms

456.0/489 = 0.9325

The cost of checksums is about 6.75% here.

Can you try with postgres compiled with CFLAGS="-O2 -march=native"? There's a bit of low hanging fruit there to use a runtime CPU check to pick a better optimized checksum function.

Regards,

Ants Aasma

Re: Enable data checksums by default

From

Christoph Berg

Date:

28 March 2019, 08:38:16

Re: Ants Aasma 2019-03-27 <CA+CSw_twXdRzDN2XsSZBxEj63DeZ+f6_hs3Qf7hmXfenxSq+jg@mail.gmail.com>
> Can you try with postgres compiled with CFLAGS="-O2 -march=native"? There's
> a bit of low hanging fruit there to use a runtime CPU check to pick a
> better optimized checksum function.

Frankly, no. This is with the apt.pg.o packages which are supposed to
be usable by everyone. If there is a better per-CPU checksum function,
PG should pick it at runtime. Special compiler flags are a no-go here.

CPPFLAGS = -Wdate-time -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -I/usr/include/libxml2 -I/usr/include/mit-krb5
CFLAGS = -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels
-Wmissing-format-attribute-Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard
-Wno-format-truncation-Wno-stringop-truncation -g -g -O2 -fstack-protector-strong -Wformat -Werror=format-security
-fno-omit-frame-pointer

Christoph

Re: Enable data checksums by default

From

Peter Eisentraut

Date:

28 March 2019, 12:03:44

On 2019-03-22 16:16, Christoph Berg wrote:
> I think, the next step in that direction would be to enable data
> checksums by default. They make sense in most setups, and people who
> plan to run very performance-critical systems where checksums might be
> too much need to tune many knobs anyway, and can as well choose to
> disable them manually, instead of having everyone else have to enable
> them manually. Also, disabling is much easier than enabling.

It would also enable pg_rewind to work by default.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Enable data checksums by default

From

Bernd Helmle

Date:

29 March 2019, 10:16:11

Am Dienstag, den 26.03.2019, 16:14 +0100 schrieb Christoph Berg:
> select 92551.0/97363;
> 0.9506
> 
> So the cost is 5% in this very contrived case. In almost any other
> setting, the cost would be lower, I'd think.

Well, my machine (Intel(R) Core(TM) i7-6770HQ CPU @ 2.60GHz, 32 GByte
RAM) tells me this:

pgbench -s 50 -i pgbench
pg_ctl -o "--shared-buffers=128kB" restart
pgbench -r -P4 -Mprepared -T60 -c $clients -j $clients -n -S

...prewarm...

Clients    checksums
1    20110
2    35338
4    67207
8    96627
16    110091

Clients no checksums
1    21716
2    38543
4    72118
8    117545
16    121415

Clients    Impact
1    0,926045312212194
2    0,916846119918014
4    0,931903269641421
8    0,822042621974563
16    0,906733105464728

So between ~7% to 18% impact with checksums in this specific case here.

    Bernd

Re: Enable data checksums by default

From

Ants Aasma

Date:

29 March 2019, 11:18:14

On Thu, Mar 28, 2019 at 10:38 AM Christoph Berg <myon@debian.org> wrote:

Re: Ants Aasma 2019-03-27 <CA+CSw_twXdRzDN2XsSZBxEj63DeZ+f6_hs3Qf7hmXfenxSq+jg@mail.gmail.com>
> Can you try with postgres compiled with CFLAGS="-O2 -march=native"? There's
> a bit of low hanging fruit there to use a runtime CPU check to pick a
> better optimized checksum function.

Frankly, no. This is with the apt.pg.o packages which are supposed to
be usable by everyone. If there is a better per-CPU checksum function,
PG should pick it at runtime. Special compiler flags are a no-go here.

I went ahead and tested it on the count(*) test, same settings as upthread. Median of 5 runs of 20txs on Intel i5-2500k @ 4GHz.

No checksum: 344ms

Checksums: 384ms (+12%)

No checksum march=native: 344ms

Checksums march=native: 369ms (+7%)

The checksum code was written to be easily auto-vectorized by the compiler. So if we just compile the same function with different compiler flags and pick between them at runtime the overhead can be approximately halved. Not saying that this needs to be done before enabling checksums by default, just that when considering overhead, we can foresee it being much lower in future versions.

Regards,
Ants Aasma

Re: Enable data checksums by default

From

Michael Paquier

Date:

29 March 2019, 14:10:14

On Fri, Mar 29, 2019 at 11:16:11AM +0100, Bernd Helmle wrote:
> So between ~7% to 18% impact with checksums in this specific case here.

I can't really believe that many people set up shared_buffers at 128kB
which would cause such a large number of page evictions, but I can
believe that many users have shared_buffers set to its default value
and that we are going to get complains about "performance drop after
upgrade to v12" if we switch data checksums to on by default.
--
Michael

Attachment

signature.asc

Re: Enable data checksums by default

From

Bernd Helmle

Date:

29 March 2019, 19:25:41

Am Freitag, den 29.03.2019, 23:10 +0900 schrieb Michael Paquier:
> 
> I can't really believe that many people set up shared_buffers at
> 128kB
> which would cause such a large number of page evictions, but I can
> believe that many users have shared_buffers set to its default value
> and that we are going to get complains about "performance drop after
> upgrade to v12" if we switch data checksums to on by default.

Yeah, i think Christoph's benchmark is based on this thinking. I assume
this very unrealistic scenery should emulate the worst case (many
buffer_reads, high checksum calculation load). 

    Bernd

Re: Enable data checksums by default

From

Christoph Berg

Date:

29 March 2019, 19:35:26

Re: Bernd Helmle 2019-03-29 <3586bb9345a59bfc8d13a50a7c729be1ee6759fd.camel@oopsware.de>
> Am Freitag, den 29.03.2019, 23:10 +0900 schrieb Michael Paquier:
> > 
> > I can't really believe that many people set up shared_buffers at
> > 128kB
> > which would cause such a large number of page evictions, but I can
> > believe that many users have shared_buffers set to its default value
> > and that we are going to get complains about "performance drop after
> > upgrade to v12" if we switch data checksums to on by default.
> 
> Yeah, i think Christoph's benchmark is based on this thinking. I assume
> this very unrealistic scenery should emulate the worst case (many
> buffer_reads, high checksum calculation load). 

It's not unrealistic to have large seqscans that are all buffer
misses, the table just has to be big enough. The idea in my benchmark
was that if I make shared buffers really small, and the table still
fits in to RAM, I should be seeing only buffer misses, but without any
delay for actually reading from disk.

Christoph

Re: Enable data checksums by default

From

Tomas Vondra

Date:

30 March 2019, 19:25:43

On Fri, Mar 29, 2019 at 08:35:26PM +0100, Christoph Berg wrote:
>Re: Bernd Helmle 2019-03-29 <3586bb9345a59bfc8d13a50a7c729be1ee6759fd.camel@oopsware.de>
>> Am Freitag, den 29.03.2019, 23:10 +0900 schrieb Michael Paquier:
>> >
>> > I can't really believe that many people set up shared_buffers at
>> > 128kB
>> > which would cause such a large number of page evictions, but I can
>> > believe that many users have shared_buffers set to its default value
>> > and that we are going to get complains about "performance drop after
>> > upgrade to v12" if we switch data checksums to on by default.
>>
>> Yeah, i think Christoph's benchmark is based on this thinking. I assume
>> this very unrealistic scenery should emulate the worst case (many
>> buffer_reads, high checksum calculation load).
>
>It's not unrealistic to have large seqscans that are all buffer
>misses, the table just has to be big enough. The idea in my benchmark
>was that if I make shared buffers really small, and the table still
>fits in to RAM, I should be seeing only buffer misses, but without any
>delay for actually reading from disk.
>
>Christoph
>

FWIW I think it's a mistake to focus solely on CPU utilization, which
all the benchmarks performed on this thread do because they look at tps
of in-memory read-only workloads. Checksums have other costs too, not
just the additional CPU time. Most importanly they require wal_log_hints
to be set (which people may or may not want anyway).

I've done a simple benchmark, that does read-only (-S) and read-write
(-N) pgbench runs with different scales, but also measures duration of
the pgbench init and amount of WAL produced during the tests.

On a small machine (i5, 8GB RAM, SSD RAID) the results are these:

         scale    config     |    init           tps        wal
    =========================|==================================
    ro      10    no-hints   |       2        117038        130
                  hints      |       2        116378        146
                  checksums  |       2        115619        147
          -------------------|----------------------------------
           200    no-hints   |      32         88340        2407
                  hints      |      37         86154        2628
                  checksums  |      36         83336        2624
          -------------------|----------------------------------
          2000    no-hints   |     365         38680        1967
                  hints      |     423         38670        2123
                  checksums  |     504         37510        2046
    -------------------------|----------------------------------
    rw      10    no-hints   |       2         19691        437
                  hints      |       2         19712        437
                  checksums  |       2         19654        437
          -------------------|----------------------------------
           200    no-hints   |      32         15839        2745
                  hints      |      37         15735        2783
                  checksums  |      36         15646        2775
          -------------------|----------------------------------
          2000    no-hints   |     365         5371         3721
                  hints      |     423         5270         3671
                  checksums  |     504         5094         3574

The no-hints config is default (wal_log_hints=off, data_checksums=off),
hints sets wal_log_hints=on and checksums enables data checksums. All
the configs were somewhat tuned (1GB shared buffers, max_wal_size high
enough not to hit checkpoints very often, etc.).

I've also done the tests on the a larger machine (2x E5-2620v4, 32GB of
RAM, NVMe SSD), and the general pattern is about the same - while the
tps and amount of WAL (not covering the init) does not change, the time
for initialization increases significantly (by 20-40%).

This effect is even clearer when using slower storage (SATA-based RAID).
The results then look like this:

         scale    config     |    init           tps        wal
    =========================|==================================
    ro     100    no-hints   |      49        229459        122
                  hints      |     101        167983        190
                  checksums  |     103        156307        190
          -------------------|----------------------------------
          1000    no-hints   |     580        152167        109
                  hints      |    1047        122814        142
                  checksums  |    1080        118586        141
          -------------------|----------------------------------
          6000    no-hints   |    4035           508          1
                  hints      |   11193           502          1
                  checksums  |   11376           506          1
    -------------------------|----------------------------------
    rw     100    no-hints   |      49           279        192
                  hints      |     101           275        190
                  checksums  |     103           275        190
          -------------------|----------------------------------
          1000    no-hints   |     580           237        210
                  hints      |    1047           225        201
                  checksums  |    1080           224        200
          -------------------|----------------------------------
          6000    no-hints   |    4035           135        123
                  hints      |   11193           133        122
                  checksums  |   11376           132        121

and when expressed as relative to no-hints:

            scale       config  |     init       tps        wal
    ============================|===============================
    ro        100        hints  |     206%       73%        155%
                     checksums  |     210%       68%        155%
             -------------------|--------------------------------
             1000        hints  |     181%       81%        131%
                     checksums  |     186%       78%        129%
             -------------------|--------------------------------
             6000        hints  |     277%       99%        100%
                     checksums  |     282%       100%       104%
    ----------------------------|--------------------------------
    rw        100        hints  |     206%       99%        99%
                     checksums  |     210%       99%        99%
             -------------------|--------------------------------
             1000        hints  |     181%       95%        96%
                     checksums  |     186%       95%        95%
             -------------------|--------------------------------
             6000        hints  |     277%       99%        99%
                     checksums  |     282%       98%        98%

I have not investigated the exact reasons, but my hypothesis it's about
the amount of WAL generated during the initial CREATE INDEX (because it
probably ends up setting the hint bits), which puts additional pressure
on the storage.

Unfortunately, this additional cost is unlikely to go away :-(

Now, maybe we want to enable checksums by default anyway, but we should
not pretent the only cost related to checksums is CPU usage.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Enable data checksums by default

From

Andres Freund

Date:

30 March 2019, 20:17:20


On March 30, 2019 3:25:43 PM EDT, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>On Fri, Mar 29, 2019 at 08:35:26PM +0100, Christoph Berg wrote:
>>Re: Bernd Helmle 2019-03-29
><3586bb9345a59bfc8d13a50a7c729be1ee6759fd.camel@oopsware.de>
>>> Am Freitag, den 29.03.2019, 23:10 +0900 schrieb Michael Paquier:
>>> >
>>> > I can't really believe that many people set up shared_buffers at
>>> > 128kB
>>> > which would cause such a large number of page evictions, but I can
>>> > believe that many users have shared_buffers set to its default
>value
>>> > and that we are going to get complains about "performance drop
>after
>>> > upgrade to v12" if we switch data checksums to on by default.
>>>
>>> Yeah, i think Christoph's benchmark is based on this thinking. I
>assume
>>> this very unrealistic scenery should emulate the worst case (many
>>> buffer_reads, high checksum calculation load).
>>
>>It's not unrealistic to have large seqscans that are all buffer
>>misses, the table just has to be big enough. The idea in my benchmark
>>was that if I make shared buffers really small, and the table still
>>fits in to RAM, I should be seeing only buffer misses, but without any
>>delay for actually reading from disk.
>>
>>Christoph
>>
>
>FWIW I think it's a mistake to focus solely on CPU utilization, which
>all the benchmarks performed on this thread do because they look at tps
>of in-memory read-only workloads. Checksums have other costs too, not
>just the additional CPU time. Most importanly they require
>wal_log_hints
>to be set (which people may or may not want anyway).
>
>I've done a simple benchmark, that does read-only (-S) and read-write
>(-N) pgbench runs with different scales, but also measures duration of
>the pgbench init and amount of WAL produced during the tests.
>
>On a small machine (i5, 8GB RAM, SSD RAID) the results are these:
>
>         scale    config     |    init           tps        wal
>    =========================|==================================
>    ro      10    no-hints   |       2        117038        130
>                  hints      |       2        116378        146
>                  checksums  |       2        115619        147
>          -------------------|----------------------------------
>           200    no-hints   |      32         88340        2407
>                  hints      |      37         86154        2628
>                  checksums  |      36         83336        2624
>          -------------------|----------------------------------
>          2000    no-hints   |     365         38680        1967
>                  hints      |     423         38670        2123
>                  checksums  |     504         37510        2046
>    -------------------------|----------------------------------
>    rw      10    no-hints   |       2         19691        437
>                  hints      |       2         19712        437
>                  checksums  |       2         19654        437
>          -------------------|----------------------------------
>           200    no-hints   |      32         15839        2745
>                  hints      |      37         15735        2783
>                  checksums  |      36         15646        2775
>          -------------------|----------------------------------
>          2000    no-hints   |     365         5371         3721
>                  hints      |     423         5270         3671
>                  checksums  |     504         5094         3574
>
>The no-hints config is default (wal_log_hints=off, data_checksums=off),
>hints sets wal_log_hints=on and checksums enables data checksums. All
>the configs were somewhat tuned (1GB shared buffers, max_wal_size high
>enough not to hit checkpoints very often, etc.).
>
>I've also done the tests on the a larger machine (2x E5-2620v4, 32GB of
>RAM, NVMe SSD), and the general pattern is about the same - while the
>tps and amount of WAL (not covering the init) does not change, the time
>for initialization increases significantly (by 20-40%).
>
>This effect is even clearer when using slower storage (SATA-based
>RAID).
>The results then look like this:
>
>         scale    config     |    init           tps        wal
>    =========================|==================================
>    ro     100    no-hints   |      49        229459        122
>                  hints      |     101        167983        190
>                  checksums  |     103        156307        190
>          -------------------|----------------------------------
>          1000    no-hints   |     580        152167        109
>                  hints      |    1047        122814        142
>                  checksums  |    1080        118586        141
>          -------------------|----------------------------------
>          6000    no-hints   |    4035           508          1
>                  hints      |   11193           502          1
>                  checksums  |   11376           506          1
>    -------------------------|----------------------------------
>    rw     100    no-hints   |      49           279        192
>                  hints      |     101           275        190
>                  checksums  |     103           275        190
>          -------------------|----------------------------------
>          1000    no-hints   |     580           237        210
>                  hints      |    1047           225        201
>                  checksums  |    1080           224        200
>          -------------------|----------------------------------
>          6000    no-hints   |    4035           135        123
>                  hints      |   11193           133        122
>                  checksums  |   11376           132        121
>
>and when expressed as relative to no-hints:
>
>            scale       config  |     init       tps        wal
>    ============================|===============================
>    ro        100        hints  |     206%       73%        155%
>                     checksums  |     210%       68%        155%
>             -------------------|--------------------------------
>             1000        hints  |     181%       81%        131%
>                     checksums  |     186%       78%        129%
>             -------------------|--------------------------------
>             6000        hints  |     277%       99%        100%
>                     checksums  |     282%       100%       104%
>    ----------------------------|--------------------------------
>    rw        100        hints  |     206%       99%        99%
>                     checksums  |     210%       99%        99%
>             -------------------|--------------------------------
>             1000        hints  |     181%       95%        96%
>                     checksums  |     186%       95%        95%
>             -------------------|--------------------------------
>             6000        hints  |     277%       99%        99%
>                     checksums  |     282%       98%        98%
>
>I have not investigated the exact reasons, but my hypothesis it's about
>the amount of WAL generated during the initial CREATE INDEX (because it
>probably ends up setting the hint bits), which puts additional pressure
>on the storage.
>
>Unfortunately, this additional cost is unlikely to go away :-(
>
>Now, maybe we want to enable checksums by default anyway, but we should
>not pretent the only cost related to checksums is CPU usage.

Thanks for running these, very helpful.
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: Enable data checksums by default

From

Christoph Berg

Date:

01 April 2019, 08:16:47

Re: Tomas Vondra 2019-03-30 <20190330192543.GH4719@development>
> I have not investigated the exact reasons, but my hypothesis it's about
> the amount of WAL generated during the initial CREATE INDEX (because it
> probably ends up setting the hint bits), which puts additional pressure
> on the storage.
> 
> Unfortunately, this additional cost is unlikely to go away :-(

If WAL volume is a problem, would wal_compression help?

> Now, maybe we want to enable checksums by default anyway, but we should
> not pretent the only cost related to checksums is CPU usage.

Thanks for doing these tests. The point I'm trying to make is, why do
we run without data checksums by default? For example, we do checksum
the WAL all the time, and there's not even an option to disable it,
even if that might make things faster. Why don't we enable data
checksums by default as well?

Christoph

Re: Enable data checksums by default

From

Magnus Hagander

Date:

01 April 2019, 08:25:57

On Mon, Apr 1, 2019 at 10:17 AM Christoph Berg <myon@debian.org> wrote:

Re: Tomas Vondra 2019-03-30 <20190330192543.GH4719@development>
> I have not investigated the exact reasons, but my hypothesis it's about
> the amount of WAL generated during the initial CREATE INDEX (because it
> probably ends up setting the hint bits), which puts additional pressure
> on the storage.
>
> Unfortunately, this additional cost is unlikely to go away :-(

If WAL volume is a problem, would wal_compression help?

> Now, maybe we want to enable checksums by default anyway, but we should
> not pretent the only cost related to checksums is CPU usage.

Thanks for doing these tests. The point I'm trying to make is, why do
we run without data checksums by default? For example, we do checksum
the WAL all the time, and there's not even an option to disable it,
even if that might make things faster. Why don't we enable data
checksums by default as well?

I think one of the often overlooked original reasons was that we need to log hint bits, same as when wal_log_hints is set.

Of course, if we consider it today, you have to do that in order to use pg_rewind as well, so a lot of people who want to run any form of HA setup will be having that turned on anyway. I think that has turned out to be a much weaker reason than it originally was thought to be.

Magnus Hagander
Me: https://www.hagander.net/
Work: https://www.redpill-linpro.com/

Re: Enable data checksums by default

From

Bruce Momjian

Date:

10 April 2019, 03:09:21

On Mon, Apr  1, 2019 at 10:16:47AM +0200, Christoph Berg wrote:
> Re: Tomas Vondra 2019-03-30 <20190330192543.GH4719@development>
> > I have not investigated the exact reasons, but my hypothesis it's about
> > the amount of WAL generated during the initial CREATE INDEX (because it
> > probably ends up setting the hint bits), which puts additional pressure
> > on the storage.
> > 
> > Unfortunately, this additional cost is unlikely to go away :-(
> 
> If WAL volume is a problem, would wal_compression help?
> 
> > Now, maybe we want to enable checksums by default anyway, but we should
> > not pretent the only cost related to checksums is CPU usage.
> 
> Thanks for doing these tests. The point I'm trying to make is, why do
> we run without data checksums by default? For example, we do checksum
> the WAL all the time, and there's not even an option to disable it,
> even if that might make things faster. Why don't we enable data
> checksums by default as well?

We checksum wal because we know partial WAL writes are likely to happen
during power failure during a write.  Data pages have pre-images (GUC
full_page_writes) stored in WAL so they are protected from partial
writes, hence are less likely to need checksum protection to detect
corruption.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

Re: Enable data checksums by default

From

Bruce Momjian

Date:

10 April 2019, 03:11:03

On Fri, Mar 22, 2019 at 12:07:22PM -0400, Tom Lane wrote:
> Christoph Berg <myon@debian.org> writes:
> > I think, the next step in that direction would be to enable data
> > checksums by default. They make sense in most setups,
> 
> Well, that is exactly the point that needs some proof, not just
> an unfounded assertion.
> 
> IMO, the main value of checksums is that they allow the Postgres
> project to deflect blame.  That's nice for us but I'm not sure
> that it's a benefit for users.  I've seen little if any data to
> suggest that checksums actually catch enough problems to justify
> the extra CPU costs and the risk of false positives.

Enabling checksums by default will require anyone using pg_upgrade to
run initdb to disable checksums before running pg_upgrade, for one
release.  We could add checksums for non-link pg_upgrade runs, but we
don't have code to do that yet, and most people use link anyway.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

Re: Enable data checksums by default

From

Andres Freund

Date:

11 April 2019, 16:58:12

Hi,

On 2019-04-09 23:11:03 -0400, Bruce Momjian wrote:
> Enabling checksums by default will require anyone using pg_upgrade to
> run initdb to disable checksums before running pg_upgrade, for one
> release.  We could add checksums for non-link pg_upgrade runs, but we
> don't have code to do that yet, and most people use link anyway.

Hm. We could just have pg_ugprade run pg_checksums --enable/disable,
based on the old cluster, and print a warning on mismatches. Not sure if
that's worth it, but ...

Greetings,

Andres Freund

Re: Enable data checksums by default

From

Daniel Gustafsson

Date:

11 April 2019, 18:15:41

On Thursday, April 11, 2019 6:58 PM, Andres Freund <andres@anarazel.de> wrote:

> On 2019-04-09 23:11:03 -0400, Bruce Momjian wrote:
>
> > Enabling checksums by default will require anyone using pg_upgrade to
> > run initdb to disable checksums before running pg_upgrade, for one
> > release. We could add checksums for non-link pg_upgrade runs, but we
> > don't have code to do that yet, and most people use link anyway.
>
> Hm. We could just have pg_ugprade run pg_checksums --enable/disable,
> based on the old cluster, and print a warning on mismatches. Not sure if
> that's worth it, but ...

That would be for link mode, for copy-mode you'd have to initdb with checksums
turned off and run pg_checksums on the new cluster, else the non-destructive
nature of copy mode would be lost.

Another option would be to teach pg_upgrade to checksum the cluster during the
upgrade on the fly.  That would however be a big conceptual change for
pg_upgrade as it's currently not modifying the cluster data.  In Greenplum we
have done this, but it was an easier choice there as we are rewriting all the
pages anyways.  It would also create yet another utility which can checksum an
offline cluster, but wanted to bring the idea to the table.

cheers ./daniel

Re: Enable data checksums by default

From

Andres Freund

Date:

11 April 2019, 18:56:41

Hi,

On 2019-04-11 18:15:41 +0000, Daniel Gustafsson wrote:
> On Thursday, April 11, 2019 6:58 PM, Andres Freund <andres@anarazel.de> wrote:
> 
> > On 2019-04-09 23:11:03 -0400, Bruce Momjian wrote:
> >
> > > Enabling checksums by default will require anyone using pg_upgrade to
> > > run initdb to disable checksums before running pg_upgrade, for one
> > > release. We could add checksums for non-link pg_upgrade runs, but we
> > > don't have code to do that yet, and most people use link anyway.
> >
> > Hm. We could just have pg_ugprade run pg_checksums --enable/disable,
> > based on the old cluster, and print a warning on mismatches. Not sure if
> > that's worth it, but ...
> 
> That would be for link mode, for copy-mode you'd have to initdb with checksums
> turned off and run pg_checksums on the new cluster, else the non-destructive
> nature of copy mode would be lost.

I don't think so? But I think we might just have misunderstood each
other. What I was suggesting is that we could take the burden of having
to match the old cluster's checksum enabled/disabled setting when
initdb'ing the new cluster, by changing the new cluster instead of
erroring out with:
    if (oldctrl->data_checksum_version == 0 &&
        newctrl->data_checksum_version != 0)
        pg_fatal("old cluster does not use data checksums but the new one does\n");
    else if (oldctrl->data_checksum_version != 0 &&
             newctrl->data_checksum_version == 0)
        pg_fatal("old cluster uses data checksums but the new one does not\n");
    else if (oldctrl->data_checksum_version != newctrl->data_checksum_version)
        pg_fatal("old and new cluster pg_controldata checksum versions do not match\n");


As the new cluster at that time isn't yet related to the old cluster, I
don't see why that'd influence the non-destructive nature?

Greetings,

Andres Freund

Re: Enable data checksums by default

From

Daniel Gustafsson

Date:

11 April 2019, 21:08:07

On Thursday, April 11, 2019 8:56 PM, Andres Freund <andres@anarazel.de> wrote:

> On 2019-04-11 18:15:41 +0000, Daniel Gustafsson wrote:
>
> > On Thursday, April 11, 2019 6:58 PM, Andres Freund andres@anarazel.de wrote:
> >
> > > On 2019-04-09 23:11:03 -0400, Bruce Momjian wrote:
> > >
> > > > Enabling checksums by default will require anyone using pg_upgrade to
> > > > run initdb to disable checksums before running pg_upgrade, for one
> > > > release. We could add checksums for non-link pg_upgrade runs, but we
> > > > don't have code to do that yet, and most people use link anyway.
> > >
> > > Hm. We could just have pg_ugprade run pg_checksums --enable/disable,
> > > based on the old cluster, and print a warning on mismatches. Not sure if
> > > that's worth it, but ...
> >
> > That would be for link mode, for copy-mode you'd have to initdb with checksums
> > turned off and run pg_checksums on the new cluster, else the non-destructive
> > nature of copy mode would be lost.
>
> I don't think so? But I think we might just have misunderstood each
> other. What I was suggesting is that we could take the burden of having
> to match the old cluster's checksum enabled/disabled setting when
> initdb'ing the new cluster, by changing the new cluster instead of
> erroring out with:
> if (oldctrl->data_checksum_version == 0 &&
>
>         newctrl->data_checksum_version != 0)
>
>         pg_fatal("old cluster does not use data checksums but the new one does\\n");
>     else if (oldctrl->data_checksum_version != 0 &&
>
>              newctrl->data_checksum_version == 0)
>
>         pg_fatal("old cluster uses data checksums but the new one does not\\n");
>     else if (oldctrl->data_checksum_version != newctrl->data_checksum_version)
>
>         pg_fatal("old and new cluster pg_controldata checksum versions do not match\\n");
>
>
> As the new cluster at that time isn't yet related to the old cluster, I
> don't see why that'd influence the non-destructive nature?

Right, now I see what you mean, and I indeed misunderstood you.  Thanks for
clarifying.

cheers ./daniel