Re: Enable data checksums by default - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Enable data checksums by default
Date
Msg-id 20190330192543.GH4719@development
Whole thread Raw
In response to Re: Enable data checksums by default  (Christoph Berg <myon@debian.org>)
Responses Re: Enable data checksums by default
Re: Enable data checksums by default
List pgsql-hackers
On Fri, Mar 29, 2019 at 08:35:26PM +0100, Christoph Berg wrote:
>Re: Bernd Helmle 2019-03-29 <3586bb9345a59bfc8d13a50a7c729be1ee6759fd.camel@oopsware.de>
>> Am Freitag, den 29.03.2019, 23:10 +0900 schrieb Michael Paquier:
>> >
>> > I can't really believe that many people set up shared_buffers at
>> > 128kB
>> > which would cause such a large number of page evictions, but I can
>> > believe that many users have shared_buffers set to its default value
>> > and that we are going to get complains about "performance drop after
>> > upgrade to v12" if we switch data checksums to on by default.
>>
>> Yeah, i think Christoph's benchmark is based on this thinking. I assume
>> this very unrealistic scenery should emulate the worst case (many
>> buffer_reads, high checksum calculation load).
>
>It's not unrealistic to have large seqscans that are all buffer
>misses, the table just has to be big enough. The idea in my benchmark
>was that if I make shared buffers really small, and the table still
>fits in to RAM, I should be seeing only buffer misses, but without any
>delay for actually reading from disk.
>
>Christoph
>

FWIW I think it's a mistake to focus solely on CPU utilization, which
all the benchmarks performed on this thread do because they look at tps
of in-memory read-only workloads. Checksums have other costs too, not
just the additional CPU time. Most importanly they require wal_log_hints
to be set (which people may or may not want anyway).

I've done a simple benchmark, that does read-only (-S) and read-write
(-N) pgbench runs with different scales, but also measures duration of
the pgbench init and amount of WAL produced during the tests.

On a small machine (i5, 8GB RAM, SSD RAID) the results are these:

         scale    config     |    init           tps        wal
    =========================|==================================
    ro      10    no-hints   |       2        117038        130
                  hints      |       2        116378        146
                  checksums  |       2        115619        147
          -------------------|----------------------------------
           200    no-hints   |      32         88340        2407
                  hints      |      37         86154        2628
                  checksums  |      36         83336        2624
          -------------------|----------------------------------
          2000    no-hints   |     365         38680        1967
                  hints      |     423         38670        2123
                  checksums  |     504         37510        2046
    -------------------------|----------------------------------
    rw      10    no-hints   |       2         19691        437
                  hints      |       2         19712        437
                  checksums  |       2         19654        437
          -------------------|----------------------------------
           200    no-hints   |      32         15839        2745
                  hints      |      37         15735        2783
                  checksums  |      36         15646        2775
          -------------------|----------------------------------
          2000    no-hints   |     365         5371         3721
                  hints      |     423         5270         3671
                  checksums  |     504         5094         3574

The no-hints config is default (wal_log_hints=off, data_checksums=off),
hints sets wal_log_hints=on and checksums enables data checksums. All
the configs were somewhat tuned (1GB shared buffers, max_wal_size high
enough not to hit checkpoints very often, etc.).

I've also done the tests on the a larger machine (2x E5-2620v4, 32GB of
RAM, NVMe SSD), and the general pattern is about the same - while the
tps and amount of WAL (not covering the init) does not change, the time
for initialization increases significantly (by 20-40%).

This effect is even clearer when using slower storage (SATA-based RAID).
The results then look like this:

         scale    config     |    init           tps        wal
    =========================|==================================
    ro     100    no-hints   |      49        229459        122
                  hints      |     101        167983        190
                  checksums  |     103        156307        190
          -------------------|----------------------------------
          1000    no-hints   |     580        152167        109
                  hints      |    1047        122814        142
                  checksums  |    1080        118586        141
          -------------------|----------------------------------
          6000    no-hints   |    4035           508          1
                  hints      |   11193           502          1
                  checksums  |   11376           506          1
    -------------------------|----------------------------------
    rw     100    no-hints   |      49           279        192
                  hints      |     101           275        190
                  checksums  |     103           275        190
          -------------------|----------------------------------
          1000    no-hints   |     580           237        210
                  hints      |    1047           225        201
                  checksums  |    1080           224        200
          -------------------|----------------------------------
          6000    no-hints   |    4035           135        123
                  hints      |   11193           133        122
                  checksums  |   11376           132        121

and when expressed as relative to no-hints:

            scale       config  |     init       tps        wal
    ============================|===============================
    ro        100        hints  |     206%       73%        155%
                     checksums  |     210%       68%        155%
             -------------------|--------------------------------
             1000        hints  |     181%       81%        131%
                     checksums  |     186%       78%        129%
             -------------------|--------------------------------
             6000        hints  |     277%       99%        100%
                     checksums  |     282%       100%       104%
    ----------------------------|--------------------------------
    rw        100        hints  |     206%       99%        99%
                     checksums  |     210%       99%        99%
             -------------------|--------------------------------
             1000        hints  |     181%       95%        96%
                     checksums  |     186%       95%        95%
             -------------------|--------------------------------
             6000        hints  |     277%       99%        99%
                     checksums  |     282%       98%        98%

I have not investigated the exact reasons, but my hypothesis it's about
the amount of WAL generated during the initial CREATE INDEX (because it
probably ends up setting the hint bits), which puts additional pressure
on the storage.

Unfortunately, this additional cost is unlikely to go away :-(

Now, maybe we want to enable checksums by default anyway, but we should
not pretent the only cost related to checksums is CPU usage.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: pgsql: Compute XID horizon for page level index vacuum on primary.
Next
From: Michael Banck
Date:
Subject: Re: Progress reporting for pg_verify_checksums