Thread: Enable data checksums by default
Lately, PostgreSQL has moved many defaults from "bare minimum" more to the "user friendly by default" side, e.g. hot_standby & replication in the default configuration, parallelism, and generally higher defaults for resource knobs like *_mem, autovacuum_* and so on. I think, the next step in that direction would be to enable data checksums by default. They make sense in most setups, and people who plan to run very performance-critical systems where checksums might be too much need to tune many knobs anyway, and can as well choose to disable them manually, instead of having everyone else have to enable them manually. Also, disabling is much easier than enabling. One argument against checksums used to be that we lack tools to fix problems with them. But ignore_checksum_failure and the pg_checksums tool fix that. The attached patch flips the default in initdb. It also adds a new option -k --no-data-checksums that wasn't present previously. Docs are updated to say what the new default is, and the testsuite exercises the -K option. Christoph
Attachment
Christoph Berg <myon@debian.org> writes: > I think, the next step in that direction would be to enable data > checksums by default. They make sense in most setups, Well, that is exactly the point that needs some proof, not just an unfounded assertion. IMO, the main value of checksums is that they allow the Postgres project to deflect blame. That's nice for us but I'm not sure that it's a benefit for users. I've seen little if any data to suggest that checksums actually catch enough problems to justify the extra CPU costs and the risk of false positives. regards, tom lane
Hi, On 2019-03-22 12:07:22 -0400, Tom Lane wrote: > Christoph Berg <myon@debian.org> writes: > > I think, the next step in that direction would be to enable data > > checksums by default. They make sense in most setups, > > Well, that is exactly the point that needs some proof, not just > an unfounded assertion. > > IMO, the main value of checksums is that they allow the Postgres > project to deflect blame. That's nice for us but I'm not sure > that it's a benefit for users. I've seen little if any data to > suggest that checksums actually catch enough problems to justify > the extra CPU costs and the risk of false positives. IDK, being able to verify in some form that backups aren't corrupted on an IO level is mighty nice. That often does allow to detect the issue while one still has older backups around. My problem is more that I'm not confident the checks are mature enough. The basebackup checks are atm not able to detect random data, and neither basebackup nor backend checks detect zeroed out files/file ranges. Greetings, Andres Freund
On 3/22/19 5:10 PM, Andres Freund wrote: > Hi, > > On 2019-03-22 12:07:22 -0400, Tom Lane wrote: >> Christoph Berg <myon@debian.org> writes: >>> I think, the next step in that direction would be to enable data >>> checksums by default. They make sense in most setups, >> >> Well, that is exactly the point that needs some proof, not just >> an unfounded assertion. >> >> IMO, the main value of checksums is that they allow the Postgres >> project to deflect blame. That's nice for us but I'm not sure >> that it's a benefit for users. I've seen little if any data to >> suggest that checksums actually catch enough problems to justify >> the extra CPU costs and the risk of false positives. > I'm not sure about checksums being an effective tool to deflect blame. Considering the recent fsync retry issues - due to the assumption that we can just retry fsync we might have lost some of the writes, resulting in torn pages and checksum failures. I'm sure we could argue about how much sense the fsync behavior makes, but I doubt checksum failures are enough to deflect blame here. > IDK, being able to verify in some form that backups aren't corrupted on > an IO level is mighty nice. That often does allow to detect the issue > while one still has older backups around. > Yeah, I agree that's a valuable capability. I think the question is how effective it actually is considering how much the storage changed over the past few years (which necessarily affects the type of failures people have to deal with). > My problem is more that I'm not confident the checks are mature > enough. The basebackup checks are atm not able to detect random data, > and neither basebackup nor backend checks detect zeroed out files/file > ranges. > Yep :-( The pg_basebackup vulnerability to random garbage in a page header is unfortunate, we better improve that. It's not clear to me what can checksums do about zeroed pages (and/or truncated files) though. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 2019-03-22 17:32:10 +0100, Tomas Vondra wrote: > On 3/22/19 5:10 PM, Andres Freund wrote: > > IDK, being able to verify in some form that backups aren't corrupted on > > an IO level is mighty nice. That often does allow to detect the issue > > while one still has older backups around. > > > > Yeah, I agree that's a valuable capability. I think the question is how > effective it actually is considering how much the storage changed over > the past few years (which necessarily affects the type of failures > people have to deal with). I'm not sure I understand? How do the changes around storage meaningfully affect the need to have some trust in backups and benefiting from earlier detection? > It's not clear to me what can checksums do about zeroed pages (and/or > truncated files) though. Well, there's nothing fundamental about needing added pages be zeroes. We could expand them to be initialized with actual valid checksums instead of /* new buffers are zero-filled */ MemSet((char *) bufBlock, 0, BLCKSZ); /* don't set checksum for all-zero page */ smgrextend(smgr, forkNum, blockNum, (char *) bufBlock, false); the problem is that it's hard to do so safely without adding a lot of additional WAL logging. A lot of filesystems will journal metadata changes (like the size of the file), but not contents. So after a crash the tail end might appear zeroed out, even if we never wrote zeroes. That's obviously solvable by WAL logging, but that's not cheap. It might still be a good idea to just write a page with an initialized header / checksum at that point, as that ought to still detect a number of problems we can't detect right now. Greetings, Andres Freund
On 3/22/19 5:41 PM, Andres Freund wrote: > Hi, > > On 2019-03-22 17:32:10 +0100, Tomas Vondra wrote: >> On 3/22/19 5:10 PM, Andres Freund wrote: >>> IDK, being able to verify in some form that backups aren't corrupted on >>> an IO level is mighty nice. That often does allow to detect the issue >>> while one still has older backups around. >>> >> >> Yeah, I agree that's a valuable capability. I think the question is how >> effective it actually is considering how much the storage changed over >> the past few years (which necessarily affects the type of failures >> people have to deal with). > > I'm not sure I understand? How do the changes around storage > meaningfully affect the need to have some trust in backups and > benefiting from earlier detection? > Having trusted in backups is still desirable - nothing changes that, obviously. The question I was posing was rather "Are checksums still effective on current storage systems?" I'm wondering if the storage systems people use nowadays may be failing in ways that are not reliably detectable by checksums. I don't have any data to either support or reject that hypothesis, though. > >> It's not clear to me what can checksums do about zeroed pages (and/or >> truncated files) though. > > Well, there's nothing fundamental about needing added pages be > zeroes. We could expand them to be initialized with actual valid > checksums instead of > /* new buffers are zero-filled */ > MemSet((char *) bufBlock, 0, BLCKSZ); > /* don't set checksum for all-zero page */ > smgrextend(smgr, forkNum, blockNum, (char *) bufBlock, false); > > the problem is that it's hard to do so safely without adding a lot of > additional WAL logging. A lot of filesystems will journal metadata > changes (like the size of the file), but not contents. So after a crash > the tail end might appear zeroed out, even if we never wrote > zeroes. That's obviously solvable by WAL logging, but that's not cheap. > Hmmm. I'd say a filesystem that does not guarantee having all the data after an fsync is outright broken, but maybe that's what checksums are meant to protect against. > It might still be a good idea to just write a page with an initialized > header / checksum at that point, as that ought to still detect a number > of problems we can't detect right now. > Sounds reasonable. cheers -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2019-03-22 18:01:32 +0100, Tomas Vondra wrote: > On 3/22/19 5:41 PM, Andres Freund wrote: > > Hi, > > > > On 2019-03-22 17:32:10 +0100, Tomas Vondra wrote: > >> On 3/22/19 5:10 PM, Andres Freund wrote: > >>> IDK, being able to verify in some form that backups aren't corrupted on > >>> an IO level is mighty nice. That often does allow to detect the issue > >>> while one still has older backups around. > >>> > >> > >> Yeah, I agree that's a valuable capability. I think the question is how > >> effective it actually is considering how much the storage changed over > >> the past few years (which necessarily affects the type of failures > >> people have to deal with). > > > > I'm not sure I understand? How do the changes around storage > > meaningfully affect the need to have some trust in backups and > > benefiting from earlier detection? > > > > Having trusted in backups is still desirable - nothing changes that, > obviously. The question I was posing was rather "Are checksums still > effective on current storage systems?" > > I'm wondering if the storage systems people use nowadays may be failing > in ways that are not reliably detectable by checksums. I don't have any > data to either support or reject that hypothesis, though. I don't think it's useful to paint unsubstantiated doom-and-gloom pictures. > >> It's not clear to me what can checksums do about zeroed pages (and/or > >> truncated files) though. > > > > Well, there's nothing fundamental about needing added pages be > > zeroes. We could expand them to be initialized with actual valid > > checksums instead of > > /* new buffers are zero-filled */ > > MemSet((char *) bufBlock, 0, BLCKSZ); > > /* don't set checksum for all-zero page */ > > smgrextend(smgr, forkNum, blockNum, (char *) bufBlock, false); > > > > the problem is that it's hard to do so safely without adding a lot of > > additional WAL logging. A lot of filesystems will journal metadata > > changes (like the size of the file), but not contents. So after a crash > > the tail end might appear zeroed out, even if we never wrote > > zeroes. That's obviously solvable by WAL logging, but that's not cheap. > > > > Hmmm. I'd say a filesystem that does not guarantee having all the data > after an fsync is outright broken, but maybe that's what checksums are > meant to protect against. There's no fsync here. smgrextend(with-valid-checksum);crash; - the OS will probably have journalled the file size change, but not the contents. After a crash it's thus likely that the data page will appear zeroed. Which prevents us from erroring out when encountering a zeroed page, even though that'd be very good for error detection capabilities, because storage systems will show corrupted data as zeroes in a number of cases. Greetings, Andres Freund
Re: To Tom Lane 2019-03-26 <20190326151446.GG3829@msg.df7cb.de> > I run a benchmark with checksums disabled/enabled. shared_buffers is > 512kB to make sure almost any read will fetch the page from the OS > cache; scale factor is 50 (~750MB) to make sure the whole cluster fits > into RAM. [...] > So the cost is 5% in this very contrived case. In almost any other > setting, the cost would be lower, I'd think. (That was on 12devel, btw.) That was about the most extreme OLTP read-only workload. After thinking about it some more, I realized that exercising large seqscans might be an even better way to test it because of less per-query overhead. Same setup again, shared_buffers = 16 (128kB), jit = off, max_parallel_workers_per_gather = 0: select count(bid) from pgbench_accounts; no checksums: ~456ms with checksums: ~489ms 456.0/489 = 0.9325 The cost of checksums is about 6.75% here. Christoph
On Wed, Mar 27, 2019, 15:57 Christoph Berg <myon@debian.org> wrote:
Re: To Tom Lane 2019-03-26 <20190326151446.GG3829@msg.df7cb.de>
> I run a benchmark with checksums disabled/enabled. shared_buffers is
> 512kB to make sure almost any read will fetch the page from the OS
> cache; scale factor is 50 (~750MB) to make sure the whole cluster fits
> into RAM.
[...]
> So the cost is 5% in this very contrived case. In almost any other
> setting, the cost would be lower, I'd think.
(That was on 12devel, btw.)
That was about the most extreme OLTP read-only workload. After
thinking about it some more, I realized that exercising large seqscans
might be an even better way to test it because of less per-query
overhead.
Same setup again, shared_buffers = 16 (128kB), jit = off,
max_parallel_workers_per_gather = 0:
select count(bid) from pgbench_accounts;
no checksums: ~456ms
with checksums: ~489ms
456.0/489 = 0.9325
The cost of checksums is about 6.75% here.
Can you try with postgres compiled with CFLAGS="-O2 -march=native"? There's a bit of low hanging fruit there to use a runtime CPU check to pick a better optimized checksum function.
Regards,
Ants Aasma
Re: Ants Aasma 2019-03-27 <CA+CSw_twXdRzDN2XsSZBxEj63DeZ+f6_hs3Qf7hmXfenxSq+jg@mail.gmail.com> > Can you try with postgres compiled with CFLAGS="-O2 -march=native"? There's > a bit of low hanging fruit there to use a runtime CPU check to pick a > better optimized checksum function. Frankly, no. This is with the apt.pg.o packages which are supposed to be usable by everyone. If there is a better per-CPU checksum function, PG should pick it at runtime. Special compiler flags are a no-go here. CPPFLAGS = -Wdate-time -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -I/usr/include/libxml2 -I/usr/include/mit-krb5 CFLAGS = -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute-Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation-Wno-stringop-truncation -g -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -fno-omit-frame-pointer Christoph
On 2019-03-22 16:16, Christoph Berg wrote: > I think, the next step in that direction would be to enable data > checksums by default. They make sense in most setups, and people who > plan to run very performance-critical systems where checksums might be > too much need to tune many knobs anyway, and can as well choose to > disable them manually, instead of having everyone else have to enable > them manually. Also, disabling is much easier than enabling. It would also enable pg_rewind to work by default. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Am Dienstag, den 26.03.2019, 16:14 +0100 schrieb Christoph Berg: > select 92551.0/97363; > 0.9506 > > So the cost is 5% in this very contrived case. In almost any other > setting, the cost would be lower, I'd think. Well, my machine (Intel(R) Core(TM) i7-6770HQ CPU @ 2.60GHz, 32 GByte RAM) tells me this: pgbench -s 50 -i pgbench pg_ctl -o "--shared-buffers=128kB" restart pgbench -r -P4 -Mprepared -T60 -c $clients -j $clients -n -S ...prewarm... Clients checksums 1 20110 2 35338 4 67207 8 96627 16 110091 Clients no checksums 1 21716 2 38543 4 72118 8 117545 16 121415 Clients Impact 1 0,926045312212194 2 0,916846119918014 4 0,931903269641421 8 0,822042621974563 16 0,906733105464728 So between ~7% to 18% impact with checksums in this specific case here. Bernd
On Thu, Mar 28, 2019 at 10:38 AM Christoph Berg <myon@debian.org> wrote:
Re: Ants Aasma 2019-03-27 <CA+CSw_twXdRzDN2XsSZBxEj63DeZ+f6_hs3Qf7hmXfenxSq+jg@mail.gmail.com>
> Can you try with postgres compiled with CFLAGS="-O2 -march=native"? There's
> a bit of low hanging fruit there to use a runtime CPU check to pick a
> better optimized checksum function.
Frankly, no. This is with the apt.pg.o packages which are supposed to
be usable by everyone. If there is a better per-CPU checksum function,
PG should pick it at runtime. Special compiler flags are a no-go here.
I went ahead and tested it on the count(*) test, same settings as upthread. Median of 5 runs of 20txs on Intel i5-2500k @ 4GHz.
No checksum: 344ms
Checksums: 384ms (+12%)
No checksum march=native: 344ms
Checksums march=native: 369ms (+7%)
The checksum code was written to be easily auto-vectorized by the compiler. So if we just compile the same function with different compiler flags and pick between them at runtime the overhead can be approximately halved. Not saying that this needs to be done before enabling checksums by default, just that when considering overhead, we can foresee it being much lower in future versions.
Regards,
Ants Aasma
Ants Aasma
On Fri, Mar 29, 2019 at 11:16:11AM +0100, Bernd Helmle wrote: > So between ~7% to 18% impact with checksums in this specific case here. I can't really believe that many people set up shared_buffers at 128kB which would cause such a large number of page evictions, but I can believe that many users have shared_buffers set to its default value and that we are going to get complains about "performance drop after upgrade to v12" if we switch data checksums to on by default. -- Michael
Attachment
Am Freitag, den 29.03.2019, 23:10 +0900 schrieb Michael Paquier: > > I can't really believe that many people set up shared_buffers at > 128kB > which would cause such a large number of page evictions, but I can > believe that many users have shared_buffers set to its default value > and that we are going to get complains about "performance drop after > upgrade to v12" if we switch data checksums to on by default. Yeah, i think Christoph's benchmark is based on this thinking. I assume this very unrealistic scenery should emulate the worst case (many buffer_reads, high checksum calculation load). Bernd
Re: Bernd Helmle 2019-03-29 <3586bb9345a59bfc8d13a50a7c729be1ee6759fd.camel@oopsware.de> > Am Freitag, den 29.03.2019, 23:10 +0900 schrieb Michael Paquier: > > > > I can't really believe that many people set up shared_buffers at > > 128kB > > which would cause such a large number of page evictions, but I can > > believe that many users have shared_buffers set to its default value > > and that we are going to get complains about "performance drop after > > upgrade to v12" if we switch data checksums to on by default. > > Yeah, i think Christoph's benchmark is based on this thinking. I assume > this very unrealistic scenery should emulate the worst case (many > buffer_reads, high checksum calculation load). It's not unrealistic to have large seqscans that are all buffer misses, the table just has to be big enough. The idea in my benchmark was that if I make shared buffers really small, and the table still fits in to RAM, I should be seeing only buffer misses, but without any delay for actually reading from disk. Christoph
On Fri, Mar 29, 2019 at 08:35:26PM +0100, Christoph Berg wrote: >Re: Bernd Helmle 2019-03-29 <3586bb9345a59bfc8d13a50a7c729be1ee6759fd.camel@oopsware.de> >> Am Freitag, den 29.03.2019, 23:10 +0900 schrieb Michael Paquier: >> > >> > I can't really believe that many people set up shared_buffers at >> > 128kB >> > which would cause such a large number of page evictions, but I can >> > believe that many users have shared_buffers set to its default value >> > and that we are going to get complains about "performance drop after >> > upgrade to v12" if we switch data checksums to on by default. >> >> Yeah, i think Christoph's benchmark is based on this thinking. I assume >> this very unrealistic scenery should emulate the worst case (many >> buffer_reads, high checksum calculation load). > >It's not unrealistic to have large seqscans that are all buffer >misses, the table just has to be big enough. The idea in my benchmark >was that if I make shared buffers really small, and the table still >fits in to RAM, I should be seeing only buffer misses, but without any >delay for actually reading from disk. > >Christoph > FWIW I think it's a mistake to focus solely on CPU utilization, which all the benchmarks performed on this thread do because they look at tps of in-memory read-only workloads. Checksums have other costs too, not just the additional CPU time. Most importanly they require wal_log_hints to be set (which people may or may not want anyway). I've done a simple benchmark, that does read-only (-S) and read-write (-N) pgbench runs with different scales, but also measures duration of the pgbench init and amount of WAL produced during the tests. On a small machine (i5, 8GB RAM, SSD RAID) the results are these: scale config | init tps wal =========================|================================== ro 10 no-hints | 2 117038 130 hints | 2 116378 146 checksums | 2 115619 147 -------------------|---------------------------------- 200 no-hints | 32 88340 2407 hints | 37 86154 2628 checksums | 36 83336 2624 -------------------|---------------------------------- 2000 no-hints | 365 38680 1967 hints | 423 38670 2123 checksums | 504 37510 2046 -------------------------|---------------------------------- rw 10 no-hints | 2 19691 437 hints | 2 19712 437 checksums | 2 19654 437 -------------------|---------------------------------- 200 no-hints | 32 15839 2745 hints | 37 15735 2783 checksums | 36 15646 2775 -------------------|---------------------------------- 2000 no-hints | 365 5371 3721 hints | 423 5270 3671 checksums | 504 5094 3574 The no-hints config is default (wal_log_hints=off, data_checksums=off), hints sets wal_log_hints=on and checksums enables data checksums. All the configs were somewhat tuned (1GB shared buffers, max_wal_size high enough not to hit checkpoints very often, etc.). I've also done the tests on the a larger machine (2x E5-2620v4, 32GB of RAM, NVMe SSD), and the general pattern is about the same - while the tps and amount of WAL (not covering the init) does not change, the time for initialization increases significantly (by 20-40%). This effect is even clearer when using slower storage (SATA-based RAID). The results then look like this: scale config | init tps wal =========================|================================== ro 100 no-hints | 49 229459 122 hints | 101 167983 190 checksums | 103 156307 190 -------------------|---------------------------------- 1000 no-hints | 580 152167 109 hints | 1047 122814 142 checksums | 1080 118586 141 -------------------|---------------------------------- 6000 no-hints | 4035 508 1 hints | 11193 502 1 checksums | 11376 506 1 -------------------------|---------------------------------- rw 100 no-hints | 49 279 192 hints | 101 275 190 checksums | 103 275 190 -------------------|---------------------------------- 1000 no-hints | 580 237 210 hints | 1047 225 201 checksums | 1080 224 200 -------------------|---------------------------------- 6000 no-hints | 4035 135 123 hints | 11193 133 122 checksums | 11376 132 121 and when expressed as relative to no-hints: scale config | init tps wal ============================|=============================== ro 100 hints | 206% 73% 155% checksums | 210% 68% 155% -------------------|-------------------------------- 1000 hints | 181% 81% 131% checksums | 186% 78% 129% -------------------|-------------------------------- 6000 hints | 277% 99% 100% checksums | 282% 100% 104% ----------------------------|-------------------------------- rw 100 hints | 206% 99% 99% checksums | 210% 99% 99% -------------------|-------------------------------- 1000 hints | 181% 95% 96% checksums | 186% 95% 95% -------------------|-------------------------------- 6000 hints | 277% 99% 99% checksums | 282% 98% 98% I have not investigated the exact reasons, but my hypothesis it's about the amount of WAL generated during the initial CREATE INDEX (because it probably ends up setting the hint bits), which puts additional pressure on the storage. Unfortunately, this additional cost is unlikely to go away :-( Now, maybe we want to enable checksums by default anyway, but we should not pretent the only cost related to checksums is CPU usage. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On March 30, 2019 3:25:43 PM EDT, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >On Fri, Mar 29, 2019 at 08:35:26PM +0100, Christoph Berg wrote: >>Re: Bernd Helmle 2019-03-29 ><3586bb9345a59bfc8d13a50a7c729be1ee6759fd.camel@oopsware.de> >>> Am Freitag, den 29.03.2019, 23:10 +0900 schrieb Michael Paquier: >>> > >>> > I can't really believe that many people set up shared_buffers at >>> > 128kB >>> > which would cause such a large number of page evictions, but I can >>> > believe that many users have shared_buffers set to its default >value >>> > and that we are going to get complains about "performance drop >after >>> > upgrade to v12" if we switch data checksums to on by default. >>> >>> Yeah, i think Christoph's benchmark is based on this thinking. I >assume >>> this very unrealistic scenery should emulate the worst case (many >>> buffer_reads, high checksum calculation load). >> >>It's not unrealistic to have large seqscans that are all buffer >>misses, the table just has to be big enough. The idea in my benchmark >>was that if I make shared buffers really small, and the table still >>fits in to RAM, I should be seeing only buffer misses, but without any >>delay for actually reading from disk. >> >>Christoph >> > >FWIW I think it's a mistake to focus solely on CPU utilization, which >all the benchmarks performed on this thread do because they look at tps >of in-memory read-only workloads. Checksums have other costs too, not >just the additional CPU time. Most importanly they require >wal_log_hints >to be set (which people may or may not want anyway). > >I've done a simple benchmark, that does read-only (-S) and read-write >(-N) pgbench runs with different scales, but also measures duration of >the pgbench init and amount of WAL produced during the tests. > >On a small machine (i5, 8GB RAM, SSD RAID) the results are these: > > scale config | init tps wal > =========================|================================== > ro 10 no-hints | 2 117038 130 > hints | 2 116378 146 > checksums | 2 115619 147 > -------------------|---------------------------------- > 200 no-hints | 32 88340 2407 > hints | 37 86154 2628 > checksums | 36 83336 2624 > -------------------|---------------------------------- > 2000 no-hints | 365 38680 1967 > hints | 423 38670 2123 > checksums | 504 37510 2046 > -------------------------|---------------------------------- > rw 10 no-hints | 2 19691 437 > hints | 2 19712 437 > checksums | 2 19654 437 > -------------------|---------------------------------- > 200 no-hints | 32 15839 2745 > hints | 37 15735 2783 > checksums | 36 15646 2775 > -------------------|---------------------------------- > 2000 no-hints | 365 5371 3721 > hints | 423 5270 3671 > checksums | 504 5094 3574 > >The no-hints config is default (wal_log_hints=off, data_checksums=off), >hints sets wal_log_hints=on and checksums enables data checksums. All >the configs were somewhat tuned (1GB shared buffers, max_wal_size high >enough not to hit checkpoints very often, etc.). > >I've also done the tests on the a larger machine (2x E5-2620v4, 32GB of >RAM, NVMe SSD), and the general pattern is about the same - while the >tps and amount of WAL (not covering the init) does not change, the time >for initialization increases significantly (by 20-40%). > >This effect is even clearer when using slower storage (SATA-based >RAID). >The results then look like this: > > scale config | init tps wal > =========================|================================== > ro 100 no-hints | 49 229459 122 > hints | 101 167983 190 > checksums | 103 156307 190 > -------------------|---------------------------------- > 1000 no-hints | 580 152167 109 > hints | 1047 122814 142 > checksums | 1080 118586 141 > -------------------|---------------------------------- > 6000 no-hints | 4035 508 1 > hints | 11193 502 1 > checksums | 11376 506 1 > -------------------------|---------------------------------- > rw 100 no-hints | 49 279 192 > hints | 101 275 190 > checksums | 103 275 190 > -------------------|---------------------------------- > 1000 no-hints | 580 237 210 > hints | 1047 225 201 > checksums | 1080 224 200 > -------------------|---------------------------------- > 6000 no-hints | 4035 135 123 > hints | 11193 133 122 > checksums | 11376 132 121 > >and when expressed as relative to no-hints: > > scale config | init tps wal > ============================|=============================== > ro 100 hints | 206% 73% 155% > checksums | 210% 68% 155% > -------------------|-------------------------------- > 1000 hints | 181% 81% 131% > checksums | 186% 78% 129% > -------------------|-------------------------------- > 6000 hints | 277% 99% 100% > checksums | 282% 100% 104% > ----------------------------|-------------------------------- > rw 100 hints | 206% 99% 99% > checksums | 210% 99% 99% > -------------------|-------------------------------- > 1000 hints | 181% 95% 96% > checksums | 186% 95% 95% > -------------------|-------------------------------- > 6000 hints | 277% 99% 99% > checksums | 282% 98% 98% > >I have not investigated the exact reasons, but my hypothesis it's about >the amount of WAL generated during the initial CREATE INDEX (because it >probably ends up setting the hint bits), which puts additional pressure >on the storage. > >Unfortunately, this additional cost is unlikely to go away :-( > >Now, maybe we want to enable checksums by default anyway, but we should >not pretent the only cost related to checksums is CPU usage. Thanks for running these, very helpful. -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Re: Tomas Vondra 2019-03-30 <20190330192543.GH4719@development> > I have not investigated the exact reasons, but my hypothesis it's about > the amount of WAL generated during the initial CREATE INDEX (because it > probably ends up setting the hint bits), which puts additional pressure > on the storage. > > Unfortunately, this additional cost is unlikely to go away :-( If WAL volume is a problem, would wal_compression help? > Now, maybe we want to enable checksums by default anyway, but we should > not pretent the only cost related to checksums is CPU usage. Thanks for doing these tests. The point I'm trying to make is, why do we run without data checksums by default? For example, we do checksum the WAL all the time, and there's not even an option to disable it, even if that might make things faster. Why don't we enable data checksums by default as well? Christoph
On Mon, Apr 1, 2019 at 10:17 AM Christoph Berg <myon@debian.org> wrote:
Re: Tomas Vondra 2019-03-30 <20190330192543.GH4719@development>
> I have not investigated the exact reasons, but my hypothesis it's about
> the amount of WAL generated during the initial CREATE INDEX (because it
> probably ends up setting the hint bits), which puts additional pressure
> on the storage.
>
> Unfortunately, this additional cost is unlikely to go away :-(
If WAL volume is a problem, would wal_compression help?
> Now, maybe we want to enable checksums by default anyway, but we should
> not pretent the only cost related to checksums is CPU usage.
Thanks for doing these tests. The point I'm trying to make is, why do
we run without data checksums by default? For example, we do checksum
the WAL all the time, and there's not even an option to disable it,
even if that might make things faster. Why don't we enable data
checksums by default as well?
I think one of the often overlooked original reasons was that we need to log hint bits, same as when wal_log_hints is set.
Of course, if we consider it today, you have to do that in order to use pg_rewind as well, so a lot of people who want to run any form of HA setup will be having that turned on anyway. I think that has turned out to be a much weaker reason than it originally was thought to be.
On Mon, Apr 1, 2019 at 10:16:47AM +0200, Christoph Berg wrote: > Re: Tomas Vondra 2019-03-30 <20190330192543.GH4719@development> > > I have not investigated the exact reasons, but my hypothesis it's about > > the amount of WAL generated during the initial CREATE INDEX (because it > > probably ends up setting the hint bits), which puts additional pressure > > on the storage. > > > > Unfortunately, this additional cost is unlikely to go away :-( > > If WAL volume is a problem, would wal_compression help? > > > Now, maybe we want to enable checksums by default anyway, but we should > > not pretent the only cost related to checksums is CPU usage. > > Thanks for doing these tests. The point I'm trying to make is, why do > we run without data checksums by default? For example, we do checksum > the WAL all the time, and there's not even an option to disable it, > even if that might make things faster. Why don't we enable data > checksums by default as well? We checksum wal because we know partial WAL writes are likely to happen during power failure during a write. Data pages have pre-images (GUC full_page_writes) stored in WAL so they are protected from partial writes, hence are less likely to need checksum protection to detect corruption. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On Fri, Mar 22, 2019 at 12:07:22PM -0400, Tom Lane wrote: > Christoph Berg <myon@debian.org> writes: > > I think, the next step in that direction would be to enable data > > checksums by default. They make sense in most setups, > > Well, that is exactly the point that needs some proof, not just > an unfounded assertion. > > IMO, the main value of checksums is that they allow the Postgres > project to deflect blame. That's nice for us but I'm not sure > that it's a benefit for users. I've seen little if any data to > suggest that checksums actually catch enough problems to justify > the extra CPU costs and the risk of false positives. Enabling checksums by default will require anyone using pg_upgrade to run initdb to disable checksums before running pg_upgrade, for one release. We could add checksums for non-link pg_upgrade runs, but we don't have code to do that yet, and most people use link anyway. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Hi, On 2019-04-09 23:11:03 -0400, Bruce Momjian wrote: > Enabling checksums by default will require anyone using pg_upgrade to > run initdb to disable checksums before running pg_upgrade, for one > release. We could add checksums for non-link pg_upgrade runs, but we > don't have code to do that yet, and most people use link anyway. Hm. We could just have pg_ugprade run pg_checksums --enable/disable, based on the old cluster, and print a warning on mismatches. Not sure if that's worth it, but ... Greetings, Andres Freund
On Thursday, April 11, 2019 6:58 PM, Andres Freund <andres@anarazel.de> wrote: > On 2019-04-09 23:11:03 -0400, Bruce Momjian wrote: > > > Enabling checksums by default will require anyone using pg_upgrade to > > run initdb to disable checksums before running pg_upgrade, for one > > release. We could add checksums for non-link pg_upgrade runs, but we > > don't have code to do that yet, and most people use link anyway. > > Hm. We could just have pg_ugprade run pg_checksums --enable/disable, > based on the old cluster, and print a warning on mismatches. Not sure if > that's worth it, but ... That would be for link mode, for copy-mode you'd have to initdb with checksums turned off and run pg_checksums on the new cluster, else the non-destructive nature of copy mode would be lost. Another option would be to teach pg_upgrade to checksum the cluster during the upgrade on the fly. That would however be a big conceptual change for pg_upgrade as it's currently not modifying the cluster data. In Greenplum we have done this, but it was an easier choice there as we are rewriting all the pages anyways. It would also create yet another utility which can checksum an offline cluster, but wanted to bring the idea to the table. cheers ./daniel
Hi, On 2019-04-11 18:15:41 +0000, Daniel Gustafsson wrote: > On Thursday, April 11, 2019 6:58 PM, Andres Freund <andres@anarazel.de> wrote: > > > On 2019-04-09 23:11:03 -0400, Bruce Momjian wrote: > > > > > Enabling checksums by default will require anyone using pg_upgrade to > > > run initdb to disable checksums before running pg_upgrade, for one > > > release. We could add checksums for non-link pg_upgrade runs, but we > > > don't have code to do that yet, and most people use link anyway. > > > > Hm. We could just have pg_ugprade run pg_checksums --enable/disable, > > based on the old cluster, and print a warning on mismatches. Not sure if > > that's worth it, but ... > > That would be for link mode, for copy-mode you'd have to initdb with checksums > turned off and run pg_checksums on the new cluster, else the non-destructive > nature of copy mode would be lost. I don't think so? But I think we might just have misunderstood each other. What I was suggesting is that we could take the burden of having to match the old cluster's checksum enabled/disabled setting when initdb'ing the new cluster, by changing the new cluster instead of erroring out with: if (oldctrl->data_checksum_version == 0 && newctrl->data_checksum_version != 0) pg_fatal("old cluster does not use data checksums but the new one does\n"); else if (oldctrl->data_checksum_version != 0 && newctrl->data_checksum_version == 0) pg_fatal("old cluster uses data checksums but the new one does not\n"); else if (oldctrl->data_checksum_version != newctrl->data_checksum_version) pg_fatal("old and new cluster pg_controldata checksum versions do not match\n"); As the new cluster at that time isn't yet related to the old cluster, I don't see why that'd influence the non-destructive nature? Greetings, Andres Freund
On Thursday, April 11, 2019 8:56 PM, Andres Freund <andres@anarazel.de> wrote: > On 2019-04-11 18:15:41 +0000, Daniel Gustafsson wrote: > > > On Thursday, April 11, 2019 6:58 PM, Andres Freund andres@anarazel.de wrote: > > > > > On 2019-04-09 23:11:03 -0400, Bruce Momjian wrote: > > > > > > > Enabling checksums by default will require anyone using pg_upgrade to > > > > run initdb to disable checksums before running pg_upgrade, for one > > > > release. We could add checksums for non-link pg_upgrade runs, but we > > > > don't have code to do that yet, and most people use link anyway. > > > > > > Hm. We could just have pg_ugprade run pg_checksums --enable/disable, > > > based on the old cluster, and print a warning on mismatches. Not sure if > > > that's worth it, but ... > > > > That would be for link mode, for copy-mode you'd have to initdb with checksums > > turned off and run pg_checksums on the new cluster, else the non-destructive > > nature of copy mode would be lost. > > I don't think so? But I think we might just have misunderstood each > other. What I was suggesting is that we could take the burden of having > to match the old cluster's checksum enabled/disabled setting when > initdb'ing the new cluster, by changing the new cluster instead of > erroring out with: > if (oldctrl->data_checksum_version == 0 && > > newctrl->data_checksum_version != 0) > > pg_fatal("old cluster does not use data checksums but the new one does\\n"); > else if (oldctrl->data_checksum_version != 0 && > > newctrl->data_checksum_version == 0) > > pg_fatal("old cluster uses data checksums but the new one does not\\n"); > else if (oldctrl->data_checksum_version != newctrl->data_checksum_version) > > pg_fatal("old and new cluster pg_controldata checksum versions do not match\\n"); > > > As the new cluster at that time isn't yet related to the old cluster, I > don't see why that'd influence the non-destructive nature? Right, now I see what you mean, and I indeed misunderstood you. Thanks for clarifying. cheers ./daniel