Thread: [HACKERS] Checksums by default?

[HACKERS] Checksums by default?

From
Magnus Hagander
Date:
Is it time to enable checksums by default, and give initdb a switch to turn it off instead?

I keep running into situations where people haven't enabled it, because (a) they didn't know about it, or (b) their packaging system ran initdb for them so they didn't even know they could. And of course they usually figure this out once the db has enough data and traffic that the only way to fix it is to set up something like slony/bucardo/pglogical and a whole new server to deal with it.. (Which is something that would also be good to fix -- but having the default changed would be useful as well)

--

Re: [HACKERS] Checksums by default?

From
Stephen Frost
Date:
* Magnus Hagander (magnus@hagander.net) wrote:
> Is it time to enable checksums by default, and give initdb a switch to turn
> it off instead?

Yes, please.

We've already agreed to make changes to have a better user experience
and ask those who really care about certain performance aspects to have
to configure for performance instead (see: wal_level changes), I view
this as being very much in that same vein.

I know one argument in the past has been that we don't have a tool that
can be used to check all of the checksums, but that's also changed now
that pgBackRest supports verifying checksums during backups.  I'm all
for adding a tool to core to perform a validation too, of course, though
it does make a lot of sense to validate checksums during backup since
you're reading all the pages anyway.

Thanks!

Stephen

Re: [HACKERS] Checksums by default?

From
Michael Paquier
Date:
On Sat, Jan 21, 2017 at 7:39 PM, Magnus Hagander <magnus@hagander.net> wrote:
> Is it time to enable checksums by default, and give initdb a switch to turn
> it off instead?
>
> I keep running into situations where people haven't enabled it, because (a)
> they didn't know about it, or (b) their packaging system ran initdb for them
> so they didn't even know they could. And of course they usually figure this
> out once the db has enough data and traffic that the only way to fix it is
> to set up something like slony/bucardo/pglogical and a whole new server to
> deal with it.. (Which is something that would also be good to fix -- but
> having the default changed would be useful as well)

Perhaps that's not mandatory, but I think that one obstacle in
changing this default is to be able to have pg_upgrade work from a
checksum-disabled old instance to a checksum-enabled instance. That
would really help with its adoption.
-- 
Michael



Re: [HACKERS] Checksums by default?

From
Magnus Hagander
Date:


On Sat, Jan 21, 2017 at 3:05 PM, Michael Paquier <michael.paquier@gmail.com> wrote:
On Sat, Jan 21, 2017 at 7:39 PM, Magnus Hagander <magnus@hagander.net> wrote:
> Is it time to enable checksums by default, and give initdb a switch to turn
> it off instead?
>
> I keep running into situations where people haven't enabled it, because (a)
> they didn't know about it, or (b) their packaging system ran initdb for them
> so they didn't even know they could. And of course they usually figure this
> out once the db has enough data and traffic that the only way to fix it is
> to set up something like slony/bucardo/pglogical and a whole new server to
> deal with it.. (Which is something that would also be good to fix -- but
> having the default changed would be useful as well)

Perhaps that's not mandatory, but I think that one obstacle in
changing this default is to be able to have pg_upgrade work from a
checksum-disabled old instance to a checksum-enabled instance. That
would really help with its adoption.

That's a different usecase though.

If we just change the default, then we'd have to teach pg_upgrade to initialize the upgraded cluster without checksums. We still need to keep that *option*, just reverse the default. 

Being able to enable checksums on the fly is a different feature. Which I'd really like to have. I have some unfinished code for it, but it's a bit too unfinished so far :)

--

Re: [HACKERS] Checksums by default?

From
Stephen Frost
Date:
* Michael Paquier (michael.paquier@gmail.com) wrote:
> On Sat, Jan 21, 2017 at 7:39 PM, Magnus Hagander <magnus@hagander.net> wrote:
> > Is it time to enable checksums by default, and give initdb a switch to turn
> > it off instead?
> >
> > I keep running into situations where people haven't enabled it, because (a)
> > they didn't know about it, or (b) their packaging system ran initdb for them
> > so they didn't even know they could. And of course they usually figure this
> > out once the db has enough data and traffic that the only way to fix it is
> > to set up something like slony/bucardo/pglogical and a whole new server to
> > deal with it.. (Which is something that would also be good to fix -- but
> > having the default changed would be useful as well)
>
> Perhaps that's not mandatory, but I think that one obstacle in
> changing this default is to be able to have pg_upgrade work from a
> checksum-disabled old instance to a checksum-enabled instance. That
> would really help with its adoption.

That's moving the goal-posts here about 3000 miles away and I don't
believe it's necessary to have that to make this change.

I agree that it'd be great to have, of course, and we're looking at if
we could do something like: backup a checksum-disabled system, perform a
restore which adds checksums and marks the cluster as now having
checksums.  If we can work out a good way to do that *and* have it work
with incremental backup/restore, then we could possibly provide a
small-downtime-window way to upgrade to a database with checksums.

Thanks!

Stephen

Re: [HACKERS] Checksums by default?

From
Stephen Frost
Date:
Magnus,

* Magnus Hagander (magnus@hagander.net) wrote:
> On Sat, Jan 21, 2017 at 3:05 PM, Michael Paquier <michael.paquier@gmail.com>
> wrote:
>
> > On Sat, Jan 21, 2017 at 7:39 PM, Magnus Hagander <magnus@hagander.net>
> > wrote:
> > > Is it time to enable checksums by default, and give initdb a switch to
> > turn
> > > it off instead?
> > >
> > > I keep running into situations where people haven't enabled it, because
> > (a)
> > > they didn't know about it, or (b) their packaging system ran initdb for
> > them
> > > so they didn't even know they could. And of course they usually figure
> > this
> > > out once the db has enough data and traffic that the only way to fix it
> > is
> > > to set up something like slony/bucardo/pglogical and a whole new server
> > to
> > > deal with it.. (Which is something that would also be good to fix -- but
> > > having the default changed would be useful as well)
> >
> > Perhaps that's not mandatory, but I think that one obstacle in
> > changing this default is to be able to have pg_upgrade work from a
> > checksum-disabled old instance to a checksum-enabled instance. That
> > would really help with its adoption.
>
> That's a different usecase though.

Agreed.

> If we just change the default, then we'd have to teach pg_upgrade to
> initialize the upgraded cluster without checksums. We still need to keep
> that *option*, just reverse the default.

Just to clarify- pg_upgrade doesn't init the new database, the user (or
a distribution script) does.  As such *pg_upgradecluster* would have to
know to init the new cluster correctly based on the options the old
cluster was init'd with, but it might actually already do that (not sure
off-hand), and, even if it doesn't, it shouldn't be too hard to make it
to that.

> Being able to enable checksums on the fly is a different feature. Which I'd
> really like to have. I have some unfinished code for it, but it's a bit too
> unfinished so far :)

Agreed.

Thanks!

Stephen

Re: [HACKERS] Checksums by default?

From
Petr Jelinek
Date:
On 21/01/17 11:39, Magnus Hagander wrote:
> Is it time to enable checksums by default, and give initdb a switch to
> turn it off instead?

I'd like to see benchmark first, both in terms of CPU and in terms of
produced WAL (=network traffic) given that it turns on logging of hint bits.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: [HACKERS] Checksums by default?

From
Stephen Frost
Date:
Petr,

* Petr Jelinek (petr.jelinek@2ndquadrant.com) wrote:
> On 21/01/17 11:39, Magnus Hagander wrote:
> > Is it time to enable checksums by default, and give initdb a switch to
> > turn it off instead?
>
> I'd like to see benchmark first, both in terms of CPU and in terms of
> produced WAL (=network traffic) given that it turns on logging of hint bits.

Benchmarking was done previously, but I don't think it's really all that
relevant, we should be checksum'ing by default because we care about the
data and it's hard to get checksums enabled on a running system.

If this is going to be a serious argument made against making this
change (and, frankly, I don't believe that it should be) then what we
should do is simply provide a way for users to disable checksums.  It
would be one-way and require a restart, of course, but it wouldn't be
hard to do.

Thanks!

Stephen

Re: [HACKERS] Checksums by default?

From
Fujii Masao
Date:
On Sun, Jan 22, 2017 at 12:18 AM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:
> On 21/01/17 11:39, Magnus Hagander wrote:
>> Is it time to enable checksums by default, and give initdb a switch to
>> turn it off instead?
>
> I'd like to see benchmark first, both in terms of CPU and in terms of
> produced WAL (=network traffic) given that it turns on logging of hint bits.

+1

If the performance overhead by the checksums is really negligible,
we may be able to get rid of wal_log_hints parameter, as well.

Regards,

-- 
Fujii Masao



Re: [HACKERS] Checksums by default?

From
Stephen Frost
Date:
* Fujii Masao (masao.fujii@gmail.com) wrote:
> On Sun, Jan 22, 2017 at 12:18 AM, Petr Jelinek
> <petr.jelinek@2ndquadrant.com> wrote:
> > On 21/01/17 11:39, Magnus Hagander wrote:
> >> Is it time to enable checksums by default, and give initdb a switch to
> >> turn it off instead?
> >
> > I'd like to see benchmark first, both in terms of CPU and in terms of
> > produced WAL (=network traffic) given that it turns on logging of hint bits.
>
> +1
>
> If the performance overhead by the checksums is really negligible,
> we may be able to get rid of wal_log_hints parameter, as well.

Prior benchmarks showed it to be on the order of a few percent, as I
recall, so I'm not sure that we can say it's negligible (and that's not
why Magnus was proposing changing the default).

Thanks!

Stephen

Re: [HACKERS] Checksums by default?

From
Tom Lane
Date:
Magnus Hagander <magnus@hagander.net> writes:
> Is it time to enable checksums by default, and give initdb a switch to turn
> it off instead?

Have we seen *even one* report of checksums catching problems in a useful
way?

I think this will be making the average user pay X% for nothing.
        regards, tom lane



Re: [HACKERS] Checksums by default?

From
Stephen Frost
Date:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Magnus Hagander <magnus@hagander.net> writes:
> > Is it time to enable checksums by default, and give initdb a switch to turn
> > it off instead?
>
> Have we seen *even one* report of checksums catching problems in a useful
> way?

This isn't the right question.

The right question is "have we seen reports of corruption which
checksums *would* have caught?"  Admittedly, that's a much harder
question to answer, but I've definitely seen various reports of
corruption in the field, but it's reasonably rare (which I am sure we
can all be thankful for).  I can't say for sure which of those cases
would have been caught if checksums had been enabled, but I have a hard
time believing that none of them would have been caught sooner if
checksums had been enabled and regular checksum validation was being
performed.

Given our current default and the relative rarity that it happens, it'll
be a great deal longer until we see such a report- but when we do (and I
don't doubt that we will, eventually) what are we going to do about it?
Tell the vast majority of people who still don't have checksums enabled
because it wasn't the default that they need to pg_dump/reload?  That's
not a good way to treat our users.

> I think this will be making the average user pay X% for nothing.

Have we seen *even one* report of someone having to disable checksums
for performance reasons?  If so, that's an argument for giving a way for
users who really trust their hardware, virtualization system, kernel,
storage network, and everything else involved, to disable checksums (as
I suggested elsewhere), not a reason to keep the current default.

Thanks!

Stephen

Re: [HACKERS] Checksums by default?

From
Petr Jelinek
Date:
On 21/01/17 16:40, Stephen Frost wrote:
> Petr,
> 
> * Petr Jelinek (petr.jelinek@2ndquadrant.com) wrote:
>> On 21/01/17 11:39, Magnus Hagander wrote:
>>> Is it time to enable checksums by default, and give initdb a switch to
>>> turn it off instead?
>>
>> I'd like to see benchmark first, both in terms of CPU and in terms of
>> produced WAL (=network traffic) given that it turns on logging of hint bits.
> 
> Benchmarking was done previously, but I don't think it's really all that
> relevant, we should be checksum'ing by default because we care about the
> data and it's hard to get checksums enabled on a running system.
> 

I do think that performance implications are very relevant. And I
haven't seen any serious benchmark that would incorporate all current
differences between using and not using checksums.

The change of wal_level was supported by benchmark, I think it's
reasonable to ask for this to be as well.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: [HACKERS] Checksums by default?

From
Stephen Frost
Date:
Petr,

* Petr Jelinek (petr.jelinek@2ndquadrant.com) wrote:
> On 21/01/17 16:40, Stephen Frost wrote:
> > * Petr Jelinek (petr.jelinek@2ndquadrant.com) wrote:
> >> On 21/01/17 11:39, Magnus Hagander wrote:
> >>> Is it time to enable checksums by default, and give initdb a switch to
> >>> turn it off instead?
> >>
> >> I'd like to see benchmark first, both in terms of CPU and in terms of
> >> produced WAL (=network traffic) given that it turns on logging of hint bits.
> >
> > Benchmarking was done previously, but I don't think it's really all that
> > relevant, we should be checksum'ing by default because we care about the
> > data and it's hard to get checksums enabled on a running system.
>
> I do think that performance implications are very relevant. And I
> haven't seen any serious benchmark that would incorporate all current
> differences between using and not using checksums.

This is just changing the *default*, not requiring checksums to always
be enabled.  We do not hold the same standards for our defaults as we do
for always-enabled code, for clear reasons- not every situation is the
same and that's why we have defaults that people can change.

There are interesting arguments to be made about if checksum'ing is
every worthwhile at all (some seem to see that the feature is entirely
useless and we should just rip that code out, but I don't agree with
that), or if we should just always enable it (because fewer options is a
good thing and we care about our user's data and checksum'ing is worth
the performance hit if it's a small hit; I'm more on the fence when it
comes to this one as I have heard people say that they've run into cases
where it does enough of a difference in performance to matter for them).

We don't currently configure the defaults for any system to be the
fastest possible performance, or we wouldn't have changed wal_level and
we would have move aggressive settings for things like default work_mem,
maintenance_work_mem, shared_buffers, max_wal_size,
checkpoint_completion_target, all of the autovacuum settings,
effective_io_concurrency, effective_cache_size, etc, etc.

> The change of wal_level was supported by benchmark, I think it's
> reasonable to ask for this to be as well.

No, it wasn't, it was that people felt the cases where changing
wal_level would seriously hurt performance didn't out-weigh the value of
making the change to the default.

Thanks!

Stephen

Re: [HACKERS] Checksums by default?

From
Tom Lane
Date:
Stephen Frost <sfrost@snowman.net> writes:
> * Tom Lane (tgl@sss.pgh.pa.us) wrote:
>> Have we seen *even one* report of checksums catching problems in a useful
>> way?

> This isn't the right question.

I disagree.  If they aren't doing something useful for people who have
turned them on, what's the reason to think they'd do something useful
for the rest?

> The right question is "have we seen reports of corruption which
> checksums *would* have caught?"

Sure, that's also a useful question, one which hasn't been answered.

A third useful question is "have we seen any reports of false-positive
checksum failures?".  Even one false positive, IMO, would have costs that
likely outweigh any benefits for typical installations with reasonably
reliable storage hardware.

I really do not believe that there's a case for turning on checksums by
default, and I *certainly* won't go along with turning them on without
somebody actually making that case.  "Is it time yet" is not an argument.
        regards, tom lane



Re: [HACKERS] Checksums by default?

From
Petr Jelinek
Date:
On 21/01/17 17:31, Stephen Frost wrote:
> Petr,
> 
> * Petr Jelinek (petr.jelinek@2ndquadrant.com) wrote:
>> On 21/01/17 16:40, Stephen Frost wrote:
>>> * Petr Jelinek (petr.jelinek@2ndquadrant.com) wrote:
>>>> On 21/01/17 11:39, Magnus Hagander wrote:
>>>>> Is it time to enable checksums by default, and give initdb a switch to
>>>>> turn it off instead?
>>>>
>>>> I'd like to see benchmark first, both in terms of CPU and in terms of
>>>> produced WAL (=network traffic) given that it turns on logging of hint bits.
>>>
>>> Benchmarking was done previously, but I don't think it's really all that
>>> relevant, we should be checksum'ing by default because we care about the
>>> data and it's hard to get checksums enabled on a running system.
>>
>> I do think that performance implications are very relevant. And I
>> haven't seen any serious benchmark that would incorporate all current
>> differences between using and not using checksums.
> 
> This is just changing the *default*, not requiring checksums to always
> be enabled.  We do not hold the same standards for our defaults as we do
> for always-enabled code, for clear reasons- not every situation is the
> same and that's why we have defaults that people can change.

I can buy that. If it's possible to turn checksums off without
recreating data directory then I think it would be okay to have default on.

>> The change of wal_level was supported by benchmark, I think it's
>> reasonable to ask for this to be as well.
> 
> No, it wasn't, it was that people felt the cases where changing
> wal_level would seriously hurt performance didn't out-weigh the value of
> making the change to the default.
> 

Really?

https://www.postgresql.org/message-id/d34ce5b5-131f-66ce-f7c5-eb406dbe026f@2ndquadrant.com

https://www.postgresql.org/message-id/83b33502-1bf8-1ffb-7c73-5b61ddeb68ab@2ndquadrant.com


--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: [HACKERS] Checksums by default?

From
Andreas Karlsson
Date:
On 01/21/2017 04:48 PM, Stephen Frost wrote:
> * Fujii Masao (masao.fujii@gmail.com) wrote:
>> If the performance overhead by the checksums is really negligible,
>> we may be able to get rid of wal_log_hints parameter, as well.
>
> Prior benchmarks showed it to be on the order of a few percent, as I
> recall, so I'm not sure that we can say it's negligible (and that's not
> why Magnus was proposing changing the default).

It might be worth looking into using the CRC CPU instruction to reduce 
this overhead, like we do for the WAL checksums. Since that is a 
different algorithm it would be a compatibility break and we would need 
to support the old algorithm for upgraded clusters..

Andreas



Re: [HACKERS] Checksums by default?

From
Tom Lane
Date:
Stephen Frost <sfrost@snowman.net> writes:
> * Petr Jelinek (petr.jelinek@2ndquadrant.com) wrote:
>> The change of wal_level was supported by benchmark, I think it's
>> reasonable to ask for this to be as well.

> No, it wasn't, it was that people felt the cases where changing
> wal_level would seriously hurt performance didn't out-weigh the value of
> making the change to the default.

It was "supported" in the sense that somebody took the trouble to measure
the impact, so that we had some facts on which to base the value judgment
that the cost was acceptable.  In the case of checksums, you seem to be in
a hurry to arrive at a conclusion without any supporting evidence.
        regards, tom lane



Re: [HACKERS] Checksums by default?

From
Stephen Frost
Date:
* Petr Jelinek (petr.jelinek@2ndquadrant.com) wrote:
> On 21/01/17 17:31, Stephen Frost wrote:
> > This is just changing the *default*, not requiring checksums to always
> > be enabled.  We do not hold the same standards for our defaults as we do
> > for always-enabled code, for clear reasons- not every situation is the
> > same and that's why we have defaults that people can change.
>
> I can buy that. If it's possible to turn checksums off without
> recreating data directory then I think it would be okay to have default on.

I'm glad to hear that.

> >> The change of wal_level was supported by benchmark, I think it's
> >> reasonable to ask for this to be as well.
> >
> > No, it wasn't, it was that people felt the cases where changing
> > wal_level would seriously hurt performance didn't out-weigh the value of
> > making the change to the default.
>
> Really?

Yes.

> https://www.postgresql.org/message-id/d34ce5b5-131f-66ce-f7c5-eb406dbe026f@2ndquadrant.com

From the above link:

> So while it'd be trivial to construct workloads demonstrating the
> optimizations in wal_level=minimal (e.g. initial loads doing CREATE
> TABLE + COPY + CREATE INDEX in a single transaction), but that would be
> mostly irrelevant I guess.

> Instead, I've decided to run regular pgbench TPC-B-like workload on a
> bunch of different scales, and measure throughput + some xlog stats with
> each of the three wal_level options.

In other words, there was no performance testing of the cases where
wal_level=minimal (the old default) optimizations would have been
compared against wal_level > minimal.

I'm quite sure that the performance numbers for the CREATE TABLE + COPY
case with wal_level=minimal would have been *far* better than for
wal_level > minimal.

That case was entirely punted on as "mostly irrelevant" even though
there are known production environments where those optimizations make a
huge difference.  Those are OLAP cases though, and not nearly enough
folks around here seem to care one bit about them, which I continue to
be disappointed by.

Even so, I *did* agree with the change to the default of wal_level,
based on an understanding of its value and that users could change to
wal_level=minimal if they wished to, just as I am arguing that same
thing here when it comes to checksums.

Thanks!

Stephen

Re: [HACKERS] Checksums by default?

From
Stephen Frost
Date:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Stephen Frost <sfrost@snowman.net> writes:
> > * Petr Jelinek (petr.jelinek@2ndquadrant.com) wrote:
> >> The change of wal_level was supported by benchmark, I think it's
> >> reasonable to ask for this to be as well.
>
> > No, it wasn't, it was that people felt the cases where changing
> > wal_level would seriously hurt performance didn't out-weigh the value of
> > making the change to the default.
>
> It was "supported" in the sense that somebody took the trouble to measure
> the impact, so that we had some facts on which to base the value judgment
> that the cost was acceptable.  In the case of checksums, you seem to be in
> a hurry to arrive at a conclusion without any supporting evidence.

No, no one measured the impact in the cases where wal_level=minimal
makes a big difference, that I saw, at least.

Further info with links to what was done are in my reply to Petr.

As for checksums, I do see value in them and I'm pretty sure that the
author of that particular feature did as well, or we wouldn't even have
it as an option.  You seem to be of the opinion that we might as well
just rip all of that code and work out as being useless.

Thanks!

Stephen

Re: [HACKERS] Checksums by default?

From
Stephen Frost
Date:
* Andreas Karlsson (andreas@proxel.se) wrote:
> On 01/21/2017 04:48 PM, Stephen Frost wrote:
> >* Fujii Masao (masao.fujii@gmail.com) wrote:
> >>If the performance overhead by the checksums is really negligible,
> >>we may be able to get rid of wal_log_hints parameter, as well.
> >
> >Prior benchmarks showed it to be on the order of a few percent, as I
> >recall, so I'm not sure that we can say it's negligible (and that's not
> >why Magnus was proposing changing the default).
>
> It might be worth looking into using the CRC CPU instruction to
> reduce this overhead, like we do for the WAL checksums. Since that
> is a different algorithm it would be a compatibility break and we
> would need to support the old algorithm for upgraded clusters..

+1.

I'd be all for removing the option and requiring checksums if we do that
and it turns out that the performance hit ends up being less than 1%.

Thanks!

Stephen

Re: [HACKERS] Checksums by default?

From
Andres Freund
Date:
On 2017-01-21 11:39:18 +0100, Magnus Hagander wrote:
> Is it time to enable checksums by default, and give initdb a switch to turn
> it off instead?

-1 - the WAL overhead is quite massive, and in contrast to the other
GUCs recently changed you can't just switch this around.

Andres



Re: [HACKERS] Checksums by default?

From
Andres Freund
Date:
On 2017-01-22 00:41:55 +0900, Fujii Masao wrote:
> On Sun, Jan 22, 2017 at 12:18 AM, Petr Jelinek
> <petr.jelinek@2ndquadrant.com> wrote:
> > On 21/01/17 11:39, Magnus Hagander wrote:
> >> Is it time to enable checksums by default, and give initdb a switch to
> >> turn it off instead?
> >
> > I'd like to see benchmark first, both in terms of CPU and in terms of
> > produced WAL (=network traffic) given that it turns on logging of hint bits.
> 
> +1
> 
> If the performance overhead by the checksums is really negligible,
> we may be able to get rid of wal_log_hints parameter, as well.

It's not just the performance overhead, but also the volume of WAL due
to the additional FPIs...

Andres



Re: [HACKERS] Checksums by default?

From
Petr Jelinek
Date:
On 21/01/17 17:51, Stephen Frost wrote:
> * Petr Jelinek (petr.jelinek@2ndquadrant.com) wrote:
>> On 21/01/17 17:31, Stephen Frost wrote:
>>> This is just changing the *default*, not requiring checksums to always
>>> be enabled.  We do not hold the same standards for our defaults as we do
>>> for always-enabled code, for clear reasons- not every situation is the
>>> same and that's why we have defaults that people can change.
>>
>> I can buy that. If it's possible to turn checksums off without
>> recreating data directory then I think it would be okay to have default on.
> 
> I'm glad to hear that.
> 
>>>> The change of wal_level was supported by benchmark, I think it's
>>>> reasonable to ask for this to be as well.
>>>
>>> No, it wasn't, it was that people felt the cases where changing
>>> wal_level would seriously hurt performance didn't out-weigh the value of
>>> making the change to the default.
>>
>> Really?
> 
> Yes.
> 
>> https://www.postgresql.org/message-id/d34ce5b5-131f-66ce-f7c5-eb406dbe026f@2ndquadrant.com
> 
> From the above link:
> 
>> So while it'd be trivial to construct workloads demonstrating the 
>> optimizations in wal_level=minimal (e.g. initial loads doing CREATE 
>> TABLE + COPY + CREATE INDEX in a single transaction), but that would be 
>> mostly irrelevant I guess.
> 
>> Instead, I've decided to run regular pgbench TPC-B-like workload on a 
>> bunch of different scales, and measure throughput + some xlog stats with 
>> each of the three wal_level options.
> 
> In other words, there was no performance testing of the cases where
> wal_level=minimal (the old default) optimizations would have been
> compared against wal_level > minimal.
> 
> I'm quite sure that the performance numbers for the CREATE TABLE + COPY
> case with wal_level=minimal would have been *far* better than for
> wal_level > minimal.

Which is random usecase very few people do on regular basis. Checksums
affect *everybody*.

What the benchmarks gave us is a way to do informed decision for common
use. All I am asking for here is to be able to do informed decision as
well.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: [HACKERS] Checksums by default?

From
Tom Lane
Date:
Stephen Frost <sfrost@snowman.net> writes:
> As for checksums, I do see value in them and I'm pretty sure that the
> author of that particular feature did as well, or we wouldn't even have
> it as an option.  You seem to be of the opinion that we might as well
> just rip all of that code and work out as being useless.

Not at all; I just think that it's not clear that they are a net win
for the average user, and so I'm unconvinced that turning them on by
default is a good idea.  I could be convinced otherwise by suitable
evidence.  What I'm objecting to is turning them on without making
any effort to collect such evidence.

Also, if we do decide to do that, there's the question of timing.
As I mentioned, one of the chief risks I see is the possibility of
false-positive checksum failures due to bugs; I think that code has seen
sufficiently little field use that we should have little confidence that
no such bugs remain.  So if we're gonna do it, I'd prefer to do it at the
very start of a devel cycle, so as to have the greatest opportunity to
find bugs before we ship the new default.
        regards, tom lane



Re: [HACKERS] Checksums by default?

From
Stephen Frost
Date:
* Petr Jelinek (petr.jelinek@2ndquadrant.com) wrote:
> On 21/01/17 17:51, Stephen Frost wrote:
> > I'm quite sure that the performance numbers for the CREATE TABLE + COPY
> > case with wal_level=minimal would have been *far* better than for
> > wal_level > minimal.
>
> Which is random usecase very few people do on regular basis. Checksums
> affect *everybody*.

It's not a 'random usecase very few people do on a regular basis', it's
a different usecase which a subset of our users do.  I agree that
changing the default for checksums would affect everyone who uses just
the defaults.

What I think is different about checksums, really, is that you have to
pass an option to initdb to enable them.  That's right from a technical
perspective, but I seriously doubt everyone re-reads the 'man' page for
initdb when they're setting up a new cluster, and if they didn't read
the release notes for that particular release where checksums were
introduced, they might not be aware that they exist or that they're
defaulted to off.

Values in postgresql.conf, at least in my experience, get a lot more
regular review as there's always things changing there from
release-to-release and anyone even slightly experienced with PG knows
that our defaults in postgresql.conf pretty much suck and have to be
adjusted.  I expect we'd see a lot more people using checksums if they
were in postgresql.conf somehow.  I don't have any particular answer to
that problem, just thought it interesting to consider how flags to
initdb differ from postgresql.conf configuration options.

> What the benchmarks gave us is a way to do informed decision for common
> use. All I am asking for here is to be able to do informed decision as
> well.

From above:

> >>>> The change of wal_level was supported by benchmark, I think it's
> >>>> reasonable to ask for this to be as well.
> >>>
> >>> No, it wasn't, it was that people felt the cases where changing
> >>> wal_level would seriously hurt performance didn't out-weigh the value of
> >>> making the change to the default.

In other words, the change of the *default* for checksums, at least in
my view, is well worth the performance impact, and I came to that
conclusion based on the previously published work when the feature was
being developed, which was a few percent, as I recall, though I'd be
fine with having it be the default even if it was 5%.

At what point would you say we shouldn't have the default be to have
checksums enabled?  1%?  5%? 10%?

All that said, to be clear, I don't have any objection to Tomas (or
whomever) doing performance testing of this case if he's interested and
has time, but that I don't feel we really need a huge analysis of this.
We did an analysis when the feature was developed and I doubt we would
have been able to even get it included if it resulted in a 10% or more
performance hit, though it wasn't that high, as I recall.

Thanks!

Stephen

Re: [HACKERS] Checksums by default?

From
Andres Freund
Date:
On 2017-01-21 12:09:53 -0500, Tom Lane wrote:
> Also, if we do decide to do that, there's the question of timing.
> As I mentioned, one of the chief risks I see is the possibility of
> false-positive checksum failures due to bugs; I think that code has seen
> sufficiently little field use that we should have little confidence that
> no such bugs remain.  So if we're gonna do it, I'd prefer to do it at the
> very start of a devel cycle, so as to have the greatest opportunity to
> find bugs before we ship the new default.

What wouldn't hurt is enabling it by default in pg_regress on master for
a while. That seems like a good thing to do independent of flipping the
default.

Andres



Re: [HACKERS] Checksums by default?

From
Stephen Frost
Date:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Stephen Frost <sfrost@snowman.net> writes:
> > As for checksums, I do see value in them and I'm pretty sure that the
> > author of that particular feature did as well, or we wouldn't even have
> > it as an option.  You seem to be of the opinion that we might as well
> > just rip all of that code and work out as being useless.
>
> Not at all; I just think that it's not clear that they are a net win
> for the average user, and so I'm unconvinced that turning them on by
> default is a good idea.  I could be convinced otherwise by suitable
> evidence.  What I'm objecting to is turning them on without making
> any effort to collect such evidence.

> Also, if we do decide to do that, there's the question of timing.
> As I mentioned, one of the chief risks I see is the possibility of
> false-positive checksum failures due to bugs; I think that code has seen
> sufficiently little field use that we should have little confidence that
> no such bugs remain.  So if we're gonna do it, I'd prefer to do it at the
> very start of a devel cycle, so as to have the greatest opportunity to
> find bugs before we ship the new default.

I can agree with a goal to make sure we aren't enabling code that a
bunch of people are going to be seeing false positives due to, as that
would certainly be bad.  I also agree that further testing is generally
a good idea, I just dislike the idea that we would prevent this change
to the default due to performance concerns of a few percent in a default
install.

I'll ask around with some of the big PG installed bases (RDS, Heroku,
etc) and see if they're a) enabling checksums already (and therefore
doing a lot of testing of those code paths), and b) if they are, if
they have seen any true or false positive reports from it.

If we hear back that the large installed bases in those environments are
already running with checksums enabled, without issues, then I'm not
sure that we need to wait until the start of the next devel cycle to
change the default.  Alternativly, if we end up wanting to change the
on-disk format, as discussed elsewhere on this thread, then we might
want to wait and make that change at the same time as a change to the
default, or at least not change the default right before we decide to
change the format.

Thanks!

Stephen

Re: [HACKERS] Checksums by default?

From
Stephen Frost
Date:
* Andres Freund (andres@anarazel.de) wrote:
> On 2017-01-21 12:09:53 -0500, Tom Lane wrote:
> > Also, if we do decide to do that, there's the question of timing.
> > As I mentioned, one of the chief risks I see is the possibility of
> > false-positive checksum failures due to bugs; I think that code has seen
> > sufficiently little field use that we should have little confidence that
> > no such bugs remain.  So if we're gonna do it, I'd prefer to do it at the
> > very start of a devel cycle, so as to have the greatest opportunity to
> > find bugs before we ship the new default.
>
> What wouldn't hurt is enabling it by default in pg_regress on master for
> a while. That seems like a good thing to do independent of flipping the
> default.

Oh.  I like that idea, a lot.

+1.

Thanks!

Stephen

Re: [HACKERS] Checksums by default?

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> What wouldn't hurt is enabling it by default in pg_regress on master for
> a while. That seems like a good thing to do independent of flipping the
> default.

Yeah, I could get behind that.  I'm not certain how much the regression
tests really stress checksumming: most of the tests work with small
amounts of data that probably never leave shared buffers.  But it'd be
some hard evidence at least.
        regards, tom lane



Re: [HACKERS] Checksums by default?

From
Petr Jelinek
Date:
On 21/01/17 18:15, Stephen Frost wrote:
> * Petr Jelinek (petr.jelinek@2ndquadrant.com) wrote:
>> On 21/01/17 17:51, Stephen Frost wrote:
>>> I'm quite sure that the performance numbers for the CREATE TABLE + COPY
>>> case with wal_level=minimal would have been *far* better than for
>>> wal_level > minimal.
>>
>> Which is random usecase very few people do on regular basis. Checksums
>> affect *everybody*.
> 
> It's not a 'random usecase very few people do on a regular basis', it's
> a different usecase which a subset of our users do.  I agree that
> changing the default for checksums would affect everyone who uses just
> the defaults.
> 
> What I think is different about checksums, really, is that you have to
> pass an option to initdb to enable them.  That's right from a technical
> perspective, but I seriously doubt everyone re-reads the 'man' page for
> initdb when they're setting up a new cluster, and if they didn't read
> the release notes for that particular release where checksums were
> introduced, they might not be aware that they exist or that they're
> defaulted to off.
> 
> Values in postgresql.conf, at least in my experience, get a lot more
> regular review as there's always things changing there from
> release-to-release and anyone even slightly experienced with PG knows
> that our defaults in postgresql.conf pretty much suck and have to be
> adjusted.  I expect we'd see a lot more people using checksums if they
> were in postgresql.conf somehow.  I don't have any particular answer to
> that problem, just thought it interesting to consider how flags to
> initdb differ from postgresql.conf configuration options.

So in summary "postgresql.conf options are easy to change" while "initdb
options are hard to change", I can see this argument used both for
enabling or disabling checksums by default. As I said I would be less
worried if it was easy to turn off, but we are not there afaik. And even
then I'd still want benchmark first.

> 
>> What the benchmarks gave us is a way to do informed decision for common
>> use. All I am asking for here is to be able to do informed decision as
>> well.
> 
> From above:
> 
>>>>>> The change of wal_level was supported by benchmark, I think it's
>>>>>> reasonable to ask for this to be as well.
>>>>>
>>>>> No, it wasn't, it was that people felt the cases where changing
>>>>> wal_level would seriously hurt performance didn't out-weigh the value of
>>>>> making the change to the default.
> 
> In other words, the change of the *default* for checksums, at least in
> my view, is well worth the performance impact.
> 

As we don't know the performance impact is (there was no benchmark done
on reasonably current code base) I really don't understand how you can
judge if it's worth it or not.

I stand by the opinion that changing default which affect performance
without any benchmark is bad idea.

And for the record, I care much less about overall TPS, I care a lot
more about amount of WAL produced because in 90%+ environments that I
work with any increase in WAL amount means at least double the increase
in network bandwidth due to replication.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: [HACKERS] Checksums by default?

From
Stephen Frost
Date:
* Petr Jelinek (petr.jelinek@2ndquadrant.com) wrote:
> As we don't know the performance impact is (there was no benchmark done
> on reasonably current code base) I really don't understand how you can
> judge if it's worth it or not.

Because I see having checksums as, frankly, something we always should
have had (as most other databases do, for good reason...) and because
they will hopefully prevent data loss.  I'm willing to give us a fair
bit to minimize the risk of losing data.

> I stand by the opinion that changing default which affect performance
> without any benchmark is bad idea.

I'd be surprised if the performance impact has really changed all that
much since the code went in.  Perhaps that's overly optimistic of me.

> And for the record, I care much less about overall TPS, I care a lot
> more about amount of WAL produced because in 90%+ environments that I
> work with any increase in WAL amount means at least double the increase
> in network bandwidth due to replication.

Do you run with all defaults in those environments?

Thanks!

Stephen

Re: [HACKERS] Checksums by default?

From
Petr Jelinek
Date:
On 21/01/17 18:46, Stephen Frost wrote:
> * Petr Jelinek (petr.jelinek@2ndquadrant.com) wrote:
>> As we don't know the performance impact is (there was no benchmark done
>> on reasonably current code base) I really don't understand how you can
>> judge if it's worth it or not.
> 
> Because I see having checksums as, frankly, something we always should
> have had (as most other databases do, for good reason...) and because
> they will hopefully prevent data loss.  I'm willing to give us a fair
> bit to minimize the risk of losing data.
> 
>> I stand by the opinion that changing default which affect performance
>> without any benchmark is bad idea.
> 
> I'd be surprised if the performance impact has really changed all that
> much since the code went in.  Perhaps that's overly optimistic of me.
> 

My problem is that we are still only guessing. And while my gut also
tells me that the TPS difference will not be big, it also tells me that
changes like this in important software like PostgreSQL should not be
made based purely on it.

>> And for the record, I care much less about overall TPS, I care a lot
>> more about amount of WAL produced because in 90%+ environments that I
>> work with any increase in WAL amount means at least double the increase
>> in network bandwidth due to replication.
> 
> Do you run with all defaults in those environments?
> 

For initdb? Mostly yes.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: [HACKERS] Checksums by default?

From
Andres Freund
Date:
On 2017-01-21 12:46:05 -0500, Stephen Frost wrote:
> > I stand by the opinion that changing default which affect performance
> > without any benchmark is bad idea.
> 
> I'd be surprised if the performance impact has really changed all that
> much since the code went in.  Perhaps that's overly optimistic of me.

Back then there were cases with well over 20% overhead. More edge cases,
but that's a lot.  And our scalability back then was a lot worse than
where we are today.


> > And for the record, I care much less about overall TPS, I care a lot
> > more about amount of WAL produced because in 90%+ environments that I
> > work with any increase in WAL amount means at least double the increase
> > in network bandwidth due to replication.
> 
> Do you run with all defaults in those environments?

Irrelevant - changing requires re-initdb'ing. That's unrealistic.

Greetings,

Andres Freund



Re: [HACKERS] Checksums by default?

From
Stephen Frost
Date:
* Petr Jelinek (petr.jelinek@2ndquadrant.com) wrote:
> > Do you run with all defaults in those environments?
>
> For initdb? Mostly yes.

Ok, fine, but you probably wouldn't if this change went in.

For me, it's the other way- I have to go enable checksums at initdb time
unless there's an excuse not to, and that's only when I get to be
involved at initdb time, which is often not the case, sadly.  There's
certainly lots and lots of environments where the extra CPU and WAL
aren't even remotely close to being an issue.

* Andres Freund (andres@anarazel.de) wrote:
> On 2017-01-21 12:46:05 -0500, Stephen Frost wrote:
> > Do you run with all defaults in those environments?
>
> Irrelevant - changing requires re-initdb'ing. That's unrealistic.

I disagree.  Further, we can add an option to be able to disable
checksums without needing to re-initdb pretty trivially, which addresses
the case where someone's having a problem because it's enabled, as
discussed.

Thanks!

Stephen

Re: [HACKERS] Checksums by default?

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2017-01-21 12:46:05 -0500, Stephen Frost wrote:
>> Do you run with all defaults in those environments?

> Irrelevant - changing requires re-initdb'ing. That's unrealistic.

If you can't turn checksums *off* without re-initdb, that raises the
stakes for this enormously.  But why is that so hard?  Seems like if
you just stop checking them, you're done.  I see that we have the
state recorded in pg_control, but surely we could teach some utility
or other to update that file while the postmaster is stopped.

I think a reasonable prerequisite before we even consider this change
is a patch to make it possible to turn off checksumming.
        regards, tom lane



Re: [HACKERS] Checksums by default?

From
Tom Lane
Date:
Stephen Frost <sfrost@snowman.net> writes:
> Because I see having checksums as, frankly, something we always should
> have had (as most other databases do, for good reason...) and because
> they will hopefully prevent data loss.  I'm willing to give us a fair
> bit to minimize the risk of losing data.

To be perfectly blunt, that's just magical thinking.  Checksums don't
prevent data loss in any way, shape, or form.  In fact, they can *cause*
data loss, or at least make it harder for you to retrieve your data,
in the event of bugs causing false-positive checksum failures.

What checksums can do for you, perhaps, is notify you in a reasonably
timely fashion if you've already lost data due to storage-subsystem
problems.  But in a pretty high percentage of cases, that fact would
be extremely obvious anyway, because of visible data corruption.

I think the only really clear benefit accruing from checksums is that
they make it easier to distinguish storage-subsystem failures from
Postgres bugs.  That can certainly be a benefit to some users, but
I remain dubious that the average user will find it worth any noticeable
amount of overhead.
        regards, tom lane



Re: [HACKERS] Checksums by default?

From
Andres Freund
Date:
On 2017-01-21 13:03:52 -0500, Stephen Frost wrote:
> * Andres Freund (andres@anarazel.de) wrote:
> > On 2017-01-21 12:46:05 -0500, Stephen Frost wrote:
> > > Do you run with all defaults in those environments?
> > 
> > Irrelevant - changing requires re-initdb'ing. That's unrealistic.
> 
> I disagree.  Further, we can add an option to be able to disable
> checksums without needing to re-initdb pretty trivially, which addresses
> the case where someone's having a problem because it's enabled, as
> discussed.

Sure, it might be easy, but we don't have it.  Personally I think
checksums just aren't even ready for prime time. If we had:
- ability to switch on/off at runtime (early patches for that have IIRC been posted)
- *builtin* tooling to check checksums for everything
- *builtin* tooling to compute checksums after changing setting
- configurable background sweeps for checksums

then the story would look differently. Right now checksums just aren't
particularly useful due to not having the above. Just checking recent
data doesn't really guarantee much - failures are more likely in old
data, and the data might even be read from ram.

Greetings,

Andres Freund



Re: [HACKERS] Checksums by default?

From
Andres Freund
Date:
On 2017-01-21 13:04:18 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2017-01-21 12:46:05 -0500, Stephen Frost wrote:
> >> Do you run with all defaults in those environments?
> 
> > Irrelevant - changing requires re-initdb'ing. That's unrealistic.
> 
> If you can't turn checksums *off* without re-initdb, that raises the
> stakes for this enormously.

Not right now, no.


> But why is that so hard?

Don't think it is, it's just that nobody has done it ;)


Greetings,

Andres Freund



Re: [HACKERS] Checksums by default?

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> Sure, it might be easy, but we don't have it.  Personally I think
> checksums just aren't even ready for prime time. If we had:
> - ability to switch on/off at runtime (early patches for that have IIRC
>   been posted)
> - *builtin* tooling to check checksums for everything
> - *builtin* tooling to compute checksums after changing setting
> - configurable background sweeps for checksums

Yeah, and there's a bunch of usability tooling that we don't have,
centered around "what do you do after you get a checksum error?".
AFAIK there's no way to check or clear such an error; but without
such tools, I'm afraid that checksums are as much of a foot-gun
as a benefit.

I think developing all this stuff is a good long-term activity,
but I'm hesitant about turning checksums loose on the average
user before we have it.

To draw a weak analogy, our checksums right now are more or less
where our replication was in 9.1 --- we had it, but there were still
an awful lot of rough edges.  It was only just in this past release
cycle that we started to change default settings towards the idea
that they should support replication by default.  I think the checksum
tooling likewise needs years of maturation before we can say that it's
realistically ready to be the default.
        regards, tom lane



Re: [HACKERS] Checksums by default?

From
Stephen Frost
Date:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Stephen Frost <sfrost@snowman.net> writes:
> > Because I see having checksums as, frankly, something we always should
> > have had (as most other databases do, for good reason...) and because
> > they will hopefully prevent data loss.  I'm willing to give us a fair
> > bit to minimize the risk of losing data.
>
> To be perfectly blunt, that's just magical thinking.  Checksums don't
> prevent data loss in any way, shape, or form.  In fact, they can *cause*
> data loss, or at least make it harder for you to retrieve your data,
> in the event of bugs causing false-positive checksum failures.

This is not a new argument, at least to me, and I don't agree with it.

> What checksums can do for you, perhaps, is notify you in a reasonably
> timely fashion if you've already lost data due to storage-subsystem
> problems.  But in a pretty high percentage of cases, that fact would
> be extremely obvious anyway, because of visible data corruption.

Exactly, and that awareness will allow a user to prevent further data
loss or corruption.  Slow corruption over time is a very much known and
accepted real-world case that people do experience, as well as bit
flipping enough for someone to write a not-that-old blog post about
them:

https://blogs.oracle.com/ksplice/entry/attack_of_the_cosmic_rays1

A really nice property of checksums on pages is that they also tell you
what data you *didn't* lose, which can be extremely valuable.

> I think the only really clear benefit accruing from checksums is that
> they make it easier to distinguish storage-subsystem failures from
> Postgres bugs.  That can certainly be a benefit to some users, but
> I remain dubious that the average user will find it worth any noticeable
> amount of overhead.

Or memory errors, or kernel bugs, or virtualization bugs, if they happen
at the right time.  We keep adding to the bits between the DB and the
storage and to think they're all perfect is certainly a step farther
than I'd go.

Thanks!

Stephen

Re: [HACKERS] Checksums by default?

From
Stephen Frost
Date:
* Andres Freund (andres@anarazel.de) wrote:
> On 2017-01-21 13:03:52 -0500, Stephen Frost wrote:
> > * Andres Freund (andres@anarazel.de) wrote:
> > > On 2017-01-21 12:46:05 -0500, Stephen Frost wrote:
> > > > Do you run with all defaults in those environments?
> > >
> > > Irrelevant - changing requires re-initdb'ing. That's unrealistic.
> >
> > I disagree.  Further, we can add an option to be able to disable
> > checksums without needing to re-initdb pretty trivially, which addresses
> > the case where someone's having a problem because it's enabled, as
> > discussed.
>
> Sure, it might be easy, but we don't have it.  Personally I think
> checksums just aren't even ready for prime time. If we had:
> - ability to switch on/off at runtime (early patches for that have IIRC
>   been posted)
> - *builtin* tooling to check checksums for everything
> - *builtin* tooling to compute checksums after changing setting
> - configurable background sweeps for checksums

I'm certainly all for adding, well, all of that.

I don't think we need *all* of it before considering enabling checksums
by default, but I do think it'd be great if we had people working on
adding those.

> then the story would look differently. Right now checksums just aren't
> particularly useful due to not having the above. Just checking recent
> data doesn't really guarantee much - failures are more likely in old
> data, and the data might even be read from ram.

I agree that failures tend to be more likely in old data, though, as
with everything, "it depends."

Thanks!

Stephen

Re: [HACKERS] Checksums by default?

From
Stephen Frost
Date:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Andres Freund <andres@anarazel.de> writes:
> > Sure, it might be easy, but we don't have it.  Personally I think
> > checksums just aren't even ready for prime time. If we had:
> > - ability to switch on/off at runtime (early patches for that have IIRC
> >   been posted)
> > - *builtin* tooling to check checksums for everything
> > - *builtin* tooling to compute checksums after changing setting
> > - configurable background sweeps for checksums
>
> Yeah, and there's a bunch of usability tooling that we don't have,
> centered around "what do you do after you get a checksum error?".
> AFAIK there's no way to check or clear such an error; but without
> such tools, I'm afraid that checksums are as much of a foot-gun
> as a benefit.

Uh, ignore_checksum_failure and zero_damanged_pages ...?

Not that I'd suggest flipping those on for your production database the
first time you see a checksum failure, but we aren't completely without
a way to address such cases.

Or the to-be-implemented ability to disable checksums for a cluster.

> I think developing all this stuff is a good long-term activity,
> but I'm hesitant about turning checksums loose on the average
> user before we have it.

What I dislike about this stance is that it just means we're going to
have more and more systems out there that won't have checksums enabled,
and there's not going to be an easy way to fix that.

> To draw a weak analogy, our checksums right now are more or less
> where our replication was in 9.1 --- we had it, but there were still
> an awful lot of rough edges.  It was only just in this past release
> cycle that we started to change default settings towards the idea
> that they should support replication by default.  I think the checksum
> tooling likewise needs years of maturation before we can say that it's
> realistically ready to be the default.

Well, checksums were introduced in 9.3, which would mean that this is
really only being proposed a year earlier than the replication timeline
case, if I'm following correctly.  I do agree that checksums have not
seen quite as much love as the replicaiton work, though I'm tempted to
argue that's because they aren't an "interesting" feature now that we've
got them- but even those uninteresting features really need someone to
champion them when they're important.  Unfortunately, our situation with
checksums does make me feel a bit like they were added to satisfy a
check-box requirement and then not really developed very much further.

Following your weak analogy though, it's not like users are going to
have to dump/restore their entire cluster to change their systems to
take advantage of the new replication capabilities.

Thanks!

Stephen

Re: [HACKERS] Checksums by default?

From
Ants Aasma
Date:
On Sat, Jan 21, 2017 at 6:41 PM, Andreas Karlsson <andreas@proxel.se> wrote:
> On 01/21/2017 04:48 PM, Stephen Frost wrote:
>>
>> * Fujii Masao (masao.fujii@gmail.com) wrote:
>>>
>>> If the performance overhead by the checksums is really negligible,
>>> we may be able to get rid of wal_log_hints parameter, as well.
>>
>>
>> Prior benchmarks showed it to be on the order of a few percent, as I
>> recall, so I'm not sure that we can say it's negligible (and that's not
>> why Magnus was proposing changing the default).
>
>
> It might be worth looking into using the CRC CPU instruction to reduce this
> overhead, like we do for the WAL checksums. Since that is a different
> algorithm it would be a compatibility break and we would need to support the
> old algorithm for upgraded clusters..

We looked at that when picking the algorithm. At that point it seemed
that CRC CPU instructions were not universal enough to rely on them.
The algorithm we ended up on was designed to be fast on SIMD hardware.
Unfortunately on x86-64 that required SSE4.1 integer instructions, so
with default compiles there is a lot of performance left on table. A
low hanging fruit would be to do CPU detection like the CRC case and
enable a SSE4.1 optimized variant when those instructions are
available. IIRC it was actually a lot faster than the naive hardware
CRC that is used for WAL and about on par with interleaved CRC.

That said the actual checksum calculation was not a big issue for
performance. The only way to make it really matter was with a larger
than shared buffers smaller than RAM workload with tiny per page
execution overhead. My test case was SELECT COUNT(*) on wide rows with
a small fill factor. Having to WAL log hint bits is the major issue.

Regards,
Ants Aasma



Re: [HACKERS] Checksums by default?

From
Ants Aasma
Date:
On Sat, Jan 21, 2017 at 7:39 PM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:
> So in summary "postgresql.conf options are easy to change" while "initdb
> options are hard to change", I can see this argument used both for
> enabling or disabling checksums by default. As I said I would be less
> worried if it was easy to turn off, but we are not there afaik. And even
> then I'd still want benchmark first.

Adding support for disabling checksums is almost trivial as it only
requires flipping a value in the control file. And I have somewhere
sitting around a similarly simple tool for turning on checksums while
the database is offline. FWIW, based on customers and fellow
conference goers I have talked to most would gladly take the
performance hit, but not the downtime to turn it on on an existing
database.

Regards,
Ants Aasma



Re: [HACKERS] Checksums by default?

From
Stephen Frost
Date:
* Ants Aasma (ants.aasma@eesti.ee) wrote:
> On Sat, Jan 21, 2017 at 7:39 PM, Petr Jelinek
> <petr.jelinek@2ndquadrant.com> wrote:
> > So in summary "postgresql.conf options are easy to change" while "initdb
> > options are hard to change", I can see this argument used both for
> > enabling or disabling checksums by default. As I said I would be less
> > worried if it was easy to turn off, but we are not there afaik. And even
> > then I'd still want benchmark first.
>
> Adding support for disabling checksums is almost trivial as it only
> requires flipping a value in the control file. And I have somewhere
> sitting around a similarly simple tool for turning on checksums while
> the database is offline. FWIW, based on customers and fellow
> conference goers I have talked to most would gladly take the
> performance hit, but not the downtime to turn it on on an existing
> database.

I've had the same reaction from folks I've talked to, unless it was the
cases where they were just floored that we didn't have them enabled by
default and now they felt the need to go get them enabled on all their
systems...

Thanks!

Stephen

Re: [HACKERS] Checksums by default?

From
"Greg Sabino Mullane"
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160


tl;dr +1 from me for changing the default, it is worth it.

Tom Lane wrote:
> Have we seen *even one* report of checksums catching 
> problems in a usefuld way?

Sort of chicken-and-egg, as most places don't have it enabled. 
Which leads us to:

Stephen Frost replies:
> This isn't the right question.
>
> The right question is "have we seen reports of corruption which
> checksums *would* have caught?"

Well, I've seen corruption that almost certainly would have got caught 
much earlier than stumbling upon it later on when the corruption 
happened to finally trigger an error. I don't normally report such 
things to the list: it's almost always a hardware bug or bad RAM. I 
would only post if it were caused by a Postgres bug.

Tom Lane wrote:
> I think this will be making the average user pay X% for nothing.

I think you mean "the average user who doesn't check what initdb 
options are available". And we can certainly post a big notice about 
this in the release notes, so people can use the initdb option
- --disable-data-checksums if they want.

> ... pay X% for nothing.

It is not for nothing, it is for increasing reliability by detecting 
(and pinpointing!) corruption as early as possible.

- -- 
Greg Sabino Mullane greg@turnstep.com
End Point Corporation http://www.endpoint.com/
PGP Key: 0x14964AC8 201701211513
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-----BEGIN PGP SIGNATURE-----

iEYEAREDAAYFAliDwU4ACgkQvJuQZxSWSsi06QCgpPUg4SljERHMWP9tTJnoIRic
U2cAoLZINh2rSECNYOwjldlD4dK00FiV
=pYQ/
-----END PGP SIGNATURE-----





Re: [HACKERS] Checksums by default?

From
Michael Banck
Date:
On Sat, Jan 21, 2017 at 09:02:25PM +0200, Ants Aasma wrote:
> On Sat, Jan 21, 2017 at 6:41 PM, Andreas Karlsson <andreas@proxel.se> wrote:
> > It might be worth looking into using the CRC CPU instruction to reduce this
> > overhead, like we do for the WAL checksums. Since that is a different
> > algorithm it would be a compatibility break and we would need to support the
> > old algorithm for upgraded clusters..
> 
> We looked at that when picking the algorithm. At that point it seemed
> that CRC CPU instructions were not universal enough to rely on them.
> The algorithm we ended up on was designed to be fast on SIMD hardware.
> Unfortunately on x86-64 that required SSE4.1 integer instructions, so
> with default compiles there is a lot of performance left on table. A
> low hanging fruit would be to do CPU detection like the CRC case and
> enable a SSE4.1 optimized variant when those instructions are
> available. IIRC it was actually a lot faster than the naive hardware
> CRC that is used for WAL and about on par with interleaved CRC.

I am afraid that won't fly with most end-user packages, cause
distributions can't just built packages on their newest machine and then
users get SIGILL or whatever cause their 2014 server doesn't have that
instruction, or would they still work?

So you would have to do runtime detection of the CPU features, and use
the appropriate code if they are available.  Which also makes regression
testing harder, as not all codepaths would get exercised all the time.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer



Re: [HACKERS] Checksums by default?

From
"Greg Sabino Mullane"
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160


Tom Lane points out:
> Yeah, and there's a bunch of usability tooling that we don't have,
> centered around "what do you do after you get a checksum error?".

I've asked myself this as well, and came up with a proof of conecpt 
repair tool called pg_healer:

http://blog.endpoint.com/2016/09/pghealer-repairing-postgres-problems.html

It's very rough, but my vision is that someday Postgres will 
have a background process akin to autovacuum that constantly 
sniffs out corruption problems and (optionally) repairs them.
The ability to self-repair is very limited unless checksums are 
enabled.

I agree that there is work needed and problems to be solved 
with our checksum implementation (e.g. what if cosmic ray 
hits the checksum itself!?), but I would love to see what we do have 
enabled by default so we dramatically increase the pool of people 
with checksums enabled.

- -- 
Greg Sabino Mullane greg@turnstep.com
End Point Corporation http://www.endpoint.com/
PGP Key: 0x14964AC8 201701211522
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-----BEGIN PGP SIGNATURE-----

iEYEAREDAAYFAliDw5oACgkQvJuQZxSWSshy4QCfXokvagoishfTUnmujjpBNTUT
q7IAn0dR74bFy0mj0EMoTU7Taj0db3Sh
=qBEJ
-----END PGP SIGNATURE-----





Re: [HACKERS] Checksums by default?

From
Ants Aasma
Date:
On Sat, Jan 21, 2017 at 10:16 PM, Michael Banck
<michael.banck@credativ.de> wrote:
> On Sat, Jan 21, 2017 at 09:02:25PM +0200, Ants Aasma wrote:
>> On Sat, Jan 21, 2017 at 6:41 PM, Andreas Karlsson <andreas@proxel.se> wrote:
>> > It might be worth looking into using the CRC CPU instruction to reduce this
>> > overhead, like we do for the WAL checksums. Since that is a different
>> > algorithm it would be a compatibility break and we would need to support the
>> > old algorithm for upgraded clusters..
>>
>> We looked at that when picking the algorithm. At that point it seemed
>> that CRC CPU instructions were not universal enough to rely on them.
>> The algorithm we ended up on was designed to be fast on SIMD hardware.
>> Unfortunately on x86-64 that required SSE4.1 integer instructions, so
>> with default compiles there is a lot of performance left on table. A
>> low hanging fruit would be to do CPU detection like the CRC case and
>> enable a SSE4.1 optimized variant when those instructions are
>> available. IIRC it was actually a lot faster than the naive hardware
>> CRC that is used for WAL and about on par with interleaved CRC.
>
> I am afraid that won't fly with most end-user packages, cause
> distributions can't just built packages on their newest machine and then
> users get SIGILL or whatever cause their 2014 server doesn't have that
> instruction, or would they still work?
>
> So you would have to do runtime detection of the CPU features, and use
> the appropriate code if they are available.  Which also makes regression
> testing harder, as not all codepaths would get exercised all the time.

Runtime detection is exactly what I had in mind. The code path would
also be the same as at least the two most important compilers only
need a compilation flag change. And the required instruction was
introduced in 2007 so I think anybody who is concerned about
performance is covered.

Regards,
Ants Aasma



Re: [HACKERS] Checksums by default?

From
Thomas Munro
Date:
On Sun, Jan 22, 2017 at 7:37 AM, Stephen Frost <sfrost@snowman.net> wrote:
> Exactly, and that awareness will allow a user to prevent further data
> loss or corruption.  Slow corruption over time is a very much known and
> accepted real-world case that people do experience, as well as bit
> flipping enough for someone to write a not-that-old blog post about
> them:
>
> https://blogs.oracle.com/ksplice/entry/attack_of_the_cosmic_rays1

I have no doubt that low frequency cosmic ray bit flipping in main
memory is a real phenomenon, having worked at a company that runs
enough computers to see ECC messages in kernel logs on a regular
basis.  But our checksums can't actually help with that, can they?  We
verify checksums on the way into shared buffers, and compute new
checksums on the way back to disk, so any bit-flipping that happens in
between those two times -- while your data is a sitting duck in shared
buffers -- would not be detected by this scheme.  That's ECC's job.

So the risk being defended against is corruption while in the disk
subsystem, whatever that might consist of (and certainly that includes
more buffers in strange places that themselves are susceptible to
memory faults etc, and hopefully they have their own error detection
and correction).  Certainly the ZFS community thinks that pile of
turtles can't be trusted and that extra checks are worthwhile, and you
can find anecdotal reports and studies about filesystem corruption
being detected, for example in the links from
https://en.wikipedia.org/wiki/ZFS#Data_integrity .

So +1 for enabling it by default.  I always turn that on.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] Checksums by default?

From
Stephen Frost
Date:
Thomas,

* Thomas Munro (thomas.munro@enterprisedb.com) wrote:
> On Sun, Jan 22, 2017 at 7:37 AM, Stephen Frost <sfrost@snowman.net> wrote:
> > Exactly, and that awareness will allow a user to prevent further data
> > loss or corruption.  Slow corruption over time is a very much known and
> > accepted real-world case that people do experience, as well as bit
> > flipping enough for someone to write a not-that-old blog post about
> > them:
> >
> > https://blogs.oracle.com/ksplice/entry/attack_of_the_cosmic_rays1
>
> I have no doubt that low frequency cosmic ray bit flipping in main
> memory is a real phenomenon, having worked at a company that runs
> enough computers to see ECC messages in kernel logs on a regular
> basis.  But our checksums can't actually help with that, can they?  We
> verify checksums on the way into shared buffers, and compute new
> checksums on the way back to disk, so any bit-flipping that happens in
> between those two times -- while your data is a sitting duck in shared
> buffers -- would not be detected by this scheme.  That's ECC's job.

Ideally, everyone's gonna run with ECC and have that handle it.  That
said, there's still the possibility that the bit is flipped after we've
calculated the checksum but before it's hit disk, or before it's
actually been written out to the storage system underneath.  You're
correct that if the bit is flipped before we go to write the buffer out
that we won't detect that case, but there's not much help for that
without compromising performance more than even I'd be ok with.

> So the risk being defended against is corruption while in the disk
> subsystem, whatever that might consist of (and certainly that includes
> more buffers in strange places that themselves are susceptible to
> memory faults etc, and hopefully they have their own error detection
> and correction).  Certainly the ZFS community thinks that pile of
> turtles can't be trusted and that extra checks are worthwhile, and you
> can find anecdotal reports and studies about filesystem corruption
> being detected, for example in the links from
> https://en.wikipedia.org/wiki/ZFS#Data_integrity .

Agreed.  wrt your point above, if you consider "everything that happens
after we have passed over a given bit to incorporate its value into our
CRC" to be "disk subsystem" then I think we're in agreement on this
point and that there's a bunch of stuff that happens then which could be
caught by checking our CRC.

I've seen some other funny things out there in the wild too though, like
a page suddenly being half-zero'd because the virtualization system ran
out of memory and barfed.  I realize that our CRC might not catch such a
case if it's in our shared buffers before we write the page out, but if
it happens in the kernel's write buffer after we pushed it from shared
buffers then our CRC would detect it.

Which actually brings up another point when it comes to if CRCs save
from data-loss: they certainly do if you catch it happening before you
have expired the WAL and the WAL data is clean.

> So +1 for enabling it by default.  I always turn that on.

Ditto.

Thanks!

Stephen

Re: [HACKERS] Checksums by default?

From
Stephen Frost
Date:
Tom,

* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Not at all; I just think that it's not clear that they are a net win
> for the average user, and so I'm unconvinced that turning them on by
> default is a good idea.  I could be convinced otherwise by suitable
> evidence.  What I'm objecting to is turning them on without making
> any effort to collect such evidence.

As it happens, rather unexpectedly, we had evidence of a bit-flip
happening on a 9.1.24 install show up on IRC today:

https://paste.fedoraproject.org/533186/85041907/

What that shows is the output from:

select * from heap_page_items(get_raw_page('theirtable', 4585));

With a row whose t_ctid is (134222313,18).  Looking at the base-2 format
of 4585 and 134222313:

0000 0000 0000 0000 0001 0001 1110 1001
0000 1000 0000 0000 0001 0001 1110 1001

There appears to be other issues with the page also but this was
discovered through a pg_dump where the user was trying to get data
out to upgrade to something more recent.  Not clear if the errors on the
page all happened at once or if it was over time, of course, but it's at
least possible that this particular area of storage has been degrading
over time and that identifying an error when it was just the bit-flip in
the t_ctid (thanks to a checksum) might have allowed the user to pull
out the data.

During the discussion on IRC, someone else mentioned a similar problem
which was due to not having ECC memory in their server.  As discussed,
that might mean that we wouldn't have caught the corruption since we
only calculate the checksum on the way out of shared_buffers, but it's
also entirely possible that we would have because it could have happened
in kernel space too.

We're still working with the user to see if we can get their data out,
but that looks like pretty good evidence that maybe we should care about
enabling checksums to catch corruption before it causes undo pain for
our users.

The raw page is here: https://paste.fedoraproject.org/533195/48504224/
if anyone is curious to look at it further (we're looking through it
too).

Thanks!

Stephen

Re: [HACKERS] Checksums by default?

From
Jim Nasby
Date:
On 1/21/17 10:02 AM, Tom Lane wrote:
> Magnus Hagander <magnus@hagander.net> writes:
>> Is it time to enable checksums by default, and give initdb a switch to turn
>> it off instead?
> Have we seen *even one* report of checksums catching problems in a useful
> way?

I've experienced multiple corruption events that were ultimately tracked 
down to storage problems. These first manifested as corruption to PG 
page structures, and I have no way to know how much user data might have 
been corrupted.

Obviously I can't prove that checksums would have caught all of those 
errors, but there's one massive benefit they would have given us: we'd 
know that Postgres was not the source of the corruption. That would have 
eliminated a lot of guesswork.

So that we can stop guessing about performance, I did a simple benchmark 
on my laptop. Stock config except for synchronous_commit=off and 
checkpoint_timeout=1min, with the idea being that we want to test 
flushing buffers. Both databases initialized with scale=50, or 800MB. 
shared_buffers was 128MB.

After a couple runs in each database to create dead tuples, runs were 
performed with pgbench -rT 300 & sleep 2 && PGPORT=5444 pgbench -rT 300

No checksums    checksums
818 tps        758 tps
821 tps        877 tps
879 tps        799 tps
739 tps        808 tps
867 tps        845 tps
854 tps        831 tps

Looking at per-statement latency, the variation is always in the update 
to branches. I'll try to get some sequential runs tonight.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)



Re: [HACKERS] Checksums by default?

From
Tomas Vondra
Date:
On 01/21/2017 05:35 PM, Tom Lane wrote:
> Stephen Frost <sfrost@snowman.net> writes:
>> * Tom Lane (tgl@sss.pgh.pa.us) wrote:
>>> Have we seen *even one* report of checksums catching problems in
>>> auseful way?
>
>> This isn't the right question.
>
> I disagree. If they aren't doing something useful for people who
> have turned them on, what's the reason to think they'd do something
> useful for the rest?
>

I believe Stephen is right. The fact that you don't see something, e.g. 
reports about checksums catching something in production deployments, 
proves nothing because of "survivorship bias" discovered by Abraham Wald 
during WWW II [1]. Not seeing bombers with bullet holes in engines does 
not mean you don't need to armor engines. Quite the opposite.

[1] 
https://medium.com/@penguinpress/an-excerpt-from-how-not-to-be-wrong-by-jordan-ellenberg-664e708cfc3d#.j9d9c35mb

Applied to checksums, we're quite unlikely to see reports about data 
corruption caught by checksums because "ERROR: invalid page in block X" 
is such a clear sign of data corruption that people don't even ask us 
about that. Combine that with the fact that most people are running with 
defaults (i.e. no checksums) and that data corruption is a rare event by 
nature, and we're bound to have no such reports.

What we got, however, are reports about strange errors from instances 
without checksums enabled, that were either determined to be data 
corruption, or disappeared after dump/restore or reindexing. It's hard 
to say for sure whether those were cases of data corruption (where 
checksums might have helped) or some other bug (resulting in a corrupted 
page with the checksum computed on the corrupted page).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Checksums by default?

From
Tomas Vondra
Date:
On 01/21/2017 05:51 PM, Stephen Frost wrote:
> * Petr Jelinek (petr.jelinek@2ndquadrant.com) wrote:
>> On 21/01/17 17:31, Stephen Frost wrote:
>>> This is just changing the *default*, not requiring checksums to always
>>> be enabled.  We do not hold the same standards for our defaults as we do
>>> for always-enabled code, for clear reasons- not every situation is the
>>> same and that's why we have defaults that people can change.
>>
>> I can buy that. If it's possible to turn checksums off without
>> recreating data directory then I think it would be okay to have default on.
>
> I'm glad to hear that.
>
>>>> The change of wal_level was supported by benchmark, I think it's
>>>> reasonable to ask for this to be as well.
>>>
>>> No, it wasn't, it was that people felt the cases where changing
>>> wal_level would seriously hurt performance didn't out-weigh the value of
>>> making the change to the default.
>>
>> Really?
>
> Yes.
>
>> https://www.postgresql.org/message-id/d34ce5b5-131f-66ce-f7c5-eb406dbe026f@2ndquadrant.com
>
> From the above link:
>
>> So while it'd be trivial to construct workloads demonstrating the
>> optimizations in wal_level=minimal (e.g. initial loads doing
>> CREATE TABLE + COPY + CREATE INDEX in a single transaction), but
>> that would be mostly irrelevant I guess.
>
>> Instead, I've decided to run regular pgbench TPC-B-like workload on
>> a bunch of different scales, and measure throughput + some xlog
>> stats with each of the three wal_level options.
>
> In other words, there was no performance testing of the cases where
> wal_level=minimal (the old default) optimizations would have been
> compared against wal_level > minimal.
>
> I'm quite sure that the performance numbers for the CREATE TABLE +
> COPY case with wal_level=minimal would have been *far* better than
> for wal_level > minimal.
>
> That case was entirely punted on as "mostly irrelevant" even though
> there are known production environments where those optimizations
> make a huge difference. Those are OLAP cases though, and not nearly
> enough folks around here seem to care one bit about them, which I
> continue to be disappointed by.
>

You make it look as if we swept that case under the carpet, despite 
there being quite a bit of relevant discussion in that thread.

We might argue how many deployments benefit from the Wal_level=minimal 
optimization (I'm sure some of our customers do benefit from it too), 
and whether it makes it 'common workload' or not.

It's trivial to construct a workload demonstrating pretty arbitrary 
performance advantage of the wal_level=minimal case. Hence no point in 
wasting time on demonstrating it, making the case rather irrelevant for 
the benchmarking.

Moreover, there are quite a few differences between enabling checksums 
by default, and switching to wal_level=minimal:

- There are reasonable workaround that give you wal_level=minimal back 
(UNLOGGED tables). And those work even when you actually need a standby, 
which is pretty common these days. With checksums there's nothing like that.

- You can switch between wal_level values by merely restarting the 
cluster, while checksums may only be enabled/disabled by initdb.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Checksums by default?

From
Tomas Vondra
Date:
On 01/21/2017 05:53 PM, Stephen Frost wrote:
> * Tom Lane (tgl@sss.pgh.pa.us) wrote:
>> Stephen Frost <sfrost@snowman.net> writes:
>>> * Petr Jelinek (petr.jelinek@2ndquadrant.com) wrote:
>>>> The change of wal_level was supported by benchmark, I think it's
>>>> reasonable to ask for this to be as well.
>>
>>> No, it wasn't, it was that people felt the cases where changing
>>> wal_level would seriously hurt performance didn't out-weigh the value of
>>> making the change to the default.
>>
>> It was "supported" in the sense that somebody took the trouble to
>> measure the impact, so that we had some facts on which to base the
>> value judgment that the cost was acceptable. In the case of
>> checksums, you seem to be in a hurry to arrive at a conclusion
>> without any supporting evidence.
>

Exactly.

> No, no one measured the impact in the cases where wal_level=minimal
> makes a big difference, that I saw, at least.
>

We already knew we can construct data-loading workloads relying on the 
wal_level=minimal optimization and demonstrating pretty arbitrary 
benefits, so there was no point in benchmarking them. That does not mean 
those cases were not considered, though.
>
> Further info with links to what was done are in my reply to Petr.
>
> As for checksums, I do see value in them and I'm pretty sure that
> the author of that particular feature did as well, or we wouldn't
> even have it as an option. You seem to be of the opinion that we
> might as well just rip all of that code and work out as being
> useless.
>

I do see value in them too, and if turning then off again would be as 
simple as reverting back to wal_level=minimal, I would be strongly in 
favor of enabling then to 'on' by default (after a bit of benchmarking, 
similar to what we did in the wal_level=minimal case).

The fact that disabling them requires initdb makes the reasoning much 
trickier, though.

That being said, I'm ready to do some benchmarking on this, so that we 
have at least some numbers to argue about. Can we agree on a set of 
workloads that we want to benchmark in the first round?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Checksums by default?

From
Tomas Vondra
Date:
On 01/21/2017 04:18 PM, Petr Jelinek wrote:
> On 21/01/17 11:39, Magnus Hagander wrote:
>> Is it time to enable checksums by default, and give initdb a switch to
>> turn it off instead?
>
> I'd like to see benchmark first, both in terms of CPU and in terms of
> produced WAL (=network traffic) given that it turns on logging of hint bits.
>

... and those hint bits may easily trigger full-page writes, resulting 
in significant write amplification.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Checksums by default?

From
Magnus Hagander
Date:


On Sat, Jan 21, 2017 at 8:16 PM, Ants Aasma <ants.aasma@eesti.ee> wrote:
On Sat, Jan 21, 2017 at 7:39 PM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:
> So in summary "postgresql.conf options are easy to change" while "initdb
> options are hard to change", I can see this argument used both for
> enabling or disabling checksums by default. As I said I would be less
> worried if it was easy to turn off, but we are not there afaik. And even
> then I'd still want benchmark first.

Adding support for disabling checksums is almost trivial as it only
requires flipping a value in the control file. And I have somewhere
sitting around a similarly simple tool for turning on checksums while
the database is offline. FWIW, based on customers and fellow
conference goers I have talked to most would gladly take the
performance hit, but not the downtime to turn it on on an existing
database.

This is exactly the scenario I've been exposed to over and over again. If it can be turned on/off online, then the default matters a lot less. But it has to be online.

Yes, you can set up a replica (which today requires third party products like slony, bucardo or pglogical -- at least we'll hopefully have pglogical fully in 10, but it's still a very expensive way to fix the problem).

If we can make it cheap and easy to turn them off, that makes a change of the default a lot cheaper. Did you have a tool for that sitting around as well? 

--

Re: [HACKERS] Checksums by default?

From
"Tsunakawa, Takayuki"
Date:
From: pgsql-hackers-owner@postgresql.org
> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Magnus Hagander
> Is it time to enable checksums by default, and give initdb a switch to turn
> it off instead?
> 
> I keep running into situations where people haven't enabled it, because
> (a) they didn't know about it, or (b) their packaging system ran initdb
> for them so they didn't even know they could. And of course they usually
> figure this out once the db has enough data and traffic that the only way
> to fix it is to set up something like slony/bucardo/pglogical and a whole
> new server to deal with it.. (Which is something that would also be good
> to fix -- but having the default changed would be useful as well)

+10
I was wondering why the community had decided to turn it off by default.  IIRC, the reason was that the performance
overheadwas 20-30% when the entire data directory was placed on the tmpfs, but it's not as important as the data
protectionby default.
 

Regards
Takayuki Tsunakawa



Re: [HACKERS] Checksums by default?

From
Vladimir Borodin
Date:
> 21 янв. 2017 г., в 18:18, Petr Jelinek <petr.jelinek@2ndquadrant.com> написал(а):
>
> On 21/01/17 11:39, Magnus Hagander wrote:
>> Is it time to enable checksums by default, and give initdb a switch to
>> turn it off instead?

+1

>
> I'd like to see benchmark first, both in terms of CPU and in terms of
> produced WAL (=network traffic) given that it turns on logging of hint bits.

Here are the results of my testing for 9.4 in December 2014. The benchmark was done on a production like use case when
allthe data fits in memory (~ 20 GB) and doesn’t fit into shared_buffers (it was intentionally small - 128 MB on stress
stend),so that shared_blk_read / shared_blk_hit = 1/4. The workload was a typical OLTP but with mostly write queries
(80%writes, 20% reads). 

Here are the number of WALs written during the test:
Defaults            263
wal_log_hints        410
checksums        367

I couldn’t find the answer why WAL write amplification is even worse for wal_log_hints than for checksums but I checked
severaltimes and it always reproduced. 

As for CPU I couldn’t see the overhead [1]. And perf top showed me less then 2% in calculating CRC.

For all new DBs we now enable checksums at initdb and several dozens of our shards use checksums now. I don’t see any
performancedifference for them comparing with non-checksumed clusters. And we have already had one case when we caught
datacorruption with checksums. 

[1] https://yadi.sk/i/VAiWjv6t3AQCs2?lang=en

--
May the force be with you…
https://simply.name




Re: [HACKERS] Checksums by default?

From
Amit Kapila
Date:
On Sun, Jan 22, 2017 at 3:43 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> That being said, I'm ready to do some benchmarking on this, so that we have
> at least some numbers to argue about. Can we agree on a set of workloads
> that we want to benchmark in the first round?
>

I think if we can get data for pgbench read-write workload when data
doesn't fit in shared buffers but fit in RAM, that can give us some
indication.  We can try by varying the ratio of shared buffers w.r.t
data.  This should exercise the checksum code both when buffers are
evicted and at next read.  I think it also makes sense to check the
WAL data size for each of those runs.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Checksums by default?

From
Tomas Vondra
Date:
On 01/23/2017 08:30 AM, Amit Kapila wrote:
> On Sun, Jan 22, 2017 at 3:43 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>>
>> That being said, I'm ready to do some benchmarking on this, so that we have
>> at least some numbers to argue about. Can we agree on a set of workloads
>> that we want to benchmark in the first round?
>>
>
> I think if we can get data for pgbench read-write workload when data
> doesn't fit in shared buffers but fit in RAM, that can give us some
> indication.  We can try by varying the ratio of shared buffers w.r.t
> data.  This should exercise the checksum code both when buffers are
> evicted and at next read.  I think it also makes sense to check the
> WAL data size for each of those runs.
>

Yes, I'm thinking that's pretty much the worst case for OLTP-like 
workload, because it has to evict buffers from shared buffers, 
generating a continuous stream of writes. Doing that on good storage 
(e.g. PCI-e SSD or possibly tmpfs) will further limit the storage 
overhead, making the time spent computing checksums much more 
significant. Makes sense?

What about other types of workload? I think we should not look just at 
write-heavy workloads - I wonder what is the overhead of verifying the 
checksums in read-only workloads (again, with data that fits into RAM).

What about large data loads simulating OLAP, and exports (e.g. pg_dump)?

That leaves us with 4 workload types, I guess:

1) read-write OLTP (shared buffers < data < RAM)
2) read-only OLTP (shared buffers < data < RAM)
3) large data loads (COPY)
4) large data exports (pg_dump)

Anything else?

The other question is of course hardware - IIRC there are differences 
between CPUs. I do have a new e5-2620v4, but perhaps it'd be good to 
also do some testing on a Power machine, or an older Intel CPU.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Checksums by default?

From
Amit Kapila
Date:
On Mon, Jan 23, 2017 at 1:18 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On 01/23/2017 08:30 AM, Amit Kapila wrote:
>>
>>
>> I think if we can get data for pgbench read-write workload when data
>> doesn't fit in shared buffers but fit in RAM, that can give us some
>> indication.  We can try by varying the ratio of shared buffers w.r.t
>> data.  This should exercise the checksum code both when buffers are
>> evicted and at next read.  I think it also makes sense to check the
>> WAL data size for each of those runs.
>>
>
> Yes, I'm thinking that's pretty much the worst case for OLTP-like workload,
> because it has to evict buffers from shared buffers, generating a continuous
> stream of writes. Doing that on good storage (e.g. PCI-e SSD or possibly
> tmpfs) will further limit the storage overhead, making the time spent
> computing checksums much more significant. Makes sense?
>

Yeah, I think that can be helpful with respect to WAL, but for data,
if we are considering the case where everything fits in RAM, then
faster storage might or might not help.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Checksums by default?

From
Tomas Vondra
Date:
On 01/23/2017 09:57 AM, Amit Kapila wrote:
> On Mon, Jan 23, 2017 at 1:18 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> On 01/23/2017 08:30 AM, Amit Kapila wrote:
>>>
>>>
>>> I think if we can get data for pgbench read-write workload when data
>>> doesn't fit in shared buffers but fit in RAM, that can give us some
>>> indication.  We can try by varying the ratio of shared buffers w.r.t
>>> data.  This should exercise the checksum code both when buffers are
>>> evicted and at next read.  I think it also makes sense to check the
>>> WAL data size for each of those runs.
>>>
>>
>> Yes, I'm thinking that's pretty much the worst case for OLTP-like workload,
>> because it has to evict buffers from shared buffers, generating a continuous
>> stream of writes. Doing that on good storage (e.g. PCI-e SSD or possibly
>> tmpfs) will further limit the storage overhead, making the time spent
>> computing checksums much more significant. Makes sense?
>>
>
> Yeah, I think that can be helpful with respect to WAL, but for data,
> if we are considering the case where everything fits in RAM, then
> faster storage might or might not help.
>

I'm not sure I understand. Why wouldn't faster storage help? It's only a 
matter of generating enough dirty buffers (that get evicted from shared 
buffers) to saturate the storage. With some storage you'll hit that at 
100 MB/s, with PCI-e it might be more like 1GB/s.

Of course, if the main bottleneck is somewhere else (e.g. hitting 100% 
CPU utilization before putting any pressure on storage), that's not 
going to make much difference.

Or perhaps I missed something important?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Checksums by default?

From
Amit Kapila
Date:
On Mon, Jan 23, 2017 at 3:56 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On 01/23/2017 09:57 AM, Amit Kapila wrote:
>>
>> On Mon, Jan 23, 2017 at 1:18 PM, Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
>>>
>>> On 01/23/2017 08:30 AM, Amit Kapila wrote:
>>>>
>>>>
>>>>
>>>> I think if we can get data for pgbench read-write workload when data
>>>> doesn't fit in shared buffers but fit in RAM, that can give us some
>>>> indication.  We can try by varying the ratio of shared buffers w.r.t
>>>> data.  This should exercise the checksum code both when buffers are
>>>> evicted and at next read.  I think it also makes sense to check the
>>>> WAL data size for each of those runs.
>>>>
>>>
>>> Yes, I'm thinking that's pretty much the worst case for OLTP-like
>>> workload,
>>> because it has to evict buffers from shared buffers, generating a
>>> continuous
>>> stream of writes. Doing that on good storage (e.g. PCI-e SSD or possibly
>>> tmpfs) will further limit the storage overhead, making the time spent
>>> computing checksums much more significant. Makes sense?
>>>
>>
>> Yeah, I think that can be helpful with respect to WAL, but for data,
>> if we are considering the case where everything fits in RAM, then
>> faster storage might or might not help.
>>
>
> I'm not sure I understand. Why wouldn't faster storage help? It's only a
> matter of generating enough dirty buffers (that get evicted from shared
> buffers) to saturate the storage.
>

When the page gets evicted from shared buffer, it is just pushed to
kernel; the real write to disk won't happen until the kernel feels
like it.They are written to storage later when a checkpoint occurs.
So, now if we have fast storage subsystem then it can improve the
writes from kernel to disk, but not sure how much that can help in
improving TPS.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Checksums by default?

From
Tomas Vondra
Date:
On 01/23/2017 01:40 PM, Amit Kapila wrote:
> On Mon, Jan 23, 2017 at 3:56 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> On 01/23/2017 09:57 AM, Amit Kapila wrote:
>>>
>>> On Mon, Jan 23, 2017 at 1:18 PM, Tomas Vondra
>>> <tomas.vondra@2ndquadrant.com> wrote:
>>>>
>>>> On 01/23/2017 08:30 AM, Amit Kapila wrote:
>>>>>
>>>>>
>>>>>
>>>>> I think if we can get data for pgbench read-write workload when data
>>>>> doesn't fit in shared buffers but fit in RAM, that can give us some
>>>>> indication.  We can try by varying the ratio of shared buffers w.r.t
>>>>> data.  This should exercise the checksum code both when buffers are
>>>>> evicted and at next read.  I think it also makes sense to check the
>>>>> WAL data size for each of those runs.
>>>>>
>>>>
>>>> Yes, I'm thinking that's pretty much the worst case for OLTP-like
>>>> workload,
>>>> because it has to evict buffers from shared buffers, generating a
>>>> continuous
>>>> stream of writes. Doing that on good storage (e.g. PCI-e SSD or possibly
>>>> tmpfs) will further limit the storage overhead, making the time spent
>>>> computing checksums much more significant. Makes sense?
>>>>
>>>
>>> Yeah, I think that can be helpful with respect to WAL, but for data,
>>> if we are considering the case where everything fits in RAM, then
>>> faster storage might or might not help.
>>>
>>
>> I'm not sure I understand. Why wouldn't faster storage help? It's only a
>> matter of generating enough dirty buffers (that get evicted from shared
>> buffers) to saturate the storage.
>>
>
> When the page gets evicted from shared buffer, it is just pushed to
> kernel; the real write to disk won't happen until the kernel feels
> like it.They are written to storage later when a checkpoint occurs.
> So, now if we have fast storage subsystem then it can improve the
> writes from kernel to disk, but not sure how much that can help in
> improving TPS.
>

I don't think that's quite true. If the pages are evicted by bgwriter, 
since 9.6 there's a flush every 512kB. This will also flush data written 
by backends, of course. But even without the flushing, the OS does not 
wait with the flush until the very last moment - that'd be a huge I/O 
spike. Instead, the OS will write the dirty data to disk after 30 
seconds, of after accumulating some predefined amount of dirty data.

So the system will generally get into a "stable state" where it writes 
about the same amount of data to disk on average.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Checksums by default?

From
Amit Kapila
Date:
On Mon, Jan 23, 2017 at 6:57 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On 01/23/2017 01:40 PM, Amit Kapila wrote:
>>
>> On Mon, Jan 23, 2017 at 3:56 PM, Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
>>>
>>> On 01/23/2017 09:57 AM, Amit Kapila wrote:
>>>>
>>>>
>>>> On Mon, Jan 23, 2017 at 1:18 PM, Tomas Vondra
>>>> <tomas.vondra@2ndquadrant.com> wrote:
>>>>>
>>>>>
>>>>> On 01/23/2017 08:30 AM, Amit Kapila wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> I think if we can get data for pgbench read-write workload when data
>>>>>> doesn't fit in shared buffers but fit in RAM, that can give us some
>>>>>> indication.  We can try by varying the ratio of shared buffers w.r.t
>>>>>> data.  This should exercise the checksum code both when buffers are
>>>>>> evicted and at next read.  I think it also makes sense to check the
>>>>>> WAL data size for each of those runs.
>>>>>>
>>>>>
>>>>> Yes, I'm thinking that's pretty much the worst case for OLTP-like
>>>>> workload,
>>>>> because it has to evict buffers from shared buffers, generating a
>>>>> continuous
>>>>> stream of writes. Doing that on good storage (e.g. PCI-e SSD or
>>>>> possibly
>>>>> tmpfs) will further limit the storage overhead, making the time spent
>>>>> computing checksums much more significant. Makes sense?
>>>>>
>>>>
>>>> Yeah, I think that can be helpful with respect to WAL, but for data,
>>>> if we are considering the case where everything fits in RAM, then
>>>> faster storage might or might not help.
>>>>
>>>
>>> I'm not sure I understand. Why wouldn't faster storage help? It's only a
>>> matter of generating enough dirty buffers (that get evicted from shared
>>> buffers) to saturate the storage.
>>>
>>
>> When the page gets evicted from shared buffer, it is just pushed to
>> kernel; the real write to disk won't happen until the kernel feels
>> like it.They are written to storage later when a checkpoint occurs.
>> So, now if we have fast storage subsystem then it can improve the
>> writes from kernel to disk, but not sure how much that can help in
>> improving TPS.
>>
>
> I don't think that's quite true. If the pages are evicted by bgwriter, since
> 9.6 there's a flush every 512kB.
>

Right, but backend flush after is zero by default.

> This will also flush data written by
> backends, of course. But even without the flushing, the OS does not wait
> with the flush until the very last moment - that'd be a huge I/O spike.
> Instead, the OS will write the dirty data to disk after 30 seconds, of after
> accumulating some predefined amount of dirty data.
>

This is the reason I told it might or might not help.  I think there
is no point in having too much discussion on this point, if you access
to fast storage system, then go ahead and perform the tests on same,
if not, then also we can try without that.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Checksums by default?

From
Peter Geoghegan
Date:
On Sat, Jan 21, 2017 at 9:09 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Not at all; I just think that it's not clear that they are a net win
> for the average user, and so I'm unconvinced that turning them on by
> default is a good idea.  I could be convinced otherwise by suitable
> evidence.  What I'm objecting to is turning them on without making
> any effort to collect such evidence.

+1

One insight Jim Gray has in the classic paper "Why Do Computers Stop
and What Can Be Done About It?" [1] is that fault-tolerant hardware is
table stakes, and so most failures are related to operator error, and
to a lesser extent software bugs. The paper is about 30 years old.

I don't recall ever seeing a checksum failure on a Heroku Postgres
database, even though they were enabled as soon as the feature became
available. I have seen a few corruption problems brought to light by
amcheck, though, all of which were due to bugs in software.
Apparently, before I joined Heroku there were real reliability
problems with the storage subsystem that Heroku Postgres runs on (it's
a pluggable storage service from a popular cloud provider -- the
"pluggable" functionality would have made it fairly novel at the
time). These problems were something that the Heroku Postgres team
dealt with about 6 years ago. However, anecdotal evidence suggests
that the reliability of the same storage system *vastly* improved
roughly a year or two later. We still occasionally lose drives, but
drives seem to fail fast in a fashion that lets us recover without
data loss easily. In practice, Postgres checksums do *not* seem to
catch problems. That's been my experience, at least.

Obviously every additional check helps, and it may be something we can
do without any appreciable downside. I'd like to see a benchmark.

[1] http://www.hpl.hp.com/techreports/tandem/TR-85.7.pdf
-- 
Peter Geoghegan



Re: [HACKERS] Checksums by default?

From
Merlin Moncure
Date:
On Sat, Jan 21, 2017 at 12:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Andres Freund <andres@anarazel.de> writes:
>> Sure, it might be easy, but we don't have it.  Personally I think
>> checksums just aren't even ready for prime time. If we had:
>> - ability to switch on/off at runtime (early patches for that have IIRC
>>   been posted)
>> - *builtin* tooling to check checksums for everything
>> - *builtin* tooling to compute checksums after changing setting
>> - configurable background sweeps for checksums
>
> Yeah, and there's a bunch of usability tooling that we don't have,
> centered around "what do you do after you get a checksum error?".
> AFAIK there's no way to check or clear such an error; but without
> such tools, I'm afraid that checksums are as much of a foot-gun
> as a benefit.

I see your point here, but they sure saved my ass with that pl/sh
issue.  So I'm inclined to lightly disagree; there are good arguments
either way.

merlin



Re: [HACKERS] Checksums by default?

From
Jim Nasby
Date:
On 1/23/17 1:30 AM, Amit Kapila wrote:
> On Sun, Jan 22, 2017 at 3:43 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>>
>> That being said, I'm ready to do some benchmarking on this, so that we have
>> at least some numbers to argue about. Can we agree on a set of workloads
>> that we want to benchmark in the first round?
>>
>
> I think if we can get data for pgbench read-write workload when data
> doesn't fit in shared buffers but fit in RAM, that can give us some
> indication.  We can try by varying the ratio of shared buffers w.r.t
> data.  This should exercise the checksum code both when buffers are
> evicted and at next read.  I think it also makes sense to check the
> WAL data size for each of those runs.

I tried testing this (and thought I sent an email about it but don't see 
it now :/). Unfortunately, on my laptop I wasn't getting terribly 
consistent runs; I was seeing +/- ~8% TPS. Sometimes checksumps appeared 
to add ~10% overhead, but it was hard to tell.

If someone has a more stable (is in, dedicated) setup, testing would be 
useful.

BTW, I ran the test with small (default 128MB) shared_buffers, scale 50 
(800MB database), sync_commit = off, checkpoint_timeout = 1min, to try 
and significantly increase the rate of buffers being written out.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)



Re: [HACKERS] Checksums by default?

From
Jim Nasby
Date:
On 1/23/17 6:14 PM, Peter Geoghegan wrote:
> In practice, Postgres checksums do *not* seem to
> catch problems. That's been my experience, at least.

For someone running on a bunch of AWS hardware that doesn't really 
surprise me. Presumably, anyone operating at that scale would be quickly 
overwhelmed if odd hardware errors were even remotely common. (Note that 
odd errors aren't the same as an outright failure.)

Where I'd expect this to help is with anyone running a moderate-sized 
data center that doesn't have the kind of monitoring resources a cloud 
provider does.

As for collecting data, I don't really know what more data we can get. 
We get data corruption reports on a fairly regular basis. I think it's a 
very safe bet that CRCs would identify somewhere between 20% and 80%. 
Maybe that number could be better refined, but that's still going to be 
guesswork.

As others have mentioned, right now practically no one enables this, so 
we've got zero data on how useful it might actually be. If the patch to 
make this a GUC goes through then at least we could tell people that 
have experienced corruption to enable this. That might provide some 
data, though the horse is already well out of the barn by then.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)



Re: [HACKERS] Checksums by default?

From
Stephen Frost
Date:
Jim,

* Jim Nasby (Jim.Nasby@BlueTreble.com) wrote:
> As others have mentioned, right now practically no one enables this,
> so we've got zero data on how useful it might actually be.

Uhm, Peter G just said that Heroku enables this on all their databases
and have yet to see a false-positive report or an issue with having it
enabled.

That, plus the reports and evidence we've seen in the past couple days,
look like a pretty ringing endorsement for having them.

I'll ping the RDS crowd and see if they'll tell me what they're doing
and what their thoughts are on it.

Thanks!

Stephen

Re: [HACKERS] Checksums by default?

From
Tom Lane
Date:
Stephen Frost <sfrost@snowman.net> writes:
> * Jim Nasby (Jim.Nasby@BlueTreble.com) wrote:
>> As others have mentioned, right now practically no one enables this,
>> so we've got zero data on how useful it might actually be.

> Uhm, Peter G just said that Heroku enables this on all their databases
> and have yet to see a false-positive report or an issue with having it
> enabled.

> That, plus the reports and evidence we've seen in the past couple days,
> look like a pretty ringing endorsement for having them.

You must have read a different Peter G than I did.  What I read was

>> I don't recall ever seeing a checksum failure on a Heroku Postgres
>> database,

which did not sound like an endorsement to me.
        regards, tom lane



Re: [HACKERS] Checksums by default?

From
Jim Nasby
Date:
On 1/23/17 6:55 PM, Stephen Frost wrote:
> * Jim Nasby (Jim.Nasby@BlueTreble.com) wrote:
>> As others have mentioned, right now practically no one enables this,
>> so we've got zero data on how useful it might actually be.
> Uhm, Peter G just said that Heroku enables this on all their databases
> and have yet to see a false-positive report or an issue with having it
> enabled.
>
> That, plus the reports and evidence we've seen in the past couple days,
> look like a pretty ringing endorsement for having them.
>
> I'll ping the RDS crowd and see if they'll tell me what they're doing
> and what their thoughts are on it.

Oh, I read the thread as "there's no data to support checksums are 
useful", not "there's no data to support there's little risk of bugs or 
false-positives". I certainly agree that Heroku is a good test of both 
of those.

IIRC Grant's mentioned in one of his presentations that they enable 
checksums, but getting more explicit info would be good.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)



Re: [HACKERS] Checksums by default?

From
Stephen Frost
Date:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Stephen Frost <sfrost@snowman.net> writes:
> > * Jim Nasby (Jim.Nasby@BlueTreble.com) wrote:
> >> As others have mentioned, right now practically no one enables this,
> >> so we've got zero data on how useful it might actually be.
>
> > Uhm, Peter G just said that Heroku enables this on all their databases
> > and have yet to see a false-positive report or an issue with having it
> > enabled.
>
> > That, plus the reports and evidence we've seen in the past couple days,
> > look like a pretty ringing endorsement for having them.
>
> You must have read a different Peter G than I did.  What I read was
>
> >> I don't recall ever seeing a checksum failure on a Heroku Postgres
> >> database,

Not sure how this part of that sentence was missed:

-----
... even though they were enabled as soon as the feature became
available.
-----

Which would seem to me to say "the code's been running for a long time
on a *lot* of systems without throwing a false positive or surfacing a
bug."

Given your up-thread concerns that enabling checksums could lead to
false positives and might surface bugs, that's pretty good indication
that those concerns are unfounded.

In addition, it shows that big hosting providers were anxious to get the
feature and enabled it immediately for their users, while we debate if
it might be useful for *our* users.  I certainly don't believe that
Heroku or Amazon have all the right answers for everything, but I do
think we should consider that they enabled checksums immediately, along
with the other consultants on this thread who have said the same.

Lastly, I've already pointed out that there were 2 cases recently
reported on IRC of corruption on reasonably modern gear, with a third
comment following that up from Merlin.  These notions that corruption
doesn't happen today, or that we would have heard about it if it had,
also look unfounded from my perspective.

Thanks!

Stephen

Re: [HACKERS] Checksums by default?

From
Stephen Frost
Date:
Jim,

* Jim Nasby (Jim.Nasby@BlueTreble.com) wrote:
> On 1/23/17 6:55 PM, Stephen Frost wrote:
> >* Jim Nasby (Jim.Nasby@BlueTreble.com) wrote:
> >>As others have mentioned, right now practically no one enables this,
> >>so we've got zero data on how useful it might actually be.
> >Uhm, Peter G just said that Heroku enables this on all their databases
> >and have yet to see a false-positive report or an issue with having it
> >enabled.
> >
> >That, plus the reports and evidence we've seen in the past couple days,
> >look like a pretty ringing endorsement for having them.
> >
> >I'll ping the RDS crowd and see if they'll tell me what they're doing
> >and what their thoughts are on it.
>
> Oh, I read the thread as "there's no data to support checksums are
> useful",

There's been multiple reports on this thread that corruption does
happen.  Sure, it'd be nice if we had a report of it happening with
checksums enabled and where checksums caught it, but I don't see any
basis for an argument that they wouldn't ever catch real-world
bit-flipping corruption.

> IIRC Grant's mentioned in one of his presentations that they enable
> checksums, but getting more explicit info would be good.

Frankly, my recollection is that they wouldn't use PG until it had
page-level checksums, and that they run it on all of their instances,
but I'd like to get confirmation of that, if I can, and also hear if
they've got examples of the checksums we have catching real issues.

Thanks!

Stephen

Re: [HACKERS] Checksums by default?

From
Jim Nasby
Date:
On 1/23/17 7:15 PM, Tom Lane wrote:
>> Uhm, Peter G just said that Heroku enables this on all their databases
>> and have yet to see a false-positive report or an issue with having it
>> enabled.
>> That, plus the reports and evidence we've seen in the past couple days,
>> look like a pretty ringing endorsement for having them.
> You must have read a different Peter G than I did.  What I read was
>
>>> I don't recall ever seeing a checksum failure on a Heroku Postgres
>>> database,
> which did not sound like an endorsement to me.

Well, it is pretty good evidence that there's no bugs and that false 
positives aren't a problem. As I mentioned earlier, my bet is that any 
significantly large cloud provider has a ton of things going on behind 
the scenes to prevent oddball (as in non-repeating) errors. When you've 
got 1M+ servers even small probability bugs can become really big problems.

In any case, how can we go about collecting data that checksums help? We 
certainly know people suffer data corruption. We can only guess at how 
many of those incidents would be caught by checksums. I don't see how we 
can get data on that unless we get a lot more users running checksums.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)



Re: [HACKERS] Checksums by default?

From
Peter Geoghegan
Date:
On Mon, Jan 23, 2017 at 5:26 PM, Stephen Frost <sfrost@snowman.net> wrote:
> Not sure how this part of that sentence was missed:
>
> -----
> ... even though they were enabled as soon as the feature became
> available.
> -----
>
> Which would seem to me to say "the code's been running for a long time
> on a *lot* of systems without throwing a false positive or surfacing a
> bug."

I think you've both understood what I said correctly. Note that I
remain neutral on the question of whether or not checksums should be
enabled by default.

Perhaps I've missed the point entirely, but, I have to ask: How could
there ever be false positives? With checksums, false positives are
simply not allowed. Therefore, there cannot be a false positive,
unless we define checksums as a mechanism that should only find
problems that originate somewhere at or below the filesystem. We
clearly have not done that, so ISTM that checksums could legitimately
find bugs in the checksum code. I am not being facetious.

-- 
Peter Geoghegan



Re: [HACKERS] Checksums by default?

From
Michael Paquier
Date:
On Tue, Jan 24, 2017 at 10:26 AM, Stephen Frost <sfrost@snowman.net> wrote:
> * Tom Lane (tgl@sss.pgh.pa.us) wrote:
>> >> I don't recall ever seeing a checksum failure on a Heroku Postgres
>> >> database,
>
> Not sure how this part of that sentence was missed:
>
> -----
> ... even though they were enabled as soon as the feature became
> available.
> -----
>
> Which would seem to me to say "the code's been running for a long time
> on a *lot* of systems without throwing a false positive or surfacing a
> bug."

I am reading that similarly to what Tom is seeing: enabling it has
proved Heroku that it did not catch problems in years, meaning that
the performance cost induced by enabling it has paid nothing in
practive, except the insurance to catch up problems should they
happen.

> Given your up-thread concerns that enabling checksums could lead to
> false positives and might surface bugs, that's pretty good indication
> that those concerns are unfounded.

FWIW, I have seen failures that could have been as hardware failures
or where enabling checksums may have helped, but they were on pre-9.3
instances. Now there is a performance penalty in enabling them, but it
solely depends on the page eviction from shared buffers, which is
pretty high for some load patterns I work with, still even with that
we saw a 1~2% output penalty in measurements. Perhaps not everybody
would like to pay this price. FWIW, we are thinking about paying it as
VMs are more sensitive to vmdk-like class of bugs. I am not sure that
everybody would like to pay that. By seeing this thread -hackers
likely are fine with the cost induced.
-- 
Michael



Re: [HACKERS] Checksums by default?

From
Stephen Frost
Date:
* Peter Geoghegan (pg@heroku.com) wrote:
> On Mon, Jan 23, 2017 at 5:26 PM, Stephen Frost <sfrost@snowman.net> wrote:
> > Not sure how this part of that sentence was missed:
> >
> > -----
> > ... even though they were enabled as soon as the feature became
> > available.
> > -----
> >
> > Which would seem to me to say "the code's been running for a long time
> > on a *lot* of systems without throwing a false positive or surfacing a
> > bug."
>
> I think you've both understood what I said correctly. Note that I
> remain neutral on the question of whether or not checksums should be
> enabled by default.
>
> Perhaps I've missed the point entirely, but, I have to ask: How could
> there ever be false positives? With checksums, false positives are
> simply not allowed. Therefore, there cannot be a false positive,
> unless we define checksums as a mechanism that should only find
> problems that originate somewhere at or below the filesystem. We
> clearly have not done that, so ISTM that checksums could legitimately
> find bugs in the checksum code. I am not being facetious.

I'm not sure I'm following your question here.  A false positive would
be a case where the checksum code throws an error on a page whose
checksum is correct, or where the checksum has failed but nothing is
actually wrong/different on the page.

As for the purpose of checksums, it's exactly to identify cases where
the page has been changed since we wrote it out, due to corruption in
the kernel, filesystem, storage system, etc.  As we only check them when
we read in a page and calculate them when we go to write the page out,
they aren't helpful for shared_buffers corruption, generally speaking.

It might be interesting to consider checking them in 'clean' pages in
shared_buffers in a background process, as that, presumably, *would*
detect shared buffers corruption.

Thanks!

Stephen

Re: [HACKERS] Checksums by default?

From
Petr Jelinek
Date:
On 24/01/17 02:39, Michael Paquier wrote:
> On Tue, Jan 24, 2017 at 10:26 AM, Stephen Frost <sfrost@snowman.net> wrote:
>> * Tom Lane (tgl@sss.pgh.pa.us) wrote:
>>>>> I don't recall ever seeing a checksum failure on a Heroku Postgres
>>>>> database,
>>
>> Not sure how this part of that sentence was missed:
>>
>> -----
>> ... even though they were enabled as soon as the feature became
>> available.
>> -----
>>
>> Which would seem to me to say "the code's been running for a long time
>> on a *lot* of systems without throwing a false positive or surfacing a
>> bug."
> 
> I am reading that similarly to what Tom is seeing: enabling it has
> proved Heroku that it did not catch problems in years, meaning that
> the performance cost induced by enabling it has paid nothing in
> practive, except the insurance to catch up problems should they
> happen.
> 

+1

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: [HACKERS] Checksums by default?

From
Peter Geoghegan
Date:
On Mon, Jan 23, 2017 at 5:47 PM, Stephen Frost <sfrost@snowman.net> wrote:
>> Perhaps I've missed the point entirely, but, I have to ask: How could
>> there ever be false positives? With checksums, false positives are
>> simply not allowed. Therefore, there cannot be a false positive,
>> unless we define checksums as a mechanism that should only find
>> problems that originate somewhere at or below the filesystem. We
>> clearly have not done that, so ISTM that checksums could legitimately
>> find bugs in the checksum code. I am not being facetious.
>
> I'm not sure I'm following your question here.  A false positive would
> be a case where the checksum code throws an error on a page whose
> checksum is correct, or where the checksum has failed but nothing is
> actually wrong/different on the page.

I thought that checksums went in in part because we thought that there
was some chance that they'd find bugs in Postgres. I was under the
impression that that was at least a secondary goal of checksums.

> As for the purpose of checksums, it's exactly to identify cases where
> the page has been changed since we wrote it out, due to corruption in
> the kernel, filesystem, storage system, etc.  As we only check them when
> we read in a page and calculate them when we go to write the page out,
> they aren't helpful for shared_buffers corruption, generally speaking.

I'd have guessed that they might catch a bug in recovery itself, even
when the filesystem maintains the guarantees Postgres requires.

In any case, it seems exceedingly unlikely that the checksum code
itself would fail.

-- 
Peter Geoghegan



Re: [HACKERS] Checksums by default?

From
Tom Lane
Date:
Peter Geoghegan <pg@heroku.com> writes:
> Perhaps I've missed the point entirely, but, I have to ask: How could
> there ever be false positives?

Bugs.  For example, checksum is computed while somebody else is setting
a hint bit in the page, so that what is written out is completely valid
except that the checksum doesn't match.  (I realize that that specific
scenario should be impossible given our implementation, but I hope you
aren't going to claim that bugs in the checksum code are impossible.)

Maybe this is a terminology problem.  I'm taking "false positive" to mean
"checksum reports a failure, but in fact there is no observable data
corruption".  Depending on why the false positive occurred, that might
help alert you to underlying storage problems, but it isn't helping you
with respect to being able to access your perfectly valid data.
        regards, tom lane



Re: [HACKERS] Checksums by default?

From
Peter Geoghegan
Date:
On Mon, Jan 23, 2017 at 6:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Maybe this is a terminology problem.  I'm taking "false positive" to mean
> "checksum reports a failure, but in fact there is no observable data
> corruption".  Depending on why the false positive occurred, that might
> help alert you to underlying storage problems, but it isn't helping you
> with respect to being able to access your perfectly valid data.

It was a terminology problem. Thank you for the clarification.


-- 
Peter Geoghegan



Re: [HACKERS] Checksums by default?

From
Tom Lane
Date:
Peter Geoghegan <pg@heroku.com> writes:
> I thought that checksums went in in part because we thought that there
> was some chance that they'd find bugs in Postgres.

Not really.  AFAICS the only point is to catch storage-system malfeasance.

It's barely possible that checksumming would help detect cases where
we'd written data meant for block A into block B, but I don't rate
that as being significantly more probable than bugs in the checksum
code itself.  Also, if that case did happen, the checksum code might
"detect" it in some sense, but it would be remarkably unhelpful at
identifying the actual cause.
        regards, tom lane



Re: [HACKERS] Checksums by default?

From
Jim Nasby
Date:
On 1/23/17 7:47 PM, Stephen Frost wrote:
> It might be interesting to consider checking them in 'clean' pages in
> shared_buffers in a background process, as that, presumably, *would*
> detect shared buffers corruption.

Hmm... that would be interesting. Assuming the necessary functions are 
exposed it presumably wouldn't be difficult to do that in an extension, 
as a bgworker.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)



Re: [HACKERS] Checksums by default?

From
Tom Lane
Date:
Jim Nasby <Jim.Nasby@bluetreble.com> writes:
> On 1/23/17 7:47 PM, Stephen Frost wrote:
>> It might be interesting to consider checking them in 'clean' pages in
>> shared_buffers in a background process, as that, presumably, *would*
>> detect shared buffers corruption.

> Hmm... that would be interesting. Assuming the necessary functions are 
> exposed it presumably wouldn't be difficult to do that in an extension, 
> as a bgworker.

But we don't maintain the checksum of a page while it sits in shared
buffers.  Trying to do so would break, eg, concurrent hint-bit updates.
        regards, tom lane



Re: [HACKERS] Checksums by default?

From
Jim Nasby
Date:
On 1/23/17 8:24 PM, Tom Lane wrote:
> Jim Nasby <Jim.Nasby@bluetreble.com> writes:
>> On 1/23/17 7:47 PM, Stephen Frost wrote:
>>> It might be interesting to consider checking them in 'clean' pages in
>>> shared_buffers in a background process, as that, presumably, *would*
>>> detect shared buffers corruption.
>
>> Hmm... that would be interesting. Assuming the necessary functions are
>> exposed it presumably wouldn't be difficult to do that in an extension,
>> as a bgworker.
>
> But we don't maintain the checksum of a page while it sits in shared
> buffers.  Trying to do so would break, eg, concurrent hint-bit updates.

Hrm, I thought the checksum would be valid if the buffer is marked clean?
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)



Re: [HACKERS] Checksums by default?

From
Stephen Frost
Date:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Jim Nasby <Jim.Nasby@bluetreble.com> writes:
> > On 1/23/17 7:47 PM, Stephen Frost wrote:
> >> It might be interesting to consider checking them in 'clean' pages in
> >> shared_buffers in a background process, as that, presumably, *would*
> >> detect shared buffers corruption.
>
> > Hmm... that would be interesting. Assuming the necessary functions are
> > exposed it presumably wouldn't be difficult to do that in an extension,
> > as a bgworker.
>
> But we don't maintain the checksum of a page while it sits in shared
> buffers.  Trying to do so would break, eg, concurrent hint-bit updates.

Hence why I said 'clean' pages..

Perhaps I'm missing something here, but with checksums enabled, a hint
bit update is going to dirty the page (and we're going to write it into
the WAL and write it out to the heap), no?

We'd have to accept that checking the checksum on a page would require a
read lock on each page as it goes through, I imagine, though we could do
something like check if the page is clean, obtain a read lock, then
check if it's still clean and if so calculate the checksum and then let
the lock go, then at least we're avoiding trying to lock dirty pages.

Thanks!

Stephen

Re: [HACKERS] Checksums by default?

From
Tom Lane
Date:
Stephen Frost <sfrost@snowman.net> writes:
> * Tom Lane (tgl@sss.pgh.pa.us) wrote:
>> But we don't maintain the checksum of a page while it sits in shared
>> buffers.  Trying to do so would break, eg, concurrent hint-bit updates.

> Hence why I said 'clean' pages..

When we write out a page, we copy it into private memory and compute the
checksum there, right?  We don't copy the checksum back into the page's
shared-buffer image, and if we did, that would defeat the point because we
would've had to maintain exclusive lock on the buffer or else the checksum
might be out of date.  So I don't see how this works without throwing away
a lot of carefully-designed behavior.
        regards, tom lane



Re: [HACKERS] Checksums by default?

From
Andres Freund
Date:
On 2017-01-23 21:40:53 -0500, Stephen Frost wrote:
> Perhaps I'm missing something here, but with checksums enabled, a hint
> bit update is going to dirty the page (and we're going to write it into
> the WAL and write it out to the heap), no?

No.  We only WAL log hint bits the first time a page is modified after a
checkpoint.  It's quite likely that you'll set hint bits in the same
checkpoint cycle as the row has been modified last (necessating the hint
bit change).  So we can't just pessimize this.

I'm a bit confused about the amount of technically wrong arguments in
this thread.

Greetings,

Andres Freund



Re: [HACKERS] Checksums by default?

From
Merlin Moncure
Date:
On Mon, Jan 23, 2017 at 8:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Peter Geoghegan <pg@heroku.com> writes:
>> I thought that checksums went in in part because we thought that there
>> was some chance that they'd find bugs in Postgres.
>
> Not really.  AFAICS the only point is to catch storage-system malfeasance.
>
> It's barely possible that checksumming would help detect cases where
> we'd written data meant for block A into block B, but I don't rate
> that as being significantly more probable than bugs in the checksum
> code itself.  Also, if that case did happen, the checksum code might
> "detect" it in some sense, but it would be remarkably unhelpful at
> identifying the actual cause.

Hm, but at least in some cases wouldn't it protect people from further
damage?  End user data damage ought to prevented at all costs IMO.

merlin



Re: [HACKERS] Checksums by default?

From
Tom Lane
Date:
Merlin Moncure <mmoncure@gmail.com> writes:
> Hm, but at least in some cases wouldn't it protect people from further
> damage?  End user data damage ought to prevented at all costs IMO.

Well ... not directly.  Disallowing you from accessing busted block A
doesn't in itself prevent the same thing from happening to block B.
The argument seems to be that checksum failure complaints might prompt
users to, say, replace a failing disk drive before it goes dead completely.
But I think there's a whole lot of wishful thinking in that, particularly
when it comes to the sort of low-information users who would actually
be affected by a change in the default checksum setting.
        regards, tom lane



Re: [HACKERS] Checksums by default?

From
Andres Freund
Date:
On 2017-01-23 21:11:37 -0600, Merlin Moncure wrote:
> End user data damage ought to prevented at all costs IMO.

Really, really, really not. We should do a lot, but if that'd be the
only priority we'd enable all the expensive as shit stuff and be so slow
that there'd be no users.  We'd never add new scalability/performance
features, because they'll initially have more bugs / increase
complexity. Etc.

Andres



Re: [HACKERS] Checksums by default?

From
Torsten Zuehlsdorff
Date:
On 21.01.2017 19:35, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
>> Sure, it might be easy, but we don't have it.  Personally I think
>> checksums just aren't even ready for prime time. If we had:
>> - ability to switch on/off at runtime (early patches for that have IIRC
>>   been posted)
>> - *builtin* tooling to check checksums for everything
>> - *builtin* tooling to compute checksums after changing setting
>> - configurable background sweeps for checksums
>
> Yeah, and there's a bunch of usability tooling that we don't have,
> centered around "what do you do after you get a checksum error?".
> AFAIK there's no way to check or clear such an error; but without
> such tools, I'm afraid that checksums are as much of a foot-gun
> as a benefit.

I wanted to raise the same issue. A "something is broken" flag is fine 
to avoid more things get broken. But if you can't repair them, its not 
very useful.

Since i'm a heavy user of ZFS: there are checksums and if you enable 
shadow-copies or using a raid, checksums are helpful, since the allow to 
recover from the problems.

I personally would prefer to enable checksums manually and than get the 
possibility to repair damages. Manually because this would at least 
double the needed space.

Greetings,
Torsten



Re: [HACKERS] Checksums by default?

From
Torsten Zuehlsdorff
Date:

On 21.01.2017 19:37, Stephen Frost wrote:
> * Tom Lane (tgl@sss.pgh.pa.us) wrote:
>> Stephen Frost <sfrost@snowman.net> writes:
>>> Because I see having checksums as, frankly, something we always should
>>> have had (as most other databases do, for good reason...) and because
>>> they will hopefully prevent data loss.  I'm willing to give us a fair
>>> bit to minimize the risk of losing data.
>>
>> To be perfectly blunt, that's just magical thinking.  Checksums don't
>> prevent data loss in any way, shape, or form.  In fact, they can *cause*
>> data loss, or at least make it harder for you to retrieve your data,
>> in the event of bugs causing false-positive checksum failures.
>
> This is not a new argument, at least to me, and I don't agree with it.

I don't agree also. Yes, statistically it is more likely that checksum 
causes data-loss. The IO is greater, therefore the disc has more to do 
and breaks faster.
But the same is true for RAID: adding more disk increases the odds of an 
disk-fallout.

So: yes. If you use checksums at a single disc its more likely to cause 
problems. But if you managed it right (like ZFS for example) its an 
overall gain.

>> What checksums can do for you, perhaps, is notify you in a reasonably
>> timely fashion if you've already lost data due to storage-subsystem
>> problems.  But in a pretty high percentage of cases, that fact would
>> be extremely obvious anyway, because of visible data corruption.
>
> Exactly, and that awareness will allow a user to prevent further data
> loss or corruption.  Slow corruption over time is a very much known and
> accepted real-world case that people do experience, as well as bit
> flipping enough for someone to write a not-that-old blog post about
> them:
>
> https://blogs.oracle.com/ksplice/entry/attack_of_the_cosmic_rays1
>
> A really nice property of checksums on pages is that they also tell you
> what data you *didn't* lose, which can be extremely valuable.

Indeed!

Greetings,
Torsten



Re: [HACKERS] Checksums by default?

From
Ants Aasma
Date:
On Tue, Jan 24, 2017 at 4:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Peter Geoghegan <pg@heroku.com> writes:
>> I thought that checksums went in in part because we thought that there
>> was some chance that they'd find bugs in Postgres.
>
> Not really.  AFAICS the only point is to catch storage-system malfeasance.

This matches my understanding. Actual physical media errors are caught
by lower level checksums/error correction codes, and memory errors are
caught by ECC RAM. Checksums do very little for PostgreSQL bugs, which
leaves only filesystem and storage firmware bugs. However the latter
are still reasonably common faults. I have seen multiple cases where,
after reviewing the corruption with a hex editor, the only reasonable
conclusion was a bug in the storage system. Data shifted around by
non-page size amounts, non-page aligned extents that are zeroed out,
etc. Unfortunately none of those customers had checksums turned on at
the time. I feel that reacting to such errors with a non-cryptic and
easily debuggable checksum error is much better than erroring out with
huge memory allocations, crashing or returning bogus data. Timely
reaction to data corruption is really important for minimizing data
loss.

Regards,
Ants Aasma



Re: [HACKERS] Checksums by default?

From
Stephen Frost
Date:
Greetings,

* Ants Aasma (ants.aasma@eesti.ee) wrote:
> On Tue, Jan 24, 2017 at 4:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Peter Geoghegan <pg@heroku.com> writes:
> >> I thought that checksums went in in part because we thought that there
> >> was some chance that they'd find bugs in Postgres.
> >
> > Not really.  AFAICS the only point is to catch storage-system malfeasance.
>
> This matches my understanding. Actual physical media errors are caught
> by lower level checksums/error correction codes, and memory errors are
> caught by ECC RAM.

Not everyone runs with ECC, sadly.

> Checksums do very little for PostgreSQL bugs, which
> leaves only filesystem and storage firmware bugs. However the latter
> are still reasonably common faults.

Agreed, but in additional to filesystem and storage firmware bugs,
virtualization systems can have bugs as well and if those bugs hit the
kernel's cache (which is actually the more likely case- that's what the
VM system is going to think it can monkey with, as long as it works with
the kernel) then you can have cases which PG's checksum would likely
catch since we check the checksum when we read from the kernel's read
cache, and calculate the checksum before we push the page to the
kernel's write cache.

> I have seen multiple cases where,
> after reviewing the corruption with a hex editor, the only reasonable
> conclusion was a bug in the storage system. Data shifted around by
> non-page size amounts, non-page aligned extents that are zeroed out,
> etc.

Right, I've seen similar kinds of things happening in memory of
virtualized systems; things like random chunks of memory suddenly being
zero'd.

> Unfortunately none of those customers had checksums turned on at
> the time. I feel that reacting to such errors with a non-cryptic and
> easily debuggable checksum error is much better than erroring out with
> huge memory allocations, crashing or returning bogus data. Timely
> reaction to data corruption is really important for minimizing data
> loss.

Agreed.

In addition to that, in larger environments where there are multiple
databases involved for the explicit purpose of fail-over, a system which
is going south because of bad memory or storage could be detected and
pulled out, potentially with zero data loss.  Of course, to minimize
data loss, it'd be extremely important for the fail-over system to
identify a checksum error more-or-less immediately and take the bad node
out.

Thanks!

Stephen

Re: [HACKERS] Checksums by default?

From
"Joshua D. Drake"
Date:
On 01/21/2017 09:09 AM, Tom Lane wrote:
> Stephen Frost <sfrost@snowman.net> writes:
>> As for checksums, I do see value in them and I'm pretty sure that the
>> author of that particular feature did as well, or we wouldn't even have
>> it as an option.  You seem to be of the opinion that we might as well
>> just rip all of that code and work out as being useless.
>
> Not at all; I just think that it's not clear that they are a net win
> for the average user,

Tom is correct here. They are not a net win for the average user. We 
tend to forget that although we collectively have a lot of enterprise 
installs where this does matter, we collectively do not equal near the 
level of average user installs.
From an advocacy perspective, the average user install is the one that 
we tend most because that tending (in theory) will grow something that 
is more fruitful e.g; the enterprise install over time because we 
constantly and consistently provided a reasonable and expected 
experience to the average user.

Sincerely,

JD




-- 
Command Prompt, Inc.                  http://the.postgres.company/                        +1-503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Everyone appreciates your honesty, until you are honest with them.
Unless otherwise stated, opinions are my own.



Re: [HACKERS] Checksums by default?

From
Jim Nasby
Date:
On 1/24/17 10:30 AM, Joshua D. Drake wrote:
>
> Tom is correct here. They are not a net win for the average user. We
> tend to forget that although we collectively have a lot of enterprise
> installs where this does matter, we collectively do not equal near the
> level of average user installs.
>
> From an advocacy perspective, the average user install is the one that
> we tend most because that tending (in theory) will grow something that
> is more fruitful e.g; the enterprise install over time because we
> constantly and consistently provided a reasonable and expected
> experience to the average user.

I'm not completely grokking your second paragraph, but I would think 
that an average user would love got get a heads-up that their hardware 
is failing.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)



Re: [HACKERS] Checksums by default?

From
Robert Haas
Date:
On Sat, Jan 21, 2017 at 11:57 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2017-01-21 11:39:18 +0100, Magnus Hagander wrote:
>> Is it time to enable checksums by default, and give initdb a switch to turn
>> it off instead?
>
> -1 - the WAL overhead is quite massive, and in contrast to the other
> GUCs recently changed you can't just switch this around.

I agree.  I bet that if somebody does the test suggested by Amit
downthread, it'll turn out that the performance is just awful.  And
those cases are common.  I think the people saying "well, the overhead
is worth it" must be people whose systems (or whose customer's
systems) aren't processing continuous heavy OLTP workloads.  If you've
got a data warehousing workload, checksums are probably pretty cheap.
If you've got a low-velocity OLTP workload, or an OLTP workload that
fits in shared_buffers, it's probably bearable.  But if you've got 8GB
of shared_buffers and 100GB of data, and you've got 100 or so backends
continuously doing random updates, I think checksums are going nail
you to the wall.  And EnterpriseDB, at least, has lots of customers
that do exactly that sort of thing.

Having said that, I've certain run into situations where I speculated
that a customer had a hardware problem and they speculated that we had
given them buggy database software.  In a pretty significant number of
cases, the customer turned out to be right; for example, some of those
people were suffering from multixact bugs that resulted in
unexplainable corruption.  Now, would it have been useful to know that
checksums were passing (suggesting a PostgreSQL problem) rather than
failing (suggesting an OS problem)?  Yes, that would have been great.
I could have given those customers better support.  On the other hand,
I think I've run into MORE cases where the customer was desperately
seeking options to improve write performance, which remains a pretty
significant problem for PostgreSQL.  I can't see taking a significant
hit in that area for my convenience in understanding what's going on
in data corruption situations.  The write performance penalty is paid
by everybody all the time, whereas data corruption is a rare event
even among support cases.

And even when you do have corruption, whether or not the data
corruption is accompanied by a checksum failure is only ONE extra bit
of useful data.  A failure doesn't guarantee a hardware problem; it
could be caused by a faulty backup procedure, like forgetting to run
pg_start_backup().  The lack of a failure doesn't guarantee a software
problem; it could be caused by a faulty backup procedure, like using
an OS-level snapshot facility that isn't exactly simultaneous across
tablespaces.  What you really need to do when a customer has
corruption is figure out why they have corruption, and the leading
cause by far is neither the hardware nor the software but some kind of
user error.  Checksums are at best a very modest assist in figuring
out whether an error has been made and if so of what type.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Checksums by default?

From
Robert Haas
Date:
On Wed, Jan 25, 2017 at 12:02 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
> I'm not completely grokking your second paragraph, but I would think that an
> average user would love got get a heads-up that their hardware is failing.

Sure.  If the database runs fast enough with checksums enabled,
there's basically no reason to have them turned off.  The issue is
when it doesn't.

Also, it's not as if there are no other ways of checking whether your
disks are failing.  SMART, for example, is supposed to tell you about
incipient hardware failures before PostgreSQL ever sees a bit flip.
Surely an average user would love to get a heads-up that their
hardware is failing even when that hardware is not being used to power
PostgreSQL, yet many people don't bother to configure SMART (or
similar proprietary systems provided by individual vendors).

Trying to force those people to use checksums is just masterminding;
they've made their own decision that it's not worth bothering with.
When something goes wrong, WE still care about distinguishing hardware
failure from PostgreSQL failure.   Our pride is on the line.  But the
customer often doesn't.  The DBA isn't the same person as the
operating system guy, and the operating system guy isn't going to
listen to the DBA even if the DBA complains of checksum failures.  Or
the customer has 100 things on the same piece of hardware and
PostgreSQL is the only one that failed; or alternatively they all
failed around the same time; either way the culprit is obvious.  Or
the remedy is to restore from backup[1] whether the problem is
hardware or software and regardless of whose software is to blame.  Or
their storage cost a million dollars and is a year old and they simply
won't believe that it's failing.  Or their storage cost a hundred
dollars and is 8 years old and they're looking for an excuse to
replace it whether it's responsible for the problem du jour or not.

I think it's great that we have a checksum feature and I think it's
great for people who want to use it and are willing to pay the cost of
it to turn it on.  I don't accept the argument that all of our users,
or even most of them, fall into that category.  I also think it's
disappointing that there's such a vigorous argument for changing the
default when so little follow-on development has gone into this
feature.  If we had put any real effort into making this easier to
turn on and off, for example, the default value would be less
important, because people could change it more easily.  But nobody's
making that effort.  I suggest that the people who think this a
super-high-value feature should be willing to put some real work into
improving it instead of trying to force it down everybody's throat
as-is.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1] Alternatively, sometimes the remedy is to wish the had a usable
backup while frantically running pg_resetxlog.



Re: [HACKERS] Checksums by default?

From
Peter Geoghegan
Date:
On Wed, Jan 25, 2017 at 10:18 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> Trying to force those people to use checksums is just masterminding;
> they've made their own decision that it's not worth bothering with.
> When something goes wrong, WE still care about distinguishing hardware
> failure from PostgreSQL failure.   Our pride is on the line.  But the
> customer often doesn't.  The DBA isn't the same person as the
> operating system guy, and the operating system guy isn't going to
> listen to the DBA even if the DBA complains of checksum failures.

We need to invest in corruption detection/verification tools that are
run on an as-needed basis. They are available to users of every other
major database system.


-- 
Peter Geoghegan



Re: [HACKERS] Checksums by default?

From
Robert Haas
Date:
On Wed, Jan 25, 2017 at 1:37 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Wed, Jan 25, 2017 at 10:18 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Trying to force those people to use checksums is just masterminding;
>> they've made their own decision that it's not worth bothering with.
>> When something goes wrong, WE still care about distinguishing hardware
>> failure from PostgreSQL failure.   Our pride is on the line.  But the
>> customer often doesn't.  The DBA isn't the same person as the
>> operating system guy, and the operating system guy isn't going to
>> listen to the DBA even if the DBA complains of checksum failures.
>
> We need to invest in corruption detection/verification tools that are
> run on an as-needed basis. They are available to users of every other
> major database system.

+1, but the trick is (a) figuring out exactly what to develop and (b)
finding the time to develop it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Checksums by default?

From
Stephen Frost
Date:
Robert,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Wed, Jan 25, 2017 at 12:02 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
> > I'm not completely grokking your second paragraph, but I would think that an
> > average user would love got get a heads-up that their hardware is failing.
>
> Sure.  If the database runs fast enough with checksums enabled,
> there's basically no reason to have them turned off.  The issue is
> when it doesn't.

I don't believe we're talking about forcing every user to have checksums
enabled.  We are discussing the default.

Would you say that most user's databases run fast enough with checksums
enabled?  Or more than most, maybe 70%?  80%?  In today's environment,
I'd probably say that it's more like 90+%.

Yet, our default is to have them disabled and *really* hard to enable.

I agree that it's unfortunate that we haven't put more effort into
fixing that- I'm all for it, but it's disappointing to see that people
are not in favor of changing the default as I believe it would both help
our users and encourage more development of the feature.

Thanks!

Stephen

Re: [HACKERS] Checksums by default?

From
Stephen Frost
Date:
* Peter Geoghegan (pg@heroku.com) wrote:
> On Wed, Jan 25, 2017 at 10:18 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> > Trying to force those people to use checksums is just masterminding;
> > they've made their own decision that it's not worth bothering with.
> > When something goes wrong, WE still care about distinguishing hardware
> > failure from PostgreSQL failure.   Our pride is on the line.  But the
> > customer often doesn't.  The DBA isn't the same person as the
> > operating system guy, and the operating system guy isn't going to
> > listen to the DBA even if the DBA complains of checksum failures.
>
> We need to invest in corruption detection/verification tools that are
> run on an as-needed basis. They are available to users of every other
> major database system.

Agreed.

Thanks!

Stephen

Re: [HACKERS] Checksums by default?

From
Tom Lane
Date:
Stephen Frost <sfrost@snowman.net> writes:
> Would you say that most user's databases run fast enough with checksums
> enabled?  Or more than most, maybe 70%?  80%?  In today's environment,
> I'd probably say that it's more like 90+%.

It would be nice if there were some actual evidence about this, rather
than numbers picked out of the air.

> I agree that it's unfortunate that we haven't put more effort into
> fixing that- I'm all for it, but it's disappointing to see that people
> are not in favor of changing the default as I believe it would both help
> our users and encourage more development of the feature.

I think the really key point is that a whole lot of infrastructure work
needs to be done still, and changing the default before that work has been
done is not going to be user-friendly.  The most pressing issue being the
difficulty of changing the setting after the fact.  It would be a *whole*
lot easier to sell default-on if there were a way to turn it off, and yet
you want us to buy into default-on before that way exists.  Come back
after that feature is in, and we can talk.
        regards, tom lane



Re: [HACKERS] Checksums by default?

From
"Joshua D. Drake"
Date:
On 01/25/2017 11:41 AM, Tom Lane wrote:
> Stephen Frost <sfrost@snowman.net> writes:
>> Would you say that most user's databases run fast enough with checksums
>> enabled?  Or more than most, maybe 70%?  80%?  In today's environment,
>> I'd probably say that it's more like 90+%.
>
> It would be nice if there were some actual evidence about this, rather
> than numbers picked out of the air.
>
>> I agree that it's unfortunate that we haven't put more effort into
>> fixing that- I'm all for it, but it's disappointing to see that people
>> are not in favor of changing the default as I believe it would both help
>> our users and encourage more development of the feature.
>
> I think the really key point is that a whole lot of infrastructure work
> needs to be done still, and changing the default before that work has been
> done is not going to be user-friendly.  The most pressing issue being the
> difficulty of changing the setting after the fact.  It would be a *whole*
> lot easier to sell default-on if there were a way to turn it off, and yet
> you want us to buy into default-on before that way exists.  Come back
> after that feature is in, and we can talk.

+1

Sincerely,

JD


-- 
Command Prompt, Inc.                  http://the.postgres.company/                        +1-503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Everyone appreciates your honesty, until you are honest with them.
Unless otherwise stated, opinions are my own.



Re: [HACKERS] Checksums by default?

From
Ants Aasma
Date:
On Wed, Jan 25, 2017 at 8:18 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Also, it's not as if there are no other ways of checking whether your
> disks are failing.  SMART, for example, is supposed to tell you about
> incipient hardware failures before PostgreSQL ever sees a bit flip.
> Surely an average user would love to get a heads-up that their
> hardware is failing even when that hardware is not being used to power
> PostgreSQL, yet many people don't bother to configure SMART (or
> similar proprietary systems provided by individual vendors).

You really can't rely on SMART to tell you about hardware failures. 1
in 4 drives fail completely with 0 SMART indication [1]. And for the 1
in 1000 annual checksum failure rate other indicators except system
restarts only had a weak correlation[2]. And this is without
filesystem and other OS bugs that SMART knows nothing about.

My view may be biased by mostly seeing the cases where things have
already gone wrong, but I recommend support clients to turn checksums
on unless it's known that write IO is going to be an issue. Especially
because I know that if it turns out to be a problem I can go in and
quickly hack together a tool to help them turn it off. I do agree that
to change the PostgreSQL default at least some tool turn it off online
should be included.

[1] https://www.backblaze.com/blog/what-smart-stats-indicate-hard-drive-failures/
[2] https://www.usenix.org/legacy/event/fast08/tech/full_papers/bairavasundaram/bairavasundaram.pdf

Regards,
Ants Aasma



Re: [HACKERS] Checksums by default?

From
Robert Haas
Date:
On Wed, Jan 25, 2017 at 2:23 PM, Stephen Frost <sfrost@snowman.net> wrote:
>> Sure.  If the database runs fast enough with checksums enabled,
>> there's basically no reason to have them turned off.  The issue is
>> when it doesn't.
>
> I don't believe we're talking about forcing every user to have checksums
> enabled.  We are discussing the default.

I never said otherwise.

> Would you say that most user's databases run fast enough with checksums
> enabled?  Or more than most, maybe 70%?  80%?  In today's environment,
> I'd probably say that it's more like 90+%.

I don't have statistics on that, but I'd certainly agree that it's
over 90%.  However, I estimate that the number of percentage of people
who wouldn't be helped by checksums is also over 90%.  I don't think
it's easy to say whether there are more people who would benefit from
checksums than would be hurt by the performance penalty or visca
versa.  My own feeling is the second, but I understand that yours is
the first.

> Yet, our default is to have them disabled and *really* hard to enable.

First of all, that could be fixed by further development.

Second, really hard to enable is a relative term.  I accept that
enabling checksums is not a pleasant process.  Right now, you'd have
to do a dump/restore, or use logical replication to replicate the data
to a new cluster and then switch over.  On the other hand, if
checksums are really a critical feature, how are people getting to the
point where they've got a mission-critical production system and only
then discovering that they want to enable checksums?  If you tell
somebody "we have an optional feature called checksums and you should
really use it" and they respond "well, I'd like to, but I already put
my system into critical production use and it's not worth it to me to
take downtime to get them enabled", that sounds to me like the feature
is nice-to-have, not absolutely essential.  When something is
essential, you find a way to get it done, whether it's painful or not,
because that's what essential means.  And if checksums are not
essential, then they shouldn't be enabled by default unless they're
very cheap -- and I think we already know that's not true in all
workloads.

> I agree that it's unfortunate that we haven't put more effort into
> fixing that- I'm all for it, but it's disappointing to see that people
> are not in favor of changing the default as I believe it would both help
> our users and encourage more development of the feature.

I think it would help some users and hurt others.  I do agree that it
would encourage more development of the feature -- almost of
necessity.  In particular, I bet it would spur development of an
efficient way of turning checksums off -- but I'd rather see us
approach it from the other direction: let's develop an efficient way
of turning the feature on and off FIRST.  Deciding that the feature
has to be on for everyone because turning it on later is too hard for
the people who later decide they want it is letting the tail wag the
dog.

Also, I think that one of the big problems with the way checksums work
is that you don't find problems with your archived data until it's too
late.  Suppose that in February bits get flipped in a block.  You
don't access the data until July[1].  Well, it's nice to have the
system tell you that the data is corrupted, but what are you going to
do about it?  By that point, all of your backups are probably
corrupted.  So it's basically:

ERROR: you're screwed

It's nice to know that (maybe?) but without a recovery strategy a
whole lot of people who get that message are going to immediately
start asking "How do I ignore the fact that I'm screwed and try to
read the data anyway?".  And then you wonder what the point of having
the feature turned on is, especially if it's costly.  It's almost an
attractive nuisance at that point - nobody wants to be the user that
turns off checksums because they sound good on paper, but when you
actually have a problem an awful lot of people are NOT going to want
to try to restore from backup and maybe lose recent transactions.
They're going to want to ignore the checksum failures.  That's kind of
awful.

Peter's comments upthread get at this: "We need to invest in
corruption detection/verification tools that are run on an as-needed
basis."  Exactly.  If we could verify that our data is good before
throwing away our old backups, that'd be good.  If we could verify
that our indexes were structurally sane, that would be superior to
anything checksums can ever give us because it catches not only
storage failures but also software failures within PostgreSQL itself
and user malfeasance above the PostgreSQL layer (e.g. redefining the
supposedly-immutable function to give different answers) and damage
inflicted inadvertently by environmental changes (e.g. upgrading glibc
and having strcoll() change its mind).  If we could verify that every
XID and MXID in the heap points to a clog or multixact record that
still exists, that'd catch more than just bit flips.

I'm not trying to downplay the usefulness of checksums *in a certain
context*.  It's a good feature, and I'm glad we have it.  But I think
you're somewhat inflating the utility of it while discounting the very
real costs.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1] of the following year, maybe.



Re: [HACKERS] Checksums by default?

From
Peter Geoghegan
Date:
On Wed, Jan 25, 2017 at 12:23 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Also, I think that one of the big problems with the way checksums work
> is that you don't find problems with your archived data until it's too
> late.  Suppose that in February bits get flipped in a block.  You
> don't access the data until July[1].  Well, it's nice to have the
> system tell you that the data is corrupted, but what are you going to
> do about it?  By that point, all of your backups are probably
> corrupted.  So it's basically:
>
> ERROR: you're screwed
>
> It's nice to know that (maybe?) but without a recovery strategy a
> whole lot of people who get that message are going to immediately
> start asking "How do I ignore the fact that I'm screwed and try to
> read the data anyway?".

That's also how I tend to think about it.

I understand that my experience with storage devices is unusually
narrow compared to everyone else here. That's why I remain neutral on
the high level question of whether or not we ought to enable checksums
by default. I'll ask other hackers to answer what may seem like a very
naive question, while bearing what I just said in mind. The question
is: Have you ever actually seen a checksum failure in production? And,
if so, how helpful was it?

I myself have not, despite the fact that Heroku uses checksums
wherever possible, and has the technical means to detect problems like
this across the entire fleet of customer databases. Not even once.
This is not what I would have expected myself several years ago.

-- 
Peter Geoghegan



Re: [HACKERS] Checksums by default?

From
Stephen Frost
Date:
Robert,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Wed, Jan 25, 2017 at 2:23 PM, Stephen Frost <sfrost@snowman.net> wrote:
> > Yet, our default is to have them disabled and *really* hard to enable.
>
> First of all, that could be fixed by further development.

I'm certainly all for doing so, but I don't agree that it necessairly is
required before we flip the default.  That said, if the way to get
checksums enabled by default is providing a relativly easy way to turn
them off, then that's something which I'll do what I can to help work
towards.  In other words, I'm not going to continue to argue, given the
various opinions of the group, that we should just flip it tomorrow.

I hope to discuss it further after we have the ability to turn it off
easily.

> Second, really hard to enable is a relative term.  I accept that
> enabling checksums is not a pleasant process.  Right now, you'd have
> to do a dump/restore, or use logical replication to replicate the data
> to a new cluster and then switch over.  On the other hand, if
> checksums are really a critical feature, how are people getting to the
> point where they've got a mission-critical production system and only
> then discovering that they want to enable checksums?

I truely do wish everyone would come talk to me before building out a
database.  Perhaps that's been your experience, in which case, I envy
you, but I tend to get a reaction more along the lines of "wait, what do
you mean I had to pass some option to initdb to enable checksum?!?!".
The fact that we've got a WAL implementation and clearly understand
fsync requirements, why full page writes make sense, and that our WAL
has its own CRCs which isn't possible to disable, tends to lead people
to think we really know what we're doing and that we care a lot about
their data.

> > I agree that it's unfortunate that we haven't put more effort into
> > fixing that- I'm all for it, but it's disappointing to see that people
> > are not in favor of changing the default as I believe it would both help
> > our users and encourage more development of the feature.
>
> I think it would help some users and hurt others.  I do agree that it
> would encourage more development of the feature -- almost of
> necessity.  In particular, I bet it would spur development of an
> efficient way of turning checksums off -- but I'd rather see us
> approach it from the other direction: let's develop an efficient way
> of turning the feature on and off FIRST.  Deciding that the feature
> has to be on for everyone because turning it on later is too hard for
> the people who later decide they want it is letting the tail wag the
> dog.

As I have said, I don't believe it has to be on for everyone.

> Also, I think that one of the big problems with the way checksums work
> is that you don't find problems with your archived data until it's too
> late.  Suppose that in February bits get flipped in a block.  You
> don't access the data until July[1].  Well, it's nice to have the
> system tell you that the data is corrupted, but what are you going to
> do about it?  By that point, all of your backups are probably
> corrupted.  So it's basically:

If your backup system is checking the checksums when backing up PG,
which I think every backup system *should* be doing, then guess what?
You've got a backup which you can go back to immediately, possibly with
the ability to restore all of the data from WAL.  That won't always be
the case, naturally, but it's a much better position than simply having
a system which continues to degrade until you've actually reached the
"you're screwed" level because PG will no longer read a page or perhaps
can't even start up, *and* you no longer have any backups.

As it is, there are backup solutions which *do* check the checksum when
backing up PG.  This is no longer, thankfully, some hypothetical thing,
but something which really exists and will hopefully keep users from
losing data.

> It's nice to know that (maybe?) but without a recovery strategy a
> whole lot of people who get that message are going to immediately
> start asking "How do I ignore the fact that I'm screwed and try to
> read the data anyway?".

And we have options for that.

> And then you wonder what the point of having
> the feature turned on is, especially if it's costly.  It's almost an
> attractive nuisance at that point - nobody wants to be the user that
> turns off checksums because they sound good on paper, but when you
> actually have a problem an awful lot of people are NOT going to want
> to try to restore from backup and maybe lose recent transactions.
> They're going to want to ignore the checksum failures.  That's kind of
> awful.

Presently, last I checked at least, the database system doesn't fall
over and die if a single page's checksum fails.  I agree entirely that
we want the system to fail gracefully (unless the user instructs us
otherwise, perhaps because they have a redundant system that they can
flip to immediately).

> Peter's comments upthread get at this: "We need to invest in
> corruption detection/verification tools that are run on an as-needed
> basis."  Exactly.  If we could verify that our data is good before
> throwing away our old backups, that'd be good.

Ok.  We have that.  I admit that it's not in PG proper, yet, but I can't
turn back the clock and make that happen, but we *do* have the ability
to verify that our backups are free of checksum errors, on checksum
enabled databases, before we throw away the old backup.

> If we could verify
> that our indexes were structurally sane, that would be superior to
> anything checksums can ever give us because it catches not only
> storage failures but also software failures within PostgreSQL itself
> and user malfeasance above the PostgreSQL layer (e.g. redefining the
> supposedly-immutable function to give different answers) and damage
> inflicted inadvertently by environmental changes (e.g. upgrading glibc
> and having strcoll() change its mind).

We have a patch for that and I would love to see it in core already.  If
it's possible to make that run against a backup, then I'd be happy to
see about integrating it into a backup solution.

> If we could verify that every
> XID and MXID in the heap points to a clog or multixact record that
> still exists, that'd catch more than just bit flips.

Interesting idea...  That might not be too hard to implement on
systems with enough memory to pull all of clog in and then check as the
pages go by.  I can't promise anything, but I'm definitely going to
discuss the idea with David.

> I'm not trying to downplay the usefulness of checksums *in a certain
> context*.  It's a good feature, and I'm glad we have it.  But I think
> you're somewhat inflating the utility of it while discounting the very
> real costs.

The costs for checksums don't bother me any more than the costs for WAL
or WAL CRCs or full page writes.  They may not be required on every
system, but they're certainly required on more than 'zero' entirely
reasonable systems which people deploy in their production environments.
I'd rather walk into an engagement where the user is saying "yeah, we
enabled checksums and it caught this corruption issue" than having to
break the bad news, which I've had to do over and over, that their
existing system hasn't got checksums enabled.  This isn't hypothetical,
it's what I run into regularly with entirely reasonable and skilled
engineers who have been deploying PG.

Thanks!

Stephen

Re: [HACKERS] Checksums by default?

From
Peter Geoghegan
Date:
On Wed, Jan 25, 2017 at 3:30 PM, Stephen Frost <sfrost@snowman.net> wrote:
> As it is, there are backup solutions which *do* check the checksum when
> backing up PG.  This is no longer, thankfully, some hypothetical thing,
> but something which really exists and will hopefully keep users from
> losing data.

Wouldn't that have issues with torn pages?


-- 
Peter Geoghegan



Re: [HACKERS] Checksums by default?

From
Michael Paquier
Date:
On Thu, Jan 26, 2017 at 9:14 AM, Peter Geoghegan <pg@heroku.com> wrote:
> On Wed, Jan 25, 2017 at 3:30 PM, Stephen Frost <sfrost@snowman.net> wrote:
>> As it is, there are backup solutions which *do* check the checksum when
>> backing up PG.  This is no longer, thankfully, some hypothetical thing,
>> but something which really exists and will hopefully keep users from
>> losing data.
>
> Wouldn't that have issues with torn pages?

Why? What do you foresee here? I would think such backup solutions are
careful enough to ensure correctly the durability of pages so as they
are not partially written.
-- 
Michael



Re: [HACKERS] Checksums by default?

From
Andres Freund
Date:
On 2017-01-26 09:19:28 +0900, Michael Paquier wrote:
> On Thu, Jan 26, 2017 at 9:14 AM, Peter Geoghegan <pg@heroku.com> wrote:
> > On Wed, Jan 25, 2017 at 3:30 PM, Stephen Frost <sfrost@snowman.net> wrote:
> >> As it is, there are backup solutions which *do* check the checksum when
> >> backing up PG.  This is no longer, thankfully, some hypothetical thing,
> >> but something which really exists and will hopefully keep users from
> >> losing data.
> >
> > Wouldn't that have issues with torn pages?
> 
> Why? What do you foresee here? I would think such backup solutions are
> careful enough to ensure correctly the durability of pages so as they
> are not partially written.

That means you have to replay enough WAL to get into a consistent
state...

- Andres



Re: [HACKERS] Checksums by default?

From
Robert Haas
Date:
On Wed, Jan 25, 2017 at 6:30 PM, Stephen Frost <sfrost@snowman.net> wrote:
> I hope to discuss it further after we have the ability to turn it off
> easily.

I think we should have the ability to flip it in BOTH directions easily.

>> Second, really hard to enable is a relative term.  I accept that
>> enabling checksums is not a pleasant process.  Right now, you'd have
>> to do a dump/restore, or use logical replication to replicate the data
>> to a new cluster and then switch over.  On the other hand, if
>> checksums are really a critical feature, how are people getting to the
>> point where they've got a mission-critical production system and only
>> then discovering that they want to enable checksums?
>
> I truely do wish everyone would come talk to me before building out a
> database.  Perhaps that's been your experience, in which case, I envy
> you, but I tend to get a reaction more along the lines of "wait, what do
> you mean I had to pass some option to initdb to enable checksum?!?!".
> The fact that we've got a WAL implementation and clearly understand
> fsync requirements, why full page writes make sense, and that our WAL
> has its own CRCs which isn't possible to disable, tends to lead people
> to think we really know what we're doing and that we care a lot about
> their data.

It sounds to me like you are misleading users about the positives and
negatives of checksums, which then causes them to be shocked that they
are not the default.

> As I have said, I don't believe it has to be on for everyone.

For the second time, I didn't say that.  But the default has a
powerful influence on behavior.  If it didn't, you wouldn't be trying
to get it changed.

> [ unsolicited bragging about an unspecified backup tool, presumably pgbackrest ]

Great.

> Presently, last I checked at least, the database system doesn't fall
> over and die if a single page's checksum fails.

This is another thing that I never said.

> [ more unsolicited bragging an unspecified backup tool, presumably still pgbackrest ]

Swell.

>> I'm not trying to downplay the usefulness of checksums *in a certain
>> context*.  It's a good feature, and I'm glad we have it.  But I think
>> you're somewhat inflating the utility of it while discounting the very
>> real costs.
>
> The costs for checksums don't bother me any more than the costs for WAL
> or WAL CRCs or full page writes.

Obviously.  But I think they should.  Frankly, I think the costs for
full page writes should bother the heck out of all of us, but the
solution isn't to shut them off any more than it is to enable
checksums despite the cost.  It's to find a way to reduce the costs.

> They may not be required on every
> system, but they're certainly required on more than 'zero' entirely
> reasonable systems which people deploy in their production environments.

Nobody said otherwise.

> I'd rather walk into an engagement where the user is saying "yeah, we
> enabled checksums and it caught this corruption issue" than having to
> break the bad news, which I've had to do over and over, that their
> existing system hasn't got checksums enabled.  This isn't hypothetical,
> it's what I run into regularly with entirely reasonable and skilled
> engineers who have been deploying PG.

Maybe you should just stop telling them and use the time thus freed up
to work on improving the checksum feature.

I'm skeptical of this whole discussion because you seem to be filled
with unalloyed confidence that checksums have little performance
impact and will do wonderful things to prevent data loss, whereas I
think they have significant performance impact and will only very
slightly help to prevent data loss.  I admit that the idea of having
pgbackrest verify checksums while backing up seems like it could
greatly improve the chances of checksums being useful, but I'm not
going to endorse changing PostgreSQL's default for pgbackrest's
benefit.  It's got to be to the benefit of PostgreSQL users broadly,
not just the subset of those people who use one particular backup
tool.  Also, the massive hit that will probably occur on
high-concurrency OLTP workloads larger than shared_buffers is going to
be had to justify for any amount of backup security.  I think that
problem's got to be solved or at least mitigated before we think about
changing this.  I realize that not everyone would set the bar that
high, but I see far too many customers with exactly that workload to
dismiss it lightly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Checksums by default?

From
Michael Paquier
Date:
On Thu, Jan 26, 2017 at 9:22 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2017-01-26 09:19:28 +0900, Michael Paquier wrote:
>> On Thu, Jan 26, 2017 at 9:14 AM, Peter Geoghegan <pg@heroku.com> wrote:
>> > On Wed, Jan 25, 2017 at 3:30 PM, Stephen Frost <sfrost@snowman.net> wrote:
>> >> As it is, there are backup solutions which *do* check the checksum when
>> >> backing up PG.  This is no longer, thankfully, some hypothetical thing,
>> >> but something which really exists and will hopefully keep users from
>> >> losing data.
>> >
>> > Wouldn't that have issues with torn pages?
>>
>> Why? What do you foresee here? I would think such backup solutions are
>> careful enough to ensure correctly the durability of pages so as they
>> are not partially written.
>
> That means you have to replay enough WAL to get into a consistent
> state...

Ah, OK I got the point. Yes that would be a problem to check this
field on raw backups except if the page size matches the kernel's one
at 4k.
-- 
Michael



Re: [HACKERS] Checksums by default?

From
Robert Haas
Date:
On Wed, Jan 25, 2017 at 7:19 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Thu, Jan 26, 2017 at 9:14 AM, Peter Geoghegan <pg@heroku.com> wrote:
>> On Wed, Jan 25, 2017 at 3:30 PM, Stephen Frost <sfrost@snowman.net> wrote:
>>> As it is, there are backup solutions which *do* check the checksum when
>>> backing up PG.  This is no longer, thankfully, some hypothetical thing,
>>> but something which really exists and will hopefully keep users from
>>> losing data.
>>
>> Wouldn't that have issues with torn pages?
>
> Why? What do you foresee here? I would think such backup solutions are
> careful enough to ensure correctly the durability of pages so as they
> are not partially written.

Well, you'd have to keep a read(fd, buf, 8192) performed by the backup
tool from overlapping with a write(fd, buf, 8192) performed by the
backend.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Checksums by default?

From
Stephen Frost
Date:
* Peter Geoghegan (pg@heroku.com) wrote:
> On Wed, Jan 25, 2017 at 3:30 PM, Stephen Frost <sfrost@snowman.net> wrote:
> > As it is, there are backup solutions which *do* check the checksum when
> > backing up PG.  This is no longer, thankfully, some hypothetical thing,
> > but something which really exists and will hopefully keep users from
> > losing data.
>
> Wouldn't that have issues with torn pages?

No, why would it?  The page has either been written out by PG to the OS,
in which case the backup s/w will see the new page, or it hasn't been.
Our testing has not turned up any issues as yet.  That said, it's
relatively new and I wouldn't be surprised if we need to do some
adjustments in that area, which might be system-dependent even.  We
could certainly check the WAL for the page that had a checksum error (we
currently simply report them, though don't throw away a prior backup if
we detect one).

This isn't like a case where only half the page made it to the disk
because of a system failure though; everything is online and working
properly during an online backup.

Thanks!

Stephen

Re: [HACKERS] Checksums by default?

From
Stephen Frost
Date:
* Michael Paquier (michael.paquier@gmail.com) wrote:
> On Thu, Jan 26, 2017 at 9:14 AM, Peter Geoghegan <pg@heroku.com> wrote:
> > On Wed, Jan 25, 2017 at 3:30 PM, Stephen Frost <sfrost@snowman.net> wrote:
> >> As it is, there are backup solutions which *do* check the checksum when
> >> backing up PG.  This is no longer, thankfully, some hypothetical thing,
> >> but something which really exists and will hopefully keep users from
> >> losing data.
> >
> > Wouldn't that have issues with torn pages?
>
> Why? What do you foresee here? I would think such backup solutions are
> careful enough to ensure correctly the durability of pages so as they
> are not partially written.

I believe his concern was that the backup sw might see a
partially-updated page when it reads the file while PG is writing it.
In other words, would the kernel return some intermediate state of data
while an fwrite() is in progress.

Thanks!

Stephen

Re: [HACKERS] Checksums by default?

From
Stephen Frost
Date:
* Robert Haas (robertmhaas@gmail.com) wrote:
> On Wed, Jan 25, 2017 at 7:19 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
> > On Thu, Jan 26, 2017 at 9:14 AM, Peter Geoghegan <pg@heroku.com> wrote:
> >> On Wed, Jan 25, 2017 at 3:30 PM, Stephen Frost <sfrost@snowman.net> wrote:
> >>> As it is, there are backup solutions which *do* check the checksum when
> >>> backing up PG.  This is no longer, thankfully, some hypothetical thing,
> >>> but something which really exists and will hopefully keep users from
> >>> losing data.
> >>
> >> Wouldn't that have issues with torn pages?
> >
> > Why? What do you foresee here? I would think such backup solutions are
> > careful enough to ensure correctly the durability of pages so as they
> > are not partially written.
>
> Well, you'd have to keep a read(fd, buf, 8192) performed by the backup
> tool from overlapping with a write(fd, buf, 8192) performed by the
> backend.

As Michael mentioned, that'd depend on if things are atomic from a
user's perspective at certain sizes (perhaps 4k, which wouldn't be too
surprising, but may also be system-dependent), in which case verifying
that the page is in the WAL would be sufficient.

Thanks!

Stephen

Re: [HACKERS] Checksums by default?

From
Andres Freund
Date:
On 2017-01-25 19:30:08 -0500, Stephen Frost wrote:
> * Peter Geoghegan (pg@heroku.com) wrote:
> > On Wed, Jan 25, 2017 at 3:30 PM, Stephen Frost <sfrost@snowman.net> wrote:
> > > As it is, there are backup solutions which *do* check the checksum when
> > > backing up PG.  This is no longer, thankfully, some hypothetical thing,
> > > but something which really exists and will hopefully keep users from
> > > losing data.
> > 
> > Wouldn't that have issues with torn pages?
> 
> No, why would it?  The page has either been written out by PG to the OS,
> in which case the backup s/w will see the new page, or it hasn't been.

Uh. Writes aren't atomic on that granularity.  That means you very well
*can* see a torn page (in linux you can e.g. on 4KB os page boundaries
of a 8KB postgres page). Just read a page while it's being written out.

You simply can't reliably verify checksums without replaying WAL (or
creating a manual version of replay, as in checking the WAL for a FPW).


> This isn't like a case where only half the page made it to the disk
> because of a system failure though; everything is online and working
> properly during an online backup.

I don't think that really changes anything.

Greetings,

Andres Freund



Re: [HACKERS] Checksums by default?

From
Michael Paquier
Date:
On Thu, Jan 26, 2017 at 9:32 AM, Stephen Frost <sfrost@snowman.net> wrote:
> * Robert Haas (robertmhaas@gmail.com) wrote:
>> On Wed, Jan 25, 2017 at 7:19 PM, Michael Paquier
>> <michael.paquier@gmail.com> wrote:
>> > On Thu, Jan 26, 2017 at 9:14 AM, Peter Geoghegan <pg@heroku.com> wrote:
>> >> On Wed, Jan 25, 2017 at 3:30 PM, Stephen Frost <sfrost@snowman.net> wrote:
>> >>> As it is, there are backup solutions which *do* check the checksum when
>> >>> backing up PG.  This is no longer, thankfully, some hypothetical thing,
>> >>> but something which really exists and will hopefully keep users from
>> >>> losing data.
>> >>
>> >> Wouldn't that have issues with torn pages?
>> >
>> > Why? What do you foresee here? I would think such backup solutions are
>> > careful enough to ensure correctly the durability of pages so as they
>> > are not partially written.
>>
>> Well, you'd have to keep a read(fd, buf, 8192) performed by the backup
>> tool from overlapping with a write(fd, buf, 8192) performed by the
>> backend.
>
> As Michael mentioned, that'd depend on if things are atomic from a
> user's perspective at certain sizes (perhaps 4k, which wouldn't be too
> surprising, but may also be system-dependent), in which case verifying
> that the page is in the WAL would be sufficient.

That would be enough. It should also be rare enough that there would
not be that many pages to track when looking at records from the
backup start position to minimum recovery point. It could be also
simpler, though more time-consuming, to just let a backup recover up
to the minimum recovery point (recovery_target = 'immediate'), and
then run the checksum sanity checks. There are other checks usually
needed on a backup anyway like being sure that index pages are in good
shape even with a correct checksum, etc.

But here I am really high-jacking the thread, so I'll stop..
-- 
Michael



Re: [HACKERS] Checksums by default?

From
Stephen Frost
Date:
* Andres Freund (andres@anarazel.de) wrote:
> On 2017-01-25 19:30:08 -0500, Stephen Frost wrote:
> > * Peter Geoghegan (pg@heroku.com) wrote:
> > > On Wed, Jan 25, 2017 at 3:30 PM, Stephen Frost <sfrost@snowman.net> wrote:
> > > > As it is, there are backup solutions which *do* check the checksum when
> > > > backing up PG.  This is no longer, thankfully, some hypothetical thing,
> > > > but something which really exists and will hopefully keep users from
> > > > losing data.
> > >
> > > Wouldn't that have issues with torn pages?
> >
> > No, why would it?  The page has either been written out by PG to the OS,
> > in which case the backup s/w will see the new page, or it hasn't been.
>
> Uh. Writes aren't atomic on that granularity.  That means you very well
> *can* see a torn page (in linux you can e.g. on 4KB os page boundaries
> of a 8KB postgres page). Just read a page while it's being written out.
>
> You simply can't reliably verify checksums without replaying WAL (or
> creating a manual version of replay, as in checking the WAL for a FPW).

Looking through the WAL isn't any surprise and is something we've been
planning to do for other reasons anyway.

Thanks!

Stephen

Re: [HACKERS] Checksums by default?

From
Stephen Frost
Date:
Michael,

* Michael Paquier (michael.paquier@gmail.com) wrote:
> That would be enough. It should also be rare enough that there would
> not be that many pages to track when looking at records from the
> backup start position to minimum recovery point. It could be also
> simpler, though more time-consuming, to just let a backup recover up
> to the minimum recovery point (recovery_target = 'immediate'), and
> then run the checksum sanity checks. There are other checks usually
> needed on a backup anyway like being sure that index pages are in good
> shape even with a correct checksum, etc.

Belive me, I'm all for *all* of that.

> But here I am really high-jacking the thread, so I'll stop..

If you have further thoughts, I'm all ears.  This is all relatively new,
and I don't expect to have all of the answer or solutions.

Obviously, having to bring up a full database is an extra step (one we
try to make easy to do), but, sadly, we don't have any way to ask PG to
verify all the checksums with released versions, so that's what we're
working with.

Thanks!

Stephen

Re: [HACKERS] Checksums by default?

From
Stephen Frost
Date:
* Robert Haas (robertmhaas@gmail.com) wrote:
> On Wed, Jan 25, 2017 at 6:30 PM, Stephen Frost <sfrost@snowman.net> wrote:
> > I hope to discuss it further after we have the ability to turn it off
> > easily.
>
> I think we should have the ability to flip it in BOTH directions easily.

Presumably you imply this to mean "before we enable it by default."  I'm
not sure that I can agree with that, but we haven't got it in either
direction yet, so it's not terribly interesting to discuss that
particular "what if."

> It sounds to me like you are misleading users about the positives and
> negatives of checksums, which then causes them to be shocked that they
> are not the default.

I don't try to claim that they are without downsides or performance
impacts, if that's the implication here.

> > [ more unsolicited bragging an unspecified backup tool, presumably still pgbackrest ]

It was explicitly to counter the claim that there aren't things out
there which are working to actively check the checksums.

> > I'd rather walk into an engagement where the user is saying "yeah, we
> > enabled checksums and it caught this corruption issue" than having to
> > break the bad news, which I've had to do over and over, that their
> > existing system hasn't got checksums enabled.  This isn't hypothetical,
> > it's what I run into regularly with entirely reasonable and skilled
> > engineers who have been deploying PG.
>
> Maybe you should just stop telling them and use the time thus freed up
> to work on improving the checksum feature.

I'm working to improve the usefulness of our checksum feature in a way
which will produce practical and much more immediate results than
anything I could do today in PG.  That said, I do plan to also support
working on checksums as I'm able to.  At the moment, that's supporting
Magnus' thread about enabling them by default.  I'd be a bit surprised
if he was trying to force a change on PG because he thinks it's going to
improve things for pgbackrest, but if so, I'm not going to complain when
it seems like an entirely sensible and good change which will benefit
PG's users too.

Even better would be if we had an independent tool to check checksums
endorsed by the PG community, but that won't happen for a release cycle.
I'd also be extremely happy if the other backup tools out there grew
the ability to check checksums in PG pages; frankly, I hope that adding
it to pgbackrest will push them to do so.

> I'm skeptical of this whole discussion because you seem to be filled
> with unalloyed confidence that checksums have little performance
> impact and will do wonderful things to prevent data loss, whereas I
> think they have significant performance impact and will only very
> slightly help to prevent data loss.

I admit that they'll have a significant performance impact in some
environments, but I think the vast majority of installations won't see
anything different, while some of them may be saved by it, including, as
likely as not, a number of actual corruption issues that have been
brought up on these lists in the past few days, simply because reports
were asked for.

> I admit that the idea of having
> pgbackrest verify checksums while backing up seems like it could
> greatly improve the chances of checksums being useful, but I'm not
> going to endorse changing PostgreSQL's default for pgbackrest's
> benefit.

I'm glad to hear that you generally endorse the idea of having a backup
tool verify checksums.  I'd love it if all of them did and I'm not going
to apologize for, as far as I'm aware, being the first to even make an
effort in that direction.

> It's got to be to the benefit of PostgreSQL users broadly,
> not just the subset of those people who use one particular backup
> tool.

Hopefully, other backup solutions will add similar capability, and
perhaps someone will also write an independent tool, and eventually
those will get out in released versions, and maybe PG will grow a tool
to check checksums too, but I can't make other tool authors implement
it, nor can I make other committers work on it and while I'm doing what
I can, as I'm sure you understand, we all have a lot of different hats.

> Also, the massive hit that will probably occur on
> high-concurrency OLTP workloads larger than shared_buffers is going to
> be had to justify for any amount of backup security.  I think that
> problem's got to be solved or at least mitigated before we think about
> changing this.  I realize that not everyone would set the bar that
> high, but I see far too many customers with exactly that workload to
> dismiss it lightly.

I have a sneaking suspicion that the customers which you get directly
involved with tend to be at a different level than the majority of PG
users which exist out in the wild (I can't say that it's really any
different for me).  I don't think that's a bad thing, but I do think
users at all levels deserve consideration and not just those running
close to the limits of their gear.

Thanks!

Stephen

Re: [HACKERS] Checksums by default?

From
Peter Geoghegan
Date:
On Wed, Jan 25, 2017 at 1:22 PM, Peter Geoghegan <pg@heroku.com> wrote:
> I understand that my experience with storage devices is unusually
> narrow compared to everyone else here. That's why I remain neutral on
> the high level question of whether or not we ought to enable checksums
> by default. I'll ask other hackers to answer what may seem like a very
> naive question, while bearing what I just said in mind. The question
> is: Have you ever actually seen a checksum failure in production? And,
> if so, how helpful was it?

I'm surprised that nobody has answered my question yet.

I'm not claiming that not actually seeing any corruption in the wild
due to a failing checksum invalidates any argument. I *do* think that
data points like this can be helpful, though.

-- 
Peter Geoghegan



Re: [HACKERS] Checksums by default?

From
Stephen Frost
Date:
Peter,

* Peter Geoghegan (pg@heroku.com) wrote:
> On Wed, Jan 25, 2017 at 1:22 PM, Peter Geoghegan <pg@heroku.com> wrote:
> > I understand that my experience with storage devices is unusually
> > narrow compared to everyone else here. That's why I remain neutral on
> > the high level question of whether or not we ought to enable checksums
> > by default. I'll ask other hackers to answer what may seem like a very
> > naive question, while bearing what I just said in mind. The question
> > is: Have you ever actually seen a checksum failure in production? And,
> > if so, how helpful was it?
>
> I'm surprised that nobody has answered my question yet.
>
> I'm not claiming that not actually seeing any corruption in the wild
> due to a failing checksum invalidates any argument. I *do* think that
> data points like this can be helpful, though.

Sadly, without having them enabled by default, there's not a huge corpus
of example cases to draw from.

There have been a few examples already posted about corruption failures
with PG, but one can't say with certainty that they would have been
caught sooner if checksums had been enabled.

Thanks!

Stephen

Re: [HACKERS] Checksums by default?

From
Thomas Munro
Date:
On Thu, Jan 26, 2017 at 2:28 PM, Stephen Frost <sfrost@snowman.net> wrote:
> Sadly, without having them enabled by default, there's not a huge corpus
> of example cases to draw from.
>
> There have been a few examples already posted about corruption failures
> with PG, but one can't say with certainty that they would have been
> caught sooner if checksums had been enabled.

I don't know how comparable it is to our checksum technology, but
MySQL seems to have some kind of checksums on table data, and you can
find public emails, blogs etc lamenting corrupted databases by
searching Google for the string "InnoDB: uncompressed page, stored
checksum in field1" (that's the start of a longer error message that
includes actual and expected checksums).

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] Checksums by default?

From
Robert Haas
Date:
On Wed, Jan 25, 2017 at 7:37 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2017-01-25 19:30:08 -0500, Stephen Frost wrote:
>> * Peter Geoghegan (pg@heroku.com) wrote:
>> > On Wed, Jan 25, 2017 at 3:30 PM, Stephen Frost <sfrost@snowman.net> wrote:
>> > > As it is, there are backup solutions which *do* check the checksum when
>> > > backing up PG.  This is no longer, thankfully, some hypothetical thing,
>> > > but something which really exists and will hopefully keep users from
>> > > losing data.
>> >
>> > Wouldn't that have issues with torn pages?
>>
>> No, why would it?  The page has either been written out by PG to the OS,
>> in which case the backup s/w will see the new page, or it hasn't been.
>
> Uh. Writes aren't atomic on that granularity.  That means you very well
> *can* see a torn page (in linux you can e.g. on 4KB os page boundaries
> of a 8KB postgres page). Just read a page while it's being written out.

Yeah.  This is also why backups force full page writes on even if
they're turned off in general.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Checksums by default?

From
Greg Stark
Date:
On 26 January 2017 at 01:58, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>
> I don't know how comparable it is to our checksum technology, but
> MySQL seems to have some kind of checksums on table data, and you can
> find public emails, blogs etc lamenting corrupted databases by
> searching Google for the string "InnoDB: uncompressed page, stored
> checksum in field1" (that's the start of a longer error message that
> includes actual and expected checksums).

I'm not sure what exactly that teaches us however. I see these were
often associated with software bugs (Apparently MySQL long assumed
that a checksum of 0 never happened for example).

In every non software case I stumbled across seemed to be following a
power failure. Apparently MySQL uses a "doublewrite buffer" to protect
against torn pages but when I search for that I get tons of people
inquiring how to turn it off... So even without software bugs in the
checksum code I don't know that the frequency of the error necessarily
teaches us anything about the frequency of hardware corruption either.

And more to the point it seems what people are asking for in all those
lamentations is how they can convince MySQL to continue and ignore the
corruption. A typical response was "We slightly modified innochecksum
and added option -f that means if the checksum of a page is wrong,
rewrite it in the InnoDB page header." Which begs the question...

-- 
greg



Re: [HACKERS] Checksums by default?

From
Stephen Frost
Date:
* Robert Haas (robertmhaas@gmail.com) wrote:
> On Wed, Jan 25, 2017 at 7:37 PM, Andres Freund <andres@anarazel.de> wrote:
> > On 2017-01-25 19:30:08 -0500, Stephen Frost wrote:
> >> * Peter Geoghegan (pg@heroku.com) wrote:
> >> > On Wed, Jan 25, 2017 at 3:30 PM, Stephen Frost <sfrost@snowman.net> wrote:
> >> > > As it is, there are backup solutions which *do* check the checksum when
> >> > > backing up PG.  This is no longer, thankfully, some hypothetical thing,
> >> > > but something which really exists and will hopefully keep users from
> >> > > losing data.
> >> >
> >> > Wouldn't that have issues with torn pages?
> >>
> >> No, why would it?  The page has either been written out by PG to the OS,
> >> in which case the backup s/w will see the new page, or it hasn't been.
> >
> > Uh. Writes aren't atomic on that granularity.  That means you very well
> > *can* see a torn page (in linux you can e.g. on 4KB os page boundaries
> > of a 8KB postgres page). Just read a page while it's being written out.
>
> Yeah.  This is also why backups force full page writes on even if
> they're turned off in general.

I've got a question into David about this, I know we chatted about the
risk at one point, I just don't recall what we ended up doing (I can
imagine a few different possible things- re-read the page, which isn't a
guarantee but reduces the chances a fair bit, or check the LSN, or
perhaps the plan was to just check if it's in the WAL, as I mentioned)
or if we ended up concluding it wasn't a risk for some, perhaps
incorrect, reason and need to revisit it.

Thanks!

Stephen

Re: [HACKERS] Checksums by default?

From
"Joshua D. Drake"
Date:
On 01/25/2017 05:25 PM, Peter Geoghegan wrote:
> On Wed, Jan 25, 2017 at 1:22 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> I understand that my experience with storage devices is unusually
>> narrow compared to everyone else here. That's why I remain neutral on
>> the high level question of whether or not we ought to enable checksums
>> by default. I'll ask other hackers to answer what may seem like a very
>> naive question, while bearing what I just said in mind. The question
>> is: Have you ever actually seen a checksum failure in production? And,
>> if so, how helpful was it?

No.

JD



-- 
Command Prompt, Inc.                  http://the.postgres.company/                        +1-503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Everyone appreciates your honesty, until you are honest with them.



Re: [HACKERS] Checksums by default?

From
David Steele
Date:
On 1/25/17 10:38 PM, Stephen Frost wrote:
> * Robert Haas (robertmhaas@gmail.com) wrote:
>> On Wed, Jan 25, 2017 at 7:37 PM, Andres Freund <andres@anarazel.de> wrote:
>>> On 2017-01-25 19:30:08 -0500, Stephen Frost wrote:
>>>> * Peter Geoghegan (pg@heroku.com) wrote:
>>>>> On Wed, Jan 25, 2017 at 3:30 PM, Stephen Frost <sfrost@snowman.net> wrote:
>>>>>> As it is, there are backup solutions which *do* check the checksum when
>>>>>> backing up PG.  This is no longer, thankfully, some hypothetical thing,
>>>>>> but something which really exists and will hopefully keep users from
>>>>>> losing data.
>>>>>
>>>>> Wouldn't that have issues with torn pages?
>>>>
>>>> No, why would it?  The page has either been written out by PG to the OS,
>>>> in which case the backup s/w will see the new page, or it hasn't been.
>>>
>>> Uh. Writes aren't atomic on that granularity.  That means you very well
>>> *can* see a torn page (in linux you can e.g. on 4KB os page boundaries
>>> of a 8KB postgres page). Just read a page while it's being written out.
>>
>> Yeah.  This is also why backups force full page writes on even if
>> they're turned off in general.
>
> I've got a question into David about this, I know we chatted about the
> risk at one point, I just don't recall what we ended up doing (I can
> imagine a few different possible things- re-read the page, which isn't a
> guarantee but reduces the chances a fair bit, or check the LSN, or
> perhaps the plan was to just check if it's in the WAL, as I mentioned)
> or if we ended up concluding it wasn't a risk for some, perhaps
> incorrect, reason and need to revisit it.

The solution was to simply ignore the checksums of any pages with an LSN
>= the LSN returned by pg_start_backup().  This means that hot blocks
may never be checked during backup, but if they are active then any
problems should be caught directly by PostgreSQL.

This technique assumes that blocks can be consistently read in the order
they were written.  If the second 4k (or 512 byte, etc.) block of the
fwrite is visible before the first 4k block then there would a false
positive.  I have a hard time imagining any sane buffering system
working this way, but I can't discount it.

It's definitely possible for pages on disk to have this characteristic
(i.e., the first block is not written first) but that should be fixed
during recovery before it is possible to take a backup.

Note that reports of page checksum errors are informational only and do
not have any effect on the backup process.  Even so we would definitely
prefer to avoid false positives.  If anybody can poke a hole in this
solution then I would like to hear it.

--
-David
david@pgmasters.net


Re: [HACKERS] Checksums by default?

From
Tomas Vondra
Date:
Hi,

I've been running a bunch of benchmarks to measure the overhead of data 
checksums. I'm not sure if we're still considering this change for 10, 
but let me post the results anyway.

The benchmark was fairly simple - run a few simple workloads, measure 
tps, amount of WAL generated etc. The scripts and results (including sar 
data etc) is available here:

     https://bitbucket.org/tvondra/checksum-bench

I encourage others to validate the results, see if I made some stupid 
mistakes and perhaps do additional analysis of the data. There's quite a 
bit of data, including pg_stat_* snapshots collected every second, etc.

The workloads benchmarked were

* regular pgbench

* 'skewed' pgbench (simulating workloads that frequently access only 
small subset of data)

* 'skewed-n' (same as before, but skipping updates on tiny tables, to 
prevent concurrency issues, just like "pgbench -N")

* pgbench with 90% reads, 10% writes

All of this on three  different scales:

* 100 - fits into shared buffers

* 1000 - fits into RAM

* 10000 - exceeds RAM

This was done on a machine with 2 x 8/16 cores, with 32 clients, each 
run taking 2h. The machine has PCIe SSD drive (Intel 750), so fairly 
powerful both for reads and writes.

The TPS results look like this (charts and spreadsheet attached):

     test    scale        checksums        no-checksums      ratio
     -----------------------------------------------------------------
     pgbench 100              20888               21166      98.69%
             1000             15578               16036      97.14%
             10000             5205                5352      97.25%

     test         scale        checksums        no-checksums    ratio
     -----------------------------------------------------------------
     read-write     100           151228              152330   99.28%
                   1000            95396               96755   98.60%
                  10000            13067               14514   90.03%

     test         scale        checksums        no-checksums    ratio
     ------------------------------------------------------------------
     skewed         100            20957               20805   100.73%
                   1000            20865               20744   100.58%
                  10000             8855                9482    93.39%

     test          scale        checksums        no-checksums   ratio
     -----------------------------------------------------------------
     skewed-n        100            29226               28104  103.99%
                    1000            28011               27612  101.45%
                   10000             9363               10109   92.62%


So it seems with smaller scales, the overhead is fairly low (~2%, so 
within noise). On the large data the throughput drops by ~10% with some 
of the workloads.

The amount of WAL generated (over the 2h run with checkpoints triggered 
every 30 minutes) looks like this:

     test         scale     checksums    no-checksums     ratio
     -------------------------------------------------------------
     pgbench        100         66.11           66.48     99.45%
                   1000        119.77          119.55     100.19%
                  10000        375.05          350.63     106.97%

     test            scale     checksums    no-checksums     ratio
     ---------------------------------------------------------------
     read-write        100         26.06           26.26     99.26%
                      1000         70.37           70.01    100.51%
                     10000        147.54          154.78     95.33%

     test        scale        checksums        no-checksums    ratio
     -----------------------------------------------------------------
     skewed        100            65.64               64.88  101.18%
                  1000           117.20              114.79  102.10%
                 10000           370.15              346.39  106.86%

     test        scale        checksums        no-checksums    ratio
     ----------------------------------------------------------------
     skewed-n      100            61.00               56.53   107.92%
                  1000           106.85              102.48   104.26%
                 10000           372.03              348.29   106.82%


Those numbers should be probably corrected a bit to compensate for the 
tps difference (if you to ~10% more TPS, you expect ~10% more WAL), but 
I think it's clear the WAL overhead is roughly +10% or so.

I'm not suggesting this means we should or should not enable data 
checksums by default, but hopefully it provides some actual data 
measuring the impact. If needed, I can do more tests with other 
workloads, and it should be difficult to use the scripts on other 
systems if you want to run some benchmarks yourself.


regards

regards
Tomas Vondra

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Checksums by default?

From
Jim Nasby
Date:
On 1/25/17 6:40 PM, Stephen Frost wrote:
> Obviously, having to bring up a full database is an extra step (one we
> try to make easy to do), but, sadly, we don't have any way to ask PG to
> verify all the checksums with released versions, so that's what we're
> working with.

Wouldn't it be fairly trivial to write an extension that did that though?

foreach r in pg_class where relkind in (...)  for (b = 0; b < r.relpages; b++)    ReadBufferExtended(...,
BAS_BULKREAD);
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)



Re: [HACKERS] Checksums by default?

From
Andres Freund
Date:
On 2017-02-03 17:23:15 -0600, Jim Nasby wrote:
> On 1/25/17 6:40 PM, Stephen Frost wrote:
> > Obviously, having to bring up a full database is an extra step (one we
> > try to make easy to do), but, sadly, we don't have any way to ask PG to
> > verify all the checksums with released versions, so that's what we're
> > working with.
>
> Wouldn't it be fairly trivial to write an extension that did that though?
>
> foreach r in pg_class where relkind in (...)
>   for (b = 0; b < r.relpages; b++)
>     ReadBufferExtended(..., BAS_BULKREAD);

You can't really see things from other databases that way tho. So you
need to write a tool that iterates all databases and such.  Not that
that's a huge problem, but it doesn't make things easier at least.

(and you need to deal with things like forks, but that's not a huge
issue)

- Andres



Re: [HACKERS] Checksums by default?

From
Robert Haas
Date:
On Mon, Jan 30, 2017 at 12:29 PM, David Steele <david@pgmasters.net> wrote:
> The solution was to simply ignore the checksums of any pages with an LSN
>>= the LSN returned by pg_start_backup().  This means that hot blocks
> may never be checked during backup, but if they are active then any
> problems should be caught directly by PostgreSQL.

I feel like this doesn't fix the problem.  Suppose the backup process
reads part of a block that hasn't been modified in a while, and then
PostgreSQL writes the block, and then the backup process reads the
rest of the block.  The LSN test will not prevent the checksum from
being verified, but the checksum will fail to match.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Checksums by default?

From
Jim Nasby
Date:
On 2/3/17 5:31 PM, Andres Freund wrote:
> You can't really see things from other databases that way tho. So you
> need to write a tool that iterates all databases and such.  Not that
> that's a huge problem, but it doesn't make things easier at least.

True. Not terribly hard to iterate though, and if the author of this 
mythical extension really wanted to they could probably use a bgworker 
that was free to iterate through the databases.

> (and you need to deal with things like forks, but that's not a huge
> issue)

Yeah, which maybe requires version-specific hard-coded knowledge of how 
many forks you might have.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)



Re: [HACKERS] Checksums by default?

From
Tomas Vondra
Date:
Hi,

I've repeated those benchmarks on a much smaller/older machine, with 
only minimal changes (mostly related to RAM and cores available). I've 
expected to see more significant differences, assuming that newer CPUs 
will handle the checksumming better, but to my surprise the impact of 
enabling checksums on this machine is ~2%.

As usual, full results and statistics are available for review here:

     https://bitbucket.org/tvondra/checksum-bench-i5

Looking at average TPS (measured over 2 hours, with a checkpoints every 
30 minutes), I see this:

         test    scale     checksums   no-checksums
     -----------------------------------------------------------
      pgbench       50          7444           7518     99.02%
                   300          6863           6936     98.95%
                  1000          4195           4295     97.67%

      read-write    50         48858          48832    100.05%
                   300         41999          42302     99.28%
                  1000         16539          16666     99.24%

      skewed        50          7485           7480    100.07%
                   300          7245           7280     99.52%
                  1000          5950           6050     98.35%

      skewed-n      50         10234          10226    100.08%
                   300          9618           9649     99.68%
                  1000          7371           7393     99.70%

And the amount of WAL produced looks like this:

      test       scale        checksums     no-checksums
     -----------------------------------------------------------------
      pgbench       50            24.89            24.67      100.89%
                   300            37.94            37.54      101.07%
                  1000            65.91            64.88      101.58%

      read-write    50            10.00            9.98       100.11%
                   300            23.28            23.35       99.66%
                  1000            54.20            53.20      101.89%

      skewed        50            24.35            24.01      101.43%
                   300            35.12            34.51      101.77%
                  1000            52.14            51.15      101.93%

      skewed-n      50            21.71            21.13      102.73%
                   300            32.23            31.54      102.18%
                  1000            53.24            51.94      102.50%

Again, this is hardly a proof of non-existence of a workload where data 
checksums have much worse impact, but I've expected to see a much more 
significant impact on those workloads.


Incidentally, I've been dealing with a checksum failure reported by a 
customer last week, and based on the experience I tend to agree that we 
don't have the tools needed to deal with checksum failures. I think such 
tooling should be a 'must have' for enabling checksums by default.

In this particular case the checksum failure is particularly annoying 
because it happens during recovery (on a standby, after a restart), 
during startup, so FATAL means shutdown.

I've managed to inspect the page in different way (dd and pageinspect 
from another instance), and it looks fine - no obvious data corruption, 
the only thing that seems borked is the checksum itself, and only three 
consecutive bits are flipped in the checksum. So this doesn't seem like 
a "stale checksum" - hardware issue is a possibility (the machine has 
ECC RAM though), but it might just as easily be a bug in PostgreSQL, 
when something scribbles over the checksum due to a buffer overflow, 
just before we write the buffer to the OS. So 'false failures' are not 
entirely impossible thing.

And no, backups may not be a suitable solution - the failure happens on 
a standby, and the page (luckily) is not corrupted on the master. Which 
means that perhaps the standby got corrupted by a WAL, which would 
affect the backups too. I can't verify this, though, because the WAL got 
removed from the archive, already. But it's a possibility.

So I think we're not ready to enable checksums by default for everyone, 
not until we can provide tools to deal with failures like this (I don't 
think users will be amused if we tell them to use 'dd' and inspect the 
pages in a hex editor).

ISTM the way forward is to keep the current default (disabled), but to 
allow enabling checksums on the fly. That will mostly fix the issue for 
people who actually want checksums but don't realize they need to enable 
them at initdb time (and starting from scratch is not an option for 
them), are running on good hardware and are capable of dealing with 
checksum errors if needed, even without more built-in tooling.

Being able to disable checksums on the fly is nice, but it only really 
solves the issue of extra overhead - it does really help with the 
failures (particularly when you can't even start the database, because 
of a checksum failure in the startup phase).

So, shall we discuss what tooling would be useful / desirable?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Checksums by default?

From
Robert Haas
Date:
On Fri, Feb 10, 2017 at 7:38 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Incidentally, I've been dealing with a checksum failure reported by a
> customer last week, and based on the experience I tend to agree that we
> don't have the tools needed to deal with checksum failures. I think such
> tooling should be a 'must have' for enabling checksums by default.
>
> In this particular case the checksum failure is particularly annoying
> because it happens during recovery (on a standby, after a restart), during
> startup, so FATAL means shutdown.
>
> I've managed to inspect the page in different way (dd and pageinspect from
> another instance), and it looks fine - no obvious data corruption, the only
> thing that seems borked is the checksum itself, and only three consecutive
> bits are flipped in the checksum. So this doesn't seem like a "stale
> checksum" - hardware issue is a possibility (the machine has ECC RAM
> though), but it might just as easily be a bug in PostgreSQL, when something
> scribbles over the checksum due to a buffer overflow, just before we write
> the buffer to the OS. So 'false failures' are not entirely impossible thing.
>
> And no, backups may not be a suitable solution - the failure happens on a
> standby, and the page (luckily) is not corrupted on the master. Which means
> that perhaps the standby got corrupted by a WAL, which would affect the
> backups too. I can't verify this, though, because the WAL got removed from
> the archive, already. But it's a possibility.
>
> So I think we're not ready to enable checksums by default for everyone, not
> until we can provide tools to deal with failures like this (I don't think
> users will be amused if we tell them to use 'dd' and inspect the pages in a
> hex editor).
>
> ISTM the way forward is to keep the current default (disabled), but to allow
> enabling checksums on the fly. That will mostly fix the issue for people who
> actually want checksums but don't realize they need to enable them at initdb
> time (and starting from scratch is not an option for them), are running on
> good hardware and are capable of dealing with checksum errors if needed,
> even without more built-in tooling.
>
> Being able to disable checksums on the fly is nice, but it only really
> solves the issue of extra overhead - it does really help with the failures
> (particularly when you can't even start the database, because of a checksum
> failure in the startup phase).
>
> So, shall we discuss what tooling would be useful / desirable?

FWIW, I appreciate this analysis and I think it's exactly the kind of
thing we need to set a strategy for moving forward.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Checksums by default?

From
Jim Nasby
Date:
On 2/10/17 6:38 PM, Tomas Vondra wrote:
> And no, backups may not be a suitable solution - the failure happens on
> a standby, and the page (luckily) is not corrupted on the master. Which
> means that perhaps the standby got corrupted by a WAL, which would
> affect the backups too. I can't verify this, though, because the WAL got
> removed from the archive, already. But it's a possibility.

Possibly related... I've got a customer that periodically has SR replias 
stop in their tracks due to WAL checksum failure. I don't think there's 
any hardware correlation (they've seen this on multiple machines). 
Studying the code, it occurred to me that if there's any bugs in the 
handling of individual WAL record sizes or pointers during SR then you 
could get CRC failures. So far every one of these occurrences has been 
repairable by replacing the broken WAL file on the replica. I've 
requested that next time this happens they save the bad WAL.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)



Re: [HACKERS] Checksums by default?

From
Tomas Vondra
Date:
On 02/13/2017 02:29 AM, Jim Nasby wrote:
> On 2/10/17 6:38 PM, Tomas Vondra wrote:
>> And no, backups may not be a suitable solution - the failure happens on
>> a standby, and the page (luckily) is not corrupted on the master. Which
>> means that perhaps the standby got corrupted by a WAL, which would
>> affect the backups too. I can't verify this, though, because the WAL got
>> removed from the archive, already. But it's a possibility.
>
> Possibly related... I've got a customer that periodically has SR replias
> stop in their tracks due to WAL checksum failure. I don't think there's
> any hardware correlation (they've seen this on multiple machines).
> Studying the code, it occurred to me that if there's any bugs in the
> handling of individual WAL record sizes or pointers during SR then you
> could get CRC failures. So far every one of these occurrences has been
> repairable by replacing the broken WAL file on the replica. I've
> requested that next time this happens they save the bad WAL.

I don't follow. You're talking about WAL checksums, this thread is about 
data checksums. I'm not seeing any WAL checksum failure, but when the 
standby attempts to apply the WAL (in particular a Btree/DELETE WAL 
record), it detects an incorrect data checksum in the underlying table.

So either there's a hardware issue, or the heap got corrupted by some 
preceding WAL. Or maybe one of the tiny gnomes in the CPU got tired and 
punched the bits wrong.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Checksums by default?

From
Tomas Vondra
Date:
On 02/11/2017 01:38 AM, Tomas Vondra wrote:
>
> Incidentally, I've been dealing with a checksum failure reported by a
> customer last week, and based on the experience I tend to agree that we
> don't have the tools needed to deal with checksum failures. I think such
> tooling should be a 'must have' for enabling checksums by default.
>
> In this particular case the checksum failure is particularly annoying
> because it happens during recovery (on a standby, after a restart),
> during startup, so FATAL means shutdown.
>
> I've managed to inspect the page in different way (dd and pageinspect
> from another instance), and it looks fine - no obvious data corruption,
> the only thing that seems borked is the checksum itself, and only three
> consecutive bits are flipped in the checksum. So this doesn't seem like
> a "stale checksum" - hardware issue is a possibility (the machine has
> ECC RAM though), but it might just as easily be a bug in PostgreSQL,
> when something scribbles over the checksum due to a buffer overflow,
> just before we write the buffer to the OS. So 'false failures' are not
> entirely impossible thing.
>

Not to leave this without any resolution, it seems the issue has been 
caused by a SAN. Some configuration changes or something was being done 
at the time of the issue, and the SAN somehow ended up writing a page 
into a different relfilenode, into a different block. The page was from 
a btree index and got written into a heap relfilenode, but otherwise it 
was correct - the only thing that changed seems to be the block number, 
which explains the minimal difference in the checksum.

I don't think we'll learn much more, but it seems the checksums did 
their work in detecting the issue.
>
> So I think we're not ready to enable checksums by default for everyone,
> not until we can provide tools to deal with failures like this (I don't
> think users will be amused if we tell them to use 'dd' and inspect the
> pages in a hex editor).
>
> ISTM the way forward is to keep the current default (disabled), but to
> allow enabling checksums on the fly. That will mostly fix the issue for
> people who actually want checksums but don't realize they need to enable
> them at initdb time (and starting from scratch is not an option for
> them), are running on good hardware and are capable of dealing with
> checksum errors if needed, even without more built-in tooling.
>
> Being able to disable checksums on the fly is nice, but it only really
> solves the issue of extra overhead - it does really help with the
> failures (particularly when you can't even start the database, because
> of a checksum failure in the startup phase).
>
> So, shall we discuss what tooling would be useful / desirable?
>

Although the checksums did detect the issue (we might never notice 
without them, or maybe the instance would mysteriously crash), I still 
think better tooling is neeed.

I've posted some minor pageinspect improvements I hacked together while 
investigating this, but I don't think pageinspect is a very good tool 
for investigating checksum / data corruption issues, for a number of 
reasons:

1) It does not work at all when the instance does not even start - you 
have to manually dump the pages and try inspecting them from another 
instance.

2) Even then it assumes the pages are not corrupted, and may easily 
cause segfaults or other issues if that's not the case.

3) Works on a manual page-by-page basis.

4) It does not even try to resolve the issue somehow.

For example I think it'd be great to have a tool that work even on 
instances that are not running. For example, something that recursively 
walks through all files in a data directory, verifies checksums on 
everything, lists/dumps pages with broken checksums for further 
inspection. I have an alpha-alpha versions of something along those 
lines, written before the root cause was identified.

It'd be nice to have something that could help with fixing the issues 
(e.g. by fetching the last FPI from the backup, or so). But that's 
clearly way more difficult.

There are probably some other tools that might be useful when dealing 
with data corruption (e.g. scrubbing to detect it).


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Checksums by default?

From
Bruce Momjian
Date:
On Sat, Jan 21, 2017 at 12:46:05PM -0500, Stephen Frost wrote:
> * Petr Jelinek (petr.jelinek@2ndquadrant.com) wrote:
> > As we don't know the performance impact is (there was no benchmark done
> > on reasonably current code base) I really don't understand how you can
> > judge if it's worth it or not.
> 
> Because I see having checksums as, frankly, something we always should
> have had (as most other databases do, for good reason...) and because
> they will hopefully prevent data loss.  I'm willing to give us a fair
> bit to minimize the risk of losing data.

Do these other databases do checksums because they don't do
full_page_writes?  They just detect torn pages rather than repair them
like we do?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: [HACKERS] Checksums by default?

From
Magnus Hagander
Date:
On Thu, Feb 23, 2017 at 10:37 PM, Bruce Momjian <bruce@momjian.us> wrote:
On Sat, Jan 21, 2017 at 12:46:05PM -0500, Stephen Frost wrote:
> * Petr Jelinek (petr.jelinek@2ndquadrant.com) wrote:
> > As we don't know the performance impact is (there was no benchmark done
> > on reasonably current code base) I really don't understand how you can
> > judge if it's worth it or not.
>
> Because I see having checksums as, frankly, something we always should
> have had (as most other databases do, for good reason...) and because
> they will hopefully prevent data loss.  I'm willing to give us a fair
> bit to minimize the risk of losing data.

Do these other databases do checksums because they don't do
full_page_writes?  They just detect torn pages rather than repair them
like we do?

Torn page detection is usually/often done by other means than checksums. I don't think those are necessarily related. 

--

Re: [HACKERS] Checksums by default?

From
Bruce Momjian
Date:
On Sat, Jan 21, 2017 at 09:02:25PM +0200, Ants Aasma wrote:
> > It might be worth looking into using the CRC CPU instruction to reduce this
> > overhead, like we do for the WAL checksums. Since that is a different
> > algorithm it would be a compatibility break and we would need to support the
> > old algorithm for upgraded clusters..
> 
> We looked at that when picking the algorithm. At that point it seemed
> that CRC CPU instructions were not universal enough to rely on them.
> The algorithm we ended up on was designed to be fast on SIMD hardware.
> Unfortunately on x86-64 that required SSE4.1 integer instructions, so
> with default compiles there is a lot of performance left on table. A
> low hanging fruit would be to do CPU detection like the CRC case and
> enable a SSE4.1 optimized variant when those instructions are
> available. IIRC it was actually a lot faster than the naive hardware
> CRC that is used for WAL and about on par with interleaved CRC.

Uh, I thought already did compile-time testing for SSE4.1 and used them
if present.  Why do you say "with default compiles there is a lot of
performance left on table?"

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: [HACKERS] Checksums by default?

From
Tomas Vondra
Date:
On 02/24/2017 06:47 PM, Bruce Momjian wrote:
> On Sat, Jan 21, 2017 at 09:02:25PM +0200, Ants Aasma wrote:
>>> It might be worth looking into using the CRC CPU instruction to reduce this
>>> overhead, like we do for the WAL checksums. Since that is a different
>>> algorithm it would be a compatibility break and we would need to support the
>>> old algorithm for upgraded clusters..
>>
>> We looked at that when picking the algorithm. At that point it seemed
>> that CRC CPU instructions were not universal enough to rely on them.
>> The algorithm we ended up on was designed to be fast on SIMD hardware.
>> Unfortunately on x86-64 that required SSE4.1 integer instructions, so
>> with default compiles there is a lot of performance left on table. A
>> low hanging fruit would be to do CPU detection like the CRC case and
>> enable a SSE4.1 optimized variant when those instructions are
>> available. IIRC it was actually a lot faster than the naive hardware
>> CRC that is used for WAL and about on par with interleaved CRC.
>
> Uh, I thought already did compile-time testing for SSE4.1 and used them
> if present.  Why do you say "with default compiles there is a lot of
> performance left on table?"
>

Compile-time is not enough - we build binary packages that may then be 
installed on machines without the SSE4.1 instructions available.

On Intel this may not be a huge issue - the first microarchitecture with 
SSE4.1 was "Nehalem", announced in 2007, so we're only left with very 
old boxes based on "Intel Core" (and perhaps the even older P6).

On AMD, it's a bit worse - the first micro-architecture with SSE4.1 was 
Bulldozer (late 2011). So quite a few CPUs out there, even if most 
people use Intel.

In any case, we can't just build x86-64 packages with compile-time 
SSE4.1 checks.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Checksums by default?

From
Ants Aasma
Date:
On Fri, Feb 24, 2017 at 7:47 PM, Bruce Momjian <bruce@momjian.us> wrote:
> On Sat, Jan 21, 2017 at 09:02:25PM +0200, Ants Aasma wrote:
>> > It might be worth looking into using the CRC CPU instruction to reduce this
>> > overhead, like we do for the WAL checksums. Since that is a different
>> > algorithm it would be a compatibility break and we would need to support the
>> > old algorithm for upgraded clusters..
>>
>> We looked at that when picking the algorithm. At that point it seemed
>> that CRC CPU instructions were not universal enough to rely on them.
>> The algorithm we ended up on was designed to be fast on SIMD hardware.
>> Unfortunately on x86-64 that required SSE4.1 integer instructions, so
>> with default compiles there is a lot of performance left on table. A
>> low hanging fruit would be to do CPU detection like the CRC case and
>> enable a SSE4.1 optimized variant when those instructions are
>> available. IIRC it was actually a lot faster than the naive hardware
>> CRC that is used for WAL and about on par with interleaved CRC.
>
> Uh, I thought already did compile-time testing for SSE4.1 and used them
> if present.  Why do you say "with default compiles there is a lot of
> performance left on table?"

Compile time checks don't help because the compiled binary could be
run on a different host that does not have SSE4.1 (as extremely
unlikely as it is at this point of time). A runtime check is done for
WAL checksums that use a special CRC32 instruction. Block checksums
predate that and use a different algorithm that was picked because it
could be accelerated with vectorized execution on non-Intel
architectures. We just never got around to adding runtime checks for
the architecture to enable this speedup.

The attached test runs 1M iterations of the checksum about 3x faster
when compiled with SSE4.1 and vectorization, 4x if AVX2 is added into
the mix.

Test:
gcc $CFLAGS -Isrc/include -DN=1000000 testchecksum.c -o testchecksum
&& time ./testchecksum

Results:
CFLAGS="-O2": 2.364s
CFLAGS="-O2 -msse4.1 -ftree-vectorize": 0.752s
CFLAGS="-O2 -mavx2 -ftree-vectorize": 0.552s

That 0.552s is 15GB/s per core on a 3 year old laptop.

Regards,
Ants Aasma

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Checksums by default?

From
Bruce Momjian
Date:
On Fri, Feb 24, 2017 at 08:31:09PM +0200, Ants Aasma wrote:
> >> We looked at that when picking the algorithm. At that point it seemed
> >> that CRC CPU instructions were not universal enough to rely on them.
> >> The algorithm we ended up on was designed to be fast on SIMD hardware.
> >> Unfortunately on x86-64 that required SSE4.1 integer instructions, so
> >> with default compiles there is a lot of performance left on table. A
> >> low hanging fruit would be to do CPU detection like the CRC case and
> >> enable a SSE4.1 optimized variant when those instructions are
> >> available. IIRC it was actually a lot faster than the naive hardware
> >> CRC that is used for WAL and about on par with interleaved CRC.
> >
> > Uh, I thought already did compile-time testing for SSE4.1 and used them
> > if present.  Why do you say "with default compiles there is a lot of
> > performance left on table?"
> 
> Compile time checks don't help because the compiled binary could be
> run on a different host that does not have SSE4.1 (as extremely
> unlikely as it is at this point of time). A runtime check is done for

Right.

> WAL checksums that use a special CRC32 instruction. Block checksums
> predate that and use a different algorithm that was picked because it
> could be accelerated with vectorized execution on non-Intel
> architectures. We just never got around to adding runtime checks for
> the architecture to enable this speedup.

Oh, that's why we will hopefully eventually change the page checksum
algorithm to use the special CRC32 instruction, and set a new checksum
version --- got it.  I assume there is currently no compile-time way to
do this.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: [HACKERS] Checksums by default?

From
Jim Nasby
Date:
On 2/24/17 12:30 PM, Tomas Vondra wrote:
> In any case, we can't just build x86-64 packages with compile-time
> SSE4.1 checks.

Dumb question... since we're already discussing llvm for the executor, 
would that potentially be an option here? AIUI that also opens the 
possibility of using the GPU as well.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)



Re: [HACKERS] Checksums by default?

From
Bruce Momjian
Date:
On Fri, Feb 24, 2017 at 01:49:07PM -0600, Jim Nasby wrote:
> On 2/24/17 12:30 PM, Tomas Vondra wrote:
> >In any case, we can't just build x86-64 packages with compile-time
> >SSE4.1 checks.
> 
> Dumb question... since we're already discussing llvm for the executor, would
> that potentially be an option here? AIUI that also opens the possibility of
> using the GPU as well.

Uh, as far as I know, the best you are going to get from llvm is
standard assembly, while the SSE4.1 instructions use special assembly
instructions, so they would be faster, and in a way they are a GPU built
into CPUs.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: [HACKERS] Checksums by default?

From
Ants Aasma
Date:
On Fri, Feb 24, 2017 at 9:37 PM, Bruce Momjian <bruce@momjian.us> wrote:
> Oh, that's why we will hopefully eventually change the page checksum
> algorithm to use the special CRC32 instruction, and set a new checksum
> version --- got it.  I assume there is currently no compile-time way to
> do this.

Using CRC32 as implemented now for the WAL would be significantly
slower than what we have now due to instruction latency. Even the best
theoretical implementation using the CRC32 instruction would still be
about the same speed than what we have now. I haven't seen anybody
working on swapping out the current algorithm. And I don't really see
a reason to, it would introduce a load of headaches for no real gain.

Regards,
Ants Aasma



Re: [HACKERS] Checksums by default?

From
Ants Aasma
Date:
On Fri, Feb 24, 2017 at 9:49 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
> On 2/24/17 12:30 PM, Tomas Vondra wrote:
>>
>> In any case, we can't just build x86-64 packages with compile-time
>> SSE4.1 checks.
>
>
> Dumb question... since we're already discussing llvm for the executor, would
> that potentially be an option here? AIUI that also opens the possibility of
> using the GPU as well.

Just transferring the block to the GPU would be slower than what we
have now. Theoretically LLVM could be used to JIT the checksum
calculation, but just precompiling a couple of versions and swithcing
between them at runtime would be simpler and would give the same
speedup.

Regards,
Ants saasma



Re: [HACKERS] Checksums by default?

From
Ants Aasma
Date:
On Fri, Feb 24, 2017 at 10:02 PM, Bruce Momjian <bruce@momjian.us> wrote:
> Uh, as far as I know, the best you are going to get from llvm is
> standard assembly, while the SSE4.1 instructions use special assembly
> instructions, so they would be faster, and in a way they are a GPU built
> into CPUs.

Both LLVM and GCC are capable of compiling the code that we have to a
vectorized loop using SSE4.1 or AVX2 instructions given the proper
compilation flags. This is exactly what was giving the speedup in the
test I showed in my e-mail.

Regards,
Ants Aasma



Re: [HACKERS] Checksums by default?

From
Bruce Momjian
Date:
On Fri, Feb 24, 2017 at 10:09:50PM +0200, Ants Aasma wrote:
> On Fri, Feb 24, 2017 at 9:37 PM, Bruce Momjian <bruce@momjian.us> wrote:
> > Oh, that's why we will hopefully eventually change the page checksum
> > algorithm to use the special CRC32 instruction, and set a new checksum
> > version --- got it.  I assume there is currently no compile-time way to
> > do this.
> 
> Using CRC32 as implemented now for the WAL would be significantly
> slower than what we have now due to instruction latency. Even the best
> theoretical implementation using the CRC32 instruction would still be
> about the same speed than what we have now. I haven't seen anybody
> working on swapping out the current algorithm. And I don't really see
> a reason to, it would introduce a load of headaches for no real gain.

Uh, I am confused.  I thought you said we were leaving some performance
on the table.  What is that?   I though CRC32 was SSE4.1.  Why is CRC32
good for the WAL but bad for the page checksums?  What about the WAL
page images?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: [HACKERS] Checksums by default?

From
Ants Aasma
Date:
On Fri, Feb 24, 2017 at 10:30 PM, Bruce Momjian <bruce@momjian.us> wrote:
> On Fri, Feb 24, 2017 at 10:09:50PM +0200, Ants Aasma wrote:
>> On Fri, Feb 24, 2017 at 9:37 PM, Bruce Momjian <bruce@momjian.us> wrote:
>> > Oh, that's why we will hopefully eventually change the page checksum
>> > algorithm to use the special CRC32 instruction, and set a new checksum
>> > version --- got it.  I assume there is currently no compile-time way to
>> > do this.
>>
>> Using CRC32 as implemented now for the WAL would be significantly
>> slower than what we have now due to instruction latency. Even the best
>> theoretical implementation using the CRC32 instruction would still be
>> about the same speed than what we have now. I haven't seen anybody
>> working on swapping out the current algorithm. And I don't really see
>> a reason to, it would introduce a load of headaches for no real gain.
>
> Uh, I am confused.  I thought you said we were leaving some performance
> on the table.  What is that?   I though CRC32 was SSE4.1.  Why is CRC32
> good for the WAL but bad for the page checksums?  What about the WAL
> page images?

The page checksum algorithm was designed to take advantage of CPUs
that provide vectorized 32bit integer multiplication. On x86 this was
introduced with SSE4.1 extensions. This means that by default we can't
take advantage of the design. The code is written in a way that
compiler auto vectorization works on it, so only using appropriate
compilation flags are needed to compile a version that does use vector
instructions. However to enable it on generic builds, a runtime switch
between different levels of vectorization support is needed. This is
what is leaving the performance on the table.

The page checksum algorithm we have is extremely fast, memcpy fast.
Even without vectorization it is right up there with Murmurhash3a and
xxHash. With vectorization it's 4x faster. And it works this fast on
most modern CPUs, not only Intel. The downside is that it only works
well for large blocks, and only fixed power-of-2 size with the current
implementation. WAL page images have the page hole removed so can't
easily take advantage of this.

That said, I haven't really seen either the hardware accelerated CRC32
calculation nor the non-vectorized page checksum take a noticeable
amount of time on real world workloads. The benchmarks presented in
this thread seem to corroborate this observation.

Regards,
Ants Aasma