Thread: Re: [PATCHES] wal_checksum = on (default) | off

Re: [PATCHES] wal_checksum = on (default) | off

From
Tom Lane
Date:
"Simon Riggs" <simon@2ndquadrant.com> writes:
> In this thread, I outlined an idea for reducing cost of WAL CRC checking
> http://archives.postgresql.org/pgsql-hackers/2006-10/msg01299.php
> wal_checksum = on (default) | off

This still seems awfully dangerous to me.

> Recovery can occur with/without same setting of wal_checksum, to avoid
> complications from crashes immediately after turning GUC on.

Surely not.  Otherwise even the "on" setting is not really a defense.

            regards, tom lane

Re: [PATCHES] wal_checksum = on (default) | off

From
"Simon Riggs"
Date:
On Thu, 2007-01-04 at 10:00 -0500, Tom Lane wrote:
> "Simon Riggs" <simon@2ndquadrant.com> writes:
> > In this thread, I outlined an idea for reducing cost of WAL CRC checking
> > http://archives.postgresql.org/pgsql-hackers/2006-10/msg01299.php
> > wal_checksum = on (default) | off
>
> This still seems awfully dangerous to me.

Understood.

> > Recovery can occur with/without same setting of wal_checksum, to avoid
> > complications from crashes immediately after turning GUC on.
>
> Surely not.  Otherwise even the "on" setting is not really a defense.

Only when the CRC is exactly zero, which happens very very rarely.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com



Re: [PATCHES] wal_checksum = on (default) | off

From
Tom Lane
Date:
"Simon Riggs" <simon@2ndquadrant.com> writes:
> On Thu, 2007-01-04 at 10:00 -0500, Tom Lane wrote:
>> "Simon Riggs" <simon@2ndquadrant.com> writes:
>>> Recovery can occur with/without same setting of wal_checksum, to avoid
>>> complications from crashes immediately after turning GUC on.
>>
>> Surely not.  Otherwise even the "on" setting is not really a defense.

> Only when the CRC is exactly zero, which happens very very rarely.

"It works most of the time" doesn't exactly satisfy me.  What's the
use-case for changing the variable on the fly anyway?  Seems a better
solution is just to lock down the setting at postmaster start.

            regards, tom lane

Re: [PATCHES] wal_checksum = on (default) | off

From
"Simon Riggs"
Date:
On Thu, 2007-01-04 at 11:09 -0500, Tom Lane wrote:
> "Simon Riggs" <simon@2ndquadrant.com> writes:
> > On Thu, 2007-01-04 at 10:00 -0500, Tom Lane wrote:
> >> "Simon Riggs" <simon@2ndquadrant.com> writes:
> >>> Recovery can occur with/without same setting of wal_checksum, to avoid
> >>> complications from crashes immediately after turning GUC on.
> >>
> >> Surely not.  Otherwise even the "on" setting is not really a defense.
>
> > Only when the CRC is exactly zero, which happens very very rarely.
>
> "It works most of the time" doesn't exactly satisfy me.  What's the
> use-case for changing the variable on the fly anyway?  Seems a better
> solution is just to lock down the setting at postmaster start.

That would prevent us from using the secondary checkpoint location, in
the case of a crash effecting the primary checkpoint when it is a
shutdown checkpoint where we changed the setting of wal_checksum. It
seemed safer to allow a very rare error through to the next level of
error checking rather than to close the door so tight that recovery
would not be possible in a very rare case.

If your're good with server start, so am I.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com



Re: [PATCHES] wal_checksum = on (default) | off

From
Florian Weimer
Date:
* Simon Riggs:

>> Surely not.  Otherwise even the "on" setting is not really a defense.
>
> Only when the CRC is exactly zero, which happens very very rarely.

Have you tried switching to Adler32 instead of CRC32?

--
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Kriegsstraße 100              tel: +49-721-96201-1
D-76133 Karlsruhe             fax: +49-721-96201-99

Re: [PATCHES] wal_checksum = on (default) | off

From
"Simon Riggs"
Date:
On Thu, 2007-01-04 at 17:58 +0100, Florian Weimer wrote:
> * Simon Riggs:
>
> >> Surely not.  Otherwise even the "on" setting is not really a defense.
> >
> > Only when the CRC is exactly zero, which happens very very rarely.
>
> Have you tried switching to Adler32 instead of CRC32?

No. Please explain further.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com



Re: [PATCHES] wal_checksum = on (default) | off

From
Tom Lane
Date:
Florian Weimer <fweimer@bfk.de> writes:
> Have you tried switching to Adler32 instead of CRC32?

Is anything known about the error detection capabilities of Adler32?
There's a lot of math behind CRCs but AFAIR Adler's method is pretty
much ad-hoc.

            regards, tom lane

Re: [PATCHES] wal_checksum = on (default) | off

From
Tom Lane
Date:
"Simon Riggs" <simon@2ndquadrant.com> writes:
> On Thu, 2007-01-04 at 11:09 -0500, Tom Lane wrote:
>> "It works most of the time" doesn't exactly satisfy me.

> It seemed safer to allow a very rare error through to the next level of
> error checking rather than to close the door so tight that recovery
> would not be possible in a very rare case.

If a DBA is turning checksums off at all, he's already bought into the
assumption that he's prepared to recover from backups.  What you don't
seem to get here is that this "feature" is pretty darn questionable in
the first place, and for it to have a side effect of poking a hole in
the system's reliability even when it's off is more than enough to get
it rejected outright.  It's just a No Sale.

I don't believe that the hole is real small, either, as
overwrite-with-zeroes is not exactly an unheard-of failure mode for
filesystems.

            regards, tom lane

Re: [PATCHES] wal_checksum = on (default) | off

From
"Simon Riggs"
Date:
On Thu, 2007-01-04 at 12:13 -0500, Tom Lane wrote:
> "Simon Riggs" <simon@2ndquadrant.com> writes:
> > On Thu, 2007-01-04 at 11:09 -0500, Tom Lane wrote:
> >> "It works most of the time" doesn't exactly satisfy me.
>
> > It seemed safer to allow a very rare error through to the next level of
> > error checking rather than to close the door so tight that recovery
> > would not be possible in a very rare case.
>
> If a DBA is turning checksums off at all, he's already bought into the
> assumption that he's prepared to recover from backups.  What you don't
> seem to get here is that this "feature" is pretty darn questionable in
> the first place, and for it to have a side effect of poking a hole in
> the system's reliability even when it's off is more than enough to get
> it rejected outright.  It's just a No Sale.

I get it, and I listened. I'm was/am happy to do it the way you
suggested; I was merely explaining that I had considered the issue.

New patch enclosed.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com


Attachment

Re: [PATCHES] wal_checksum = on (default) | off

From
Florian Weimer
Date:
* Tom Lane:

> Florian Weimer <fweimer@bfk.de> writes:
>> Have you tried switching to Adler32 instead of CRC32?
>
> Is anything known about the error detection capabilities of Adler32?
> There's a lot of math behind CRCs but AFAIR Adler's method is pretty
> much ad-hoc.

Correct me if I'm wrong, but the main reason for the WAL CRC is to
detect partial WAL writes (due to improper caching, for instance).
This means that you're out of the realm of traditional CRC analysis
anyway, because the things you are guarding against are neither burts
errors nor n-bit errors (for small n).

Re: [PATCHES] wal_checksum = on (default) | off

From
Tom Lane
Date:
Florian Weimer <fw@deneb.enyo.de> writes:
> * Tom Lane:
>> There's a lot of math behind CRCs but AFAIR Adler's method is pretty
>> much ad-hoc.

> Correct me if I'm wrong, but the main reason for the WAL CRC is to
> detect partial WAL writes (due to improper caching, for instance).

Well, that's *a* reason, but not the only one, and IMHO not one that
gives any particular guidance on what kind of checksum to use.

> This means that you're out of the realm of traditional CRC analysis
> anyway, because the things you are guarding against are neither burts
> errors nor n-bit errors (for small n).

I think short burst errors are fairly likely: the kind of scenario I'm
worried about is a wild store corrupting a word of a WAL entry while
it's waiting around to be written in the WAL buffers.  So the CRC math
does give me some comfort that that'll be detected.

            regards, tom lane

Re: [PATCHES] wal_checksum = on (default) | off

From
Florian Weimer
Date:
* Tom Lane:

> I think short burst errors are fairly likely: the kind of scenario I'm
> worried about is a wild store corrupting a word of a WAL entry while
> it's waiting around to be written in the WAL buffers.

Ah, does this mean that each WAL entry gets its own checksum?  In this
case, Adler32 is indeed suboptimal because it doesn't use the full 32
bits for short inputs.  It might still catch many wild stores, but the
statistics are worse than for CRC32.

(I had assumed that PostgreSQLs WAL checksumming was justified by the
partial write issue.  The wild store could easily occur with a heap
page, too, and AFAIK, tuples, aren't checksummed.  Which would be an
interesting option, I guess.)

Re: [PATCHES] wal_checksum = on (default) | off

From
Tom Lane
Date:
Florian Weimer <fw@deneb.enyo.de> writes:
> Ah, does this mean that each WAL entry gets its own checksum?

Right.

> (I had assumed that PostgreSQLs WAL checksumming was justified by the
> partial write issue.  The wild store could easily occur with a heap
> page, too, and AFAIK, tuples, aren't checksummed.  Which would be an
> interesting option, I guess.)

We've discussed it but there's never been a pressing reason to do it.

            regards, tom lane

Re: [PATCHES] wal_checksum = on (default) | off

From
"Zeugswetter Andreas ADI SD"
Date:
> > >>> Recovery can occur with/without same setting of wal_checksum, to
avoid
> > >>> complications from crashes immediately after turning GUC on.
> > >>
> > >> Surely not.  Otherwise even the "on" setting is not really a
defense.
> >
> > > Only when the CRC is exactly zero, which happens very very rarely.
> >
> > "It works most of the time" doesn't exactly satisfy me.  What's the

Agreed

> > use-case for changing the variable on the fly anyway?  Seems a
better
> > solution is just to lock down the setting at postmaster start.

I guess that the use case is more for a WAL based replicate, that
has/wants a different setting. Maybe we want a WAL entry for the change,
or force a log switch (so you can interrupt the replicate, change it's
setting
and proceed with the next log) ?

Maybe a 3rd mode for replicates that ignores 0 CRC's ?

Andreas

Re: [PATCHES] wal_checksum = on (default) | off

From
"Simon Riggs"
Date:
On Fri, 2007-01-05 at 11:01 +0100, Zeugswetter Andreas ADI SD wrote:

> > > What's the use-case for changing the variable on the fly anyway?  Seems a
> better
> > > solution is just to lock down the setting at postmaster start.
>
> I guess that the use case is more for a WAL based replicate, that
> has/wants a different setting. Maybe we want a WAL entry for the change,
> or force a log switch (so you can interrupt the replicate, change it's
> setting
> and proceed with the next log) ?
>
> Maybe a 3rd mode for replicates that ignores 0 CRC's ?

Well, wal_checksum allows you to have this turned ON for the main server
and OFF on a Warm Standby.

The recovery process doesn't check for postgresql.conf reloads, so
setting it at server start is effectively the same thing in that case.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com



Re: [PATCHES] wal_checksum = on (default) | off

From
"Zeugswetter Andreas ADI SD"
Date:
> > > > What's the use-case for changing the variable on the fly anyway?
Seems a
> > better
> > > > solution is just to lock down the setting at postmaster start.
> >
> > I guess that the use case is more for a WAL based replicate, that
> > has/wants a different setting. Maybe we want a WAL entry for the
change,
> > or force a log switch (so you can interrupt the replicate, change
it's
> > setting
> > and proceed with the next log) ?
> >
> > Maybe a 3rd mode for replicates that ignores 0 CRC's ?
>
> Well, wal_checksum allows you to have this turned ON for the main
server
> and OFF on a Warm Standby.

Ok, so when you need CRC's on a replicate (but not on the master) you
turn it
off during standby replay, but turn it on when you start the replicate
for normal operation.

Andreas

Re: [PATCHES] wal_checksum = on (default) | off

From
Tom Lane
Date:
"Zeugswetter Andreas ADI SD" <ZeugswetterA@spardat.at> writes:
> Ok, so when you need CRC's on a replicate (but not on the master) you
> turn it
> off during standby replay, but turn it on when you start the replicate
> for normal operation.

Thought: even when it's off, the CRC had better be computed for
shutdown-checkpoint records.  Else there's no way to turn it on even
with a postmaster restart --- unless we accept the idea of poking a hole
in the normal mode.  (Which I still dislike, and even more so if the
special value is zero.  Almost any other value would be safer than zero.)

On the whole, though, I still don't want to put this in.  I don't think
Simon has thought it through sufficiently, and we haven't even seen any
demonstration of a big speedup.

(Another hole in the implementation as given: pg_resetxlog.)

            regards, tom lane

Re: [PATCHES] wal_checksum = on (default) | off

From
"Zeugswetter Andreas ADI SD"
Date:
> > Ok, so when you need CRC's on a replicate (but not on the master)
you
> > turn it
> > off during standby replay, but turn it on when you start the
replicate
> > for normal operation.
>
> Thought: even when it's off, the CRC had better be computed for
> shutdown-checkpoint records.  Else there's no way to turn it on even
> with a postmaster restart --- unless we accept the idea of poking a
hole
> in the normal mode.  (Which I still dislike, and even more so if the
> special value is zero.  Almost any other value would be safer than
zero.)
>
> On the whole, though, I still don't want to put this in.  I don't
think
> Simon has thought it through sufficiently,

Well, the part that we do not really want a special value (at least not
0)
is new, and makes things a bit more complicated.

> and we haven't even seen any demonstration of a big speedup.

Yes, iirc the demonstration was with the 64 bit crc instead of the
sufficient
32-bit (or a bad crc compiler optimization?).
But I do think it can be shown to provide significant speedup
(at least peak burst performance).

Especially on target hardware WAL write IO is extremely fast
(since it is write cached), so the CPU should show.

Andreas

Re: [PATCHES] wal_checksum = on (default) | off

From
Jim Nasby
Date:
On Jan 5, 2007, at 6:30 AM, Zeugswetter Andreas ADI SD wrote:
> Ok, so when you need CRC's on a replicate (but not on the master) you
> turn it
> off during standby replay, but turn it on when you start the replicate
> for normal operation.

Which sounds to me like a good reason to allow the option in
recovery.conf as well...
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)



Re: [PATCHES] wal_checksum = on (default) | off

From
Tom Lane
Date:
Jim Nasby <decibel@decibel.org> writes:
> On Jan 5, 2007, at 6:30 AM, Zeugswetter Andreas ADI SD wrote:
>> Ok, so when you need CRC's on a replicate (but not on the master) you

> Which sounds to me like a good reason to allow the option in
> recovery.conf as well...

Actually, I'm not seeing the use-case for a slave having a different
setting from the master at all?

    "My backup server is less reliable than the primary."

    "My backup server is more reliable than the primary."

Somehow, neither of these statements seem likely to be uttered by
a sane DBA ...

            regards, tom lane

Re: [PATCHES] wal_checksum = on (default) | off

From
"Joshua D. Drake"
Date:
> Actually, I'm not seeing the use-case for a slave having a different
> setting from the master at all?
>
>     "My backup server is less reliable than the primary."
>
>     "My backup server is more reliable than the primary."
>
> Somehow, neither of these statements seem likely to be uttered by
> a sane DBA ...

My backup server is actually my dev machine.
My backup server is just a reporting machine.
My backup machine is using SATA just because it is just an absolute
emergency machine.
My backups machine is also my web server.

Real world dictates differently. Let's not forget that not every company
can spend 100k on two identical machines, yet many companies can spend
50k + 5k for a backup machine based on Sata or secondary services.

Sincerely,

Joshua D. Drake

--

      === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
             http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate




Re: [PATCHES] wal_checksum = on (default) | off

From
"Simon Riggs"
Date:
On Fri, 2007-01-05 at 22:57 -0500, Tom Lane wrote:
> Jim Nasby <decibel@decibel.org> writes:
> > On Jan 5, 2007, at 6:30 AM, Zeugswetter Andreas ADI SD wrote:
> >> Ok, so when you need CRC's on a replicate (but not on the master) you
>
> > Which sounds to me like a good reason to allow the option in
> > recovery.conf as well...
>
> Actually, I'm not seeing the use-case for a slave having a different
> setting from the master at all?
>
>     "My backup server is less reliable than the primary."
>
>     "My backup server is more reliable than the primary."
>
> Somehow, neither of these statements seem likely to be uttered by
> a sane DBA ...

If I take a backup of a server and bring it up on a new system, the
blocks in the backup will not have been CRC checked before they go to
disk.

If I take the same server and now stream log records across to it, why
*must* that data be CRC checked, when the original data has not been?

I'm proposing choice, with a safe default. That's all.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com



Re: [PATCHES] wal_checksum = on (default) | off

From
Bruce Momjian
Date:
Simon Riggs wrote:
> > Somehow, neither of these statements seem likely to be uttered by
> > a sane DBA ...
>
> If I take a backup of a server and bring it up on a new system, the
> blocks in the backup will not have been CRC checked before they go to
> disk.
>
> If I take the same server and now stream log records across to it, why
> *must* that data be CRC checked, when the original data has not been?
>
> I'm proposing choice, with a safe default. That's all.

Are there performance numbers to justify the option?  We don't give
people options unless there is real value to it.

--
  Bruce Momjian   bruce@momjian.us
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: [PATCHES] wal_checksum = on (default) | off

From
Bruce Momjian
Date:
Simon Riggs wrote:
> On Fri, 2007-01-05 at 22:57 -0500, Tom Lane wrote:
> > Jim Nasby <decibel@decibel.org> writes:
> > > On Jan 5, 2007, at 6:30 AM, Zeugswetter Andreas ADI SD wrote:
> > >> Ok, so when you need CRC's on a replicate (but not on the master) you
> >
> > > Which sounds to me like a good reason to allow the option in
> > > recovery.conf as well...
> >
> > Actually, I'm not seeing the use-case for a slave having a different
> > setting from the master at all?
> >
> >     "My backup server is less reliable than the primary."
> >
> >     "My backup server is more reliable than the primary."
> >
> > Somehow, neither of these statements seem likely to be uttered by
> > a sane DBA ...
>
> If I take a backup of a server and bring it up on a new system, the
> blocks in the backup will not have been CRC checked before they go to
> disk.
>
> If I take the same server and now stream log records across to it, why
> *must* that data be CRC checked, when the original data has not been?
>
> I'm proposing choice, with a safe default. That's all.

I am assuming this item is dead because no performance results have been
reported.

--
  Bruce Momjian   bruce@momjian.us
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: [PATCHES] wal_checksum = on (default) | off

From
"Simon Riggs"
Date:
On Wed, 2007-01-10 at 23:32 -0500, Bruce Momjian wrote:
> Simon Riggs wrote:
> > On Fri, 2007-01-05 at 22:57 -0500, Tom Lane wrote:
> > > Jim Nasby <decibel@decibel.org> writes:
> > > > On Jan 5, 2007, at 6:30 AM, Zeugswetter Andreas ADI SD wrote:
> > > >> Ok, so when you need CRC's on a replicate (but not on the master) you
> > >
> > > > Which sounds to me like a good reason to allow the option in
> > > > recovery.conf as well...
> > >
> > > Actually, I'm not seeing the use-case for a slave having a different
> > > setting from the master at all?
> > >
> > >     "My backup server is less reliable than the primary."
> > >
> > >     "My backup server is more reliable than the primary."
> > >
> > > Somehow, neither of these statements seem likely to be uttered by
> > > a sane DBA ...
> >
> > If I take a backup of a server and bring it up on a new system, the
> > blocks in the backup will not have been CRC checked before they go to
> > disk.
> >
> > If I take the same server and now stream log records across to it, why
> > *must* that data be CRC checked, when the original data has not been?
> >
> > I'm proposing choice, with a safe default. That's all.
>
> I am assuming this item is dead because no performance results have been
> reported.

It's been on my hold queue, as a result of its lack of clear acceptance.

Results from earlier tests show the routines which are dominated by CRC
checking overhead are prominent in a number of important workloads.
Those workloads all have a substantial disk component, so test results
will vary between no saving at all on a disk-bound system to some
savings on a CPU bound system.

Restore        RecordIsValid() #1 on oprofile results at 50-70% CPU

COPY        XLogInsert() #1 on oprofile results at 17% CPU
        (full_page_writes = on)

OLTP        no test with full_page_writes = on (less relevant)

OLTP        XLogInsert() #5 on oprofile results at 1.2% CPU
        (full_page_writes = off)
Removing the CRC checks on WAL would likely be the easiest to remove 1%
CPU on the system as it stands. Other changes require algorithmic or
architectural changes to improve matters, though gains can be much
larger. 1% doesn't sound much, but PostgreSQL is a very sleek beast
these days. As we improve things in other areas the importance of this
patch as a tuning switch will grow.

Clearly the current patch is not accepted, but can we imagine a patch
that saved substantial CPU time in these areas that would be acceptable?
*Always* as a non-default option, IMHO, with careful documentation as to
its possible use.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com



Re: [PATCHES] wal_checksum = on (default) | off

From
Tom Lane
Date:
"Simon Riggs" <simon@2ndquadrant.com> writes:
> COPY        XLogInsert() #1 on oprofile results at 17% CPU
>         (full_page_writes = on)

But what portion of that is actually CRC-related?  XLogInsert does quite
a lot.

Anyway, I can't see degrading the reliability of the system for a gain
in the range of a few percent, which is the most that we'd be likely
to get here ... for a factor of two or more, maybe people would be
willing to take a risk.

            regards, tom lane

Re: [PATCHES] wal_checksum = on (default) | off

From
"Simon Riggs"
Date:
On Thu, 2007-01-11 at 09:01 -0500, Tom Lane wrote:
> "Simon Riggs" <simon@2ndquadrant.com> writes:
> > COPY        XLogInsert() #1 on oprofile results at 17% CPU
> >         (full_page_writes = on)
>
> But what portion of that is actually CRC-related?  XLogInsert does quite
> a lot.
>
> Anyway, I can't see degrading the reliability of the system for a gain
> in the range of a few percent, which is the most that we'd be likely
> to get here ... for a factor of two or more, maybe people would be
> willing to take a risk.

All I would add is that the loss of reliability was not certain in all
cases, otherwise I myself would have dropped the idea long ago. With the
spectre of doubt surrounding this, I'm happy to drop the idea until we
have proof/greater certainty either way.

Patch revoked.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com



Re: [PATCHES] wal_checksum = on (default) | off

From
Gregory Stark
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> "Simon Riggs" <simon@2ndquadrant.com> writes:
>> COPY        XLogInsert() #1 on oprofile results at 17% CPU
>>         (full_page_writes = on)
>
> But what portion of that is actually CRC-related?  XLogInsert does quite
> a lot.
>
> Anyway, I can't see degrading the reliability of the system for a gain
> in the range of a few percent, which is the most that we'd be likely
> to get here ... for a factor of two or more, maybe people would be
> willing to take a risk.

Well the problem with that is when you have dozens of 1% improvements which
are additive and the net negative effects aren't additive. Since we can't
quantify the effect on reliability here it's hard to tell if that's the case.

What did you think about protecting against torn writes using id numbers every
512 bytes. That "improves" the reliability of the system in that it detects
torn writes 100% of the instead of just (2^32-1)/2^32 (or 99.99999998%) of the
time. (which poking a hole in the CRC mechanism would reduce to (2^31-1)/2^31
(or 99.99999995%)).

I don't think either of those percentages are significant. If you crash your
system once a day and always get a torn WAL page you're still far more likely
to still get corrupted data due to cosmic rays long before you get corrupted
data due to mistakenly applying a torn page from WAL.

But if we can save CPU on every WAL write while not harming reliability (in
fact increasing it, albeit insignificantly) why not?

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

Re: [PATCHES] wal_checksum = on (default) | off

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> What did you think about protecting against torn writes using id numbers every
> 512 bytes.

Pretty much not happening; or are you volunteering to fix every part of
the system to tolerate injections of inserted data anywhere in a stored
datum?

            regards, tom lane

Re: [PATCHES] wal_checksum = on (default) | off

From
Gregory Stark
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> Gregory Stark <stark@enterprisedb.com> writes:
>> What did you think about protecting against torn writes using id numbers every
>> 512 bytes.
>
> Pretty much not happening; or are you volunteering to fix every part of
> the system to tolerate injections of inserted data anywhere in a stored
> datum?

I was thinking to do it at a low level as the xlog records are prepared to be
written to the filesystem and as the data is being read from disk. I haven't
read that code yet to see where to inject it but I understand there's already
a copy happening and it could be done there. Even if we optimize out all the
copies we could do it in the actual i/o call using readv/writev.

I wasn't thinking of doing it on actual disk buffers since it doesn't help us
avoid full page writes, but it could be done there too using readv/writev in
the smgr. That might be useful for a full_page_writes=off system so even if it
can't guarantee no corruption at least it can guarantee no silent corruption.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

Re: [PATCHES] wal_checksum = on (default) | off

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> "Tom Lane" <tgl@sss.pgh.pa.us> writes:
>> Pretty much not happening; or are you volunteering to fix every part of
>> the system to tolerate injections of inserted data anywhere in a stored
>> datum?

> I was thinking to do it at a low level as the xlog records are prepared to be
> written to the filesystem and as the data is being read from disk. I haven't
> read that code yet to see where to inject it but I understand there's already
> a copy happening and it could be done there.

You understand wrong ... a tuple sitting on disk is normally read
directly from the shared buffer, and I don't think we want to pay for
copying it.

            regards, tom lane

Re: [PATCHES] wal_checksum = on (default) | off

From
Gregory Stark
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> Gregory Stark <stark@enterprisedb.com> writes:
>> "Tom Lane" <tgl@sss.pgh.pa.us> writes:
>>> Pretty much not happening; or are you volunteering to fix every part of
>>> the system to tolerate injections of inserted data anywhere in a stored
>>> datum?
>
>> I was thinking to do it at a low level as the xlog records are prepared to be
>> written to the filesystem and as the data is being read from disk. I haven't
>> read that code yet to see where to inject it but I understand there's already
>> a copy happening and it could be done there.
>
> You understand wrong ... a tuple sitting on disk is normally read
> directly from the shared buffer, and I don't think we want to pay for
> copying it.

"xlog records"


--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

Re: [PATCHES] wal_checksum = on (default) | off

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> "Tom Lane" <tgl@sss.pgh.pa.us> writes:
>> You understand wrong ... a tuple sitting on disk is normally read
>> directly from the shared buffer, and I don't think we want to pay for
>> copying it.

> "xlog records"

Oh, sorry, had the wrong context in mind.  I'm still not very impressed
with the idea --- a CRC check will catch many kinds of problems, whereas
this approach catches exactly one kind of problem.

            regards, tom lane

Re: [PATCHES] wal_checksum = on (default) | off

From
Gregory Stark
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> Oh, sorry, had the wrong context in mind.  I'm still not very impressed
> with the idea --- a CRC check will catch many kinds of problems, whereas
> this approach catches exactly one kind of problem.

Well in fairness I tossed in a throwaway comment at the end of that email
about heap pages. I'll do the same here since I can't resist. But the main
thread here is about xlog really.

It just seems to me like it's better to target each problem with a solution
that addresses it directly than have one feature that we hope hope addresses
them all more or less.

Having a CRC in WAL but not in the heap seems kind of pointless. If your
hardware is unreliable the corruption could anywhere. Depending on it to solve
multiple problems means we can't offer the option to disable it because it
would affect other things as well.

What I would like to see is a CRC option that would put CRC checks in every
disk page whether heap, index, WAL, control file, etc. I think we would
default that to off to match our current setup most closely.

Separately we would have a feature in WAL to detect torn pages so that we can
reliably detect the end of valid WAL. That would have to always be on. But
having it as a separate feature means the CRC could be optional.

Also, incidentally like I mentioned in my previous email, we could do the torn
page detection in heap pages too by handling it in the smgr using
readv/writev. No copies, no corrupted datums. Essentially the tags would be
inserted on the fly as the data was copied into kernel space.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

Re: [PATCHES] wal_checksum = on (default) | off

From
"Simon Riggs"
Date:
On Thu, 2007-01-11 at 17:06 +0000, Gregory Stark wrote:
> Having a CRC in WAL but not in the heap seems kind of pointless.

Yes...

> If your
> hardware is unreliable the corruption could anywhere.

Agreed.

Other DBMS have one setting for the whole server; I've never seen
separate settings for WAL and data.

> Depending on it to solve
> multiple problems means we can't offer the option to disable it
> because it
> would affect other things as well.
>
> What I would like to see is a CRC option that would put CRC checks in
> every
> disk page whether heap, index, WAL, control file, etc. I think we
> would
> default that to off to match our current setup most closely.
>
> Separately we would have a feature in WAL to detect torn pages so that
> we can
> reliably detect the end of valid WAL. That would have to always be on.
> But
> having it as a separate feature means the CRC could be optional.

Your thoughts seem logical to me.

It does seem a bigger project than I'd envisaged, but doable, one day.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com



Re: [PATCHES] wal_checksum = on (default) | off

From
Martijn van Oosterhout
Date:
On Thu, Jan 11, 2007 at 11:10:38PM +0000, Simon Riggs wrote:
> On Thu, 2007-01-11 at 17:06 +0000, Gregory Stark wrote:
> > Having a CRC in WAL but not in the heap seems kind of pointless.
>
> Yes...
>
> > If your
> > hardware is unreliable the corruption could anywhere.
>
> Agreed.

I thought the point was that the WAL protects against unexpected power
failure, that sort of thing. In that situation, the memory is the first
to be corrupted, and an active DMA transfer will thus be corrupted
also. We don't need to worry about the data, because the WAL is known
to be accurate.

The WAL does not protect against random data corruption, in normal
operation it is never read. If we want to detect random corruption,
we'd need checksum everywhere, yes. But that's not the goal here.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.