Thread: Re: [PATCHES] wal_checksum = on (default) | off
"Simon Riggs" <simon@2ndquadrant.com> writes: > In this thread, I outlined an idea for reducing cost of WAL CRC checking > http://archives.postgresql.org/pgsql-hackers/2006-10/msg01299.php > wal_checksum = on (default) | off This still seems awfully dangerous to me. > Recovery can occur with/without same setting of wal_checksum, to avoid > complications from crashes immediately after turning GUC on. Surely not. Otherwise even the "on" setting is not really a defense. regards, tom lane
On Thu, 2007-01-04 at 10:00 -0500, Tom Lane wrote: > "Simon Riggs" <simon@2ndquadrant.com> writes: > > In this thread, I outlined an idea for reducing cost of WAL CRC checking > > http://archives.postgresql.org/pgsql-hackers/2006-10/msg01299.php > > wal_checksum = on (default) | off > > This still seems awfully dangerous to me. Understood. > > Recovery can occur with/without same setting of wal_checksum, to avoid > > complications from crashes immediately after turning GUC on. > > Surely not. Otherwise even the "on" setting is not really a defense. Only when the CRC is exactly zero, which happens very very rarely. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
"Simon Riggs" <simon@2ndquadrant.com> writes: > On Thu, 2007-01-04 at 10:00 -0500, Tom Lane wrote: >> "Simon Riggs" <simon@2ndquadrant.com> writes: >>> Recovery can occur with/without same setting of wal_checksum, to avoid >>> complications from crashes immediately after turning GUC on. >> >> Surely not. Otherwise even the "on" setting is not really a defense. > Only when the CRC is exactly zero, which happens very very rarely. "It works most of the time" doesn't exactly satisfy me. What's the use-case for changing the variable on the fly anyway? Seems a better solution is just to lock down the setting at postmaster start. regards, tom lane
On Thu, 2007-01-04 at 11:09 -0500, Tom Lane wrote: > "Simon Riggs" <simon@2ndquadrant.com> writes: > > On Thu, 2007-01-04 at 10:00 -0500, Tom Lane wrote: > >> "Simon Riggs" <simon@2ndquadrant.com> writes: > >>> Recovery can occur with/without same setting of wal_checksum, to avoid > >>> complications from crashes immediately after turning GUC on. > >> > >> Surely not. Otherwise even the "on" setting is not really a defense. > > > Only when the CRC is exactly zero, which happens very very rarely. > > "It works most of the time" doesn't exactly satisfy me. What's the > use-case for changing the variable on the fly anyway? Seems a better > solution is just to lock down the setting at postmaster start. That would prevent us from using the secondary checkpoint location, in the case of a crash effecting the primary checkpoint when it is a shutdown checkpoint where we changed the setting of wal_checksum. It seemed safer to allow a very rare error through to the next level of error checking rather than to close the door so tight that recovery would not be possible in a very rare case. If your're good with server start, so am I. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
* Simon Riggs: >> Surely not. Otherwise even the "on" setting is not really a defense. > > Only when the CRC is exactly zero, which happens very very rarely. Have you tried switching to Adler32 instead of CRC32? -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99
On Thu, 2007-01-04 at 17:58 +0100, Florian Weimer wrote: > * Simon Riggs: > > >> Surely not. Otherwise even the "on" setting is not really a defense. > > > > Only when the CRC is exactly zero, which happens very very rarely. > > Have you tried switching to Adler32 instead of CRC32? No. Please explain further. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Florian Weimer <fweimer@bfk.de> writes: > Have you tried switching to Adler32 instead of CRC32? Is anything known about the error detection capabilities of Adler32? There's a lot of math behind CRCs but AFAIR Adler's method is pretty much ad-hoc. regards, tom lane
"Simon Riggs" <simon@2ndquadrant.com> writes: > On Thu, 2007-01-04 at 11:09 -0500, Tom Lane wrote: >> "It works most of the time" doesn't exactly satisfy me. > It seemed safer to allow a very rare error through to the next level of > error checking rather than to close the door so tight that recovery > would not be possible in a very rare case. If a DBA is turning checksums off at all, he's already bought into the assumption that he's prepared to recover from backups. What you don't seem to get here is that this "feature" is pretty darn questionable in the first place, and for it to have a side effect of poking a hole in the system's reliability even when it's off is more than enough to get it rejected outright. It's just a No Sale. I don't believe that the hole is real small, either, as overwrite-with-zeroes is not exactly an unheard-of failure mode for filesystems. regards, tom lane
On Thu, 2007-01-04 at 12:13 -0500, Tom Lane wrote: > "Simon Riggs" <simon@2ndquadrant.com> writes: > > On Thu, 2007-01-04 at 11:09 -0500, Tom Lane wrote: > >> "It works most of the time" doesn't exactly satisfy me. > > > It seemed safer to allow a very rare error through to the next level of > > error checking rather than to close the door so tight that recovery > > would not be possible in a very rare case. > > If a DBA is turning checksums off at all, he's already bought into the > assumption that he's prepared to recover from backups. What you don't > seem to get here is that this "feature" is pretty darn questionable in > the first place, and for it to have a side effect of poking a hole in > the system's reliability even when it's off is more than enough to get > it rejected outright. It's just a No Sale. I get it, and I listened. I'm was/am happy to do it the way you suggested; I was merely explaining that I had considered the issue. New patch enclosed. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Attachment
* Tom Lane: > Florian Weimer <fweimer@bfk.de> writes: >> Have you tried switching to Adler32 instead of CRC32? > > Is anything known about the error detection capabilities of Adler32? > There's a lot of math behind CRCs but AFAIR Adler's method is pretty > much ad-hoc. Correct me if I'm wrong, but the main reason for the WAL CRC is to detect partial WAL writes (due to improper caching, for instance). This means that you're out of the realm of traditional CRC analysis anyway, because the things you are guarding against are neither burts errors nor n-bit errors (for small n).
Florian Weimer <fw@deneb.enyo.de> writes: > * Tom Lane: >> There's a lot of math behind CRCs but AFAIR Adler's method is pretty >> much ad-hoc. > Correct me if I'm wrong, but the main reason for the WAL CRC is to > detect partial WAL writes (due to improper caching, for instance). Well, that's *a* reason, but not the only one, and IMHO not one that gives any particular guidance on what kind of checksum to use. > This means that you're out of the realm of traditional CRC analysis > anyway, because the things you are guarding against are neither burts > errors nor n-bit errors (for small n). I think short burst errors are fairly likely: the kind of scenario I'm worried about is a wild store corrupting a word of a WAL entry while it's waiting around to be written in the WAL buffers. So the CRC math does give me some comfort that that'll be detected. regards, tom lane
* Tom Lane: > I think short burst errors are fairly likely: the kind of scenario I'm > worried about is a wild store corrupting a word of a WAL entry while > it's waiting around to be written in the WAL buffers. Ah, does this mean that each WAL entry gets its own checksum? In this case, Adler32 is indeed suboptimal because it doesn't use the full 32 bits for short inputs. It might still catch many wild stores, but the statistics are worse than for CRC32. (I had assumed that PostgreSQLs WAL checksumming was justified by the partial write issue. The wild store could easily occur with a heap page, too, and AFAIK, tuples, aren't checksummed. Which would be an interesting option, I guess.)
Florian Weimer <fw@deneb.enyo.de> writes: > Ah, does this mean that each WAL entry gets its own checksum? Right. > (I had assumed that PostgreSQLs WAL checksumming was justified by the > partial write issue. The wild store could easily occur with a heap > page, too, and AFAIK, tuples, aren't checksummed. Which would be an > interesting option, I guess.) We've discussed it but there's never been a pressing reason to do it. regards, tom lane
> > >>> Recovery can occur with/without same setting of wal_checksum, to avoid > > >>> complications from crashes immediately after turning GUC on. > > >> > > >> Surely not. Otherwise even the "on" setting is not really a defense. > > > > > Only when the CRC is exactly zero, which happens very very rarely. > > > > "It works most of the time" doesn't exactly satisfy me. What's the Agreed > > use-case for changing the variable on the fly anyway? Seems a better > > solution is just to lock down the setting at postmaster start. I guess that the use case is more for a WAL based replicate, that has/wants a different setting. Maybe we want a WAL entry for the change, or force a log switch (so you can interrupt the replicate, change it's setting and proceed with the next log) ? Maybe a 3rd mode for replicates that ignores 0 CRC's ? Andreas
On Fri, 2007-01-05 at 11:01 +0100, Zeugswetter Andreas ADI SD wrote: > > > What's the use-case for changing the variable on the fly anyway? Seems a > better > > > solution is just to lock down the setting at postmaster start. > > I guess that the use case is more for a WAL based replicate, that > has/wants a different setting. Maybe we want a WAL entry for the change, > or force a log switch (so you can interrupt the replicate, change it's > setting > and proceed with the next log) ? > > Maybe a 3rd mode for replicates that ignores 0 CRC's ? Well, wal_checksum allows you to have this turned ON for the main server and OFF on a Warm Standby. The recovery process doesn't check for postgresql.conf reloads, so setting it at server start is effectively the same thing in that case. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
> > > > What's the use-case for changing the variable on the fly anyway? Seems a > > better > > > > solution is just to lock down the setting at postmaster start. > > > > I guess that the use case is more for a WAL based replicate, that > > has/wants a different setting. Maybe we want a WAL entry for the change, > > or force a log switch (so you can interrupt the replicate, change it's > > setting > > and proceed with the next log) ? > > > > Maybe a 3rd mode for replicates that ignores 0 CRC's ? > > Well, wal_checksum allows you to have this turned ON for the main server > and OFF on a Warm Standby. Ok, so when you need CRC's on a replicate (but not on the master) you turn it off during standby replay, but turn it on when you start the replicate for normal operation. Andreas
"Zeugswetter Andreas ADI SD" <ZeugswetterA@spardat.at> writes: > Ok, so when you need CRC's on a replicate (but not on the master) you > turn it > off during standby replay, but turn it on when you start the replicate > for normal operation. Thought: even when it's off, the CRC had better be computed for shutdown-checkpoint records. Else there's no way to turn it on even with a postmaster restart --- unless we accept the idea of poking a hole in the normal mode. (Which I still dislike, and even more so if the special value is zero. Almost any other value would be safer than zero.) On the whole, though, I still don't want to put this in. I don't think Simon has thought it through sufficiently, and we haven't even seen any demonstration of a big speedup. (Another hole in the implementation as given: pg_resetxlog.) regards, tom lane
> > Ok, so when you need CRC's on a replicate (but not on the master) you > > turn it > > off during standby replay, but turn it on when you start the replicate > > for normal operation. > > Thought: even when it's off, the CRC had better be computed for > shutdown-checkpoint records. Else there's no way to turn it on even > with a postmaster restart --- unless we accept the idea of poking a hole > in the normal mode. (Which I still dislike, and even more so if the > special value is zero. Almost any other value would be safer than zero.) > > On the whole, though, I still don't want to put this in. I don't think > Simon has thought it through sufficiently, Well, the part that we do not really want a special value (at least not 0) is new, and makes things a bit more complicated. > and we haven't even seen any demonstration of a big speedup. Yes, iirc the demonstration was with the 64 bit crc instead of the sufficient 32-bit (or a bad crc compiler optimization?). But I do think it can be shown to provide significant speedup (at least peak burst performance). Especially on target hardware WAL write IO is extremely fast (since it is write cached), so the CPU should show. Andreas
On Jan 5, 2007, at 6:30 AM, Zeugswetter Andreas ADI SD wrote: > Ok, so when you need CRC's on a replicate (but not on the master) you > turn it > off during standby replay, but turn it on when you start the replicate > for normal operation. Which sounds to me like a good reason to allow the option in recovery.conf as well... -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Jim Nasby <decibel@decibel.org> writes: > On Jan 5, 2007, at 6:30 AM, Zeugswetter Andreas ADI SD wrote: >> Ok, so when you need CRC's on a replicate (but not on the master) you > Which sounds to me like a good reason to allow the option in > recovery.conf as well... Actually, I'm not seeing the use-case for a slave having a different setting from the master at all? "My backup server is less reliable than the primary." "My backup server is more reliable than the primary." Somehow, neither of these statements seem likely to be uttered by a sane DBA ... regards, tom lane
> Actually, I'm not seeing the use-case for a slave having a different > setting from the master at all? > > "My backup server is less reliable than the primary." > > "My backup server is more reliable than the primary." > > Somehow, neither of these statements seem likely to be uttered by > a sane DBA ... My backup server is actually my dev machine. My backup server is just a reporting machine. My backup machine is using SATA just because it is just an absolute emergency machine. My backups machine is also my web server. Real world dictates differently. Let's not forget that not every company can spend 100k on two identical machines, yet many companies can spend 50k + 5k for a backup machine based on Sata or secondary services. Sincerely, Joshua D. Drake -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
On Fri, 2007-01-05 at 22:57 -0500, Tom Lane wrote: > Jim Nasby <decibel@decibel.org> writes: > > On Jan 5, 2007, at 6:30 AM, Zeugswetter Andreas ADI SD wrote: > >> Ok, so when you need CRC's on a replicate (but not on the master) you > > > Which sounds to me like a good reason to allow the option in > > recovery.conf as well... > > Actually, I'm not seeing the use-case for a slave having a different > setting from the master at all? > > "My backup server is less reliable than the primary." > > "My backup server is more reliable than the primary." > > Somehow, neither of these statements seem likely to be uttered by > a sane DBA ... If I take a backup of a server and bring it up on a new system, the blocks in the backup will not have been CRC checked before they go to disk. If I take the same server and now stream log records across to it, why *must* that data be CRC checked, when the original data has not been? I'm proposing choice, with a safe default. That's all. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Simon Riggs wrote: > > Somehow, neither of these statements seem likely to be uttered by > > a sane DBA ... > > If I take a backup of a server and bring it up on a new system, the > blocks in the backup will not have been CRC checked before they go to > disk. > > If I take the same server and now stream log records across to it, why > *must* that data be CRC checked, when the original data has not been? > > I'm proposing choice, with a safe default. That's all. Are there performance numbers to justify the option? We don't give people options unless there is real value to it. -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Simon Riggs wrote: > On Fri, 2007-01-05 at 22:57 -0500, Tom Lane wrote: > > Jim Nasby <decibel@decibel.org> writes: > > > On Jan 5, 2007, at 6:30 AM, Zeugswetter Andreas ADI SD wrote: > > >> Ok, so when you need CRC's on a replicate (but not on the master) you > > > > > Which sounds to me like a good reason to allow the option in > > > recovery.conf as well... > > > > Actually, I'm not seeing the use-case for a slave having a different > > setting from the master at all? > > > > "My backup server is less reliable than the primary." > > > > "My backup server is more reliable than the primary." > > > > Somehow, neither of these statements seem likely to be uttered by > > a sane DBA ... > > If I take a backup of a server and bring it up on a new system, the > blocks in the backup will not have been CRC checked before they go to > disk. > > If I take the same server and now stream log records across to it, why > *must* that data be CRC checked, when the original data has not been? > > I'm proposing choice, with a safe default. That's all. I am assuming this item is dead because no performance results have been reported. -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Wed, 2007-01-10 at 23:32 -0500, Bruce Momjian wrote: > Simon Riggs wrote: > > On Fri, 2007-01-05 at 22:57 -0500, Tom Lane wrote: > > > Jim Nasby <decibel@decibel.org> writes: > > > > On Jan 5, 2007, at 6:30 AM, Zeugswetter Andreas ADI SD wrote: > > > >> Ok, so when you need CRC's on a replicate (but not on the master) you > > > > > > > Which sounds to me like a good reason to allow the option in > > > > recovery.conf as well... > > > > > > Actually, I'm not seeing the use-case for a slave having a different > > > setting from the master at all? > > > > > > "My backup server is less reliable than the primary." > > > > > > "My backup server is more reliable than the primary." > > > > > > Somehow, neither of these statements seem likely to be uttered by > > > a sane DBA ... > > > > If I take a backup of a server and bring it up on a new system, the > > blocks in the backup will not have been CRC checked before they go to > > disk. > > > > If I take the same server and now stream log records across to it, why > > *must* that data be CRC checked, when the original data has not been? > > > > I'm proposing choice, with a safe default. That's all. > > I am assuming this item is dead because no performance results have been > reported. It's been on my hold queue, as a result of its lack of clear acceptance. Results from earlier tests show the routines which are dominated by CRC checking overhead are prominent in a number of important workloads. Those workloads all have a substantial disk component, so test results will vary between no saving at all on a disk-bound system to some savings on a CPU bound system. Restore RecordIsValid() #1 on oprofile results at 50-70% CPU COPY XLogInsert() #1 on oprofile results at 17% CPU (full_page_writes = on) OLTP no test with full_page_writes = on (less relevant) OLTP XLogInsert() #5 on oprofile results at 1.2% CPU (full_page_writes = off) Removing the CRC checks on WAL would likely be the easiest to remove 1% CPU on the system as it stands. Other changes require algorithmic or architectural changes to improve matters, though gains can be much larger. 1% doesn't sound much, but PostgreSQL is a very sleek beast these days. As we improve things in other areas the importance of this patch as a tuning switch will grow. Clearly the current patch is not accepted, but can we imagine a patch that saved substantial CPU time in these areas that would be acceptable? *Always* as a non-default option, IMHO, with careful documentation as to its possible use. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
"Simon Riggs" <simon@2ndquadrant.com> writes: > COPY XLogInsert() #1 on oprofile results at 17% CPU > (full_page_writes = on) But what portion of that is actually CRC-related? XLogInsert does quite a lot. Anyway, I can't see degrading the reliability of the system for a gain in the range of a few percent, which is the most that we'd be likely to get here ... for a factor of two or more, maybe people would be willing to take a risk. regards, tom lane
On Thu, 2007-01-11 at 09:01 -0500, Tom Lane wrote: > "Simon Riggs" <simon@2ndquadrant.com> writes: > > COPY XLogInsert() #1 on oprofile results at 17% CPU > > (full_page_writes = on) > > But what portion of that is actually CRC-related? XLogInsert does quite > a lot. > > Anyway, I can't see degrading the reliability of the system for a gain > in the range of a few percent, which is the most that we'd be likely > to get here ... for a factor of two or more, maybe people would be > willing to take a risk. All I would add is that the loss of reliability was not certain in all cases, otherwise I myself would have dropped the idea long ago. With the spectre of doubt surrounding this, I'm happy to drop the idea until we have proof/greater certainty either way. Patch revoked. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > "Simon Riggs" <simon@2ndquadrant.com> writes: >> COPY XLogInsert() #1 on oprofile results at 17% CPU >> (full_page_writes = on) > > But what portion of that is actually CRC-related? XLogInsert does quite > a lot. > > Anyway, I can't see degrading the reliability of the system for a gain > in the range of a few percent, which is the most that we'd be likely > to get here ... for a factor of two or more, maybe people would be > willing to take a risk. Well the problem with that is when you have dozens of 1% improvements which are additive and the net negative effects aren't additive. Since we can't quantify the effect on reliability here it's hard to tell if that's the case. What did you think about protecting against torn writes using id numbers every 512 bytes. That "improves" the reliability of the system in that it detects torn writes 100% of the instead of just (2^32-1)/2^32 (or 99.99999998%) of the time. (which poking a hole in the CRC mechanism would reduce to (2^31-1)/2^31 (or 99.99999995%)). I don't think either of those percentages are significant. If you crash your system once a day and always get a torn WAL page you're still far more likely to still get corrupted data due to cosmic rays long before you get corrupted data due to mistakenly applying a torn page from WAL. But if we can save CPU on every WAL write while not harming reliability (in fact increasing it, albeit insignificantly) why not? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark <stark@enterprisedb.com> writes: > What did you think about protecting against torn writes using id numbers every > 512 bytes. Pretty much not happening; or are you volunteering to fix every part of the system to tolerate injections of inserted data anywhere in a stored datum? regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > Gregory Stark <stark@enterprisedb.com> writes: >> What did you think about protecting against torn writes using id numbers every >> 512 bytes. > > Pretty much not happening; or are you volunteering to fix every part of > the system to tolerate injections of inserted data anywhere in a stored > datum? I was thinking to do it at a low level as the xlog records are prepared to be written to the filesystem and as the data is being read from disk. I haven't read that code yet to see where to inject it but I understand there's already a copy happening and it could be done there. Even if we optimize out all the copies we could do it in the actual i/o call using readv/writev. I wasn't thinking of doing it on actual disk buffers since it doesn't help us avoid full page writes, but it could be done there too using readv/writev in the smgr. That might be useful for a full_page_writes=off system so even if it can't guarantee no corruption at least it can guarantee no silent corruption. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark <stark@enterprisedb.com> writes: > "Tom Lane" <tgl@sss.pgh.pa.us> writes: >> Pretty much not happening; or are you volunteering to fix every part of >> the system to tolerate injections of inserted data anywhere in a stored >> datum? > I was thinking to do it at a low level as the xlog records are prepared to be > written to the filesystem and as the data is being read from disk. I haven't > read that code yet to see where to inject it but I understand there's already > a copy happening and it could be done there. You understand wrong ... a tuple sitting on disk is normally read directly from the shared buffer, and I don't think we want to pay for copying it. regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > Gregory Stark <stark@enterprisedb.com> writes: >> "Tom Lane" <tgl@sss.pgh.pa.us> writes: >>> Pretty much not happening; or are you volunteering to fix every part of >>> the system to tolerate injections of inserted data anywhere in a stored >>> datum? > >> I was thinking to do it at a low level as the xlog records are prepared to be >> written to the filesystem and as the data is being read from disk. I haven't >> read that code yet to see where to inject it but I understand there's already >> a copy happening and it could be done there. > > You understand wrong ... a tuple sitting on disk is normally read > directly from the shared buffer, and I don't think we want to pay for > copying it. "xlog records" -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark <stark@enterprisedb.com> writes: > "Tom Lane" <tgl@sss.pgh.pa.us> writes: >> You understand wrong ... a tuple sitting on disk is normally read >> directly from the shared buffer, and I don't think we want to pay for >> copying it. > "xlog records" Oh, sorry, had the wrong context in mind. I'm still not very impressed with the idea --- a CRC check will catch many kinds of problems, whereas this approach catches exactly one kind of problem. regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > Oh, sorry, had the wrong context in mind. I'm still not very impressed > with the idea --- a CRC check will catch many kinds of problems, whereas > this approach catches exactly one kind of problem. Well in fairness I tossed in a throwaway comment at the end of that email about heap pages. I'll do the same here since I can't resist. But the main thread here is about xlog really. It just seems to me like it's better to target each problem with a solution that addresses it directly than have one feature that we hope hope addresses them all more or less. Having a CRC in WAL but not in the heap seems kind of pointless. If your hardware is unreliable the corruption could anywhere. Depending on it to solve multiple problems means we can't offer the option to disable it because it would affect other things as well. What I would like to see is a CRC option that would put CRC checks in every disk page whether heap, index, WAL, control file, etc. I think we would default that to off to match our current setup most closely. Separately we would have a feature in WAL to detect torn pages so that we can reliably detect the end of valid WAL. That would have to always be on. But having it as a separate feature means the CRC could be optional. Also, incidentally like I mentioned in my previous email, we could do the torn page detection in heap pages too by handling it in the smgr using readv/writev. No copies, no corrupted datums. Essentially the tags would be inserted on the fly as the data was copied into kernel space. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
On Thu, 2007-01-11 at 17:06 +0000, Gregory Stark wrote: > Having a CRC in WAL but not in the heap seems kind of pointless. Yes... > If your > hardware is unreliable the corruption could anywhere. Agreed. Other DBMS have one setting for the whole server; I've never seen separate settings for WAL and data. > Depending on it to solve > multiple problems means we can't offer the option to disable it > because it > would affect other things as well. > > What I would like to see is a CRC option that would put CRC checks in > every > disk page whether heap, index, WAL, control file, etc. I think we > would > default that to off to match our current setup most closely. > > Separately we would have a feature in WAL to detect torn pages so that > we can > reliably detect the end of valid WAL. That would have to always be on. > But > having it as a separate feature means the CRC could be optional. Your thoughts seem logical to me. It does seem a bigger project than I'd envisaged, but doable, one day. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Thu, Jan 11, 2007 at 11:10:38PM +0000, Simon Riggs wrote: > On Thu, 2007-01-11 at 17:06 +0000, Gregory Stark wrote: > > Having a CRC in WAL but not in the heap seems kind of pointless. > > Yes... > > > If your > > hardware is unreliable the corruption could anywhere. > > Agreed. I thought the point was that the WAL protects against unexpected power failure, that sort of thing. In that situation, the memory is the first to be corrupted, and an active DMA transfer will thus be corrupted also. We don't need to worry about the data, because the WAL is known to be accurate. The WAL does not protect against random data corruption, in normal operation it is never read. If we want to detect random corruption, we'd need checksum everywhere, yes. But that's not the goal here. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.