Thread: CRCs

CRCs

From
ncm@zembu.com (Nathan Myers)
Date:
Vadim wrote:
> Tom wrote:
> > Bruce wrote:
> > > ... If the CRC on
> > > the WAL log checks for errors that are not checked anywhere else,
> > > then fine, but I thought disk CRC would just duplicate the I/O
> > > subsystem/disk.
> >
> > A disk-block CRC would detect partially written blocks (ie,
> > power drops after disk has written M of the N sectors in a
> > block). The disk's own checks will NOT consider this condition a
> > failure. I'm not convinced that WAL will reliably detect it either
> > (Vadim?).
>
> Idea proposed by Andreas about "physical log" is implemented! Now WAL
> saves whole data blocks on first after checkpoint modification. This
> way on recovery modified data blocks will be first restored *as a
> whole*. Isn't it much better than just detection of partially writes?

This seems to protect against some partial writes, but see below.

> > Certainly WAL will not help for corruption caused by external agents, 
> > away from any updates that are actually being performed/logged.
>
> What do you mean by "external agents"?

External agents include RAM bit drops and noise on cables when
blocks are (read and re-) written.  Every time data is moved, 
there is a chance of an undetected error being introduced.  The 
disk only promises (within limits) to deliver the sector that 
was written; it doesn't promise that what was written is what 
you meant to write.  Errors of this sort accumulate unless 
caught by end-to-end checks.

External agents include bugs in database code, bugs in OS code,
bugs in disk controller firmware, and bugs in disk firmware.
Each can result in clobbered data, blocks being written in the
wrong place, blocks said to be written but not, and any number
of other variations.  All this code is written by humans, and
even the most thorough testing cannot cover even the majority
of code paths.

External agents include sector errors not caught by the disk CRC: 
the disk only promises to keep the number of errors delivered to a
reasonably low (and documented) level.  It's up to the user to 
notice the errors that slip through.

and Andreas wrote:
> > A disk-block CRC would detect partially written blocks (ie, power
> > drops after disk has written M of the N sectors in a block). The
> > disk's own checks will NOT consider this condition a failure.
>
> But physical log recovery will rewrite every page that was changed
> after last checkpoint, thus this is not an issue anymore.

No.  That assumes that when the drive _says_ the block is written, 
it is really on the disk.  That is not true for IDE drives.  It is 
true for SCSI drives only when the SCSI spec is implemented correctly,
but implementing the spec correctly interferes with favorable benchmark 
results.

> >  I'm not convinced that WAL will reliably detect it either
> > (Vadim?). Certainly WAL will not help for corruption caused by
> > external agents, away from any updates that are actually being
> > performed/logged.
>
> The external agent (if malvolent) could write a correct CRC anyway
> If on the other hand the agent writes complete garbage, vacuum will
> notice.

Vacuum does not check most of the bits in the blocks it reads.  
(Bad bits in metadata will cause a crash only if you're lucky.
If not, they result in more corruption.)

A database is unusual among computer applications in that an error
introduced today can sit unnoticed on the disk, and then result in 
an unnoticed wrong answer six months later.  We need to be able to
detect bad bits as soon as possible, before the backups have been
overwritten.  CRCs are how we can detect cumulative corruption from 
all sources.

Nathan Myers
ncm@zembu.com



RE: CRCs

From
"Mikheev, Vadim"
Date:
> > But physical log recovery will rewrite every page that was changed
> > after last checkpoint, thus this is not an issue anymore.
> 
> No.  That assumes that when the drive _says_ the block is written, 
> it is really on the disk.  That is not true for IDE drives.  It is 
> true for SCSI drives only when the SCSI spec is implemented correctly,
> but implementing the spec correctly interferes with favorable 
> benchmark results.

You know - this is *core* assumption. If drive lies about this then
*nothing* will help you. Do you remember core rule of WAL?
"Changes must be logged *before* changed data pages written".
If this rule will be broken then data files will be inconsistent
after crash recovery and you will not notice this, w/wo CRC in
data blocks.

I agreed that CRCs could help to detect other errors but probably
it's too late for 7.1

Vadim


Re: CRCs

From
ncm@zembu.com (Nathan Myers)
Date:
On Fri, Jan 12, 2001 at 01:07:56PM -0800, Mikheev, Vadim wrote:
> > > But physical log recovery will rewrite every page that was changed
> > > after last checkpoint, thus this is not an issue anymore.
> > 
> > No.  That assumes that when the drive _says_ the block is written, 
> > it is really on the disk.  That is not true for IDE drives.  It is 
> > true for SCSI drives only when the SCSI spec is implemented correctly,
> > but implementing the spec correctly interferes with favorable 
> > benchmark results.
> 
> You know - this is *core* assumption. If drive lies about this then
> *nothing* will help you. Do you remember core rule of WAL?
> "Changes must be logged *before* changed data pages written".
> If this rule will be broken then data files will be inconsistent
> after crash recovery and you will not notice this, w/wo CRC in
> data blocks.

You can include the data blocks' CRCs in the log entries.

> I agreed that CRCs could help to detect other errors but probably
> it's too late for 7.1.

7.2 is not too far off.  I'm hoping to see it then.

Nathan Myers
ncm@zembu.com


Re: CRCs

From
ncm@zembu.com (Nathan Myers)
Date:
On Fri, Jan 12, 2001 at 12:35:14PM -0800, Nathan Myers wrote:
> Vadim wrote:
> > What do you mean by "external agents"?
> 
> External agents include RAM bit drops and noise on cables when
> blocks are (read and re-) written.  Every time data is moved, 
> there is a chance of an undetected error being introduced.  The 
> disk only promises (within limits) to deliver the sector that 
> was written; it doesn't promise that what was written is what 
> you meant to write.  Errors of this sort accumulate unless 
> caught by end-to-end checks.
> 
> External agents include bugs in database code, bugs in OS code,
> bugs in disk controller firmware, and bugs in disk firmware.
> Each can result in clobbered data, blocks being written in the
> wrong place, blocks said to be written but not, and any number
> of other variations.  All this code is written by humans, and
> even the most thorough testing cannot cover even the majority
> of code paths.
> 
> External agents include sector errors not caught by the disk CRC: 
> the disk only promises to keep the number of errors delivered to a
> reasonably low (and documented) level.  It's up to the user to 
> notice the errors that slip through.

Interestingly, right after I posted this I noticed that cron 
noticed a corrupt inode in /dev on my machine.  The disk is 
happy with it, but I'm not...

Nathan Myers
ncm@zembu.com


RE: CRCs

From
"Mikheev, Vadim"
Date:
> > You know - this is *core* assumption. If drive lies about this then
> > *nothing* will help you. Do you remember core rule of WAL?
> > "Changes must be logged *before* changed data pages written".
> > If this rule will be broken then data files will be inconsistent
> > after crash recovery and you will not notice this, w/wo CRC in
> > data blocks.
> 
> You can include the data blocks' CRCs in the log entries.

How could it help?

Vadim


Re: CRCs

From
ncm@zembu.com (Nathan Myers)
Date:
On Fri, Jan 12, 2001 at 02:16:07PM -0800, Mikheev, Vadim wrote:
> > > You know - this is *core* assumption. If drive lies about this then
> > > *nothing* will help you. Do you remember core rule of WAL?
> > > "Changes must be logged *before* changed data pages written".
> > > If this rule will be broken then data files will be inconsistent
> > > after crash recovery and you will not notice this, w/wo CRC in
> > > data blocks.
> > 
> > You can include the data blocks' CRCs in the log entries.
> 
> How could it help?

It wouldn't help you recover, but you would be able to report that 
you cannot recover.

To be more specific, if the blocks referenced in the log are partially 
written, their CRCs will (probably) be wrong.  If they are not 
physically written at all, their CRCs will be correct but will 
not match what is in the log.  In either case the user will know 
immediately that the database has been corrupted, and must fall 
back on a failover image or backup.

It would be no bad thing to include the CRC of the block referenced
wherever in the file format that a block reference lives.

Nathan Myers
ncm@zembu.com


Re: CRCs

From
Tom Lane
Date:
ncm@zembu.com (Nathan Myers) writes:
>>>>>> "Changes must be logged *before* changed data pages written".
>>>>>> If this rule will be broken then data files will be inconsistent
>>>>>> after crash recovery and you will not notice this, w/wo CRC in
>>>>>> data blocks.
>>>> 
>>>> You can include the data blocks' CRCs in the log entries.
>> 
>> How could it help?

> It wouldn't help you recover, but you would be able to report that 
> you cannot recover.

How?  The scenario Vadim is pointing out is where the disk drive writes
a changed data block in advance of the WAL log entry describing the
change.  Then power drops and the WAL entry never gets made.  At
restart, how will you realize that that data block now contains data you
don't want?  There's not even a log entry telling you you need to look
at it, much less one that tells you what should be in it.

AFAICS, disk-block CRCs do not guard against mishaps involving intended
writes.  They will help guard against data corruption that might creep
in due to outside factors, however.
        regards, tom lane


RE: CRCs

From
"Mikheev, Vadim"
Date:
> > It wouldn't help you recover, but you would be able to report that 
> > you cannot recover.
> 
> How? The scenario Vadim is pointing out is where the disk 
> drive writes a changed data block in advance of the WAL log entry
> describing the change. Then power drops and the WAL entry never gets
> made. At restart, how will you realize that that data block now
> contains data you don't want? There's not even a log entry telling
> you you need to look at it, much less one that tells you what should
> be in it.
> 
> AFAICS, disk-block CRCs do not guard against mishaps involving intended
> writes. They will help guard against data corruption that might creep
> in due to outside factors, however.

I couldn't describe better -:)

Vadim


Re: CRCs

From
ncm@zembu.com (Nathan Myers)
Date:
On Fri, Jan 12, 2001 at 06:06:21PM -0500, Tom Lane wrote:
> ncm@zembu.com (Nathan Myers) writes:
> >>>>>> "Changes must be logged *before* changed data pages written".
> >>>>>> If this rule will be broken then data files will be inconsistent
> >>>>>> after crash recovery and you will not notice this, w/wo CRC in
> >>>>>> data blocks.
> >>>> 
> >>>> You can include the data blocks' CRCs in the log entries.
> >> 
> >> How could it help?
> 
> > It wouldn't help you recover, but you would be able to report that 
> > you cannot recover.
> 
> How?  The scenario Vadim is pointing out is where the disk drive writes
> a changed data block in advance of the WAL log entry describing the
> change.  Then power drops and the WAL entry never gets made.  At
> restart, how will you realize that that data block now contains data you
> don't want?  There's not even a log entry telling you you need to look
> at it, much less one that tells you what should be in it.

OK.  In that case, recent transactions that were acknowledged to user 
programs just disappear.  The database isn't corrupt, but it doesn't
contain what the user believes is in it.

The only way I can think of to guard against that is to have a sequence
number in each acknowledgement sent to users, and also reported when the 
database recovers.  If users log their ACK numbers, they can be compared
when the database comes back up.

Obviously it's better to configure the disk so that it doesn't lie about
what's been written.

> AFAICS, disk-block CRCs do not guard against mishaps involving intended
> writes.  They will help guard against data corruption that might creep
> in due to outside factors, however.

Right.  

Nathan Myers
ncm@zembu.com


Re: CRCs

From
Alfred Perlstein
Date:
* Nathan Myers <ncm@zembu.com> [010112 15:49] wrote:
> On Fri, Jan 12, 2001 at 06:06:21PM -0500, Tom Lane wrote:
> > ncm@zembu.com (Nathan Myers) writes:
> > >>>>>> "Changes must be logged *before* changed data pages written".
> > >>>>>> If this rule will be broken then data files will be inconsistent
> > >>>>>> after crash recovery and you will not notice this, w/wo CRC in
> > >>>>>> data blocks.
> > >>>> 
> > >>>> You can include the data blocks' CRCs in the log entries.
> > >> 
> > >> How could it help?
> > 
> > > It wouldn't help you recover, but you would be able to report that 
> > > you cannot recover.
> > 
> > How?  The scenario Vadim is pointing out is where the disk drive writes
> > a changed data block in advance of the WAL log entry describing the
> > change.  Then power drops and the WAL entry never gets made.  At
> > restart, how will you realize that that data block now contains data you
> > don't want?  There's not even a log entry telling you you need to look
> > at it, much less one that tells you what should be in it.
> 
> OK.  In that case, recent transactions that were acknowledged to user 
> programs just disappear.  The database isn't corrupt, but it doesn't
> contain what the user believes is in it.
> 
> The only way I can think of to guard against that is to have a sequence
> number in each acknowledgement sent to users, and also reported when the 
> database recovers.  If users log their ACK numbers, they can be compared
> when the database comes back up.
> 
> Obviously it's better to configure the disk so that it doesn't lie about
> what's been written.

I thought WAL+fsync wasn't supposed to allow this to happen?

-- 
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
"I have the heart of a child; I keep it in a jar on my desk."


RE: CRCs

From
"Mikheev, Vadim"
Date:
> > How?  The scenario Vadim is pointing out is where the disk 
> > drive writes a changed data block in advance of the WAL log
> > entry describing the change. Then power drops and the WAL
> > entry never gets made. At restart, how will you realize that
> > that data block now contains data you don't want? There's not
> > even a log entry telling you you need to look at it, much less
> > one that tells you what should be in it.
> 
> OK. In that case, recent transactions that were acknowledged to user 
> programs just disappear. The database isn't corrupt, but it doesn't
> contain what the user believes is in it.

Example.

1. Tuple was inserted into index.
2. Looking for free buffer bufmgr decides to write index block.
3. Following WAL core rule bufmgr first calls XLogFlush() to write  and fsync log record related to index tuple
insertion.
4. *Beliving* that log record is on disk now (after successful fsync)  bufmgr writes index block.

If log record was not really flushed on disk in 3. but on-disk image of
index block was updated in 4. and system crashed after this then after
restart recovery you'll have unlawful index tuple pointing to where?
Who knows! No guarantee that corresponding heap tuple was flushed on
disk.

Isn't database corrupted now?

Vadim


Re: CRCs

From
Tom Lane
Date:
"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:
> If log record was not really flushed on disk in 3. but on-disk image of
> index block was updated in 4. and system crashed after this then after
> restart recovery you'll have unlawful index tuple pointing to where?
> Who knows! No guarantee that corresponding heap tuple was flushed on
> disk.

This example doesn't seem very convincing.  Wouldn't the XLOG entry
describing creation of the heap tuple appear in the log before the one
for the index tuple?  Or are you assuming that both these XLOG entries
are lost due to disk drive malfeasance?
        regards, tom lane


Re: CRCs

From
ncm@zembu.com (Nathan Myers)
Date:
On Fri, Jan 12, 2001 at 04:10:36PM -0800, Alfred Perlstein wrote:
> Nathan Myers <ncm@zembu.com> [010112 15:49] wrote:
> >
> > Obviously it's better to configure the disk so that it doesn't
> > lie about what's been written.
> 
> I thought WAL+fsync wasn't supposed to allow this to happen?

It's an OS and hardware configuration matter; you only get correct
WAL+fsync semantics if the underlying system is configured right.  
IDE disks are almost always configured wrong, to spoof benchmarks; 
SCSI disks sometimes are.

If they're configured wrong, then (now that we have a CRC in the 
log entry) in the event of a power outage the database might come 
back with recently-acknowledged transaction results discarded.
That's a lot better than a corrupt database, but it's not 
industrial-grade semantics.  (Use a UPS.)

Nathan Myers
ncm@zembu.com


RE: CRCs

From
"Mikheev, Vadim"
Date:
> > If log record was not really flushed on disk in 3. but 
> > on-disk image of index block was updated in 4. and system
> > crashed after this then after restart recovery you'll have
> > unlawful index tuple pointing to where? Who knows!
> > No guarantee that corresponding heap tuple was flushed on
> > disk.
> 
> This example doesn't seem very convincing.  Wouldn't the XLOG entry
> describing creation of the heap tuple appear in the log before the one
> for the index tuple?  Or are you assuming that both these XLOG entries
> are lost due to disk drive malfeasance?

Yes, that was assumed.
When UNDO will be implemented and uncomitted tuples will be removed by
rollback part of after crash recovery we'll get corrupted database without
that assumption.

Vadim


Re: CRCs

From
Daniele Orlandi
Date:
Nathan Myers wrote:
> 
> It wouldn't help you recover, but you would be able to report that
> you cannot recover.

While this could help decting hardware problems, you still won't be able
to detect some (many) memory errors because the CRC will be calculated
on the already corrupted data.

Of course there are other situations where CRC will not match and
appropriately logged is a reliable heads-up warning.

Bye!

-- Daniele


Re: CRCs

From
Tom Lane
Date:
>> AFAICS, disk-block CRCs do not guard against mishaps involving intended
>> writes.  They will help guard against data corruption that might creep
>> in due to outside factors, however.

> Right.  

Given that we seem to have agreed on that, I withdraw my complaint about
disk-block-CRC not being in there for 7.1.  I think we are still a ways
away from the point where externally-induced corruption is a major share
of our failure rate ;-).  7.2 or so will be time enough to add this
feature, and I'd really rather not force another initdb for 7.1.
        regards, tom lane


Re: CRCs

From
ncm@zembu.com (Nathan Myers)
Date:
On Fri, Jan 12, 2001 at 11:30:30PM -0500, Tom Lane wrote:
> >> AFAICS, disk-block CRCs do not guard against mishaps involving intended
> >> writes.  They will help guard against data corruption that might creep
> >> in due to outside factors, however.
> 
> > Right.  
> 
> Given that we seem to have agreed on that, I withdraw my complaint about
> disk-block-CRC not being in there for 7.1.  I think we are still a ways
> away from the point where externally-induced corruption is a major share
> of our failure rate ;-).  7.2 or so will be time enough to add this
> feature, and I'd really rather not force another initdb for 7.1.

More to the point, putting CRCs on data blocks might have unintended
consequences for dump or vacuum processes.  7.1 is a monumental 
accomplishment even without corruption detection, and the sooner
the world has it, the better.

Nathan Myers
ncm@zembu.com


Re: CRCs

From
ncm@zembu.com (Nathan Myers)
Date:
On Fri, Jan 12, 2001 at 04:38:37PM -0800, Mikheev, Vadim wrote:
> Example.
> 1. Tuple was inserted into index.
> 2. Looking for free buffer bufmgr decides to write index block.
> 3. Following WAL core rule bufmgr first calls XLogFlush() to write
>    and fsync log record related to index tuple insertion.
> 4. *Believing* that log record is on disk now (after successful fsync)
>    bufmgr writes index block.
> 
> If log record was not really flushed on disk in 3. but on-disk image of
> index block was updated in 4. and system crashed after this then after
> restart recovery you'll have unlawful index tuple pointing to where?
> Who knows! No guarantee that corresponding heap tuple was flushed on
> disk.
> 
> Isn't database corrupted now?

Note, I haven't read the WAL code, so much of what I've said is based 
on what I know is and isn't possible with logging, rather than on 
Vadim's actual choices.  I know it's *possible* to implement a logging 
database which can maintain consistency without need for strict write 
ordering; but without strict write ordering, it is not possible to 
guarantee durable transactions.  That is, after a power outage, such 
a database may be guaranteed to recover uncorrupted, but some number 
(>= 0) of the last few acknowledged/committed transactions may be lost.

Vadim's implementation assumes strict write ordering, so that (e.g.) 
with IDE disks a corrupt database is possible in the event of a power 
outage.  (Database and OS crashes don't count; those don't keep the 
blocks from finding their way from on-disk buffers to disk.)  This is 
no criticism; it is more efficient to assume strict write ordering, 
and a database that can lose (the last few) committed transactions 
has limited value.

To achieve disk write-order independence is probably not a worthwhile 
goal, but for systems that cannot provide strict write ordering (e.g., 
most PCs) it would be helpful to be able to detect that the database 
has become corrupted.  In Vadim's example above, if the index were to
contain not only the heap blocks' numbers, but also their CRCs, then 
the corruption could be detected when the index is used.  When the 
block is read in, its CRC is checked, and when it is referenced via 
the index, the two CRC values are simply compared and the corruption
is revealed. 

On a machine that does provide strict write ordering, the CRCs in the 
index might be unnecessary overhead, but they also provide cross-checks
to help detect corruption introduced by bugs and whatnot.

Or maybe I don't know what I'm talking about.  

Nathan Myers
ncm@zembu.com


Re: CRCs

From
Tom Lane
Date:
ncm@zembu.com (Nathan Myers) writes:
> To achieve disk write-order independence is probably not a worthwhile 
> goal, but for systems that cannot provide strict write ordering (e.g., 
> most PCs) it would be helpful to be able to detect that the database 
> has become corrupted.  In Vadim's example above, if the index were to
> contain not only the heap blocks' numbers, but also their CRCs, then 
> the corruption could be detected when the index is used.  When the 
> block is read in, its CRC is checked, and when it is referenced via 
> the index, the two CRC values are simply compared and the corruption
> is revealed. 

A row-level CRC might be useful for this, but it would have to be on
the data only (not the tuple commit-status bits).  It'd be totally
impractical with a block CRC, I think.  To do it with a block CRC, every
time you changed *anything* in a heap page, you'd have to find all the
index items for each row on the page and update their copies of the
heap block's CRC.  That could easily turn one disk-write into hundreds,
not to mention the index search costs.  Similarly, a check value that is
affected by tuple status updates would enormously increase the cost of
marking tuples committed or dead.

Instead of a partial row CRC, we could just as well use some other bit
of identifying information, say the row OID.  Given a block CRC on the
heap page, we'll be pretty confident already that the heap page is OK,
we just need to guard against the possibility that it's older than the
index item.  Checking that there is a valid tuple at the slot indicated
by the index item, and that it has the right OID, should be a good
enough (and cheap enough) test.
        regards, tom lane


Re: CRCs

From
ncm@zembu.com (Nathan Myers)
Date:
On Sat, Jan 13, 2001 at 12:49:34PM -0500, Tom Lane wrote:
> ncm@zembu.com (Nathan Myers) writes:
> > ... for systems that cannot provide strict write ordering (e.g., 
> > most PCs) it would be helpful to be able to detect that the database 
> > has become corrupted.  In Vadim's example above, if the index were to
> > contain not only the heap blocks' numbers, but also their CRCs, then 
> > the corruption could be detected when the index is used.  ...
> 
> A row-level CRC might be useful for this, but it would have to be on
> the data only (not the tuple commit-status bits).  It'd be totally
> impractical with a block CRC, I think.   ...

I almost wrote about an indirect scheme to share the expected block CRC
value among all the index entries that need it, but thought it would 
distract from the correct approach:

> Instead of a partial row CRC, we could just as well use some other bit
> of identifying information, say the row OID.   ...

Good.  But, wouldn't the TID be more specific?  True, it would be pretty
unlikely for a block to have an old tuple with the right OID in the same
place.  Belt-and-braces says check both :-).  Either way, the check seems 
independent of block CRCs.   Would this check be simple enough to be safe
for 7.1? 

Nathan Myers
ncm@zembu.com


Re: CRCs

From
Horst Herb
Date:
On Sunday 14 January 2001 04:49, Tom Lane wrote:

> A row-level CRC might be useful for this, but it would have to be on
> the data only (not the tuple commit-status bits).  It'd be totally
> impractical with a block CRC, I think.  To do it with a block CRC, every
> time you changed *anything* in a heap page, you'd have to find all the
> index items for each row on the page and update their copies of the
> heap block's CRC.  That could easily turn one disk-write into hundreds,
> not to mention the index search costs.  Similarly, a check value that is
> affected by tuple status updates would enormously increase the cost of
> marking tuples committed or dead.

Ah, finally. Looks like we are moving in circles (or spirals ;-) )Remember 
that some 3-4 months ago I requested help from this list several times 
regarding a trigger function that implements a crc only on the user defined 
attributes? I wrote one in pgtcl which was slow and had trouble with the C 
equivalent due to lack of documentation. I still believe this is that useful 
that it should be an option in Postgresand not a user defined function.

Horst


Re: CRCs

From
Tom Lane
Date:
ncm@zembu.com (Nathan Myers) writes:
>> Instead of a partial row CRC, we could just as well use some other bit
>> of identifying information, say the row OID.   ...

> Good.  But, wouldn't the TID be more specific?

Uh, the TID *is* the pointer from index to heap.  There's no redundancy
that way.

> Would this check be simple enough to be safe for 7.1? 

It'd probably be safe, but adding OIDs to index tuples would force an
initdb, which I'd rather avoid at this stage of the cycle.
        regards, tom lane


RE: CRCs

From
"Mikheev, Vadim"
Date:
> Instead of a partial row CRC, we could just as well use some other bit
> of identifying information, say the row OID.  Given a block CRC on the
> heap page, we'll be pretty confident already that the heap page is OK,
> we just need to guard against the possibility that it's older than the
> index item.  Checking that there is a valid tuple at the slot 
> indicated by the index item, and that it has the right OID, should be
> a good enough (and cheap enough) test.

This would work in 7.1 but not in 7.2 anyway (assuming UNDO and true
transaction rollback to be implemented). There will be no permanent
pg_log and after crash recovery any heap tuple with unknown t_xmin status
will be assumed as committed. Rollback will remove tuples inserted by
uncommitted transactions but this will be possible only for *logged*
modifications.

One should properly configure disk drives instead of hacking arround
this problem. "Log before modifying data pages" is *rule* for any WAL
system like Oracle, Informix and dozen others.

Vadim