On Fri, 2008-10-17 at 12:26 -0300, Alvaro Herrera wrote:
> So this discussion died with no solution arising to the
> hint-bit-setting-invalidates-the-CRC problem.
>
> Apparently the only solution in sight is to WAL-log hint bits. Simon
> opines it would be horrible from a performance standpoint to WAL-log
> every hint bit set, and I think we all agree with that. So we need to
> find an alternative mechanism to WAL log hint bits.
It occurred to me that maybe we don't need to WAL-log the CRC checks.
Proposal
* We reserve enough space on a disk block for a CRC check. When a dirty
block is written to disk we calculate and annotate the CRC value, though
this is *not* WAL logged.
* In normal running we re-check the CRC when we read the block back into
shared_buffers.
* In recovery we will overwrite the last image of a block from WAL, so
we ignore the block CRC check, since the WAL record was already CRC
checked. If full_page_writes = off, we ignore and zero the block's CRC
for any block touched during recovery. We do those things because the
block CRC in the WAL is likely to be different to that on disk, due to
hints.
* We also re-check the CRC on a block immediately before we dirty the
block (for any reason). This minimises the possibility of in-memory data
corruption for blocks.
So in the typical case all blocks moving from disk <-> memory and from
clean -> dirty are CRC checked. So in the case where we have
full_page_writes = on then we have a good CRC every time. In the
full_page_writes = off case we are exposed only on the blocks that
changed during last checkpoint cycle and only if we crash. That seems
good because most databases are up 99% of the time, so any corruptions
are likely to occur in normal running, not as a result of crashes.
This would be a run-time option.
Like it?
-- Simon Riggs www.2ndQuadrant.com