On 4 April 2013 02:39, Andres Freund <andres@2ndquadrant.com> wrote:
> Ok, I think I see the bug. And I think its been introduced in the
> checkpoints patch.
Well spotted. (I think you mean checksums patch).
> If by now the first backend has proceeded to PageSetLSN() we are writing
> different data to disk than the one we computed the checksum of
> before. Boom.
Right, so nothing else we were doing was wrong, that's why we couldn't
spot a bug. The problem is that we aren't replaying enough WAL because
the checksum on the WAL record is broke.
> I think the whole locking interactions in MarkBufferDirtyHint() need to
> be thought over pretty carefully.
When we write out a buffer with checksums enabled, we take a copy of
the buffer so that the checksum is consistent, even while other
backends may be writing hints to the same bufer.
I missed out on doing that with XLOG_HINT records, so the WAL CRC can
be incorrect because the data is scanned twice; normally that would be
OK because we have an exclusive lock on the block, but with hints we
only have share lock. So what we need to do is take a copy of the
buffer before we do XLogInsert().
Simple patch to do this attached for discussion. (Not tested).
We might also do this by modifying the WAL record to take the whole
block and bypass the BkpBlock mechanism entirely. But that's more work
and doesn't seem like it would be any cleaner. I figure lets solve the
problem first then discuss which approach is best.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services