On Tue, 2013-03-05 at 11:35 +0200, Heikki Linnakangas wrote:
> If you enable checksums, the free space map never gets updated in a
> standby. It will slowly drift to be completely out of sync with reality,
> which could lead to significant slowdown and bloat after failover.
One of the design points of this patch is that those operations that use
MarkBufferDirtyHint(), including tuple hint bits, the FSM, index dead
markers, etc., do not directly go to the standby. That's because the
standby can't write WAL, so it can't protect itself against a torn page
breaking the checksum.
However, these do make it through by riding along with a full-page image
in the WAL. The fact that checksums are enabled means that these full
page images will be written once per modified page per checkpoint, and
then replayed on the standby. FSM should get the updates the same way,
even though no other WAL is written for the FSM.
If full_page_writes are disabled, then the updates will never arrive.
But in that case, I think we can just go ahead and dirty the page during
recovery, because there isn't a real problem. I was hesitant to make
this change in my patch because:
1. I wanted to see if someone saw a flaw in this reasoning; and
2. I noticed that full_page_images can be changed with a SIGHUP, which
could add complexity (I don't see any reason we allow this... shouldn't
we just force a restart for that change?).
I added a README file, moved some of the explanatory material there, and
tried to clarify this situation.
Let me know if you see a problem that I'm missing. I verified that at
least some FSM changes do make it through with checksums on, but I
didn't dig much deeper than that.
> Since the checksums are an all-or-nothing cluster-wide setting, the
> three extra flags in the page header, PD_CHECKSUMS1, PD_CHECKSUM2 and
> PD_HEADERCHECK, are not needed. Let's leave them out. That keeps the
> code simpler, and leaves the bits free for future use. If we want to
> enable such per-page setting in the future, we can add it later. For a
> per-relation scheme, they're not needed.
Removed header bits.
> XLogCheckBuffer() and XLogCheckBufferNeedsBackup() read the page LSN
> without a lock. That's not atomic, so it could incorrectly determine
> that a page doesn't need to be backed up. We used to always hold an
> exclusive lock on the buffer when it's called, which prevents
> modifications to the LSN, but that's no longer the case.
Fixed. I added a new exported function, BufferGetLSNAtomic().
There was another similar omission in gistget.c.
By the way, I can not find any trace of XLogCheckBufferNeedsBackup(),
was that a typo?
> Shouldn't SetBufferCommitInfoNeedsSave() check the BM_PERMANENT flag? I
> think it will generate WAL records for unlogged tables as it is.
Fixed.
I also rebased and added a GUC to control whether the checksum failure
causes an error or not.
I need to do another self-review after these changes and some more
extensive testing, so I might have missed a couple things.
Regards,
Jeff Davis