On Oct26, 2011, at 15:57 , Florian Pflug wrote:
> As you said, the CLOG page corresponding to nextId
> *should* always be accessible at the start of recovery (Unless whole file
> has been removed by VACUUM, that is). So we shouldn't need to extends CLOG.
> Yet the error suggest that the CLOG is, in fact, too short. What I said
> is that we shouldn't apply any fix (for the CLOG problem) before we understand
> the reason for that apparent contradiction.
Ha! I think I've got a working theory.
In CreateCheckPoint(), we determine the nextId that'll go into the checkpoint
record, and then call CheckPointGuts() which does the actual writing and fsyncing.
So far, that fine. If a transaction ID is assigned before we compute the
checkpoint's nextXid, we'll extend the CLOG accordingly, and CheckPointGuts() will
make sure the new CLOG page goes to disk.
But, if wal_level = hot_standby, we also call LogStandbySnapshot() in
CreateCheckPoint(), and we do that *after* CheckPointGuts(). Which would be
fine too, except that LogStandbySnapshot() re-assigned the *current* value of
ShmemVariableCache->nextXid to the checkpoint's nextXid field.
Thus, if the CLOG is extended after (or in the middle of) CheckPointGuts(), but
before LogStandbySnapshot(), then we end up with a nextXid in the checkpoint
whose CLOG page hasn't necessarily made it to the disk yet. The longer CheckPointGuts()
takes to finish it's work the more likely it becomes (assuming that CLOG writing
and syncing doesn't happen at the very end). This fits the OP's observation ob the
problem vanishing when pg_start_backup() does an immediate checkpoint.
I dunno how to this fix, though, since I don't really understand why
LogStandbySnapshot() needs to modify the checkpoint's nextXid.Simon, is there some
documentation on what assumptions the hot standby code makes about the various XID
fields included in a checkpoint?
best regards,
Florian Pflug