[PATCH] A crash and subsequent recovery of the master can cause the slave to get out-of-sync - Mailing list pgsql-hackers
From | Florian G. Pflug |
---|---|
Subject | [PATCH] A crash and subsequent recovery of the master can cause the slave to get out-of-sync |
Date | |
Msg-id | 4627D30D.8020803@phlo.org Whole thread Raw |
Responses |
Re: [PATCH] A crash and subsequent recovery of themaster can cause the slave to get out-of-sync
Re: [PATCH] A crash and subsequent recovery of the master can cause the slave to get out-of-sync |
List | pgsql-hackers |
Hi I believe I have discovered the following problem in pgsql 8.2 and HEAD, concerning warm-standbys using WAL log shipping. The problem is that after a crash, the master might complete incomplete actions via rm_cleanup() - but since it won't wal-log those changes, the slave won't know about this. This will at least prevent the creation of any further restart points on the slave (because safe_restartpoint) will never return true again - it it might even cause data corruption, if subsequent wal records are interpreted wrongly by the slave because it sees other data than the master did when it generated them. Attached is a patch that lets RecoveryRestartPoint call all rm_cleanup() methods and create a restart point whenever it encounters a shutdown checkpoint in the wal (because those are generated after recovery). This ought not cause a performance degradation, because shutdown checkpoints will occur very infrequently. The patch is per discussion with Simon Riggs. I've not yet had a chance to test this patch, I only made sure that it compiles. I'm sending this out now because I hope this might make it into 8.2.4. greetings, Florian Pflug diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 6c67821..93c86a1 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -5060,10 +5060,13 @@ #endif * Perform a checkpoint to update all our recovery activity to disk. * * Note that we write a shutdown checkpoint rather than an on-line - * one. This is not particularly critical, but since we may be - * assigning a new TLI, using a shutdown checkpoint allows us to have - * the rule that TLI only changes in shutdown checkpoints, which - * allows some extra error checking in xlog_redo. + * one. A slave will always create a restart point if it sees a + * shutdown checkpoint, and will call all rm_cleanup() methods before + * it does so. This guarantees that any actions taken by the master + * in rm_cleanup will also be carried out on the slave. + * Additionally, we may be assigning a new TLI, so using a shutdow + * checkpoint allows us to have the rule that TLI only changes in shutdown + * checkpoints, which allows some extra error checking in xlog_redo. */ CreateCheckPoint(true, true); @@ -5672,35 +5675,56 @@ CheckPointGuts(XLogRecPtr checkPointRedo * restartpoint is needed or not. */ static void -RecoveryRestartPoint(const CheckPoint *checkPoint) +RecoveryRestartPoint(const CheckPoint *checkPoint, const bool shutdownCheckpoint) { int elapsed_secs; int rmid; /* - * Do nothing if the elapsed time since the last restartpoint is less than - * half of checkpoint_timeout. (We use a value less than - * checkpoint_timeout so that variations in the timing of checkpoints on - * the master, or speed of transmission of WAL segments to a slave, won't - * make the slave skip a restartpoint once it's synced with the master.) - * Checking true elapsed time keeps us from doing restartpoints too often - * while rapidly scanning large amounts of WAL. + * If the checkpoint we saw in the wal was a shutdown checkpoint, it might + * have been written after the recovery following a crash of the master. + * In that case, the master will have completed any actions that were + * incomplete when it crashed *during recovery*, and these completions + * are therefor *not* logged in the wal. + * To prevent getting out of sync, we follow what the master did, and + * call the rm_cleanup() methods. To be on the safe side, we then create + * a RestartPoint, regardless of the time elapsed. Note that asking + * the resource managers if they have partial state would be redundant + * after calling rm_cleanup(). */ - elapsed_secs = time(NULL) - ControlFile->time; - if (elapsed_secs < CheckPointTimeout / 2) - return; + if (shutdownCheckpoint) { + for (rmid = 0; rmid <= RM_MAX_ID; rmid++) + { + if (RmgrTable[rmid].rm_cleanup != NULL) + RmgrTable[rmid].rm_cleanup(); + } + } + else { + /* + * Do nothing if the elapsed time since the last restartpoint is less than + * half of checkpoint_timeout. (We use a value less than + * checkpoint_timeout so that variations in the timing of checkpoints on + * the master, or speed of transmission of WAL segments to a slave, won't + * make the slave skip a restartpoint once it's synced with the master.) + * Checking true elapsed time keeps us from doing restartpoints too often + * while rapidly scanning large amounts of WAL. + */ + elapsed_secs = time(NULL) - ControlFile->time; + if (elapsed_secs < CheckPointTimeout / 2) + return; - /* - * Is it safe to checkpoint? We must ask each of the resource managers - * whether they have any partial state information that might prevent a - * correct restart from this point. If so, we skip this opportunity, but - * return at the next checkpoint record for another try. - */ - for (rmid = 0; rmid <= RM_MAX_ID; rmid++) - { - if (RmgrTable[rmid].rm_safe_restartpoint != NULL) - if (!(RmgrTable[rmid].rm_safe_restartpoint())) - return; + /* + * Is it safe to checkpoint? We must ask each of the resource managers + * whether they have any partial state information that might prevent a + * correct restart from this point. If so, we skip this opportunity, but + * return at the next checkpoint record for another try. + */ + for (rmid = 0; rmid <= RM_MAX_ID; rmid++) + { + if (RmgrTable[rmid].rm_safe_restartpoint != NULL) + if (!(RmgrTable[rmid].rm_safe_restartpoint())) + return; + } } /* @@ -5835,7 +5859,7 @@ xlog_redo(XLogRecPtr lsn, XLogRecord *re ThisTimeLineID = checkPoint.ThisTimeLineID; } - RecoveryRestartPoint(&checkPoint); + RecoveryRestartPoint(&checkPoint, true); } else if (info == XLOG_CHECKPOINT_ONLINE) { @@ -5864,7 +5888,7 @@ xlog_redo(XLogRecPtr lsn, XLogRecord *re (errmsg("unexpected timeline ID %u (should be %u) in checkpoint record", checkPoint.ThisTimeLineID, ThisTimeLineID))); - RecoveryRestartPoint(&checkPoint); + RecoveryRestartPoint(&checkPoint, false); } else if (info == XLOG_SWITCH) {
pgsql-hackers by date: