[PATCH] A crash and subsequent recovery of the master can cause the slave to get out-of-sync - Mailing list pgsql-hackers

From Florian G. Pflug
Subject [PATCH] A crash and subsequent recovery of the master can cause the slave to get out-of-sync
Date
Msg-id 4627D30D.8020803@phlo.org
Whole thread Raw
Responses Re: [PATCH] A crash and subsequent recovery of themaster can cause the slave to get out-of-sync
Re: [PATCH] A crash and subsequent recovery of the master can cause the slave to get out-of-sync
List pgsql-hackers
Hi

I believe I have discovered the following problem in pgsql 8.2 and HEAD,
concerning warm-standbys using WAL log shipping.

The problem is that after a crash, the master might complete incomplete
actions via rm_cleanup() - but since it won't wal-log those changes,
the slave won't know about this. This will at least prevent the creation
of any further restart points on the slave (because safe_restartpoint)
will never return true again - it it might even cause data corruption,
if subsequent wal records are interpreted wrongly by the slave because
it sees other data than the master did when it generated them.

Attached is a patch that lets RecoveryRestartPoint call all
rm_cleanup() methods and create a restart point whenever it encounters
a shutdown checkpoint in the wal (because those are generated after
recovery). This ought not cause a performance degradation, because
shutdown checkpoints will occur very infrequently.

The patch is per discussion with Simon Riggs.

I've not yet had a chance to test this patch, I only made sure
that it compiles. I'm sending this out now because I hope this
might make it into 8.2.4.

greetings, Florian Pflug
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6c67821..93c86a1 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5060,10 +5060,13 @@ #endif
          * Perform a checkpoint to update all our recovery activity to disk.
          *
          * Note that we write a shutdown checkpoint rather than an on-line
-         * one. This is not particularly critical, but since we may be
-         * assigning a new TLI, using a shutdown checkpoint allows us to have
-         * the rule that TLI only changes in shutdown checkpoints, which
-         * allows some extra error checking in xlog_redo.
+         * one. A slave will always create a restart point if it sees a
+         * shutdown checkpoint, and will call all rm_cleanup() methods before
+         * it does so. This guarantees that any actions taken by the master
+         * in rm_cleanup will also be carried out on the slave.
+         * Additionally, we may be assigning a new TLI, so using a shutdow
+         * checkpoint allows us to have the rule that TLI only changes in shutdown
+         * checkpoints, which allows some extra error checking in xlog_redo.
          */
         CreateCheckPoint(true, true);

@@ -5672,35 +5675,56 @@ CheckPointGuts(XLogRecPtr checkPointRedo
  * restartpoint is needed or not.
  */
 static void
-RecoveryRestartPoint(const CheckPoint *checkPoint)
+RecoveryRestartPoint(const CheckPoint *checkPoint, const bool shutdownCheckpoint)
 {
     int            elapsed_secs;
     int            rmid;

     /*
-     * Do nothing if the elapsed time since the last restartpoint is less than
-     * half of checkpoint_timeout.    (We use a value less than
-     * checkpoint_timeout so that variations in the timing of checkpoints on
-     * the master, or speed of transmission of WAL segments to a slave, won't
-     * make the slave skip a restartpoint once it's synced with the master.)
-     * Checking true elapsed time keeps us from doing restartpoints too often
-     * while rapidly scanning large amounts of WAL.
+     * If the checkpoint we saw in the wal was a shutdown checkpoint, it might
+     * have been written after the recovery following a crash of the master.
+     * In that case, the master will have completed any actions that were
+     * incomplete when it crashed *during recovery*, and these completions
+     * are therefor *not* logged in the wal.
+     * To prevent getting out of sync, we follow what the master did, and
+     * call the rm_cleanup() methods. To be on the safe side, we then create
+     * a RestartPoint, regardless of the time elapsed. Note that asking
+     * the resource managers if they have partial state would be redundant
+     * after calling rm_cleanup().
      */
-    elapsed_secs = time(NULL) - ControlFile->time;
-    if (elapsed_secs < CheckPointTimeout / 2)
-        return;
+    if (shutdownCheckpoint) {
+        for (rmid = 0; rmid <= RM_MAX_ID; rmid++)
+        {
+            if (RmgrTable[rmid].rm_cleanup != NULL)
+                RmgrTable[rmid].rm_cleanup();
+        }
+    }
+    else {
+        /*
+         * Do nothing if the elapsed time since the last restartpoint is less than
+         * half of checkpoint_timeout.    (We use a value less than
+         * checkpoint_timeout so that variations in the timing of checkpoints on
+         * the master, or speed of transmission of WAL segments to a slave, won't
+         * make the slave skip a restartpoint once it's synced with the master.)
+         * Checking true elapsed time keeps us from doing restartpoints too often
+         * while rapidly scanning large amounts of WAL.
+         */
+        elapsed_secs = time(NULL) - ControlFile->time;
+        if (elapsed_secs < CheckPointTimeout / 2)
+            return;

-    /*
-     * Is it safe to checkpoint?  We must ask each of the resource managers
-     * whether they have any partial state information that might prevent a
-     * correct restart from this point.  If so, we skip this opportunity, but
-     * return at the next checkpoint record for another try.
-     */
-    for (rmid = 0; rmid <= RM_MAX_ID; rmid++)
-    {
-        if (RmgrTable[rmid].rm_safe_restartpoint != NULL)
-            if (!(RmgrTable[rmid].rm_safe_restartpoint()))
-                return;
+        /*
+         * Is it safe to checkpoint?  We must ask each of the resource managers
+         * whether they have any partial state information that might prevent a
+         * correct restart from this point.  If so, we skip this opportunity, but
+         * return at the next checkpoint record for another try.
+         */
+        for (rmid = 0; rmid <= RM_MAX_ID; rmid++)
+        {
+            if (RmgrTable[rmid].rm_safe_restartpoint != NULL)
+                if (!(RmgrTable[rmid].rm_safe_restartpoint()))
+                    return;
+        }
     }

     /*
@@ -5835,7 +5859,7 @@ xlog_redo(XLogRecPtr lsn, XLogRecord *re
             ThisTimeLineID = checkPoint.ThisTimeLineID;
         }

-        RecoveryRestartPoint(&checkPoint);
+        RecoveryRestartPoint(&checkPoint, true);
     }
     else if (info == XLOG_CHECKPOINT_ONLINE)
     {
@@ -5864,7 +5888,7 @@ xlog_redo(XLogRecPtr lsn, XLogRecord *re
                     (errmsg("unexpected timeline ID %u (should be %u) in checkpoint record",
                             checkPoint.ThisTimeLineID, ThisTimeLineID)));

-        RecoveryRestartPoint(&checkPoint);
+        RecoveryRestartPoint(&checkPoint, false);
     }
     else if (info == XLOG_SWITCH)
     {

pgsql-hackers by date:

Previous
From: "Pavel Stehule"
Date:
Subject: pgsql crollable cursor doesn't support one form of postgresql's cursors
Next
From: Martijn van Oosterhout
Date:
Subject: Re: parser dilemma