Thread: Hot standby, recovery infra

Hot standby, recovery infra

From
Heikki Linnakangas
Date:
I've been reviewing and massaging the so called recovery infra patch.

To recap, the goal is to:
- start background writer during (archive) recovery
- skip the shutdown checkpoint at the end of recovery. Instead, the
database is brought up immediately, and the bgwriter performs a normal
online checkpoint, while we're already accepting connections.
- keep track of when we reach a consistent point in the recovery, where
we could let read-only backends in. Which is obviously required for hot
standby

The 1st and 2nd points provide some useful functionality, even without
the rest of the hot standby patch.

I've refactored the patch quite heavily, making it a lot simpler, and
over 1/3 smaller than before:

The signaling between the bgwriter and startup process during recovery
was quite complicated. The startup process periodically sent checkpoint
records to the bgwriter, so that bgwriter could perform restart points.
I've replaced that by storing the last seen checkpoint in a shared
memory in xlog.c. CreateRestartPoint() picks it up from there. This
means that bgwriter can decide autonomously when to perform a restart
point, it no longer needs to be told to do so by the startup process.
Which is nice in a standby. What could happen before is that the standby
processes a checkpoint record, and decides not to make it a restartpoint
because not enough time has passed since last one. If we then get a long
idle period after that, we'd really want to make the previous checkpoint
record a restart point after all, after some time has passed. That is
what will happen now, which is a usability enhancement, although the
real motivation for this refactoring was to make the code simpler.

The bgwriter is now always responsible for all checkpoints and
restartpoints. (well, except for a stand-alone backend). Which makes it
easier to understand what's going on, IMHO.

There was one pretty fundamental bug in the minsafestartpoint handling:
it was always set when a WAL file was opened for reading. Which means it
was also moved backwards when the recovery began by reading the WAL
segment containing last restart/checkpoint, rendering it useless for the
purpose it was designed. Fortunately that was easy to fix. Another tiny
bug was that log_restartpoints was not respected, because it was stored
in a variable in startup process' memory, and wasn't seen by bgwriter.

One aspect that troubles me a bit is the changes in XLogFlush. I guess
we no longer have the problem that you can't start up the database if
we've read in a corrupted page from disk, because we now start up before
checkpointing. However, it does mean that if a corrupt page is read into
shared buffers, we'll never be able to checkpoint. But then again, I
guess that's already true without this patch.


I feel quite good about this patch now. Given the amount of code churn,
it requires testing, and I'll read it through one more time after
sleeping over it. Simon, do you see anything wrong with this?

(this patch is also in my git repository at git.postgresql.org, branch
recoveryinfra.)

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index bd6035d..30fea49 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -119,12 +119,26 @@ CheckpointStatsData CheckpointStats;
  */
 TimeLineID    ThisTimeLineID = 0;

-/* Are we doing recovery from XLOG? */
+/*
+ * Are we doing recovery from XLOG?
+ *
+ * This is only ever true in the startup process, when it's replaying WAL.
+ * It's used in functions that need to act differently when called from a
+ * redo function (e.g skip WAL logging).  To check whether the system is in
+ * recovery regardless of what process you're running in, use
+ * IsRecoveryProcessingMode().
+ */
 bool        InRecovery = false;

 /* Are we recovering using offline XLOG archives? */
 static bool InArchiveRecovery = false;

+/*
+ * Local copy of shared RecoveryProcessingMode variable. True actually
+ * means "not known, need to check the shared state"
+ */
+static bool LocalRecoveryProcessingMode = true;
+
 /* Was the last xlog file restored from archive, or local? */
 static bool restoredFromArchive = false;

@@ -133,16 +147,22 @@ static char *recoveryRestoreCommand = NULL;
 static bool recoveryTarget = false;
 static bool recoveryTargetExact = false;
 static bool recoveryTargetInclusive = true;
-static bool recoveryLogRestartpoints = false;
 static TransactionId recoveryTargetXid;
 static TimestampTz recoveryTargetTime;
 static TimestampTz recoveryLastXTime = 0;
+/*
+ * log_restartpoints is stored in shared memory because it needs to be
+ * accessed by bgwriter when it performs restartpoints
+ */

 /* if recoveryStopsHere returns true, it saves actual stop xid/time here */
 static TransactionId recoveryStopXid;
 static TimestampTz recoveryStopTime;
 static bool recoveryStopAfter;

+/* is the database in consistent state yet? */
+static bool    reachedSafeStartPoint = false;
+
 /*
  * During normal operation, the only timeline we care about is ThisTimeLineID.
  * During recovery, however, things are more complicated.  To simplify life
@@ -313,6 +333,25 @@ typedef struct XLogCtlData
     int            XLogCacheBlck;    /* highest allocated xlog buffer index */
     TimeLineID    ThisTimeLineID;

+    /*
+     * SharedRecoveryProcessingMode indicates if we're still in crash or
+     * archive recovery. It's checked by IsRecoveryProcessingMode()
+     */
+    bool        SharedRecoveryProcessingMode;
+
+    /*
+     * During recovery, we keep a copy of the latest checkpoint record
+     * here. It's used by the background writer when it wants to create
+     * a restartpoint.
+     *
+     * is info_lck spinlock a bit too light-weight to protect this?
+     */
+    XLogRecPtr    lastCheckPointRecPtr;
+    CheckPoint    lastCheckPoint;
+
+    /* Should restartpoints be logged? Taken from recovery.conf */
+    bool        recoveryLogRestartpoints;
+
     slock_t        info_lck;        /* locks shared variables shown above */
 } XLogCtlData;

@@ -399,6 +438,7 @@ static void XLogArchiveCleanup(const char *xlog);
 static void readRecoveryCommandFile(void);
 static void exitArchiveRecovery(TimeLineID endTLI,
                     uint32 endLogId, uint32 endLogSeg);
+static void exitRecovery(void);
 static bool recoveryStopsHere(XLogRecord *record, bool *includeThis);
 static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);

@@ -483,6 +523,11 @@ XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
     bool        updrqst;
     bool        doPageWrites;
     bool        isLogSwitch = (rmid == RM_XLOG_ID && info == XLOG_SWITCH);
+    bool        isRecoveryEnd = (rmid == RM_XLOG_ID && info == XLOG_RECOVERY_END);
+
+    /* cross-check on whether we should be here or not */
+    if (IsRecoveryProcessingMode() && !isRecoveryEnd)
+        elog(FATAL, "cannot make new WAL entries during recovery");

     /* info's high bits are reserved for use by me */
     if (info & XLR_INFO_MASK)
@@ -1730,7 +1775,7 @@ XLogFlush(XLogRecPtr record)
     XLogwrtRqst WriteRqst;

     /* Disabled during REDO */
-    if (InRedo)
+    if (IsRecoveryProcessingMode())
         return;

     /* Quick exit if already known flushed */
@@ -1818,9 +1863,9 @@ XLogFlush(XLogRecPtr record)
      * the bad page is encountered again during recovery then we would be
      * unable to restart the database at all!  (This scenario has actually
      * happened in the field several times with 7.1 releases. Note that we
-     * cannot get here while InRedo is true, but if the bad page is brought in
-     * and marked dirty during recovery then CreateCheckPoint will try to
-     * flush it at the end of recovery.)
+     * cannot get here while IsRecoveryProcessingMode(), but if the bad page is
+     * brought in and marked dirty during recovery then if a checkpoint were
+     * performed at the end of recovery it will try to flush it.
      *
      * The current approach is to ERROR under normal conditions, but only
      * WARNING during recovery, so that the system can be brought up even if
@@ -1830,7 +1875,7 @@ XLogFlush(XLogRecPtr record)
      * and so we will not force a restart for a bad LSN on a data page.
      */
     if (XLByteLT(LogwrtResult.Flush, record))
-        elog(InRecovery ? WARNING : ERROR,
+        elog(ERROR,
         "xlog flush request %X/%X is not satisfied --- flushed only to %X/%X",
              record.xlogid, record.xrecoff,
              LogwrtResult.Flush.xlogid, LogwrtResult.Flush.xrecoff);
@@ -2103,7 +2148,8 @@ XLogFileInit(uint32 log, uint32 seg,
         unlink(tmppath);
     }

-    elog(DEBUG2, "done creating and filling new WAL file");
+    XLogFileName(tmppath, ThisTimeLineID, log, seg);
+    elog(DEBUG2, "done creating and filling new WAL file %s", tmppath);

     /* Set flag to tell caller there was no existent file */
     *use_existent = false;
@@ -2409,6 +2455,33 @@ XLogFileRead(uint32 log, uint32 seg, int emode)
                      xlogfname);
             set_ps_display(activitymsg, false);

+            /*
+             * Calculate and write out a new safeStartPoint. This defines
+             * the latest LSN that might appear on-disk while we apply
+             * the WAL records in this file. If we crash during recovery
+             * we must reach this point again before we can prove
+             * database consistency. Not a restartpoint! Restart points
+             * define where we should start recovery from, if we crash.
+             */
+            if (InArchiveRecovery)
+            {
+                XLogRecPtr    nextSegRecPtr;
+                uint32        nextLog = log;
+                uint32        nextSeg = seg;
+
+                NextLogSeg(nextLog, nextSeg);
+                nextSegRecPtr.xlogid = nextLog;
+                nextSegRecPtr.xrecoff = nextSeg * XLogSegSize;
+
+                LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+                if (XLByteLT(ControlFile->minSafeStartPoint, nextSegRecPtr))
+                {
+                    ControlFile->minSafeStartPoint = nextSegRecPtr;
+                    UpdateControlFile();
+                }
+                LWLockRelease(ControlFileLock);
+            }
+
             return fd;
         }
         if (errno != ENOENT)    /* unexpected failure? */
@@ -4592,13 +4665,13 @@ readRecoveryCommandFile(void)
             /*
              * does nothing if a recovery_target is not also set
              */
-            if (!parse_bool(tok2, &recoveryLogRestartpoints))
-                  ereport(ERROR,
-                            (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
-                      errmsg("parameter \"log_restartpoints\" requires a Boolean value")));
+            if (!parse_bool(tok2, &XLogCtl->recoveryLogRestartpoints))
+                ereport(ERROR,
+                        (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                            errmsg("parameter \"log_restartpoints\" requires a Boolean value")));
             ereport(LOG,
-                    (errmsg("log_restartpoints = %s", tok2)));
-        }
+                (errmsg("log_restartpoints = %s", tok2)));
+         }
         else
             ereport(FATAL,
                     (errmsg("unrecognized recovery parameter \"%s\"",
@@ -4734,7 +4807,10 @@ exitArchiveRecovery(TimeLineID endTLI, uint32 endLogId, uint32 endLogSeg)

     /*
      * Rename the config file out of the way, so that we don't accidentally
-     * re-enter archive recovery mode in a subsequent crash.
+     * re-enter archive recovery mode in a subsequent crash. We have already
+     * restored all the WAL segments we need from the archive, and we trust
+     * that they are not going to go away even if we crash. (XXX: should
+     * we fsync() them all to ensure that?)
      */
     unlink(RECOVERY_COMMAND_DONE);
     if (rename(RECOVERY_COMMAND_FILE, RECOVERY_COMMAND_DONE) != 0)
@@ -4876,6 +4952,7 @@ StartupXLOG(void)
     CheckPoint    checkPoint;
     bool        wasShutdown;
     bool        reachedStopPoint = false;
+    bool        performedRecovery = false;
     bool        haveBackupLabel = false;
     XLogRecPtr    RecPtr,
                 LastRec,
@@ -4888,6 +4965,8 @@ StartupXLOG(void)
     uint32        freespace;
     TransactionId oldestActiveXID;

+    XLogCtl->SharedRecoveryProcessingMode = true;
+
     /*
      * Read control file and check XLOG status looks valid.
      *
@@ -5108,9 +5187,15 @@ StartupXLOG(void)
         if (minRecoveryLoc.xlogid != 0 || minRecoveryLoc.xrecoff != 0)
             ControlFile->minRecoveryPoint = minRecoveryLoc;
         ControlFile->time = (pg_time_t) time(NULL);
+        /* No need to hold ControlFileLock yet, we aren't up far enough */
         UpdateControlFile();

         /*
+         * Reset pgstat data, because it may be invalid after recovery.
+         */
+        pgstat_reset_all();
+
+        /*
          * If there was a backup label file, it's done its job and the info
          * has now been propagated into pg_control.  We must get rid of the
          * label file so that if we crash during recovery, we'll pick up at
@@ -5155,6 +5240,7 @@ StartupXLOG(void)
             bool        recoveryContinue = true;
             bool        recoveryApply = true;
             ErrorContextCallback errcontext;
+            XLogRecPtr    minSafeStartPoint;

             InRedo = true;
             ereport(LOG,
@@ -5162,6 +5248,12 @@ StartupXLOG(void)
                             ReadRecPtr.xlogid, ReadRecPtr.xrecoff)));

             /*
+             * Take a local copy of minSafeStartPoint at the beginning of
+             * recovery, because it's updated as we go.
+             */
+            minSafeStartPoint = ControlFile->minSafeStartPoint;
+
+            /*
              * main redo apply loop
              */
             do
@@ -5217,6 +5309,32 @@ StartupXLOG(void)

                 LastRec = ReadRecPtr;

+                /*
+                 * Have we reached our safe starting point? If so, we can
+                 * signal postmaster to enter consistent recovery mode.
+                 *
+                 * There are two points in the log we must pass. The first is
+                 * the minRecoveryPoint, which is the LSN at the time the
+                 * base backup was taken that we are about to rollfoward from.
+                 * If recovery has ever crashed or was stopped there is
+                 * another point also: minSafeStartPoint, which is the
+                 * latest LSN that recovery could have reached prior to crash.
+                 */
+                if (!reachedSafeStartPoint &&
+                     XLByteLE(minSafeStartPoint, EndRecPtr) &&
+                     XLByteLE(ControlFile->minRecoveryPoint, EndRecPtr))
+                {
+                    reachedSafeStartPoint = true;
+                    if (InArchiveRecovery)
+                    {
+                        ereport(LOG,
+                            (errmsg("consistent recovery state reached at %X/%X",
+                                EndRecPtr.xlogid, EndRecPtr.xrecoff)));
+                        if (IsUnderPostmaster)
+                            SendPostmasterSignal(PMSIGNAL_RECOVERY_START);
+                    }
+                }
+
                 record = ReadRecord(NULL, LOG);
             } while (record != NULL && recoveryContinue);

@@ -5238,6 +5356,7 @@ StartupXLOG(void)
             /* there are no WAL records following the checkpoint */
             ereport(LOG,
                     (errmsg("redo is not required")));
+            reachedSafeStartPoint = true;
         }
     }

@@ -5251,9 +5370,9 @@ StartupXLOG(void)

     /*
      * Complain if we did not roll forward far enough to render the backup
-     * dump consistent.
+     * dump consistent and start safely.
      */
-    if (XLByteLT(EndOfLog, ControlFile->minRecoveryPoint))
+    if (InRecovery && !reachedSafeStartPoint)
     {
         if (reachedStopPoint)    /* stopped because of stop request */
             ereport(FATAL,
@@ -5375,39 +5494,14 @@ StartupXLOG(void)
         XLogCheckInvalidPages();

         /*
-         * Reset pgstat data, because it may be invalid after recovery.
+         * Finally exit recovery and mark that in WAL. Pre-8.4 we wrote
+         * a shutdown checkpoint here, but we ask bgwriter to do that now.
          */
-        pgstat_reset_all();
+        exitRecovery();

-        /*
-         * Perform a checkpoint to update all our recovery activity to disk.
-         *
-         * Note that we write a shutdown checkpoint rather than an on-line
-         * one. This is not particularly critical, but since we may be
-         * assigning a new TLI, using a shutdown checkpoint allows us to have
-         * the rule that TLI only changes in shutdown checkpoints, which
-         * allows some extra error checking in xlog_redo.
-         */
-        CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
+        performedRecovery = true;
     }

-    /*
-     * Preallocate additional log files, if wanted.
-     */
-    PreallocXlogFiles(EndOfLog);
-
-    /*
-     * Okay, we're officially UP.
-     */
-    InRecovery = false;
-
-    ControlFile->state = DB_IN_PRODUCTION;
-    ControlFile->time = (pg_time_t) time(NULL);
-    UpdateControlFile();
-
-    /* start the archive_timeout timer running */
-    XLogCtl->Write.lastSegSwitchTime = ControlFile->time;
-
     /* initialize shared-memory copy of latest checkpoint XID/epoch */
     XLogCtl->ckptXidEpoch = ControlFile->checkPointCopy.nextXidEpoch;
     XLogCtl->ckptXid = ControlFile->checkPointCopy.nextXid;
@@ -5441,6 +5535,67 @@ StartupXLOG(void)
         readRecordBuf = NULL;
         readRecordBufSize = 0;
     }
+
+    /*
+     * If we had to replay any WAL records, request a checkpoint. This isn't
+     * strictly necessary: if we crash now, the recovery will simply restart
+     * from the same point where it started this time around (or from the
+     * last restartpoint). The control file is left in DB_IN_*_RECOVERY
+     * state; the first checkpoint will change that to DB_IN_PRODUCTION.
+     */
+    if (performedRecovery)
+    {
+        /*
+         * Okay, we can come up now. Allow others to write WAL.
+         */
+        XLogCtl->SharedRecoveryProcessingMode = false;
+
+        RequestCheckpoint(CHECKPOINT_FORCE | CHECKPOINT_IMMEDIATE |
+                          CHECKPOINT_STARTUP);
+    }
+    else
+    {
+        /*
+         * No recovery, so let's just get on with it.
+         */
+        LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+        ControlFile->state = DB_IN_PRODUCTION;
+        ControlFile->time = (pg_time_t) time(NULL);
+        UpdateControlFile();
+        LWLockRelease(ControlFileLock);
+
+        /*
+         * Okay, we're officially UP.
+         */
+        XLogCtl->SharedRecoveryProcessingMode = false;
+    }
+
+    /* start the archive_timeout timer running */
+    XLogCtl->Write.lastSegSwitchTime = (pg_time_t) time(NULL);
+
+}
+
+/*
+ * IsRecoveryProcessingMode()
+ *
+ * Fast test for whether we're still in recovery or not. We test the shared
+ * state each time only until we leave recovery mode. After that we never
+ * look again, relying upon the settings of our local state variables. This
+ * is designed to avoid the need for a separate initialisation step.
+ */
+bool
+IsRecoveryProcessingMode(void)
+{
+    if (!LocalRecoveryProcessingMode)
+        return false;
+    else
+    {
+        /* use volatile pointer to prevent code rearrangement */
+        volatile XLogCtlData *xlogctl = XLogCtl;
+
+        LocalRecoveryProcessingMode = xlogctl->SharedRecoveryProcessingMode;
+        return LocalRecoveryProcessingMode;
+    }
 }

 /*
@@ -5696,22 +5851,27 @@ ShutdownXLOG(int code, Datum arg)
  * Log start of a checkpoint.
  */
 static void
-LogCheckpointStart(int flags)
+LogCheckpointStart(int flags, bool restartpoint)
 {
-    elog(LOG, "checkpoint starting:%s%s%s%s%s%s",
-         (flags & CHECKPOINT_IS_SHUTDOWN) ? " shutdown" : "",
-         (flags & CHECKPOINT_IMMEDIATE) ? " immediate" : "",
-         (flags & CHECKPOINT_FORCE) ? " force" : "",
-         (flags & CHECKPOINT_WAIT) ? " wait" : "",
-         (flags & CHECKPOINT_CAUSE_XLOG) ? " xlog" : "",
-         (flags & CHECKPOINT_CAUSE_TIME) ? " time" : "");
+    if (restartpoint)
+        elog(LOG, "restartpoint starting:%s",
+             (flags & CHECKPOINT_IMMEDIATE) ? " immediate" : "");
+    else
+        elog(LOG, "checkpoint starting:%s%s%s%s%s%s%s",
+             (flags & CHECKPOINT_IS_SHUTDOWN) ? " shutdown" : "",
+             (flags & CHECKPOINT_STARTUP) ? " startup" : "",
+             (flags & CHECKPOINT_IMMEDIATE) ? " immediate" : "",
+             (flags & CHECKPOINT_FORCE) ? " force" : "",
+             (flags & CHECKPOINT_WAIT) ? " wait" : "",
+             (flags & CHECKPOINT_CAUSE_XLOG) ? " xlog" : "",
+             (flags & CHECKPOINT_CAUSE_TIME) ? " time" : "");
 }

 /*
  * Log end of a checkpoint.
  */
 static void
-LogCheckpointEnd(void)
+LogCheckpointEnd(int flags, bool restartpoint)
 {
     long        write_secs,
                 sync_secs,
@@ -5734,17 +5894,26 @@ LogCheckpointEnd(void)
                         CheckpointStats.ckpt_sync_end_t,
                         &sync_secs, &sync_usecs);

-    elog(LOG, "checkpoint complete: wrote %d buffers (%.1f%%); "
-         "%d transaction log file(s) added, %d removed, %d recycled; "
-         "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s",
-         CheckpointStats.ckpt_bufs_written,
-         (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
-         CheckpointStats.ckpt_segs_added,
-         CheckpointStats.ckpt_segs_removed,
-         CheckpointStats.ckpt_segs_recycled,
-         write_secs, write_usecs / 1000,
-         sync_secs, sync_usecs / 1000,
-         total_secs, total_usecs / 1000);
+    if (restartpoint)
+        elog(LOG, "restartpoint complete: wrote %d buffers (%.1f%%); "
+             "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s",
+             CheckpointStats.ckpt_bufs_written,
+             (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
+             write_secs, write_usecs / 1000,
+             sync_secs, sync_usecs / 1000,
+             total_secs, total_usecs / 1000);
+    else
+        elog(LOG, "checkpoint complete: wrote %d buffers (%.1f%%); "
+             "%d transaction log file(s) added, %d removed, %d recycled; "
+             "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s",
+             CheckpointStats.ckpt_bufs_written,
+             (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
+             CheckpointStats.ckpt_segs_added,
+             CheckpointStats.ckpt_segs_removed,
+             CheckpointStats.ckpt_segs_recycled,
+             write_secs, write_usecs / 1000,
+             sync_secs, sync_usecs / 1000,
+             total_secs, total_usecs / 1000);
 }

 /*
@@ -5800,9 +5969,11 @@ CreateCheckPoint(int flags)

     if (shutdown)
     {
+        LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
         ControlFile->state = DB_SHUTDOWNING;
         ControlFile->time = (pg_time_t) time(NULL);
         UpdateControlFile();
+        LWLockRelease(ControlFileLock);
     }

     /*
@@ -5906,7 +6077,7 @@ CreateCheckPoint(int flags)
      * to log anything if we decided to skip the checkpoint.
      */
     if (log_checkpoints)
-        LogCheckpointStart(flags);
+        LogCheckpointStart(flags, false);

     TRACE_POSTGRESQL_CHECKPOINT_START(flags);

@@ -6010,11 +6181,18 @@ CreateCheckPoint(int flags)
     XLByteToSeg(ControlFile->checkPointCopy.redo, _logId, _logSeg);

     /*
-     * Update the control file.
+     * Update the control file. In 8.4, this routine becomes the primary
+     * point for recording changes of state in the control file at the
+     * end of recovery. Postmaster state already shows us being in
+     * normal running mode, but it is only after this point that we
+     * are completely free of reperforming a recovery if we crash.  Note
+     * that this is executed by bgwriter after the death of Startup process.
      */
     LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
     if (shutdown)
         ControlFile->state = DB_SHUTDOWNED;
+    else
+        ControlFile->state = DB_IN_PRODUCTION;
     ControlFile->prevCheckPoint = ControlFile->checkPoint;
     ControlFile->checkPoint = ProcLastRecPtr;
     ControlFile->checkPointCopy = checkPoint;
@@ -6068,12 +6246,11 @@ CreateCheckPoint(int flags)
      * in subtrans.c).    During recovery, though, we mustn't do this because
      * StartupSUBTRANS hasn't been called yet.
      */
-    if (!InRecovery)
-        TruncateSUBTRANS(GetOldestXmin(true, false));
+    TruncateSUBTRANS(GetOldestXmin(true, false));

     /* All real work is done, but log before releasing lock. */
     if (log_checkpoints)
-        LogCheckpointEnd();
+        LogCheckpointEnd(flags, false);

         TRACE_POSTGRESQL_CHECKPOINT_DONE(CheckpointStats.ckpt_bufs_written,
                                 NBuffers, CheckpointStats.ckpt_segs_added,
@@ -6101,32 +6278,16 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 }

 /*
- * Set a recovery restart point if appropriate
- *
- * This is similar to CreateCheckPoint, but is used during WAL recovery
- * to establish a point from which recovery can roll forward without
- * replaying the entire recovery log.  This function is called each time
- * a checkpoint record is read from XLOG; it must determine whether a
- * restartpoint is needed or not.
+ * Store checkpoint record in shared memory, so that it can be used as a
+ * restartpoint. This function is called each time a checkpoint record is
+ * read from XLOG.
  */
 static void
 RecoveryRestartPoint(const CheckPoint *checkPoint)
 {
-    int            elapsed_secs;
     int            rmid;
-
-    /*
-     * Do nothing if the elapsed time since the last restartpoint is less than
-     * half of checkpoint_timeout.    (We use a value less than
-     * checkpoint_timeout so that variations in the timing of checkpoints on
-     * the master, or speed of transmission of WAL segments to a slave, won't
-     * make the slave skip a restartpoint once it's synced with the master.)
-     * Checking true elapsed time keeps us from doing restartpoints too often
-     * while rapidly scanning large amounts of WAL.
-     */
-    elapsed_secs = (pg_time_t) time(NULL) - ControlFile->time;
-    if (elapsed_secs < CheckPointTimeout / 2)
-        return;
+    /* use volatile pointer to prevent code rearrangement */
+    volatile XLogCtlData *xlogctl = XLogCtl;

     /*
      * Is it safe to checkpoint?  We must ask each of the resource managers
@@ -6148,28 +6309,111 @@ RecoveryRestartPoint(const CheckPoint *checkPoint)
     }

     /*
-     * OK, force data out to disk
+     * Copy the checkpoint record to shared memory, so that bgwriter can
+     * use it the next time it wants to perform a restartpoint.
+     */
+    SpinLockAcquire(&xlogctl->info_lck);
+    XLogCtl->lastCheckPointRecPtr = ReadRecPtr;
+    memcpy(&XLogCtl->lastCheckPoint, checkPoint, sizeof(CheckPoint));
+    SpinLockRelease(&xlogctl->info_lck);
+
+    /*
+     * XXX: Should we try to perform restartpoints if we're not in consistent
+     * recovery? The bgwriter isn't doing it for us in that case.
+     */
+}
+
+/*
+ * This is similar to CreateCheckPoint, but is used during WAL recovery
+ * to establish a point from which recovery can roll forward without
+ * replaying the entire recovery log.
+ */
+void
+CreateRestartPoint(int flags)
+{
+    XLogRecPtr lastCheckPointRecPtr;
+    CheckPoint lastCheckPoint;
+    /* use volatile pointer to prevent code rearrangement */
+    volatile XLogCtlData *xlogctl = XLogCtl;
+
+    /* Get the a local copy of the last checkpoint record. */
+    SpinLockAcquire(&xlogctl->info_lck);
+    lastCheckPointRecPtr = xlogctl->lastCheckPointRecPtr;
+    memcpy(&lastCheckPoint, &XLogCtl->lastCheckPoint, sizeof(CheckPoint));
+    SpinLockRelease(&xlogctl->info_lck);
+
+    /*
+     * If the last checkpoint record we've replayed is already our last
+     * restartpoint, we're done.
      */
-    CheckPointGuts(checkPoint->redo, CHECKPOINT_IMMEDIATE);
+    if (XLByteLE(lastCheckPoint.redo, ControlFile->checkPointCopy.redo))
+    {
+        ereport(DEBUG2,
+                (errmsg("skipping restartpoint, already performed at %X/%X",
+                        lastCheckPoint.redo.xlogid, lastCheckPoint.redo.xrecoff)));
+        return;
+    }

     /*
-     * Update pg_control so that any subsequent crash will restart from this
-     * checkpoint.    Note: ReadRecPtr gives the XLOG address of the checkpoint
-     * record itself.
+     * Acquire CheckpointLock to ensure only one restartpoint happens at a time.
+     * We rely on this lock to ensure that the startup process doesn't exit
+     * Recovery while we are half way through a restartpoint. XXX ?
      */
+    LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);
+
+    /* Check that we're still in recovery mode. */
+    if (!IsRecoveryProcessingMode())
+    {
+        ereport(DEBUG2,
+                (errmsg("skipping restartpoint, recovery has already ended")));
+        LWLockRelease(CheckpointLock);
+        return;
+    }
+
+    if (XLogCtl->recoveryLogRestartpoints)
+    {
+        /*
+         * Prepare to accumulate statistics.
+         */
+        MemSet(&CheckpointStats, 0, sizeof(CheckpointStats));
+        CheckpointStats.ckpt_start_t = GetCurrentTimestamp();
+
+        LogCheckpointStart(flags, true);
+    }
+
+    CheckPointGuts(lastCheckPoint.redo, flags);
+
+    /*
+     * Update pg_control, using current time
+     */
+    LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
     ControlFile->prevCheckPoint = ControlFile->checkPoint;
-    ControlFile->checkPoint = ReadRecPtr;
-    ControlFile->checkPointCopy = *checkPoint;
+    ControlFile->checkPoint = lastCheckPointRecPtr;
+    ControlFile->checkPointCopy = lastCheckPoint;
     ControlFile->time = (pg_time_t) time(NULL);
     UpdateControlFile();
+    LWLockRelease(ControlFileLock);

-    ereport((recoveryLogRestartpoints ? LOG : DEBUG2),
+    /*
+     * Currently, there is no need to truncate pg_subtrans during recovery.
+     * If we did do that, we will need to have called StartupSUBTRANS()
+     * already and then TruncateSUBTRANS() would go here.
+     */
+
+    /* All real work is done, but log before releasing lock. */
+    if (XLogCtl->recoveryLogRestartpoints)
+        LogCheckpointEnd(flags, true);
+
+    ereport((XLogCtl->recoveryLogRestartpoints ? LOG : DEBUG2),
             (errmsg("recovery restart point at %X/%X",
-                    checkPoint->redo.xlogid, checkPoint->redo.xrecoff)));
+                    lastCheckPoint.redo.xlogid, lastCheckPoint.redo.xrecoff)));
+
     if (recoveryLastXTime)
-        ereport((recoveryLogRestartpoints ? LOG : DEBUG2),
-                (errmsg("last completed transaction was at log time %s",
-                        timestamptz_to_str(recoveryLastXTime))));
+        ereport((XLogCtl->recoveryLogRestartpoints ? LOG : DEBUG2),
+            (errmsg("last completed transaction was at log time %s",
+                    timestamptz_to_str(recoveryLastXTime))));
+
+    LWLockRelease(CheckpointLock);
 }

 /*
@@ -6234,7 +6478,43 @@ RequestXLogSwitch(void)
 }

 /*
+ * exitRecovery()
+ *
+ * Exit recovery state and write a XLOG_RECOVERY_END record. This is the
+ * only record type that can record a change of timelineID. We assume
+ * caller has already set ThisTimeLineID, if appropriate.
+ */
+static void
+exitRecovery(void)
+{
+    XLogRecData rdata;
+
+    rdata.buffer = InvalidBuffer;
+    rdata.data = (char *) (&ThisTimeLineID);
+    rdata.len = sizeof(TimeLineID);
+    rdata.next = NULL;
+
+    /*
+     * This is the only type of WAL message that can be inserted during
+     * recovery. This ensures that we don't allow others to get access
+     * until after we have changed state.
+     */
+    (void) XLogInsert(RM_XLOG_ID, XLOG_RECOVERY_END, &rdata);
+
+    /*
+     * We don't XLogFlush() here otherwise we'll end up zeroing the WAL
+     * file ourselves. So just let bgwriter's forthcoming checkpoint do
+     * that for us.
+     */
+
+    InRecovery = false;
+}
+
+/*
  * XLOG resource manager's routines
+ *
+ * Definitions of message info are in include/catalog/pg_control.h,
+ * though not all messages relate to control file processing.
  */
 void
 xlog_redo(XLogRecPtr lsn, XLogRecord *record)
@@ -6272,21 +6552,38 @@ xlog_redo(XLogRecPtr lsn, XLogRecord *record)
         ControlFile->checkPointCopy.nextXid = checkPoint.nextXid;

         /*
-         * TLI may change in a shutdown checkpoint, but it shouldn't decrease
+         * TLI no longer changes at shutdown checkpoint, since as of 8.4,
+         * shutdown checkpoints only occur at shutdown. Much less confusing.
          */
-        if (checkPoint.ThisTimeLineID != ThisTimeLineID)
+
+        RecoveryRestartPoint(&checkPoint);
+    }
+    else if (info == XLOG_RECOVERY_END)
+    {
+        TimeLineID    tli;
+
+        memcpy(&tli, XLogRecGetData(record), sizeof(TimeLineID));
+
+        /*
+         * TLI may change when recovery ends, but it shouldn't decrease.
+         *
+         * This is the only WAL record that can tell us to change timelineID
+         * while we process WAL records.
+         *
+         * We can *choose* to stop recovery at any point, generating a
+         * new timelineID which is recorded using this record type.
+         */
+        if (tli != ThisTimeLineID)
         {
-            if (checkPoint.ThisTimeLineID < ThisTimeLineID ||
+            if (tli < ThisTimeLineID ||
                 !list_member_int(expectedTLIs,
-                                 (int) checkPoint.ThisTimeLineID))
+                                 (int) tli))
                 ereport(PANIC,
-                        (errmsg("unexpected timeline ID %u (after %u) in checkpoint record",
-                                checkPoint.ThisTimeLineID, ThisTimeLineID)));
+                        (errmsg("unexpected timeline ID %u (after %u) at recovery end record",
+                                tli, ThisTimeLineID)));
             /* Following WAL records should be run with new TLI */
-            ThisTimeLineID = checkPoint.ThisTimeLineID;
+            ThisTimeLineID = tli;
         }
-
-        RecoveryRestartPoint(&checkPoint);
     }
     else if (info == XLOG_CHECKPOINT_ONLINE)
     {
@@ -6309,7 +6606,7 @@ xlog_redo(XLogRecPtr lsn, XLogRecord *record)
         ControlFile->checkPointCopy.nextXidEpoch = checkPoint.nextXidEpoch;
         ControlFile->checkPointCopy.nextXid = checkPoint.nextXid;

-        /* TLI should not change in an on-line checkpoint */
+        /* TLI must not change at a checkpoint */
         if (checkPoint.ThisTimeLineID != ThisTimeLineID)
             ereport(PANIC,
                     (errmsg("unexpected timeline ID %u (should be %u) in checkpoint record",
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 6a0cd4e..428a440 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -49,6 +49,7 @@
 #include <unistd.h>

 #include "access/xlog_internal.h"
+#include "catalog/pg_control.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -197,6 +198,9 @@ BackgroundWriterMain(void)
 {
     sigjmp_buf    local_sigjmp_buf;
     MemoryContext bgwriter_context;
+    bool        BgWriterRecoveryMode;
+    /* use volatile pointer to prevent code rearrangement */
+    volatile BgWriterShmemStruct *bgs = BgWriterShmem;

     BgWriterShmem->bgwriter_pid = MyProcPid;
     am_bg_writer = true;
@@ -355,6 +359,27 @@ BackgroundWriterMain(void)
      */
     PG_SETMASK(&UnBlockSig);

+    BgWriterRecoveryMode = IsRecoveryProcessingMode();
+
+    if (BgWriterRecoveryMode)
+        elog(DEBUG1, "bgwriter starting during recovery");
+    else
+        InitXLOGAccess();
+
+    /*
+     * If someone requested a checkpoint before we started up, process that.
+     *
+     * This check exists primarily for crash recovery: after the startup
+     * process is finished with WAL replay, it will request a checkpoint, but
+     * the background writer might not have started yet. This check will
+     * actually not notice a checkpoint that's been requested without any
+     * flags, but it's good enough for the startup checkpoint.
+     */
+    SpinLockAcquire(&bgs->ckpt_lck);
+    if (bgs->ckpt_flags)
+        checkpoint_requested = true;
+    SpinLockRelease(&bgs->ckpt_lck);
+
     /*
      * Loop forever
      */
@@ -396,7 +421,8 @@ BackgroundWriterMain(void)
              */
             ExitOnAnyError = true;
             /* Close down the database */
-            ShutdownXLOG(0, 0);
+            if (!BgWriterRecoveryMode)
+                ShutdownXLOG(0, 0);
             /* Normal exit from the bgwriter is here */
             proc_exit(0);        /* done */
         }
@@ -418,14 +444,26 @@ BackgroundWriterMain(void)
         }

         /*
+         * Check if we've exited recovery. We do this after determining
+         * whether to perform a checkpoint or not, to be sure that we
+         * perform a real checkpoint and not a restartpoint, if someone
+         * (like the startup process!) requested a checkpoint immediately
+         * after exiting recovery.
+         */
+         if (BgWriterRecoveryMode && !IsRecoveryProcessingMode())
+          {
+            elog(DEBUG1, "bgwriter changing from recovery to normal mode");
+
+            InitXLOGAccess();
+            BgWriterRecoveryMode = false;
+        }
+
+        /*
          * Do a checkpoint if requested, otherwise do one cycle of
          * dirty-buffer writing.
          */
         if (do_checkpoint)
         {
-            /* use volatile pointer to prevent code rearrangement */
-            volatile BgWriterShmemStruct *bgs = BgWriterShmem;
-
             /*
              * Atomically fetch the request flags to figure out what kind of a
              * checkpoint we should perform, and increase the started-counter
@@ -444,7 +482,8 @@ BackgroundWriterMain(void)
              * implementation will not generate warnings caused by
              * CheckPointTimeout < CheckPointWarning.
              */
-            if ((flags & CHECKPOINT_CAUSE_XLOG) &&
+            if (!BgWriterRecoveryMode &&
+                (flags & CHECKPOINT_CAUSE_XLOG) &&
                 elapsed_secs < CheckPointWarning)
                 ereport(LOG,
                         (errmsg("checkpoints are occurring too frequently (%d seconds apart)",
@@ -455,14 +494,18 @@ BackgroundWriterMain(void)
              * Initialize bgwriter-private variables used during checkpoint.
              */
             ckpt_active = true;
-            ckpt_start_recptr = GetInsertRecPtr();
+            if (!BgWriterRecoveryMode)
+                ckpt_start_recptr = GetInsertRecPtr();
             ckpt_start_time = now;
             ckpt_cached_elapsed = 0;

             /*
              * Do the checkpoint.
              */
-            CreateCheckPoint(flags);
+            if (!BgWriterRecoveryMode)
+                CreateCheckPoint(flags);
+            else
+                CreateRestartPoint(flags);

             /*
              * After any checkpoint, close all smgr files.    This is so we
@@ -507,7 +550,7 @@ CheckArchiveTimeout(void)
     pg_time_t    now;
     pg_time_t    last_time;

-    if (XLogArchiveTimeout <= 0)
+    if (XLogArchiveTimeout <= 0 || !IsRecoveryProcessingMode())
         return;

     now = (pg_time_t) time(NULL);
@@ -586,7 +629,8 @@ BgWriterNap(void)
         (ckpt_active ? ImmediateCheckpointRequested() : checkpoint_requested))
             break;
         pg_usleep(1000000L);
-        AbsorbFsyncRequests();
+        if (!IsRecoveryProcessingMode())
+            AbsorbFsyncRequests();
         udelay -= 1000000L;
     }

@@ -714,16 +758,19 @@ IsCheckpointOnSchedule(double progress)
      * However, it's good enough for our purposes, we're only calculating an
      * estimate anyway.
      */
-    recptr = GetInsertRecPtr();
-    elapsed_xlogs =
-        (((double) (int32) (recptr.xlogid - ckpt_start_recptr.xlogid)) * XLogSegsPerFile +
-         ((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) / XLogSegSize) /
-        CheckPointSegments;
-
-    if (progress < elapsed_xlogs)
+    if (!IsRecoveryProcessingMode())
     {
-        ckpt_cached_elapsed = elapsed_xlogs;
-        return false;
+        recptr = GetInsertRecPtr();
+        elapsed_xlogs =
+            (((double) (int32) (recptr.xlogid - ckpt_start_recptr.xlogid)) * XLogSegsPerFile +
+             ((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) / XLogSegSize) /
+            CheckPointSegments;
+
+        if (progress < elapsed_xlogs)
+        {
+            ckpt_cached_elapsed = elapsed_xlogs;
+            return false;
+        }
     }

     /*
@@ -850,6 +897,7 @@ BgWriterShmemInit(void)
  *
  * flags is a bitwise OR of the following:
  *    CHECKPOINT_IS_SHUTDOWN: checkpoint is for database shutdown.
+ *    CHECKPOINT_IS_STARTUP: checkpoint is for database startup.
  *    CHECKPOINT_IMMEDIATE: finish the checkpoint ASAP,
  *        ignoring checkpoint_completion_target parameter.
  *    CHECKPOINT_FORCE: force a checkpoint even if no XLOG activity has occured
@@ -916,6 +964,18 @@ RequestCheckpoint(int flags)
     {
         if (BgWriterShmem->bgwriter_pid == 0)
         {
+            /*
+             * The only difference between a startup checkpoint and a normal
+             * online checkpoint is that it's quite normal for the bgwriter
+             * to not be up yet when the startup checkpoint is requested.
+             * (it might be, though). That's ok, background writer will
+             * perform the checkpoint as soon as it starts up.
+             */
+            if (flags & CHECKPOINT_STARTUP)
+            {
+                Assert(!(flags & CHECKPOINT_WAIT));
+                break;
+            }
             if (ntries >= 20)        /* max wait 2.0 sec */
             {
                 elog((flags & CHECKPOINT_WAIT) ? ERROR : LOG,
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 3380b80..221c9b2 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -228,7 +228,18 @@ static bool FatalError = false; /* T if recovering from backend crash */

 /*
  * We use a simple state machine to control startup, shutdown, and
- * crash recovery (which is rather like shutdown followed by startup).
+ * recovery.
+ *
+ * Recovery is split into two phases: crash recovery and consistent (archive)
+ * recovery.  The startup process begins with crash recovery, replaying WAL
+ * until a self-consistent database state is reached. At that point, it
+ * signals postmaster, and we switch to consistent recovery phase. The
+ * background writer is launched, while the startup process continues
+ * applying WAL.  We could start accepting connections to perform read-only
+ * queries at this point, if we had the infrastructure to do that. When the
+ * startup process exits, we switch to PM_RUN state. The startup process can
+ * also skip the consistent recovery altogether, as it will during normal
+ * startup when there's no recovery to be done, for example.
  *
  * Normal child backends can only be launched when we are in PM_RUN state.
  * (We also allow it in PM_WAIT_BACKUP state, but only for superusers.)
@@ -254,6 +265,7 @@ typedef enum
 {
     PM_INIT,                    /* postmaster starting */
     PM_STARTUP,                    /* waiting for startup subprocess */
+    PM_RECOVERY,                /* consistent recovery mode */
     PM_RUN,                        /* normal "database is alive" state */
     PM_WAIT_BACKUP,                /* waiting for online backup mode to end */
     PM_WAIT_BACKENDS,            /* waiting for live backends to exit */
@@ -1302,7 +1314,7 @@ ServerLoop(void)
          * state that prevents it, start one.  It doesn't matter if this
          * fails, we'll just try again later.
          */
-        if (BgWriterPID == 0 && pmState == PM_RUN)
+        if (BgWriterPID == 0 && (pmState == PM_RUN || pmState == PM_RECOVERY))
             BgWriterPID = StartBackgroundWriter();

         /*
@@ -2116,7 +2128,7 @@ reaper(SIGNAL_ARGS)
         if (pid == StartupPID)
         {
             StartupPID = 0;
-            Assert(pmState == PM_STARTUP);
+            Assert(pmState == PM_STARTUP || pmState == PM_RECOVERY);

             /* FATAL exit of startup is treated as catastrophic */
             if (!EXIT_STATUS_0(exitstatus))
@@ -2157,11 +2169,12 @@ reaper(SIGNAL_ARGS)
             load_role();

             /*
-             * Crank up the background writer.    It doesn't matter if this
-             * fails, we'll just try again later.
+             * Crank up the background writer, if we didn't do that already
+             * when we entered consistent recovery phase.  It doesn't matter
+             * if this fails, we'll just try again later.
              */
-            Assert(BgWriterPID == 0);
-            BgWriterPID = StartBackgroundWriter();
+            if (BgWriterPID == 0)
+                BgWriterPID = StartBackgroundWriter();

             /*
              * Likewise, start other special children as needed.  In a restart
@@ -3847,6 +3860,51 @@ sigusr1_handler(SIGNAL_ARGS)

     PG_SETMASK(&BlockSig);

+    if (CheckPostmasterSignal(PMSIGNAL_RECOVERY_START))
+    {
+        Assert(pmState == PM_STARTUP);
+
+        /*
+         * Go to shutdown mode if a shutdown request was pending.
+         */
+        if (Shutdown > NoShutdown)
+        {
+            pmState = PM_WAIT_BACKENDS;
+            /* PostmasterStateMachine logic does the rest */
+        }
+        else
+        {
+            /*
+             * Startup process has entered recovery
+             */
+            pmState = PM_RECOVERY;
+
+            /*
+             * Load the flat authorization file into postmaster's cache. The
+             * startup process won't have recomputed this from the database yet,
+             * so we it may change following recovery.
+             */
+            load_role();
+
+            /*
+             * Crank up the background writer.    It doesn't matter if this
+             * fails, we'll just try again later.
+             */
+            Assert(BgWriterPID == 0);
+            BgWriterPID = StartBackgroundWriter();
+
+            /*
+             * Likewise, start other special children as needed.
+             */
+            Assert(PgStatPID == 0);
+            PgStatPID = pgstat_start();
+
+            /* XXX at this point we could accept read-only connections */
+            ereport(DEBUG1,
+                 (errmsg("database system is in consistent recovery mode")));
+        }
+    }
+
     if (CheckPostmasterSignal(PMSIGNAL_PASSWORD_CHANGE))
     {
         /*
diff --git a/src/backend/storage/buffer/README b/src/backend/storage/buffer/README
index 62b22bd..a7b81e3 100644
--- a/src/backend/storage/buffer/README
+++ b/src/backend/storage/buffer/README
@@ -268,3 +268,12 @@ out (and anyone else who flushes buffer contents to disk must do so too).
 This ensures that the page image transferred to disk is reasonably consistent.
 We might miss a hint-bit update or two but that isn't a problem, for the same
 reasons mentioned under buffer access rules.
+
+As of 8.4, background writer starts during recovery mode when there is
+some form of potentially extended recovery to perform. It performs an
+identical service to normal processing, except that checkpoints it
+writes are technically restartpoints. Flushing outstanding WAL for dirty
+buffers is also skipped, though there shouldn't ever be new WAL entries
+at that time in any case. We could choose to start background writer
+immediately but we hold off until we can prove the database is in a
+consistent state so that postmaster has a single, clean state change.
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 4ea849d..3bba50a 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -197,6 +197,9 @@ main(int argc, char *argv[])
     printf(_("Minimum recovery ending location:     %X/%X\n"),
            ControlFile.minRecoveryPoint.xlogid,
            ControlFile.minRecoveryPoint.xrecoff);
+    printf(_("Minimum safe starting location:       %X/%X\n"),
+           ControlFile.minSafeStartPoint.xlogid,
+           ControlFile.minSafeStartPoint.xrecoff);
     printf(_("Maximum data alignment:               %u\n"),
            ControlFile.maxAlign);
     /* we don't print floatFormat since can't say much useful about it */
diff --git a/src/bin/pg_resetxlog/pg_resetxlog.c b/src/bin/pg_resetxlog/pg_resetxlog.c
index 51cdde1..b20d4bd 100644
--- a/src/bin/pg_resetxlog/pg_resetxlog.c
+++ b/src/bin/pg_resetxlog/pg_resetxlog.c
@@ -603,6 +603,8 @@ RewriteControlFile(void)
     ControlFile.prevCheckPoint.xrecoff = 0;
     ControlFile.minRecoveryPoint.xlogid = 0;
     ControlFile.minRecoveryPoint.xrecoff = 0;
+    ControlFile.minSafeStartPoint.xlogid = 0;
+    ControlFile.minSafeStartPoint.xrecoff = 0;

     /* Now we can force the recorded xlog seg size to the right thing. */
     ControlFile.xlog_seg_size = XLogSegSize;
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 6913f7c..6f58b80 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -133,7 +133,16 @@ typedef struct XLogRecData
 } XLogRecData;

 extern TimeLineID ThisTimeLineID;        /* current TLI */
-extern bool InRecovery;
+
+/*
+ * Prior to 8.4, all activity during recovery were carried out by Startup
+ * process. This local variable continues to be used in many parts of the
+ * code to indicate actions taken by RecoveryManagers. Other processes who
+ * potentially perform work during recovery should check
+ * IsRecoveryProcessingMode(), see XLogCtl notes in xlog.c
+ */
+extern bool InRecovery;
+
 extern XLogRecPtr XactLastRecEnd;

 /* these variables are GUC parameters related to XLOG */
@@ -161,11 +170,12 @@ extern bool XLOG_DEBUG;
 #define CHECKPOINT_IS_SHUTDOWN    0x0001    /* Checkpoint is for shutdown */
 #define CHECKPOINT_IMMEDIATE    0x0002    /* Do it without delays */
 #define CHECKPOINT_FORCE        0x0004    /* Force even if no activity */
+#define CHECKPOINT_STARTUP        0x0008    /* Startup checkpoint */
 /* These are important to RequestCheckpoint */
-#define CHECKPOINT_WAIT            0x0008    /* Wait for completion */
+#define CHECKPOINT_WAIT            0x0010    /* Wait for completion */
 /* These indicate the cause of a checkpoint request */
-#define CHECKPOINT_CAUSE_XLOG    0x0010    /* XLOG consumption */
-#define CHECKPOINT_CAUSE_TIME    0x0020    /* Elapsed time */
+#define CHECKPOINT_CAUSE_XLOG    0x0020    /* XLOG consumption */
+#define CHECKPOINT_CAUSE_TIME    0x0040    /* Elapsed time */

 /* Checkpoint statistics */
 typedef struct CheckpointStatsData
@@ -199,6 +209,8 @@ extern void RestoreBkpBlocks(XLogRecPtr lsn, XLogRecord *record, bool cleanup);
 extern void xlog_redo(XLogRecPtr lsn, XLogRecord *record);
 extern void xlog_desc(StringInfo buf, uint8 xl_info, char *rec);

+extern bool IsRecoveryProcessingMode(void);
+
 extern void UpdateControlFile(void);
 extern Size XLOGShmemSize(void);
 extern void XLOGShmemInit(void);
@@ -207,6 +219,7 @@ extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
+extern void CreateRestartPoint(int flags);
 extern void XLogPutNextOid(Oid nextOid);
 extern XLogRecPtr GetRedoRecPtr(void);
 extern XLogRecPtr GetInsertRecPtr(void);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 400f32c..e69c8ec 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -21,7 +21,7 @@


 /* Version identifier for this pg_control format */
-#define PG_CONTROL_VERSION    843
+#define PG_CONTROL_VERSION    847

 /*
  * Body of CheckPoint XLOG records.  This is declared here because we keep
@@ -46,7 +46,7 @@ typedef struct CheckPoint
 #define XLOG_NOOP                        0x20
 #define XLOG_NEXTOID                    0x30
 #define XLOG_SWITCH                        0x40
-
+#define XLOG_RECOVERY_END            0x50

 /* System status indicator */
 typedef enum DBState
@@ -102,6 +102,7 @@ typedef struct ControlFileData
     CheckPoint    checkPointCopy; /* copy of last check point record */

     XLogRecPtr    minRecoveryPoint;        /* must replay xlog to here */
+    XLogRecPtr    minSafeStartPoint;        /* safe point after recovery crashes */

     /*
      * This data is used to check for hardware-architecture compatibility of
diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index 3101092..1904187 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -22,6 +22,7 @@
  */
 typedef enum
 {
+    PMSIGNAL_RECOVERY_START,    /* move to PM_RECOVERY state */
     PMSIGNAL_PASSWORD_CHANGE,    /* pg_auth file has changed */
     PMSIGNAL_WAKEN_ARCHIVER,    /* send a NOTIFY signal to xlog archiver */
     PMSIGNAL_ROTATE_LOGFILE,    /* send SIGUSR1 to syslogger to rotate logfile */

Re: Hot standby, recovery infra

From
Simon Riggs
Date:
On Wed, 2009-01-28 at 12:04 +0200, Heikki Linnakangas wrote:
> I've been reviewing and massaging the so called recovery infra patch.

Thanks.

> I feel quite good about this patch now. Given the amount of code
> churn, it requires testing, and I'll read it through one more time
> after sleeping over it. 

There's nothing major I feel we should discuss.

The way restartpoints happen is a useful improvement, thanks.

> Simon, do you see anything wrong with this?

Few minor points

* I think we are now renaming the recovery.conf file too early. The
comment says "We have already restored all the WAL segments we need from
the archive, and we trust that they are not going to go away even if we
crash." We have, but the files overwrite each other as they arrive, so
if the last restartpoint is not in the last restored WAL file then it
will only exist in the archive. The recovery.conf is the only place
where we store the information on where the archive is and how to access
it, so by renaming it out of the way we will be unable to crash recover
until the first checkpoint is complete. So the way this was in the
original patch is the correct way to go, AFAICS.

* my original intention was to deprecate log_restartpoints and would
still like to do so. log_checkpoints does just as well for that. Even
less code than before...

* comment on BgWriterShmemInit() refers to CHECKPOINT_IS_STARTUP, but
the actual define is CHECKPOINT_STARTUP. Would prefer the "is" version
because it matches the IS_SHUTDOWN naming.

* In CreateCheckpoint() the if test on TruncateSubtrans() has been
removed, but the comment has not been changed (to explain why).

* PG_CONTROL_VERSION bump should be just one increment, to 844. I
deliberately had it higher to help spot mismatches earlier, and to avoid
needless patch conflicts.

So it looks pretty much ready for commit very soon.

We should continue to measure performance of recovery in the light of
these changes. I still feel that fsyncing the control file on each
XLogFileRead() will give a noticeable performance penalty, mostly
because we know doing exactly the same thing in normal running caused a
performance penalty. But that is easily changed and cannot be done with
any certainty without wider feedback, so no reason to delay code commit.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot standby, recovery infra

From
Fujii Masao
Date:
Hi,

On Wed, Jan 28, 2009 at 7:04 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> I've been reviewing and massaging the so called recovery infra patch.

Great!

> I feel quite good about this patch now. Given the amount of code churn, it
> requires testing, and I'll read it through one more time after sleeping over
> it. Simon, do you see anything wrong with this?

I also read this patch and found something odd. I apologize if I misread it..

> @@ -507,7 +550,7 @@ CheckArchiveTimeout(void)
>      pg_time_t    now;
>      pg_time_t    last_time;
>
> -    if (XLogArchiveTimeout <= 0)
> +    if (XLogArchiveTimeout <= 0 || !IsRecoveryProcessingMode())

The above change destroys archive_timeout because checking the timeout
is always skipped after recovery is done.

> @@ -355,6 +359,27 @@ BackgroundWriterMain(void)
>       */
>      PG_SETMASK(&UnBlockSig);
>
> +    BgWriterRecoveryMode = IsRecoveryProcessingMode();
> +
> +    if (BgWriterRecoveryMode)
> +        elog(DEBUG1, "bgwriter starting during recovery");
> +    else
> +        InitXLOGAccess();

Why is InitXLOGAccess() called also here when bgwriter is started after
recovery? That is already called by AuxiliaryProcessMain().

> @@ -1302,7 +1314,7 @@ ServerLoop(void)
>           * state that prevents it, start one.  It doesn't matter if this
>           * fails, we'll just try again later.
>           */
> -        if (BgWriterPID == 0 && pmState == PM_RUN)
> +        if (BgWriterPID == 0 && (pmState == PM_RUN || pmState == PM_RECOVERY))
>              BgWriterPID = StartBackgroundWriter();

Likewise, we should try to start also the stats collector during recovery?

> @@ -2103,7 +2148,8 @@ XLogFileInit(uint32 log, uint32 seg,
>          unlink(tmppath);
>      }
>
> -    elog(DEBUG2, "done creating and filling new WAL file");
> +    XLogFileName(tmppath, ThisTimeLineID, log, seg);
> +    elog(DEBUG2, "done creating and filling new WAL file %s", tmppath);

This debug message is somewhat confusing, because the WAL file
represented as "tmppath" might have been already created by
previous XLogFileInit() via InstallXLogFileSegment().

regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: Hot standby, recovery infra

From
Simon Riggs
Date:
On Wed, 2009-01-28 at 23:19 +0900, Fujii Masao wrote:

> > @@ -355,6 +359,27 @@ BackgroundWriterMain(void)
> >       */
> >      PG_SETMASK(&UnBlockSig);
> >
> > +    BgWriterRecoveryMode = IsRecoveryProcessingMode();
> > +
> > +    if (BgWriterRecoveryMode)
> > +        elog(DEBUG1, "bgwriter starting during recovery");
> > +    else
> > +        InitXLOGAccess();
> 
> Why is InitXLOGAccess() called also here when bgwriter is started after
> recovery? That is already called by AuxiliaryProcessMain().

InitXLOGAccess() sets the timeline and also gets the latest record
pointer. If the bgwriter is started in recovery these values need to be
reset later. It's easier to call it twice.

> > @@ -1302,7 +1314,7 @@ ServerLoop(void)
> >           * state that prevents it, start one.  It doesn't matter if this
> >           * fails, we'll just try again later.
> >           */
> > -        if (BgWriterPID == 0 && pmState == PM_RUN)
> > +        if (BgWriterPID == 0 && (pmState == PM_RUN || pmState == PM_RECOVERY))
> >              BgWriterPID = StartBackgroundWriter();
> 
> Likewise, we should try to start also the stats collector during recovery?

We did in the previous patch...

> > @@ -2103,7 +2148,8 @@ XLogFileInit(uint32 log, uint32 seg,
> >          unlink(tmppath);
> >      }
> >
> > -    elog(DEBUG2, "done creating and filling new WAL file");
> > +    XLogFileName(tmppath, ThisTimeLineID, log, seg);
> > +    elog(DEBUG2, "done creating and filling new WAL file %s", tmppath);
> 
> This debug message is somewhat confusing, because the WAL file
> represented as "tmppath" might have been already created by
> previous XLogFileInit() via InstallXLogFileSegment().

I think those are just for debugging and can be removed.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot standby, recovery infra

From
Fujii Masao
Date:
Hi,

On Wed, Jan 28, 2009 at 11:47 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> On Wed, 2009-01-28 at 23:19 +0900, Fujii Masao wrote:
>
>> > @@ -355,6 +359,27 @@ BackgroundWriterMain(void)
>> >      */
>> >     PG_SETMASK(&UnBlockSig);
>> >
>> > +   BgWriterRecoveryMode = IsRecoveryProcessingMode();
>> > +
>> > +   if (BgWriterRecoveryMode)
>> > +           elog(DEBUG1, "bgwriter starting during recovery");
>> > +   else
>> > +           InitXLOGAccess();
>>
>> Why is InitXLOGAccess() called also here when bgwriter is started after
>> recovery? That is already called by AuxiliaryProcessMain().
>
> InitXLOGAccess() sets the timeline and also gets the latest record
> pointer. If the bgwriter is started in recovery these values need to be
> reset later. It's easier to call it twice.

Right. But, InitXLOGAccess() called during main loop is enough for
that purpose.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: Hot standby, recovery infra

From
Simon Riggs
Date:
On Wed, 2009-01-28 at 23:54 +0900, Fujii Masao wrote:
> >> Why is InitXLOGAccess() called also here when bgwriter is started after
> >> recovery? That is already called by AuxiliaryProcessMain().
> >
> > InitXLOGAccess() sets the timeline and also gets the latest record
> > pointer. If the bgwriter is started in recovery these values need to be
> > reset later. It's easier to call it twice.
> 
> Right. But, InitXLOGAccess() called during main loop is enough for
> that purpose.

I think the code is clearer the way it is. Otherwise you'd read
AuxiliaryProcessMain() and think the bgwriter didn't need xlog access.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot standby, recovery infra

From
Heikki Linnakangas
Date:
Fujii Masao wrote:
> On Wed, Jan 28, 2009 at 7:04 PM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com> wrote:
>> I feel quite good about this patch now. Given the amount of code churn, it
>> requires testing, and I'll read it through one more time after sleeping over
>> it. Simon, do you see anything wrong with this?
> 
> I also read this patch and found something odd. 

Many thanks for looking into this!

>> @@ -507,7 +550,7 @@ CheckArchiveTimeout(void)
>>      pg_time_t    now;
>>      pg_time_t    last_time;
>>
>> -    if (XLogArchiveTimeout <= 0)
>> +    if (XLogArchiveTimeout <= 0 || !IsRecoveryProcessingMode())
> 
> The above change destroys archive_timeout because checking the timeout
> is always skipped after recovery is done.

Yep, good catch. That obviously needs to be 
"IsRecoveryProcessingMode()", without the exclamation mark.

>> @@ -355,6 +359,27 @@ BackgroundWriterMain(void)
>>       */
>>      PG_SETMASK(&UnBlockSig);
>>
>> +    BgWriterRecoveryMode = IsRecoveryProcessingMode();
>> +
>> +    if (BgWriterRecoveryMode)
>> +        elog(DEBUG1, "bgwriter starting during recovery");
>> +    else
>> +        InitXLOGAccess();
> 
> Why is InitXLOGAccess() called also here when bgwriter is started after
> recovery? That is already called by AuxiliaryProcessMain().

Oh, I didn't realize that. Agreed, it's a bit disconcerting that 
InitXLOGAccess() is called twice (there was a 2nd call within the loop 
in the original patch as well). Looking at InitXLOGAccess, it's harmless 
to call it multiple times, but it seems better to remove the 
InitXLOGAccess call from AuxiliaryProcessMain().

InitXLOGAccess() needs to be called after seeing that 
IsRecoveryProcessingMode() == false, because it won't get the right 
timeline id before that.

>> @@ -1302,7 +1314,7 @@ ServerLoop(void)
>>           * state that prevents it, start one.  It doesn't matter if this
>>           * fails, we'll just try again later.
>>           */
>> -        if (BgWriterPID == 0 && pmState == PM_RUN)
>> +        if (BgWriterPID == 0 && (pmState == PM_RUN || pmState == PM_RECOVERY))
>>              BgWriterPID = StartBackgroundWriter();
> 
> Likewise, we should try to start also the stats collector during recovery?

Hmm, there's not much point as this patch stands, but I guess we should 
in the hot standby patch, where we let backends in.

>> @@ -2103,7 +2148,8 @@ XLogFileInit(uint32 log, uint32 seg,
>>          unlink(tmppath);
>>      }
>>
>> -    elog(DEBUG2, "done creating and filling new WAL file");
>> +    XLogFileName(tmppath, ThisTimeLineID, log, seg);
>> +    elog(DEBUG2, "done creating and filling new WAL file %s", tmppath);
> 
> This debug message is somewhat confusing, because the WAL file
> represented as "tmppath" might have been already created by
> previous XLogFileInit() via InstallXLogFileSegment().

I don't quite understand what you're saying, but I think I'll just 
revert that.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot standby, recovery infra

From
Fujii Masao
Date:
Hi,

On Wed, Jan 28, 2009 at 11:19 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> I feel quite good about this patch now. Given the amount of code churn, it
>> requires testing, and I'll read it through one more time after sleeping over
>> it. Simon, do you see anything wrong with this?
>
> I also read this patch and found something odd. I apologize if I misread it..

If archive recovery fails after it reaches the last valid record
in the last unfilled WAL segment, subsequent recovery might cause
the following fatal error. This is because minSafeStartPoint indicates
the end of the last unfilled WAL segment which subsequent recovery
cannot reach. Is this bug? (I'm not sure how to fix this problem
because I don't understand yet why minSafeStartPoint is required.)

> FATAL:  WAL ends before end time of backup dump

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: Hot standby, recovery infra

From
Fujii Masao
Date:
Hi,

On Wed, Jan 28, 2009 at 11:19 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> I feel quite good about this patch now. Given the amount of code churn, it
>> requires testing, and I'll read it through one more time after sleeping over
>> it. Simon, do you see anything wrong with this?
>
> I also read this patch and found something odd. I apologize if I misread it..

Sorry for my random reply.

Though this is a matter of taste, I think that it's weird that bgwriter
runs with ThisTimeLineID = 0 during recovery. This is because
XLogCtl->ThisTimeLineID is set at the end of recovery. ISTM this will
be a cause of bug in the near future, though this is harmless currently.

> +    /*
> +     * XXX: Should we try to perform restartpoints if we're not in consistent
> +     * recovery? The bgwriter isn't doing it for us in that case.
> +     */

I think yes. This is helpful for the system which it takes a long time to get
a base backup, ie. it also takes a long time to reach a consistent recovery
point.

> +CreateRestartPoint(int flags)
<snip>
> +     * We rely on this lock to ensure that the startup process doesn't exit
> +     * Recovery while we are half way through a restartpoint. XXX ?
>       */
> +    LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);

Is this comment correct? CheckpointLock cannot prevent the startup process
from exiting recovery because the startup process doesn't acquire that lock.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: Hot standby, recovery infra

From
Simon Riggs
Date:
On Thu, 2009-01-29 at 12:18 +0900, Fujii Masao wrote:
> Hi,
> 
> On Wed, Jan 28, 2009 at 11:19 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> >> I feel quite good about this patch now. Given the amount of code churn, it
> >> requires testing, and I'll read it through one more time after sleeping over
> >> it. Simon, do you see anything wrong with this?
> >
> > I also read this patch and found something odd. I apologize if I misread it..
> 
> Sorry for my random reply.
> 
> Though this is a matter of taste, I think that it's weird that bgwriter
> runs with ThisTimeLineID = 0 during recovery. This is because
> XLogCtl->ThisTimeLineID is set at the end of recovery. ISTM this will
> be a cause of bug in the near future, though this is harmless currently.

It doesn't. That's exactly what InitXLogAccess() was for.

> > +    /*
> > +     * XXX: Should we try to perform restartpoints if we're not in consistent
> > +     * recovery? The bgwriter isn't doing it for us in that case.
> > +     */
> 
> I think yes. This is helpful for the system which it takes a long time to get
> a base backup, ie. it also takes a long time to reach a consistent recovery
> point.

The original patch did this.

> > +CreateRestartPoint(int flags)
> <snip>
> > +     * We rely on this lock to ensure that the startup process doesn't exit
> > +     * Recovery while we are half way through a restartpoint. XXX ?
> >       */
> > +    LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);
> 
> Is this comment correct? CheckpointLock cannot prevent the startup process
> from exiting recovery because the startup process doesn't acquire that lock.

The original patch acquired CheckpointLock during exitRecovery to prove
that a restartpoint was not in progress. It no longer does this, so not
sure if Heikki has found another way and the comment is wrong, or that
removing the lock request is a bug.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot standby, recovery infra

From
Simon Riggs
Date:
On Thu, 2009-01-29 at 10:36 +0900, Fujii Masao wrote:
> Hi,
> 
> On Wed, Jan 28, 2009 at 11:19 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> >> I feel quite good about this patch now. Given the amount of code churn, it
> >> requires testing, and I'll read it through one more time after sleeping over
> >> it. Simon, do you see anything wrong with this?
> >
> > I also read this patch and found something odd. I apologize if I misread it..
> 
> If archive recovery fails after it reaches the last valid record
> in the last unfilled WAL segment, subsequent recovery might cause
> the following fatal error. This is because minSafeStartPoint indicates
> the end of the last unfilled WAL segment which subsequent recovery
> cannot reach. Is this bug? (I'm not sure how to fix this problem
> because I don't understand yet why minSafeStartPoint is required.)
> 
> > FATAL:  WAL ends before end time of backup dump

I think you're right. We need a couple of changes to avoid confusing
messages.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot standby, recovery infra

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Thu, 2009-01-29 at 12:18 +0900, Fujii Masao wrote:
>> Though this is a matter of taste, I think that it's weird that bgwriter
>> runs with ThisTimeLineID = 0 during recovery. This is because
>> XLogCtl->ThisTimeLineID is set at the end of recovery. ISTM this will
>> be a cause of bug in the near future, though this is harmless currently.
> 
> It doesn't. That's exactly what InitXLogAccess() was for.

It does *during recovery*, before InitXLogAccess is called. Yeah, it's 
harmless currently. It would be pretty hard to keep it up-to-date in 
bgwriter and other processes. I think it's better to keep it at 0, which 
is clearly an invalid value, than try to keep it up-to-date and risk 
using an old value. TimeLineID is not used in a lot of places, 
currently. It might be a good idea to add some "Assert(TimeLineID != 0)" 
to places where it used.

>>> +    /*
>>> +     * XXX: Should we try to perform restartpoints if we're not in consistent
>>> +     * recovery? The bgwriter isn't doing it for us in that case.
>>> +     */
>> I think yes. This is helpful for the system which it takes a long time to get
>> a base backup, ie. it also takes a long time to reach a consistent recovery
>> point.
> 
> The original patch did this.

Yeah, I took it out. ISTM if you restore from a base backup, you'd want 
to run it until consistent recovery anyway. We can put it back in if 
people feel it's helpful. There's two ways to do it: we can fire up the 
bgwriter before reaching consistent recovery point, or we can do the 
restartpoints in startup process before that point. I'm inclined to fire 
up the bgwriter earlier. That way, bgwriter remains responsible for all 
checkpoints and restartpoints, which seems clearer. I can't see any 
particular reason why bgwriter startup and reaching the consistent 
recovery point need to be linked together.

>>> +CreateRestartPoint(int flags)
>> <snip>
>>> +     * We rely on this lock to ensure that the startup process doesn't exit
>>> +     * Recovery while we are half way through a restartpoint. XXX ?
>>>       */
>>> +    LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);
>> Is this comment correct? CheckpointLock cannot prevent the startup process
>> from exiting recovery because the startup process doesn't acquire that lock.
> 
> The original patch acquired CheckpointLock during exitRecovery to prove
> that a restartpoint was not in progress. It no longer does this, so not
> sure if Heikki has found another way and the comment is wrong, or that
> removing the lock request is a bug.

The comment is obsolete. There's no reason for startup process to wait 
for a restartpoint to finish. Looking back at the archives and the 
history of that, I got the impression that in a very early version of 
the patch, startup process still created a shutdown checkpoint after 
recovery. To do that, it had to interrupt an ongoing restartpoint. That 
was deemed too dangerous/weird, so it was changed to wait for it to 
finish instead. But since the startup process no longer creates a 
shutdown checkpoint, the waiting became obsolete, right?

If there's a restartpoint in progress when we reach the end of recovery, 
startup process will RequestCheckpoint as usual. That will cause 
bgwriter to hurry the on-going restartpoint, skipping the usual delays, 
and start the checkpoint as soon as the restartpoint is finished.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot standby, recovery infra

From
Simon Riggs
Date:
On Thu, 2009-01-29 at 09:34 +0200, Heikki Linnakangas wrote:

> It does *during recovery*, before InitXLogAccess is called. Yeah, it's
> harmless currently. It would be pretty hard to keep it up-to-date in 
> bgwriter and other processes. I think it's better to keep it at 0,
> which is clearly an invalid value, than try to keep it up-to-date and
> risk using an old value. TimeLineID is not used in a lot of places, 
> currently. It might be a good idea to add some "Assert(TimeLineID !=
> 0)" to places where it used.

Agreed. TimeLineID is a normal-running concept used for writing WAL.
Perhaps we should even solidify the meaning of TimeLineID == 0 as
"cannot write WAL".

I see a problem there for any process that exists both before and after
recovery ends, which includes bgwriter. In that case we must not flush
WAL before recovery ends, yet afterwards we *must* flush WAL. To do that
we *must* have a valid TimeLineID set.

I would suggest we put InitXLogAccess into IsRecoveryProcessingMode(),
so if the mode changes we immediately set everything we need to allow
XLogFlush() calls to work correctly.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot standby, recovery infra

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Thu, 2009-01-29 at 10:36 +0900, Fujii Masao wrote:
>> Hi,
>>
>> On Wed, Jan 28, 2009 at 11:19 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>>>> I feel quite good about this patch now. Given the amount of code churn, it
>>>> requires testing, and I'll read it through one more time after sleeping over
>>>> it. Simon, do you see anything wrong with this?
>>> I also read this patch and found something odd. I apologize if I misread it..
>> If archive recovery fails after it reaches the last valid record
>> in the last unfilled WAL segment, subsequent recovery might cause
>> the following fatal error. This is because minSafeStartPoint indicates
>> the end of the last unfilled WAL segment which subsequent recovery
>> cannot reach. Is this bug? (I'm not sure how to fix this problem
>> because I don't understand yet why minSafeStartPoint is required.)
>>
>>> FATAL:  WAL ends before end time of backup dump
> 
> I think you're right. We need a couple of changes to avoid confusing
> messages.

Hmm, we could update minSafeStartPoint in XLogFlush instead. That was 
suggested when the idea of minSafeStartPoint was first thought of. 
Updating minSafeStartPoint is analogous to flushing WAL: 
minSafeStartPoint must be advanced to X before we can flush a data pgse 
with LSN X. To avoid excessive controlfile updates, whenever we update 
minSafeStartPoint, we can update it to the latest WAL record we've read.

Or we could simply ignore that error if we've reached minSafeStartPoint 
- 1 segment, assuming that even though minSafeStartPoint is higher, we 
can't have gone past the end of valid WAL records in the last segment in 
previous recovery either. But that feels more fragile.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot standby, recovery infra

From
Simon Riggs
Date:
On Thu, 2009-01-29 at 11:20 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Thu, 2009-01-29 at 10:36 +0900, Fujii Masao wrote:
> >> Hi,
> >>
> >> On Wed, Jan 28, 2009 at 11:19 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> >>>> I feel quite good about this patch now. Given the amount of code churn, it
> >>>> requires testing, and I'll read it through one more time after sleeping over
> >>>> it. Simon, do you see anything wrong with this?
> >>> I also read this patch and found something odd. I apologize if I misread it..
> >> If archive recovery fails after it reaches the last valid record
> >> in the last unfilled WAL segment, subsequent recovery might cause
> >> the following fatal error. This is because minSafeStartPoint indicates
> >> the end of the last unfilled WAL segment which subsequent recovery
> >> cannot reach. Is this bug? (I'm not sure how to fix this problem
> >> because I don't understand yet why minSafeStartPoint is required.)
> >>
> >>> FATAL:  WAL ends before end time of backup dump
> > 
> > I think you're right. We need a couple of changes to avoid confusing
> > messages.
> 
> Hmm, we could update minSafeStartPoint in XLogFlush instead. That was 
> suggested when the idea of minSafeStartPoint was first thought of. 
> Updating minSafeStartPoint is analogous to flushing WAL: 
> minSafeStartPoint must be advanced to X before we can flush a data pgse 
> with LSN X. To avoid excessive controlfile updates, whenever we update 
> minSafeStartPoint, we can update it to the latest WAL record we've read.
> 
> Or we could simply ignore that error if we've reached minSafeStartPoint 
> - 1 segment, assuming that even though minSafeStartPoint is higher, we 
> can't have gone past the end of valid WAL records in the last segment in 
> previous recovery either. But that feels more fragile.

My proposed fix for Fujii-san's minSafeStartPoint bug is to introduce
another control file state DB_IN_ARCHIVE_RECOVERY_BASE. This would show
that we are still recovering up to the point of the end of the base
backup. Once we reach minSafeStartPoint we then switch state to
DB_IN_ARCHIVE_RECOVERY, and set baseBackupReached boolean, which then
enables writing new minSafeStartPoints when we open new WAL files in the
future. 

We then have messages only when in DB_IN_ARCHIVE_RECOVERY_BASE state
 if (XLByteLT(EndOfLog, ControlFile->minRecoveryPoint) &&     ControlFile->state == DB_IN_ARCHIVE_RECOVERY_BASE) {   if
(reachedStopPoint)/* stopped because of stop request */     ereport(FATAL,         (errmsg("requested recovery stop
pointis before end time of
 
backup dump")));   else /* ran off end of WAL */       ereport(FATAL,       (errmsg("WAL ends before end time of backup
dump")));}
 

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot standby, recovery infra

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> My proposed fix for Fujii-san's minSafeStartPoint bug is to introduce
> another control file state DB_IN_ARCHIVE_RECOVERY_BASE. This would show
> that we are still recovering up to the point of the end of the base
> backup. Once we reach minSafeStartPoint we then switch state to
> DB_IN_ARCHIVE_RECOVERY, and set baseBackupReached boolean, which then
> enables writing new minSafeStartPoints when we open new WAL files in the
> future. 

I don't see how that helps, the bug has nothing to with base backups. It 
comes from the fact that we set minSafeStartPoint beyond the actual end 
of WAL, if the last WAL segment is only partially filled (= fails CRC 
check at some point). If we crash after setting minSafeStartPoint like 
that, and then restart recovery, we'll get the error.

The last WAL segment could be partially filled for example because the 
DBA has manually copied the last unarchived WAL segments to pg_xlog, as 
we recommend in the manual.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot standby, recovery infra

From
Heikki Linnakangas
Date:
It looks like if you issue a fast shutdown during recovery, postmaster 
doesn't kill bgwriter.

...
LOG:  restored log file "000000010000000000000028" from archive
LOG:  restored log file "000000010000000000000029" from archive
LOG:  consistent recovery state reached at 0/2900005C
...
LOG:  restored log file "00000001000000000000002F" from archive
LOG:  restored log file "000000010000000000000030" from archive
LOG:  restored log file "000000010000000000000031" from archive
LOG:  restored log file "000000010000000000000032" from archive
LOG:  restored log file "000000010000000000000033" from archive
LOG:  restartpoint starting: time
LOG:  restored log file "000000010000000000000034" from archive
LOG:  received fast shutdown request
LOG:  restored log file "000000010000000000000035" from archive
FATAL:  terminating connection due to administrator command
LOG:  startup process (PID 14137) exited with exit code 1
LOG:  aborting startup due to startup process failure
hlinnaka@heikkilaptop:~/pgsql$
hlinnaka@heikkilaptop:~/pgsql$ LOG:  restartpoint complete: wrote 3324 
buffers (5.1%); write=13.996 s, sync=2.016 s, total=16.960 s
LOG:  recovery restart point at 0/3000FA14

Seems that reaper() needs to be taught that bgwriter can be active 
during consistent recovery. I'll take a look at how to do that.


BTW, the message "terminating connection ..." is a bit misleading. It's 
referring to the startup process, which is hardly a connection. We have 
that in CVS HEAD too, so it's not something introduced by the patch, but 
seems worth changing in HS, since we then let real connections in while 
startup process is still running.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot standby, recovery infra

From
Simon Riggs
Date:
On Thu, 2009-01-29 at 12:22 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > My proposed fix for Fujii-san's minSafeStartPoint bug is to introduce
> > another control file state DB_IN_ARCHIVE_RECOVERY_BASE. This would show
> > that we are still recovering up to the point of the end of the base
> > backup. Once we reach minSafeStartPoint we then switch state to
> > DB_IN_ARCHIVE_RECOVERY, and set baseBackupReached boolean, which then
> > enables writing new minSafeStartPoints when we open new WAL files in the
> > future. 
> 
> I don't see how that helps, the bug has nothing to with base backups. 

Sorry, disagree.

> It 
> comes from the fact that we set minSafeStartPoint beyond the actual end 
> of WAL, if the last WAL segment is only partially filled (= fails CRC 
> check at some point). If we crash after setting minSafeStartPoint like 
> that, and then restart recovery, we'll get the error.

Look again please. My proposal would avoid the error when it is not
relevant, yet keep it when it is (while recovering base backups). 

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot standby, recovery infra

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Thu, 2009-01-29 at 12:22 +0200, Heikki Linnakangas wrote:
>> It 
>> comes from the fact that we set minSafeStartPoint beyond the actual end 
>> of WAL, if the last WAL segment is only partially filled (= fails CRC 
>> check at some point). If we crash after setting minSafeStartPoint like 
>> that, and then restart recovery, we'll get the error.
> 
> Look again please. My proposal would avoid the error when it is not
> relevant, yet keep it when it is (while recovering base backups). 

I fail to see what base backups have to do with this. The problem arises 
in this scenario:

0. A base backup is unzipped. recovery.conf is copied in place, and the 
remaining unarchived WAL segments are copied from the primary server to 
pg_xlog. The last WAL segment is only partially filled. Let's say that 
redo point is in WAL segment 1. The last, partial, WAL segment is 3, and 
WAL ends at 0/3500000
1. postmaster is started, recovery starts.
2. WAL segment 1 is restored from archive.
3. We reach consistent recovery point
4. We restore WAL segment 2 from archive. minSafeStartPoint is advanced 
to 0/3000000
5. WAL segment 2 is completely replayed, we move on to WAL segment 3. It 
is not in archive, but it's found in pg_xlog. minSafeStartPoint is 
advanced to 0/4000000. Note that that's beyond end of WAL.
6. At replay of WAL record 0/3200000, the recovery is interrupted. For 
example, by a fast shutdown request, or crash.

Now when we restart the recovery, we will never reach minSafeStartPoint, 
which is now 0/4000000, and we'll fail with the error that Fujii-san 
pointed out. We're already way past the min recovery point of base 
backup by then.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot standby, recovery infra

From
Simon Riggs
Date:
On Thu, 2009-01-29 at 15:31 +0200, Heikki Linnakangas wrote:

> Now when we restart the recovery, we will never reach
> minSafeStartPoint, which is now 0/4000000, and we'll fail with the
> error that Fujii-san pointed out. We're already way past the min
> recovery point of base backup by then.

The problem was that we reported this error

FATAL:  WAL ends before end time of backup dump

and this is inappropriate because, as you say, we are way past the min
recovery point of base backup.

If you look again at my proposal you will see that the proposal avoids
the above error by keeping track of whether we are past the point of
base backup or not. If we are still in base backup we get the error and
if we are passed it we do not.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot standby, recovery infra

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Thu, 2009-01-29 at 15:31 +0200, Heikki Linnakangas wrote:
> 
>> Now when we restart the recovery, we will never reach
>> minSafeStartPoint, which is now 0/4000000, and we'll fail with the
>> error that Fujii-san pointed out. We're already way past the min
>> recovery point of base backup by then.
> 
> The problem was that we reported this error
> 
> FATAL:  WAL ends before end time of backup dump
> 
> and this is inappropriate because, as you say, we are way past the min
> recovery point of base backup.
> 
> If you look again at my proposal you will see that the proposal avoids
> the above error by keeping track of whether we are past the point of
> base backup or not. If we are still in base backup we get the error and
> if we are passed it we do not.

Oh, we would simply ignore the fact that we haven't reached the 
minSafeStartPoint at the end of recovery, and start up anyway. Ok, that 
would avoid the problem Fujii-san described. It's like my suggestion of 
ignoring the message if we're at minSafeStartPoint - 1 segment, just 
more lenient. I don't understand why you'd need a new control file 
state, though.

You'd lose the extra protection minSafeStartPoint gives, though. For 
example, if you interrupt recovery and move recovery point backwards, we 
could refuse to start up when it's not safe to do so. It's currently a 
"don't do that!" case, but we could protect against that with 
minSafeStartPoint.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot standby, recovery infra

From
Heikki Linnakangas
Date:
Heikki Linnakangas wrote:
> It looks like if you issue a fast shutdown during recovery, postmaster 
> doesn't kill bgwriter.

Hmm, seems like we haven't thought through how shutdown during 
consistent recovery is supposed to behave in general. Right now, smart 
shutdown doesn't do anything during consistent recovery, because the 
startup process will just keep going. And fast shutdown will simply 
ExitPostmaster(1), which is clearly not right.

I'm thinking that in both smart and fast shutdown, the startup process 
should exit in a controlled way as soon as it's finished with the 
current WAL record, and set minSafeStartPoint to the current point in 
the replay.

I wonder if bgwriter should perform a restartpoint before exiting? 
You'll have to start with recovery on the next startup anyway, but at 
least we could minimize the amount of WAL that needs to be replayed.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot standby, recovery infra

From
Heikki Linnakangas
Date:
Heikki Linnakangas wrote:
> Simon Riggs wrote:
>> On Thu, 2009-01-29 at 15:31 +0200, Heikki Linnakangas wrote:
>>
>>> Now when we restart the recovery, we will never reach
>>> minSafeStartPoint, which is now 0/4000000, and we'll fail with the
>>> error that Fujii-san pointed out. We're already way past the min
>>> recovery point of base backup by then.
>>
>> The problem was that we reported this error
>>
>> FATAL:  WAL ends before end time of backup dump
>>
>> and this is inappropriate because, as you say, we are way past the min
>> recovery point of base backup.
>>
>> If you look again at my proposal you will see that the proposal avoids
>> the above error by keeping track of whether we are past the point of
>> base backup or not. If we are still in base backup we get the error and
>> if we are passed it we do not.
> 
> Oh, we would simply ignore the fact that we haven't reached the 
> minSafeStartPoint at the end of recovery, and start up anyway. Ok, that 
> would avoid the problem Fujii-san described. It's like my suggestion of 
> ignoring the message if we're at minSafeStartPoint - 1 segment, just 
> more lenient. I don't understand why you'd need a new control file 
> state, though.
> 
> You'd lose the extra protection minSafeStartPoint gives, though. For 
> example, if you interrupt recovery and move recovery point backwards, we 
> could refuse to start up when it's not safe to do so. It's currently a 
> "don't do that!" case, but we could protect against that with 
> minSafeStartPoint.

Hmm, another point of consideration is how this interacts with the 
pause/continue. In particular, it was suggested earlier that you could 
put an option into recovery.conf to start in paused mode. If you pause 
recovery, and then stop and restart the server, and have that option in 
recovery.conf, I would expect that when you enter consistent recovery 
you're at the exact same paused location as before stopping the server. 
The upshot of that is that we need to set minSafeStartPoint to that 
exact location, at least when you pause & stop in a controlled fashion.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot standby, recovery infra

From
Heikki Linnakangas
Date:
I just realized that the new minSafeStartPoint is actually exactly the 
same concept as the existing minRecoveryPoint. As the recovery 
progresses, we could advance minRecoveryPoint just as well as the new 
minSafeStartPoint.

Perhaps it's a good idea to keep them separate anyway though, the 
original minRecoveryPoint might be a useful debugging aid. Or what do 
you think?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot standby, recovery infra

From
Simon Riggs
Date:
On Thu, 2009-01-29 at 20:35 +0200, Heikki Linnakangas wrote:
> Hmm, another point of consideration is how this interacts with the 
> pause/continue. In particular, it was suggested earlier that you
> could 
> put an option into recovery.conf to start in paused mode. If you
> pause 
> recovery, and then stop and restart the server, and have that option
> in 
> recovery.conf, I would expect that when you enter consistent recovery 
> you're at the exact same paused location as before stopping the
> server. 
> The upshot of that is that we need to set minSafeStartPoint to that 
> exact location, at least when you pause & stop in a controlled
> fashion.

OK, makes sense.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot standby, recovery infra

From
Simon Riggs
Date:
On Thu, 2009-01-29 at 19:20 +0200, Heikki Linnakangas wrote:
> Heikki Linnakangas wrote:
> > It looks like if you issue a fast shutdown during recovery, postmaster 
> > doesn't kill bgwriter.
> 
> Hmm, seems like we haven't thought through how shutdown during 
> consistent recovery is supposed to behave in general. Right now, smart 
> shutdown doesn't do anything during consistent recovery, because the 
> startup process will just keep going. And fast shutdown will simply 
> ExitPostmaster(1), which is clearly not right.

That whole area was something I was leaving until last, since immediate
shutdown doesn't work either, even in HEAD. (Fujii-san and I discussed
this before Christmas, briefly).

> I'm thinking that in both smart and fast shutdown, the startup process 
> should exit in a controlled way as soon as it's finished with the 
> current WAL record, and set minSafeStartPoint to the current point in 
> the replay.

That makes sense, though isn't required.

> I wonder if bgwriter should perform a restartpoint before exiting? 
> You'll have to start with recovery on the next startup anyway, but at 
> least we could minimize the amount of WAL that needs to be replayed.

That seems like extra work for no additional benefit.

I think we're beginning to blur the lines between review and you just
adding some additional stuff in this area. There's nothing to stop you
doing further changes after this has been committed. We can also commit
what we have with some caveats also, i.e. commit in pieces.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot standby, recovery infra

From
Simon Riggs
Date:
On Fri, 2009-01-30 at 11:33 +0200, Heikki Linnakangas wrote:
> I just realized that the new minSafeStartPoint is actually exactly the 
> same concept as the existing minRecoveryPoint. As the recovery 
> progresses, we could advance minRecoveryPoint just as well as the new 
> minSafeStartPoint.
> 
> Perhaps it's a good idea to keep them separate anyway though, the 
> original minRecoveryPoint might be a useful debugging aid. Or what do 
> you think?

I think we've been confusing ourselves substantially. The patch already
has everything it needs, but there is a one-line-fixable bug where
Fujii-san says.

The code comments already explain how this works
* There are two points in the log that we must pass. The first* is minRecoveryPoint, which is the LSN at the time the*
basebackup was taken that we are about to rollforward from.* If recovery has ever crashed or was stopped there is also*
anotherpoint also: minSafeStartPoint, which we know the* latest LSN that recovery could have reached prior to crash.
 

The later message
FATAL  WAL ends before end time of backup dump

was originally triggered if
if (XLByteLT(EndOfLog, ControlFile->minRecoveryPoint))

and I changed that. Now I look at it again, I see that the original if
test, shown above, is correct and should not have been changed.

Other than that, I don't see the need for further change. Heikki's
suggestions to write a new minSafeStartPoint are good ones and fit
within the existing mechanisms and meanings of these variables.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot standby, recovery infra

From
Simon Riggs
Date:
On Thu, 2009-01-29 at 14:21 +0200, Heikki Linnakangas wrote:
> It looks like if you issue a fast shutdown during recovery, postmaster 
> doesn't kill bgwriter.

Thanks for the report.

I'm thinking to add a new function that will allow crash testing easier.

pg_crash_standby() will issue a new xlog record, XLOG_CRASH_STANDBY,
which when replayed will just throw a FATAL error and crash Startup
process. We won't be adding that to the user docs...

This will allow us to produce tests that crash the server at specific
places, rather than trying to trap those points manually.

> Seems that reaper() needs to be taught that bgwriter can be active 
> during consistent recovery. I'll take a look at how to do that.
> 
> 
> BTW, the message "terminating connection ..." is a bit misleading. It's 
> referring to the startup process, which is hardly a connection. We have 
> that in CVS HEAD too, so it's not something introduced by the patch, but 
> seems worth changing in HS, since we then let real connections in while 
> startup process is still running.
> 
-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot standby, recovery infra

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> I'm thinking to add a new function that will allow crash testing easier.
> 
> pg_crash_standby() will issue a new xlog record, XLOG_CRASH_STANDBY,
> which when replayed will just throw a FATAL error and crash Startup
> process. We won't be adding that to the user docs...
> 
> This will allow us to produce tests that crash the server at specific
> places, rather than trying to trap those points manually.

Heh, talk about a footgun ;-). I don't think including that in CVS is a 
good idea.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot standby, recovery infra

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Thu, 2009-01-29 at 19:20 +0200, Heikki Linnakangas wrote:
>> Hmm, seems like we haven't thought through how shutdown during 
>> consistent recovery is supposed to behave in general. Right now, smart 
>> shutdown doesn't do anything during consistent recovery, because the 
>> startup process will just keep going. And fast shutdown will simply 
>> ExitPostmaster(1), which is clearly not right.
> 
> That whole area was something I was leaving until last, since immediate
> shutdown doesn't work either, even in HEAD. (Fujii-san and I discussed
> this before Christmas, briefly).

We must handle shutdown gracefully, can't just leave bgwriter running 
after postmaster exit.

Hmm, why does pg_standby catch SIGQUIT? Seems it could just let it kill 
the process.

>> I wonder if bgwriter should perform a restartpoint before exiting? 
>> You'll have to start with recovery on the next startup anyway, but at 
>> least we could minimize the amount of WAL that needs to be replayed.
> 
> That seems like extra work for no additional benefit.
> 
> I think we're beginning to blur the lines between review and you just
> adding some additional stuff in this area. There's nothing to stop you
> doing further changes after this has been committed.

Sure. I think the "shutdown restartpoint" might actually fall out of the 
way the code is structured anyway: bgwriter normally performs a 
checkpoint before exiting.

> We can also commit
> what we have with some caveats also, i.e. commit in pieces.

This late in the release cycle, I don't want to commit anything that we 
would have to rip out if we run out of time. There is no difference from 
review or testing point of view whether the code is in CVS or not.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot standby, recovery infra

From
Heikki Linnakangas
Date:
Ok, here's an attempt to make shutdown work gracefully.

Startup process now signals postmaster three times during startup: first
when it has done all the initialization, and starts redo. At that point.
postmaster launches bgwriter, which starts to perform restartpoints when
it deems appropriate. The 2nd time signals when we've reached consistent
recovery state. As the patch stands, that's not significant, but it will
be with all the rest of the hot standby stuff. The 3rd signal is sent
when startup process has finished recovery. Postmaster used to wait for
the startup process to exit, and check the return code to determine
that, but now that we support shutdown, startup process also returns
with 0 exit code when it has been requested to terminate.

The startup process now catches SIGTERM, and calls proc_exit() at the
next WAL record. That's what will happen in a fast shutdown. Unexpected
death of the startup process is treated the same as a backend/auxiliary
process crash.

InitXLogAccess is now called in IsRecoeryProcessingMode() as you suggested.

Simon Riggs wrote:
> On Thu, 2009-01-29 at 19:20 +0200, Heikki Linnakangas wrote:
>> Heikki Linnakangas wrote:
>>> It looks like if you issue a fast shutdown during recovery, postmaster
>>> doesn't kill bgwriter.
>> Hmm, seems like we haven't thought through how shutdown during
>> consistent recovery is supposed to behave in general. Right now, smart
>> shutdown doesn't do anything during consistent recovery, because the
>> startup process will just keep going. And fast shutdown will simply
>> ExitPostmaster(1), which is clearly not right.
>
> That whole area was something I was leaving until last, since immediate
> shutdown doesn't work either, even in HEAD. (Fujii-san and I discussed
> this before Christmas, briefly).
>
>> I'm thinking that in both smart and fast shutdown, the startup process
>> should exit in a controlled way as soon as it's finished with the
>> current WAL record, and set minSafeStartPoint to the current point in
>> the replay.
>
> That makes sense, though isn't required.
>
>> I wonder if bgwriter should perform a restartpoint before exiting?
>> You'll have to start with recovery on the next startup anyway, but at
>> least we could minimize the amount of WAL that needs to be replayed.
>
> That seems like extra work for no additional benefit.
>
> I think we're beginning to blur the lines between review and you just
> adding some additional stuff in this area. There's nothing to stop you
> doing further changes after this has been committed. We can also commit
> what we have with some caveats also, i.e. commit in pieces.
>


--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index bd6035d..50be1d5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -36,6 +36,7 @@
 #include "catalog/pg_control.h"
 #include "catalog/pg_type.h"
 #include "funcapi.h"
+#include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
@@ -47,6 +48,7 @@
 #include "storage/smgr.h"
 #include "storage/spin.h"
 #include "utils/builtins.h"
+#include "utils/flatfiles.h"
 #include "utils/guc.h"
 #include "utils/ps_status.h"
 #include "pg_trace.h"
@@ -119,12 +121,26 @@ CheckpointStatsData CheckpointStats;
  */
 TimeLineID    ThisTimeLineID = 0;

-/* Are we doing recovery from XLOG? */
+/*
+ * Are we doing recovery from XLOG?
+ *
+ * This is only ever true in the startup process, when it's replaying WAL.
+ * It's used in functions that need to act differently when called from a
+ * redo function (e.g skip WAL logging).  To check whether the system is in
+ * recovery regardless of what process you're running in, use
+ * IsRecoveryProcessingMode().
+ */
 bool        InRecovery = false;

 /* Are we recovering using offline XLOG archives? */
 static bool InArchiveRecovery = false;

+/*
+ * Local copy of shared RecoveryProcessingMode variable. True actually
+ * means "not known, need to check the shared state"
+ */
+static bool LocalRecoveryProcessingMode = true;
+
 /* Was the last xlog file restored from archive, or local? */
 static bool restoredFromArchive = false;

@@ -133,7 +149,6 @@ static char *recoveryRestoreCommand = NULL;
 static bool recoveryTarget = false;
 static bool recoveryTargetExact = false;
 static bool recoveryTargetInclusive = true;
-static bool recoveryLogRestartpoints = false;
 static TransactionId recoveryTargetXid;
 static TimestampTz recoveryTargetTime;
 static TimestampTz recoveryLastXTime = 0;
@@ -313,6 +328,22 @@ typedef struct XLogCtlData
     int            XLogCacheBlck;    /* highest allocated xlog buffer index */
     TimeLineID    ThisTimeLineID;

+    /*
+     * SharedRecoveryProcessingMode indicates if we're still in crash or
+     * archive recovery. It's checked by IsRecoveryProcessingMode()
+     */
+    bool        SharedRecoveryProcessingMode;
+
+    /*
+     * During recovery, we keep a copy of the latest checkpoint record
+     * here. It's used by the background writer when it wants to create
+     * a restartpoint.
+     *
+     * is info_lck spinlock a bit too light-weight to protect this?
+     */
+    XLogRecPtr    lastCheckPointRecPtr;
+    CheckPoint    lastCheckPoint;
+
     slock_t        info_lck;        /* locks shared variables shown above */
 } XLogCtlData;

@@ -390,6 +421,11 @@ static TimeLineID lastPageTLI = 0;

 static bool InRedo = false;

+/*
+ * Flag set by interrupt handlers for later service in the redo loop.
+ */
+static volatile sig_atomic_t shutdown_requested = false;
+

 static void XLogArchiveNotify(const char *xlog);
 static void XLogArchiveNotifySeg(uint32 log, uint32 seg);
@@ -399,6 +435,7 @@ static void XLogArchiveCleanup(const char *xlog);
 static void readRecoveryCommandFile(void);
 static void exitArchiveRecovery(TimeLineID endTLI,
                     uint32 endLogId, uint32 endLogSeg);
+static void exitRecovery(void);
 static bool recoveryStopsHere(XLogRecord *record, bool *includeThis);
 static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);

@@ -483,6 +520,11 @@ XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
     bool        updrqst;
     bool        doPageWrites;
     bool        isLogSwitch = (rmid == RM_XLOG_ID && info == XLOG_SWITCH);
+    bool        isRecoveryEnd = (rmid == RM_XLOG_ID && info == XLOG_RECOVERY_END);
+
+    /* cross-check on whether we should be here or not */
+    if (IsRecoveryProcessingMode() && !isRecoveryEnd)
+        elog(FATAL, "cannot make new WAL entries during recovery");

     /* info's high bits are reserved for use by me */
     if (info & XLR_INFO_MASK)
@@ -1730,7 +1772,7 @@ XLogFlush(XLogRecPtr record)
     XLogwrtRqst WriteRqst;

     /* Disabled during REDO */
-    if (InRedo)
+    if (IsRecoveryProcessingMode())
         return;

     /* Quick exit if already known flushed */
@@ -1818,9 +1860,9 @@ XLogFlush(XLogRecPtr record)
      * the bad page is encountered again during recovery then we would be
      * unable to restart the database at all!  (This scenario has actually
      * happened in the field several times with 7.1 releases. Note that we
-     * cannot get here while InRedo is true, but if the bad page is brought in
-     * and marked dirty during recovery then CreateCheckPoint will try to
-     * flush it at the end of recovery.)
+     * cannot get here while IsRecoveryProcessingMode(), but if the bad page is
+     * brought in and marked dirty during recovery then if a checkpoint were
+     * performed at the end of recovery it will try to flush it.
      *
      * The current approach is to ERROR under normal conditions, but only
      * WARNING during recovery, so that the system can be brought up even if
@@ -1830,7 +1872,7 @@ XLogFlush(XLogRecPtr record)
      * and so we will not force a restart for a bad LSN on a data page.
      */
     if (XLByteLT(LogwrtResult.Flush, record))
-        elog(InRecovery ? WARNING : ERROR,
+        elog(ERROR,
         "xlog flush request %X/%X is not satisfied --- flushed only to %X/%X",
              record.xlogid, record.xrecoff,
              LogwrtResult.Flush.xlogid, LogwrtResult.Flush.xrecoff);
@@ -2409,6 +2451,33 @@ XLogFileRead(uint32 log, uint32 seg, int emode)
                      xlogfname);
             set_ps_display(activitymsg, false);

+            /*
+             * Calculate and write out a new safeStartPoint. This defines
+             * the latest LSN that might appear on-disk while we apply
+             * the WAL records in this file. If we crash during recovery
+             * we must reach this point again before we can prove
+             * database consistency. Not a restartpoint! Restart points
+             * define where we should start recovery from, if we crash.
+             */
+            if (InArchiveRecovery)
+            {
+                XLogRecPtr    nextSegRecPtr;
+                uint32        nextLog = log;
+                uint32        nextSeg = seg;
+
+                NextLogSeg(nextLog, nextSeg);
+                nextSegRecPtr.xlogid = nextLog;
+                nextSegRecPtr.xrecoff = nextSeg * XLogSegSize;
+
+                LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+                if (XLByteLT(ControlFile->minSafeStartPoint, nextSegRecPtr))
+                {
+                    ControlFile->minSafeStartPoint = nextSegRecPtr;
+                    UpdateControlFile();
+                }
+                LWLockRelease(ControlFileLock);
+            }
+
             return fd;
         }
         if (errno != ENOENT)    /* unexpected failure? */
@@ -2677,11 +2746,22 @@ RestoreArchivedFile(char *path, const char *xlogfname,
      * those it's a good bet we should have gotten it too.  Aborting on other
      * signals such as SIGTERM seems a good idea as well.
      *
+     * However, if we were requested to terminate, we don't really care what
+     * happened to the restore command, so we just exit cleanly. In fact,
+     * the restore command most likely received the SIGTERM too, and we don't
+     * want to complain about that.
+     *
      * Per the Single Unix Spec, shells report exit status > 128 when a called
      * command died on a signal.  Also, 126 and 127 are used to report
      * problems such as an unfindable command; treat those as fatal errors
      * too.
      */
+    if (shutdown_requested && InRedo)
+    {
+        /* XXX: We should update minSafeStartPoint to the exact value here */
+        proc_exit(0);
+    }
+
     signaled = WIFSIGNALED(rc) || WEXITSTATUS(rc) > 125;

     ereport(signaled ? FATAL : DEBUG2,
@@ -4587,18 +4667,6 @@ readRecoveryCommandFile(void)
             ereport(LOG,
                     (errmsg("recovery_target_inclusive = %s", tok2)));
         }
-        else if (strcmp(tok1, "log_restartpoints") == 0)
-        {
-            /*
-             * does nothing if a recovery_target is not also set
-             */
-            if (!parse_bool(tok2, &recoveryLogRestartpoints))
-                  ereport(ERROR,
-                            (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
-                      errmsg("parameter \"log_restartpoints\" requires a Boolean value")));
-            ereport(LOG,
-                    (errmsg("log_restartpoints = %s", tok2)));
-        }
         else
             ereport(FATAL,
                     (errmsg("unrecognized recovery parameter \"%s\"",
@@ -4734,7 +4802,10 @@ exitArchiveRecovery(TimeLineID endTLI, uint32 endLogId, uint32 endLogSeg)

     /*
      * Rename the config file out of the way, so that we don't accidentally
-     * re-enter archive recovery mode in a subsequent crash.
+     * re-enter archive recovery mode in a subsequent crash. We have already
+     * restored all the WAL segments we need from the archive, and we trust
+     * that they are not going to go away even if we crash. (XXX: should
+     * we fsync() them all to ensure that?)
      */
     unlink(RECOVERY_COMMAND_DONE);
     if (rename(RECOVERY_COMMAND_FILE, RECOVERY_COMMAND_DONE) != 0)
@@ -4876,6 +4947,8 @@ StartupXLOG(void)
     CheckPoint    checkPoint;
     bool        wasShutdown;
     bool        reachedStopPoint = false;
+    bool        reachedSafeStartPoint = false;
+    bool        performedRecovery = false;
     bool        haveBackupLabel = false;
     XLogRecPtr    RecPtr,
                 LastRec,
@@ -4888,6 +4961,8 @@ StartupXLOG(void)
     uint32        freespace;
     TransactionId oldestActiveXID;

+    XLogCtl->SharedRecoveryProcessingMode = true;
+
     /*
      * Read control file and check XLOG status looks valid.
      *
@@ -5108,9 +5183,15 @@ StartupXLOG(void)
         if (minRecoveryLoc.xlogid != 0 || minRecoveryLoc.xrecoff != 0)
             ControlFile->minRecoveryPoint = minRecoveryLoc;
         ControlFile->time = (pg_time_t) time(NULL);
+        /* No need to hold ControlFileLock yet, we aren't up far enough */
         UpdateControlFile();

         /*
+         * Reset pgstat data, because it may be invalid after recovery.
+         */
+        pgstat_reset_all();
+
+        /*
          * If there was a backup label file, it's done its job and the info
          * has now been propagated into pg_control.  We must get rid of the
          * label file so that if we crash during recovery, we'll pick up at
@@ -5155,6 +5236,7 @@ StartupXLOG(void)
             bool        recoveryContinue = true;
             bool        recoveryApply = true;
             ErrorContextCallback errcontext;
+            XLogRecPtr    minSafeStartPoint;

             InRedo = true;
             ereport(LOG,
@@ -5162,6 +5244,16 @@ StartupXLOG(void)
                             ReadRecPtr.xlogid, ReadRecPtr.xrecoff)));

             /*
+             * Take a local copy of minSafeStartPoint at the beginning of
+             * recovery, because it's updated as we go.
+             */
+            minSafeStartPoint = ControlFile->minSafeStartPoint;
+
+            /* Let postmaster know we've started redo now */
+            if (InArchiveRecovery && IsUnderPostmaster)
+                SendPostmasterSignal(PMSIGNAL_RECOVERY_STARTED);
+
+            /*
              * main redo apply loop
              */
             do
@@ -5186,6 +5278,46 @@ StartupXLOG(void)
 #endif

                 /*
+                 * Process any requests or signals received recently.
+                 */
+                if (shutdown_requested)
+                {
+                    /*
+                     * We were requested to exit without finishing recovery.
+                     *
+                     * XXX: We should update minSafeStartPoint to the exact
+                     * value here.
+                     */
+                    proc_exit(0);
+                }
+
+                /*
+                 * Have we reached our safe starting point? If so, we can
+                 * signal postmaster to enter consistent recovery mode.
+                 *
+                 * There are two points in the log we must pass. The first is
+                 * the minRecoveryPoint, which is the LSN at the time the
+                 * base backup was taken that we are about to rollfoward from.
+                 * If recovery has ever crashed or was stopped there is
+                 * another point also: minSafeStartPoint, which is the
+                 * latest LSN that recovery could have reached prior to crash.
+                 */
+                if (!reachedSafeStartPoint &&
+                     XLByteLE(minSafeStartPoint, EndRecPtr) &&
+                     XLByteLE(ControlFile->minRecoveryPoint, EndRecPtr))
+                {
+                    reachedSafeStartPoint = true;
+                    if (InArchiveRecovery)
+                    {
+                        ereport(LOG,
+                            (errmsg("consistent recovery state reached at %X/%X",
+                                EndRecPtr.xlogid, EndRecPtr.xrecoff)));
+                        if (IsUnderPostmaster)
+                            SendPostmasterSignal(PMSIGNAL_RECOVERY_CONSISTENT);
+                    }
+                }
+
+                /*
                  * Have we reached our recovery target?
                  */
                 if (recoveryStopsHere(record, &recoveryApply))
@@ -5238,6 +5370,7 @@ StartupXLOG(void)
             /* there are no WAL records following the checkpoint */
             ereport(LOG,
                     (errmsg("redo is not required")));
+            reachedSafeStartPoint = true;
         }
     }

@@ -5253,7 +5386,7 @@ StartupXLOG(void)
      * Complain if we did not roll forward far enough to render the backup
      * dump consistent.
      */
-    if (XLByteLT(EndOfLog, ControlFile->minRecoveryPoint))
+    if (InRecovery && !reachedSafeStartPoint)
     {
         if (reachedStopPoint)    /* stopped because of stop request */
             ereport(FATAL,
@@ -5375,38 +5508,16 @@ StartupXLOG(void)
         XLogCheckInvalidPages();

         /*
-         * Reset pgstat data, because it may be invalid after recovery.
+         * Finally exit recovery and mark that in WAL. Pre-8.4 we wrote
+         * a shutdown checkpoint here, but we ask bgwriter to do that now.
          */
-        pgstat_reset_all();
+        exitRecovery();

-        /*
-         * Perform a checkpoint to update all our recovery activity to disk.
-         *
-         * Note that we write a shutdown checkpoint rather than an on-line
-         * one. This is not particularly critical, but since we may be
-         * assigning a new TLI, using a shutdown checkpoint allows us to have
-         * the rule that TLI only changes in shutdown checkpoints, which
-         * allows some extra error checking in xlog_redo.
-         */
-        CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
+        performedRecovery = true;
     }

-    /*
-     * Preallocate additional log files, if wanted.
-     */
-    PreallocXlogFiles(EndOfLog);
-
-    /*
-     * Okay, we're officially UP.
-     */
-    InRecovery = false;
-
-    ControlFile->state = DB_IN_PRODUCTION;
-    ControlFile->time = (pg_time_t) time(NULL);
-    UpdateControlFile();
-
     /* start the archive_timeout timer running */
-    XLogCtl->Write.lastSegSwitchTime = ControlFile->time;
+    XLogCtl->Write.lastSegSwitchTime = (pg_time_t) time(NULL);

     /* initialize shared-memory copy of latest checkpoint XID/epoch */
     XLogCtl->ckptXidEpoch = ControlFile->checkPointCopy.nextXidEpoch;
@@ -5441,6 +5552,74 @@ StartupXLOG(void)
         readRecordBuf = NULL;
         readRecordBufSize = 0;
     }
+
+    /*
+     * If we had to replay any WAL records, request a checkpoint. This isn't
+     * strictly necessary: if we crash now, the recovery will simply restart
+     * from the same point as this time (or from the last restartpoint). The
+     * control file is left in DB_IN_*_RECOVERY state; the first checkpoint
+     * will change that to DB_IN_PRODUCTION.
+     */
+    if (performedRecovery)
+    {
+        /*
+         * Okay, we can come up now. Allow others to write WAL.
+         */
+        XLogCtl->SharedRecoveryProcessingMode = false;
+
+        RequestCheckpoint(CHECKPOINT_FORCE | CHECKPOINT_IMMEDIATE |
+                          CHECKPOINT_STARTUP);
+    }
+    else
+    {
+        /*
+         * No recovery, so let's just get on with it.
+         */
+        LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+        ControlFile->state = DB_IN_PRODUCTION;
+        ControlFile->time = (pg_time_t) time(NULL);
+        UpdateControlFile();
+        LWLockRelease(ControlFileLock);
+
+        /*
+         * Okay, we're officially UP.
+         */
+        XLogCtl->SharedRecoveryProcessingMode = false;
+    }
+}
+
+/*
+ * Is the system still in recovery?
+ *
+ * As a side-effect, we initialize the local TimeLineID and RedoRecPtr
+ * variables the first time we see that recovery is finished.
+ */
+bool
+IsRecoveryProcessingMode(void)
+{
+    /*
+     * We check shared state each time only until we leave recovery mode.
+     * We can't re-enter recovery, so we rely on the local state variable
+     * after that.
+     */
+    if (!LocalRecoveryProcessingMode)
+        return false;
+    else
+    {
+        /* use volatile pointer to prevent code rearrangement */
+        volatile XLogCtlData *xlogctl = XLogCtl;
+
+        LocalRecoveryProcessingMode = xlogctl->SharedRecoveryProcessingMode;
+
+        /*
+         * Initialize TimeLineID and RedoRecPtr the first time we see that
+         * recovery is finished.
+         */
+        if (!LocalRecoveryProcessingMode)
+            InitXLOGAccess();
+
+        return LocalRecoveryProcessingMode;
+    }
 }

 /*
@@ -5572,6 +5751,8 @@ InitXLOGAccess(void)
 {
     /* ThisTimeLineID doesn't change so we need no lock to copy it */
     ThisTimeLineID = XLogCtl->ThisTimeLineID;
+    Assert(ThisTimeLineID != 0);
+
     /* Use GetRedoRecPtr to copy the RedoRecPtr safely */
     (void) GetRedoRecPtr();
 }
@@ -5683,7 +5864,10 @@ ShutdownXLOG(int code, Datum arg)
     ereport(LOG,
             (errmsg("shutting down")));

-    CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
+    if (IsRecoveryProcessingMode())
+        CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
+    else
+        CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
     ShutdownCLOG();
     ShutdownSUBTRANS();
     ShutdownMultiXact();
@@ -5696,10 +5880,22 @@ ShutdownXLOG(int code, Datum arg)
  * Log start of a checkpoint.
  */
 static void
-LogCheckpointStart(int flags)
+LogCheckpointStart(int flags, bool restartpoint)
 {
-    elog(LOG, "checkpoint starting:%s%s%s%s%s%s",
+    char *msg;
+
+    /*
+     * XXX: This is hopelessly untranslatable. We could call gettext_noop
+     * for the main message, but what about all the flags?
+     */
+    if (restartpoint)
+        msg = "restartpoint starting:%s%s%s%s%s%s%s";
+    else
+        msg = "checkpoint starting:%s%s%s%s%s%s%s";
+
+    elog(LOG, msg,
          (flags & CHECKPOINT_IS_SHUTDOWN) ? " shutdown" : "",
+         (flags & CHECKPOINT_STARTUP) ? " startup" : "",
          (flags & CHECKPOINT_IMMEDIATE) ? " immediate" : "",
          (flags & CHECKPOINT_FORCE) ? " force" : "",
          (flags & CHECKPOINT_WAIT) ? " wait" : "",
@@ -5711,7 +5907,7 @@ LogCheckpointStart(int flags)
  * Log end of a checkpoint.
  */
 static void
-LogCheckpointEnd(void)
+LogCheckpointEnd(bool restartpoint)
 {
     long        write_secs,
                 sync_secs,
@@ -5734,17 +5930,26 @@ LogCheckpointEnd(void)
                         CheckpointStats.ckpt_sync_end_t,
                         &sync_secs, &sync_usecs);

-    elog(LOG, "checkpoint complete: wrote %d buffers (%.1f%%); "
-         "%d transaction log file(s) added, %d removed, %d recycled; "
-         "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s",
-         CheckpointStats.ckpt_bufs_written,
-         (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
-         CheckpointStats.ckpt_segs_added,
-         CheckpointStats.ckpt_segs_removed,
-         CheckpointStats.ckpt_segs_recycled,
-         write_secs, write_usecs / 1000,
-         sync_secs, sync_usecs / 1000,
-         total_secs, total_usecs / 1000);
+    if (restartpoint)
+        elog(LOG, "restartpoint complete: wrote %d buffers (%.1f%%); "
+             "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s",
+             CheckpointStats.ckpt_bufs_written,
+             (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
+             write_secs, write_usecs / 1000,
+             sync_secs, sync_usecs / 1000,
+             total_secs, total_usecs / 1000);
+    else
+        elog(LOG, "checkpoint complete: wrote %d buffers (%.1f%%); "
+             "%d transaction log file(s) added, %d removed, %d recycled; "
+             "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s",
+             CheckpointStats.ckpt_bufs_written,
+             (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
+             CheckpointStats.ckpt_segs_added,
+             CheckpointStats.ckpt_segs_removed,
+             CheckpointStats.ckpt_segs_recycled,
+             write_secs, write_usecs / 1000,
+             sync_secs, sync_usecs / 1000,
+             total_secs, total_usecs / 1000);
 }

 /*
@@ -5775,6 +5980,10 @@ CreateCheckPoint(int flags)
     TransactionId *inCommitXids;
     int            nInCommit;

+    /* shouldn't happen */
+    if (IsRecoveryProcessingMode())
+        elog(ERROR, "can't create a checkpoint during recovery");
+
     /*
      * Acquire CheckpointLock to ensure only one checkpoint happens at a time.
      * (This is just pro forma, since in the present system structure there is
@@ -5800,9 +6009,11 @@ CreateCheckPoint(int flags)

     if (shutdown)
     {
+        LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
         ControlFile->state = DB_SHUTDOWNING;
         ControlFile->time = (pg_time_t) time(NULL);
         UpdateControlFile();
+        LWLockRelease(ControlFileLock);
     }

     /*
@@ -5906,7 +6117,7 @@ CreateCheckPoint(int flags)
      * to log anything if we decided to skip the checkpoint.
      */
     if (log_checkpoints)
-        LogCheckpointStart(flags);
+        LogCheckpointStart(flags, false);

     TRACE_POSTGRESQL_CHECKPOINT_START(flags);

@@ -6010,11 +6221,14 @@ CreateCheckPoint(int flags)
     XLByteToSeg(ControlFile->checkPointCopy.redo, _logId, _logSeg);

     /*
-     * Update the control file.
+     * Update the control file. This also sets state to IN_DB_PRODUCTION
+     * if this is the first checkpoint after recovery.
      */
     LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
     if (shutdown)
         ControlFile->state = DB_SHUTDOWNED;
+    else
+        ControlFile->state = DB_IN_PRODUCTION;
     ControlFile->prevCheckPoint = ControlFile->checkPoint;
     ControlFile->checkPoint = ProcLastRecPtr;
     ControlFile->checkPointCopy = checkPoint;
@@ -6068,12 +6282,11 @@ CreateCheckPoint(int flags)
      * in subtrans.c).    During recovery, though, we mustn't do this because
      * StartupSUBTRANS hasn't been called yet.
      */
-    if (!InRecovery)
-        TruncateSUBTRANS(GetOldestXmin(true, false));
+    TruncateSUBTRANS(GetOldestXmin(true, false));

     /* All real work is done, but log before releasing lock. */
     if (log_checkpoints)
-        LogCheckpointEnd();
+        LogCheckpointEnd(false);

         TRACE_POSTGRESQL_CHECKPOINT_DONE(CheckpointStats.ckpt_bufs_written,
                                 NBuffers, CheckpointStats.ckpt_segs_added,
@@ -6101,32 +6314,17 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 }

 /*
- * Set a recovery restart point if appropriate
- *
- * This is similar to CreateCheckPoint, but is used during WAL recovery
- * to establish a point from which recovery can roll forward without
- * replaying the entire recovery log.  This function is called each time
- * a checkpoint record is read from XLOG; it must determine whether a
- * restartpoint is needed or not.
+ * This is used during WAL recovery to establish a point from which recovery
+ * can roll forward without replaying the entire recovery log.  This function
+ * is called each time a checkpoint record is read from XLOG. It is stored
+ * in shared memory, so that it can be used as a restartpoint later on.
  */
 static void
 RecoveryRestartPoint(const CheckPoint *checkPoint)
 {
-    int            elapsed_secs;
     int            rmid;
-
-    /*
-     * Do nothing if the elapsed time since the last restartpoint is less than
-     * half of checkpoint_timeout.    (We use a value less than
-     * checkpoint_timeout so that variations in the timing of checkpoints on
-     * the master, or speed of transmission of WAL segments to a slave, won't
-     * make the slave skip a restartpoint once it's synced with the master.)
-     * Checking true elapsed time keeps us from doing restartpoints too often
-     * while rapidly scanning large amounts of WAL.
-     */
-    elapsed_secs = (pg_time_t) time(NULL) - ControlFile->time;
-    if (elapsed_secs < CheckPointTimeout / 2)
-        return;
+    /* use volatile pointer to prevent code rearrangement */
+    volatile XLogCtlData *xlogctl = XLogCtl;

     /*
      * Is it safe to checkpoint?  We must ask each of the resource managers
@@ -6148,28 +6346,111 @@ RecoveryRestartPoint(const CheckPoint *checkPoint)
     }

     /*
-     * OK, force data out to disk
+     * Copy the checkpoint record to shared memory, so that bgwriter can
+     * use it the next time it wants to perform a restartpoint.
      */
-    CheckPointGuts(checkPoint->redo, CHECKPOINT_IMMEDIATE);
+    SpinLockAcquire(&xlogctl->info_lck);
+    XLogCtl->lastCheckPointRecPtr = ReadRecPtr;
+    memcpy(&XLogCtl->lastCheckPoint, checkPoint, sizeof(CheckPoint));
+    SpinLockRelease(&xlogctl->info_lck);
+}
+
+/*
+ * This is similar to CreateCheckPoint, but is used during WAL recovery
+ * to establish a point from which recovery can roll forward without
+ * replaying the entire recovery log.
+ */
+void
+CreateRestartPoint(int flags)
+{
+    XLogRecPtr lastCheckPointRecPtr;
+    CheckPoint lastCheckPoint;
+    /* use volatile pointer to prevent code rearrangement */
+    volatile XLogCtlData *xlogctl = XLogCtl;
+
+    /*
+     * Acquire CheckpointLock to ensure only one restartpoint happens at a
+     * time. (This is just pro forma, since in the present system structure
+     * there is only one process that is allowed to issue checkpoints or
+     * restart points at any given time.)
+     */
+    LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);
+
+    /* Get the a local copy of the last checkpoint record. */
+    SpinLockAcquire(&xlogctl->info_lck);
+    lastCheckPointRecPtr = xlogctl->lastCheckPointRecPtr;
+    memcpy(&lastCheckPoint, &XLogCtl->lastCheckPoint, sizeof(CheckPoint));
+    SpinLockRelease(&xlogctl->info_lck);

     /*
-     * Update pg_control so that any subsequent crash will restart from this
-     * checkpoint.    Note: ReadRecPtr gives the XLOG address of the checkpoint
-     * record itself.
+     * If the last checkpoint record we've replayed is already our last
+     * restartpoint, we're done.
      */
+    if (XLByteLE(lastCheckPoint.redo, ControlFile->checkPointCopy.redo))
+    {
+        ereport(DEBUG2,
+                (errmsg("skipping restartpoint, already performed at %X/%X",
+                        lastCheckPoint.redo.xlogid, lastCheckPoint.redo.xrecoff)));
+        LWLockRelease(CheckpointLock);
+        return;
+    }
+
+    /*
+     * Check that we're still in recovery mode. It's ok if we exit recovery
+     * mode after this check, the restart point is valid anyway.
+     */
+    if (!IsRecoveryProcessingMode())
+    {
+        ereport(DEBUG2,
+                (errmsg("skipping restartpoint, recovery has already ended")));
+        LWLockRelease(CheckpointLock);
+        return;
+    }
+
+    if (log_checkpoints)
+    {
+        /*
+         * Prepare to accumulate statistics.
+         */
+        MemSet(&CheckpointStats, 0, sizeof(CheckpointStats));
+        CheckpointStats.ckpt_start_t = GetCurrentTimestamp();
+
+        LogCheckpointStart(flags, true);
+    }
+
+    CheckPointGuts(lastCheckPoint.redo, flags);
+
+    /*
+     * Update pg_control, using current time
+     */
+    LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
     ControlFile->prevCheckPoint = ControlFile->checkPoint;
-    ControlFile->checkPoint = ReadRecPtr;
-    ControlFile->checkPointCopy = *checkPoint;
+    ControlFile->checkPoint = lastCheckPointRecPtr;
+    ControlFile->checkPointCopy = lastCheckPoint;
     ControlFile->time = (pg_time_t) time(NULL);
     UpdateControlFile();
+    LWLockRelease(ControlFileLock);
+
+    /*
+     * Currently, there is no need to truncate pg_subtrans during recovery.
+     * If we did do that, we will need to have called StartupSUBTRANS()
+     * already and then TruncateSUBTRANS() would go here.
+     */

-    ereport((recoveryLogRestartpoints ? LOG : DEBUG2),
+    /* All real work is done, but log before releasing lock. */
+    if (log_checkpoints)
+        LogCheckpointEnd(true);
+
+    ereport((log_checkpoints ? LOG : DEBUG2),
             (errmsg("recovery restart point at %X/%X",
-                    checkPoint->redo.xlogid, checkPoint->redo.xrecoff)));
+                    lastCheckPoint.redo.xlogid, lastCheckPoint.redo.xrecoff)));
+
     if (recoveryLastXTime)
-        ereport((recoveryLogRestartpoints ? LOG : DEBUG2),
-                (errmsg("last completed transaction was at log time %s",
-                        timestamptz_to_str(recoveryLastXTime))));
+        ereport((log_checkpoints ? LOG : DEBUG2),
+            (errmsg("last completed transaction was at log time %s",
+                    timestamptz_to_str(recoveryLastXTime))));
+
+    LWLockRelease(CheckpointLock);
 }

 /*
@@ -6234,7 +6515,43 @@ RequestXLogSwitch(void)
 }

 /*
+ * exitRecovery()
+ *
+ * Exit recovery state and write a XLOG_RECOVERY_END record. This is the
+ * only record type that can record a change of timelineID. We assume
+ * caller has already set ThisTimeLineID, if appropriate.
+ */
+static void
+exitRecovery(void)
+{
+    XLogRecData rdata;
+
+    rdata.buffer = InvalidBuffer;
+    rdata.data = (char *) (&ThisTimeLineID);
+    rdata.len = sizeof(TimeLineID);
+    rdata.next = NULL;
+
+    /*
+     * This is the only type of WAL message that can be inserted during
+     * recovery. This ensures that we don't allow others to get access
+     * until after we have changed state.
+     */
+    (void) XLogInsert(RM_XLOG_ID, XLOG_RECOVERY_END, &rdata);
+
+    /*
+     * We don't XLogFlush() here otherwise we'll end up zeroing the WAL
+     * file ourselves. So just let bgwriter's forthcoming checkpoint do
+     * that for us.
+     */
+
+    InRecovery = false;
+}
+
+/*
  * XLOG resource manager's routines
+ *
+ * Definitions of message info are in include/catalog/pg_control.h,
+ * though not all messages relate to control file processing.
  */
 void
 xlog_redo(XLogRecPtr lsn, XLogRecord *record)
@@ -6272,21 +6589,38 @@ xlog_redo(XLogRecPtr lsn, XLogRecord *record)
         ControlFile->checkPointCopy.nextXid = checkPoint.nextXid;

         /*
-         * TLI may change in a shutdown checkpoint, but it shouldn't decrease
+         * TLI no longer changes at shutdown checkpoint, since as of 8.4,
+         * shutdown checkpoints only occur at shutdown. Much less confusing.
          */
-        if (checkPoint.ThisTimeLineID != ThisTimeLineID)
+
+        RecoveryRestartPoint(&checkPoint);
+    }
+    else if (info == XLOG_RECOVERY_END)
+    {
+        TimeLineID    tli;
+
+        memcpy(&tli, XLogRecGetData(record), sizeof(TimeLineID));
+
+        /*
+         * TLI may change when recovery ends, but it shouldn't decrease.
+         *
+         * This is the only WAL record that can tell us to change timelineID
+         * while we process WAL records.
+         *
+         * We can *choose* to stop recovery at any point, generating a
+         * new timelineID which is recorded using this record type.
+         */
+        if (tli != ThisTimeLineID)
         {
-            if (checkPoint.ThisTimeLineID < ThisTimeLineID ||
+            if (tli < ThisTimeLineID ||
                 !list_member_int(expectedTLIs,
-                                 (int) checkPoint.ThisTimeLineID))
+                                 (int) tli))
                 ereport(PANIC,
-                        (errmsg("unexpected timeline ID %u (after %u) in checkpoint record",
-                                checkPoint.ThisTimeLineID, ThisTimeLineID)));
+                        (errmsg("unexpected timeline ID %u (after %u) at recovery end record",
+                                tli, ThisTimeLineID)));
             /* Following WAL records should be run with new TLI */
-            ThisTimeLineID = checkPoint.ThisTimeLineID;
+            ThisTimeLineID = tli;
         }
-
-        RecoveryRestartPoint(&checkPoint);
     }
     else if (info == XLOG_CHECKPOINT_ONLINE)
     {
@@ -6309,7 +6643,7 @@ xlog_redo(XLogRecPtr lsn, XLogRecord *record)
         ControlFile->checkPointCopy.nextXidEpoch = checkPoint.nextXidEpoch;
         ControlFile->checkPointCopy.nextXid = checkPoint.nextXid;

-        /* TLI should not change in an on-line checkpoint */
+        /* TLI must not change at a checkpoint */
         if (checkPoint.ThisTimeLineID != ThisTimeLineID)
             ereport(PANIC,
                     (errmsg("unexpected timeline ID %u (should be %u) in checkpoint record",
@@ -7224,3 +7558,89 @@ CancelBackup(void)
     }
 }

+/* ------------------------------------------------------
+ *  Startup Process main entry point and signal handlers
+ * ------------------------------------------------------
+ */
+
+/*
+ * wal_quickdie() occurs when signalled SIGQUIT by the postmaster.
+ *
+ * Some backend has bought the farm,
+ * so we need to stop what we're doing and exit.
+ */
+static void
+startupproc_quickdie(SIGNAL_ARGS)
+{
+    PG_SETMASK(&BlockSig);
+
+    /*
+     * DO NOT proc_exit() -- we're here because shared memory may be
+     * corrupted, so we don't want to try to clean up our transaction. Just
+     * nail the windows shut and get out of town.
+     *
+     * Note we do exit(2) not exit(0).    This is to force the postmaster into a
+     * system reset cycle if some idiot DBA sends a manual SIGQUIT to a random
+     * backend.  This is necessary precisely because we don't clean up our
+     * shared memory state.
+     */
+    exit(2);
+}
+
+
+/* SIGTERM: set flag to abort redo and exit */
+static void
+StartupProcShutdownHandler(SIGNAL_ARGS)
+{
+    shutdown_requested = true;
+}
+
+/* Main entry point for startup process */
+void
+StartupProcessMain(void)
+{
+    /*
+     * If possible, make this process a group leader, so that the postmaster
+     * can signal any child processes too.
+     */
+#ifdef HAVE_SETSID
+    if (setsid() < 0)
+        elog(FATAL, "setsid() failed: %m");
+#endif
+
+    /*
+     * Properly accept or ignore signals the postmaster might send us
+     */
+    pqsignal(SIGHUP, SIG_IGN);    /* ignore config file updates */
+    pqsignal(SIGINT, SIG_IGN);        /* ignore query cancel */
+    pqsignal(SIGTERM, StartupProcShutdownHandler); /* request shutdown */
+    pqsignal(SIGQUIT, startupproc_quickdie);        /* hard crash time */
+    pqsignal(SIGALRM, SIG_IGN);
+    pqsignal(SIGPIPE, SIG_IGN);
+    pqsignal(SIGUSR1, SIG_IGN);
+    pqsignal(SIGUSR2, SIG_IGN);
+
+    /*
+     * Reset some signals that are accepted by postmaster but not here
+     */
+    pqsignal(SIGCHLD, SIG_DFL);
+    pqsignal(SIGTTIN, SIG_DFL);
+    pqsignal(SIGTTOU, SIG_DFL);
+    pqsignal(SIGCONT, SIG_DFL);
+    pqsignal(SIGWINCH, SIG_DFL);
+
+    /*
+     * Unblock signals (they were blocked when the postmaster forked us)
+     */
+    PG_SETMASK(&UnBlockSig);
+
+    StartupXLOG();
+
+    BuildFlatFiles(false);
+
+    /* Let postmaster know that startup is finished */
+    SendPostmasterSignal(PMSIGNAL_RECOVERY_COMPLETED);
+
+    /* exit normally */
+    proc_exit(0);
+}
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 431a95f..13d5bcb 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -37,7 +37,6 @@
 #include "storage/proc.h"
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
-#include "utils/flatfiles.h"
 #include "utils/fmgroids.h"
 #include "utils/memutils.h"
 #include "utils/ps_status.h"
@@ -416,14 +415,12 @@ AuxiliaryProcessMain(int argc, char *argv[])
             proc_exit(1);        /* should never return */

         case StartupProcess:
-            bootstrap_signals();
-            StartupXLOG();
-            BuildFlatFiles(false);
-            proc_exit(0);        /* startup done */
+            /* don't set signals, startup process has its own agenda */
+            StartupProcessMain();
+            proc_exit(1);        /* should never return */

         case BgWriterProcess:
             /* don't set signals, bgwriter has its own agenda */
-            InitXLOGAccess();
             BackgroundWriterMain();
             proc_exit(1);        /* should never return */

diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 6a0cd4e..4c8c54c 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -49,6 +49,7 @@
 #include <unistd.h>

 #include "access/xlog_internal.h"
+#include "catalog/pg_control.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -197,6 +198,9 @@ BackgroundWriterMain(void)
 {
     sigjmp_buf    local_sigjmp_buf;
     MemoryContext bgwriter_context;
+    bool        BgWriterRecoveryMode = true;
+    /* use volatile pointer to prevent code rearrangement */
+    volatile BgWriterShmemStruct *bgs = BgWriterShmem;

     BgWriterShmem->bgwriter_pid = MyProcPid;
     am_bg_writer = true;
@@ -356,6 +360,20 @@ BackgroundWriterMain(void)
     PG_SETMASK(&UnBlockSig);

     /*
+     * If someone requested a checkpoint before we started up, process that.
+     *
+     * This check exists primarily for crash recovery: after the startup
+     * process is finished with WAL replay, it will request a checkpoint, but
+     * the background writer might not have started yet. This check will
+     * actually not notice a checkpoint that's been requested without any
+     * flags, but it's good enough for the startup checkpoint.
+     */
+    SpinLockAcquire(&bgs->ckpt_lck);
+    if (bgs->ckpt_flags)
+        checkpoint_requested = true;
+    SpinLockRelease(&bgs->ckpt_lck);
+
+    /*
      * Loop forever
      */
     for (;;)
@@ -397,6 +415,7 @@ BackgroundWriterMain(void)
             ExitOnAnyError = true;
             /* Close down the database */
             ShutdownXLOG(0, 0);
+
             /* Normal exit from the bgwriter is here */
             proc_exit(0);        /* done */
         }
@@ -418,14 +437,25 @@ BackgroundWriterMain(void)
         }

         /*
+         * Check if we've exited recovery. We do this after determining
+         * whether to perform a checkpoint or not, to be sure that we
+         * perform a real checkpoint and not a restartpoint, if someone
+         * (like the startup process!) requested a checkpoint immediately
+         * after exiting recovery. And we must have the right TimeLineID
+         * when we perform a checkpoint.
+         */
+         if (BgWriterRecoveryMode && !IsRecoveryProcessingMode())
+          {
+            elog(DEBUG1, "bgwriter changing from recovery to normal mode");
+            BgWriterRecoveryMode = false;
+        }
+
+        /*
          * Do a checkpoint if requested, otherwise do one cycle of
          * dirty-buffer writing.
          */
         if (do_checkpoint)
         {
-            /* use volatile pointer to prevent code rearrangement */
-            volatile BgWriterShmemStruct *bgs = BgWriterShmem;
-
             /*
              * Atomically fetch the request flags to figure out what kind of a
              * checkpoint we should perform, and increase the started-counter
@@ -444,7 +474,8 @@ BackgroundWriterMain(void)
              * implementation will not generate warnings caused by
              * CheckPointTimeout < CheckPointWarning.
              */
-            if ((flags & CHECKPOINT_CAUSE_XLOG) &&
+            if (!BgWriterRecoveryMode &&
+                (flags & CHECKPOINT_CAUSE_XLOG) &&
                 elapsed_secs < CheckPointWarning)
                 ereport(LOG,
                         (errmsg("checkpoints are occurring too frequently (%d seconds apart)",
@@ -455,14 +486,18 @@ BackgroundWriterMain(void)
              * Initialize bgwriter-private variables used during checkpoint.
              */
             ckpt_active = true;
-            ckpt_start_recptr = GetInsertRecPtr();
+            if (!BgWriterRecoveryMode)
+                ckpt_start_recptr = GetInsertRecPtr();
             ckpt_start_time = now;
             ckpt_cached_elapsed = 0;

             /*
              * Do the checkpoint.
              */
-            CreateCheckPoint(flags);
+            if (!BgWriterRecoveryMode)
+                CreateCheckPoint(flags);
+            else
+                CreateRestartPoint(flags);

             /*
              * After any checkpoint, close all smgr files.    This is so we
@@ -507,7 +542,7 @@ CheckArchiveTimeout(void)
     pg_time_t    now;
     pg_time_t    last_time;

-    if (XLogArchiveTimeout <= 0)
+    if (XLogArchiveTimeout <= 0 || IsRecoveryProcessingMode())
         return;

     now = (pg_time_t) time(NULL);
@@ -586,7 +621,8 @@ BgWriterNap(void)
         (ckpt_active ? ImmediateCheckpointRequested() : checkpoint_requested))
             break;
         pg_usleep(1000000L);
-        AbsorbFsyncRequests();
+        if (!IsRecoveryProcessingMode())
+            AbsorbFsyncRequests();
         udelay -= 1000000L;
     }

@@ -714,16 +750,19 @@ IsCheckpointOnSchedule(double progress)
      * However, it's good enough for our purposes, we're only calculating an
      * estimate anyway.
      */
-    recptr = GetInsertRecPtr();
-    elapsed_xlogs =
-        (((double) (int32) (recptr.xlogid - ckpt_start_recptr.xlogid)) * XLogSegsPerFile +
-         ((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) / XLogSegSize) /
-        CheckPointSegments;
-
-    if (progress < elapsed_xlogs)
+    if (!IsRecoveryProcessingMode())
     {
-        ckpt_cached_elapsed = elapsed_xlogs;
-        return false;
+        recptr = GetInsertRecPtr();
+        elapsed_xlogs =
+            (((double) (int32) (recptr.xlogid - ckpt_start_recptr.xlogid)) * XLogSegsPerFile +
+             ((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) / XLogSegSize) /
+            CheckPointSegments;
+
+        if (progress < elapsed_xlogs)
+        {
+            ckpt_cached_elapsed = elapsed_xlogs;
+            return false;
+        }
     }

     /*
@@ -850,6 +889,7 @@ BgWriterShmemInit(void)
  *
  * flags is a bitwise OR of the following:
  *    CHECKPOINT_IS_SHUTDOWN: checkpoint is for database shutdown.
+ *    CHECKPOINT_IS_STARTUP: checkpoint is for database startup.
  *    CHECKPOINT_IMMEDIATE: finish the checkpoint ASAP,
  *        ignoring checkpoint_completion_target parameter.
  *    CHECKPOINT_FORCE: force a checkpoint even if no XLOG activity has occured
@@ -916,6 +956,18 @@ RequestCheckpoint(int flags)
     {
         if (BgWriterShmem->bgwriter_pid == 0)
         {
+            /*
+             * The only difference between a startup checkpoint and a normal
+             * online checkpoint is that it's quite normal for the bgwriter
+             * to not be up yet when the startup checkpoint is requested.
+             * (it might be, though). That's ok, background writer will
+             * perform the checkpoint as soon as it starts up.
+             */
+            if (flags & CHECKPOINT_STARTUP)
+            {
+                Assert(!(flags & CHECKPOINT_WAIT));
+                break;
+            }
             if (ntries >= 20)        /* max wait 2.0 sec */
             {
                 elog((flags & CHECKPOINT_WAIT) ? ERROR : LOG,
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 3380b80..15fc7ad 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -226,10 +226,36 @@ static int    Shutdown = NoShutdown;

 static bool FatalError = false; /* T if recovering from backend crash */

+/* State of WAL redo */
+#define            NoRecovery            0
+#define            RecoveryStarted        1
+#define            RecoveryConsistent    2
+#define            RecoveryCompleted    3
+
+static int    RecoveryStatus = NoRecovery;
+
 /*
  * We use a simple state machine to control startup, shutdown, and
  * crash recovery (which is rather like shutdown followed by startup).
  *
+ * After doing all the postmaster initialization work, we enter PM_STARTUP
+ * state and the startup process is launched. The startup process begins by
+ * reading the control file and other preliminary initialization steps. When
+ * it's ready to start WAL redo, it signals postmaster, and we switch to
+ * PM_RECOVERY phase. The background writer is launched, while the startup
+ * process continues applying WAL.
+ *
+ * After reaching a consistent point in WAL redo, startup process signals
+ * us again, and we switch to PM_RECOVERY_CONSISTENT phase. There's currently
+ * no difference between PM_RECOVERY and PM_RECOVERY_CONSISTENT, but we
+ * could start accepting connections to perform read-only queries at this
+ * point, if we had the infrastructure to do that.
+ *
+ * When the WAL redo is finished, the startup process signals us the third
+ * time, and we switch to PM_RUN state. The startup process can also skip the
+ * recovery and consistent recovery phases altogether, as it will during
+ * normal startup when there's no recovery to be done, for example.
+ *
  * Normal child backends can only be launched when we are in PM_RUN state.
  * (We also allow it in PM_WAIT_BACKUP state, but only for superusers.)
  * In other states we handle connection requests by launching "dead_end"
@@ -254,6 +280,8 @@ typedef enum
 {
     PM_INIT,                    /* postmaster starting */
     PM_STARTUP,                    /* waiting for startup subprocess */
+    PM_RECOVERY,                /* in recovery mode */
+    PM_RECOVERY_CONSISTENT,        /* consistent recovery mode */
     PM_RUN,                        /* normal "database is alive" state */
     PM_WAIT_BACKUP,                /* waiting for online backup mode to end */
     PM_WAIT_BACKENDS,            /* waiting for live backends to exit */
@@ -307,6 +335,7 @@ static void pmdie(SIGNAL_ARGS);
 static void reaper(SIGNAL_ARGS);
 static void sigusr1_handler(SIGNAL_ARGS);
 static void dummy_handler(SIGNAL_ARGS);
+static void CheckRecoverySignals(void);
 static void CleanupBackend(int pid, int exitstatus);
 static void HandleChildCrash(int pid, int exitstatus, const char *procname);
 static void LogChildExit(int lev, const char *procname,
@@ -1302,7 +1331,9 @@ ServerLoop(void)
          * state that prevents it, start one.  It doesn't matter if this
          * fails, we'll just try again later.
          */
-        if (BgWriterPID == 0 && pmState == PM_RUN)
+        if (BgWriterPID == 0 &&
+            (pmState == PM_RUN || pmState == PM_RECOVERY ||
+             pmState == PM_RECOVERY_CONSISTENT))
             BgWriterPID = StartBackgroundWriter();

         /*
@@ -1982,7 +2013,7 @@ pmdie(SIGNAL_ARGS)
             ereport(LOG,
                     (errmsg("received smart shutdown request")));

-            if (pmState == PM_RUN)
+            if (pmState == PM_RUN || pmState == PM_RECOVERY || pmState == PM_RECOVERY_CONSISTENT)
             {
                 /* autovacuum workers are told to shut down immediately */
                 SignalAutovacWorkers(SIGTERM);
@@ -2019,7 +2050,14 @@ pmdie(SIGNAL_ARGS)

             if (StartupPID != 0)
                 signal_child(StartupPID, SIGTERM);
-            if (pmState == PM_RUN || pmState == PM_WAIT_BACKUP)
+            if (pmState == PM_RECOVERY)
+            {
+                /* only bgwriter is active in this state */
+                pmState = PM_WAIT_BACKENDS;
+            }
+            if (pmState == PM_RUN ||
+                pmState == PM_WAIT_BACKUP ||
+                pmState == PM_RECOVERY_CONSISTENT)
             {
                 ereport(LOG,
                         (errmsg("aborting any active transactions")));
@@ -2116,10 +2154,22 @@ reaper(SIGNAL_ARGS)
         if (pid == StartupPID)
         {
             StartupPID = 0;
-            Assert(pmState == PM_STARTUP);

-            /* FATAL exit of startup is treated as catastrophic */
-            if (!EXIT_STATUS_0(exitstatus))
+            /*
+             * Check if we've received a signal from the startup process
+             * first. This can change pmState. If the startup process sends
+             * a signal, and exits immediately after that, we might not have
+             * processed the signal yet, and we need to know if it completed
+             * recovery before exiting.
+             */
+            CheckRecoverySignals();
+
+            /*
+             * Unexpected exit of startup process (including FATAL exit)
+             * during PM_STARTUP is treated as catastrophic. There is no
+             * other processes running yet.
+             */
+            if (pmState == PM_STARTUP)
             {
                 LogChildExit(LOG, _("startup process"),
                              pid, exitstatus);
@@ -2127,60 +2177,27 @@ reaper(SIGNAL_ARGS)
                 (errmsg("aborting startup due to startup process failure")));
                 ExitPostmaster(1);
             }
-
             /*
-             * Startup succeeded - we are done with system startup or
-             * recovery.
+             * Any unexpected exit (including FATAL exit) of the startup
+             * process is treated as a crash.
              */
-            FatalError = false;
-
-            /*
-             * Go to shutdown mode if a shutdown request was pending.
-             */
-            if (Shutdown > NoShutdown)
+            if (!EXIT_STATUS_0(exitstatus))
             {
-                pmState = PM_WAIT_BACKENDS;
-                /* PostmasterStateMachine logic does the rest */
+                HandleChildCrash(pid, exitstatus,
+                                 _("startup process"));
                 continue;
             }
-
-            /*
-             * Otherwise, commence normal operations.
-             */
-            pmState = PM_RUN;
-
-            /*
-             * Load the flat authorization file into postmaster's cache. The
-             * startup process has recomputed this from the database contents,
-             * so we wait till it finishes before loading it.
-             */
-            load_role();
-
             /*
-             * Crank up the background writer.    It doesn't matter if this
-             * fails, we'll just try again later.
-             */
-            Assert(BgWriterPID == 0);
-            BgWriterPID = StartBackgroundWriter();
-
-            /*
-             * Likewise, start other special children as needed.  In a restart
-             * situation, some of them may be alive already.
+             * Startup process exited normally, but didn't finish recovery.
+             * This can happen if someone else than postmaster kills the
+             * startup process with SIGTERM. Treat it like a crash.
              */
-            if (WalWriterPID == 0)
-                WalWriterPID = StartWalWriter();
-            if (AutoVacuumingActive() && AutoVacPID == 0)
-                AutoVacPID = StartAutoVacLauncher();
-            if (XLogArchivingActive() && PgArchPID == 0)
-                PgArchPID = pgarch_start();
-            if (PgStatPID == 0)
-                PgStatPID = pgstat_start();
-
-            /* at this point we are really open for business */
-            ereport(LOG,
-                 (errmsg("database system is ready to accept connections")));
-
-            continue;
+            if (pmState == PM_RECOVERY || pmState == PM_RECOVERY_CONSISTENT)
+            {
+                HandleChildCrash(pid, exitstatus,
+                                 _("startup process"));
+                continue;
+            }
         }

         /*
@@ -2443,6 +2460,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         }
     }

+    /* Take care of the startup process too */
+    if (pid == StartupPID)
+        StartupPID = 0;
+    else if (StartupPID != 0 && !FatalError)
+    {
+        ereport(DEBUG2,
+                (errmsg_internal("sending %s to process %d",
+                                 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+                                 (int) StartupPID)));
+        signal_child(BgWriterPID, (SendStop ? SIGSTOP : SIGQUIT));
+    }
+
     /* Take care of the bgwriter too */
     if (pid == BgWriterPID)
         BgWriterPID = 0;
@@ -2514,7 +2543,9 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)

     FatalError = true;
     /* We now transit into a state of waiting for children to die */
-    if (pmState == PM_RUN ||
+    if (pmState == PM_RECOVERY ||
+        pmState == PM_RECOVERY_CONSISTENT ||
+        pmState == PM_RUN ||
         pmState == PM_WAIT_BACKUP ||
         pmState == PM_SHUTDOWN)
         pmState = PM_WAIT_BACKENDS;
@@ -2582,6 +2613,128 @@ LogChildExit(int lev, const char *procname, int pid, int exitstatus)
 static void
 PostmasterStateMachine(void)
 {
+    /* Startup states */
+
+    if (pmState == PM_STARTUP && RecoveryStatus > NoRecovery)
+    {
+        /* Recovery has started */
+
+        /*
+         * Go to shutdown mode if a shutdown request was pending.
+         */
+        if (Shutdown > NoShutdown)
+        {
+            pmState = PM_WAIT_BACKENDS;
+            /* PostmasterStateMachine logic does the rest */
+        }
+        else
+        {
+            /*
+             * Crank up the background writer.    It doesn't matter if this
+             * fails, we'll just try again later.
+             */
+            Assert(BgWriterPID == 0);
+            BgWriterPID = StartBackgroundWriter();
+
+            pmState = PM_RECOVERY;
+        }
+    }
+    if (pmState == PM_RECOVERY && RecoveryStatus >= RecoveryConsistent)
+    {
+        /*
+         * Go to shutdown mode if a shutdown request was pending.
+         */
+        if (Shutdown > NoShutdown)
+        {
+            pmState = PM_WAIT_BACKENDS;
+            /* PostmasterStateMachine logic does the rest */
+        }
+        else
+        {
+            /*
+             * Startup process has entered recovery. We consider that good
+             * enough to reset FatalError.
+             */
+            pmState = PM_RECOVERY_CONSISTENT;
+            FatalError = false;
+
+            /*
+             * Load the flat authorization file into postmaster's cache. The
+             * startup process won't have recomputed this from the database yet,
+             * so we it may change following recovery.
+             */
+            load_role();
+
+            /*
+             * Likewise, start other special children as needed.
+             */
+            Assert(PgStatPID == 0);
+            PgStatPID = pgstat_start();
+
+            /* XXX at this point we could accept read-only connections */
+            ereport(DEBUG1,
+                 (errmsg("database system is in consistent recovery mode")));
+        }
+    }
+    if ((pmState == PM_RECOVERY || pmState == PM_RECOVERY_CONSISTENT || pmState == PM_STARTUP) && RecoveryStatus ==
RecoveryCompleted)
+    {
+        /*
+         * Startup succeeded - we are done with system startup or
+         * recovery.
+         */
+        FatalError = false;
+
+        /*
+         * Go to shutdown mode if a shutdown request was pending.
+         */
+        if (Shutdown > NoShutdown)
+        {
+            pmState = PM_WAIT_BACKENDS;
+            /* PostmasterStateMachine logic does the rest */
+        }
+        else
+        {
+            /*
+             * Otherwise, commence normal operations.
+             */
+            pmState = PM_RUN;
+
+            /*
+             * Load the flat authorization file into postmaster's cache. The
+             * startup process has recomputed this from the database contents,
+             * so we wait till it finishes before loading it.
+             */
+            load_role();
+
+            /*
+             * Crank up the background writer, if we didn't do that already
+             * when we entered consistent recovery phase.  It doesn't matter
+             * if this fails, we'll just try again later.
+             */
+            if (BgWriterPID == 0)
+                BgWriterPID = StartBackgroundWriter();
+
+            /*
+             * Likewise, start other special children as needed.  In a restart
+             * situation, some of them may be alive already.
+             */
+            if (WalWriterPID == 0)
+                WalWriterPID = StartWalWriter();
+            if (AutoVacuumingActive() && AutoVacPID == 0)
+                AutoVacPID = StartAutoVacLauncher();
+            if (XLogArchivingActive() && PgArchPID == 0)
+                PgArchPID = pgarch_start();
+            if (PgStatPID == 0)
+                PgStatPID = pgstat_start();
+
+            /* at this point we are really open for business */
+            ereport(LOG,
+                (errmsg("database system is ready to accept connections")));
+        }
+    }
+
+    /* Shutdown states */
+
     if (pmState == PM_WAIT_BACKUP)
     {
         /*
@@ -2734,6 +2887,8 @@ PostmasterStateMachine(void)
         shmem_exit(1);
         reset_shared(PostPortNumber);

+        RecoveryStatus = NoRecovery;
+
         StartupPID = StartupDataBase();
         Assert(StartupPID != 0);
         pmState = PM_STARTUP;
@@ -3838,6 +3993,37 @@ ExitPostmaster(int status)
 }

 /*
+ * common code used in sigusr1_handler() and reaper() to handle
+ * recovery-related signals from startup process
+ */
+static void
+CheckRecoverySignals(void)
+{
+    bool changed = false;
+
+    if (CheckPostmasterSignal(PMSIGNAL_RECOVERY_STARTED))
+    {
+        Assert(pmState == PM_STARTUP);
+
+        RecoveryStatus = RecoveryStarted;
+        changed = true;
+    }
+    if (CheckPostmasterSignal(PMSIGNAL_RECOVERY_CONSISTENT))
+    {
+        RecoveryStatus = RecoveryConsistent;
+        changed = true;
+    }
+    if (CheckPostmasterSignal(PMSIGNAL_RECOVERY_COMPLETED))
+    {
+        RecoveryStatus = RecoveryCompleted;
+        changed = true;
+    }
+
+    if (changed)
+        PostmasterStateMachine();
+}
+
+/*
  * sigusr1_handler - handle signal conditions from child processes
  */
 static void
@@ -3847,6 +4033,8 @@ sigusr1_handler(SIGNAL_ARGS)

     PG_SETMASK(&BlockSig);

+    CheckRecoverySignals();
+
     if (CheckPostmasterSignal(PMSIGNAL_PASSWORD_CHANGE))
     {
         /*
diff --git a/src/backend/storage/buffer/README b/src/backend/storage/buffer/README
index 62b22bd..a7b81e3 100644
--- a/src/backend/storage/buffer/README
+++ b/src/backend/storage/buffer/README
@@ -268,3 +268,12 @@ out (and anyone else who flushes buffer contents to disk must do so too).
 This ensures that the page image transferred to disk is reasonably consistent.
 We might miss a hint-bit update or two but that isn't a problem, for the same
 reasons mentioned under buffer access rules.
+
+As of 8.4, background writer starts during recovery mode when there is
+some form of potentially extended recovery to perform. It performs an
+identical service to normal processing, except that checkpoints it
+writes are technically restartpoints. Flushing outstanding WAL for dirty
+buffers is also skipped, though there shouldn't ever be new WAL entries
+at that time in any case. We could choose to start background writer
+immediately but we hold off until we can prove the database is in a
+consistent state so that postmaster has a single, clean state change.
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index cf98323..b359395 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -324,7 +324,7 @@ InitCommunication(void)
  * If you're wondering why this is separate from InitPostgres at all:
  * the critical distinction is that this stuff has to happen before we can
  * run XLOG-related initialization, which is done before InitPostgres --- in
- * fact, for cases such as checkpoint creation processes, InitPostgres may
+ * fact, for cases such as the background writer process, InitPostgres may
  * never be done at all.
  */
 void
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 4ea849d..3bba50a 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -197,6 +197,9 @@ main(int argc, char *argv[])
     printf(_("Minimum recovery ending location:     %X/%X\n"),
            ControlFile.minRecoveryPoint.xlogid,
            ControlFile.minRecoveryPoint.xrecoff);
+    printf(_("Minimum safe starting location:       %X/%X\n"),
+           ControlFile.minSafeStartPoint.xlogid,
+           ControlFile.minSafeStartPoint.xrecoff);
     printf(_("Maximum data alignment:               %u\n"),
            ControlFile.maxAlign);
     /* we don't print floatFormat since can't say much useful about it */
diff --git a/src/bin/pg_resetxlog/pg_resetxlog.c b/src/bin/pg_resetxlog/pg_resetxlog.c
index 51cdde1..b20d4bd 100644
--- a/src/bin/pg_resetxlog/pg_resetxlog.c
+++ b/src/bin/pg_resetxlog/pg_resetxlog.c
@@ -603,6 +603,8 @@ RewriteControlFile(void)
     ControlFile.prevCheckPoint.xrecoff = 0;
     ControlFile.minRecoveryPoint.xlogid = 0;
     ControlFile.minRecoveryPoint.xrecoff = 0;
+    ControlFile.minSafeStartPoint.xlogid = 0;
+    ControlFile.minSafeStartPoint.xrecoff = 0;

     /* Now we can force the recorded xlog seg size to the right thing. */
     ControlFile.xlog_seg_size = XLogSegSize;
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 6913f7c..c3b3ec7 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -133,7 +133,16 @@ typedef struct XLogRecData
 } XLogRecData;

 extern TimeLineID ThisTimeLineID;        /* current TLI */
-extern bool InRecovery;
+
+/*
+ * Prior to 8.4, all activity during recovery were carried out by Startup
+ * process. This local variable continues to be used in many parts of the
+ * code to indicate actions taken by RecoveryManagers. Other processes who
+ * potentially perform work during recovery should check
+ * IsRecoveryProcessingMode(), see XLogCtl notes in xlog.c
+ */
+extern bool InRecovery;
+
 extern XLogRecPtr XactLastRecEnd;

 /* these variables are GUC parameters related to XLOG */
@@ -161,11 +170,12 @@ extern bool XLOG_DEBUG;
 #define CHECKPOINT_IS_SHUTDOWN    0x0001    /* Checkpoint is for shutdown */
 #define CHECKPOINT_IMMEDIATE    0x0002    /* Do it without delays */
 #define CHECKPOINT_FORCE        0x0004    /* Force even if no activity */
+#define CHECKPOINT_STARTUP        0x0008    /* Startup checkpoint */
 /* These are important to RequestCheckpoint */
-#define CHECKPOINT_WAIT            0x0008    /* Wait for completion */
+#define CHECKPOINT_WAIT            0x0010    /* Wait for completion */
 /* These indicate the cause of a checkpoint request */
-#define CHECKPOINT_CAUSE_XLOG    0x0010    /* XLOG consumption */
-#define CHECKPOINT_CAUSE_TIME    0x0020    /* Elapsed time */
+#define CHECKPOINT_CAUSE_XLOG    0x0020    /* XLOG consumption */
+#define CHECKPOINT_CAUSE_TIME    0x0040    /* Elapsed time */

 /* Checkpoint statistics */
 typedef struct CheckpointStatsData
@@ -199,6 +209,8 @@ extern void RestoreBkpBlocks(XLogRecPtr lsn, XLogRecord *record, bool cleanup);
 extern void xlog_redo(XLogRecPtr lsn, XLogRecord *record);
 extern void xlog_desc(StringInfo buf, uint8 xl_info, char *rec);

+extern bool IsRecoveryProcessingMode(void);
+
 extern void UpdateControlFile(void);
 extern Size XLOGShmemSize(void);
 extern void XLOGShmemInit(void);
@@ -207,9 +219,12 @@ extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
+extern void CreateRestartPoint(int flags);
 extern void XLogPutNextOid(Oid nextOid);
 extern XLogRecPtr GetRedoRecPtr(void);
 extern XLogRecPtr GetInsertRecPtr(void);
 extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch);

+extern void StartupProcessMain(void);
+
 #endif   /* XLOG_H */
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 400f32c..e69c8ec 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -21,7 +21,7 @@


 /* Version identifier for this pg_control format */
-#define PG_CONTROL_VERSION    843
+#define PG_CONTROL_VERSION    847

 /*
  * Body of CheckPoint XLOG records.  This is declared here because we keep
@@ -46,7 +46,7 @@ typedef struct CheckPoint
 #define XLOG_NOOP                        0x20
 #define XLOG_NEXTOID                    0x30
 #define XLOG_SWITCH                        0x40
-
+#define XLOG_RECOVERY_END            0x50

 /* System status indicator */
 typedef enum DBState
@@ -102,6 +102,7 @@ typedef struct ControlFileData
     CheckPoint    checkPointCopy; /* copy of last check point record */

     XLogRecPtr    minRecoveryPoint;        /* must replay xlog to here */
+    XLogRecPtr    minSafeStartPoint;        /* safe point after recovery crashes */

     /*
      * This data is used to check for hardware-architecture compatibility of
diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index 3101092..62dddfc 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -22,6 +22,9 @@
  */
 typedef enum
 {
+    PMSIGNAL_RECOVERY_STARTED,    /* recovery has started */
+    PMSIGNAL_RECOVERY_CONSISTENT, /* recovery has reached consistent state */
+    PMSIGNAL_RECOVERY_COMPLETED, /* recovery completed */
     PMSIGNAL_PASSWORD_CHANGE,    /* pg_auth file has changed */
     PMSIGNAL_WAKEN_ARCHIVER,    /* send a NOTIFY signal to xlog archiver */
     PMSIGNAL_ROTATE_LOGFILE,    /* send SIGUSR1 to syslogger to rotate logfile */

Re: Hot standby, recovery infra

From
Simon Riggs
Date:
On Fri, 2009-01-30 at 13:15 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > I'm thinking to add a new function that will allow crash testing easier.
> > 
> > pg_crash_standby() will issue a new xlog record, XLOG_CRASH_STANDBY,
> > which when replayed will just throw a FATAL error and crash Startup
> > process. We won't be adding that to the user docs...
> > 
> > This will allow us to produce tests that crash the server at specific
> > places, rather than trying to trap those points manually.
> 
> Heh, talk about a footgun ;-). I don't think including that in CVS is a 
> good idea.

Thought you'd like it. I'd have preferred it in a plugin... :-(

Not really sure why its a problem though. We support 
pg_ctl stop -m immediate, which is the equivalent, yet we don't regard
that as a footgun.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot standby, recovery infra

From
Simon Riggs
Date:
On Fri, 2009-01-30 at 13:25 +0200, Heikki Linnakangas wrote:
> > That whole area was something I was leaving until last, since
> immediate
> > shutdown doesn't work either, even in HEAD. (Fujii-san and I
> discussed
> > this before Christmas, briefly).
> 
> We must handle shutdown gracefully, can't just leave bgwriter running 
> after postmaster exit.

Nobody was suggesting we should.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot standby, recovery infra

From
Simon Riggs
Date:
On Fri, 2009-01-30 at 16:55 +0200, Heikki Linnakangas wrote:
> Ok, here's an attempt to make shutdown work gracefully.
> 
> Startup process now signals postmaster three times during startup: first 
> when it has done all the initialization, and starts redo. At that point. 
> postmaster launches bgwriter, which starts to perform restartpoints when 
> it deems appropriate. The 2nd time signals when we've reached consistent 
> recovery state. As the patch stands, that's not significant, but it will 
> be with all the rest of the hot standby stuff. The 3rd signal is sent 
> when startup process has finished recovery. Postmaster used to wait for 
> the startup process to exit, and check the return code to determine 
> that, but now that we support shutdown, startup process also returns 
> with 0 exit code when it has been requested to terminate.

Yeh, seems much cleaner.

Slightly bizarre though cos now we're pretty much back to my originally
proposed design. C'est la vie.

I like this way because it means we might in the future get Startup
process to perform post-recovery actions also.

> The startup process now catches SIGTERM, and calls proc_exit() at the 
> next WAL record. That's what will happen in a fast shutdown. Unexpected 
> death of the startup process is treated the same as a backend/auxiliary 
> process crash.

Good. Like your re-arrangement of StartupProcessMain also.


Your call to PMSIGNAL_RECOVERY_COMPLETED needs to be if
(IsUnderPostmaster), or at least a comment to explain why not or perhaps
an Assert.

Think you need to just throw away this chunk

@@ -5253,7 +5386,7 @@ StartupXLOG(void)        * Complain if we did not roll forward far enough to render the
backup        * dump consistent.        */
-       if (XLByteLT(EndOfLog, ControlFile->minRecoveryPoint))
+       if (InRecovery && !reachedSafeStartPoint)       {               if (reachedStopPoint)   /* stopped because of
stop
request */                       ereport(FATAL,




-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot standby, recovery infra

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Fri, 2009-01-30 at 13:15 +0200, Heikki Linnakangas wrote:
>> Simon Riggs wrote:
>>> I'm thinking to add a new function that will allow crash testing easier.
>>>
>>> pg_crash_standby() will issue a new xlog record, XLOG_CRASH_STANDBY,
>>> which when replayed will just throw a FATAL error and crash Startup
>>> process. We won't be adding that to the user docs...
>>>
>>> This will allow us to produce tests that crash the server at specific
>>> places, rather than trying to trap those points manually.
>> Heh, talk about a footgun ;-). I don't think including that in CVS is a 
>> good idea.
> 
> Thought you'd like it. I'd have preferred it in a plugin... :-(
> 
> Not really sure why its a problem though. We support 
> pg_ctl stop -m immediate, which is the equivalent, yet we don't regard
> that as a footgun.

If you poison your WAL archive with a XLOG_CRASH_RECOVERY record, 
recovery will never be able to proceed over that point. There would have 
to be a switch to ignore those records, at the very least.

pg_ctl stop -m immediate has some use in a production system, while this 
would be a pure development aid. For that purpose, you might as use a 
patched version.

Presumably you want to test different kind of crashes and at different 
points. FATAL, PANIC, segfault etc. A single crash mechanism like that 
will only test one path.

You don't really need to do it with a new WAL record. You could just add 
a GUC or recovery.conf option along the lines of recovery_target: 
crash_target=0/123456, and check for that in ReadRecord or wherever you 
want the crash to occur.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot standby, recovery infra

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Fri, 2009-01-30 at 16:55 +0200, Heikki Linnakangas wrote:
>> Ok, here's an attempt to make shutdown work gracefully.
>>
>> Startup process now signals postmaster three times during startup: first 
>> when it has done all the initialization, and starts redo. At that point. 
>> postmaster launches bgwriter, which starts to perform restartpoints when 
>> it deems appropriate. The 2nd time signals when we've reached consistent 
>> recovery state. As the patch stands, that's not significant, but it will 
>> be with all the rest of the hot standby stuff. The 3rd signal is sent 
>> when startup process has finished recovery. Postmaster used to wait for 
>> the startup process to exit, and check the return code to determine 
>> that, but now that we support shutdown, startup process also returns 
>> with 0 exit code when it has been requested to terminate.
> 
> Yeh, seems much cleaner.
> 
> Slightly bizarre though cos now we're pretty much back to my originally
> proposed design. C'est la vie.

Yep. I didn't see any objections to that approach in the archives. There 
was other problems in the early versions of the patch, but nothing 
related to this arrangement.

> I like this way because it means we might in the future get Startup
> process to perform post-recovery actions also.

Yeah, it does. Do you have something in mind already?

> Your call to PMSIGNAL_RECOVERY_COMPLETED needs to be if
> (IsUnderPostmaster), or at least a comment to explain why not or perhaps
> an Assert.

Nah, StartupProcessMain is only run under postmaster; you don't want to 
install signal handlers in a stand-along backend. Stand-alone backend 
calls StartupXLOG directly.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot standby, recovery infra

From
Simon Riggs
Date:
On Sat, 2009-01-31 at 22:32 +0200, Heikki Linnakangas wrote:

> If you poison your WAL archive with a XLOG_CRASH_RECOVERY record, 
> recovery will never be able to proceed over that point. There would have 
> to be a switch to ignore those records, at the very least.

Definitely in assert mode only.

I'll do it as a test patch and keep it separate from main line.

> You don't really need to do it with a new WAL record. You could just add 
> a GUC or recovery.conf option along the lines of recovery_target: 
> crash_target=0/123456, and check for that in ReadRecord or wherever you 
> want the crash to occur.

Knowing that LSN is somewhat harder

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot standby, recovery infra

From
Simon Riggs
Date:
On Sat, 2009-01-31 at 22:41 +0200, Heikki Linnakangas wrote:

> > I like this way because it means we might in the future get Startup
> > process to perform post-recovery actions also.
> 
> Yeah, it does. Do you have something in mind already?

Yes, but nothing that needs to be discussed yet.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot standby, recovery infra

From
Fujii Masao
Date:
Hi,

On Fri, Jan 30, 2009 at 11:55 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> The startup process now catches SIGTERM, and calls proc_exit() at the next
> WAL record. That's what will happen in a fast shutdown. Unexpected death of
> the startup process is treated the same as a backend/auxiliary process
> crash.

If unexpected death of the startup process happens in automatic recovery
after a crash, postmaster and bgwriter may get stuck. Because HandleChildCrash()
can be called before FatalError flag is reset. When FatalError is false,
HandleChildCrash() doesn't kill any auxiliary processes. So, bgwriter survives
the crash and postmaster waits for the death of bgwriter forever with recovery
status (which means that new connection cannot be started). Is this bug?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: Hot standby, recovery infra

From
Heikki Linnakangas
Date:
Fujii Masao wrote:
> On Fri, Jan 30, 2009 at 11:55 PM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com> wrote:
>> The startup process now catches SIGTERM, and calls proc_exit() at the next
>> WAL record. That's what will happen in a fast shutdown. Unexpected death of
>> the startup process is treated the same as a backend/auxiliary process
>> crash.
> 
> If unexpected death of the startup process happens in automatic recovery
> after a crash, postmaster and bgwriter may get stuck. Because HandleChildCrash()
> can be called before FatalError flag is reset. When FatalError is false,
> HandleChildCrash() doesn't kill any auxiliary processes. So, bgwriter survives
> the crash and postmaster waits for the death of bgwriter forever with recovery
> status (which means that new connection cannot be started). Is this bug?

Yes, and in fact I ran into it myself yesterday while testing. It seems 
that we should reset FatalError earlier, ie. when the recovery starts 
and bgwriter is launched. I'm not sure why we in CVS HEAD we don't reset 
FatalError until after the startup process is finished. Resetting it as 
soon all the processes have been terminated and startup process is 
launched again would seem like a more obvious place to do it. The only 
difference that I can see is that if someone tries to connect while the 
startup process is running, you now get a "the database system is in 
recovery mode" message instead of "the database system is starting up" 
if we're reinitializing after crash. We can keep that behavior, just 
need to add another flag to mean "reinitializing after crash" that isn't 
reset until the recovery is over.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot standby, recovery infra

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> * I think we are now renaming the recovery.conf file too early. The
> comment says "We have already restored all the WAL segments we need from
> the archive, and we trust that they are not going to go away even if we
> crash." We have, but the files overwrite each other as they arrive, so
> if the last restartpoint is not in the last restored WAL file then it
> will only exist in the archive. The recovery.conf is the only place
> where we store the information on where the archive is and how to access
> it, so by renaming it out of the way we will be unable to crash recover
> until the first checkpoint is complete. So the way this was in the
> original patch is the correct way to go, AFAICS.

I can see what you mean now. In fact we're not safe even when the last 
restartpoint is in the last restored WAL file, because we always restore 
segments from the archive to a temporary filename.

Yes, renaming recovery.conf at the first checkpoint avoids that problem. 
However, it means that we'll re-enter archive recovery if we crash 
before that checkpoint is finished. Won't that cause havoc if more files 
have appeared to the archive since the crash, and we restore those even 
though we already started up from an earlier point? How do the timelines 
work in that case?

We could avoid that by performing a good old startup checkpoint, but I 
quite like the fast failover time we get without it.

> * my original intention was to deprecate log_restartpoints and would
> still like to do so. log_checkpoints does just as well for that. Even
> less code than before...

Ok, done.

> * comment on BgWriterShmemInit() refers to CHECKPOINT_IS_STARTUP, but
> the actual define is CHECKPOINT_STARTUP. Would prefer the "is" version
> because it matches the IS_SHUTDOWN naming.

Fixed.

> * In CreateCheckpoint() the if test on TruncateSubtrans() has been
> removed, but the comment has not been changed (to explain why).

Thanks, comment updated. (we now call CreateCheckPoint() only after 
recovery is finished)

> We should continue to measure performance of recovery in the light of
> these changes. I still feel that fsyncing the control file on each
> XLogFileRead() will give a noticeable performance penalty, mostly
> because we know doing exactly the same thing in normal running caused a
> performance penalty. But that is easily changed and cannot be done with
> any certainty without wider feedback, so no reason to delay code commit.

I've changed the way minRecoveryPoint is updated now anyway, so it no 
longer happens every XLogFileRead().

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot standby, recovery infra

From
Simon Riggs
Date:
On Wed, 2009-02-04 at 19:03 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > * I think we are now renaming the recovery.conf file too early. The
> > comment says "We have already restored all the WAL segments we need from
> > the archive, and we trust that they are not going to go away even if we
> > crash." We have, but the files overwrite each other as they arrive, so
> > if the last restartpoint is not in the last restored WAL file then it
> > will only exist in the archive. The recovery.conf is the only place
> > where we store the information on where the archive is and how to access
> > it, so by renaming it out of the way we will be unable to crash recover
> > until the first checkpoint is complete. So the way this was in the
> > original patch is the correct way to go, AFAICS.
> 
> I can see what you mean now. In fact we're not safe even when the last 
> restartpoint is in the last restored WAL file, because we always restore 
> segments from the archive to a temporary filename.
> 
> Yes, renaming recovery.conf at the first checkpoint avoids that problem. 
> However, it means that we'll re-enter archive recovery if we crash 
> before that checkpoint is finished. Won't that cause havoc if more files 
> have appeared to the archive since the crash, and we restore those even 
> though we already started up from an earlier point? How do the timelines 
> work in that case?

More archive files being added to archive is possible, but unlikely.
What *is* a definite problem is that if we ended recovery by manual
command (possible with HS patch) or recovery target then we would have
no record of which point it was that we finished at. It is then possible
that the recovery.conf has been re-edited, causing re-recovery to end at
a different place. That seems like a bad thing.

> We could avoid that by performing a good old startup checkpoint, but I 
> quite like the fast failover time we get without it.

ISTM it's either slow failover or (fast failover, but restart archive
recovery if crashes).

I would suggest that at end of recovery we write the last LSN to the
control file, so if we crash recover then we will always end archive
recovery at the same place again should we re-enter it. So we would have
a recovery_target_lsn that overrides recovery_target_xid etc..

Given where we are, I would suggest we go for the slow failover option
in this release. Doing otherwise is a risky decision with little gain.
BGwriter-in-recovery is a good thing of itself and we have other code to
review yet with a higher importance level, and the rest of HS is code
I'm actually much happier with. I've spent weeks trying to get the
transition between recovery and normal running clean, but it seems like
time to say I didn't get it right *enough* to be absolutely certain.
(Just the fast failover part of patch!). Thanks for the review.

> > We should continue to measure performance of recovery in the light
> of
> > these changes. I still feel that fsyncing the control file on each
> > XLogFileRead() will give a noticeable performance penalty, mostly
> > because we know doing exactly the same thing in normal running
> caused a
> > performance penalty. But that is easily changed and cannot be done
> with
> > any certainty without wider feedback, so no reason to delay code
> commit.
> 
> I've changed the way minRecoveryPoint is updated now anyway, so it no 
> longer happens every XLogFileRead().

Care to elucidate?

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot standby, recovery infra

From
Fujii Masao
Date:
Hi,

On Wed, Feb 4, 2009 at 8:35 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> Yes, and in fact I ran into it myself yesterday while testing. It seems that
> we should reset FatalError earlier, ie. when the recovery starts and
> bgwriter is launched. I'm not sure why we in CVS HEAD we don't reset
> FatalError until after the startup process is finished. Resetting it as soon
> all the processes have been terminated and startup process is launched again
> would seem like a more obvious place to do it.

Which may repeat the recovery crash and reinitializing forever. To prevent
this problem, unexpected death of startup process should not cause
reinitializing?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: Hot standby, recovery infra

From
Tom Lane
Date:
Fujii Masao <masao.fujii@gmail.com> writes:
> On Wed, Feb 4, 2009 at 8:35 PM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com> wrote:
>> ... I'm not sure why we in CVS HEAD we don't reset
>> FatalError until after the startup process is finished.

> Which may repeat the recovery crash and reinitializing forever. To prevent
> this problem, unexpected death of startup process should not cause
> reinitializing?

Fujii-san has it in one.
        regards, tom lane


Re: Hot standby, recovery infra

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Fujii Masao <masao.fujii@gmail.com> writes:
>> On Wed, Feb 4, 2009 at 8:35 PM, Heikki Linnakangas
>> <heikki.linnakangas@enterprisedb.com> wrote:
>>> ... I'm not sure why we in CVS HEAD we don't reset
>>> FatalError until after the startup process is finished.
> 
>> Which may repeat the recovery crash and reinitializing forever. To prevent
>> this problem, unexpected death of startup process should not cause
>> reinitializing?
> 
> Fujii-san has it in one.

In CVS HEAD, we always do ExitPostmaster(1) if the startup process dies 
unexpectedly, regardless of FatalError. So it serves no such purpose there.

But yeah, we do have that problem with the patch. What do we want to do 
if the startup process dies with FATAL? It seems reasonable to assume 
that the problem isn't going to just go away if we restart: we'd do 
exactly the same sequence of actions after restart.

But if it's after reaching consistent recovery, the system should still 
be in consistent state, and we could stay open for read-only 
transactions. That would be useful to recover a corrupted database/WAL; 
you could let the WAL replay to run as far as it can, and then connect 
and investigate the situation, or run pg_dump.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot standby, recovery infra

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
>> We could avoid that by performing a good old startup checkpoint, but I 
>> quite like the fast failover time we get without it.
> 
> ISTM it's either slow failover or (fast failover, but restart archive
> recovery if crashes).
> 
> I would suggest that at end of recovery we write the last LSN to the
> control file, so if we crash recover then we will always end archive
> recovery at the same place again should we re-enter it. So we would have
> a recovery_target_lsn that overrides recovery_target_xid etc..

Hmm, we don't actually want to end recovery at the same point again: if 
there's some updates right after the database came up, but before the 
first checkpoint and crash, those actions need to be replayed too.

> Given where we are, I would suggest we go for the slow failover option
> in this release.

Agreed. We could do it for crash recovery, but I'd rather not have two 
different ways to do it. It's not as important for crash recovery, either.

>>> We should continue to measure performance of recovery in the light
>> of
>>> these changes. I still feel that fsyncing the control file on each
>>> XLogFileRead() will give a noticeable performance penalty, mostly
>>> because we know doing exactly the same thing in normal running
>> caused a
>>> performance penalty. But that is easily changed and cannot be done
>> with
>>> any certainty without wider feedback, so no reason to delay code
>> commit.
>>
>> I've changed the way minRecoveryPoint is updated now anyway, so it no 
>> longer happens every XLogFileRead().
> 
> Care to elucidate?

I got rid of minSafeStartPoint, advancing minRecoveryPoint instead. And 
it's advanced in XLogFlush instead of XLogFileRead. I'll post an updated 
patch soon.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot standby, recovery infra

From
Simon Riggs
Date:
On Thu, 2009-02-05 at 09:28 +0200, Heikki Linnakangas wrote:

> >> I've changed the way minRecoveryPoint is updated now anyway, so it no 
> >> longer happens every XLogFileRead().
> > 
> > Care to elucidate?
> 
> I got rid of minSafeStartPoint, advancing minRecoveryPoint instead. And 
> it's advanced in XLogFlush instead of XLogFileRead. I'll post an updated 
> patch soon.

Why do you think XLogFlush is called less frequently than XLogFileRead?

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot standby, recovery infra

From
Simon Riggs
Date:
On Thu, 2009-02-05 at 09:28 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> >> We could avoid that by performing a good old startup checkpoint, but I 
> >> quite like the fast failover time we get without it.
> > 
> > ISTM it's either slow failover or (fast failover, but restart archive
> > recovery if crashes).
> > 
> > I would suggest that at end of recovery we write the last LSN to the
> > control file, so if we crash recover then we will always end archive
> > recovery at the same place again should we re-enter it. So we would have
> > a recovery_target_lsn that overrides recovery_target_xid etc..
> 
> Hmm, we don't actually want to end recovery at the same point again: if 
> there's some updates right after the database came up, but before the 
> first checkpoint and crash, those actions need to be replayed too.

I think we do need to. Crash recovery is supposed to return us to the
same state. Where we ended ArchiveRecovery is part of that state. 

It isn't for crash recovery to decide that it saw a few more
transactions and decided to apply them anyway. The user may have
specifically ended recovery to avoid those additional changes from
taking place.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot standby, recovery infra

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Thu, 2009-02-05 at 09:28 +0200, Heikki Linnakangas wrote:
>> Simon Riggs wrote:
>>> I would suggest that at end of recovery we write the last LSN to the
>>> control file, so if we crash recover then we will always end archive
>>> recovery at the same place again should we re-enter it. So we would have
>>> a recovery_target_lsn that overrides recovery_target_xid etc..
>> Hmm, we don't actually want to end recovery at the same point again: if 
>> there's some updates right after the database came up, but before the 
>> first checkpoint and crash, those actions need to be replayed too.
> 
> I think we do need to. Crash recovery is supposed to return us to the
> same state. Where we ended ArchiveRecovery is part of that state. 
> 
> It isn't for crash recovery to decide that it saw a few more
> transactions and decided to apply them anyway. The user may have
> specifically ended recovery to avoid those additional changes from
> taking place.

Those additional changes were made in the standby, after ending 
recovery. So the sequence of events is:

1. Standby performs archive recovery
2. End of archive (or stop point) reached. Open for connections, 
read-write. Request an online checkpoint. Online checkpoint begins.
3. A user connects, and does "INSERT INTO foo VALUES (123)". Commits, 
commit returns.
4. "pg_ctl stop -m immediate". The checkpoint started in step 2 hasn't 
finished yet
5. Restart the server.

The server will now re-enter archive recovery. But this time, we want to 
replay the INSERT too.

(let's do the startup checkpoint for now, as discussed elsewhere in this 
thread)

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot standby, recovery infra

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Thu, 2009-02-05 at 09:28 +0200, Heikki Linnakangas wrote:
> 
>>>> I've changed the way minRecoveryPoint is updated now anyway, so it no 
>>>> longer happens every XLogFileRead().
>>> Care to elucidate?
>> I got rid of minSafeStartPoint, advancing minRecoveryPoint instead. And 
>> it's advanced in XLogFlush instead of XLogFileRead. I'll post an updated 
>> patch soon.
> 
> Why do you think XLogFlush is called less frequently than XLogFileRead?

It's not, but we only need to update the control file when we're 
"flushing" an LSN that's greater than current minRecoveryPoint. And when 
we do update minRecoveryPoint, we can update it to the LSN of the last 
record we've read from the archive.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot standby, recovery infra

From
Simon Riggs
Date:
On Thu, 2009-02-05 at 10:07 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Thu, 2009-02-05 at 09:28 +0200, Heikki Linnakangas wrote:
> >> Simon Riggs wrote:
> >>> I would suggest that at end of recovery we write the last LSN to the
> >>> control file, so if we crash recover then we will always end archive
> >>> recovery at the same place again should we re-enter it. So we would have
> >>> a recovery_target_lsn that overrides recovery_target_xid etc..
> >> Hmm, we don't actually want to end recovery at the same point again: if 
> >> there's some updates right after the database came up, but before the 
> >> first checkpoint and crash, those actions need to be replayed too.
> > 
> > I think we do need to. Crash recovery is supposed to return us to the
> > same state. Where we ended ArchiveRecovery is part of that state. 
> > 
> > It isn't for crash recovery to decide that it saw a few more
> > transactions and decided to apply them anyway. The user may have
> > specifically ended recovery to avoid those additional changes from
> > taking place.
> 
> Those additional changes were made in the standby, after ending 
> recovery. So the sequence of events is:
> 
> 1. Standby performs archive recovery
> 2. End of archive (or stop point) reached. Open for connections, 
> read-write. Request an online checkpoint. Online checkpoint begins.
> 3. A user connects, and does "INSERT INTO foo VALUES (123)". Commits, 
> commit returns.
> 4. "pg_ctl stop -m immediate". The checkpoint started in step 2 hasn't 
> finished yet
> 5. Restart the server.
> 
> The server will now re-enter archive recovery. But this time, we want to 
> replay the INSERT too.

I agree with this, so let me restate both comments together.

When the server starts it begins a new timeline. 

When recovering we must switch to that timeline at the same point we did
previously when we ended archive recovery. We currently don't record
when that is and we need to.

Yes, we must also replay the records in the new timeline once we have
switched to it, as you say. But we must not replay any further in the
older timeline(s).

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot standby, recovery infra

From
Simon Riggs
Date:
On Thu, 2009-02-05 at 10:31 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Thu, 2009-02-05 at 09:28 +0200, Heikki Linnakangas wrote:
> > 
> >>>> I've changed the way minRecoveryPoint is updated now anyway, so it no 
> >>>> longer happens every XLogFileRead().
> >>> Care to elucidate?
> >> I got rid of minSafeStartPoint, advancing minRecoveryPoint instead. And 
> >> it's advanced in XLogFlush instead of XLogFileRead. I'll post an updated 
> >> patch soon.
> > 
> > Why do you think XLogFlush is called less frequently than XLogFileRead?
> 
> It's not, but we only need to update the control file when we're 
> "flushing" an LSN that's greater than current minRecoveryPoint. And when 
> we do update minRecoveryPoint, we can update it to the LSN of the last 
> record we've read from the archive.

So we might end up flushing more often *and* we will be doing it
potentially in the code path of other users.

This change seems speculative and also against what has previously been
agreed with Tom. If he chooses not to comment on your changes, that's up
to him, but I don't think you should remove things quietly that have
been put there through the community process, as if they caused
problems. I feel like I'm in the middle here. 

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot standby, recovery infra

From
Simon Riggs
Date:
On Thu, 2009-02-05 at 09:26 +0000, Simon Riggs wrote:

> This change seems speculative and also against what has previously been
> agreed with Tom. If he chooses not to comment on your changes, that's up
> to him, but I don't think you should remove things quietly that have
> been put there through the community process, as if they caused
> problems. I feel like I'm in the middle here. 

This is only a problem because of two independent reviewers.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot standby, recovery infra

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Thu, 2009-02-05 at 10:31 +0200, Heikki Linnakangas wrote:
>> Simon Riggs wrote:
>>> On Thu, 2009-02-05 at 09:28 +0200, Heikki Linnakangas wrote:
>>>> I got rid of minSafeStartPoint, advancing minRecoveryPoint instead. And 
>>>> it's advanced in XLogFlush instead of XLogFileRead. I'll post an updated 
>>>> patch soon.
>>> Why do you think XLogFlush is called less frequently than XLogFileRead?
>> It's not, but we only need to update the control file when we're 
>> "flushing" an LSN that's greater than current minRecoveryPoint. And when 
>> we do update minRecoveryPoint, we can update it to the LSN of the last 
>> record we've read from the archive.
> 
> So we might end up flushing more often *and* we will be doing it
> potentially in the code path of other users.

For example, imagine a database that fits completely in shared buffers. 
If we update at every XLogFileRead, we have to fsync every 16MB of WAL. 
If we update in XLogFlush the way I described, you only need to update 
when we flush a page from the buffer cache, which will only happen at 
restartpoints. That's far less updates.

Expanding that example to a database that doesn't fit in cache, you're 
still replacing pages from the buffer cache that have been untouched for 
longest. Such pages will have an old LSN, too, so we shouldn't need to 
update very often.

I'm sure you can come up with an example of where we end up fsyncing 
more often, but it doesn't seem like the common case to me.

> This change seems speculative and also against what has previously been
> agreed with Tom. If he chooses not to comment on your changes, that's up
> to him, but I don't think you should remove things quietly that have
> been put there through the community process, as if they caused
> problems. I feel like I'm in the middle here. 

I'd like to have the extra protection that this approach gives. If we 
let safeStartPoint to be ahead of the actual WAL we've replayed, we have 
to just assume we're fine if we reach end of WAL before reaching that 
point. That assumption falls down if e.g recovery is stopped, and you go 
and remove the last few WAL segments from the archive before restarting 
it, or signal pg_standby to trigger failover too early. Tracking the 
real safe starting point and enforcing it always protects you from that.

(we did discuss this a week ago: 
http://archives.postgresql.org/message-id/4981F6E0.2040503@enterprisedb.com)

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot standby, recovery infra

From
Simon Riggs
Date:
On Thu, 2009-02-05 at 11:46 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:

> > So we might end up flushing more often *and* we will be doing it
> > potentially in the code path of other users.
> 
> For example, imagine a database that fits completely in shared buffers. 
> If we update at every XLogFileRead, we have to fsync every 16MB of WAL. 
> If we update in XLogFlush the way I described, you only need to update 
> when we flush a page from the buffer cache, which will only happen at 
> restartpoints. That's far less updates.

Oh, did you change the bgwriter so it doesn't do normal page cleaning? 

General thoughts: Latest HS patch has a CPU profile within 1-2% of
current and the use of ProcArrayLock is fairly minimal now. The
additional CPU is recoveryStopsHere(), which enables the manual control
of recovery, so the trade off seems worth it. The major CPU hog remains
RecordIsValid, which is the CRC checks. Startup is still I/O bound.
Specific avoidable I/O hogs are (1) checkpoints, (2) page cleaning so I
hope we can avoid both of those. 

I also hope to minimise the I/O cost of keeping track of the consistency
point. If that was done as infrequently as each restartpoint then I
would certainly be very happy, but that won't happen in the proposed
scheme if we do page cleaning.

> Expanding that example to a database that doesn't fit in cache, you're 
> still replacing pages from the buffer cache that have been untouched for 
> longest. Such pages will have an old LSN, too, so we shouldn't need to 
> update very often.

They will tend to be written in ascending LSN order which will mean we
continually update the control file. Anything out of order does skip a
write. The better the cache is at finding LRU blocks out the more writes
we will make.

> I'm sure you can come up with an example of where we end up fsyncing 
> more often, but it doesn't seem like the common case to me.

I'm not trying to come up with counterexamples...

> > This change seems speculative and also against what has previously been
> > agreed with Tom. If he chooses not to comment on your changes, that's up
> > to him, but I don't think you should remove things quietly that have
> > been put there through the community process, as if they caused
> > problems. I feel like I'm in the middle here. 
> 
> I'd like to have the extra protection that this approach gives. If we 
> let safeStartPoint to be ahead of the actual WAL we've replayed, we have 
> to just assume we're fine if we reach end of WAL before reaching that 
> point. That assumption falls down if e.g recovery is stopped, and you go 
> and remove the last few WAL segments from the archive before restarting 
> it, or signal pg_standby to trigger failover too early. Tracking the 
> real safe starting point and enforcing it always protects you from that.

Doing it this way will require you to remove existing specific error
messages about ending before end time of backup, to be replaced by more
general ones that say "consistency not reached" which is harder to
figure out what to do about it.

> (we did discuss this a week ago: 
> http://archives.postgresql.org/message-id/4981F6E0.2040503@enterprisedb.com)

Yes, we need to update it in that case. Though that is no way agreement
to the other changes, nor does it require them.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot standby, recovery infra

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Thu, 2009-02-05 at 11:46 +0200, Heikki Linnakangas wrote:
>> Simon Riggs wrote:
> 
>>> So we might end up flushing more often *and* we will be doing it
>>> potentially in the code path of other users.
>> For example, imagine a database that fits completely in shared buffers. 
>> If we update at every XLogFileRead, we have to fsync every 16MB of WAL. 
>> If we update in XLogFlush the way I described, you only need to update 
>> when we flush a page from the buffer cache, which will only happen at 
>> restartpoints. That's far less updates.
> 
> Oh, did you change the bgwriter so it doesn't do normal page cleaning? 

No. Ok, that wasn't completely accurate. The page cleaning by bgwriter 
will perform XLogFlushes, but that should be pretty insignificant. When 
there's little page replacement going on, bgwriter will do a small 
trickle of page cleaning, which won't matter much. If there's more page 
replacement going on, bgwriter is cleaning up pages that will soon be 
replaced, so it's just offsetting work from other backends (or the 
startup process in this case).

>> Expanding that example to a database that doesn't fit in cache, you're 
>> still replacing pages from the buffer cache that have been untouched for 
>> longest. Such pages will have an old LSN, too, so we shouldn't need to 
>> update very often.
> 
> They will tend to be written in ascending LSN order which will mean we
> continually update the control file. Anything out of order does skip a
> write. The better the cache is at finding LRU blocks out the more writes
> we will make.

When minRecoveryPoint is updated, it's not update to just the LSN that's 
being flushed. It's updated to the recptr of the most recently read WAL 
record. That's an important point to avoid that behavior. Just like 
XLogFlush normally always flushes all of the outstanding WAL, not just 
up to the requested LSN.

>> I'd like to have the extra protection that this approach gives. If we 
>> let safeStartPoint to be ahead of the actual WAL we've replayed, we have 
>> to just assume we're fine if we reach end of WAL before reaching that 
>> point. That assumption falls down if e.g recovery is stopped, and you go 
>> and remove the last few WAL segments from the archive before restarting 
>> it, or signal pg_standby to trigger failover too early. Tracking the 
>> real safe starting point and enforcing it always protects you from that.
> 
> Doing it this way will require you to remove existing specific error
> messages about ending before end time of backup, to be replaced by more
> general ones that say "consistency not reached" which is harder to
> figure out what to do about it.

Yeah. If that's an important distinction, we could still save the 
original backup stop location somewhere, just so that we can give the 
old error message when we've not passed that location. But perhaps a 
message like "WAL ends before reaching a consistent state" with a hint 
"Make sure you archive all the WAL created during backup" or something 
would do suffice.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot standby, recovery infra

From
Simon Riggs
Date:
On Thu, 2009-02-05 at 13:18 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Thu, 2009-02-05 at 11:46 +0200, Heikki Linnakangas wrote:
> >> Simon Riggs wrote:
> > 
> >>> So we might end up flushing more often *and* we will be doing it
> >>> potentially in the code path of other users.
> >> For example, imagine a database that fits completely in shared buffers. 
> >> If we update at every XLogFileRead, we have to fsync every 16MB of WAL. 
> >> If we update in XLogFlush the way I described, you only need to update 
> >> when we flush a page from the buffer cache, which will only happen at 
> >> restartpoints. That's far less updates.
> > 
> > Oh, did you change the bgwriter so it doesn't do normal page cleaning? 
> 
> No. Ok, that wasn't completely accurate. The page cleaning by bgwriter 
> will perform XLogFlushes, but that should be pretty insignificant. When 
> there's little page replacement going on, bgwriter will do a small 
> trickle of page cleaning, which won't matter much. 

Yes, that case is good, but it wasn't the use case we're trying to speed
up by having the bgwriter active during recovery. We're worried about
I/O bound recoveries.

> If there's more page 
> replacement going on, bgwriter is cleaning up pages that will soon be 
> replaced, so it's just offsetting work from other backends (or the 
> startup process in this case).

Which only needs to be done if we follow this route, so is not an
argument in favour.

Using more I/O in I/O bound cases makes the worst case even worse.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot standby, recovery infra

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Thu, 2009-02-05 at 13:18 +0200, Heikki Linnakangas wrote:
>> Simon Riggs wrote:
>>> On Thu, 2009-02-05 at 11:46 +0200, Heikki Linnakangas wrote:
>>>> Simon Riggs wrote:
>>>>> So we might end up flushing more often *and* we will be doing it
>>>>> potentially in the code path of other users.
>>>> For example, imagine a database that fits completely in shared buffers. 
>>>> If we update at every XLogFileRead, we have to fsync every 16MB of WAL. 
>>>> If we update in XLogFlush the way I described, you only need to update 
>>>> when we flush a page from the buffer cache, which will only happen at 
>>>> restartpoints. That's far less updates.
>>> Oh, did you change the bgwriter so it doesn't do normal page cleaning? 
>> No. Ok, that wasn't completely accurate. The page cleaning by bgwriter 
>> will perform XLogFlushes, but that should be pretty insignificant. When 
>> there's little page replacement going on, bgwriter will do a small 
>> trickle of page cleaning, which won't matter much. 
> 
> Yes, that case is good, but it wasn't the use case we're trying to speed
> up by having the bgwriter active during recovery. We're worried about
> I/O bound recoveries.

Ok, let's do the math:

By updating minRecoveryPoint in XLogFileRead, you're fsyncing the 
control file once every 16MB of WAL.

By updating in XLogFlush, the frequency depends on the amount of 
shared_buffers available to buffer the modified pages, the average WAL 
record size, and the cache hit ratio. Let's determine the worst case:

The smallest WAL record that dirties a page is a heap deletion record. 
That contains just enough information to locate the tuple. If I'm 
reading the headers right, that record is 48 bytes long (28 bytes of 
xlog header + 18 bytes of payload + padding). Assuming that the WAL is 
full of just those records, and there's no full page images, and that 
the cache hit ratio is 0%, we will need (16 MB / 48 B) * 8 kB = 2730 MB 
of shared_buffers to achieve the once per 16 MB of WAL per one fsync mark.

So if you have a lower shared_buffers setting than 2.7 GB, you can have 
more frequent fsyncs this way in the worst case. If you think of the 
typical case, you're probably not doing all deletes, and you're having a 
non-zero cache hit ratio, so you achieve the same frequency with a much 
lower shared_buffers setting. And if you're really that I/O bound, I 
doubt the few extra fsyncs matter much.

Also note that when the control file is updated in XLogFlush, it's 
typically the bgwriter doing it as it cleans buffers ahead of the clock 
hand, not the startup process.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot standby, recovery infra

From
Simon Riggs
Date:
On Thu, 2009-02-05 at 14:18 +0200, Heikki Linnakangas wrote:

> when the control file is updated in XLogFlush, it's 
> typically the bgwriter doing it as it cleans buffers ahead of the
> clock hand, not the startup process

That is the key point. Let's do it your way.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot standby, recovery infra

From
Heikki Linnakangas
Date:
Ok, here's another version. Major changes since last patch:

- Startup checkpoint is now again performed after the recovery is
finished, before allowing (read-write) connections. This is because we
couldn't solve the problem of re-entering recovery after a crash before
the first online checkpoint.

- minSafeStartPoint is gone, and its functionality has been folded into
minRecoveryPoint. It was really the same semantics. There might have
been some debugging value in keeping the backup stop time around, but
it's in the backup label file in the base backup anyway.

- minRecoveryPoint is now updated in XLogFlush, instead of when a file
is restored from archive.

- log_restartpoints is gone. Use log_checkpoints in postgresql.conf
instead

Outstanding issues:

- If bgwriter is performing a restartpoint when recovery ends, the
startup checkpoint will be queued up behind the restartpoint. And since
it uses the same smoothing logic as checkpoints, it can take quite some
time for that to finish. The original patch had some code to hurry up
the restartpoint by signaling the bgwriter if
LWLockConditionalAcquire(CheckPointLock) fails, but there's a race
condition with that if a restartpoint starts right after that check. We
could let the bgwriter do the checkpoint too, and wait for it, but
bgwriter might not be running yet, and we'd have to allow bgwriter to
write WAL while disallowing it for all other processes, which seems
quite complex. Seems like we need something like the
LWLockConditionalAcquire approach, but built into CreateCheckPoint to
eliminate the race condition

- If you perform a fast shutdown while startup process is waiting for
the restore command, startup process sometimes throws a FATAL error
which leads escalates into an immediate shutdown. That leads to
different messages in the logs, and skipping of the shutdown
restartpoint that we now otherwise perform.

- It's not clear to me if the rest of the xlog flushing related
functions, XLogBackgroundFlush, XLogNeedsFlush and XLogAsyncCommitFlush,
need to work during recovery, and what they should do.

I'll continue working on those outstanding items.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com
*** src/backend/access/transam/xlog.c
--- src/backend/access/transam/xlog.c
***************
*** 36,41 ****
--- 36,42 ----
  #include "catalog/pg_control.h"
  #include "catalog/pg_type.h"
  #include "funcapi.h"
+ #include "libpq/pqsignal.h"
  #include "miscadmin.h"
  #include "pgstat.h"
  #include "postmaster/bgwriter.h"
***************
*** 47,52 ****
--- 48,54 ----
  #include "storage/smgr.h"
  #include "storage/spin.h"
  #include "utils/builtins.h"
+ #include "utils/flatfiles.h"
  #include "utils/guc.h"
  #include "utils/ps_status.h"
  #include "pg_trace.h"
***************
*** 119,130 **** CheckpointStatsData CheckpointStats;
   */
  TimeLineID    ThisTimeLineID = 0;

! /* Are we doing recovery from XLOG? */
  bool        InRecovery = false;

  /* Are we recovering using offline XLOG archives? */
  static bool InArchiveRecovery = false;

  /* Was the last xlog file restored from archive, or local? */
  static bool restoredFromArchive = false;

--- 121,146 ----
   */
  TimeLineID    ThisTimeLineID = 0;

! /*
!  * Are we doing recovery from XLOG?
!  *
!  * This is only ever true in the startup process, when it's replaying WAL.
!  * It's used in functions that need to act differently when called from a
!  * redo function (e.g skip WAL logging).  To check whether the system is in
!  * recovery regardless of what process you're running in, use
!  * IsRecoveryProcessingMode().
!  */
  bool        InRecovery = false;

  /* Are we recovering using offline XLOG archives? */
  static bool InArchiveRecovery = false;

+ /*
+  * Local copy of shared RecoveryProcessingMode variable. True actually
+  * means "not known, need to check the shared state"
+  */
+ static bool LocalRecoveryProcessingMode = true;
+
  /* Was the last xlog file restored from archive, or local? */
  static bool restoredFromArchive = false;

***************
*** 133,139 **** static char *recoveryRestoreCommand = NULL;
  static bool recoveryTarget = false;
  static bool recoveryTargetExact = false;
  static bool recoveryTargetInclusive = true;
- static bool recoveryLogRestartpoints = false;
  static TransactionId recoveryTargetXid;
  static TimestampTz recoveryTargetTime;
  static TimestampTz recoveryLastXTime = 0;
--- 149,154 ----
***************
*** 242,250 **** static XLogRecPtr RedoRecPtr;
   * ControlFileLock: must be held to read/update control file or create
   * new log file.
   *
!  * CheckpointLock: must be held to do a checkpoint (ensures only one
!  * checkpointer at a time; currently, with all checkpoints done by the
!  * bgwriter, this is just pro forma).
   *
   *----------
   */
--- 257,264 ----
   * ControlFileLock: must be held to read/update control file or create
   * new log file.
   *
!  * CheckpointLock: must be held to do a checkpoint or restartpoint (ensures
!  * only one checkpointer at a time)
   *
   *----------
   */
***************
*** 313,318 **** typedef struct XLogCtlData
--- 327,351 ----
      int            XLogCacheBlck;    /* highest allocated xlog buffer index */
      TimeLineID    ThisTimeLineID;

+     /*
+      * SharedRecoveryProcessingMode indicates if we're still in crash or
+      * archive recovery. It's checked by IsRecoveryProcessingMode()
+      */
+     bool        SharedRecoveryProcessingMode;
+
+     /*
+      * During recovery, we keep a copy of the latest checkpoint record
+      * here. It's used by the background writer when it wants to create
+      * a restartpoint.
+      *
+      * is info_lck spinlock a bit too light-weight to protect these?
+      */
+     XLogRecPtr    lastCheckPointRecPtr;
+     CheckPoint    lastCheckPoint;
+
+     /* end+1 of the last record replayed (or being replayed) */
+     XLogRecPtr    replayEndRecPtr;
+
      slock_t        info_lck;        /* locks shared variables shown above */
  } XLogCtlData;

***************
*** 387,395 **** static XLogRecPtr ReadRecPtr;    /* start of last record read */
--- 420,435 ----
  static XLogRecPtr EndRecPtr;    /* end+1 of last record read */
  static XLogRecord *nextRecord = NULL;
  static TimeLineID lastPageTLI = 0;
+ static XLogRecPtr minRecoveryPoint; /* local copy of ControlFile->minRecoveryPoint */
+ static bool    updateMinRecoveryPoint = true;

  static bool InRedo = false;

+ /*
+  * Flag set by interrupt handlers for later service in the redo loop.
+  */
+ static volatile sig_atomic_t shutdown_requested = false;
+

  static void XLogArchiveNotify(const char *xlog);
  static void XLogArchiveNotifySeg(uint32 log, uint32 seg);
***************
*** 420,425 **** static void PreallocXlogFiles(XLogRecPtr endptr);
--- 460,466 ----
  static void RemoveOldXlogFiles(uint32 log, uint32 seg, XLogRecPtr endptr);
  static void ValidateXLOGDirectoryStructure(void);
  static void CleanupBackupHistory(void);
+ static void UpdateMinRecoveryPoint(XLogRecPtr lsn);
  static XLogRecord *ReadRecord(XLogRecPtr *RecPtr, int emode);
  static bool ValidXLOGHeader(XLogPageHeader hdr, int emode);
  static XLogRecord *ReadCheckpointRecord(XLogRecPtr RecPtr, int whichChkpt);
***************
*** 484,489 **** XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
--- 525,534 ----
      bool        doPageWrites;
      bool        isLogSwitch = (rmid == RM_XLOG_ID && info == XLOG_SWITCH);

+     /* cross-check on whether we should be here or not */
+     if (IsRecoveryProcessingMode())
+         elog(FATAL, "cannot make new WAL entries during recovery");
+
      /* info's high bits are reserved for use by me */
      if (info & XLR_INFO_MASK)
          elog(PANIC, "invalid xlog info mask %02X", info);
***************
*** 1718,1723 **** XLogSetAsyncCommitLSN(XLogRecPtr asyncCommitLSN)
--- 1763,1817 ----
  }

  /*
+  * Advance minRecoveryPoint in control file.
+  *
+  * If we crash during recovery, we must reach this point again before the
+  * database is consistent. If minRecoveryPoint is already greater than or
+  * equal to 'lsn', it is not updated.
+  */
+ static void
+ UpdateMinRecoveryPoint(XLogRecPtr lsn)
+ {
+     /* Quick check using our local copy of the variable */
+     if (!updateMinRecoveryPoint || XLByteLE(lsn, minRecoveryPoint))
+         return;
+
+     LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+
+     /* update local copy */
+     minRecoveryPoint = ControlFile->minRecoveryPoint;
+
+     /*
+      * An invalid minRecoveryPoint means that we need to recover all the WAL,
+      * ie. crash recovery. Don't update the control file in that case.
+      */
+     if (minRecoveryPoint.xlogid == 0 && minRecoveryPoint.xrecoff == 0)
+         updateMinRecoveryPoint = false;
+     else if (XLByteLT(minRecoveryPoint, lsn))
+     {
+         /* use volatile pointer to prevent code rearrangement */
+         volatile XLogCtlData *xlogctl = XLogCtl;
+
+         /*
+          * To avoid having to update the control file too often, we update
+          * it all the way to the last record being replayed, even though 'lsn'
+          * would suffice for correctness.
+          */
+         SpinLockAcquire(&xlogctl->info_lck);
+         minRecoveryPoint = xlogctl->replayEndRecPtr;
+         SpinLockRelease(&xlogctl->info_lck);
+
+         /* update control file */
+         ControlFile->minRecoveryPoint = minRecoveryPoint;
+         UpdateControlFile();
+
+         elog(DEBUG2, "updated min recovery point to %X/%X",
+              minRecoveryPoint.xlogid, minRecoveryPoint.xrecoff);
+     }
+     LWLockRelease(ControlFileLock);
+ }
+
+ /*
   * Ensure that all XLOG data through the given position is flushed to disk.
   *
   * NOTE: this differs from XLogWrite mainly in that the WALWriteLock is not
***************
*** 1729,1737 **** XLogFlush(XLogRecPtr record)
      XLogRecPtr    WriteRqstPtr;
      XLogwrtRqst WriteRqst;

!     /* Disabled during REDO */
!     if (InRedo)
          return;

      /* Quick exit if already known flushed */
      if (XLByteLE(record, LogwrtResult.Flush))
--- 1823,1837 ----
      XLogRecPtr    WriteRqstPtr;
      XLogwrtRqst WriteRqst;

!     /*
!      * During REDO, we don't try to flush the WAL, but update minRecoveryPoint
!      * instead.
!      */
!     if (IsRecoveryProcessingMode())
!     {
!         UpdateMinRecoveryPoint(record);
          return;
+     }

      /* Quick exit if already known flushed */
      if (XLByteLE(record, LogwrtResult.Flush))
***************
*** 1818,1826 **** XLogFlush(XLogRecPtr record)
       * the bad page is encountered again during recovery then we would be
       * unable to restart the database at all!  (This scenario has actually
       * happened in the field several times with 7.1 releases. Note that we
!      * cannot get here while InRedo is true, but if the bad page is brought in
!      * and marked dirty during recovery then CreateCheckPoint will try to
!      * flush it at the end of recovery.)
       *
       * The current approach is to ERROR under normal conditions, but only
       * WARNING during recovery, so that the system can be brought up even if
--- 1918,1926 ----
       * the bad page is encountered again during recovery then we would be
       * unable to restart the database at all!  (This scenario has actually
       * happened in the field several times with 7.1 releases. Note that we
!      * cannot get here while IsRecoveryProcessingMode(), but if the bad page is
!      * brought in and marked dirty during recovery then if a checkpoint were
!      * performed at the end of recovery it will try to flush it.
       *
       * The current approach is to ERROR under normal conditions, but only
       * WARNING during recovery, so that the system can be brought up even if
***************
*** 2677,2687 **** RestoreArchivedFile(char *path, const char *xlogfname,
--- 2777,2799 ----
       * those it's a good bet we should have gotten it too.  Aborting on other
       * signals such as SIGTERM seems a good idea as well.
       *
+      * However, if we were requested to terminate, we don't really care what
+      * happened to the restore command, so we just exit cleanly. In fact,
+      * the restore command most likely received the SIGTERM too, and we don't
+      * want to complain about that.
+      *
       * Per the Single Unix Spec, shells report exit status > 128 when a called
       * command died on a signal.  Also, 126 and 127 are used to report
       * problems such as an unfindable command; treat those as fatal errors
       * too.
       */
+     if (shutdown_requested && InRedo)
+     {
+         /* XXX: Is EndRecPtr always the right value here? */
+         UpdateMinRecoveryPoint(EndRecPtr);
+         proc_exit(0);
+     }
+
      signaled = WIFSIGNALED(rc) || WEXITSTATUS(rc) > 125;

      ereport(signaled ? FATAL : DEBUG2,
***************
*** 4590,4607 **** readRecoveryCommandFile(void)
              ereport(LOG,
                      (errmsg("recovery_target_inclusive = %s", tok2)));
          }
-         else if (strcmp(tok1, "log_restartpoints") == 0)
-         {
-             /*
-              * does nothing if a recovery_target is not also set
-              */
-             if (!parse_bool(tok2, &recoveryLogRestartpoints))
-                   ereport(ERROR,
-                             (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
-                       errmsg("parameter \"log_restartpoints\" requires a Boolean value")));
-             ereport(LOG,
-                     (errmsg("log_restartpoints = %s", tok2)));
-         }
          else
              ereport(FATAL,
                      (errmsg("unrecognized recovery parameter \"%s\"",
--- 4702,4707 ----
***************
*** 4883,4889 **** StartupXLOG(void)
      XLogRecPtr    RecPtr,
                  LastRec,
                  checkPointLoc,
!                 minRecoveryLoc,
                  EndOfLog;
      uint32        endLogId;
      uint32        endLogSeg;
--- 4983,4989 ----
      XLogRecPtr    RecPtr,
                  LastRec,
                  checkPointLoc,
!                 backupStopLoc,
                  EndOfLog;
      uint32        endLogId;
      uint32        endLogSeg;
***************
*** 4891,4896 **** StartupXLOG(void)
--- 4991,4998 ----
      uint32        freespace;
      TransactionId oldestActiveXID;

+     XLogCtl->SharedRecoveryProcessingMode = true;
+
      /*
       * Read control file and check XLOG status looks valid.
       *
***************
*** 4970,4976 **** StartupXLOG(void)
                          recoveryTargetTLI,
                          ControlFile->checkPointCopy.ThisTimeLineID)));

!     if (read_backup_label(&checkPointLoc, &minRecoveryLoc))
      {
          /*
           * When a backup_label file is present, we want to roll forward from
--- 5072,5078 ----
                          recoveryTargetTLI,
                          ControlFile->checkPointCopy.ThisTimeLineID)));

!     if (read_backup_label(&checkPointLoc, &backupStopLoc))
      {
          /*
           * When a backup_label file is present, we want to roll forward from
***************
*** 5108,5118 **** StartupXLOG(void)
          ControlFile->prevCheckPoint = ControlFile->checkPoint;
          ControlFile->checkPoint = checkPointLoc;
          ControlFile->checkPointCopy = checkPoint;
!         if (minRecoveryLoc.xlogid != 0 || minRecoveryLoc.xrecoff != 0)
!             ControlFile->minRecoveryPoint = minRecoveryLoc;
          ControlFile->time = (pg_time_t) time(NULL);
          UpdateControlFile();

          /*
           * If there was a backup label file, it's done its job and the info
           * has now been propagated into pg_control.  We must get rid of the
--- 5210,5232 ----
          ControlFile->prevCheckPoint = ControlFile->checkPoint;
          ControlFile->checkPoint = checkPointLoc;
          ControlFile->checkPointCopy = checkPoint;
!         if (backupStopLoc.xlogid != 0 || backupStopLoc.xrecoff != 0)
!         {
!             if (XLByteLT(ControlFile->minRecoveryPoint, backupStopLoc))
!                 ControlFile->minRecoveryPoint = backupStopLoc;
!         }
          ControlFile->time = (pg_time_t) time(NULL);
+         /* No need to hold ControlFileLock yet, we aren't up far enough */
          UpdateControlFile();

+         /* update our local copy of minRecoveryPoint */
+         minRecoveryPoint = ControlFile->minRecoveryPoint;
+
+         /*
+          * Reset pgstat data, because it may be invalid after recovery.
+          */
+         pgstat_reset_all();
+
          /*
           * If there was a backup label file, it's done its job and the info
           * has now been propagated into pg_control.  We must get rid of the
***************
*** 5157,5168 **** StartupXLOG(void)
          {
              bool        recoveryContinue = true;
              bool        recoveryApply = true;
              ErrorContextCallback errcontext;

              InRedo = true;
!             ereport(LOG,
!                     (errmsg("redo starts at %X/%X",
!                             ReadRecPtr.xlogid, ReadRecPtr.xrecoff)));

              /*
               * main redo apply loop
--- 5271,5306 ----
          {
              bool        recoveryContinue = true;
              bool        recoveryApply = true;
+             bool        reachedMinRecoveryPoint = false;
              ErrorContextCallback errcontext;
+             /* use volatile pointer to prevent code rearrangement */
+             volatile XLogCtlData *xlogctl = XLogCtl;
+
+             /* Update shared copy of replayEndRecPtr */
+             SpinLockAcquire(&xlogctl->info_lck);
+             xlogctl->replayEndRecPtr = ReadRecPtr;
+             SpinLockRelease(&xlogctl->info_lck);

              InRedo = true;
!
!             if (minRecoveryPoint.xlogid == 0 && minRecoveryPoint.xrecoff == 0)
!                 ereport(LOG,
!                         (errmsg("redo starts at %X/%X",
!                                 ReadRecPtr.xlogid, ReadRecPtr.xrecoff)));
!             else
!                 ereport(LOG,
!                         (errmsg("redo starts at %X/%X, consistency will be reached at %X/%X",
!                         ReadRecPtr.xlogid, ReadRecPtr.xrecoff,
!                         minRecoveryPoint.xlogid, minRecoveryPoint.xrecoff)));
!
!             /*
!              * Let postmaster know we've started redo now.
!              *
!              * After this point, we can no longer assume that there's no other
!              * processes running concurrently.
!              */
!             if (InArchiveRecovery && IsUnderPostmaster)
!                 SendPostmasterSignal(PMSIGNAL_RECOVERY_STARTED);

              /*
               * main redo apply loop
***************
*** 5189,5194 **** StartupXLOG(void)
--- 5327,5361 ----
  #endif

                  /*
+                  * Process any requests or signals received recently.
+                  */
+                 if (shutdown_requested)
+                 {
+                     /*
+                      * We were requested to exit without finishing recovery.
+                      */
+                     UpdateMinRecoveryPoint(ReadRecPtr);
+                     proc_exit(0);
+                 }
+
+                 /*
+                  * Have we reached our safe starting point? If so, we can
+                  * tell postmaster that the database is consistent now.
+                  */
+                 if (!reachedMinRecoveryPoint &&
+                      XLByteLE(minRecoveryPoint, EndRecPtr))
+                 {
+                     reachedMinRecoveryPoint = true;
+                     if (InArchiveRecovery)
+                     {
+                         ereport(LOG,
+                                 (errmsg("consistent recovery state reached")));
+                         if (IsUnderPostmaster)
+                             SendPostmasterSignal(PMSIGNAL_RECOVERY_CONSISTENT);
+                     }
+                 }
+
+                 /*
                   * Have we reached our recovery target?
                   */
                  if (recoveryStopsHere(record, &recoveryApply))
***************
*** 5213,5218 **** StartupXLOG(void)
--- 5380,5390 ----
                      TransactionIdAdvance(ShmemVariableCache->nextXid);
                  }

+                 /* Update shared copy of replayEndRecPtr */
+                 SpinLockAcquire(&xlogctl->info_lck);
+                 xlogctl->replayEndRecPtr = EndRecPtr;
+                 SpinLockRelease(&xlogctl->info_lck);
+
                  RmgrTable[record->xl_rmid].rm_redo(EndRecPtr, record);

                  /* Pop the error context stack */
***************
*** 5256,5269 **** StartupXLOG(void)
       * Complain if we did not roll forward far enough to render the backup
       * dump consistent.
       */
!     if (XLByteLT(EndOfLog, ControlFile->minRecoveryPoint))
      {
          if (reachedStopPoint)    /* stopped because of stop request */
              ereport(FATAL,
!                     (errmsg("requested recovery stop point is before end time of backup dump")));
          else    /* ran off end of WAL */
              ereport(FATAL,
!                     (errmsg("WAL ends before end time of backup dump")));
      }

      /*
--- 5428,5441 ----
       * Complain if we did not roll forward far enough to render the backup
       * dump consistent.
       */
!     if (InRecovery && XLByteLT(EndOfLog, minRecoveryPoint))
      {
          if (reachedStopPoint)    /* stopped because of stop request */
              ereport(FATAL,
!                     (errmsg("requested recovery stop point is before consistent recovery point")));
          else    /* ran off end of WAL */
              ereport(FATAL,
!                     (errmsg("WAL ended before a consistent state was reached")));
      }

      /*
***************
*** 5358,5363 **** StartupXLOG(void)
--- 5530,5541 ----
      /* Pre-scan prepared transactions to find out the range of XIDs present */
      oldestActiveXID = PrescanPreparedTransactions();

+     /*
+      * Allow writing WAL for us. But not for other backends! That's done
+      * after writing the shutdown checkpoint and finishing recovery.
+      */
+     LocalRecoveryProcessingMode = false;
+
      if (InRecovery)
      {
          int            rmid;
***************
*** 5378,5388 **** StartupXLOG(void)
          XLogCheckInvalidPages();

          /*
-          * Reset pgstat data, because it may be invalid after recovery.
-          */
-         pgstat_reset_all();
-
-         /*
           * Perform a checkpoint to update all our recovery activity to disk.
           *
           * Note that we write a shutdown checkpoint rather than an on-line
--- 5556,5561 ----
***************
*** 5404,5415 **** StartupXLOG(void)
       */
      InRecovery = false;

      ControlFile->state = DB_IN_PRODUCTION;
      ControlFile->time = (pg_time_t) time(NULL);
      UpdateControlFile();

      /* start the archive_timeout timer running */
!     XLogCtl->Write.lastSegSwitchTime = ControlFile->time;

      /* initialize shared-memory copy of latest checkpoint XID/epoch */
      XLogCtl->ckptXidEpoch = ControlFile->checkPointCopy.nextXidEpoch;
--- 5577,5590 ----
       */
      InRecovery = false;

+     LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
      ControlFile->state = DB_IN_PRODUCTION;
      ControlFile->time = (pg_time_t) time(NULL);
      UpdateControlFile();
+     LWLockRelease(ControlFileLock);

      /* start the archive_timeout timer running */
!     XLogCtl->Write.lastSegSwitchTime = (pg_time_t) time(NULL);

      /* initialize shared-memory copy of latest checkpoint XID/epoch */
      XLogCtl->ckptXidEpoch = ControlFile->checkPointCopy.nextXidEpoch;
***************
*** 5444,5449 **** StartupXLOG(void)
--- 5619,5663 ----
          readRecordBuf = NULL;
          readRecordBufSize = 0;
      }
+
+     /*
+      * All done. Allow others to write WAL.
+      */
+     XLogCtl->SharedRecoveryProcessingMode = false;
+ }
+
+ /*
+  * Is the system still in recovery?
+  *
+  * As a side-effect, we initialize the local TimeLineID and RedoRecPtr
+  * variables the first time we see that recovery is finished.
+  */
+ bool
+ IsRecoveryProcessingMode(void)
+ {
+     /*
+      * We check shared state each time only until we leave recovery mode.
+      * We can't re-enter recovery, so we rely on the local state variable
+      * after that.
+      */
+     if (!LocalRecoveryProcessingMode)
+         return false;
+     else
+     {
+         /* use volatile pointer to prevent code rearrangement */
+         volatile XLogCtlData *xlogctl = XLogCtl;
+
+         LocalRecoveryProcessingMode = xlogctl->SharedRecoveryProcessingMode;
+
+         /*
+          * Initialize TimeLineID and RedoRecPtr the first time we see that
+          * recovery is finished.
+          */
+         if (!LocalRecoveryProcessingMode)
+             InitXLOGAccess();
+
+         return LocalRecoveryProcessingMode;
+     }
  }

  /*
***************
*** 5575,5580 **** InitXLOGAccess(void)
--- 5789,5796 ----
  {
      /* ThisTimeLineID doesn't change so we need no lock to copy it */
      ThisTimeLineID = XLogCtl->ThisTimeLineID;
+     Assert(ThisTimeLineID != 0);
+
      /* Use GetRedoRecPtr to copy the RedoRecPtr safely */
      (void) GetRedoRecPtr();
  }
***************
*** 5686,5692 **** ShutdownXLOG(int code, Datum arg)
      ereport(LOG,
              (errmsg("shutting down")));

!     CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
      ShutdownCLOG();
      ShutdownSUBTRANS();
      ShutdownMultiXact();
--- 5902,5911 ----
      ereport(LOG,
              (errmsg("shutting down")));

!     if (IsRecoveryProcessingMode())
!         CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
!     else
!         CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
      ShutdownCLOG();
      ShutdownSUBTRANS();
      ShutdownMultiXact();
***************
*** 5699,5707 **** ShutdownXLOG(int code, Datum arg)
   * Log start of a checkpoint.
   */
  static void
! LogCheckpointStart(int flags)
  {
!     elog(LOG, "checkpoint starting:%s%s%s%s%s%s",
           (flags & CHECKPOINT_IS_SHUTDOWN) ? " shutdown" : "",
           (flags & CHECKPOINT_IMMEDIATE) ? " immediate" : "",
           (flags & CHECKPOINT_FORCE) ? " force" : "",
--- 5918,5937 ----
   * Log start of a checkpoint.
   */
  static void
! LogCheckpointStart(int flags, bool restartpoint)
  {
!     char *msg;
!
!     /*
!      * XXX: This is hopelessly untranslatable. We could call gettext_noop
!      * for the main message, but what about all the flags?
!      */
!     if (restartpoint)
!         msg = "restartpoint starting:%s%s%s%s%s%s";
!     else
!         msg = "checkpoint starting:%s%s%s%s%s%s";
!
!     elog(LOG, msg,
           (flags & CHECKPOINT_IS_SHUTDOWN) ? " shutdown" : "",
           (flags & CHECKPOINT_IMMEDIATE) ? " immediate" : "",
           (flags & CHECKPOINT_FORCE) ? " force" : "",
***************
*** 5714,5720 **** LogCheckpointStart(int flags)
   * Log end of a checkpoint.
   */
  static void
! LogCheckpointEnd(void)
  {
      long        write_secs,
                  sync_secs,
--- 5944,5950 ----
   * Log end of a checkpoint.
   */
  static void
! LogCheckpointEnd(bool restartpoint)
  {
      long        write_secs,
                  sync_secs,
***************
*** 5737,5753 **** LogCheckpointEnd(void)
                          CheckpointStats.ckpt_sync_end_t,
                          &sync_secs, &sync_usecs);

!     elog(LOG, "checkpoint complete: wrote %d buffers (%.1f%%); "
!          "%d transaction log file(s) added, %d removed, %d recycled; "
!          "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s",
!          CheckpointStats.ckpt_bufs_written,
!          (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
!          CheckpointStats.ckpt_segs_added,
!          CheckpointStats.ckpt_segs_removed,
!          CheckpointStats.ckpt_segs_recycled,
!          write_secs, write_usecs / 1000,
!          sync_secs, sync_usecs / 1000,
!          total_secs, total_usecs / 1000);
  }

  /*
--- 5967,5992 ----
                          CheckpointStats.ckpt_sync_end_t,
                          &sync_secs, &sync_usecs);

!     if (restartpoint)
!         elog(LOG, "restartpoint complete: wrote %d buffers (%.1f%%); "
!              "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s",
!              CheckpointStats.ckpt_bufs_written,
!              (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
!              write_secs, write_usecs / 1000,
!              sync_secs, sync_usecs / 1000,
!              total_secs, total_usecs / 1000);
!     else
!         elog(LOG, "checkpoint complete: wrote %d buffers (%.1f%%); "
!              "%d transaction log file(s) added, %d removed, %d recycled; "
!              "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s",
!              CheckpointStats.ckpt_bufs_written,
!              (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
!              CheckpointStats.ckpt_segs_added,
!              CheckpointStats.ckpt_segs_removed,
!              CheckpointStats.ckpt_segs_recycled,
!              write_secs, write_usecs / 1000,
!              sync_secs, sync_usecs / 1000,
!              total_secs, total_usecs / 1000);
  }

  /*
***************
*** 5778,5788 **** CreateCheckPoint(int flags)
      TransactionId *inCommitXids;
      int            nInCommit;

      /*
       * Acquire CheckpointLock to ensure only one checkpoint happens at a time.
-      * (This is just pro forma, since in the present system structure there is
-      * only one process that is allowed to issue checkpoints at any given
-      * time.)
       */
      LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);

--- 6017,6028 ----
      TransactionId *inCommitXids;
      int            nInCommit;

+     /* shouldn't happen */
+     if (IsRecoveryProcessingMode())
+         elog(ERROR, "can't create a checkpoint during recovery");
+
      /*
       * Acquire CheckpointLock to ensure only one checkpoint happens at a time.
       */
      LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);

***************
*** 5803,5811 **** CreateCheckPoint(int flags)
--- 6043,6053 ----

      if (shutdown)
      {
+         LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
          ControlFile->state = DB_SHUTDOWNING;
          ControlFile->time = (pg_time_t) time(NULL);
          UpdateControlFile();
+         LWLockRelease(ControlFileLock);
      }

      /*
***************
*** 5909,5915 **** CreateCheckPoint(int flags)
       * to log anything if we decided to skip the checkpoint.
       */
      if (log_checkpoints)
!         LogCheckpointStart(flags);

      TRACE_POSTGRESQL_CHECKPOINT_START(flags);

--- 6151,6157 ----
       * to log anything if we decided to skip the checkpoint.
       */
      if (log_checkpoints)
!         LogCheckpointStart(flags, false);

      TRACE_POSTGRESQL_CHECKPOINT_START(flags);

***************
*** 6068,6074 **** CreateCheckPoint(int flags)
       * Truncate pg_subtrans if possible.  We can throw away all data before
       * the oldest XMIN of any running transaction.    No future transaction will
       * attempt to reference any pg_subtrans entry older than that (see Asserts
!      * in subtrans.c).    During recovery, though, we mustn't do this because
       * StartupSUBTRANS hasn't been called yet.
       */
      if (!InRecovery)
--- 6310,6316 ----
       * Truncate pg_subtrans if possible.  We can throw away all data before
       * the oldest XMIN of any running transaction.    No future transaction will
       * attempt to reference any pg_subtrans entry older than that (see Asserts
!      * in subtrans.c).  During recovery, though, we mustn't do this because
       * StartupSUBTRANS hasn't been called yet.
       */
      if (!InRecovery)
***************
*** 6076,6082 **** CreateCheckPoint(int flags)

      /* All real work is done, but log before releasing lock. */
      if (log_checkpoints)
!         LogCheckpointEnd();

          TRACE_POSTGRESQL_CHECKPOINT_DONE(CheckpointStats.ckpt_bufs_written,
                                  NBuffers, CheckpointStats.ckpt_segs_added,
--- 6318,6324 ----

      /* All real work is done, but log before releasing lock. */
      if (log_checkpoints)
!         LogCheckpointEnd(false);

          TRACE_POSTGRESQL_CHECKPOINT_DONE(CheckpointStats.ckpt_bufs_written,
                                  NBuffers, CheckpointStats.ckpt_segs_added,
***************
*** 6104,6135 **** CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
  }

  /*
!  * Set a recovery restart point if appropriate
!  *
!  * This is similar to CreateCheckPoint, but is used during WAL recovery
!  * to establish a point from which recovery can roll forward without
!  * replaying the entire recovery log.  This function is called each time
!  * a checkpoint record is read from XLOG; it must determine whether a
!  * restartpoint is needed or not.
   */
  static void
  RecoveryRestartPoint(const CheckPoint *checkPoint)
  {
-     int            elapsed_secs;
      int            rmid;
!
!     /*
!      * Do nothing if the elapsed time since the last restartpoint is less than
!      * half of checkpoint_timeout.    (We use a value less than
!      * checkpoint_timeout so that variations in the timing of checkpoints on
!      * the master, or speed of transmission of WAL segments to a slave, won't
!      * make the slave skip a restartpoint once it's synced with the master.)
!      * Checking true elapsed time keeps us from doing restartpoints too often
!      * while rapidly scanning large amounts of WAL.
!      */
!     elapsed_secs = (pg_time_t) time(NULL) - ControlFile->time;
!     if (elapsed_secs < CheckPointTimeout / 2)
!         return;

      /*
       * Is it safe to checkpoint?  We must ask each of the resource managers
--- 6346,6362 ----
  }

  /*
!  * This is used during WAL recovery to establish a point from which recovery
!  * can roll forward without replaying the entire recovery log.  This function
!  * is called each time a checkpoint record is read from XLOG. It is stored
!  * in shared memory, so that it can be used as a restartpoint later on.
   */
  static void
  RecoveryRestartPoint(const CheckPoint *checkPoint)
  {
      int            rmid;
!     /* use volatile pointer to prevent code rearrangement */
!     volatile XLogCtlData *xlogctl = XLogCtl;

      /*
       * Is it safe to checkpoint?  We must ask each of the resource managers
***************
*** 6151,6178 **** RecoveryRestartPoint(const CheckPoint *checkPoint)
      }

      /*
!      * OK, force data out to disk
       */
!     CheckPointGuts(checkPoint->redo, CHECKPOINT_IMMEDIATE);

      /*
!      * Update pg_control so that any subsequent crash will restart from this
!      * checkpoint.    Note: ReadRecPtr gives the XLOG address of the checkpoint
!      * record itself.
       */
      ControlFile->prevCheckPoint = ControlFile->checkPoint;
!     ControlFile->checkPoint = ReadRecPtr;
!     ControlFile->checkPointCopy = *checkPoint;
      ControlFile->time = (pg_time_t) time(NULL);
      UpdateControlFile();

!     ereport((recoveryLogRestartpoints ? LOG : DEBUG2),
              (errmsg("recovery restart point at %X/%X",
!                     checkPoint->redo.xlogid, checkPoint->redo.xrecoff)));
      if (recoveryLastXTime)
!         ereport((recoveryLogRestartpoints ? LOG : DEBUG2),
!                 (errmsg("last completed transaction was at log time %s",
!                         timestamptz_to_str(recoveryLastXTime))));
  }

  /*
--- 6378,6487 ----
      }

      /*
!      * Copy the checkpoint record to shared memory, so that bgwriter can
!      * use it the next time it wants to perform a restartpoint.
!      */
!     SpinLockAcquire(&xlogctl->info_lck);
!     XLogCtl->lastCheckPointRecPtr = ReadRecPtr;
!     memcpy(&XLogCtl->lastCheckPoint, checkPoint, sizeof(CheckPoint));
!     SpinLockRelease(&xlogctl->info_lck);
! }
!
! /*
!  * This is similar to CreateCheckPoint, but is used during WAL recovery
!  * to establish a point from which recovery can roll forward without
!  * replaying the entire recovery log.
!  */
! void
! CreateRestartPoint(int flags)
! {
!     XLogRecPtr lastCheckPointRecPtr;
!     CheckPoint lastCheckPoint;
!     /* use volatile pointer to prevent code rearrangement */
!     volatile XLogCtlData *xlogctl = XLogCtl;
!
!     /*
!      * Acquire CheckpointLock to ensure only one restartpoint or checkpoint
!      * happens at a time.
       */
!     LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);
!
!     /* Get the a local copy of the last checkpoint record. */
!     SpinLockAcquire(&xlogctl->info_lck);
!     lastCheckPointRecPtr = xlogctl->lastCheckPointRecPtr;
!     memcpy(&lastCheckPoint, &XLogCtl->lastCheckPoint, sizeof(CheckPoint));
!     SpinLockRelease(&xlogctl->info_lck);

      /*
!      * If the last checkpoint record we've replayed is already our last
!      * restartpoint, we're done.
       */
+     if (XLogRecPtrIsInvalid(lastCheckPointRecPtr) ||
+         XLByteLE(lastCheckPoint.redo, ControlFile->checkPointCopy.redo))
+     {
+         ereport(DEBUG2,
+                 (errmsg("skipping restartpoint, already performed at %X/%X",
+                         lastCheckPoint.redo.xlogid, lastCheckPoint.redo.xrecoff)));
+         LWLockRelease(CheckpointLock);
+         return;
+     }
+
+     /*
+      * Check that we're still in recovery mode. It's ok if we exit recovery
+      * mode after this check, the restart point is valid anyway.
+      */
+     if (!IsRecoveryProcessingMode())
+     {
+         ereport(DEBUG2,
+                 (errmsg("skipping restartpoint, recovery has already ended")));
+         LWLockRelease(CheckpointLock);
+         return;
+     }
+
+     if (log_checkpoints)
+     {
+         /*
+          * Prepare to accumulate statistics.
+          */
+         MemSet(&CheckpointStats, 0, sizeof(CheckpointStats));
+         CheckpointStats.ckpt_start_t = GetCurrentTimestamp();
+
+         LogCheckpointStart(flags, true);
+     }
+
+     CheckPointGuts(lastCheckPoint.redo, flags);
+
+     /*
+      * Update pg_control, using current time
+      */
+     LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
      ControlFile->prevCheckPoint = ControlFile->checkPoint;
!     ControlFile->checkPoint = lastCheckPointRecPtr;
!     ControlFile->checkPointCopy = lastCheckPoint;
      ControlFile->time = (pg_time_t) time(NULL);
      UpdateControlFile();
+     LWLockRelease(ControlFileLock);
+
+     /*
+      * Currently, there is no need to truncate pg_subtrans during recovery.
+      * If we did do that, we will need to have called StartupSUBTRANS()
+      * already and then TruncateSUBTRANS() would go here.
+      */
+
+     /* All real work is done, but log before releasing lock. */
+     if (log_checkpoints)
+         LogCheckpointEnd(true);

!     ereport((log_checkpoints ? LOG : DEBUG2),
              (errmsg("recovery restart point at %X/%X",
!                     lastCheckPoint.redo.xlogid, lastCheckPoint.redo.xrecoff)));
!
      if (recoveryLastXTime)
!         ereport((log_checkpoints ? LOG : DEBUG2),
!             (errmsg("last completed transaction was at log time %s",
!                     timestamptz_to_str(recoveryLastXTime))));
!
!     LWLockRelease(CheckpointLock);
  }

  /*
***************
*** 6238,6243 **** RequestXLogSwitch(void)
--- 6547,6555 ----

  /*
   * XLOG resource manager's routines
+  *
+  * Definitions of message info are in include/catalog/pg_control.h,
+  * though not all messages relate to control file processing.
   */
  void
  xlog_redo(XLogRecPtr lsn, XLogRecord *record)
***************
*** 6284,6292 **** xlog_redo(XLogRecPtr lsn, XLogRecord *record)
                                   (int) checkPoint.ThisTimeLineID))
                  ereport(PANIC,
                          (errmsg("unexpected timeline ID %u (after %u) in checkpoint record",
!                                 checkPoint.ThisTimeLineID, ThisTimeLineID)));
!             /* Following WAL records should be run with new TLI */
!             ThisTimeLineID = checkPoint.ThisTimeLineID;
          }

          RecoveryRestartPoint(&checkPoint);
--- 6596,6604 ----
                                   (int) checkPoint.ThisTimeLineID))
                  ereport(PANIC,
                          (errmsg("unexpected timeline ID %u (after %u) in checkpoint record",
!                                checkPoint.ThisTimeLineID, ThisTimeLineID)));
!            /* Following WAL records should be run with new TLI */
!            ThisTimeLineID = checkPoint.ThisTimeLineID;
          }

          RecoveryRestartPoint(&checkPoint);
***************
*** 7227,7229 **** CancelBackup(void)
--- 7539,7627 ----
      }
  }

+ /* ------------------------------------------------------
+  *  Startup Process main entry point and signal handlers
+  * ------------------------------------------------------
+  */
+
+ /*
+  * startupproc_quickdie() occurs when signalled SIGQUIT by the postmaster.
+  *
+  * Some backend has bought the farm,
+  * so we need to stop what we're doing and exit.
+  */
+ static void
+ startupproc_quickdie(SIGNAL_ARGS)
+ {
+     PG_SETMASK(&BlockSig);
+
+     /*
+      * DO NOT proc_exit() -- we're here because shared memory may be
+      * corrupted, so we don't want to try to clean up our transaction. Just
+      * nail the windows shut and get out of town.
+      *
+      * Note we do exit(2) not exit(0).    This is to force the postmaster into a
+      * system reset cycle if some idiot DBA sends a manual SIGQUIT to a random
+      * backend.  This is necessary precisely because we don't clean up our
+      * shared memory state.
+      */
+     exit(2);
+ }
+
+
+ /* SIGTERM: set flag to abort redo and exit */
+ static void
+ StartupProcShutdownHandler(SIGNAL_ARGS)
+ {
+     shutdown_requested = true;
+ }
+
+ /* Main entry point for startup process */
+ void
+ StartupProcessMain(void)
+ {
+     /*
+      * If possible, make this process a group leader, so that the postmaster
+      * can signal any child processes too.
+      */
+ #ifdef HAVE_SETSID
+     if (setsid() < 0)
+         elog(FATAL, "setsid() failed: %m");
+ #endif
+
+     /*
+      * Properly accept or ignore signals the postmaster might send us
+      */
+     pqsignal(SIGHUP, SIG_IGN);    /* ignore config file updates */
+     pqsignal(SIGINT, SIG_IGN);        /* ignore query cancel */
+     pqsignal(SIGTERM, StartupProcShutdownHandler); /* request shutdown */
+     pqsignal(SIGQUIT, startupproc_quickdie);        /* hard crash time */
+     pqsignal(SIGALRM, SIG_IGN);
+     pqsignal(SIGPIPE, SIG_IGN);
+     pqsignal(SIGUSR1, SIG_IGN);
+     pqsignal(SIGUSR2, SIG_IGN);
+
+     /*
+      * Reset some signals that are accepted by postmaster but not here
+      */
+     pqsignal(SIGCHLD, SIG_DFL);
+     pqsignal(SIGTTIN, SIG_DFL);
+     pqsignal(SIGTTOU, SIG_DFL);
+     pqsignal(SIGCONT, SIG_DFL);
+     pqsignal(SIGWINCH, SIG_DFL);
+
+     /*
+      * Unblock signals (they were blocked when the postmaster forked us)
+      */
+     PG_SETMASK(&UnBlockSig);
+
+     StartupXLOG();
+
+     BuildFlatFiles(false);
+
+     /* Let postmaster know that startup is finished */
+     SendPostmasterSignal(PMSIGNAL_RECOVERY_COMPLETED);
+
+     /* exit normally */
+     proc_exit(0);
+ }
*** src/backend/bootstrap/bootstrap.c
--- src/backend/bootstrap/bootstrap.c
***************
*** 37,43 ****
  #include "storage/proc.h"
  #include "tcop/tcopprot.h"
  #include "utils/builtins.h"
- #include "utils/flatfiles.h"
  #include "utils/fmgroids.h"
  #include "utils/memutils.h"
  #include "utils/ps_status.h"
--- 37,42 ----
***************
*** 416,429 **** AuxiliaryProcessMain(int argc, char *argv[])
              proc_exit(1);        /* should never return */

          case StartupProcess:
!             bootstrap_signals();
!             StartupXLOG();
!             BuildFlatFiles(false);
!             proc_exit(0);        /* startup done */

          case BgWriterProcess:
              /* don't set signals, bgwriter has its own agenda */
-             InitXLOGAccess();
              BackgroundWriterMain();
              proc_exit(1);        /* should never return */

--- 415,426 ----
              proc_exit(1);        /* should never return */

          case StartupProcess:
!             /* don't set signals, startup process has its own agenda */
!             StartupProcessMain();
!             proc_exit(1);        /* should never return */

          case BgWriterProcess:
              /* don't set signals, bgwriter has its own agenda */
              BackgroundWriterMain();
              proc_exit(1);        /* should never return */

*** src/backend/postmaster/bgwriter.c
--- src/backend/postmaster/bgwriter.c
***************
*** 49,54 ****
--- 49,55 ----
  #include <unistd.h>

  #include "access/xlog_internal.h"
+ #include "catalog/pg_control.h"
  #include "libpq/pqsignal.h"
  #include "miscadmin.h"
  #include "pgstat.h"
***************
*** 197,202 **** BackgroundWriterMain(void)
--- 198,204 ----
  {
      sigjmp_buf    local_sigjmp_buf;
      MemoryContext bgwriter_context;
+     bool        BgWriterRecoveryMode = true;

      BgWriterShmem->bgwriter_pid = MyProcPid;
      am_bg_writer = true;
***************
*** 418,423 **** BackgroundWriterMain(void)
--- 420,439 ----
          }

          /*
+          * Check if we've exited recovery. We do this after determining
+          * whether to perform a checkpoint or not, to be sure that we
+          * perform a real checkpoint and not a restartpoint, if someone
+          * requested a checkpoint immediately after exiting recovery. And
+          * we must have the right TimeLineID when we perform a checkpoint;
+          * IsRecoveryProcessingMode() initializes that as a side-effect.
+          */
+          if (BgWriterRecoveryMode && !IsRecoveryProcessingMode())
+           {
+             elog(DEBUG1, "bgwriter changing from recovery to normal mode");
+             BgWriterRecoveryMode = false;
+         }
+
+         /*
           * Do a checkpoint if requested, otherwise do one cycle of
           * dirty-buffer writing.
           */
***************
*** 444,450 **** BackgroundWriterMain(void)
               * implementation will not generate warnings caused by
               * CheckPointTimeout < CheckPointWarning.
               */
!             if ((flags & CHECKPOINT_CAUSE_XLOG) &&
                  elapsed_secs < CheckPointWarning)
                  ereport(LOG,
                          (errmsg("checkpoints are occurring too frequently (%d seconds apart)",
--- 460,467 ----
               * implementation will not generate warnings caused by
               * CheckPointTimeout < CheckPointWarning.
               */
!             if (!BgWriterRecoveryMode &&
!                 (flags & CHECKPOINT_CAUSE_XLOG) &&
                  elapsed_secs < CheckPointWarning)
                  ereport(LOG,
                          (errmsg("checkpoints are occurring too frequently (%d seconds apart)",
***************
*** 455,468 **** BackgroundWriterMain(void)
               * Initialize bgwriter-private variables used during checkpoint.
               */
              ckpt_active = true;
!             ckpt_start_recptr = GetInsertRecPtr();
              ckpt_start_time = now;
              ckpt_cached_elapsed = 0;

              /*
               * Do the checkpoint.
               */
!             CreateCheckPoint(flags);

              /*
               * After any checkpoint, close all smgr files.    This is so we
--- 472,489 ----
               * Initialize bgwriter-private variables used during checkpoint.
               */
              ckpt_active = true;
!             if (!BgWriterRecoveryMode)
!                 ckpt_start_recptr = GetInsertRecPtr();
              ckpt_start_time = now;
              ckpt_cached_elapsed = 0;

              /*
               * Do the checkpoint.
               */
!             if (!BgWriterRecoveryMode)
!                 CreateCheckPoint(flags);
!             else
!                 CreateRestartPoint(flags);

              /*
               * After any checkpoint, close all smgr files.    This is so we
***************
*** 507,513 **** CheckArchiveTimeout(void)
      pg_time_t    now;
      pg_time_t    last_time;

!     if (XLogArchiveTimeout <= 0)
          return;

      now = (pg_time_t) time(NULL);
--- 528,534 ----
      pg_time_t    now;
      pg_time_t    last_time;

!     if (XLogArchiveTimeout <= 0 || IsRecoveryProcessingMode())
          return;

      now = (pg_time_t) time(NULL);
***************
*** 586,592 **** BgWriterNap(void)
          (ckpt_active ? ImmediateCheckpointRequested() : checkpoint_requested))
              break;
          pg_usleep(1000000L);
!         AbsorbFsyncRequests();
          udelay -= 1000000L;
      }

--- 607,614 ----
          (ckpt_active ? ImmediateCheckpointRequested() : checkpoint_requested))
              break;
          pg_usleep(1000000L);
!         if (!IsRecoveryProcessingMode())
!             AbsorbFsyncRequests();
          udelay -= 1000000L;
      }

***************
*** 714,729 **** IsCheckpointOnSchedule(double progress)
       * However, it's good enough for our purposes, we're only calculating an
       * estimate anyway.
       */
!     recptr = GetInsertRecPtr();
!     elapsed_xlogs =
!         (((double) (int32) (recptr.xlogid - ckpt_start_recptr.xlogid)) * XLogSegsPerFile +
!          ((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) / XLogSegSize) /
!         CheckPointSegments;
!
!     if (progress < elapsed_xlogs)
      {
!         ckpt_cached_elapsed = elapsed_xlogs;
!         return false;
      }

      /*
--- 736,754 ----
       * However, it's good enough for our purposes, we're only calculating an
       * estimate anyway.
       */
!     if (!IsRecoveryProcessingMode())
      {
!         recptr = GetInsertRecPtr();
!         elapsed_xlogs =
!             (((double) (int32) (recptr.xlogid - ckpt_start_recptr.xlogid)) * XLogSegsPerFile +
!              ((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) / XLogSegSize) /
!             CheckPointSegments;
!
!         if (progress < elapsed_xlogs)
!         {
!             ckpt_cached_elapsed = elapsed_xlogs;
!             return false;
!         }
      }

      /*
*** src/backend/postmaster/postmaster.c
--- src/backend/postmaster/postmaster.c
***************
*** 225,235 **** static pid_t StartupPID = 0,
--- 225,262 ----
  static int    Shutdown = NoShutdown;

  static bool FatalError = false; /* T if recovering from backend crash */
+ static bool RecoveryError = false; /* T if recovery failed */
+
+ /* State of WAL redo */
+ #define            NoRecovery            0
+ #define            RecoveryStarted        1
+ #define            RecoveryConsistent    2
+ #define            RecoveryCompleted    3
+
+ static int    RecoveryStatus = NoRecovery;

  /*
   * We use a simple state machine to control startup, shutdown, and
   * crash recovery (which is rather like shutdown followed by startup).
   *
+  * After doing all the postmaster initialization work, we enter PM_STARTUP
+  * state and the startup process is launched. The startup process begins by
+  * reading the control file and other preliminary initialization steps. When
+  * it's ready to start WAL redo, it signals postmaster, and we switch to
+  * PM_RECOVERY phase. The background writer is launched, while the startup
+  * process continues applying WAL.
+  *
+  * After reaching a consistent point in WAL redo, startup process signals
+  * us again, and we switch to PM_RECOVERY_CONSISTENT phase. There's currently
+  * no difference between PM_RECOVERY and PM_RECOVERY_CONSISTENT, but we
+  * could start accepting connections to perform read-only queries at this
+  * point, if we had the infrastructure to do that.
+  *
+  * When the WAL redo is finished, the startup process signals us the third
+  * time, and we switch to PM_RUN state. The startup process can also skip the
+  * recovery and consistent recovery phases altogether, as it will during
+  * normal startup when there's no recovery to be done, for example.
+  *
   * Normal child backends can only be launched when we are in PM_RUN state.
   * (We also allow it in PM_WAIT_BACKUP state, but only for superusers.)
   * In other states we handle connection requests by launching "dead_end"
***************
*** 245,259 **** static bool FatalError = false; /* T if recovering from backend crash */
   *
   * Notice that this state variable does not distinguish *why* we entered
   * states later than PM_RUN --- Shutdown and FatalError must be consulted
!  * to find that out.  FatalError is never true in PM_RUN state, nor in
!  * PM_SHUTDOWN states (because we don't enter those states when trying to
!  * recover from a crash).  It can be true in PM_STARTUP state, because we
!  * don't clear it until we've successfully recovered.
   */
  typedef enum
  {
      PM_INIT,                    /* postmaster starting */
      PM_STARTUP,                    /* waiting for startup subprocess */
      PM_RUN,                        /* normal "database is alive" state */
      PM_WAIT_BACKUP,                /* waiting for online backup mode to end */
      PM_WAIT_BACKENDS,            /* waiting for live backends to exit */
--- 272,288 ----
   *
   * Notice that this state variable does not distinguish *why* we entered
   * states later than PM_RUN --- Shutdown and FatalError must be consulted
!  * to find that out.  FatalError is never true in PM_RECOVERY_* or PM_RUN
!  * states, nor in PM_SHUTDOWN states (because we don't enter those states
!  * when trying to recover from a crash).  It can be true in PM_STARTUP state,
!  * because we don't clear it until we've successfully started WAL redo.
   */
  typedef enum
  {
      PM_INIT,                    /* postmaster starting */
      PM_STARTUP,                    /* waiting for startup subprocess */
+     PM_RECOVERY,                /* in recovery mode */
+     PM_RECOVERY_CONSISTENT,        /* consistent recovery mode */
      PM_RUN,                        /* normal "database is alive" state */
      PM_WAIT_BACKUP,                /* waiting for online backup mode to end */
      PM_WAIT_BACKENDS,            /* waiting for live backends to exit */
***************
*** 307,312 **** static void pmdie(SIGNAL_ARGS);
--- 336,342 ----
  static void reaper(SIGNAL_ARGS);
  static void sigusr1_handler(SIGNAL_ARGS);
  static void dummy_handler(SIGNAL_ARGS);
+ static void CheckRecoverySignals(void);
  static void CleanupBackend(int pid, int exitstatus);
  static void HandleChildCrash(int pid, int exitstatus, const char *procname);
  static void LogChildExit(int lev, const char *procname,
***************
*** 1302,1308 **** ServerLoop(void)
           * state that prevents it, start one.  It doesn't matter if this
           * fails, we'll just try again later.
           */
!         if (BgWriterPID == 0 && pmState == PM_RUN)
              BgWriterPID = StartBackgroundWriter();

          /*
--- 1332,1340 ----
           * state that prevents it, start one.  It doesn't matter if this
           * fails, we'll just try again later.
           */
!         if (BgWriterPID == 0 &&
!             (pmState == PM_RUN || pmState == PM_RECOVERY ||
!              pmState == PM_RECOVERY_CONSISTENT))
              BgWriterPID = StartBackgroundWriter();

          /*
***************
*** 1752,1758 **** canAcceptConnections(void)
              return CAC_WAITBACKUP;    /* allow superusers only */
          if (Shutdown > NoShutdown)
              return CAC_SHUTDOWN;    /* shutdown is pending */
!         if (pmState == PM_STARTUP && !FatalError)
              return CAC_STARTUP; /* normal startup */
          return CAC_RECOVERY;    /* else must be crash recovery */
      }
--- 1784,1793 ----
              return CAC_WAITBACKUP;    /* allow superusers only */
          if (Shutdown > NoShutdown)
              return CAC_SHUTDOWN;    /* shutdown is pending */
!         if (!FatalError &&
!             (pmState == PM_STARTUP ||
!              pmState == PM_RECOVERY ||
!              pmState == PM_RECOVERY_CONSISTENT))
              return CAC_STARTUP; /* normal startup */
          return CAC_RECOVERY;    /* else must be crash recovery */
      }
***************
*** 1982,1988 **** pmdie(SIGNAL_ARGS)
              ereport(LOG,
                      (errmsg("received smart shutdown request")));

!             if (pmState == PM_RUN)
              {
                  /* autovacuum workers are told to shut down immediately */
                  SignalAutovacWorkers(SIGTERM);
--- 2017,2023 ----
              ereport(LOG,
                      (errmsg("received smart shutdown request")));

!             if (pmState == PM_RUN || pmState == PM_RECOVERY || pmState == PM_RECOVERY_CONSISTENT)
              {
                  /* autovacuum workers are told to shut down immediately */
                  SignalAutovacWorkers(SIGTERM);
***************
*** 2019,2025 **** pmdie(SIGNAL_ARGS)

              if (StartupPID != 0)
                  signal_child(StartupPID, SIGTERM);
!             if (pmState == PM_RUN || pmState == PM_WAIT_BACKUP)
              {
                  ereport(LOG,
                          (errmsg("aborting any active transactions")));
--- 2054,2067 ----

              if (StartupPID != 0)
                  signal_child(StartupPID, SIGTERM);
!             if (pmState == PM_RECOVERY)
!             {
!                 /* only bgwriter is active in this state */
!                 pmState = PM_WAIT_BACKENDS;
!             }
!             if (pmState == PM_RUN ||
!                 pmState == PM_WAIT_BACKUP ||
!                 pmState == PM_RECOVERY_CONSISTENT)
              {
                  ereport(LOG,
                          (errmsg("aborting any active transactions")));
***************
*** 2116,2125 **** reaper(SIGNAL_ARGS)
          if (pid == StartupPID)
          {
              StartupPID = 0;
-             Assert(pmState == PM_STARTUP);

!             /* FATAL exit of startup is treated as catastrophic */
!             if (!EXIT_STATUS_0(exitstatus))
              {
                  LogChildExit(LOG, _("startup process"),
                               pid, exitstatus);
--- 2158,2179 ----
          if (pid == StartupPID)
          {
              StartupPID = 0;

!             /*
!              * Check if we've received a signal from the startup process
!              * first. This can change pmState. If the startup process sends
!              * a signal, and exits immediately after that, we might not have
!              * processed the signal yet, and we need to know if it completed
!              * recovery before exiting.
!              */
!             CheckRecoverySignals();
!
!             /*
!              * Unexpected exit of startup process (including FATAL exit)
!              * during PM_STARTUP is treated as catastrophic. There is no
!              * other processes running yet.
!              */
!             if (pmState == PM_STARTUP)
              {
                  LogChildExit(LOG, _("startup process"),
                               pid, exitstatus);
***************
*** 2127,2186 **** reaper(SIGNAL_ARGS)
                  (errmsg("aborting startup due to startup process failure")));
                  ExitPostmaster(1);
              }
-
              /*
!              * Startup succeeded - we are done with system startup or
!              * recovery.
               */
!             FatalError = false;
!
!             /*
!              * Go to shutdown mode if a shutdown request was pending.
!              */
!             if (Shutdown > NoShutdown)
              {
!                 pmState = PM_WAIT_BACKENDS;
!                 /* PostmasterStateMachine logic does the rest */
                  continue;
              }
-
              /*
!              * Otherwise, commence normal operations.
!              */
!             pmState = PM_RUN;
!
!             /*
!              * Load the flat authorization file into postmaster's cache. The
!              * startup process has recomputed this from the database contents,
!              * so we wait till it finishes before loading it.
!              */
!             load_role();
!
!             /*
!              * Crank up the background writer.    It doesn't matter if this
!              * fails, we'll just try again later.
               */
!             Assert(BgWriterPID == 0);
!             BgWriterPID = StartBackgroundWriter();
!
!             /*
!              * Likewise, start other special children as needed.  In a restart
!              * situation, some of them may be alive already.
!              */
!             if (WalWriterPID == 0)
!                 WalWriterPID = StartWalWriter();
!             if (AutoVacuumingActive() && AutoVacPID == 0)
!                 AutoVacPID = StartAutoVacLauncher();
!             if (XLogArchivingActive() && PgArchPID == 0)
!                 PgArchPID = pgarch_start();
!             if (PgStatPID == 0)
!                 PgStatPID = pgstat_start();
!
!             /* at this point we are really open for business */
!             ereport(LOG,
!                  (errmsg("database system is ready to accept connections")));
!
!             continue;
          }

          /*
--- 2181,2210 ----
                  (errmsg("aborting startup due to startup process failure")));
                  ExitPostmaster(1);
              }
              /*
!              * Any unexpected exit (including FATAL exit) of the startup
!              * process is treated as a crash, except that we don't want
!              * to reinitialize.
               */
!             if (!EXIT_STATUS_0(exitstatus))
              {
!                 RecoveryError = true;
!                 HandleChildCrash(pid, exitstatus,
!                                  _("startup process"));
                  continue;
              }
              /*
!              * Startup process exited normally, but didn't finish recovery.
!              * This can happen if someone else than postmaster kills the
!              * startup process with SIGTERM. Treat it like a crash.
               */
!             if (pmState == PM_RECOVERY || pmState == PM_RECOVERY_CONSISTENT)
!             {
!                 RecoveryError = true;
!                 HandleChildCrash(pid, exitstatus,
!                                  _("startup process"));
!                 continue;
!             }
          }

          /*
***************
*** 2443,2448 **** HandleChildCrash(int pid, int exitstatus, const char *procname)
--- 2467,2484 ----
          }
      }

+     /* Take care of the startup process too */
+     if (pid == StartupPID)
+         StartupPID = 0;
+     else if (StartupPID != 0 && !FatalError)
+     {
+         ereport(DEBUG2,
+                 (errmsg_internal("sending %s to process %d",
+                                  (SendStop ? "SIGSTOP" : "SIGQUIT"),
+                                  (int) StartupPID)));
+         signal_child(BgWriterPID, (SendStop ? SIGSTOP : SIGQUIT));
+     }
+
      /* Take care of the bgwriter too */
      if (pid == BgWriterPID)
          BgWriterPID = 0;
***************
*** 2514,2520 **** HandleChildCrash(int pid, int exitstatus, const char *procname)

      FatalError = true;
      /* We now transit into a state of waiting for children to die */
!     if (pmState == PM_RUN ||
          pmState == PM_WAIT_BACKUP ||
          pmState == PM_SHUTDOWN)
          pmState = PM_WAIT_BACKENDS;
--- 2550,2558 ----

      FatalError = true;
      /* We now transit into a state of waiting for children to die */
!     if (pmState == PM_RECOVERY ||
!         pmState == PM_RECOVERY_CONSISTENT ||
!         pmState == PM_RUN ||
          pmState == PM_WAIT_BACKUP ||
          pmState == PM_SHUTDOWN)
          pmState = PM_WAIT_BACKENDS;
***************
*** 2582,2587 **** LogChildExit(int lev, const char *procname, int pid, int exitstatus)
--- 2620,2746 ----
  static void
  PostmasterStateMachine(void)
  {
+     /* Startup states */
+
+     if (pmState == PM_STARTUP && RecoveryStatus > NoRecovery)
+     {
+         /* WAL redo has started. We're out of reinitialization. */
+         FatalError = false;
+
+         /*
+          * Go to shutdown mode if a shutdown request was pending.
+          */
+         if (Shutdown > NoShutdown)
+         {
+             pmState = PM_WAIT_BACKENDS;
+             /* PostmasterStateMachine logic does the rest */
+         }
+         else
+         {
+             /*
+              * Crank up the background writer.    It doesn't matter if this
+              * fails, we'll just try again later.
+              */
+             Assert(BgWriterPID == 0);
+             BgWriterPID = StartBackgroundWriter();
+
+             pmState = PM_RECOVERY;
+         }
+     }
+     if (pmState == PM_RECOVERY && RecoveryStatus >= RecoveryConsistent)
+     {
+         /*
+          * Go to shutdown mode if a shutdown request was pending.
+          */
+         if (Shutdown > NoShutdown)
+         {
+             pmState = PM_WAIT_BACKENDS;
+             /* PostmasterStateMachine logic does the rest */
+         }
+         else
+         {
+             /*
+              * Startup process has entered recovery. We consider that good
+              * enough to reset FatalError.
+              */
+             pmState = PM_RECOVERY_CONSISTENT;
+
+             /*
+              * Load the flat authorization file into postmaster's cache. The
+              * startup process won't have recomputed this from the database yet,
+              * so we it may change following recovery.
+              */
+             load_role();
+
+             /*
+              * Likewise, start other special children as needed.
+              */
+             Assert(PgStatPID == 0);
+             PgStatPID = pgstat_start();
+
+             /* XXX at this point we could accept read-only connections */
+             ereport(DEBUG1,
+                  (errmsg("database system is in consistent recovery mode")));
+         }
+     }
+     if ((pmState == PM_RECOVERY ||
+          pmState == PM_RECOVERY_CONSISTENT ||
+          pmState == PM_STARTUP) &&
+         RecoveryStatus == RecoveryCompleted)
+     {
+         /*
+          * Startup succeeded.
+          *
+          * Go to shutdown mode if a shutdown request was pending.
+          */
+         if (Shutdown > NoShutdown)
+         {
+             pmState = PM_WAIT_BACKENDS;
+             /* PostmasterStateMachine logic does the rest */
+         }
+         else
+         {
+             /*
+              * Otherwise, commence normal operations.
+              */
+             pmState = PM_RUN;
+
+             /*
+              * Load the flat authorization file into postmaster's cache. The
+              * startup process has recomputed this from the database contents,
+              * so we wait till it finishes before loading it.
+              */
+             load_role();
+
+             /*
+              * Crank up the background writer, if we didn't do that already
+              * when we entered consistent recovery phase.  It doesn't matter
+              * if this fails, we'll just try again later.
+              */
+             if (BgWriterPID == 0)
+                 BgWriterPID = StartBackgroundWriter();
+
+             /*
+              * Likewise, start other special children as needed.  In a restart
+              * situation, some of them may be alive already.
+              */
+             if (WalWriterPID == 0)
+                 WalWriterPID = StartWalWriter();
+             if (AutoVacuumingActive() && AutoVacPID == 0)
+                 AutoVacPID = StartAutoVacLauncher();
+             if (XLogArchivingActive() && PgArchPID == 0)
+                 PgArchPID = pgarch_start();
+             if (PgStatPID == 0)
+                 PgStatPID = pgstat_start();
+
+             /* at this point we are really open for business */
+             ereport(LOG,
+                 (errmsg("database system is ready to accept connections")));
+         }
+     }
+
+     /* Shutdown states */
+
      if (pmState == PM_WAIT_BACKUP)
      {
          /*
***************
*** 2723,2728 **** PostmasterStateMachine(void)
--- 2882,2896 ----
      }

      /*
+      * If recovery failed, wait for all non-syslogger children to exit,
+      * and then exit postmaster. We don't try to reinitialize when recovery
+      * fails, because more than likely it will just fail again and we will
+      * keep trying forever.
+      */
+     if (RecoveryError && pmState == PM_NO_CHILDREN)
+         ExitPostmaster(1);
+
+     /*
       * If we need to recover from a crash, wait for all non-syslogger
       * children to exit, then reset shmem and StartupDataBase.
       */
***************
*** 2734,2739 **** PostmasterStateMachine(void)
--- 2902,2909 ----
          shmem_exit(1);
          reset_shared(PostPortNumber);

+         RecoveryStatus = NoRecovery;
+
          StartupPID = StartupDataBase();
          Assert(StartupPID != 0);
          pmState = PM_STARTUP;
***************
*** 3838,3843 **** ExitPostmaster(int status)
--- 4008,4044 ----
  }

  /*
+  * common code used in sigusr1_handler() and reaper() to handle
+  * recovery-related signals from startup process
+  */
+ static void
+ CheckRecoverySignals(void)
+ {
+     bool changed = false;
+
+     if (CheckPostmasterSignal(PMSIGNAL_RECOVERY_STARTED))
+     {
+         Assert(pmState == PM_STARTUP);
+
+         RecoveryStatus = RecoveryStarted;
+         changed = true;
+     }
+     if (CheckPostmasterSignal(PMSIGNAL_RECOVERY_CONSISTENT))
+     {
+         RecoveryStatus = RecoveryConsistent;
+         changed = true;
+     }
+     if (CheckPostmasterSignal(PMSIGNAL_RECOVERY_COMPLETED))
+     {
+         RecoveryStatus = RecoveryCompleted;
+         changed = true;
+     }
+
+     if (changed)
+         PostmasterStateMachine();
+ }
+
+ /*
   * sigusr1_handler - handle signal conditions from child processes
   */
  static void
***************
*** 3847,3852 **** sigusr1_handler(SIGNAL_ARGS)
--- 4048,4055 ----

      PG_SETMASK(&BlockSig);

+     CheckRecoverySignals();
+
      if (CheckPostmasterSignal(PMSIGNAL_PASSWORD_CHANGE))
      {
          /*
*** src/backend/storage/buffer/README
--- src/backend/storage/buffer/README
***************
*** 268,270 **** out (and anyone else who flushes buffer contents to disk must do so too).
--- 268,279 ----
  This ensures that the page image transferred to disk is reasonably consistent.
  We might miss a hint-bit update or two but that isn't a problem, for the same
  reasons mentioned under buffer access rules.
+
+ As of 8.4, background writer starts during recovery mode when there is
+ some form of potentially extended recovery to perform. It performs an
+ identical service to normal processing, except that checkpoints it
+ writes are technically restartpoints. Flushing outstanding WAL for dirty
+ buffers is also skipped, though there shouldn't ever be new WAL entries
+ at that time in any case. We could choose to start background writer
+ immediately but we hold off until we can prove the database is in a
+ consistent state so that postmaster has a single, clean state change.
*** src/backend/utils/init/postinit.c
--- src/backend/utils/init/postinit.c
***************
*** 324,330 **** InitCommunication(void)
   * If you're wondering why this is separate from InitPostgres at all:
   * the critical distinction is that this stuff has to happen before we can
   * run XLOG-related initialization, which is done before InitPostgres --- in
!  * fact, for cases such as checkpoint creation processes, InitPostgres may
   * never be done at all.
   */
  void
--- 324,330 ----
   * If you're wondering why this is separate from InitPostgres at all:
   * the critical distinction is that this stuff has to happen before we can
   * run XLOG-related initialization, which is done before InitPostgres --- in
!  * fact, for cases such as the background writer process, InitPostgres may
   * never be done at all.
   */
  void
*** src/include/access/xlog.h
--- src/include/access/xlog.h
***************
*** 133,139 **** typedef struct XLogRecData
  } XLogRecData;

  extern TimeLineID ThisTimeLineID;        /* current TLI */
! extern bool InRecovery;
  extern XLogRecPtr XactLastRecEnd;

  /* these variables are GUC parameters related to XLOG */
--- 133,148 ----
  } XLogRecData;

  extern TimeLineID ThisTimeLineID;        /* current TLI */
!
! /*
!  * Prior to 8.4, all activity during recovery were carried out by Startup
!  * process. This local variable continues to be used in many parts of the
!  * code to indicate actions taken by RecoveryManagers. Other processes who
!  * potentially perform work during recovery should check
!  * IsRecoveryProcessingMode(), see XLogCtl notes in xlog.c
!  */
! extern bool InRecovery;
!
  extern XLogRecPtr XactLastRecEnd;

  /* these variables are GUC parameters related to XLOG */
***************
*** 199,204 **** extern void RestoreBkpBlocks(XLogRecPtr lsn, XLogRecord *record, bool cleanup);
--- 208,215 ----
  extern void xlog_redo(XLogRecPtr lsn, XLogRecord *record);
  extern void xlog_desc(StringInfo buf, uint8 xl_info, char *rec);

+ extern bool IsRecoveryProcessingMode(void);
+
  extern void UpdateControlFile(void);
  extern Size XLOGShmemSize(void);
  extern void XLOGShmemInit(void);
***************
*** 207,215 **** extern void StartupXLOG(void);
--- 218,229 ----
  extern void ShutdownXLOG(int code, Datum arg);
  extern void InitXLOGAccess(void);
  extern void CreateCheckPoint(int flags);
+ extern void CreateRestartPoint(int flags);
  extern void XLogPutNextOid(Oid nextOid);
  extern XLogRecPtr GetRedoRecPtr(void);
  extern XLogRecPtr GetInsertRecPtr(void);
  extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch);

+ extern void StartupProcessMain(void);
+
  #endif   /* XLOG_H */
*** src/include/storage/pmsignal.h
--- src/include/storage/pmsignal.h
***************
*** 22,27 ****
--- 22,30 ----
   */
  typedef enum
  {
+     PMSIGNAL_RECOVERY_STARTED,    /* recovery has started */
+     PMSIGNAL_RECOVERY_CONSISTENT, /* recovery has reached consistent state */
+     PMSIGNAL_RECOVERY_COMPLETED, /* recovery completed */
      PMSIGNAL_PASSWORD_CHANGE,    /* pg_auth file has changed */
      PMSIGNAL_WAKEN_ARCHIVER,    /* send a NOTIFY signal to xlog archiver */
      PMSIGNAL_ROTATE_LOGFILE,    /* send SIGUSR1 to syslogger to rotate logfile */

Re: Hot standby, recovery infra

From
Simon Riggs
Date:
On Thu, 2009-02-05 at 21:54 +0200, Heikki Linnakangas wrote:

> - If bgwriter is performing a restartpoint when recovery ends, the 
> startup checkpoint will be queued up behind the restartpoint. And since 
> it uses the same smoothing logic as checkpoints, it can take quite some 
> time for that to finish. The original patch had some code to hurry up 
> the restartpoint by signaling the bgwriter if 
> LWLockConditionalAcquire(CheckPointLock) fails, but there's a race 
> condition with that if a restartpoint starts right after that check. We 
> could let the bgwriter do the checkpoint too, and wait for it, but 
> bgwriter might not be running yet, and we'd have to allow bgwriter to 
> write WAL while disallowing it for all other processes, which seems 
> quite complex. Seems like we need something like the 
> LWLockConditionalAcquire approach, but built into CreateCheckPoint to 
> eliminate the race condition

Seems straightforward? Hold the lock longer.

> - If you perform a fast shutdown while startup process is waiting for 
> the restore command, startup process sometimes throws a FATAL error 
> which leads escalates into an immediate shutdown. That leads to 
> different messages in the logs, and skipping of the shutdown 
> restartpoint that we now otherwise perform.

Sometimes?

> - It's not clear to me if the rest of the xlog flushing related 
> functions, XLogBackgroundFlush, XLogNeedsFlush and XLogAsyncCommitFlush, 
> need to work during recovery, and what they should do.

XLogNeedsFlush should always return false InRecoveryProcessingMode().
The WAL is already in the WAL files, not in wal_buffers anymore.

XLogAsyncCommitFlush should contain Assert(!InRecoveryProcessingMode())
since it is called during a VACUUM FULL only.

XLogBackgroundFlush should never be called during recovery because the
WALWriter is never active in recovery. That should just be documented.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot standby, recovery infra

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Thu, 2009-02-05 at 21:54 +0200, Heikki Linnakangas wrote:
>> - If you perform a fast shutdown while startup process is waiting for 
>> the restore command, startup process sometimes throws a FATAL error 
>> which leads escalates into an immediate shutdown. That leads to 
>> different messages in the logs, and skipping of the shutdown 
>> restartpoint that we now otherwise perform.
> 
> Sometimes?

I think what happens is that if the restore command receives the SIGTERM 
and dies before the startup process that's waiting for the restore 
command receives the SIGTERM, the startup process throws a FATAL error 
because the restore command died unexpectedly. I put this

>     if (shutdown_requested && InRedo)
>     {
>         /* XXX: Is EndRecPtr always the right value here? */
>         UpdateMinRecoveryPoint(EndRecPtr);
>         proc_exit(0);
>     }

right after the "system(xlogRestoreCmd)" call, to exit gracefully if we 
were requested to shut down while restore command was running, but it 
seems that that's not enough because of the race condition.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot standby, recovery infra

From
Simon Riggs
Date:
On Fri, 2009-02-06 at 10:06 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Thu, 2009-02-05 at 21:54 +0200, Heikki Linnakangas wrote:
> >> - If you perform a fast shutdown while startup process is waiting for 
> >> the restore command, startup process sometimes throws a FATAL error 
> >> which leads escalates into an immediate shutdown. That leads to 
> >> different messages in the logs, and skipping of the shutdown 
> >> restartpoint that we now otherwise perform.
> > 
> > Sometimes?
> 
> I think what happens is that if the restore command receives the SIGTERM 
> and dies before the startup process that's waiting for the restore 
> command receives the SIGTERM, the startup process throws a FATAL error 
> because the restore command died unexpectedly. I put this
> 
> >     if (shutdown_requested && InRedo)
> >     {
> >         /* XXX: Is EndRecPtr always the right value here? */
> >         UpdateMinRecoveryPoint(EndRecPtr);
> >         proc_exit(0);
> >     }
> 
> right after the "system(xlogRestoreCmd)" call, to exit gracefully if we 
> were requested to shut down while restore command was running, but it 
> seems that that's not enough because of the race condition.

Can we trap the death of the restorecmd and handle it differently from
the death of the startup process?

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot standby, recovery infra

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Fri, 2009-02-06 at 10:06 +0200, Heikki Linnakangas wrote:
>> Simon Riggs wrote:
>>> On Thu, 2009-02-05 at 21:54 +0200, Heikki Linnakangas wrote:
>>>> - If you perform a fast shutdown while startup process is waiting for
>>>> the restore command, startup process sometimes throws a FATAL error
>>>> which leads escalates into an immediate shutdown. That leads to
>>>> different messages in the logs, and skipping of the shutdown
>>>> restartpoint that we now otherwise perform.
>>> Sometimes?
>> I think what happens is that if the restore command receives the SIGTERM
>> and dies before the startup process that's waiting for the restore
>> command receives the SIGTERM, the startup process throws a FATAL error
>> because the restore command died unexpectedly. I put this
>>
>>>     if (shutdown_requested && InRedo)
>>>     {
>>>         /* XXX: Is EndRecPtr always the right value here? */
>>>         UpdateMinRecoveryPoint(EndRecPtr);
>>>         proc_exit(0);
>>>     }
>> right after the "system(xlogRestoreCmd)" call, to exit gracefully if we
>> were requested to shut down while restore command was running, but it
>> seems that that's not enough because of the race condition.
>
> Can we trap the death of the restorecmd and handle it differently from
> the death of the startup process?

The startup process launches the restore command, so it's the startup
process that needs to handle its death.

Anyway, I think I've found a solution. While we're executing the restore
command, we're in a state that it's safe to proc_exit(0). We can set a
flag to indicate to the signal handler when we're executing the restore
command, so that the signal handler can do proc_exit(0) on SIGTERM. So
if the startup process receives the SIGTERM first, it will proc_exit(0)
immediately, and if the restore command dies first due to the SIGTERM,
startup process exits with proc_exit(0) when it sees that restore
command exited because of the SIGTERM. If either process receives
SIGTERM for some other reason than a fast shutdown request, postmaster
will see that the startup process exited unexpectedly, and handles that
like a child process crash.

Attached is an updated patch that does that, and I've fixed all the
other outstanding issues I listed earlier as well. Now I'm feeling again
that this is in pretty good shape.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 36,41 ****
--- 36,42 ----
  #include "catalog/pg_control.h"
  #include "catalog/pg_type.h"
  #include "funcapi.h"
+ #include "libpq/pqsignal.h"
  #include "miscadmin.h"
  #include "pgstat.h"
  #include "postmaster/bgwriter.h"
***************
*** 47,52 ****
--- 48,54 ----
  #include "storage/smgr.h"
  #include "storage/spin.h"
  #include "utils/builtins.h"
+ #include "utils/flatfiles.h"
  #include "utils/guc.h"
  #include "utils/ps_status.h"
  #include "pg_trace.h"
***************
*** 119,130 **** CheckpointStatsData CheckpointStats;
   */
  TimeLineID    ThisTimeLineID = 0;

! /* Are we doing recovery from XLOG? */
  bool        InRecovery = false;

  /* Are we recovering using offline XLOG archives? */
  static bool InArchiveRecovery = false;

  /* Was the last xlog file restored from archive, or local? */
  static bool restoredFromArchive = false;

--- 121,146 ----
   */
  TimeLineID    ThisTimeLineID = 0;

! /*
!  * Are we doing recovery from XLOG?
!  *
!  * This is only ever true in the startup process, when it's replaying WAL.
!  * It's used in functions that need to act differently when called from a
!  * redo function (e.g skip WAL logging).  To check whether the system is in
!  * recovery regardless of what process you're running in, use
!  * IsRecoveryProcessingMode().
!  */
  bool        InRecovery = false;

  /* Are we recovering using offline XLOG archives? */
  static bool InArchiveRecovery = false;

+ /*
+  * Local copy of shared RecoveryProcessingMode variable. True actually
+  * means "not known, need to check the shared state"
+  */
+ static bool LocalRecoveryProcessingMode = true;
+
  /* Was the last xlog file restored from archive, or local? */
  static bool restoredFromArchive = false;

***************
*** 133,139 **** static char *recoveryRestoreCommand = NULL;
  static bool recoveryTarget = false;
  static bool recoveryTargetExact = false;
  static bool recoveryTargetInclusive = true;
- static bool recoveryLogRestartpoints = false;
  static TransactionId recoveryTargetXid;
  static TimestampTz recoveryTargetTime;
  static TimestampTz recoveryLastXTime = 0;
--- 149,154 ----
***************
*** 242,250 **** static XLogRecPtr RedoRecPtr;
   * ControlFileLock: must be held to read/update control file or create
   * new log file.
   *
!  * CheckpointLock: must be held to do a checkpoint (ensures only one
!  * checkpointer at a time; currently, with all checkpoints done by the
!  * bgwriter, this is just pro forma).
   *
   *----------
   */
--- 257,264 ----
   * ControlFileLock: must be held to read/update control file or create
   * new log file.
   *
!  * CheckpointLock: must be held to do a checkpoint or restartpoint (ensures
!  * only one checkpointer at a time)
   *
   *----------
   */
***************
*** 313,318 **** typedef struct XLogCtlData
--- 327,351 ----
      int            XLogCacheBlck;    /* highest allocated xlog buffer index */
      TimeLineID    ThisTimeLineID;

+     /*
+      * SharedRecoveryProcessingMode indicates if we're still in crash or
+      * archive recovery.  It's checked by IsRecoveryProcessingMode().
+      */
+     bool        SharedRecoveryProcessingMode;
+
+     /*
+      * During recovery, we keep a copy of the latest checkpoint record
+      * here.  Used by the background writer when it wants to create
+      * a restartpoint.
+      *
+      * Protected by info_lck.
+      */
+     XLogRecPtr    lastCheckPointRecPtr;
+     CheckPoint    lastCheckPoint;
+
+     /* end+1 of the last record replayed (or being replayed) */
+     XLogRecPtr    replayEndRecPtr;
+
      slock_t        info_lck;        /* locks shared variables shown above */
  } XLogCtlData;

***************
*** 387,395 **** static XLogRecPtr ReadRecPtr;    /* start of last record read */
--- 420,440 ----
  static XLogRecPtr EndRecPtr;    /* end+1 of last record read */
  static XLogRecord *nextRecord = NULL;
  static TimeLineID lastPageTLI = 0;
+ static XLogRecPtr minRecoveryPoint; /* local copy of ControlFile->minRecoveryPoint */
+ static bool    updateMinRecoveryPoint = true;

  static bool InRedo = false;

+ /*
+  * Flag set by interrupt handlers for later service in the redo loop.
+  */
+ static volatile sig_atomic_t shutdown_requested = false;
+ /*
+  * Flag set when executing a restore command, to tell SIGTERM signal handler
+  * that it's safe to just proc_exit(0).
+  */
+ static volatile sig_atomic_t in_restore_command = false;
+

  static void XLogArchiveNotify(const char *xlog);
  static void XLogArchiveNotifySeg(uint32 log, uint32 seg);
***************
*** 420,425 **** static void PreallocXlogFiles(XLogRecPtr endptr);
--- 465,471 ----
  static void RemoveOldXlogFiles(uint32 log, uint32 seg, XLogRecPtr endptr);
  static void ValidateXLOGDirectoryStructure(void);
  static void CleanupBackupHistory(void);
+ static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
  static XLogRecord *ReadRecord(XLogRecPtr *RecPtr, int emode);
  static bool ValidXLOGHeader(XLogPageHeader hdr, int emode);
  static XLogRecord *ReadCheckpointRecord(XLogRecPtr RecPtr, int whichChkpt);
***************
*** 484,489 **** XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
--- 530,539 ----
      bool        doPageWrites;
      bool        isLogSwitch = (rmid == RM_XLOG_ID && info == XLOG_SWITCH);

+     /* cross-check on whether we should be here or not */
+     if (IsRecoveryProcessingMode())
+         elog(FATAL, "cannot make new WAL entries during recovery");
+
      /* info's high bits are reserved for use by me */
      if (info & XLR_INFO_MASK)
          elog(PANIC, "invalid xlog info mask %02X", info);
***************
*** 1718,1723 **** XLogSetAsyncCommitLSN(XLogRecPtr asyncCommitLSN)
--- 1768,1830 ----
  }

  /*
+  * Advance minRecoveryPoint in control file.
+  *
+  * If we crash during recovery, we must reach this point again before the
+  * database is consistent.
+  *
+  * If 'force' is true, 'lsn' argument is ignored. Otherwise, minRecoveryPoint
+  * is is only updated if it's already greater than or equal to 'lsn'.
+  */
+ static void
+ UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force)
+ {
+     /* Quick check using our local copy of the variable */
+     if (!updateMinRecoveryPoint || (!force && XLByteLE(lsn, minRecoveryPoint)))
+         return;
+
+     LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+
+     /* update local copy */
+     minRecoveryPoint = ControlFile->minRecoveryPoint;
+
+     /*
+      * An invalid minRecoveryPoint means that we need to recover all the WAL,
+      * ie. crash recovery. Don't update the control file in that case.
+      */
+     if (minRecoveryPoint.xlogid == 0 && minRecoveryPoint.xrecoff == 0)
+         updateMinRecoveryPoint = false;
+     else if (force || XLByteLT(minRecoveryPoint, lsn))
+     {
+         /* use volatile pointer to prevent code rearrangement */
+         volatile XLogCtlData *xlogctl = XLogCtl;
+         XLogRecPtr newMinRecoveryPoint;
+
+         /*
+          * To avoid having to update the control file too often, we update
+          * it all the way to the last record being replayed, even though 'lsn'
+          * would suffice for correctness.
+          */
+         SpinLockAcquire(&xlogctl->info_lck);
+         newMinRecoveryPoint = xlogctl->replayEndRecPtr;
+         SpinLockRelease(&xlogctl->info_lck);
+
+         /* update control file */
+         if (XLByteLT(ControlFile->minRecoveryPoint, newMinRecoveryPoint))
+         {
+             ControlFile->minRecoveryPoint = newMinRecoveryPoint;
+             UpdateControlFile();
+             minRecoveryPoint = newMinRecoveryPoint;
+         }
+
+         ereport(DEBUG2,
+                 (errmsg("updated min recovery point to %X/%X",
+                         minRecoveryPoint.xlogid, minRecoveryPoint.xrecoff)));
+     }
+     LWLockRelease(ControlFileLock);
+ }
+
+ /*
   * Ensure that all XLOG data through the given position is flushed to disk.
   *
   * NOTE: this differs from XLogWrite mainly in that the WALWriteLock is not
***************
*** 1729,1737 **** XLogFlush(XLogRecPtr record)
      XLogRecPtr    WriteRqstPtr;
      XLogwrtRqst WriteRqst;

!     /* Disabled during REDO */
!     if (InRedo)
          return;

      /* Quick exit if already known flushed */
      if (XLByteLE(record, LogwrtResult.Flush))
--- 1836,1850 ----
      XLogRecPtr    WriteRqstPtr;
      XLogwrtRqst WriteRqst;

!     /*
!      * During REDO, we don't try to flush the WAL, but update minRecoveryPoint
!      * instead.
!      */
!     if (IsRecoveryProcessingMode())
!     {
!         UpdateMinRecoveryPoint(record, false);
          return;
+     }

      /* Quick exit if already known flushed */
      if (XLByteLE(record, LogwrtResult.Flush))
***************
*** 1818,1826 **** XLogFlush(XLogRecPtr record)
       * the bad page is encountered again during recovery then we would be
       * unable to restart the database at all!  (This scenario has actually
       * happened in the field several times with 7.1 releases. Note that we
!      * cannot get here while InRedo is true, but if the bad page is brought in
!      * and marked dirty during recovery then CreateCheckPoint will try to
!      * flush it at the end of recovery.)
       *
       * The current approach is to ERROR under normal conditions, but only
       * WARNING during recovery, so that the system can be brought up even if
--- 1931,1939 ----
       * the bad page is encountered again during recovery then we would be
       * unable to restart the database at all!  (This scenario has actually
       * happened in the field several times with 7.1 releases. Note that we
!      * cannot get here while IsRecoveryProcessingMode(), but if the bad page is
!      * brought in and marked dirty during recovery then if a checkpoint were
!      * performed at the end of recovery it will try to flush it.
       *
       * The current approach is to ERROR under normal conditions, but only
       * WARNING during recovery, so that the system can be brought up even if
***************
*** 1857,1862 **** XLogBackgroundFlush(void)
--- 1970,1979 ----
      XLogRecPtr    WriteRqstPtr;
      bool        flexible = true;

+     /* XLOG doesn't need flushing during recovery */
+     if (IsRecoveryProcessingMode())
+         return;
+
      /* read LogwrtResult and update local state */
      {
          /* use volatile pointer to prevent code rearrangement */
***************
*** 1928,1933 **** XLogAsyncCommitFlush(void)
--- 2045,2054 ----
      /* use volatile pointer to prevent code rearrangement */
      volatile XLogCtlData *xlogctl = XLogCtl;

+     /* There's no asynchronously committed transactions during recovery */
+     if (IsRecoveryProcessingMode())
+         return;
+
      SpinLockAcquire(&xlogctl->info_lck);
      WriteRqstPtr = xlogctl->asyncCommitLSN;
      SpinLockRelease(&xlogctl->info_lck);
***************
*** 1944,1949 **** XLogAsyncCommitFlush(void)
--- 2065,2074 ----
  bool
  XLogNeedsFlush(XLogRecPtr record)
  {
+     /* XLOG doesn't need flushing during recovery */
+     if (IsRecoveryProcessingMode())
+         return false;
+
      /* Quick exit if already known flushed */
      if (XLByteLE(record, LogwrtResult.Flush))
          return false;
***************
*** 2619,2627 **** RestoreArchivedFile(char *path, const char *xlogfname,
--- 2744,2765 ----
                               xlogRestoreCmd)));

      /*
+      * Set in_restore_command to tell the signal handler that we should exit
+      * right away on SIGTERM. We know that we're in a safe point to do that.
+      * Check if we had already received the signal, so that we don't miss
+      * a shutdown request received just before this.
+      */
+     in_restore_command = true;
+     if (shutdown_requested)
+         proc_exit(0);
+
+     /*
       * Copy xlog from archival storage to XLOGDIR
       */
      rc = system(xlogRestoreCmd);
+
+     in_restore_command = false;
+
      if (rc == 0)
      {
          /*
***************
*** 2674,2687 **** RestoreArchivedFile(char *path, const char *xlogfname,
       * assume that recovery is complete and start up the database!) It's
       * essential to abort on child SIGINT and SIGQUIT, because per spec
       * system() ignores SIGINT and SIGQUIT while waiting; if we see one of
!      * those it's a good bet we should have gotten it too.  Aborting on other
!      * signals such as SIGTERM seems a good idea as well.
       *
       * Per the Single Unix Spec, shells report exit status > 128 when a called
       * command died on a signal.  Also, 126 and 127 are used to report
       * problems such as an unfindable command; treat those as fatal errors
       * too.
       */
      signaled = WIFSIGNALED(rc) || WEXITSTATUS(rc) > 125;

      ereport(signaled ? FATAL : DEBUG2,
--- 2812,2835 ----
       * assume that recovery is complete and start up the database!) It's
       * essential to abort on child SIGINT and SIGQUIT, because per spec
       * system() ignores SIGINT and SIGQUIT while waiting; if we see one of
!      * those it's a good bet we should have gotten it too.
!      *
!      * On SIGTERM, assume we have received a fast shutdown request, and exit
!      * cleanly. It's pure chance whether we receive the SIGTERM first, or the
!      * child process. If we receive it first, the signal handler will call
!      * proc_exit(0), otherwise we do it here. If we or the child process
!      * received SIGTERM for any other reason than a fast shutdown request,
!      * postmaster will perform an immediate shutdown when it sees us exiting
!      * unexpectedly.
       *
       * Per the Single Unix Spec, shells report exit status > 128 when a called
       * command died on a signal.  Also, 126 and 127 are used to report
       * problems such as an unfindable command; treat those as fatal errors
       * too.
       */
+     if (WTERMSIG(rc) == SIGTERM)
+         proc_exit(0);
+
      signaled = WIFSIGNALED(rc) || WEXITSTATUS(rc) > 125;

      ereport(signaled ? FATAL : DEBUG2,
***************
*** 4590,4607 **** readRecoveryCommandFile(void)
              ereport(LOG,
                      (errmsg("recovery_target_inclusive = %s", tok2)));
          }
-         else if (strcmp(tok1, "log_restartpoints") == 0)
-         {
-             /*
-              * does nothing if a recovery_target is not also set
-              */
-             if (!parse_bool(tok2, &recoveryLogRestartpoints))
-                   ereport(ERROR,
-                             (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
-                       errmsg("parameter \"log_restartpoints\" requires a Boolean value")));
-             ereport(LOG,
-                     (errmsg("log_restartpoints = %s", tok2)));
-         }
          else
              ereport(FATAL,
                      (errmsg("unrecognized recovery parameter \"%s\"",
--- 4738,4743 ----
***************
*** 4883,4889 **** StartupXLOG(void)
      XLogRecPtr    RecPtr,
                  LastRec,
                  checkPointLoc,
!                 minRecoveryLoc,
                  EndOfLog;
      uint32        endLogId;
      uint32        endLogSeg;
--- 5019,5025 ----
      XLogRecPtr    RecPtr,
                  LastRec,
                  checkPointLoc,
!                 backupStopLoc,
                  EndOfLog;
      uint32        endLogId;
      uint32        endLogSeg;
***************
*** 4891,4896 **** StartupXLOG(void)
--- 5027,5034 ----
      uint32        freespace;
      TransactionId oldestActiveXID;

+     XLogCtl->SharedRecoveryProcessingMode = true;
+
      /*
       * Read control file and check XLOG status looks valid.
       *
***************
*** 4970,4976 **** StartupXLOG(void)
                          recoveryTargetTLI,
                          ControlFile->checkPointCopy.ThisTimeLineID)));

!     if (read_backup_label(&checkPointLoc, &minRecoveryLoc))
      {
          /*
           * When a backup_label file is present, we want to roll forward from
--- 5108,5114 ----
                          recoveryTargetTLI,
                          ControlFile->checkPointCopy.ThisTimeLineID)));

!     if (read_backup_label(&checkPointLoc, &backupStopLoc))
      {
          /*
           * When a backup_label file is present, we want to roll forward from
***************
*** 5108,5118 **** StartupXLOG(void)
          ControlFile->prevCheckPoint = ControlFile->checkPoint;
          ControlFile->checkPoint = checkPointLoc;
          ControlFile->checkPointCopy = checkPoint;
!         if (minRecoveryLoc.xlogid != 0 || minRecoveryLoc.xrecoff != 0)
!             ControlFile->minRecoveryPoint = minRecoveryLoc;
          ControlFile->time = (pg_time_t) time(NULL);
          UpdateControlFile();

          /*
           * If there was a backup label file, it's done its job and the info
           * has now been propagated into pg_control.  We must get rid of the
--- 5246,5268 ----
          ControlFile->prevCheckPoint = ControlFile->checkPoint;
          ControlFile->checkPoint = checkPointLoc;
          ControlFile->checkPointCopy = checkPoint;
!         if (backupStopLoc.xlogid != 0 || backupStopLoc.xrecoff != 0)
!         {
!             if (XLByteLT(ControlFile->minRecoveryPoint, backupStopLoc))
!                 ControlFile->minRecoveryPoint = backupStopLoc;
!         }
          ControlFile->time = (pg_time_t) time(NULL);
+         /* No need to hold ControlFileLock yet, we aren't up far enough */
          UpdateControlFile();

+         /* update our local copy of minRecoveryPoint */
+         minRecoveryPoint = ControlFile->minRecoveryPoint;
+
+         /*
+          * Reset pgstat data, because it may be invalid after recovery.
+          */
+         pgstat_reset_all();
+
          /*
           * If there was a backup label file, it's done its job and the info
           * has now been propagated into pg_control.  We must get rid of the
***************
*** 5157,5168 **** StartupXLOG(void)
          {
              bool        recoveryContinue = true;
              bool        recoveryApply = true;
              ErrorContextCallback errcontext;

              InRedo = true;
!             ereport(LOG,
!                     (errmsg("redo starts at %X/%X",
!                             ReadRecPtr.xlogid, ReadRecPtr.xrecoff)));

              /*
               * main redo apply loop
--- 5307,5347 ----
          {
              bool        recoveryContinue = true;
              bool        recoveryApply = true;
+             bool        reachedMinRecoveryPoint = false;
              ErrorContextCallback errcontext;
+             /* use volatile pointer to prevent code rearrangement */
+             volatile XLogCtlData *xlogctl = XLogCtl;
+
+             /* Update shared replayEndRecPtr */
+             SpinLockAcquire(&xlogctl->info_lck);
+             xlogctl->replayEndRecPtr = ReadRecPtr;
+             SpinLockRelease(&xlogctl->info_lck);

              InRedo = true;
!
!             if (minRecoveryPoint.xlogid == 0 && minRecoveryPoint.xrecoff == 0)
!                 ereport(LOG,
!                         (errmsg("redo starts at %X/%X",
!                                 ReadRecPtr.xlogid, ReadRecPtr.xrecoff)));
!             else
!                 ereport(LOG,
!                         (errmsg("redo starts at %X/%X, consistency will be reached at %X/%X",
!                         ReadRecPtr.xlogid, ReadRecPtr.xrecoff,
!                         minRecoveryPoint.xlogid, minRecoveryPoint.xrecoff)));
!
!             /*
!              * Let postmaster know we've started redo now, so that it can
!              * launch bgwriter to perform restartpoints.  We don't bother
!              * during crash recovery as restartpoints can only be performed
!              * during archive recovery.  And we'd like to keep crash recovery
!              * simple, to avoid introducing bugs that could you from
!              * recovering after crash.
!              *
!              * After this point, we can no longer assume that we're the only
!              * process in addition to postmaster!
!              */
!             if (InArchiveRecovery && IsUnderPostmaster)
!                 SendPostmasterSignal(PMSIGNAL_RECOVERY_STARTED);

              /*
               * main redo apply loop
***************
*** 5189,5194 **** StartupXLOG(void)
--- 5368,5397 ----
  #endif

                  /*
+                  * Check if we were requested to exit without finishing
+                  * recovery.
+                  */
+                 if (shutdown_requested)
+                     proc_exit(0);
+
+                 /*
+                  * Have we reached our safe starting point? If so, we can
+                  * tell postmaster that the database is consistent now.
+                  */
+                 if (!reachedMinRecoveryPoint &&
+                      XLByteLE(minRecoveryPoint, EndRecPtr))
+                 {
+                     reachedMinRecoveryPoint = true;
+                     if (InArchiveRecovery)
+                     {
+                         ereport(LOG,
+                                 (errmsg("consistent recovery state reached")));
+                         if (IsUnderPostmaster)
+                             SendPostmasterSignal(PMSIGNAL_RECOVERY_CONSISTENT);
+                     }
+                 }
+
+                 /*
                   * Have we reached our recovery target?
                   */
                  if (recoveryStopsHere(record, &recoveryApply))
***************
*** 5213,5218 **** StartupXLOG(void)
--- 5416,5430 ----
                      TransactionIdAdvance(ShmemVariableCache->nextXid);
                  }

+                 /*
+                  * Update shared replayEndRecPtr before replaying this
+                  * record, so that XLogFlush will update minRecoveryPoint
+                  * correctly.
+                  */
+                 SpinLockAcquire(&xlogctl->info_lck);
+                 xlogctl->replayEndRecPtr = EndRecPtr;
+                 SpinLockRelease(&xlogctl->info_lck);
+
                  RmgrTable[record->xl_rmid].rm_redo(EndRecPtr, record);

                  /* Pop the error context stack */
***************
*** 5256,5269 **** StartupXLOG(void)
       * Complain if we did not roll forward far enough to render the backup
       * dump consistent.
       */
!     if (XLByteLT(EndOfLog, ControlFile->minRecoveryPoint))
      {
          if (reachedStopPoint)    /* stopped because of stop request */
              ereport(FATAL,
!                     (errmsg("requested recovery stop point is before end time of backup dump")));
          else    /* ran off end of WAL */
              ereport(FATAL,
!                     (errmsg("WAL ends before end time of backup dump")));
      }

      /*
--- 5468,5481 ----
       * Complain if we did not roll forward far enough to render the backup
       * dump consistent.
       */
!     if (InRecovery && XLByteLT(EndOfLog, minRecoveryPoint))
      {
          if (reachedStopPoint)    /* stopped because of stop request */
              ereport(FATAL,
!                     (errmsg("requested recovery stop point is before consistent recovery point")));
          else    /* ran off end of WAL */
              ereport(FATAL,
!                     (errmsg("WAL ends before consistent recovery point")));
      }

      /*
***************
*** 5358,5363 **** StartupXLOG(void)
--- 5570,5581 ----
      /* Pre-scan prepared transactions to find out the range of XIDs present */
      oldestActiveXID = PrescanPreparedTransactions();

+     /*
+      * Allow writing WAL for us, so that we can create a checkpoint record.
+      * But not yet for other backends!
+      */
+     LocalRecoveryProcessingMode = false;
+
      if (InRecovery)
      {
          int            rmid;
***************
*** 5378,5388 **** StartupXLOG(void)
          XLogCheckInvalidPages();

          /*
-          * Reset pgstat data, because it may be invalid after recovery.
-          */
-         pgstat_reset_all();
-
-         /*
           * Perform a checkpoint to update all our recovery activity to disk.
           *
           * Note that we write a shutdown checkpoint rather than an on-line
--- 5596,5601 ----
***************
*** 5404,5415 **** StartupXLOG(void)
       */
      InRecovery = false;

      ControlFile->state = DB_IN_PRODUCTION;
      ControlFile->time = (pg_time_t) time(NULL);
      UpdateControlFile();

      /* start the archive_timeout timer running */
!     XLogCtl->Write.lastSegSwitchTime = ControlFile->time;

      /* initialize shared-memory copy of latest checkpoint XID/epoch */
      XLogCtl->ckptXidEpoch = ControlFile->checkPointCopy.nextXidEpoch;
--- 5617,5630 ----
       */
      InRecovery = false;

+     LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
      ControlFile->state = DB_IN_PRODUCTION;
      ControlFile->time = (pg_time_t) time(NULL);
      UpdateControlFile();
+     LWLockRelease(ControlFileLock);

      /* start the archive_timeout timer running */
!     XLogCtl->Write.lastSegSwitchTime = (pg_time_t) time(NULL);

      /* initialize shared-memory copy of latest checkpoint XID/epoch */
      XLogCtl->ckptXidEpoch = ControlFile->checkPointCopy.nextXidEpoch;
***************
*** 5444,5449 **** StartupXLOG(void)
--- 5659,5703 ----
          readRecordBuf = NULL;
          readRecordBufSize = 0;
      }
+
+     /*
+      * All done. Allow others to write WAL.
+      */
+     XLogCtl->SharedRecoveryProcessingMode = false;
+ }
+
+ /*
+  * Is the system still in recovery?
+  *
+  * As a side-effect, we initialize the local TimeLineID and RedoRecPtr
+  * variables the first time we see that recovery is finished.
+  */
+ bool
+ IsRecoveryProcessingMode(void)
+ {
+     /*
+      * We check shared state each time only until we leave recovery mode.
+      * We can't re-enter recovery, so we rely on the local state variable
+      * after that.
+      */
+     if (!LocalRecoveryProcessingMode)
+         return false;
+     else
+     {
+         /* use volatile pointer to prevent code rearrangement */
+         volatile XLogCtlData *xlogctl = XLogCtl;
+
+         LocalRecoveryProcessingMode = xlogctl->SharedRecoveryProcessingMode;
+
+         /*
+          * Initialize TimeLineID and RedoRecPtr the first time we see that
+          * recovery is finished.
+          */
+         if (!LocalRecoveryProcessingMode)
+             InitXLOGAccess();
+
+         return LocalRecoveryProcessingMode;
+     }
  }

  /*
***************
*** 5575,5580 **** InitXLOGAccess(void)
--- 5829,5836 ----
  {
      /* ThisTimeLineID doesn't change so we need no lock to copy it */
      ThisTimeLineID = XLogCtl->ThisTimeLineID;
+     Assert(ThisTimeLineID != 0);
+
      /* Use GetRedoRecPtr to copy the RedoRecPtr safely */
      (void) GetRedoRecPtr();
  }
***************
*** 5686,5692 **** ShutdownXLOG(int code, Datum arg)
      ereport(LOG,
              (errmsg("shutting down")));

!     CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
      ShutdownCLOG();
      ShutdownSUBTRANS();
      ShutdownMultiXact();
--- 5942,5951 ----
      ereport(LOG,
              (errmsg("shutting down")));

!     if (IsRecoveryProcessingMode())
!         CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
!     else
!         CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
      ShutdownCLOG();
      ShutdownSUBTRANS();
      ShutdownMultiXact();
***************
*** 5699,5707 **** ShutdownXLOG(int code, Datum arg)
   * Log start of a checkpoint.
   */
  static void
! LogCheckpointStart(int flags)
  {
!     elog(LOG, "checkpoint starting:%s%s%s%s%s%s",
           (flags & CHECKPOINT_IS_SHUTDOWN) ? " shutdown" : "",
           (flags & CHECKPOINT_IMMEDIATE) ? " immediate" : "",
           (flags & CHECKPOINT_FORCE) ? " force" : "",
--- 5958,5977 ----
   * Log start of a checkpoint.
   */
  static void
! LogCheckpointStart(int flags, bool restartpoint)
  {
!     char *msg;
!
!     /*
!      * XXX: This is hopelessly untranslatable. We could call gettext_noop
!      * for the main message, but what about all the flags?
!      */
!     if (restartpoint)
!         msg = "restartpoint starting:%s%s%s%s%s%s";
!     else
!         msg = "checkpoint starting:%s%s%s%s%s%s";
!
!     elog(LOG, msg,
           (flags & CHECKPOINT_IS_SHUTDOWN) ? " shutdown" : "",
           (flags & CHECKPOINT_IMMEDIATE) ? " immediate" : "",
           (flags & CHECKPOINT_FORCE) ? " force" : "",
***************
*** 5714,5720 **** LogCheckpointStart(int flags)
   * Log end of a checkpoint.
   */
  static void
! LogCheckpointEnd(void)
  {
      long        write_secs,
                  sync_secs,
--- 5984,5990 ----
   * Log end of a checkpoint.
   */
  static void
! LogCheckpointEnd(bool restartpoint)
  {
      long        write_secs,
                  sync_secs,
***************
*** 5737,5753 **** LogCheckpointEnd(void)
                          CheckpointStats.ckpt_sync_end_t,
                          &sync_secs, &sync_usecs);

!     elog(LOG, "checkpoint complete: wrote %d buffers (%.1f%%); "
!          "%d transaction log file(s) added, %d removed, %d recycled; "
!          "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s",
!          CheckpointStats.ckpt_bufs_written,
!          (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
!          CheckpointStats.ckpt_segs_added,
!          CheckpointStats.ckpt_segs_removed,
!          CheckpointStats.ckpt_segs_recycled,
!          write_secs, write_usecs / 1000,
!          sync_secs, sync_usecs / 1000,
!          total_secs, total_usecs / 1000);
  }

  /*
--- 6007,6032 ----
                          CheckpointStats.ckpt_sync_end_t,
                          &sync_secs, &sync_usecs);

!     if (restartpoint)
!         elog(LOG, "restartpoint complete: wrote %d buffers (%.1f%%); "
!              "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s",
!              CheckpointStats.ckpt_bufs_written,
!              (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
!              write_secs, write_usecs / 1000,
!              sync_secs, sync_usecs / 1000,
!              total_secs, total_usecs / 1000);
!     else
!         elog(LOG, "checkpoint complete: wrote %d buffers (%.1f%%); "
!              "%d transaction log file(s) added, %d removed, %d recycled; "
!              "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s",
!              CheckpointStats.ckpt_bufs_written,
!              (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
!              CheckpointStats.ckpt_segs_added,
!              CheckpointStats.ckpt_segs_removed,
!              CheckpointStats.ckpt_segs_recycled,
!              write_secs, write_usecs / 1000,
!              sync_secs, sync_usecs / 1000,
!              total_secs, total_usecs / 1000);
  }

  /*
***************
*** 5778,5790 **** CreateCheckPoint(int flags)
      TransactionId *inCommitXids;
      int            nInCommit;

      /*
       * Acquire CheckpointLock to ensure only one checkpoint happens at a time.
!      * (This is just pro forma, since in the present system structure there is
!      * only one process that is allowed to issue checkpoints at any given
!      * time.)
       */
!     LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);

      /*
       * Prepare to accumulate statistics.
--- 6057,6089 ----
      TransactionId *inCommitXids;
      int            nInCommit;

+     /* shouldn't happen */
+     if (IsRecoveryProcessingMode())
+         elog(ERROR, "can't create a checkpoint during recovery");
+
      /*
       * Acquire CheckpointLock to ensure only one checkpoint happens at a time.
!      * During normal operation, bgwriter is the only process that creates
!      * checkpoints, but at the end archive recovery, the bgwriter can be busy
!      * creating a restartpoint while the startup process tries to perform the
!      * startup checkpoint.
       */
!     if (!LWLockConditionalAcquire(CheckpointLock, LW_EXCLUSIVE))
!     {
!         Assert(InRecovery);
!
!         /*
!          * A restartpoint is in progress. Wait until it finishes. This can
!          * cause an extra restartpoint to be performed, but that's OK because
!          * we're just about to perform a checkpoint anyway. Flushing the
!          * buffers in this restartpoint can take some time, but that time is
!          * saved from the upcoming checkpoint so the net effect is zero.
!          */
!         ereport(DEBUG2, (errmsg("hurrying in-progress restartpoint")));
!         RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_WAIT);
!
!         LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);
!     }

      /*
       * Prepare to accumulate statistics.
***************
*** 5803,5811 **** CreateCheckPoint(int flags)
--- 6102,6112 ----

      if (shutdown)
      {
+         LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
          ControlFile->state = DB_SHUTDOWNING;
          ControlFile->time = (pg_time_t) time(NULL);
          UpdateControlFile();
+         LWLockRelease(ControlFileLock);
      }

      /*
***************
*** 5909,5915 **** CreateCheckPoint(int flags)
       * to log anything if we decided to skip the checkpoint.
       */
      if (log_checkpoints)
!         LogCheckpointStart(flags);

      TRACE_POSTGRESQL_CHECKPOINT_START(flags);

--- 6210,6216 ----
       * to log anything if we decided to skip the checkpoint.
       */
      if (log_checkpoints)
!         LogCheckpointStart(flags, false);

      TRACE_POSTGRESQL_CHECKPOINT_START(flags);

***************
*** 6076,6082 **** CreateCheckPoint(int flags)

      /* All real work is done, but log before releasing lock. */
      if (log_checkpoints)
!         LogCheckpointEnd();

          TRACE_POSTGRESQL_CHECKPOINT_DONE(CheckpointStats.ckpt_bufs_written,
                                  NBuffers, CheckpointStats.ckpt_segs_added,
--- 6377,6383 ----

      /* All real work is done, but log before releasing lock. */
      if (log_checkpoints)
!         LogCheckpointEnd(false);

          TRACE_POSTGRESQL_CHECKPOINT_DONE(CheckpointStats.ckpt_bufs_written,
                                  NBuffers, CheckpointStats.ckpt_segs_added,
***************
*** 6104,6135 **** CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
  }

  /*
!  * Set a recovery restart point if appropriate
!  *
!  * This is similar to CreateCheckPoint, but is used during WAL recovery
!  * to establish a point from which recovery can roll forward without
!  * replaying the entire recovery log.  This function is called each time
!  * a checkpoint record is read from XLOG; it must determine whether a
!  * restartpoint is needed or not.
   */
  static void
  RecoveryRestartPoint(const CheckPoint *checkPoint)
  {
-     int            elapsed_secs;
      int            rmid;
!
!     /*
!      * Do nothing if the elapsed time since the last restartpoint is less than
!      * half of checkpoint_timeout.    (We use a value less than
!      * checkpoint_timeout so that variations in the timing of checkpoints on
!      * the master, or speed of transmission of WAL segments to a slave, won't
!      * make the slave skip a restartpoint once it's synced with the master.)
!      * Checking true elapsed time keeps us from doing restartpoints too often
!      * while rapidly scanning large amounts of WAL.
!      */
!     elapsed_secs = (pg_time_t) time(NULL) - ControlFile->time;
!     if (elapsed_secs < CheckPointTimeout / 2)
!         return;

      /*
       * Is it safe to checkpoint?  We must ask each of the resource managers
--- 6405,6421 ----
  }

  /*
!  * This is used during WAL recovery to establish a point from which recovery
!  * can roll forward without replaying the entire recovery log.  This function
!  * is called each time a checkpoint record is read from XLOG. It is stored
!  * in shared memory, so that it can be used as a restartpoint later on.
   */
  static void
  RecoveryRestartPoint(const CheckPoint *checkPoint)
  {
      int            rmid;
!     /* use volatile pointer to prevent code rearrangement */
!     volatile XLogCtlData *xlogctl = XLogCtl;

      /*
       * Is it safe to checkpoint?  We must ask each of the resource managers
***************
*** 6151,6178 **** RecoveryRestartPoint(const CheckPoint *checkPoint)
      }

      /*
!      * OK, force data out to disk
       */
!     CheckPointGuts(checkPoint->redo, CHECKPOINT_IMMEDIATE);

      /*
!      * Update pg_control so that any subsequent crash will restart from this
!      * checkpoint.    Note: ReadRecPtr gives the XLOG address of the checkpoint
!      * record itself.
       */
      ControlFile->prevCheckPoint = ControlFile->checkPoint;
!     ControlFile->checkPoint = ReadRecPtr;
!     ControlFile->checkPointCopy = *checkPoint;
      ControlFile->time = (pg_time_t) time(NULL);
      UpdateControlFile();

!     ereport((recoveryLogRestartpoints ? LOG : DEBUG2),
              (errmsg("recovery restart point at %X/%X",
!                     checkPoint->redo.xlogid, checkPoint->redo.xrecoff)));
      if (recoveryLastXTime)
!         ereport((recoveryLogRestartpoints ? LOG : DEBUG2),
!                 (errmsg("last completed transaction was at log time %s",
!                         timestamptz_to_str(recoveryLastXTime))));
  }

  /*
--- 6437,6564 ----
      }

      /*
!      * Copy the checkpoint record to shared memory, so that bgwriter can
!      * use it the next time it wants to perform a restartpoint.
!      */
!     SpinLockAcquire(&xlogctl->info_lck);
!     XLogCtl->lastCheckPointRecPtr = ReadRecPtr;
!     memcpy(&XLogCtl->lastCheckPoint, checkPoint, sizeof(CheckPoint));
!     SpinLockRelease(&xlogctl->info_lck);
! }
!
! /*
!  * This is similar to CreateCheckPoint, but is used during WAL recovery
!  * to establish a point from which recovery can roll forward without
!  * replaying the entire recovery log.
!  *
!  * Returns true if a new restartpoint was established. We can only establish
!  * a restartpoint if we have replayed a checkpoint record since last
!  * restartpoint.
!  */
! bool
! CreateRestartPoint(int flags)
! {
!     XLogRecPtr lastCheckPointRecPtr;
!     CheckPoint lastCheckPoint;
!     /* use volatile pointer to prevent code rearrangement */
!     volatile XLogCtlData *xlogctl = XLogCtl;
!
!     /*
!      * Acquire CheckpointLock to ensure only one restartpoint or checkpoint
!      * happens at a time.
!      */
!     LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);
!
!     /* Get the a local copy of the last checkpoint record. */
!     SpinLockAcquire(&xlogctl->info_lck);
!     lastCheckPointRecPtr = xlogctl->lastCheckPointRecPtr;
!     memcpy(&lastCheckPoint, &XLogCtl->lastCheckPoint, sizeof(CheckPoint));
!     SpinLockRelease(&xlogctl->info_lck);
!
!     /*
!      * Check that we're still in recovery mode. It's ok if we exit recovery
!      * mode after this check, the restart point is valid anyway.
!      */
!     if (!IsRecoveryProcessingMode())
!     {
!         ereport(DEBUG2,
!                 (errmsg("skipping restartpoint, recovery has already ended")));
!         LWLockRelease(CheckpointLock);
!         return false;
!     }
!
!     /*
!      * If the last checkpoint record we've replayed is already our last
!      * restartpoint, we can't perform a new restart point. We still update
!      * minRecoveryPoint in that case, so that if this is a shutdown restart
!      * point, we won't start up earlier than before. That's not strictly
!      * necessary, but when we get hot standby capability, it would be rather
!      * weird if the database opened up for read-only connections at a
!      * point-in-time before the last shutdown. Such time travel is still
!      * possible in case of immediate shutdown, though.
!      *
!      * We don't explicitly advance minRecoveryPoint when we do create a
!      * restartpoint. It's assumed that flushing the buffers will do that
!      * as a side-effect.
       */
!     if (XLogRecPtrIsInvalid(lastCheckPointRecPtr) ||
!         XLByteLE(lastCheckPoint.redo, ControlFile->checkPointCopy.redo))
!     {
!         XLogRecPtr InvalidXLogRecPtr = {0, 0};
!         ereport(DEBUG2,
!                 (errmsg("skipping restartpoint, already performed at %X/%X",
!                         lastCheckPoint.redo.xlogid, lastCheckPoint.redo.xrecoff)));
!
!         UpdateMinRecoveryPoint(InvalidXLogRecPtr, true);
!         LWLockRelease(CheckpointLock);
!         return false;
!     }
!
!     if (log_checkpoints)
!     {
!         /*
!          * Prepare to accumulate statistics.
!          */
!         MemSet(&CheckpointStats, 0, sizeof(CheckpointStats));
!         CheckpointStats.ckpt_start_t = GetCurrentTimestamp();
!
!         LogCheckpointStart(flags, true);
!     }
!
!     CheckPointGuts(lastCheckPoint.redo, flags);

      /*
!      * Update pg_control, using current time
       */
+     LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
      ControlFile->prevCheckPoint = ControlFile->checkPoint;
!     ControlFile->checkPoint = lastCheckPointRecPtr;
!     ControlFile->checkPointCopy = lastCheckPoint;
      ControlFile->time = (pg_time_t) time(NULL);
      UpdateControlFile();
+     LWLockRelease(ControlFileLock);

!     /*
!      * Currently, there is no need to truncate pg_subtrans during recovery.
!      * If we did do that, we will need to have called StartupSUBTRANS()
!      * already and then TruncateSUBTRANS() would go here.
!      */
!
!     /* All real work is done, but log before releasing lock. */
!     if (log_checkpoints)
!         LogCheckpointEnd(true);
!
!     ereport((log_checkpoints ? LOG : DEBUG2),
              (errmsg("recovery restart point at %X/%X",
!                     lastCheckPoint.redo.xlogid, lastCheckPoint.redo.xrecoff)));
!
      if (recoveryLastXTime)
!         ereport((log_checkpoints ? LOG : DEBUG2),
!             (errmsg("last completed transaction was at log time %s",
!                     timestamptz_to_str(recoveryLastXTime))));
!
!     LWLockRelease(CheckpointLock);
!     return true;
  }

  /*
***************
*** 6238,6243 **** RequestXLogSwitch(void)
--- 6624,6632 ----

  /*
   * XLOG resource manager's routines
+  *
+  * Definitions of message info are in include/catalog/pg_control.h,
+  * though not all messages relate to control file processing.
   */
  void
  xlog_redo(XLogRecPtr lsn, XLogRecord *record)
***************
*** 6284,6292 **** xlog_redo(XLogRecPtr lsn, XLogRecord *record)
                                   (int) checkPoint.ThisTimeLineID))
                  ereport(PANIC,
                          (errmsg("unexpected timeline ID %u (after %u) in checkpoint record",
!                                 checkPoint.ThisTimeLineID, ThisTimeLineID)));
!             /* Following WAL records should be run with new TLI */
!             ThisTimeLineID = checkPoint.ThisTimeLineID;
          }

          RecoveryRestartPoint(&checkPoint);
--- 6673,6681 ----
                                   (int) checkPoint.ThisTimeLineID))
                  ereport(PANIC,
                          (errmsg("unexpected timeline ID %u (after %u) in checkpoint record",
!                                checkPoint.ThisTimeLineID, ThisTimeLineID)));
!            /* Following WAL records should be run with new TLI */
!            ThisTimeLineID = checkPoint.ThisTimeLineID;
          }

          RecoveryRestartPoint(&checkPoint);
***************
*** 7227,7229 **** CancelBackup(void)
--- 7616,7707 ----
      }
  }

+ /* ------------------------------------------------------
+  *  Startup Process main entry point and signal handlers
+  * ------------------------------------------------------
+  */
+
+ /*
+  * startupproc_quickdie() occurs when signalled SIGQUIT by the postmaster.
+  *
+  * Some backend has bought the farm,
+  * so we need to stop what we're doing and exit.
+  */
+ static void
+ startupproc_quickdie(SIGNAL_ARGS)
+ {
+     PG_SETMASK(&BlockSig);
+
+     /*
+      * DO NOT proc_exit() -- we're here because shared memory may be
+      * corrupted, so we don't want to try to clean up our transaction. Just
+      * nail the windows shut and get out of town.
+      *
+      * Note we do exit(2) not exit(0).    This is to force the postmaster into a
+      * system reset cycle if some idiot DBA sends a manual SIGQUIT to a random
+      * backend.  This is necessary precisely because we don't clean up our
+      * shared memory state.
+      */
+     exit(2);
+ }
+
+
+ /* SIGTERM: set flag to abort redo and exit */
+ static void
+ StartupProcShutdownHandler(SIGNAL_ARGS)
+ {
+     if (in_restore_command)
+         proc_exit(0);
+     else
+         shutdown_requested = true;
+ }
+
+ /* Main entry point for startup process */
+ void
+ StartupProcessMain(void)
+ {
+     /*
+      * If possible, make this process a group leader, so that the postmaster
+      * can signal any child processes too.
+      */
+ #ifdef HAVE_SETSID
+     if (setsid() < 0)
+         elog(FATAL, "setsid() failed: %m");
+ #endif
+
+     /*
+      * Properly accept or ignore signals the postmaster might send us
+      */
+     pqsignal(SIGHUP, SIG_IGN);    /* ignore config file updates */
+     pqsignal(SIGINT, SIG_IGN);        /* ignore query cancel */
+     pqsignal(SIGTERM, StartupProcShutdownHandler); /* request shutdown */
+     pqsignal(SIGQUIT, startupproc_quickdie);        /* hard crash time */
+     pqsignal(SIGALRM, SIG_IGN);
+     pqsignal(SIGPIPE, SIG_IGN);
+     pqsignal(SIGUSR1, SIG_IGN);
+     pqsignal(SIGUSR2, SIG_IGN);
+
+     /*
+      * Reset some signals that are accepted by postmaster but not here
+      */
+     pqsignal(SIGCHLD, SIG_DFL);
+     pqsignal(SIGTTIN, SIG_DFL);
+     pqsignal(SIGTTOU, SIG_DFL);
+     pqsignal(SIGCONT, SIG_DFL);
+     pqsignal(SIGWINCH, SIG_DFL);
+
+     /*
+      * Unblock signals (they were blocked when the postmaster forked us)
+      */
+     PG_SETMASK(&UnBlockSig);
+
+     StartupXLOG();
+
+     BuildFlatFiles(false);
+
+     /* Let postmaster know that startup is finished */
+     SendPostmasterSignal(PMSIGNAL_RECOVERY_COMPLETED);
+
+     /* exit normally */
+     proc_exit(0);
+ }
*** a/src/backend/bootstrap/bootstrap.c
--- b/src/backend/bootstrap/bootstrap.c
***************
*** 37,43 ****
  #include "storage/proc.h"
  #include "tcop/tcopprot.h"
  #include "utils/builtins.h"
- #include "utils/flatfiles.h"
  #include "utils/fmgroids.h"
  #include "utils/memutils.h"
  #include "utils/ps_status.h"
--- 37,42 ----
***************
*** 416,429 **** AuxiliaryProcessMain(int argc, char *argv[])
              proc_exit(1);        /* should never return */

          case StartupProcess:
!             bootstrap_signals();
!             StartupXLOG();
!             BuildFlatFiles(false);
!             proc_exit(0);        /* startup done */

          case BgWriterProcess:
              /* don't set signals, bgwriter has its own agenda */
-             InitXLOGAccess();
              BackgroundWriterMain();
              proc_exit(1);        /* should never return */

--- 415,426 ----
              proc_exit(1);        /* should never return */

          case StartupProcess:
!             /* don't set signals, startup process has its own agenda */
!             StartupProcessMain();
!             proc_exit(1);        /* should never return */

          case BgWriterProcess:
              /* don't set signals, bgwriter has its own agenda */
              BackgroundWriterMain();
              proc_exit(1);        /* should never return */

*** a/src/backend/postmaster/bgwriter.c
--- b/src/backend/postmaster/bgwriter.c
***************
*** 49,54 ****
--- 49,55 ----
  #include <unistd.h>

  #include "access/xlog_internal.h"
+ #include "catalog/pg_control.h"
  #include "libpq/pqsignal.h"
  #include "miscadmin.h"
  #include "pgstat.h"
***************
*** 197,202 **** BackgroundWriterMain(void)
--- 198,204 ----
  {
      sigjmp_buf    local_sigjmp_buf;
      MemoryContext bgwriter_context;
+     bool        BgWriterRecoveryMode = true;

      BgWriterShmem->bgwriter_pid = MyProcPid;
      am_bg_writer = true;
***************
*** 418,428 **** BackgroundWriterMain(void)
--- 420,446 ----
          }

          /*
+          * Check if we've exited recovery. We do this after determining
+          * whether to perform a checkpoint or not, to be sure that we
+          * perform a real checkpoint and not a restartpoint, if someone
+          * requested a checkpoint immediately after exiting recovery. And
+          * we must have the right TimeLineID when we perform a checkpoint;
+          * IsRecoveryProcessingMode() initializes that as a side-effect.
+          */
+          if (BgWriterRecoveryMode && !IsRecoveryProcessingMode())
+           {
+             elog(DEBUG1, "bgwriter changing from recovery to normal mode");
+             BgWriterRecoveryMode = false;
+         }
+
+         /*
           * Do a checkpoint if requested, otherwise do one cycle of
           * dirty-buffer writing.
           */
          if (do_checkpoint)
          {
+             bool    ckpt_performed = false;
+
              /* use volatile pointer to prevent code rearrangement */
              volatile BgWriterShmemStruct *bgs = BgWriterShmem;

***************
*** 444,450 **** BackgroundWriterMain(void)
               * implementation will not generate warnings caused by
               * CheckPointTimeout < CheckPointWarning.
               */
!             if ((flags & CHECKPOINT_CAUSE_XLOG) &&
                  elapsed_secs < CheckPointWarning)
                  ereport(LOG,
                          (errmsg("checkpoints are occurring too frequently (%d seconds apart)",
--- 462,469 ----
               * implementation will not generate warnings caused by
               * CheckPointTimeout < CheckPointWarning.
               */
!             if (!BgWriterRecoveryMode &&
!                 (flags & CHECKPOINT_CAUSE_XLOG) &&
                  elapsed_secs < CheckPointWarning)
                  ereport(LOG,
                          (errmsg("checkpoints are occurring too frequently (%d seconds apart)",
***************
*** 455,468 **** BackgroundWriterMain(void)
               * Initialize bgwriter-private variables used during checkpoint.
               */
              ckpt_active = true;
!             ckpt_start_recptr = GetInsertRecPtr();
              ckpt_start_time = now;
              ckpt_cached_elapsed = 0;

              /*
               * Do the checkpoint.
               */
!             CreateCheckPoint(flags);

              /*
               * After any checkpoint, close all smgr files.    This is so we
--- 474,494 ----
               * Initialize bgwriter-private variables used during checkpoint.
               */
              ckpt_active = true;
!             if (!BgWriterRecoveryMode)
!                 ckpt_start_recptr = GetInsertRecPtr();
              ckpt_start_time = now;
              ckpt_cached_elapsed = 0;

              /*
               * Do the checkpoint.
               */
!             if (!BgWriterRecoveryMode)
!             {
!                 CreateCheckPoint(flags);
!                 ckpt_performed = true;
!             }
!             else
!                 ckpt_performed = CreateRestartPoint(flags);

              /*
               * After any checkpoint, close all smgr files.    This is so we
***************
*** 477,490 **** BackgroundWriterMain(void)
              bgs->ckpt_done = bgs->ckpt_started;
              SpinLockRelease(&bgs->ckpt_lck);

!             ckpt_active = false;

!             /*
!              * Note we record the checkpoint start time not end time as
!              * last_checkpoint_time.  This is so that time-driven checkpoints
!              * happen at a predictable spacing.
!              */
!             last_checkpoint_time = now;
          }
          else
              BgBufferSync();
--- 503,529 ----
              bgs->ckpt_done = bgs->ckpt_started;
              SpinLockRelease(&bgs->ckpt_lck);

!             if (ckpt_performed)
!             {
!                 /*
!                  * Note we record the checkpoint start time not end time as
!                  * last_checkpoint_time.  This is so that time-driven
!                  * checkpoints happen at a predictable spacing.
!                  */
!                 last_checkpoint_time = now;
!             }
!             else
!             {
!                 /*
!                  * We were not able to perform the restartpoint (checkpoints
!                  * throw an ERROR in case of error).  Most likely because we
!                  * have not received any new checkpoint WAL records since the
!                  * last restartpoint. Try again in 15 s.
!                  */
!                 last_checkpoint_time = now - CheckPointTimeout + 15;
!             }

!             ckpt_active = false;
          }
          else
              BgBufferSync();
***************
*** 507,513 **** CheckArchiveTimeout(void)
      pg_time_t    now;
      pg_time_t    last_time;

!     if (XLogArchiveTimeout <= 0)
          return;

      now = (pg_time_t) time(NULL);
--- 546,552 ----
      pg_time_t    now;
      pg_time_t    last_time;

!     if (XLogArchiveTimeout <= 0 || IsRecoveryProcessingMode())
          return;

      now = (pg_time_t) time(NULL);
***************
*** 586,592 **** BgWriterNap(void)
          (ckpt_active ? ImmediateCheckpointRequested() : checkpoint_requested))
              break;
          pg_usleep(1000000L);
!         AbsorbFsyncRequests();
          udelay -= 1000000L;
      }

--- 625,632 ----
          (ckpt_active ? ImmediateCheckpointRequested() : checkpoint_requested))
              break;
          pg_usleep(1000000L);
!         if (!IsRecoveryProcessingMode())
!             AbsorbFsyncRequests();
          udelay -= 1000000L;
      }

***************
*** 714,729 **** IsCheckpointOnSchedule(double progress)
       * However, it's good enough for our purposes, we're only calculating an
       * estimate anyway.
       */
!     recptr = GetInsertRecPtr();
!     elapsed_xlogs =
!         (((double) (int32) (recptr.xlogid - ckpt_start_recptr.xlogid)) * XLogSegsPerFile +
!          ((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) / XLogSegSize) /
!         CheckPointSegments;
!
!     if (progress < elapsed_xlogs)
      {
!         ckpt_cached_elapsed = elapsed_xlogs;
!         return false;
      }

      /*
--- 754,772 ----
       * However, it's good enough for our purposes, we're only calculating an
       * estimate anyway.
       */
!     if (!IsRecoveryProcessingMode())
      {
!         recptr = GetInsertRecPtr();
!         elapsed_xlogs =
!             (((double) (int32) (recptr.xlogid - ckpt_start_recptr.xlogid)) * XLogSegsPerFile +
!              ((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) / XLogSegSize) /
!             CheckPointSegments;
!
!         if (progress < elapsed_xlogs)
!         {
!             ckpt_cached_elapsed = elapsed_xlogs;
!             return false;
!         }
      }

      /*
*** a/src/backend/postmaster/postmaster.c
--- b/src/backend/postmaster/postmaster.c
***************
*** 225,235 **** static pid_t StartupPID = 0,
--- 225,262 ----
  static int    Shutdown = NoShutdown;

  static bool FatalError = false; /* T if recovering from backend crash */
+ static bool RecoveryError = false; /* T if recovery failed */
+
+ /* State of WAL redo */
+ #define            NoRecovery            0
+ #define            RecoveryStarted        1
+ #define            RecoveryConsistent    2
+ #define            RecoveryCompleted    3
+
+ static int    RecoveryStatus = NoRecovery;

  /*
   * We use a simple state machine to control startup, shutdown, and
   * crash recovery (which is rather like shutdown followed by startup).
   *
+  * After doing all the postmaster initialization work, we enter PM_STARTUP
+  * state and the startup process is launched. The startup process begins by
+  * reading the control file and other preliminary initialization steps. When
+  * it's ready to start WAL redo, it signals postmaster, and we switch to
+  * PM_RECOVERY phase. The background writer is launched, while the startup
+  * process continues applying WAL.
+  *
+  * After reaching a consistent point in WAL redo, startup process signals
+  * us again, and we switch to PM_RECOVERY_CONSISTENT phase. There's currently
+  * no difference between PM_RECOVERY and PM_RECOVERY_CONSISTENT, but we
+  * could start accepting connections to perform read-only queries at this
+  * point, if we had the infrastructure to do that.
+  *
+  * When the WAL redo is finished, the startup process signals us the third
+  * time, and we switch to PM_RUN state. The startup process can also skip the
+  * recovery and consistent recovery phases altogether, as it will during
+  * normal startup when there's no recovery to be done, for example.
+  *
   * Normal child backends can only be launched when we are in PM_RUN state.
   * (We also allow it in PM_WAIT_BACKUP state, but only for superusers.)
   * In other states we handle connection requests by launching "dead_end"
***************
*** 245,259 **** static bool FatalError = false; /* T if recovering from backend crash */
   *
   * Notice that this state variable does not distinguish *why* we entered
   * states later than PM_RUN --- Shutdown and FatalError must be consulted
!  * to find that out.  FatalError is never true in PM_RUN state, nor in
!  * PM_SHUTDOWN states (because we don't enter those states when trying to
!  * recover from a crash).  It can be true in PM_STARTUP state, because we
!  * don't clear it until we've successfully recovered.
   */
  typedef enum
  {
      PM_INIT,                    /* postmaster starting */
      PM_STARTUP,                    /* waiting for startup subprocess */
      PM_RUN,                        /* normal "database is alive" state */
      PM_WAIT_BACKUP,                /* waiting for online backup mode to end */
      PM_WAIT_BACKENDS,            /* waiting for live backends to exit */
--- 272,288 ----
   *
   * Notice that this state variable does not distinguish *why* we entered
   * states later than PM_RUN --- Shutdown and FatalError must be consulted
!  * to find that out.  FatalError is never true in PM_RECOVERY_* or PM_RUN
!  * states, nor in PM_SHUTDOWN states (because we don't enter those states
!  * when trying to recover from a crash).  It can be true in PM_STARTUP state,
!  * because we don't clear it until we've successfully started WAL redo.
   */
  typedef enum
  {
      PM_INIT,                    /* postmaster starting */
      PM_STARTUP,                    /* waiting for startup subprocess */
+     PM_RECOVERY,                /* in recovery mode */
+     PM_RECOVERY_CONSISTENT,        /* consistent recovery mode */
      PM_RUN,                        /* normal "database is alive" state */
      PM_WAIT_BACKUP,                /* waiting for online backup mode to end */
      PM_WAIT_BACKENDS,            /* waiting for live backends to exit */
***************
*** 307,312 **** static void pmdie(SIGNAL_ARGS);
--- 336,342 ----
  static void reaper(SIGNAL_ARGS);
  static void sigusr1_handler(SIGNAL_ARGS);
  static void dummy_handler(SIGNAL_ARGS);
+ static void CheckRecoverySignals(void);
  static void CleanupBackend(int pid, int exitstatus);
  static void HandleChildCrash(int pid, int exitstatus, const char *procname);
  static void LogChildExit(int lev, const char *procname,
***************
*** 1302,1308 **** ServerLoop(void)
           * state that prevents it, start one.  It doesn't matter if this
           * fails, we'll just try again later.
           */
!         if (BgWriterPID == 0 && pmState == PM_RUN)
              BgWriterPID = StartBackgroundWriter();

          /*
--- 1332,1340 ----
           * state that prevents it, start one.  It doesn't matter if this
           * fails, we'll just try again later.
           */
!         if (BgWriterPID == 0 &&
!             (pmState == PM_RUN || pmState == PM_RECOVERY ||
!              pmState == PM_RECOVERY_CONSISTENT))
              BgWriterPID = StartBackgroundWriter();

          /*
***************
*** 1752,1758 **** canAcceptConnections(void)
              return CAC_WAITBACKUP;    /* allow superusers only */
          if (Shutdown > NoShutdown)
              return CAC_SHUTDOWN;    /* shutdown is pending */
!         if (pmState == PM_STARTUP && !FatalError)
              return CAC_STARTUP; /* normal startup */
          return CAC_RECOVERY;    /* else must be crash recovery */
      }
--- 1784,1793 ----
              return CAC_WAITBACKUP;    /* allow superusers only */
          if (Shutdown > NoShutdown)
              return CAC_SHUTDOWN;    /* shutdown is pending */
!         if (!FatalError &&
!             (pmState == PM_STARTUP ||
!              pmState == PM_RECOVERY ||
!              pmState == PM_RECOVERY_CONSISTENT))
              return CAC_STARTUP; /* normal startup */
          return CAC_RECOVERY;    /* else must be crash recovery */
      }
***************
*** 1982,1988 **** pmdie(SIGNAL_ARGS)
              ereport(LOG,
                      (errmsg("received smart shutdown request")));

!             if (pmState == PM_RUN)
              {
                  /* autovacuum workers are told to shut down immediately */
                  SignalAutovacWorkers(SIGTERM);
--- 2017,2023 ----
              ereport(LOG,
                      (errmsg("received smart shutdown request")));

!             if (pmState == PM_RUN || pmState == PM_RECOVERY || pmState == PM_RECOVERY_CONSISTENT)
              {
                  /* autovacuum workers are told to shut down immediately */
                  SignalAutovacWorkers(SIGTERM);
***************
*** 2019,2025 **** pmdie(SIGNAL_ARGS)

              if (StartupPID != 0)
                  signal_child(StartupPID, SIGTERM);
!             if (pmState == PM_RUN || pmState == PM_WAIT_BACKUP)
              {
                  ereport(LOG,
                          (errmsg("aborting any active transactions")));
--- 2054,2067 ----

              if (StartupPID != 0)
                  signal_child(StartupPID, SIGTERM);
!             if (pmState == PM_RECOVERY)
!             {
!                 /* only bgwriter is active in this state */
!                 pmState = PM_WAIT_BACKENDS;
!             }
!             if (pmState == PM_RUN ||
!                 pmState == PM_WAIT_BACKUP ||
!                 pmState == PM_RECOVERY_CONSISTENT)
              {
                  ereport(LOG,
                          (errmsg("aborting any active transactions")));
***************
*** 2116,2125 **** reaper(SIGNAL_ARGS)
          if (pid == StartupPID)
          {
              StartupPID = 0;
-             Assert(pmState == PM_STARTUP);

!             /* FATAL exit of startup is treated as catastrophic */
!             if (!EXIT_STATUS_0(exitstatus))
              {
                  LogChildExit(LOG, _("startup process"),
                               pid, exitstatus);
--- 2158,2179 ----
          if (pid == StartupPID)
          {
              StartupPID = 0;

!             /*
!              * Check if we've received a signal from the startup process
!              * first. This can change pmState. If the startup process sends
!              * a signal, and exits immediately after that, we might not have
!              * processed the signal yet, and we need to know if it completed
!              * recovery before exiting.
!              */
!             CheckRecoverySignals();
!
!             /*
!              * Unexpected exit of startup process (including FATAL exit)
!              * during PM_STARTUP is treated as catastrophic. There is no
!              * other processes running yet.
!              */
!             if (pmState == PM_STARTUP)
              {
                  LogChildExit(LOG, _("startup process"),
                               pid, exitstatus);
***************
*** 2127,2186 **** reaper(SIGNAL_ARGS)
                  (errmsg("aborting startup due to startup process failure")));
                  ExitPostmaster(1);
              }
-
              /*
!              * Startup succeeded - we are done with system startup or
!              * recovery.
               */
!             FatalError = false;
!
!             /*
!              * Go to shutdown mode if a shutdown request was pending.
!              */
!             if (Shutdown > NoShutdown)
              {
!                 pmState = PM_WAIT_BACKENDS;
!                 /* PostmasterStateMachine logic does the rest */
                  continue;
              }
-
              /*
!              * Otherwise, commence normal operations.
!              */
!             pmState = PM_RUN;
!
!             /*
!              * Load the flat authorization file into postmaster's cache. The
!              * startup process has recomputed this from the database contents,
!              * so we wait till it finishes before loading it.
!              */
!             load_role();
!
!             /*
!              * Crank up the background writer.    It doesn't matter if this
!              * fails, we'll just try again later.
               */
!             Assert(BgWriterPID == 0);
!             BgWriterPID = StartBackgroundWriter();
!
!             /*
!              * Likewise, start other special children as needed.  In a restart
!              * situation, some of them may be alive already.
!              */
!             if (WalWriterPID == 0)
!                 WalWriterPID = StartWalWriter();
!             if (AutoVacuumingActive() && AutoVacPID == 0)
!                 AutoVacPID = StartAutoVacLauncher();
!             if (XLogArchivingActive() && PgArchPID == 0)
!                 PgArchPID = pgarch_start();
!             if (PgStatPID == 0)
!                 PgStatPID = pgstat_start();
!
!             /* at this point we are really open for business */
!             ereport(LOG,
!                  (errmsg("database system is ready to accept connections")));
!
!             continue;
          }

          /*
--- 2181,2210 ----
                  (errmsg("aborting startup due to startup process failure")));
                  ExitPostmaster(1);
              }
              /*
!              * Any unexpected exit (including FATAL exit) of the startup
!              * process is treated as a crash, except that we don't want
!              * to reinitialize.
               */
!             if (!EXIT_STATUS_0(exitstatus))
              {
!                 RecoveryError = true;
!                 HandleChildCrash(pid, exitstatus,
!                                  _("startup process"));
                  continue;
              }
              /*
!              * Startup process exited normally, but didn't finish recovery.
!              * This can happen if someone else than postmaster kills the
!              * startup process with SIGTERM. Treat it like a crash.
               */
!             if (pmState == PM_RECOVERY || pmState == PM_RECOVERY_CONSISTENT)
!             {
!                 RecoveryError = true;
!                 HandleChildCrash(pid, exitstatus,
!                                  _("startup process"));
!                 continue;
!             }
          }

          /*
***************
*** 2443,2448 **** HandleChildCrash(int pid, int exitstatus, const char *procname)
--- 2467,2484 ----
          }
      }

+     /* Take care of the startup process too */
+     if (pid == StartupPID)
+         StartupPID = 0;
+     else if (StartupPID != 0 && !FatalError)
+     {
+         ereport(DEBUG2,
+                 (errmsg_internal("sending %s to process %d",
+                                  (SendStop ? "SIGSTOP" : "SIGQUIT"),
+                                  (int) StartupPID)));
+         signal_child(BgWriterPID, (SendStop ? SIGSTOP : SIGQUIT));
+     }
+
      /* Take care of the bgwriter too */
      if (pid == BgWriterPID)
          BgWriterPID = 0;
***************
*** 2514,2520 **** HandleChildCrash(int pid, int exitstatus, const char *procname)

      FatalError = true;
      /* We now transit into a state of waiting for children to die */
!     if (pmState == PM_RUN ||
          pmState == PM_WAIT_BACKUP ||
          pmState == PM_SHUTDOWN)
          pmState = PM_WAIT_BACKENDS;
--- 2550,2558 ----

      FatalError = true;
      /* We now transit into a state of waiting for children to die */
!     if (pmState == PM_RECOVERY ||
!         pmState == PM_RECOVERY_CONSISTENT ||
!         pmState == PM_RUN ||
          pmState == PM_WAIT_BACKUP ||
          pmState == PM_SHUTDOWN)
          pmState = PM_WAIT_BACKENDS;
***************
*** 2582,2587 **** LogChildExit(int lev, const char *procname, int pid, int exitstatus)
--- 2620,2746 ----
  static void
  PostmasterStateMachine(void)
  {
+     /* Startup states */
+
+     if (pmState == PM_STARTUP && RecoveryStatus > NoRecovery)
+     {
+         /* WAL redo has started. We're out of reinitialization. */
+         FatalError = false;
+
+         /*
+          * Go to shutdown mode if a shutdown request was pending.
+          */
+         if (Shutdown > NoShutdown)
+         {
+             pmState = PM_WAIT_BACKENDS;
+             /* PostmasterStateMachine logic does the rest */
+         }
+         else
+         {
+             /*
+              * Crank up the background writer.    It doesn't matter if this
+              * fails, we'll just try again later.
+              */
+             Assert(BgWriterPID == 0);
+             BgWriterPID = StartBackgroundWriter();
+
+             pmState = PM_RECOVERY;
+         }
+     }
+     if (pmState == PM_RECOVERY && RecoveryStatus >= RecoveryConsistent)
+     {
+         /*
+          * Go to shutdown mode if a shutdown request was pending.
+          */
+         if (Shutdown > NoShutdown)
+         {
+             pmState = PM_WAIT_BACKENDS;
+             /* PostmasterStateMachine logic does the rest */
+         }
+         else
+         {
+             /*
+              * Startup process has entered recovery. We consider that good
+              * enough to reset FatalError.
+              */
+             pmState = PM_RECOVERY_CONSISTENT;
+
+             /*
+              * Load the flat authorization file into postmaster's cache. The
+              * startup process won't have recomputed this from the database yet,
+              * so we it may change following recovery.
+              */
+             load_role();
+
+             /*
+              * Likewise, start other special children as needed.
+              */
+             Assert(PgStatPID == 0);
+             PgStatPID = pgstat_start();
+
+             /* XXX at this point we could accept read-only connections */
+             ereport(DEBUG1,
+                  (errmsg("database system is in consistent recovery mode")));
+         }
+     }
+     if ((pmState == PM_RECOVERY ||
+          pmState == PM_RECOVERY_CONSISTENT ||
+          pmState == PM_STARTUP) &&
+         RecoveryStatus == RecoveryCompleted)
+     {
+         /*
+          * Startup succeeded.
+          *
+          * Go to shutdown mode if a shutdown request was pending.
+          */
+         if (Shutdown > NoShutdown)
+         {
+             pmState = PM_WAIT_BACKENDS;
+             /* PostmasterStateMachine logic does the rest */
+         }
+         else
+         {
+             /*
+              * Otherwise, commence normal operations.
+              */
+             pmState = PM_RUN;
+
+             /*
+              * Load the flat authorization file into postmaster's cache. The
+              * startup process has recomputed this from the database contents,
+              * so we wait till it finishes before loading it.
+              */
+             load_role();
+
+             /*
+              * Crank up the background writer, if we didn't do that already
+              * when we entered consistent recovery phase.  It doesn't matter
+              * if this fails, we'll just try again later.
+              */
+             if (BgWriterPID == 0)
+                 BgWriterPID = StartBackgroundWriter();
+
+             /*
+              * Likewise, start other special children as needed.  In a restart
+              * situation, some of them may be alive already.
+              */
+             if (WalWriterPID == 0)
+                 WalWriterPID = StartWalWriter();
+             if (AutoVacuumingActive() && AutoVacPID == 0)
+                 AutoVacPID = StartAutoVacLauncher();
+             if (XLogArchivingActive() && PgArchPID == 0)
+                 PgArchPID = pgarch_start();
+             if (PgStatPID == 0)
+                 PgStatPID = pgstat_start();
+
+             /* at this point we are really open for business */
+             ereport(LOG,
+                 (errmsg("database system is ready to accept connections")));
+         }
+     }
+
+     /* Shutdown states */
+
      if (pmState == PM_WAIT_BACKUP)
      {
          /*
***************
*** 2723,2728 **** PostmasterStateMachine(void)
--- 2882,2896 ----
      }

      /*
+      * If recovery failed, wait for all non-syslogger children to exit,
+      * and then exit postmaster. We don't try to reinitialize when recovery
+      * fails, because more than likely it will just fail again and we will
+      * keep trying forever.
+      */
+     if (RecoveryError && pmState == PM_NO_CHILDREN)
+         ExitPostmaster(1);
+
+     /*
       * If we need to recover from a crash, wait for all non-syslogger
       * children to exit, then reset shmem and StartupDataBase.
       */
***************
*** 2734,2739 **** PostmasterStateMachine(void)
--- 2902,2909 ----
          shmem_exit(1);
          reset_shared(PostPortNumber);

+         RecoveryStatus = NoRecovery;
+
          StartupPID = StartupDataBase();
          Assert(StartupPID != 0);
          pmState = PM_STARTUP;
***************
*** 3838,3843 **** ExitPostmaster(int status)
--- 4008,4044 ----
  }

  /*
+  * common code used in sigusr1_handler() and reaper() to handle
+  * recovery-related signals from startup process
+  */
+ static void
+ CheckRecoverySignals(void)
+ {
+     bool changed = false;
+
+     if (CheckPostmasterSignal(PMSIGNAL_RECOVERY_STARTED))
+     {
+         Assert(pmState == PM_STARTUP);
+
+         RecoveryStatus = RecoveryStarted;
+         changed = true;
+     }
+     if (CheckPostmasterSignal(PMSIGNAL_RECOVERY_CONSISTENT))
+     {
+         RecoveryStatus = RecoveryConsistent;
+         changed = true;
+     }
+     if (CheckPostmasterSignal(PMSIGNAL_RECOVERY_COMPLETED))
+     {
+         RecoveryStatus = RecoveryCompleted;
+         changed = true;
+     }
+
+     if (changed)
+         PostmasterStateMachine();
+ }
+
+ /*
   * sigusr1_handler - handle signal conditions from child processes
   */
  static void
***************
*** 3847,3852 **** sigusr1_handler(SIGNAL_ARGS)
--- 4048,4055 ----

      PG_SETMASK(&BlockSig);

+     CheckRecoverySignals();
+
      if (CheckPostmasterSignal(PMSIGNAL_PASSWORD_CHANGE))
      {
          /*
*** a/src/backend/storage/buffer/README
--- b/src/backend/storage/buffer/README
***************
*** 268,270 **** out (and anyone else who flushes buffer contents to disk must do so too).
--- 268,279 ----
  This ensures that the page image transferred to disk is reasonably consistent.
  We might miss a hint-bit update or two but that isn't a problem, for the same
  reasons mentioned under buffer access rules.
+
+ As of 8.4, background writer starts during recovery mode when there is
+ some form of potentially extended recovery to perform. It performs an
+ identical service to normal processing, except that checkpoints it
+ writes are technically restartpoints. Flushing outstanding WAL for dirty
+ buffers is also skipped, though there shouldn't ever be new WAL entries
+ at that time in any case. We could choose to start background writer
+ immediately but we hold off until we can prove the database is in a
+ consistent state so that postmaster has a single, clean state change.
*** a/src/backend/utils/init/postinit.c
--- b/src/backend/utils/init/postinit.c
***************
*** 324,330 **** InitCommunication(void)
   * If you're wondering why this is separate from InitPostgres at all:
   * the critical distinction is that this stuff has to happen before we can
   * run XLOG-related initialization, which is done before InitPostgres --- in
!  * fact, for cases such as checkpoint creation processes, InitPostgres may
   * never be done at all.
   */
  void
--- 324,330 ----
   * If you're wondering why this is separate from InitPostgres at all:
   * the critical distinction is that this stuff has to happen before we can
   * run XLOG-related initialization, which is done before InitPostgres --- in
!  * fact, for cases such as the background writer process, InitPostgres may
   * never be done at all.
   */
  void
*** a/src/include/access/xlog.h
--- b/src/include/access/xlog.h
***************
*** 133,139 **** typedef struct XLogRecData
  } XLogRecData;

  extern TimeLineID ThisTimeLineID;        /* current TLI */
! extern bool InRecovery;
  extern XLogRecPtr XactLastRecEnd;

  /* these variables are GUC parameters related to XLOG */
--- 133,148 ----
  } XLogRecData;

  extern TimeLineID ThisTimeLineID;        /* current TLI */
!
! /*
!  * Prior to 8.4, all activity during recovery were carried out by Startup
!  * process. This local variable continues to be used in many parts of the
!  * code to indicate actions taken by RecoveryManagers. Other processes who
!  * potentially perform work during recovery should check
!  * IsRecoveryProcessingMode(), see XLogCtl notes in xlog.c
!  */
! extern bool InRecovery;
!
  extern XLogRecPtr XactLastRecEnd;

  /* these variables are GUC parameters related to XLOG */
***************
*** 199,204 **** extern void RestoreBkpBlocks(XLogRecPtr lsn, XLogRecord *record, bool cleanup);
--- 208,215 ----
  extern void xlog_redo(XLogRecPtr lsn, XLogRecord *record);
  extern void xlog_desc(StringInfo buf, uint8 xl_info, char *rec);

+ extern bool IsRecoveryProcessingMode(void);
+
  extern void UpdateControlFile(void);
  extern Size XLOGShmemSize(void);
  extern void XLOGShmemInit(void);
***************
*** 207,215 **** extern void StartupXLOG(void);
--- 218,229 ----
  extern void ShutdownXLOG(int code, Datum arg);
  extern void InitXLOGAccess(void);
  extern void CreateCheckPoint(int flags);
+ extern bool CreateRestartPoint(int flags);
  extern void XLogPutNextOid(Oid nextOid);
  extern XLogRecPtr GetRedoRecPtr(void);
  extern XLogRecPtr GetInsertRecPtr(void);
  extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch);

+ extern void StartupProcessMain(void);
+
  #endif   /* XLOG_H */
*** a/src/include/storage/pmsignal.h
--- b/src/include/storage/pmsignal.h
***************
*** 22,27 ****
--- 22,30 ----
   */
  typedef enum
  {
+     PMSIGNAL_RECOVERY_STARTED,    /* recovery has started */
+     PMSIGNAL_RECOVERY_CONSISTENT, /* recovery has reached consistent state */
+     PMSIGNAL_RECOVERY_COMPLETED, /* recovery completed */
      PMSIGNAL_PASSWORD_CHANGE,    /* pg_auth file has changed */
      PMSIGNAL_WAKEN_ARCHIVER,    /* send a NOTIFY signal to xlog archiver */
      PMSIGNAL_ROTATE_LOGFILE,    /* send SIGUSR1 to syslogger to rotate logfile */

Re: Hot standby, recovery infra

From
Simon Riggs
Date:
On Mon, 2009-02-09 at 17:13 +0200, Heikki Linnakangas wrote:

> Attached is an updated patch that does that, and I've fixed all the 
> other outstanding issues I listed earlier as well. Now I'm feeling
> again that this is in pretty good shape.

UpdateMinRecoveryPoint() issues a DEBUG2 message even when we have not
updated the control file, leading to log filling behaviour on an idle
system.

DEBUG:  updated min recovery point to ...

We should just tuck the message into the "if" section above it.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot standby, recovery infra

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Mon, 2009-02-09 at 17:13 +0200, Heikki Linnakangas wrote:
> 
>> Attached is an updated patch that does that, and I've fixed all the
>> other outstanding issues I listed earlier as well. Now I'm feeling
>> again that this is in pretty good shape.
> 
> UpdateMinRecoveryPoint() issues a DEBUG2 message even when we have not
> updated the control file, leading to log filling behaviour on an idle
> system.
> 
> DEBUG:  updated min recovery point to ...
> 
> We should just tuck the message into the "if" section above it.

The outer "if" should ensure that it isn't printed repeatedly on an idle 
system. But I agree it belongs inside the inner if section.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot standby, recovery infra

From
Simon Riggs
Date:
On Wed, 2009-02-18 at 14:26 +0200, Heikki Linnakangas wrote:

> The outer "if" should ensure that it isn't printed repeatedly on an idle 
> system. 

Regrettably not.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot standby, recovery infra

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Wed, 2009-02-18 at 14:26 +0200, Heikki Linnakangas wrote:
> 
>> The outer "if" should ensure that it isn't printed repeatedly on an idle 
>> system. 
> 
> Regrettably not.

Ok, committed. I fixed that and some comment changes. I also renamed 
IsRecoveryProcessingMode() to RecoveryInProgress(), to avoid confusion 
with the "real" processing modes defined in miscadmin.h. That will 
probably cause you merge conflicts in the hot standby patch, but it 
should be a matter of search-replace to fix.

The changes need to be documented. At least the removal of 
log_restartpoints is a clear user-visible change.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot standby, recovery infra

From
Simon Riggs
Date:
On Wed, 2009-02-18 at 18:01 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Wed, 2009-02-18 at 14:26 +0200, Heikki Linnakangas wrote:
> > 
> >> The outer "if" should ensure that it isn't printed repeatedly on an idle 
> >> system. 
> > 
> > Regrettably not.
> 
> Ok, committed. 

Cool.

> I fixed that and some comment changes. I also renamed 
> IsRecoveryProcessingMode() to RecoveryInProgress(), to avoid confusion 
> with the "real" processing modes defined in miscadmin.h. That will 
> probably cause you merge conflicts in the hot standby patch, but it 
> should be a matter of search-replace to fix.

Yep, good change, agree with reasons.

> The changes need to be documented. At least the removal of 
> log_restartpoints is a clear user-visible change.

Yep.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot standby, recovery infra

From
Fujii Masao
Date:
Hi,

On Fri, Jan 30, 2009 at 7:47 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> That whole area was something I was leaving until last, since immediate
> shutdown doesn't work either, even in HEAD. (Fujii-san and I discussed
> this before Christmas, briefly).

This problem remains in current HEAD. I mean, immediate shutdown
may be unable to kill the startup process because system() which
executes restore_command ignores SIGQUIT while waiting.
When I tried immediate shutdown during recovery, only the startup
process survived. This is undesirable behavior, I think.

The following code should be added into RestoreArchivedFile()?

----
if (WTERMSIG(rc) == SIGQUIT)      exit(2);
----

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: Hot standby, recovery infra

From
Heikki Linnakangas
Date:
Fujii Masao wrote:
> On Fri, Jan 30, 2009 at 7:47 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> That whole area was something I was leaving until last, since immediate
>> shutdown doesn't work either, even in HEAD. (Fujii-san and I discussed
>> this before Christmas, briefly).
> 
> This problem remains in current HEAD. I mean, immediate shutdown
> may be unable to kill the startup process because system() which
> executes restore_command ignores SIGQUIT while waiting.
> When I tried immediate shutdown during recovery, only the startup
> process survived. This is undesirable behavior, I think.

Yeah, we need to fix that.

> The following code should be added into RestoreArchivedFile()?
> 
> ----
> if (WTERMSIG(rc) == SIGQUIT)
>        exit(2);
> ----

I don't see how that helps, as we already have this in there:
signaled = WIFSIGNALED(rc) || WEXITSTATUS(rc) > 125;
ereport(signaled ? FATAL : DEBUG2,    (errmsg("could not restore file \"%s\" from archive: return code %d",
xlogfname,rc)));
 

which means we already ereport(FATAL) if the restore command dies with 
SIGQUIT.

I think the real problem here is that pg_standby traps SIGQUIT. The 
startup process doesn't receive the SIGQUIT because it's in system(), 
and pg_standby doesn't propagate it to the startup process either 
because it traps it.

I think we should simply remove the signal handler for SIGQUIT from 
pg_standby. Or will that lead to core dump by default? In that case, we 
need pg_standby to exit(128) or similar, so that RestoreArchivedFile 
understands that the command was killed by a signal.

Another approach is to check that the postmaster is still alive, like we  do in walwriter and bgwriter:
    /*     * Emergency bailout if postmaster has died.  This is to avoid the     * necessity for manual cleanup of all
postmasterchildren.     */    if (!PostmasterIsAlive(true))        exit(1);
 

However, I'm afraid there's a race condition with that. If we do that 
right after system(), postmaster might've signaled us but not exited 
yet. We could check that in the main loop, but if we wrongly interpret 
the exit of the recovery command as a "file not found - go ahead and 
start up", the damage might be done by the time we notice that the 
postmaster is gone.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot standby, recovery infra

From
Fujii Masao
Date:
Hi,

On Fri, Feb 27, 2009 at 3:38 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> I think the real problem here is that pg_standby traps SIGQUIT. The startup
> process doesn't receive the SIGQUIT because it's in system(), and pg_standby
> doesn't propagate it to the startup process either because it traps it.

Yes, you are right.

> I think we should simply remove the signal handler for SIGQUIT from
> pg_standby.

+1

> I don't see how that helps, as we already have this in there:
>
>        signaled = WIFSIGNALED(rc) || WEXITSTATUS(rc) > 125;
>
>        ereport(signaled ? FATAL : DEBUG2,
>                (errmsg("could not restore file \"%s\" from archive: return code %d",
>                                xlogfname, rc)));
>
> which means we already ereport(FATAL) if the restore command dies with SIGQUIT.

SIGQUIT should kill the process immediately, so I think that the startup
process as well as other auxiliary process should call exit(2) instead of
ereport(FATAL). Right?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: Hot standby, recovery infra

From
Simon Riggs
Date:
On Thu, 2009-02-26 at 20:38 +0200, Heikki Linnakangas wrote:

> I think we should simply remove the signal handler for SIGQUIT from 
> pg_standby.

If you do this, please make it release dependent so pg_standby behaves
correctly for the release it is being used with.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot standby, recovery infra

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Thu, 2009-02-26 at 20:38 +0200, Heikki Linnakangas wrote:
> 
>> I think we should simply remove the signal handler for SIGQUIT from 
>> pg_standby.
> 
> If you do this, please make it release dependent so pg_standby behaves
> correctly for the release it is being used with.

Hmm, I don't think there's a way for pg_standby to know which version of 
PostgreSQL is calling it. Assuming there is, how would you want it to 
behave? If you want no change in behavior in old releases, can't we just 
leave it unfixed in back-branches? In fact, it seems more useful to not 
detect the server version, so that if you do want the new behavior, you 
can use a 8.4 pg_standby against a 8.3 server.

In back-branches, I think we need to decide between fixing this, at the 
risk of breaking someone's script that is using "killall -QUIT 
pg_standby" or similar to trigger failover, and leaving it as it is 
knowing that immediate shutdown doesn't work on a standby server. I'm 
not sure which is best.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com