Re: BUG #4879: bgwriter fails to fsync the file in recovery mode - Mailing list pgsql-bugs

From Heikki Linnakangas
Subject Re: BUG #4879: bgwriter fails to fsync the file in recovery mode
Date
Msg-id 4A43AFF3.9040907@enterprisedb.com
Whole thread Raw
In response to Re: BUG #4879: bgwriter fails to fsync the file in recovery mode  (Simon Riggs <simon@2ndQuadrant.com>)
Responses Re: BUG #4879: bgwriter fails to fsync the file in recovery mode
List pgsql-bugs
Simon Riggs wrote:
> On Thu, 2009-06-25 at 12:43 -0400, Tom Lane wrote:
>
>> What about "revert the patch"?
>
> That's probably just as dangerous.

I don't feel comfortable either reverting such a big patch at last
minute. Would need a fair amount of testing to make sure that the
revertion is correct and that no patches committed after the patch are
now broken.

I'm testing the attached patch at the moment. It's the same as the
previous one, with the elog() in mdsync() issue fixed.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 4fd9d41..274369f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5156,6 +5156,7 @@ StartupXLOG(void)
     XLogRecord *record;
     uint32        freespace;
     TransactionId oldestActiveXID;
+    bool        bgwriterLaunched = false;

     XLogCtl->SharedRecoveryInProgress = true;

@@ -5472,7 +5473,11 @@ StartupXLOG(void)
              * process in addition to postmaster!
              */
             if (InArchiveRecovery && IsUnderPostmaster)
+            {
+                SetForwardFsyncRequests();
                 SendPostmasterSignal(PMSIGNAL_RECOVERY_STARTED);
+                bgwriterLaunched = true;
+            }

             /*
              * main redo apply loop
@@ -5742,7 +5747,17 @@ StartupXLOG(void)
          * assigning a new TLI, using a shutdown checkpoint allows us to have
          * the rule that TLI only changes in shutdown checkpoints, which
          * allows some extra error checking in xlog_redo.
+         *
+         * If bgwriter is active, we can't do the necessary fsyncs in the
+         * startup process because bgwriter has accumulated fsync requests
+         * into its private memory. So we request a last forced restart point,
+         * which will flush them to disk, before performing the actual
+         * checkpoint. It's safe to flush the buffers first, because no other
+         * process can do (WAL-logged) changes to data pages in between.
          */
+        if (bgwriterLaunched)
+            RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE |
+                              CHECKPOINT_WAIT);
         CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);

         /*
@@ -6219,27 +6234,11 @@ CreateCheckPoint(int flags)

     /*
      * Acquire CheckpointLock to ensure only one checkpoint happens at a time.
-     * During normal operation, bgwriter is the only process that creates
-     * checkpoints, but at the end of archive recovery, the bgwriter can be
-     * busy creating a restartpoint while the startup process tries to perform
-     * the startup checkpoint.
+     * (This is just pro forma, since in the present system structure there is
+     * only one process that is allowed to issue checkpoints at any given
+     * time.)
      */
-    if (!LWLockConditionalAcquire(CheckpointLock, LW_EXCLUSIVE))
-    {
-        Assert(InRecovery);
-
-        /*
-         * A restartpoint is in progress. Wait until it finishes. This can
-         * cause an extra restartpoint to be performed, but that's OK because
-         * we're just about to perform a checkpoint anyway. Flushing the
-         * buffers in this restartpoint can take some time, but that time is
-         * saved from the upcoming checkpoint so the net effect is zero.
-         */
-        ereport(DEBUG2, (errmsg("hurrying in-progress restartpoint")));
-        RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_WAIT);
-
-        LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);
-    }
+    LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);

     /*
      * Prepare to accumulate statistics.
@@ -6660,8 +6659,9 @@ CreateRestartPoint(int flags)
      * restartpoint. It's assumed that flushing the buffers will do that as a
      * side-effect.
      */
-    if (XLogRecPtrIsInvalid(lastCheckPointRecPtr) ||
-        XLByteLE(lastCheckPoint.redo, ControlFile->checkPointCopy.redo))
+    if (!(flags & CHECKPOINT_FORCE) &&
+        (XLogRecPtrIsInvalid(lastCheckPointRecPtr) ||
+         XLByteLE(lastCheckPoint.redo, ControlFile->checkPointCopy.redo)))
     {
         XLogRecPtr    InvalidXLogRecPtr = {0, 0};

diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index d42e86a..56824d5 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -204,6 +204,21 @@ mdinit(void)
 }

 /*
+ * In archive recovery, we rely on bgwriter to do fsyncs(), but we don't
+ * know that we do archive recovery at process startup when pendingOpsTable
+ * has already been created. Calling this function drops pendingOpsTable
+ * and causes any subsequent requests to be forwarded to bgwriter.
+ */
+void
+SetForwardFsyncRequests(void)
+{
+    /* Perform any pending ops we may have queued up */
+    if (pendingOpsTable)
+        mdsync();
+    pendingOpsTable = NULL;
+}
+
+/*
  *    mdexists() -- Does the physical file exist?
  *
  * Note: this will return true for lingering files, with pending deletions
@@ -908,9 +923,18 @@ mdsync(void)
     /*
      * This is only called during checkpoints, and checkpoints should only
      * occur in processes that have created a pendingOpsTable.
+     *
+     * Startup process performing a startup checkpoint after archive recovery
+     * is an exception. It has no pendingOpsTable, but that's OK because it
+     * has requested bgwriter to perform a restartpoint before the checkpoint
+     * which does mdsync() instead.
      */
     if (!pendingOpsTable)
+    {
+        if (InRecovery)
+            return;
         elog(ERROR, "cannot sync without a pendingOpsTable");
+    }

     /*
      * If we are in the bgwriter, the sync had better include all fsync
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 7556b14..fd79d7b 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -109,6 +109,7 @@ extern void mdpreckpt(void);
 extern void mdsync(void);
 extern void mdpostckpt(void);

+extern void SetForwardFsyncRequests(void);
 extern void RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum,
                      BlockNumber segno);
 extern void ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum);

pgsql-bugs by date:

Previous
From: Simon Riggs
Date:
Subject: Re: BUG #4879: bgwriter fails to fsync the file in recovery mode
Next
From: Tom Lane
Date:
Subject: Re: BUG #4879: bgwriter fails to fsync the file in recovery mode