Home > mailing lists
Load Distributed Checkpoints, take 3 - Mailing list pgsql-patches

From	Heikki Linnakangas
Subject	Load Distributed Checkpoints, take 3
Date	June 20, 2007 13:48:27
Msg-id	46792FF3.8000301@enterprisedb.com Whole thread Raw
Responses	Re: Load Distributed Checkpoints, take 3 (Tom Lane <tgl@sss.pgh.pa.us>) Re: Load Distributed Checkpoints, take 3 (ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp>)
List	pgsql-patches
Tree view
Here's an updated WIP patch for load distributed checkpoints.

I added a spinlock to protect the signaling fields between bgwriter and
backends. The current non-locking approach gets really difficult as the
patch adds two new flags, and both are more important than the existing
ckpt_time_warn flag.

In fact, I think there's a small race condition in CVS HEAD:

1. pg_start_backup() is called, which calls RequestCheckpoint
2. RequestCheckpoint takes note of the old value of ckpt_started
3. bgwriter wakes up from pg_usleep, and sees that we've exceeded
checkpoint_timeout.
4. bgwriter increases ckpt_started to note that a new checkpoint has started
5. RequestCheckpoint signals bgwriter to start a new checkpoint
6. bgwriter calls CreateCheckpoint, with the force-flag set to false
because this checkpoint was triggered by timeout
7. RequestCheckpoint sees that ckpt_started has increased, and starts to
wait for ckpt_done to reach the new value.
8. CreateCheckpoint finishes immediately, because there was no XLOG
activity since last checkpoint.
9. RequestCheckpoint sees that ckpt_done matches ckpt_started, and returns.
10. pg_start_backup() continues, with potentially the same redo location
and thus history filename as previous backup.

Now I admit that the chances for that to happen are extremely small,
people don't usually do two pg_start_backup calls without *any* WAL
logged activity in between them, for example. But as we add the new
flags, avoiding scenarios like that becomes harder.

Since last patch, I did some clean up and refactoring, and added a bunch
of comments, and user documentation.

I haven't yet changed GetInsertRecPtr to use the almost up-to-date value
protected by the info_lck per Simon's suggestion, and I need to do some
correctness testing. After that, I'm done with the patch.

Ps. In case you wonder what took me so long since last revision, I've
spent a lot of time reviewing HOT.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com
Index: doc/src/sgml/config.sgml
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/doc/src/sgml/config.sgml,v
retrieving revision 1.126
diff -c -r1.126 config.sgml
*** doc/src/sgml/config.sgml    7 Jun 2007 19:19:56 -0000    1.126
--- doc/src/sgml/config.sgml    19 Jun 2007 14:24:31 -0000
***************
*** 1565,1570 ****
--- 1565,1608 ----
        </listitem>
       </varlistentry>

+      <varlistentry id="guc-checkpoint-smoothing" xreflabel="checkpoint_smoothing">
+       <term><varname>checkpoint_smoothing</varname> (<type>floating point</type>)</term>
+       <indexterm>
+        <primary><varname>checkpoint_smoothing</> configuration parameter</primary>
+       </indexterm>
+       <listitem>
+        <para>
+         Specifies the target length of checkpoints, as a fraction of
+         the checkpoint interval. The default is 0.3.
+
+         This parameter can only be set in the <filename>postgresql.conf</>
+         file or on the server command line.
+        </para>
+       </listitem>
+      </varlistentry>
+
+       <term><varname>checkpoint_rate</varname> (<type>floating point</type>)</term>
+       <indexterm>
+        <primary><varname>checkpoint_rate</> configuration parameter</primary>
+       </indexterm>
+       <listitem>
+        <para>
+         Specifies the minimum I/O rate used to flush dirty buffers during a
+         checkpoint, when there's not many dirty buffers in the buffer cache.
+         The default is 512 KB/s.
+
+         Note: the accuracy of this setting depends on
+         <varname>bgwriter_delay</varname. This value is converted internally
+         to pages / bgwriter_delay, so if for examply the minimum allowed
+         bgwriter_delay setting of 10ms is used, the effective minimum
+         checkpoint I/O rate is 1 page / 10 ms, or 800 KB/s.
+
+         This parameter can only be set in the <filename>postgresql.conf</>
+         file or on the server command line.
+        </para>
+       </listitem>
+      </varlistentry>
+
       <varlistentry id="guc-checkpoint-warning" xreflabel="checkpoint_warning">
        <term><varname>checkpoint_warning</varname> (<type>integer</type>)</term>
        <indexterm>
Index: doc/src/sgml/wal.sgml
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/doc/src/sgml/wal.sgml,v
retrieving revision 1.43
diff -c -r1.43 wal.sgml
*** doc/src/sgml/wal.sgml    31 Jan 2007 20:56:19 -0000    1.43
--- doc/src/sgml/wal.sgml    19 Jun 2007 14:26:45 -0000
***************
*** 217,225 ****
    </para>

    <para>
     There will be at least one WAL segment file, and will normally
     not be more than 2 * <varname>checkpoint_segments</varname> + 1
!    files.  Each segment file is normally 16 MB (though this size can be
     altered when building the server).  You can use this to estimate space
     requirements for <acronym>WAL</acronym>.
     Ordinarily, when old log segment files are no longer needed, they
--- 217,245 ----
    </para>

    <para>
+    If there is a lot of dirty buffers in the buffer cache, flushing them
+    all at checkpoint will cause a heavy burst of I/O that can disrupt other
+    activity in the system. To avoid that, the checkpoint I/O can be distributed
+    over a longer period of time, defined with
+    <varname>checkpoint_smoothing</varname>. It's given as a fraction of the
+    checkpoint interval, as defined by <varname>checkpoint_timeout</varname>
+    and <varname>checkpoint_segments</varname>. The WAL segment consumption
+    and elapsed time is monitored and the I/O rate is adjusted during
+    checkpoint so that it's finished when the given fraction of elapsed time
+    or WAL segments has passed, whichever is sooner. However, that could lead
+    to unnecessarily prolonged checkpoints when there's not many dirty buffers
+    in the cache. To avoid that, <varname>checkpoint_rate</varname> can be used
+    to set the minimum I/O rate used. Note that prolonging checkpoints
+    affects recovery time, because the longer the checkpoint takes, more WAL
+    need to be kept around and replayed in recovery.
+   </para>
+
+   <para>
     There will be at least one WAL segment file, and will normally
     not be more than 2 * <varname>checkpoint_segments</varname> + 1
!    files, though there can be more if a large
!    <varname>checkpoint_smoothing</varname> setting is used.
!    Each segment file is normally 16 MB (though this size can be
     altered when building the server).  You can use this to estimate space
     requirements for <acronym>WAL</acronym>.
     Ordinarily, when old log segment files are no longer needed, they
Index: src/backend/access/transam/xlog.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/transam/xlog.c,v
retrieving revision 1.272
diff -c -r1.272 xlog.c
*** src/backend/access/transam/xlog.c    31 May 2007 15:13:01 -0000    1.272
--- src/backend/access/transam/xlog.c    20 Jun 2007 10:44:40 -0000
***************
*** 398,404 ****
  static void exitArchiveRecovery(TimeLineID endTLI,
                      uint32 endLogId, uint32 endLogSeg);
  static bool recoveryStopsHere(XLogRecord *record, bool *includeThis);
! static void CheckPointGuts(XLogRecPtr checkPointRedo);

  static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
                  XLogRecPtr *lsn, BkpBlock *bkpb);
--- 398,404 ----
  static void exitArchiveRecovery(TimeLineID endTLI,
                      uint32 endLogId, uint32 endLogSeg);
  static bool recoveryStopsHere(XLogRecord *record, bool *includeThis);
! static void CheckPointGuts(XLogRecPtr checkPointRedo, bool immediate);

  static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
                  XLogRecPtr *lsn, BkpBlock *bkpb);
***************
*** 1608,1614 ****
                          if (XLOG_DEBUG)
                              elog(LOG, "time for a checkpoint, signaling bgwriter");
  #endif
!                         RequestCheckpoint(false, true);
                      }
                  }
              }
--- 1608,1614 ----
                          if (XLOG_DEBUG)
                              elog(LOG, "time for a checkpoint, signaling bgwriter");
  #endif
!                         RequestXLogFillCheckpoint();
                      }
                  }
              }
***************
*** 5110,5116 ****
           * the rule that TLI only changes in shutdown checkpoints, which
           * allows some extra error checking in xlog_redo.
           */
!         CreateCheckPoint(true, true);

          /*
           * Close down recovery environment
--- 5110,5116 ----
           * the rule that TLI only changes in shutdown checkpoints, which
           * allows some extra error checking in xlog_redo.
           */
!         CreateCheckPoint(true, true, true);

          /*
           * Close down recovery environment
***************
*** 5319,5324 ****
--- 5319,5340 ----
  }

  /*
+  * GetInsertRecPtr -- Returns the current insert position.
+  */
+ XLogRecPtr
+ GetInsertRecPtr(void)
+ {
+     XLogCtlInsert  *Insert = &XLogCtl->Insert;
+     XLogRecPtr        recptr;
+
+     LWLockAcquire(WALInsertLock, LW_SHARED);
+     INSERT_RECPTR(recptr, Insert, Insert->curridx);
+     LWLockRelease(WALInsertLock);
+
+     return recptr;
+ }
+
+ /*
   * Get the time of the last xlog segment switch
   */
  time_t
***************
*** 5383,5389 ****
      ereport(LOG,
              (errmsg("shutting down")));

!     CreateCheckPoint(true, true);
      ShutdownCLOG();
      ShutdownSUBTRANS();
      ShutdownMultiXact();
--- 5399,5405 ----
      ereport(LOG,
              (errmsg("shutting down")));

!     CreateCheckPoint(true, true, true);
      ShutdownCLOG();
      ShutdownSUBTRANS();
      ShutdownMultiXact();
***************
*** 5395,5405 ****
  /*
   * Perform a checkpoint --- either during shutdown, or on-the-fly
   *
   * If force is true, we force a checkpoint regardless of whether any XLOG
   * activity has occurred since the last one.
   */
  void
! CreateCheckPoint(bool shutdown, bool force)
  {
      CheckPoint    checkPoint;
      XLogRecPtr    recptr;
--- 5411,5424 ----
  /*
   * Perform a checkpoint --- either during shutdown, or on-the-fly
   *
+  * If immediate is true, we try to finish the checkpoint as fast as we can,
+  * ignoring checkpoint_smoothing parameter.
+  *
   * If force is true, we force a checkpoint regardless of whether any XLOG
   * activity has occurred since the last one.
   */
  void
! CreateCheckPoint(bool shutdown, bool immediate, bool force)
  {
      CheckPoint    checkPoint;
      XLogRecPtr    recptr;
***************
*** 5591,5597 ****
       */
      END_CRIT_SECTION();

!     CheckPointGuts(checkPoint.redo);

      START_CRIT_SECTION();

--- 5610,5616 ----
       */
      END_CRIT_SECTION();

!     CheckPointGuts(checkPoint.redo, immediate);

      START_CRIT_SECTION();

***************
*** 5693,5708 ****
  /*
   * Flush all data in shared memory to disk, and fsync
   *
   * This is the common code shared between regular checkpoints and
   * recovery restartpoints.
   */
  static void
! CheckPointGuts(XLogRecPtr checkPointRedo)
  {
      CheckPointCLOG();
      CheckPointSUBTRANS();
      CheckPointMultiXact();
!     FlushBufferPool();            /* performs all required fsyncs */
      /* We deliberately delay 2PC checkpointing as long as possible */
      CheckPointTwoPhase(checkPointRedo);
  }
--- 5712,5730 ----
  /*
   * Flush all data in shared memory to disk, and fsync
   *
+  * If immediate is true, try to finish as quickly as possible, ignoring
+  * the GUC variables to throttle checkpoint I/O.
+  *
   * This is the common code shared between regular checkpoints and
   * recovery restartpoints.
   */
  static void
! CheckPointGuts(XLogRecPtr checkPointRedo, bool immediate)
  {
      CheckPointCLOG();
      CheckPointSUBTRANS();
      CheckPointMultiXact();
!     FlushBufferPool(immediate);        /* performs all required fsyncs */
      /* We deliberately delay 2PC checkpointing as long as possible */
      CheckPointTwoPhase(checkPointRedo);
  }
***************
*** 5710,5716 ****
  /*
   * Set a recovery restart point if appropriate
   *
!  * This is similar to CreateCheckpoint, but is used during WAL recovery
   * to establish a point from which recovery can roll forward without
   * replaying the entire recovery log.  This function is called each time
   * a checkpoint record is read from XLOG; it must determine whether a
--- 5732,5738 ----
  /*
   * Set a recovery restart point if appropriate
   *
!  * This is similar to CreateCheckPoint, but is used during WAL recovery
   * to establish a point from which recovery can roll forward without
   * replaying the entire recovery log.  This function is called each time
   * a checkpoint record is read from XLOG; it must determine whether a
***************
*** 5751,5757 ****
      /*
       * OK, force data out to disk
       */
!     CheckPointGuts(checkPoint->redo);

      /*
       * Update pg_control so that any subsequent crash will restart from this
--- 5773,5779 ----
      /*
       * OK, force data out to disk
       */
!     CheckPointGuts(checkPoint->redo, true);

      /*
       * Update pg_control so that any subsequent crash will restart from this
***************
*** 6177,6183 ****
           * have different checkpoint positions and hence different history
           * file names, even if nothing happened in between.
           */
!         RequestCheckpoint(true, false);

          /*
           * Now we need to fetch the checkpoint record location, and also its
--- 6199,6205 ----
           * have different checkpoint positions and hence different history
           * file names, even if nothing happened in between.
           */
!         RequestLazyCheckpoint();

          /*
           * Now we need to fetch the checkpoint record location, and also its
Index: src/backend/bootstrap/bootstrap.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/bootstrap/bootstrap.c,v
retrieving revision 1.233
diff -c -r1.233 bootstrap.c
*** src/backend/bootstrap/bootstrap.c    7 Mar 2007 13:35:02 -0000    1.233
--- src/backend/bootstrap/bootstrap.c    19 Jun 2007 15:29:51 -0000
***************
*** 489,495 ****

      /* Perform a checkpoint to ensure everything's down to disk */
      SetProcessingMode(NormalProcessing);
!     CreateCheckPoint(true, true);

      /* Clean up and exit */
      cleanup();
--- 489,495 ----

      /* Perform a checkpoint to ensure everything's down to disk */
      SetProcessingMode(NormalProcessing);
!     CreateCheckPoint(true, true, true);

      /* Clean up and exit */
      cleanup();
Index: src/backend/commands/dbcommands.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/commands/dbcommands.c,v
retrieving revision 1.195
diff -c -r1.195 dbcommands.c
*** src/backend/commands/dbcommands.c    1 Jun 2007 19:38:07 -0000    1.195
--- src/backend/commands/dbcommands.c    20 Jun 2007 09:36:24 -0000
***************
*** 404,410 ****
       * up-to-date for the copy.  (We really only need to flush buffers for the
       * source database, but bufmgr.c provides no API for that.)
       */
!     BufferSync();

      /*
       * Once we start copying subdirectories, we need to be able to clean 'em
--- 404,410 ----
       * up-to-date for the copy.  (We really only need to flush buffers for the
       * source database, but bufmgr.c provides no API for that.)
       */
!     BufferSync(true);

      /*
       * Once we start copying subdirectories, we need to be able to clean 'em
***************
*** 507,513 ****
           * Perhaps if we ever implement CREATE DATABASE in a less cheesy way,
           * we can avoid this.
           */
!         RequestCheckpoint(true, false);

          /*
           * Close pg_database, but keep lock till commit (this is important to
--- 507,513 ----
           * Perhaps if we ever implement CREATE DATABASE in a less cheesy way,
           * we can avoid this.
           */
!         RequestImmediateCheckpoint();

          /*
           * Close pg_database, but keep lock till commit (this is important to
***************
*** 661,667 ****
       * open files, which would cause rmdir() to fail.
       */
  #ifdef WIN32
!     RequestCheckpoint(true, false);
  #endif

      /*
--- 661,667 ----
       * open files, which would cause rmdir() to fail.
       */
  #ifdef WIN32
!     RequestImmediateCheckpoint();
  #endif

      /*
***************
*** 1427,1433 ****
           * up-to-date for the copy.  (We really only need to flush buffers for
           * the source database, but bufmgr.c provides no API for that.)
           */
!         BufferSync();

          /*
           * Copy this subdirectory to the new location
--- 1427,1433 ----
           * up-to-date for the copy.  (We really only need to flush buffers for
           * the source database, but bufmgr.c provides no API for that.)
           */
!         BufferSync(true);

          /*
           * Copy this subdirectory to the new location
Index: src/backend/postmaster/bgwriter.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/postmaster/bgwriter.c,v
retrieving revision 1.38
diff -c -r1.38 bgwriter.c
*** src/backend/postmaster/bgwriter.c    27 May 2007 03:50:39 -0000    1.38
--- src/backend/postmaster/bgwriter.c    20 Jun 2007 12:58:20 -0000
***************
*** 44,49 ****
--- 44,50 ----
  #include "postgres.h"

  #include <signal.h>
+ #include <sys/time.h>
  #include <time.h>
  #include <unistd.h>

***************
*** 59,64 ****
--- 60,66 ----
  #include "storage/pmsignal.h"
  #include "storage/shmem.h"
  #include "storage/smgr.h"
+ #include "storage/spin.h"
  #include "tcop/tcopprot.h"
  #include "utils/guc.h"
  #include "utils/memutils.h"
***************
*** 112,122 ****
  {
      pid_t        bgwriter_pid;    /* PID of bgwriter (0 if not started) */

!     sig_atomic_t ckpt_started;    /* advances when checkpoint starts */
!     sig_atomic_t ckpt_done;        /* advances when checkpoint done */
!     sig_atomic_t ckpt_failed;    /* advances when checkpoint fails */

!     sig_atomic_t ckpt_time_warn;    /* warn if too soon since last ckpt? */

      int            num_requests;    /* current # of requests */
      int            max_requests;    /* allocated array size */
--- 114,128 ----
  {
      pid_t        bgwriter_pid;    /* PID of bgwriter (0 if not started) */

!     slock_t        ckpt_lck;        /* protects all the ckpt_* fields */

!     int            ckpt_started;    /* advances when checkpoint starts */
!     int            ckpt_done;        /* advances when checkpoint done */
!     int            ckpt_failed;    /* advances when checkpoint fails */
!
!     bool    ckpt_rqst_time_warn;    /* warn if too soon since last ckpt */
!     bool    ckpt_rqst_immediate;    /* an immediate ckpt has been requested */
!     bool    ckpt_rqst_force;        /* checkpoint even if no WAL activity */

      int            num_requests;    /* current # of requests */
      int            max_requests;    /* allocated array size */
***************
*** 131,136 ****
--- 137,143 ----
  int            BgWriterDelay = 200;
  int            CheckPointTimeout = 300;
  int            CheckPointWarning = 30;
+ double        CheckPointSmoothing = 0.3;

  /*
   * Flags set by interrupt handlers for later service in the main loop.
***************
*** 146,154 ****
--- 153,176 ----

  static bool ckpt_active = false;

+ /* Current time and WAL insert location when checkpoint was started */
+ static time_t        ckpt_start_time;
+ static XLogRecPtr    ckpt_start_recptr;
+
+ static double        ckpt_cached_elapsed;
+
  static time_t last_checkpoint_time;
  static time_t last_xlog_switch_time;

+ /* Prototypes for private functions */
+
+ static void RequestCheckpoint(bool waitforit, bool warnontime, bool immediate, bool force);
+ static void CheckArchiveTimeout(void);
+ static void BgWriterNap(void);
+ static bool IsCheckpointOnSchedule(double progress);
+ static bool ImmediateCheckpointRequested(void);
+
+ /* Signal handlers */

  static void bg_quickdie(SIGNAL_ARGS);
  static void BgSigHupHandler(SIGNAL_ARGS);
***************
*** 170,175 ****
--- 192,198 ----

      Assert(BgWriterShmem != NULL);
      BgWriterShmem->bgwriter_pid = MyProcPid;
+     SpinLockInit(&BgWriterShmem->ckpt_lck);
      am_bg_writer = true;

      /*
***************
*** 281,288 ****
--- 304,314 ----
              /* use volatile pointer to prevent code rearrangement */
              volatile BgWriterShmemStruct *bgs = BgWriterShmem;

+             SpinLockAcquire(&BgWriterShmem->ckpt_lck);
              bgs->ckpt_failed++;
              bgs->ckpt_done = bgs->ckpt_started;
+             SpinLockRelease(&bgs->ckpt_lck);
+
              ckpt_active = false;
          }

***************
*** 328,337 ****
      for (;;)
      {
          bool        do_checkpoint = false;
-         bool        force_checkpoint = false;
          time_t        now;
          int            elapsed_secs;
-         long        udelay;

          /*
           * Emergency bailout if postmaster has died.  This is to avoid the
--- 354,361 ----
***************
*** 354,360 ****
          {
              checkpoint_requested = false;
              do_checkpoint = true;
-             force_checkpoint = true;
              BgWriterStats.m_requested_checkpoints++;
          }
          if (shutdown_requested)
--- 378,383 ----
***************
*** 377,387 ****
           */
          now = time(NULL);
          elapsed_secs = now - last_checkpoint_time;
!         if (elapsed_secs >= CheckPointTimeout)
          {
              do_checkpoint = true;
!             if (!force_checkpoint)
!                 BgWriterStats.m_timed_checkpoints++;
          }

          /*
--- 400,409 ----
           */
          now = time(NULL);
          elapsed_secs = now - last_checkpoint_time;
!         if (!do_checkpoint && elapsed_secs >= CheckPointTimeout)
          {
              do_checkpoint = true;
!             BgWriterStats.m_timed_checkpoints++;
          }

          /*
***************
*** 390,395 ****
--- 412,445 ----
           */
          if (do_checkpoint)
          {
+             /* use volatile pointer to prevent code rearrangement */
+             volatile BgWriterShmemStruct *bgs = BgWriterShmem;
+             bool time_warn;
+             bool immediate;
+             bool force;
+
+             /*
+              * Atomically check the request flags to figure out what
+              * kind of a checkpoint we should perform, and increase the
+              * started-counter to acknowledge that we've started
+              * a new checkpoint.
+              */
+
+             SpinLockAcquire(&bgs->ckpt_lck);
+
+             time_warn = bgs->ckpt_rqst_time_warn;
+             bgs->ckpt_rqst_time_warn = false;
+
+             immediate = bgs->ckpt_rqst_immediate;
+             bgs->ckpt_rqst_immediate = false;
+
+             force = bgs->ckpt_rqst_force;
+             bgs->ckpt_rqst_force = false;
+
+             bgs->ckpt_started++;
+
+             SpinLockRelease(&bgs->ckpt_lck);
+
              /*
               * We will warn if (a) too soon since last checkpoint (whatever
               * caused it) and (b) somebody has set the ckpt_time_warn flag
***************
*** 397,417 ****
               * implementation will not generate warnings caused by
               * CheckPointTimeout < CheckPointWarning.
               */
!             if (BgWriterShmem->ckpt_time_warn &&
                  elapsed_secs < CheckPointWarning)
                  ereport(LOG,
                          (errmsg("checkpoints are occurring too frequently (%d seconds apart)",
                                  elapsed_secs),
                           errhint("Consider increasing the configuration parameter \"checkpoint_segments\".")));
!             BgWriterShmem->ckpt_time_warn = false;

              /*
               * Indicate checkpoint start to any waiting backends.
               */
              ckpt_active = true;
-             BgWriterShmem->ckpt_started++;

!             CreateCheckPoint(false, force_checkpoint);

              /*
               * After any checkpoint, close all smgr files.    This is so we
--- 447,474 ----
               * implementation will not generate warnings caused by
               * CheckPointTimeout < CheckPointWarning.
               */
!             if (time_warn &&
                  elapsed_secs < CheckPointWarning)
                  ereport(LOG,
                          (errmsg("checkpoints are occurring too frequently (%d seconds apart)",
                                  elapsed_secs),
                           errhint("Consider increasing the configuration parameter \"checkpoint_segments\".")));
!

              /*
               * Indicate checkpoint start to any waiting backends.
               */
              ckpt_active = true;

!             ckpt_start_recptr = GetInsertRecPtr();
!             ckpt_start_time = now;
!             ckpt_cached_elapsed = 0;
!
!             elog(DEBUG1, "CHECKPOINT: start");
!
!             CreateCheckPoint(false, immediate, force);
!
!             elog(DEBUG1, "CHECKPOINT: end");

              /*
               * After any checkpoint, close all smgr files.    This is so we
***************
*** 422,428 ****
              /*
               * Indicate checkpoint completion to any waiting backends.
               */
!             BgWriterShmem->ckpt_done = BgWriterShmem->ckpt_started;
              ckpt_active = false;

              /*
--- 479,487 ----
              /*
               * Indicate checkpoint completion to any waiting backends.
               */
!             SpinLockAcquire(&bgs->ckpt_lck);
!             bgs->ckpt_done = bgs->ckpt_started;
!             SpinLockRelease(&bgs->ckpt_lck);
              ckpt_active = false;

              /*
***************
*** 433,446 ****
              last_checkpoint_time = now;
          }
          else
!             BgBufferSync();

          /*
!          * Check for archive_timeout, if so, switch xlog files.  First we do a
!          * quick check using possibly-stale local state.
           */
!         if (XLogArchiveTimeout > 0 &&
!             (int) (now - last_xlog_switch_time) >= XLogArchiveTimeout)
          {
              /*
               * Update local state ... note that last_xlog_switch_time is the
--- 492,530 ----
              last_checkpoint_time = now;
          }
          else
!         {
!             BgAllSweep();
!             BgLruSweep();
!         }

          /*
!          * Check for archive_timeout and switch xlog files if necessary.
           */
!         CheckArchiveTimeout();
!
!         /* Nap for the configured time. */
!         BgWriterNap();
!     }
! }
!
! /*
!  * CheckArchiveTimeout -- check for archive_timeout and switch xlog files
!  *        if needed
!  */
! static void
! CheckArchiveTimeout(void)
! {
!     time_t        now;
!
!     if (XLogArchiveTimeout <= 0)
!         return;
!
!     now = time(NULL);
!
!     /* First we do a quick check using possibly-stale local state. */
!     if ((int) (now - last_xlog_switch_time) < XLogArchiveTimeout)
!         return;
!
          {
              /*
               * Update local state ... note that last_xlog_switch_time is the
***************
*** 450,459 ****

              last_xlog_switch_time = Max(last_xlog_switch_time, last_time);

-             /* if we did a checkpoint, 'now' might be stale too */
-             if (do_checkpoint)
-                 now = time(NULL);
-
              /* Now we can do the real check */
              if ((int) (now - last_xlog_switch_time) >= XLogArchiveTimeout)
              {
--- 534,539 ----
***************
*** 478,483 ****
--- 558,572 ----
                  last_xlog_switch_time = now;
              }
          }
+ }
+
+ /*
+  * BgWriterNap -- Nap for the configured time or until a signal is received.
+  */
+ static void
+ BgWriterNap(void)
+ {
+     long        udelay;

          /*
           * Send off activity statistics to the stats collector
***************
*** 496,502 ****
           * We absorb pending requests after each short sleep.
           */
          if ((bgwriter_all_percent > 0.0 && bgwriter_all_maxpages > 0) ||
!             (bgwriter_lru_percent > 0.0 && bgwriter_lru_maxpages > 0))
              udelay = BgWriterDelay * 1000L;
          else if (XLogArchiveTimeout > 0)
              udelay = 1000000L;    /* One second */
--- 585,592 ----
           * We absorb pending requests after each short sleep.
           */
          if ((bgwriter_all_percent > 0.0 && bgwriter_all_maxpages > 0) ||
!             (bgwriter_lru_percent > 0.0 && bgwriter_lru_maxpages > 0) ||
!             ckpt_active)
              udelay = BgWriterDelay * 1000L;
          else if (XLogArchiveTimeout > 0)
              udelay = 1000000L;    /* One second */
***************
*** 505,522 ****

          while (udelay > 999999L)
          {
!             if (got_SIGHUP || checkpoint_requested || shutdown_requested)
                  break;
              pg_usleep(1000000L);
              AbsorbFsyncRequests();
              udelay -= 1000000L;
          }

!         if (!(got_SIGHUP || checkpoint_requested || shutdown_requested))
              pg_usleep(udelay);
      }
  }


  /* --------------------------------
   *        signal handler routines
--- 595,766 ----

          while (udelay > 999999L)
          {
!             /* If a checkpoint is active, postpone reloading the config
!              * until the checkpoint is finished, and don't care about
!              * non-immediate checkpoint requests.
!              */
!             if (shutdown_requested ||
!                 (!ckpt_active && (got_SIGHUP || checkpoint_requested)) ||
!                 (ckpt_active && ImmediateCheckpointRequested()))
                  break;
+
              pg_usleep(1000000L);
              AbsorbFsyncRequests();
              udelay -= 1000000L;
          }

!
!         if (!(shutdown_requested ||
!               (!ckpt_active && (got_SIGHUP || checkpoint_requested)) ||
!               (ckpt_active && ImmediateCheckpointRequested())))
              pg_usleep(udelay);
+ }
+
+ /*
+  * Returns true if an immediate checkpoint request is pending.
+  */
+ static bool
+ ImmediateCheckpointRequested()
+ {
+     if (checkpoint_requested)
+     {
+         volatile BgWriterShmemStruct *bgs = BgWriterShmem;
+
+         /*
+          * We're only looking at a single field, so we don't need to
+          * acquire the lock in this case.
+          */
+         if (bgs->ckpt_rqst_immediate)
+             return true;
      }
+     return false;
  }

+ /*
+  * CheckpointWriteDelay -- periodical sleep in checkpoint write phase
+  *
+  * During checkpoint, this is called periodically by the buffer manager while
+  * writing out dirty buffers from the shared buffer cache. We estimate if we've
+  * made enough progress so that we're going to finish this checkpoint in time
+  * before the next one is due, taking checkpoint_smoothing into account.
+  * If so, we perform one round of normal bgwriter activity including LRU-
+  * cleaning of buffer cache, switching xlog segment if archive_timeout has
+  * passed, and sleeping for BgWriterDelay msecs.
+  *
+  * 'progress' is an estimate of how much of the writes has been done, as a
+  * fraction between 0.0 meaning none, and 1.0 meaning all done.
+  */
+ void
+ CheckpointWriteDelay(double progress)
+ {
+     /*
+      * Return immediately if we should finish the checkpoint ASAP.
+      */
+     if (!am_bg_writer || CheckPointSmoothing <= 0 || shutdown_requested ||
+         ImmediateCheckpointRequested())
+         return;
+
+     elog(DEBUG1, "CheckpointWriteDelay: progress=%.3f", progress);
+
+     /* Take a nap and perform the usual bgwriter duties, unless we're behind
+      * schedule, in which case we just try to catch up as quickly as possible.
+      */
+     if (IsCheckpointOnSchedule(progress))
+     {
+         CheckArchiveTimeout();
+         BgLruSweep();
+         BgWriterNap();
+     }
+ }
+
+ /*
+  * IsCheckpointOnSchedule -- are we on schedule to finish this checkpoint
+  *         in time?
+  *
+  * Compares the current progress against the time/segments elapsed since last
+  * checkpoint, and returns true if the progress we've made this far is greater
+  * than the elapsed time/segments.
+  *
+  * If another checkpoint has already been requested, always return false.
+  */
+ static bool
+ IsCheckpointOnSchedule(double progress)
+ {
+     struct timeval    now;
+     XLogRecPtr        recptr;
+     double            progress_in_time,
+                     progress_in_xlog;
+
+     Assert(ckpt_active);
+
+     /* scale progress according to CheckPointSmoothing */
+     progress *= CheckPointSmoothing;
+
+     /*
+      * Check against the cached value first. Only do the more expensive
+      * calculations once we reach the target previously calculated. Since
+      * neither time or WAL insert pointer moves backwards, a freshly
+      * calculated value can only be greater than or equal to the cached value.
+      */
+     if (progress < ckpt_cached_elapsed)
+     {
+         elog(DEBUG2, "IsCheckpointOnSchedule: Still behind cached=%.3f, progress=%.3f",
+              ckpt_cached_elapsed, progress);
+         return false;
+     }
+
+     gettimeofday(&now, NULL);
+
+     /*
+      * Check progress against time elapsed and checkpoint_timeout.
+      */
+     progress_in_time = ((double) (now.tv_sec - ckpt_start_time) +
+         now.tv_usec / 1000000.0) / CheckPointTimeout;
+
+     if (progress < progress_in_time)
+     {
+         elog(DEBUG2, "IsCheckpointOnSchedule: Behind checkpoint_timeout, time=%.3f, progress=%.3f",
+              progress_in_time, progress);
+
+         ckpt_cached_elapsed = progress_in_time;
+
+         return false;
+     }
+
+     /*
+      * Check progress against WAL segments written and checkpoint_segments.
+      *
+      * We compare the current WAL insert location against the location
+      * computed before calling CreateCheckPoint. The code in XLogInsert that
+      * actually triggers a checkpoint when checkpoint_segments is exceeded
+      * compares against RedoRecptr, so this is not completely accurate.
+      * However, it's good enough for our purposes, we're only calculating
+      * an estimate anyway.
+      */
+     recptr = GetInsertRecPtr();
+     progress_in_xlog =
+         (((double) recptr.xlogid - (double) ckpt_start_recptr.xlogid) * XLogSegsPerFile +
+          ((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) / XLogSegSize) /
+         CheckPointSegments;
+
+     if (progress < progress_in_xlog)
+     {
+         elog(DEBUG2, "IsCheckpointOnSchedule: Behind checkpoint_segments, xlog=%.3f, progress=%.3f",
+              progress_in_xlog, progress);
+
+         ckpt_cached_elapsed = progress_in_xlog;
+
+         return false;
+     }
+
+
+     /* It looks like we're on schedule. */
+
+     elog(DEBUG2, "IsCheckpointOnSchedule: on schedule, time=%.3f, xlog=%.3f progress=%.3f",
+          progress_in_time, progress_in_xlog, progress);
+
+     return true;
+ }

  /* --------------------------------
   *        signal handler routines
***************
*** 618,625 ****
  }

  /*
   * RequestCheckpoint
!  *        Called in backend processes to request an immediate checkpoint
   *
   * If waitforit is true, wait until the checkpoint is completed
   * before returning; otherwise, just signal the request and return
--- 862,910 ----
  }

  /*
+  * RequestImmediateCheckpoint
+  *        Called in backend processes to request an immediate checkpoint.
+  *
+  * Returns when the checkpoint is finished.
+  */
+ void
+ RequestImmediateCheckpoint()
+ {
+     RequestCheckpoint(true, false, true, true);
+ }
+
+ /*
+  * RequestImmediateCheckpoint
+  *        Called in backend processes to request a lazy checkpoint.
+  *
+  * This is essentially the same as RequestImmediateCheckpoint, except
+  * that this form obeys the checkpoint_smoothing GUC variable, and
+  * can therefore take a lot longer time.
+  *
+  * Returns when the checkpoint is finished.
+  */
+ void
+ RequestLazyCheckpoint()
+ {
+     RequestCheckpoint(true, false, false, true);
+ }
+
+ /*
+  * RequestXLogFillCheckpoint
+  *        Signals the bgwriter that we've reached checkpoint_segments
+  *
+  * Unlike RequestImmediateCheckpoint and RequestLazyCheckpoint, return
+  * immediately without waiting for the checkpoint to finish.
+  */
+ void
+ RequestXLogFillCheckpoint()
+ {
+     RequestCheckpoint(false, true, false, false);
+ }
+
+ /*
   * RequestCheckpoint
!  *        Common subroutine for all the above Request*Checkpoint variants.
   *
   * If waitforit is true, wait until the checkpoint is completed
   * before returning; otherwise, just signal the request and return
***************
*** 628,648 ****
   * If warnontime is true, and it's "too soon" since the last checkpoint,
   * the bgwriter will log a warning.  This should be true only for checkpoints
   * caused due to xlog filling, else the warning will be misleading.
   */
! void
! RequestCheckpoint(bool waitforit, bool warnontime)
  {
      /* use volatile pointer to prevent code rearrangement */
      volatile BgWriterShmemStruct *bgs = BgWriterShmem;
!     sig_atomic_t old_failed = bgs->ckpt_failed;
!     sig_atomic_t old_started = bgs->ckpt_started;

      /*
       * If in a standalone backend, just do it ourselves.
       */
      if (!IsPostmasterEnvironment)
      {
!         CreateCheckPoint(false, true);

          /*
           * After any checkpoint, close all smgr files.    This is so we won't
--- 913,942 ----
   * If warnontime is true, and it's "too soon" since the last checkpoint,
   * the bgwriter will log a warning.  This should be true only for checkpoints
   * caused due to xlog filling, else the warning will be misleading.
+  *
+  * If immediate is true, the checkpoint should be finished ASAP.
+  *
+  * If force is true, force a checkpoint even if no XLOG activity has occured
+  * since the last one.
   */
! static void
! RequestCheckpoint(bool waitforit, bool warnontime, bool immediate, bool force)
  {
      /* use volatile pointer to prevent code rearrangement */
      volatile BgWriterShmemStruct *bgs = BgWriterShmem;
!     int old_failed, old_started;

      /*
       * If in a standalone backend, just do it ourselves.
       */
      if (!IsPostmasterEnvironment)
      {
!         /*
!          * There's no point in doing lazy checkpoints in a standalone
!          * backend, because there's no other backends the checkpoint could
!          * disrupt.
!          */
!         CreateCheckPoint(false, true, true);

          /*
           * After any checkpoint, close all smgr files.    This is so we won't
***************
*** 653,661 ****
          return;
      }

!     /* Set warning request flag if appropriate */
      if (warnontime)
!         bgs->ckpt_time_warn = true;

      /*
       * Send signal to request checkpoint.  When waitforit is false, we
--- 947,974 ----
          return;
      }

!     /*
!      * Atomically set the request flags, and take a snapshot of the counters.
!      * This ensures that when we see that ckpt_started > old_started,
!      * we know the flags we set here have been seen by bgwriter.
!      *
!      * Note that we effectively OR the flags with any existing flags, to
!      * avoid overriding a "stronger" request by another backend.
!      */
!     SpinLockAcquire(&bgs->ckpt_lck);
!
!     old_failed = bgs->ckpt_failed;
!     old_started = bgs->ckpt_started;
!
!     /* Set request flags as appropriate */
      if (warnontime)
!         bgs->ckpt_rqst_time_warn = true;
!     if (immediate)
!         bgs->ckpt_rqst_immediate = true;
!     if (force)
!         bgs->ckpt_rqst_force = true;
!
!     SpinLockRelease(&bgs->ckpt_lck);

      /*
       * Send signal to request checkpoint.  When waitforit is false, we
***************
*** 674,701 ****
       */
      if (waitforit)
      {
!         while (bgs->ckpt_started == old_started)
          {
              CHECK_FOR_INTERRUPTS();
              pg_usleep(100000L);
          }
-         old_started = bgs->ckpt_started;

          /*
!          * We are waiting for ckpt_done >= old_started, in a modulo sense.
!          * This is a little tricky since we don't know the width or signedness
!          * of sig_atomic_t.  We make the lowest common denominator assumption
!          * that it is only as wide as "char".  This means that this algorithm
!          * will cope correctly as long as we don't sleep for more than 127
!          * completed checkpoints.  (If we do, we will get another chance to
!          * exit after 128 more checkpoints...)
           */
!         while (((signed char) (bgs->ckpt_done - old_started)) < 0)
          {
              CHECK_FOR_INTERRUPTS();
              pg_usleep(100000L);
          }
!         if (bgs->ckpt_failed != old_failed)
              ereport(ERROR,
                      (errmsg("checkpoint request failed"),
                       errhint("Consult recent messages in the server log for details.")));
--- 987,1031 ----
       */
      if (waitforit)
      {
!         int new_started, new_failed;
!
!         /* Wait for a new checkpoint to start. */
!         for(;;)
          {
+             SpinLockAcquire(&bgs->ckpt_lck);
+             new_started = bgs->ckpt_started;
+             SpinLockRelease(&bgs->ckpt_lck);
+
+             if (new_started != old_started)
+                 break;
+
              CHECK_FOR_INTERRUPTS();
              pg_usleep(100000L);
          }

          /*
!          * We are waiting for ckpt_done >= new_started, in a modulo sense.
!          * This algorithm will cope correctly as long as we don't sleep for
!          * more than MAX_INT completed checkpoints.  (If we do, we will get
!          * another chance to exit after MAX_INT more checkpoints...)
           */
!         for(;;)
          {
+             int new_done;
+
+             SpinLockAcquire(&bgs->ckpt_lck);
+             new_done = bgs->ckpt_done;
+             new_failed = bgs->ckpt_failed;
+             SpinLockRelease(&bgs->ckpt_lck);
+
+             if(new_done - new_started >= 0)
+                 break;
+
              CHECK_FOR_INTERRUPTS();
              pg_usleep(100000L);
          }
!
!         if (new_failed != old_failed)
              ereport(ERROR,
                      (errmsg("checkpoint request failed"),
                       errhint("Consult recent messages in the server log for details.")));
Index: src/backend/storage/buffer/bufmgr.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/bufmgr.c,v
retrieving revision 1.220
diff -c -r1.220 bufmgr.c
*** src/backend/storage/buffer/bufmgr.c    30 May 2007 20:11:58 -0000    1.220
--- src/backend/storage/buffer/bufmgr.c    20 Jun 2007 12:47:43 -0000
***************
*** 32,38 ****
   *
   * BufferSync() -- flush all dirty buffers in the buffer pool.
   *
!  * BgBufferSync() -- flush some dirty buffers in the buffer pool.
   *
   * InitBufferPool() -- Init the buffer module.
   *
--- 32,40 ----
   *
   * BufferSync() -- flush all dirty buffers in the buffer pool.
   *
!  * BgAllSweep() -- write out some dirty buffers in the pool.
!  *
!  * BgLruSweep() -- write out some lru dirty buffers in the pool.
   *
   * InitBufferPool() -- Init the buffer module.
   *
***************
*** 74,79 ****
--- 76,82 ----
  double        bgwriter_all_percent = 0.333;
  int            bgwriter_lru_maxpages = 5;
  int            bgwriter_all_maxpages = 5;
+ int            checkpoint_rate = 512; /* in pages/s */


  long        NDirectFileRead;    /* some I/O's are direct file access. bypass
***************
*** 645,651 ****
       * at 1 so that the buffer can survive one clock-sweep pass.)
       */
      buf->tag = newTag;
!     buf->flags &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | BM_IO_ERROR);
      buf->flags |= BM_TAG_VALID;
      buf->usage_count = 1;

--- 648,654 ----
       * at 1 so that the buffer can survive one clock-sweep pass.)
       */
      buf->tag = newTag;
!     buf->flags &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | BM_CHECKPOINT_NEEDED | BM_IO_ERROR);
      buf->flags |= BM_TAG_VALID;
      buf->usage_count = 1;

***************
*** 1000,1037 ****
   * BufferSync -- Write out all dirty buffers in the pool.
   *
   * This is called at checkpoint time to write out all dirty shared buffers.
   */
  void
! BufferSync(void)
  {
!     int            buf_id;
      int            num_to_scan;
      int            absorb_counter;

      /*
       * Find out where to start the circular scan.
       */
!     buf_id = StrategySyncStart();

      /* Make sure we can handle the pin inside SyncOneBuffer */
      ResourceOwnerEnlargeBuffers(CurrentResourceOwner);

      /*
!      * Loop over all buffers.
       */
      num_to_scan = NBuffers;
      absorb_counter = WRITES_PER_ABSORB;
      while (num_to_scan-- > 0)
      {
!         if (SyncOneBuffer(buf_id, false))
          {
              BgWriterStats.m_buf_written_checkpoints++;

              /*
               * If in bgwriter, absorb pending fsync requests after each
               * WRITES_PER_ABSORB write operations, to prevent overflow of the
               * fsync request queue.  If not in bgwriter process, this is a
               * no-op.
               */
              if (--absorb_counter <= 0)
              {
--- 1003,1127 ----
   * BufferSync -- Write out all dirty buffers in the pool.
   *
   * This is called at checkpoint time to write out all dirty shared buffers.
+  * If 'immediate' is true, write them all ASAP, otherwise throttle the
+  * I/O rate according to checkpoint_write_rate GUC variable, and perform
+  * normal bgwriter duties periodically.
   */
  void
! BufferSync(bool immediate)
  {
!     int            buf_id, start_id;
      int            num_to_scan;
+     int            num_to_write;
+     int            num_written;
      int            absorb_counter;
+     int            num_written_since_nap;
+     int            writes_per_nap;
+
+     /*
+      * Convert checkpoint_write_rate to number writes of writes to perform in
+      * a period of BgWriterDelay. The result is an integer, so we lose some
+      * precision here. There's a lot of other factors as well that affect the
+      * real rate, for example granularity of OS timer used for BgWriterDelay,
+      * whether any of the writes block, and time spent in CheckpointWriteDelay
+      * performing normal bgwriter duties.
+      */
+     writes_per_nap = Min(1, checkpoint_rate / BgWriterDelay);

      /*
       * Find out where to start the circular scan.
       */
!     start_id = StrategySyncStart();

      /* Make sure we can handle the pin inside SyncOneBuffer */
      ResourceOwnerEnlargeBuffers(CurrentResourceOwner);

      /*
!      * Loop over all buffers, and mark the ones that need to be written with
!      * BM_CHECKPOINT_NEEDED. Count them as we go (num_to_write), so that we
!      * can estimate how much work needs to be done.
!      *
!      * This allows us to only write those pages that were dirty when the
!      * checkpoint began, and haven't been flushed to disk since. Whenever a
!      * page with BM_CHECKPOINT_NEEDED is written out by normal backends or
!      * the bgwriter LRU-scan, the flag is cleared, and any pages dirtied after
!      * this scan don't have the flag set.
!      */
!     num_to_scan = NBuffers;
!     num_to_write = 0;
!     buf_id = start_id;
!     while (num_to_scan-- > 0)
!     {
!         volatile BufferDesc *bufHdr = &BufferDescriptors[buf_id];
!
!         /*
!          * Header spinlock is enough to examine BM_DIRTY, see comment in
!          * SyncOneBuffer.
!          */
!         LockBufHdr(bufHdr);
!
!         if (bufHdr->flags & BM_DIRTY)
!         {
!             bufHdr->flags |= BM_CHECKPOINT_NEEDED;
!             num_to_write++;
!         }
!
!         UnlockBufHdr(bufHdr);
!
!         if (++buf_id >= NBuffers)
!             buf_id = 0;
!     }
!
!     elog(DEBUG1, "CHECKPOINT: %d / %d buffers to write", num_to_write, NBuffers);
!
!     /*
!      * Loop over all buffers again, and write the ones (still) marked with
!      * BM_CHECKPOINT_NEEDED.
       */
      num_to_scan = NBuffers;
+     num_written = num_written_since_nap = 0;
      absorb_counter = WRITES_PER_ABSORB;
+     buf_id = start_id;
      while (num_to_scan-- > 0)
      {
!         volatile BufferDesc *bufHdr = &BufferDescriptors[buf_id];
!         bool needs_flush;
!
!         /* We don't need to acquire the lock here, because we're
!          * only looking at a single bit. It's possible that someone
!          * else writes the buffer and clears the flag right after we
!          * check, but that doesn't matter. This assumes that no-one
!          * clears the flag and sets it again while holding info_lck,
!          * expecting no-one to see the intermediary state.
!          */
!         needs_flush = (bufHdr->flags & BM_CHECKPOINT_NEEDED) != 0;
!
!         if (needs_flush && SyncOneBuffer(buf_id, false))
          {
              BgWriterStats.m_buf_written_checkpoints++;
+             num_written++;
+
+             /*
+              * Perform normal bgwriter duties and sleep to throttle
+              * our I/O rate.
+              */
+             if (!immediate && ++num_written_since_nap >= writes_per_nap)
+             {
+                 num_written_since_nap = 0;
+                 CheckpointWriteDelay((double) (num_written) / num_to_write);
+             }

              /*
               * If in bgwriter, absorb pending fsync requests after each
               * WRITES_PER_ABSORB write operations, to prevent overflow of the
               * fsync request queue.  If not in bgwriter process, this is a
               * no-op.
+              *
+              * AbsorbFsyncRequests is also called inside CheckpointWriteDelay,
+              * so this is partially redundant. However, we can't totally trust
+              * on the call in CheckpointWriteDelay, because it's only made
+              * before sleeping. In case CheckpointWriteDelay doesn't sleep,
+              * we need to absorb pending requests ourselves.
               */
              if (--absorb_counter <= 0)
              {
***************
*** 1045,1059 ****
  }

  /*
!  * BgBufferSync -- Write out some dirty buffers in the pool.
   *
   * This is called periodically by the background writer process.
   */
  void
! BgBufferSync(void)
  {
      static int    buf_id1 = 0;
-     int            buf_id2;
      int            num_to_scan;
      int            num_written;

--- 1135,1152 ----
  }

  /*
!  * BgAllSweep -- Write out some dirty buffers in the pool.
   *
+  * Runs the bgwriter all-sweep algorithm to write dirty buffers to
+  * minimize work at checkpoint time.
   * This is called periodically by the background writer process.
+  *
+  * XXX: Is this really needed with load distributed checkpoints?
   */
  void
! BgAllSweep(void)
  {
      static int    buf_id1 = 0;
      int            num_to_scan;
      int            num_written;

***************
*** 1063,1072 ****
      /*
       * To minimize work at checkpoint time, we want to try to keep all the
       * buffers clean; this motivates a scan that proceeds sequentially through
!      * all buffers.  But we are also charged with ensuring that buffers that
!      * will be recycled soon are clean when needed; these buffers are the ones
!      * just ahead of the StrategySyncStart point.  We make a separate scan
!      * through those.
       */

      /*
--- 1156,1162 ----
      /*
       * To minimize work at checkpoint time, we want to try to keep all the
       * buffers clean; this motivates a scan that proceeds sequentially through
!      * all buffers.
       */

      /*
***************
*** 1098,1103 ****
--- 1188,1210 ----
          }
          BgWriterStats.m_buf_written_all += num_written;
      }
+ }
+
+ /*
+  * BgLruSweep -- Write out some lru dirty buffers in the pool.
+  */
+ void
+ BgLruSweep(void)
+ {
+     int            buf_id2;
+     int            num_to_scan;
+     int            num_written;
+
+     /*
+      * The purpose of this sweep is to ensure that buffers that
+      * will be recycled soon are clean when needed; these buffers are the ones
+      * just ahead of the StrategySyncStart point.
+      */

      /*
       * This loop considers only unpinned buffers close to the clock sweep
***************
*** 1341,1349 ****
   * flushed.
   */
  void
! FlushBufferPool(void)
  {
!     BufferSync();
      smgrsync();
  }

--- 1448,1459 ----
   * flushed.
   */
  void
! FlushBufferPool(bool immediate)
  {
!     elog(DEBUG1, "CHECKPOINT: write phase");
!     BufferSync(immediate || CheckPointSmoothing <= 0);
!
!     elog(DEBUG1, "CHECKPOINT: sync phase");
      smgrsync();
  }

***************
*** 2132,2138 ****
      Assert(buf->flags & BM_IO_IN_PROGRESS);
      buf->flags &= ~(BM_IO_IN_PROGRESS | BM_IO_ERROR);
      if (clear_dirty && !(buf->flags & BM_JUST_DIRTIED))
!         buf->flags &= ~BM_DIRTY;
      buf->flags |= set_flag_bits;

      UnlockBufHdr(buf);
--- 2242,2248 ----
      Assert(buf->flags & BM_IO_IN_PROGRESS);
      buf->flags &= ~(BM_IO_IN_PROGRESS | BM_IO_ERROR);
      if (clear_dirty && !(buf->flags & BM_JUST_DIRTIED))
!         buf->flags &= ~(BM_DIRTY | BM_CHECKPOINT_NEEDED);
      buf->flags |= set_flag_bits;

      UnlockBufHdr(buf);
Index: src/backend/tcop/utility.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/tcop/utility.c,v
retrieving revision 1.280
diff -c -r1.280 utility.c
*** src/backend/tcop/utility.c    30 May 2007 20:12:01 -0000    1.280
--- src/backend/tcop/utility.c    20 Jun 2007 09:36:31 -0000
***************
*** 1089,1095 ****
                  ereport(ERROR,
                          (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
                           errmsg("must be superuser to do CHECKPOINT")));
!             RequestCheckpoint(true, false);
              break;

          case T_ReindexStmt:
--- 1089,1095 ----
                  ereport(ERROR,
                          (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
                           errmsg("must be superuser to do CHECKPOINT")));
!             RequestImmediateCheckpoint();
              break;

          case T_ReindexStmt:
Index: src/backend/utils/misc/guc.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/utils/misc/guc.c,v
retrieving revision 1.396
diff -c -r1.396 guc.c
*** src/backend/utils/misc/guc.c    8 Jun 2007 18:23:52 -0000    1.396
--- src/backend/utils/misc/guc.c    20 Jun 2007 10:14:06 -0000
***************
*** 1487,1492 ****
--- 1487,1503 ----
          30, 0, INT_MAX, NULL, NULL
      },

+
+     {
+         {"checkpoint_rate", PGC_SIGHUP, WAL_CHECKPOINTS,
+             gettext_noop("Minimum I/O rate used to write dirty buffers during checkpoints."),
+             NULL,
+             GUC_UNIT_BLOCKS
+         },
+         &checkpoint_rate,
+         100, 0.0, 100000, NULL, NULL
+     },
+
      {
          {"wal_buffers", PGC_POSTMASTER, WAL_SETTINGS,
              gettext_noop("Sets the number of disk-page buffers in shared memory for WAL."),
***************
*** 1866,1871 ****
--- 1877,1891 ----
          0.1, 0.0, 100.0, NULL, NULL
      },

+     {
+         {"checkpoint_smoothing", PGC_SIGHUP, WAL_CHECKPOINTS,
+             gettext_noop("Time spent flushing dirty buffers during checkpoint, as fraction of checkpoint interval."),
+             NULL
+         },
+         &CheckPointSmoothing,
+         0.3, 0.0, 0.9, NULL, NULL
+     },
+
      /* End-of-list marker */
      {
          {NULL, 0, 0, NULL, NULL}, NULL, 0.0, 0.0, 0.0, NULL, NULL
Index: src/backend/utils/misc/postgresql.conf.sample
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/utils/misc/postgresql.conf.sample,v
retrieving revision 1.216
diff -c -r1.216 postgresql.conf.sample
*** src/backend/utils/misc/postgresql.conf.sample    3 Jun 2007 17:08:15 -0000    1.216
--- src/backend/utils/misc/postgresql.conf.sample    20 Jun 2007 10:03:17 -0000
***************
*** 168,173 ****
--- 168,175 ----

  #checkpoint_segments = 3        # in logfile segments, min 1, 16MB each
  #checkpoint_timeout = 5min        # range 30s-1h
+ #checkpoint_smoothing = 0.3        # checkpoint duration, range 0.0 - 0.9
+ #checkpoint_rate = 512.0KB        # min. checkpoint write rate per second
  #checkpoint_warning = 30s        # 0 is off

  # - Archiving -
Index: src/include/access/xlog.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/xlog.h,v
retrieving revision 1.78
diff -c -r1.78 xlog.h
*** src/include/access/xlog.h    30 May 2007 20:12:02 -0000    1.78
--- src/include/access/xlog.h    19 Jun 2007 14:10:07 -0000
***************
*** 171,179 ****
  extern void StartupXLOG(void);
  extern void ShutdownXLOG(int code, Datum arg);
  extern void InitXLOGAccess(void);
! extern void CreateCheckPoint(bool shutdown, bool force);
  extern void XLogPutNextOid(Oid nextOid);
  extern XLogRecPtr GetRedoRecPtr(void);
  extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch);

  #endif   /* XLOG_H */
--- 171,180 ----
  extern void StartupXLOG(void);
  extern void ShutdownXLOG(int code, Datum arg);
  extern void InitXLOGAccess(void);
! extern void CreateCheckPoint(bool shutdown, bool immediate, bool force);
  extern void XLogPutNextOid(Oid nextOid);
  extern XLogRecPtr GetRedoRecPtr(void);
+ extern XLogRecPtr GetInsertRecPtr(void);
  extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch);

  #endif   /* XLOG_H */
Index: src/include/postmaster/bgwriter.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/postmaster/bgwriter.h,v
retrieving revision 1.9
diff -c -r1.9 bgwriter.h
*** src/include/postmaster/bgwriter.h    5 Jan 2007 22:19:57 -0000    1.9
--- src/include/postmaster/bgwriter.h    20 Jun 2007 09:27:20 -0000
***************
*** 20,29 ****
  extern int    BgWriterDelay;
  extern int    CheckPointTimeout;
  extern int    CheckPointWarning;

  extern void BackgroundWriterMain(void);

! extern void RequestCheckpoint(bool waitforit, bool warnontime);

  extern bool ForwardFsyncRequest(RelFileNode rnode, BlockNumber segno);
  extern void AbsorbFsyncRequests(void);
--- 20,33 ----
  extern int    BgWriterDelay;
  extern int    CheckPointTimeout;
  extern int    CheckPointWarning;
+ extern double CheckPointSmoothing;

  extern void BackgroundWriterMain(void);

! extern void RequestImmediateCheckpoint(void);
! extern void RequestLazyCheckpoint(void);
! extern void RequestXLogFillCheckpoint(void);
! extern void CheckpointWriteDelay(double progress);

  extern bool ForwardFsyncRequest(RelFileNode rnode, BlockNumber segno);
  extern void AbsorbFsyncRequests(void);
Index: src/include/storage/buf_internals.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/storage/buf_internals.h,v
retrieving revision 1.90
diff -c -r1.90 buf_internals.h
*** src/include/storage/buf_internals.h    30 May 2007 20:12:03 -0000    1.90
--- src/include/storage/buf_internals.h    12 Jun 2007 11:42:23 -0000
***************
*** 35,40 ****
--- 35,41 ----
  #define BM_IO_ERROR                (1 << 4)        /* previous I/O failed */
  #define BM_JUST_DIRTIED            (1 << 5)        /* dirtied since write started */
  #define BM_PIN_COUNT_WAITER        (1 << 6)        /* have waiter for sole pin */
+ #define BM_CHECKPOINT_NEEDED    (1 << 7)        /* this needs to be written in checkpoint */

  typedef bits16 BufFlags;

Index: src/include/storage/bufmgr.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/storage/bufmgr.h,v
retrieving revision 1.104
diff -c -r1.104 bufmgr.h
*** src/include/storage/bufmgr.h    30 May 2007 20:12:03 -0000    1.104
--- src/include/storage/bufmgr.h    20 Jun 2007 10:28:43 -0000
***************
*** 36,41 ****
--- 36,42 ----
  extern double bgwriter_all_percent;
  extern int    bgwriter_lru_maxpages;
  extern int    bgwriter_all_maxpages;
+ extern int    checkpoint_rate;

  /* in buf_init.c */
  extern DLLIMPORT char *BufferBlocks;
***************
*** 136,142 ****
  extern void ResetBufferUsage(void);
  extern void AtEOXact_Buffers(bool isCommit);
  extern void PrintBufferLeakWarning(Buffer buffer);
! extern void FlushBufferPool(void);
  extern BlockNumber BufferGetBlockNumber(Buffer buffer);
  extern BlockNumber RelationGetNumberOfBlocks(Relation relation);
  extern void RelationTruncate(Relation rel, BlockNumber nblocks);
--- 137,143 ----
  extern void ResetBufferUsage(void);
  extern void AtEOXact_Buffers(bool isCommit);
  extern void PrintBufferLeakWarning(Buffer buffer);
! extern void FlushBufferPool(bool immediate);
  extern BlockNumber BufferGetBlockNumber(Buffer buffer);
  extern BlockNumber RelationGetNumberOfBlocks(Relation relation);
  extern void RelationTruncate(Relation rel, BlockNumber nblocks);
***************
*** 161,168 ****
  extern void AbortBufferIO(void);

  extern void BufmgrCommit(void);
! extern void BufferSync(void);
! extern void BgBufferSync(void);

  extern void AtProcExit_LocalBuffers(void);

--- 162,170 ----
  extern void AbortBufferIO(void);

  extern void BufmgrCommit(void);
! extern void BufferSync(bool immediate);
! extern void BgAllSweep(void);
! extern void BgLruSweep(void);

  extern void AtProcExit_LocalBuffers(void);
pgsql-patches by date:
From: Alvaro Herrera
Date: 20 June 2007, 13:47:06
Subject: Re: more autovacuum fixes
From: Tom Lane
Date: 20 June 2007, 15:32:36
Subject: Re: [gpoo@ubiobio.cl: Re: [HACKERS] EXPLAIN omits schema?]
Load Distributed Checkpoints, take 3 - Mailing list pgsql-patches

Previous

Next