Thread: Load Distributed Checkpoints, final patch

Load Distributed Checkpoints, final patch

From
Heikki Linnakangas
Date:
Here's latest revision of Itagaki-sans Load Distributed Checkpoints patch:

* bgwriter all-scan is gone. We might or might not improve the LRU-sweep
later so that it can perform any duties the all-sweep might have had
besides reducing the impact of a checkpoint.

* one new GUC variable, called checkpoint_completion_target. Default is
0.5, which should be more than enough to smooth checkpoints on a system
that's not already overloaded. It's also not too large to hurt recovery
times too much on a system that's not already struggling to hit its
recovery time requirements. You can set it to 0 if you want the old
checkpoint behavior for some reason. Maximum is 0.9, to leave some
headroom for fsync and any other things that need to happen during a
checkpoint.

* The minimum rate we write at during a checkpoint is 1 page /
bgwriter_delay.

* Added a paragraph to user manual to describe the feature. Also updated
the formula for expected number of WAL segments, new formula is (2 +
checkpoint_completion_target) * checkpoint_segments + 1. I believe the
comments in xlog.c regarding XLOGfileslop are still valid.

* The signaling in bgwriter.c is based on a spinlock. Tom advised to not
use the spinlock when not strictly necessary, but IMHO it's easier to
understand this way. Feel free to revert that when committing if you
disagree.

* The algorithm for estimating progress wrt. checkpoint_segments is the
same as before. Bursty WAL activity will lead to bursty checkpoint
activity, but I wanted to keep it simple for now. In any case, the I/O
rate will be smoother than without the patch.

* There's some DEBUG elogs which we might want to replace with better
ones later, per the patch in the patch queue by Greg Smith. The ones
that are there now are useful for testing this feature, but are a bit
crude for DBAs to use.

Barring any objections from committer, I'm finished with this patch.

I'm scheduling more DBT-2 tests at a high # of warehouses per Greg
Smith's suggestion just to see what happens, but I doubt that will
change my mind on the above decisions.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com
Index: doc/src/sgml/config.sgml
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/doc/src/sgml/config.sgml,v
retrieving revision 1.127
diff -c -r1.127 config.sgml
*** doc/src/sgml/config.sgml    19 Jun 2007 20:13:21 -0000    1.127
--- doc/src/sgml/config.sgml    26 Jun 2007 16:31:28 -0000
***************
*** 1565,1570 ****
--- 1565,1586 ----
        </listitem>
       </varlistentry>

+      <varlistentry id="guc-checkpoint-completion-target" xreflabel="checkpoint_completion_target">
+       <term><varname>checkpoint_completion_target</varname> (<type>floating point</type>)</term>
+       <indexterm>
+        <primary><varname>checkpoint_completion_target</> configuration parameter</primary>
+       </indexterm>
+       <listitem>
+        <para>
+         Specifies the target length of checkpoints, as a fraction of
+         the checkpoint interval. The default is 0.5.
+
+         This parameter can only be set in the <filename>postgresql.conf</>
+         file or on the server command line.
+        </para>
+       </listitem>
+      </varlistentry>
+
       <varlistentry id="guc-checkpoint-warning" xreflabel="checkpoint_warning">
        <term><varname>checkpoint_warning</varname> (<type>integer</type>)</term>
        <indexterm>
Index: doc/src/sgml/wal.sgml
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/doc/src/sgml/wal.sgml,v
retrieving revision 1.43
diff -c -r1.43 wal.sgml
*** doc/src/sgml/wal.sgml    31 Jan 2007 20:56:19 -0000    1.43
--- doc/src/sgml/wal.sgml    26 Jun 2007 16:53:17 -0000
***************
*** 217,224 ****
    </para>

    <para>
     There will be at least one WAL segment file, and will normally
!    not be more than 2 * <varname>checkpoint_segments</varname> + 1
     files.  Each segment file is normally 16 MB (though this size can be
     altered when building the server).  You can use this to estimate space
     requirements for <acronym>WAL</acronym>.
--- 217,242 ----
    </para>

    <para>
+    To avoid flooding the OS and I/O subsystem with sudden burst of page writes,
+    writing dirty buffers during a checkpoint is spread over a period of time.
+    That period is controlled <varname>checkpoint_completion_target</varname>.
+    It's given as a fraction of the checkpoint interval, as defined by
+    <varname>checkpoint_timeout</varname> and
+    <varname>checkpoint_segments</varname>. The WAL segment consumption
+    and elapsed time is monitored and the I/O rate is adjusted during
+    checkpoint so that it's finished when the given fraction of elapsed time
+    or WAL segments has passed, whichever is sooner. For example, with the
+    default of 0.5, checkpoint writes are paced so that the checkpoint finishes
+    after 50% of checkpoint_timeout or checkpoint_segments has elapsed. On
+    system that's very close to maximum I/O throughput during normal operation,
+    you might want to increase <varname>checkpoint_completion_target</varname>,
+    though prolonging checkpoints affects recovery time, because more WAL need
+    to be kept around and replayed in recovery.
+   </para>
+
+   <para>
     There will be at least one WAL segment file, and will normally
!    not be more than (2 + checkpoint_completion_target) * <varname>checkpoint_segments</varname> + 1
     files.  Each segment file is normally 16 MB (though this size can be
     altered when building the server).  You can use this to estimate space
     requirements for <acronym>WAL</acronym>.
Index: src/backend/access/transam/xlog.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/transam/xlog.c,v
retrieving revision 1.272
diff -c -r1.272 xlog.c
*** src/backend/access/transam/xlog.c    31 May 2007 15:13:01 -0000    1.272
--- src/backend/access/transam/xlog.c    26 Jun 2007 17:51:09 -0000
***************
*** 398,404 ****
  static void exitArchiveRecovery(TimeLineID endTLI,
                      uint32 endLogId, uint32 endLogSeg);
  static bool recoveryStopsHere(XLogRecord *record, bool *includeThis);
! static void CheckPointGuts(XLogRecPtr checkPointRedo);

  static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
                  XLogRecPtr *lsn, BkpBlock *bkpb);
--- 398,404 ----
  static void exitArchiveRecovery(TimeLineID endTLI,
                      uint32 endLogId, uint32 endLogSeg);
  static bool recoveryStopsHere(XLogRecord *record, bool *includeThis);
! static void CheckPointGuts(XLogRecPtr checkPointRedo, bool immediate);

  static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
                  XLogRecPtr *lsn, BkpBlock *bkpb);
***************
*** 1608,1614 ****
                          if (XLOG_DEBUG)
                              elog(LOG, "time for a checkpoint, signaling bgwriter");
  #endif
!                         RequestCheckpoint(false, true);
                      }
                  }
              }
--- 1608,1614 ----
                          if (XLOG_DEBUG)
                              elog(LOG, "time for a checkpoint, signaling bgwriter");
  #endif
!                         RequestXLogFillCheckpoint();
                      }
                  }
              }
***************
*** 5110,5116 ****
           * the rule that TLI only changes in shutdown checkpoints, which
           * allows some extra error checking in xlog_redo.
           */
!         CreateCheckPoint(true, true);

          /*
           * Close down recovery environment
--- 5110,5116 ----
           * the rule that TLI only changes in shutdown checkpoints, which
           * allows some extra error checking in xlog_redo.
           */
!         CreateCheckPoint(true, true, true);

          /*
           * Close down recovery environment
***************
*** 5319,5324 ****
--- 5319,5347 ----
  }

  /*
+  * GetInsertRecPtr -- Returns the current insert position.
+  *
+  * NOTE: The value *actually* returned is the position of the last full
+  * xlog page. It lags behind the real insert position by at most 1 page.
+  * For that, we don't need to acquire WALInsertLock which can be quite
+  * heavily contended, and an approximation is enough for the current
+  * usage of this function.
+  */
+ XLogRecPtr
+ GetInsertRecPtr(void)
+ {
+     /* use volatile pointer to prevent code rearrangement */
+     volatile XLogCtlData *xlogctl = XLogCtl;
+     XLogRecPtr recptr;
+
+     SpinLockAcquire(&xlogctl->info_lck);
+     recptr = xlogctl->LogwrtRqst.Write;
+     SpinLockRelease(&xlogctl->info_lck);
+
+     return recptr;
+ }
+
+ /*
   * Get the time of the last xlog segment switch
   */
  time_t
***************
*** 5383,5389 ****
      ereport(LOG,
              (errmsg("shutting down")));

!     CreateCheckPoint(true, true);
      ShutdownCLOG();
      ShutdownSUBTRANS();
      ShutdownMultiXact();
--- 5406,5412 ----
      ereport(LOG,
              (errmsg("shutting down")));

!     CreateCheckPoint(true, true, true);
      ShutdownCLOG();
      ShutdownSUBTRANS();
      ShutdownMultiXact();
***************
*** 5395,5405 ****
  /*
   * Perform a checkpoint --- either during shutdown, or on-the-fly
   *
   * If force is true, we force a checkpoint regardless of whether any XLOG
   * activity has occurred since the last one.
   */
  void
! CreateCheckPoint(bool shutdown, bool force)
  {
      CheckPoint    checkPoint;
      XLogRecPtr    recptr;
--- 5418,5431 ----
  /*
   * Perform a checkpoint --- either during shutdown, or on-the-fly
   *
+  * If immediate is true, we try to finish the checkpoint as fast as we can,
+  * ignoring checkpoint_completion_target parameter.
+  *
   * If force is true, we force a checkpoint regardless of whether any XLOG
   * activity has occurred since the last one.
   */
  void
! CreateCheckPoint(bool shutdown, bool immediate, bool force)
  {
      CheckPoint    checkPoint;
      XLogRecPtr    recptr;
***************
*** 5591,5597 ****
       */
      END_CRIT_SECTION();

!     CheckPointGuts(checkPoint.redo);

      START_CRIT_SECTION();

--- 5617,5623 ----
       */
      END_CRIT_SECTION();

!     CheckPointGuts(checkPoint.redo, immediate);

      START_CRIT_SECTION();

***************
*** 5693,5708 ****
  /*
   * Flush all data in shared memory to disk, and fsync
   *
   * This is the common code shared between regular checkpoints and
   * recovery restartpoints.
   */
  static void
! CheckPointGuts(XLogRecPtr checkPointRedo)
  {
      CheckPointCLOG();
      CheckPointSUBTRANS();
      CheckPointMultiXact();
!     FlushBufferPool();            /* performs all required fsyncs */
      /* We deliberately delay 2PC checkpointing as long as possible */
      CheckPointTwoPhase(checkPointRedo);
  }
--- 5719,5737 ----
  /*
   * Flush all data in shared memory to disk, and fsync
   *
+  * If immediate is true, try to finish as quickly as possible, ignoring
+  * the GUC variables to throttle checkpoint I/O.
+  *
   * This is the common code shared between regular checkpoints and
   * recovery restartpoints.
   */
  static void
! CheckPointGuts(XLogRecPtr checkPointRedo, bool immediate)
  {
      CheckPointCLOG();
      CheckPointSUBTRANS();
      CheckPointMultiXact();
!     FlushBufferPool(immediate);        /* performs all required fsyncs */
      /* We deliberately delay 2PC checkpointing as long as possible */
      CheckPointTwoPhase(checkPointRedo);
  }
***************
*** 5710,5716 ****
  /*
   * Set a recovery restart point if appropriate
   *
!  * This is similar to CreateCheckpoint, but is used during WAL recovery
   * to establish a point from which recovery can roll forward without
   * replaying the entire recovery log.  This function is called each time
   * a checkpoint record is read from XLOG; it must determine whether a
--- 5739,5745 ----
  /*
   * Set a recovery restart point if appropriate
   *
!  * This is similar to CreateCheckPoint, but is used during WAL recovery
   * to establish a point from which recovery can roll forward without
   * replaying the entire recovery log.  This function is called each time
   * a checkpoint record is read from XLOG; it must determine whether a
***************
*** 5751,5757 ****
      /*
       * OK, force data out to disk
       */
!     CheckPointGuts(checkPoint->redo);

      /*
       * Update pg_control so that any subsequent crash will restart from this
--- 5780,5786 ----
      /*
       * OK, force data out to disk
       */
!     CheckPointGuts(checkPoint->redo, true);

      /*
       * Update pg_control so that any subsequent crash will restart from this
***************
*** 6177,6183 ****
           * have different checkpoint positions and hence different history
           * file names, even if nothing happened in between.
           */
!         RequestCheckpoint(true, false);

          /*
           * Now we need to fetch the checkpoint record location, and also its
--- 6206,6212 ----
           * have different checkpoint positions and hence different history
           * file names, even if nothing happened in between.
           */
!         RequestLazyCheckpoint();

          /*
           * Now we need to fetch the checkpoint record location, and also its
Index: src/backend/bootstrap/bootstrap.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/bootstrap/bootstrap.c,v
retrieving revision 1.233
diff -c -r1.233 bootstrap.c
*** src/backend/bootstrap/bootstrap.c    7 Mar 2007 13:35:02 -0000    1.233
--- src/backend/bootstrap/bootstrap.c    19 Jun 2007 15:29:51 -0000
***************
*** 489,495 ****

      /* Perform a checkpoint to ensure everything's down to disk */
      SetProcessingMode(NormalProcessing);
!     CreateCheckPoint(true, true);

      /* Clean up and exit */
      cleanup();
--- 489,495 ----

      /* Perform a checkpoint to ensure everything's down to disk */
      SetProcessingMode(NormalProcessing);
!     CreateCheckPoint(true, true, true);

      /* Clean up and exit */
      cleanup();
Index: src/backend/commands/dbcommands.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/commands/dbcommands.c,v
retrieving revision 1.195
diff -c -r1.195 dbcommands.c
*** src/backend/commands/dbcommands.c    1 Jun 2007 19:38:07 -0000    1.195
--- src/backend/commands/dbcommands.c    20 Jun 2007 09:36:24 -0000
***************
*** 404,410 ****
       * up-to-date for the copy.  (We really only need to flush buffers for the
       * source database, but bufmgr.c provides no API for that.)
       */
!     BufferSync();

      /*
       * Once we start copying subdirectories, we need to be able to clean 'em
--- 404,410 ----
       * up-to-date for the copy.  (We really only need to flush buffers for the
       * source database, but bufmgr.c provides no API for that.)
       */
!     BufferSync(true);

      /*
       * Once we start copying subdirectories, we need to be able to clean 'em
***************
*** 507,513 ****
           * Perhaps if we ever implement CREATE DATABASE in a less cheesy way,
           * we can avoid this.
           */
!         RequestCheckpoint(true, false);

          /*
           * Close pg_database, but keep lock till commit (this is important to
--- 507,513 ----
           * Perhaps if we ever implement CREATE DATABASE in a less cheesy way,
           * we can avoid this.
           */
!         RequestImmediateCheckpoint();

          /*
           * Close pg_database, but keep lock till commit (this is important to
***************
*** 661,667 ****
       * open files, which would cause rmdir() to fail.
       */
  #ifdef WIN32
!     RequestCheckpoint(true, false);
  #endif

      /*
--- 661,667 ----
       * open files, which would cause rmdir() to fail.
       */
  #ifdef WIN32
!     RequestImmediateCheckpoint();
  #endif

      /*
***************
*** 1427,1433 ****
           * up-to-date for the copy.  (We really only need to flush buffers for
           * the source database, but bufmgr.c provides no API for that.)
           */
!         BufferSync();

          /*
           * Copy this subdirectory to the new location
--- 1427,1433 ----
           * up-to-date for the copy.  (We really only need to flush buffers for
           * the source database, but bufmgr.c provides no API for that.)
           */
!         BufferSync(true);

          /*
           * Copy this subdirectory to the new location
Index: src/backend/postmaster/bgwriter.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/postmaster/bgwriter.c,v
retrieving revision 1.38
diff -c -r1.38 bgwriter.c
*** src/backend/postmaster/bgwriter.c    27 May 2007 03:50:39 -0000    1.38
--- src/backend/postmaster/bgwriter.c    26 Jun 2007 18:14:47 -0000
***************
*** 44,49 ****
--- 44,50 ----
  #include "postgres.h"

  #include <signal.h>
+ #include <sys/time.h>
  #include <time.h>
  #include <unistd.h>

***************
*** 59,64 ****
--- 60,66 ----
  #include "storage/pmsignal.h"
  #include "storage/shmem.h"
  #include "storage/smgr.h"
+ #include "storage/spin.h"
  #include "tcop/tcopprot.h"
  #include "utils/guc.h"
  #include "utils/memutils.h"
***************
*** 70,101 ****
   *
   * The ckpt counters allow backends to watch for completion of a checkpoint
   * request they send.  Here's how it works:
!  *    * At start of a checkpoint, bgwriter increments ckpt_started.
   *    * On completion of a checkpoint, bgwriter sets ckpt_done to
   *      equal ckpt_started.
!  *    * On failure of a checkpoint, bgwrite first increments ckpt_failed,
!  *      then sets ckpt_done to equal ckpt_started.
!  * All three fields are declared sig_atomic_t to ensure they can be read
!  * and written without explicit locking.  The algorithm for backends is:
!  *    1. Record current values of ckpt_failed and ckpt_started (in that
!  *       order!).
   *    2. Send signal to request checkpoint.
   *    3. Sleep until ckpt_started changes.  Now you know a checkpoint has
   *       begun since you started this algorithm (although *not* that it was
   *       specifically initiated by your signal).
   *    4. Record new value of ckpt_started.
!  *    5. Sleep until ckpt_done >= saved value of ckpt_started.  (Use modulo
!  *       arithmetic here in case counters wrap around.)  Now you know a
!  *       checkpoint has started and completed, but not whether it was
!  *       successful.
   *    6. If ckpt_failed is different from the originally saved value,
   *       assume request failed; otherwise it was definitely successful.
   *
!  * An additional field is ckpt_time_warn; this is also sig_atomic_t for
!  * simplicity, but is only used as a boolean.  If a backend is requesting
!  * a checkpoint for which a checkpoints-too-close-together warning is
!  * reasonable, it should set this field TRUE just before sending the signal.
!  *
   * The requests array holds fsync requests sent by backends and not yet
   * absorbed by the bgwriter.  Unlike the checkpoint fields, the requests
   * fields are protected by BgWriterCommLock.
--- 72,103 ----
   *
   * The ckpt counters allow backends to watch for completion of a checkpoint
   * request they send.  Here's how it works:
!  *    * At start of a checkpoint, bgwriter reads the request flags
!  *      and increments ckpt_started, while holding ckpt_lck.
   *    * On completion of a checkpoint, bgwriter sets ckpt_done to
   *      equal ckpt_started.
!  *    * On failure of a checkpoint, bgwriter increments ckpt_failed,
!  *      and sets ckpt_done to equal ckpt_started.
!  *
!  * The algorithm for backends is:
!  *    1. Record current values of ckpt_failed and ckpt_started, and
!  *     set any request flags, while holding ckpt_lck.
   *    2. Send signal to request checkpoint.
   *    3. Sleep until ckpt_started changes.  Now you know a checkpoint has
   *       begun since you started this algorithm (although *not* that it was
   *       specifically initiated by your signal).
   *    4. Record new value of ckpt_started.
!  *    5. Sleep until ckpt_done >= saved value of ckpt_started.
!  *       Now you know a checkpoint has started and completed, but not whether
!  *       it was successful.
   *    6. If ckpt_failed is different from the originally saved value,
   *       assume request failed; otherwise it was definitely successful.
   *
!  * There's three request flags that indicate what kind of a checkpoint
!  * should be performed next. They are accumulative; if for example an
!  * immediate checkpoint is requested, another request for a non-immediate
!  * checkpoint won't clear the flag.
!  *
   * The requests array holds fsync requests sent by backends and not yet
   * absorbed by the bgwriter.  Unlike the checkpoint fields, the requests
   * fields are protected by BgWriterCommLock.
***************
*** 112,122 ****
  {
      pid_t        bgwriter_pid;    /* PID of bgwriter (0 if not started) */

!     sig_atomic_t ckpt_started;    /* advances when checkpoint starts */
!     sig_atomic_t ckpt_done;        /* advances when checkpoint done */
!     sig_atomic_t ckpt_failed;    /* advances when checkpoint fails */

!     sig_atomic_t ckpt_time_warn;    /* warn if too soon since last ckpt? */

      int            num_requests;    /* current # of requests */
      int            max_requests;    /* allocated array size */
--- 114,128 ----
  {
      pid_t        bgwriter_pid;    /* PID of bgwriter (0 if not started) */

!     slock_t        ckpt_lck;        /* protects all the ckpt_* fields */

!     int            ckpt_started;    /* advances when checkpoint starts */
!     int            ckpt_done;        /* advances when checkpoint done */
!     int            ckpt_failed;    /* advances when checkpoint fails */
!
!     bool    ckpt_rqst_time_warn;    /* warn if too soon since last ckpt */
!     bool    ckpt_rqst_immediate;    /* an immediate ckpt has been requested */
!     bool    ckpt_rqst_force;        /* checkpoint even if no WAL activity */

      int            num_requests;    /* current # of requests */
      int            max_requests;    /* allocated array size */
***************
*** 131,136 ****
--- 137,143 ----
  int            BgWriterDelay = 200;
  int            CheckPointTimeout = 300;
  int            CheckPointWarning = 30;
+ double        CheckPointCompletionTarget = 0.5;

  /*
   * Flags set by interrupt handlers for later service in the main loop.
***************
*** 146,154 ****
--- 153,180 ----

  static bool ckpt_active = false;

+ /*
+  * Current time and WAL insert location when checkpoint was started.
+  * Used to track progress during the checkpoint and throttle I/O
+  * to hit checkpoint_completion_target.
+  */
+ static time_t        ckpt_start_time;
+ static XLogRecPtr    ckpt_start_recptr;
+
+ static double        ckpt_cached_elapsed;
+
  static time_t last_checkpoint_time;
  static time_t last_xlog_switch_time;

+ /* Prototypes for private functions */
+
+ static void RequestCheckpoint(bool waitforit, bool warnontime, bool immediate, bool force);
+ static void CheckArchiveTimeout(void);
+ static void BgWriterNap(void);
+ static bool IsCheckpointOnSchedule(double progress);
+ static bool ImmediateCheckpointRequested(void);
+
+ /* Signal handlers */

  static void bg_quickdie(SIGNAL_ARGS);
  static void BgSigHupHandler(SIGNAL_ARGS);
***************
*** 170,175 ****
--- 196,202 ----

      Assert(BgWriterShmem != NULL);
      BgWriterShmem->bgwriter_pid = MyProcPid;
+     SpinLockInit(&BgWriterShmem->ckpt_lck);
      am_bg_writer = true;

      /*
***************
*** 281,288 ****
--- 308,318 ----
              /* use volatile pointer to prevent code rearrangement */
              volatile BgWriterShmemStruct *bgs = BgWriterShmem;

+             SpinLockAcquire(&bgs->ckpt_lck);
              bgs->ckpt_failed++;
              bgs->ckpt_done = bgs->ckpt_started;
+             SpinLockRelease(&bgs->ckpt_lck);
+
              ckpt_active = false;
          }

***************
*** 328,337 ****
      for (;;)
      {
          bool        do_checkpoint = false;
-         bool        force_checkpoint = false;
          time_t        now;
          int            elapsed_secs;
-         long        udelay;

          /*
           * Emergency bailout if postmaster has died.  This is to avoid the
--- 358,365 ----
***************
*** 354,360 ****
          {
              checkpoint_requested = false;
              do_checkpoint = true;
-             force_checkpoint = true;
              BgWriterStats.m_requested_checkpoints++;
          }
          if (shutdown_requested)
--- 382,387 ----
***************
*** 377,387 ****
           */
          now = time(NULL);
          elapsed_secs = now - last_checkpoint_time;
!         if (elapsed_secs >= CheckPointTimeout)
          {
              do_checkpoint = true;
!             if (!force_checkpoint)
!                 BgWriterStats.m_timed_checkpoints++;
          }

          /*
--- 404,413 ----
           */
          now = time(NULL);
          elapsed_secs = now - last_checkpoint_time;
!         if (!do_checkpoint && elapsed_secs >= CheckPointTimeout)
          {
              do_checkpoint = true;
!             BgWriterStats.m_timed_checkpoints++;
          }

          /*
***************
*** 390,395 ****
--- 416,449 ----
           */
          if (do_checkpoint)
          {
+             /* use volatile pointer to prevent code rearrangement */
+             volatile BgWriterShmemStruct *bgs = BgWriterShmem;
+             bool time_warn;
+             bool immediate;
+             bool force;
+
+             /*
+              * Atomically check the request flags to figure out what
+              * kind of a checkpoint we should perform, and increase the
+              * started-counter to acknowledge that we've started
+              * a new checkpoint.
+              */
+
+             SpinLockAcquire(&bgs->ckpt_lck);
+
+             time_warn = bgs->ckpt_rqst_time_warn;
+             bgs->ckpt_rqst_time_warn = false;
+
+             immediate = bgs->ckpt_rqst_immediate;
+             bgs->ckpt_rqst_immediate = false;
+
+             force = bgs->ckpt_rqst_force;
+             bgs->ckpt_rqst_force = false;
+
+             bgs->ckpt_started++;
+
+             SpinLockRelease(&bgs->ckpt_lck);
+
              /*
               * We will warn if (a) too soon since last checkpoint (whatever
               * caused it) and (b) somebody has set the ckpt_time_warn flag
***************
*** 397,417 ****
               * implementation will not generate warnings caused by
               * CheckPointTimeout < CheckPointWarning.
               */
!             if (BgWriterShmem->ckpt_time_warn &&
                  elapsed_secs < CheckPointWarning)
                  ereport(LOG,
                          (errmsg("checkpoints are occurring too frequently (%d seconds apart)",
                                  elapsed_secs),
                           errhint("Consider increasing the configuration parameter \"checkpoint_segments\".")));
!             BgWriterShmem->ckpt_time_warn = false;

              /*
!              * Indicate checkpoint start to any waiting backends.
               */
              ckpt_active = true;
!             BgWriterShmem->ckpt_started++;

!             CreateCheckPoint(false, force_checkpoint);

              /*
               * After any checkpoint, close all smgr files.    This is so we
--- 451,477 ----
               * implementation will not generate warnings caused by
               * CheckPointTimeout < CheckPointWarning.
               */
!             if (time_warn &&
                  elapsed_secs < CheckPointWarning)
                  ereport(LOG,
                          (errmsg("checkpoints are occurring too frequently (%d seconds apart)",
                                  elapsed_secs),
                           errhint("Consider increasing the configuration parameter \"checkpoint_segments\".")));
!

              /*
!              * Initialize bgwriter-private variables used during checkpoint.
               */
              ckpt_active = true;
!             ckpt_start_recptr = GetInsertRecPtr();
!             ckpt_start_time = now;
!             ckpt_cached_elapsed = 0;
!
!             elog(DEBUG1, "CHECKPOINT: start");
!
!             CreateCheckPoint(false, immediate, force);

!             elog(DEBUG1, "CHECKPOINT: end");

              /*
               * After any checkpoint, close all smgr files.    This is so we
***************
*** 422,428 ****
              /*
               * Indicate checkpoint completion to any waiting backends.
               */
!             BgWriterShmem->ckpt_done = BgWriterShmem->ckpt_started;
              ckpt_active = false;

              /*
--- 482,491 ----
              /*
               * Indicate checkpoint completion to any waiting backends.
               */
!             SpinLockAcquire(&bgs->ckpt_lck);
!             bgs->ckpt_done = bgs->ckpt_started;
!             SpinLockRelease(&bgs->ckpt_lck);
!
              ckpt_active = false;

              /*
***************
*** 436,446 ****
              BgBufferSync();

          /*
!          * Check for archive_timeout, if so, switch xlog files.  First we do a
!          * quick check using possibly-stale local state.
           */
!         if (XLogArchiveTimeout > 0 &&
!             (int) (now - last_xlog_switch_time) >= XLogArchiveTimeout)
          {
              /*
               * Update local state ... note that last_xlog_switch_time is the
--- 499,531 ----
              BgBufferSync();

          /*
!          * Check for archive_timeout and switch xlog files if necessary.
           */
!         CheckArchiveTimeout();
!
!         /* Nap for the configured time. */
!         BgWriterNap();
!     }
! }
!
! /*
!  * CheckArchiveTimeout -- check for archive_timeout and switch xlog files
!  *        if needed
!  */
! static void
! CheckArchiveTimeout(void)
! {
!     time_t        now;
!
!     if (XLogArchiveTimeout <= 0)
!         return;
!
!     now = time(NULL);
!
!     /* First we do a quick check using possibly-stale local state. */
!     if ((int) (now - last_xlog_switch_time) < XLogArchiveTimeout)
!         return;
!
          {
              /*
               * Update local state ... note that last_xlog_switch_time is the
***************
*** 450,459 ****

              last_xlog_switch_time = Max(last_xlog_switch_time, last_time);

-             /* if we did a checkpoint, 'now' might be stale too */
-             if (do_checkpoint)
-                 now = time(NULL);
-
              /* Now we can do the real check */
              if ((int) (now - last_xlog_switch_time) >= XLogArchiveTimeout)
              {
--- 535,540 ----
***************
*** 478,483 ****
--- 559,573 ----
                  last_xlog_switch_time = now;
              }
          }
+ }
+
+ /*
+  * BgWriterNap -- Nap for the configured time or until a signal is received.
+  */
+ static void
+ BgWriterNap(void)
+ {
+     long        udelay;

          /*
           * Send off activity statistics to the stats collector
***************
*** 495,502 ****
           *
           * We absorb pending requests after each short sleep.
           */
!         if ((bgwriter_all_percent > 0.0 && bgwriter_all_maxpages > 0) ||
!             (bgwriter_lru_percent > 0.0 && bgwriter_lru_maxpages > 0))
              udelay = BgWriterDelay * 1000L;
          else if (XLogArchiveTimeout > 0)
              udelay = 1000000L;    /* One second */
--- 585,592 ----
           *
           * We absorb pending requests after each short sleep.
           */
!         if ((bgwriter_lru_percent > 0.0 && bgwriter_lru_maxpages > 0) ||
!             ckpt_active)
              udelay = BgWriterDelay * 1000L;
          else if (XLogArchiveTimeout > 0)
              udelay = 1000000L;    /* One second */
***************
*** 505,522 ****

          while (udelay > 999999L)
          {
!             if (got_SIGHUP || checkpoint_requested || shutdown_requested)
                  break;
              pg_usleep(1000000L);
              AbsorbFsyncRequests();
              udelay -= 1000000L;
          }

!         if (!(got_SIGHUP || checkpoint_requested || shutdown_requested))
              pg_usleep(udelay);
      }
  }


  /* --------------------------------
   *        signal handler routines
--- 595,774 ----

          while (udelay > 999999L)
          {
!             if (got_SIGHUP || shutdown_requested ||
!                 (!ckpt_active && checkpoint_requested) ||
!                 (ckpt_active && ImmediateCheckpointRequested()))
                  break;
              pg_usleep(1000000L);
              AbsorbFsyncRequests();
              udelay -= 1000000L;
          }

!
!         if (!(got_SIGHUP || shutdown_requested ||
!               (!ckpt_active && checkpoint_requested) ||
!               (ckpt_active && ImmediateCheckpointRequested())))
              pg_usleep(udelay);
+ }
+
+ /*
+  * Returns true if an immediate checkpoint request is pending.
+  */
+ static bool
+ ImmediateCheckpointRequested()
+ {
+     if (checkpoint_requested)
+     {
+         volatile BgWriterShmemStruct *bgs = BgWriterShmem;
+
+         /*
+          * We don't need to acquire the ckpt_lck in this case because we're
+          * only looking at a single boolean field.
+          */
+         if (bgs->ckpt_rqst_immediate)
+             return true;
      }
+     return false;
  }

+ #define WRITES_PER_ABSORB 1000
+
+ /*
+  * CheckpointWriteDelay -- yield control to bgwriter during a checkpoint
+  *
+  * To give bgwriter a chance to perform its normal activities while we're
+  * performing a checkpoint, this function is called after each page written
+  * from the buffer cache.
+  *
+  * To throttle the checkpoint I/O rate, we also sleep periodically so that
+  * we hit checkpoint_completion_target.
+  *
+  * 'progress' is an estimate of how much of the writes has been done, as a
+  * fraction between 0.0 meaning none, and 1.0 meaning all done. If 'immediate'
+  * is true, this is an immediate checkpoint and we should not sleep.
+  */
+ void
+ CheckpointWriteDelay(bool immediate, double progress)
+ {
+     static int absorb_counter = WRITES_PER_ABSORB;
+
+     /*
+      * Take a nap and perform the usual bgwriter duties, unless we're behind
+      * schedule, in which case we just try to catch up as quickly as possible.
+      */
+     if (!immediate && am_bg_writer && !shutdown_requested &&
+         !ImmediateCheckpointRequested() &&
+         IsCheckpointOnSchedule(progress))
+     {
+         if (got_SIGHUP)
+         {
+             got_SIGHUP = false;
+             ProcessConfigFile(PGC_SIGHUP);
+         }
+         CheckArchiveTimeout();
+         BgBufferSync();
+         BgWriterNap();
+
+         AbsorbFsyncRequests();
+         absorb_counter = WRITES_PER_ABSORB;
+     }
+     else if (--absorb_counter <= 0)
+     {
+         /*
+          * Absorb pending fsync requests after each WRITES_PER_ABSORB write
+          * operations when we don't sleep, to prevent overflow of the fsync
+          * request queue.
+          */
+         AbsorbFsyncRequests();
+         absorb_counter = WRITES_PER_ABSORB;
+     }
+ }
+
+ /*
+  * IsCheckpointOnSchedule -- are we on schedule to finish this checkpoint
+  *         in time?
+  *
+  * Compares the current progress against the time/segments elapsed since last
+  * checkpoint, and returns true if the progress we've made this far is greater
+  * than the elapsed time/segments.
+  */
+ static bool
+ IsCheckpointOnSchedule(double progress)
+ {
+     struct timeval    now;
+     XLogRecPtr        recptr;
+     double            elapsed_time,
+                     elapsed_xlogs;
+
+     Assert(ckpt_active);
+
+     /* Scale progress according to checkpoint_completion_target. */
+     progress *= CheckPointCompletionTarget;
+
+     /*
+      * Check against the cached value first. Only do the more expensive
+      * calculations once we reach the target previously calculated. Since
+      * neither time or WAL insert pointer moves backwards, a freshly
+      * calculated value can only be greater than or equal to the cached value.
+      */
+     if (progress < ckpt_cached_elapsed)
+     {
+         elog(DEBUG3, "IsCheckpointOnSchedule: Still behind cached=%.3f, scaled progress=%.3f",
+              ckpt_cached_elapsed, progress);
+         return false;
+     }
+
+     /*
+      * Check progress against time elapsed and checkpoint_timeout.
+      */
+     gettimeofday(&now, NULL);
+
+     elapsed_time = ((double) (now.tv_sec - ckpt_start_time) +
+         now.tv_usec / 1000000.0) / CheckPointTimeout;
+
+     if (progress < elapsed_time)
+     {
+         elog(DEBUG2, "IsCheckpointOnSchedule: Behind checkpoint_timeout, time=%.3f, scaled progress=%.3f",
+              elapsed_time, progress);
+
+         ckpt_cached_elapsed = elapsed_time;
+
+         return false;
+     }
+
+     /*
+      * Check progress against WAL segments written and checkpoint_segments.
+      *
+      * We compare the current WAL insert location against the location
+      * computed before calling CreateCheckPoint. The code in XLogInsert that
+      * actually triggers a checkpoint when checkpoint_segments is exceeded
+      * compares against RedoRecptr, so this is not completely accurate.
+      * However, it's good enough for our purposes, we're only calculating
+      * an estimate anyway.
+      */
+     recptr = GetInsertRecPtr();
+     elapsed_xlogs =
+         (((double) recptr.xlogid - (double) ckpt_start_recptr.xlogid) * XLogSegsPerFile +
+          ((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) / XLogSegSize) /
+         CheckPointSegments;
+
+     if (progress < elapsed_xlogs)
+     {
+         elog(DEBUG2, "IsCheckpointOnSchedule: Behind checkpoint_segments, xlog=%.3f, scaled progress=%.3f",
+              elapsed_xlogs, progress);
+
+         ckpt_cached_elapsed = elapsed_xlogs;
+
+         return false;
+     }
+
+     /* It looks like we're on schedule. */
+
+     elog(DEBUG2, "IsCheckpointOnSchedule: on schedule, time=%.3f, xlog=%.3f, scaled progress=%.3f",
+          elapsed_time, elapsed_xlogs, progress);
+
+     return true;
+ }

  /* --------------------------------
   *        signal handler routines
***************
*** 618,626 ****
  }

  /*
!  * RequestCheckpoint
   *        Called in backend processes to request an immediate checkpoint
   *
   * If waitforit is true, wait until the checkpoint is completed
   * before returning; otherwise, just signal the request and return
   * immediately.
--- 870,919 ----
  }

  /*
!  * RequestImmediateCheckpoint
   *        Called in backend processes to request an immediate checkpoint
   *
+  * Returns when the checkpoint is finished.
+  */
+ void
+ RequestImmediateCheckpoint()
+ {
+     RequestCheckpoint(true, false, true, true);
+ }
+
+ /*
+  * RequestImmediateCheckpoint
+  *        Called in backend processes to request a lazy checkpoint.
+  *
+  * This is essentially the same as RequestImmediateCheckpoint, except
+  * that this form obeys the checkpoint_completion_target GUC variable, and
+  * can therefore take a lot longer time.
+  *
+  * Returns when the checkpoint is finished.
+  */
+ void
+ RequestLazyCheckpoint()
+ {
+     RequestCheckpoint(true, false, false, true);
+ }
+
+ /*
+  * RequestXLogFillCheckpoint
+  *        Signals the bgwriter that we've reached checkpoint_segments
+  *
+  * Unlike RequestImmediateCheckpoint and RequestLazyCheckpoint, return
+  * immediately without waiting for the checkpoint to finish.
+  */
+ void
+ RequestXLogFillCheckpoint()
+ {
+     RequestCheckpoint(false, true, false, false);
+ }
+
+ /*
+  * RequestCheckpoint
+  *        Common subroutine for all the above Request*Checkpoint variants.
+  *
   * If waitforit is true, wait until the checkpoint is completed
   * before returning; otherwise, just signal the request and return
   * immediately.
***************
*** 628,648 ****
   * If warnontime is true, and it's "too soon" since the last checkpoint,
   * the bgwriter will log a warning.  This should be true only for checkpoints
   * caused due to xlog filling, else the warning will be misleading.
   */
! void
! RequestCheckpoint(bool waitforit, bool warnontime)
  {
      /* use volatile pointer to prevent code rearrangement */
      volatile BgWriterShmemStruct *bgs = BgWriterShmem;
!     sig_atomic_t old_failed = bgs->ckpt_failed;
!     sig_atomic_t old_started = bgs->ckpt_started;

      /*
       * If in a standalone backend, just do it ourselves.
       */
      if (!IsPostmasterEnvironment)
      {
!         CreateCheckPoint(false, true);

          /*
           * After any checkpoint, close all smgr files.    This is so we won't
--- 921,950 ----
   * If warnontime is true, and it's "too soon" since the last checkpoint,
   * the bgwriter will log a warning.  This should be true only for checkpoints
   * caused due to xlog filling, else the warning will be misleading.
+  *
+  * If immediate is true, the checkpoint should be finished ASAP.
+  *
+  * If force is true, force a checkpoint even if no XLOG activity has occured
+  * since the last one.
   */
! static void
! RequestCheckpoint(bool waitforit, bool warnontime, bool immediate, bool force)
  {
      /* use volatile pointer to prevent code rearrangement */
      volatile BgWriterShmemStruct *bgs = BgWriterShmem;
!     int old_failed, old_started;

      /*
       * If in a standalone backend, just do it ourselves.
       */
      if (!IsPostmasterEnvironment)
      {
!         /*
!          * There's no point in doing lazy checkpoints in a standalone
!          * backend, because there's no other backends the checkpoint could
!          * disrupt.
!          */
!         CreateCheckPoint(false, true, true);

          /*
           * After any checkpoint, close all smgr files.    This is so we won't
***************
*** 653,661 ****
          return;
      }

!     /* Set warning request flag if appropriate */
      if (warnontime)
!         bgs->ckpt_time_warn = true;

      /*
       * Send signal to request checkpoint.  When waitforit is false, we
--- 955,982 ----
          return;
      }

!     /*
!      * Atomically set the request flags, and take a snapshot of the counters.
!      * This ensures that when we see that ckpt_started > old_started,
!      * we know the flags we set here have been seen by bgwriter.
!      *
!      * Note that we effectively OR the flags with any existing flags, to
!      * avoid overriding a "stronger" request by another backend.
!      */
!     SpinLockAcquire(&bgs->ckpt_lck);
!
!     old_failed = bgs->ckpt_failed;
!     old_started = bgs->ckpt_started;
!
!     /* Set request flags as appropriate */
      if (warnontime)
!         bgs->ckpt_rqst_time_warn = true;
!     if (immediate)
!         bgs->ckpt_rqst_immediate = true;
!     if (force)
!         bgs->ckpt_rqst_force = true;
!
!     SpinLockRelease(&bgs->ckpt_lck);

      /*
       * Send signal to request checkpoint.  When waitforit is false, we
***************
*** 674,701 ****
       */
      if (waitforit)
      {
!         while (bgs->ckpt_started == old_started)
          {
              CHECK_FOR_INTERRUPTS();
              pg_usleep(100000L);
          }
-         old_started = bgs->ckpt_started;

          /*
!          * We are waiting for ckpt_done >= old_started, in a modulo sense.
!          * This is a little tricky since we don't know the width or signedness
!          * of sig_atomic_t.  We make the lowest common denominator assumption
!          * that it is only as wide as "char".  This means that this algorithm
!          * will cope correctly as long as we don't sleep for more than 127
!          * completed checkpoints.  (If we do, we will get another chance to
!          * exit after 128 more checkpoints...)
           */
!         while (((signed char) (bgs->ckpt_done - old_started)) < 0)
          {
              CHECK_FOR_INTERRUPTS();
              pg_usleep(100000L);
          }
!         if (bgs->ckpt_failed != old_failed)
              ereport(ERROR,
                      (errmsg("checkpoint request failed"),
                       errhint("Consult recent messages in the server log for details.")));
--- 995,1036 ----
       */
      if (waitforit)
      {
!         int new_started, new_failed;
!
!         /* Wait for a new checkpoint to start. */
!         for(;;)
          {
+             SpinLockAcquire(&bgs->ckpt_lck);
+             new_started = bgs->ckpt_started;
+             SpinLockRelease(&bgs->ckpt_lck);
+
+             if (new_started != old_started)
+                 break;
+
              CHECK_FOR_INTERRUPTS();
              pg_usleep(100000L);
          }

          /*
!          * We are waiting for ckpt_done >= new_started, in a modulo sense.
           */
!         for(;;)
          {
+             int new_done;
+
+             SpinLockAcquire(&bgs->ckpt_lck);
+             new_done = bgs->ckpt_done;
+             new_failed = bgs->ckpt_failed;
+             SpinLockRelease(&bgs->ckpt_lck);
+
+             if(new_done - new_started >= 0)
+                 break;
+
              CHECK_FOR_INTERRUPTS();
              pg_usleep(100000L);
          }
!
!         if (new_failed != old_failed)
              ereport(ERROR,
                      (errmsg("checkpoint request failed"),
                       errhint("Consult recent messages in the server log for details.")));
Index: src/backend/storage/buffer/bufmgr.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/bufmgr.c,v
retrieving revision 1.221
diff -c -r1.221 bufmgr.c
*** src/backend/storage/buffer/bufmgr.c    18 Jun 2007 00:47:20 -0000    1.221
--- src/backend/storage/buffer/bufmgr.c    26 Jun 2007 17:01:30 -0000
***************
*** 71,79 ****
  /* GUC variables */
  bool        zero_damaged_pages = false;
  double        bgwriter_lru_percent = 1.0;
- double        bgwriter_all_percent = 0.333;
  int            bgwriter_lru_maxpages = 5;
- int            bgwriter_all_maxpages = 5;


  long        NDirectFileRead;    /* some I/O's are direct file access. bypass
--- 71,77 ----
***************
*** 644,650 ****
       * at 1 so that the buffer can survive one clock-sweep pass.)
       */
      buf->tag = newTag;
!     buf->flags &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | BM_IO_ERROR);
      buf->flags |= BM_TAG_VALID;
      buf->usage_count = 1;

--- 642,648 ----
       * at 1 so that the buffer can survive one clock-sweep pass.)
       */
      buf->tag = newTag;
!     buf->flags &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | BM_CHECKPOINT_NEEDED | BM_IO_ERROR);
      buf->flags |= BM_TAG_VALID;
      buf->usage_count = 1;

***************
*** 999,1042 ****
   * BufferSync -- Write out all dirty buffers in the pool.
   *
   * This is called at checkpoint time to write out all dirty shared buffers.
   */
  void
! BufferSync(void)
  {
!     int            buf_id;
      int            num_to_scan;
!     int            absorb_counter;

      /*
       * Find out where to start the circular scan.
       */
!     buf_id = StrategySyncStart();

      /* Make sure we can handle the pin inside SyncOneBuffer */
      ResourceOwnerEnlargeBuffers(CurrentResourceOwner);

      /*
!      * Loop over all buffers.
       */
      num_to_scan = NBuffers;
!     absorb_counter = WRITES_PER_ABSORB;
      while (num_to_scan-- > 0)
      {
!         if (SyncOneBuffer(buf_id, false))
          {
              BgWriterStats.m_buf_written_checkpoints++;

              /*
!              * If in bgwriter, absorb pending fsync requests after each
!              * WRITES_PER_ABSORB write operations, to prevent overflow of the
!              * fsync request queue.  If not in bgwriter process, this is a
!              * no-op.
               */
!             if (--absorb_counter <= 0)
!             {
!                 AbsorbFsyncRequests();
!                 absorb_counter = WRITES_PER_ABSORB;
!             }
          }
          if (++buf_id >= NBuffers)
              buf_id = 0;
--- 997,1098 ----
   * BufferSync -- Write out all dirty buffers in the pool.
   *
   * This is called at checkpoint time to write out all dirty shared buffers.
+  * If 'immediate' is true, write them all ASAP, otherwise throttle the
+  * I/O rate according to checkpoint_write_rate GUC variable, and perform
+  * normal bgwriter duties periodically.
   */
  void
! BufferSync(bool immediate)
  {
!     int            buf_id, start_id;
      int            num_to_scan;
!     int            num_to_write;
!     int            num_written;

      /*
       * Find out where to start the circular scan.
       */
!     start_id = StrategySyncStart();

      /* Make sure we can handle the pin inside SyncOneBuffer */
      ResourceOwnerEnlargeBuffers(CurrentResourceOwner);

      /*
!      * Loop over all buffers, and mark the ones that need to be written with
!      * BM_CHECKPOINT_NEEDED. Count them as we go (num_to_write), so that we
!      * can estimate how much work needs to be done.
!      *
!      * This allows us to only write those pages that were dirty when the
!      * checkpoint began, and haven't been flushed to disk since. Whenever a
!      * page with BM_CHECKPOINT_NEEDED is written out, either by us later
!      * in this function, or by normal backends or the bgwriter LRU-scan, the
!      * flag is cleared. Any buffer dirtied after this point won't have the
!      * flag set.
       */
      num_to_scan = NBuffers;
!     num_to_write = 0;
!     buf_id = start_id;
      while (num_to_scan-- > 0)
      {
!         volatile BufferDesc *bufHdr = &BufferDescriptors[buf_id];
!
!         /*
!          * Header spinlock is enough to examine BM_DIRTY, see comment in
!          * SyncOneBuffer.
!          */
!         LockBufHdr(bufHdr);
!
!         if (bufHdr->flags & BM_DIRTY)
!         {
!             bufHdr->flags |= BM_CHECKPOINT_NEEDED;
!             num_to_write++;
!         }
!
!         UnlockBufHdr(bufHdr);
!
!         if (++buf_id >= NBuffers)
!             buf_id = 0;
!     }
!
!     elog(DEBUG1, "CHECKPOINT: %d / %d buffers to write", num_to_write, NBuffers);
!
!     /*
!      * Loop over all buffers again, and write the ones (still) marked with
!      * BM_CHECKPOINT_NEEDED.
!      */
!     num_to_scan = NBuffers;
!     num_written = 0;
!     buf_id = start_id;
!     while (num_to_scan-- > 0 && num_written < num_to_write)
!     {
!         volatile BufferDesc *bufHdr = &BufferDescriptors[buf_id];
!         bool needs_flush;
!
!         /* We don't need to acquire the lock here, because we're
!          * only looking at a single bit. It's possible that someone
!          * else writes the buffer and clears the flag right after we
!          * check, but that doesn't matter. This assumes that no-one
!          * clears the flag and sets it again while holding info_lck,
!          * expecting no-one to see the intermediary state.
!          */
!         needs_flush = (bufHdr->flags & BM_CHECKPOINT_NEEDED) != 0;
!
!         if (needs_flush && SyncOneBuffer(buf_id, false))
          {
              BgWriterStats.m_buf_written_checkpoints++;
+             num_written++;

              /*
!              * Perform normal bgwriter duties and sleep to throttle
!              * our I/O rate.
!              *
!              * Note that num_written doesn't include buffers written
!              * by other backends, or the bgwriter LRU-sweep. That means
!              * that the esimate of how much progress we've made is
!              * conservative.
               */
!             CheckpointWriteDelay(immediate,
!                                  (double) num_written / num_to_write);
          }
          if (++buf_id >= NBuffers)
              buf_id = 0;
***************
*** 1051,1058 ****
  void
  BgBufferSync(void)
  {
!     static int    buf_id1 = 0;
!     int            buf_id2;
      int            num_to_scan;
      int            num_written;

--- 1107,1113 ----
  void
  BgBufferSync(void)
  {
!     int            buf_id;
      int            num_to_scan;
      int            num_written;

***************
*** 1060,1104 ****
      ResourceOwnerEnlargeBuffers(CurrentResourceOwner);

      /*
!      * To minimize work at checkpoint time, we want to try to keep all the
!      * buffers clean; this motivates a scan that proceeds sequentially through
!      * all buffers.  But we are also charged with ensuring that buffers that
       * will be recycled soon are clean when needed; these buffers are the ones
!      * just ahead of the StrategySyncStart point.  We make a separate scan
!      * through those.
       */

      /*
-      * This loop runs over all buffers, including pinned ones.    The starting
-      * point advances through the buffer pool on successive calls.
-      *
-      * Note that we advance the static counter *before* trying to write. This
-      * ensures that, if we have a persistent write failure on a dirty buffer,
-      * we'll still be able to make progress writing other buffers. (The
-      * bgwriter will catch the error and just call us again later.)
-      */
-     if (bgwriter_all_percent > 0.0 && bgwriter_all_maxpages > 0)
-     {
-         num_to_scan = (int) ((NBuffers * bgwriter_all_percent + 99) / 100);
-         num_written = 0;
-
-         while (num_to_scan-- > 0)
-         {
-             if (++buf_id1 >= NBuffers)
-                 buf_id1 = 0;
-             if (SyncOneBuffer(buf_id1, false))
-             {
-                 if (++num_written >= bgwriter_all_maxpages)
-                 {
-                     BgWriterStats.m_maxwritten_all++;
-                     break;
-                 }
-             }
-         }
-         BgWriterStats.m_buf_written_all += num_written;
-     }
-
-     /*
       * This loop considers only unpinned buffers close to the clock sweep
       * point.
       */
--- 1115,1126 ----
      ResourceOwnerEnlargeBuffers(CurrentResourceOwner);

      /*
!      * The purpose of this sweep is to ensure that buffers that
       * will be recycled soon are clean when needed; these buffers are the ones
!      * just ahead of the StrategySyncStart point.
       */

      /*
       * This loop considers only unpinned buffers close to the clock sweep
       * point.
       */
***************
*** 1107,1117 ****
          num_to_scan = (int) ((NBuffers * bgwriter_lru_percent + 99) / 100);
          num_written = 0;

!         buf_id2 = StrategySyncStart();

          while (num_to_scan-- > 0)
          {
!             if (SyncOneBuffer(buf_id2, true))
              {
                  if (++num_written >= bgwriter_lru_maxpages)
                  {
--- 1129,1139 ----
          num_to_scan = (int) ((NBuffers * bgwriter_lru_percent + 99) / 100);
          num_written = 0;

!         buf_id = StrategySyncStart();

          while (num_to_scan-- > 0)
          {
!             if (SyncOneBuffer(buf_id, true))
              {
                  if (++num_written >= bgwriter_lru_maxpages)
                  {
***************
*** 1119,1126 ****
                      break;
                  }
              }
!             if (++buf_id2 >= NBuffers)
!                 buf_id2 = 0;
          }
          BgWriterStats.m_buf_written_lru += num_written;
      }
--- 1141,1148 ----
                      break;
                  }
              }
!             if (++buf_id >= NBuffers)
!                 buf_id = 0;
          }
          BgWriterStats.m_buf_written_lru += num_written;
      }
***************
*** 1340,1348 ****
   * flushed.
   */
  void
! FlushBufferPool(void)
  {
!     BufferSync();
      smgrsync();
  }

--- 1362,1373 ----
   * flushed.
   */
  void
! FlushBufferPool(bool immediate)
  {
!     elog(DEBUG1, "CHECKPOINT: write phase");
!     BufferSync(immediate);
!
!     elog(DEBUG1, "CHECKPOINT: sync phase");
      smgrsync();
  }

***************
*** 2131,2137 ****
      Assert(buf->flags & BM_IO_IN_PROGRESS);
      buf->flags &= ~(BM_IO_IN_PROGRESS | BM_IO_ERROR);
      if (clear_dirty && !(buf->flags & BM_JUST_DIRTIED))
!         buf->flags &= ~BM_DIRTY;
      buf->flags |= set_flag_bits;

      UnlockBufHdr(buf);
--- 2156,2162 ----
      Assert(buf->flags & BM_IO_IN_PROGRESS);
      buf->flags &= ~(BM_IO_IN_PROGRESS | BM_IO_ERROR);
      if (clear_dirty && !(buf->flags & BM_JUST_DIRTIED))
!         buf->flags &= ~(BM_DIRTY | BM_CHECKPOINT_NEEDED);
      buf->flags |= set_flag_bits;

      UnlockBufHdr(buf);
Index: src/backend/tcop/utility.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/tcop/utility.c,v
retrieving revision 1.280
diff -c -r1.280 utility.c
*** src/backend/tcop/utility.c    30 May 2007 20:12:01 -0000    1.280
--- src/backend/tcop/utility.c    20 Jun 2007 09:36:31 -0000
***************
*** 1089,1095 ****
                  ereport(ERROR,
                          (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
                           errmsg("must be superuser to do CHECKPOINT")));
!             RequestCheckpoint(true, false);
              break;

          case T_ReindexStmt:
--- 1089,1095 ----
                  ereport(ERROR,
                          (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
                           errmsg("must be superuser to do CHECKPOINT")));
!             RequestImmediateCheckpoint();
              break;

          case T_ReindexStmt:
Index: src/backend/utils/misc/guc.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/utils/misc/guc.c,v
retrieving revision 1.400
diff -c -r1.400 guc.c
*** src/backend/utils/misc/guc.c    20 Jun 2007 18:31:39 -0000    1.400
--- src/backend/utils/misc/guc.c    26 Jun 2007 16:58:23 -0000
***************
*** 1569,1583 ****
      },

      {
-         {"bgwriter_all_maxpages", PGC_SIGHUP, RESOURCES,
-             gettext_noop("Background writer maximum number of all pages to flush per round."),
-             NULL
-         },
-         &bgwriter_all_maxpages,
-         5, 0, 1000, NULL, NULL
-     },
-
-     {
          {"log_rotation_age", PGC_SIGHUP, LOGGING_WHERE,
              gettext_noop("Automatic log file rotation will occur after N minutes."),
              NULL,
--- 1569,1574 ----
***************
*** 1830,1844 ****
      },

      {
-         {"bgwriter_all_percent", PGC_SIGHUP, RESOURCES,
-             gettext_noop("Background writer percentage of all buffers to flush per round."),
-             NULL
-         },
-         &bgwriter_all_percent,
-         0.333, 0.0, 100.0, NULL, NULL
-     },
-
-     {
          {"seed", PGC_USERSET, UNGROUPED,
              gettext_noop("Sets the seed for random-number generation."),
              NULL,
--- 1821,1826 ----
***************
*** 1865,1870 ****
--- 1847,1861 ----
          0.1, 0.0, 100.0, NULL, NULL
      },

+     {
+         {"checkpoint_completion_target", PGC_SIGHUP, WAL_CHECKPOINTS,
+             gettext_noop("Time spent flushing dirty buffers during checkpoint, as fraction of checkpoint interval."),
+             NULL
+         },
+         &CheckPointCompletionTarget,
+         0.5, 0.0, 0.9, NULL, NULL
+     },
+
      /* End-of-list marker */
      {
          {NULL, 0, 0, NULL, NULL}, NULL, 0.0, 0.0, 0.0, NULL, NULL
Index: src/backend/utils/misc/postgresql.conf.sample
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/utils/misc/postgresql.conf.sample,v
retrieving revision 1.216
diff -c -r1.216 postgresql.conf.sample
*** src/backend/utils/misc/postgresql.conf.sample    3 Jun 2007 17:08:15 -0000    1.216
--- src/backend/utils/misc/postgresql.conf.sample    26 Jun 2007 17:02:05 -0000
***************
*** 168,173 ****
--- 168,174 ----

  #checkpoint_segments = 3        # in logfile segments, min 1, 16MB each
  #checkpoint_timeout = 5min        # range 30s-1h
+ #checkpoint_completion_target = 0.5    # checkpoint duration, range 0.0 - 0.9
  #checkpoint_warning = 30s        # 0 is off

  # - Archiving -
Index: src/include/access/xlog.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/xlog.h,v
retrieving revision 1.78
diff -c -r1.78 xlog.h
*** src/include/access/xlog.h    30 May 2007 20:12:02 -0000    1.78
--- src/include/access/xlog.h    19 Jun 2007 14:10:07 -0000
***************
*** 171,179 ****
  extern void StartupXLOG(void);
  extern void ShutdownXLOG(int code, Datum arg);
  extern void InitXLOGAccess(void);
! extern void CreateCheckPoint(bool shutdown, bool force);
  extern void XLogPutNextOid(Oid nextOid);
  extern XLogRecPtr GetRedoRecPtr(void);
  extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch);

  #endif   /* XLOG_H */
--- 171,180 ----
  extern void StartupXLOG(void);
  extern void ShutdownXLOG(int code, Datum arg);
  extern void InitXLOGAccess(void);
! extern void CreateCheckPoint(bool shutdown, bool immediate, bool force);
  extern void XLogPutNextOid(Oid nextOid);
  extern XLogRecPtr GetRedoRecPtr(void);
+ extern XLogRecPtr GetInsertRecPtr(void);
  extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch);

  #endif   /* XLOG_H */
Index: src/include/postmaster/bgwriter.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/postmaster/bgwriter.h,v
retrieving revision 1.9
diff -c -r1.9 bgwriter.h
*** src/include/postmaster/bgwriter.h    5 Jan 2007 22:19:57 -0000    1.9
--- src/include/postmaster/bgwriter.h    25 Jun 2007 10:33:38 -0000
***************
*** 20,29 ****
  extern int    BgWriterDelay;
  extern int    CheckPointTimeout;
  extern int    CheckPointWarning;

  extern void BackgroundWriterMain(void);

! extern void RequestCheckpoint(bool waitforit, bool warnontime);

  extern bool ForwardFsyncRequest(RelFileNode rnode, BlockNumber segno);
  extern void AbsorbFsyncRequests(void);
--- 20,33 ----
  extern int    BgWriterDelay;
  extern int    CheckPointTimeout;
  extern int    CheckPointWarning;
+ extern double CheckPointCompletionTarget;

  extern void BackgroundWriterMain(void);

! extern void RequestImmediateCheckpoint(void);
! extern void RequestLazyCheckpoint(void);
! extern void RequestXLogFillCheckpoint(void);
! extern void CheckpointWriteDelay(bool immediate, double progress);

  extern bool ForwardFsyncRequest(RelFileNode rnode, BlockNumber segno);
  extern void AbsorbFsyncRequests(void);
Index: src/include/storage/buf_internals.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/storage/buf_internals.h,v
retrieving revision 1.90
diff -c -r1.90 buf_internals.h
*** src/include/storage/buf_internals.h    30 May 2007 20:12:03 -0000    1.90
--- src/include/storage/buf_internals.h    12 Jun 2007 11:42:23 -0000
***************
*** 35,40 ****
--- 35,41 ----
  #define BM_IO_ERROR                (1 << 4)        /* previous I/O failed */
  #define BM_JUST_DIRTIED            (1 << 5)        /* dirtied since write started */
  #define BM_PIN_COUNT_WAITER        (1 << 6)        /* have waiter for sole pin */
+ #define BM_CHECKPOINT_NEEDED    (1 << 7)        /* this needs to be written in checkpoint */

  typedef bits16 BufFlags;

Index: src/include/storage/bufmgr.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/storage/bufmgr.h,v
retrieving revision 1.104
diff -c -r1.104 bufmgr.h
*** src/include/storage/bufmgr.h    30 May 2007 20:12:03 -0000    1.104
--- src/include/storage/bufmgr.h    25 Jun 2007 11:28:36 -0000
***************
*** 33,41 ****
  /* in bufmgr.c */
  extern bool zero_damaged_pages;
  extern double bgwriter_lru_percent;
- extern double bgwriter_all_percent;
  extern int    bgwriter_lru_maxpages;
- extern int    bgwriter_all_maxpages;

  /* in buf_init.c */
  extern DLLIMPORT char *BufferBlocks;
--- 33,39 ----
***************
*** 136,142 ****
  extern void ResetBufferUsage(void);
  extern void AtEOXact_Buffers(bool isCommit);
  extern void PrintBufferLeakWarning(Buffer buffer);
! extern void FlushBufferPool(void);
  extern BlockNumber BufferGetBlockNumber(Buffer buffer);
  extern BlockNumber RelationGetNumberOfBlocks(Relation relation);
  extern void RelationTruncate(Relation rel, BlockNumber nblocks);
--- 134,140 ----
  extern void ResetBufferUsage(void);
  extern void AtEOXact_Buffers(bool isCommit);
  extern void PrintBufferLeakWarning(Buffer buffer);
! extern void FlushBufferPool(bool immediate);
  extern BlockNumber BufferGetBlockNumber(Buffer buffer);
  extern BlockNumber RelationGetNumberOfBlocks(Relation relation);
  extern void RelationTruncate(Relation rel, BlockNumber nblocks);
***************
*** 161,167 ****
  extern void AbortBufferIO(void);

  extern void BufmgrCommit(void);
! extern void BufferSync(void);
  extern void BgBufferSync(void);

  extern void AtProcExit_LocalBuffers(void);
--- 159,165 ----
  extern void AbortBufferIO(void);

  extern void BufmgrCommit(void);
! extern void BufferSync(bool immediate);
  extern void BgBufferSync(void);

  extern void AtProcExit_LocalBuffers(void);

Re: Load Distributed Checkpoints, final patch

From
Michael Glaesemann
Date:
On Jun 26, 2007, at 13:49 , Heikki Linnakangas wrote:

> Maximum is 0.9, to leave some headroom for fsync and any other
> things that need to happen during a checkpoint.

I think it might be more user-friendly to make the maximum 1 (meaning
as much smoothing as you could possibly get) and internally reserve a
certain amount off for whatever headroom might be required. It's more
common for users to see a value range from 0 to 1 rather than 0 to 0.9.

Michael Glaesemann
grzm seespotcode net



Re: Load Distributed Checkpoints, final patch

From
Tom Lane
Date:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> Barring any objections from committer, I'm finished with this patch.

Sounds great, I'll start looking this over.

> I'm scheduling more DBT-2 tests at a high # of warehouses per Greg
> Smith's suggestion just to see what happens, but I doubt that will
> change my mind on the above decisions.

When do you expect to have those results?

            regards, tom lane

Re: Load Distributed Checkpoints, final patch

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
>> Barring any objections from committer, I'm finished with this patch.
>
> Sounds great, I'll start looking this over.
>
>> I'm scheduling more DBT-2 tests at a high # of warehouses per Greg
>> Smith's suggestion just to see what happens, but I doubt that will
>> change my mind on the above decisions.
>
> When do you expect to have those results?

In a few days. I'm doing long tests because the variability in the 1h
tests was very high.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Load Distributed Checkpoints, final patch

From
Heikki Linnakangas
Date:
Michael Glaesemann wrote:
>
> On Jun 26, 2007, at 13:49 , Heikki Linnakangas wrote:
>
>> Maximum is 0.9, to leave some headroom for fsync and any other things
>> that need to happen during a checkpoint.
>
> I think it might be more user-friendly to make the maximum 1 (meaning as
> much smoothing as you could possibly get) and internally reserve a
> certain amount off for whatever headroom might be required. It's more
> common for users to see a value range from 0 to 1 rather than 0 to 0.9.

It would then be counter-intuitive if you set it to 1.0, and see that
your checkpoints consistently take 90% of the checkpoint interval.

We could just allow any value up to 1.0, and note in the docs that you
should leave some headroom, unless you don't mind starting the next
checkpoint a bit late. That actually sounds pretty good.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Load Distributed Checkpoints, final patch

From
Gregory Stark
Date:
"Heikki Linnakangas" <heikki@enterprisedb.com> writes:

> We could just allow any value up to 1.0, and note in the docs that you should
> leave some headroom, unless you don't mind starting the next checkpoint a bit
> late. That actually sounds pretty good.

What exactly happens if a checkpoint takes so long that the next checkpoint
starts. Aside from it not actually helping is there much reason to avoid this
situation? Have we ever actually tested it?

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com


Re: Load Distributed Checkpoints, final patch

From
Heikki Linnakangas
Date:
Gregory Stark wrote:
> "Heikki Linnakangas" <heikki@enterprisedb.com> writes:
>
>> We could just allow any value up to 1.0, and note in the docs that you should
>> leave some headroom, unless you don't mind starting the next checkpoint a bit
>> late. That actually sounds pretty good.
>
> What exactly happens if a checkpoint takes so long that the next checkpoint
> starts. Aside from it not actually helping is there much reason to avoid this
> situation?

Not really. We might run out of preallocated WAL segments, and will have
to create more. Recovery could be longer than expected since the real
checkpoint interval ends up being longer, but you can't make very
accurate recovery time estimations anyway.

> Have we ever actually tested it?

I haven't.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Load Distributed Checkpoints, final patch

From
Greg Smith
Date:
On Tue, 26 Jun 2007, Gregory Stark wrote:

> What exactly happens if a checkpoint takes so long that the next checkpoint
> starts. Aside from it not actually helping is there much reason to avoid this
> situation? Have we ever actually tested it?

More segments get created, and because of how they are cleared at the
beginning this causes its own mini-I/O storm through the same buffered
write channel the checkpoint writes are going into (which way or may not
be the same way normal WAL writes go, depending on whether you're using
O_[D]SYNC WAL writes).  I've seen some weird and intermittant breakdowns
from the contention that occurs when this happens, and it's certainly
something to be avoided.

To test it you could just use a big buffer cache, write like mad to it,
and make checkpoint_segments smaller than it should be for that workload.
It's easy enough to kill yourself exactly this way right now though, and
the fact that LDC gives you a parameter to aim this particular foot-gun
more precisely isn't a big deal IMHO as long as the documentation is
clear.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Load Distributed Checkpoints, final patch

From
Tom Lane
Date:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> We could just allow any value up to 1.0, and note in the docs that you
> should leave some headroom, unless you don't mind starting the next
> checkpoint a bit late. That actually sounds pretty good.

Yeah, that sounds fine.  There isn't actually any harm in starting a
checkpoint later than otherwise expected, is there?  The worst
consequence I can think of is a backend having to take time to
manufacture a new xlog segment, because we didn't finish a checkpoint
in time to recycle old ones.  This might be better handled by allowing
a bit more slop in the number of recycled-into-the-future xlog segments.

Come to think of it, shouldn't we be allowing some extra slop in the
number of future segments to account for xlog archiving delays, when
that's enabled?

            regards, tom lane

Re: Load Distributed Checkpoints, final patch

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
>> We could just allow any value up to 1.0, and note in the docs that you
>> should leave some headroom, unless you don't mind starting the next
>> checkpoint a bit late. That actually sounds pretty good.
>
> Yeah, that sounds fine.  There isn't actually any harm in starting a
> checkpoint later than otherwise expected, is there?  The worst
> consequence I can think of is a backend having to take time to
> manufacture a new xlog segment, because we didn't finish a checkpoint
> in time to recycle old ones.  This might be better handled by allowing
> a bit more slop in the number of recycled-into-the-future xlog segments.
>
> Come to think of it, shouldn't we be allowing some extra slop in the
> number of future segments to account for xlog archiving delays, when
> that's enabled?

XLogFileSlop is currently 2 * checkpoint_segments + 1 since last
checkpoint, which is just enough to accommodate a checkpoint that lasts
the full checkpoint interval. If we want to keep as much "slop" there as
before, then yes that should be increased to (2 +
checkpoint_completion_target) * checkpoint_segments + 1, or just 3 *
checkpoint_segments if we want to keep it simple.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Load Distributed Checkpoints, final patch

From
Greg Smith
Date:
On Tue, 26 Jun 2007, Heikki Linnakangas wrote:

> I'm scheduling more DBT-2 tests at a high # of warehouses per Greg Smith's
> suggestion just to see what happens, but I doubt that will change my mind on
> the above decisions.

I don't either, at worst I'd expect a small documentation update perhaps
with some warnings based on what's discovered there.  The form you've
added checkpoint_completion_target in is sufficient to address all the
serious concerns I had; I can turn it off, I can smooth just a bit without
increasing recovery time too much, or I can go all out smooth.

Certainly no one should consider waiting for the tests I asked you about a
hurdle to getting this patch committed, slowing that down was never my
intention by bringing that up.  I'm just curious to see if anything
scurries out of some the darker corners in this area when they're
illuminated.  I'd actually like to see this get committed relatively soon
because there's two interleaved merges stuck behind this one (the more
verbose logging patch and the LRU modifications).

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Load Distributed Checkpoints, final patch

From
Tom Lane
Date:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> Here's latest revision of Itagaki-sans Load Distributed Checkpoints patch:

Applied with some minor revisions to make some of the internal APIs a
bit cleaner; mostly, it seemed like a good idea to replace all those
bool parameters with a flag-bits approach, so that you could have
something like "CHECKPOINT_FORCE | CHECKPOINT_WAIT" instead of
"false, true, true, false" ...

For the moment I removed all the debugging elog's in the patch.
We still have Greg Smith's checkpoint logging patch to look at
(which I suppose needs adjustment now), and that seems like the
appropriate venue to consider what to put in.

Also, the question of redesigning the bgwriter's LRU scan is
still open.  I believe that's on Greg's plate, too.

One other closely connected item that might be worth looking at is the
code for creating new future xlog segments (PreallocXlogFiles).  Greg
was griping upthread about xlog segment creation being a real
performance drag.  I realized that as we currently have it set up, the
checkpoint code is next to useless for high-WAL-volume installations,
because it only considers making *one* future XLOG segment.  Once you've
built up enough XLOG segments, the system isn't too bad about recycling
them, but there will be a nasty startup transient where foreground
processes have to stop and make the things.  I wonder whether it would
help if we (a) have the bgwriter call PreallocXlogFiles during its
normal loop, and (b) back the slop in PreallocXlogFiles way off, so that
it will make a future segment as soon as we start using the last
existing segment, instead of only when we're nearly done.  This would at
least make it more likely that the bgwriter does the work instead of a
foreground process.  I'm hesitant to go much further than that, because
I don't want to bloat the minimum disk footprint for low-volume
installations, but the minimum footprint is really 2 xlog files anyway...

            regards, tom lane

Re: Load Distributed Checkpoints, final patch

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
>> Here's latest revision of Itagaki-sans Load Distributed Checkpoints patch:
>
> Applied with some minor revisions to make some of the internal APIs a
> bit cleaner; mostly, it seemed like a good idea to replace all those
> bool parameters with a flag-bits approach, so that you could have
> something like "CHECKPOINT_FORCE | CHECKPOINT_WAIT" instead of
> "false, true, true, false" ...

Thanks.

> For the moment I removed all the debugging elog's in the patch.
> We still have Greg Smith's checkpoint logging patch to look at
> (which I suppose needs adjustment now), and that seems like the
> appropriate venue to consider what to put in.

Ok, I'll look at that next.

> One other closely connected item that might be worth looking at is the
> code for creating new future xlog segments (PreallocXlogFiles).  Greg
> was griping upthread about xlog segment creation being a real
> performance drag.  I realized that as we currently have it set up, the
> checkpoint code is next to useless for high-WAL-volume installations,
> because it only considers making *one* future XLOG segment.  Once you've
> built up enough XLOG segments, the system isn't too bad about recycling
> them, but there will be a nasty startup transient where foreground
> processes have to stop and make the things.  I wonder whether it would
> help if we (a) have the bgwriter call PreallocXlogFiles during its
> normal loop, and (b) back the slop in PreallocXlogFiles way off, so that
> it will make a future segment as soon as we start using the last
> existing segment, instead of only when we're nearly done.  This would at
> least make it more likely that the bgwriter does the work instead of a
> foreground process.  I'm hesitant to go much further than that, because
> I don't want to bloat the minimum disk footprint for low-volume
> installations, but the minimum footprint is really 2 xlog files anyway...

That seems like a good idea. It might also become a problem if you have
WAL archiving set up and the archiving falls behind so that existing log
files are not recycled fast enough.

The comment in PreallocXlogFiles is out of date:

> /*
>  * Preallocate log files beyond the specified log endpoint, according to
>  * the XLOGfile user parameter.
>  */

As you pointed out, it only preallocates one log file. And there is no
XLOGfile mentioned anywhere else in the source tree.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Load Distributed Checkpoints, final patch

From
Greg Smith
Date:
On Wed, 27 Jun 2007, Tom Lane wrote:

> Also, the question of redesigning the bgwriter's LRU scan is
> still open.  I believe that's on Greg's plate, too.

Greg's plate was temporarily fried after his house was hit by lightening
yesterday.  I just got everything back on-line again, so no coding
progress, but I think I finished assimilating your "epiphany" during that
time.  Now I realize that what you're suggesting is that under healthy
low-load conditions, the LRU really should be able to keep up right behind
the clock sweep point.  Noting how far behind it is serves as a
measurement of it failing to match the rate buffers that could be re-used
are being dirtied, and the only question is how fast and far it should try
to drive the point it has cleaned to forward when that happens.

> Once you've built up enough XLOG segments, the system isn't too bad
> about recycling them, but there will be a nasty startup transient where
> foreground processes have to stop and make the things.

Exactly.  I found it problematic in four situations:

1) Slow checkpoint doesn't finish in time and new segments are being
created while the checkpoint is also busy (this is the really bad one)

2) Archive logger stop doing anything (usually because the archive disk is
filled) and nothing gets recycled until that's fixed.

2) checkpoint_segments is changed, so then performance is really sluggish
for a bit until all the segments are built back up again

3) You ran an early manual checkpoint which doesn't seem to recycle as
many segments usefully

Any change that would be more proactive about creating segments in these
situations than the current code would be a benefit, even though these are
not common paths people encounter.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Load Distributed Checkpoints, final patch

From
Tom Lane
Date:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> The comment in PreallocXlogFiles is out of date:

Yeah, I changed it yesterday ...

> As you pointed out, it only preallocates one log file. And there is no
> XLOGfile mentioned anywhere else in the source tree.

If memory serves, there once was a variable there, but we simplified it
out of existence for reasons no longer apparent.  Possibly it'd be worth
trolling the CVS log and archives to find out why we did that.

Anyway, what I'm thinking at the moment is that it's not so much that
PreallocXlogFiles needs to do more work as that it needs to be called
more often.  Right now we only do it once per checkpoint.

            regards, tom lane

Re: Load Distributed Checkpoints, final patch

From
Heikki Linnakangas
Date:
Heikki Linnakangas wrote:
> Tom Lane wrote:
>> Heikki Linnakangas <heikki@enterprisedb.com> writes:
>>> I'm scheduling more DBT-2 tests at a high # of warehouses per Greg
>>> Smith's suggestion just to see what happens, but I doubt that will
>>> change my mind on the above decisions.
>>
>> When do you expect to have those results?
>
> In a few days. I'm doing long tests because the variability in the 1h
> tests was very high.

I ran two tests with 200 warehouses to see how LDC behaves on a badly
overloaded system, see tests imola-319 and imola-320. Seems to work
quite well. In fact the checkpoint spike is relatively speaking less
severe than with smaller # of warehouses even in the baseline test run,
and LDC smooths it very nicely.

After those two tests, I noticed that I had full_page_writes=off in all
tests performed earlier :(. That throws off the confidence in those
results, so I ran more tests with full_page_writes on and off to compare
the affect. I also wanted to compare the effectiveness of the patch when
checkpoints are triggered by either checkpoint_timeout or
checkpoint_segments.

imola-326 - imola-330 are all configured so that checkpoints happen
roughly on a 50 minute interval. On imola-326, checkpoints are triggered
by checkpoint_segments, and on imola-327 they're triggered by
checkpoint_timeout. On imola-326, the write phase lasts ~7 minutes, and
on imola-327, it lasts ~10 minutes. Because of full_page_writes, a lot
more WAL is consumed right after starting the checkpoint, so we end up
being more aggressive than necessary at the beginning.

For comparison, imola-328 has full_page_writes=off. Checkpoints last ~9
minutes there, and the graphs look very smooth. That suggests that
spreading the writes over a longer time wouldn't make a difference, but
smoothing the rush at the beginning of checkpoint might. I'm going to
try the algorithm I posted, that uses the WAL consumption rate from
previous checkpoint interval in the calculations.

Imola-329 is the same as imola-328, but with updated CVS source tree
instead of older tree + patch. The purpose of this test was basically to
just verify that what was committed works the same as the patch.

Imola-330 is comparable with imola-327, checkpoints are triggered by
timeout and full_page_writes=on. But 330 was patched to to call
PreallocXlogFiles in bgwriter, per Tom's idea. According to logs, most
WAL segments are created by bgwriter in that test, and response times
look slightly better with the patch, though I'm not sure the difference
is statistically significant.

As before, the results are available at
http://community.enterprisedb.com/ldc/

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Load Distributed Checkpoints, final patch

From
Bruce Momjian
Date:
Heikki Linnakangas wrote:
> For comparison, imola-328 has full_page_writes=off. Checkpoints last ~9
> minutes there, and the graphs look very smooth. That suggests that
> spreading the writes over a longer time wouldn't make a difference, but
> smoothing the rush at the beginning of checkpoint might. I'm going to
> try the algorithm I posted, that uses the WAL consumption rate from
> previous checkpoint interval in the calculations.

One thing that concerns me is that checkpoint smoothing happening just
after the checkpoint is causing I/O at the same time that
full_page_writes is causing additional I/O.  Ideally we would do the
smoothing toward the end of the checkpoint cycle, but I realize that has
problems of its own.

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: Load Distributed Checkpoints, final patch

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> Heikki Linnakangas wrote:
>> For comparison, imola-328 has full_page_writes=off. Checkpoints last ~9
>> minutes there, and the graphs look very smooth. That suggests that
>> spreading the writes over a longer time wouldn't make a difference, but
>> smoothing the rush at the beginning of checkpoint might. I'm going to
>> try the algorithm I posted, that uses the WAL consumption rate from
>> previous checkpoint interval in the calculations.

> One thing that concerns me is that checkpoint smoothing happening just
> after the checkpoint is causing I/O at the same time that
> full_page_writes is causing additional I/O.

I'm tempted to just apply some sort of nonlinear correction to the
WAL-based progress measurement.  Squaring it would be cheap but is
probably too extreme.  Carrying over info from the previous cycle
doesn't seem like it would help much; rather, the point is exactly
that we *don't* want a constant write speed during the checkpoint.

            regards, tom lane

Re: Load Distributed Checkpoints, final patch

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
>> Heikki Linnakangas wrote:
>>> For comparison, imola-328 has full_page_writes=off. Checkpoints last ~9
>>> minutes there, and the graphs look very smooth. That suggests that
>>> spreading the writes over a longer time wouldn't make a difference, but
>>> smoothing the rush at the beginning of checkpoint might. I'm going to
>>> try the algorithm I posted, that uses the WAL consumption rate from
>>> previous checkpoint interval in the calculations.
>
>> One thing that concerns me is that checkpoint smoothing happening just
>> after the checkpoint is causing I/O at the same time that
>> full_page_writes is causing additional I/O.
>
> I'm tempted to just apply some sort of nonlinear correction to the
> WAL-based progress measurement.  Squaring it would be cheap but is
> probably too extreme.  Carrying over info from the previous cycle
> doesn't seem like it would help much; rather, the point is exactly
> that we *don't* want a constant write speed during the checkpoint.

While thinking about this, I made an observation on full_page_writes.
Currently, we perform a full page write whenever LSN < RedoRecPtr. If
we're clever, we can skip or defer some of the full page writes:

The rule is that when we replay, we need to always replay a full page
image before we apply any regular WAL records on the page. When we begin
a checkpoint, there's two possible outcomes: we crash before the new
checkpoint is finished, and we replay starting from the previous redo
ptr, or we finish the checkpoint successfully, and we replay starting
from the new redo ptr (or we don't crash and don't need to recover).

To be able to recover from the previous redo ptr, we don't need to write
a full page image if we have already written one since the previous redo
ptr.

To be able to recover from the new redo ptr, we don't need to write a
full page image, if we haven't flushed the page yet. It will be written
and fsync'd by the time the checkpoint finishes.

IOW, we can skip full page images of pages that we have already taken a
full page image of since previous checkpoint, and we haven't flushed yet
during the current checkpoint.

This might reduce the overall WAL I/O a little bit, but more
importantly, it spreads the impact of taking full page images over the
checkpoint duration. That's a good thing on its own, but it also makes
it unnecessary to compensate for the full_page_writes rush in the
checkpoint smoothing.

I'm still trying to get my head around the bookkeeping required to get
that right; I think it's possible using the new BM_CHECKPOINT_NEEDED
flag and a new flag in the page header to mark pages that we've skipped
taking the full page image when it was last modified.

For 8.3, we should probably just do some simple compensation in the
checkpoint throttling code, if we want to do anything at all. But this
is something to think about in the future.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Load Distributed Checkpoints, final patch

From
Gregory Stark
Date:
"Heikki Linnakangas" <heikki@enterprisedb.com> writes:

> For 8.3, we should probably just do some simple compensation in the checkpoint
> throttling code, if we want to do anything at all. But this is something to
> think about in the future.

Just as a stress test it might be interesting to run a quick tpcc test with
very short checkpoint intervals. Something like 30s. Just to make sure that
the logic is all correct and unexpected things don't start happening.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com


Re: Load Distributed Checkpoints, final patch

From
Tom Lane
Date:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> While thinking about this, I made an observation on full_page_writes.
> Currently, we perform a full page write whenever LSN < RedoRecPtr. If
> we're clever, we can skip or defer some of the full page writes:

I'm not convinced this is safe; in particular, ISTM that a PITR slave
following the WAL log is likely to be at risk if it tries to restart
from the checkpoint you've omitted some full-page-images after.  There's
no guarantee it will have flushed pages at the same spots the master did.

            regards, tom lane

Re: Load Distributed Checkpoints, final patch

From
Bruce Momjian
Date:
This has been saved for the 8.4 release:

    http://momjian.postgresql.org/cgi-bin/pgpatches_hold

---------------------------------------------------------------------------

Tom Lane wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
> > Here's latest revision of Itagaki-sans Load Distributed Checkpoints patch:
>
> Applied with some minor revisions to make some of the internal APIs a
> bit cleaner; mostly, it seemed like a good idea to replace all those
> bool parameters with a flag-bits approach, so that you could have
> something like "CHECKPOINT_FORCE | CHECKPOINT_WAIT" instead of
> "false, true, true, false" ...
>
> For the moment I removed all the debugging elog's in the patch.
> We still have Greg Smith's checkpoint logging patch to look at
> (which I suppose needs adjustment now), and that seems like the
> appropriate venue to consider what to put in.
>
> Also, the question of redesigning the bgwriter's LRU scan is
> still open.  I believe that's on Greg's plate, too.
>
> One other closely connected item that might be worth looking at is the
> code for creating new future xlog segments (PreallocXlogFiles).  Greg
> was griping upthread about xlog segment creation being a real
> performance drag.  I realized that as we currently have it set up, the
> checkpoint code is next to useless for high-WAL-volume installations,
> because it only considers making *one* future XLOG segment.  Once you've
> built up enough XLOG segments, the system isn't too bad about recycling
> them, but there will be a nasty startup transient where foreground
> processes have to stop and make the things.  I wonder whether it would
> help if we (a) have the bgwriter call PreallocXlogFiles during its
> normal loop, and (b) back the slop in PreallocXlogFiles way off, so that
> it will make a future segment as soon as we start using the last
> existing segment, instead of only when we're nearly done.  This would at
> least make it more likely that the bgwriter does the work instead of a
> foreground process.  I'm hesitant to go much further than that, because
> I don't want to bloat the minimum disk footprint for low-volume
> installations, but the minimum footprint is really 2 xlog files anyway...
>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: Load Distributed Checkpoints, final patch

From
Bruce Momjian
Date:
Added to TODO:

* Test to see if calling PreallocXlogFiles() from the background writer
  will help with WAL segment creation latency

  http://archives.postgresql.org/pgsql-patches/2007-06/msg00340.php


---------------------------------------------------------------------------

Tom Lane wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
> > Here's latest revision of Itagaki-sans Load Distributed Checkpoints patch:
>
> Applied with some minor revisions to make some of the internal APIs a
> bit cleaner; mostly, it seemed like a good idea to replace all those
> bool parameters with a flag-bits approach, so that you could have
> something like "CHECKPOINT_FORCE | CHECKPOINT_WAIT" instead of
> "false, true, true, false" ...
>
> For the moment I removed all the debugging elog's in the patch.
> We still have Greg Smith's checkpoint logging patch to look at
> (which I suppose needs adjustment now), and that seems like the
> appropriate venue to consider what to put in.
>
> Also, the question of redesigning the bgwriter's LRU scan is
> still open.  I believe that's on Greg's plate, too.
>
> One other closely connected item that might be worth looking at is the
> code for creating new future xlog segments (PreallocXlogFiles).  Greg
> was griping upthread about xlog segment creation being a real
> performance drag.  I realized that as we currently have it set up, the
> checkpoint code is next to useless for high-WAL-volume installations,
> because it only considers making *one* future XLOG segment.  Once you've
> built up enough XLOG segments, the system isn't too bad about recycling
> them, but there will be a nasty startup transient where foreground
> processes have to stop and make the things.  I wonder whether it would
> help if we (a) have the bgwriter call PreallocXlogFiles during its
> normal loop, and (b) back the slop in PreallocXlogFiles way off, so that
> it will make a future segment as soon as we start using the last
> existing segment, instead of only when we're nearly done.  This would at
> least make it more likely that the bgwriter does the work instead of a
> foreground process.  I'm hesitant to go much further than that, because
> I don't want to bloat the minimum disk footprint for low-volume
> installations, but the minimum footprint is really 2 xlog files anyway...
>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://postgres.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +