Thread: Load Distributed Checkpoints, take 3

Load Distributed Checkpoints, take 3

From
Heikki Linnakangas
Date:
Here's an updated WIP patch for load distributed checkpoints.

I added a spinlock to protect the signaling fields between bgwriter and
backends. The current non-locking approach gets really difficult as the
patch adds two new flags, and both are more important than the existing
ckpt_time_warn flag.

In fact, I think there's a small race condition in CVS HEAD:

1. pg_start_backup() is called, which calls RequestCheckpoint
2. RequestCheckpoint takes note of the old value of ckpt_started
3. bgwriter wakes up from pg_usleep, and sees that we've exceeded
checkpoint_timeout.
4. bgwriter increases ckpt_started to note that a new checkpoint has started
5. RequestCheckpoint signals bgwriter to start a new checkpoint
6. bgwriter calls CreateCheckpoint, with the force-flag set to false
because this checkpoint was triggered by timeout
7. RequestCheckpoint sees that ckpt_started has increased, and starts to
wait for ckpt_done to reach the new value.
8. CreateCheckpoint finishes immediately, because there was no XLOG
activity since last checkpoint.
9. RequestCheckpoint sees that ckpt_done matches ckpt_started, and returns.
10. pg_start_backup() continues, with potentially the same redo location
and thus history filename as previous backup.

Now I admit that the chances for that to happen are extremely small,
people don't usually do two pg_start_backup calls without *any* WAL
logged activity in between them, for example. But as we add the new
flags, avoiding scenarios like that becomes harder.

Since last patch, I did some clean up and refactoring, and added a bunch
of comments, and user documentation.

I haven't yet changed GetInsertRecPtr to use the almost up-to-date value
protected by the info_lck per Simon's suggestion, and I need to do some
correctness testing. After that, I'm done with the patch.

Ps. In case you wonder what took me so long since last revision, I've
spent a lot of time reviewing HOT.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com
Index: doc/src/sgml/config.sgml
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/doc/src/sgml/config.sgml,v
retrieving revision 1.126
diff -c -r1.126 config.sgml
*** doc/src/sgml/config.sgml    7 Jun 2007 19:19:56 -0000    1.126
--- doc/src/sgml/config.sgml    19 Jun 2007 14:24:31 -0000
***************
*** 1565,1570 ****
--- 1565,1608 ----
        </listitem>
       </varlistentry>

+      <varlistentry id="guc-checkpoint-smoothing" xreflabel="checkpoint_smoothing">
+       <term><varname>checkpoint_smoothing</varname> (<type>floating point</type>)</term>
+       <indexterm>
+        <primary><varname>checkpoint_smoothing</> configuration parameter</primary>
+       </indexterm>
+       <listitem>
+        <para>
+         Specifies the target length of checkpoints, as a fraction of
+         the checkpoint interval. The default is 0.3.
+
+         This parameter can only be set in the <filename>postgresql.conf</>
+         file or on the server command line.
+        </para>
+       </listitem>
+      </varlistentry>
+
+       <term><varname>checkpoint_rate</varname> (<type>floating point</type>)</term>
+       <indexterm>
+        <primary><varname>checkpoint_rate</> configuration parameter</primary>
+       </indexterm>
+       <listitem>
+        <para>
+         Specifies the minimum I/O rate used to flush dirty buffers during a
+         checkpoint, when there's not many dirty buffers in the buffer cache.
+         The default is 512 KB/s.
+
+         Note: the accuracy of this setting depends on
+         <varname>bgwriter_delay</varname. This value is converted internally
+         to pages / bgwriter_delay, so if for examply the minimum allowed
+         bgwriter_delay setting of 10ms is used, the effective minimum
+         checkpoint I/O rate is 1 page / 10 ms, or 800 KB/s.
+
+         This parameter can only be set in the <filename>postgresql.conf</>
+         file or on the server command line.
+        </para>
+       </listitem>
+      </varlistentry>
+
       <varlistentry id="guc-checkpoint-warning" xreflabel="checkpoint_warning">
        <term><varname>checkpoint_warning</varname> (<type>integer</type>)</term>
        <indexterm>
Index: doc/src/sgml/wal.sgml
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/doc/src/sgml/wal.sgml,v
retrieving revision 1.43
diff -c -r1.43 wal.sgml
*** doc/src/sgml/wal.sgml    31 Jan 2007 20:56:19 -0000    1.43
--- doc/src/sgml/wal.sgml    19 Jun 2007 14:26:45 -0000
***************
*** 217,225 ****
    </para>

    <para>
     There will be at least one WAL segment file, and will normally
     not be more than 2 * <varname>checkpoint_segments</varname> + 1
!    files.  Each segment file is normally 16 MB (though this size can be
     altered when building the server).  You can use this to estimate space
     requirements for <acronym>WAL</acronym>.
     Ordinarily, when old log segment files are no longer needed, they
--- 217,245 ----
    </para>

    <para>
+    If there is a lot of dirty buffers in the buffer cache, flushing them
+    all at checkpoint will cause a heavy burst of I/O that can disrupt other
+    activity in the system. To avoid that, the checkpoint I/O can be distributed
+    over a longer period of time, defined with
+    <varname>checkpoint_smoothing</varname>. It's given as a fraction of the
+    checkpoint interval, as defined by <varname>checkpoint_timeout</varname>
+    and <varname>checkpoint_segments</varname>. The WAL segment consumption
+    and elapsed time is monitored and the I/O rate is adjusted during
+    checkpoint so that it's finished when the given fraction of elapsed time
+    or WAL segments has passed, whichever is sooner. However, that could lead
+    to unnecessarily prolonged checkpoints when there's not many dirty buffers
+    in the cache. To avoid that, <varname>checkpoint_rate</varname> can be used
+    to set the minimum I/O rate used. Note that prolonging checkpoints
+    affects recovery time, because the longer the checkpoint takes, more WAL
+    need to be kept around and replayed in recovery.
+   </para>
+
+   <para>
     There will be at least one WAL segment file, and will normally
     not be more than 2 * <varname>checkpoint_segments</varname> + 1
!    files, though there can be more if a large
!    <varname>checkpoint_smoothing</varname> setting is used.
!    Each segment file is normally 16 MB (though this size can be
     altered when building the server).  You can use this to estimate space
     requirements for <acronym>WAL</acronym>.
     Ordinarily, when old log segment files are no longer needed, they
Index: src/backend/access/transam/xlog.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/transam/xlog.c,v
retrieving revision 1.272
diff -c -r1.272 xlog.c
*** src/backend/access/transam/xlog.c    31 May 2007 15:13:01 -0000    1.272
--- src/backend/access/transam/xlog.c    20 Jun 2007 10:44:40 -0000
***************
*** 398,404 ****
  static void exitArchiveRecovery(TimeLineID endTLI,
                      uint32 endLogId, uint32 endLogSeg);
  static bool recoveryStopsHere(XLogRecord *record, bool *includeThis);
! static void CheckPointGuts(XLogRecPtr checkPointRedo);

  static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
                  XLogRecPtr *lsn, BkpBlock *bkpb);
--- 398,404 ----
  static void exitArchiveRecovery(TimeLineID endTLI,
                      uint32 endLogId, uint32 endLogSeg);
  static bool recoveryStopsHere(XLogRecord *record, bool *includeThis);
! static void CheckPointGuts(XLogRecPtr checkPointRedo, bool immediate);

  static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
                  XLogRecPtr *lsn, BkpBlock *bkpb);
***************
*** 1608,1614 ****
                          if (XLOG_DEBUG)
                              elog(LOG, "time for a checkpoint, signaling bgwriter");
  #endif
!                         RequestCheckpoint(false, true);
                      }
                  }
              }
--- 1608,1614 ----
                          if (XLOG_DEBUG)
                              elog(LOG, "time for a checkpoint, signaling bgwriter");
  #endif
!                         RequestXLogFillCheckpoint();
                      }
                  }
              }
***************
*** 5110,5116 ****
           * the rule that TLI only changes in shutdown checkpoints, which
           * allows some extra error checking in xlog_redo.
           */
!         CreateCheckPoint(true, true);

          /*
           * Close down recovery environment
--- 5110,5116 ----
           * the rule that TLI only changes in shutdown checkpoints, which
           * allows some extra error checking in xlog_redo.
           */
!         CreateCheckPoint(true, true, true);

          /*
           * Close down recovery environment
***************
*** 5319,5324 ****
--- 5319,5340 ----
  }

  /*
+  * GetInsertRecPtr -- Returns the current insert position.
+  */
+ XLogRecPtr
+ GetInsertRecPtr(void)
+ {
+     XLogCtlInsert  *Insert = &XLogCtl->Insert;
+     XLogRecPtr        recptr;
+
+     LWLockAcquire(WALInsertLock, LW_SHARED);
+     INSERT_RECPTR(recptr, Insert, Insert->curridx);
+     LWLockRelease(WALInsertLock);
+
+     return recptr;
+ }
+
+ /*
   * Get the time of the last xlog segment switch
   */
  time_t
***************
*** 5383,5389 ****
      ereport(LOG,
              (errmsg("shutting down")));

!     CreateCheckPoint(true, true);
      ShutdownCLOG();
      ShutdownSUBTRANS();
      ShutdownMultiXact();
--- 5399,5405 ----
      ereport(LOG,
              (errmsg("shutting down")));

!     CreateCheckPoint(true, true, true);
      ShutdownCLOG();
      ShutdownSUBTRANS();
      ShutdownMultiXact();
***************
*** 5395,5405 ****
  /*
   * Perform a checkpoint --- either during shutdown, or on-the-fly
   *
   * If force is true, we force a checkpoint regardless of whether any XLOG
   * activity has occurred since the last one.
   */
  void
! CreateCheckPoint(bool shutdown, bool force)
  {
      CheckPoint    checkPoint;
      XLogRecPtr    recptr;
--- 5411,5424 ----
  /*
   * Perform a checkpoint --- either during shutdown, or on-the-fly
   *
+  * If immediate is true, we try to finish the checkpoint as fast as we can,
+  * ignoring checkpoint_smoothing parameter.
+  *
   * If force is true, we force a checkpoint regardless of whether any XLOG
   * activity has occurred since the last one.
   */
  void
! CreateCheckPoint(bool shutdown, bool immediate, bool force)
  {
      CheckPoint    checkPoint;
      XLogRecPtr    recptr;
***************
*** 5591,5597 ****
       */
      END_CRIT_SECTION();

!     CheckPointGuts(checkPoint.redo);

      START_CRIT_SECTION();

--- 5610,5616 ----
       */
      END_CRIT_SECTION();

!     CheckPointGuts(checkPoint.redo, immediate);

      START_CRIT_SECTION();

***************
*** 5693,5708 ****
  /*
   * Flush all data in shared memory to disk, and fsync
   *
   * This is the common code shared between regular checkpoints and
   * recovery restartpoints.
   */
  static void
! CheckPointGuts(XLogRecPtr checkPointRedo)
  {
      CheckPointCLOG();
      CheckPointSUBTRANS();
      CheckPointMultiXact();
!     FlushBufferPool();            /* performs all required fsyncs */
      /* We deliberately delay 2PC checkpointing as long as possible */
      CheckPointTwoPhase(checkPointRedo);
  }
--- 5712,5730 ----
  /*
   * Flush all data in shared memory to disk, and fsync
   *
+  * If immediate is true, try to finish as quickly as possible, ignoring
+  * the GUC variables to throttle checkpoint I/O.
+  *
   * This is the common code shared between regular checkpoints and
   * recovery restartpoints.
   */
  static void
! CheckPointGuts(XLogRecPtr checkPointRedo, bool immediate)
  {
      CheckPointCLOG();
      CheckPointSUBTRANS();
      CheckPointMultiXact();
!     FlushBufferPool(immediate);        /* performs all required fsyncs */
      /* We deliberately delay 2PC checkpointing as long as possible */
      CheckPointTwoPhase(checkPointRedo);
  }
***************
*** 5710,5716 ****
  /*
   * Set a recovery restart point if appropriate
   *
!  * This is similar to CreateCheckpoint, but is used during WAL recovery
   * to establish a point from which recovery can roll forward without
   * replaying the entire recovery log.  This function is called each time
   * a checkpoint record is read from XLOG; it must determine whether a
--- 5732,5738 ----
  /*
   * Set a recovery restart point if appropriate
   *
!  * This is similar to CreateCheckPoint, but is used during WAL recovery
   * to establish a point from which recovery can roll forward without
   * replaying the entire recovery log.  This function is called each time
   * a checkpoint record is read from XLOG; it must determine whether a
***************
*** 5751,5757 ****
      /*
       * OK, force data out to disk
       */
!     CheckPointGuts(checkPoint->redo);

      /*
       * Update pg_control so that any subsequent crash will restart from this
--- 5773,5779 ----
      /*
       * OK, force data out to disk
       */
!     CheckPointGuts(checkPoint->redo, true);

      /*
       * Update pg_control so that any subsequent crash will restart from this
***************
*** 6177,6183 ****
           * have different checkpoint positions and hence different history
           * file names, even if nothing happened in between.
           */
!         RequestCheckpoint(true, false);

          /*
           * Now we need to fetch the checkpoint record location, and also its
--- 6199,6205 ----
           * have different checkpoint positions and hence different history
           * file names, even if nothing happened in between.
           */
!         RequestLazyCheckpoint();

          /*
           * Now we need to fetch the checkpoint record location, and also its
Index: src/backend/bootstrap/bootstrap.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/bootstrap/bootstrap.c,v
retrieving revision 1.233
diff -c -r1.233 bootstrap.c
*** src/backend/bootstrap/bootstrap.c    7 Mar 2007 13:35:02 -0000    1.233
--- src/backend/bootstrap/bootstrap.c    19 Jun 2007 15:29:51 -0000
***************
*** 489,495 ****

      /* Perform a checkpoint to ensure everything's down to disk */
      SetProcessingMode(NormalProcessing);
!     CreateCheckPoint(true, true);

      /* Clean up and exit */
      cleanup();
--- 489,495 ----

      /* Perform a checkpoint to ensure everything's down to disk */
      SetProcessingMode(NormalProcessing);
!     CreateCheckPoint(true, true, true);

      /* Clean up and exit */
      cleanup();
Index: src/backend/commands/dbcommands.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/commands/dbcommands.c,v
retrieving revision 1.195
diff -c -r1.195 dbcommands.c
*** src/backend/commands/dbcommands.c    1 Jun 2007 19:38:07 -0000    1.195
--- src/backend/commands/dbcommands.c    20 Jun 2007 09:36:24 -0000
***************
*** 404,410 ****
       * up-to-date for the copy.  (We really only need to flush buffers for the
       * source database, but bufmgr.c provides no API for that.)
       */
!     BufferSync();

      /*
       * Once we start copying subdirectories, we need to be able to clean 'em
--- 404,410 ----
       * up-to-date for the copy.  (We really only need to flush buffers for the
       * source database, but bufmgr.c provides no API for that.)
       */
!     BufferSync(true);

      /*
       * Once we start copying subdirectories, we need to be able to clean 'em
***************
*** 507,513 ****
           * Perhaps if we ever implement CREATE DATABASE in a less cheesy way,
           * we can avoid this.
           */
!         RequestCheckpoint(true, false);

          /*
           * Close pg_database, but keep lock till commit (this is important to
--- 507,513 ----
           * Perhaps if we ever implement CREATE DATABASE in a less cheesy way,
           * we can avoid this.
           */
!         RequestImmediateCheckpoint();

          /*
           * Close pg_database, but keep lock till commit (this is important to
***************
*** 661,667 ****
       * open files, which would cause rmdir() to fail.
       */
  #ifdef WIN32
!     RequestCheckpoint(true, false);
  #endif

      /*
--- 661,667 ----
       * open files, which would cause rmdir() to fail.
       */
  #ifdef WIN32
!     RequestImmediateCheckpoint();
  #endif

      /*
***************
*** 1427,1433 ****
           * up-to-date for the copy.  (We really only need to flush buffers for
           * the source database, but bufmgr.c provides no API for that.)
           */
!         BufferSync();

          /*
           * Copy this subdirectory to the new location
--- 1427,1433 ----
           * up-to-date for the copy.  (We really only need to flush buffers for
           * the source database, but bufmgr.c provides no API for that.)
           */
!         BufferSync(true);

          /*
           * Copy this subdirectory to the new location
Index: src/backend/postmaster/bgwriter.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/postmaster/bgwriter.c,v
retrieving revision 1.38
diff -c -r1.38 bgwriter.c
*** src/backend/postmaster/bgwriter.c    27 May 2007 03:50:39 -0000    1.38
--- src/backend/postmaster/bgwriter.c    20 Jun 2007 12:58:20 -0000
***************
*** 44,49 ****
--- 44,50 ----
  #include "postgres.h"

  #include <signal.h>
+ #include <sys/time.h>
  #include <time.h>
  #include <unistd.h>

***************
*** 59,64 ****
--- 60,66 ----
  #include "storage/pmsignal.h"
  #include "storage/shmem.h"
  #include "storage/smgr.h"
+ #include "storage/spin.h"
  #include "tcop/tcopprot.h"
  #include "utils/guc.h"
  #include "utils/memutils.h"
***************
*** 112,122 ****
  {
      pid_t        bgwriter_pid;    /* PID of bgwriter (0 if not started) */

!     sig_atomic_t ckpt_started;    /* advances when checkpoint starts */
!     sig_atomic_t ckpt_done;        /* advances when checkpoint done */
!     sig_atomic_t ckpt_failed;    /* advances when checkpoint fails */

!     sig_atomic_t ckpt_time_warn;    /* warn if too soon since last ckpt? */

      int            num_requests;    /* current # of requests */
      int            max_requests;    /* allocated array size */
--- 114,128 ----
  {
      pid_t        bgwriter_pid;    /* PID of bgwriter (0 if not started) */

!     slock_t        ckpt_lck;        /* protects all the ckpt_* fields */

!     int            ckpt_started;    /* advances when checkpoint starts */
!     int            ckpt_done;        /* advances when checkpoint done */
!     int            ckpt_failed;    /* advances when checkpoint fails */
!
!     bool    ckpt_rqst_time_warn;    /* warn if too soon since last ckpt */
!     bool    ckpt_rqst_immediate;    /* an immediate ckpt has been requested */
!     bool    ckpt_rqst_force;        /* checkpoint even if no WAL activity */

      int            num_requests;    /* current # of requests */
      int            max_requests;    /* allocated array size */
***************
*** 131,136 ****
--- 137,143 ----
  int            BgWriterDelay = 200;
  int            CheckPointTimeout = 300;
  int            CheckPointWarning = 30;
+ double        CheckPointSmoothing = 0.3;

  /*
   * Flags set by interrupt handlers for later service in the main loop.
***************
*** 146,154 ****
--- 153,176 ----

  static bool ckpt_active = false;

+ /* Current time and WAL insert location when checkpoint was started */
+ static time_t        ckpt_start_time;
+ static XLogRecPtr    ckpt_start_recptr;
+
+ static double        ckpt_cached_elapsed;
+
  static time_t last_checkpoint_time;
  static time_t last_xlog_switch_time;

+ /* Prototypes for private functions */
+
+ static void RequestCheckpoint(bool waitforit, bool warnontime, bool immediate, bool force);
+ static void CheckArchiveTimeout(void);
+ static void BgWriterNap(void);
+ static bool IsCheckpointOnSchedule(double progress);
+ static bool ImmediateCheckpointRequested(void);
+
+ /* Signal handlers */

  static void bg_quickdie(SIGNAL_ARGS);
  static void BgSigHupHandler(SIGNAL_ARGS);
***************
*** 170,175 ****
--- 192,198 ----

      Assert(BgWriterShmem != NULL);
      BgWriterShmem->bgwriter_pid = MyProcPid;
+     SpinLockInit(&BgWriterShmem->ckpt_lck);
      am_bg_writer = true;

      /*
***************
*** 281,288 ****
--- 304,314 ----
              /* use volatile pointer to prevent code rearrangement */
              volatile BgWriterShmemStruct *bgs = BgWriterShmem;

+             SpinLockAcquire(&BgWriterShmem->ckpt_lck);
              bgs->ckpt_failed++;
              bgs->ckpt_done = bgs->ckpt_started;
+             SpinLockRelease(&bgs->ckpt_lck);
+
              ckpt_active = false;
          }

***************
*** 328,337 ****
      for (;;)
      {
          bool        do_checkpoint = false;
-         bool        force_checkpoint = false;
          time_t        now;
          int            elapsed_secs;
-         long        udelay;

          /*
           * Emergency bailout if postmaster has died.  This is to avoid the
--- 354,361 ----
***************
*** 354,360 ****
          {
              checkpoint_requested = false;
              do_checkpoint = true;
-             force_checkpoint = true;
              BgWriterStats.m_requested_checkpoints++;
          }
          if (shutdown_requested)
--- 378,383 ----
***************
*** 377,387 ****
           */
          now = time(NULL);
          elapsed_secs = now - last_checkpoint_time;
!         if (elapsed_secs >= CheckPointTimeout)
          {
              do_checkpoint = true;
!             if (!force_checkpoint)
!                 BgWriterStats.m_timed_checkpoints++;
          }

          /*
--- 400,409 ----
           */
          now = time(NULL);
          elapsed_secs = now - last_checkpoint_time;
!         if (!do_checkpoint && elapsed_secs >= CheckPointTimeout)
          {
              do_checkpoint = true;
!             BgWriterStats.m_timed_checkpoints++;
          }

          /*
***************
*** 390,395 ****
--- 412,445 ----
           */
          if (do_checkpoint)
          {
+             /* use volatile pointer to prevent code rearrangement */
+             volatile BgWriterShmemStruct *bgs = BgWriterShmem;
+             bool time_warn;
+             bool immediate;
+             bool force;
+
+             /*
+              * Atomically check the request flags to figure out what
+              * kind of a checkpoint we should perform, and increase the
+              * started-counter to acknowledge that we've started
+              * a new checkpoint.
+              */
+
+             SpinLockAcquire(&bgs->ckpt_lck);
+
+             time_warn = bgs->ckpt_rqst_time_warn;
+             bgs->ckpt_rqst_time_warn = false;
+
+             immediate = bgs->ckpt_rqst_immediate;
+             bgs->ckpt_rqst_immediate = false;
+
+             force = bgs->ckpt_rqst_force;
+             bgs->ckpt_rqst_force = false;
+
+             bgs->ckpt_started++;
+
+             SpinLockRelease(&bgs->ckpt_lck);
+
              /*
               * We will warn if (a) too soon since last checkpoint (whatever
               * caused it) and (b) somebody has set the ckpt_time_warn flag
***************
*** 397,417 ****
               * implementation will not generate warnings caused by
               * CheckPointTimeout < CheckPointWarning.
               */
!             if (BgWriterShmem->ckpt_time_warn &&
                  elapsed_secs < CheckPointWarning)
                  ereport(LOG,
                          (errmsg("checkpoints are occurring too frequently (%d seconds apart)",
                                  elapsed_secs),
                           errhint("Consider increasing the configuration parameter \"checkpoint_segments\".")));
!             BgWriterShmem->ckpt_time_warn = false;

              /*
               * Indicate checkpoint start to any waiting backends.
               */
              ckpt_active = true;
-             BgWriterShmem->ckpt_started++;

!             CreateCheckPoint(false, force_checkpoint);

              /*
               * After any checkpoint, close all smgr files.    This is so we
--- 447,474 ----
               * implementation will not generate warnings caused by
               * CheckPointTimeout < CheckPointWarning.
               */
!             if (time_warn &&
                  elapsed_secs < CheckPointWarning)
                  ereport(LOG,
                          (errmsg("checkpoints are occurring too frequently (%d seconds apart)",
                                  elapsed_secs),
                           errhint("Consider increasing the configuration parameter \"checkpoint_segments\".")));
!

              /*
               * Indicate checkpoint start to any waiting backends.
               */
              ckpt_active = true;

!             ckpt_start_recptr = GetInsertRecPtr();
!             ckpt_start_time = now;
!             ckpt_cached_elapsed = 0;
!
!             elog(DEBUG1, "CHECKPOINT: start");
!
!             CreateCheckPoint(false, immediate, force);
!
!             elog(DEBUG1, "CHECKPOINT: end");

              /*
               * After any checkpoint, close all smgr files.    This is so we
***************
*** 422,428 ****
              /*
               * Indicate checkpoint completion to any waiting backends.
               */
!             BgWriterShmem->ckpt_done = BgWriterShmem->ckpt_started;
              ckpt_active = false;

              /*
--- 479,487 ----
              /*
               * Indicate checkpoint completion to any waiting backends.
               */
!             SpinLockAcquire(&bgs->ckpt_lck);
!             bgs->ckpt_done = bgs->ckpt_started;
!             SpinLockRelease(&bgs->ckpt_lck);
              ckpt_active = false;

              /*
***************
*** 433,446 ****
              last_checkpoint_time = now;
          }
          else
!             BgBufferSync();

          /*
!          * Check for archive_timeout, if so, switch xlog files.  First we do a
!          * quick check using possibly-stale local state.
           */
!         if (XLogArchiveTimeout > 0 &&
!             (int) (now - last_xlog_switch_time) >= XLogArchiveTimeout)
          {
              /*
               * Update local state ... note that last_xlog_switch_time is the
--- 492,530 ----
              last_checkpoint_time = now;
          }
          else
!         {
!             BgAllSweep();
!             BgLruSweep();
!         }

          /*
!          * Check for archive_timeout and switch xlog files if necessary.
           */
!         CheckArchiveTimeout();
!
!         /* Nap for the configured time. */
!         BgWriterNap();
!     }
! }
!
! /*
!  * CheckArchiveTimeout -- check for archive_timeout and switch xlog files
!  *        if needed
!  */
! static void
! CheckArchiveTimeout(void)
! {
!     time_t        now;
!
!     if (XLogArchiveTimeout <= 0)
!         return;
!
!     now = time(NULL);
!
!     /* First we do a quick check using possibly-stale local state. */
!     if ((int) (now - last_xlog_switch_time) < XLogArchiveTimeout)
!         return;
!
          {
              /*
               * Update local state ... note that last_xlog_switch_time is the
***************
*** 450,459 ****

              last_xlog_switch_time = Max(last_xlog_switch_time, last_time);

-             /* if we did a checkpoint, 'now' might be stale too */
-             if (do_checkpoint)
-                 now = time(NULL);
-
              /* Now we can do the real check */
              if ((int) (now - last_xlog_switch_time) >= XLogArchiveTimeout)
              {
--- 534,539 ----
***************
*** 478,483 ****
--- 558,572 ----
                  last_xlog_switch_time = now;
              }
          }
+ }
+
+ /*
+  * BgWriterNap -- Nap for the configured time or until a signal is received.
+  */
+ static void
+ BgWriterNap(void)
+ {
+     long        udelay;

          /*
           * Send off activity statistics to the stats collector
***************
*** 496,502 ****
           * We absorb pending requests after each short sleep.
           */
          if ((bgwriter_all_percent > 0.0 && bgwriter_all_maxpages > 0) ||
!             (bgwriter_lru_percent > 0.0 && bgwriter_lru_maxpages > 0))
              udelay = BgWriterDelay * 1000L;
          else if (XLogArchiveTimeout > 0)
              udelay = 1000000L;    /* One second */
--- 585,592 ----
           * We absorb pending requests after each short sleep.
           */
          if ((bgwriter_all_percent > 0.0 && bgwriter_all_maxpages > 0) ||
!             (bgwriter_lru_percent > 0.0 && bgwriter_lru_maxpages > 0) ||
!             ckpt_active)
              udelay = BgWriterDelay * 1000L;
          else if (XLogArchiveTimeout > 0)
              udelay = 1000000L;    /* One second */
***************
*** 505,522 ****

          while (udelay > 999999L)
          {
!             if (got_SIGHUP || checkpoint_requested || shutdown_requested)
                  break;
              pg_usleep(1000000L);
              AbsorbFsyncRequests();
              udelay -= 1000000L;
          }

!         if (!(got_SIGHUP || checkpoint_requested || shutdown_requested))
              pg_usleep(udelay);
      }
  }


  /* --------------------------------
   *        signal handler routines
--- 595,766 ----

          while (udelay > 999999L)
          {
!             /* If a checkpoint is active, postpone reloading the config
!              * until the checkpoint is finished, and don't care about
!              * non-immediate checkpoint requests.
!              */
!             if (shutdown_requested ||
!                 (!ckpt_active && (got_SIGHUP || checkpoint_requested)) ||
!                 (ckpt_active && ImmediateCheckpointRequested()))
                  break;
+
              pg_usleep(1000000L);
              AbsorbFsyncRequests();
              udelay -= 1000000L;
          }

!
!         if (!(shutdown_requested ||
!               (!ckpt_active && (got_SIGHUP || checkpoint_requested)) ||
!               (ckpt_active && ImmediateCheckpointRequested())))
              pg_usleep(udelay);
+ }
+
+ /*
+  * Returns true if an immediate checkpoint request is pending.
+  */
+ static bool
+ ImmediateCheckpointRequested()
+ {
+     if (checkpoint_requested)
+     {
+         volatile BgWriterShmemStruct *bgs = BgWriterShmem;
+
+         /*
+          * We're only looking at a single field, so we don't need to
+          * acquire the lock in this case.
+          */
+         if (bgs->ckpt_rqst_immediate)
+             return true;
      }
+     return false;
  }

+ /*
+  * CheckpointWriteDelay -- periodical sleep in checkpoint write phase
+  *
+  * During checkpoint, this is called periodically by the buffer manager while
+  * writing out dirty buffers from the shared buffer cache. We estimate if we've
+  * made enough progress so that we're going to finish this checkpoint in time
+  * before the next one is due, taking checkpoint_smoothing into account.
+  * If so, we perform one round of normal bgwriter activity including LRU-
+  * cleaning of buffer cache, switching xlog segment if archive_timeout has
+  * passed, and sleeping for BgWriterDelay msecs.
+  *
+  * 'progress' is an estimate of how much of the writes has been done, as a
+  * fraction between 0.0 meaning none, and 1.0 meaning all done.
+  */
+ void
+ CheckpointWriteDelay(double progress)
+ {
+     /*
+      * Return immediately if we should finish the checkpoint ASAP.
+      */
+     if (!am_bg_writer || CheckPointSmoothing <= 0 || shutdown_requested ||
+         ImmediateCheckpointRequested())
+         return;
+
+     elog(DEBUG1, "CheckpointWriteDelay: progress=%.3f", progress);
+
+     /* Take a nap and perform the usual bgwriter duties, unless we're behind
+      * schedule, in which case we just try to catch up as quickly as possible.
+      */
+     if (IsCheckpointOnSchedule(progress))
+     {
+         CheckArchiveTimeout();
+         BgLruSweep();
+         BgWriterNap();
+     }
+ }
+
+ /*
+  * IsCheckpointOnSchedule -- are we on schedule to finish this checkpoint
+  *         in time?
+  *
+  * Compares the current progress against the time/segments elapsed since last
+  * checkpoint, and returns true if the progress we've made this far is greater
+  * than the elapsed time/segments.
+  *
+  * If another checkpoint has already been requested, always return false.
+  */
+ static bool
+ IsCheckpointOnSchedule(double progress)
+ {
+     struct timeval    now;
+     XLogRecPtr        recptr;
+     double            progress_in_time,
+                     progress_in_xlog;
+
+     Assert(ckpt_active);
+
+     /* scale progress according to CheckPointSmoothing */
+     progress *= CheckPointSmoothing;
+
+     /*
+      * Check against the cached value first. Only do the more expensive
+      * calculations once we reach the target previously calculated. Since
+      * neither time or WAL insert pointer moves backwards, a freshly
+      * calculated value can only be greater than or equal to the cached value.
+      */
+     if (progress < ckpt_cached_elapsed)
+     {
+         elog(DEBUG2, "IsCheckpointOnSchedule: Still behind cached=%.3f, progress=%.3f",
+              ckpt_cached_elapsed, progress);
+         return false;
+     }
+
+     gettimeofday(&now, NULL);
+
+     /*
+      * Check progress against time elapsed and checkpoint_timeout.
+      */
+     progress_in_time = ((double) (now.tv_sec - ckpt_start_time) +
+         now.tv_usec / 1000000.0) / CheckPointTimeout;
+
+     if (progress < progress_in_time)
+     {
+         elog(DEBUG2, "IsCheckpointOnSchedule: Behind checkpoint_timeout, time=%.3f, progress=%.3f",
+              progress_in_time, progress);
+
+         ckpt_cached_elapsed = progress_in_time;
+
+         return false;
+     }
+
+     /*
+      * Check progress against WAL segments written and checkpoint_segments.
+      *
+      * We compare the current WAL insert location against the location
+      * computed before calling CreateCheckPoint. The code in XLogInsert that
+      * actually triggers a checkpoint when checkpoint_segments is exceeded
+      * compares against RedoRecptr, so this is not completely accurate.
+      * However, it's good enough for our purposes, we're only calculating
+      * an estimate anyway.
+      */
+     recptr = GetInsertRecPtr();
+     progress_in_xlog =
+         (((double) recptr.xlogid - (double) ckpt_start_recptr.xlogid) * XLogSegsPerFile +
+          ((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) / XLogSegSize) /
+         CheckPointSegments;
+
+     if (progress < progress_in_xlog)
+     {
+         elog(DEBUG2, "IsCheckpointOnSchedule: Behind checkpoint_segments, xlog=%.3f, progress=%.3f",
+              progress_in_xlog, progress);
+
+         ckpt_cached_elapsed = progress_in_xlog;
+
+         return false;
+     }
+
+
+     /* It looks like we're on schedule. */
+
+     elog(DEBUG2, "IsCheckpointOnSchedule: on schedule, time=%.3f, xlog=%.3f progress=%.3f",
+          progress_in_time, progress_in_xlog, progress);
+
+     return true;
+ }

  /* --------------------------------
   *        signal handler routines
***************
*** 618,625 ****
  }

  /*
   * RequestCheckpoint
!  *        Called in backend processes to request an immediate checkpoint
   *
   * If waitforit is true, wait until the checkpoint is completed
   * before returning; otherwise, just signal the request and return
--- 862,910 ----
  }

  /*
+  * RequestImmediateCheckpoint
+  *        Called in backend processes to request an immediate checkpoint.
+  *
+  * Returns when the checkpoint is finished.
+  */
+ void
+ RequestImmediateCheckpoint()
+ {
+     RequestCheckpoint(true, false, true, true);
+ }
+
+ /*
+  * RequestImmediateCheckpoint
+  *        Called in backend processes to request a lazy checkpoint.
+  *
+  * This is essentially the same as RequestImmediateCheckpoint, except
+  * that this form obeys the checkpoint_smoothing GUC variable, and
+  * can therefore take a lot longer time.
+  *
+  * Returns when the checkpoint is finished.
+  */
+ void
+ RequestLazyCheckpoint()
+ {
+     RequestCheckpoint(true, false, false, true);
+ }
+
+ /*
+  * RequestXLogFillCheckpoint
+  *        Signals the bgwriter that we've reached checkpoint_segments
+  *
+  * Unlike RequestImmediateCheckpoint and RequestLazyCheckpoint, return
+  * immediately without waiting for the checkpoint to finish.
+  */
+ void
+ RequestXLogFillCheckpoint()
+ {
+     RequestCheckpoint(false, true, false, false);
+ }
+
+ /*
   * RequestCheckpoint
!  *        Common subroutine for all the above Request*Checkpoint variants.
   *
   * If waitforit is true, wait until the checkpoint is completed
   * before returning; otherwise, just signal the request and return
***************
*** 628,648 ****
   * If warnontime is true, and it's "too soon" since the last checkpoint,
   * the bgwriter will log a warning.  This should be true only for checkpoints
   * caused due to xlog filling, else the warning will be misleading.
   */
! void
! RequestCheckpoint(bool waitforit, bool warnontime)
  {
      /* use volatile pointer to prevent code rearrangement */
      volatile BgWriterShmemStruct *bgs = BgWriterShmem;
!     sig_atomic_t old_failed = bgs->ckpt_failed;
!     sig_atomic_t old_started = bgs->ckpt_started;

      /*
       * If in a standalone backend, just do it ourselves.
       */
      if (!IsPostmasterEnvironment)
      {
!         CreateCheckPoint(false, true);

          /*
           * After any checkpoint, close all smgr files.    This is so we won't
--- 913,942 ----
   * If warnontime is true, and it's "too soon" since the last checkpoint,
   * the bgwriter will log a warning.  This should be true only for checkpoints
   * caused due to xlog filling, else the warning will be misleading.
+  *
+  * If immediate is true, the checkpoint should be finished ASAP.
+  *
+  * If force is true, force a checkpoint even if no XLOG activity has occured
+  * since the last one.
   */
! static void
! RequestCheckpoint(bool waitforit, bool warnontime, bool immediate, bool force)
  {
      /* use volatile pointer to prevent code rearrangement */
      volatile BgWriterShmemStruct *bgs = BgWriterShmem;
!     int old_failed, old_started;

      /*
       * If in a standalone backend, just do it ourselves.
       */
      if (!IsPostmasterEnvironment)
      {
!         /*
!          * There's no point in doing lazy checkpoints in a standalone
!          * backend, because there's no other backends the checkpoint could
!          * disrupt.
!          */
!         CreateCheckPoint(false, true, true);

          /*
           * After any checkpoint, close all smgr files.    This is so we won't
***************
*** 653,661 ****
          return;
      }

!     /* Set warning request flag if appropriate */
      if (warnontime)
!         bgs->ckpt_time_warn = true;

      /*
       * Send signal to request checkpoint.  When waitforit is false, we
--- 947,974 ----
          return;
      }

!     /*
!      * Atomically set the request flags, and take a snapshot of the counters.
!      * This ensures that when we see that ckpt_started > old_started,
!      * we know the flags we set here have been seen by bgwriter.
!      *
!      * Note that we effectively OR the flags with any existing flags, to
!      * avoid overriding a "stronger" request by another backend.
!      */
!     SpinLockAcquire(&bgs->ckpt_lck);
!
!     old_failed = bgs->ckpt_failed;
!     old_started = bgs->ckpt_started;
!
!     /* Set request flags as appropriate */
      if (warnontime)
!         bgs->ckpt_rqst_time_warn = true;
!     if (immediate)
!         bgs->ckpt_rqst_immediate = true;
!     if (force)
!         bgs->ckpt_rqst_force = true;
!
!     SpinLockRelease(&bgs->ckpt_lck);

      /*
       * Send signal to request checkpoint.  When waitforit is false, we
***************
*** 674,701 ****
       */
      if (waitforit)
      {
!         while (bgs->ckpt_started == old_started)
          {
              CHECK_FOR_INTERRUPTS();
              pg_usleep(100000L);
          }
-         old_started = bgs->ckpt_started;

          /*
!          * We are waiting for ckpt_done >= old_started, in a modulo sense.
!          * This is a little tricky since we don't know the width or signedness
!          * of sig_atomic_t.  We make the lowest common denominator assumption
!          * that it is only as wide as "char".  This means that this algorithm
!          * will cope correctly as long as we don't sleep for more than 127
!          * completed checkpoints.  (If we do, we will get another chance to
!          * exit after 128 more checkpoints...)
           */
!         while (((signed char) (bgs->ckpt_done - old_started)) < 0)
          {
              CHECK_FOR_INTERRUPTS();
              pg_usleep(100000L);
          }
!         if (bgs->ckpt_failed != old_failed)
              ereport(ERROR,
                      (errmsg("checkpoint request failed"),
                       errhint("Consult recent messages in the server log for details.")));
--- 987,1031 ----
       */
      if (waitforit)
      {
!         int new_started, new_failed;
!
!         /* Wait for a new checkpoint to start. */
!         for(;;)
          {
+             SpinLockAcquire(&bgs->ckpt_lck);
+             new_started = bgs->ckpt_started;
+             SpinLockRelease(&bgs->ckpt_lck);
+
+             if (new_started != old_started)
+                 break;
+
              CHECK_FOR_INTERRUPTS();
              pg_usleep(100000L);
          }

          /*
!          * We are waiting for ckpt_done >= new_started, in a modulo sense.
!          * This algorithm will cope correctly as long as we don't sleep for
!          * more than MAX_INT completed checkpoints.  (If we do, we will get
!          * another chance to exit after MAX_INT more checkpoints...)
           */
!         for(;;)
          {
+             int new_done;
+
+             SpinLockAcquire(&bgs->ckpt_lck);
+             new_done = bgs->ckpt_done;
+             new_failed = bgs->ckpt_failed;
+             SpinLockRelease(&bgs->ckpt_lck);
+
+             if(new_done - new_started >= 0)
+                 break;
+
              CHECK_FOR_INTERRUPTS();
              pg_usleep(100000L);
          }
!
!         if (new_failed != old_failed)
              ereport(ERROR,
                      (errmsg("checkpoint request failed"),
                       errhint("Consult recent messages in the server log for details.")));
Index: src/backend/storage/buffer/bufmgr.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/bufmgr.c,v
retrieving revision 1.220
diff -c -r1.220 bufmgr.c
*** src/backend/storage/buffer/bufmgr.c    30 May 2007 20:11:58 -0000    1.220
--- src/backend/storage/buffer/bufmgr.c    20 Jun 2007 12:47:43 -0000
***************
*** 32,38 ****
   *
   * BufferSync() -- flush all dirty buffers in the buffer pool.
   *
!  * BgBufferSync() -- flush some dirty buffers in the buffer pool.
   *
   * InitBufferPool() -- Init the buffer module.
   *
--- 32,40 ----
   *
   * BufferSync() -- flush all dirty buffers in the buffer pool.
   *
!  * BgAllSweep() -- write out some dirty buffers in the pool.
!  *
!  * BgLruSweep() -- write out some lru dirty buffers in the pool.
   *
   * InitBufferPool() -- Init the buffer module.
   *
***************
*** 74,79 ****
--- 76,82 ----
  double        bgwriter_all_percent = 0.333;
  int            bgwriter_lru_maxpages = 5;
  int            bgwriter_all_maxpages = 5;
+ int            checkpoint_rate = 512; /* in pages/s */


  long        NDirectFileRead;    /* some I/O's are direct file access. bypass
***************
*** 645,651 ****
       * at 1 so that the buffer can survive one clock-sweep pass.)
       */
      buf->tag = newTag;
!     buf->flags &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | BM_IO_ERROR);
      buf->flags |= BM_TAG_VALID;
      buf->usage_count = 1;

--- 648,654 ----
       * at 1 so that the buffer can survive one clock-sweep pass.)
       */
      buf->tag = newTag;
!     buf->flags &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | BM_CHECKPOINT_NEEDED | BM_IO_ERROR);
      buf->flags |= BM_TAG_VALID;
      buf->usage_count = 1;

***************
*** 1000,1037 ****
   * BufferSync -- Write out all dirty buffers in the pool.
   *
   * This is called at checkpoint time to write out all dirty shared buffers.
   */
  void
! BufferSync(void)
  {
!     int            buf_id;
      int            num_to_scan;
      int            absorb_counter;

      /*
       * Find out where to start the circular scan.
       */
!     buf_id = StrategySyncStart();

      /* Make sure we can handle the pin inside SyncOneBuffer */
      ResourceOwnerEnlargeBuffers(CurrentResourceOwner);

      /*
!      * Loop over all buffers.
       */
      num_to_scan = NBuffers;
      absorb_counter = WRITES_PER_ABSORB;
      while (num_to_scan-- > 0)
      {
!         if (SyncOneBuffer(buf_id, false))
          {
              BgWriterStats.m_buf_written_checkpoints++;

              /*
               * If in bgwriter, absorb pending fsync requests after each
               * WRITES_PER_ABSORB write operations, to prevent overflow of the
               * fsync request queue.  If not in bgwriter process, this is a
               * no-op.
               */
              if (--absorb_counter <= 0)
              {
--- 1003,1127 ----
   * BufferSync -- Write out all dirty buffers in the pool.
   *
   * This is called at checkpoint time to write out all dirty shared buffers.
+  * If 'immediate' is true, write them all ASAP, otherwise throttle the
+  * I/O rate according to checkpoint_write_rate GUC variable, and perform
+  * normal bgwriter duties periodically.
   */
  void
! BufferSync(bool immediate)
  {
!     int            buf_id, start_id;
      int            num_to_scan;
+     int            num_to_write;
+     int            num_written;
      int            absorb_counter;
+     int            num_written_since_nap;
+     int            writes_per_nap;
+
+     /*
+      * Convert checkpoint_write_rate to number writes of writes to perform in
+      * a period of BgWriterDelay. The result is an integer, so we lose some
+      * precision here. There's a lot of other factors as well that affect the
+      * real rate, for example granularity of OS timer used for BgWriterDelay,
+      * whether any of the writes block, and time spent in CheckpointWriteDelay
+      * performing normal bgwriter duties.
+      */
+     writes_per_nap = Min(1, checkpoint_rate / BgWriterDelay);

      /*
       * Find out where to start the circular scan.
       */
!     start_id = StrategySyncStart();

      /* Make sure we can handle the pin inside SyncOneBuffer */
      ResourceOwnerEnlargeBuffers(CurrentResourceOwner);

      /*
!      * Loop over all buffers, and mark the ones that need to be written with
!      * BM_CHECKPOINT_NEEDED. Count them as we go (num_to_write), so that we
!      * can estimate how much work needs to be done.
!      *
!      * This allows us to only write those pages that were dirty when the
!      * checkpoint began, and haven't been flushed to disk since. Whenever a
!      * page with BM_CHECKPOINT_NEEDED is written out by normal backends or
!      * the bgwriter LRU-scan, the flag is cleared, and any pages dirtied after
!      * this scan don't have the flag set.
!      */
!     num_to_scan = NBuffers;
!     num_to_write = 0;
!     buf_id = start_id;
!     while (num_to_scan-- > 0)
!     {
!         volatile BufferDesc *bufHdr = &BufferDescriptors[buf_id];
!
!         /*
!          * Header spinlock is enough to examine BM_DIRTY, see comment in
!          * SyncOneBuffer.
!          */
!         LockBufHdr(bufHdr);
!
!         if (bufHdr->flags & BM_DIRTY)
!         {
!             bufHdr->flags |= BM_CHECKPOINT_NEEDED;
!             num_to_write++;
!         }
!
!         UnlockBufHdr(bufHdr);
!
!         if (++buf_id >= NBuffers)
!             buf_id = 0;
!     }
!
!     elog(DEBUG1, "CHECKPOINT: %d / %d buffers to write", num_to_write, NBuffers);
!
!     /*
!      * Loop over all buffers again, and write the ones (still) marked with
!      * BM_CHECKPOINT_NEEDED.
       */
      num_to_scan = NBuffers;
+     num_written = num_written_since_nap = 0;
      absorb_counter = WRITES_PER_ABSORB;
+     buf_id = start_id;
      while (num_to_scan-- > 0)
      {
!         volatile BufferDesc *bufHdr = &BufferDescriptors[buf_id];
!         bool needs_flush;
!
!         /* We don't need to acquire the lock here, because we're
!          * only looking at a single bit. It's possible that someone
!          * else writes the buffer and clears the flag right after we
!          * check, but that doesn't matter. This assumes that no-one
!          * clears the flag and sets it again while holding info_lck,
!          * expecting no-one to see the intermediary state.
!          */
!         needs_flush = (bufHdr->flags & BM_CHECKPOINT_NEEDED) != 0;
!
!         if (needs_flush && SyncOneBuffer(buf_id, false))
          {
              BgWriterStats.m_buf_written_checkpoints++;
+             num_written++;
+
+             /*
+              * Perform normal bgwriter duties and sleep to throttle
+              * our I/O rate.
+              */
+             if (!immediate && ++num_written_since_nap >= writes_per_nap)
+             {
+                 num_written_since_nap = 0;
+                 CheckpointWriteDelay((double) (num_written) / num_to_write);
+             }

              /*
               * If in bgwriter, absorb pending fsync requests after each
               * WRITES_PER_ABSORB write operations, to prevent overflow of the
               * fsync request queue.  If not in bgwriter process, this is a
               * no-op.
+              *
+              * AbsorbFsyncRequests is also called inside CheckpointWriteDelay,
+              * so this is partially redundant. However, we can't totally trust
+              * on the call in CheckpointWriteDelay, because it's only made
+              * before sleeping. In case CheckpointWriteDelay doesn't sleep,
+              * we need to absorb pending requests ourselves.
               */
              if (--absorb_counter <= 0)
              {
***************
*** 1045,1059 ****
  }

  /*
!  * BgBufferSync -- Write out some dirty buffers in the pool.
   *
   * This is called periodically by the background writer process.
   */
  void
! BgBufferSync(void)
  {
      static int    buf_id1 = 0;
-     int            buf_id2;
      int            num_to_scan;
      int            num_written;

--- 1135,1152 ----
  }

  /*
!  * BgAllSweep -- Write out some dirty buffers in the pool.
   *
+  * Runs the bgwriter all-sweep algorithm to write dirty buffers to
+  * minimize work at checkpoint time.
   * This is called periodically by the background writer process.
+  *
+  * XXX: Is this really needed with load distributed checkpoints?
   */
  void
! BgAllSweep(void)
  {
      static int    buf_id1 = 0;
      int            num_to_scan;
      int            num_written;

***************
*** 1063,1072 ****
      /*
       * To minimize work at checkpoint time, we want to try to keep all the
       * buffers clean; this motivates a scan that proceeds sequentially through
!      * all buffers.  But we are also charged with ensuring that buffers that
!      * will be recycled soon are clean when needed; these buffers are the ones
!      * just ahead of the StrategySyncStart point.  We make a separate scan
!      * through those.
       */

      /*
--- 1156,1162 ----
      /*
       * To minimize work at checkpoint time, we want to try to keep all the
       * buffers clean; this motivates a scan that proceeds sequentially through
!      * all buffers.
       */

      /*
***************
*** 1098,1103 ****
--- 1188,1210 ----
          }
          BgWriterStats.m_buf_written_all += num_written;
      }
+ }
+
+ /*
+  * BgLruSweep -- Write out some lru dirty buffers in the pool.
+  */
+ void
+ BgLruSweep(void)
+ {
+     int            buf_id2;
+     int            num_to_scan;
+     int            num_written;
+
+     /*
+      * The purpose of this sweep is to ensure that buffers that
+      * will be recycled soon are clean when needed; these buffers are the ones
+      * just ahead of the StrategySyncStart point.
+      */

      /*
       * This loop considers only unpinned buffers close to the clock sweep
***************
*** 1341,1349 ****
   * flushed.
   */
  void
! FlushBufferPool(void)
  {
!     BufferSync();
      smgrsync();
  }

--- 1448,1459 ----
   * flushed.
   */
  void
! FlushBufferPool(bool immediate)
  {
!     elog(DEBUG1, "CHECKPOINT: write phase");
!     BufferSync(immediate || CheckPointSmoothing <= 0);
!
!     elog(DEBUG1, "CHECKPOINT: sync phase");
      smgrsync();
  }

***************
*** 2132,2138 ****
      Assert(buf->flags & BM_IO_IN_PROGRESS);
      buf->flags &= ~(BM_IO_IN_PROGRESS | BM_IO_ERROR);
      if (clear_dirty && !(buf->flags & BM_JUST_DIRTIED))
!         buf->flags &= ~BM_DIRTY;
      buf->flags |= set_flag_bits;

      UnlockBufHdr(buf);
--- 2242,2248 ----
      Assert(buf->flags & BM_IO_IN_PROGRESS);
      buf->flags &= ~(BM_IO_IN_PROGRESS | BM_IO_ERROR);
      if (clear_dirty && !(buf->flags & BM_JUST_DIRTIED))
!         buf->flags &= ~(BM_DIRTY | BM_CHECKPOINT_NEEDED);
      buf->flags |= set_flag_bits;

      UnlockBufHdr(buf);
Index: src/backend/tcop/utility.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/tcop/utility.c,v
retrieving revision 1.280
diff -c -r1.280 utility.c
*** src/backend/tcop/utility.c    30 May 2007 20:12:01 -0000    1.280
--- src/backend/tcop/utility.c    20 Jun 2007 09:36:31 -0000
***************
*** 1089,1095 ****
                  ereport(ERROR,
                          (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
                           errmsg("must be superuser to do CHECKPOINT")));
!             RequestCheckpoint(true, false);
              break;

          case T_ReindexStmt:
--- 1089,1095 ----
                  ereport(ERROR,
                          (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
                           errmsg("must be superuser to do CHECKPOINT")));
!             RequestImmediateCheckpoint();
              break;

          case T_ReindexStmt:
Index: src/backend/utils/misc/guc.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/utils/misc/guc.c,v
retrieving revision 1.396
diff -c -r1.396 guc.c
*** src/backend/utils/misc/guc.c    8 Jun 2007 18:23:52 -0000    1.396
--- src/backend/utils/misc/guc.c    20 Jun 2007 10:14:06 -0000
***************
*** 1487,1492 ****
--- 1487,1503 ----
          30, 0, INT_MAX, NULL, NULL
      },

+
+     {
+         {"checkpoint_rate", PGC_SIGHUP, WAL_CHECKPOINTS,
+             gettext_noop("Minimum I/O rate used to write dirty buffers during checkpoints."),
+             NULL,
+             GUC_UNIT_BLOCKS
+         },
+         &checkpoint_rate,
+         100, 0.0, 100000, NULL, NULL
+     },
+
      {
          {"wal_buffers", PGC_POSTMASTER, WAL_SETTINGS,
              gettext_noop("Sets the number of disk-page buffers in shared memory for WAL."),
***************
*** 1866,1871 ****
--- 1877,1891 ----
          0.1, 0.0, 100.0, NULL, NULL
      },

+     {
+         {"checkpoint_smoothing", PGC_SIGHUP, WAL_CHECKPOINTS,
+             gettext_noop("Time spent flushing dirty buffers during checkpoint, as fraction of checkpoint interval."),
+             NULL
+         },
+         &CheckPointSmoothing,
+         0.3, 0.0, 0.9, NULL, NULL
+     },
+
      /* End-of-list marker */
      {
          {NULL, 0, 0, NULL, NULL}, NULL, 0.0, 0.0, 0.0, NULL, NULL
Index: src/backend/utils/misc/postgresql.conf.sample
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/utils/misc/postgresql.conf.sample,v
retrieving revision 1.216
diff -c -r1.216 postgresql.conf.sample
*** src/backend/utils/misc/postgresql.conf.sample    3 Jun 2007 17:08:15 -0000    1.216
--- src/backend/utils/misc/postgresql.conf.sample    20 Jun 2007 10:03:17 -0000
***************
*** 168,173 ****
--- 168,175 ----

  #checkpoint_segments = 3        # in logfile segments, min 1, 16MB each
  #checkpoint_timeout = 5min        # range 30s-1h
+ #checkpoint_smoothing = 0.3        # checkpoint duration, range 0.0 - 0.9
+ #checkpoint_rate = 512.0KB        # min. checkpoint write rate per second
  #checkpoint_warning = 30s        # 0 is off

  # - Archiving -
Index: src/include/access/xlog.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/xlog.h,v
retrieving revision 1.78
diff -c -r1.78 xlog.h
*** src/include/access/xlog.h    30 May 2007 20:12:02 -0000    1.78
--- src/include/access/xlog.h    19 Jun 2007 14:10:07 -0000
***************
*** 171,179 ****
  extern void StartupXLOG(void);
  extern void ShutdownXLOG(int code, Datum arg);
  extern void InitXLOGAccess(void);
! extern void CreateCheckPoint(bool shutdown, bool force);
  extern void XLogPutNextOid(Oid nextOid);
  extern XLogRecPtr GetRedoRecPtr(void);
  extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch);

  #endif   /* XLOG_H */
--- 171,180 ----
  extern void StartupXLOG(void);
  extern void ShutdownXLOG(int code, Datum arg);
  extern void InitXLOGAccess(void);
! extern void CreateCheckPoint(bool shutdown, bool immediate, bool force);
  extern void XLogPutNextOid(Oid nextOid);
  extern XLogRecPtr GetRedoRecPtr(void);
+ extern XLogRecPtr GetInsertRecPtr(void);
  extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch);

  #endif   /* XLOG_H */
Index: src/include/postmaster/bgwriter.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/postmaster/bgwriter.h,v
retrieving revision 1.9
diff -c -r1.9 bgwriter.h
*** src/include/postmaster/bgwriter.h    5 Jan 2007 22:19:57 -0000    1.9
--- src/include/postmaster/bgwriter.h    20 Jun 2007 09:27:20 -0000
***************
*** 20,29 ****
  extern int    BgWriterDelay;
  extern int    CheckPointTimeout;
  extern int    CheckPointWarning;

  extern void BackgroundWriterMain(void);

! extern void RequestCheckpoint(bool waitforit, bool warnontime);

  extern bool ForwardFsyncRequest(RelFileNode rnode, BlockNumber segno);
  extern void AbsorbFsyncRequests(void);
--- 20,33 ----
  extern int    BgWriterDelay;
  extern int    CheckPointTimeout;
  extern int    CheckPointWarning;
+ extern double CheckPointSmoothing;

  extern void BackgroundWriterMain(void);

! extern void RequestImmediateCheckpoint(void);
! extern void RequestLazyCheckpoint(void);
! extern void RequestXLogFillCheckpoint(void);
! extern void CheckpointWriteDelay(double progress);

  extern bool ForwardFsyncRequest(RelFileNode rnode, BlockNumber segno);
  extern void AbsorbFsyncRequests(void);
Index: src/include/storage/buf_internals.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/storage/buf_internals.h,v
retrieving revision 1.90
diff -c -r1.90 buf_internals.h
*** src/include/storage/buf_internals.h    30 May 2007 20:12:03 -0000    1.90
--- src/include/storage/buf_internals.h    12 Jun 2007 11:42:23 -0000
***************
*** 35,40 ****
--- 35,41 ----
  #define BM_IO_ERROR                (1 << 4)        /* previous I/O failed */
  #define BM_JUST_DIRTIED            (1 << 5)        /* dirtied since write started */
  #define BM_PIN_COUNT_WAITER        (1 << 6)        /* have waiter for sole pin */
+ #define BM_CHECKPOINT_NEEDED    (1 << 7)        /* this needs to be written in checkpoint */

  typedef bits16 BufFlags;

Index: src/include/storage/bufmgr.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/storage/bufmgr.h,v
retrieving revision 1.104
diff -c -r1.104 bufmgr.h
*** src/include/storage/bufmgr.h    30 May 2007 20:12:03 -0000    1.104
--- src/include/storage/bufmgr.h    20 Jun 2007 10:28:43 -0000
***************
*** 36,41 ****
--- 36,42 ----
  extern double bgwriter_all_percent;
  extern int    bgwriter_lru_maxpages;
  extern int    bgwriter_all_maxpages;
+ extern int    checkpoint_rate;

  /* in buf_init.c */
  extern DLLIMPORT char *BufferBlocks;
***************
*** 136,142 ****
  extern void ResetBufferUsage(void);
  extern void AtEOXact_Buffers(bool isCommit);
  extern void PrintBufferLeakWarning(Buffer buffer);
! extern void FlushBufferPool(void);
  extern BlockNumber BufferGetBlockNumber(Buffer buffer);
  extern BlockNumber RelationGetNumberOfBlocks(Relation relation);
  extern void RelationTruncate(Relation rel, BlockNumber nblocks);
--- 137,143 ----
  extern void ResetBufferUsage(void);
  extern void AtEOXact_Buffers(bool isCommit);
  extern void PrintBufferLeakWarning(Buffer buffer);
! extern void FlushBufferPool(bool immediate);
  extern BlockNumber BufferGetBlockNumber(Buffer buffer);
  extern BlockNumber RelationGetNumberOfBlocks(Relation relation);
  extern void RelationTruncate(Relation rel, BlockNumber nblocks);
***************
*** 161,168 ****
  extern void AbortBufferIO(void);

  extern void BufmgrCommit(void);
! extern void BufferSync(void);
! extern void BgBufferSync(void);

  extern void AtProcExit_LocalBuffers(void);

--- 162,170 ----
  extern void AbortBufferIO(void);

  extern void BufmgrCommit(void);
! extern void BufferSync(bool immediate);
! extern void BgAllSweep(void);
! extern void BgLruSweep(void);

  extern void AtProcExit_LocalBuffers(void);


Re: Load Distributed Checkpoints, take 3

From
Tom Lane
Date:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> I added a spinlock to protect the signaling fields between bgwriter and
> backends. The current non-locking approach gets really difficult as the
> patch adds two new flags, and both are more important than the existing
> ckpt_time_warn flag.

That may be, but you could minimize the cost and notational ugliness by
not using the spinlock where you don't have to.  Put the sig_atomic_t
fields back the way they were, and many of the uses of the spinlock go
away.  All you really need it for is the places where the additional
flags are set or read.

> In fact, I think there's a small race condition in CVS HEAD:

Yeah, probably --- the original no-locking design didn't have any side
flags.  The reason you need the lock is for a backend to be sure that
a newly-started checkpoint is using its requested flags.  But the
detection of checkpoint termination is still the same.

Some other comments:

I tend to agree with whoever said upthread that the combination of GUC
variables proposed here is confusing and ugly.  It'd make more sense to
have min and max checkpoint rates in KB/s, with the max checkpoint rate
only honored when we are predicting we'll finish before the next
checkpoint time.

The flow of control and responsibility between xlog.c, bgwriter.c and
bufmgr.c seems to have reached the breaking point of unintelligibility.
Can you think of a refactoring that would simplify it?  We might have
to resign ourselves to this much messiness, but I'd rather not...

            regards, tom lane

Re: Load Distributed Checkpoints, take 3

From
ITAGAKI Takahiro
Date:
Heikki Linnakangas <heikki@enterprisedb.com> wrote:

> Here's an updated WIP patch for load distributed checkpoints.
> Since last patch, I did some clean up and refactoring, and added a bunch
> of comments, and user documentation.

The only thing I don't understand is the naming of 'checkpoint_smoothing'.
Can users imagine the unit of 'smoothing' is a fraction?

You explain the paremeter with the word 'fraction'.
Why don't you simply name it 'checkpoint_fraction' ?
| Specifies the target length of checkpoints, as a fraction of
| the checkpoint interval. The default is 0.3.

Sorry if I'm missing discussions abount the naming.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



Re: Load Distributed Checkpoints, take 3

From
Heikki Linnakangas
Date:
ITAGAKI Takahiro wrote:
> The only thing I don't understand is the naming of 'checkpoint_smoothing'.
> Can users imagine the unit of 'smoothing' is a fraction?
>
> You explain the paremeter with the word 'fraction'.
> Why don't you simply name it 'checkpoint_fraction' ?
> | Specifies the target length of checkpoints, as a fraction of
> | the checkpoint interval. The default is 0.3.

I chose checkpoint_smoothing because it tells you what the parameter is
for. If you want more smoothing, tune it up, and if you want less, tune
it down. checkpoint_fraction makes you wonder what you can do with it
and why you would change it.

> Sorry if I'm missing discussions abount the naming.

No, I chose _smoothing on my own. People didn't like
checkpoint_write_percent either (including).

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Load Distributed Checkpoints, take 3

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
>> I added a spinlock to protect the signaling fields between bgwriter and
>> backends. The current non-locking approach gets really difficult as the
>> patch adds two new flags, and both are more important than the existing
>> ckpt_time_warn flag.
>
> That may be, but you could minimize the cost and notational ugliness by
> not using the spinlock where you don't have to.  Put the sig_atomic_t
> fields back the way they were, and many of the uses of the spinlock go
> away.  All you really need it for is the places where the additional
> flags are set or read.

I find it easier to understand if it's used whenever any of the fields
are accessed. You don't need to read and write ckpt_failed and
ckpt_started/ckpt_done in specific order anymore, for example.

> Some other comments:
>
> I tend to agree with whoever said upthread that the combination of GUC
> variables proposed here is confusing and ugly.  It'd make more sense to
> have min and max checkpoint rates in KB/s, with the max checkpoint rate
> only honored when we are predicting we'll finish before the next
> checkpoint time.

Really? I thought everyone is happy with the current combination, and
that it was just the old trio of parameters controlling the write, nap
and sync phases that was ugly. One particularly nice thing about
defining the duration as a fraction of checkpoint interval is that we
can come up with a reasonable default value that doesn't depend on your
hardware.

How would a min and max rate work?

Anyone else have an opinion on the parameters?

> The flow of control and responsibility between xlog.c, bgwriter.c and
> bufmgr.c seems to have reached the breaking point of unintelligibility.
> Can you think of a refactoring that would simplify it?  We might have
> to resign ourselves to this much messiness, but I'd rather not...

The problem we're trying to solve is doing a checkpoint while running
the normal bgwriter activity at the same time. The normal way to do two
things simultaneously is to have two different processes (or threads). I
thought about having a separate checkpoint process, but I gave up on
that thought because the communication needed between backends, bgwriter
and the checkpointer seems like a mess. The checkpointer would need the
pendingOpsTable so that it can do the fsyncs, and it would also need to
receive the forget-messages to that table. We could move that table
entirely to the checkpointer, but since bgwriter is presumably doing
most of the writes, there would be a lot of chatter between bgwriter and
the checkpointer.

The current approach is like co-operative multitasking. BufferSyncs
yields control to bgwriter every now and then.

The division of labor between xlog.c and other modules is not that bad,
IMO. CreateCheckPoint is the main entry point to create a checkpoint,
and it calls other modules to do their stuff, including bufmgr.c.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Load Distributed Checkpoints, take 3

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
>> In fact, I think there's a small race condition in CVS HEAD:
>
> Yeah, probably --- the original no-locking design didn't have any side
> flags.  The reason you need the lock is for a backend to be sure that
> a newly-started checkpoint is using its requested flags.  But the
> detection of checkpoint termination is still the same.

Actually, the race condition I outlined isn't related to the flags. It's
possible because RequestCheckpoint doesn't guarantee that a checkpoint
is performed when there's been no WAL activity since last one.

I did use a new force-flag to fix it, but I'm sure there is other ways.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Load Distributed Checkpoints, take 3

From
Tom Lane
Date:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> Tom Lane wrote:
>> I tend to agree with whoever said upthread that the combination of GUC
>> variables proposed here is confusing and ugly.  It'd make more sense to
>> have min and max checkpoint rates in KB/s, with the max checkpoint rate
>> only honored when we are predicting we'll finish before the next
>> checkpoint time.

> Really? I thought everyone is happy with the current combination, and
> that it was just the old trio of parameters controlling the write, nap
> and sync phases that was ugly. One particularly nice thing about
> defining the duration as a fraction of checkpoint interval is that we
> can come up with a reasonable default value that doesn't depend on your
> hardware.

That argument would hold some water if you weren't introducing a
hardware-dependent min rate in the same patch.  Do we need the min rate
at all?  If so, why can't it be in the same units as the max (ie, a
fraction of checkpoint)?

> How would a min and max rate work?

Pretty much the same as the code does now, no?  You either delay, or not.

            regards, tom lane

Re: Load Distributed Checkpoints, take 3

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
>> Tom Lane wrote:
>>> I tend to agree with whoever said upthread that the combination of GUC
>>> variables proposed here is confusing and ugly.  It'd make more sense to
>>> have min and max checkpoint rates in KB/s, with the max checkpoint rate
>>> only honored when we are predicting we'll finish before the next
>>> checkpoint time.
>
>> Really? I thought everyone is happy with the current combination, and
>> that it was just the old trio of parameters controlling the write, nap
>> and sync phases that was ugly. One particularly nice thing about
>> defining the duration as a fraction of checkpoint interval is that we
>> can come up with a reasonable default value that doesn't depend on your
>> hardware.
>
> That argument would hold some water if you weren't introducing a
> hardware-dependent min rate in the same patch.  Do we need the min rate
> at all?  If so, why can't it be in the same units as the max (ie, a
> fraction of checkpoint)?
>
>> How would a min and max rate work?
>
> Pretty much the same as the code does now, no?  You either delay, or not.

I don't think you understand how the settings work. Did you read the
documentation? If you did, it's apparently not adequate.

The main tuning knob is checkpoint_smoothing, which is defined as a
fraction of the checkpoint interval (both checkpoint_timeout and
checkpoint_segments are taken into account). Normally, the write phase
of a checkpoint takes exactly that much time.

So the length of a checkpoint stays the same regardless of how many
dirty buffers there is (assuming you don't exceed the bandwidth of your
hardware), but the I/O rate varies.

There's another possible strategy: keep the I/O rate constant, but vary
the length of the checkpoint. checkpoint_rate allows you to do that.

I'm envisioning we set the defaults so that checkpoint_smoothing is the
effective control in a relatively busy system, and checkpoint_rate
ensures that we don't unnecessarily prolong checkpoints on an idle system.

Now how would you replace checkpoint_smoothing with a max I/O rate? The
only real alternative way of defining it is directly as a time and/or
segments, similar to checkpoint_timeout and checkpoint_segments, but
then we'd really need two knobs instead of one.

Though maybe we could just hard-code it to 0.8, for example, and tell
people to tune checkpoint_rate if they want more aggressive checkpoints.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Load Distributed Checkpoints, take 3

From
Tom Lane
Date:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> I don't think you understand how the settings work. Did you read the
> documentation? If you did, it's apparently not adequate.

I did read the documentation, and I'm not complaining that I don't
understand it.  I'm complaining that I don't like the presented API
because it's self-inconsistent.  You've got two parameters that are in
effect upper and lower bounds for the checkpoint write rate, but they
are named inconsistently and not even measured in the same kind of unit.
Nor do I agree that the inconsistency buys any ease of use.

> The main tuning knob is checkpoint_smoothing, which is defined as a
> fraction of the checkpoint interval (both checkpoint_timeout and
> checkpoint_segments are taken into account). Normally, the write phase
> of a checkpoint takes exactly that much time.

So the question is, why in the heck would anyone want the behavior that
"checkpoints take exactly X time"??  The useful part of this whole patch
is to cap the write rate at something that doesn't interfere too much
with foreground queries.  I don't see why people wouldn't prefer
"checkpoints can take any amount of time up to the checkpoint interval,
but we do our best not to exceed Y writes/second".

Basically I don't see what useful values checkpoint_smoothing would have
other than 0 and 1.  You might as well make it a bool.

> There's another possible strategy: keep the I/O rate constant, but vary
> the length of the checkpoint. checkpoint_rate allows you to do that.

But only from the lower side.

> Now how would you replace checkpoint_smoothing with a max I/O rate?

I don't see why you think that's hard.  It looks to me like the
components of the decision are the same numbers in any case: you have to
estimate your progress towards checkpoint completion, your available
time till next checkpoint, and your write rate.  Then you either delay
or not.

            regards, tom lane

Re: Load Distributed Checkpoints, take 3

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
>> The main tuning knob is checkpoint_smoothing, which is defined as a
>> fraction of the checkpoint interval (both checkpoint_timeout and
>> checkpoint_segments are taken into account). Normally, the write phase
>> of a checkpoint takes exactly that much time
>
> So the question is, why in the heck would anyone want the behavior that
> "checkpoints take exactly X time"??  The useful part of this whole patch
> is to cap the write rate at something that doesn't interfere too much
> with foreground queries.  I don't see why people wouldn't prefer
> "checkpoints can take any amount of time up to the checkpoint interval,
> but we do our best not to exceed Y writes/second".

Because it's easier to tune. You don't need to know how much checkpoint
I/O you can tolerate. The system will use just enough I/O bandwidth to
meet the deadline, but not more than that.

> Basically I don't see what useful values checkpoint_smoothing would have
> other than 0 and 1.  You might as well make it a bool.

Well that's one option. It feels like a good thing to be able to control
   how much headroom you have until the next checkpoint, but maybe we
can just hardcode it close to 1. It's also good to avoid spreading the
checkpoints unnecessarily, to keep recovery times lower, but you can
control that with the min rate setting as well.

>> There's another possible strategy: keep the I/O rate constant, but vary
>> the length of the checkpoint. checkpoint_rate allows you to do that.
>
> But only from the lower side.
>
>> Now how would you replace checkpoint_smoothing with a max I/O rate?
>
> I don't see why you think that's hard.  It looks to me like the
> components of the decision are the same numbers in any case: you have to
> estimate your progress towards checkpoint completion, your available
> time till next checkpoint, and your write rate.  Then you either delay
> or not.

Let me put it this way: If you define a min and a max I/O rate, when
would the max I/O rate limit take effect? If there's few dirty buffers
in the pool, so that you'll finish the checkpoint in time before the
next one is due writing at the min rate, that's what you'll use. If
there's more, you'll need to write fast enough that you'll finish the
checkpoint in time, regardless of the max rate. Or would you let the
next checkpoint slip and keep writing at the max rate? That seems like a
footgun if someone sets it to a too low value.

Or are you thinking that we have just one setting: checkpoint_rate? You
describe it as a maximum, but I've been thinking of it as a minimum
because you *will* exceed it if the next checkpoint is due soon.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Load Distributed Checkpoints, take 3

From
Tom Lane
Date:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> Tom Lane wrote:
>> So the question is, why in the heck would anyone want the behavior that
>> "checkpoints take exactly X time"??

> Because it's easier to tune. You don't need to know how much checkpoint
> I/O you can tolerate. The system will use just enough I/O bandwidth to
> meet the deadline, but not more than that.

Uh, not really.  After consuming more caffeine and reading the patch
more carefully, I think there are several problems here:

1. checkpoint_rate is used thusly:

    writes_per_nap = Min(1, checkpoint_rate / BgWriterDelay);

where writes_per_nap is the max number of dirty blocks to write before
taking a bgwriter nap.  Now surely this is completely backward: if
BgWriterDelay is increased, the number of writes to allow per nap
decreases?  If you think checkpoint_rate is expressed in some kind of
physical bytes/sec unit, that cannot be right; the number of blocks
per nap has to increase if the naps get longer.  (BTW, the patch seems
a bit schizoid about whether checkpoint_rate is int or float.)

2. checkpoint_smoothing is used thusly:

    /* scale progress according to CheckPointSmoothing */
    progress *= CheckPointSmoothing;

where the progress value being scaled is the fraction so far completed
of the total number of dirty pages we expect to have to write.  This
is then compared against estimates of the total fraction of the time-
between-checkpoints that has elapsed; if less, we are behind schedule
and should not nap, if more, we are ahead of schedule and may nap.
(This is a bit odd, but I guess it's all right because it's equivalent
to dividing the elapsed-time estimate by CheckPointSmoothing, which
seems a more natural way of thinking about what needs to happen.)

What's bugging me about this is that we are either going to be writing
at checkpoint_rate if ahead of schedule, or max possible rate if behind;
that's not "smoothing" to me, that's introducing some pretty bursty
behavior.  ISTM that actual "smoothing" would involve adjusting
writes_per_nap up or down according to whether we are ahead or behind
schedule, so as to have a finer degree of control over the I/O rate.
(I'd also consider saving the last writes_per_nap value across
checkpoints so as to have a more nearly accurate starting value next
time.)

In any case I still concur with Takahiro-san that "smoothing" doesn't
seem the most appropriate name for the parameter.  Something along the
lines of "checkpoint_completion_target" would convey more about what it
does, I think.

And checkpoint_rate really needs to be named checkpoint_min_rate, if
it's going to be a minimum.  However, I question whether we need it at
all, because as the code stands, with the default BgWriterDelay you
would have to increase checkpoint_rate to 4x its proposed default before
writes_per_nap moves off its minimum of 1.  This says to me that the
system's tested behavior has been so insensitive to checkpoint_rate
that we probably need not expose such a parameter at all; just hardwire
the minimum writes_per_nap at 1.

            regards, tom lane

Re: Load Distributed Checkpoints, take 3

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> 1. checkpoint_rate is used thusly:
>
>     writes_per_nap = Min(1, checkpoint_rate / BgWriterDelay);
>
> where writes_per_nap is the max number of dirty blocks to write before
> taking a bgwriter nap.  Now surely this is completely backward: if
> BgWriterDelay is increased, the number of writes to allow per nap
> decreases?  If you think checkpoint_rate is expressed in some kind of
> physical bytes/sec unit, that cannot be right; the number of blocks
> per nap has to increase if the naps get longer.

Uh, you're right, that's just plain wrong. checkpoint_rate is in
pages/s, so that line should be

writes_per_nap = Min(1, checkpoint_rate * BgWriterDelay / 1000)

> (BTW, the patch seems
> a bit schizoid about whether checkpoint_rate is int or float.)

Yeah, I've gone back and forth on the data type. I wanted it to be a
float, but guc code doesn't let you specify a float in KB, so I switched
it to int.

> 2. checkpoint_smoothing is used thusly:
>
>     /* scale progress according to CheckPointSmoothing */
>     progress *= CheckPointSmoothing;
>
> where the progress value being scaled is the fraction so far completed
> of the total number of dirty pages we expect to have to write.  This
> is then compared against estimates of the total fraction of the time-
> between-checkpoints that has elapsed; if less, we are behind schedule
> and should not nap, if more, we are ahead of schedule and may nap.

> (This is a bit odd, but I guess it's all right because it's equivalent
> to dividing the elapsed-time estimate by CheckPointSmoothing, which
> seems a more natural way of thinking about what needs to happen.)

Yeah, it's a bit unnatural. I did it that way so we don't have to divide
all three of the estimates: cached_elapsed, progress_in_time and
progress_in_xlog. Maybe it's not worth micro-optimizing that.

> What's bugging me about this is that we are either going to be writing
> at checkpoint_rate if ahead of schedule, or max possible rate if behind;
> that's not "smoothing" to me, that's introducing some pretty bursty
> behavior.  ISTM that actual "smoothing" would involve adjusting
> writes_per_nap up or down according to whether we are ahead or behind
> schedule, so as to have a finer degree of control over the I/O rate.
> (I'd also consider saving the last writes_per_nap value across
> checkpoints so as to have a more nearly accurate starting value next
> time.)

That sounds a lot more complex. The burstiness at that level shouldn't
matter much. The OS is still going to cache the writes, and should even
out the bursts.

Assuming time/xlogs elapse at a steady rate, we will write some multiple
of writes_per_nap pages between each sleep. With a small writes_per_nap,
writing just writes_per_nap isn't enough to catch up after we fall
behind, so we'll write more than that between each sleep. That means
that on each iteration, we'll write either N*writes_per_nap, or
(N+1)*writes_per_nap. At worst, that means either writes_per_nap or
2*writes_per_nap pages on each iteration. That's not too bad, I think.

> In any case I still concur with Takahiro-san that "smoothing" doesn't
> seem the most appropriate name for the parameter.  Something along the
> lines of "checkpoint_completion_target" would convey more about what it
> does, I think.

Ok, I'm not wedded to smoothing.

> And checkpoint_rate really needs to be named checkpoint_min_rate, if
> it's going to be a minimum.  However, I question whether we need it at
> all, because as the code stands, with the default BgWriterDelay you
> would have to increase checkpoint_rate to 4x its proposed default before
> writes_per_nap moves off its minimum of 1.

Hmm. With bgwriter_delay of 200 ms, and checkpoint_min_rate of 512 KB/s,
  using the non-broken formula above, we get:

(512*1024/8192) * 200 / 1000 = 12.8, truncated to 12.

So I think that's fine. (looking at the patch, I see that the default in
guc.c is actually 100 pages / s, contrary to documentation; needs to be
fixed)

> This says to me that the
> system's tested behavior has been so insensitive to checkpoint_rate
> that we probably need not expose such a parameter at all; just hardwire
> the minimum writes_per_nap at 1.

I've set checkpoint_rate to a small value in my tests on purpose to
control the feature with the other parameter. That must be why I haven't
noticed the bogus calculation of it before.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Load Distributed Checkpoints, take 3

From
Bruce Momjian
Date:
Tom Lane wrote:
> And checkpoint_rate really needs to be named checkpoint_min_rate, if
> it's going to be a minimum.  However, I question whether we need it at
> all, because as the code stands, with the default BgWriterDelay you
> would have to increase checkpoint_rate to 4x its proposed default before
> writes_per_nap moves off its minimum of 1.  This says to me that the
> system's tested behavior has been so insensitive to checkpoint_rate
> that we probably need not expose such a parameter at all; just hardwire
> the minimum writes_per_nap at 1.

I agree on starting with as simple a GUC as possible and expand it as
there is need in the field.  99% of people are never going to test this
value as much as you are.

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: Load Distributed Checkpoints, take 3

From
Tom Lane
Date:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> Tom Lane wrote:
>> (BTW, the patch seems
>> a bit schizoid about whether checkpoint_rate is int or float.)

> Yeah, I've gone back and forth on the data type. I wanted it to be a
> float, but guc code doesn't let you specify a float in KB, so I switched
> it to int.

I seriously question trying to claim that it's blocks at all, seeing
that the *actual* units are pages per unit time.  Pretending that it's
a memory unit does more to obscure the truth than document it.

>> What's bugging me about this is that we are either going to be writing
>> at checkpoint_rate if ahead of schedule, or max possible rate if behind;
>> that's not "smoothing" to me, that's introducing some pretty bursty
>> behavior.

> That sounds a lot more complex. The burstiness at that level shouldn't
> matter much. The OS is still going to cache the writes, and should even
> out the bursts.

With the changes you're proposing here, the burstiness would be quite
severe.  OTOH, if writes_per_nap is always 1, then bufmgr is going to
recheck the delay situation after every page, so what you have actually
tested is as granular as it could get.

>> And checkpoint_rate really needs to be named checkpoint_min_rate, if
>> it's going to be a minimum.  However, I question whether we need it at
>> all,

> Hmm. With bgwriter_delay of 200 ms, and checkpoint_min_rate of 512 KB/s,
>   using the non-broken formula above, we get:

> (512*1024/8192) * 200 / 1000 = 12.8, truncated to 12.

> So I think that's fine.

"Fine?"  That's 12x the value you have actually tested.  That's enough
of a change to invalidate all your performance testing IMHO.  I still
think you've not demonstrated a need to expose this parameter.

            regards, tom lane

Re: Load Distributed Checkpoints, take 3

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
>> Tom Lane wrote:
>>> (BTW, the patch seems
>>> a bit schizoid about whether checkpoint_rate is int or float.)
>
>> Yeah, I've gone back and forth on the data type. I wanted it to be a
>> float, but guc code doesn't let you specify a float in KB, so I switched
>> it to int.
>
> I seriously question trying to claim that it's blocks at all, seeing
> that the *actual* units are pages per unit time.  Pretending that it's
> a memory unit does more to obscure the truth than document it.

Hmm. What I wanted to avoid is that the I/O rate you get then depends on
your bgwriter_delay, so you if you change that you need to change
checkpoint_min_rate as well.

Now we already have that issue with bgwriter_all/lru_maxpages, and I
don't like it there either. If you think it's better to let the user
define it directly as pages/bgwriter_delay, fine.

>>> And checkpoint_rate really needs to be named checkpoint_min_rate, if
>>> it's going to be a minimum.  However, I question whether we need it at
>>> all,
>
>> Hmm. With bgwriter_delay of 200 ms, and checkpoint_min_rate of 512 KB/s,
>>   using the non-broken formula above, we get:
>
>> (512*1024/8192) * 200 / 1000 = 12.8, truncated to 12.
>
>> So I think that's fine.
>
> "Fine?"  That's 12x the value you have actually tested.  That's enough
> of a change to invalidate all your performance testing IMHO.

I'll reschedule the tests to be sure, after we settle on how we want to
control this feature.

> I still think you've not demonstrated a need to expose this parameter.

Greg Smith wanted to explicitly control the I/O rate, and let the
checkpoint duration vary. I personally think that fixing the checkpoint
duration is better because it's easier to tune.

But if we only do that, you might end up with ridiculously long
checkpoints when there's not many dirty pages. If we want to avoid that,
we need some way of telling what's a safe minimum rate to write at,
because that can vary greatly depending on your hardware.

But maybe we don't care about prolonging checkpoints, and don't really
need any GUCs at all. We could then just hardcode writes_per_nap to some
low value, and target duration close to 1.0. You would have a checkpoint
running practically all the time, and you would use
checkpoint_timeout/checkpoint_segments to control how long it takes. I'm
a bit worried about jumping to such a completely different regime,
though. For example, pg_start_backup needs to create a new checkpoint,
so it would need to wait on average 1.5 * checkpoint_timeout/segments,
and recovery would need to process on average 1.5 as much WAL as before.
Though with LDC, you should get away with shorter checkpoint intervals
than before, because the checkpoints aren't as invasive.

If we do that, we should remove bgwriter_all_* settings. They wouldn't
do much because we would have checkpoint running all the time, writing
out dirty pages.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Load Distributed Checkpoints, take 3

From
Tom Lane
Date:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> Tom Lane wrote:
>> I still think you've not demonstrated a need to expose this parameter.

> Greg Smith wanted to explicitly control the I/O rate, and let the
> checkpoint duration vary. I personally think that fixing the checkpoint
> duration is better because it's easier to tune.

> But if we only do that, you might end up with ridiculously long
> checkpoints when there's not many dirty pages. If we want to avoid that,
> we need some way of telling what's a safe minimum rate to write at,
> because that can vary greatly depending on your hardware.

As long as the minimum rate is at least 1/bgdelay, I don't think this is
a big problem.

> But maybe we don't care about prolonging checkpoints, and don't really
> need any GUCs at all. We could then just hardcode writes_per_nap to some
> low value, and target duration close to 1.0. You would have a checkpoint
> running practically all the time, and you would use
> checkpoint_timeout/checkpoint_segments to control how long it takes. I'm
> a bit worried about jumping to such a completely different regime,
> though. For example, pg_start_backup needs to create a new checkpoint,
> so it would need to wait on average 1.5 * checkpoint_timeout/segments,

Maybe I misread the patch, but I thought that if someone requested an
immediate checkpoint, the checkpoint-in-progress would effectively flip
to immediate mode.  So that could be handled by offering an immediate vs
extended checkpoint option in pg_start_backup.  I'm not sure it's a
problem though, since as previously noted you probably want
pg_start_backup to be noninvasive.  Also, one could do a manual
CHECKPOINT command then immediately pg_start_backup if one wanted
as-fast-as-possible (CHECKPOINT requests immediate checkpoint, right?)

> and recovery would need to process on average 1.5 as much WAL as before.
> Though with LDC, you should get away with shorter checkpoint intervals
> than before, because the checkpoints aren't as invasive.

No, you still want a pretty long checkpoint interval, because of the
increase in WAL traffic due to more page images being dumped when the
interval is short.

> If we do that, we should remove bgwriter_all_* settings. They wouldn't
> do much because we would have checkpoint running all the time, writing
> out dirty pages.

Yeah, I'm not sure that we've thought through the interactions with the
existing bgwriter behavior.

            regards, tom lane

Re: Load Distributed Checkpoints, take 3

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Maybe I misread the patch, but I thought that if someone requested an
> immediate checkpoint, the checkpoint-in-progress would effectively flip
> to immediate mode.  So that could be handled by offering an immediate vs
> extended checkpoint option in pg_start_backup.  I'm not sure it's a
> problem though, since as previously noted you probably want
> pg_start_backup to be noninvasive.  Also, one could do a manual
> CHECKPOINT command then immediately pg_start_backup if one wanted
> as-fast-as-possible (CHECKPOINT requests immediate checkpoint, right?)

Yeah, that's possible.

>> and recovery would need to process on average 1.5 as much WAL as before.
>> Though with LDC, you should get away with shorter checkpoint intervals
>> than before, because the checkpoints aren't as invasive.
>
> No, you still want a pretty long checkpoint interval, because of the
> increase in WAL traffic due to more page images being dumped when the
> interval is short.
>
>> If we do that, we should remove bgwriter_all_* settings. They wouldn't
>> do much because we would have checkpoint running all the time, writing
>> out dirty pages.
>
> Yeah, I'm not sure that we've thought through the interactions with the
> existing bgwriter behavior.

I searched the archives a bit for the discussions when the current
bgwriter settings were born, and found this thread:

http://archives.postgresql.org/pgsql-hackers/2004-12/msg00784.php

The idea of Load Distributed Checkpoints certainly isn't new :).

Ok, if we approach this from the idea that there will be *no* GUC
variables at all to control this, and we remove the bgwriter_all_*
settings as well, does anyone see a reason why that would be bad? Here's
the ones mentioned this far:

1. we need to keep 2x as much WAL segments around as before.

2. pg_start_backup will need to wait for a long time.

3. Recovery will take longer, because the distance last committed redo
ptr will lag behind more.

1. and 3. can be alleviated by using a smaller
checkpoint_timeout/segments though as you pointed out that leads to
higher WAL traffic. 2. is not a big deal, and we can add an 'immediate'
parameter to pg_start_backup if necessary.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Load Distributed Checkpoints, take 3

From
Greg Smith
Date:
On Fri, 22 Jun 2007, Tom Lane wrote:

> Yeah, I'm not sure that we've thought through the interactions with the
> existing bgwriter behavior.

The entire background writer mess needs a rewrite, and the best way to
handle that is going to shift considerably with LDC applied.

As the person who was complaining about corner cases I'm not in a position
to talk more explicitly about, I can at least summarize my opinion of how
I feel everyone should be thinking about this patch and you can take what
you want from that.  In the context of getting an 8.3 release finalized, I
think you should be building a LDC implementation that accomplishes the
main goal of spreading the checkpoints out, which is clearly working, and
erring on the side of more knobs in cases where it's unsure if they're
needed or not.  It's clear what my position is which non-obvious knobs I
think are important for some people.

Which is worse:  putting in a tuning setting that it's discovered
everybody just uses the same value for, or discovering after release that
there's a use-case for that setting and by hard-coding it you've made the
DBAs for a class of applications you didn't think about miserable?  In the
cases where there's good evidence so far of the right setting, just make
that the default, and the only harm is GUC bloat.

Nothing should be done that changes the existing behavior if the LDC
feature is turned off, so anything more obtrusive to the background writer
is right out.  Make reducing the knobs, optimizing the default behavior,
and rewriting the background writer to better fit into its new context a
major goal of 8.4.  I know I've got a whole notebook full of stuff on that
topic I've been ignoring as not to distract you guys from getting 8.3
done.

That's the low risk plan, and the design/beta/release period here is short
enough that I think going too experimental beyond that is a bad idea.  To
pick an example, when I read this idea from Heikki:

> You would have a checkpoint running practically all the time, and you
> would use checkpoint_timeout/checkpoint_segments to control how long it
> takes... If we do that, we should remove bgwriter_all_* settings

Whether or not I think this is an awesome idea, the very idea of a change
that big at this point gives me the willies.  Just off the top of my head,
there's a whole class of issues involving recycling xlog segments this
would introduce I would be really unhappy with the implications of.  Did
anyone else ever notice that when a new xlog segment is created, the write
to clear it out doesn't happen via direct I/O like the rest of the xlog
writes do?  That write goes through the regular buffered path instead.
The checkpoint logging patch I submitted logs when this happens
specifically because that particular issue messed with some operations and
I found it important to be aware of.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Load Distributed Checkpoints, take 3

From
Tom Lane
Date:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> Ok, if we approach this from the idea that there will be *no* GUC
> variables at all to control this, and we remove the bgwriter_all_*
> settings as well, does anyone see a reason why that would be bad? Here's
> the ones mentioned this far:

> 1. we need to keep 2x as much WAL segments around as before.

No, only 50% more.  We've always kept WAL back to the
checkpoint-before-last, so the peak usage has been about 2 checkpoint
distances (plus a bit), and now it'd tend to be about 3.

> 2. pg_start_backup will need to wait for a long time.

> 3. Recovery will take longer, because the distance last committed redo
> ptr will lag behind more.

True, you'd have to replay 1.5 checkpoint intervals on average instead
of 0.5 (more or less, assuming checkpoints had been short).  I don't
think we're in the business of optimizing crash recovery time though.

It might be that getting rid of checkpoint_smoothing is an overreaction,
and that we should keep that parameter.  IIRC Greg had worried about
being able to turn this behavior off, so we'd still need at least a
bool, and we might as well expose the fraction instead.  (Still don't
like the name "smoothing" though.)  I agree with removing the non-LRU
part of the bgwriter's write logic though; that should simplify matters
a bit and cut down the overall I/O volume.

            regards, tom lane

Re: Load Distributed Checkpoints, take 3

From
Tom Lane
Date:
Greg Smith <gsmith@gregsmith.com> writes:
> As the person who was complaining about corner cases I'm not in a position
> to talk more explicitly about, I can at least summarize my opinion of how
> I feel everyone should be thinking about this patch and you can take what
> you want from that.

Sorry, but we're not going to make decisions on the basis of evidence
you won't show us.  Right at the moment the best thing to do seems to
be to enable LDC with a low minimum write rate and a high target
duration, and remove the thereby-obsoleted "all buffers" scan of the
existing bgwriter logic.  If you've got specific evidence why any of
these things need to be parameterized, let's see it.  Personally I think
that we have a bad track record of exposing GUC variables as a
substitute for understanding performance issues at the start, and this
approach isn't doing any favors for DBAs.

> Whether or not I think this is an awesome idea, the very idea of a change
> that big at this point gives me the willies.  Just off the top of my head,
> there's a whole class of issues involving recycling xlog segments this
> would introduce I would be really unhappy with the implications of.

Really?  Name one.  AFAICS this should not affect xlog recycling in the
slightest (other than that we have to bump up the target number of live
segments from 2x to 3x the checkpoint distance).

> Did anyone else ever notice that when a new xlog segment is created,
> the write to clear it out doesn't happen via direct I/O like the rest
> of the xlog writes do?

It's not supposed to matter, because that path isn't supposed to be
taken often.

            regards, tom lane

Re: Load Distributed Checkpoints, take 3

From
Greg Smith
Date:
On Fri, 22 Jun 2007, Tom Lane wrote:

> Greg had worried about being able to turn this behavior off, so we'd
> still need at least a bool, and we might as well expose the fraction
> instead.  I agree with removing the non-LRU part of the bgwriter's write
> logic though

If you accept that being able to turn LDC off is important, people who
aren't turning it on may still want the existing bgwriter_all_* settings
so they can keep running things the way they are now.  It's certainly
reasonable to skip that code path when doing things the LDC way.

> True, you'd have to replay 1.5 checkpoint intervals on average instead
> of 0.5 (more or less, assuming checkpoints had been short).  I don't
> think we're in the business of optimizing crash recovery time though.

If you're not, I think you should be.  Keeping that replay interval time
down was one of the reasons why the people I was working with were
displeased with the implications of the very spread out style of some LDC
tunings.  They were already unhappy with the implied recovery time of how
high they had to set checkpoint_settings for good performance, and making
it that much bigger aggrevates the issue.  Given a knob where the LDC can
be spread out a bit but not across the entire interval, that makes it
easier to control how much expansion there is relative to the current
behavior.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Load Distributed Checkpoints, take 3

From
Heikki Linnakangas
Date:
Greg Smith wrote:
>> True, you'd have to replay 1.5 checkpoint intervals on average instead
>> of 0.5 (more or less, assuming checkpoints had been short).  I don't
>> think we're in the business of optimizing crash recovery time though.
>
> If you're not, I think you should be.  Keeping that replay interval time
> down was one of the reasons why the people I was working with were
> displeased with the implications of the very spread out style of some
> LDC tunings.  They were already unhappy with the implied recovery time
> of how high they had to set checkpoint_settings for good performance,
> and making it that much bigger aggrevates the issue.  Given a knob where
> the LDC can be spread out a bit but not across the entire interval, that
> makes it easier to control how much expansion there is relative to the
> current behavior.

I agree on that one: we *should* optimize crash recovery time. It may
not be the most important thing on earth, but it's a significant
consideration for some systems.

However, I think shortening the checkpoint interval is a perfectly valid
solution to that. It does lead to more full page writes, but in 8.3 more
full page writes can actually make the recovery go faster, not slower,
because with we no longer read in the previous contents of the page when
we restore it from a full page image. In any case, while people
sometimes complain that we have a large WAL footprint, it's not usually
a problem.

This is off-topic, but at PGCon in May, Itagaki-san and his colleagues
whose names I can't remember, pointed out to me very clearly that our
recovery is *slow*. So slow, that in the benchmarks they were running,
their warm stand-by slave couldn't keep up with the master generating
the WAL, even though both are running on the same kind of hardware.

The reason is simple: There can be tens of backends doing I/O and
generating WAL, but in recovery we serialize them. If you have decent
I/O hardware that could handle for example 10 concurrent random I/Os, at
recovery we'll be issuing them one at a time. That's a scalability
issue, and doesn't show up on a laptop or a small server with a single disk.

That's one of the first things I'm planning to tackle when the 8.4 dev
cycle opens. And I'm planning to look at recovery times in general; I've
never even measured it before so who knows what comes up.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Load Distributed Checkpoints, take 3

From
Heikki Linnakangas
Date:
Greg Smith wrote:
> On Fri, 22 Jun 2007, Tom Lane wrote:
>
>> Greg had worried about being able to turn this behavior off, so we'd
>> still need at least a bool, and we might as well expose the fraction
>> instead.  I agree with removing the non-LRU part of the bgwriter's
>> write logic though
>
> If you accept that being able to turn LDC off is important, people who
> aren't turning it on may still want the existing bgwriter_all_* settings
> so they can keep running things the way they are now.  It's certainly
> reasonable to skip that code path when doing things the LDC way.

Well, if you don't want anything to change, don't upgrade.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Load Distributed Checkpoints, take 3

From
Greg Smith
Date:
This message is going to come off as kind of angry, and I hope you don't
take that personally.  I'm very frustrated with this whole area right now
but am unable to do anything to improve that situation.

On Fri, 22 Jun 2007, Tom Lane wrote:

> If you've got specific evidence why any of these things need to be
> parameterized, let's see it.

All I'm trying to suggest here is that you might want to pause and
consider whether you want to make a change that might break existing,
happily working installations based just on the small number of tests that
have been done on this patch so far.  A nice stack of DBT2 results is very
informative, but the DBT2 workload is not everybody's workload.

Did you see anybody else predicting issues with the LDC patch on
overloaded systems as are starting to be seen in the 150 warehouse/90%
latency figures in Heikki's most recent results?  The way I remember that,
it was just me pushing to expose that problem, because I knew it was there
from my unfortunately private tests, but it was difficult to encounter the
issue on other types of benchmarks (thanks again to Greg Stark and Heikki
for helping with that).  But that's fine, if you want to blow off the rest
of my suggestions now just because the other things I'm worried about are
also very hard problem to expose and I can't hand you over a smoking gun,
that's your decision.

> Personally I think that we have a bad track record of exposing GUC
> variables as a substitute for understanding performance issues at the
> start, and this approach isn't doing any favors for DBAs.

I think this project has an awful track record of introducing new GUC
variables and never having a plan to follow through with a process to
figure out how they should be set.  The almost complete lack of
standardization and useful tools for collecting performance information
about this database boggles my mind, and you're never going to get the
performance related sections of the GUC streamlined without it.

We were just talking about the mess that is effective_cache_size recently.
As a more topical example here, the background writer was officially
released in early 2005, with a bizarre collection of tunables.  I had to
help hack on that code myself, over two years later, to even start
exposing the internal statistics data needed to optimize it correctly.
The main reason I can't prove some of my concerns is that I got so
side-tracked adding the infrastructure needed to even show they exist that
I wasn't able to nail down exactly what was going on well enough to
generate a public test case before the project that exposed the issues
wrapped up.

> Right at the moment the best thing to do seems to be to enable LDC with
> a low minimum write rate and a high target duration, and remove the
> thereby-obsoleted "all buffers" scan of the existing bgwriter logic.

I have reason to believe there's a set of use cases where a more
accelerated LDC approach than everyone seems to be learning toward is
appropriate, which would then reinvigorate the need for the all-scan BGW
component.  I have a whole new design for the non-LRU background writer
that fixes most of what's wrong with it I'm waiting for 8.4 to pass out
and get feedback on, but if everybody is hell bent on just yanking the
whole thing out in preference to these really lazy checkpoints go ahead
and do what you want.  My life would be easier if I just tossed all that
out and forgot about the whole thing, and I'm real close to doing just
that right now.

>> Did anyone else ever notice that when a new xlog segment is created,
>> the write to clear it out doesn't happen via direct I/O like the rest
>> of the xlog writes do?
> It's not supposed to matter, because that path isn't supposed to be
> taken often.

Yes, but during the situation it does happen in--when checkpoints take so
much longer than expected that more segments have to be created, or in an
archive logger faiure--it badly impacts an already unpleasant situation.

>> there's a whole class of issues involving recycling xlog segments this
>> would introduce I would be really unhappy with the implications of.
> Really?  Name one.

You already mentioned expansion of the log segments used which is a
primary issue.  Acting like all the additional segments used for some of
the more extreme checkpoint spreading approaches are without cost is
completely unrealistic IMHO.  In the situation I just described above, I
also noticed the way O_DIRECT sync writes get mixed with buffered WAL
writes seems to cause some weird I/O scheduling issues in Linux that can
make worst-case latency degrade.  But since I can't prove that, I guess I
might as well not even mention that either.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Load Distributed Checkpoints, take 3

From
"Simon Riggs"
Date:
On Fri, 2007-06-22 at 22:19 +0100, Heikki Linnakangas wrote:

> However, I think shortening the checkpoint interval is a perfectly valid
> solution to that.

Agreed. That's what checkpoint_timeout is for. Greg can't choose to use
checkpoint_segments as the limit and then complain about unbounded
recovery time, because that was clearly a conscious choice.

> In any case, while people
> sometimes complain that we have a large WAL footprint, it's not usually
> a problem.

IMHO its a huge problem. Turning on full_page_writes means that the
amount of WAL generated varies fairly linearly with number of blocks
touched, which means large databases become a problem.

Suzuki-san's team had results that showed this was a problem also.

> This is off-topic, but at PGCon in May, Itagaki-san and his colleagues
> whose names I can't remember, pointed out to me very clearly that our
> recovery is *slow*. So slow, that in the benchmarks they were running,
> their warm stand-by slave couldn't keep up with the master generating
> the WAL, even though both are running on the same kind of hardware.

> The reason is simple: There can be tens of backends doing I/O and
> generating WAL, but in recovery we serialize them. If you have decent
> I/O hardware that could handle for example 10 concurrent random I/Os, at
> recovery we'll be issuing them one at a time. That's a scalability
> issue, and doesn't show up on a laptop or a small server with a single disk.

The results showed that the current recovery isn't scalable beyond a
certain point, not that it was slow per se. The effect isn't noticeable
on systems generating fewer writes (of any size) or ones where the cache
hit ratio is high. It isn't accurate to say laptops and small servers
only, but it would be accurate to say hi volume I/O bound OLTP is the
place where the scalability of recovery does not match the performance
scalability of the master server.

Yes, we need to make recovery more scalable by de-serializing I/O.

Slony also serializes changes onto the Slave nodes, so the general
problem of scalability of recovery solutions needs to be tackled.

> That's one of the first things I'm planning to tackle when the 8.4 dev
> cycle opens. And I'm planning to look at recovery times in general; I've
> never even measured it before so who knows what comes up.

I'm planning on working on recovery also, as is Florian. Let's make sure
we coordinate what we do to avoid patch conflicts.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com



Re: Load Distributed Checkpoints, take 3

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Fri, 2007-06-22 at 22:19 +0100, Heikki Linnakangas wrote:
>
>> However, I think shortening the checkpoint interval is a perfectly valid
>> solution to that.
>
> Agreed. That's what checkpoint_timeout is for. Greg can't choose to use
> checkpoint_segments as the limit and then complain about unbounded
> recovery time, because that was clearly a conscious choice.

No, by checkpoint interval, I meant both checkpoint_timeout and
checkpoint_segments. You can decrease checkpoint_segments as well to
shorten recovery time.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Load Distributed Checkpoints, take 3

From
"Simon Riggs"
Date:
On Fri, 2007-06-22 at 16:21 -0400, Tom Lane wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:

> > 3. Recovery will take longer, because the distance last committed redo
> > ptr will lag behind more.
>
> True, you'd have to replay 1.5 checkpoint intervals on average instead
> of 0.5 (more or less, assuming checkpoints had been short).  I don't
> think we're in the business of optimizing crash recovery time though.

Agreed, though only for crash recovery. Most archive recoveries will not
be change noticeably - half a checkpoint isn't very long in PITR terms.

> It might be that getting rid of checkpoint_smoothing is an overreaction,
> and that we should keep that parameter.  IIRC Greg had worried about
> being able to turn this behavior off, so we'd still need at least a
> bool, and we might as well expose the fraction instead.  (Still don't
> like the name "smoothing" though.)

ISTM we don't need the checkpoint_smoothing parameter.

I can't see why anyone would want to turn off smoothing: If they are
doing many writes, then they will be effected by the sharp dive at
checkpoint, which happens *every* checkpoint. The increase in crash
recovery time is an easily acceptable trade-off for what is gained; if
not, they can always make checkpoint_timeout smaller.

If they aren't doing many writes they don't care what the setting of
checkpoint_smoothing is and recovery will be very fast anyway.

>I agree with removing the non-LRU
> part of the bgwriter's write logic though; that should simplify matters
> a bit and cut down the overall I/O volume.

Yes, agreed.


--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com



Re: Load Distributed Checkpoints, take 3

From
"Simon Riggs"
Date:
On Fri, 2007-06-22 at 16:57 -0400, Greg Smith wrote:
> If you're not, I think you should be.  Keeping that replay interval
> time down was one of the reasons why the people I was working with
> were displeased with the implications of the very spread out style of
> some LDC tunings.  They were already unhappy with the implied recovery
> time of how high they had to set checkpoint_settings for good
> performance, and making it that much bigger aggrevates the issue.
> Given a knob where the LDC can be spread out a bit but not across the
> entire interval, that makes it easier to control how much expansion
> there is relative to the current behavior.

We won't need to set checkpoint_settings so high, since performance is
smoothed across checkpoints by LDC and its OK to allow them more
frequently. So this concern need not apply with LDC.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com



Re: Load Distributed Checkpoints, take 3

From
Greg Smith
Date:
On Sun, 24 Jun 2007, Simon Riggs wrote:

> I can't see why anyone would want to turn off smoothing: If they are
> doing many writes, then they will be effected by the sharp dive at
> checkpoint, which happens *every* checkpoint.

There are service-level agreement situations where a short and sharp
disruption is more acceptable than a more prolonged one.  As some of the
overloaded I/O tests are starting to show, the LDC may be a backward step
for someone in that sort of environment.

I am not a fan of introducing a replacement feature based on what I
consider too limited testing, and I don't feel this one has been beat on
long yet enough to start pruning features that would allow better backward
compatibility/transitioning.  I think that's introducing an unnecessary
risk to the design.

> We won't need to set checkpoint_segments so high, since performance is
> smoothed across checkpoints by LDC and its OK to allow them more
> frequently. So this concern need not apply with LDC.

Performance *should* be smoothed across by checkpoints by LDC and my
concern *may* not apply.  I think assuming it will always help based on
the limited number of test results presented so far is extrapolation.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Load Distributed Checkpoints, take 3

From
Tom Lane
Date:
Greg Smith <gsmith@gregsmith.com> writes:
> I am not a fan of introducing a replacement feature based on what I
> consider too limited testing, and I don't feel this one has been beat on
> long yet enough to start pruning features that would allow better backward
> compatibility/transitioning.  I think that's introducing an unnecessary
> risk to the design.

While it's certainly true that we could do with wider testing of this
patch, I'm not sure why you hold such strong allegiance to the status
quo.  We know that the status quo isn't working very well.  And if you
think that the current code had enormous amounts of testing before it
went in, I've got to disillusion you :-(

            regards, tom lane

Re: Load Distributed Checkpoints, take 3

From
Greg Smith
Date:
On Sun, 24 Jun 2007, Simon Riggs wrote:

> Greg can't choose to use checkpoint_segments as the limit and then
> complain about unbounded recovery time, because that was clearly a
> conscious choice.

I'm complaining only because everyone seems content to wander in a
direction where the multiplier on checkpoint_segments for how many
segments are actually active at once will go up considerably, which can
make a known problem (recovery time) worse.  I'm thinking about things
like how the release notes for 8.3 are going to address this.  It the plan
to say something like "whatever you set checkpoint_segments to before so
you were happy with the crash recovery time, make sure to re-run all those
tests again because we made a big change there, it may take a lot longer
now with the same value, and too bad if you don't like it because there's
no way to get the old behavior back.  Suck it up, decrease
checkpoint_segments, and get used to having more checkpoints if you have
to keep the same recovery time".

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Load Distributed Checkpoints, take 3

From
Greg Smith
Date:
On Mon, 25 Jun 2007, Tom Lane wrote:

> I'm not sure why you hold such strong allegiance to the status quo.  We
> know that the status quo isn't working very well.

Don't get me wrong here; I am a big fan of this patch, think it's an
important step forward, and it's exactly the fact that I'm so shell
shocked from abuse by the status quo that I'm still mixed up in this mess
(I really should be ignoring the lot of you and writing new code instead).

LDC certainly makes things better in almost every case.  My "allegiance"
comes from having seen a class of transactions where LDC made things worse
on a fast/overloaded system, in that it made some types of service
guarantees harder to meet, and I just don't know who else might run into
problems in that area.  I'm worried that if it's not adjustable, you're
introducing a risk that you'll take a step backward for some of this
code's users, and that will be hard to undo given the way releases are
structured here.

I spent some time trading stocks for a living.  There are sometimes
situations you can get into there where there is a tiny chance that
something very bad can happen with a trade, and many people get wiped out
by such things.  If it's possible in that situation to remove that risk
with something inexpensive, you do it, even though the net expected value
of the change might be slightly negative.  This seems like such a
situation to me.  If it's possible to take away the risk of other people
running into an unforseen problem with the LDC patch just by keeping a
knob that's already there, unless that's an expensive operation my opinion
is that you should pick a good default but not remove it yet.

> And if you think that the current code had enormous amounts of testing
> before it went in, I've got to disillusion you :-(

It's having been on the painful receiving end of that fact that makes me
so paranoid now :)

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Load Distributed Checkpoints, take 3

From
Heikki Linnakangas
Date:
Greg Smith wrote:
> LDC certainly makes things better in almost every case.  My "allegiance"
> comes from having seen a class of transactions where LDC made things
> worse on a fast/overloaded system, in that it made some types of service
> guarantees harder to meet, and I just don't know who else might run into
> problems in that area.

Please describe the class of transactions and the service guarantees so
that we can reproduce that, and figure out what's the best solution.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Load Distributed Checkpoints, take 3

From
"Simon Riggs"
Date:
On Mon, 2007-06-25 at 01:33 -0400, Greg Smith wrote:
> On Sun, 24 Jun 2007, Simon Riggs wrote:
>
> > Greg can't choose to use checkpoint_segments as the limit and then
> > complain about unbounded recovery time, because that was clearly a
> > conscious choice.
>
> I'm complaining

I apologise for using that emotive phrase.

> only because everyone seems content to wander in a
> direction where the multiplier on checkpoint_segments for how many
> segments are actually active at once will go up considerably, which can
> make a known problem (recovery time) worse.

+50% more. Recovery time is a consideration that can be adjusted for. We
have done nothing to make recovery rate worse; the additional WAL leads
to an increased recovery time *only* if you keep the same parameter
settings. There is no reason to keep them the same, nor do we promise
that parameters will keep the exact meaning they had previous releases.

As you say, we can put comments in the release notes to advise people of
50% increase in recovery time if the parameters stay the same. That
would be balanced by the comment that checkpoints are now considerably
smoother than before and more frequent checkpoints are unlikely to be a
problem, so it is OK to reduce the parameters from the settings you used
in previous releases.

I don't see any reason why we would want unsmooth checkpoints; similarly
we don't want unsmooth bg writing, unsmooth archiving etc.. Performance
will increase, response times will drop and people will be happy.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com



Re: Load Distributed Checkpoints, take 3

From
Magnus Hagander
Date:
On Mon, Jun 25, 2007 at 10:15:07AM +0100, Simon Riggs wrote:
> On Mon, 2007-06-25 at 01:33 -0400, Greg Smith wrote:
> > On Sun, 24 Jun 2007, Simon Riggs wrote:
> >
> > > Greg can't choose to use checkpoint_segments as the limit and then
> > > complain about unbounded recovery time, because that was clearly a
> > > conscious choice.
> >
> > I'm complaining
>
> I apologise for using that emotive phrase.
>
> > only because everyone seems content to wander in a
> > direction where the multiplier on checkpoint_segments for how many
> > segments are actually active at once will go up considerably, which can
> > make a known problem (recovery time) worse.
>
> +50% more. Recovery time is a consideration that can be adjusted for. We
> have done nothing to make recovery rate worse; the additional WAL leads
> to an increased recovery time *only* if you keep the same parameter
> settings. There is no reason to keep them the same, nor do we promise
> that parameters will keep the exact meaning they had previous releases.
>
> As you say, we can put comments in the release notes to advise people of
> 50% increase in recovery time if the parameters stay the same. That
> would be balanced by the comment that checkpoints are now considerably
> smoother than before and more frequent checkpoints are unlikely to be a
> problem, so it is OK to reduce the parameters from the settings you used
> in previous releases.

Didn't we already add other featuers that makes recovery much *faster* than
before? In that case, are they faster enugh to neutralise this increased
time (a guestimate, of course)

Or did I mess that up with stuff we added for 8.2? :-)

//Magnus


Re: Load Distributed Checkpoints, take 3

From
Heikki Linnakangas
Date:
Magnus Hagander wrote:
> On Mon, Jun 25, 2007 at 10:15:07AM +0100, Simon Riggs wrote:
>> As you say, we can put comments in the release notes to advise people of
>> 50% increase in recovery time if the parameters stay the same. That
>> would be balanced by the comment that checkpoints are now considerably
>> smoother than before and more frequent checkpoints are unlikely to be a
>> problem, so it is OK to reduce the parameters from the settings you used
>> in previous releases.
>
> Didn't we already add other featuers that makes recovery much *faster* than
> before?

Yes, we no longer read in a page just to overwrite it with a full page
image.

> In that case, are they faster enugh to neutralise this increased
> time (a guestimate, of course)

It depends a lot on your workload. Under the right circumstances, it
will more than offset any increase caused by LDC, but sometimes it won't
make any difference at all (for example with full_page_writes=off).

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Load Distributed Checkpoints, take 3

From
"Simon Riggs"
Date:
On Mon, 2007-06-25 at 12:56 +0200, Magnus Hagander wrote:

> Didn't we already add other featuers that makes recovery much *faster* than
> before? In that case, are they faster enugh to neutralise this increased
> time (a guestimate, of course)
>
> Or did I mess that up with stuff we added for 8.2? :-)

Only with full_page_writes on, but yes you're right, the predicted
recovery times are changing for 8.3 anyway.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com



Re: Load Distributed Checkpoints, take 3

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> I agree with removing the non-LRU
> part of the bgwriter's write logic though; that should simplify matters
> a bit and cut down the overall I/O volume.

On further thought, there is one workload where removing the non-LRU
part would be counterproductive:

If you have a system with a very bursty transaction rate, it's possible
that when it's time for a checkpoint, there hasn't been any WAL logged
activity since last checkpoint, so we skip it. When that happens, the
buffer cache might still be full of dirty pages, because of hint bit
updates. That still isn't a problem on it's own, but if you then do a
huge batch update, you have to flush those dirty pages at that point. It
would be better to trickle flush those dirty pages during the idle period.

So we might still want to keep the non-LRU scan. Or alternatively, when
the checkpoint is a no-op, we call BufferSync nevertheless. That's
effectively the same thing, except that BufferSync would be controlled
by the logic to estimate the deadline until next checkpoint, instead of
the current bgwriter_all_*-settings.

Greg, is this the kind of workload you're having, or is there some other
scenario you're worried about?

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Load Distributed Checkpoints, take 3

From
Tom Lane
Date:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> On further thought, there is one workload where removing the non-LRU
> part would be counterproductive:

> If you have a system with a very bursty transaction rate, it's possible
> that when it's time for a checkpoint, there hasn't been any WAL logged
> activity since last checkpoint, so we skip it. When that happens, the
> buffer cache might still be full of dirty pages, because of hint bit
> updates. That still isn't a problem on it's own, but if you then do a
> huge batch update, you have to flush those dirty pages at that point. It
> would be better to trickle flush those dirty pages during the idle period.

But wouldn't the LRU-based scan accomplish that?

            regards, tom lane

Re: Load Distributed Checkpoints, take 3

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
>> On further thought, there is one workload where removing the non-LRU
>> part would be counterproductive:
>
>> If you have a system with a very bursty transaction rate, it's possible
>> that when it's time for a checkpoint, there hasn't been any WAL logged
>> activity since last checkpoint, so we skip it. When that happens, the
>> buffer cache might still be full of dirty pages, because of hint bit
>> updates. That still isn't a problem on it's own, but if you then do a
>> huge batch update, you have to flush those dirty pages at that point. It
>> would be better to trickle flush those dirty pages during the idle period.
>
> But wouldn't the LRU-based scan accomplish that?

It only scans bgwriter_lru_percent buffers ahead of the clock hand. If
the hand isn't moving, it keeps scanning the same buffers over and over
again. You can crank it all the way up to 100%, though, in which case it
would work, but that starts to get expensive CPU-wise.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Load Distributed Checkpoints, take 3

From
Tom Lane
Date:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> Tom Lane wrote:
>> Heikki Linnakangas <heikki@enterprisedb.com> writes:
>>> If you have a system with a very bursty transaction rate, it's possible
>>> that when it's time for a checkpoint, there hasn't been any WAL logged
>>> activity since last checkpoint, so we skip it. When that happens, the
>>> buffer cache might still be full of dirty pages, because of hint bit
>>> updates. That still isn't a problem on it's own, but if you then do a
>>> huge batch update, you have to flush those dirty pages at that point. It
>>> would be better to trickle flush those dirty pages during the idle period.
>>
>> But wouldn't the LRU-based scan accomplish that?

> It only scans bgwriter_lru_percent buffers ahead of the clock hand. If
> the hand isn't moving, it keeps scanning the same buffers over and over
> again. You can crank it all the way up to 100%, though, in which case it
> would work, but that starts to get expensive CPU-wise.

Hmm.  But if we're going to do that, we might as well have a checkpoint
for our troubles, no?  The reason for the current design is the
assumption that a bgwriter_all scan is less burdensome than a
checkpoint, but that is no longer true given this rewrite.  AFAICS all
the bgwriter_all scan will accomplish is induce extra I/O in most
scenarios.  And it's certainly extra code in an area that's already been
rendered pretty darn baroque by this patch.

Also, the design assumption here is that the bgwriter_lru scan is
supposed to keep enough buffers clean to satisfy the system's need for
new buffers.  If it's not able to do so, it's not apparent to me that
the bgwriter_all scan helps --- the latter is most likely flushing the
wrong buffers, ie, those not close in front of the clock sweep.  You'd
be well advised to increase lru_percent anyway to keep more buffers
clean, if this scenario is worth optimizing rather than others for your
usage.

            regards, tom lane

Re: Load Distributed Checkpoints, take 3

From
Greg Smith
Date:
On Mon, 25 Jun 2007, Heikki Linnakangas wrote:

> Greg, is this the kind of workload you're having, or is there some other
> scenario you're worried about?

The way transitions between completely idle and all-out bursts happen were
one problematic area I struggled with.  Since the LRU point doesn't move
during the idle parts, and the lingering buffers have a usage_count>0, the
LRU scan won't touch them; the only way to clear out a bunch of dirty
buffers leftover from the last burst is with the all-scan.  Ideally, you
want those to write during idle periods so you're completely clean when
the next burst comes.  My plan for the code I wanted to put into 8.4 one
day was to have something like the current all-scan that defers to the LRU
and checkpoint, such that if neither of them are doing anything it would
go searching for buffers it might blow out.  Because the all-scan mainly
gets in the way under heavy load right now I've only found mild settings
helpful, but if it had a bit more information about what else was going on
it could run much harder during slow spots.  That's sort of the next stage
to the auto-tuning LRU writer code in the grand design floating through my
head.

As a general comment on this subject, a lot of the work in LDC presumes
you have an accurate notion of how close the next checkpoint is.  On
systems that can dirty buffers and write WAL really fast, I've found hyper
bursty workloads are a challenge for it to cope with.  You can go from
thinking you have all sorts of time to stream the data out to discovering
the next checkpoint is coming up fast in only seconds.  In that situation,
you'd have been better off had you been writing faster during the period
preceeding the burst when the code thought it should be "smooth"[1].
That falls into the category of things I haven't found a good way for
other people to test (I happened to have an internal bursty app that
aggrevated this area to use).

[1] This is actually a reference to "Yacht Rock", one of my favorite web
sites:  http://www.channel101.com/shows/show.php?show_id=152

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Load Distributed Checkpoints, take 3

From
Greg Smith
Date:
On Mon, 25 Jun 2007, Heikki Linnakangas wrote:

> It only scans bgwriter_lru_percent buffers ahead of the clock hand. If the
> hand isn't moving, it keeps scanning the same buffers over and over again.
> You can crank it all the way up to 100%, though, in which case it would work,
> but that starts to get expensive CPU-wise.

In addition to being a CPU pig, that still won't necessarily work because
the way the LRU writer ignores things with a non-zero usage_count.  If
it's dirty, it's probably been used recently as well.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Load Distributed Checkpoints, take 3

From
Greg Smith
Date:
On Mon, 25 Jun 2007, Heikki Linnakangas wrote:

> Please describe the class of transactions and the service guarantees so
> that we can reproduce that, and figure out what's the best solution.

I'm confident you're already moving in that direction by noticing how the
90th percentile numbers were kind of weird with your 150 warehouse DBT2
tests, and I already mentioned how that could be usefully fleshed out by
more tests during beta.  That number is the kind of service guarantee I'm
talking about--if before 90% of transactions were <4.5ms, but now that
number is closer to 6ms, that could be considered worse performance by
some service metrics even if the average and worst-case performance were
improved.

The only thing I can think of if you wanted to make the problem more like
what I was seeing would be switching the transaction mix on that around to
do more UPDATEs relative to the other types of transactions; having more
of those seemed to aggrevate my LDC-related issues because they leave a
different pattern of dirty+used buffers around than other operations.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Load Distributed Checkpoints, take 3

From
Tom Lane
Date:
Greg Smith <gsmith@gregsmith.com> writes:
> The way transitions between completely idle and all-out bursts happen were
> one problematic area I struggled with.  Since the LRU point doesn't move
> during the idle parts, and the lingering buffers have a usage_count>0, the
> LRU scan won't touch them; the only way to clear out a bunch of dirty
> buffers leftover from the last burst is with the all-scan.

One thing that might be worth changing is that right now, BgBufferSync
starts over from the current clock-sweep point on each call --- that is,
each bgwriter cycle.  So it can't really be made to write very many
buffers without excessive CPU work.  Maybe we should redefine it to have
some static state carried across bgwriter cycles, such that it would
write at most N dirty buffers per call, but scan through X percent of
the buffers, possibly across several calls, before returning to the (by
now probably advanced) clock-sweep point.  This would allow a larger
value of X to be used than is currently practical.  You might wish to
recheck the clock sweep point on each iteration just to make sure the
scan hasn't fallen behind it, but otherwise I don't see any downside.
The scenario where somebody re-dirties a buffer that was cleaned by the
bgwriter scan isn't a problem, because that buffer will also have had its
usage_count increased and thereby not be a candidate for replacement.

> As a general comment on this subject, a lot of the work in LDC presumes
> you have an accurate notion of how close the next checkpoint is.

Yeah; this is one reason I was interested in carrying some write-speed
state across checkpoints instead of having the calculation start from
scratch each time.  That wouldn't help systems that sit idle a long time
and suddenly go nuts, but it seems to me that smoothing the write rate
across more than one checkpoint could help if the fluctuations occur
over a timescale of a few checkpoints.

            regards, tom lane

Re: Load Distributed Checkpoints, take 3

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Hmm.  But if we're going to do that, we might as well have a checkpoint
> for our troubles, no?  The reason for the current design is the
> assumption that a bgwriter_all scan is less burdensome than a
> checkpoint, but that is no longer true given this rewrite.

Per comments in CreateCheckPoint, another  also skip the extra
checkpoints to avoid writing two checkpoints to the same page, risking
losing both on a crash:

>      * If this isn't a shutdown or forced checkpoint, and we have not inserted
>      * any XLOG records since the start of the last checkpoint, skip the
>      * checkpoint.    The idea here is to avoid inserting duplicate checkpoints
>      * when the system is idle. That wastes log space, and more importantly it
>      * exposes us to possible loss of both current and previous checkpoint
>      * records if the machine crashes just as we're writing the update.
>      * (Perhaps it'd make even more sense to checkpoint only when the previous
>      * checkpoint record is in a different xlog page?)

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Load Distributed Checkpoints, take 3

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Greg Smith <gsmith@gregsmith.com> writes:
>> The way transitions between completely idle and all-out bursts happen were
>> one problematic area I struggled with.  Since the LRU point doesn't move
>> during the idle parts, and the lingering buffers have a usage_count>0, the
>> LRU scan won't touch them; the only way to clear out a bunch of dirty
>> buffers leftover from the last burst is with the all-scan.
>
> One thing that might be worth changing is that right now, BgBufferSync
> starts over from the current clock-sweep point on each call --- that is,
> each bgwriter cycle.  So it can't really be made to write very many
> buffers without excessive CPU work.  Maybe we should redefine it to have
> some static state carried across bgwriter cycles, such that it would
> write at most N dirty buffers per call, but scan through X percent of
> the buffers, possibly across several calls, before returning to the (by
> now probably advanced) clock-sweep point.  This would allow a larger
> value of X to be used than is currently practical.  You might wish to
> recheck the clock sweep point on each iteration just to make sure the
> scan hasn't fallen behind it, but otherwise I don't see any downside.
> The scenario where somebody re-dirties a buffer that was cleaned by the
> bgwriter scan isn't a problem, because that buffer will also have had its
> usage_count increased and thereby not be a candidate for replacement.

Something along those lines could be useful. I've thought of that
before, but it never occured to me that if a page in front of the clock
hand is re-dirtied, it's no longer a candidate for replacement anyway...

I'm going to leave the all- and lru- bgwriter scans alone for now to get
this LDC patch finished. We still have the bgwriter autotuning patch in
the queue. Let's think about this in the context of that patch.

>> As a general comment on this subject, a lot of the work in LDC presumes
>> you have an accurate notion of how close the next checkpoint is.
>
> Yeah; this is one reason I was interested in carrying some write-speed
> state across checkpoints instead of having the calculation start from
> scratch each time.  That wouldn't help systems that sit idle a long time
> and suddenly go nuts, but it seems to me that smoothing the write rate
> across more than one checkpoint could help if the fluctuations occur
> over a timescale of a few checkpoints.

Hmm. This problem only applies to checkpoints triggered by
checkpoint_segments; time tends to move forward at a constant rate.

I'm not worried about small fluctuations or bursts. As I argued earlier,
the OS still buffers the writes and should give some extra smoothing of
the physical writes. I believe bursts of say 50-100 MB would easily fit
in OS cache, as long as there's enough gap between them. I haven't
tested that, though.

Here's a proposal for an algorithm to smooth bigger bursts:

The basic design is the same as before. We keep track of elapsed time
and elapsed xlogs, and based on them we estimate how much progress in
flushing the buffers we should've made by now, and then we catch up
until we reach that. The estimate for time is the same. The estimate for
xlogs gets more complicated:

Let's have a few definitions first:

Ro = elapsed segments / elapsed time, from previous checkpoint cycle.
For example, 1.25 means that the checkpoint was triggered by
checkpoint_segments, and we had spent 1/1.25 =  80% of
checkpoint_timeout when we reached checkpoint_segments. 0.25 would mean
that checkpoint was triggered by checkpoint_timeout, and we had spent
25% of checkpoint_segments by then.

Rn = elapsed segments / elapsed time this far from current in-progress
checkpoint.

t = elapsed time, as a fraction of checkpoint_timeout (0.0 - 1.0, though
could be > 1 if next checkpoint is already due)
s = elapsed xlog segments, as a fraction of checkpoint_segments (0.0 -
1.0, though could again be > 1 if next checkpoint is already due)

R = estimate for WAL segment consumption rate, as checkpoint_segments /
checkpoint_timeout

R = Ro * t + Rn * (1 - t)

In other words, at the beginning of the checkpoint, we give more weight
to the state carried over from previous checkpoint. As we go forward,
more weight is given to the rate calculated from current cycle.

 From R, we extrapolate how much progress we should've done by now:

Target progress = R * t

This would require saving just one number from previous cycle (Rn), and
there is no requirement to call the estimation function at steady time
intervals, for example. It gives pretty steady I/O rate even if there's
big bursts in WAL activity, but still reacts to changes in the rate.

I'm not convinced this is worth the effort, though. First of all, this
is only a problem if you use checkpoint_segments to control your
checkpoints, so you can lower checkpoint_timeout to do more work during
the idle periods. Secondly, with the optimization of not flushing
buffers during checkpoint that were dirtied after the start of
checkpoint, the LRU-sweep will also contribute to flushing the buffers
and finishing the checkpoint. We don't count them towards the progress
made ATM, but we probably should. Lastly, distributing the writes even a
little bit is going to be smoother than the current behavior anyway.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Load Distributed Checkpoints, take 3

From
Tom Lane
Date:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> Tom Lane wrote:
>> Hmm.  But if we're going to do that, we might as well have a checkpoint
>> for our troubles, no?  The reason for the current design is the
>> assumption that a bgwriter_all scan is less burdensome than a
>> checkpoint, but that is no longer true given this rewrite.

> Per comments in CreateCheckPoint, another  also skip the extra
> checkpoints to avoid writing two checkpoints to the same page, risking
> losing both on a crash:

That's not the reason for the current design --- that logic existed long
before bgwriter did.

Anyway, if there are no XLOG records since the last checkpoint, there's
probably nothing in shared buffers that needs flushing.  There might be
some dirty hint-bits, but the only reason to push those out is to make
some free buffers available, and doing that is not checkpoint's job (nor
the all-buffers scan's job); that's what the LRU scan is for.

The only reason that the all-buffers scan was put in was to try to make
the next checkpoint cheaper.  That reason falls to the ground with LDC.
What we have left is merely excess I/O.

            regards, tom lane

Re: Load Distributed Checkpoints, take 3

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Anyway, if there are no XLOG records since the last checkpoint, there's
> probably nothing in shared buffers that needs flushing.  There might be
> some dirty hint-bits, but the only reason to push those out is to make
> some free buffers available, and doing that is not checkpoint's job (nor
> the all-buffers scan's job); that's what the LRU scan is for.

Yeah, except the LRU scan is not doing a very good job at that. It will
ignore buffers with usage_count > 0, and it only scans
bgwriter_lru_percent buffers ahead of the clock hand.

One pathological case is a COPY of a table slightly smaller than
shared_buffers. That will fill the buffer cache. If you then have a
checkpoint, and after that a SELECT COUNT(*), or a VACUUM, the buffer
cache will be full of pages with just hint-bit-updates, but no WAL
activity since last checkpoint.

But let's fix the LRU scan, rather work around it's deficiencies.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Load Distributed Checkpoints, take 3

From
Tom Lane
Date:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> Tom Lane wrote:
>> ... that's what the LRU scan is for.

> Yeah, except the LRU scan is not doing a very good job at that. It will
> ignore buffers with usage_count > 0, and it only scans
> bgwriter_lru_percent buffers ahead of the clock hand.

Which is exactly in sync with which buffers are actually candidates for
replacement.  It could be looking further ahead of the clock hand, as
per my previous suggestion, but there is no need to push out buffers
with usage_count > 0 until after the sweep has decremented that to 0.
Doing so would tend to create excess I/O --- it makes it more likely
that multiple page-dirtying events on a page will result in separate
writes instead of a single write.

> One pathological case is a COPY of a table slightly smaller than
> shared_buffers. That will fill the buffer cache. If you then have a
> checkpoint, and after that a SELECT COUNT(*), or a VACUUM, the buffer
> cache will be full of pages with just hint-bit-updates, but no WAL
> activity since last checkpoint.

This argument supposes that the bgwriter will do nothing while the COPY
is proceeding.

            regards, tom lane

Re: Load Distributed Checkpoints, take 3

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
>> One pathological case is a COPY of a table slightly smaller than
>> shared_buffers. That will fill the buffer cache. If you then have a
>> checkpoint, and after that a SELECT COUNT(*), or a VACUUM, the buffer
>> cache will be full of pages with just hint-bit-updates, but no WAL
>> activity since last checkpoint.
>
> This argument supposes that the bgwriter will do nothing while the COPY
> is proceeding.

It will clean buffers ahead of the COPY, but it won't write the buffers
COPY leaves behind since they have usage_count=1.

Let me demonstrate this with an imaginary example with shared_buffers=4:

buf_id    usage_count    dirty
1    0        f
2    0        f
3    0        f
4    0        f

After COPY

buf_id    usage_count    dirty
1    1        t
2    1        t
3    1        t
4    1        t

CHECKPOINT:

buf_id    usage_count    dirty
1    1        f
2    1        f
3    1        f
4    1        f

VACUUM:

buf_id    usage_count    dirty
1    1        t
2    1        t
3    1        t
4    1        t

As soon as a backend asks for a buffer, the situation is defused as the
backend will do a full clock sweep and decrement the usage_count of each
buffer to 0, letting the bgwriter lru-scan to clean them.

Having the buffer cache full of dirty buffers is not a problem on its
own, so this only becomes a performance issue if you then issue another
large COPY etc. that needs those buffers, and you now have to write them
at the busy time.

This is a corner case that might not be worth worrying about. It's also
mitigated by the fact that the OS cache is most likely clean after a
period of idle time, and should be able to absorb the write burst.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Load Distributed Checkpoints, take 3

From
Tom Lane
Date:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> Tom Lane wrote:
>> This argument supposes that the bgwriter will do nothing while the COPY
>> is proceeding.

> It will clean buffers ahead of the COPY, but it won't write the buffers
> COPY leaves behind since they have usage_count=1.

Yeah, and they don't *need* to be written until the clock sweep has
passed over them once.  I'm not impressed with the idea of writing
buffers because we might need them someday; that just costs extra
I/O due to re-dirtying in too many scenarios.

(Note that COPY per se will not trigger this behavior anyway, since it
will act in a limited number of buffers because of the recent buffer
access strategy patch.)

            regards, tom lane

Re: Load Distributed Checkpoints, take 3

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> (Note that COPY per se will not trigger this behavior anyway, since it
> will act in a limited number of buffers because of the recent buffer
> access strategy patch.)

Actually we dropped it from COPY, because it didn't seem to improve
performance in the tests we ran.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Load Distributed Checkpoints, take 3

From
Tom Lane
Date:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> Tom Lane wrote:
>> (Note that COPY per se will not trigger this behavior anyway, since it
>> will act in a limited number of buffers because of the recent buffer
>> access strategy patch.)

> Actually we dropped it from COPY, because it didn't seem to improve
> performance in the tests we ran.

Who's "we"?  AFAICS, CVS HEAD will treat a large copy the same as any
other large heapscan.

            regards, tom lane

Re: Load Distributed Checkpoints, take 3

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
>> Tom Lane wrote:
>>> (Note that COPY per se will not trigger this behavior anyway, since it
>>> will act in a limited number of buffers because of the recent buffer
>>> access strategy patch.)
>
>> Actually we dropped it from COPY, because it didn't seem to improve
>> performance in the tests we ran.
>
> Who's "we"?  AFAICS, CVS HEAD will treat a large copy the same as any
> other large heapscan.

Umm, I'm talking about populating a table with COPY *FROM*. That's not a
heap scan at all.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Load Distributed Checkpoints, take 3

From
Tom Lane
Date:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> Tom Lane wrote:
>> Who's "we"?  AFAICS, CVS HEAD will treat a large copy the same as any
>> other large heapscan.

> Umm, I'm talking about populating a table with COPY *FROM*. That's not a
> heap scan at all.

No wonder we're failing to communicate.  I assumed you were talking
about copy-to-file.  Copy-from-file is going to be creating WAL entries
hence the no-checkpoint case doesn't apply anyway, no?

[ thinks ... ]  Oh, you must be positing the case where the recently
added skip-WAL-if-table-is-new-in-this-transaction optimization applies.
Well, that thing could probably do with some more work anyway (I wonder
why it's using shared buffers at all anymore).  I don't really think
that case should be allowed to drive our thinking about how the bgwriter
should work.

            regards, tom lane

Re: Load Distributed Checkpoints, take 3

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
>> Tom Lane wrote:
>>> Who's "we"?  AFAICS, CVS HEAD will treat a large copy the same as any
>>> other large heapscan.
>
>> Umm, I'm talking about populating a table with COPY *FROM*. That's not a
>> heap scan at all.
>
> No wonder we're failing to communicate.  I assumed you were talking
> about copy-to-file.  Copy-from-file is going to be creating WAL entries
> hence the no-checkpoint case doesn't apply anyway, no?

We are indeed having trouble to communicate :(.

No, I'm not talking about the new non-WAL-logged COPY optimization. COPY
FROM *would* create WAL entries, and the next checkpoint would clean
them. So far so good. But then you run VACUUM, as you often do after
loading a table, to set all hint bits. That will *not* generate WAL, and
next checkpoint is skipped.

To recap, the sequence is:

1. COPY FROM
2. checkpoint
3. VACUUM

Now you have buffer cache full of dirty buffers with usage_count=1, and
no WAL activity since last checkpoint. They will not be flushed until:
a) WAL activity happens and next checkpoint comes
b) database is shut down, or manual CHECKPOINT is issued
c) clock hand advances and decrements the usage_counts

It's a corner case. Probably not a problem in practice; you usually run
CREATE INDEX after loading a table, for example. But it exists.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Load Distributed Checkpoints, take 3

From
Tom Lane
Date:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> To recap, the sequence is:

> 1. COPY FROM
> 2. checkpoint
> 3. VACUUM

> Now you have buffer cache full of dirty buffers with usage_count=1,

Well, it won't be very full, because VACUUM works in a limited number of
buffers (and did even before the BufferAccessStrategy patch).

I have no doubt that there are scenarios such as you are thinking about,
but it definitely seems like a corner case that doesn't justify keeping
the all-buffers scan.  That scan is costing us extra I/O in ordinary
non-corner cases, so it's not free to keep it.

            regards, tom lane

Re: Load Distributed Checkpoints, take 3

From
Greg Smith
Date:
On Mon, 25 Jun 2007, Tom Lane wrote:

> right now, BgBufferSync starts over from the current clock-sweep point
> on each call --- that is, each bgwriter cycle.  So it can't really be
> made to write very many buffers without excessive CPU work.  Maybe we
> should redefine it to have some static state carried across bgwriter
> cycles

The LRU portion restarts like that, and it's certainly not optimal.  But
the auto-tuning LRU patch that's already near application makes this much
less of an issue because it only does real work when buffers have been
allocated, so the sweep point will have moved along.  I'll add this idea
to my list of things that would be nice to have as part of a larger
rewriter, I think it's more trouble than it's worth to chase right now.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Load Distributed Checkpoints, take 3

From
Greg Smith
Date:
On Tue, 26 Jun 2007, Tom Lane wrote:

> I have no doubt that there are scenarios such as you are thinking about,
> but it definitely seems like a corner case that doesn't justify keeping
> the all-buffers scan.  That scan is costing us extra I/O in ordinary
> non-corner cases, so it's not free to keep it.

And scenarios I'm concerned about but can't diagram as easily fall into
this category as well.  I agree that a LDC enabled config would ship with
the all-buffers scan turned off as redundant and wasteful, in which the
only cost to keep it is code baggage.  But the fact that there are corner
cases floating around this area is what makes me feel that removing it
altogether is still a bit premature.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Load Distributed Checkpoints, take 3

From
Greg Smith
Date:
On Tue, 26 Jun 2007, Tom Lane wrote:

> I'm not impressed with the idea of writing buffers because we might need
> them someday; that just costs extra I/O due to re-dirtying in too many
> scenarios.

This is kind of an interesting statement to me because it really
highlights the difference in how I thinking about this problem from how
you see it.  As far as I'm concerned, there's a hierarchy of I/O the
database needs to finish that goes like this:

1) Client back-end writes (blocked until a buffer appears)
2) LRU writes so (1) doesn't happen
3) Checkpoint writes
4) Dirty pages with a non-zero usage count

In my view of the world, there should be one parameter for a target rate
of how much I/O you can stand under normal use, and the background writer
should work its way as far down this chain as it can until it meets that.
If there's plenty of clean buffers for the expected new allocations and
there's no checkpoint going on, by all means write out some buffers we
might re-dirty if there's I/O to spare.  If you write them twice, so what?
You didn't even get to that point as an option until all the important
stuff was taken care of and the system was near idle.

The elimination of the all-scan background writer means that true hot and
dirty spots in the buffer cache, like popular index blocks on a heavily
updated table that never get a zero usage_count, are never going to be
written out other than as part of the checkpoint process.  That's OK for
now, but I'd like it to be the case that one day the database's I/O
scheduling would eventually get to those, in order to optimize performance
in the kind of bursty scenarios I've been mentioning lately.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Load Distributed Checkpoints, take 3

From
Gregory Stark
Date:
"Greg Smith" <gsmith@gregsmith.com> writes:

> If you write them twice, so what? You didn't even get to that point as an
> option until all the important stuff was taken care of and the system was
> near idle.

Well even if it's near idle you were still occupying the i/o system for a few
milliseconds. If someone else came in with a block request at that time you
extended their response time by that much.

> The elimination of the all-scan background writer means that true hot and dirty
> spots in the buffer cache, like popular index blocks on a heavily updated table
> that never get a zero usage_count, are never going to be written out other than
> as part of the checkpoint process.

If they're really popular blocks on a heavily updated table then they really
don't buy us anything to write them out unnecessarily.

The case where they help us is when they weren't really popular but we're not
doing enough to get around to writing them out and then when we do need to
write it out the system's busy. In that case we wasted a chance to write them
out when the system was more idle.

But don't forget that we're still going through the OS's buffers. That will
smooth out a lot of the i/o we generate anyways. Just because Postgres is idle
doesn't mean the OS isn't busy flushing the buffers we wrote out when it was
busy.

> That's OK for now, but I'd like it to be the case that one day the
> database's I/O scheduling would eventually get to those, in order to
> optimize performance in the kind of bursty scenarios I've been mentioning
> lately.

I think the feeling is that the bursty scenarios are really corner cases.
Heikki described a case where you're dirtying tons of blocks without
triggering any WAL. That can happen but it's pretty peculiar.

I can imagine a scenario where you have a system that's very busy for 60s and
then idle for 60s repeatedly. And for some reason you configure a
checkpoint_timeout on the order of 20m or so (assuming you're travelling
precisely 60mph).

In that scenario bgwriter's lru scan has to fight to keep up for 60s while
it's mostly writing out dirty pages that it could have flushed out during the
idle time. Effectively you're only making use of half the i/o bandwidth since
bgwriter doesn't do any work for half the duty cycle.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com


Re: Load Distributed Checkpoints, take 3

From
Alvaro Herrera
Date:
Gregory Stark wrote:

> I can imagine a scenario where you have a system that's very busy for 60s and
> then idle for 60s repeatedly. And for some reason you configure a
> checkpoint_timeout on the order of 20m or so (assuming you're travelling
> precisely 60mph).

Is that Scottish m?

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.