Thread: Load Distributed Checkpoints test results

Load Distributed Checkpoints test results

From

Heikki Linnakangas

Date:

13 June 2007, 08:09:48

Here's results from a batch of test runs with LDC. This patch only
spreads out the writes, fsyncs work as before. This patch also includes
the optimization that we don't write buffers that were dirtied after
starting the checkpoint.

http://community.enterprisedb.com/ldc/

See tests 276-280. 280 is the baseline with no patch attached, the
others are with load distributed checkpoints with different values for
checkpoint_write_percent. But after running the tests I noticed that the
spreading was actually controlled by checkpoint_write_rate, which sets
the minimum rate for the writes, so all those tests with the patch
applied are effectively the same; the writes were spread over a period
of 1 minute. I'll fix that setting and run more tests.

The response time graphs show that the patch reduces the max (new-order)
response times during checkpoints from ~40-60 s to ~15-20 s. The change
in minute by minute average is even more significant.

The change in overall average response times is also very significant.
1.5s without patch, and ~0.3-0.4s with the patch for new-order
transactions. That also means that we pass the TPC-C requirement that
90th percentile of response times must be < average.


All that said, there's still significant checkpoint spikes present, even
though they're much less severe than without the patch. I'm willing to
settle with this for 8.3. Does anyone want to push for more testing and
thinking of spreading the fsyncs as well, and/or adding a delay between
writes and fsyncs?

Attached is the patch used in the tests. It still needs some love..

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com
Index: doc/src/sgml/config.sgml
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/doc/src/sgml/config.sgml,v
retrieving revision 1.126
diff -c -r1.126 config.sgml
*** doc/src/sgml/config.sgml    7 Jun 2007 19:19:56 -0000    1.126
--- doc/src/sgml/config.sgml    12 Jun 2007 08:16:55 -0000
***************
*** 1565,1570 ****
--- 1565,1619 ----
        </listitem>
       </varlistentry>

+      <varlistentry id="guc-checkpoint-write-percent" xreflabel="checkpoint_write_percent">
+       <term><varname>checkpoint_write_percent</varname> (<type>floating point</type>)</term>
+       <indexterm>
+        <primary><varname>checkpoint_write_percent</> configuration parameter</primary>
+       </indexterm>
+       <listitem>
+        <para>
+         To spread works in checkpoints, each checkpoint spends the specified
+         time and delays to write out all dirty buffers in the shared buffer
+         pool. The default value is 50.0 (50% of <varname>checkpoint_timeout</>).
+         This parameter can only be set in the <filename>postgresql.conf</>
+         file or on the server command line.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry id="guc-checkpoint-nap-percent" xreflabel="checkpoint_nap_percent">
+       <term><varname>checkpoint_nap_percent</varname> (<type>floating point</type>)</term>
+       <indexterm>
+        <primary><varname>checkpoint_nap_percent</> configuration parameter</primary>
+       </indexterm>
+       <listitem>
+        <para>
+         Specifies the delay between writing out all dirty buffers and flushing
+         all modified files. Make the kernel's disk writer to flush dirty buffers
+         during this time in order to reduce works in the next flushing phase.
+         The default value is 10.0 (10% of <varname>checkpoint_timeout</>).
+         This parameter can only be set in the <filename>postgresql.conf</>
+         file or on the server command line.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry id="guc-checkpoint-sync-percent" xreflabel="checkpoint_sync_percent">
+       <term><varname>checkpoint_sync_percent</varname> (<type>floating point</type>)</term>
+       <indexterm>
+        <primary><varname>checkpoint_sync_percent</> configuration parameter</primary>
+       </indexterm>
+       <listitem>
+        <para>
+         To spread works in checkpoints, each checkpoint spends the specified
+         time and delays to flush all modified files.
+         The default value is 20.0 (20% of <varname>checkpoint_timeout</>).
+         This parameter can only be set in the <filename>postgresql.conf</>
+         file or on the server command line.
+        </para>
+       </listitem>
+      </varlistentry>
+
       <varlistentry id="guc-checkpoint-warning" xreflabel="checkpoint_warning">
        <term><varname>checkpoint_warning</varname> (<type>integer</type>)</term>
        <indexterm>
Index: src/backend/access/transam/xlog.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/transam/xlog.c,v
retrieving revision 1.272
diff -c -r1.272 xlog.c
*** src/backend/access/transam/xlog.c    31 May 2007 15:13:01 -0000    1.272
--- src/backend/access/transam/xlog.c    12 Jun 2007 08:16:55 -0000
***************
*** 398,404 ****
  static void exitArchiveRecovery(TimeLineID endTLI,
                      uint32 endLogId, uint32 endLogSeg);
  static bool recoveryStopsHere(XLogRecord *record, bool *includeThis);
! static void CheckPointGuts(XLogRecPtr checkPointRedo);

  static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
                  XLogRecPtr *lsn, BkpBlock *bkpb);
--- 398,404 ----
  static void exitArchiveRecovery(TimeLineID endTLI,
                      uint32 endLogId, uint32 endLogSeg);
  static bool recoveryStopsHere(XLogRecord *record, bool *includeThis);
! static void CheckPointGuts(XLogRecPtr checkPointRedo, bool immediate);

  static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
                  XLogRecPtr *lsn, BkpBlock *bkpb);
***************
*** 5319,5324 ****
--- 5319,5341 ----
  }

  /*
+  * GetInsertRecPtr -- Returns the current insert position.
+  */
+ XLogRecPtr
+ GetInsertRecPtr(void)
+ {
+     volatile XLogCtlData *xlogctl = XLogCtl;
+     XLogCtlInsert  *Insert = &XLogCtl->Insert;
+     XLogRecPtr        recptr;
+
+     SpinLockAcquire(&xlogctl->info_lck);
+     INSERT_RECPTR(recptr, Insert, Insert->curridx);
+     SpinLockRelease(&xlogctl->info_lck);
+
+     return recptr;
+ }
+
+ /*
   * Get the time of the last xlog segment switch
   */
  time_t
***************
*** 5591,5597 ****
       */
      END_CRIT_SECTION();

!     CheckPointGuts(checkPoint.redo);

      START_CRIT_SECTION();

--- 5608,5614 ----
       */
      END_CRIT_SECTION();

!     CheckPointGuts(checkPoint.redo, force);

      START_CRIT_SECTION();

***************
*** 5697,5708 ****
   * recovery restartpoints.
   */
  static void
! CheckPointGuts(XLogRecPtr checkPointRedo)
  {
      CheckPointCLOG();
      CheckPointSUBTRANS();
      CheckPointMultiXact();
!     FlushBufferPool();            /* performs all required fsyncs */
      /* We deliberately delay 2PC checkpointing as long as possible */
      CheckPointTwoPhase(checkPointRedo);
  }
--- 5714,5725 ----
   * recovery restartpoints.
   */
  static void
! CheckPointGuts(XLogRecPtr checkPointRedo, bool immediate)
  {
      CheckPointCLOG();
      CheckPointSUBTRANS();
      CheckPointMultiXact();
!     FlushBufferPool(immediate);        /* performs all required fsyncs */
      /* We deliberately delay 2PC checkpointing as long as possible */
      CheckPointTwoPhase(checkPointRedo);
  }
***************
*** 5751,5757 ****
      /*
       * OK, force data out to disk
       */
!     CheckPointGuts(checkPoint->redo);

      /*
       * Update pg_control so that any subsequent crash will restart from this
--- 5768,5774 ----
      /*
       * OK, force data out to disk
       */
!     CheckPointGuts(checkPoint->redo, true);

      /*
       * Update pg_control so that any subsequent crash will restart from this
Index: src/backend/commands/dbcommands.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/commands/dbcommands.c,v
retrieving revision 1.195
diff -c -r1.195 dbcommands.c
*** src/backend/commands/dbcommands.c    1 Jun 2007 19:38:07 -0000    1.195
--- src/backend/commands/dbcommands.c    12 Jun 2007 08:16:55 -0000
***************
*** 404,410 ****
       * up-to-date for the copy.  (We really only need to flush buffers for the
       * source database, but bufmgr.c provides no API for that.)
       */
!     BufferSync();

      /*
       * Once we start copying subdirectories, we need to be able to clean 'em
--- 404,410 ----
       * up-to-date for the copy.  (We really only need to flush buffers for the
       * source database, but bufmgr.c provides no API for that.)
       */
!     BufferSync(true);

      /*
       * Once we start copying subdirectories, we need to be able to clean 'em
***************
*** 1427,1433 ****
           * up-to-date for the copy.  (We really only need to flush buffers for
           * the source database, but bufmgr.c provides no API for that.)
           */
!         BufferSync();

          /*
           * Copy this subdirectory to the new location
--- 1427,1433 ----
           * up-to-date for the copy.  (We really only need to flush buffers for
           * the source database, but bufmgr.c provides no API for that.)
           */
!         BufferSync(true);

          /*
           * Copy this subdirectory to the new location
Index: src/backend/postmaster/bgwriter.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/postmaster/bgwriter.c,v
retrieving revision 1.38
diff -c -r1.38 bgwriter.c
*** src/backend/postmaster/bgwriter.c    27 May 2007 03:50:39 -0000    1.38
--- src/backend/postmaster/bgwriter.c    12 Jun 2007 11:23:57 -0000
***************
*** 44,49 ****
--- 44,50 ----
  #include "postgres.h"

  #include <signal.h>
+ #include <sys/time.h>
  #include <time.h>
  #include <unistd.h>

***************
*** 117,122 ****
--- 118,124 ----
      sig_atomic_t ckpt_failed;    /* advances when checkpoint fails */

      sig_atomic_t ckpt_time_warn;    /* warn if too soon since last ckpt? */
+     sig_atomic_t ckpt_force;        /* any waiter for the checkpoint? */

      int            num_requests;    /* current # of requests */
      int            max_requests;    /* allocated array size */
***************
*** 131,136 ****
--- 133,139 ----
  int            BgWriterDelay = 200;
  int            CheckPointTimeout = 300;
  int            CheckPointWarning = 30;
+ double        checkpoint_write_percent = 50.0;

  /*
   * Flags set by interrupt handlers for later service in the main loop.
***************
*** 146,155 ****
--- 149,166 ----

  static bool ckpt_active = false;

+ static time_t        ckpt_start_time;
+ static XLogRecPtr    ckpt_start_recptr;
+ static double        ckpt_progress_at_sync_start;
+
  static time_t last_checkpoint_time;
  static time_t last_xlog_switch_time;


+ static void CheckArchiveTimeout(void);
+ static void BgWriterNap(long msec);
+ static bool NextCheckpointRequested(void);
+ static double GetCheckpointElapsedProgress(void);
  static void bg_quickdie(SIGNAL_ARGS);
  static void BgSigHupHandler(SIGNAL_ARGS);
  static void ReqCheckpointHandler(SIGNAL_ARGS);
***************
*** 331,337 ****
          bool        force_checkpoint = false;
          time_t        now;
          int            elapsed_secs;
-         long        udelay;

          /*
           * Emergency bailout if postmaster has died.  This is to avoid the
--- 342,347 ----
***************
*** 350,362 ****
              got_SIGHUP = false;
              ProcessConfigFile(PGC_SIGHUP);
          }
-         if (checkpoint_requested)
-         {
-             checkpoint_requested = false;
-             do_checkpoint = true;
-             force_checkpoint = true;
-             BgWriterStats.m_requested_checkpoints++;
-         }
          if (shutdown_requested)
          {
              /*
--- 360,365 ----
***************
*** 377,387 ****
           */
          now = time(NULL);
          elapsed_secs = now - last_checkpoint_time;
!         if (elapsed_secs >= CheckPointTimeout)
          {
              do_checkpoint = true;
!             if (!force_checkpoint)
!                 BgWriterStats.m_timed_checkpoints++;
          }

          /*
--- 380,396 ----
           */
          now = time(NULL);
          elapsed_secs = now - last_checkpoint_time;
!         if (checkpoint_requested)
!         {
!             checkpoint_requested = false;
!             force_checkpoint = BgWriterShmem->ckpt_force;
!             do_checkpoint = true;
!             BgWriterStats.m_requested_checkpoints++;
!         }
!         else if (elapsed_secs >= CheckPointTimeout)
          {
              do_checkpoint = true;
!             BgWriterStats.m_timed_checkpoints++;
          }

          /*
***************
*** 404,416 ****
--- 413,430 ----
                                  elapsed_secs),
                           errhint("Consider increasing the configuration parameter \"checkpoint_segments\".")));
              BgWriterShmem->ckpt_time_warn = false;
+             BgWriterShmem->ckpt_force = false;

              /*
               * Indicate checkpoint start to any waiting backends.
               */
              ckpt_active = true;
+             elog(DEBUG1, "CHECKPOINT: start");
              BgWriterShmem->ckpt_started++;

+             ckpt_start_time = now;
+             ckpt_start_recptr = GetInsertRecPtr();
+             ckpt_progress_at_sync_start = 0;
              CreateCheckPoint(false, force_checkpoint);

              /*
***************
*** 423,428 ****
--- 437,443 ----
               * Indicate checkpoint completion to any waiting backends.
               */
              BgWriterShmem->ckpt_done = BgWriterShmem->ckpt_started;
+             elog(DEBUG1, "CHECKPOINT: end");
              ckpt_active = false;

              /*
***************
*** 439,446 ****
           * Check for archive_timeout, if so, switch xlog files.  First we do a
           * quick check using possibly-stale local state.
           */
!         if (XLogArchiveTimeout > 0 &&
!             (int) (now - last_xlog_switch_time) >= XLogArchiveTimeout)
          {
              /*
               * Update local state ... note that last_xlog_switch_time is the
--- 454,481 ----
           * Check for archive_timeout, if so, switch xlog files.  First we do a
           * quick check using possibly-stale local state.
           */
!         CheckArchiveTimeout();
!
!         /* Nap for the configured time. */
!         BgWriterNap(0);
!     }
! }
!
! /*
!  * CheckArchiveTimeout -- check for archive_timeout
!  */
! static void
! CheckArchiveTimeout(void)
! {
!     time_t        now;
!
!     if (XLogArchiveTimeout <= 0)
!         return;
!
!     now = time(NULL);
!     if ((int) (now - last_xlog_switch_time) < XLogArchiveTimeout)
!         return;
!
          {
              /*
               * Update local state ... note that last_xlog_switch_time is the
***************
*** 450,459 ****

              last_xlog_switch_time = Max(last_xlog_switch_time, last_time);

-             /* if we did a checkpoint, 'now' might be stale too */
-             if (do_checkpoint)
-                 now = time(NULL);
-
              /* Now we can do the real check */
              if ((int) (now - last_xlog_switch_time) >= XLogArchiveTimeout)
              {
--- 485,490 ----
***************
*** 478,483 ****
--- 509,526 ----
                  last_xlog_switch_time = now;
              }
          }
+ }
+
+ /*
+  * BgWriterNap -- short nap in bgwriter
+  *
+  * Nap for the shorter time of the configured time or the mdelay unless
+  * it is zero. Return the actual nap time in msec.
+  */
+ static void
+ BgWriterNap(long mdelay)
+ {
+     long        udelay;

          /*
           * Send off activity statistics to the stats collector
***************
*** 503,508 ****
--- 546,555 ----
          else
              udelay = 10000000L; /* Ten seconds */

+         /* Clamp the delay to the upper bound. */
+         if (mdelay > 0)
+             udelay = Min(udelay, mdelay * 1000L);
+
          while (udelay > 999999L)
          {
              if (got_SIGHUP || checkpoint_requested || shutdown_requested)
***************
*** 514,522 ****
--- 561,664 ----

          if (!(got_SIGHUP || checkpoint_requested || shutdown_requested))
              pg_usleep(udelay);
+ }
+
+ /*
+  * CheckpointWriteDelay -- periodical sleep in checkpoint write phase
+  */
+ void
+ CheckpointWriteDelay(double progress)
+ {
+     double target_progress;
+     bool next_requested;
+
+     if (!ckpt_active || checkpoint_write_percent <= 0)
+         return;
+
+     next_requested = NextCheckpointRequested();
+     target_progress = GetCheckpointElapsedProgress() / (checkpoint_write_percent / 100);
+
+     elog(DEBUG1, "CheckpointWriteDelay: progress=%.3f, target=%.3f, next=%d",
+          progress, target_progress, next_requested);
+
+     if (!next_requested &&
+         progress > target_progress)
+     {
+         AbsorbFsyncRequests();
+         BgLruBufferSync();
+         BgWriterNap(0);
      }
  }

+ /*
+  * NextCheckpointRequested -- true iff the next checkpoint is requested
+  *
+  *    Do also check any signals received recently.
+  */
+ static bool
+ NextCheckpointRequested(void)
+ {
+     if (!am_bg_writer || !ckpt_active)
+         return true;
+
+     /* Don't sleep this checkpoint if next checkpoint is requested. */
+     if (checkpoint_requested || shutdown_requested ||
+         (time(NULL) - ckpt_start_time >= CheckPointTimeout))
+     {
+         elog(DEBUG1, "NextCheckpointRequested");
+         checkpoint_requested = true;
+         return true;
+     }
+
+     /* Process reload signals. */
+     if (got_SIGHUP)
+     {
+         got_SIGHUP = false;
+         ProcessConfigFile(PGC_SIGHUP);
+     }
+
+     /* Check for archive_timeout and nap for the configured time. */
+     CheckArchiveTimeout();
+
+     return false;
+ }
+
+ /*
+  * GetCheckpointElapsedProgress -- progress of the current checkpoint in range 0-100%
+  */
+ static double
+ GetCheckpointElapsedProgress(void)
+ {
+     struct timeval    now;
+     XLogRecPtr        recptr;
+     double            progress_in_time,
+                     progress_in_xlog;
+     double            percent;
+
+     Assert(ckpt_active);
+
+     /* coordinate the progress with checkpoint_timeout */
+     gettimeofday(&now, NULL);
+     progress_in_time = ((double) (now.tv_sec - ckpt_start_time) +
+         now.tv_usec / 1000000.0) / CheckPointTimeout;
+
+     /* coordinate the progress with checkpoint_segments */
+     recptr = GetInsertRecPtr();
+     progress_in_xlog =
+         (((double) recptr.xlogid - (double) ckpt_start_recptr.xlogid) * XLogSegsPerFile +
+          ((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) / XLogSegSize) /
+         CheckPointSegments;
+
+     percent = Max(progress_in_time, progress_in_xlog);
+     if (percent > 1.0)
+         percent = 1.0;
+
+     elog(DEBUG2, "GetCheckpointElapsedProgress: time=%.3f, xlog=%.3f",
+         progress_in_time, progress_in_xlog);
+
+     return percent;
+ }
+

  /* --------------------------------
   *        signal handler routines
***************
*** 656,661 ****
--- 798,805 ----
      /* Set warning request flag if appropriate */
      if (warnontime)
          bgs->ckpt_time_warn = true;
+     if (waitforit)
+         bgs->ckpt_force = true;

      /*
       * Send signal to request checkpoint.  When waitforit is false, we
Index: src/backend/storage/buffer/bufmgr.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/bufmgr.c,v
retrieving revision 1.220
diff -c -r1.220 bufmgr.c
*** src/backend/storage/buffer/bufmgr.c    30 May 2007 20:11:58 -0000    1.220
--- src/backend/storage/buffer/bufmgr.c    12 Jun 2007 12:43:37 -0000
***************
*** 74,79 ****
--- 74,80 ----
  double        bgwriter_all_percent = 0.333;
  int            bgwriter_lru_maxpages = 5;
  int            bgwriter_all_maxpages = 5;
+ int            checkpoint_write_rate = 1;


  long        NDirectFileRead;    /* some I/O's are direct file access. bypass
***************
*** 1002,1031 ****
   * This is called at checkpoint time to write out all dirty shared buffers.
   */
  void
! BufferSync(void)
  {
      int            buf_id;
      int            num_to_scan;
      int            absorb_counter;

      /*
       * Find out where to start the circular scan.
       */
!     buf_id = StrategySyncStart();

      /* Make sure we can handle the pin inside SyncOneBuffer */
      ResourceOwnerEnlargeBuffers(CurrentResourceOwner);

      /*
       * Loop over all buffers.
       */
      num_to_scan = NBuffers;
      absorb_counter = WRITES_PER_ABSORB;
!     while (num_to_scan-- > 0)
      {
!         if (SyncOneBuffer(buf_id, false))
          {
              BgWriterStats.m_buf_written_checkpoints++;

              /*
               * If in bgwriter, absorb pending fsync requests after each
--- 1003,1092 ----
   * This is called at checkpoint time to write out all dirty shared buffers.
   */
  void
! BufferSync(bool immediate)
  {
      int            buf_id;
      int            num_to_scan;
+     int            num_written;
      int            absorb_counter;
+     int            writes_per_nap = checkpoint_write_rate;
+     int            num_to_write;
+     int            start_id;
+     int            num_written_since_nap;

      /*
       * Find out where to start the circular scan.
       */
!     start_id = StrategySyncStart();

      /* Make sure we can handle the pin inside SyncOneBuffer */
      ResourceOwnerEnlargeBuffers(CurrentResourceOwner);

      /*
+      * Loop over all buffers, and mark the ones that need to be written.
+      */
+     num_to_scan = NBuffers;
+     num_to_write = 0;
+     buf_id = start_id;
+     while (num_to_scan-- > 0)
+     {
+         volatile BufferDesc *bufHdr = &BufferDescriptors[buf_id];
+         LockBufHdr(bufHdr);
+
+         if (bufHdr->flags & BM_DIRTY)
+         {
+             bufHdr->flags |= BM_CHECKPOINT_NEEDED;
+             num_to_write++;
+         }
+         else
+         {
+             /* There shouldn't be any buffers in the cache with the flag
+              * set, but better safe than sorry in case the previous checkpoint
+              * crashed. If we didn't clear the flag, we might end the
+              * write-loop below early, because num_to_write wouldn't include
+              * any leftover pages. Alternatively, we could count them into
+              * num_to_write, but we might as well clear avoid the work.
+              */
+             bufHdr->flags &= ~BM_CHECKPOINT_NEEDED;
+         }
+
+         UnlockBufHdr(bufHdr);
+
+         if (++buf_id >= NBuffers)
+             buf_id = 0;
+     }
+
+     elog(DEBUG1, "CHECKPOINT: %d / %d buffers to write", num_to_write, NBuffers);
+
+     /*
       * Loop over all buffers.
       */
      num_to_scan = NBuffers;
+     num_written = num_written_since_nap = 0;
      absorb_counter = WRITES_PER_ABSORB;
!     buf_id = start_id;
!     while (num_to_scan-- > 0 && num_written < num_to_write)
      {
!         volatile BufferDesc *bufHdr = &BufferDescriptors[buf_id];
!         bool needs_flush;
!
!         LockBufHdr(bufHdr);
!
!         needs_flush = (bufHdr->flags & BM_CHECKPOINT_NEEDED) != 0;
!
!         UnlockBufHdr(bufHdr);
!
!         if (needs_flush && SyncOneBuffer(buf_id, false))
          {
              BgWriterStats.m_buf_written_checkpoints++;
+             num_written++;
+
+             if (!immediate && ++num_written_since_nap >= writes_per_nap)
+             {
+                 num_written_since_nap = 0;
+                 CheckpointWriteDelay(
+                     (double) (num_written) / num_to_write);
+             }

              /*
               * If in bgwriter, absorb pending fsync requests after each
***************
*** 1053,1059 ****
  BgBufferSync(void)
  {
      static int    buf_id1 = 0;
-     int            buf_id2;
      int            num_to_scan;
      int            num_written;

--- 1114,1119 ----
***************
*** 1099,1104 ****
--- 1159,1177 ----
          BgWriterStats.m_buf_written_all += num_written;
      }

+     BgLruBufferSync();
+ }
+
+ /*
+  * BgLruBufferSync -- Write out some lru dirty buffers in the pool.
+  */
+ void
+ BgLruBufferSync(void)
+ {
+     int            buf_id2;
+     int            num_to_scan;
+     int            num_written;
+
      /*
       * This loop considers only unpinned buffers close to the clock sweep
       * point.
***************
*** 1341,1349 ****
   * flushed.
   */
  void
! FlushBufferPool(void)
  {
!     BufferSync();
      smgrsync();
  }

--- 1414,1425 ----
   * flushed.
   */
  void
! FlushBufferPool(bool immediate)
  {
!     elog(DEBUG1, "CHECKPOINT: write phase");
!     BufferSync(immediate || checkpoint_write_percent <= 0);
!
!     elog(DEBUG1, "CHECKPOINT: sync phase");
      smgrsync();
  }

***************
*** 2132,2138 ****
      Assert(buf->flags & BM_IO_IN_PROGRESS);
      buf->flags &= ~(BM_IO_IN_PROGRESS | BM_IO_ERROR);
      if (clear_dirty && !(buf->flags & BM_JUST_DIRTIED))
!         buf->flags &= ~BM_DIRTY;
      buf->flags |= set_flag_bits;

      UnlockBufHdr(buf);
--- 2208,2214 ----
      Assert(buf->flags & BM_IO_IN_PROGRESS);
      buf->flags &= ~(BM_IO_IN_PROGRESS | BM_IO_ERROR);
      if (clear_dirty && !(buf->flags & BM_JUST_DIRTIED))
!         buf->flags &= ~(BM_DIRTY | BM_CHECKPOINT_NEEDED);
      buf->flags |= set_flag_bits;

      UnlockBufHdr(buf);
Index: src/backend/utils/misc/guc.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/utils/misc/guc.c,v
retrieving revision 1.396
diff -c -r1.396 guc.c
*** src/backend/utils/misc/guc.c    8 Jun 2007 18:23:52 -0000    1.396
--- src/backend/utils/misc/guc.c    12 Jun 2007 10:50:50 -0000
***************
*** 1579,1584 ****
--- 1579,1593 ----
      },

      {
+         {"checkpoint_write_rate", PGC_SIGHUP, WAL_CHECKPOINTS,
+             gettext_noop("XXX"),
+             NULL
+         },
+         &checkpoint_write_rate,
+         1, 0, 1000000, NULL, NULL
+     },
+
+     {
          {"log_rotation_age", PGC_SIGHUP, LOGGING_WHERE,
              gettext_noop("Automatic log file rotation will occur after N minutes."),
              NULL,
***************
*** 1866,1871 ****
--- 1875,1889 ----
          0.1, 0.0, 100.0, NULL, NULL
      },

+     {
+         {"checkpoint_write_percent", PGC_SIGHUP, WAL_CHECKPOINTS,
+             gettext_noop("Sets the duration percentage of write phase in checkpoints."),
+             NULL
+         },
+         &checkpoint_write_percent,
+         50.0, 0.0, 100.0, NULL, NULL
+     },
+
      /* End-of-list marker */
      {
          {NULL, 0, 0, NULL, NULL}, NULL, 0.0, 0.0, 0.0, NULL, NULL
Index: src/backend/utils/misc/postgresql.conf.sample
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/utils/misc/postgresql.conf.sample,v
retrieving revision 1.216
diff -c -r1.216 postgresql.conf.sample
*** src/backend/utils/misc/postgresql.conf.sample    3 Jun 2007 17:08:15 -0000    1.216
--- src/backend/utils/misc/postgresql.conf.sample    12 Jun 2007 08:16:55 -0000
***************
*** 168,173 ****
--- 168,176 ----

  #checkpoint_segments = 3        # in logfile segments, min 1, 16MB each
  #checkpoint_timeout = 5min        # range 30s-1h
+ #checkpoint_write_percent = 50.0        # duration percentage in write phase
+ #checkpoint_nap_percent = 10.0        # duration percentage between write and sync phases
+ #checkpoint_sync_percent = 20.0        # duration percentage in sync phase
  #checkpoint_warning = 30s        # 0 is off

  # - Archiving -
Index: src/include/access/xlog.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/xlog.h,v
retrieving revision 1.78
diff -c -r1.78 xlog.h
*** src/include/access/xlog.h    30 May 2007 20:12:02 -0000    1.78
--- src/include/access/xlog.h    12 Jun 2007 08:16:55 -0000
***************
*** 174,179 ****
--- 174,180 ----
  extern void CreateCheckPoint(bool shutdown, bool force);
  extern void XLogPutNextOid(Oid nextOid);
  extern XLogRecPtr GetRedoRecPtr(void);
+ extern XLogRecPtr GetInsertRecPtr(void);
  extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch);

  #endif   /* XLOG_H */
Index: src/include/postmaster/bgwriter.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/postmaster/bgwriter.h,v
retrieving revision 1.9
diff -c -r1.9 bgwriter.h
*** src/include/postmaster/bgwriter.h    5 Jan 2007 22:19:57 -0000    1.9
--- src/include/postmaster/bgwriter.h    12 Jun 2007 08:16:55 -0000
***************
*** 20,29 ****
--- 20,35 ----
  extern int    BgWriterDelay;
  extern int    CheckPointTimeout;
  extern int    CheckPointWarning;
+ extern double    checkpoint_write_percent;
+ extern double    checkpoint_nap_percent;
+ extern double    checkpoint_sync_percent;

  extern void BackgroundWriterMain(void);

  extern void RequestCheckpoint(bool waitforit, bool warnontime);
+ extern void CheckpointWriteDelay(double progress);
+ extern void CheckpointNapDelay(double percent);
+ extern void CheckpointSyncDelay(double progress, double percent);

  extern bool ForwardFsyncRequest(RelFileNode rnode, BlockNumber segno);
  extern void AbsorbFsyncRequests(void);
Index: src/include/storage/buf_internals.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/storage/buf_internals.h,v
retrieving revision 1.90
diff -c -r1.90 buf_internals.h
*** src/include/storage/buf_internals.h    30 May 2007 20:12:03 -0000    1.90
--- src/include/storage/buf_internals.h    12 Jun 2007 11:42:23 -0000
***************
*** 35,40 ****
--- 35,41 ----
  #define BM_IO_ERROR                (1 << 4)        /* previous I/O failed */
  #define BM_JUST_DIRTIED            (1 << 5)        /* dirtied since write started */
  #define BM_PIN_COUNT_WAITER        (1 << 6)        /* have waiter for sole pin */
+ #define BM_CHECKPOINT_NEEDED    (1 << 7)        /* this needs to be written in checkpoint */

  typedef bits16 BufFlags;

Index: src/include/storage/bufmgr.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/storage/bufmgr.h,v
retrieving revision 1.104
diff -c -r1.104 bufmgr.h
*** src/include/storage/bufmgr.h    30 May 2007 20:12:03 -0000    1.104
--- src/include/storage/bufmgr.h    12 Jun 2007 08:52:28 -0000
***************
*** 36,41 ****
--- 36,42 ----
  extern double bgwriter_all_percent;
  extern int    bgwriter_lru_maxpages;
  extern int    bgwriter_all_maxpages;
+ extern int    checkpoint_write_rate;

  /* in buf_init.c */
  extern DLLIMPORT char *BufferBlocks;
***************
*** 136,142 ****
  extern void ResetBufferUsage(void);
  extern void AtEOXact_Buffers(bool isCommit);
  extern void PrintBufferLeakWarning(Buffer buffer);
! extern void FlushBufferPool(void);
  extern BlockNumber BufferGetBlockNumber(Buffer buffer);
  extern BlockNumber RelationGetNumberOfBlocks(Relation relation);
  extern void RelationTruncate(Relation rel, BlockNumber nblocks);
--- 137,143 ----
  extern void ResetBufferUsage(void);
  extern void AtEOXact_Buffers(bool isCommit);
  extern void PrintBufferLeakWarning(Buffer buffer);
! extern void FlushBufferPool(bool immediate);
  extern BlockNumber BufferGetBlockNumber(Buffer buffer);
  extern BlockNumber RelationGetNumberOfBlocks(Relation relation);
  extern void RelationTruncate(Relation rel, BlockNumber nblocks);
***************
*** 161,168 ****
  extern void AbortBufferIO(void);

  extern void BufmgrCommit(void);
! extern void BufferSync(void);
  extern void BgBufferSync(void);

  extern void AtProcExit_LocalBuffers(void);

--- 162,170 ----
  extern void AbortBufferIO(void);

  extern void BufmgrCommit(void);
! extern void BufferSync(bool immediate);
  extern void BgBufferSync(void);
+ extern void BgLruBufferSync(void);

  extern void AtProcExit_LocalBuffers(void);

Re: Load Distributed Checkpoints test results

From

Gregory Stark

Date:

13 June 2007, 11:28:46

"Heikki Linnakangas" <heikki@enterprisedb.com> writes:

> The response time graphs show that the patch reduces the max (new-order)
> response times during checkpoints from ~40-60 s to ~15-20 s. 

I think that's the headline number here. The worst-case response time is
reduced from about 60s to about 17s. That's pretty impressive on its own. It
would be worth knowing if that benefit goes away if we push the machine again
to the edge of its i/o bandwidth.

> The change in overall average response times is also very significant. 1.5s
> without patch, and ~0.3-0.4s with the patch for new-order transactions. That
> also means that we pass the TPC-C requirement that 90th percentile of response
> times must be < average.

Incidentally this is backwards. the 90th percentile response time must be
greater than the average response time for that transaction.

This isn't actually a very stringent test given that most of the data points
in the 90th percentile are actually substantially below the maximum. It's
quite possible to achieve it even with maximum response times above 60s.

However TPC-E has even more stringent requirements:
   During Steady State the throughput of the SUT must be sustainable for the   remainder of a Business Day started at
thebeginning of the Steady State.

   Some aspects of the benchmark implementation can result in rather   insignificant but frequent variations in
throughputwhen computed over   somewhat shorter periods of time. To meet the sustainable throughput   requirement, the
cumulativeeffect of these variations over one Business   Day must not exceed 2% of the Reported Throughput.

   Comment 1: This requirement is met when the throughput computed over any   period of one hour, sliding over the
SteadyState by increments of ten   minutes, varies from the Reported Throughput by no more than 2%.

   Some aspects of the benchmark implementation can result in rather   significant but sporadic variations in
throughputwhen computed over some   much shorter periods of time. To meet the sustainable throughput   requirement, the
cumulativeeffect of these variations over one Business   Day must not exceed 20% of the Reported Throughput.

   Comment 2: This requirement is met when the throughput level computed over   any period of ten minutes, sliding over
theSteady State by increments one   minute, varies from the Reported Throughput by no more than 20%.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com

Re: Load Distributed Checkpoints test results

From

Josh Berkus

Date:

13 June 2007, 15:37:34

Greg,

> However TPC-E has even more stringent requirements:

I'll see if I can get our TPCE people to test this, but I'd say that the 
existing patch is already good enough to be worth accepting based on the TPCC 
results.

However, I would like to see some community testing on oddball workloads (like 
huge ELT operations and read-only workloads) to see if the patch imposes any 
extra overhead on non-OLTP databases.

-- 
Josh Berkus
PostgreSQL @ Sun
San Francisco

Re: Load Distributed Checkpoints test results

From

ITAGAKI Takahiro

Date:

14 June 2007, 04:20:04

Heikki Linnakangas <heikki@enterprisedb.com> wrote:

> Here's results from a batch of test runs with LDC. This patch only 
> spreads out the writes, fsyncs work as before.

I saw similar results in my tests. Spreading only writes are enough
for OLTP at least on Linux with middle-or-high-grade storage system.
It also works well on desktop-grade Widnows machine.

However, I don't know how it works on other OSes, including Solaris
and FreeBSD, that have different I/O policies. Would anyone test it
in those environment?

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Re: Load Distributed Checkpoints test results

From

Heikki Linnakangas

Date:

15 June 2007, 04:26:53

Heikki Linnakangas wrote:
> Here's results from a batch of test runs with LDC. This patch only 
> spreads out the writes, fsyncs work as before. This patch also includes 
> the optimization that we don't write buffers that were dirtied after 
> starting the checkpoint.
> 
> http://community.enterprisedb.com/ldc/
> 
> See tests 276-280. 280 is the baseline with no patch attached, the 
> others are with load distributed checkpoints with different values for 
> checkpoint_write_percent. But after running the tests I noticed that the 
> spreading was actually controlled by checkpoint_write_rate, which sets 
> the minimum rate for the writes, so all those tests with the patch 
> applied are effectively the same; the writes were spread over a period 
> of 1 minute. I'll fix that setting and run more tests.

I ran another series of tests, with a less aggressive bgwriter_delay 
setting, which also affects the minimum rate of the writes in the WIP 
patch I used.

Now that the checkpoints are spread out more, the response times are 
very smooth.

With the 40% checkpoint_write_percent setting, the checkpoints last ~3 
minutes. About 85% of the buffer cache is dirty at the beginning of 
checkpoints, and thanks to the optimization of not writing pages dirtied 
after checkpoint start, only ~47% of those are actually written by the 
checkpoint. That explains why the checkpoints only last ~3 minutes, and 
not checkpoint_timeout*checkpoint_write_percent, which would be 6 
minutes. The estimation of how much progress has been done and how much 
is left doesn't take the gain from that optimization into account.

The sync phase only takes ~5 seconds. I'm very happy with these results.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Load Distributed Checkpoints test results

From

Gregory Stark

Date:

15 June 2007, 08:26:38

"Heikki Linnakangas" <heikki@enterprisedb.com> writes:

> I ran another series of tests, with a less aggressive bgwriter_delay setting,
> which also affects the minimum rate of the writes in the WIP patch I used.
>
> Now that the checkpoints are spread out more, the response times are very
> smooth.

So obviously the reason the results are so dramatic is that the checkpoints
used to push the i/o bandwidth demand up over 100%. By spreading it out you
can see in the io charts that even during the checkpoint the i/o busy rate
stays just under 100% except for a few data points.

If I understand it right Greg Smith's concern is that in a busier system where
even *with* the load distributed checkpoint the i/o bandwidth demand during t
he checkpoint was *still* being pushed over 100% then spreading out the load
would only exacerbate the problem by extending the outage.

To that end it seems like what would be useful is a pair of tests with and
without the patch with about 10% larger warehouse size (~ 115) which would
push the i/o bandwidth demand up to about that level.

It might even make sense to run a test with an outright overloaded to see if
the patch doesn't exacerbate the condition. Something with a warehouse size of
maybe 150. I would expect it to fail the TPCC constraints either way but what
would be interesting to know is whether it fails by a larger margin with the
LDC behaviour or a smaller margin.

Even just the fact that we're passing at 105 warehouses -- and apparently with
quite a bit of headroom too -- whereas previously we were failing at that
level on this hardware is a positive result as far as the TPCC benchmark
methodology.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com

Re: Load Distributed Checkpoints test results

From

Greg Smith

Date:

15 June 2007, 14:55:04

On Fri, 15 Jun 2007, Gregory Stark wrote:

> If I understand it right Greg Smith's concern is that in a busier system 
> where even *with* the load distributed checkpoint the i/o bandwidth 
> demand during t he checkpoint was *still* being pushed over 100% then 
> spreading out the load would only exacerbate the problem by extending 
> the outage.

Thank you for that very concise summary; that's exactly what I've run 
into.  DBT2 creates a heavy write load, but it's not testing a real burst 
behavior where something is writing as fast as it's possible to.

I've been involved in applications that are more like a data logging 
situation, where periodically you get some data source tossing 
transactions in as fast as it will hit disk--the upstream source 
temporarily becomes faster at generating data during these periods than 
the database itself can be.  Under normal conditions, the LDC smoothing 
would be a win, as it would lower the number of times the entire flow of 
operations got stuck.  But at these peaks it will, as you say, extend the 
outage.

> It might even make sense to run a test with an outright overloaded to 
> see if the patch doesn't exacerbate the condition.

Exactly.  I expect that it will make things worse, but I'd like to keep an 
eye on making sure the knobs are available so that it's only slightly 
worse.

I think it's important to at least recognize that someone who wants LDC 
normally might occasionally have a period where they're completely 
overloaded, and that this new feature doesn't have an unexpected breakdown 
when that happens.  I'm still stuggling with creating a simple test case 
to demonstrate what I'm concerned about.  I'm not familiar enough with the 
TPC testing to say whether your suggestions for adjusting warehouse size 
would accomplish that (because the flow is so different I had to abandon 
working with that a while ago as not being representative of what I was 
doing), but I'm glad you're thinking about it.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Load Distributed Checkpoints test results

From

Heikki Linnakangas

Date:

15 June 2007, 16:14:59

Gregory Stark wrote:
> "Heikki Linnakangas" <heikki@enterprisedb.com> writes:
>> Now that the checkpoints are spread out more, the response times are very
>> smooth.
> 
> So obviously the reason the results are so dramatic is that the checkpoints
> used to push the i/o bandwidth demand up over 100%. By spreading it out you
> can see in the io charts that even during the checkpoint the i/o busy rate
> stays just under 100% except for a few data points.
> 
> If I understand it right Greg Smith's concern is that in a busier system where
> even *with* the load distributed checkpoint the i/o bandwidth demand during t
> he checkpoint was *still* being pushed over 100% then spreading out the load
> would only exacerbate the problem by extending the outage.
> 
> To that end it seems like what would be useful is a pair of tests with and
> without the patch with about 10% larger warehouse size (~ 115) which would
> push the i/o bandwidth demand up to about that level.

I still don't see how spreading the writes could make things worse, but 
running more tests is easy. I'll schedule tests with more warehouses 
over the weekend.

> It might even make sense to run a test with an outright overloaded to see if
> the patch doesn't exacerbate the condition. Something with a warehouse size of
> maybe 150. I would expect it to fail the TPCC constraints either way but what
> would be interesting to know is whether it fails by a larger margin with the
> LDC behaviour or a smaller margin.

I'll do that as well, though experiences with tests like that in the 
past have been that it's hard to get repeatable results that way.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Load Distributed Checkpoints test results

From

Gregory Stark

Date:

15 June 2007, 17:13:11

"Greg Smith" <gsmith@gregsmith.com> writes:

> On Fri, 15 Jun 2007, Gregory Stark wrote:
>
>> If I understand it right Greg Smith's concern is that in a busier system
>> where even *with* the load distributed checkpoint the i/o bandwidth demand
>> during t he checkpoint was *still* being pushed over 100% then spreading out
>> the load would only exacerbate the problem by extending the outage.
>
> Thank you for that very concise summary; that's exactly what I've run into.
> DBT2 creates a heavy write load, but it's not testing a real burst behavior
> where something is writing as fast as it's possible to.

Ah, thanks, that's precisely the distinction that I was missing. It's funny,
something that was so counter-intuitive initially has become so ingrained in
my thinking that I didn't even notice I was assuming it any more.

DBT2 has "think times" which it uses to limit the flow of transactions. This
is critical to ensuring that you're forced to increase the scale of the
database if you want to report larger transaction rates which of course is
what everyone wants to brag about.

Essentially this is what makes it an OLTP benchmark. You're measuring how well
you can keep up with a flow of transactions which arrive at a fixed speed
independent of the database.

But what you're concerned about is not OLTP performance at all. It's a kind of
DSS system -- perhaps there's another TLA that's more precise. But the point
is you're concerned with total throughput and not response time. You don't
have a fixed rate imposed by outside circumstances with which you have to keep
up all the time. You just want to be have the highest throughput overall.

The good news is that this should be pretty easy to test though. The main
competitor for DBT2 is BenchmarkSQL whose main deficiency is precisely the
lack of support for the think times. We can run BenchmarkSQL runs to see if
the patch impacts performance when it's set to run as fast as possible with no
think times.

While in theory spreading out the writes could have a detrimental effect I
think we should wait until we see actual numbers. I have a pretty strong
suspicion that the effect would be pretty minimal. We're still doing the same
amount of i/o total, just with a slightly less chance for the elevator
algorithm to optimize the pattern.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com

Re: Load Distributed Checkpoints test results

From

"Gregory Maxwell"

Date:

15 June 2007, 17:28:40

On 6/15/07, Gregory Stark <stark@enterprisedb.com> wrote:
> While in theory spreading out the writes could have a detrimental effect I
> think we should wait until we see actual numbers. I have a pretty strong
> suspicion that the effect would be pretty minimal. We're still doing the same
> amount of i/o total, just with a slightly less chance for the elevator
> algorithm to optimize the pattern.

..and the sort patching suggests that the OS's elevator isn't doing a
great job for large flushes in any case. I wouldn't be shocked to see
load distributed checkpoints cause an unconditional improvement since
they may do better at avoiding the huge burst behavior that is
overrunning the OS elevator in any case.

Re: Load Distributed Checkpoints test results

From

PFC

Date:

15 June 2007, 17:42:42

On Fri, 15 Jun 2007 22:28:34 +0200, Gregory Maxwell <gmaxwell@gmail.com>  
wrote:

> On 6/15/07, Gregory Stark <stark@enterprisedb.com> wrote:
>> While in theory spreading out the writes could have a detrimental  
>> effect I
>> think we should wait until we see actual numbers. I have a pretty strong
>> suspicion that the effect would be pretty minimal. We're still doing  
>> the same
>> amount of i/o total, just with a slightly less chance for the elevator
>> algorithm to optimize the pattern.
>
> ..and the sort patching suggests that the OS's elevator isn't doing a
> great job for large flushes in any case. I wouldn't be shocked to see
> load distributed checkpoints cause an unconditional improvement since
> they may do better at avoiding the huge burst behavior that is
> overrunning the OS elevator in any case.
...also consider that if someone uses RAID5, sorting the writes may  
produce more full-stripe writes, which don't need the read-then-write  
RAID5 performance killer...

Re: Load Distributed Checkpoints test results

From

Josh Berkus

Date:

16 June 2007, 16:16:25

All,

Where is the most current version of this patch?  I want to test it on TPCE, 
but there seem to be  4-5 different versions floating around, and the patch 
tracker hasn't been updated.

-- 
Josh Berkus
PostgreSQL @ Sun
San Francisco

Re: Load Distributed Checkpoints test results

From

Greg Smith

Date:

17 June 2007, 02:36:39

On Fri, 15 Jun 2007, Gregory Stark wrote:

> But what you're concerned about is not OLTP performance at all.

It's an OLTP system most of the time that periodically gets unexpectedly 
high volume.  The TPC-E OLTP test suite actually has a MarketFeed 
component to in it that has similar properties to what I was fighting 
with.  In a real-world Market Feed, you spec the system to survive a very 
high volume day of trades.  But every now and then there's some event that 
causes volumes to spike way outside of any you would ever be able to plan 
for, and much data ends up getting lost as a result from systems not being 
able to keep up.  A look at the 1987 "Black Monday" crash is informative 
here: http://en.wikipedia.org/wiki/Black_Monday_(1987)

> But the point is you're concerned with total throughput and not response 
> time. You don't have a fixed rate imposed by outside circumstances with 
> which you have to keep up all the time. You just want to be have the 
> highest throughput overall.

Actually, I think I care about reponse time more than you do.  In a 
typical data logging situation, there is some normal rate at which you 
expect transactions to arrive.  There's usually something memory-based 
upsteam that can buffer a small amount of delay, so an occasional short 
checkpoint blip can be tolerated.  But if there's ever a really extended 
one, you actually start losing data when the buffers overflow.

The last project I was working on, any checkpoint that caused a 
transaction to slip for more than 5 seconds would cause a data loss.  One 
of the defenses against that happening is that you have a wicked fast 
transaction rate to clear the buffer out when thing are going well, but by 
no means is that rate the important thing--never having the response time 
halt for so long that transactions get lost is.

> The good news is that this should be pretty easy to test though. The 
> main competitor for DBT2 is BenchmarkSQL whose main deficiency is 
> precisely the lack of support for the think times.

Maybe you can get something useful out of that one.  I found the 
performance impact of the JDBC layer in the middle so lowered overall 
throughput and distanced me from what was happening that it blurred what 
was going on.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Load Distributed Checkpoints test results

From

Heikki Linnakangas

Date:

17 June 2007, 03:44:19

Josh Berkus wrote:
> Where is the most current version of this patch?  I want to test it on TPCE, 
> but there seem to be  4-5 different versions floating around, and the patch 
> tracker hasn't been updated.

It would be the ldc-justwrites-2.patch:
http://archives.postgresql.org/pgsql-patches/2007-06/msg00149.php

Thanks in advance for the testing!

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Load Distributed Checkpoints test results

From

"Simon Riggs"

Date:

18 June 2007, 05:45:21

On Sun, 2007-06-17 at 01:36 -0400, Greg Smith wrote:

> The last project I was working on, any checkpoint that caused a 
> transaction to slip for more than 5 seconds would cause a data loss.  One 
> of the defenses against that happening is that you have a wicked fast 
> transaction rate to clear the buffer out when thing are going well, but by 
> no means is that rate the important thing--never having the response time 
> halt for so long that transactions get lost is.

You would want longer checkpoints in that case.

You're saying you don't want long checkpoints because they cause an
effective outage. The current situation is that checkpoints are so
severe that they cause an effective halt to processing, even though
checkpoints allow processing to continue. Checkpoints don't hold any
locks that prevent normal work from occurring but they did cause an
unthrottled burst of work to occur that raised expected service times
dramatically on an already busy server.

There were a number of effects contributing to the high impact of
checkpointing. Heikki's recent changes reduce the impact of checkpoints
so that they do *not* halt other processing. Longer checkpoints do *not*
mean longer halts in processing, they actually reduce the halt in
processing. Smoother checkpoints mean smaller resource queues when a
burst coincides with a checkpoint, so anybody with throughput-maximised
or bursty apps should want longer, smooth checkpoints.

You're right to ask for a minimum write rate, since this allows very
small checkpoints to complete in reduced times. There's no gain from
having long checkpoints per se, just the reduction in peak write rate
they typically cause.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com

Re: Load Distributed Checkpoints test results

From

Greg Smith

Date:

18 June 2007, 10:38:54

On Mon, 18 Jun 2007, Simon Riggs wrote:

> Smoother checkpoints mean smaller resource queues when a burst coincides 
> with a checkpoint, so anybody with throughput-maximised or bursty apps 
> should want longer, smooth checkpoints.

True as long as two conditions hold:

1) Buffers needed to fill allocation requests are still being written fast 
enough.  The buffer allocation code starts burning a lot of CPU+lock 
resources when many clients are all searching the pool looking for a 
buffers and there aren't many clean ones to be found.  The way the current 
checkpoint code starts at the LRU point and writes everything dirty in the 
order new buffers will be allocating in as fast as possible means it's 
doing the optimal procedure to keep this from happening.  It's being 
presumed that making the LRU writer active will mitigate this issue, my 
experience suggests that may not be as effective as hoped--unless it gets 
changed so that it's allowed to decrement usage_count.

To pick one example of a direction I'm a little concerned about related to 
this, Itagaki's sorted writes results look very interesting.  But as his 
test system is such that the actual pgbench TPS numbers are 1/10 of the 
ones I was seeing when I started having ugly buffer allocation issues, I'm 
real sure the particular test he's running isn't sensitive to issues in 
this area at all; there's just not enough buffer cache churn if you're 
only doing a couple of hundred TPS for this to happen.

2) The checkpoint still finishes in time.

The thing you can't forget about when dealing with an overloaded system is 
that there's no such thing as lowering the load of the checkpoint such 
that it doesn't have a bad impact.  Assume new transactions are being 
generated by an upstream source such that the database itself is the 
bottleneck, and you're always filling 100% of I/O capacity.  All I'm 
trying to get everyone to consider is that if you have a large pool of 
dirty buffers to deal with in this situation, it's possible (albeit 
difficult) to get into a situation where if the checkpoint doesn't write 
out the dirty buffers fast enough, the client backends will evacuate them 
instead in a way that makes the whole process less efficient than the 
current behavior.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Load Distributed Checkpoints test results

From

Heikki Linnakangas

Date:

20 June 2007, 14:58:55

I've uploaded the latest test results to the results page at 
http://community.enterprisedb.com/ldc/

The test results on the index page are not in a completely logical 
order, sorry about that.

I ran a series of tests with 115 warehouses, and no surprises there. LDC 
smooths the checkpoints nicely.

Another series with 150 warehouses is more interesting. At that # of 
warehouses, the data disks are 100% busy according to iostat. The 90% 
percentile response times are somewhat higher with LDC, though the 
variability in both the baseline and LDC test runs seem to be pretty 
high. Looking at the response time graphs, even with LDC there's clear 
checkpoint spikes there, but they're much less severe than without.

Another series was with 90 warehouses, but without think times, driving 
the system to full load. LDC seems to smooth the checkpoints very nicely  in these tests.

Heikki Linnakangas wrote:
> Gregory Stark wrote:
>> "Heikki Linnakangas" <heikki@enterprisedb.com> writes:
>>> Now that the checkpoints are spread out more, the response times are 
>>> very
>>> smooth.
>>
>> So obviously the reason the results are so dramatic is that the 
>> checkpoints
>> used to push the i/o bandwidth demand up over 100%. By spreading it 
>> out you
>> can see in the io charts that even during the checkpoint the i/o busy 
>> rate
>> stays just under 100% except for a few data points.
>>
>> If I understand it right Greg Smith's concern is that in a busier 
>> system where
>> even *with* the load distributed checkpoint the i/o bandwidth demand 
>> during t
>> he checkpoint was *still* being pushed over 100% then spreading out 
>> the load
>> would only exacerbate the problem by extending the outage.
>>
>> To that end it seems like what would be useful is a pair of tests with 
>> and
>> without the patch with about 10% larger warehouse size (~ 115) which 
>> would
>> push the i/o bandwidth demand up to about that level.
> 
> I still don't see how spreading the writes could make things worse, but 
> running more tests is easy. I'll schedule tests with more warehouses 
> over the weekend.
> 
>> It might even make sense to run a test with an outright overloaded to 
>> see if
>> the patch doesn't exacerbate the condition. Something with a warehouse 
>> size of
>> maybe 150. I would expect it to fail the TPCC constraints either way 
>> but what
>> would be interesting to know is whether it fails by a larger margin 
>> with the
>> LDC behaviour or a smaller margin.
> 
> I'll do that as well, though experiences with tests like that in the 
> past have been that it's hard to get repeatable results that way.



--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Load Distributed Checkpoints test results

From

Greg Smith

Date:

20 June 2007, 17:07:07

On Wed, 20 Jun 2007, Heikki Linnakangas wrote:

> Another series with 150 warehouses is more interesting. At that # of 
> warehouses, the data disks are 100% busy according to iostat. The 90% 
> percentile response times are somewhat higher with LDC, though the 
> variability in both the baseline and LDC test runs seem to be pretty high.

Great, this the exactly the behavior I had observed and wanted someone 
else to independantly run into.  When you're in 100% disk busy land, LDC 
can shift the distribution of bad transactions around in a way that some 
people may not be happy with, and that might represent a step backward 
from the current code for them.  I hope you can understand now why I've 
been so vocal that it must be possible to pull this new behavior out so 
the current form of checkpointing is still available.

While it shows up in the 90% figure, what happens is most obvious in the 
response time distribution graphs.  Someone who is currently getting a run 
like #295 right now: http://community.enterprisedb.com/ldc/295/rt.html

Might be really unhappy if they turn on LDC expecting to smooth out 
checkpoints and get the shift of #296 instead: 
http://community.enterprisedb.com/ldc/296/rt.html

That is of course cherry-picking the most extreme examples.  But it 
illustrates my concern about the possibility for LDC making things worse 
on a really overloaded system, which is kind of counter-intuitive because 
you might expect that would be the best case for its improvements.

When I summarize the percentile behavior from your results with 150 
warehouses in a table like this:

Test    LDC %    90%
295    None    3.703
297    None    4.432
292    10    3.432
298    20    5.925
296    30    5.992
294    40    4.132

I think it does a better job of showing how LDC can shift the top 
percentile around under heavy load, even though there are runs where it's 
a clear improvement.  Since there is so much variability in results when 
you get into this territory, you really need to run a lot of these tests 
to get a feel for the spread of behavior.  I spent about a week of 
continuously running tests stalking this bugger before I felt I'd mapped 
out the boundaries with my app.  You've got your own priorities, but I'd 
suggest you try to find enough time for a more exhaustive look at this 
area before nailing down the final form for the patch.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Load Distributed Checkpoints test results

From

Bruce Momjian

Date:

20 June 2007, 17:44:24

Greg Smith wrote:
> I think it does a better job of showing how LDC can shift the top 
> percentile around under heavy load, even though there are runs where it's 
> a clear improvement.  Since there is so much variability in results when 
> you get into this territory, you really need to run a lot of these tests 
> to get a feel for the spread of behavior.  I spent about a week of 
> continuously running tests stalking this bugger before I felt I'd mapped 
> out the boundaries with my app.  You've got your own priorities, but I'd 
> suggest you try to find enough time for a more exhaustive look at this 
> area before nailing down the final form for the patch.

OK, I have hit my limit on people asking for more testing.  I am not
against testing, but I don't want to get into a situation where we just
keep asking for more tests and not move forward.  I am going to rely on
the patch submitters to suggest when enough testing has been done and
move on.

I don't expect this patch to be perfect when it is applied.  I do expect
to be a best effort, and it will get continual real-world testing during
beta and we can continue to improve this.  Right now, we know we have a
serious issue with checkpoint I/O, and this patch is going to improve
that in most cases.  I don't want to see us reject it or greatly delay
beta as we try to make it perfect.

My main point is that should keep trying to make the patch better, but
the patch doesn't have to be perfect to get applied.  I don't want us to
get into a death-by-testing spiral.

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: Load Distributed Checkpoints test results

From

Heikki Linnakangas

Date:

20 June 2007, 17:56:19

Greg Smith wrote:
> While it shows up in the 90% figure, what happens is most obvious in the 
> response time distribution graphs.  Someone who is currently getting a 
> run like #295 right now: http://community.enterprisedb.com/ldc/295/rt.html
> 
> Might be really unhappy if they turn on LDC expecting to smooth out 
> checkpoints and get the shift of #296 instead: 
> http://community.enterprisedb.com/ldc/296/rt.html

You mean the shift and "flattening" of the graph to the right in the 
delivery response time distribution graph? Looking at the other runs, 
that graph looks sufficiently different between the two baseline runs 
and the patched runs that I really wouldn't draw any conclusion from that.

In any case you *can* disable LDC if you want to.

> That is of course cherry-picking the most extreme examples.  But it 
> illustrates my concern about the possibility for LDC making things worse 
> on a really overloaded system, which is kind of counter-intuitive 
> because you might expect that would be the best case for its improvements.

Well, it is indeed cherry-picking, so I still don't see how LDC could 
make things worse on a really overloaded system. I grant you there might 
indeed be one, but I'd like to understand the underlaying mechanism, or 
at least see one.

> Since there is so much variability in results 
> when you get into this territory, you really need to run a lot of these 
> tests to get a feel for the spread of behavior.

I think that's the real lesson from this. In any case, at least LDC 
doesn't seem to hurt much in any of the test configurations tested this 
far, and smooths the checkpoints a lot in most configurations.

>  I spent about a week of 
> continuously running tests stalking this bugger before I felt I'd mapped 
> out the boundaries with my app.  You've got your own priorities, but I'd 
> suggest you try to find enough time for a more exhaustive look at this 
> area before nailing down the final form for the patch.

I don't have any good simple ideas on how to make it better in 8.3 
timeframe, so I don't think there's much to learn from repeating these 
tests.

That said, running tests is easy and doesn't take much effort. If you 
have suggestions for configurations or workloads to test, I'll be happy 
to do that.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Load Distributed Checkpoints test results

From

"Joshua D. Drake"

Date:

20 June 2007, 17:59:58

Bruce Momjian wrote:
> Greg Smith wrote:

> I don't expect this patch to be perfect when it is applied.  I do expect
> to be a best effort, and it will get continual real-world testing during
> beta and we can continue to improve this.  Right now, we know we have a
> serious issue with checkpoint I/O, and this patch is going to improve
> that in most cases.  I don't want to see us reject it or greatly delay
> beta as we try to make it perfect.
> 
> My main point is that should keep trying to make the patch better, but
> the patch doesn't have to be perfect to get applied.  I don't want us to
> get into a death-by-testing spiral.

Death by testing? The only comment I have is that is could be useful to 
be able to turn this feature off via GUC. Other than that, I think it is 
great.

Joshua D. Drake


> 


-- 
      === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997             http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/

Re: Load Distributed Checkpoints test results

From

Heikki Linnakangas

Date:

20 June 2007, 18:04:50

Joshua D. Drake wrote:
> The only comment I have is that is could be useful to 
> be able to turn this feature off via GUC. Other than that, I think it is 
> great.

Yeah, you can do that.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Load Distributed Checkpoints test results

From

Greg Smith

Date:

20 June 2007, 18:38:29

On Wed, 20 Jun 2007, Heikki Linnakangas wrote:

> You mean the shift and "flattening" of the graph to the right in the delivery 
> response time distribution graph?

Right, that's what ends up happening during the problematic cases.  To 
pick numbers out of the air, instead of 1% of the transactions getting 
nailed really hard, by spreading things out you might have 5% of them get 
slowed considerably but not awfully.  For some applications, that might be 
considered a step backwards.

> I'd like to understand the underlaying mechanism

I had to capture regular snapshots of the buffer cache internals via 
pg_buffercache to figure out where the breakdown was in my case.

> I don't have any good simple ideas on how to make it better in 8.3 timeframe, 
> so I don't think there's much to learn from repeating these tests.

Right now, it's not clear which of the runs represent normal behavior and 
which might be anomolies.  That's the thing you might learn if you had 10 
at each configuration instead of just 1.  The goal for the 8.3 timeframe 
in my mind would be to perhaps have enough data to give better guidelines 
for defaults and a range of useful settings in the documentation.

The only other configuration I'd be curious to see is pushing the number 
of warehouses even more to see if the 90% numbers spread further from 
current behavior.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Load Distributed Checkpoints test results

From

Greg Smith

Date:

20 June 2007, 18:45:14

On Wed, 20 Jun 2007, Bruce Momjian wrote:

> I don't expect this patch to be perfect when it is applied.  I do expect
> to be a best effort, and it will get continual real-world testing during
> beta and we can continue to improve this.

This is completely fair.  Consider my suggestions something that people 
might want look out for during beta rather than a task Heikki should worry 
about before applying the patch.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD