bgwriter changes - Mailing list pgsql-hackers

From Neil Conway
Subject bgwriter changes
Date
Msg-id 41BEEB0E.4070003@samurai.com
Whole thread Raw
Responses Re: bgwriter changes
Re: bgwriter changes
Re: bgwriter changes
List pgsql-hackers
In recent discussion[1] with Simon Riggs, there has been some talk of
making some changes to the bgwriter. To summarize the problem, the
bgwriter currently scans the entire T1+T2 buffer lists and returns a
list of all the currently dirty buffers. It then selects a subset of
that list (computed using bgwriter_percent and bgwriter_maxpages) to
flush to disk. Not only does this mean we can end up scanning a
significant portion of shared_buffers for every invocation of the
bgwriter, we also do the scan while holding the BufMgrLock, likely
hurting scalability.

I think a fix for this in some fashion is warranted for 8.0. Possible
solutions:

(1) Special-case bgwriter_percent=100. The only reason we need to return
a list of all the dirty buffers is so that we can choose n% of them to
satisfy bgwriter_percent. That is obviously unnecessary if we have
bgwriter_percent=100. I think this change won't help most users,
*unless* we also change bgwriter_percent=100 in the default configuration.

(2) Remove bgwriter_percent. I have yet to hear anyone argue that
there's an actual need for bgwriter_percent in tuning bgwriter behavior,
and one less GUC var is a good thing, all else being equal. This is
effectively the same as #1 with the default changed, only less flexibility.

(3) Change the meaning of bgwriter_percent, per Simon's proposal. Make
it mean "the percentage of the buffer pool to scan, at most, to look for
dirty buffers". I don't think this is workable, at least not at this
point in the release cycle, because it means we might not smooth of
checkpoint load, one of the primary goals of the bgwriter (in this
proposal bgwriter would only ever consider writing out a small subset of
the total shared buffer cache: the least-recently-used n%, with 2% being
a suggested default). Some variant of this might be worth exploring for
8.1 though.

A patch (implementing #2) is attached -- any benchmark results would be
helpful. Increasing shared_buffers (to 10,000 or more) should make the
problem noticeable.

Opinions on which route is the best, or on some alternative solution? My
inclination is toward #2, but I'm not dead-set on it.

-Neil

[1] http://archives.postgresql.org/pgsql-hackers/2004-12/msg00386.php
Index: doc/src/sgml/runtime.sgml
===================================================================
RCS file: /var/lib/cvs/pgsql/doc/src/sgml/runtime.sgml,v
retrieving revision 1.296
diff -c -r1.296 runtime.sgml
*** doc/src/sgml/runtime.sgml    13 Dec 2004 18:05:09 -0000    1.296
--- doc/src/sgml/runtime.sgml    14 Dec 2004 04:52:26 -0000
***************
*** 1350,1382 ****
          <para>
           Specifies the delay between activity rounds for the
           background writer.  In each round the writer issues writes
!          for some number of dirty buffers (controllable by the
!          following parameters).  The selected buffers will always be
!          the least recently used ones among the currently dirty
!          buffers.  It then sleeps for <varname>bgwriter_delay</>
!          milliseconds, and repeats.  The default value is 200. Note
!          that on many systems, the effective resolution of sleep
!          delays is 10 milliseconds; setting <varname>bgwriter_delay</>
!          to a value that is not a multiple of 10 may have the same
!          results as setting it to the next higher multiple of 10.
!          This option can only be set at server start or in the
!          <filename>postgresql.conf</filename> file.
!         </para>
!        </listitem>
!       </varlistentry>
!
!       <varlistentry id="guc-bgwriter-percent" xreflabel="bgwriter_percent">
!        <term><varname>bgwriter_percent</varname> (<type>integer</type>)</term>
!        <indexterm>
!         <primary><varname>bgwriter_percent</> configuration parameter</primary>
!        </indexterm>
!        <listitem>
!         <para>
!          In each round, no more than this percentage of the currently
!          dirty buffers will be written (rounding up any fraction to
!          the next whole number of buffers).  The default value is
!          1. This option can only be set at server start or in the
!          <filename>postgresql.conf</filename> file.
          </para>
         </listitem>
        </varlistentry>
--- 1350,1367 ----
          <para>
           Specifies the delay between activity rounds for the
           background writer.  In each round the writer issues writes
!          for some number of dirty buffers (controllable by
!          <varname>bgwriter_maxpages</varname>).  The selected buffers
!          will always be the least recently used ones among the
!          currently dirty buffers.  It then sleeps for
!          <varname>bgwriter_delay</> milliseconds, and repeats.  The
!          default value is 200. Note that on many systems, the
!          effective resolution of sleep delays is 10 milliseconds;
!          setting <varname>bgwriter_delay</> to a value that is not a
!          multiple of 10 may have the same results as setting it to the
!          next higher multiple of 10.  This option can only be set at
!          server start or in the <filename>postgresql.conf</filename>
!          file.
          </para>
         </listitem>
        </varlistentry>
***************
*** 1398,1409 ****
       </variablelist>

       <para>
!       Smaller values of <varname>bgwriter_percent</varname> and
!       <varname>bgwriter_maxpages</varname> reduce the extra I/O load
!       caused by the background writer, but leave more work to be done
!       at checkpoint time.  To reduce load spikes at checkpoints,
!       increase the values.  To disable background writing entirely,
!       set <varname>bgwriter_percent</varname> and/or
        <varname>bgwriter_maxpages</varname> to zero.
       </para>
      </sect3>
--- 1383,1396 ----
       </variablelist>

       <para>
!       Decreasing <varname>bgwriter_maxpages</varname> or increasing
!       <varname>bgwriter_delay</varname> will reduce the extra I/O load
!       caused by the background writer, but will leave more work to be
!       done at checkpoint time. To reduce load spikes at checkpoints,
!       increase the number of pages written per round
!       (<varname>bgwriter_maxpages</varname>) or reduce the delay
!       between rounds (<varname>bgwriter_delay</varname>). To disable
!       background writing entirely, set
        <varname>bgwriter_maxpages</varname> to zero.
       </para>
      </sect3>
Index: src/backend/catalog/index.c
===================================================================
RCS file: /var/lib/cvs/pgsql/src/backend/catalog/index.c,v
retrieving revision 1.242
diff -c -r1.242 index.c
*** src/backend/catalog/index.c    1 Dec 2004 19:00:39 -0000    1.242
--- src/backend/catalog/index.c    14 Dec 2004 04:32:39 -0000
***************
*** 1062,1068 ****
          /* Send out shared cache inval if necessary */
          if (!IsBootstrapProcessingMode())
              CacheInvalidateHeapTuple(pg_class, tuple);
!         BufferSync(-1, -1);
      }
      else if (dirty)
      {
--- 1062,1068 ----
          /* Send out shared cache inval if necessary */
          if (!IsBootstrapProcessingMode())
              CacheInvalidateHeapTuple(pg_class, tuple);
!         BufferSync(-1);
      }
      else if (dirty)
      {
Index: src/backend/commands/dbcommands.c
===================================================================
RCS file: /var/lib/cvs/pgsql/src/backend/commands/dbcommands.c,v
retrieving revision 1.147
diff -c -r1.147 dbcommands.c
*** src/backend/commands/dbcommands.c    18 Nov 2004 01:14:26 -0000    1.147
--- src/backend/commands/dbcommands.c    14 Dec 2004 04:40:19 -0000
***************
*** 332,338 ****
       * up-to-date for the copy.  (We really only need to flush buffers for
       * the source database, but bufmgr.c provides no API for that.)
       */
!     BufferSync(-1, -1);

      /*
       * Close virtual file descriptors so the kernel has more available for
--- 332,338 ----
       * up-to-date for the copy.  (We really only need to flush buffers for
       * the source database, but bufmgr.c provides no API for that.)
       */
!     BufferSync(-1);

      /*
       * Close virtual file descriptors so the kernel has more available for
***************
*** 1206,1212 ****
           * up-to-date for the copy.  (We really only need to flush buffers for
           * the source database, but bufmgr.c provides no API for that.)
           */
!         BufferSync(-1, -1);

  #ifndef WIN32

--- 1206,1212 ----
           * up-to-date for the copy.  (We really only need to flush buffers for
           * the source database, but bufmgr.c provides no API for that.)
           */
!         BufferSync(-1);

  #ifndef WIN32

Index: src/backend/postmaster/bgwriter.c
===================================================================
RCS file: /var/lib/cvs/pgsql/src/backend/postmaster/bgwriter.c,v
retrieving revision 1.11
diff -c -r1.11 bgwriter.c
*** src/backend/postmaster/bgwriter.c    5 Nov 2004 17:11:28 -0000    1.11
--- src/backend/postmaster/bgwriter.c    14 Dec 2004 04:44:26 -0000
***************
*** 116,122 ****
   * GUC parameters
   */
  int            BgWriterDelay = 200;
- int            BgWriterPercent = 1;
  int            BgWriterMaxPages = 100;

  int            CheckPointTimeout = 300;
--- 116,121 ----
***************
*** 372,378 ****
              n = 1;
          }
          else
!             n = BufferSync(BgWriterPercent, BgWriterMaxPages);

          /*
           * Nap for the configured time or sleep for 10 seconds if there
--- 371,377 ----
              n = 1;
          }
          else
!             n = BufferSync(BgWriterMaxPages);

          /*
           * Nap for the configured time or sleep for 10 seconds if there
Index: src/backend/storage/buffer/bufmgr.c
===================================================================
RCS file: /var/lib/cvs/pgsql/src/backend/storage/buffer/bufmgr.c,v
retrieving revision 1.182
diff -c -r1.182 bufmgr.c
*** src/backend/storage/buffer/bufmgr.c    24 Nov 2004 02:56:17 -0000    1.182
--- src/backend/storage/buffer/bufmgr.c    14 Dec 2004 04:40:18 -0000
***************
*** 671,717 ****
   *
   * This is called at checkpoint time to write out all dirty shared buffers,
   * and by the background writer process to write out some of the dirty blocks.
!  * percent/maxpages should be -1 in the former case, and limit values (>= 0)
   * in the latter.
   *
   * Returns the number of buffers written.
   */
  int
! BufferSync(int percent, int maxpages)
  {
      BufferDesc **dirty_buffers;
      BufferTag  *buftags;
      int            num_buffer_dirty;
      int            i;

!     /* If either limit is zero then we are disabled from doing anything... */
!     if (percent == 0 || maxpages == 0)
          return 0;

      /*
       * Get a list of all currently dirty buffers and how many there are.
       * We do not flush buffers that get dirtied after we started. They
       * have to wait until the next checkpoint.
       */
!     dirty_buffers = (BufferDesc **) palloc(NBuffers * sizeof(BufferDesc *));
!     buftags = (BufferTag *) palloc(NBuffers * sizeof(BufferTag));

      LWLockAcquire(BufMgrLock, LW_EXCLUSIVE);
      num_buffer_dirty = StrategyDirtyBufferList(dirty_buffers, buftags,
!                                                NBuffers);
!
!     /*
!      * If called by the background writer, we are usually asked to only
!      * write out some portion of dirty buffers now, to prevent the IO
!      * storm at checkpoint time.
!      */
!     if (percent > 0)
!     {
!         Assert(percent <= 100);
!         num_buffer_dirty = (num_buffer_dirty * percent + 99) / 100;
!     }
!     if (maxpages > 0 && num_buffer_dirty > maxpages)
!         num_buffer_dirty = maxpages;

      /* Make sure we can handle the pin inside the loop */
      ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
--- 671,710 ----
   *
   * This is called at checkpoint time to write out all dirty shared buffers,
   * and by the background writer process to write out some of the dirty blocks.
!  * maxpages should be -1 in the former case, and a limit value (>= 0)
   * in the latter.
   *
   * Returns the number of buffers written.
   */
  int
! BufferSync(int maxpages)
  {
      BufferDesc **dirty_buffers;
      BufferTag  *buftags;
      int            num_buffer_dirty;
      int            i;

!     /* If maxpages is zero then we're effectively disabled */
!     if (maxpages == 0)
          return 0;

+     /* If -1, flush all dirty buffers */
+     if (maxpages == -1)
+         maxpages = NBuffers;
+
      /*
+      * Get a list of up to "maxpages" dirty buffers, starting from LRU and
       * Get a list of all currently dirty buffers and how many there are.
       * We do not flush buffers that get dirtied after we started. They
       * have to wait until the next checkpoint.
       */
!     dirty_buffers = (BufferDesc **) palloc(maxpages * sizeof(BufferDesc *));
!     buftags = (BufferTag *) palloc(maxpages * sizeof(BufferTag));

      LWLockAcquire(BufMgrLock, LW_EXCLUSIVE);
      num_buffer_dirty = StrategyDirtyBufferList(dirty_buffers, buftags,
!                                                maxpages);
!     Assert(num_buffer_dirty <= maxpages);

      /* Make sure we can handle the pin inside the loop */
      ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
***************
*** 947,953 ****
  void
  FlushBufferPool(void)
  {
!     BufferSync(-1, -1);
      smgrsync();
  }

--- 940,946 ----
  void
  FlushBufferPool(void)
  {
!     BufferSync(-1);
      smgrsync();
  }

Index: src/backend/storage/buffer/freelist.c
===================================================================
RCS file: /var/lib/cvs/pgsql/src/backend/storage/buffer/freelist.c,v
retrieving revision 1.48
diff -c -r1.48 freelist.c
*** src/backend/storage/buffer/freelist.c    16 Sep 2004 16:58:31 -0000    1.48
--- src/backend/storage/buffer/freelist.c    14 Dec 2004 04:22:02 -0000
***************
*** 753,810 ****
      int            num_buffer_dirty = 0;
      int            cdb_id_t1;
      int            cdb_id_t2;
-     int            buf_id;
-     BufferDesc *buf;

      /*
!      * Traverse the T1 and T2 list LRU to MRU in "parallel" and add all
!      * dirty buffers found in that order to the list. The ARC strategy
!      * keeps all used buffers including pinned ones in the T1 or T2 list.
!      * So we cannot miss any dirty buffers.
       */
      cdb_id_t1 = StrategyControl->listHead[STRAT_LIST_T1];
      cdb_id_t2 = StrategyControl->listHead[STRAT_LIST_T2];

      while (cdb_id_t1 >= 0 || cdb_id_t2 >= 0)
      {
          if (cdb_id_t1 >= 0)
          {
              buf_id = StrategyCDB[cdb_id_t1].buf_id;
-             buf = &BufferDescriptors[buf_id];
-
-             if (buf->flags & BM_VALID)
-             {
-                 if ((buf->flags & BM_DIRTY) || (buf->cntxDirty))
-                 {
-                     buffers[num_buffer_dirty] = buf;
-                     buftags[num_buffer_dirty] = buf->tag;
-                     num_buffer_dirty++;
-                     if (num_buffer_dirty >= max_buffers)
-                         break;
-                 }
-             }
-
              cdb_id_t1 = StrategyCDB[cdb_id_t1].next;
          }
!
!         if (cdb_id_t2 >= 0)
          {
              buf_id = StrategyCDB[cdb_id_t2].buf_id;
!             buf = &BufferDescriptors[buf_id];

!             if (buf->flags & BM_VALID)
              {
!                 if ((buf->flags & BM_DIRTY) || (buf->cntxDirty))
!                 {
!                     buffers[num_buffer_dirty] = buf;
!                     buftags[num_buffer_dirty] = buf->tag;
!                     num_buffer_dirty++;
!                     if (num_buffer_dirty >= max_buffers)
!                         break;
!                 }
              }
-
-             cdb_id_t2 = StrategyCDB[cdb_id_t2].next;
          }
      }

--- 753,797 ----
      int            num_buffer_dirty = 0;
      int            cdb_id_t1;
      int            cdb_id_t2;

      /*
!      * Traverse the T1 and T2 list from LRU to MRU in "parallel" and
!      * add all dirty buffers found in that order to the list. The ARC
!      * strategy keeps all used buffers including pinned ones in the T1
!      * or T2 list.  So we cannot miss any dirty buffers.
       */
      cdb_id_t1 = StrategyControl->listHead[STRAT_LIST_T1];
      cdb_id_t2 = StrategyControl->listHead[STRAT_LIST_T2];

      while (cdb_id_t1 >= 0 || cdb_id_t2 >= 0)
      {
+         int            buf_id;
+         BufferDesc *buf;
+
          if (cdb_id_t1 >= 0)
          {
              buf_id = StrategyCDB[cdb_id_t1].buf_id;
              cdb_id_t1 = StrategyCDB[cdb_id_t1].next;
          }
!         else
          {
+             Assert(cdb_id_t2 >= 0);
              buf_id = StrategyCDB[cdb_id_t2].buf_id;
!             cdb_id_t2 = StrategyCDB[cdb_id_t2].next;
!         }
!
!         buf = &BufferDescriptors[buf_id];

!         if (buf->flags & BM_VALID)
!         {
!             if ((buf->flags & BM_DIRTY) || (buf->cntxDirty))
              {
!                 buffers[num_buffer_dirty] = buf;
!                 buftags[num_buffer_dirty] = buf->tag;
!                 num_buffer_dirty++;
!                 if (num_buffer_dirty >= max_buffers)
!                     break;
              }
          }
      }

Index: src/backend/utils/misc/guc.c
===================================================================
RCS file: /var/lib/cvs/pgsql/src/backend/utils/misc/guc.c,v
retrieving revision 1.250
diff -c -r1.250 guc.c
*** src/backend/utils/misc/guc.c    24 Nov 2004 19:51:03 -0000    1.250
--- src/backend/utils/misc/guc.c    14 Dec 2004 04:44:40 -0000
***************
*** 1249,1263 ****
      },

      {
-         {"bgwriter_percent", PGC_SIGHUP, RESOURCES,
-             gettext_noop("Background writer percentage of dirty buffers to flush per round"),
-             NULL
-         },
-         &BgWriterPercent,
-         1, 0, 100, NULL, NULL
-     },
-
-     {
          {"bgwriter_maxpages", PGC_SIGHUP, RESOURCES,
              gettext_noop("Background writer maximum number of pages to flush per round"),
              NULL
--- 1249,1254 ----
Index: src/backend/utils/misc/postgresql.conf.sample
===================================================================
RCS file: /var/lib/cvs/pgsql/src/backend/utils/misc/postgresql.conf.sample,v
retrieving revision 1.134
diff -c -r1.134 postgresql.conf.sample
*** src/backend/utils/misc/postgresql.conf.sample    5 Nov 2004 19:16:16 -0000    1.134
--- src/backend/utils/misc/postgresql.conf.sample    14 Dec 2004 04:54:47 -0000
***************
*** 96,106 ****
  #vacuum_cost_page_dirty = 20    # 0-10000 credits
  #vacuum_cost_limit = 200    # 0-10000 credits

! # - Background writer -

  #bgwriter_delay = 200        # 10-10000 milliseconds between rounds
! #bgwriter_percent = 1        # 0-100% of dirty buffers in each round
! #bgwriter_maxpages = 100    # 0-1000 buffers max per round


  #---------------------------------------------------------------------------
--- 96,105 ----
  #vacuum_cost_page_dirty = 20    # 0-10000 credits
  #vacuum_cost_limit = 200    # 0-10000 credits

! # - Background Writer -

  #bgwriter_delay = 200        # 10-10000 milliseconds between rounds
! #bgwriter_maxpages = 100    # max buffers written per round, 0 disables


  #---------------------------------------------------------------------------
Index: src/include/postmaster/bgwriter.h
===================================================================
RCS file: /var/lib/cvs/pgsql/src/include/postmaster/bgwriter.h,v
retrieving revision 1.3
diff -c -r1.3 bgwriter.h
*** src/include/postmaster/bgwriter.h    29 Aug 2004 04:13:09 -0000    1.3
--- src/include/postmaster/bgwriter.h    14 Dec 2004 04:44:44 -0000
***************
*** 18,24 ****

  /* GUC options */
  extern int    BgWriterDelay;
- extern int    BgWriterPercent;
  extern int    BgWriterMaxPages;
  extern int    CheckPointTimeout;
  extern int    CheckPointWarning;
--- 18,23 ----
Index: src/include/storage/bufmgr.h
===================================================================
RCS file: /var/lib/cvs/pgsql/src/include/storage/bufmgr.h,v
retrieving revision 1.88
diff -c -r1.88 bufmgr.h
*** src/include/storage/bufmgr.h    16 Oct 2004 18:57:26 -0000    1.88
--- src/include/storage/bufmgr.h    14 Dec 2004 04:40:09 -0000
***************
*** 150,156 ****
  extern void AbortBufferIO(void);

  extern void BufmgrCommit(void);
! extern int    BufferSync(int percent, int maxpages);

  extern void InitLocalBuffer(void);

--- 150,156 ----
  extern void AbortBufferIO(void);

  extern void BufmgrCommit(void);
! extern int    BufferSync(int maxpages);

  extern void InitLocalBuffer(void);


pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: rc1 packaged ...
Next
From: Bruce Momjian
Date:
Subject: Re: bgwriter changes