Re: Spread checkpoint sync - Mailing list pgsql-hackers

From Greg Smith
Subject Re: Spread checkpoint sync
Date
Msg-id 4D31AE64.3000202@2ndquadrant.com
Whole thread Raw
In response to Re: Spread checkpoint sync  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Spread checkpoint sync  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Robert Haas wrote:
> Idea #2: At the beginning of a checkpoint when we scan all the
> buffers, count the number of buffers that need to be synced for each
> relation.  Use the same hashtable that we use for tracking pending
> fsync requests.  Then, interleave the writes and the fsyncs...
>
> Idea #3: Stick with the idea of a fixed delay between fsyncs, but
> compute how many fsyncs you think you're ultimately going to need at
> the start of the checkpoint, and back up the target completion time by
> 3 s per fsync from the get-go, so that the checkpoint still finishes
> on schedule.
>

What I've been working on is something halfway between these two ideas.
I have a patch, and it doesn't work right yet because I just broke it,
but since I have some faint hope this will all come together any minute
now I'm going to share it before someone announces a deadline has passed
or something.  (whistling).  I'm going to add this messy thing and the
patch you submitted upthread to the CF list; I'll review yours, I'll
either fix the remaining problem in this one myself or rewrite to one of
your ideas, and then it's onto a round of benchmarking.

Once upon a time we got a patch from Itagaki Takahiro whose purpose was
to sort writes before sending them out:

http://archives.postgresql.org/pgsql-hackers/2007-06/msg00541.php

This didn't really work repeatedly for everyone because of the now well
understood ext3 issues--I never replicated that speedup at the time for
example.  And this was before the spread checkpoint code was in 8.3.
The hope was that it wasn't really going to be necessary after that anyway.

Back to today...instead of something complicated, it struck me that if I
just had a count of exactly how many files were involved in each
checkpoint, that would be helpful.  I could keep the idea of a fixed
delay between fsyncs, but just auto-tune that delay amount based on the
count.  And how do you count the number of unique things in a list?
Well, you can always sort them.  I thought that if the sorted writes
patch got back to functional again, it could serve two purposes.  It
would group all of the writes for a file together, and if you did the
syncs in the same sorted order they would have the maximum odds of
discovering the data was already written.  So rather than this possible
order:

table block
a 1
b 1
c 1
c 2
b 2
a 2
sync a
sync b
sync c

Which has very low odds of the sync on "a" finishing quickly, we'd get
this one:

table block
a 1
a 2
b 1
b 2
c 1
c 2
sync a
sync b
sync c

Which sure seems like a reasonable way to improve the odds data has been
written before the associated sync comes along.

Also, I could just traverse the sorted list with some simple logic to
count the number of unique files, and then set the delay between fsync
writes based on it.  In the above, once the list was sorted, easy to
just see how many times the table name changes on a linear scan of the
sorted data.  3 files, so if the checkpoint target gives me, say, a
minute of time to sync them, I can delay 20 seconds between.  Simple
math, and exactly the sort I used to get reasonable behavior on the busy
production system this all started on.  There's some unresolved
trickiness in the segment-driven checkpoint case, but one thing at a time.

So I fixed the bitrot on the old sorted patch, which was fun as it came
from before the 8.3 changes.  It seemed to work.  I then moved the
structure it uses to hold the list of buffers to write, the thing that's
sorted, into shared memory.  It's got a predictable maximum size,
relying on palloc in the middle of the checkpoint code seems bad, and
there's some potential gain from not reallocating it every time through.

Somewhere along the way, it started doing this instead of what I wanted:

 BadArgument("!(((header->context) != ((void *)0) &&
(((((Node*)((header->context)))->type) == T_AllocSetContext))))", File:
"mcxt.c", Line: 589)

(that's from initdb, not a good sign)

And it's left me wondering whether this whole idea is a dead end I used
up my window of time wandering down.

There's good bits in the patch I submitted for the last CF and in the
patch you wrote earlier this week.  This unfinished patch may be a
valuable idea to fit in there too once I fix it, or maybe it's
fundamentally flawed and one of the other ideas you suggested (or I have
sitting on the potential design list) will work better.  There's a patch
integration problem that needs to be solved here, but I think almost all
the individual pieces are available.  I'd hate to see this fail to get
integrated now just for lack of time, considering the problem is so
serious when you run into it.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index dadb49d..c8c0f67 100644
*** a/src/backend/storage/buffer/buf_init.c
--- b/src/backend/storage/buffer/buf_init.c
***************
*** 20,25 ****
--- 20,26 ----

  BufferDesc *BufferDescriptors;
  char       *BufferBlocks;
+ BufAndTag  *BufferTags;
  int32       *PrivateRefCount;


*************** int32       *PrivateRefCount;
*** 72,79 ****
  void
  InitBufferPool(void)
  {
!     bool        foundBufs,
!                 foundDescs;

      BufferDescriptors = (BufferDesc *)
          ShmemInitStruct("Buffer Descriptors",
--- 73,81 ----
  void
  InitBufferPool(void)
  {
!     bool        foundBufs;
!     bool        foundDescs;
!     bool        foundTags;

      BufferDescriptors = (BufferDesc *)
          ShmemInitStruct("Buffer Descriptors",
*************** InitBufferPool(void)
*** 83,92 ****
          ShmemInitStruct("Buffer Blocks",
                          NBuffers * (Size) BLCKSZ, &foundBufs);

!     if (foundDescs || foundBufs)
      {
!         /* both should be present or neither */
!         Assert(foundDescs && foundBufs);
          /* note: this path is only taken in EXEC_BACKEND case */
      }
      else
--- 85,98 ----
          ShmemInitStruct("Buffer Blocks",
                          NBuffers * (Size) BLCKSZ, &foundBufs);

!     BufferTags = (BufAndTag *)
!         ShmemInitStruct("Dirty Buffer Tags",
!                         NBuffers * sizeof(BufAndTag), &foundTags);
!
!     if (foundDescs || foundBufs || foundTags)
      {
!         /* all should be present or none */
!         Assert(foundDescs && foundBufs && foundTags);
          /* note: this path is only taken in EXEC_BACKEND case */
      }
      else
*************** BufferShmemSize(void)
*** 171,176 ****
--- 177,185 ----
      /* size of data pages */
      size = add_size(size, mul_size(NBuffers, BLCKSZ));

+     /* size of checkpoint buffer tags */
+     size = add_size(size, mul_size(NBuffers, sizeof(BufAndTag)));
+
      /* size of stuff controlled by freelist.c */
      size = add_size(size, StrategyShmemSize());

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 1f89e52..bd779bf 100644
*** a/src/backend/storage/buffer/bufmgr.c
--- b/src/backend/storage/buffer/bufmgr.c
*************** UnpinBuffer(volatile BufferDesc *buf, bo
*** 1158,1163 ****
--- 1158,1181 ----
      }
  }

+ static int
+ bufcmp(const void *a, const void *b)
+ {
+     const BufAndTag *lhs = (const BufAndTag *) a;
+     const BufAndTag *rhs = (const BufAndTag *) b;
+     int        r;
+
+     r = memcmp(&lhs->tag.rnode, &rhs->tag.rnode, sizeof(lhs->tag.rnode));
+     if (r != 0)
+         return r;
+     if (lhs->tag.blockNum < rhs->tag.blockNum)
+         return -1;
+     else if (lhs->tag.blockNum > rhs->tag.blockNum)
+         return 1;
+     else
+         return 0;
+ }
+
  /*
   * BufferSync -- Write out all dirty buffers in the pool.
   *
*************** static void
*** 1171,1180 ****
  BufferSync(int flags)
  {
      int            buf_id;
-     int            num_to_scan;
      int            num_to_write;
      int            num_written;
      int            mask = BM_DIRTY;

      /* Make sure we can handle the pin inside SyncOneBuffer */
      ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
--- 1189,1202 ----
  BufferSync(int flags)
  {
      int            buf_id;
      int            num_to_write;
      int            num_written;
      int            mask = BM_DIRTY;
+     int            dirty_buf;
+     int            dirty_files;
+     Oid            last_seen_rel;
+     ForkNumber  last_seen_fork;
+     BlockNumber last_seen_block;

      /* Make sure we can handle the pin inside SyncOneBuffer */
      ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
*************** BufferSync(int flags)
*** 1216,1221 ****
--- 1238,1245 ----
          if ((bufHdr->flags & mask) == mask)
          {
              bufHdr->flags |= BM_CHECKPOINT_NEEDED;
+             BufferTags[num_to_write].buf_id = buf_id;
+             BufferTags[num_to_write].tag = bufHdr->tag;
              num_to_write++;
          }

*************** BufferSync(int flags)
*** 1225,1246 ****
      if (num_to_write == 0)
          return;                    /* nothing to do */

      TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_write);

      /*
       * Loop over all buffers again, and write the ones (still) marked with
!      * BM_CHECKPOINT_NEEDED.  In this loop, we start at the clock sweep point
!      * since we might as well dump soon-to-be-recycled buffers first.
       *
       * Note that we don't read the buffer alloc count here --- that should be
       * left untouched till the next BgBufferSync() call.
!      */
!     buf_id = StrategySyncStart(NULL, NULL);
!     num_to_scan = NBuffers;
      num_written = 0;
!     while (num_to_scan-- > 0)
!     {
!         volatile BufferDesc *bufHdr = &BufferDescriptors[buf_id];

          /*
           * We don't need to acquire the lock here, because we're only looking
--- 1249,1307 ----
      if (num_to_write == 0)
          return;                    /* nothing to do */

+     /*
+      * Sort the list of buffers to write.  It's then straightforward to
+      * count the approximate number of files involved.  There may be
+      * some small error from buffers that turn out to be skipped below,
+      * but for the purposes the file count is needed that's acceptable.
+      */
+     qsort(BufferTags, num_to_write, sizeof(*BufferTags), bufcmp);
+
+     /*
+      * Count the number of unique node/fork combinations, relying on the
+      * sorted order
+      */
+
+     /* Initialize with the first entry in the dirty buffer list */
+     last_seen_rel = BufferTags[0].tag.rnode.relNode;
+     last_seen_fork = BufferTags[0].tag.forkNum;
+     last_seen_block = BufferTags[0].tag.blockNum;
+     dirty_files = 1;
+
+     for (dirty_buf = 1; dirty_buf < num_to_write; dirty_buf++)
+       {
+         if ((last_seen_rel != BufferTags[dirty_buf].tag.rnode.relNode) ||
+             (last_seen_fork != BufferTags[dirty_buf].tag.forkNum))
+         {
+             last_seen_rel=BufferTags[dirty_buf].tag.rnode.relNode;
+             last_seen_fork=BufferTags[dirty_buf].tag.forkNum;
+             dirty_files++;
+         }
+     }
+
+     /*
+      * TODO:  This doesn't account for the fact that blocks might span multiple
+      * files within the same relation yet.
+      */
+
+     elog(DEBUG1, "BufferSync found %d buffers to write involving %d files",
+         num_to_write,dirty_files)
+
      TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_write);

      /*
       * Loop over all buffers again, and write the ones (still) marked with
!      * BM_CHECKPOINT_NEEDED.
       *
       * Note that we don't read the buffer alloc count here --- that should be
       * left untouched till the next BgBufferSync() call.
!      */
      num_written = 0;
!     for (dirty_buf = 0; dirty_buf < num_to_write; dirty_buf++)
!       {
!         volatile BufferDesc *bufHdr;
!         buf_id = BufferTags[dirty_buf].buf_id;
!         bufHdr = &BufferDescriptors[buf_id];

          /*
           * We don't need to acquire the lock here, because we're only looking
*************** BufferSync(int flags)
*** 1263,1282 ****
                  num_written++;

                  /*
-                  * We know there are at most num_to_write buffers with
-                  * BM_CHECKPOINT_NEEDED set; so we can stop scanning if
-                  * num_written reaches num_to_write.
-                  *
-                  * Note that num_written doesn't include buffers written by
-                  * other backends, or by the bgwriter cleaning scan. That
-                  * means that the estimate of how much progress we've made is
-                  * conservative, and also that this test will often fail to
-                  * trigger.  But it seems worth making anyway.
-                  */
-                 if (num_written >= num_to_write)
-                     break;
-
-                 /*
                   * Perform normal bgwriter duties and sleep to throttle our
                   * I/O rate.
                   */
--- 1324,1329 ----
*************** BufferSync(int flags)
*** 1284,1292 ****
                                       (double) num_written / num_to_write);
              }
          }
-
-         if (++buf_id >= NBuffers)
-             buf_id = 0;
      }

      /*
--- 1331,1336 ----
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 0652bdf..1c9c910 100644
*** a/src/include/storage/buf_internals.h
--- b/src/include/storage/buf_internals.h
*************** typedef struct sbufdesc
*** 167,175 ****
--- 167,185 ----
  #define LockBufHdr(bufHdr)        SpinLockAcquire(&(bufHdr)->buf_hdr_lock)
  #define UnlockBufHdr(bufHdr)    SpinLockRelease(&(bufHdr)->buf_hdr_lock)

+ /*
+  * Checkpoint time mapping between the buffer id values and the associated
+  * buffer tags of dirty buffers to write
+  */
+ typedef struct BufAndTag
+ {
+     int            buf_id;
+     BufferTag    tag;
+ } BufAndTag;

  /* in buf_init.c */
  extern PGDLLIMPORT BufferDesc *BufferDescriptors;
+ extern PGDLLIMPORT BufAndTag *BufferTags;

  /* in localbuf.c */
  extern BufferDesc *LocalBufferDescriptors;

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Bug in pg_describe_object, patch v2
Next
From: Robert Haas
Date:
Subject: Re: Spread checkpoint sync