Thread: Change BgWriterCommLock to spinlock

Change BgWriterCommLock to spinlock

From
Qingqing Zhou
Date:
The following patch changes BgWriterCommLock to a spinlock. To confirm to
the spinlock coding rules(though not needed since we are in critical
section), rewrite the requests array into a circular one, which will
reduce the lock time when the bgwriter absorbs the requests. A
byproduct-benefit is that we can avoid the critial sections in
AbsorbFsyncRequests() since we only removes the requests when it is done.

The concurrency control of the circular array relies on the single-reader
fact, but I don't see there is any need in future to change it.

Regards,
Qingqing

---

Index: backend/postmaster/bgwriter.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/postmaster/bgwriter.c,v
retrieving revision 1.22
diff -c -r1.22 bgwriter.c
*** backend/postmaster/bgwriter.c    8 Dec 2005 19:19:22 -0000    1.22
--- backend/postmaster/bgwriter.c    8 Jan 2006 03:36:51 -0000
***************
*** 56,61 ****
--- 56,62 ----
  #include "storage/ipc.h"
  #include "storage/pmsignal.h"
  #include "storage/smgr.h"
+ #include "storage/spin.h"
  #include "tcop/tcopprot.h"
  #include "utils/guc.h"
  #include "utils/memutils.h"
***************
*** 94,101 ****
   * reasonable, it should set this field TRUE just before sending the signal.
   *
   * The requests array holds fsync requests sent by backends and not yet
!  * absorbed by the bgwriter.  Unlike the checkpoint fields, the requests
!  * fields are protected by BgWriterCommLock.
   *----------
   */
  typedef struct
--- 95,101 ----
   * reasonable, it should set this field TRUE just before sending the signal.
   *
   * The requests array holds fsync requests sent by backends and not yet
!  * absorbed by the bgwriter.
   *----------
   */
  typedef struct
***************
*** 115,126 ****

      sig_atomic_t ckpt_time_warn;    /* warn if too soon since last ckpt? */

!     int            num_requests;    /* current # of requests */
!     int            max_requests;    /* allocated array size */
      BgWriterRequest requests[1];    /* VARIABLE LENGTH ARRAY */
  } BgWriterShmemStruct;

! static BgWriterShmemStruct *BgWriterShmem;

  /*
   * GUC parameters
--- 115,131 ----

      sig_atomic_t ckpt_time_warn;    /* warn if too soon since last ckpt? */

!     /* The following implements a circular array */
!     slock_t        req_lock;            /* protects the array */
!     int            req_oldest;            /* start of requests */
!     int            req_newest;            /* end of requests */
!     int            num_requests;        /* current number of requests */
!     int            max_requests;        /* allocated array size */
      BgWriterRequest requests[1];    /* VARIABLE LENGTH ARRAY */
  } BgWriterShmemStruct;

! /* use volatile pointer to prevent code rearrangement */
! static volatile BgWriterShmemStruct *BgWriterShmem;

  /*
   * GUC parameters
***************
*** 528,535 ****
--- 533,542 ----
      if (found)
          return;                    /* already initialized */

+     /* Init the circular array */
      MemSet(BgWriterShmem, 0, sizeof(BgWriterShmemStruct));
      BgWriterShmem->max_requests = NBuffers;
+     SpinLockInit(&BgWriterShmem->req_lock);
  }

  /*
***************
*** 638,660 ****
  bool
  ForwardFsyncRequest(RelFileNode rnode, BlockNumber segno)
  {
!     BgWriterRequest *request;

      if (!IsUnderPostmaster)
          return false;            /* probably shouldn't even get here */
      Assert(BgWriterShmem != NULL);

!     LWLockAcquire(BgWriterCommLock, LW_EXCLUSIVE);
      if (BgWriterShmem->bgwriter_pid == 0 ||
          BgWriterShmem->num_requests >= BgWriterShmem->max_requests)
      {
!         LWLockRelease(BgWriterCommLock);
          return false;
      }
!     request = &BgWriterShmem->requests[BgWriterShmem->num_requests++];
      request->rnode = rnode;
      request->segno = segno;
!     LWLockRelease(BgWriterCommLock);
      return true;
  }

--- 645,673 ----
  bool
  ForwardFsyncRequest(RelFileNode rnode, BlockNumber segno)
  {
!     volatile BgWriterRequest *request;

      if (!IsUnderPostmaster)
          return false;            /* probably shouldn't even get here */
      Assert(BgWriterShmem != NULL);

!     SpinLockAcquire(&BgWriterShmem->req_lock);
      if (BgWriterShmem->bgwriter_pid == 0 ||
          BgWriterShmem->num_requests >= BgWriterShmem->max_requests)
      {
!         SpinLockRelease(&BgWriterShmem->req_lock);
          return false;
      }
!     request = &BgWriterShmem->requests[BgWriterShmem->req_newest++];
!     BgWriterShmem->num_requests++;
      request->rnode = rnode;
      request->segno = segno;
!
!     /* handle wrap around */
!     if (BgWriterShmem->req_newest >= BgWriterShmem->max_requests)
!         BgWriterShmem->req_newest = 0;
!     SpinLockRelease(&BgWriterShmem->req_lock);
!
      return true;
  }

***************
*** 670,712 ****
  void
  AbsorbFsyncRequests(void)
  {
!     BgWriterRequest *requests = NULL;
!     BgWriterRequest *request;
!     int            n;

      if (!am_bg_writer)
          return;

!     /*
!      * We have to PANIC if we fail to absorb all the pending requests (eg,
!      * because our hashtable runs out of memory).  This is because the system
!      * cannot run safely if we are unable to fsync what we have been told to
!      * fsync.  Fortunately, the hashtable is so small that the problem is
!      * quite unlikely to arise in practice.
!      */
!     START_CRIT_SECTION();
!
!     /*
!      * We try to avoid holding the lock for a long time by copying the request
!      * array.
!      */
!     LWLockAcquire(BgWriterCommLock, LW_EXCLUSIVE);
!
!     n = BgWriterShmem->num_requests;
!     if (n > 0)
      {
!         requests = (BgWriterRequest *) palloc(n * sizeof(BgWriterRequest));
!         memcpy(requests, BgWriterShmem->requests, n * sizeof(BgWriterRequest));
      }
!     BgWriterShmem->num_requests = 0;
!
!     LWLockRelease(BgWriterCommLock);
!
!     for (request = requests; n > 0; request++, n--)
!         RememberFsyncRequest(request->rnode, request->segno);

!     if (requests)
!         pfree(requests);

!     END_CRIT_SECTION();
  }
--- 683,738 ----
  void
  AbsorbFsyncRequests(void)
  {
!     int                num;

      if (!am_bg_writer)
          return;

!     /* Take a snapshot now since I am the only reader */
!     SpinLockAcquire(&BgWriterShmem->req_lock);
!     num = BgWriterShmem->num_requests;
!     if (num == 0)
      {
!         Assert(BgWriterShmem->req_oldest == BgWriterShmem->req_newest);
!         SpinLockRelease(&BgWriterShmem->req_lock);
      }
!     else
!     {
!         volatile BgWriterRequest *requests,
!                         *request;
!         int                i,
!                         oldest,
!                         newest;
!
!         oldest = BgWriterShmem->req_oldest;
!         newest = BgWriterShmem->req_newest;
!         SpinLockRelease(&BgWriterShmem->req_lock);
!
!         /* Handle all the requests */
!         requests = BgWriterShmem->requests;
!         if (oldest < newest)
!         {
!             Assert(num == newest - oldest);
!             for (i = num, request = requests + oldest; i > 0; request++, i--)
!                 RememberFsyncRequest(request->rnode, request->segno);
!         }
!         else
!         {
!             int        toend = BgWriterShmem->max_requests - oldest;

!             Assert(num == toend + newest);
!             for (i = toend, request = requests + oldest;  i > 0 ; request++, i--)
!                 RememberFsyncRequest(request->rnode, request->segno);
!             for (i = newest, request = requests;  i > 0 ; request++, i--)
!                 RememberFsyncRequest(request->rnode, request->segno);
!         }

!         /* We've absorbed all the requests, now safe to remove them */
!         SpinLockAcquire(&BgWriterShmem->req_lock);
!         Assert(BgWriterShmem->num_requests >= num);
!         BgWriterShmem->num_requests -= num;
!         BgWriterShmem->req_oldest = (BgWriterShmem->req_oldest + num)
!                                     %BgWriterShmem->max_requests;
!         SpinLockRelease(&BgWriterShmem->req_lock);
!     }
  }
Index: include/storage/lwlock.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/storage/lwlock.h,v
retrieving revision 1.25
diff -c -r1.25 lwlock.h
*** include/storage/lwlock.h    4 Jan 2006 21:06:32 -0000    1.25
--- include/storage/lwlock.h    8 Jan 2006 03:36:52 -0000
***************
*** 44,50 ****
      MultiXactOffsetControlLock,
      MultiXactMemberControlLock,
      RelCacheInitLock,
-     BgWriterCommLock,
      TwoPhaseStateLock,
      FirstLockMgrLock,            /* must be last except for MaxDynamicLWLock */

--- 44,49 ----

Re: Change BgWriterCommLock to spinlock

From
Tom Lane
Date:
Qingqing Zhou <zhouqq@cs.toronto.edu> writes:
> The following patch changes BgWriterCommLock to a spinlock.

Why is this a good idea?

            regards, tom lane

Re: Change BgWriterCommLock to spinlock

From
Qingqing Zhou
Date:

On Sun, 8 Jan 2006, Tom Lane wrote:
>
> Why is this a good idea?
>

"In spirit of incremental improvement":
(1) The spinlock itself are light weight than the LWLock here and we can
reduce the lock contention a little bit in AbsorbFsyncRequests();
(2) Don't need the CRITICAL SECTION in AbsorbFsyncRequests() any more;

Regards,
Qingqing

Re: Change BgWriterCommLock to spinlock

From
Tom Lane
Date:
Qingqing Zhou <zhouqq@cs.toronto.edu> writes:
> On Sun, 8 Jan 2006, Tom Lane wrote:
>> Why is this a good idea?

> "In spirit of incremental improvement":
> (1) The spinlock itself are light weight than the LWLock here and we can
> reduce the lock contention a little bit in AbsorbFsyncRequests();

Spinlock-based coding is inherently much more fragile than LWLock-based
coding.  I'm against changing things in that direction unless a
substantial performance improvement can be gained.  You didn't offer
any evidence of improvement at all.

> (2) Don't need the CRITICAL SECTION in AbsorbFsyncRequests() any more;

Really?  I think this coding still breaks, rather badly, if
RememberFsyncRequest fails.  Certainly the reasons for needing a
critical section have nothing to do with what kind of lock is used.

            regards, tom lane

Re: Change BgWriterCommLock to spinlock

From
Qingqing Zhou
Date:

On Sun, 8 Jan 2006, Tom Lane wrote:
>
> > (1) The spinlock itself are light weight than the LWLock here and we
> > can reduce the lock contention a little bit in AbsorbFsyncRequests();
>
> Spinlock-based coding is inherently much more fragile than LWLock-based
> coding.  I'm against changing things in that direction unless a
> substantial performance improvement can be gained.  You didn't offer
> any evidence of improvement at all.
>
Yeah, only theoretically there is some marginal performance improvements.
Maybe you suggest we keep the LWLock but use the circular array part?

> > (2) Don't need the CRITICAL SECTION in AbsorbFsyncRequests() any more;
>
> Really?  I think this coding still breaks, rather badly, if
> RememberFsyncRequest fails.  Certainly the reasons for needing a
> critical section have nothing to do with what kind of lock is used.
>
Yeah, not related to lock. But I changed algorithm to circular array as
well and notice there is only one reader, so we can remove the requests
after the we are successfully done. In another word, if there is problem,
we never remove the unhanlded requests.

Regards,
Qingqing


Re: Change BgWriterCommLock to spinlock

From
Tom Lane
Date:
Qingqing Zhou <zhouqq@cs.toronto.edu> writes:
> Yeah, only theoretically there is some marginal performance improvements.
> Maybe you suggest we keep the LWLock but use the circular array part?

They're separable issues anyway.

> Yeah, not related to lock. But I changed algorithm to circular array as
> well and notice there is only one reader, so we can remove the requests
> after the we are successfully done. In another word, if there is problem,
> we never remove the unhanlded requests.

I doubt that's more robust than before though.  If we did run out of
memory in the hashtable, this coding would essentially block further
operation of the request queue, because it would keep trying to
re-insert all the entries in the queue, and keep failing.

If you want the bgwriter to keep working in the face of an out-of-memory
condition in the hashtable, I think you'd have to change the coding so
that it takes requests one at a time from the queue.  Then, as
individual requests get removed from the hashtable, individual new
requests could get put in, even if the total queue is too large to fit
all at once.  However this would result in much greater lock-acquisition
overhead, so it doesn't really seem like a win.  We haven't seen any
evidence that the current coding has a problem under field conditions.

Another issue to keep in mind is that correct operation requires that
the bgwriter not declare a checkpoint complete until it's completed
every fsync request that was queued before the checkpoint started.
So if the bgwriter is to try to keep going after failing to absorb
all the pending requests, there would have to be some logic addition
to keep track of whether it's OK to complete a checkpoint or not.

            regards, tom lane

Re: Change BgWriterCommLock to spinlock

From
Qingqing Zhou
Date:

On Sun, 8 Jan 2006, Tom Lane wrote:
>
> If you want the bgwriter to keep working in the face of an out-of-memory
> condition in the hashtable, I think you'd have to change the coding so
> that it takes requests one at a time from the queue.
>
Patched version will issue ERROR instead of PANIC at this condition, so
the bgwriter can still keep running. I don't quite understand what do you
mean by "want the bgwriter keep working" -- do you mean by not issuing an
ERROR but do retry? An ERROR is not avoidable unless we change the
out-of-memory handling logic inside hashtable.

>
> Another issue to keep in mind is that correct operation requires that
> the bgwriter not declare a checkpoint complete until it's completed
> every fsync request that was queued before the checkpoint started.
> So if the bgwriter is to try to keep going after failing to absorb
> all the pending requests, there would have to be some logic addition
> to keep track of whether it's OK to complete a checkpoint or not.
>
As above, if bgwriter fails to absorb, it will quite the job (and
checkpoint will not be finished).

Do you suggest it makes sense that we continue to work on the patch or let
it be?

Regards,
Qingqing

Re: Change BgWriterCommLock to spinlock

From
Tom Lane
Date:
Qingqing Zhou <zhouqq@cs.toronto.edu> writes:
> Do you suggest it makes sense that we continue to work on the patch or let
> it be?

I don't see any problem here that urgently needs solving.  If we ever
see any reports of out-of-memory failures in the bgwriter, then it'll
be time to worry about this, but I think it quite unlikely that we
ever will.  (Even if we do, a simpler answer would be to increase
the initial size of the pending-requests hashtable.)

            regards, tom lane

Re: Change BgWriterCommLock to spinlock

From
Qingqing Zhou
Date:

On Sun, 8 Jan 2006, Tom Lane wrote:
>
> I don't see any problem here that urgently needs solving.  If we ever
> see any reports of out-of-memory failures in the bgwriter, then it'll
> be time to worry about this, but I think it quite unlikely that we
> ever will.  (Even if we do, a simpler answer would be to increase
> the initial size of the pending-requests hashtable.)
>

Agreed,

Regards
Qingqing