Shared locking in slru.c - Mailing list pgsql-hackers

From Tom Lane
Subject Shared locking in slru.c
Date
Msg-id 29931.1133376793@sss.pgh.pa.us
Whole thread Raw
Responses Re: Shared locking in slru.c  (Kenneth Marshall <ktm@it.is.rice.edu>)
Re: Shared locking in slru.c  (Manfred Koizar <mkoi-pg@aon.at>)
Re: Shared locking in slru.c  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
I've been fooling around with a test case Rob Creager sent me, which is
able to drive PG into a context-switch storm caused by contention for
the SubtransControlLock.  Rob asked for the test case not to be posted
publicly (it's part of some proprietary code), but I found that you can
cause some of the same behavior just by running pgbench while holding
an open transaction in a psql session.  pgbench isn't able to saturate
the machine but I think that's due to unrelated pgbench limitations.

The basic cause of the problem is that by holding an open transaction
while many other transactions come and go, the window of transactions
for which pg_subtrans has to be consulted gets wider and wider.  If the
system is doing a lot of updates, the fraction of tuples for which
HeapTupleSatisfiesSnapshot has to invoke SubTransGetTopmostTransaction
gets larger too, and so the extent of contention for SubtransControlLock
increases, even though no subtransactions are actually in use.

I've been looking at various ways to resolve this, but one thing that
seems promising is to hack slru.c to take the control lock in shared
mode, not exclusive mode, for read-only accesses to pages that are
already in memory.  The vast majority of accesses to pg_subtrans in the
problem scenario are read-only, and so this reduces the need to block
and the consequent CS storm.

A quick-hack prototype patch for this is attached.  (It changes only
SubTransGetParent, but if applied of course TransactionIdGetStatus and
so on would get the same treatment.)  The idea is that we take the lock
in shared mode, look to see if the required page is in memory, and if so
just return it; if not, release the lock, reacquire in exclusive mode,
proceed with the existing code.

The reason that the existing code always takes the lock in exclusive
mode, even if there's no intention to change data, is that it also needs
to adjust the LRU information to reflect the page access, and the way
that we're doing that requires exclusive access.  So to make the idea
work at all, we need some alternative way of managing recency-of-use
info.

The way the attached patch attacks this is for the shared-lock access
case to simply set the page's LRU counter to zero, without bumping up
the LRU counters of the other pages as the normal adjustment would do.
This is safe to execute in a shared environment since even if someone
else is concurrently touching the same page, they'll also be trying
to set its counter to zero, and so there's no possibility of ending
up in a bad state.  However, this leaves us with the likelihood that
multiple pages will have equal LRU counters when it comes time to
throw out a page to make room for another.  The patch deals with that
by selecting the furthest-back page of those with the highest LRU
setting.  I'm not totally happy with this heuristic, though, and was
wondering if anyone had a better idea.  Anyone seen a lock-free
data structure for LRU or approximately-LRU state tracking?

            regards, tom lane


--- src/backend/access/transam/subtrans.c    Tue Nov 29 14:34:58 2005
***************
*** 110,118 ****
      if (!TransactionIdIsNormal(xid))
          return InvalidTransactionId;

!     LWLockAcquire(SubtransControlLock, LW_EXCLUSIVE);
!
!     slotno = SimpleLruReadPage(SubTransCtl, pageno, xid);
      ptr = (TransactionId *) SubTransCtl->shared->page_buffer[slotno];
      ptr += entryno;

--- 110,116 ----
      if (!TransactionIdIsNormal(xid))
          return InvalidTransactionId;

!     slotno = SimpleLruReadPage_ReadOnly(SubTransCtl, pageno, xid);
      ptr = (TransactionId *) SubTransCtl->shared->page_buffer[slotno];
      ptr += entryno;


*** src/backend/access/transam/slru.c.orig    Tue Nov 22 16:06:16 2005
--- src/backend/access/transam/slru.c    Tue Nov 29 14:34:51 2005
***************
*** 352,357 ****
--- 352,395 ----
  }

  /*
+  * Find a page in a shared buffer, reading it in if necessary.
+  * The page number must correspond to an already-initialized page.
+  *
+  * The passed-in xid is used only for error reporting, and may be
+  * InvalidTransactionId if no specific xid is associated with the action.
+  *
+  * Return value is the shared-buffer slot number now holding the page.
+  * The buffer's LRU access info is updated.
+  *
+  * Control lock must be held at entry, and will be held at exit.
+  */
+ int
+ SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno, TransactionId xid)
+ {
+     SlruShared    shared = ctl->shared;
+     int            slotno;
+
+     LWLockAcquire(shared->ControlLock, LW_SHARED);
+
+     /* See if page is already in a buffer */
+     for (slotno = 0; slotno < NUM_SLRU_BUFFERS; slotno++)
+     {
+         if (shared->page_number[slotno] == pageno &&
+             shared->page_status[slotno] != SLRU_PAGE_EMPTY &&
+             shared->page_status[slotno] != SLRU_PAGE_READ_IN_PROGRESS)
+         {
+             shared->page_lru_count[slotno] = 0;
+             return slotno;
+         }
+     }
+
+     LWLockRelease(shared->ControlLock);
+     LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
+
+     return SimpleLruReadPage(ctl, pageno, xid);
+ }
+
+ /*
   * Write a page from a shared buffer, if necessary.
   * Does nothing if the specified slot is not dirty.
   *
***************
*** 729,736 ****
      for (;;)
      {
          int            slotno;
!         int            bestslot = 0;
!         unsigned int bestcount = 0;

          /* See if page already has a buffer assigned */
          for (slotno = 0; slotno < NUM_SLRU_BUFFERS; slotno++)
--- 767,775 ----
      for (;;)
      {
          int            slotno;
!         int            bestslot;
!         unsigned int bestcount;
!         int            best_page_number;

          /* See if page already has a buffer assigned */
          for (slotno = 0; slotno < NUM_SLRU_BUFFERS; slotno++)
***************
*** 744,758 ****
           * If we find any EMPTY slot, just select that one. Else locate the
           * least-recently-used slot that isn't the latest page.
           */
          for (slotno = 0; slotno < NUM_SLRU_BUFFERS; slotno++)
          {
              if (shared->page_status[slotno] == SLRU_PAGE_EMPTY)
                  return slotno;
!             if (shared->page_lru_count[slotno] > bestcount &&
!                 shared->page_number[slotno] != shared->latest_page_number)
              {
                  bestslot = slotno;
!                 bestcount = shared->page_lru_count[slotno];
              }
          }

--- 783,808 ----
           * If we find any EMPTY slot, just select that one. Else locate the
           * least-recently-used slot that isn't the latest page.
           */
+         bestslot = 0;
+         bestcount = 0;
+         best_page_number = shared->latest_page_number;
          for (slotno = 0; slotno < NUM_SLRU_BUFFERS; slotno++)
          {
+             unsigned int    this_lru_count;
+             int            this_page_number;
+
              if (shared->page_status[slotno] == SLRU_PAGE_EMPTY)
                  return slotno;
!             this_lru_count = shared->page_lru_count[slotno];
!             this_page_number = shared->page_number[slotno];
!             if ((this_lru_count > bestcount ||
!                  (this_lru_count == bestcount &&
!                   ctl->PagePrecedes(this_page_number, best_page_number))) &&
!                 this_page_number != shared->latest_page_number)
              {
                  bestslot = slotno;
!                 bestcount = this_lru_count;
!                 best_page_number = this_page_number;
              }
          }

pgsql-hackers by date:

Previous
From: "Marc G. Fournier"
Date:
Subject: Re: [pgsql-www] Upcoming PG re-releases
Next
From: Robert Treat
Date:
Subject: Re: [pgsql-www] Upcoming PG re-releases