Thread: Shared locking in slru.c

Shared locking in slru.c

From
Tom Lane
Date:
I've been fooling around with a test case Rob Creager sent me, which is
able to drive PG into a context-switch storm caused by contention for
the SubtransControlLock.  Rob asked for the test case not to be posted
publicly (it's part of some proprietary code), but I found that you can
cause some of the same behavior just by running pgbench while holding
an open transaction in a psql session.  pgbench isn't able to saturate
the machine but I think that's due to unrelated pgbench limitations.

The basic cause of the problem is that by holding an open transaction
while many other transactions come and go, the window of transactions
for which pg_subtrans has to be consulted gets wider and wider.  If the
system is doing a lot of updates, the fraction of tuples for which
HeapTupleSatisfiesSnapshot has to invoke SubTransGetTopmostTransaction
gets larger too, and so the extent of contention for SubtransControlLock
increases, even though no subtransactions are actually in use.

I've been looking at various ways to resolve this, but one thing that
seems promising is to hack slru.c to take the control lock in shared
mode, not exclusive mode, for read-only accesses to pages that are
already in memory.  The vast majority of accesses to pg_subtrans in the
problem scenario are read-only, and so this reduces the need to block
and the consequent CS storm.

A quick-hack prototype patch for this is attached.  (It changes only
SubTransGetParent, but if applied of course TransactionIdGetStatus and
so on would get the same treatment.)  The idea is that we take the lock
in shared mode, look to see if the required page is in memory, and if so
just return it; if not, release the lock, reacquire in exclusive mode,
proceed with the existing code.

The reason that the existing code always takes the lock in exclusive
mode, even if there's no intention to change data, is that it also needs
to adjust the LRU information to reflect the page access, and the way
that we're doing that requires exclusive access.  So to make the idea
work at all, we need some alternative way of managing recency-of-use
info.

The way the attached patch attacks this is for the shared-lock access
case to simply set the page's LRU counter to zero, without bumping up
the LRU counters of the other pages as the normal adjustment would do.
This is safe to execute in a shared environment since even if someone
else is concurrently touching the same page, they'll also be trying
to set its counter to zero, and so there's no possibility of ending
up in a bad state.  However, this leaves us with the likelihood that
multiple pages will have equal LRU counters when it comes time to
throw out a page to make room for another.  The patch deals with that
by selecting the furthest-back page of those with the highest LRU
setting.  I'm not totally happy with this heuristic, though, and was
wondering if anyone had a better idea.  Anyone seen a lock-free
data structure for LRU or approximately-LRU state tracking?

            regards, tom lane


--- src/backend/access/transam/subtrans.c    Tue Nov 29 14:34:58 2005
***************
*** 110,118 ****
      if (!TransactionIdIsNormal(xid))
          return InvalidTransactionId;

!     LWLockAcquire(SubtransControlLock, LW_EXCLUSIVE);
!
!     slotno = SimpleLruReadPage(SubTransCtl, pageno, xid);
      ptr = (TransactionId *) SubTransCtl->shared->page_buffer[slotno];
      ptr += entryno;

--- 110,116 ----
      if (!TransactionIdIsNormal(xid))
          return InvalidTransactionId;

!     slotno = SimpleLruReadPage_ReadOnly(SubTransCtl, pageno, xid);
      ptr = (TransactionId *) SubTransCtl->shared->page_buffer[slotno];
      ptr += entryno;


*** src/backend/access/transam/slru.c.orig    Tue Nov 22 16:06:16 2005
--- src/backend/access/transam/slru.c    Tue Nov 29 14:34:51 2005
***************
*** 352,357 ****
--- 352,395 ----
  }

  /*
+  * Find a page in a shared buffer, reading it in if necessary.
+  * The page number must correspond to an already-initialized page.
+  *
+  * The passed-in xid is used only for error reporting, and may be
+  * InvalidTransactionId if no specific xid is associated with the action.
+  *
+  * Return value is the shared-buffer slot number now holding the page.
+  * The buffer's LRU access info is updated.
+  *
+  * Control lock must be held at entry, and will be held at exit.
+  */
+ int
+ SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno, TransactionId xid)
+ {
+     SlruShared    shared = ctl->shared;
+     int            slotno;
+
+     LWLockAcquire(shared->ControlLock, LW_SHARED);
+
+     /* See if page is already in a buffer */
+     for (slotno = 0; slotno < NUM_SLRU_BUFFERS; slotno++)
+     {
+         if (shared->page_number[slotno] == pageno &&
+             shared->page_status[slotno] != SLRU_PAGE_EMPTY &&
+             shared->page_status[slotno] != SLRU_PAGE_READ_IN_PROGRESS)
+         {
+             shared->page_lru_count[slotno] = 0;
+             return slotno;
+         }
+     }
+
+     LWLockRelease(shared->ControlLock);
+     LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
+
+     return SimpleLruReadPage(ctl, pageno, xid);
+ }
+
+ /*
   * Write a page from a shared buffer, if necessary.
   * Does nothing if the specified slot is not dirty.
   *
***************
*** 729,736 ****
      for (;;)
      {
          int            slotno;
!         int            bestslot = 0;
!         unsigned int bestcount = 0;

          /* See if page already has a buffer assigned */
          for (slotno = 0; slotno < NUM_SLRU_BUFFERS; slotno++)
--- 767,775 ----
      for (;;)
      {
          int            slotno;
!         int            bestslot;
!         unsigned int bestcount;
!         int            best_page_number;

          /* See if page already has a buffer assigned */
          for (slotno = 0; slotno < NUM_SLRU_BUFFERS; slotno++)
***************
*** 744,758 ****
           * If we find any EMPTY slot, just select that one. Else locate the
           * least-recently-used slot that isn't the latest page.
           */
          for (slotno = 0; slotno < NUM_SLRU_BUFFERS; slotno++)
          {
              if (shared->page_status[slotno] == SLRU_PAGE_EMPTY)
                  return slotno;
!             if (shared->page_lru_count[slotno] > bestcount &&
!                 shared->page_number[slotno] != shared->latest_page_number)
              {
                  bestslot = slotno;
!                 bestcount = shared->page_lru_count[slotno];
              }
          }

--- 783,808 ----
           * If we find any EMPTY slot, just select that one. Else locate the
           * least-recently-used slot that isn't the latest page.
           */
+         bestslot = 0;
+         bestcount = 0;
+         best_page_number = shared->latest_page_number;
          for (slotno = 0; slotno < NUM_SLRU_BUFFERS; slotno++)
          {
+             unsigned int    this_lru_count;
+             int            this_page_number;
+
              if (shared->page_status[slotno] == SLRU_PAGE_EMPTY)
                  return slotno;
!             this_lru_count = shared->page_lru_count[slotno];
!             this_page_number = shared->page_number[slotno];
!             if ((this_lru_count > bestcount ||
!                  (this_lru_count == bestcount &&
!                   ctl->PagePrecedes(this_page_number, best_page_number))) &&
!                 this_page_number != shared->latest_page_number)
              {
                  bestslot = slotno;
!                 bestcount = this_lru_count;
!                 best_page_number = this_page_number;
              }
          }

Re: Shared locking in slru.c

From
Kenneth Marshall
Date:
On Wed, Nov 30, 2005 at 03:23:55PM -0500, Tom Lane wrote:
> Kenneth Marshall <ktm@is.rice.edu> writes:
> > ... In pseudo-code, the operations to
> > read the control information are:
> 
> > WriteControl:
> > 1. Set latch.
> > 2. Update control information
> > 3. Increment latch version number.
> > 4. Unset latch.
> 
> > ReadControl:
> > 1. Read latch version number.
> > 2. Read control information.
> > 3. Check latch. If locked go to step 1.
> > 4. Read latch version number. If the value is different from the
> >    value read in step 1, go to step 1.
> 
> Hm, I don't see how that's safe in the presence of concurrent would-be
> writers?  (Or is that what you meant by "queuing up lock requests"?)
> 
>             regards, tom lane

The latch is definitely safe for readers and writers concurrently
accessing the information. It does not provide the ordered waiting
for a lock that the LWLock will. It is also so light-weight that
for the types of reads and updates to the shared areas that it may
outperform the existing LWLock even during contention.

Ken


Re: Shared locking in slru.c

From
Kenneth Marshall
Date:
On Wed, Nov 30, 2005 at 01:53:13PM -0500, Tom Lane wrote:
> I've been looking at various ways to resolve this, but one thing that
> seems promising is to hack slru.c to take the control lock in shared
> mode, not exclusive mode, for read-only accesses to pages that are
> already in memory.  The vast majority of accesses to pg_subtrans in the
> problem scenario are read-only, and so this reduces the need to block
> and the consequent CS storm.
> 
> A quick-hack prototype patch for this is attached.  (It changes only
> SubTransGetParent, but if applied of course TransactionIdGetStatus and
> so on would get the same treatment.)  The idea is that we take the lock
> in shared mode, look to see if the required page is in memory, and if so
> just return it; if not, release the lock, reacquire in exclusive mode,
> proceed with the existing code.
> 
> The reason that the existing code always takes the lock in exclusive
> mode, even if there's no intention to change data, is that it also needs
> to adjust the LRU information to reflect the page access, and the way
> that we're doing that requires exclusive access.  So to make the idea
> work at all, we need some alternative way of managing recency-of-use
> info.
> 
> The way the attached patch attacks this is for the shared-lock access
> case to simply set the page's LRU counter to zero, without bumping up
> the LRU counters of the other pages as the normal adjustment would do.
> This is safe to execute in a shared environment since even if someone
> else is concurrently touching the same page, they'll also be trying
> to set its counter to zero, and so there's no possibility of ending
> up in a bad state.  However, this leaves us with the likelihood that
> multiple pages will have equal LRU counters when it comes time to
> throw out a page to make room for another.  The patch deals with that
> by selecting the furthest-back page of those with the highest LRU
> setting.  I'm not totally happy with this heuristic, though, and was
> wondering if anyone had a better idea.  Anyone seen a lock-free
> data structure for LRU or approximately-LRU state tracking?
> 
>             regards, tom lane

Tom,

In this situation, if this code does not need the ability to queue
access to the control lock, it may be reasonable to use a latch process
to update the control information. In pseudo-code, the operations to
read the control information are:

WriteControl:
1. Set latch.
2. Update control information
3. Increment latch version number.
4. Unset latch.

ReadControl:
1. Read latch version number.
2. Read control information.
3. Check latch. If locked go to step 1.
4. Read latch version number. If the value is different from the  value read in step 1, go to step 1.

In this way, a read operation will only need two reads of the version
number and one of the shared data. A write operation will be very
lightweight. Setting the latch can be a single TAS/CAS operation and
unsetting the latch and incrementing the version number can be done
in a single write. Of course, if you need to be able to queue up lock
requests this will not work.

Ken


Re: Shared locking in slru.c

From
Manfred Koizar
Date:
On Wed, 30 Nov 2005 13:53:13 -0500, Tom Lane <tgl@sss.pgh.pa.us>
wrote:
>The way the attached patch attacks this is for the shared-lock access
>case to simply set the page's LRU counter to zero, without bumping up
>the LRU counters of the other pages as the normal adjustment would do.

If you still plan to do this, you might also want to revert the
micro-optimisation intruduced by the original SLRU patch:

| Apart from refactoring I made a little change to SlruRecentlyUsed,
| formerly ClogRecentlyUsed:  It now skips incrementing lru_counts, if
| slotno is already the LRU slot, thus saving a few CPU cycles.

|+#define SlruRecentlyUsed(shared, slotno)    \
|+    do { \
|+        if ((shared)->page_lru_count[slotno] != 0) { \
|+            int        iilru; \
|+            for (iilru = 0; iilru < NUM_CLOG_BUFFERS; iilru++) \
|+                (shared)->page_lru_count[iilru]++; \
|+            (shared)->page_lru_count[slotno] = 0; \
|+        } \
|+    } while (0)

Otherwise you could end up with a stable state of several pages having
lru_count == 0.
ServusManfred



Re: Shared locking in slru.c

From
Tom Lane
Date:
Manfred Koizar <mkoi-pg@aon.at> writes:
> On Wed, 30 Nov 2005 13:53:13 -0500, Tom Lane <tgl@sss.pgh.pa.us>
> wrote:
>> The way the attached patch attacks this is for the shared-lock access
>> case to simply set the page's LRU counter to zero, without bumping up
>> the LRU counters of the other pages as the normal adjustment would do.

> If you still plan to do this, you might also want to revert the
> micro-optimisation intruduced by the original SLRU patch:

Good point --- thanks for mentioning it.  I'm still fooling with the
modified code because it seems like it's not doing very well at managing
the SLRU pool, and perhaps that's got something to do with it ...
        regards, tom lane


Re: Shared locking in slru.c

From
Tom Lane
Date:
I wrote:
> The way the attached patch attacks this is for the shared-lock access
> case to simply set the page's LRU counter to zero, without bumping up
> the LRU counters of the other pages as the normal adjustment would do.
> ...
> I'm not totally happy with this heuristic, though, and was
> wondering if anyone had a better idea.  Anyone seen a lock-free
> data structure for LRU or approximately-LRU state tracking?

I've come up with what seems a slightly better way to do this.  The idea
is to replace the current structure for tracking which page is the least-
recently-used with this:

typedef struct SlruSharedData
{   ...
   /*----------    * We mark a page "most recently used" by setting    *        page_lru_count[slotno] =
++cur_lru_count;   * The oldest page is therefore the one with the highest value of    *        cur_lru_count -
page_lru_count[slotno]   *----------    */   int            cur_lru_count;
 
   ...
   int            page_lru_count[NUM_SLRU_BUFFERS];
   ...

which makes the SlruRecentlyUsed macro look like

#define SlruRecentlyUsed(shared, slotno) \   do { \       int        new_lru_count = (shared)->cur_lru_count; \
if(new_lru_count != (shared)->page_lru_count[slotno]) { \           (shared)->cur_lru_count = ++new_lru_count; \
  (shared)->page_lru_count[slotno] = new_lru_count; \       } \   } while (0)
 

and SlruSelectLRUPage() has to look for the maximum value of
"cur_count - shared->page_lru_count[slotno]" rather than just
"shared->page_lru_count[slotno]" as before.  This seems like a win
in any case since it takes cycles out of the commonly used path at
a small increase in the cost of SlruSelectLRUPage --- but in that
routine you are about to do I/O anyway, so a few extra subtractions
are negligible.

However, the real reason for doing this is that I think it's OK for
the proposed SimpleLruReadPage_ReadOnly routine to apply this form
of SlruRecentlyUsed even though it holds only a shared lock.  Assuming
that int reads and writes are atomic, the only possible failures are

1. If a process running the macro is delayed, it might store a stale
(hence too small) value back into cur_lru_count or a page_lru_count
array element after someone else has incremented them to a larger value.

2. Two processes might read the same cur_lru_count value at the same
time, so that one of their increments is lost.  This has the same end
effect as #1, though.

Given reasonable care in SlruSelectLRUPage (see attached excerpt), we
can defend against these scenarios and usually still make a reasonable
choice of which page to evict.  In any case, the worst possible scenario
is that we make a nonoptimal choice of page to evict due to confused
lru_count state.  This price seems well worth the chance to reduce
contention for shared memory.

Thoughts, objections?
        regards, tom lane

       /*        * If we find any EMPTY slot, just select that one. Else locate the        * least-recently-used slot
toreplace.        *        * Normally the page_lru_count values will all be different and so        * there will be a
well-definedLRU page.  But since we allow        * concurrent execution of SlruRecentlyUsed() within        *
SimpleLruReadPage_ReadOnly(),it is possible that multiple pages        * acquire the same lru_count values.  In that
casewe break ties by        * choosing the furthest-back page.        *        * In no case will we select the slot
containinglatest_page_number        * for replacement, even if it appears least recently used.        *        * Notice
thatthis next line forcibly advances cur_lru_count to a        * value that is certainly beyond any value that will be
inthe        * page_lru_count array after the loop finishes.  This ensures that        * the next execution of
SlruRecentlyUsedwill give us good data,        * even if it's against a page that has the current counter value.
*/      cur_count = (shared->cur_lru_count)++;       best_delta = -1;       bestslot = 0;            /* no-op, just
keepscompiler quiet */       best_page_number = 0;    /* ditto */       for (slotno = 0; slotno < NUM_SLRU_BUFFERS;
slotno++)      {           int            this_delta;           int            this_page_number;
 
           if (shared->page_status[slotno] == SLRU_PAGE_EMPTY)               return slotno;           this_delta =
cur_count- shared->page_lru_count[slotno];           if (this_delta < 0)           {               /*                *
Cleanup in case shared updates have caused cur_count                * increments to get "lost".  We back off the page
counts,               * rather than trying to increase cur_count, to avoid any                * question of infinite
loopsor failure in the presence of                * wrapped-around counts.                */
shared->page_lru_count[slotno]= cur_count;               this_delta = 0;           }           this_page_number =
shared->page_number[slotno];          if ((this_delta > best_delta ||                (this_delta == best_delta &&
         ctl->PagePrecedes(this_page_number, best_page_number))) &&               this_page_number !=
shared->latest_page_number)          {               bestslot = slotno;               best_delta = this_delta;
    best_page_number = this_page_number;           }       }