Re: Move unused buffers to freelist - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Move unused buffers to freelist
Date
Msg-id 20130627130146.GH1254@alap2.anarazel.de
Whole thread Raw
In response to Re: Move unused buffers to freelist  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Move unused buffers to freelist
Re: Move unused buffers to freelist
List pgsql-hackers
On 2013-06-27 08:23:31 -0400, Robert Haas wrote:
> I'd like to just back up a minute here and talk about the broader
> picture here.

Sounds like a very good plan.

> So in other words,
> there's no huge *performance* problem for a working set larger than
> shared_buffers, but there is a huge *scalability* problem.  Now why is
> that?

> As far as I can tell, the answer is that we've got a scalability
> problem around BufFreelistLock.

Part of the problem is it's name ;)

> Contention on the buffer mapping
> locks may also be a problem, but all of my previous benchmarking (with
> LWLOCK_STATS) suggests that BufFreelistLock is, by far, the elephant
> in the room.

Contention wise I aggree. What I have seen is that we have a huge
amount of cacheline bouncing around the buffer header spinlocks.

> My interest in having the background writer add buffers
> to the free list is basically around solving that problem.  It's a
> pretty dramatic problem, as the graph above shows, and this patch
> doesn't solve it.

> One thing that occurred to me while writing this note is that the
> background writer doesn't have any compelling reason to run on a
> read-only workload.  It will still run at a certain minimum rate, so
> that it cycles the buffer pool every 2 minutes, if I remember
> correctly.

I have previously added some adhoc instrumentation that printed the
amount of buffers that were required (by other backends) during a
bgwriter cycle and the amount of buffers that the buffer manager could
actually write out. I don't think I actually found any workload where
the bgwriter actually wroute out a relevant percentage of the necessary
pages.
Which would explain why the patch doesn't have a big benefit. The
freelist is empty most of the time, so we don't benefit from the reduced
work done under the lock.

I think the whole algorithm that guides how much the background writer
actually does, including its pacing/sleeping logic, needs to be
rewritten from scratch before we are actually able to measure the
benefit from this patch. I personally don't think there's much to
salvage from the current code.

Problems with the current code:

* doesn't manipulate the usage_count and never does anything to used pages. Which means it will just about never find a
victimbuffer in a busy database.
 
* by far not aggressive enough, touches only a few buffers ahead of the clock sweep.
* does not advance the clock sweep, so the individual backends will touch the same buffers again and transfer all the
bufferspinlock cacheline around
 
* The adaption logic it has makes it so slow to adapt that it takes several minutes to adapt.
* ...


There's another thing we could do to noticeably improve scalability of
buffer acquiration. Currently we do a huge amount of work under the
freelist lock.
In StrategyGetBuffer:   LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
...   // check freelist, will usually be empty
...   for (;;)   {       buf = &BufferDescriptors[StrategyControl->nextVictimBuffer];
       ++StrategyControl->nextVictimBuffer;
       LockBufHdr(buf);       if (buf->refcount == 0)       {           if (buf->usage_count > 0)           {
   buf->usage_count--;           }           else           {               /* Found a usable buffer */
if(strategy != NULL)                   AddBufferToRing(strategy, buf);               return buf;           }       }
  UnlockBufHdr(buf);   }
 

So, we perform the entire clock sweep until we found a single buffer we
can use inside a *global* lock. At times we need to iterate over the
whole shared buffers BM_MAX_USAGE_COUNT (5) times till we pushed down all
the usage counts enough (if the database is busy it can take even
longer...).
In a busy database where usually all the usagecounts are high the next
backend will touch a lot of those buffers again which causes massive
cache eviction & bouncing.

It seems far more sensible to only protect the clock sweep's
nextVictimBuffer with a spinlock. With some care all the rest can happen
without any global interlock.

I think even after fixing this - which we definitely should do - having
a sensible/more aggressive bgwriter moving pages onto the freelist makes
sense because then backends then don't need to deal with dirty pages.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: XLogInsert scaling, revisited
Next
From: Robert Haas
Date:
Subject: Re: [PATCH] add long options to pgbench (submission 1)