Re: Move unused buffers to freelist - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Move unused buffers to freelist
Date
Msg-id CA+TgmoZK_-riTdfY=9e4k8WyHNMQZH-WHLw6Uf=B7gT18tBeEA@mail.gmail.com
Whole thread Raw
In response to Re: Move unused buffers to freelist  (Andres Freund <andres@2ndquadrant.com>)
Responses Re: Move unused buffers to freelist
List pgsql-hackers
On Thu, Jun 27, 2013 at 9:01 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> Contention wise I aggree. What I have seen is that we have a huge
> amount of cacheline bouncing around the buffer header spinlocks.

How did you measure that?

> I have previously added some adhoc instrumentation that printed the
> amount of buffers that were required (by other backends) during a
> bgwriter cycle and the amount of buffers that the buffer manager could
> actually write out.

I think you can see how many are needed from buffers_alloc.  No?

> I don't think I actually found any workload where
> the bgwriter actually wroute out a relevant percentage of the necessary
> pages.

Check.

> Problems with the current code:
>
> * doesn't manipulate the usage_count and never does anything to used
>   pages. Which means it will just about never find a victim buffer in a
>   busy database.

Right.  I was thinking that was part of this patch, but it isn't.  I
think we should definitely add that.  In other words, the background
writer's job should be to run the clock sweep and add buffers to the
free list.  I think we should also split the lock: a spinlock for the
freelist, and an lwlock for the clock sweep.

> * by far not aggressive enough, touches only a few buffers ahead of the
>   clock sweep.

Check.  Fixing this might be a separate patch, but then again maybe
not.  The changes we're talking about here provide a natural feedback
mechanism: if we observe that the freelist is empty (or less than some
length, like 32 buffers?) set the background writer's latch, because
we know it's not keeping up.

> * does not advance the clock sweep, so the individual backends will
>   touch the same buffers again and transfer all the buffer spinlock
>   cacheline around

Yes, I think that should be fixed as part of this patch too.  It's
obviously connected to the point about usage counts.

> * The adaption logic it has makes it so slow to adapt that it takes
>   several minutes to adapt.

Yeah.  I don't know if fixing that will fall naturally out of these
other changes or not, but I think it's a second-order concern in any
event.

> There's another thing we could do to noticeably improve scalability of
> buffer acquiration. Currently we do a huge amount of work under the
> freelist lock.
> In StrategyGetBuffer:
>     LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
> ...
>     // check freelist, will usually be empty
> ...
>     for (;;)
>     {
>         buf = &BufferDescriptors[StrategyControl->nextVictimBuffer];
>
>         ++StrategyControl->nextVictimBuffer;
>
>         LockBufHdr(buf);
>         if (buf->refcount == 0)
>         {
>             if (buf->usage_count > 0)
>             {
>                 buf->usage_count--;
>             }
>             else
>             {
>                 /* Found a usable buffer */
>                 if (strategy != NULL)
>                     AddBufferToRing(strategy, buf);
>                 return buf;
>             }
>         }
>         UnlockBufHdr(buf);
>     }
>
> So, we perform the entire clock sweep until we found a single buffer we
> can use inside a *global* lock. At times we need to iterate over the
> whole shared buffers BM_MAX_USAGE_COUNT (5) times till we pushed down all
> the usage counts enough (if the database is busy it can take even
> longer...).
> In a busy database where usually all the usagecounts are high the next
> backend will touch a lot of those buffers again which causes massive
> cache eviction & bouncing.
>
> It seems far more sensible to only protect the clock sweep's
> nextVictimBuffer with a spinlock. With some care all the rest can happen
> without any global interlock.

That's a lot more spinlock acquire/release cycles, but it might work
out to a win anyway.  Or it might lead to the system suffering a
horrible spinlock-induced death spiral on eviction-heavy workloads.

> I think even after fixing this - which we definitely should do - having
> a sensible/more aggressive bgwriter moving pages onto the freelist makes
> sense because then backends then don't need to deal with dirty pages.

Or scanning to find evictable pages.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: PQConnectPoll, connect(2), EWOULDBLOCK and somaxconn
Next
From: Tom Lane
Date:
Subject: Re: pg_filedump 9.3: checksums (and a few other fixes)