Thread: Thinking about breaking up the BufMgrLock

Thinking about breaking up the BufMgrLock

From
Tom Lane
Date:
We've been speculating for awhile about ways to reduce contention for
the BufMgrLock, in hopes of fixing the "context swap storm" problem
exhibited by certain patterns of concurrent queries.  The traditional
way to fix such problems is to divide what is protected by the lock
into finer-grain lockable objects.  (I'm not totally convinced that this
approach can help us, because more locks means more low-level locking
overhead.  But let's pursue it and see where it goes; we sure don't have
any other ideas at the moment.)  Neil Conway took a stab at this about a
year ago:
http://archives.postgresql.org/pgsql-hackers/2004-01/msg00173.php
but that approach had some problems IMHO, and in any case he never got
the patch to the point of working.  Here's a bit of fresh thinking about
the problem.

To fix the test cases we have been looking at, there are two things
we need to fix the performance of: ReadBuffer for the case where the
requested page is already in a shared buffer, and ReleaseBuffer.  The
test cases essentially put multiple backends into tight loops doing
these operations.  These cases do not require I/O or any kernel call,
so ideally they should be fast, but what we are seeing is that they
suffer from excessive contention for the BufMgrLock.

ReadBuffer needs to do a lookup to map the page ID to a buffer ID,
which in principle requires only a shared lock on the page-to-buffer
mapping (embodied in the buf_table hash table).  Assuming success, it
also needs to mark the buffer pinned and update the LRU-list position
of the buffer.  Marking pinned is certainly a buffer-local change,
so we could imagine splitting up the BufMgrLock like this:

1. A global LWLock for the page-to-buffer mapping, call it say
BufMappingLock.  Share lock on this is sufficient to allow reading the
hash table; must get exclusive lock when reassigning any buffer's page
identity.

2. A global LWLock for the LRU strategy information, say BufLRULock
or BufStrategyLock.

3. Per-buffer LWLocks (or maybe just spinlocks) protecting the state of
each buffer header; you need this lock to examine or change a buffer
header.

ReleaseBuffer has no locking problems in this formulation: it just grabs
the per-buffer lock, decrements its refcount, releases the lock.

ReadBuffer looks like:
* Acquire share lock on BufMappingLock.* Search hash table for desired ID.  (we assume success)* acquire per-buffer
lock.*increment buffer refcount.* release per-buffer lock.* release share lock on BufMappingLock.* update the LRU
state.

(We have to bump the pin count on the target buffer before releasing the
BufMappingLock, otherwise someone could reassign the buffer as soon as
we release BufMappingLock.)

This would be pretty good from a locking point of view, except that
"update the LRU state" seems to require taking an exclusive lock on a
global data structure, which puts us about back where we were.
Contention for a global BufLRULock might be a bit better than for the
existing overall BufMgrLock, but it'll still cripple the performance
of ReadBuffer.

Perhaps someone knows of a neat data structure that can maintain an LRU
list with only local updates?  I don't though.

The best idea I've had about it is to add flag bits to the per-buffer
state that essentially indicate "pending move to head of T1" or "pending
move to head of T2".  We would make ReadBuffer set the appropriate bit
when it grabs a buffer, but not actually update the global LRU state.
Operations such as the periodic bgwriter scan for work to do, or an
attempt to locate a reusable buffer, would scan the LRU list from tail
to front.  When they hit a buffer that's got one of these bits set, they
would not process it immediately, but would move it to the list head and
clear the bit.  Doing this would require write lock on the LRU list, as
well as per-buffer locks on each buffer as it's examined or changed,
but *not* any lock on BufMappingLock.  So these operations wouldn't
interfere too badly with the success path in ReadBuffer.

This would convert the existing "strict LRU" behavior into an
"approximate LRU".  I'm worried that the change might be penny-wise and
pound-foolish: if a poorer cache management algorithm causes us to have
to do more I/O, it's probably a net loss to save some lock contention.
But without a smart idea about data structures I don't see how to do
better.

Thoughts?
        regards, tom lane


Re: Thinking about breaking up the BufMgrLock

From
Neil Conway
Date:
On Sun, 2005-02-06 at 19:30 -0500, Tom Lane wrote:
> This would be pretty good from a locking point of view, except that
> "update the LRU state" seems to require taking an exclusive lock on a
> global data structure, which puts us about back where we were.

We're only concerned with a buffer's access recency when it is on the
free list, right? (That is certainly true with naive LRU; I confess I
haven't had a chance to grok the 2Q paper yet). If so:

- when ReadBuffer() is called on an in-cache but previously-free buffer,
increment the buffer's ref count; this logically removes it from the
LRU, but does not actually do the physical list removal.

- when we need to fault in a page from disk, acquire the LRU lock and
select a buffer for replacement. At this point we can do some physical
cleanup of the free list, by removing buffers with a non-zero refcount
from the free list. We can do this cleanup relatively cheaply now,
because we had to acquire the LRU lock anyway, and this is the slow path
of ReadBuffer(). This work might also be done by the bgwriter (removing
clean but pinned buffers from close to the LRU of the free list).

- we need only update a pinned buffer's LRU position when its shared
refcount drops to zero, not on every ReleaseBuffer()

This means we need to acquire the LRU lock once for each acquire/release
pair on a previously-unpinned buffer (rather than twice, if we had to
acquire the LRU lock on acquire as well). Not sure if that's enough to
remove the contention problem; on the plus side, it doesn't change the
actual behavior of the replacement policy.

> Perhaps someone knows of a neat data structure that can maintain an LRU
> list with only local updates?  I don't though.

What operations does 2Q require on the shared lists? (Assuming that's
the replacement policy we're going with.) Depending on how complex the
list modifications are, non-blocking algorithms might be worth
considering. For example, to remove a node from the middle of a linked
list can be done via atomic CAS; you need to redo the CAS in the
presence of a concurrent modification of the particular node you're
trying to modify, but that means we are contending over individual list
nodes, not the list as a whole.

-Neil




Re: Thinking about breaking up the BufMgrLock

From
Tom Lane
Date:
Neil Conway <neilc@samurai.com> writes:
> We're only concerned with a buffer's access recency when it is on the
> free list, right?

Right; pinned buffers are not logically part of the freelist.  (Whether
they are so physically is open to choice.)

Another free variable, AFAICS, is whether to do the move-to-front during
ReadBuffer or during ReleaseBuffer.  You could possibly argue that the
latter would be slightly more accurate, but it probably doesn't matter
much.

> This means we need to acquire the LRU lock once for each acquire/release
> pair on a previously-unpinned buffer (rather than twice, if we had to
> acquire the LRU lock on acquire as well).

I think this is true of any scheme that separates the lookup table from
LRU management: you hit the LRU overhead during acquire, or during
release, but not both.  I suspect that we need more than a 2X reduction
in contention to fix the problem, though.

> What operations does 2Q require on the shared lists?

Remove from list, insert at head of list (often but not always paired as
a move-to-front operation).  I think we also have an insert-at-tail case
for VACUUM.  One difficulty is that we need to maintain a list length
counter, which is a central point of contention except for the
move-to-front-of-same-list case.

> For example, to remove a node from the middle of a linked
> list can be done via atomic CAS; you need to redo the CAS in the
> presence of a concurrent modification of the particular node you're
> trying to modify, but that means we are contending over individual list
> nodes, not the list as a whole.

Unfortunately, most of the time you need to touch the list head, so I'm
unclear that this buys us much, at least if we try to preserve true
"move to front" behavior.  It might be interesting to consider "move
ahead one place" instead of "move to front" --- that would avoid the
pressure on the list head.  Again you have to worry whether you're
blowing off too much performance in the cache management algorithm...
        regards, tom lane


Re: Thinking about breaking up the BufMgrLock

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> ReadBuffer needs to do a lookup to map the page ID to a buffer ID,
> which in principle requires only a shared lock on the page-to-buffer
> mapping (embodied in the buf_table hash table).  Assuming success, it
> also needs to mark the buffer pinned and update the LRU-list position
> of the buffer.  Marking pinned is certainly a buffer-local change,
> so we could imagine splitting up the BufMgrLock like this:

So the only reason you need the global lock is for the LRU updates?

This is a well understood problem. I remember it from my Systems class in
school. And searching on google finds lecture notes that match my memory that
there are other systems generally preferred to LRU precisely because they
don't require extensive list management in the critical path.

For example these slides:

http://www.cs.utexas.edu/users/lorenzo/corsi/cs372/03F/notes/11-20.pdf


They describe a few that could be relevant. I recall the clock sweep having
been recommended in class as having most of the best properties of LRU with
very low cost in the critical path.

To use a buffer you only have to set a single local flag indicating the buffer
has been used. A separate step advances circularly one buffer at a time
clearing the flags. If it finds any buffer that hasn't been used for a
complete cycle through the list then it can remove it, either entirely or
merely to a second list.



-- 
greg



Re: Thinking about breaking up the BufMgrLock

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> I recall the clock sweep having
> been recommended in class as having most of the best properties of LRU with
> very low cost in the critical path.

Right.  The "pending move to front" idea that I suggested is basically a
variant of a clock algorithm: it takes two trips through the LRU list
before a page falls off and becomes eligible for replacement.  (It's
even closer to the "second chance" clock algorithm.)

The $64 question with any of these things is how much performance at the
cache-management level are we willing to give up in order to gain
low-level efficiency?  We probably don't want to go very far in that
direction.  But maybe a clock scan will be a good compromise.
        regards, tom lane


Re: Thinking about breaking up the BufMgrLock

From
"Jim C. Nasby"
Date:
On Sun, Feb 06, 2005 at 10:53:38PM -0500, Greg Stark wrote:
> This is a well understood problem. I remember it from my Systems class in
> school. And searching on google finds lecture notes that match my memory that
> there are other systems generally preferred to LRU precisely because they
> don't require extensive list management in the critical path.
> 
> For example these slides:
> 
> http://www.cs.utexas.edu/users/lorenzo/corsi/cs372/03F/notes/11-20.pdf

It might be worth taking a gander at the appropriate sections of either
The Design and Implementation of the 4.4 BSD Operating System
(http://lnk.nu/amazon.com/1ds) or the 4.3 version
(http://lnk.nu/amazon.com/1dt). I've read the 4.3 one, and there's a lot
of discussion of the buffer management algorithms chosen, and why they
were chosen.
-- 
Jim C. Nasby, Database Consultant               decibel@decibel.org 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"


Re: Thinking about breaking up the BufMgrLock

From
"Simon Riggs"
Date:
>Neil Conway writes
> We're only concerned with a buffer's access recency when it is on the
> free list, right? (That is certainly true with naive LRU; I confess I
> haven't had a chance to grok the 2Q paper yet). If so:

> - we need only update a pinned buffer's LRU position when its shared
> refcount drops to zero, not on every ReleaseBuffer()
>
> This means we need to acquire the LRU lock once for each
> acquire/release
> pair on a previously-unpinned buffer (rather than twice, if we had to
> acquire the LRU lock on acquire as well). Not sure if that's enough to
> remove the contention problem; on the plus side, it doesn't change the
> actual behavior of the replacement policy.

This is a fantastic idea, well done.

> - when we need to fault in a page from disk, acquire the LRU lock and
> select a buffer for replacement. At this point we can do some physical
> cleanup of the free list, by removing buffers with a non-zero refcount
> from the free list. We can do this cleanup relatively cheaply now,
> because we had to acquire the LRU lock anyway, and this is
> the slow path
> of ReadBuffer(). This work might also be done by the bgwriter
> (removing
> clean but pinned buffers from close to the LRU of the free list).

I'm not sure I understand this. Removing clean buffers doesn't cost
much, its the removal of dirty ones that cost, surely? There is no work
to remove a clean buffer.

Best Regards, Simon Riggs



Re: Thinking about breaking up the BufMgrLock

From
"Simon Riggs"
Date:
>Tom Lane
> But without a smart idea about data structures I don't see how to do
> better.
>
> Thoughts?
>

Hmm, seems like you summed up the lack of algorithmic choices very well.

After much thought, I believe the best way is to implement bufferpools
(BPs). That is, we don't just have one bufferhash and one LRU, we have
many pairs. We then work out some mapping by which a block can be known
to be in a particular BP, then acquire the lock for that BP.

Using BPs means that each block is only ever associated with one BP.
Hence, there is still only a single BPBufMgrLock to Acquire/Release, but
there could be considerably less contention for a specific lock.

BPs can be implemented manually or automatically. Commercial systems
provide facilities for manual placement of specific tables into specific
bufferpools. That is more work, so I would favour automatically assigned
BPs, though leaving the door open to further extension. It seems simple
to use one or more of the numbers associated with a block as input to a
hash algorithm to identify the BP in which to locate the block. After
some thinking, using relationId seems to be the best candidate. That
would mean an index or sequential scan would be more likely to reAcquire
a lock.

There is some possibility that the multiple BP LRUs could become
unbalanced, so very short LRUs would not benefit as much from splitting
up in this way. So, I suggest linking the number of BPs to the number of
shared buffers, say 1 BP per 10,000 shared_buffers, up to a maximum of
8.

I believe a self-balancing algorithm between the multiple lists would be
fairly straightforward addition also.

ISTM that there is no chance for deadlock in using BPs. The current code
assumes that all blocks are accessed via a single BufMgrLock. There is
never an attempt to acquire BufMgrLock when it is already held, so no
chance therefore of deadlock.

Overall, this idea would reduce BufMgrLock contention by as many
multiples as you choose, as well as allowing more practical use of
larger shared_buffers settings which would further drive up performance.
The costs need not be incurred at all for smaller shared_buffers and the
current situation (num_bufferpools = 1) can remain the default.

Also, the current code already allows for multiple lists, so extending
the number from 4 (or 3) to multiples seems more straightforward.

One other consideration is the eviction of dirty buffers, but more
detail on that in another post. What should be observed there is that
dirty buffer evictions perform Acquire/Release of the BufMgrLock at
least twice. Improving the bgwriter will also have an effect on
BufMgrLock contention.

Best Regards, Simon Riggs



Re: Thinking about breaking up the BufMgrLock

From
Tom Lane
Date:
"Simon Riggs" <simon@2ndquadrant.com> writes:
> After much thought, I believe the best way is to implement bufferpools
> (BPs). That is, we don't just have one bufferhash and one LRU, we have
> many pairs. We then work out some mapping by which a block can be known
> to be in a particular BP, then acquire the lock for that BP.

I think this is basically wrongheaded, because it achieves its reduction
in contention by a one-for-one sacrifice of cache allocation efficiency;
that is, any individual page is now fighting for survival in a pool only
1/Nth as large as before.  We're going to lose more in I/O than we can
hope to save at the processor level.

I think a clock algorithm or something similar may be a reasonable way
to go, though.
        regards, tom lane


Re: Thinking about breaking up the BufMgrLock

From
Pailloncy Jean-Gerard
Date:
> What operations does 2Q require on the shared lists? (Assuming that's
> the replacement policy we're going with.) Depending on how complex the
> list modifications are, non-blocking algorithms might be worth
> considering. For example, to remove a node from the middle of a linked
> list can be done via atomic CAS; you need to redo the CAS in the
> presence of a concurrent modification of the particular node you're
> trying to modify, but that means we are contending over individual list
> nodes, not the list as a whole.
If you plan to use CAS to have lock-free algorithm, there was a thread
about concurrent lock-free algorithm few days ago.
http://archives.postgresql.org/pgsql-hackers/2005-01/msg00736.php

I give some references about new paper I found about wait-free
algorithm.

I think we can adapt to the cache list the GC wait-free discribe
http://www.cs.chalmers.se/~phs/TechnicalReports/Sun04_WaitFreeRef.pdf

In a general manner, I think a deep study of these recent works on
wait-free algorithms will be a big win.

Cordialement,
Jean-Gérard Pailloncy



Re: Thinking about breaking up the BufMgrLock

From
"Simon Riggs"
Date:
>Tom Lane
> "Simon Riggs" <simon@2ndquadrant.com> writes:
> > After much thought, I believe the best way is to implement
> bufferpools
> > (BPs). That is, we don't just have one bufferhash and one
> LRU, we have
> > many pairs. We then work out some mapping by which a block
> can be known
> > to be in a particular BP, then acquire the lock for that BP.
>
> I think this is basically wrongheaded, because it achieves
> its reduction
> in contention by a one-for-one sacrifice of cache allocation
> efficiency;
> that is, any individual page is now fighting for survival in
> a pool only
> 1/Nth as large as before.  We're going to lose more in I/O than we can
> hope to save at the processor level.

Well, as you might expect, I disagree. I also expected that reaction:
ISTM when all good ideas are used up - as you carefully explained was
the case, it must be a more offbeat idea that is the next good one (to
paraphrase Sherlock Holmes).

I do agree that the goal is to reduce contention without increasing I/O.

With BPs: Yes, an individual block is fighting for survival in a smaller
pool, but the number of blocks that might want to go in the pool is also
reduced by the same ratio. With a small shared_buffers, shorter lists
might be a problem, but sizing it large enough would take away some
issues also. I think a straightforward list balancing algorithm would
reduce any imbalance across lists.

BPs would not giving much benefit on a single CPU - my goal here is to
increase SMP performance.

Overall, performance results must be the final arbiter in what is best.

Best Regards, Simon Riggs



Re: Thinking about breaking up the BufMgrLock

From
Kenneth Marshall
Date:
On Sun, Feb 06, 2005 at 07:30:37PM -0500, Tom Lane wrote:
> 
> ReadBuffer needs to do a lookup to map the page ID to a buffer ID,
> which in principle requires only a shared lock on the page-to-buffer
> mapping (embodied in the buf_table hash table).  Assuming success, it
> also needs to mark the buffer pinned and update the LRU-list position
> of the buffer.  Marking pinned is certainly a buffer-local change,
> so we could imagine splitting up the BufMgrLock like this:
> 
> 1. A global LWLock for the page-to-buffer mapping, call it say
> BufMappingLock.  Share lock on this is sufficient to allow reading the
> hash table; must get exclusive lock when reassigning any buffer's page
> identity.
> 
> 2. A global LWLock for the LRU strategy information, say BufLRULock
> or BufStrategyLock.
> 
> 3. Per-buffer LWLocks (or maybe just spinlocks) protecting the state of
> each buffer header; you need this lock to examine or change a buffer
> header.
> 
> ReleaseBuffer has no locking problems in this formulation: it just grabs
> the per-buffer lock, decrements its refcount, releases the lock.
> 
For the per-buffer, a latch would provide a lightweight method of updating
the contents of the buffer without hampering the read-only access. A latch
is comprised of a latch bit and a sequence number that can be set in an
atomic action. The flow for the two cases is simple:

Write:1. Get latch.2. Update the buffer.3. Increment the sequence number.4. Release the latch.

Read:1. Read version number.2. Read buffer.3. Check latch. If latched, go to 1.4. If version number has changed, go to
1.

By using this process, readers will only see a consistent state of
the buffer. Also, since the read does not entail a write operation
it will not cause a cache line update and contribute to the a cache
update "storm". The "get latch" operation can be implemented using
an atomic operation such as TAS (test-and-set) and CAS (compare-and-set).
This would provide readers an extremely lightweight access to the
buffer - no cache line update hit. If you need to have sequenced access
to the buffer, then you would need to use LWLocks but in many cases
such as 3. in Tom's list a latch would work well.

> ReadBuffer looks like:
> 
>     * Acquire share lock on BufMappingLock.
>     * Search hash table for desired ID.  (we assume success)
>     * acquire per-buffer lock.
>     * increment buffer refcount.
>     * release per-buffer lock.
>     * release share lock on BufMappingLock.
>     * update the LRU state.
> 
> (We have to bump the pin count on the target buffer before releasing the
> BufMappingLock, otherwise someone could reassign the buffer as soon as
> we release BufMappingLock.)
> 
> This would be pretty good from a locking point of view, except that
> "update the LRU state" seems to require taking an exclusive lock on a
> global data structure, which puts us about back where we were.
> Contention for a global BufLRULock might be a bit better than for the
> existing overall BufMgrLock, but it'll still cripple the performance
> of ReadBuffer.
> 
> Perhaps someone knows of a neat data structure that can maintain an LRU
> list with only local updates?  I don't though.
> 
The clock algorithm is pretty close to this and provides an approximation
to LRU that eleminates the need to move buffers to the MRU position by
using a reference bit.

> This would convert the existing "strict LRU" behavior into an
> "approximate LRU".  I'm worried that the change might be penny-wise and
> pound-foolish: if a poorer cache management algorithm causes us to have
> to do more I/O, it's probably a net loss to save some lock contention.
> But without a smart idea about data structures I don't see how to do
> better.
> 

One alternative to an approximate LRU, such as the clock algorithm, would
be to have multiple buffer pools as we discussed in the previous thread.
The contention would be reduced by 1/N, where N is the number of pools.
Of course, buffers should be allocated in a fashion that would maximize
locality and minimize the effect of scan cache polution.

More food for thought.

Ken