Re: Design notes for BufMgrLock rewrite - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: Design notes for BufMgrLock rewrite
Date
Msg-id 1109025309.3801.133.camel@localhost.localdomain
Whole thread Raw
In response to Design notes for BufMgrLock rewrite  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Design notes for BufMgrLock rewrite
Re: Design notes for BufMgrLock rewrite
List pgsql-hackers
My understanding from this is:
If we have a buffer cache hit ratio of 93%, then we should expect:
- 93% of buffer requests to require only shared BufMappingLocks
- 7% of buffer requests would require an exclusive BufFreelistLock then
an exclusive BufMappingLock.

That seems like an improvement would come from allowing multiple
successful simultaneous cache hits, which would be welcome and that is a
very good thing. 

I also like the simplicity with which the bgwriter will be able to
easily stay ahead of the clock sweep, writing out dirty buffers, without
taking exclusive system-wide locks. This would produce a
StrategyDirtyBufferList design that is not dependant upon size of
shared_buffers, further improving the efficacy of this new design since
it will allow us to further increase shared_buffers.

ISTM this new design will increase scalability directly in line with the
cache hit ratio, but would still suffer from poor scalability for cache
misses. That concerns me, since a large table scan would require all-
exclusive locks to complete its scan, since it will typically be 95%+
cache misses. That would mean that well tuned OLTP applications could be
more scalable, but DW or mixed applications would not be. Servers with
few CPUs would not see this difference in as marked a way as higher-end
servers.

The cache would also be spoiled from scans, though I think we can handle
those the same way as Vacuum.

This design seems to be a clear improvement on the current design. I am
still encouraged that the freelist structures should be subdivided into
many smaller pieces, thereby producing finer grained locks (the earlier
bufferpools proposal). This could be implemented as an additional
feature on top of this patch, or as an alternate design on cvstip.

[It might be worth having separate bufferpools for indexes and heap
blocks, so that seq scans never effect the index cache.]

Whatever is done from here, I think it is certain that we can improve
things by providing hints from the higher code layers down to the buffer
management layer, as everybody keeps suggesting for Vacuum.

[I'm assuming that there are no system-wide locks held across I/Os, that
bit seems a bit unclear from the description]


Best Regards, Simon Riggs



On Sun, 2005-02-13 at 17:07 -0500, Tom Lane wrote:
> I'm working on an experimental patch to break up the BufMgrLock along
> the lines we discussed a few days ago --- in particular, using a clock
> sweep algorithm instead of LRU lists for the buffer replacement strategy.
> I started by writing up some design notes, which are attached for
> review in case anyone has better ideas.
> 
> One thing I realized quickly is that there is no natural way in a clock
> algorithm to discourage VACUUM from blowing out the cache.  I came up
> with a slightly ugly idea that's described below.  Can anyone do better?
> 
>             regards, tom lane
> 
> 
> Buffer manager's internal locking
> ---------------------------------
> 
> Before PostgreSQL 8.1, all operations of the shared buffer manager itself
> were protected by a single system-wide lock, the BufMgrLock, which
> unsurprisingly proved to be a source of contention.  The new locking scheme
> avoids grabbing system-wide exclusive locks in common code paths.  It works
> like this:
> 
> * There is a system-wide LWLock, the BufMappingLock, that notionally
> protects the mapping from buffer tags (page identifiers) to buffers.
> (Physically, it can be thought of as protecting the hash table maintained
> by buf_table.c.)  To look up whether a buffer exists for a tag, it is
> sufficient to obtain share lock on the BufMappingLock.  Note that one
> must pin the found buffer, if any, before releasing the BufMappingLock.
> To alter the page assignment of any buffer, one must hold exclusive lock
> on the BufMappingLock.  This lock must be held across adjusting the buffer's
> header fields and changing the buf_table hash table.  The only common
> operation that needs exclusive lock is reading in a page that was not
> in shared buffers already, which will require at least a kernel call
> and usually a wait for I/O, so it will be slow anyway.
> 
> * A separate system-wide LWLock, the BufFreelistLock, provides mutual
> exclusion for operations that access the buffer free list or select
> buffers for replacement.  This is always taken in exclusive mode since
> there are no read-only operations on those data structures.  The buffer
> management policy is designed so that BufFreelistLock need not be taken
> except in paths that will require I/O, and thus will be slow anyway.
> (Details appear below.)  It is never necessary to hold the BufMappingLock
> and the BufFreelistLock at the same time.
> 
> * Each buffer header contains a spinlock that must be taken when examining
> or changing fields of that buffer header.  This allows operations such as
> ReleaseBuffer to make local state changes without taking any system-wide
> lock.  We use a spinlock, not an LWLock, since there are no cases where
> the lock needs to be held for more than a few instructions.
> 
> Note that a buffer header's spinlock does not control access to the data
> held within the buffer.  Each buffer header also contains an LWLock, the
> "buffer context lock", that *does* represent the right to access the data
> in the buffer.  It is used per the rules above.
> 
> There is yet another set of per-buffer LWLocks, the io_in_progress locks,
> that are used to wait for I/O on a buffer to complete.  The process doing
> a read or write takes exclusive lock for the duration, and processes that
> need to wait for completion try to take shared locks (which they release
> immediately upon obtaining).  XXX on systems where an LWLock represents
> nontrivial resources, it's fairly annoying to need so many locks.  Possibly
> we could use per-backend LWLocks instead (a buffer header would then contain
> a field to show which backend is doing its I/O).
> 
> 
> Buffer replacement strategy
> ---------------------------
> 
> There is a "free list" of buffers that are prime candidates for replacement.
> In particular, buffers that are completely free (contain no valid page) are
> always in this list.  We may also throw buffers into this list if we
> consider their pages unlikely to be needed soon.  The list is singly-linked
> using fields in the buffer headers; we maintain head and tail pointers in
> global variables.  (Note: although the list links are in the buffer headers,
> they are considered to be protected by the BufFreelistLock, not the
> buffer-header spinlocks.)  To choose a victim buffer to recycle when there
> are no free buffers available, we use a simple clock-sweep algorithm, which
> avoids the need to take system-wide locks during common operations.  It
> works like this:
> 
> Each buffer header contains a "recently used" flag bit, which is set true
> whenever the buffer is unpinned.  (Setting this bit requires only the
> buffer header spinlock, which would have to be taken anyway to decrement
> the buffer reference count, so it's nearly free.)
> 
> The "clock hand" is a buffer index, NextVictimBuffer, that moves circularly
> through all the available buffers.  NextVictimBuffer is protected by the
> BufFreelistLock.
> 
> The algorithm for a process that needs to obtain a victim buffer is:
> 
> 1. Obtain BufFreelistLock.
> 
> 2. If buffer free list is nonempty, remove its head buffer.  If the buffer
> is pinned or has its "recently used" bit set, it cannot be used; ignore
> it and return to the start of step 2.  Otherwise, pin the buffer,
> release BufFreelistLock, and return the buffer.
> 
> 3. Otherwise, select the buffer pointed to by NextVictimBuffer, and
> circularly advance NextVictimBuffer for next time.
> 
> 4. If the selected buffer is pinned or has its "recently used" bit set,
> it cannot be used.  Clear its "recently used" bit and return to step 3
> to examine the next buffer.
> 
> 5. Pin the selected buffer, release BufFreelistLock, and return the buffer.
> 
> (Note that if the selected buffer is dirty, we will have to write it out
> before we recycle it; if someone else pins the buffer meanwhile we will
> have to give up and try another buffer.  This however is not a concern
> of the basic select-a-victim-buffer algorithm.)
> 
> This scheme selects only victim buffers that have gone unused since they
> were last passed over by the "clock hand".
> 
> A special provision is that while running VACUUM, a backend does not set the
> "recently used" bit on buffers it accesses.  In fact, if ReleaseBuffer sees
> that it is dropping the pin count to zero and the "recently used" bit is not
> set, then it appends the buffer to the tail of the free list.  (This implies
> that VACUUM, but only VACUUM, must take the BufFreelistLock during
> ReleaseBuffer; this shouldn't create much of a contention problem.)  This
> provision encourages VACUUM to work in a relatively small number of buffers
> rather than blowing out the entire buffer cache.  It is reasonable since a
> page that has been touched only by VACUUM is unlikely to be needed again
> soon.
> 
> Since VACUUM usually requests many pages very fast, the effect of this is that
> it will get back the very buffers it filled and possibly modified on the next
> call and will therefore do its work in a few shared memory buffers, while
> being able to use whatever it finds in the cache already.  This also implies
> that most of the write traffic caused by a VACUUM will be done by the VACUUM
> itself and not pushed off onto other processes.
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org



pgsql-hackers by date:

Previous
From: "Magnus Hagander"
Date:
Subject: Re: sigint psql
Next
From: Simon Riggs
Date:
Subject: Re: 8.0.X and the ARC patent