RFC: bufmgr locking changes - Mailing list pgsql-hackers
From | Neil Conway |
---|---|
Subject | RFC: bufmgr locking changes |
Date | |
Msg-id | 87k7431j3e.fsf@mailbox.samurai.com Whole thread Raw |
Responses |
Re: RFC: bufmgr locking changes
Re: RFC: bufmgr locking changes |
List | pgsql-hackers |
I've attached a (gzip'ed) patch that makes the following changes to the buffer manager: (1) Overhaul locking; whenever the code needs to modify the state of an individual buffer, do synchronization via a per-buffer "meta data lock" rather than the global BufMgrLock. For more info on the motivation for this, see the several recent -hackers threads on the subject. (2) Reduce superfluous dirtying of pages: previously, LockBuffer(buf, LW_EXCLUSIVE) would automatically dirty the buffer. This behavior has been removed, to reduce the number of pages that are dirtied. For more info on the motivation for this, see the previous -hackers thread on the topic. (3) Simplify the code in a bunch of places This basically involved rewriting bufmgr.c (which makes the patch a little illegible; it might be easier to just apply it and read through the new code). That also means there are still a fair number of bugs and unresolved issues. The new locking works as follows: - modifying the shared buffer table (buf_table.c) or making calls into the freelist code (freelist.c) still requires holding the BufMgrLock. There was some talk of trying to add more granular locking to freelist.c itself: if there is still significant contention for the BufMgrLock after these changes have been applied, I'll take a look at doing that - to modify a BufferDesc's meta data (shared refcount, flags, tag, etc.), one must hold the buffer's "meta data lock". Also, I removed the existing "io_in_progress" lock; instead, we now hold the buffer's meta data lock while doing buffer I/O. - if a backend holds the BufMgrLock, it will never try to LWLockAcquire() a per-buffer meta data lock, due to the risk of deadlock (and the loss in concurrency if we got blocked waiting while still holding the BufMgrLock). Instead it does a LWLockConditionalAcquire() and handles an acquisition failure sanely - if a backend holds the meta data lock for a buffer, it CAN attempt to LWLockAcquire() the BufMgrLock. This is safe from deadlocks, due to the previous paragraph. The code is still a work-in-progress (running `pgbench -s 30 -c 20 -t 1000` bails out pretty quickly, for example), but any comments on whether this scheme is in-theory correct would be very welcome. For #2, my understanding of the existing XLOG code is incomplete, so perhaps someone can tell me if I'm on the right track. I've modified a few of the XLogInsert() call sites so that they now: - acquire an exclusive buffer cntx_lock - modify the content of the page/buffer - WriteNoReleaseBuffer() to mark the buffer as dirty; since we still hold the buffer's cntx_lock, a checkpoint process can't write this page to disk yet. This replaces the implicit "mark this page as dirty" behavior of LockBuffer() - do the XLogInsert() - release the cntx_lock - ReleaseBuffer() to decrement the buffer's pincount For example, sequence.c now works like this (I've probably missed a few places) -- is this correct, and/or is there a better way to do it? Notes, and caveats: - Remove SetBufferCommitInfoNeedsSave(). AFAICS, this is now completely equivalent to WriteNoReleaseBuffer(), so I just removed the former and replaced all the calls to it with calls to the later. - Moved the code for flushing a dirty buffer to disk to buf_io.c - Moved UnpinBuffer() and PinBuffer() to bufmgr.c, from freelist.c - Removed the following now-unused buffer flag bits: BM_VALID, BM_DELETED, BM_JUST_DIRTIED, BM_IO_IN_PROGRESS, and shrunk the 'flags' field of the BufferDesc down to 8 bits (from 16) - Removed the cntxDirty field from the BufferDesc: now that we don't need to acquire the BufMgrLock to modify the buffer's flags, there's no reason to keep this around - Make 'PrivateRefCount' an array of uint16s, rather than longs. This saves 2 bits * shared_buffers per backend on 32-bit machines and 6 bits * shared_buffers per backend on some 64-bit machines. It means a given backend can only pin a single buffer 65,636 times, but that should be more than enough. Similarly, made LocalRefCount an array of uint16s. I was thinking of adding overflow checking to the refcount increment code to make sure we fail safely if a backend *does* try to exceed this number of pins, but I can't imagine a scenario when it would actually occur, so I haven't bothered. - Remove the BufferLocks array. AFAICS this wasn't actually necessary. - A few loose ends in the code still need to be wrapped up (for example, I need to take another glance at the pin-waiter stuff, and the error handling still needs some more work), but I think most of the functionality is there. Areas of concern are denoted by 'XXX' comments. Any comments? -Neil
Attachment
pgsql-hackers by date: