RFC: bufmgr locking changes - Mailing list pgsql-hackers

From Neil Conway
Subject RFC: bufmgr locking changes
Date
Msg-id 87k7431j3e.fsf@mailbox.samurai.com
Whole thread Raw
Responses Re: RFC: bufmgr locking changes
Re: RFC: bufmgr locking changes
List pgsql-hackers
I've attached a (gzip'ed) patch that makes the following changes to
the buffer manager:

    (1) Overhaul locking; whenever the code needs to modify the state
        of an individual buffer, do synchronization via a per-buffer
        "meta data lock" rather than the global BufMgrLock. For more
        info on the motivation for this, see the several recent
        -hackers threads on the subject.

    (2) Reduce superfluous dirtying of pages: previously,
        LockBuffer(buf, LW_EXCLUSIVE) would automatically dirty the
        buffer. This behavior has been removed, to reduce the number
        of pages that are dirtied. For more info on the motivation for
        this, see the previous -hackers thread on the topic.

    (3) Simplify the code in a bunch of places

This basically involved rewriting bufmgr.c (which makes the patch a
little illegible; it might be easier to just apply it and read through
the new code). That also means there are still a fair number of bugs
and unresolved issues.

The new locking works as follows:

    - modifying the shared buffer table (buf_table.c) or making calls
      into the freelist code (freelist.c) still requires holding the
      BufMgrLock. There was some talk of trying to add more granular
      locking to freelist.c itself: if there is still significant
      contention for the BufMgrLock after these changes have been
      applied, I'll take a look at doing that

    - to modify a BufferDesc's meta data (shared refcount, flags, tag,
      etc.), one must hold the buffer's "meta data lock". Also, I
      removed the existing "io_in_progress" lock; instead, we now hold
      the buffer's meta data lock while doing buffer I/O.

    - if a backend holds the BufMgrLock, it will never try to
      LWLockAcquire() a per-buffer meta data lock, due to the risk of
      deadlock (and the loss in concurrency if we got blocked waiting
      while still holding the BufMgrLock). Instead it does a
      LWLockConditionalAcquire() and handles an acquisition failure
      sanely

    - if a backend holds the meta data lock for a buffer, it CAN
      attempt to LWLockAcquire() the BufMgrLock. This is safe from
      deadlocks, due to the previous paragraph.

The code is still a work-in-progress (running `pgbench -s 30 -c 20 -t
1000` bails out pretty quickly, for example), but any comments on
whether this scheme is in-theory correct would be very welcome.

For #2, my understanding of the existing XLOG code is incomplete, so
perhaps someone can tell me if I'm on the right track. I've modified a
few of the XLogInsert() call sites so that they now:

    - acquire an exclusive buffer cntx_lock

    - modify the content of the page/buffer

    - WriteNoReleaseBuffer() to mark the buffer as dirty; since we
      still hold the buffer's cntx_lock, a checkpoint process can't
      write this page to disk yet. This replaces the implicit "mark
      this page as dirty" behavior of LockBuffer()

    - do the XLogInsert()

    - release the cntx_lock

    - ReleaseBuffer() to decrement the buffer's pincount

For example, sequence.c now works like this (I've probably missed a
few places) -- is this correct, and/or is there a better way to do it?

Notes, and caveats:

- Remove SetBufferCommitInfoNeedsSave(). AFAICS, this is now
  completely equivalent to WriteNoReleaseBuffer(), so I just removed
  the former and replaced all the calls to it with calls to the later.

- Moved the code for flushing a dirty buffer to disk to buf_io.c

- Moved UnpinBuffer() and PinBuffer() to bufmgr.c, from freelist.c

- Removed the following now-unused buffer flag bits: BM_VALID,
  BM_DELETED, BM_JUST_DIRTIED, BM_IO_IN_PROGRESS, and shrunk the
  'flags' field of the BufferDesc down to 8 bits (from 16)

- Removed the cntxDirty field from the BufferDesc: now that we don't
  need to acquire the BufMgrLock to modify the buffer's flags, there's
  no reason to keep this around

- Make 'PrivateRefCount' an array of uint16s, rather than longs. This
  saves 2 bits * shared_buffers per backend on 32-bit machines and 6
  bits * shared_buffers per backend on some 64-bit machines. It means
  a given backend can only pin a single buffer 65,636 times, but that
  should be more than enough. Similarly, made LocalRefCount an array
  of uint16s.

  I was thinking of adding overflow checking to the refcount increment
  code to make sure we fail safely if a backend *does* try to exceed
  this number of pins, but I can't imagine a scenario when it would
  actually occur, so I haven't bothered.

- Remove the BufferLocks array. AFAICS this wasn't actually necessary.

- A few loose ends in the code still need to be wrapped up (for
  example, I need to take another glance at the pin-waiter stuff, and
  the error handling still needs some more work), but I think most of
  the functionality is there. Areas of concern are denoted by 'XXX'
  comments.

Any comments?

-Neil

Attachment

pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: Brokenness in parsing of pg_hba.conf
Next
From: Kurt Roeckx
Date:
Subject: Re: RFC: bufmgr locking changes