Thread: RFC: bufmgr locking changes

RFC: bufmgr locking changes

From
Neil Conway
Date:
I've attached a (gzip'ed) patch that makes the following changes to
the buffer manager:

    (1) Overhaul locking; whenever the code needs to modify the state
        of an individual buffer, do synchronization via a per-buffer
        "meta data lock" rather than the global BufMgrLock. For more
        info on the motivation for this, see the several recent
        -hackers threads on the subject.

    (2) Reduce superfluous dirtying of pages: previously,
        LockBuffer(buf, LW_EXCLUSIVE) would automatically dirty the
        buffer. This behavior has been removed, to reduce the number
        of pages that are dirtied. For more info on the motivation for
        this, see the previous -hackers thread on the topic.

    (3) Simplify the code in a bunch of places

This basically involved rewriting bufmgr.c (which makes the patch a
little illegible; it might be easier to just apply it and read through
the new code). That also means there are still a fair number of bugs
and unresolved issues.

The new locking works as follows:

    - modifying the shared buffer table (buf_table.c) or making calls
      into the freelist code (freelist.c) still requires holding the
      BufMgrLock. There was some talk of trying to add more granular
      locking to freelist.c itself: if there is still significant
      contention for the BufMgrLock after these changes have been
      applied, I'll take a look at doing that

    - to modify a BufferDesc's meta data (shared refcount, flags, tag,
      etc.), one must hold the buffer's "meta data lock". Also, I
      removed the existing "io_in_progress" lock; instead, we now hold
      the buffer's meta data lock while doing buffer I/O.

    - if a backend holds the BufMgrLock, it will never try to
      LWLockAcquire() a per-buffer meta data lock, due to the risk of
      deadlock (and the loss in concurrency if we got blocked waiting
      while still holding the BufMgrLock). Instead it does a
      LWLockConditionalAcquire() and handles an acquisition failure
      sanely

    - if a backend holds the meta data lock for a buffer, it CAN
      attempt to LWLockAcquire() the BufMgrLock. This is safe from
      deadlocks, due to the previous paragraph.

The code is still a work-in-progress (running `pgbench -s 30 -c 20 -t
1000` bails out pretty quickly, for example), but any comments on
whether this scheme is in-theory correct would be very welcome.

For #2, my understanding of the existing XLOG code is incomplete, so
perhaps someone can tell me if I'm on the right track. I've modified a
few of the XLogInsert() call sites so that they now:

    - acquire an exclusive buffer cntx_lock

    - modify the content of the page/buffer

    - WriteNoReleaseBuffer() to mark the buffer as dirty; since we
      still hold the buffer's cntx_lock, a checkpoint process can't
      write this page to disk yet. This replaces the implicit "mark
      this page as dirty" behavior of LockBuffer()

    - do the XLogInsert()

    - release the cntx_lock

    - ReleaseBuffer() to decrement the buffer's pincount

For example, sequence.c now works like this (I've probably missed a
few places) -- is this correct, and/or is there a better way to do it?

Notes, and caveats:

- Remove SetBufferCommitInfoNeedsSave(). AFAICS, this is now
  completely equivalent to WriteNoReleaseBuffer(), so I just removed
  the former and replaced all the calls to it with calls to the later.

- Moved the code for flushing a dirty buffer to disk to buf_io.c

- Moved UnpinBuffer() and PinBuffer() to bufmgr.c, from freelist.c

- Removed the following now-unused buffer flag bits: BM_VALID,
  BM_DELETED, BM_JUST_DIRTIED, BM_IO_IN_PROGRESS, and shrunk the
  'flags' field of the BufferDesc down to 8 bits (from 16)

- Removed the cntxDirty field from the BufferDesc: now that we don't
  need to acquire the BufMgrLock to modify the buffer's flags, there's
  no reason to keep this around

- Make 'PrivateRefCount' an array of uint16s, rather than longs. This
  saves 2 bits * shared_buffers per backend on 32-bit machines and 6
  bits * shared_buffers per backend on some 64-bit machines. It means
  a given backend can only pin a single buffer 65,636 times, but that
  should be more than enough. Similarly, made LocalRefCount an array
  of uint16s.

  I was thinking of adding overflow checking to the refcount increment
  code to make sure we fail safely if a backend *does* try to exceed
  this number of pins, but I can't imagine a scenario when it would
  actually occur, so I haven't bothered.

- Remove the BufferLocks array. AFAICS this wasn't actually necessary.

- A few loose ends in the code still need to be wrapped up (for
  example, I need to take another glance at the pin-waiter stuff, and
  the error handling still needs some more work), but I think most of
  the functionality is there. Areas of concern are denoted by 'XXX'
  comments.

Any comments?

-Neil

Attachment

Re: RFC: bufmgr locking changes

From
Kurt Roeckx
Date:
On Wed, Jan 07, 2004 at 05:37:09PM -0500, Neil Conway wrote:
> 
>     - if a backend holds the BufMgrLock, it will never try to
>       LWLockAcquire() a per-buffer meta data lock, due to the risk of
>       deadlock (and the loss in concurrency if we got blocked waiting
>       while still holding the BufMgrLock). Instead it does a
>       LWLockConditionalAcquire() and handles an acquisition failure
>       sanely

I can only see this work when you always check that you actually
got the right buffer after you locked the meta data.  There is
nothing preventing an other backend from using it to for
something else in it between the time you release the BufMgrLock
and you lock the buffer's meta data.

Note that your PinBuffer() is now always called when you already
have the lock meta lock, which is probably a good idea or you're
going to make that function more complex that it should be.  Just
say that it should hold the meta lock instead of the BufMgrLock
when it's called.


Kurt



Re: RFC: bufmgr locking changes

From
Tom Lane
Date:
Neil Conway <neilc@samurai.com> writes:
>     - to modify a BufferDesc's meta data (shared refcount, flags, tag,
>       etc.), one must hold the buffer's "meta data lock". Also, I
>       removed the existing "io_in_progress" lock; instead, we now hold
>       the buffer's meta data lock while doing buffer I/O.

The latter is a really bad idea IMHO.  The io_in_progress lock can be
held for eons (in CPU terms) and should not be blocking people who
simply want to bump their refcount up and down.  In particular, there is
no reason for an in-process write to block people who want read-only
access to the buffer.

It's possible that you could combine the io_in_progress lock with the
cntx_lock, but I doubt locking out access to the metadata during i/o
is a good idea.

>     - if a backend holds the BufMgrLock, it will never try to
>       LWLockAcquire() a per-buffer meta data lock, due to the risk of
>       deadlock (and the loss in concurrency if we got blocked waiting
>       while still holding the BufMgrLock). Instead it does a
>       LWLockConditionalAcquire() and handles an acquisition failure
>       sanely

Hmm, what's "sanely" exactly?  It seems to me that figuring out how to
not need to lock in this direction is a critical part of the redesign,
so you can't just handwave here and expect people to understand.

> - Remove SetBufferCommitInfoNeedsSave(). AFAICS, this is now
>   completely equivalent to WriteNoReleaseBuffer(), so I just removed
>   the former and replaced all the calls to it with calls to the later.

The reason I've kept the separation was as a form of documentation as to
the reason for each write.  Although they currently do the same thing,
that might not always be true.  I'd prefer not to eliminate the
distinction from the source code --- though I'd not object if you want
to make SetBufferCommitInfoNeedsSave a macro that invokes the other
routine.

> - Removed the following now-unused buffer flag bits: BM_VALID,
>   BM_DELETED, BM_JUST_DIRTIED, BM_IO_IN_PROGRESS, and shrunk the
>   'flags' field of the BufferDesc down to 8 bits (from 16)

I cannot believe that it's workable to remove BM_JUST_DIRTIED.  How are
you handling the race conditions that this was for?  I'm also
unconvinced that removing BM_IO_IN_PROGRESS is a good idea.

> - Make 'PrivateRefCount' an array of uint16s, rather than longs. This
>   saves 2 bits * shared_buffers per backend on 32-bit machines and 6
>   bits * shared_buffers per backend on some 64-bit machines. It means
>   a given backend can only pin a single buffer 65,636 times, but that
>   should be more than enough. Similarly, made LocalRefCount an array
>   of uint16s.

I think both of these are ill-considered micro-optimization.  How do you
know that the pin count can't exceed 64K?  Consider recursive plpgsql
functions for a likely counterexample.

>   I was thinking of adding overflow checking to the refcount increment
>   code to make sure we fail safely if a backend *does* try to exceed
>   this number of pins, but I can't imagine a scenario when it would
>   actually occur, so I haven't bothered.

Leaving the arrays as longs is a much saner approach IMHO.

> - Remove the BufferLocks array. AFAICS this wasn't actually necessary.

Please put that back.  It is there to avoid unnecessary acquisitions of
buffer locks during UnlockBuffers (which is executed during any
transaction abort).  Without it, you will be needing to lock every
single buffer during an abort in order to check its flags.
        regards, tom lane


Re: RFC: bufmgr locking changes

From
Neil Conway
Date:
(Sorry Tom, I was meaning to reply to you once I'd had a chance to revise the bufmgr patch; since that seems a fair
waysoff, I figured it would be better to respond now.)
 

Tom Lane <tgl@sss.pgh.pa.us> writes:

> Neil Conway <neilc@samurai.com> writes:
>> we now hold the buffer's meta data lock while doing buffer I/O.
>
> The latter is a really bad idea IMHO.  The io_in_progress lock can be
> held for eons (in CPU terms) and should not be blocking people who
> simply want to bump their refcount up and down.

My reasoning was that the contention for the per-buffer meta data lock
should be pretty low. The io_in_progress lock is held when we're
either faulting a page in or flushing a page out. In the first case,
we can't actually use the buffer no matter how we do the locking (its
content is incomplete), so there's no effective loss in
concurrency. In the second case, what kinds of concurrent activity can
we allow on the buffer? (We can allow reads, of course, but I don't
believe we can allow writes.)

However, I'll think some more on this, you (and Jan, who raised this
point a while ago via IRC) are probably correct.

> It's possible that you could combine the io_in_progress lock with the
> cntx_lock

Yeah, that's a possibility.

>> - Remove SetBufferCommitInfoNeedsSave(). AFAICS, this is now
>>   completely equivalent to WriteNoReleaseBuffer(), so I just removed
>>   the former and replaced all the calls to it with calls to the later.
>
> The reason I've kept the separation was as a form of documentation as to
> the reason for each write.  Although they currently do the same thing,
> that might not always be true.  I'd prefer not to eliminate the
> distinction from the source code --- though I'd not object if you want
> to make SetBufferCommitInfoNeedsSave a macro that invokes the other
> routine.

Ok, fair enough -- I've changed SetBufferCommitInfoNeedsSave() to be a
macro for WriteNoReleaseBuffer().

>> - Make 'PrivateRefCount' an array of uint16s, rather than longs. This
>>   saves 2 bits * shared_buffers per backend on 32-bit machines and 6
>>   bits * shared_buffers per backend on some 64-bit machines. It means
>>   a given backend can only pin a single buffer 65,636 times, but that
>>   should be more than enough. Similarly, made LocalRefCount an array
>>   of uint16s.
>
> I think both of these are ill-considered micro-optimization.  How do you
> know that the pin count can't exceed 64K?  Consider recursive plpgsql
> functions for a likely counterexample.

Fair enough -- I couldn't conceive of an actual scenario in which
a single backend would acquire > 64K pins on a single buffer, but I'll
take your word that it's not so far fetched. However, there is still
room for improvement, IMHO: on a machine with 64-bit longs, we're
plainly allocating 4 bytes more than we need do. Any objection if I
change these to arrays of int32?

> Please put that back.  It is there to avoid unnecessary acquisitions
> of buffer locks during UnlockBuffers (which is executed during any
> transaction abort).  Without it, you will be needing to lock every
> single buffer during an abort in order to check its flags.

It seems bizarre that we need to iterate through a few thousand array
elements just to do some lock cleanup. I'll take a closer look at
this...

-Neil



Re: RFC: bufmgr locking changes

From
Tom Lane
Date:
Neil Conway <neilc@samurai.com> writes:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
>> The latter is a really bad idea IMHO.  The io_in_progress lock can be
>> held for eons (in CPU terms) and should not be blocking people who
>> simply want to bump their refcount up and down.

> My reasoning was that the contention for the per-buffer meta data lock
> should be pretty low. The io_in_progress lock is held when we're
> either faulting a page in or flushing a page out. In the first case,
> we can't actually use the buffer no matter how we do the locking (its
> content is incomplete), so there's no effective loss in
> concurrency. In the second case, what kinds of concurrent activity can
> we allow on the buffer? (We can allow reads, of course, but I don't
> believe we can allow writes.)

True, there's no win in the read-busy case, but I think you
underestimate the value of the write-busy case.  Multiple concurrent
readers are a very important consideration.  In Postgres it is possible
for a reader to cause a write to occur (because it sets commit hint
bits, as per the SetBufferCommitInfoNeedsSave() business), and so you
could have a situation like
   Reader pins page   Reader examines some tuples   Reader sets a commit bit and dirties page   ...
Writerstarts write   ...   Reader examines some more tuples   Reader unpins page                    Writer finishes
write

If the reader can't unpin until the writer is done, then we will have
foreground readers blocked on the background writer process, which is
exactly what we do not want.

>> I think both of these are ill-considered micro-optimization.  How do you
>> know that the pin count can't exceed 64K?  Consider recursive plpgsql
>> functions for a likely counterexample.

> Fair enough -- I couldn't conceive of an actual scenario in which
> a single backend would acquire > 64K pins on a single buffer, but I'll
> take your word that it's not so far fetched. However, there is still
> room for improvement, IMHO: on a machine with 64-bit longs, we're
> plainly allocating 4 bytes more than we need do. Any objection if I
> change these to arrays of int32?

That seems like a reasonable compromise.
        regards, tom lane