Re: ReadRecentBuffer() doesn't scale well - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: ReadRecentBuffer() doesn't scale well
Date
Msg-id CA+hUKG+r9=0Ss9Lbv+vEEaVuKhu+VMWZpGwvB5jTKB2Lm1Bb=g@mail.gmail.com
Whole thread Raw
In response to ReadRecentBuffer() doesn't scale well  (Andres Freund <andres@anarazel.de>)
Responses Re: ReadRecentBuffer() doesn't scale well
Re: ReadRecentBuffer() doesn't scale well
List pgsql-hackers
On Tue, Jun 27, 2023 at 2:05 PM Andres Freund <andres@anarazel.de> wrote:
> As mentioned nearby [1], Thomas brought up [2] the idea of using
> ReadRecentBuffer() _bt_getroot().  I couldn't resist and prototyped it.

Thanks!

> Unfortunately it scaled way worse at first. This is not an inherent issue, but
> due to an implementation choice in ReadRecentBuffer().  Whereas the normal
> BufferAlloc() path uses PinBuffer(), ReadRecentBuffer() first does
> LockBufHdr(), checks if the buffer ID is the same and then uses
> PinBuffer_Locked().
>
> The problem with that is that PinBuffer() takes care to not hold the buffer
> header spinlock, it uses compare_exchange to atomically acquire the pin, while
> guaranteing nobody holds the lock.  When holding the buffer header spinlock,
> there obviously is the risk of being scheduled out (or even just not have
> exclusive access to the cacheline).

Yeah.  Aside from inherent nastiness of user-space spinlocks, this new
use case is also enormously more likely to contend and then get into
trouble by being preempted due to btree root pages being about the
hottest pages in the universe than the use case I was focusing on at
the time.

> The fairly obvious solution to this is to just use PinBuffer() and just unpin
> the buffer if its identity was changed concurrently. There could be an
> unlocked pre-check as well.  However, there's the following comment in
> ReadRecentBuffer():
>                          * It's now safe to pin the buffer.  We can't pin first and ask
>                          * questions later, because it might confuse code paths like
>                          * InvalidateBuffer() if we pinned a random non-matching buffer.
>                          */
>
> But I'm not sure I buy that - there's plenty other things that can briefly
> acquire a buffer pin (e.g. checkpointer, reclaiming the buffer for other
> contents, etc).

I may well have been too cautious with that.  The worst thing I can
think of right now is that InvalidateBuffer() would busy loop (as it
already does in other rare cases) when it sees a pin.

> Another difference between using PinBuffer() and PinBuffer_locked() is that
> the latter does not adjust a buffer's usagecount.
>
> Leaving the scalability issue aside, isn't it somewhat odd that optimizing a
> codepath to use ReadRecentBuffer() instead of ReadBuffer() leads to not
> increasing usagecount anymore?

Yeah, that is not great.  The simplification you suggest would fix
that too, though I guess it would also bump the usage count of buffers
that don't have the tag we expected; that's obviously rare and erring
on a better side though.

> FWIW, once that's fixed, using ReadRecentBuffer() for _bt_getroot(), caching
> the root page's buffer id in RelationData, seems a noticeable win. About 7% in
> a concurrent, read-only pgbench that utilizes batches of 10. And it should be
> easy to get much bigger wins, e.g. with a index nested loop with a relatively
> small index on the inner side.

Wooo, that's better than I was hoping.  Thanks for trying it out!  I
think, for the complexity involved (ie very little), it's a nice
result, and worth considering even though it's also a solid clue that
we could do much better than this with a (yet to be designed)
longer-lived pin scheme.  smgr_targblock could be another
easy-to-cache candidate, ie a place where there is a single
interesting hot page that we're already keeping track of with no
requirement for new backend-local mapping machinery.



pgsql-hackers by date:

Previous
From: Kirk Wolak
Date:
Subject: Re: Add some more corruption error codes to relcache
Next
From: Peter Geoghegan
Date:
Subject: Re: ReadRecentBuffer() doesn't scale well