Home > mailing lists

Re: Buffer locking is special (hints, checksums, AIO writes) - Mailing list pgsql-hackers

From	Noah Misch
Subject	Re: Buffer locking is special (hints, checksums, AIO writes)
Date	August 27 22:14:41
Msg-id	20250827191441.1c.nmisch@google.com Whole thread Raw
In response to	Re: Buffer locking is special (hints, checksums, AIO writes) (Andres Freund <andres@anarazel.de>)
Responses	Re: Buffer locking is special (hints, checksums, AIO writes)
List	pgsql-hackers

Tree view

On Wed, Aug 27, 2025 at 12:18:27PM -0400, Andres Freund wrote:
> On 2025-08-26 17:14:49 -0700, Noah Misch wrote:
> > On Fri, Aug 22, 2025 at 03:44:48PM -0400, Andres Freund wrote:
> > > == Problem 3 - Cacheline contention ==
> >
> > > c) Read accesses to the BufferDesc cause contention
> > >
> > >    Some code, like nbtree, relies on functions like
> > >    BufferGetBlockNumber(). Unfortunately that contends with concurrent
> > >    modifications of the buffer descriptor (like pinning). Potential solutions
> > >    are to rely less on functions like BufferGetBlockNumber() or to split out
> > >    the memory for that into a separate (denser?) array.
> >
> > Agreed.  BufferGetBlockNumber() could even use a new local (non-shmem) data
> > structure, since the buffer's mapping can't change until we unpin.
> 
> Hm. I didn't think about a backend local datastructure for that, perhaps
> because it seems not cheap to maintain (both from a runtime and a space
> perspective).

Yes, paying off the cost of maintaining it could be tricky.  It could be the
kind of thing where the overhead loses at 10 cores and wins at 40 cores.  It
could also depend heavily on the workload's concurrent pins per backend.

> If we store the read-only data for buffers separately from the read-write
> data, we could access that from backends without a lock, since it can't change
> with the buffer pinned.

Good point.  That alone may be enough of a win.

> One way to do that would be to maintain a back-pointer from the BufferDesc to
> the BufferLookupEnt, since the latter *already* contains the BufferTag. We
> probably don't want to add another indirection to the buffer mapping hash
> table, otherwise we could deduplicate the other way round and just put padding
> between the modified and read-only part of a buffer desc.

I think you're saying clients would save the back-pointer once and dereference
it many times, with each dereference of a saved back-pointer avoiding a shmem
read of BufferDesc.tag.  Is that right?

> > On Tue, Aug 26, 2025 at 05:00:13PM -0400, Andres Freund wrote:
> > > On 2025-08-26 16:21:36 -0400, Robert Haas wrote:
> > > > On Fri, Aug 22, 2025 at 3:45 PM Andres Freund <andres@anarazel.de> wrote:
> > > > > DOES ANYBODY HAVE A BETTER NAME THAN SHARE-EXCLUSIVE???!?
> 
> > I would consider {AccessShare, Exclusive, AccessExclusive}.
> 
> One thing I forgot to mention is that with the proposed re-architecture in
> place, we could subsequently go further and make pinning just be a very
> lightweight lock level, instead of that being a separate dedicated
> infrstructure.  One nice outgrowth of that would be that that acquiring a
> cleanup lock would just be a real lock acquisition, instead of the dedicated
> limited machinery we have right now.
> 
> Which would leave us with:
> - reference (pins today)
> - share
> - share-exclusive
> - exclusive
> - cleanup
> 
> This doesn't quite seem to map onto the heavyweight lock levels in a sensible
> way...

Could map it like this:

AccessShare - pins today
RowShare - check tuple visibility (BUFFER_LOCK_SHARE today)
Share - set hint bits
ShareUpdateExclusive - clean/write out (borrowing Robert's idea)
Exclusive - add tuples, change xmax, etc. (BUFFER_LOCK_EXCLUSIVE today)
AccessExclusive - cleanup lock or evict the buffer

That has a separate level for hint bits vs. I/O, so multiple backends could
set hint bits.  I don't know whether the benchmarks would favor maintaining
that distinction.

> > What the $SUBJECT proposal calls SHARE-EXCLUSIVE would become Exclusive.
> 
> There are a few hundred references to the lock levels though, seems painful to
> rename them :(

Yes, especially in comments and extensions.  Likely more important than that
for the long-term, your latest proposal has the advantage of keeping short
names for the most-commonly-referenced lock types.  (We could keep
BUFFER_LOCK_SHARE with the lower layers translating that into RowShare, but
that weakens or eliminates the benefit of reducing what readers need to
learn.)  For what it's worth, 6 PGXN modules reference BUFFER_LOCK_SHARE
and/or BUFFER_LOCK_EXCLUSIVE.

Compared to share-exclusive, I think I'd prefer a name that describes the use
cases, "set-hints-or-write" (or separate "write" and "set-hints" levels).
What do you think of that?  I don't know whether that should win vs. names
like ShareUpdateExclusive, though.

pgsql-hackers by date:

From: Sami Imseih
Date: 27 August, 22:13:39
Subject: Re: Improve LWLock tranche name visibility across backends

From: Robert Haas
Date: 27 August, 22:22:55
Subject: Re: Buffer locking is special (hints, checksums, AIO writes)

Re: Buffer locking is special (hints, checksums, AIO writes) - Mailing list pgsql-hackers

Previous

Next