Re: Buffer locking is special (hints, checksums, AIO writes) - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Buffer locking is special (hints, checksums, AIO writes)
Date
Msg-id wyxi4ezllclq24f5oplseb2dv7r4pk5ylof4gzrzlljciuuhsi@2zdujvnacdau
Whole thread Raw
In response to Re: Buffer locking is special (hints, checksums, AIO writes)  (Noah Misch <noah@leadboat.com>)
Responses Re: Buffer locking is special (hints, checksums, AIO writes)
Re: Buffer locking is special (hints, checksums, AIO writes)
List pgsql-hackers
Hi,

On 2025-08-26 17:14:49 -0700, Noah Misch wrote:
> On Fri, Aug 22, 2025 at 03:44:48PM -0400, Andres Freund wrote:
> > == Problem 2 - AIO writes vs exclusive locks ==
> >
> > Separate from the hint bit issue, there is a second issue that I didn't have a
> > good answer for: Making acquiring an exclusive lock concurrency safe in the
> > presence of asynchronous writes:
> >
> > The problem is that while a buffer is being written out, it obviously has to
> > be share locked. That's true even with AIO. With AIO the share lock is held
> > once the IO is completed. The problem is that if a backend wants to
> > exclusively lock a buffer undergoing AIO, it can't just wait for the content
> > lock as today, it might have to actually reap the IO completion from the
> > operating system. If one just were to wait for the content lock, there's no
> > forward progress guarantee.
> >
> > The buffer's state "knows" that it's undergoing write IO (BM_VALID and
> > BM_IO_IN_PROGRESS are set). To ensure forward progress guarantee, an exclusive
> > locker needs to wait for the IO (pgaio_wref_wait(BufferDesc->->io_wref)). The
> > problem is that it's surprisingly hard to do so race free:
> >
> > If a backend A were to just check if a buffer is undergoing IO before locking
> > it, a backend B could start IO on the buffer between A checking for
> > BM_IO_IN_PROGRESS and acquiring the content lock.  We obviously can't just
> > hold the buffer header spinlock across a blocking lwlock acquisition.
> >
> > There potentially are ways to synchronize the buffer state and the content
> > lock, but it requires deep integration between bufmgr.c and lwlock.c.
>
> You may have considered and rejected simpler alternatives for (2) before
> picking the approach you go on to outline.

I tried a few things...


> Anything interesting?

Not really.

The first one you propose is what I looked at for a while:

> For example, I imagine these might work with varying degrees of
> inefficiency:
>
> - Use LWLockConditionalAcquire() with some nonstandard waiting protocol when
>   there's a non-I/O lock conflict.

It's nontrivial to make this race free - the problem is the case where we *do*
have to wait for an exclusive content lock. It's possible for the lwlock to be
released by the owning backend and for IO to be started, after checking
whether IO is in progress (after LWLockConditionalAcquire() had failed).

I came up with a complicated scheme, where setting IO in progress would
afterwards wake up all lwlock waiters and all exclusive content lock waits
were done with LWLockAcquireOrWait().  I think that was working - but it's
also a slower and seems really fragile and ugly.


> - Take BM_IO_IN_PROGRESS before exclusive-locking, then release it.

That just seems expensive. We could make it cheaper by doing it only if a
LWLockConditionalAcquire() doesn't succeed. But it still seems not great.  And
it doesn't really help with addressing the 'setting hint bits while IO is in
progress" part...


> > == Problem 3 - Cacheline contention ==
>
> > c) Read accesses to the BufferDesc cause contention
> >
> >    Some code, like nbtree, relies on functions like
> >    BufferGetBlockNumber(). Unfortunately that contends with concurrent
> >    modifications of the buffer descriptor (like pinning). Potential solutions
> >    are to rely less on functions like BufferGetBlockNumber() or to split out
> >    the memory for that into a separate (denser?) array.
>
> Agreed.  BufferGetBlockNumber() could even use a new local (non-shmem) data
> structure, since the buffer's mapping can't change until we unpin.

Hm. I didn't think about a backend local datastructure for that, perhaps
because it seems not cheap to maintain (both from a runtime and a space
perspective).

If we store the read-only data for buffers separately from the read-write
data, we could access that from backends without a lock, since it can't change
with the buffer pinned.

One way to do that would be to maintain a back-pointer from the BufferDesc to
the BufferLookupEnt, since the latter *already* contains the BufferTag. We
probably don't want to add another indirection to the buffer mapping hash
table, otherwise we could deduplicate the other way round and just put padding
between the modified and read-only part of a buffer desc.



> On Tue, Aug 26, 2025 at 05:00:13PM -0400, Andres Freund wrote:
> > On 2025-08-26 16:21:36 -0400, Robert Haas wrote:
> > > On Fri, Aug 22, 2025 at 3:45 PM Andres Freund <andres@anarazel.de> wrote:
> > > > The order of changes I think makes the most sense is the following:
>
> No concerns so far.  I won't claim I can picture all the implications and be
> sure this is the right thing, but it sounds promising.  I like your principle
> of ordering changes to avoid performance regressions.

I suspect we'll have to merge this incrementally to stay sane, I don't want to
end up with a period of worse performance due to this, that could make it
harder to evaluate other work.


> > > > DOES ANYBODY HAVE A BETTER NAME THAN SHARE-EXCLUSIVE???!?

> I would consider {AccessShare, Exclusive, AccessExclusive}.

One thing I forgot to mention is that with the proposed re-architecture in
place, we could subsequently go further and make pinning just be a very
lightweight lock level, instead of that being a separate dedicated
infrstructure.  One nice outgrowth of that would be that that acquiring a
cleanup lock would just be a real lock acquisition, instead of the dedicated
limited machinery we have right now.

Which would leave us with:
- reference (pins today)
- share
- share-exclusive
- exclusive
- cleanup

This doesn't quite seem to map onto the heavyweight lock levels in a sensible
way...



> What the $SUBJECT proposal calls SHARE-EXCLUSIVE would become Exclusive.

There are a few hundred references to the lock levels though, seems painful to
rename them :(


> That has the same conflict matrix as the corresponding heavyweight locks,
> which seems good.

> I don't love our mode names, particularly ShareRowExclusive being
> unsharable.

I hate them with a passion :). Except for the most basic ones they just don't
stay in my head for more than a few hours.


> However, learning one special taxonomy is better than learning two.

But that's fair.


Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Álvaro Herrera
Date:
Subject: Re: List TAP test files in makefiles
Next
From: Peter Eisentraut
Date:
Subject: Re: Proper object locking for GRANT/REVOKE