Re: Buffer locking is special (hints, checksums, AIO writes) - Mailing list pgsql-hackers
From | Noah Misch |
---|---|
Subject | Re: Buffer locking is special (hints, checksums, AIO writes) |
Date | |
Msg-id | 20250827001449.fb.nmisch@google.com Whole thread Raw |
In response to | Re: Buffer locking is special (hints, checksums, AIO writes) (Andres Freund <andres@anarazel.de>) |
List | pgsql-hackers |
On Fri, Aug 22, 2025 at 03:44:48PM -0400, Andres Freund wrote: > I'm working on making bufmgr.c ready for AIO writes. Nice! > == Problem 2 - AIO writes vs exclusive locks == > > Separate from the hint bit issue, there is a second issue that I didn't have a > good answer for: Making acquiring an exclusive lock concurrency safe in the > presence of asynchronous writes: > > The problem is that while a buffer is being written out, it obviously has to > be share locked. That's true even with AIO. With AIO the share lock is held > once the IO is completed. The problem is that if a backend wants to > exclusively lock a buffer undergoing AIO, it can't just wait for the content > lock as today, it might have to actually reap the IO completion from the > operating system. If one just were to wait for the content lock, there's no > forward progress guarantee. > > The buffer's state "knows" that it's undergoing write IO (BM_VALID and > BM_IO_IN_PROGRESS are set). To ensure forward progress guarantee, an exclusive > locker needs to wait for the IO (pgaio_wref_wait(BufferDesc->->io_wref)). The > problem is that it's surprisingly hard to do so race free: > > If a backend A were to just check if a buffer is undergoing IO before locking > it, a backend B could start IO on the buffer between A checking for > BM_IO_IN_PROGRESS and acquiring the content lock. We obviously can't just > hold the buffer header spinlock across a blocking lwlock acquisition. > > There potentially are ways to synchronize the buffer state and the content > lock, but it requires deep integration between bufmgr.c and lwlock.c. You may have considered and rejected simpler alternatives for (2) before picking the approach you go on to outline. Anything interesting? For example, I imagine these might work with varying degrees of inefficiency: - Use LWLockConditionalAcquire() with some nonstandard waiting protocol when there's a non-I/O lock conflict. - Take BM_IO_IN_PROGRESS before exclusive-locking, then release it. > == Problem 3 - Cacheline contention == > c) Read accesses to the BufferDesc cause contention > > Some code, like nbtree, relies on functions like > BufferGetBlockNumber(). Unfortunately that contends with concurrent > modifications of the buffer descriptor (like pinning). Potential solutions > are to rely less on functions like BufferGetBlockNumber() or to split out > the memory for that into a separate (denser?) array. Agreed. BufferGetBlockNumber() could even use a new local (non-shmem) data structure, since the buffer's mapping can't change until we unpin. > d) Even after addressing all of the above, there's still a lot of contention > > I think the solution here would be something roughly to fastpath locks. If > a buffer is very contended, we can mark it as super-pinned & share locked, > avoiding any atomic operation on the buffer descriptor itself. Instead the > current lock and pincount would be stored in each backends PGPROC. > Obviously evicting or exclusively-locking such a buffer would be a lot more > expensive. > > I've prototyped it and it helps a *lot*. The reason I mention this here is > that this seems impossible to do while using the generic lwlocks for the > content lock. Nice. On Tue, Aug 26, 2025 at 05:00:13PM -0400, Andres Freund wrote: > On 2025-08-26 16:21:36 -0400, Robert Haas wrote: > > On Fri, Aug 22, 2025 at 3:45 PM Andres Freund <andres@anarazel.de> wrote: > > > The order of changes I think makes the most sense is the following: No concerns so far. I won't claim I can picture all the implications and be sure this is the right thing, but it sounds promising. I like your principle of ordering changes to avoid performance regressions. > > > DOES ANYBODY HAVE A BETTER NAME THAN SHARE-EXCLUSIVE???!? I would consider {AccessShare, Exclusive, AccessExclusive}. What the $SUBJECT proposal calls SHARE-EXCLUSIVE would become Exclusive. That has the same conflict matrix as the corresponding heavyweight locks, which seems good. I don't love our mode names, particularly ShareRowExclusive being unsharable. However, learning one special taxonomy is better than learning two. > > AFAIK "share exclusive" or "SX" is standard terminology. Can you say more about that? I looked around at https://google.com/search?q=share+exclusive+%22sx%22+lock but didn't find anything well-aligned with the proposal: https://dev.mysql.com/doc/dev/mysql-server/latest//PAGE_LOCK_ORDER.html looked most relevant, but it doesn't give the big idea. https://mysqlonarm.github.io/Understanding-InnoDB-rwlock-stats/ is less authoritative but does articulate the big idea, as "Shared-Exclusive (SX): offer write access to the resource with inconsistent read. (relaxed exclusive)." That differs from $SUBJECT semantics, in which SHARE-EXCLUSIVE can't see inconsistent reads. https://docs.oracle.com/en/database/oracle/oracle-database/19/arpls/DBMS_LOCK.html has term SX = "sub exclusive". I gather an SX lock on a table lets one do SELECT FOR UPDATE on that table (each row is the "sub"component being locked). https://man.freebsd.org/cgi/man.cgi?query=sx_slock&sektion=9&format=html uses the term "SX", but it's more like our lwlocks. One acquires S or X, not blends of them.
pgsql-hackers by date: