Re: AIO v2.5 - Mailing list pgsql-hackers

From Andres Freund
Subject Re: AIO v2.5
Date
Msg-id g5eisego74jjmdqck2ge4r3bunnjk4m56o7omdec6pnzdp42nf@gcici63b6iyf
Whole thread Raw
In response to Re: AIO v2.5  (Noah Misch <noah@leadboat.com>)
List pgsql-hackers
Hi,

On 2025-04-01 09:07:27 -0700, Noah Misch wrote:
> On Tue, Apr 01, 2025 at 11:55:20AM -0400, Andres Freund wrote:
> > WRT the locking issues, I've been wondering whether we could make
> > LWLockWaitForVar() work that purpose, but I doubt it's the right approach.
> > Probably better to get rid of the LWLock*Var functions and go for the approach
> > I had in v1, namely a version of LWLockAcquire() with a callback that gets
> > called between LWLockQueueSelf() and PGSemaphoreLock(), which can cause the
> > lock acquisition to abort.
>
> What are the best thing(s) to read to understand the locking issues?

Unfortunately I think it's our discussion from a few days/weeks ago.

The problem basically is that functions like LockBuffer(EXCLUSIVE) need to be able
to non-racily

a) wait for in-fligth IOs
b) acquire the content lock

If you just do it naively like this:

    else if (mode == BUFFER_LOCK_EXCLUSIVE)
    {
        if (pg_atomic_read_u32(&buf->state) &_IO_IN_PROGRESS)
            WaitIO(buf);
        LWLockAcquire(content_lock, LW_EXCLUSIVE);
   }

you obviously could have another backend start new IO between the WaitIO() and
the LWLockAcquire().  If that other backend then doesn't consume the
completion of that IO, the current backend could end up endlessly waiting for
the IO.  I don't see a way to avoid with narrow changes just to LockBuffer().


We need some infrastructure that allows to avoid that issue.  One approach
could be to integrate more tightly with lwlock.c. If

1) anyone starting IO were to wake up all waiters for the LWLock

2) The waiting side checked that there is no IO in progress *after*
   LWLockQueueSelf(), but before PGSemaphoreLock()

The backend doing LockBuffer() would be guaranteed to have the chance to wait
for the IO, rather than the lwlock.


But there might be better approaches.

I'm not really convinced that using generic lwlocks for buffer locking is the
best idea. There's just too many special things about buffers. E.g. we have
rather massive NUMA scalability issues due to the amount of lock traffic from
buffer header and content lock atomic operations, particuly on things like the
uppermost levels of a btree.  I've played with ideas like super-pinning and
locking btree root pages, which move all the overhead to the side that wants
to exclusively lock such a page - but that doesn't really make sense for
lwlocks in general.


Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Nathan Bossart
Date:
Subject: Re: Improve CRC32C performance on SSE4.2
Next
From: Ashutosh Bapat
Date:
Subject: Re: Reducing memory consumed by RestrictInfo list translations in partitionwise join planning