Re: Potential deadlock in pgaio_io_wait() - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Potential deadlock in pgaio_io_wait() |
Date | |
Msg-id | kl7wzxh6lhtbexnkp7meobqjksnayvzvh53uzsdjsubdfkom37@kemmr6est2g6 Whole thread Raw |
In response to | Re: Potential deadlock in pgaio_io_wait() (Thomas Munro <thomas.munro@gmail.com>) |
List | pgsql-hackers |
Hi, On 2025-08-15 17:39:30 +1200, Thomas Munro wrote: > I discussed this off-list with Andres who provided the following review: > > * +1 for the analysis > * +1 for the solution > * his benchmark machine shows no regression under heavy IO submission workload > * needs better comments > > I had expected that patch to be rejected as too slow. I was thinking > that it should be enough to insert a memory barrier and then do an > unlocked check for an empty waitlist in ConditionVariableBroadcast() > to avoid a spinlock acquire/release. That caused a few weird hangs I > didn't understand, which is why I didn't post that version last time, > but he pointed out that the other side also needs a memory barrier in > ConditionVariablePrepareToSleep() and the existing spinlock > acquire/release is not enough. FTR, the reason that we would need a barrier there is that we define SpinLockRelease() as a non-atomic volatile store on x86. That means that there is no guarantee that another backend sees an up2date view of the memory if reading a memory location *without* acquiring the spinlock. The reason this is correct for spinlocks is that x86 has strongly ordered stores, and the *acquisition* of a spinlock will only succeed once the release of the spinlock has become visible. But if you read something *without* the spinlock, you're not guaranteed to see all the changes done without the spinlock. I do wonder if we accidentally rely on SpinLockRelease() being a barrier anywhere... > I also thought of a small optimisation, presented in the -B patch. > It's a bit of a shame to wait for backend->submit_cv and then also > ioh->cv in io_method=worker. It's just a bunch of IPC ping-pong for > nothing. So I figured it should be allowed to fall all the way > through based on its lack of ->wait_one. Worth bothering with? I don't think I'd go there without concrete evidence that it helps - it makes it harder to understand when to wait for what. Not terribly so, but enough that I'd not go there without some measurable benefit. > On a superficial note: > > AIO_IO_COMPLETION "Waiting for another process to complete IO." > +AIO_IO_SUBMIT "Waiting for another process to submit IO." > AIO_IO_URING_SUBMIT "Waiting for IO submission via io_uring." > AIO_IO_URING_EXECUTION "Waiting for IO execution via io_uring." > We're inconsistent in our choice of noun or verb. I went with _SUBMIT > to match the following line, rather than _SUBMISSION to match the > preceding line. Shrug. FWIW, I think so far it's a verb when the process is doing the work (e.g. the process is calling io_uring_enter(to_submit = ..). It's a noun if we wait for somebody *else* to do the completion, since we're not doing work ourselves. > From ec2e1e21054f00918d3b28ce01129bc06de37790 Mon Sep 17 00:00:00 2001 > From: Thomas Munro <thomas.munro@gmail.com> > Date: Sun, 3 Aug 2025 23:07:56 +1200 > Subject: [PATCH v2 1/2] aio: Fix pgaio_io_wait() for staged IOs. > > Previously, pgaio_io_wait()'s cases for PGAIO_HS_DEFINED and > PGAIO_HS_STAGED fell through to waiting for completion. The owner only > promises to advance it to PGAIO_HS_SUBMITTED. The waiter needs to be > prepared to call ->wait_one() itself once the IO is submitted in order > to guarantee progress and avoid deadlocks on IO methods that provide > ->wait_one(). > > Introduce a new per-backend condition variable submit_cv, woken by by > pgaio_submit_stage(), and use it to wait for the state to advance. The > new broadcast doesn't seem to cause any measurable slowdown, so ideas > for optimizing the common no-waiters case were abandoned for now. > > It may not be possible to reach any real deadlock with existing AIO > users, but that situation could change. There's also no reason the > waiter shouldn't begin to wait via the IO method as soon as possible > even without a deadlock. > > Picked up by testing a proposed IO method that has ->wait_one(), like > io_method=io_uring, and code review. LGTM. Greetings, Andres Freund
pgsql-hackers by date: