Re: AIO v2.5 - Mailing list pgsql-hackers

From Andres Freund
Subject Re: AIO v2.5
Date
Msg-id xbv74hjsabzy3pjh6wmz3iksaxia7r6atrhsecpszfvbl4eiei@ulw3ty6xb3bu
Whole thread Raw
In response to Re: AIO v2.5  (Alexander Lakhin <exclusion@gmail.com>)
List pgsql-hackers
Hi,

On 2025-05-08 19:22:27 -0700, Noah Misch wrote:
> On Thu, May 08, 2025 at 09:06:18PM -0400, Andres Freund wrote:
> > On 2025-05-02 20:05:11 -0700, Noah Misch wrote:
> > > On Wed, Apr 30, 2025 at 04:00:35PM -0400, Andres Freund wrote:
> 
> > We do need to hold interrupts in a few other places, I think - with some debug
> > infrastructure (things like calling ProcessBarrierSmgrRelease() whenever
> > interrupts could be processed and calling CFI() in errstart() in its return
> > false case) it's possible to find state confusions which trigger
> > assertions. The issue is that pgaio_io_update_state() contains a
> > pgaio_debug_io() and executing pgaio_closing_fd() in places that call
> > pgaio_io_update_state() doesn't end well.  There's a similar danger with the
> > debug message in pgaio_io_reclaim().
> > 
> > In the attached patch I added an assertion to pgaio_io_update_state()
> > verifying that interrupts are held and added code to hold interrupts in the
> > relevant places.
> 
> Works for me.
> 
> > > For the "no free IOs despite no in-flight IOs" case, I'd replace the
> > > ereport(ERROR) with "return;" since we now know interrupt processing reclaimed
> > > an IO.
> > 
> > Hm - it seems better to me to check if there are now free handles and return
> > if that's the case, but to keep the error check in case there actually is no
> > free IO? That seems like a not implausible bug...
> 
> Works for me.
> 
> > > Then decide what protection if any, we need against bugs causing an
> > > infinite loop in caller pgaio_io_acquire().  What's the case motivating the
> > > unbounded loop in pgaio_io_acquire(), as opposed to capping at two
> > > pgaio_io_acquire_nb() calls?  If the theory is that pgaio_io_acquire() could
> > > be reentrant, what scenario would reach that reentrancy?
> > 
> > I do not remember why I wrote this as an endless loop.  If you prefer I could
> > change that as part of this patch.
> 
> I asked because it would be sad to remove the ereport(ERROR) like I proposed
> and then have a bug cause a real infinite loop.  Removing the loop was one way
> to prove that can't happen.  As you say, another way would be keeping the
> ereport(ERROR) and guarding it with a free-handles check, like in your patch
> today.  I don't have a strong preference between those.

I pushed it this way just now.

Thanks again for Alexander for reporting & investigating the issues and for
Noah's review.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: strange perf regression with data checksums
Next
From: Tatsuo Ishii
Date:
Subject: Adding null patch entry to cfbot/CommitFest