Re: AIO v2.5 - Mailing list pgsql-hackers

From Andres Freund
Subject Re: AIO v2.5
Date
Msg-id xg2s43enrr7lwvghh55kofxdj7o7zd6s6ocgmowel76oxyb2gm@eh6guzumnido
Whole thread Raw
In response to Re: AIO v2.5  (Alexander Lakhin <exclusion@gmail.com>)
Responses Re: AIO v2.5
List pgsql-hackers
Hi,

After a bit more private back and forth with Alexander I have found the issue
- and it's pretty stupid:

pgaio_io_wait_for_free() does what it says on the tin. For that, after a bunch
of other things, finds the oldest in-flight IO and waits for it.

        PgAioHandle *ioh = dclist_head_element(PgAioHandle, node,
                                               &pgaio_my_backend->in_flight_ios);

        switch (ioh->state)
        {
...
            case PGAIO_HS_COMPLETED_IO:
            case PGAIO_HS_SUBMITTED:
                pgaio_debug_io(DEBUG2, ioh,
                               "waiting for free io with %d in flight",
                               dclist_count(&pgaio_my_backend->in_flight_ios));
...
                pgaio_io_wait(ioh, ioh->generation);
                break;


The problem is that, if the log level is low enough, ereport() (which is
called by pgaio_debug_io()), processes interrupts.  The interrupt processing
may end up execute ProcessBarrierSmgrRelease(), which in turn needs to wait
for all in-flight IOs before the IOs are closed.

Which then leads to the
            elog(PANIC, "waiting for own IO in wrong state: %d",
                 state);

error.

The waiting for in-flight IOs before closing FDs only happens with io-uring,
hence this only triggering with io-uring.


A similar set of steps can lead to the "no free IOs despite no in-flight IOs"
ERROR that Alexander also observed - if pgaio_submit_staged() triggers a debug
ereport that executes ProcessBarrierSmgrRelease() in an interrupt, we might
wait for all in-flight IOs during IO submission, triggering the error.


I'm somewhat amazed that Alexander could somewhat reliably reproduce this - I
haven't been able to do so once using Alexander's recipe.  I did find a
reproducer though:

c=32;pgbench -n -c$c -j$c -P1 -T1000 -f <(echo 'SELECT sum(abalance) FROM pgbench_accounts;')
c=1;pgbench -n -c$c -j$c -P1 -T1000 -f <(echo 'DROP DATABASE IF EXISTS foo;CREATE DATABASE foo;')

trigger both issues quickly if run with
  log_min_messages=debug2
  io_method=io_uring
and not when a non-debug log level is used.


I'm not yet sure how to best fix it - locally I have done so by pgaio_debug()
do a HOLD_INTERRUPTS()/RESUME_INTERRUPTS() around the call to ereport.  But
that doesn't really seem great - otoh requiring various pieces of code to know
that anything emitting debug messages needs to hold interrupts etc makes for
rare and hard to understand bugs.

We could just make the relevant functions hold interrupts, and that might be
the best path forward, but we don't really need to hold all interrupts
(e.g. termination would be fine), so it's a bit coarse grained.  It would need
to happen in a few places, which isn't great either.

Other suggestions?


Thanks again for finding and reporting this Alexander!

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Alexander Lakhin
Date:
Subject: Re: Performance issues with v18 SQL-language-function changes
Next
From: Mihail Nikalayeu
Date:
Subject: Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements