Re: AIO v2.5 - Mailing list pgsql-hackers

From Andres Freund
Subject Re: AIO v2.5
Date
Msg-id u3otgmy67yiltqw4533wqqhfbpzrf5ds7pqowssvw6w27klb7c@gbjnkuutrsza
Whole thread Raw
In response to Re: AIO v2.5  (Jakub Wartak <jakub.wartak@enterprisedb.com>)
List pgsql-hackers
Hi,

On 2025-03-06 12:36:43 +0100, Jakub Wartak wrote:
> On Tue, Mar 4, 2025 at 8:00 PM Andres Freund <andres@anarazel.de> wrote:
> > Questions:
> >
> > - My current thinking is that we'd set io_method = worker initially - so we
> >   actually get some coverage - and then decide whether to switch to
> >   io_method=sync by default for 18 sometime around beta1/2. Does that sound
> >   reasonable?
>
> IMHO, yes, good idea. Anyway final outcomes partially will depend on
> how many other stream-consumers be committed, right?

I think it's more whether we find cases where it performs substantially worse
with the read stream users that exists.  The behaviour for non-read-stream IO
shouldn't change.


> > - To allow io_workers to be PGC_SIGHUP, and to eventually allow to
> >   automatically in/decrease active workers, the max number of workers (32) is
> >   always allocated. That means we use more semaphores than before. I think
> >   that's ok, it's not 1995 anymore.  Alternatively we can add a
> >   "io_workers_max" GUC and probe for it in initdb.
>
> Wouldn't that matter only on *BSDs?

Yea, NetBSD and OpenBSD only, I think.


> > - pg_stat_aios currently has the IO Handle flags as dedicated columns. Not
> >   sure that's great?
> >
> >   They could be an enum array or such too? That'd perhaps be a bit more
> >   extensible? OTOH, we don't currently use enums in the catalogs and arrays
> >   are somewhat annoying to conjure up from C.
>
> s/pg_stat_aios/pg_aios/ ? :^)

Ooops, yes.


> It looks good to me as it is.
> Anyway it
> is a debugging view - perhaps mark it as such in the docs - so there
> is no stable API for that and shouldn't be queried by any software
> anyway.

Cool


> > - Documentation for pg_stat_aios.
>
> pg_aios! :)
>
> So, I've taken aio-2 branch from Your's github repo for a small ride
> on legacy RHEL 8.7 with dm-flakey to inject I/O errors. This is more a
> question: perhaps IO workers should auto-close fd on errors or should
> we use SIGUSR2 for it? The scenario is like this:

When you say "auto-close", you mean that one IO error should trigger *all*
workers to close their FDs?


> so usual stuff with kernel remounting it RO, but here's the dragon
> with io_method=worker:
>
> # mount -o remount,rw /flakey/
> mount: /flakey: cannot remount /dev/mapper/flakey read-write, is
> write-protected.
> # umount /flakey # to fsck or just mount rw again
> umount: /flakey: target is busy.
> # lsof /flakey/
> COMMAND     PID     USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
> postgres 103483 postgres   14u   REG  253,2 36249600   17
> /flakey/tblspace/PG_18_202503031/5/24586
> postgres 103484 postgres    6u   REG  253,2 36249600   17
> /flakey/tblspace/PG_18_202503031/5/24586
> postgres 103485 postgres    6u   REG  253,2 36249600   17
> /flakey/tblspace/PG_18_202503031/5/24586
>
> Those 10348[345] are IO workers, they have still open fds and there's
> no way to close those without restart -- well without close()
> injection probably via gdb.

The same is already true with bgwriter, checkpointer etc?


> pg_terminate_backend() on those won't work. The only thing that works seems
> to be sending SIGUSR2

Sending SIGINT works.


> , but is that safe [there could be some errors after pwrite() ]?

Could you expand on that?


> With
> io_worker=sync just quitting the backend of course works. Not sure
> what your thoughts are because any other bgworker could be having open
> fds there. It's a very minor thing. Otherwise that outage of separate
> tablespace (rarely used) would potentially cause inability to fsck
> there and lower the availability of the DB (due to potential restart
> required).

I think a crash-restart is the only valid thing to get out of a scenario like
that, independent of AIO:

- If there had been any writes we need to perform crash recovery anyway, to
  recreate those writes
- If there just were reads, it's good to restart as well, as otherwise there
  might be pages in the buffer pool that don't exist on disk anymore, due to
  the errors.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Dave Page
Date:
Subject: Re: History doc page clarification on naming
Next
From: Alena Rybakina
Date:
Subject: Re: explain analyze rows=%.0f