Re: AIO v2.5 - Mailing list pgsql-hackers

From Jakub Wartak
Subject Re: AIO v2.5
Date
Msg-id CAKZiRmxj1Z77CH-_J+KuuxTVU8zpOwxqpR7GMLtAGGbN2i44+Q@mail.gmail.com
Whole thread Raw
In response to AIO v2.5  (Andres Freund <andres@anarazel.de>)
Responses Re: AIO v2.5
List pgsql-hackers
On Tue, Mar 4, 2025 at 8:00 PM Andres Freund <andres@anarazel.de> wrote:

> Attached is v2.5 of the AIO patchset.
[..]
Hi, Thanks for working on this!

> Questions:
>
> - My current thinking is that we'd set io_method = worker initially - so we
>   actually get some coverage - and then decide whether to switch to
>   io_method=sync by default for 18 sometime around beta1/2. Does that sound
>   reasonable?

IMHO, yes, good idea. Anyway final outcomes partially will depend on
how many other stream-consumers be committed, right?

> - Three of the commits in the series really are just precursor commits to
>   their subsequent commits, which I found helpful for development and review,
>   namely:
>
>   - aio: Basic subsystem initialization
>   - aio: Skeleton IO worker infrastructure
>   - aio: Add liburing dependency
>
>   Not sure if it's worth keeping these separate or whether they should just be
>   merged with their "real commit".

For me it was easier to read those when they are separate.

> - Right now this series defines PGAIO_VERBOSE to 1. That's good for debugging,
>   but all the ereport()s add a noticeable amount of overhead at high IO
>   throughput (at multiple gigabytes/second), so that's probably not right
>   forever.  I'd leave this on initially and then change it to default to off
>   later.  I think that's ok?

+1, hopefully nothing is recording/logging/running with
log_min_messages>=debug3 because only then it starts to be visible.

> - To allow io_workers to be PGC_SIGHUP, and to eventually allow to
>   automatically in/decrease active workers, the max number of workers (32) is
>   always allocated. That means we use more semaphores than before. I think
>   that's ok, it's not 1995 anymore.  Alternatively we can add a
>   "io_workers_max" GUC and probe for it in initdb.

Wouldn't that matter only on *BSDs?

BTW I somehow cannot imagine someone saturating >= 32 workers (if one
does, better to switch to uring anyway?), but I have a related
question about closing fd by those workers.

> - pg_stat_aios currently has the IO Handle flags as dedicated columns. Not
>   sure that's great?
>
>   They could be an enum array or such too? That'd perhaps be a bit more
>   extensible? OTOH, we don't currently use enums in the catalogs and arrays
>   are somewhat annoying to conjure up from C.

s/pg_stat_aios/pg_aios/ ? :^) It looks good to me as it is. Anyway it
is a debugging view - perhaps mark it as such in the docs - so there
is no stable API for that and shouldn't be queried by any software
anyway.

> - Documentation for pg_stat_aios.

pg_aios! :)

So, I've taken aio-2 branch from Your's github repo for a small ride
on legacy RHEL 8.7 with dm-flakey to inject I/O errors. This is more a
question: perhaps IO workers should auto-close fd on errors or should
we use SIGUSR2 for it? The scenario is like this:

#dm-dust is not that available even on modern distros(not always
compiled), but flakey seemed to work on 4.18.x:
losetup /dev/loop0 /dd.img
mkfs.ext4 -j /dev/loop0
mkdir /flakey
mount /dev/loop0 /flakey # for now it will work
mkdir /flakey/tblspace
chown postgres /flakey/tblspace
chmod 0700 /flakey/tblspace
CREATE TABLESPACE test1 LOCATION '/flakey/tblspace'
CREATE TABLE on t1fail on that test1 tablespace + INSERT SOME DATA
pg_ctl stop
umount /flakey
echo "0 `blockdev --getsz /dev/loop0` flakey /dev/loop0 0 1 1" |
dmsetup create flakey # after 1s start throwing IO errors
mount /dev/mapper/flakey /flakey
#might even say: mount: /flakey: can't read superblock on /dev/mapper/flakey.
mount /dev/mapper/flakey /flakey
pg_ctl start

and then this will happen:

postgres=# insert into t1fail select generate_series(1000001, 2000001);
ERROR:  could not read blocks 0..1 in file
"pg_tblspc/24579/PG_18_202503031/5/24586_fsm": Input/output error
postgres=# insert into t1fail select generate_series(1000001, 2000001);
ERROR:  could not read blocks 0..1 in file
"pg_tblspc/24579/PG_18_202503031/5/24586_fsm": Input/output error
postgres=# insert into t1fail select generate_series(1000001, 2000001);
ERROR:  could not read blocks 0..1 in file
"pg_tblspc/24579/PG_18_202503031/5/24586_fsm": Input/output error

postgres=# insert into t1fail select generate_series(1000001, 2000001);
ERROR:  could not open file
"pg_tblspc/24579/PG_18_202503031/5/24586_vm": Read-only file system

so usual stuff with kernel remounting it RO, but here's the dragon
with io_method=worker:

# mount -o remount,rw /flakey/
mount: /flakey: cannot remount /dev/mapper/flakey read-write, is
write-protected.
# umount /flakey # to fsck or just mount rw again
umount: /flakey: target is busy.
# lsof /flakey/
COMMAND     PID     USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
postgres 103483 postgres   14u   REG  253,2 36249600   17
/flakey/tblspace/PG_18_202503031/5/24586
postgres 103484 postgres    6u   REG  253,2 36249600   17
/flakey/tblspace/PG_18_202503031/5/24586
postgres 103485 postgres    6u   REG  253,2 36249600   17
/flakey/tblspace/PG_18_202503031/5/24586

Those 10348[345] are IO workers, they have still open fds and there's
no way to close those without restart -- well without close()
injection probably via gdb.   pg_terminate_backend() on those won't
work. The only thing that works seems to be sending SIGUSR2, but is
that safe [there could be some errors after pwrite() ] ? With
io_worker=sync just quitting the backend of course works. Not sure
what your thoughts are because any other bgworker could be having open
fds there. It's a very minor thing. Otherwise that outage of separate
tablespace (rarely used) would potentially cause inability to fsck
there and lower the availability of the DB (due to potential restart
required). I'm thinking especially of scenarios where lots of schemas
are used with lots of tablespaces OR where temp_tablespace is employed
for some dedicated (fast/furious/faulty) device. So I'm hoping SIGUSR2
is enough right (4231f4059e5e54d78c56b904f30a5873da88e163 seems to be
doing it anyway) ?

BTW: While at this, I've tried amcheck/pg_surgery for 1 min and they
both seem to work.

-J.



pgsql-hackers by date:

Previous
From: John Naylor
Date:
Subject: Re: vacuumdb changes for stats import/export
Next
From: John Naylor
Date:
Subject: Re: Improve CRC32C performance on SSE4.2