Re: AIO v2.5 - Mailing list pgsql-hackers
From | Jakub Wartak |
---|---|
Subject | Re: AIO v2.5 |
Date | |
Msg-id | CAKZiRmxj1Z77CH-_J+KuuxTVU8zpOwxqpR7GMLtAGGbN2i44+Q@mail.gmail.com Whole thread Raw |
In response to | AIO v2.5 (Andres Freund <andres@anarazel.de>) |
Responses |
Re: AIO v2.5
|
List | pgsql-hackers |
On Tue, Mar 4, 2025 at 8:00 PM Andres Freund <andres@anarazel.de> wrote: > Attached is v2.5 of the AIO patchset. [..] Hi, Thanks for working on this! > Questions: > > - My current thinking is that we'd set io_method = worker initially - so we > actually get some coverage - and then decide whether to switch to > io_method=sync by default for 18 sometime around beta1/2. Does that sound > reasonable? IMHO, yes, good idea. Anyway final outcomes partially will depend on how many other stream-consumers be committed, right? > - Three of the commits in the series really are just precursor commits to > their subsequent commits, which I found helpful for development and review, > namely: > > - aio: Basic subsystem initialization > - aio: Skeleton IO worker infrastructure > - aio: Add liburing dependency > > Not sure if it's worth keeping these separate or whether they should just be > merged with their "real commit". For me it was easier to read those when they are separate. > - Right now this series defines PGAIO_VERBOSE to 1. That's good for debugging, > but all the ereport()s add a noticeable amount of overhead at high IO > throughput (at multiple gigabytes/second), so that's probably not right > forever. I'd leave this on initially and then change it to default to off > later. I think that's ok? +1, hopefully nothing is recording/logging/running with log_min_messages>=debug3 because only then it starts to be visible. > - To allow io_workers to be PGC_SIGHUP, and to eventually allow to > automatically in/decrease active workers, the max number of workers (32) is > always allocated. That means we use more semaphores than before. I think > that's ok, it's not 1995 anymore. Alternatively we can add a > "io_workers_max" GUC and probe for it in initdb. Wouldn't that matter only on *BSDs? BTW I somehow cannot imagine someone saturating >= 32 workers (if one does, better to switch to uring anyway?), but I have a related question about closing fd by those workers. > - pg_stat_aios currently has the IO Handle flags as dedicated columns. Not > sure that's great? > > They could be an enum array or such too? That'd perhaps be a bit more > extensible? OTOH, we don't currently use enums in the catalogs and arrays > are somewhat annoying to conjure up from C. s/pg_stat_aios/pg_aios/ ? :^) It looks good to me as it is. Anyway it is a debugging view - perhaps mark it as such in the docs - so there is no stable API for that and shouldn't be queried by any software anyway. > - Documentation for pg_stat_aios. pg_aios! :) So, I've taken aio-2 branch from Your's github repo for a small ride on legacy RHEL 8.7 with dm-flakey to inject I/O errors. This is more a question: perhaps IO workers should auto-close fd on errors or should we use SIGUSR2 for it? The scenario is like this: #dm-dust is not that available even on modern distros(not always compiled), but flakey seemed to work on 4.18.x: losetup /dev/loop0 /dd.img mkfs.ext4 -j /dev/loop0 mkdir /flakey mount /dev/loop0 /flakey # for now it will work mkdir /flakey/tblspace chown postgres /flakey/tblspace chmod 0700 /flakey/tblspace CREATE TABLESPACE test1 LOCATION '/flakey/tblspace' CREATE TABLE on t1fail on that test1 tablespace + INSERT SOME DATA pg_ctl stop umount /flakey echo "0 `blockdev --getsz /dev/loop0` flakey /dev/loop0 0 1 1" | dmsetup create flakey # after 1s start throwing IO errors mount /dev/mapper/flakey /flakey #might even say: mount: /flakey: can't read superblock on /dev/mapper/flakey. mount /dev/mapper/flakey /flakey pg_ctl start and then this will happen: postgres=# insert into t1fail select generate_series(1000001, 2000001); ERROR: could not read blocks 0..1 in file "pg_tblspc/24579/PG_18_202503031/5/24586_fsm": Input/output error postgres=# insert into t1fail select generate_series(1000001, 2000001); ERROR: could not read blocks 0..1 in file "pg_tblspc/24579/PG_18_202503031/5/24586_fsm": Input/output error postgres=# insert into t1fail select generate_series(1000001, 2000001); ERROR: could not read blocks 0..1 in file "pg_tblspc/24579/PG_18_202503031/5/24586_fsm": Input/output error postgres=# insert into t1fail select generate_series(1000001, 2000001); ERROR: could not open file "pg_tblspc/24579/PG_18_202503031/5/24586_vm": Read-only file system so usual stuff with kernel remounting it RO, but here's the dragon with io_method=worker: # mount -o remount,rw /flakey/ mount: /flakey: cannot remount /dev/mapper/flakey read-write, is write-protected. # umount /flakey # to fsck or just mount rw again umount: /flakey: target is busy. # lsof /flakey/ COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME postgres 103483 postgres 14u REG 253,2 36249600 17 /flakey/tblspace/PG_18_202503031/5/24586 postgres 103484 postgres 6u REG 253,2 36249600 17 /flakey/tblspace/PG_18_202503031/5/24586 postgres 103485 postgres 6u REG 253,2 36249600 17 /flakey/tblspace/PG_18_202503031/5/24586 Those 10348[345] are IO workers, they have still open fds and there's no way to close those without restart -- well without close() injection probably via gdb. pg_terminate_backend() on those won't work. The only thing that works seems to be sending SIGUSR2, but is that safe [there could be some errors after pwrite() ] ? With io_worker=sync just quitting the backend of course works. Not sure what your thoughts are because any other bgworker could be having open fds there. It's a very minor thing. Otherwise that outage of separate tablespace (rarely used) would potentially cause inability to fsck there and lower the availability of the DB (due to potential restart required). I'm thinking especially of scenarios where lots of schemas are used with lots of tablespaces OR where temp_tablespace is employed for some dedicated (fast/furious/faulty) device. So I'm hoping SIGUSR2 is enough right (4231f4059e5e54d78c56b904f30a5873da88e163 seems to be doing it anyway) ? BTW: While at this, I've tried amcheck/pg_surgery for 1 min and they both seem to work. -J.
pgsql-hackers by date: