Thread: Re: AIO v2.4
Hi, Attached is v2.4 of the AIO patchset. Changes: - Introduce "batchmode", while not in batchmode, IOs get submitted immediately. Thomas didn't like how this worked previously, and while this was a surprisingly large amount of work, I agree that it looks better now. I vaccilated a bunch on the naming. For now it's extern void pgaio_enter_batchmode(void); extern void pgaio_exit_batchmode(void); I did adjust the README and wrote a reasonably long comment above enter: https://github.com/anarazel/postgres/blob/a324870186ddff9a31b10472b790eb4e744c40b3/src/backend/storage/aio/aio.c#L931-L960 - Batchmode needs to be exited in case of errors, for that - a new pgaio_after_error() call has been added to all the relevant places - xact.c calls to aio have been (re-)added to check that there are no in-progress batches / unsubmitted IOs at the end of a transaction. Before that I had just removed at-eoxact "callbacks" :) This checking has holes though: https://postgr.es/m/upkkyhyuv6ultnejrutqcu657atw22kluh4lt2oidzxxtjqux3%40a4hdzamh4wzo Because this only means that we will not detect all buggy code, rather than misbehaving for correct code, I think this may be ok for now. - Renamed aio_init.h to aio_subsys.h The newly added pgaio_after_error() calls would have required including aio.h in a good bit more places that won't themselves issue AIO. That seemed wrong. There already was a aio_init.h to avoid needing to include aio.h in places like ipci.c, but it seemed wrong to put pgaio_after_error() in aio_init.h. So I renamed it to aio_subsys.h - not sure that's the best name, but I can live with it. - Now that Thomas submitted the necessary read_stream.c improvements, the prior big TODO about one StartReadBuffers() call needing to start many IOs has been addressed. Thomas' thread: https://postgr.es/m/CA%2BhUKGK_%3D4CVmMHvsHjOVrK6t4F%3DLBpFzsrr3R%2BaJYN8kcTfWg%40mail.gmail.com For now I've also included Thomas patches in my queue, but they should get pushed independently. Review comments specific to those patches probably are better put on the other thread. Thomas' patches also fix several issues that were addressed in my WIP adjustments to read_stream.c. There are a few left, but it does look better. The included commits are 0003-0008. - I rewrote the tests into a tap test. That was exceedingly painful. Partially due to tap infrastructure bugs on windows that would sometimes cause inscrutable failures, see https://www.postgresql.org/message-id/wmovm6xcbwh7twdtymxuboaoarbvwj2haasd3sikzlb3dkgz76%40n45rzycluzft I just pushed that fix earlier today. - Added docs for new GUCs, moved them to a more appropriate section See also https://postgr.es/m/x3tlw2jk5gm3r3mv47hwrshffyw7halpczkfbk3peksxds7bvc%40lguk43z3bsyq - If IO workers fail to reopen the file for an IO, the IO is now marked as failed. Previously we'd just hang. To test this I added an injection point that triggers the failure. I don't know how else this could be tested. - Added liburing dependency build documentation - Added check hook to ensure io_max_concurrency = isn't set to 0 (-1 is for auto-config) - Fixed that with io_method == sync we'd issue fadvise calls when not appropriate, that was a consequence of my hacky read_stream.c changes. - Renamed some the aio<->bufmgr.c interface functions. Don't think they're quite perfect, but they're in later patches, so I don't want to focus too much on them rn. - Comment improvements etc. - Got rid of an unused wait event and renamed other wait events to make more sense. - Previously the injection points were added as part of the test patch, I now moved them into the commits adding the code being tested. Was too annoying to edit otherwise. Todo: - there's a decent amount of FIXMEs in later commits related to ereport(LOG)s needing relpath() while in a critical section. I did propose a solution to that yesterday: https://postgr.es/m/h3a7ftrxypgxbw6ukcrrkspjon5dlninedwb5udkrase3rgqvn%403cokde6btlrl - A few more corner case tests for the interaction of multiple backends trying to do IO on overlapping buffers would be good. - Our temp table test coverage is atrociously bad Questions: - The test module requires StartBufferIO() to be visible outside of bufmgr.c - I think that's ok, would be good to know if others agree. I'm planning to push the first two commits soon, I think they're ok on their own, even if nothing else were to go in. Greetings, Andres Freund
Attachment
- v2.4-0001-Ensure-a-resowner-exists-for-all-paths-that-may.patch
- v2.4-0002-Allow-lwlocks-to-be-unowned.patch
- v2.4-0003-Refactor-read_stream.c-s-circular-arithmetic.patch
- v2.4-0004-Allow-more-buffers-for-sequential-read-streams.patch
- v2.4-0005-Improve-buffer-pool-API-for-per-backend-pin-lim.patch
- v2.4-0006-Respect-pin-limits-accurately-in-read_stream.c.patch
- v2.4-0007-Support-buffer-forwarding-in-read_stream.c.patch
- v2.4-0008-Support-buffer-forwarding-in-StartReadBuffers.patch
- v2.4-0009-aio-Basic-subsystem-initialization.patch
- v2.4-0010-aio-Core-AIO-implementation.patch
- v2.4-0011-aio-Skeleton-IO-worker-infrastructure.patch
- v2.4-0012-aio-Add-worker-method.patch
- v2.4-0013-aio-Add-liburing-dependency.patch
- v2.4-0014-aio-Add-io_uring-method.patch
- v2.4-0015-aio-Add-README.md-explaining-higher-level-desig.patch
- v2.4-0016-aio-Implement-smgr-md-fd-aio-methods.patch
- v2.4-0017-aio-Add-pg_aios-view.patch
- v2.4-0018-WIP-localbuf-Track-pincount-in-BufferDesc-as-we.patch
- v2.4-0019-bufmgr-Implement-AIO-read-support.patch
- v2.4-0020-bufmgr-Use-aio-for-StartReadBuffers.patch
- v2.4-0021-WIP-aio-read_stream.c-adjustments-for-real-AIO.patch
- v2.4-0022-aio-Add-test_aio-module.patch
- v2.4-0023-aio-Add-bounce-buffers.patch
- v2.4-0024-bufmgr-Implement-AIO-write-support.patch
- v2.4-0025-aio-Add-IO-queue-helper.patch
- v2.4-0026-bufmgr-use-AIO-in-checkpointer-bgwriter.patch
- v2.4-0027-Temporary-Increase-BAS_BULKREAD-size.patch
- v2.4-0028-WIP-Use-MAP_POPULATE.patch
Hi, On 2025-02-19 14:10:44 -0500, Andres Freund wrote: > I'm planning to push the first two commits soon, I think they're ok on their > own, even if nothing else were to go in. I did that for the lwlock patch. But I think I might not do the same for the "Ensure a resowner exists for all paths that may perform AIO" patch. The paths for which we are missing resowners are concerned WAL writes - but it'll be a while before we get AIO WAL writes. It'd be fairly harmless to do this change before, but I found the justifying code comments hard to rephrase. E.g.: --- a/src/backend/bootstrap/bootstrap.c +++ b/src/backend/bootstrap/bootstrap.c @@ -361,8 +361,15 @@ BootstrapModeMain(int argc, char *argv[], bool check_only) BaseInit(); bootstrap_signals(); + + /* need a resowner for IO during BootStrapXLOG() */ + CreateAuxProcessResourceOwner(); + BootStrapXLOG(bootstrap_data_checksum_version); + ReleaseAuxProcessResources(true); + CurrentResourceOwner = NULL; + /* * To ensure that src/common/link-canary.c is linked into the backend, we * must call it from somewhere. Here is as good as anywhere. Given that there's no use of resowners inside BootStrapXLOG() today and not for the next months it seems confusing? Greetings, Andres Freund
Hi, Attached is v2.5 of the AIO patchset. Relative to 2.4 I: - Committed some earlier commits. I ended up *not* committing the patch to create resowners in more backends (e.g. walsender), as that's not really a dependency for now. One of the more important things to get committed was in a separate thread: https://postgr.es/m/b6vveqz6r3wno66rho5lqi6z5kyhfgtvi3jcodyq5rlpp3cu44%40c6dsgf3z7yhs Now relpath() can be used for logging while in a critical section. That alone allowed to remove most of the remaining FIXMEs. - Split md.c read/write patches, the write side is more complicated and isn't needed before write support arrives (much later in the queue and very likely not for 18). The complicated bit about write support is needing to register_dirty_segment() after completion of the write. If RegisterSyncRequest() fails, the IO completer needs to open the file and sync itself, unfortunately PathNameOpenFile() allocates memory, which isn't ok while in a critical section (even though it'd not be detected, as it's using malloc()). - Reordered patches so that Thomas' read_stream work is after the basic AIO infrastructure patches, there's no dependency to the earlier patches I think Thomas might have a newer version of some of these, but since they're not intended to be committed as part of this, I didn't spend the time to rebase to the last version. - Added a small bit of data that can be provided to callbacks, that makes it a lot cleaner to transport information like ZERO_ON_ERROR. I also did s/shared_callbacks/callbacks/, as the prior name was outdated. - Substantially expanded tests, most importantly generic temp file tests and AIO specific cross-backend tests As part of the expanded tests I also needed to export TerminateBufferIO(), like, as previously mentioned, already done in an earlier version for StartBufferIO(). Nobody commented on that, so I think that's ok. I also renamed the tests away from the very inventively named tbl_a, tbl_b... - Moved the commit to create resownern in more places to much later in the queue, it's not actually needed for bufmgr.c IO, and nothing needing it will land in 18 - Added a proper commit message fo the main commit. I'd appreciate folks reading through it. I'm sure I forgot a lot of folks and a lot of things. - Did a fair bit of of comment polishing - Addressed an XXX in the "aio infrastructure" commit suggesting that we might want to error out if a backend is waiting on is own unsubmitted IO. Noah argued for erroring out. I now made it so. - Temporarily added a commit to increase open-file limit on openbsd. I saw related errors without this patch too, but it fails more often with. I already sent a separate email about this. At this point I am not aware of anything significant left to do in the main AIO commit, safe some of the questions below. There's a lot more potential optimizations etc, but this is already a very complicated piece of work, so I think they just has to wait for later. There are a few things to clean up in the bufmgr.c commits, I don't yet quite like the function naming and there could be a bit less duplication. But I don't think that needs to be resolved before the main commit. Questions: - My current thinking is that we'd set io_method = worker initially - so we actually get some coverage - and then decide whether to switch to io_method=sync by default for 18 sometime around beta1/2. Does that sound reasonable? - We could reduce memory usage a tiny bit if we made the mapping between pgproc and per-backend-aio-state more complicated, i.e. not just indexed by ProcNumber. Right now IO workers have the per-backend AIO state, but don't actually need it. I'm mildly inclined to think that the complexity isn't worth it, but on the fence. - Three of the commits in the series really are just precursor commits to their subsequent commits, which I found helpful for development and review, namely: - aio: Basic subsystem initialization - aio: Skeleton IO worker infrastructure - aio: Add liburing dependency Not sure if it's worth keeping these separate or whether they should just be merged with their "real commit". - Thomas suggested renaming COMPLETED_IO->COMPLETED, COMPLETED_SHARED->TERMINATED_BY_COMPLETER, COMPLETED_SHARED->TERMINATED_BY_SUBMITTER in https://www.postgresql.org/message-id/CA%2BhUKGLxH1tsUgzZfng4BU6GqnS6bKF2ThvxH1_w5c7-sLRKQw%40mail.gmail.com While the other things in the email were commented upon by others and addressed in v2.4, the naming aspect wasn't further remarked upon by others. I'm not personally in love with the suggested names, but I could live with them. - Right now this series defines PGAIO_VERBOSE to 1. That's good for debugging, but all the ereport()s add a noticeable amount of overhead at high IO throughput (at multiple gigabytes/second), so that's probably not right forever. I'd leave this on initially and then change it to default to off later. I think that's ok? - To allow io_workers to be PGC_SIGHUP, and to eventually allow to automatically in/decrease active workers, the max number of workers (32) is always allocated. That means we use more semaphores than before. I think that's ok, it's not 1995 anymore. Alternatively we can add a "io_workers_max" GUC and probe for it in initdb. - pg_stat_aios currently has the IO Handle flags as dedicated columns. Not sure that's great? They could be an enum array or such too? That'd perhaps be a bit more extensible? OTOH, we don't currently use enums in the catalogs and arrays are somewhat annoying to conjure up from C. Todo: - A few more passes over the main commit, I'm sure there's a few more inartful comments, odd formatting and such. - Check if there's a decent way to deduplicate pgaio_io_call_complete_shared() and pgaio_io_call_complete_local() - Figure out how to deduplicate support for LockBufferForCleanup() in TerminateBufferIO(). - Documentation for pg_stat_aios. - Check if documentation for track_io_timing needs to be adjusted, after the bufmgr.c changes we only track waiting for an IO. - Some of the test_aio code is specific to non-temp tables, it probably is worth generalizing to deal with temp tables and invoke them for both. Greetings, Andres
Attachment
- v2.5-0001-aio-Basic-subsystem-initialization.patch
- v2.5-0002-aio-Add-asynchronous-I-O-infrastructure.patch
- v2.5-0003-aio-Skeleton-IO-worker-infrastructure.patch
- v2.5-0004-aio-Add-worker-method.patch
- v2.5-0005-aio-Add-liburing-dependency.patch
- v2.5-0006-aio-Add-io_uring-method.patch
- v2.5-0007-aio-Add-README.md-explaining-higher-level-desig.patch
- v2.5-0008-aio-Implement-smgr-md-fd-read-support.patch
- v2.5-0009-aio-Add-pg_aios-view.patch
- v2.5-0010-Refactor-read_stream.c-s-circular-arithmetic.patch
- v2.5-0011-Improve-buffer-pool-API-for-per-backend-pin-lim.patch
- v2.5-0012-Respect-pin-limits-accurately-in-read_stream.c.patch
- v2.5-0013-Support-buffer-forwarding-in-read_stream.c.patch
- v2.5-0014-Support-buffer-forwarding-in-StartReadBuffers.patch
- v2.5-0015-WIP-tests-Expand-temp-table-tests-to-some-pin-r.patch
- v2.5-0016-WIP-localbuf-Track-pincount-in-BufferDesc-as-we.patch
- v2.5-0017-bufmgr-Implement-AIO-read-support.patch
- v2.5-0018-bufmgr-Use-aio-for-StartReadBuffers.patch
- v2.5-0019-WIP-aio-read_stream.c-adjustments-for-real-AIO.patch
- v2.5-0020-aio-Add-test_aio-module.patch
- v2.5-0021-wip-ci-Increase-openbsd-kern.maxfiles-to-fix-co.patch
- v2.5-0022-aio-Implement-smgr-md-fd-write-support.patch
- v2.5-0023-aio-Add-bounce-buffers.patch
- v2.5-0024-bufmgr-Implement-AIO-write-support.patch
- v2.5-0025-aio-Add-IO-queue-helper.patch
- v2.5-0026-bufmgr-use-AIO-in-checkpointer-bgwriter.patch
- v2.5-0027-Ensure-a-resowner-exists-for-all-paths-that-may.patch
- v2.5-0028-Temporary-Increase-BAS_BULKREAD-size.patch
- v2.5-0029-WIP-Use-MAP_POPULATE.patch
On Tue, Mar 4, 2025 at 8:00 PM Andres Freund <andres@anarazel.de> wrote: > Attached is v2.5 of the AIO patchset. [..] Hi, Thanks for working on this! > Questions: > > - My current thinking is that we'd set io_method = worker initially - so we > actually get some coverage - and then decide whether to switch to > io_method=sync by default for 18 sometime around beta1/2. Does that sound > reasonable? IMHO, yes, good idea. Anyway final outcomes partially will depend on how many other stream-consumers be committed, right? > - Three of the commits in the series really are just precursor commits to > their subsequent commits, which I found helpful for development and review, > namely: > > - aio: Basic subsystem initialization > - aio: Skeleton IO worker infrastructure > - aio: Add liburing dependency > > Not sure if it's worth keeping these separate or whether they should just be > merged with their "real commit". For me it was easier to read those when they are separate. > - Right now this series defines PGAIO_VERBOSE to 1. That's good for debugging, > but all the ereport()s add a noticeable amount of overhead at high IO > throughput (at multiple gigabytes/second), so that's probably not right > forever. I'd leave this on initially and then change it to default to off > later. I think that's ok? +1, hopefully nothing is recording/logging/running with log_min_messages>=debug3 because only then it starts to be visible. > - To allow io_workers to be PGC_SIGHUP, and to eventually allow to > automatically in/decrease active workers, the max number of workers (32) is > always allocated. That means we use more semaphores than before. I think > that's ok, it's not 1995 anymore. Alternatively we can add a > "io_workers_max" GUC and probe for it in initdb. Wouldn't that matter only on *BSDs? BTW I somehow cannot imagine someone saturating >= 32 workers (if one does, better to switch to uring anyway?), but I have a related question about closing fd by those workers. > - pg_stat_aios currently has the IO Handle flags as dedicated columns. Not > sure that's great? > > They could be an enum array or such too? That'd perhaps be a bit more > extensible? OTOH, we don't currently use enums in the catalogs and arrays > are somewhat annoying to conjure up from C. s/pg_stat_aios/pg_aios/ ? :^) It looks good to me as it is. Anyway it is a debugging view - perhaps mark it as such in the docs - so there is no stable API for that and shouldn't be queried by any software anyway. > - Documentation for pg_stat_aios. pg_aios! :) So, I've taken aio-2 branch from Your's github repo for a small ride on legacy RHEL 8.7 with dm-flakey to inject I/O errors. This is more a question: perhaps IO workers should auto-close fd on errors or should we use SIGUSR2 for it? The scenario is like this: #dm-dust is not that available even on modern distros(not always compiled), but flakey seemed to work on 4.18.x: losetup /dev/loop0 /dd.img mkfs.ext4 -j /dev/loop0 mkdir /flakey mount /dev/loop0 /flakey # for now it will work mkdir /flakey/tblspace chown postgres /flakey/tblspace chmod 0700 /flakey/tblspace CREATE TABLESPACE test1 LOCATION '/flakey/tblspace' CREATE TABLE on t1fail on that test1 tablespace + INSERT SOME DATA pg_ctl stop umount /flakey echo "0 `blockdev --getsz /dev/loop0` flakey /dev/loop0 0 1 1" | dmsetup create flakey # after 1s start throwing IO errors mount /dev/mapper/flakey /flakey #might even say: mount: /flakey: can't read superblock on /dev/mapper/flakey. mount /dev/mapper/flakey /flakey pg_ctl start and then this will happen: postgres=# insert into t1fail select generate_series(1000001, 2000001); ERROR: could not read blocks 0..1 in file "pg_tblspc/24579/PG_18_202503031/5/24586_fsm": Input/output error postgres=# insert into t1fail select generate_series(1000001, 2000001); ERROR: could not read blocks 0..1 in file "pg_tblspc/24579/PG_18_202503031/5/24586_fsm": Input/output error postgres=# insert into t1fail select generate_series(1000001, 2000001); ERROR: could not read blocks 0..1 in file "pg_tblspc/24579/PG_18_202503031/5/24586_fsm": Input/output error postgres=# insert into t1fail select generate_series(1000001, 2000001); ERROR: could not open file "pg_tblspc/24579/PG_18_202503031/5/24586_vm": Read-only file system so usual stuff with kernel remounting it RO, but here's the dragon with io_method=worker: # mount -o remount,rw /flakey/ mount: /flakey: cannot remount /dev/mapper/flakey read-write, is write-protected. # umount /flakey # to fsck or just mount rw again umount: /flakey: target is busy. # lsof /flakey/ COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME postgres 103483 postgres 14u REG 253,2 36249600 17 /flakey/tblspace/PG_18_202503031/5/24586 postgres 103484 postgres 6u REG 253,2 36249600 17 /flakey/tblspace/PG_18_202503031/5/24586 postgres 103485 postgres 6u REG 253,2 36249600 17 /flakey/tblspace/PG_18_202503031/5/24586 Those 10348[345] are IO workers, they have still open fds and there's no way to close those without restart -- well without close() injection probably via gdb. pg_terminate_backend() on those won't work. The only thing that works seems to be sending SIGUSR2, but is that safe [there could be some errors after pwrite() ] ? With io_worker=sync just quitting the backend of course works. Not sure what your thoughts are because any other bgworker could be having open fds there. It's a very minor thing. Otherwise that outage of separate tablespace (rarely used) would potentially cause inability to fsck there and lower the availability of the DB (due to potential restart required). I'm thinking especially of scenarios where lots of schemas are used with lots of tablespaces OR where temp_tablespace is employed for some dedicated (fast/furious/faulty) device. So I'm hoping SIGUSR2 is enough right (4231f4059e5e54d78c56b904f30a5873da88e163 seems to be doing it anyway) ? BTW: While at this, I've tried amcheck/pg_surgery for 1 min and they both seem to work. -J.
Hi, On 2025-03-06 12:36:43 +0100, Jakub Wartak wrote: > On Tue, Mar 4, 2025 at 8:00 PM Andres Freund <andres@anarazel.de> wrote: > > Questions: > > > > - My current thinking is that we'd set io_method = worker initially - so we > > actually get some coverage - and then decide whether to switch to > > io_method=sync by default for 18 sometime around beta1/2. Does that sound > > reasonable? > > IMHO, yes, good idea. Anyway final outcomes partially will depend on > how many other stream-consumers be committed, right? I think it's more whether we find cases where it performs substantially worse with the read stream users that exists. The behaviour for non-read-stream IO shouldn't change. > > - To allow io_workers to be PGC_SIGHUP, and to eventually allow to > > automatically in/decrease active workers, the max number of workers (32) is > > always allocated. That means we use more semaphores than before. I think > > that's ok, it's not 1995 anymore. Alternatively we can add a > > "io_workers_max" GUC and probe for it in initdb. > > Wouldn't that matter only on *BSDs? Yea, NetBSD and OpenBSD only, I think. > > - pg_stat_aios currently has the IO Handle flags as dedicated columns. Not > > sure that's great? > > > > They could be an enum array or such too? That'd perhaps be a bit more > > extensible? OTOH, we don't currently use enums in the catalogs and arrays > > are somewhat annoying to conjure up from C. > > s/pg_stat_aios/pg_aios/ ? :^) Ooops, yes. > It looks good to me as it is. > Anyway it > is a debugging view - perhaps mark it as such in the docs - so there > is no stable API for that and shouldn't be queried by any software > anyway. Cool > > - Documentation for pg_stat_aios. > > pg_aios! :) > > So, I've taken aio-2 branch from Your's github repo for a small ride > on legacy RHEL 8.7 with dm-flakey to inject I/O errors. This is more a > question: perhaps IO workers should auto-close fd on errors or should > we use SIGUSR2 for it? The scenario is like this: When you say "auto-close", you mean that one IO error should trigger *all* workers to close their FDs? > so usual stuff with kernel remounting it RO, but here's the dragon > with io_method=worker: > > # mount -o remount,rw /flakey/ > mount: /flakey: cannot remount /dev/mapper/flakey read-write, is > write-protected. > # umount /flakey # to fsck or just mount rw again > umount: /flakey: target is busy. > # lsof /flakey/ > COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME > postgres 103483 postgres 14u REG 253,2 36249600 17 > /flakey/tblspace/PG_18_202503031/5/24586 > postgres 103484 postgres 6u REG 253,2 36249600 17 > /flakey/tblspace/PG_18_202503031/5/24586 > postgres 103485 postgres 6u REG 253,2 36249600 17 > /flakey/tblspace/PG_18_202503031/5/24586 > > Those 10348[345] are IO workers, they have still open fds and there's > no way to close those without restart -- well without close() > injection probably via gdb. The same is already true with bgwriter, checkpointer etc? > pg_terminate_backend() on those won't work. The only thing that works seems > to be sending SIGUSR2 Sending SIGINT works. > , but is that safe [there could be some errors after pwrite() ]? Could you expand on that? > With > io_worker=sync just quitting the backend of course works. Not sure > what your thoughts are because any other bgworker could be having open > fds there. It's a very minor thing. Otherwise that outage of separate > tablespace (rarely used) would potentially cause inability to fsck > there and lower the availability of the DB (due to potential restart > required). I think a crash-restart is the only valid thing to get out of a scenario like that, independent of AIO: - If there had been any writes we need to perform crash recovery anyway, to recreate those writes - If there just were reads, it's good to restart as well, as otherwise there might be pages in the buffer pool that don't exist on disk anymore, due to the errors. Greetings, Andres Freund
On Tue, Mar 4, 2025 at 2:00 PM Andres Freund <andres@anarazel.de> wrote: > - pg_stat_aios currently has the IO Handle flags as dedicated columns. Not > sure that's great? I don't like the name. Pluralization abbreviations is weird, and it's even weirder when the abbreviation is not one that is universally known. Maybe just drop the "s". -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2025-03-06 10:33:33 -0500, Robert Haas wrote: > On Tue, Mar 4, 2025 at 2:00 PM Andres Freund <andres@anarazel.de> wrote: > > - pg_stat_aios currently has the IO Handle flags as dedicated columns. Not > > sure that's great? > > I don't like the name. I don't think it changes anything, but as Jakub pointed out, I thinko'd the name in the email you're responding to, it's pg_aios, not pg_stat_aios. It shows the currently in-flight IOs, not accumulated statistics about them, hence no _stat_. I don't like the name either, I IIRC asked for suggestions elsewhere in the thread, not a lot was forthcoming, so I left it at pg_aios. > Pluralization abbreviations is weird, and it's even weirder when the > abbreviation is not one that is universally known. Maybe just drop the "s". I went with plural because that's what we have in other views showing the "current" state: - pg_cursors - pg_file_settings - pg_prepared_statements - pg_prepared_xacts - pg_replication_slots - pg_locks - ... But you're right that those aren't abbreviations. Greetings, Andres Freund
On Thu, Mar 6, 2025 at 2:13 PM Andres Freund <andres@anarazel.de> wrote: > On 2025-03-06 12:36:43 +0100, Jakub Wartak wrote: > > On Tue, Mar 4, 2025 at 8:00 PM Andres Freund <andres@anarazel.de> wrote: > > > Questions: > > > > > > - My current thinking is that we'd set io_method = worker initially - so we > > > actually get some coverage - and then decide whether to switch to > > > io_method=sync by default for 18 sometime around beta1/2. Does that sound > > > reasonable? > > > > IMHO, yes, good idea. Anyway final outcomes partially will depend on > > how many other stream-consumers be committed, right? > > I think it's more whether we find cases where it performs substantially worse > with the read stream users that exist. The behaviour for non-read-stream IO > shouldn't change. OK, so in order to to get full picture for v18beta this would mean $thread + following ones?: - Use read streams in autoprewarm - BitmapHeapScan table AM violation removal (and use streaming read API) - Index Prefetching (it seems it has stalled?) or is there something more planned? (I'm asking what to apply on top of AIO to minimize number of potential test runs which seem to take lots of time, so to do it all in one go) > > So, I've taken aio-2 branch from Your's github repo for a small ride > > on legacy RHEL 8.7 with dm-flakey to inject I/O errors. This is more a > > question: perhaps IO workers should auto-close fd on errors or should > > we use SIGUSR2 for it? The scenario is like this: > > When you say "auto-close", you mean that one IO error should trigger *all* > workers to close their FDs? Yeah I somehow was thinking about such a thing, but after You have bolded that "*all*", my question sounds much more stupid than it was yesterday. Sorry for asking stupid question :) > The same is already true with bgwriter, checkpointer etc? Yeah.. I was kind of looking for a way of getting "higher availability" in the presence of partial IO (tablespace) errors. > > pg_terminate_backend() on those won't work. The only thing that works seems > > to be sending SIGUSR2 > > Sending SIGINT works. Ugh, ok, it looks like I've been overthinking that, cool. > > , but is that safe [there could be some errors after pwrite() ]? > > Could you expand on that? It is pure speculation on my side: well I'm always concerned about leaving something out there without cleanup after errors and then re-using it for something else much later, especially on edge-cases like NFS or FUSE. In the backend we could maintain some state, but io_workes are shared across backends. E.g. some pwrite() failing on NFS, we are not closing that fd, and then reusing it for something else much latter for different backend (although AFAIK close() does not guarantee anything, but e.g. it could be that some inode/path or something was simply marked dangling - the fresh pair of close()/open() could could could return error, but here we would just keep on pwriting() there?). OK the only question remains: does it make sense to try something like pgbench on NFS UDP mountopt=hard,nointr + intermittent iptables DROP from time to time , or is it not worth trying? > > With > > io_worker=sync just quitting the backend of course works. Not sure > > what your thoughts are because any other bgworker could be having open > > fds there. It's a very minor thing. Otherwise that outage of separate > > tablespace (rarely used) would potentially cause inability to fsck > > there and lower the availability of the DB (due to potential restart > > required). > > I think a crash-restart is the only valid thing to get out of a scenario like > that, independent of AIO: > > - If there had been any writes we need to perform crash recovery anyway, to > recreate those writes > - If there just were reads, it's good to restart as well, as otherwise there > might be pages in the buffer pool that don't exist on disk anymore, due to > the errors. OK, cool, thanks! -J.
Hi, On 2025-03-07 11:21:09 +0100, Jakub Wartak wrote: > On Thu, Mar 6, 2025 at 2:13 PM Andres Freund <andres@anarazel.de> wrote: > > > On 2025-03-06 12:36:43 +0100, Jakub Wartak wrote: > > > On Tue, Mar 4, 2025 at 8:00 PM Andres Freund <andres@anarazel.de> wrote: > > > > Questions: > > > > > > > > - My current thinking is that we'd set io_method = worker initially - so we > > > > actually get some coverage - and then decide whether to switch to > > > > io_method=sync by default for 18 sometime around beta1/2. Does that sound > > > > reasonable? > > > > > > IMHO, yes, good idea. Anyway final outcomes partially will depend on > > > how many other stream-consumers be committed, right? > > > > I think it's more whether we find cases where it performs substantially worse > > with the read stream users that exist. The behaviour for non-read-stream IO > > shouldn't change. > > OK, so in order to to get full picture for v18beta this would mean > $thread + following ones?: > - Use read streams in autoprewarm > - BitmapHeapScan table AM violation removal (and use streaming read API) Yep. > - Index Prefetching (it seems it has stalled?) I don't think there's any chance it'll be in 18. There's a good bit more work needed before it can go in... > or is there something more planned? (I'm asking what to apply on top > of AIO to minimize number of potential test runs which seem to take > lots of time, so to do it all in one go) I think there may be some more (e.g. btree index vacuuming), but I don't think they'll have *that* big an impact. > > > So, I've taken aio-2 branch from Your's github repo for a small ride > > > on legacy RHEL 8.7 with dm-flakey to inject I/O errors. This is more a > > > question: perhaps IO workers should auto-close fd on errors or should > > > we use SIGUSR2 for it? The scenario is like this: > > > > When you say "auto-close", you mean that one IO error should trigger *all* > > workers to close their FDs? > > Yeah I somehow was thinking about such a thing, but after You have > bolded that "*all*", my question sounds much more stupid than it was > yesterday. Sorry for asking stupid question :) Don't worry about that :) > > The same is already true with bgwriter, checkpointer etc? > > Yeah.. I was kind of looking for a way of getting "higher > availability" in the presence of partial IO (tablespace) errors. I'm really doubtful that's that worthwhile to pursue. IME the system is pretty much hosed once this starts to happening and it's often made *worse* by trying to limp along. > OK the only question remains: does it make sense to try something like > pgbench on NFS UDP mountopt=hard,nointr + intermittent iptables DROP > from time to time , or is it not worth trying? I don't think it's particularly interesting. But then I'd *never* trust any meaningful data to a PG running on NFS. Greetings, Andres Freund
Hi, On 2025-03-06 11:53:41 -0500, Andres Freund wrote: > On 2025-03-06 10:33:33 -0500, Robert Haas wrote: > > On Tue, Mar 4, 2025 at 2:00 PM Andres Freund <andres@anarazel.de> wrote: > > > - pg_stat_aios currently has the IO Handle flags as dedicated columns. Not > > > sure that's great? > > > > I don't like the name. > > I don't think it changes anything, but as Jakub pointed out, I thinko'd the > name in the email you're responding to, it's pg_aios, not pg_stat_aios. > > It shows the currently in-flight IOs, not accumulated statistics about them, > hence no _stat_. > > I don't like the name either, I IIRC asked for suggestions elsewhere in the > thread, not a lot was forthcoming, so I left it at pg_aios. What about pg_io_handles? Greetings, Andres Freund
Hi, Tom, CCed you since you have worked most on elog.c On 2025-03-07 16:23:51 -0500, Andres Freund wrote: > What about pg_io_handles? While looking at the view I felt motivated to tackle the one FIXME in the implementation of the view. Namely that the "error_desc" column wasn't populated (the view did show that there was an error, but not what the error was). Which lead me down a sad sad rabbit hole, largely independent of AIO. A bit of background: For AIO completion callbacks can signal errors (e.g. a page header failing validation). That error can be logged in the callback and/or raised later, e.g. by the query that issued the IO. AIO callbacks happen in critical sections, which is required to be able to use AIO for WAL (see README.md for more details). Currently errors are logged/raised by ereport()s in functions that gets passed in an elevel, pretty standard. A few of the ereports() use errcode_for_file_access() to translate an errno to an sqlerrcode. Now on to the problem: The result of an ereport() can't be put into a view, obviously. I didn't think it'd be good if the each kind of error needed to be implemented twice, once with ereport() and once to just return a string to put in the view. I tried a few things: 1) Use errsave() to allow delayed reporting of the error I encountered a few problems: - errsave() doesn't allow the log level to be specified, which means it can't directly be used to LOG if no context is specified. This could be worked around by always specifying the context, with ErrorSaveContext.details_wanted = true and having generic code that changes the elevel to whatever is appropriate and then using ThrowErrorData() to log the message. - ersave_start() sets assoc_context to CurrentMemoryContext and errsave_finish() allocates an ErrorData copy in CurrentMemoryContext This makes naive use of this approach when logging in a critical section impossible. If ErrorSaveContext is not passed in an ERROR will be raised, even if we just want to log. If ErrorSaveContext is used, we allocate memory in the caller context, which isn't allowed in a critical section. The only way I saw to work around that was to switch to ErrorContext before calling errsave(). That's doable, the logging is called from one function (pgaio_result_report()). That kinda works, but as a consequence we more than double the memory usage in ErrorContext as errsave_finish() will palloc a new ErrorData and ThrowErrorData() copies that ErrorData and all its string back to ErrorContext. 2) Have the error callback format the error using a helper function instead of using ereport() Problems: - errcode_for_file_access() would need to be reimplemented / split into a function translating an errnode into an sqlerrcode without getting it from the error data stack - emitting the log message in a critical section would require either doing the error formatting in ErrorContext or creating another context with reserved memory to do so. - allowing to specify DETAIL, HINT etc basically requires a small elog.c interface reimplementation 3) Use pre_format_elog_string(), format_elog_string() similar to what guc.c does for check hooks, via GUC_check_errmsg(), GUC_check_errhint() ... Problems: - Requires to duplicate errcode_for_file_access() for similar reason as in 2) - Not exactly pretty - Somewhat gnarly, but doable, to make use of %m safe, the way it's done in guc.h afaict isn't safe: pre_format_elog_string() is called for each of GUC_check_{errmsg,errdetail,errhint}. As the global errno might get set during the format_elog_string(), it'll not be the right one during the next GUC_check_*. 4) Don't use ereport() directly, but instead put the errstart() in pgaio_result_report(), before calling the error description callback. When emitting a log message, call errfinish() after the callback. For the view, get the message out via CopyErrorData() and free the memory again using FlushErrorState Problems: - Seems extremely hacky I implemented all, but don't really like any of them. Unless somebody has a better idea or we agree that one of the above is actually a acceptable approach, I'm inclined to simply remove the column containing the description of the error. The window in which one could see an IO with an error is rather short most of the time anyway and the error will also be logged. It's a bit annoying that adding the column later would require revising the signature of the error reporting callback at that time, but I think that degree of churn is acceptable. The main reason I wanted to write this up is that it seems that we're just lacking some infrastructure here. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > While looking at the view I felt motivated to tackle the one FIXME in the > implementation of the view. Namely that the "error_desc" column wasn't > populated (the view did show that there was an error, but not what the error > was). > Which lead me down a sad sad rabbit hole, largely independent of AIO. > ... > The main reason I wanted to write this up is that it seems that we're just > lacking some infrastructure here. Maybe. The mention of elog.c in the same breath with critical sections is already enough to scare me; we surely daren't invoke gettext() in a critical section, for instance. I feel the most we could hope for here is to report a constant string that would not get translated till later, outside the critical section. That seems less about infrastructure and more about how the AIO error handling/reporting code is laid out. In the meantime, if leaving the error out of this view is enough to make the problem go away, let's do that. regards, tom lane
Hi, Attached is v2.6 of the AIO patchset. Relative to 2.5 I: - Improved the split between subsystem initialization and main AIO commit, as well as the one between worker infrastructure and io_method=worker Seemed worth as the only one voicing an opinion about squashing those commits was opposed. - Added a lot more comments to aio.h/aio_internal.h. I think just about anything that should conceivably have a comment has one. - Reordered fields in PgAioHandle to waste less due to padding - Narrowed a few *count fields, they were 64bit without ever being al to reach that - Used aio_types.h more widely, instead of "manual" forward declarations. This required moving a few typedefs to aio_types.h - Substantial commit message improvements. - Removed the pg_aios.error_desc column, due to: https://postgr.es/m/qzxq6mqqozctlfcg2kg5744gmyubicvuehnp4a7up472thlvz2%40y5xqgd5wcwhw - Reordered the commits slightly, to put the README just after the smgr.c/md.c/... support, as the readme references those in the examples - Stopped creating backend-local io_uring instances, that is vestigial for now. We likely will want to reintroduce them at some point (e.g. for network IO), but we can do that at that time. - There were a lot of duplicated codepaths in bufmgr.c support for AIO due to temp tables. I added a few commits refactoring the temp buffers state management to look a lot more like the shared buffer code. I'm not sure that that's the best path, but they all seemed substantial improvements on their own. - putting io_method in PG_TEST_INITDB_EXTRA_OPTS previously broke a test, because Cluster::init() puts PG_TEST_INITDB_EXTRA_OPTS after the # options specified by ->extra. I now worked around that by appending the io method to a local PG_TEST_INITDB_EXTRA_OPTS, but brrr. - The tracepoint for read completion omitted the fact that it was a temp table, if so. - Fixed some duplicated function decls, due to a misresolved merge-conflict Current state: - 0001, 0002 - core AIO - IMO pretty much ready - 0003, 0004 - IO worker - same - 0005, 0006 - io_uring support - close, but we need to do something about set_max_fds(), which errors out spuriously in some cases - 0007 - smgr/md/fd.c readv support - seems quite close, but might benefit from another pass through - 0008 - README - I think it's good, but I'm probably not seeing the trees for the forest anymore - 0009 - pg_aios view - naming not resolved, docs missing - 0010 to 0014 - from another thread, just included here due to a dependency - 0016 to 0020 - cleanups for temp buffers code - I just wrote these to clean up the code before making larger changes, needs review - 0021 - keep BufferDesc refcount up2date for temp buffers - I think that's pretty much ready, but depends on earlier patches - 0022 - bufmgr readv AIO suppot - some naming, some code duplication needs to be resolved, but otherwise quite close - 0023 - use AIO in StartReadBuffers() - perhaps a bit of polishing needed - 0024 - adjust read_stream.c for AIO - I think Thomas has a better patch for this in the works - 0025 - tests for AIO - I think it's reasonable, unless somebody objects to exporting a few bufmgr.c functions to the test - the rest: Not for 18 Greetings, Andres Freund
Attachment
- v2.6-0018-localbuf-Introduce-TerminateLocalBufferIO.patch
- v2.6-0001-aio-Basic-subsystem-initialization.patch
- v2.6-0002-aio-Add-asynchronous-I-O-infrastructure.patch
- v2.6-0003-aio-Infrastructure-for-io_method-worker.patch
- v2.6-0004-aio-Add-io_method-worker.patch
- v2.6-0005-aio-Add-liburing-dependency.patch
- v2.6-0006-aio-Add-io_method-io_uring.patch
- v2.6-0007-aio-Implement-support-for-reads-in-smgr-md-fd.patch
- v2.6-0008-aio-Add-README.md-explaining-higher-level-desig.patch
- v2.6-0009-aio-Add-pg_aios-view.patch
- v2.6-0010-Refactor-read_stream.c-s-circular-arithmetic.patch
- v2.6-0011-Improve-buffer-pool-API-for-per-backend-pin-lim.patch
- v2.6-0012-Respect-pin-limits-accurately-in-read_stream.c.patch
- v2.6-0013-Support-buffer-forwarding-in-read_stream.c.patch
- v2.6-0014-Support-buffer-forwarding-in-StartReadBuffers.patch
- v2.6-0015-tests-Expand-temp-table-tests-to-some-pin-relat.patch
- v2.6-0016-localbuf-Fix-dangerous-coding-pattern-in-GetLoc.patch
- v2.6-0017-localbuf-Introduce-InvalidateLocalBuffer.patch
- v2.6-0019-localbuf-Introduce-FlushLocalBuffer.patch
- v2.6-0020-localbuf-Introduce-StartLocalBufferIO.patch
- v2.6-0021-localbuf-Track-pincount-in-BufferDesc-as-well.patch
- v2.6-0022-bufmgr-Implement-AIO-read-support.patch
- v2.6-0023-bufmgr-Use-AIO-in-StartReadBuffers.patch
- v2.6-0024-WIP-aio-read_stream.c-adjustments-for-real-AIO.patch
- v2.6-0025-aio-Add-test_aio-module.patch
- v2.6-0026-aio-Implement-smgr-md-fd-write-support.patch
- v2.6-0027-aio-Add-bounce-buffers.patch
- v2.6-0028-bufmgr-Implement-AIO-write-support.patch
- v2.6-0029-aio-Add-IO-queue-helper.patch
- v2.6-0030-bufmgr-use-AIO-in-checkpointer-bgwriter.patch
- v2.6-0031-Ensure-a-resowner-exists-for-all-paths-that-may.patch
- v2.6-0032-Temporary-Increase-BAS_BULKREAD-size.patch
- v2.6-0033-WIP-Use-MAP_POPULATE.patch
On Mon, Mar 10, 2025 at 2:23 PM Andres Freund <andres@anarazel.de> wrote: > > - 0016 to 0020 - cleanups for temp buffers code - I just wrote these to clean > up the code before making larger changes, needs review This is a review of 0016-0020 Commit messages for 0017-0020 are thin. I assume you will beef them up a bit before committing. Really, though, those matter much less than 0016 which is an actual bug (or pre-bug) fix. I called out the ones where I think you should really consider adding more detail to the commit message. 0016: * the case, write it out before reusing it! */ - if (buf_state & BM_DIRTY) + if (pg_atomic_read_u32(&bufHdr->state) & BM_DIRTY) { + uint32 buf_state = pg_atomic_read_u32(&bufHdr->state); I don't love that you fetch in the if statement and inside the if statement. You wouldn't normally do this, so it sticks out. I get that you want to avoid having the problem this commit fixes again, but maybe it is worth just fetching the buf_state above the if statement and adding a comment that it could have changed so you must do that. Anyway, I think your future patches make the local buf_state variable in this function obsolete, so perhaps it doesn't matter. Not related to this patch, but while reading this code, I noticed that this line of code is really weird: LocalBufHdrGetBlock(bufHdr) = GetLocalBufferStorage(); I actually don't understand what it is doing ... setting the result of the macro to the result of GetLocalBufferStorage()? I haven't seen anything like that before. Otherwise, this patch LGTM. 0017: +++ b/src/backend/storage/buffer/localbuf.c @@ -56,6 +56,7 @@ static int NLocalPinnedBuffers = 0; static Buffer GetLocalVictimBuffer(void); +static void InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced); Technically this line is too long + * InvalidateLocalBuffer -- mark a local buffer invalid. + * + * If check_unreferenced is true, error out if the buffer is still + * used. Passing false is appropriate when redesignating the buffer instead + * dropping it. + * + * See also InvalidateBuffer(). + */ +static void +InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced) +{ I was on the fence about the language "buffer is still used", since this is about the ref count and not the usage count. If this is the language used elsewhere perhaps it is fine. I also was not sure what redesignate means here. If you mean to use this function in the future in other contexts than eviction and dropping buffers, fine. But otherwise, maybe just use a more obvious word (like eviction). 0018: Compiler now warns that buf_state is unused in GetLocalVictimBuffer(). @@ -4564,8 +4548,7 @@ FlushRelationBuffers(Relation rel) IOCONTEXT_NORMAL, IOOP_WRITE, io_start, 1, BLCKSZ); - buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED); - pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state); + TerminateLocalBufferIO(bufHdr, true, 0); FlushRelationBuffers() used to clear BM_JUST_DIRTIED, which it seems like wouldn't have been applicable to local buffers before, but, actually with async IO could perhaps happen in the future? Anyway, TerminateLocalBuffers() doesn't clear that flag, so you should call that out if it was intentional. @@ -5652,8 +5635,11 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits, + buf_state &= ~BM_IO_IN_PROGRESS; + buf_state &= ~BM_IO_ERROR; - buf_state &= ~(BM_IO_IN_PROGRESS | BM_IO_ERROR); Is it worth mentioning in the commit message that you made a cosmetic change to TerminateBufferIO()? 0019: LGTM 0020: This commit message is probably tooo thin. I think you need to at least say something about this being used by AIO in the future. Out of context of this patch set, it will be confusing. +/* + * Like StartBufferIO, but for local buffers + */ +bool +StartLocalBufferIO(BufferDesc *bufHdr, bool forInput, bool nowait) +{ I think you could use a comment about why nowait might be useful for local buffers in the future. It wouldn't make sense with synchronous I/O, so it feels a bit weird without any comment. + if (forInput ? (buf_state & BM_VALID) : !(buf_state & BM_DIRTY)) + { + /* someone else already did the I/O */ + UnlockBufHdr(bufHdr, buf_state); + return false; + } UnlockBufHdr() explicitly says it should not be called for local buffers. I know that code is unreachable right now, but it doesn't feel quite right. I'm not sure what the architecture of AIO local buffers will be like, but if other processes can't access these buffers, I don't know why you would need BM_LOCKED. And if you will, I think you need to edit the UnlockBufHdr() comment. @@ -1450,13 +1450,11 @@ static inline bool WaitReadBuffersCanStartIO(Buffer buffer, bool nowait) { if (BufferIsLocal(buffer)) else - return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait); + return StartBufferIO(GetBufferDescriptor(buffer - 1), + true, nowait); I'm not sure it is worth the diff in non-local buffer case to reflow this. It is already confusing enough in this patch that you are adding some code that is mostly unneeded. - Melanie
Hi, On 2025-03-11 11:31:18 -0400, Melanie Plageman wrote: > On Mon, Mar 10, 2025 at 2:23 PM Andres Freund <andres@anarazel.de> wrote: > > > > - 0016 to 0020 - cleanups for temp buffers code - I just wrote these to clean > > up the code before making larger changes, needs review > > This is a review of 0016-0020 > > Commit messages for 0017-0020 are thin. I assume you will beef them up > a bit before committing. Yea. I wanted to get some feedback on whether these refactorings are a good idea or not... > Really, though, those matter much less than 0016 which is an actual bug (or > pre-bug) fix. I called out the ones where I think you should really consider > adding more detail to the commit message. > > 0016: Do you think we should backpatch that change? It's not really an active bug in 16+, but it's also not quite right. The other changes surely shouldn't be backpatched... > * the case, write it out before reusing it! > */ > - if (buf_state & BM_DIRTY) > + if (pg_atomic_read_u32(&bufHdr->state) & BM_DIRTY) > { > + uint32 buf_state = pg_atomic_read_u32(&bufHdr->state); > > I don't love that you fetch in the if statement and inside the if > statement. You wouldn't normally do this, so it sticks out. I get that > you want to avoid having the problem this commit fixes again, but > maybe it is worth just fetching the buf_state above the if statement > and adding a comment that it could have changed so you must do that. It's seems way too easy to introduce new similar breakages if the scope of buf_state is that wide - I yesterday did waste 90min because I did in another similar place. The narrower scopes make that much less likely to be a problem. > Anyway, I think your future patches make the local buf_state variable > in this function obsolete, so perhaps it doesn't matter. Leaving the defensive-programming aspect aside, it does seem like a better intermediary state to me to have the local vars than to have to change more lines when introducing FlushLocalBuffer() etc. > Not related to this patch, but while reading this code, I noticed that > this line of code is really weird: > LocalBufHdrGetBlock(bufHdr) = GetLocalBufferStorage(); > I actually don't understand what it is doing ... setting the result of > the macro to the result of GetLocalBufferStorage()? I haven't seen > anything like that before. Yes, that's what it's doing. LocalBufferBlockPointers() evaluates to a value that can be used as an lvalue in an assignment. Not exactly pretty... > Otherwise, this patch LGTM. > > 0017: > > +++ b/src/backend/storage/buffer/localbuf.c > @@ -56,6 +56,7 @@ static int NLocalPinnedBuffers = 0; > static Buffer GetLocalVictimBuffer(void); > +static void InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced); > > Technically this line is too long Oh, do I love our line length limits. But, um, is it actually too long? It's 78 chars, which is exactly our limit, I think? > + * InvalidateLocalBuffer -- mark a local buffer invalid. > + * > + * If check_unreferenced is true, error out if the buffer is still > + * used. Passing false is appropriate when redesignating the buffer instead > + * dropping it. > + * > + * See also InvalidateBuffer(). > + */ > +static void > +InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced) > +{ > > I was on the fence about the language "buffer is still used", since > this is about the ref count and not the usage count. If this is the > language used elsewhere perhaps it is fine. I'll change it to "still pinned" I guess I can make it "still pinned". > I also was not sure what redesignate means here. If you mean to use > this function in the future in other contexts than eviction and > dropping buffers, fine. But otherwise, maybe just use a more obvious > word (like eviction). I was trying to reference changing the identity of the buffer as part of buffer replacement, where we keep a pin to the buffer. Compared to the use of InvalidateLocalBuffer() in DropRelationAllLocalBuffers() / DropRelationLocalBuffers(). /* * InvalidateLocalBuffer -- mark a local buffer invalid. * * If check_unreferenced is true, error out if the buffer is still * pinned. Passing false is appropriate when calling InvalidateLocalBuffer() * as part of changing the identity of a buffer, instead of just dropping the * buffer. * * See also InvalidateBuffer(). */ > 0018: > > Compiler now warns that buf_state is unused in GetLocalVictimBuffer(). Oops. Missed that because it was then removed in a later commit... > @@ -4564,8 +4548,7 @@ FlushRelationBuffers(Relation rel) > IOCONTEXT_NORMAL, IOOP_WRITE, > io_start, 1, BLCKSZ); > > - buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED); > - pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state); > + TerminateLocalBufferIO(bufHdr, true, 0); > > FlushRelationBuffers() used to clear BM_JUST_DIRTIED, which it seems > like wouldn't have been applicable to local buffers before, but, > actually with async IO could perhaps happen in the future? Anyway, > TerminateLocalBuffers() doesn't clear that flag, so you should call > that out if it was intentional. I think it'd be good to start using BM_JUST_DIRTIED, even if just to make the code between local and shared buffers more similar. But that that's better done separately. I don't know why FlushRelationBuffers cleared it, it's neer set at the moment. I'll add a note to the commit message. > @@ -5652,8 +5635,11 @@ TerminateBufferIO(BufferDesc *buf, bool > clear_dirty, uint32 set_flag_bits, > + buf_state &= ~BM_IO_IN_PROGRESS; > + buf_state &= ~BM_IO_ERROR; > - buf_state &= ~(BM_IO_IN_PROGRESS | BM_IO_ERROR); > > Is it worth mentioning in the commit message that you made a cosmetic > change to TerminateBufferIO()? Doesn't really seem worth calling out, but if you think it should, I will. > 0020: > This commit message is probably tooo thin. I think you need to at > least say something about this being used by AIO in the future. Out of > context of this patch set, it will be confusing. Yep. > +/* > + * Like StartBufferIO, but for local buffers > + */ > +bool > +StartLocalBufferIO(BufferDesc *bufHdr, bool forInput, bool nowait) > +{ > > I think you could use a comment about why nowait might be useful for > local buffers in the future. It wouldn't make sense with synchronous > I/O, so it feels a bit weird without any comment. Hm, fair point. Another approach would be to defer adding the argument to a later patch, it doesn't need to be added here. > + if (forInput ? (buf_state & BM_VALID) : !(buf_state & BM_DIRTY)) > + { > + /* someone else already did the I/O */ > + UnlockBufHdr(bufHdr, buf_state); > + return false; > + } > > UnlockBufHdr() explicitly says it should not be called for local > buffers. I know that code is unreachable right now, but it doesn't > feel quite right. I'm not sure what the architecture of AIO local > buffers will be like, but if other processes can't access these > buffers, I don't know why you would need BM_LOCKED. And if you will, I > think you need to edit the UnlockBufHdr() comment. You are right, this is a bug in my change. I started with a copy of StartBufferIO() and whittled it down insufficiently. Thanks for catching that! Wonder if we should add an assert against this to UnlockBufHdr()... > @@ -1450,13 +1450,11 @@ static inline bool > WaitReadBuffersCanStartIO(Buffer buffer, bool nowait) > { > if (BufferIsLocal(buffer)) > else > - return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait); > + return StartBufferIO(GetBufferDescriptor(buffer - 1), > + true, nowait); > > I'm not sure it is worth the diff in non-local buffer case to reflow > this. It is already confusing enough in this patch that you are adding > some code that is mostly unneeded. Heh, you're right. I had to add a line break in the StartLocalBufferIO() and it looked wrong to have the two lines formatted differently :) Thanks for the review! Greetings, Andres Freund
On Tue, Mar 11, 2025 at 1:56 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2025-03-11 11:31:18 -0400, Melanie Plageman wrote: > > Commit messages for 0017-0020 are thin. I assume you will beef them up > > a bit before committing. > > Yea. I wanted to get some feedback on whether these refactorings are a good > idea or not... I'd say yes, they seem like a good idea. > > Really, though, those matter much less than 0016 which is an actual bug (or > > pre-bug) fix. I called out the ones where I think you should really consider > > adding more detail to the commit message. > > > > 0016: > > Do you think we should backpatch that change? It's not really an active bug in > 16+, but it's also not quite right. The other changes surely shouldn't be > backpatched... I don't feel strongly about it. PinLocalBuffer() is passed with adjust_usagecount false and we have loads of other places where things would just not work if we changed the boolean flag passed in to a function called by it (bgwriter and SyncOneBuffer() with skip_recently_used comes to mind). On the other hand it's a straightforward fix that only needs to be backpatched a couple versions, so it definitely doesn't hurt. > > +++ b/src/backend/storage/buffer/localbuf.c > > @@ -56,6 +56,7 @@ static int NLocalPinnedBuffers = 0; > > static Buffer GetLocalVictimBuffer(void); > > +static void InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced); > > > > Technically this line is too long > > Oh, do I love our line length limits. But, um, is it actually too long? It's > 78 chars, which is exactly our limit, I think? Teccchnically it's 79, which is why it showed up for me with this handy line from the committing wiki page git diff origin/master -- src/backend/storage/buffer/localbuf.c | grep -E '^(\+|diff)' | sed 's/^+//' | expand -t4 | awk "length > 78 || /^diff/" But anyway, it doesn't really matter. I only mentioned it because I noticed it visually looked long. > > + if (forInput ? (buf_state & BM_VALID) : !(buf_state & BM_DIRTY)) > > + { > > + /* someone else already did the I/O */ > > + UnlockBufHdr(bufHdr, buf_state); > > + return false; > > + } > > > > UnlockBufHdr() explicitly says it should not be called for local > > buffers. I know that code is unreachable right now, but it doesn't > > feel quite right. I'm not sure what the architecture of AIO local > > buffers will be like, but if other processes can't access these > > buffers, I don't know why you would need BM_LOCKED. And if you will, I > > think you need to edit the UnlockBufHdr() comment. > > You are right, this is a bug in my change. I started with a copy of > StartBufferIO() and whittled it down insufficiently. Thanks for catching that! > > Wonder if we should add an assert against this to UnlockBufHdr()... Yea, I think that makes sense. - Melanie
On Mon, Sep 16, 2024 at 01:51:42PM -0400, Andres Freund wrote: > On 2024-09-16 07:43:49 -0700, Noah Misch wrote: > > For non-sync IO methods, I gather it's essential that a process other than the > > IO definer be scanning for incomplete IOs and completing them. > > Otherwise, deadlocks like this would happen: > > > backend1 locks blk1 for non-IO reasons > > backend2 locks blk2, starts AIO write > > backend1 waits for lock on blk2 for non-IO reasons > > backend2 waits for lock on blk1 for non-IO reasons > > > > If that's right, in worker mode, the IO worker resolves that deadlock. What > > resolves it under io_uring? Another process that happens to do > > pgaio_io_ref_wait() would dislodge things, but I didn't locate the code to > > make that happen systematically. > > Yea, it's code that I haven't forward ported yet. I think basically > LockBuffer[ForCleanup] ought to call pgaio_io_ref_wait() when it can't > immediately acquire the lock and if the buffer has IO going on. I'm not finding that code in v2.6. What function has it? [I wrote a bunch of the subsequent comments against v2.5. I may have missed instances of v2.6 obsoleting them.] On Tue, Mar 04, 2025 at 02:00:14PM -0500, Andres Freund wrote: > Attached is v2.5 of the AIO patchset. > - Added a proper commit message fo the main commit. I'd appreciate folks > reading through it. I'm sure I forgot a lot of folks and a lot of things. Commit message looks fine. > At this point I am not aware of anything significant left to do in the main > AIO commit, safe some of the questions below. That is a big milestone. > Questions: > > - My current thinking is that we'd set io_method = worker initially - so we > actually get some coverage - and then decide whether to switch to > io_method=sync by default for 18 sometime around beta1/2. Does that sound > reasonable? Yes. > - We could reduce memory usage a tiny bit if we made the mapping between > pgproc and per-backend-aio-state more complicated, i.e. not just indexed by > ProcNumber. Right now IO workers have the per-backend AIO state, but don't > actually need it. I'm mildly inclined to think that the complexity isn't > worth it, but on the fence. The max memory savings, for 32 IO workers, is like the difference between max_connections=500 and max_connections=532, right? If that's right, I wouldn't bother in the foreseeable future. > - Three of the commits in the series really are just precursor commits to > their subsequent commits, which I found helpful for development and review, > namely: > > - aio: Basic subsystem initialization > - aio: Skeleton IO worker infrastructure > - aio: Add liburing dependency > > Not sure if it's worth keeping these separate or whether they should just be > merged with their "real commit". The split aided my review. It's trivial to turn an unmerged stack of commits into the merged equivalent, but unmerging is hard. > - Thomas suggested renaming > COMPLETED_IO->COMPLETED, > COMPLETED_SHARED->TERMINATED_BY_COMPLETER, > COMPLETED_SHARED->TERMINATED_BY_SUBMITTER > in > https://www.postgresql.org/message-id/CA%2BhUKGLxH1tsUgzZfng4BU6GqnS6bKF2ThvxH1_w5c7-sLRKQw%40mail.gmail.com > > While the other things in the email were commented upon by others and > addressed in v2.4, the naming aspect wasn't further remarked upon by others. > I'm not personally in love with the suggested names, but I could live with > them. I, too, could live with those. None of these naming proposals bother me, and I would not have raised the topic myself. If I were changing it further, I'd use these principles: - use COMPLETED or TERMINATED, not both - I like COMPLETED, because _complete_ works well in a function name. _terminate_ sounds more like an abnormal interruption. - If one state name lacks a suffix, it should be the final state. So probably one of: {COMPLETED,TERMINATED,FINISHED,REAPED,DONE}_{KERN,RETURN,RETVAL,ERRNO} {COMPLETED,TERMINATED,FINISHED,REAPED,DONE}_{SHMEM,SHARED} {COMPLETED,TERMINATED,FINISHED,REAPED,DONE}{_SUBMITTER,} If it were me picking today, I'd pick: COMPLETED_RETURN COMPLETED_SHMEM COMPLETED > - Right now this series defines PGAIO_VERBOSE to 1. That's good for debugging, > but all the ereport()s add a noticeable amount of overhead at high IO > throughput (at multiple gigabytes/second), so that's probably not right > forever. I'd leave this on initially and then change it to default to off > later. I think that's ok? Sure. Perhaps make it depend on USE_ASSERT_CHECKING later? > - To allow io_workers to be PGC_SIGHUP, and to eventually allow to > automatically in/decrease active workers, the max number of workers (32) is > always allocated. That means we use more semaphores than before. I think > that's ok, it's not 1995 anymore. Alternatively we can add a > "io_workers_max" GUC and probe for it in initdb. Let's start as you have it. If someone wants to make things perfect for non-root BSD users, they can add the GUC later. io_method=sync is a sufficient backup plan indefinitely. > - pg_stat_aios currently has the IO Handle flags as dedicated columns. Not > sure that's great? > > They could be an enum array or such too? That'd perhaps be a bit more > extensible? OTOH, we don't currently use enums in the catalogs and arrays > are somewhat annoying to conjure up from C. An enum array does seem elegant and extensible, but it has the problems you say. (I would expect to lose time setting up pg_enum.oid values to not change between releases.) A possible compromise would be a text array like heap_tuple_infomask_flags() does. Overall, I'm not seeing a clear need to change away from the bool columns. > Todo: > - Figure out how to deduplicate support for LockBufferForCleanup() in > TerminateBufferIO(). Yes, I agree there's an opportunity for a WakePinCountWaiter() or similar subroutine. > - Check if documentation for track_io_timing needs to be adjusted, after the > bufmgr.c changes we only track waiting for an IO. Yes. On Mon, Mar 10, 2025 at 02:23:12PM -0400, Andres Freund wrote: > Attached is v2.6 of the AIO patchset. > - 0005, 0006 - io_uring support - close, but we need to do something about > set_max_fds(), which errors out spuriously in some cases What do we know about those cases? I don't see a set_max_fds(); is that set_max_safe_fds(), or something else? > - 0025 - tests for AIO - I think it's reasonable, unless somebody objects to > exporting a few bufmgr.c functions to the test I'll essentially never object to that. > + * AIO handles need be registered in critical sections and therefore > + * cannot use the normal ResoureElem mechanism. s/ResoureElem/ResourceElem/ > + <varlistentry id="guc-io-method" xreflabel="io_method"> > + <term><varname>io_method</varname> (<type>enum</type>) > + <indexterm> > + <primary><varname>io_method</varname> configuration parameter</primary> > + </indexterm> > + </term> > + <listitem> > + <para> > + Selects the method for executing asynchronous I/O. > + Possible values are: > + <itemizedlist> > + <listitem> > + <para> > + <literal>sync</literal> (execute asynchronous I/O synchronously) The part in parentheses reads like a contradiction to me. How about phrasing it like one of these: (execute I/O synchronously, even I/O eligible for asynchronous execution) (execute asynchronous-eligible I/O synchronously) (execute I/O synchronously, even when asynchronous execution was feasible) > + * This could be in aio_internal.h, as it is not pubicly referenced, but typo -> publicly > + * On what is IO being performed. End sentence with question mark, probably. > + * List of in-flight IOs. Also contains IOs that aren't strict speaking s/strict/strictly/ > + /* > + * Start executing passed in IOs. > + * > + * Will not be called if ->needs_synchronous_execution() returned true. > + * > + * num_staged_ios is <= PGAIO_SUBMIT_BATCH_SIZE. > + * I recommend adding "Always called in a critical section." since at least pgaio_worker_submit() subtly needs it. > + */ > + int (*submit) (uint16 num_staged_ios, PgAioHandle **staged_ios); > + * Each backend can only have one AIO handle that that has been "handed out" s/that that/that/ > + * AIO, it typically will pass the handle to smgr., which will pass it on to s/smgr.,/smgr.c,/ or just "smgr" > +PgAioHandle * > +pgaio_io_acquire_nb(struct ResourceOwnerData *resowner, PgAioReturn *ret) > +{ > + if (pgaio_my_backend->num_staged_ios >= PGAIO_SUBMIT_BATCH_SIZE) > + { > + Assert(pgaio_my_backend->num_staged_ios == PGAIO_SUBMIT_BATCH_SIZE); > + pgaio_submit_staged(); I'm seeing the "num_staged_ios >= PGAIO_SUBMIT_BATCH_SIZE" case uncovered in a check-world coverage report. I tried PGAIO_SUBMIT_BATCH_SIZE=2, io_max_concurrency=1, and io_max_concurrency=64. Do you already have a recipe for reaching this case? > +/* > + * Stage IO for execution and, if necessary, submit it immediately. > + * > + * Should only be called from pgaio_io_prep_*(). > + */ > +void > +pgaio_io_stage(PgAioHandle *ioh, PgAioOp op) > +{ We've got closely-associated verbs "prepare", "prep", and "stage". README.md doesn't mention "stage". Can one of the following two changes happen? - README.md starts mentioning "stage" and how it differs from the others - Code stops using "stage" > + * locallbacks just before reclaiming at multiple callsites. s/locallbacks/local callbacks/ > + * Check if the the referenced IO completed, without blocking. s/the the/the/ > + * Batch submission mode needs to explicitly ended with > + * pgaio_exit_batchmode(), but it is allowed to throw errors, in which case > + * error recovery will end the batch. This sentence needs some grammar help, I think. Maybe use: * End batch submission mode with pgaio_exit_batchmode(). (Throwing errors is * allowed; error recovery will end the batch.) > Size > AioShmemSize(void) > { > Size sz = 0; > > + /* > + * We prefer to report this value's source as PGC_S_DYNAMIC_DEFAULT. > + * However, if the DBA explicitly set wal_buffers = -1 in the config file, s/wal_buffers/io_max_concurrency/ > +extern int io_workers; By the rule that GUC vars are PGDLLIMPORT, this should be PGDLLIMPORT. > +static void > +maybe_adjust_io_workers(void) This also restarts workers that exit, so perhaps name it start_io_workers_if_missing(). > +{ ... > + /* Try to launch one. */ > + child = StartChildProcess(B_IO_WORKER); > + if (child != NULL) > + { > + io_worker_children[id] = child; > + ++io_worker_count; > + } > + else > + break; /* XXX try again soon? */ Can LaunchMissingBackgroundProcesses() become the sole caller of this function, replacing the current mix of callers? That would be more conducive to promptly doing the right thing after launch failure. > --- a/src/backend/utils/init/miscinit.c > +++ b/src/backend/utils/init/miscinit.c > @@ -293,6 +293,9 @@ GetBackendTypeDesc(BackendType backendType) > case B_CHECKPOINTER: > backendDesc = gettext_noop("checkpointer"); > break; > + case B_IO_WORKER: > + backendDesc = "io worker"; Wrap in gettext_noop() like B_CHECKPOINTER does. > + Only has an effect if <xref linkend="guc-max-wal-senders"/> is set to > + <literal>worker</literal>. s/guc-max-wal-senders/guc-io-method/ > + * of IOs, wakeups "fan out"; each woken IO worker can wake two more. qXXX s/qXXX/XXX/ > + /* > + * It's very unlikely, but possible, that reopen fails. E.g. due > + * to memory allocations failing or file permissions changing or > + * such. In that case we need to fail the IO. > + * > + * There's not really a good errno we can report here. > + */ > + error_errno = ENOENT; Agreed there's not a good errno, but let's use a fake errno that we're mighty unlikely to confuse with an actual case of libc returning that errno. Like one of EBADF or EOWNERDEAD. > + for (int contextno = 0; contextno < TotalProcs; contextno++) > + { > + PgAioUringContext *context = &pgaio_uring_contexts[contextno]; > + int ret; > + > + /* > + * XXX: Probably worth sharing the WQ between the different rings, > + * when supported by the kernel. Could also cause additional > + * contention, I guess? > + */ > +#if 0 > + if (!AcquireExternalFD()) > + elog(ERROR, "No external FD available"); > +#endif Probably remove the "#if 0" or add a comment on why it's here. > + ret = io_uring_submit(uring_instance); > + pgstat_report_wait_end(); > + > + if (ret == -EINTR) > + { > + pgaio_debug(DEBUG3, > + "aio method uring: submit EINTR, nios: %d", > + num_staged_ios); > + } > + else if (ret < 0) > + elog(PANIC, "failed: %d/%s", > + ret, strerror(-ret)); I still think (see 2024-09-16 review) EAGAIN should do the documented recommendation instead of PANIC: EAGAIN The kernel was unable to allocate memory for the request, or otherwise ran out of resources to handle it. The application should wait for some completions and try again. At a minimum, it deserves a comment like "We accept PANIC on memory exhaustion here." > + pgstat_report_wait_end(); > + > + if (ret == -EINTR) > + { > + continue; > + } > + else if (ret != 0) > + { > + elog(PANIC, "unexpected: %d/%s: %m", ret, strerror(-ret)); I think errno isn't meaningful here, so %m doesn't belong. > --- a/doc/src/sgml/config.sgml > +++ b/doc/src/sgml/config.sgml > @@ -2687,6 +2687,12 @@ include_dir 'conf.d' > <literal>worker</literal> (execute asynchronous I/O using worker processes) > </para> > </listitem> > + <listitem> > + <para> > + <literal>io_uring</literal> (execute asynchronous I/O using > + io_uring, if available) > + </para> > + </listitem> Docs should eventually cover RLIMIT_MEMLOCK per https://github.com/axboe/liburing "ulimit settings". Maybe RLIMIT_NOFILE, too. > @@ -2498,6 +2529,12 @@ FilePathName(File file) > int > FileGetRawDesc(File file) > { > + int returnCode; > + > + returnCode = FileAccess(file); > + if (returnCode < 0) > + return returnCode; > + > Assert(FileIsValid(file)); > return VfdCache[file].fd; > } What's the rationale for this function's change? > +The main reason to want to use Direct IO are: > +The main reason *not* to use Direct IO are: x2 s/main reason/main reasons/ > + and direct IO without O_DSYNC needs to issue a write and after the writes > + completion a cache cache flush, whereas O\_DIRECT + O\_DSYNC can use a s/writes/write's/ > + single FUA write). I recommend including the acronym expansion: s/FUA/Force Unit Access (FUA)/ > +In an `EXEC_BACKEND` build backends executable code and other process local s/backends/backends'/ > +state is not necessarily mapped to the same addresses in each process due to > +ASLR. This means that the shared memory cannot contain pointer to callbacks. s/pointer/pointers/ > +The "solution" to this the ability to associate multiple completion callbacks > +with a handle. E.g. bufmgr.c can have a callback to update the BufferDesc > +state and to verify the page and md.c. another callback to check if the IO > +operation was successful. One of these or similar: s/md.c. another/md.c can have another/ s/md.c. /md.c / I've got one high-level question that I felt could take too long to answer for myself by code reading. What's the cleanup story if process A does elog(FATAL) with unfinished I/O? Specifically: - Suppose some other process B reuses the shared memory AIO data structures that pertained to process A. After that, some process C completes the I/O in shmem. Do we avoid confusing B by storing local callback data meant for A in shared memory now pertaining to B? - Thinking more about this README paragraph: +In addition to completion, AIO callbacks also are called to "prepare" an +IO. This is, e.g., used to increase buffer reference counts to account for the +AIO subsystem referencing the buffer, which is required to handle the case +where the issuing backend errors out and releases its own pins while the IO is +still ongoing. Which function performs that reference count increase? I'm not finding it today. I wanted to look at how it ensures the issuing backend still exists as the function increases the reference count. One later-patch item: > +static PgAioResult > +SharedBufferCompleteRead(int buf_off, Buffer buffer, uint8 flags, bool failed) > +{ ... > + TRACE_POSTGRESQL_BUFFER_READ_DONE(tag.forkNum, > + tag.blockNum, > + tag.spcOid, > + tag.dbOid, > + tag.relNumber, > + INVALID_PROC_NUMBER, > + false); I wondered about whether the buffer-read-done probe should happen in the process that calls the complete_shared callback or in the process that did the buffer-read-start probe. When I see dtrace examples, they usually involve explicitly naming each PID to trace. Assuming that's indeed the norm, I think the local callback would be the better place, so a given trace contains both probes. If it were reasonable to dtrace all current and future postmaster kids, that would argue for putting the probe in the complete_shared callback. Alternatively, would could argue for separate probes buffer-read-done-shmem and buffer-read-done. Thanks, nm
Hi, On 2025-03-11 12:41:08 -0700, Noah Misch wrote: > On Mon, Sep 16, 2024 at 01:51:42PM -0400, Andres Freund wrote: > > On 2024-09-16 07:43:49 -0700, Noah Misch wrote: > > > For non-sync IO methods, I gather it's essential that a process other than the > > > IO definer be scanning for incomplete IOs and completing them. > > > > Otherwise, deadlocks like this would happen: > > > > > backend1 locks blk1 for non-IO reasons > > > backend2 locks blk2, starts AIO write > > > backend1 waits for lock on blk2 for non-IO reasons > > > backend2 waits for lock on blk1 for non-IO reasons > > > > > > If that's right, in worker mode, the IO worker resolves that deadlock. What > > > resolves it under io_uring? Another process that happens to do > > > pgaio_io_ref_wait() would dislodge things, but I didn't locate the code to > > > make that happen systematically. > > > > Yea, it's code that I haven't forward ported yet. I think basically > > LockBuffer[ForCleanup] ought to call pgaio_io_ref_wait() when it can't > > immediately acquire the lock and if the buffer has IO going on. > > I'm not finding that code in v2.6. What function has it? My local version now has it... Sorry, I was focusing on the earlier patches until now. What do we want to do for ConditionalLockBufferForCleanup() (I don't think IsBufferCleanupOK() can matter)? I suspect we should also make it wait for the IO. See below: Not for 18, but for full write support, we'll also need logic to wait for IO in LockBuffer(BUFFER_LOCK_EXCLUSIVE) and answer the same question as for ConditionalLockBufferForCleanup() for ConditionalLockBuffer(). It's not an issue with the current level of write support in the stack of patches. But with v1 AIO, which had support for a lot more ways of doing asynchronous writes, it turned out that not handling it in ConditionalLockBuffer() triggers an endless loop. This can be kind-of-reproduced today by just making ConditionalLockBuffer() always return false - triggers a hang in the regression tests: spginsert() loops around spgdoinsert() until it succeeds. spgdoinsert() locks the child page with ConditionalLockBuffer() and gives up if it can't. That seems like rather bad code in spgist, because, even without AIO, it'll busy-loop until the buffer is unlocked. Which could take a while, given that it'll conflict even with a share locker and thus synchronous writes. Even if we fixed spgist, it seems rather likely that there's other code that wouldn't tolerate "spurious" failures. Which leads me to think that causing the IO to complete is probably the safest bet. Triggering IO completion never requires acquiring new locks that could participate in a deadlock, so it'd be safe. > > At this point I am not aware of anything significant left to do in the main > > AIO commit, safe some of the questions below. > > That is a big milestone. Indeed! > > - We could reduce memory usage a tiny bit if we made the mapping between > > pgproc and per-backend-aio-state more complicated, i.e. not just indexed by > > ProcNumber. Right now IO workers have the per-backend AIO state, but don't > > actually need it. I'm mildly inclined to think that the complexity isn't > > worth it, but on the fence. > > The max memory savings, for 32 IO workers, is like the difference between > max_connections=500 and max_connections=532, right? Even less than that: Aux processes aren't always used as a multiplier in places where max_connections etc are. E.g. max_locks_per_transaction is just multiplied by MaxBackends, not MaxBackends+NUM_AUXILIARY_PROCS. > If that's right, I wouldn't bother in the foreseeable future. Cool. > > - Three of the commits in the series really are just precursor commits to > > their subsequent commits, which I found helpful for development and review, > > namely: > > > > - aio: Basic subsystem initialization > > - aio: Skeleton IO worker infrastructure > > - aio: Add liburing dependency > > > > Not sure if it's worth keeping these separate or whether they should just be > > merged with their "real commit". > > The split aided my review. It's trivial to turn an unmerged stack of commits > into the merged equivalent, but unmerging is hard. That's been the feedback so far, so I'll leave it split. > > - Right now this series defines PGAIO_VERBOSE to 1. That's good for debugging, > > but all the ereport()s add a noticeable amount of overhead at high IO > > throughput (at multiple gigabytes/second), so that's probably not right > > forever. I'd leave this on initially and then change it to default to off > > later. I think that's ok? > > Sure. Perhaps make it depend on USE_ASSERT_CHECKING later? Yea, that makes sense. > > - To allow io_workers to be PGC_SIGHUP, and to eventually allow to > > automatically in/decrease active workers, the max number of workers (32) is > > always allocated. That means we use more semaphores than before. I think > > that's ok, it's not 1995 anymore. Alternatively we can add a > > "io_workers_max" GUC and probe for it in initdb. > > Let's start as you have it. If someone wants to make things perfect for > non-root BSD users, they can add the GUC later. io_method=sync is a > sufficient backup plan indefinitely. Cool. I think we'll really need to do something about this for BSD users regardless of AIO. Or maybe those OSs should fix something, but somehow I am not having high hopes for an OS that claims to have POSIX confirming unnamed semaphores due to having a syscall that always returns EPERM... [1]. > > - pg_stat_aios currently has the IO Handle flags as dedicated columns. Not > > sure that's great? > > > > They could be an enum array or such too? That'd perhaps be a bit more > > extensible? OTOH, we don't currently use enums in the catalogs and arrays > > are somewhat annoying to conjure up from C. > > An enum array does seem elegant and extensible, but it has the problems you > say. (I would expect to lose time setting up pg_enum.oid values to not change > between releases.) A possible compromise would be a text array like > heap_tuple_infomask_flags() does. Overall, I'm not seeing a clear need to > change away from the bool columns. Yea, I think that's where I ended up too. If we get a dozen flags we can reconsider. > > Todo: > > > - Figure out how to deduplicate support for LockBufferForCleanup() in > > TerminateBufferIO(). > > Yes, I agree there's an opportunity for a WakePinCountWaiter() or similar > subroutine. Done. > > - Check if documentation for track_io_timing needs to be adjusted, after the > > bufmgr.c changes we only track waiting for an IO. > > Yes. The relevant sentences seem to be: - "Enables timing of database I/O calls." s/calls/waits/ - "Time spent in {read,write,writeback,extend,fsync} operations" s/in/waiting for/ Even though not all of these will use AIO, the "waiting for" formulation seems just as accurate. - "Columns tracking I/O time will only be non-zero when <xref linkend="guc-track-io-timing"/> is enabled." s/time/wait time/ > On Mon, Mar 10, 2025 at 02:23:12PM -0400, Andres Freund wrote: > > Attached is v2.6 of the AIO patchset. > > > - 0005, 0006 - io_uring support - close, but we need to do something about > > set_max_fds(), which errors out spuriously in some cases > > What do we know about those cases? I don't see a set_max_fds(); is that > set_max_safe_fds(), or something else? Sorry, yes, set_max_safe_fds(). The problem basically is that with io_uring we will have a large number of FDs already allocated by the time set_max_safe_fds() is called. set_max_safe_fds() subtracts already_open from max_files_per_process allowing few, and even negative, IOs. I think we should redefine max_files_per_process to be about the number of files each *backend* will additionally open. Jelte was working on related patches, see [2] > > + * AIO handles need be registered in critical sections and therefore > > + * cannot use the normal ResoureElem mechanism. > > s/ResoureElem/ResourceElem/ Oops, fixed. > > + <varlistentry id="guc-io-method" xreflabel="io_method"> > > + <term><varname>io_method</varname> (<type>enum</type>) > > + <indexterm> > > + <primary><varname>io_method</varname> configuration parameter</primary> > > + </indexterm> > > + </term> > > + <listitem> > > + <para> > > + Selects the method for executing asynchronous I/O. > > + Possible values are: > > + <itemizedlist> > > + <listitem> > > + <para> > > + <literal>sync</literal> (execute asynchronous I/O synchronously) > > The part in parentheses reads like a contradiction to me. There's something to that... > How about phrasing it like one of these: > > (execute I/O synchronously, even I/O eligible for asynchronous execution) > (execute asynchronous-eligible I/O synchronously) > (execute I/O synchronously, even when asynchronous execution was feasible) I like the second one best, adopted. > [..] > End sentence with question mark, probably. > [..] > s/strict/strictly/ > [..] > I recommend adding "Always called in a critical section." since at least > pgaio_worker_submit() subtly needs it. > [..] > s/that that/that/ > [..] > s/smgr.,/smgr.c,/ or just "smgr" > [..] > s/locallbacks/local callbacks/ > [..] > s/the the/the/ All adopted. > > +PgAioHandle * > > +pgaio_io_acquire_nb(struct ResourceOwnerData *resowner, PgAioReturn *ret) > > +{ > > + if (pgaio_my_backend->num_staged_ios >= PGAIO_SUBMIT_BATCH_SIZE) > > + { > > + Assert(pgaio_my_backend->num_staged_ios == PGAIO_SUBMIT_BATCH_SIZE); > > + pgaio_submit_staged(); > > I'm seeing the "num_staged_ios >= PGAIO_SUBMIT_BATCH_SIZE" case uncovered in a > check-world coverage report. I tried PGAIO_SUBMIT_BATCH_SIZE=2, > io_max_concurrency=1, and io_max_concurrency=64. Do you already have a recipe > for reaching this case? With the default server settings it's hard to hit due to read_stream.c limiting how much IO it issues: 1) The default io_combine_limit=16 makes reads larger, reducing the queue depth, at least for sequential scans 2) The default shared_buffers/max_connections settings limit the number of buffers that can be pinned to 86, which will only allow a small number of IOs due to 86/io_combine_limit = ~5 3) The default effective_io_concurrency only allows one IO in flight Melanie has a patch to adjust effective_io_concurrency: https://www.postgresql.org/message-id/CAAKRu_Z4ekRbfTacYYVrvu9xRqS6G4DMbZSbN_1usaVtj%2Bbv2w%40mail.gmail.com If I increase shared_buffers and decrease io_combine_limit and put an elog(PANIC) in that branch, it's rather quickly hit. > > +/* > > + * Stage IO for execution and, if necessary, submit it immediately. > > + * > > + * Should only be called from pgaio_io_prep_*(). > > + */ > > +void > > +pgaio_io_stage(PgAioHandle *ioh, PgAioOp op) > > +{ > > We've got closely-associated verbs "prepare", "prep", and "stage". README.md > doesn't mention "stage". Can one of the following two changes happen? > > - README.md starts mentioning "stage" and how it differs from the others > - Code stops using "stage" I'll try to add something to README.md. To me the sequence is prepare->stage. > > + * Batch submission mode needs to explicitly ended with > > + * pgaio_exit_batchmode(), but it is allowed to throw errors, in which case > > + * error recovery will end the batch. > > This sentence needs some grammar help, I think. Indeed. > Maybe use: > > * End batch submission mode with pgaio_exit_batchmode(). (Throwing errors is > * allowed; error recovery will end the batch.) I like it. > > Size > > AioShmemSize(void) > > { > > Size sz = 0; > > > > + /* > > + * We prefer to report this value's source as PGC_S_DYNAMIC_DEFAULT. > > + * However, if the DBA explicitly set wal_buffers = -1 in the config file, > > s/wal_buffers/io_max_concurrency/ Ooops. > > +extern int io_workers; > > By the rule that GUC vars are PGDLLIMPORT, this should be PGDLLIMPORT. Indeed. I wish we had something finding violations of this automatically... > > +static void > > +maybe_adjust_io_workers(void) > > This also restarts workers that exit, so perhaps name it > start_io_workers_if_missing(). But it also stops IO workers if necessary? > > +{ > ... > > + /* Try to launch one. */ > > + child = StartChildProcess(B_IO_WORKER); > > + if (child != NULL) > > + { > > + io_worker_children[id] = child; > > + ++io_worker_count; > > + } > > + else > > + break; /* XXX try again soon? */ > > Can LaunchMissingBackgroundProcesses() become the sole caller of this > function, replacing the current mix of callers? That would be more conducive > to promptly doing the right thing after launch failure. I'm not sure that'd be a good idea - right now IO workers are started before the startup process, as the startup process might need to perform IO. If we started it only later in ServerLoop() we'd potentially do a fair bit of work, including starting checkpointer, bgwriter, bgworkers before we started IO workers. That shouldn't actively break anything, but it would likely make things slower. I rather dislike the code around when we start what. Leaving AIO aside, during a normal startup we start checkpointer, bgwriter before the startup process. But during a crash restart we don't explicitly start them. Why make things uniform when it coul also be exciting :) > > --- a/src/backend/utils/init/miscinit.c > > +++ b/src/backend/utils/init/miscinit.c > > @@ -293,6 +293,9 @@ GetBackendTypeDesc(BackendType backendType) > > case B_CHECKPOINTER: > > backendDesc = gettext_noop("checkpointer"); > > break; > > + case B_IO_WORKER: > > + backendDesc = "io worker"; > > Wrap in gettext_noop() like B_CHECKPOINTER does. > > > + Only has an effect if <xref linkend="guc-max-wal-senders"/> is set to > > + <literal>worker</literal>. > > s/guc-max-wal-senders/guc-io-method/ > > > + * of IOs, wakeups "fan out"; each woken IO worker can wake two more. qXXX > > s/qXXX/XXX/ All fixed. > > + /* > > + * It's very unlikely, but possible, that reopen fails. E.g. due > > + * to memory allocations failing or file permissions changing or > > + * such. In that case we need to fail the IO. > > + * > > + * There's not really a good errno we can report here. > > + */ > > + error_errno = ENOENT; > > Agreed there's not a good errno, but let's use a fake errno that we're mighty > unlikely to confuse with an actual case of libc returning that errno. Like > one of EBADF or EOWNERDEAD. Can we rely on that to be present on all platforms, including windows? > > + for (int contextno = 0; contextno < TotalProcs; contextno++) > > + { > > + PgAioUringContext *context = &pgaio_uring_contexts[contextno]; > > + int ret; > > + > > + /* > > + * XXX: Probably worth sharing the WQ between the different rings, > > + * when supported by the kernel. Could also cause additional > > + * contention, I guess? > > + */ > > +#if 0 > > + if (!AcquireExternalFD()) > > + elog(ERROR, "No external FD available"); > > +#endif > > Probably remove the "#if 0" or add a comment on why it's here. Will do. It was an attempt at dealing with the set_max_safe_fds() issue above, but it turned out to not work at all, given how fd.c currently works. > > + ret = io_uring_submit(uring_instance); > > + pgstat_report_wait_end(); > > + > > + if (ret == -EINTR) > > + { > > + pgaio_debug(DEBUG3, > > + "aio method uring: submit EINTR, nios: %d", > > + num_staged_ios); > > + } > > + else if (ret < 0) > > + elog(PANIC, "failed: %d/%s", > > + ret, strerror(-ret)); > > I still think (see 2024-09-16 review) EAGAIN should do the documented > recommendation instead of PANIC: > > EAGAIN The kernel was unable to allocate memory for the request, or > otherwise ran out of resources to handle it. The application should wait for > some completions and try again. I don't think this can be hit in a recoverable way. We'd likely just end up with an untested path that quite possibly would be wrong. What wait time would be appropriate? What problems would it cause if we just slept while holding critical lwlocks? I think it'd typically just delay the crash-restart if we did, making it harder to recover from the problem. Because we are careful to limit how many outstanding IO requests there are on an io_uring instance, the kernel has to have run *severely* out of memory to hit this. I suspect it might currently be *impossible* to hit this due to ENOMEM, because io_uring will fall back to allocating individual request, if the batch allocation it normally does, fails. My understanding is that for small allocations the kernel will try to reclaim memory forever, only large ones can fail. Even if it were possible to hit, the likelihood that postgres can continue to work ok if the kernel can't allocate ~250 bytes seems very low. How about adding a dedicated error message for EAGAIN? IMO io_uring_enter()'s meaning of EAGAIN is, uhm, unconvential, so a better error message than strerror() might be good? Proposed comment: /* * The io_uring_enter() manpage suggests that the appropriate * reaction to EAGAIN is: * * "The application should wait for some completions and try * again" * * However, it seems unlikely that that would help in our case, as * we apply a low limit to the number of outstanding IOs and thus * also outstanding completions, making it unlikely that we'd get * EAGAIN while the OS is in good working order. * * Additionally, it would be problematic to just wait here, our * caller might hold critical locks. It'd possibly lead to * delaying the crash-restart that seems likely to occur when the * kernel is under such heavy memory pressure. */ > > + pgstat_report_wait_end(); > > + > > + if (ret == -EINTR) > > + { > > + continue; > > + } > > + else if (ret != 0) > > + { > > + elog(PANIC, "unexpected: %d/%s: %m", ret, strerror(-ret)); > > I think errno isn't meaningful here, so %m doesn't belong. You're right. I wonder if we should make errno meaningful though (by setting it), the elog.c machinery captures it and I know that there are logging hooks that utilize that fact. That'd also avoid the need to use strerror() here. > > --- a/doc/src/sgml/config.sgml > > +++ b/doc/src/sgml/config.sgml > > @@ -2687,6 +2687,12 @@ include_dir 'conf.d' > > <literal>worker</literal> (execute asynchronous I/O using worker processes) > > </para> > > </listitem> > > + <listitem> > > + <para> > > + <literal>io_uring</literal> (execute asynchronous I/O using > > + io_uring, if available) > > + </para> > > + </listitem> > > Docs should eventually cover RLIMIT_MEMLOCK per > https://github.com/axboe/liburing "ulimit settings". The way we currently use io_uring (i.e. no registered buffers), the RLIMIT_MEMLOCK advice only applies to linux <= 5.11. I'm not sure that's worth documenting? > Maybe RLIMIT_NOFILE, too. Yea, we probably need to. Depends a bit on where we go with [2] though. > > > @@ -2498,6 +2529,12 @@ FilePathName(File file) > > int > > FileGetRawDesc(File file) > > { > > + int returnCode; > > + > > + returnCode = FileAccess(file); > > + if (returnCode < 0) > > + return returnCode; > > + > > Assert(FileIsValid(file)); > > return VfdCache[file].fd; > > } > > What's the rationale for this function's change? It flatly didn't work before. I guess I can make that a separate commit. > > +The main reason to want to use Direct IO are: > > > +The main reason *not* to use Direct IO are: > > x2 s/main reason/main reasons/ > > > + and direct IO without O_DSYNC needs to issue a write and after the writes > > + completion a cache cache flush, whereas O\_DIRECT + O\_DSYNC can use a > > s/writes/write's/ > > > + single FUA write). > > I recommend including the acronym expansion: s/FUA/Force Unit Access (FUA)/ > > > +In an `EXEC_BACKEND` build backends executable code and other process local > > s/backends/backends'/ > > > +state is not necessarily mapped to the same addresses in each process due to > > +ASLR. This means that the shared memory cannot contain pointer to callbacks. > > s/pointer/pointers/ > > > +The "solution" to this the ability to associate multiple completion callbacks > > +with a handle. E.g. bufmgr.c can have a callback to update the BufferDesc > > +state and to verify the page and md.c. another callback to check if the IO > > +operation was successful. > > One of these or similar: > s/md.c. another/md.c can have another/ > s/md.c. /md.c / All applied. > I've got one high-level question that I felt could take too long to answer for > myself by code reading. What's the cleanup story if process A does > elog(FATAL) with unfinished I/O? Specifically: It's a good question. Luckily there's a relatively easy answer: pgaio_shutdown() is registered via before_shmem_exit() in pgaio_init_backend() and pgaio_shutdown() waits for all IOs to finish. The main reason this exists is that the AIO mechanism in various OSs, at least in some OS versions, don't like it if the issuing process exits while the IO is in flight. IIRC that was the case with in v1 with posix_aio (which we don't support in v2, and probably should never use) and I think also with io_uring in some kernel versions. Another reason is that those requests would show up in pg_aios (or whatever we end up naming it) until they're reused, which doesn't seem great. > - Suppose some other process B reuses the shared memory AIO data structures > that pertained to process A. After that, some process C completes the I/O > in shmem. Do we avoid confusing B by storing local callback data meant for > A in shared memory now pertaining to B? This will, before pgaio_shutdown() gets involved, also be prevented by local callbacks being cleared by resowner cleanup. We take care that that resowner cleanup happens before process exit. That's important, because the backend local pointer could be invalidated by an ERROR > - Thinking more about this README paragraph: > > +In addition to completion, AIO callbacks also are called to "prepare" an > +IO. This is, e.g., used to increase buffer reference counts to account for the > +AIO subsystem referencing the buffer, which is required to handle the case > +where the issuing backend errors out and releases its own pins while the IO is > +still ongoing. > > Which function performs that reference count increase? I'm not finding it > today. Ugh, I just renamed the relevant functions in my local branch, while trying to reduce the code duplication between shared and local buffers ;). In <= v2.6 it's shared_buffer_stage_common() and local_buffer_readv_stage(). In v2.7-to-be it is buffer_stage_common(), which now supports both shared and local buffers. > I wanted to look at how it ensures the issuing backend still exists as the > function increases the reference count. The reference count is increased solely in the BufferDesc, *not* in the backend-local pin tracking. Earlier I had tracked the pin in BufferDesc for shared buffers (as the pin needs to be released upon completion, which might be in another backend), but in LocalRefCount[] for temp buffers. But that turned out to not work when the backend errors out, as it would make CheckForLocalBufferLeaks() complain. > > One later-patch item: > > > +static PgAioResult > > +SharedBufferCompleteRead(int buf_off, Buffer buffer, uint8 flags, bool failed) > > +{ > ... > > + TRACE_POSTGRESQL_BUFFER_READ_DONE(tag.forkNum, > > + tag.blockNum, > > + tag.spcOid, > > + tag.dbOid, > > + tag.relNumber, > > + INVALID_PROC_NUMBER, > > + false); > > I wondered about whether the buffer-read-done probe should happen in the > process that calls the complete_shared callback or in the process that did the > buffer-read-start probe. Yea, that's a good point. I should at least have added a comment pointing out that it's a choice with pros and cons. The reason I went for doing it in the completion callback is that it seemed better to get the READ_DONE event as soon as possible, even if the issuer of the IO is currently busy doing other things. The shared completion callback is after all where the buffer state is updated for shared buffers. But I think you have a point too. > When I see dtrace examples, they usually involve explicitly naming each PID > to trace TBH, i've only ever used our tracepoints via perf and bpftrace, not dtrace itself. For those it's easy to trace more than just a single pid and to monitor system-wide. I don't really know enough about using dtrace itself. > Assuming that's indeed the norm, I think the local callback would > be the better place, so a given trace contains both probes. Seems like a shame to add an extra indirect function call for a tracing feature that afaict approximately nobody ever uses (IIRC we several times have passed wrong things to tracepoints without that being noticed). TBH, the tracepoints are so poorly documented and maintained that I was tempted to suggest removing them a couple times. This was an awesome review, thanks! Andres Freund [1] https://man.openbsd.org/sem_init.3#STANDARDS [2] https://postgr.es/m/D80MHNSG4EET.6MSV5G9P130F%40jeltef.nl
On Tue, Mar 11, 2025 at 07:55:35PM -0400, Andres Freund wrote: > On 2025-03-11 12:41:08 -0700, Noah Misch wrote: > > On Mon, Sep 16, 2024 at 01:51:42PM -0400, Andres Freund wrote: > > > On 2024-09-16 07:43:49 -0700, Noah Misch wrote: > What do we want to do for ConditionalLockBufferForCleanup() (I don't think > IsBufferCleanupOK() can matter)? I suspect we should also make it wait for > the IO. See below: I agree IsBufferCleanupOK() can't matter. It asserts that the caller already holds the exclusive buffer lock, and it doesn't loop or wait. > [...] leads me to think that causing > the IO to complete is probably the safest bet. Triggering IO completion never > requires acquiring new locks that could participate in a deadlock, so it'd be > safe. Yes, that decision looks right to me. I scanned the callers, and none of them clearly prefers a different choice. If we someday find one caller prefers a false return over blocking on I/O completion, we can always introduce a new ConditionalLock* variant for that. > > > - To allow io_workers to be PGC_SIGHUP, and to eventually allow to > > > automatically in/decrease active workers, the max number of workers (32) is > > > always allocated. That means we use more semaphores than before. I think > > > that's ok, it's not 1995 anymore. Alternatively we can add a > > > "io_workers_max" GUC and probe for it in initdb. > > > > Let's start as you have it. If someone wants to make things perfect for > > non-root BSD users, they can add the GUC later. io_method=sync is a > > sufficient backup plan indefinitely. > > Cool. > > I think we'll really need to do something about this for BSD users regardless > of AIO. Or maybe those OSs should fix something, but somehow I am not having > high hopes for an OS that claims to have POSIX confirming unnamed semaphores > due to having a syscall that always returns EPERM... [1]. I won't mind a project making things better for non-root BSD users. I do think such a project should not block other projects making things better for everything else (like $SUBJECT). > > > - Check if documentation for track_io_timing needs to be adjusted, after the > > > bufmgr.c changes we only track waiting for an IO. > > > > Yes. > > The relevant sentences seem to be: > > - "Enables timing of database I/O calls." > > s/calls/waits/ > > - "Time spent in {read,write,writeback,extend,fsync} operations" > > s/in/waiting for/ > > Even though not all of these will use AIO, the "waiting for" formulation > seems just as accurate. > > - "Columns tracking I/O time will only be non-zero when <xref > linkend="guc-track-io-timing"/> is enabled." > > s/time/wait time/ Sounds good. > > On Mon, Mar 10, 2025 at 02:23:12PM -0400, Andres Freund wrote: > > > Attached is v2.6 of the AIO patchset. > > > > > - 0005, 0006 - io_uring support - close, but we need to do something about > > > set_max_fds(), which errors out spuriously in some cases > > > > What do we know about those cases? I don't see a set_max_fds(); is that > > set_max_safe_fds(), or something else? > > Sorry, yes, set_max_safe_fds(). The problem basically is that with io_uring we > will have a large number of FDs already allocated by the time > set_max_safe_fds() is called. set_max_safe_fds() subtracts already_open from > max_files_per_process allowing few, and even negative, IOs. > > I think we should redefine max_files_per_process to be about the number of > files each *backend* will additionally open. Jelte was working on related > patches, see [2] Got it. max_files_per_process is a quaint setting, documented as follows (I needed the reminder): If the kernel is enforcing a safe per-process limit, you don't need to worry about this setting. But on some platforms (notably, most BSD systems), the kernel will allow individual processes to open many more files than the system can actually support if many processes all try to open that many files. If you find yourself seeing <quote>Too many open files</quote> failures, try reducing this setting. I could live with v6-0003-Reflect-the-value-of-max_safe_fds-in-max_files_pe.patch but would lean against it since it feels unduly novel to have a setting where we use the postgresql.conf value to calculate a value that becomes the new SHOW-value of the same setting. Options I'd consider before that: - Like you say, "redefine max_files_per_process to be about the number of files each *backend* will additionally open". It will become normal that each backend's actual FD list length is max_files_per_process + MaxBackends if io_method=io_uring. Outcome is not unlike v6-0002-Bump-postmaster-soft-open-file-limit-RLIMIT_NOFIL.patch + v6-0003-Reflect-the-value-of-max_safe_fds-in-max_files_pe.patch but we don't mutate max_files_per_process. Benchmark results should not change beyond the inter-major-version noise level unless one sets io_method=io_uring. I'm feeling best about this one, but I've not been thinking about it long. - When building with io_uring, make the max_files_per_process default something like 10000 instead. Disadvantages: FD usage grows even if you don't use io_uring. Merely rebuilding with io_uring (not enabling it at runtime) will change benchmark results. High MaxBackends still needs to override the value. > > > +static void > > > +maybe_adjust_io_workers(void) > > > > This also restarts workers that exit, so perhaps name it > > start_io_workers_if_missing(). > > But it also stops IO workers if necessary? Good point. Maybe just add a comment like "start or stop IO workers to close the gap between the running count and the configured count intent". > > > +{ > > ... > > > + /* Try to launch one. */ > > > + child = StartChildProcess(B_IO_WORKER); > > > + if (child != NULL) > > > + { > > > + io_worker_children[id] = child; > > > + ++io_worker_count; > > > + } > > > + else > > > + break; /* XXX try again soon? */ > > > > Can LaunchMissingBackgroundProcesses() become the sole caller of this > > function, replacing the current mix of callers? That would be more conducive > > to promptly doing the right thing after launch failure. > > I'm not sure that'd be a good idea - right now IO workers are started before > the startup process, as the startup process might need to perform IO. If we > started it only later in ServerLoop() we'd potentially do a fair bit of work, > including starting checkpointer, bgwriter, bgworkers before we started IO > workers. That shouldn't actively break anything, but it would likely make > things slower. I missed that. How about keeping the two calls associated with PM_STARTUP but replacing the assign_io_workers() and process_pm_child_exit() calls with one in LaunchMissingBackgroundProcesses()? In the event of a launch failure, I think that would retry the launch quickly, as opposed to maybe-never. > I rather dislike the code around when we start what. Leaving AIO aside, during > a normal startup we start checkpointer, bgwriter before the startup > process. But during a crash restart we don't explicitly start them. Why make > things uniform when it coul also be exciting :) It's become some artisanal code! :) > > > + /* > > > + * It's very unlikely, but possible, that reopen fails. E.g. due > > > + * to memory allocations failing or file permissions changing or > > > + * such. In that case we need to fail the IO. > > > + * > > > + * There's not really a good errno we can report here. > > > + */ > > > + error_errno = ENOENT; > > > > Agreed there's not a good errno, but let's use a fake errno that we're mighty > > unlikely to confuse with an actual case of libc returning that errno. Like > > one of EBADF or EOWNERDEAD. > > Can we rely on that to be present on all platforms, including windows? I expect EBADF is universal. EBADF would be fine. EOWNERDEAD is from 2006. https://learn.microsoft.com/en-us/cpp/c-runtime-library/errno-constants?view=msvc-140 says VS2015 had EOWNERDEAD (the page doesn't have links for older Visual Studio versions, so I consider them unknown). https://github.com/coreutils/gnulib/blob/master/doc/posix-headers/errno.texi lists some OSs not having it, the newest of which looks like NetBSD 9.3 (2022). We could use it and add a #define for platforms lacking it. > > > + ret = io_uring_submit(uring_instance); > > > + pgstat_report_wait_end(); > > > + > > > + if (ret == -EINTR) > > > + { > > > + pgaio_debug(DEBUG3, > > > + "aio method uring: submit EINTR, nios: %d", > > > + num_staged_ios); > > > + } > > > + else if (ret < 0) > > > + elog(PANIC, "failed: %d/%s", > > > + ret, strerror(-ret)); > > > > I still think (see 2024-09-16 review) EAGAIN should do the documented > > recommendation instead of PANIC: > > > > EAGAIN The kernel was unable to allocate memory for the request, or > > otherwise ran out of resources to handle it. The application should wait for > > some completions and try again. > > I don't think this can be hit in a recoverable way. We'd likely just end up > with an untested path that quite possibly would be wrong. > > What wait time would be appropriate? What problems would it cause if we just > slept while holding critical lwlocks? I think it'd typically just delay the > crash-restart if we did, making it harder to recover from the problem. I might use 30s like pgwin32_open_handle(), but 30s wouldn't be principled. > Because we are careful to limit how many outstanding IO requests there are on > an io_uring instance, the kernel has to have run *severely* out of memory to > hit this. > > I suspect it might currently be *impossible* to hit this due to ENOMEM, > because io_uring will fall back to allocating individual request, if the batch > allocation it normally does, fails. My understanding is that for small > allocations the kernel will try to reclaim memory forever, only large ones can > fail. > > Even if it were possible to hit, the likelihood that postgres can continue to > work ok if the kernel can't allocate ~250 bytes seems very low. > > How about adding a dedicated error message for EAGAIN? IMO io_uring_enter()'s > meaning of EAGAIN is, uhm, unconvential, so a better error message than > strerror() might be good? I'm fine with the present error message. > Proposed comment: > /* > * The io_uring_enter() manpage suggests that the appropriate > * reaction to EAGAIN is: > * > * "The application should wait for some completions and try > * again" > * > * However, it seems unlikely that that would help in our case, as > * we apply a low limit to the number of outstanding IOs and thus > * also outstanding completions, making it unlikely that we'd get > * EAGAIN while the OS is in good working order. > * > * Additionally, it would be problematic to just wait here, our > * caller might hold critical locks. It'd possibly lead to > * delaying the crash-restart that seems likely to occur when the > * kernel is under such heavy memory pressure. > */ That works for me. No retry needed, then. > > > + pgstat_report_wait_end(); > > > + > > > + if (ret == -EINTR) > > > + { > > > + continue; > > > + } > > > + else if (ret != 0) > > > + { > > > + elog(PANIC, "unexpected: %d/%s: %m", ret, strerror(-ret)); > > > > I think errno isn't meaningful here, so %m doesn't belong. > > You're right. I wonder if we should make errno meaningful though (by setting > it), the elog.c machinery captures it and I know that there are logging hooks > that utilize that fact. That'd also avoid the need to use strerror() here. That's better still. > > > --- a/doc/src/sgml/config.sgml > > > +++ b/doc/src/sgml/config.sgml > > > @@ -2687,6 +2687,12 @@ include_dir 'conf.d' > > > <literal>worker</literal> (execute asynchronous I/O using worker processes) > > > </para> > > > </listitem> > > > + <listitem> > > > + <para> > > > + <literal>io_uring</literal> (execute asynchronous I/O using > > > + io_uring, if available) > > > + </para> > > > + </listitem> > > > > Docs should eventually cover RLIMIT_MEMLOCK per > > https://github.com/axboe/liburing "ulimit settings". > > The way we currently use io_uring (i.e. no registered buffers), the > RLIMIT_MEMLOCK advice only applies to linux <= 5.11. I'm not sure that's > worth documenting? Kernel 5.11 will be 5.5 years old by the time v18 is out. Yeah, no need for doc coverage of that. > > One later-patch item: > > > > > +static PgAioResult > > > +SharedBufferCompleteRead(int buf_off, Buffer buffer, uint8 flags, bool failed) > > > +{ > > ... > > > + TRACE_POSTGRESQL_BUFFER_READ_DONE(tag.forkNum, > > > + tag.blockNum, > > > + tag.spcOid, > > > + tag.dbOid, > > > + tag.relNumber, > > > + INVALID_PROC_NUMBER, > > > + false); > > > > I wondered about whether the buffer-read-done probe should happen in the > > process that calls the complete_shared callback or in the process that did the > > buffer-read-start probe. > > Yea, that's a good point. I should at least have added a comment pointing out > that it's a choice with pros and cons. > > The reason I went for doing it in the completion callback is that it seemed > better to get the READ_DONE event as soon as possible, even if the issuer of > the IO is currently busy doing other things. The shared completion callback is > after all where the buffer state is updated for shared buffers. > > But I think you have a point too. > > > > When I see dtrace examples, they usually involve explicitly naming each PID > > to trace > > TBH, i've only ever used our tracepoints via perf and bpftrace, not dtrace > itself. For those it's easy to trace more than just a single pid and to > monitor system-wide. I don't really know enough about using dtrace itself. Perhaps just a comment, then. > > Assuming that's indeed the norm, I think the local callback would > > be the better place, so a given trace contains both probes. > > Seems like a shame to add an extra indirect function call Yep. > This was an awesome review, thanks! Glad it helped. > [1] https://man.openbsd.org/sem_init.3#STANDARDS > [2] https://postgr.es/m/D80MHNSG4EET.6MSV5G9P130F%40jeltef.nl
Hi, On 2025-03-11 19:55:35 -0400, Andres Freund wrote: > On 2025-03-11 12:41:08 -0700, Noah Misch wrote: > > On Mon, Sep 16, 2024 at 01:51:42PM -0400, Andres Freund wrote: > > > On 2024-09-16 07:43:49 -0700, Noah Misch wrote: > > > > For non-sync IO methods, I gather it's essential that a process other than the > > > > IO definer be scanning for incomplete IOs and completing them. > > > > > > Otherwise, deadlocks like this would happen: > > > > > > > backend1 locks blk1 for non-IO reasons > > > > backend2 locks blk2, starts AIO write > > > > backend1 waits for lock on blk2 for non-IO reasons > > > > backend2 waits for lock on blk1 for non-IO reasons > > > > > > > > If that's right, in worker mode, the IO worker resolves that deadlock. What > > > > resolves it under io_uring? Another process that happens to do > > > > pgaio_io_ref_wait() would dislodge things, but I didn't locate the code to > > > > make that happen systematically. > > > > > > Yea, it's code that I haven't forward ported yet. I think basically > > > LockBuffer[ForCleanup] ought to call pgaio_io_ref_wait() when it can't > > > immediately acquire the lock and if the buffer has IO going on. > > > > I'm not finding that code in v2.6. What function has it? > > My local version now has it... Sorry, I was focusing on the earlier patches > until now. Looking more at my draft, I don't think it was race-free. I had a race-free way of doing it in the v1 patch (by making lwlocks extensible, so the check for IO could happen between enqueueing on the lwlock wait queue and sleeping on the semaphore), but that obviously requires that infrastructure. I want to focus on reads for now, so I'll add FIXMEs to the relevant places in the patch to support AIO writes and focus on the rest of the patch for now. Greetings, Andres Freund
Hi, On 2025-03-11 20:57:43 -0700, Noah Misch wrote: > > I think we'll really need to do something about this for BSD users regardless > > of AIO. Or maybe those OSs should fix something, but somehow I am not having > > high hopes for an OS that claims to have POSIX confirming unnamed semaphores > > due to having a syscall that always returns EPERM... [1]. > > I won't mind a project making things better for non-root BSD users. I do > think such a project should not block other projects making things better for > everything else (like $SUBJECT). Oh, I strongly agree. The main reason I would like it to be addressed that I'm pretty tired of having to think about open/netbsd whenever we update some default setting. > > > On Mon, Mar 10, 2025 at 02:23:12PM -0400, Andres Freund wrote: > > > > Attached is v2.6 of the AIO patchset. > > > > > > > - 0005, 0006 - io_uring support - close, but we need to do something about > > > > set_max_fds(), which errors out spuriously in some cases > > > > > > What do we know about those cases? I don't see a set_max_fds(); is that > > > set_max_safe_fds(), or something else? > > > > Sorry, yes, set_max_safe_fds(). The problem basically is that with io_uring we > > will have a large number of FDs already allocated by the time > > set_max_safe_fds() is called. set_max_safe_fds() subtracts already_open from > > max_files_per_process allowing few, and even negative, IOs. > > > > I think we should redefine max_files_per_process to be about the number of > > files each *backend* will additionally open. Jelte was working on related > > patches, see [2] > > Got it. max_files_per_process is a quaint setting, documented as follows (I > needed the reminder): > > If the kernel is enforcing > a safe per-process limit, you don't need to worry about this setting. > But on some platforms (notably, most BSD systems), the kernel will > allow individual processes to open many more files than the system > can actually support if many processes all try to open > that many files. If you find yourself seeing <quote>Too many open > files</quote> failures, try reducing this setting. > > I could live with > v6-0003-Reflect-the-value-of-max_safe_fds-in-max_files_pe.patch but would lean > against it since it feels unduly novel to have a setting where we use the > postgresql.conf value to calculate a value that becomes the new SHOW-value of > the same setting. I think we may update some other GUCs, but not sure. > Options I'd consider before that: > - Like you say, "redefine max_files_per_process to be about the number of > files each *backend* will additionally open". It will become normal that > each backend's actual FD list length is max_files_per_process + MaxBackends > if io_method=io_uring. Outcome is not unlike > v6-0002-Bump-postmaster-soft-open-file-limit-RLIMIT_NOFIL.patch + > v6-0003-Reflect-the-value-of-max_safe_fds-in-max_files_pe.patch but we don't > mutate max_files_per_process. Benchmark results should not change beyond > the inter-major-version noise level unless one sets io_method=io_uring. I'm > feeling best about this one, but I've not been thinking about it long. Yea, I think that's something probably worth doing separately from Jelte's patch. I do think that it'd be rather helpful to have jelte's patch to increase NOFILE in addition though. > > > > +static void > > > > +maybe_adjust_io_workers(void) > > > > > > This also restarts workers that exit, so perhaps name it > > > start_io_workers_if_missing(). > > > > But it also stops IO workers if necessary? > > Good point. Maybe just add a comment like "start or stop IO workers to close > the gap between the running count and the configured count intent". It's now /* * Start or stop IO workers, to close the gap between the number of running * workers and the number of configured workers. Used to respond to change of * the io_workers GUC (by increasing and decreasing the number of workers), as * well as workers terminating in response to errors (by starting * "replacement" workers). */ > > > > +{ > > > ... > > > > + /* Try to launch one. */ > > > > + child = StartChildProcess(B_IO_WORKER); > > > > + if (child != NULL) > > > > + { > > > > + io_worker_children[id] = child; > > > > + ++io_worker_count; > > > > + } > > > > + else > > > > + break; /* XXX try again soon? */ > > > > > > Can LaunchMissingBackgroundProcesses() become the sole caller of this > > > function, replacing the current mix of callers? That would be more conducive > > > to promptly doing the right thing after launch failure. > > > > I'm not sure that'd be a good idea - right now IO workers are started before > > the startup process, as the startup process might need to perform IO. If we > > started it only later in ServerLoop() we'd potentially do a fair bit of work, > > including starting checkpointer, bgwriter, bgworkers before we started IO > > workers. That shouldn't actively break anything, but it would likely make > > things slower. > > I missed that. How about keeping the two calls associated with PM_STARTUP but > replacing the assign_io_workers() and process_pm_child_exit() calls with one > in LaunchMissingBackgroundProcesses()? I think replacing the call in assign_io_workers() is a good idea, that way we don't need assign_io_workers(). Less convinced it's a good idea to do the same for process_pm_child_exit() - if IO workers errored out we'll launch backends etc before we get to LaunchMissingBackgroundProcesses(). That's not a fundamental problem, but seems a bit odd. I think LaunchMissingBackgroundProcesses() should be split into one that starts aux processes and one that starts bgworkers. The one maintaining aux processes should be called before we start backends, the latter not. > In the event of a launch failure, I think that would retry the launch > quickly, as opposed to maybe-never. That's a fair point. > > > > + /* > > > > + * It's very unlikely, but possible, that reopen fails. E.g. due > > > > + * to memory allocations failing or file permissions changing or > > > > + * such. In that case we need to fail the IO. > > > > + * > > > > + * There's not really a good errno we can report here. > > > > + */ > > > > + error_errno = ENOENT; > > > > > > Agreed there's not a good errno, but let's use a fake errno that we're mighty > > > unlikely to confuse with an actual case of libc returning that errno. Like > > > one of EBADF or EOWNERDEAD. > > > > Can we rely on that to be present on all platforms, including windows? > > I expect EBADF is universal. EBADF would be fine. Hm, that's actually an error that could happen for other reasons, and IMO would be more confusing than ENOENT. The latter describes the issue to a reasonable extent, EBADFD seems like it would be more confusing. I'm not sure it's worth investing time in this - it really shouldn't happen, and we probably have bigger problems than the error code if it does. But if we do want to do something, I think I can see a way to report a dedicated error message for this. > EOWNERDEAD is from 2006. > https://learn.microsoft.com/en-us/cpp/c-runtime-library/errno-constants?view=msvc-140 > says VS2015 had EOWNERDEAD (the page doesn't have links for older Visual > Studio versions, so I consider them unknown). Oh, that's a larger list than I'd have though. > https://github.com/coreutils/gnulib/blob/master/doc/posix-headers/errno.texi > lists some OSs not having it, the newest of which looks like NetBSD 9.3 > (2022). We could use it and add a #define for platforms lacking it. What would we define it as? I guess we could just pick a high value, but... Greetings, Andres Freund
Hi, Attached is v2.7, with the following changes: - Significantly deduplicated AIO related code bufmgr.c Previously the code for temp and shared buffers was duplicated to an uncomfortable degree. Now there is a common helper to implements the behaviour for both cases. The BM_PIN_COUNT_WAITER supporting code was also deduplicated, by introducing a helper function. - Fixed typos / improved phrasing, per Noah's review - Add comment explaining why retries for EAGAIN for io_uring_enter syscall failures don't seem to make sense, improve related error messages slightly - Added a comment to aio.h explaining that aio_types.h might suffice for function declarations and aio_init.h for initialization related code. - Added and expanded comments for PgAioHandleState, explaining the state machine in more detail. - Updated README to mention the stage callback (instead of the outdated "prepare"), plus some other minor cleanups. - Added a commit rephrasing track_io_timing related docs to talk about waits - Added FIXME to method_uring.c about the set_max_safe_fds() issue. Depending on when/how that is resolved, the relevant commits can be reordered relative to the rest. - Improved localbuf: patches and commit messages, as per Melanie's review - Added FIXMEs to the bufmgr.c write support (only in later commit, unlikely to be realistic for 18) denoting that deadlock risk needs to be addressed. We probably need some lwlock.c improvements to make that race-free, otherwise I'd just have fixed this. - Added a comment discussing the placement of the TRACE_POSTGRESQL_BUFFER_READ_DONE callback - removed a few debug ereports() from the StartReadBuffers patch Unresolved: - Whether to continue starting new workers in process_pm_child_exit() - What to name the view (currently pg_aios). I'm inclined to go for pg_io_handles right now. - set_max_safe_fds() related issues for the io_uring backend Greetings, Andres Freund
Attachment
- v2.7-0001-aio-Basic-subsystem-initialization.patch
- v2.7-0002-aio-Add-core-asynchronous-I-O-infrastructure.patch
- v2.7-0003-aio-Infrastructure-for-io_method-worker.patch
- v2.7-0004-aio-Add-io_method-worker.patch
- v2.7-0005-aio-Add-liburing-dependency.patch
- v2.7-0006-aio-Add-io_method-io_uring.patch
- v2.7-0007-aio-Implement-support-for-reads-in-smgr-md-fd.patch
- v2.7-0008-aio-Add-README.md-explaining-higher-level-desig.patch
- v2.7-0009-aio-Add-pg_aios-view.patch
- v2.7-0010-Refactor-read_stream.c-s-circular-arithmetic.patch
- v2.7-0011-Improve-buffer-pool-API-for-per-backend-pin-lim.patch
- v2.7-0012-Respect-pin-limits-accurately-in-read_stream.c.patch
- v2.7-0013-Support-buffer-forwarding-in-read_stream.c.patch
- v2.7-0014-Support-buffer-forwarding-in-StartReadBuffers.patch
- v2.7-0015-tests-Expand-temp-table-tests-to-some-pin-relat.patch
- v2.7-0016-localbuf-Fix-dangerous-coding-pattern-in-GetLoc.patch
- v2.7-0017-localbuf-Introduce-InvalidateLocalBuffer.patch
- v2.7-0018-localbuf-Introduce-TerminateLocalBufferIO.patch
- v2.7-0019-localbuf-Introduce-FlushLocalBuffer.patch
- v2.7-0020-localbuf-Introduce-StartLocalBufferIO.patch
- v2.7-0021-localbuf-Track-pincount-in-BufferDesc-as-well.patch
- v2.7-0022-bufmgr-Implement-AIO-read-support.patch
- v2.7-0023-bufmgr-Use-AIO-in-StartReadBuffers.patch
- v2.7-0024-docs-Reframe-track_io_timing-related-docs-as-wa.patch
- v2.7-0025-WIP-aio-read_stream.c-adjustments-for-real-AIO.patch
- v2.7-0026-aio-Add-test_aio-module.patch
- v2.7-0027-aio-Implement-smgr-md-fd-write-support.patch
- v2.7-0028-aio-Add-bounce-buffers.patch
- v2.7-0029-bufmgr-Implement-AIO-write-support.patch
- v2.7-0030-aio-Add-IO-queue-helper.patch
- v2.7-0031-bufmgr-use-AIO-in-checkpointer-bgwriter.patch
- v2.7-0032-Ensure-a-resowner-exists-for-all-paths-that-may.patch
- v2.7-0033-Temporary-Increase-BAS_BULKREAD-size.patch
Andres Freund <andres@anarazel.de> wrote: > Attached is v2.7, with the following changes: Attached are a few proposals for minor comment fixes. Besides that, it occurred to me when I was trying to get familiar with the patch set (respectable work, btw) that an additional Assert() statement could make sense: diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c index a9c351eb0dc..325688f0f23 100644 --- a/src/backend/storage/aio/aio.c +++ b/src/backend/storage/aio/aio.c @@ -413,6 +413,7 @@ pgaio_io_stage(PgAioHandle *ioh, PgAioOp op) bool needs_synchronous; Assert(ioh->state == PGAIO_HS_HANDED_OUT); + Assert(pgaio_my_backend->handed_out_io == ioh); Assert(pgaio_io_has_target(ioh)); ioh->op = op; -- Antonin Houska Web: https://www.cybertec-postgresql.com
Attachment
Hi, On 2025-03-13 11:53:03 +0100, Antonin Houska wrote: > Attached are a few proposals for minor comment fixes. Thanks, applied. > Besides that, it occurred to me when I was trying to get familiar with the > patch set (respectable work, btw) that an additional Assert() statement could > make sense: Yea, it does. I added it to another place as well. Attached is v2.8 with the following changes: - I wasn't happy with the way StartReadBuffers(), WaitReadBuffers() and AsyncReadBuffers() interacted. The io_method=sync path in particular was too cute by half, calling WaitReadBuffers() from within WaitReadBuffers(). I think the new state considerably better. Plenty other smaller changes as part of that. One worth calling out is that ReadBuffersCanStartIO() now submits staged IO before actually blocking. Not the prettiest code, but I think it's ok. - Added a function to assert the sanitiy of a ReadBuffersOperation While doing the refactoring for the prior point, I temporarily had a bug that returned buffers for which IO wasn't actually performed. Surprisingly the only assertion that triggered was when that buffer was read again by another operation, because it had been marked dirty, despite never being valid. Now there's a function that can be used to check that the buffers referenced by a ReadBuffersOperation are in a sane state. I guess this could be committed independently, but it'd not be entirely trivial to extract, so I'm currently leaning against doing that. - Previously VacuumCostActive accounting happened after IO completion. But that doesn't seem quite right, it'd allow to submit a lot of IO at once. It's now moved to AsyncReadBuffers() - With io_method=sync or with worker and temp tables, smgrstartreadv() would actually execute the IO. But the time accounting was done entirely around pgaio_wref_wait(). Now it's done in both places. - Rebased onto newer version of Thomas' read_stream.c changes With that newer version the integration with read stream for actually doing AIO is a bit simpler. There's one FIXME in the patch, because I don't really understand what a comment is referring to. I also split out a more experimental patch to make more efficient use of batching in read stream, the heuristics are more complicated, and it works well enough without. - I added a commit to clean up the buffer access accounting for the case that a buffer was read in concurrently. That IMO is somewhat bogus on master, and it seemed to get more bogus with AIO. - Integrated Antonin Houska's fixes and Assert suggestion - Added a patch to address the smgr.c/md.c interrupt issue (a problem in master), see https://postgr.es/m/3vae7l5ozvqtxmd7rr7zaeq3qkuipz365u3rtim5t5wdkr6f4g@vkgf2fogjirl I think the reasonable next steps are: - Commit "localbuf: *" commits - Commit temp table tests, likely after lowering the minimum temp_buffers setting - Pursue a fix of the smgr interupt issue on the referenced thread This can happen in parallel with AIO patches up to "aio: Implement support for reads in smgr/md/fd" - Commit the core AIO infrastructure patch after doing a few more passes - Commit IO worker support - In parallel: Find a way to deal with the set_max_safe_fds() issue that we've been discussing on this thread recently. As that only affects io_uring, it doesn't have to block other patches going in. - Do a round of review of the read_stream changes Thomas recently posted (and that are also included here) - Try to get some more review for the bufmgr.c related changes. I've whacked them around a fair bit lately. - Try to get Thomas to review my read_stream.c changes Open items: - The upstream BAS_BULKREAD is so small that throughput is substantially worse once a table reaches 1/4 shared_buffers. That patch in the patchset as-is is probably not good enough, although I am not sure about that - The set_max_safe_fds() issue for io_uring - Right now effective_io_concurrency cannot be set > 0 on Windows and other platforms that lack posix_fadvise. But with AIO we can read ahead without posix_fadvise(). It'd not really make anything worse than today to not remove the limit, but it'd be pretty weird to prevent windows etc from benefiting from AIO. Need to look around and see whether it would require anything other than doc changes. Greetings, Andres Freund
Attachment
- v2.8-0001-aio-Basic-subsystem-initialization.patch
- v2.8-0002-aio-Add-core-asynchronous-I-O-infrastructure.patch
- v2.8-0003-aio-Infrastructure-for-io_method-worker.patch
- v2.8-0004-aio-Add-io_method-worker.patch
- v2.8-0005-aio-Add-liburing-dependency.patch
- v2.8-0006-aio-Add-io_method-io_uring.patch
- v2.8-0007-smgr-Hold-interrupts-in-most-smgr-functions.patch
- v2.8-0008-aio-Implement-support-for-reads-in-smgr-md-fd.patch
- v2.8-0009-aio-Add-README.md-explaining-higher-level-desig.patch
- v2.8-0010-aio-Add-pg_aios-view.patch
- v2.8-0011-Improve-read_stream.c-advice-for-big-random-chu.patch
- v2.8-0012-Simplify-distance-heuristics-in-read_stream.c.patch
- v2.8-0013-Support-buffer-forwarding-in-read_stream.c.patch
- v2.8-0014-Support-buffer-forwarding-in-StartReadBuffers.patch
- v2.8-0015-tests-Expand-temp-table-tests-to-some-pin-relat.patch
- v2.8-0016-localbuf-Fix-dangerous-coding-pattern-in-GetLoc.patch
- v2.8-0017-localbuf-Introduce-InvalidateLocalBuffer.patch
- v2.8-0018-localbuf-Introduce-TerminateLocalBufferIO.patch
- v2.8-0019-localbuf-Introduce-FlushLocalBuffer.patch
- v2.8-0020-localbuf-Introduce-StartLocalBufferIO.patch
- v2.8-0021-localbuf-Track-pincount-in-BufferDesc-as-well.patch
- v2.8-0022-bufmgr-Implement-AIO-read-support.patch
- v2.8-0023-bufmgr-Improve-stats-when-buffer-was-read-in-co.patch
- v2.8-0024-bufmgr-Use-AIO-in-StartReadBuffers.patch
- v2.8-0025-docs-Reframe-track_io_timing-related-docs-as-wa.patch
- v2.8-0026-aio-Basic-read_stream-adjustments-for-real-AIO.patch
- v2.8-0027-aio-Experimental-heuristics-to-increase-batchin.patch
- v2.8-0028-aio-Add-test_aio-module.patch
- v2.8-0029-aio-Implement-smgr-md-fd-write-support.patch
- v2.8-0030-aio-Add-bounce-buffers.patch
- v2.8-0031-bufmgr-Implement-AIO-write-support.patch
- v2.8-0032-aio-Add-IO-queue-helper.patch
- v2.8-0033-bufmgr-use-AIO-in-checkpointer-bgwriter.patch
- v2.8-0034-Ensure-a-resowner-exists-for-all-paths-that-may.patch
- v2.8-0035-Temporary-Increase-BAS_BULKREAD-size.patch
- v2.8-0036-WIP-Use-MAP_POPULATE.patch
Hi, On 2025-03-14 15:43:15 -0400, Andres Freund wrote: > Open items: > > - The upstream BAS_BULKREAD is so small that throughput is substantially worse > once a table reaches 1/4 shared_buffers. That patch in the patchset as-is is > probably not good enough, although I am not sure about that > > > - The set_max_safe_fds() issue for io_uring > > > - Right now effective_io_concurrency cannot be set > 0 on Windows and other > platforms that lack posix_fadvise. But with AIO we can read ahead without > posix_fadvise(). > > It'd not really make anything worse than today to not remove the limit, but > it'd be pretty weird to prevent windows etc from benefiting from AIO. Need > to look around and see whether it would require anything other than doc > changes. A fourth, smaller, question: - Should the docs for debug_io_direct be rephrased and if so, how? Without read-stream-AIO debug_io_direct=data has completely unusable performance if there's ever any data IO - and if there's no IO there's no point in using the option. Now there is a certain set of workloads where performance with debug_io_direct=data can be better than master, sometimes substantially so. But at the same time, without support for at least: - AIO writes for at least checkpointer, bgwriter doing one synchronous IO for each buffer is ... slow. - read-streamified index vacuuming And probably also: - AIO-ified writes for writes executed by backends, e.g. due to strategies Doing one synchronous IO for each buffer is ... slow. And e.g. with COPY we do a *lot* of those. OTOH, it could be fine if most modifications are done via INSERTs instead of COPY. - prefetching for non-BHS index accesses Without prefetching, a well correlated index-range scan will be orders of magnitude slower with DIO. - Anything bypassing shared_buffers, like RelationCopyStorage() or bulk_write.c will be extremely slow The only saving grace is that these aren't all *that* common. Due to those constraints I think it's pretty clear we can't remove the debug_ prefix at this time. Perhaps it's worth going from <para> Currently this feature reduces performance, and is intended for developer testing only. </para> to <para> Currently this feature reduces performance in many workloads, and is intended for testing only. </para> I.e. qualify the downside with "many workloads" and widen the audience ever so slightly? Greetings, Andres Freund
Hi, Attached is v2.9 with the following, fairly small, changes: - Rebased ontop of the additional committed read stream patches - Committed the localbuf: refactorings (but not the change to expand refcounting of local buffers, that seems a bit more dependent on the rest) - Committed the temp table test after some annoying debugging https://postgr.es/m/w5wr26ijzp7xz2qrxkt6dzvmmn2tn6tn5fp64y6gq5iuvg43hw%40v4guo6x776dq - Some rephrasing and moving of comments in the first two commits - There was a small bug in the smgr reopen call I found when reviewing, the PgAioOpData->read.fd field was referenced for both reads and writes, which failed to fail because both read/write use a compatible struct layout. Unless I hear otherwise, I plan to commit the first two patches fairly soon, followed by the worker support patches a few buildfarm cycles later. I'm sure there's a bunch of stuff worth improving in the AIO infrastructure and I can't imagine a project of this size not having bugs. But I think it's in a state where that's better worked out in tree, with broader exposure and testing. Greetings, Andres Freund
Attachment
- v2.9-0001-aio-Basic-subsystem-initialization.patch
- v2.9-0002-aio-Add-core-asynchronous-I-O-infrastructure.patch
- v2.9-0003-aio-Infrastructure-for-io_method-worker.patch
- v2.9-0004-aio-Add-io_method-worker.patch
- v2.9-0005-aio-Add-liburing-dependency.patch
- v2.9-0006-aio-Add-io_method-io_uring.patch
- v2.9-0007-smgr-Hold-interrupts-in-most-smgr-functions.patch
- v2.9-0008-aio-Implement-support-for-reads-in-smgr-md-fd.patch
- v2.9-0009-aio-Add-README.md-explaining-higher-level-desig.patch
- v2.9-0010-aio-Add-pg_aios-view.patch
- v2.9-0011-Support-buffer-forwarding-in-read_stream.c.patch
- v2.9-0012-Support-buffer-forwarding-in-StartReadBuffers.patch
- v2.9-0013-localbuf-Track-pincount-in-BufferDesc-as-well.patch
- v2.9-0014-bufmgr-Implement-AIO-read-support.patch
- v2.9-0015-bufmgr-Improve-stats-when-buffer-was-read-in-co.patch
- v2.9-0016-bufmgr-Use-AIO-in-StartReadBuffers.patch
- v2.9-0017-docs-Reframe-track_io_timing-related-docs-as-wa.patch
- v2.9-0018-aio-Basic-read_stream-adjustments-for-real-AIO.patch
- v2.9-0019-aio-Experimental-heuristics-to-increase-batchin.patch
- v2.9-0020-aio-Add-test_aio-module.patch
- v2.9-0021-aio-Implement-smgr-md-fd-write-support.patch
- v2.9-0022-aio-Add-bounce-buffers.patch
- v2.9-0023-bufmgr-Implement-AIO-write-support.patch
- v2.9-0024-aio-Add-IO-queue-helper.patch
- v2.9-0025-bufmgr-use-AIO-in-checkpointer-bgwriter.patch
- v2.9-0026-Ensure-a-resowner-exists-for-all-paths-that-may.patch
- v2.9-0027-Temporary-Increase-BAS_BULKREAD-size.patch
- v2.9-0028-WIP-Use-MAP_POPULATE.patch
On Fri, Mar 14, 2025 at 3:43 PM Andres Freund <andres@anarazel.de> wrote: > > Open items: > > - Right now effective_io_concurrency cannot be set > 0 on Windows and other > platforms that lack posix_fadvise. But with AIO we can read ahead without > posix_fadvise(). > > It'd not really make anything worse than today to not remove the limit, but > it'd be pretty weird to prevent windows etc from benefiting from AIO. Need > to look around and see whether it would require anything other than doc > changes. I've attached a patch that removes the limit for effective_io_concurrency and maintenance_io_concurrency. I tested both GUCs with fadvise manually disabled on my system and I think it is working for those read stream users I tried (vacuum and BHS). I checked around to make sure no one was using only the value of the guc to guard prefetches, and it seems like we're safe. The one thing I am wondering about with the docs is whether or not we need to make it more clear that only a subset of the "simultaneous I/O" behavior controlled by eic/mic is available if your system doesn't have fadvise. I tried to do that a bit, but I avoided getting into too many details. - Melanie
Attachment
Hi, Attached is v2.10, with the following changes: - committed core AIO infrastructure patch A few cosmetic changes - committed io_method=worker Two non-cosmetic changes: - The shmem allocation functions over-estimated the amount of shared memory required. - pgaio_worker_shmem_init() should initialize up to MAX_IO_WORKERS, not just io_workers, since the latter is intentionally PGC_SIGHUP (found by Thomas) - Bunch of typo fixes found by searching for repeated words Thomas found one and then I searched for more. - Added Melanie's patch to allow effective_io_concurrency to be set on windows etc - Fixed a reference to an outdated function reference (thanks to Bilal) - Reordered patches to be a bit more in dependency order E.g. "bufmgr: Implement AIO read support" doesn't depend on Thomas' "buffer forwarding" patches and thus can be commited before those go in. Next steps: - Decide what to do about the smgr interrupt issue I guess it could be deferred, based on the argument it only matters with a sufficiently high debug level, but I don't feel comfortable with that. I think it'd be reasonable to just go with the patch I sent on the other thread (and included here). - Get somebody to look at "bufmgr: Improve stats when buffer was read in concurrently" This arguably fixes a bug, or just weird behaviour, on master. - Address the set_max_safe_fds() issue and once resolved, commit io_uring method Can happen concurrently with next steps - Commit "aio: Implement support for reads in smgr/md/fd" - Get somebody to do one more pass at bufmgr related commits, I think they could use a less in-the-weeds eye. That's the following commits: - localbuf: Track pincount in BufferDesc as well - bufmgr: Implement AIO read support - bufmgr: Use AIO in StartReadBuffers() - aio: Basic read_stream adjustments for real AIO Questions / Unresolved: - Write support isn't going to land in 18, but there is a tiny bit of code regarding writes in the code for bufmgr IO. I guess I could move that to a later commit? I'm inclined to leave it, the structure of the code only really makes knowing that it's going to be shared between reads & writes. - pg_aios view name Greetings, Andres Freund
Attachment
- v2.10-0001-smgr-Hold-interrupts-in-most-smgr-functions.patch
- v2.10-0002-bufmgr-Improve-stats-when-buffer-is-read-in-co.patch
- v2.10-0003-aio-Add-liburing-dependency.patch
- v2.10-0004-aio-Add-io_method-io_uring.patch
- v2.10-0005-aio-Implement-support-for-reads-in-smgr-md-fd.patch
- v2.10-0006-aio-Add-README.md-explaining-higher-level-desi.patch
- v2.10-0007-localbuf-Track-pincount-in-BufferDesc-as-well.patch
- v2.10-0008-bufmgr-Implement-AIO-read-support.patch
- v2.10-0009-Support-buffer-forwarding-in-read_stream.c.patch
- v2.10-0010-Support-buffer-forwarding-in-StartReadBuffers.patch
- v2.10-0011-bufmgr-Use-AIO-in-StartReadBuffers.patch
- v2.10-0012-aio-Basic-read_stream-adjustments-for-real-AIO.patch
- v2.10-0013-docs-Reframe-track_io_timing-related-docs-as-w.patch
- v2.10-0014-Enable-IO-concurrency-on-all-systems.patch
- v2.10-0015-aio-Add-pg_aios-view.patch
- v2.10-0016-aio-Experimental-heuristics-to-increase-batchi.patch
- v2.10-0017-aio-Add-test_aio-module.patch
- v2.10-0018-aio-Implement-smgr-md-fd-write-support.patch
- v2.10-0019-aio-Add-bounce-buffers.patch
- v2.10-0020-bufmgr-Implement-AIO-write-support.patch
- v2.10-0021-aio-Add-IO-queue-helper.patch
- v2.10-0022-bufmgr-use-AIO-in-checkpointer-bgwriter.patch
- v2.10-0023-Ensure-a-resowner-exists-for-all-paths-that-ma.patch
- v2.10-0024-Temporary-Increase-BAS_BULKREAD-size.patch
- v2.10-0025-WIP-Use-MAP_POPULATE.patch
On Tue, Mar 18, 2025 at 4:12 PM Andres Freund <andres@anarazel.de> wrote: > Attached is v2.10, This is a review of 0008: bufmgr: Implement AIO read support I'm afraid it is more of a cosmetic review than a sign-off on the patch's correctness, but perhaps it will help future readers who may have the same questions I did. In the commit message: bufmgr: Implement AIO read support This implements the following: - Addition of callbacks to maintain buffer state on completion of a readv - Addition of a wait reference to BufferDesc, to allow backends to wait for IOs - StartBufferIO(), WaitIO(), TerminateBufferIO() support for waiting AIO I think it might be nice to say something about allowing backends to complete IOs issued by other backends. @@ -40,6 +41,10 @@ static const PgAioHandleCallbacksEntry aio_handle_cbs[] = { CALLBACK_ENTRY(PGAIO_HCB_INVALID, aio_invalid_cb), CALLBACK_ENTRY(PGAIO_HCB_MD_READV, aio_md_readv_cb), + + CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_READV, aio_shared_buffer_readv_cb), + + CALLBACK_ENTRY(PGAIO_HCB_LOCAL_BUFFER_READV, aio_local_buffer_readv_cb), #undef CALLBACK_ENTRY }; I personally can't quite figure out why the read and write callbacks are defined differently than the stage, complete, and report callbacks. I know there is a comment above PgAioHandleCallbackID about something about this, but it didn't really clarify it for me. Maybe you can put a block comment at the top of aio_callback.c? Or perhaps I just need to study it more... @@ -5482,10 +5503,19 @@ WaitIO(BufferDesc *buf) + if (pgaio_wref_valid(&iow)) + { + pgaio_wref_wait(&iow); + ConditionVariablePrepareToSleep(cv); + continue; + } I'd add some comment above this. I reread it many times, and I still only _think_ I understand what it does. I think the reason we need ConditionVariablePrepareToSleep() again is because pgaio_io_wait() may have called ConditionVariableCancelSleep() so we need to ConditionVariablePrepareToSleep() again (it was done already at the top of Wait())? I'll admit I find the CV API quite confusing, so maybe I'm just misunderstanding it. Maybe worth mentioning in the commit message about why WaitIO() has to work differently for AIO than sync IO. /* * Support LockBufferForCleanup() * * If we just released a pin, need to do BM_PIN_COUNT_WAITER handling. * Most of the time the current backend will hold another pin preventing * that from happening, but that's e.g. not the case when completing an IO * another backend started. */ I found this wording a bit confusing. what about this: We may have just released the last pin other than the waiter's. In most cases, this backend holds another pin on the buffer. But, if, for example, this backend is completing an IO issued by another backend, it may be time to wake the waiter. /* * Helper for AIO staging callback for both reads and writes as well as temp * and shared buffers. */ static pg_attribute_always_inline void buffer_stage_common(PgAioHandle *ioh, bool is_write, bool is_temp) I think buffer_stage_common() needs the function comment to explain what unit it is operating on. It is called "buffer_" singular but then it loops through io_data which appears to contain multiple buffers /* * Check that all the buffers are actually ones that could conceivably * be done in one IO, i.e. are sequential. */ if (i == 0) first = buf_hdr->tag; else { Assert(buf_hdr->tag.relNumber == first.relNumber); Assert(buf_hdr->tag.blockNum == first.blockNum + i); } So it is interesting to me that this validation is done at this level. Enforcing sequentialness doesn't seem like it would be intrinsically related to or required to stage IOs. And there isn't really anything in this function that seems like it would require it either. Usually an assert is pretty close to the thing it is protecting. Oh and I think the end of the loop in buffer_stage_common() would look nicer with a small refactor with the resulting code looking like this: /* temp buffers don't use BM_IO_IN_PROGRESS */ Assert(!is_temp || (buf_state & BM_IO_IN_PROGRESS)); /* we better have ensured the buffer is present until now */ Assert(BUF_STATE_GET_REFCOUNT(buf_state) >= 1); /* * Reflect that the buffer is now owned by the subsystem. * * For local buffers: This can't be done just in LocalRefCount as one * might initially think, as this backend could error out while AIO is * still in progress, releasing all the pins by the backend itself. */ buf_state += BUF_REFCOUNT_ONE; buf_hdr->io_wref = io_ref; if (is_temp) { pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state); continue; } UnlockBufHdr(buf_hdr, buf_state); if (is_write) { LWLock *content_lock; content_lock = BufferDescriptorGetContentLock(buf_hdr); Assert(LWLockHeldByMe(content_lock)); /* * Lock is now owned by AIO subsystem. */ LWLockDisown(content_lock); } /* * Stop tracking this buffer via the resowner - the AIO system now * keeps track. */ ResourceOwnerForgetBufferIO(CurrentResourceOwner, buffer); } In buffer_readv_complete(), this comment /* * Iterate over all the buffers affected by this IO and call appropriate * per-buffer completion function for each buffer. */ makes it sound like we might invoke different completion functions (like invoke the completion callback), but that isn't what happens here. failed = prior_result.status == ARS_ERROR || prior_result.result <= buf_off; Though not introduced in this commit, I will say that I find the ARS prefix not particularly helpful. Though not as brief, something like AIO_RS_ERROR would probably be more clear. @@ -515,10 +517,25 @@ MarkLocalBufferDirty(Buffer buffer) * Like StartBufferIO, but for local buffers */ bool -StartLocalBufferIO(BufferDesc *bufHdr, bool forInput) +StartLocalBufferIO(BufferDesc *bufHdr, bool forInput, bool nowait) { - uint32 buf_state = pg_atomic_read_u32(&bufHdr->state); + uint32 buf_state; + + /* + * The buffer could have IO in progress, e.g. when there are two scans of + * the same relation. Either wait for the other IO or return false. + */ + if (pgaio_wref_valid(&bufHdr->io_wref)) + { + PgAioWaitRef iow = bufHdr->io_wref; + + if (nowait) + return false; + + pgaio_wref_wait(&iow); + } + buf_state = pg_atomic_read_u32(&bufHdr->state); if (forInput ? (buf_state & BM_VALID) : !(buf_state & BM_DIRTY)) { /* someone else already did the I/O */ I'd move this comment ("someone else already did") outside of the if statement so it kind of separates it into three clear cases: 1) the IO is in progress and you can wait on it if you want, 2) the IO is completed by someone else (is this possible for local buffers without AIO?) 3) you can start the IO - Melanie
Hi, On 2025-03-18 21:00:17 -0400, Melanie Plageman wrote: > On Tue, Mar 18, 2025 at 4:12 PM Andres Freund <andres@anarazel.de> wrote: > > Attached is v2.10, > > This is a review of 0008: bufmgr: Implement AIO read support > > I'm afraid it is more of a cosmetic review than a sign-off on the > patch's correctness, but perhaps it will help future readers who may > have the same questions I did. I think that's actually an important level of review. I'm, as odd as that sounds, more confident about the architectural stuff than about "understandability" etc. > In the commit message: > bufmgr: Implement AIO read support > > This implements the following: > - Addition of callbacks to maintain buffer state on completion of a readv > - Addition of a wait reference to BufferDesc, to allow backends to wait for > IOs > - StartBufferIO(), WaitIO(), TerminateBufferIO() support for waiting AIO > > I think it might be nice to say something about allowing backends to > complete IOs issued by other backends. Hm, I'd have said that's basically implied by the way AIO works (as outlined in the added README.md), but I can think of a way to mention it here. > @@ -40,6 +41,10 @@ static const PgAioHandleCallbacksEntry aio_handle_cbs[] = { > CALLBACK_ENTRY(PGAIO_HCB_INVALID, aio_invalid_cb), > > CALLBACK_ENTRY(PGAIO_HCB_MD_READV, aio_md_readv_cb), > + > + CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_READV, aio_shared_buffer_readv_cb), > + > + CALLBACK_ENTRY(PGAIO_HCB_LOCAL_BUFFER_READV, aio_local_buffer_readv_cb), > #undef CALLBACK_ENTRY > }; > > I personally can't quite figure out why the read and write callbacks > are defined differently than the stage, complete, and report > callbacks. I know there is a comment above PgAioHandleCallbackID about > something about this, but it didn't really clarify it for me. Maybe > you can put a block comment at the top of aio_callback.c? Or perhaps I > just need to study it more... They're not implemented differently - PgAioHandleCallbacks (which is what is contained in aio_handle_cbs, just with a name added) all have a stage, complete and report callbacks. E.g. for SHARED_BUFFER_READV you have a stage (to transfer the buffer pins to the AIO subsystem), a shared completion (to verify the page, to remove BM_IO_IN_PROGRESS and set BM_VALID/BM_IO_ERROR, as appropriate) and a report callback (to report a page validation error). Maybe more of the relevant types and functions should have been plural, but then it becomes very awkward to talk about the separate registrations of multiple callbacks (i.e. the set of callbacks for md.c and the set of callbacks for bufmgr.c). > @@ -5482,10 +5503,19 @@ WaitIO(BufferDesc *buf) > + if (pgaio_wref_valid(&iow)) > + { > + pgaio_wref_wait(&iow); > + ConditionVariablePrepareToSleep(cv); > + continue; > + } > > I'd add some comment above this. I reread it many times, and I still > only _think_ I understand what it does. I think the reason we need > ConditionVariablePrepareToSleep() again is because pgaio_io_wait() may > have called ConditionVariableCancelSleep() so we need to > ConditionVariablePrepareToSleep() again (it was done already at the > top of Wait())? Oh, yes, that definitely needs a comment. I've been marinating in this for so long that it seems obvious, but if I take a step back, it's not at all obvious. The issue is that pgaio_wref_wait() internally waits on a *different* condition variable than the BufferDesc's CV. The consequences of not doing this would be fairly mild - the next ConditionVariableSleep would prepare to sleep and return immediately - but it's unnecessary. > Maybe worth mentioning in the commit message about why WaitIO() has to > work differently for AIO than sync IO. K. > /* > * Support LockBufferForCleanup() > * > * If we just released a pin, need to do BM_PIN_COUNT_WAITER handling. > * Most of the time the current backend will hold another pin preventing > * that from happening, but that's e.g. not the case when completing an IO > * another backend started. > */ > > I found this wording a bit confusing. what about this: > > We may have just released the last pin other than the waiter's. In most cases, > this backend holds another pin on the buffer. But, if, for example, this > backend is completing an IO issued by another backend, it may be time to wake > the waiter. WFM. > /* > * Helper for AIO staging callback for both reads and writes as well as temp > * and shared buffers. > */ > static pg_attribute_always_inline void > buffer_stage_common(PgAioHandle *ioh, bool is_write, bool is_temp) > > I think buffer_stage_common() needs the function comment to explain > what unit it is operating on. > It is called "buffer_" singular but then it loops through io_data > which appears to contain multiple buffers Hm. Yea. Originally it was just for readv and was duplicated for writes. The vectorized bit hinted at being for multiple buffers. > /* > * Check that all the buffers are actually ones that could conceivably > * be done in one IO, i.e. are sequential. > */ > if (i == 0) > first = buf_hdr->tag; > else > { > Assert(buf_hdr->tag.relNumber == first.relNumber); > Assert(buf_hdr->tag.blockNum == first.blockNum + i); > } > > So it is interesting to me that this validation is done at this level. > Enforcing sequentialness doesn't seem like it would be intrinsically > related to or required to stage IOs. And there isn't really anything > in this function that seems like it would require it either. Usually > an assert is pretty close to the thing it is protecting. Staging is the last buffer-aware thing that happens before IO is actually executed. If you were to do a readv() into *non* buffers that aren't for sequential blocks, you would get bogus buffer pool contents, because obviously it doesn't make sense to read data for block N+1 into the buffer for block N+3 or whatnot. The assertions did find bugs during development, fwiw. > Oh and I think the end of the loop in buffer_stage_common() would look > nicer with a small refactor with the resulting code looking like this: > > /* temp buffers don't use BM_IO_IN_PROGRESS */ > Assert(!is_temp || (buf_state & BM_IO_IN_PROGRESS)); > > /* we better have ensured the buffer is present until now */ > Assert(BUF_STATE_GET_REFCOUNT(buf_state) >= 1); > > /* > * Reflect that the buffer is now owned by the subsystem. > * > * For local buffers: This can't be done just in LocalRefCount as one > * might initially think, as this backend could error out while AIO is > * still in progress, releasing all the pins by the backend itself. > */ > buf_state += BUF_REFCOUNT_ONE; > buf_hdr->io_wref = io_ref; > > if (is_temp) > { > pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state); > continue; > } > > UnlockBufHdr(buf_hdr, buf_state); > > if (is_write) > { > LWLock *content_lock; > > content_lock = BufferDescriptorGetContentLock(buf_hdr); > > Assert(LWLockHeldByMe(content_lock)); > > /* > * Lock is now owned by AIO subsystem. > */ > LWLockDisown(content_lock); > } > > /* > * Stop tracking this buffer via the resowner - the AIO system now > * keeps track. > */ > ResourceOwnerForgetBufferIO(CurrentResourceOwner, buffer); > } I don't particularly like this, I'd like to make the logic for shared and local buffers more similar over time. E.g. by also tracking local buffer IO via resowner. > In buffer_readv_complete(), this comment > > /* > * Iterate over all the buffers affected by this IO and call appropriate > * per-buffer completion function for each buffer. > */ > > makes it sound like we might invoke different completion functions (like invoke > the completion callback), but that isn't what happens here. Oops, that's how it used to work, but it doesn't anymore, because it ended up with too much duplication. > failed = > prior_result.status == ARS_ERROR > || prior_result.result <= buf_off; > > Though not introduced in this commit, I will say that I find the ARS prefix not > particularly helpful. Though not as brief, something like AIO_RS_ERROR would > probably be more clear. Fair enough. I'd go for PGAIO_RS_ERROR etc though. > @@ -515,10 +517,25 @@ MarkLocalBufferDirty(Buffer buffer) > * Like StartBufferIO, but for local buffers > */ > bool > -StartLocalBufferIO(BufferDesc *bufHdr, bool forInput) > +StartLocalBufferIO(BufferDesc *bufHdr, bool forInput, bool nowait) > { > - uint32 buf_state = pg_atomic_read_u32(&bufHdr->state); > + uint32 buf_state; > + > + /* > + * The buffer could have IO in progress, e.g. when there are two scans of > + * the same relation. Either wait for the other IO or return false. > + */ > + if (pgaio_wref_valid(&bufHdr->io_wref)) > + { > + PgAioWaitRef iow = bufHdr->io_wref; > + > + if (nowait) > + return false; > + > + pgaio_wref_wait(&iow); > + } > > + buf_state = pg_atomic_read_u32(&bufHdr->state); > if (forInput ? (buf_state & BM_VALID) : !(buf_state & BM_DIRTY)) > { > /* someone else already did the I/O */ > > I'd move this comment ("someone else already did") outside of the if > statement so it kind of separates it into three clear cases: FWIW it's inside because that's how StartBufferIOs comment has been for a fair while... > 1) the IO is in progress and you can wait on it if you want, > 2) the IO is completed by someone else (is this possible for local buffers > without AIO?) No, that's not possible without AIO. > 3) you can start the IO I'll give it a go. Thanks for the review! Greetings, Andres Freund
On Tue, Mar 18, 2025 at 4:12 PM Andres Freund <andres@anarazel.de> wrote: > > Attached is v2.10 This is a review of 0002: bufmgr: Improve stats when buffer is read in concurrently In the commit message, it might be worth distinguishing that pg_stat_io and vacuum didn't double count reads, they under-counted hits. pgBufferUsage and relation-level stats (pg_stat_all_tables etc) overcounted reads and undercounted hits. Quick example: On master, if we try to read 7 blocks and 3 were hits and 2 were completed by someone else then - pg_stat_io and VacuumCostBalance would record 3 hits and 2 reads, which looks like 2 misses. - pgBufferUsage would record 3 hits and 4 reads, which looks like 4 misses. - pg_stat_all_tables would record 3 hits and 7 reads, which looks like 4 misses. The correct number of misses is 2 misses comprising 5 hits and 2 reads (or 7 reads and 5 hits for pg_stat_all_tables which does the math later). @@ -1463,8 +1450,13 @@ WaitReadBuffers(ReadBuffersOperation *operation) if (!WaitReadBuffersCanStartIO(buffers[i], false)) { /* - * Report this as a 'hit' for this backend, even though it must - * have started out as a miss in PinBufferForBlock(). + * Report and track this as a 'hit' for this backend, even though + * it must have started out as a miss in PinBufferForBlock(). + * + * Some of the accesses would otherwise never be counted (e.g. + * pgBufferUsage) or counted as a miss (e.g. + * pgstat_count_buffer_hit(), as we always call + * pgstat_count_buffer_read()). */ I think this comment should be changed. It reads like something written when discovering this problem and not like something useful in the future. I think you can probably drop the whole second paragraph. You could make it even more clear by mentioning that the other backend will count it as a read. Otherwise, LGTM - Melanie
On Tue, Mar 18, 2025 at 4:12 PM Andres Freund <andres@anarazel.de> wrote: > > Attached is v2.10, I noticed a few comments could be improved in 0011: bufmgr: Use AIO in StartReadBuffers() In WaitReadBuffers(), this comment is incomplete: /* - * Skip this block if someone else has already completed it. If an - * I/O is already in progress in another backend, this will wait for - * the outcome: either done, or something went wrong and we will - * retry. + * If there is an IO associated with the operation, we may need to + * wait for it. It's possible for there to be no IO if */ In WaitReadBuffers(), too many thes /* * Most of the the the one IO we started will read in everything. But * we need to deal with short reads and buffers not needing IO * anymore. */ In ReadBuffersCanStartIO() + /* + * Unfortunately a false returned StartBufferIO() doesn't allow to + * distinguish between the buffer already being valid and IO already + * being in progress. Since IO already being in progress is quite + * rare, this approach seems fine. + */ maybe reword "a false returned StartBufferIO()" Above and in AsyncReadBuffers() * To support retries after short reads, the first operation->nblocks_done is * buffers are skipped. can't quite understand this + * On return *nblocks_progres is updated to reflect the number of buffers progress spelled wrong * A secondary benefit is that this would allows us to measure the time in * pgaio_io_acquire() without causing undue timer overhead in the common, * non-blocking, case. However, currently the pgstats infrastructure * doesn't really allow that, as it a) asserts that an operation can't * have time without operations b) doesn't have an API to report * "accumulated" time. */ allows->allow What would the time spent in pgaio_io_acquire() be reported as? Time submitting IOs? Time waiting for a handle? And what is "accumulated" time here? It seems like you just add the time to the running total and that is already accumulated. - Melanie
Hi, On 2025-03-19 13:20:17 -0400, Melanie Plageman wrote: > On Tue, Mar 18, 2025 at 4:12 PM Andres Freund <andres@anarazel.de> wrote: > > > > Attached is v2.10, > > I noticed a few comments could be improved in 0011: bufmgr: Use AIO > in StartReadBuffers() > [...] Yep. > Above and in AsyncReadBuffers() > > * To support retries after short reads, the first operation->nblocks_done is > * buffers are skipped. > > can't quite understand this Heh, yea, it's easy to misunderstand. "short read" in the sense of a partial read, i.e. a preadv() that only read some of the blocks, not all. I'm replacing the "short" with partial. (also removed the superfluous "is") > * A secondary benefit is that this would allows us to measure the time in > * pgaio_io_acquire() without causing undue timer overhead in the common, > * non-blocking, case. However, currently the pgstats infrastructure > * doesn't really allow that, as it a) asserts that an operation can't > * have time without operations b) doesn't have an API to report > * "accumulated" time. > */ > > allows->allow > > What would the time spent in pgaio_io_acquire() be reported as? I'd report it as additional time for the IO we're trying to start, as that wait would otherwise not happen. > And what is "accumulated" time here? It seems like you just add the time to > the running total and that is already accumulated. Afaict there currently is no way to report a time delta to pgstat. pgstat_count_io_op_time() computes the time since pgstat_prepare_io_time(). Due to the assertions that time cannot be reported for an operation with a zero count, we can't just do two pgstat_prepare_io_time(); ...; pgstat_count_io_op_time(); twice, with the first one passing cnt=0. Greetings, Andres Freund
On Wed, Mar 12, 2025 at 01:06:03PM -0400, Andres Freund wrote: > On 2025-03-11 20:57:43 -0700, Noah Misch wrote: > > - Like you say, "redefine max_files_per_process to be about the number of > > files each *backend* will additionally open". It will become normal that > > each backend's actual FD list length is max_files_per_process + MaxBackends > > if io_method=io_uring. Outcome is not unlike > > v6-0002-Bump-postmaster-soft-open-file-limit-RLIMIT_NOFIL.patch + > > v6-0003-Reflect-the-value-of-max_safe_fds-in-max_files_pe.patch but we don't > > mutate max_files_per_process. Benchmark results should not change beyond > > the inter-major-version noise level unless one sets io_method=io_uring. I'm > > feeling best about this one, but I've not been thinking about it long. > > Yea, I think that's something probably worth doing separately from Jelte's > patch. I do think that it'd be rather helpful to have jelte's patch to > increase NOFILE in addition though. Agreed. > > > > > +static void > > > > > +maybe_adjust_io_workers(void) > > > > > > > > This also restarts workers that exit, so perhaps name it > > > > start_io_workers_if_missing(). > > > > > > But it also stops IO workers if necessary? > > > > Good point. Maybe just add a comment like "start or stop IO workers to close > > the gap between the running count and the configured count intent". > > It's now > /* > * Start or stop IO workers, to close the gap between the number of running > * workers and the number of configured workers. Used to respond to change of > * the io_workers GUC (by increasing and decreasing the number of workers), as > * well as workers terminating in response to errors (by starting > * "replacement" workers). > */ Excellent. > > > > > +{ > > > > ... > > > > > + /* Try to launch one. */ > > > > > + child = StartChildProcess(B_IO_WORKER); > > > > > + if (child != NULL) > > > > > + { > > > > > + io_worker_children[id] = child; > > > > > + ++io_worker_count; > > > > > + } > > > > > + else > > > > > + break; /* XXX try again soon? */ > > > > > > > > Can LaunchMissingBackgroundProcesses() become the sole caller of this > > > > function, replacing the current mix of callers? That would be more conducive > > > > to promptly doing the right thing after launch failure. > > > > > > I'm not sure that'd be a good idea - right now IO workers are started before > > > the startup process, as the startup process might need to perform IO. If we > > > started it only later in ServerLoop() we'd potentially do a fair bit of work, > > > including starting checkpointer, bgwriter, bgworkers before we started IO > > > workers. That shouldn't actively break anything, but it would likely make > > > things slower. > > > > I missed that. How about keeping the two calls associated with PM_STARTUP but > > replacing the assign_io_workers() and process_pm_child_exit() calls with one > > in LaunchMissingBackgroundProcesses()? > > I think replacing the call in assign_io_workers() is a good idea, that way we > don't need assign_io_workers(). > > Less convinced it's a good idea to do the same for process_pm_child_exit() - > if IO workers errored out we'll launch backends etc before we get to > LaunchMissingBackgroundProcesses(). That's not a fundamental problem, but > seems a bit odd. Works for me. > I think LaunchMissingBackgroundProcesses() should be split into one that > starts aux processes and one that starts bgworkers. The one maintaining aux > processes should be called before we start backends, the latter not. That makes sense, though I've not thought about it much. > > > > > + /* > > > > > + * It's very unlikely, but possible, that reopen fails. E.g. due > > > > > + * to memory allocations failing or file permissions changing or > > > > > + * such. In that case we need to fail the IO. > > > > > + * > > > > > + * There's not really a good errno we can report here. > > > > > + */ > > > > > + error_errno = ENOENT; > > > > > > > > Agreed there's not a good errno, but let's use a fake errno that we're mighty > > > > unlikely to confuse with an actual case of libc returning that errno. Like > > > > one of EBADF or EOWNERDEAD. > > > > > > Can we rely on that to be present on all platforms, including windows? > > > > I expect EBADF is universal. EBADF would be fine. > > Hm, that's actually an error that could happen for other reasons, and IMO > would be more confusing than ENOENT. The latter describes the issue to a > reasonable extent, EBADFD seems like it would be more confusing. > > I'm not sure it's worth investing time in this - it really shouldn't happen, > and we probably have bigger problems than the error code if it does. But if we > do want to do something, I think I can see a way to report a dedicated error > message for this. I agree it's not worth much investment. Let's leave that one as-is. We can always change it further if the not-really-good errno shows up too much. > > https://github.com/coreutils/gnulib/blob/master/doc/posix-headers/errno.texi > > lists some OSs not having it, the newest of which looks like NetBSD 9.3 > > (2022). We could use it and add a #define for platforms lacking it. > > What would we define it as? I guess we could just pick a high value, but... Some second-best value, but I withdraw that idea. On Wed, Mar 12, 2025 at 07:23:47PM -0400, Andres Freund wrote: > Attached is v2.7, with the following changes: > Unresolved: > > - Whether to continue starting new workers in process_pm_child_exit() I'm fine with that continuing. It's hurting ~nothing. > - What to name the view (currently pg_aios). I'm inclined to go for > pg_io_handles right now. I like pg_aios mildly better than pg_io_handles, since "handle" sounds implementation-centric. On Fri, Mar 14, 2025 at 03:43:15PM -0400, Andres Freund wrote: > Attached is v2.8 with the following changes: > - In parallel: Find a way to deal with the set_max_safe_fds() issue that we've > been discussing on this thread recently. As that only affects io_uring, it > doesn't have to block other patches going in. As above, I like the "redefine" option. > - Right now effective_io_concurrency cannot be set > 0 on Windows and other > platforms that lack posix_fadvise. But with AIO we can read ahead without > posix_fadvise(). > > It'd not really make anything worse than today to not remove the limit, but > it'd be pretty weird to prevent windows etc from benefiting from AIO. Need > to look around and see whether it would require anything other than doc > changes. Worth changing, but non-blocking. On Fri, Mar 14, 2025 at 03:58:43PM -0400, Andres Freund wrote: > - Should the docs for debug_io_direct be rephrased and if so, how? > Perhaps it's worth going from > > <para> > Currently this feature reduces performance, and is intended for > developer testing only. > </para> > to > <para> > Currently this feature reduces performance in many workloads, and is > intended for testing only. > </para> > > I.e. qualify the downside with "many workloads" and widen the audience ever so > slightly? Yes, that's good. Other than the smgr patch review sent on its own thread, I've not yet reviewed any of these patches comprehensively. Given the speed of change, I felt it was time to flush comments buffered since 2025-03-11: commit 0284401 wrote: > aio: Basic subsystem initialization > @@ -465,6 +466,7 @@ AutoVacLauncherMain(const void *startup_data, size_t startup_data_len) > */ > LWLockReleaseAll(); > pgstat_report_wait_end(); > + pgaio_error_cleanup(); AutoVacLauncherMain(), BackgroundWriterMain(), CheckpointerMain(), and WalWriterMain() call AtEOXact_Buffers() but not AtEOXact_Aio(). Is that proper? They do call pgaio_error_cleanup() as seen here, so the only loss is some asserts. (The load-bearing part does get done.) commit da72269 wrote: > aio: Add core asynchronous I/O infrastructure > + * This could be in aio_internal.h, as it is not pubicly referenced, but typo -> publicly commit 55b454d wrote: > aio: Infrastructure for io_method=worker > + /* Try to launch one. */ > + child = StartChildProcess(B_IO_WORKER); > + if (child != NULL) > + { > + io_worker_children[id] = child; > + ++io_worker_count; > + } > + else > + break; /* XXX try again soon? */ I'd change the comment to something like one of: retry after DetermineSleepTime() next LaunchMissingBackgroundProcesses() will retry in <60s On Tue, Mar 18, 2025 at 04:12:18PM -0400, Andres Freund wrote: > - Decide what to do about the smgr interrupt issue Replied on that thread. It's essentially ready. > Questions / Unresolved: > > - Write support isn't going to land in 18, but there is a tiny bit of code > regarding writes in the code for bufmgr IO. I guess I could move that to a > later commit? > > I'm inclined to leave it, the structure of the code only really makes > knowing that it's going to be shared between reads & writes. Fine to leave it. > - pg_aios view name Covered above. > Subject: [PATCH v2.10 08/28] bufmgr: Implement AIO read support Some comments about BM_IO_IN_PROGRESS may need updates. This paragraph: * The BM_IO_IN_PROGRESS flag acts as a kind of lock, used to wait for I/O on a buffer to complete (and in releases before 14, it was accompanied by a per-buffer LWLock). The process doing a read or write sets the flag for the duration, and processes that need to wait for it to be cleared sleep on a condition variable. And these individual lines from "git grep BM_IO_IN_PROGRESS": * I/O already in progress. We already hold BM_IO_IN_PROGRESS for the * only one process at a time can set the BM_IO_IN_PROGRESS bit. * only one process at a time can set the BM_IO_IN_PROGRESS bit. * i.e at most one BM_IO_IN_PROGRESS bit is set per proc. The last especially. For the other three lines and the paragraph, the notion of a process "holding" BM_IO_IN_PROGRESS or being the process to "set" it or being the process "doing a read" becomes less significant when one process starts the IO and another completes it. > + /* we better have ensured the buffer is present until now */ > + Assert(BUF_STATE_GET_REFCOUNT(buf_state) >= 1); I'd delete that comment; to me, the assertion alone is clearer. > + ereport(LOG, > + (errcode(ERRCODE_DATA_CORRUPTED), > + errmsg("invalid page in block %u of relation %s; zeroing out page", This is changing level s/WARNING/LOG/. That seems orthogonal to the patch's goals; is it needed? If so, I recommend splitting it out as a preliminary patch, to highlight the behavior change for release notes. > +/* > + * Perform completion handling of a single AIO read. This read may cover > + * multiple blocks / buffers. > + * > + * Shared between shared and local buffers, to reduce code duplication. > + */ > +static pg_attribute_always_inline PgAioResult > +buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result, > + uint8 cb_data, bool is_temp) > +{ > + PgAioResult result = prior_result; > + PgAioTargetData *td = pgaio_io_get_target_data(ioh); > + uint64 *io_data; > + uint8 handle_data_len; > + > + if (is_temp) > + { > + Assert(td->smgr.is_temp); > + Assert(pgaio_io_get_owner(ioh) == MyProcNumber); > + } > + else > + Assert(!td->smgr.is_temp); > + > + /* > + * Iterate over all the buffers affected by this IO and call appropriate > + * per-buffer completion function for each buffer. > + */ > + io_data = pgaio_io_get_handle_data(ioh, &handle_data_len); > + for (uint8 buf_off = 0; buf_off < handle_data_len; buf_off++) > + { > + Buffer buf = io_data[buf_off]; > + PgAioResult buf_result; > + bool failed; > + > + Assert(BufferIsValid(buf)); > + > + /* > + * If the entire failed on a lower-level, each buffer needs to be Missing word, probably fix like: s,entire failed on a lower-level,entire I/O failed on a lower level, > + * marked as failed. In case of a partial read, some buffers may be > + * ok. > + */ > + failed = > + prior_result.status == ARS_ERROR > + || prior_result.result <= buf_off; I didn't run an experiment to check the following, but I think this should be s/<=/</. Suppose we requested two blocks and read some amount of bytes [1*BLCKSZ, 2*BLSCKSZ - 1]. md_readv_complete will store result=1. buf_off==0 should compute failed=false here, but buf_off==1 should compute failed=true. I see this relies on md_readv_complete having converted "result" to blocks. Was there some win from doing that as opposed to doing the division here? Division here ("blocks_read = prior_result.result / BLCKSZ") would feel easier to follow, to me. > + > + buf_result = buffer_readv_complete_one(buf_off, buf, cb_data, failed, > + is_temp); > + > + /* > + * If there wasn't any prior error and the IO for this page failed in > + * some form, set the whole IO's to the page's result. s/the IO for this page/page verification/ s/IO's/IO's result/ > + */ > + if (result.status != ARS_ERROR && buf_result.status != ARS_OK) > + { > + result = buf_result; > + pgaio_result_report(result, td, LOG); > + } > + } > + > + return result; > +}
Hi, On 2025-03-19 14:25:30 -0700, Noah Misch wrote: > On Wed, Mar 12, 2025 at 01:06:03PM -0400, Andres Freund wrote: > > - Right now effective_io_concurrency cannot be set > 0 on Windows and other > > platforms that lack posix_fadvise. But with AIO we can read ahead without > > posix_fadvise(). > > > > It'd not really make anything worse than today to not remove the limit, but > > it'd be pretty weird to prevent windows etc from benefiting from AIO. Need > > to look around and see whether it would require anything other than doc > > changes. > > Worth changing, but non-blocking. Thankfully Melanie submitted a patch for that... > Other than the smgr patch review sent on its own thread, I've not yet reviewed > any of these patches comprehensively. Given the speed of change, I felt it > was time to flush comments buffered since 2025-03-11: Thanks! > commit 0284401 wrote: > > aio: Basic subsystem initialization > > > @@ -465,6 +466,7 @@ AutoVacLauncherMain(const void *startup_data, size_t startup_data_len) > > */ > > LWLockReleaseAll(); > > pgstat_report_wait_end(); > > + pgaio_error_cleanup(); > > AutoVacLauncherMain(), BackgroundWriterMain(), CheckpointerMain(), and > WalWriterMain() call AtEOXact_Buffers() but not AtEOXact_Aio(). Is that > proper? They do call pgaio_error_cleanup() as seen here, so the only loss is > some asserts. (The load-bearing part does get done.) I don't think it's particularly good that we use the AtEOXact_* functions in the sigsetjmp blocks, that feels like a weird mixup of infrastructure to me. So this was intentional. > commit da72269 wrote: > > aio: Add core asynchronous I/O infrastructure > > > + * This could be in aio_internal.h, as it is not pubicly referenced, but > > typo -> publicly /me has a red face. > commit 55b454d wrote: > > aio: Infrastructure for io_method=worker > > > + /* Try to launch one. */ > > + child = StartChildProcess(B_IO_WORKER); > > + if (child != NULL) > > + { > > + io_worker_children[id] = child; > > + ++io_worker_count; > > + } > > + else > > + break; /* XXX try again soon? */ > > I'd change the comment to something like one of: > > retry after DetermineSleepTime() > next LaunchMissingBackgroundProcesses() will retry in <60s Hm, we retry more frequently that that if there are new connections... Maybe just "try again next time"? > On Tue, Mar 18, 2025 at 04:12:18PM -0400, Andres Freund wrote: > > - Decide what to do about the smgr interrupt issue > > Replied on that thread. It's essentially ready. Cool, will reply there in a bit. > > Subject: [PATCH v2.10 08/28] bufmgr: Implement AIO read support > > Some comments about BM_IO_IN_PROGRESS may need updates. This paragraph: > > * The BM_IO_IN_PROGRESS flag acts as a kind of lock, used to wait for I/O on a > buffer to complete (and in releases before 14, it was accompanied by a > per-buffer LWLock). The process doing a read or write sets the flag for the > duration, and processes that need to wait for it to be cleared sleep on a > condition variable. First draft: * The BM_IO_IN_PROGRESS flag acts as a kind of lock, used to wait for I/O on a buffer to complete (and in releases before 14, it was accompanied by a per-buffer LWLock). The process start a read or write sets the flag. When the I/O is completed, be it by the process that initiated the I/O or by another process, the flag is removed and the Buffer's condition variable is signalled. Processes that need to wait for the I/O to complete can wait for asynchronous I/O to using BufferDesc->io_wref and for BM_IO_IN_PROGRESS to be unset by sleeping on the buffer's condition variable. > And these individual lines from "git grep BM_IO_IN_PROGRESS": > > * i.e at most one BM_IO_IN_PROGRESS bit is set per proc. > > The last especially. Huh - yea. This isn't a "new" issue, I think I missed this comment in 16's 12f3867f5534. I think the comment can just be deleted? > * I/O already in progress. We already hold BM_IO_IN_PROGRESS for the > * only one process at a time can set the BM_IO_IN_PROGRESS bit. > * only one process at a time can set the BM_IO_IN_PROGRESS bit. > For the other three lines and the paragraph, the notion > of a process "holding" BM_IO_IN_PROGRESS or being the process to "set" it or > being the process "doing a read" becomes less significant when one process > starts the IO and another completes it. Hm. I think they'd be ok as-is, but we can probably improve them. Maybe * Now it's safe to write buffer to disk. Note that no one else should * have been able to write it while we were busy with log flushing because * we got the exclusive right to perform I/O by setting the * BM_IO_IN_PROGRESS bit. > > + /* we better have ensured the buffer is present until now */ > > + Assert(BUF_STATE_GET_REFCOUNT(buf_state) >= 1); > > I'd delete that comment; to me, the assertion alone is clearer. Ok. > > + ereport(LOG, > > + (errcode(ERRCODE_DATA_CORRUPTED), > > + errmsg("invalid page in block %u of relation %s; zeroing out page", > > This is changing level s/WARNING/LOG/. That seems orthogonal to the patch's > goals; is it needed? If so, I recommend splitting it out as a preliminary > patch, to highlight the behavior change for release notes. No, it's not needed. I think I looked over the patch at some point and considered the log-level wrong according to our guidelines and thought I'd broken it. > > + /* > > + * If the entire failed on a lower-level, each buffer needs to be > > Missing word, probably fix like: > s,entire failed on a lower-level,entire I/O failed on a lower level, Yep. > > + * marked as failed. In case of a partial read, some buffers may be > > + * ok. > > + */ > > + failed = > > + prior_result.status == ARS_ERROR > > + || prior_result.result <= buf_off; > > I didn't run an experiment to check the following, but I think this should be > s/<=/</. Suppose we requested two blocks and read some amount of bytes > [1*BLCKSZ, 2*BLSCKSZ - 1]. md_readv_complete will store result=1. buf_off==0 > should compute failed=false here, but buf_off==1 should compute failed=true. Huh, you might be right. I thought I wrote a test for this, I wonder why it didn't catch the problem... > I see this relies on md_readv_complete having converted "result" to blocks. > Was there some win from doing that as opposed to doing the division here? > Division here ("blocks_read = prior_result.result / BLCKSZ") would feel easier > to follow, to me. It seemed like that would be wrong layering - what if we had an smgr that could store data in a compressed format? The raw read would be of a smaller size. The smgr API deals in BlockNumbers, only the md.c layer should know about bytes. > > + > > + buf_result = buffer_readv_complete_one(buf_off, buf, cb_data, failed, > > + is_temp); > > + > > + /* > > + * If there wasn't any prior error and the IO for this page failed in > > + * some form, set the whole IO's to the page's result. > > s/the IO for this page/page verification/ > s/IO's/IO's result/ Agreed. Thanks for the review! Greetings, Andres Freund
On Wed, Mar 19, 2025 at 06:17:37PM -0400, Andres Freund wrote: > On 2025-03-19 14:25:30 -0700, Noah Misch wrote: > > commit 55b454d wrote: > > > aio: Infrastructure for io_method=worker > > > > > + /* Try to launch one. */ > > > + child = StartChildProcess(B_IO_WORKER); > > > + if (child != NULL) > > > + { > > > + io_worker_children[id] = child; > > > + ++io_worker_count; > > > + } > > > + else > > > + break; /* XXX try again soon? */ > > > > I'd change the comment to something like one of: > > > > retry after DetermineSleepTime() > > next LaunchMissingBackgroundProcesses() will retry in <60s > > Hm, we retry more frequently that that if there are new connections... Maybe > just "try again next time"? Works for me. > > On Tue, Mar 18, 2025 at 04:12:18PM -0400, Andres Freund wrote: > > > Subject: [PATCH v2.10 08/28] bufmgr: Implement AIO read support > > > > Some comments about BM_IO_IN_PROGRESS may need updates. This paragraph: > > > > * The BM_IO_IN_PROGRESS flag acts as a kind of lock, used to wait for I/O on a > > buffer to complete (and in releases before 14, it was accompanied by a > > per-buffer LWLock). The process doing a read or write sets the flag for the > > duration, and processes that need to wait for it to be cleared sleep on a > > condition variable. > > First draft: > * The BM_IO_IN_PROGRESS flag acts as a kind of lock, used to wait for I/O on a > buffer to complete (and in releases before 14, it was accompanied by a > per-buffer LWLock). The process start a read or write sets the flag. When the s/start/starting/ > I/O is completed, be it by the process that initiated the I/O or by another > process, the flag is removed and the Buffer's condition variable is signalled. > Processes that need to wait for the I/O to complete can wait for asynchronous > I/O to using BufferDesc->io_wref and for BM_IO_IN_PROGRESS to be unset by s/to using/by using/ > sleeping on the buffer's condition variable. Sounds good. > > And these individual lines from "git grep BM_IO_IN_PROGRESS": > > > > * i.e at most one BM_IO_IN_PROGRESS bit is set per proc. > > > > The last especially. > > Huh - yea. This isn't a "new" issue, I think I missed this comment in 16's > 12f3867f5534. I think the comment can just be deleted? Hmm, yes, it's orthogonal to $SUBJECT and deletion works fine. > > * I/O already in progress. We already hold BM_IO_IN_PROGRESS for the > > * only one process at a time can set the BM_IO_IN_PROGRESS bit. > > * only one process at a time can set the BM_IO_IN_PROGRESS bit. > > > For the other three lines and the paragraph, the notion > > of a process "holding" BM_IO_IN_PROGRESS or being the process to "set" it or > > being the process "doing a read" becomes less significant when one process > > starts the IO and another completes it. > > Hm. I think they'd be ok as-is, but we can probably improve them. Maybe Looking again, I agree they're okay. > > * Now it's safe to write buffer to disk. Note that no one else should > * have been able to write it while we were busy with log flushing because > * we got the exclusive right to perform I/O by setting the > * BM_IO_IN_PROGRESS bit. That's fine too. Maybe s/perform/stage/ or s/perform/start/. > > I see this relies on md_readv_complete having converted "result" to blocks. > > Was there some win from doing that as opposed to doing the division here? > > Division here ("blocks_read = prior_result.result / BLCKSZ") would feel easier > > to follow, to me. > > It seemed like that would be wrong layering - what if we had an smgr that > could store data in a compressed format? The raw read would be of a smaller > size. The smgr API deals in BlockNumbers, only the md.c layer should know > about bytes. I hadn't thought of that. That's a good reason.
On Tue, Mar 18, 2025 at 9:12 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > Attached is v2.10, with the following changes: > > - committed core AIO infrastructure patch Hi, yay, It's happening.jpg ;) Some thoughts about 2.10-0004: What do you think about putting there into (io_uring patch) info about the need to ensure that kernel.io_uring_disabled sysctl is on ? (some distros might shut it down) E.g. in doc/src/sgml/config.sgml after io_method = <listitems>... there could be --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml <literal>io_uring</literal> (execute asynchronous I/O using io_uring, if available) [..] and then add something like: + "At present io_method=io_uring is supported only on Linux and requires Linux's sysctl kernel.io_uring_disabled (if present) to be at value 0 (enabled) or 1 (with kernel.io_uring_group set to PostgreSQL's GID)." Rationale: it seems that at least RHEL 9.x will have this knob present (but e.g. RHEL 8.10 even kernel-ml 6.4.2 doesn't, as this seems to come with 6.6+ but saw somewhere that somebody had issues with this on probably backported kernel in Rocky 9.x ). Also further googling for I have found also that mysql can throw - when executed from podmap/docker: "mysqld: io_uring_queue_init() failed with ENOSYS: check seccomp filters, and the kernel version (newer than 5.1 required)" and this leaves this with two probable follow-up questions when adjusting this sentence: a. shouldn't we add some sentence about containers/namespaces/seccomp allowing this ? b. and/or shouldn't we reference in docs a minimum kernel version (this is somewhat wild, liburing could be installed and compiled against, but runtime kernel would be < 5.1 ?) -J.
Hi, On 2025-03-19 18:17:37 -0400, Andres Freund wrote: > On 2025-03-19 14:25:30 -0700, Noah Misch wrote: > > > + * marked as failed. In case of a partial read, some buffers may be > > > + * ok. > > > + */ > > > + failed = > > > + prior_result.status == ARS_ERROR > > > + || prior_result.result <= buf_off; > > > > I didn't run an experiment to check the following, but I think this should be > > s/<=/</. Suppose we requested two blocks and read some amount of bytes > > [1*BLCKSZ, 2*BLSCKSZ - 1]. md_readv_complete will store result=1. buf_off==0 > > should compute failed=false here, but buf_off==1 should compute failed=true. > > Huh, you might be right. I thought I wrote a test for this, I wonder why it > didn't catch the problem... It was correct as-is. With result=1 you get precisely the result you describe as the desired outcome, no? prior_result.result <= buf_off -> 1 <= 0 -> failed = 0 1 <= 1 -> failed = 1 but if it were < as you suggest: prior_result.result < buf_off -> 1 < 0 -> failed = 0 1 < 1 -> failed = 0 I.e. we would assume that the second buffer also completed. What does concern me is that the existing tests do *not* catch the problem if I turn "<=" into "<". The second buffer in this case wrongly gets marked as valid. We do retry the read (because bufmgr.c thinks only one block was read), but find the buffer to already be valid. The reason the test doesn't fail, is that the way I set up the "short read" tests. The injection point runs after the IO completed and just modifies the result. However, the actual buffer contents still got modified. The easiest way around that seems to be to have the injection point actually zero out the remaining memory. Not pretty, but it'd be harder to just submit shortend IOs in multiple IO methods. It'd be even better if we could trivially use something like randomize_mem(), but it's only conditionally compiled... Greetings, Andres Freund
On Thu, Mar 20, 2025 at 01:05:05PM -0400, Andres Freund wrote: > On 2025-03-19 18:17:37 -0400, Andres Freund wrote: > > On 2025-03-19 14:25:30 -0700, Noah Misch wrote: > > > > + * marked as failed. In case of a partial read, some buffers may be > > > > + * ok. > > > > + */ > > > > + failed = > > > > + prior_result.status == ARS_ERROR > > > > + || prior_result.result <= buf_off; > > > > > > I didn't run an experiment to check the following, but I think this should be > > > s/<=/</. Suppose we requested two blocks and read some amount of bytes > > > [1*BLCKSZ, 2*BLSCKSZ - 1]. md_readv_complete will store result=1. buf_off==0 > > > should compute failed=false here, but buf_off==1 should compute failed=true. > > > > Huh, you might be right. I thought I wrote a test for this, I wonder why it > > didn't catch the problem... > > It was correct as-is. With result=1 you get precisely the result you describe > as the desired outcome, no? > prior_result.result <= buf_off > -> > 1 <= 0 -> failed = 0 > 1 <= 1 -> failed = 1 > > but if it were < as you suggest: > > prior_result.result < buf_off > -> > 1 < 0 -> failed = 0 > 1 < 1 -> failed = 0 > > I.e. we would assume that the second buffer also completed. That's right. I see it now. My mistake. > What does concern me is that the existing tests do *not* catch the problem if > I turn "<=" into "<". The second buffer in this case wrongly gets marked as > valid. We do retry the read (because bufmgr.c thinks only one block was read), > but find the buffer to already be valid. > > The reason the test doesn't fail, is that the way I set up the "short read" > tests. The injection point runs after the IO completed and just modifies the > result. However, the actual buffer contents still got modified. > > > The easiest way around that seems to be to have the injection point actually > zero out the remaining memory. Sounds reasonable and sufficient. FYI, I've resumed the comprehensive review. That's still ongoing.
Hi, On 2025-03-19 18:11:18 -0700, Noah Misch wrote: > On Wed, Mar 19, 2025 at 06:17:37PM -0400, Andres Freund wrote: > > On 2025-03-19 14:25:30 -0700, Noah Misch wrote: > > Hm, we retry more frequently that that if there are new connections... Maybe > > just "try again next time"? > > Works for me. > > > > And these individual lines from "git grep BM_IO_IN_PROGRESS": > > > > > > * i.e at most one BM_IO_IN_PROGRESS bit is set per proc. > > > > > > The last especially. > > > > Huh - yea. This isn't a "new" issue, I think I missed this comment in 16's > > 12f3867f5534. I think the comment can just be deleted? > > Hmm, yes, it's orthogonal to $SUBJECT and deletion works fine. > > > > * I/O already in progress. We already hold BM_IO_IN_PROGRESS for the > > > * only one process at a time can set the BM_IO_IN_PROGRESS bit. > > > * only one process at a time can set the BM_IO_IN_PROGRESS bit. > > > > > For the other three lines and the paragraph, the notion > > > of a process "holding" BM_IO_IN_PROGRESS or being the process to "set" it or > > > being the process "doing a read" becomes less significant when one process > > > starts the IO and another completes it. > > > > Hm. I think they'd be ok as-is, but we can probably improve them. Maybe > > Looking again, I agree they're okay. > > > > > * Now it's safe to write buffer to disk. Note that no one else should > > * have been able to write it while we were busy with log flushing because > > * we got the exclusive right to perform I/O by setting the > > * BM_IO_IN_PROGRESS bit. > > That's fine too. Maybe s/perform/stage/ or s/perform/start/. I put these comment changes into their own patch, as it seemed confusing to change them as part of one of the already queued commits. > > > I see this relies on md_readv_complete having converted "result" to blocks. > > > Was there some win from doing that as opposed to doing the division here? > > > Division here ("blocks_read = prior_result.result / BLCKSZ") would feel easier > > > to follow, to me. > > > > It seemed like that would be wrong layering - what if we had an smgr that > > could store data in a compressed format? The raw read would be of a smaller > > size. The smgr API deals in BlockNumbers, only the md.c layer should know > > about bytes. > > I hadn't thought of that. That's a good reason. I thought that was better documented, but alas, it wasn't. How about updating the documentation of smgrstartreadv to the following: /* * smgrstartreadv() -- asynchronous version of smgrreadv() * * This starts an asynchronous readv IO using the IO handle `ioh`. Other than * `ioh` all parameters are the same as smgrreadv(). * * Completion callbacks above smgr will be passed the result as the number of * successfully read blocks if the read [partially] succeeds. This maintains * the abstraction that smgr operates on the level of blocks, rather than * bytes. */ I briefly had a bug in test_aio's injection point that lead to *increasing* the number of bytes successfully read. That triggered an assertion failure in bufmgr.c, but not closer to the problem. Is it worth adding an assert against that to md_readv_complete? Can't quite decide. Greetings, Andres Freund
On Thu, Mar 20, 2025 at 02:54:14PM -0400, Andres Freund wrote: > On 2025-03-19 18:11:18 -0700, Noah Misch wrote: > > On Wed, Mar 19, 2025 at 06:17:37PM -0400, Andres Freund wrote: > > > On 2025-03-19 14:25:30 -0700, Noah Misch wrote: > > > > I see this relies on md_readv_complete having converted "result" to blocks. > > > > Was there some win from doing that as opposed to doing the division here? > > > > Division here ("blocks_read = prior_result.result / BLCKSZ") would feel easier > > > > to follow, to me. > > > > > > It seemed like that would be wrong layering - what if we had an smgr that > > > could store data in a compressed format? The raw read would be of a smaller > > > size. The smgr API deals in BlockNumbers, only the md.c layer should know > > > about bytes. > > > > I hadn't thought of that. That's a good reason. > > I thought that was better documented, but alas, it wasn't. How about updating > the documentation of smgrstartreadv to the following: > > /* > * smgrstartreadv() -- asynchronous version of smgrreadv() > * > * This starts an asynchronous readv IO using the IO handle `ioh`. Other than > * `ioh` all parameters are the same as smgrreadv(). > * > * Completion callbacks above smgr will be passed the result as the number of > * successfully read blocks if the read [partially] succeeds. This maintains > * the abstraction that smgr operates on the level of blocks, rather than > * bytes. > */ That's good. Possibly add "(Buffers for blocks not successfully read might bear unspecified modifications, up to the full nblocks.)" In a bit of over-thinking this, I wondered if shared_buffer_readv_complete would be better named shared_buffer_smgrreadv_complete, to emphasize the smgrreadv semantics. PGAIO_HCB_SHARED_BUFFER_READV likewise. But I tend to think not. smgrreadv() has no "result" concept, so the symmetry is limited. > I briefly had a bug in test_aio's injection point that lead to *increasing* > the number of bytes successfully read. That triggered an assertion failure in > bufmgr.c, but not closer to the problem. Is it worth adding an assert against > that to md_readv_complete? Can't quite decide. I'd lean yes, if in doubt.
Hi, Attached v2.11, with the following changes: - Pushed the smgr interrupt change, as discussed on the dedicated thread - Pushed "bufmgr: Improve stats when a buffer is read in concurrently" It was reviewed by Melanie and there didn't seem to be any reason to wait further. - Addressed feedback from Melanie - Addressed feedback from Noah - Added a new commit: aio: Change prefix of PgAioResultStatus values to PGAIO_RS_ As suggested/requested by Melanie. I think she's unfortunately right. - Added a patch for some comment fixups for code that's either older or already pushed - Added an error check for FileStartReadV() failing FileStartReadV() actually can fail, if the file can't be re-opened. I thought it'd be important for the error message to differ from the one that's issued for read actually failing, so I went with: "could not start reading blocks %u..%u in file \"%s\": %m" but I'm not sure how good that is. - Added a new commit to redefine set_max_safe_fds() to not subtract already_open fds from max_files_per_process This prevents io_method=io_uring from failing when RLIMIT_NOFILE is high enough, but more than max_files_per_process io_uring instances need to be created. - Improved error message if io_uring_queue_init() fails Added errhint()s for likely cases of failure. Added errcode(). I was tempted to use errcode_for_file_access(), but that doesn't support ENOSYS - perhaps I should add that instead? - Disable io_uring method when using EXEC_BACKEND, they're not compatible I chose to do this with a define aio.h, but I guess we could also do it at configure time? That seems more complicated though - how would we even know that EXEC_BACKEND is used on non-windows? Not sure yet how to best disable testing io_uring in this case. We can't just query EXEC_BACKEND from pg_config.h unfortunately. I guess making the initdb not fail and checking the error log would work, but that doesn't work nicely with Cluster.pm. - Changed test_aio injection short-read injection point to zero out the rest of the IO, otherwise some tests fail to fail even if a bug in retries of partial reads is introduced - Improved method_io_uring.c includes a bit (no pgstat.h) Questions: - We only "look" at BM_IO_ERROR for writes, isn't that somewhat weird? See AbortBufferIO(Buffer buffer) It doesn't really matter for the patchset, but it just strikes me as an oddity. Greetings, Andres Freund
Attachment
- v2.11-0001-aio-bufmgr-Comment-fixes.patch
- v2.11-0002-aio-Change-prefix-of-PgAioResultStatus-values-.patch
- v2.11-0003-Redefine-max_files_per_process-to-control-addi.patch
- v2.11-0004-aio-Add-liburing-dependency.patch
- v2.11-0005-aio-Add-io_method-io_uring.patch
- v2.11-0006-aio-Implement-support-for-reads-in-smgr-md-fd.patch
- v2.11-0007-aio-Add-README.md-explaining-higher-level-desi.patch
- v2.11-0008-localbuf-Track-pincount-in-BufferDesc-as-well.patch
- v2.11-0009-bufmgr-Implement-AIO-read-support.patch
- v2.11-0010-Support-buffer-forwarding-in-read_stream.c.patch
- v2.11-0011-Support-buffer-forwarding-in-StartReadBuffers.patch
- v2.11-0012-bufmgr-Use-AIO-in-StartReadBuffers.patch
- v2.11-0013-aio-Basic-read_stream-adjustments-for-real-AIO.patch
- v2.11-0014-docs-Reframe-track_io_timing-related-docs-as-w.patch
- v2.11-0015-Enable-IO-concurrency-on-all-systems.patch
- v2.11-0016-aio-Add-pg_aios-view.patch
- v2.11-0017-aio-Add-test_aio-module.patch
- v2.11-0018-aio-Experimental-heuristics-to-increase-batchi.patch
- v2.11-0019-aio-Implement-smgr-md-fd-write-support.patch
- v2.11-0020-aio-Add-bounce-buffers.patch
- v2.11-0021-bufmgr-Implement-AIO-write-support.patch
- v2.11-0022-aio-Add-IO-queue-helper.patch
- v2.11-0023-bufmgr-use-AIO-in-checkpointer-bgwriter.patch
- v2.11-0024-Ensure-a-resowner-exists-for-all-paths-that-ma.patch
- v2.11-0025-Temporary-Increase-BAS_BULKREAD-size.patch
- v2.11-0026-WIP-Use-MAP_POPULATE.patch
- v2.11-0027-StartReadBuffers-debug-stuff.patch
On Thu, Mar 20, 2025 at 09:58:37PM -0400, Andres Freund wrote: > Attached v2.11, with the following changes: > - Added an error check for FileStartReadV() failing > > FileStartReadV() actually can fail, if the file can't be re-opened. I > thought it'd be important for the error message to differ from the one > that's issued for read actually failing, so I went with: > > "could not start reading blocks %u..%u in file \"%s\": %m" > > but I'm not sure how good that is. Message looks good. > - Improved error message if io_uring_queue_init() fails > > Added errhint()s for likely cases of failure. > > Added errcode(). I was tempted to use errcode_for_file_access(), but that > doesn't support ENOSYS - perhaps I should add that instead? Either way is fine with me. ENOSYS -> ERRCODE_FEATURE_NOT_SUPPORTED is a good general mapping to have in errcode_for_file_access(), but it's also not a problem to keep it the way v2.11 has it. > - Disable io_uring method when using EXEC_BACKEND, they're not compatible > > I chose to do this with a define aio.h, but I guess we could also do it at > configure time? That seems more complicated though - how would we even know > that EXEC_BACKEND is used on non-windows? Agreed, "make PROFILE=-DEXEC_BACKEND" is a valid way to get EXEC_BACKEND. > Not sure yet how to best disable testing io_uring in this case. We can't > just query EXEC_BACKEND from pg_config.h unfortunately. I guess making the > initdb not fail and checking the error log would work, but that doesn't work > nicely with Cluster.pm. How about "postgres -c io_method=io_uring -C <anything>": --- a/src/test/modules/test_aio/t/001_aio.pl +++ b/src/test/modules/test_aio/t/001_aio.pl @@ -29,7 +29,13 @@ $node_worker->stop(); # Test io_method=io_uring ### -if ($ENV{with_liburing} eq 'yes') +sub have_io_uring +{ + local %ENV = $node_worker->_get_env(); # any node works + return run_log [qw(postgres -c io_method=io_uring -C io_method)]; +} + +if (have_io_uring()) { my $node_uring = create_node('io_uring'); $node_uring->start(); > Questions: > > > - We only "look" at BM_IO_ERROR for writes, isn't that somewhat weird? > > See AbortBufferIO(Buffer buffer) > > It doesn't really matter for the patchset, but it just strikes me as an oddity. That caught my attention in an earlier review round, but I didn't find it important enough to raise. It's mildly unfortunate to be setting BM_IO_ERROR for reads when the only thing BM_IO_ERROR drives is message "Multiple failures --- write error might be permanent." It's minor, so let's leave it that way for the foreseeable future. > Subject: [PATCH v2.11 01/27] aio, bufmgr: Comment fixes Ready to commit, though other comment fixes might come up in later reviews. One idea so far is to comment on valid states after some IoMethodOps callbacks: --- a/src/include/storage/aio_internal.h +++ b/src/include/storage/aio_internal.h @@ -310,6 +310,9 @@ typedef struct IoMethodOps /* * Start executing passed in IOs. * + * Shall advance state to PGAIO_HS_SUBMITTED. (By the time this returns, + * other backends might have advanced the state further.) + * * Will not be called if ->needs_synchronous_execution() returned true. * * num_staged_ios is <= PGAIO_SUBMIT_BATCH_SIZE. @@ -321,6 +324,12 @@ typedef struct IoMethodOps /* * Wait for the IO to complete. Optional. * + * On return, state shall be PGAIO_HS_COMPLETED_IO, + * PGAIO_HS_COMPLETED_SHARED or PGAIO_HS_COMPLETED_LOCAL. (The callback + * need not change the state if it's already one of those.) If state is + * PGAIO_HS_COMPLETED_IO, state will reach PGAIO_HS_COMPLETED_SHARED + * without further intervention. + * * If not provided, it needs to be guaranteed that the IO method calls * pgaio_io_process_completion() without further interaction by the * issuing backend. > Subject: [PATCH v2.11 02/27] aio: Change prefix of PgAioResultStatus values to > PGAIO_RS_ Ready to commit > Subject: [PATCH v2.11 03/27] Redefine max_files_per_process to control > additionally opened files Ready to commit > Subject: [PATCH v2.11 04/27] aio: Add liburing dependency > --- a/meson.build > +++ b/meson.build > @@ -944,6 +944,18 @@ endif > > > > +############################################################### > +# Library: liburing > +############################################################### > + > +liburingopt = get_option('liburing') > +liburing = dependency('liburing', required: liburingopt) > +if liburing.found() > + cdata.set('USE_LIBURING', 1) > +endif This is a different style from other deps; is it equivalent to our standard style? Example for lz4: lz4opt = get_option('lz4') if not lz4opt.disabled() lz4 = dependency('liblz4', required: false) # Unfortunately the dependency is named differently with cmake if not lz4.found() # combine with above once meson 0.60.0 is required lz4 = dependency('lz4', required: lz4opt, method: 'cmake', modules: ['LZ4::lz4_shared'], ) endif if lz4.found() cdata.set('USE_LZ4', 1) cdata.set('HAVE_LIBLZ4', 1) endif else lz4 = not_found_dep endif > --- a/configure.ac > +++ b/configure.ac > @@ -975,6 +975,14 @@ AC_SUBST(with_readline) > PGAC_ARG_BOOL(with, libedit-preferred, no, > [prefer BSD Libedit over GNU Readline]) > > +# > +# liburing > +# > +AC_MSG_CHECKING([whether to build with liburing support]) > +PGAC_ARG_BOOL(with, liburing, no, [io_uring support, for asynchronous I/O], Fourth arg generally starts with "build" for args like this. I suggest "build with io_uring support, for asynchronous I/O". Comparable options: --with-llvm build with LLVM based JIT support --with-tcl build Tcl modules (PL/Tcl) --with-perl build Perl modules (PL/Perl) --with-python build Python modules (PL/Python) --with-gssapi build with GSSAPI support --with-pam build with PAM support --with-bsd-auth build with BSD Authentication support --with-ldap build with LDAP support --with-bonjour build with Bonjour support --with-selinux build with SELinux support --with-systemd build with systemd support --with-libcurl build with libcurl support --with-libxml build with XML support --with-libxslt use XSLT support when building contrib/xml2 --with-lz4 build with LZ4 support --with-zstd build with ZSTD support > + [AC_DEFINE([USE_LIBURING], 1, [Define to build with io_uring support. (--with-liburing)])]) > +AC_MSG_RESULT([$with_liburing]) > +AC_SUBST(with_liburing) > > # > # UUID library > @@ -1463,6 +1471,9 @@ elif test "$with_uuid" = ossp ; then > fi > AC_SUBST(UUID_LIBS) > > +if test "$with_liburing" = yes; then > + PKG_CHECK_MODULES(LIBURING, liburing) > +fi We usually put this right after the AC_MSG_CHECKING ... AC_SUBST block. This currently has unrelated stuff separating them. Also, with the exception of icu, we follow PKG_CHECK_MODULES uses by absorbing flags from pkg-config and use AC_CHECK_LIB to add the actual "-l". By not absorbing flags, I think a liburing in a nonstandard location would require --with-libraries and --with-includes, unlike the other PKG_CHECK_MODULES-based dependencies. lz4 is a representative example of our standard: ``` AC_MSG_CHECKING([whether to build with LZ4 support]) PGAC_ARG_BOOL(with, lz4, no, [build with LZ4 support], [AC_DEFINE([USE_LZ4], 1, [Define to 1 to build with LZ4 support. (--with-lz4)])]) AC_MSG_RESULT([$with_lz4]) AC_SUBST(with_lz4) if test "$with_lz4" = yes; then PKG_CHECK_MODULES(LZ4, liblz4) # We only care about -I, -D, and -L switches; # note that -llz4 will be added by AC_CHECK_LIB below. for pgac_option in $LZ4_CFLAGS; do case $pgac_option in -I*|-D*) CPPFLAGS="$CPPFLAGS $pgac_option";; esac done for pgac_option in $LZ4_LIBS; do case $pgac_option in -L*) LDFLAGS="$LDFLAGS $pgac_option";; esac done fi # ... later in file ... if test "$with_lz4" = yes ; then AC_CHECK_LIB(lz4, LZ4_compress_default, [], [AC_MSG_ERROR([library 'lz4' is required for LZ4 support])]) fi ``` I think it's okay to not use the AC_CHECK_LIB and rely on explicit src/backend/Makefile code like you've done, but we shouldn't miss CPPFLAGS/LDFLAGS (or should have a comment on why missing them is right). > --- a/doc/src/sgml/installation.sgml > +++ b/doc/src/sgml/installation.sgml lz4 and other deps have a mention in <sect1 id="install-requirements">, in addition to sections edited here. > Subject: [PATCH v2.11 05/27] aio: Add io_method=io_uring (Still reviewing this one.)
On Thu, Mar 20, 2025 at 09:58:37PM -0400, Andres Freund wrote: > Attached v2.11 > Subject: [PATCH v2.11 05/27] aio: Add io_method=io_uring Apart from some isolated cosmetic points, this is ready to commit: > + ereport(ERROR, > + errcode(err), > + errmsg("io_uring_queue_init failed: %m"), > + hint != NULL ? errhint("%s", hint) : 0); https://www.postgresql.org/docs/current/error-style-guide.html gives the example: BAD: open() failed: %m BETTER: could not open file %s: %m Hence, this errmsg should change, perhaps to: "could not setup io_uring queues: %m". > + pgaio_debug_io(DEBUG3, ioh, > + "wait_one io_gen: %llu, ref_gen: %llu, cycle %d", > + (long long unsigned) ref_generation, > + (long long unsigned) ioh->generation, In the message string, io_gen appears before ref_gen. In the subsequent args, the order is swapped relative to the message string. > --- a/src/backend/utils/activity/wait_event_names.txt > +++ b/src/backend/utils/activity/wait_event_names.txt > @@ -192,6 +192,8 @@ ABI_compatibility: > > Section: ClassName - WaitEventIO > > +AIO_IO_URING_SUBMIT "Waiting for IO submission via io_uring." > +AIO_IO_URING_COMPLETION "Waiting for IO completion via io_uring." > AIO_IO_COMPLETION "Waiting for IO completion." I'm wondering if there's an opportunity to enrich the last two wait event names and/or descriptions. The current descriptions suggest to me more similarity than is actually there. Inputs to the decision: - AIO_IO_COMPLETION waits for an IO in PGAIO_HS_DEFINED, PGAIO_HS_STAGED, or PGAIO_HS_COMPLETED_IO to reach PGAIO_HS_COMPLETED_SHARED. The three starting states are the states where some other backend owns the next action, so the current backend can only wait to be signaled. - AIO_IO_URING_COMPLETION waits for the kernel to do enough so we can move from PGAIO_HS_SUBMITTED to PGAIO_HS_COMPLETED_IO. Possible names and descriptions, based on PgAioHandleState enum names and comments: AIO_IO_URING_COMPLETED_IO "Waiting for IO result via io_uring." AIO_COMPLETED_SHARED "Waiting for IO shared completion callback." If "shared completion callback" is too internals-focused, perhaps this: AIO_IO_URING_COMPLETED_IO "Waiting for IO result via io_uring." AIO_COMPLETED_SHARED "Waiting for IO completion to update shared memory." > --- a/doc/src/sgml/config.sgml > +++ b/doc/src/sgml/config.sgml > @@ -2710,6 +2710,12 @@ include_dir 'conf.d' > <literal>worker</literal> (execute asynchronous I/O using worker processes) > </para> > </listitem> > + <listitem> > + <para> > + <literal>io_uring</literal> (execute asynchronous I/O using > + io_uring, if available) I feel the "if available" doesn't quite fit, since we'll fail if unavailable. Maybe just "(execute asynchronous I/O using Linux io_uring)" with "Linux" there to reduce surprise on other platforms. > Subject: [PATCH v2.11 06/27] aio: Implement support for reads in smgr/md/fd (Still reviewing this one.)
Hi, On 2025-03-22 17:20:56 -0700, Noah Misch wrote: > On Thu, Mar 20, 2025 at 09:58:37PM -0400, Andres Freund wrote: > > Not sure yet how to best disable testing io_uring in this case. We can't > > just query EXEC_BACKEND from pg_config.h unfortunately. I guess making the > > initdb not fail and checking the error log would work, but that doesn't work > > nicely with Cluster.pm. > > How about "postgres -c io_method=io_uring -C <anything>": > > --- a/src/test/modules/test_aio/t/001_aio.pl > +++ b/src/test/modules/test_aio/t/001_aio.pl > @@ -29,7 +29,13 @@ $node_worker->stop(); > # Test io_method=io_uring > ### > > -if ($ENV{with_liburing} eq 'yes') > +sub have_io_uring > +{ > + local %ENV = $node_worker->_get_env(); # any node works > + return run_log [qw(postgres -c io_method=io_uring -C io_method)]; > +} > + > +if (have_io_uring()) > { > my $node_uring = create_node('io_uring'); > $node_uring->start(); Yea, that's a good idea. One thing that doesn't seem great is that it requires a prior node - what if we do -c io_method=invalid' that would report the list of valid GUC options, so we could just grep for io_uring? It's too bad that postgres --describe-config a) doesn't report the possible enum values b) doesn't apply/validate -c options > > Subject: [PATCH v2.11 01/27] aio, bufmgr: Comment fixes > > Ready to commit, though other comment fixes might come up in later reviews. I'll reorder it to a bit later in the series, to accumulate a few more. > One idea so far is to comment on valid states after some IoMethodOps > callbacks: > > --- a/src/include/storage/aio_internal.h > +++ b/src/include/storage/aio_internal.h > @@ -310,6 +310,9 @@ typedef struct IoMethodOps > /* > * Start executing passed in IOs. > * > + * Shall advance state to PGAIO_HS_SUBMITTED. (By the time this returns, > + * other backends might have advanced the state further.) > + * > * Will not be called if ->needs_synchronous_execution() returned true. > * > * num_staged_ios is <= PGAIO_SUBMIT_BATCH_SIZE. > @@ -321,6 +324,12 @@ typedef struct IoMethodOps > /* > * Wait for the IO to complete. Optional. > * > + * On return, state shall be PGAIO_HS_COMPLETED_IO, > + * PGAIO_HS_COMPLETED_SHARED or PGAIO_HS_COMPLETED_LOCAL. (The callback > + * need not change the state if it's already one of those.) If state is > + * PGAIO_HS_COMPLETED_IO, state will reach PGAIO_HS_COMPLETED_SHARED > + * without further intervention. > + * > * If not provided, it needs to be guaranteed that the IO method calls > * pgaio_io_process_completion() without further interaction by the > * issuing backend. I think these are a good idea. I added those to the copy-edit patch, with a few more tweaks: @@ -315,6 +315,9 @@ typedef struct IoMethodOps /* * Start executing passed in IOs. * + * Shall advance state to at least PGAIO_HS_SUBMITTED. (By the time this + * returns, other backends might have advanced the state further.) + * * Will not be called if ->needs_synchronous_execution() returned true. * * num_staged_ios is <= PGAIO_SUBMIT_BATCH_SIZE. @@ -323,12 +326,24 @@ typedef struct IoMethodOps */ int (*submit) (uint16 num_staged_ios, PgAioHandle **staged_ios); - /* + /* --- * Wait for the IO to complete. Optional. * + * On return, state shall be on of + * - PGAIO_HS_COMPLETED_IO + * - PGAIO_HS_COMPLETED_SHARED + * - PGAIO_HS_COMPLETED_LOCAL + * + * The callback must not block if the handle is already in one of those + * states, or has been reused (see pgaio_io_was_recycled()). If, on + * return, the state is PGAIO_HS_COMPLETED_IO, state will reach + * PGAIO_HS_COMPLETED_SHARED without further intervention by the IO + * method. + * * If not provided, it needs to be guaranteed that the IO method calls * pgaio_io_process_completion() without further interaction by the * issuing backend. + * --- */ void (*wait_one) (PgAioHandle *ioh, uint64 ref_generation); > > Subject: [PATCH v2.11 03/27] Redefine max_files_per_process to control > > additionally opened files > > Ready to commit Cool! > > Subject: [PATCH v2.11 04/27] aio: Add liburing dependency > > > --- a/meson.build > > +++ b/meson.build > > @@ -944,6 +944,18 @@ endif > > > > > > > > +############################################################### > > +# Library: liburing > > +############################################################### > > + > > +liburingopt = get_option('liburing') > > +liburing = dependency('liburing', required: liburingopt) > > +if liburing.found() > > + cdata.set('USE_LIBURING', 1) > > +endif > > This is a different style from other deps; is it equivalent to our standard > style? Yes - the only reason to be more complicated in the lz4 case is that we want to fall back to other ways of looking up the dependency (primarily because of windows. But that's not required for liburing, which oviously is linux only. > > --- a/configure.ac > > +++ b/configure.ac > > @@ -975,6 +975,14 @@ AC_SUBST(with_readline) > > PGAC_ARG_BOOL(with, libedit-preferred, no, > > [prefer BSD Libedit over GNU Readline]) > > > > +# > > +# liburing > > +# > > +AC_MSG_CHECKING([whether to build with liburing support]) > > +PGAC_ARG_BOOL(with, liburing, no, [io_uring support, for asynchronous I/O], > > Fourth arg generally starts with "build" for args like this. I suggest "build > with io_uring support, for asynchronous I/O". WFM. > > + [AC_DEFINE([USE_LIBURING], 1, [Define to build with io_uring support. (--with-liburing)])]) > > +AC_MSG_RESULT([$with_liburing]) > > +AC_SUBST(with_liburing) > > > > # > > # UUID library > > @@ -1463,6 +1471,9 @@ elif test "$with_uuid" = ossp ; then > > fi > > AC_SUBST(UUID_LIBS) > > > > +if test "$with_liburing" = yes; then > > + PKG_CHECK_MODULES(LIBURING, liburing) > > +fi > > We usually put this right after the AC_MSG_CHECKING ... AC_SUBST block. We don't really seem to do that for "dependency checks" in general, e.g. PGAC_CHECK_PERL_CONFIGS, PGAC_CHECK_PYTHON_EMBED_SETUP, PGAC_CHECK_READLINE, dependency dependent AC_CHECK_LIB calls, .. later in configure.ac than the defnition of the option. TBH, I've always struggled trying to discern what the organizing principle of configure.ac is. But you're right that the PKG_CHECK_MODULES calls are closer-by. And I'm happy to move towards having the code for each dep all in one place, so moved. A related thing: We seem to have no order of the $with_ checks that I can discern. Should the liburing check be at a different place? > This currently has unrelated stuff separating them. Also, with the > exception of icu, we follow PKG_CHECK_MODULES uses by absorbing flags from > pkg-config and use AC_CHECK_LIB to add the actual "-l". I think for liburing I was trying to follow ICU's example - injecting CFLAGS and LIBS just in the parts of the build dir that needs them. For LIBS I think I did so: diff --git a/src/backend/Makefile b/src/backend/Makefile ... +# The backend conditionally needs libraries that most executables don't need. +LIBS += $(LDAP_LIBS_BE) $(ICU_LIBS) $(LIBURING_LIBS) But ugh, for some reason I didn't do that for LIBURING_CFLAGS. In the v1.x version of aio I had aio:src/backend/storage/aio/Makefile:override CPPFLAGS += $(LIBURING_CFLAGS) but somehow lost that somewhere along the way to v2.x I think I like targetting where ${LIB}_LIBS and ${LIB}_CFLAGS are applied more narrowly better than just adding to the global CFLAGS, CPPFLAGS, LDFLAGS. I'm somewhat inclined to add it LIBURING_CFLAGS in src/backend rather than src/backend/storage/aio/ though. But I'm also willing to do it entirely differently. > > --- a/doc/src/sgml/installation.sgml > > +++ b/doc/src/sgml/installation.sgml > > lz4 and other deps have a mention in <sect1 id="install-requirements">, in > addition to sections edited here. Good point. Although once more I feel defeated by the ordering used :) Hm, that list is rather incomplete. At least libxml, libxslt, selinux, curl, uuid, systemd, selinux and bonjour aren't listed. Not sure if it makes sense to add liburing, given that? Greetings, Andres Freund
On Sun, Mar 23, 2025 at 11:11:53AM -0400, Andres Freund wrote: > On 2025-03-22 17:20:56 -0700, Noah Misch wrote: > > On Thu, Mar 20, 2025 at 09:58:37PM -0400, Andres Freund wrote: > > > Not sure yet how to best disable testing io_uring in this case. We can't > > > just query EXEC_BACKEND from pg_config.h unfortunately. I guess making the > > > initdb not fail and checking the error log would work, but that doesn't work > > > nicely with Cluster.pm. > > > > How about "postgres -c io_method=io_uring -C <anything>": > > > > --- a/src/test/modules/test_aio/t/001_aio.pl > > +++ b/src/test/modules/test_aio/t/001_aio.pl > > @@ -29,7 +29,13 @@ $node_worker->stop(); > > # Test io_method=io_uring > > ### > > > > -if ($ENV{with_liburing} eq 'yes') > > +sub have_io_uring > > +{ > > + local %ENV = $node_worker->_get_env(); # any node works > > + return run_log [qw(postgres -c io_method=io_uring -C io_method)]; > > +} > > + > > +if (have_io_uring()) > > { > > my $node_uring = create_node('io_uring'); > > $node_uring->start(); > > Yea, that's a good idea. > > One thing that doesn't seem great is that it requires a prior node - what if > we do -c io_method=invalid' that would report the list of valid GUC options, > so we could just grep for io_uring? Works for me. > > One idea so far is to comment on valid states after some IoMethodOps > > callbacks: > I think these are a good idea. I added those to the copy-edit patch, with a > few more tweaks: The tweaks made it better. > > > Subject: [PATCH v2.11 04/27] aio: Add liburing dependency > > > + [AC_DEFINE([USE_LIBURING], 1, [Define to build with io_uring support. (--with-liburing)])]) > > > +AC_MSG_RESULT([$with_liburing]) > > > +AC_SUBST(with_liburing) > > > > > > # > > > # UUID library > > > @@ -1463,6 +1471,9 @@ elif test "$with_uuid" = ossp ; then > > > fi > > > AC_SUBST(UUID_LIBS) > > > > > > +if test "$with_liburing" = yes; then > > > + PKG_CHECK_MODULES(LIBURING, liburing) > > > +fi > > > > We usually put this right after the AC_MSG_CHECKING ... AC_SUBST block. > > We don't really seem to do that for "dependency checks" in general, e.g. > PGAC_CHECK_PERL_CONFIGS, PGAC_CHECK_PYTHON_EMBED_SETUP, PGAC_CHECK_READLINE, > dependency dependent AC_CHECK_LIB calls, .. later in configure.ac than the > defnition of the option. AC_CHECK_LIB stays far away, yes. > But you're right that the PKG_CHECK_MODULES calls are closer-by. And I'm happy > to move towards having the code for each dep all in one place, so moved. > > > A related thing: We seem to have no order of the $with_ checks that I can > discern. Should the liburing check be at a different place? No opinion on that one. It's fine. > > This currently has unrelated stuff separating them. Also, with the > > exception of icu, we follow PKG_CHECK_MODULES uses by absorbing flags from > > pkg-config and use AC_CHECK_LIB to add the actual "-l". > > I think for liburing I was trying to follow ICU's example - injecting CFLAGS > and LIBS just in the parts of the build dir that needs them. > > For LIBS I think I did so: > > diff --git a/src/backend/Makefile b/src/backend/Makefile > ... > +# The backend conditionally needs libraries that most executables don't need. > +LIBS += $(LDAP_LIBS_BE) $(ICU_LIBS) $(LIBURING_LIBS) > > But ugh, for some reason I didn't do that for LIBURING_CFLAGS. In the v1.x > version of aio I had > aio:src/backend/storage/aio/Makefile:override CPPFLAGS += $(LIBURING_CFLAGS) > > but somehow lost that somewhere along the way to v2.x > > > I think I like targetting where ${LIB}_LIBS and ${LIB}_CFLAGS are applied more > narrowly better than just adding to the global CFLAGS, CPPFLAGS, LDFLAGS. Agreed. > somewhat inclined to add it LIBURING_CFLAGS in src/backend rather than > src/backend/storage/aio/ though. > > But I'm also willing to do it entirely differently. The CPPFLAGS addition, located wherever makes sense, resolves that point. > > > --- a/doc/src/sgml/installation.sgml > > > +++ b/doc/src/sgml/installation.sgml > > > > lz4 and other deps have a mention in <sect1 id="install-requirements">, in > > addition to sections edited here. > > Good point. > > Although once more I feel defeated by the ordering used :) > > Hm, that list is rather incomplete. At least libxml, libxslt, selinux, curl, > uuid, systemd, selinux and bonjour aren't listed. > > Not sure if it makes sense to add liburing, given that? That's a lot of preexisting incompleteness. I withdraw the point about <sect1 id="install-requirements">. Unrelated to the above, another question about io_uring: commit da722699 wrote: > +/* > + * Need to submit staged but not yet submitted IOs using the fd, otherwise > + * the IO would end up targeting something bogus. > + */ > +void > +pgaio_closing_fd(int fd) An IO in PGAIO_HS_STAGED clearly blocks closing the IO's FD, and an IO in PGAIO_HS_COMPLETED_IO clearly doesn't block that close. For io_method=worker, closing in PGAIO_HS_SUBMITTED is okay. For io_method=io_uring, is there a reference about it being okay to close during PGAIO_HS_SUBMITTED? I looked awhile for an authoritative view on that, but I didn't find one. If we can rely on io_uring_submit() returning only after the kernel has given the io_uring its own reference to all applicable file descriptors, I expect it's okay to close the process's FD. If the io_uring acquires its reference later than that, I expect we shouldn't close before that later time.
Hi, On 2025-03-22 19:09:55 -0700, Noah Misch wrote: > On Thu, Mar 20, 2025 at 09:58:37PM -0400, Andres Freund wrote: > > Attached v2.11 > > > Subject: [PATCH v2.11 05/27] aio: Add io_method=io_uring > > Apart from some isolated cosmetic points, this is ready to commit: > > > + ereport(ERROR, > > + errcode(err), > > + errmsg("io_uring_queue_init failed: %m"), > > + hint != NULL ? errhint("%s", hint) : 0); > > https://www.postgresql.org/docs/current/error-style-guide.html gives the example: > > BAD: open() failed: %m > BETTER: could not open file %s: %m > > Hence, this errmsg should change, perhaps to: > "could not setup io_uring queues: %m". You're right. I didn't intentionally "violate" the policy, but I do have to admit, I'm not a huge fan of that aspect, it just obfuscates what actually failed, forcing one to look at the code or strace to figure out what precisely failed. (Changed) > > + pgaio_debug_io(DEBUG3, ioh, > > + "wait_one io_gen: %llu, ref_gen: %llu, cycle %d", > > + (long long unsigned) ref_generation, > > + (long long unsigned) ioh->generation, > > In the message string, io_gen appears before ref_gen. In the subsequent args, > the order is swapped relative to the message string. Oops, you're right. > > --- a/src/backend/utils/activity/wait_event_names.txt > > +++ b/src/backend/utils/activity/wait_event_names.txt > > @@ -192,6 +192,8 @@ ABI_compatibility: > > > > Section: ClassName - WaitEventIO > > > > +AIO_IO_URING_SUBMIT "Waiting for IO submission via io_uring." > > +AIO_IO_URING_COMPLETION "Waiting for IO completion via io_uring." > > AIO_IO_COMPLETION "Waiting for IO completion." > > I'm wondering if there's an opportunity to enrich the last two wait event > names and/or descriptions. The current descriptions suggest to me more > similarity than is actually there. Inputs to the decision: > > - AIO_IO_COMPLETION waits for an IO in PGAIO_HS_DEFINED, PGAIO_HS_STAGED, or > PGAIO_HS_COMPLETED_IO to reach PGAIO_HS_COMPLETED_SHARED. The three > starting states are the states where some other backend owns the next > action, so the current backend can only wait to be signaled. > > - AIO_IO_URING_COMPLETION waits for the kernel to do enough so we can move > from PGAIO_HS_SUBMITTED to PGAIO_HS_COMPLETED_IO. > > Possible names and descriptions, based on PgAioHandleState enum names and > comments: > > AIO_IO_URING_COMPLETED_IO "Waiting for IO result via io_uring." > AIO_COMPLETED_SHARED "Waiting for IO shared completion callback." > > If "shared completion callback" is too internals-focused, perhaps this: > > AIO_IO_URING_COMPLETED_IO "Waiting for IO result via io_uring." > AIO_COMPLETED_SHARED "Waiting for IO completion to update shared memory." Hm, right now AIO_IO_COMPLETION also covers the actual "raw" execution of the IO with io_method=worker/sync. For that AIO_COMPLETED_SHARED would be inappropriate. We could use a different wait event if wait for an IO via CV in PGAIO_HS_SUBMITTED, with a small refactoring of pgaio_io_wait(). But I'm not sure that would get you that far - we don't broadcast the CV when transitioning from PGAIO_HS_SUBMITTED -> PGAIO_HS_COMPLETED_IO, so the wait event would stay the same, now wrong, wait event until the shared callback completes. Obviously waking everyone up just so they can use a differen wait event doesn't make sense. A more minimal change would be to narrow AIO_IO_URING_COMPLETION to "execution" or something like that, to hint at a separation between the raw IO being completed and the IO, including the callbacks completing. > > --- a/doc/src/sgml/config.sgml > > +++ b/doc/src/sgml/config.sgml > > @@ -2710,6 +2710,12 @@ include_dir 'conf.d' > > <literal>worker</literal> (execute asynchronous I/O using worker processes) > > </para> > > </listitem> > > + <listitem> > > + <para> > > + <literal>io_uring</literal> (execute asynchronous I/O using > > + io_uring, if available) > > I feel the "if available" doesn't quite fit, since we'll fail if unavailable. > Maybe just "(execute asynchronous I/O using Linux io_uring)" with "Linux" > there to reduce surprise on other platforms. You're right, the if available can be misunderstood. But not mentioning that it's an optional dependency seems odd too. What about something like <para> <literal>io_uring</literal> (execute asynchronous I/O using io_uring, requires postgres to have been built with <link linkend="configure-option-with-liburing"><option>--with-liburing</option></link> / <link linkend="configure-with-liburing-meson"><option>-Dliburing</option></link>) </para> Should the docs for --with-liburing/-Dliburing mention it's linux only? We don't seem to do that for things like systemd (linux), selinux (linux) and only kinda for bonjour (macos). Greetings, Andres Freund
On Sun, Mar 23, 2025 at 11:57:48AM -0400, Andres Freund wrote: > On 2025-03-22 19:09:55 -0700, Noah Misch wrote: > > On Thu, Mar 20, 2025 at 09:58:37PM -0400, Andres Freund wrote: > > > Attached v2.11 > > > > > Subject: [PATCH v2.11 05/27] aio: Add io_method=io_uring > > > --- a/src/backend/utils/activity/wait_event_names.txt > > > +++ b/src/backend/utils/activity/wait_event_names.txt > > > @@ -192,6 +192,8 @@ ABI_compatibility: > > > > > > Section: ClassName - WaitEventIO > > > > > > +AIO_IO_URING_SUBMIT "Waiting for IO submission via io_uring." > > > +AIO_IO_URING_COMPLETION "Waiting for IO completion via io_uring." > > > AIO_IO_COMPLETION "Waiting for IO completion." > > > > I'm wondering if there's an opportunity to enrich the last two wait event > > names and/or descriptions. The current descriptions suggest to me more > > similarity than is actually there. Inputs to the decision: > > > > - AIO_IO_COMPLETION waits for an IO in PGAIO_HS_DEFINED, PGAIO_HS_STAGED, or > > PGAIO_HS_COMPLETED_IO to reach PGAIO_HS_COMPLETED_SHARED. The three > > starting states are the states where some other backend owns the next > > action, so the current backend can only wait to be signaled. > > > > - AIO_IO_URING_COMPLETION waits for the kernel to do enough so we can move > > from PGAIO_HS_SUBMITTED to PGAIO_HS_COMPLETED_IO. > > > > Possible names and descriptions, based on PgAioHandleState enum names and > > comments: > > > > AIO_IO_URING_COMPLETED_IO "Waiting for IO result via io_uring." > > AIO_COMPLETED_SHARED "Waiting for IO shared completion callback." > > > > If "shared completion callback" is too internals-focused, perhaps this: > > > > AIO_IO_URING_COMPLETED_IO "Waiting for IO result via io_uring." > > AIO_COMPLETED_SHARED "Waiting for IO completion to update shared memory." > > Hm, right now AIO_IO_COMPLETION also covers the actual "raw" execution of the > IO with io_method=worker/sync. Right, it could start with the IO in PGAIO_HS_DEFINED and end with the IO in PGAIO_HS_COMPLETED_SHARED. So another part of the wait may be the definer doing work before exiting batch mode. > For that AIO_COMPLETED_SHARED would be > inappropriate. The concept I had in mind was "waiting to reach PGAIO_HS_COMPLETED_SHARED, whatever obstacles that involves". Another candidate description string: AIO_COMPLETED_SHARED "Waiting for another process to complete IO." > We could use a different wait event if wait for an IO via CV in > PGAIO_HS_SUBMITTED, with a small refactoring of pgaio_io_wait(). But I'm not > sure that would get you that far - we don't broadcast the CV when > transitioning from PGAIO_HS_SUBMITTED -> PGAIO_HS_COMPLETED_IO, so the wait > event would stay the same, now wrong, wait event until the shared callback > completes. Obviously waking everyone up just so they can use a differen wait > event doesn't make sense. Agreed. The mapping of code ranges to wait events seems fine to me. I'm mainly trying to optimize the wait event description strings to fit those code ranges. > A more minimal change would be to narrow AIO_IO_URING_COMPLETION to > "execution" or something like that, to hint at a separation between the raw IO > being completed and the IO, including the callbacks completing. Yes, that would work for me. > > > --- a/doc/src/sgml/config.sgml > > > +++ b/doc/src/sgml/config.sgml > > > @@ -2710,6 +2710,12 @@ include_dir 'conf.d' > > > <literal>worker</literal> (execute asynchronous I/O using worker processes) > > > </para> > > > </listitem> > > > + <listitem> > > > + <para> > > > + <literal>io_uring</literal> (execute asynchronous I/O using > > > + io_uring, if available) > > > > I feel the "if available" doesn't quite fit, since we'll fail if unavailable. > > Maybe just "(execute asynchronous I/O using Linux io_uring)" with "Linux" > > there to reduce surprise on other platforms. > > You're right, the if available can be misunderstood. But not mentioning that > it's an optional dependency seems odd too. What about something like > > <para> > <literal>io_uring</literal> (execute asynchronous I/O using > io_uring, requires postgres to have been built with > <link linkend="configure-option-with-liburing"><option>--with-liburing</option></link> / > <link linkend="configure-with-liburing-meson"><option>-Dliburing</option></link>) > </para> I'd change s/postgres to have been built/a build with/ since the SGML docs don't use the term "postgres" that way. Otherwise, that works for me. > Should the docs for --with-liburing/-Dliburing mention it's linux only? We > don't seem to do that for things like systemd (linux), selinux (linux) and > only kinda for bonjour (macos). No need, I think.
Hi, On 2025-03-23 08:55:29 -0700, Noah Misch wrote: > On Sun, Mar 23, 2025 at 11:11:53AM -0400, Andres Freund wrote: > Unrelated to the above, another question about io_uring: > > commit da722699 wrote: > > +/* > > + * Need to submit staged but not yet submitted IOs using the fd, otherwise > > + * the IO would end up targeting something bogus. > > + */ > > +void > > +pgaio_closing_fd(int fd) > > An IO in PGAIO_HS_STAGED clearly blocks closing the IO's FD, and an IO in > PGAIO_HS_COMPLETED_IO clearly doesn't block that close. For io_method=worker, > closing in PGAIO_HS_SUBMITTED is okay. For io_method=io_uring, is there a > reference about it being okay to close during PGAIO_HS_SUBMITTED? I looked > awhile for an authoritative view on that, but I didn't find one. If we can > rely on io_uring_submit() returning only after the kernel has given the > io_uring its own reference to all applicable file descriptors, I expect it's > okay to close the process's FD. If the io_uring acquires its reference later > than that, I expect we shouldn't close before that later time. I'm fairly sure io_uring has its own reference for the file descriptor by the time io_uring_enter() returns [1]. What io_uring does *not* reliably tolerate is the issuing process *exiting* before the IO completes, even if there are other processes attached to the same io_uring instance. AIO v1 had a posix_aio backend, which, on several platforms, did *not* tolerate the FD being closed before the IO completes. Because of that IoMethodOps had a closing_fd callback, which posix_aio used to wait for the IO's completion [2]. I've added a test case exercising this path for all io methods. But I can't think of a way that would catch io_uring not actually holding a reference to the fd with a high likelihood - the IO will almost always complete quickly enough to not be able to catch that. But it still seems better than not at all testing the path - it does catch at least the problem of pgaio_closing_fd() not doing anything. Greetings, Andres Freund [1] See https://github.com/torvalds/linux/blob/586de92313fcab8ed84ac5f78f4d2aae2db92c59/io_uring/io_uring.c#L1728 called from https://github.com/torvalds/linux/blob/586de92313fcab8ed84ac5f78f4d2aae2db92c59/io_uring/io_uring.c#L2204 called from https://github.com/torvalds/linux/blob/586de92313fcab8ed84ac5f78f4d2aae2db92c59/io_uring/io_uring.c#L3372 in the io_uring_enter() syscall [2] https://github.com/anarazel/postgres/blob/a08cd717b5af4e51afb25ec86623973158a72ab9/src/backend/storage/aio/aio_posix.c#L738
commit 247ce06b wrote: > + pgaio_io_reopen(ioh); > + > + /* > + * To be able to exercise the reopen-fails path, allow injection > + * points to trigger a failure at this point. > + */ > + pgaio_io_call_inj(ioh, "AIO_WORKER_AFTER_REOPEN"); > + > + error_errno = 0; > + error_ioh = NULL; > + > + /* > + * We don't expect this to ever fail with ERROR or FATAL, no need > + * to keep error_ioh set to the IO. > + * pgaio_io_perform_synchronously() contains a critical section to > + * ensure we don't accidentally fail. > + */ > + pgaio_io_perform_synchronously(ioh); A CHECK_FOR_INTERRUPTS() could close() the FD that pgaio_io_reopen() callee smgr_aio_reopen() stores. Hence, I think smgrfd() should assert that interrupts are held instead of doing its own HOLD_INTERRUPTS(), and a HOLD_INTERRUPTS() should surround the above region of code. It's likely hard to reproduce a problem, because pgaio_io_call_inj() does nothing in many builds, and pgaio_io_perform_synchronously() starts by entering a critical section. On Thu, Mar 20, 2025 at 09:58:37PM -0400, Andres Freund wrote: > Attached v2.11 > Subject: [PATCH v2.11 06/27] aio: Implement support for reads in smgr/md/fd > +int > +FileStartReadV(PgAioHandle *ioh, File file, > + int iovcnt, off_t offset, > + uint32 wait_event_info) > +{ > + int returnCode; > + Vfd *vfdP; > + > + Assert(FileIsValid(file)); > + > + DO_DB(elog(LOG, "FileStartReadV: %d (%s) " INT64_FORMAT " %d", > + file, VfdCache[file].fileName, > + (int64) offset, > + iovcnt)); > + > + returnCode = FileAccess(file); > + if (returnCode < 0) > + return returnCode; > + > + vfdP = &VfdCache[file]; > + > + pgaio_io_prep_readv(ioh, vfdP->fd, iovcnt, offset); FileStartReadV() and pgaio_io_prep_readv() advance the IO to PGAIO_HS_STAGED w/ batch mode, PGAIO_HS_SUBMITTED w/o batch mode. I didn't expect that from functions so named. The "start" verb sounds to me like unconditional PGAIO_HS_SUBMITTED, and the "prep" verb sounds like PGAIO_HS_DEFINED. I like the "stage" verb, because it matches PGAIO_HS_STAGED, and the comment at PGAIO_HS_STAGED succinctly covers what to expect. Hence, I recommend names FileStageReadV, pgaio_io_stage_readv, mdstagereadv, and smgrstageread. How do you see it? > +/* > + * AIO error reporting callback for mdstartreadv(). > + * > + * Errors are encoded as follows: > + * - PgAioResult.error_data != 0 encodes IO that failed with errno != 0 I recommend replacing "errno != 0" with either "that errno" or "errno == error_data". > Subject: [PATCH v2.11 07/27] aio: Add README.md explaining higher level design Ready for commit apart from some trivia: > +if (ioret.result.status == PGAIO_RS_ERROR) > + pgaio_result_report(aio_ret.result, &aio_ret.target_data, ERROR); I think ioret and aio_ret are supposed to be the same object. If that's right, change one of the names. Likewise elsewhere in this file. > +The central API piece for postgres' AIO abstraction are AIO handles. To > +execute an IO one first has to acquire an IO handle (`pgaio_io_acquire()`) and > +then "defined", i.e. associate an IO operation with the handle. s/"defined"/"define" it/ or similar > +The "solution" to this the ability to associate multiple completion callbacks s/this the/this is the/ > Subject: [PATCH v2.11 08/27] localbuf: Track pincount in BufferDesc as well > @@ -5350,6 +5350,18 @@ ConditionalLockBufferForCleanup(Buffer buffer) > Assert(refcount > 0); > if (refcount != 1) > return false; > + > + /* > + * Check that the AIO subsystem doesn't have a pin. Likely not > + * possible today, but better safe than sorry. > + */ > + bufHdr = GetLocalBufferDescriptor(-buffer - 1); > + buf_state = pg_atomic_read_u32(&bufHdr->state); > + refcount = BUF_STATE_GET_REFCOUNT(buf_state); > + Assert(refcount > 0); > + if (refcount != 1) > + return false; > + LockBufferForCleanup() should get code like this ConditionalLockBufferForCleanup() code, either now or when "not possible today" ends. Currently, it just assumes all local buffers are cleanup-lockable: /* Nobody else to wait for */ if (BufferIsLocal(buffer)) return; > @@ -570,7 +577,13 @@ InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced) > > buf_state = pg_atomic_read_u32(&bufHdr->state); > > - if (check_unreferenced && LocalRefCount[bufid] != 0) > + /* > + * We need to test not just LocalRefCount[bufid] but also the BufferDesc > + * itself, as the latter is used to represent a pin by the AIO subsystem. > + * This can happen if AIO is initiated and then the query errors out. > + */ > + if (check_unreferenced && > + (LocalRefCount[bufid] != 0 || BUF_STATE_GET_REFCOUNT(buf_state) != 0)) > elog(ERROR, "block %u of %s is still referenced (local %u)", I didn't write a test to prove it, but I'm suspecting we'll reach the above ERROR with this sequence: CREATE TEMP TABLE foo ...; [some command that starts reading a block of foo into local buffers, then ERROR with IO ongoing] DROP TABLE foo; DropRelationAllLocalBuffers() calls InvalidateLocalBuffer(bufHdr, true). I think we'd need to do like pgaio_shutdown() and finish all IOs (or all IOs for the particular rel) before InvalidateLocalBuffer(). Or use something like the logic near elog(ERROR, "buffer is pinned in InvalidateBuffer") in corresponding bufmgr code. I think that bufmgr ERROR is unreachable, since only a private refcnt triggers that bufmgr ERROR. Is there something preventing the localbuf error from being a problem? (This wouldn't require changes to the current patch; responsibility would fall in a bufmgr AIO patch.) > Subject: [PATCH v2.11 09/27] bufmgr: Implement AIO read support (Still reviewing this and later patches, but incidental observations follow.) > +buffer_readv_complete_one(uint8 buf_off, Buffer buffer, uint8 flags, > + bool failed, bool is_temp) > +{ ... > + PgAioResult result; ... > + result.status = PGAIO_RS_OK; ... > + return result; gcc 14.2.0 -Werror gives me: bufmgr.c:7297:16: error: ‘result’ may be used uninitialized [-Werror=maybe-uninitialized] Zeroing the unset fields silenced it: --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -7221,3 +7221,3 @@ buffer_readv_complete_one(uint8 buf_off, Buffer buffer, uint8 flags, char *bufdata = BufferGetBlock(buffer); - PgAioResult result; + PgAioResult result = { .status = PGAIO_RS_OK }; uint32 set_flag_bits; @@ -7238,4 +7238,2 @@ buffer_readv_complete_one(uint8 buf_off, Buffer buffer, uint8 flags, - result.status = PGAIO_RS_OK; - /* check for garbage data */ > Subject: [PATCH v2.11 13/27] aio: Basic read_stream adjustments for real AIO > @@ -416,6 +418,13 @@ read_stream_start_pending_read(ReadStream *stream) > static void > read_stream_look_ahead(ReadStream *stream) > { > + /* > + * Allow amortizing the cost of submitting IO over multiple IOs. This > + * requires that we don't do any operations that could lead to a deadlock > + * with staged-but-unsubmitted IO. > + */ > + pgaio_enter_batchmode(); We call read_stream_get_block() while in batchmode, so the stream callback needs to be ready for that. A complicated case is collect_corrupt_items_read_stream_next_block(), which may do its own buffer I/O to read in a vmbuffer for VM_ALL_FROZEN(). That's feeling to me like a recipe for corner cases reaching ERROR "starting batch while batch already in progress". Are there mitigating factors? > Subject: [PATCH v2.11 17/27] aio: Add test_aio module > + # verify that page verification errors are detected even as part of a > + # shortened multi-block read (tbl_corr, block 1 is tbl_corred) Is "tbl_corred" a typo of something? > --- /dev/null > +++ b/src/test/modules/test_aio/test_aio.c > @@ -0,0 +1,657 @@ > +/*------------------------------------------------------------------------- > + * > + * delay_execution.c > + * Test module to allow delay between parsing and execution of a query. > + * > + * The delay is implemented by taking and immediately releasing a specified > + * advisory lock. If another process has previously taken that lock, the > + * current process will be blocked until the lock is released; otherwise, > + * there's no effect. This allows an isolationtester script to reliably > + * test behaviors where some specified action happens in another backend > + * between parsing and execution of any desired query. > + * > + * Copyright (c) 2020-2025, PostgreSQL Global Development Group > + * > + * IDENTIFICATION > + * src/test/modules/delay_execution/delay_execution.c Header comment is surviving from copy-paste of delay_execution.c. > + * Tor tests we don't want the resowner release preventing us from s/Tor/For/
On Mon, Mar 24, 2025 at 5:59 AM Andres Freund <andres@anarazel.de> wrote: > On 2025-03-23 08:55:29 -0700, Noah Misch wrote: > > An IO in PGAIO_HS_STAGED clearly blocks closing the IO's FD, and an IO in > > PGAIO_HS_COMPLETED_IO clearly doesn't block that close. For io_method=worker, > > closing in PGAIO_HS_SUBMITTED is okay. For io_method=io_uring, is there a > > reference about it being okay to close during PGAIO_HS_SUBMITTED? I looked > > awhile for an authoritative view on that, but I didn't find one. If we can > > rely on io_uring_submit() returning only after the kernel has given the > > io_uring its own reference to all applicable file descriptors, I expect it's > > okay to close the process's FD. If the io_uring acquires its reference later > > than that, I expect we shouldn't close before that later time. > > I'm fairly sure io_uring has its own reference for the file descriptor by the > time io_uring_enter() returns [1]. What io_uring does *not* reliably tolerate > is the issuing process *exiting* before the IO completes, even if there are > other processes attached to the same io_uring instance. It is a bit strange that the documentation doesn't say that explicitly. You can sorta-maybe-kinda infer it from the fact that io_uring didn't originally support cancelling requests at all, maybe a small clue that it also didn't cancel them when you closed the fd :-) The only sane alternative would seem to be that they keep running and have their own reference to the *file* (not the fd), which is the actual case, and might also be inferrable at a stretch from the io_uring_register() documentation that says it reduces overheads with a "long term reference" reducing "per-I/O overhead". (The distant third option/non-option is a sort of late/async binding fd as seen in the Glibc user space POSIX AIO implementation, but that sort of madness doesn't seem to be the sort of thing anyone working in the kernel would entertain for a nanosecond...) Anyway, there are also public discussions involving Mr Axboe that discuss the fact that async operations continue to run when the associated fd is closed, eg from people who were surprised by that when porting stuff from other systems, which might help fill in the documentation gap a teensy bit if people want to see something outside the source code: https://github.com/axboe/liburing/issues/568 > AIO v1 had a posix_aio backend, which, on several platforms, did *not* > tolerate the FD being closed before the IO completes. Because of that > IoMethodOps had a closing_fd callback, which posix_aio used to wait for the > IO's completion [2]. Just for the record while remembering this stuff: Windows is another system that took the cancel-on-close approach, so the Windows IOCP proof-of-concept patches also used that AIO v1 callback and we'll have to think about that again if/when we want to get that stuff going on AIO v2. I recall also speculating that it might be better to teach the vfd system to pick another victim to close instead if an fd was currently tied up with an asynchronous I/O for the benefit of those cancel-on-close systems, hopefully without any happy-path book-keeping. But just submitting staged I/O is a nice and cheap solution for now, without them in the picture.
Hi, On 2025-03-23 17:29:39 -0700, Noah Misch wrote: > commit 247ce06b wrote: > > + pgaio_io_reopen(ioh); > > + > > + /* > > + * To be able to exercise the reopen-fails path, allow injection > > + * points to trigger a failure at this point. > > + */ > > + pgaio_io_call_inj(ioh, "AIO_WORKER_AFTER_REOPEN"); > > + > > + error_errno = 0; > > + error_ioh = NULL; > > + > > + /* > > + * We don't expect this to ever fail with ERROR or FATAL, no need > > + * to keep error_ioh set to the IO. > > + * pgaio_io_perform_synchronously() contains a critical section to > > + * ensure we don't accidentally fail. > > + */ > > + pgaio_io_perform_synchronously(ioh); > > A CHECK_FOR_INTERRUPTS() could close() the FD that pgaio_io_reopen() callee > smgr_aio_reopen() stores. Hence, I think smgrfd() should assert that > interrupts are held instead of doing its own HOLD_INTERRUPTS(), and a > HOLD_INTERRUPTS() should surround the above region of code. It's likely hard > to reproduce a problem, because pgaio_io_call_inj() does nothing in many > builds, and pgaio_io_perform_synchronously() starts by entering a critical > section. Hm, I guess you're right - it would be pretty bonkers for the injection to process interrupts, but its much better to clarify the code to make that not an option. Once doing that it seemed we should also have a similar assertion in pgaio_io_before_prep() would be appropriate. > On Thu, Mar 20, 2025 at 09:58:37PM -0400, Andres Freund wrote: > > Attached v2.11 > > > Subject: [PATCH v2.11 06/27] aio: Implement support for reads in smgr/md/fd > > > +int > > +FileStartReadV(PgAioHandle *ioh, File file, > > + int iovcnt, off_t offset, > > + uint32 wait_event_info) > > +{ > > + int returnCode; > > + Vfd *vfdP; > > + > > + Assert(FileIsValid(file)); > > + > > + DO_DB(elog(LOG, "FileStartReadV: %d (%s) " INT64_FORMAT " %d", > > + file, VfdCache[file].fileName, > > + (int64) offset, > > + iovcnt)); > > + > > + returnCode = FileAccess(file); > > + if (returnCode < 0) > > + return returnCode; > > + > > + vfdP = &VfdCache[file]; > > + > > + pgaio_io_prep_readv(ioh, vfdP->fd, iovcnt, offset); > > FileStartReadV() and pgaio_io_prep_readv() advance the IO to PGAIO_HS_STAGED > w/ batch mode, PGAIO_HS_SUBMITTED w/o batch mode. I didn't expect that from > functions so named. The "start" verb sounds to me like unconditional > PGAIO_HS_SUBMITTED, and the "prep" verb sounds like PGAIO_HS_DEFINED. I like > the "stage" verb, because it matches PGAIO_HS_STAGED, and the comment at > PGAIO_HS_STAGED succinctly covers what to expect. Hence, I recommend names > FileStageReadV, pgaio_io_stage_readv, mdstagereadv, and smgrstageread. How do > you see it? I have a surprisingly strong negative reaction to that proposed naming. To me the staging is a distinct step that happens *after* the IO is fully defined. Making all the layered calls that lead up to that named that way would IMO be a bad idea. I however don't particularly like the *start* or *prep* names, I've gone back and forth on those a couple times. I could see "begin" work uniformly across those. > > +/* > > + * AIO error reporting callback for mdstartreadv(). > > + * > > + * Errors are encoded as follows: > > + * - PgAioResult.error_data != 0 encodes IO that failed with errno != 0 > > I recommend replacing "errno != 0" with either "that errno" or "errno == > error_data". Done. > > Subject: [PATCH v2.11 07/27] aio: Add README.md explaining higher level design > > Ready for commit apart from some trivia: Great. > > +if (ioret.result.status == PGAIO_RS_ERROR) > > + pgaio_result_report(aio_ret.result, &aio_ret.target_data, ERROR); > > I think ioret and aio_ret are supposed to be the same object. If that's > right, change one of the names. Likewise elsewhere in this file. You're right. > > +The central API piece for postgres' AIO abstraction are AIO handles. To > > +execute an IO one first has to acquire an IO handle (`pgaio_io_acquire()`) and > > +then "defined", i.e. associate an IO operation with the handle. > > s/"defined"/"define" it/ or similar > > > +The "solution" to this the ability to associate multiple completion callbacks > > s/this the/this is the/ Applied. > > Subject: [PATCH v2.11 08/27] localbuf: Track pincount in BufferDesc as well > > > @@ -5350,6 +5350,18 @@ ConditionalLockBufferForCleanup(Buffer buffer) > > Assert(refcount > 0); > > if (refcount != 1) > > return false; > > + > > + /* > > + * Check that the AIO subsystem doesn't have a pin. Likely not > > + * possible today, but better safe than sorry. > > + */ > > + bufHdr = GetLocalBufferDescriptor(-buffer - 1); > > + buf_state = pg_atomic_read_u32(&bufHdr->state); > > + refcount = BUF_STATE_GET_REFCOUNT(buf_state); > > + Assert(refcount > 0); > > + if (refcount != 1) > > + return false; > > + > > LockBufferForCleanup() should get code like this > ConditionalLockBufferForCleanup() code, either now or when "not possible > today" ends. Currently, it just assumes all local buffers are > cleanup-lockable: > > /* Nobody else to wait for */ > if (BufferIsLocal(buffer)) > return; Kinda, yes, kinda no? LockBufferForCleanup() assumes, even for shared buffers, that the current backend can't be doing anything that conflicts with acquiring a buffer pin - note that it doesn't check the backend local pincount for shared buffers either. LockBufferForCleanup() kind of has to make that assumption, because there's no way to wait for yourself to release another pin, because obviously waiting in LockBufferForCleanup() would prevent that from ever happening. It's somewhat disheartening that the comments for LockBufferForCleanup() don't mention that somehow the caller needs to be sure not to be called with other pins on a relation. Nor does LockBufferForCleanup() have any asserts checking how many backend-local pins exist. Leaving documentation / asserts aside, I think this is largely a safe assumption given current callers. With one exception, it's all vacuum or recovery related code - as vacuum can't run in a transaction, we can't conflict with another pin by the same backend. The one exception is heap_surgery.c - it doesn't quite seem safe, the surrounding (or another query with a cursor) could have a pin on the target block. The most obvious fix would be to use CheckTableNotInUse(), but that might break some reasonable uses. Or maybe it should just not use a cleanup lock, it's not obvious to me why it uses one. But tbh, I don't care too much, given what heap_surgery is. > > @@ -570,7 +577,13 @@ InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced) > > > > buf_state = pg_atomic_read_u32(&bufHdr->state); > > > > - if (check_unreferenced && LocalRefCount[bufid] != 0) > > + /* > > + * We need to test not just LocalRefCount[bufid] but also the BufferDesc > > + * itself, as the latter is used to represent a pin by the AIO subsystem. > > + * This can happen if AIO is initiated and then the query errors out. > > + */ > > + if (check_unreferenced && > > + (LocalRefCount[bufid] != 0 || BUF_STATE_GET_REFCOUNT(buf_state) != 0)) > > elog(ERROR, "block %u of %s is still referenced (local %u)", > > I didn't write a test to prove it, but I'm suspecting we'll reach the above > ERROR with this sequence: > > CREATE TEMP TABLE foo ...; > [some command that starts reading a block of foo into local buffers, then ERROR with IO ongoing] > DROP TABLE foo; That seems plausible. I'll try to write a test after this email. > DropRelationAllLocalBuffers() calls InvalidateLocalBuffer(bufHdr, true). I > think we'd need to do like pgaio_shutdown() and finish all IOs (or all IOs for > the particular rel) before InvalidateLocalBuffer(). Or use something like the > logic near elog(ERROR, "buffer is pinned in InvalidateBuffer") in > corresponding bufmgr code. Just waiting for the IO in InvalidateBuffer() does seem like the best bet to me. It's going to be pretty rarely reached, waiting for all concurrent IO seems unnecessarily heavyweight. I don't think it matters much today, but once we do things like asynchronously writing back buffers or WAL, the situation will be different. I think this points to the comment above the WaitIO() in InvalidateBuffer() needing a bit of adapting - an in-progress read can trigger the WaitIO as well. Something like: /* * We assume the reason for it to be pinned is that either we were * asynchronously reading the page in before erroring out or someone else * is flushing the page out. Wait for the IO to finish. (This could be * an infinite loop if the refcount is messed up... it would be nice to * time out after awhile, but there seems no way to be sure how many loops * may be needed. Note that if the other guy has pinned the buffer but * not yet done StartBufferIO, WaitIO will fall through and we'll * effectively be busy-looping here.) */ > > +buffer_readv_complete_one(uint8 buf_off, Buffer buffer, uint8 flags, > > + bool failed, bool is_temp) > > +{ > ... > > + PgAioResult result; > ... > > + result.status = PGAIO_RS_OK; > ... > > + return result; > > gcc 14.2.0 -Werror gives me: > > bufmgr.c:7297:16: error: ‘result’ may be used uninitialized [-Werror=maybe-uninitialized] Gngngng. Since when is it a bug for some fields of a struct to be uninitialized, as long they're not used? Interestingly I don't see that warning, despite also using gcc 14.2.0. I'll just move to your solution, but it seems odd. > > Subject: [PATCH v2.11 13/27] aio: Basic read_stream adjustments for real AIO > > > @@ -416,6 +418,13 @@ read_stream_start_pending_read(ReadStream *stream) > > static void > > read_stream_look_ahead(ReadStream *stream) > > { > > + /* > > + * Allow amortizing the cost of submitting IO over multiple IOs. This > > + * requires that we don't do any operations that could lead to a deadlock > > + * with staged-but-unsubmitted IO. > > + */ > > + pgaio_enter_batchmode(); > > We call read_stream_get_block() while in batchmode, so the stream callback > needs to be ready for that. A complicated case is > collect_corrupt_items_read_stream_next_block(), which may do its own buffer > I/O to read in a vmbuffer for VM_ALL_FROZEN(). That's feeling to me like a > recipe for corner cases reaching ERROR "starting batch while batch already in > progress". Are there mitigating factors? Ugh, yes, you're right. heap_vac_scan_next_block() is also affected. I don't think "starting batch while batch already in progress" is the real issue though - it seems easy enough to avoid starting another batch inside, partially because current cases seem unlikely to need to do batchable IO inside. What worries me more is that code might block while there's unsubmitted IO - which seems entirely plausible. I can see a few approaches: 1) Declare that all read stream callbacks have to be careful and cope with batch mode I'm not sure how viable that is, not starting batches seems ok, but ensuring that the code doesn't block is a different story. 2) Have read stream users opt-in to batching Presumably via a flag like READ_STREAM_USE_BATCHING. That'd be easy enough to implement and to add to the callsites where that's fine. 3) Teach read stream to "look ahead" far enough to determine all the blocks that could be issued in a batch outside of batchmode I think that's probably not a great idea, it'd lead us to looking further ahead than we really need to, which could increase "unfairness" in e.g. parallel sequential scan. 4) Just defer using batch mode for now It's a nice win with io_uring for random IO, e.g. from bitmap heap scans , but there's no need to immediately solve this. I think regardless of what we go for, it's worth splitting "aio: Basic read_stream adjustments for real AIO" into the actually basic parts (i.e. introducing sync_mode) from the not actually so basic parts (i.e. batching). I suspect that 2) would be the best approach. Only the read stream user knows what it needs to do in the callback. > > > Subject: [PATCH v2.11 17/27] aio: Add test_aio module > > > + # verify that page verification errors are detected even as part of a > > + # shortened multi-block read (tbl_corr, block 1 is tbl_corred) > > Is "tbl_corred" a typo of something? I think that was a search&replace of the table name gone wrong. It was just supposed to be "corrupted". > > + * > > + * IDENTIFICATION > > + * src/test/modules/delay_execution/delay_execution.c > > Header comment is surviving from copy-paste of delay_execution.c. Oh, how I hate these pointless comments. Fixed. Greetings, Andres Freund
Hi, On 2025-03-24 11:43:47 -0400, Andres Freund wrote: > > I didn't write a test to prove it, but I'm suspecting we'll reach the above > > ERROR with this sequence: > > > > CREATE TEMP TABLE foo ...; > > [some command that starts reading a block of foo into local buffers, then ERROR with IO ongoing] > > DROP TABLE foo; > > That seems plausible. I'll try to write a test after this email. FWIW, a test did indeed confirm that. Luckily: > > DropRelationAllLocalBuffers() calls InvalidateLocalBuffer(bufHdr, true). I > > think we'd need to do like pgaio_shutdown() and finish all IOs (or all IOs for > > the particular rel) before InvalidateLocalBuffer(). Or use something like the > > logic near elog(ERROR, "buffer is pinned in InvalidateBuffer") in > > corresponding bufmgr code. > > Just waiting for the IO in InvalidateBuffer() does seem like the best bet to > me. This did indeed resolve the issue. I've extended the testsuite to test for that and a bunch more things. Working on sending out a new version... > > We call read_stream_get_block() while in batchmode, so the stream callback > > needs to be ready for that. A complicated case is > > collect_corrupt_items_read_stream_next_block(), which may do its own buffer > > I/O to read in a vmbuffer for VM_ALL_FROZEN(). That's feeling to me like a > > recipe for corner cases reaching ERROR "starting batch while batch already in > > progress". Are there mitigating factors? > > Ugh, yes, you're right. heap_vac_scan_next_block() is also affected. > > I don't think "starting batch while batch already in progress" is the real > issue though - it seems easy enough to avoid starting another batch inside, > partially because current cases seem unlikely to need to do batchable IO > inside. What worries me more is that code might block while there's > unsubmitted IO - which seems entirely plausible. > > > I can see a few approaches: > > 1) Declare that all read stream callbacks have to be careful and cope with > batch mode > > I'm not sure how viable that is, not starting batches seems ok, but > ensuring that the code doesn't block is a different story. > > > 2) Have read stream users opt-in to batching > > Presumably via a flag like READ_STREAM_USE_BATCHING. That'd be easy enough > to implement and to add to the callsites where that's fine. > > > 3) Teach read stream to "look ahead" far enough to determine all the blocks > that could be issued in a batch outside of batchmode > > I think that's probably not a great idea, it'd lead us to looking further > ahead than we really need to, which could increase "unfairness" in > e.g. parallel sequential scan. > > > 4) Just defer using batch mode for now > > It's a nice win with io_uring for random IO, e.g. from bitmap heap scans , > but there's no need to immediately solve this. > > > I think regardless of what we go for, it's worth splitting > "aio: Basic read_stream adjustments for real AIO" > into the actually basic parts (i.e. introducing sync_mode) from the not > actually so basic parts (i.e. batching). > > > I suspect that 2) would be the best approach. Only the read stream user knows > what it needs to do in the callback. I still think 2) would be the best option. Writing a patch for that. If a callback may sometimes need to block, it can still opt into READ_STREAM_USE_BATCHING, by submitting all staged IO before blocking. The hardest part is to explain the flag. Here's my current attempt: /* --- * Opt-in to using AIO batchmode. * * Submitting IO in larger batches can be more efficient than doing so * one-by-one, particularly for many small reads. It does, however, require * the ReadStreamBlockNumberCB callback to abide by the restrictions of AIO * batching (c.f. pgaio_enter_batchmode()). Basically, the callback may not: * a) block without first calling pgaio_submit_staged(), unless a * to-be-waited-on lock cannot be part of a deadlock, e.g. because it is * never acquired in a nested fashion * b) directly or indirectly start another batch pgaio_enter_batchmode() * * As this requires care and is nontrivial in some cases, batching is only * used with explicit opt-in. * --- */ #define READ_STREAM_USE_BATCHING 0x08 Greetings, Andres Freund
On Tue, Mar 25, 2025 at 11:55 AM Andres Freund <andres@anarazel.de> wrote: > If a callback may sometimes need to block, it can still opt into > READ_STREAM_USE_BATCHING, by submitting all staged IO before blocking. > > The hardest part is to explain the flag. Here's my current attempt: > > /* --- > * Opt-in to using AIO batchmode. > * > * Submitting IO in larger batches can be more efficient than doing so > * one-by-one, particularly for many small reads. It does, however, require > * the ReadStreamBlockNumberCB callback to abide by the restrictions of AIO > * batching (c.f. pgaio_enter_batchmode()). Basically, the callback may not: > * a) block without first calling pgaio_submit_staged(), unless a > * to-be-waited-on lock cannot be part of a deadlock, e.g. because it is > * never acquired in a nested fashion > * b) directly or indirectly start another batch pgaio_enter_batchmode() > * > * As this requires care and is nontrivial in some cases, batching is only > * used with explicit opt-in. > * --- > */ > #define READ_STREAM_USE_BATCHING 0x08 +1 I wonder if something more like READ_STREAM_CALLBACK_BATCHMODE_AWARE would be better, to highlight that you are making a declaration about a property of your callback, not just turning on an independent go-fast feature... I fished those words out of the main (?) description of this topic atop pgaio_enter_batchmode(). Just a thought, IDK.
Hi, On 2025-03-25 13:07:49 +1300, Thomas Munro wrote: > On Tue, Mar 25, 2025 at 11:55 AM Andres Freund <andres@anarazel.de> wrote: > > #define READ_STREAM_USE_BATCHING 0x08 > > +1 > > I wonder if something more like READ_STREAM_CALLBACK_BATCHMODE_AWARE > would be better, to highlight that you are making a declaration about > a property of your callback, not just turning on an independent > go-fast feature... I fished those words out of the main (?) > description of this topic atop pgaio_enter_batchmode(). Just a > thought, IDK. The relevant lines are already very deeply indented, so I'm a bit wary of such a long name. I think we'd basically have to use a separate flags variable everywhere and that is annoying due to us following C89 variable declaration positions... Greetings, Andres Freund
On Mon, Mar 24, 2025 at 11:43:47AM -0400, Andres Freund wrote: > On 2025-03-23 17:29:39 -0700, Noah Misch wrote: > > commit 247ce06b wrote: > > > + pgaio_io_reopen(ioh); > > > + > > > + /* > > > + * To be able to exercise the reopen-fails path, allow injection > > > + * points to trigger a failure at this point. > > > + */ > > > + pgaio_io_call_inj(ioh, "AIO_WORKER_AFTER_REOPEN"); > > > + > > > + error_errno = 0; > > > + error_ioh = NULL; > > > + > > > + /* > > > + * We don't expect this to ever fail with ERROR or FATAL, no need > > > + * to keep error_ioh set to the IO. > > > + * pgaio_io_perform_synchronously() contains a critical section to > > > + * ensure we don't accidentally fail. > > > + */ > > > + pgaio_io_perform_synchronously(ioh); > > > > A CHECK_FOR_INTERRUPTS() could close() the FD that pgaio_io_reopen() callee > > smgr_aio_reopen() stores. Hence, I think smgrfd() should assert that > > interrupts are held instead of doing its own HOLD_INTERRUPTS(), and a > > HOLD_INTERRUPTS() should surround the above region of code. It's likely hard > > to reproduce a problem, because pgaio_io_call_inj() does nothing in many > > builds, and pgaio_io_perform_synchronously() starts by entering a critical > > section. > > Hm, I guess you're right - it would be pretty bonkers for the injection to > process interrupts, but its much better to clarify the code to make that not > an option. Once doing that it seemed we should also have a similar assertion > in pgaio_io_before_prep() would be appropriate. Agreed. Following that line of thinking, the io_uring case needs to HOLD_INTERRUPTS() (or hold smgrrelease() specifically) all the way from pgaio_io_before_prep() to PGAIO_HS_SUBMITTED. The fd has to stay valid until io_uring_submit(). (We may be due for a test mode that does smgrreleaseall() at every CHECK_FOR_INTERRUPTS()?) > > On Thu, Mar 20, 2025 at 09:58:37PM -0400, Andres Freund wrote: > > > Subject: [PATCH v2.11 06/27] aio: Implement support for reads in smgr/md/fd > > > > > +int > > > +FileStartReadV(PgAioHandle *ioh, File file, > > > + int iovcnt, off_t offset, > > > + uint32 wait_event_info) > > > +{ > > > + int returnCode; > > > + Vfd *vfdP; > > > + > > > + Assert(FileIsValid(file)); > > > + > > > + DO_DB(elog(LOG, "FileStartReadV: %d (%s) " INT64_FORMAT " %d", > > > + file, VfdCache[file].fileName, > > > + (int64) offset, > > > + iovcnt)); > > > + > > > + returnCode = FileAccess(file); > > > + if (returnCode < 0) > > > + return returnCode; > > > + > > > + vfdP = &VfdCache[file]; > > > + > > > + pgaio_io_prep_readv(ioh, vfdP->fd, iovcnt, offset); > > > > FileStartReadV() and pgaio_io_prep_readv() advance the IO to PGAIO_HS_STAGED > > w/ batch mode, PGAIO_HS_SUBMITTED w/o batch mode. I didn't expect that from > > functions so named. The "start" verb sounds to me like unconditional > > PGAIO_HS_SUBMITTED, and the "prep" verb sounds like PGAIO_HS_DEFINED. I like > > the "stage" verb, because it matches PGAIO_HS_STAGED, and the comment at > > PGAIO_HS_STAGED succinctly covers what to expect. Hence, I recommend names > > FileStageReadV, pgaio_io_stage_readv, mdstagereadv, and smgrstageread. How do > > you see it? > > I have a surprisingly strong negative reaction to that proposed naming. To me > the staging is a distinct step that happens *after* the IO is fully > defined. Making all the layered calls that lead up to that named that way > would IMO be a bad idea. As a general naming principle, I think the name of a function that advances through multiple named steps should mention the last step. Naming the function after just a non-last step feels weird to me. For example, serving a meal consists of steps menu_define, mix_ingredients, and plate_food. It would be weird to me if a function called meal_menu_define() mixed ingredients or plated food, but it's fine if meal_plate_food() does all three steps. A second strategy is to name both the first and last steps: meal_define_menu_thru_plate_food() is fine apart from being long. A third strategy is to have meal_plate_food() assert than meal_mix_ingredients() has been called. I wouldn't mind "staging" as a distinct step, but I think today's API boundaries hide the distinction. PGAIO_HS_DEFINED is a temporary state during a pgaio_io_stage() call, so the process that defines and stages the IO can observe PGAIO_HS_DEFINED only while pgaio_io_stage() is on the stack. The aforementioned "third strategy" could map to having distinct smgrdefinereadv() and smgrstagereadv(). I don't know how well that would work out overall. I wouldn't be optimistic about that winning, but I mention it for completeness. > I however don't particularly like the *start* or *prep* names, I've gone back > and forth on those a couple times. I could see "begin" work uniformly across > those. For ease of new readers understanding things, I think it helps for the functions that advance PgAioHandleState to have names that use words from PgAioHandleState. It's one less mapping to get into the reader's head. "Begin", "Start" and "prep" are all outside that taxonomy, making the reader learn how to map them to the taxonomy. What reward does the reader get at the end of that exercise? I'm not seeing one, but please do tell me what I'm missing here. > > > Subject: [PATCH v2.11 08/27] localbuf: Track pincount in BufferDesc as well > > > > > @@ -5350,6 +5350,18 @@ ConditionalLockBufferForCleanup(Buffer buffer) > > > Assert(refcount > 0); > > > if (refcount != 1) > > > return false; > > > + > > > + /* > > > + * Check that the AIO subsystem doesn't have a pin. Likely not > > > + * possible today, but better safe than sorry. > > > + */ > > > + bufHdr = GetLocalBufferDescriptor(-buffer - 1); > > > + buf_state = pg_atomic_read_u32(&bufHdr->state); > > > + refcount = BUF_STATE_GET_REFCOUNT(buf_state); > > > + Assert(refcount > 0); > > > + if (refcount != 1) > > > + return false; > > > + > > > > LockBufferForCleanup() should get code like this > > ConditionalLockBufferForCleanup() code, either now or when "not possible > > today" ends. Currently, it just assumes all local buffers are > > cleanup-lockable: > > > > /* Nobody else to wait for */ > > if (BufferIsLocal(buffer)) > > return; > > Kinda, yes, kinda no? LockBufferForCleanup() assumes, even for shared > buffers, that the current backend can't be doing anything that conflicts with > acquiring a buffer pin - note that it doesn't check the backend local pincount > for shared buffers either. It checks the local pincount via callee CheckBufferIsPinnedOnce(). As the patch stands, LockBufferForCleanup() can succeed when ConditionalLockBufferForCleanup() would have returned false. I'm not seeking to raise the overall standard of *Cleanup() family of functions, but I am trying to keep members of that family agreeing on the standard. Like the comment, I expect it's academic today. I expect it will stay academic. Anything that does a cleanup will start by reading the buffer, which will resolve any refcnt the AIO subsystems holds for a read. If there's an AIO write, the LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE) will block on that. How about just removing the ConditionalLockBufferForCleanup() changes or replacing them with a comment (like the present paragraph)? > I think this points to the comment above the WaitIO() in InvalidateBuffer() > needing a bit of adapting - an in-progress read can trigger the WaitIO as > well. Something like: > > /* > * We assume the reason for it to be pinned is that either we were > * asynchronously reading the page in before erroring out or someone else > * is flushing the page out. Wait for the IO to finish. (This could be > * an infinite loop if the refcount is messed up... it would be nice to > * time out after awhile, but there seems no way to be sure how many loops > * may be needed. Note that if the other guy has pinned the buffer but > * not yet done StartBufferIO, WaitIO will fall through and we'll > * effectively be busy-looping here.) > */ Agreed. > > > +buffer_readv_complete_one(uint8 buf_off, Buffer buffer, uint8 flags, > > > + bool failed, bool is_temp) > > > +{ > > ... > > > + PgAioResult result; > > ... > > > + result.status = PGAIO_RS_OK; > > ... > > > + return result; > > > > gcc 14.2.0 -Werror gives me: > > > > bufmgr.c:7297:16: error: ‘result’ may be used uninitialized [-Werror=maybe-uninitialized] > > Gngngng. Since when is it a bug for some fields of a struct to be > uninitialized, as long they're not used? > > Interestingly I don't see that warning, despite also using gcc 14.2.0. I badly neglected to mention my non-default flags: CFLAGS='-O2 -fno-sanitize-recover=all -fsanitize=address,alignment,undefined --param=max-vartrack-size=150000000 -ftrivial-auto-var-init=pattern' COPT=-Werror -Wno-error=array-bounds Final CFLAGS, including the ones "configure" elects on its own: configure: using CFLAGS=-Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Werror=unguarded-availability-new-Wendif-labels -Wmissing-format-attribute -Wcast-function-type -Wformat-security -Wmissing-variable-declarations-fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-unused-command-line-argument-Wno-compound-token-split-by-macro -Wno-format-truncation -Wno-cast-function-type-strict-g -O2 -fno-sanitize-recover=all -fsanitize=address,alignment,undefined --param=max-vartrack-size=150000000-ftrivial-auto-var-init=pattern (I use -Wno-error=array-bounds because the sanitizer options elicit a lot of those warnings. Today's master is free from maybe-uninitialized warnings in this configuration, though.) > I'll just move to your solution, but it seems odd. Got it. > I think regardless of what we go for, it's worth splitting > "aio: Basic read_stream adjustments for real AIO" > into the actually basic parts (i.e. introducing sync_mode) from the not > actually so basic parts (i.e. batching). Fair. On Mon, Mar 24, 2025 at 06:55:22PM -0400, Andres Freund wrote: > Hi, > > On 2025-03-24 11:43:47 -0400, Andres Freund wrote: > > > I didn't write a test to prove it, but I'm suspecting we'll reach the above > > > ERROR with this sequence: > > > > > > CREATE TEMP TABLE foo ...; > > > [some command that starts reading a block of foo into local buffers, then ERROR with IO ongoing] > > > DROP TABLE foo; > > > > That seems plausible. I'll try to write a test after this email. > > FWIW, a test did indeed confirm that. Luckily: > > > > DropRelationAllLocalBuffers() calls InvalidateLocalBuffer(bufHdr, true). I > > > think we'd need to do like pgaio_shutdown() and finish all IOs (or all IOs for > > > the particular rel) before InvalidateLocalBuffer(). Or use something like the > > > logic near elog(ERROR, "buffer is pinned in InvalidateBuffer") in > > > corresponding bufmgr code. > > > > Just waiting for the IO in InvalidateBuffer() does seem like the best bet to > > me. > > This did indeed resolve the issue. I'm happy with that approach. On Tue, Mar 25, 2025 at 01:07:49PM +1300, Thomas Munro wrote: > On Tue, Mar 25, 2025 at 11:55 AM Andres Freund <andres@anarazel.de> wrote: > > If a callback may sometimes need to block, it can still opt into > > READ_STREAM_USE_BATCHING, by submitting all staged IO before blocking. > > > > The hardest part is to explain the flag. Here's my current attempt: > > > > /* --- > > * Opt-in to using AIO batchmode. > > * > > * Submitting IO in larger batches can be more efficient than doing so > > * one-by-one, particularly for many small reads. It does, however, require > > * the ReadStreamBlockNumberCB callback to abide by the restrictions of AIO > > * batching (c.f. pgaio_enter_batchmode()). Basically, the callback may not: > > * a) block without first calling pgaio_submit_staged(), unless a > > * to-be-waited-on lock cannot be part of a deadlock, e.g. because it is > > * never acquired in a nested fashion > > * b) directly or indirectly start another batch pgaio_enter_batchmode() I think a callback could still do: pgaio_exit_batchmode() ... arbitrary code that might reach pgaio_enter_batchmode() ... pgaio_enter_batchmode() > > * > > * As this requires care and is nontrivial in some cases, batching is only > > * used with explicit opt-in. > > * --- > > */ > > #define READ_STREAM_USE_BATCHING 0x08 > > +1 Agreed. It's simple, and there's no loss of generality. > I wonder if something more like READ_STREAM_CALLBACK_BATCHMODE_AWARE > would be better, to highlight that you are making a declaration about > a property of your callback, not just turning on an independent > go-fast feature... I fished those words out of the main (?) > description of this topic atop pgaio_enter_batchmode(). Just a > thought, IDK. Good points. I lean toward your renaming suggestion, or shortening to READ_STREAM_BATCHMODE_AWARE or READ_STREAM_BATCH_OK. I'm also fine with the original name, though. Thanks, nm
Hi, Attached v2.12, with the following changes: - Pushed the max_files_per_process change I plan to look at what parts of Jelte's change is worth doing ontop. Thanks for the review Noah. - Rebased over Thomas' commit of the remaining read stream changes Yay! - Addressed Noah's review comments - Added another test to test_aio/, to test that changing io_workers while running works, and that workers are restarted if terminated Written by Bilal - Made InvalidateLocalBuffer wait for IO if necessary As reported / suggested by Noah - Added tests for dropping tables with ongoing IO This failed, as Noah predicted, without the InvalidateLocalBuffer() change. - Added a commit to explicitly hold interrupts in workers after pgaio_io_reopen() As suggested by Noah. - Added a commit to fix a logic error around what gets passed to ioh->report_return - this lead to temporary buffer validation errors not being reported Discovered while extending the tests, as noted in the next point. I could see a few different "formulations" of this change (e.g. the report_return stuff could be populated by pgaio_io_call_complete_local() instead), but I don't think it matters much. - Add temporary table coverage to test_aio This required hanged test_aio.c to cope with temporary tables as well. - io_uring tests don't run anymore when built with EXEC_BACKEND and liburing enabled - Split the read stream patch into two Noah, quite rightly, pointed out that it's not safe to use batching if the next-block callback may block (or start its own batch). The best idea seems to be to make users of read stream opt-in to batching. I've done that in a patch that uses where it seems safe without doing extra work. See also the commit message. - Added a commit to add I/O, Asynchronous I/O glossary and acronym entries - Docs for pg_aios - Renamed pg_aios.offset to off, to avoid use of a keyword - Updated the io_uring wait event name while waiting for IOs to complete to AIO_IO_URING_COMPLETION and updated the description of AIO_IO_COMPLETION to "Waiting for another process to complete IO." I think this is a mix of different suggestions by Noah. TODO: - There are more tests in test_aio that should be expanded to run for temp tables as well, not just normal tables - Add an explicit test for the checksum verification in the completion callback There is an existing test for testing an invalid page due to page header verification in test_aio, but not for checksum failures. I think it's indirectly covered (e.g. in amcheck), but seems better to test it explicitly. Wonder if it's worth adding some coverage for when checksums are disabled? Probably not necessary? Greetings, Andres Freund
Attachment
- v2.12-0001-aio-Be-more-paranoid-about-interrupts.patch
- v2.12-0002-aio-Pass-result-of-local-callbacks-to-report_r.patch
- v2.12-0003-aio-Add-liburing-dependency.patch
- v2.12-0004-aio-Add-io_method-io_uring.patch
- v2.12-0005-aio-Implement-support-for-reads-in-smgr-md-fd.patch
- v2.12-0006-aio-Add-README.md-explaining-higher-level-desi.patch
- v2.12-0007-localbuf-Track-pincount-in-BufferDesc-as-well.patch
- v2.12-0008-bufmgr-Implement-AIO-read-support.patch
- v2.12-0009-bufmgr-Use-AIO-in-StartReadBuffers.patch
- v2.12-0010-aio-Basic-read_stream-adjustments-for-real-AIO.patch
- v2.12-0011-read_stream-Introduce-and-use-optional-batchmo.patch
- v2.12-0012-docs-Reframe-track_io_timing-related-docs-as-w.patch
- v2.12-0013-Enable-IO-concurrency-on-all-systems.patch
- v2.12-0014-docs-Add-acronym-and-glossary-entries-for-I-O-.patch
- v2.12-0015-aio-Add-pg_aios-view.patch
- v2.12-0016-aio-Add-test_aio-module.patch
- v2.12-0017-aio-bufmgr-Comment-fixes.patch
- v2.12-0018-aio-Experimental-heuristics-to-increase-batchi.patch
- v2.12-0019-aio-Implement-smgr-md-fd-write-support.patch
- v2.12-0020-aio-Add-bounce-buffers.patch
- v2.12-0021-bufmgr-Implement-AIO-write-support.patch
- v2.12-0022-aio-Add-IO-queue-helper.patch
- v2.12-0023-bufmgr-use-AIO-in-checkpointer-bgwriter.patch
- v2.12-0024-Ensure-a-resowner-exists-for-all-paths-that-ma.patch
- v2.12-0025-Temporary-Increase-BAS_BULKREAD-size.patch
- v2.12-0026-WIP-Use-MAP_POPULATE.patch
Hi, On 2025-03-23 09:32:48 -0700, Noah Misch wrote: > Another candidate description string: > > AIO_COMPLETED_SHARED "Waiting for another process to complete IO." I liked that one and adopted it. > > A more minimal change would be to narrow AIO_IO_URING_COMPLETION to > > "execution" or something like that, to hint at a separation between the raw IO > > being completed and the IO, including the callbacks completing. > > Yes, that would work for me. I updated both the name and the description of this one to EXECUTION, but I'm not sure I like it for the name... > > > > --- a/doc/src/sgml/config.sgml > > > > +++ b/doc/src/sgml/config.sgml > > > > @@ -2710,6 +2710,12 @@ include_dir 'conf.d' > > > > <literal>worker</literal> (execute asynchronous I/O using worker processes) > > > > </para> > > > > </listitem> > > > > + <listitem> > > > > + <para> > > > > + <literal>io_uring</literal> (execute asynchronous I/O using > > > > + io_uring, if available) > > > > > > I feel the "if available" doesn't quite fit, since we'll fail if unavailable. > > > Maybe just "(execute asynchronous I/O using Linux io_uring)" with "Linux" > > > there to reduce surprise on other platforms. > > > > You're right, the if available can be misunderstood. But not mentioning that > > it's an optional dependency seems odd too. What about something like > > > > <para> > > <literal>io_uring</literal> (execute asynchronous I/O using > > io_uring, requires postgres to have been built with > > <link linkend="configure-option-with-liburing"><option>--with-liburing</option></link> / > > <link linkend="configure-with-liburing-meson"><option>-Dliburing</option></link>) > > </para> > > I'd change s/postgres to have been built/a build with/ since the SGML docs > don't use the term "postgres" that way. Otherwise, that works for me. Went with that. Greetings, Andres Freund
On Thu, Mar 20, 2025 at 09:58:37PM -0400, Andres Freund wrote: > Subject: [PATCH v2.11 09/27] bufmgr: Implement AIO read support [I checked that v2.12 doesn't invalidate these review comments, but I didn't technically rebase the review onto v2.12's line numbers.] > static void > TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits, > - bool forget_owner) > + bool forget_owner, bool syncio) > { > uint32 buf_state; > > @@ -5586,6 +5636,14 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits, > if (clear_dirty && !(buf_state & BM_JUST_DIRTIED)) > buf_state &= ~(BM_DIRTY | BM_CHECKPOINT_NEEDED); > > + if (!syncio) > + { > + /* release ownership by the AIO subsystem */ > + Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0); > + buf_state -= BUF_REFCOUNT_ONE; > + pgaio_wref_clear(&buf->io_wref); > + } Looking at the callers: ZeroAndLockBuffer[1083] TerminateBufferIO(bufHdr, false, BM_VALID, true, true); ExtendBufferedRelShared[2869] TerminateBufferIO(buf_hdr, false, BM_VALID, true, true); FlushBuffer[4827] TerminateBufferIO(buf, true, 0, true, true); AbortBufferIO[6637] TerminateBufferIO(buf_hdr, false, BM_IO_ERROR, false, true); buffer_readv_complete_one[7279] TerminateBufferIO(buf_hdr, false, set_flag_bits, false, false); buffer_writev_complete_one[7427] TerminateBufferIO(buf_hdr, clear_dirty, set_flag_bits, false, false); I think we can improve on the "syncio" arg name. The first two aren't doing IO, and AbortBufferIO() may be cleaning up what would have been an AIO if it hadn't failed early. Perhaps name the arg "release_aio" and pass release_aio=true instead of syncio=false (release_aio = !syncio). > + * about which buffers are target by IO can be hard to debug, making s/target/targeted/ > +static pg_attribute_always_inline PgAioResult > +buffer_readv_complete_one(uint8 buf_off, Buffer buffer, uint8 flags, > + bool failed, bool is_temp) > +{ ... > + if ((flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages) > + { > + ereport(WARNING, > + (errcode(ERRCODE_DATA_CORRUPTED), > + errmsg("invalid page in block %u of relation %s; zeroing out page", My earlier review requested s/LOG/WARNING/, but I wasn't thinking about this in full depth. In the !is_temp case, this runs in a complete_shared callback. A process unrelated to the original IO may run this callback. That's unfortunate in two ways. First, that other process's client gets an unexpected WARNING. The process getting the WARNING may not even have zero_damaged_pages enabled. Second, the client of the process that staged the IO gets no message. AIO ERROR-level messages handle this optimally. We emit a LOG-level message in the process that runs the complete_shared callback, and we arrange for the ERROR-level message in the stager. That would be ideal here: LOG in the complete_shared runner, WARNING in the stager. One could simplify things by forcing io_method=sync under ZERO_ON_ERROR || zero_damaged_pages, perhaps as a short-term approach. Thoughts?
Hi, On 2025-03-24 17:45:37 -0700, Noah Misch wrote: > (We may be due for a test mode that does smgrreleaseall() at every > CHECK_FOR_INTERRUPTS()?) I suspect we are. I'm a bit afraid of even trying... ... It's extremely slow - but at least the main regression as well as the aio tests pass! > > I however don't particularly like the *start* or *prep* names, I've gone back > > and forth on those a couple times. I could see "begin" work uniformly across > > those. > > For ease of new readers understanding things, I think it helps for the > functions that advance PgAioHandleState to have names that use words from > PgAioHandleState. It's one less mapping to get into the reader's head. Unfortunately for me it's kind of the opposite in this case, see below. > "Begin", "Start" and "prep" are all outside that taxonomy, making the reader > learn how to map them to the taxonomy. What reward does the reader get at the > end of that exercise? I'm not seeing one, but please do tell me what I'm > missing here. Because the end state varies, depending on the number of previously staged IOs, the IO method and whether batchmode is enabled, I think it's better if the "function naming pattern" (i.e. FileStartReadv, smgrstartreadv etc) is *not* aligned with an internal state name. It will just mislead readers to think that there's a deterministic mapping when that does not exist. That's not an excuse for pgaio_io_prep* though, that's a pointlessly different naming that I just stopped seeing. I'll try to think more about this, perhaps I can make myself see your POV more. > > > > Subject: [PATCH v2.11 08/27] localbuf: Track pincount in BufferDesc as well > > > LockBufferForCleanup() should get code like this > > > ConditionalLockBufferForCleanup() code, either now or when "not possible > > > today" ends. Currently, it just assumes all local buffers are > > > cleanup-lockable: > > > > > > /* Nobody else to wait for */ > > > if (BufferIsLocal(buffer)) > > > return; > > > > Kinda, yes, kinda no? LockBufferForCleanup() assumes, even for shared > > buffers, that the current backend can't be doing anything that conflicts with > > acquiring a buffer pin - note that it doesn't check the backend local pincount > > for shared buffers either. > > It checks the local pincount via callee CheckBufferIsPinnedOnce(). In exactly one of the callers :/ > As the patch stands, LockBufferForCleanup() can succeed when > ConditionalLockBufferForCleanup() would have returned false. That's already true today, right? In master ConditionalLockBufferForCleanup() for temp buffers checks LocalRefCount, whereas LockBufferForCleanup() doesn't. I think I agree with your suggestion further below, but independent of that, I don't see how the current modification in the patch makes the worse. Historically this behaviour of LockBufferForCleanup() kinda somewhat makes sense - the only place we use LockBufferForCleanup() is in a non-transactional command i.e. vacuum / index vacuum. So LockBufferForCleanup() turns out to only be safe in that context. > Like the comment, I expect it's academic today. I expect it will stay > academic. Anything that does a cleanup will start by reading the buffer, > which will resolve any refcnt the AIO subsystems holds for a read. If there's > an AIO write, the LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE) will block on > that. How about just removing the ConditionalLockBufferForCleanup() changes > or replacing them with a comment (like the present paragraph)? I think we'll need an expanded version of what I suggest once we have writes - but as you say, it shouldn't matter as long as we only have reads. So I think moving the relevant changes, with adjusted caveats, to the bufmgr: write change makes sense. > > > /* --- > > > * Opt-in to using AIO batchmode. > > > * > > > * Submitting IO in larger batches can be more efficient than doing so > > > * one-by-one, particularly for many small reads. It does, however, require > > > * the ReadStreamBlockNumberCB callback to abide by the restrictions of AIO > > > * batching (c.f. pgaio_enter_batchmode()). Basically, the callback may not: > > > * a) block without first calling pgaio_submit_staged(), unless a > > > * to-be-waited-on lock cannot be part of a deadlock, e.g. because it is > > > * never acquired in a nested fashion > > > * b) directly or indirectly start another batch pgaio_enter_batchmode() > > I think a callback could still do: > > pgaio_exit_batchmode() > ... arbitrary code that might reach pgaio_enter_batchmode() ... > pgaio_enter_batchmode() Yea - but I somehow doubt there are many cases where it makes sense to deep-queue IOs within the callback. The cases I can think of are things like ensuring the right VM buffer is in s_b. But if it turns out to be necessary, what you seuggest would be an out. Do you think it's worth mentioning the above workaround? I'm mildly inclined that not. If it turns out to be actually useful to do nested batching, we can change it so that nested batching *is* allowed, that'd not be hard. > > > * > > > * As this requires care and is nontrivial in some cases, batching is only > > > * used with explicit opt-in. > > > * --- > > > */ > > > #define READ_STREAM_USE_BATCHING 0x08 > > > > +1 > > Agreed. It's simple, and there's no loss of generality. > > > I wonder if something more like READ_STREAM_CALLBACK_BATCHMODE_AWARE > > would be better, to highlight that you are making a declaration about > > a property of your callback, not just turning on an independent > > go-fast feature... I fished those words out of the main (?) > > description of this topic atop pgaio_enter_batchmode(). Just a > > thought, IDK. > > Good points. I lean toward your renaming suggestion, or shortening to > READ_STREAM_BATCHMODE_AWARE or READ_STREAM_BATCH_OK. I'm also fine with the > original name, though. I'm ok with all of these. In order of preference: 1) READ_STREAM_USE_BATCHING or READ_STREAM_BATCH_OK 2) READ_STREAM_BATCHMODE_AWARE 3) READ_STREAM_CALLBACK_BATCHMODE_AWARE Greetings, Andres Freund
Hi, On 2025-03-24 19:20:37 -0700, Noah Misch wrote: > On Thu, Mar 20, 2025 at 09:58:37PM -0400, Andres Freund wrote: > > static void > > TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits, > > - bool forget_owner) > > + bool forget_owner, bool syncio) > > ... > Looking at the callers: > > ZeroAndLockBuffer[1083] TerminateBufferIO(bufHdr, false, BM_VALID, true, true); > ExtendBufferedRelShared[2869] TerminateBufferIO(buf_hdr, false, BM_VALID, true, true); > FlushBuffer[4827] TerminateBufferIO(buf, true, 0, true, true); > AbortBufferIO[6637] TerminateBufferIO(buf_hdr, false, BM_IO_ERROR, false, true); > buffer_readv_complete_one[7279] TerminateBufferIO(buf_hdr, false, set_flag_bits, false, false); > buffer_writev_complete_one[7427] TerminateBufferIO(buf_hdr, clear_dirty, set_flag_bits, false, false); > > I think we can improve on the "syncio" arg name. The first two aren't doing > IO, and AbortBufferIO() may be cleaning up what would have been an AIO if it > hadn't failed early. Perhaps name the arg "release_aio" and pass > release_aio=true instead of syncio=false (release_aio = !syncio). Yes, I think that makes sense. Will do that tomorrow. > > +static pg_attribute_always_inline PgAioResult > > +buffer_readv_complete_one(uint8 buf_off, Buffer buffer, uint8 flags, > > + bool failed, bool is_temp) > > +{ > ... > > + if ((flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages) > > + { > > + ereport(WARNING, > > + (errcode(ERRCODE_DATA_CORRUPTED), > > + errmsg("invalid page in block %u of relation %s; zeroing out page", > > My earlier review requested s/LOG/WARNING/, but I wasn't thinking about this > in full depth. In the !is_temp case, this runs in a complete_shared > callback. A process unrelated to the original IO may run this callback. > That's unfortunate in two ways. First, that other process's client gets an > unexpected WARNING. The process getting the WARNING may not even have > zero_damaged_pages enabled. Second, the client of the process that staged > the IO gets no message. Ah, right. That could be why I had flipped it. If so, shame on me for not adding a comment... > AIO ERROR-level messages handle this optimally. We emit a LOG-level message > in the process that runs the complete_shared callback, and we arrange for the > ERROR-level message in the stager. That would be ideal here: LOG in the > complete_shared runner, WARNING in the stager. We could obviously downgrade (crossgrade? A LOG is more severe than a LOG in some ways, but not others) the message when run in a different backend fairly easily. Still emitting a WARNING in the stager however is a bit more tricky. Before thinking more deeply about how we could emit WARNING in the stage: Is it actually sane to use WARNING here? At least for ZERO_ON_ERROR that could trigger a rather massive flood of messages to the client in a *normal* situation. I'm thinking of something like an insert extending a relation some time after an immediate restart and encountering a lot of FSM corruption (due to its non-crash-safe-ness) during the search for free space and the subsequent FSM vacuum. It might be ok to LOG that, but sending a lot of WARNINGs to the client seems not quite right. If we want to implement it, I think we could introduce PGAIO_RS_WARN, which then could tell the stager to issue the WARNING. It would add a bit of distributed cost, both to callbacks and users of AIO, but it might not be too bad. > One could simplify things by forcing io_method=sync under ZERO_ON_ERROR || > zero_damaged_pages, perhaps as a short-term approach. Yea, that could work. Perhaps even just for zero_damaged_pages, after changing it so that ZERO_ON_ERROR always just LOGs. Hm, it seems somewhat nasty to have rather different performance characteristics when forced to use zero_damaged_pages to recover from a problem. Imagine an instance that's configured to use DIO and then needs to use zero_damaged_pages to recove from corruption... /me adds writing a test for both ZERO_ON_ERROR and zero_damaged_pages to the TODO. Greetings, Andres Freund
On Tue, Mar 25, 2025 at 2:18 PM Andres Freund <andres@anarazel.de> wrote: > Attached v2.12, with the following changes: Here's a tiny fixup to make io_concurrency=0 turn on READ_BUFFERS_SYNCHRONOUSLY as mooted in a FIXME. Without this, AIO will still run at level 1 even if you asked for 0. Feel free to squash, or ignore and I'll push it later, whatever suits... (tested on the tip of your public aio-2 branch).
Attachment
On Mon, Mar 24, 2025 at 10:30:27PM -0400, Andres Freund wrote: > On 2025-03-24 17:45:37 -0700, Noah Misch wrote: > > (We may be due for a test mode that does smgrreleaseall() at every > > CHECK_FOR_INTERRUPTS()?) > > I suspect we are. I'm a bit afraid of even trying... > > ... > > It's extremely slow - but at least the main regression as well as the aio tests pass! One less thing! > > > I however don't particularly like the *start* or *prep* names, I've gone back > > > and forth on those a couple times. I could see "begin" work uniformly across > > > those. > > > > For ease of new readers understanding things, I think it helps for the > > functions that advance PgAioHandleState to have names that use words from > > PgAioHandleState. It's one less mapping to get into the reader's head. > > Unfortunately for me it's kind of the opposite in this case, see below. > > > > "Begin", "Start" and "prep" are all outside that taxonomy, making the reader > > learn how to map them to the taxonomy. What reward does the reader get at the > > end of that exercise? I'm not seeing one, but please do tell me what I'm > > missing here. > > Because the end state varies, depending on the number of previously staged > IOs, the IO method and whether batchmode is enabled, I think it's better if > the "function naming pattern" (i.e. FileStartReadv, smgrstartreadv etc) is > *not* aligned with an internal state name. It will just mislead readers to > think that there's a deterministic mapping when that does not exist. That's fair. Could we provide the mapping in a comment, something like the following? --- a/src/include/storage/aio_internal.h +++ b/src/include/storage/aio_internal.h @@ -34,5 +34,10 @@ * linearly through all states. * - * State changes should all go through pgaio_io_update_state(). + * State changes should all go through pgaio_io_update_state(). Its callers + * use these naming conventions: + * + * - A "start" function (e.g. FileStartReadV()) moves an IO from + * PGAIO_HS_HANDED_OUT to at least PGAIO_HS_STAGED and at most + * PGAIO_HS_COMPLETED_LOCAL. */ typedef enum PgAioHandleState > That's not an excuse for pgaio_io_prep* though, that's a pointlessly different > naming that I just stopped seeing. > > I'll try to think more about this, perhaps I can make myself see your POV > more. > > As the patch stands, LockBufferForCleanup() can succeed when > > ConditionalLockBufferForCleanup() would have returned false. > > That's already true today, right? In master ConditionalLockBufferForCleanup() > for temp buffers checks LocalRefCount, whereas LockBufferForCleanup() doesn't. I'm finding a LocalRefCount check under LockBufferForCleanup: LockBufferForCleanup(Buffer buffer) { ... CheckBufferIsPinnedOnce(buffer); CheckBufferIsPinnedOnce(Buffer buffer) { if (BufferIsLocal(buffer)) { if (LocalRefCount[-buffer - 1] != 1) elog(ERROR, "incorrect local pin count: %d", LocalRefCount[-buffer - 1]); } else { if (GetPrivateRefCount(buffer) != 1) elog(ERROR, "incorrect local pin count: %d", GetPrivateRefCount(buffer)); } } > > Like the comment, I expect it's academic today. I expect it will stay > > academic. Anything that does a cleanup will start by reading the buffer, > > which will resolve any refcnt the AIO subsystems holds for a read. If there's > > an AIO write, the LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE) will block on > > that. How about just removing the ConditionalLockBufferForCleanup() changes > > or replacing them with a comment (like the present paragraph)? > > I think we'll need an expanded version of what I suggest once we have writes - > but as you say, it shouldn't matter as long as we only have reads. So I think > moving the relevant changes, with adjusted caveats, to the bufmgr: write > change makes sense. Moving those changes works for me. I'm not currently seeing the need under writes, but that may get clearer upon reaching those patches. > > > > /* --- > > > > * Opt-in to using AIO batchmode. > > > > * > > > > * Submitting IO in larger batches can be more efficient than doing so > > > > * one-by-one, particularly for many small reads. It does, however, require > > > > * the ReadStreamBlockNumberCB callback to abide by the restrictions of AIO > > > > * batching (c.f. pgaio_enter_batchmode()). Basically, the callback may not: > > > > * a) block without first calling pgaio_submit_staged(), unless a > > > > * to-be-waited-on lock cannot be part of a deadlock, e.g. because it is > > > > * never acquired in a nested fashion > > > > * b) directly or indirectly start another batch pgaio_enter_batchmode() > > > > I think a callback could still do: > > > > pgaio_exit_batchmode() > > ... arbitrary code that might reach pgaio_enter_batchmode() ... > > pgaio_enter_batchmode() > > Yea - but I somehow doubt there are many cases where it makes sense to > deep-queue IOs within the callback. The cases I can think of are things like > ensuring the right VM buffer is in s_b. But if it turns out to be necessary, > what you seuggest would be an out. I don't foresee a callback specifically wanting to batch, but callbacks might call into other infrastructure that can elect to batch. The exit+reenter pattern would be better than adding no-batch options to other infrastructure. > Do you think it's worth mentioning the above workaround? I'm mildly inclined > that not. Perhaps not in that detail, but perhaps we can rephrase (b) to not imply exit+reenter is banned. Maybe "(b) start another batch (without first exiting one)". It's also fine as-is, though. > If it turns out to be actually useful to do nested batching, we can change it > so that nested batching *is* allowed, that'd not be hard. Good point. > I'm ok with all of these. In order of preference: > > 1) READ_STREAM_USE_BATCHING or READ_STREAM_BATCH_OK > 2) READ_STREAM_BATCHMODE_AWARE > 3) READ_STREAM_CALLBACK_BATCHMODE_AWARE Same for me.
On Mon, Mar 24, 2025 at 10:52:19PM -0400, Andres Freund wrote: > On 2025-03-24 19:20:37 -0700, Noah Misch wrote: > > On Thu, Mar 20, 2025 at 09:58:37PM -0400, Andres Freund wrote: > > > +static pg_attribute_always_inline PgAioResult > > > +buffer_readv_complete_one(uint8 buf_off, Buffer buffer, uint8 flags, > > > + bool failed, bool is_temp) > > > +{ > > ... > > > + if ((flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages) > > > + { > > > + ereport(WARNING, > > > + (errcode(ERRCODE_DATA_CORRUPTED), > > > + errmsg("invalid page in block %u of relation %s; zeroing out page", > > > > My earlier review requested s/LOG/WARNING/, but I wasn't thinking about this > > in full depth. In the !is_temp case, this runs in a complete_shared > > callback. A process unrelated to the original IO may run this callback. > > That's unfortunate in two ways. First, that other process's client gets an > > unexpected WARNING. The process getting the WARNING may not even have > > zero_damaged_pages enabled. Second, the client of the process that staged > > the IO gets no message. > > Ah, right. That could be why I had flipped it. If so, shame on me for not > adding a comment... > > > > AIO ERROR-level messages handle this optimally. We emit a LOG-level message > > in the process that runs the complete_shared callback, and we arrange for the > > ERROR-level message in the stager. That would be ideal here: LOG in the > > complete_shared runner, WARNING in the stager. > > We could obviously downgrade (crossgrade? A LOG is more severe than a LOG in > some ways, but not others) the message when run in a different backend fairly > easily. Still emitting a WARNING in the stager however is a bit more tricky. > > Before thinking more deeply about how we could emit WARNING in the stage: > > Is it actually sane to use WARNING here? At least for ZERO_ON_ERROR that could > trigger a rather massive flood of messages to the client in a *normal* > situation. I'm thinking of something like an insert extending a relation some > time after an immediate restart and encountering a lot of FSM corruption (due > to its non-crash-safe-ness) during the search for free space and the > subsequent FSM vacuum. It might be ok to LOG that, but sending a lot of > WARNINGs to the client seems not quite right. Orthogonal to AIO, I do think LOG (or even DEBUG1?) is better for ZERO_ON_ERROR. The ZERO_ON_ERROR case also should not use ERRCODE_DATA_CORRUPTED. (That errcode shouldn't appear for business as usual. It should signify wrong or irretrievable query results, essentially.) For zero_damaged_pages, WARNING seems at least defensible, and ERRCODE_DATA_CORRUPTED is right. It wouldn't be the worst thing to change zero_damaged_pages to LOG and let the complete_shared runner log it, as long as we release-note that. It's superuser-only, and the superuser can learn to check the log. One typically should use zero_damaged_pages in one session at a time, so the logs won't be too confusing. Another thought on complete_shared running on other backends: I wonder if we should push an ErrorContextCallback that adds "CONTEXT: completing I/O of other process" or similar, so people wonder less about how "SELECT FROM a" led to a log message about IO on table "b". > If we want to implement it, I think we could introduce PGAIO_RS_WARN, which > then could tell the stager to issue the WARNING. It would add a bit of > distributed cost, both to callbacks and users of AIO, but it might not be too > bad. > > > > One could simplify things by forcing io_method=sync under ZERO_ON_ERROR || > > zero_damaged_pages, perhaps as a short-term approach. > > Yea, that could work. Perhaps even just for zero_damaged_pages, after > changing it so that ZERO_ON_ERROR always just LOGs. Yes. > Hm, it seems somewhat nasty to have rather different performance > characteristics when forced to use zero_damaged_pages to recover from a > problem. Imagine an instance that's configured to use DIO and then needs to > use zero_damaged_pages to recove from corruption... True. I'd be willing to bet high-scale use of zero_damaged_pages is rare. By high scale, I mean something like reading a whole large table, as opposed to a TID scan of the known-problematic range. That said, people (including me) expect the emergency tools to be good even if they're used rarely. You're not wrong to worry about it.
Hi, On 2025-03-25 06:33:21 -0700, Noah Misch wrote: > On Mon, Mar 24, 2025 at 10:30:27PM -0400, Andres Freund wrote: > > On 2025-03-24 17:45:37 -0700, Noah Misch wrote: > > > (We may be due for a test mode that does smgrreleaseall() at every > > > CHECK_FOR_INTERRUPTS()?) > > > > I suspect we are. I'm a bit afraid of even trying... > > > > ... > > > > It's extremely slow - but at least the main regression as well as the aio tests pass! > > One less thing! Unfortunately I'm now doubting the thoroughness of my check - while I made every CFI() execute smgrreleaseall(), I didn't trigger CFI() in cases where we trigger it conditionally. E.g. elog(DEBUGN, ...) only executes a CFI if log_min_messages <= DEBUGN... I'll try that in a bit. > > Because the end state varies, depending on the number of previously staged > > IOs, the IO method and whether batchmode is enabled, I think it's better if > > the "function naming pattern" (i.e. FileStartReadv, smgrstartreadv etc) is > > *not* aligned with an internal state name. It will just mislead readers to > > think that there's a deterministic mapping when that does not exist. > > That's fair. Could we provide the mapping in a comment, something like the > following? Yes! I wonder if it should also be duplicated or referenced elsewhere, although I am not sure where precisely. > --- a/src/include/storage/aio_internal.h > +++ b/src/include/storage/aio_internal.h > @@ -34,5 +34,10 @@ > * linearly through all states. > * > - * State changes should all go through pgaio_io_update_state(). > + * State changes should all go through pgaio_io_update_state(). Its callers > + * use these naming conventions: > + * > + * - A "start" function (e.g. FileStartReadV()) moves an IO from > + * PGAIO_HS_HANDED_OUT to at least PGAIO_HS_STAGED and at most > + * PGAIO_HS_COMPLETED_LOCAL. > */ > typedef enum PgAioHandleState One detail I'm not sure about: The above change is correct, but perhaps a bit misleading, because we can actually go "back" to IDLE. Not sure how to best phrase that though. > > That's not an excuse for pgaio_io_prep* though, that's a pointlessly different > > naming that I just stopped seeing. I assume you're on board with renaming _io_prep* to _io_start_*? > > I'll try to think more about this, perhaps I can make myself see your POV > > more. > > > > As the patch stands, LockBufferForCleanup() can succeed when > > > ConditionalLockBufferForCleanup() would have returned false. > > > > That's already true today, right? In master ConditionalLockBufferForCleanup() > > for temp buffers checks LocalRefCount, whereas LockBufferForCleanup() doesn't. > > I'm finding a LocalRefCount check under LockBufferForCleanup: I guess I should have stopped looking at code / replying before my last email last night... Not sure how I missed that. > CheckBufferIsPinnedOnce(Buffer buffer) > { > if (BufferIsLocal(buffer)) > { > if (LocalRefCount[-buffer - 1] != 1) > elog(ERROR, "incorrect local pin count: %d", > LocalRefCount[-buffer - 1]); > } > else > { > if (GetPrivateRefCount(buffer) != 1) > elog(ERROR, "incorrect local pin count: %d", > GetPrivateRefCount(buffer)); > } > } Pretty random orthogonal thought, that I was reminded of by the above code snippet: It sure seems we should at some point get rid of LocalRefCount[] and just use the GetPrivateRefCount() infrastructure for both shared and local buffers. I don't think the GetPrivateRefCount() infrastructure cares about local/non-local, leaving a few asserts aside. If we do that, and start to use BM_IO_IN_PROGRESS, combined with ResourceOwnerRememberBufferIO(), the set of differences between shared and local buffers would be a lot smaller. > > > Like the comment, I expect it's academic today. I expect it will stay > > > academic. Anything that does a cleanup will start by reading the buffer, > > > which will resolve any refcnt the AIO subsystems holds for a read. If there's > > > an AIO write, the LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE) will block on > > > that. How about just removing the ConditionalLockBufferForCleanup() changes > > > or replacing them with a comment (like the present paragraph)? > > > > I think we'll need an expanded version of what I suggest once we have writes - > > but as you say, it shouldn't matter as long as we only have reads. So I think > > moving the relevant changes, with adjusted caveats, to the bufmgr: write > > change makes sense. > > Moving those changes works for me. I'm not currently seeing the need under > writes, but that may get clearer upon reaching those patches. FWIW, I don't think it's currently worth looking at the write side in detail, there's enough required changes to make that not necessarily the best use of your time at this point. At least: - Write logic needs to be rebased ontop of the patch series to not hit bit dirty buffers while IO is going on The performance impact of doing the memory copies is rather substantial, as on intel memory bandwidth is *the* IO bottleneck even just for the checksum computation, without a copy. That makes the memory copy for something like bounce buffers hurt really badly. And the memory usage of bounce buffers is also really concerning. And even without checksums, several filesystems *really* don't like buffers getting modified during DIO writes. Which I think would mean we ought to use bounce buffers for *all* writes, which would impose a *very* substantial overhead (basically removing the benefit of DMA happening off-cpu). - Right now the sync.c integration with smgr.c/md.c isn't properly safe to use in a critical section The only reason it doesn't immediately fail is that it's reasonably rare that RegisterSyncRequest() fails *and* either: - smgropen()->hash_search(HASH_ENTER) decides to resize the hash table, even though the lookup is guaranteed to suceed for io_method=worker. - an io_method=uring completion is run in a different backend and smgropen() needs to build a new entry and thus needs to allocate memory For a bit I thought this could be worked around easily enough by not doing an smgropen() in mdsyncfiletag(), or adding a "fallible" smgropen() and instead just opening the file directly. That actually does kinda solve the problem, but only because the memory allocation in PathNameOpenFile() uses malloc(), not palloc() and thus doesn't trigger - I think it requires new lwlock.c infrastructure (as v1 of aio had), to make LockBuffer(BUFFER_LOCK_EXCLUSIVE) etc wait in a concurrency safe manner for in-progress writes I can think of ways to solve this purely in bufmgr.c, but only in ways that would cause other problems (e.g. setting BM_IO_IN_PROGRESS before waiting for an exclusive lock) and/or expensive. - My current set of patches doesn't implement bgwriter_flush_after, checkpointer_flush_after I think that's not too hard to do, it's mainly round tuits. - temp_file_limit is not respected by aio writes I guess that could be ok if AIO writes are only used by checkpointer / bgwriter, but we need to figure out a way to deal with that. Perhaps by redesigning temp_file_limit, the current implementation seems like rather substantial layering violation. - Too much duplicated code, as there's the aio and non-aio write paths. That might be ok for a bit. I updated the commit messages of the relevant commits with the above, there were abbreviated versions of most of the above, but not in enough detail for anybody but me (and maybe not even that). > > Do you think it's worth mentioning the above workaround? I'm mildly inclined > > that not. > > Perhaps not in that detail, but perhaps we can rephrase (b) to not imply > exit+reenter is banned. Maybe "(b) start another batch (without first exiting > one)". It's also fine as-is, though. I updated it to: * b) start another batch (without first exiting batchmode and re-entering * before returning) > > I'm ok with all of these. In order of preference: > > > > 1) READ_STREAM_USE_BATCHING or READ_STREAM_BATCH_OK > > 2) READ_STREAM_BATCHMODE_AWARE > > 3) READ_STREAM_CALLBACK_BATCHMODE_AWARE > > Same for me. For now I'll leave it at READ_STREAM_USE_BATCHING, but if Thomas has a preference I'll go for whatever we have a majority for. Greetings, Andres Freund
Hi, On 2025-03-25 17:10:19 +1300, Thomas Munro wrote: > On Tue, Mar 25, 2025 at 2:18 PM Andres Freund <andres@anarazel.de> wrote: > > Attached v2.12, with the following changes: > > Here's a tiny fixup to make io_concurrency=0 turn on > READ_BUFFERS_SYNCHRONOUSLY as mooted in a FIXME. Without this, AIO > will still run at level 1 even if you asked for 0. Feel free to > squash, or ignore and I'll push it later, whatever suits... (tested on > the tip of your public aio-2 branch). Thanks! I squashed it into "aio: Basic read_stream adjustments for real AIO" and updated the commit message to account for that. Greetings, Andres Freund
Hi, On 2025-03-25 07:11:20 -0700, Noah Misch wrote: > On Mon, Mar 24, 2025 at 10:52:19PM -0400, Andres Freund wrote: > > Is it actually sane to use WARNING here? At least for ZERO_ON_ERROR that could > > trigger a rather massive flood of messages to the client in a *normal* > > situation. I'm thinking of something like an insert extending a relation some > > time after an immediate restart and encountering a lot of FSM corruption (due > > to its non-crash-safe-ness) during the search for free space and the > > subsequent FSM vacuum. It might be ok to LOG that, but sending a lot of > > WARNINGs to the client seems not quite right. > > Orthogonal to AIO, I do think LOG (or even DEBUG1?) is better for > ZERO_ON_ERROR. The ZERO_ON_ERROR case also should not use > ERRCODE_DATA_CORRUPTED. (That errcode shouldn't appear for business as usual. > It should signify wrong or irretrievable query results, essentially.) I strongly agree on the errcode - basically makes it much harder to use the errcode to trigger alerting. And we don't have any other way to do that... I'm, obviously, positive on not using WARNING for ZERO_ON_ERROR. I'm neutral on LOG vs DEBUG1, I can see arguments for either. > For zero_damaged_pages, WARNING seems at least defensible, and > ERRCODE_DATA_CORRUPTED is right. It wouldn't be the worst thing to change > zero_damaged_pages to LOG and let the complete_shared runner log it, as long > as we release-note that. It's superuser-only, and the superuser can learn to > check the log. One typically should use zero_damaged_pages in one session at > a time, so the logs won't be too confusing. It's obviously tempting to go for that, I'm somewhat undecided what the best way is right now. There might be a compromise, see below: > > If we want to implement it, I think we could introduce PGAIO_RS_WARN, which > > then could tell the stager to issue the WARNING. It would add a bit of > > distributed cost, both to callbacks and users of AIO, but it might not be too > > bad. FWIW, I prototyped this, it's not hard. But it can't replace the current WARNING with 100% fidelity: If we read 60 blocks in a single smgrreadv, we today would would emit 60 WARNINGs. But we can't encode that many block offset in single PgAioResult, there's not enough space, and enlarging it far enough doesn't seem to make sense either. What we *could* do is to emit one WARNING for each bufmgr.c smgrstartreadv(), with that warning saying that there were N zeroed blocks in a read from block N to block Y and a HINT saying that there are more details in the server log. > Another thought on complete_shared running on other backends: I wonder if we > should push an ErrorContextCallback that adds "CONTEXT: completing I/O of > other process" or similar, so people wonder less about how "SELECT FROM a" led > to a log message about IO on table "b". I've been wondering about that as well, and yes, we probably should. I'd add the pid of the backend that started the IO to the message - although need to check whether we're trying to keep PIDs of other processes from unprivileged users. I think we probably should add a similar, but not equivalent, context in io workers. Maybe "I/O worker executing I/O on behalf of process %d". Greetings, Andres Freund
On Tue, Mar 25, 2025 at 11:26:14AM -0400, Andres Freund wrote: > On 2025-03-25 06:33:21 -0700, Noah Misch wrote: > > On Mon, Mar 24, 2025 at 10:30:27PM -0400, Andres Freund wrote: > > > On 2025-03-24 17:45:37 -0700, Noah Misch wrote: > > > > (We may be due for a test mode that does smgrreleaseall() at every > > > > CHECK_FOR_INTERRUPTS()?) > > > > > > I suspect we are. I'm a bit afraid of even trying... > > > > > > ... > > > > > > It's extremely slow - but at least the main regression as well as the aio tests pass! > > > > One less thing! > > Unfortunately I'm now doubting the thoroughness of my check - while I made > every CFI() execute smgrreleaseall(), I didn't trigger CFI() in cases where we > trigger it conditionally. E.g. elog(DEBUGN, ...) only executes a CFI if > log_min_messages <= DEBUGN... > > I'll try that in a bit. While having nagging thoughts that we might be releasing FDs before io_uring gets them into kernel custody, I tried this hack to maximize FD turnover: static void ReleaseLruFiles(void) { #if 0 while (nfile + numAllocatedDescs + numExternalFDs >= max_safe_fds) { if (!ReleaseLruFile()) break; } #else while (ReleaseLruFile()) ; #endif } "make check" with default settings (io_method=worker) passes, but io_method=io_uring in the TEMP_CONFIG file got different diffs in each of two runs. s/#if 0/#if 1/ (restore normal FD turnover) removes the failures. Here's the richer of the two diffs: diff -U3 src/test/regress/expected/sanity_check.out src/test/regress/results/sanity_check.out --- src/test/regress/expected/sanity_check.out 2024-10-24 12:43:25.741817594 -0700 +++ src/test/regress/results/sanity_check.out 2025-03-25 08:27:44.875151566 -0700 @@ -1,4 +1,7 @@ VACUUM; +ERROR: index "pg_enum_oid_index" contains corrupted page at block 2 +HINT: Please REINDEX it. +CONTEXT: while vacuuming index "pg_enum_oid_index" of relation "pg_catalog.pg_enum" -- -- Sanity check: every system catalog that has OIDs should have -- a unique index on OID. This ensures that the OIDs will be unique, diff -U3 src/test/regress/expected/oidjoins.out src/test/regress/results/oidjoins.out --- src/test/regress/expected/oidjoins.out 2023-07-06 19:58:07.686364439 -0700 +++ src/test/regress/results/oidjoins.out 2025-03-25 08:28:02.584335458 -0700 @@ -233,6 +233,8 @@ NOTICE: checking pg_policy {polrelid} => pg_class {oid} NOTICE: checking pg_policy {polroles} => pg_authid {oid} NOTICE: checking pg_default_acl {defaclrole} => pg_authid {oid} +WARNING: FK VIOLATION IN pg_default_acl({defaclrole}): ("(1,5)",0) +WARNING: FK VIOLATION IN pg_default_acl({defaclrole}): ("(1,7)",402654464) NOTICE: checking pg_default_acl {defaclnamespace} => pg_namespace {oid} NOTICE: checking pg_init_privs {classoid} => pg_class {oid} NOTICE: checking pg_seclabel {classoid} => pg_class {oid} > > > Because the end state varies, depending on the number of previously staged > > > IOs, the IO method and whether batchmode is enabled, I think it's better if > > > the "function naming pattern" (i.e. FileStartReadv, smgrstartreadv etc) is > > > *not* aligned with an internal state name. It will just mislead readers to > > > think that there's a deterministic mapping when that does not exist. > > > > That's fair. Could we provide the mapping in a comment, something like the > > following? > > Yes! > > I wonder if it should also be duplicated or referenced elsewhere, although I > am not sure where precisely. I considered the README.md also, but adding that wasn't an obvious win. > > --- a/src/include/storage/aio_internal.h > > +++ b/src/include/storage/aio_internal.h > > @@ -34,5 +34,10 @@ > > * linearly through all states. > > * > > - * State changes should all go through pgaio_io_update_state(). > > + * State changes should all go through pgaio_io_update_state(). Its callers > > + * use these naming conventions: > > + * > > + * - A "start" function (e.g. FileStartReadV()) moves an IO from > > + * PGAIO_HS_HANDED_OUT to at least PGAIO_HS_STAGED and at most > > + * PGAIO_HS_COMPLETED_LOCAL. > > */ > > typedef enum PgAioHandleState > > One detail I'm not sure about: The above change is correct, but perhaps a bit > misleading, because we can actually go "back" to IDLE. Not sure how to best > phrase that though. Not sure either. Maybe the above could change to "to PGAIO_HS_STAGED or any subsequent state" and the comment at PGAIO_HS_STAGED could say like "Once in this state, concurrent activity could move the IO all the way to PGAIO_HS_COMPLETED_LOCAL and recycle it back to IDLE." > > > That's not an excuse for pgaio_io_prep* though, that's a pointlessly different > > > naming that I just stopped seeing. > > I assume you're on board with renaming _io_prep* to _io_start_*? Yes. > > > I'll try to think more about this, perhaps I can make myself see your POV > > > more. > > CheckBufferIsPinnedOnce(Buffer buffer) > > { > > if (BufferIsLocal(buffer)) > > { > > if (LocalRefCount[-buffer - 1] != 1) > > elog(ERROR, "incorrect local pin count: %d", > > LocalRefCount[-buffer - 1]); > > } > > else > > { > > if (GetPrivateRefCount(buffer) != 1) > > elog(ERROR, "incorrect local pin count: %d", > > GetPrivateRefCount(buffer)); > > } > > } > > Pretty random orthogonal thought, that I was reminded of by the above code > snippet: > > It sure seems we should at some point get rid of LocalRefCount[] and just use > the GetPrivateRefCount() infrastructure for both shared and local buffers. I > don't think the GetPrivateRefCount() infrastructure cares about > local/non-local, leaving a few asserts aside. If we do that, and start to use > BM_IO_IN_PROGRESS, combined with ResourceOwnerRememberBufferIO(), the set of > differences between shared and local buffers would be a lot smaller. That sounds promising. > > > > Like the comment, I expect it's academic today. I expect it will stay > > > > academic. Anything that does a cleanup will start by reading the buffer, > > > > which will resolve any refcnt the AIO subsystems holds for a read. If there's > > > > an AIO write, the LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE) will block on > > > > that. How about just removing the ConditionalLockBufferForCleanup() changes > > > > or replacing them with a comment (like the present paragraph)? > > > > > > I think we'll need an expanded version of what I suggest once we have writes - > > > but as you say, it shouldn't matter as long as we only have reads. So I think > > > moving the relevant changes, with adjusted caveats, to the bufmgr: write > > > change makes sense. > > > > Moving those changes works for me. I'm not currently seeing the need under > > writes, but that may get clearer upon reaching those patches. > > FWIW, I don't think it's currently worth looking at the write side in detail, Got it. (I meant I didn't see a first-principles need, not that I had deduced lack of need from a specific writes implementation.) > > > Do you think it's worth mentioning the above workaround? I'm mildly inclined > > > that not. > > > > Perhaps not in that detail, but perhaps we can rephrase (b) to not imply > > exit+reenter is banned. Maybe "(b) start another batch (without first exiting > > one)". It's also fine as-is, though. > > I updated it to: > > * b) start another batch (without first exiting batchmode and re-entering > * before returning) That's good.
On Tue, Mar 25, 2025 at 11:57:58AM -0400, Andres Freund wrote: > On 2025-03-25 07:11:20 -0700, Noah Misch wrote: > > On Mon, Mar 24, 2025 at 10:52:19PM -0400, Andres Freund wrote: > > > If we want to implement it, I think we could introduce PGAIO_RS_WARN, which > > > then could tell the stager to issue the WARNING. It would add a bit of > > > distributed cost, both to callbacks and users of AIO, but it might not be too > > > bad. > > FWIW, I prototyped this, it's not hard. > > But it can't replace the current WARNING with 100% fidelity: If we read 60 > blocks in a single smgrreadv, we today would would emit 60 WARNINGs. But we > can't encode that many block offset in single PgAioResult, there's not enough > space, and enlarging it far enough doesn't seem to make sense either. > > > What we *could* do is to emit one WARNING for each bufmgr.c smgrstartreadv(), > with that warning saying that there were N zeroed blocks in a read from block > N to block Y and a HINT saying that there are more details in the server log. Sounds fine. > > Another thought on complete_shared running on other backends: I wonder if we > > should push an ErrorContextCallback that adds "CONTEXT: completing I/O of > > other process" or similar, so people wonder less about how "SELECT FROM a" led > > to a log message about IO on table "b". > > I've been wondering about that as well, and yes, we probably should. > > I'd add the pid of the backend that started the IO to the message - although > need to check whether we're trying to keep PIDs of other processes from > unprivileged users. We don't. > I think we probably should add a similar, but not equivalent, context in io > workers. Maybe "I/O worker executing I/O on behalf of process %d". Sounds good.
Hi, On 2025-03-25 08:58:08 -0700, Noah Misch wrote: > While having nagging thoughts that we might be releasing FDs before io_uring > gets them into kernel custody, I tried this hack to maximize FD turnover: > > static void > ReleaseLruFiles(void) > { > #if 0 > while (nfile + numAllocatedDescs + numExternalFDs >= max_safe_fds) > { > if (!ReleaseLruFile()) > break; > } > #else > while (ReleaseLruFile()) > ; > #endif > } > > "make check" with default settings (io_method=worker) passes, but > io_method=io_uring in the TEMP_CONFIG file got different diffs in each of two > runs. s/#if 0/#if 1/ (restore normal FD turnover) removes the failures. > Here's the richer of the two diffs: Yikes. That's a very good catch. I spent a bit of time debugging this. I think I see what's going on - it turns out that the kernel does *not* open the FDs during io_uring_enter() if IOSQE_ASYNC is specified [1]. Which we do add heuristically, in an attempt to avoid a small but measurable slowdown for sequential scans that are fully buffered (c.f. pgaio_uring_submit()). If I disable that heuristic, your patch above passes all tests here. I don't know if that's an intentional or unintentional behavioral difference. There are 2 1/2 ways around this: 1) Stop using IOSQE_ASYNC heuristic 2a) Wait for all in-flight IOs when any FD gets closed 2b) Wait for all in-flight IOs using FD when it gets closed Given that we have clear evidence that io_uring doesn't completely support closing FDs while IOs are in flight, be it a bug or intentional, it seems clearly better to go for 2a or 2b. Greetings, Andres Freund [1] Instead files are opened when the queue entry is being worked on instead. Interestingly that only happens when the IO is *explicitly* requested to be executed in the workqueue with IOSQE_ASYNC, not when it's put there because it couldn't be done in a non-blocking way.
On Tue, Mar 25, 2025 at 02:58:37PM -0400, Andres Freund wrote: > On 2025-03-25 08:58:08 -0700, Noah Misch wrote: > > While having nagging thoughts that we might be releasing FDs before io_uring > > gets them into kernel custody, I tried this hack to maximize FD turnover: > > > > static void > > ReleaseLruFiles(void) > > { > > #if 0 > > while (nfile + numAllocatedDescs + numExternalFDs >= max_safe_fds) > > { > > if (!ReleaseLruFile()) > > break; > > } > > #else > > while (ReleaseLruFile()) > > ; > > #endif > > } > > > > "make check" with default settings (io_method=worker) passes, but > > io_method=io_uring in the TEMP_CONFIG file got different diffs in each of two > > runs. s/#if 0/#if 1/ (restore normal FD turnover) removes the failures. > > Here's the richer of the two diffs: > > Yikes. That's a very good catch. > > I spent a bit of time debugging this. I think I see what's going on - it turns > out that the kernel does *not* open the FDs during io_uring_enter() if > IOSQE_ASYNC is specified [1]. Which we do add heuristically, in an attempt to > avoid a small but measurable slowdown for sequential scans that are fully > buffered (c.f. pgaio_uring_submit()). If I disable that heuristic, your patch > above passes all tests here. Same result here. As an additional data point, I tried adding this so every reopen gets a new FD number (leaks FDs wildly): --- a/src/backend/storage/file/fd.c +++ b/src/backend/storage/file/fd.c @@ -1304,5 +1304,5 @@ LruDelete(File file) * to leak the FD than to mess up our internal state. */ - if (close(vfdP->fd) != 0) + if (dup2(2, vfdP->fd) != vfdP->fd) elog(vfdP->fdstate & FD_TEMP_FILE_LIMIT ? LOG : data_sync_elevel(LOG), "could not close file \"%s\": %m", vfdP->fileName); The same "make check" w/ TEMP_CONFIG io_method=io_uring passes with the combination of that and the max-turnover change to ReleaseLruFiles(). > I don't know if that's an intentional or unintentional behavioral difference. > > There are 2 1/2 ways around this: > > 1) Stop using IOSQE_ASYNC heuristic > 2a) Wait for all in-flight IOs when any FD gets closed > 2b) Wait for all in-flight IOs using FD when it gets closed > > Given that we have clear evidence that io_uring doesn't completely support > closing FDs while IOs are in flight, be it a bug or intentional, it seems > clearly better to go for 2a or 2b. Agreed. If a workload spends significant time on fd.c closing files, I suspect that workload already won't have impressive benchmark numbers. Performance-seeking workloads will already want to tune FD usage high enough to keep FDs long-lived. So (1) clearly loses, and neither (2a) nor (2b) clearly beats the other. I'd try (2b) first but, if complicated, quickly abandon it in favor of (2a). What other considerations could be important?
Hi, On 2025-03-25 12:39:56 -0700, Noah Misch wrote: > On Tue, Mar 25, 2025 at 02:58:37PM -0400, Andres Freund wrote: > > I don't know if that's an intentional or unintentional behavioral difference. > > > > There are 2 1/2 ways around this: > > > > 1) Stop using IOSQE_ASYNC heuristic > > 2a) Wait for all in-flight IOs when any FD gets closed > > 2b) Wait for all in-flight IOs using FD when it gets closed > > > > Given that we have clear evidence that io_uring doesn't completely support > > closing FDs while IOs are in flight, be it a bug or intentional, it seems > > clearly better to go for 2a or 2b. > > Agreed. If a workload spends significant time on fd.c closing files, I > suspect that workload already won't have impressive benchmark numbers. > Performance-seeking workloads will already want to tune FD usage high enough > to keep FDs long-lived. So (1) clearly loses, and neither (2a) nor (2b) > clearly beats the other. I'd try (2b) first but, if complicated, quickly > abandon it in favor of (2a). What other considerations could be important? The only other consideration I can think of is whether this should happen for all io_methods or not. I'm inclined to do it via a bool in IoMethodOps, but I guess one could argue it's a bit weird to have a bool in a struct called *Ops. Greetings, Andres Freund
On Tue, Mar 25, 2025 at 04:07:35PM -0400, Andres Freund wrote: > On 2025-03-25 12:39:56 -0700, Noah Misch wrote: > > On Tue, Mar 25, 2025 at 02:58:37PM -0400, Andres Freund wrote: > > > There are 2 1/2 ways around this: > > > > > > 1) Stop using IOSQE_ASYNC heuristic > > > 2a) Wait for all in-flight IOs when any FD gets closed > > > 2b) Wait for all in-flight IOs using FD when it gets closed > > > > > > Given that we have clear evidence that io_uring doesn't completely support > > > closing FDs while IOs are in flight, be it a bug or intentional, it seems > > > clearly better to go for 2a or 2b. > > > > Agreed. If a workload spends significant time on fd.c closing files, I > > suspect that workload already won't have impressive benchmark numbers. > > Performance-seeking workloads will already want to tune FD usage high enough > > to keep FDs long-lived. So (1) clearly loses, and neither (2a) nor (2b) > > clearly beats the other. I'd try (2b) first but, if complicated, quickly > > abandon it in favor of (2a). What other considerations could be important? > > The only other consideration I can think of is whether this should happen for > all io_methods or not. Either way is fine, I think. > I'm inclined to do it via a bool in IoMethodOps, but I guess one could argue > it's a bit weird to have a bool in a struct called *Ops. That wouldn't bother me. IndexAmRoutine has many bools, and "Ops" is basically a synonym of "Routine".
Hi, On 2025-03-25 13:18:50 -0700, Noah Misch wrote: > On Tue, Mar 25, 2025 at 04:07:35PM -0400, Andres Freund wrote: > > On 2025-03-25 12:39:56 -0700, Noah Misch wrote: > > > On Tue, Mar 25, 2025 at 02:58:37PM -0400, Andres Freund wrote: > > > > There are 2 1/2 ways around this: > > > > > > > > 1) Stop using IOSQE_ASYNC heuristic > > > > 2a) Wait for all in-flight IOs when any FD gets closed > > > > 2b) Wait for all in-flight IOs using FD when it gets closed > > > > > > > > Given that we have clear evidence that io_uring doesn't completely support > > > > closing FDs while IOs are in flight, be it a bug or intentional, it seems > > > > clearly better to go for 2a or 2b. > > > > > > Agreed. If a workload spends significant time on fd.c closing files, I > > > suspect that workload already won't have impressive benchmark numbers. > > > Performance-seeking workloads will already want to tune FD usage high enough > > > to keep FDs long-lived. So (1) clearly loses, and neither (2a) nor (2b) > > > clearly beats the other. I'd try (2b) first but, if complicated, quickly > > > abandon it in favor of (2a). What other considerations could be important? > > > > The only other consideration I can think of is whether this should happen for > > all io_methods or not. > > Either way is fine, I think. Here's a draft incremental patch (attached as a .fixup to avoid triggering cfbot) implementing 2b). > > I'm inclined to do it via a bool in IoMethodOps, but I guess one could argue > > it's a bit weird to have a bool in a struct called *Ops. > > That wouldn't bother me. IndexAmRoutine has many bools, and "Ops" is > basically a synonym of "Routine". Cool. Done that way. The repeated-iteration approach taken in pgaio_closing_fd() isn't the prettiest, but it's hard to to imagine that ever being a noticeable. This survives a testrun where I use your torture patch and where I force all IOs to use ASYNC. Previously that did not get very far. I also did verify that, if I allow a small number of FDs, we do not wrongly wait for all IOs. Greetings, Andres Freund
Attachment
On Tue, Mar 25, 2025 at 04:56:53PM -0400, Andres Freund wrote: > The repeated-iteration approach taken in pgaio_closing_fd() isn't the > prettiest, but it's hard to to imagine that ever being a noticeable. Yep. I've reviewed the fixup code, and it looks all good. > This survives a testrun where I use your torture patch and where I force all > IOs to use ASYNC. Previously that did not get very far. I also did verify > that, if I allow a small number of FDs, we do not wrongly wait for all IOs. I, too, see the test diffs gone.
Hi, On 2025-03-25 09:15:43 -0700, Noah Misch wrote: > On Tue, Mar 25, 2025 at 11:57:58AM -0400, Andres Freund wrote: > > FWIW, I prototyped this, it's not hard. > > > > But it can't replace the current WARNING with 100% fidelity: If we read 60 > > blocks in a single smgrreadv, we today would would emit 60 WARNINGs. But we > > can't encode that many block offset in single PgAioResult, there's not enough > > space, and enlarging it far enough doesn't seem to make sense either. > > > > > > What we *could* do is to emit one WARNING for each bufmgr.c smgrstartreadv(), > > with that warning saying that there were N zeroed blocks in a read from block > > N to block Y and a HINT saying that there are more details in the server log. It should probably be DETAIL, not HINT... > Sounds fine. I got that working. To make it readable, it required changing division of labor between buffer_readv_complete() and buffer_readv_complete_one() a bit, but I think it's actually easier to understand now. Still need to beef up the test infrastructure a bit to make the multi-block cases more easily testable. Could use some input on the framing of the message/detail. Right now it's: ERROR: invalid page in block 8 of relation base/5/16417 DETAIL: Read of 8 blocks, starting at block 7, 1 other pages in the same read are invalid. But that doesn't seem great. Maybe: DETAIL: Read of blocks 7..14, 1 other pages in the same read were also invalid. But that still isn't really a sentence. Greetings, Andres Freund
On Mon, Mar 24, 2025 at 09:18:06PM -0400, Andres Freund wrote: > Attached v2.12, with the following changes: > TODO: > Wonder if it's worth adding some coverage for when checksums are disabled? > Probably not necessary? Probably not necessary, agreed. Orthogonal to AIO, it's likely worth a CI "SPECIAL" and/or buildfarm animal that runs all tests w/ checksums disabled. > Subject: [PATCH v2.12 01/28] aio: Be more paranoid about interrupts Ready for commit > Subject: [PATCH v2.12 02/28] aio: Pass result of local callbacks to > ->report_return Ready for commit w/ up to one cosmetic change: > @@ -296,7 +299,9 @@ pgaio_io_call_complete_local(PgAioHandle *ioh) > > /* > * Note that we don't save the result in ioh->distilled_result, the local > - * callback's result should not ever matter to other waiters. > + * callback's result should not ever matter to other waiters. However, the > + * local backend does care, so we return the result as modified by local > + * callbacks, which then can be passed to ioh->report_return->result. > */ > pgaio_debug_io(DEBUG3, ioh, > "after local completion: distilled result: (status %s, id %u, error_data %d, result %d), raw_result:%d", Should this debug message remove the word "distilled", since this commit solidifies distilled_result as referring to the complete_shared result? > Subject: [PATCH v2.12 03/28] aio: Add liburing dependency Ready for commit > Subject: [PATCH v2.12 04/28] aio: Add io_method=io_uring Ready for commit w/ open_fd.fixup > Subject: [PATCH v2.12 05/28] aio: Implement support for reads in smgr/md/fd Ready for commit w/ up to two cosmetic changes: > +/* > + * AIO error reporting callback for mdstartreadv(). > + * > + * Errors are encoded as follows: > + * - PgAioResult.error_data != 0 encodes IO that failed with errno != 0 I recommend replacing "errno != 0" with either "that errno" or "errno == error_data". Second, the aio_internal.h comment changes discussed in postgr.es/m/20250325155808.f7.nmisch@google.com and earlier. > Subject: [PATCH v2.12 06/28] aio: Add README.md explaining higher level design Ready for commit (This and the previous patch have three spots that would change with the s/prep/start/ renames. No opinion on whether to rename before or rename after.) > Subject: [PATCH v2.12 07/28] localbuf: Track pincount in BufferDesc as well The plan here looks good: postgr.es/m/dbeeaize47y7esifdrinpa2l7cqqb67k72exvuf3appyxywjnc@7bt76mozhcy2 > Subject: [PATCH v2.12 08/28] bufmgr: Implement AIO read support See review here and later discussion: postgr.es/m/20250325022037.91.nmisch@google.com > Subject: [PATCH v2.12 09/28] bufmgr: Use AIO in StartReadBuffers() Ready for commit after a batch of small things, all but one of which have no implications beyond code cosmetics. This is my first comprehensive review of this patch. I like the test coverage (by the end of the patch series). For anyone else following, I found "diff -w" helpful for the bufmgr.c changes. That's because a key part is former WaitReadBuffers() code moving up an indentation level to its home in new subroutine AsyncReadBuffers(). > Assert(*nblocks == 1 || allow_forwarding); > Assert(*nblocks > 0); > Assert(*nblocks <= MAX_IO_COMBINE_LIMIT); > + Assert(*nblocks == 1 || allow_forwarding); Duplicates the assert three lines back. > + nblocks = aio_ret->result.result; > + > + elog(DEBUG3, "partial read, will retry"); > + > + } > + else if (aio_ret->result.status == PGAIO_RS_ERROR) > + { > + pgaio_result_report(aio_ret->result, &aio_ret->target_data, ERROR); > + nblocks = 0; /* silence compiler */ > + } > > Assert(nblocks > 0); > Assert(nblocks <= MAX_IO_COMBINE_LIMIT); > > + operation->nblocks_done += nblocks; I struggled somewhat from the variety of "nblocks" variables: this local nblocks, operation->nblocks, actual_nblocks, and *nblocks in/out parameters of some functions. No one of them is clearly wrong to use the name, and some of these names are preexisting. That said, if you see opportunities to push in the direction of more-specific names, I'd welcome it. For example, this local variable could become add_to_nblocks_done instead. > + AsyncReadBuffers(operation, &nblocks); I suggest renaming s/nblocks/ignored_nblocks_progress/ here. > + * If we need to wait for IO before we can get a handle, submit already > + * staged IO first, so that other backends don't need to wait. There s/already staged/already-staged/. Normally I'd skip this as nitpicking, but I misread this particular sentence twice, as "submit" being the subject that "staged" something. (It's still nitpicking, alas.) > /* > * How many neighboring-on-disk blocks can we scatter-read into other > * buffers at the same time? In this case we don't wait if we see an > - * I/O already in progress. We already hold BM_IO_IN_PROGRESS for the > + * I/O already in progress. We already set BM_IO_IN_PROGRESS for the > * head block, so we should get on with that I/O as soon as possible. > - * We'll come back to this block again, above. > + * > + * We'll come back to this block in the next call to > + * StartReadBuffers() -> AsyncReadBuffers(). Did this mean to say "WaitReadBuffers() -> AsyncReadBuffers()"? I'm guessing so, since WaitReadBuffers() is the one that loops. It might be referring to read_stream_start_pending_read()'s next StartReadBuffers(), though. I think this could just delete the last sentence. The function header comment already mentions the possibility of reading a subset of the request. This spot doesn't need to detail how the higher layers come back to here. > + smgrstartreadv(ioh, operation->smgr, forknum, > + blocknum + nblocks_done, > + io_pages, io_buffers_len); > + pgstat_count_io_op_time(io_object, io_context, IOOP_READ, > + io_start, 1, *nblocks_progress * BLCKSZ); We don't assign *nblocks_progress until lower in the function, so I think "io_buffers_len" should replace "*nblocks_progress" here. (This is my only non-cosmetic comment on this patch.) > Subject: [PATCH v2.12 10/28] aio: Basic read_stream adjustments for real AIO (Still reviewing this and later patches, but incidental observations follow.) > Subject: [PATCH v2.12 16/28] aio: Add test_aio module > +use List::Util qw(sample); sample() is new in 2020: https://metacpan.org/release/PEVANS/Scalar-List-Utils-1.68/source/Changes#L100 Hence, I'd expect some buildfarm failures. I'd try to use shuffle(), then take the first N elements. > +++ b/src/test/modules/test_aio/test_aio.c > @@ -0,0 +1,712 @@ > +/*------------------------------------------------------------------------- > + * > + * delay_execution.c > + * Test module to allow delay between parsing and execution of a query. > + * > + * The delay is implemented by taking and immediately releasing a specified > + * advisory lock. If another process has previously taken that lock, the > + * current process will be blocked until the lock is released; otherwise, > + * there's no effect. This allows an isolationtester script to reliably > + * test behaviors where some specified action happens in another backend > + * between parsing and execution of any desired query. > + * > + * Copyright (c) 2020-2025, PostgreSQL Global Development Group > + * > + * IDENTIFICATION > + * src/test/modules/test_aio/test_aio.c To elaborate on my last review, the entire header comment was a copy from delay_execution.c. v2.12 fixes the IDENTIFICATION, but the rest needs updates. Thanks, nm
On Tue, Mar 25, 2025 at 08:17:17PM -0400, Andres Freund wrote: > On 2025-03-25 09:15:43 -0700, Noah Misch wrote: > > On Tue, Mar 25, 2025 at 11:57:58AM -0400, Andres Freund wrote: > > > FWIW, I prototyped this, it's not hard. > > > > > > But it can't replace the current WARNING with 100% fidelity: If we read 60 > > > blocks in a single smgrreadv, we today would would emit 60 WARNINGs. But we > > > can't encode that many block offset in single PgAioResult, there's not enough > > > space, and enlarging it far enough doesn't seem to make sense either. > > > > > > > > > What we *could* do is to emit one WARNING for each bufmgr.c smgrstartreadv(), > > > with that warning saying that there were N zeroed blocks in a read from block > > > N to block Y and a HINT saying that there are more details in the server log. > > It should probably be DETAIL, not HINT... Either is fine with me. I would go for HINT if referring to the server log, given the precedent of errhint("See server log for query details."). DETAIL fits for block counts, though: > Could use some input on the framing of the message/detail. Right now it's: > > ERROR: invalid page in block 8 of relation base/5/16417 > DETAIL: Read of 8 blocks, starting at block 7, 1 other pages in the same read are invalid. > > But that doesn't seem great. Maybe: > > DETAIL: Read of blocks 7..14, 1 other pages in the same read were also invalid. > > But that still isn't really a sentence. How about this for the multi-page case: WARNING: zeroing out %u invalid pages among blocks %u..%u of relation %s DETAIL: Block %u held first invalid page. HINT: See server log for the other %u invalid blocks. For the one-page case, the old message can stay: WARNING: invalid page in block %u of relation %s; zeroing out page
I reviewed everything up to and including "[PATCH v2.12 17/28] aio, bufmgr: Comment fixes", the last patch before write support. postgr.es/m/20250326001915.bc.nmisch@google.com covered patches 1-9, and this email covers patches 10-17. All remaining review comments are minor, so I've marked the commitfest entry Ready for Committer. If there's anything you'd like re-reviewed before you commit it, feel free to bring it to my attention. Thanks for getting the feature to this stage! On Mon, Mar 24, 2025 at 09:18:06PM -0400, Andres Freund wrote: > Subject: [PATCH v2.12 10/28] aio: Basic read_stream adjustments for real AIO > @@ -631,6 +637,9 @@ read_stream_begin_impl(int flags, > * For now, max_ios = 0 is interpreted as max_ios = 1 with advice disabled > * above. If we had real asynchronous I/O we might need a slightly > * different definition. > + * > + * FIXME: Not sure what different definition we would need? I guess we > + * could add the READ_BUFFERS_SYNCHRONOUSLY flag automatically? I think we don't need a different definition. max_ios comes from effective_io_concurrency and similar settings. The above comment's definition of max_ios=0 matches that GUC's documented behavior: The allowed range is <literal>1</literal> to <literal>1000</literal>, or <literal>0</literal> to disable issuance of asynchronous I/O requests. I'll guess the comment meant that "advice disabled" is a no-op for AIO, so we could reasonably argue to have effective_io_concurrency=0 distinguish itself from effective_io_concurrency=1 in some different way for AIO. Equally, there's no hurry to use that freedom to distinguish them. > Subject: [PATCH v2.12 11/28] read_stream: Introduce and use optional batchmode > support > This patch adds an explicit flag (READ_STREAM_USE_BATCHING) to read_stream and > uses it where appropriate. I'd also use the new flag on the read_stream_begin_smgr_relation() call in RelationCopyStorageUsingBuffer(). It uses block_range_read_stream_cb, and other streams of that callback rightly use the flag. > + * b) directly or indirectly start another batch pgaio_enter_batchmode() Needs new wording from end of postgr.es/m/20250325155808.f7.nmisch@google.com > Subject: [PATCH v2.12 12/28] docs: Reframe track_io_timing related docs as > wait time > Subject: [PATCH v2.12 13/28] Enable IO concurrency on all systems Consider also updating this comment to stop focusing on prefetch; I think changing that aligns with the patch's other changes: /* * How many buffers PrefetchBuffer callers should try to stay ahead of their * ReadBuffer calls by. Zero means "never prefetch". This value is only used * for buffers not belonging to tablespaces that have their * effective_io_concurrency parameter set. */ int effective_io_concurrency = DEFAULT_EFFECTIVE_IO_CONCURRENCY; > -#io_combine_limit = 128kB # usually 1-128 blocks (depends on OS) > +#io_combine_limit = 128kB # usually 1-32 blocks (depends on OS) I think "usually 1-128" remains right given: GUC_UNIT_BLOCKS #define MAX_IO_COMBINE_LIMIT PG_IOV_MAX #define PG_IOV_MAX Min(IOV_MAX, 128) > - On systems without prefetch advice support, attempting to configure > - any value other than <literal>0</literal> will error out. > + On systems with prefetch advice support, > + <varname>effective_io_concurrency</varname> also controls the prefetch distance. Wrap the last line. > Subject: [PATCH v2.12 14/28] docs: Add acronym and glossary entries for I/O > and AIO > These could use a lot more polish. To me, it's fine as-is. > I did not actually reference the new entries yet, because I don't really > understand what our policy for that is. I haven't seen much of a policy on that. > Subject: [PATCH v2.12 15/28] aio: Add pg_aios view > +retry: > + > + /* > + * There is no lock that could prevent the state of the IO to advance > + * concurrently - and we don't want to introduce one, as that would > + * introduce atomics into a very common path. Instead we > + * > + * 1) Determine the state + generation of the IO. > + * > + * 2) Copy the IO to local memory. > + * > + * 3) Check if state or generation of the IO changed. If the state > + * changed, retry, if the generation changed don't display the IO. > + */ > + > + /* 1) from above */ > + start_generation = live_ioh->generation; > + pg_read_barrier(); Based on the "really started after this function was called" and "no risk of a livelock here" comments below, I think "retry:" should be here. We don't want to livelock in the form of chasing ever-growing start_generation numbers. > + /* > + * The IO completed and a new one was started with the same ID. Don't > + * display it - it really started after this function was called. > + * There be a risk of a livelock if we just retried endlessly, if IOs > + * complete very quickly. > + */ > + if (live_ioh->generation != start_generation) > + continue; > + > + /* > + * The IOs state changed while we were "rendering" it. Just start from s/IOs/IO's/ > + * scratch. There's no risk of a livelock here, as an IO has a limited > + * sets of states it can be in, and state changes go only in a single > + * direction. > + */ > + if (live_ioh->state != start_state) > + goto retry; > + <entry role="catalog_table_entry"><para role="column_definition"> > + <structfield>target</structfield> <type>text</type> > + </para> > + <para> > + What kind of object is the I/O targeting: > + <itemizedlist spacing="compact"> > + <listitem> > + <para> > + <literal>smgr</literal>, I/O on postgres relations s/postgres relations/relations/ since SGML docs don't use the term "postgres" that way. > Subject: [PATCH v2.12 16/28] aio: Add test_aio module > --- a/src/test/modules/meson.build > +++ b/src/test/modules/meson.build > @@ -1,5 +1,6 @@ > # Copyright (c) 2022-2025, PostgreSQL Global Development Group > > +subdir('test_aio') > subdir('brin') List is alphabetized; please preserve that. > +++ b/src/test/modules/test_aio/Makefile > @@ -0,0 +1,26 @@ > +# src/test/modules/delay_execution/Makefile Update filename in comment. > +++ b/src/test/modules/test_aio/meson.build > @@ -0,0 +1,37 @@ > +# Copyright (c) 2022-2024, PostgreSQL Global Development Group s/2024/2025/ > --- /dev/null > +++ b/src/test/modules/test_aio/t/001_aio.pl s/ {4}/\t/g on this file. It's mostly \t now, with some exceptions. > + test_inject_worker('worker', $node_worker); What do we expect to happen if autovacuum or checkpointer runs one of these injection points? I'm guessing it would at most make that process fail without affecting the test outcome. If so, that's fine. > + $waitfor,); s/,// > + # normal handle use > + psql_like($io_method, $psql, "handle_get_release()", > + qq(SELECT handle_get_release()), > + qr/^$/, qr/^$/); > + > + # should error out, API violation > + psql_like($io_method, $psql, "handle_get_twice()", > + qq(SELECT handle_get_release()), > + qr/^$/, qr/^$/); Last two lines are a clone of the previous psql_like() call. I guess this wants to instead call handle_get_twice() and check for some stderr. > + "read_rel_block_ll() of $tblname page", What does "_ll" stand for? > + # Issue IO without waiting for completion, then exit > + $psql_a->query_safe( > + qq(SELECT read_rel_block_ll('tbl_ok', 1, wait_complete=>false);)); > + $psql_a->reconnect_and_clear(); > + > + # Check that another backend can read the relevant block > + psql_like( > + $io_method, > + $psql_b, > + "completing read started by exited backend", I think the exiting backend's pgaio_shutdown() completed it. > +sub test_inject This deserves a brief comment on the behaviors being tested, like the previous functions have. It seems to be about short reads and hard failures like EIO. > Subject: [PATCH v2.12 17/28] aio, bufmgr: Comment fixes
Hi, On 2025-03-25 17:19:15 -0700, Noah Misch wrote: > On Mon, Mar 24, 2025 at 09:18:06PM -0400, Andres Freund wrote: > > @@ -296,7 +299,9 @@ pgaio_io_call_complete_local(PgAioHandle *ioh) > > > > /* > > * Note that we don't save the result in ioh->distilled_result, the local > > - * callback's result should not ever matter to other waiters. > > + * callback's result should not ever matter to other waiters. However, the > > + * local backend does care, so we return the result as modified by local > > + * callbacks, which then can be passed to ioh->report_return->result. > > */ > > pgaio_debug_io(DEBUG3, ioh, > > "after local completion: distilled result: (status %s, id %u, error_data %d, result %d), raw_result:%d", > > Should this debug message remove the word "distilled", since this commit > solidifies distilled_result as referring to the complete_shared result? Good point, updated. > > Subject: [PATCH v2.12 01/28] aio: Be more paranoid about interrupts > Ready for commit > > Subject: [PATCH v2.12 02/28] aio: Pass result of local callbacks to > > ->report_return > > Ready for commit w/ up to one cosmetic change: > And pushed. Together with the s/pgaio_io_prep_/s/pgaio_io_start_/ renaming we've been discussing. Btw, I figured out the origin of that, I was just mirroring the liburing API... Thanks again for the reviews. > > Subject: [PATCH v2.12 03/28] aio: Add liburing dependency > > Ready for commit > > > > Subject: [PATCH v2.12 04/28] aio: Add io_method=io_uring > > Ready for commit w/ open_fd.fixup Yay. Planning to push those soon. > > Subject: [PATCH v2.12 05/28] aio: Implement support for reads in smgr/md/fd > > Ready for commit w/ up to two cosmetic changes: Cool. > > +/* > > + * AIO error reporting callback for mdstartreadv(). > > + * > > + * Errors are encoded as follows: > > + * - PgAioResult.error_data != 0 encodes IO that failed with errno != 0 > > I recommend replacing "errno != 0" with either "that errno" or "errno == > error_data". Applied. > Second, the aio_internal.h comment changes discussed in > postgr.es/m/20250325155808.f7.nmisch@google.com and earlier. Here's my current version of that: * Note that the externally visible functions to start IO * (e.g. FileStartReadV(), via pgaio_io_start_readv()) move an IO from * PGAIO_HS_HANDED_OUT to at least PGAIO_HS_STAGED and at most * PGAIO_HS_COMPLETED_LOCAL (at which point the handle will be reused). Does that work? I think I'll push that as part of the comment updates patch instead of "Implement support for reads in smgr/md/fd", unless you see a reason to do so differently. I'd have done it in the patch to s/prep/start/, but then it would reference functions that don't exist yet... > > Subject: [PATCH v2.12 06/28] aio: Add README.md explaining higher level design > > Ready for commit Cool. Comments in it reference PGAIO_HCB_SHARED_BUFFER_READV, so I'm inclined to reorder it until after "bufmgr: Implement AIO read support". There's also a small change in a new patch in the series (not yet sent), due to the changes related to emitting WARNINGs about checksum failures to the client connection. I think that part is fine, but... > (This and the previous patch have three spots that would change with the > s/prep/start/ renames. No opinion on whether to rename before or rename > after.) I thought it'd be better to do the renaming first. > > Subject: [PATCH v2.12 07/28] localbuf: Track pincount in BufferDesc as well > > The plan here looks good: > postgr.es/m/dbeeaize47y7esifdrinpa2l7cqqb67k72exvuf3appyxywjnc@7bt76mozhcy2 > > Subject: [PATCH v2.12 08/28] bufmgr: Implement AIO read support > > See review here and later discussion: > postgr.es/m/20250325022037.91.nmisch@google.com I'm working on a version with those addressed. > > Subject: [PATCH v2.12 09/28] bufmgr: Use AIO in StartReadBuffers() > > Ready for commit after a batch of small things, all but one of which have no > implications beyond code cosmetics. Yay. > I like the test coverage (by the end of the patch series). I'm really shocked just how bad our test coverage for a lot of this is today :( > For anyone else following, I found "diff -w" helpful for the bufmgr.c > changes. That's because a key part is former WaitReadBuffers() code moving > up an indentation level to its home in new subroutine AsyncReadBuffers(). For reviewing changes that move stuff around a lot I find this rather helpful: git diff --color-moved --color-moved-ws=ignore-space-change That highlights removed code differently from moved code, and due to ignore-space-change considers code that changed just due to space changes, to be moved. > > Assert(*nblocks == 1 || allow_forwarding); > > Assert(*nblocks > 0); > > Assert(*nblocks <= MAX_IO_COMBINE_LIMIT); > > + Assert(*nblocks == 1 || allow_forwarding); > > Duplicates the assert three lines back. Ah, it was moved into ce1a75c4fea, which I didn't notice while rebasing... > > + nblocks = aio_ret->result.result; > > + > > + elog(DEBUG3, "partial read, will retry"); > > + > > + } > > + else if (aio_ret->result.status == PGAIO_RS_ERROR) > > + { > > + pgaio_result_report(aio_ret->result, &aio_ret->target_data, ERROR); > > + nblocks = 0; /* silence compiler */ > > + } > > > > Assert(nblocks > 0); > > Assert(nblocks <= MAX_IO_COMBINE_LIMIT); > > > > + operation->nblocks_done += nblocks; > > I struggled somewhat from the variety of "nblocks" variables: this local > nblocks, operation->nblocks, actual_nblocks, and *nblocks in/out parameters of > some functions. No one of them is clearly wrong to use the name, and some of > these names are preexisting. That said, if you see opportunities to push in > the direction of more-specific names, I'd welcome it. > > For example, this local variable could become add_to_nblocks_done instead. I named it "newly_read_blocks", hope that works? > > + AsyncReadBuffers(operation, &nblocks); > > I suggest renaming s/nblocks/ignored_nblocks_progress/ here. Adopted. Unfortunately I didn't see a good way to reduce the amount of the other nblocks variables, as they are all, I think, preexisting. > > + * If we need to wait for IO before we can get a handle, submit already > > + * staged IO first, so that other backends don't need to wait. There > > s/already staged/already-staged/. Normally I'd skip this as nitpicking, but I > misread this particular sentence twice, as "submit" being the subject that > "staged" something. (It's still nitpicking, alas.) Makes sense - it doesn't help that it was at a linebreak... > > /* > > * How many neighboring-on-disk blocks can we scatter-read into other > > * buffers at the same time? In this case we don't wait if we see an > > - * I/O already in progress. We already hold BM_IO_IN_PROGRESS for the > > + * I/O already in progress. We already set BM_IO_IN_PROGRESS for the > > * head block, so we should get on with that I/O as soon as possible. > > - * We'll come back to this block again, above. > > + * > > + * We'll come back to this block in the next call to > > + * StartReadBuffers() -> AsyncReadBuffers(). > > Did this mean to say "WaitReadBuffers() -> AsyncReadBuffers()"? I'm guessing > so, since WaitReadBuffers() is the one that loops. It might be referring to > read_stream_start_pending_read()'s next StartReadBuffers(), though. I was referring to the latter, as that is the more common case (it's pretty easy to hit if you e.g. have multiple sequential scans on the same table going). > I think this could just delete the last sentence. The function header comment > already mentions the possibility of reading a subset of the request. This > spot doesn't need to detail how the higher layers come back to here. Agreed. > > + smgrstartreadv(ioh, operation->smgr, forknum, > > + blocknum + nblocks_done, > > + io_pages, io_buffers_len); > > + pgstat_count_io_op_time(io_object, io_context, IOOP_READ, > > + io_start, 1, *nblocks_progress * BLCKSZ); > > We don't assign *nblocks_progress until lower in the function, so I think > "io_buffers_len" should replace "*nblocks_progress" here. (This is my only > non-cosmetic comment on this patch.) Good catch! > > Subject: [PATCH v2.12 16/28] aio: Add test_aio module > > > +use List::Util qw(sample); > > sample() is new in 2020: > https://metacpan.org/release/PEVANS/Scalar-List-Utils-1.68/source/Changes#L100 > > Hence, I'd expect some buildfarm failures. I'd try to use shuffle(), then > take the first N elements. Hah. Bilal's patch was using shuffe(). I wanted to reduce the number of iterations and first did as you suggest and then saw that there's a nicer way... Done that way again... > > +++ b/src/test/modules/test_aio/test_aio.c > > @@ -0,0 +1,712 @@ > > +/*------------------------------------------------------------------------- > > + * > > + * delay_execution.c > > + * Test module to allow delay between parsing and execution of a query. > > + * > > + * The delay is implemented by taking and immediately releasing a specified > > + * advisory lock. If another process has previously taken that lock, the > > + * current process will be blocked until the lock is released; otherwise, > > + * there's no effect. This allows an isolationtester script to reliably > > + * test behaviors where some specified action happens in another backend > > + * between parsing and execution of any desired query. > > + * > > + * Copyright (c) 2020-2025, PostgreSQL Global Development Group > > + * > > + * IDENTIFICATION > > + * src/test/modules/test_aio/test_aio.c > > To elaborate on my last review, the entire header comment was a copy from > delay_execution.c. v2.12 fixes the IDENTIFICATION, but the rest needs > updates. I was really too tired that day... Embarassing. Greetings, Andres Freund
On Tue, 25 Mar 2025 at 01:18, Andres Freund <andres@anarazel.de> wrote: > > Hi, > > Attached v2.12, with the following changes: I took a quick gander through this just out of curiosity (yes, I know I'm late), and found these show-stoppers: v2.12-0015-aio-Add-pg_aios-view.patch: + <literal>ERROR</literal> mean the I/O failed with an error. s/mean/means/ v2.12-0021-bufmgr-Implement-AIO-write-support.patch +shared buffer lock still allows some modification, e.g., for hint bits(see s/bits\(see/bits \(see) +buffers that can be used as the source / target for IO. A bounce buffer be s/be/can be/ Regards Thom
On Wed, Mar 26, 2025 at 04:33:49PM -0400, Andres Freund wrote: > On 2025-03-25 17:19:15 -0700, Noah Misch wrote: > > On Mon, Mar 24, 2025 at 09:18:06PM -0400, Andres Freund wrote: > > Second, the aio_internal.h comment changes discussed in > > postgr.es/m/20250325155808.f7.nmisch@google.com and earlier. > > Here's my current version of that: > > * Note that the externally visible functions to start IO > * (e.g. FileStartReadV(), via pgaio_io_start_readv()) move an IO from > * PGAIO_HS_HANDED_OUT to at least PGAIO_HS_STAGED and at most > * PGAIO_HS_COMPLETED_LOCAL (at which point the handle will be reused). > > Does that work? Yes. > I think I'll push that as part of the comment updates patch instead of > "Implement support for reads in smgr/md/fd", unless you see a reason to do so > differently. I'd have done it in the patch to s/prep/start/, but then it would > reference functions that don't exist yet... Agreed. > > > Subject: [PATCH v2.12 06/28] aio: Add README.md explaining higher level design > > > > Ready for commit > > Cool. > > Comments in it reference PGAIO_HCB_SHARED_BUFFER_READV, so I'm inclined to > reorder it until after "bufmgr: Implement AIO read support". Agreed. > > For example, this local variable could become add_to_nblocks_done instead. > > I named it "newly_read_blocks", hope that works? Yes.
Hi, On 2025-03-26 11:31:02 -0700, Noah Misch wrote: > I reviewed everything up to and including "[PATCH v2.12 17/28] aio, bufmgr: > Comment fixes", the last patch before write support. Thanks! > postgr.es/m/20250326001915.bc.nmisch@google.com covered patches 1-9, and this > email covers patches 10-17. All remaining review comments are minor, so I've > marked the commitfest entry Ready for Committer. If there's anything you'd > like re-reviewed before you commit it, feel free to bring it to my attention. > Thanks for getting the feature to this stage! As part of our discussion around the WARNING stuff I did make some changes, it'd be good if you could look at those once I send them. While I squashed the rest of the changes (addressing review comments) into their base commits, I left the error-reporting related bits and pieces in fixup commits, to make that easier. > On Mon, Mar 24, 2025 at 09:18:06PM -0400, Andres Freund wrote: > > Subject: [PATCH v2.12 10/28] aio: Basic read_stream adjustments for real AIO > > > @@ -631,6 +637,9 @@ read_stream_begin_impl(int flags, > > * For now, max_ios = 0 is interpreted as max_ios = 1 with advice disabled > > * above. If we had real asynchronous I/O we might need a slightly > > * different definition. > > + * > > + * FIXME: Not sure what different definition we would need? I guess we > > + * could add the READ_BUFFERS_SYNCHRONOUSLY flag automatically? > > I think we don't need a different definition. max_ios comes from > effective_io_concurrency and similar settings. The above comment's definition > of max_ios=0 matches that GUC's documented behavior: > > The allowed range is > <literal>1</literal> to <literal>1000</literal>, or > <literal>0</literal> to disable issuance of asynchronous I/O requests. > > I'll guess the comment meant that "advice disabled" is a no-op for AIO, so we > could reasonably argue to have effective_io_concurrency=0 distinguish itself > from effective_io_concurrency=1 in some different way for AIO. Equally, > there's no hurry to use that freedom to distinguish them. Thomas has since provided an implementation of what he was thinking of when writing that comment: https://postgr.es/m/CA%2BhUKG%2B8SC2%3DAD3bC0Pn85aMXm-PE2JSFGhC%3DMFVJvNQLObZeA%40mail.gmail.com I squashed that into "aio: Basic read_stream adjustments for real AIO". > > Subject: [PATCH v2.12 11/28] read_stream: Introduce and use optional batchmode > > support > > > This patch adds an explicit flag (READ_STREAM_USE_BATCHING) to read_stream and > > uses it where appropriate. > > I'd also use the new flag on the read_stream_begin_smgr_relation() call in > RelationCopyStorageUsingBuffer(). It uses block_range_read_stream_cb, and > other streams of that callback rightly use the flag. Ah, yes. I had searched for all read_stream_begin_relation(), but not for _smgr... > > + * b) directly or indirectly start another batch pgaio_enter_batchmode() > > Needs new wording from end of postgr.es/m/20250325155808.f7.nmisch@google.com Locally it's that, just need to send out a new version... * * b) start another batch (without first exiting batchmode and re-entering * before returning) > > Subject: [PATCH v2.12 13/28] Enable IO concurrency on all systems > > Consider also updating this comment to stop focusing on prefetch; I think > changing that aligns with the patch's other changes: > > /* > * How many buffers PrefetchBuffer callers should try to stay ahead of their > * ReadBuffer calls by. Zero means "never prefetch". This value is only used > * for buffers not belonging to tablespaces that have their > * effective_io_concurrency parameter set. > */ > int effective_io_concurrency = DEFAULT_EFFECTIVE_IO_CONCURRENCY; Good point. Although I suspect it might be worth adjusting this, and also the config.sgml bit about effective_io_concurrency separately. That seems like it might take an iteration or two. > > -#io_combine_limit = 128kB # usually 1-128 blocks (depends on OS) > > +#io_combine_limit = 128kB # usually 1-32 blocks (depends on OS) > > I think "usually 1-128" remains right given: > GUC_UNIT_BLOCKS > #define MAX_IO_COMBINE_LIMIT PG_IOV_MAX > #define PG_IOV_MAX Min(IOV_MAX, 128) You're right. I think I got this wrong when rebasing over conflicts due to 06fb5612c97. > > - On systems without prefetch advice support, attempting to configure > > - any value other than <literal>0</literal> will error out. > > + On systems with prefetch advice support, > > + <varname>effective_io_concurrency</varname> also controls the prefetch distance. > > Wrap the last line. Done. > > Subject: [PATCH v2.12 14/28] docs: Add acronym and glossary entries for I/O > > and AIO > > > These could use a lot more polish. > > To me, it's fine as-is. Cool. > > I did not actually reference the new entries yet, because I don't really > > understand what our policy for that is. > > I haven't seen much of a policy on that. That's sure what it looks like to me :/ > > > Subject: [PATCH v2.12 15/28] aio: Add pg_aios view > > +retry: > > + > > + /* > > + * There is no lock that could prevent the state of the IO to advance > > + * concurrently - and we don't want to introduce one, as that would > > + * introduce atomics into a very common path. Instead we > > + * > > + * 1) Determine the state + generation of the IO. > > + * > > + * 2) Copy the IO to local memory. > > + * > > + * 3) Check if state or generation of the IO changed. If the state > > + * changed, retry, if the generation changed don't display the IO. > > + */ > > + > > + /* 1) from above */ > > + start_generation = live_ioh->generation; > > + pg_read_barrier(); > > Based on the "really started after this function was called" and "no risk of a > livelock here" comments below, I think "retry:" should be here. We don't > want to livelock in the form of chasing ever-growing start_generation numbers. You're right. > > + * scratch. There's no risk of a livelock here, as an IO has a limited > > + * sets of states it can be in, and state changes go only in a single > > + * direction. > > + */ > > + if (live_ioh->state != start_state) > > + goto retry; > > > + <entry role="catalog_table_entry"><para role="column_definition"> > > + <structfield>target</structfield> <type>text</type> > > + </para> > > + <para> > > + What kind of object is the I/O targeting: > > + <itemizedlist spacing="compact"> > > + <listitem> > > + <para> > > + <literal>smgr</literal>, I/O on postgres relations > > s/postgres relations/relations/ since SGML docs don't use the term "postgres" > that way. Not sure what I was even trying to express with "postgres relations" vs plain "relations" here... > > > Subject: [PATCH v2.12 16/28] aio: Add test_aio module > > > --- a/src/test/modules/meson.build > > +++ b/src/test/modules/meson.build > > @@ -1,5 +1,6 @@ > > # Copyright (c) 2022-2025, PostgreSQL Global Development Group > > > > +subdir('test_aio') > > subdir('brin') > > List is alphabetized; please preserve that. > > > +++ b/src/test/modules/test_aio/Makefile > > @@ -0,0 +1,26 @@ > > +# src/test/modules/delay_execution/Makefile > > Update filename in comment. > > > +++ b/src/test/modules/test_aio/meson.build > > @@ -0,0 +1,37 @@ > > +# Copyright (c) 2022-2024, PostgreSQL Global Development Group > > s/2024/2025/ Done. > > --- /dev/null > > +++ b/src/test/modules/test_aio/t/001_aio.pl > > s/ {4}/\t/g on this file. It's mostly \t now, with some exceptions. Huh. No idea how that happened. > > + test_inject_worker('worker', $node_worker); > > What do we expect to happen if autovacuum or checkpointer runs one of these > injection points? I'm guessing it would at most make that process fail > without affecting the test outcome. If so, that's fine. Autovacuum I disabled on the relations, to prevent that. I think checkpointer should behave as you describe, although I could wonder if it could confuse wait_for_log() based checks - but even so, I think that would at worst lead to a test missing a bug, in extremely rare circumstances. I tried triggering that condition, but it's pretty hard to hit, even after lowering checkpoint_timeout to 1s and looping in the tests. > > + # normal handle use > > + psql_like($io_method, $psql, "handle_get_release()", > > + qq(SELECT handle_get_release()), > > + qr/^$/, qr/^$/); > > + > > + # should error out, API violation > > + psql_like($io_method, $psql, "handle_get_twice()", > > + qq(SELECT handle_get_release()), > > + qr/^$/, qr/^$/); > > Last two lines are a clone of the previous psql_like() call. I guess this > wants to instead call handle_get_twice() and check for some stderr. Indeed. > > + "read_rel_block_ll() of $tblname page", > > What does "_ll" stand for? "low level". I added a C comment: /* * A "low level" read. This does similar things to what * StartReadBuffers()/WaitReadBuffers() do, but provides more control (and * less sanity). */ > > + # Issue IO without waiting for completion, then exit > > + $psql_a->query_safe( > > + qq(SELECT read_rel_block_ll('tbl_ok', 1, wait_complete=>false);)); > > + $psql_a->reconnect_and_clear(); > > + > > + # Check that another backend can read the relevant block > > + psql_like( > > + $io_method, > > + $psql_b, > > + "completing read started by exited backend", > > I think the exiting backend's pgaio_shutdown() completed it. I wrote the test precisely to exercise that path, otherwise it's pretty hard to reach. It does seem to reach the path reasonably reliably, although it's much harder to catch that causing problems, as the IO is typically too fast. > > +sub test_inject > > This deserves a brief comment on the behaviors being tested, like the previous > functions have. It seems to be about short reads and hard failures like EIO. Done. Greetings, Andres Freund
Hi, On 2025-03-26 21:20:47 +0000, Thom Brown wrote: > I took a quick gander through this just out of curiosity (yes, I know > I'm late), and found these show-stoppers: > > v2.12-0015-aio-Add-pg_aios-view.patch: > > + <literal>ERROR</literal> mean the I/O failed with an error. > > s/mean/means/ > > > v2.12-0021-bufmgr-Implement-AIO-write-support.patch > > +shared buffer lock still allows some modification, e.g., for hint bits(see > > s/bits\(see/bits \(see) > > +buffers that can be used as the source / target for IO. A bounce buffer be > > s/be/can be/ Thanks! Squashed into my local tree. Greetings, Andres Freund
On Thu, Mar 27, 2025 at 10:41 AM Andres Freund <andres@anarazel.de> wrote: > > > Subject: [PATCH v2.12 13/28] Enable IO concurrency on all systems > > > > Consider also updating this comment to stop focusing on prefetch; I think > > changing that aligns with the patch's other changes: > > > > /* > > * How many buffers PrefetchBuffer callers should try to stay ahead of their > > * ReadBuffer calls by. Zero means "never prefetch". This value is only used > > * for buffers not belonging to tablespaces that have their > > * effective_io_concurrency parameter set. > > */ > > int effective_io_concurrency = DEFAULT_EFFECTIVE_IO_CONCURRENCY; > > Good point. Although I suspect it might be worth adjusting this, and also the > config.sgml bit about effective_io_concurrency separately. That seems like it > might take an iteration or two. +1 for rewriting that separately from this work on the code (I can have a crack at that if you want). For the comment, my suggestion would be something like: "Default limit on the level of concurrency that each I/O stream (currently, ReadStream but in future other kinds of streams) can use. Zero means that I/O is always performed synchronously, ie not concurrently with query execution. This value can be overridden at the tablespace level with the parameter of the same name. Note that streams performing I/O not classified as single-session work respect maintenance_io_concurrency instead."
Hi, Attached v2.13, with the following changes: Changes: - Pushed a fair number of commits A lot of thanks goes to Noah's detailed reviews! - As Noah pointed out, the zero_damaged_pages warning could be emitted in an io worker or another backend, but omitted in the backend that started the IO To address that: 1) I added a new commit "aio: Add WARNING result status" (itself trivial) 2) I changed buffer_readv_complete() to encode the warning/error in a more detailed way than before (was_zeroed, first_invalid_off, count_invalid) As part of that I put the encoding/decoding into a static inline 3) Tracking the number of invalid buffers was awkward with buffer_readv_complete_one() returning a PgAioResult. Now it just reports whether it found an invalid page with an out argument. 4) As discussed there now is a different error messages for the case of multiple invalid pages The code is a bit awkward to avoid code duplication, curious whether that's seen as acceptable? I could just duplicate the entire ereport() instead. 5) The WARNING in the callback is now a LOG, as it will be sent to the client as a WARNING explicitly when the IO's results are processed I actually chose LOG_SERVER_ONLY - that seemed slightly better than just LOG? But not at all sure. There's a comment explaining this now too. Noah, I think this set of changes would benefit from another round of review. I left these changes in "squash-later: " commits, to make it easier to see / think about. - Added a comment about the pgaio_result_report() in md_readv_complete(). I changed it to LOG_SERVER_ONLY as well , but I'm not at all sure about that. - Previously the buffer completion callback checked zero_damaged_pages - but that's not right, the GUC hopefully is only set on a per-session basis I solved that by having AsyncReadBuffers() add ZERO_ON_ERROR to the flags if zero_damaged_pages is configured. Also added a comment explaining that we probably should eventually use a separate flag, so we can adjust the errcode etc differently. - Explicit test for zero_damaged_pages and ZERO_ON_ERROR As part of that I made read_rel_block_ll() support reading multiple blocks. That makes it a lot easier to verify that we handle cases like a 4-block read where 2,3 are invalid correctly. - I removed the code that "localbuf: Track pincount in BufferDesc as well" added to ConditionalLockBufferForCleanup() and IsBufferCleanupOK() as discussed Right now the situations that the code was worried don't exist yet, as we only support reads. I added a comment about not needing to worry about that yet to "bufmgr: Implement AIO read support". And then changed that comment to a FIXME in the write patches. - Squashed Thomas' change to make io_concurrency=0 really not use AIO - Lots of other review comments by Noah addressed - Merged typo fixes by Thom Brown TODO: - There are more tests in test_aio that should be expanded to run for temp tables as well, not just normal tables - Add an explicit test for the checksum verification in the completion callback There is an existing test for testing an invalid page due to page header verification in test_aio, but not for checksum failures. I think it's indirectly covered (e.g. in amcheck), but seems better to test it explicitly. - Add error context callbacks for io worker and "foreign" IO completion Greetings, Andres Freund
Attachment
- v2.13-0001-aio-Implement-support-for-reads-in-smgr-md-fd.patch
- v2.13-0002-docs-Add-acronym-and-glossary-entries-for-I-O-.patch
- v2.13-0003-aio-Add-pg_aios-view.patch
- v2.13-0004-localbuf-Track-pincount-in-BufferDesc-as-well.patch
- v2.13-0005-aio-bufmgr-Comment-fixes.patch
- v2.13-0006-aio-Add-WARNING-result-status.patch
- v2.13-0007-bufmgr-Implement-AIO-read-support.patch
- v2.13-0008-squash-later-bufmgr-Implement-AIO-read-support.patch
- v2.13-0009-bufmgr-Use-AIO-in-StartReadBuffers.patch
- v2.13-0010-squash-later-bufmgr-Use-AIO-in-StartReadBuffer.patch
- v2.13-0011-aio-Add-README.md-explaining-higher-level-desi.patch
- v2.13-0012-squash-later-aio-Add-README.md-explaining-high.patch
- v2.13-0013-aio-Basic-read_stream-adjustments-for-real-AIO.patch
- v2.13-0014-read_stream-Introduce-and-use-optional-batchmo.patch
- v2.13-0015-docs-Reframe-track_io_timing-related-docs-as-w.patch
- v2.13-0016-Enable-IO-concurrency-on-all-systems.patch
- v2.13-0017-aio-Add-test_aio-module.patch
- v2.13-0018-aio-Experimental-heuristics-to-increase-batchi.patch
- v2.13-0019-aio-Implement-smgr-md-fd-write-support.patch
- v2.13-0020-aio-Add-bounce-buffers.patch
- v2.13-0021-bufmgr-Implement-AIO-write-support.patch
- v2.13-0022-aio-Add-IO-queue-helper.patch
- v2.13-0023-bufmgr-use-AIO-in-checkpointer-bgwriter.patch
- v2.13-0024-Ensure-a-resowner-exists-for-all-paths-that-ma.patch
- v2.13-0025-Temporary-Increase-BAS_BULKREAD-size.patch
- v2.13-0026-WIP-Use-MAP_POPULATE.patch
Hi, On 2025-03-26 21:07:40 -0400, Andres Freund wrote: > TODO > ... > - Add an explicit test for the checksum verification in the completion callback > > There is an existing test for testing an invalid page due to page header > verification in test_aio, but not for checksum failures. > > I think it's indirectly covered (e.g. in amcheck), but seems better to test > it explicitly. Ah, for crying out loud. As it turns out, no, we do not have *ANY* tests for this for the server-side. Not a single one. I'm somewhat apoplectic, data_checksums is a really complicated feature, which we just started *turning on by default*, without a single test of the failure behaviour, when detecting failures is the one thing the feature is supposed to do. I now wrote some tests. And I both regret doing so (because it found problems, which would have been apparent long ago, if the feature had come with *any* tests, if I had gone the same way I could have just pushed stuff) and am glad I did (because I dislike pushing broken stuff). I have to admit, I was tempted to just ignore this issue and just not say anything about tests for checksum failures anymore. Problems: 1) PageIsVerifiedExtended() emits a WARNING, just like with ZERO_ON_ERROR, we don't want to emit it in a) io workers b) another backend if it completes the error. This isn't hard to address, we can add PIV_LOG_LOG (or something like that) to emit it at a different log level and an out-parameter to trigger sending a warning / adjust the warning/error message we already emit once the issuer completes the IO. 2) With IO workers (and "foreign completors", in rare cases), the checksum failures would be attributed wrongly, as it reports all stats to MyDatabaseId As it turns out, this is already borked on master for shared relations, since pg_stat_database.checksum_failures has existed, see [1]. This isn't too hard to fix, if we adjust the signature to PageIsVerifiedExtended() to pass in the database oid. But see also 3) 3) We can't pgstat_report_checksum_failure() during the completion callback, as it *sometimes* allocates memory Aside from the allocation-in-critical-section asserts, I think this is *extremely* unlikely to actually cause a problem in practice. But we don't want to rely on that, obviously. Addressing the first two is pretty simple and/or needs to be done anyway, since it's a currently existing bug, as discussed in [1]. Addressing 3) is not at all trivial. Here's what I've thought of so far: Approach I) My first thoughts were around trying to make the relevant pgstat infrastructure either not need to allocate memory, or handle memory allocation failures gracefully. Unfortunately that seems not really viable: The most successful approach I tried was to report stats directly to the dshash table, and only report stats if there's already an entry (which there just about always will, except for a very short period after stats have been reset). Unfortunately that fails because to access the shared memory with the stats data we need to do dsa_get_address(), which can fail if the relevant dsm segment wasn't already mapped in the current process (it allocates memory in the process of mapping in the segment). There's no API to do that without erroring out. That aspect rules out a number of other approaches that sounded like they could work - we e.g. could increase the refcount of the relevant pgstat entry before issuing IO, ensuring that it's around by the time we need to report. But that wouldn't get around the issue of needing to map in the dsm segment. Approach II) Don't report the error in the completion callback. The obvious place would be to do it where we we'll raise the warning/error in the issuing process. The big disadvantage is that that that could lead to under-counting checksum errors: a) A read stream does 2+ concurrent reads for the same relation, and more than one encounters checksum errors. When processing the results for the first failed read, we raise an error and thus won't process the results of the second+ reads with errors. b) A read is started asynchronously, but before the backend gets around to processing the result of the IO, it errors out during that other work (possibly due to a cancellation). Because the backend never looked at the results of the IOs, the checksum errors don't get accounted for. b) doesn't overly bother me, but a) seems problematic. Approach III) Accumulate checksum errors in two backend local variables (one for database specific errors, one for errors on shared relations), which will be flushed by the backend that issued IO during the next pgstat_report_start(). Two disadvantages: - Accumulation of errors will be delayed until the next pgstat_report_start(). That seems acceptable, after all we do so far a lot of other stats. - We need to register a local callback for shared buffer reads, which don't need them today . That's a small bit of added overhead. It's a shame to do so for counters that approximately never get incremented. Approach IV): Embracy piercing abstractions / generic infrastructure and put two atomic variables (one for shared one for the backend's database) in some backend-specific shared memory (e.g. the backend's PgAioBackend or PGPROC) and update that in the completion callback. Flush that variable to the shared stats in pgstat_report_start() or such. This would avoid the need for the local completion callback, and would also allow to introduce a function see the number "unflushed" checksum errors. It also doesn't require transporting the number of errors between the shared callback and the local callback - but we might want to do have that for the error message anyway. I wish the new-to-18 pgstat_backend() were designed in a way to make this possible in a nice way. But unfortunately it puts the backend-specific data in the dshash table / dynamic shared memory, rather than in a MaxBackends + NUM_AUX sized array array in plain shared memory. As explained in I), we can't rely on having the entire array mapped. Leaving the issue from this email aside, that also adds a fair bit of overhead to other cases. Does anybody have better ideas? I think II), III) and IV) are all relatively simple to implement. The most complicated bit is that a bit of bit-squeezing is necessary to fit the number of checksum errors (in addition to the number of otherwise invalid pages) into the available space for error data. It's doable. We could also just increase the size of PgAioResult. I've implemented II), but I'm not sure the disadvantages are acceptable. Greetings, Andres Freund [1] https://postgr.es/m/mglpvvbhighzuwudjxzu4br65qqcxsnyvio3nl4fbog3qknwhg%40e4gt7npsohuz
On Thu, Mar 27, 2025 at 04:58:11PM -0400, Andres Freund wrote: > I now wrote some tests. And I both regret doing so (because it found problems, > which would have been apparent long ago, if the feature had come with *any* > tests, if I had gone the same way I could have just pushed stuff) and am glad > I did (because I dislike pushing broken stuff). > > I have to admit, I was tempted to just ignore this issue and just not say > anything about tests for checksum failures anymore. I don't blame you. > 3) We can't pgstat_report_checksum_failure() during the completion callback, > as it *sometimes* allocates memory > > Aside from the allocation-in-critical-section asserts, I think this is > *extremely* unlikely to actually cause a problem in practice. But we don't > want to rely on that, obviously. > Addressing 3) is not at all trivial. Here's what I've thought of so far: > > > Approach I) > Unfortunately that fails because to access the shared memory with the stats > data we need to do dsa_get_address() > Approach II) > > Don't report the error in the completion callback. The obvious place would be > to do it where we we'll raise the warning/error in the issuing process. The > big disadvantage is that that that could lead to under-counting checksum > errors: > > a) A read stream does 2+ concurrent reads for the same relation, and more than > one encounters checksum errors. When processing the results for the first > failed read, we raise an error and thus won't process the results of the > second+ reads with errors. > > b) A read is started asynchronously, but before the backend gets around to > processing the result of the IO, it errors out during that other work > (possibly due to a cancellation). Because the backend never looked at the > results of the IOs, the checksum errors don't get accounted for. > > b) doesn't overly bother me, but a) seems problematic. While neither are great, I could live with both. I guess I'm optimistic that clusters experiencing checksum failures won't lose enough reports to these loss sources to make the difference in whether monitoring catches them. In other words, a cluster will report N failures without these losses and N-K after these losses. If N is large enough for relevant monitoring to flag the cluster appropriately, N-K will also be large enough. > Approach III) > > Accumulate checksum errors in two backend local variables (one for database > specific errors, one for errors on shared relations), which will be flushed by > the backend that issued IO during the next pgstat_report_start(). > > Two disadvantages: > > - Accumulation of errors will be delayed until the next > pgstat_report_start(). That seems acceptable, after all we do so far a lot > of other stats. Yep, acceptable. > - We need to register a local callback for shared buffer reads, which don't > need them today . That's a small bit of added overhead. It's a shame to do > so for counters that approximately never get incremented. Fair concern. An idea is to let the complete_shared callback change the callback list associated with the IO, so it could change PGAIO_HCB_SHARED_BUFFER_READV to PGAIO_HCB_SHARED_BUFFER_READV_SLOW. The latter would differ from the former only in having the extra local callback. Could that help? I think the only overhead is using more PGAIO_HCB numbers. We currently reserve 256 (uint8), but one could imagine trying to pack into fewer bits. That said, this wouldn't paint us into a corner. We could change the approach later. pgaio_io_call_complete_local() starts a critical section. Is that a problem for this approach? > Approach IV): > > Embracy piercing abstractions / generic infrastructure and put two atomic > variables (one for shared one for the backend's database) in some > backend-specific shared memory (e.g. the backend's PgAioBackend or PGPROC) and > update that in the completion callback. Flush that variable to the shared > stats in pgstat_report_start() or such. I could live with that. I feel better about Approach III currently, though. Overall, I'm feeling best about III long-term, but II may be the right tactical choice. > Does anybody have better ideas? I think no, but here are some ideas I tossed around: - Like your Approach III, but have the completing process store the count locally and flush it, instead of the staging process doing so. Would need more than 2 slots, but we could have a fixed number of slots and just discard any reports that arrive with all slots full. Reporting checksum failures in, say, 8 databases in quick succession probably tells the DBA there's "enough corruption to start worrying". Missing the 9th database would be okay. - Pre-warm the memory allocations and DSAs we could possibly need, so we can report those stats in critical sections, from the completing process. Bad since there's an entry per database, hence no reasonable limit on how much memory a process might need to pre-warm. We could even end up completing an IO for a database that didn't exist on entry to our critical section. - Skip the checksum pgstats if we're completing in a critical section. Doesn't work since we _always_ make a critical section to complete I/O. This email isn't as well-baked as I like, but the alternative was delaying it 24-48h depending on how other duties go over those hours. My v2.13 review is still in-progress, too.
Hi, On 2025-03-27 20:22:23 -0700, Noah Misch wrote: > On Thu, Mar 27, 2025 at 04:58:11PM -0400, Andres Freund wrote: > > Don't report the error in the completion callback. The obvious place would be > > to do it where we we'll raise the warning/error in the issuing process. The > > big disadvantage is that that that could lead to under-counting checksum > > errors: > > > > a) A read stream does 2+ concurrent reads for the same relation, and more than > > one encounters checksum errors. When processing the results for the first > > failed read, we raise an error and thus won't process the results of the > > second+ reads with errors. > > > > b) A read is started asynchronously, but before the backend gets around to > > processing the result of the IO, it errors out during that other work > > (possibly due to a cancellation). Because the backend never looked at the > > results of the IOs, the checksum errors don't get accounted for. > > > > b) doesn't overly bother me, but a) seems problematic. > > While neither are great, I could live with both. I guess I'm optimistic that > clusters experiencing checksum failures won't lose enough reports to these > loss sources to make the difference in whether monitoring catches them. In > other words, a cluster will report N failures without these losses and N-K > after these losses. If N is large enough for relevant monitoring to flag the > cluster appropriately, N-K will also be large enough. That's true. > > Approach III) > > > > Accumulate checksum errors in two backend local variables (one for database > > specific errors, one for errors on shared relations), which will be flushed by > > the backend that issued IO during the next pgstat_report_start(). FWIW, two variables turn out to not quite suffice - as I realized later, we actually can issue IO on behalf of arbitrary databases, due to ScanSourceDatabasePgClass() and RelationCopyStorageUsingBuffer(). That unfortunately makes it much harder to be able to guarantee that the completor of an IO has the DSM segment for a pg_stat_database stats entry mapped. > > - We need to register a local callback for shared buffer reads, which don't > > need them today . That's a small bit of added overhead. It's a shame to do > > so for counters that approximately never get incremented. > > Fair concern. An idea is to let the complete_shared callback change the > callback list associated with the IO, so it could change > PGAIO_HCB_SHARED_BUFFER_READV to PGAIO_HCB_SHARED_BUFFER_READV_SLOW. The > latter would differ from the former only in having the extra local callback. > Could that help? I think the only overhead is using more PGAIO_HCB numbers. I think changing the callback could work - I'll do some measurements in a coffee or two, but I suspect the overhead is not worth being too worried about for now. There's a different aspect that worries me slightly more, see further down. > We currently reserve 256 (uint8), but one could imagine trying to pack into > fewer bits. Yea, my current local worktree reduces it to 6 bits for now, to make space for keeping track of the number of checksum failures in error data (as part of that adds defines for the bit widths). If that becomes an issue we can make PgAioResult wider, but I suspect that won't be too soon. One simplification that we could make is to only ever report one checksum failure for each IO, even if N buffers failed - after all that's what HEAD does (by virtue of throwing an error after the first). Then we'd not track the number of checksum errors. > That said, this wouldn't paint us into a corner. We could change the > approach later. Indeed - I think we mainly need something that works for now. I think medium term the right fix here would be to make sure that the stats can be accounted for with just an atomic increment somewhere. We've had several discussions around having an in-memory datastructure for every relation that currently has buffer in shared_buffers, to store e.g. the relation length and the sync requests. If we get that, I think Thomas has a prototype, we can accumulate the number of checksum errors in there, for example. It'd also allow to address the biggest blocker for writes, namely that RememberSyncRequest() could fail, *after* IO comletion. > pgaio_io_call_complete_local() starts a critical section. Is that a problem > for this approach? I think we can make it not a problem - I added a pgstat_prepare_report_checksum_failure(dboid) that ensures the calling backend has a reference to the relevant shared memory stats entry. If we make the rule that it has to be called *before* starting buffered IO (i.e. in AsyncReadBuffers()), we can be sure the stats reference still exists by the time local completion runs (as the isn't a way to have the stats entry dropped without dropping the database, which isn't possible while a) the database still is connected to, for normal IO b) the CREATE DATABASE is still running). Unfortunately pgstat_prepare_report_checksum_failure() has to do a lookup in a local hashtable. That's more expensive than an indirect function call (i.e. the added local callback). I hope^Wsuspect it'll still be fine, and if not we can apply a mini-cache for the current database, which is surely the only thing that ever matters for performance. > > Approach IV): > > > > Embracy piercing abstractions / generic infrastructure and put two atomic > > variables (one for shared one for the backend's database) in some > > backend-specific shared memory (e.g. the backend's PgAioBackend or PGPROC) and > > update that in the completion callback. Flush that variable to the shared > > stats in pgstat_report_start() or such. > > I could live with that. I feel better about Approach III currently, though. > Overall, I'm feeling best about III long-term, but II may be the right > tactical choice. I think it's easy to change between these approaches. Both require that we encode the number of checksum failures in the result, which is where most of the complexity lies (but still a rather surmountable amount of complexity). > I think no, but here are some ideas I tossed around: > > - Like your Approach III, but have the completing process store the count > locally and flush it, instead of the staging process doing so. Would need > more than 2 slots, but we could have a fixed number of slots and just > discard any reports that arrive with all slots full. Reporting checksum > failures in, say, 8 databases in quick succession probably tells the DBA > there's "enough corruption to start worrying". Missing the 9th database > would be okay. Yea. I think that'd be an ok fallback, but if we can make III' work, it'd be nicer. > - Pre-warm the memory allocations and DSAs we could possibly need, so we can > report those stats in critical sections, from the completing process. Bad > since there's an entry per database, hence no reasonable limit on how much > memory a process might need to pre-warm. We could even end up completing an > IO for a database that didn't exist on entry to our critical section. I experimented with this one - it works surprisingly well, because for IO workers we could just do the pre-warming outside of the critical section, and it's *exceedingly* rare that any other completor would ever need to complete IO for another database than the current / a shared relation. But it does leave a nasty edge case, that we'd just have to accept. I guess we could just make it so that in that case stats aren't reported. But it seems pretty ugly. > This email isn't as well-baked as I like, but the alternative was delaying it > 24-48h depending on how other duties go over those hours. My v2.13 review is > still in-progress, too. It's appreciated! Greetings, Andres Freund
Hi, On 2025-03-28 08:54:42 -0400, Andres Freund wrote: > One simplification that we could make is to only ever report one checksum > failure for each IO, even if N buffers failed - after all that's what HEAD > does (by virtue of throwing an error after the first). Then we'd not track the > number of checksum errors. Just after sending, I thought of another variation: Report the number of *invalid* pages (which we already track) as checksum errors, if there was at least on checksum error. It's imo rather weird that we track checksum errors but we don't track invalid page headers, despite the latter being an even worse indication of something having gone wrong... Greetings, Andres Freund
Hi, On 2025-03-28 08:54:42 -0400, Andres Freund wrote: > On 2025-03-27 20:22:23 -0700, Noah Misch wrote: > > On Thu, Mar 27, 2025 at 04:58:11PM -0400, Andres Freund wrote: > > > - We need to register a local callback for shared buffer reads, which don't > > > need them today . That's a small bit of added overhead. It's a shame to do > > > so for counters that approximately never get incremented. > > > > Fair concern. An idea is to let the complete_shared callback change the > > callback list associated with the IO, so it could change > > PGAIO_HCB_SHARED_BUFFER_READV to PGAIO_HCB_SHARED_BUFFER_READV_SLOW. The > > latter would differ from the former only in having the extra local callback. > > Could that help? I think the only overhead is using more PGAIO_HCB numbers. > > I think changing the callback could work - I'll do some measurements in a > coffee or two, but I suspect the overhead is not worth being too worried about > for now. There's a different aspect that worries me slightly more, see > further down. > ... > Unfortunately pgstat_prepare_report_checksum_failure() has to do a lookup in a > local hashtable. That's more expensive than an indirect function call > (i.e. the added local callback). I hope^Wsuspect it'll still be fine, and if > not we can apply a mini-cache for the current database, which is surely the > only thing that ever matters for performance. I tried it and at ~30GB/s of read IO, with checksums disabled, I can't see a difference of either having the unnecessary complete_local callback or having the lookup in pgstat_prepare_report_checksum_failure(). In a profile there are a few hits inside pgstat_get_entry_ref(), but not enough to matter. Hence I think this isn't worth worrying about, at least for now. I think we have far bigger fish to fry at this point than such a small performance difference. I've adjusted the comment above TRACE_POSTGRESQL_BUFFER_READ_DONE() to not mention the overhead. I'm still inclined to think that it's better to call it in the shared completion callback. I also fixed support and added tests for ignore_checksum_failure, that also needs to be determined at the start of the IO, not in the completion. Once more there were no tests, of course. I spent the last 6 hours on the stupid error/warning messages around this, somewhat ridiculous. The number of combinations is annoyingly large. It's e.g. plausible to use ignore_checksum_failure=on and zero_damaged_pages=on at the same time for recovery. The same buffer could both be ignored *and* zeroed. Or somebody could use ignore_checksum_failure=on but then still encounter a page that is invalid. But I finally got to a point where the code ends up readable, without undue duplication. It would, leaving some nasty hack aside, require a errhint_internal() - but I can't imagine a reason against introducing that, given we have it for the errmsg and errhint. Here's the relevant code: /* * Treat a read that had both zeroed buffers *and* ignored checksums as a * special case, it's too irregular to be emitted the same way as the other * cases. */ if (zeroed_any && ignored_any) { Assert(zeroed_any && ignored_any); Assert(nblocks > 1); /* same block can't be both zeroed and ignored */ Assert(result.status != PGAIO_RS_ERROR); affected_count = zeroed_or_error_count; ereport(elevel, errcode(ERRCODE_DATA_CORRUPTED), errmsg("zeroing %u pages and ignoring %u checksum failures among blocks %u..%u of relation %s", affected_count, checkfail_count, first, last, rpath.str), affected_count > 1 ? errdetail("Block %u held first zeroed page.", first + first_off) : 0, errhint("See server log for details about the other %u invalid blocks.", affected_count + checkfail_count - 1)); return; } /* * The other messages are highly repetitive. To avoid duplicating a long * and complicated ereport(), gather the translated format strings * separately and then do one common ereport. */ if (result.status == PGAIO_RS_ERROR) { Assert(!zeroed_any); /* can't have invalid pages when zeroing them */ affected_count = zeroed_or_error_count; msg_one = _("invalid page in block %u of relation %s"); msg_mult = _("%u invalid pages among blocks %u..%u of relation %s"); det_mult = _("Block %u held first invalid page."); hint_mult = _("See server log for the other %u invalid blocks."); } else if (zeroed_any && !ignored_any) { affected_count = zeroed_or_error_count; msg_one = _("invalid page in block %u of relation %s; zeroing out page"); msg_mult = _("zeroing out %u invalid pages among blocks %u..%u of relation %s"); det_mult = _("Block %u held first zeroed page."); hint_mult = _("See server log for the other %u zeroed blocks."); } else if (!zeroed_any && ignored_any) { affected_count = checkfail_count; msg_one = _("ignoring checksum failure in block %u of relation %s"); msg_mult = _("ignoring %u checksum failures among blocks %u..%u of relation %s"); det_mult = _("Block %u held first ignored page."); hint_mult = _("See server log for the other %u ignored blocks."); } else pg_unreachable(); ereport(elevel, errcode(ERRCODE_DATA_CORRUPTED), affected_count == 1 ? errmsg_internal(msg_one, first + first_off, rpath.str) : errmsg_internal(msg_mult, affected_count, first, last, rpath.str), affected_count > 1 ? errdetail_internal(det_mult, first + first_off) : 0, affected_count > 1 ? errhint_internal(hint_mult, affected_count - 1) : 0); Does that approach make sense? What do you think about using "zeroing invalid page in block %u of relation %s" instead of "invalid page in block %u of relation %s; zeroing out page" I thought about instead translating "ignoring", "ignored", "zeroing", "zeroed", etc separately, but I have doubts about how well that would actually translate. Greetings, Andres Freund
On Fri, Mar 28, 2025 at 11:35:23PM -0400, Andres Freund wrote: > The number of combinations is annoyingly large. It's e.g. plausible to use > ignore_checksum_failure=on and zero_damaged_pages=on at the same time for > recovery. That's intricate indeed. > But I finally got to a point where the code ends up readable, without undue > duplication. It would, leaving some nasty hack aside, require a > errhint_internal() - but I can't imagine a reason against introducing that, > given we have it for the errmsg and errhint. Introducing that is fine. > Here's the relevant code: > > /* > * Treat a read that had both zeroed buffers *and* ignored checksums as a > * special case, it's too irregular to be emitted the same way as the other > * cases. > */ > if (zeroed_any && ignored_any) > { > Assert(zeroed_any && ignored_any); > Assert(nblocks > 1); /* same block can't be both zeroed and ignored */ > Assert(result.status != PGAIO_RS_ERROR); > affected_count = zeroed_or_error_count; > > ereport(elevel, > errcode(ERRCODE_DATA_CORRUPTED), > errmsg("zeroing %u pages and ignoring %u checksum failures among blocks %u..%u of relation %s", > affected_count, checkfail_count, first, last, rpath.str), Translation stumbles on this one, because each of the first two %u are plural-sensitive. I'd do one of: - Call ereport() twice, once for zeroed pages and once for ignored checksums. Since elevel <= ERROR here, that doesn't lose the second call. - s/pages/page(s)/ like msgid "There are %d other session(s) and %d prepared transaction(s) using the database." - Something more like the style of VACUUM VERBOSE, e.g. "INTRO_TEXT: %u zeroed, %u checksums ignored". I've not written INTRO_TEXT, and this doesn't really resolve pluralization. Probably don't use this option. > affected_count > 1 ? > errdetail("Block %u held first zeroed page.", > first + first_off) : 0, > errhint("See server log for details about the other %u invalid blocks.", > affected_count + checkfail_count - 1)); > return; > } > > /* > * The other messages are highly repetitive. To avoid duplicating a long > * and complicated ereport(), gather the translated format strings > * separately and then do one common ereport. > */ > if (result.status == PGAIO_RS_ERROR) > { > Assert(!zeroed_any); /* can't have invalid pages when zeroing them */ > affected_count = zeroed_or_error_count; > msg_one = _("invalid page in block %u of relation %s"); > msg_mult = _("%u invalid pages among blocks %u..%u of relation %s"); > det_mult = _("Block %u held first invalid page."); > hint_mult = _("See server log for the other %u invalid blocks."); For each hint_mult, we would usually use ngettext() instead of _(). (Would be errhint_plural() if not separated from its ereport().) Alternatively, s/blocks/block(s)/ is fine. > } > else if (zeroed_any && !ignored_any) > { > affected_count = zeroed_or_error_count; > msg_one = _("invalid page in block %u of relation %s; zeroing out page"); > msg_mult = _("zeroing out %u invalid pages among blocks %u..%u of relation %s"); > det_mult = _("Block %u held first zeroed page."); > hint_mult = _("See server log for the other %u zeroed blocks."); > } > else if (!zeroed_any && ignored_any) > { > affected_count = checkfail_count; > msg_one = _("ignoring checksum failure in block %u of relation %s"); > msg_mult = _("ignoring %u checksum failures among blocks %u..%u of relation %s"); > det_mult = _("Block %u held first ignored page."); > hint_mult = _("See server log for the other %u ignored blocks."); > } > else > pg_unreachable(); > > ereport(elevel, > errcode(ERRCODE_DATA_CORRUPTED), > affected_count == 1 ? > errmsg_internal(msg_one, first + first_off, rpath.str) : > errmsg_internal(msg_mult, affected_count, first, last, rpath.str), > affected_count > 1 ? errdetail_internal(det_mult, first + first_off) : 0, > affected_count > 1 ? errhint_internal(hint_mult, affected_count - 1) : 0); > > Does that approach make sense? Yes. > What do you think about using > "zeroing invalid page in block %u of relation %s" > instead of > "invalid page in block %u of relation %s; zeroing out page" I like the replacement. It moves the important part to the front, and it's shorter. > I thought about instead translating "ignoring", "ignored", "zeroing", > "zeroed", etc separately, but I have doubts about how well that would actually > translate. Agreed, I wouldn't have high hopes for that. An approach like that would probably need messages that separate the independently-translated part grammatically, e.g.: /* last %s is translation of "ignore" or "zero-fill" */ "invalid page in block %u of relation %s; resolved by method \"%s\"" (Again, I'm not recommending that.)
Hi, On 2025-03-29 06:41:43 -0700, Noah Misch wrote: > On Fri, Mar 28, 2025 at 11:35:23PM -0400, Andres Freund wrote: > > But I finally got to a point where the code ends up readable, without undue > > duplication. It would, leaving some nasty hack aside, require a > > errhint_internal() - but I can't imagine a reason against introducing that, > > given we have it for the errmsg and errhint. > > Introducing that is fine. Cool. > > Here's the relevant code: > > > > /* > > * Treat a read that had both zeroed buffers *and* ignored checksums as a > > * special case, it's too irregular to be emitted the same way as the other > > * cases. > > */ > > if (zeroed_any && ignored_any) > > { > > Assert(zeroed_any && ignored_any); > > Assert(nblocks > 1); /* same block can't be both zeroed and ignored */ > > Assert(result.status != PGAIO_RS_ERROR); > > affected_count = zeroed_or_error_count; > > > > ereport(elevel, > > errcode(ERRCODE_DATA_CORRUPTED), > > errmsg("zeroing %u pages and ignoring %u checksum failures among blocks %u..%u of relation %s", > > affected_count, checkfail_count, first, last, rpath.str), > > Translation stumbles on this one, because each of the first two %u are > plural-sensitive. Fair. We don't generally seem to have been very careful around this in relate code, but there's no reason to just continue down that road when it's easy. E.g. in md.c we unconditionally output "could not read blocks %u..%u in file \"%s\": %m" even if it's just a single block... > I'd do one of: > > - Call ereport() twice, once for zeroed pages and once for ignored checksums. > Since elevel <= ERROR here, that doesn't lose the second call. > > - s/pages/page(s)/ like msgid "There are %d other session(s) and %d prepared > transaction(s) using the database." I think I like this better. > > /* > > * The other messages are highly repetitive. To avoid duplicating a long > > * and complicated ereport(), gather the translated format strings > > * separately and then do one common ereport. > > */ > > if (result.status == PGAIO_RS_ERROR) > > { > > Assert(!zeroed_any); /* can't have invalid pages when zeroing them */ > > affected_count = zeroed_or_error_count; > > msg_one = _("invalid page in block %u of relation %s"); > > msg_mult = _("%u invalid pages among blocks %u..%u of relation %s"); > > det_mult = _("Block %u held first invalid page."); > > hint_mult = _("See server log for the other %u invalid blocks."); > > For each hint_mult, we would usually use ngettext() instead of _(). (Would be > errhint_plural() if not separated from its ereport().) Alternatively, > s/blocks/block(s)/ is fine. I will go with the (s) here as well, this stuff is too rare to be worth having pluralized messages imo. > > Does that approach make sense? > > Yes. > ... > > I like the replacement. It moves the important part to the front, and it's > shorter. Cool, I squashed them with the relevant changes now. Attached is v2.14: Changes: - Added a commit to fix stats attribution of checksum errors, previously the checksum errors detected bufmgr.c/storage.c were always attributed to the current database This would have caused bigger issues with worker based IO, as IO workers aren't connected to databases. - Added a commit to allow checksum error reports to happen in critical sections. For that a pgstat_prepare_report_checksum_failure() has to be called in the same backend, to report the critical section. Other suggestions for the name welcome. - Expanded on the idea in 13 to track the number of invalid buffers in the IO's result, by also tracking checksum errors. Combined with the previous point, this fixes the issue of an assert during checksum failure reporting outlined in: https://postgr.es/m/5tyic6epvdlmd6eddgelv47syg2b5cpwffjam54axp25xyq2ga%40ptwkinxqo3az This required being a bit more careful with space in the error, to be able to squeeze in the checksum number. - The ignore_checksum_failure of the issuer needs to be used when completing IO, not the one of the completor, particularly when using io_method=worker For that the access to ignore_checksum_failure had to be moved from PageIsVerified() to its callers. I added tests for ignore_checksum_failure, including its interplay with zero_damaged_pages. - Deduplicated the error reporting in buffer_readv_report() somewhat by only having the selection of format strings be done in branches. I think this ends up a lot more readable than the huge ereport before. - Added details about the changed error/warning logging to "bufmgr: Implement AIO read support"'s commit message. - polished the commit to add PGAIO_RS_WARNING a bit, adding defines for the bit-widths of PgAioResult portions and added static asserts to verify them - Squashed the changes that I had kept separately in v2.13, it was too hard to do that while doing the above changes. I did make the encoding function cast the arguments to uint32 before shifting. I think that's implied by the C integer promotion rules, but it seemed fishy enough to not want to believe in that. I also added a StaticAssertStmt() to ensure we are only using the available bit space. - Added a test for a) checksum errors being detected b) CREATE DATABASE ... STRATEGY WAL_LOG The latter is interesting because it also provides test coverage for doing IO for objects in other databases. - Removed an obsoleted inclusion of pg_trace.h in localbuf.c TODO: - I think the tests around zero_damaged_pages, ignore_checksum_failure should be expanded a bit more. There's two FIXME in the tests about that. At the moment there are two different test functions for zero_damaged_pages and ignore_checksum_failure, I'm not sure how good that is. I wanted to get this version out, because I have to run some errands, otherwise I'd have implemente them first... Next steps: - push the checksums stats fix - unless somebody sees a reason to not use LOG_SERVER_ONLY in "aio: Implement support for reads in smgr/md/fd", push that Besides that the only change since Noah's last review of that commit is an added comment. - push acronym, glossary change - push pg_aios view (depends a tiny bit on the smgr/md/fd change above) - push "localbuf: Track pincount in BufferDesc as well" - I think I addressed all of Noah's review feedback - address the above TODO Greetings, Andres Freund
Attachment
- v2.14-0001-Fix-mis-attribution-of-checksum-failure-stats-.patch
- v2.14-0002-aio-Implement-support-for-reads-in-smgr-md-fd.patch
- v2.14-0003-docs-Add-acronym-and-glossary-entries-for-I-O-.patch
- v2.14-0004-aio-Add-pg_aios-view.patch
- v2.14-0005-localbuf-Track-pincount-in-BufferDesc-as-well.patch
- v2.14-0006-aio-bufmgr-Comment-fixes.patch
- v2.14-0007-aio-Add-WARNING-result-status.patch
- v2.14-0008-pgstat-Allow-checksum-errors-to-be-reported-in.patch
- v2.14-0009-Add-errhint_internal.patch
- v2.14-0010-bufmgr-Implement-AIO-read-support.patch
- v2.14-0011-Let-caller-of-PageIsVerified-control-ignore_ch.patch
- v2.14-0012-bufmgr-Use-AIO-in-StartReadBuffers.patch
- v2.14-0013-aio-Add-README.md-explaining-higher-level-desi.patch
- v2.14-0014-aio-Basic-read_stream-adjustments-for-real-AIO.patch
- v2.14-0015-read_stream-Introduce-and-use-optional-batchmo.patch
- v2.14-0016-docs-Reframe-track_io_timing-related-docs-as-w.patch
- v2.14-0017-Enable-IO-concurrency-on-all-systems.patch
- v2.14-0018-aio-Add-test_aio-module.patch
- v2.14-0019-aio-Experimental-heuristics-to-increase-batchi.patch
- v2.14-0020-aio-Implement-smgr-md-fd-write-support.patch
- v2.14-0021-aio-Add-bounce-buffers.patch
- v2.14-0022-bufmgr-Implement-AIO-write-support.patch
- v2.14-0023-aio-Add-IO-queue-helper.patch
- v2.14-0024-bufmgr-use-AIO-in-checkpointer-bgwriter.patch
- v2.14-0025-Ensure-a-resowner-exists-for-all-paths-that-ma.patch
- v2.14-0026-Temporary-Increase-BAS_BULKREAD-size.patch
- v2.14-0027-WIP-Use-MAP_POPULATE.patch
Hi, On 2025-03-29 10:48:10 -0400, Andres Freund wrote: > Attached is v2.14: FWIW, there was a last minute change in the test that fails in one task on CI, due to reading across the smaller segment size configured for one of the runs. Doesn't quite seem worth posting a new version for. > - push the checksums stats fix Done. > - unless somebody sees a reason to not use LOG_SERVER_ONLY in > "aio: Implement support for reads in smgr/md/fd", push that > > Besides that the only change since Noah's last review of that commit is an > added comment. Also done. If we want to change log level later, it's easy to do so. I made some small changes since the version I had posted: - I found one dangling reference to mdread() instead of mdreadv() - I had accidentally squashed the fix to Noah's review comment about a comment above md_readv_report() to the wrong commit (smgr/md/fd.c write support) - PGAIO_HCB_MD_WRITEV was added in "smgr/md/fd.c read support" instead of "smgr/md/fd.c write support" > - push pg_aios view (depends a tiny bit on the smgr/md/fd change above) I think I found an issue with this one - as it stands the view was viewable by everyone. While it doesn't provide a *lot* of insight, it still seems a bit too much for an unprivileged user to learn what part of a relation any other user is currently reading. There'd be two different ways to address that: 1) revoke view & function from public, grant to a limited role (presumably pg_read_all_stats) 2) copy pg_stat_activity's approach of using something like #define HAS_PGSTAT_PERMISSIONS(role) (has_privs_of_role(GetUserId(), ROLE_PG_READ_ALL_STATS) || has_privs_of_role(GetUserId(),role)) on a per-IO basis. Greetings, Andres Freund
On Sat, Mar 29, 2025 at 2:25 PM Andres Freund <andres@anarazel.de> wrote: > > I think I found an issue with this one - as it stands the view was viewable by > everyone. While it doesn't provide a *lot* of insight, it still seems a bit > too much for an unprivileged user to learn what part of a relation any other > user is currently reading. > > There'd be two different ways to address that: > 1) revoke view & function from public, grant to a limited role (presumably > pg_read_all_stats) > 2) copy pg_stat_activity's approach of using something like > > #define HAS_PGSTAT_PERMISSIONS(role) (has_privs_of_role(GetUserId(), ROLE_PG_READ_ALL_STATS) || has_privs_of_role(GetUserId(),role)) > > on a per-IO basis. Is it easier to later change it to be more restrictive or less? If it is easier to later lock it down more, then go with 2, otherwise go with 1? - Melanie
Flushing half-baked review comments before going offline for a few hours: On Wed, Mar 26, 2025 at 09:07:40PM -0400, Andres Freund wrote: > Attached v2.13, with the following changes: > 5) The WARNING in the callback is now a LOG, as it will be sent to the > client as a WARNING explicitly when the IO's results are processed > > I actually chose LOG_SERVER_ONLY - that seemed slightly better than just > LOG? But not at all sure. LOG_SERVER_ONLY and its synonym COMMERR look to be used for: - ProcessLogMemoryContextInterrupt() - messages before successful authentication - protocol sync loss, where we'd fail to send a client message - client already gone The choice between LOG and LOG_SERVER_ONLY doesn't matter much for $SUBJECT. If a client has decided to set client_min_messages that high, the client might be interested in the fact that it got side-tracked completing someone else's IO. On the other hand, almost none of those sidetrack events will produce messages. The main argument I'd envision for LOG_SERVER_ONLY is that we consider the message content sensitive, but I don't see the message content as materially sensitive. Since you committed LOG_SERVER_ONLY, let's keep that decision. My last draft review discouraged it, but it doesn't matter. pgaio_result_report() should assert elevel != LOG to avoid future divergence. > - Previously the buffer completion callback checked zero_damaged_pages - but > that's not right, the GUC hopefully is only set on a per-session basis Good catch. I've now audited the complete_shared callbacks for other variable references and actions not acceptable there. I found nothing beyond what you found by v2.14. On Sat, Mar 29, 2025 at 10:48:10AM -0400, Andres Freund wrote: > On 2025-03-29 06:41:43 -0700, Noah Misch wrote: > > On Fri, Mar 28, 2025 at 11:35:23PM -0400, Andres Freund wrote: > Subject: [PATCH v2.14 01/29] Fix mis-attribution of checksum failure stats to > the wrong database I've skipped reviewing this patch, since it's already committed. If it needs post-commit review, let me know. > Subject: [PATCH v2.14 02/29] aio: Implement support for reads in smgr/md/fd > + /* > + * Immediately log a message about the IO error, but only to the > + * server log. The reason to do so immediately is that the originator > + * might not process the query result immediately (because it is busy > + * doing another part of query processing) or at all (e.g. if it was > + * cancelled or errored out due to another IO also failing). The > + * issuer of the IO will emit an ERROR when processing the IO's s/issuer/definer/ please, to avoid proliferating synonyms. Likewise two other places in the patches. > +/* > + * smgrstartreadv() -- asynchronous version of smgrreadv() > + * > + * This starts an asynchronous readv IO using the IO handle `ioh`. Other than > + * `ioh` all parameters are the same as smgrreadv(). I would add a comment starting with: Compared to smgrreadv(), more responsibilities fall on layers above smgr. Higher layers handle partial reads. smgr will ereport(LOG_SERVER_ONLY) some problems, but higher layers are responsible for pgaio_result_report() to mirror that news to the user and (for ERROR) abort the (sub)transaction. md_readv_complete() comment "issuer of the IO will emit an ERROR" says some of that, but someone adding a smgrstartreadv() call is less likely to find it there. I say "comment starting with", because I think there's a remaining decision about who owns the zeroing currently tied to smgrreadv(). An audit of mdreadv() vs. AIO counterparts found this part of mdreadv(): if (nbytes == 0) { /* * We are at or past EOF, or we read a partial block at EOF. * Normally this is an error; upper levels should never try to * read a nonexistent block. However, if zero_damaged_pages * is ON or we are InRecovery, we should instead return zeroes * without complaining. This allows, for example, the case of * trying to update a block that was later truncated away. */ if (zero_damaged_pages || InRecovery) { I didn't write a test to prove its absence, but I'm not finding such code in the AIO path. I wondered if we could just Assert(!InRecovery), but adding that to md_readv_complete() failed 001_stream_rep.pl with this stack: ExceptionalCondition at assert.c:52 md_readv_complete at md.c:2043 pgaio_io_call_complete_shared at aio_callback.c:258 pgaio_io_process_completion at aio.c:515 pgaio_io_perform_synchronously at aio_io.c:148 pgaio_io_stage at aio.c:453 pgaio_io_start_readv at aio_io.c:87 FileStartReadV at fd.c:2243 mdstartreadv at md.c:1005 smgrstartreadv at smgr.c:757 AsyncReadBuffers at bufmgr.c:1938 StartReadBuffersImpl at bufmgr.c:1422 StartReadBuffer at bufmgr.c:1515 ReadBuffer_common at bufmgr.c:1246 ReadBufferExtended at bufmgr.c:818 vm_readbuf at visibilitymap.c:584 visibilitymap_pin at visibilitymap.c:203 heap_xlog_insert at heapam_xlog.c:450 heap_redo at heapam_xlog.c:1195 ApplyWalRecord at xlogrecovery.c:1995 PerformWalRecovery at xlogrecovery.c:1825 StartupXLOG at xlog.c:5895 If this is a real problem, fix options may include: - Implement the InRecovery zeroing for real. - Make the InRecovery case somehow use real mdreadv(), not pgaio_io_perform_synchronously() to use AIO APIs with synchronous AIO. I'll guess this is harder than the previous option, though. > Subject: [PATCH v2.14 04/29] aio: Add pg_aios view > + /* > + * There is no lock that could prevent the state of the IO to advance > + * concurrently - and we don't want to introduce one, as that would > + * introduce atomics into a very common path. Instead we > + * > + * 1) Determine the state + generation of the IO. > + * > + * 2) Copy the IO to local memory. > + * > + * 3) Check if state or generation of the IO changed. If the state > + * changed, retry, if the generation changed don't display the IO. > + */ > + > + /* 1) from above */ > + start_generation = live_ioh->generation; > + pg_read_barrier(); I think "retry:" needs to be here, above start_state assignment. Otherwise, the "live_ioh->state != start_state" test will keep seeing a state mismatch. > + start_state = live_ioh->state; > + > +retry: > + if (start_state == PGAIO_HS_IDLE) > + continue; > Subject: [PATCH v2.14 05/29] localbuf: Track pincount in BufferDesc as well > Subject: [PATCH v2.14 07/29] aio: Add WARNING result status > Subject: [PATCH v2.14 08/29] pgstat: Allow checksum errors to be reported in > critical sections > Subject: [PATCH v2.14 09/29] Add errhint_internal() Ready for commit > Subject: [PATCH v2.14 10/29] bufmgr: Implement AIO read support > Buffer reads executed this infrastructure will report invalid page / checksum > errors / warnings differently than before: s/this/through this/ > + *zeroed_or_error_count = rem_error & ((1 << 7) - 1); > + rem_error >>= 7; These raw "7" are good places to use your new #define values. Likewise in buffer_readv_encode_error(). > + * that was errored or zerored or, if no errors/zeroes, the first ignored s/zerored/zeroed/ > + * enough. If there is an error, the error is the integeresting offset, typo "integeresting" > +/* > + * We need a backend-local completion callback for shared buffers, to be able > + * to report checksum errors correctly. Unfortunately that can only safely > + * happen if the reporting backend has previously called Missing end of sentence. > @@ -144,8 +144,8 @@ PageIsVerified(PageData *page, BlockNumber blkno, int flags, bool *checksum_fail > */ There's an outdated comment ending here: /* * Throw a WARNING if the checksum fails, but only after we've checked for * the all-zeroes case. */ > if (checksum_failure) > { > - if ((flags & PIV_LOG_WARNING) != 0) > - ereport(WARNING, > + if ((flags & (PIV_LOG_WARNING | PIV_LOG_LOG)) != 0) > + ereport(flags & PIV_LOG_WARNING ? WARNING : LOG, > Subject: [PATCH v2.14 11/29] Let caller of PageIsVerified() control > ignore_checksum_failure > Subject: [PATCH v2.14 12/29] bufmgr: Use AIO in StartReadBuffers() > Subject: [PATCH v2.14 13/29] aio: Add README.md explaining higher level design > Subject: [PATCH v2.14 14/29] aio: Basic read_stream adjustments for real AIO > Subject: [PATCH v2.14 15/29] read_stream: Introduce and use optional batchmode > support > Subject: [PATCH v2.14 16/29] docs: Reframe track_io_timing related docs as > wait time > Subject: [PATCH v2.14 17/29] Enable IO concurrency on all systems Ready for commit > Subject: [PATCH v2.14 18/29] aio: Add test_aio module I didn't yet re-review the v2.13 or 2.14 changes to this one. That's still in my queue. One thing I noticed anyway: > +# Tests using injection points. Mostly to exercise had IO errors that are s/had/hard/ On Sat, Mar 29, 2025 at 02:25:15PM -0400, Andres Freund wrote: > On 2025-03-29 10:48:10 -0400, Andres Freund wrote: > > Attached is v2.14: > > - push pg_aios view (depends a tiny bit on the smgr/md/fd change above) > > I think I found an issue with this one - as it stands the view was viewable by > everyone. While it doesn't provide a *lot* of insight, it still seems a bit > too much for an unprivileged user to learn what part of a relation any other > user is currently reading. > > There'd be two different ways to address that: > 1) revoke view & function from public, grant to a limited role (presumably > pg_read_all_stats) > 2) copy pg_stat_activity's approach of using something like > > #define HAS_PGSTAT_PERMISSIONS(role) (has_privs_of_role(GetUserId(), ROLE_PG_READ_ALL_STATS) || has_privs_of_role(GetUserId(),role)) > > on a per-IO basis. No strong opinion. I'm not really worried about any of this information leaking. Nothing in pg_aios comes close to the sensitivity of pg_stat_activity.query. pg_stat_activity is mighty cautious, hiding even stuff like wait_event_type that I wouldn't worry about. Hence, another valid choice is (3) change nothing. Meanwhile, I see substantially less need to monitor your own IOs than to monitor your own pg_stat_activity rows, and even your own IOs potentially reveal things happening in other sessions, e.g. evicting buffers that others read and you never read. So restrictions wouldn't be too painful, and (1) arguably helps privacy more than (2). I'd likely go with (1) today.
Hi, On 2025-03-29 14:29:29 -0700, Noah Misch wrote: > Flushing half-baked review comments before going offline for a few hours: > > On Wed, Mar 26, 2025 at 09:07:40PM -0400, Andres Freund wrote: > > Attached v2.13, with the following changes: > > > 5) The WARNING in the callback is now a LOG, as it will be sent to the > > client as a WARNING explicitly when the IO's results are processed > > > > I actually chose LOG_SERVER_ONLY - that seemed slightly better than just > > LOG? But not at all sure. > > LOG_SERVER_ONLY and its synonym COMMERR look to be used for: > > - ProcessLogMemoryContextInterrupt() > - messages before successful authentication > - protocol sync loss, where we'd fail to send a client message > - client already gone > > The choice between LOG and LOG_SERVER_ONLY doesn't matter much for $SUBJECT. > If a client has decided to set client_min_messages that high, the client might > be interested in the fact that it got side-tracked completing someone else's > IO. On the other hand, almost none of those sidetrack events will produce > messages. The main argument I'd envision for LOG_SERVER_ONLY is that we > consider the message content sensitive, but I don't see the message content as > materially sensitive. I don't think it's sensitive - it just seems a bit sillier to send the same thing to the client twice than to the server log. I'm happy to change it to LOG if you prefer. Your points below mean some comments need to be updated in smgr/md.c anyway. > > - Previously the buffer completion callback checked zero_damaged_pages - but > > that's not right, the GUC hopefully is only set on a per-session basis > > Good catch. I've now audited the complete_shared callbacks for other variable > references and actions not acceptable there. I found nothing beyond what you > found by v2.14. I didn't find anything else either. > > Subject: [PATCH v2.14 02/29] aio: Implement support for reads in smgr/md/fd > > > + /* > > + * Immediately log a message about the IO error, but only to the > > + * server log. The reason to do so immediately is that the originator > > + * might not process the query result immediately (because it is busy > > + * doing another part of query processing) or at all (e.g. if it was > > + * cancelled or errored out due to another IO also failing). The > > + * issuer of the IO will emit an ERROR when processing the IO's > > s/issuer/definer/ please, to avoid proliferating synonyms. Likewise two other > places in the patches. Hm. Will do. Doesn't bother me personally, but happy to change it. > > +/* > > + * smgrstartreadv() -- asynchronous version of smgrreadv() > > + * > > + * This starts an asynchronous readv IO using the IO handle `ioh`. Other than > > + * `ioh` all parameters are the same as smgrreadv(). > > I would add a comment starting with: > > Compared to smgrreadv(), more responsibilities fall on layers above smgr. > Higher layers handle partial reads. smgr will ereport(LOG_SERVER_ONLY) some > problems, but higher layers are responsible for pgaio_result_report() to > mirror that news to the user and (for ERROR) abort the (sub)transaction. Hm - if we document that in all the smgrstart* we'd end up with something like that in a lot of places - but OTOH, this is the first one so far... > I say "comment starting with", because I think there's a remaining decision > about who owns the zeroing currently tied to smgrreadv(). An audit of > mdreadv() vs. AIO counterparts found this part of mdreadv(): > > if (nbytes == 0) > { > /* > * We are at or past EOF, or we read a partial block at EOF. > * Normally this is an error; upper levels should never try to > * read a nonexistent block. However, if zero_damaged_pages > * is ON or we are InRecovery, we should instead return zeroes > * without complaining. This allows, for example, the case of > * trying to update a block that was later truncated away. > */ > if (zero_damaged_pages || InRecovery) > { > > I didn't write a test to prove its absence, but I'm not finding such code in > the AIO path. Yes, there is no such codepath A while ago I had started a thread about whether the above codepath is necessary, as the whole idea of putting a buffer into shared buffers that doesn't exist on-disk is *extremely* ill conceived, it puts a buffer into shared buffer that somehow wasn't readable on disk, *without* creating it on disk. The problem is that an mdnblocks() wouldn't know about that only-in-memory part of the relation and thus most parts of PG won't consider that buffer to exist - it'd be just skipped in sequential scans etc, but then it'd trigger errors when extending the relation ("unexpected data beyond EOF"), etc. I had planned to put in an error into mdreadv() at the time, but somehow lost track of that - I kind of mentally put this issue into the "done" category :( Based on my research, the InRecovery path is not reachable (most recovery buffer reads go through XLogReadBufferExtended() which extends at that layer files, and the exceptions like VM/FSM have explicit code to extend the relation, c.f. vm_readbuf()). It actually looks to me like it *never* was reachable, the XLogReadBufferExtended() predecessors, back to the initial addition of WAL to PG, had that such an extension path, as did vm/fsm. The zero_damaged_pages path hasn't reliably worked for a long time afaict, because _mdfd_getseg() doesn't know about it (note we're not passing EXTENSION_CREATE). So unless the buffer is just after the physical end of the last segment, you'll just get an error at that point. To my knowledge we haven't heard related complaints. It makes some sense to have zero_damaged_pages for actually existing pages reached from sequential / tid / COPY on the table level - after all that's the only way you might get data out during data recovery. But those would never reach this logic, as such scans rely on mdnblocks(). For index -> heap fetches the option seems mainly dangerous, because that'll just create random buffers in shared buffers that, as explained above, won't then be reached by other scans. And index scans during data recovery are not a good idea in the first place, all that one should do in that situation is to dump out the data. At the very least we need to add a comment about this though. If we want to implement it, it'd be easy enough, but I think that logic is so insane that I think we shouldn't do it unless there is some *VERY* clear evidence that we need it. > I wondered if we could just Assert(!InRecovery), but adding that to > md_readv_complete() failed 001_stream_rep.pl with this stack: I'd expect that to fail in a lot of paths: XLogReadBufferExtended() -> ReadBufferWithoutRelcache() -> ReadBuffer_common() -> StartReadBuffer() > > Subject: [PATCH v2.14 04/29] aio: Add pg_aios view > > > + /* > > + * There is no lock that could prevent the state of the IO to advance > > + * concurrently - and we don't want to introduce one, as that would > > + * introduce atomics into a very common path. Instead we > > + * > > + * 1) Determine the state + generation of the IO. > > + * > > + * 2) Copy the IO to local memory. > > + * > > + * 3) Check if state or generation of the IO changed. If the state > > + * changed, retry, if the generation changed don't display the IO. > > + */ > > + > > + /* 1) from above */ > > + start_generation = live_ioh->generation; > > + pg_read_barrier(); > > I think "retry:" needs to be here, above start_state assignment. Otherwise, > the "live_ioh->state != start_state" test will keep seeing a state mismatch. Damn, you're right. > > Subject: [PATCH v2.14 05/29] localbuf: Track pincount in BufferDesc as well > > Subject: [PATCH v2.14 07/29] aio: Add WARNING result status > > Subject: [PATCH v2.14 08/29] pgstat: Allow checksum errors to be reported in > > critical sections > > Subject: [PATCH v2.14 09/29] Add errhint_internal() > > Ready for commit Cool > > Subject: [PATCH v2.14 10/29] bufmgr: Implement AIO read support > > > Buffer reads executed this infrastructure will report invalid page / checksum > > errors / warnings differently than before: > > s/this/through this/ Fixed. > > + *zeroed_or_error_count = rem_error & ((1 << 7) - 1); > > + rem_error >>= 7; > > These raw "7" are good places to use your new #define values. Likewise in > buffer_readv_encode_error(). Which define value are you thinking of here? I don't think any of the ones I added apply? But I think you're right it'd be good to have some define for it, at least locally. > > + * that was errored or zerored or, if no errors/zeroes, the first ignored > > s/zerored/zeroed/ > > > + * enough. If there is an error, the error is the integeresting offset, > > typo "integeresting" :(. Fixed. > > +/* > > + * We need a backend-local completion callback for shared buffers, to be able > > + * to report checksum errors correctly. Unfortunately that can only safely > > + * happen if the reporting backend has previously called > > Missing end of sentence. It's now: /* * We need a backend-local completion callback for shared buffers, to be able * to report checksum errors correctly. Unfortunately that can only safely * happen if the reporting backend has previously called * pgstat_prepare_report_checksum_failure(), which we can only guarantee in * the backend that started the IO. Hence this callback. */ > > @@ -144,8 +144,8 @@ PageIsVerified(PageData *page, BlockNumber blkno, int flags, bool *checksum_fail > > */ > > There's an outdated comment ending here: > > /* > * Throw a WARNING if the checksum fails, but only after we've checked for > * the all-zeroes case. > */ Updated to: /* * Throw a WARNING/LOG, as instructed by PIV_LOG_*, if the checksum fails, * but only after we've checked for the all-zeroes case. */ I found one more, the newly added comment about checksum_failure_p was still talking about ignore_checksum_failure, but it should now be IGNORE_CHECKSUM_FAILURE. > > Subject: [PATCH v2.14 11/29] Let caller of PageIsVerified() control > > ignore_checksum_failure > > Subject: [PATCH v2.14 12/29] bufmgr: Use AIO in StartReadBuffers() > > Subject: [PATCH v2.14 13/29] aio: Add README.md explaining higher level design > > Subject: [PATCH v2.14 14/29] aio: Basic read_stream adjustments for real AIO > > Subject: [PATCH v2.14 15/29] read_stream: Introduce and use optional batchmode > > support > > Subject: [PATCH v2.14 16/29] docs: Reframe track_io_timing related docs as > > wait time > > Subject: [PATCH v2.14 17/29] Enable IO concurrency on all systems > > Ready for commit Cool. > > Subject: [PATCH v2.14 18/29] aio: Add test_aio module > > I didn't yet re-review the v2.13 or 2.14 changes to this one. That's still in > my queue. That's good - I think some of the tests need to expand a bit more. Since that's at the end of the dependency chain... > One thing I noticed anyway: > > +# Tests using injection points. Mostly to exercise had IO errors that are > > s/had/hard/ Fixed. > On Sat, Mar 29, 2025 at 02:25:15PM -0400, Andres Freund wrote: > > On 2025-03-29 10:48:10 -0400, Andres Freund wrote: > > > Attached is v2.14: > > > > - push pg_aios view (depends a tiny bit on the smgr/md/fd change above) > > > > I think I found an issue with this one - as it stands the view was viewable by > > everyone. While it doesn't provide a *lot* of insight, it still seems a bit > > too much for an unprivileged user to learn what part of a relation any other > > user is currently reading. > > > > There'd be two different ways to address that: > > 1) revoke view & function from public, grant to a limited role (presumably > > pg_read_all_stats) > > 2) copy pg_stat_activity's approach of using something like > > > > #define HAS_PGSTAT_PERMISSIONS(role) (has_privs_of_role(GetUserId(), ROLE_PG_READ_ALL_STATS) || has_privs_of_role(GetUserId(),role)) > > > > on a per-IO basis. > > No strong opinion. Same. > I'm not really worried about any of this information leaking. Nothing in > pg_aios comes close to the sensitivity of pg_stat_activity.query. > pg_stat_activity is mighty cautious, hiding even stuff like wait_event_type > that I wouldn't worry about. Hence, another valid choice is (3) change > nothing. I'd also be on board with that. > Meanwhile, I see substantially less need to monitor your own IOs than to > monitor your own pg_stat_activity rows, and even your own IOs potentially > reveal things happening in other sessions, e.g. evicting buffers that others > read and you never read. So restrictions wouldn't be too painful, and (1) > arguably helps privacy more than (2). > > I'd likely go with (1) today. Sounds good to me. It also has the advantage of being much easier to test than 2). Greetings, Andres Freund
On Sat, Mar 29, 2025 at 08:39:54PM -0400, Andres Freund wrote: > On 2025-03-29 14:29:29 -0700, Noah Misch wrote: > > On Wed, Mar 26, 2025 at 09:07:40PM -0400, Andres Freund wrote: > > The choice between LOG and LOG_SERVER_ONLY doesn't matter much for $SUBJECT. > > If a client has decided to set client_min_messages that high, the client might > > be interested in the fact that it got side-tracked completing someone else's > > IO. On the other hand, almost none of those sidetrack events will produce > > messages. The main argument I'd envision for LOG_SERVER_ONLY is that we > > consider the message content sensitive, but I don't see the message content as > > materially sensitive. > > I don't think it's sensitive - it just seems a bit sillier to send the same > thing to the client twice than to the server log. Ah, that adds weight to the benefit of LOG_SERVER_ONLY. > I'm happy to change it to > LOG if you prefer. Your points below mean some comments need to be updated in > smgr/md.c anyway. Nah. > > > +/* > > > + * smgrstartreadv() -- asynchronous version of smgrreadv() > > > + * > > > + * This starts an asynchronous readv IO using the IO handle `ioh`. Other than > > > + * `ioh` all parameters are the same as smgrreadv(). > > > > I would add a comment starting with: > > > > Compared to smgrreadv(), more responsibilities fall on layers above smgr. > > Higher layers handle partial reads. smgr will ereport(LOG_SERVER_ONLY) some > > problems, but higher layers are responsible for pgaio_result_report() to > > mirror that news to the user and (for ERROR) abort the (sub)transaction. > > Hm - if we document that in all the smgrstart* we'd end up with something like > that in a lot of places - but OTOH, this is the first one so far... Alternatively, to avoid duplication: See $PLACE for the tasks that the caller's layer owns, in contrast to smgr owning them for smgrreadv(). > > I say "comment starting with", because I think there's a remaining decision > > about who owns the zeroing currently tied to smgrreadv(). An audit of > > mdreadv() vs. AIO counterparts found this part of mdreadv(): > > > > if (nbytes == 0) > > { > > /* > > * We are at or past EOF, or we read a partial block at EOF. > > * Normally this is an error; upper levels should never try to > > * read a nonexistent block. However, if zero_damaged_pages > > * is ON or we are InRecovery, we should instead return zeroes > > * without complaining. This allows, for example, the case of > > * trying to update a block that was later truncated away. > > */ > > if (zero_damaged_pages || InRecovery) > > { > > > > I didn't write a test to prove its absence, but I'm not finding such code in > > the AIO path. > > Yes, there is no such codepath > > A while ago I had started a thread about whether the above codepath is > necessary postgr.es/m/3qxxsnciyffyf3wyguiz4besdp5t5uxvv3utg75cbcszojlz7p@uibfzmnukkbd which I had forgotten completely. I've redone your audit, and I agree the InRecovery case is dead code. check-world InRecovery reaches mdstartreadv() and mdreadv() only via XLogReadBufferExtended(), vm_readbuf(), and fsm_readbuf(). The zero_damaged_pages case entails more of a judgment call about whether or not its rule breakage eclipses its utility. Fortunately, a disappointed zero_damaged_pages user could work around that code's absence by stopping the server and using "dd" to extend the segment with zeros. > I had planned to put in an error into mdreadv() at the time, but somehow lost > track of that - I kind of mentally put this issue into the "done" category :( > At the very least we need to add a comment about this though. I'm fine with either of: 1. Replace that mdreadv() code with an error. 2. Update comment in mdreadv() that we're phasing out this code due the InRecovery case's dead code status and the zero_damaged_pages problems; AIO intentionally doesn't replicate it. Maybe add Assert(false). I'd do (2) for v18, then do (1) first thing in v19 development. > > > + *zeroed_or_error_count = rem_error & ((1 << 7) - 1); > > > + rem_error >>= 7; > > > > These raw "7" are good places to use your new #define values. Likewise in > > buffer_readv_encode_error(). > > Which define value are you thinking of here? I don't think any of the ones I > added apply? But I think you're right it'd be good to have some define for > it, at least locally. It was just my imagination. Withdrawn. > It's now: > > /* > * We need a backend-local completion callback for shared buffers, to be able > * to report checksum errors correctly. Unfortunately that can only safely > * happen if the reporting backend has previously called > * pgstat_prepare_report_checksum_failure(), which we can only guarantee in > * the backend that started the IO. Hence this callback. > */ Sounds good. > Updated to: > /* > * Throw a WARNING/LOG, as instructed by PIV_LOG_*, if the checksum fails, > * but only after we've checked for the all-zeroes case. > */ > > I found one more, the newly added comment about checksum_failure_p was still > talking about ignore_checksum_failure, but it should now be IGNORE_CHECKSUM_FAILURE. That works. > > > Subject: [PATCH v2.14 18/29] aio: Add test_aio module > > > > I didn't yet re-review the v2.13 or 2.14 changes to this one. That's still in > > my queue. > > That's good - I think some of the tests need to expand a bit more. Since > that's at the end of the dependency chain... Okay, I'll delay on re-reviewing that one. When it's a good time, please put the CF entry back in Needs Review. The patches before it are all ready for commit after the above points of this mail.
On Tue, Mar 25, 2025 at 11:58 AM Andres Freund <andres@anarazel.de> wrote: > > > Another thought on complete_shared running on other backends: I wonder if we > > should push an ErrorContextCallback that adds "CONTEXT: completing I/O of > > other process" or similar, so people wonder less about how "SELECT FROM a" led > > to a log message about IO on table "b". > > I've been wondering about that as well, and yes, we probably should. > > I'd add the pid of the backend that started the IO to the message - although > need to check whether we're trying to keep PIDs of other processes from > unprivileged users. > > I think we probably should add a similar, but not equivalent, context in io > workers. Maybe "I/O worker executing I/O on behalf of process %d". I think this has not yet been done. Attached patch is an attempt to add error context for IO completions by another backend when using io_uring and IO processing in general by an IO worker. It seems to work -- that is, running the test_aio tests, you can see the context in the logs. I'm not certain that I did this in the way you both were envisioning, though. - Melanie
Attachment
Hi, On 2025-03-27 10:52:10 +1300, Thomas Munro wrote: > On Thu, Mar 27, 2025 at 10:41 AM Andres Freund <andres@anarazel.de> wrote: > > > > Subject: [PATCH v2.12 13/28] Enable IO concurrency on all systems > > > > > > Consider also updating this comment to stop focusing on prefetch; I think > > > changing that aligns with the patch's other changes: > > > > > > /* > > > * How many buffers PrefetchBuffer callers should try to stay ahead of their > > > * ReadBuffer calls by. Zero means "never prefetch". This value is only used > > > * for buffers not belonging to tablespaces that have their > > > * effective_io_concurrency parameter set. > > > */ > > > int effective_io_concurrency = DEFAULT_EFFECTIVE_IO_CONCURRENCY; > > > > Good point. Although I suspect it might be worth adjusting this, and also the > > config.sgml bit about effective_io_concurrency separately. That seems like it > > might take an iteration or two. > > +1 for rewriting that separately from this work on the code (I can > have a crack at that if you want). You taking a crack at that would be appreciated! > For the comment, my suggestion would be something like: > > "Default limit on the level of concurrency that each I/O stream > (currently, ReadStream but in future other kinds of streams) can use. > Zero means that I/O is always performed synchronously, ie not > concurrently with query execution. This value can be overridden at the > tablespace level with the parameter of the same name. Note that > streams performing I/O not classified as single-session work respect > maintenance_io_concurrency instead." Generally sounds good. I do wonder if the last sentence could be made a bit simpler, it took me a few seconds to parse "not classified as single-session". Maybe "classified as performing work for multiple sessions respect maintenance_io_concurrency instead."? Greetings, Andres Freund
Hi, On 2025-03-29 14:29:29 -0700, Noah Misch wrote: > > Subject: [PATCH v2.14 11/29] Let caller of PageIsVerified() control > > ignore_checksum_failure > > Subject: [PATCH v2.14 12/29] bufmgr: Use AIO in StartReadBuffers() > > Subject: [PATCH v2.14 14/29] aio: Basic read_stream adjustments for real AIO > > Subject: [PATCH v2.14 15/29] read_stream: Introduce and use optional batchmode > > support > > Subject: [PATCH v2.14 16/29] docs: Reframe track_io_timing related docs as > > wait time > > Subject: [PATCH v2.14 17/29] Enable IO concurrency on all systems > > Ready for commit I've pushed most of these after some very light further editing. Yay. Thanks a lot for all the reviews! So far the buildfarm hasn't been complaining, but it's early days. I didn't yet push > > Subject: [PATCH v2.14 13/29] aio: Add README.md explaining higher level design because I want to integrate some language that could be referenced by smgrstartreadv() (and more in the future), as we have been talking about. Tomorrow I'll work on sending out a new version with the remaining patches. I plan for that version to have: - pg_aios view with the security checks (already done, trivial) - a commit with updated language for smgrstartreadv(), probably referencing aio's README - a change to mdreadv() around zero_damaged_pages, as Noah and I have been discussing - updated tests, with the FIXMEs etc addressed - a reviewed version of the errcontext callback patch that Melanie sent earlier today Todo beyond that: - The comment and docs updates we've been discussing in https://postgr.es/m/5fc6r4smanncsmqw7ib6s3uj6eoiqoioxbd5ibmhtqimcggtoe%40fyrok2gozsoq - I think a good long search through the docs is in order, there probably are other things that should be updated, beyond concrete references to effective_io_concurrency etc. - Whether we should do something, and if so what, about BAS_BULKREAD for 18. Thomas may have some thoughts / code. - Whether there's anything committable around Jelte's RLIMIT_NOFILE changes. Greetings, Andres Freund
Hi, On 2025-03-30 19:46:57 -0400, Andres Freund wrote: > I didn't yet push > > > > Subject: [PATCH v2.14 13/29] aio: Add README.md explaining higher level design > > because I want to integrate some language that could be referenced by > smgrstartreadv() (and more in the future), as we have been talking about. I tried a bunch of variations and none of them seemed great. So I ended up with a lightly polished version of your suggested comment above smgrstartreadv(). We can later see about generalizing it. > Tomorrow I'll work on sending out a new version with the remaining patches. I > plan for that version to have: Got a bit distracted with $work stuff today, but here we go. The updated version has all of that: > - pg_aios view with the security checks (already done, trivial) > > - a commit with updated language for smgrstartreadv(), probably referencing > aio's README > > - a change to mdreadv() around zero_damaged_pages, as Noah and I have been > discussing > > - updated tests, with the FIXMEs etc addressed > > - a reviewed version of the errcontext callback patch that Melanie sent > earlier today Although I didn't actually find anything in that, other than one unnecessary change. Greetings, Andres Freund
Attachment
- v2.15-0001-docs-Add-acronym-and-glossary-entries-for-I-O-.patch
- v2.15-0002-aio-Add-pg_aios-view.patch
- v2.15-0003-aio-Add-README.md-explaining-higher-level-desi.patch
- v2.15-0004-aio-Add-test_aio-module.patch
- v2.15-0005-md-Add-comment-assert-to-buffer-zeroing-path-i.patch
- v2.15-0006-aio-comment-polishing.patch
- v2.15-0007-aio-Add-errcontext-for-processing-I-Os-for-ano.patch
- v2.15-0008-aio-Experimental-heuristics-to-increase-batchi.patch
- v2.15-0009-aio-Implement-smgr-md-fd-write-support.patch
- v2.15-0010-aio-Add-bounce-buffers.patch
- v2.15-0011-bufmgr-Implement-AIO-write-support.patch
- v2.15-0012-aio-Add-IO-queue-helper.patch
- v2.15-0013-bufmgr-use-AIO-in-checkpointer-bgwriter.patch
- v2.15-0014-Ensure-a-resowner-exists-for-all-paths-that-ma.patch
- v2.15-0015-Temporary-Increase-BAS_BULKREAD-size.patch
- v2.15-0016-WIP-Use-MAP_POPULATE.patch
Hi Andres, > > I didn't yet push > > > > > > Subject: [PATCH v2.14 13/29] aio: Add README.md explaining higher level design I have several notes about 0003 / README.md: 1. I noticed that the use of "Postgres" and "postgres" is inconsistent. 2. ``` +pgaio_io_register_callbacks(ioh, PGAIO_HCB_SHARED_BUFFER_READV, 0); ``` Perhaps I'm a bit late here, but the name of the function is weird. It registers a single callback, but the name is "_callbacks". 3. The use of "AIO Handle" and "AioHandle" is inconsistent. 4. - pgaio_io_register_callbacks - pgaio_io_set_handle_data_32 If I understand correctly one can register multiple callbacks per one AIO Handle (right? ...). However I don't see an obvious way to match handle data to the given callback. If all the callbacks get the same handle data... well it's weird IMO, but we should explicitly say that. On top of that we should probably explain in which order the callbacks are going to be executed. If there are any guarantees in this respect of course. 5. pgaio_io_set_handle_data_32(ioh, (uint32 *) buffer, 1) Perhaps it's worth mentioning if `buffer` can be freed after the call i.e. if it's stored by value or by reference. It's also worth clarifying if the maximum number of buffers is limited or not. 6. It is worth clarifying if AIO allows reads and writes or only reads at the moment. Perhaps it's also worth explicitly saying that AIO is for disk IO only, not for network one. 7. It is worth clarifying how many times the callbacks are called when reading multiple buffers. Is it guaranteed that the callbacks are called ones, or if it somehow depends on the implementation, and also what happens in the case if I/O succeeds partially. 8. I believe we should tell a bit more about the context in which the callbacks are called. Particularly what happens to the memory contexts and if I can allocate/free memory, can I throw ERRORs, can I create new AIO Handles, is it expected that the callback should return quickly, are the signals masked while the callback is executed, can I use sockets, is it guaranteed that the callback is going to be called in the same process (I guess so, but the text doesn't explicitly promise that), etc. 9. ``` +Because acquisition of an IO handle +[must always succeed](#io-can-be-started-in-critical-sections) +and the number of AIO Handles +[has to be limited](#state-for-aio-needs-to-live-in-shared-memory) +AIO handles can be reused as soon as they have completed. ``` What pgaio_io_acquire() does if we are out of AIO Handles? Since it always succeeds I guess it should block the caller in this case, but IMO we should say this explicitly. 10. > > because I want to integrate some language that could be referenced by > > smgrstartreadv() (and more in the future), as we have been talking about. > > I tried a bunch of variations and none of them seemed great. So I ended up > with a lightly polished version of your suggested comment above > smgrstartreadv(). We can later see about generalizing it. IMO the problem here is that README doesn't show the code that does IO per se, and thus doesn't give the full picture of how AIO should be used. Perhaps instead of referencing smgrstartreadv() it would be better to provide a simple but complete example, one that opens a binary file and reads 512 bytes from it by the given offset for instance. -- Best regards, Aleksander Alekseev
Hi, On 2025-04-01 14:56:17 +0300, Aleksander Alekseev wrote: > Hi Andres, > > > > I didn't yet push > > > > > > > > Subject: [PATCH v2.14 13/29] aio: Add README.md explaining higher level design > > I have several notes about 0003 / README.md: > > 1. I noticed that the use of "Postgres" and "postgres" is inconsistent. :/ It probably should be consistent, but I have no idea which of the spellings we should go for. Either looks ugly in some contexts. > 2. > > ``` > +pgaio_io_register_callbacks(ioh, PGAIO_HCB_SHARED_BUFFER_READV, 0); > ``` > > Perhaps I'm a bit late here, but the name of the function is weird. It > registers a single callback, but the name is "_callbacks". It registers a *set* of callbacks (stage, complete_shared, complete_local, report_error) on the handle. > 3. The use of "AIO Handle" and "AioHandle" is inconsistent. This seems ok to me. > 4. > > - pgaio_io_register_callbacks > - pgaio_io_set_handle_data_32 > > If I understand correctly one can register multiple callbacks per one > AIO Handle (right? ...). However I don't see an obvious way to match > handle data to the given callback. If all the callbacks get the same > handle data... well it's weird IMO, but we should explicitly say that. There is: /* * Associate an array of data with the Handle. This is e.g. useful to the * transport knowledge about which buffers a multi-block IO affects to * completion callbacks. * * Right now this can be done only once for each IO, even though multiple * callbacks can be registered. There aren't any known usecases requiring more * and the required amount of shared memory does add up, so it doesn't seem * worth multiplying memory usage by PGAIO_HANDLE_MAX_CALLBACKS. */ > On top of that we should probably explain in which order the callbacks > are going to be executed. If there are any guarantees in this respect > of course. Alongside PgAioHandleCallbacks: * * The latest registered callback is called first. This allows * higher-level code to register callbacks that can rely on callbacks * registered by lower-level code to already have been executed. * > 5. pgaio_io_set_handle_data_32(ioh, (uint32 *) buffer, 1) > > Perhaps it's worth mentioning if `buffer` can be freed after the call > i.e. if it's stored by value or by reference. By value. > It's also worth clarifying if the maximum number of buffers is limited or > not. It's limited to PG_IOV_MAX, fwiw. > 6. It is worth clarifying if AIO allows reads and writes or only reads > at the moment. We have patches for writes, I just ran out of time for 18. Im not particularly excited about adding stuff that then needs to be removed in 19. > Perhaps it's also worth explicitly saying that AIO is for disk IO only, not > for network one. I'd like to integrate network IO too. I have a local prototype, fwiw. > 7. It is worth clarifying how many times the callbacks are called when > reading multiple buffers. Is it guaranteed that the callbacks are > called ones, or if it somehow depends on the implementation, and also > what happens in the case if I/O succeeds partially. The aio subsystem doesn't know anything about buffers. Callbacks are executed exactly once, with the exception of the error reporting callback, which could be called multiple times. > 8. I believe we should tell a bit more about the context in which the > callbacks are called. Particularly what happens to the memory contexts > and if I can allocate/free memory, can I throw ERRORs, can I create > new AIO Handles, is it expected that the callback should return > quickly, are the signals masked while the callback is executed, can I > use sockets, is it guaranteed that the callback is going to be called > in the same process (I guess so, but the text doesn't explicitly > promise that), etc. There is the following above pgaio_io_register_callbacks() * Note that callbacks are executed in critical sections. This is necessary * to be able to execute IO in critical sections (consider e.g. WAL * logging). To perform AIO we first need to acquire a handle, which, if there * are no free handles, requires waiting for IOs to complete and to execute * their completion callbacks. * * Callbacks may be executed in the issuing backend but also in another * backend (because that backend is waiting for the IO) or in IO workers (if * io_method=worker is used). And also a bunch of detail along struct PgAioHandleCallbacks. > 9. > > ``` > +Because acquisition of an IO handle > +[must always succeed](#io-can-be-started-in-critical-sections) > +and the number of AIO Handles > +[has to be limited](#state-for-aio-needs-to-live-in-shared-memory) > +AIO handles can be reused as soon as they have completed. > ``` > > What pgaio_io_acquire() does if we are out of AIO Handles? Since it > always succeeds I guess it should block the caller in this case, but > IMO we should say this explicitly. That's documented above pgaio_io_acquire(). > 10. > > > > because I want to integrate some language that could be referenced by > > > smgrstartreadv() (and more in the future), as we have been talking about. > > > > I tried a bunch of variations and none of them seemed great. So I ended up > > with a lightly polished version of your suggested comment above > > smgrstartreadv(). We can later see about generalizing it. > > IMO the problem here is that README doesn't show the code that does IO > per se, and thus doesn't give the full picture of how AIO should be > used. Perhaps instead of referencing smgrstartreadv() it would be > better to provide a simple but complete example, one that opens a > binary file and reads 512 bytes from it by the given offset for > instance. IMO the example is already long enough, if you want that level of detail, you can just look at smgrstartreadv() etc. The idea about explaining it at that level is that that is basically what is required to use AIO in a new place, whereas implementing AIO for a new target, or a new IO operation, requires a bit more care. Greetings, Andres Freund
On Mon, Mar 31, 2025 at 08:41:39PM -0400, Andres Freund wrote: > updated version All non-write patches (1-7) are ready for commit, though I have some cosmetic recommendations below. I've marked the commitfest entry Ready for Committer. > + # Check a page validity error in another block, to ensure we report > + # the correct block number > + $psql_a->query_safe( > + qq( > +SELECT modify_rel_block('tbl_zero', 3, corrupt_header=>true); > +)); > + psql_like( > + $io_method, > + $psql_a, > + "$persistency: test zeroing of invalid block 3", > + qq(SELECT read_rel_block_ll('tbl_zero', 3, zero_on_error=>true);), > + qr/^$/, > + qr/^psql:<stdin>:\d+: WARNING: invalid page in block 3 of relation base\/.*\/.*; zeroing out page$/ > + ); > + > + > + # Check a page validity error in another block, to ensure we report > + # the correct block number This comment is a copy of the previous test's comment. While the comment is not false, consider changing it to: # Check one read reporting multiple invalid blocks. > + $psql_a->query_safe( > + qq( > +SELECT modify_rel_block('tbl_zero', 2, corrupt_header=>true); > +SELECT modify_rel_block('tbl_zero', 3, corrupt_header=>true); > +)); > + # First test error > + psql_like( > + $io_method, > + $psql_a, > + "$persistency: test reading of invalid block 2,3 in larger read", > + qq(SELECT read_rel_block_ll('tbl_zero', 1, nblocks=>4, zero_on_error=>false)), > + qr/^$/, > + qr/^psql:<stdin>:\d+: ERROR: 2 invalid pages among blocks 1..4 of relation base\/.*\/.*\nDETAIL: Block 2held first invalid page\.\nHINT:[^\n]+$/ > + ); > + > + # Then test zeroing via ZERO_ON_ERROR flag > + psql_like( > + $io_method, > + $psql_a, > + "$persistency: test zeroing of invalid block 2,3 in larger read, ZERO_ON_ERROR", > + qq(SELECT read_rel_block_ll('tbl_zero', 1, nblocks=>4, zero_on_error=>true)), > + qr/^$/, > + qr/^psql:<stdin>:\d+: WARNING: zeroing out 2 invalid pages among blocks 1..4 of relation base\/.*\/.*\nDETAIL: Block 2 held first zeroed page\.\nHINT:[^\n]+$/ > + ); > + > + # Then test zeroing vio zero_damaged_pages s/vio/via/ > +# Verify checksum handling when creating database from an invalid database. > +# This also serves as a minimal check that cross-database IO is handled > +# reasonably. To me, "invalid database" is a term of art from the message "cannot connect to invalid database". Hence, I would change "invalid database" to "database w/ invalid block" or similar, here and below. (Alternatively, just delete "from an invalid database". It's clear from the context.) > + if (corrupt_checksum) > + { > + bool successfully_corrupted = 0; > + > + /* > + * Any single modification of the checksum could just end up being > + * valid again. To be sure > + */ Unfinished sentence. That said, I'm not following why we'd need this loop. If this test code were changing the input to the checksum, it's true that an input bit flip might reach the same pd_checksum. The test case is changing pd_checksum, not the input bits. I don't see how changing pd_checksum could leave the page still valid. There's only one valid pd_checksum value for a given input page. > + /* > + * The underlying IO actually completed OK, and thus the "invalid" > + * portion of the IOV actually contains valid data. That can hide > + * a lot of problems, e.g. if we were to wrongly mark a buffer, > + * that wasn't read according to the shortened-read, IO as valid, > + * the contents would look valid and we might miss a bug. Minimally s/read, IO/read IO,/ but I'd edit a bit further: * a lot of problems, e.g. if we were to wrongly mark-valid a * buffer that wasn't read according to the shortened-read IO, the * contents would look valid and we might miss a bug. > Subject: [PATCH v2.15 05/18] md: Add comment & assert to buffer-zeroing path > in md[start]readv() > The zero_damaged_pages path is incomplete, as as missing segments are not s/as as/as/ > For now, put an Assert(false) comments documenting this choice into mdreadv() s/comments/and comments/ > + * For PG 18, we are putting an Assert(false) in into > + * mdreadv() (triggering failures in assertion-enabled builds, s/in into/in/ > Subject: [PATCH v2.15 06/18] aio: comment polishing > + * - Partial reads need to be handle by the caller re-issuing IO for the > + * unread blocks s/handle/handled/ > Subject: [PATCH v2.15 07/18] aio: Add errcontext for processing I/Os for > another backend
Hi, On 2025-04-01 08:11:59 -0700, Noah Misch wrote: > On Mon, Mar 31, 2025 at 08:41:39PM -0400, Andres Freund wrote: > > updated version > > All non-write patches (1-7) are ready for commit, though I have some cosmetic > recommendations below. I've marked the commitfest entry Ready for Committer. Thanks! I haven't yet pushed the changes, but will work on that in the afternoon. I plan to afterwards close the CF entry and will eventually create a new one for write support, although probably only rebasing onto https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf%40gcnactj4z56m and addressing some of the locking issues. WRT the locking issues, I've been wondering whether we could make LWLockWaitForVar() work that purpose, but I doubt it's the right approach. Probably better to get rid of the LWLock*Var functions and go for the approach I had in v1, namely a version of LWLockAcquire() with a callback that gets called between LWLockQueueSelf() and PGSemaphoreLock(), which can cause the lock acquisition to abort. > This comment is a copy of the previous test's comment. While the comment is > not false, consider changing it to: > > # Check one read reporting multiple invalid blocks. > > + # Then test zeroing vio zero_damaged_pages > > s/vio/via/ > These make sense. > > +# Verify checksum handling when creating database from an invalid database. > > +# This also serves as a minimal check that cross-database IO is handled > > +# reasonably. > > To me, "invalid database" is a term of art from the message "cannot connect to > invalid database". Hence, I would change "invalid database" to "database w/ > invalid block" or similar, here and below. (Alternatively, just delete "from > an invalid database". It's clear from the context.) Yea, I agree, this is easy to misunderstand when stepping back. I went for "with an invalid block". > > + if (corrupt_checksum) > > + { > > + bool successfully_corrupted = 0; > > + > > + /* > > + * Any single modification of the checksum could just end up being > > + * valid again. To be sure > > + */ > > Unfinished sentence. Oops. See below. > That said, I'm not following why we'd need this loop. If this test code > were changing the input to the checksum, it's true that an input bit flip > might reach the same pd_checksum. The test case is changing pd_checksum, > not the input bits. We might be changing the input, due to the zero/corrupt_header options. Or we might be called on a page that is *already* corrupted. I did encounter that situation once while writing tests, where the tests only passed if I made the + 1 a + 2. Which was, uh, rather confusing and left me feel like I was cursed that day. > I don't see how changing pd_checksum could leave the > page still valid. There's only one valid pd_checksum value for a given > input page. I updated the comment to: /* * Any single modification of the checksum could just end up being * valid again, due to e.g. corrupt_header changing the data in a way * that'd result in the "corrupted" checksum, or the checksum already * being invalid. Retry in that, unlikely, case. */ > > + /* > > + * The underlying IO actually completed OK, and thus the "invalid" > > + * portion of the IOV actually contains valid data. That can hide > > + * a lot of problems, e.g. if we were to wrongly mark a buffer, > > + * that wasn't read according to the shortened-read, IO as valid, > > + * the contents would look valid and we might miss a bug. > > Minimally s/read, IO/read IO,/ but I'd edit a bit further: > > * a lot of problems, e.g. if we were to wrongly mark-valid a > * buffer that wasn't read according to the shortened-read IO, the > * contents would look valid and we might miss a bug. Adopted. > > Subject: [PATCH v2.15 05/18] md: Add comment & assert to buffer-zeroing path > > in md[start]readv() > > > The zero_damaged_pages path is incomplete, as as missing segments are not > > s/as as/as/ > > > For now, put an Assert(false) comments documenting this choice into mdreadv() > > s/comments/and comments/ > > > + * For PG 18, we are putting an Assert(false) in into > > + * mdreadv() (triggering failures in assertion-enabled builds, > > s/in into/in/ > > Subject: [PATCH v2.15 06/18] aio: comment polishing > > > + * - Partial reads need to be handle by the caller re-issuing IO for the > > + * unread blocks > > s/handle/handled/ All adopted. I'm sorry that you had to see so much of tiredness-enhanced dyslexia :(. Greetings, Andres Freund
On Tue, Apr 01, 2025 at 11:55:20AM -0400, Andres Freund wrote: > On 2025-04-01 08:11:59 -0700, Noah Misch wrote: > > On Mon, Mar 31, 2025 at 08:41:39PM -0400, Andres Freund wrote: > I haven't yet pushed the changes, but will work on that in the afternoon. > > I plan to afterwards close the CF entry and will eventually create a new one > for write support, although probably only rebasing onto > https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf%40gcnactj4z56m > and addressing some of the locking issues. Sounds good. > WRT the locking issues, I've been wondering whether we could make > LWLockWaitForVar() work that purpose, but I doubt it's the right approach. > Probably better to get rid of the LWLock*Var functions and go for the approach > I had in v1, namely a version of LWLockAcquire() with a callback that gets > called between LWLockQueueSelf() and PGSemaphoreLock(), which can cause the > lock acquisition to abort. What are the best thing(s) to read to understand the locking issues? > > > +# Verify checksum handling when creating database from an invalid database. > > > +# This also serves as a minimal check that cross-database IO is handled > > > +# reasonably. > > > > To me, "invalid database" is a term of art from the message "cannot connect to > > invalid database". Hence, I would change "invalid database" to "database w/ > > invalid block" or similar, here and below. (Alternatively, just delete "from > > an invalid database". It's clear from the context.) > > Yea, I agree, this is easy to misunderstand when stepping back. I went for "with > an invalid block". Sounds good. > > > + if (corrupt_checksum) > > > + { > > > + bool successfully_corrupted = 0; > > > + > > > + /* > > > + * Any single modification of the checksum could just end up being > > > + * valid again. To be sure > > > + */ > > > > Unfinished sentence. > > Oops. See below. > > > > That said, I'm not following why we'd need this loop. If this test code > > were changing the input to the checksum, it's true that an input bit flip > > might reach the same pd_checksum. The test case is changing pd_checksum, > > not the input bits. > > We might be changing the input, due to the zero/corrupt_header options. Or we > might be called on a page that is *already* corrupted. I did encounter that > situation once while writing tests, where the tests only passed if I made the > + 1 a + 2. Which was, uh, rather confusing and left me feel like I was cursed > that day. Got it. > > I don't see how changing pd_checksum could leave the > > page still valid. There's only one valid pd_checksum value for a given > > input page. > > I updated the comment to: > /* > * Any single modification of the checksum could just end up being > * valid again, due to e.g. corrupt_header changing the data in a way > * that'd result in the "corrupted" checksum, or the checksum already > * being invalid. Retry in that, unlikely, case. > */ Works for me.
Hi, On 2025-04-01 09:07:27 -0700, Noah Misch wrote: > On Tue, Apr 01, 2025 at 11:55:20AM -0400, Andres Freund wrote: > > WRT the locking issues, I've been wondering whether we could make > > LWLockWaitForVar() work that purpose, but I doubt it's the right approach. > > Probably better to get rid of the LWLock*Var functions and go for the approach > > I had in v1, namely a version of LWLockAcquire() with a callback that gets > > called between LWLockQueueSelf() and PGSemaphoreLock(), which can cause the > > lock acquisition to abort. > > What are the best thing(s) to read to understand the locking issues? Unfortunately I think it's our discussion from a few days/weeks ago. The problem basically is that functions like LockBuffer(EXCLUSIVE) need to be able to non-racily a) wait for in-fligth IOs b) acquire the content lock If you just do it naively like this: else if (mode == BUFFER_LOCK_EXCLUSIVE) { if (pg_atomic_read_u32(&buf->state) &_IO_IN_PROGRESS) WaitIO(buf); LWLockAcquire(content_lock, LW_EXCLUSIVE); } you obviously could have another backend start new IO between the WaitIO() and the LWLockAcquire(). If that other backend then doesn't consume the completion of that IO, the current backend could end up endlessly waiting for the IO. I don't see a way to avoid with narrow changes just to LockBuffer(). We need some infrastructure that allows to avoid that issue. One approach could be to integrate more tightly with lwlock.c. If 1) anyone starting IO were to wake up all waiters for the LWLock 2) The waiting side checked that there is no IO in progress *after* LWLockQueueSelf(), but before PGSemaphoreLock() The backend doing LockBuffer() would be guaranteed to have the chance to wait for the IO, rather than the lwlock. But there might be better approaches. I'm not really convinced that using generic lwlocks for buffer locking is the best idea. There's just too many special things about buffers. E.g. we have rather massive NUMA scalability issues due to the amount of lock traffic from buffer header and content lock atomic operations, particuly on things like the uppermost levels of a btree. I've played with ideas like super-pinning and locking btree root pages, which move all the overhead to the side that wants to exclusively lock such a page - but that doesn't really make sense for lwlocks in general. Greetings, Andres Freund
Hi, On 2025-04-01 11:55:20 -0400, Andres Freund wrote: > I haven't yet pushed the changes, but will work on that in the afternoon. There are three different types of failures in the test_aio test so far: 1) TEMP_CONFIG See https://postgr.es/m/zh5u22wbpcyfw2ddl3lsvmsxf4yvsrvgxqwwmfjddc4c2khsgp%40gfysyjsaelr5 2) Failure on at least some windows BF machines: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2025-04-01%2020%3A15%3A19 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2025-04-01%2019%3A03%3A07 Afaict the error is independent of AIO, instead just related CREATE DATABASE ... STRATEGY wal_log failing on windows. In contrast to dropdb(), which does /* * Force a checkpoint to make sure the checkpointer has received the * message sent by ForgetDatabaseSyncRequests. */ RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT); /* Close all smgr fds in all backends. */ WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SMGRRELEASE)); createdb_failure_callback() does no such thing. But it's rather likely that we, bgwriter, checkpointer (and now IO workers) have files open for the target database. Note that the test is failing even with "io_method=sync", which obviously doesn't use IO workers, so it's not related to that. It's probably not a good idea to blockingly request a checkpoint and a barrier inside a PG_TRY/PG_ENSURE_ERROR_CLEANUP() though, so this would need a bit more rearchitecting. I think I'm just going to make the test more lenient by not insisting that the error is the first thing on psql's stderr. 3) Some subtests fail if RELCACHE_FORCE_RELEASE and CATCACHE_FORCE_RELEASE are defined: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2025-04-01%2019%3A23%3A07 # +++ tap check in src/test/modules/test_aio +++ # Failed test 'worker: batch_start() leak & cleanup in implicit xact: expected stderr' # at t/001_aio.pl line 318. # 'psql:<stdin>:4: ERROR: starting batch while batch already in progress' # doesn't match '(?^:open AIO batch at end)' The problem is basically that the test intentionally forgets to exit batchmode - normally that would trigger an error at the end of the transaction, which the test verifies. However, with RELCACHE_FORCE_RELEASE and CATCACHE_FORCE_RELEASE defined, we get other code entering batchmode and erroring out because batchmode isn't allowed to be entered recursively. #0 pgaio_enter_batchmode () at ../../../../../home/andres/src/postgresql/src/backend/storage/aio/aio.c:997 #1 0x000055ec847959bf in read_stream_look_ahead (stream=0x55ecbcfda098) at ../../../../../home/andres/src/postgresql/src/backend/storage/aio/read_stream.c:438 #2 0x000055ec84796514 in read_stream_next_buffer (stream=0x55ecbcfda098, per_buffer_data=0x0) at ../../../../../home/andres/src/postgresql/src/backend/storage/aio/read_stream.c:890 #3 0x000055ec8432520b in heap_fetch_next_buffer (scan=0x55ecbcfd1c00, dir=ForwardScanDirection) at ../../../../../home/andres/src/postgresql/src/backend/access/heap/heapam.c:679 #4 0x000055ec843259a4 in heapgettup_pagemode (scan=0x55ecbcfd1c00, dir=ForwardScanDirection, nkeys=1, key=0x55ecbcfd1620) at ../../../../../home/andres/src/postgresql/src/backend/access/heap/heapam.c:1041 #5 0x000055ec843263ba in heap_getnextslot (sscan=0x55ecbcfd1c00, direction=ForwardScanDirection, slot=0x55ecbcfd0e18) at ../../../../../home/andres/src/postgresql/src/backend/access/heap/heapam.c:1420 #6 0x000055ec8434ebe5 in table_scan_getnextslot (sscan=0x55ecbcfd1c00, direction=ForwardScanDirection, slot=0x55ecbcfd0e18) at ../../../../../home/andres/src/postgresql/src/include/access/tableam.h:1041 #7 0x000055ec8434f786 in systable_getnext (sysscan=0x55ecbcfd8088) at ../../../../../home/andres/src/postgresql/src/backend/access/index/genam.c:541 #8 0x000055ec849c784a in SearchCatCacheMiss (cache=0x55ecbcf81000, nkeys=1, hashValue=3830081846, hashIndex=2, v1=403, v2=0,v3=0, v4=0) at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/catcache.c:1543 #9 0x000055ec849c76f9 in SearchCatCacheInternal (cache=0x55ecbcf81000, nkeys=1, v1=403, v2=0, v3=0, v4=0) at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/catcache.c:1464 #10 0x000055ec849c73ec in SearchCatCache1 (cache=0x55ecbcf81000, v1=403) at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/catcache.c:1332 #11 0x000055ec849e5ae3 in SearchSysCache1 (cacheId=2, key1=403) at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/syscache.c:228 #12 0x000055ec849d8c78 in RelationInitIndexAccessInfo (relation=0x7f6a85901c20) at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/relcache.c:1456 #13 0x000055ec849d8471 in RelationBuildDesc (targetRelId=2703, insertIt=true) at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/relcache.c:1201 #14 0x000055ec849d9e9c in RelationIdGetRelation (relationId=2703) at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/relcache.c:2100 #15 0x000055ec842d219f in relation_open (relationId=2703, lockmode=1) at ../../../../../home/andres/src/postgresql/src/backend/access/common/relation.c:58 #16 0x000055ec8435043c in index_open (relationId=2703, lockmode=1) at ../../../../../home/andres/src/postgresql/src/backend/access/index/indexam.c:137 #17 0x000055ec8434f2f9 in systable_beginscan (heapRelation=0x7f6a859353a8, indexId=2703, indexOK=true, snapshot=0x0, nkeys=1,key=0x7ffc11aa7c90) at ../../../../../home/andres/src/postgresql/src/backend/access/index/genam.c:400 #18 0x000055ec849c782c in SearchCatCacheMiss (cache=0x55ecbcfa0e80, nkeys=1, hashValue=2659955452, hashIndex=60, v1=2278,v2=0, v3=0, v4=0) at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/catcache.c:1533 #19 0x000055ec849c76f9 in SearchCatCacheInternal (cache=0x55ecbcfa0e80, nkeys=1, v1=2278, v2=0, v3=0, v4=0) at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/catcache.c:1464 #20 0x000055ec849c73ec in SearchCatCache1 (cache=0x55ecbcfa0e80, v1=2278) at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/catcache.c:1332 #21 0x000055ec849e5ae3 in SearchSysCache1 (cacheId=82, key1=2278) at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/syscache.c:228 #22 0x000055ec849d0375 in getTypeOutputInfo (type=2278, typOutput=0x55ecbcfd15d0, typIsVarlena=0x55ecbcfd15d8) at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/lsyscache.c:2995 #23 0x000055ec842d1a57 in printtup_prepare_info (myState=0x55ecbcfcec00, typeinfo=0x55ecbcfd0588, numAttrs=1) at ../../../../../home/andres/src/postgresql/src/backend/access/common/printtup.c:277 #24 0x000055ec842d1ba6 in printtup (slot=0x55ecbcfd0b28, self=0x55ecbcfcec00) at ../../../../../home/andres/src/postgresql/src/backend/access/common/printtup.c:315 #25 0x000055ec84541f54 in ExecutePlan (queryDesc=0x55ecbced4290, operation=CMD_SELECT, sendTuples=true, numberTuples=0, direction=ForwardScanDirection, dest=0x55ecbcfcec00) at ../../../../../home/andres/src/postgresql/src/backend/executor/execMain.c:1814 I don't really have a good idea how to deal with that yet. Greetings, Andres
Hi, On 2025-04-01 17:47:51 -0400, Andres Freund wrote: > 3) Some subtests fail if RELCACHE_FORCE_RELEASE and CATCACHE_FORCE_RELEASE are defined: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2025-04-01%2019%3A23%3A07 > > # +++ tap check in src/test/modules/test_aio +++ > > # Failed test 'worker: batch_start() leak & cleanup in implicit xact: expected stderr' > # at t/001_aio.pl line 318. > # 'psql:<stdin>:4: ERROR: starting batch while batch already in progress' > # doesn't match '(?^:open AIO batch at end)' > > > The problem is basically that the test intentionally forgets to exit batchmode > - normally that would trigger an error at the end of the transaction, which > the test verifies. However, with RELCACHE_FORCE_RELEASE and > CATCACHE_FORCE_RELEASE defined, we get other code entering batchmode and > erroring out because batchmode isn't allowed to be entered recursively. > > > #0 pgaio_enter_batchmode () at ../../../../../home/andres/src/postgresql/src/backend/storage/aio/aio.c:997 > #1 0x000055ec847959bf in read_stream_look_ahead (stream=0x55ecbcfda098) > at ../../../../../home/andres/src/postgresql/src/backend/storage/aio/read_stream.c:438 > #2 0x000055ec84796514 in read_stream_next_buffer (stream=0x55ecbcfda098, per_buffer_data=0x0) > at ../../../../../home/andres/src/postgresql/src/backend/storage/aio/read_stream.c:890 > #3 0x000055ec8432520b in heap_fetch_next_buffer (scan=0x55ecbcfd1c00, dir=ForwardScanDirection) > at ../../../../../home/andres/src/postgresql/src/backend/access/heap/heapam.c:679 > #4 0x000055ec843259a4 in heapgettup_pagemode (scan=0x55ecbcfd1c00, dir=ForwardScanDirection, nkeys=1, key=0x55ecbcfd1620) > at ../../../../../home/andres/src/postgresql/src/backend/access/heap/heapam.c:1041 > #5 0x000055ec843263ba in heap_getnextslot (sscan=0x55ecbcfd1c00, direction=ForwardScanDirection, slot=0x55ecbcfd0e18) > at ../../../../../home/andres/src/postgresql/src/backend/access/heap/heapam.c:1420 > #6 0x000055ec8434ebe5 in table_scan_getnextslot (sscan=0x55ecbcfd1c00, direction=ForwardScanDirection, slot=0x55ecbcfd0e18) > at ../../../../../home/andres/src/postgresql/src/include/access/tableam.h:1041 > #7 0x000055ec8434f786 in systable_getnext (sysscan=0x55ecbcfd8088) at ../../../../../home/andres/src/postgresql/src/backend/access/index/genam.c:541 > #8 0x000055ec849c784a in SearchCatCacheMiss (cache=0x55ecbcf81000, nkeys=1, hashValue=3830081846, hashIndex=2, v1=403,v2=0, v3=0, v4=0) > at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/catcache.c:1543 > #9 0x000055ec849c76f9 in SearchCatCacheInternal (cache=0x55ecbcf81000, nkeys=1, v1=403, v2=0, v3=0, v4=0) > at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/catcache.c:1464 > #10 0x000055ec849c73ec in SearchCatCache1 (cache=0x55ecbcf81000, v1=403) at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/catcache.c:1332 > #11 0x000055ec849e5ae3 in SearchSysCache1 (cacheId=2, key1=403) at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/syscache.c:228 > #12 0x000055ec849d8c78 in RelationInitIndexAccessInfo (relation=0x7f6a85901c20) > at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/relcache.c:1456 > #13 0x000055ec849d8471 in RelationBuildDesc (targetRelId=2703, insertIt=true) > at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/relcache.c:1201 > #14 0x000055ec849d9e9c in RelationIdGetRelation (relationId=2703) at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/relcache.c:2100 > #15 0x000055ec842d219f in relation_open (relationId=2703, lockmode=1) at ../../../../../home/andres/src/postgresql/src/backend/access/common/relation.c:58 > #16 0x000055ec8435043c in index_open (relationId=2703, lockmode=1) at ../../../../../home/andres/src/postgresql/src/backend/access/index/indexam.c:137 > #17 0x000055ec8434f2f9 in systable_beginscan (heapRelation=0x7f6a859353a8, indexId=2703, indexOK=true, snapshot=0x0, nkeys=1,key=0x7ffc11aa7c90) > at ../../../../../home/andres/src/postgresql/src/backend/access/index/genam.c:400 > #18 0x000055ec849c782c in SearchCatCacheMiss (cache=0x55ecbcfa0e80, nkeys=1, hashValue=2659955452, hashIndex=60, v1=2278,v2=0, v3=0, v4=0) > at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/catcache.c:1533 > #19 0x000055ec849c76f9 in SearchCatCacheInternal (cache=0x55ecbcfa0e80, nkeys=1, v1=2278, v2=0, v3=0, v4=0) > at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/catcache.c:1464 > #20 0x000055ec849c73ec in SearchCatCache1 (cache=0x55ecbcfa0e80, v1=2278) at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/catcache.c:1332 > #21 0x000055ec849e5ae3 in SearchSysCache1 (cacheId=82, key1=2278) at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/syscache.c:228 > #22 0x000055ec849d0375 in getTypeOutputInfo (type=2278, typOutput=0x55ecbcfd15d0, typIsVarlena=0x55ecbcfd15d8) > at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/lsyscache.c:2995 > #23 0x000055ec842d1a57 in printtup_prepare_info (myState=0x55ecbcfcec00, typeinfo=0x55ecbcfd0588, numAttrs=1) > at ../../../../../home/andres/src/postgresql/src/backend/access/common/printtup.c:277 > #24 0x000055ec842d1ba6 in printtup (slot=0x55ecbcfd0b28, self=0x55ecbcfcec00) > at ../../../../../home/andres/src/postgresql/src/backend/access/common/printtup.c:315 > #25 0x000055ec84541f54 in ExecutePlan (queryDesc=0x55ecbced4290, operation=CMD_SELECT, sendTuples=true, numberTuples=0,direction=ForwardScanDirection, > dest=0x55ecbcfcec00) at ../../../../../home/andres/src/postgresql/src/backend/executor/execMain.c:1814 > > > I don't really have a good idea how to deal with that yet. Hm. Making the query something like SELECT * FROM (VALUES (NULL), (batch_start())); avoids the wrong output, because the type lookup happens for the first row already. But that's pretty magical and probably fragile. Greetings, Andres Freund
On Tue, Apr 01, 2025 at 06:25:28PM -0400, Andres Freund wrote: > On 2025-04-01 17:47:51 -0400, Andres Freund wrote: > > 3) Some subtests fail if RELCACHE_FORCE_RELEASE and CATCACHE_FORCE_RELEASE are defined: > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2025-04-01%2019%3A23%3A07 > > > > # +++ tap check in src/test/modules/test_aio +++ > > > > # Failed test 'worker: batch_start() leak & cleanup in implicit xact: expected stderr' > > # at t/001_aio.pl line 318. > > # 'psql:<stdin>:4: ERROR: starting batch while batch already in progress' > > # doesn't match '(?^:open AIO batch at end)' > > > > > > The problem is basically that the test intentionally forgets to exit batchmode > > - normally that would trigger an error at the end of the transaction, which > > the test verifies. However, with RELCACHE_FORCE_RELEASE and > > CATCACHE_FORCE_RELEASE defined, we get other code entering batchmode and > > erroring out because batchmode isn't allowed to be entered recursively. > > I don't really have a good idea how to deal with that yet. > > Hm. Making the query something like > > SELECT * FROM (VALUES (NULL), (batch_start())); > > avoids the wrong output, because the type lookup happens for the first row > already. But that's pretty magical and probably fragile. Hmm. Some options: a. VALUES() trick above. For test code, it's hard to argue with something that seems to solve it in practice. b. Encapsulate the test in a PROCEDURE, so perhaps less happens between the batch_start() and the procedure-managed COMMIT. Maybe less fragile than (a), maybe more fragile. c. Move RELCACHE_FORCE_RELEASE and CATCACHE_FORCE_RELEASE to be GUC-controlled, like how CLOBBER_CACHE_ALWAYS changed into the debug_discard_caches GUC. Then disable them for relevant parts of test_aio. This feels best long-term, but it's bigger. I also wanted this in syscache-update-pruned.spec[1]. d. Have test_aio deduce whether these are set, probably by observing memory contexts or DEBUG messages. Maybe have every postmaster startup print a DEBUG message about these settings being enabled. Skip relevant parts of test_aio. This sounds messy. Each of those feels defensible to me. I'd probably do (a) or (b) to start. [1] For that spec, an alternative expected output sufficed. Incidentally, I'll soon fix that spec flaking on valgrind/skink.
Hi, I've pushed fixes for 1) and 2) and am working on 3). On 2025-04-01 17:13:24 -0700, Noah Misch wrote: > On Tue, Apr 01, 2025 at 06:25:28PM -0400, Andres Freund wrote: > > On 2025-04-01 17:47:51 -0400, Andres Freund wrote: > > > 3) Some subtests fail if RELCACHE_FORCE_RELEASE and CATCACHE_FORCE_RELEASE are defined: > > > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2025-04-01%2019%3A23%3A07 > > > > > > # +++ tap check in src/test/modules/test_aio +++ > > > > > > # Failed test 'worker: batch_start() leak & cleanup in implicit xact: expected stderr' > > > # at t/001_aio.pl line 318. > > > # 'psql:<stdin>:4: ERROR: starting batch while batch already in progress' > > > # doesn't match '(?^:open AIO batch at end)' > > > > > > > > > The problem is basically that the test intentionally forgets to exit batchmode > > > - normally that would trigger an error at the end of the transaction, which > > > the test verifies. However, with RELCACHE_FORCE_RELEASE and > > > CATCACHE_FORCE_RELEASE defined, we get other code entering batchmode and > > > erroring out because batchmode isn't allowed to be entered recursively. > > > > I don't really have a good idea how to deal with that yet. > > > > Hm. Making the query something like > > > > SELECT * FROM (VALUES (NULL), (batch_start())); > > > > avoids the wrong output, because the type lookup happens for the first row > > already. But that's pretty magical and probably fragile. > > Hmm. Some options: > > a. VALUES() trick above. For test code, it's hard to argue with something > that seems to solve it in practice. I think I'll go for a slightly nicer version of that, namely SELECT WHERE batch_start() IS NULL I think that ends up the least verbose of the ideas we've been discussing. > c. Move RELCACHE_FORCE_RELEASE and CATCACHE_FORCE_RELEASE to be > GUC-controlled, like how CLOBBER_CACHE_ALWAYS changed into the > debug_discard_caches GUC. Then disable them for relevant parts of > test_aio. This feels best long-term, but it's bigger. I also wanted this > in syscache-update-pruned.spec[1]. Yea, that'd probably be a good thing medium-term. Greetings, Andres Freund
Hi.
Em qua., 2 de abr. de 2025 às 08:58, Andres Freund <andres@anarazel.de> escreveu:
Hi,
I've pushed fixes for 1) and 2) and am working on 3).
Coverity has one report about this.
CID 1596092: (#1 of 1): Uninitialized scalar variable (UNINIT)
13. uninit_use_in_call: Using uninitialized value result_one. Field result_one.result is uninitialized when calling pgaio_result_report.Below not is a fix, but some suggestion:
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 1c37d7dfe2..b0f9ce452c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -6786,6 +6786,8 @@ buffer_readv_encode_error(PgAioResult *result,
else
result->status = PGAIO_RS_WARNING;
+ result->result = 0;
+
/*
* The encoding is complicated enough to warrant cross-checking it against
* the decode function.
@@ -6868,8 +6870,6 @@ buffer_readv_complete_one(PgAioTargetData *td, uint8 buf_off, Buffer buffer,
/* Check for garbage data. */
if (!failed)
{
- PgAioResult result_one;
-
if (!PageIsVerified((Page) bufdata, tag.blockNum, piv_flags,
failed_checksum))
{
@@ -6904,6 +6904,8 @@ buffer_readv_complete_one(PgAioTargetData *td, uint8 buf_off, Buffer buffer,
*/
if (*buffer_invalid || *failed_checksum || *zeroed_buffer)
{
+ PgAioResult result_one;
+
buffer_readv_encode_error(&result_one, is_temp,
*zeroed_buffer,
*ignored_checksum,
index 1c37d7dfe2..b0f9ce452c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -6786,6 +6786,8 @@ buffer_readv_encode_error(PgAioResult *result,
else
result->status = PGAIO_RS_WARNING;
+ result->result = 0;
+
/*
* The encoding is complicated enough to warrant cross-checking it against
* the decode function.
@@ -6868,8 +6870,6 @@ buffer_readv_complete_one(PgAioTargetData *td, uint8 buf_off, Buffer buffer,
/* Check for garbage data. */
if (!failed)
{
- PgAioResult result_one;
-
if (!PageIsVerified((Page) bufdata, tag.blockNum, piv_flags,
failed_checksum))
{
@@ -6904,6 +6904,8 @@ buffer_readv_complete_one(PgAioTargetData *td, uint8 buf_off, Buffer buffer,
*/
if (*buffer_invalid || *failed_checksum || *zeroed_buffer)
{
+ PgAioResult result_one;
+
buffer_readv_encode_error(&result_one, is_temp,
*zeroed_buffer,
*ignored_checksum,
1. I couldn't find the correct value to initialize the *result* field.
2. result_one can be reduced scope.
best regards,
Ranier Vilela
Hi, On 2025-04-01 17:47:51 -0400, Andres Freund wrote: > There are three different types of failures in the test_aio test so far: And a fourth, visible after I enabled liburing support for skink. https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=skink&dt=2025-04-03%2007%3A06%3A19&stg=pg_upgrade-check (ignore the pg_upgrade and oauth failures, they're independent, I've raised them separately) 4a) 2025-04-03 10:58:32.978 UTC [2486740][client backend][3/6:0] LOG: short read injection point called, is enabled: 0 ==2486740== VALGRINDERROR-BEGIN ==2486740== Invalid read of size 2 ==2486740== at 0x59C8AC: PageIsNew (bufpage.h:237) ==2486740== by 0x59C8AC: PageIsVerified (bufpage.c:108) ==2486740== by 0x567870: buffer_readv_complete_one (bufmgr.c:6873) ==2486740== by 0x567870: buffer_readv_complete (bufmgr.c:6996) ==2486740== by 0x567870: shared_buffer_readv_complete (bufmgr.c:7153) ==2486740== by 0x55DDB2: pgaio_io_call_complete_shared (aio_callback.c:256) ==2486740== by 0x55D6F1: pgaio_io_process_completion (aio.c:512) ==2486740== by 0x55F53A: pgaio_uring_drain_locked (method_io_uring.c:370) ==2486740== by 0x55F7B8: pgaio_uring_wait_one (method_io_uring.c:449) ==2486740== by 0x55C702: pgaio_io_wait (aio.c:587) ==2486740== by 0x55C8B0: pgaio_wref_wait (aio.c:900) ==2486740== by 0x8639240: read_rel_block_ll (test_aio.c:440) ==2486740== by 0x3B915C: ExecInterpExpr (execExprInterp.c:953) ==2486740== by 0x3B4E4E: ExecInterpExprStillValid (execExprInterp.c:2299) ==2486740== by 0x3F7E97: ExecEvalExprNoReturn (executor.h:445) ==2486740== Address 0x8fa400e is in a rw- anonymous segment ==2486740== ==2486740== VALGRINDERROR-END The reason for this is that this test unpins the buffer (from the backend's view), before waiting for the IO. While the AIO subsystem holds a pin, UnpinBufferNoOwner() marked the buffer as inaccessible: /* * Mark buffer non-accessible to Valgrind. * * Note that the buffer may have already been marked non-accessible * within access method code that enforces that buffers are only * accessed while a buffer lock is held. */ VALGRIND_MAKE_MEM_NOACCESS(BufHdrGetBlock(buf), BLCKSZ); I think to fix this we need to mark buffers as accessible around the PageIsVerified() call in buffer_readv_complete_one(), IFF they're not pinned by the backend. Unfortunately, this is complicated by the fact that local buffers do not have valgrind integration :(, so we should only do that for local buffers, as otherwise the local buffer stays inaccessible the next time it is pinned. 4b) That's not all though, after getting past this failure, I see uninitialized memory errors for reads into temporary buffers: ==3334031== VALGRINDERROR-BEGIN ==3334031== Conditional jump or move depends on uninitialised value(s) ==3334031== at 0xD7C859: PageIsVerified (bufpage.c:108) ==3334031== by 0xD381CA: buffer_readv_complete_one (bufmgr.c:6876) ==3334031== by 0xD385D1: buffer_readv_complete (bufmgr.c:7002) ==3334031== by 0xD38D2E: local_buffer_readv_complete (bufmgr.c:7210) ==3334031== by 0xD265FA: pgaio_io_call_complete_local (aio_callback.c:306) ==3334031== by 0xD24720: pgaio_io_reclaim (aio.c:644) ==3334031== by 0xD24400: pgaio_io_process_completion (aio.c:521) ==3334031== by 0xD28D3D: pgaio_uring_drain_locked (method_io_uring.c:382) ==3334031== by 0xD2905F: pgaio_uring_wait_one (method_io_uring.c:461) ==3334031== by 0xD245E0: pgaio_io_wait (aio.c:587) ==3334031== by 0xD24FFE: pgaio_wref_wait (aio.c:900) ==3334031== by 0xD2F471: WaitReadBuffers (bufmgr.c:1695) ==3334031== by 0xD2BCF4: read_stream_next_buffer (read_stream.c:898) ==3334031== by 0x8B4861: heap_fetch_next_buffer (heapam.c:654) ==3334031== by 0x8B4FFA: heapgettup_pagemode (heapam.c:1016) ==3334031== by 0x8B594F: heap_getnextslot (heapam.c:1375) ==3334031== by 0xB28AA4: table_scan_getnextslot (tableam.h:1031) ==3334031== by 0xB29177: SeqNext (nodeSeqscan.c:81) ==3334031== by 0xB28F75: ExecScanFetch (execScan.h:126) ==3334031== by 0xB28FDD: ExecScanExtended (execScan.h:170) The reason for this one is, I think, that valgrind doesn't understand io_uring sufficiently. Which isn't surprising, io_uring's nature of an in-memory queue of commands is somewhat hard to intercept by tools like valgrind and rr. The best fix for that one would, I think, be to have method_io_uring() iterate over the IOV and mark the relevant regions as defined? That does fix the issue at least and does seem to make sense? Not quite sure if we should mark the entire IOV is efined or just the portion that was actually read - the latter is additional fiddly code, and it's not clear it's likely to be helpful? 4c) Unfortunately, once 4a) is addressed, the VALGRIND_MAKE_MEM_NOACCESS() after PageIsVerified() causes the *next* read into the same buffer in an IO worker to fail: ==3339904== Syscall param pread64(buf) points to unaddressable byte(s) ==3339904== at 0x5B3B687: __internal_syscall_cancel (cancellation.c:64) ==3339904== by 0x5B3B6AC: __syscall_cancel (cancellation.c:75) ==3339904== by 0x5B93C83: pread (pread64.c:25) ==3339904== by 0xD274F4: pg_preadv (pg_iovec.h:56) ==3339904== by 0xD2799A: pgaio_io_perform_synchronously (aio_io.c:137) ==3339904== by 0xD2A6D7: IoWorkerMain (method_worker.c:538) ==3339904== by 0xC91E26: postmaster_child_launch (launch_backend.c:290) ==3339904== by 0xC99594: StartChildProcess (postmaster.c:3972) ==3339904== by 0xC99EE3: maybe_adjust_io_workers (postmaster.c:4403) ==3339904== by 0xC958A8: PostmasterMain (postmaster.c:1381) ==3339904== by 0xB69622: main (main.c:227) ==3339904== Address 0x7f936d386000 is in a rw- anonymous segment Because, from the view of the IO worker, that memory is still marked NOACCESS, even if it since has been marked accessible in the backend. We could adress this by conditioning the VALGRIND_MAKE_MEM_NOACCESS() on not being in an IO worker, but it seems better to instead explicitly mark the region accessible in the worker, before executing the IO. In a first hack, I did that in pgaio_io_perform_synchronously(), but that is likely too broad. I don't think the same scenario exists when IOs are executed synchronously in the foreground. Questions: 1) It'd be cleaner to implement valgrind support in localbuf.c, so we don't need to have special-case logic for that. But it also makes the change less localized and more "impactful", who knows what kind of skullduggery we have been getting away with unnoticed. I haven't written the code up yet, but I don't think it'd be all that much code to add valgrind support to localbuf. 2) Any better ideas to handle the above issues than what I outlined? Greetings, Andres Freund
Hi, On 2025-04-03 13:46:39 -0300, Ranier Vilela wrote: > Em qua., 2 de abr. de 2025 às 08:58, Andres Freund <andres@anarazel.de> > escreveu: > > > Hi, > > > > I've pushed fixes for 1) and 2) and am working on 3). > > > Coverity has one report about this. > > CID 1596092: (#1 of 1): Uninitialized scalar variable (UNINIT) > 13. uninit_use_in_call: Using uninitialized value result_one. Field > result_one.result is uninitialized when calling pgaio_result_report. Isn't this a rather silly thing to warn about for coverity? The field isn't used in pgaio_result_report(). It can't be a particularly rare thing to have struct fields that aren't always used? > Below not is a fix, but some suggestion: > > diff --git a/src/backend/storage/buffer/bufmgr.c > b/src/backend/storage/buffer/bufmgr.c > index 1c37d7dfe2..b0f9ce452c 100644 > --- a/src/backend/storage/buffer/bufmgr.c > +++ b/src/backend/storage/buffer/bufmgr.c > @@ -6786,6 +6786,8 @@ buffer_readv_encode_error(PgAioResult *result, > else > result->status = PGAIO_RS_WARNING; > > + result->result = 0; > + That'd be completely wrong - and the tests indeed fail if you do that. The read might succeed with a warning (e.g. due to zero_damaged_pages) in which case the result still carries important information about how many blocks were successfully read. > /* > * The encoding is complicated enough to warrant cross-checking it against > * the decode function. > @@ -6868,8 +6870,6 @@ buffer_readv_complete_one(PgAioTargetData *td, uint8 > buf_off, Buffer buffer, > /* Check for garbage data. */ > if (!failed) > { > - PgAioResult result_one; > - > if (!PageIsVerified((Page) bufdata, tag.blockNum, piv_flags, > failed_checksum)) > { > @@ -6904,6 +6904,8 @@ buffer_readv_complete_one(PgAioTargetData *td, uint8 > buf_off, Buffer buffer, > */ > if (*buffer_invalid || *failed_checksum || *zeroed_buffer) > { > + PgAioResult result_one; > + > buffer_readv_encode_error(&result_one, is_temp, > *zeroed_buffer, > *ignored_checksum, > > > 1. I couldn't find the correct value to initialize the *result* field. It is not accessed in this path. I guess we can just zero-initialize result_one to shut up coverity. > 2. result_one can be reduced scope. True. Greetings, Andres Freund
Em qui., 3 de abr. de 2025 às 15:35, Andres Freund <andres@anarazel.de> escreveu:
Hi,
On 2025-04-03 13:46:39 -0300, Ranier Vilela wrote:
> Em qua., 2 de abr. de 2025 às 08:58, Andres Freund <andres@anarazel.de>
> escreveu:
>
> > Hi,
> >
> > I've pushed fixes for 1) and 2) and am working on 3).
> >
> Coverity has one report about this.
>
> CID 1596092: (#1 of 1): Uninitialized scalar variable (UNINIT)
> 13. uninit_use_in_call: Using uninitialized value result_one. Field
> result_one.result is uninitialized when calling pgaio_result_report.
Isn't this a rather silly thing to warn about for coverity?
Personally, I consider every warning to be important.
The field isn't
used in pgaio_result_report(). It can't be a particularly rare thing to have
struct fields that aren't always used?
Always considered a risk, someone may start using it.
> Below not is a fix, but some suggestion:
>
> diff --git a/src/backend/storage/buffer/bufmgr.c
> b/src/backend/storage/buffer/bufmgr.c
> index 1c37d7dfe2..b0f9ce452c 100644
> --- a/src/backend/storage/buffer/bufmgr.c
> +++ b/src/backend/storage/buffer/bufmgr.c
> @@ -6786,6 +6786,8 @@ buffer_readv_encode_error(PgAioResult *result,
> else
> result->status = PGAIO_RS_WARNING;
>
> + result->result = 0;
> +
That'd be completely wrong - and the tests indeed fail if you do that. The
read might succeed with a warning (e.g. due to zero_damaged_pages) in which
case the result still carries important information about how many blocks were
successfully read.
That's exactly why it's not a patch.
> /*
> * The encoding is complicated enough to warrant cross-checking it against
> * the decode function.
> @@ -6868,8 +6870,6 @@ buffer_readv_complete_one(PgAioTargetData *td, uint8
> buf_off, Buffer buffer,
> /* Check for garbage data. */
> if (!failed)
> {
> - PgAioResult result_one;
> -
> if (!PageIsVerified((Page) bufdata, tag.blockNum, piv_flags,
> failed_checksum))
> {
> @@ -6904,6 +6904,8 @@ buffer_readv_complete_one(PgAioTargetData *td, uint8
> buf_off, Buffer buffer,
> */
> if (*buffer_invalid || *failed_checksum || *zeroed_buffer)
> {
> + PgAioResult result_one;
> +
> buffer_readv_encode_error(&result_one, is_temp,
> *zeroed_buffer,
> *ignored_checksum,
>
>
> 1. I couldn't find the correct value to initialize the *result* field.
It is not accessed in this path. I guess we can just zero-initialize
result_one to shut up coverity.
Very good.
> 2. result_one can be reduced scope.
True.
Ok.
best regards,
Ranier Vilela
On Thu, Apr 03, 2025 at 02:19:43PM -0400, Andres Freund wrote: > 4b) > > That's not all though, after getting past this failure, I see uninitialized > memory errors for reads into temporary buffers: > > ==3334031== VALGRINDERROR-BEGIN > ==3334031== Conditional jump or move depends on uninitialised value(s) > ==3334031== at 0xD7C859: PageIsVerified (bufpage.c:108) > ==3334031== by 0xD381CA: buffer_readv_complete_one (bufmgr.c:6876) > ==3334031== by 0xD385D1: buffer_readv_complete (bufmgr.c:7002) > ==3334031== by 0xD38D2E: local_buffer_readv_complete (bufmgr.c:7210) > ==3334031== by 0xD265FA: pgaio_io_call_complete_local (aio_callback.c:306) > ==3334031== by 0xD24720: pgaio_io_reclaim (aio.c:644) > ==3334031== by 0xD24400: pgaio_io_process_completion (aio.c:521) > ==3334031== by 0xD28D3D: pgaio_uring_drain_locked (method_io_uring.c:382) > ==3334031== by 0xD2905F: pgaio_uring_wait_one (method_io_uring.c:461) > ==3334031== by 0xD245E0: pgaio_io_wait (aio.c:587) > ==3334031== by 0xD24FFE: pgaio_wref_wait (aio.c:900) > ==3334031== by 0xD2F471: WaitReadBuffers (bufmgr.c:1695) > ==3334031== by 0xD2BCF4: read_stream_next_buffer (read_stream.c:898) > ==3334031== by 0x8B4861: heap_fetch_next_buffer (heapam.c:654) > ==3334031== by 0x8B4FFA: heapgettup_pagemode (heapam.c:1016) > ==3334031== by 0x8B594F: heap_getnextslot (heapam.c:1375) > ==3334031== by 0xB28AA4: table_scan_getnextslot (tableam.h:1031) > ==3334031== by 0xB29177: SeqNext (nodeSeqscan.c:81) > ==3334031== by 0xB28F75: ExecScanFetch (execScan.h:126) > ==3334031== by 0xB28FDD: ExecScanExtended (execScan.h:170) > > > The reason for this one is, I think, that valgrind doesn't understand io_uring > sufficiently. Which isn't surprising, io_uring's nature of an in-memory queue > of commands is somewhat hard to intercept by tools like valgrind and rr. > > The best fix for that one would, I think, be to have method_io_uring() iterate > over the IOV and mark the relevant regions as defined? That does fix the > issue at least and does seem to make sense? Makes sense. Valgrind knows that read() makes its target bytes "defined". It probably doesn't have an io_uring equivalent for that. I expect we only need this for local buffers, and it's unclear to me how the fix for (4a) didn't fix this. Before bufmgr Valgrind integration (1e0dfd1 of 2020-07) there was no explicit handling of shared_buffers. I suspect that worked because the initial mmap() of shared memory was considered "defined" (zeros), and steps like PageAddItem() copy only defined bytes into buffers. Hence, shared_buffers remained defined without explicit Valgrind client requests. This example uses local buffers. Storage for those comes from MemoryContextAlloc() in GetLocalBufferStorage(). That memory starts undefined, but it becomes defined at PageInit() or read(). Hence, I expected the fix for (4a) to make the buffer defined after io_uring read. What makes the outcome different? In the general case, we could want client requests as follows: - If completor==definer and has not dropped pin: - Make defined before verifying page. That's all. It might be cleaner to do this when first retrieving a return value from io_uring, since this just makes up for what Valgrind already does for readv(). - If completor!=definer or has dropped pin: - Make NOACCESS in definer when definer cedes its own pin. - For io_method=worker, make UNDEFINED before starting readv(). It might be cleanest to do this when the worker first acts as the owner of the AIO subsystem pin, if that's a clear moment earlier than readv(). - Make DEFINED in completor before verifying page. It might be cleaner to do this when the completor first retrieves a return value from io_uring, since this just makes up for what Valgrind already does for readv(). - Make NOACCESS in completor after verifying page. Similarly, it might be cleaner to do this when the completor releases the AIO subsystem pin. > Not quite sure if we should mark > the entire IOV is efined or just the portion that was actually read - the > latter is additional fiddly code, and it's not clear it's likely to be helpful? Seems fine to do the simpler way if that saves fiddly code. > 4c) > > Unfortunately, once 4a) is addressed, the VALGRIND_MAKE_MEM_NOACCESS() after > PageIsVerified() causes the *next* read into the same buffer in an IO worker > to fail: > > ==3339904== Syscall param pread64(buf) points to unaddressable byte(s) > ==3339904== at 0x5B3B687: __internal_syscall_cancel (cancellation.c:64) > ==3339904== by 0x5B3B6AC: __syscall_cancel (cancellation.c:75) > ==3339904== by 0x5B93C83: pread (pread64.c:25) > ==3339904== by 0xD274F4: pg_preadv (pg_iovec.h:56) > ==3339904== by 0xD2799A: pgaio_io_perform_synchronously (aio_io.c:137) > ==3339904== by 0xD2A6D7: IoWorkerMain (method_worker.c:538) > ==3339904== by 0xC91E26: postmaster_child_launch (launch_backend.c:290) > ==3339904== by 0xC99594: StartChildProcess (postmaster.c:3972) > ==3339904== by 0xC99EE3: maybe_adjust_io_workers (postmaster.c:4403) > ==3339904== by 0xC958A8: PostmasterMain (postmaster.c:1381) > ==3339904== by 0xB69622: main (main.c:227) > ==3339904== Address 0x7f936d386000 is in a rw- anonymous segment > > Because, from the view of the IO worker, that memory is still marked NOACCESS, > even if it since has been marked accessible in the backend. > > > We could adress this by conditioning the VALGRIND_MAKE_MEM_NOACCESS() on not > being in an IO worker, but it seems better to instead explicitly mark the > region accessible in the worker, before executing the IO. Sounds good. Since the definer gave the AIO subsystem a pin on the worker's behalf, it's like the worker is doing an implicit pin and explicit unpin. > In a first hack, I did that in pgaio_io_perform_synchronously(), but that is > likely too broad. I don't think the same scenario exists when IOs are > executed synchronously in the foreground. > > > Questions: > > 1) It'd be cleaner to implement valgrind support in localbuf.c, so we don't > need to have special-case logic for that. But it also makes the change less > localized and more "impactful", who knows what kind of skullduggery we have > been getting away with unnoticed. > > I haven't written the code up yet, but I don't think it'd be all that much > code to add valgrind support to localbuf. It would be the right thing long-term, and it's not a big deal if it causes some false positives initially. So if you're leaning that way, that's good. > 2) Any better ideas to handle the above issues than what I outlined? Not here, unless the discussion under (4b) differs usefully from what you planned.
Hi, On 2025-04-03 16:16:50 -0300, Ranier Vilela wrote: > Em qui., 3 de abr. de 2025 às 15:35, Andres Freund <andres@anarazel.de> > escreveu:> > On 2025-04-03 13:46:39 -0300, Ranier Vilela wrote: > > > Em qua., 2 de abr. de 2025 às 08:58, Andres Freund <andres@anarazel.de> > > > escreveu: > > > > > > > Hi, > > > > > > > > I've pushed fixes for 1) and 2) and am working on 3). > > > > > > > Coverity has one report about this. > > > > > > CID 1596092: (#1 of 1): Uninitialized scalar variable (UNINIT) > > > 13. uninit_use_in_call: Using uninitialized value result_one. Field > > > result_one.result is uninitialized when calling pgaio_result_report. > > > > Isn't this a rather silly thing to warn about for coverity? > > Personally, I consider every warning to be important. If the warning is wrong, then it's not helpful. Warning quality really matters. Zero-initializing everything *REDUCES* what static analysis and sanitizers can do. The analyzer/sanitizer can't tell that you just silenced a warning by zero-initializing something that shouldn't be accessed. If later there is an access, the zero is probably the wrong value, but the no tool can tell you, because you did initialize it after all. > > > The field isn't > > used in pgaio_result_report(). It can't be a particularly rare thing to > > have > > struct fields that aren't always used? > > > Always considered a risk, someone may start using it. That makes it worse! E.g. valgrind won't raise errors about it anymore. Greetings, Andres Freund
Hi, Sorry for the slow work on this. The cycle times are humonguous due to valgrind being so slow... On 2025-04-03 12:40:23 -0700, Noah Misch wrote: > On Thu, Apr 03, 2025 at 02:19:43PM -0400, Andres Freund wrote: > > The best fix for that one would, I think, be to have method_io_uring() iterate > > over the IOV and mark the relevant regions as defined? That does fix the > > issue at least and does seem to make sense? > > Makes sense. Valgrind knows that read() makes its target bytes "defined". It > probably doesn't have an io_uring equivalent for that. Correct - and I think it would be nontrivial to add, because there's not easy syscall to intercept... > I expect we only need this for local buffers, and it's unclear to me how the > fix for (4a) didn't fix this. At that time I didn't apply the fix in 4a) to local buffers, because local buffers, in HEAD, don't have the valgrind integration. Without that marking the buffer as NOACCESS would cause all sorts of issues, because it'd be considered inaccessible even after pinning. As you analyzed, that then ends up considered undefined due to the MemoryContextAlloc(). > In the general case, we could want client requests as follows: > > - If completor==definer and has not dropped pin: > - Make defined before verifying page. That's all. It might be cleaner to > do this when first retrieving a return value from io_uring, since this > just makes up for what Valgrind already does for readv(). Yea, I think it's better to do that in io_uring. It's what I have done in the attached. > - If completor!=definer or has dropped pin: > - Make NOACCESS in definer when definer cedes its own pin. That's the current behaviour for shared buffers, right? > - For io_method=worker, make UNDEFINED before starting readv(). It might be > cleanest to do this when the worker first acts as the owner of the AIO > subsystem pin, if that's a clear moment earlier than readv(). Hm, what do we need this for? > - Make DEFINED in completor before verifying page. It might be cleaner to > do this when the completor first retrieves a return value from io_uring, > since this just makes up for what Valgrind already does for readv(). I think we can't rely on the marking during retrieving it from io_uring, as that might have happened in a different backend for a temp buffer. That'd only happen if we got io_uring events for *another* IO that involved a shared rel, but it can happen. > > Not quite sure if we should mark > > the entire IOV is efined or just the portion that was actually read - the > > latter is additional fiddly code, and it's not clear it's likely to be helpful? > > Seems fine to do the simpler way if that saves fiddly code. Can't quite decide, it's just at the border of what I consider too fiddly... See the change to method_io_uring.c in the attached patch. > > Questions: > > > > 1) It'd be cleaner to implement valgrind support in localbuf.c, so we don't > > need to have special-case logic for that. But it also makes the change less > > localized and more "impactful", who knows what kind of skullduggery we have > > been getting away with unnoticed. > > > > I haven't written the code up yet, but I don't think it'd be all that much > > code to add valgrind support to localbuf. > > It would be the right thing long-term, and it's not a big deal if it causes > some false positives initially. So if you're leaning that way, that's good. It was easy enough. I saw one related failure, FlushRelationBuffers() didn't pin temporary buffers before flushing them. Pinning the buffers fixed that. I don't think it's a real problem to not pin the local buffer during FlushRelationBuffers(), at least not today. But it seems unnecessarily odd to not pin it. I wish valgrind had a way to mark the buffer as inaccessible and then accessible again, without loosing the defined-ness information... Greetings, Andres Freund
Attachment
On Fri, Apr 04, 2025 at 03:16:18PM -0400, Andres Freund wrote: > On 2025-04-03 12:40:23 -0700, Noah Misch wrote: > > On Thu, Apr 03, 2025 at 02:19:43PM -0400, Andres Freund wrote: > > In the general case, we could want client requests as follows: > > > > - If completor==definer and has not dropped pin: > > - Make defined before verifying page. That's all. It might be cleaner to > > do this when first retrieving a return value from io_uring, since this > > just makes up for what Valgrind already does for readv(). > > Yea, I think it's better to do that in io_uring. It's what I have done in the > attached. > > > > - If completor!=definer or has dropped pin: > > - Make NOACCESS in definer when definer cedes its own pin. > > That's the current behaviour for shared buffers, right? Yes. > > - For io_method=worker, make UNDEFINED before starting readv(). It might be > > cleanest to do this when the worker first acts as the owner of the AIO > > subsystem pin, if that's a clear moment earlier than readv(). > > Hm, what do we need this for? At the time, we likely didn't need it: - If the worker does its own PinBuffer*()+unpin, we don't need it. Those functions do the Valgrind client requests. - If the worker relies on the AIO-subsystem-owned pin and does neither regular pin nor regular unpin, we don't need it. Buffers are always "defined". - If the worker relies on the AIO-subsystem-owned pin to skip PinBuffer*() but uses regular unpin code, then the buffer may be NOACCESS. Then one would need this. But this would be questionable for other reasons. Your proposed change to set NOACCESS in buffer_readv_complete_one() interacts with things further, making the UNDEFINED necessary. > > - Make DEFINED in completor before verifying page. It might be cleaner to > > do this when the completor first retrieves a return value from io_uring, > > since this just makes up for what Valgrind already does for readv(). > > I think we can't rely on the marking during retrieving it from io_uring, as > that might have happened in a different backend for a temp buffer. That'd only > happen if we got io_uring events for *another* IO that involved a shared rel, > but it can happen. Good point. I think the VALGRIND_MAKE_MEM_DEFINED() in pgaio_uring_drain_locked() isn't currently needed at all. If completor-subxact==definer-subxact, PinBuffer() already did what Valgrind needs. Otherwise, buffer_readv_complete_one() does what Valgrind needs. If that's right, it would still be nice to reach the right VALGRIND_MAKE_MEM_DEFINED() without involving bufmgr. That helps future, non-bufmgr AIO use cases. It's tricky to pick the right place for that VALGRIND_MAKE_MEM_DEFINED(): - pgaio_uring_drain_locked() is problematic, I think. In the localbuf case, the iovec base address is relevant only in the ioh-defining process. In the shmem completor!=definer case, this runs only in the completor. - A complete_local callback solves those problems. However, if the AIO-defining subxact aborted, then we shouldn't set DEFINED at all, since the buffer mapping may have changed by the time of complete_local. - Putting it in the place that would call pgaio_result_report(ERROR) if needed, e.g. ProcessReadBuffersResult(), solves the problem of the buffer mapping having moved. ProcessReadBuffersResult() doesn't even need this, since PinBuffer() already did it. Each future AIO use case will have a counterpart of ProcessReadBuffersResult() that consumes the result and proceeds with tasks that depend on the AIO. That's the place. Is that right? I got this wrong a few times while trying to think through it, so I'm not too confident in the above. > > > Not quite sure if we should mark > > > the entire IOV is efined or just the portion that was actually read - the > > > latter is additional fiddly code, and it's not clear it's likely to be helpful? > > > > Seems fine to do the simpler way if that saves fiddly code. > > Can't quite decide, it's just at the border of what I consider too > fiddly... See the change to method_io_uring.c in the attached patch. It is at the border, as you say, but I'd tend to keep it. > Subject: [PATCH v1 1/3] localbuf: Add Valgrind buffer access instrumentation Ready for commit > Subject: [PATCH v1 2/3] aio: Make AIO compatible with valgrind See above about pgaio_uring_drain_locked(). > related code until it is pinned bu "user" code again. But it requires some s/bu/by/ > + * Return the iovecand its length. Currently only expected to be used by s/iovecand/iovec and/ > @@ -361,13 +405,16 @@ pgaio_uring_drain_locked(PgAioUringContext *context) > for (int i = 0; i < ncqes; i++) > { > struct io_uring_cqe *cqe = cqes[i]; > + int32 res; > PgAioHandle *ioh; > > ioh = io_uring_cqe_get_data(cqe); > errcallback.arg = ioh; > + res = cqe->res; > + > io_uring_cqe_seen(&context->io_uring_ring, cqe); > > - pgaio_io_process_completion(ioh, cqe->res); > + pgaio_uring_io_process_completion(ioh, res); I guess this is a distinct cleanup, done to avoid any suspicion of cqe being reused asynchronously after io_uring_cqe_seen(). Is that right? > Subject: [PATCH v1 3/3] aio: Avoid spurious coverity warning Ready for commit
Hi, On 2025-04-04 14:18:02 -0700, Noah Misch wrote: > On Fri, Apr 04, 2025 at 03:16:18PM -0400, Andres Freund wrote: > > > - Make DEFINED in completor before verifying page. It might be cleaner to > > > do this when the completor first retrieves a return value from io_uring, > > > since this just makes up for what Valgrind already does for readv(). > > > > I think we can't rely on the marking during retrieving it from io_uring, as > > that might have happened in a different backend for a temp buffer. That'd only > > happen if we got io_uring events for *another* IO that involved a shared rel, > > but it can happen. > > Good point. I think the VALGRIND_MAKE_MEM_DEFINED() in > pgaio_uring_drain_locked() isn't currently needed at all. If > completor-subxact==definer-subxact, PinBuffer() already did what Valgrind > needs. Otherwise, buffer_readv_complete_one() does what Valgrind needs. We did need it - but only because I bungled something in the earlier patch to add valgrind support. The problem is that in PinLocalBuffer() there may not actually be any storage allocated for the buffer yet, so VALGRIND_MAKE_MEM_DEFINED() doesn't work. In the first use of the buffer the allocation happens a bit later, in GetLocalVictimBuffer(), namely during the call to GetLocalBufferStorage(). Not quite sure yet how to best deal with it. Putting the PinLocalBuffer() slightly later into GetLocalVictimBuffer() fixes the issue, but also doesn't really seem great. > If that's right, it would still be nice to reach the right > VALGRIND_MAKE_MEM_DEFINED() without involving bufmgr. I think that would be possible if we didn't do VALGRIND_MAKE_MEM_NOACCESS() in UnpinBuffer()/UnpinLocalBuffer(). But with that I don't see how we can avoid needing to remark the region as accessible? > That helps future, non-bufmgr AIO use cases. It's tricky to pick the right > place for that VALGRIND_MAKE_MEM_DEFINED(): > - pgaio_uring_drain_locked() is problematic, I think. In the localbuf case, > the iovec base address is relevant only in the ioh-defining process. In the > shmem completor!=definer case, this runs only in the completor. You're right :( > - A complete_local callback solves those problems. However, if the > AIO-defining subxact aborted, then we shouldn't set DEFINED at all, since > the buffer mapping may have changed by the time of complete_local. I don't think that is possible, due to the aio subsystem owned pin? > - Putting it in the place that would call pgaio_result_report(ERROR) if > needed, e.g. ProcessReadBuffersResult(), solves the problem of the buffer > mapping having moved. ProcessReadBuffersResult() doesn't even need this, > since PinBuffer() already did it. Each future AIO use case will have a > counterpart of ProcessReadBuffersResult() that consumes the result and > proceeds with tasks that depend on the AIO. That's the place. I don't really follow - at the point something like ProcessReadBuffersResult() gets involved, we'll already have done the accesses that needed the memory to be accessible and defined? I think the point about non-aio uses is a fair one, but I don't quite know how to best solve it right now, due to the local buffer issue you mentioned. I'd guess that we'd best put it somewhere a) in pgaio_io_process_completion(), if definer==completor || !PGAIO_HF_REFERENCES_LOCAL b) pgaio_io_call_complete_local(), just before calling pgaio_io_call_complete_local() if PGAIO_HF_REFERENCES_LOCAL > > related code until it is pinned bu "user" code again. But it requires some > > s/bu/by/ > > > + * Return the iovecand its length. Currently only expected to be used by > > s/iovecand/iovec and/ Fixed. > > @@ -361,13 +405,16 @@ pgaio_uring_drain_locked(PgAioUringContext *context) > > for (int i = 0; i < ncqes; i++) > > { > > struct io_uring_cqe *cqe = cqes[i]; > > + int32 res; > > PgAioHandle *ioh; > > > > ioh = io_uring_cqe_get_data(cqe); > > errcallback.arg = ioh; > > + res = cqe->res; > > + > > io_uring_cqe_seen(&context->io_uring_ring, cqe); > > > > - pgaio_io_process_completion(ioh, cqe->res); > > + pgaio_uring_io_process_completion(ioh, res); > > I guess this is a distinct cleanup, done to avoid any suspicion of cqe being > reused asynchronously after io_uring_cqe_seen(). Is that right? I don't think there is any such danger - there's no background thing processing things on the ring, if there were, it'd get corrupted, but it seemed cleaner to do it that way when I introduced pgaio_uring_io_process_completion(). > > Subject: [PATCH v1 3/3] aio: Avoid spurious coverity warning > > Ready for commit Thanks! Greetings, Andres Freund