Thread: Re: AIO v2.4

Re: AIO v2.4

From
Andres Freund
Date:
Hi,

Attached is v2.4 of the AIO patchset.

Changes:

- Introduce "batchmode", while not in batchmode, IOs get submitted immediately.

  Thomas didn't like how this worked previously, and while this was a
  surprisingly large amount of work, I agree that it looks better now.

  I vaccilated a bunch on the naming. For now it's

  extern void pgaio_enter_batchmode(void);
  extern void pgaio_exit_batchmode(void);

  I did adjust the README and wrote a reasonably long comment above enter:

https://github.com/anarazel/postgres/blob/a324870186ddff9a31b10472b790eb4e744c40b3/src/backend/storage/aio/aio.c#L931-L960


- Batchmode needs to be exited in case of errors, for that

  - a new pgaio_after_error() call has been added to all the relevant places

  - xact.c calls to aio have been (re-)added to check that there are no
    in-progress batches / unsubmitted IOs at the end of a transaction.

    Before that I had just removed at-eoxact "callbacks" :)

    This checking has holes though:
    https://postgr.es/m/upkkyhyuv6ultnejrutqcu657atw22kluh4lt2oidzxxtjqux3%40a4hdzamh4wzo

    Because this only means that we will not detect all buggy code, rather
    than misbehaving for correct code, I think this may be ok for now.


- Renamed aio_init.h to aio_subsys.h

  The newly added pgaio_after_error() calls would have required including
  aio.h in a good bit more places that won't themselves issue AIO. That seemed
  wrong.  There already was a aio_init.h to avoid needing to include aio.h in
  places like ipci.c, but it seemed wrong to put pgaio_after_error() in
  aio_init.h.  So I renamed it to aio_subsys.h - not sure that's the best
  name, but I can live with it.


- Now that Thomas submitted the necessary read_stream.c improvements, the
  prior big TODO about one StartReadBuffers() call needing to start many IOs
  has been addressed.

  Thomas' thread: https://postgr.es/m/CA%2BhUKGK_%3D4CVmMHvsHjOVrK6t4F%3DLBpFzsrr3R%2BaJYN8kcTfWg%40mail.gmail.com

  For now I've also included Thomas patches in my queue, but they should get
  pushed independently.  Review comments specific to those patches probably
  are better put on the other thread.

  Thomas' patches also fix several issues that were addressed in my WIP
  adjustments to read_stream.c.  There are a few left, but it does look
  better.

  The included commits are 0003-0008.


- I rewrote the tests into a tap test. That was exceedingly painful. Partially
  due to tap infrastructure bugs on windows that would sometimes cause
  inscrutable failures, see
  https://www.postgresql.org/message-id/wmovm6xcbwh7twdtymxuboaoarbvwj2haasd3sikzlb3dkgz76%40n45rzycluzft

  I just pushed that fix earlier today.


- Added docs for new GUCs, moved them to a more appropriate section

  See also https://postgr.es/m/x3tlw2jk5gm3r3mv47hwrshffyw7halpczkfbk3peksxds7bvc%40lguk43z3bsyq


- If IO workers fail to reopen the file for an IO, the IO is now marked as
  failed. Previously we'd just hang.

  To test this I added an injection point that triggers the failure. I don't
  know how else this could be tested.


- Added liburing dependency build documentation


- Added check hook to ensure io_max_concurrency = isn't set to 0 (-1 is for
  auto-config)


- Fixed that with io_method == sync we'd issue fadvise calls when not
  appropriate, that was a consequence of my hacky read_stream.c changes.


- Renamed some the aio<->bufmgr.c interface functions. Don't think they're
  quite perfect, but they're in later patches, so I don't want to focus too
  much on them rn.


- Comment improvements etc.


- Got rid of an unused wait event and renamed other wait events to make more
  sense.


- Previously the injection points were added as part of the test patch, I now
  moved them into the commits adding the code being tested. Was too annoying
  to edit otherwise.


Todo:

- there's a decent amount of FIXMEs in later commits related to ereport(LOG)s
  needing relpath() while in a critical section.  I did propose a solution to
  that yesterday:

  https://postgr.es/m/h3a7ftrxypgxbw6ukcrrkspjon5dlninedwb5udkrase3rgqvn%403cokde6btlrl

- A few more corner case tests for the interaction of multiple backends trying
  to do IO on overlapping buffers would be good.

- Our temp table test coverage is atrociously bad


Questions:

- The test module requires StartBufferIO() to be visible outside of bufmgr.c -
  I think that's ok, would be good to know if others agree.


I'm planning to push the first two commits soon, I think they're ok on their
own, even if nothing else were to go in.


Greetings,

Andres Freund

Attachment

Re: AIO v2.4

From
Andres Freund
Date:
Hi,

On 2025-02-19 14:10:44 -0500, Andres Freund wrote:
> I'm planning to push the first two commits soon, I think they're ok on their
> own, even if nothing else were to go in.

I did that for the lwlock patch.

But I think I might not do the same for the "Ensure a resowner exists for all
paths that may perform AIO" patch. The paths for which we are missing
resowners are concerned WAL writes - but it'll be a while before we get
AIO WAL writes.

It'd be fairly harmless to do this change before, but I found the justifying
code comments hard to rephrase. E.g.:

--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -361,8 +361,15 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
     BaseInit();
 
     bootstrap_signals();
+
+    /* need a resowner for IO during BootStrapXLOG() */
+    CreateAuxProcessResourceOwner();
+
     BootStrapXLOG(bootstrap_data_checksum_version);
 
+    ReleaseAuxProcessResources(true);
+    CurrentResourceOwner = NULL;
+
     /*
      * To ensure that src/common/link-canary.c is linked into the backend, we
      * must call it from somewhere.  Here is as good as anywhere.

Given that there's no use of resowners inside BootStrapXLOG() today and not
for the next months it seems confusing?


Greetings,

Andres Freund



AIO v2.5

From
Andres Freund
Date:
Hi,

Attached is v2.5 of the AIO patchset.

Relative to 2.4 I:

- Committed some earlier commits. I ended up *not* committing the patch to
  create resowners in more backends (e.g. walsender), as that's not really a
  dependency for now.

  One of the more important things to get committed was in a separate thread:
  https://postgr.es/m/b6vveqz6r3wno66rho5lqi6z5kyhfgtvi3jcodyq5rlpp3cu44%40c6dsgf3z7yhs

  Now relpath() can be used for logging while in a critical section. That
  alone allowed to remove most of the remaining FIXMEs.


- Split md.c read/write patches, the write side is more complicated and isn't
  needed before write support arrives (much later in the queue and very likely
  not for 18).

  The complicated bit about write support is needing to
  register_dirty_segment() after completion of the write. If
  RegisterSyncRequest() fails, the IO completer needs to open the file and
  sync itself, unfortunately PathNameOpenFile() allocates memory, which isn't
  ok while in a critical section (even though it'd not be detected, as it's
  using malloc()).


- Reordered patches so that Thomas' read_stream work is after the basic AIO
  infrastructure patches, there's no dependency to the earlier patches

  I think Thomas might have a newer version of some of these, but since
  they're not intended to be committed as part of this, I didn't spend the
  time to rebase to the last version.


- Added a small bit of data that can be provided to callbacks, that makes it a
  lot cleaner to transport information like ZERO_ON_ERROR.

  I also did s/shared_callbacks/callbacks/, as the prior name was outdated.


- Substantially expanded tests, most importantly generic temp file tests and
  AIO specific cross-backend tests

  As part of the expanded tests I also needed to export TerminateBufferIO(),
  like, as previously mentioned, already done in an earlier version for
  StartBufferIO().  Nobody commented on that, so I think that's ok.

  I also renamed the tests away from the very inventively named tbl_a, tbl_b...


- Moved the commit to create resownern in more places to much later in the
  queue, it's not actually needed for bufmgr.c IO, and nothing needing it will
  land in 18


- Added a proper commit message fo the main commit. I'd appreciate folks
  reading through it. I'm sure I forgot a lot of folks and a lot of things.


- Did a fair bit of of comment polishing


- Addressed an XXX in the "aio infrastructure" commit suggesting that we might
  want to error out if a backend is waiting on is own unsubmitted IO. Noah
  argued for erroring out. I now made it so.


- Temporarily added a commit to increase open-file limit on openbsd. I saw
  related errors without this patch too, but it fails more often with. I
  already sent a separate email about this.


At this point I am not aware of anything significant left to do in the main
AIO commit, safe some of the questions below.  There's a lot more potential
optimizations etc, but this is already a very complicated piece of work, so I
think they just has to wait for later.

There are a few things to clean up in the bufmgr.c commits, I don't yet quite
like the function naming and there could be a bit less duplication. But I
don't think that needs to be resolved before the main commit.


Questions:

- My current thinking is that we'd set io_method = worker initially - so we
  actually get some coverage - and then decide whether to switch to
  io_method=sync by default for 18 sometime around beta1/2. Does that sound
  reasonable?


- We could reduce memory usage a tiny bit if we made the mapping between
  pgproc and per-backend-aio-state more complicated, i.e. not just indexed by
  ProcNumber. Right now IO workers have the per-backend AIO state, but don't
  actually need it.  I'm mildly inclined to think that the complexity isn't
  worth it, but on the fence.


- Three of the commits in the series really are just precursor commits to
  their subsequent commits, which I found helpful for development and review,
  namely:

  - aio: Basic subsystem initialization
  - aio: Skeleton IO worker infrastructure
  - aio: Add liburing dependency

  Not sure if it's worth keeping these separate or whether they should just be
  merged with their "real commit".


- Thomas suggested renaming
  COMPLETED_IO->COMPLETED,
  COMPLETED_SHARED->TERMINATED_BY_COMPLETER,
  COMPLETED_SHARED->TERMINATED_BY_SUBMITTER
  in
  https://www.postgresql.org/message-id/CA%2BhUKGLxH1tsUgzZfng4BU6GqnS6bKF2ThvxH1_w5c7-sLRKQw%40mail.gmail.com

  While the other things in the email were commented upon by others and
  addressed in v2.4, the naming aspect wasn't further remarked upon by others.
  I'm not personally in love with the suggested names, but I could live with
  them.


- Right now this series defines PGAIO_VERBOSE to 1. That's good for debugging,
  but all the ereport()s add a noticeable amount of overhead at high IO
  throughput (at multiple gigabytes/second), so that's probably not right
  forever.  I'd leave this on initially and then change it to default to off
  later.  I think that's ok?


- To allow io_workers to be PGC_SIGHUP, and to eventually allow to
  automatically in/decrease active workers, the max number of workers (32) is
  always allocated. That means we use more semaphores than before. I think
  that's ok, it's not 1995 anymore.  Alternatively we can add a
  "io_workers_max" GUC and probe for it in initdb.


- pg_stat_aios currently has the IO Handle flags as dedicated columns. Not
  sure that's great?

  They could be an enum array or such too? That'd perhaps be a bit more
  extensible? OTOH, we don't currently use enums in the catalogs and arrays
  are somewhat annoying to conjure up from C.


Todo:

- A few more passes over the main commit, I'm sure there's a few more inartful
  comments, odd formatting and such.

  - Check if there's a decent way to deduplicate pgaio_io_call_complete_shared() and
    pgaio_io_call_complete_local()


- Figure out how to deduplicate support for LockBufferForCleanup() in
  TerminateBufferIO().


- Documentation for pg_stat_aios.


- Check if documentation for track_io_timing needs to be adjusted, after the
  bufmgr.c changes we only track waiting for an IO.


- Some of the test_aio code is specific to non-temp tables, it probably is
  worth generalizing to deal with temp tables and invoke them for both.

Greetings,

Andres

Attachment

Re: AIO v2.5

From
Jakub Wartak
Date:
On Tue, Mar 4, 2025 at 8:00 PM Andres Freund <andres@anarazel.de> wrote:

> Attached is v2.5 of the AIO patchset.
[..]
Hi, Thanks for working on this!

> Questions:
>
> - My current thinking is that we'd set io_method = worker initially - so we
>   actually get some coverage - and then decide whether to switch to
>   io_method=sync by default for 18 sometime around beta1/2. Does that sound
>   reasonable?

IMHO, yes, good idea. Anyway final outcomes partially will depend on
how many other stream-consumers be committed, right?

> - Three of the commits in the series really are just precursor commits to
>   their subsequent commits, which I found helpful for development and review,
>   namely:
>
>   - aio: Basic subsystem initialization
>   - aio: Skeleton IO worker infrastructure
>   - aio: Add liburing dependency
>
>   Not sure if it's worth keeping these separate or whether they should just be
>   merged with their "real commit".

For me it was easier to read those when they are separate.

> - Right now this series defines PGAIO_VERBOSE to 1. That's good for debugging,
>   but all the ereport()s add a noticeable amount of overhead at high IO
>   throughput (at multiple gigabytes/second), so that's probably not right
>   forever.  I'd leave this on initially and then change it to default to off
>   later.  I think that's ok?

+1, hopefully nothing is recording/logging/running with
log_min_messages>=debug3 because only then it starts to be visible.

> - To allow io_workers to be PGC_SIGHUP, and to eventually allow to
>   automatically in/decrease active workers, the max number of workers (32) is
>   always allocated. That means we use more semaphores than before. I think
>   that's ok, it's not 1995 anymore.  Alternatively we can add a
>   "io_workers_max" GUC and probe for it in initdb.

Wouldn't that matter only on *BSDs?

BTW I somehow cannot imagine someone saturating >= 32 workers (if one
does, better to switch to uring anyway?), but I have a related
question about closing fd by those workers.

> - pg_stat_aios currently has the IO Handle flags as dedicated columns. Not
>   sure that's great?
>
>   They could be an enum array or such too? That'd perhaps be a bit more
>   extensible? OTOH, we don't currently use enums in the catalogs and arrays
>   are somewhat annoying to conjure up from C.

s/pg_stat_aios/pg_aios/ ? :^) It looks good to me as it is. Anyway it
is a debugging view - perhaps mark it as such in the docs - so there
is no stable API for that and shouldn't be queried by any software
anyway.

> - Documentation for pg_stat_aios.

pg_aios! :)

So, I've taken aio-2 branch from Your's github repo for a small ride
on legacy RHEL 8.7 with dm-flakey to inject I/O errors. This is more a
question: perhaps IO workers should auto-close fd on errors or should
we use SIGUSR2 for it? The scenario is like this:

#dm-dust is not that available even on modern distros(not always
compiled), but flakey seemed to work on 4.18.x:
losetup /dev/loop0 /dd.img
mkfs.ext4 -j /dev/loop0
mkdir /flakey
mount /dev/loop0 /flakey # for now it will work
mkdir /flakey/tblspace
chown postgres /flakey/tblspace
chmod 0700 /flakey/tblspace
CREATE TABLESPACE test1 LOCATION '/flakey/tblspace'
CREATE TABLE on t1fail on that test1 tablespace + INSERT SOME DATA
pg_ctl stop
umount /flakey
echo "0 `blockdev --getsz /dev/loop0` flakey /dev/loop0 0 1 1" |
dmsetup create flakey # after 1s start throwing IO errors
mount /dev/mapper/flakey /flakey
#might even say: mount: /flakey: can't read superblock on /dev/mapper/flakey.
mount /dev/mapper/flakey /flakey
pg_ctl start

and then this will happen:

postgres=# insert into t1fail select generate_series(1000001, 2000001);
ERROR:  could not read blocks 0..1 in file
"pg_tblspc/24579/PG_18_202503031/5/24586_fsm": Input/output error
postgres=# insert into t1fail select generate_series(1000001, 2000001);
ERROR:  could not read blocks 0..1 in file
"pg_tblspc/24579/PG_18_202503031/5/24586_fsm": Input/output error
postgres=# insert into t1fail select generate_series(1000001, 2000001);
ERROR:  could not read blocks 0..1 in file
"pg_tblspc/24579/PG_18_202503031/5/24586_fsm": Input/output error

postgres=# insert into t1fail select generate_series(1000001, 2000001);
ERROR:  could not open file
"pg_tblspc/24579/PG_18_202503031/5/24586_vm": Read-only file system

so usual stuff with kernel remounting it RO, but here's the dragon
with io_method=worker:

# mount -o remount,rw /flakey/
mount: /flakey: cannot remount /dev/mapper/flakey read-write, is
write-protected.
# umount /flakey # to fsck or just mount rw again
umount: /flakey: target is busy.
# lsof /flakey/
COMMAND     PID     USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
postgres 103483 postgres   14u   REG  253,2 36249600   17
/flakey/tblspace/PG_18_202503031/5/24586
postgres 103484 postgres    6u   REG  253,2 36249600   17
/flakey/tblspace/PG_18_202503031/5/24586
postgres 103485 postgres    6u   REG  253,2 36249600   17
/flakey/tblspace/PG_18_202503031/5/24586

Those 10348[345] are IO workers, they have still open fds and there's
no way to close those without restart -- well without close()
injection probably via gdb.   pg_terminate_backend() on those won't
work. The only thing that works seems to be sending SIGUSR2, but is
that safe [there could be some errors after pwrite() ] ? With
io_worker=sync just quitting the backend of course works. Not sure
what your thoughts are because any other bgworker could be having open
fds there. It's a very minor thing. Otherwise that outage of separate
tablespace (rarely used) would potentially cause inability to fsck
there and lower the availability of the DB (due to potential restart
required). I'm thinking especially of scenarios where lots of schemas
are used with lots of tablespaces OR where temp_tablespace is employed
for some dedicated (fast/furious/faulty) device. So I'm hoping SIGUSR2
is enough right (4231f4059e5e54d78c56b904f30a5873da88e163 seems to be
doing it anyway) ?

BTW: While at this, I've tried amcheck/pg_surgery for 1 min and they
both seem to work.

-J.



Re: AIO v2.5

From
Andres Freund
Date:
Hi,

On 2025-03-06 12:36:43 +0100, Jakub Wartak wrote:
> On Tue, Mar 4, 2025 at 8:00 PM Andres Freund <andres@anarazel.de> wrote:
> > Questions:
> >
> > - My current thinking is that we'd set io_method = worker initially - so we
> >   actually get some coverage - and then decide whether to switch to
> >   io_method=sync by default for 18 sometime around beta1/2. Does that sound
> >   reasonable?
>
> IMHO, yes, good idea. Anyway final outcomes partially will depend on
> how many other stream-consumers be committed, right?

I think it's more whether we find cases where it performs substantially worse
with the read stream users that exists.  The behaviour for non-read-stream IO
shouldn't change.


> > - To allow io_workers to be PGC_SIGHUP, and to eventually allow to
> >   automatically in/decrease active workers, the max number of workers (32) is
> >   always allocated. That means we use more semaphores than before. I think
> >   that's ok, it's not 1995 anymore.  Alternatively we can add a
> >   "io_workers_max" GUC and probe for it in initdb.
>
> Wouldn't that matter only on *BSDs?

Yea, NetBSD and OpenBSD only, I think.


> > - pg_stat_aios currently has the IO Handle flags as dedicated columns. Not
> >   sure that's great?
> >
> >   They could be an enum array or such too? That'd perhaps be a bit more
> >   extensible? OTOH, we don't currently use enums in the catalogs and arrays
> >   are somewhat annoying to conjure up from C.
>
> s/pg_stat_aios/pg_aios/ ? :^)

Ooops, yes.


> It looks good to me as it is.
> Anyway it
> is a debugging view - perhaps mark it as such in the docs - so there
> is no stable API for that and shouldn't be queried by any software
> anyway.

Cool


> > - Documentation for pg_stat_aios.
>
> pg_aios! :)
>
> So, I've taken aio-2 branch from Your's github repo for a small ride
> on legacy RHEL 8.7 with dm-flakey to inject I/O errors. This is more a
> question: perhaps IO workers should auto-close fd on errors or should
> we use SIGUSR2 for it? The scenario is like this:

When you say "auto-close", you mean that one IO error should trigger *all*
workers to close their FDs?


> so usual stuff with kernel remounting it RO, but here's the dragon
> with io_method=worker:
>
> # mount -o remount,rw /flakey/
> mount: /flakey: cannot remount /dev/mapper/flakey read-write, is
> write-protected.
> # umount /flakey # to fsck or just mount rw again
> umount: /flakey: target is busy.
> # lsof /flakey/
> COMMAND     PID     USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
> postgres 103483 postgres   14u   REG  253,2 36249600   17
> /flakey/tblspace/PG_18_202503031/5/24586
> postgres 103484 postgres    6u   REG  253,2 36249600   17
> /flakey/tblspace/PG_18_202503031/5/24586
> postgres 103485 postgres    6u   REG  253,2 36249600   17
> /flakey/tblspace/PG_18_202503031/5/24586
>
> Those 10348[345] are IO workers, they have still open fds and there's
> no way to close those without restart -- well without close()
> injection probably via gdb.

The same is already true with bgwriter, checkpointer etc?


> pg_terminate_backend() on those won't work. The only thing that works seems
> to be sending SIGUSR2

Sending SIGINT works.


> , but is that safe [there could be some errors after pwrite() ]?

Could you expand on that?


> With
> io_worker=sync just quitting the backend of course works. Not sure
> what your thoughts are because any other bgworker could be having open
> fds there. It's a very minor thing. Otherwise that outage of separate
> tablespace (rarely used) would potentially cause inability to fsck
> there and lower the availability of the DB (due to potential restart
> required).

I think a crash-restart is the only valid thing to get out of a scenario like
that, independent of AIO:

- If there had been any writes we need to perform crash recovery anyway, to
  recreate those writes
- If there just were reads, it's good to restart as well, as otherwise there
  might be pages in the buffer pool that don't exist on disk anymore, due to
  the errors.

Greetings,

Andres Freund



Re: AIO v2.5

From
Robert Haas
Date:
On Tue, Mar 4, 2025 at 2:00 PM Andres Freund <andres@anarazel.de> wrote:
> - pg_stat_aios currently has the IO Handle flags as dedicated columns. Not
>   sure that's great?

I don't like the name. Pluralization abbreviations is weird, and it's
even weirder when the abbreviation is not one that is universally
known. Maybe just drop the "s".

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: AIO v2.5

From
Andres Freund
Date:
Hi,

On 2025-03-06 10:33:33 -0500, Robert Haas wrote:
> On Tue, Mar 4, 2025 at 2:00 PM Andres Freund <andres@anarazel.de> wrote:
> > - pg_stat_aios currently has the IO Handle flags as dedicated columns. Not
> >   sure that's great?
> 
> I don't like the name.

I don't think it changes anything, but as Jakub pointed out, I thinko'd the
name in the email you're responding to, it's pg_aios, not pg_stat_aios.

It shows the currently in-flight IOs, not accumulated statistics about them,
hence no _stat_.

I don't like the name either, I IIRC asked for suggestions elsewhere in the
thread, not a lot was forthcoming, so I left it at pg_aios.


> Pluralization abbreviations is weird, and it's even weirder when the
> abbreviation is not one that is universally known. Maybe just drop the "s".

I went with plural because that's what we have in other views showing the
"current" state:
- pg_cursors
- pg_file_settings
- pg_prepared_statements
- pg_prepared_xacts
- pg_replication_slots
- pg_locks
- ...

But you're right that those aren't abbreviations.

Greetings,

Andres Freund



Re: AIO v2.5

From
Jakub Wartak
Date:
On Thu, Mar 6, 2025 at 2:13 PM Andres Freund <andres@anarazel.de> wrote:

> On 2025-03-06 12:36:43 +0100, Jakub Wartak wrote:
> > On Tue, Mar 4, 2025 at 8:00 PM Andres Freund <andres@anarazel.de> wrote:
> > > Questions:
> > >
> > > - My current thinking is that we'd set io_method = worker initially - so we
> > >   actually get some coverage - and then decide whether to switch to
> > >   io_method=sync by default for 18 sometime around beta1/2. Does that sound
> > >   reasonable?
> >
> > IMHO, yes, good idea. Anyway final outcomes partially will depend on
> > how many other stream-consumers be committed, right?
>
> I think it's more whether we find cases where it performs substantially worse
> with the read stream users that exist.  The behaviour for non-read-stream IO
> shouldn't change.

OK, so in order to to get full picture for v18beta this would mean
$thread + following ones?:
- Use read streams in autoprewarm
- BitmapHeapScan table AM violation removal (and use streaming read API)
- Index Prefetching (it seems it has stalled?)

or is there something more planned? (I'm asking what to apply on top
of AIO to minimize number of potential test runs which seem to take
lots of time, so to do it all in one go)

> > So, I've taken aio-2 branch from Your's github repo for a small ride
> > on legacy RHEL 8.7 with dm-flakey to inject I/O errors. This is more a
> > question: perhaps IO workers should auto-close fd on errors or should
> > we use SIGUSR2 for it? The scenario is like this:
>
> When you say "auto-close", you mean that one IO error should trigger *all*
> workers to close their FDs?

Yeah I somehow was thinking about such a thing, but after You have
bolded that "*all*", my question sounds much more stupid than it was
yesterday. Sorry for asking stupid question :)

> The same is already true with bgwriter, checkpointer etc?

Yeah.. I was kind of looking for a way of getting "higher
availability" in the presence of partial IO (tablespace) errors.

> > pg_terminate_backend() on those won't work. The only thing that works seems
> > to be sending SIGUSR2
>
> Sending SIGINT works.

Ugh, ok, it looks like I've been overthinking that, cool.

> > , but is that safe [there could be some errors after pwrite() ]?
>
> Could you expand on that?

It is pure speculation on my side: well I'm always concerned about
leaving something out there without cleanup after errors and then
re-using it for something else much later, especially on edge-cases
like NFS or FUSE. In the backend we could maintain some state, but
io_workes are shared across backends. E.g. some pwrite() failing on
NFS, we are not closing that fd, and then reusing it for something
else much latter for different backend (although AFAIK close() does
not guarantee anything, but e.g. it could be that some inode/path or
something was simply marked dangling - the fresh pair of
close()/open() could could could return error, but here we would just
keep on pwriting() there?).

OK the only question remains: does it make sense to try something like
pgbench on NFS UDP mountopt=hard,nointr + intermittent iptables DROP
from time to time , or is it not worth trying?

> > With
> > io_worker=sync just quitting the backend of course works. Not sure
> > what your thoughts are because any other bgworker could be having open
> > fds there. It's a very minor thing. Otherwise that outage of separate
> > tablespace (rarely used) would potentially cause inability to fsck
> > there and lower the availability of the DB (due to potential restart
> > required).
>
> I think a crash-restart is the only valid thing to get out of a scenario like
> that, independent of AIO:
>
> - If there had been any writes we need to perform crash recovery anyway, to
>   recreate those writes
> - If there just were reads, it's good to restart as well, as otherwise there
>   might be pages in the buffer pool that don't exist on disk anymore, due to
>   the errors.

OK, cool, thanks!

-J.



Re: AIO v2.5

From
Andres Freund
Date:
Hi,

On 2025-03-07 11:21:09 +0100, Jakub Wartak wrote:
> On Thu, Mar 6, 2025 at 2:13 PM Andres Freund <andres@anarazel.de> wrote:
> 
> > On 2025-03-06 12:36:43 +0100, Jakub Wartak wrote:
> > > On Tue, Mar 4, 2025 at 8:00 PM Andres Freund <andres@anarazel.de> wrote:
> > > > Questions:
> > > >
> > > > - My current thinking is that we'd set io_method = worker initially - so we
> > > >   actually get some coverage - and then decide whether to switch to
> > > >   io_method=sync by default for 18 sometime around beta1/2. Does that sound
> > > >   reasonable?
> > >
> > > IMHO, yes, good idea. Anyway final outcomes partially will depend on
> > > how many other stream-consumers be committed, right?
> >
> > I think it's more whether we find cases where it performs substantially worse
> > with the read stream users that exist.  The behaviour for non-read-stream IO
> > shouldn't change.
> 
> OK, so in order to to get full picture for v18beta this would mean
> $thread + following ones?:
> - Use read streams in autoprewarm
> - BitmapHeapScan table AM violation removal (and use streaming read API)

Yep.


> - Index Prefetching (it seems it has stalled?)

I don't think there's any chance it'll be in 18. There's a good bit more work
needed before it can go in...


> or is there something more planned? (I'm asking what to apply on top
> of AIO to minimize number of potential test runs which seem to take
> lots of time, so to do it all in one go)

I think there may be some more (e.g. btree index vacuuming), but I don't think
they'll have *that* big an impact.


> > > So, I've taken aio-2 branch from Your's github repo for a small ride
> > > on legacy RHEL 8.7 with dm-flakey to inject I/O errors. This is more a
> > > question: perhaps IO workers should auto-close fd on errors or should
> > > we use SIGUSR2 for it? The scenario is like this:
> >
> > When you say "auto-close", you mean that one IO error should trigger *all*
> > workers to close their FDs?
> 
> Yeah I somehow was thinking about such a thing, but after You have
> bolded that "*all*", my question sounds much more stupid than it was
> yesterday. Sorry for asking stupid question :)

Don't worry about that :)


> > The same is already true with bgwriter, checkpointer etc?
> 
> Yeah.. I was kind of looking for a way of getting "higher
> availability" in the presence of partial IO (tablespace) errors.

I'm really doubtful that's that worthwhile to pursue. IME the system is pretty
much hosed once this starts to happening and it's often made *worse* by trying
to limp along.


> OK the only question remains: does it make sense to try something like
> pgbench on NFS UDP mountopt=hard,nointr + intermittent iptables DROP
> from time to time , or is it not worth trying?

I don't think it's particularly interesting. But then I'd *never* trust any
meaningful data to a PG running on NFS.


Greetings,

Andres Freund



Re: AIO v2.5

From
Andres Freund
Date:
Hi,

On 2025-03-06 11:53:41 -0500, Andres Freund wrote:
> On 2025-03-06 10:33:33 -0500, Robert Haas wrote:
> > On Tue, Mar 4, 2025 at 2:00 PM Andres Freund <andres@anarazel.de> wrote:
> > > - pg_stat_aios currently has the IO Handle flags as dedicated columns. Not
> > >   sure that's great?
> > 
> > I don't like the name.
> 
> I don't think it changes anything, but as Jakub pointed out, I thinko'd the
> name in the email you're responding to, it's pg_aios, not pg_stat_aios.
> 
> It shows the currently in-flight IOs, not accumulated statistics about them,
> hence no _stat_.
> 
> I don't like the name either, I IIRC asked for suggestions elsewhere in the
> thread, not a lot was forthcoming, so I left it at pg_aios.

What about pg_io_handles?

Greetings,

Andres Freund



Re: AIO v2.5

From
Andres Freund
Date:
Hi,

Tom, CCed you since you have worked most on elog.c


On 2025-03-07 16:23:51 -0500, Andres Freund wrote:
> What about pg_io_handles?

While looking at the view I felt motivated to tackle the one FIXME in the
implementation of the view. Namely that the "error_desc" column wasn't
populated (the view did show that there was an error, but not what the error
was).

Which lead me down a sad sad rabbit hole, largely independent of AIO.


A bit of background:

For AIO completion callbacks can signal errors (e.g. a page header failing
validation). That error can be logged in the callback and/or raised later,
e.g. by the query that issued the IO.

AIO callbacks happen in critical sections, which is required to be able to use
AIO for WAL (see README.md for more details).

Currently errors are logged/raised by ereport()s in functions that gets passed
in an elevel, pretty standard.

A few of the ereports() use errcode_for_file_access() to translate an errno to
an sqlerrcode.


Now on to the problem:

The result of an ereport() can't be put into a view, obviously. I didn't think
it'd be good if the each kind of error needed to be implemented twice, once
with ereport() and once to just return a string to put in the view.


I tried a few things:

1) Use errsave() to allow delayed reporting of the error

I encountered a few problems:

- errsave() doesn't allow the log level to be specified, which means it can't
  directly be used to LOG if no context is specified.

  This could be worked around by always specifying the context, with
  ErrorSaveContext.details_wanted = true and having generic code that changes
  the elevel to whatever is appropriate and then using ThrowErrorData() to log the
  message.

- ersave_start() sets assoc_context to CurrentMemoryContext and
  errsave_finish() allocates an ErrorData copy in CurrentMemoryContext

  This makes naive use of this approach when logging in a critical section
  impossible. If ErrorSaveContext is not passed in an ERROR will be raised,
  even if we just want to log.  If ErrorSaveContext is used, we allocate
  memory in the caller context, which isn't allowed in a critical section.

  The only way I saw to work around that was to switch to ErrorContext before
  calling errsave(). That's doable, the logging is called from one function
  (pgaio_result_report()). That kinda works, but as a consequence we more than
  double the memory usage in ErrorContext as errsave_finish() will palloc a
  new ErrorData and ThrowErrorData() copies that ErrorData and all its string
  back to ErrorContext.


2) Have the error callback format the error using a helper function instead of
   using ereport()

Problems:

- errcode_for_file_access() would need to be reimplemented / split into a
  function translating an errnode into an sqlerrcode without getting it from
  the error data stack

- emitting the log message in a critical section would require either doing
  the error formatting in ErrorContext or creating another context with
  reserved memory to do so.

- allowing to specify DETAIL, HINT etc basically requires a small elog.c
  interface reimplementation


3) Use pre_format_elog_string(), format_elog_string() similar to what guc.c
   does for check hooks, via GUC_check_errmsg(), GUC_check_errhint() ...

Problems:

- Requires to duplicate errcode_for_file_access() for similar reason as in 2)

- Not exactly pretty

- Somewhat gnarly, but doable, to make use of %m safe, the way it's done in
  guc.h afaict isn't safe:
  pre_format_elog_string() is called for each of
  GUC_check_{errmsg,errdetail,errhint}. As the global errno might get set
  during the format_elog_string(), it'll not be the right one during
  the next GUC_check_*.



4) Don't use ereport() directly, but instead put the errstart() in
   pgaio_result_report(), before calling the error description callback.

   When emitting a log message, call errfinish() after the callback. For the
   view, get the message out via CopyErrorData() and free the memory again
   using FlushErrorState

Problems:

- Seems extremely hacky


I implemented all, but don't really like any of them.


Unless somebody has a better idea or we agree that one of the above is
actually a acceptable approach, I'm inclined to simply remove the column
containing the description of the error. The window in which one could see an
IO with an error is rather short most of the time anyway and the error will
also be logged.

It's a bit annoying that adding the column later would require revising the
signature of the error reporting callback at that time, but I think that
degree of churn is acceptable.


The main reason I wanted to write this up is that it seems that we're just
lacking some infrastructure here.

Greetings,

Andres Freund



Re: AIO v2.5

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> While looking at the view I felt motivated to tackle the one FIXME in the
> implementation of the view. Namely that the "error_desc" column wasn't
> populated (the view did show that there was an error, but not what the error
> was).

> Which lead me down a sad sad rabbit hole, largely independent of AIO.

> ...

> The main reason I wanted to write this up is that it seems that we're just
> lacking some infrastructure here.

Maybe.  The mention of elog.c in the same breath with critical
sections is already enough to scare me; we surely daren't invoke
gettext() in a critical section, for instance.  I feel the most
we could hope for here is to report a constant string that would
not get translated till later, outside the critical section.
That seems less about infrastructure and more about how the AIO
error handling/reporting code is laid out.  In the meantime,
if leaving the error out of this view is enough to make the problem
go away, let's do that.

            regards, tom lane



Re: AIO v2.5

From
Andres Freund
Date:
Hi,

Attached is v2.6 of the AIO patchset.

Relative to 2.5 I:

- Improved the split between subsystem initialization and main AIO commit, as
  well as the one between worker infrastructure and io_method=worker

  Seemed worth as the only one voicing an opinion about squashing those
  commits was opposed.


- Added a lot more comments to aio.h/aio_internal.h. I think just about
  anything that should conceivably have a comment has one.


- Reordered fields in PgAioHandle to waste less due to padding


- Narrowed a few *count fields, they were 64bit without ever being al to reach
  that


- Used aio_types.h more widely, instead of "manual" forward declarations. This
  required moving a few typedefs to aio_types.h


- Substantial commit message improvements.


- Removed the pg_aios.error_desc column, due to:
  https://postgr.es/m/qzxq6mqqozctlfcg2kg5744gmyubicvuehnp4a7up472thlvz2%40y5xqgd5wcwhw


- Reordered the commits slightly, to put the README just after the
  smgr.c/md.c/... support, as the readme references those in the examples


- Stopped creating backend-local io_uring instances, that is vestigial for
  now. We likely will want to reintroduce them at some point (e.g. for network
  IO), but we can do that at that time.


- There were a lot of duplicated codepaths in bufmgr.c support for AIO due to
  temp tables. I added a few commits refactoring the temp buffers state
  management to look a lot more like the shared buffer code.

  I'm not sure that that's the best path, but they all seemed substantial
  improvements on their own.


- putting io_method in PG_TEST_INITDB_EXTRA_OPTS previously broke a test,
  because Cluster::init() puts PG_TEST_INITDB_EXTRA_OPTS after the # options
  specified by ->extra. I now worked around that by appending the io method to
  a local PG_TEST_INITDB_EXTRA_OPTS, but brrr.


- The tracepoint for read completion omitted the fact that it was a temp
  table, if so.


- Fixed some duplicated function decls, due to a misresolved merge-conflict


Current state:

- 0001, 0002 - core AIO - IMO pretty much ready

- 0003, 0004 - IO worker - same

- 0005, 0006 - io_uring support - close, but we need to do something about
  set_max_fds(), which errors out spuriously in some cases

- 0007 - smgr/md/fd.c readv support - seems quite close, but might benefit from
  another pass through

- 0008 - README - I think it's good, but I'm probably not seeing the trees for
  the forest anymore

- 0009 - pg_aios view - naming not resolved, docs missing

- 0010 to 0014 - from another thread, just included here due to a dependency

- 0016 to 0020 - cleanups for temp buffers code - I just wrote these to clean
  up the code before making larger changes, needs review

- 0021 - keep BufferDesc refcount up2date for temp buffers - I think that's
  pretty much ready, but depends on earlier patches

- 0022 - bufmgr readv AIO suppot - some naming, some code duplication needs to
  be resolved, but otherwise quite close

- 0023 - use AIO in StartReadBuffers() - perhaps a bit of polishing needed

- 0024 - adjust read_stream.c for AIO - I think Thomas has a better patch for
  this in the works

- 0025 - tests for AIO - I think it's reasonable, unless somebody objects to
  exporting a few bufmgr.c functions to the test

- the rest: Not for 18

Greetings,

Andres Freund

Attachment

Re: AIO v2.5

From
Melanie Plageman
Date:
On Mon, Mar 10, 2025 at 2:23 PM Andres Freund <andres@anarazel.de> wrote:
>
> - 0016 to 0020 - cleanups for temp buffers code - I just wrote these to clean
>   up the code before making larger changes, needs review

This is a review of 0016-0020

Commit messages for 0017-0020 are thin. I assume you will beef them up
a bit before committing. Really, though, those matter much less than
0016 which is an actual bug (or pre-bug) fix. I called out the ones
where I think you should really consider adding more detail to the
commit message.

0016:

      * the case, write it out before reusing it!
      */
-    if (buf_state & BM_DIRTY)
+    if (pg_atomic_read_u32(&bufHdr->state) & BM_DIRTY)
     {
+        uint32        buf_state = pg_atomic_read_u32(&bufHdr->state);

I don't love that you fetch in the if statement and inside the if
statement. You wouldn't normally do this, so it sticks out. I get that
you want to avoid having the problem this commit fixes again, but
maybe it is worth just fetching the buf_state above the if statement
and adding a comment that it could have changed so you must do that.
Anyway, I think your future patches make the local buf_state variable
in this function obsolete, so perhaps it doesn't matter.

Not related to this patch, but while reading this code, I noticed that
this line of code is really weird:
        LocalBufHdrGetBlock(bufHdr) = GetLocalBufferStorage();
I actually don't understand what it is doing ... setting the result of
the macro to the result of GetLocalBufferStorage()? I haven't seen
anything like that before.

Otherwise, this patch LGTM.

0017:

+++ b/src/backend/storage/buffer/localbuf.c
@@ -56,6 +56,7 @@ static int    NLocalPinnedBuffers = 0;
 static Buffer GetLocalVictimBuffer(void);
+static void InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced);

Technically this line is too long

+ * InvalidateLocalBuffer -- mark a local buffer invalid.
+ *
+ * If check_unreferenced is true, error out if the buffer is still
+ * used. Passing false is appropriate when redesignating the buffer instead
+ * dropping it.
+ *
+ * See also InvalidateBuffer().
+ */
+static void
+InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced)
+{

I was on the fence about the language "buffer is still used", since
this is about the ref count and not the usage count. If this is the
language used elsewhere perhaps it is fine.

I also was not sure what redesignate means here. If you mean to use
this function in the future in other contexts than eviction and
dropping buffers, fine. But otherwise, maybe just use a more obvious
word (like eviction).

0018:

Compiler now warns that buf_state is unused in GetLocalVictimBuffer().

@@ -4564,8 +4548,7 @@ FlushRelationBuffers(Relation rel)
                                         IOCONTEXT_NORMAL, IOOP_WRITE,
                                         io_start, 1, BLCKSZ);

-                buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
-                pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+                TerminateLocalBufferIO(bufHdr, true, 0);

FlushRelationBuffers() used to clear BM_JUST_DIRTIED, which it seems
like wouldn't have been applicable to local buffers before, but,
actually with async IO could perhaps happen in the future? Anyway,
TerminateLocalBuffers() doesn't clear that flag, so you should call
that out if it was intentional.

@@ -5652,8 +5635,11 @@ TerminateBufferIO(BufferDesc *buf, bool
clear_dirty, uint32 set_flag_bits,
+    buf_state &= ~BM_IO_IN_PROGRESS;
+    buf_state &= ~BM_IO_ERROR;
-    buf_state &= ~(BM_IO_IN_PROGRESS | BM_IO_ERROR);

Is it worth mentioning in the commit message that you made a cosmetic
change to TerminateBufferIO()?

0019:
LGTM

0020:
This commit message is probably tooo thin. I think you need to at
least say something about this being used by AIO in the future. Out of
context of this patch set, it will be confusing.

+/*
+ * Like StartBufferIO, but for local buffers
+ */
+bool
+StartLocalBufferIO(BufferDesc *bufHdr, bool forInput, bool nowait)
+{

I think you could use a comment about why nowait might be useful for
local buffers in the future. It wouldn't make sense with synchronous
I/O, so it feels a bit weird without any comment.

+    if (forInput ? (buf_state & BM_VALID) : !(buf_state & BM_DIRTY))
+    {
+        /* someone else already did the I/O */
+        UnlockBufHdr(bufHdr, buf_state);
+        return false;
+    }

UnlockBufHdr() explicitly says it should not be called for local
buffers. I know that code is unreachable right now, but it doesn't
feel quite right. I'm not sure what the architecture of AIO local
buffers will be like, but if other processes can't access these
buffers, I don't know why you would need BM_LOCKED. And if you will, I
think you need to edit the UnlockBufHdr() comment.

@@ -1450,13 +1450,11 @@ static inline bool
 WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
 {
     if (BufferIsLocal(buffer))
     else
-        return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
+        return StartBufferIO(GetBufferDescriptor(buffer - 1),
+                             true, nowait);

I'm not sure it is worth the diff in non-local buffer case to reflow
this. It is already confusing enough in this patch that you are adding
some code that is mostly unneeded.

- Melanie



Re: AIO v2.5

From
Andres Freund
Date:
Hi,

On 2025-03-11 11:31:18 -0400, Melanie Plageman wrote:
> On Mon, Mar 10, 2025 at 2:23 PM Andres Freund <andres@anarazel.de> wrote:
> >
> > - 0016 to 0020 - cleanups for temp buffers code - I just wrote these to clean
> >   up the code before making larger changes, needs review
> 
> This is a review of 0016-0020
> 
> Commit messages for 0017-0020 are thin. I assume you will beef them up
> a bit before committing.

Yea. I wanted to get some feedback on whether these refactorings are a good
idea or not...


> Really, though, those matter much less than 0016 which is an actual bug (or
> pre-bug) fix. I called out the ones where I think you should really consider
> adding more detail to the commit message.
> 
> 0016:

Do you think we should backpatch that change? It's not really an active bug in
16+, but it's also not quite right. The other changes surely shouldn't be
backpatched...


>       * the case, write it out before reusing it!
>       */
> -    if (buf_state & BM_DIRTY)
> +    if (pg_atomic_read_u32(&bufHdr->state) & BM_DIRTY)
>      {
> +        uint32        buf_state = pg_atomic_read_u32(&bufHdr->state);
> 
> I don't love that you fetch in the if statement and inside the if
> statement. You wouldn't normally do this, so it sticks out. I get that
> you want to avoid having the problem this commit fixes again, but
> maybe it is worth just fetching the buf_state above the if statement
> and adding a comment that it could have changed so you must do that.

It's seems way too easy to introduce new similar breakages if the scope of
buf_state is that wide - I yesterday did waste 90min because I did in another
similar place. The narrower scopes make that much less likely to be a problem.


> Anyway, I think your future patches make the local buf_state variable
> in this function obsolete, so perhaps it doesn't matter.

Leaving the defensive-programming aspect aside, it does seem like a better
intermediary state to me to have the local vars than to have to change more
lines when introducing FlushLocalBuffer() etc.


> Not related to this patch, but while reading this code, I noticed that
> this line of code is really weird:
>         LocalBufHdrGetBlock(bufHdr) = GetLocalBufferStorage();
> I actually don't understand what it is doing ... setting the result of
> the macro to the result of GetLocalBufferStorage()? I haven't seen
> anything like that before.

Yes, that's what it's doing. LocalBufferBlockPointers() evaluates to a value
that can be used as an lvalue in an assignment.

Not exactly pretty...


> Otherwise, this patch LGTM.
> 
> 0017:
> 
> +++ b/src/backend/storage/buffer/localbuf.c
> @@ -56,6 +56,7 @@ static int    NLocalPinnedBuffers = 0;
>  static Buffer GetLocalVictimBuffer(void);
> +static void InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced);
> 
> Technically this line is too long

Oh, do I love our line length limits. But, um, is it actually too long? It's
78 chars, which is exactly our limit, I think?


> + * InvalidateLocalBuffer -- mark a local buffer invalid.
> + *
> + * If check_unreferenced is true, error out if the buffer is still
> + * used. Passing false is appropriate when redesignating the buffer instead
> + * dropping it.
> + *
> + * See also InvalidateBuffer().
> + */
> +static void
> +InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced)
> +{
> 
> I was on the fence about the language "buffer is still used", since
> this is about the ref count and not the usage count. If this is the
> language used elsewhere perhaps it is fine.

I'll change it to "still pinned"
I guess I can make it "still pinned".


> I also was not sure what redesignate means here. If you mean to use
> this function in the future in other contexts than eviction and
> dropping buffers, fine. But otherwise, maybe just use a more obvious
> word (like eviction).

I was trying to reference changing the identity of the buffer as part of
buffer replacement, where we keep a pin to the buffer. Compared to the use of
InvalidateLocalBuffer() in DropRelationAllLocalBuffers() /
DropRelationLocalBuffers().

/*
 * InvalidateLocalBuffer -- mark a local buffer invalid.
 *
 * If check_unreferenced is true, error out if the buffer is still
 * pinned. Passing false is appropriate when calling InvalidateLocalBuffer()
 * as part of changing the identity of a buffer, instead of just dropping the
 * buffer.
 *
 * See also InvalidateBuffer().
 */


> 0018:
> 
> Compiler now warns that buf_state is unused in GetLocalVictimBuffer().

Oops. Missed that because it was then removed in a later commit...


> @@ -4564,8 +4548,7 @@ FlushRelationBuffers(Relation rel)
>                                          IOCONTEXT_NORMAL, IOOP_WRITE,
>                                          io_start, 1, BLCKSZ);
> 
> -                buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
> -                pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
> +                TerminateLocalBufferIO(bufHdr, true, 0);
> 
> FlushRelationBuffers() used to clear BM_JUST_DIRTIED, which it seems
> like wouldn't have been applicable to local buffers before, but,
> actually with async IO could perhaps happen in the future? Anyway,
> TerminateLocalBuffers() doesn't clear that flag, so you should call
> that out if it was intentional.

I think it'd be good to start using BM_JUST_DIRTIED, even if just to make the
code between local and shared buffers more similar. But that that's better
done separately.

I don't know why FlushRelationBuffers cleared it, it's neer set at the moment.

I'll add a note to the commit message.


> @@ -5652,8 +5635,11 @@ TerminateBufferIO(BufferDesc *buf, bool
> clear_dirty, uint32 set_flag_bits,
> +    buf_state &= ~BM_IO_IN_PROGRESS;
> +    buf_state &= ~BM_IO_ERROR;
> -    buf_state &= ~(BM_IO_IN_PROGRESS | BM_IO_ERROR);
> 
> Is it worth mentioning in the commit message that you made a cosmetic
> change to TerminateBufferIO()?

Doesn't really seem worth calling out, but if you think it should, I will.


> 0020:
> This commit message is probably tooo thin. I think you need to at
> least say something about this being used by AIO in the future. Out of
> context of this patch set, it will be confusing.

Yep.


> +/*
> + * Like StartBufferIO, but for local buffers
> + */
> +bool
> +StartLocalBufferIO(BufferDesc *bufHdr, bool forInput, bool nowait)
> +{
> 
> I think you could use a comment about why nowait might be useful for
> local buffers in the future. It wouldn't make sense with synchronous
> I/O, so it feels a bit weird without any comment.

Hm, fair point. Another approach would be to defer adding the argument to a
later patch, it doesn't need to be added here.


> +    if (forInput ? (buf_state & BM_VALID) : !(buf_state & BM_DIRTY))
> +    {
> +        /* someone else already did the I/O */
> +        UnlockBufHdr(bufHdr, buf_state);
> +        return false;
> +    }
> 
> UnlockBufHdr() explicitly says it should not be called for local
> buffers. I know that code is unreachable right now, but it doesn't
> feel quite right. I'm not sure what the architecture of AIO local
> buffers will be like, but if other processes can't access these
> buffers, I don't know why you would need BM_LOCKED. And if you will, I
> think you need to edit the UnlockBufHdr() comment.

You are right, this is a bug in my change. I started with a copy of
StartBufferIO() and whittled it down insufficiently. Thanks for catching that!

Wonder if we should add an assert against this to UnlockBufHdr()...


> @@ -1450,13 +1450,11 @@ static inline bool
>  WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
>  {
>      if (BufferIsLocal(buffer))
>      else
> -        return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
> +        return StartBufferIO(GetBufferDescriptor(buffer - 1),
> +                             true, nowait);
> 
> I'm not sure it is worth the diff in non-local buffer case to reflow
> this. It is already confusing enough in this patch that you are adding
> some code that is mostly unneeded.

Heh, you're right. I had to add a line break in the StartLocalBufferIO() and
it looked wrong to have the two lines formatted differently :)


Thanks for the review!

Greetings,

Andres Freund



Re: AIO v2.5

From
Melanie Plageman
Date:
On Tue, Mar 11, 2025 at 1:56 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2025-03-11 11:31:18 -0400, Melanie Plageman wrote:
> > Commit messages for 0017-0020 are thin. I assume you will beef them up
> > a bit before committing.
>
> Yea. I wanted to get some feedback on whether these refactorings are a good
> idea or not...

I'd say yes, they seem like a good idea.

> > Really, though, those matter much less than 0016 which is an actual bug (or
> > pre-bug) fix. I called out the ones where I think you should really consider
> > adding more detail to the commit message.
> >
> > 0016:
>
> Do you think we should backpatch that change? It's not really an active bug in
> 16+, but it's also not quite right. The other changes surely shouldn't be
> backpatched...

I don't feel strongly about it. PinLocalBuffer() is passed with
adjust_usagecount false and we have loads of other places where things
would just not work if we changed the boolean flag passed in to a
function called by it (bgwriter and SyncOneBuffer() with
skip_recently_used comes to mind).

On the other hand it's a straightforward fix that only needs to be
backpatched a couple versions, so it definitely doesn't hurt.

> > +++ b/src/backend/storage/buffer/localbuf.c
> > @@ -56,6 +56,7 @@ static int    NLocalPinnedBuffers = 0;
> >  static Buffer GetLocalVictimBuffer(void);
> > +static void InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced);
> >
> > Technically this line is too long
>
> Oh, do I love our line length limits. But, um, is it actually too long? It's
> 78 chars, which is exactly our limit, I think?

Teccchnically it's 79, which is why it showed up for me with this
handy line from the committing wiki page

git diff origin/master -- src/backend/storage/buffer/localbuf.c | grep
-E '^(\+|diff)' | sed 's/^+//' | expand -t4 | awk "length > 78 ||
/^diff/"

But anyway, it doesn't really matter. I only mentioned it because I
noticed it visually looked long.

> > +    if (forInput ? (buf_state & BM_VALID) : !(buf_state & BM_DIRTY))
> > +    {
> > +        /* someone else already did the I/O */
> > +        UnlockBufHdr(bufHdr, buf_state);
> > +        return false;
> > +    }
> >
> > UnlockBufHdr() explicitly says it should not be called for local
> > buffers. I know that code is unreachable right now, but it doesn't
> > feel quite right. I'm not sure what the architecture of AIO local
> > buffers will be like, but if other processes can't access these
> > buffers, I don't know why you would need BM_LOCKED. And if you will, I
> > think you need to edit the UnlockBufHdr() comment.
>
> You are right, this is a bug in my change. I started with a copy of
> StartBufferIO() and whittled it down insufficiently. Thanks for catching that!
>
> Wonder if we should add an assert against this to UnlockBufHdr()...

Yea, I think that makes sense.

- Melanie



Re: AIO v2.5

From
Noah Misch
Date:
On Mon, Sep 16, 2024 at 01:51:42PM -0400, Andres Freund wrote:
> On 2024-09-16 07:43:49 -0700, Noah Misch wrote:
> > For non-sync IO methods, I gather it's essential that a process other than the
> > IO definer be scanning for incomplete IOs and completing them.

> > Otherwise, deadlocks like this would happen:
> 
> > backend1 locks blk1 for non-IO reasons
> > backend2 locks blk2, starts AIO write
> > backend1 waits for lock on blk2 for non-IO reasons
> > backend2 waits for lock on blk1 for non-IO reasons
> >
> > If that's right, in worker mode, the IO worker resolves that deadlock.  What
> > resolves it under io_uring?  Another process that happens to do
> > pgaio_io_ref_wait() would dislodge things, but I didn't locate the code to
> > make that happen systematically.
> 
> Yea, it's code that I haven't forward ported yet. I think basically
> LockBuffer[ForCleanup] ought to call pgaio_io_ref_wait() when it can't
> immediately acquire the lock and if the buffer has IO going on.

I'm not finding that code in v2.6.  What function has it?


[I wrote a bunch of the subsequent comments against v2.5.  I may have missed
instances of v2.6 obsoleting them.]

On Tue, Mar 04, 2025 at 02:00:14PM -0500, Andres Freund wrote:
> Attached is v2.5 of the AIO patchset.

> - Added a proper commit message fo the main commit. I'd appreciate folks
>   reading through it. I'm sure I forgot a lot of folks and a lot of things.

Commit message looks fine.

> At this point I am not aware of anything significant left to do in the main
> AIO commit, safe some of the questions below.

That is a big milestone.

> Questions:
> 
> - My current thinking is that we'd set io_method = worker initially - so we
>   actually get some coverage - and then decide whether to switch to
>   io_method=sync by default for 18 sometime around beta1/2. Does that sound
>   reasonable?

Yes.

> - We could reduce memory usage a tiny bit if we made the mapping between
>   pgproc and per-backend-aio-state more complicated, i.e. not just indexed by
>   ProcNumber. Right now IO workers have the per-backend AIO state, but don't
>   actually need it.  I'm mildly inclined to think that the complexity isn't
>   worth it, but on the fence.

The max memory savings, for 32 IO workers, is like the difference between
max_connections=500 and max_connections=532, right?  If that's right, I
wouldn't bother in the foreseeable future.

> - Three of the commits in the series really are just precursor commits to
>   their subsequent commits, which I found helpful for development and review,
>   namely:
> 
>   - aio: Basic subsystem initialization
>   - aio: Skeleton IO worker infrastructure
>   - aio: Add liburing dependency
> 
>   Not sure if it's worth keeping these separate or whether they should just be
>   merged with their "real commit".

The split aided my review.  It's trivial to turn an unmerged stack of commits
into the merged equivalent, but unmerging is hard.

> - Thomas suggested renaming
>   COMPLETED_IO->COMPLETED,
>   COMPLETED_SHARED->TERMINATED_BY_COMPLETER,
>   COMPLETED_SHARED->TERMINATED_BY_SUBMITTER
>   in
>   https://www.postgresql.org/message-id/CA%2BhUKGLxH1tsUgzZfng4BU6GqnS6bKF2ThvxH1_w5c7-sLRKQw%40mail.gmail.com
> 
>   While the other things in the email were commented upon by others and
>   addressed in v2.4, the naming aspect wasn't further remarked upon by others.
>   I'm not personally in love with the suggested names, but I could live with
>   them.

I, too, could live with those.  None of these naming proposals bother me, and
I would not have raised the topic myself.  If I were changing it further, I'd
use these principles:

- use COMPLETED or TERMINATED, not both
- I like COMPLETED, because _complete_ works well in a function name.
  _terminate_ sounds more like an abnormal interruption.
- If one state name lacks a suffix, it should be the final state.

So probably one of:

{COMPLETED,TERMINATED,FINISHED,REAPED,DONE}_{KERN,RETURN,RETVAL,ERRNO}
{COMPLETED,TERMINATED,FINISHED,REAPED,DONE}_{SHMEM,SHARED}
{COMPLETED,TERMINATED,FINISHED,REAPED,DONE}{_SUBMITTER,}

If it were me picking today, I'd pick:

COMPLETED_RETURN
COMPLETED_SHMEM
COMPLETED

> - Right now this series defines PGAIO_VERBOSE to 1. That's good for debugging,
>   but all the ereport()s add a noticeable amount of overhead at high IO
>   throughput (at multiple gigabytes/second), so that's probably not right
>   forever.  I'd leave this on initially and then change it to default to off
>   later.  I think that's ok?

Sure.  Perhaps make it depend on USE_ASSERT_CHECKING later?

> - To allow io_workers to be PGC_SIGHUP, and to eventually allow to
>   automatically in/decrease active workers, the max number of workers (32) is
>   always allocated. That means we use more semaphores than before. I think
>   that's ok, it's not 1995 anymore.  Alternatively we can add a
>   "io_workers_max" GUC and probe for it in initdb.

Let's start as you have it.  If someone wants to make things perfect for
non-root BSD users, they can add the GUC later.  io_method=sync is a
sufficient backup plan indefinitely.

> - pg_stat_aios currently has the IO Handle flags as dedicated columns. Not
>   sure that's great?
> 
>   They could be an enum array or such too? That'd perhaps be a bit more
>   extensible? OTOH, we don't currently use enums in the catalogs and arrays
>   are somewhat annoying to conjure up from C.

An enum array does seem elegant and extensible, but it has the problems you
say.  (I would expect to lose time setting up pg_enum.oid values to not change
between releases.)  A possible compromise would be a text array like
heap_tuple_infomask_flags() does.  Overall, I'm not seeing a clear need to
change away from the bool columns.

> Todo:

> - Figure out how to deduplicate support for LockBufferForCleanup() in
>   TerminateBufferIO().

Yes, I agree there's an opportunity for a WakePinCountWaiter() or similar
subroutine.

> - Check if documentation for track_io_timing needs to be adjusted, after the
>   bufmgr.c changes we only track waiting for an IO.

Yes.


On Mon, Mar 10, 2025 at 02:23:12PM -0400, Andres Freund wrote:
> Attached is v2.6 of the AIO patchset.

> - 0005, 0006 - io_uring support - close, but we need to do something about
>   set_max_fds(), which errors out spuriously in some cases

What do we know about those cases?  I don't see a set_max_fds(); is that
set_max_safe_fds(), or something else?

> - 0025 - tests for AIO - I think it's reasonable, unless somebody objects to
>   exporting a few bufmgr.c functions to the test

I'll essentially never object to that.


> +     * AIO handles need be registered in critical sections and therefore
> +     * cannot use the normal ResoureElem mechanism.

s/ResoureElem/ResourceElem/

> +      <varlistentry id="guc-io-method" xreflabel="io_method">
> +       <term><varname>io_method</varname> (<type>enum</type>)
> +       <indexterm>
> +        <primary><varname>io_method</varname> configuration parameter</primary>
> +       </indexterm>
> +       </term>
> +       <listitem>
> +        <para>
> +         Selects the method for executing asynchronous I/O.
> +         Possible values are:
> +         <itemizedlist>
> +          <listitem>
> +           <para>
> +            <literal>sync</literal> (execute asynchronous I/O synchronously)

The part in parentheses reads like a contradiction to me.  How about phrasing
it like one of these:

  (execute I/O synchronously, even I/O eligible for asynchronous execution)
  (execute asynchronous-eligible I/O synchronously)
  (execute I/O synchronously, even when asynchronous execution was feasible)

> + * This could be in aio_internal.h, as it is not pubicly referenced, but

typo -> publicly

> + * On what is IO being performed.

End sentence with question mark, probably.

> +     * List of in-flight IOs. Also contains IOs that aren't strict speaking

s/strict/strictly/

> +    /*
> +     * Start executing passed in IOs.
> +     *
> +     * Will not be called if ->needs_synchronous_execution() returned true.
> +     *
> +     * num_staged_ios is <= PGAIO_SUBMIT_BATCH_SIZE.
> +     *

I recommend adding "Always called in a critical section." since at least
pgaio_worker_submit() subtly needs it.

> +     */
> +    int            (*submit) (uint16 num_staged_ios, PgAioHandle **staged_ios);

> + * Each backend can only have one AIO handle that that has been "handed out"

s/that that/that/

> + * AIO, it typically will pass the handle to smgr., which will pass it on to

s/smgr.,/smgr.c,/ or just "smgr"

> +PgAioHandle *
> +pgaio_io_acquire_nb(struct ResourceOwnerData *resowner, PgAioReturn *ret)
> +{
> +    if (pgaio_my_backend->num_staged_ios >= PGAIO_SUBMIT_BATCH_SIZE)
> +    {
> +        Assert(pgaio_my_backend->num_staged_ios == PGAIO_SUBMIT_BATCH_SIZE);
> +        pgaio_submit_staged();

I'm seeing the "num_staged_ios >= PGAIO_SUBMIT_BATCH_SIZE" case uncovered in a
check-world coverage report.  I tried PGAIO_SUBMIT_BATCH_SIZE=2,
io_max_concurrency=1, and io_max_concurrency=64.  Do you already have a recipe
for reaching this case?

> +/*
> + * Stage IO for execution and, if necessary, submit it immediately.
> + *
> + * Should only be called from pgaio_io_prep_*().
> + */
> +void
> +pgaio_io_stage(PgAioHandle *ioh, PgAioOp op)
> +{

We've got closely-associated verbs "prepare", "prep", and "stage".  README.md
doesn't mention "stage".  Can one of the following two changes happen?

- README.md starts mentioning "stage" and how it differs from the others
- Code stops using "stage"

> +     * locallbacks just before reclaiming at multiple callsites.

s/locallbacks/local callbacks/

> + * Check if the the referenced IO completed, without blocking.

s/the the/the/

> + * Batch submission mode needs to explicitly ended with
> + * pgaio_exit_batchmode(), but it is allowed to throw errors, in which case
> + * error recovery will end the batch.

This sentence needs some grammar help, I think.  Maybe use:

 * End batch submission mode with pgaio_exit_batchmode().  (Throwing errors is
 * allowed; error recovery will end the batch.)

>  Size
>  AioShmemSize(void)
>  {
>      Size        sz = 0;
>  
> +    /*
> +     * We prefer to report this value's source as PGC_S_DYNAMIC_DEFAULT.
> +     * However, if the DBA explicitly set wal_buffers = -1 in the config file,

s/wal_buffers/io_max_concurrency/

> +extern int    io_workers;

By the rule that GUC vars are PGDLLIMPORT, this should be PGDLLIMPORT.

> +static void
> +maybe_adjust_io_workers(void)

This also restarts workers that exit, so perhaps name it
start_io_workers_if_missing().

> +{
...
> +        /* Try to launch one. */
> +        child = StartChildProcess(B_IO_WORKER);
> +        if (child != NULL)
> +        {
> +            io_worker_children[id] = child;
> +            ++io_worker_count;
> +        }
> +        else
> +            break;                /* XXX try again soon? */

Can LaunchMissingBackgroundProcesses() become the sole caller of this
function, replacing the current mix of callers?  That would be more conducive
to promptly doing the right thing after launch failure.

> --- a/src/backend/utils/init/miscinit.c
> +++ b/src/backend/utils/init/miscinit.c
> @@ -293,6 +293,9 @@ GetBackendTypeDesc(BackendType backendType)
>          case B_CHECKPOINTER:
>              backendDesc = gettext_noop("checkpointer");
>              break;
> +        case B_IO_WORKER:
> +            backendDesc = "io worker";

Wrap in gettext_noop() like B_CHECKPOINTER does.

> +         Only has an effect if <xref linkend="guc-max-wal-senders"/> is set to
> +         <literal>worker</literal>.

s/guc-max-wal-senders/guc-io-method/

> + * of IOs, wakeups "fan out"; each woken IO worker can wake two more.  qXXX

s/qXXX/XXX/

> +            /*
> +             * It's very unlikely, but possible, that reopen fails. E.g. due
> +             * to memory allocations failing or file permissions changing or
> +             * such.  In that case we need to fail the IO.
> +             *
> +             * There's not really a good errno we can report here.
> +             */
> +            error_errno = ENOENT;

Agreed there's not a good errno, but let's use a fake errno that we're mighty
unlikely to confuse with an actual case of libc returning that errno.  Like
one of EBADF or EOWNERDEAD.

> +    for (int contextno = 0; contextno < TotalProcs; contextno++)
> +    {
> +        PgAioUringContext *context = &pgaio_uring_contexts[contextno];
> +        int            ret;
> +
> +        /*
> +         * XXX: Probably worth sharing the WQ between the different rings,
> +         * when supported by the kernel. Could also cause additional
> +         * contention, I guess?
> +         */
> +#if 0
> +        if (!AcquireExternalFD())
> +            elog(ERROR, "No external FD available");
> +#endif

Probably remove the "#if 0" or add a comment on why it's here.

> +        ret = io_uring_submit(uring_instance);
> +        pgstat_report_wait_end();
> +
> +        if (ret == -EINTR)
> +        {
> +            pgaio_debug(DEBUG3,
> +                        "aio method uring: submit EINTR, nios: %d",
> +                        num_staged_ios);
> +        }
> +        else if (ret < 0)
> +            elog(PANIC, "failed: %d/%s",
> +                 ret, strerror(-ret));

I still think (see 2024-09-16 review) EAGAIN should do the documented
recommendation instead of PANIC:

  EAGAIN The kernel was unable to allocate memory for the request, or
  otherwise ran out of resources to handle it. The application should wait for
  some completions and try again.

At a minimum, it deserves a comment like "We accept PANIC on memory exhaustion
here."

> +            pgstat_report_wait_end();
> +
> +            if (ret == -EINTR)
> +            {
> +                continue;
> +            }
> +            else if (ret != 0)
> +            {
> +                elog(PANIC, "unexpected: %d/%s: %m", ret, strerror(-ret));

I think errno isn't meaningful here, so %m doesn't belong.

> --- a/doc/src/sgml/config.sgml
> +++ b/doc/src/sgml/config.sgml
> @@ -2687,6 +2687,12 @@ include_dir 'conf.d'
>              <literal>worker</literal> (execute asynchronous I/O using worker processes)
>             </para>
>            </listitem>
> +          <listitem>
> +           <para>
> +            <literal>io_uring</literal> (execute asynchronous I/O using
> +            io_uring, if available)
> +           </para>
> +          </listitem>

Docs should eventually cover RLIMIT_MEMLOCK per
https://github.com/axboe/liburing "ulimit settings".  Maybe RLIMIT_NOFILE,
too.

> @@ -2498,6 +2529,12 @@ FilePathName(File file)
>  int
>  FileGetRawDesc(File file)
>  {
> +    int            returnCode;
> +
> +    returnCode = FileAccess(file);
> +    if (returnCode < 0)
> +        return returnCode;
> +
>      Assert(FileIsValid(file));
>      return VfdCache[file].fd;
>  }

What's the rationale for this function's change?

> +The main reason to want to use Direct IO are:

> +The main reason *not* to use Direct IO are:

x2 s/main reason/main reasons/

> +  and direct IO without O_DSYNC needs to issue a write and after the writes
> +  completion a cache cache flush, whereas O\_DIRECT + O\_DSYNC can use a

s/writes/write's/

> +  single FUA write).

I recommend including the acronym expansion: s/FUA/Force Unit Access (FUA)/

> +In an `EXEC_BACKEND` build backends executable code and other process local

s/backends/backends'/

> +state is not necessarily mapped to the same addresses in each process due to
> +ASLR. This means that the shared memory cannot contain pointer to callbacks.

s/pointer/pointers/

> +The "solution" to this the ability to associate multiple completion callbacks
> +with a handle. E.g. bufmgr.c can have a callback to update the BufferDesc
> +state and to verify the page and md.c. another callback to check if the IO
> +operation was successful.

One of these or similar:
s/md.c. another/md.c can have another/
s/md.c. /md.c /


I've got one high-level question that I felt could take too long to answer for
myself by code reading.  What's the cleanup story if process A does
elog(FATAL) with unfinished I/O?  Specifically:

- Suppose some other process B reuses the shared memory AIO data structures
  that pertained to process A.  After that, some process C completes the I/O
  in shmem.  Do we avoid confusing B by storing local callback data meant for
  A in shared memory now pertaining to B?

- Thinking more about this README paragraph:

    +In addition to completion, AIO callbacks also are called to "prepare" an
    +IO. This is, e.g., used to increase buffer reference counts to account for the
    +AIO subsystem referencing the buffer, which is required to handle the case
    +where the issuing backend errors out and releases its own pins while the IO is
    +still ongoing.

  Which function performs that reference count increase?  I'm not finding it
  today.  I wanted to look at how it ensures the issuing backend still exists
  as the function increases the reference count.


One later-patch item:

> +static PgAioResult
> +SharedBufferCompleteRead(int buf_off, Buffer buffer, uint8 flags, bool failed)
> +{
...
> +    TRACE_POSTGRESQL_BUFFER_READ_DONE(tag.forkNum,
> +                                      tag.blockNum,
> +                                      tag.spcOid,
> +                                      tag.dbOid,
> +                                      tag.relNumber,
> +                                      INVALID_PROC_NUMBER,
> +                                      false);

I wondered about whether the buffer-read-done probe should happen in the
process that calls the complete_shared callback or in the process that did the
buffer-read-start probe.  When I see dtrace examples, they usually involve
explicitly naming each PID to trace.  Assuming that's indeed the norm, I think
the local callback would be the better place, so a given trace contains both
probes.  If it were reasonable to dtrace all current and future postmaster
kids, that would argue for putting the probe in the complete_shared callback.
Alternatively, would could argue for separate probes buffer-read-done-shmem
and buffer-read-done.

Thanks,
nm



Re: AIO v2.5

From
Andres Freund
Date:
Hi,

On 2025-03-11 12:41:08 -0700, Noah Misch wrote:
> On Mon, Sep 16, 2024 at 01:51:42PM -0400, Andres Freund wrote:
> > On 2024-09-16 07:43:49 -0700, Noah Misch wrote:
> > > For non-sync IO methods, I gather it's essential that a process other than the
> > > IO definer be scanning for incomplete IOs and completing them.
>
> > > Otherwise, deadlocks like this would happen:
> >
> > > backend1 locks blk1 for non-IO reasons
> > > backend2 locks blk2, starts AIO write
> > > backend1 waits for lock on blk2 for non-IO reasons
> > > backend2 waits for lock on blk1 for non-IO reasons
> > >
> > > If that's right, in worker mode, the IO worker resolves that deadlock.  What
> > > resolves it under io_uring?  Another process that happens to do
> > > pgaio_io_ref_wait() would dislodge things, but I didn't locate the code to
> > > make that happen systematically.
> >
> > Yea, it's code that I haven't forward ported yet. I think basically
> > LockBuffer[ForCleanup] ought to call pgaio_io_ref_wait() when it can't
> > immediately acquire the lock and if the buffer has IO going on.
>
> I'm not finding that code in v2.6.  What function has it?

My local version now has it... Sorry, I was focusing on the earlier patches
until now.

What do we want to do for ConditionalLockBufferForCleanup() (I don't think
IsBufferCleanupOK() can matter)?  I suspect we should also make it wait for
the IO. See below:

Not for 18, but for full write support, we'll also need logic to wait for IO
in LockBuffer(BUFFER_LOCK_EXCLUSIVE) and answer the same question as for
ConditionalLockBufferForCleanup() for ConditionalLockBuffer().

It's not an issue with the current level of write support in the stack of
patches. But with v1 AIO, which had support for a lot more ways of doing
asynchronous writes, it turned out that not handling it in
ConditionalLockBuffer() triggers an endless loop. This can be
kind-of-reproduced today by just making ConditionalLockBuffer() always return
false - triggers a hang in the regression tests:

spginsert() loops around spgdoinsert() until it succeeds. spgdoinsert() locks
the child page with ConditionalLockBuffer() and gives up if it can't.

That seems like rather bad code in spgist, because, even without AIO, it'll
busy-loop until the buffer is unlocked. Which could take a while, given that
it'll conflict even with a share locker and thus synchronous writes.


Even if we fixed spgist, it seems rather likely that there's other code that
wouldn't tolerate "spurious" failures. Which leads me to think that causing
the IO to complete is probably the safest bet. Triggering IO completion never
requires acquiring new locks that could participate in a deadlock, so it'd be
safe.



> > At this point I am not aware of anything significant left to do in the main
> > AIO commit, safe some of the questions below.
>
> That is a big milestone.

Indeed!


> > - We could reduce memory usage a tiny bit if we made the mapping between
> >   pgproc and per-backend-aio-state more complicated, i.e. not just indexed by
> >   ProcNumber. Right now IO workers have the per-backend AIO state, but don't
> >   actually need it.  I'm mildly inclined to think that the complexity isn't
> >   worth it, but on the fence.
>
> The max memory savings, for 32 IO workers, is like the difference between
> max_connections=500 and max_connections=532, right?

Even less than that: Aux processes aren't always used as a multiplier in
places where max_connections etc are. E.g. max_locks_per_transaction is just
multiplied by MaxBackends, not MaxBackends+NUM_AUXILIARY_PROCS.


> If that's right, I wouldn't bother in the foreseeable future.

Cool.



> > - Three of the commits in the series really are just precursor commits to
> >   their subsequent commits, which I found helpful for development and review,
> >   namely:
> >
> >   - aio: Basic subsystem initialization
> >   - aio: Skeleton IO worker infrastructure
> >   - aio: Add liburing dependency
> >
> >   Not sure if it's worth keeping these separate or whether they should just be
> >   merged with their "real commit".
>
> The split aided my review.  It's trivial to turn an unmerged stack of commits
> into the merged equivalent, but unmerging is hard.

That's been the feedback so far, so I'll leave it split.



> > - Right now this series defines PGAIO_VERBOSE to 1. That's good for debugging,
> >   but all the ereport()s add a noticeable amount of overhead at high IO
> >   throughput (at multiple gigabytes/second), so that's probably not right
> >   forever.  I'd leave this on initially and then change it to default to off
> >   later.  I think that's ok?
>
> Sure.  Perhaps make it depend on USE_ASSERT_CHECKING later?

Yea, that makes sense.


> > - To allow io_workers to be PGC_SIGHUP, and to eventually allow to
> >   automatically in/decrease active workers, the max number of workers (32) is
> >   always allocated. That means we use more semaphores than before. I think
> >   that's ok, it's not 1995 anymore.  Alternatively we can add a
> >   "io_workers_max" GUC and probe for it in initdb.
>
> Let's start as you have it.  If someone wants to make things perfect for
> non-root BSD users, they can add the GUC later.  io_method=sync is a
> sufficient backup plan indefinitely.

Cool.

I think we'll really need to do something about this for BSD users regardless
of AIO. Or maybe those OSs should fix something, but somehow I am not having
high hopes for an OS that claims to have POSIX confirming unnamed semaphores
due to having a syscall that always returns EPERM... [1].


> > - pg_stat_aios currently has the IO Handle flags as dedicated columns. Not
> >   sure that's great?
> >
> >   They could be an enum array or such too? That'd perhaps be a bit more
> >   extensible? OTOH, we don't currently use enums in the catalogs and arrays
> >   are somewhat annoying to conjure up from C.
>
> An enum array does seem elegant and extensible, but it has the problems you
> say.  (I would expect to lose time setting up pg_enum.oid values to not change
> between releases.)  A possible compromise would be a text array like
> heap_tuple_infomask_flags() does.  Overall, I'm not seeing a clear need to
> change away from the bool columns.

Yea, I think that's where I ended up too. If we get a dozen flags we can
reconsider.



> > Todo:
>
> > - Figure out how to deduplicate support for LockBufferForCleanup() in
> >   TerminateBufferIO().
>
> Yes, I agree there's an opportunity for a WakePinCountWaiter() or similar
> subroutine.

Done.


> > - Check if documentation for track_io_timing needs to be adjusted, after the
> >   bufmgr.c changes we only track waiting for an IO.
>
> Yes.

The relevant sentences seem to be:

- "Enables timing of database I/O calls."

  s/calls/waits/

- "Time spent in {read,write,writeback,extend,fsync} operations"

  s/in/waiting for/

  Even though not all of these will use AIO, the "waiting for" formulation
  seems just as accurate.

- "Columns tracking I/O time will only be non-zero when <xref
  linkend="guc-track-io-timing"/> is enabled."

  s/time/wait time/


> On Mon, Mar 10, 2025 at 02:23:12PM -0400, Andres Freund wrote:
> > Attached is v2.6 of the AIO patchset.
>
> > - 0005, 0006 - io_uring support - close, but we need to do something about
> >   set_max_fds(), which errors out spuriously in some cases
>
> What do we know about those cases?  I don't see a set_max_fds(); is that
> set_max_safe_fds(), or something else?

Sorry, yes, set_max_safe_fds(). The problem basically is that with io_uring we
will have a large number of FDs already allocated by the time
set_max_safe_fds() is called. set_max_safe_fds() subtracts already_open from
max_files_per_process allowing few, and even negative, IOs.

I think we should redefine max_files_per_process to be about the number of
files each *backend* will additionally open.  Jelte was working on related
patches, see [2]

> > +     * AIO handles need be registered in critical sections and therefore
> > +     * cannot use the normal ResoureElem mechanism.
>
> s/ResoureElem/ResourceElem/

Oops, fixed.


> > +      <varlistentry id="guc-io-method" xreflabel="io_method">
> > +       <term><varname>io_method</varname> (<type>enum</type>)
> > +       <indexterm>
> > +        <primary><varname>io_method</varname> configuration parameter</primary>
> > +       </indexterm>
> > +       </term>
> > +       <listitem>
> > +        <para>
> > +         Selects the method for executing asynchronous I/O.
> > +         Possible values are:
> > +         <itemizedlist>
> > +          <listitem>
> > +           <para>
> > +            <literal>sync</literal> (execute asynchronous I/O synchronously)
>
> The part in parentheses reads like a contradiction to me.

There's something to that...


> How about phrasing it like one of these:
>
>   (execute I/O synchronously, even I/O eligible for asynchronous execution)
>   (execute asynchronous-eligible I/O synchronously)
>   (execute I/O synchronously, even when asynchronous execution was feasible)

I like the second one best, adopted.


> [..]
> End sentence with question mark, probably.
> [..]
> s/strict/strictly/
> [..]
> I recommend adding "Always called in a critical section." since at least
> pgaio_worker_submit() subtly needs it.
> [..]
> s/that that/that/
> [..]
> s/smgr.,/smgr.c,/ or just "smgr"
> [..]
> s/locallbacks/local callbacks/
> [..]
> s/the the/the/

All adopted.


> > +PgAioHandle *
> > +pgaio_io_acquire_nb(struct ResourceOwnerData *resowner, PgAioReturn *ret)
> > +{
> > +    if (pgaio_my_backend->num_staged_ios >= PGAIO_SUBMIT_BATCH_SIZE)
> > +    {
> > +        Assert(pgaio_my_backend->num_staged_ios == PGAIO_SUBMIT_BATCH_SIZE);
> > +        pgaio_submit_staged();
>
> I'm seeing the "num_staged_ios >= PGAIO_SUBMIT_BATCH_SIZE" case uncovered in a
> check-world coverage report.  I tried PGAIO_SUBMIT_BATCH_SIZE=2,
> io_max_concurrency=1, and io_max_concurrency=64.  Do you already have a recipe
> for reaching this case?

With the default server settings it's hard to hit due to read_stream.c
limiting how much IO it issues:

1) The default io_combine_limit=16 makes reads larger, reducing the queue
   depth, at least for sequential scans

2) The default shared_buffers/max_connections settings limit the number of
   buffers that can be pinned to 86, which will only allow a small number of
   IOs due to 86/io_combine_limit = ~5

3) The default effective_io_concurrency only allows one IO in flight

Melanie has a patch to adjust effective_io_concurrency:
https://www.postgresql.org/message-id/CAAKRu_Z4ekRbfTacYYVrvu9xRqS6G4DMbZSbN_1usaVtj%2Bbv2w%40mail.gmail.com

If I increase shared_buffers and decrease io_combine_limit and put an
elog(PANIC) in that branch, it's rather quickly hit.



> > +/*
> > + * Stage IO for execution and, if necessary, submit it immediately.
> > + *
> > + * Should only be called from pgaio_io_prep_*().
> > + */
> > +void
> > +pgaio_io_stage(PgAioHandle *ioh, PgAioOp op)
> > +{
>
> We've got closely-associated verbs "prepare", "prep", and "stage".  README.md
> doesn't mention "stage".  Can one of the following two changes happen?
>
> - README.md starts mentioning "stage" and how it differs from the others
> - Code stops using "stage"

I'll try to add something to README.md. To me the sequence is prepare->stage.


> > + * Batch submission mode needs to explicitly ended with
> > + * pgaio_exit_batchmode(), but it is allowed to throw errors, in which case
> > + * error recovery will end the batch.
>
> This sentence needs some grammar help, I think.

Indeed.

> Maybe use:
>
>  * End batch submission mode with pgaio_exit_batchmode().  (Throwing errors is
>  * allowed; error recovery will end the batch.)

I like it.


> >  Size
> >  AioShmemSize(void)
> >  {
> >      Size        sz = 0;
> >
> > +    /*
> > +     * We prefer to report this value's source as PGC_S_DYNAMIC_DEFAULT.
> > +     * However, if the DBA explicitly set wal_buffers = -1 in the config file,
>
> s/wal_buffers/io_max_concurrency/

Ooops.


> > +extern int    io_workers;
>
> By the rule that GUC vars are PGDLLIMPORT, this should be PGDLLIMPORT.

Indeed. I wish we had something finding violations of this automatically...


> > +static void
> > +maybe_adjust_io_workers(void)
>
> This also restarts workers that exit, so perhaps name it
> start_io_workers_if_missing().

But it also stops IO workers if necessary?


> > +{
> ...
> > +        /* Try to launch one. */
> > +        child = StartChildProcess(B_IO_WORKER);
> > +        if (child != NULL)
> > +        {
> > +            io_worker_children[id] = child;
> > +            ++io_worker_count;
> > +        }
> > +        else
> > +            break;                /* XXX try again soon? */
>
> Can LaunchMissingBackgroundProcesses() become the sole caller of this
> function, replacing the current mix of callers?  That would be more conducive
> to promptly doing the right thing after launch failure.

I'm not sure that'd be a good idea - right now IO workers are started before
the startup process, as the startup process might need to perform IO. If we
started it only later in ServerLoop() we'd potentially do a fair bit of work,
including starting checkpointer, bgwriter, bgworkers before we started IO
workers.  That shouldn't actively break anything, but it would likely make
things slower.

I rather dislike the code around when we start what. Leaving AIO aside, during
a normal startup we start checkpointer, bgwriter before the startup
process. But during a crash restart we don't explicitly start them. Why make
things uniform when it coul also be exciting :)


> > --- a/src/backend/utils/init/miscinit.c
> > +++ b/src/backend/utils/init/miscinit.c
> > @@ -293,6 +293,9 @@ GetBackendTypeDesc(BackendType backendType)
> >          case B_CHECKPOINTER:
> >              backendDesc = gettext_noop("checkpointer");
> >              break;
> > +        case B_IO_WORKER:
> > +            backendDesc = "io worker";
>
> Wrap in gettext_noop() like B_CHECKPOINTER does.
>
> > +         Only has an effect if <xref linkend="guc-max-wal-senders"/> is set to
> > +         <literal>worker</literal>.
>
> s/guc-max-wal-senders/guc-io-method/
>
> > + * of IOs, wakeups "fan out"; each woken IO worker can wake two more.  qXXX
>
> s/qXXX/XXX/

All fixed.


> > +            /*
> > +             * It's very unlikely, but possible, that reopen fails. E.g. due
> > +             * to memory allocations failing or file permissions changing or
> > +             * such.  In that case we need to fail the IO.
> > +             *
> > +             * There's not really a good errno we can report here.
> > +             */
> > +            error_errno = ENOENT;
>
> Agreed there's not a good errno, but let's use a fake errno that we're mighty
> unlikely to confuse with an actual case of libc returning that errno.  Like
> one of EBADF or EOWNERDEAD.

Can we rely on that to be present on all platforms, including windows?


> > +    for (int contextno = 0; contextno < TotalProcs; contextno++)
> > +    {
> > +        PgAioUringContext *context = &pgaio_uring_contexts[contextno];
> > +        int            ret;
> > +
> > +        /*
> > +         * XXX: Probably worth sharing the WQ between the different rings,
> > +         * when supported by the kernel. Could also cause additional
> > +         * contention, I guess?
> > +         */
> > +#if 0
> > +        if (!AcquireExternalFD())
> > +            elog(ERROR, "No external FD available");
> > +#endif
>
> Probably remove the "#if 0" or add a comment on why it's here.

Will do. It was an attempt at dealing with the set_max_safe_fds() issue above,
but it turned out to not work at all, given how fd.c currently works.


> > +        ret = io_uring_submit(uring_instance);
> > +        pgstat_report_wait_end();
> > +
> > +        if (ret == -EINTR)
> > +        {
> > +            pgaio_debug(DEBUG3,
> > +                        "aio method uring: submit EINTR, nios: %d",
> > +                        num_staged_ios);
> > +        }
> > +        else if (ret < 0)
> > +            elog(PANIC, "failed: %d/%s",
> > +                 ret, strerror(-ret));
>
> I still think (see 2024-09-16 review) EAGAIN should do the documented
> recommendation instead of PANIC:
>
>   EAGAIN The kernel was unable to allocate memory for the request, or
>   otherwise ran out of resources to handle it. The application should wait for
>   some completions and try again.

I don't think this can be hit in a recoverable way. We'd likely just end up
with an untested path that quite possibly would be wrong.

What wait time would be appropriate?  What problems would it cause if we just
slept while holding critical lwlocks? I think it'd typically just delay the
crash-restart if we did, making it harder to recover from the problem.

Because we are careful to limit how many outstanding IO requests there are on
an io_uring instance, the kernel has to have run *severely* out of memory to
hit this.

I suspect it might currently be *impossible* to hit this due to ENOMEM,
because io_uring will fall back to allocating individual request, if the batch
allocation it normally does, fails. My understanding is that for small
allocations the kernel will try to reclaim memory forever, only large ones can
fail.

Even if it were possible to hit, the likelihood that postgres can continue to
work ok if the kernel can't allocate ~250 bytes seems very low.

How about adding a dedicated error message for EAGAIN? IMO io_uring_enter()'s
meaning of EAGAIN is, uhm, unconvential, so a better error message than
strerror() might be good?


Proposed comment:
            /*
             * The io_uring_enter() manpage suggests that the appropriate
             * reaction to EAGAIN is:
             *
             * "The application should wait for some completions and try
             * again"
             *
             * However, it seems unlikely that that would help in our case, as
             * we apply a low limit to the number of outstanding IOs and thus
             * also outstanding completions, making it unlikely that we'd get
             * EAGAIN while the OS is in good working order.
             *
             * Additionally, it would be problematic to just wait here, our
             * caller might hold critical locks. It'd possibly lead to
             * delaying the crash-restart that seems likely to occur when the
             * kernel is under such heavy memory pressure.
             */




> > +            pgstat_report_wait_end();
> > +
> > +            if (ret == -EINTR)
> > +            {
> > +                continue;
> > +            }
> > +            else if (ret != 0)
> > +            {
> > +                elog(PANIC, "unexpected: %d/%s: %m", ret, strerror(-ret));
>
> I think errno isn't meaningful here, so %m doesn't belong.

You're right.  I wonder if we should make errno meaningful though (by setting
it), the elog.c machinery captures it and I know that there are logging hooks
that utilize that fact.  That'd also avoid the need to use strerror() here.


> > --- a/doc/src/sgml/config.sgml
> > +++ b/doc/src/sgml/config.sgml
> > @@ -2687,6 +2687,12 @@ include_dir 'conf.d'
> >              <literal>worker</literal> (execute asynchronous I/O using worker processes)
> >             </para>
> >            </listitem>
> > +          <listitem>
> > +           <para>
> > +            <literal>io_uring</literal> (execute asynchronous I/O using
> > +            io_uring, if available)
> > +           </para>
> > +          </listitem>
>
> Docs should eventually cover RLIMIT_MEMLOCK per
> https://github.com/axboe/liburing "ulimit settings".

The way we currently use io_uring (i.e. no registered buffers), the
RLIMIT_MEMLOCK advice only applies to linux <= 5.11.  I'm not sure that's
worth documenting?


> Maybe RLIMIT_NOFILE, too.

Yea, we probably need to. Depends a bit on where we go with [2] though.


>
> > @@ -2498,6 +2529,12 @@ FilePathName(File file)
> >  int
> >  FileGetRawDesc(File file)
> >  {
> > +    int            returnCode;
> > +
> > +    returnCode = FileAccess(file);
> > +    if (returnCode < 0)
> > +        return returnCode;
> > +
> >      Assert(FileIsValid(file));
> >      return VfdCache[file].fd;
> >  }
>
> What's the rationale for this function's change?

It flatly didn't work before.  I guess I can make that a separate commit.


> > +The main reason to want to use Direct IO are:
>
> > +The main reason *not* to use Direct IO are:
>
> x2 s/main reason/main reasons/
>
> > +  and direct IO without O_DSYNC needs to issue a write and after the writes
> > +  completion a cache cache flush, whereas O\_DIRECT + O\_DSYNC can use a
>
> s/writes/write's/
>
> > +  single FUA write).
>
> I recommend including the acronym expansion: s/FUA/Force Unit Access (FUA)/
>
> > +In an `EXEC_BACKEND` build backends executable code and other process local
>
> s/backends/backends'/
>
> > +state is not necessarily mapped to the same addresses in each process due to
> > +ASLR. This means that the shared memory cannot contain pointer to callbacks.
>
> s/pointer/pointers/
>
> > +The "solution" to this the ability to associate multiple completion callbacks
> > +with a handle. E.g. bufmgr.c can have a callback to update the BufferDesc
> > +state and to verify the page and md.c. another callback to check if the IO
> > +operation was successful.
>
> One of these or similar:
> s/md.c. another/md.c can have another/
> s/md.c. /md.c /

All applied.


> I've got one high-level question that I felt could take too long to answer for
> myself by code reading.  What's the cleanup story if process A does
> elog(FATAL) with unfinished I/O?  Specifically:

It's a good question. Luckily there's a relatively easy answer:

pgaio_shutdown() is registered via before_shmem_exit() in pgaio_init_backend()
and pgaio_shutdown() waits for all IOs to finish.

The main reason this exists is that the AIO mechanism in various OSs, at least
in some OS versions, don't like it if the issuing process exits while the IO
is in flight.  IIRC that was the case with in v1 with posix_aio (which we
don't support in v2, and probably should never use) and I think also with
io_uring in some kernel versions.

Another reason is that those requests would show up in pg_aios (or whatever we
end up naming it) until they're reused, which doesn't seem great.


> - Suppose some other process B reuses the shared memory AIO data structures
>   that pertained to process A.  After that, some process C completes the I/O
>   in shmem.  Do we avoid confusing B by storing local callback data meant for
>   A in shared memory now pertaining to B?

This will, before pgaio_shutdown() gets involved, also be prevented by local
callbacks being cleared by resowner cleanup. We take care that that resowner
cleanup happens before process exit. That's important, because the backend
local pointer could be invalidated by an ERROR


> - Thinking more about this README paragraph:
>
>     +In addition to completion, AIO callbacks also are called to "prepare" an
>     +IO. This is, e.g., used to increase buffer reference counts to account for the
>     +AIO subsystem referencing the buffer, which is required to handle the case
>     +where the issuing backend errors out and releases its own pins while the IO is
>     +still ongoing.
>
>   Which function performs that reference count increase?  I'm not finding it
>   today.

Ugh, I just renamed the relevant functions in my local branch, while trying to
reduce the code duplication between shared and local buffers ;).

In <= v2.6 it's shared_buffer_stage_common() and local_buffer_readv_stage().

In v2.7-to-be it is buffer_stage_common(), which now supports both shared and
local buffers.


> I wanted to look at how it ensures the issuing backend still exists as the
> function increases the reference count.

The reference count is increased solely in the BufferDesc, *not* in the
backend-local pin tracking.  Earlier I had tracked the pin in BufferDesc for
shared buffers (as the pin needs to be released upon completion, which might
be in another backend), but in LocalRefCount[] for temp buffers.  But that
turned out to not work when the backend errors out, as it would make
CheckForLocalBufferLeaks() complain.


>
> One later-patch item:
>
> > +static PgAioResult
> > +SharedBufferCompleteRead(int buf_off, Buffer buffer, uint8 flags, bool failed)
> > +{
> ...
> > +    TRACE_POSTGRESQL_BUFFER_READ_DONE(tag.forkNum,
> > +                                      tag.blockNum,
> > +                                      tag.spcOid,
> > +                                      tag.dbOid,
> > +                                      tag.relNumber,
> > +                                      INVALID_PROC_NUMBER,
> > +                                      false);
>
> I wondered about whether the buffer-read-done probe should happen in the
> process that calls the complete_shared callback or in the process that did the
> buffer-read-start probe.

Yea, that's a good point. I should at least have added a comment pointing out
that it's a choice with pros and cons.

The reason I went for doing it in the completion callback is that it seemed
better to get the READ_DONE event as soon as possible, even if the issuer of
the IO is currently busy doing other things. The shared completion callback is
after all where the buffer state is updated for shared buffers.

But I think you have a point too.


> When I see dtrace examples, they usually involve explicitly naming each PID
> to trace

TBH, i've only ever used our tracepoints via perf and bpftrace, not dtrace
itself. For those it's easy to trace more than just a single pid and to
monitor system-wide. I don't really know enough about using dtrace itself.


> Assuming that's indeed the norm, I think the local callback would
> be the better place, so a given trace contains both probes.

Seems like a shame to add an extra indirect function call for a tracing
feature that afaict approximately nobody ever uses (IIRC we several times have
passed wrong things to tracepoints without that being noticed).


TBH, the tracepoints are so poorly documented and maintained that I was
tempted to suggest removing them a couple times.


This was an awesome review, thanks!

Andres Freund

[1] https://man.openbsd.org/sem_init.3#STANDARDS
[2] https://postgr.es/m/D80MHNSG4EET.6MSV5G9P130F%40jeltef.nl



Re: AIO v2.5

From
Noah Misch
Date:
On Tue, Mar 11, 2025 at 07:55:35PM -0400, Andres Freund wrote:
> On 2025-03-11 12:41:08 -0700, Noah Misch wrote:
> > On Mon, Sep 16, 2024 at 01:51:42PM -0400, Andres Freund wrote:
> > > On 2024-09-16 07:43:49 -0700, Noah Misch wrote:

> What do we want to do for ConditionalLockBufferForCleanup() (I don't think
> IsBufferCleanupOK() can matter)?  I suspect we should also make it wait for
> the IO. See below:

I agree IsBufferCleanupOK() can't matter.  It asserts that the caller already
holds the exclusive buffer lock, and it doesn't loop or wait.

> [...] leads me to think that causing
> the IO to complete is probably the safest bet. Triggering IO completion never
> requires acquiring new locks that could participate in a deadlock, so it'd be
> safe.

Yes, that decision looks right to me.  I scanned the callers, and none of them
clearly prefers a different choice.  If we someday find one caller prefers a
false return over blocking on I/O completion, we can always introduce a new
ConditionalLock* variant for that.

> > > - To allow io_workers to be PGC_SIGHUP, and to eventually allow to
> > >   automatically in/decrease active workers, the max number of workers (32) is
> > >   always allocated. That means we use more semaphores than before. I think
> > >   that's ok, it's not 1995 anymore.  Alternatively we can add a
> > >   "io_workers_max" GUC and probe for it in initdb.
> >
> > Let's start as you have it.  If someone wants to make things perfect for
> > non-root BSD users, they can add the GUC later.  io_method=sync is a
> > sufficient backup plan indefinitely.
> 
> Cool.
> 
> I think we'll really need to do something about this for BSD users regardless
> of AIO. Or maybe those OSs should fix something, but somehow I am not having
> high hopes for an OS that claims to have POSIX confirming unnamed semaphores
> due to having a syscall that always returns EPERM... [1].

I won't mind a project making things better for non-root BSD users.  I do
think such a project should not block other projects making things better for
everything else (like $SUBJECT).

> > > - Check if documentation for track_io_timing needs to be adjusted, after the
> > >   bufmgr.c changes we only track waiting for an IO.
> >
> > Yes.
> 
> The relevant sentences seem to be:
> 
> - "Enables timing of database I/O calls."
> 
>   s/calls/waits/
> 
> - "Time spent in {read,write,writeback,extend,fsync} operations"
> 
>   s/in/waiting for/
> 
>   Even though not all of these will use AIO, the "waiting for" formulation
>   seems just as accurate.
> 
> - "Columns tracking I/O time will only be non-zero when <xref
>   linkend="guc-track-io-timing"/> is enabled."
> 
>   s/time/wait time/

Sounds good.

> > On Mon, Mar 10, 2025 at 02:23:12PM -0400, Andres Freund wrote:
> > > Attached is v2.6 of the AIO patchset.
> >
> > > - 0005, 0006 - io_uring support - close, but we need to do something about
> > >   set_max_fds(), which errors out spuriously in some cases
> >
> > What do we know about those cases?  I don't see a set_max_fds(); is that
> > set_max_safe_fds(), or something else?
> 
> Sorry, yes, set_max_safe_fds(). The problem basically is that with io_uring we
> will have a large number of FDs already allocated by the time
> set_max_safe_fds() is called. set_max_safe_fds() subtracts already_open from
> max_files_per_process allowing few, and even negative, IOs.
> 
> I think we should redefine max_files_per_process to be about the number of
> files each *backend* will additionally open.  Jelte was working on related
> patches, see [2]

Got it.  max_files_per_process is a quaint setting, documented as follows (I
needed the reminder):

        If the kernel is enforcing
        a safe per-process limit, you don't need to worry about this setting.
        But on some platforms (notably, most BSD systems), the kernel will
        allow individual processes to open many more files than the system
        can actually support if many processes all try to open
        that many files. If you find yourself seeing <quote>Too many open
        files</quote> failures, try reducing this setting.

I could live with
v6-0003-Reflect-the-value-of-max_safe_fds-in-max_files_pe.patch but would lean
against it since it feels unduly novel to have a setting where we use the
postgresql.conf value to calculate a value that becomes the new SHOW-value of
the same setting.  Options I'd consider before that:

- Like you say, "redefine max_files_per_process to be about the number of
  files each *backend* will additionally open".  It will become normal that
  each backend's actual FD list length is max_files_per_process + MaxBackends
  if io_method=io_uring.  Outcome is not unlike
  v6-0002-Bump-postmaster-soft-open-file-limit-RLIMIT_NOFIL.patch +
  v6-0003-Reflect-the-value-of-max_safe_fds-in-max_files_pe.patch but we don't
  mutate max_files_per_process.  Benchmark results should not change beyond
  the inter-major-version noise level unless one sets io_method=io_uring.  I'm
  feeling best about this one, but I've not been thinking about it long.

- When building with io_uring, make the max_files_per_process default
  something like 10000 instead.  Disadvantages: FD usage grows even if you
  don't use io_uring.  Merely rebuilding with io_uring (not enabling it at
  runtime) will change benchmark results.  High MaxBackends still needs to
  override the value.

> > > +static void
> > > +maybe_adjust_io_workers(void)
> >
> > This also restarts workers that exit, so perhaps name it
> > start_io_workers_if_missing().
> 
> But it also stops IO workers if necessary?

Good point.  Maybe just add a comment like "start or stop IO workers to close
the gap between the running count and the configured count intent".

> > > +{
> > ...
> > > +        /* Try to launch one. */
> > > +        child = StartChildProcess(B_IO_WORKER);
> > > +        if (child != NULL)
> > > +        {
> > > +            io_worker_children[id] = child;
> > > +            ++io_worker_count;
> > > +        }
> > > +        else
> > > +            break;                /* XXX try again soon? */
> >
> > Can LaunchMissingBackgroundProcesses() become the sole caller of this
> > function, replacing the current mix of callers?  That would be more conducive
> > to promptly doing the right thing after launch failure.
> 
> I'm not sure that'd be a good idea - right now IO workers are started before
> the startup process, as the startup process might need to perform IO. If we
> started it only later in ServerLoop() we'd potentially do a fair bit of work,
> including starting checkpointer, bgwriter, bgworkers before we started IO
> workers.  That shouldn't actively break anything, but it would likely make
> things slower.

I missed that.  How about keeping the two calls associated with PM_STARTUP but
replacing the assign_io_workers() and process_pm_child_exit() calls with one
in LaunchMissingBackgroundProcesses()?  In the event of a launch failure, I
think that would retry the launch quickly, as opposed to maybe-never.

> I rather dislike the code around when we start what. Leaving AIO aside, during
> a normal startup we start checkpointer, bgwriter before the startup
> process. But during a crash restart we don't explicitly start them. Why make
> things uniform when it coul also be exciting :)

It's become some artisanal code! :)

> > > +            /*
> > > +             * It's very unlikely, but possible, that reopen fails. E.g. due
> > > +             * to memory allocations failing or file permissions changing or
> > > +             * such.  In that case we need to fail the IO.
> > > +             *
> > > +             * There's not really a good errno we can report here.
> > > +             */
> > > +            error_errno = ENOENT;
> >
> > Agreed there's not a good errno, but let's use a fake errno that we're mighty
> > unlikely to confuse with an actual case of libc returning that errno.  Like
> > one of EBADF or EOWNERDEAD.
> 
> Can we rely on that to be present on all platforms, including windows?

I expect EBADF is universal.  EBADF would be fine.

EOWNERDEAD is from 2006.
https://learn.microsoft.com/en-us/cpp/c-runtime-library/errno-constants?view=msvc-140
says VS2015 had EOWNERDEAD (the page doesn't have links for older Visual
Studio versions, so I consider them unknown).
https://github.com/coreutils/gnulib/blob/master/doc/posix-headers/errno.texi
lists some OSs not having it, the newest of which looks like NetBSD 9.3
(2022).  We could use it and add a #define for platforms lacking it.

> > > +        ret = io_uring_submit(uring_instance);
> > > +        pgstat_report_wait_end();
> > > +
> > > +        if (ret == -EINTR)
> > > +        {
> > > +            pgaio_debug(DEBUG3,
> > > +                        "aio method uring: submit EINTR, nios: %d",
> > > +                        num_staged_ios);
> > > +        }
> > > +        else if (ret < 0)
> > > +            elog(PANIC, "failed: %d/%s",
> > > +                 ret, strerror(-ret));
> >
> > I still think (see 2024-09-16 review) EAGAIN should do the documented
> > recommendation instead of PANIC:
> >
> >   EAGAIN The kernel was unable to allocate memory for the request, or
> >   otherwise ran out of resources to handle it. The application should wait for
> >   some completions and try again.
> 
> I don't think this can be hit in a recoverable way. We'd likely just end up
> with an untested path that quite possibly would be wrong.
> 
> What wait time would be appropriate?  What problems would it cause if we just
> slept while holding critical lwlocks? I think it'd typically just delay the
> crash-restart if we did, making it harder to recover from the problem.

I might use 30s like pgwin32_open_handle(), but 30s wouldn't be principled.

> Because we are careful to limit how many outstanding IO requests there are on
> an io_uring instance, the kernel has to have run *severely* out of memory to
> hit this.
> 
> I suspect it might currently be *impossible* to hit this due to ENOMEM,
> because io_uring will fall back to allocating individual request, if the batch
> allocation it normally does, fails. My understanding is that for small
> allocations the kernel will try to reclaim memory forever, only large ones can
> fail.
> 
> Even if it were possible to hit, the likelihood that postgres can continue to
> work ok if the kernel can't allocate ~250 bytes seems very low.
> 
> How about adding a dedicated error message for EAGAIN? IMO io_uring_enter()'s
> meaning of EAGAIN is, uhm, unconvential, so a better error message than
> strerror() might be good?

I'm fine with the present error message.

> Proposed comment:
>             /*
>              * The io_uring_enter() manpage suggests that the appropriate
>              * reaction to EAGAIN is:
>              *
>              * "The application should wait for some completions and try
>              * again"
>              *
>              * However, it seems unlikely that that would help in our case, as
>              * we apply a low limit to the number of outstanding IOs and thus
>              * also outstanding completions, making it unlikely that we'd get
>              * EAGAIN while the OS is in good working order.
>              *
>              * Additionally, it would be problematic to just wait here, our
>              * caller might hold critical locks. It'd possibly lead to
>              * delaying the crash-restart that seems likely to occur when the
>              * kernel is under such heavy memory pressure.
>              */

That works for me.  No retry needed, then.

> > > +            pgstat_report_wait_end();
> > > +
> > > +            if (ret == -EINTR)
> > > +            {
> > > +                continue;
> > > +            }
> > > +            else if (ret != 0)
> > > +            {
> > > +                elog(PANIC, "unexpected: %d/%s: %m", ret, strerror(-ret));
> >
> > I think errno isn't meaningful here, so %m doesn't belong.
> 
> You're right.  I wonder if we should make errno meaningful though (by setting
> it), the elog.c machinery captures it and I know that there are logging hooks
> that utilize that fact.  That'd also avoid the need to use strerror() here.

That's better still.

> > > --- a/doc/src/sgml/config.sgml
> > > +++ b/doc/src/sgml/config.sgml
> > > @@ -2687,6 +2687,12 @@ include_dir 'conf.d'
> > >              <literal>worker</literal> (execute asynchronous I/O using worker processes)
> > >             </para>
> > >            </listitem>
> > > +          <listitem>
> > > +           <para>
> > > +            <literal>io_uring</literal> (execute asynchronous I/O using
> > > +            io_uring, if available)
> > > +           </para>
> > > +          </listitem>
> >
> > Docs should eventually cover RLIMIT_MEMLOCK per
> > https://github.com/axboe/liburing "ulimit settings".
> 
> The way we currently use io_uring (i.e. no registered buffers), the
> RLIMIT_MEMLOCK advice only applies to linux <= 5.11.  I'm not sure that's
> worth documenting?

Kernel 5.11 will be 5.5 years old by the time v18 is out.  Yeah, no need for
doc coverage of that.

> > One later-patch item:
> >
> > > +static PgAioResult
> > > +SharedBufferCompleteRead(int buf_off, Buffer buffer, uint8 flags, bool failed)
> > > +{
> > ...
> > > +    TRACE_POSTGRESQL_BUFFER_READ_DONE(tag.forkNum,
> > > +                                      tag.blockNum,
> > > +                                      tag.spcOid,
> > > +                                      tag.dbOid,
> > > +                                      tag.relNumber,
> > > +                                      INVALID_PROC_NUMBER,
> > > +                                      false);
> >
> > I wondered about whether the buffer-read-done probe should happen in the
> > process that calls the complete_shared callback or in the process that did the
> > buffer-read-start probe.
> 
> Yea, that's a good point. I should at least have added a comment pointing out
> that it's a choice with pros and cons.
> 
> The reason I went for doing it in the completion callback is that it seemed
> better to get the READ_DONE event as soon as possible, even if the issuer of
> the IO is currently busy doing other things. The shared completion callback is
> after all where the buffer state is updated for shared buffers.
> 
> But I think you have a point too.
> 
> 
> > When I see dtrace examples, they usually involve explicitly naming each PID
> > to trace
> 
> TBH, i've only ever used our tracepoints via perf and bpftrace, not dtrace
> itself. For those it's easy to trace more than just a single pid and to
> monitor system-wide. I don't really know enough about using dtrace itself.

Perhaps just a comment, then.

> > Assuming that's indeed the norm, I think the local callback would
> > be the better place, so a given trace contains both probes.
> 
> Seems like a shame to add an extra indirect function call

Yep.

> This was an awesome review, thanks!

Glad it helped.

> [1] https://man.openbsd.org/sem_init.3#STANDARDS
> [2] https://postgr.es/m/D80MHNSG4EET.6MSV5G9P130F%40jeltef.nl



Re: AIO v2.5

From
Andres Freund
Date:
Hi,

On 2025-03-11 19:55:35 -0400, Andres Freund wrote:
> On 2025-03-11 12:41:08 -0700, Noah Misch wrote:
> > On Mon, Sep 16, 2024 at 01:51:42PM -0400, Andres Freund wrote:
> > > On 2024-09-16 07:43:49 -0700, Noah Misch wrote:
> > > > For non-sync IO methods, I gather it's essential that a process other than the
> > > > IO definer be scanning for incomplete IOs and completing them.
> >
> > > > Otherwise, deadlocks like this would happen:
> > >
> > > > backend1 locks blk1 for non-IO reasons
> > > > backend2 locks blk2, starts AIO write
> > > > backend1 waits for lock on blk2 for non-IO reasons
> > > > backend2 waits for lock on blk1 for non-IO reasons
> > > >
> > > > If that's right, in worker mode, the IO worker resolves that deadlock.  What
> > > > resolves it under io_uring?  Another process that happens to do
> > > > pgaio_io_ref_wait() would dislodge things, but I didn't locate the code to
> > > > make that happen systematically.
> > >
> > > Yea, it's code that I haven't forward ported yet. I think basically
> > > LockBuffer[ForCleanup] ought to call pgaio_io_ref_wait() when it can't
> > > immediately acquire the lock and if the buffer has IO going on.
> >
> > I'm not finding that code in v2.6.  What function has it?
> 
> My local version now has it... Sorry, I was focusing on the earlier patches
> until now.

Looking more at my draft, I don't think it was race-free.  I had a race-free
way of doing it in the v1 patch (by making lwlocks extensible, so the check
for IO could happen between enqueueing on the lwlock wait queue and sleeping
on the semaphore), but that obviously requires that infrastructure.

I want to focus on reads for now, so I'll add FIXMEs to the relevant places in
the patch to support AIO writes and focus on the rest of the patch for now.

Greetings,

Andres Freund



Re: AIO v2.5

From
Andres Freund
Date:
Hi,

On 2025-03-11 20:57:43 -0700, Noah Misch wrote:
> > I think we'll really need to do something about this for BSD users regardless
> > of AIO. Or maybe those OSs should fix something, but somehow I am not having
> > high hopes for an OS that claims to have POSIX confirming unnamed semaphores
> > due to having a syscall that always returns EPERM... [1].
> 
> I won't mind a project making things better for non-root BSD users.  I do
> think such a project should not block other projects making things better for
> everything else (like $SUBJECT).

Oh, I strongly agree.  The main reason I would like it to be addressed that
I'm pretty tired of having to think about open/netbsd whenever we update some
default setting.


> > > On Mon, Mar 10, 2025 at 02:23:12PM -0400, Andres Freund wrote:
> > > > Attached is v2.6 of the AIO patchset.
> > >
> > > > - 0005, 0006 - io_uring support - close, but we need to do something about
> > > >   set_max_fds(), which errors out spuriously in some cases
> > >
> > > What do we know about those cases?  I don't see a set_max_fds(); is that
> > > set_max_safe_fds(), or something else?
> > 
> > Sorry, yes, set_max_safe_fds(). The problem basically is that with io_uring we
> > will have a large number of FDs already allocated by the time
> > set_max_safe_fds() is called. set_max_safe_fds() subtracts already_open from
> > max_files_per_process allowing few, and even negative, IOs.
> > 
> > I think we should redefine max_files_per_process to be about the number of
> > files each *backend* will additionally open.  Jelte was working on related
> > patches, see [2]
> 
> Got it.  max_files_per_process is a quaint setting, documented as follows (I
> needed the reminder):
> 
>         If the kernel is enforcing
>         a safe per-process limit, you don't need to worry about this setting.
>         But on some platforms (notably, most BSD systems), the kernel will
>         allow individual processes to open many more files than the system
>         can actually support if many processes all try to open
>         that many files. If you find yourself seeing <quote>Too many open
>         files</quote> failures, try reducing this setting.
> 
> I could live with
> v6-0003-Reflect-the-value-of-max_safe_fds-in-max_files_pe.patch but would lean
> against it since it feels unduly novel to have a setting where we use the
> postgresql.conf value to calculate a value that becomes the new SHOW-value of
> the same setting.

I think we may update some other GUCs, but not sure.


> Options I'd consider before that:

> - Like you say, "redefine max_files_per_process to be about the number of
>   files each *backend* will additionally open".  It will become normal that
>   each backend's actual FD list length is max_files_per_process + MaxBackends
>   if io_method=io_uring.  Outcome is not unlike
>   v6-0002-Bump-postmaster-soft-open-file-limit-RLIMIT_NOFIL.patch +
>   v6-0003-Reflect-the-value-of-max_safe_fds-in-max_files_pe.patch but we don't
>   mutate max_files_per_process.  Benchmark results should not change beyond
>   the inter-major-version noise level unless one sets io_method=io_uring.  I'm
>   feeling best about this one, but I've not been thinking about it long.

Yea, I think that's something probably worth doing separately from Jelte's
patch.  I do think that it'd be rather helpful to have jelte's patch to
increase NOFILE in addition though.


> > > > +static void
> > > > +maybe_adjust_io_workers(void)
> > >
> > > This also restarts workers that exit, so perhaps name it
> > > start_io_workers_if_missing().
> > 
> > But it also stops IO workers if necessary?
> 
> Good point.  Maybe just add a comment like "start or stop IO workers to close
> the gap between the running count and the configured count intent".

It's now
/*
 * Start or stop IO workers, to close the gap between the number of running
 * workers and the number of configured workers.  Used to respond to change of
 * the io_workers GUC (by increasing and decreasing the number of workers), as
 * well as workers terminating in response to errors (by starting
 * "replacement" workers).
 */


> > > > +{
> > > ...
> > > > +        /* Try to launch one. */
> > > > +        child = StartChildProcess(B_IO_WORKER);
> > > > +        if (child != NULL)
> > > > +        {
> > > > +            io_worker_children[id] = child;
> > > > +            ++io_worker_count;
> > > > +        }
> > > > +        else
> > > > +            break;                /* XXX try again soon? */
> > >
> > > Can LaunchMissingBackgroundProcesses() become the sole caller of this
> > > function, replacing the current mix of callers?  That would be more conducive
> > > to promptly doing the right thing after launch failure.
> > 
> > I'm not sure that'd be a good idea - right now IO workers are started before
> > the startup process, as the startup process might need to perform IO. If we
> > started it only later in ServerLoop() we'd potentially do a fair bit of work,
> > including starting checkpointer, bgwriter, bgworkers before we started IO
> > workers.  That shouldn't actively break anything, but it would likely make
> > things slower.
> 
> I missed that.  How about keeping the two calls associated with PM_STARTUP but
> replacing the assign_io_workers() and process_pm_child_exit() calls with one
> in LaunchMissingBackgroundProcesses()?

I think replacing the call in assign_io_workers() is a good idea, that way we
don't need assign_io_workers().

Less convinced it's a good idea to do the same for process_pm_child_exit() -
if IO workers errored out we'll launch backends etc before we get to
LaunchMissingBackgroundProcesses(). That's not a fundamental problem, but
seems a bit odd.

I think LaunchMissingBackgroundProcesses() should be split into one that
starts aux processes and one that starts bgworkers. The one maintaining aux
processes should be called before we start backends, the latter not.


> In the event of a launch failure, I think that would retry the launch
> quickly, as opposed to maybe-never.

That's a fair point.


> > > > +            /*
> > > > +             * It's very unlikely, but possible, that reopen fails. E.g. due
> > > > +             * to memory allocations failing or file permissions changing or
> > > > +             * such.  In that case we need to fail the IO.
> > > > +             *
> > > > +             * There's not really a good errno we can report here.
> > > > +             */
> > > > +            error_errno = ENOENT;
> > >
> > > Agreed there's not a good errno, but let's use a fake errno that we're mighty
> > > unlikely to confuse with an actual case of libc returning that errno.  Like
> > > one of EBADF or EOWNERDEAD.
> > 
> > Can we rely on that to be present on all platforms, including windows?
> 
> I expect EBADF is universal.  EBADF would be fine.

Hm, that's actually an error that could happen for other reasons, and IMO
would be more confusing than ENOENT. The latter describes the issue to a
reasonable extent, EBADFD seems like it would be more confusing.

I'm not sure it's worth investing time in this - it really shouldn't happen,
and we probably have bigger problems than the error code if it does. But if we
do want to do something, I think I can see a way to report a dedicated error
message for this.


> EOWNERDEAD is from 2006.
> https://learn.microsoft.com/en-us/cpp/c-runtime-library/errno-constants?view=msvc-140
> says VS2015 had EOWNERDEAD (the page doesn't have links for older Visual
> Studio versions, so I consider them unknown).

Oh, that's a larger list than I'd have though.


> https://github.com/coreutils/gnulib/blob/master/doc/posix-headers/errno.texi
> lists some OSs not having it, the newest of which looks like NetBSD 9.3
> (2022).  We could use it and add a #define for platforms lacking it.

What would we define it as?  I guess we could just pick a high value, but...


Greetings,

Andres Freund



Re: AIO v2.5

From
Andres Freund
Date:
Hi,

Attached is v2.7, with the following changes:

- Significantly deduplicated AIO related code bufmgr.c

  Previously the code for temp and shared buffers was duplicated to an
  uncomfortable degree. Now there is a common helper to implements the
  behaviour for both cases.

  The BM_PIN_COUNT_WAITER supporting code was also deduplicated, by
  introducing a helper function.


- Fixed typos / improved phrasing, per Noah's review


- Add comment explaining why retries for EAGAIN for io_uring_enter syscall
  failures don't seem to make sense, improve related error messages slightly


- Added a comment to aio.h explaining that aio_types.h might suffice for
  function declarations and aio_init.h for initialization related code.


- Added and expanded comments for PgAioHandleState, explaining the state
  machine in more detail.


- Updated README to mention the stage callback (instead of the outdated
  "prepare"), plus some other minor cleanups.


- Added a commit rephrasing track_io_timing related docs to talk about waits


- Added FIXME to method_uring.c about the set_max_safe_fds() issue. Depending
  on when/how that is resolved, the relevant commits can be reordered relative
  to the rest.


- Improved localbuf: patches and commit messages, as per Melanie's review


- Added FIXMEs to the bufmgr.c write support (only in later commit, unlikely
  to be realistic for 18) denoting that deadlock risk needs to be
  addressed. We probably need some lwlock.c improvements to make that
  race-free, otherwise I'd just have fixed this.


- Added a comment discussing the placement of the
  TRACE_POSTGRESQL_BUFFER_READ_DONE callback


- removed a few debug ereports() from the StartReadBuffers patch


Unresolved:

- Whether to continue starting new workers in process_pm_child_exit()


- What to name the view (currently pg_aios). I'm inclined to go for
  pg_io_handles right now.


- set_max_safe_fds() related issues for the io_uring backend


Greetings,

Andres Freund

Attachment

Re: AIO v2.5

From
Antonin Houska
Date:
Andres Freund <andres@anarazel.de> wrote:

> Attached is v2.7, with the following changes:

Attached are a few proposals for minor comment fixes.

Besides that, it occurred to me when I was trying to get familiar with the
patch set (respectable work, btw) that an additional Assert() statement could
make sense:

diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index a9c351eb0dc..325688f0f23 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -413,6 +413,7 @@ pgaio_io_stage(PgAioHandle *ioh, PgAioOp op)
        bool            needs_synchronous;
 
        Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+       Assert(pgaio_my_backend->handed_out_io == ioh);
        Assert(pgaio_io_has_target(ioh));
 
        ioh->op = op;

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com


Attachment

Re: AIO v2.5

From
Andres Freund
Date:
Hi,

On 2025-03-13 11:53:03 +0100, Antonin Houska wrote:
> Attached are a few proposals for minor comment fixes.

Thanks, applied.


> Besides that, it occurred to me when I was trying to get familiar with the
> patch set (respectable work, btw) that an additional Assert() statement could
> make sense:

Yea, it does. I added it to another place as well.


Attached is v2.8 with the following changes:

- I wasn't happy with the way StartReadBuffers(), WaitReadBuffers() and
  AsyncReadBuffers() interacted. The io_method=sync path in particular was too
  cute by half, calling WaitReadBuffers() from within WaitReadBuffers().

  I think the new state considerably better.

  Plenty other smaller changes as part of that. One worth calling out is that
  ReadBuffersCanStartIO() now submits staged IO before actually blocking. Not
  the prettiest code, but I think it's ok.


- Added a function to assert the sanitiy of a ReadBuffersOperation

  While doing the refactoring for the prior point, I temporarily had a bug
  that returned buffers for which IO wasn't actually performed. Surprisingly
  the only assertion that triggered was when that buffer was read again by
  another operation, because it had been marked dirty, despite never being
  valid.

  Now there's a function that can be used to check that the buffers referenced
  by a ReadBuffersOperation are in a sane state.

  I guess this could be committed independently, but it'd not be entirely
  trivial to extract, so I'm currently leaning against doing that.


- Previously VacuumCostActive accounting happened after IO completion. But
  that doesn't seem quite right, it'd allow to submit a lot of IO at
  once. It's now moved to AsyncReadBuffers()


- With io_method=sync or with worker and temp tables, smgrstartreadv() would
  actually execute the IO. But the time accounting was done entirely around
  pgaio_wref_wait(). Now it's done in both places.


- Rebased onto newer version of Thomas' read_stream.c changes

  With that newer version the integration with read stream for actually doing
  AIO is a bit simpler.  There's one FIXME in the patch, because I don't
  really understand what a comment is referring to.

  I also split out a more experimental patch to make more efficient use of
  batching in read stream, the heuristics are more complicated, and it works
  well enough without.


- I added a commit to clean up the buffer access accounting for the case that
  a buffer was read in concurrently. That IMO is somewhat bogus on master, and
  it seemed to get more bogus with AIO.


- Integrated Antonin Houska's fixes and Assert suggestion


- Added a patch to address the smgr.c/md.c interrupt issue (a problem in master), see
  https://postgr.es/m/3vae7l5ozvqtxmd7rr7zaeq3qkuipz365u3rtim5t5wdkr6f4g@vkgf2fogjirl



I think the reasonable next steps are:

- Commit "localbuf: *" commits


- Commit temp table tests, likely after lowering the minimum temp_buffers setting


- Pursue a fix of the smgr interupt issue on the referenced thread

  This can happen in parallel with AIO patches up to
  "aio: Implement support for reads in smgr/md/fd"


- Commit the core AIO infrastructure patch after doing a few more passes


- Commit IO worker support


- In parallel: Find a way to deal with the set_max_safe_fds() issue that we've
  been discussing on this thread recently. As that only affects io_uring, it
  doesn't have to block other patches going in.


- Do a round of review of the read_stream changes Thomas recently posted (and
  that are also included here)


- Try to get some more review for the bufmgr.c related changes. I've whacked
  them around a fair bit lately.


- Try to get Thomas to review my read_stream.c changes



Open items:

- The upstream BAS_BULKREAD is so small that throughput is substantially worse
  once a table reaches 1/4 shared_buffers. That patch in the patchset as-is is
  probably not good enough, although I am not sure about that


- The set_max_safe_fds() issue for io_uring


- Right now effective_io_concurrency cannot be set > 0 on Windows and other
  platforms that lack posix_fadvise. But with AIO we can read ahead without
  posix_fadvise().

  It'd not really make anything worse than today to not remove the limit, but
  it'd be pretty weird to prevent windows etc from benefiting from AIO.  Need
  to look around and see whether it would require anything other than doc
  changes.


Greetings,

Andres Freund

Attachment

Re: AIO v2.5

From
Andres Freund
Date:
Hi,

On 2025-03-14 15:43:15 -0400, Andres Freund wrote:
> Open items:
>
> - The upstream BAS_BULKREAD is so small that throughput is substantially worse
>   once a table reaches 1/4 shared_buffers. That patch in the patchset as-is is
>   probably not good enough, although I am not sure about that
>
>
> - The set_max_safe_fds() issue for io_uring
>
>
> - Right now effective_io_concurrency cannot be set > 0 on Windows and other
>   platforms that lack posix_fadvise. But with AIO we can read ahead without
>   posix_fadvise().
>
>   It'd not really make anything worse than today to not remove the limit, but
>   it'd be pretty weird to prevent windows etc from benefiting from AIO.  Need
>   to look around and see whether it would require anything other than doc
>   changes.

A fourth, smaller, question:

- Should the docs for debug_io_direct be rephrased and if so, how?

  Without read-stream-AIO debug_io_direct=data has completely unusable
  performance if there's ever any data IO - and if there's no IO there's no
  point in using the option.

  Now there is a certain set of workloads where performance with
  debug_io_direct=data can be better than master, sometimes substantially
  so. But at the same time, without support for at least:

  - AIO writes for at least checkpointer, bgwriter

    doing one synchronous IO for each buffer is ... slow.


  - read-streamified index vacuuming


  And probably also:
  - AIO-ified writes for writes executed by backends, e.g. due to strategies

    Doing one synchronous IO for each buffer is ... slow. And e.g. with COPY
    we do a *lot* of those. OTOH, it could be fine if most modifications are
    done via INSERTs instead of COPY.


  - prefetching for non-BHS index accesses

    Without prefetching, a well correlated index-range scan will be orders of
    magnitude slower with DIO.


  - Anything bypassing shared_buffers, like RelationCopyStorage() or
    bulk_write.c will be extremely slow

    The only saving grace is that these aren't all *that* common.


Due to those constraints I think it's pretty clear we can't remove the debug_
prefix at this time.

Perhaps it's worth going from

       <para>
        Currently this feature reduces performance, and is intended for
        developer testing only.
       </para>
to
       <para>
        Currently this feature reduces performance in many workloads, and is
        intended for testing only.
       </para>

I.e. qualify the downside with "many workloads" and widen the audience ever so
slightly?

Greetings,

Andres Freund



Re: AIO v2.5

From
Andres Freund
Date:
Hi,

Attached is v2.9 with the following, fairly small, changes:
- Rebased ontop of the additional committed read stream patches

- Committed the localbuf: refactorings (but not the change to expand
  refcounting of local buffers, that seems a bit more dependent on the rest)

- Committed the temp table test after some annoying debugging
  https://postgr.es/m/w5wr26ijzp7xz2qrxkt6dzvmmn2tn6tn5fp64y6gq5iuvg43hw%40v4guo6x776dq

- Some rephrasing and moving of comments in the first two commits

- There was a small bug in the smgr reopen call I found when reviewing, the
  PgAioOpData->read.fd field was referenced for both reads and writes, which
  failed to fail because both read/write use a compatible struct layout.


Unless I hear otherwise, I plan to commit the first two patches fairly soon,
followed by the worker support patches a few buildfarm cycles later.

I'm sure there's a bunch of stuff worth improving in the AIO infrastructure
and I can't imagine a project of this size not having bugs. But I think it's
in a state where that's better worked out in tree, with broader exposure and
testing.

Greetings,

Andres Freund

Attachment

Re: AIO v2.5

From
Melanie Plageman
Date:
On Fri, Mar 14, 2025 at 3:43 PM Andres Freund <andres@anarazel.de> wrote:
>
> Open items:
>
> - Right now effective_io_concurrency cannot be set > 0 on Windows and other
>   platforms that lack posix_fadvise. But with AIO we can read ahead without
>   posix_fadvise().
>
>   It'd not really make anything worse than today to not remove the limit, but
>   it'd be pretty weird to prevent windows etc from benefiting from AIO.  Need
>   to look around and see whether it would require anything other than doc
>   changes.

I've attached a patch that removes the limit for
effective_io_concurrency and maintenance_io_concurrency. I tested both
GUCs with fadvise manually disabled on my system and I think it is
working for those read stream users I tried (vacuum and BHS).

I checked around to make sure no one was using only the value of the
guc to guard prefetches, and it seems like we're safe.

The one thing I am wondering about with the docs is whether or not we
need to make it more clear that only a subset of the "simultaneous
I/O" behavior controlled by eic/mic is available if your system
doesn't have fadvise. I tried to do that a bit, but I avoided getting
into too many details.

- Melanie

Attachment

Re: AIO v2.5

From
Andres Freund
Date:
Hi,

Attached is v2.10, with the following changes:

- committed core AIO infrastructure patch

  A few cosmetic changes


- committed io_method=worker

  Two non-cosmetic changes:

  - The shmem allocation functions over-estimated the amount of shared memory
    required.

  - pgaio_worker_shmem_init() should initialize up to MAX_IO_WORKERS, not just
    io_workers, since the latter is intentionally PGC_SIGHUP (found by Thomas)


- Bunch of typo fixes found by searching for repeated words

  Thomas found one and then I searched for more.


- Added Melanie's patch to allow effective_io_concurrency to be set on windows etc


- Fixed a reference to an outdated function reference (thanks to Bilal)


- Reordered patches to be a bit more in dependency order

  E.g. "bufmgr: Implement AIO read support" doesn't depend on Thomas' "buffer
  forwarding" patches and thus can be commited before those go in.


Next steps:

- Decide what to do about the smgr interrupt issue

  I guess it could be deferred, based on the argument it only matters with a
  sufficiently high debug level, but I don't feel comfortable with that.

  I think it'd be reasonable to just go with the patch I sent on the other
  thread (and included here).


- Get somebody to look at
  "bufmgr: Improve stats when buffer was read in concurrently"

  This arguably fixes a bug, or just weird behaviour, on master.


- Address the set_max_safe_fds() issue and once resolved, commit io_uring
  method

  Can happen concurrently with next steps


- Commit "aio: Implement support for reads in smgr/md/fd"


- Get somebody to do one more pass at bufmgr related commits, I think they
  could use a less in-the-weeds eye.

  That's the following commits:
  - localbuf: Track pincount in BufferDesc as well
  - bufmgr: Implement AIO read support
  - bufmgr: Use AIO in StartReadBuffers()
  - aio: Basic read_stream adjustments for real AIO


Questions / Unresolved:

- Write support isn't going to land in 18, but there is a tiny bit of code
  regarding writes in the code for bufmgr IO. I guess I could move that to a
  later commit?

  I'm inclined to leave it, the structure of the code only really makes
  knowing that it's going to be shared between reads & writes.


- pg_aios view name


Greetings,

Andres Freund

Attachment

Re: AIO v2.5

From
Melanie Plageman
Date:
On Tue, Mar 18, 2025 at 4:12 PM Andres Freund <andres@anarazel.de> wrote:
> Attached is v2.10,

This is a review of 0008:     bufmgr: Implement AIO read support

I'm afraid it is more of a cosmetic review than a sign-off on the
patch's correctness, but perhaps it will help future readers who may
have the same questions I did.

In the commit message:
    bufmgr: Implement AIO read support

    This implements the following:
    - Addition of callbacks to maintain buffer state on completion of a readv
    - Addition of a wait reference to BufferDesc, to allow backends to wait for
      IOs
    - StartBufferIO(), WaitIO(), TerminateBufferIO() support for waiting AIO

I think it might be nice to say something about allowing backends to
complete IOs issued by other backends.

@@ -40,6 +41,10 @@ static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
    CALLBACK_ENTRY(PGAIO_HCB_INVALID, aio_invalid_cb),

    CALLBACK_ENTRY(PGAIO_HCB_MD_READV, aio_md_readv_cb),
+
+   CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_READV, aio_shared_buffer_readv_cb),
+
+   CALLBACK_ENTRY(PGAIO_HCB_LOCAL_BUFFER_READV, aio_local_buffer_readv_cb),
 #undef CALLBACK_ENTRY
 };

I personally can't quite figure out why the read and write callbacks
are defined differently than the stage, complete, and report
callbacks. I know there is a comment above PgAioHandleCallbackID about
something about this, but it didn't really clarify it for me. Maybe
you can put a block comment at the top of aio_callback.c? Or perhaps I
just need to study it more...

@@ -5482,10 +5503,19 @@ WaitIO(BufferDesc *buf)
+       if (pgaio_wref_valid(&iow))
+       {
+           pgaio_wref_wait(&iow);
+           ConditionVariablePrepareToSleep(cv);
+           continue;
+       }

I'd add some comment above this.  I reread it many times, and I still
only _think_ I understand what it does. I think the reason we need
ConditionVariablePrepareToSleep() again is because pgaio_io_wait() may
have called ConditionVariableCancelSleep() so we need to
ConditionVariablePrepareToSleep() again (it was done already at the
top of Wait())?

I'll admit I find the CV API quite confusing, so maybe I'm just
misunderstanding it.

Maybe worth mentioning in the commit message about why WaitIO() has to
work differently for AIO than sync IO.

    /*
     * Support LockBufferForCleanup()
     *
     * If we just released a pin, need to do BM_PIN_COUNT_WAITER handling.
     * Most of the time the current backend will hold another pin preventing
     * that from happening, but that's e.g. not the case when completing an IO
     * another backend started.
     */

I found this wording a bit confusing. what about this:

We may have just released the last pin other than the waiter's. In most cases,
this backend holds another pin on the buffer. But, if, for example, this
backend is completing an IO issued by another backend, it may be time to wake
the waiter.

/*
 * Helper for AIO staging callback for both reads and writes as well as temp
 * and shared buffers.
 */
static pg_attribute_always_inline void
buffer_stage_common(PgAioHandle *ioh, bool is_write, bool is_temp)

I think buffer_stage_common() needs the function comment to explain
what unit it is operating on.
It is called "buffer_" singular but then it loops through io_data
which appears to contain multiple buffers

        /*
         * Check that all the buffers are actually ones that could conceivably
         * be done in one IO, i.e. are sequential.
         */
        if (i == 0)
            first = buf_hdr->tag;
        else
        {
            Assert(buf_hdr->tag.relNumber == first.relNumber);
            Assert(buf_hdr->tag.blockNum == first.blockNum + i);
        }

So it is interesting to me that this validation is done at this level.
Enforcing sequentialness doesn't seem like it would be intrinsically
related to or required to stage IOs. And there isn't really anything
in this function that seems like it would require it either. Usually
an assert is pretty close to the thing it is protecting.

Oh and I think the end of the loop in buffer_stage_common() would look
nicer with a small refactor with the resulting code looking like this:

        /* temp buffers don't use BM_IO_IN_PROGRESS */
        Assert(!is_temp || (buf_state & BM_IO_IN_PROGRESS));

        /* we better have ensured the buffer is present until now */
        Assert(BUF_STATE_GET_REFCOUNT(buf_state) >= 1);

        /*
         * Reflect that the buffer is now owned by the subsystem.
         *
         * For local buffers: This can't be done just in LocalRefCount as one
         * might initially think, as this backend could error out while AIO is
         * still in progress, releasing all the pins by the backend itself.
         */
        buf_state += BUF_REFCOUNT_ONE;
        buf_hdr->io_wref = io_ref;

        if (is_temp)
        {
            pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
            continue;
        }

        UnlockBufHdr(buf_hdr, buf_state);

        if (is_write)
        {
            LWLock       *content_lock;

            content_lock = BufferDescriptorGetContentLock(buf_hdr);

            Assert(LWLockHeldByMe(content_lock));

            /*
             * Lock is now owned by AIO subsystem.
             */
            LWLockDisown(content_lock);
        }

        /*
         * Stop tracking this buffer via the resowner - the AIO system now
         * keeps track.
         */
        ResourceOwnerForgetBufferIO(CurrentResourceOwner, buffer);
    }

In buffer_readv_complete(), this comment

    /*
     * Iterate over all the buffers affected by this IO and call appropriate
     * per-buffer completion function for each buffer.
     */

makes it sound like we might invoke different completion functions (like invoke
the completion callback), but that isn't what happens here.

        failed =
            prior_result.status == ARS_ERROR
            || prior_result.result <= buf_off;

Though not introduced in this commit, I will say that I find the ARS prefix not
particularly helpful. Though not as brief, something like AIO_RS_ERROR would
probably be more clear.

@@ -515,10 +517,25 @@ MarkLocalBufferDirty(Buffer buffer)
  * Like StartBufferIO, but for local buffers
  */
 bool
-StartLocalBufferIO(BufferDesc *bufHdr, bool forInput)
+StartLocalBufferIO(BufferDesc *bufHdr, bool forInput, bool nowait)
 {
-   uint32      buf_state = pg_atomic_read_u32(&bufHdr->state);
+   uint32      buf_state;
+
+   /*
+    * The buffer could have IO in progress, e.g. when there are two scans of
+    * the same relation. Either wait for the other IO or return false.
+    */
+   if (pgaio_wref_valid(&bufHdr->io_wref))
+   {
+       PgAioWaitRef iow = bufHdr->io_wref;
+
+       if (nowait)
+           return false;
+
+       pgaio_wref_wait(&iow);
+   }

+   buf_state = pg_atomic_read_u32(&bufHdr->state);
    if (forInput ? (buf_state & BM_VALID) : !(buf_state & BM_DIRTY))
    {
        /* someone else already did the I/O */

I'd move this comment ("someone else already did") outside of the if
statement so it kind of separates it into three clear cases:
1) the IO is in progress and you can wait on it if you want, 2) the IO
is completed by someone else (is this possible for local buffers
without AIO?) 3) you can start the IO

- Melanie



Re: AIO v2.5

From
Andres Freund
Date:
Hi,

On 2025-03-18 21:00:17 -0400, Melanie Plageman wrote:
> On Tue, Mar 18, 2025 at 4:12 PM Andres Freund <andres@anarazel.de> wrote:
> > Attached is v2.10,
> 
> This is a review of 0008:     bufmgr: Implement AIO read support
> 
> I'm afraid it is more of a cosmetic review than a sign-off on the
> patch's correctness, but perhaps it will help future readers who may
> have the same questions I did.

I think that's actually an important level of review. I'm, as odd as that
sounds, more confident about the architectural stuff than about
"understandability" etc.


> In the commit message:
>     bufmgr: Implement AIO read support
> 
>     This implements the following:
>     - Addition of callbacks to maintain buffer state on completion of a readv
>     - Addition of a wait reference to BufferDesc, to allow backends to wait for
>       IOs
>     - StartBufferIO(), WaitIO(), TerminateBufferIO() support for waiting AIO
> 
> I think it might be nice to say something about allowing backends to
> complete IOs issued by other backends.

Hm, I'd have said that's basically implied by the way AIO works (as outlined
in the added README.md), but I can think of a way to mention it here.


> @@ -40,6 +41,10 @@ static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
>     CALLBACK_ENTRY(PGAIO_HCB_INVALID, aio_invalid_cb),
> 
>     CALLBACK_ENTRY(PGAIO_HCB_MD_READV, aio_md_readv_cb),
> +
> +   CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_READV, aio_shared_buffer_readv_cb),
> +
> +   CALLBACK_ENTRY(PGAIO_HCB_LOCAL_BUFFER_READV, aio_local_buffer_readv_cb),
>  #undef CALLBACK_ENTRY
>  };
> 
> I personally can't quite figure out why the read and write callbacks
> are defined differently than the stage, complete, and report
> callbacks. I know there is a comment above PgAioHandleCallbackID about
> something about this, but it didn't really clarify it for me. Maybe
> you can put a block comment at the top of aio_callback.c? Or perhaps I
> just need to study it more...

They're not implemented differently - PgAioHandleCallbacks (which is what is
contained in aio_handle_cbs, just with a name added) all have a stage,
complete and report callbacks.

E.g. for SHARED_BUFFER_READV you have a stage (to transfer the buffer pins to
the AIO subsystem), a shared completion (to verify the page, to remove
BM_IO_IN_PROGRESS and set BM_VALID/BM_IO_ERROR, as appropriate) and a report
callback (to report a page validation error).

Maybe more of the relevant types and functions should have been plural, but
then it becomes very awkward to talk about the separate registrations of
multiple callbacks (i.e. the set of callbacks for md.c and the set of
callbacks for bufmgr.c).


> @@ -5482,10 +5503,19 @@ WaitIO(BufferDesc *buf)
> +       if (pgaio_wref_valid(&iow))
> +       {
> +           pgaio_wref_wait(&iow);
> +           ConditionVariablePrepareToSleep(cv);
> +           continue;
> +       }
> 
> I'd add some comment above this.  I reread it many times, and I still
> only _think_ I understand what it does. I think the reason we need
> ConditionVariablePrepareToSleep() again is because pgaio_io_wait() may
> have called ConditionVariableCancelSleep() so we need to
> ConditionVariablePrepareToSleep() again (it was done already at the
> top of Wait())?

Oh, yes, that definitely needs a comment. I've been marinating in this for so
long that it seems obvious, but if I take a step back, it's not at all
obvious.

The issue is that pgaio_wref_wait() internally waits on a *different*
condition variable than the BufferDesc's CV.  The consequences of not doing
this would be fairly mild - the next ConditionVariableSleep would prepare to
sleep and return immediately - but it's unnecessary.


> Maybe worth mentioning in the commit message about why WaitIO() has to
> work differently for AIO than sync IO.

K.


>     /*
>      * Support LockBufferForCleanup()
>      *
>      * If we just released a pin, need to do BM_PIN_COUNT_WAITER handling.
>      * Most of the time the current backend will hold another pin preventing
>      * that from happening, but that's e.g. not the case when completing an IO
>      * another backend started.
>      */
> 
> I found this wording a bit confusing. what about this:
> 
> We may have just released the last pin other than the waiter's. In most cases,
> this backend holds another pin on the buffer. But, if, for example, this
> backend is completing an IO issued by another backend, it may be time to wake
> the waiter.

WFM.


> /*
>  * Helper for AIO staging callback for both reads and writes as well as temp
>  * and shared buffers.
>  */
> static pg_attribute_always_inline void
> buffer_stage_common(PgAioHandle *ioh, bool is_write, bool is_temp)
> 
> I think buffer_stage_common() needs the function comment to explain
> what unit it is operating on.
> It is called "buffer_" singular but then it loops through io_data
> which appears to contain multiple buffers

Hm. Yea. Originally it was just for readv and was duplicated for writes. The
vectorized bit hinted at being for multiple buffers.


>         /*
>          * Check that all the buffers are actually ones that could conceivably
>          * be done in one IO, i.e. are sequential.
>          */
>         if (i == 0)
>             first = buf_hdr->tag;
>         else
>         {
>             Assert(buf_hdr->tag.relNumber == first.relNumber);
>             Assert(buf_hdr->tag.blockNum == first.blockNum + i);
>         }
> 
> So it is interesting to me that this validation is done at this level.
> Enforcing sequentialness doesn't seem like it would be intrinsically
> related to or required to stage IOs. And there isn't really anything
> in this function that seems like it would require it either. Usually
> an assert is pretty close to the thing it is protecting.

Staging is the last buffer-aware thing that happens before IO is actually
executed. If you were to do a readv() into *non* buffers that aren't for
sequential blocks, you would get bogus buffer pool contents, because obviously
it doesn't make sense to read data for block N+1 into the buffer for block N+3
or whatnot.

The assertions did find bugs during development, fwiw.


> Oh and I think the end of the loop in buffer_stage_common() would look
> nicer with a small refactor with the resulting code looking like this:
> 
>         /* temp buffers don't use BM_IO_IN_PROGRESS */
>         Assert(!is_temp || (buf_state & BM_IO_IN_PROGRESS));
> 
>         /* we better have ensured the buffer is present until now */
>         Assert(BUF_STATE_GET_REFCOUNT(buf_state) >= 1);
> 
>         /*
>          * Reflect that the buffer is now owned by the subsystem.
>          *
>          * For local buffers: This can't be done just in LocalRefCount as one
>          * might initially think, as this backend could error out while AIO is
>          * still in progress, releasing all the pins by the backend itself.
>          */
>         buf_state += BUF_REFCOUNT_ONE;
>         buf_hdr->io_wref = io_ref;
> 
>         if (is_temp)
>         {
>             pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
>             continue;
>         }
> 
>         UnlockBufHdr(buf_hdr, buf_state);
> 
>         if (is_write)
>         {
>             LWLock       *content_lock;
> 
>             content_lock = BufferDescriptorGetContentLock(buf_hdr);
> 
>             Assert(LWLockHeldByMe(content_lock));
> 
>             /*
>              * Lock is now owned by AIO subsystem.
>              */
>             LWLockDisown(content_lock);
>         }
> 
>         /*
>          * Stop tracking this buffer via the resowner - the AIO system now
>          * keeps track.
>          */
>         ResourceOwnerForgetBufferIO(CurrentResourceOwner, buffer);
>     }

I don't particularly like this, I'd like to make the logic for shared and
local buffers more similar over time. E.g. by also tracking local buffer IO
via resowner.


> In buffer_readv_complete(), this comment
> 
>     /*
>      * Iterate over all the buffers affected by this IO and call appropriate
>      * per-buffer completion function for each buffer.
>      */
> 
> makes it sound like we might invoke different completion functions (like invoke
> the completion callback), but that isn't what happens here.

Oops, that's how it used to work, but it doesn't anymore, because it ended up
with too much duplication.


>         failed =
>             prior_result.status == ARS_ERROR
>             || prior_result.result <= buf_off;
> 
> Though not introduced in this commit, I will say that I find the ARS prefix not
> particularly helpful. Though not as brief, something like AIO_RS_ERROR would
> probably be more clear.

Fair enough. I'd go for PGAIO_RS_ERROR etc though.


> @@ -515,10 +517,25 @@ MarkLocalBufferDirty(Buffer buffer)
>   * Like StartBufferIO, but for local buffers
>   */
>  bool
> -StartLocalBufferIO(BufferDesc *bufHdr, bool forInput)
> +StartLocalBufferIO(BufferDesc *bufHdr, bool forInput, bool nowait)
>  {
> -   uint32      buf_state = pg_atomic_read_u32(&bufHdr->state);
> +   uint32      buf_state;
> +
> +   /*
> +    * The buffer could have IO in progress, e.g. when there are two scans of
> +    * the same relation. Either wait for the other IO or return false.
> +    */
> +   if (pgaio_wref_valid(&bufHdr->io_wref))
> +   {
> +       PgAioWaitRef iow = bufHdr->io_wref;
> +
> +       if (nowait)
> +           return false;
> +
> +       pgaio_wref_wait(&iow);
> +   }
> 
> +   buf_state = pg_atomic_read_u32(&bufHdr->state);
>     if (forInput ? (buf_state & BM_VALID) : !(buf_state & BM_DIRTY))
>     {
>         /* someone else already did the I/O */
> 
> I'd move this comment ("someone else already did") outside of the if
> statement so it kind of separates it into three clear cases:

FWIW it's inside because that's how StartBufferIOs comment has been for a fair
while...


> 1) the IO is in progress and you can wait on it if you want,
> 2) the IO is completed by someone else (is this possible for local buffers
> without AIO?)

No, that's not possible without AIO.


> 3) you can start the IO

I'll give it a go.


Thanks for the review!


Greetings,

Andres Freund



Re: AIO v2.5

From
Melanie Plageman
Date:
On Tue, Mar 18, 2025 at 4:12 PM Andres Freund <andres@anarazel.de> wrote:
>
> Attached is v2.10

This is a review of 0002:  bufmgr: Improve stats when buffer is read
in concurrently

In the commit message, it might be worth distinguishing that
pg_stat_io and vacuum didn't double count reads, they under-counted
hits. pgBufferUsage and relation-level stats (pg_stat_all_tables etc)
overcounted reads and undercounted hits.

Quick example:
On master, if we try to read 7 blocks and 3 were hits and 2 were
completed by someone else then
- pg_stat_io and VacuumCostBalance would record 3 hits and 2 reads,
which looks like 2 misses.
- pgBufferUsage would record 3 hits and 4 reads, which looks like 4 misses.
- pg_stat_all_tables would record 3 hits and 7 reads, which looks like 4 misses.

The correct number of misses is 2 misses comprising 5 hits and 2 reads
(or 7 reads and 5 hits for pg_stat_all_tables which does the math
later).

@@ -1463,8 +1450,13 @@ WaitReadBuffers(ReadBuffersOperation *operation)
        if (!WaitReadBuffersCanStartIO(buffers[i], false))
        {
            /*
-            * Report this as a 'hit' for this backend, even though it must
-            * have started out as a miss in PinBufferForBlock().
+            * Report and track this as a 'hit' for this backend, even though
+            * it must have started out as a miss in PinBufferForBlock().
+            *
+            * Some of the accesses would otherwise never be counted (e.g.
+            * pgBufferUsage) or counted as a miss (e.g.
+            * pgstat_count_buffer_hit(), as we always call
+            * pgstat_count_buffer_read()).
             */

I think this comment should be changed. It reads like something
written when discovering this problem and not like something useful in
the future. I think you can probably drop the whole second paragraph.

You could make it even more clear by mentioning that the other backend
will count it as a read.

Otherwise, LGTM


- Melanie



Re: AIO v2.5

From
Melanie Plageman
Date:
On Tue, Mar 18, 2025 at 4:12 PM Andres Freund <andres@anarazel.de> wrote:
>
> Attached is v2.10,

I noticed a few comments could be improved in  0011: bufmgr: Use AIO
in StartReadBuffers()

In WaitReadBuffers(), this comment is incomplete:

        /*
-        * Skip this block if someone else has already completed it.  If an
-        * I/O is already in progress in another backend, this will wait for
-        * the outcome: either done, or something went wrong and we will
-        * retry.
+        * If there is an IO associated with the operation, we may need to
+        * wait for it. It's possible for there to be no IO if
         */

In WaitReadBuffers(), too many thes

        /*
         * Most of the the the one IO we started will read in everything.  But
         * we need to deal with short reads and buffers not needing IO
         * anymore.
         */

In ReadBuffersCanStartIO()

+       /*
+        * Unfortunately a false returned StartBufferIO() doesn't allow to
+        * distinguish between the buffer already being valid and IO already
+        * being in progress. Since IO already being in progress is quite
+        * rare, this approach seems fine.
+        */

maybe reword "a false returned StartBufferIO()"

Above and in AsyncReadBuffers()

 * To support retries after short reads, the first operation->nblocks_done is
 * buffers are skipped.

can't quite understand this

+ * On return *nblocks_progres is updated to reflect the number of buffers
progress spelled wrong

     * A secondary benefit is that this would allows us to measure the time in
     * pgaio_io_acquire() without causing undue timer overhead in the common,
     * non-blocking, case.  However, currently the pgstats infrastructure
     * doesn't really allow that, as it a) asserts that an operation can't
     * have time without operations b) doesn't have an API to report
     * "accumulated" time.
     */

allows->allow

What would the time spent in pgaio_io_acquire() be reported as? Time
submitting IOs? Time waiting for a handle? And what is "accumulated"
time here? It seems like you just add the time to the running total
and that is already accumulated.

- Melanie



Re: AIO v2.5

From
Andres Freund
Date:
Hi,

On 2025-03-19 13:20:17 -0400, Melanie Plageman wrote:
> On Tue, Mar 18, 2025 at 4:12 PM Andres Freund <andres@anarazel.de> wrote:
> >
> > Attached is v2.10,
> 
> I noticed a few comments could be improved in  0011: bufmgr: Use AIO
> in StartReadBuffers()
> [...]

Yep.


> Above and in AsyncReadBuffers()
> 
>  * To support retries after short reads, the first operation->nblocks_done is
>  * buffers are skipped.
> 
> can't quite understand this

Heh, yea, it's easy to misunderstand. "short read" in the sense of a partial
read, i.e. a preadv() that only read some of the blocks, not all. I'm
replacing the "short" with partial.

(also removed the superfluous "is")



>      * A secondary benefit is that this would allows us to measure the time in
>      * pgaio_io_acquire() without causing undue timer overhead in the common,
>      * non-blocking, case.  However, currently the pgstats infrastructure
>      * doesn't really allow that, as it a) asserts that an operation can't
>      * have time without operations b) doesn't have an API to report
>      * "accumulated" time.
>      */
> 
> allows->allow
> 
> What would the time spent in pgaio_io_acquire() be reported as?

I'd report it as additional time for the IO we're trying to start, as that
wait would otherwise not happen.


> And what is "accumulated" time here? It seems like you just add the time to
> the running total and that is already accumulated.

Afaict there currently is no way to report a time delta to
pgstat. pgstat_count_io_op_time() computes the time since
pgstat_prepare_io_time(). Due to the assertions that time cannot be reported
for an operation with a zero count, we can't just do two
  pgstat_prepare_io_time(); ...; pgstat_count_io_op_time();
twice, with the first one passing cnt=0.

Greetings,

Andres Freund



Re: AIO v2.5

From
Noah Misch
Date:
On Wed, Mar 12, 2025 at 01:06:03PM -0400, Andres Freund wrote:
> On 2025-03-11 20:57:43 -0700, Noah Misch wrote:
> > - Like you say, "redefine max_files_per_process to be about the number of
> >   files each *backend* will additionally open".  It will become normal that
> >   each backend's actual FD list length is max_files_per_process + MaxBackends
> >   if io_method=io_uring.  Outcome is not unlike
> >   v6-0002-Bump-postmaster-soft-open-file-limit-RLIMIT_NOFIL.patch +
> >   v6-0003-Reflect-the-value-of-max_safe_fds-in-max_files_pe.patch but we don't
> >   mutate max_files_per_process.  Benchmark results should not change beyond
> >   the inter-major-version noise level unless one sets io_method=io_uring.  I'm
> >   feeling best about this one, but I've not been thinking about it long.
> 
> Yea, I think that's something probably worth doing separately from Jelte's
> patch.  I do think that it'd be rather helpful to have jelte's patch to
> increase NOFILE in addition though.

Agreed.

> > > > > +static void
> > > > > +maybe_adjust_io_workers(void)
> > > >
> > > > This also restarts workers that exit, so perhaps name it
> > > > start_io_workers_if_missing().
> > > 
> > > But it also stops IO workers if necessary?
> > 
> > Good point.  Maybe just add a comment like "start or stop IO workers to close
> > the gap between the running count and the configured count intent".
> 
> It's now
> /*
>  * Start or stop IO workers, to close the gap between the number of running
>  * workers and the number of configured workers.  Used to respond to change of
>  * the io_workers GUC (by increasing and decreasing the number of workers), as
>  * well as workers terminating in response to errors (by starting
>  * "replacement" workers).
>  */

Excellent.

> > > > > +{
> > > > ...
> > > > > +        /* Try to launch one. */
> > > > > +        child = StartChildProcess(B_IO_WORKER);
> > > > > +        if (child != NULL)
> > > > > +        {
> > > > > +            io_worker_children[id] = child;
> > > > > +            ++io_worker_count;
> > > > > +        }
> > > > > +        else
> > > > > +            break;                /* XXX try again soon? */
> > > >
> > > > Can LaunchMissingBackgroundProcesses() become the sole caller of this
> > > > function, replacing the current mix of callers?  That would be more conducive
> > > > to promptly doing the right thing after launch failure.
> > > 
> > > I'm not sure that'd be a good idea - right now IO workers are started before
> > > the startup process, as the startup process might need to perform IO. If we
> > > started it only later in ServerLoop() we'd potentially do a fair bit of work,
> > > including starting checkpointer, bgwriter, bgworkers before we started IO
> > > workers.  That shouldn't actively break anything, but it would likely make
> > > things slower.
> > 
> > I missed that.  How about keeping the two calls associated with PM_STARTUP but
> > replacing the assign_io_workers() and process_pm_child_exit() calls with one
> > in LaunchMissingBackgroundProcesses()?
> 
> I think replacing the call in assign_io_workers() is a good idea, that way we
> don't need assign_io_workers().
> 
> Less convinced it's a good idea to do the same for process_pm_child_exit() -
> if IO workers errored out we'll launch backends etc before we get to
> LaunchMissingBackgroundProcesses(). That's not a fundamental problem, but
> seems a bit odd.

Works for me.

> I think LaunchMissingBackgroundProcesses() should be split into one that
> starts aux processes and one that starts bgworkers. The one maintaining aux
> processes should be called before we start backends, the latter not.

That makes sense, though I've not thought about it much.

> > > > > +            /*
> > > > > +             * It's very unlikely, but possible, that reopen fails. E.g. due
> > > > > +             * to memory allocations failing or file permissions changing or
> > > > > +             * such.  In that case we need to fail the IO.
> > > > > +             *
> > > > > +             * There's not really a good errno we can report here.
> > > > > +             */
> > > > > +            error_errno = ENOENT;
> > > >
> > > > Agreed there's not a good errno, but let's use a fake errno that we're mighty
> > > > unlikely to confuse with an actual case of libc returning that errno.  Like
> > > > one of EBADF or EOWNERDEAD.
> > > 
> > > Can we rely on that to be present on all platforms, including windows?
> > 
> > I expect EBADF is universal.  EBADF would be fine.
> 
> Hm, that's actually an error that could happen for other reasons, and IMO
> would be more confusing than ENOENT. The latter describes the issue to a
> reasonable extent, EBADFD seems like it would be more confusing.
> 
> I'm not sure it's worth investing time in this - it really shouldn't happen,
> and we probably have bigger problems than the error code if it does. But if we
> do want to do something, I think I can see a way to report a dedicated error
> message for this.

I agree it's not worth much investment.  Let's leave that one as-is.  We can
always change it further if the not-really-good errno shows up too much.

> > https://github.com/coreutils/gnulib/blob/master/doc/posix-headers/errno.texi
> > lists some OSs not having it, the newest of which looks like NetBSD 9.3
> > (2022).  We could use it and add a #define for platforms lacking it.
> 
> What would we define it as?  I guess we could just pick a high value, but...

Some second-best value, but I withdraw that idea.

On Wed, Mar 12, 2025 at 07:23:47PM -0400, Andres Freund wrote:
> Attached is v2.7, with the following changes:

> Unresolved:
> 
> - Whether to continue starting new workers in process_pm_child_exit()

I'm fine with that continuing.  It's hurting ~nothing.

> - What to name the view (currently pg_aios). I'm inclined to go for
>   pg_io_handles right now.

I like pg_aios mildly better than pg_io_handles, since "handle" sounds
implementation-centric.

On Fri, Mar 14, 2025 at 03:43:15PM -0400, Andres Freund wrote:
> Attached is v2.8 with the following changes:

> - In parallel: Find a way to deal with the set_max_safe_fds() issue that we've
>   been discussing on this thread recently. As that only affects io_uring, it
>   doesn't have to block other patches going in.

As above, I like the "redefine" option.

> - Right now effective_io_concurrency cannot be set > 0 on Windows and other
>   platforms that lack posix_fadvise. But with AIO we can read ahead without
>   posix_fadvise().
> 
>   It'd not really make anything worse than today to not remove the limit, but
>   it'd be pretty weird to prevent windows etc from benefiting from AIO.  Need
>   to look around and see whether it would require anything other than doc
>   changes.

Worth changing, but non-blocking.

On Fri, Mar 14, 2025 at 03:58:43PM -0400, Andres Freund wrote:
> - Should the docs for debug_io_direct be rephrased and if so, how?

> Perhaps it's worth going from
> 
>        <para>
>         Currently this feature reduces performance, and is intended for
>         developer testing only.
>        </para>
> to
>        <para>
>         Currently this feature reduces performance in many workloads, and is
>         intended for testing only.
>        </para>
> 
> I.e. qualify the downside with "many workloads" and widen the audience ever so
> slightly?

Yes, that's good.


Other than the smgr patch review sent on its own thread, I've not yet reviewed
any of these patches comprehensively.  Given the speed of change, I felt it
was time to flush comments buffered since 2025-03-11:

commit 0284401 wrote:
>     aio: Basic subsystem initialization

> @@ -465,6 +466,7 @@ AutoVacLauncherMain(const void *startup_data, size_t startup_data_len)
>           */
>          LWLockReleaseAll();
>          pgstat_report_wait_end();
> +        pgaio_error_cleanup();

AutoVacLauncherMain(), BackgroundWriterMain(), CheckpointerMain(), and
WalWriterMain() call AtEOXact_Buffers() but not AtEOXact_Aio().  Is that
proper?  They do call pgaio_error_cleanup() as seen here, so the only loss is
some asserts.  (The load-bearing part does get done.)

commit da72269 wrote:
>     aio: Add core asynchronous I/O infrastructure

> + * This could be in aio_internal.h, as it is not pubicly referenced, but

typo -> publicly

commit 55b454d wrote:
>     aio: Infrastructure for io_method=worker

> +        /* Try to launch one. */
> +        child = StartChildProcess(B_IO_WORKER);
> +        if (child != NULL)
> +        {
> +            io_worker_children[id] = child;
> +            ++io_worker_count;
> +        }
> +        else
> +            break;                /* XXX try again soon? */

I'd change the comment to something like one of:

  retry after DetermineSleepTime()
  next LaunchMissingBackgroundProcesses() will retry in <60s

On Tue, Mar 18, 2025 at 04:12:18PM -0400, Andres Freund wrote:
> - Decide what to do about the smgr interrupt issue

Replied on that thread.  It's essentially ready.

> Questions / Unresolved:
> 
> - Write support isn't going to land in 18, but there is a tiny bit of code
>   regarding writes in the code for bufmgr IO. I guess I could move that to a
>   later commit?
> 
>   I'm inclined to leave it, the structure of the code only really makes
>   knowing that it's going to be shared between reads & writes.

Fine to leave it.

> - pg_aios view name

Covered above.

> Subject: [PATCH v2.10 08/28] bufmgr: Implement AIO read support

Some comments about BM_IO_IN_PROGRESS may need updates.  This paragraph:

* The BM_IO_IN_PROGRESS flag acts as a kind of lock, used to wait for I/O on a
buffer to complete (and in releases before 14, it was accompanied by a
per-buffer LWLock).  The process doing a read or write sets the flag for the
duration, and processes that need to wait for it to be cleared sleep on a
condition variable.

And these individual lines from "git grep BM_IO_IN_PROGRESS":
         * I/O already in progress.  We already hold BM_IO_IN_PROGRESS for the
     * only one process at a time can set the BM_IO_IN_PROGRESS bit.
     * only one process at a time can set the BM_IO_IN_PROGRESS bit.
 *    i.e at most one BM_IO_IN_PROGRESS bit is set per proc.

The last especially.  For the other three lines and the paragraph, the notion
of a process "holding" BM_IO_IN_PROGRESS or being the process to "set" it or
being the process "doing a read" becomes less significant when one process
starts the IO and another completes it.

> +        /* we better have ensured the buffer is present until now */
> +        Assert(BUF_STATE_GET_REFCOUNT(buf_state) >= 1);

I'd delete that comment; to me, the assertion alone is clearer.

> +            ereport(LOG,
> +                    (errcode(ERRCODE_DATA_CORRUPTED),
> +                     errmsg("invalid page in block %u of relation %s; zeroing out page",

This is changing level s/WARNING/LOG/.  That seems orthogonal to the patch's
goals; is it needed?  If so, I recommend splitting it out as a preliminary
patch, to highlight the behavior change for release notes.

> +/*
> + * Perform completion handling of a single AIO read. This read may cover
> + * multiple blocks / buffers.
> + *
> + * Shared between shared and local buffers, to reduce code duplication.
> + */
> +static pg_attribute_always_inline PgAioResult
> +buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result,
> +                      uint8 cb_data, bool is_temp)
> +{
> +    PgAioResult result = prior_result;
> +    PgAioTargetData *td = pgaio_io_get_target_data(ioh);
> +    uint64       *io_data;
> +    uint8        handle_data_len;
> +
> +    if (is_temp)
> +    {
> +        Assert(td->smgr.is_temp);
> +        Assert(pgaio_io_get_owner(ioh) == MyProcNumber);
> +    }
> +    else
> +        Assert(!td->smgr.is_temp);
> +
> +    /*
> +     * Iterate over all the buffers affected by this IO and call appropriate
> +     * per-buffer completion function for each buffer.
> +     */
> +    io_data = pgaio_io_get_handle_data(ioh, &handle_data_len);
> +    for (uint8 buf_off = 0; buf_off < handle_data_len; buf_off++)
> +    {
> +        Buffer        buf = io_data[buf_off];
> +        PgAioResult buf_result;
> +        bool        failed;
> +
> +        Assert(BufferIsValid(buf));
> +
> +        /*
> +         * If the entire failed on a lower-level, each buffer needs to be

Missing word, probably fix like:
s,entire failed on a lower-level,entire I/O failed on a lower level,

> +         * marked as failed. In case of a partial read, some buffers may be
> +         * ok.
> +         */
> +        failed =
> +            prior_result.status == ARS_ERROR
> +            || prior_result.result <= buf_off;

I didn't run an experiment to check the following, but I think this should be
s/<=/</.  Suppose we requested two blocks and read some amount of bytes
[1*BLCKSZ, 2*BLSCKSZ - 1].  md_readv_complete will store result=1.  buf_off==0
should compute failed=false here, but buf_off==1 should compute failed=true.

I see this relies on md_readv_complete having converted "result" to blocks.
Was there some win from doing that as opposed to doing the division here?
Division here ("blocks_read = prior_result.result / BLCKSZ") would feel easier
to follow, to me.

> +
> +        buf_result = buffer_readv_complete_one(buf_off, buf, cb_data, failed,
> +                                               is_temp);
> +
> +        /*
> +         * If there wasn't any prior error and the IO for this page failed in
> +         * some form, set the whole IO's to the page's result.

s/the IO for this page/page verification/
s/IO's/IO's result/

> +         */
> +        if (result.status != ARS_ERROR && buf_result.status != ARS_OK)
> +        {
> +            result = buf_result;
> +            pgaio_result_report(result, td, LOG);
> +        }
> +    }
> +
> +    return result;
> +}



Re: AIO v2.5

From
Andres Freund
Date:
Hi,

On 2025-03-19 14:25:30 -0700, Noah Misch wrote:
> On Wed, Mar 12, 2025 at 01:06:03PM -0400, Andres Freund wrote:
> > - Right now effective_io_concurrency cannot be set > 0 on Windows and other
> >   platforms that lack posix_fadvise. But with AIO we can read ahead without
> >   posix_fadvise().
> >
> >   It'd not really make anything worse than today to not remove the limit, but
> >   it'd be pretty weird to prevent windows etc from benefiting from AIO.  Need
> >   to look around and see whether it would require anything other than doc
> >   changes.
>
> Worth changing, but non-blocking.

Thankfully Melanie submitted a patch for that...


> Other than the smgr patch review sent on its own thread, I've not yet reviewed
> any of these patches comprehensively.  Given the speed of change, I felt it
> was time to flush comments buffered since 2025-03-11:

Thanks!


> commit 0284401 wrote:
> >     aio: Basic subsystem initialization
>
> > @@ -465,6 +466,7 @@ AutoVacLauncherMain(const void *startup_data, size_t startup_data_len)
> >           */
> >          LWLockReleaseAll();
> >          pgstat_report_wait_end();
> > +        pgaio_error_cleanup();
>
> AutoVacLauncherMain(), BackgroundWriterMain(), CheckpointerMain(), and
> WalWriterMain() call AtEOXact_Buffers() but not AtEOXact_Aio().  Is that
> proper?  They do call pgaio_error_cleanup() as seen here, so the only loss is
> some asserts.  (The load-bearing part does get done.)

I don't think it's particularly good that we use the AtEOXact_* functions in
the sigsetjmp blocks, that feels like a weird mixup of infrastructure to
me. So this was intentional.


> commit da72269 wrote:
> >     aio: Add core asynchronous I/O infrastructure
>
> > + * This could be in aio_internal.h, as it is not pubicly referenced, but
>
> typo -> publicly

/me has a red face.


> commit 55b454d wrote:
> >     aio: Infrastructure for io_method=worker
>
> > +        /* Try to launch one. */
> > +        child = StartChildProcess(B_IO_WORKER);
> > +        if (child != NULL)
> > +        {
> > +            io_worker_children[id] = child;
> > +            ++io_worker_count;
> > +        }
> > +        else
> > +            break;                /* XXX try again soon? */
>
> I'd change the comment to something like one of:
>
>   retry after DetermineSleepTime()
>   next LaunchMissingBackgroundProcesses() will retry in <60s

Hm, we retry more frequently that that if there are new connections...  Maybe
just "try again next time"?


> On Tue, Mar 18, 2025 at 04:12:18PM -0400, Andres Freund wrote:
> > - Decide what to do about the smgr interrupt issue
>
> Replied on that thread.  It's essentially ready.

Cool, will reply there in a bit.


> > Subject: [PATCH v2.10 08/28] bufmgr: Implement AIO read support
>
> Some comments about BM_IO_IN_PROGRESS may need updates.  This paragraph:
>
> * The BM_IO_IN_PROGRESS flag acts as a kind of lock, used to wait for I/O on a
> buffer to complete (and in releases before 14, it was accompanied by a
> per-buffer LWLock).  The process doing a read or write sets the flag for the
> duration, and processes that need to wait for it to be cleared sleep on a
> condition variable.

First draft:
* The BM_IO_IN_PROGRESS flag acts as a kind of lock, used to wait for I/O on a
buffer to complete (and in releases before 14, it was accompanied by a
per-buffer LWLock).  The process start a read or write sets the flag. When the
I/O is completed, be it by the process that initiated the I/O or by another
process, the flag is removed and the Buffer's condition variable is signalled.
Processes that need to wait for the I/O to complete can wait for asynchronous
I/O to using BufferDesc->io_wref and for BM_IO_IN_PROGRESS to be unset by
sleeping on the buffer's condition variable.


> And these individual lines from "git grep BM_IO_IN_PROGRESS":
>
>  *    i.e at most one BM_IO_IN_PROGRESS bit is set per proc.
>
> The last especially.

Huh - yea.  This isn't a "new" issue, I think I missed this comment in 16's
12f3867f5534.  I think the comment can just be deleted?


>          * I/O already in progress.  We already hold BM_IO_IN_PROGRESS for the
>      * only one process at a time can set the BM_IO_IN_PROGRESS bit.
>      * only one process at a time can set the BM_IO_IN_PROGRESS bit.

> For the other three lines and the paragraph, the notion
> of a process "holding" BM_IO_IN_PROGRESS or being the process to "set" it or
> being the process "doing a read" becomes less significant when one process
> starts the IO and another completes it.

Hm. I think they'd be ok as-is, but we can probably improve them. Maybe


     * Now it's safe to write buffer to disk. Note that no one else should
     * have been able to write it while we were busy with log flushing because
     * we got the exclusive right to perform I/O by setting the
     * BM_IO_IN_PROGRESS bit.



> > +        /* we better have ensured the buffer is present until now */
> > +        Assert(BUF_STATE_GET_REFCOUNT(buf_state) >= 1);
>
> I'd delete that comment; to me, the assertion alone is clearer.

Ok.


> > +            ereport(LOG,
> > +                    (errcode(ERRCODE_DATA_CORRUPTED),
> > +                     errmsg("invalid page in block %u of relation %s; zeroing out page",
>
> This is changing level s/WARNING/LOG/.  That seems orthogonal to the patch's
> goals; is it needed?  If so, I recommend splitting it out as a preliminary
> patch, to highlight the behavior change for release notes.

No, it's not needed. I think I looked over the patch at some point and
considered the log-level wrong according to our guidelines and thought I'd
broken it.


> > +        /*
> > +         * If the entire failed on a lower-level, each buffer needs to be
>
> Missing word, probably fix like:
> s,entire failed on a lower-level,entire I/O failed on a lower level,


Yep.


> > +         * marked as failed. In case of a partial read, some buffers may be
> > +         * ok.
> > +         */
> > +        failed =
> > +            prior_result.status == ARS_ERROR
> > +            || prior_result.result <= buf_off;
>
> I didn't run an experiment to check the following, but I think this should be
> s/<=/</.  Suppose we requested two blocks and read some amount of bytes
> [1*BLCKSZ, 2*BLSCKSZ - 1].  md_readv_complete will store result=1.  buf_off==0
> should compute failed=false here, but buf_off==1 should compute failed=true.

Huh, you might be right. I thought I wrote a test for this, I wonder why it
didn't catch the problem...


> I see this relies on md_readv_complete having converted "result" to blocks.
> Was there some win from doing that as opposed to doing the division here?
> Division here ("blocks_read = prior_result.result / BLCKSZ") would feel easier
> to follow, to me.

It seemed like that would be wrong layering - what if we had an smgr that
could store data in a compressed format? The raw read would be of a smaller
size. The smgr API deals in BlockNumbers, only the md.c layer should know
about bytes.


> > +
> > +        buf_result = buffer_readv_complete_one(buf_off, buf, cb_data, failed,
> > +                                               is_temp);
> > +
> > +        /*
> > +         * If there wasn't any prior error and the IO for this page failed in
> > +         * some form, set the whole IO's to the page's result.
>
> s/the IO for this page/page verification/
> s/IO's/IO's result/

Agreed.

Thanks for the review!

Greetings,

Andres Freund



Re: AIO v2.5

From
Noah Misch
Date:
On Wed, Mar 19, 2025 at 06:17:37PM -0400, Andres Freund wrote:
> On 2025-03-19 14:25:30 -0700, Noah Misch wrote:
> > commit 55b454d wrote:
> > >     aio: Infrastructure for io_method=worker
> >
> > > +        /* Try to launch one. */
> > > +        child = StartChildProcess(B_IO_WORKER);
> > > +        if (child != NULL)
> > > +        {
> > > +            io_worker_children[id] = child;
> > > +            ++io_worker_count;
> > > +        }
> > > +        else
> > > +            break;                /* XXX try again soon? */
> >
> > I'd change the comment to something like one of:
> >
> >   retry after DetermineSleepTime()
> >   next LaunchMissingBackgroundProcesses() will retry in <60s
> 
> Hm, we retry more frequently that that if there are new connections...  Maybe
> just "try again next time"?

Works for me.

> > On Tue, Mar 18, 2025 at 04:12:18PM -0400, Andres Freund wrote:
> > > Subject: [PATCH v2.10 08/28] bufmgr: Implement AIO read support
> >
> > Some comments about BM_IO_IN_PROGRESS may need updates.  This paragraph:
> >
> > * The BM_IO_IN_PROGRESS flag acts as a kind of lock, used to wait for I/O on a
> > buffer to complete (and in releases before 14, it was accompanied by a
> > per-buffer LWLock).  The process doing a read or write sets the flag for the
> > duration, and processes that need to wait for it to be cleared sleep on a
> > condition variable.
> 
> First draft:
> * The BM_IO_IN_PROGRESS flag acts as a kind of lock, used to wait for I/O on a
> buffer to complete (and in releases before 14, it was accompanied by a
> per-buffer LWLock).  The process start a read or write sets the flag. When the
s/start/starting/
> I/O is completed, be it by the process that initiated the I/O or by another
> process, the flag is removed and the Buffer's condition variable is signalled.
> Processes that need to wait for the I/O to complete can wait for asynchronous
> I/O to using BufferDesc->io_wref and for BM_IO_IN_PROGRESS to be unset by
s/to using/by using/
> sleeping on the buffer's condition variable.

Sounds good.

> > And these individual lines from "git grep BM_IO_IN_PROGRESS":
> >
> >  *    i.e at most one BM_IO_IN_PROGRESS bit is set per proc.
> >
> > The last especially.
> 
> Huh - yea.  This isn't a "new" issue, I think I missed this comment in 16's
> 12f3867f5534.  I think the comment can just be deleted?

Hmm, yes, it's orthogonal to $SUBJECT and deletion works fine.

> >          * I/O already in progress.  We already hold BM_IO_IN_PROGRESS for the
> >      * only one process at a time can set the BM_IO_IN_PROGRESS bit.
> >      * only one process at a time can set the BM_IO_IN_PROGRESS bit.
> 
> > For the other three lines and the paragraph, the notion
> > of a process "holding" BM_IO_IN_PROGRESS or being the process to "set" it or
> > being the process "doing a read" becomes less significant when one process
> > starts the IO and another completes it.
> 
> Hm. I think they'd be ok as-is, but we can probably improve them. Maybe

Looking again, I agree they're okay.

> 
>      * Now it's safe to write buffer to disk. Note that no one else should
>      * have been able to write it while we were busy with log flushing because
>      * we got the exclusive right to perform I/O by setting the
>      * BM_IO_IN_PROGRESS bit.

That's fine too.  Maybe s/perform/stage/ or s/perform/start/.

> > I see this relies on md_readv_complete having converted "result" to blocks.
> > Was there some win from doing that as opposed to doing the division here?
> > Division here ("blocks_read = prior_result.result / BLCKSZ") would feel easier
> > to follow, to me.
> 
> It seemed like that would be wrong layering - what if we had an smgr that
> could store data in a compressed format? The raw read would be of a smaller
> size. The smgr API deals in BlockNumbers, only the md.c layer should know
> about bytes.

I hadn't thought of that.  That's a good reason.



Re: AIO v2.5

From
Jakub Wartak
Date:
On Tue, Mar 18, 2025 at 9:12 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> Attached is v2.10, with the following changes:
>
> - committed core AIO infrastructure patch

Hi, yay, It's happening.jpg ;)

Some thoughts about 2.10-0004:
What do you think about putting there into (io_uring patch)  info
about the need to ensure that kernel.io_uring_disabled sysctl is on ?
(some distros might shut it down) E.g. in doc/src/sgml/config.sgml
after io_method = <listitems>... there could be

--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
            <literal>io_uring</literal> (execute asynchronous I/O using
            io_uring, if available)
[..]
and then add something like:
+ "At present io_method=io_uring is supported only on Linux and
requires Linux's sysctl kernel.io_uring_disabled (if present) to be at
value 0 (enabled) or 1 (with kernel.io_uring_group set to PostgreSQL's
GID)."

Rationale: it seems that at least RHEL 9.x will have this knob present
(but e.g. RHEL 8.10 even kernel-ml 6.4.2 doesn't, as this seems to
come with 6.6+ but saw somewhere that somebody had issues with this on
probably backported kernel in Rocky 9.x ). Also further googling for I
have found also that mysql can throw - when executed from
podmap/docker: "mysqld: io_uring_queue_init() failed with ENOSYS:
check seccomp filters, and the kernel version (newer than 5.1
required)"

and this leaves this with two probable follow-up questions when
adjusting this sentence:
a. shouldn't we add some sentence about containers/namespaces/seccomp
allowing this ?
b. and/or shouldn't we reference in docs a minimum kernel version
(this is somewhat wild, liburing could be installed and compiled
against, but runtime kernel would be < 5.1 ?)

-J.



Re: AIO v2.5

From
Andres Freund
Date:
Hi,

On 2025-03-19 18:17:37 -0400, Andres Freund wrote:
> On 2025-03-19 14:25:30 -0700, Noah Misch wrote:
> > > +         * marked as failed. In case of a partial read, some buffers may be
> > > +         * ok.
> > > +         */
> > > +        failed =
> > > +            prior_result.status == ARS_ERROR
> > > +            || prior_result.result <= buf_off;
> >
> > I didn't run an experiment to check the following, but I think this should be
> > s/<=/</.  Suppose we requested two blocks and read some amount of bytes
> > [1*BLCKSZ, 2*BLSCKSZ - 1].  md_readv_complete will store result=1.  buf_off==0
> > should compute failed=false here, but buf_off==1 should compute failed=true.
> 
> Huh, you might be right. I thought I wrote a test for this, I wonder why it
> didn't catch the problem...

It was correct as-is. With result=1 you get precisely the result you describe
as the desired outcome, no?
   prior_result.result <= buf_off
   ->
   1 <= 0 -> failed = 0
   1 <= 1 -> failed = 1

but if it were < as you suggest:

   prior_result.result < buf_off
   ->
   1 < 0 -> failed = 0
   1 < 1 -> failed = 0

I.e. we would assume that the second buffer also completed.


What does concern me is that the existing tests do *not* catch the problem if
I turn "<=" into "<".  The second buffer in this case wrongly gets marked as
valid. We do retry the read (because bufmgr.c thinks only one block was read),
but find the buffer to already be valid.

The reason the test doesn't fail, is that the way I set up the "short read"
tests. The injection point runs after the IO completed and just modifies the
result. However, the actual buffer contents still got modified.


The easiest way around that seems to be to have the injection point actually
zero out the remaining memory. Not pretty, but it'd be harder to just submit
shortend IOs in multiple IO methods.  It'd be even better if we could
trivially use something like randomize_mem(), but it's only conditionally
compiled...

Greetings,

Andres Freund



Re: AIO v2.5

From
Noah Misch
Date:
On Thu, Mar 20, 2025 at 01:05:05PM -0400, Andres Freund wrote:
> On 2025-03-19 18:17:37 -0400, Andres Freund wrote:
> > On 2025-03-19 14:25:30 -0700, Noah Misch wrote:
> > > > +         * marked as failed. In case of a partial read, some buffers may be
> > > > +         * ok.
> > > > +         */
> > > > +        failed =
> > > > +            prior_result.status == ARS_ERROR
> > > > +            || prior_result.result <= buf_off;
> > >
> > > I didn't run an experiment to check the following, but I think this should be
> > > s/<=/</.  Suppose we requested two blocks and read some amount of bytes
> > > [1*BLCKSZ, 2*BLSCKSZ - 1].  md_readv_complete will store result=1.  buf_off==0
> > > should compute failed=false here, but buf_off==1 should compute failed=true.
> > 
> > Huh, you might be right. I thought I wrote a test for this, I wonder why it
> > didn't catch the problem...
> 
> It was correct as-is. With result=1 you get precisely the result you describe
> as the desired outcome, no?
>    prior_result.result <= buf_off
>    ->
>    1 <= 0 -> failed = 0
>    1 <= 1 -> failed = 1
> 
> but if it were < as you suggest:
> 
>    prior_result.result < buf_off
>    ->
>    1 < 0 -> failed = 0
>    1 < 1 -> failed = 0
> 
> I.e. we would assume that the second buffer also completed.

That's right.  I see it now.  My mistake.

> What does concern me is that the existing tests do *not* catch the problem if
> I turn "<=" into "<".  The second buffer in this case wrongly gets marked as
> valid. We do retry the read (because bufmgr.c thinks only one block was read),
> but find the buffer to already be valid.
> 
> The reason the test doesn't fail, is that the way I set up the "short read"
> tests. The injection point runs after the IO completed and just modifies the
> result. However, the actual buffer contents still got modified.
> 
> 
> The easiest way around that seems to be to have the injection point actually
> zero out the remaining memory.

Sounds reasonable and sufficient.

FYI, I've resumed the comprehensive review.  That's still ongoing.



Re: AIO v2.5

From
Andres Freund
Date:
Hi,

On 2025-03-19 18:11:18 -0700, Noah Misch wrote:
> On Wed, Mar 19, 2025 at 06:17:37PM -0400, Andres Freund wrote:
> > On 2025-03-19 14:25:30 -0700, Noah Misch wrote:

> > Hm, we retry more frequently that that if there are new connections...  Maybe
> > just "try again next time"?
> 
> Works for me.
> 

> > > And these individual lines from "git grep BM_IO_IN_PROGRESS":
> > >
> > >  *    i.e at most one BM_IO_IN_PROGRESS bit is set per proc.
> > >
> > > The last especially.
> > 
> > Huh - yea.  This isn't a "new" issue, I think I missed this comment in 16's
> > 12f3867f5534.  I think the comment can just be deleted?
> 
> Hmm, yes, it's orthogonal to $SUBJECT and deletion works fine.
> 
> > >          * I/O already in progress.  We already hold BM_IO_IN_PROGRESS for the
> > >      * only one process at a time can set the BM_IO_IN_PROGRESS bit.
> > >      * only one process at a time can set the BM_IO_IN_PROGRESS bit.
> > 
> > > For the other three lines and the paragraph, the notion
> > > of a process "holding" BM_IO_IN_PROGRESS or being the process to "set" it or
> > > being the process "doing a read" becomes less significant when one process
> > > starts the IO and another completes it.
> > 
> > Hm. I think they'd be ok as-is, but we can probably improve them. Maybe
> 
> Looking again, I agree they're okay.
> 
> > 
> >      * Now it's safe to write buffer to disk. Note that no one else should
> >      * have been able to write it while we were busy with log flushing because
> >      * we got the exclusive right to perform I/O by setting the
> >      * BM_IO_IN_PROGRESS bit.
> 
> That's fine too.  Maybe s/perform/stage/ or s/perform/start/.

I put these comment changes into their own patch, as it seemed confusing to
change them as part of one of the already queued commits.


> > > I see this relies on md_readv_complete having converted "result" to blocks.
> > > Was there some win from doing that as opposed to doing the division here?
> > > Division here ("blocks_read = prior_result.result / BLCKSZ") would feel easier
> > > to follow, to me.
> > 
> > It seemed like that would be wrong layering - what if we had an smgr that
> > could store data in a compressed format? The raw read would be of a smaller
> > size. The smgr API deals in BlockNumbers, only the md.c layer should know
> > about bytes.
> 
> I hadn't thought of that.  That's a good reason.

I thought that was better documented, but alas, it wasn't. How about updating
the documentation of smgrstartreadv to the following:

/*
 * smgrstartreadv() -- asynchronous version of smgrreadv()
 *
 * This starts an asynchronous readv IO using the IO handle `ioh`. Other than
 * `ioh` all parameters are the same as smgrreadv().
 *
 * Completion callbacks above smgr will be passed the result as the number of
 * successfully read blocks if the read [partially] succeeds. This maintains
 * the abstraction that smgr operates on the level of blocks, rather than
 * bytes.
 */


I briefly had a bug in test_aio's injection point that lead to *increasing*
the number of bytes successfully read. That triggered an assertion failure in
bufmgr.c, but not closer to the problem.  Is it worth adding an assert against
that to md_readv_complete? Can't quite decide.

Greetings,

Andres Freund



Re: AIO v2.5

From
Noah Misch
Date:
On Thu, Mar 20, 2025 at 02:54:14PM -0400, Andres Freund wrote:
> On 2025-03-19 18:11:18 -0700, Noah Misch wrote:
> > On Wed, Mar 19, 2025 at 06:17:37PM -0400, Andres Freund wrote:
> > > On 2025-03-19 14:25:30 -0700, Noah Misch wrote:
> > > > I see this relies on md_readv_complete having converted "result" to blocks.
> > > > Was there some win from doing that as opposed to doing the division here?
> > > > Division here ("blocks_read = prior_result.result / BLCKSZ") would feel easier
> > > > to follow, to me.
> > > 
> > > It seemed like that would be wrong layering - what if we had an smgr that
> > > could store data in a compressed format? The raw read would be of a smaller
> > > size. The smgr API deals in BlockNumbers, only the md.c layer should know
> > > about bytes.
> > 
> > I hadn't thought of that.  That's a good reason.
> 
> I thought that was better documented, but alas, it wasn't. How about updating
> the documentation of smgrstartreadv to the following:
> 
> /*
>  * smgrstartreadv() -- asynchronous version of smgrreadv()
>  *
>  * This starts an asynchronous readv IO using the IO handle `ioh`. Other than
>  * `ioh` all parameters are the same as smgrreadv().
>  *
>  * Completion callbacks above smgr will be passed the result as the number of
>  * successfully read blocks if the read [partially] succeeds. This maintains
>  * the abstraction that smgr operates on the level of blocks, rather than
>  * bytes.
>  */

That's good.  Possibly add "(Buffers for blocks not successfully read might
bear unspecified modifications, up to the full nblocks.)"

In a bit of over-thinking this, I wondered if shared_buffer_readv_complete
would be better named shared_buffer_smgrreadv_complete, to emphasize the
smgrreadv semantics.  PGAIO_HCB_SHARED_BUFFER_READV likewise.  But I tend to
think not.  smgrreadv() has no "result" concept, so the symmetry is limited.

> I briefly had a bug in test_aio's injection point that lead to *increasing*
> the number of bytes successfully read. That triggered an assertion failure in
> bufmgr.c, but not closer to the problem.  Is it worth adding an assert against
> that to md_readv_complete? Can't quite decide.

I'd lean yes, if in doubt.



Re: AIO v2.5

From
Andres Freund
Date:
Hi,

Attached v2.11, with the following changes:


- Pushed the smgr interrupt change, as discussed on the dedicated thread


- Pushed "bufmgr: Improve stats when a buffer is read in concurrently"

  It was reviewed by Melanie and there didn't seem to be any reason to wait
  further.


- Addressed feedback from Melanie


- Addressed feedback from Noah


- Added a new commit: aio: Change prefix of PgAioResultStatus values to PGAIO_RS_

  As suggested/requested by Melanie. I think she's unfortunately right.


- Added a patch for some comment fixups for code that's either older or
  already pushed


- Added an error check for FileStartReadV() failing

  FileStartReadV() actually can fail, if the file can't be re-opened. I
  thought it'd be important for the error message to differ from the one
  that's issued for read actually failing, so I went with:

  "could not start reading blocks %u..%u in file \"%s\": %m"

  but I'm not sure how good that is.


- Added a new commit to redefine set_max_safe_fds() to not subtract
  already_open fds from max_files_per_process

  This prevents io_method=io_uring from failing when RLIMIT_NOFILE is high
  enough, but more than max_files_per_process io_uring instances need to be
  created.


- Improved error message if io_uring_queue_init() fails

  Added errhint()s for likely cases of failure.

  Added errcode().  I was tempted to use errcode_for_file_access(), but that
  doesn't support ENOSYS - perhaps I should add that instead?


- Disable io_uring method when using EXEC_BACKEND, they're not compatible

  I chose to do this with a define aio.h, but I guess we could also do it at
  configure time? That seems more complicated though - how would we even know
  that EXEC_BACKEND is used on non-windows?

  Not sure yet how to best disable testing io_uring in this case. We can't
  just query EXEC_BACKEND from pg_config.h unfortunately.  I guess making the
  initdb not fail and checking the error log would work, but that doesn't work
  nicely with Cluster.pm.


- Changed test_aio injection short-read injection point to zero out the rest
  of the IO, otherwise some tests fail to fail even if a bug in retries of
  partial reads is introduced


- Improved method_io_uring.c includes a bit (no pgstat.h)



Questions:


- We only "look" at BM_IO_ERROR for writes, isn't that somewhat weird?

  See AbortBufferIO(Buffer buffer)

  It doesn't really matter for the patchset, but it just strikes me as an oddity.



Greetings,

Andres Freund

Attachment

Re: AIO v2.5

From
Noah Misch
Date:
On Thu, Mar 20, 2025 at 09:58:37PM -0400, Andres Freund wrote:
> Attached v2.11, with the following changes:

> - Added an error check for FileStartReadV() failing
> 
>   FileStartReadV() actually can fail, if the file can't be re-opened. I
>   thought it'd be important for the error message to differ from the one
>   that's issued for read actually failing, so I went with:
> 
>   "could not start reading blocks %u..%u in file \"%s\": %m"
> 
>   but I'm not sure how good that is.

Message looks good.

> - Improved error message if io_uring_queue_init() fails
> 
>   Added errhint()s for likely cases of failure.
> 
>   Added errcode().  I was tempted to use errcode_for_file_access(), but that
>   doesn't support ENOSYS - perhaps I should add that instead?

Either way is fine with me.  ENOSYS -> ERRCODE_FEATURE_NOT_SUPPORTED is a good
general mapping to have in errcode_for_file_access(), but it's also not a
problem to keep it the way v2.11 has it.

> - Disable io_uring method when using EXEC_BACKEND, they're not compatible
> 
>   I chose to do this with a define aio.h, but I guess we could also do it at
>   configure time? That seems more complicated though - how would we even know
>   that EXEC_BACKEND is used on non-windows?

Agreed, "make PROFILE=-DEXEC_BACKEND" is a valid way to get EXEC_BACKEND.

>   Not sure yet how to best disable testing io_uring in this case. We can't
>   just query EXEC_BACKEND from pg_config.h unfortunately.  I guess making the
>   initdb not fail and checking the error log would work, but that doesn't work
>   nicely with Cluster.pm.

How about "postgres -c io_method=io_uring -C <anything>":

--- a/src/test/modules/test_aio/t/001_aio.pl
+++ b/src/test/modules/test_aio/t/001_aio.pl
@@ -29,7 +29,13 @@ $node_worker->stop();
 # Test io_method=io_uring
 ###
 
-if ($ENV{with_liburing} eq 'yes')
+sub have_io_uring
+{
+    local %ENV = $node_worker->_get_env();  # any node works
+    return run_log [qw(postgres -c io_method=io_uring -C io_method)];
+}
+
+if (have_io_uring())
 {
     my $node_uring = create_node('io_uring');
     $node_uring->start();

> Questions:
> 
> 
> - We only "look" at BM_IO_ERROR for writes, isn't that somewhat weird?
> 
>   See AbortBufferIO(Buffer buffer)
> 
>   It doesn't really matter for the patchset, but it just strikes me as an oddity.

That caught my attention in an earlier review round, but I didn't find it
important enough to raise.  It's mildly unfortunate to be setting BM_IO_ERROR
for reads when the only thing BM_IO_ERROR drives is message "Multiple failures
--- write error might be permanent."  It's minor, so let's leave it that way
for the foreseeable future.

> Subject: [PATCH v2.11 01/27] aio, bufmgr: Comment fixes

Ready to commit, though other comment fixes might come up in later reviews.
One idea so far is to comment on valid states after some IoMethodOps
callbacks:

--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -310,6 +310,9 @@ typedef struct IoMethodOps
     /*
      * Start executing passed in IOs.
      *
+     * Shall advance state to PGAIO_HS_SUBMITTED.  (By the time this returns,
+     * other backends might have advanced the state further.)
+     *
      * Will not be called if ->needs_synchronous_execution() returned true.
      *
      * num_staged_ios is <= PGAIO_SUBMIT_BATCH_SIZE.
@@ -321,6 +324,12 @@ typedef struct IoMethodOps
     /*
      * Wait for the IO to complete. Optional.
      *
+     * On return, state shall be PGAIO_HS_COMPLETED_IO,
+     * PGAIO_HS_COMPLETED_SHARED or PGAIO_HS_COMPLETED_LOCAL.  (The callback
+     * need not change the state if it's already one of those.)  If state is
+     * PGAIO_HS_COMPLETED_IO, state will reach PGAIO_HS_COMPLETED_SHARED
+     * without further intervention.
+     *
      * If not provided, it needs to be guaranteed that the IO method calls
      * pgaio_io_process_completion() without further interaction by the
      * issuing backend.

> Subject: [PATCH v2.11 02/27] aio: Change prefix of PgAioResultStatus values to
>  PGAIO_RS_

Ready to commit

> Subject: [PATCH v2.11 03/27] Redefine max_files_per_process to control
>  additionally opened files

Ready to commit

> Subject: [PATCH v2.11 04/27] aio: Add liburing dependency

> --- a/meson.build
> +++ b/meson.build
> @@ -944,6 +944,18 @@ endif
>  
>  
>  
> +###############################################################
> +# Library: liburing
> +###############################################################
> +
> +liburingopt = get_option('liburing')
> +liburing = dependency('liburing', required: liburingopt)
> +if liburing.found()
> +  cdata.set('USE_LIBURING', 1)
> +endif

This is a different style from other deps; is it equivalent to our standard
style?  Example for lz4:

lz4opt = get_option('lz4')
if not lz4opt.disabled()
  lz4 = dependency('liblz4', required: false)
  # Unfortunately the dependency is named differently with cmake
  if not lz4.found() # combine with above once meson 0.60.0 is required
    lz4 = dependency('lz4', required: lz4opt,
                     method: 'cmake', modules: ['LZ4::lz4_shared'],
                    )
  endif

  if lz4.found()
    cdata.set('USE_LZ4', 1)
    cdata.set('HAVE_LIBLZ4', 1)
  endif

else
  lz4 = not_found_dep
endif

> --- a/configure.ac
> +++ b/configure.ac
> @@ -975,6 +975,14 @@ AC_SUBST(with_readline)
>  PGAC_ARG_BOOL(with, libedit-preferred, no,
>                [prefer BSD Libedit over GNU Readline])
>  
> +#
> +# liburing
> +#
> +AC_MSG_CHECKING([whether to build with liburing support])
> +PGAC_ARG_BOOL(with, liburing, no, [io_uring support, for asynchronous I/O],

Fourth arg generally starts with "build" for args like this.  I suggest "build
with io_uring support, for asynchronous I/O".  Comparable options:

  --with-llvm             build with LLVM based JIT support
  --with-tcl              build Tcl modules (PL/Tcl)
  --with-perl             build Perl modules (PL/Perl)
  --with-python           build Python modules (PL/Python)
  --with-gssapi           build with GSSAPI support
  --with-pam              build with PAM support
  --with-bsd-auth         build with BSD Authentication support
  --with-ldap             build with LDAP support
  --with-bonjour          build with Bonjour support
  --with-selinux          build with SELinux support
  --with-systemd          build with systemd support
  --with-libcurl          build with libcurl support
  --with-libxml           build with XML support
  --with-libxslt          use XSLT support when building contrib/xml2
  --with-lz4              build with LZ4 support
  --with-zstd             build with ZSTD support

> +              [AC_DEFINE([USE_LIBURING], 1, [Define to build with io_uring support. (--with-liburing)])])
> +AC_MSG_RESULT([$with_liburing])
> +AC_SUBST(with_liburing)
>  
>  #
>  # UUID library
> @@ -1463,6 +1471,9 @@ elif test "$with_uuid" = ossp ; then
>  fi
>  AC_SUBST(UUID_LIBS)
>  
> +if test "$with_liburing" = yes; then
> +  PKG_CHECK_MODULES(LIBURING, liburing)
> +fi

We usually put this right after the AC_MSG_CHECKING ... AC_SUBST block.  This
currently has unrelated stuff separating them.  Also, with the exception of
icu, we follow PKG_CHECK_MODULES uses by absorbing flags from pkg-config and
use AC_CHECK_LIB to add the actual "-l".  By not absorbing flags, I think a
liburing in a nonstandard location would require --with-libraries and
--with-includes, unlike the other PKG_CHECK_MODULES-based dependencies.  lz4
is a representative example of our standard:

```
AC_MSG_CHECKING([whether to build with LZ4 support])
PGAC_ARG_BOOL(with, lz4, no, [build with LZ4 support],
              [AC_DEFINE([USE_LZ4], 1, [Define to 1 to build with LZ4 support. (--with-lz4)])])
AC_MSG_RESULT([$with_lz4])
AC_SUBST(with_lz4)

if test "$with_lz4" = yes; then
  PKG_CHECK_MODULES(LZ4, liblz4)
  # We only care about -I, -D, and -L switches;
  # note that -llz4 will be added by AC_CHECK_LIB below.
  for pgac_option in $LZ4_CFLAGS; do
    case $pgac_option in
      -I*|-D*) CPPFLAGS="$CPPFLAGS $pgac_option";;
    esac
  done
  for pgac_option in $LZ4_LIBS; do
    case $pgac_option in
      -L*) LDFLAGS="$LDFLAGS $pgac_option";;
    esac
  done
fi

# ... later in file ...

if test "$with_lz4" = yes ; then
  AC_CHECK_LIB(lz4, LZ4_compress_default, [], [AC_MSG_ERROR([library 'lz4' is required for LZ4 support])])
fi
```

I think it's okay to not use the AC_CHECK_LIB and rely on explicit
src/backend/Makefile code like you've done, but we shouldn't miss
CPPFLAGS/LDFLAGS (or should have a comment on why missing them is right).

> --- a/doc/src/sgml/installation.sgml
> +++ b/doc/src/sgml/installation.sgml

lz4 and other deps have a mention in <sect1 id="install-requirements">, in
addition to sections edited here.

> Subject: [PATCH v2.11 05/27] aio: Add io_method=io_uring

(Still reviewing this one.)



Re: AIO v2.5

From
Noah Misch
Date:
On Thu, Mar 20, 2025 at 09:58:37PM -0400, Andres Freund wrote:
> Attached v2.11

> Subject: [PATCH v2.11 05/27] aio: Add io_method=io_uring

Apart from some isolated cosmetic points, this is ready to commit:

> +            ereport(ERROR,
> +                    errcode(err),
> +                    errmsg("io_uring_queue_init failed: %m"),
> +                    hint != NULL ? errhint("%s", hint) : 0);

https://www.postgresql.org/docs/current/error-style-guide.html gives the example:

BAD:    open() failed: %m
BETTER: could not open file %s: %m

Hence, this errmsg should change, perhaps to:
"could not setup io_uring queues: %m".

> +        pgaio_debug_io(DEBUG3, ioh,
> +                       "wait_one io_gen: %llu, ref_gen: %llu, cycle %d",
> +                       (long long unsigned) ref_generation,
> +                       (long long unsigned) ioh->generation,

In the message string, io_gen appears before ref_gen.  In the subsequent args,
the order is swapped relative to the message string.

> --- a/src/backend/utils/activity/wait_event_names.txt
> +++ b/src/backend/utils/activity/wait_event_names.txt
> @@ -192,6 +192,8 @@ ABI_compatibility:
>  
>  Section: ClassName - WaitEventIO
>  
> +AIO_IO_URING_SUBMIT    "Waiting for IO submission via io_uring."
> +AIO_IO_URING_COMPLETION    "Waiting for IO completion via io_uring."
>  AIO_IO_COMPLETION    "Waiting for IO completion."

I'm wondering if there's an opportunity to enrich the last two wait event
names and/or descriptions.  The current descriptions suggest to me more
similarity than is actually there.  Inputs to the decision:

- AIO_IO_COMPLETION waits for an IO in PGAIO_HS_DEFINED, PGAIO_HS_STAGED, or
  PGAIO_HS_COMPLETED_IO to reach PGAIO_HS_COMPLETED_SHARED.  The three
  starting states are the states where some other backend owns the next
  action, so the current backend can only wait to be signaled.

- AIO_IO_URING_COMPLETION waits for the kernel to do enough so we can move
  from PGAIO_HS_SUBMITTED to PGAIO_HS_COMPLETED_IO.

Possible names and descriptions, based on PgAioHandleState enum names and
comments:

AIO_IO_URING_COMPLETED_IO    "Waiting for IO result via io_uring."
AIO_COMPLETED_SHARED    "Waiting for IO shared completion callback."

If "shared completion callback" is too internals-focused, perhaps this:

AIO_IO_URING_COMPLETED_IO    "Waiting for IO result via io_uring."
AIO_COMPLETED_SHARED    "Waiting for IO completion to update shared memory."

> --- a/doc/src/sgml/config.sgml
> +++ b/doc/src/sgml/config.sgml
> @@ -2710,6 +2710,12 @@ include_dir 'conf.d'
>              <literal>worker</literal> (execute asynchronous I/O using worker processes)
>             </para>
>            </listitem>
> +          <listitem>
> +           <para>
> +            <literal>io_uring</literal> (execute asynchronous I/O using
> +            io_uring, if available)

I feel the "if available" doesn't quite fit, since we'll fail if unavailable.
Maybe just "(execute asynchronous I/O using Linux io_uring)" with "Linux"
there to reduce surprise on other platforms.

> Subject: [PATCH v2.11 06/27] aio: Implement support for reads in smgr/md/fd

(Still reviewing this one.)



Re: AIO v2.5

From
Andres Freund
Date:
Hi,

On 2025-03-22 17:20:56 -0700, Noah Misch wrote:
> On Thu, Mar 20, 2025 at 09:58:37PM -0400, Andres Freund wrote:
> >   Not sure yet how to best disable testing io_uring in this case. We can't
> >   just query EXEC_BACKEND from pg_config.h unfortunately.  I guess making the
> >   initdb not fail and checking the error log would work, but that doesn't work
> >   nicely with Cluster.pm.
> 
> How about "postgres -c io_method=io_uring -C <anything>":
> 
> --- a/src/test/modules/test_aio/t/001_aio.pl
> +++ b/src/test/modules/test_aio/t/001_aio.pl
> @@ -29,7 +29,13 @@ $node_worker->stop();
>  # Test io_method=io_uring
>  ###
>  
> -if ($ENV{with_liburing} eq 'yes')
> +sub have_io_uring
> +{
> +    local %ENV = $node_worker->_get_env();  # any node works
> +    return run_log [qw(postgres -c io_method=io_uring -C io_method)];
> +}
> +
> +if (have_io_uring())
>  {
>      my $node_uring = create_node('io_uring');
>      $node_uring->start();

Yea, that's a good idea.

One thing that doesn't seem great is that it requires a prior node - what if
we do -c io_method=invalid' that would report the list of valid GUC options,
so we could just grep for io_uring?

It's too bad that postgres --describe-config
a) doesn't report the possible enum values
b) doesn't apply/validate -c options


> > Subject: [PATCH v2.11 01/27] aio, bufmgr: Comment fixes
> 
> Ready to commit, though other comment fixes might come up in later reviews.

I'll reorder it to a bit later in the series, to accumulate a few more.


> One idea so far is to comment on valid states after some IoMethodOps
> callbacks:
> 
> --- a/src/include/storage/aio_internal.h
> +++ b/src/include/storage/aio_internal.h
> @@ -310,6 +310,9 @@ typedef struct IoMethodOps
>      /*
>       * Start executing passed in IOs.
>       *
> +     * Shall advance state to PGAIO_HS_SUBMITTED.  (By the time this returns,
> +     * other backends might have advanced the state further.)
> +     *
>       * Will not be called if ->needs_synchronous_execution() returned true.
>       *
>       * num_staged_ios is <= PGAIO_SUBMIT_BATCH_SIZE.
> @@ -321,6 +324,12 @@ typedef struct IoMethodOps
>      /*
>       * Wait for the IO to complete. Optional.
>       *
> +     * On return, state shall be PGAIO_HS_COMPLETED_IO,
> +     * PGAIO_HS_COMPLETED_SHARED or PGAIO_HS_COMPLETED_LOCAL.  (The callback
> +     * need not change the state if it's already one of those.)  If state is
> +     * PGAIO_HS_COMPLETED_IO, state will reach PGAIO_HS_COMPLETED_SHARED
> +     * without further intervention.
> +     *
>       * If not provided, it needs to be guaranteed that the IO method calls
>       * pgaio_io_process_completion() without further interaction by the
>       * issuing backend.

I think these are a good idea. I added those to the copy-edit patch, with a
few more tweaks:

@@ -315,6 +315,9 @@ typedef struct IoMethodOps
     /*
      * Start executing passed in IOs.
      *
+     * Shall advance state to at least PGAIO_HS_SUBMITTED.  (By the time this
+     * returns, other backends might have advanced the state further.)
+     *
      * Will not be called if ->needs_synchronous_execution() returned true.
      *
      * num_staged_ios is <= PGAIO_SUBMIT_BATCH_SIZE.
@@ -323,12 +326,24 @@ typedef struct IoMethodOps
      */
     int         (*submit) (uint16 num_staged_ios, PgAioHandle **staged_ios);
 
-    /*
+    /* ---
      * Wait for the IO to complete. Optional.
      *
+     * On return, state shall be on of
+     * - PGAIO_HS_COMPLETED_IO
+     * - PGAIO_HS_COMPLETED_SHARED
+     * - PGAIO_HS_COMPLETED_LOCAL
+     *
+     * The callback must not block if the handle is already in one of those
+     * states, or has been reused (see pgaio_io_was_recycled()).  If, on
+     * return, the state is PGAIO_HS_COMPLETED_IO, state will reach
+     * PGAIO_HS_COMPLETED_SHARED without further intervention by the IO
+     * method.
+     *
      * If not provided, it needs to be guaranteed that the IO method calls
      * pgaio_io_process_completion() without further interaction by the
      * issuing backend.
+     * ---
      */
     void        (*wait_one) (PgAioHandle *ioh,
                              uint64 ref_generation);



> > Subject: [PATCH v2.11 03/27] Redefine max_files_per_process to control
> >  additionally opened files
> 
> Ready to commit

Cool!


> > Subject: [PATCH v2.11 04/27] aio: Add liburing dependency
> 
> > --- a/meson.build
> > +++ b/meson.build
> > @@ -944,6 +944,18 @@ endif
> >  
> >  
> >  
> > +###############################################################
> > +# Library: liburing
> > +###############################################################
> > +
> > +liburingopt = get_option('liburing')
> > +liburing = dependency('liburing', required: liburingopt)
> > +if liburing.found()
> > +  cdata.set('USE_LIBURING', 1)
> > +endif
> 
> This is a different style from other deps; is it equivalent to our standard
> style?

Yes - the only reason to be more complicated in the lz4 case is that we want
to fall back to other ways of looking up the dependency (primarily because of
windows. But that's not required for liburing, which oviously is linux only.


> > --- a/configure.ac
> > +++ b/configure.ac
> > @@ -975,6 +975,14 @@ AC_SUBST(with_readline)
> >  PGAC_ARG_BOOL(with, libedit-preferred, no,
> >                [prefer BSD Libedit over GNU Readline])
> >  
> > +#
> > +# liburing
> > +#
> > +AC_MSG_CHECKING([whether to build with liburing support])
> > +PGAC_ARG_BOOL(with, liburing, no, [io_uring support, for asynchronous I/O],
> 
> Fourth arg generally starts with "build" for args like this.  I suggest "build
> with io_uring support, for asynchronous I/O".

WFM.


> > +              [AC_DEFINE([USE_LIBURING], 1, [Define to build with io_uring support. (--with-liburing)])])
> > +AC_MSG_RESULT([$with_liburing])
> > +AC_SUBST(with_liburing)
> >  
> >  #
> >  # UUID library
> > @@ -1463,6 +1471,9 @@ elif test "$with_uuid" = ossp ; then
> >  fi
> >  AC_SUBST(UUID_LIBS)
> >  
> > +if test "$with_liburing" = yes; then
> > +  PKG_CHECK_MODULES(LIBURING, liburing)
> > +fi
> 
> We usually put this right after the AC_MSG_CHECKING ... AC_SUBST block.

We don't really seem to do that for "dependency checks" in general, e.g.
PGAC_CHECK_PERL_CONFIGS, PGAC_CHECK_PYTHON_EMBED_SETUP, PGAC_CHECK_READLINE,
dependency dependent AC_CHECK_LIB calls, .. later in configure.ac than the
defnition of the option.  TBH, I've always struggled trying to discern what
the organizing principle of configure.ac is.

But you're right that the PKG_CHECK_MODULES calls are closer-by. And I'm happy
to move towards having the code for each dep all in one place, so moved.


A related thing: We seem to have no order of the $with_ checks that I can
discern. Should the liburing check be at a different place?


> This currently has unrelated stuff separating them.  Also, with the
> exception of icu, we follow PKG_CHECK_MODULES uses by absorbing flags from
> pkg-config and use AC_CHECK_LIB to add the actual "-l".

I think for liburing I was trying to follow ICU's example - injecting CFLAGS
and LIBS just in the parts of the build dir that needs them.

For LIBS I think I did so:

diff --git a/src/backend/Makefile b/src/backend/Makefile
...
+# The backend conditionally needs libraries that most executables don't need.
+LIBS += $(LDAP_LIBS_BE) $(ICU_LIBS) $(LIBURING_LIBS)

But ugh, for some reason I didn't do that for LIBURING_CFLAGS. In the v1.x
version of aio I had
aio:src/backend/storage/aio/Makefile:override CPPFLAGS += $(LIBURING_CFLAGS)

but somehow lost that somewhere along the way to v2.x


I think I like targetting where ${LIB}_LIBS and ${LIB}_CFLAGS are applied more
narrowly better than just adding to the global CFLAGS, CPPFLAGS, LDFLAGS.  I'm
somewhat inclined to add it LIBURING_CFLAGS in src/backend rather than
src/backend/storage/aio/ though.

But I'm also willing to do it entirely differently.


> > --- a/doc/src/sgml/installation.sgml
> > +++ b/doc/src/sgml/installation.sgml
> 
> lz4 and other deps have a mention in <sect1 id="install-requirements">, in
> addition to sections edited here.

Good point.

Although once more I feel defeated by the ordering used :)

Hm, that list is rather incomplete. At least libxml, libxslt, selinux, curl,
uuid, systemd, selinux and bonjour aren't listed.

Not sure if it makes sense to add liburing, given that?

Greetings,

Andres Freund



Re: AIO v2.5

From
Noah Misch
Date:
On Sun, Mar 23, 2025 at 11:11:53AM -0400, Andres Freund wrote:
> On 2025-03-22 17:20:56 -0700, Noah Misch wrote:
> > On Thu, Mar 20, 2025 at 09:58:37PM -0400, Andres Freund wrote:
> > >   Not sure yet how to best disable testing io_uring in this case. We can't
> > >   just query EXEC_BACKEND from pg_config.h unfortunately.  I guess making the
> > >   initdb not fail and checking the error log would work, but that doesn't work
> > >   nicely with Cluster.pm.
> > 
> > How about "postgres -c io_method=io_uring -C <anything>":
> > 
> > --- a/src/test/modules/test_aio/t/001_aio.pl
> > +++ b/src/test/modules/test_aio/t/001_aio.pl
> > @@ -29,7 +29,13 @@ $node_worker->stop();
> >  # Test io_method=io_uring
> >  ###
> >  
> > -if ($ENV{with_liburing} eq 'yes')
> > +sub have_io_uring
> > +{
> > +    local %ENV = $node_worker->_get_env();  # any node works
> > +    return run_log [qw(postgres -c io_method=io_uring -C io_method)];
> > +}
> > +
> > +if (have_io_uring())
> >  {
> >      my $node_uring = create_node('io_uring');
> >      $node_uring->start();
> 
> Yea, that's a good idea.
> 
> One thing that doesn't seem great is that it requires a prior node - what if
> we do -c io_method=invalid' that would report the list of valid GUC options,
> so we could just grep for io_uring?

Works for me.

> > One idea so far is to comment on valid states after some IoMethodOps
> > callbacks:

> I think these are a good idea. I added those to the copy-edit patch, with a
> few more tweaks:

The tweaks made it better.

> > > Subject: [PATCH v2.11 04/27] aio: Add liburing dependency

> > > +              [AC_DEFINE([USE_LIBURING], 1, [Define to build with io_uring support. (--with-liburing)])])
> > > +AC_MSG_RESULT([$with_liburing])
> > > +AC_SUBST(with_liburing)
> > >  
> > >  #
> > >  # UUID library
> > > @@ -1463,6 +1471,9 @@ elif test "$with_uuid" = ossp ; then
> > >  fi
> > >  AC_SUBST(UUID_LIBS)
> > >  
> > > +if test "$with_liburing" = yes; then
> > > +  PKG_CHECK_MODULES(LIBURING, liburing)
> > > +fi
> > 
> > We usually put this right after the AC_MSG_CHECKING ... AC_SUBST block.
> 
> We don't really seem to do that for "dependency checks" in general, e.g.
> PGAC_CHECK_PERL_CONFIGS, PGAC_CHECK_PYTHON_EMBED_SETUP, PGAC_CHECK_READLINE,
> dependency dependent AC_CHECK_LIB calls, .. later in configure.ac than the
> defnition of the option.

AC_CHECK_LIB stays far away, yes.

> But you're right that the PKG_CHECK_MODULES calls are closer-by. And I'm happy
> to move towards having the code for each dep all in one place, so moved.
> 
> 
> A related thing: We seem to have no order of the $with_ checks that I can
> discern. Should the liburing check be at a different place?

No opinion on that one.  It's fine.

> > This currently has unrelated stuff separating them.  Also, with the
> > exception of icu, we follow PKG_CHECK_MODULES uses by absorbing flags from
> > pkg-config and use AC_CHECK_LIB to add the actual "-l".
> 
> I think for liburing I was trying to follow ICU's example - injecting CFLAGS
> and LIBS just in the parts of the build dir that needs them.
> 
> For LIBS I think I did so:
> 
> diff --git a/src/backend/Makefile b/src/backend/Makefile
> ...
> +# The backend conditionally needs libraries that most executables don't need.
> +LIBS += $(LDAP_LIBS_BE) $(ICU_LIBS) $(LIBURING_LIBS)
> 
> But ugh, for some reason I didn't do that for LIBURING_CFLAGS. In the v1.x
> version of aio I had
> aio:src/backend/storage/aio/Makefile:override CPPFLAGS += $(LIBURING_CFLAGS)
> 
> but somehow lost that somewhere along the way to v2.x
> 
> 
> I think I like targetting where ${LIB}_LIBS and ${LIB}_CFLAGS are applied more
> narrowly better than just adding to the global CFLAGS, CPPFLAGS, LDFLAGS.

Agreed.

> somewhat inclined to add it LIBURING_CFLAGS in src/backend rather than
> src/backend/storage/aio/ though.
> 
> But I'm also willing to do it entirely differently.

The CPPFLAGS addition, located wherever makes sense, resolves that point.

> > > --- a/doc/src/sgml/installation.sgml
> > > +++ b/doc/src/sgml/installation.sgml
> > 
> > lz4 and other deps have a mention in <sect1 id="install-requirements">, in
> > addition to sections edited here.
> 
> Good point.
> 
> Although once more I feel defeated by the ordering used :)
> 
> Hm, that list is rather incomplete. At least libxml, libxslt, selinux, curl,
> uuid, systemd, selinux and bonjour aren't listed.
> 
> Not sure if it makes sense to add liburing, given that?

That's a lot of preexisting incompleteness.  I withdraw the point about <sect1
id="install-requirements">.


Unrelated to the above, another question about io_uring:

commit da722699 wrote:
> +/*
> + * Need to submit staged but not yet submitted IOs using the fd, otherwise
> + * the IO would end up targeting something bogus.
> + */
> +void
> +pgaio_closing_fd(int fd)

An IO in PGAIO_HS_STAGED clearly blocks closing the IO's FD, and an IO in
PGAIO_HS_COMPLETED_IO clearly doesn't block that close.  For io_method=worker,
closing in PGAIO_HS_SUBMITTED is okay.  For io_method=io_uring, is there a
reference about it being okay to close during PGAIO_HS_SUBMITTED?  I looked
awhile for an authoritative view on that, but I didn't find one.  If we can
rely on io_uring_submit() returning only after the kernel has given the
io_uring its own reference to all applicable file descriptors, I expect it's
okay to close the process's FD.  If the io_uring acquires its reference later
than that, I expect we shouldn't close before that later time.



Re: AIO v2.5

From
Andres Freund
Date:
Hi,

On 2025-03-22 19:09:55 -0700, Noah Misch wrote:
> On Thu, Mar 20, 2025 at 09:58:37PM -0400, Andres Freund wrote:
> > Attached v2.11
>
> > Subject: [PATCH v2.11 05/27] aio: Add io_method=io_uring
>
> Apart from some isolated cosmetic points, this is ready to commit:
>
> > +            ereport(ERROR,
> > +                    errcode(err),
> > +                    errmsg("io_uring_queue_init failed: %m"),
> > +                    hint != NULL ? errhint("%s", hint) : 0);
>
> https://www.postgresql.org/docs/current/error-style-guide.html gives the example:
>
> BAD:    open() failed: %m
> BETTER: could not open file %s: %m
>
> Hence, this errmsg should change, perhaps to:
> "could not setup io_uring queues: %m".

You're right. I didn't intentionally "violate" the policy, but I do have to
admit, I'm not a huge fan of that aspect, it just obfuscates what actually
failed, forcing one to look at the code or strace to figure out what precisely
failed.

(Changed)


> > +        pgaio_debug_io(DEBUG3, ioh,
> > +                       "wait_one io_gen: %llu, ref_gen: %llu, cycle %d",
> > +                       (long long unsigned) ref_generation,
> > +                       (long long unsigned) ioh->generation,
>
> In the message string, io_gen appears before ref_gen.  In the subsequent args,
> the order is swapped relative to the message string.

Oops, you're right.


> > --- a/src/backend/utils/activity/wait_event_names.txt
> > +++ b/src/backend/utils/activity/wait_event_names.txt
> > @@ -192,6 +192,8 @@ ABI_compatibility:
> >
> >  Section: ClassName - WaitEventIO
> >
> > +AIO_IO_URING_SUBMIT    "Waiting for IO submission via io_uring."
> > +AIO_IO_URING_COMPLETION    "Waiting for IO completion via io_uring."
> >  AIO_IO_COMPLETION    "Waiting for IO completion."
>
> I'm wondering if there's an opportunity to enrich the last two wait event
> names and/or descriptions.  The current descriptions suggest to me more
> similarity than is actually there.  Inputs to the decision:
>
> - AIO_IO_COMPLETION waits for an IO in PGAIO_HS_DEFINED, PGAIO_HS_STAGED, or
>   PGAIO_HS_COMPLETED_IO to reach PGAIO_HS_COMPLETED_SHARED.  The three
>   starting states are the states where some other backend owns the next
>   action, so the current backend can only wait to be signaled.
>
> - AIO_IO_URING_COMPLETION waits for the kernel to do enough so we can move
>   from PGAIO_HS_SUBMITTED to PGAIO_HS_COMPLETED_IO.
>
> Possible names and descriptions, based on PgAioHandleState enum names and
> comments:
>
> AIO_IO_URING_COMPLETED_IO    "Waiting for IO result via io_uring."
> AIO_COMPLETED_SHARED    "Waiting for IO shared completion callback."
>
> If "shared completion callback" is too internals-focused, perhaps this:
>
> AIO_IO_URING_COMPLETED_IO    "Waiting for IO result via io_uring."
> AIO_COMPLETED_SHARED    "Waiting for IO completion to update shared memory."

Hm, right now AIO_IO_COMPLETION also covers the actual "raw" execution of the
IO with io_method=worker/sync. For that AIO_COMPLETED_SHARED would be
inappropriate.

We could use a different wait event if wait for an IO via CV in
PGAIO_HS_SUBMITTED, with a small refactoring of pgaio_io_wait().  But I'm not
sure that would get you that far - we don't broadcast the CV when
transitioning from PGAIO_HS_SUBMITTED -> PGAIO_HS_COMPLETED_IO, so the wait
event would stay the same, now wrong, wait event until the shared callback
completes. Obviously waking everyone up just so they can use a differen wait
event doesn't make sense.

A more minimal change would be to narrow AIO_IO_URING_COMPLETION to
"execution" or something like that, to hint at a separation between the raw IO
being completed and the IO, including the callbacks completing.


> > --- a/doc/src/sgml/config.sgml
> > +++ b/doc/src/sgml/config.sgml
> > @@ -2710,6 +2710,12 @@ include_dir 'conf.d'
> >              <literal>worker</literal> (execute asynchronous I/O using worker processes)
> >             </para>
> >            </listitem>
> > +          <listitem>
> > +           <para>
> > +            <literal>io_uring</literal> (execute asynchronous I/O using
> > +            io_uring, if available)
>
> I feel the "if available" doesn't quite fit, since we'll fail if unavailable.
> Maybe just "(execute asynchronous I/O using Linux io_uring)" with "Linux"
> there to reduce surprise on other platforms.

You're right, the if available can be misunderstood. But not mentioning that
it's an optional dependency seems odd too. What about something like

           <para>
            <literal>io_uring</literal> (execute asynchronous I/O using
            io_uring, requires postgres to have been built with
            <link linkend="configure-option-with-liburing"><option>--with-liburing</option></link> /
            <link linkend="configure-with-liburing-meson"><option>-Dliburing</option></link>)
           </para>

Should the docs for --with-liburing/-Dliburing mention it's linux only? We
don't seem to do that for things like systemd (linux), selinux (linux) and
only kinda for bonjour (macos).

Greetings,

Andres Freund



Re: AIO v2.5

From
Noah Misch
Date:
On Sun, Mar 23, 2025 at 11:57:48AM -0400, Andres Freund wrote:
> On 2025-03-22 19:09:55 -0700, Noah Misch wrote:
> > On Thu, Mar 20, 2025 at 09:58:37PM -0400, Andres Freund wrote:
> > > Attached v2.11
> >
> > > Subject: [PATCH v2.11 05/27] aio: Add io_method=io_uring

> > > --- a/src/backend/utils/activity/wait_event_names.txt
> > > +++ b/src/backend/utils/activity/wait_event_names.txt
> > > @@ -192,6 +192,8 @@ ABI_compatibility:
> > >
> > >  Section: ClassName - WaitEventIO
> > >
> > > +AIO_IO_URING_SUBMIT    "Waiting for IO submission via io_uring."
> > > +AIO_IO_URING_COMPLETION    "Waiting for IO completion via io_uring."
> > >  AIO_IO_COMPLETION    "Waiting for IO completion."
> >
> > I'm wondering if there's an opportunity to enrich the last two wait event
> > names and/or descriptions.  The current descriptions suggest to me more
> > similarity than is actually there.  Inputs to the decision:
> >
> > - AIO_IO_COMPLETION waits for an IO in PGAIO_HS_DEFINED, PGAIO_HS_STAGED, or
> >   PGAIO_HS_COMPLETED_IO to reach PGAIO_HS_COMPLETED_SHARED.  The three
> >   starting states are the states where some other backend owns the next
> >   action, so the current backend can only wait to be signaled.
> >
> > - AIO_IO_URING_COMPLETION waits for the kernel to do enough so we can move
> >   from PGAIO_HS_SUBMITTED to PGAIO_HS_COMPLETED_IO.
> >
> > Possible names and descriptions, based on PgAioHandleState enum names and
> > comments:
> >
> > AIO_IO_URING_COMPLETED_IO    "Waiting for IO result via io_uring."
> > AIO_COMPLETED_SHARED    "Waiting for IO shared completion callback."
> >
> > If "shared completion callback" is too internals-focused, perhaps this:
> >
> > AIO_IO_URING_COMPLETED_IO    "Waiting for IO result via io_uring."
> > AIO_COMPLETED_SHARED    "Waiting for IO completion to update shared memory."
> 
> Hm, right now AIO_IO_COMPLETION also covers the actual "raw" execution of the
> IO with io_method=worker/sync.

Right, it could start with the IO in PGAIO_HS_DEFINED and end with the IO in
PGAIO_HS_COMPLETED_SHARED.  So another part of the wait may be the definer
doing work before exiting batch mode.

> For that AIO_COMPLETED_SHARED would be
> inappropriate.

The concept I had in mind was "waiting to reach PGAIO_HS_COMPLETED_SHARED,
whatever obstacles that involves".

Another candidate description string:

AIO_COMPLETED_SHARED    "Waiting for another process to complete IO."

> We could use a different wait event if wait for an IO via CV in
> PGAIO_HS_SUBMITTED, with a small refactoring of pgaio_io_wait().  But I'm not
> sure that would get you that far - we don't broadcast the CV when
> transitioning from PGAIO_HS_SUBMITTED -> PGAIO_HS_COMPLETED_IO, so the wait
> event would stay the same, now wrong, wait event until the shared callback
> completes. Obviously waking everyone up just so they can use a differen wait
> event doesn't make sense.

Agreed.  The mapping of code ranges to wait events seems fine to me.  I'm mainly
trying to optimize the wait event description strings to fit those code ranges.

> A more minimal change would be to narrow AIO_IO_URING_COMPLETION to
> "execution" or something like that, to hint at a separation between the raw IO
> being completed and the IO, including the callbacks completing.

Yes, that would work for me.

> > > --- a/doc/src/sgml/config.sgml
> > > +++ b/doc/src/sgml/config.sgml
> > > @@ -2710,6 +2710,12 @@ include_dir 'conf.d'
> > >              <literal>worker</literal> (execute asynchronous I/O using worker processes)
> > >             </para>
> > >            </listitem>
> > > +          <listitem>
> > > +           <para>
> > > +            <literal>io_uring</literal> (execute asynchronous I/O using
> > > +            io_uring, if available)
> >
> > I feel the "if available" doesn't quite fit, since we'll fail if unavailable.
> > Maybe just "(execute asynchronous I/O using Linux io_uring)" with "Linux"
> > there to reduce surprise on other platforms.
> 
> You're right, the if available can be misunderstood. But not mentioning that
> it's an optional dependency seems odd too. What about something like
> 
>            <para>
>             <literal>io_uring</literal> (execute asynchronous I/O using
>             io_uring, requires postgres to have been built with
>             <link linkend="configure-option-with-liburing"><option>--with-liburing</option></link> /
>             <link linkend="configure-with-liburing-meson"><option>-Dliburing</option></link>)
>            </para>

I'd change s/postgres to have been built/a build with/ since the SGML docs
don't use the term "postgres" that way.  Otherwise, that works for me.

> Should the docs for --with-liburing/-Dliburing mention it's linux only? We
> don't seem to do that for things like systemd (linux), selinux (linux) and
> only kinda for bonjour (macos).

No need, I think.



Re: AIO v2.5

From
Andres Freund
Date:
Hi,

On 2025-03-23 08:55:29 -0700, Noah Misch wrote:
> On Sun, Mar 23, 2025 at 11:11:53AM -0400, Andres Freund wrote:
> Unrelated to the above, another question about io_uring:
> 
> commit da722699 wrote:
> > +/*
> > + * Need to submit staged but not yet submitted IOs using the fd, otherwise
> > + * the IO would end up targeting something bogus.
> > + */
> > +void
> > +pgaio_closing_fd(int fd)
> 
> An IO in PGAIO_HS_STAGED clearly blocks closing the IO's FD, and an IO in
> PGAIO_HS_COMPLETED_IO clearly doesn't block that close.  For io_method=worker,
> closing in PGAIO_HS_SUBMITTED is okay.  For io_method=io_uring, is there a
> reference about it being okay to close during PGAIO_HS_SUBMITTED?  I looked
> awhile for an authoritative view on that, but I didn't find one.  If we can
> rely on io_uring_submit() returning only after the kernel has given the
> io_uring its own reference to all applicable file descriptors, I expect it's
> okay to close the process's FD.  If the io_uring acquires its reference later
> than that, I expect we shouldn't close before that later time.

I'm fairly sure io_uring has its own reference for the file descriptor by the
time io_uring_enter() returns [1].  What io_uring does *not* reliably tolerate
is the issuing process *exiting* before the IO completes, even if there are
other processes attached to the same io_uring instance.

AIO v1 had a posix_aio backend, which, on several platforms, did *not*
tolerate the FD being closed before the IO completes. Because of that
IoMethodOps had a closing_fd callback, which posix_aio used to wait for the
IO's completion [2].


I've added a test case exercising this path for all io methods. But I can't
think of a way that would catch io_uring not actually holding a reference to
the fd with a high likelihood - the IO will almost always complete quickly
enough to not be able to catch that. But it still seems better than not at all
testing the path - it does catch at least the problem of pgaio_closing_fd()
not doing anything.

Greetings,

Andres Freund

[1] See
  https://github.com/torvalds/linux/blob/586de92313fcab8ed84ac5f78f4d2aae2db92c59/io_uring/io_uring.c#L1728
  called from
  https://github.com/torvalds/linux/blob/586de92313fcab8ed84ac5f78f4d2aae2db92c59/io_uring/io_uring.c#L2204
  called from
  https://github.com/torvalds/linux/blob/586de92313fcab8ed84ac5f78f4d2aae2db92c59/io_uring/io_uring.c#L3372
  in the io_uring_enter() syscall

[2]
https://github.com/anarazel/postgres/blob/a08cd717b5af4e51afb25ec86623973158a72ab9/src/backend/storage/aio/aio_posix.c#L738



Re: AIO v2.5

From
Noah Misch
Date:
commit 247ce06b wrote:
> +            pgaio_io_reopen(ioh);
> +
> +            /*
> +             * To be able to exercise the reopen-fails path, allow injection
> +             * points to trigger a failure at this point.
> +             */
> +            pgaio_io_call_inj(ioh, "AIO_WORKER_AFTER_REOPEN");
> +
> +            error_errno = 0;
> +            error_ioh = NULL;
> +
> +            /*
> +             * We don't expect this to ever fail with ERROR or FATAL, no need
> +             * to keep error_ioh set to the IO.
> +             * pgaio_io_perform_synchronously() contains a critical section to
> +             * ensure we don't accidentally fail.
> +             */
> +            pgaio_io_perform_synchronously(ioh);

A CHECK_FOR_INTERRUPTS() could close() the FD that pgaio_io_reopen() callee
smgr_aio_reopen() stores.  Hence, I think smgrfd() should assert that
interrupts are held instead of doing its own HOLD_INTERRUPTS(), and a
HOLD_INTERRUPTS() should surround the above region of code.  It's likely hard
to reproduce a problem, because pgaio_io_call_inj() does nothing in many
builds, and pgaio_io_perform_synchronously() starts by entering a critical
section.

On Thu, Mar 20, 2025 at 09:58:37PM -0400, Andres Freund wrote:
> Attached v2.11

> Subject: [PATCH v2.11 06/27] aio: Implement support for reads in smgr/md/fd

> +int
> +FileStartReadV(PgAioHandle *ioh, File file,
> +               int iovcnt, off_t offset,
> +               uint32 wait_event_info)
> +{
> +    int            returnCode;
> +    Vfd           *vfdP;
> +
> +    Assert(FileIsValid(file));
> +
> +    DO_DB(elog(LOG, "FileStartReadV: %d (%s) " INT64_FORMAT " %d",
> +               file, VfdCache[file].fileName,
> +               (int64) offset,
> +               iovcnt));
> +
> +    returnCode = FileAccess(file);
> +    if (returnCode < 0)
> +        return returnCode;
> +
> +    vfdP = &VfdCache[file];
> +
> +    pgaio_io_prep_readv(ioh, vfdP->fd, iovcnt, offset);

FileStartReadV() and pgaio_io_prep_readv() advance the IO to PGAIO_HS_STAGED
w/ batch mode, PGAIO_HS_SUBMITTED w/o batch mode.  I didn't expect that from
functions so named.  The "start" verb sounds to me like unconditional
PGAIO_HS_SUBMITTED, and the "prep" verb sounds like PGAIO_HS_DEFINED.  I like
the "stage" verb, because it matches PGAIO_HS_STAGED, and the comment at
PGAIO_HS_STAGED succinctly covers what to expect.  Hence, I recommend names
FileStageReadV, pgaio_io_stage_readv, mdstagereadv, and smgrstageread.  How do
you see it?

> +/*
> + * AIO error reporting callback for mdstartreadv().
> + *
> + * Errors are encoded as follows:
> + * - PgAioResult.error_data != 0 encodes IO that failed with errno != 0

I recommend replacing "errno != 0" with either "that errno" or "errno ==
error_data".


> Subject: [PATCH v2.11 07/27] aio: Add README.md explaining higher level design

Ready for commit apart from some trivia:

> +if (ioret.result.status == PGAIO_RS_ERROR)
> +    pgaio_result_report(aio_ret.result, &aio_ret.target_data, ERROR);

I think ioret and aio_ret are supposed to be the same object.  If that's
right, change one of the names.  Likewise elsewhere in this file.

> +The central API piece for postgres' AIO abstraction are AIO handles. To
> +execute an IO one first has to acquire an IO handle (`pgaio_io_acquire()`) and
> +then "defined", i.e. associate an IO operation with the handle.

s/"defined"/"define" it/ or similar

> +The "solution" to this the ability to associate multiple completion callbacks

s/this the/this is the/


> Subject: [PATCH v2.11 08/27] localbuf: Track pincount in BufferDesc as well

> @@ -5350,6 +5350,18 @@ ConditionalLockBufferForCleanup(Buffer buffer)
>          Assert(refcount > 0);
>          if (refcount != 1)
>              return false;
> +
> +        /*
> +         * Check that the AIO subsystem doesn't have a pin. Likely not
> +         * possible today, but better safe than sorry.
> +         */
> +        bufHdr = GetLocalBufferDescriptor(-buffer - 1);
> +        buf_state = pg_atomic_read_u32(&bufHdr->state);
> +        refcount = BUF_STATE_GET_REFCOUNT(buf_state);
> +        Assert(refcount > 0);
> +        if (refcount != 1)
> +            return false;
> +

LockBufferForCleanup() should get code like this
ConditionalLockBufferForCleanup() code, either now or when "not possible
today" ends.  Currently, it just assumes all local buffers are
cleanup-lockable:

    /* Nobody else to wait for */
    if (BufferIsLocal(buffer))
        return;

> @@ -570,7 +577,13 @@ InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced)
>  
>      buf_state = pg_atomic_read_u32(&bufHdr->state);
>  
> -    if (check_unreferenced && LocalRefCount[bufid] != 0)
> +    /*
> +     * We need to test not just LocalRefCount[bufid] but also the BufferDesc
> +     * itself, as the latter is used to represent a pin by the AIO subsystem.
> +     * This can happen if AIO is initiated and then the query errors out.
> +     */
> +    if (check_unreferenced &&
> +        (LocalRefCount[bufid] != 0 || BUF_STATE_GET_REFCOUNT(buf_state) != 0))
>          elog(ERROR, "block %u of %s is still referenced (local %u)",

I didn't write a test to prove it, but I'm suspecting we'll reach the above
ERROR with this sequence:

  CREATE TEMP TABLE foo ...;
  [some command that starts reading a block of foo into local buffers, then ERROR with IO ongoing]
  DROP TABLE foo;

DropRelationAllLocalBuffers() calls InvalidateLocalBuffer(bufHdr, true).  I
think we'd need to do like pgaio_shutdown() and finish all IOs (or all IOs for
the particular rel) before InvalidateLocalBuffer().  Or use something like the
logic near elog(ERROR, "buffer is pinned in InvalidateBuffer") in
corresponding bufmgr code.  I think that bufmgr ERROR is unreachable, since
only a private refcnt triggers that bufmgr ERROR.  Is there something
preventing the localbuf error from being a problem?  (This wouldn't require
changes to the current patch; responsibility would fall in a bufmgr AIO
patch.)


> Subject: [PATCH v2.11 09/27] bufmgr: Implement AIO read support

(Still reviewing this and later patches, but incidental observations follow.)

> +buffer_readv_complete_one(uint8 buf_off, Buffer buffer, uint8 flags,
> +                          bool failed, bool is_temp)
> +{
...
> +    PgAioResult result;
...
> +    result.status = PGAIO_RS_OK;
...
> +    return result;

gcc 14.2.0 -Werror gives me:

  bufmgr.c:7297:16: error: ‘result’ may be used uninitialized [-Werror=maybe-uninitialized]

Zeroing the unset fields silenced it:

--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -7221,3 +7221,3 @@ buffer_readv_complete_one(uint8 buf_off, Buffer buffer, uint8 flags,
     char       *bufdata = BufferGetBlock(buffer);
-    PgAioResult result;
+    PgAioResult result = { .status = PGAIO_RS_OK };
     uint32        set_flag_bits;
@@ -7238,4 +7238,2 @@ buffer_readv_complete_one(uint8 buf_off, Buffer buffer, uint8 flags,
 
-    result.status = PGAIO_RS_OK;
-
     /* check for garbage data */


> Subject: [PATCH v2.11 13/27] aio: Basic read_stream adjustments for real AIO

> @@ -416,6 +418,13 @@ read_stream_start_pending_read(ReadStream *stream)
>  static void
>  read_stream_look_ahead(ReadStream *stream)
>  {
> +    /*
> +     * Allow amortizing the cost of submitting IO over multiple IOs. This
> +     * requires that we don't do any operations that could lead to a deadlock
> +     * with staged-but-unsubmitted IO.
> +     */
> +    pgaio_enter_batchmode();

We call read_stream_get_block() while in batchmode, so the stream callback
needs to be ready for that.  A complicated case is
collect_corrupt_items_read_stream_next_block(), which may do its own buffer
I/O to read in a vmbuffer for VM_ALL_FROZEN().  That's feeling to me like a
recipe for corner cases reaching ERROR "starting batch while batch already in
progress".  Are there mitigating factors?


> Subject: [PATCH v2.11 17/27] aio: Add test_aio module

> +    # verify that page verification errors are detected even as part of a
> +    # shortened multi-block read (tbl_corr, block 1 is tbl_corred)

Is "tbl_corred" a typo of something?

> --- /dev/null
> +++ b/src/test/modules/test_aio/test_aio.c
> @@ -0,0 +1,657 @@
> +/*-------------------------------------------------------------------------
> + *
> + * delay_execution.c
> + *        Test module to allow delay between parsing and execution of a query.
> + *
> + * The delay is implemented by taking and immediately releasing a specified
> + * advisory lock.  If another process has previously taken that lock, the
> + * current process will be blocked until the lock is released; otherwise,
> + * there's no effect.  This allows an isolationtester script to reliably
> + * test behaviors where some specified action happens in another backend
> + * between parsing and execution of any desired query.
> + *
> + * Copyright (c) 2020-2025, PostgreSQL Global Development Group
> + *
> + * IDENTIFICATION
> + *      src/test/modules/delay_execution/delay_execution.c

Header comment is surviving from copy-paste of delay_execution.c.

> +     * Tor tests we don't want the resowner release preventing us from

s/Tor/For/



Re: AIO v2.5

From
Thomas Munro
Date:
On Mon, Mar 24, 2025 at 5:59 AM Andres Freund <andres@anarazel.de> wrote:
> On 2025-03-23 08:55:29 -0700, Noah Misch wrote:
> > An IO in PGAIO_HS_STAGED clearly blocks closing the IO's FD, and an IO in
> > PGAIO_HS_COMPLETED_IO clearly doesn't block that close.  For io_method=worker,
> > closing in PGAIO_HS_SUBMITTED is okay.  For io_method=io_uring, is there a
> > reference about it being okay to close during PGAIO_HS_SUBMITTED?  I looked
> > awhile for an authoritative view on that, but I didn't find one.  If we can
> > rely on io_uring_submit() returning only after the kernel has given the
> > io_uring its own reference to all applicable file descriptors, I expect it's
> > okay to close the process's FD.  If the io_uring acquires its reference later
> > than that, I expect we shouldn't close before that later time.
>
> I'm fairly sure io_uring has its own reference for the file descriptor by the
> time io_uring_enter() returns [1].  What io_uring does *not* reliably tolerate
> is the issuing process *exiting* before the IO completes, even if there are
> other processes attached to the same io_uring instance.

It is a bit strange that the documentation doesn't say that
explicitly.  You can sorta-maybe-kinda infer it from the fact that
io_uring didn't originally support cancelling requests at all, maybe a
small clue that it also didn't cancel them when you closed the fd :-)
The only sane alternative would seem to be that they keep running and
have their own reference to the *file* (not the fd), which is the
actual case, and might also be inferrable at a stretch from the
io_uring_register() documentation that says it reduces overheads with
a "long term reference" reducing "per-I/O overhead".  (The distant
third option/non-option is a sort of late/async binding fd as seen in
the Glibc user space POSIX AIO implementation, but that sort of
madness doesn't seem to be the sort of thing anyone working in the
kernel would entertain for a nanosecond...)  Anyway, there are also
public discussions involving Mr Axboe that discuss the fact that async
operations continue to run when the associated fd is closed, eg from
people who were surprised by that when porting stuff from other
systems, which might help fill in the documentation gap a teensy bit
if people want to see something outside the source code:

https://github.com/axboe/liburing/issues/568

> AIO v1 had a posix_aio backend, which, on several platforms, did *not*
> tolerate the FD being closed before the IO completes. Because of that
> IoMethodOps had a closing_fd callback, which posix_aio used to wait for the
> IO's completion [2].

Just for the record while remembering this stuff: Windows is another
system that took the cancel-on-close approach, so the Windows IOCP
proof-of-concept patches also used that AIO v1 callback and we'll have
to think about that again if/when we want to get that stuff
going on AIO v2.  I recall also speculating that it might be better to
teach the vfd system to pick another victim to close instead if an fd
was currently tied up with an asynchronous I/O for the benefit of
those cancel-on-close systems, hopefully without any happy-path
book-keeping.  But just submitting staged I/O is a nice and cheap
solution for now, without them in the picture.



Re: AIO v2.5

From
Andres Freund
Date:
Hi,

On 2025-03-23 17:29:39 -0700, Noah Misch wrote:
> commit 247ce06b wrote:
> > +            pgaio_io_reopen(ioh);
> > +
> > +            /*
> > +             * To be able to exercise the reopen-fails path, allow injection
> > +             * points to trigger a failure at this point.
> > +             */
> > +            pgaio_io_call_inj(ioh, "AIO_WORKER_AFTER_REOPEN");
> > +
> > +            error_errno = 0;
> > +            error_ioh = NULL;
> > +
> > +            /*
> > +             * We don't expect this to ever fail with ERROR or FATAL, no need
> > +             * to keep error_ioh set to the IO.
> > +             * pgaio_io_perform_synchronously() contains a critical section to
> > +             * ensure we don't accidentally fail.
> > +             */
> > +            pgaio_io_perform_synchronously(ioh);
>
> A CHECK_FOR_INTERRUPTS() could close() the FD that pgaio_io_reopen() callee
> smgr_aio_reopen() stores.  Hence, I think smgrfd() should assert that
> interrupts are held instead of doing its own HOLD_INTERRUPTS(), and a
> HOLD_INTERRUPTS() should surround the above region of code.  It's likely hard
> to reproduce a problem, because pgaio_io_call_inj() does nothing in many
> builds, and pgaio_io_perform_synchronously() starts by entering a critical
> section.

Hm, I guess you're right - it would be pretty bonkers for the injection to
process interrupts, but its much better to clarify the code to make that not
an option.  Once doing that it seemed we should also have a similar assertion
in pgaio_io_before_prep() would be appropriate.



> On Thu, Mar 20, 2025 at 09:58:37PM -0400, Andres Freund wrote:
> > Attached v2.11
>
> > Subject: [PATCH v2.11 06/27] aio: Implement support for reads in smgr/md/fd
>
> > +int
> > +FileStartReadV(PgAioHandle *ioh, File file,
> > +               int iovcnt, off_t offset,
> > +               uint32 wait_event_info)
> > +{
> > +    int            returnCode;
> > +    Vfd           *vfdP;
> > +
> > +    Assert(FileIsValid(file));
> > +
> > +    DO_DB(elog(LOG, "FileStartReadV: %d (%s) " INT64_FORMAT " %d",
> > +               file, VfdCache[file].fileName,
> > +               (int64) offset,
> > +               iovcnt));
> > +
> > +    returnCode = FileAccess(file);
> > +    if (returnCode < 0)
> > +        return returnCode;
> > +
> > +    vfdP = &VfdCache[file];
> > +
> > +    pgaio_io_prep_readv(ioh, vfdP->fd, iovcnt, offset);
>
> FileStartReadV() and pgaio_io_prep_readv() advance the IO to PGAIO_HS_STAGED
> w/ batch mode, PGAIO_HS_SUBMITTED w/o batch mode.  I didn't expect that from
> functions so named.  The "start" verb sounds to me like unconditional
> PGAIO_HS_SUBMITTED, and the "prep" verb sounds like PGAIO_HS_DEFINED.  I like
> the "stage" verb, because it matches PGAIO_HS_STAGED, and the comment at
> PGAIO_HS_STAGED succinctly covers what to expect.  Hence, I recommend names
> FileStageReadV, pgaio_io_stage_readv, mdstagereadv, and smgrstageread.  How do
> you see it?

I have a surprisingly strong negative reaction to that proposed naming. To me
the staging is a distinct step that happens *after* the IO is fully
defined. Making all the layered calls that lead up to that named that way
would IMO be a bad idea.

I however don't particularly like the *start* or *prep* names, I've gone back
and forth on those a couple times. I could see "begin" work uniformly across
those.


> > +/*
> > + * AIO error reporting callback for mdstartreadv().
> > + *
> > + * Errors are encoded as follows:
> > + * - PgAioResult.error_data != 0 encodes IO that failed with errno != 0
>
> I recommend replacing "errno != 0" with either "that errno" or "errno ==
> error_data".

Done.


> > Subject: [PATCH v2.11 07/27] aio: Add README.md explaining higher level design
>
> Ready for commit apart from some trivia:

Great.


> > +if (ioret.result.status == PGAIO_RS_ERROR)
> > +    pgaio_result_report(aio_ret.result, &aio_ret.target_data, ERROR);
>
> I think ioret and aio_ret are supposed to be the same object.  If that's
> right, change one of the names.  Likewise elsewhere in this file.

You're right.


> > +The central API piece for postgres' AIO abstraction are AIO handles. To
> > +execute an IO one first has to acquire an IO handle (`pgaio_io_acquire()`) and
> > +then "defined", i.e. associate an IO operation with the handle.
>
> s/"defined"/"define" it/ or similar
>
> > +The "solution" to this the ability to associate multiple completion callbacks
>
> s/this the/this is the/

Applied.


> > Subject: [PATCH v2.11 08/27] localbuf: Track pincount in BufferDesc as well
>
> > @@ -5350,6 +5350,18 @@ ConditionalLockBufferForCleanup(Buffer buffer)
> >          Assert(refcount > 0);
> >          if (refcount != 1)
> >              return false;
> > +
> > +        /*
> > +         * Check that the AIO subsystem doesn't have a pin. Likely not
> > +         * possible today, but better safe than sorry.
> > +         */
> > +        bufHdr = GetLocalBufferDescriptor(-buffer - 1);
> > +        buf_state = pg_atomic_read_u32(&bufHdr->state);
> > +        refcount = BUF_STATE_GET_REFCOUNT(buf_state);
> > +        Assert(refcount > 0);
> > +        if (refcount != 1)
> > +            return false;
> > +
>
> LockBufferForCleanup() should get code like this
> ConditionalLockBufferForCleanup() code, either now or when "not possible
> today" ends.  Currently, it just assumes all local buffers are
> cleanup-lockable:
>
>     /* Nobody else to wait for */
>     if (BufferIsLocal(buffer))
>         return;

Kinda, yes, kinda no?  LockBufferForCleanup() assumes, even for shared
buffers, that the current backend can't be doing anything that conflicts with
acquiring a buffer pin - note that it doesn't check the backend local pincount
for shared buffers either.


LockBufferForCleanup() kind of has to make that assumption, because there's no
way to wait for yourself to release another pin, because obviously waiting in
LockBufferForCleanup() would prevent that from ever happening.

It's somewhat disheartening that the comments for LockBufferForCleanup() don't
mention that somehow the caller needs to be sure not to be called with other
pins on a relation. Nor does LockBufferForCleanup() have any asserts checking
how many backend-local pins exist.


Leaving documentation / asserts aside, I think this is largely a safe
assumption given current callers. With one exception, it's all vacuum or
recovery related code - as vacuum can't run in a transaction, we can't
conflict with another pin by the same backend.

The one exception is heap_surgery.c - it doesn't quite seem safe, the
surrounding (or another query with a cursor) could have a pin on the target
block. The most obvious fix would be to use CheckTableNotInUse(), but that
might break some reasonable uses. Or maybe it should just not use a cleanup
lock, it's not obvious to me why it uses one.  But tbh, I don't care too much,
given what heap_surgery is.


> > @@ -570,7 +577,13 @@ InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced)
> >
> >      buf_state = pg_atomic_read_u32(&bufHdr->state);
> >
> > -    if (check_unreferenced && LocalRefCount[bufid] != 0)
> > +    /*
> > +     * We need to test not just LocalRefCount[bufid] but also the BufferDesc
> > +     * itself, as the latter is used to represent a pin by the AIO subsystem.
> > +     * This can happen if AIO is initiated and then the query errors out.
> > +     */
> > +    if (check_unreferenced &&
> > +        (LocalRefCount[bufid] != 0 || BUF_STATE_GET_REFCOUNT(buf_state) != 0))
> >          elog(ERROR, "block %u of %s is still referenced (local %u)",
>
> I didn't write a test to prove it, but I'm suspecting we'll reach the above
> ERROR with this sequence:
>
>   CREATE TEMP TABLE foo ...;
>   [some command that starts reading a block of foo into local buffers, then ERROR with IO ongoing]
>   DROP TABLE foo;

That seems plausible.  I'll try to write a test after this email.


> DropRelationAllLocalBuffers() calls InvalidateLocalBuffer(bufHdr, true).  I
> think we'd need to do like pgaio_shutdown() and finish all IOs (or all IOs for
> the particular rel) before InvalidateLocalBuffer().  Or use something like the
> logic near elog(ERROR, "buffer is pinned in InvalidateBuffer") in
> corresponding bufmgr code.

Just waiting for the IO in InvalidateBuffer() does seem like the best bet to
me. It's going to be pretty rarely reached, waiting for all concurrent IO
seems unnecessarily heavyweight. I don't think it matters much today, but once
we do things like asynchronously writing back buffers or WAL, the situation
will be different.

I think this points to the comment above the WaitIO() in InvalidateBuffer()
needing a bit of adapting - an in-progress read can trigger the WaitIO as
well. Something like:

    /*
     * We assume the reason for it to be pinned is that either we were
     * asynchronously reading the page in before erroring out or someone else
     * is flushing the page out.  Wait for the IO to finish.  (This could be
     * an infinite loop if the refcount is messed up... it would be nice to
     * time out after awhile, but there seems no way to be sure how many loops
     * may be needed.  Note that if the other guy has pinned the buffer but
     * not yet done StartBufferIO, WaitIO will fall through and we'll
     * effectively be busy-looping here.)
     */



> > +buffer_readv_complete_one(uint8 buf_off, Buffer buffer, uint8 flags,
> > +                          bool failed, bool is_temp)
> > +{
> ...
> > +    PgAioResult result;
> ...
> > +    result.status = PGAIO_RS_OK;
> ...
> > +    return result;
>
> gcc 14.2.0 -Werror gives me:
>
>   bufmgr.c:7297:16: error: ‘result’ may be used uninitialized [-Werror=maybe-uninitialized]

Gngngng.  Since when is it a bug for some fields of a struct to be
uninitialized, as long they're not used?

Interestingly I don't see that warning, despite also using gcc 14.2.0.

I'll just move to your solution, but it seems odd.


> > Subject: [PATCH v2.11 13/27] aio: Basic read_stream adjustments for real AIO
>
> > @@ -416,6 +418,13 @@ read_stream_start_pending_read(ReadStream *stream)
> >  static void
> >  read_stream_look_ahead(ReadStream *stream)
> >  {
> > +    /*
> > +     * Allow amortizing the cost of submitting IO over multiple IOs. This
> > +     * requires that we don't do any operations that could lead to a deadlock
> > +     * with staged-but-unsubmitted IO.
> > +     */
> > +    pgaio_enter_batchmode();
>
> We call read_stream_get_block() while in batchmode, so the stream callback
> needs to be ready for that.  A complicated case is
> collect_corrupt_items_read_stream_next_block(), which may do its own buffer
> I/O to read in a vmbuffer for VM_ALL_FROZEN().  That's feeling to me like a
> recipe for corner cases reaching ERROR "starting batch while batch already in
> progress".  Are there mitigating factors?

Ugh, yes, you're right.  heap_vac_scan_next_block() is also affected.

I don't think "starting batch while batch already in progress" is the real
issue though - it seems easy enough to avoid starting another batch inside,
partially because current cases seem unlikely to need to do batchable IO
inside. What worries me more is that code might block while there's
unsubmitted IO - which seems entirely plausible.


I can see a few approaches:

1) Declare that all read stream callbacks have to be careful and cope with
   batch mode

   I'm not sure how viable that is, not starting batches seems ok, but
   ensuring that the code doesn't block is a different story.


2) Have read stream users opt-in to batching

   Presumably via a flag like READ_STREAM_USE_BATCHING. That'd be easy enough
   to implement and to add to the callsites where that's fine.


3) Teach read stream to "look ahead" far enough to determine all the blocks
   that could be issued in a batch outside of batchmode

   I think that's probably not a great idea, it'd lead us to looking further
   ahead than we really need to, which could increase "unfairness" in
   e.g. parallel sequential scan.


4) Just defer using batch mode for now

   It's a nice win with io_uring for random IO, e.g. from bitmap heap scans ,
   but there's no need to immediately solve this.


I think regardless of what we go for, it's worth splitting
  "aio: Basic read_stream adjustments for real AIO"
into the actually basic parts (i.e. introducing sync_mode) from the not
actually so basic parts (i.e. batching).


I suspect that 2) would be the best approach. Only the read stream user knows
what it needs to do in the callback.


>
> > Subject: [PATCH v2.11 17/27] aio: Add test_aio module
>
> > +    # verify that page verification errors are detected even as part of a
> > +    # shortened multi-block read (tbl_corr, block 1 is tbl_corred)
>
> Is "tbl_corred" a typo of something?

I think that was a search&replace of the table name gone wrong. It was just
supposed to be "corrupted".


> > + *
> > + * IDENTIFICATION
> > + *      src/test/modules/delay_execution/delay_execution.c
>
> Header comment is surviving from copy-paste of delay_execution.c.

Oh, how I hate these pointless comments. Fixed.

Greetings,

Andres Freund



Re: AIO v2.5

From
Andres Freund
Date:
Hi,

On 2025-03-24 11:43:47 -0400, Andres Freund wrote:
> > I didn't write a test to prove it, but I'm suspecting we'll reach the above
> > ERROR with this sequence:
> >
> >   CREATE TEMP TABLE foo ...;
> >   [some command that starts reading a block of foo into local buffers, then ERROR with IO ongoing]
> >   DROP TABLE foo;
> 
> That seems plausible.  I'll try to write a test after this email.

FWIW, a test did indeed confirm that.  Luckily:

> > DropRelationAllLocalBuffers() calls InvalidateLocalBuffer(bufHdr, true).  I
> > think we'd need to do like pgaio_shutdown() and finish all IOs (or all IOs for
> > the particular rel) before InvalidateLocalBuffer().  Or use something like the
> > logic near elog(ERROR, "buffer is pinned in InvalidateBuffer") in
> > corresponding bufmgr code.
> 
> Just waiting for the IO in InvalidateBuffer() does seem like the best bet to
> me.

This did indeed resolve the issue.

I've extended the testsuite to test for that and a bunch more things. Working
on sending out a new version...


> > We call read_stream_get_block() while in batchmode, so the stream callback
> > needs to be ready for that.  A complicated case is
> > collect_corrupt_items_read_stream_next_block(), which may do its own buffer
> > I/O to read in a vmbuffer for VM_ALL_FROZEN().  That's feeling to me like a
> > recipe for corner cases reaching ERROR "starting batch while batch already in
> > progress".  Are there mitigating factors?
> 
> Ugh, yes, you're right.  heap_vac_scan_next_block() is also affected.
> 
> I don't think "starting batch while batch already in progress" is the real
> issue though - it seems easy enough to avoid starting another batch inside,
> partially because current cases seem unlikely to need to do batchable IO
> inside. What worries me more is that code might block while there's
> unsubmitted IO - which seems entirely plausible.
> 
> 
> I can see a few approaches:
> 
> 1) Declare that all read stream callbacks have to be careful and cope with
>    batch mode
> 
>    I'm not sure how viable that is, not starting batches seems ok, but
>    ensuring that the code doesn't block is a different story.
> 
> 
> 2) Have read stream users opt-in to batching
> 
>    Presumably via a flag like READ_STREAM_USE_BATCHING. That'd be easy enough
>    to implement and to add to the callsites where that's fine.
> 
> 
> 3) Teach read stream to "look ahead" far enough to determine all the blocks
>    that could be issued in a batch outside of batchmode
> 
>    I think that's probably not a great idea, it'd lead us to looking further
>    ahead than we really need to, which could increase "unfairness" in
>    e.g. parallel sequential scan.
> 
> 
> 4) Just defer using batch mode for now
> 
>    It's a nice win with io_uring for random IO, e.g. from bitmap heap scans ,
>    but there's no need to immediately solve this.
> 
> 
> I think regardless of what we go for, it's worth splitting
>   "aio: Basic read_stream adjustments for real AIO"
> into the actually basic parts (i.e. introducing sync_mode) from the not
> actually so basic parts (i.e. batching).
> 
> 
> I suspect that 2) would be the best approach. Only the read stream user knows
> what it needs to do in the callback.

I still think 2) would be the best option.

Writing a patch for that.

If a callback may sometimes need to block, it can still opt into
READ_STREAM_USE_BATCHING, by submitting all staged IO before blocking.

The hardest part is to explain the flag. Here's my current attempt:

/* ---
 * Opt-in to using AIO batchmode.
 *
 * Submitting IO in larger batches can be more efficient than doing so
 * one-by-one, particularly for many small reads. It does, however, require
 * the ReadStreamBlockNumberCB callback to abide by the restrictions of AIO
 * batching (c.f. pgaio_enter_batchmode()). Basically, the callback may not:
 * a) block without first calling pgaio_submit_staged(), unless a
 *    to-be-waited-on lock cannot be part of a deadlock, e.g. because it is
 *    never acquired in a nested fashion
 * b) directly or indirectly start another batch pgaio_enter_batchmode()
 *
 * As this requires care and is nontrivial in some cases, batching is only
 * used with explicit opt-in.
 * ---
 */
#define READ_STREAM_USE_BATCHING 0x08


Greetings,

Andres Freund



Re: AIO v2.5

From
Thomas Munro
Date:
On Tue, Mar 25, 2025 at 11:55 AM Andres Freund <andres@anarazel.de> wrote:
> If a callback may sometimes need to block, it can still opt into
> READ_STREAM_USE_BATCHING, by submitting all staged IO before blocking.
>
> The hardest part is to explain the flag. Here's my current attempt:
>
> /* ---
>  * Opt-in to using AIO batchmode.
>  *
>  * Submitting IO in larger batches can be more efficient than doing so
>  * one-by-one, particularly for many small reads. It does, however, require
>  * the ReadStreamBlockNumberCB callback to abide by the restrictions of AIO
>  * batching (c.f. pgaio_enter_batchmode()). Basically, the callback may not:
>  * a) block without first calling pgaio_submit_staged(), unless a
>  *    to-be-waited-on lock cannot be part of a deadlock, e.g. because it is
>  *    never acquired in a nested fashion
>  * b) directly or indirectly start another batch pgaio_enter_batchmode()
>  *
>  * As this requires care and is nontrivial in some cases, batching is only
>  * used with explicit opt-in.
>  * ---
>  */
> #define READ_STREAM_USE_BATCHING 0x08

+1

I wonder if something more like READ_STREAM_CALLBACK_BATCHMODE_AWARE
would be better, to highlight that you are making a declaration about
a property of your callback, not just turning on an independent
go-fast feature... I fished those words out of the main (?)
description of this topic atop pgaio_enter_batchmode().  Just a
thought, IDK.



Re: AIO v2.5

From
Andres Freund
Date:
Hi,

On 2025-03-25 13:07:49 +1300, Thomas Munro wrote:
> On Tue, Mar 25, 2025 at 11:55 AM Andres Freund <andres@anarazel.de> wrote:
> > #define READ_STREAM_USE_BATCHING 0x08
> 
> +1
> 
> I wonder if something more like READ_STREAM_CALLBACK_BATCHMODE_AWARE
> would be better, to highlight that you are making a declaration about
> a property of your callback, not just turning on an independent
> go-fast feature... I fished those words out of the main (?)
> description of this topic atop pgaio_enter_batchmode().  Just a
> thought, IDK.

The relevant lines are already very deeply indented, so I'm a bit wary of such
a long name.  I think we'd basically have to use a separate flags variable
everywhere and that is annoying due to us following C89 variable declaration
positions...

Greetings,

Andres Freund



Re: AIO v2.5

From
Noah Misch
Date:
On Mon, Mar 24, 2025 at 11:43:47AM -0400, Andres Freund wrote:
> On 2025-03-23 17:29:39 -0700, Noah Misch wrote:
> > commit 247ce06b wrote:
> > > +            pgaio_io_reopen(ioh);
> > > +
> > > +            /*
> > > +             * To be able to exercise the reopen-fails path, allow injection
> > > +             * points to trigger a failure at this point.
> > > +             */
> > > +            pgaio_io_call_inj(ioh, "AIO_WORKER_AFTER_REOPEN");
> > > +
> > > +            error_errno = 0;
> > > +            error_ioh = NULL;
> > > +
> > > +            /*
> > > +             * We don't expect this to ever fail with ERROR or FATAL, no need
> > > +             * to keep error_ioh set to the IO.
> > > +             * pgaio_io_perform_synchronously() contains a critical section to
> > > +             * ensure we don't accidentally fail.
> > > +             */
> > > +            pgaio_io_perform_synchronously(ioh);
> >
> > A CHECK_FOR_INTERRUPTS() could close() the FD that pgaio_io_reopen() callee
> > smgr_aio_reopen() stores.  Hence, I think smgrfd() should assert that
> > interrupts are held instead of doing its own HOLD_INTERRUPTS(), and a
> > HOLD_INTERRUPTS() should surround the above region of code.  It's likely hard
> > to reproduce a problem, because pgaio_io_call_inj() does nothing in many
> > builds, and pgaio_io_perform_synchronously() starts by entering a critical
> > section.
> 
> Hm, I guess you're right - it would be pretty bonkers for the injection to
> process interrupts, but its much better to clarify the code to make that not
> an option.  Once doing that it seemed we should also have a similar assertion
> in pgaio_io_before_prep() would be appropriate.

Agreed.  Following that line of thinking, the io_uring case needs to
HOLD_INTERRUPTS() (or hold smgrrelease() specifically) all the way from
pgaio_io_before_prep() to PGAIO_HS_SUBMITTED.  The fd has to stay valid until
io_uring_submit().

(We may be due for a test mode that does smgrreleaseall() at every
CHECK_FOR_INTERRUPTS()?)

> > On Thu, Mar 20, 2025 at 09:58:37PM -0400, Andres Freund wrote:
> > > Subject: [PATCH v2.11 06/27] aio: Implement support for reads in smgr/md/fd
> >
> > > +int
> > > +FileStartReadV(PgAioHandle *ioh, File file,
> > > +               int iovcnt, off_t offset,
> > > +               uint32 wait_event_info)
> > > +{
> > > +    int            returnCode;
> > > +    Vfd           *vfdP;
> > > +
> > > +    Assert(FileIsValid(file));
> > > +
> > > +    DO_DB(elog(LOG, "FileStartReadV: %d (%s) " INT64_FORMAT " %d",
> > > +               file, VfdCache[file].fileName,
> > > +               (int64) offset,
> > > +               iovcnt));
> > > +
> > > +    returnCode = FileAccess(file);
> > > +    if (returnCode < 0)
> > > +        return returnCode;
> > > +
> > > +    vfdP = &VfdCache[file];
> > > +
> > > +    pgaio_io_prep_readv(ioh, vfdP->fd, iovcnt, offset);
> >
> > FileStartReadV() and pgaio_io_prep_readv() advance the IO to PGAIO_HS_STAGED
> > w/ batch mode, PGAIO_HS_SUBMITTED w/o batch mode.  I didn't expect that from
> > functions so named.  The "start" verb sounds to me like unconditional
> > PGAIO_HS_SUBMITTED, and the "prep" verb sounds like PGAIO_HS_DEFINED.  I like
> > the "stage" verb, because it matches PGAIO_HS_STAGED, and the comment at
> > PGAIO_HS_STAGED succinctly covers what to expect.  Hence, I recommend names
> > FileStageReadV, pgaio_io_stage_readv, mdstagereadv, and smgrstageread.  How do
> > you see it?
> 
> I have a surprisingly strong negative reaction to that proposed naming. To me
> the staging is a distinct step that happens *after* the IO is fully
> defined. Making all the layered calls that lead up to that named that way
> would IMO be a bad idea.

As a general naming principle, I think the name of a function that advances
through multiple named steps should mention the last step.  Naming the
function after just a non-last step feels weird to me.  For example, serving a
meal consists of steps menu_define, mix_ingredients, and plate_food.  It would
be weird to me if a function called meal_menu_define() mixed ingredients or
plated food, but it's fine if meal_plate_food() does all three steps.  A
second strategy is to name both the first and last steps:
meal_define_menu_thru_plate_food() is fine apart from being long.  A third
strategy is to have meal_plate_food() assert than meal_mix_ingredients() has
been called.

I wouldn't mind "staging" as a distinct step, but I think today's API
boundaries hide the distinction.  PGAIO_HS_DEFINED is a temporary state during
a pgaio_io_stage() call, so the process that defines and stages the IO can
observe PGAIO_HS_DEFINED only while pgaio_io_stage() is on the stack.

The aforementioned "third strategy" could map to having distinct
smgrdefinereadv() and smgrstagereadv().  I don't know how well that would work
out overall.  I wouldn't be optimistic about that winning, but I mention it
for completeness.

> I however don't particularly like the *start* or *prep* names, I've gone back
> and forth on those a couple times. I could see "begin" work uniformly across
> those.

For ease of new readers understanding things, I think it helps for the
functions that advance PgAioHandleState to have names that use words from
PgAioHandleState.  It's one less mapping to get into the reader's head.
"Begin", "Start" and "prep" are all outside that taxonomy, making the reader
learn how to map them to the taxonomy.  What reward does the reader get at the
end of that exercise?  I'm not seeing one, but please do tell me what I'm
missing here.

> > > Subject: [PATCH v2.11 08/27] localbuf: Track pincount in BufferDesc as well
> >
> > > @@ -5350,6 +5350,18 @@ ConditionalLockBufferForCleanup(Buffer buffer)
> > >          Assert(refcount > 0);
> > >          if (refcount != 1)
> > >              return false;
> > > +
> > > +        /*
> > > +         * Check that the AIO subsystem doesn't have a pin. Likely not
> > > +         * possible today, but better safe than sorry.
> > > +         */
> > > +        bufHdr = GetLocalBufferDescriptor(-buffer - 1);
> > > +        buf_state = pg_atomic_read_u32(&bufHdr->state);
> > > +        refcount = BUF_STATE_GET_REFCOUNT(buf_state);
> > > +        Assert(refcount > 0);
> > > +        if (refcount != 1)
> > > +            return false;
> > > +
> >
> > LockBufferForCleanup() should get code like this
> > ConditionalLockBufferForCleanup() code, either now or when "not possible
> > today" ends.  Currently, it just assumes all local buffers are
> > cleanup-lockable:
> >
> >     /* Nobody else to wait for */
> >     if (BufferIsLocal(buffer))
> >         return;
> 
> Kinda, yes, kinda no?  LockBufferForCleanup() assumes, even for shared
> buffers, that the current backend can't be doing anything that conflicts with
> acquiring a buffer pin - note that it doesn't check the backend local pincount
> for shared buffers either.

It checks the local pincount via callee CheckBufferIsPinnedOnce().

As the patch stands, LockBufferForCleanup() can succeed when
ConditionalLockBufferForCleanup() would have returned false.  I'm not seeking
to raise the overall standard of *Cleanup() family of functions, but I am
trying to keep members of that family agreeing on the standard.

Like the comment, I expect it's academic today.  I expect it will stay
academic.  Anything that does a cleanup will start by reading the buffer,
which will resolve any refcnt the AIO subsystems holds for a read.  If there's
an AIO write, the LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE) will block on
that.  How about just removing the ConditionalLockBufferForCleanup() changes
or replacing them with a comment (like the present paragraph)?

> I think this points to the comment above the WaitIO() in InvalidateBuffer()
> needing a bit of adapting - an in-progress read can trigger the WaitIO as
> well. Something like:
> 
>     /*
>      * We assume the reason for it to be pinned is that either we were
>      * asynchronously reading the page in before erroring out or someone else
>      * is flushing the page out.  Wait for the IO to finish.  (This could be
>      * an infinite loop if the refcount is messed up... it would be nice to
>      * time out after awhile, but there seems no way to be sure how many loops
>      * may be needed.  Note that if the other guy has pinned the buffer but
>      * not yet done StartBufferIO, WaitIO will fall through and we'll
>      * effectively be busy-looping here.)
>      */

Agreed.

> > > +buffer_readv_complete_one(uint8 buf_off, Buffer buffer, uint8 flags,
> > > +                          bool failed, bool is_temp)
> > > +{
> > ...
> > > +    PgAioResult result;
> > ...
> > > +    result.status = PGAIO_RS_OK;
> > ...
> > > +    return result;
> >
> > gcc 14.2.0 -Werror gives me:
> >
> >   bufmgr.c:7297:16: error: ‘result’ may be used uninitialized [-Werror=maybe-uninitialized]
> 
> Gngngng.  Since when is it a bug for some fields of a struct to be
> uninitialized, as long they're not used?
> 
> Interestingly I don't see that warning, despite also using gcc 14.2.0.

I badly neglected to mention my non-default flags:

CFLAGS='-O2 -fno-sanitize-recover=all -fsanitize=address,alignment,undefined --param=max-vartrack-size=150000000
-ftrivial-auto-var-init=pattern'
COPT=-Werror -Wno-error=array-bounds

Final CFLAGS, including the ones "configure" elects on its own:

configure: using CFLAGS=-Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla
-Werror=unguarded-availability-new-Wendif-labels -Wmissing-format-attribute -Wcast-function-type -Wformat-security
-Wmissing-variable-declarations-fno-strict-aliasing -fwrapv -fexcess-precision=standard
-Wno-unused-command-line-argument-Wno-compound-token-split-by-macro -Wno-format-truncation
-Wno-cast-function-type-strict-g -O2 -fno-sanitize-recover=all -fsanitize=address,alignment,undefined
--param=max-vartrack-size=150000000-ftrivial-auto-var-init=pattern
 

(I use -Wno-error=array-bounds because the sanitizer options elicit a lot of
those warnings.  Today's master is free from maybe-uninitialized warnings in
this configuration, though.)

> I'll just move to your solution, but it seems odd.

Got it.

> I think regardless of what we go for, it's worth splitting
>   "aio: Basic read_stream adjustments for real AIO"
> into the actually basic parts (i.e. introducing sync_mode) from the not
> actually so basic parts (i.e. batching).

Fair.

On Mon, Mar 24, 2025 at 06:55:22PM -0400, Andres Freund wrote:
> Hi,
> 
> On 2025-03-24 11:43:47 -0400, Andres Freund wrote:
> > > I didn't write a test to prove it, but I'm suspecting we'll reach the above
> > > ERROR with this sequence:
> > >
> > >   CREATE TEMP TABLE foo ...;
> > >   [some command that starts reading a block of foo into local buffers, then ERROR with IO ongoing]
> > >   DROP TABLE foo;
> > 
> > That seems plausible.  I'll try to write a test after this email.
> 
> FWIW, a test did indeed confirm that.  Luckily:
> 
> > > DropRelationAllLocalBuffers() calls InvalidateLocalBuffer(bufHdr, true).  I
> > > think we'd need to do like pgaio_shutdown() and finish all IOs (or all IOs for
> > > the particular rel) before InvalidateLocalBuffer().  Or use something like the
> > > logic near elog(ERROR, "buffer is pinned in InvalidateBuffer") in
> > > corresponding bufmgr code.
> > 
> > Just waiting for the IO in InvalidateBuffer() does seem like the best bet to
> > me.
> 
> This did indeed resolve the issue.

I'm happy with that approach.

On Tue, Mar 25, 2025 at 01:07:49PM +1300, Thomas Munro wrote:
> On Tue, Mar 25, 2025 at 11:55 AM Andres Freund <andres@anarazel.de> wrote:
> > If a callback may sometimes need to block, it can still opt into
> > READ_STREAM_USE_BATCHING, by submitting all staged IO before blocking.
> >
> > The hardest part is to explain the flag. Here's my current attempt:
> >
> > /* ---
> >  * Opt-in to using AIO batchmode.
> >  *
> >  * Submitting IO in larger batches can be more efficient than doing so
> >  * one-by-one, particularly for many small reads. It does, however, require
> >  * the ReadStreamBlockNumberCB callback to abide by the restrictions of AIO
> >  * batching (c.f. pgaio_enter_batchmode()). Basically, the callback may not:
> >  * a) block without first calling pgaio_submit_staged(), unless a
> >  *    to-be-waited-on lock cannot be part of a deadlock, e.g. because it is
> >  *    never acquired in a nested fashion
> >  * b) directly or indirectly start another batch pgaio_enter_batchmode()

I think a callback could still do:

  pgaio_exit_batchmode()
  ... arbitrary code that might reach pgaio_enter_batchmode() ...
  pgaio_enter_batchmode()

> >  *
> >  * As this requires care and is nontrivial in some cases, batching is only
> >  * used with explicit opt-in.
> >  * ---
> >  */
> > #define READ_STREAM_USE_BATCHING 0x08
> 
> +1

Agreed.  It's simple, and there's no loss of generality.

> I wonder if something more like READ_STREAM_CALLBACK_BATCHMODE_AWARE
> would be better, to highlight that you are making a declaration about
> a property of your callback, not just turning on an independent
> go-fast feature... I fished those words out of the main (?)
> description of this topic atop pgaio_enter_batchmode().  Just a
> thought, IDK.

Good points.  I lean toward your renaming suggestion, or shortening to
READ_STREAM_BATCHMODE_AWARE or READ_STREAM_BATCH_OK.  I'm also fine with the
original name, though.

Thanks,
nm



Re: AIO v2.5

From
Andres Freund
Date:
Hi,

Attached v2.12, with the following changes:

- Pushed the max_files_per_process change

  I plan to look at what parts of Jelte's change is worth doing ontop.

  Thanks for the review Noah.


- Rebased over Thomas' commit of the remaining read stream changes

  Yay!


- Addressed Noah's review comments


- Added another test to test_aio/, to test that changing io_workers while
  running works, and that workers are restarted if terminated

  Written by Bilal


- Made InvalidateLocalBuffer wait for IO if necessary

  As reported / suggested by Noah


- Added tests for dropping tables with ongoing IO

  This failed, as Noah predicted, without the InvalidateLocalBuffer() change.


- Added a commit to explicitly hold interrupts in workers after
  pgaio_io_reopen()

  As suggested by Noah.


- Added a commit to fix a logic error around what gets passed to
  ioh->report_return - this lead to temporary buffer validation errors not
  being reported

  Discovered while extending the tests, as noted in the next point.

  I could see a few different "formulations" of this change (e.g. the
  report_return stuff could be populated by pgaio_io_call_complete_local()
  instead), but I don't think it matters much.


- Add temporary table coverage to test_aio

  This required hanged test_aio.c to cope with temporary tables as well.


- io_uring tests don't run anymore when built with EXEC_BACKEND and liburing
  enabled


- Split the read stream patch into two

  Noah, quite rightly, pointed out that it's not safe to use batching if the
  next-block callback may block (or start its own batch). The best idea seems
  to be to make users of read stream opt-in to batching.  I've done that in a
  patch that uses where it seems safe without doing extra work. See also the
  commit message.


- Added a commit to add I/O, Asynchronous I/O glossary and acronym entries


- Docs for pg_aios


- Renamed pg_aios.offset to off, to avoid use of a keyword


- Updated the io_uring wait event name while waiting for IOs to complete to
  AIO_IO_URING_COMPLETION and updated the description of AIO_IO_COMPLETION to
  "Waiting for another process to complete IO."

  I think this is a mix of different suggestions by Noah.


TODO:


- There are more tests in test_aio that should be expanded to run for temp
  tables as well, not just normal tables


- Add an explicit test for the checksum verification in the completion callback

  There is an existing test for testing an invalid page due to page header
  verification in test_aio, but not for checksum failures.

  I think it's indirectly covered (e.g. in amcheck), but seems better to test
  it explicitly.

  Wonder if it's worth adding some coverage for when checksums are disabled?
  Probably not necessary?


Greetings,

Andres Freund

Attachment

Re: AIO v2.5

From
Andres Freund
Date:
Hi,

On 2025-03-23 09:32:48 -0700, Noah Misch wrote:
> Another candidate description string:
> 
> AIO_COMPLETED_SHARED    "Waiting for another process to complete IO."

I liked that one and adopted it.


> > A more minimal change would be to narrow AIO_IO_URING_COMPLETION to
> > "execution" or something like that, to hint at a separation between the raw IO
> > being completed and the IO, including the callbacks completing.
> 
> Yes, that would work for me.

I updated both the name and the description of this one to EXECUTION, but I'm
not sure I like it for the name...


> > > > --- a/doc/src/sgml/config.sgml
> > > > +++ b/doc/src/sgml/config.sgml
> > > > @@ -2710,6 +2710,12 @@ include_dir 'conf.d'
> > > >              <literal>worker</literal> (execute asynchronous I/O using worker processes)
> > > >             </para>
> > > >            </listitem>
> > > > +          <listitem>
> > > > +           <para>
> > > > +            <literal>io_uring</literal> (execute asynchronous I/O using
> > > > +            io_uring, if available)
> > >
> > > I feel the "if available" doesn't quite fit, since we'll fail if unavailable.
> > > Maybe just "(execute asynchronous I/O using Linux io_uring)" with "Linux"
> > > there to reduce surprise on other platforms.
> > 
> > You're right, the if available can be misunderstood. But not mentioning that
> > it's an optional dependency seems odd too. What about something like
> > 
> >            <para>
> >             <literal>io_uring</literal> (execute asynchronous I/O using
> >             io_uring, requires postgres to have been built with
> >             <link linkend="configure-option-with-liburing"><option>--with-liburing</option></link> /
> >             <link linkend="configure-with-liburing-meson"><option>-Dliburing</option></link>)
> >            </para>
> 
> I'd change s/postgres to have been built/a build with/ since the SGML docs
> don't use the term "postgres" that way.  Otherwise, that works for me.

Went with that.

Greetings,

Andres Freund



Re: AIO v2.5

From
Noah Misch
Date:
On Thu, Mar 20, 2025 at 09:58:37PM -0400, Andres Freund wrote:
> Subject: [PATCH v2.11 09/27] bufmgr: Implement AIO read support

[I checked that v2.12 doesn't invalidate these review comments, but I didn't
technically rebase the review onto v2.12's line numbers.]

>  static void
>  TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
> -                  bool forget_owner)
> +                  bool forget_owner, bool syncio)
>  {
>      uint32        buf_state;
>  
> @@ -5586,6 +5636,14 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
>      if (clear_dirty && !(buf_state & BM_JUST_DIRTIED))
>          buf_state &= ~(BM_DIRTY | BM_CHECKPOINT_NEEDED);
>  
> +    if (!syncio)
> +    {
> +        /* release ownership by the AIO subsystem */
> +        Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
> +        buf_state -= BUF_REFCOUNT_ONE;
> +        pgaio_wref_clear(&buf->io_wref);
> +    }

Looking at the callers:

ZeroAndLockBuffer[1083]        TerminateBufferIO(bufHdr, false, BM_VALID, true, true);
ExtendBufferedRelShared[2869]  TerminateBufferIO(buf_hdr, false, BM_VALID, true, true);
FlushBuffer[4827]              TerminateBufferIO(buf, true, 0, true, true);
AbortBufferIO[6637]            TerminateBufferIO(buf_hdr, false, BM_IO_ERROR, false, true);
buffer_readv_complete_one[7279] TerminateBufferIO(buf_hdr, false, set_flag_bits, false, false);
buffer_writev_complete_one[7427] TerminateBufferIO(buf_hdr, clear_dirty, set_flag_bits, false, false);

I think we can improve on the "syncio" arg name.  The first two aren't doing
IO, and AbortBufferIO() may be cleaning up what would have been an AIO if it
hadn't failed early.  Perhaps name the arg "release_aio" and pass
release_aio=true instead of syncio=false (release_aio = !syncio).

> +         * about which buffers are target by IO can be hard to debug, making

s/target/targeted/

> +static pg_attribute_always_inline PgAioResult
> +buffer_readv_complete_one(uint8 buf_off, Buffer buffer, uint8 flags,
> +                          bool failed, bool is_temp)
> +{
...
> +        if ((flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
> +        {
> +            ereport(WARNING,
> +                    (errcode(ERRCODE_DATA_CORRUPTED),
> +                     errmsg("invalid page in block %u of relation %s; zeroing out page",

My earlier review requested s/LOG/WARNING/, but I wasn't thinking about this
in full depth.  In the !is_temp case, this runs in a complete_shared callback.
A process unrelated to the original IO may run this callback.  That's
unfortunate in two ways.  First, that other process's client gets an
unexpected WARNING.  The process getting the WARNING may not even have
zero_damaged_pages enabled.  Second, the client of the process that staged the
IO gets no message.

AIO ERROR-level messages handle this optimally.  We emit a LOG-level message
in the process that runs the complete_shared callback, and we arrange for the
ERROR-level message in the stager.  That would be ideal here: LOG in the
complete_shared runner, WARNING in the stager.

One could simplify things by forcing io_method=sync under ZERO_ON_ERROR ||
zero_damaged_pages, perhaps as a short-term approach.

Thoughts?



Re: AIO v2.5

From
Andres Freund
Date:
Hi,

On 2025-03-24 17:45:37 -0700, Noah Misch wrote:
> (We may be due for a test mode that does smgrreleaseall() at every
> CHECK_FOR_INTERRUPTS()?)

I suspect we are. I'm a bit afraid of even trying...

...

It's extremely slow - but at least the main regression as well as the aio tests pass!


> > I however don't particularly like the *start* or *prep* names, I've gone back
> > and forth on those a couple times. I could see "begin" work uniformly across
> > those.
> 
> For ease of new readers understanding things, I think it helps for the
> functions that advance PgAioHandleState to have names that use words from
> PgAioHandleState.  It's one less mapping to get into the reader's head.

Unfortunately for me it's kind of the opposite in this case, see below.


> "Begin", "Start" and "prep" are all outside that taxonomy, making the reader
> learn how to map them to the taxonomy.  What reward does the reader get at the
> end of that exercise?  I'm not seeing one, but please do tell me what I'm
> missing here.

Because the end state varies, depending on the number of previously staged
IOs, the IO method and whether batchmode is enabled, I think it's better if
the "function naming pattern" (i.e. FileStartReadv, smgrstartreadv etc) is
*not* aligned with an internal state name.  It will just mislead readers to
think that there's a deterministic mapping when that does not exist.

That's not an excuse for pgaio_io_prep* though, that's a pointlessly different
naming that I just stopped seeing.

I'll try to think more about this, perhaps I can make myself see your POV
more.



> > > > Subject: [PATCH v2.11 08/27] localbuf: Track pincount in BufferDesc as well
> > > LockBufferForCleanup() should get code like this
> > > ConditionalLockBufferForCleanup() code, either now or when "not possible
> > > today" ends.  Currently, it just assumes all local buffers are
> > > cleanup-lockable:
> > >
> > >     /* Nobody else to wait for */
> > >     if (BufferIsLocal(buffer))
> > >         return;
> > 
> > Kinda, yes, kinda no?  LockBufferForCleanup() assumes, even for shared
> > buffers, that the current backend can't be doing anything that conflicts with
> > acquiring a buffer pin - note that it doesn't check the backend local pincount
> > for shared buffers either.
> 
> It checks the local pincount via callee CheckBufferIsPinnedOnce().

In exactly one of the callers :/


> As the patch stands, LockBufferForCleanup() can succeed when
> ConditionalLockBufferForCleanup() would have returned false.

That's already true today, right? In master ConditionalLockBufferForCleanup()
for temp buffers checks LocalRefCount, whereas LockBufferForCleanup() doesn't.
I think I agree with your suggestion further below, but independent of that, I
don't see how the current modification in the patch makes the worse.

Historically this behaviour of LockBufferForCleanup() kinda somewhat makes
sense - the only place we use LockBufferForCleanup() is in a non-transactional
command i.e. vacuum / index vacuum. So LockBufferForCleanup() turns out to
only be safe in that context.


> Like the comment, I expect it's academic today.  I expect it will stay
> academic.  Anything that does a cleanup will start by reading the buffer,
> which will resolve any refcnt the AIO subsystems holds for a read.  If there's
> an AIO write, the LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE) will block on
> that.  How about just removing the ConditionalLockBufferForCleanup() changes
> or replacing them with a comment (like the present paragraph)?

I think we'll need an expanded version of what I suggest once we have writes -
but as you say, it shouldn't matter as long as we only have reads. So I think
moving the relevant changes, with adjusted caveats, to the bufmgr: write
change makes sense.



> > > /* ---
> > >  * Opt-in to using AIO batchmode.
> > >  *
> > >  * Submitting IO in larger batches can be more efficient than doing so
> > >  * one-by-one, particularly for many small reads. It does, however, require
> > >  * the ReadStreamBlockNumberCB callback to abide by the restrictions of AIO
> > >  * batching (c.f. pgaio_enter_batchmode()). Basically, the callback may not:
> > >  * a) block without first calling pgaio_submit_staged(), unless a
> > >  *    to-be-waited-on lock cannot be part of a deadlock, e.g. because it is
> > >  *    never acquired in a nested fashion
> > >  * b) directly or indirectly start another batch pgaio_enter_batchmode()
> 
> I think a callback could still do:
> 
>   pgaio_exit_batchmode()
>   ... arbitrary code that might reach pgaio_enter_batchmode() ...
>   pgaio_enter_batchmode()

Yea - but I somehow doubt there are many cases where it makes sense to
deep-queue IOs within the callback. The cases I can think of are things like
ensuring the right VM buffer is in s_b.  But if it turns out to be necessary,
what you seuggest would be an out.

Do you think it's worth mentioning the above workaround? I'm mildly inclined
that not.

If it turns out to be actually useful to do nested batching, we can change it
so that nested batching *is* allowed, that'd not be hard.



> > >  *
> > >  * As this requires care and is nontrivial in some cases, batching is only
> > >  * used with explicit opt-in.
> > >  * ---
> > >  */
> > > #define READ_STREAM_USE_BATCHING 0x08
> > 
> > +1
> 
> Agreed.  It's simple, and there's no loss of generality.
> 
> > I wonder if something more like READ_STREAM_CALLBACK_BATCHMODE_AWARE
> > would be better, to highlight that you are making a declaration about
> > a property of your callback, not just turning on an independent
> > go-fast feature... I fished those words out of the main (?)
> > description of this topic atop pgaio_enter_batchmode().  Just a
> > thought, IDK.
> 
> Good points.  I lean toward your renaming suggestion, or shortening to
> READ_STREAM_BATCHMODE_AWARE or READ_STREAM_BATCH_OK.  I'm also fine with the
> original name, though.

I'm ok with all of these. In order of preference:

1) READ_STREAM_USE_BATCHING or READ_STREAM_BATCH_OK
2) READ_STREAM_BATCHMODE_AWARE
3) READ_STREAM_CALLBACK_BATCHMODE_AWARE

Greetings,

Andres Freund



Re: AIO v2.5

From
Andres Freund
Date:
Hi,

On 2025-03-24 19:20:37 -0700, Noah Misch wrote:
> On Thu, Mar 20, 2025 at 09:58:37PM -0400, Andres Freund wrote:
> >  static void
> >  TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
> > -                  bool forget_owner)
> > +                  bool forget_owner, bool syncio)
> > ...
> Looking at the callers:
> 
> ZeroAndLockBuffer[1083]        TerminateBufferIO(bufHdr, false, BM_VALID, true, true);
> ExtendBufferedRelShared[2869]  TerminateBufferIO(buf_hdr, false, BM_VALID, true, true);
> FlushBuffer[4827]              TerminateBufferIO(buf, true, 0, true, true);
> AbortBufferIO[6637]            TerminateBufferIO(buf_hdr, false, BM_IO_ERROR, false, true);
> buffer_readv_complete_one[7279] TerminateBufferIO(buf_hdr, false, set_flag_bits, false, false);
> buffer_writev_complete_one[7427] TerminateBufferIO(buf_hdr, clear_dirty, set_flag_bits, false, false);
> 
> I think we can improve on the "syncio" arg name.  The first two aren't doing
> IO, and AbortBufferIO() may be cleaning up what would have been an AIO if it
> hadn't failed early.  Perhaps name the arg "release_aio" and pass
> release_aio=true instead of syncio=false (release_aio = !syncio).

Yes, I think that makes sense.  Will do that tomorrow.


> > +static pg_attribute_always_inline PgAioResult
> > +buffer_readv_complete_one(uint8 buf_off, Buffer buffer, uint8 flags,
> > +                          bool failed, bool is_temp)
> > +{
> ...
> > +        if ((flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
> > +        {
> > +            ereport(WARNING,
> > +                    (errcode(ERRCODE_DATA_CORRUPTED),
> > +                     errmsg("invalid page in block %u of relation %s; zeroing out page",
> 
> My earlier review requested s/LOG/WARNING/, but I wasn't thinking about this
> in full depth. In the !is_temp case, this runs in a complete_shared
> callback.  A process unrelated to the original IO may run this callback.
> That's unfortunate in two ways.  First, that other process's client gets an
> unexpected WARNING.  The process getting the WARNING may not even have
> zero_damaged_pages enabled.  Second, the client of the process that staged
> the IO gets no message.

Ah, right. That could be why I had flipped it. If so, shame on me for not
adding a comment...


> AIO ERROR-level messages handle this optimally.  We emit a LOG-level message
> in the process that runs the complete_shared callback, and we arrange for the
> ERROR-level message in the stager.  That would be ideal here: LOG in the
> complete_shared runner, WARNING in the stager.

We could obviously downgrade (crossgrade? A LOG is more severe than a LOG in
some ways, but not others) the message when run in a different backend fairly
easily.  Still emitting a WARNING in the stager however is a bit more tricky.

Before thinking more deeply about how we could emit WARNING in the stage:

Is it actually sane to use WARNING here? At least for ZERO_ON_ERROR that could
trigger a rather massive flood of messages to the client in a *normal*
situation. I'm thinking of something like an insert extending a relation some
time after an immediate restart and encountering a lot of FSM corruption (due
to its non-crash-safe-ness) during the search for free space and the
subsequent FSM vacuum.  It might be ok to LOG that, but sending a lot of
WARNINGs to the client seems not quite right.


If we want to implement it, I think we could introduce PGAIO_RS_WARN, which
then could tell the stager to issue the WARNING. It would add a bit of
distributed cost, both to callbacks and users of AIO, but it might not be too
bad.


> One could simplify things by forcing io_method=sync under ZERO_ON_ERROR ||
> zero_damaged_pages, perhaps as a short-term approach.

Yea, that could work.  Perhaps even just for zero_damaged_pages, after
changing it so that ZERO_ON_ERROR always just LOGs.

Hm, it seems somewhat nasty to have rather different performance
characteristics when forced to use zero_damaged_pages to recover from a
problem. Imagine an instance that's configured to use DIO and then needs to
use zero_damaged_pages to recove from corruption...


/me adds writing a test for both ZERO_ON_ERROR and zero_damaged_pages to the
TODO.

Greetings,

Andres Freund



Re: AIO v2.5

From
Thomas Munro
Date:
On Tue, Mar 25, 2025 at 2:18 PM Andres Freund <andres@anarazel.de> wrote:
> Attached v2.12, with the following changes:

Here's a tiny fixup to make io_concurrency=0 turn on
READ_BUFFERS_SYNCHRONOUSLY as mooted in a FIXME.  Without this, AIO
will still run at level 1 even if you asked for 0.  Feel free to
squash, or ignore and I'll push it later, whatever suits... (tested on
the tip of your public aio-2 branch).

Attachment

Re: AIO v2.5

From
Noah Misch
Date:
On Mon, Mar 24, 2025 at 10:30:27PM -0400, Andres Freund wrote:
> On 2025-03-24 17:45:37 -0700, Noah Misch wrote:
> > (We may be due for a test mode that does smgrreleaseall() at every
> > CHECK_FOR_INTERRUPTS()?)
> 
> I suspect we are. I'm a bit afraid of even trying...
> 
> ...
> 
> It's extremely slow - but at least the main regression as well as the aio tests pass!

One less thing!

> > > I however don't particularly like the *start* or *prep* names, I've gone back
> > > and forth on those a couple times. I could see "begin" work uniformly across
> > > those.
> > 
> > For ease of new readers understanding things, I think it helps for the
> > functions that advance PgAioHandleState to have names that use words from
> > PgAioHandleState.  It's one less mapping to get into the reader's head.
> 
> Unfortunately for me it's kind of the opposite in this case, see below.
> 
> 
> > "Begin", "Start" and "prep" are all outside that taxonomy, making the reader
> > learn how to map them to the taxonomy.  What reward does the reader get at the
> > end of that exercise?  I'm not seeing one, but please do tell me what I'm
> > missing here.
> 
> Because the end state varies, depending on the number of previously staged
> IOs, the IO method and whether batchmode is enabled, I think it's better if
> the "function naming pattern" (i.e. FileStartReadv, smgrstartreadv etc) is
> *not* aligned with an internal state name.  It will just mislead readers to
> think that there's a deterministic mapping when that does not exist.

That's fair.  Could we provide the mapping in a comment, something like the
following?

--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -34,5 +34,10 @@
  * linearly through all states.
  *
- * State changes should all go through pgaio_io_update_state().
+ * State changes should all go through pgaio_io_update_state().  Its callers
+ * use these naming conventions:
+ *
+ * - A "start" function (e.g. FileStartReadV()) moves an IO from
+ *   PGAIO_HS_HANDED_OUT to at least PGAIO_HS_STAGED and at most
+ *   PGAIO_HS_COMPLETED_LOCAL.
  */
 typedef enum PgAioHandleState

> That's not an excuse for pgaio_io_prep* though, that's a pointlessly different
> naming that I just stopped seeing.
> 
> I'll try to think more about this, perhaps I can make myself see your POV
> more.

> > As the patch stands, LockBufferForCleanup() can succeed when
> > ConditionalLockBufferForCleanup() would have returned false.
> 
> That's already true today, right? In master ConditionalLockBufferForCleanup()
> for temp buffers checks LocalRefCount, whereas LockBufferForCleanup() doesn't.

I'm finding a LocalRefCount check under LockBufferForCleanup:

LockBufferForCleanup(Buffer buffer)
{
...
    CheckBufferIsPinnedOnce(buffer);

CheckBufferIsPinnedOnce(Buffer buffer)
{
    if (BufferIsLocal(buffer))
    {
        if (LocalRefCount[-buffer - 1] != 1)
            elog(ERROR, "incorrect local pin count: %d",
                 LocalRefCount[-buffer - 1]);
    }
    else
    {
        if (GetPrivateRefCount(buffer) != 1)
            elog(ERROR, "incorrect local pin count: %d",
                 GetPrivateRefCount(buffer));
    }
}

> > Like the comment, I expect it's academic today.  I expect it will stay
> > academic.  Anything that does a cleanup will start by reading the buffer,
> > which will resolve any refcnt the AIO subsystems holds for a read.  If there's
> > an AIO write, the LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE) will block on
> > that.  How about just removing the ConditionalLockBufferForCleanup() changes
> > or replacing them with a comment (like the present paragraph)?
> 
> I think we'll need an expanded version of what I suggest once we have writes -
> but as you say, it shouldn't matter as long as we only have reads. So I think
> moving the relevant changes, with adjusted caveats, to the bufmgr: write
> change makes sense.

Moving those changes works for me.  I'm not currently seeing the need under
writes, but that may get clearer upon reaching those patches.

> > > > /* ---
> > > >  * Opt-in to using AIO batchmode.
> > > >  *
> > > >  * Submitting IO in larger batches can be more efficient than doing so
> > > >  * one-by-one, particularly for many small reads. It does, however, require
> > > >  * the ReadStreamBlockNumberCB callback to abide by the restrictions of AIO
> > > >  * batching (c.f. pgaio_enter_batchmode()). Basically, the callback may not:
> > > >  * a) block without first calling pgaio_submit_staged(), unless a
> > > >  *    to-be-waited-on lock cannot be part of a deadlock, e.g. because it is
> > > >  *    never acquired in a nested fashion
> > > >  * b) directly or indirectly start another batch pgaio_enter_batchmode()
> > 
> > I think a callback could still do:
> > 
> >   pgaio_exit_batchmode()
> >   ... arbitrary code that might reach pgaio_enter_batchmode() ...
> >   pgaio_enter_batchmode()
> 
> Yea - but I somehow doubt there are many cases where it makes sense to
> deep-queue IOs within the callback. The cases I can think of are things like
> ensuring the right VM buffer is in s_b.  But if it turns out to be necessary,
> what you seuggest would be an out.

I don't foresee a callback specifically wanting to batch, but callbacks might
call into other infrastructure that can elect to batch.  The exit+reenter
pattern would be better than adding no-batch options to other infrastructure.

> Do you think it's worth mentioning the above workaround? I'm mildly inclined
> that not.

Perhaps not in that detail, but perhaps we can rephrase (b) to not imply
exit+reenter is banned.  Maybe "(b) start another batch (without first exiting
one)".  It's also fine as-is, though.

> If it turns out to be actually useful to do nested batching, we can change it
> so that nested batching *is* allowed, that'd not be hard.

Good point.

> I'm ok with all of these. In order of preference:
> 
> 1) READ_STREAM_USE_BATCHING or READ_STREAM_BATCH_OK
> 2) READ_STREAM_BATCHMODE_AWARE
> 3) READ_STREAM_CALLBACK_BATCHMODE_AWARE

Same for me.



Re: AIO v2.5

From
Noah Misch
Date:
On Mon, Mar 24, 2025 at 10:52:19PM -0400, Andres Freund wrote:
> On 2025-03-24 19:20:37 -0700, Noah Misch wrote:
> > On Thu, Mar 20, 2025 at 09:58:37PM -0400, Andres Freund wrote:
> > > +static pg_attribute_always_inline PgAioResult
> > > +buffer_readv_complete_one(uint8 buf_off, Buffer buffer, uint8 flags,
> > > +                          bool failed, bool is_temp)
> > > +{
> > ...
> > > +        if ((flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
> > > +        {
> > > +            ereport(WARNING,
> > > +                    (errcode(ERRCODE_DATA_CORRUPTED),
> > > +                     errmsg("invalid page in block %u of relation %s; zeroing out page",
> > 
> > My earlier review requested s/LOG/WARNING/, but I wasn't thinking about this
> > in full depth. In the !is_temp case, this runs in a complete_shared
> > callback.  A process unrelated to the original IO may run this callback.
> > That's unfortunate in two ways.  First, that other process's client gets an
> > unexpected WARNING.  The process getting the WARNING may not even have
> > zero_damaged_pages enabled.  Second, the client of the process that staged
> > the IO gets no message.
> 
> Ah, right. That could be why I had flipped it. If so, shame on me for not
> adding a comment...
> 
> 
> > AIO ERROR-level messages handle this optimally.  We emit a LOG-level message
> > in the process that runs the complete_shared callback, and we arrange for the
> > ERROR-level message in the stager.  That would be ideal here: LOG in the
> > complete_shared runner, WARNING in the stager.
> 
> We could obviously downgrade (crossgrade? A LOG is more severe than a LOG in
> some ways, but not others) the message when run in a different backend fairly
> easily.  Still emitting a WARNING in the stager however is a bit more tricky.
> 
> Before thinking more deeply about how we could emit WARNING in the stage:
> 
> Is it actually sane to use WARNING here? At least for ZERO_ON_ERROR that could
> trigger a rather massive flood of messages to the client in a *normal*
> situation. I'm thinking of something like an insert extending a relation some
> time after an immediate restart and encountering a lot of FSM corruption (due
> to its non-crash-safe-ness) during the search for free space and the
> subsequent FSM vacuum.  It might be ok to LOG that, but sending a lot of
> WARNINGs to the client seems not quite right.

Orthogonal to AIO, I do think LOG (or even DEBUG1?) is better for
ZERO_ON_ERROR.  The ZERO_ON_ERROR case also should not use
ERRCODE_DATA_CORRUPTED.  (That errcode shouldn't appear for business as usual.
It should signify wrong or irretrievable query results, essentially.)

For zero_damaged_pages, WARNING seems at least defensible, and
ERRCODE_DATA_CORRUPTED is right.  It wouldn't be the worst thing to change
zero_damaged_pages to LOG and let the complete_shared runner log it, as long
as we release-note that.  It's superuser-only, and the superuser can learn to
check the log.  One typically should use zero_damaged_pages in one session at
a time, so the logs won't be too confusing.

Another thought on complete_shared running on other backends: I wonder if we
should push an ErrorContextCallback that adds "CONTEXT: completing I/O of
other process" or similar, so people wonder less about how "SELECT FROM a" led
to a log message about IO on table "b".

> If we want to implement it, I think we could introduce PGAIO_RS_WARN, which
> then could tell the stager to issue the WARNING. It would add a bit of
> distributed cost, both to callbacks and users of AIO, but it might not be too
> bad.
> 
> 
> > One could simplify things by forcing io_method=sync under ZERO_ON_ERROR ||
> > zero_damaged_pages, perhaps as a short-term approach.
> 
> Yea, that could work.  Perhaps even just for zero_damaged_pages, after
> changing it so that ZERO_ON_ERROR always just LOGs.

Yes.

> Hm, it seems somewhat nasty to have rather different performance
> characteristics when forced to use zero_damaged_pages to recover from a
> problem. Imagine an instance that's configured to use DIO and then needs to
> use zero_damaged_pages to recove from corruption...

True.  I'd be willing to bet high-scale use of zero_damaged_pages is rare.  By
high scale, I mean something like reading a whole large table, as opposed to a
TID scan of the known-problematic range.  That said, people (including me)
expect the emergency tools to be good even if they're used rarely.  You're not
wrong to worry about it.



Re: AIO v2.5

From
Andres Freund
Date:
Hi,

On 2025-03-25 06:33:21 -0700, Noah Misch wrote:
> On Mon, Mar 24, 2025 at 10:30:27PM -0400, Andres Freund wrote:
> > On 2025-03-24 17:45:37 -0700, Noah Misch wrote:
> > > (We may be due for a test mode that does smgrreleaseall() at every
> > > CHECK_FOR_INTERRUPTS()?)
> >
> > I suspect we are. I'm a bit afraid of even trying...
> >
> > ...
> >
> > It's extremely slow - but at least the main regression as well as the aio tests pass!
>
> One less thing!

Unfortunately I'm now doubting the thoroughness of my check - while I made
every CFI() execute smgrreleaseall(), I didn't trigger CFI() in cases where we
trigger it conditionally. E.g. elog(DEBUGN, ...) only executes a CFI if
log_min_messages <= DEBUGN...

I'll try that in a bit.


> > Because the end state varies, depending on the number of previously staged
> > IOs, the IO method and whether batchmode is enabled, I think it's better if
> > the "function naming pattern" (i.e. FileStartReadv, smgrstartreadv etc) is
> > *not* aligned with an internal state name.  It will just mislead readers to
> > think that there's a deterministic mapping when that does not exist.
>
> That's fair.  Could we provide the mapping in a comment, something like the
> following?

Yes!

I wonder if it should also be duplicated or referenced elsewhere, although I
am not sure where precisely.


> --- a/src/include/storage/aio_internal.h
> +++ b/src/include/storage/aio_internal.h
> @@ -34,5 +34,10 @@
>   * linearly through all states.
>   *
> - * State changes should all go through pgaio_io_update_state().
> + * State changes should all go through pgaio_io_update_state().  Its callers
> + * use these naming conventions:
> + *
> + * - A "start" function (e.g. FileStartReadV()) moves an IO from
> + *   PGAIO_HS_HANDED_OUT to at least PGAIO_HS_STAGED and at most
> + *   PGAIO_HS_COMPLETED_LOCAL.
>   */
>  typedef enum PgAioHandleState

One detail I'm not sure about: The above change is correct, but perhaps a bit
misleading, because we can actually go "back" to IDLE. Not sure how to best
phrase that though.


> > That's not an excuse for pgaio_io_prep* though, that's a pointlessly different
> > naming that I just stopped seeing.

I assume you're on board with renaming _io_prep* to _io_start_*?


> > I'll try to think more about this, perhaps I can make myself see your POV
> > more.
>
> > > As the patch stands, LockBufferForCleanup() can succeed when
> > > ConditionalLockBufferForCleanup() would have returned false.
> >
> > That's already true today, right? In master ConditionalLockBufferForCleanup()
> > for temp buffers checks LocalRefCount, whereas LockBufferForCleanup() doesn't.
>
> I'm finding a LocalRefCount check under LockBufferForCleanup:

I guess I should have stopped looking at code / replying before my last email
last night... Not sure how I missed that.



> CheckBufferIsPinnedOnce(Buffer buffer)
> {
>     if (BufferIsLocal(buffer))
>     {
>         if (LocalRefCount[-buffer - 1] != 1)
>             elog(ERROR, "incorrect local pin count: %d",
>                  LocalRefCount[-buffer - 1]);
>     }
>     else
>     {
>         if (GetPrivateRefCount(buffer) != 1)
>             elog(ERROR, "incorrect local pin count: %d",
>                  GetPrivateRefCount(buffer));
>     }
> }

Pretty random orthogonal thought, that I was reminded of by the above code
snippet:

It sure seems we should at some point get rid of LocalRefCount[] and just use
the GetPrivateRefCount() infrastructure for both shared and local buffers.  I
don't think the GetPrivateRefCount() infrastructure cares about
local/non-local, leaving a few asserts aside.  If we do that, and start to use
BM_IO_IN_PROGRESS, combined with ResourceOwnerRememberBufferIO(), the set of
differences between shared and local buffers would be a lot smaller.


> > > Like the comment, I expect it's academic today.  I expect it will stay
> > > academic.  Anything that does a cleanup will start by reading the buffer,
> > > which will resolve any refcnt the AIO subsystems holds for a read.  If there's
> > > an AIO write, the LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE) will block on
> > > that.  How about just removing the ConditionalLockBufferForCleanup() changes
> > > or replacing them with a comment (like the present paragraph)?
> >
> > I think we'll need an expanded version of what I suggest once we have writes -
> > but as you say, it shouldn't matter as long as we only have reads. So I think
> > moving the relevant changes, with adjusted caveats, to the bufmgr: write
> > change makes sense.
>
> Moving those changes works for me.  I'm not currently seeing the need under
> writes, but that may get clearer upon reaching those patches.

FWIW, I don't think it's currently worth looking at the write side in detail,
there's enough required changes to make that not necessarily the best use of
your time at this point. At least:


- Write logic needs to be rebased ontop of the patch series to not hit bit
  dirty buffers while IO is going on

  The performance impact of doing the memory copies is rather substantial, as
  on intel memory bandwidth is *the* IO bottleneck even just for the checksum
  computation, without a copy. That makes the memory copy for something like
  bounce buffers hurt really badly.

  And the memory usage of bounce buffers is also really concerning.

  And even without checksums, several filesystems *really* don't like buffers
  getting modified during DIO writes. Which I think would mean we ought to use
  bounce buffers for *all* writes, which would impose a *very* substantial
  overhead (basically removing the benefit of DMA happening off-cpu).


- Right now the sync.c integration with smgr.c/md.c isn't properly safe to use
  in a critical section

  The only reason it doesn't immediately fail is that it's reasonably rare
  that RegisterSyncRequest() fails *and* either:

  - smgropen()->hash_search(HASH_ENTER) decides to resize the hash table, even
    though the lookup is guaranteed to suceed for io_method=worker.

  - an io_method=uring completion is run in a different backend and smgropen()
    needs to build a new entry and thus needs to allocate memory

  For a bit I thought this could be worked around easily enough by not doing
  an smgropen() in mdsyncfiletag(), or adding a "fallible" smgropen() and
  instead just opening the file directly. That actually does kinda solve the
  problem, but only because the memory allocation in PathNameOpenFile()
  uses malloc(), not palloc() and thus doesn't trigger


- I think it requires new lwlock.c infrastructure (as v1 of aio had), to make
  LockBuffer(BUFFER_LOCK_EXCLUSIVE) etc wait in a concurrency safe manner for
  in-progress writes

  I can think of ways to solve this purely in bufmgr.c, but only in ways that
  would cause other problems (e.g. setting BM_IO_IN_PROGRESS before waiting
  for an exclusive lock) and/or expensive.


- My current set of patches doesn't implement bgwriter_flush_after,
  checkpointer_flush_after

  I think that's not too hard to do, it's mainly round tuits.


- temp_file_limit is not respected by aio writes

  I guess that could be ok if AIO writes are only used by checkpointer /
  bgwriter, but we need to figure out a way to deal with that. Perhaps by
  redesigning temp_file_limit, the current implementation seems like rather
  substantial layering violation.


- Too much duplicated code, as there's the aio and non-aio write paths.  That
  might be ok for a bit.


I updated the commit messages of the relevant commits with the above, there
were abbreviated versions of most of the above, but not in enough detail for
anybody but me (and maybe not even that).



> > Do you think it's worth mentioning the above workaround? I'm mildly inclined
> > that not.
>
> Perhaps not in that detail, but perhaps we can rephrase (b) to not imply
> exit+reenter is banned.  Maybe "(b) start another batch (without first exiting
> one)".  It's also fine as-is, though.

I updated it to:

 * b) start another batch (without first exiting batchmode and re-entering
 *    before returning)


> > I'm ok with all of these. In order of preference:
> >
> > 1) READ_STREAM_USE_BATCHING or READ_STREAM_BATCH_OK
> > 2) READ_STREAM_BATCHMODE_AWARE
> > 3) READ_STREAM_CALLBACK_BATCHMODE_AWARE
>
> Same for me.

For now I'll leave it at READ_STREAM_USE_BATCHING, but if Thomas has a
preference I'll go for whatever we have a majority for.

Greetings,

Andres Freund



Re: AIO v2.5

From
Andres Freund
Date:
Hi,

On 2025-03-25 17:10:19 +1300, Thomas Munro wrote:
> On Tue, Mar 25, 2025 at 2:18 PM Andres Freund <andres@anarazel.de> wrote:
> > Attached v2.12, with the following changes:
> 
> Here's a tiny fixup to make io_concurrency=0 turn on
> READ_BUFFERS_SYNCHRONOUSLY as mooted in a FIXME.  Without this, AIO
> will still run at level 1 even if you asked for 0.  Feel free to
> squash, or ignore and I'll push it later, whatever suits... (tested on
> the tip of your public aio-2 branch).

Thanks! I squashed it into "aio: Basic read_stream adjustments for real AIO"
and updated the commit message to account for that.

Greetings,

Andres Freund



Re: AIO v2.5

From
Andres Freund
Date:
Hi,

On 2025-03-25 07:11:20 -0700, Noah Misch wrote:
> On Mon, Mar 24, 2025 at 10:52:19PM -0400, Andres Freund wrote:
> > Is it actually sane to use WARNING here? At least for ZERO_ON_ERROR that could
> > trigger a rather massive flood of messages to the client in a *normal*
> > situation. I'm thinking of something like an insert extending a relation some
> > time after an immediate restart and encountering a lot of FSM corruption (due
> > to its non-crash-safe-ness) during the search for free space and the
> > subsequent FSM vacuum.  It might be ok to LOG that, but sending a lot of
> > WARNINGs to the client seems not quite right.
> 
> Orthogonal to AIO, I do think LOG (or even DEBUG1?) is better for
> ZERO_ON_ERROR.  The ZERO_ON_ERROR case also should not use
> ERRCODE_DATA_CORRUPTED.  (That errcode shouldn't appear for business as usual.
> It should signify wrong or irretrievable query results, essentially.)

I strongly agree on the errcode - basically makes it much harder to use the
errcode to trigger alerting. And we don't have any other way to do that...

I'm, obviously, positive on not using WARNING for ZERO_ON_ERROR. I'm neutral
on LOG vs DEBUG1, I can see arguments for either.


> For zero_damaged_pages, WARNING seems at least defensible, and
> ERRCODE_DATA_CORRUPTED is right.  It wouldn't be the worst thing to change
> zero_damaged_pages to LOG and let the complete_shared runner log it, as long
> as we release-note that.  It's superuser-only, and the superuser can learn to
> check the log.  One typically should use zero_damaged_pages in one session at
> a time, so the logs won't be too confusing.

It's obviously tempting to go for that, I'm somewhat undecided what the best
way is right now.  There might be a compromise, see below:


> > If we want to implement it, I think we could introduce PGAIO_RS_WARN, which
> > then could tell the stager to issue the WARNING. It would add a bit of
> > distributed cost, both to callbacks and users of AIO, but it might not be too
> > bad.

FWIW, I prototyped this, it's not hard.

But it can't replace the current WARNING with 100% fidelity: If we read 60
blocks in a single smgrreadv, we today would would emit 60 WARNINGs.  But we
can't encode that many block offset in single PgAioResult, there's not enough
space, and enlarging it far enough doesn't seem to make sense either.


What we *could* do is to emit one WARNING for each bufmgr.c smgrstartreadv(),
with that warning saying that there were N zeroed blocks in a read from block
N to block Y and a HINT saying that there are more details in the server log.



> Another thought on complete_shared running on other backends: I wonder if we
> should push an ErrorContextCallback that adds "CONTEXT: completing I/O of
> other process" or similar, so people wonder less about how "SELECT FROM a" led
> to a log message about IO on table "b".

I've been wondering about that as well, and yes, we probably should.

I'd add the pid of the backend that started the IO to the message - although
need to check whether we're trying to keep PIDs of other processes from
unprivileged users.

I think we probably should add a similar, but not equivalent, context in io
workers. Maybe "I/O worker executing I/O on behalf of process %d".

Greetings,

Andres Freund



Re: AIO v2.5

From
Noah Misch
Date:
On Tue, Mar 25, 2025 at 11:26:14AM -0400, Andres Freund wrote:
> On 2025-03-25 06:33:21 -0700, Noah Misch wrote:
> > On Mon, Mar 24, 2025 at 10:30:27PM -0400, Andres Freund wrote:
> > > On 2025-03-24 17:45:37 -0700, Noah Misch wrote:
> > > > (We may be due for a test mode that does smgrreleaseall() at every
> > > > CHECK_FOR_INTERRUPTS()?)
> > >
> > > I suspect we are. I'm a bit afraid of even trying...
> > >
> > > ...
> > >
> > > It's extremely slow - but at least the main regression as well as the aio tests pass!
> >
> > One less thing!
> 
> Unfortunately I'm now doubting the thoroughness of my check - while I made
> every CFI() execute smgrreleaseall(), I didn't trigger CFI() in cases where we
> trigger it conditionally. E.g. elog(DEBUGN, ...) only executes a CFI if
> log_min_messages <= DEBUGN...
> 
> I'll try that in a bit.

While having nagging thoughts that we might be releasing FDs before io_uring
gets them into kernel custody, I tried this hack to maximize FD turnover:

static void
ReleaseLruFiles(void)
{
#if 0
    while (nfile + numAllocatedDescs + numExternalFDs >= max_safe_fds)
    {
        if (!ReleaseLruFile())
            break;
    }
#else
    while (ReleaseLruFile())
        ;
#endif
}

"make check" with default settings (io_method=worker) passes, but
io_method=io_uring in the TEMP_CONFIG file got different diffs in each of two
runs.  s/#if 0/#if 1/ (restore normal FD turnover) removes the failures.
Here's the richer of the two diffs:

diff -U3 src/test/regress/expected/sanity_check.out src/test/regress/results/sanity_check.out
--- src/test/regress/expected/sanity_check.out  2024-10-24 12:43:25.741817594 -0700
+++ src/test/regress/results/sanity_check.out   2025-03-25 08:27:44.875151566 -0700
@@ -1,4 +1,7 @@
 VACUUM;
+ERROR:  index "pg_enum_oid_index" contains corrupted page at block 2
+HINT:  Please REINDEX it.
+CONTEXT:  while vacuuming index "pg_enum_oid_index" of relation "pg_catalog.pg_enum"
 --
 -- Sanity check: every system catalog that has OIDs should have
 -- a unique index on OID.  This ensures that the OIDs will be unique,
diff -U3 src/test/regress/expected/oidjoins.out src/test/regress/results/oidjoins.out
--- src/test/regress/expected/oidjoins.out      2023-07-06 19:58:07.686364439 -0700
+++ src/test/regress/results/oidjoins.out       2025-03-25 08:28:02.584335458 -0700
@@ -233,6 +233,8 @@
 NOTICE:  checking pg_policy {polrelid} => pg_class {oid}
 NOTICE:  checking pg_policy {polroles} => pg_authid {oid}
 NOTICE:  checking pg_default_acl {defaclrole} => pg_authid {oid}
+WARNING:  FK VIOLATION IN pg_default_acl({defaclrole}): ("(1,5)",0)
+WARNING:  FK VIOLATION IN pg_default_acl({defaclrole}): ("(1,7)",402654464)
 NOTICE:  checking pg_default_acl {defaclnamespace} => pg_namespace {oid}
 NOTICE:  checking pg_init_privs {classoid} => pg_class {oid}
 NOTICE:  checking pg_seclabel {classoid} => pg_class {oid}

> > > Because the end state varies, depending on the number of previously staged
> > > IOs, the IO method and whether batchmode is enabled, I think it's better if
> > > the "function naming pattern" (i.e. FileStartReadv, smgrstartreadv etc) is
> > > *not* aligned with an internal state name.  It will just mislead readers to
> > > think that there's a deterministic mapping when that does not exist.
> >
> > That's fair.  Could we provide the mapping in a comment, something like the
> > following?
> 
> Yes!
> 
> I wonder if it should also be duplicated or referenced elsewhere, although I
> am not sure where precisely.

I considered the README.md also, but adding that wasn't an obvious win.

> > --- a/src/include/storage/aio_internal.h
> > +++ b/src/include/storage/aio_internal.h
> > @@ -34,5 +34,10 @@
> >   * linearly through all states.
> >   *
> > - * State changes should all go through pgaio_io_update_state().
> > + * State changes should all go through pgaio_io_update_state().  Its callers
> > + * use these naming conventions:
> > + *
> > + * - A "start" function (e.g. FileStartReadV()) moves an IO from
> > + *   PGAIO_HS_HANDED_OUT to at least PGAIO_HS_STAGED and at most
> > + *   PGAIO_HS_COMPLETED_LOCAL.
> >   */
> >  typedef enum PgAioHandleState
> 
> One detail I'm not sure about: The above change is correct, but perhaps a bit
> misleading, because we can actually go "back" to IDLE. Not sure how to best
> phrase that though.

Not sure either.  Maybe the above could change to "to PGAIO_HS_STAGED or any
subsequent state" and the comment at PGAIO_HS_STAGED could say like "Once in
this state, concurrent activity could move the IO all the way to
PGAIO_HS_COMPLETED_LOCAL and recycle it back to IDLE."

> > > That's not an excuse for pgaio_io_prep* though, that's a pointlessly different
> > > naming that I just stopped seeing.
> 
> I assume you're on board with renaming _io_prep* to _io_start_*?

Yes.

> > > I'll try to think more about this, perhaps I can make myself see your POV
> > > more.

> > CheckBufferIsPinnedOnce(Buffer buffer)
> > {
> >     if (BufferIsLocal(buffer))
> >     {
> >         if (LocalRefCount[-buffer - 1] != 1)
> >             elog(ERROR, "incorrect local pin count: %d",
> >                  LocalRefCount[-buffer - 1]);
> >     }
> >     else
> >     {
> >         if (GetPrivateRefCount(buffer) != 1)
> >             elog(ERROR, "incorrect local pin count: %d",
> >                  GetPrivateRefCount(buffer));
> >     }
> > }
> 
> Pretty random orthogonal thought, that I was reminded of by the above code
> snippet:
> 
> It sure seems we should at some point get rid of LocalRefCount[] and just use
> the GetPrivateRefCount() infrastructure for both shared and local buffers.  I
> don't think the GetPrivateRefCount() infrastructure cares about
> local/non-local, leaving a few asserts aside.  If we do that, and start to use
> BM_IO_IN_PROGRESS, combined with ResourceOwnerRememberBufferIO(), the set of
> differences between shared and local buffers would be a lot smaller.

That sounds promising.

> > > > Like the comment, I expect it's academic today.  I expect it will stay
> > > > academic.  Anything that does a cleanup will start by reading the buffer,
> > > > which will resolve any refcnt the AIO subsystems holds for a read.  If there's
> > > > an AIO write, the LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE) will block on
> > > > that.  How about just removing the ConditionalLockBufferForCleanup() changes
> > > > or replacing them with a comment (like the present paragraph)?
> > >
> > > I think we'll need an expanded version of what I suggest once we have writes -
> > > but as you say, it shouldn't matter as long as we only have reads. So I think
> > > moving the relevant changes, with adjusted caveats, to the bufmgr: write
> > > change makes sense.
> >
> > Moving those changes works for me.  I'm not currently seeing the need under
> > writes, but that may get clearer upon reaching those patches.
> 
> FWIW, I don't think it's currently worth looking at the write side in detail,

Got it.  (I meant I didn't see a first-principles need, not that I had deduced
lack of need from a specific writes implementation.)

> > > Do you think it's worth mentioning the above workaround? I'm mildly inclined
> > > that not.
> >
> > Perhaps not in that detail, but perhaps we can rephrase (b) to not imply
> > exit+reenter is banned.  Maybe "(b) start another batch (without first exiting
> > one)".  It's also fine as-is, though.
> 
> I updated it to:
> 
>  * b) start another batch (without first exiting batchmode and re-entering
>  *    before returning)

That's good.



Re: AIO v2.5

From
Noah Misch
Date:
On Tue, Mar 25, 2025 at 11:57:58AM -0400, Andres Freund wrote:
> On 2025-03-25 07:11:20 -0700, Noah Misch wrote:
> > On Mon, Mar 24, 2025 at 10:52:19PM -0400, Andres Freund wrote:
> > > If we want to implement it, I think we could introduce PGAIO_RS_WARN, which
> > > then could tell the stager to issue the WARNING. It would add a bit of
> > > distributed cost, both to callbacks and users of AIO, but it might not be too
> > > bad.
> 
> FWIW, I prototyped this, it's not hard.
> 
> But it can't replace the current WARNING with 100% fidelity: If we read 60
> blocks in a single smgrreadv, we today would would emit 60 WARNINGs.  But we
> can't encode that many block offset in single PgAioResult, there's not enough
> space, and enlarging it far enough doesn't seem to make sense either.
> 
> 
> What we *could* do is to emit one WARNING for each bufmgr.c smgrstartreadv(),
> with that warning saying that there were N zeroed blocks in a read from block
> N to block Y and a HINT saying that there are more details in the server log.

Sounds fine.

> > Another thought on complete_shared running on other backends: I wonder if we
> > should push an ErrorContextCallback that adds "CONTEXT: completing I/O of
> > other process" or similar, so people wonder less about how "SELECT FROM a" led
> > to a log message about IO on table "b".
> 
> I've been wondering about that as well, and yes, we probably should.
> 
> I'd add the pid of the backend that started the IO to the message - although
> need to check whether we're trying to keep PIDs of other processes from
> unprivileged users.

We don't.

> I think we probably should add a similar, but not equivalent, context in io
> workers. Maybe "I/O worker executing I/O on behalf of process %d".

Sounds good.



Re: AIO v2.5

From
Andres Freund
Date:
Hi,

On 2025-03-25 08:58:08 -0700, Noah Misch wrote:
> While having nagging thoughts that we might be releasing FDs before io_uring
> gets them into kernel custody, I tried this hack to maximize FD turnover:
> 
> static void
> ReleaseLruFiles(void)
> {
> #if 0
>     while (nfile + numAllocatedDescs + numExternalFDs >= max_safe_fds)
>     {
>         if (!ReleaseLruFile())
>             break;
>     }
> #else
>     while (ReleaseLruFile())
>         ;
> #endif
> }
> 
> "make check" with default settings (io_method=worker) passes, but
> io_method=io_uring in the TEMP_CONFIG file got different diffs in each of two
> runs.  s/#if 0/#if 1/ (restore normal FD turnover) removes the failures.
> Here's the richer of the two diffs:

Yikes. That's a very good catch.

I spent a bit of time debugging this. I think I see what's going on - it turns
out that the kernel does *not* open the FDs during io_uring_enter() if
IOSQE_ASYNC is specified [1].  Which we do add heuristically, in an attempt to
avoid a small but measurable slowdown for sequential scans that are fully
buffered (c.f. pgaio_uring_submit()).  If I disable that heuristic, your patch
above passes all tests here.


I don't know if that's an intentional or unintentional behavioral difference.

There are 2 1/2 ways around this:

1) Stop using IOSQE_ASYNC heuristic
2a) Wait for all in-flight IOs when any FD gets closed
2b) Wait for all in-flight IOs using FD when it gets closed

Given that we have clear evidence that io_uring doesn't completely support
closing FDs while IOs are in flight, be it a bug or intentional, it seems
clearly better to go for 2a or 2b.

Greetings,

Andres Freund


[1] Instead files are opened when the queue entry is being worked on
    instead. Interestingly that only happens when the IO is *explicitly*
    requested to be executed in the workqueue with IOSQE_ASYNC, not when it's
    put there because it couldn't be done in a non-blocking way.



Re: AIO v2.5

From
Noah Misch
Date:
On Tue, Mar 25, 2025 at 02:58:37PM -0400, Andres Freund wrote:
> On 2025-03-25 08:58:08 -0700, Noah Misch wrote:
> > While having nagging thoughts that we might be releasing FDs before io_uring
> > gets them into kernel custody, I tried this hack to maximize FD turnover:
> > 
> > static void
> > ReleaseLruFiles(void)
> > {
> > #if 0
> >     while (nfile + numAllocatedDescs + numExternalFDs >= max_safe_fds)
> >     {
> >         if (!ReleaseLruFile())
> >             break;
> >     }
> > #else
> >     while (ReleaseLruFile())
> >         ;
> > #endif
> > }
> > 
> > "make check" with default settings (io_method=worker) passes, but
> > io_method=io_uring in the TEMP_CONFIG file got different diffs in each of two
> > runs.  s/#if 0/#if 1/ (restore normal FD turnover) removes the failures.
> > Here's the richer of the two diffs:
> 
> Yikes. That's a very good catch.
> 
> I spent a bit of time debugging this. I think I see what's going on - it turns
> out that the kernel does *not* open the FDs during io_uring_enter() if
> IOSQE_ASYNC is specified [1].  Which we do add heuristically, in an attempt to
> avoid a small but measurable slowdown for sequential scans that are fully
> buffered (c.f. pgaio_uring_submit()).  If I disable that heuristic, your patch
> above passes all tests here.

Same result here.  As an additional data point, I tried adding this so every
reopen gets a new FD number (leaks FDs wildly):

--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1304,5 +1304,5 @@ LruDelete(File file)
      * to leak the FD than to mess up our internal state.
      */
-    if (close(vfdP->fd) != 0)
+    if (dup2(2, vfdP->fd) != vfdP->fd)
         elog(vfdP->fdstate & FD_TEMP_FILE_LIMIT ? LOG : data_sync_elevel(LOG),
              "could not close file \"%s\": %m", vfdP->fileName);

The same "make check" w/ TEMP_CONFIG io_method=io_uring passes with the
combination of that and the max-turnover change to ReleaseLruFiles().

> I don't know if that's an intentional or unintentional behavioral difference.
> 
> There are 2 1/2 ways around this:
> 
> 1) Stop using IOSQE_ASYNC heuristic
> 2a) Wait for all in-flight IOs when any FD gets closed
> 2b) Wait for all in-flight IOs using FD when it gets closed
> 
> Given that we have clear evidence that io_uring doesn't completely support
> closing FDs while IOs are in flight, be it a bug or intentional, it seems
> clearly better to go for 2a or 2b.

Agreed.  If a workload spends significant time on fd.c closing files, I
suspect that workload already won't have impressive benchmark numbers.
Performance-seeking workloads will already want to tune FD usage high enough
to keep FDs long-lived.  So (1) clearly loses, and neither (2a) nor (2b)
clearly beats the other.  I'd try (2b) first but, if complicated, quickly
abandon it in favor of (2a).  What other considerations could be important?



Re: AIO v2.5

From
Andres Freund
Date:
Hi,

On 2025-03-25 12:39:56 -0700, Noah Misch wrote:
> On Tue, Mar 25, 2025 at 02:58:37PM -0400, Andres Freund wrote:
> > I don't know if that's an intentional or unintentional behavioral difference.
> > 
> > There are 2 1/2 ways around this:
> > 
> > 1) Stop using IOSQE_ASYNC heuristic
> > 2a) Wait for all in-flight IOs when any FD gets closed
> > 2b) Wait for all in-flight IOs using FD when it gets closed
> > 
> > Given that we have clear evidence that io_uring doesn't completely support
> > closing FDs while IOs are in flight, be it a bug or intentional, it seems
> > clearly better to go for 2a or 2b.
> 
> Agreed.  If a workload spends significant time on fd.c closing files, I
> suspect that workload already won't have impressive benchmark numbers.
> Performance-seeking workloads will already want to tune FD usage high enough
> to keep FDs long-lived.  So (1) clearly loses, and neither (2a) nor (2b)
> clearly beats the other.  I'd try (2b) first but, if complicated, quickly
> abandon it in favor of (2a).  What other considerations could be important?

The only other consideration I can think of is whether this should happen for
all io_methods or not.

I'm inclined to do it via a bool in IoMethodOps, but I guess one could argue
it's a bit weird to have a bool in a struct called *Ops.

Greetings,

Andres Freund



Re: AIO v2.5

From
Noah Misch
Date:
On Tue, Mar 25, 2025 at 04:07:35PM -0400, Andres Freund wrote:
> On 2025-03-25 12:39:56 -0700, Noah Misch wrote:
> > On Tue, Mar 25, 2025 at 02:58:37PM -0400, Andres Freund wrote:
> > > There are 2 1/2 ways around this:
> > > 
> > > 1) Stop using IOSQE_ASYNC heuristic
> > > 2a) Wait for all in-flight IOs when any FD gets closed
> > > 2b) Wait for all in-flight IOs using FD when it gets closed
> > > 
> > > Given that we have clear evidence that io_uring doesn't completely support
> > > closing FDs while IOs are in flight, be it a bug or intentional, it seems
> > > clearly better to go for 2a or 2b.
> > 
> > Agreed.  If a workload spends significant time on fd.c closing files, I
> > suspect that workload already won't have impressive benchmark numbers.
> > Performance-seeking workloads will already want to tune FD usage high enough
> > to keep FDs long-lived.  So (1) clearly loses, and neither (2a) nor (2b)
> > clearly beats the other.  I'd try (2b) first but, if complicated, quickly
> > abandon it in favor of (2a).  What other considerations could be important?
> 
> The only other consideration I can think of is whether this should happen for
> all io_methods or not.

Either way is fine, I think.

> I'm inclined to do it via a bool in IoMethodOps, but I guess one could argue
> it's a bit weird to have a bool in a struct called *Ops.

That wouldn't bother me.  IndexAmRoutine has many bools, and "Ops" is
basically a synonym of "Routine".



Re: AIO v2.5

From
Andres Freund
Date:
Hi,

On 2025-03-25 13:18:50 -0700, Noah Misch wrote:
> On Tue, Mar 25, 2025 at 04:07:35PM -0400, Andres Freund wrote:
> > On 2025-03-25 12:39:56 -0700, Noah Misch wrote:
> > > On Tue, Mar 25, 2025 at 02:58:37PM -0400, Andres Freund wrote:
> > > > There are 2 1/2 ways around this:
> > > > 
> > > > 1) Stop using IOSQE_ASYNC heuristic
> > > > 2a) Wait for all in-flight IOs when any FD gets closed
> > > > 2b) Wait for all in-flight IOs using FD when it gets closed
> > > > 
> > > > Given that we have clear evidence that io_uring doesn't completely support
> > > > closing FDs while IOs are in flight, be it a bug or intentional, it seems
> > > > clearly better to go for 2a or 2b.
> > > 
> > > Agreed.  If a workload spends significant time on fd.c closing files, I
> > > suspect that workload already won't have impressive benchmark numbers.
> > > Performance-seeking workloads will already want to tune FD usage high enough
> > > to keep FDs long-lived.  So (1) clearly loses, and neither (2a) nor (2b)
> > > clearly beats the other.  I'd try (2b) first but, if complicated, quickly
> > > abandon it in favor of (2a).  What other considerations could be important?
> > 
> > The only other consideration I can think of is whether this should happen for
> > all io_methods or not.
> 
> Either way is fine, I think.

Here's a draft incremental patch (attached as a .fixup to avoid triggering
cfbot) implementing 2b).


> > I'm inclined to do it via a bool in IoMethodOps, but I guess one could argue
> > it's a bit weird to have a bool in a struct called *Ops.
> 
> That wouldn't bother me.  IndexAmRoutine has many bools, and "Ops" is
> basically a synonym of "Routine".

Cool. Done that way.

The repeated-iteration approach taken in pgaio_closing_fd() isn't the
prettiest, but it's hard to to imagine that ever being a noticeable.


This survives a testrun where I use your torture patch and where I force all
IOs to use ASYNC. Previously that did not get very far.  I also did verify
that, if I allow a small number of FDs, we do not wrongly wait for all IOs.

Greetings,

Andres Freund

Attachment

Re: AIO v2.5

From
Noah Misch
Date:
On Tue, Mar 25, 2025 at 04:56:53PM -0400, Andres Freund wrote:
> The repeated-iteration approach taken in pgaio_closing_fd() isn't the
> prettiest, but it's hard to to imagine that ever being a noticeable.

Yep.  I've reviewed the fixup code, and it looks all good.

> This survives a testrun where I use your torture patch and where I force all
> IOs to use ASYNC. Previously that did not get very far.  I also did verify
> that, if I allow a small number of FDs, we do not wrongly wait for all IOs.

I, too, see the test diffs gone.



Re: AIO v2.5

From
Andres Freund
Date:
Hi,

On 2025-03-25 09:15:43 -0700, Noah Misch wrote:
> On Tue, Mar 25, 2025 at 11:57:58AM -0400, Andres Freund wrote:
> > FWIW, I prototyped this, it's not hard.
> > 
> > But it can't replace the current WARNING with 100% fidelity: If we read 60
> > blocks in a single smgrreadv, we today would would emit 60 WARNINGs.  But we
> > can't encode that many block offset in single PgAioResult, there's not enough
> > space, and enlarging it far enough doesn't seem to make sense either.
> > 
> > 
> > What we *could* do is to emit one WARNING for each bufmgr.c smgrstartreadv(),
> > with that warning saying that there were N zeroed blocks in a read from block
> > N to block Y and a HINT saying that there are more details in the server log.

It should probably be DETAIL, not HINT...


> Sounds fine.

I got that working. To make it readable, it required changing division of
labor between buffer_readv_complete() and buffer_readv_complete_one() a bit,
but I think it's actually easier to understand now.

Still need to beef up the test infrastructure a bit to make the multi-block
cases more easily testable.


Could use some input on the framing of the message/detail. Right now it's:

ERROR:  invalid page in block 8 of relation base/5/16417
DETAIL: Read of 8 blocks, starting at block 7, 1 other pages in the same read are invalid.

But that doesn't seem great. Maybe:

DETAIL: Read of blocks 7..14, 1 other pages in the same read were also invalid.

But that still isn't really a sentence.

Greetings,

Andres Freund



Re: AIO v2.5

From
Noah Misch
Date:
On Mon, Mar 24, 2025 at 09:18:06PM -0400, Andres Freund wrote:
> Attached v2.12, with the following changes:

> TODO:

>   Wonder if it's worth adding some coverage for when checksums are disabled?
>   Probably not necessary?

Probably not necessary, agreed.  Orthogonal to AIO, it's likely worth a CI
"SPECIAL" and/or buildfarm animal that runs all tests w/ checksums disabled.


> Subject: [PATCH v2.12 01/28] aio: Be more paranoid about interrupts

Ready for commit


> Subject: [PATCH v2.12 02/28] aio: Pass result of local callbacks to
>  ->report_return

Ready for commit w/ up to one cosmetic change:

> @@ -296,7 +299,9 @@ pgaio_io_call_complete_local(PgAioHandle *ioh)
>  
>      /*
>       * Note that we don't save the result in ioh->distilled_result, the local
> -     * callback's result should not ever matter to other waiters.
> +     * callback's result should not ever matter to other waiters. However, the
> +     * local backend does care, so we return the result as modified by local
> +     * callbacks, which then can be passed to ioh->report_return->result.
>       */
>      pgaio_debug_io(DEBUG3, ioh,
>                     "after local completion: distilled result: (status %s, id %u, error_data %d, result %d),
raw_result:%d",
 

Should this debug message remove the word "distilled", since this commit
solidifies distilled_result as referring to the complete_shared result?


> Subject: [PATCH v2.12 03/28] aio: Add liburing dependency

Ready for commit


> Subject: [PATCH v2.12 04/28] aio: Add io_method=io_uring

Ready for commit w/ open_fd.fixup


> Subject: [PATCH v2.12 05/28] aio: Implement support for reads in smgr/md/fd

Ready for commit w/ up to two cosmetic changes:

> +/*
> + * AIO error reporting callback for mdstartreadv().
> + *
> + * Errors are encoded as follows:
> + * - PgAioResult.error_data != 0 encodes IO that failed with errno != 0

I recommend replacing "errno != 0" with either "that errno" or "errno ==
error_data".

Second, the aio_internal.h comment changes discussed in
postgr.es/m/20250325155808.f7.nmisch@google.com and earlier.


> Subject: [PATCH v2.12 06/28] aio: Add README.md explaining higher level design

Ready for commit

(This and the previous patch have three spots that would change with the
s/prep/start/ renames.  No opinion on whether to rename before or rename
after.)


> Subject: [PATCH v2.12 07/28] localbuf: Track pincount in BufferDesc as well

The plan here looks good:
postgr.es/m/dbeeaize47y7esifdrinpa2l7cqqb67k72exvuf3appyxywjnc@7bt76mozhcy2


> Subject: [PATCH v2.12 08/28] bufmgr: Implement AIO read support

See review here and later discussion:
postgr.es/m/20250325022037.91.nmisch@google.com


> Subject: [PATCH v2.12 09/28] bufmgr: Use AIO in StartReadBuffers()

Ready for commit after a batch of small things, all but one of which have no
implications beyond code cosmetics.  This is my first comprehensive review of
this patch.  I like the test coverage (by the end of the patch series).  For
anyone else following, I found "diff -w" helpful for the bufmgr.c changes.
That's because a key part is former WaitReadBuffers() code moving up an
indentation level to its home in new subroutine AsyncReadBuffers().

>      Assert(*nblocks == 1 || allow_forwarding);
>      Assert(*nblocks > 0);
>      Assert(*nblocks <= MAX_IO_COMBINE_LIMIT);
> +    Assert(*nblocks == 1 || allow_forwarding);

Duplicates the assert three lines back.

> +        nblocks = aio_ret->result.result;
> +
> +        elog(DEBUG3, "partial read, will retry");
> +
> +    }
> +    else if (aio_ret->result.status == PGAIO_RS_ERROR)
> +    {
> +        pgaio_result_report(aio_ret->result, &aio_ret->target_data, ERROR);
> +        nblocks = 0;            /* silence compiler */
> +    }
>  
>      Assert(nblocks > 0);
>      Assert(nblocks <= MAX_IO_COMBINE_LIMIT);
>  
> +    operation->nblocks_done += nblocks;

I struggled somewhat from the variety of "nblocks" variables: this local
nblocks, operation->nblocks, actual_nblocks, and *nblocks in/out parameters of
some functions.  No one of them is clearly wrong to use the name, and some of
these names are preexisting.  That said, if you see opportunities to push in
the direction of more-specific names, I'd welcome it.

For example, this local variable could become add_to_nblocks_done instead.

> +        AsyncReadBuffers(operation, &nblocks);

I suggest renaming s/nblocks/ignored_nblocks_progress/ here.

> +     * If we need to wait for IO before we can get a handle, submit already
> +     * staged IO first, so that other backends don't need to wait. There

s/already staged/already-staged/.  Normally I'd skip this as nitpicking, but I
misread this particular sentence twice, as "submit" being the subject that
"staged" something.  (It's still nitpicking, alas.)

>          /*
>           * How many neighboring-on-disk blocks can we scatter-read into other
>           * buffers at the same time?  In this case we don't wait if we see an
> -         * I/O already in progress.  We already hold BM_IO_IN_PROGRESS for the
> +         * I/O already in progress.  We already set BM_IO_IN_PROGRESS for the
>           * head block, so we should get on with that I/O as soon as possible.
> -         * We'll come back to this block again, above.
> +         *
> +         * We'll come back to this block in the next call to
> +         * StartReadBuffers() -> AsyncReadBuffers().

Did this mean to say "WaitReadBuffers() -> AsyncReadBuffers()"?  I'm guessing
so, since WaitReadBuffers() is the one that loops.  It might be referring to
read_stream_start_pending_read()'s next StartReadBuffers(), though.

I think this could just delete the last sentence.  The function header comment
already mentions the possibility of reading a subset of the request.  This
spot doesn't need to detail how the higher layers come back to here.

> +        smgrstartreadv(ioh, operation->smgr, forknum,
> +                       blocknum + nblocks_done,
> +                       io_pages, io_buffers_len);
> +        pgstat_count_io_op_time(io_object, io_context, IOOP_READ,
> +                                io_start, 1, *nblocks_progress * BLCKSZ);

We don't assign *nblocks_progress until lower in the function, so I think
"io_buffers_len" should replace "*nblocks_progress" here.  (This is my only
non-cosmetic comment on this patch.)


> Subject: [PATCH v2.12 10/28] aio: Basic read_stream adjustments for real AIO

(Still reviewing this and later patches, but incidental observations follow.)


> Subject: [PATCH v2.12 16/28] aio: Add test_aio module

> +use List::Util qw(sample);

sample() is new in 2020:
https://metacpan.org/release/PEVANS/Scalar-List-Utils-1.68/source/Changes#L100

Hence, I'd expect some buildfarm failures.  I'd try to use shuffle(), then
take the first N elements.

> +++ b/src/test/modules/test_aio/test_aio.c
> @@ -0,0 +1,712 @@
> +/*-------------------------------------------------------------------------
> + *
> + * delay_execution.c
> + *        Test module to allow delay between parsing and execution of a query.
> + *
> + * The delay is implemented by taking and immediately releasing a specified
> + * advisory lock.  If another process has previously taken that lock, the
> + * current process will be blocked until the lock is released; otherwise,
> + * there's no effect.  This allows an isolationtester script to reliably
> + * test behaviors where some specified action happens in another backend
> + * between parsing and execution of any desired query.
> + *
> + * Copyright (c) 2020-2025, PostgreSQL Global Development Group
> + *
> + * IDENTIFICATION
> + *      src/test/modules/test_aio/test_aio.c

To elaborate on my last review, the entire header comment was a copy from
delay_execution.c.  v2.12 fixes the IDENTIFICATION, but the rest needs
updates.

Thanks,
nm



Re: AIO v2.5

From
Noah Misch
Date:
On Tue, Mar 25, 2025 at 08:17:17PM -0400, Andres Freund wrote:
> On 2025-03-25 09:15:43 -0700, Noah Misch wrote:
> > On Tue, Mar 25, 2025 at 11:57:58AM -0400, Andres Freund wrote:
> > > FWIW, I prototyped this, it's not hard.
> > > 
> > > But it can't replace the current WARNING with 100% fidelity: If we read 60
> > > blocks in a single smgrreadv, we today would would emit 60 WARNINGs.  But we
> > > can't encode that many block offset in single PgAioResult, there's not enough
> > > space, and enlarging it far enough doesn't seem to make sense either.
> > > 
> > > 
> > > What we *could* do is to emit one WARNING for each bufmgr.c smgrstartreadv(),
> > > with that warning saying that there were N zeroed blocks in a read from block
> > > N to block Y and a HINT saying that there are more details in the server log.
> 
> It should probably be DETAIL, not HINT...

Either is fine with me.  I would go for HINT if referring to the server log,
given the precedent of errhint("See server log for query details.").  DETAIL
fits for block counts, though:

> Could use some input on the framing of the message/detail. Right now it's:
> 
> ERROR:  invalid page in block 8 of relation base/5/16417
> DETAIL: Read of 8 blocks, starting at block 7, 1 other pages in the same read are invalid.
> 
> But that doesn't seem great. Maybe:
> 
> DETAIL: Read of blocks 7..14, 1 other pages in the same read were also invalid.
> 
> But that still isn't really a sentence.

How about this for the multi-page case:

WARNING: zeroing out %u invalid pages among blocks %u..%u of relation %s
DETAIL:  Block %u held first invalid page.
HINT: See server log for the other %u invalid blocks.


For the one-page case, the old message can stay:

WARNING:  invalid page in block %u of relation %s; zeroing out page



Re: AIO v2.5

From
Noah Misch
Date:
I reviewed everything up to and including "[PATCH v2.12 17/28] aio, bufmgr:
Comment fixes", the last patch before write support.
postgr.es/m/20250326001915.bc.nmisch@google.com covered patches 1-9, and this
email covers patches 10-17.  All remaining review comments are minor, so I've
marked the commitfest entry Ready for Committer.  If there's anything you'd
like re-reviewed before you commit it, feel free to bring it to my attention.
Thanks for getting the feature to this stage!

On Mon, Mar 24, 2025 at 09:18:06PM -0400, Andres Freund wrote:
> Subject: [PATCH v2.12 10/28] aio: Basic read_stream adjustments for real AIO

> @@ -631,6 +637,9 @@ read_stream_begin_impl(int flags,
>       * For now, max_ios = 0 is interpreted as max_ios = 1 with advice disabled
>       * above.  If we had real asynchronous I/O we might need a slightly
>       * different definition.
> +     *
> +     * FIXME: Not sure what different definition we would need? I guess we
> +     * could add the READ_BUFFERS_SYNCHRONOUSLY flag automatically?

I think we don't need a different definition.  max_ios comes from
effective_io_concurrency and similar settings.  The above comment's definition
of max_ios=0 matches that GUC's documented behavior:

         The allowed range is
         <literal>1</literal> to <literal>1000</literal>, or
         <literal>0</literal> to disable issuance of asynchronous I/O requests.

I'll guess the comment meant that "advice disabled" is a no-op for AIO, so we
could reasonably argue to have effective_io_concurrency=0 distinguish itself
from effective_io_concurrency=1 in some different way for AIO.  Equally,
there's no hurry to use that freedom to distinguish them.


> Subject: [PATCH v2.12 11/28] read_stream: Introduce and use optional batchmode
>  support

> This patch adds an explicit flag (READ_STREAM_USE_BATCHING) to read_stream and
> uses it where appropriate.

I'd also use the new flag on the read_stream_begin_smgr_relation() call in
RelationCopyStorageUsingBuffer().  It uses block_range_read_stream_cb, and
other streams of that callback rightly use the flag.

> + * b) directly or indirectly start another batch pgaio_enter_batchmode()

Needs new wording from end of postgr.es/m/20250325155808.f7.nmisch@google.com


> Subject: [PATCH v2.12 12/28] docs: Reframe track_io_timing related docs as
>  wait time


> Subject: [PATCH v2.12 13/28] Enable IO concurrency on all systems

Consider also updating this comment to stop focusing on prefetch; I think
changing that aligns with the patch's other changes:

/*
 * How many buffers PrefetchBuffer callers should try to stay ahead of their
 * ReadBuffer calls by.  Zero means "never prefetch".  This value is only used
 * for buffers not belonging to tablespaces that have their
 * effective_io_concurrency parameter set.
 */
int            effective_io_concurrency = DEFAULT_EFFECTIVE_IO_CONCURRENCY;

> -#io_combine_limit = 128kB        # usually 1-128 blocks (depends on OS)
> +#io_combine_limit = 128kB        # usually 1-32 blocks (depends on OS)

I think "usually 1-128" remains right given:

GUC_UNIT_BLOCKS
#define MAX_IO_COMBINE_LIMIT PG_IOV_MAX
#define PG_IOV_MAX Min(IOV_MAX, 128)

> -         On systems without prefetch advice support, attempting to configure
> -         any value other than <literal>0</literal> will error out.
> +         On systems with prefetch advice support,
> +         <varname>effective_io_concurrency</varname> also controls the prefetch distance.

Wrap the last line.


> Subject: [PATCH v2.12 14/28] docs: Add acronym and glossary entries for I/O
>  and AIO

> These could use a lot more polish.

To me, it's fine as-is.

> I did not actually reference the new entries yet, because I don't really
> understand what our policy for that is.

I haven't seen much of a policy on that.


> Subject: [PATCH v2.12 15/28] aio: Add pg_aios view
> +retry:
> +
> +        /*
> +         * There is no lock that could prevent the state of the IO to advance
> +         * concurrently - and we don't want to introduce one, as that would
> +         * introduce atomics into a very common path. Instead we
> +         *
> +         * 1) Determine the state + generation of the IO.
> +         *
> +         * 2) Copy the IO to local memory.
> +         *
> +         * 3) Check if state or generation of the IO changed. If the state
> +         * changed, retry, if the generation changed don't display the IO.
> +         */
> +
> +        /* 1) from above */
> +        start_generation = live_ioh->generation;
> +        pg_read_barrier();

Based on the "really started after this function was called" and "no risk of a
livelock here" comments below, I think "retry:"  should be here.  We don't
want to livelock in the form of chasing ever-growing start_generation numbers.

> +        /*
> +         * The IO completed and a new one was started with the same ID. Don't
> +         * display it - it really started after this function was called.
> +         * There be a risk of a livelock if we just retried endlessly, if IOs
> +         * complete very quickly.
> +         */
> +        if (live_ioh->generation != start_generation)
> +            continue;
> +
> +        /*
> +         * The IOs state changed while we were "rendering" it. Just start from

s/IOs/IO's/

> +         * scratch. There's no risk of a livelock here, as an IO has a limited
> +         * sets of states it can be in, and state changes go only in a single
> +         * direction.
> +         */
> +        if (live_ioh->state != start_state)
> +            goto retry;

> +      <entry role="catalog_table_entry"><para role="column_definition">
> +       <structfield>target</structfield> <type>text</type>
> +      </para>
> +      <para>
> +       What kind of object is the I/O targeting:
> +       <itemizedlist spacing="compact">
> +        <listitem>
> +         <para>
> +          <literal>smgr</literal>, I/O on postgres relations

s/postgres relations/relations/ since SGML docs don't use the term "postgres"
that way.


> Subject: [PATCH v2.12 16/28] aio: Add test_aio module

> --- a/src/test/modules/meson.build
> +++ b/src/test/modules/meson.build
> @@ -1,5 +1,6 @@
>  # Copyright (c) 2022-2025, PostgreSQL Global Development Group
>  
> +subdir('test_aio')
>  subdir('brin')

List is alphabetized; please preserve that.

> +++ b/src/test/modules/test_aio/Makefile
> @@ -0,0 +1,26 @@
> +# src/test/modules/delay_execution/Makefile

Update filename in comment.

> +++ b/src/test/modules/test_aio/meson.build
> @@ -0,0 +1,37 @@
> +# Copyright (c) 2022-2024, PostgreSQL Global Development Group

s/2024/2025/

> --- /dev/null
> +++ b/src/test/modules/test_aio/t/001_aio.pl

s/ {4}/\t/g on this file.  It's mostly \t now, with some exceptions.

> +    test_inject_worker('worker', $node_worker);

What do we expect to happen if autovacuum or checkpointer runs one of these
injection points?  I'm guessing it would at most make that process fail
without affecting the test outcome.  If so, that's fine.

> +        $waitfor,);

s/,//

> +    # normal handle use
> +    psql_like($io_method, $psql, "handle_get_release()",
> +        qq(SELECT handle_get_release()),
> +        qr/^$/, qr/^$/);
> +
> +    # should error out, API violation
> +    psql_like($io_method, $psql, "handle_get_twice()",
> +        qq(SELECT handle_get_release()),
> +        qr/^$/, qr/^$/);

Last two lines are a clone of the previous psql_like() call.  I guess this
wants to instead call handle_get_twice() and check for some stderr.

> +            "read_rel_block_ll() of $tblname page",

What does "_ll" stand for?

> +    # Issue IO without waiting for completion, then exit
> +    $psql_a->query_safe(
> +        qq(SELECT read_rel_block_ll('tbl_ok', 1, wait_complete=>false);));
> +    $psql_a->reconnect_and_clear();
> +
> +    # Check that another backend can read the relevant block
> +    psql_like(
> +        $io_method,
> +        $psql_b,
> +        "completing read started by exited backend",

I think the exiting backend's pgaio_shutdown() completed it.

> +sub test_inject

This deserves a brief comment on the behaviors being tested, like the previous
functions have.  It seems to be about short reads and hard failures like EIO.


> Subject: [PATCH v2.12 17/28] aio, bufmgr: Comment fixes



Re: AIO v2.5

From
Andres Freund
Date:
Hi,

On 2025-03-25 17:19:15 -0700, Noah Misch wrote:
> On Mon, Mar 24, 2025 at 09:18:06PM -0400, Andres Freund wrote:
> > @@ -296,7 +299,9 @@ pgaio_io_call_complete_local(PgAioHandle *ioh)
> >  
> >      /*
> >       * Note that we don't save the result in ioh->distilled_result, the local
> > -     * callback's result should not ever matter to other waiters.
> > +     * callback's result should not ever matter to other waiters. However, the
> > +     * local backend does care, so we return the result as modified by local
> > +     * callbacks, which then can be passed to ioh->report_return->result.
> >       */
> >      pgaio_debug_io(DEBUG3, ioh,
> >                     "after local completion: distilled result: (status %s, id %u, error_data %d, result %d),
raw_result:%d",
 
> 
> Should this debug message remove the word "distilled", since this commit
> solidifies distilled_result as referring to the complete_shared result?

Good point, updated.


> > Subject: [PATCH v2.12 01/28] aio: Be more paranoid about interrupts
> Ready for commit
> > Subject: [PATCH v2.12 02/28] aio: Pass result of local callbacks to
> >  ->report_return
> 
> Ready for commit w/ up to one cosmetic change:
> 

And pushed. Together with the s/pgaio_io_prep_/s/pgaio_io_start_/ renaming
we've been discussing. Btw, I figured out the origin of that, I was just
mirroring the liburing API...

Thanks again for the reviews.


> > Subject: [PATCH v2.12 03/28] aio: Add liburing dependency
> 
> Ready for commit
> 
> 
> > Subject: [PATCH v2.12 04/28] aio: Add io_method=io_uring
> 
> Ready for commit w/ open_fd.fixup

Yay.  Planning to push those soon.


> > Subject: [PATCH v2.12 05/28] aio: Implement support for reads in smgr/md/fd
> 
> Ready for commit w/ up to two cosmetic changes:

Cool.


> > +/*
> > + * AIO error reporting callback for mdstartreadv().
> > + *
> > + * Errors are encoded as follows:
> > + * - PgAioResult.error_data != 0 encodes IO that failed with errno != 0
> 
> I recommend replacing "errno != 0" with either "that errno" or "errno ==
> error_data".

Applied.


> Second, the aio_internal.h comment changes discussed in
> postgr.es/m/20250325155808.f7.nmisch@google.com and earlier.

Here's my current version of that:

 * Note that the externally visible functions to start IO
 * (e.g. FileStartReadV(), via pgaio_io_start_readv()) move an IO from
 * PGAIO_HS_HANDED_OUT to at least PGAIO_HS_STAGED and at most
 * PGAIO_HS_COMPLETED_LOCAL (at which point the handle will be reused).

Does that work?

I think I'll push that as part of the comment updates patch instead of
"Implement support for reads in smgr/md/fd", unless you see a reason to do so
differently. I'd have done it in the patch to s/prep/start/, but then it would
reference functions that don't exist yet...


> > Subject: [PATCH v2.12 06/28] aio: Add README.md explaining higher level design
> 
> Ready for commit

Cool.

Comments in it reference PGAIO_HCB_SHARED_BUFFER_READV, so I'm inclined to
reorder it until after "bufmgr: Implement AIO read support".

There's also a small change in a new patch in the series (not yet sent), due
to the changes related to emitting WARNINGs about checksum failures to the
client connection.  I think that part is fine, but...


> (This and the previous patch have three spots that would change with the
> s/prep/start/ renames.  No opinion on whether to rename before or rename
> after.)

I thought it'd be better to do the renaming first.



> > Subject: [PATCH v2.12 07/28] localbuf: Track pincount in BufferDesc as well
> 
> The plan here looks good:
> postgr.es/m/dbeeaize47y7esifdrinpa2l7cqqb67k72exvuf3appyxywjnc@7bt76mozhcy2

> > Subject: [PATCH v2.12 08/28] bufmgr: Implement AIO read support
> 
> See review here and later discussion:
> postgr.es/m/20250325022037.91.nmisch@google.com

I'm working on a version with those addressed.


> > Subject: [PATCH v2.12 09/28] bufmgr: Use AIO in StartReadBuffers()
> 
> Ready for commit after a batch of small things, all but one of which have no
> implications beyond code cosmetics.

Yay.


> I like the test coverage (by the end of the patch series).

I'm really shocked just how bad our test coverage for a lot of this is today
:(


> For anyone else following, I found "diff -w" helpful for the bufmgr.c
> changes.  That's because a key part is former WaitReadBuffers() code moving
> up an indentation level to its home in new subroutine AsyncReadBuffers().

For reviewing changes that move stuff around a lot I find this rather helpful:
git diff --color-moved --color-moved-ws=ignore-space-change

That highlights removed code differently from moved code, and due to
ignore-space-change considers code that changed just due to space changes, to
be moved.


> >      Assert(*nblocks == 1 || allow_forwarding);
> >      Assert(*nblocks > 0);
> >      Assert(*nblocks <= MAX_IO_COMBINE_LIMIT);
> > +    Assert(*nblocks == 1 || allow_forwarding);
> 
> Duplicates the assert three lines back.

Ah, it was moved into ce1a75c4fea, which I didn't notice while rebasing...


> > +        nblocks = aio_ret->result.result;
> > +
> > +        elog(DEBUG3, "partial read, will retry");
> > +
> > +    }
> > +    else if (aio_ret->result.status == PGAIO_RS_ERROR)
> > +    {
> > +        pgaio_result_report(aio_ret->result, &aio_ret->target_data, ERROR);
> > +        nblocks = 0;            /* silence compiler */
> > +    }
> >  
> >      Assert(nblocks > 0);
> >      Assert(nblocks <= MAX_IO_COMBINE_LIMIT);
> >  
> > +    operation->nblocks_done += nblocks;
> 
> I struggled somewhat from the variety of "nblocks" variables: this local
> nblocks, operation->nblocks, actual_nblocks, and *nblocks in/out parameters of
> some functions.  No one of them is clearly wrong to use the name, and some of
> these names are preexisting.  That said, if you see opportunities to push in
> the direction of more-specific names, I'd welcome it.
> 
> For example, this local variable could become add_to_nblocks_done instead.

I named it "newly_read_blocks", hope that works?


> > +        AsyncReadBuffers(operation, &nblocks);
> 
> I suggest renaming s/nblocks/ignored_nblocks_progress/ here.

Adopted.


Unfortunately I didn't see a good way to reduce the amount of the other
nblocks variables, as they are all, I think, preexisting.


> > +     * If we need to wait for IO before we can get a handle, submit already
> > +     * staged IO first, so that other backends don't need to wait. There
> 
> s/already staged/already-staged/.  Normally I'd skip this as nitpicking, but I
> misread this particular sentence twice, as "submit" being the subject that
> "staged" something.  (It's still nitpicking, alas.)

Makes sense - it doesn't help that it was at a linebreak...


> >          /*
> >           * How many neighboring-on-disk blocks can we scatter-read into other
> >           * buffers at the same time?  In this case we don't wait if we see an
> > -         * I/O already in progress.  We already hold BM_IO_IN_PROGRESS for the
> > +         * I/O already in progress.  We already set BM_IO_IN_PROGRESS for the
> >           * head block, so we should get on with that I/O as soon as possible.
> > -         * We'll come back to this block again, above.
> > +         *
> > +         * We'll come back to this block in the next call to
> > +         * StartReadBuffers() -> AsyncReadBuffers().
> 
> Did this mean to say "WaitReadBuffers() -> AsyncReadBuffers()"?  I'm guessing
> so, since WaitReadBuffers() is the one that loops.  It might be referring to
> read_stream_start_pending_read()'s next StartReadBuffers(), though.

I was referring to the latter, as that is the more common case (it's pretty
easy to hit if you e.g. have multiple sequential scans on the same table
going).


> I think this could just delete the last sentence.  The function header comment
> already mentions the possibility of reading a subset of the request.  This
> spot doesn't need to detail how the higher layers come back to here.

Agreed.


> > +        smgrstartreadv(ioh, operation->smgr, forknum,
> > +                       blocknum + nblocks_done,
> > +                       io_pages, io_buffers_len);
> > +        pgstat_count_io_op_time(io_object, io_context, IOOP_READ,
> > +                                io_start, 1, *nblocks_progress * BLCKSZ);
> 
> We don't assign *nblocks_progress until lower in the function, so I think
> "io_buffers_len" should replace "*nblocks_progress" here.  (This is my only
> non-cosmetic comment on this patch.)

Good catch!



> > Subject: [PATCH v2.12 16/28] aio: Add test_aio module
> 
> > +use List::Util qw(sample);
> 
> sample() is new in 2020:
> https://metacpan.org/release/PEVANS/Scalar-List-Utils-1.68/source/Changes#L100
> 
> Hence, I'd expect some buildfarm failures.  I'd try to use shuffle(), then
> take the first N elements.

Hah. Bilal's patch was using shuffe(). I wanted to reduce the number of
iterations and first did as you suggest and then saw that there's a nicer
way...

Done that way again...


> > +++ b/src/test/modules/test_aio/test_aio.c
> > @@ -0,0 +1,712 @@
> > +/*-------------------------------------------------------------------------
> > + *
> > + * delay_execution.c
> > + *        Test module to allow delay between parsing and execution of a query.
> > + *
> > + * The delay is implemented by taking and immediately releasing a specified
> > + * advisory lock.  If another process has previously taken that lock, the
> > + * current process will be blocked until the lock is released; otherwise,
> > + * there's no effect.  This allows an isolationtester script to reliably
> > + * test behaviors where some specified action happens in another backend
> > + * between parsing and execution of any desired query.
> > + *
> > + * Copyright (c) 2020-2025, PostgreSQL Global Development Group
> > + *
> > + * IDENTIFICATION
> > + *      src/test/modules/test_aio/test_aio.c
> 
> To elaborate on my last review, the entire header comment was a copy from
> delay_execution.c.  v2.12 fixes the IDENTIFICATION, but the rest needs
> updates.

I was really too tired that day... Embarassing.

Greetings,

Andres Freund



Re: AIO v2.5

From
Thom Brown
Date:
On Tue, 25 Mar 2025 at 01:18, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> Attached v2.12, with the following changes:

I took a quick gander through this just out of curiosity (yes, I know
I'm late), and found these show-stoppers:

v2.12-0015-aio-Add-pg_aios-view.patch:

+          <literal>ERROR</literal> mean the I/O failed with an error.

s/mean/means/


v2.12-0021-bufmgr-Implement-AIO-write-support.patch

+shared buffer lock still allows some modification, e.g., for hint bits(see

s/bits\(see/bits \(see)

+buffers that can be used as the source / target for IO. A bounce buffer be

s/be/can be/

Regards

Thom



Re: AIO v2.5

From
Noah Misch
Date:
On Wed, Mar 26, 2025 at 04:33:49PM -0400, Andres Freund wrote:
> On 2025-03-25 17:19:15 -0700, Noah Misch wrote:
> > On Mon, Mar 24, 2025 at 09:18:06PM -0400, Andres Freund wrote:

> > Second, the aio_internal.h comment changes discussed in
> > postgr.es/m/20250325155808.f7.nmisch@google.com and earlier.
> 
> Here's my current version of that:
> 
>  * Note that the externally visible functions to start IO
>  * (e.g. FileStartReadV(), via pgaio_io_start_readv()) move an IO from
>  * PGAIO_HS_HANDED_OUT to at least PGAIO_HS_STAGED and at most
>  * PGAIO_HS_COMPLETED_LOCAL (at which point the handle will be reused).
> 
> Does that work?

Yes.

> I think I'll push that as part of the comment updates patch instead of
> "Implement support for reads in smgr/md/fd", unless you see a reason to do so
> differently. I'd have done it in the patch to s/prep/start/, but then it would
> reference functions that don't exist yet...

Agreed.

> > > Subject: [PATCH v2.12 06/28] aio: Add README.md explaining higher level design
> > 
> > Ready for commit
> 
> Cool.
> 
> Comments in it reference PGAIO_HCB_SHARED_BUFFER_READV, so I'm inclined to
> reorder it until after "bufmgr: Implement AIO read support".

Agreed.

> > For example, this local variable could become add_to_nblocks_done instead.
> 
> I named it "newly_read_blocks", hope that works?

Yes.



Re: AIO v2.5

From
Andres Freund
Date:
Hi,

On 2025-03-26 11:31:02 -0700, Noah Misch wrote:
> I reviewed everything up to and including "[PATCH v2.12 17/28] aio, bufmgr:
> Comment fixes", the last patch before write support.

Thanks!


> postgr.es/m/20250326001915.bc.nmisch@google.com covered patches 1-9, and this
> email covers patches 10-17.  All remaining review comments are minor, so I've
> marked the commitfest entry Ready for Committer.  If there's anything you'd
> like re-reviewed before you commit it, feel free to bring it to my attention.
> Thanks for getting the feature to this stage!

As part of our discussion around the WARNING stuff I did make some changes,
it'd be good if you could look at those once I send them.  While I squashed
the rest of the changes (addressing review comments) into their base commits,
I left the error-reporting related bits and pieces in fixup commits, to make
that easier.


> On Mon, Mar 24, 2025 at 09:18:06PM -0400, Andres Freund wrote:
> > Subject: [PATCH v2.12 10/28] aio: Basic read_stream adjustments for real AIO
> 
> > @@ -631,6 +637,9 @@ read_stream_begin_impl(int flags,
> >       * For now, max_ios = 0 is interpreted as max_ios = 1 with advice disabled
> >       * above.  If we had real asynchronous I/O we might need a slightly
> >       * different definition.
> > +     *
> > +     * FIXME: Not sure what different definition we would need? I guess we
> > +     * could add the READ_BUFFERS_SYNCHRONOUSLY flag automatically?
> 
> I think we don't need a different definition.  max_ios comes from
> effective_io_concurrency and similar settings.  The above comment's definition
> of max_ios=0 matches that GUC's documented behavior:
> 
>          The allowed range is
>          <literal>1</literal> to <literal>1000</literal>, or
>          <literal>0</literal> to disable issuance of asynchronous I/O requests.
> 
> I'll guess the comment meant that "advice disabled" is a no-op for AIO, so we
> could reasonably argue to have effective_io_concurrency=0 distinguish itself
> from effective_io_concurrency=1 in some different way for AIO.  Equally,
> there's no hurry to use that freedom to distinguish them.

Thomas has since provided an implementation of what he was thinking of when
writing that comment:
https://postgr.es/m/CA%2BhUKG%2B8SC2%3DAD3bC0Pn85aMXm-PE2JSFGhC%3DMFVJvNQLObZeA%40mail.gmail.com

I squashed that into "aio: Basic read_stream adjustments for real AIO".


> > Subject: [PATCH v2.12 11/28] read_stream: Introduce and use optional batchmode
> >  support
> 
> > This patch adds an explicit flag (READ_STREAM_USE_BATCHING) to read_stream and
> > uses it where appropriate.
> 
> I'd also use the new flag on the read_stream_begin_smgr_relation() call in
> RelationCopyStorageUsingBuffer().  It uses block_range_read_stream_cb, and
> other streams of that callback rightly use the flag.

Ah, yes. I had searched for all read_stream_begin_relation(), but not for
_smgr...


> > + * b) directly or indirectly start another batch pgaio_enter_batchmode()
> 
> Needs new wording from end of postgr.es/m/20250325155808.f7.nmisch@google.com

Locally it's that, just need to send out a new version...
 *
 * b) start another batch (without first exiting batchmode and re-entering
 *    before returning)


> > Subject: [PATCH v2.12 13/28] Enable IO concurrency on all systems
> 
> Consider also updating this comment to stop focusing on prefetch; I think
> changing that aligns with the patch's other changes:
>
> /*
>  * How many buffers PrefetchBuffer callers should try to stay ahead of their
>  * ReadBuffer calls by.  Zero means "never prefetch".  This value is only used
>  * for buffers not belonging to tablespaces that have their
>  * effective_io_concurrency parameter set.
>  */
> int            effective_io_concurrency = DEFAULT_EFFECTIVE_IO_CONCURRENCY;

Good point.  Although I suspect it might be worth adjusting this, and also the
config.sgml bit about effective_io_concurrency separately. That seems like it
might take an iteration or two.


> > -#io_combine_limit = 128kB        # usually 1-128 blocks (depends on OS)
> > +#io_combine_limit = 128kB        # usually 1-32 blocks (depends on OS)
> 
> I think "usually 1-128" remains right given:

> GUC_UNIT_BLOCKS
> #define MAX_IO_COMBINE_LIMIT PG_IOV_MAX
> #define PG_IOV_MAX Min(IOV_MAX, 128)

You're right.  I think I got this wrong when rebasing over conflicts due to
06fb5612c97.


> > -         On systems without prefetch advice support, attempting to configure
> > -         any value other than <literal>0</literal> will error out.
> > +         On systems with prefetch advice support,
> > +         <varname>effective_io_concurrency</varname> also controls the prefetch distance.
> 
> Wrap the last line.

Done.


> > Subject: [PATCH v2.12 14/28] docs: Add acronym and glossary entries for I/O
> >  and AIO
> 
> > These could use a lot more polish.
> 
> To me, it's fine as-is.

Cool.


> > I did not actually reference the new entries yet, because I don't really
> > understand what our policy for that is.
> 
> I haven't seen much of a policy on that.

That's sure what it looks like to me :/


> 
> > Subject: [PATCH v2.12 15/28] aio: Add pg_aios view
> > +retry:
> > +
> > +        /*
> > +         * There is no lock that could prevent the state of the IO to advance
> > +         * concurrently - and we don't want to introduce one, as that would
> > +         * introduce atomics into a very common path. Instead we
> > +         *
> > +         * 1) Determine the state + generation of the IO.
> > +         *
> > +         * 2) Copy the IO to local memory.
> > +         *
> > +         * 3) Check if state or generation of the IO changed. If the state
> > +         * changed, retry, if the generation changed don't display the IO.
> > +         */
> > +
> > +        /* 1) from above */
> > +        start_generation = live_ioh->generation;
> > +        pg_read_barrier();
> 
> Based on the "really started after this function was called" and "no risk of a
> livelock here" comments below, I think "retry:"  should be here.  We don't
> want to livelock in the form of chasing ever-growing start_generation numbers.

You're right.


> > +         * scratch. There's no risk of a livelock here, as an IO has a limited
> > +         * sets of states it can be in, and state changes go only in a single
> > +         * direction.
> > +         */
> > +        if (live_ioh->state != start_state)
> > +            goto retry;
> 
> > +      <entry role="catalog_table_entry"><para role="column_definition">
> > +       <structfield>target</structfield> <type>text</type>
> > +      </para>
> > +      <para>
> > +       What kind of object is the I/O targeting:
> > +       <itemizedlist spacing="compact">
> > +        <listitem>
> > +         <para>
> > +          <literal>smgr</literal>, I/O on postgres relations
> 
> s/postgres relations/relations/ since SGML docs don't use the term "postgres"
> that way.

Not sure what I was even trying to express with "postgres relations" vs plain
"relations" here...

> 
> > Subject: [PATCH v2.12 16/28] aio: Add test_aio module
> 
> > --- a/src/test/modules/meson.build
> > +++ b/src/test/modules/meson.build
> > @@ -1,5 +1,6 @@
> >  # Copyright (c) 2022-2025, PostgreSQL Global Development Group
> >  
> > +subdir('test_aio')
> >  subdir('brin')
> 
> List is alphabetized; please preserve that.
> 
> > +++ b/src/test/modules/test_aio/Makefile
> > @@ -0,0 +1,26 @@
> > +# src/test/modules/delay_execution/Makefile
> 
> Update filename in comment.
>
> > +++ b/src/test/modules/test_aio/meson.build
> > @@ -0,0 +1,37 @@
> > +# Copyright (c) 2022-2024, PostgreSQL Global Development Group
> 
> s/2024/2025/


Done.


> > --- /dev/null
> > +++ b/src/test/modules/test_aio/t/001_aio.pl
> 
> s/ {4}/\t/g on this file.  It's mostly \t now, with some exceptions.

Huh. No idea how that happened.


> > +    test_inject_worker('worker', $node_worker);
> 
> What do we expect to happen if autovacuum or checkpointer runs one of these
> injection points?  I'm guessing it would at most make that process fail
> without affecting the test outcome.  If so, that's fine.

Autovacuum I disabled on the relations, to prevent that.

I think checkpointer should behave as you describe, although I could wonder if
it could confuse wait_for_log() based checks - but even so, I think that would
at worst lead to a test missing a bug, in extremely rare circumstances.

I tried triggering that condition, but it's pretty hard to hit, even after
lowering checkpoint_timeout to 1s and looping in the tests.



> > +    # normal handle use
> > +    psql_like($io_method, $psql, "handle_get_release()",
> > +        qq(SELECT handle_get_release()),
> > +        qr/^$/, qr/^$/);
> > +
> > +    # should error out, API violation
> > +    psql_like($io_method, $psql, "handle_get_twice()",
> > +        qq(SELECT handle_get_release()),
> > +        qr/^$/, qr/^$/);
> 
> Last two lines are a clone of the previous psql_like() call.  I guess this
> wants to instead call handle_get_twice() and check for some stderr.

Indeed.


> > +            "read_rel_block_ll() of $tblname page",
> 
> What does "_ll" stand for?

"low level".  I added a C comment:

/*
 * A "low level" read. This does similar things to what
 * StartReadBuffers()/WaitReadBuffers() do, but provides more control (and
 * less sanity).
 */


> > +    # Issue IO without waiting for completion, then exit
> > +    $psql_a->query_safe(
> > +        qq(SELECT read_rel_block_ll('tbl_ok', 1, wait_complete=>false);));
> > +    $psql_a->reconnect_and_clear();
> > +
> > +    # Check that another backend can read the relevant block
> > +    psql_like(
> > +        $io_method,
> > +        $psql_b,
> > +        "completing read started by exited backend",
> 
> I think the exiting backend's pgaio_shutdown() completed it.

I wrote the test precisely to exercise that path, otherwise it's pretty hard
to reach. It does seem to reach the path reasonably reliably, although it's
much harder to catch that causing problems, as the IO is typically too fast.


> > +sub test_inject
> 
> This deserves a brief comment on the behaviors being tested, like the previous
> functions have.  It seems to be about short reads and hard failures like EIO.

Done.

Greetings,

Andres Freund



Re: AIO v2.5

From
Andres Freund
Date:
Hi,

On 2025-03-26 21:20:47 +0000, Thom Brown wrote:
> I took a quick gander through this just out of curiosity (yes, I know
> I'm late), and found these show-stoppers:
> 
> v2.12-0015-aio-Add-pg_aios-view.patch:
> 
> +          <literal>ERROR</literal> mean the I/O failed with an error.
> 
> s/mean/means/
> 
> 
> v2.12-0021-bufmgr-Implement-AIO-write-support.patch
> 
> +shared buffer lock still allows some modification, e.g., for hint bits(see
> 
> s/bits\(see/bits \(see)
> 
> +buffers that can be used as the source / target for IO. A bounce buffer be
> 
> s/be/can be/

Thanks! Squashed into my local tree.

Greetings,

Andres Freund



Re: AIO v2.5

From
Thomas Munro
Date:
On Thu, Mar 27, 2025 at 10:41 AM Andres Freund <andres@anarazel.de> wrote:
> > > Subject: [PATCH v2.12 13/28] Enable IO concurrency on all systems
> >
> > Consider also updating this comment to stop focusing on prefetch; I think
> > changing that aligns with the patch's other changes:
> >
> > /*
> >  * How many buffers PrefetchBuffer callers should try to stay ahead of their
> >  * ReadBuffer calls by.  Zero means "never prefetch".  This value is only used
> >  * for buffers not belonging to tablespaces that have their
> >  * effective_io_concurrency parameter set.
> >  */
> > int                   effective_io_concurrency = DEFAULT_EFFECTIVE_IO_CONCURRENCY;
>
> Good point.  Although I suspect it might be worth adjusting this, and also the
> config.sgml bit about effective_io_concurrency separately. That seems like it
> might take an iteration or two.

+1 for rewriting that separately from this work on the code (I can
have a crack at that if you want).  For the comment, my suggestion
would be something like:

"Default limit on the level of concurrency that each I/O stream
(currently, ReadStream but in future other kinds of streams) can use.
Zero means that I/O is always performed synchronously, ie not
concurrently with query execution. This value can be overridden at the
tablespace level with the parameter of the same name. Note that
streams performing I/O not classified as single-session work respect
maintenance_io_concurrency instead."



Re: AIO v2.5

From
Andres Freund
Date:
Hi,

Attached v2.13, with the following changes:
Changes:

- Pushed a fair number of commits

  A lot of thanks goes to Noah's detailed reviews!


- As Noah pointed out, the zero_damaged_pages warning could be emitted in an
  io worker or another backend, but omitted in the backend that started the IO

  To address that:

  1) I added a new commit "aio: Add WARNING result status"
     (itself trivial)

  2) I changed buffer_readv_complete() to encode the warning/error in a more
     detailed way than before (was_zeroed, first_invalid_off, count_invalid)

     As part of that I put the encoding/decoding into a static inline

  3) Tracking the number of invalid buffers was awkward with
     buffer_readv_complete_one() returning a PgAioResult. Now it just
     reports whether it found an invalid page with an out argument.

  4) As discussed there now is a different error messages for the case of
     multiple invalid pages

     The code is a bit awkward to avoid code duplication, curious whether
     that's seen as acceptable?  I could just duplicate the entire ereport()
     instead.

  5) The WARNING in the callback is now a LOG, as it will be sent to the
     client as a WARNING explicitly when the IO's results are processed

     I actually chose LOG_SERVER_ONLY - that seemed slightly better than just
     LOG? But not at all sure.

     There's a comment explaining this now too.


  Noah, I think this set of changes would benefit from another round of
  review. I left these changes in "squash-later: " commits, to make it easier
  to see / think about.


- Added a comment about the pgaio_result_report() in md_readv_complete(). I
  changed it to LOG_SERVER_ONLY as well , but I'm not at all sure about that.


- Previously the buffer completion callback checked zero_damaged_pages - but
  that's not right, the GUC hopefully is only set on a per-session basis

  I solved that by having AsyncReadBuffers() add ZERO_ON_ERROR to the flags if
  zero_damaged_pages is configured.

  Also added a comment explaining that we probably should eventually use a
  separate flag, so we can adjust the errcode etc differently.


- Explicit test for zero_damaged_pages and ZERO_ON_ERROR

  As part of that I made read_rel_block_ll() support reading multiple
  blocks. That makes it a lot easier to verify that we handle cases like a
  4-block read where 2,3 are invalid correctly.


- I removed the code that "localbuf: Track pincount in BufferDesc as well"
  added to ConditionalLockBufferForCleanup() and IsBufferCleanupOK() as discussed

  Right now the situations that the code was worried don't exist yet, as we
  only support reads.

  I added a comment about not needing to worry about that yet to "bufmgr:
  Implement AIO read support". And then changed that comment to a FIXME in the
  write patches.


- Squashed Thomas' change to make io_concurrency=0 really not use AIO


- Lots of other review comments by Noah addressed


- Merged typo fixes by Thom Brown



TODO:


- There are more tests in test_aio that should be expanded to run for temp
  tables as well, not just normal tables


- Add an explicit test for the checksum verification in the completion callback

  There is an existing test for testing an invalid page due to page header
  verification in test_aio, but not for checksum failures.

  I think it's indirectly covered (e.g. in amcheck), but seems better to test
  it explicitly.


- Add error context callbacks for io worker and "foreign" IO completion

Greetings,

Andres Freund

Attachment

Re: AIO v2.5

From
Andres Freund
Date:
Hi,

On 2025-03-26 21:07:40 -0400, Andres Freund wrote:
> TODO
> ...
> - Add an explicit test for the checksum verification in the completion callback
>
>   There is an existing test for testing an invalid page due to page header
>   verification in test_aio, but not for checksum failures.
>
>   I think it's indirectly covered (e.g. in amcheck), but seems better to test
>   it explicitly.

Ah, for crying out loud.  As it turns out, no, we do not have *ANY* tests for
this for the server-side. Not a single one. I'm somewhat apoplectic,
data_checksums is a really complicated feature, which we just started *turning
on by default*, without a single test of the failure behaviour, when detecting
failures is the one thing the feature is supposed to do.


I now wrote some tests. And I both regret doing so (because it found problems,
which would have been apparent long ago, if the feature had come with *any*
tests, if I had gone the same way I could have just pushed stuff) and am glad
I did (because I dislike pushing broken stuff).

I have to admit, I was tempted to just ignore this issue and just not say
anything about tests for checksum failures anymore.


Problems:

1) PageIsVerifiedExtended() emits a WARNING, just like with ZERO_ON_ERROR, we
   don't want to emit it in a) io workers b) another backend if it completes
   the error.

   This isn't hard to address, we can add PIV_LOG_LOG (or something like that)
   to emit it at a different log level and an out-parameter to trigger sending
   a warning / adjust the warning/error message we already emit once the
   issuer completes the IO.


2) With IO workers (and "foreign completors", in rare cases), the checksum
   failures would be attributed wrongly, as it reports all stats to
   MyDatabaseId

   As it turns out, this is already borked on master for shared relations,
   since pg_stat_database.checksum_failures has existed, see [1].

   This isn't too hard to fix, if we adjust the signature to
   PageIsVerifiedExtended() to pass in the database oid. But see also 3)


3) We can't pgstat_report_checksum_failure() during the completion callback,
   as it *sometimes* allocates memory

   Aside from the allocation-in-critical-section asserts, I think this is
   *extremely* unlikely to actually cause a problem in practice. But we don't
   want to rely on that, obviously.



Addressing the first two is pretty simple and/or needs to be done anyway,
since it's a currently existing bug, as discussed in [1].


Addressing 3) is not at all trivial.  Here's what I've thought of so far:


Approach I)

My first thoughts were around trying to make the relevant pgstat
infrastructure either not need to allocate memory, or handle memory
allocation failures gracefully.

Unfortunately that seems not really viable:

The most successful approach I tried was to report stats directly to the
dshash table, and only report stats if there's already an entry (which
there just about always will, except for a very short period after stats
have been reset).

Unfortunately that fails because to access the shared memory with the stats
data we need to do dsa_get_address(), which can fail if the relevant dsm
segment wasn't already mapped in the current process (it allocates memory
in the process of mapping in the segment). There's no API to do that
without erroring out.

That aspect rules out a number of other approaches that sounded like they
could work - we e.g. could increase the refcount of the relevant pgstat
entry before issuing IO, ensuring that it's around by the time we need to
report. But that wouldn't get around the issue of needing to map in the dsm
segment.


Approach II)

Don't report the error in the completion callback.  The obvious place would be
to do it where we we'll raise the warning/error in the issuing process.  The
big disadvantage is that that that could lead to under-counting checksum
errors:

a) A read stream does 2+ concurrent reads for the same relation, and more than
one encounters checksum errors. When processing the results for the first
failed read, we raise an error and thus won't process the results of the
second+ reads with errors.

b) A read is started asynchronously, but before the backend gets around to
processing the result of the IO, it errors out during that other work
(possibly due to a cancellation). Because the backend never looked at the
results of the IOs, the checksum errors don't get accounted for.

b) doesn't overly bother me, but a) seems problematic.


Approach III)

Accumulate checksum errors in two backend local variables (one for database
specific errors, one for errors on shared relations), which will be flushed by
the backend that issued IO during the next pgstat_report_start().

Two disadvantages:

- Accumulation of errors will be delayed until the next
  pgstat_report_start(). That seems acceptable, after all we do so far a lot
  of other stats.

- We need to register a local callback for shared buffer reads, which don't
  need them today . That's a small bit of added overhead. It's a shame to do
  so for counters that approximately never get incremented.


Approach IV):

Embracy piercing abstractions / generic infrastructure and put two atomic
variables (one for shared one for the backend's database) in some
backend-specific shared memory (e.g. the backend's PgAioBackend or PGPROC) and
update that in the completion callback. Flush that variable to the shared
stats in pgstat_report_start() or such.

This would avoid the need for the local completion callback, and would also
allow to introduce a function see the number "unflushed" checksum errors. It
also doesn't require transporting the number of errors between the shared
callback and the local callback - but we might want to do have that for the
error message anyway.

I wish the new-to-18 pgstat_backend() were designed in a way to make this
possible in a nice way. But unfortunately it puts the backend-specific data in
the dshash table / dynamic shared memory, rather than in a MaxBackends +
NUM_AUX sized array array in plain shared memory. As explained in I), we can't
rely on having the entire array mapped.  Leaving the issue from this email
aside, that also adds a fair bit of overhead to other cases.



Does anybody have better ideas?


I think II), III) and IV) are all relatively simple to implement.

The most complicated bit is that a bit of bit-squeezing is necessary to fit
the number of checksum errors (in addition to the number of otherwise invalid
pages) into the available space for error data. It's doable. We could also
just increase the size of PgAioResult.

I've implemented II), but I'm not sure the disadvantages are acceptable.


Greetings,

Andres Freund

[1] https://postgr.es/m/mglpvvbhighzuwudjxzu4br65qqcxsnyvio3nl4fbog3qknwhg%40e4gt7npsohuz



Re: AIO v2.5

From
Noah Misch
Date:
On Thu, Mar 27, 2025 at 04:58:11PM -0400, Andres Freund wrote:
> I now wrote some tests. And I both regret doing so (because it found problems,
> which would have been apparent long ago, if the feature had come with *any*
> tests, if I had gone the same way I could have just pushed stuff) and am glad
> I did (because I dislike pushing broken stuff).
> 
> I have to admit, I was tempted to just ignore this issue and just not say
> anything about tests for checksum failures anymore.

I don't blame you.

> 3) We can't pgstat_report_checksum_failure() during the completion callback,
>    as it *sometimes* allocates memory
> 
>    Aside from the allocation-in-critical-section asserts, I think this is
>    *extremely* unlikely to actually cause a problem in practice. But we don't
>    want to rely on that, obviously.

> Addressing 3) is not at all trivial.  Here's what I've thought of so far:
> 
> 
> Approach I)

> Unfortunately that fails because to access the shared memory with the stats
> data we need to do dsa_get_address()


> Approach II)
> 
> Don't report the error in the completion callback.  The obvious place would be
> to do it where we we'll raise the warning/error in the issuing process.  The
> big disadvantage is that that that could lead to under-counting checksum
> errors:
> 
> a) A read stream does 2+ concurrent reads for the same relation, and more than
> one encounters checksum errors. When processing the results for the first
> failed read, we raise an error and thus won't process the results of the
> second+ reads with errors.
> 
> b) A read is started asynchronously, but before the backend gets around to
> processing the result of the IO, it errors out during that other work
> (possibly due to a cancellation). Because the backend never looked at the
> results of the IOs, the checksum errors don't get accounted for.
> 
> b) doesn't overly bother me, but a) seems problematic.

While neither are great, I could live with both.  I guess I'm optimistic that
clusters experiencing checksum failures won't lose enough reports to these
loss sources to make the difference in whether monitoring catches them.  In
other words, a cluster will report N failures without these losses and N-K
after these losses.  If N is large enough for relevant monitoring to flag the
cluster appropriately, N-K will also be large enough.


> Approach III)
> 
> Accumulate checksum errors in two backend local variables (one for database
> specific errors, one for errors on shared relations), which will be flushed by
> the backend that issued IO during the next pgstat_report_start().
> 
> Two disadvantages:
> 
> - Accumulation of errors will be delayed until the next
>   pgstat_report_start(). That seems acceptable, after all we do so far a lot
>   of other stats.

Yep, acceptable.

> - We need to register a local callback for shared buffer reads, which don't
>   need them today . That's a small bit of added overhead. It's a shame to do
>   so for counters that approximately never get incremented.

Fair concern.  An idea is to let the complete_shared callback change the
callback list associated with the IO, so it could change
PGAIO_HCB_SHARED_BUFFER_READV to PGAIO_HCB_SHARED_BUFFER_READV_SLOW.  The
latter would differ from the former only in having the extra local callback.
Could that help?  I think the only overhead is using more PGAIO_HCB numbers.
We currently reserve 256 (uint8), but one could imagine trying to pack into
fewer bits.  That said, this wouldn't paint us into a corner.  We could change
the approach later.

pgaio_io_call_complete_local() starts a critical section.  Is that a problem
for this approach?


> Approach IV):
> 
> Embracy piercing abstractions / generic infrastructure and put two atomic
> variables (one for shared one for the backend's database) in some
> backend-specific shared memory (e.g. the backend's PgAioBackend or PGPROC) and
> update that in the completion callback. Flush that variable to the shared
> stats in pgstat_report_start() or such.

I could live with that.  I feel better about Approach III currently, though.
Overall, I'm feeling best about III long-term, but II may be the right
tactical choice.


> Does anybody have better ideas?

I think no, but here are some ideas I tossed around:

- Like your Approach III, but have the completing process store the count
  locally and flush it, instead of the staging process doing so.  Would need
  more than 2 slots, but we could have a fixed number of slots and just
  discard any reports that arrive with all slots full.  Reporting checksum
  failures in, say, 8 databases in quick succession probably tells the DBA
  there's "enough corruption to start worrying".  Missing the 9th database
  would be okay.

- Pre-warm the memory allocations and DSAs we could possibly need, so we can
  report those stats in critical sections, from the completing process.  Bad
  since there's an entry per database, hence no reasonable limit on how much
  memory a process might need to pre-warm.  We could even end up completing an
  IO for a database that didn't exist on entry to our critical section.

- Skip the checksum pgstats if we're completing in a critical section.
  Doesn't work since we _always_ make a critical section to complete I/O.

This email isn't as well-baked as I like, but the alternative was delaying it
24-48h depending on how other duties go over those hours.  My v2.13 review is
still in-progress, too.



Re: AIO v2.5

From
Andres Freund
Date:
Hi,

On 2025-03-27 20:22:23 -0700, Noah Misch wrote:
> On Thu, Mar 27, 2025 at 04:58:11PM -0400, Andres Freund wrote:
> > Don't report the error in the completion callback.  The obvious place would be
> > to do it where we we'll raise the warning/error in the issuing process.  The
> > big disadvantage is that that that could lead to under-counting checksum
> > errors:
> >
> > a) A read stream does 2+ concurrent reads for the same relation, and more than
> > one encounters checksum errors. When processing the results for the first
> > failed read, we raise an error and thus won't process the results of the
> > second+ reads with errors.
> >
> > b) A read is started asynchronously, but before the backend gets around to
> > processing the result of the IO, it errors out during that other work
> > (possibly due to a cancellation). Because the backend never looked at the
> > results of the IOs, the checksum errors don't get accounted for.
> >
> > b) doesn't overly bother me, but a) seems problematic.
>
> While neither are great, I could live with both.  I guess I'm optimistic that
> clusters experiencing checksum failures won't lose enough reports to these
> loss sources to make the difference in whether monitoring catches them.  In
> other words, a cluster will report N failures without these losses and N-K
> after these losses.  If N is large enough for relevant monitoring to flag the
> cluster appropriately, N-K will also be large enough.

That's true.


> > Approach III)
> >
> > Accumulate checksum errors in two backend local variables (one for database
> > specific errors, one for errors on shared relations), which will be flushed by
> > the backend that issued IO during the next pgstat_report_start().

FWIW, two variables turn out to not quite suffice - as I realized later, we
actually can issue IO on behalf of arbitrary databases, due to
ScanSourceDatabasePgClass() and RelationCopyStorageUsingBuffer().

That unfortunately makes it much harder to be able to guarantee that the
completor of an IO has the DSM segment for a pg_stat_database stats entry
mapped.


> > - We need to register a local callback for shared buffer reads, which don't
> >   need them today . That's a small bit of added overhead. It's a shame to do
> >   so for counters that approximately never get incremented.
>
> Fair concern.  An idea is to let the complete_shared callback change the
> callback list associated with the IO, so it could change
> PGAIO_HCB_SHARED_BUFFER_READV to PGAIO_HCB_SHARED_BUFFER_READV_SLOW.  The
> latter would differ from the former only in having the extra local callback.
> Could that help?  I think the only overhead is using more PGAIO_HCB numbers.

I think changing the callback could work - I'll do some measurements in a
coffee or two, but I suspect the overhead is not worth being too worried about
for now.  There's a different aspect that worries me slightly more, see
further down.


> We currently reserve 256 (uint8), but one could imagine trying to pack into
> fewer bits.

Yea, my current local worktree reduces it to 6 bits for now, to make space for
keeping track of the number of checksum failures in error data (as part of
that adds defines for the bit widths).  If that becomes an issue we can make
PgAioResult wider, but I suspect that won't be too soon.

One simplification that we could make is to only ever report one checksum
failure for each IO, even if N buffers failed - after all that's what HEAD
does (by virtue of throwing an error after the first). Then we'd not track the
number of checksum errors.


> That said, this wouldn't paint us into a corner.  We could change the
> approach later.

Indeed - I think we mainly need something that works for now.  I think medium
term the right fix here would be to make sure that the stats can be accounted
for with just an atomic increment somewhere.

We've had several discussions around having an in-memory datastructure for
every relation that currently has buffer in shared_buffers, to store e.g. the
relation length and the sync requests. If we get that, I think Thomas has a
prototype, we can accumulate the number of checksum errors in there, for
example. It'd also allow to address the biggest blocker for writes, namely
that RememberSyncRequest() could fail, *after* IO comletion.


> pgaio_io_call_complete_local() starts a critical section.  Is that a problem
> for this approach?

I think we can make it not a problem - I added a
pgstat_prepare_report_checksum_failure(dboid) that ensures the calling backend
has a reference to the relevant shared memory stats entry. If we make the rule
that it has to be called *before* starting buffered IO (i.e. in
AsyncReadBuffers()), we can be sure the stats reference still exists by the
time local completion runs (as the isn't a way to have the stats entry dropped
without dropping the database, which isn't possible while a) the database
still is connected to, for normal IO b) the CREATE DATABASE is still running).

Unfortunately pgstat_prepare_report_checksum_failure() has to do a lookup in a
local hashtable. That's more expensive than an indirect function call
(i.e. the added local callback). I hope^Wsuspect it'll still be fine, and if
not we can apply a mini-cache for the current database, which is surely the
only thing that ever matters for performance.


> > Approach IV):
> >
> > Embracy piercing abstractions / generic infrastructure and put two atomic
> > variables (one for shared one for the backend's database) in some
> > backend-specific shared memory (e.g. the backend's PgAioBackend or PGPROC) and
> > update that in the completion callback. Flush that variable to the shared
> > stats in pgstat_report_start() or such.
>
> I could live with that.  I feel better about Approach III currently, though.
> Overall, I'm feeling best about III long-term, but II may be the right
> tactical choice.

I think it's easy to change between these approaches. Both require that we
encode the number of checksum failures in the result, which is where most of
the complexity lies (but still a rather surmountable amount of complexity).


> I think no, but here are some ideas I tossed around:
>
> - Like your Approach III, but have the completing process store the count
>   locally and flush it, instead of the staging process doing so.  Would need
>   more than 2 slots, but we could have a fixed number of slots and just
>   discard any reports that arrive with all slots full.  Reporting checksum
>   failures in, say, 8 databases in quick succession probably tells the DBA
>   there's "enough corruption to start worrying".  Missing the 9th database
>   would be okay.

Yea. I think that'd be an ok fallback, but if we can make III' work, it'd be
nicer.


> - Pre-warm the memory allocations and DSAs we could possibly need, so we can
>   report those stats in critical sections, from the completing process.  Bad
>   since there's an entry per database, hence no reasonable limit on how much
>   memory a process might need to pre-warm.  We could even end up completing an
>   IO for a database that didn't exist on entry to our critical section.

I experimented with this one - it works surprisingly well, because for IO
workers we could just do the pre-warming outside of the critical section, and
it's *exceedingly* rare that any other completor would ever need to complete
IO for another database than the current / a shared relation.

But it does leave a nasty edge case, that we'd just have to accept. I guess we
could just make it so that in that case stats aren't reported.

But it seems pretty ugly.


> This email isn't as well-baked as I like, but the alternative was delaying it
> 24-48h depending on how other duties go over those hours.  My v2.13 review is
> still in-progress, too.

It's appreciated!

Greetings,

Andres Freund



Re: AIO v2.5

From
Andres Freund
Date:
Hi,

On 2025-03-28 08:54:42 -0400, Andres Freund wrote:
> One simplification that we could make is to only ever report one checksum
> failure for each IO, even if N buffers failed - after all that's what HEAD
> does (by virtue of throwing an error after the first). Then we'd not track the
> number of checksum errors.

Just after sending, I thought of another variation: Report the number of
*invalid* pages (which we already track) as checksum errors, if there was at
least on checksum error.

It's imo rather weird that we track checksum errors but we don't track invalid
page headers, despite the latter being an even worse indication of something
having gone wrong...

Greetings,

Andres Freund



Re: AIO v2.5

From
Andres Freund
Date:
Hi,

On 2025-03-28 08:54:42 -0400, Andres Freund wrote:
> On 2025-03-27 20:22:23 -0700, Noah Misch wrote:
> > On Thu, Mar 27, 2025 at 04:58:11PM -0400, Andres Freund wrote:
> > > - We need to register a local callback for shared buffer reads, which don't
> > >   need them today . That's a small bit of added overhead. It's a shame to do
> > >   so for counters that approximately never get incremented.
> >
> > Fair concern.  An idea is to let the complete_shared callback change the
> > callback list associated with the IO, so it could change
> > PGAIO_HCB_SHARED_BUFFER_READV to PGAIO_HCB_SHARED_BUFFER_READV_SLOW.  The
> > latter would differ from the former only in having the extra local callback.
> > Could that help?  I think the only overhead is using more PGAIO_HCB numbers.
>
> I think changing the callback could work - I'll do some measurements in a
> coffee or two, but I suspect the overhead is not worth being too worried about
> for now.  There's a different aspect that worries me slightly more, see
> further down.
> ...
> Unfortunately pgstat_prepare_report_checksum_failure() has to do a lookup in a
> local hashtable. That's more expensive than an indirect function call
> (i.e. the added local callback). I hope^Wsuspect it'll still be fine, and if
> not we can apply a mini-cache for the current database, which is surely the
> only thing that ever matters for performance.

I tried it and at ~30GB/s of read IO, with checksums disabled, I can't see a
difference of either having the unnecessary complete_local callback or having
the lookup in pgstat_prepare_report_checksum_failure(). In a profile there are
a few hits inside pgstat_get_entry_ref(), but not enough to matter.

Hence I think this isn't worth worrying about, at least for now. I think we
have far bigger fish to fry at this point than such a small performance
difference.

I've adjusted the comment above TRACE_POSTGRESQL_BUFFER_READ_DONE() to not
mention the overhead. I'm still inclined to think that it's better to call it
in the shared completion callback.


I also fixed support and added tests for ignore_checksum_failure, that also
needs to be determined at the start of the IO, not in the completion.  Once
more there were no tests, of course.


I spent the last 6 hours on the stupid error/warning messages around this,
somewhat ridiculous.

The number of combinations is annoyingly large. It's e.g. plausible to use
ignore_checksum_failure=on and zero_damaged_pages=on at the same time for
recovery. The same buffer could both be ignored *and* zeroed. Or somebody
could use ignore_checksum_failure=on but then still encounter a page that is
invalid.

But I finally got to a point where the code ends up readable, without undue
duplication.  It would, leaving some nasty hack aside, require a
errhint_internal() - but I can't imagine a reason against introducing that,
given we have it for the errmsg and errhint.

Here's the relevant code:

    /*
     * Treat a read that had both zeroed buffers *and* ignored checksums as a
     * special case, it's too irregular to be emitted the same way as the other
     * cases.
     */
    if (zeroed_any && ignored_any)
    {
        Assert(zeroed_any && ignored_any);
        Assert(nblocks > 1);    /* same block can't be both zeroed and ignored */
        Assert(result.status != PGAIO_RS_ERROR);
        affected_count = zeroed_or_error_count;

        ereport(elevel,
                errcode(ERRCODE_DATA_CORRUPTED),
                errmsg("zeroing %u pages and ignoring %u checksum failures among blocks %u..%u of relation %s",
                       affected_count, checkfail_count, first, last, rpath.str),
                affected_count > 1 ?
                errdetail("Block %u held first zeroed page.",
                          first + first_off) : 0,
                errhint("See server log for details about the other %u invalid blocks.",
                        affected_count + checkfail_count - 1));
        return;
    }

    /*
     * The other messages are highly repetitive. To avoid duplicating a long
     * and complicated ereport(), gather the translated format strings
     * separately and then do one common ereport.
     */
    if (result.status == PGAIO_RS_ERROR)
    {
        Assert(!zeroed_any);    /* can't have invalid pages when zeroing them */
        affected_count = zeroed_or_error_count;
        msg_one = _("invalid page in block %u of relation %s");
        msg_mult = _("%u invalid pages among blocks %u..%u of relation %s");
        det_mult = _("Block %u held first invalid page.");
        hint_mult = _("See server log for the other %u invalid blocks.");
    }
    else if (zeroed_any && !ignored_any)
    {
        affected_count = zeroed_or_error_count;
        msg_one = _("invalid page in block %u of relation %s; zeroing out page");
        msg_mult = _("zeroing out %u invalid pages among blocks %u..%u of relation %s");
        det_mult = _("Block %u held first zeroed page.");
        hint_mult = _("See server log for the other %u zeroed blocks.");
    }
    else if (!zeroed_any && ignored_any)
    {
        affected_count = checkfail_count;
        msg_one = _("ignoring checksum failure in block %u of relation %s");
        msg_mult = _("ignoring %u checksum failures among blocks %u..%u of relation %s");
        det_mult = _("Block %u held first ignored page.");
        hint_mult = _("See server log for the other %u ignored blocks.");
    }
    else
        pg_unreachable();

    ereport(elevel,
            errcode(ERRCODE_DATA_CORRUPTED),
            affected_count == 1 ?
            errmsg_internal(msg_one, first + first_off, rpath.str) :
            errmsg_internal(msg_mult, affected_count, first, last, rpath.str),
            affected_count > 1 ? errdetail_internal(det_mult, first + first_off) : 0,
            affected_count > 1 ? errhint_internal(hint_mult, affected_count - 1) : 0);

Does that approach make sense?

What do you think about using
  "zeroing invalid page in block %u of relation %s"
instead of
  "invalid page in block %u of relation %s; zeroing out page"

I thought about instead translating "ignoring", "ignored", "zeroing",
"zeroed", etc separately, but I have doubts about how well that would actually
translate.

Greetings,

Andres Freund



Re: AIO v2.5

From
Noah Misch
Date:
On Fri, Mar 28, 2025 at 11:35:23PM -0400, Andres Freund wrote:
> The number of combinations is annoyingly large. It's e.g. plausible to use
> ignore_checksum_failure=on and zero_damaged_pages=on at the same time for
> recovery.

That's intricate indeed.

> But I finally got to a point where the code ends up readable, without undue
> duplication.  It would, leaving some nasty hack aside, require a
> errhint_internal() - but I can't imagine a reason against introducing that,
> given we have it for the errmsg and errhint.

Introducing that is fine.

> Here's the relevant code:
> 
>     /*
>      * Treat a read that had both zeroed buffers *and* ignored checksums as a
>      * special case, it's too irregular to be emitted the same way as the other
>      * cases.
>      */
>     if (zeroed_any && ignored_any)
>     {
>         Assert(zeroed_any && ignored_any);
>         Assert(nblocks > 1);    /* same block can't be both zeroed and ignored */
>         Assert(result.status != PGAIO_RS_ERROR);
>         affected_count = zeroed_or_error_count;
> 
>         ereport(elevel,
>                 errcode(ERRCODE_DATA_CORRUPTED),
>                 errmsg("zeroing %u pages and ignoring %u checksum failures among blocks %u..%u of relation %s",
>                        affected_count, checkfail_count, first, last, rpath.str),

Translation stumbles on this one, because each of the first two %u are
plural-sensitive.  I'd do one of:

- Call ereport() twice, once for zeroed pages and once for ignored checksums.
  Since elevel <= ERROR here, that doesn't lose the second call.

- s/pages/page(s)/ like msgid "There are %d other session(s) and %d prepared
  transaction(s) using the database."

- Something more like the style of VACUUM VERBOSE, e.g. "INTRO_TEXT: %u
  zeroed, %u checksums ignored".  I've not written INTRO_TEXT, and this
  doesn't really resolve pluralization.  Probably don't use this option.

>                 affected_count > 1 ?
>                 errdetail("Block %u held first zeroed page.",
>                           first + first_off) : 0,
>                 errhint("See server log for details about the other %u invalid blocks.",
>                         affected_count + checkfail_count - 1));
>         return;
>     }
> 
>     /*
>      * The other messages are highly repetitive. To avoid duplicating a long
>      * and complicated ereport(), gather the translated format strings
>      * separately and then do one common ereport.
>      */
>     if (result.status == PGAIO_RS_ERROR)
>     {
>         Assert(!zeroed_any);    /* can't have invalid pages when zeroing them */
>         affected_count = zeroed_or_error_count;
>         msg_one = _("invalid page in block %u of relation %s");
>         msg_mult = _("%u invalid pages among blocks %u..%u of relation %s");
>         det_mult = _("Block %u held first invalid page.");
>         hint_mult = _("See server log for the other %u invalid blocks.");

For each hint_mult, we would usually use ngettext() instead of _().  (Would be
errhint_plural() if not separated from its ereport().)  Alternatively,
s/blocks/block(s)/ is fine.

>     }
>     else if (zeroed_any && !ignored_any)
>     {
>         affected_count = zeroed_or_error_count;
>         msg_one = _("invalid page in block %u of relation %s; zeroing out page");
>         msg_mult = _("zeroing out %u invalid pages among blocks %u..%u of relation %s");
>         det_mult = _("Block %u held first zeroed page.");
>         hint_mult = _("See server log for the other %u zeroed blocks.");
>     }
>     else if (!zeroed_any && ignored_any)
>     {
>         affected_count = checkfail_count;
>         msg_one = _("ignoring checksum failure in block %u of relation %s");
>         msg_mult = _("ignoring %u checksum failures among blocks %u..%u of relation %s");
>         det_mult = _("Block %u held first ignored page.");
>         hint_mult = _("See server log for the other %u ignored blocks.");
>     }
>     else
>         pg_unreachable();
> 
>     ereport(elevel,
>             errcode(ERRCODE_DATA_CORRUPTED),
>             affected_count == 1 ?
>             errmsg_internal(msg_one, first + first_off, rpath.str) :
>             errmsg_internal(msg_mult, affected_count, first, last, rpath.str),
>             affected_count > 1 ? errdetail_internal(det_mult, first + first_off) : 0,
>             affected_count > 1 ? errhint_internal(hint_mult, affected_count - 1) : 0);
> 
> Does that approach make sense?

Yes.

> What do you think about using
>   "zeroing invalid page in block %u of relation %s"
> instead of
>   "invalid page in block %u of relation %s; zeroing out page"

I like the replacement.  It moves the important part to the front, and it's
shorter.

> I thought about instead translating "ignoring", "ignored", "zeroing",
> "zeroed", etc separately, but I have doubts about how well that would actually
> translate.

Agreed, I wouldn't have high hopes for that.  An approach like that would
probably need messages that separate the independently-translated part
grammatically, e.g.:

  /* last %s is translation of "ignore" or "zero-fill" */
  "invalid page in block %u of relation %s; resolved by method \"%s\""

(Again, I'm not recommending that.)



Re: AIO v2.5

From
Andres Freund
Date:
Hi,

On 2025-03-29 06:41:43 -0700, Noah Misch wrote:
> On Fri, Mar 28, 2025 at 11:35:23PM -0400, Andres Freund wrote:
> > But I finally got to a point where the code ends up readable, without undue
> > duplication.  It would, leaving some nasty hack aside, require a
> > errhint_internal() - but I can't imagine a reason against introducing that,
> > given we have it for the errmsg and errhint.
>
> Introducing that is fine.

Cool.


> > Here's the relevant code:
> >
> >     /*
> >      * Treat a read that had both zeroed buffers *and* ignored checksums as a
> >      * special case, it's too irregular to be emitted the same way as the other
> >      * cases.
> >      */
> >     if (zeroed_any && ignored_any)
> >     {
> >         Assert(zeroed_any && ignored_any);
> >         Assert(nblocks > 1);    /* same block can't be both zeroed and ignored */
> >         Assert(result.status != PGAIO_RS_ERROR);
> >         affected_count = zeroed_or_error_count;
> >
> >         ereport(elevel,
> >                 errcode(ERRCODE_DATA_CORRUPTED),
> >                 errmsg("zeroing %u pages and ignoring %u checksum failures among blocks %u..%u of relation %s",
> >                        affected_count, checkfail_count, first, last, rpath.str),
>
> Translation stumbles on this one, because each of the first two %u are
> plural-sensitive.

Fair. We don't generally seem to have been very careful around this in relate
code, but there's no reason to just continue down that road when it's easy.

E.g. in md.c we unconditionally output "could not read blocks %u..%u in file \"%s\": %m"
even if it's just a single block...


> I'd do one of:
>
> - Call ereport() twice, once for zeroed pages and once for ignored checksums.
>   Since elevel <= ERROR here, that doesn't lose the second call.
>
> - s/pages/page(s)/ like msgid "There are %d other session(s) and %d prepared
>   transaction(s) using the database."

I think I like this better.


> >     /*
> >      * The other messages are highly repetitive. To avoid duplicating a long
> >      * and complicated ereport(), gather the translated format strings
> >      * separately and then do one common ereport.
> >      */
> >     if (result.status == PGAIO_RS_ERROR)
> >     {
> >         Assert(!zeroed_any);    /* can't have invalid pages when zeroing them */
> >         affected_count = zeroed_or_error_count;
> >         msg_one = _("invalid page in block %u of relation %s");
> >         msg_mult = _("%u invalid pages among blocks %u..%u of relation %s");
> >         det_mult = _("Block %u held first invalid page.");
> >         hint_mult = _("See server log for the other %u invalid blocks.");
>
> For each hint_mult, we would usually use ngettext() instead of _().  (Would be
> errhint_plural() if not separated from its ereport().)  Alternatively,
> s/blocks/block(s)/ is fine.

I will go with the (s) here as well, this stuff is too rare to be worth having
pluralized messages imo.


> > Does that approach make sense?
>
> Yes.
> ...
> > I like the replacement.  It moves the important part to the front, and it's
> shorter.

Cool, I squashed them with the relevant changes now.


Attached is v2.14:

Changes:

- Added a commit to fix stats attribution of checksum errors, previously the
  checksum errors detected bufmgr.c/storage.c were always attributed to the
  current database

  This would have caused bigger issues with worker based IO, as IO workers
  aren't connected to databases.


- Added a commit to allow checksum error reports to happen in critical
  sections. For that a pgstat_prepare_report_checksum_failure() has to be
  called in the same backend, to report the critical section.

  Other suggestions for the name welcome.


- Expanded on the idea in 13 to track the number of invalid buffers in the
  IO's result, by also tracking checksum errors. Combined with the previous
  point, this fixes the issue of an assert during checksum failure reporting
  outlined in:

  https://postgr.es/m/5tyic6epvdlmd6eddgelv47syg2b5cpwffjam54axp25xyq2ga%40ptwkinxqo3az

  This required being a bit more careful with space in the error, to be able
  to squeeze in the checksum number.


- The ignore_checksum_failure of the issuer needs to be used when completing
  IO, not the one of the completor, particularly when using io_method=worker

  For that the access to ignore_checksum_failure had to be moved from
  PageIsVerified() to its callers.

  I added tests for ignore_checksum_failure, including its interplay with
  zero_damaged_pages.


- Deduplicated the error reporting in buffer_readv_report() somewhat by only
  having the selection of format strings be done in branches. I think this
  ends up a lot more readable than the huge ereport before.


- Added details about the changed error/warning logging to "bufmgr: Implement
  AIO read support"'s commit message.


- polished the commit to add PGAIO_RS_WARNING a bit, adding defines for the
  bit-widths of PgAioResult portions and added static asserts to verify them


- Squashed the changes that I had kept separately in v2.13, it was too hard to
  do that while doing the above changes.

  I did make the encoding function cast the arguments to uint32 before
  shifting. I think that's implied by the C integer promotion rules, but it
  seemed fishy enough to not want to believe in that.

  I also added a StaticAssertStmt() to ensure we are only using the available
  bit space.


- Added a test for a) checksum errors being detected b) CREATE DATABASE
  ... STRATEGY WAL_LOG

  The latter is interesting because it also provides test coverage for doing
  IO for objects in other databases.


- Removed an obsoleted inclusion of pg_trace.h in localbuf.c


TODO:

- I think the tests around zero_damaged_pages, ignore_checksum_failure should
  be expanded a bit more. There's two FIXME in the tests about that.

  At the moment there are two different test functions for zero_damaged_pages
  and ignore_checksum_failure, I'm not sure how good that is.

  I wanted to get this version out, because I have to run some errands,
  otherwise I'd have implemente them first...


Next steps:

- push the checksums stats fix


- unless somebody sees a reason to not use LOG_SERVER_ONLY in
  "aio: Implement support for reads in smgr/md/fd", push that

  Besides that the only change since Noah's last review of that commit is an
  added comment.


- push acronym, glossary change


- push pg_aios view (depends a tiny bit on the smgr/md/fd change above)


- push "localbuf: Track pincount in BufferDesc as well" - I think I addressed
  all of Noah's review feedback


- address the above TODO



Greetings,

Andres Freund

Attachment

Re: AIO v2.5

From
Andres Freund
Date:
Hi,

On 2025-03-29 10:48:10 -0400, Andres Freund wrote:
> Attached is v2.14:

FWIW, there was a last minute change in the test that fails in one task on CI,
due to reading across the smaller segment size configured for one of the
runs. Doesn't quite seem worth posting a new version for.


> - push the checksums stats fix

Done.


> - unless somebody sees a reason to not use LOG_SERVER_ONLY in
>   "aio: Implement support for reads in smgr/md/fd", push that
>
>   Besides that the only change since Noah's last review of that commit is an
>   added comment.

Also done.  If we want to change log level later, it's easy to do so.

I made some small changes since the version I had posted:
- I found one dangling reference to mdread() instead of mdreadv()
- I had accidentally squashed the fix to Noah's review comment about a comment
  above md_readv_report() to the wrong commit (smgr/md/fd.c write support)
- PGAIO_HCB_MD_WRITEV was added in "smgr/md/fd.c read support" instead of
  "smgr/md/fd.c write support"


> - push pg_aios view (depends a tiny bit on the smgr/md/fd change above)

I think I found an issue with this one - as it stands the view was viewable by
everyone. While it doesn't provide a *lot* of insight, it still seems a bit
too much for an unprivileged user to learn what part of a relation any other
user is currently reading.

There'd be two different ways to address that:
1) revoke view & function from public, grant to a limited role (presumably
   pg_read_all_stats)
2) copy pg_stat_activity's approach of using something like

   #define HAS_PGSTAT_PERMISSIONS(role)     (has_privs_of_role(GetUserId(), ROLE_PG_READ_ALL_STATS) ||
has_privs_of_role(GetUserId(),role))
 

   on a per-IO basis.


Greetings,

Andres Freund



Re: AIO v2.5

From
Melanie Plageman
Date:
On Sat, Mar 29, 2025 at 2:25 PM Andres Freund <andres@anarazel.de> wrote:
>
> I think I found an issue with this one - as it stands the view was viewable by
> everyone. While it doesn't provide a *lot* of insight, it still seems a bit
> too much for an unprivileged user to learn what part of a relation any other
> user is currently reading.
>
> There'd be two different ways to address that:
> 1) revoke view & function from public, grant to a limited role (presumably
>    pg_read_all_stats)
> 2) copy pg_stat_activity's approach of using something like
>
>    #define HAS_PGSTAT_PERMISSIONS(role)  (has_privs_of_role(GetUserId(), ROLE_PG_READ_ALL_STATS) ||
has_privs_of_role(GetUserId(),role)) 
>
>    on a per-IO basis.

Is it easier to later change it to be more restrictive or less? If it
is easier to later lock it down more, then go with 2, otherwise go
with 1?

- Melanie



Re: AIO v2.5

From
Noah Misch
Date:
Flushing half-baked review comments before going offline for a few hours:

On Wed, Mar 26, 2025 at 09:07:40PM -0400, Andres Freund wrote:
> Attached v2.13, with the following changes:

>   5) The WARNING in the callback is now a LOG, as it will be sent to the
>      client as a WARNING explicitly when the IO's results are processed
> 
>      I actually chose LOG_SERVER_ONLY - that seemed slightly better than just
>      LOG? But not at all sure.

LOG_SERVER_ONLY and its synonym COMMERR look to be used for:

- ProcessLogMemoryContextInterrupt()
- messages before successful authentication
- protocol sync loss, where we'd fail to send a client message
- client already gone

The choice between LOG and LOG_SERVER_ONLY doesn't matter much for $SUBJECT.
If a client has decided to set client_min_messages that high, the client might
be interested in the fact that it got side-tracked completing someone else's
IO.  On the other hand, almost none of those sidetrack events will produce
messages.  The main argument I'd envision for LOG_SERVER_ONLY is that we
consider the message content sensitive, but I don't see the message content as
materially sensitive.

Since you committed LOG_SERVER_ONLY, let's keep that decision.  My last draft
review discouraged it, but it doesn't matter.  pgaio_result_report() should
assert elevel != LOG to avoid future divergence.

> - Previously the buffer completion callback checked zero_damaged_pages - but
>   that's not right, the GUC hopefully is only set on a per-session basis

Good catch.  I've now audited the complete_shared callbacks for other variable
references and actions not acceptable there.  I found nothing beyond what you
found by v2.14.

On Sat, Mar 29, 2025 at 10:48:10AM -0400, Andres Freund wrote:
> On 2025-03-29 06:41:43 -0700, Noah Misch wrote:
> > On Fri, Mar 28, 2025 at 11:35:23PM -0400, Andres Freund wrote:

> Subject: [PATCH v2.14 01/29] Fix mis-attribution of checksum failure stats to
>  the wrong database

I've skipped reviewing this patch, since it's already committed.  If it needs
post-commit review, let me know.

> Subject: [PATCH v2.14 02/29] aio: Implement support for reads in smgr/md/fd

> +        /*
> +         * Immediately log a message about the IO error, but only to the
> +         * server log. The reason to do so immediately is that the originator
> +         * might not process the query result immediately (because it is busy
> +         * doing another part of query processing) or at all (e.g. if it was
> +         * cancelled or errored out due to another IO also failing).  The
> +         * issuer of the IO will emit an ERROR when processing the IO's

s/issuer/definer/ please, to avoid proliferating synonyms.  Likewise two other
places in the patches.

> +/*
> + * smgrstartreadv() -- asynchronous version of smgrreadv()
> + *
> + * This starts an asynchronous readv IO using the IO handle `ioh`. Other than
> + * `ioh` all parameters are the same as smgrreadv().

I would add a comment starting with:

  Compared to smgrreadv(), more responsibilities fall on layers above smgr.
  Higher layers handle partial reads.  smgr will ereport(LOG_SERVER_ONLY) some
  problems, but higher layers are responsible for pgaio_result_report() to
  mirror that news to the user and (for ERROR) abort the (sub)transaction.

md_readv_complete() comment "issuer of the IO will emit an ERROR" says some of
that, but someone adding a smgrstartreadv() call is less likely to find it
there.

I say "comment starting with", because I think there's a remaining decision
about who owns the zeroing currently tied to smgrreadv().  An audit of
mdreadv() vs. AIO counterparts found this part of mdreadv():

            if (nbytes == 0)
            {
                /*
                 * We are at or past EOF, or we read a partial block at EOF.
                 * Normally this is an error; upper levels should never try to
                 * read a nonexistent block.  However, if zero_damaged_pages
                 * is ON or we are InRecovery, we should instead return zeroes
                 * without complaining.  This allows, for example, the case of
                 * trying to update a block that was later truncated away.
                 */
                if (zero_damaged_pages || InRecovery)
                {

I didn't write a test to prove its absence, but I'm not finding such code in
the AIO path.  I wondered if we could just Assert(!InRecovery), but adding
that to md_readv_complete() failed 001_stream_rep.pl with this stack:

ExceptionalCondition at assert.c:52
md_readv_complete at md.c:2043
pgaio_io_call_complete_shared at aio_callback.c:258
pgaio_io_process_completion at aio.c:515
pgaio_io_perform_synchronously at aio_io.c:148
pgaio_io_stage at aio.c:453
pgaio_io_start_readv at aio_io.c:87
FileStartReadV at fd.c:2243
mdstartreadv at md.c:1005
smgrstartreadv at smgr.c:757
AsyncReadBuffers at bufmgr.c:1938
StartReadBuffersImpl at bufmgr.c:1422
StartReadBuffer at bufmgr.c:1515
ReadBuffer_common at bufmgr.c:1246
ReadBufferExtended at bufmgr.c:818
vm_readbuf at visibilitymap.c:584
visibilitymap_pin at visibilitymap.c:203
heap_xlog_insert at heapam_xlog.c:450
heap_redo at heapam_xlog.c:1195
ApplyWalRecord at xlogrecovery.c:1995
PerformWalRecovery at xlogrecovery.c:1825
StartupXLOG at xlog.c:5895

If this is a real problem, fix options may include:

- Implement the InRecovery zeroing for real.
- Make the InRecovery case somehow use real mdreadv(), not
  pgaio_io_perform_synchronously() to use AIO APIs with synchronous AIO.  I'll
  guess this is harder than the previous option, though.

> Subject: [PATCH v2.14 04/29] aio: Add pg_aios view

> +        /*
> +         * There is no lock that could prevent the state of the IO to advance
> +         * concurrently - and we don't want to introduce one, as that would
> +         * introduce atomics into a very common path. Instead we
> +         *
> +         * 1) Determine the state + generation of the IO.
> +         *
> +         * 2) Copy the IO to local memory.
> +         *
> +         * 3) Check if state or generation of the IO changed. If the state
> +         * changed, retry, if the generation changed don't display the IO.
> +         */
> +
> +        /* 1) from above */
> +        start_generation = live_ioh->generation;
> +        pg_read_barrier();

I think "retry:" needs to be here, above start_state assignment.  Otherwise,
the "live_ioh->state != start_state" test will keep seeing a state mismatch.

> +        start_state = live_ioh->state;
> +
> +retry:
> +        if (start_state == PGAIO_HS_IDLE)
> +            continue;


> Subject: [PATCH v2.14 05/29] localbuf: Track pincount in BufferDesc as well
> Subject: [PATCH v2.14 07/29] aio: Add WARNING result status
> Subject: [PATCH v2.14 08/29] pgstat: Allow checksum errors to be reported in
>  critical sections
> Subject: [PATCH v2.14 09/29] Add errhint_internal()

Ready for commit


> Subject: [PATCH v2.14 10/29] bufmgr: Implement AIO read support

> Buffer reads executed this infrastructure will report invalid page / checksum
> errors / warnings differently than before:

s/this/through this/

> +    *zeroed_or_error_count = rem_error & ((1 << 7) - 1);
> +    rem_error >>= 7;

These raw "7" are good places to use your new #define values.  Likewise in
buffer_readv_encode_error().

> + *   that was errored or zerored or, if no errors/zeroes, the first ignored

s/zerored/zeroed/

> +     * enough. If there is an error, the error is the integeresting offset,

typo "integeresting"

> +/*
> + * We need a backend-local completion callback for shared buffers, to be able
> + * to report checksum errors correctly. Unfortunately that can only safely
> + * happen if the reporting backend has previously called

Missing end of sentence.

> @@ -144,8 +144,8 @@ PageIsVerified(PageData *page, BlockNumber blkno, int flags, bool *checksum_fail
>       */

There's an outdated comment ending here:

    /*
     * Throw a WARNING if the checksum fails, but only after we've checked for
     * the all-zeroes case.
     */

>      if (checksum_failure)
>      {
> -        if ((flags & PIV_LOG_WARNING) != 0)
> -            ereport(WARNING,
> +        if ((flags & (PIV_LOG_WARNING | PIV_LOG_LOG)) != 0)
> +            ereport(flags & PIV_LOG_WARNING ? WARNING : LOG,


> Subject: [PATCH v2.14 11/29] Let caller of PageIsVerified() control
>  ignore_checksum_failure
> Subject: [PATCH v2.14 12/29] bufmgr: Use AIO in StartReadBuffers()
> Subject: [PATCH v2.14 13/29] aio: Add README.md explaining higher level design
> Subject: [PATCH v2.14 14/29] aio: Basic read_stream adjustments for real AIO
> Subject: [PATCH v2.14 15/29] read_stream: Introduce and use optional batchmode
>  support
> Subject: [PATCH v2.14 16/29] docs: Reframe track_io_timing related docs as
>  wait time
> Subject: [PATCH v2.14 17/29] Enable IO concurrency on all systems

Ready for commit


> Subject: [PATCH v2.14 18/29] aio: Add test_aio module

I didn't yet re-review the v2.13 or 2.14 changes to this one.  That's still in
my queue.  One thing I noticed anyway:

> +# Tests using injection points. Mostly to exercise had IO errors that are

s/had/hard/


On Sat, Mar 29, 2025 at 02:25:15PM -0400, Andres Freund wrote:
> On 2025-03-29 10:48:10 -0400, Andres Freund wrote:
> > Attached is v2.14:

> > - push pg_aios view (depends a tiny bit on the smgr/md/fd change above)
> 
> I think I found an issue with this one - as it stands the view was viewable by
> everyone. While it doesn't provide a *lot* of insight, it still seems a bit
> too much for an unprivileged user to learn what part of a relation any other
> user is currently reading.
> 
> There'd be two different ways to address that:
> 1) revoke view & function from public, grant to a limited role (presumably
>    pg_read_all_stats)
> 2) copy pg_stat_activity's approach of using something like
> 
>    #define HAS_PGSTAT_PERMISSIONS(role)     (has_privs_of_role(GetUserId(), ROLE_PG_READ_ALL_STATS) ||
has_privs_of_role(GetUserId(),role))
 
> 
>    on a per-IO basis.

No strong opinion.  I'm not really worried about any of this information
leaking.  Nothing in pg_aios comes close to the sensitivity of
pg_stat_activity.query.  pg_stat_activity is mighty cautious, hiding even
stuff like wait_event_type that I wouldn't worry about.  Hence, another valid
choice is (3) change nothing.

Meanwhile, I see substantially less need to monitor your own IOs than to
monitor your own pg_stat_activity rows, and even your own IOs potentially
reveal things happening in other sessions, e.g. evicting buffers that others
read and you never read.  So restrictions wouldn't be too painful, and (1)
arguably helps privacy more than (2).

I'd likely go with (1) today.



Re: AIO v2.5

From
Andres Freund
Date:
Hi,

On 2025-03-29 14:29:29 -0700, Noah Misch wrote:
> Flushing half-baked review comments before going offline for a few hours:
>
> On Wed, Mar 26, 2025 at 09:07:40PM -0400, Andres Freund wrote:
> > Attached v2.13, with the following changes:
>
> >   5) The WARNING in the callback is now a LOG, as it will be sent to the
> >      client as a WARNING explicitly when the IO's results are processed
> >
> >      I actually chose LOG_SERVER_ONLY - that seemed slightly better than just
> >      LOG? But not at all sure.
>
> LOG_SERVER_ONLY and its synonym COMMERR look to be used for:
>
> - ProcessLogMemoryContextInterrupt()
> - messages before successful authentication
> - protocol sync loss, where we'd fail to send a client message
> - client already gone
>
> The choice between LOG and LOG_SERVER_ONLY doesn't matter much for $SUBJECT.
> If a client has decided to set client_min_messages that high, the client might
> be interested in the fact that it got side-tracked completing someone else's
> IO.  On the other hand, almost none of those sidetrack events will produce
> messages.  The main argument I'd envision for LOG_SERVER_ONLY is that we
> consider the message content sensitive, but I don't see the message content as
> materially sensitive.

I don't think it's sensitive - it just seems a bit sillier to send the same
thing to the client twice than to the server log. I'm happy to change it to
LOG if you prefer. Your points below mean some comments need to be updated in
smgr/md.c anyway.


> > - Previously the buffer completion callback checked zero_damaged_pages - but
> >   that's not right, the GUC hopefully is only set on a per-session basis
>
> Good catch.  I've now audited the complete_shared callbacks for other variable
> references and actions not acceptable there.  I found nothing beyond what you
> found by v2.14.

I didn't find anything else either.



> > Subject: [PATCH v2.14 02/29] aio: Implement support for reads in smgr/md/fd
>
> > +        /*
> > +         * Immediately log a message about the IO error, but only to the
> > +         * server log. The reason to do so immediately is that the originator
> > +         * might not process the query result immediately (because it is busy
> > +         * doing another part of query processing) or at all (e.g. if it was
> > +         * cancelled or errored out due to another IO also failing).  The
> > +         * issuer of the IO will emit an ERROR when processing the IO's
>
> s/issuer/definer/ please, to avoid proliferating synonyms.  Likewise two other
> places in the patches.

Hm. Will do. Doesn't bother me personally, but happy to change it.


> > +/*
> > + * smgrstartreadv() -- asynchronous version of smgrreadv()
> > + *
> > + * This starts an asynchronous readv IO using the IO handle `ioh`. Other than
> > + * `ioh` all parameters are the same as smgrreadv().
>
> I would add a comment starting with:
>
>   Compared to smgrreadv(), more responsibilities fall on layers above smgr.
>   Higher layers handle partial reads.  smgr will ereport(LOG_SERVER_ONLY) some
>   problems, but higher layers are responsible for pgaio_result_report() to
>   mirror that news to the user and (for ERROR) abort the (sub)transaction.

Hm - if we document that in all the smgrstart* we'd end up with something like
that in a lot of places - but OTOH, this is the first one so far...



> I say "comment starting with", because I think there's a remaining decision
> about who owns the zeroing currently tied to smgrreadv().  An audit of
> mdreadv() vs. AIO counterparts found this part of mdreadv():
>
>             if (nbytes == 0)
>             {
>                 /*
>                  * We are at or past EOF, or we read a partial block at EOF.
>                  * Normally this is an error; upper levels should never try to
>                  * read a nonexistent block.  However, if zero_damaged_pages
>                  * is ON or we are InRecovery, we should instead return zeroes
>                  * without complaining.  This allows, for example, the case of
>                  * trying to update a block that was later truncated away.
>                  */
>                 if (zero_damaged_pages || InRecovery)
>                 {
>
> I didn't write a test to prove its absence, but I'm not finding such code in
> the AIO path.

Yes, there is no such codepath

A while ago I had started a thread about whether the above codepath is
necessary, as the whole idea of putting a buffer into shared buffers that
doesn't exist on-disk is *extremely* ill conceived, it puts a buffer into
shared buffer that somehow wasn't readable on disk, *without* creating it on
disk. The problem is that an mdnblocks() wouldn't know about that
only-in-memory part of the relation and thus most parts of PG won't consider
that buffer to exist - it'd be just skipped in sequential scans etc, but then
it'd trigger errors when extending the relation ("unexpected data beyond
EOF"), etc.

I had planned to put in an error into mdreadv() at the time, but somehow lost
track of that - I kind of mentally put this issue into the "done" category :(

Based on my research, the InRecovery path is not reachable (most recovery
buffer reads go through XLogReadBufferExtended() which extends at that layer
files, and the exceptions like VM/FSM have explicit code to extend the
relation, c.f. vm_readbuf()). It actually looks to me like it *never* was
reachable, the XLogReadBufferExtended() predecessors, back to the initial
addition of WAL to PG, had that such an extension path, as did vm/fsm.

The zero_damaged_pages path hasn't reliably worked for a long time afaict,
because _mdfd_getseg() doesn't know about it (note we're not passing
EXTENSION_CREATE). So unless the buffer is just after the physical end of the
last segment, you'll just get an error at that point. To my knowledge we
haven't heard related complaints.

It makes some sense to have zero_damaged_pages for actually existing pages
reached from sequential / tid / COPY on the table level - after all that's the
only way you might get data out during data recovery. But those would never
reach this logic, as such scans rely on mdnblocks().  For index -> heap
fetches the option seems mainly dangerous, because that'll just create random
buffers in shared buffers that, as explained above, won't then be reached by
other scans.  And index scans during data recovery are not a good idea in the
first place, all that one should do in that situation is to dump out the data.


At the very least we need to add a comment about this though.  If we want to
implement it, it'd be easy enough, but I think that logic is so insane that I
think we shouldn't do it unless there is some *VERY* clear evidence that we
need it.



> I wondered if we could just Assert(!InRecovery), but adding that to
> md_readv_complete() failed 001_stream_rep.pl with this stack:

I'd expect that to fail in a lot of paths:
XLogReadBufferExtended() ->
ReadBufferWithoutRelcache() ->
ReadBuffer_common() ->
StartReadBuffer()


> > Subject: [PATCH v2.14 04/29] aio: Add pg_aios view
>
> > +        /*
> > +         * There is no lock that could prevent the state of the IO to advance
> > +         * concurrently - and we don't want to introduce one, as that would
> > +         * introduce atomics into a very common path. Instead we
> > +         *
> > +         * 1) Determine the state + generation of the IO.
> > +         *
> > +         * 2) Copy the IO to local memory.
> > +         *
> > +         * 3) Check if state or generation of the IO changed. If the state
> > +         * changed, retry, if the generation changed don't display the IO.
> > +         */
> > +
> > +        /* 1) from above */
> > +        start_generation = live_ioh->generation;
> > +        pg_read_barrier();
>
> I think "retry:" needs to be here, above start_state assignment.  Otherwise,
> the "live_ioh->state != start_state" test will keep seeing a state mismatch.

Damn, you're right.


> > Subject: [PATCH v2.14 05/29] localbuf: Track pincount in BufferDesc as well
> > Subject: [PATCH v2.14 07/29] aio: Add WARNING result status
> > Subject: [PATCH v2.14 08/29] pgstat: Allow checksum errors to be reported in
> >  critical sections
> > Subject: [PATCH v2.14 09/29] Add errhint_internal()
>
> Ready for commit

Cool


> > Subject: [PATCH v2.14 10/29] bufmgr: Implement AIO read support
>
> > Buffer reads executed this infrastructure will report invalid page / checksum
> > errors / warnings differently than before:
>
> s/this/through this/

Fixed.


> > +    *zeroed_or_error_count = rem_error & ((1 << 7) - 1);
> > +    rem_error >>= 7;
>
> These raw "7" are good places to use your new #define values.  Likewise in
> buffer_readv_encode_error().

Which define value are you thinking of here? I don't think any of the ones I
added apply?  But I think you're right it'd be good to have some define for
it, at least locally.


> > + *   that was errored or zerored or, if no errors/zeroes, the first ignored
>
> s/zerored/zeroed/
>
> > +     * enough. If there is an error, the error is the integeresting offset,
>
> typo "integeresting"

:(. Fixed.


> > +/*
> > + * We need a backend-local completion callback for shared buffers, to be able
> > + * to report checksum errors correctly. Unfortunately that can only safely
> > + * happen if the reporting backend has previously called
>
> Missing end of sentence.

It's now:

/*
 * We need a backend-local completion callback for shared buffers, to be able
 * to report checksum errors correctly. Unfortunately that can only safely
 * happen if the reporting backend has previously called
 * pgstat_prepare_report_checksum_failure(), which we can only guarantee in
 * the backend that started the IO. Hence this callback.
 */


> > @@ -144,8 +144,8 @@ PageIsVerified(PageData *page, BlockNumber blkno, int flags, bool *checksum_fail
> >       */
>
> There's an outdated comment ending here:
>
>     /*
>      * Throw a WARNING if the checksum fails, but only after we've checked for
>      * the all-zeroes case.
>      */

Updated to:
    /*
     * Throw a WARNING/LOG, as instructed by PIV_LOG_*, if the checksum fails,
     * but only after we've checked for the all-zeroes case.
     */

I found one more, the newly added comment about checksum_failure_p was still
talking about ignore_checksum_failure, but it should now be IGNORE_CHECKSUM_FAILURE.


> > Subject: [PATCH v2.14 11/29] Let caller of PageIsVerified() control
> >  ignore_checksum_failure
> > Subject: [PATCH v2.14 12/29] bufmgr: Use AIO in StartReadBuffers()
> > Subject: [PATCH v2.14 13/29] aio: Add README.md explaining higher level design
> > Subject: [PATCH v2.14 14/29] aio: Basic read_stream adjustments for real AIO
> > Subject: [PATCH v2.14 15/29] read_stream: Introduce and use optional batchmode
> >  support
> > Subject: [PATCH v2.14 16/29] docs: Reframe track_io_timing related docs as
> >  wait time
> > Subject: [PATCH v2.14 17/29] Enable IO concurrency on all systems
>
> Ready for commit

Cool.



> > Subject: [PATCH v2.14 18/29] aio: Add test_aio module
>
> I didn't yet re-review the v2.13 or 2.14 changes to this one.  That's still in
> my queue.

That's good - I think some of the tests need to expand a bit more. Since
that's at the end of the dependency chain...


> One thing I noticed anyway:

> > +# Tests using injection points. Mostly to exercise had IO errors that are
>
> s/had/hard/

Fixed.


> On Sat, Mar 29, 2025 at 02:25:15PM -0400, Andres Freund wrote:
> > On 2025-03-29 10:48:10 -0400, Andres Freund wrote:
> > > Attached is v2.14:
>
> > > - push pg_aios view (depends a tiny bit on the smgr/md/fd change above)
> >
> > I think I found an issue with this one - as it stands the view was viewable by
> > everyone. While it doesn't provide a *lot* of insight, it still seems a bit
> > too much for an unprivileged user to learn what part of a relation any other
> > user is currently reading.
> >
> > There'd be two different ways to address that:
> > 1) revoke view & function from public, grant to a limited role (presumably
> >    pg_read_all_stats)
> > 2) copy pg_stat_activity's approach of using something like
> >
> >    #define HAS_PGSTAT_PERMISSIONS(role)     (has_privs_of_role(GetUserId(), ROLE_PG_READ_ALL_STATS) ||
has_privs_of_role(GetUserId(),role))
 
> >
> >    on a per-IO basis.
>
> No strong opinion.

Same.


> I'm not really worried about any of this information leaking.  Nothing in
> pg_aios comes close to the sensitivity of pg_stat_activity.query.
> pg_stat_activity is mighty cautious, hiding even stuff like wait_event_type
> that I wouldn't worry about.  Hence, another valid choice is (3) change
> nothing.

I'd also be on board with that.


> Meanwhile, I see substantially less need to monitor your own IOs than to
> monitor your own pg_stat_activity rows, and even your own IOs potentially
> reveal things happening in other sessions, e.g. evicting buffers that others
> read and you never read.  So restrictions wouldn't be too painful, and (1)
> arguably helps privacy more than (2).
>
> I'd likely go with (1) today.

Sounds good to me. It also has the advantage of being much easier to test than
2).

Greetings,

Andres Freund



Re: AIO v2.5

From
Noah Misch
Date:
On Sat, Mar 29, 2025 at 08:39:54PM -0400, Andres Freund wrote:
> On 2025-03-29 14:29:29 -0700, Noah Misch wrote:
> > On Wed, Mar 26, 2025 at 09:07:40PM -0400, Andres Freund wrote:

> > The choice between LOG and LOG_SERVER_ONLY doesn't matter much for $SUBJECT.
> > If a client has decided to set client_min_messages that high, the client might
> > be interested in the fact that it got side-tracked completing someone else's
> > IO.  On the other hand, almost none of those sidetrack events will produce
> > messages.  The main argument I'd envision for LOG_SERVER_ONLY is that we
> > consider the message content sensitive, but I don't see the message content as
> > materially sensitive.
> 
> I don't think it's sensitive - it just seems a bit sillier to send the same
> thing to the client twice than to the server log.

Ah, that adds weight to the benefit of LOG_SERVER_ONLY.

> I'm happy to change it to
> LOG if you prefer. Your points below mean some comments need to be updated in
> smgr/md.c anyway.

Nah.

> > > +/*
> > > + * smgrstartreadv() -- asynchronous version of smgrreadv()
> > > + *
> > > + * This starts an asynchronous readv IO using the IO handle `ioh`. Other than
> > > + * `ioh` all parameters are the same as smgrreadv().
> >
> > I would add a comment starting with:
> >
> >   Compared to smgrreadv(), more responsibilities fall on layers above smgr.
> >   Higher layers handle partial reads.  smgr will ereport(LOG_SERVER_ONLY) some
> >   problems, but higher layers are responsible for pgaio_result_report() to
> >   mirror that news to the user and (for ERROR) abort the (sub)transaction.
> 
> Hm - if we document that in all the smgrstart* we'd end up with something like
> that in a lot of places - but OTOH, this is the first one so far...

Alternatively, to avoid duplication:

  See $PLACE for the tasks that the caller's layer owns, in contrast to smgr
  owning them for smgrreadv().

> > I say "comment starting with", because I think there's a remaining decision
> > about who owns the zeroing currently tied to smgrreadv().  An audit of
> > mdreadv() vs. AIO counterparts found this part of mdreadv():
> >
> >             if (nbytes == 0)
> >             {
> >                 /*
> >                  * We are at or past EOF, or we read a partial block at EOF.
> >                  * Normally this is an error; upper levels should never try to
> >                  * read a nonexistent block.  However, if zero_damaged_pages
> >                  * is ON or we are InRecovery, we should instead return zeroes
> >                  * without complaining.  This allows, for example, the case of
> >                  * trying to update a block that was later truncated away.
> >                  */
> >                 if (zero_damaged_pages || InRecovery)
> >                 {
> >
> > I didn't write a test to prove its absence, but I'm not finding such code in
> > the AIO path.
> 
> Yes, there is no such codepath
> 
> A while ago I had started a thread about whether the above codepath is
> necessary

postgr.es/m/3qxxsnciyffyf3wyguiz4besdp5t5uxvv3utg75cbcszojlz7p@uibfzmnukkbd
which I had forgotten completely.

I've redone your audit, and I agree the InRecovery case is dead code.
check-world InRecovery reaches mdstartreadv() and mdreadv() only via
XLogReadBufferExtended(), vm_readbuf(), and fsm_readbuf().

The zero_damaged_pages case entails more of a judgment call about whether or
not its rule breakage eclipses its utility.  Fortunately, a disappointed
zero_damaged_pages user could work around that code's absence by stopping the
server and using "dd" to extend the segment with zeros.

> I had planned to put in an error into mdreadv() at the time, but somehow lost
> track of that - I kind of mentally put this issue into the "done" category :(

> At the very least we need to add a comment about this though.

I'm fine with either of:

1. Replace that mdreadv() code with an error.

2. Update comment in mdreadv() that we're phasing out this code due the
   InRecovery case's dead code status and the zero_damaged_pages problems; AIO
   intentionally doesn't replicate it.  Maybe add Assert(false).

I'd do (2) for v18, then do (1) first thing in v19 development.

> > > +    *zeroed_or_error_count = rem_error & ((1 << 7) - 1);
> > > +    rem_error >>= 7;
> >
> > These raw "7" are good places to use your new #define values.  Likewise in
> > buffer_readv_encode_error().
> 
> Which define value are you thinking of here? I don't think any of the ones I
> added apply?  But I think you're right it'd be good to have some define for
> it, at least locally.

It was just my imagination.  Withdrawn.

> It's now:
> 
> /*
>  * We need a backend-local completion callback for shared buffers, to be able
>  * to report checksum errors correctly. Unfortunately that can only safely
>  * happen if the reporting backend has previously called
>  * pgstat_prepare_report_checksum_failure(), which we can only guarantee in
>  * the backend that started the IO. Hence this callback.
>  */

Sounds good.

> Updated to:
>     /*
>      * Throw a WARNING/LOG, as instructed by PIV_LOG_*, if the checksum fails,
>      * but only after we've checked for the all-zeroes case.
>      */
> 
> I found one more, the newly added comment about checksum_failure_p was still
> talking about ignore_checksum_failure, but it should now be IGNORE_CHECKSUM_FAILURE.

That works.

> > > Subject: [PATCH v2.14 18/29] aio: Add test_aio module
> >
> > I didn't yet re-review the v2.13 or 2.14 changes to this one.  That's still in
> > my queue.
> 
> That's good - I think some of the tests need to expand a bit more. Since
> that's at the end of the dependency chain...

Okay, I'll delay on re-reviewing that one.  When it's a good time, please put
the CF entry back in Needs Review.  The patches before it are all ready for
commit after the above points of this mail.



Re: AIO v2.5

From
Melanie Plageman
Date:
On Tue, Mar 25, 2025 at 11:58 AM Andres Freund <andres@anarazel.de> wrote:
>
> > Another thought on complete_shared running on other backends: I wonder if we
> > should push an ErrorContextCallback that adds "CONTEXT: completing I/O of
> > other process" or similar, so people wonder less about how "SELECT FROM a" led
> > to a log message about IO on table "b".
>
> I've been wondering about that as well, and yes, we probably should.
>
> I'd add the pid of the backend that started the IO to the message - although
> need to check whether we're trying to keep PIDs of other processes from
> unprivileged users.
>
> I think we probably should add a similar, but not equivalent, context in io
> workers. Maybe "I/O worker executing I/O on behalf of process %d".

I think this has not yet been done. Attached patch is an attempt to
add error context for IO completions by another backend when using
io_uring and IO processing in general by an IO worker. It seems to
work -- that is, running the test_aio tests, you can see the context
in the logs.

I'm not certain that I did this in the way you both were envisioning, though.

- Melanie

Attachment

Re: AIO v2.5

From
Andres Freund
Date:
Hi,

On 2025-03-27 10:52:10 +1300, Thomas Munro wrote:
> On Thu, Mar 27, 2025 at 10:41 AM Andres Freund <andres@anarazel.de> wrote:
> > > > Subject: [PATCH v2.12 13/28] Enable IO concurrency on all systems
> > >
> > > Consider also updating this comment to stop focusing on prefetch; I think
> > > changing that aligns with the patch's other changes:
> > >
> > > /*
> > >  * How many buffers PrefetchBuffer callers should try to stay ahead of their
> > >  * ReadBuffer calls by.  Zero means "never prefetch".  This value is only used
> > >  * for buffers not belonging to tablespaces that have their
> > >  * effective_io_concurrency parameter set.
> > >  */
> > > int                   effective_io_concurrency = DEFAULT_EFFECTIVE_IO_CONCURRENCY;
> >
> > Good point.  Although I suspect it might be worth adjusting this, and also the
> > config.sgml bit about effective_io_concurrency separately. That seems like it
> > might take an iteration or two.
> 
> +1 for rewriting that separately from this work on the code (I can
> have a crack at that if you want).

You taking a crack at that would be appreciated!

> For the comment, my suggestion would be something like:
> 
> "Default limit on the level of concurrency that each I/O stream
> (currently, ReadStream but in future other kinds of streams) can use.
> Zero means that I/O is always performed synchronously, ie not
> concurrently with query execution. This value can be overridden at the
> tablespace level with the parameter of the same name. Note that
> streams performing I/O not classified as single-session work respect
> maintenance_io_concurrency instead."

Generally sounds good. I do wonder if the last sentence could be made a bit
simpler, it took me a few seconds to parse "not classified as single-session".

Maybe "classified as performing work for multiple sessions respect
maintenance_io_concurrency instead."?

Greetings,

Andres Freund



Re: AIO v2.5

From
Andres Freund
Date:
Hi,

On 2025-03-29 14:29:29 -0700, Noah Misch wrote:
> > Subject: [PATCH v2.14 11/29] Let caller of PageIsVerified() control
> >  ignore_checksum_failure
> > Subject: [PATCH v2.14 12/29] bufmgr: Use AIO in StartReadBuffers()
> > Subject: [PATCH v2.14 14/29] aio: Basic read_stream adjustments for real AIO
> > Subject: [PATCH v2.14 15/29] read_stream: Introduce and use optional batchmode
> >  support
> > Subject: [PATCH v2.14 16/29] docs: Reframe track_io_timing related docs as
> >  wait time
> > Subject: [PATCH v2.14 17/29] Enable IO concurrency on all systems
>
> Ready for commit

I've pushed most of these after some very light further editing.  Yay.  Thanks
a lot for all the reviews!

So far the buildfarm hasn't been complaining, but it's early days.


I didn't yet push

> > Subject: [PATCH v2.14 13/29] aio: Add README.md explaining higher level design

because I want to integrate some language that could be referenced by
smgrstartreadv() (and more in the future), as we have been talking about.


Tomorrow I'll work on sending out a new version with the remaining patches. I
plan for that version to have:

- pg_aios view with the security checks (already done, trivial)

- a commit with updated language for smgrstartreadv(), probably referencing
  aio's README

- a change to mdreadv() around zero_damaged_pages, as Noah and I have been
  discussing

- updated tests, with the FIXMEs etc addressed

- a reviewed version of the errcontext callback patch that Melanie sent
  earlier today


Todo beyond that:

- The comment and docs updates we've been discussing in
  https://postgr.es/m/5fc6r4smanncsmqw7ib6s3uj6eoiqoioxbd5ibmhtqimcggtoe%40fyrok2gozsoq

- I think a good long search through the docs is in order, there probably are
  other things that should be updated, beyond concrete references to
  effective_io_concurrency etc.

- Whether we should do something, and if so what, about BAS_BULKREAD for
  18. Thomas may have some thoughts / code.

- Whether there's anything committable around Jelte's RLIMIT_NOFILE changes.


Greetings,

Andres Freund



Re: AIO v2.5

From
Andres Freund
Date:
Hi,

On 2025-03-30 19:46:57 -0400, Andres Freund wrote:
> I didn't yet push
>
> > > Subject: [PATCH v2.14 13/29] aio: Add README.md explaining higher level design
>
> because I want to integrate some language that could be referenced by
> smgrstartreadv() (and more in the future), as we have been talking about.

I tried a bunch of variations and none of them seemed great. So I ended up
with a lightly polished version of your suggested comment above
smgrstartreadv().  We can later see about generalizing it.


> Tomorrow I'll work on sending out a new version with the remaining patches. I
> plan for that version to have:

Got a bit distracted with $work stuff today, but here we go.


The updated version has all of that:

> - pg_aios view with the security checks (already done, trivial)
>
> - a commit with updated language for smgrstartreadv(), probably referencing
>   aio's README
>
> - a change to mdreadv() around zero_damaged_pages, as Noah and I have been
>   discussing
>
> - updated tests, with the FIXMEs etc addressed
>
> - a reviewed version of the errcontext callback patch that Melanie sent
>   earlier today

Although I didn't actually find anything in that, other than one unnecessary
change.

Greetings,

Andres Freund

Attachment

Re: AIO v2.5

From
Aleksander Alekseev
Date:
Hi Andres,

> > I didn't yet push
> >
> > > > Subject: [PATCH v2.14 13/29] aio: Add README.md explaining higher level design

I have several notes about 0003 / README.md:

1. I noticed that the use of "Postgres" and "postgres" is inconsistent.

2.

```
+pgaio_io_register_callbacks(ioh, PGAIO_HCB_SHARED_BUFFER_READV, 0);
```

Perhaps I'm a bit late here, but the name of the function is weird. It
registers a single callback, but the name is "_callbacks".

3. The use of "AIO Handle" and "AioHandle" is inconsistent.

4.

- pgaio_io_register_callbacks
- pgaio_io_set_handle_data_32

If I understand correctly one can register multiple callbacks per one
AIO Handle (right? ...). However I don't see an obvious way to match
handle data to the given callback. If all the callbacks get the same
handle data... well it's weird IMO, but we should explicitly say that.
On top of that we should probably explain in which order the callbacks
are going to be executed. If there are any guarantees in this respect
of course.

5. pgaio_io_set_handle_data_32(ioh, (uint32 *) buffer, 1)

Perhaps it's worth mentioning if `buffer` can be freed after the call
i.e. if it's stored by value or by reference. It's also worth
clarifying if the maximum number of buffers is limited or not.

6. It is worth clarifying if AIO allows reads and writes or only reads
at the moment. Perhaps it's also worth explicitly saying that AIO is
for disk IO only, not for network one.

7. It is worth clarifying how many times the callbacks are called when
reading multiple buffers. Is it guaranteed that the callbacks are
called ones, or if it somehow depends on the implementation, and also
what happens in the case if I/O succeeds partially.

8. I believe we should tell a bit more about the context in which the
callbacks are called. Particularly what happens to the memory contexts
and if I can allocate/free memory, can I throw ERRORs, can I create
new AIO Handles, is it expected that the callback should return
quickly, are the signals masked while the callback is executed, can I
use sockets, is it guaranteed that the callback is going to be called
in the same process (I guess so, but the text doesn't explicitly
promise that), etc.

9.

```
+Because acquisition of an IO handle
+[must always succeed](#io-can-be-started-in-critical-sections)
+and the number of AIO Handles
+[has to be limited](#state-for-aio-needs-to-live-in-shared-memory)
+AIO handles can be reused as soon as they have completed.
```

What pgaio_io_acquire() does if we are out of AIO Handles? Since it
always succeeds I guess it should block the caller in this case, but
IMO we should say this explicitly.

10.

> > because I want to integrate some language that could be referenced by
> > smgrstartreadv() (and more in the future), as we have been talking about.
>
> I tried a bunch of variations and none of them seemed great. So I ended up
> with a lightly polished version of your suggested comment above
> smgrstartreadv().  We can later see about generalizing it.

IMO the problem here is that README doesn't show the code that does IO
per se, and thus doesn't give the full picture of how AIO should be
used. Perhaps instead of referencing smgrstartreadv() it would be
better to provide a simple but complete example, one that opens a
binary file and reads 512 bytes from it by the given offset for
instance.

-- 
Best regards,
Aleksander Alekseev



Re: AIO v2.5

From
Andres Freund
Date:
Hi,

On 2025-04-01 14:56:17 +0300, Aleksander Alekseev wrote:
> Hi Andres,
>
> > > I didn't yet push
> > >
> > > > > Subject: [PATCH v2.14 13/29] aio: Add README.md explaining higher level design
>
> I have several notes about 0003 / README.md:
>
> 1. I noticed that the use of "Postgres" and "postgres" is inconsistent.

:/

It probably should be consistent, but I have no idea which of the spellings we
should go for. Either looks ugly in some contexts.


> 2.
>
> ```
> +pgaio_io_register_callbacks(ioh, PGAIO_HCB_SHARED_BUFFER_READV, 0);
> ```
>
> Perhaps I'm a bit late here, but the name of the function is weird. It
> registers a single callback, but the name is "_callbacks".

It registers a *set* of callbacks (stage, complete_shared, complete_local,
report_error) on the handle.



> 3. The use of "AIO Handle" and "AioHandle" is inconsistent.

This seems ok to me.


> 4.
>
> - pgaio_io_register_callbacks
> - pgaio_io_set_handle_data_32
>
> If I understand correctly one can register multiple callbacks per one
> AIO Handle (right? ...). However I don't see an obvious way to match
> handle data to the given callback. If all the callbacks get the same
> handle data... well it's weird IMO, but we should explicitly say that.

There is:

/*
 * Associate an array of data with the Handle. This is e.g. useful to the
 * transport knowledge about which buffers a multi-block IO affects to
 * completion callbacks.
 *
 * Right now this can be done only once for each IO, even though multiple
 * callbacks can be registered. There aren't any known usecases requiring more
 * and the required amount of shared memory does add up, so it doesn't seem
 * worth multiplying memory usage by PGAIO_HANDLE_MAX_CALLBACKS.
 */


> On top of that we should probably explain in which order the callbacks
> are going to be executed. If there are any guarantees in this respect
> of course.

Alongside PgAioHandleCallbacks:
     *
     * The latest registered callback is called first. This allows
     * higher-level code to register callbacks that can rely on callbacks
     * registered by lower-level code to already have been executed.
     *


> 5. pgaio_io_set_handle_data_32(ioh, (uint32 *) buffer, 1)
>
> Perhaps it's worth mentioning if `buffer` can be freed after the call
> i.e. if it's stored by value or by reference.

By value.


> It's also worth clarifying if the maximum number of buffers is limited or
> not.

It's limited to PG_IOV_MAX, fwiw.


> 6. It is worth clarifying if AIO allows reads and writes or only reads
> at the moment.

We have patches for writes, I just ran out of time for 18. Im not particularly
excited about adding stuff that then needs to be removed in 19.


> Perhaps it's also worth explicitly saying that AIO is for disk IO only, not
> for network one.

I'd like to integrate network IO too.  I have a local prototype, fwiw.


> 7. It is worth clarifying how many times the callbacks are called when
> reading multiple buffers. Is it guaranteed that the callbacks are
> called ones, or if it somehow depends on the implementation, and also
> what happens in the case if I/O succeeds partially.

The aio subsystem doesn't know anything about buffers.  Callbacks are executed
exactly once, with the exception of the error reporting callback, which could
be called multiple times.


> 8. I believe we should tell a bit more about the context in which the
> callbacks are called. Particularly what happens to the memory contexts
> and if I can allocate/free memory, can I throw ERRORs, can I create
> new AIO Handles, is it expected that the callback should return
> quickly, are the signals masked while the callback is executed, can I
> use sockets, is it guaranteed that the callback is going to be called
> in the same process (I guess so, but the text doesn't explicitly
> promise that), etc.

There is the following above pgaio_io_register_callbacks()

 * Note that callbacks are executed in critical sections.  This is necessary
 * to be able to execute IO in critical sections (consider e.g. WAL
 * logging). To perform AIO we first need to acquire a handle, which, if there
 * are no free handles, requires waiting for IOs to complete and to execute
 * their completion callbacks.
 *
 * Callbacks may be executed in the issuing backend but also in another
 * backend (because that backend is waiting for the IO) or in IO workers (if
 * io_method=worker is used).

And also a bunch of detail along struct PgAioHandleCallbacks.


> 9.
>
> ```
> +Because acquisition of an IO handle
> +[must always succeed](#io-can-be-started-in-critical-sections)
> +and the number of AIO Handles
> +[has to be limited](#state-for-aio-needs-to-live-in-shared-memory)
> +AIO handles can be reused as soon as they have completed.
> ```
>
> What pgaio_io_acquire() does if we are out of AIO Handles? Since it
> always succeeds I guess it should block the caller in this case, but
> IMO we should say this explicitly.

That's documented above pgaio_io_acquire().


> 10.
>
> > > because I want to integrate some language that could be referenced by
> > > smgrstartreadv() (and more in the future), as we have been talking about.
> >
> > I tried a bunch of variations and none of them seemed great. So I ended up
> > with a lightly polished version of your suggested comment above
> > smgrstartreadv().  We can later see about generalizing it.
>
> IMO the problem here is that README doesn't show the code that does IO
> per se, and thus doesn't give the full picture of how AIO should be
> used. Perhaps instead of referencing smgrstartreadv() it would be
> better to provide a simple but complete example, one that opens a
> binary file and reads 512 bytes from it by the given offset for
> instance.

IMO the example is already long enough, if you want that level of detail, you
can just look at smgrstartreadv() etc. The idea about explaining it at that
level is that that is basically what is required to use AIO in a new place,
whereas implementing AIO for a new target, or a new IO operation, requires a
bit more care.

Greetings,

Andres Freund



Re: AIO v2.5

From
Noah Misch
Date:
On Mon, Mar 31, 2025 at 08:41:39PM -0400, Andres Freund wrote:
> updated version

All non-write patches (1-7) are ready for commit, though I have some cosmetic
recommendations below.  I've marked the commitfest entry Ready for Committer.

> +        # Check a page validity error in another block, to ensure we report
> +        # the correct block number
> +        $psql_a->query_safe(
> +            qq(
> +SELECT modify_rel_block('tbl_zero', 3, corrupt_header=>true);
> +));
> +        psql_like(
> +            $io_method,
> +            $psql_a,
> +            "$persistency: test zeroing of invalid block 3",
> +            qq(SELECT read_rel_block_ll('tbl_zero', 3, zero_on_error=>true);),
> +            qr/^$/,
> +            qr/^psql:<stdin>:\d+: WARNING:  invalid page in block 3 of relation base\/.*\/.*; zeroing out page$/
> +        );
> +
> +
> +        # Check a page validity error in another block, to ensure we report
> +        # the correct block number

This comment is a copy of the previous test's comment.  While the comment is
not false, consider changing it to:

        # Check one read reporting multiple invalid blocks.

> +        $psql_a->query_safe(
> +            qq(
> +SELECT modify_rel_block('tbl_zero', 2, corrupt_header=>true);
> +SELECT modify_rel_block('tbl_zero', 3, corrupt_header=>true);
> +));
> +        # First test error
> +        psql_like(
> +            $io_method,
> +            $psql_a,
> +            "$persistency: test reading of invalid block 2,3 in larger read",
> +            qq(SELECT read_rel_block_ll('tbl_zero', 1, nblocks=>4, zero_on_error=>false)),
> +            qr/^$/,
> +            qr/^psql:<stdin>:\d+: ERROR:  2 invalid pages among blocks 1..4 of relation base\/.*\/.*\nDETAIL:  Block
2held first invalid page\.\nHINT:[^\n]+$/
 
> +        );
> +
> +        # Then test zeroing via ZERO_ON_ERROR flag
> +        psql_like(
> +            $io_method,
> +            $psql_a,
> +            "$persistency: test zeroing of invalid block 2,3 in larger read, ZERO_ON_ERROR",
> +            qq(SELECT read_rel_block_ll('tbl_zero', 1, nblocks=>4, zero_on_error=>true)),
> +            qr/^$/,
> +            qr/^psql:<stdin>:\d+: WARNING:  zeroing out 2 invalid pages among blocks 1..4 of relation
base\/.*\/.*\nDETAIL: Block 2 held first zeroed page\.\nHINT:[^\n]+$/
 
> +        );
> +
> +        # Then test zeroing vio zero_damaged_pages

s/vio/via/

> +# Verify checksum handling when creating database from an invalid database.
> +# This also serves as a minimal check that cross-database IO is handled
> +# reasonably.

To me, "invalid database" is a term of art from the message "cannot connect to
invalid database".  Hence, I would change "invalid database" to "database w/
invalid block" or similar, here and below.  (Alternatively, just delete "from
an invalid database".  It's clear from the context.)

> +    if (corrupt_checksum)
> +    {
> +        bool        successfully_corrupted = 0;
> +
> +        /*
> +         * Any single modification of the checksum could just end up being
> +         * valid again. To be sure
> +         */

Unfinished sentence.  That said, I'm not following why we'd need this loop.
If this test code were changing the input to the checksum, it's true that an
input bit flip might reach the same pd_checksum.  The test case is changing
pd_checksum, not the input bits.  I don't see how changing pd_checksum could
leave the page still valid.  There's only one valid pd_checksum value for a
given input page.

> +            /*
> +             * The underlying IO actually completed OK, and thus the "invalid"
> +             * portion of the IOV actually contains valid data. That can hide
> +             * a lot of problems, e.g. if we were to wrongly mark a buffer,
> +             * that wasn't read according to the shortened-read, IO as valid,
> +             * the contents would look valid and we might miss a bug.

Minimally s/read, IO/read IO,/ but I'd edit a bit further:

             * a lot of problems, e.g. if we were to wrongly mark-valid a
             * buffer that wasn't read according to the shortened-read IO, the
             * contents would look valid and we might miss a bug.

> Subject: [PATCH v2.15 05/18] md: Add comment & assert to buffer-zeroing path
>  in md[start]readv()

> The zero_damaged_pages path is incomplete, as as missing segments are not

s/as as/as/

> For now, put an Assert(false) comments documenting this choice into mdreadv()

s/comments/and comments/

> +                 * For PG 18, we are putting an Assert(false) in into
> +                 * mdreadv() (triggering failures in assertion-enabled builds,

s/in into/in/

> Subject: [PATCH v2.15 06/18] aio: comment polishing

> + * - Partial reads need to be handle by the caller re-issuing IO for the
> + *   unread blocks

s/handle/handled/

> Subject: [PATCH v2.15 07/18] aio: Add errcontext for processing I/Os for
>  another backend



Re: AIO v2.5

From
Andres Freund
Date:
Hi,

On 2025-04-01 08:11:59 -0700, Noah Misch wrote:
> On Mon, Mar 31, 2025 at 08:41:39PM -0400, Andres Freund wrote:
> > updated version
>
> All non-write patches (1-7) are ready for commit, though I have some cosmetic
> recommendations below.  I've marked the commitfest entry Ready for Committer.

Thanks!

I haven't yet pushed the changes, but will work on that in the afternoon.

I plan to afterwards close the CF entry and will eventually create a new one
for write support, although probably only rebasing onto
https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf%40gcnactj4z56m
and addressing some of the locking issues.

WRT the locking issues, I've been wondering whether we could make
LWLockWaitForVar() work that purpose, but I doubt it's the right approach.
Probably better to get rid of the LWLock*Var functions and go for the approach
I had in v1, namely a version of LWLockAcquire() with a callback that gets
called between LWLockQueueSelf() and PGSemaphoreLock(), which can cause the
lock acquisition to abort.


> This comment is a copy of the previous test's comment.  While the comment is
> not false, consider changing it to:
>
>         # Check one read reporting multiple invalid blocks.

> > +        # Then test zeroing vio zero_damaged_pages
>
> s/vio/via/
>

These make sense.


> > +# Verify checksum handling when creating database from an invalid database.
> > +# This also serves as a minimal check that cross-database IO is handled
> > +# reasonably.
>
> To me, "invalid database" is a term of art from the message "cannot connect to
> invalid database".  Hence, I would change "invalid database" to "database w/
> invalid block" or similar, here and below.  (Alternatively, just delete "from
> an invalid database".  It's clear from the context.)

Yea, I agree, this is easy to misunderstand when stepping back. I went for "with
an invalid block".


> > +    if (corrupt_checksum)
> > +    {
> > +        bool        successfully_corrupted = 0;
> > +
> > +        /*
> > +         * Any single modification of the checksum could just end up being
> > +         * valid again. To be sure
> > +         */
>
> Unfinished sentence.

Oops. See below.


> That said, I'm not following why we'd need this loop.  If this test code
> were changing the input to the checksum, it's true that an input bit flip
> might reach the same pd_checksum.  The test case is changing pd_checksum,
> not the input bits.

We might be changing the input, due to the zero/corrupt_header options. Or we
might be called on a page that is *already* corrupted. I did encounter that
situation once while writing tests, where the tests only passed if I made the
+ 1 a + 2. Which was, uh, rather confusing and left me feel like I was cursed
that day.


> I don't see how changing pd_checksum could leave the
> page still valid.  There's only one valid pd_checksum value for a given
> input page.

I updated the comment to:
        /*
         * Any single modification of the checksum could just end up being
         * valid again, due to e.g. corrupt_header changing the data in a way
         * that'd result in the "corrupted" checksum, or the checksum already
         * being invalid. Retry in that, unlikely, case.
         */


> > +            /*
> > +             * The underlying IO actually completed OK, and thus the "invalid"
> > +             * portion of the IOV actually contains valid data. That can hide
> > +             * a lot of problems, e.g. if we were to wrongly mark a buffer,
> > +             * that wasn't read according to the shortened-read, IO as valid,
> > +             * the contents would look valid and we might miss a bug.
>
> Minimally s/read, IO/read IO,/ but I'd edit a bit further:
>
>              * a lot of problems, e.g. if we were to wrongly mark-valid a
>              * buffer that wasn't read according to the shortened-read IO, the
>              * contents would look valid and we might miss a bug.

Adopted.


> > Subject: [PATCH v2.15 05/18] md: Add comment & assert to buffer-zeroing path
> >  in md[start]readv()
>
> > The zero_damaged_pages path is incomplete, as as missing segments are not
>
> s/as as/as/
>
> > For now, put an Assert(false) comments documenting this choice into mdreadv()
>
> s/comments/and comments/
>
> > +                 * For PG 18, we are putting an Assert(false) in into
> > +                 * mdreadv() (triggering failures in assertion-enabled builds,
>
> s/in into/in/

> > Subject: [PATCH v2.15 06/18] aio: comment polishing
>
> > + * - Partial reads need to be handle by the caller re-issuing IO for the
> > + *   unread blocks
>
> s/handle/handled/

All adopted.  I'm sorry that you had to see so much of tiredness-enhanced
dyslexia :(.

Greetings,

Andres Freund



Re: AIO v2.5

From
Noah Misch
Date:
On Tue, Apr 01, 2025 at 11:55:20AM -0400, Andres Freund wrote:
> On 2025-04-01 08:11:59 -0700, Noah Misch wrote:
> > On Mon, Mar 31, 2025 at 08:41:39PM -0400, Andres Freund wrote:

> I haven't yet pushed the changes, but will work on that in the afternoon.
> 
> I plan to afterwards close the CF entry and will eventually create a new one
> for write support, although probably only rebasing onto
> https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf%40gcnactj4z56m
> and addressing some of the locking issues.

Sounds good.

> WRT the locking issues, I've been wondering whether we could make
> LWLockWaitForVar() work that purpose, but I doubt it's the right approach.
> Probably better to get rid of the LWLock*Var functions and go for the approach
> I had in v1, namely a version of LWLockAcquire() with a callback that gets
> called between LWLockQueueSelf() and PGSemaphoreLock(), which can cause the
> lock acquisition to abort.

What are the best thing(s) to read to understand the locking issues?

> > > +# Verify checksum handling when creating database from an invalid database.
> > > +# This also serves as a minimal check that cross-database IO is handled
> > > +# reasonably.
> >
> > To me, "invalid database" is a term of art from the message "cannot connect to
> > invalid database".  Hence, I would change "invalid database" to "database w/
> > invalid block" or similar, here and below.  (Alternatively, just delete "from
> > an invalid database".  It's clear from the context.)
> 
> Yea, I agree, this is easy to misunderstand when stepping back. I went for "with
> an invalid block".

Sounds good.

> > > +    if (corrupt_checksum)
> > > +    {
> > > +        bool        successfully_corrupted = 0;
> > > +
> > > +        /*
> > > +         * Any single modification of the checksum could just end up being
> > > +         * valid again. To be sure
> > > +         */
> >
> > Unfinished sentence.
> 
> Oops. See below.
> 
> 
> > That said, I'm not following why we'd need this loop.  If this test code
> > were changing the input to the checksum, it's true that an input bit flip
> > might reach the same pd_checksum.  The test case is changing pd_checksum,
> > not the input bits.
> 
> We might be changing the input, due to the zero/corrupt_header options. Or we
> might be called on a page that is *already* corrupted. I did encounter that
> situation once while writing tests, where the tests only passed if I made the
> + 1 a + 2. Which was, uh, rather confusing and left me feel like I was cursed
> that day.

Got it.

> > I don't see how changing pd_checksum could leave the
> > page still valid.  There's only one valid pd_checksum value for a given
> > input page.
> 
> I updated the comment to:
>         /*
>          * Any single modification of the checksum could just end up being
>          * valid again, due to e.g. corrupt_header changing the data in a way
>          * that'd result in the "corrupted" checksum, or the checksum already
>          * being invalid. Retry in that, unlikely, case.
>          */

Works for me.



Re: AIO v2.5

From
Andres Freund
Date:
Hi,

On 2025-04-01 09:07:27 -0700, Noah Misch wrote:
> On Tue, Apr 01, 2025 at 11:55:20AM -0400, Andres Freund wrote:
> > WRT the locking issues, I've been wondering whether we could make
> > LWLockWaitForVar() work that purpose, but I doubt it's the right approach.
> > Probably better to get rid of the LWLock*Var functions and go for the approach
> > I had in v1, namely a version of LWLockAcquire() with a callback that gets
> > called between LWLockQueueSelf() and PGSemaphoreLock(), which can cause the
> > lock acquisition to abort.
>
> What are the best thing(s) to read to understand the locking issues?

Unfortunately I think it's our discussion from a few days/weeks ago.

The problem basically is that functions like LockBuffer(EXCLUSIVE) need to be able
to non-racily

a) wait for in-fligth IOs
b) acquire the content lock

If you just do it naively like this:

    else if (mode == BUFFER_LOCK_EXCLUSIVE)
    {
        if (pg_atomic_read_u32(&buf->state) &_IO_IN_PROGRESS)
            WaitIO(buf);
        LWLockAcquire(content_lock, LW_EXCLUSIVE);
   }

you obviously could have another backend start new IO between the WaitIO() and
the LWLockAcquire().  If that other backend then doesn't consume the
completion of that IO, the current backend could end up endlessly waiting for
the IO.  I don't see a way to avoid with narrow changes just to LockBuffer().


We need some infrastructure that allows to avoid that issue.  One approach
could be to integrate more tightly with lwlock.c. If

1) anyone starting IO were to wake up all waiters for the LWLock

2) The waiting side checked that there is no IO in progress *after*
   LWLockQueueSelf(), but before PGSemaphoreLock()

The backend doing LockBuffer() would be guaranteed to have the chance to wait
for the IO, rather than the lwlock.


But there might be better approaches.

I'm not really convinced that using generic lwlocks for buffer locking is the
best idea. There's just too many special things about buffers. E.g. we have
rather massive NUMA scalability issues due to the amount of lock traffic from
buffer header and content lock atomic operations, particuly on things like the
uppermost levels of a btree.  I've played with ideas like super-pinning and
locking btree root pages, which move all the overhead to the side that wants
to exclusively lock such a page - but that doesn't really make sense for
lwlocks in general.


Greetings,

Andres Freund



Re: AIO v2.5

From
Andres Freund
Date:
Hi,

On 2025-04-01 11:55:20 -0400, Andres Freund wrote:
> I haven't yet pushed the changes, but will work on that in the afternoon.

There are three different types of failures in the test_aio test so far:

1) TEMP_CONFIG

See https://postgr.es/m/zh5u22wbpcyfw2ddl3lsvmsxf4yvsrvgxqwwmfjddc4c2khsgp%40gfysyjsaelr5


2) Failure on at least some windows BF machines:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2025-04-01%2020%3A15%3A19
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2025-04-01%2019%3A03%3A07

Afaict the error is independent of AIO, instead just related CREATE DATABASE
... STRATEGY wal_log failing on windows.  In contrast to dropdb(), which does

    /*
     * Force a checkpoint to make sure the checkpointer has received the
     * message sent by ForgetDatabaseSyncRequests.
     */
    RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);

    /* Close all smgr fds in all backends. */
    WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SMGRRELEASE));

createdb_failure_callback() does no such thing.  But it's rather likely that
we, bgwriter, checkpointer (and now IO workers) have files open for the target
database.

Note that the test is failing even with "io_method=sync", which obviously
doesn't use IO workers, so it's not related to that.


It's probably not a good idea to blockingly request a checkpoint and a barrier
inside a PG_TRY/PG_ENSURE_ERROR_CLEANUP() though, so this would need a bit
more rearchitecting.


I think I'm just going to make the test more lenient by not insisting that the
error is the first thing on psql's stderr.


3) Some subtests fail if RELCACHE_FORCE_RELEASE and CATCACHE_FORCE_RELEASE are defined:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2025-04-01%2019%3A23%3A07

# +++ tap check in src/test/modules/test_aio +++

#   Failed test 'worker: batch_start() leak & cleanup in implicit xact: expected stderr'
#   at t/001_aio.pl line 318.
#                   'psql:<stdin>:4: ERROR:  starting batch while batch already in progress'
#     doesn't match '(?^:open AIO batch at end)'


The problem is basically that the test intentionally forgets to exit batchmode
- normally that would trigger an error at the end of the transaction, which
the test verifies.  However, with RELCACHE_FORCE_RELEASE and
CATCACHE_FORCE_RELEASE defined, we get other code entering batchmode and
erroring out because batchmode isn't allowed to be entered recursively.


#0  pgaio_enter_batchmode () at ../../../../../home/andres/src/postgresql/src/backend/storage/aio/aio.c:997
#1  0x000055ec847959bf in read_stream_look_ahead (stream=0x55ecbcfda098)
    at ../../../../../home/andres/src/postgresql/src/backend/storage/aio/read_stream.c:438
#2  0x000055ec84796514 in read_stream_next_buffer (stream=0x55ecbcfda098, per_buffer_data=0x0)
    at ../../../../../home/andres/src/postgresql/src/backend/storage/aio/read_stream.c:890
#3  0x000055ec8432520b in heap_fetch_next_buffer (scan=0x55ecbcfd1c00, dir=ForwardScanDirection)
    at ../../../../../home/andres/src/postgresql/src/backend/access/heap/heapam.c:679
#4  0x000055ec843259a4 in heapgettup_pagemode (scan=0x55ecbcfd1c00, dir=ForwardScanDirection, nkeys=1,
key=0x55ecbcfd1620)
    at ../../../../../home/andres/src/postgresql/src/backend/access/heap/heapam.c:1041
#5  0x000055ec843263ba in heap_getnextslot (sscan=0x55ecbcfd1c00, direction=ForwardScanDirection, slot=0x55ecbcfd0e18)
    at ../../../../../home/andres/src/postgresql/src/backend/access/heap/heapam.c:1420
#6  0x000055ec8434ebe5 in table_scan_getnextslot (sscan=0x55ecbcfd1c00, direction=ForwardScanDirection,
slot=0x55ecbcfd0e18)
    at ../../../../../home/andres/src/postgresql/src/include/access/tableam.h:1041
#7  0x000055ec8434f786 in systable_getnext (sysscan=0x55ecbcfd8088) at
../../../../../home/andres/src/postgresql/src/backend/access/index/genam.c:541
#8  0x000055ec849c784a in SearchCatCacheMiss (cache=0x55ecbcf81000, nkeys=1, hashValue=3830081846, hashIndex=2, v1=403,
v2=0,v3=0, v4=0)
 
    at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/catcache.c:1543
#9  0x000055ec849c76f9 in SearchCatCacheInternal (cache=0x55ecbcf81000, nkeys=1, v1=403, v2=0, v3=0, v4=0)
    at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/catcache.c:1464
#10 0x000055ec849c73ec in SearchCatCache1 (cache=0x55ecbcf81000, v1=403) at
../../../../../home/andres/src/postgresql/src/backend/utils/cache/catcache.c:1332
#11 0x000055ec849e5ae3 in SearchSysCache1 (cacheId=2, key1=403) at
../../../../../home/andres/src/postgresql/src/backend/utils/cache/syscache.c:228
#12 0x000055ec849d8c78 in RelationInitIndexAccessInfo (relation=0x7f6a85901c20)
    at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/relcache.c:1456
#13 0x000055ec849d8471 in RelationBuildDesc (targetRelId=2703, insertIt=true)
    at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/relcache.c:1201
#14 0x000055ec849d9e9c in RelationIdGetRelation (relationId=2703) at
../../../../../home/andres/src/postgresql/src/backend/utils/cache/relcache.c:2100
#15 0x000055ec842d219f in relation_open (relationId=2703, lockmode=1) at
../../../../../home/andres/src/postgresql/src/backend/access/common/relation.c:58
#16 0x000055ec8435043c in index_open (relationId=2703, lockmode=1) at
../../../../../home/andres/src/postgresql/src/backend/access/index/indexam.c:137
#17 0x000055ec8434f2f9 in systable_beginscan (heapRelation=0x7f6a859353a8, indexId=2703, indexOK=true, snapshot=0x0,
nkeys=1,key=0x7ffc11aa7c90)
 
    at ../../../../../home/andres/src/postgresql/src/backend/access/index/genam.c:400
#18 0x000055ec849c782c in SearchCatCacheMiss (cache=0x55ecbcfa0e80, nkeys=1, hashValue=2659955452, hashIndex=60,
v1=2278,v2=0, v3=0, v4=0)
 
    at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/catcache.c:1533
#19 0x000055ec849c76f9 in SearchCatCacheInternal (cache=0x55ecbcfa0e80, nkeys=1, v1=2278, v2=0, v3=0, v4=0)
    at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/catcache.c:1464
#20 0x000055ec849c73ec in SearchCatCache1 (cache=0x55ecbcfa0e80, v1=2278) at
../../../../../home/andres/src/postgresql/src/backend/utils/cache/catcache.c:1332
#21 0x000055ec849e5ae3 in SearchSysCache1 (cacheId=82, key1=2278) at
../../../../../home/andres/src/postgresql/src/backend/utils/cache/syscache.c:228
#22 0x000055ec849d0375 in getTypeOutputInfo (type=2278, typOutput=0x55ecbcfd15d0, typIsVarlena=0x55ecbcfd15d8)
    at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/lsyscache.c:2995
#23 0x000055ec842d1a57 in printtup_prepare_info (myState=0x55ecbcfcec00, typeinfo=0x55ecbcfd0588, numAttrs=1)
    at ../../../../../home/andres/src/postgresql/src/backend/access/common/printtup.c:277
#24 0x000055ec842d1ba6 in printtup (slot=0x55ecbcfd0b28, self=0x55ecbcfcec00)
    at ../../../../../home/andres/src/postgresql/src/backend/access/common/printtup.c:315
#25 0x000055ec84541f54 in ExecutePlan (queryDesc=0x55ecbced4290, operation=CMD_SELECT, sendTuples=true, numberTuples=0,
direction=ForwardScanDirection,
    dest=0x55ecbcfcec00) at ../../../../../home/andres/src/postgresql/src/backend/executor/execMain.c:1814


I don't really have a good idea how to deal with that yet.


Greetings,

Andres



Re: AIO v2.5

From
Andres Freund
Date:
Hi,

On 2025-04-01 17:47:51 -0400, Andres Freund wrote:
> 3) Some subtests fail if RELCACHE_FORCE_RELEASE and CATCACHE_FORCE_RELEASE are defined:
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2025-04-01%2019%3A23%3A07
> 
> # +++ tap check in src/test/modules/test_aio +++
> 
> #   Failed test 'worker: batch_start() leak & cleanup in implicit xact: expected stderr'
> #   at t/001_aio.pl line 318.
> #                   'psql:<stdin>:4: ERROR:  starting batch while batch already in progress'
> #     doesn't match '(?^:open AIO batch at end)'
> 
> 
> The problem is basically that the test intentionally forgets to exit batchmode
> - normally that would trigger an error at the end of the transaction, which
> the test verifies.  However, with RELCACHE_FORCE_RELEASE and
> CATCACHE_FORCE_RELEASE defined, we get other code entering batchmode and
> erroring out because batchmode isn't allowed to be entered recursively.
> 
> 
> #0  pgaio_enter_batchmode () at ../../../../../home/andres/src/postgresql/src/backend/storage/aio/aio.c:997
> #1  0x000055ec847959bf in read_stream_look_ahead (stream=0x55ecbcfda098)
>     at ../../../../../home/andres/src/postgresql/src/backend/storage/aio/read_stream.c:438
> #2  0x000055ec84796514 in read_stream_next_buffer (stream=0x55ecbcfda098, per_buffer_data=0x0)
>     at ../../../../../home/andres/src/postgresql/src/backend/storage/aio/read_stream.c:890
> #3  0x000055ec8432520b in heap_fetch_next_buffer (scan=0x55ecbcfd1c00, dir=ForwardScanDirection)
>     at ../../../../../home/andres/src/postgresql/src/backend/access/heap/heapam.c:679
> #4  0x000055ec843259a4 in heapgettup_pagemode (scan=0x55ecbcfd1c00, dir=ForwardScanDirection, nkeys=1,
key=0x55ecbcfd1620)
>     at ../../../../../home/andres/src/postgresql/src/backend/access/heap/heapam.c:1041
> #5  0x000055ec843263ba in heap_getnextslot (sscan=0x55ecbcfd1c00, direction=ForwardScanDirection,
slot=0x55ecbcfd0e18)
>     at ../../../../../home/andres/src/postgresql/src/backend/access/heap/heapam.c:1420
> #6  0x000055ec8434ebe5 in table_scan_getnextslot (sscan=0x55ecbcfd1c00, direction=ForwardScanDirection,
slot=0x55ecbcfd0e18)
>     at ../../../../../home/andres/src/postgresql/src/include/access/tableam.h:1041
> #7  0x000055ec8434f786 in systable_getnext (sysscan=0x55ecbcfd8088) at
../../../../../home/andres/src/postgresql/src/backend/access/index/genam.c:541
> #8  0x000055ec849c784a in SearchCatCacheMiss (cache=0x55ecbcf81000, nkeys=1, hashValue=3830081846, hashIndex=2,
v1=403,v2=0, v3=0, v4=0)
 
>     at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/catcache.c:1543
> #9  0x000055ec849c76f9 in SearchCatCacheInternal (cache=0x55ecbcf81000, nkeys=1, v1=403, v2=0, v3=0, v4=0)
>     at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/catcache.c:1464
> #10 0x000055ec849c73ec in SearchCatCache1 (cache=0x55ecbcf81000, v1=403) at
../../../../../home/andres/src/postgresql/src/backend/utils/cache/catcache.c:1332
> #11 0x000055ec849e5ae3 in SearchSysCache1 (cacheId=2, key1=403) at
../../../../../home/andres/src/postgresql/src/backend/utils/cache/syscache.c:228
> #12 0x000055ec849d8c78 in RelationInitIndexAccessInfo (relation=0x7f6a85901c20)
>     at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/relcache.c:1456
> #13 0x000055ec849d8471 in RelationBuildDesc (targetRelId=2703, insertIt=true)
>     at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/relcache.c:1201
> #14 0x000055ec849d9e9c in RelationIdGetRelation (relationId=2703) at
../../../../../home/andres/src/postgresql/src/backend/utils/cache/relcache.c:2100
> #15 0x000055ec842d219f in relation_open (relationId=2703, lockmode=1) at
../../../../../home/andres/src/postgresql/src/backend/access/common/relation.c:58
> #16 0x000055ec8435043c in index_open (relationId=2703, lockmode=1) at
../../../../../home/andres/src/postgresql/src/backend/access/index/indexam.c:137
> #17 0x000055ec8434f2f9 in systable_beginscan (heapRelation=0x7f6a859353a8, indexId=2703, indexOK=true, snapshot=0x0,
nkeys=1,key=0x7ffc11aa7c90)
 
>     at ../../../../../home/andres/src/postgresql/src/backend/access/index/genam.c:400
> #18 0x000055ec849c782c in SearchCatCacheMiss (cache=0x55ecbcfa0e80, nkeys=1, hashValue=2659955452, hashIndex=60,
v1=2278,v2=0, v3=0, v4=0)
 
>     at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/catcache.c:1533
> #19 0x000055ec849c76f9 in SearchCatCacheInternal (cache=0x55ecbcfa0e80, nkeys=1, v1=2278, v2=0, v3=0, v4=0)
>     at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/catcache.c:1464
> #20 0x000055ec849c73ec in SearchCatCache1 (cache=0x55ecbcfa0e80, v1=2278) at
../../../../../home/andres/src/postgresql/src/backend/utils/cache/catcache.c:1332
> #21 0x000055ec849e5ae3 in SearchSysCache1 (cacheId=82, key1=2278) at
../../../../../home/andres/src/postgresql/src/backend/utils/cache/syscache.c:228
> #22 0x000055ec849d0375 in getTypeOutputInfo (type=2278, typOutput=0x55ecbcfd15d0, typIsVarlena=0x55ecbcfd15d8)
>     at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/lsyscache.c:2995
> #23 0x000055ec842d1a57 in printtup_prepare_info (myState=0x55ecbcfcec00, typeinfo=0x55ecbcfd0588, numAttrs=1)
>     at ../../../../../home/andres/src/postgresql/src/backend/access/common/printtup.c:277
> #24 0x000055ec842d1ba6 in printtup (slot=0x55ecbcfd0b28, self=0x55ecbcfcec00)
>     at ../../../../../home/andres/src/postgresql/src/backend/access/common/printtup.c:315
> #25 0x000055ec84541f54 in ExecutePlan (queryDesc=0x55ecbced4290, operation=CMD_SELECT, sendTuples=true,
numberTuples=0,direction=ForwardScanDirection,
 
>     dest=0x55ecbcfcec00) at ../../../../../home/andres/src/postgresql/src/backend/executor/execMain.c:1814
> 
> 
> I don't really have a good idea how to deal with that yet.

Hm. Making the query something like

SELECT * FROM (VALUES (NULL), (batch_start()));

avoids the wrong output, because the type lookup happens for the first row
already. But that's pretty magical and probably fragile.

Greetings,

Andres Freund



Re: AIO v2.5

From
Noah Misch
Date:
On Tue, Apr 01, 2025 at 06:25:28PM -0400, Andres Freund wrote:
> On 2025-04-01 17:47:51 -0400, Andres Freund wrote:
> > 3) Some subtests fail if RELCACHE_FORCE_RELEASE and CATCACHE_FORCE_RELEASE are defined:
> > 
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2025-04-01%2019%3A23%3A07
> > 
> > # +++ tap check in src/test/modules/test_aio +++
> > 
> > #   Failed test 'worker: batch_start() leak & cleanup in implicit xact: expected stderr'
> > #   at t/001_aio.pl line 318.
> > #                   'psql:<stdin>:4: ERROR:  starting batch while batch already in progress'
> > #     doesn't match '(?^:open AIO batch at end)'
> > 
> > 
> > The problem is basically that the test intentionally forgets to exit batchmode
> > - normally that would trigger an error at the end of the transaction, which
> > the test verifies.  However, with RELCACHE_FORCE_RELEASE and
> > CATCACHE_FORCE_RELEASE defined, we get other code entering batchmode and
> > erroring out because batchmode isn't allowed to be entered recursively.

> > I don't really have a good idea how to deal with that yet.
> 
> Hm. Making the query something like
> 
> SELECT * FROM (VALUES (NULL), (batch_start()));
> 
> avoids the wrong output, because the type lookup happens for the first row
> already. But that's pretty magical and probably fragile.

Hmm.  Some options:

a. VALUES() trick above.  For test code, it's hard to argue with something
   that seems to solve it in practice.

b. Encapsulate the test in a PROCEDURE, so perhaps less happens between the
   batch_start() and the procedure-managed COMMIT.  Maybe less fragile than
   (a), maybe more fragile.

c. Move RELCACHE_FORCE_RELEASE and CATCACHE_FORCE_RELEASE to be
   GUC-controlled, like how CLOBBER_CACHE_ALWAYS changed into the
   debug_discard_caches GUC.  Then disable them for relevant parts of
   test_aio.  This feels best long-term, but it's bigger.  I also wanted this
   in syscache-update-pruned.spec[1].

d. Have test_aio deduce whether these are set, probably by observing memory
   contexts or DEBUG messages.  Maybe have every postmaster startup print a
   DEBUG message about these settings being enabled.  Skip relevant parts of
   test_aio.  This sounds messy.

Each of those feels defensible to me.  I'd probably do (a) or (b) to start.


[1] For that spec, an alternative expected output sufficed.  Incidentally,
I'll soon fix that spec flaking on valgrind/skink.



Re: AIO v2.5

From
Andres Freund
Date:
Hi,

I've pushed fixes for 1) and 2) and am working on 3).


On 2025-04-01 17:13:24 -0700, Noah Misch wrote:
> On Tue, Apr 01, 2025 at 06:25:28PM -0400, Andres Freund wrote:
> > On 2025-04-01 17:47:51 -0400, Andres Freund wrote:
> > > 3) Some subtests fail if RELCACHE_FORCE_RELEASE and CATCACHE_FORCE_RELEASE are defined:
> > > 
> > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2025-04-01%2019%3A23%3A07
> > > 
> > > # +++ tap check in src/test/modules/test_aio +++
> > > 
> > > #   Failed test 'worker: batch_start() leak & cleanup in implicit xact: expected stderr'
> > > #   at t/001_aio.pl line 318.
> > > #                   'psql:<stdin>:4: ERROR:  starting batch while batch already in progress'
> > > #     doesn't match '(?^:open AIO batch at end)'
> > > 
> > > 
> > > The problem is basically that the test intentionally forgets to exit batchmode
> > > - normally that would trigger an error at the end of the transaction, which
> > > the test verifies.  However, with RELCACHE_FORCE_RELEASE and
> > > CATCACHE_FORCE_RELEASE defined, we get other code entering batchmode and
> > > erroring out because batchmode isn't allowed to be entered recursively.
> 
> > > I don't really have a good idea how to deal with that yet.
> > 
> > Hm. Making the query something like
> > 
> > SELECT * FROM (VALUES (NULL), (batch_start()));
> > 
> > avoids the wrong output, because the type lookup happens for the first row
> > already. But that's pretty magical and probably fragile.
> 
> Hmm.  Some options:
> 
> a. VALUES() trick above.  For test code, it's hard to argue with something
>    that seems to solve it in practice.

I think I'll go for a slightly nicer version of that, namely
  SELECT WHERE batch_start() IS NULL
I think that ends up the least verbose of the ideas we've been discussing.


> c. Move RELCACHE_FORCE_RELEASE and CATCACHE_FORCE_RELEASE to be
>    GUC-controlled, like how CLOBBER_CACHE_ALWAYS changed into the
>    debug_discard_caches GUC.  Then disable them for relevant parts of
>    test_aio.  This feels best long-term, but it's bigger.  I also wanted this
>    in syscache-update-pruned.spec[1].

Yea, that'd probably be a good thing medium-term.

Greetings,

Andres Freund



Re: AIO v2.5

From
Ranier Vilela
Date:
Hi.

Em qua., 2 de abr. de 2025 às 08:58, Andres Freund <andres@anarazel.de> escreveu:
Hi,

I've pushed fixes for 1) and 2) and am working on 3).
Coverity has one report about this.

CID 1596092: (#1 of 1): Uninitialized scalar variable (UNINIT)
13. uninit_use_in_call: Using uninitialized value result_one. Field result_one.result is uninitialized when calling pgaio_result_report.


Below not is a fix, but some suggestion:

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 1c37d7dfe2..b0f9ce452c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -6786,6 +6786,8 @@ buffer_readv_encode_error(PgAioResult *result,
  else
  result->status = PGAIO_RS_WARNING;
 
+ result->result = 0;
+
  /*
  * The encoding is complicated enough to warrant cross-checking it against
  * the decode function.
@@ -6868,8 +6870,6 @@ buffer_readv_complete_one(PgAioTargetData *td, uint8 buf_off, Buffer buffer,
  /* Check for garbage data. */
  if (!failed)
  {
- PgAioResult result_one;
-
  if (!PageIsVerified((Page) bufdata, tag.blockNum, piv_flags,
  failed_checksum))
  {
@@ -6904,6 +6904,8 @@ buffer_readv_complete_one(PgAioTargetData *td, uint8 buf_off, Buffer buffer,
  */
  if (*buffer_invalid || *failed_checksum || *zeroed_buffer)
  {
+ PgAioResult result_one;
+
  buffer_readv_encode_error(&result_one, is_temp,
   *zeroed_buffer,
   *ignored_checksum,
 

1. I couldn't find the correct value to initialize the *result* field.
2. result_one can be reduced scope.

best regards,
Ranier Vilela

Re: AIO v2.5

From
Andres Freund
Date:
Hi,

On 2025-04-01 17:47:51 -0400, Andres Freund wrote:
> There are three different types of failures in the test_aio test so far:

And a fourth, visible after I enabled liburing support for skink.

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=skink&dt=2025-04-03%2007%3A06%3A19&stg=pg_upgrade-check
(ignore the pg_upgrade and oauth failures, they're independent, I've raised
them separately)

4a)

2025-04-03 10:58:32.978 UTC [2486740][client backend][3/6:0] LOG:  short read injection point called, is enabled: 0
==2486740== VALGRINDERROR-BEGIN
==2486740== Invalid read of size 2
==2486740==    at 0x59C8AC: PageIsNew (bufpage.h:237)
==2486740==    by 0x59C8AC: PageIsVerified (bufpage.c:108)
==2486740==    by 0x567870: buffer_readv_complete_one (bufmgr.c:6873)
==2486740==    by 0x567870: buffer_readv_complete (bufmgr.c:6996)
==2486740==    by 0x567870: shared_buffer_readv_complete (bufmgr.c:7153)
==2486740==    by 0x55DDB2: pgaio_io_call_complete_shared (aio_callback.c:256)
==2486740==    by 0x55D6F1: pgaio_io_process_completion (aio.c:512)
==2486740==    by 0x55F53A: pgaio_uring_drain_locked (method_io_uring.c:370)
==2486740==    by 0x55F7B8: pgaio_uring_wait_one (method_io_uring.c:449)
==2486740==    by 0x55C702: pgaio_io_wait (aio.c:587)
==2486740==    by 0x55C8B0: pgaio_wref_wait (aio.c:900)
==2486740==    by 0x8639240: read_rel_block_ll (test_aio.c:440)
==2486740==    by 0x3B915C: ExecInterpExpr (execExprInterp.c:953)
==2486740==    by 0x3B4E4E: ExecInterpExprStillValid (execExprInterp.c:2299)
==2486740==    by 0x3F7E97: ExecEvalExprNoReturn (executor.h:445)
==2486740==  Address 0x8fa400e is in a rw- anonymous segment
==2486740==
==2486740== VALGRINDERROR-END

The reason for this is that this test unpins the buffer (from the backend's
view), before waiting for the IO. While the AIO subsystem holds a pin,
UnpinBufferNoOwner() marked the buffer as inaccessible:
        /*
         * Mark buffer non-accessible to Valgrind.
         *
         * Note that the buffer may have already been marked non-accessible
         * within access method code that enforces that buffers are only
         * accessed while a buffer lock is held.
         */
        VALGRIND_MAKE_MEM_NOACCESS(BufHdrGetBlock(buf), BLCKSZ);


I think to fix this we need to mark buffers as accessible around the
PageIsVerified() call in buffer_readv_complete_one(), IFF they're not pinned
by the backend.  Unfortunately, this is complicated by the fact that local
buffers do not have valgrind integration :(, so we should only do that for
local buffers, as otherwise the local buffer stays inaccessible the next time
it is pinned.


4b)

That's not all though, after getting past this failure, I see uninitialized
memory errors for reads into temporary buffers:

==3334031== VALGRINDERROR-BEGIN
==3334031== Conditional jump or move depends on uninitialised value(s)
==3334031==    at 0xD7C859: PageIsVerified (bufpage.c:108)
==3334031==    by 0xD381CA: buffer_readv_complete_one (bufmgr.c:6876)
==3334031==    by 0xD385D1: buffer_readv_complete (bufmgr.c:7002)
==3334031==    by 0xD38D2E: local_buffer_readv_complete (bufmgr.c:7210)
==3334031==    by 0xD265FA: pgaio_io_call_complete_local (aio_callback.c:306)
==3334031==    by 0xD24720: pgaio_io_reclaim (aio.c:644)
==3334031==    by 0xD24400: pgaio_io_process_completion (aio.c:521)
==3334031==    by 0xD28D3D: pgaio_uring_drain_locked (method_io_uring.c:382)
==3334031==    by 0xD2905F: pgaio_uring_wait_one (method_io_uring.c:461)
==3334031==    by 0xD245E0: pgaio_io_wait (aio.c:587)
==3334031==    by 0xD24FFE: pgaio_wref_wait (aio.c:900)
==3334031==    by 0xD2F471: WaitReadBuffers (bufmgr.c:1695)
==3334031==    by 0xD2BCF4: read_stream_next_buffer (read_stream.c:898)
==3334031==    by 0x8B4861: heap_fetch_next_buffer (heapam.c:654)
==3334031==    by 0x8B4FFA: heapgettup_pagemode (heapam.c:1016)
==3334031==    by 0x8B594F: heap_getnextslot (heapam.c:1375)
==3334031==    by 0xB28AA4: table_scan_getnextslot (tableam.h:1031)
==3334031==    by 0xB29177: SeqNext (nodeSeqscan.c:81)
==3334031==    by 0xB28F75: ExecScanFetch (execScan.h:126)
==3334031==    by 0xB28FDD: ExecScanExtended (execScan.h:170)


The reason for this one is, I think, that valgrind doesn't understand io_uring
sufficiently. Which isn't surprising, io_uring's nature of an in-memory queue
of commands is somewhat hard to intercept by tools like valgrind and rr.

The best fix for that one would, I think, be to have method_io_uring() iterate
over the IOV and mark the relevant regions as defined?  That does fix the
issue at least and does seem to make sense?   Not quite sure if we should mark
the entire IOV is efined or just the portion that was actually read - the
latter is additional fiddly code, and it's not clear it's likely to be helpful?


4c)

Unfortunately, once 4a) is addressed, the VALGRIND_MAKE_MEM_NOACCESS() after
PageIsVerified() causes the *next* read into the same buffer in an IO worker
to fail:

==3339904== Syscall param pread64(buf) points to unaddressable byte(s)
==3339904==    at 0x5B3B687: __internal_syscall_cancel (cancellation.c:64)
==3339904==    by 0x5B3B6AC: __syscall_cancel (cancellation.c:75)
==3339904==    by 0x5B93C83: pread (pread64.c:25)
==3339904==    by 0xD274F4: pg_preadv (pg_iovec.h:56)
==3339904==    by 0xD2799A: pgaio_io_perform_synchronously (aio_io.c:137)
==3339904==    by 0xD2A6D7: IoWorkerMain (method_worker.c:538)
==3339904==    by 0xC91E26: postmaster_child_launch (launch_backend.c:290)
==3339904==    by 0xC99594: StartChildProcess (postmaster.c:3972)
==3339904==    by 0xC99EE3: maybe_adjust_io_workers (postmaster.c:4403)
==3339904==    by 0xC958A8: PostmasterMain (postmaster.c:1381)
==3339904==    by 0xB69622: main (main.c:227)
==3339904==  Address 0x7f936d386000 is in a rw- anonymous segment

Because, from the view of the IO worker, that memory is still marked NOACCESS,
even if it since has been marked accessible in the backend.


We could adress this by conditioning the VALGRIND_MAKE_MEM_NOACCESS() on not
being in an IO worker, but it seems better to instead explicitly mark the
region accessible in the worker, before executing the IO.

In a first hack, I did that in pgaio_io_perform_synchronously(), but that is
likely too broad.  I don't think the same scenario exists when IOs are
executed synchronously in the foreground.


Questions:

1) It'd be cleaner to implement valgrind support in localbuf.c, so we don't
   need to have special-case logic for that. But it also makes the change less
   localized and more "impactful", who knows what kind of skullduggery we have
   been getting away with unnoticed.

   I haven't written the code up yet, but I don't think it'd be all that much
   code to add valgrind support to localbuf.

2) Any better ideas to handle the above issues than what I outlined?

Greetings,

Andres Freund



Re: AIO v2.5

From
Andres Freund
Date:
Hi,

On 2025-04-03 13:46:39 -0300, Ranier Vilela wrote:
> Em qua., 2 de abr. de 2025 às 08:58, Andres Freund <andres@anarazel.de>
> escreveu:
> 
> > Hi,
> >
> > I've pushed fixes for 1) and 2) and am working on 3).
> >
> Coverity has one report about this.
> 
> CID 1596092: (#1 of 1): Uninitialized scalar variable (UNINIT)
> 13. uninit_use_in_call: Using uninitialized value result_one. Field
> result_one.result is uninitialized when calling pgaio_result_report.

Isn't this a rather silly thing to warn about for coverity?  The field isn't
used in pgaio_result_report().  It can't be a particularly rare thing to have
struct fields that aren't always used?


> Below not is a fix, but some suggestion:
> 
> diff --git a/src/backend/storage/buffer/bufmgr.c
> b/src/backend/storage/buffer/bufmgr.c
> index 1c37d7dfe2..b0f9ce452c 100644
> --- a/src/backend/storage/buffer/bufmgr.c
> +++ b/src/backend/storage/buffer/bufmgr.c
> @@ -6786,6 +6786,8 @@ buffer_readv_encode_error(PgAioResult *result,
>   else
>   result->status = PGAIO_RS_WARNING;
> 
> + result->result = 0;
> +

That'd be completely wrong - and the tests indeed fail if you do that. The
read might succeed with a warning (e.g. due to zero_damaged_pages) in which
case the result still carries important information about how many blocks were
successfully read.


>   /*
>   * The encoding is complicated enough to warrant cross-checking it against
>   * the decode function.
> @@ -6868,8 +6870,6 @@ buffer_readv_complete_one(PgAioTargetData *td, uint8
> buf_off, Buffer buffer,
>   /* Check for garbage data. */
>   if (!failed)
>   {
> - PgAioResult result_one;
> -
>   if (!PageIsVerified((Page) bufdata, tag.blockNum, piv_flags,
>   failed_checksum))
>   {
> @@ -6904,6 +6904,8 @@ buffer_readv_complete_one(PgAioTargetData *td, uint8
> buf_off, Buffer buffer,
>   */
>   if (*buffer_invalid || *failed_checksum || *zeroed_buffer)
>   {
> + PgAioResult result_one;
> +
>   buffer_readv_encode_error(&result_one, is_temp,
>    *zeroed_buffer,
>    *ignored_checksum,
> 
> 
> 1. I couldn't find the correct value to initialize the *result* field.

It is not accessed in this path.  I guess we can just zero-initialize
result_one to shut up coverity.


> 2. result_one can be reduced scope.

True.


Greetings,

Andres Freund



Re: AIO v2.5

From
Ranier Vilela
Date:


Em qui., 3 de abr. de 2025 às 15:35, Andres Freund <andres@anarazel.de> escreveu:
Hi,

On 2025-04-03 13:46:39 -0300, Ranier Vilela wrote:
> Em qua., 2 de abr. de 2025 às 08:58, Andres Freund <andres@anarazel.de>
> escreveu:
>
> > Hi,
> >
> > I've pushed fixes for 1) and 2) and am working on 3).
> >
> Coverity has one report about this.
>
> CID 1596092: (#1 of 1): Uninitialized scalar variable (UNINIT)
> 13. uninit_use_in_call: Using uninitialized value result_one. Field
> result_one.result is uninitialized when calling pgaio_result_report.

Isn't this a rather silly thing to warn about for coverity?
Personally, I consider every warning to be important.
 
  The field isn't
used in pgaio_result_report().  It can't be a particularly rare thing to have
struct fields that aren't always used?
Always considered a risk, someone may start using it.
 


> Below not is a fix, but some suggestion:
>
> diff --git a/src/backend/storage/buffer/bufmgr.c
> b/src/backend/storage/buffer/bufmgr.c
> index 1c37d7dfe2..b0f9ce452c 100644
> --- a/src/backend/storage/buffer/bufmgr.c
> +++ b/src/backend/storage/buffer/bufmgr.c
> @@ -6786,6 +6786,8 @@ buffer_readv_encode_error(PgAioResult *result,
>   else
>   result->status = PGAIO_RS_WARNING;
>
> + result->result = 0;
> +

That'd be completely wrong - and the tests indeed fail if you do that. The
read might succeed with a warning (e.g. due to zero_damaged_pages) in which
case the result still carries important information about how many blocks were
successfully read.
That's exactly why it's not a patch.
 


>   /*
>   * The encoding is complicated enough to warrant cross-checking it against
>   * the decode function.
> @@ -6868,8 +6870,6 @@ buffer_readv_complete_one(PgAioTargetData *td, uint8
> buf_off, Buffer buffer,
>   /* Check for garbage data. */
>   if (!failed)
>   {
> - PgAioResult result_one;
> -
>   if (!PageIsVerified((Page) bufdata, tag.blockNum, piv_flags,
>   failed_checksum))
>   {
> @@ -6904,6 +6904,8 @@ buffer_readv_complete_one(PgAioTargetData *td, uint8
> buf_off, Buffer buffer,
>   */
>   if (*buffer_invalid || *failed_checksum || *zeroed_buffer)
>   {
> + PgAioResult result_one;
> +
>   buffer_readv_encode_error(&result_one, is_temp,
>    *zeroed_buffer,
>    *ignored_checksum,
>
>
> 1. I couldn't find the correct value to initialize the *result* field.

It is not accessed in this path.  I guess we can just zero-initialize
result_one to shut up coverity.
Very good.
 


> 2. result_one can be reduced scope.

True.
Ok.

best regards,
Ranier Vilela

Re: AIO v2.5

From
Noah Misch
Date:
On Thu, Apr 03, 2025 at 02:19:43PM -0400, Andres Freund wrote:
> 4b)
> 
> That's not all though, after getting past this failure, I see uninitialized
> memory errors for reads into temporary buffers:
> 
> ==3334031== VALGRINDERROR-BEGIN
> ==3334031== Conditional jump or move depends on uninitialised value(s)
> ==3334031==    at 0xD7C859: PageIsVerified (bufpage.c:108)
> ==3334031==    by 0xD381CA: buffer_readv_complete_one (bufmgr.c:6876)
> ==3334031==    by 0xD385D1: buffer_readv_complete (bufmgr.c:7002)
> ==3334031==    by 0xD38D2E: local_buffer_readv_complete (bufmgr.c:7210)
> ==3334031==    by 0xD265FA: pgaio_io_call_complete_local (aio_callback.c:306)
> ==3334031==    by 0xD24720: pgaio_io_reclaim (aio.c:644)
> ==3334031==    by 0xD24400: pgaio_io_process_completion (aio.c:521)
> ==3334031==    by 0xD28D3D: pgaio_uring_drain_locked (method_io_uring.c:382)
> ==3334031==    by 0xD2905F: pgaio_uring_wait_one (method_io_uring.c:461)
> ==3334031==    by 0xD245E0: pgaio_io_wait (aio.c:587)
> ==3334031==    by 0xD24FFE: pgaio_wref_wait (aio.c:900)
> ==3334031==    by 0xD2F471: WaitReadBuffers (bufmgr.c:1695)
> ==3334031==    by 0xD2BCF4: read_stream_next_buffer (read_stream.c:898)
> ==3334031==    by 0x8B4861: heap_fetch_next_buffer (heapam.c:654)
> ==3334031==    by 0x8B4FFA: heapgettup_pagemode (heapam.c:1016)
> ==3334031==    by 0x8B594F: heap_getnextslot (heapam.c:1375)
> ==3334031==    by 0xB28AA4: table_scan_getnextslot (tableam.h:1031)
> ==3334031==    by 0xB29177: SeqNext (nodeSeqscan.c:81)
> ==3334031==    by 0xB28F75: ExecScanFetch (execScan.h:126)
> ==3334031==    by 0xB28FDD: ExecScanExtended (execScan.h:170)
> 
> 
> The reason for this one is, I think, that valgrind doesn't understand io_uring
> sufficiently. Which isn't surprising, io_uring's nature of an in-memory queue
> of commands is somewhat hard to intercept by tools like valgrind and rr.
> 
> The best fix for that one would, I think, be to have method_io_uring() iterate
> over the IOV and mark the relevant regions as defined?  That does fix the
> issue at least and does seem to make sense?

Makes sense.  Valgrind knows that read() makes its target bytes "defined".  It
probably doesn't have an io_uring equivalent for that.

I expect we only need this for local buffers, and it's unclear to me how the
fix for (4a) didn't fix this.  Before bufmgr Valgrind integration (1e0dfd1 of
2020-07) there was no explicit handling of shared_buffers.  I suspect that
worked because the initial mmap() of shared memory was considered "defined"
(zeros), and steps like PageAddItem() copy only defined bytes into buffers.
Hence, shared_buffers remained defined without explicit Valgrind client
requests.  This example uses local buffers.  Storage for those comes from
MemoryContextAlloc() in GetLocalBufferStorage().  That memory starts
undefined, but it becomes defined at PageInit() or read().  Hence, I expected
the fix for (4a) to make the buffer defined after io_uring read.  What makes
the outcome different?

In the general case, we could want client requests as follows:

- If completor==definer and has not dropped pin:
  - Make defined before verifying page.  That's all.  It might be cleaner to
    do this when first retrieving a return value from io_uring, since this
    just makes up for what Valgrind already does for readv().

- If completor!=definer or has dropped pin:
  - Make NOACCESS in definer when definer cedes its own pin.
  - For io_method=worker, make UNDEFINED before starting readv().  It might be
    cleanest to do this when the worker first acts as the owner of the AIO
    subsystem pin, if that's a clear moment earlier than readv().
  - Make DEFINED in completor before verifying page.  It might be cleaner to
    do this when the completor first retrieves a return value from io_uring,
    since this just makes up for what Valgrind already does for readv().
  - Make NOACCESS in completor after verifying page.  Similarly, it might be
    cleaner to do this when the completor releases the AIO subsystem pin.

> Not quite sure if we should mark
> the entire IOV is efined or just the portion that was actually read - the
> latter is additional fiddly code, and it's not clear it's likely to be helpful?

Seems fine to do the simpler way if that saves fiddly code.

> 4c)
> 
> Unfortunately, once 4a) is addressed, the VALGRIND_MAKE_MEM_NOACCESS() after
> PageIsVerified() causes the *next* read into the same buffer in an IO worker
> to fail:
> 
> ==3339904== Syscall param pread64(buf) points to unaddressable byte(s)
> ==3339904==    at 0x5B3B687: __internal_syscall_cancel (cancellation.c:64)
> ==3339904==    by 0x5B3B6AC: __syscall_cancel (cancellation.c:75)
> ==3339904==    by 0x5B93C83: pread (pread64.c:25)
> ==3339904==    by 0xD274F4: pg_preadv (pg_iovec.h:56)
> ==3339904==    by 0xD2799A: pgaio_io_perform_synchronously (aio_io.c:137)
> ==3339904==    by 0xD2A6D7: IoWorkerMain (method_worker.c:538)
> ==3339904==    by 0xC91E26: postmaster_child_launch (launch_backend.c:290)
> ==3339904==    by 0xC99594: StartChildProcess (postmaster.c:3972)
> ==3339904==    by 0xC99EE3: maybe_adjust_io_workers (postmaster.c:4403)
> ==3339904==    by 0xC958A8: PostmasterMain (postmaster.c:1381)
> ==3339904==    by 0xB69622: main (main.c:227)
> ==3339904==  Address 0x7f936d386000 is in a rw- anonymous segment
> 
> Because, from the view of the IO worker, that memory is still marked NOACCESS,
> even if it since has been marked accessible in the backend.
> 
> 
> We could adress this by conditioning the VALGRIND_MAKE_MEM_NOACCESS() on not
> being in an IO worker, but it seems better to instead explicitly mark the
> region accessible in the worker, before executing the IO.

Sounds good.  Since the definer gave the AIO subsystem a pin on the worker's
behalf, it's like the worker is doing an implicit pin and explicit unpin.

> In a first hack, I did that in pgaio_io_perform_synchronously(), but that is
> likely too broad.  I don't think the same scenario exists when IOs are
> executed synchronously in the foreground.
> 
> 
> Questions:
> 
> 1) It'd be cleaner to implement valgrind support in localbuf.c, so we don't
>    need to have special-case logic for that. But it also makes the change less
>    localized and more "impactful", who knows what kind of skullduggery we have
>    been getting away with unnoticed.
> 
>    I haven't written the code up yet, but I don't think it'd be all that much
>    code to add valgrind support to localbuf.

It would be the right thing long-term, and it's not a big deal if it causes
some false positives initially.  So if you're leaning that way, that's good.

> 2) Any better ideas to handle the above issues than what I outlined?

Not here, unless the discussion under (4b) differs usefully from what you
planned.



Re: AIO v2.5

From
Andres Freund
Date:
Hi,

On 2025-04-03 16:16:50 -0300, Ranier Vilela wrote:
> Em qui., 3 de abr. de 2025 às 15:35, Andres Freund <andres@anarazel.de>
> escreveu:> > On 2025-04-03 13:46:39 -0300, Ranier Vilela wrote:
> > > Em qua., 2 de abr. de 2025 às 08:58, Andres Freund <andres@anarazel.de>
> > > escreveu:
> > >
> > > > Hi,
> > > >
> > > > I've pushed fixes for 1) and 2) and am working on 3).
> > > >
> > > Coverity has one report about this.
> > >
> > > CID 1596092: (#1 of 1): Uninitialized scalar variable (UNINIT)
> > > 13. uninit_use_in_call: Using uninitialized value result_one. Field
> > > result_one.result is uninitialized when calling pgaio_result_report.
> >
> > Isn't this a rather silly thing to warn about for coverity?
> 
> Personally, I consider every warning to be important.

If the warning is wrong, then it's not helpful. Warning quality really
matters.

Zero-initializing everything *REDUCES* what static analysis and sanitizers can
do. The analyzer/sanitizer can't tell that you just silenced a warning by
zero-initializing something that shouldn't be accessed. If later there is an
access, the zero is probably the wrong value, but the no tool can tell you,
because you did initialize it after all.

> 
> >   The field isn't
> > used in pgaio_result_report().  It can't be a particularly rare thing to
> > have
> > struct fields that aren't always used?
> >
> Always considered a risk, someone may start using it.

That makes it worse! E.g. valgrind won't raise errors about it anymore.

Greetings,

Andres Freund



Re: AIO v2.5

From
Andres Freund
Date:
Hi,

Sorry for the slow work on this. The cycle times are humonguous due to
valgrind being so slow...


On 2025-04-03 12:40:23 -0700, Noah Misch wrote:
> On Thu, Apr 03, 2025 at 02:19:43PM -0400, Andres Freund wrote:
> > The best fix for that one would, I think, be to have method_io_uring() iterate
> > over the IOV and mark the relevant regions as defined?  That does fix the
> > issue at least and does seem to make sense?
> 
> Makes sense.  Valgrind knows that read() makes its target bytes "defined".  It
> probably doesn't have an io_uring equivalent for that.

Correct - and I think it would be nontrivial to add, because there's not easy
syscall to intercept...


> I expect we only need this for local buffers, and it's unclear to me how the
> fix for (4a) didn't fix this.

At that time I didn't apply the fix in 4a) to local buffers, because local
buffers, in HEAD, don't have the valgrind integration. Without that marking
the buffer as NOACCESS would cause all sorts of issues, because it'd be
considered inaccessible even after pinning.  As you analyzed, that then ends
up considered undefined due to the MemoryContextAlloc().


> In the general case, we could want client requests as follows:
> 
> - If completor==definer and has not dropped pin:
>   - Make defined before verifying page.  That's all.  It might be cleaner to
>     do this when first retrieving a return value from io_uring, since this
>     just makes up for what Valgrind already does for readv().

Yea, I think it's better to do that in io_uring. It's what I have done in the
attached.


> - If completor!=definer or has dropped pin:
>   - Make NOACCESS in definer when definer cedes its own pin.

That's the current behaviour for shared buffers, right?


>   - For io_method=worker, make UNDEFINED before starting readv().  It might be
>     cleanest to do this when the worker first acts as the owner of the AIO
>     subsystem pin, if that's a clear moment earlier than readv().

Hm, what do we need this for?


>   - Make DEFINED in completor before verifying page.  It might be cleaner to
>     do this when the completor first retrieves a return value from io_uring,
>     since this just makes up for what Valgrind already does for readv().

I think we can't rely on the marking during retrieving it from io_uring, as
that might have happened in a different backend for a temp buffer. That'd only
happen if we got io_uring events for *another* IO that involved a shared rel,
but it can happen.



> > Not quite sure if we should mark
> > the entire IOV is efined or just the portion that was actually read - the
> > latter is additional fiddly code, and it's not clear it's likely to be helpful?
> 
> Seems fine to do the simpler way if that saves fiddly code.

Can't quite decide, it's just at the border of what I consider too
fiddly... See the change to method_io_uring.c in the attached patch.


> > Questions:
> > 
> > 1) It'd be cleaner to implement valgrind support in localbuf.c, so we don't
> >    need to have special-case logic for that. But it also makes the change less
> >    localized and more "impactful", who knows what kind of skullduggery we have
> >    been getting away with unnoticed.
> > 
> >    I haven't written the code up yet, but I don't think it'd be all that much
> >    code to add valgrind support to localbuf.
> 
> It would be the right thing long-term, and it's not a big deal if it causes
> some false positives initially.  So if you're leaning that way, that's good.

It was easy enough.

I saw one related failure, FlushRelationBuffers() didn't pin temporary buffers
before flushing them. Pinning the buffers fixed that.

I don't think it's a real problem to not pin the local buffer during
FlushRelationBuffers(), at least not today. But it seems unnecessarily odd to
not pin it.


I wish valgrind had a way to mark the buffer as inaccessible and then
accessible again, without loosing the defined-ness information...


Greetings,

Andres Freund

Attachment

Re: AIO v2.5

From
Noah Misch
Date:
On Fri, Apr 04, 2025 at 03:16:18PM -0400, Andres Freund wrote:
> On 2025-04-03 12:40:23 -0700, Noah Misch wrote:
> > On Thu, Apr 03, 2025 at 02:19:43PM -0400, Andres Freund wrote:

> > In the general case, we could want client requests as follows:
> > 
> > - If completor==definer and has not dropped pin:
> >   - Make defined before verifying page.  That's all.  It might be cleaner to
> >     do this when first retrieving a return value from io_uring, since this
> >     just makes up for what Valgrind already does for readv().
> 
> Yea, I think it's better to do that in io_uring. It's what I have done in the
> attached.
> 
> 
> > - If completor!=definer or has dropped pin:
> >   - Make NOACCESS in definer when definer cedes its own pin.
> 
> That's the current behaviour for shared buffers, right?

Yes.

> >   - For io_method=worker, make UNDEFINED before starting readv().  It might be
> >     cleanest to do this when the worker first acts as the owner of the AIO
> >     subsystem pin, if that's a clear moment earlier than readv().
> 
> Hm, what do we need this for?

At the time, we likely didn't need it:

- If the worker does its own PinBuffer*()+unpin, we don't need it.  Those
  functions do the Valgrind client requests.
- If the worker relies on the AIO-subsystem-owned pin and does neither regular
  pin nor regular unpin, we don't need it.  Buffers are always "defined".
- If the worker relies on the AIO-subsystem-owned pin to skip PinBuffer*() but
  uses regular unpin code, then the buffer may be NOACCESS.  Then one would
  need this.  But this would be questionable for other reasons.

Your proposed change to set NOACCESS in buffer_readv_complete_one() interacts
with things further, making the UNDEFINED necessary.

> >   - Make DEFINED in completor before verifying page.  It might be cleaner to
> >     do this when the completor first retrieves a return value from io_uring,
> >     since this just makes up for what Valgrind already does for readv().
> 
> I think we can't rely on the marking during retrieving it from io_uring, as
> that might have happened in a different backend for a temp buffer. That'd only
> happen if we got io_uring events for *another* IO that involved a shared rel,
> but it can happen.

Good point.  I think the VALGRIND_MAKE_MEM_DEFINED() in
pgaio_uring_drain_locked() isn't currently needed at all.  If
completor-subxact==definer-subxact, PinBuffer() already did what Valgrind
needs.  Otherwise, buffer_readv_complete_one() does what Valgrind needs.

If that's right, it would still be nice to reach the right
VALGRIND_MAKE_MEM_DEFINED() without involving bufmgr.  That helps future,
non-bufmgr AIO use cases.  It's tricky to pick the right place for that
VALGRIND_MAKE_MEM_DEFINED():

- pgaio_uring_drain_locked() is problematic, I think.  In the localbuf case,
  the iovec base address is relevant only in the ioh-defining process.  In the
  shmem completor!=definer case, this runs only in the completor.

- A complete_local callback solves those problems.  However, if the
  AIO-defining subxact aborted, then we shouldn't set DEFINED at all, since
  the buffer mapping may have changed by the time of complete_local.

- Putting it in the place that would call pgaio_result_report(ERROR) if
  needed, e.g. ProcessReadBuffersResult(), solves the problem of the buffer
  mapping having moved.  ProcessReadBuffersResult() doesn't even need this,
  since PinBuffer() already did it.  Each future AIO use case will have a
  counterpart of ProcessReadBuffersResult() that consumes the result and
  proceeds with tasks that depend on the AIO.  That's the place.

Is that right?  I got this wrong a few times while trying to think through it,
so I'm not too confident in the above.

> > > Not quite sure if we should mark
> > > the entire IOV is efined or just the portion that was actually read - the
> > > latter is additional fiddly code, and it's not clear it's likely to be helpful?
> > 
> > Seems fine to do the simpler way if that saves fiddly code.
> 
> Can't quite decide, it's just at the border of what I consider too
> fiddly... See the change to method_io_uring.c in the attached patch.

It is at the border, as you say, but I'd tend to keep it.


> Subject: [PATCH v1 1/3] localbuf: Add Valgrind buffer access instrumentation

Ready for commit


> Subject: [PATCH v1 2/3] aio: Make AIO compatible with valgrind

See above about pgaio_uring_drain_locked().

> related code until it is pinned bu "user" code again. But it requires some

s/bu/by/

> + * Return the iovecand its length. Currently only expected to be used by

s/iovecand/iovec and/

> @@ -361,13 +405,16 @@ pgaio_uring_drain_locked(PgAioUringContext *context)
>          for (int i = 0; i < ncqes; i++)
>          {
>              struct io_uring_cqe *cqe = cqes[i];
> +            int32        res;
>              PgAioHandle *ioh;
>  
>              ioh = io_uring_cqe_get_data(cqe);
>              errcallback.arg = ioh;
> +            res = cqe->res;
> +
>              io_uring_cqe_seen(&context->io_uring_ring, cqe);
>  
> -            pgaio_io_process_completion(ioh, cqe->res);
> +            pgaio_uring_io_process_completion(ioh, res);

I guess this is a distinct cleanup, done to avoid any suspicion of cqe being
reused asynchronously after io_uring_cqe_seen().  Is that right?


> Subject: [PATCH v1 3/3] aio: Avoid spurious coverity warning

Ready for commit



Re: AIO v2.5

From
Andres Freund
Date:
Hi,

On 2025-04-04 14:18:02 -0700, Noah Misch wrote:
> On Fri, Apr 04, 2025 at 03:16:18PM -0400, Andres Freund wrote:
> > >   - Make DEFINED in completor before verifying page.  It might be cleaner to
> > >     do this when the completor first retrieves a return value from io_uring,
> > >     since this just makes up for what Valgrind already does for readv().
> > 
> > I think we can't rely on the marking during retrieving it from io_uring, as
> > that might have happened in a different backend for a temp buffer. That'd only
> > happen if we got io_uring events for *another* IO that involved a shared rel,
> > but it can happen.
> 
> Good point.  I think the VALGRIND_MAKE_MEM_DEFINED() in
> pgaio_uring_drain_locked() isn't currently needed at all.  If
> completor-subxact==definer-subxact, PinBuffer() already did what Valgrind
> needs.  Otherwise, buffer_readv_complete_one() does what Valgrind needs.

We did need it - but only because I bungled something in the earlier patch to
add valgrind support.  The problem is that in PinLocalBuffer() there may not
actually be any storage allocated for the buffer yet, so
VALGRIND_MAKE_MEM_DEFINED() doesn't work. In the first use of the buffer the
allocation happens a bit later, in GetLocalVictimBuffer(), namely during the
call to GetLocalBufferStorage().

Not quite sure yet how to best deal with it.  Putting the PinLocalBuffer()
slightly later into GetLocalVictimBuffer() fixes the issue, but also doesn't
really seem great.


> If that's right, it would still be nice to reach the right
> VALGRIND_MAKE_MEM_DEFINED() without involving bufmgr.

I think that would be possible if we didn't do VALGRIND_MAKE_MEM_NOACCESS() in
UnpinBuffer()/UnpinLocalBuffer(). But with that I don't see how we can avoid
needing to remark the region as accessible?


> That helps future, non-bufmgr AIO use cases.  It's tricky to pick the right
> place for that VALGRIND_MAKE_MEM_DEFINED():

> - pgaio_uring_drain_locked() is problematic, I think.  In the localbuf case,
>   the iovec base address is relevant only in the ioh-defining process.  In the
>   shmem completor!=definer case, this runs only in the completor.

You're right :(


> - A complete_local callback solves those problems.  However, if the
>   AIO-defining subxact aborted, then we shouldn't set DEFINED at all, since
>   the buffer mapping may have changed by the time of complete_local.

I don't think that is possible, due to the aio subsystem owned pin?


> - Putting it in the place that would call pgaio_result_report(ERROR) if
>   needed, e.g. ProcessReadBuffersResult(), solves the problem of the buffer
>   mapping having moved.  ProcessReadBuffersResult() doesn't even need this,
>   since PinBuffer() already did it.  Each future AIO use case will have a
>   counterpart of ProcessReadBuffersResult() that consumes the result and
>   proceeds with tasks that depend on the AIO.  That's the place.

I don't really follow - at the point something like ProcessReadBuffersResult()
gets involved, we'll already have done the accesses that needed the memory to
be accessible and defined?


I think the point about non-aio uses is a fair one, but I don't quite know how
to best solve it right now, due to the local buffer issue you mentioned. I'd
guess that we'd best put it somewhere
a) in pgaio_io_process_completion(), if definer==completor || !PGAIO_HF_REFERENCES_LOCAL
b) pgaio_io_call_complete_local(), just before calling
   pgaio_io_call_complete_local() if PGAIO_HF_REFERENCES_LOCAL



> > related code until it is pinned bu "user" code again. But it requires some
> 
> s/bu/by/
> 
> > + * Return the iovecand its length. Currently only expected to be used by
> 
> s/iovecand/iovec and/

Fixed.


> > @@ -361,13 +405,16 @@ pgaio_uring_drain_locked(PgAioUringContext *context)
> >          for (int i = 0; i < ncqes; i++)
> >          {
> >              struct io_uring_cqe *cqe = cqes[i];
> > +            int32        res;
> >              PgAioHandle *ioh;
> >  
> >              ioh = io_uring_cqe_get_data(cqe);
> >              errcallback.arg = ioh;
> > +            res = cqe->res;
> > +
> >              io_uring_cqe_seen(&context->io_uring_ring, cqe);
> >  
> > -            pgaio_io_process_completion(ioh, cqe->res);
> > +            pgaio_uring_io_process_completion(ioh, res);
> 
> I guess this is a distinct cleanup, done to avoid any suspicion of cqe being
> reused asynchronously after io_uring_cqe_seen().  Is that right?

I don't think there is any such danger - there's no background thing
processing things on the ring, if there were, it'd get corrupted, but it
seemed cleaner to do it that way when I introduced
pgaio_uring_io_process_completion().


> > Subject: [PATCH v1 3/3] aio: Avoid spurious coverity warning
> 
> Ready for commit

Thanks!


Greetings,

Andres Freund