AIO v2.5 - Mailing list pgsql-hackers

From Andres Freund
Subject AIO v2.5
Date
Msg-id vz5i2x2wkjjpp6l2z4g4l3umpxktrxsejgdwefi7n6kzw666h3@rnhlvlpadig7
Whole thread Raw
In response to Re: AIO v2.4  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
Hi,

Attached is v2.5 of the AIO patchset.

Relative to 2.4 I:

- Committed some earlier commits. I ended up *not* committing the patch to
  create resowners in more backends (e.g. walsender), as that's not really a
  dependency for now.

  One of the more important things to get committed was in a separate thread:
  https://postgr.es/m/b6vveqz6r3wno66rho5lqi6z5kyhfgtvi3jcodyq5rlpp3cu44%40c6dsgf3z7yhs

  Now relpath() can be used for logging while in a critical section. That
  alone allowed to remove most of the remaining FIXMEs.


- Split md.c read/write patches, the write side is more complicated and isn't
  needed before write support arrives (much later in the queue and very likely
  not for 18).

  The complicated bit about write support is needing to
  register_dirty_segment() after completion of the write. If
  RegisterSyncRequest() fails, the IO completer needs to open the file and
  sync itself, unfortunately PathNameOpenFile() allocates memory, which isn't
  ok while in a critical section (even though it'd not be detected, as it's
  using malloc()).


- Reordered patches so that Thomas' read_stream work is after the basic AIO
  infrastructure patches, there's no dependency to the earlier patches

  I think Thomas might have a newer version of some of these, but since
  they're not intended to be committed as part of this, I didn't spend the
  time to rebase to the last version.


- Added a small bit of data that can be provided to callbacks, that makes it a
  lot cleaner to transport information like ZERO_ON_ERROR.

  I also did s/shared_callbacks/callbacks/, as the prior name was outdated.


- Substantially expanded tests, most importantly generic temp file tests and
  AIO specific cross-backend tests

  As part of the expanded tests I also needed to export TerminateBufferIO(),
  like, as previously mentioned, already done in an earlier version for
  StartBufferIO().  Nobody commented on that, so I think that's ok.

  I also renamed the tests away from the very inventively named tbl_a, tbl_b...


- Moved the commit to create resownern in more places to much later in the
  queue, it's not actually needed for bufmgr.c IO, and nothing needing it will
  land in 18


- Added a proper commit message fo the main commit. I'd appreciate folks
  reading through it. I'm sure I forgot a lot of folks and a lot of things.


- Did a fair bit of of comment polishing


- Addressed an XXX in the "aio infrastructure" commit suggesting that we might
  want to error out if a backend is waiting on is own unsubmitted IO. Noah
  argued for erroring out. I now made it so.


- Temporarily added a commit to increase open-file limit on openbsd. I saw
  related errors without this patch too, but it fails more often with. I
  already sent a separate email about this.


At this point I am not aware of anything significant left to do in the main
AIO commit, safe some of the questions below.  There's a lot more potential
optimizations etc, but this is already a very complicated piece of work, so I
think they just has to wait for later.

There are a few things to clean up in the bufmgr.c commits, I don't yet quite
like the function naming and there could be a bit less duplication. But I
don't think that needs to be resolved before the main commit.


Questions:

- My current thinking is that we'd set io_method = worker initially - so we
  actually get some coverage - and then decide whether to switch to
  io_method=sync by default for 18 sometime around beta1/2. Does that sound
  reasonable?


- We could reduce memory usage a tiny bit if we made the mapping between
  pgproc and per-backend-aio-state more complicated, i.e. not just indexed by
  ProcNumber. Right now IO workers have the per-backend AIO state, but don't
  actually need it.  I'm mildly inclined to think that the complexity isn't
  worth it, but on the fence.


- Three of the commits in the series really are just precursor commits to
  their subsequent commits, which I found helpful for development and review,
  namely:

  - aio: Basic subsystem initialization
  - aio: Skeleton IO worker infrastructure
  - aio: Add liburing dependency

  Not sure if it's worth keeping these separate or whether they should just be
  merged with their "real commit".


- Thomas suggested renaming
  COMPLETED_IO->COMPLETED,
  COMPLETED_SHARED->TERMINATED_BY_COMPLETER,
  COMPLETED_SHARED->TERMINATED_BY_SUBMITTER
  in
  https://www.postgresql.org/message-id/CA%2BhUKGLxH1tsUgzZfng4BU6GqnS6bKF2ThvxH1_w5c7-sLRKQw%40mail.gmail.com

  While the other things in the email were commented upon by others and
  addressed in v2.4, the naming aspect wasn't further remarked upon by others.
  I'm not personally in love with the suggested names, but I could live with
  them.


- Right now this series defines PGAIO_VERBOSE to 1. That's good for debugging,
  but all the ereport()s add a noticeable amount of overhead at high IO
  throughput (at multiple gigabytes/second), so that's probably not right
  forever.  I'd leave this on initially and then change it to default to off
  later.  I think that's ok?


- To allow io_workers to be PGC_SIGHUP, and to eventually allow to
  automatically in/decrease active workers, the max number of workers (32) is
  always allocated. That means we use more semaphores than before. I think
  that's ok, it's not 1995 anymore.  Alternatively we can add a
  "io_workers_max" GUC and probe for it in initdb.


- pg_stat_aios currently has the IO Handle flags as dedicated columns. Not
  sure that's great?

  They could be an enum array or such too? That'd perhaps be a bit more
  extensible? OTOH, we don't currently use enums in the catalogs and arrays
  are somewhat annoying to conjure up from C.


Todo:

- A few more passes over the main commit, I'm sure there's a few more inartful
  comments, odd formatting and such.

  - Check if there's a decent way to deduplicate pgaio_io_call_complete_shared() and
    pgaio_io_call_complete_local()


- Figure out how to deduplicate support for LockBufferForCleanup() in
  TerminateBufferIO().


- Documentation for pg_stat_aios.


- Check if documentation for track_io_timing needs to be adjusted, after the
  bufmgr.c changes we only track waiting for an IO.


- Some of the test_aio code is specific to non-temp tables, it probably is
  worth generalizing to deal with temp tables and invoke them for both.

Greetings,

Andres

Attachment

pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: scalability bottlenecks with (many) partitions (and more)
Next
From: Andres Freund
Date:
Subject: Re: scalability bottlenecks with (many) partitions (and more)