Thread: Re: AIO v2.0

Re: AIO v2.0

From
Jakub Wartak
Date:
On Fri, Sep 6, 2024 at 9:38 PM Andres Freund <andres@anarazel.de> wrote:
Hi,

Attached is the next version of the patchset. (..)
 
Hi Andres,

Thank You for worth admiring persistence on this. Please do not take it as criticism, just more like set of questions regarding the patchset v2.1 that I finally got little time to play with:

0. Doesn't the v2.1-0011-aio-Add-io_uring-method.patch -> in pgaio_uring_submit() -> io_uring_get_sqe() need a return value check ? Otherwise we'll never know that SQ is full in theory, perhaps at least such a check should be made with Assert() ? (I understand right now that we allow just up to io_uring_queue_init(io_max_concurrency), but what happens if:
a. previous io_uring_submit() failed for some reason and we do not have free space for SQ?
b. (hypothetical) someday someone will try to make PG multithreaded and the code starts using just one big queue, still without checking for io_uring_get_sqe()?

1. In [0] you wrote that there's this high amount of FDs consumed for io_uring (dangerously close to RLIMIT_NOFILE). I can attest that there are many customers who are using extremely high max_connections (4k-5k, but there outliers with 10k in the wild too) - so they won't even start - and I have one doubt on the user-friendliness impact of this. I'm quite certain it's going to be the same as with pgbouncer where one is forced to tweak OS(systemd/pam/limits.conf/etc), but in PG we are better because PG tries to preallocate and then close() a lot of FDs, so that's safer in runtime. IMVHO even if we just consume e.g. say > 30% of FDs just for io_uring, the max_files_per_process looses it's spirit a little bit and PG is going to start loose efficiency too due to frequent open()/close() calls as fd cache is too small. Tomas also complained about it some time ago in [1])

So maybe it would be good to introduce couple of sanity checks too (even after setting higher limit):
- issue FATAL in case of using io_method = io_ring && max_connections would be close to getrusage(RLIMIT_NOFILE)
- issue warning in case of using io_method = io_ring && we wouldnt have even real 1k FDs free for handling relation FDs (detect something bad like: getrusage(RLIMIT_NOFILE) <= max_connections + max_files_per_process)

2. In pgaio_uring_postmaster_child_init_local() there "io_uring_queue_init(32,...)" - why 32? :) And also there's separate io_uring_queue_init(io_max_concurrency) which seems to be derived from AioChooseMaxConccurrency() which can go up to 64?

3. I find having two GUCs named literally the same (effective_io_concurrency, io_max_concurrency). It is clear from IO_URING perspective what is io_max_concurrency all about, but I bet having also effective_io_concurrency in the mix is going to be a little confusing for users (well, it is to me). Maybe that README.md could elaborate a little bit on the relation between those two? Or maybe do you plan to remove io_max_concurrency and bind it to effective_io_concurrency in future? To add more fun , there's MAX_IO_CONCURRENCY nearby in v2.1-0014 too while the earlier mentioned AioChooseMaxConccurrency() goes up to just 64

4. While we are at this, shouldn't the patch rename debug_io_direct to simply io_direct so that GUCs are consistent in terms of naming?

5. It appears that pg_stat_io.reads seems to be not refreshed until they query seems to be finished. While running a query for minutes with this patchset, I've got:
              now              |  reads   | read_time
-------------------------------+----------+-----------
 2024-11-15 12:09:09.151631+00 | 15004271 |         0
[..]
 2024-11-15 12:10:25.241175+00 | 15004271 |         0
 2024-11-15 12:10:26.241179+00 | 15004271 |         0
 2024-11-15 12:10:27.241139+00 | 18250913 |         0
 
Or is that how it is supposed to work? Also pg_stat_io.read_time would be something vague with io_uring/worker, so maybe zero is good here (?). Otherwise we would have to measure time spent on waiting alone, but that would force more instructions for calculating io times...

6. After playing with some basic measurements - which went fine, I wanted to go test simple PostGIS even with sequential scans to see any compatibility issues (AFAIR Thomas Munro on PGConfEU indicated as good testing point), but before that I've tried to see what's the TOAST performance alone with AIO+DIO (debug_io_direct=data). One issue I have found is that DIO seems to be unusable until somebody will teach TOAST to use readstreams, is that correct? Maybe I'm doing something wrong, but I haven't seen any TOAST <-> readstreams topic:

-- 12MB table , 25GB toast
create table t (id bigint, t text storage external);
insert into t select i::bigint as id, repeat(md5(i::text),4000)::text as r from generate_series(1,200000) s(i);
set max_parallel_workers_per_gather=0;
\timing
-- with cold caches: empty s_b, echo 3 > drop_caches
select sum(length(t)) from t;
  master    101897.823 ms (01:41.898)
  AIO          99758.399 ms (01:39.758)
  AIO+DIO 191479.079 ms (03:11.479)

hotpath was detoast_attr() -> toast_fetch_datum() -> heap_fetch_toast_slice() -> systable_getnext_ordered() -> index_getnext_slot() -> index_fetch_heap() -> heapam_index_fetch_tuple() -> ReadBufferExtended -> AIO code.

The difference is that on cold caches with DIO gets 2x slowdown; with clean s_b and so on:
* getting normal heap data seqscan: up to 285MB/s
* but TOASTs maxes out at 132MB/s when using io_uring+DIO

Not about patch itself, but questions about related stack functionality:
----------------------------------------------------------------------------------------------------

7. Is pg_stat_aios still on the table or not ? (AIO 2021 had it). Any hints on how to inspect real I/O calls requested to review if the code is issuing sensible calls: there's no strace for uring, or do you stick to DEBUG3 or perhaps using some bpftrace / xfsslower is the best way to go ?

8. Not sure if that helps, but I've managed the somehow to hit the impossible situation You describe in pgaio_uring_submit() "(ret != num_staged_ios)", but I had to push urings really hard into using futexes and probably I've could made some error in coding too for that too occur [3]. As it stands in that patch from my thread, it was not covered: /* FIXME: fix ret != submitted ?! seems like bug?! */ (but i had that hit that code-path pretty often with 6.10.x kernel)

9. Please let me know, what's the current up to date line of thinking about this patchset: is it intended to be committed as v18 ? As a debug feature or as non-debug feature? (that is which of the IO methods should be scrutinized the most as it is going to be the new default - sync or worker?)

10. At this point, does it even make sense to give a try experimenty try to pwritev2() with RWF_ATOMIC? (that thing is already in the open, but XFS is going to cover it with 6.12.x apparently, but I could try with some -rcX)

-J.

p.s. I hope I did not ask stupid questions nor missed anything.


Re: AIO v2.0

From
Andres Freund
Date:
Hi,

Sorry for loosing track of your message for this long, I saw it just now
because I was working on posting a new version.


On 2024-11-18 13:19:58 +0100, Jakub Wartak wrote:
>  On Fri, Sep 6, 2024 at 9:38 PM Andres Freund <andres@anarazel.de> wrote:
> Thank You for worth admiring persistence on this. Please do not take it as
> criticism, just more like set of questions regarding the patchset v2.1 that
> I finally got little time to play with:
>
> 0. Doesn't the v2.1-0011-aio-Add-io_uring-method.patch -> in
> pgaio_uring_submit() -> io_uring_get_sqe() need a return value check ?

Yea, it shouldn't ever happen, but it's worth adding a check.


> Otherwise we'll never know that SQ is full in theory, perhaps at least such
> a check should be made with Assert() ? (I understand right now that we
> allow just up to io_uring_queue_init(io_max_concurrency), but what happens
> if:
> a. previous io_uring_submit() failed for some reason and we do not have
> free space for SQ?

We'd have PANICed at that failure :)


> b. (hypothetical) someday someone will try to make PG multithreaded and the
> code starts using just one big queue, still without checking for
> io_uring_get_sqe()?

That'd not make sense - you'd still want to use separate rings, to avoid
contention.


> 1. In [0] you wrote that there's this high amount of FDs consumed for
> io_uring (dangerously close to RLIMIT_NOFILE). I can attest that there are
> many customers who are using extremely high max_connections (4k-5k, but
> there outliers with 10k in the wild too) - so they won't even start - and I
> have one doubt on the user-friendliness impact of this. I'm quite certain
> it's going to be the same as with pgbouncer where one is forced to tweak
> OS(systemd/pam/limits.conf/etc), but in PG we are better because PG tries
> to preallocate and then close() a lot of FDs, so that's safer in runtime.
> IMVHO even if we just consume e.g. say > 30% of FDs just for io_uring, the
> max_files_per_process looses it's spirit a little bit and PG is going to
> start loose efficiency too due to frequent open()/close() calls as fd cache
> is too small. Tomas also complained about it some time ago in [1])

My current thoughts around this are that we should generally, independent of
io_uring, increase the FD limit ourselves.

In most distros the soft ulimit is set to something like 1024, but the hard
limit is much higher.  The reason for that is that some applications try to
close all fds between 0 and RLIMIT_NOFILE - which takes a long time if
RLIMIT_NOFILE is high. By setting only the soft limit to a low value any
application needing higher limits can just opt into using more FDs.

On several of my machines the hard limit is 1073741816.



> So maybe it would be good to introduce couple of sanity checks too (even
> after setting higher limit):
> - issue FATAL in case of using io_method = io_ring && max_connections would
> be close to getrusage(RLIMIT_NOFILE)
> - issue warning in case of using io_method = io_ring && we wouldnt have
> even real 1k FDs free for handling relation FDs (detect something bad like:
> getrusage(RLIMIT_NOFILE) <= max_connections + max_files_per_process)

Probably still worth adding something like this, even if we were to do what I
am suggesting above.


> 2. In pgaio_uring_postmaster_child_init_local() there
> "io_uring_queue_init(32,...)" - why 32? :) And also there's separate
> io_uring_queue_init(io_max_concurrency) which seems to be derived from
> AioChooseMaxConccurrency() which can go up to 64?

Yea, that's probably not right.


> 3. I find having two GUCs named literally the same
> (effective_io_concurrency, io_max_concurrency). It is clear from IO_URING
> perspective what is io_max_concurrency all about, but I bet having also
> effective_io_concurrency in the mix is going to be a little confusing for
> users (well, it is to me). Maybe that README.md could elaborate a little
> bit on the relation between those two? Or maybe do you plan to remove
> io_max_concurrency and bind it to effective_io_concurrency in future?

io_max_concurrency is a hard maximum that needs to be set at server start,
because it requires allocating shared memory. Whereas effective_io_concurrency
can be changed on a per-session and per-tablespace
basis. I.e. io_max_concurrency is a hard upper limit for an entire backend,
whereas effective_io_concurrency controls how much one scan (or whatever does
prefetching) can issue.



>  To add more fun , there's MAX_IO_CONCURRENCY nearby in v2.1-0014 too while
> the earlier mentioned AioChooseMaxConccurrency() goes up to just 64

Yea, that should probably be disambiguated.


> 4. While we are at this, shouldn't the patch rename debug_io_direct to
> simply io_direct so that GUCs are consistent in terms of naming?

I used to have a patch like that in the series and it was a pain to
rebase...

I also suspect sure this is quite enough to make debug_io_direct quite
production ready, even if just considering io_direct=data. Without streaming
read use in heap + index VACUUM, RelationCopyStorage() and a few other places
the performance consequences of using direct IO can be, um, surprising.



> 5. It appears that pg_stat_io.reads seems to be not refreshed until they
> query seems to be finished. While running a query for minutes with this
> patchset, I've got:
>               now              |  reads   | read_time
> -------------------------------+----------+-----------
>  2024-11-15 12:09:09.151631+00 | 15004271 |         0
> [..]
>  2024-11-15 12:10:25.241175+00 | 15004271 |         0
>  2024-11-15 12:10:26.241179+00 | 15004271 |         0
>  2024-11-15 12:10:27.241139+00 | 18250913 |         0
>
> Or is that how it is supposed to work?

Currently the patch has a FIXME to add some IO statistics (I think I raised
that somewhere in this thread, too). It's not clear to me what IO time ought
to mean. I suspect the least bad answer is what you suggest:

> Also pg_stat_io.read_time would be something vague with io_uring/worker, so
> maybe zero is good here (?).  Otherwise we would have to measure time spent
> on waiting alone, but that would force more instructions for calculating io
> times...

I.e. we should track the amount of time spent waiting for IOs.

I don't think tracking time in worker or such would make much sense, that'd
often end up with reporting more IO time than a query took.


> 6. After playing with some basic measurements - which went fine, I wanted
> to go test simple PostGIS even with sequential scans to see any
> compatibility issues (AFAIR Thomas Munro on PGConfEU indicated as good
> testing point), but before that I've tried to see what's the TOAST
> performance alone with AIO+DIO (debug_io_direct=data).

It's worth noting that with the last posted version you needed to increase
effective_io_concurrency to something very high to see sensible
performance.

That's due to the way read_stream_begin_impl() limited the number of buffers
pinned to effective_io_concurrency * 4 - which, due to io_combine_limit, ends
up allowing only a single IO in flight in case of sequential blocks until
effective_io_concurrency is set to 8 or such.  I've adjusted that to some
degree now, but I think that might need a bit more sophistication.



> One issue I have found is that DIO seems to be unusable until somebody will
> teach TOAST to use readstreams, is that correct? Maybe I'm doing something
> wrong, but I haven't seen any TOAST <-> readstreams topic:

Hm, I suspect that aq read stream won't help a whole lot in manyq toast
cases. Unless you have particularly long toast datums, the time is going to be
dominated by the random accesses, as each toast datum is looked up in a
non-predictable way.

Generally, using DIO requires tuning shared buffers much more aggressively
than not using DIO, no amount of stream use will change that. Of course we
shoul try to reduce that "downside"...


I'm not sure if the best way to do prefetching toast chunks would be to rely
on more generalized index->table prefetching support, or to have dedicated code.


> -- 12MB table , 25GB toast
> create table t (id bigint, t text storage external);
> insert into t select i::bigint as id, repeat(md5(i::text),4000)::text as r
> from generate_series(1,200000) s(i);
> set max_parallel_workers_per_gather=0;
> \timing
> -- with cold caches: empty s_b, echo 3 > drop_caches
> select sum(length(t)) from t;
>   master    101897.823 ms (01:41.898)
>   AIO          99758.399 ms (01:39.758)
>   AIO+DIO 191479.079 ms (03:11.479)
>
> hotpath was detoast_attr() -> toast_fetch_datum() ->
> heap_fetch_toast_slice() -> systable_getnext_ordered() ->
> index_getnext_slot() -> index_fetch_heap() -> heapam_index_fetch_tuple() ->
> ReadBufferExtended -> AIO code.
>
> The difference is that on cold caches with DIO gets 2x slowdown; with clean
> s_b and so on:
> * getting normal heap data seqscan: up to 285MB/s
> * but TOASTs maxes out at 132MB/s when using io_uring+DIO

I started loading the data to try this out myself :).



> Not about patch itself, but questions about related stack functionality:
> ----------------------------------------------------------------------------------------------------
>
>
> 7. Is pg_stat_aios still on the table or not ? (AIO 2021 had it). Any hints
> on how to inspect real I/O calls requested to review if the code is issuing
> sensible calls: there's no strace for uring, or do you stick to DEBUG3 or
> perhaps using some bpftrace / xfsslower is the best way to go ?

I think we still want something like it, but I don't think it needs to be in
the initial commits.

There are kernel events that you can track using e.g. perf. Particularly
useful are
io_uring:io_uring_submit_req
io_uring:io_uring_complete



> 8. Not sure if that helps, but I've managed the somehow to hit the
> impossible situation You describe in pgaio_uring_submit() "(ret !=
> num_staged_ios)", but I had to push urings really hard into using futexes
> and probably I've could made some error in coding too for that too occur
> [3]. As it stands in that patch from my thread, it was not covered: /*
> FIXME: fix ret != submitted ?! seems like bug?! */ (but i had that hit that
> code-path pretty often with 6.10.x kernel)

I think you can hit that if you don't take care to limit the number of IOs
being submitted at once or if you're not consuming completions. If the
completion queue is full enough the kernel at some point won't allow more IOs
to be submitted.


> 9. Please let me know, what's the current up to date line of thinking about
> this patchset: is it intended to be committed as v18 ?

I'd love to get some of it into 18.  I don't quite know whether we can make it
happen and to what extent.


> As a debug feature or as non-debug feature? (that is which of the IO methods
> should be scrutinized the most as it is going to be the new default - sync
> or worker?)

I'd say initially worker, with a beta 1 or 2 checklist item to revise it.



> 10. At this point, does it even make sense to give a try experimenty try to
> pwritev2() with RWF_ATOMIC? (that thing is already in the open, but XFS is
> going to cover it with 6.12.x apparently, but I could try with some -rcX)

I don't think that's worth doing right now. There's too many dependencies and
it's going to be a while till the kernel support for that is widespread enough
to matter.

There's also the issue that, to my knowledge, outside of cloud environments
there's pretty much no hardware that actually reports power-fail atomicity
sizes bigger than a sector.


> p.s. I hope I did not ask stupid questions nor missed anything.

You did not!

Greetings,

Andres Freund



Re: AIO v2.0

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> My current thoughts around this are that we should generally, independent of
> io_uring, increase the FD limit ourselves.

I'm seriously down on that, because it amounts to an assumption that
we own the machine and can appropriate all its resources.  If ENFILE
weren't a thing, it'd be all right, but that is a thing.  We have no
business trying to consume resources the DBA didn't tell us we could.

            regards, tom lane



Re: AIO v2.0

From
Andres Freund
Date:
Hi,

On 2024-12-19 17:34:29 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > My current thoughts around this are that we should generally, independent of
> > io_uring, increase the FD limit ourselves.
>
> I'm seriously down on that, because it amounts to an assumption that
> we own the machine and can appropriate all its resources.  If ENFILE
> weren't a thing, it'd be all right, but that is a thing.  We have no
> business trying to consume resources the DBA didn't tell us we could.

Arguably the configuration *did* tell us, by having a higher hard limit...

I'm not saying that we should increase the limit without a bound or without a
configuration option, btw.

As I had mentioned, the problem with relying on increasing the soft limit that
is that it's not generally sensible to do so, because it causes a bunch of
binaries to do be weirdly slow.

Another reason to not increase the soft rlimit is that doing so can break
programs relying on select().

But opting into a higher rlimit, while obviously adhering to the hard limit
and perhaps some other config knob, seems fine?

Greetings,

Andres Freund



Re: AIO v2.0

From
Jelte Fennema-Nio
Date:
On Fri, 20 Dec 2024 at 01:54, Andres Freund <andres@anarazel.de> wrote:
> Arguably the configuration *did* tell us, by having a higher hard limit...
> <snip>
> But opting into a higher rlimit, while obviously adhering to the hard limit
> and perhaps some other config knob, seems fine?

Yes, totally fine. That's exactly the reasoning why the hard limit is
so much larger than the soft limit by default on systems with systemd:

https://0pointer.net/blog/file-descriptor-limits.html



Re: AIO v2.0

From
Andres Freund
Date:
Hi,

On 2024-12-20 18:27:13 +0100, Jelte Fennema-Nio wrote:
> On Fri, 20 Dec 2024 at 01:54, Andres Freund <andres@anarazel.de> wrote:
> > Arguably the configuration *did* tell us, by having a higher hard limit...
> > <snip>
> > But opting into a higher rlimit, while obviously adhering to the hard limit
> > and perhaps some other config knob, seems fine?
> 
> Yes, totally fine. That's exactly the reasoning why the hard limit is
> so much larger than the soft limit by default on systems with systemd:
> 
> https://0pointer.net/blog/file-descriptor-limits.html

Good link.

This isn't just relevant for using io_uring:

There obviously are several people working on threaded postgres. Even if we
didn't duplicate fd.c file descriptors between threads (we probably will, at
least initially), the client connection FDs alone will mean that we have a lot
more FDs open. Due to the select() issue the soft limit won't be increased
beyond 1024, requiring everyone to add a 'ulimit -n $somehighnumber' before
starting postgres on linux doesn't help anyone.

Greetings,

Andres Freund