Thread: Re: RFC: Allow EXPLAIN to Output Page Fault Information
torikoshia <torikoshia@oss.nttdata.com> writes: > I have attached a PoC patch that modifies EXPLAIN to include page fault > information during both the planning and execution phases of a query. Surely these numbers would be too unstable to be worth anything. regards, tom lane
On 2024-12-25 00:52, Tom Lane wrote: > torikoshia <torikoshia@oss.nttdata.com> writes: >> I have attached a PoC patch that modifies EXPLAIN to include page >> fault >> information during both the planning and execution phases of a query. > > Surely these numbers would be too unstable to be worth anything. Thanks for your comment! Hmm, I didn't know these are unstable. If there are any reasons, I'd like to know about them. I would like to explore alternative methods for measuring resource usage, but I am not aware of other approaches. (IIUC pg_stat_kcache[1], which is said to provide information about filesystem layer I/O usage also gets metrics from getrusage(2)) [1] https://github.com/powa-team/pg_stat_kcache -- Regards, -- Atsushi Torikoshi Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.
On Thu, 26 Dec 2024 13:15:59 +0900 torikoshia <torikoshia@oss.nttdata.com> wrote: > On 2024-12-25 00:52, Tom Lane wrote: > > torikoshia <torikoshia@oss.nttdata.com> writes: > >> I have attached a PoC patch that modifies EXPLAIN to include page > >> fault > >> information during both the planning and execution phases of a > >> query. > > > > Surely these numbers would be too unstable to be worth anything. > > Thanks for your comment! > > Hmm, I didn't know these are unstable. If there are any reasons, I'd > like to know about them. > > I would like to explore alternative methods for measuring resource > usage, but > I am not aware of other approaches. > (IIUC pg_stat_kcache[1], which is said to provide information about > filesystem layer I/O usage also gets metrics from getrusage(2)) What I'd really like to see is a column added to pg_stat_database called blks_read_majflts It would be great if we could calculate a cache hit ratio that took OS major page faults into account Yes this could also be done in pg_stat_kcache but why not do it right in pg_stat_database? I think it would pretty widely appreciated and used. -Jeremy
On Tue Dec 24, 2024 at 4:52 PM CET, Tom Lane wrote: > torikoshia <torikoshia@oss.nttdata.com> writes: >> I have attached a PoC patch that modifies EXPLAIN to include page fault >> information during both the planning and execution phases of a query. > > Surely these numbers would be too unstable to be worth anything. What makes you think that? I'd expect them to be similarly stable to the numbers we get for BUFFERS. i.e. Sure they won't be completely stable, but I expect them to be quite helpful when debugging perf issues, because large numbers indicate that the query is disk-bound and small numbers indicate that it is not. These numbers seem especially useful for setups where shared_buffers is significantly smaller than the total memory available to the system. In those cases the output from BUFFERS might give the impression that that you're disk-bound, but if your working set still fits into OS cache then the number of page faults is likely still low. Thus telling you that the numbers that you get back from BUFFERS are not as big of a problem as they might seem.
On Fri, 27 Dec 2024 15:15:40 +0100 "Jelte Fennema-Nio" <postgres@jeltef.nl> wrote: > On Tue Dec 24, 2024 at 4:52 PM CET, Tom Lane wrote: > > torikoshia <torikoshia@oss.nttdata.com> writes: > >> I have attached a PoC patch that modifies EXPLAIN to include page > >> fault information during both the planning and execution phases of > >> a query. > > > > Surely these numbers would be too unstable to be worth anything. > > What makes you think that? I'd expect them to be similarly stable to > the numbers we get for BUFFERS. i.e. Sure they won't be completely > stable, but I expect them to be quite helpful when debugging perf > issues, because large numbers indicate that the query is disk-bound > and small numbers indicate that it is not. > > These numbers seem especially useful for setups where shared_buffers > is significantly smaller than the total memory available to the > system. In those cases the output from BUFFERS might give the > impression that that you're disk-bound, but if your working set still > fits into OS cache then the number of page faults is likely still > low. Thus telling you that the numbers that you get back from BUFFERS > are not as big of a problem as they might seem. We'd probably need to combine both pg_buffercache_evict() and /proc/sys/vm/drop_caches to get stable numbers - which is something I have done in the past for testing. Another thought would be splitting out the IO timing information into two values - IO timing for reads that triggered major faults, versus IO timing for reads that did not. And system views like pg_stat_database seem worth considering too. -Jeremy
On Fri, Dec 27, 2024 at 03:15:40PM +0100, Jelte Fennema-Nio wrote: > On Tue Dec 24, 2024 at 4:52 PM CET, Tom Lane wrote: > > torikoshia <torikoshia@oss.nttdata.com> writes: > > > I have attached a PoC patch that modifies EXPLAIN to include page > > > fault information during both the planning and execution phases of a > > > query. > > > > Surely these numbers would be too unstable to be worth anything. > > What makes you think that? I'd expect them to be similarly stable to the > numbers we get for BUFFERS. i.e. Sure they won't be completely stable, > but I expect them to be quite helpful when debugging perf issues, > because large numbers indicate that the query is disk-bound and small > numbers indicate that it is not. > > These numbers seem especially useful for setups where shared_buffers is > significantly smaller than the total memory available to the system. In > those cases the output from BUFFERS might give the impression that that > you're disk-bound, but if your working set still fits into OS cache then > the number of page faults is likely still low. Thus telling you that the > numbers that you get back from BUFFERS are not as big of a problem as > they might seem. I certainly would love to see storage I/O numbers as distinct from kernel read I/O numbers. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Do not let urgent matters crowd out time for investment in the future.
Bruce Momjian <bruce@momjian.us> writes: > On Fri, Dec 27, 2024 at 03:15:40PM +0100, Jelte Fennema-Nio wrote: >> On Tue Dec 24, 2024 at 4:52 PM CET, Tom Lane wrote: >>> torikoshia <torikoshia@oss.nttdata.com> writes: >>>> I have attached a PoC patch that modifies EXPLAIN to include page >>>> fault information during both the planning and execution phases of a >>>> query. > I certainly would love to see storage I/O numbers as distinct from > kernel read I/O numbers. Me too, but I think it is 100% wishful thinking to imagine that page fault counts match up with that. Maybe there are filesystems where a read that we request maps one-to-one with a subsequent page fault, but it hardly seems likely to me that that's universal. Also, you can't tell page faults for reading program code apart from those for data, and you won't get any information at all about writes. regards, tom lane
On Mon Dec 30, 2024 at 5:39 PM CET, Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: >> I certainly would love to see storage I/O numbers as distinct from >> kernel read I/O numbers. > > Me too, but I think it is 100% wishful thinking to imagine that > page fault counts match up with that. Okay I played around with this patch a bit, in hopes of proving you wrong. But I now agree with you. I cannot seem to get any numbers out of this that make sense. The major page fault numbers are always zero, even after running: echo 1 > /proc/sys/vm/drop_caches If Takahori has a way to get some more useful insights from this patch, I'm quite interested in the steps he took (I might very well have missed something obvious). **However, I think the general direction has merit**: Changing this patch to use `ru_inblock`/`ru_oublock` gives very useful insights. `ru_inblock` is 0 when everything is in page cache, and it is very high when stuff is not. I was only hacking around and basically did this: s/ru_minflt/ru_inblock/g s/ru_majflt/ru_oublock/g Obviously more is needed. We'd probably want to show these numbers in useful units like MB or something. Also, maybe there's some better way of getting read/write numbers for the current process than ru_inblock/ru_oublock (but this one seems to work at least reasonably well). One other thing that I noticed when playing around with this, which would need to be addressed: Parallel workers need to pass these values to the main process somehow, otherwise the IO from those processes gets lost. For the record, the queries I used to test this patch were: create table t_big(a int, b text); insert into t_big SELECT i, repeat(i::text, 200) FROM generate_series(1, 3000000) i; explain (ANALYZE, PAGEFAULTS) select max(a), max(b) from t_big; explain (analyze, PAGEFAULTS) insert into t_big SELECT i, repeat(i::text, 200) FROM generate_series(1, 3000000) i; And then seeing if there was any difference in the explain analyze output after running the following (as root): echo 1 > /proc/sys/vm/drop_caches
Hi, On Mon, Jan 06, 2025 at 06:49:06PM +0900, torikoshia wrote: > > > **However, I think the general direction has merit**: Changing this > > patch to > > use `ru_inblock`/`ru_oublock` gives very useful insights. `ru_inblock` > > is 0 when everything is in page cache, and it is very high when stuff is > > not. I was only hacking around and basically did this: > > > > s/ru_minflt/ru_inblock/g > > s/ru_majflt/ru_oublock/g > > [...] > > Also, maybe there's some better way > > of getting read/write numbers for the current process than > > ru_inblock/ru_oublock (but this one seems to work at least reasonably > > well). > > Maybe, but as far as using getrusage(), ru_inblock and ru_outblock seem the > best. FWIW that's the counters we've been using in pg_stat_kcache / powa to compute disk IO and "postgres shared buffers hit VS OS cache hit VS disk read" hit ratio for many years and we didn't get any problem report for that.
On Mon Jan 6, 2025 at 10:49 AM CET, torikoshia wrote: > On Tue, Dec 31, 2024 at 7:57 AM Jelte Fennema-Nio <postgres@jeltef.nl> > Updated the PoC patch to calculate them by KB: > > =# EXPLAIN (ANALYZE, STORAGEIO) SELECT * FROM pgbench_accounts; > QUERY PLAN > > --------------------------------------------------------------------------------------------------------------------------------- > Seq Scan on pgbench_accounts (cost=0.00..263935.35 rows=10000035 > width=97) (actual time=1.447..3900.279 rows=10000000 loops=1) > Buffers: shared hit=2587 read=161348 > Planning Time: 0.367 ms > Execution: > Storage I/O: read=1291856 KB write=0 KB > Execution Time: 4353.253 ms > (6 rows) > > The core functionality works well in my opinion. I think it makes sense to spend the effort to move this from PoC quality to something committable. Below some of the things that are necessary to do that after an initial pass over the code (and trying it out): >> One other thing that I noticed when playing around with this, which >> would need to be addressed: Parallel workers need to pass these values >> to the main process somehow, otherwise the IO from those processes gets >> lost. > > Yes. > I haven't developed it yet but I believe we can pass them like > buffer/WAL usage. 1. Yeah, makes sense to do this the same way as we do for buffers. Let's do this now. > + if (es->storageio) > + { > + getrusage(RUSAGE_SELF, &rusage); > + > + storageio.inblock = rusage.ru_inblock - storageio_start.inblock; > + storageio.outblock = rusage.ru_oublock - storageio_start.outblock; > + > + if (es->format == EXPLAIN_FORMAT_TEXT) > + { > + ExplainIndentText(es); > + appendStringInfoString(es->str, "Execution:\n"); > + es->indent++; > + } > + show_storageio(es, &storageio); > + > + if (es->format == EXPLAIN_FORMAT_TEXT) > + es->indent--; > + ExplainCloseGroup("Execution", "Execution", true, es); > + } 2. The current code always shows "Execution: " in the explain analyze output, even if no storageio is done. I think this should use peek_storageio() to check if any storageio was done and only show the "Execution: " line if that is the case. 3. FORMAT JSON seems to be broken with this patch. With the following simple query: explain (ANALYZE, STORAGEIO, FORMAT JSON) select max(a), max(b) from t_big; I get this this assert failure: TRAP: failed Assert("es->indent == 0"), File: "../src/backend/commands/explain.c", Line: 375, PID: 199034 postgres: jelte postgres 127.0.0.1(49034) EXPLAIN(ExceptionalCondition+0x74)[0x5ad72872b464] postgres: jelte postgres 127.0.0.1(49034) EXPLAIN(ExplainQuery+0x75b)[0x5ad7283c87bb] postgres: jelte postgres 127.0.0.1(49034) EXPLAIN(standard_ProcessUtility+0x595)[0x5ad7285e97f5] postgres: jelte postgres 127.0.0.1(49034) EXPLAIN(+0x4daadf)[0x5ad7285e7adf] postgres: jelte postgres 127.0.0.1(49034) EXPLAIN(+0x4dafc4)[0x5ad7285e7fc4] postgres: jelte postgres 127.0.0.1(49034) EXPLAIN(PortalRun+0x32d)[0x5ad7285e834d] postgres: jelte postgres 127.0.0.1(49034) EXPLAIN(+0x4d70a2)[0x5ad7285e40a2] postgres: jelte postgres 127.0.0.1(49034) EXPLAIN(PostgresMain+0x16e9)[0x5ad7285e5b39] postgres: jelte postgres 127.0.0.1(49034) EXPLAIN(BackendMain+0x5f)[0x5ad7285e02df] postgres: jelte postgres 127.0.0.1(49034) EXPLAIN(postmaster_child_launch+0xe1)[0x5ad72853cde1] postgres: jelte postgres 127.0.0.1(49034) EXPLAIN(+0x433758)[0x5ad728540758] postgres: jelte postgres 127.0.0.1(49034) EXPLAIN(PostmasterMain+0xddd)[0x5ad72854223d] postgres: jelte postgres 127.0.0.1(49034) EXPLAIN(main+0x1d0)[0x5ad72828b600] /lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca)[0x714aa222a1ca] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b)[0x714aa222a28b] postgres: jelte postgres 127.0.0.1(49034) EXPLAIN(_start+0x25)[0x5ad72828bc05] FORMAT JSON should obviously not crash the server, but apart from that it should also actually show this new data in the json output. 4. I think this setting should be enabled by default for ANALYZE, just like BUFFERS is now since c2a4078e[1]. 5. I'm wondering whether this option deserves its own EXPLAIN option, or if it should simply be made part of BUFFERS. 6. Windows compilation is currently failing on the CFbot. Looking at the output, that's because rusage does not contain these fields there. I think you'll need some #ifdefs 7. The result of getrusage() should be checked for errors and we should report the error. (eventhough it's very unexpected to ever fail). 8. This needs docs > + appendStringInfo(es->str, " read=%ld KB", (long) usage->inblock / 2); > + appendStringInfo(es->str, " write=%ld KB", (long) usage->outblock / 2); 9. I think this division by 2 could use some explanation in a comment. I understand that you're doing this because linux divides its original bytes using 512 bytes[2] and your additional factor of 2 gets that to 1024 bytes. But that's not clear immediately from the code. I'm also not convinced that 512 is the blocksize if this logic is even correct on every platform. I'm wondering if maybe we should simply show the blocks after all. [1]: https://github.com/postgres/postgres/commit/c2a4078ebad71999dd451ae7d4358be3c9290b07 [2]: https://github.com/torvalds/linux/blob/fbfd64d25c7af3b8695201ebc85efe90be28c5a3/include/linux/task_io_accounting_ops.h#L16-L23
Hi,
On Mon, Jan 06, 2025 at 06:49:06PM +0900, torikoshia wrote:
>
> > **However, I think the general direction has merit**: Changing this
> > patch to
> > use `ru_inblock`/`ru_oublock` gives very useful insights. `ru_inblock`
> > is 0 when everything is in page cache, and it is very high when stuff is
> > not. I was only hacking around and basically did this:
> >
> > s/ru_minflt/ru_inblock/g
> > s/ru_majflt/ru_oublock/g
> > [...]
> > Also, maybe there's some better way
> > of getting read/write numbers for the current process than
> > ru_inblock/ru_oublock (but this one seems to work at least reasonably
> > well).
>
> Maybe, but as far as using getrusage(), ru_inblock and ru_outblock seem the
> best.
FWIW that's the counters we've been using in pg_stat_kcache / powa to compute
disk IO and "postgres shared buffers hit VS OS cache hit VS disk read" hit
ratio for many years and we didn't get any problem report for that.
On Tue, Jan 7, 2025 at 5:09 AM Jelte Fennema-Nio <postgres@jeltef.nl> wrote:
On Mon Jan 6, 2025 at 10:49 AM CET, torikoshia wrote:
> On Tue, Dec 31, 2024 at 7:57 AM Jelte Fennema-Nio <postgres@jeltef.nl>
> Updated the PoC patch to calculate them by KB:
>
> =# EXPLAIN (ANALYZE, STORAGEIO) SELECT * FROM pgbench_accounts;
> QUERY PLAN
>
> ---------------------------------------------------------------------------------------------------------------------------------
> Seq Scan on pgbench_accounts (cost=0.00..263935.35 rows=10000035
> width=97) (actual time=1.447..3900.279 rows=10000000 loops=1)
> Buffers: shared hit=2587 read=161348
> Planning Time: 0.367 ms
> Execution:
> Storage I/O: read=1291856 KB write=0 KB
> Execution Time: 4353.253 ms
> (6 rows)
>
>
The core functionality works well in my opinion. I think it makes sense
to spend the effort to move this from PoC quality to something
committable.
Thanks for the comment and review!
If there are no other opinions, I will make a patch based on the direction of the current PoC patch.
4. I think this setting should be enabled by default for ANALYZE, just
like BUFFERS is now since c2a4078e[1].
5. I'm wondering whether this option deserves its own EXPLAIN option, or
if it should simply be made part of BUFFERS.
Although this is not information about PostgreSQL buffers, I now feel that making this addition part of BUFFERS is attractive, since people using BUFFERS would also sometimes want to know about the storage I/O.
Since BUFFERS is now ON by default with EXPLAIN ANALYZE, I am concerned about the performance impact.
However, if it is limited to just twice—once at the start and once at the end—for the planning phase, execution phase, and each parallel worker, I believe the impact is relatively small.
9. I think this division by 2 could use some explanation in a comment. I
understand that you're doing this because linux divides its original
bytes using 512 bytes[2] and your additional factor of 2 gets that to
1024 bytes. But that's not clear immediately from the code.
I'm also not convinced that 512 is the blocksize if this logic is
even correct on every platform. I'm wondering if maybe we should
simply show the blocks after all.
Maybe so. I'll look into this and then decide the unit.
For the other comments, I plan to address them as you suggested.
Regards,
Atsushi Torikoshi
On Sat, 8 Feb 2025 at 14:54, Jelte Fennema-Nio <postgres@jeltef.nl> wrote: > I'll play around with it a bit next week. Okay, I played around with it and couldn't find any issues. I marked the patch as "ready for committer" in the commitfest app[1], given that all feedback in my previous email was very minor. [1]: https://commitfest.postgresql.org/52/5526/
Hi, On 2025-02-09 12:51:40 +0100, Jelte Fennema-Nio wrote: > On Sat, 8 Feb 2025 at 14:54, Jelte Fennema-Nio <postgres@jeltef.nl> wrote: > > I'll play around with it a bit next week. > > Okay, I played around with it and couldn't find any issues. I marked > the patch as "ready for committer" in the commitfest app[1], given > that all feedback in my previous email was very minor. I'm somewhat against this patch, as it's fairly fundamentally incompatible with AIO. There's no real way to get information in this manner if the IO isn't executed synchronously in process context... Greetings, Andres
Andres Freund <andres@anarazel.de> writes: > I'm somewhat against this patch, as it's fairly fundamentally incompatible > with AIO. There's no real way to get information in this manner if the IO > isn't executed synchronously in process context... Even without looking ahead to AIO, there's bgwriter, walwriter, and checkpointer processes that all take I/O load away from foreground processes. I don't really believe that this will produce useful numbers. regards, tom lane
On Mon, 10 Feb 2025 at 14:31, Andres Freund <andres@anarazel.de> wrote: > I think it'll always be a subset of use. It doesn't make sense to use DIO for > a small databases or untuned databases. Or a system that's deliberately > overcommmitted. Thanks, that's useful context. > But this will also not work with AIO w/ Buffered IO. Which we hope to use much > more commonly. To be clear, here you mean worker based AIO right? Because it would work with io_uring based AIO, right? > If suddenly I have to reimplement something like this to work with worker > based IO, it'll certainly take longer to get to AIO. I totally understand. But in my opinion it would be completely fine to decide that these new IO stats are simply not available for worker based IO. Just like they're not available for Windows either with this patch. I think it would be a shame to make perfect be the enemy of good here (as often seems to happen with PG patches). I'd rather have this feature for some setups, than for no setups at all.
Hi, On 2025-02-10 23:52:17 +0100, Jelte Fennema-Nio wrote: > On Mon, 10 Feb 2025 at 14:31, Andres Freund <andres@anarazel.de> wrote: > > But this will also not work with AIO w/ Buffered IO. Which we hope to use much > > more commonly. > > To be clear, here you mean worker based AIO right? Because it would > work with io_uring based AIO, right? I mostly meant worker based AIO, yes. I haven't checked how accurately these are kept for io_uring. I would hope they are... > > If suddenly I have to reimplement something like this to work with worker > > based IO, it'll certainly take longer to get to AIO. > > I totally understand. But in my opinion it would be completely fine to > decide that these new IO stats are simply not available for worker > based IO. Just like they're not available for Windows either with this > patch. The thing is that you'd often get completely misleading stats. Some of the IO will still be done by the backend itself, so there will be a non-zero value. But it will be a significant undercount, because the asynchronously executed IO won't be tracked (if worker mode is used). Greetings, Andres Freund
Hi, On 2025-02-10 18:30:56 -0500, Andres Freund wrote: > On 2025-02-10 23:52:17 +0100, Jelte Fennema-Nio wrote: > > On Mon, 10 Feb 2025 at 14:31, Andres Freund <andres@anarazel.de> wrote: > > > But this will also not work with AIO w/ Buffered IO. Which we hope to use much > > > more commonly. > > > > To be clear, here you mean worker based AIO right? Because it would > > work with io_uring based AIO, right? > > I mostly meant worker based AIO, yes. I haven't checked how accurately these > are kept for io_uring. I would hope they are... It does look like it is tracked. > > > If suddenly I have to reimplement something like this to work with worker > > > based IO, it'll certainly take longer to get to AIO. > > > > I totally understand. But in my opinion it would be completely fine to > > decide that these new IO stats are simply not available for worker > > based IO. Just like they're not available for Windows either with this > > patch. > > The thing is that you'd often get completely misleading stats. Some of the IO > will still be done by the backend itself, so there will be a non-zero > value. But it will be a significant undercount, because the asynchronously > executed IO won't be tracked (if worker mode is used). <clear cache> postgres[985394][1]=# SHOW io_method ; ┌───────────┐ │ io_method │ ├───────────┤ │ worker │ └───────────┘ (1 row) postgres[985394][1]=# EXPLAIN ANALYZE SELECT count(*) FROM manyrows ; ┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │ QUERY PLAN │ ├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ Aggregate (cost=17906.00..17906.01 rows=1 width=8) (actual time=199.494..199.494 rows=1 loops=1) │ │ Buffers: shared read=5406 │ │ I/O Timings: shared read=57.906 │ │ -> Seq Scan on manyrows (cost=0.00..15406.00 rows=1000000 width=0) (actual time=0.380..140.671 rows=1000000 loops=1)│ │ Buffers: shared read=5406 │ │ I/O Timings: shared read=57.906 │ │ Planning: │ │ Buffers: shared hit=41 read=12 │ │ Storage I/O: read=192 times write=0 times │ │ Planning Time: 1.869 ms │ │ Execution Time: 199.554 ms │ └──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ <clear cache> postgres[1014152][1]=# SHOW io_method ; ┌───────────┐ │ io_method │ ├───────────┤ │ io_uring │ └───────────┘ (1 row) postgres[1014152][1]=# EXPLAIN ANALYZE SELECT count(*) FROM manyrows ; ┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │ QUERY PLAN │ ├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ Aggregate (cost=17906.00..17906.01 rows=1 width=8) (actual time=111.591..111.593 rows=1 loops=1) │ │ Buffers: shared read=5406 │ │ I/O Timings: shared read=14.342 │ │ -> Seq Scan on manyrows (cost=0.00..15406.00 rows=1000000 width=0) (actual time=0.161..70.843 rows=1000000 loops=1)│ │ Buffers: shared read=5406 │ │ I/O Timings: shared read=14.342 │ │ Planning: │ │ Buffers: shared hit=41 read=12 │ │ Storage I/O: read=192 times write=0 times │ │ Planning Time: 1.768 ms │ │ Execution: │ │ Storage I/O: read=86496 times write=0 times │ │ Execution Time: 111.670 ms │ └─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ Independent to of this, it's probably not good that we're tracking shared buffer hits after io combining, if I interpret this correctly... That looks to be an issue in master, not just the AIO branch. Greetings, Andres Freund
On Tue, 11 Feb 2025 at 00:53, Andres Freund <andres@anarazel.de> wrote: > > I mostly meant worker based AIO, yes. I haven't checked how accurately these > > are kept for io_uring. I would hope they are... > > It does look like it is tracked. nice! > > The thing is that you'd often get completely misleading stats. Some of the IO > > will still be done by the backend itself, so there will be a non-zero > > value. But it will be a significant undercount, because the asynchronously > > executed IO won't be tracked (if worker mode is used). Yeah, makes sense. Like I said, I would be completely fine with not showing these numbers at all/setting them to 0 for setups where we cannot easily get useful numbers (and this bgworker AIO would be one of those setups). > Independent to of this, it's probably not good that we're tracking shared > buffer hits after io combining, if I interpret this correctly... That looks to > be an issue in master, not just the AIO branch. You mean that e.g. a combined IO for 20 blocks still sounds only as 1 "shared read"? Yeah, that sounds like a bug.
Hi, On 2025-02-11 09:59:43 +0100, Jelte Fennema-Nio wrote: > On Tue, 11 Feb 2025 at 00:53, Andres Freund <andres@anarazel.de> wrote: > > > The thing is that you'd often get completely misleading stats. Some of the IO > > > will still be done by the backend itself, so there will be a non-zero > > > value. But it will be a significant undercount, because the asynchronously > > > executed IO won't be tracked (if worker mode is used). > > Yeah, makes sense. Like I said, I would be completely fine with not > showing these numbers at all/setting them to 0 for setups where we > cannot easily get useful numbers (and this bgworker AIO would be one > of those setups). Shrug. It means that it'll not work in what I hope will be the default mechanism before long. I just can't get excited for that. In all likelihood it'll result in bug reports that I'll then be on the hook to fix. > > Independent to of this, it's probably not good that we're tracking shared > > buffer hits after io combining, if I interpret this correctly... That looks to > > be an issue in master, not just the AIO branch. > > You mean that e.g. a combined IO for 20 blocks still sounds only as 1 > "shared read"? Yeah, that sounds like a bug. Yep. Greetings, Andres Freund
On Tue, 11 Feb 2025 at 16:36, Andres Freund <andres@anarazel.de> wrote: > Shrug. It means that it'll not work in what I hope will be the default > mechanism before long. I just can't get excited for that. In all likelihood > it'll result in bug reports that I'll then be on the hook to fix. My assumption was that io_uring would become the default mechanism on Linux. And that bgworker based AIO would generally only be used by platforms without similar functionality. Is that assumption incorrect?
Hi, On 2025-02-11 17:00:36 +0100, Jelte Fennema-Nio wrote: > On Tue, 11 Feb 2025 at 16:36, Andres Freund <andres@anarazel.de> wrote: > > Shrug. It means that it'll not work in what I hope will be the default > > mechanism before long. I just can't get excited for that. In all likelihood > > it'll result in bug reports that I'll then be on the hook to fix. > > My assumption was that io_uring would become the default mechanism on > Linux. And that bgworker based AIO would generally only be used by > platforms without similar functionality. Is that assumption incorrect? Yes, at least initially: 1) it's not enabled on the kernel level everywhere 2) it requires an optional build dependency 3) it requires tuning the file descriptor ulimit, unless we can convince Tom that it's ok to do that ourselves Greetings, Andres Freund
On Tue, 11 Feb 2025 at 17:19, Andres Freund <andres@anarazel.de> wrote: > Yes, at least initially: Ah, then I understand your point of view much better. Still I think we could easily frame it as: If you enable io_uring, you also get these additional fancy stats. Also afaict the items don't have to mean that > 1) it's not enabled on the kernel level everywhere Is that really a common thing to do? From a quick internet search, I can "only" find that Google does this. (Google is definitely a big cloud player, so I don't want to suggest that that is not important, but if that's really the only one still the bulk of systems would have io_uring support) > 2) it requires an optional build dependency What build dependency is this? In any case, can't we choose the default at build time based on the available build dependencies? And if we cannot, I think we could always add an "auto" default that would mean the best available AIO implementation (where io_uring is better than bgworkers). > 3) it requires tuning the file descriptor ulimit, unless we can convince Tom > that it's ok to do that ourselves I think we should just do this, given the reasoning in the blog[1] from the systemd author I linked in the AIO thread. I agree that a response/explicit approval from Tom would be nice though. [1]: https://0pointer.net/blog/file-descriptor-limits.html
Hi, On 2025-02-11 18:45:13 +0100, Jelte Fennema-Nio wrote: > On Tue, 11 Feb 2025 at 17:19, Andres Freund <andres@anarazel.de> wrote: > > Yes, at least initially: > > Ah, then I understand your point of view much better. Still I think we > could easily frame it as: If you enable io_uring, you also get these > additional fancy stats. > Also afaict the items don't have to mean that > > > 1) it's not enabled on the kernel level everywhere > > Is that really a common thing to do? From a quick internet search, I > can "only" find that Google does this. (Google is definitely a big > cloud player, so I don't want to suggest that that is not important, > but if that's really the only one still the bulk of systems would have > io_uring support) RHEL had it disabled for quite a while, not sure if that's still the case. > > 2) it requires an optional build dependency > > What build dependency is this? Liburing. > In any case, can't we choose the default at build time based on the > available build dependencies? And if we cannot, I think we could always add > an "auto" default that would mean the best available AIO implementation > (where io_uring is better than bgworkers). We could, but because of 3) I don't want to do that right now. > > 3) it requires tuning the file descriptor ulimit, unless we can convince Tom > > that it's ok to do that ourselves > > I think we should just do this, given the reasoning in the blog[1] > from the systemd author I linked in the AIO thread. I agree that a > response/explicit approval from Tom would be nice though. I think it's the right path, but that's a fight to fight after AIO has been merged, not before. Greetings, Andres Freund