Thread: Re: RFC: Allow EXPLAIN to Output Page Fault Information

Re: RFC: Allow EXPLAIN to Output Page Fault Information

From
Tom Lane
Date:
torikoshia <torikoshia@oss.nttdata.com> writes:
> I have attached a PoC patch that modifies EXPLAIN to include page fault 
> information during both the planning and execution phases of a query. 

Surely these numbers would be too unstable to be worth anything.

            regards, tom lane



Re: RFC: Allow EXPLAIN to Output Page Fault Information

From
torikoshia
Date:
On 2024-12-25 00:52, Tom Lane wrote:
> torikoshia <torikoshia@oss.nttdata.com> writes:
>> I have attached a PoC patch that modifies EXPLAIN to include page 
>> fault
>> information during both the planning and execution phases of a query.
> 
> Surely these numbers would be too unstable to be worth anything.

Thanks for your comment!

Hmm, I didn't know these are unstable. If there are any reasons, I'd 
like to know about them.

I would like to explore alternative methods for measuring resource 
usage, but
I am not aware of other approaches.
(IIUC pg_stat_kcache[1], which is said to provide information about 
filesystem layer I/O usage also gets metrics from getrusage(2))


[1] https://github.com/powa-team/pg_stat_kcache

-- 
Regards,

--
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.



Re: RFC: Allow EXPLAIN to Output Page Fault Information

From
Jeremy Schneider
Date:
On Thu, 26 Dec 2024 13:15:59 +0900
torikoshia <torikoshia@oss.nttdata.com> wrote:

> On 2024-12-25 00:52, Tom Lane wrote:
> > torikoshia <torikoshia@oss.nttdata.com> writes:  
> >> I have attached a PoC patch that modifies EXPLAIN to include page 
> >> fault
> >> information during both the planning and execution phases of a
> >> query.  
> > 
> > Surely these numbers would be too unstable to be worth anything.  
> 
> Thanks for your comment!
> 
> Hmm, I didn't know these are unstable. If there are any reasons, I'd 
> like to know about them.
> 
> I would like to explore alternative methods for measuring resource 
> usage, but
> I am not aware of other approaches.
> (IIUC pg_stat_kcache[1], which is said to provide information about 
> filesystem layer I/O usage also gets metrics from getrusage(2))

What I'd really like to see is a column added to pg_stat_database
called blks_read_majflts

It would be great if we could calculate a cache hit ratio that took OS
major page faults into account

Yes this could also be done in pg_stat_kcache but why not do it right
in pg_stat_database? I think it would pretty widely appreciated and
used.

-Jeremy




Re: RFC: Allow EXPLAIN to Output Page Fault Information

From
"Jelte Fennema-Nio"
Date:
On Tue Dec 24, 2024 at 4:52 PM CET, Tom Lane wrote:
> torikoshia <torikoshia@oss.nttdata.com> writes:
>> I have attached a PoC patch that modifies EXPLAIN to include page fault
>> information during both the planning and execution phases of a query.
>
> Surely these numbers would be too unstable to be worth anything.

What makes you think that? I'd expect them to be similarly stable to the
numbers we get for BUFFERS. i.e. Sure they won't be completely stable,
but I expect them to be quite helpful when debugging perf issues,
because large numbers indicate that the query is disk-bound and small
numbers indicate that it is not.

These numbers seem especially useful for setups where shared_buffers is
significantly smaller than the total memory available to the system. In
those cases the output from BUFFERS might give the impression that that
you're disk-bound, but if your working set still fits into OS cache then
the number of page faults is likely still low. Thus telling you that the
numbers that you get back from BUFFERS are not as big of a problem as
they might seem.



Re: RFC: Allow EXPLAIN to Output Page Fault Information

From
Jeremy Schneider
Date:
On Fri, 27 Dec 2024 15:15:40 +0100
"Jelte Fennema-Nio" <postgres@jeltef.nl> wrote:

> On Tue Dec 24, 2024 at 4:52 PM CET, Tom Lane wrote:
> > torikoshia <torikoshia@oss.nttdata.com> writes:  
> >> I have attached a PoC patch that modifies EXPLAIN to include page
> >> fault information during both the planning and execution phases of
> >> a query.   
> >
> > Surely these numbers would be too unstable to be worth anything.  
> 
> What makes you think that? I'd expect them to be similarly stable to
> the numbers we get for BUFFERS. i.e. Sure they won't be completely
> stable, but I expect them to be quite helpful when debugging perf
> issues, because large numbers indicate that the query is disk-bound
> and small numbers indicate that it is not.
> 
> These numbers seem especially useful for setups where shared_buffers
> is significantly smaller than the total memory available to the
> system. In those cases the output from BUFFERS might give the
> impression that that you're disk-bound, but if your working set still
> fits into OS cache then the number of page faults is likely still
> low. Thus telling you that the numbers that you get back from BUFFERS
> are not as big of a problem as they might seem.

We'd probably need to combine both pg_buffercache_evict() and
/proc/sys/vm/drop_caches to get stable numbers - which is something I
have done in the past for testing.

Another thought would be splitting out the IO timing information into
two values - IO timing for reads that triggered major faults, versus
IO timing for reads that did not.

And system views like pg_stat_database seem worth considering too.

-Jeremy



Re: RFC: Allow EXPLAIN to Output Page Fault Information

From
Bruce Momjian
Date:
On Fri, Dec 27, 2024 at 03:15:40PM +0100, Jelte Fennema-Nio wrote:
> On Tue Dec 24, 2024 at 4:52 PM CET, Tom Lane wrote:
> > torikoshia <torikoshia@oss.nttdata.com> writes:
> > > I have attached a PoC patch that modifies EXPLAIN to include page
> > > fault information during both the planning and execution phases of a
> > > query.
> > 
> > Surely these numbers would be too unstable to be worth anything.
> 
> What makes you think that? I'd expect them to be similarly stable to the
> numbers we get for BUFFERS. i.e. Sure they won't be completely stable,
> but I expect them to be quite helpful when debugging perf issues,
> because large numbers indicate that the query is disk-bound and small
> numbers indicate that it is not.
> 
> These numbers seem especially useful for setups where shared_buffers is
> significantly smaller than the total memory available to the system. In
> those cases the output from BUFFERS might give the impression that that
> you're disk-bound, but if your working set still fits into OS cache then
> the number of page faults is likely still low. Thus telling you that the
> numbers that you get back from BUFFERS are not as big of a problem as
> they might seem.

I certainly would love to see storage I/O numbers as distinct from
kernel read I/O numbers.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Do not let urgent matters crowd out time for investment in the future.





Re: RFC: Allow EXPLAIN to Output Page Fault Information

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> On Fri, Dec 27, 2024 at 03:15:40PM +0100, Jelte Fennema-Nio wrote:
>> On Tue Dec 24, 2024 at 4:52 PM CET, Tom Lane wrote:
>>> torikoshia <torikoshia@oss.nttdata.com> writes:
>>>> I have attached a PoC patch that modifies EXPLAIN to include page
>>>> fault information during both the planning and execution phases of a
>>>> query.

> I certainly would love to see storage I/O numbers as distinct from
> kernel read I/O numbers.

Me too, but I think it is 100% wishful thinking to imagine that
page fault counts match up with that.  Maybe there are filesystems
where a read that we request maps one-to-one with a subsequent
page fault, but it hardly seems likely to me that that's
universal.  Also, you can't tell page faults for reading program
code apart from those for data, and you won't get any information
at all about writes.

            regards, tom lane



Re: RFC: Allow EXPLAIN to Output Page Fault Information

From
"Jelte Fennema-Nio"
Date:
On Mon Dec 30, 2024 at 5:39 PM CET, Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
>> I certainly would love to see storage I/O numbers as distinct from
>> kernel read I/O numbers.
>
> Me too, but I think it is 100% wishful thinking to imagine that
> page fault counts match up with that.

Okay I played around with this patch a bit, in hopes of proving you
wrong. But I now agree with you. I cannot seem to get any numbers out of
this that make sense.

The major page fault numbers are always zero, even after running:

echo 1 > /proc/sys/vm/drop_caches

If Takahori has a way to get some more useful insights from this patch,
I'm quite interested in the steps he took (I might very well have missed
something obvious).

**However, I think the general direction has merit**: Changing this patch to
use `ru_inblock`/`ru_oublock` gives very useful insights. `ru_inblock`
is 0 when everything is in page cache, and it is very high when stuff is
not. I was only hacking around and basically did this:

s/ru_minflt/ru_inblock/g
s/ru_majflt/ru_oublock/g

Obviously more is needed. We'd probably want to show these numbers in
useful units like MB or something. Also, maybe there's some better way
of getting read/write numbers for the current process than
ru_inblock/ru_oublock (but this one seems to work at least reasonably
well).

One other thing that I noticed when playing around with this, which
would need to be addressed: Parallel workers need to pass these values
to the main process somehow, otherwise the IO from those processes gets lost.

For the record, the queries I used to test this patch were:

create table t_big(a int, b text);
insert into t_big SELECT i, repeat(i::text, 200) FROM generate_series(1, 3000000) i;
explain (ANALYZE, PAGEFAULTS) select max(a), max(b) from t_big;
explain (analyze, PAGEFAULTS) insert into t_big SELECT i, repeat(i::text, 200) FROM generate_series(1, 3000000) i;

And then seeing if there was any difference in the explain analyze
output after running the following (as root):

echo 1 > /proc/sys/vm/drop_caches



Re: RFC: Allow EXPLAIN to Output Page Fault Information

From
Julien Rouhaud
Date:
Hi,

On Mon, Jan 06, 2025 at 06:49:06PM +0900, torikoshia wrote:
>
> > **However, I think the general direction has merit**: Changing this
> > patch to
> > use `ru_inblock`/`ru_oublock` gives very useful insights. `ru_inblock`
> > is 0 when everything is in page cache, and it is very high when stuff is
> > not. I was only hacking around and basically did this:
> >
> > s/ru_minflt/ru_inblock/g
> > s/ru_majflt/ru_oublock/g
> > [...]
> >  Also, maybe there's some better way
> > of getting read/write numbers for the current process than
> > ru_inblock/ru_oublock (but this one seems to work at least reasonably
> > well).
>
> Maybe, but as far as using getrusage(), ru_inblock and ru_outblock seem the
> best.

FWIW that's the counters we've been using in pg_stat_kcache / powa to compute
disk IO and "postgres shared buffers hit VS OS cache hit VS disk read" hit
ratio for many years and we didn't get any problem report for that.



Re: RFC: Allow EXPLAIN to Output Page Fault Information

From
"Jelte Fennema-Nio"
Date:
On Mon Jan 6, 2025 at 10:49 AM CET, torikoshia wrote:
> On Tue, Dec 31, 2024 at 7:57 AM Jelte Fennema-Nio <postgres@jeltef.nl>
> Updated the PoC patch to calculate them by KB:
>
>    =# EXPLAIN (ANALYZE, STORAGEIO) SELECT * FROM pgbench_accounts;
>                                                               QUERY PLAN
>
>
---------------------------------------------------------------------------------------------------------------------------------
>     Seq Scan on pgbench_accounts  (cost=0.00..263935.35 rows=10000035
> width=97) (actual time=1.447..3900.279 rows=10000000 loops=1)
>       Buffers: shared hit=2587 read=161348
>     Planning Time: 0.367 ms
>     Execution:
>       Storage I/O: read=1291856 KB write=0 KB
>     Execution Time: 4353.253 ms
>    (6 rows)
>
>

The core functionality works well in my opinion. I think it makes sense
to spend the effort to move this from PoC quality to something
committable. Below some of the things that are necessary to do that
after an initial pass over the code (and trying it out):

>> One other thing that I noticed when playing around with this, which
>> would need to be addressed: Parallel workers need to pass these values
>> to the main process somehow, otherwise the IO from those processes gets
>> lost.
>
> Yes.
> I haven't developed it yet but I believe we can pass them like
> buffer/WAL usage.

1. Yeah, makes sense to do this the same way as we do for buffers. Let's do
   this now.

> +        if (es->storageio)
> +        {
> +                getrusage(RUSAGE_SELF, &rusage);
> +
> +                storageio.inblock = rusage.ru_inblock - storageio_start.inblock;
> +                storageio.outblock = rusage.ru_oublock - storageio_start.outblock;
> +
> +                if (es->format == EXPLAIN_FORMAT_TEXT)
> +                {
> +                        ExplainIndentText(es);
> +                        appendStringInfoString(es->str, "Execution:\n");
> +                        es->indent++;
> +                }
> +                show_storageio(es, &storageio);
> +
> +                if (es->format == EXPLAIN_FORMAT_TEXT)
> +                        es->indent--;
> +                ExplainCloseGroup("Execution", "Execution", true, es);
> +        }

2. The current code always shows "Execution: " in the explain analyze
   output, even if no storageio is done. I think this should use
   peek_storageio() to check if any storageio was done and only show the
   "Execution: " line if that is the case.


3. FORMAT JSON seems to be broken with this patch. With the following
   simple query:

explain (ANALYZE, STORAGEIO, FORMAT JSON) select max(a), max(b) from t_big;

   I get this this assert failure:

TRAP: failed Assert("es->indent == 0"), File: "../src/backend/commands/explain.c", Line: 375, PID: 199034
postgres: jelte postgres 127.0.0.1(49034) EXPLAIN(ExceptionalCondition+0x74)[0x5ad72872b464]
postgres: jelte postgres 127.0.0.1(49034) EXPLAIN(ExplainQuery+0x75b)[0x5ad7283c87bb]
postgres: jelte postgres 127.0.0.1(49034) EXPLAIN(standard_ProcessUtility+0x595)[0x5ad7285e97f5]
postgres: jelte postgres 127.0.0.1(49034) EXPLAIN(+0x4daadf)[0x5ad7285e7adf]
postgres: jelte postgres 127.0.0.1(49034) EXPLAIN(+0x4dafc4)[0x5ad7285e7fc4]
postgres: jelte postgres 127.0.0.1(49034) EXPLAIN(PortalRun+0x32d)[0x5ad7285e834d]
postgres: jelte postgres 127.0.0.1(49034) EXPLAIN(+0x4d70a2)[0x5ad7285e40a2]
postgres: jelte postgres 127.0.0.1(49034) EXPLAIN(PostgresMain+0x16e9)[0x5ad7285e5b39]
postgres: jelte postgres 127.0.0.1(49034) EXPLAIN(BackendMain+0x5f)[0x5ad7285e02df]
postgres: jelte postgres 127.0.0.1(49034) EXPLAIN(postmaster_child_launch+0xe1)[0x5ad72853cde1]
postgres: jelte postgres 127.0.0.1(49034) EXPLAIN(+0x433758)[0x5ad728540758]
postgres: jelte postgres 127.0.0.1(49034) EXPLAIN(PostmasterMain+0xddd)[0x5ad72854223d]
postgres: jelte postgres 127.0.0.1(49034) EXPLAIN(main+0x1d0)[0x5ad72828b600]
/lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca)[0x714aa222a1ca]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b)[0x714aa222a28b]
postgres: jelte postgres 127.0.0.1(49034) EXPLAIN(_start+0x25)[0x5ad72828bc05]

   FORMAT JSON should obviously not crash the server, but apart from
   that it should also actually show this new data in the json output.

4. I think this setting should be enabled by default for ANALYZE, just
   like BUFFERS is now since c2a4078e[1].

5. I'm wondering whether this option deserves its own EXPLAIN option, or
   if it should simply be made part of BUFFERS.

6. Windows compilation is currently failing on the CFbot. Looking at the
   output, that's because rusage does not contain these fields there. I
   think you'll need some #ifdefs

7. The result of getrusage() should be checked for errors and we should
   report the error. (eventhough it's very unexpected to ever fail).

8. This needs docs


> +                appendStringInfo(es->str, " read=%ld KB", (long) usage->inblock / 2);
> +                appendStringInfo(es->str, " write=%ld KB", (long) usage->outblock / 2);

9. I think this division by 2 could use some explanation in a comment. I
   understand that you're doing this because linux divides its original
   bytes using 512 bytes[2] and your additional factor of 2 gets that to
   1024 bytes. But that's not clear immediately from the code.

   I'm also not convinced that 512 is the blocksize if this logic is
   even correct on every platform. I'm wondering if maybe we should
   simply show the blocks after all.


[1]: https://github.com/postgres/postgres/commit/c2a4078ebad71999dd451ae7d4358be3c9290b07
[2]:
https://github.com/torvalds/linux/blob/fbfd64d25c7af3b8695201ebc85efe90be28c5a3/include/linux/task_io_accounting_ops.h#L16-L23



Re: RFC: Allow EXPLAIN to Output Page Fault Information

From
Atsushi Torikoshi
Date:

On Mon, Jan 6, 2025 at 6:59 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
Hi,

On Mon, Jan 06, 2025 at 06:49:06PM +0900, torikoshia wrote:
>
> > **However, I think the general direction has merit**: Changing this
> > patch to
> > use `ru_inblock`/`ru_oublock` gives very useful insights. `ru_inblock`
> > is 0 when everything is in page cache, and it is very high when stuff is
> > not. I was only hacking around and basically did this:
> >
> > s/ru_minflt/ru_inblock/g
> > s/ru_majflt/ru_oublock/g
> > [...]
> >  Also, maybe there's some better way
> > of getting read/write numbers for the current process than
> > ru_inblock/ru_oublock (but this one seems to work at least reasonably
> > well).
>
> Maybe, but as far as using getrusage(), ru_inblock and ru_outblock seem the
> best.

FWIW that's the counters we've been using in pg_stat_kcache / powa to compute
disk IO and "postgres shared buffers hit VS OS cache hit VS disk read" hit
ratio for many years and we didn't get any problem report for that.

Thanks for sharing the information!

On Tue, Jan 7, 2025 at 5:09 AM Jelte Fennema-Nio <postgres@jeltef.nl> wrote:
On Mon Jan 6, 2025 at 10:49 AM CET, torikoshia wrote:
> On Tue, Dec 31, 2024 at 7:57 AM Jelte Fennema-Nio <postgres@jeltef.nl>
> Updated the PoC patch to calculate them by KB:
>
>    =# EXPLAIN (ANALYZE, STORAGEIO) SELECT * FROM pgbench_accounts;
>                                                               QUERY PLAN
>   
> ---------------------------------------------------------------------------------------------------------------------------------
>     Seq Scan on pgbench_accounts  (cost=0.00..263935.35 rows=10000035
> width=97) (actual time=1.447..3900.279 rows=10000000 loops=1)
>       Buffers: shared hit=2587 read=161348
>     Planning Time: 0.367 ms
>     Execution:
>       Storage I/O: read=1291856 KB write=0 KB
>     Execution Time: 4353.253 ms
>    (6 rows)
>
>

The core functionality works well in my opinion. I think it makes sense
to spend the effort to move this from PoC quality to something
committable.

Thanks for the comment and review!
If there are no other opinions, I will make a patch based on the direction of the current PoC patch.
  
4. I think this setting should be enabled by default for ANALYZE, just
   like BUFFERS is now since c2a4078e[1].

5. I'm wondering whether this option deserves its own EXPLAIN option, or
   if it should simply be made part of BUFFERS.
 
Although this is not information about PostgreSQL buffers, I now feel that making this addition part of BUFFERS is attractive, since people using BUFFERS would also sometimes want to know about the storage I/O. 

Since BUFFERS is now ON by default with EXPLAIN ANALYZE, I am concerned about the performance impact.
However, if it is limited to just twice—once at the start and once at the end—for the planning phase, execution phase, and each parallel worker, I believe the impact is relatively small.
 
9. I think this division by 2 could use some explanation in a comment. I
   understand that you're doing this because linux divides its original
   bytes using 512 bytes[2] and your additional factor of 2 gets that to
   1024 bytes. But that's not clear immediately from the code.

   I'm also not convinced that 512 is the blocksize if this logic is
   even correct on every platform. I'm wondering if maybe we should
   simply show the blocks after all.

Maybe so. I'll look into this and then decide the unit. 

For the other comments, I plan to address them as you suggested.

 
Regards,
Atsushi Torikoshi

Re: RFC: Allow EXPLAIN to Output Page Fault Information

From
Jelte Fennema-Nio
Date:
On Sat, 8 Feb 2025 at 14:54, Jelte Fennema-Nio <postgres@jeltef.nl> wrote:
> I'll play around with it a bit next week.

Okay, I played around with it and couldn't find any issues. I marked
the patch as "ready for committer" in the commitfest app[1], given
that all feedback in my previous email was very minor.

[1]: https://commitfest.postgresql.org/52/5526/



Re: RFC: Allow EXPLAIN to Output Page Fault Information

From
Andres Freund
Date:
Hi,

On 2025-02-09 12:51:40 +0100, Jelte Fennema-Nio wrote:
> On Sat, 8 Feb 2025 at 14:54, Jelte Fennema-Nio <postgres@jeltef.nl> wrote:
> > I'll play around with it a bit next week.
> 
> Okay, I played around with it and couldn't find any issues. I marked
> the patch as "ready for committer" in the commitfest app[1], given
> that all feedback in my previous email was very minor.

I'm somewhat against this patch, as it's fairly fundamentally incompatible
with AIO. There's no real way to get information in this manner if the IO
isn't executed synchronously in process context...

Greetings,

Andres



Re: RFC: Allow EXPLAIN to Output Page Fault Information

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> I'm somewhat against this patch, as it's fairly fundamentally incompatible
> with AIO. There's no real way to get information in this manner if the IO
> isn't executed synchronously in process context...

Even without looking ahead to AIO, there's bgwriter, walwriter, and
checkpointer processes that all take I/O load away from foreground
processes.  I don't really believe that this will produce useful
numbers.

            regards, tom lane



Re: RFC: Allow EXPLAIN to Output Page Fault Information

From
Jelte Fennema-Nio
Date:
On Mon, 10 Feb 2025 at 14:31, Andres Freund <andres@anarazel.de> wrote:
> I think it'll always be a subset of use. It doesn't make sense to use DIO for
> a small databases or untuned databases. Or a system that's deliberately
> overcommmitted.

Thanks, that's useful context.

> But this will also not work with AIO w/ Buffered IO. Which we hope to use much
> more commonly.

To be clear, here you mean worker based AIO right? Because it would
work with io_uring based AIO, right?

> If suddenly I have to reimplement something like this to work with worker
> based IO, it'll certainly take longer to get to AIO.

I totally understand. But in my opinion it would be completely fine to
decide that these new IO stats are simply not available for worker
based IO. Just like they're not available for Windows either with this
patch.

I think it would be a shame to make perfect be the enemy of good here
(as often seems to happen with PG patches). I'd rather have this
feature for some setups, than for no setups at all.



Re: RFC: Allow EXPLAIN to Output Page Fault Information

From
Andres Freund
Date:
Hi,

On 2025-02-10 23:52:17 +0100, Jelte Fennema-Nio wrote:
> On Mon, 10 Feb 2025 at 14:31, Andres Freund <andres@anarazel.de> wrote:
> > But this will also not work with AIO w/ Buffered IO. Which we hope to use much
> > more commonly.
> 
> To be clear, here you mean worker based AIO right? Because it would
> work with io_uring based AIO, right?

I mostly meant worker based AIO, yes. I haven't checked how accurately these
are kept for io_uring. I would hope they are...


> > If suddenly I have to reimplement something like this to work with worker
> > based IO, it'll certainly take longer to get to AIO.
> 
> I totally understand. But in my opinion it would be completely fine to
> decide that these new IO stats are simply not available for worker
> based IO. Just like they're not available for Windows either with this
> patch.

The thing is that you'd often get completely misleading stats. Some of the IO
will still be done by the backend itself, so there will be a non-zero
value. But it will be a significant undercount, because the asynchronously
executed IO won't be tracked (if worker mode is used).

Greetings,

Andres Freund



Re: RFC: Allow EXPLAIN to Output Page Fault Information

From
Andres Freund
Date:
Hi,

On 2025-02-10 18:30:56 -0500, Andres Freund wrote:
> On 2025-02-10 23:52:17 +0100, Jelte Fennema-Nio wrote:
> > On Mon, 10 Feb 2025 at 14:31, Andres Freund <andres@anarazel.de> wrote:
> > > But this will also not work with AIO w/ Buffered IO. Which we hope to use much
> > > more commonly.
> >
> > To be clear, here you mean worker based AIO right? Because it would
> > work with io_uring based AIO, right?
>
> I mostly meant worker based AIO, yes. I haven't checked how accurately these
> are kept for io_uring. I would hope they are...

It does look like it is tracked.


> > > If suddenly I have to reimplement something like this to work with worker
> > > based IO, it'll certainly take longer to get to AIO.
> >
> > I totally understand. But in my opinion it would be completely fine to
> > decide that these new IO stats are simply not available for worker
> > based IO. Just like they're not available for Windows either with this
> > patch.
>
> The thing is that you'd often get completely misleading stats. Some of the IO
> will still be done by the backend itself, so there will be a non-zero
> value. But it will be a significant undercount, because the asynchronously
> executed IO won't be tracked (if worker mode is used).

<clear cache>

postgres[985394][1]=# SHOW io_method ;
┌───────────┐
│ io_method │
├───────────┤
│ worker    │
└───────────┘
(1 row)

postgres[985394][1]=# EXPLAIN ANALYZE SELECT count(*) FROM manyrows ;

┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                        QUERY PLAN
  │
 

├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Aggregate  (cost=17906.00..17906.01 rows=1 width=8) (actual time=199.494..199.494 rows=1 loops=1)
  │
 
│   Buffers: shared read=5406
  │
 
│   I/O Timings: shared read=57.906
  │
 
│   ->  Seq Scan on manyrows  (cost=0.00..15406.00 rows=1000000 width=0) (actual time=0.380..140.671 rows=1000000
loops=1)│
 
│         Buffers: shared read=5406
  │
 
│         I/O Timings: shared read=57.906
  │
 
│ Planning:
  │
 
│   Buffers: shared hit=41 read=12
  │
 
│   Storage I/O: read=192 times write=0 times
  │
 
│ Planning Time: 1.869 ms
  │
 
│ Execution Time: 199.554 ms
  │
 

└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘


<clear cache>


postgres[1014152][1]=# SHOW io_method ;
┌───────────┐
│ io_method │
├───────────┤
│ io_uring  │
└───────────┘
(1 row)


postgres[1014152][1]=# EXPLAIN ANALYZE SELECT count(*) FROM manyrows ;

┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                       QUERY PLAN
 │
 

├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Aggregate  (cost=17906.00..17906.01 rows=1 width=8) (actual time=111.591..111.593 rows=1 loops=1)
 │
 
│   Buffers: shared read=5406
 │
 
│   I/O Timings: shared read=14.342
 │
 
│   ->  Seq Scan on manyrows  (cost=0.00..15406.00 rows=1000000 width=0) (actual time=0.161..70.843 rows=1000000
loops=1)│
 
│         Buffers: shared read=5406
 │
 
│         I/O Timings: shared read=14.342
 │
 
│ Planning:
 │
 
│   Buffers: shared hit=41 read=12
 │
 
│   Storage I/O: read=192 times write=0 times
 │
 
│ Planning Time: 1.768 ms
 │
 
│ Execution:
 │
 
│   Storage I/O: read=86496 times write=0 times
 │
 
│ Execution Time: 111.670 ms
 │
 

└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘


Independent to of this, it's probably not good that we're tracking shared
buffer hits after io combining, if I interpret this correctly... That looks to
be an issue in master, not just the AIO branch.

Greetings,

Andres Freund



Re: RFC: Allow EXPLAIN to Output Page Fault Information

From
Jelte Fennema-Nio
Date:
On Tue, 11 Feb 2025 at 00:53, Andres Freund <andres@anarazel.de> wrote:
> > I mostly meant worker based AIO, yes. I haven't checked how accurately these
> > are kept for io_uring. I would hope they are...
>
> It does look like it is tracked.

nice!

> > The thing is that you'd often get completely misleading stats. Some of the IO
> > will still be done by the backend itself, so there will be a non-zero
> > value. But it will be a significant undercount, because the asynchronously
> > executed IO won't be tracked (if worker mode is used).

Yeah, makes sense. Like I said, I would be completely fine with not
showing these numbers at all/setting them to 0 for setups where we
cannot easily get useful numbers (and this bgworker AIO would be one
of those setups).

> Independent to of this, it's probably not good that we're tracking shared
> buffer hits after io combining, if I interpret this correctly... That looks to
> be an issue in master, not just the AIO branch.

You mean that e.g. a combined IO for 20 blocks still sounds only as 1
"shared read"? Yeah, that sounds like a bug.



Re: RFC: Allow EXPLAIN to Output Page Fault Information

From
Andres Freund
Date:
Hi,

On 2025-02-11 09:59:43 +0100, Jelte Fennema-Nio wrote:
> On Tue, 11 Feb 2025 at 00:53, Andres Freund <andres@anarazel.de> wrote:
> > > The thing is that you'd often get completely misleading stats. Some of the IO
> > > will still be done by the backend itself, so there will be a non-zero
> > > value. But it will be a significant undercount, because the asynchronously
> > > executed IO won't be tracked (if worker mode is used).
> 
> Yeah, makes sense. Like I said, I would be completely fine with not
> showing these numbers at all/setting them to 0 for setups where we
> cannot easily get useful numbers (and this bgworker AIO would be one
> of those setups).

Shrug. It means that it'll not work in what I hope will be the default
mechanism before long.  I just can't get excited for that. In all likelihood
it'll result in bug reports that I'll then be on the hook to fix.


> > Independent to of this, it's probably not good that we're tracking shared
> > buffer hits after io combining, if I interpret this correctly... That looks to
> > be an issue in master, not just the AIO branch.
> 
> You mean that e.g. a combined IO for 20 blocks still sounds only as 1
> "shared read"? Yeah, that sounds like a bug.

Yep.

Greetings,

Andres Freund



Re: RFC: Allow EXPLAIN to Output Page Fault Information

From
Jelte Fennema-Nio
Date:
On Tue, 11 Feb 2025 at 16:36, Andres Freund <andres@anarazel.de> wrote:
> Shrug. It means that it'll not work in what I hope will be the default
> mechanism before long.  I just can't get excited for that. In all likelihood
> it'll result in bug reports that I'll then be on the hook to fix.

My assumption was that io_uring would become the default mechanism on
Linux. And that bgworker based AIO would generally only be used by
platforms without similar functionality. Is that assumption incorrect?



Re: RFC: Allow EXPLAIN to Output Page Fault Information

From
Andres Freund
Date:
Hi,

On 2025-02-11 17:00:36 +0100, Jelte Fennema-Nio wrote:
> On Tue, 11 Feb 2025 at 16:36, Andres Freund <andres@anarazel.de> wrote:
> > Shrug. It means that it'll not work in what I hope will be the default
> > mechanism before long.  I just can't get excited for that. In all likelihood
> > it'll result in bug reports that I'll then be on the hook to fix.
> 
> My assumption was that io_uring would become the default mechanism on
> Linux. And that bgworker based AIO would generally only be used by
> platforms without similar functionality. Is that assumption incorrect?

Yes, at least initially:
1) it's not enabled on the kernel level everywhere
2) it requires an optional build dependency
3) it requires tuning the file descriptor ulimit, unless we can convince Tom
   that it's ok to do that ourselves

Greetings,

Andres Freund



Re: RFC: Allow EXPLAIN to Output Page Fault Information

From
Jelte Fennema-Nio
Date:
On Tue, 11 Feb 2025 at 17:19, Andres Freund <andres@anarazel.de> wrote:
> Yes, at least initially:

Ah, then I understand your point of view much better. Still I think we
could easily frame it as: If you enable io_uring, you also get these
additional fancy stats.

Also afaict the items don't have to mean that

> 1) it's not enabled on the kernel level everywhere

Is that really a common thing to do? From a quick internet search, I
can "only" find that Google does this. (Google is definitely a big
cloud player, so I don't want to suggest that that is not important,
but if that's really the only one still the bulk of systems would have
io_uring support)

> 2) it requires an optional build dependency

What build dependency is this? In any case, can't we choose the
default at build time based on the available build dependencies? And
if we cannot, I think we could always add an "auto" default that would
mean the best available AIO implementation (where io_uring is better
than bgworkers).

> 3) it requires tuning the file descriptor ulimit, unless we can convince Tom
>    that it's ok to do that ourselves

I think we should just do this, given the reasoning in the blog[1]
from the systemd author I linked in the AIO thread. I agree that a
response/explicit approval from Tom would be nice though.

[1]: https://0pointer.net/blog/file-descriptor-limits.html



Re: RFC: Allow EXPLAIN to Output Page Fault Information

From
Andres Freund
Date:
Hi,

On 2025-02-11 18:45:13 +0100, Jelte Fennema-Nio wrote:
> On Tue, 11 Feb 2025 at 17:19, Andres Freund <andres@anarazel.de> wrote:
> > Yes, at least initially:
> 
> Ah, then I understand your point of view much better. Still I think we
> could easily frame it as: If you enable io_uring, you also get these
> additional fancy stats.

> Also afaict the items don't have to mean that
> 
> > 1) it's not enabled on the kernel level everywhere
> 
> Is that really a common thing to do? From a quick internet search, I
> can "only" find that Google does this. (Google is definitely a big
> cloud player, so I don't want to suggest that that is not important,
> but if that's really the only one still the bulk of systems would have
> io_uring support)

RHEL had it disabled for quite a while, not sure if that's still the case.


> > 2) it requires an optional build dependency
> 
> What build dependency is this?

Liburing.


> In any case, can't we choose the default at build time based on the
> available build dependencies? And if we cannot, I think we could always add
> an "auto" default that would mean the best available AIO implementation
> (where io_uring is better than bgworkers).

We could, but because of 3) I don't want to do that right now.


> > 3) it requires tuning the file descriptor ulimit, unless we can convince Tom
> >    that it's ok to do that ourselves
> 
> I think we should just do this, given the reasoning in the blog[1]
> from the systemd author I linked in the AIO thread. I agree that a
> response/explicit approval from Tom would be nice though.

I think it's the right path, but that's a fight to fight after AIO has been
merged, not before.

Greetings,

Andres Freund