Thread: Re: AIO v2.3

Re: AIO v2.3

From
Andres Freund
Date:
Hi,

On 2025-02-06 11:50:04 +0100, Jakub Wartak wrote:
> Hi Andres, OK, so I've hastily launched AIO v2.3 (full, 29 patches)
> patchset probe run before going for short vacations and here results
> are attached*.

Thanks for doing that work!


> TLDR; in terms of SELECTs the master vs aioworkers looks very solid!

Phew! Weee! Yay.


> I was kind of afraid that additional IPC to separate processes would put
> workers at a disadvantage a little bit , but that's amazingly not true.

It's a measurable disadvantage, it's just more than counteracted by being able
to do IO asynchronously :).

It's possible to make it more visible, by setting io_combine_limit = 1. If you
have a small shared buffers with everything in the kernel cache, the dispatch
overhead starts to be noticeable above several GB/s. But that's ok, I think.


> The intention of this effort just to see if committing AIO with defaults as
> it stands is good enough to not cause basic regressions for users and to me
> it looks like it is nearly finished :)).

That's really good to hear.  I think we can improve things a lot in the
future, but we gotta start somewhere...


> 1. not a single crash was observed , but those were pretty short runs
> 
> 2. my very limited in terms of time data analysis thoughts
> - most of the time perf  with aioworkers is identical (+/- 3%) as of
> the master, in most cases it is much BETTER

I assume s/most/some/ for the second most?


> - on parallel seqscans "sata" with datasets bigger than VFS-cache
> ("big") and high e_io_c with high client counts(sigh!), it looks like
> it would user noticeable big regression but to me it's not regression
> itself, probably we are issuing way too many posix_fadvise()
> readaheads with diminishing returns. Just letting you know. Not sure
> it is worth introducing some global (shared aioworkers e_io_c
> limiter), I think not. I think it could also be some maintenance noise
> on that I/O device, but I have no isolated SATA RAID10 with like 8x
> HDDs in home to launch such a test to be absolutely sure.

I'm inclined to not introduce a global limit for now - it's pretty hard to
make that scale to fast IO devices, so you need a multi-level design, where
each backend can issue a few IOs without consulting the global limit and only
after that you need to get the right to issue even more IOs from the shared
"pool".

I think this is basically a configuration issue - configuring a high e_io_c
for a device that can't handle that and then load it up with a lot of clients,
well, that'll not work out great.


> 3. with aioworkers in documentation it would worth pointing out that
> `iotop` won't be good enough to show which PID is doing I/O anymore .
> I've often get question like this: who is taking the most of I/O right
> now because storage is fully saturated on multi-use system. Not sure
> it would require new view or not (pg_aios output seems to be not more
> like in-memory debug view that would be have to be sampled
> aggressively, and pg_statio_all_tables shows well table, but not PID
> -- same for pg_stat_io). IMHO if docs would be simple like
> "In order to understand which processes (PIDs) are issuing lots of
> IOs, please check pg_stat_activty for *IO/AioCompletion* waits events"
> it should be good enough for a start.

pg_stat_get_backend_io() should allow to answer that, albeit with the usual
weakness of our stats system, namely that the user has to diff two snapshots
themselves. It probably also has the weakness of not showing results for
queries before they've finished, although I think that's something we should
be able to improve without too much trouble (not in this release though, I
suspect).

I guess we could easily reference pg_stat_get_backend_io(), but a more
complete recipe isn't entirely trivial...


> Bench machine: it was intentionally much smaller hardware. Azure's
> Lsv2 L8s_v2 (1st gen EPYC/1s4c8t, with kernel 6.10.11+bpo-cloud-amd64
> and booted with mem=12GB that limited real usable RAM memory to just
> like ~8GB to stress I/O). liburing 2.9. Normal standard compile
> options were used without asserts (such as normal users would use).

Good - the asserts in the aio patches are a bit more noticeable than the ones
in master.


Thanks again for running these!


Greetings,

Andres Freund



Re: AIO v2.3

From
Jakub Wartak
Date:
On Tue, Feb 11, 2025 at 12:10 AM Andres Freund <andres@anarazel.de> wrote:

>> TLDR; in terms of SELECTs the master vs aioworkers looks very solid!

> Phew! Weee! Yay.

Another good news: I've completed a full 24h pgbench run on the same
machine and it did not fail or report anything suspicious. FYI,
patchset didn't not apply anymore (seems patches 1..6 are already
applied on master due to checkpoint shutdown sequence), but there was
a failed hunk in patch #12 yesterday too:
[..]
patching file src/backend/postmaster/postmaster.c
Hunk #10 succeeded at 2960 (offset 14 lines).
Hunk #11 FAILED at 3047.
[..]
1 out of 15 hunks FAILED -- saving rejects to file
src/backend/postmaster/postmaster.c.rej

anyway, so on master @ a5579a90af05814eb5dc2fd5f68ce803899d2504 (~ Jan
24) to have clean apply I've used the below asserted build:

meson setup build --reconfigure --prefix=/usr/pgsql18.aio --debug
-Dsegsize_blocks=13 -Dcassert=true
/usr/pgsql18.aio/bin/pgbench -i -s 500 --partitions=100 # ~8GB
/usr/pgsql18.aio/bin/pgbench -R 1500 -c 100 -j 4 -P 1 -T 86400

with some add-on functionalities:
effective_io_concurrency = '4'
shared_buffers = '2GB'
max_connections = '1000'
archive_command = 'cp %p /dev/null'
archive_mode = 'on'
summarize_wal = 'on'
wal_summary_keep_time = '1h'
wal_compression = 'on'
wal_log_hints = 'on'
max_wal_size = '1GB'
shared_preload_libraries = 'pg_stat_statements'
huge_pages = 'off'
wal_receiver_status_interval = '1s'

so the above got perfect run:
[..]
duration: 86400 s
number of transactions actually processed: 129615534
number of failed transactions: 0 (0.000%)
latency average = 5.332 ms
latency stddev = 24.107 ms
rate limit schedule lag: avg 0.748 (max 1992.517) ms
initial connection time = 124.472 ms
tps = 1500.179231 (without initial connection time)

> > I was kind of afraid that additional IPC to separate processes would put
> > workers at a disadvantage a little bit , but that's amazingly not true.
>
> It's a measurable disadvantage, it's just more than counteracted by being able
> to do IO asynchronously :).
>
> It's possible to make it more visible, by setting io_combine_limit = 1. If you
> have a small shared buffers with everything in the kernel cache, the dispatch
> overhead starts to be noticeable above several GB/s. But that's ok, I think.

Sure it is.

> > 2. my very limited in terms of time data analysis thoughts
> > - most of the time perf  with aioworkers is identical (+/- 3%) as of
> > the master, in most cases it is much BETTER
>
> I assume s/most/some/ for the second most?

Right, pardon for my excited moment ;)

> > - on parallel seqscans "sata" with datasets bigger than VFS-cache
> > ("big") and high e_io_c with high client counts(sigh!), it looks like
> > it would user noticeable big regression but to me it's not regression
> > itself, probably we are issuing way too many posix_fadvise()
> > readaheads with diminishing returns. Just letting you know. Not sure
> > it is worth introducing some global (shared aioworkers e_io_c
> > limiter), I think not. I think it could also be some maintenance noise
> > on that I/O device, but I have no isolated SATA RAID10 with like 8x
> > HDDs in home to launch such a test to be absolutely sure.
>
> I think this is basically a configuration issue - configuring a high e_io_c
> for a device that can't handle that and then load it up with a lot of clients,
> well, that'll not work out great.

Sure, btw i'm going to also an idea about autotuning that e_io_c in
that related thread where everybody is complaining about it

> > 3. with aioworkers in documentation it would worth pointing out that
> > `iotop` won't be good enough to show which PID is doing I/O anymore .
> > I've often get question like this: who is taking the most of I/O right
> > now because storage is fully saturated on multi-use system. Not sure
> > it would require new view or not (pg_aios output seems to be not more
> > like in-memory debug view that would be have to be sampled
> > aggressively, and pg_statio_all_tables shows well table, but not PID
> > -- same for pg_stat_io). IMHO if docs would be simple like
> > "In order to understand which processes (PIDs) are issuing lots of
> > IOs, please check pg_stat_activty for *IO/AioCompletion* waits events"
> > it should be good enough for a start.
>
> pg_stat_get_backend_io() should allow to answer that, albeit with the usual
> weakness of our stats system, namely that the user has to diff two snapshots
> themselves. It probably also has the weakness of not showing results for
> queries before they've finished, although I think that's something we should
> be able to improve without too much trouble (not in this release though, I
> suspect).
>
> I guess we could easily reference pg_stat_get_backend_io(), but a more
> complete recipe isn't entirely trivial...

I was trying to come out with something that could be added to docs,
but the below thing is too ugly and as you stated the primary weakness
is that query needs to finish before it is reflected:

WITH
    b AS (SELECT 0 AS step, pid, round(sum(write_bytes)/1024/1024) AS
wMB, NULL::void, NULL::void FROM pg_stat_activity,
pg_stat_get_backend_io(pid) GROUP BY pid),
    flush AS (SELECT 0 as step, 0, 0, pg_sleep(1), pg_stat_clear_snapshot()),
    e AS (SELECT 1 AS step, pid, round(sum(write_bytes)/1024/1024) AS
wMB, NULL::void, NULL::void FROM pg_stat_activity,
pg_stat_get_backend_io(pid) GROUP BY pid),
    picture AS MATERIALIZED (
        SELECT * FROM b
        UNION ALL
        SELECt * FROM flush
        UNION ALL
        SELECT * FROM e
    )
SELECT * FROM (
    SELECT pid, wMB - LAG(wMB, 1) OVER (PARTITION BY pid ORDER BY
step) AS "wMB/s" FROM picture
) WHERE "wMB/s" > 0;

\watch 1

-J.