Home > mailing lists

Re: AIO v2.3 - Mailing list pgsql-hackers

From	Jakub Wartak
Subject	Re: AIO v2.3
Date	February 14 10:40:48
Msg-id	CAKZiRmzpwCrg2cAeRGQowYaQt-C18samkkG3cX3E-JSR3oP=uw@mail.gmail.com Whole thread Raw
In response to	Re: AIO v2.3 (Andres Freund <andres@anarazel.de>)
List	pgsql-hackers

Tree view

On Tue, Feb 11, 2025 at 12:10 AM Andres Freund <andres@anarazel.de> wrote:

>> TLDR; in terms of SELECTs the master vs aioworkers looks very solid!

> Phew! Weee! Yay.

Another good news: I've completed a full 24h pgbench run on the same
machine and it did not fail or report anything suspicious. FYI,
patchset didn't not apply anymore (seems patches 1..6 are already
applied on master due to checkpoint shutdown sequence), but there was
a failed hunk in patch #12 yesterday too:
[..]
patching file src/backend/postmaster/postmaster.c
Hunk #10 succeeded at 2960 (offset 14 lines).
Hunk #11 FAILED at 3047.
[..]
1 out of 15 hunks FAILED -- saving rejects to file
src/backend/postmaster/postmaster.c.rej

anyway, so on master @ a5579a90af05814eb5dc2fd5f68ce803899d2504 (~ Jan
24) to have clean apply I've used the below asserted build:

meson setup build --reconfigure --prefix=/usr/pgsql18.aio --debug
-Dsegsize_blocks=13 -Dcassert=true
/usr/pgsql18.aio/bin/pgbench -i -s 500 --partitions=100 # ~8GB
/usr/pgsql18.aio/bin/pgbench -R 1500 -c 100 -j 4 -P 1 -T 86400

with some add-on functionalities:
effective_io_concurrency = '4'
shared_buffers = '2GB'
max_connections = '1000'
archive_command = 'cp %p /dev/null'
archive_mode = 'on'
summarize_wal = 'on'
wal_summary_keep_time = '1h'
wal_compression = 'on'
wal_log_hints = 'on'
max_wal_size = '1GB'
shared_preload_libraries = 'pg_stat_statements'
huge_pages = 'off'
wal_receiver_status_interval = '1s'

so the above got perfect run:
[..]
duration: 86400 s
number of transactions actually processed: 129615534
number of failed transactions: 0 (0.000%)
latency average = 5.332 ms
latency stddev = 24.107 ms
rate limit schedule lag: avg 0.748 (max 1992.517) ms
initial connection time = 124.472 ms
tps = 1500.179231 (without initial connection time)

> > I was kind of afraid that additional IPC to separate processes would put
> > workers at a disadvantage a little bit , but that's amazingly not true.
>
> It's a measurable disadvantage, it's just more than counteracted by being able
> to do IO asynchronously :).
>
> It's possible to make it more visible, by setting io_combine_limit = 1. If you
> have a small shared buffers with everything in the kernel cache, the dispatch
> overhead starts to be noticeable above several GB/s. But that's ok, I think.

Sure it is.

> > 2. my very limited in terms of time data analysis thoughts
> > - most of the time perf  with aioworkers is identical (+/- 3%) as of
> > the master, in most cases it is much BETTER
>
> I assume s/most/some/ for the second most?

Right, pardon for my excited moment ;)

> > - on parallel seqscans "sata" with datasets bigger than VFS-cache
> > ("big") and high e_io_c with high client counts(sigh!), it looks like
> > it would user noticeable big regression but to me it's not regression
> > itself, probably we are issuing way too many posix_fadvise()
> > readaheads with diminishing returns. Just letting you know. Not sure
> > it is worth introducing some global (shared aioworkers e_io_c
> > limiter), I think not. I think it could also be some maintenance noise
> > on that I/O device, but I have no isolated SATA RAID10 with like 8x
> > HDDs in home to launch such a test to be absolutely sure.
>
> I think this is basically a configuration issue - configuring a high e_io_c
> for a device that can't handle that and then load it up with a lot of clients,
> well, that'll not work out great.

Sure, btw i'm going to also an idea about autotuning that e_io_c in
that related thread where everybody is complaining about it

> > 3. with aioworkers in documentation it would worth pointing out that
> > `iotop` won't be good enough to show which PID is doing I/O anymore .
> > I've often get question like this: who is taking the most of I/O right
> > now because storage is fully saturated on multi-use system. Not sure
> > it would require new view or not (pg_aios output seems to be not more
> > like in-memory debug view that would be have to be sampled
> > aggressively, and pg_statio_all_tables shows well table, but not PID
> > -- same for pg_stat_io). IMHO if docs would be simple like
> > "In order to understand which processes (PIDs) are issuing lots of
> > IOs, please check pg_stat_activty for *IO/AioCompletion* waits events"
> > it should be good enough for a start.
>
> pg_stat_get_backend_io() should allow to answer that, albeit with the usual
> weakness of our stats system, namely that the user has to diff two snapshots
> themselves. It probably also has the weakness of not showing results for
> queries before they've finished, although I think that's something we should
> be able to improve without too much trouble (not in this release though, I
> suspect).
>
> I guess we could easily reference pg_stat_get_backend_io(), but a more
> complete recipe isn't entirely trivial...

I was trying to come out with something that could be added to docs,
but the below thing is too ugly and as you stated the primary weakness
is that query needs to finish before it is reflected:

WITH
    b AS (SELECT 0 AS step, pid, round(sum(write_bytes)/1024/1024) AS
wMB, NULL::void, NULL::void FROM pg_stat_activity,
pg_stat_get_backend_io(pid) GROUP BY pid),
    flush AS (SELECT 0 as step, 0, 0, pg_sleep(1), pg_stat_clear_snapshot()),
    e AS (SELECT 1 AS step, pid, round(sum(write_bytes)/1024/1024) AS
wMB, NULL::void, NULL::void FROM pg_stat_activity,
pg_stat_get_backend_io(pid) GROUP BY pid),
    picture AS MATERIALIZED (
        SELECT * FROM b
        UNION ALL
        SELECt * FROM flush
        UNION ALL
        SELECT * FROM e
    )
SELECT * FROM (
    SELECT pid, wMB - LAG(wMB, 1) OVER (PARTITION BY pid ORDER BY
step) AS "wMB/s" FROM picture
) WHERE "wMB/s" > 0;

\watch 1

-J.

pgsql-hackers by date:

From: Amit Kapila
Date: 14 February, 10:18:42
Subject: Re: Proposal: Filter irrelevant change before reassemble transactions during logical decoding

From: Peter Eisentraut
Date: 14 February, 11:02:43
Subject: Re: Update Unicode data to Unicode 16.0.0

Re: AIO v2.3 - Mailing list pgsql-hackers

Previous

Next