Re: AIO v2.3 - Mailing list pgsql-hackers
From | Jakub Wartak |
---|---|
Subject | Re: AIO v2.3 |
Date | |
Msg-id | CAKZiRmzpwCrg2cAeRGQowYaQt-C18samkkG3cX3E-JSR3oP=uw@mail.gmail.com Whole thread Raw |
In response to | Re: AIO v2.3 (Andres Freund <andres@anarazel.de>) |
List | pgsql-hackers |
On Tue, Feb 11, 2025 at 12:10 AM Andres Freund <andres@anarazel.de> wrote: >> TLDR; in terms of SELECTs the master vs aioworkers looks very solid! > Phew! Weee! Yay. Another good news: I've completed a full 24h pgbench run on the same machine and it did not fail or report anything suspicious. FYI, patchset didn't not apply anymore (seems patches 1..6 are already applied on master due to checkpoint shutdown sequence), but there was a failed hunk in patch #12 yesterday too: [..] patching file src/backend/postmaster/postmaster.c Hunk #10 succeeded at 2960 (offset 14 lines). Hunk #11 FAILED at 3047. [..] 1 out of 15 hunks FAILED -- saving rejects to file src/backend/postmaster/postmaster.c.rej anyway, so on master @ a5579a90af05814eb5dc2fd5f68ce803899d2504 (~ Jan 24) to have clean apply I've used the below asserted build: meson setup build --reconfigure --prefix=/usr/pgsql18.aio --debug -Dsegsize_blocks=13 -Dcassert=true /usr/pgsql18.aio/bin/pgbench -i -s 500 --partitions=100 # ~8GB /usr/pgsql18.aio/bin/pgbench -R 1500 -c 100 -j 4 -P 1 -T 86400 with some add-on functionalities: effective_io_concurrency = '4' shared_buffers = '2GB' max_connections = '1000' archive_command = 'cp %p /dev/null' archive_mode = 'on' summarize_wal = 'on' wal_summary_keep_time = '1h' wal_compression = 'on' wal_log_hints = 'on' max_wal_size = '1GB' shared_preload_libraries = 'pg_stat_statements' huge_pages = 'off' wal_receiver_status_interval = '1s' so the above got perfect run: [..] duration: 86400 s number of transactions actually processed: 129615534 number of failed transactions: 0 (0.000%) latency average = 5.332 ms latency stddev = 24.107 ms rate limit schedule lag: avg 0.748 (max 1992.517) ms initial connection time = 124.472 ms tps = 1500.179231 (without initial connection time) > > I was kind of afraid that additional IPC to separate processes would put > > workers at a disadvantage a little bit , but that's amazingly not true. > > It's a measurable disadvantage, it's just more than counteracted by being able > to do IO asynchronously :). > > It's possible to make it more visible, by setting io_combine_limit = 1. If you > have a small shared buffers with everything in the kernel cache, the dispatch > overhead starts to be noticeable above several GB/s. But that's ok, I think. Sure it is. > > 2. my very limited in terms of time data analysis thoughts > > - most of the time perf with aioworkers is identical (+/- 3%) as of > > the master, in most cases it is much BETTER > > I assume s/most/some/ for the second most? Right, pardon for my excited moment ;) > > - on parallel seqscans "sata" with datasets bigger than VFS-cache > > ("big") and high e_io_c with high client counts(sigh!), it looks like > > it would user noticeable big regression but to me it's not regression > > itself, probably we are issuing way too many posix_fadvise() > > readaheads with diminishing returns. Just letting you know. Not sure > > it is worth introducing some global (shared aioworkers e_io_c > > limiter), I think not. I think it could also be some maintenance noise > > on that I/O device, but I have no isolated SATA RAID10 with like 8x > > HDDs in home to launch such a test to be absolutely sure. > > I think this is basically a configuration issue - configuring a high e_io_c > for a device that can't handle that and then load it up with a lot of clients, > well, that'll not work out great. Sure, btw i'm going to also an idea about autotuning that e_io_c in that related thread where everybody is complaining about it > > 3. with aioworkers in documentation it would worth pointing out that > > `iotop` won't be good enough to show which PID is doing I/O anymore . > > I've often get question like this: who is taking the most of I/O right > > now because storage is fully saturated on multi-use system. Not sure > > it would require new view or not (pg_aios output seems to be not more > > like in-memory debug view that would be have to be sampled > > aggressively, and pg_statio_all_tables shows well table, but not PID > > -- same for pg_stat_io). IMHO if docs would be simple like > > "In order to understand which processes (PIDs) are issuing lots of > > IOs, please check pg_stat_activty for *IO/AioCompletion* waits events" > > it should be good enough for a start. > > pg_stat_get_backend_io() should allow to answer that, albeit with the usual > weakness of our stats system, namely that the user has to diff two snapshots > themselves. It probably also has the weakness of not showing results for > queries before they've finished, although I think that's something we should > be able to improve without too much trouble (not in this release though, I > suspect). > > I guess we could easily reference pg_stat_get_backend_io(), but a more > complete recipe isn't entirely trivial... I was trying to come out with something that could be added to docs, but the below thing is too ugly and as you stated the primary weakness is that query needs to finish before it is reflected: WITH b AS (SELECT 0 AS step, pid, round(sum(write_bytes)/1024/1024) AS wMB, NULL::void, NULL::void FROM pg_stat_activity, pg_stat_get_backend_io(pid) GROUP BY pid), flush AS (SELECT 0 as step, 0, 0, pg_sleep(1), pg_stat_clear_snapshot()), e AS (SELECT 1 AS step, pid, round(sum(write_bytes)/1024/1024) AS wMB, NULL::void, NULL::void FROM pg_stat_activity, pg_stat_get_backend_io(pid) GROUP BY pid), picture AS MATERIALIZED ( SELECT * FROM b UNION ALL SELECt * FROM flush UNION ALL SELECT * FROM e ) SELECT * FROM ( SELECT pid, wMB - LAG(wMB, 1) OVER (PARTITION BY pid ORDER BY step) AS "wMB/s" FROM picture ) WHERE "wMB/s" > 0; \watch 1 -J.
pgsql-hackers by date: