Re: AIO v2.0 - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: AIO v2.0 |
Date | |
Msg-id | exrjge7fo7hcqvmcfscbxti6vyzuyy7gs2wpjgmxpnvuvgrnud@mxhnya3f5oyp Whole thread Raw |
In response to | Re: AIO v2.0 (Jakub Wartak <jakub.wartak@enterprisedb.com>) |
Responses |
Re: AIO v2.0
|
List | pgsql-hackers |
Hi, On 2025-01-08 15:04:39 +0100, Jakub Wartak wrote: > On Mon, Jan 6, 2025 at 5:28 PM Andres Freund <andres@anarazel.de> wrote: > > I didn't think that pg_stat_* was quite the right namespace, given that it > > shows not stats, but the currently ongoing IOs. I am going with pg_aios for > > now, but I don't particularly like that. > > If you are looking for other proposals: > * pg_aios_progress ? (to follow pattern of pg_stat_copy|vaccuum_progress?) > * pg_debug_aios ? > * pg_debug_io ? I think pg_aios is better than those, if not by much. Seems others are ok with that name too. And we easily can evolve it later. > > I think we'll want a pg_stat_aio as well, tracking things like: > > > > - how often the queue to IO workes was full > > - how many times we submitted IO to the kernel (<= #ios with io_uring) > > - how many times we asked the kernel for events (<= #ios with io_uring) > > - how many times we had to wait for in-flight IOs before issuing more IOs > > If I could dream of one thing that would be 99.9% percentile of IO > response times in milliseconds for different classes of I/O traffic > (read/write/flush). But it sounds like it would be very similiar to > pg_stat_io and potentially would have to be > per-tablespace/IO-traffic(subject)-type too. Yea, that's a significant project on its own. It's not that cheap to compute reasonably accurate percentiles and we have no infrastructure for doing so right now. > AFAIU pg_stat_io has improper structure to have that there. Hm, not obvious to me why? It might make the view a bit wide to add it as an additional column, but otherwise I don't see a problem? > BTW: before trying to even start to compile that AIO v2.2* and > responding to the previous review, what are You looking interested to > hear the most about it so that it adds some value? Due to the rather limited "users" of AIO in the patchset, I think most benchmarks aren't expected to show any meaningful gains. However, they shouldn't show any significant regressions either (when not using direct IO). I think trying to find regressions would be a rather valuable thing. I'm tempted to collect a few of the reasonbly-ready read stream conversions into the patchset, to make the potential gains more visible. But I am not sure it's a good investment of time right now. One small regression I do know about, namely scans of large relations that are bigger than shared buffers but do fit in the kernel page cache. The increase of BAS_BULKREAD does cause a small slowdown - but without it we never can do sufficient asynchronous IO. I think the slowdown is small enough to just accept that, but it's worth qualifying that on a few machines. > Any workload specific measurements? just general feedback, functionality > gaps? To see the benefits it'd be interesting to compare: 1) sequential scan performance with data not in shared buffers, using buffered IO 2) same, but using direct IO when testing the patch 3) checkpoint performance In my experiments 1) gains a decent amount of performance in many cases, but nothing overwhelming - sequential scans are easy for the kernel to read ahead. I do see very significant gains for 2) - On a system with 10 striped NVMe SSDs that each can do ~3.5 GB/s I measured very parallel sequential scans (I had to use ALTER TABLE to get sufficient numbers of workers): master: ~18 GB/s patch, buffered: ~20 GB/s patch, direct, worker: ~28 GB/s patch, direct, uring: ~35 GB/s This was with io_workers=32, io_max_concurrency=128, effective_io_concurrency=1000 (doesn't need to be that high, but it's what I still have the numbers for). This was without data checksums enabled as otherwise the checksum code becomes a *huge* bottleneck. I also see significant gains with 3). Bigger when using direct IO. One complicating factor measuring 3) is that the first write to a block will often be slower than subsequent writes because the filesystem will need to update some journaled metadata, presenting a bottleneck. Checkpoint performance is also severely limited by data checksum computation if enabled - independent of this patchset. One annoying thing when testing DIO is that right now VACUUM will be rather slow if the data isn't already in s_b, as it isn't yet read-stream-ified. > Integrity/data testing with stuff like dm-dust, dm-flakey, dm-delay > to try the error handling routines? Hm. I don't think that's going to work very well even on master. If the filesystem fails there's not much that PG can do... > Some kind of AIO <-> standby/recovery interactions? I wouldn't expect anything there. I think Thomas somewhere has a patch that read-stream-ifies recovery prefetching, once that's done it would be more interesting. > * - btw, Date: 2025-01-01 04:03:33 - I saw what you did there! so > let's officially recognize the 2025 as the year of AIO in PG, as it > was 1st message :D Hah, that was actually the opposite of what I intended :). I'd hoped to post earlier, but jetlag had caught up with me... Greetings, Andres Freund
pgsql-hackers by date: