Re: AIO v2.5 - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: AIO v2.5 |
Date | |
Msg-id | cbc572c1-95cf-4407-94b2-e54521f55e28@vondra.me Whole thread Raw |
In response to | Re: AIO v2.5 (Andres Freund <andres@anarazel.de>) |
List | pgsql-hackers |
On 7/14/25 20:36, Andres Freund wrote: > Hi, > > On 2025-07-11 23:03:53 +0200, Tomas Vondra wrote: >> I've been running some benchmarks comparing the io_methods, to help with >> resolving this PG18 open item. So here are some results, and my brief >> analysis of it. > > Thanks for doing that! > > > >> The TL;DR version >> ----------------- >> >> * The "worker" method seems good, and I think we should keep it as a >> default. We should probably think about increasing the number of workers >> a bit, the current io_workers=3 seems to be too low and regresses in a >> couple tests. >> >> * The "sync" seems OK too, but it's more of a conservative choice, i.e. >> more weight for keeping the PG17 behavior / not causing regressions. But >> I haven't seen that (with enough workers). And there are cases when the >> "worker" is much faster. It'd be a shame to throw away that benefit. >> >> * There might be bugs in "worker", simply because it has to deal with >> multiple concurrent processes etc. But I guess we'll fix those just like >> other bugs. I don't think it's a good argument against "worker" default. >> >> * All my tests were done on Linux and NVMe drives. It'd be good to do >> similar testing on other platforms (e.g. FreeBSD) and/or storage. I plan >> to do some of that, but it'd be great to cover more cases. I can help >> with getting my script running, a run takes ~1-2 days. > > FWIW, in my very limited tests on windows, the benefit of worker was > considerably bigger there, due to having much more minimal readahead not > having posix_fadvise... > > >> The test also included PG17 for comparison, but I forgot PG18 enabled >> checksums by default. So PG17 results are with checksums off, which in >> some cases means PG17 seems a little bit faster. I'm re-running it with >> checksums enabled on PG17, and that seems to eliminate the differences >> (as expected). > > My sneaking suspicion is that, independent of AIO, we're not really ready to > default to checksums defaulting to on. > > >> Findings >> -------- >> >> I'm attaching only three PDFs with charts from the cold runs, to keep >> the e-mail small (each PDF is ~100-200kB). Feel free to check the other >> PDFs in the git repository, but it's all very similar and the attached >> PDFs are quite representative. >> >> Some basic observations: >> >> a) index scans >> >> There's almost no difference for indexscans, i.e. the middle column in >> the PDFs. There's a bit of variation on some of the cyclic/linear data >> sets, but it seems more like random noise than a systemic difference. >> >> Which is not all that surprising, considering index scans don't really >> use read_stream yet, so there's no prefetching etc. > > Indeed. > > >> The "ryzen" results however demonstrate that 3 workers may be too low. >> The timing spikes to ~3000ms (at ~1% selectivity), before quickly >> dropping back to ~1000ms. The other datasets show similar difference. >> With 12 workers, there's no such problem. > > I don't really know what to do about that - for now we don't have dynamic > #workers, and starting 12 workers on a tiny database doesn't really make > sense... I suspect that on most hardware & queries it won't matter that much, > but clearly, if you have high iops hardware it might. I can perhaps see > increasing the default to 5 or so, but after that... I guess we could try > some autoconf formula based on the size of s_b or such? But that seems > somewhat awkward too. > True. I don't have a great idea either. FWIW most of our defaults are very conservative/low, and you have to bump them up on bigger machines anyway. So having to bump one more GUC is not a big deal, and I don't think we have to invent some magic formula for this one. Also, autoconf wouldn't even know about s_b size, it'd have to be something at startup. We could do some automated sizing if set to -1, perhaps. But is s_b even a good value to tie this to? I doubt that. > > >> >> e) indexscan regression (ryzen-indexscan-uniform-pg17-checksums.png) >> >> There's an interesting difference difference I noticed in the run with >> checksums on PG17. The full PDF is available here: > > (there's a subsequent email about this, will reply there) > > >> Conclusion >> ---------- >> >> That's all I have at the moment. I still think it makes sense to keep >> io_method=worker, but bump up the io_workers a bit higher. Could we also >> add some suggestions how to pick a good value to the docs? > > .oO(/me ponders a troll patch to re-add a reference the number of spindles in > that tuning advice) > ;-) > I'm not sure what advice to give here. Maybe just to set it to a considerably > larger number once not running on a tiny system? The incremental overhead of > having an idle worker is rather small unless you're on a really tiny system... > Too bad the patch doesn't collect any stats about how utilized the workers are :-( That'd make it a bit easier, we could even print something into the log if the queues overflow "too often", similarly to max_wal_size when checkpoints happen too often. > >> You might also run the benchmark on different hardware, and either >> build/publish the plots somewhere, or just give me the CSV and I'll do >> that. Better to find strange stuff / regressions now. > > Probably the most interesting thing would be some runs with cloud-ish storage > (relatively high iops, very high latency)... > Yeah, I've started a test on a cloud VM. Will see in a day or two. And another on FreeBSD, for good measure. > >> The repository also has branches with plots showing results with WIP >> indexscan prefetching. (It's excluded from the PDFs I presented here). > > Hm, I looked for those, but I couldn't quickly find any plots that include > them. Would I have to generate the plots from a checkout of the repo? > No, the charts are there, you don't need to generate them. Look into the "with-indexscan-prefetch-run2-17-checksums" branch. E.g. this is the same "ryzen" plot I shared earlier, but with "indexscan prefetch" column: https://github.com/tvondra/iomethod-tests/blob/with-indexscan-prefetch-run2-17-checksums/ryzen-rows-cold-32GB-16-unscaled.pdf It might be better to look at the "scaled" charts, which makes it easier to compare the different scans (and the benefit of prefetching). https://github.com/tvondra/iomethod-tests/blob/with-indexscan-prefetch-run2-17-checksums/ryzen-rows-cold-32GB-16-scaled.pdf > >> The conclusions are similar to what we found here - "worker" is good >> with enough workers, io_uring is good too. Sync has issues for some of >> the data sets, but still helps a lot. > > Nice. > > Greetings, > > Andres Freund regards -- Tomas Vondra
pgsql-hackers by date: