Re: AIO v2.5 - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: AIO v2.5 |
Date | |
Msg-id | brdaw5wke274lubirrl4v2k4qdacylvgwwqztifn7m27pkth3s@rh7wie47pfcp Whole thread Raw |
In response to | Re: AIO v2.5 (Tomas Vondra <tomas@vondra.me>) |
Responses |
Re: AIO v2.5
|
List | pgsql-hackers |
Hi, On 2025-07-11 23:03:53 +0200, Tomas Vondra wrote: > I've been running some benchmarks comparing the io_methods, to help with > resolving this PG18 open item. So here are some results, and my brief > analysis of it. Thanks for doing that! > The TL;DR version > ----------------- > > * The "worker" method seems good, and I think we should keep it as a > default. We should probably think about increasing the number of workers > a bit, the current io_workers=3 seems to be too low and regresses in a > couple tests. > > * The "sync" seems OK too, but it's more of a conservative choice, i.e. > more weight for keeping the PG17 behavior / not causing regressions. But > I haven't seen that (with enough workers). And there are cases when the > "worker" is much faster. It'd be a shame to throw away that benefit. > > * There might be bugs in "worker", simply because it has to deal with > multiple concurrent processes etc. But I guess we'll fix those just like > other bugs. I don't think it's a good argument against "worker" default. > > * All my tests were done on Linux and NVMe drives. It'd be good to do > similar testing on other platforms (e.g. FreeBSD) and/or storage. I plan > to do some of that, but it'd be great to cover more cases. I can help > with getting my script running, a run takes ~1-2 days. FWIW, in my very limited tests on windows, the benefit of worker was considerably bigger there, due to having much more minimal readahead not having posix_fadvise... > The test also included PG17 for comparison, but I forgot PG18 enabled > checksums by default. So PG17 results are with checksums off, which in > some cases means PG17 seems a little bit faster. I'm re-running it with > checksums enabled on PG17, and that seems to eliminate the differences > (as expected). My sneaking suspicion is that, independent of AIO, we're not really ready to default to checksums defaulting to on. > Findings > -------- > > I'm attaching only three PDFs with charts from the cold runs, to keep > the e-mail small (each PDF is ~100-200kB). Feel free to check the other > PDFs in the git repository, but it's all very similar and the attached > PDFs are quite representative. > > Some basic observations: > > a) index scans > > There's almost no difference for indexscans, i.e. the middle column in > the PDFs. There's a bit of variation on some of the cyclic/linear data > sets, but it seems more like random noise than a systemic difference. > > Which is not all that surprising, considering index scans don't really > use read_stream yet, so there's no prefetching etc. Indeed. > The "ryzen" results however demonstrate that 3 workers may be too low. > The timing spikes to ~3000ms (at ~1% selectivity), before quickly > dropping back to ~1000ms. The other datasets show similar difference. > With 12 workers, there's no such problem. I don't really know what to do about that - for now we don't have dynamic #workers, and starting 12 workers on a tiny database doesn't really make sense... I suspect that on most hardware & queries it won't matter that much, but clearly, if you have high iops hardware it might. I can perhaps see increasing the default to 5 or so, but after that... I guess we could try some autoconf formula based on the size of s_b or such? But that seems somewhat awkward too. > > e) indexscan regression (ryzen-indexscan-uniform-pg17-checksums.png) > > There's an interesting difference difference I noticed in the run with > checksums on PG17. The full PDF is available here: (there's a subsequent email about this, will reply there) > Conclusion > ---------- > > That's all I have at the moment. I still think it makes sense to keep > io_method=worker, but bump up the io_workers a bit higher. Could we also > add some suggestions how to pick a good value to the docs? .oO(/me ponders a troll patch to re-add a reference the number of spindles in that tuning advice) I'm not sure what advice to give here. Maybe just to set it to a considerably larger number once not running on a tiny system? The incremental overhead of having an idle worker is rather small unless you're on a really tiny system... > You might also run the benchmark on different hardware, and either > build/publish the plots somewhere, or just give me the CSV and I'll do > that. Better to find strange stuff / regressions now. Probably the most interesting thing would be some runs with cloud-ish storage (relatively high iops, very high latency)... > The repository also has branches with plots showing results with WIP > indexscan prefetching. (It's excluded from the PDFs I presented here). Hm, I looked for those, but I couldn't quickly find any plots that include them. Would I have to generate the plots from a checkout of the repo? > The conclusions are similar to what we found here - "worker" is good > with enough workers, io_uring is good too. Sync has issues for some of > the data sets, but still helps a lot. Nice. Greetings, Andres Freund
pgsql-hackers by date: