Re: AIO v2.5 - Mailing list pgsql-hackers

From Andres Freund
Subject Re: AIO v2.5
Date
Msg-id brdaw5wke274lubirrl4v2k4qdacylvgwwqztifn7m27pkth3s@rh7wie47pfcp
Whole thread Raw
In response to Re: AIO v2.5  (Tomas Vondra <tomas@vondra.me>)
Responses Re: AIO v2.5
List pgsql-hackers
Hi,

On 2025-07-11 23:03:53 +0200, Tomas Vondra wrote:
> I've been running some benchmarks comparing the io_methods, to help with
> resolving this PG18 open item. So here are some results, and my brief
> analysis of it.

Thanks for doing that!



> The TL;DR version
> -----------------
> 
> * The "worker" method seems good, and I think we should keep it as a
> default. We should probably think about increasing the number of workers
> a bit, the current io_workers=3 seems to be too low and regresses in a
> couple tests.
> 
> * The "sync" seems OK too, but it's more of a conservative choice, i.e.
> more weight for keeping the PG17 behavior / not causing regressions. But
> I haven't seen that (with enough workers). And there are cases when the
> "worker" is much faster. It'd be a shame to throw away that benefit.
> 
> * There might be bugs in "worker", simply because it has to deal with
> multiple concurrent processes etc. But I guess we'll fix those just like
> other bugs. I don't think it's a good argument against "worker" default.
> 
> * All my tests were done on Linux and NVMe drives. It'd be good to do
> similar testing on other platforms (e.g. FreeBSD) and/or storage. I plan
> to do some of that, but it'd be great to cover more cases. I can help
> with getting my script running, a run takes ~1-2 days.

FWIW, in my very limited tests on windows, the benefit of worker was
considerably bigger there, due to having much more minimal readahead not
having posix_fadvise...


> The test also included PG17 for comparison, but I forgot PG18 enabled
> checksums by default. So PG17 results are with checksums off, which in
> some cases means PG17 seems a little bit faster. I'm re-running it with
> checksums enabled on PG17, and that seems to eliminate the differences
> (as expected).

My sneaking suspicion is that, independent of AIO, we're not really ready to
default to checksums defaulting to on.


> Findings
> --------
> 
> I'm attaching only three PDFs with charts from the cold runs, to keep
> the e-mail small (each PDF is ~100-200kB). Feel free to check the other
> PDFs in the git repository, but it's all very similar and the attached
> PDFs are quite representative.
> 
> Some basic observations:
> 
> a) index scans
> 
> There's almost no difference for indexscans, i.e. the middle column in
> the PDFs. There's a bit of variation on some of the cyclic/linear data
> sets, but it seems more like random noise than a systemic difference.
>
> Which is not all that surprising, considering index scans don't really
> use read_stream yet, so there's no prefetching etc.

Indeed.


> The "ryzen" results however demonstrate that 3 workers may be too low.
> The timing spikes to ~3000ms (at ~1% selectivity), before quickly
> dropping back to ~1000ms. The other datasets show similar difference.
> With 12 workers, there's no such problem.

I don't really know what to do about that - for now we don't have dynamic
#workers, and starting 12 workers on a tiny database doesn't really make
sense...  I suspect that on most hardware & queries it won't matter that much,
but clearly, if you have high iops hardware it might.  I can perhaps see
increasing the default to 5 or so, but after that...  I guess we could try
some autoconf formula based on the size of s_b or such? But that seems
somewhat awkward too.



> 
> e) indexscan regression (ryzen-indexscan-uniform-pg17-checksums.png)
> 
> There's an interesting difference difference I noticed in the run with
> checksums on PG17. The full PDF is available here:

(there's a subsequent email about this, will reply there)


> Conclusion
> ----------
> 
> That's all I have at the moment. I still think it makes sense to keep
> io_method=worker, but bump up the io_workers a bit higher. Could we also
> add some suggestions how to pick a good value to the docs?

.oO(/me ponders a troll patch to re-add a reference the number of spindles in
that tuning advice)

I'm not sure what advice to give here.  Maybe just to set it to a considerably
larger number once not running on a tiny system? The incremental overhead of
having an idle worker is rather small unless you're on a really tiny system...


> You might also run the benchmark on different hardware, and either
> build/publish the plots somewhere, or just give me the CSV and I'll do
> that. Better to find strange stuff / regressions now.

Probably the most interesting thing would be some runs with cloud-ish storage
(relatively high iops, very high latency)...


> The repository also has branches with plots showing results with WIP
> indexscan prefetching. (It's excluded from the PDFs I presented here).

Hm, I looked for those, but I couldn't quickly find any plots that include
them.  Would I have to generate the plots from a checkout of the repo?


> The conclusions are similar to what we found here - "worker" is good
> with enough workers, io_uring is good too. Sync has issues for some of
> the data sets, but still helps a lot.

Nice.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Álvaro Herrera
Date:
Subject: Re: pg_dump does not dump domain not-null constraint's comments
Next
From: Tom Lane
Date:
Subject: Re: Disable parallel query by default