Re: AIO v2.5 - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: AIO v2.5
Date
Msg-id cbc572c1-95cf-4407-94b2-e54521f55e28@vondra.me
Whole thread Raw
In response to Re: AIO v2.5  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers

On 7/14/25 20:36, Andres Freund wrote:
> Hi,
> 
> On 2025-07-11 23:03:53 +0200, Tomas Vondra wrote:
>> I've been running some benchmarks comparing the io_methods, to help with
>> resolving this PG18 open item. So here are some results, and my brief
>> analysis of it.
> 
> Thanks for doing that!
> 
> 
> 
>> The TL;DR version
>> -----------------
>>
>> * The "worker" method seems good, and I think we should keep it as a
>> default. We should probably think about increasing the number of workers
>> a bit, the current io_workers=3 seems to be too low and regresses in a
>> couple tests.
>>
>> * The "sync" seems OK too, but it's more of a conservative choice, i.e.
>> more weight for keeping the PG17 behavior / not causing regressions. But
>> I haven't seen that (with enough workers). And there are cases when the
>> "worker" is much faster. It'd be a shame to throw away that benefit.
>>
>> * There might be bugs in "worker", simply because it has to deal with
>> multiple concurrent processes etc. But I guess we'll fix those just like
>> other bugs. I don't think it's a good argument against "worker" default.
>>
>> * All my tests were done on Linux and NVMe drives. It'd be good to do
>> similar testing on other platforms (e.g. FreeBSD) and/or storage. I plan
>> to do some of that, but it'd be great to cover more cases. I can help
>> with getting my script running, a run takes ~1-2 days.
> 
> FWIW, in my very limited tests on windows, the benefit of worker was
> considerably bigger there, due to having much more minimal readahead not
> having posix_fadvise...
> 
> 
>> The test also included PG17 for comparison, but I forgot PG18 enabled
>> checksums by default. So PG17 results are with checksums off, which in
>> some cases means PG17 seems a little bit faster. I'm re-running it with
>> checksums enabled on PG17, and that seems to eliminate the differences
>> (as expected).
> 
> My sneaking suspicion is that, independent of AIO, we're not really ready to
> default to checksums defaulting to on.
> 
> 
>> Findings
>> --------
>>
>> I'm attaching only three PDFs with charts from the cold runs, to keep
>> the e-mail small (each PDF is ~100-200kB). Feel free to check the other
>> PDFs in the git repository, but it's all very similar and the attached
>> PDFs are quite representative.
>>
>> Some basic observations:
>>
>> a) index scans
>>
>> There's almost no difference for indexscans, i.e. the middle column in
>> the PDFs. There's a bit of variation on some of the cyclic/linear data
>> sets, but it seems more like random noise than a systemic difference.
>>
>> Which is not all that surprising, considering index scans don't really
>> use read_stream yet, so there's no prefetching etc.
> 
> Indeed.
> 
> 
>> The "ryzen" results however demonstrate that 3 workers may be too low.
>> The timing spikes to ~3000ms (at ~1% selectivity), before quickly
>> dropping back to ~1000ms. The other datasets show similar difference.
>> With 12 workers, there's no such problem.
> 
> I don't really know what to do about that - for now we don't have dynamic
> #workers, and starting 12 workers on a tiny database doesn't really make
> sense...  I suspect that on most hardware & queries it won't matter that much,
> but clearly, if you have high iops hardware it might.  I can perhaps see
> increasing the default to 5 or so, but after that...  I guess we could try
> some autoconf formula based on the size of s_b or such? But that seems
> somewhat awkward too.
> 

True. I don't have a great idea either. FWIW most of our defaults are
very conservative/low, and you have to bump them up on bigger machines
anyway. So having to bump one more GUC is not a big deal, and I don't
think we have to invent some magic formula for this one.

Also, autoconf wouldn't even know about s_b size, it'd have to be
something at startup. We could do some automated sizing if set to -1,
perhaps. But is s_b even a good value to tie this to? I doubt that.

> 
> 
>>
>> e) indexscan regression (ryzen-indexscan-uniform-pg17-checksums.png)
>>
>> There's an interesting difference difference I noticed in the run with
>> checksums on PG17. The full PDF is available here:
> 
> (there's a subsequent email about this, will reply there)
> 
> 
>> Conclusion
>> ----------
>>
>> That's all I have at the moment. I still think it makes sense to keep
>> io_method=worker, but bump up the io_workers a bit higher. Could we also
>> add some suggestions how to pick a good value to the docs?
> 
> .oO(/me ponders a troll patch to re-add a reference the number of spindles in
> that tuning advice)
>

;-)

> I'm not sure what advice to give here.  Maybe just to set it to a considerably
> larger number once not running on a tiny system? The incremental overhead of
> having an idle worker is rather small unless you're on a really tiny system...
> 

Too bad the patch doesn't collect any stats about how utilized the
workers are :-( That'd make it a bit easier, we could even print
something into the log if the queues overflow "too often", similarly to
max_wal_size when checkpoints happen too often.

> 
>> You might also run the benchmark on different hardware, and either
>> build/publish the plots somewhere, or just give me the CSV and I'll do
>> that. Better to find strange stuff / regressions now.
> 
> Probably the most interesting thing would be some runs with cloud-ish storage
> (relatively high iops, very high latency)...
> 

Yeah, I've started a test on a cloud VM. Will see in a day or two. And
another on FreeBSD, for good measure.

> 
>> The repository also has branches with plots showing results with WIP
>> indexscan prefetching. (It's excluded from the PDFs I presented here).
> 
> Hm, I looked for those, but I couldn't quickly find any plots that include
> them.  Would I have to generate the plots from a checkout of the repo?
> 

No, the charts are there, you don't need to generate them.

Look into the "with-indexscan-prefetch-run2-17-checksums" branch. E.g.
this is the same "ryzen" plot I shared earlier, but with "indexscan
prefetch" column:


https://github.com/tvondra/iomethod-tests/blob/with-indexscan-prefetch-run2-17-checksums/ryzen-rows-cold-32GB-16-unscaled.pdf

It might be better to look at the "scaled" charts, which makes it easier
to compare the different scans (and the benefit of prefetching).


https://github.com/tvondra/iomethod-tests/blob/with-indexscan-prefetch-run2-17-checksums/ryzen-rows-cold-32GB-16-scaled.pdf

> 
>> The conclusions are similar to what we found here - "worker" is good
>> with enough workers, io_uring is good too. Sync has issues for some of
>> the data sets, but still helps a lot.
> 
> Nice.
> 
> Greetings,
> 
> Andres Freund


regards

-- 
Tomas Vondra




pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: AIO v2.5
Next
From: Rustam ALLAKOV
Date:
Subject: Re: Foreign key isolation tests