Re: AIO v2.0 - Mailing list pgsql-hackers

From Andres Freund
Subject Re: AIO v2.0
Date
Msg-id exrjge7fo7hcqvmcfscbxti6vyzuyy7gs2wpjgmxpnvuvgrnud@mxhnya3f5oyp
Whole thread Raw
In response to Re: AIO v2.0  (Jakub Wartak <jakub.wartak@enterprisedb.com>)
Responses Re: AIO v2.0
List pgsql-hackers
Hi,

On 2025-01-08 15:04:39 +0100, Jakub Wartak wrote:
> On Mon, Jan 6, 2025 at 5:28 PM Andres Freund <andres@anarazel.de> wrote:
> > I didn't think that pg_stat_* was quite the right namespace, given that it
> > shows not stats, but the currently ongoing IOs.  I am going with pg_aios for
> > now, but I don't particularly like that.
>
> If you are looking for other proposals:
> * pg_aios_progress ? (to follow pattern of pg_stat_copy|vaccuum_progress?)
> * pg_debug_aios ?
> * pg_debug_io ?

I think pg_aios is better than those, if not by much.  Seems others are ok
with that name too. And we easily can evolve it later.


> > I think we'll want a pg_stat_aio as well, tracking things like:
> >
> > - how often the queue to IO workes was full
> > - how many times we submitted IO to the kernel (<= #ios with io_uring)
> > - how many times we asked the kernel for events (<= #ios with io_uring)
> > - how many times we had to wait for in-flight IOs before issuing more IOs
>
> If I could dream of one thing that would be 99.9% percentile of IO
> response times in milliseconds for different classes of I/O traffic
> (read/write/flush). But it sounds like it would be very similiar to
> pg_stat_io and potentially would have to be
> per-tablespace/IO-traffic(subject)-type too.

Yea, that's a significant project on its own. It's not that cheap to compute
reasonably accurate percentiles and we have no infrastructure for doing so
right now.


> AFAIU pg_stat_io has improper structure to have that there.

Hm, not obvious to me why? It might make the view a bit wide to add it as an
additional column, but otherwise I don't see a problem?


> BTW: before trying to even start to compile that AIO v2.2* and
> responding to the previous review, what are You looking interested to
> hear the most about it so that it adds some value?

Due to the rather limited "users" of AIO in the patchset, I think most
benchmarks aren't expected to show any meaningful gains. However, they
shouldn't show any significant regressions either (when not using direct
IO). I think trying to find regressions would be a rather valuable thing.


I'm tempted to collect a few of the reasonbly-ready read stream conversions
into the patchset, to make the potential gains more visible. But I am not sure
it's a good investment of time right now.


One small regression I do know about, namely scans of large relations that are
bigger than shared buffers but do fit in the kernel page cache. The increase
of BAS_BULKREAD does cause a small slowdown - but without it we never can do
sufficient asynchronous IO.   I think the slowdown is small enough to just
accept that, but it's worth qualifying that on a few machines.


> Any workload specific measurements? just general feedback, functionality
> gaps?

To see the benefits it'd be interesting to compare:

1) sequential scan performance with data not in shared buffers, using buffered IO
2) same, but using direct IO when testing the patch
3) checkpoint performance


In my experiments 1) gains a decent amount of performance in many cases, but
nothing overwhelming - sequential scans are easy for the kernel to read ahead.

I do see very significant gains for 2) - On a system with 10 striped NVMe SSDs
that each can do ~3.5 GB/s I measured very parallel sequential scans (I had
to use ALTER TABLE to get sufficient numbers of workers):

master:            ~18 GB/s
patch, buffered:        ~20 GB/s
patch, direct, worker:    ~28 GB/s
patch, direct, uring:    ~35 GB/s


This was with io_workers=32, io_max_concurrency=128,
effective_io_concurrency=1000 (doesn't need to be that high, but it's what I
still have the numbers for).


This was without data checksums enabled as otherwise the checksum code becomes
a *huge* bottleneck.


I also see significant gains with 3). Bigger when using direct IO.  One
complicating factor measuring 3) is that the first write to a block will often
be slower than subsequent writes because the filesystem will need to update
some journaled metadata, presenting a bottleneck.

Checkpoint performance is also severely limited by data checksum computation
if enabled - independent of this patchset.


One annoying thing when testing DIO is that right now VACUUM will be rather
slow if the data isn't already in s_b, as it isn't yet read-stream-ified.



> Integrity/data testing with stuff like dm-dust, dm-flakey, dm-delay
> to try the error handling routines?

Hm. I don't think that's going to work very well even on master. If the
filesystem fails there's not much that PG can do...


> Some kind of AIO <-> standby/recovery interactions?

I wouldn't expect anything there. I think Thomas somewhere has a patch that
read-stream-ifies recovery prefetching, once that's done it would be more
interesting.


> * - btw, Date: 2025-01-01 04:03:33 - I saw what you did there! so
> let's officially recognize the 2025 as the year of AIO in PG, as it
> was 1st message :D

Hah, that was actually the opposite of what I intended :). I'd hoped to post
earlier, but jetlag had caught up with me...

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: improve DEBUG1 logging of parallel workers for CREATE INDEX?
Next
From: Guillaume Lelarge
Date:
Subject: Re: Non-text mode for pg_dumpall