Re: AIO / read stream heuristics adjustments for index prefetching - Mailing list pgsql-hackers

From Melanie Plageman
Subject Re: AIO / read stream heuristics adjustments for index prefetching
Date
Msg-id CAAKRu_b24CSa1ja8yEu+5PnTpWcpL0EFdRL1TwQye7VffvYjJA@mail.gmail.com
Whole thread Raw
In response to Re: AIO / read stream heuristics adjustments for index prefetching  (Andres Freund <andres@anarazel.de>)
Responses Re: AIO / read stream heuristics adjustments for index prefetching
List pgsql-hackers
On Fri, Apr 3, 2026 at 1:30 PM Andres Freund <andres@anarazel.de> wrote:
>
> > - why not remove the combine_distance requirement from the fast path
> > entry criteria (could save resume_combine_distance in the fast path
> > and restore it after a miss)
>
> Because entering the fast path prevents IO combining from happening. So it's
> absolutely crucial that we do *not* enter it while we would combine.

But if it is a buffer hit, obviously we can't do IO combining anyway,
or am I misunderstanding the fast path's common case?

> > > In my experiments the batching was primarily useful to allow to reduce the
> > > command submission & interrupt overhead when interacting with storage, i.e.
> > > when actually doing IO, not just copying from the page cache.
> >
> > It still seems to me that batching would help guarantee parallel
> > memcpys for workers when data is in the kernel buffer cache because if
> > the IO is quick, the same worker may pick up the next IO.
>
> When you say workers, do you mean for io_method=worker? Or the internal kernel
> threads of io_uring for async IOs?

I'm talking about io_method worker.

> > You mentioned that we don't want to read too far ahead (including for
> > a single combined IO) in part because:
> >
> > > The resowner and private refcount mechanisms take more CPU cycles if you have
> > > more buffers pinned
> >
> > But I don't see how either distance is responding to this or
> > self-calibrating with this in mind
>
> Using the minimal required distance to avoid needing to wait for IO completion
> is responding to that, no?  Without these patches we read ahead as far as
> possible, even if all the data is in the page cache, which makes this issue
> way worse (without these patches it's a major source of regressions in the
> index prefetching patch).

But we aren't using the minimal distance to avoid needing to wait for
IO completion. We are also using a higher distance to try and get IO
combining and  toallow for async copying into the kernel buffer cache,
etc, etc. There's a lot of different considerations; it isn't just two
opposing forces. And, I'd imagine that the relationship between the
number of buffers pinned and CPU cycles required for resowner/refcount
isn't perfectly linear.

> > I think it is weird to have combine_distance only be relevant when
> > readahead_distance is low. You said:
> >
> > > We could, but I don't think there would be a benefit in doing so. In my mind,
> > > what combine_distance is used for, is to control how far to look ahead to
> > > allow for IO combining when the readahead_distance is too low to allow for IO
> > > combining.  But if pinned_buffers > 0, we already have another IO in flight,
> > > so the combine_distance mechanism doesn't have to kick in.
> >
> > But it seems like for a completely uncached workload bigger IOs is
> > still beneficial.
>
> Massively so - the storage layer getting too small IOs really hurts.
>
> But, as the code stands, we *do* end up with large IOs in that case, because
> we will not issue the IO until it is "complete". If we need to do actual IO,
> the readahead_distance will be larger, and allow multiple full sized IOs to be
> issued.
>
> /*
>  * We don't start the pending read just because we've hit the distance limit,
>  * preferring to give it another chance to grow to full io_combine_limit size
>  * once more buffers have been consumed.  But this is not desirable in all
>  * situations - see below.
>  */
> static inline bool
> read_stream_should_issue_now(ReadStream *stream)
> {
> ...
>         /*
>          * If we've already reached io_combine_limit, there's no chance of growing
>          * the read further.
>          */
>         if (pending_read_nblocks >= stream->io_combine_limit)
>                 return true;
>
>         /* same if capped not by io_combine_limit but combine_distance */
>         if (stream->combine_distance > 0 &&
>                 pending_read_nblocks >= stream->combine_distance)
>                 return true;
>
>         /*
>          * If we currently have no reads in flight or prepared, issue the IO once
>          * we are not looking ahead further. This ensures there's always at least
>          * one IO prepared.
>          */
>         if (stream->pinned_buffers == 0 &&
>                 !read_stream_should_look_ahead(stream))
>                 return true;
>
>         return false;
>
> So, unless we are not reading ahead, we are not issuing IOs until they are
> fully sized (or can't be combined, obviously).
>
> Am I misunderstanding what you're talking about?

I'm not saying that we don't do IO combining at high distances, I'm
more saying that it is confusing that combine_distance controls how
far we look ahead when readahead_distance is low but when
readahead_distance is high, it controls when we issue the IO and not
how far we look ahead. I don't think we should change course now, but
I wanted to call out that this felt a little uncomfortable to me.

- Melanie



pgsql-hackers by date:

Previous
From: Bharath Rupireddy
Date:
Subject: Re: Introduce XID age based replication slot invalidation
Next
From: Corey Huinker
Date:
Subject: Re: CAST(... ON DEFAULT) - WIP build on top of Error-Safe User Functions