Re: AIO / read stream heuristics adjustments for index prefetching - Mailing list pgsql-hackers
| From | Andres Freund |
|---|---|
| Subject | Re: AIO / read stream heuristics adjustments for index prefetching |
| Date | |
| Msg-id | dj6njk2kfy4ztfduptrk5olv43j2zhebhbz62ip2y7gyeddirm@s2ywu64xxftw Whole thread |
| In response to | Re: AIO / read stream heuristics adjustments for index prefetching (Melanie Plageman <melanieplageman@gmail.com>) |
| Responses |
Re: AIO / read stream heuristics adjustments for index prefetching
|
| List | pgsql-hackers |
Hi,
On 2026-04-03 11:46:59 -0400, Melanie Plageman wrote:
> On Thu, Apr 2, 2026 at 11:47 AM Andres Freund <andres@anarazel.de> wrote:
> >
> > What do you think about the updated patch to achieve that that I posted?
>
> here is some review on 0005 and 0006 earlier posted
> concrete things:
> -------
> - I’d reorder stream→distance == 0 in read_stream_look_ahead() (and
> issue), I found myself asking why it wasn’t first
>
> - I agree with bilal that
> if (stream->pinned_buffers + stream->pending_read_nblocks >=
> stream->max_pinned_buffers)
> return false;
> should be in the other commit because it isn’t required with the
> current code because of how distance is set and is more confusing.
> wasn’t it an assert before?
Will do.
> - perhaps the new heuristic for allowing further look ahead if we are
> building an IO and it isn’t big enough yet should be in its own commit
Isn't that the whole point of the commit? What else is there?
> - why have read_stream_pause() save combine_distance if it isn’t going
> to zero it out?
It should zero it out.
> - I think the comments you added to the read stream struct for
> combine_distance and readahead_distance should indicate units (i.e.
> blocks)
Will do.
> - why not remove the combine_distance requirement from the fast path
> entry criteria (could save resume_combine_distance in the fast path
> and restore it after a miss)
Because entering the fast path prevents IO combining from happening. So it's
absolutely crucial that we do *not* enter it while we would combine.
> - can we move the fast path to where we found out we got a hit?
Yea, I think we should. Seems like a separate project though.
> - should_issue_now has a lot of overlap with should_read_ahead which
> makes it confusing to figure out how they are different
Yea. I'm not sure how to do it better though.
> but I think that is related to readahead_distance being in units of blocks
> and not IOs (which I talk about later)
I agree that we should change that, but it seems like a separate project to
me.
> - There's also some assorted typos and comment awkwardness that I
> assume you will clean up in the polish step
> (e.g. "if we have an the process" -> "if we are in the process",
> "oveflow" -> "overflow", The NB comment still says stream->distance ==
> 0 but should say stream->readahead_distance == 0)
Yea.
> some more ambiguous stuff:
> -------
> > In my experiments the batching was primarily useful to allow to reduce the
> > command submission & interrupt overhead when interacting with storage, i.e.
> > when actually doing IO, not just copying from the page cache.
>
> It still seems to me that batching would help guarantee parallel
> memcpys for workers when data is in the kernel buffer cache because if
> the IO is quick, the same worker may pick up the next IO.
When you say workers, do you mean for io_method=worker? Or the internal kernel
threads of io_uring for async IOs?
> ---
> You mentioned that we don't want to read too far ahead (including for
> a single combined IO) in part because:
>
> > The resowner and private refcount mechanisms take more CPU cycles if you have
> > more buffers pinned
>
> But I don't see how either distance is responding to this or
> self-calibrating with this in mind
Using the minimal required distance to avoid needing to wait for IO completion
is responding to that, no? Without these patches we read ahead as far as
possible, even if all the data is in the page cache, which makes this issue
way worse (without these patches it's a major source of regressions in the
index prefetching patch).
> ---
> >> I liked the idea of being more aggressive to do IO combining. What is
> >> the reason for gradually increasing combine_distance, is it to not do
> >> unnecessary IOs at the start?
> >
> > Yea. It'd perhaps not be too bad with the existing users, but it'd *really*
> > hurt with index scan prefetching, because of query plans where we only consume
> > part of the index scan (e.g. a nested loop antijoin).
>
> You said we have to ramp up combine_distance because in index
> prefetching when we don’t need all the blocks, it can hurt. But if we
> ramp it up unconditionally (assuming we did IO), then I don’t see how
> this would solve the problem as it will ramp up regardless after just
> a few IOs anyway.
It limits the percentage of "wasted IO". So it does not entirely solve it - I
think that's impossible - but reduces it enough to be ok.
If the consumer stops consuming after e.g. 1 IOs, an immediate rampup would
lead to 15 IOs being wasted (assuming io_combine_limit 16). So only ~6% of the
read in blocks are used.
If instead the consumer stops only after already having ramped up to the max
combine distance, it'd have to have consumed a lot more buffers, as the
distance is increased only after consuming a buffer that required IO.
Think it's something like:
io of size 1, block 0
consume buffer #1 -> combine_distance 2
io of size 2, block 1-2
consume buffer #2 -> combine_distance 4
consume buffer #3
io of size 4, block 3-6
consume buffer #4 -> combine_distance 8
consume buffer #4-#7
io of size 8, block 7-14
consume buffer #8 -> combine_distance 16
consume buffer #9-#15
io of size 16, block 15-30
consume buffer #16 -> combine distance can't be increased further
<reset>
Consumer consumed 16 buffers, we did IO for 31 blocks, wasted 15. So we used
~51% of the read in blocks are used.
While 51% is not perfect, it still a considerable improvement in worst-case
wasteage. It's also less of a problem because more CPU was spent in the
consumer, given that 16 buffers had to be processed to get to this point.
> ---
> > I've been experimenting with going the other way, by having
> > read_stream_should_look_ahead() do a check like
>
> /*
> * If we already have IO in flight, but are close enough to to the
> * distance limit that we would not start a fully sized IO, don't even
> * start a pending read until later.
> *
> * This avoids calling the call thing the next block callback in cases we
> * would not start the pending read anyway. For some users the work to
> * determine the next block is non-trivial, so we don't want to do so
> * earlier than necessary.
> *
> * A secondary benefit of this is that some callers use parallel workers
> * with each their own read stream to process a global list of blocks, and
> * only calling the next block callback when ready to actually issue IO
> * makes it more likely for one backend to get consecutive blocks.
> */
> if (stream->pinned_buffers > 0 &&
> stream->pending_read_nblocks == 0 &&
> stream->pinned_buffers + stream->combine_distance >=
> stream->readahead_distance)
> return false;
>
> So, as you say, with, for example, a large io_combine_limit and small
> effective_io_concurrency, this would result in much lower IO
> concurrency than we want. But, I think this speaks to the central
> tension I see with the new combine_distance and readahead_distance: it
> seems like readahead_distance should now be in units of IO and
> combine_distance in units of blocks. If that were the case, this
> heuristic wouldn't have a downside.
>
> Obviously ramping readahead_distance up and down when it is in units
> of IO becomes much more fraught though.
I think we might want to move to that world, but I think that'll require a lot
more work to validate. In cases where you have a mix of hits and misses, with
the misses requiring actual IO, we'd end up with a lot less readahead than
today.
> ---
> I think it is weird to have combine_distance only be relevant when
> readahead_distance is low. You said:
>
> > We could, but I don't think there would be a benefit in doing so. In my mind,
> > what combine_distance is used for, is to control how far to look ahead to
> > allow for IO combining when the readahead_distance is too low to allow for IO
> > combining. But if pinned_buffers > 0, we already have another IO in flight,
> > so the combine_distance mechanism doesn't have to kick in.
>
> But it seems like for a completely uncached workload bigger IOs is
> still beneficial.
Massively so - the storage layer getting too small IOs really hurts.
But, as the code stands, we *do* end up with large IOs in that case, because
we will not issue the IO until it is "complete". If we need to do actual IO,
the readahead_distance will be larger, and allow multiple full sized IOs to be
issued.
/*
* We don't start the pending read just because we've hit the distance limit,
* preferring to give it another chance to grow to full io_combine_limit size
* once more buffers have been consumed. But this is not desirable in all
* situations - see below.
*/
static inline bool
read_stream_should_issue_now(ReadStream *stream)
{
...
/*
* If we've already reached io_combine_limit, there's no chance of growing
* the read further.
*/
if (pending_read_nblocks >= stream->io_combine_limit)
return true;
/* same if capped not by io_combine_limit but combine_distance */
if (stream->combine_distance > 0 &&
pending_read_nblocks >= stream->combine_distance)
return true;
/*
* If we currently have no reads in flight or prepared, issue the IO once
* we are not looking ahead further. This ensures there's always at least
* one IO prepared.
*/
if (stream->pinned_buffers == 0 &&
!read_stream_should_look_ahead(stream))
return true;
return false;
So, unless we are not reading ahead, we are not issuing IOs until they are
fully sized (or can't be combined, obviously).
Am I misunderstanding what you're talking about?
Greetings,
Andres Freund
pgsql-hackers by date: