Home > mailing lists

Re: AIO / read stream heuristics adjustments for index prefetching - Mailing list pgsql-hackers

From	Andres Freund
Subject	Re: AIO / read stream heuristics adjustments for index prefetching
Date	April 2 16:33:16
Msg-id	aafskrrvefnb6p7zg3xnzau3m2iywfwrxfcmqx5vr7673j4qio@hha6e55rolqd Whole thread
In response to	Re: AIO / read stream heuristics adjustments for index prefetching (Nazir Bilal Yavuz <byavuz81@gmail.com>)
Responses	Re: AIO / read stream heuristics adjustments for index prefetching
List	pgsql-hackers

Tree view

Hi,

On 2026-04-02 15:30:26 +0300, Nazir Bilal Yavuz wrote:
> 0002:  read_stream: Prevent distance from decaying too quickly
>
> +        /*
> +         * As we needed IO, prevent distance from being reduced within our
> +         * maximum look-ahead window. This avoids having distance collapse too
> +         * quickly in workloads where most of the required blocks are cached,
> +         * but where the remaining IOs are a sufficient enough factor to cause
> +         * a substantial slowdown if executed synchronously.
> +         *
> +         * There are valid arguments for preventing decay for max_ios or for
> +         * max_pinned_buffers.  But the argument for max_pinned_buffers seems
> +         * clearer - if we can't see any misses within the maximum look-ahead
> +         * distance, we can't do any useful read-ahead.
> +         */
> +        stream->distance_decay_holdoff = stream->max_pinned_buffers;
>
> That is already committed but I have a question. Did you think about
> setting stream->distance_decay_holdoff to current stream->distance?
> This will also fix 'miss followed by a hit' (it won't fix double
> missed followed by a hit, though).

I am fairly certain that would not work well, because the whole point of the
mechanism it to prevent the distance from staying too small when the distance
is low.

We want the distance to grow for cases like mis, hit{10}, miss, ... because
that will typically still be bottlenecked by IO latency. And something like
setting the decay holdoff to the current distance wouldn't work for that.


> 0003:  aio: io_uring: Trigger async processing for large IOs
>
> There is a small optimization opportunity, we don't need to calculate
> io_size for the DIO since pgaio_uring_should_use_async() will always
> return false. Do you think it is worth implementing this? Other than
> that, LGTM.

I doubt it. Compared to actually doing IO the cost of this is neglegible.


> 0005:  WIP: read_stream: Move logic about IO combining & issuing to helpers
>
>     /* never pin more buffers than allowed */
>     if (stream->pinned_buffers + stream->pending_read_nblocks >=
> stream->max_pinned_buffers)
>         return false;
>
>     /*
>      * Don't start more read-ahead if that'd put us over the distance limit
>      * for doing read-ahead.
>      */
>     if (stream->pinned_buffers + stream->pending_read_nblocks >=
> stream->distance)
>         return false;
>
> AFAIK, stream->distance can't be higher than
> stream->max_pinned_buffers [1], so I think we can remove the first if
> block. If we are not sure about [1], stream->max_pinned_buffers most
> likely will be higher than stream->distance, it would make sense to
> change the order of these.
>
> Aha, I understood this change after looking at 0006. It is nitpick but
> 'if (stream->pinned_buffers + stream->pending_read_nblocks >=
> stream->max_pinned_buffers)' block can be added in 0006. Other than
> these, LGTM.

Hm. I guess that could make sense.


> 0006:  WIP: read stream: Split decision about look ahead for AIO and combining
>
> I liked the idea of being more aggressive to do IO combining. What is
> the reason for gradually increasing combine_distance, is it to not do
> unnecessary IOs at the start?

Yea. It'd perhaps not be too bad with the existing users, but it'd *really*
hurt with index scan prefetching, because of query plans where we only consume
part of the index scan (e.g. a nested loop antijoin).


> +                /*
> +                 * XXX: Should we actually reduce this at any time other than
> +                 * a reset? For now we have to, as this is also a condition
> +                 * for re-enabling fast_path.
> +                 */
> +                if (stream->combine_distance > 1)
> +                    stream->combine_distance--;
>
> I don't think we need to reduce this other than reset.

Hm. I go back and forth on that one :)


> +    /*
> +     * Allow looking further ahead if we have an the process of building a
> +     * larger IO, the IO is not yet big enough and we don't yet have IO in
> +     * flight.  Note that this is allowed even if we are reaching the
> +     * read-ahead limit (but not the buffer pin limit).
> +     *
> +     * This is important for cases where either effective_io_concurrency is
> +     * low or we never need to wait for IO and thus are not increasing the
> +     * distance. Without this we would end up with lots of small IOs.
> +     */
> +    if (stream->pending_read_nblocks > 0 &&
> +        stream->pinned_buffers == 0 &&
> +        stream->pending_read_nblocks < stream->combine_distance)
> +        return true;
>
> Do we need to check 'stream->pinned_buffers == 0' here? I think we can
> continue to look ahead although there are pinned buffers as we already
> know 'stream->pinned_buffers + stream->pending_read_nblocks <
> stream->max_pinned_buffers'.

We could, but I don't think there would be a benefit in doing so. In my mind,
what combine_distance is used for, is to control how far to look ahead to
allow for IO combining when the readahead_distance is too low to allow for IO
combining.  But if pinned_buffers > 0, we already have another IO in flight,
so the combine_distance mechanism doesn't have to kick in.

I've been experimenting with going the other way, by having
read_stream_should_look_ahead() do a check like

    /*
     * If we already have IO in flight, but are close enough to to the
     * distance limit that we would not start a fully sized IO, don't even
     * start a pending read until later.
     *
     * This avoids calling the call thing the next block callback in cases we
     * would not start the pending read anyway. For some users the work to
     * determine the next block is non-trivial, so we don't want to do so
     * earlier than necessary.
     *
     * A secondary benefit of this is that some callers use parallel workers
     * with each their own read stream to process a global list of blocks, and
     * only calling the next block callback when ready to actually issue IO
     * makes it more likely for one backend to get consecutive blocks.
     */
    if (stream->pinned_buffers > 0 &&
        stream->pending_read_nblocks == 0 &&
        stream->pinned_buffers + stream->combine_distance >= stream->readahead_distance)
        return false;


I do see that improve performance with some bitmap heap scan queries
(e.g. Tomas' "cyclic" ones). But I'm worried it'd hurt other queries when
using a low effective_io_concurrency.  So that's probably something for later.


> +    /* same if capped not by io_combine_limit but combine_distance */
> +    if (stream->combine_distance > 0 &&
> +        pending_read_nblocks >= stream->combine_distance)
> +        return true;
>
> I think we don't need to check 'stream->combine_distance > 0', it
> shouldn't be less or equal than zero when the
> 'stream->readahead_distance' is not zero, right?

Yea. I guess we could just assert that.


> +    {
> +        stream->readahead_distance = Min(max_pinned_buffers,
> stream->io_combine_limit);
> +        stream->combine_distance = stream->io_combine_limit;
> +    }
>
> I think the stream->combine_distance should be equal to
> stream->readahead_distance as we don't want to surpass
> max_pinned_buffers.

I don't think it could lead to exceeding max_pinned_buffers, due to the
    /* never pin more buffers than allowed */

check in read_stream_should_look_ahead(), but there's no reason to rely on
that.

Greetings,

Andres Freund

pgsql-hackers by date:

From: Japin Li
Date: 02 April, 16:20:43
Subject: Re: pg_get__*_ddl consolidation

From: "David G. Johnston"
Date: 02 April, 16:35:59
Subject: Re: pg_get__*_ddl consolidation

Re: AIO / read stream heuristics adjustments for index prefetching - Mailing list pgsql-hackers

Previous

Next