Re: Streaming I/O, vectored I/O (WIP) - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: Streaming I/O, vectored I/O (WIP)
Date
Msg-id CA+hUKGKNC3G+B5sbRSfoid8AvA_mKmCmb65KBmq4AOnKax2YLg@mail.gmail.com
Whole thread Raw
In response to Re: Streaming I/O, vectored I/O (WIP)  (Nazir Bilal Yavuz <byavuz81@gmail.com>)
List pgsql-hackers
On Wed, Feb 7, 2024 at 11:54 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
> 0001-Provide-vectored-variant-of-ReadBuffer:
>
> - Do we need to pass the hit variable to ReadBuffer_common()? I think
> it can be just declared in the ReadBuffer_common() now.

Right, thanks!  Done, in the version I'll post shortly.

> 0003-Provide-API-for-streaming-reads-of-relations:
>
> - Do we need to re-think about getting a victim buffer logic?
> StrategyGetBuffer() function errors if it can not find any unpinned
> buffers, this can be more common in the async world since we pin
> buffers before completing the read (while looking ahead).

Hmm, well that is what the pin limit machinery is supposed to be the
solution to.  It has always been possible to see that error if
shared_buffers is too small, so that your backends can't pin what they
need to make progress because there are too many other backends, the
whole buffer pool is pinned and there is nothing available to
steal/evict.  Here, sure, we pin more stuff per backend, but not more
than a "fair share", that is, Buffers / max backends, so it's not any
worse, is it?  Well maybe it's marginally worse in some case, for
example if a query that uses many streams has one pinned buffer per
stream (which we always allow) where before we'd have acquired and
released pins in a slightly different sequence or whatever, but there
is always going to be a minimum shared_buffers that will work at all
for a given workload and we aren't changing it by much if at all here.
If you're anywhere near that limit, your performance must be so bad
that it'd only be a toy setting anyway.  Does that sound reasonable?

Note that this isn't the first to use multi-pin logic or that limit
mechanism: that was the new extension code that shipped in 16.  This
will do that more often, though.

> - If the returned block from the callback is an invalid block,
> pg_streaming_read_look_ahead() sets pgsr->finished = true. Could there
> be cases like the returned block being an invalid block but we should
> continue to read after this invalid block?

Yeah, I think there will be, and I think we should do it with some
kind of reset/restart function.  I don't think we need it for the
current users so I haven't included it yet (there is a maybe-related
discussion about reset for another reasons, I think Melanie has an
idea about that), but I think something like that will useful for
future stuff like streaming recovery, where you can run out of WAL to
read but more will come via the network soon.

> - max_pinned_buffers and pinned_buffers_trigger variables are set in
> the initialization part (in the
> pg_streaming_read_buffer_alloc_internal() function) then they do not
> change. In some cases there could be no acquirable buffers to pin
> while initializing the pgsr (LimitAdditionalPins() set
> max_pinned_buffers to 1) but while the read is continuing there could
> be chances to create larger reads (other consecutive reads are
> finished while this read is continuing). Do you think that trying to
> reset max_pinned_buffers and pinned_buffers_trigger to have higher
> values after the initialization to have larger reads make sense?

That sounds hard!  You're right that in the execution of a query there
might well be cases like that (inner and outer scan of a hash join
don't actually run at the same time, likewise for various other plan
shapes), and something that would magically and dynamically balance
resource usage might be ideal, but I don't know where to begin.
Concretely, as long as your buffer pool is measured in gigabytes and
your max backends is measured in hundreds, the per backend pin limit
should actually be fairly hard to hit anyway, as it would be in the
thousands.  So I don't think it is as important as other resource
usage balance problems that we also don't attempt (memory, CPU, I/O
bandwidth).

> +        /* Is there a head range that we can't extend? */
> +        head_range = &pgsr->ranges[pgsr->head];
> +        if (head_range->nblocks > 0 &&
> +            (!need_complete ||
> +             !head_range->need_complete ||
> +             head_range->blocknum + head_range->nblocks != blocknum))
> +        {
> +            /* Yes, time to start building a new one. */
> +            head_range = pg_streaming_read_new_range(pgsr);
>
> - I think if both need_complete and head_range->need_complete are
> false, we can extend the head range regardless of the consecutiveness
> of the blocks.

Yeah, I think we can experiment with ideas like that.  Not done yet
but I'm thinking about it -- more shortly.

> 0006-Allow-streaming-reads-to-ramp-up-in-size:
>
> - ramp_up_pin_limit variable is declared as an int but we do not check
> the overflow while doubling it. This could be a problem in longer
> reads.

But it can't get above very high, because eventually it exceeds
max_pinned_buffers, which is anchored to the ground by various small
limits.



pgsql-hackers by date:

Previous
From: Jeff Davis
Date:
Subject: Re: Built-in CTYPE provider
Next
From: Thomas Munro
Date:
Subject: Re: Streaming I/O, vectored I/O (WIP)