Re: Streaming I/O, vectored I/O (WIP) - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: Streaming I/O, vectored I/O (WIP) |
Date | |
Msg-id | CA+hUKGKNC3G+B5sbRSfoid8AvA_mKmCmb65KBmq4AOnKax2YLg@mail.gmail.com Whole thread Raw |
In response to | Re: Streaming I/O, vectored I/O (WIP) (Nazir Bilal Yavuz <byavuz81@gmail.com>) |
List | pgsql-hackers |
On Wed, Feb 7, 2024 at 11:54 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote: > 0001-Provide-vectored-variant-of-ReadBuffer: > > - Do we need to pass the hit variable to ReadBuffer_common()? I think > it can be just declared in the ReadBuffer_common() now. Right, thanks! Done, in the version I'll post shortly. > 0003-Provide-API-for-streaming-reads-of-relations: > > - Do we need to re-think about getting a victim buffer logic? > StrategyGetBuffer() function errors if it can not find any unpinned > buffers, this can be more common in the async world since we pin > buffers before completing the read (while looking ahead). Hmm, well that is what the pin limit machinery is supposed to be the solution to. It has always been possible to see that error if shared_buffers is too small, so that your backends can't pin what they need to make progress because there are too many other backends, the whole buffer pool is pinned and there is nothing available to steal/evict. Here, sure, we pin more stuff per backend, but not more than a "fair share", that is, Buffers / max backends, so it's not any worse, is it? Well maybe it's marginally worse in some case, for example if a query that uses many streams has one pinned buffer per stream (which we always allow) where before we'd have acquired and released pins in a slightly different sequence or whatever, but there is always going to be a minimum shared_buffers that will work at all for a given workload and we aren't changing it by much if at all here. If you're anywhere near that limit, your performance must be so bad that it'd only be a toy setting anyway. Does that sound reasonable? Note that this isn't the first to use multi-pin logic or that limit mechanism: that was the new extension code that shipped in 16. This will do that more often, though. > - If the returned block from the callback is an invalid block, > pg_streaming_read_look_ahead() sets pgsr->finished = true. Could there > be cases like the returned block being an invalid block but we should > continue to read after this invalid block? Yeah, I think there will be, and I think we should do it with some kind of reset/restart function. I don't think we need it for the current users so I haven't included it yet (there is a maybe-related discussion about reset for another reasons, I think Melanie has an idea about that), but I think something like that will useful for future stuff like streaming recovery, where you can run out of WAL to read but more will come via the network soon. > - max_pinned_buffers and pinned_buffers_trigger variables are set in > the initialization part (in the > pg_streaming_read_buffer_alloc_internal() function) then they do not > change. In some cases there could be no acquirable buffers to pin > while initializing the pgsr (LimitAdditionalPins() set > max_pinned_buffers to 1) but while the read is continuing there could > be chances to create larger reads (other consecutive reads are > finished while this read is continuing). Do you think that trying to > reset max_pinned_buffers and pinned_buffers_trigger to have higher > values after the initialization to have larger reads make sense? That sounds hard! You're right that in the execution of a query there might well be cases like that (inner and outer scan of a hash join don't actually run at the same time, likewise for various other plan shapes), and something that would magically and dynamically balance resource usage might be ideal, but I don't know where to begin. Concretely, as long as your buffer pool is measured in gigabytes and your max backends is measured in hundreds, the per backend pin limit should actually be fairly hard to hit anyway, as it would be in the thousands. So I don't think it is as important as other resource usage balance problems that we also don't attempt (memory, CPU, I/O bandwidth). > + /* Is there a head range that we can't extend? */ > + head_range = &pgsr->ranges[pgsr->head]; > + if (head_range->nblocks > 0 && > + (!need_complete || > + !head_range->need_complete || > + head_range->blocknum + head_range->nblocks != blocknum)) > + { > + /* Yes, time to start building a new one. */ > + head_range = pg_streaming_read_new_range(pgsr); > > - I think if both need_complete and head_range->need_complete are > false, we can extend the head range regardless of the consecutiveness > of the blocks. Yeah, I think we can experiment with ideas like that. Not done yet but I'm thinking about it -- more shortly. > 0006-Allow-streaming-reads-to-ramp-up-in-size: > > - ramp_up_pin_limit variable is declared as an int but we do not check > the overflow while doubling it. This could be a problem in longer > reads. But it can't get above very high, because eventually it exceeds max_pinned_buffers, which is anchored to the ground by various small limits.
pgsql-hackers by date: