Hi,
On Wed, Aug 13, 2025 at 2:02 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Wed, Aug 13, 2025 at 5:29 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
> > > 1. The leading block is BM_VALID, so we choose to give it to you
> > > immediately and not look further (we could look further and return
> > > more than one consecutive BM_VALID block at a time, but this isn't
> > > implemented)
> >
> > I am curious about why this isn't implemented. It looks helpful.
> > Is there any blocking issue or trade-off for not doing so?
>
> stream->distance tends to be 1 for well cached data (it grows when IO
> is needed and it is worth looking far ahead, and shrinks when data is
> found in cache). So we won't currently have a distance > 1 for well
> cached data, so we wouldn't benefit much. That may not always be
> true: if the buffer mapping table is changed to a tree data structure
> (like the way kernels manage virtual memory pages associated with a
> file or other VM object), then we might have an incentive to look up
> lots of cached blocks at the same time, and then we might want to
> change the distance tuning algorithm, and then we might want
> StartReadBuffers() to be able to return more than one cached block at
> a time. We've focused mainly on I/O so far, and just tried not to
> make the well cached cases worse.
Understood. I came across Alexander’s article on the topic. It appears
to require a major re-architecture, which could be challenging to land
HEAD.
https://www.orioledb.com/blog/buffer-management
>
> > The format of this part is not aligned well in gmail, so I copy it into vs code.
> > Is this layout right?
>
> Yes.
>
> > I found second illustration somewhat hard to follow, especially
> > the "do nothing" trick and the movement of next_buffef_index in the second queue.
> > Maybe I need to read the corresponding code.
>
> Suppose you have pending_read_blocknum = 100, pending_read_nblock = 5,
> next_buffer_index = 200. Now you decide to start that read, because
> the next block the caller wants can't be combined, so you call
> StartReadBuffers(blocknum = 100, *nblocks = 4, buffers =
> &buffers[200]). StartReadBuffers() pins 5 buffers and starts an IO,
> but finds a reason to split it at size 2. It sets *nblocks to 2, and
> returns. Now read_stream.c adds 2 to pending_read_blocknum, subtracts
> 2 from pending_read_nblocks, so that it represents the continuation of
> the previous read. It also adds 2 to next_buffer_index with modulo
> arithmetic (so it wraps around the queue), because that is the
> location of the buffers for the next read, and sets forwarded_buffers
> to 3, because there are 3 buffers already pinned (in that patch ^ it
> is renamed to pending_read_npinned). The three buffers themselves are
> already in the queue at the right position for StartReadBuffers() to
> receive them and know that it doesn't have to pin them again. What I
> was referring to with "doing nothing" is the way that if you keep
> sliding StartReadBuffers() call along the queue, the extra buffers it
> spits out become its input next time, without having to be copied
> anywhere. In that picture there are two forwarded buffers, labeled 9
> and A.
Got it. Elegant technique. Patch 2 also reads as a robustness
improvement now. :-) I'll try to map the high-level design to code and
review patch 2.
Thanks for the great guidance!
Best,
Xuneng