On Sat, Jul 19, 2025 at 11:23 PM Tomas Vondra <tomas@vondra.me> wrote:
> Thanks for the link. It seems I came up with an almost the same patch,
> with three minor differences:
>
> 1) There's another place that sets "distance = 0" in
> read_stream_next_buffer, so maybe this should preserve the distance too?
>
> 2) I suspect we need to preserve the distance at the beginning of
> read_stream_reset, like
>
> stream->reset_distance = Max(stream->reset_distance,
> stream->distance);
>
> because what if you call _reset before reaching the end of the stream?
>
> 3) Shouldn't it reset the reset_distance to 0 after restoring it?
Probably. Hmm... an earlier version of this code didn't use distance
== 0 to indicate end-of-stream, but instead had a separate internal
end_of_stream flag. If we brought that back and didn't clobber
distance, we wouldn't need this save-and-restore dance. It seemed
shorter and sweeter without it back then, before _reset() existed in
its present form, but I wonder if end_of_stream would be nicer than
having to add this kind of stuff, without measurable downsides.
> > There was also some discussion at the time about whether "reset so I
> > can rescan", and "reset so I can continue after a temporary stop"
> > should be different operations requiring different APIs. It now seems
> > like one operation is sufficient, but it should preserve the distance
> > as you showed and then let the algorithm learn about already-cached
> > data in the rescan case (if it is even true then, which is also
> > debatable since it depends on the size of the scan). So, I think we
> > should just go ahead and commit a patch like that.
>
> Not sure. To me it seems more like two distinct cases, but I'm not sure
> if it requires two distinct "operations" with distinct API. Perhaps a
> simple flag for the _reset() would be enough? It'd need to track the
> distance anyway, just in case.
>
> Consider for example a nested loop, which does a rescan every time the
> outer row changes. Is there a reason to believe the outer rows will need
> the same number of inner rows? Aren't those "distinct streams"? Maybe
> I'm thinking about this wrong, of course.
Good question. Yeah, your flag idea seems like a good way to avoid
baking opinion into this level. I wonder if it should be a bitmask
rather than a boolean, in case we think of more things that need to be
included or not when resetting.
> The thing that however concerns me is that what I observed was not the
> distance getting reset to 1, and then ramping up. Which should happen
> pretty quickly, thanks to the doubling. In my experiments it *never*
> ramped up again, it stayed at 1. I still don't quite understand why.
Huh. Will look into that on Monday.