Andres Freund <andres@anarazel.de> writes:
> On 2023-11-19 14:08:05 -0500, Tom Lane wrote:
>> So that results in not having to deconstruct most of the tuple,
>> whereas in the new code we do have to, thanks to b8d7f053c's
>> decision to batch all the variable-value-extraction work.
> Yea, I think we were aware at the time that this does have downsides - it's
> just that the worst case behaviour of *not* batching are much bigger than the
> worst-case downside of batching.
Agreed. Still ...
> We actually did add fastpaths for a few similar cases: ExecJustInnerVar() etc
> will just use slot_getattr(). These can be used when the result is just a
> single variable. However, the goal there was more to avoid "interpreter
> startup" overhead, rather than evaluation overhead.
Yeah. Also, if I'm reading the example appropriately, Daniel's case
*does* involve fetching more than a single column --- but the other ones
are up near the start so we didn't use to have to deform very much of
the tuple.
> What if we instead load 8 bytes of the bitmap into a uint64 before entering
> the loop, and shift an "index" mask into the bitmap by one each iteration
> through the loop?
Meh. Seems like a micro-optimization that does nothing for the big-O
problem. One thing to think about is that I suspect "all the columns
are null" is just a simple test case and not very representative of
the real-world problem. In the real case, probably quite a few of
the leading columns are non-null, which would make Daniel's issue
even worse because slot_deform_tuple would have to do significantly
more work that it didn't do before. Shaving cycles off the null-column
fast path would be proportionally less useful too.
It might well be that what you suggest is worth doing just to cut
the cost of slot_deform_tuple across the board, but I don't think
it's an answer to this complaint specifically.
regards, tom lane