Re: execExprInterp() questions / How to improve scalar array op expreval? - Mailing list pgsql-hackers

From Andres Freund
Subject Re: execExprInterp() questions / How to improve scalar array op expreval?
Date
Msg-id 20200411180132.ccxokkyvdz7rgxll@alap3.anarazel.de
Whole thread Raw
In response to execExprInterp() questions / How to improve scalar array op expr eval?  (James Coleman <jtc331@gmail.com>)
Responses Re: execExprInterp() questions / How to improve scalar array op expr eval?  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: execExprInterp() questions / How to improve scalar array op expr eval?  (James Coleman <jtc331@gmail.com>)
List pgsql-hackers
Hi,


Tom, CCing you because of expanded datum question at bottom.


On 2020-04-11 08:58:46 -0400, James Coleman wrote:
> - Does the execExpr/execExprInterp framework allow a scalar array op
> to get an already expanded array (unless I'm missing something we
> can't easily lookup a given index in a flattened array)?

Well, I'm not quite sure what you're thinking of. If the input is
constant, then expression initialization can do just about everything it
wants. Including preprocessing the array into whatever form it wants.
But there's no logic for doing preprocessing whenever values change.


> - If not, is there a way in that framework to know if the array expr
> has stayed the same through multiple evaluations of the expression
> tree (i.e., so you could expand and sort it just once)?

No.


> Long version/background:
> 
> I've been looking at how a query like:
> select * from t where i in (<long array>);
> executes as a seq scan the execution time increases linearly with
> respect to the length of the array. That's because in
> execExprInterp.c's ExecEvalScalarArrayOp() we do a linear search
> through the array.

Is "<long array>" constant in the cases you're interested in? Because
it's a heck of a lot easier to add an optimization for that case, than
adding runtime tracking the array values by comparing the whole array
for equality with the last - the comparisons of the whole array could
easily end up adding more cost than what's being saved.


> I've been considering approaches to improve the seq scan case. We might:
> - At plan time rewrite as a hash join to a deduped array values
> relation (gross, but included here since that's an approach we can
> take rewriting the SQL itself as a proof of concept).
> - At execution time build a hash and lookup.
> - At execution time sort the array and binary search through it.
> 
> I've been working other last approach to see what results I might get
> (it seemed like the easiest to hack together). Putting that together
> left me with the questions mentioned above in the "short version".
> 
> Obviously if anyone has thoughts on the above approaches I'd be
> interested in that too.

If you're content with optimizing constant arrays, I'd go for detecting
that case in the T_ScalarArrayOpExpr case in ExecInitExprRec(), and
preprocessing the array into an optimized form. Probably with a separate
opcode for execution.


> Side question: when we do:
> arr = DatumGetArrayTypeP(*op->resvalue);
> in ExecEvalScalarArrayOp() is that going to be expensive each time
> through a seq scan? Or is it (likely) going to resolve to an already
> in-memory array and effectively be the cost of retrieving that
> pointer?

It Depends TM.  For the constant case it's likely going to be cheap-ish,
because it'll not be toasted. For the case where it's the return value
from a subquery or something, you cannot assume it won't change between
calls.

I think the worst case here is something like a nestloop, where the
inner side does foo IN (outer.column). If I recall the code correctly,
we'll currently end up detoasting the array value every single
iteration.

I wonder if it would be a good idea to change ExecEvalParamExec and
ExecEvalParamExtern to detoast the to-be-returned value. If we change
the value that's stored in econtext->ecxt_param_exec_vals /
econtext->ecxt_param_list_info, we'd avoid repeated detaosting.

It'd be nice if we somehow could make the expanded datum machinery work
here. I'm not quite seeing how though?

Crazy idea: I have a patch to make execExprInterp largely use
NullableDatum. Tom and I had theorized a while ago about adding
additional fields in the padding that currently exists in it. I wonder
if we could utilize a bit in there to allow to expand in-place?

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Complete data erasure
Next
From: Tomas Vondra
Date:
Subject: Re: Complete data erasure