Re: execExprInterp() questions / How to improve scalar array op expr eval? - Mailing list pgsql-hackers

From James Coleman
Subject Re: execExprInterp() questions / How to improve scalar array op expr eval?
Date
Msg-id CAAaqYe9qm+vFtVFZbkihKhfTpFdBAdaoVH0fKrhxnXp-Wx7pKw@mail.gmail.com
Whole thread Raw
In response to Re: execExprInterp() questions / How to improve scalar array op expreval?  (Andres Freund <andres@anarazel.de>)
Responses Re: execExprInterp() questions / How to improve scalar array op expreval?
List pgsql-hackers
On Sat, Apr 11, 2020 at 2:01 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
>
> Tom, CCing you because of expanded datum question at bottom.
>
>
> On 2020-04-11 08:58:46 -0400, James Coleman wrote:
> > - Does the execExpr/execExprInterp framework allow a scalar array op
> > to get an already expanded array (unless I'm missing something we
> > can't easily lookup a given index in a flattened array)?
>
> Well, I'm not quite sure what you're thinking of. If the input is
> constant, then expression initialization can do just about everything it
> wants. Including preprocessing the array into whatever form it wants.
> But there's no logic for doing preprocessing whenever values change.
>
>
> > - If not, is there a way in that framework to know if the array expr
> > has stayed the same through multiple evaluations of the expression
> > tree (i.e., so you could expand and sort it just once)?
>
> No.

Ok. Seems like it'd be likely to be interesting (maybe in other places
too?) to know if:
- Something is actually a param that can change, and,
- When (perhaps by some kind of flag or counter) it has changed.

> > Long version/background:
> >
> > I've been looking at how a query like:
> > select * from t where i in (<long array>);
> > executes as a seq scan the execution time increases linearly with
> > respect to the length of the array. That's because in
> > execExprInterp.c's ExecEvalScalarArrayOp() we do a linear search
> > through the array.
>
> Is "<long array>" constant in the cases you're interested in? Because
> it's a heck of a lot easier to add an optimization for that case, than
> adding runtime tracking the array values by comparing the whole array
> for equality with the last - the comparisons of the whole array could
> easily end up adding more cost than what's being saved.

In the simplest case, yes, it's a constant, though it'd be obviously
better if it weren't limited to that. There are many cases where a
long array can come from a subplan but we can easily tell by looking
at the SQL that it will only ever execute once. An unimaginative case
is something like:
select * from t where a in (select i from generate_series(0,10000) n(i));

> > I've been considering approaches to improve the seq scan case. We might:
> > - At plan time rewrite as a hash join to a deduped array values
> > relation (gross, but included here since that's an approach we can
> > take rewriting the SQL itself as a proof of concept).
> > - At execution time build a hash and lookup.
> > - At execution time sort the array and binary search through it.
> >
> > I've been working other last approach to see what results I might get
> > (it seemed like the easiest to hack together). Putting that together
> > left me with the questions mentioned above in the "short version".
> >
> > Obviously if anyone has thoughts on the above approaches I'd be
> > interested in that too.
>
> If you're content with optimizing constant arrays, I'd go for detecting
> that case in the T_ScalarArrayOpExpr case in ExecInitExprRec(), and
> preprocessing the array into an optimized form. Probably with a separate
> opcode for execution.

At minimum constants are a good first place to try it out. Thanks for
the pointers.

> > Side question: when we do:
> > arr = DatumGetArrayTypeP(*op->resvalue);
> > in ExecEvalScalarArrayOp() is that going to be expensive each time
> > through a seq scan? Or is it (likely) going to resolve to an already
> > in-memory array and effectively be the cost of retrieving that
> > pointer?
>
> It Depends TM.  For the constant case it's likely going to be cheap-ish,
> because it'll not be toasted. For the case where it's the return value
> from a subquery or something, you cannot assume it won't change between
> calls.

Back to what I was saying earlier. Perhaps some kind of mechanism so
we can know that is a better place to start. Perhaps something from
the patch Tom referenced would help kickstart that. I'll take a look.

> I think the worst case here is something like a nestloop, where the
> inner side does foo IN (outer.column). If I recall the code correctly,
> we'll currently end up detoasting the array value every single
> iteration.

Ouch. Seems like that could be a significant cost in some queries?

> I wonder if it would be a good idea to change ExecEvalParamExec and
> ExecEvalParamExtern to detoast the to-be-returned value. If we change
> the value that's stored in econtext->ecxt_param_exec_vals /
> econtext->ecxt_param_list_info, we'd avoid repeated detaosting.
>
> It'd be nice if we somehow could make the expanded datum machinery work
> here. I'm not quite seeing how though?

I'm not yet familiar enough with it to comment.

> Crazy idea: I have a patch to make execExprInterp largely use
> NullableDatum. Tom and I had theorized a while ago about adding
> additional fields in the padding that currently exists in it. I wonder
> if we could utilize a bit in there to allow to expand in-place?

Effectively to store the pointers to, for example, the expanded array?

James



pgsql-hackers by date:

Previous
From: Corey Huinker
Date:
Subject: Re: Add A Glossary
Next
From: Andres Freund
Date:
Subject: Re: execExprInterp() questions / How to improve scalar array op expreval?