Re: execExprInterp() questions / How to improve scalar array op expr eval? - Mailing list pgsql-hackers
From | James Coleman |
---|---|
Subject | Re: execExprInterp() questions / How to improve scalar array op expr eval? |
Date | |
Msg-id | CAAaqYe9qm+vFtVFZbkihKhfTpFdBAdaoVH0fKrhxnXp-Wx7pKw@mail.gmail.com Whole thread Raw |
In response to | Re: execExprInterp() questions / How to improve scalar array op expreval? (Andres Freund <andres@anarazel.de>) |
Responses |
Re: execExprInterp() questions / How to improve scalar array op expreval?
|
List | pgsql-hackers |
On Sat, Apr 11, 2020 at 2:01 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > > Tom, CCing you because of expanded datum question at bottom. > > > On 2020-04-11 08:58:46 -0400, James Coleman wrote: > > - Does the execExpr/execExprInterp framework allow a scalar array op > > to get an already expanded array (unless I'm missing something we > > can't easily lookup a given index in a flattened array)? > > Well, I'm not quite sure what you're thinking of. If the input is > constant, then expression initialization can do just about everything it > wants. Including preprocessing the array into whatever form it wants. > But there's no logic for doing preprocessing whenever values change. > > > > - If not, is there a way in that framework to know if the array expr > > has stayed the same through multiple evaluations of the expression > > tree (i.e., so you could expand and sort it just once)? > > No. Ok. Seems like it'd be likely to be interesting (maybe in other places too?) to know if: - Something is actually a param that can change, and, - When (perhaps by some kind of flag or counter) it has changed. > > Long version/background: > > > > I've been looking at how a query like: > > select * from t where i in (<long array>); > > executes as a seq scan the execution time increases linearly with > > respect to the length of the array. That's because in > > execExprInterp.c's ExecEvalScalarArrayOp() we do a linear search > > through the array. > > Is "<long array>" constant in the cases you're interested in? Because > it's a heck of a lot easier to add an optimization for that case, than > adding runtime tracking the array values by comparing the whole array > for equality with the last - the comparisons of the whole array could > easily end up adding more cost than what's being saved. In the simplest case, yes, it's a constant, though it'd be obviously better if it weren't limited to that. There are many cases where a long array can come from a subplan but we can easily tell by looking at the SQL that it will only ever execute once. An unimaginative case is something like: select * from t where a in (select i from generate_series(0,10000) n(i)); > > I've been considering approaches to improve the seq scan case. We might: > > - At plan time rewrite as a hash join to a deduped array values > > relation (gross, but included here since that's an approach we can > > take rewriting the SQL itself as a proof of concept). > > - At execution time build a hash and lookup. > > - At execution time sort the array and binary search through it. > > > > I've been working other last approach to see what results I might get > > (it seemed like the easiest to hack together). Putting that together > > left me with the questions mentioned above in the "short version". > > > > Obviously if anyone has thoughts on the above approaches I'd be > > interested in that too. > > If you're content with optimizing constant arrays, I'd go for detecting > that case in the T_ScalarArrayOpExpr case in ExecInitExprRec(), and > preprocessing the array into an optimized form. Probably with a separate > opcode for execution. At minimum constants are a good first place to try it out. Thanks for the pointers. > > Side question: when we do: > > arr = DatumGetArrayTypeP(*op->resvalue); > > in ExecEvalScalarArrayOp() is that going to be expensive each time > > through a seq scan? Or is it (likely) going to resolve to an already > > in-memory array and effectively be the cost of retrieving that > > pointer? > > It Depends TM. For the constant case it's likely going to be cheap-ish, > because it'll not be toasted. For the case where it's the return value > from a subquery or something, you cannot assume it won't change between > calls. Back to what I was saying earlier. Perhaps some kind of mechanism so we can know that is a better place to start. Perhaps something from the patch Tom referenced would help kickstart that. I'll take a look. > I think the worst case here is something like a nestloop, where the > inner side does foo IN (outer.column). If I recall the code correctly, > we'll currently end up detoasting the array value every single > iteration. Ouch. Seems like that could be a significant cost in some queries? > I wonder if it would be a good idea to change ExecEvalParamExec and > ExecEvalParamExtern to detoast the to-be-returned value. If we change > the value that's stored in econtext->ecxt_param_exec_vals / > econtext->ecxt_param_list_info, we'd avoid repeated detaosting. > > It'd be nice if we somehow could make the expanded datum machinery work > here. I'm not quite seeing how though? I'm not yet familiar enough with it to comment. > Crazy idea: I have a patch to make execExprInterp largely use > NullableDatum. Tom and I had theorized a while ago about adding > additional fields in the padding that currently exists in it. I wonder > if we could utilize a bit in there to allow to expand in-place? Effectively to store the pointers to, for example, the expanded array? James
pgsql-hackers by date: