execExprInterp() questions / How to improve scalar array op expr eval? - Mailing list pgsql-hackers

From James Coleman
Subject execExprInterp() questions / How to improve scalar array op expr eval?
Date
Msg-id CAAaqYe-UQBba7sScrucDOyHb7cDoNbWf_rcLrOWeD4ikP3_qTQ@mail.gmail.com
Whole thread Raw
Responses Re: execExprInterp() questions / How to improve scalar array op expreval?  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
Short version:

In what I'm currently working on I had a few questions about arrays
and the execExpr/execExprInterp framework that didn't seem obviously
answered in the code or README.

- Does the execExpr/execExprInterp framework allow a scalar array op
to get an already expanded array (unless I'm missing something we
can't easily lookup a given index in a flattened array)?
- If not, is there a way in that framework to know if the array expr
has stayed the same through multiple evaluations of the expression
tree (i.e., so you could expand and sort it just once)?

Long version/background:

I've been looking at how a query like:
select * from t where i in (<long array>);
executes as a seq scan the execution time increases linearly with
respect to the length of the array. That's because in
execExprInterp.c's ExecEvalScalarArrayOp() we do a linear search
through the array.

In contrast, when setting up a btree scan with a similar saop, we
first sort the array, remove duplicates, and remove nulls. Of course
with btree scans this has other values, like allowing us to return
results in array order.

I've been considering approaches to improve the seq scan case. We might:
- At plan time rewrite as a hash join to a deduped array values
relation (gross, but included here since that's an approach we can
take rewriting the SQL itself as a proof of concept).
- At execution time build a hash and lookup.
- At execution time sort the array and binary search through it.

I've been working other last approach to see what results I might get
(it seemed like the easiest to hack together). Putting that together
left me with the questions mentioned above in the "short version".

Obviously if anyone has thoughts on the above approaches I'd be
interested in that too.

Side question: when we do:
arr = DatumGetArrayTypeP(*op->resvalue);
in ExecEvalScalarArrayOp() is that going to be expensive each time
through a seq scan? Or is it (likely) going to resolve to an already
in-memory array and effectively be the cost of retrieving that
pointer?

James



pgsql-hackers by date:

Previous
From: Jürgen Purtz
Date:
Subject: Re: Add A Glossary
Next
From: Julien Rouhaud
Date:
Subject: Re: WAL usage calculation patch