Thread: Optimize common expressions in projection evaluation

Optimize common expressions in projection evaluation

From
Peifeng Qiu
Date:
Hi hackers.

When a star(*) expands into multiple fields, our current
implementation is to generate multiple copies of the expression
and do FieldSelects. This is very inefficient because the same
expression get evaluated multiple times but actually we only need
to do it once. This is stated in ExpandRowReference().

For example:
CREATE TABLE tbl(c1 int, c2 int, c3 int);
CREATE TABLE src(v text);
CREATE FUNCTION expensive_func(input text) RETURNS t;
INSERT INTO tbl SELECT (expensive_func(v)).* FROM src;

This is effectively the same as:
INSERT INTO tbl SELECT (expensive_func(v)).c1,
(expensive_func(v)).c2, (expensive_func(v)).c3 FROM src;

In this form, expensive_func will be evaluated for every column in
tbl. If tbl has hundreds of columns we are in trouble. To partially
solve this issue, when doing projection in ExecBuildProjectionInfo,
instead of generating normal steps for FieldSelects one by one, we
can group them by the expression(arg of FieldSelect node). Then
evaluate the epxression once to get a HeapTuple, deform it into
fields, and then assign needed fields in one step. I've attached
patch that introduce EEOP_FIELD_MULTI_SELECT_ASSIGN for this.

With this patch, the following query should generate only one
NOTICE, instead of 3.

CREATE TYPE proj_type AS (a int, b int, c text);
CREATE OR REPLACE FUNCTION proj_type_func1(input text)
RETURNS proj_type AS $$
BEGIN
    RAISE NOTICE 'proj_type_func called';
    RETURN ROW(1, 2, input);
END
$$ IMMUTABLE LANGUAGE PLPGSQL;
CREATE TEMP TABLE stage_table(a text);
INSERT INTO stage_table VALUES('aaaa');
SELECT (proj_type_func1(a)).* FROM stage_table;

This patch is just proof of concept. Some unsolved questions I
can think of right now:
- Carry some information in FieldSelect from ExpandRowReference
to assist grouping?
- This can only handle FuncExpr as the root node of FieldSelect
arg. What about a more general expression?
- How to determine whether a common expression is safe to be
optimized this way? Any unexpcted side-effects?

Any thoughts on this approach?

Best regards,
Peifeng Qiu

Attachment

Re: Optimize common expressions in projection evaluation

From
"David G. Johnston"
Date:
On Fri, Dec 2, 2022 at 12:52 AM Peifeng Qiu <pgsql@qiupf.dev> wrote:
Hi hackers.

When a star(*) expands into multiple fields, our current
implementation is to generate multiple copies of the expression
and do FieldSelects. This is very inefficient because the same
expression get evaluated multiple times but actually we only need
to do it once.

And so we implemented the SQL Standard LATERAL and all was well.

Given both how long we didn't have lateral and didn't do something like this, and how long lateral has been around and this hasn't really come up, the need for this code seems not that great.  But as to the code itself I'm unable to properly judge.

David J.

Re: Optimize common expressions in projection evaluation

From
Peifeng Qiu
Date:
> the need for this code seems not that great.  But as to the code itself I'm unable to properly judge.
A simplified version of my use case is like this:
CREATE FOREIGN TABLE ft(rawdata json);
INSERT INTO tbl SELECT (convert_func(rawdata)).* FROM ft;
We have a foreign data source that can emit json data in different
formats. We need different
convert_func to extract the actual fields out. The client know which
function to use, but as
the each json may have hundreds of columns, the performance is very poor.

Best regards,
Peifeng Qiu



Re: Optimize common expressions in projection evaluation

From
"David G. Johnston"
Date:
On Sun, Dec 4, 2022 at 9:00 PM Peifeng Qiu <pgsql@qiupf.dev> wrote:
> the need for this code seems not that great.  But as to the code itself I'm unable to properly judge.
A simplified version of my use case is like this:
CREATE FOREIGN TABLE ft(rawdata json);
INSERT INTO tbl SELECT (convert_func(rawdata)).* FROM ft;


Which is properly written as the following, using lateral, which also avoids the problem you describe:

INSERT INTO tbl
SELECT func_call.*
FROM ft
JOIN LATERAL convert_func(ft.rawdata) AS func_call ON true;

David J.

Re: Optimize common expressions in projection evaluation

From
Tom Lane
Date:
Peifeng Qiu <pgsql@qiupf.dev> writes:
>> the need for this code seems not that great.  But as to the code itself I'm unable to properly judge.

> A simplified version of my use case is like this:
> CREATE FOREIGN TABLE ft(rawdata json);
> INSERT INTO tbl SELECT (convert_func(rawdata)).* FROM ft;

It might be worth noting that the code as we got it from Berkeley
could do this scenario without multiple evaluations of convert_func().
Memory is foggy, but I believe it involved essentially a two-level
targetlist.  Unfortunately, the scheme was impossibly baroque and
buggy, so we eventually ripped it out altogether in favor of the
multiple-evaluation behavior you see today.  I think that commit
62e29fe2e might have been what ripped it out, but I'm not quite
sure.  It's about the right time-frame, anyway.

I mention this because trying to reverse-engineer this situation
in execExpr seems seriously ugly and inefficient, even assuming
you can make it non-buggy.  The right solution has to involve never
expanding foo().* into duplicate function calls in the first place,
which is the way it used to be.  Maybe if you dug around in those
twenty-year-old changes you could get some inspiration.

I tend to agree with David that LATERAL offers a good-enough
solution in most cases ... but it is annoying that we accept
this syntax and then pessimize it.

            regards, tom lane



Re: Optimize common expressions in projection evaluation

From
Pavel Stehule
Date:


po 5. 12. 2022 v 5:28 odesílatel Tom Lane <tgl@sss.pgh.pa.us> napsal:
Peifeng Qiu <pgsql@qiupf.dev> writes:
>> the need for this code seems not that great.  But as to the code itself I'm unable to properly judge.

> A simplified version of my use case is like this:
> CREATE FOREIGN TABLE ft(rawdata json);
> INSERT INTO tbl SELECT (convert_func(rawdata)).* FROM ft;

It might be worth noting that the code as we got it from Berkeley
could do this scenario without multiple evaluations of convert_func().
Memory is foggy, but I believe it involved essentially a two-level
targetlist.  Unfortunately, the scheme was impossibly baroque and
buggy, so we eventually ripped it out altogether in favor of the
multiple-evaluation behavior you see today.  I think that commit
62e29fe2e might have been what ripped it out, but I'm not quite
sure.  It's about the right time-frame, anyway.

I mention this because trying to reverse-engineer this situation
in execExpr seems seriously ugly and inefficient, even assuming
you can make it non-buggy.  The right solution has to involve never
expanding foo().* into duplicate function calls in the first place,
which is the way it used to be.  Maybe if you dug around in those
twenty-year-old changes you could get some inspiration.

I tend to agree with David that LATERAL offers a good-enough
solution in most cases ... but it is annoying that we accept
this syntax and then pessimize it.

I agree, so there is a perfect solution like don't use .*, but on second hand any supported syntax should be optimized well or we should raise some warning about negative performance impact.

Today there are a lot of baroque technologies in the stack so it is hard to remember all good practices and it is hard for newbies to take this knowledge. We should reduce possible performance traps when it is possible.

Regards

Pavel


 

                        regards, tom lane


Re: Optimize common expressions in projection evaluation

From
Tom Lane
Date:
Pavel Stehule <pavel.stehule@gmail.com> writes:
> po 5. 12. 2022 v 5:28 odesílatel Tom Lane <tgl@sss.pgh.pa.us> napsal:
>> I tend to agree with David that LATERAL offers a good-enough
>> solution in most cases ... but it is annoying that we accept
>> this syntax and then pessimize it.

> I agree, so there is a perfect solution like don't use .*, but on second
> hand any supported syntax should be optimized well or we should raise some
> warning about negative performance impact.

Yeah.  I wonder if we could get the parser to automatically transform

    SELECT (foo(t.x)).* FROM tab t

into

    SELECT f.* FROM tab t, LATERAL foo(t.x) f

There are probably cases where this doesn't work or changes the
semantics subtly, but I suspect it could be made to work in
most cases of practical interest.

            regards, tom lane



Re: Optimize common expressions in projection evaluation

From
"David G. Johnston"
Date:
On Sun, Dec 4, 2022 at 9:37 PM Pavel Stehule <pavel.stehule@gmail.com> wrote:

po 5. 12. 2022 v 5:28 odesílatel Tom Lane <tgl@sss.pgh.pa.us> napsal:
Peifeng Qiu <pgsql@qiupf.dev> writes:
>> the need for this code seems not that great.  But as to the code itself I'm unable to properly judge.

I mention this because trying to reverse-engineer this situation
in execExpr seems seriously ugly and inefficient, even assuming
you can make it non-buggy.  The right solution has to involve never
expanding foo().* into duplicate function calls in the first place,
which is the way it used to be.  Maybe if you dug around in those
twenty-year-old changes you could get some inspiration.

I tend to agree with David that LATERAL offers a good-enough
solution in most cases ... but it is annoying that we accept
this syntax and then pessimize it.

I agree, so there is a perfect solution like don't use .*, but on second hand any supported syntax should be optimized well or we should raise some warning about negative performance impact.


Yeah, I phrased that poorly.  I agree that having this problem solved would be beneficial to the project.  But, and this is probably a bit unfair to the OP, if Tom decided to implement LATERAL after years of hearing complaints I doubted this patch was going to be acceptable.  The benefit/cost ratio of fixing this in code just doesn't seem to be high enough to try at this point.  But I'd be happy to be proven wrong.  And I readily admit both a lack of knowledge, and that as time has passed maybe we've improved the foundations enough to implement something here.

Otherwise, we can maybe improve the documentation.  There isn't any way (that the project would accept anyway) to actually warn the user at runtime.  If we go that route we should probably just disallow the syntax outright.

David J.



Re: Optimize common expressions in projection evaluation

From
Peifeng Qiu
Date:
> Which is properly written as the following, using lateral, which also avoids the problem you describe:
>
> INSERT INTO tbl
> SELECT func_call.*
> FROM ft
> JOIN LATERAL convert_func(ft.rawdata) AS func_call ON true;

I didn't fully realize this syntax until you point out. Just try it out in
our case and this works well. I think My problem is mostly resolved
without the need of this patch.  Thanks!

It's still good to do something about the normal (func(v)).* syntax
if it's still considered legal. I will give a try to expanding it more
cleverly and see if we can avoid the duplicate evaluation issue.

Peifeng Qiu