Thread: Optimize common expressions in projection evaluation
Hi hackers. When a star(*) expands into multiple fields, our current implementation is to generate multiple copies of the expression and do FieldSelects. This is very inefficient because the same expression get evaluated multiple times but actually we only need to do it once. This is stated in ExpandRowReference(). For example: CREATE TABLE tbl(c1 int, c2 int, c3 int); CREATE TABLE src(v text); CREATE FUNCTION expensive_func(input text) RETURNS t; INSERT INTO tbl SELECT (expensive_func(v)).* FROM src; This is effectively the same as: INSERT INTO tbl SELECT (expensive_func(v)).c1, (expensive_func(v)).c2, (expensive_func(v)).c3 FROM src; In this form, expensive_func will be evaluated for every column in tbl. If tbl has hundreds of columns we are in trouble. To partially solve this issue, when doing projection in ExecBuildProjectionInfo, instead of generating normal steps for FieldSelects one by one, we can group them by the expression(arg of FieldSelect node). Then evaluate the epxression once to get a HeapTuple, deform it into fields, and then assign needed fields in one step. I've attached patch that introduce EEOP_FIELD_MULTI_SELECT_ASSIGN for this. With this patch, the following query should generate only one NOTICE, instead of 3. CREATE TYPE proj_type AS (a int, b int, c text); CREATE OR REPLACE FUNCTION proj_type_func1(input text) RETURNS proj_type AS $$ BEGIN RAISE NOTICE 'proj_type_func called'; RETURN ROW(1, 2, input); END $$ IMMUTABLE LANGUAGE PLPGSQL; CREATE TEMP TABLE stage_table(a text); INSERT INTO stage_table VALUES('aaaa'); SELECT (proj_type_func1(a)).* FROM stage_table; This patch is just proof of concept. Some unsolved questions I can think of right now: - Carry some information in FieldSelect from ExpandRowReference to assist grouping? - This can only handle FuncExpr as the root node of FieldSelect arg. What about a more general expression? - How to determine whether a common expression is safe to be optimized this way? Any unexpcted side-effects? Any thoughts on this approach? Best regards, Peifeng Qiu
Attachment
On Fri, Dec 2, 2022 at 12:52 AM Peifeng Qiu <pgsql@qiupf.dev> wrote:
Hi hackers.
When a star(*) expands into multiple fields, our current
implementation is to generate multiple copies of the expression
and do FieldSelects. This is very inefficient because the same
expression get evaluated multiple times but actually we only need
to do it once.
And so we implemented the SQL Standard LATERAL and all was well.
Given both how long we didn't have lateral and didn't do something like this, and how long lateral has been around and this hasn't really come up, the need for this code seems not that great. But as to the code itself I'm unable to properly judge.
David J.
> the need for this code seems not that great. But as to the code itself I'm unable to properly judge. A simplified version of my use case is like this: CREATE FOREIGN TABLE ft(rawdata json); INSERT INTO tbl SELECT (convert_func(rawdata)).* FROM ft; We have a foreign data source that can emit json data in different formats. We need different convert_func to extract the actual fields out. The client know which function to use, but as the each json may have hundreds of columns, the performance is very poor. Best regards, Peifeng Qiu
On Sun, Dec 4, 2022 at 9:00 PM Peifeng Qiu <pgsql@qiupf.dev> wrote:
> the need for this code seems not that great. But as to the code itself I'm unable to properly judge.
A simplified version of my use case is like this:
CREATE FOREIGN TABLE ft(rawdata json);
INSERT INTO tbl SELECT (convert_func(rawdata)).* FROM ft;
Which is properly written as the following, using lateral, which also avoids the problem you describe:
INSERT INTO tbl
SELECT func_call.*
FROM ft
JOIN LATERAL convert_func(ft.rawdata) AS func_call ON true;
David J.
Peifeng Qiu <pgsql@qiupf.dev> writes: >> the need for this code seems not that great. But as to the code itself I'm unable to properly judge. > A simplified version of my use case is like this: > CREATE FOREIGN TABLE ft(rawdata json); > INSERT INTO tbl SELECT (convert_func(rawdata)).* FROM ft; It might be worth noting that the code as we got it from Berkeley could do this scenario without multiple evaluations of convert_func(). Memory is foggy, but I believe it involved essentially a two-level targetlist. Unfortunately, the scheme was impossibly baroque and buggy, so we eventually ripped it out altogether in favor of the multiple-evaluation behavior you see today. I think that commit 62e29fe2e might have been what ripped it out, but I'm not quite sure. It's about the right time-frame, anyway. I mention this because trying to reverse-engineer this situation in execExpr seems seriously ugly and inefficient, even assuming you can make it non-buggy. The right solution has to involve never expanding foo().* into duplicate function calls in the first place, which is the way it used to be. Maybe if you dug around in those twenty-year-old changes you could get some inspiration. I tend to agree with David that LATERAL offers a good-enough solution in most cases ... but it is annoying that we accept this syntax and then pessimize it. regards, tom lane
po 5. 12. 2022 v 5:28 odesílatel Tom Lane <tgl@sss.pgh.pa.us> napsal:
Peifeng Qiu <pgsql@qiupf.dev> writes:
>> the need for this code seems not that great. But as to the code itself I'm unable to properly judge.
> A simplified version of my use case is like this:
> CREATE FOREIGN TABLE ft(rawdata json);
> INSERT INTO tbl SELECT (convert_func(rawdata)).* FROM ft;
It might be worth noting that the code as we got it from Berkeley
could do this scenario without multiple evaluations of convert_func().
Memory is foggy, but I believe it involved essentially a two-level
targetlist. Unfortunately, the scheme was impossibly baroque and
buggy, so we eventually ripped it out altogether in favor of the
multiple-evaluation behavior you see today. I think that commit
62e29fe2e might have been what ripped it out, but I'm not quite
sure. It's about the right time-frame, anyway.
I mention this because trying to reverse-engineer this situation
in execExpr seems seriously ugly and inefficient, even assuming
you can make it non-buggy. The right solution has to involve never
expanding foo().* into duplicate function calls in the first place,
which is the way it used to be. Maybe if you dug around in those
twenty-year-old changes you could get some inspiration.
I tend to agree with David that LATERAL offers a good-enough
solution in most cases ... but it is annoying that we accept
this syntax and then pessimize it.
I agree, so there is a perfect solution like don't use .*, but on second hand any supported syntax should be optimized well or we should raise some warning about negative performance impact.
Today there are a lot of baroque technologies in the stack so it is hard to remember all good practices and it is hard for newbies to take this knowledge. We should reduce possible performance traps when it is possible.
Regards
Pavel
regards, tom lane
Pavel Stehule <pavel.stehule@gmail.com> writes: > po 5. 12. 2022 v 5:28 odesílatel Tom Lane <tgl@sss.pgh.pa.us> napsal: >> I tend to agree with David that LATERAL offers a good-enough >> solution in most cases ... but it is annoying that we accept >> this syntax and then pessimize it. > I agree, so there is a perfect solution like don't use .*, but on second > hand any supported syntax should be optimized well or we should raise some > warning about negative performance impact. Yeah. I wonder if we could get the parser to automatically transform SELECT (foo(t.x)).* FROM tab t into SELECT f.* FROM tab t, LATERAL foo(t.x) f There are probably cases where this doesn't work or changes the semantics subtly, but I suspect it could be made to work in most cases of practical interest. regards, tom lane
On Sun, Dec 4, 2022 at 9:37 PM Pavel Stehule <pavel.stehule@gmail.com> wrote:
po 5. 12. 2022 v 5:28 odesílatel Tom Lane <tgl@sss.pgh.pa.us> napsal:Peifeng Qiu <pgsql@qiupf.dev> writes:
>> the need for this code seems not that great. But as to the code itself I'm unable to properly judge.
I mention this because trying to reverse-engineer this situation
in execExpr seems seriously ugly and inefficient, even assuming
you can make it non-buggy. The right solution has to involve never
expanding foo().* into duplicate function calls in the first place,
which is the way it used to be. Maybe if you dug around in those
twenty-year-old changes you could get some inspiration.
I tend to agree with David that LATERAL offers a good-enough
solution in most cases ... but it is annoying that we accept
this syntax and then pessimize it.I agree, so there is a perfect solution like don't use .*, but on second hand any supported syntax should be optimized well or we should raise some warning about negative performance impact.
Yeah, I phrased that poorly. I agree that having this problem solved would be beneficial to the project. But, and this is probably a bit unfair to the OP, if Tom decided to implement LATERAL after years of hearing complaints I doubted this patch was going to be acceptable. The benefit/cost ratio of fixing this in code just doesn't seem to be high enough to try at this point. But I'd be happy to be proven wrong. And I readily admit both a lack of knowledge, and that as time has passed maybe we've improved the foundations enough to implement something here.
Otherwise, we can maybe improve the documentation. There isn't any way (that the project would accept anyway) to actually warn the user at runtime. If we go that route we should probably just disallow the syntax outright.
David J.
> Which is properly written as the following, using lateral, which also avoids the problem you describe: > > INSERT INTO tbl > SELECT func_call.* > FROM ft > JOIN LATERAL convert_func(ft.rawdata) AS func_call ON true; I didn't fully realize this syntax until you point out. Just try it out in our case and this works well. I think My problem is mostly resolved without the need of this patch. Thanks! It's still good to do something about the normal (func(v)).* syntax if it's still considered legal. I will give a try to expanding it more cleverly and see if we can avoid the duplicate evaluation issue. Peifeng Qiu