Re: Implement targetlist SRFs using ROWS FROM() (was Changed SRF in targetlist handling) - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Implement targetlist SRFs using ROWS FROM() (was Changed SRF in targetlist handling)
Date
Msg-id CA+TgmoboLaUy71iLKa2GPcnioWmNxhCFdcDj-F05sZOGgRO0NA@mail.gmail.com
Whole thread Raw
In response to Implement targetlist SRFs using ROWS FROM() (was Changed SRF in targetlist handling)  (Andres Freund <andres@anarazel.de>)
Responses Re: Implement targetlist SRFs using ROWS FROM() (was Changed SRF in targetlist handling)
Re: Implement targetlist SRFs using ROWS FROM() (was Changed SRF in targetlist handling)
List pgsql-hackers
On Tue, Aug 23, 2016 at 3:10 AM, Andres Freund <andres@anarazel.de> wrote:
> as noted in [1] I started hacking on removing the current implementation
> of SRFs in the targetlist (tSRFs henceforth). IM discussion brought the
> need for a description of the problem, need and approach to light.

Thanks for writing this up.

> 1) How to deal with the least-common-multiple behaviour of tSRFs. E.g.
> =# SELECT generate_series(1, 3), generate_series(1,2);
> returning
> ┌─────────────────┬─────────────────┐
> │ generate_series │ generate_series │
> ├─────────────────┼─────────────────┤
> │               1 │               1 │
> │               2 │               2 │
> │               3 │               1 │
> │               1 │               2 │
> │               2 │               1 │
> │               3 │               2 │
> └─────────────────┴─────────────────┘
> (6 rows)
> but
> =# SELECT generate_series(1, 3), generate_series(5,7);
> returning
> ┌─────────────────┬─────────────────┐
> │ generate_series │ generate_series │
> ├─────────────────┼─────────────────┤
> │               1 │               5 │
> │               2 │               6 │
> │               3 │               7 │
> └─────────────────┴─────────────────┘
>
> discussion in this thread came, according to my reading, to the
> conclusion that that behaviour is just confusing and that the ROWS FROM
> behaviour of
> =# SELECT * FROM ROWS FROM(generate_series(1, 3), generate_series(1,2));
> ┌─────────────────┬─────────────────┐
> │ generate_series │ generate_series │
> ├─────────────────┼─────────────────┤
> │               1 │               1 │
> │               2 │               2 │
> │               3 │          (null) │
> └─────────────────┴─────────────────┘
> (3 rows)
>
> makes more sense.  We also discussed erroring out if two SRFs  return
> differing amount of rows, but that seems not to be preferred so far. And
> we can easily add it if we want.

This all seems fine.  I don't think erroring out is an improvement.

> 2) A naive conversion to ROWS FROM, like in the example in the
>    introductory paragraph, can change the output, when implemented as a
>    join from ROWS FROM to the rest of the query, rather than the other
>    way round. E.g.
> =# EXPLAIN SELECT * FROM few, ROWS FROM(generate_series(1,10));
> ┌──────────────────────────────────────────────────────────────────────────────┐
> │                                  QUERY PLAN                                  │
> ├──────────────────────────────────────────────────────────────────────────────┤
> │ Nested Loop  (cost=0.00..36.03 rows=2000 width=8)                            │
> │   ->  Function Scan on generate_series  (cost=0.00..10.00 rows=1000 width=4) │
> │   ->  Materialize  (cost=0.00..1.03 rows=2 width=4)                          │
> │         ->  Seq Scan on few  (cost=0.00..1.02 rows=2 width=4)                │
> └──────────────────────────────────────────────────────────────────────────────┘
> (4 rows)
> =# SELECT * FROM few, ROWS FROM(generate_series(1,3));
> ┌────┬─────────────────┐
> │ id │ generate_series │
> ├────┼─────────────────┤
> │  1 │               1 │
> │  2 │               1 │
> │  1 │               2 │
> │  2 │               2 │
> │  1 │               3 │
> │  2 │               3 │
> └────┴─────────────────┘
> (6 rows)
> surely isn't what was intended.  So the join order needs to be enforced.

In general, we've been skeptical about giving any guarantees about
result ordering.  Maybe this case is different and we should give some
guarantee here, but I don't think it's 100% obvious.

> 3) tSRFs are evaluated after GROUP BY, and window functions:
> =# SELECT generate_series(1, count(*)) FROM (VALUES(1),(2),(10)) f;
> ┌─────────────────┐
> │ generate_series │
> ├─────────────────┤
> │               1 │
> │               2 │
> │               3 │
> └─────────────────┘
> which means we have to push the "original" query into a subquery, with
> the ROWS FROM laterally referencing the subquery:
> SELECT generate_series FROM (SELECT count(*) FROM (VALUES(1),(2),(10)) f) s, ROWS FROM (generate_series(1,s.count));

Seems OK.

> 4) The evaluation order of tSRFs in combination with ORDER BY is a bit
>    confusing. Namely tSRFs are implemented after ORDER BY has been
>    evaluated, unless the ORDER BY references the SRF.
> E.g.
> =# SELECT few.id, generate_series FROM ROWS FROM(generate_series(1,3)),few ORDER BY few.id DESC;
> might return
> ┌────┬─────────────────┐
> │ id │ generate_series │
> ├────┼─────────────────┤
> │ 24 │               3 │
> │ 24 │               2 │
> │ 24 │               1 │
> ..
> instead of
> ┌────┬─────────────────┐
> │ id │ generate_series │
> ├────┼─────────────────┤
> │ 24 │               1 │
> │ 24 │               2 │
> │ 24 │               3 │
> as before.
>
> which means we'll sometimes have to push down the ORDER BY into the
> subquery (when not referencing tSRFs, so they're evaluated first),
> sometimes evaluate them on the outside (if tSRFs are referenced)

OK.

> 5) tSRFs can have tSRFs as argument, e.g.:
> =# SELECT generate_series(1, generate_series(1,3));
> ┌─────────────────┐
> │ generate_series │
> ├─────────────────┤
> │               1 │
> │               1 │
> │               2 │
> │               1 │
> │               2 │
> │               3 │
> └─────────────────┘
> that can quite easily be implemented by having the "nested" tSRF
> evaluate as a separate ROWS FROM expression.
>
> Which even allows us to implement the previously forbidden
> =# SELECT generate_series(generate_series(1,3), generate_series(2,4));
> ERROR:  0A000: functions and operators can take at most one set argument
>
> - not that I think that's of great value ;)

OK.

> 6) SETOF record type functions cannot directly be used in ROWS FROM() -
>    as ROWS FROM "expands" records returned by functions. When converting
>    something like
> CREATE OR REPLACE FUNCTION setof_record_sql() RETURNS SETOF record LANGUAGE sql AS $$SELECT 1 AS a, 2 AS b UNION ALL
SELECT1, 2;$$;
 
> SELECT setof_record_sql();
>    we don't have that available though.
>
> The best way to handle that seems to be to introduce the ability for
> ROWS FROM not to expand the record returned by a column.  I'm currently
> thinking that something like ROWS FROM(setof_record_sql() AS ()) would
> do the trick.  That'd also considerably simplify the handling of
> functions returning known composite types - my current POC patch
> generates a ROW(a,b,..) type expression for those.
>
> I'm open to better syntax suggestions.

I definitely agree that having some syntax to avoid row-expansion in
this case (and maybe in other cases) would be a good thing; I suspect
that would get a good bit of use.  I don't care much for that
particular choice of syntax, which seems fairly magical, but I'm not
sure what would be better.

> 7) ROWS FROM () / functions in the FROM list are currently signifcantly
>    slower than the equivalent in the target list (for SFRM_ValuePerCall
>    SRFs at least):
>
> =# COPY (SELECT generate_series(1,10000000)) TO '/dev/null';
> COPY 10000000
> Time: 1311.469 ms
> =# COPY (SELECT * FROM generate_series(1,10000000)) TO '/dev/null';
> LOG:  00000: temporary file: path "base/pgsql_tmp/pgsql_tmp702.0", size 140000000
> LOCATION:  FileClose, fd.c:1484
> COPY 10000000
> Time: 2173.282 ms
> for SRFM_Materialize SRFs there's no meaningufl difference:
> CREATE FUNCTION plpgsql_generate_series(bigint, bigint) RETURNS SETOF bigint LANGUAGE plpgsql AS $$BEGIN RETURN QUERY
SELECTgenerate_series($1, $2);END;$$;
 
>
> =# COPY (SELECT plpgsql_generate_series(1,10000000)) TO '/dev/null';
> LOG:  00000: temporary file: path "base/pgsql_tmp/pgsql_tmp702.2", size 180000000
> COPY 10000000
> Time: 3058.437 ms
>
> =# COPY (SELECT * FROM plpgsql_generate_series(1,10000000)) TO '/dev/null';LOG:  00000: temporary file: path
"base/pgsql_tmp/pgsql_tmp702.1",size 180000000
 
> COPY 10000000
> Time: 2964.661 ms
>
> that makes sense, because nodeFunctionscan.c, via
> ExecMakeTableFunctionResult, forces materialization of ValuePerCall
> SRFs.
>
> ISTM that we need should fix that by allowing ValuePerCall without
> materialization, as long as EXEC_FLAG_BACKWARD isn't required.

That sounds good.
> I've implemented ([2]) a prototype of this. My basic approach is:
>
> I) During parse-analysis, remember whether a query has any tSRFs
>    (Query->hasTargetSRF). That avoids doing a useless pass over the
>    query, if no tSRFs are present.
> II) At the beginning of subquery_planner(), before doing any work
>    operating on subqueries and such, implement SRFs if ->hasTargetSRF().
>    (unsrfify() in the POC)
> III) Unconditionally move the "current" query into a subquery. For that
>    do a mutator pass over the query, replacing Vars/Aggrefs/... in the
>    original targetlist with Var references to the new subquery.
>    (unsrfify_reference_subquery_mutator() in the POC)
> IV) Do a pass over the outer query's targetlist, and implement any tSRFs
>    using a ROWS FROM() RTE (or multiple ones in case of nested tSRFs).
>    (unsrfify_implement_srfs_mutator() in the POC)
>
> that seems to mostly work well.

I gather that III and IV are skipped if hasTargetSRF isn't set.

> The behaviour changes this implies are:
>
> a) Least-common-multiple behaviour, as in (1) above, is gone. I think
>    that's good.
>
> b) We currently allow tSRFs in UPDATE ... SET expressions. I don't
>    actually know what that's supposed to mean. I'm inclined
>    a;
> =# CREATE TABLE blarg AS SELECT 1::int a;
> SELECT 1
> =# UPDATE blarg SET a = generate_series(2,3);
> UPDATE 1
> =# SELECT * FROM blarg ;
> ┌───┐
> │ a │
> ├───┤
> │ 2 │
> └───┘
> I'm inclined to think that that's a bad idea, and should rather be
> forbidden.
>
> c) COALESCE/CASE have, so far, shortcut tSRF expansion. E.g.
>    SELECT id, COALESCE(1, generate_series(1,2)) FROM (VALUES(1),(2)) few(id);
>    returns only two rows, despite the generate_series().  But by
>    implementing the generate_series as a ROWS FROM, it'd return four.
>
> I think that's ok.

Those all sound OK.

> d) Not a problem with the patch per-se, but I'm doubful that that's ok:
> =# SELECT 1 ORDER BY generate_series(1, 10);
> returns 10 rows ;) - maybe we should forbid that?

OK by me.  I feel like this isn't the only case where the presence of
resjunk columns has user-visible effects, although I can't think of
another one right at the moment.  It seems like something to avoid,
though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

pgsql-hackers by date:

Previous
From: Amit Langote
Date:
Subject: Re: Surprising behaviour of \set AUTOCOMMIT ON
Next
From: Robert Haas
Date:
Subject: Re: Implement targetlist SRFs using ROWS FROM() (was Changed SRF in targetlist handling)