Re: Changed SRF in targetlist handling - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Changed SRF in targetlist handling
Date
Msg-id 20160801082346.nfp2g7mg74alifdc@alap3.anarazel.de
Whole thread Raw
In response to Re: Changed SRF in targetlist handling  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Changed SRF in targetlist handling  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On 2016-05-25 16:55:23 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2016-05-25 15:20:03 -0400, Tom Lane wrote:
> >> We could certainly make a variant behavior in nodeFunctionscan.c that
> >> emulates that, if we feel that being exactly bug-compatible on the point
> >> is actually what we want.  I'm dubious about that though, not least
> >> because I don't think *anyone* actually believes that that behavior isn't
> >> broken.  Did you read my upthread message suggesting assorted compromise
> >> choices?
>
> > You mean https://www.postgresql.org/message-id/21076.1464034513@sss.pgh.pa.us ?
> > If so, yes.
>
> > If we go with rewriting this into LATERAL, I'd vote for 2.5 (trailed by
> > option 1), that'd keep most of the functionality, and would break
> > visibly rather than invisibly in the cases where not.
>
> 2.5 would be okay with me.
>
> > I guess you're not planning to work on this?
>
> Well, not right now, as it's clearly too late for 9.6.  I might hack on
> it later if nobody beats me to it.

FWIW, as it's blocking my plans for executor related rework (expression
evaluation, batch processing) I started to hack on this.

I've an implementation that

1) turns all targetlist SRF (tSRF from now on) into ROWS FROM  expressions. If there's tSRFs in the argument of a tSRF
thosebecomes  a separate, lateral, ROWS FROM expression.
 

2) If grouping/window functions are present, the entire query is wrapped  in a subquery RTE, except for the
set-returningfunction. All  referenced Var|Aggref|GroupingFunc|WindowFunc|Param nodes in the  original targetlist are
madeto reference that subquery, which gets a  TargetEntry for them.
 

3) If sortClause does *not* reference any tSRFs the sorting is evaluated  in a subquery, to preserve the output
orderingof SRFs in queries  like  SELECT id, generate_series(1,3) FROM (VALUES(1),(2)) d(id) ORDER BY id DESC;  if in
contrastsortClause does reference the tSRF output, it's  evaluated in the outer SRF.
 

this seems to generally work, and allows to remove considerable amounts
of code.

So far I have one problem without an easy solution: Historically queries
like
=# SELECT id, generate_series(1,2) FROM (VALUES(1),(2)) few(id);
┌────┬─────────────────┐
│ id │ generate_series │
├────┼─────────────────┤
│  1 │               1 │
│  1 │               2 │
│  2 │               1 │
│  2 │               2 │
└────┴─────────────────┘
have preserved the SRF output ordering. But by turning the SRF into a
ROWS FROM, there's no guarantee that the cross join between "few" and
generate_series(1,3) above is implemented in that order. I.e. we can get
something like
┌────┬─────────────────┐
│ id │ generate_series │
├────┼─────────────────┤
│  1 │               1 │
│  2 │               1 │
│  1 │               2 │
│  2 │               2 │
└────┴─────────────────┘
because it's implemented as
┌──────────────────────────────────────────────────────────────────────────────┐
│                                  QUERY PLAN                                  │
├──────────────────────────────────────────────────────────────────────────────┤
│ Nested Loop  (cost=0.00..35.03 rows=2000 width=8)                            │
│   ->  Function Scan on generate_series  (cost=0.00..10.00 rows=1000 width=4) │
│   ->  Materialize  (cost=0.00..0.04 rows=2 width=4)                          │
│         ->  Values Scan on "*VALUES*"  (cost=0.00..0.03 rows=2 width=4)      │
└──────────────────────────────────────────────────────────────────────────────┘

I right now see no easy and nice-ish way to constrain that.


Besides that I'm structurally wondering whether turning the original
query into a subquery is the right thing to do. It requires some kind of
ugly munching of Query->*, and has the above problem. One alternative
would be to instead perform the necessary magic in grouping_planner(),
by "manually" adding nestloop joins before/after create_ordered_paths()
(depending on SRFs being referenced in the sort clause).  That'd create
plans we'd not have created so far, by layering NestLoop and
FunctionScan nodes above the normal query - that'd allow us to to easily
force the ordering of SRF evaluation.


If we go the subquery route, I'm wondering about where to tackle the
restructuring. So far I'm doing it very early in subquery_planner() -
otherwise the aggregation/sorting/... behaviour is easier to handle.
Perhaps doing it in standard_planner() itself would be better though.
An alternative approach would be to do this during parse-analysis, but I
think that might end up being confusing, because stored rules would
suddenly have a noticeably different structure, and it'd tie us a lot
more to the details of that transformation than I'd like.


Besides the removal of the least-common-multiple behaviour of tSRF queries,
there's some other consequences that using function scans have:
Previously if a tSRF was never evaluated, it didn't cause the number of
rows from being increased. E.g.
SELECT id, COALESCE(1, generate_series(1,2)) FROM (VALUES(1),(2)) few(id);
only produced two rows.  But using joins means that a simple
implementation of using ROWS FROM returns four rows.   We could try to
inject sufficient join conditions in that type of query, to prune down
the number of rows again, but I really don't want to go there - it's
kinda hard in the general case...


Comments?


Regards,


Andres



pgsql-hackers by date:

Previous
From: Shay Rojansky
Date:
Subject: Re: Slowness of extended protocol
Next
From: Fujii Masao
Date:
Subject: Re: pg_basebackup wish list