Thread: BUG #18305: Unexpected error: "WindowFunc not found in subplan target lists" triggered by subqueries

The following bug has been logged on the website:

Bug reference:      18305
Logged by:          Zuming Jiang
Email address:      zuming.jiang@inf.ethz.ch
PostgreSQL version: 16.1
Operating system:   Ubuntu 20.04
Description:

My fuzzer finds a bug in Postgres 17devel, which triggers an unexpected
error "ERROR: WindowFunc not found in subplan target lists".

--- Set up database ---
create table exeet_t3 (pkey int4);
create view exeet_t8 as 
select  
    ntile(exeet_subq_0.c_0) over () as c_0 
  from 
    (select (select pkey from exeet_t3 order by pkey limit 1 offset 6) as
c_0) as exeet_subq_0;

The fuzzer generates a test case:

--- Test case ---
select  
    1 as c_1
  from 
    exeet_t8 as exeet_ref_17
  where exeet_ref_17.c_0 < 0;

--- Expected behavior ---
The test case should not trigger any error.

--- Actual behavior ---
The test case trigger an error: 

ERROR:  WindowFunc not found in subplan target lists

--- Postgres version ---
Github commit: b0f0a9432d0b6f53634a96715f2666f6d4ea25a1
Version: PostgreSQL 17devel on x86_64-pc-linux-gnu, compiled by gcc
(Ubuntu
9.4.0-1ubuntu1~20.04.2) 9.4.0, 64-bit

--- Platform information ---
Platform: Ubuntu 20.04
Kernel: Linux 5.4.0-147-generic


PG Bug reporting form <noreply@postgresql.org> writes:
> My fuzzer finds a bug in Postgres 17devel, which triggers an unexpected
> error "ERROR: WindowFunc not found in subplan target lists".

> create table exeet_t3 (pkey int4);
> create view exeet_t8 as 
> select  
>     ntile(exeet_subq_0.c_0) over () as c_0 
>   from 
>     (select (select pkey from exeet_t3 order by pkey limit 1 offset 6) as
> c_0) as exeet_subq_0;

> select  
>     1 as c_1
>   from 
>     exeet_t8 as exeet_ref_17
>   where exeet_ref_17.c_0 < 0;
> ERROR:  WindowFunc not found in subplan target lists

Thanks for the report!  Bisecting shows that this broke at

456fa635a909ee36f73ca84d340521bd730f265f is the first bad commit
commit 456fa635a909ee36f73ca84d340521bd730f265f
Author: David Rowley <drowley@postgresql.org>
Date:   Fri Jan 27 16:08:41 2023 +1300

    Teach planner about more monotonic window functions
    
    9d9c02ccd introduced runConditions for window functions to allow
    monotonic window function evaluation to be made more efficient when the
    window function value went beyond some value that it would never go back
    from due to its monotonic nature.  That commit added prosupport functions
    to inform the planner that row_number(), rank(), dense_rank() and some
    forms of count(*) were monotonic.  Here we add support for ntile(),
    cume_dist() and percent_rank().

So I'm betting there's something wrong with the ntile support
function, but I've not dug deeper.

            regards, tom lane




On Mon, Jan 22, 2024 at 11:44 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
PG Bug reporting form <noreply@postgresql.org> writes:
> My fuzzer finds a bug in Postgres 17devel, which triggers an unexpected
> error "ERROR: WindowFunc not found in subplan target lists".

> create table exeet_t3 (pkey int4);
> create view exeet_t8 as
> select 
>     ntile(exeet_subq_0.c_0) over () as c_0
>   from
>     (select (select pkey from exeet_t3 order by pkey limit 1 offset 6) as
> c_0) as exeet_subq_0;

> select 
>     1 as c_1
>   from
>     exeet_t8 as exeet_ref_17
>   where exeet_ref_17.c_0 < 0;
> ERROR:  WindowFunc not found in subplan target lists

Thanks for the report!  Bisecting shows that this broke at

456fa635a909ee36f73ca84d340521bd730f265f is the first bad commit
commit 456fa635a909ee36f73ca84d340521bd730f265f
Author: David Rowley <drowley@postgresql.org>
Date:   Fri Jan 27 16:08:41 2023 +1300

    Teach planner about more monotonic window functions

    9d9c02ccd introduced runConditions for window functions to allow
    monotonic window function evaluation to be made more efficient when the
    window function value went beyond some value that it would never go back
    from due to its monotonic nature.  That commit added prosupport functions
    to inform the planner that row_number(), rank(), dense_rank() and some
    forms of count(*) were monotonic.  Here we add support for ntile(),
    cume_dist() and percent_rank().

So I'm betting there's something wrong with the ntile support
function, but I've not dug deeper.

Here is a simplified repro query:

select 1 from
  (select ntile(s1.x) over () as c
   from (select (select 1) as x) as s1) s
 where s.c = 1;
ERROR:  WindowFunc not found in subplan target lists

I think the problem is that in find_window_run_conditions() we fail to
check thoroughly whether the WindowFunc contains any potential subplan
nodes.  We do have the check:

    /* can't use it if there are subplans in the WindowFunc */
    if (contain_subplans((Node *) wfunc))
        return false;

... and the bug being reported here indicates that the current check is
insufficient, because a Var node inside the WindowFunc could potentially
belong to a subquery and reference a SubLink expression.

In the query above, the WindowFunc's argument, 's1.x', is actually an
EXPR_SUBLINK sublink.  After we've pulled up subquery 's1', the
arguments of the two WindowFuncs (one in the target list of subquery
's', the other in the WindowClause's runCondition of subquery 's') would
be replaced with SubLink nodes.  And then after we've converted SubLink
nodes to SubPlans, these two SubLink nodes would be replaced with
Params, but with different paramids.

Hmm, is there a way to detect in find_window_run_conditions() whether
the WindowFunc's argument references a SubLink node of a subquery that
hasn't been pulled up?

Thanks
Richard
Richard Guo <guofenglinux@gmail.com> writes:
> Hmm, is there a way to detect in find_window_run_conditions() whether
> the WindowFunc's argument references a SubLink node of a subquery that
> hasn't been pulled up?

contain_subplans will recognize both SubLinks and SubPlans,
so there must be some other factor here.  (I didn't look yet.)

            regards, tom lane



Richard Guo <guofenglinux@gmail.com> writes:
> I think the problem is that in find_window_run_conditions() we fail to
> check thoroughly whether the WindowFunc contains any potential subplan
> nodes.  We do have the check:

>     /* can't use it if there are subplans in the WindowFunc */
>     if (contain_subplans((Node *) wfunc))
>         return false;

> ... and the bug being reported here indicates that the current check is
> insufficient, because a Var node inside the WindowFunc could potentially
> belong to a subquery and reference a SubLink expression.

I dug through this, and my conclusion is that the entire "run
condition" optimization is seriously broken, because the tree
manipulations here are many bricks shy of a load.  There are
two big problems:

1. The subquery hasn't yet been through any preprocessing.  Thus,
we've not yet done subquery pullup within it, which is why the
contain_subplans call fails to detect a problem: the WindowFunc's arg
is just a Var.  Later, pullup of the lower subquery will replace that
Var with a SubLink, which eventually gets replaced with a Param
referencing an initplan's output.  However, run condition optimization
caused us to put a second copy of the WindowFunc into the
runCondition, and that copy gets its own copy of the SubLink, which
eventually results in a different Param, leading to the observed
failure.  (BTW, it scares the heck out of me that we just link the
WindowFunc expr into wclause->runCondition without making a copy of
it.  That could be safe if the planner never scribbles on its input,
but that ain't so.)

2. We are pushing the "otherexpr" from the upper-level query
(which *has* been through preprocessing) into the unprocessed
child query.  This makes me acutely uncomfortable:
  A. Almost certainly, this fails if otherexpr contains a subplan.
  B. Since nothing is done about levelsup adjustment, it will
     fail if anything in that tree includes a levelsup field.
     Fortunately is_pseudo_constant_clause will reject Vars,
     but those aren't the only things with levelsup.  I wonder
     if it's possible to get an uplevel Aggref into a subquery's
     restrictlist, in particular.
  C. Again, failure to make a copy of the expression tree seems
     pretty dangerous.
  D. We're relying on repeat preprocessing of the otherexpr to
     do no harm.  Maybe that's OK, but it's not great.
The first three things are not hard to fix, but they're not being
taken care of today.

We could work around point 1 by refusing to pull up subqueries that
contain sublinks in their targetlists, but that would be a pretty big
change (and, probably, a pessimization of some queries).  I do not
consider run-condition optimization to justify that.  Moreover
I'm not sure that sublinks are the only thing that could get
mutated to a different state in the runCondition than in the main
tree.

I think the only real way to prevent problems from point 1 is to stop
making a copy of the WindowFunc expr.  We need some other way to refer
to the WindowFunc's value in the runCondition tree.  Maybe a generated
Param would serve?

            regards, tom lane



On Wed, 24 Jan 2024 at 08:34, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> We could work around point 1 by refusing to pull up subqueries that
> contain sublinks in their targetlists, but that would be a pretty big
> change (and, probably, a pessimization of some queries).  I do not
> consider run-condition optimization to justify that.  Moreover
> I'm not sure that sublinks are the only thing that could get
> mutated to a different state in the runCondition than in the main
> tree.
>
> I think the only real way to prevent problems from point 1 is to stop
> making a copy of the WindowFunc expr.  We need some other way to refer
> to the WindowFunc's value in the runCondition tree.  Maybe a generated
> Param would serve?

I'm wondering if it was wrong to put the runCondition field in
WindowClause. Maybe it should be in WindowFunc instead...

Really the OpExpr that's created in find_window_run_conditions() with
the WindowFunc as an arg is there to show the Run Condition in
EXPLAIN. This is what's later stored in WindowAgg.runConditionOrig.
What we pass to ExecQual in nodeWindowAgg.c is a version of the OpExpr
with the WindowFunc replaced with a Var so that ExecQual properly
fetches the just-calculated WindowFunc return value from the slot.  We
don't want to leave the WindowFunc in the runCondition as ExecQual
would go and evaluate it.

If WindowFunc allowed a list of a new struct called WindowRunCondition
with fields "otherarg", "opno", "collation", "wfunc_left" then we
could construct the OpExpr later either in createplan.c or setrefs.c.
The EXPLAIN version of that OpExpr could have the WindowFunc and the
non-EXPLAIN version would have the Var.

Sounds a bit invasive for back branches, but wondering if we couldn't
just modify window_ntile_support() to reject any ntile args other than
Consts. count(*), row_number(), rank(), dense_rank(), percent_rank()
and percent_rank() all can't suffer from this issue as they don't have
an argument.  count(expr) would need to have something done to stop
the same issue from occurring. Maybe int8inc_support() could just set
req->monotonic = MONOTONICFUNC_NONE if the req->window_func has an
arg, effectively disabling the optimisation for count(expr).

David



On Thu, 25 Jan 2024 at 18:14, David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Wed, 24 Jan 2024 at 08:34, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > We could work around point 1 by refusing to pull up subqueries that
> > contain sublinks in their targetlists, but that would be a pretty big
> > change (and, probably, a pessimization of some queries).  I do not
> > consider run-condition optimization to justify that.  Moreover
> > I'm not sure that sublinks are the only thing that could get
> > mutated to a different state in the runCondition than in the main
> > tree.
> >
> > I think the only real way to prevent problems from point 1 is to stop
> > making a copy of the WindowFunc expr.  We need some other way to refer
> > to the WindowFunc's value in the runCondition tree.  Maybe a generated
> > Param would serve?
>
> I'm wondering if it was wrong to put the runCondition field in
> WindowClause. Maybe it should be in WindowFunc instead...
>
> Really the OpExpr that's created in find_window_run_conditions() with
> the WindowFunc as an arg is there to show the Run Condition in
> EXPLAIN. This is what's later stored in WindowAgg.runConditionOrig.
> What we pass to ExecQual in nodeWindowAgg.c is a version of the OpExpr
> with the WindowFunc replaced with a Var so that ExecQual properly
> fetches the just-calculated WindowFunc return value from the slot.  We
> don't want to leave the WindowFunc in the runCondition as ExecQual
> would go and evaluate it.
>
> If WindowFunc allowed a list of a new struct called WindowRunCondition
> with fields "otherarg", "opno", "collation", "wfunc_left" then we
> could construct the OpExpr later either in createplan.c or setrefs.c.
> The EXPLAIN version of that OpExpr could have the WindowFunc and the
> non-EXPLAIN version would have the Var.

Just to assist the discussion here I've drafted a patch along the
lines of the above. See attached

If you think this idea has merit I can try and turn it into something
committable for master.

> Sounds a bit invasive for back branches, but wondering if we couldn't
> just modify window_ntile_support() to reject any ntile args other than
> Consts. count(*), row_number(), rank(), dense_rank(), percent_rank()
> and percent_rank() all can't suffer from this issue as they don't have
> an argument.  count(expr) would need to have something done to stop
> the same issue from occurring. Maybe int8inc_support() could just set
> req->monotonic = MONOTONICFUNC_NONE if the req->window_func has an
> arg, effectively disabling the optimisation for count(expr).

I'm still unsure about the fix for back branches but I'm open to other ideas.

David

Attachment

On Fri, Jan 26, 2024 at 8:02 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Thu, 25 Jan 2024 at 18:14, David Rowley <dgrowleyml@gmail.com> wrote:
> If WindowFunc allowed a list of a new struct called WindowRunCondition
> with fields "otherarg", "opno", "collation", "wfunc_left" then we
> could construct the OpExpr later either in createplan.c or setrefs.c.
> The EXPLAIN version of that OpExpr could have the WindowFunc and the
> non-EXPLAIN version would have the Var.

Just to assist the discussion here I've drafted a patch along the
lines of the above. See attached

If you think this idea has merit I can try and turn it into something
committable for master.

This idea seems reasonable to me.  Now the runCondition is constructed
in create_one_window_path(), where the subquery has been through
preprocessing and therefore the WindowFunc's arg has been replaced with
a Param due to the pullup of the lower subquery and the expansion of
SubLinks to SubPlans.  This fixes the problem reported here.

Thanks
Richard

On Thu, Jan 25, 2024 at 1:14 PM David Rowley <dgrowleyml@gmail.com> wrote:
Sounds a bit invasive for back branches, but wondering if we couldn't
just modify window_ntile_support() to reject any ntile args other than
Consts. count(*), row_number(), rank(), dense_rank(), percent_rank()
and percent_rank() all can't suffer from this issue as they don't have
an argument.  count(expr) would need to have something done to stop
the same issue from occurring. Maybe int8inc_support() could just set
req->monotonic = MONOTONICFUNC_NONE if the req->window_func has an
arg, effectively disabling the optimisation for count(expr).

You're right that count(expr) also suffers from this issue.

select 1 from
  (select count(s1.x) over () as c
   from (select (select 1) as x) as s1) s
 where s.c = 1;
ERROR:  WindowFunc not found in subplan target lists

For back branches, the idea of modifying window_ntile_support() and
int8inc_support() to reject any non-pseudoconstant args also seems
reasonable to me.  One thing I noticed is that sometimes it's not easy
to tell whether the arg is pseudoconstant or not in the support
functions, because a pseudoconstant is not necessarily being type of
Const.  For instance, count(1::text) is a CoerceViaIO, and
ntile(1.0::int) is a FuncExpr.  But these are very corner cases and I
think we can just ignore them.

Thanks
Richard
On Mon, 1 Apr 2024 at 21:34, akuluasan <akuluasan@163.com> wrote:
> I am using my tool to simplify  the SQL query. Can you please confirm if the simplification process helps you
diagnoseand locate bugs?
 

The root cause of this has already been established, so in this case,
the answer is "no".

However, since you wrote "bugs" rather than "this bug", for cases
where we've not found the root cause, the more simple the reproducer,
the better.

David



On Fri, 22 Mar 2024 at 23:47, Richard Guo <guofenglinux@gmail.com> wrote:
> For back branches, the idea of modifying window_ntile_support() and
> int8inc_support() to reject any non-pseudoconstant args also seems
> reasonable to me.  One thing I noticed is that sometimes it's not easy
> to tell whether the arg is pseudoconstant or not in the support
> functions, because a pseudoconstant is not necessarily being type of
> Const.  For instance, count(1::text) is a CoerceViaIO, and
> ntile(1.0::int) is a FuncExpr.  But these are very corner cases and I
> think we can just ignore them.

I don't think that needs anything special aside from constant folding.

I've attached a more complete version of the patch (0002) and another
patch which is what I'd proposed as a fix for the backbranches (0001).
Note quite a few tests needed to be adjusted because of disabling this
optimisation.

The 0002 patch will require a cat version bump as it adds a field to
WindowFunc.  Ideally, I'd be applying this fix to master, but I
imagine some people might feel we should delay applying a fix like
this until after we branch for v18. Happy to hear people's views on
that.

David

Attachment
David Rowley <dgrowleyml@gmail.com> writes:
> I've attached a more complete version of the patch (0002) and another
> patch which is what I'd proposed as a fix for the backbranches (0001).
> Note quite a few tests needed to be adjusted because of disabling this
> optimisation.

IIUC, 0002 is meant to be applied on top of 0001, but it reverses
quite a lot of 0001?  Please don't commit it like that, it'll just
add a lot of useless thrashing to "git blame" results.

It'd be easier to review this if you presented it as two independent
patches, one for HEAD and one for the back branches.

The fact that you had to use a cheesy "eval_const_expressions(NULL,
..." call in 0001 demonstrates that it was a mistake to not include
PlannerInfo in SupportRequestWFuncMonotonic, as every other planner-
invoked support request has.  I realize that we can't change that
in back branches, and that it's no longer immediately necessary
in HEAD either after 0002.  But let's learn from experience and
add it to the struct while we're here.

> The 0002 patch will require a cat version bump as it adds a field to
> WindowFunc.  Ideally, I'd be applying this fix to master, but I
> imagine some people might feel we should delay applying a fix like
> this until after we branch for v18. Happy to hear people's views on
> that.

catversion bumps are not a problem at this stage --- we will surely
do at least one more for assigned-OID renumbering.  I'd rather avoid
the git history thrashing implicit in treating 0002 as a follow-on
patch.

            regards, tom lane



On Fri, 26 Apr 2024 at 03:39, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> It'd be easier to review this if you presented it as two independent
> patches, one for HEAD and one for the back branches.

The attached v2 is the same patch as earlier and is intended for <= v16.

> The fact that you had to use a cheesy "eval_const_expressions(NULL,
> ..." call in 0001 demonstrates that it was a mistake to not include
> PlannerInfo in SupportRequestWFuncMonotonic, as every other planner-
> invoked support request has.  I realize that we can't change that
> in back branches, and that it's no longer immediately necessary
> in HEAD either after 0002.  But let's learn from experience and
> add it to the struct while we're here.

The correct PlannerInfo to set here would be the one that the
WindowFunc belongs to.  The problem is that this code is called from
set_subquery_pathlist before the rel->subroot = subquery_planner. i.e
we've no PlannerInfo to set.  Maybe it's worth doing this for
SupportRequestOptimizeWindowClause as a separate patch.

> catversion bumps are not a problem at this stage

The attached v3 is a separate patch for v17 only.

David

Attachment
On Tue, 30 Apr 2024 at 09:39, David Rowley <dgrowleyml@gmail.com> wrote:
> The attached v2 is the same patch as earlier and is intended for <= v16.

I want to commit this one so it makes it into the next minor version
releases.  It's not very complex, so plan to push it later today.

I'll hold off with the v17 version for a while.

David



On Wed, 1 May 2024 at 11:30, David Rowley <dgrowleyml@gmail.com> wrote:
> I'll hold off with the v17 version for a while.

I've now pushed the fix for v17.

David