Thread: Remove restrictions in recursive query

Remove restrictions in recursive query

From
Renan Alves Fonseca
Date:
Hi,

I'm confused about what we should allow in a recursive query. For example, in the following query:

WITH RECURSIVE t1 AS ( SELECT 1 UNION  SELECT generate_series(2,3) FROM t1 ORDER BY  1 DESC) SELECT * FROM t1 ;

The parser attaches the "order by" clause to the "union" operator, and then we error out with the following message: "ORDER BY in a recursive query is not implemented"

The comment in the code (parser_cte.c:900) says "Disallow ORDER BY and similar decoration atop the UNION". Then, if we wrap the recursive clause around parentheses:

WITH RECURSIVE t1 AS ( SELECT 1 UNION  (SELECT generate_series(2,3) FROM t1 ORDER BY  1 DESC)) SELECT * FROM t1 ;

It works as expected. So, do we support the ORDER BY in a recursive query or not? If the answer is yes, I suggest one of the following modifications:
1. Change the error message to something like "ORDER BY at the top level of a recursive query is not implemented. HINT: wrap the respective statement around ()"
2. (preferred) Modify the parser or simply remove these checks to allow the first query.

If the answer is no, then there is a minor bug that allows us to bypass the check. Even though the ORDER BY happens inside the working table, I think it can be a useful feature combined with LIMIT and OFFSET.

There is a similar restriction regarding GROUP BY. But in this case, the error message is very clear and it is consistent with the comment in the code. I suggest removing this restriction as well in order to improve PostgreSQL's capabilities to process graph data. For example, counting the number of paths in a DAG can be computed more efficiently using an aggregation in each step. 

I don't know what the standard says about this, but it certainly does not allow DISTINCT ON in the recursive query, while PostgreSQL does support it. So, we could eventually skip the specs in this case to be more consistent since a DISTINCT ON has many similarities with a GROUP BY.

I did some tests, and it is enough to remove the check regarding the GROUP BY. The machinery to perform the GROUP BY in a recursive clause is already there.

Of course, if it is the case, I would be glad to send a patch.

Best regards,
Renan Fonseca

Re: Remove restrictions in recursive query

From
Robert Haas
Date:
On Thu, Mar 27, 2025 at 12:02 PM Renan Alves Fonseca
<renanfonseca@gmail.com> wrote:
> WITH RECURSIVE t1 AS ( SELECT 1 UNION  SELECT generate_series(2,3) FROM t1 ORDER BY  1 DESC) SELECT * FROM t1 ;
>
> The parser attaches the "order by" clause to the "union" operator, and then we error out with the following message:
"ORDERBY in a recursive query is not implemented" 
>
> The comment in the code (parser_cte.c:900) says "Disallow ORDER BY and similar decoration atop the UNION". Then, if
wewrap the recursive clause around parentheses: 
>
> WITH RECURSIVE t1 AS ( SELECT 1 UNION  (SELECT generate_series(2,3) FROM t1 ORDER BY  1 DESC)) SELECT * FROM t1 ;
>
> It works as expected. So, do we support the ORDER BY in a recursive query or not?

A recursive CTE effectively takes two queries that will be run as
arguments. For some reason, instead of choosing syntax like WITH
RECURSIVE t1 AS BASE_CASE (initial_query) RECURSIVE_CASE
(iterated_query), somebody chose WITH RECURSIVE t1 AS (initial_query
UNION iterated_query) which really doesn't make it very clear that we
need to be able to break it apart into two separate queries, one of
which will be run once and one of which will be iterated.

It's not a problem if UNION ALL is used within the initial_query and
it's not a problem if UNION ALL is used within the iterated_query. But
you can't apply ORDER BY to the result of the UNION, because the UNION
is kind of fake -- we're not running the UNION as a single query,
we're running the two halves separately, the first once and the second
as many times as needed.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Remove restrictions in recursive query

From
Robert Haas
Date:
On Thu, Mar 27, 2025 at 12:37 PM Robert Haas <robertmhaas@gmail.com> wrote:
> It's not a problem if UNION ALL is used within the initial_query and
> it's not a problem if UNION ALL is used within the iterated_query. But
> you can't apply ORDER BY to the result of the UNION, because the UNION
> is kind of fake -- we're not running the UNION as a single query,
> we're running the two halves separately, the first once and the second
> as many times as needed.

Oops. The two UNION ALL references in the first part of this paragraph
should have said ORDER BY.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Remove restrictions in recursive query

From
"David G. Johnston"
Date:
On Thu, Mar 27, 2025 at 9:02 AM Renan Alves Fonseca <renanfonseca@gmail.com> wrote:
So, do we support the ORDER BY in a recursive query or not?

Apparently we do.  The "in" recurses.

If the answer is yes, I suggest one of the following modifications: 
1. Change the error message to something like "ORDER BY at the top level of a recursive query is not implemented. HINT: wrap the respective statement around ()"

The error message is now correct but I'd lose the hint.  It assumes a typo and that the query author meant to sort each iteration of the recursive term; when maybe they did wish for the final result of the completed recursive query to then be fed into a sort node and ordered, in which case the parentheses wouldn't help.

David J.

Re: Remove restrictions in recursive query

From
Renan Alves Fonseca
Date:
On Thu, Mar 27, 2025 at 5:38 PM Robert Haas <robertmhaas@gmail.com> wrote:
> It's not a problem if UNION ALL is used within the initial_query and
> it's not a problem if UNION ALL is used within the iterated_query. But
> you can't apply ORDER BY to the result of the UNION, because the UNION
> is kind of fake -- we're not running the UNION as a single query,
> we're running the two halves separately, the first once and the second
> as many times as needed.

I understand that we can only apply ORDER BY separately in the
initial/iterated query. What disturbs me here is that the UNION
operator has associativity precedence over the ORDER BY only when
inside a recursive CTE. Consider the following query:

SELECT 1 UNION SELECT 1 GROUP BY 1;

It returns 2 rows. The GROUP BY clause attaches to the second
selectStmt without the need to add parenthesis. I would expect the
same syntax inside a recursive CTE.



Re: Remove restrictions in recursive query

From
"David G. Johnston"
Date:
On Thu, Mar 27, 2025 at 11:03 AM Renan Alves Fonseca <renanfonseca@gmail.com> wrote:
On Thu, Mar 27, 2025 at 5:38 PM Robert Haas <robertmhaas@gmail.com> wrote:
> It's not a problem if UNION ALL is used within the initial_query and
> it's not a problem if UNION ALL is used within the iterated_query. But
> you can't apply ORDER BY to the result of the UNION, because the UNION
> is kind of fake -- we're not running the UNION as a single query,
> we're running the two halves separately, the first once and the second
> as many times as needed.

I understand that we can only apply ORDER BY separately in the
initial/iterated query. What disturbs me here is that the UNION
operator has associativity precedence over the ORDER BY only when
inside a recursive CTE. Consider the following query:

SELECT 1 UNION SELECT 1 GROUP BY 1;

It returns 2 rows. The GROUP BY clause attaches to the second
selectStmt without the need to add parenthesis. I would expect the
same syntax inside a recursive CTE.

There is distinct behavior between group by and order by here.  You seem to be mixing them up.

From Select:

select_statement is any SELECT statement without an ORDER BY, LIMIT, FOR NO KEY UPDATE, FOR UPDATE, FOR SHARE, or FOR KEY SHARE clause. (ORDER BY and LIMIT can be attached to a subexpression if it is enclosed in parentheses. Without parentheses, these clauses will be taken to apply to the result of the UNION, not to its right-hand input expression.)

This is the exact same parsing precedence order by is being given in the recursive CTE query situation presented earlier.

David J.

Re: Remove restrictions in recursive query

From
Renan Alves Fonseca
Date:
On Thu, Mar 27, 2025 at 7:10 PM David G. Johnston
<david.g.johnston@gmail.com> wrote:
>
> There is distinct behavior between group by and order by here.  You seem to be mixing them up.
>
> From Select:
>
> select_statement is any SELECT statement without an ORDER BY, LIMIT, FOR NO KEY UPDATE, FOR UPDATE, FOR SHARE, or FOR
KEYSHARE clause. (ORDER BY and LIMIT can be attached to a subexpression if it is enclosed in parentheses. Without
parentheses,these clauses will be taken to apply to the result of the UNION, not to its right-hand input expression.) 
>
> This is the exact same parsing precedence order by is being given in the recursive CTE query situation presented
earlier.
>
> David J.
>

You're right. I'm really mixing these 2 here. Thanks for the clarification.

I'll assume that the silence about allowing GROUP BY means it is not a
great idea...



Re: Remove restrictions in recursive query

From
Robert Haas
Date:
On Thu, Mar 27, 2025 at 2:21 PM Renan Alves Fonseca
<renanfonseca@gmail.com> wrote:
> You're right. I'm really mixing these 2 here. Thanks for the clarification.

It looks like GROUP BY binds to the particular UNION branch but ORDER
BY binds to the UNION as a whole:

robert.haas=# select 2 union all select 1;
 ?column?
----------
        2
        1
(2 rows)

robert.haas=# select 2 union all select 1 order by 1;
 ?column?
----------
        1
        2
(2 rows)

robert.haas=# select 2 union all select 1 group by 1;
 ?column?
----------
        2
        1
(2 rows)

> I'll assume that the silence about allowing GROUP BY means it is not a
> great idea...

I don't think there's really anything to keep you from doing this --
just put the grouping operation where you refer to the recursive CTE,
instead of inside the recursive CTE itself. I think allowing it to
appear inside the recursive CTE would be rather confusing -- it's
probably best if the mandatory UNION operation is at the top level.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Remove restrictions in recursive query

From
Tom Lane
Date:
Renan Alves Fonseca <renanfonseca@gmail.com> writes:
> I'll assume that the silence about allowing GROUP BY means it is not a
> great idea...

Well, you can do grouping, ordering, or whatever else you want to the
result of the recursive WITH in the outer query level.  I don't see
any advantage in allowing an additional level of that inside the WITH.

            regards, tom lane



Re: Remove restrictions in recursive query

From
Renan Alves Fonseca
Date:
On Thu, Mar 27, 2025 at 7:32 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> > I'll assume that the silence about allowing GROUP BY means it is not a
> > great idea...
>
> I don't think there's really anything to keep you from doing this --
> just put the grouping operation where you refer to the recursive CTE,
> instead of inside the recursive CTE itself. I think allowing it to
> appear inside the recursive CTE would be rather confusing -- it's
> probably best if the mandatory UNION operation is at the top level.
>

I don't think I was clear enough about the proposed improvements. I
hope this example will clarify.

Consider a table representing the edges of a directed acyclic graph.
The following lines will create an example of such a table.

-- create DAG example
\set output_degree 10
\set maxN 20000

create temp table dag(source,target) as
       (with recursive t1(source,target) as
             (select 1,1
           union
     select target, target * i from t1, generate_series(2,:output_degree +1) i
     where target*i< :maxN
     ) select * from t1 where source!=target );

Then, suppose we want to count the number of distinct paths from the
root (1) to all sink nodes. We use the following auxiliary table for
the sink nodes:

create temp table sink_nodes(node) as
       (select target from dag except select source from dag);

The solution supported in PostgreSQL is the following:

with recursive t1(node,path) as
     (select 1,array[1]
           union
     select dag.target, t1.path||dag.target
     from t1 join dag on t1.node=dag.source
    ) select count(*) from t1 join sink_nodes using (node) ;

This query enumerates every path and probably has something like
exponential complexity on the number of edges. It gives these results
in my laptop:
  count
---------
 1114163
(1 row)
Time: 8121.044 ms (00:08.121)

The solution using GROUP BY in the recursive query is the following:

with recursive t1(node,nb_path) as
     (select 1,1::numeric
           union all
      (select dag.target, sum(nb_path)
      from t1 join dag on t1.node=dag.source
      group by 1)
     ) select sum(nb_path) from t1 join sink_nodes using (node) ;

At every iteration step, we sum the number of paths that arrived at a
given node. That is another class of algorithm complexity. The first
query cannot scale to larger inputs, while this one can scale. Here is
the result:

   sum
---------
 1114163
(1 row)
Time: 27.123 ms

I hope this example made clear that allowing the GROUP BY in the
recursive clause significantly increases the capabilities of doing
analytics on graph data, and it is not only syntax sugar. Please let
me know if there is another way of handling this problem efficiently
without the proposed change.

Thanks again for your attention,
Renan



Re: Remove restrictions in recursive query

From
Tom Lane
Date:
Renan Alves Fonseca <renanfonseca@gmail.com> writes:
> The solution using GROUP BY in the recursive query is the following:

> with recursive t1(node,nb_path) as
>      (select 1,1::numeric
>            union all
>       (select dag.target, sum(nb_path)
>       from t1 join dag on t1.node=dag.source
>       group by 1)
>      ) select sum(nb_path) from t1 join sink_nodes using (node) ;

This is not about the GROUP BY; it's about the SUM().
If you try this example you get

regression=# with recursive t1(node,nb_path) as
     (select 1,1::numeric
           union all
      (select dag.target, sum(nb_path)
      from t1 join dag on t1.node=dag.source
      group by 1)
     ) select sum(nb_path) from t1 join sink_nodes using (node) ;
ERROR:  aggregate functions are not allowed in a recursive query's recursive term
LINE 4:       (select dag.target, sum(nb_path)
                                  ^

The code says that that restriction is from the SQL spec, and
it seems to be correct as of SQL:2021.  7.17 <query expression>
SR 3)j)ix)5)C) says

    C) WQEi shall not contain a <query specification> QS such that QS
       immediately contains a <table expression> TE that contains a
       <query name> referencing WQNX and either of the following is true:

      I) TE immediately contains a <having clause> that contains a
         <set function specification>.

      II) QS immediately contains a <select list> SL that contains
          either a <window function>, or a <set function
          specification>, or both.

(<set function specification> is spec-ese for "aggregate function
call")

I don't know the SQL committee's precise reasoning for this
restriction, but I suspect it's because the recursive query
expansion is not well-defined in the presence of an aggregate.
The spec has an interesting comment at the bottom of sub-rule ix:

    NOTE 310 — The restrictions insure that each WLEi, viewed as a
    transformation of the query names of the stratum, is monotonically
    increasing. According to Tarski’s fixed point theorem, this
    insures that there is a fixed point. The General Rules use
    Kleene’s fixed point theorem to define a sequence that converges
    to the minimal fixed point.

            regards, tom lane



Re: Remove restrictions in recursive query

From
Renan Alves Fonseca
Date:
On Thu, Mar 27, 2025 at 10:50 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Renan Alves Fonseca <renanfonseca@gmail.com> writes:
> > The solution using GROUP BY in the recursive query is the following:
>
> > with recursive t1(node,nb_path) as
> >      (select 1,1::numeric
> >            union all
> >       (select dag.target, sum(nb_path)
> >       from t1 join dag on t1.node=dag.source
> >       group by 1)
> >      ) select sum(nb_path) from t1 join sink_nodes using (node) ;
>
> This is not about the GROUP BY; it's about the SUM().

Cannot SUM() without GROUP BY, right? (I mean the SUM() in the recursive query)

> If you try this example you get
>
> regression=# with recursive t1(node,nb_path) as
>      (select 1,1::numeric
>            union all
>       (select dag.target, sum(nb_path)
>       from t1 join dag on t1.node=dag.source
>       group by 1)
>      ) select sum(nb_path) from t1 join sink_nodes using (node) ;
> ERROR:  aggregate functions are not allowed in a recursive query's recursive term
> LINE 4:       (select dag.target, sum(nb_path)
>                                   ^

Sorry, I forgot to mention that this query only works in my local hack
(I simply removed the check that raises this error message)

> The code says that that restriction is from the SQL spec, and
> it seems to be correct as of SQL:2021.  7.17 <query expression>
> SR 3)j)ix)5)C) says
>
>     C) WQEi shall not contain a <query specification> QS such that QS
>        immediately contains a <table expression> TE that contains a
>        <query name> referencing WQNX and either of the following is true:
>
>       I) TE immediately contains a <having clause> that contains a
>          <set function specification>.
>
>       II) QS immediately contains a <select list> SL that contains
>           either a <window function>, or a <set function
>           specification>, or both.
>
> (<set function specification> is spec-ese for "aggregate function
> call"

I suspected that this restriction came straight from the specs. I
understand that while the proposed solution can help in some specific
use cases, it is not enough to justify an exception to the spec.

> I don't know the SQL committee's precise reasoning for this
> restriction, but I suspect it's because the recursive query
> expansion is not well-defined in the presence of an aggregate.
> The spec has an interesting comment at the bottom of sub-rule ix:
>
>     NOTE 310 — The restrictions insure that each WLEi, viewed as a
>     transformation of the query names of the stratum, is monotonically
>     increasing. According to Tarski’s fixed point theorem, this
>     insures that there is a fixed point. The General Rules use
>     Kleene’s fixed point theorem to define a sequence that converges
>     to the minimal fixed point.
>
>                         regards, tom lane



Re: Remove restrictions in recursive query

From
Tom Lane
Date:
Renan Alves Fonseca <renanfonseca@gmail.com> writes:
> I suspected that this restriction came straight from the specs. I
> understand that while the proposed solution can help in some specific
> use cases, it is not enough to justify an exception to the spec.

Well, we extend the spec in lots of places.  I'd be okay with removing
this restriction if I were sure there were no bad consequences, but
it seems likely that there are some.  College math was way too long
ago for me to be sure about the "fixed-point" business ... but I think
what they may be on about is that rows produced by aggregation may not
be stable in the face of adding more and more rows via additions to
the recursion workspace.  In your example, I think it somewhat
accidentally doesn't matter: an underlying row might get picked up
and added to a dag.target-grouped row in one recursion step or
a different one, but then you sum all those sums in the top-level
query, so the top sum comes out the same regardless of just what sets
the intermediate grouped rows consisted of.  With a different query
construction though, that would matter.

            regards, tom lane



Re: Remove restrictions in recursive query

From
Renan Alves Fonseca
Date:
On Fri, Mar 28, 2025 at 1:14 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Well, we extend the spec in lots of places.  I'd be okay with removing
> this restriction if I were sure there were no bad consequences, but
> it seems likely that there are some.  College math was way too long
> ago for me to be sure about the "fixed-point" business ... but I think
> what they may be on about is that rows produced by aggregation may not
> be stable in the face of adding more and more rows via additions to

After a quick math review, I'm quite convinced that the "fixed-point"
business applies only to the UNION [DISTINCT] case, in which the
relations can be seen as sets. Recursion with UNION ALL is a
completely different operation. I hope they mention that in the
standard. Consider this very basic recursive query:

WITH RECURSIVE t1 AS ( SELECT 1 UNION ALL SELECT * FROM t1) SELECT * FROM t1;

It does not converge. When using recursive UNION ALL it is the user
responsibility to create the conditions for convergence.

> the recursion workspace.  In your example, I think it somewhat
> accidentally doesn't matter: an underlying row might get picked up
> and added to a dag.target-grouped row in one recursion step or
> a different one, but then you sum all those sums in the top-level
> query, so the top sum comes out the same regardless of just what sets
> the intermediate grouped rows consisted of.  With a different query
> construction though, that would matter.

Not exactly accidental. I'm aware that the first GROUP BY runs in the
scope of the working table, and I'm aware that I can compose the sum
aggregates. Not every aggregate can be composed like that. The
following is the same query without the inner GROUP BY. As you've
mentioned, the SUM is much faster than COUNT (but still it has the
same complexity).

with recursive t1(node,nb_path) as
     (select 1,1::numeric
           union all
      (select dag.target, 1
      from t1 join dag on t1.node=dag.source)
     ) select sum(nb_path) from t1 join sink_nodes using (node) ;

Even if PostgreSQL allowed us to use GROUP BY in the recursive query,
we would be driving the intermediate steps, which feels wrong in a
declarative paradigm. Ideally, we would write the query above and
PostgreSQL would generate the most optimized execution that we've seen
before. I've seen that there is some ongoing work on PARTIAL
AGGREGATES. It would be beautiful if the optimizer could push down the
aggregate into the recursive clause. Does it sound possible?

Regards,
Renan



Re: Remove restrictions in recursive query

From
Nico Williams
Date:
On Thu, Mar 27, 2025 at 12:37:53PM -0400, Robert Haas wrote:
> A recursive CTE effectively takes two queries that will be run as
> arguments. For some reason, instead of choosing syntax like WITH
> RECURSIVE t1 AS BASE_CASE (initial_query) RECURSIVE_CASE
> (iterated_query), somebody chose WITH RECURSIVE t1 AS (initial_query
> UNION iterated_query) which really doesn't make it very clear that we
> need to be able to break it apart into two separate queries, one of
> which will be run once and one of which will be iterated.

Perhaps because one could have multiple UNIONs.

> It's not a problem if UNION ALL is used within the initial_query and
> it's not a problem if UNION ALL is used within the iterated_query. But
> you can't apply ORDER BY to the result of the UNION, because the UNION
> is kind of fake -- we're not running the UNION as a single query,
> we're running the two halves separately, the first once and the second
> as many times as needed.

Even in the UNION ALL case there's still iteration.  Either the internal
table used to hold the CTE's rows has to be kept in order as rows are
added or it has to be sorted when the CTE query completes.

I see ORDER BY in a recursive query more as either a hint for creating
an internal index on the CTE, or a request to sort the CTE when the
recursive query completes (or as it goes even).

Something I wish were possible is to have indices on CTEs so they can
function more like temp tables that don't need to touch the pg_catalog.
Or maybe the optimizer can create indices on CTEs automatically based on
how the CTEs are used.

Speaking of CTEs as temp tables, GLOBAL TEMP tables would be nice.

Nico
--