Re: Path to unreverting "Allow planner to use Merge Append to efficiently implement UNION" - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Path to unreverting "Allow planner to use Merge Append to efficiently implement UNION"
Date
Msg-id 5717be79-8f7d-4039-815a-b6df22d3dc11@iki.fi
Whole thread Raw
In response to Path to unreverting "Allow planner to use Merge Append to efficiently implement UNION"  (David Rowley <dgrowleyml@gmail.com>)
Responses Re: Path to unreverting "Allow planner to use Merge Append to efficiently implement UNION"
List pgsql-hackers
On 21/05/2024 05:58, David Rowley wrote:
> Let this thread be for at least the coding portion of this or be my
> thread for this patch for the v18 cycle if the RMT rules in favour of
> keeping that code reverted for v17.
> 
> I've attached 2 patches.
> 
> 0001 is a simple revert of Tom's revert (7204f3591).
> 0002 fixes the issue reported by Hubert.
> 
> If anyone wants to have a look, I'd be grateful for that.  Tom did
> call for further review after this being the 4th issue reported for
> 66c0185a3.

My planner experience is a bit rusty, but I took a quick look. Looks 
generally OK to me. Some comments below:

> +    /* For for UNIONs (not UNION ALL), try sorting, if sorting is possible */

Duplicated word: "For for"

> /*
>  * build_setop_child_paths
>  *        Build paths for the set op child relation denoted by 'rel'.
>  *
>  * interesting_pathkeys: if not NIL, also include paths that suit these
>  * pathkeys, sorting any unsorted paths as required.
>  * *pNumGroups: if not NULL, we estimate the number of distinct groups
>  *        in the result, and store it there

The indentation on 'interesting_pathkeys' and '*pNumGroups' is inconsistent.

I have a vague feeling that this comment deserves to be longer. The 
function does a lot of things. How is 'child_tlist' different from 
rel->reltarget for example?

'interesting_pathkeys' is modified by the call to 
add_setop_child_rel_equivalences(): it adds members to the 
EquivalenceClasses of the pathkeys. Is that worth mentioning here, or is 
that obvious to someone who know more about the planner?

>         /*
>          * Create paths to suit final sort order required for setop_pathkeys.
>          * Here we'll sort the cheapest input path (if not sorted already) and
>          * incremental sort any paths which are partially sorted.
>          */
>         is_sorted = pathkeys_count_contained_in(setop_pathkeys,
>                                                 subpath->pathkeys,
>                                                 &presorted_keys);
> 
>         if (!is_sorted)
>         {

Maybe also mention that if it's already sorted, it's used as is.

BTW, could the same machinery be used for INTERSECT as well? There was a 
brief mention of that in the original thread, but I didn't understand 
the details. Not for v17, but I'm curious. I was wondering if 
build_setop_child_paths() should be named build_union_child_paths(), 
since it's only used with UNIONs, but I guess it could be used as is for 
INTERSECT too.


# Testing

postgres=# begin; create table foo as select i from generate_series(1, 
1000000) i; create index on foo (i); commit;
BEGIN
SELECT 1000000
CREATE INDEX
COMMIT
postgres=# set enable_seqscan=off;
SET
postgres=# explain (select 1 as i union select i from foo) order by i;
                                               QUERY PLAN 

------------------------------------------------------------------------------------------------------
  Unique  (cost=144370.89..149370.89 rows=1000001 width=4)
    ->  Sort  (cost=144370.89..146870.89 rows=1000001 width=4)
          Sort Key: (1)
          ->  Append  (cost=0.00..31038.44 rows=1000001 width=4)
                ->  Result  (cost=0.00..0.01 rows=1 width=4)
                ->  Index Only Scan using foo_i_idx on foo 
(cost=0.42..26038.42 rows=1000000 width=4)
(6 rows)

I'm disappointed it couldn't produce a MergeAppend plan. If you replace 
the "union" with "union all" you do get a MergeAppend.

Some more cases where I hoped for a MergeAppend:

postgres=# explain (select i, 'foo' from foo union select i, 'foo' from 
foo) order by 1;
                                                  QUERY PLAN 

-------------------------------------------------------------------------------------------------------------
  Unique  (cost=380767.54..395767.54 rows=2000000 width=36)
    ->  Sort  (cost=380767.54..385767.54 rows=2000000 width=36)
          Sort Key: foo.i, ('foo'::text)
          ->  Append  (cost=0.42..62076.85 rows=2000000 width=36)
                ->  Index Only Scan using foo_i_idx on foo 
(cost=0.42..26038.42 rows=1000000 width=36)
                ->  Index Only Scan using foo_i_idx on foo foo_1 
(cost=0.42..26038.42 rows=1000000 width=36)
(6 rows)


postgres=# explain (select 'foo', i from foo union select 'bar', i from 
foo) order by 1;
                                                  QUERY PLAN 

-------------------------------------------------------------------------------------------------------------
  Unique  (cost=380767.54..395767.54 rows=2000000 width=36)
    ->  Sort  (cost=380767.54..385767.54 rows=2000000 width=36)
          Sort Key: ('foo'::text), foo.i
          ->  Append  (cost=0.42..62076.85 rows=2000000 width=36)
                ->  Index Only Scan using foo_i_idx on foo 
(cost=0.42..26038.42 rows=1000000 width=36)
                ->  Index Only Scan using foo_i_idx on foo foo_1 
(cost=0.42..26038.42 rows=1000000 width=36)
(6 rows)


The following two queries are the same from the user's point of view, 
but one is written using WITH:

postgres=# explain (select i from foo union (select 1::int order by 1) 
union select i from foo) order by 1;
                                                  QUERY PLAN 

------------------------------------------------------------------------------------------------------------
  Unique  (cost=326083.66..336083.67 rows=2000001 width=4)
    ->  Sort  (cost=326083.66..331083.67 rows=2000001 width=4)
          Sort Key: foo.i
          ->  Append  (cost=0.42..62076.87 rows=2000001 width=4)
                ->  Index Only Scan using foo_i_idx on foo 
(cost=0.42..26038.42 rows=1000000 width=4)
                ->  Result  (cost=0.00..0.01 rows=1 width=4)
                ->  Index Only Scan using foo_i_idx on foo foo_1 
(cost=0.42..26038.42 rows=1000000 width=4)
(7 rows)

postgres=# explain with x (i) as (select 1::int order by 1)  (select i 
from foo union select i from x union select i from foo) order by 1;
                                               QUERY PLAN 

------------------------------------------------------------------------------------------------------
  Unique  (cost=0.89..82926.54 rows=2000001 width=4)
    ->  Merge Append  (cost=0.89..77926.54 rows=2000001 width=4)
          Sort Key: foo.i
          ->  Index Only Scan using foo_i_idx on foo 
(cost=0.42..26038.42 rows=1000000 width=4)
          ->  Sort  (cost=0.02..0.03 rows=1 width=4)
                Sort Key: (1)
                ->  Result  (cost=0.00..0.01 rows=1 width=4)
          ->  Index Only Scan using foo_i_idx on foo foo_1 
(cost=0.42..26038.42 rows=1000000 width=4)
(8 rows)

I would've expected a MergeAppend in both cases.


None of these test cases are broken as such, you just don't get the 
benefit of the optimization. I suspect they might all have the same root 
cause, as they all involve constants in the target list. I think that's 
a pretty common use case of UNION though.

-- 
Heikki Linnakangas
Neon (https://neon.tech)




pgsql-hackers by date:

Previous
From: "Andrey M. Borodin"
Date:
Subject: Re: Injection points: preloading and runtime arguments
Next
From: Ranier Vilela
Date:
Subject: Re: CREATE DATABASE with filesystem cloning