Re: Todo: Teach planner to evaluate multiple windows in the optimal order - Mailing list pgsql-hackers

From Ankit Kumar Pandey
Subject Re: Todo: Teach planner to evaluate multiple windows in the optimal order
Date
Msg-id 2f9f632d-3297-09dd-1d01-af3e0961c446@gmail.com
Whole thread Raw
In response to Re: Todo: Teach planner to evaluate multiple windows in the optimal order  (David Rowley <dgrowleyml@gmail.com>)
Responses Re: Todo: Teach planner to evaluate multiple windows in the optimal order
List pgsql-hackers
> On 10/01/23 10:53, David Rowley wrote:

> the total cost is the same for both of these, but the execution time
> seems to vary quite a bit.

This is really weird, I tried it different ways (to rule out any issues 
due to

caching) and execution time varied in spite of having same cost.

> Maybe we should try and do this for DISTINCT queries if the
> distinct_pathkeys match the orderby_pathkeys. That seems a little less
> copout-ish. If the ORDER BY is the same as the DISTINCT then it seems
> likely that the ORDER BY might opt to use the Unique path for DISTINCT
> since it'll already have the correct pathkeys.

> However, if the ORDER BY has fewer columns then it might be cheaper to Hash Aggregate and
> then sort all over again, especially so when the DISTINCT removes a
> large proportion of the rows.

Isn't order by pathkeys are always fewer than distinct pathkeys?

distinct pathkeys = order by pathkeys + window functions pathkeys

Again, I got your point which that it is okay to pushdown order by clause

if distinct is doing unique sort. But problem is (atleast from what I am 
facing),

distinct is not caring about pushed down sortkeys, it goes with hashagg 
or unique

with some other logic (mentioned below).


Consider following (with distinct clause restriction removed)

if (parse->distinctClause)
{
    List* distinct_pathkeys = make_pathkeys_for_sortclauses(root, parse->distinctClause, root->processed_tlist);
    if (!compare_pathkeys(distinct_pathkeys, orderby_pathkeys)==1) // distinct key > order by key
        skip = true; // this is used to skip order by pushdown


}

CASE #1:

explain (costs off) select distinct a,b, min(a) over (partition by a), sum (a) over (partition by a) from abcd order by
a,b;
                         QUERY PLAN
-----------------------------------------------------------
  Sort
    Sort Key: a, b
    ->  HashAggregate
          Group Key: a, b, min(a) OVER (?), sum(a) OVER (?)
          ->  WindowAgg
                ->  Sort
                      Sort Key: a, b
                      ->  Seq Scan on abcd
(8 rows)

explain (costs off) select distinct a,b,c, min(a) over (partition by a), sum (a) over (partition by a) from abcd order
bya,b,c;
 
                           QUERY PLAN
--------------------------------------------------------------
  Sort
    Sort Key: a, b, c
    ->  HashAggregate
          Group Key: a, b, c, min(a) OVER (?), sum(a) OVER (?)
          ->  WindowAgg
                ->  Sort
                      Sort Key: a, b, c
                      ->  Seq Scan on abcd
(8 rows)

No matter how many columns are pushed down, it does hashagg.

On the other hand:

CASE #2:

EXPLAIN (costs off) SELECT DISTINCT depname, empno, min(salary) OVER (PARTITION BY depname) depminsalary,sum(salary)
OVER(PARTITION BY depname) depsalary
 
FROM empsalary
ORDER BY depname, empno;
                                     QUERY PLAN
----------------------------------------------------------------------------------
  Unique
    ->  Sort
          Sort Key: depname, empno, (min(salary) OVER (?)), (sum(salary) OVER (?))
          ->  WindowAgg
                ->  Sort
                      Sort Key: depname, empno
                      ->  Seq Scan on empsalary
(7 rows)

EXPLAIN (costs off) SELECT DISTINCT depname, empno, enroll_date, min(salary) OVER (PARTITION BY depname)
depminsalary,sum(salary)OVER (PARTITION BY depname) depsalary
 
FROM empsalary
ORDER BY depname, empno, enroll_date;
                                           QUERY PLAN
-----------------------------------------------------------------------------------------------
  Unique
    ->  Sort
          Sort Key: depname, empno, enroll_date, (min(salary) OVER (?)), (sum(salary) OVER (?))
          ->  WindowAgg
                ->  Sort
                      Sort Key: depname, empno, enroll_date
                      ->  Seq Scan on empsalary
(7 rows)

It keeps doing Unique.

In both of the cases, compare_pathkeys(distinct_pathkeys, 
orderby_pathkeys) returns 1

Looking bit further, planner is choosing things correctly.

I could see cost of unique being higher in 1st case and lower in 2nd case.

But the point is, if sort for orderby is pushdown, shouldn't there be 
some discount on

cost of Unique sort (so that there is more possibility of it being 
favorable compared to HashAgg in certain cases)?

Again, cost of Unqiue node is taken as cost of sort node as it is, but

for HashAgg, new cost is being computed. If we do incremental sort here 
(for unique node),

as we have pushed down orderby's, unique cost could be reduced and our 
optimization could

be made worthwhile (I assume this is what you intended here) in case of 
distinct.

Eg:

EXPLAIN SELECT DISTINCT depname, empno, enroll_date, min(salary) OVER (PARTITION BY depname) depminsalary,sum(salary)
OVER(PARTITION BY depname) depsalary
 
FROM empsalary
ORDER BY depname, empno, enroll_date;
                                           QUERY PLAN
-----------------------------------------------------------------------------------------------
  Unique  (cost=1.63..1.78 rows=10 width=56)
    ->  Sort  (cost=1.63..1.66 rows=10 width=56)
          Sort Key: depname, empno, enroll_date, (min(salary) OVER (?)), (sum(salary) OVER (?))
          ->  WindowAgg  (cost=1.27..1.47 rows=10 width=56)
                ->  Sort  (cost=1.27..1.29 rows=10 width=48)
                      Sort Key: depname, empno, enroll_date
                      ->  Seq Scan on empsalary  (cost=0.00..1.10 rows=10 width=48)

depname, empno, enroll_date are presorted but still strict sorting is 
done on all columns.


Additionally,

> the total cost is the same for both of these, but the execution time
> seems to vary quite a bit.

Even if I pushdown one or two path keys, end result is same cost (which 
isn't helping).


-- 
Regards,
Ankit Kumar Pandey




pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE
Next
From: Pavel Luzanov
Date:
Subject: Re: psql: Add role's membership options to the \du+ command