Thread: Is postgres able to share sorts required by common partition window functions?

Is postgres able to share sorts required by common partition window functions?

From
Sebastien Arod
Date:
Hi all,

I'm trying to optimize the following query on postgres 11.6 (running on Aurora)
select distinct
  c1,
  first_value(c2) OVER (PARTITION BY c1 order by c2) AS c2,
  first_value(c3) OVER (PARTITION BY c1 order by c3) AS c3,
  first_value(c4) OVER (PARTITION BY c1 order by c4) AS c4
from  
  t;

 
From the explain plan (attached at the end of the email) I see that postgresql is doing several sorts one per window function and one for the distinct that seems ok.
However all the window functions being on the same partition I would have expected postgresql to "share" a preliminary sort on c1 that would then be useful to reduce the work on all window functions but it doesn't.
I even created an index on c1 hoping that postgresql would be able to use it in order to minimize the cost of the sorts but I couldn't make it use it.

Is there something I am missing?

You can find below a script to set up a table and data to reproduce as well as the explain plan.

Setup Script
create table t(
pk varchar(200) PRIMARY key,
c1 varchar(200),
c2 varchar(200),
c3 varchar(200),
c4 varchar(200)
);
create index i1 on t (c1);

insert into t
(pk, c1, c2, c3, c4 )
select
 generate_series::text pk,
 'Grp' ||(generate_series / 4)::text c1,
 generate_series::text c2,
 generate_series::text c3,
 generate_series::text c4
from generate_series(0, 1000000);

Explain Plan
Unique  (cost=808480.87..820980.88 rows=1000001 width=123) (actual time=7131.675..7781.082 rows=250001 loops=1)
  ->  Sort  (cost=808480.87..810980.87 rows=1000001 width=123) (actual time=7131.673..7603.926 rows=1000001 loops=1)
        Sort Key: c1, (first_value(c2) OVER (?)), (first_value(c3) OVER (?)), (first_value(c4) OVER (?))
        Sort Method: external merge  Disk: 59640kB
        ->  WindowAgg  (cost=558937.90..578937.92 rows=1000001 width=123) (actual time=5179.374..6268.937 rows=1000001 loops=1)
              ->  Sort  (cost=558937.90..561437.90 rows=1000001 width=91) (actual time=5179.355..5679.136 rows=1000001 loops=1)
                    Sort Key: c1, c4
                    Sort Method: external merge  Disk: 52912kB
                    ->  WindowAgg  (cost=336736.93..356736.95 rows=1000001 width=91) (actual time=3260.950..4389.116 rows=1000001 loops=1)
                          ->  Sort  (cost=336736.93..339236.93 rows=1000001 width=59) (actual time=3260.934..3778.385 rows=1000001 loops=1)
                                Sort Key: c1, c3
                                Sort Method: external merge  Disk: 46176kB
                                ->  WindowAgg  (cost=141877.96..161877.98 rows=1000001 width=59) (actual time=1444.692..2477.284 rows=1000001 loops=1)
                                      ->  Sort  (cost=141877.96..144377.96 rows=1000001 width=27) (actual time=1444.669..1906.993 rows=1000001 loops=1)
                                            Sort Key: c1, c2
                                            Sort Method: external merge  Disk: 39424kB
                                            ->  Seq Scan on t  (cost=0.00..18294.01 rows=1000001 width=27) (actual time=0.011..177.815 rows=1000001 loops=1)
Planning Time: 0.214 ms
Execution Time: 7839.646 ms
Does this give the same result and do the optimization you want?

select
  c1,
  min(c2) AS c2,
  min(c3) AS c3,
  min(c4) AS c4
from  
  t
group by
  c1;
Hi Michael,

I simplified the real query before posting it here and I now realize that I oversimplified things.

Unfortunately the real query cannot be re-written with a group by.

Some of the window functions are more complex with order by clause using complex expressions involving multiple columns.

The following example would be more realistic:

select distinct
  c1,
  first_value(c2) OVER (PARTITION BY c1 order by c2, c4) AS c2,
  first_value(c3) OVER (PARTITION BY c1 order by coalesce(c4, '000'), c3) AS c3
from  
  t;



On Mon, Jul 6, 2020 at 9:55 PM Michael Lewis <mlewis@entrata.com> wrote:
Does this give the same result and do the optimization you want?

select
  c1,
  min(c2) AS c2,
  min(c3) AS c3,
  min(c4) AS c4
from  
  t
group by
  c1;
Distinct is a great way to get quick results when writing quick & dirty queries, but I rarely have them perform better than a re-write that avoids the need. It collects a ton of results, orders them, and throws away duplicates in the process. I don't love the idea of that extra work. Did you say you have an index on c1?

select
  c1,
  sub1.c2,
  sub2.c3
from  
  t
join lateral (select c2 from t1 where t1.c1 = t.c1 order by c2, c4 limit 1 ) as sub1 on true
join lateral (select c3 from t1 where t1.c1 = t.c1 order by coalesce(c4, '000'), c3 limit 1 ) as sub2 on true;


select
  c1,
  (select c2 from t1 where t1.c1 = t.c1 order by c2, c4 limit 1 ) AS c2,
  (select c3 from t1 where t1.c1 = t.c1 order by coalesce(c4, '000'), c3 limit 1 ) AS c3
from  
  t;

I don't know the data, but I assume there may be many rows with the same c1 value, so then you would likely benefit from getting that distinct set first like below as your FROM table.

(select c1 from t group by c1 ) AS t

Re: Is postgres able to share sorts required by common partition window functions?

From
"David G. Johnston"
Date:
On Monday, July 6, 2020, Michael Lewis <mlewis@entrata.com> wrote:
Did you say you have an index on c1?
[...]
I don't know the data, but I assume there may be many rows with the same c1 value, so then you would likely benefit from getting that distinct set first like below as your FROM table.

Re-reading the original email I see both the answer to your question and the data being queried.

David J.

Re: Is postgres able to share sorts required by common partition window functions?

From
"David G. Johnston"
Date:
On Monday, July 6, 2020, Sebastien Arod <sebastien.arod@gmail.com> wrote:
I would have expected postgresql to "share" a preliminary sort on c1 that would then be useful to reduce the work on all window functions but it doesn't. 

The plan shown does share - the output of one sort goes into another.  Subsequent sorts still have to happen but they should be faster as the first field is already grouped.  Doesn’t change the plan though.
 
I even created an index on c1 hoping that postgresql would be able to use it in order to minimize the cost of the sorts but I couldn't make it use it.

Use it how?  You are still evaluating 250k groups so doing anything piece-wise seems like an expected loss compared to sequential scan.

David J.
 
Michael, David thanks for your quick replies.

@Michael
I initially dismissed writing this query using joins or subselects because the real query has about 80 columns and I was afraid that having 80 joins/subselect would cause issues with postgresql including planner that would fallback to GEQO.
I'll test it anyway with real data and see how it behaves.

@David About Subsequent sorts
From my experiments the time of subsequent sorts is not reduced by previous sort but is in fact higher.
In the following example in Query 1 Sort node on key c1, c2 has a time of ~1893 excluding child nodes
In the following example in Query 2
  • Sort node on key c1 has a time of ~1558 excluding child nodes
  • Sort node on key c1, c2 has a time of ~2964 excluding child nodes which is higher than the time for the same sort key in Query 1 but in Query 2 data postgresql could have benefited from a preliminary sort on c1.
I can't figure why it would be slower.

Query1
explain analyze select
  first_value(c2) OVER (PARTITION BY c1 order by c2)
from t;

Explain Plan  Query1
WindowAgg  (cost=135042.46..155042.48 rows=1000001 width=47) (actual time=1555.005..2639.659 rows=1000001 loops=1)
  ->  Sort  (cost=135042.46..137542.46 rows=1000001 width=15) (actual time=1554.989..2070.462 rows=1000001 loops=1)
        Sort Key: c1, c2
        Sort Method: external merge  Disk: 25960kB
        ->  Seq Scan on t  (cost=0.00..18294.01 rows=1000001 width=15) (actual time=0.013..177.037 rows=1000001 loops=1)
Planning Time: 0.115 ms
Execution Time: 2693.935 ms

Query2
explain analyze select
  first_value(c2) OVER (PARTITION BY c1),
  first_value(c2) OVER (PARTITION BY c1 order by c2)
from t;
Explain Plan Query2
WindowAgg  (cost=313730.43..333730.45 rows=1000001 width=79) (actual time=4692.257..5702.226 rows=1000001 loops=1)
  ->  Sort  (cost=313730.43..316230.43 rows=1000001 width=47) (actual time=4692.152..5174.928 rows=1000001 loops=1)
        Sort Key: c1, c2
        Sort Method: external merge  Disk: 32688kB
        ->  WindowAgg  (cost=135042.46..152542.48 rows=1000001 width=47) (actual time=1188.109..2210.845 rows=1000001 loops=1)
              ->  Sort  (cost=135042.46..137542.46 rows=1000001 width=15) (actual time=1188.099..1709.253 rows=1000001 loops=1)
                    Sort Key: c1
                    Sort Method: external merge  Disk: 25960kB
                    ->  Seq Scan on t  (cost=0.00..18294.01 rows=1000001 width=15) (actual time=0.014..150.435 rows=1000001 loops=1)
Planning Time: 0.059 ms
Execution Time: 5756.591 ms


@David About using an index
You are correct that a simple index cannot be leveraged efficiently.
However a covering index including all columns and starting with the partition column c1 could be used to avoid sorting on c1 right?

But in my tests it is not used. In the following example covering_index could be used to avoid filtering on c1 but it doesn't:
Covering Index
create index covering_index on t (c1,c2, c3,c4);

Query
explain analyze select
  c1,
  first_value(c2) OVER (PARTITION BY c1 order by c2, c4) AS c2,
  first_value(c3) OVER (PARTITION BY c1 order by coalesce(c4, '000'), c3) AS c3
from  
  t;


Plan
WindowAgg  (cost=417853.93..440353.95 rows=1000001 width=123) (actual time=3071.980..4184.030 rows=1000001 loops=1)
  ->  Sort  (cost=417853.93..420353.93 rows=1000001 width=91) (actual time=3071.962..3575.641 rows=1000001 loops=1)
        Sort Key: c1, (COALESCE(c4, '000'::character varying)), c3
        Sort Method: external merge  Disk: 52912kB
        ->  WindowAgg  (cost=193152.96..215652.98 rows=1000001 width=91) (actual time=1268.373..2271.463 rows=1000001 loops=1)
              ->  Sort  (cost=193152.96..195652.96 rows=1000001 width=59) (actual time=1268.357..1711.526 rows=1000001 loops=1)
                    Sort Key: c1, c2, c4
                    Sort Method: external merge  Disk: 46168kB
                    ->  Seq Scan on t  (cost=0.00..18294.01 rows=1000001 width=59) (actual time=0.014..163.347 rows=1000001 loops=1)
Planning Time: 0.081 ms
Execution Time: 4250.367 ms

However the index is used if one of the partition by + order by matches entirely the beginning of the index. For instance query:
select
  c1,
  first_value(c2) OVER (PARTITION BY c1 order by c2) AS c2,
  first_value(c3) OVER (PARTITION BY c1 order by coalesce(c4, '000'), c3) AS c3
from  
  t;

Uses covering_index to avoid the sort on c1, c2

On Tue, Jul 7, 2020 at 12:48 AM David G. Johnston <david.g.johnston@gmail.com> wrote:
On Monday, July 6, 2020, Sebastien Arod <sebastien.arod@gmail.com> wrote:
I would have expected postgresql to "share" a preliminary sort on c1 that would then be useful to reduce the work on all window functions but it doesn't. 

The plan shown does share - the output of one sort goes into another.  Subsequent sorts still have to happen but they should be faster as the first field is already grouped.  Doesn’t change the plan though.
 
I even created an index on c1 hoping that postgresql would be able to use it in order to minimize the cost of the sorts but I couldn't make it use it.

Use it how?  You are still evaluating 250k groups so doing anything piece-wise seems like an expected loss compared to sequential scan.

David J.
 
On Monday, July 6, 2020, Michael Lewis <mlewis@entrata.com> wrote:
Did you say you have an index on c1?
[...]
I don't know the data, but I assume there may be many rows with the same c1 value, so then you would likely benefit from getting that distinct set first like below as your FROM table.

> Re-reading the original email I see both the answer to your question and the data being queried.
> David J.

Thanks David. I meant it as a rhetorical question, since yes of course there was an index. I also didn't trust the example to be true to real data in terms of c1 values distribution.


On Tue, Jul 7, 2020 at 9:01 AM Sebastien Arod <sebastien.arod@gmail.com> wrote:
Michael, David thanks for your quick replies.

@Michael
I initially dismissed writing this query using joins or subselects because the real query has about 80 columns and I was afraid that having 80 joins/subselect would cause issues with postgresql including planner that would fallback to GEQO.
I'll test it anyway with real data and see how it behaves.

Contrived and overly simplified examples often lead to uninformed, bad advice. I would not attempt joins, unless the number of distinct c1 values is relatively small perhaps. It might go fine though, and depending on your query and the statistics on the table, perhaps join_collapse_limit = 1 would be prudent to constrain the planner to your desired plan and not introduce the chance for the genetic optimizer to get involved.

        Sort Method: external merge  Disk: 52912kB
                    Sort Method: external merge  Disk: 46168kB

What is your work_mem set to? Would it be possible to set it higher (for this process) to avoid spilling to disk?