Re: Optimizer on sort aggregate - Mailing list pgsql-hackers

From Tatsuo Ishii
Subject Re: Optimizer on sort aggregate
Date
Msg-id 20141018.083516.694943555601559481.t-ishii@sraoss.co.jp
Whole thread Raw
In response to Optimizer on sort aggregate  (Feng Tian <fengttt@gmail.com>)
Responses Re: Optimizer on sort aggregate
List pgsql-hackers
> The query,
> select count(distinct j) from t group by t, i;
> 
> runs for 35 seconds.  However, if I change the query to,
> select count(distinct j) from t group by i, t;  -- note switching the
> ordering
> select count(distinct j) from t group by decode(t, 'escape'), i; -- convert
> t to bytea
> 
> Run times are just about 5 and 6.5 seconds.  The reason is clear, compare a
> string with collation is slow, which is well understood by pg hackers.
> However, here, the sorting order is forced by the planner, not user.
> Planner can do the following optimizations,

Interesting. I got following result:

test=# explain analyze select count(distinct j) from t group by t, i;
   QUERY PLAN                                                        
 

--------------------------------------------------------------------------------------------------------------------------GroupAggregate
(cost=137519.84..157519.84 rows=1000000 width=22) (actual time=1332.937..2431.238 rows=1000000 loops=1)  Group Key: t,
i ->  Sort  (cost=137519.84..140019.84 rows=1000000 width=22) (actual time=1332.922..1507.413 rows=1000000 loops=1)
  Sort Key: t, i        Sort Method: external merge  Disk: 33232kB        ->  Seq Scan on t  (cost=0.00..17352.00
rows=1000000width=22) (actual time=0.006..131.406 rows=1000000 loops=1)Planning time: 0.031 msExecution time: 2484.271
ms
(8 rows)

Time: 2484.520 ms

test=# explain analyze select count(distinct j) from t group by i, t;
   QUERY PLAN                                                        
 

--------------------------------------------------------------------------------------------------------------------------GroupAggregate
(cost=137519.84..157519.84 rows=1000000 width=22) (actual time=602.510..1632.087 rows=1000000 loops=1)  Group Key: i, t
->  Sort  (cost=137519.84..140019.84 rows=1000000 width=22) (actual time=602.493..703.274 rows=1000000 loops=1)
SortKey: i, t        Sort Method: external sort  Disk: 33240kB        ->  Seq Scan on t  (cost=0.00..17352.00
rows=1000000width=22) (actual time=0.014..129.213 rows=1000000 loops=1)Planning time: 0.176 msExecution time: 1685.575
ms
(8 rows)

Time: 1687.641 ms

Not so big difference here (maybe because I use SSD) but there is
still about 50% difference in execution time. Note that I disable
locale support.

> 1. for the sort we generated for sort agg, planner can switch column
> ordering, put int before string,
> 2. for the sort we generated for sort agg, use bytea compare instead of
> string compare.
> 
> They will bring big improvement to this common query.   Is this something
> reasonable?

Not looking at sort and planner code closely, I guess planner does
not recognize the cost difference between "external merge" and
"external sort" because cost estimations for sort step in each plan
are exactly same (137519.84..140019.84). However I am not sure if we
should fix the planner or should fix our sort machinery since it is
possible that the sort machinery should not expose such a big
difference in the two sort methods.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



pgsql-hackers by date:

Previous
From: Marko Tiikkaja
Date:
Subject: Re: get_actual_variable_range vs idx_scan/idx_tup_fetch
Next
From: David Rowley
Date:
Subject: Re: Optimizer on sort aggregate