Re: Optimizer on sort aggregate - Mailing list pgsql-hackers

From Feng Tian
Subject Re: Optimizer on sort aggregate
Date
Msg-id CAFWGqntWU7+M_qm5ASyypvJbxpOCs5y8DHeA5sQWWSHs28c1Jw@mail.gmail.com
Whole thread Raw
In response to Re: Optimizer on sort aggregate  (David Rowley <dgrowleyml@gmail.com>)
Responses Re: Optimizer on sort aggregate
Re: Optimizer on sort aggregate
List pgsql-hackers
Hi, David,

Yes, switch sorting order would loose an interesting order so if user dictates order by t, i; planner need to resort to its cost model.   Estimating cardinality of groupby is a much bigger topic than this thread.

I feel sorting string as if it is bytea is particularly interesting.  I am aware Peter G's patch and I think it is great, but for this sort agg case, first, I believe it is still slower than sorting bytea, and second, Peter G's patch depends on data.   A common long prefix will make the patch less effective, which is probably not so uncommon (for example, URL with a domain prefix).  I don't see any downside of sort bytea, other than lost the interest ordering.  

Thanks,
Feng

   

On Fri, Oct 17, 2014 at 4:36 PM, David Rowley <dgrowleyml@gmail.com> wrote:
On Sat, Oct 18, 2014 at 5:10 AM, Feng Tian <fengttt@gmail.com> wrote:
Hi,

Consider the following queries.

create table t(i int, j int, k int, t text);
insert into t select i, i % 100, i % 1000, 'AABBCCDD' || i from generate_series(1, 1000000) i;

ftian=# explain select count(distinct j) from t group by t, i;
                               QUERY PLAN                              
------------------------------------------------------------------------
 GroupAggregate  (cost=158029.84..178029.84 rows=1000000 width=22)
   ->  Sort  (cost=158029.84..160529.84 rows=1000000 width=22)
         Sort Key: t, i
         ->  Seq Scan on t  (cost=0.00..17352.00 rows=1000000 width=22)
(4 rows)


The query,
select count(distinct j) from t group by t, i;

runs for 35 seconds.  However, if I change the query to,
select count(distinct j) from t group by i, t;  -- note switching the ordering
select count(distinct j) from t group by decode(t, 'escape'), i; -- convert t to bytea

Run times are just about 5 and 6.5 seconds.  The reason is clear, compare a string with collation is slow, which is well understood by pg hackers.   However, here, the sorting order is forced by the planner, not user.   Planner can do the following optimizations,

1. for the sort we generated for sort agg, planner can switch column ordering, put int before string,
2. for the sort we generated for sort agg, use bytea compare instead of string compare.

They will bring big improvement to this common query.   Is this something reasonable? 


It seems like it might be worth looking into, but I think it's likely more complex than sorting the group by order into narrowest column first.

If the query was:

select count(distinct j) from t group by t, i order by t, i;

Then if that was  rewritten to make the group by i,t then the planner would then need to perform an additional sort on t,i to get the correct order by for the final result, which may or may not be faster, it would depend on how many groups there were to sort. Though I guess you could possibly just skip the optimisation if the next node up didn't need any specific ordering.

I also wonder if taking into account the column's width is not quite enough. For example if the 'i' column only had 1 distinct value, then performing the group by on 'i' first wouldn't help at all. Keep in mind that the columns could be much closer in width than in your example, e.g int and bigint. Perhaps you'd need to include some sort of heuristics to look at the number of distinct values and the width, and form some sort of weighted estimates about which column to put first.

Regards

David Rowley

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Hide 'Execution time' in EXPLAIN (COSTS OFF)
Next
From: Bruce Momjian
Date:
Subject: Re: Directory/File Access Permissions for COPY and Generic File Access Functions