Thread: group by of multi columns

group by of multi columns

From
Flyingfox Lee
Date:
I am doing a `group by` on a table with ~ 3 million rows, the code is simply `select A, B, C, D,E count(1) from t group by A, B, C, D, E order by 6`,  it takes ~ 3 minutes for this operation and there are ~ 500 rows returned. So, to speed this up, should I add a composite index on A, B, C, D, E or there are some parameters in postgresql.conf I can tweak, I am new to postgres, all the parameters in postgresql.conf are the default.

Re: group by of multi columns

From
Andreas Kretschmer
Date:
Flyingfox Lee <flyingfoxlee@gmail.com> wrote:

> I am doing a `group by` on a table with ~ 3 million rows, the code is simply
> `select A, B, C, D,E count(1) from t group by A, B, C, D, E order by 6`,  it
> takes ~ 3 minutes for this operation and there are ~ 500 rows returned. So, to
> speed this up, should I add a composite index on A, B, C, D, E or there are
> some parameters in postgresql.conf I can tweak, I am new to postgres, all the
> parameters in postgresql.conf are the default.

The only thing you can do is run the query with explain analyse and see
how it work. You can tweak work_mem, a simple example:

test=# create table b (a int, b int, c int, d int);
CREATE TABLE
Time: 0,735 ms
test=*# insert into b select (random() * 1000)::int, (random()*1000)::int, (random() * 1000)::int, (random() *
1000)::intfrom generate_series(1,100000) s; 
INSERT 0 100000
Time: 332,212 ms
test=*# explain analyse select a,b,c,d, count(1) from b group by a,b,c,d;
                                                  QUERY PLAN
---------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=2695.53..2791.29 rows=9576 width=16) (actual time=126.904..191.598 rows=100000 loops=1)
   Group Key: a, b, c, d
   ->  Seq Scan on b  (cost=0.00..1498.57 rows=95757 width=16) (actual time=0.012..33.520 rows=100000 loops=1)
 Planning time: 0.095 ms
 Execution time: 214.584 ms
(5 rows)

Time: 215,121 ms
test=*# set work_mem to '64kB';
SET
Time: 0,109 ms
test=*# explain analyse select a,b,c,d, count(1) from b group by a,b,c,d;
                                                     QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=12697.07..14229.18 rows=9576 width=16) (actual time=206.603..388.892 rows=100000 loops=1)
   Group Key: a, b, c, d
   ->  Sort  (cost=12697.07..12936.46 rows=95757 width=16) (actual time=206.577..276.864 rows=100000 loops=1)
         Sort Key: a, b, c, d
         Sort Method: external merge  Disk: 2552kB
         ->  Seq Scan on b  (cost=0.00..1498.57 rows=95757 width=16) (actual time=0.014..33.412 rows=100000 loops=1)
 Planning time: 0.071 ms
 Execution time: 413.876 ms
(8 rows)

Time: 414,246 ms
test=*# set work_mem to '4MB';
SET
Time: 0,059 ms
test=*# explain analyse select a,b,c,d, count(1) from b group by a,b,c,d;
                                                  QUERY PLAN
---------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=2695.53..2791.29 rows=9576 width=16) (actual time=129.093..194.711 rows=100000 loops=1)
   Group Key: a, b, c, d
   ->  Seq Scan on b  (cost=0.00..1498.57 rows=95757 width=16) (actual time=0.014..33.762 rows=100000 loops=1)
 Planning time: 0.067 ms
 Execution time: 219.694 ms
(5 rows)


so, if you can see a 'Sort Method: external merge Disk', you should increase work_mem.


Andreas
--
Really, I'm not out to destroy Microsoft. That will just be a completely
unintentional side effect.                              (Linus Torvalds)
"If I was god, I would recompile penguin with --enable-fly."   (unknown)
Kaufbach, Saxony, Germany, Europe.              N 51.05082°, E 13.56889°


Re: group by of multi columns

From
Flyingfox Lee
Date:

Thank you, this works like a charm.

On 2015年1月4日周日 20:01 Andreas Kretschmer <akretschmer@spamfence.net> wrote:
Flyingfox Lee <flyingfoxlee@gmail.com> wrote:

> I am doing a `group by` on a table with ~ 3 million rows, the code is simply
> `select A, B, C, D,E count(1) from t group by A, B, C, D, E order by 6`,  it
> takes ~ 3 minutes for this operation and there are ~ 500 rows returned. So, to
> speed this up, should I add a composite index on A, B, C, D, E or there are
> some parameters in postgresql.conf I can tweak, I am new to postgres, all the
> parameters in postgresql.conf are the default.

The only thing you can do is run the query with explain analyse and see
how it work. You can tweak work_mem, a simple example:

test=# create table b (a int, b int, c int, d int);
CREATE TABLE
Time: 0,735 ms
test=*# insert into b select (random() * 1000)::int, (random()*1000)::int, (random() * 1000)::int, (random() * 1000)::int from generate_series(1,100000) s;
INSERT 0 100000
Time: 332,212 ms
test=*# explain analyse select a,b,c,d, count(1) from b group by a,b,c,d;
                                                  QUERY PLAN
---------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=2695.53..2791.29 rows=9576 width=16) (actual time=126.904..191.598 rows=100000 loops=1)
   Group Key: a, b, c, d
   ->  Seq Scan on b  (cost=0.00..1498.57 rows=95757 width=16) (actual time=0.012..33.520 rows=100000 loops=1)
 Planning time: 0.095 ms
 Execution time: 214.584 ms
(5 rows)

Time: 215,121 ms
test=*# set work_mem to '64kB';
SET
Time: 0,109 ms
test=*# explain analyse select a,b,c,d, count(1) from b group by a,b,c,d;
                                                     QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=12697.07..14229.18 rows=9576 width=16) (actual time=206.603..388.892 rows=100000 loops=1)
   Group Key: a, b, c, d
   ->  Sort  (cost=12697.07..12936.46 rows=95757 width=16) (actual time=206.577..276.864 rows=100000 loops=1)
         Sort Key: a, b, c, d
         Sort Method: external merge  Disk: 2552kB
         ->  Seq Scan on b  (cost=0.00..1498.57 rows=95757 width=16) (actual time=0.014..33.412 rows=100000 loops=1)
 Planning time: 0.071 ms
 Execution time: 413.876 ms
(8 rows)

Time: 414,246 ms
test=*# set work_mem to '4MB';
SET
Time: 0,059 ms
test=*# explain analyse select a,b,c,d, count(1) from b group by a,b,c,d;
                                                  QUERY PLAN
---------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=2695.53..2791.29 rows=9576 width=16) (actual time=129.093..194.711 rows=100000 loops=1)
   Group Key: a, b, c, d
   ->  Seq Scan on b  (cost=0.00..1498.57 rows=95757 width=16) (actual time=0.014..33.762 rows=100000 loops=1)
 Planning time: 0.067 ms
 Execution time: 219.694 ms
(5 rows)


so, if you can see a 'Sort Method: external merge Disk', you should increase work_mem.


Andreas
--
Really, I'm not out to destroy Microsoft. That will just be a completely
unintentional side effect.                              (Linus Torvalds)
"If I was god, I would recompile penguin with --enable-fly."   (unknown)
Kaufbach, Saxony, Germany, Europe.              N 51.05082°, E 13.56889°


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general