Thread: Performance issue: index not used on GROUP BY...

Performance issue: index not used on GROUP BY...

From
gmb
Date:
Hi all

I have the following table with 10+ million records:

create table ddetail (
ddet_id serial,
co_id integer,
client_id integer,
doc_no varchar,
line_id integer,
batch_no integer,
amount NUMERIC ,
...,
constraint PRIMAR KEY (  co_id ,  client_id , doc_no ,  line_id, ddet_id )
) ;

When doing the following query on this table, performance is really slow:

SELECT  co_id ,  client_id , doc_no ,  line_id , sum( amount  )
FROM ddetail
GROUP BY co_id ,  client_id , doc_no ,  line_id

It seems as if the planner is not using the PRIMARY KEY as index which was
my assumption.

Can somebody please confirm whether aggregate functions such as GROUP BY
should use indexes ?


Thanks in advance

gmb



--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Performance-issue-index-not-used-on-GROUP-BY-tp5816702.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


Re: Performance issue: index not used on GROUP BY...

From
Victor Yegorov
Date:
2014-08-28 11:50 GMT+03:00 gmb <gmbouwer@gmail.com>:
It seems as if the planner is not using the PRIMARY KEY as index which was
my assumption.

Can you send `EXPLAIN (analyze, buffers)` for your query instead?
It'll show exactly what's going on.


--
Victor Y. Yegorov

Re: Performance issue: index not used on GROUP BY...

From
gmb
Date:

> Can you send `EXPLAIN (analyze, buffers)` for your query instead?
> It'll show exactly what's going on.

GroupAggregate  (cost=303425.31..339014.43 rows=136882 width=48) (actual
time=4708.181..6688.699 rows=287268 loops=1)
  Buffers: shared read=23899, temp read=30974 written=30974
  ->  Sort  (cost=303425.31..306847.34 rows=1368812 width=48) (actual
time=4708.170..5319.429 rows=1368744 loops=1)
        Sort Key: co_id, client_id, doc_no,
        Sort Method: external merge  Disk: 80304kB
        Buffers: shared read=23899, temp read=30974 written=30974
        ->  Seq Scan on ddetail  (cost=0.00..37587.12 rows=1368812 width=48)
(actual time=0.122..492.964 rows=1368744 loops=1)
              Buffers: shared read=23899
Total runtime: 6708.244 ms


My initial attempt was this  (this is what I actually need):

SELECT  co_id ,  client_id , doc_no ,  line_id , batch_no , sum( amount  )
FROM ddetail
GROUP BY co_id ,  client_id , doc_no ,  line_id  , batch_no ;

but I removed column batch_no from the query because I thought this was the
cause of the problem ( batch_no is not part of my PK ).


Thanks



--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Performance-issue-index-not-used-on-GROUP-BY-tp5816702p5816706.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


Re: Performance issue: index not used on GROUP BY...

From
Victor Yegorov
Date:
2014-08-28 12:08 GMT+03:00 gmb <gmbouwer@gmail.com>:
GroupAggregate  (cost=303425.31..339014.43 rows=136882 width=48) (actual
time=4708.181..6688.699 rows=287268 loops=1)
  Buffers: shared read=23899, temp read=30974 written=30974
  ->  Sort  (cost=303425.31..306847.34 rows=1368812 width=48) (actual
time=4708.170..5319.429 rows=1368744 loops=1)
        Sort Key: co_id, client_id, doc_no,
        Sort Method: external merge  Disk: 80304kB
        Buffers: shared read=23899, temp read=30974 written=30974
        ->  Seq Scan on ddetail  (cost=0.00..37587.12 rows=1368812 width=48)
(actual time=0.122..492.964 rows=1368744 loops=1)
              Buffers: shared read=23899
Total runtime: 6708.244 ms


My initial attempt was this  (this is what I actually need):

SELECT  co_id ,  client_id , doc_no ,  line_id , batch_no , sum( amount  )
FROM ddetail
GROUP BY co_id ,  client_id , doc_no ,  line_id  , batch_no ;

I think index will be of no help here, as (1) you're reading whole table anyway and (2) `amount` is not part of your index.

Try to avoid disk-based sort by increasing `work_mem` for your session, I think value in the range 120MB-150MB should work:

    SET work_mem TO '150MB';

Check `EXPLAIN` output after the change.

--
Victor Y. Yegorov

Re: Performance issue: index not used on GROUP BY...

From
Marti Raudsepp
Date:
On Thu, Aug 28, 2014 at 11:50 AM, gmb <gmbouwer@gmail.com> wrote:
> Can somebody please confirm whether aggregate functions such as GROUP BY
> should use indexes ?

Yes, if the planner deems it faster than other approaches. It can make
wrong choices for many reasons, but usually when your planner tunables
like random_page_cost, effective_cache_size aren't set appropriately.

There's some advice here:
https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server

Just for the purpose of testing, you could try "set enable_sort=false"
in your session and see if that makes it faster.

On Thu, Aug 28, 2014 at 12:08 PM, gmb <gmbouwer@gmail.com> wrote:
>         Sort Key: co_id, client_id, doc_no,

Something went missing from this line...

>         Sort Method: external merge  Disk: 80304kB

Depends on your hardware and workloads, but more work_mem may also
improve queries to avoid sorts and hashes needing to use disk. But
beware, setting it too high may result in your server running out of
memory.

Regards,
Marti


Re: Performance issue: index not used on GROUP BY...

From
gmb
Date:
Thanks for these suggestions

Unfortunately , I don't have a lot of memory available ( 65 connections ,
work_mem = 64MB in pg conf ).

>> I think index will be of no help here, as (1) you're reading whole table
>> anyway and (2) `amount` is not part of your index.

I did not think that the the field being used in the agg function should
also be part of the index.
I'll try this and check the result.

My problem is that dropping / adding indexes on this table takes a LOT of
time, so I'm stuck with doing the tests using the indexes as is, or doing
the tests on a smaller dataset.

On the smaller dataset ( 1.5 mill records on that table ) the planner did
not take the index into account, even when I omit the amount column:


CREATE INDEX ix_1
  ON ddetail
  USING btree
  (co_id ,  client_id , doc_no ,  line_id , batch_no);

SELECT  co_id ,  client_id , doc_no ,  line_id , batch_no
FROM ddetail
GROUP BY co_id ,  client_id , doc_no ,  line_id  , batch_no ;

HashAggregate  (cost=54695.74..56064.49 rows=136875 width=22)
  ->  Seq Scan on debfdetail  (cost=0.00..37586.44 rows=1368744 width=22)

still does a seq scan instead of the index scan.
I guess it is possible that on the 1.4 million records, it is faster to do a
seq scan ?
So I guess I'll  have to try and do this on the 10 mill table and check the
result there.








--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Performance-issue-index-not-used-on-GROUP-BY-tp5816702p5816715.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


Re: Performance issue: index not used on GROUP BY...

From
Victor Yegorov
Date:
2014-08-28 14:29 GMT+03:00 gmb <gmbouwer@gmail.com>:
Unfortunately , I don't have a lot of memory available ( 65 connections ,
work_mem = 64MB in pg conf ).

You don't have to change cluster-wide settings here.

You can issue `SET` command from your client right before running your query, only your session will be affected.


--
Victor Y. Yegorov

Re: Performance issue: index not used on GROUP BY...

From
Josh Berkus
Date:
On 08/28/2014 01:50 AM, gmb wrote:
> Can somebody please confirm whether aggregate functions such as GROUP BY
> should use indexes ?

Sometimes.  In your case, the index has one more column than the GROUP
BY, which makes it less likely that Postgres will use it (since
depending on the cardinality of ddet_id, it might actually be slower to
use the index).

In addition, other folks on this thread have already pointed out the
memory settings issues to you.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


Re: Performance issue: index not used on GROUP BY...

From
gmb
Date:
Thanks for the feedback, everybody.
I spent a couple of days trying to optimise this;
As mentioned , the increased memory is not an option for me, as this query
is part of a report that can be run by any user on an ad hoc basis.
Allocating the required memory to any session on demand is not feasible in
this environment.

In the end , it seems to me that a more sustainable solution will be to
introduce an additional table to carry the summarized values and lookup on
that table in this type of scenario.

Regards



--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Performance-issue-index-not-used-on-GROUP-BY-tp5816702p5817622.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.