Thread: Performance issue: index not used on GROUP BY...
Hi all I have the following table with 10+ million records: create table ddetail ( ddet_id serial, co_id integer, client_id integer, doc_no varchar, line_id integer, batch_no integer, amount NUMERIC , ..., constraint PRIMAR KEY ( co_id , client_id , doc_no , line_id, ddet_id ) ) ; When doing the following query on this table, performance is really slow: SELECT co_id , client_id , doc_no , line_id , sum( amount ) FROM ddetail GROUP BY co_id , client_id , doc_no , line_id It seems as if the planner is not using the PRIMARY KEY as index which was my assumption. Can somebody please confirm whether aggregate functions such as GROUP BY should use indexes ? Thanks in advance gmb -- View this message in context: http://postgresql.1045698.n5.nabble.com/Performance-issue-index-not-used-on-GROUP-BY-tp5816702.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
2014-08-28 11:50 GMT+03:00 gmb <gmbouwer@gmail.com>:
It seems as if the planner is not using the PRIMARY KEY as index which was
my assumption.
Can you send `EXPLAIN (analyze, buffers)` for your query instead?
It'll show exactly what's going on.
--
Victor Y. Yegorov
Victor Y. Yegorov
> Can you send `EXPLAIN (analyze, buffers)` for your query instead? > It'll show exactly what's going on. GroupAggregate (cost=303425.31..339014.43 rows=136882 width=48) (actual time=4708.181..6688.699 rows=287268 loops=1) Buffers: shared read=23899, temp read=30974 written=30974 -> Sort (cost=303425.31..306847.34 rows=1368812 width=48) (actual time=4708.170..5319.429 rows=1368744 loops=1) Sort Key: co_id, client_id, doc_no, Sort Method: external merge Disk: 80304kB Buffers: shared read=23899, temp read=30974 written=30974 -> Seq Scan on ddetail (cost=0.00..37587.12 rows=1368812 width=48) (actual time=0.122..492.964 rows=1368744 loops=1) Buffers: shared read=23899 Total runtime: 6708.244 ms My initial attempt was this (this is what I actually need): SELECT co_id , client_id , doc_no , line_id , batch_no , sum( amount ) FROM ddetail GROUP BY co_id , client_id , doc_no , line_id , batch_no ; but I removed column batch_no from the query because I thought this was the cause of the problem ( batch_no is not part of my PK ). Thanks -- View this message in context: http://postgresql.1045698.n5.nabble.com/Performance-issue-index-not-used-on-GROUP-BY-tp5816702p5816706.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
2014-08-28 12:08 GMT+03:00 gmb <gmbouwer@gmail.com>:
GroupAggregate (cost=303425.31..339014.43 rows=136882 width=48) (actual
time=4708.181..6688.699 rows=287268 loops=1)
Buffers: shared read=23899, temp read=30974 written=30974
-> Sort (cost=303425.31..306847.34 rows=1368812 width=48) (actual
time=4708.170..5319.429 rows=1368744 loops=1)
Sort Key: co_id, client_id, doc_no,
Sort Method: external merge Disk: 80304kB
Buffers: shared read=23899, temp read=30974 written=30974
-> Seq Scan on ddetail (cost=0.00..37587.12 rows=1368812 width=48)
(actual time=0.122..492.964 rows=1368744 loops=1)
Buffers: shared read=23899
Total runtime: 6708.244 ms
My initial attempt was this (this is what I actually need):
SELECT co_id , client_id , doc_no , line_id , batch_no , sum( amount )
FROM ddetail
GROUP BY co_id , client_id , doc_no , line_id , batch_no ;
I think index will be of no help here, as (1) you're reading whole table anyway and (2) `amount` is not part of your index.
Try to avoid disk-based sort by increasing `work_mem` for your session, I think value in the range 120MB-150MB should work:
SET work_mem TO '150MB';
Check `EXPLAIN` output after the change.
Victor Y. Yegorov
On Thu, Aug 28, 2014 at 11:50 AM, gmb <gmbouwer@gmail.com> wrote: > Can somebody please confirm whether aggregate functions such as GROUP BY > should use indexes ? Yes, if the planner deems it faster than other approaches. It can make wrong choices for many reasons, but usually when your planner tunables like random_page_cost, effective_cache_size aren't set appropriately. There's some advice here: https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server Just for the purpose of testing, you could try "set enable_sort=false" in your session and see if that makes it faster. On Thu, Aug 28, 2014 at 12:08 PM, gmb <gmbouwer@gmail.com> wrote: > Sort Key: co_id, client_id, doc_no, Something went missing from this line... > Sort Method: external merge Disk: 80304kB Depends on your hardware and workloads, but more work_mem may also improve queries to avoid sorts and hashes needing to use disk. But beware, setting it too high may result in your server running out of memory. Regards, Marti
Thanks for these suggestions Unfortunately , I don't have a lot of memory available ( 65 connections , work_mem = 64MB in pg conf ). >> I think index will be of no help here, as (1) you're reading whole table >> anyway and (2) `amount` is not part of your index. I did not think that the the field being used in the agg function should also be part of the index. I'll try this and check the result. My problem is that dropping / adding indexes on this table takes a LOT of time, so I'm stuck with doing the tests using the indexes as is, or doing the tests on a smaller dataset. On the smaller dataset ( 1.5 mill records on that table ) the planner did not take the index into account, even when I omit the amount column: CREATE INDEX ix_1 ON ddetail USING btree (co_id , client_id , doc_no , line_id , batch_no); SELECT co_id , client_id , doc_no , line_id , batch_no FROM ddetail GROUP BY co_id , client_id , doc_no , line_id , batch_no ; HashAggregate (cost=54695.74..56064.49 rows=136875 width=22) -> Seq Scan on debfdetail (cost=0.00..37586.44 rows=1368744 width=22) still does a seq scan instead of the index scan. I guess it is possible that on the 1.4 million records, it is faster to do a seq scan ? So I guess I'll have to try and do this on the 10 mill table and check the result there. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Performance-issue-index-not-used-on-GROUP-BY-tp5816702p5816715.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
2014-08-28 14:29 GMT+03:00 gmb <gmbouwer@gmail.com>:
Unfortunately , I don't have a lot of memory available ( 65 connections ,
work_mem = 64MB in pg conf ).
You don't have to change cluster-wide settings here.
You can issue `SET` command from your client right before running your query, only your session will be affected.
--
Victor Y. Yegorov
Victor Y. Yegorov
On 08/28/2014 01:50 AM, gmb wrote: > Can somebody please confirm whether aggregate functions such as GROUP BY > should use indexes ? Sometimes. In your case, the index has one more column than the GROUP BY, which makes it less likely that Postgres will use it (since depending on the cardinality of ddet_id, it might actually be slower to use the index). In addition, other folks on this thread have already pointed out the memory settings issues to you. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Thanks for the feedback, everybody. I spent a couple of days trying to optimise this; As mentioned , the increased memory is not an option for me, as this query is part of a report that can be run by any user on an ad hoc basis. Allocating the required memory to any session on demand is not feasible in this environment. In the end , it seems to me that a more sustainable solution will be to introduce an additional table to carry the summarized values and lookup on that table in this type of scenario. Regards -- View this message in context: http://postgresql.1045698.n5.nabble.com/Performance-issue-index-not-used-on-GROUP-BY-tp5816702p5817622.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com.