Thread: Slow performance with Group By

Slow performance with Group By

From
Erik Norvelle
Date:
Greetings all,

This question has probably been asked many times, but I was unable to
use the list archives to search, since the term "Group" matches
thousands of of messages with the names of user groups in them... so
sorry if I'm repeating!

Here's the problem:  I have a table of 10,000,000 records called
"indethom", each record representing a word in the works of a
particular author.  Each record contains, among other columns, an
CHAR(5) column representing the "lemma" code (i.e. which word it is)
called "codelemm", and an integer representing a textual unit, i.e.
chapter or other division of a work (these are numbered consecutively
from 0 to around 50,000), called "sectref".  What I want to do is find
out how many times every word occurs in each textual unit (or no row
returned for textual units where a particular word doesn't appear).  I
used a group-by clause to group by "sectref", and then used the
COUNT(codelemm) function to sum up the occurrences.  The codelemm
column had to be grouped on, in order to satisfy Postgres's
requirements.  Here's the query as I have it:

 > create table matrix2.tuo as select codelemm, sectref, count(codelemm)
from indethom group by codelemm, sectref;

And the explain results are as follows:

 >it=> explain select codelemm, sectref, count(codelemm) from indethom
group by codelemm, sectref;
 >                                   QUERY PLAN
 >-----------------------------------------------------------------------
---------
 > GroupAggregate  (cost=2339900.60..2444149.44 rows=1790528 width=13)
 >   ->  Sort  (cost=2339900.60..2364843.73 rows=9977252 width=13)
 >         Sort Key: codelemm, sectref
 >         ->  Seq Scan on indethom  (cost=0.00..455264.52 rows=9977252
width=13)

I have an index defined as follows:

 > create index indethom_clemm_sect_ndx on indethom using
btree(codelemm, sectref);

I also performed an ANALYZE after creating the index.

I have the gut feeling that there's got to be a better way than a
sequence scan on 10,000,000 records, but I'll be darned if I can find
any way to improve things here.

Thanks for any help you all can offer!!

Erik Norvelle


Re: Slow performance with Group By

From
Tom Lane
Date:
Erik Norvelle <signups@norvelle.org> writes:
>>> it=> explain select codelemm, sectref, count(codelemm) from indethom
> group by codelemm, sectref;
>>> QUERY PLAN
>>> -----------------------------------------------------------------------
> ---------
>>> GroupAggregate  (cost=2339900.60..2444149.44 rows=1790528 width=13)
>>> ->  Sort  (cost=2339900.60..2364843.73 rows=9977252 width=13)
>>> Sort Key: codelemm, sectref
>>> ->  Seq Scan on indethom  (cost=0.00..455264.52 rows=9977252
> width=13)

Actually the painful part of that is the sort.  If you bump up sort_mem
enough it will eventually switch over to a HashAggregate with no sort,
which may be a better plan if there's not too many groups (is the
estimate of 1.79 million on the mark at all??)

            regards, tom lane