Slow performance with Group By - Mailing list pgsql-performance

From Erik Norvelle
Subject Slow performance with Group By
Date
Msg-id B14B76F4-31E3-11D9-8120-000A9583BF06@norvelle.org
Whole thread Raw
Responses Re: Slow performance with Group By
List pgsql-performance
Greetings all,

This question has probably been asked many times, but I was unable to
use the list archives to search, since the term "Group" matches
thousands of of messages with the names of user groups in them... so
sorry if I'm repeating!

Here's the problem:  I have a table of 10,000,000 records called
"indethom", each record representing a word in the works of a
particular author.  Each record contains, among other columns, an
CHAR(5) column representing the "lemma" code (i.e. which word it is)
called "codelemm", and an integer representing a textual unit, i.e.
chapter or other division of a work (these are numbered consecutively
from 0 to around 50,000), called "sectref".  What I want to do is find
out how many times every word occurs in each textual unit (or no row
returned for textual units where a particular word doesn't appear).  I
used a group-by clause to group by "sectref", and then used the
COUNT(codelemm) function to sum up the occurrences.  The codelemm
column had to be grouped on, in order to satisfy Postgres's
requirements.  Here's the query as I have it:

 > create table matrix2.tuo as select codelemm, sectref, count(codelemm)
from indethom group by codelemm, sectref;

And the explain results are as follows:

 >it=> explain select codelemm, sectref, count(codelemm) from indethom
group by codelemm, sectref;
 >                                   QUERY PLAN
 >-----------------------------------------------------------------------
---------
 > GroupAggregate  (cost=2339900.60..2444149.44 rows=1790528 width=13)
 >   ->  Sort  (cost=2339900.60..2364843.73 rows=9977252 width=13)
 >         Sort Key: codelemm, sectref
 >         ->  Seq Scan on indethom  (cost=0.00..455264.52 rows=9977252
width=13)

I have an index defined as follows:

 > create index indethom_clemm_sect_ndx on indethom using
btree(codelemm, sectref);

I also performed an ANALYZE after creating the index.

I have the gut feeling that there's got to be a better way than a
sequence scan on 10,000,000 records, but I'll be darned if I can find
any way to improve things here.

Thanks for any help you all can offer!!

Erik Norvelle


pgsql-performance by date:

Previous
From: Pierre-Frédéric Caillaud
Date:
Subject: Re: vacuum analyze slows sql query
Next
From: John Meinel
Date:
Subject: Re: vacuum analyze slows sql query