Home > mailing lists

Slow performance with Group By - Mailing list pgsql-performance

From	Erik Norvelle
Subject	Slow performance with Group By
Date	November 9, 2004 03:12:05
Msg-id	B14B76F4-31E3-11D9-8120-000A9583BF06@norvelle.org Whole thread Raw
Responses	Re: Slow performance with Group By
List	pgsql-performance

Tree view

Greetings all,

This question has probably been asked many times, but I was unable to
use the list archives to search, since the term "Group" matches
thousands of of messages with the names of user groups in them... so
sorry if I'm repeating!

Here's the problem:  I have a table of 10,000,000 records called
"indethom", each record representing a word in the works of a
particular author.  Each record contains, among other columns, an
CHAR(5) column representing the "lemma" code (i.e. which word it is)
called "codelemm", and an integer representing a textual unit, i.e.
chapter or other division of a work (these are numbered consecutively
from 0 to around 50,000), called "sectref".  What I want to do is find
out how many times every word occurs in each textual unit (or no row
returned for textual units where a particular word doesn't appear).  I
used a group-by clause to group by "sectref", and then used the
COUNT(codelemm) function to sum up the occurrences.  The codelemm
column had to be grouped on, in order to satisfy Postgres's
requirements.  Here's the query as I have it:

 > create table matrix2.tuo as select codelemm, sectref, count(codelemm)
from indethom group by codelemm, sectref;

And the explain results are as follows:

 >it=> explain select codelemm, sectref, count(codelemm) from indethom
group by codelemm, sectref;
 >                                   QUERY PLAN
 >-----------------------------------------------------------------------
---------
 > GroupAggregate  (cost=2339900.60..2444149.44 rows=1790528 width=13)
 >   ->  Sort  (cost=2339900.60..2364843.73 rows=9977252 width=13)
 >         Sort Key: codelemm, sectref
 >         ->  Seq Scan on indethom  (cost=0.00..455264.52 rows=9977252
width=13)

I have an index defined as follows:

 > create index indethom_clemm_sect_ndx on indethom using
btree(codelemm, sectref);

I also performed an ANALYZE after creating the index.

I have the gut feeling that there's got to be a better way than a
sequence scan on 10,000,000 records, but I'll be darned if I can find
any way to improve things here.

Thanks for any help you all can offer!!

Erik Norvelle

pgsql-performance by date:

From: Pierre-Frédéric Caillaud
Date: 09 November 2004, 03:05:23
Subject: Re: vacuum analyze slows sql query

From: John Meinel
Date: 09 November 2004, 03:32:27
Subject: Re: vacuum analyze slows sql query

Slow performance with Group By - Mailing list pgsql-performance

Previous

Next