Re: slow group by query - Mailing list pgsql-sql

From Tom Lane
Subject Re: slow group by query
Date
Msg-id 21604.1037724848@sss.pgh.pa.us
Whole thread Raw
In response to Re: slow group by query  (Ellen Cyran <ellen@urban.csuohio.edu>)
Responses Re: slow group by query  (Ellen Cyran <ellen@urban.csuohio.edu>)
List pgsql-sql
Ellen Cyran <ellen@urban.csuohio.edu> writes:
> Here is the explain analyze:

> Group  (cost=637.18..696.52 rows=593 width=22) (actual time=982.67..67581.85 rows=435 loops=1)
>    ->  Sort  (cost=637.18..637.18 rows=5934 width=22) (actual time=833.27..844.78 rows=6571 loops=1)

Well, we don't have to read any further than that to see that all the
time is being spent in the final Group step --- and since grouping 6500
rows is surely not taking very long, the true cost must be in the
evaluation of the SELECT's output targetlist (which the estimated costs
ignore, since the planner can't do much to change it).  In other words,
what's hurting you are those subselects in the SELECT list.  You're
doing two thousand separate subselects (435*5) --- evidently they take
about 30 msec apiece, which isn't that bad by itself, but it adds up.

What you've basically got here is what Joe Celko calls a "crosstab
by subqueries" (_SQL For Smarties_, section 23.6.3).  You might want
to buy his book, which shows several other ways to do crosstab queries,
but the one that seems most directly useful is to GROUP BY and use
aggregate functions to extract the values you want in each crosstab
column:

SELECT msa_code, sic, own, ind_div,
max(case when year = '1975' then emp else null end) as emp1975,
max(case when year = '1976' then emp else null end) as emp1976,
max(case when year = '1977' then emp else null end) as emp1977,
max(case when year = '1978' then emp else null end) as emp1978,
max(case when year = '1990' then emp else null end) as emp1990
FROM tbl_bls_msa
WHERE msa_code in ('1680','1640','0040','0120','0080')
GROUP BY msa_code, sic, ind_div, own ;

If I understand your data schema correctly, only one row in a
msa_code, sic, ind_div, own group will have a particular year
value, so the case/max structure extracts that value, or gives
NULL if there's no such row.  (MIN would have worked as well;
with a numeric field you could use SUM too.)
        regards, tom lane


pgsql-sql by date:

Previous
From: Ellen Cyran
Date:
Subject: Re: slow group by query
Next
From: Luis Sousa
Date:
Subject: Re: Problems invoking psql. Help please.