Thread: query optimization help

query optimization help

From
sarlav kumar
Date:
Hi All,
 
I have the following query to generate a report grouped by "states".
 
SELECT distinct upper(cd.state) as mystate, SUM(d.amount) as total_amount, SUM(COALESCE(d.fee,0) + COALESCE(mp.seller_fee, 0) + COALESCE(mp.buyer_fee,0)) as total_fee FROM data d left JOIN customerdata cd ON d.uid = cd.uid LEFT JOIN merchant_purchase mp ON d.id = mp.data_id WHERE d.what IN (26,0, 15) AND d.flags IN (1,9,10,12 ) AND lower(cd.country) = 'us' AND date_part('year',d.time)= 2004 GROUP BY myst
ate ORDER BY mystate;

 mystate | total_amount | total_fee
---------+--------------+-----------
         |         3695 |         0
 AR      |         3000 |         0
 AZ      |         1399 |         0
 CA      |       113100 |      6242
 FL      |       121191 |      9796
 GA      |     34826876 |    478888
 GEORGIA |        57990 |   &nbs p;  3500
 IEIE    |       114000 |      4849
 MD      |        20000 |      1158
 MI      |       906447 |         0
 NY      |         8000 |       600
 PA      |         6200 |       375
 SC      |        25000 |       600
 TN      |      1443681 |      1124
         |        13300 |         0
(15 rows)
If you notice, my problem in this query is that the records for GA, GEORGIA appear separately. But what I want to do is  to have them combined to a single entry with their values summed up . Initially we had accepted both formats as input for the state field. Also, there are some invalid entries for the state field (like the "IEIE" and null values), which appear because the input for state was not validated initially. These entries have to be eliminated from the report.This query did not take a long time to complete, but did not meet the needs for the report.
 
So, the query was rewritten to the following query which takes nearly 7-8 mins to complete on our test database:
 
SELECT (SELECT DISTINCT pc.state FROM postalcode pc WHERE UPPER(cd.state) IN (pc.state, pc.state_code)) as mystate, SUM(d.amount) as total_amount, SUM(COALESCE(d.fee,0) + COALESCE(mp.seller_fee, 0) + COALESCE(mp.buyer_fee,0)) as total_fee FROM data d JOIN customerdata cd ON d.uid = cd.uid LEFT JOIN merchant_purchase mp ON d.id = mp.data_id WHERE d.what IN (26,0, 15) AND d.flags IN (1,9,10,12 ) AND lower(cd.country) = 'us' AND date_part('year', d.time) = 2004 GROUP BY mystate ORDER BY mystate;
    mystate     | total_amount | total_fee
----------------+--------------+-----------
 ARIZONA        |         1399 |         0
 ARKANSAS       |         3000 |         0
 CALIFORNIA     |       113100 |      6242
 FLORIDA        |       121191 |      9796
 GEORGIA        |     34884866 |    482388
 MARYLAND       |        20000 |      1158
 MICHIGAN       |       906447 |         0
 NEW YORK       |         8000 |       600
 PENNSYLVANIA   |         6200 |       375
 SOUTH CAROLINA |        25000 |       600
 TENNESSEE      |      1443681 |      1124
                |       130995 |      4849
 
Here is the explain analyze of this query:
  QUERY PLAN                                                                                                     
----------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1226.57..1226.58 rows=1 width=38) (actual time=362355.58..362372.09 rows=12 loops=1)
   ->  Group  (cost=1226.57..1226.57 rows=1 width=38) (actual time=362355.54..362367.73 rows=2197 loops=1)
         ->  Sort  (cost=1226.57..1226.57 rows=1 width=38) (actual time=362355.53..362356.96 rows=2197 loops=1)
               Sort Key: (subplan)
               ->  Nested Loop  (cost=0.00..1226.56 rows=1 width=38) (actual time=166.11..362321.46 rows=2197 loops=1)
                     ->  Nested Loop  (cost=0.00..1220.53 rows=1 width=26) (actual time=1.68..361.32 rows=2115 loops=1)
                           ->  Seq Scan on customerdata cd  (cost=0.00..274.32 ro ws=31 width=10) (actual time=0.04..29.87 rows=3303 loops=1)
                                 Filter: (lower((country)::text) = 'us'::text)
                           ->  Index Scan using data_uid_idx on data d  (cost=0.00..30.08 rows=1 width=16) (actual time=0.04..0.09 rows=1 loops=3303)
                                 Index Cond: (d.uid = "outer".uid)
                                 Filter: (((what = 26) OR (what = 0) OR (what = 15)) AND ((flags = 1) OR (flags = 9) OR (flags = 10) OR (flags = 12)) AND (date_part('year'::text, "time") = 2004::double precision))
                     ->  Index Scan using merchant_purchase_data_idx on merchant_purchase mp  (cost=0.00..6.01 rows=1 width=12) (actual time=0.05..0.05 rows=1 loops=2115)
                           Index Cond: ("outer".id = mp.data_id)
                     SubPlan
                       ->  Unique  (cost=2237.12..2243.22 rows=122 width=13) (actual time=161.25..164.68 rows=1 loops=2197)
                             ->  Sort  (cost=2237.12..2240.17 rows=1220 width=13) (actual time=161.21..161.88 rows=1033 loops=2197)
                                   Sort Key: state
                                   ->  Seq Scan on postalcode pc  (cost=0.00..2174.56 rows=1220 width=13) (actual time=35.79..148.33 rows=1033 loops=2197)
                                         Filter: ((upper(($0)::text) = (state)::text) OR (upper(($0)::text) = (state_code)::text))
 Total runtime: 362372.57 msec
 
 
The postalcode table is used in the query to validate the states and to combine the entries like GA and GEORGIA.
\d postalcode
                                    Table "public.postalcode"
   Column   |         Type          |                         Modifiers                         
------------+-----------------------+------------------------------------------------------------
 id         | integer               | not null default nextval('public.postalcode_id_seq'::text)
 country    | character(2)          |
 state      | character varying(30) |
 zipcode    | character varying(20) |
 city       | character varying(50) |
 city_alias | character varying(20) |
 state_code | character varying(2)  |
Indexes: postalcode_country_key unique btree (country, state_code, zipcode),
         postalcode_state_code_idx btree (state_code),
         postalcode_state_idx btree (state)
 
The postalcode table has 70328 rows!
 
Can some one please help me optimize this query?
 
Thanks,
Saranya


Do you Yahoo!?
Yahoo! Mail - You care about security. So do we.

Re: query optimization help

From
"Merlin Moncure"
Date:
Please post in plaintext, not html where possible.
Your group by clause was 'myst'...was that supposed to be mystate?

Her is something you might try...use the original query form and create a function which resolves the state code from
theinput data...you are already doing that with upper. 

So,

create function get_state_code(text) returns char(2) as
$$
    select case when len($1) = 2
        then upper($1)
        else lookup_state_code($1)
        end;
$$
language sql stable;

lookup_state_code is a similar function which is boils down to a select from a lookup table.  Or, you could make a
giantcast statement (when GEORGIA then GA, etc).  and now your function becomes IMMUTABLE and should execute very fast.
Just make sure all the states are spelled correct in the original table via domain constraint. 

Merlin



-----Original Message-----
From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of sarlav kumar
Sent: Friday, January 14, 2005 9:40 AM
To: pgsqlnovice; pgsqlperform
Subject: [PERFORM] query optimization help

Hi All,
 
I have the following query to generate a report grouped by "states".
 
SELECT distinct upper(cd.state) as mystate, SUM(d.amount) as total_amount, SUM(COALESCE(d.fee,0) +
COALESCE(mp.seller_fee,0) + COALESCE(mp.buyer_fee,0)) as total_fee FROM data d left JOIN customerdata cd ON d.uid =
cd.uidLEFT JOIN merchant_purchase mp ON d.id = mp.data_id WHERE d.what IN (26,0, 15) AND d.flags IN (1,9,10,12 ) AND
lower(cd.country)= 'us' AND date_part('year',d.time)= 2004 GROUP BY myst 
ate ORDER BY mystate;


Re: query optimization help

From
sarlav kumar
Date:
Hi,
 
Thanks for the help. I actually got around with it by doing the following.
I created a temporary table:
 
create table statesnew as select distinct state,state_code from postalcode where lower(country)='us';
 
And then changed the query to :
 
SELECT (SELECT sn.state FROM statesnew sn WHERE UPPER(cd.state) IN (sn.state, sn.state_code)) as mystate, SUM(d.amount) as total_amount, SUM(COALESCE(d.fee,0) + COALESCE(mp.seller_fee, 0) + COALESCE(mp.buyer_fee,0)) as total_fee FROM data d JOIN customerdata cd ON d.uid = cd.uid LEFT JOIN merchant_purchase mp ON d.id = mp.data_id WHERE d.what IN (26,0, 15) AND d.flags IN (1,9,10,12 ) AND lower(cd.country
) = 'us' AND date_part('year', d.time) = 2004 GROUP BY mystate ORDER BY mystate;
 
This worked well, as it reduced the number of entries it had to search from.
 
I am not sure how to use the function you have written. Can you give me pointers on that?
 
Thanks,
Saranya
 


Merlin Moncure <merlin.moncure@rcsonline.com> wrote:

Please post in plaintext, not html where possible.
Your group by clause was 'myst'...was that supposed to be mystate?

Yes, It is mystate. It continues on the next line:)


Her is something you might try...use the original query form and create a function which resolves the state code from the input data...you are already doing that with upper.

So,

create function get_state_code(text) returns char(2) as
$$
select case when len($1) = 2
then upper($1)
else lookup_state_code($1)
end;
$$
language sql stable;

lookup_state_code is a similar function which is boils down to a select from a lookup table. Or, you could make a giant cast statement (when GEORGIA then GA, etc). and now your function becomes IMMUTABLE and should execute very fast. Just make sure all the states are spelled correct in the original table via domain constraint.

Merlin



__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com