Thread: query optimization help

query optimization help

From

sarlav kumar

Date:

14 January 2005, 17:39:46

Hi All,

I have the following query to generate a report grouped by "states".

SELECT distinct upper(cd.state) as mystate, SUM(d.amount) as total_amount, SUM(COALESCE(d.fee,0) + COALESCE(mp.seller_fee, 0) + COALESCE(mp.buyer_fee,0)) as total_fee FROM data d left JOIN customerdata cd ON d.uid = cd.uid LEFT JOIN merchant_purchase mp ON d.id = mp.data_id WHERE d.what IN (26,0, 15) AND d.flags IN (1,9,10,12 ) AND lower(cd.country) = 'us' AND date_part('year',d.time)= 2004 GROUP BY myst
ate ORDER BY mystate;

mystate | total_amount | total_fee
---------+--------------+-----------
         |         3695 |         0
AR      |         3000 |         0
AZ      |         1399 |         0
CA      |       113100 |      6242
FL      |       121191 |      9796
GA      |     34826876 |    478888
GEORGIA |        57990 |   &nbs p; 3500
IEIE    |       114000 |      4849
MD      |        20000 |      1158
MI      |       906447 |         0
NY      |         8000 |       600
PA      |         6200 |       375
SC      |        25000 |       600
TN      |      1443681 |      1124
         |        13300 |         0
(15 rows)

If you notice, my problem in this query is that the records for GA, GEORGIA appear separately. But what I want to do is to have them combined to a single entry with their values summed up . Initially we had accepted both formats as input for the state field. Also, there are some invalid entries for the state field (like the "IEIE" and null values), which appear because the input for state was not validated initially. These entries have to be eliminated from the report.This query did not take a long time to complete, but did not meet the needs for the report.

So, the query was rewritten to the following query which takes nearly 7-8 mins to complete on our test database:

SELECT (SELECT DISTINCT pc.state FROM postalcode pc WHERE UPPER(cd.state) IN (pc.state, pc.state_code)) as mystate, SUM(d.amount) as total_amount, SUM(COALESCE(d.fee,0) + COALESCE(mp.seller_fee, 0) + COALESCE(mp.buyer_fee,0)) as total_fee FROM data d JOIN customerdata cd ON d.uid = cd.uid LEFT JOIN merchant_purchase mp ON d.id = mp.data_id WHERE d.what IN (26,0, 15) AND d.flags IN (1,9,10,12 ) AND lower(cd.country) = 'us' AND date_part('year', d.time) = 2004 GROUP BY mystate ORDER BY mystate;
    mystate     | total_amount | total_fee
----------------+--------------+-----------
ARIZONA        |         1399 |         0
ARKANSAS       |         3000 |         0
CALIFORNIA     |       113100 |      6242
FLORIDA        |       121191 |      9796
GEORGIA        |     34884866 |    482388
MARYLAND       |        20000 |      1158
MICHIGAN       |       906447 |         0
NEW YORK       |         8000 |       600
PENNSYLVANIA   |         6200 |       375
SOUTH CAROLINA |        25000 |       600
TENNESSEE      |      1443681 |      1124
                |       130995 |      4849

Here is the explain analyze of this query:

QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=1226.57..1226.58 rows=1 width=38) (actual time=362355.58..362372.09 rows=12 loops=1)
   -> Group (cost=1226.57..1226.57 rows=1 width=38) (actual time=362355.54..362367.73 rows=2197 loops=1)
         -> Sort (cost=1226.57..1226.57 rows=1 width=38) (actual time=362355.53..362356.96 rows=2197 loops=1)
               Sort Key: (subplan)
               -> Nested Loop (cost=0.00..1226.56 rows=1 width=38) (actual time=166.11..362321.46 rows=2197 loops=1)
                     -> Nested Loop (cost=0.00..1220.53 rows=1 width=26) (actual time=1.68..361.32 rows=2115 loops=1)
                           -> Seq Scan on customerdata cd (cost=0.00..274.32 ro ws=31 width=10) (actual time=0.04..29.87 rows=3303 loops=1)
                                 Filter: (lower((country)::text) = 'us'::text)
                           -> Index Scan using data_uid_idx on data d (cost=0.00..30.08 rows=1 width=16) (actual time=0.04..0.09 rows=1 loops=3303)
                                 Index Cond: (d.uid = "outer".uid)
                                 Filter: (((what = 26) OR (what = 0) OR (what = 15)) AND ((flags = 1) OR (flags = 9) OR (flags = 10) OR (flags = 12)) AND (date_part('year'::text, "time") = 2004::double precision))
                     -> Index Scan using merchant_purchase_data_idx on merchant_purchase mp (cost=0.00..6.01 rows=1 width=12) (actual time=0.05..0.05 rows=1 loops=2115)
                           Index Cond: ("outer".id = mp.data_id)
                     SubPlan
                       -> Unique (cost=2237.12..2243.22 rows=122 width=13) (actual time=161.25..164.68 rows=1 loops=2197)
                             -> Sort (cost=2237.12..2240.17 rows=1220 width=13) (actual time=161.21..161.88 rows=1033 loops=2197)
                                   Sort Key: state
                                   -> Seq Scan on postalcode pc (cost=0.00..2174.56 rows=1220 width=13) (actual time=35.79..148.33 rows=1033 loops=2197)
                                         Filter: ((upper(($0)::text) = (state)::text) OR (upper(($0)::text) = (state_code)::text))
Total runtime: 362372.57 msec

The postalcode table is used in the query to validate the states and to combine the entries like GA and GEORGIA.

The postalcode table has 70328 rows!

Can some one please help me optimize this query?

Thanks,

Saranya

Do you Yahoo!?
Yahoo! Mail - You care about security. So do we.

Re: query optimization help

From

"Merlin Moncure"

Date:

14 January 2005, 18:05:39

Please post in plaintext, not html where possible.
Your group by clause was 'myst'...was that supposed to be mystate?

Her is something you might try...use the original query form and create a function which resolves the state code from
theinput data...you are already doing that with upper. 

So,

create function get_state_code(text) returns char(2) as
$$
    select case when len($1) = 2
        then upper($1)
        else lookup_state_code($1)
        end;
$$
language sql stable;

lookup_state_code is a similar function which is boils down to a select from a lookup table.  Or, you could make a
giantcast statement (when GEORGIA then GA, etc).  and now your function becomes IMMUTABLE and should execute very fast.
Just make sure all the states are spelled correct in the original table via domain constraint. 

Merlin



-----Original Message-----
From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of sarlav kumar
Sent: Friday, January 14, 2005 9:40 AM
To: pgsqlnovice; pgsqlperform
Subject: [PERFORM] query optimization help

Hi All,
 
I have the following query to generate a report grouped by "states".
 
SELECT distinct upper(cd.state) as mystate, SUM(d.amount) as total_amount, SUM(COALESCE(d.fee,0) +
COALESCE(mp.seller_fee,0) + COALESCE(mp.buyer_fee,0)) as total_fee FROM data d left JOIN customerdata cd ON d.uid =
cd.uidLEFT JOIN merchant_purchase mp ON d.id = mp.data_id WHERE d.what IN (26,0, 15) AND d.flags IN (1,9,10,12 ) AND
lower(cd.country)= 'us' AND date_part('year',d.time)= 2004 GROUP BY myst 
ate ORDER BY mystate;

Re: query optimization help

From

sarlav kumar

Date:

14 January 2005, 18:27:22

Hi,

Thanks for the help. I actually got around with it by doing the following.

I created a temporary table:

create table statesnew as select distinct state,state_code from postalcode where lower(country)='us';

And then changed the query to :

SELECT (SELECT sn.state FROM statesnew sn WHERE UPPER(cd.state) IN (sn.state, sn.state_code)) as mystate, SUM(d.amount) as total_amount, SUM(COALESCE(d.fee,0) + COALESCE(mp.seller_fee, 0) + COALESCE(mp.buyer_fee,0)) as total_fee FROM data d JOIN customerdata cd ON d.uid = cd.uid LEFT JOIN merchant_purchase mp ON d.id = mp.data_id WHERE d.what IN (26,0, 15) AND d.flags IN (1,9,10,12 ) AND lower(cd.country
) = 'us' AND date_part('year', d.time) = 2004 GROUP BY mystate ORDER BY mystate;

This worked well, as it reduced the number of entries it had to search from.

I am not sure how to use the function you have written. Can you give me pointers on that?

Thanks,

Saranya

Merlin Moncure <merlin.moncure@rcsonline.com> wrote:

Please post in plaintext, not html where possible.
Your group by clause was 'myst'...was that supposed to be mystate?
Yes, It is mystate. It continues on the next line:)

Her is something you might try...use the original query form and create a function which resolves the state code from the input data...you are already doing that with upper.

So,

create function get_state_code(text) returns char(2) as
$$
select case when len($1) = 2
then upper($1)
else lookup_state_code($1)
end;
$$
language sql stable;

lookup_state_code is a similar function which is boils down to a select from a lookup table. Or, you could make a giant cast statement (when GEORGIA then GA, etc). and now your function becomes IMMUTABLE and should execute very fast. Just make sure all the states are spelled correct in the original table via domain constraint.

Merlin

__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com