Thread: Query Optimization

Query Optimization

From
James G Wilkinson
Date:
All,

I hope that this is the right place to post.  I am relatively new to
PostgreSQL (i.e., < 1 year in coding) and am just starting to
delve into the issues of query optimization.  I have hunted around
the web for the basics of query optimization, but I have not had
much success in interpreting the documents.  I have also been
trying to learn the basics of the EXPLAIN command....also without
much success, but I will keep trying.

Anyway, here is what the system reports on the following command:

EXPLAIN SELECT a.country_code, a.state_county_fips,
        icell, jcell, a.beld3_species_id, pollutant_code,
        SUM(b.ratio * d.emissions_factor * a.percent_ag *
        e.ag_fraction * 10000) as normalized_emissions
FROM "globals"."biogenic_beld3_data" a,
     "spatial"."tmpgrid" b,
     "globals"."biogenic_emissions_factors" d,
     "globals"."biogenic_beld3_ag_data" e
WHERE a.beld3_icell=b.b_icell AND
      a.beld3_jcell=b.b_jcell AND
      a.country_code=e.country_code AND
      a.state_county_fips=e.state_county_fips AND
      a.beld3_species_id=d.beld3_species_id AND
      a.ag_forest_records > 0 AND
      a.percent_ag > 0 AND d.emissions_factor > 0
GROUP BY a.country_code, a.state_county_fips, icell, jcell,
      a.beld3_species_id, pollutant_code
ORDER BY a.country_code, a.state_county_fips, icell, jcell,
      a.beld3_species_id, pollutant_code;

                                  QUERY
PLAN
-----------------------------------------------------------------------------------------------
GroupAggregate  (cost=65034.94..71110.50 rows=151889 width=73)
->Sort  (cost=65034.94..65414.66 rows=151889 width=73)
    Sort Key: a.country_code, a.state_county_fips, b.icell, b.jcell,
              a.beld3_species_id, d.pollutant_code
    ->Hash Join  (cost=33749.64..37412.88 rows=151889 width=73)
        Hash Cond: ("outer".beld3_species_id = "inner".beld3_species_id)
        ->Merge Join  (cost=33728.84..35303.61 rows=37972 width=56)
            Merge Cond: ((("outer".country_code)::text =
"inner"."?column8?") AND
                         (("outer".state_county_fips)::text =
"inner"."?column9?"))
            ->Index Scan using biogenic_beld3_ag_data_pk on
biogenic_beld3_ag_data e
                    (cost=0.00..806.68 rows=20701 width=26)
            ->Sort  (cost=33728.84..33741.67 rows=5131 width=45)
                Sort Key: (a.country_code)::text,
(a.state_county_fips)::text
                ->Nested Loop  (cost=0.00..33412.65 rows=5131 width=45)
                    ->Seq Scan on biogenic_beld3_data a
(cost=0.00..3593.02 rows=5637 width=37)
                        Filter: ((ag_forest_records > 0) AND (percent_ag
 > 0::numeric))
                    ->Index Scan using tmpgrid_pk on tmpgrid b
(cost=0.00..5.27 rows=1 width=24)
                        Index Cond: ((b.b_icell = "outer".beld3_icell) AND
                                     (b.b_jcell = "outer".beld3_jcell))
        ->Hash (cost=18.50..18.50 rows=920 width=21)
            ->Seq Scan on biogenic_emissions_factors d
(cost=0.00..18.50 rows=920 width=21)
                Filter: (emissions_factor > 0::numeric)
(18 rows)


Firstly, I  am frankly mystified on how to interpret all this.  If anyone
could point me to a document or two that will help me decipher this,
I will greatly appreciate it.

Secondly, I have figured out that SEQ SCANs are typically bad.  I am
concerned that a SEQ SCAN is being performed on 'biogenic_beld3_data'
which is the largest table in the query.  I would rather have a SEQ SCAN
be performed on 'tmpgrid' which contains the keys that subset the data
from 'biogenic_beld3_data.'  Is this naive on my part?

Thirdly, I have run EXPLAIN on other queries that report back a
GroupAggregate Cost=<low 300,000s> that runs in about 30 minutes
on my relatively highend linux machine.  But when I run this particular
query, it takes on the order of 90 minutes to complete.  Any thoughts
on why this happens will be appreciated.

Finally, if anyone can be so kind as to provide insight on how to better
optimize this query, I will, again, be deeply grateful.

Thanks in advance.

terrakit

Re: Query Optimization

From
John A Meinel
Date:
James G Wilkinson wrote:

> All,
>
...

> Firstly, I  am frankly mystified on how to interpret all this.  If anyone
> could point me to a document or two that will help me decipher this,
> I will greatly appreciate it.
>
I assume you have looked at:
http://www.postgresql.org/docs/8.0/static/performance-tips.html
And didn't find it helpful enough. I'm not really sure what help you are
asking. Are you saying that this query is performing slowly and you want
to speed it up? Or you just want to understand how to interpret the
output of explain?

> Secondly, I have figured out that SEQ SCANs are typically bad.  I am
> concerned that a SEQ SCAN is being performed on 'biogenic_beld3_data'
> which is the largest table in the query.  I would rather have a SEQ SCAN
> be performed on 'tmpgrid' which contains the keys that subset the data
> from 'biogenic_beld3_data.'  Is this naive on my part?

It depends how much data is being extracted. If you have 1,000,000 rows,
and only need 10, then an index scan is wonderful. If you need 999,999,
then a sequential scan is much better (the break even point is <10%)
 From the explain, it thinks it is going to be needing 5,637 rows from
biogenic_beld3_data, what is that portion relative to the total?

The values at least look like you've run vacuum analyze. Have you tried
running "explain analyze" instead of just explain? Then you can see if
the planners estimates are accurate.

If you want some help to force it, you could try a subselect query.
Something like:

select * from biogenic_beld3_data b where b.beld3_icell = (select
b_icell from tmpgrid_pk) and b.beld3_jcell = (select b_jcell from
tmpgrid_pk);

>
> Thirdly, I have run EXPLAIN on other queries that report back a
> GroupAggregate Cost=<low 300,000s> that runs in about 30 minutes
> on my relatively highend linux machine.  But when I run this particular
> query, it takes on the order of 90 minutes to complete.  Any thoughts
> on why this happens will be appreciated.
>
Remember cost is in terms of page fetches, not in seconds.
Probably it is just an issue of postgres mis-estimating the selectivity
of one of your queries.
Also, you have a fairly complex SUM occurring involving 4
multiplications on an estimated 150,000 rows. While doesn't seem like it
should take 90 minutes, it also isn't a trivial operation.

> Finally, if anyone can be so kind as to provide insight on how to better
> optimize this query, I will, again, be deeply grateful.
>
> Thanks in advance.
>
> terrakit
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster

John
=:->


Attachment