Query Optimization - Mailing list pgsql-performance

From James G Wilkinson
Subject Query Optimization
Date
Msg-id 422E3D81.3000407@alpinegeophysics.com
Whole thread Raw
Responses Re: Query Optimization
List pgsql-performance
All,

I hope that this is the right place to post.  I am relatively new to
PostgreSQL (i.e., < 1 year in coding) and am just starting to
delve into the issues of query optimization.  I have hunted around
the web for the basics of query optimization, but I have not had
much success in interpreting the documents.  I have also been
trying to learn the basics of the EXPLAIN command....also without
much success, but I will keep trying.

Anyway, here is what the system reports on the following command:

EXPLAIN SELECT a.country_code, a.state_county_fips,
        icell, jcell, a.beld3_species_id, pollutant_code,
        SUM(b.ratio * d.emissions_factor * a.percent_ag *
        e.ag_fraction * 10000) as normalized_emissions
FROM "globals"."biogenic_beld3_data" a,
     "spatial"."tmpgrid" b,
     "globals"."biogenic_emissions_factors" d,
     "globals"."biogenic_beld3_ag_data" e
WHERE a.beld3_icell=b.b_icell AND
      a.beld3_jcell=b.b_jcell AND
      a.country_code=e.country_code AND
      a.state_county_fips=e.state_county_fips AND
      a.beld3_species_id=d.beld3_species_id AND
      a.ag_forest_records > 0 AND
      a.percent_ag > 0 AND d.emissions_factor > 0
GROUP BY a.country_code, a.state_county_fips, icell, jcell,
      a.beld3_species_id, pollutant_code
ORDER BY a.country_code, a.state_county_fips, icell, jcell,
      a.beld3_species_id, pollutant_code;

                                  QUERY
PLAN
-----------------------------------------------------------------------------------------------
GroupAggregate  (cost=65034.94..71110.50 rows=151889 width=73)
->Sort  (cost=65034.94..65414.66 rows=151889 width=73)
    Sort Key: a.country_code, a.state_county_fips, b.icell, b.jcell,
              a.beld3_species_id, d.pollutant_code
    ->Hash Join  (cost=33749.64..37412.88 rows=151889 width=73)
        Hash Cond: ("outer".beld3_species_id = "inner".beld3_species_id)
        ->Merge Join  (cost=33728.84..35303.61 rows=37972 width=56)
            Merge Cond: ((("outer".country_code)::text =
"inner"."?column8?") AND
                         (("outer".state_county_fips)::text =
"inner"."?column9?"))
            ->Index Scan using biogenic_beld3_ag_data_pk on
biogenic_beld3_ag_data e
                    (cost=0.00..806.68 rows=20701 width=26)
            ->Sort  (cost=33728.84..33741.67 rows=5131 width=45)
                Sort Key: (a.country_code)::text,
(a.state_county_fips)::text
                ->Nested Loop  (cost=0.00..33412.65 rows=5131 width=45)
                    ->Seq Scan on biogenic_beld3_data a
(cost=0.00..3593.02 rows=5637 width=37)
                        Filter: ((ag_forest_records > 0) AND (percent_ag
 > 0::numeric))
                    ->Index Scan using tmpgrid_pk on tmpgrid b
(cost=0.00..5.27 rows=1 width=24)
                        Index Cond: ((b.b_icell = "outer".beld3_icell) AND
                                     (b.b_jcell = "outer".beld3_jcell))
        ->Hash (cost=18.50..18.50 rows=920 width=21)
            ->Seq Scan on biogenic_emissions_factors d
(cost=0.00..18.50 rows=920 width=21)
                Filter: (emissions_factor > 0::numeric)
(18 rows)


Firstly, I  am frankly mystified on how to interpret all this.  If anyone
could point me to a document or two that will help me decipher this,
I will greatly appreciate it.

Secondly, I have figured out that SEQ SCANs are typically bad.  I am
concerned that a SEQ SCAN is being performed on 'biogenic_beld3_data'
which is the largest table in the query.  I would rather have a SEQ SCAN
be performed on 'tmpgrid' which contains the keys that subset the data
from 'biogenic_beld3_data.'  Is this naive on my part?

Thirdly, I have run EXPLAIN on other queries that report back a
GroupAggregate Cost=<low 300,000s> that runs in about 30 minutes
on my relatively highend linux machine.  But when I run this particular
query, it takes on the order of 90 minutes to complete.  Any thoughts
on why this happens will be appreciated.

Finally, if anyone can be so kind as to provide insight on how to better
optimize this query, I will, again, be deeply grateful.

Thanks in advance.

terrakit

pgsql-performance by date:

Previous
From: Tom Lane
Date:
Subject: Re: bad plan
Next
From: Josh Berkus
Date:
Subject: Re: Why would writes to pgsql_tmp bottleneck at 1mb/s?