Thread: Huge difference in query performance between 8.3 and 8.4 (possibly)

Huge difference in query performance between 8.3 and 8.4 (possibly)

From

Robin Houston

Date:

09 July 2009, 13:35:31

[ Attempting to resend, because it didn't seem to get through last time. ]

We have a query that runs very slowly on our 8.3 database. (I can't
tell you exactly how slowly, because it has never successfully run to
completion even when we left it running overnight.) On the 8.4
database on my laptop, it runs in about 90 seconds. Of course there
are several differences between the two instances, but I wonder
whether query planning improvements in 8.4 could essentially account
for it. (In case you're wondering, I ran 'vacuum full analyze' on the
slow one, and it made no discernible difference.) The two instances
have the same schema and data.
The query looks like this:
select first.feature_cvterm_id from feature_cvterm first
join feature_cvterm_dbxref first_fvd
on first.feature_cvterm_id = first_fvd.feature_cvterm_id
join dbxref first_withfrom_dbxref
on first_fvd.dbxref_id = first_withfrom_dbxref.dbxref_id
   , cv
   , cvterm evidence_type join cv evidence_type_cv
        on evidence_type.cv_id = evidence_type_cv.cv_id
   , feature_cvtermprop first_evidence
   , feature_cvterm second
left join feature_cvtermprop second_evidence
    on second_evidence.feature_cvterm_id = second.feature_cvterm_id
join feature_cvterm_dbxref second_fvd
on second.feature_cvterm_id = second_fvd.feature_cvterm_id
join dbxref second_withfrom_dbxref
on second_fvd.dbxref_id = second_withfrom_dbxref.dbxref_id
   , cvterm second_term
   , cvterm first_term
   , feature
where first.cvterm_id = first_term.cvterm_id
and first_evidence.feature_cvterm_id = first.feature_cvterm_id
and second_term.cv_id = cv.cv_id
and first_term.cv_id = cv.cv_id
and cv.name in (
      'biological_process'
    , 'molecular_function'
    , 'cellular_component'
)
and second.feature_id = feature.feature_id
and second.feature_id = first.feature_id
and first.cvterm_id = first_term.cvterm_id
and second.cvterm_id = second_term.cvterm_id
and second.pub_id = first.pub_id
and evidence_type.name = 'evidence'
and evidence_type_cv.name = 'genedb_misc'
and second_evidence.type_id = evidence_type.cvterm_id
and first_evidence.type_id = evidence_type.cvterm_id
and second.feature_cvterm_id > first.feature_cvterm_id
and first_withfrom_dbxref.accession = second_withfrom_dbxref.accession
and upper(first_evidence.value) = upper(second_evidence.value)
and first_term.name = second_term.name
;
(There's some fairly obvious room for improvement in this query as
written, but none of the changes I've tried have changed the overall
performance picture.)
The execution plan on the (slow) 8.3 server is:

 Nested Loop  (cost=44050.86..77140.03 rows=1 width=4)
   Join Filter: (second_term.cv_id = cv.cv_id)
   ->  Nested Loop  (cost=44050.86..77138.61 rows=1 width=12)
         Join Filter: ((first_term.cv_id = second_term.cv_id) AND
((first_term.name)::text = (second_term.name)::text))
         ->  Nested Loop  (cost=44050.86..77130.32 rows=1 width=56)
               ->  Nested Loop  (cost=44050.86..77122.65 rows=1 width=12)
                     Join Filter: (upper(second_evidence.value) =
upper(first_evidence.value))
                     ->  Nested Loop  (cost=44050.86..77114.32 rows=1 width=50)
                           Join Filter: ((second.feature_cvterm_id >
first.feature_cvterm_id) AND (second.feature_id = first.feature_id)
AND (second.pub_id = first.pub_id) AND
((second_withfrom_dbxref.accession)::text =
(first_withfrom_dbxref.accession)::text))
                           ->  Nested Loop  (cost=30794.26..42915.70
rows=1 width=69)
                                 ->  Hash Join
(cost=30794.26..42906.88 rows=1 width=65)
                                       Hash Cond:
(second_evidence.type_id = evidence_type.cvterm_id)
                                       ->  Hash Join
(cost=30784.59..42807.07 rows=24035 width=61)
                                             Hash Cond:
(second_fvd.dbxref_id = second_withfrom_dbxref.dbxref_id)
                                             ->  Hash Join
(cost=19044.44..28262.26 rows=24035 width=50)
                                                   Hash Cond:
(second_evidence.feature_cvterm_id = second.feature_cvterm_id)
                                                   ->  Seq Scan on
feature_cvtermprop second_evidence  (cost=0.00..4370.07 rows=223307
width=34)
                                                   ->  Hash
(cost=18169.19..18169.19 rows=47620 width=24)
                                                         ->  Hash Join
 (cost=1516.45..18169.19 rows=47620 width=24)
                                                               Hash
Cond: (second.feature_cvterm_id = second_fvd.feature_cvterm_id)
                                                               ->  Seq
Scan on feature_cvterm second  (cost=0.00..7243.27 rows=442427
width=16)
                                                               ->
Hash  (cost=734.20..734.20 rows=47620 width=8)

->  Seq Scan on feature_cvterm_dbxref second_fvd  (cost=0.00..734.20
rows=47620 width=8)
                                             ->  Hash
(cost=5838.29..5838.29 rows=321429 width=19)
                                                   ->  Seq Scan on
dbxref second_withfrom_dbxref  (cost=0.00..5838.29 rows=321429
width=19)
                                       ->  Hash  (cost=9.66..9.66
rows=1 width=4)
                                             ->  Nested Loop
(cost=0.00..9.66 rows=1 width=4)
                                                   Join Filter:
(evidence_type.cv_id = evidence_type_cv.cv_id)
                                                   ->  Seq Scan on cv
evidence_type_cv  (cost=0.00..1.35 rows=1 width=4)
                                                         Filter:
((name)::text = 'genedb_misc'::text)
                                                   ->  Index Scan
using cvterm_idx2 on cvterm evidence_type  (cost=0.00..8.29 rows=1
width=8)
                                                         Index Cond:
((evidence_type.name)::text = 'evidence'::text)
                                 ->  Index Scan using feature_pkey on
feature  (cost=0.00..8.81 rows=1 width=4)
                                       Index Cond: (feature.feature_id
= second.feature_id)
                           ->  Hash Join  (cost=13256.60..33246.22
rows=47620 width=35)
                                 Hash Cond: (first_fvd.dbxref_id =
first_withfrom_dbxref.dbxref_id)
                                 ->  Hash Join
(cost=1516.45..18169.19 rows=47620 width=24)
                                       Hash Cond:
(first.feature_cvterm_id = first_fvd.feature_cvterm_id)
                                       ->  Seq Scan on feature_cvterm
first  (cost=0.00..7243.27 rows=442427 width=16)
                                       ->  Hash  (cost=734.20..734.20
rows=47620 width=8)
                                             ->  Seq Scan on
feature_cvterm_dbxref first_fvd  (cost=0.00..734.20 rows=47620
width=8)
                                 ->  Hash  (cost=5838.29..5838.29
rows=321429 width=19)
                                       ->  Seq Scan on dbxref
first_withfrom_dbxref  (cost=0.00..5838.29 rows=321429 width=19)
                     ->  Index Scan using feature_cvtermprop_c1 on
feature_cvtermprop first_evidence  (cost=0.00..8.31 rows=1 width=34)
                           Index Cond:
((first_evidence.feature_cvterm_id = first.feature_cvterm_id) AND
(first_evidence.type_id = second_evidence.type_id))
               ->  Index Scan using cvterm_pkey on cvterm first_term
(cost=0.00..7.65 rows=1 width=52)
                     Index Cond: (first_term.cvterm_id = first.cvterm_id)
         ->  Index Scan using cvterm_pkey on cvterm second_term
(cost=0.00..8.28 rows=1 width=52)
               Index Cond: (second_term.cvterm_id = second.cvterm_id)
   ->  Seq Scan on cv  (cost=0.00..1.39 rows=3 width=4)
         Filter: ((cv.name)::text = ANY
('{biological_process,molecular_function,cellular_component}'::text[]))

and on the (fast) 8.4 server is:

 Nested Loop  (cost=63949.73..77767.20 rows=1 width=4)
   Join Filter: (second_term.cv_id = cv.cv_id)
   ->  Nested Loop  (cost=63949.73..77765.78 rows=1 width=12)
         Join Filter: (upper(second_evidence.value) =
upper(first_evidence.value))
         ->  Nested Loop  (cost=63949.73..77757.46 rows=1 width=51)
               Join Filter: ((first_term.cv_id = second_term.cv_id)
AND ((first_term.name)::text = (second_term.name)::text))
               ->  Nested Loop  (cost=63949.73..77749.17 rows=1 width=95)
                     ->  Nested Loop  (cost=63949.73..77741.55 rows=1 width=51)
                           ->  Hash Join  (cost=63949.73..77732.49
rows=1 width=59)
                                 Hash Cond: ((second.feature_id =
first.feature_id) AND (second.pub_id = first.pub_id) AND
((second_withfrom_dbxref.accession)::text =
(first_withfrom_dbxref.accession)::text))
                                 Join Filter:
(second.feature_cvterm_id > first.feature_cvterm_id)
                                 ->  Hash Join
(cost=30236.57..41303.13 rows=4607 width=66)
                                       Hash Cond:
(second_evidence.type_id = evidence_type.cvterm_id)
                                       ->  Hash Join
(cost=30226.90..41161.01 rows=23034 width=62)
                                             Hash Cond:
(second_fvd.dbxref_id = second_withfrom_dbxref.dbxref_id)
                                             ->  Hash Join
(cost=18735.19..27081.42 rows=23034 width=51)
                                                   Hash Cond:
(second_evidence.feature_cvterm_id = second.feature_cvterm_id)
                                                   ->  Seq Scan on
feature_cvtermprop second_evidence  (cost=0.00..3965.92 rows=210392
width=35)
                                                   ->  Hash
(cost=17861.31..17861.31 rows=47590 width=24)
                                                         ->  Hash Join
 (cost=1490.78..17861.31 rows=47590 width=24)
                                                               Hash
Cond: (second.feature_cvterm_id = second_fvd.feature_cvterm_id)
                                                               ->  Seq
Scan on feature_cvterm second  (cost=0.00..7115.82 rows=434682
width=16)
                                                               ->
Hash  (cost=709.90..709.90 rows=47590 width=8)

->  Seq Scan on feature_cvterm_dbxref second_fvd  (cost=0.00..709.90
rows=47590 width=8)
                                             ->  Hash
(cost=5743.87..5743.87 rows=321587 width=19)
                                                   ->  Seq Scan on
dbxref second_withfrom_dbxref  (cost=0.00..5743.87 rows=321587
width=19)
                                       ->  Hash  (cost=9.66..9.66
rows=1 width=4)
                                             ->  Nested Loop
(cost=0.00..9.66 rows=1 width=4)
                                                   Join Filter:
(evidence_type.cv_id = evidence_type_cv.cv_id)
                                                   ->  Seq Scan on cv
evidence_type_cv  (cost=0.00..1.35 rows=1 width=4)
                                                         Filter:
((name)::text = 'genedb_misc'::text)
                                                   ->  Index Scan
using cvterm_idx2 on cvterm evidence_type  (cost=0.00..8.29 rows=1
width=8)
                                                         Index Cond:
((evidence_type.name)::text = 'evidence'::text)
                                 ->  Hash  (cost=32531.33..32531.33
rows=47590 width=35)
                                       ->  Hash Join
(cost=12982.48..32531.33 rows=47590 width=35)
                                             Hash Cond:
(first_fvd.dbxref_id = first_withfrom_dbxref.dbxref_id)
                                             ->  Hash Join
(cost=1490.78..17861.31 rows=47590 width=24)
                                                   Hash Cond:
(first.feature_cvterm_id = first_fvd.feature_cvterm_id)
                                                   ->  Seq Scan on
feature_cvterm first  (cost=0.00..7115.82 rows=434682 width=16)
                                                   ->  Hash
(cost=709.90..709.90 rows=47590 width=8)
                                                         ->  Seq Scan
on feature_cvterm_dbxref first_fvd  (cost=0.00..709.90 rows=47590
width=8)
                                             ->  Hash
(cost=5743.87..5743.87 rows=321587 width=19)
                                                   ->  Seq Scan on
dbxref first_withfrom_dbxref  (cost=0.00..5743.87 rows=321587
width=19)
                           ->  Index Scan using feature_pkey on
feature  (cost=0.00..9.04 rows=1 width=4)
                                 Index Cond: (feature.feature_id =
second.feature_id)
                     ->  Index Scan using cvterm_pkey on cvterm
first_term  (cost=0.00..7.61 rows=1 width=52)
                           Index Cond: (first_term.cvterm_id = first.cvterm_id)
               ->  Index Scan using cvterm_pkey on cvterm second_term
(cost=0.00..8.27 rows=1 width=52)
                     Index Cond: (second_term.cvterm_id = second.cvterm_id)
         ->  Index Scan using feature_cvtermprop_c1 on
feature_cvtermprop first_evidence  (cost=0.00..8.30 rows=1 width=35)
               Index Cond: ((first_evidence.feature_cvterm_id =
first.feature_cvterm_id) AND (first_evidence.type_id =
second_evidence.type_id))
   ->  Seq Scan on cv  (cost=0.00..1.39 rows=3 width=4)
         Filter: ((cv.name)::text = ANY
('{biological_process,molecular_function,cellular_component}'::text[]))

Any insights would be much appreciated.
Thanks,
Robin

Re: Huge difference in query performance between 8.3 and 8.4 (possibly)

From

Alvaro Herrera

Date:

09 July 2009, 14:02:55

Robin Houston escribió:

> We have a query that runs very slowly on our 8.3 database. (I can't
> tell you exactly how slowly, because it has never successfully run to
> completion even when we left it running overnight.) On the 8.4
> database on my laptop, it runs in about 90 seconds. Of course there
> are several differences between the two instances, but I wonder
> whether query planning improvements in 8.4 could essentially account
> for it.

Of course.  Great news.  Congratulations.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: Huge difference in query performance between 8.3 and 8.4 (possibly)

From

Tom Lane

Date:

09 July 2009, 14:09:24

Robin Houston <robin.houston@gmail.com> writes:
> We have a query that runs very slowly on our 8.3 database. (I can't
> tell you exactly how slowly, because it has never successfully run to
> completion even when we left it running overnight.) On the 8.4
> database on my laptop, it runs in about 90 seconds. Of course there
> are several differences between the two instances, but I wonder
> whether query planning improvements in 8.4 could essentially account
> for it.

Well, it's hard to be sure with only EXPLAIN and not EXPLAIN ANALYZE
output to look at; but I think the significant difference in these plans
is that 8.4 has chosen a hash instead of nestloop join for a couple of
the intermediate join levels.  Which is evidently because of a change
in the estimated size of the next join down:

                     ->  Nested Loop  (cost=44050.86..77114.32 rows=1 width=50)
                           Join Filter: ((second.feature_cvterm_id > first.feature_cvterm_id) AND (second.feature_id =
first.feature_id)AND (second.pub_id = first.pub_id) AND ((second_withfrom_dbxref.accession)::text =
(first_withfrom_dbxref.accession)::text))
                           ->  Nested Loop  (cost=30794.26..42915.70 rows=1 width=69)
                                 ->  Hash Join (cost=30794.26..42906.88 rows=1 width=65)
                                       Hash Cond: (second_evidence.type_id = evidence_type.cvterm_id)

versus

                           ->  Hash Join  (cost=63949.73..77732.49 rows=1 width=59)
                                 Hash Cond: ((second.feature_id = first.feature_id) AND (second.pub_id = first.pub_id)
AND((second_withfrom_dbxref.accession)::text = (first_withfrom_dbxref.accession)::text)) 
                                 Join Filter: (second.feature_cvterm_id > first.feature_cvterm_id)
                                 ->  Hash Join (cost=30236.57..41303.13 rows=4607 width=66)
                                       Hash Cond: (second_evidence.type_id = evidence_type.cvterm_id)

If the 8.4 rowcount estimate is accurate then it's not surprising that
the nestloop plan sucks --- it'd be re-executing the other arm of the
join 4600 or so times.

This could reflect improvements in the join size estimation code, or
maybe it's just a consequence of 8.4 using larger statistics targets
by default.  It's hard to be sure with so little information to go on.

            regards, tom lane

Re: Huge difference in query performance between 8.3 and 8.4 (possibly)

From

"Hartman, Matthew"

Date:

09 July 2009, 14:46:24

> From: pgsql-performance-owner@postgresql.org
[mailto:pgsql-performance-
> owner@postgresql.org] On Behalf Of Robin Houston
> Sent: Thursday, July 09, 2009 12:35 PM

> We have a query that runs very slowly on our 8.3 database. (I can't
> tell you exactly how slowly, because it has never successfully run to
> completion even when we left it running overnight.) On the 8.4
> database on my laptop, it runs in about 90 seconds.
> Any insights would be much appreciated.

I doubt this is the insight you're looking for, but that is the worst
query I have ever seen. It is difficult to understand exactly what it
returns. There are so many cross joins, outer joins, and inner joins
mixed up together, ugh.

Rather than trying to puzzle out why it is slow, rewrite it. It will be
faster than before on any version.

Matthew Hartman
Programmer/Analyst
Information Management, ICP
Kingston General Hospital