Thread: More index / search speed questions

More index / search speed questions

From
"Mitch Vincent"
Date:
This is a bit long, sorry about that..

ipa=# explain select a.app_id, a.appcode, a.firstname, a.middlename,
a.lastname, a.state, a.degree1, a.d1date, a.degree2, a.d2date, a.salary,
a.skill1, a.skill2, a.skill3, a.objective, a.employer, a.sic1, a.sic2,
a.sic3, a.prefs1, a.prefs2, a.sells, a.profile from applicants as
a,resumes_fti as rf,profile_fti as pf where (( pf.string='engineer' ) OR (
rf.string='engineer' ) ) and a.app_id=rf.app_id and a.app_id=pf.app_id order
by (case when a.resubmitted > a.created then a.resubmitted else a.created
end) desc limit 10 offset 0 ;
NOTICE:  QUERY PLAN:

Sort  (cost=167674.25..167674.25 rows=69399 width=252)
  ->  Merge Join  (cost=2096.22..160747.54 rows=69399 width=252)
        ->  Merge Join  (cost=2096.22..13223.87 rows=300530 width=236)
              ->  Index Scan using profile_fti_id_index on profile_fti pf
(cost=0.00..7261.23 rows=300530 width=16)
              ->  Sort  (cost=2096.22..2096.22 rows=8784 width=220)
                    ->  Seq Scan on applicants a  (cost=0.00..1520.84
rows=8784 width=220)
        ->  Index Scan using resumes_fti_id_index on resumes_fti rf
(cost=0.00..103748.68 rows=2200898 width=16)

EXPLAIN

Nasty slow, it hasn't even finished yet (and it's been at least 15
minutes)..

The resumes_fti table is huge, no doubt (2200898 rows) and the profile_fti
table is pretty big too (300573 rows).. Indexes are on all the app_id
columns (in all tables) ,  rf.string and pf.string, a.resubmitted and
a.created.

The merge join is evidently what's the killer and the seqscan on applicants
(10,000ish rows) doesn't help..

I'm looking for any way to speed this up some.. If I search on the
applicants table too then the search is a lot faster :

ipa=# explain select a.app_id, a.appcode, a.firstname, a.middlename,
a.lastname, a.state, a.degree1, a.d1date, a.degree2, a.d2date, a.salary,
a.skill1, a.skill2, a.skill3, a.objective, a.employer, a.sic1, a.sic2,
a.sic3, a.prefs1, a.prefs2, a.sells, a.profile from applicants as
a,resumes_fti as rf,profile_fti as pf where ( ( a.skill1=123 OR a.skill2=123
OR a.skill3=123 )) AND (( pf.string='engineer' OR pf.string='process' ) OR
( rf.string='engineer' OR rf.string='process' ) ) and a.app_id=rf.app_id and
a.app_id=pf.app_id order by (case when a.resubmitted > a.created then
a.resubmitted else a.created end) desc limit 10 offset 0

NOTICE:  QUERY PLAN:

Sort  (cost=48181.12..48181.12 rows=1754 width=252)
  ->  Merge Join  (cost=33645.39..48086.60 rows=1754 width=252)
        ->  Index Scan using profile_fti_id_index on profile_fti pf
(cost=0.00..7261.23 rows=300530 width=16)
        ->  Sort  (cost=33645.39..33645.39 rows=18580 width=236)
              ->  Nested Loop  (cost=0.00..32327.91 rows=18580 width=236)
                    ->  Index Scan using applicants_skill1,
applicants_skill2, applicants_skill3 on applicants a  (cost=0.00..294.14
rows=111 width=220)
                    ->  Index Scan using resumes_fti_id_index on resumes_fti
rf  (cost=0.00..287.31 rows=84 width=16)

EXPLAIN

...That takes only a few seconds..

So, can anyone smack me around with a clue stick?

-Mitch


Re: More index / search speed questions

From
Tom Lane
Date:
"Mitch Vincent" <mitch@venux.net> writes:
> This is a bit long, sorry about that..

One good thing to ask yourself is always "do the planner's row-count
estimates have anything to do with reality?"

In this case the issue seems to be that the planner is using an
indexscan over the whole of resumes_fti --- there is no way to limit the
scan using app_id, so it must be using the index just as a way to order
the data for a mergejoin.

In your quicker example, the innermost nested loop is pulling out
potential app_id values from the applicants table and using each one
to perform a constrained indexscan on resumes_fti.  That's a great
strategy as long as you don't have very many hits in the applicants
table (else the repeated indexscan startup overhead kills you).  I don't
know if the planner's estimate of 111 hits is very accurate, but clearly
it's guessed right that the number of hits is not large, else you'd not
be happy with the performance of that plan ;-)

In the slower case, the planner is estimating quite a few thousand
potential matches, and that leads it to use a mergejoin, which
may be relatively slow here but it won't fall apart completely when
there are many matches.  Since you're complaining, I guess that that
estimate was *not* accurate.  But what are the correct numbers?

Also, you might experiment with "set enable_mergejoin = OFF' to see what
sort of plan you get (probably a hashjoin) and what its performance is
like.

            regards, tom lane