Thread: Full text indexing preformance! (long)

Full text indexing preformance! (long)

From
"Mitch Vincent"
Date:
Well I have to say that I'm pretty impressed with PostgreSQL after this..
Still, I'd like some input from the experts as I may not be doing the best I
can..

I setup the full text indexing (as described in contrib/fulltextindex) and
see some amazing results but like I said earlier, I might be able to do
better.

I took Tom's advice and turned on like planning (as described in
contrib/likeplanning) and it made a world of difference by itself..

Here is a quick run-down of the table structure.

Table : resumes_fti (coorespnds to the cds-fti table in the example)

25370953 rows.

Table applicants_resumes (cooresponds to the cds table in the example)

Table applicants (63 fields)
11039 rows.

Table applicants_states (2 fields)
276255 rows


The most complex query I use is this :

select a.appcode, a.firstname, a.middlename, a.lastname, a.state, a.degree1,
a.d1date, a.degree2, a.d2date, a.salary, a.skill1, a.skill2, a.skill3,
a.objective, a.employer, a.sic1, a.sic2, a.sic3, a.prefs1, a.prefs2,
a.sells from applicants as a,applicants_states as s, applicants_resumes as
ar,
resumes_fti as rf where a.status = 'A' and s.rstate='AL' and
s.app_id=a.app_id and rf.string ~'engineer' and rf.id = ar.oid limit 10
offset 0

-- BUT - I forgot one crucial thing. To qualify the results from the
applicants_resume table bases on the applicants table (ie ar.app_id =
a.app_id) I did this and the query went from just over 3 seconds to over 25
seconds!

I changes the above query to

select a.appcode, a.firstname, a.middlename, a.lastname, a.state, a.degree1,
a.d1date, a.degree2, a.d2date, a.salary, a.skill1, a.skill2, a.skill3,
a.objective, a.employer, a.sic1, a.sic2, a.sic3, a.prefs1, a.prefs2,
a.sells from applicants as a,applicants_states as s, applicants_resumes as
ar,
resumes_fti as rf where a.status = 'A' and s.rstate='AL' and rf.string
~'engineer'
and rf.id = ar.oid and s.app_id=a.app_id and ar.app_id=a.app_id limit 10
offset 0
(Listed below again)

Hopefully it's just something else stupid I am doing and someone will beat
me with a clue stick.  All of this was done on a PostgreSQL 7.0 backend run
as "/usr/local/pgsql/bin/postmaster -D /usr/local/pgsql/data -B 4096 -o '-S
16384'
-i >/usr/local/pgsql/postgres.log 2>&1&" on a  FreeBSD 4.0, Dual Celeron 600
box with an ATA/66 30 gig drive and 256 megs of RAM.

Here are some stats :

Without the extra condition :

NOTICE:  QUERY PLAN:

Nested Loop  (cost=0.00..487951.02 rows=1644 width=204) ->  Nested Loop  (cost=0.00..478489.01 rows=2641 width=12)
->  Nested Loop  (cost=0.00..474081.63 rows=1 width=8)             ->  Seq Scan on resumes_fti rf
(cost=0.00..474076.91rows=1
 
width=4)             ->  Index Scan using resumes_oid_index on applicants_resumes
ar  (cost=0.00..4.70 rows=1 width=4)       ->  Index Scan using applicants_states_rstate on applicants_states s
(cost=0.00..4380.98 rows=2641 width=4) ->  Index Scan using applicants_app_id on applicants a  (cost=0.00..3.57
rows=1 width=192)

EXPLAIN


StartTransactionCommand
query: select a.appcode, a.firstname, a.middlename, a.lastname, a.state,
a.degree1,
a.d1date, a.degree2, a.d2date, a.salary, a.skill1, a.skill2, a.skill3,
a.objective, a.employer, a.sic1, a.sic2, a.sic3, a.prefs1, a.prefs2,
a.sells from applicants as a,applicants_states as s, applicants_resumes as
ar,
resumes_fti as rf where a.status = 'A' and s.rstate='AL' and rf.string
~'engineer'
and rf.id = ar.oid and s.app_id=a.app_id limit 10 offset 0;
ProcessQuery
! system usage stats:
!       3.386697 elapsed 2.599174 user 0.787057 system sec
!       [2.617929 user 0.797100 sys total]
!       0/0 [0/0] filesystem blocks in/out
!       0/8330 [0/8569] page faults/reclaims, 0 [0] swaps
!       0 [0] signals rcvd, 0/0 [3/3] messages rcvd/sent
!       0/43 [3/47] voluntary/involuntary context switches
! postgres usage stats:
!       Shared blocks:       4950 read,          0 written, buffer hit rate
= 20.03%
!       Local  blocks:          0 read,          0 written, buffer hit rate
= 0.00%
!       Direct blocks:          0 read,          0 written
CommitTransactionCommand
proc_exit(0)
shmem_exit(0)
exit(0)


With the extra condition :

NOTICE:  QUERY PLAN:

Nested Loop  (cost=0.00..474194.76 rows=1 width=208) ->  Nested Loop  (cost=0.00..474085.21 rows=1 width=204)       ->
NestedLoop  (cost=0.00..474081.63 rows=1 width=12)             ->  Seq Scan on resumes_fti rf  (cost=0.00..474076.91
rows=1
width=4)             ->  Index Scan using resumes_oid_index on applicants_resumes
ar  (cost=0.00..4.70 rows=1 width=8)       ->  Index Scan using applicants_app_id on applicants a
(cost=0.00..3.57 rows=1 width=192) ->  Index Scan using applicants_states_app_id on applicants_states s
(cost=0.00..109.54 rows=1 width=4)

EXPLAIN



StartTransactionCommand
query: select a.appcode, a.firstname, a.middlename, a.lastname, a.state,
a.degree1,
a.d1date, a.degree2, a.d2date, a.salary, a.skill1, a.skill2, a.skill3,
a.objective, a.employer, a.sic1, a.sic2, a.sic3, a.prefs1, a.prefs2,
a.sells from applicants as a,applicants_states as s, applicants_resumes as
ar,
resumes_fti as rf where a.status = 'A' and s.rstate='AL' and rf.string
~'engineer'
and rf.id = ar.oid and s.app_id=a.app_id and ar.app_id=a.app_id limit 10
offset 0
ProcessQuery
! system usage stats:
!       25.503341 elapsed 18.564543 user 5.599631 system sec
!       [18.564543 user 5.627987 sys total]
!       2029/0 [2029/0] filesystem blocks in/out
!       0/8335 [0/8571] page faults/reclaims, 0 [0] swaps
!       0 [0] signals rcvd, 0/0 [3/3] messages rcvd/sent
!       149/342 [152/346] voluntary/involuntary context switches
! postgres usage stats:
!       Shared blocks:      35118 read,          0 written, buffer hit rate
= 4.98%
!       Local  blocks:          0 read,          0 written, buffer hit rate
= 0.00%
!       Direct blocks:          0 read,          0 written
CommitTransactionCommand
proc_exit(0)
shmem_exit(0)
exit(0)

Sorry about the length. Thanks!

-Mitch




Re: Full text indexing preformance! (long)

From
Tom Lane
Date:
"Mitch Vincent" <mitch@venux.net> writes:
> Without the extra condition :
> 
> NOTICE:  QUERY PLAN:
> 
> Nested Loop  (cost=0.00..487951.02 rows=1644 width=204)
>   ->  Nested Loop  (cost=0.00..478489.01 rows=2641 width=12)
>         ->  Nested Loop  (cost=0.00..474081.63 rows=1 width=8)
>               ->  Seq Scan on resumes_fti rf  (cost=0.00..474076.91 rows=1
> width=4)
>               ->  Index Scan using resumes_oid_index on applicants_resumes
> ar  (cost=0.00..4.70 rows=1 width=4)
>         ->  Index Scan using applicants_states_rstate on applicants_states s
> (cost=0.00..4380.98 rows=2641 width=4)
>   ->  Index Scan using applicants_app_id on applicants a  (cost=0.00..3.57
> rows=1 width=192)
> 
> With the extra condition :
> 
> NOTICE:  QUERY PLAN:
> 
> Nested Loop  (cost=0.00..474194.76 rows=1 width=208)
>   ->  Nested Loop  (cost=0.00..474085.21 rows=1 width=204)
>         ->  Nested Loop  (cost=0.00..474081.63 rows=1 width=12)
>               ->  Seq Scan on resumes_fti rf  (cost=0.00..474076.91 rows=1
> width=4)
>               ->  Index Scan using resumes_oid_index on applicants_resumes
> ar  (cost=0.00..4.70 rows=1 width=8)
>         ->  Index Scan using applicants_app_id on applicants a
> (cost=0.00..3.57 rows=1 width=192)
>   ->  Index Scan using applicants_states_app_id on applicants_states s
> (cost=0.00..109.54 rows=1 width=4)

Odd.  The innermost join's the same in both plans, so that's not what's
causing the difference.  In the first case the next join is to
applicants_states using the "s.rstate='AL'" clause as a filter with the
applicants_states_rstate index.  The planner doesn't think that's gonna
be real selective (note the rows=2641) and based on prior discussion of
your database I'd agree --- don't you have lots of entries for AL?
Then it can at last join to applicants using "s.app_id=a.app_id".

In the second case it's seized on "ar.app_id=a.app_id" as a way to join
"applicants a" to the inner join using the applicants_app_id index.
This is not a bad idea at all if the a.app_id field is unique as it
seems to think (observe rows=1 there).  Then finally applicants_states
is joined on its app_id field.

Offhand I'd say that the second plan *ought* to be a lot quicker, and
I don't see why it's not.  Is applicants.app_id a unique key, or not?
You could investigate this by running just the partial selects (the
two or three inner tables with just the relevant WHERE clauses) to see
how many rows are returned at each step.

BTW, as far as I can see from this example you're still not using the
FTI stuff properly: you should be querying rf.string ~ '^engineer' so
that you get an indexscan over resumes_fti.  Without that, it seems
like you're not really getting any benefit from the FTI structure.
        regards, tom lane


Re: Full text indexing preformance! (long)

From
"Mitch Vincent"
Date:
Hi Tom, thanks for your reply..

I increased BLKSZ to 32k and re-compiled, then I imported all the resumes
(some of which I couldn't get before) and this problem completly
disappeared. The query is very fast now (.0.039792 seconds to be exact)..

One thing I did run into was this...
In my paging system I only have a need for 10 records at a time so I LIMIT
the query. The problem comes when I need to get a total of all the records
that matched the query (as a good search engine, I must tell people how many
records were found).. I can't count() and LIMIT in the same query, so I'm
forced to do 2 queries, one with count() and one without.

An example :

select a.appcode, a.firstname, a.middlename, a.lastname, a.state, a.degree1,
a.d1date, a.degree2, a.d2date, a.salary, a.skill1, a.skill2, a.skill3,
a.objective, a.employer, a.sic1, a.sic2, a.sic3, a.prefs1, a.prefs2, a.sells
from applicants as a,applicants_states as s, applicants_resumes as
ar,resumes_fti as rf where a.status = 'A' and lower(a.firstname) ~
lower('^a') and s.rstate='AL' and rf.string ~'^engineer' and
a.app_id=s.app_id and ar.app_id=a.app_id and rf.id=ar.oid limit 10 offset 0

Vs.

select count (a.app_id) as total from applicants as a,applicants_states as
s, applicants_resumes as ar,resumes_fti as rf where a.status = 'A' and
lower(a.firstname) ~ lower('^a') and s.rstate='AL' and rf.string
~'^engineer' and a.app_id=s.app_id and ar.app_id=a.app_id and rf.id=ar.oid

However the count() query has to go through the entire record set (which
makes sense) but it takes about 4 or 5  seconds.

The plan for the count() query.

NOTICE:  QUERY PLAN:

Aggregate  (cost=56.61..56.61 rows=1 width=20) ->  Nested Loop  (cost=0.00..56.61 rows=1 width=20)       ->  Nested
Loop (cost=0.00..10.74 rows=1 width=16)             ->  Nested Loop  (cost=0.00..8.59 rows=1 width=12)
->  Index Scan using resumes_fti_index on resumes_fti rf
 
(cost=0.00..4.97 rows=1 width=4)                   ->  Index Scan using applicants_resumes_index on
applicants_resumes ar  (cost=0.00..3.61 rows=1 width=8)             ->  Index Scan using applicants_app_id on
applicantsa
 
(cost=0.00..2.14 rows=1 width=4)       ->  Index Scan using applicants_states_app_id on applicants_states s
(cost=0.00..45.86 rows=1 width=4)

And the stats :

ProcessQuery
! system usage stats:
!       5.088647 elapsed 4.954981 user 0.125561 system sec
!       [4.976752 user 0.132817 sys total]
!       0/0 [0/0] filesystem blocks in/out
!       0/4607 [0/4846] page faults/reclaims, 0 [0] swaps
!       0 [0] signals rcvd, 0/0 [3/3] messages rcvd/sent
!       0/52 [3/57] voluntary/involuntary context switches
! postgres usage stats:
!       Shared blocks:          0 read,          0 written, buffer hit rate
= 100.00%
!       Local  blocks:          0 read,          0 written, buffer hit rate
= 0.00%
!       Direct blocks:          0 read,          0 written
CommitTransactionCommand
proc_exit(0)


The "0/4607 [0/4846] page faults/reclaims" area is greatly increased in this
query that from the other. Is that to be expected? Is there anything else I
can do to get the total number of records matched by the query and still use
LIMIT (I doubt it)?

If there isn't anything I can do, which looks to be the case here, I still
appreciate all the help you've given me..

I look forward to your response. Thanks!

-Mitch








Re: Full text indexing preformance! (long)

From
Tom Lane
Date:
"Mitch Vincent" <mitch@huntsvilleal.com> writes:
> The query is very fast now (.0.039792 seconds to be exact).

Cool ...

> In my paging system I only have a need for 10 records at a time so I LIMIT
> the query. The problem comes when I need to get a total of all the records
> that matched the query (as a good search engine, I must tell people how many
> records were found).. I can't count() and LIMIT in the same query, so I'm
> forced to do 2 queries, one with count() and one without.

Well, of course the whole *point* of LIMIT is that it stops short of
scanning the whole query result.  So I'm afraid you're kind of stuck
as far as the performance goes: you can't get a count() answer without
scanning the whole query.

I'm a little curious though: what is the typical count() result from
your queries?  The EXPLAIN outputs you show indicate that the planner
is only expecting about one row out now, but I have no idea how close
that is to the mark.  If it were really right, then there'd be no
difference in the performance of LIMIT and full queries, so I guess
it's not right; but how far off is it?
        regards, tom lane


Re: Full text indexing preformance! (long)

From
"Mitch Vincent"
Date:
> > Well, of course the whole *point* of LIMIT is that it stops short of
> > scanning the whole query result.  So I'm afraid you're kind of stuck
> > as far as the performance goes: you can't get a count() answer without
> > scanning the whole query.

Right, that's what I thought.

> > I'm a little curious though: what is the typical count() result from
> > your queries?  The EXPLAIN outputs you show indicate that the planner
> > is only expecting about one row out now, but I have no idea how close
> > that is to the mark.  If it were really right, then there'd be no
> > difference in the performance of LIMIT and full queries, so I guess
> > it's not right; but how far off is it?

Well, count does always return 1 row, though what's in that one row is as
varying as 0 to the number of records in the applicants database (about
11,000)..

Anyway, I thank you and appreciate your input..

-Mitch