Re: GiST index not used for ORDER BY? - Mailing list pgsql-general

From Max
Subject Re: GiST index not used for ORDER BY?
Date
Msg-id 6.2.1.2.0.20050127222524.02f697b8@mx1.to-the-max.net
Whole thread Raw
In response to Re: GiST index not used for ORDER BY?  (Oleg Bartunov <oleg@sai.msu.su>)
Responses Re: GiST index not used for ORDER BY?
List pgsql-general
Hi,

At 09:54 PM 1/27/2005, you wrote:
>On Thu, 27 Jan 2005, Max wrote:
>>I'm setting up a simple search engine using Tsearch2.
>>The basic idea is: a user enters a search query and a maximum of 1000
>>results is returned, sorted by date, newest first.
>>
>>At the moment the table holding the searchable data has 1.1 million entries.
>>It works great when the search only produces a few hundred results.
>>However when people search on a common word with 10.000+ results, there's
>>a performance problem.
 >>
>>CREATE TABLE posts_index
>>(
>>....
>>        startdate INT NOT NULL,
>>        idxFTI   tsvector,
>>....
>>);
>>CREATE INDEX idxFTI_idx2 ON posts_index USING gist(idxFTI,(-startdate));


>I assume you already vacuum your db.

Yes, I did vacuum analyze it. And he does use the first part of the index
(idxFTI), just not the second part (-startdate).

>  Hmm, seems you need to rewrite your query.
>EXPLAIN SELECT startdate, headline(subject,q) AS subject FROM ( SELECT
>startdate, subject from posts_index i,
>  to_tsquery('default', '_SEARCH_TERM_') AS q WHERE idxfti @@ q ORDER BY
>  (-i.startdate) LIMIT 1000) as foo;
>
>I bet your query will be much faster. In your query all founded tuples should
>be read from disk to calculate headline(), while in my query maximum 1000
>tuples will be read. So, performance gain could be noticeable, for example,
>if search returns 10,000 tuples, my query will be 10x faster than yours :)
>I think this is what you observed.

Thanks for your help, however headline() doesn't seem the problem.
Here's an EXPLAIN ANALYZE using your query and a common word as SEARCH_TERM:

------
QUERY PLAN

Subquery Scan foo  (cost=5368809.49..5368824.49 rows=1000 width=181)
(actual time=363455.642..363510.277 rows=1000 loops=1)
   ->  Limit  (cost=5368809.49..5368811.99 rows=1000 width=126) (actual
time=363454.387..363455.983 rows=1000 loops=1)
         ->  Sort  (cost=5368809.49..5372006.34 rows=1278741 width=126)
(actual time=363454.380..363455.471 rows=1000 loops=1)
               Sort Key: (- i.startdate)
               ->  Nested Loop  (cost=0.00..5118844.92 rows=1278741
width=126) (actual time=0.140..354003.773 rows=343974 loops=1)
                     ->  Function Scan on q  (cost=0.00..12.50 rows=1000
width=32) (actual time=0.015..0.018 rows=1 loops=1)
                     ->  Index Scan using idxfti_idx2 on posts_index
i  (cost=0.00..5099.65 rows=1279 width=253) (actual time=0.111..353068.267
rows=343974 loops=1)
                           Index Cond: (i.idxfti @@ "outer".q)
Total runtime: 363571.960 ms
----

It still seems to rather sort 343.974 rows and take over 5 minutes to
complete, than use the index for the date. While searching on less common
words takes less than a second.Omitting headline() completely doesn't
changes anything either.
So it must be something else.


Regards,
Max


pgsql-general by date:

Previous
From: Pavel Stehule
Date:
Subject: Re: change encoding ?
Next
From: William Yu
Date:
Subject: Re: Splitting queries across servers