Thread: Full text search with ORDER BY performance issue

Full text search with ORDER BY performance issue

From

Krade

Date:

18 July 2009, 19:16:25

Hello,

I'm having a bit of an issue with full text search (using tsvectors) on
PostgreSQL 8.4. I have a rather large table (around 12M rows) and want
to use full text search in it (just for one of the columns). Just doing
a plainto_tsquery works reasonably fast (I have a GIN index on the
column in question, "comment_tsv"), but it becomes unbearably slow when
I want to make it order by another field ("timestamp").

Here's an example query:
SELECT * FROM a WHERE comment_tsv @@ plainto_tsquery('love') ORDER BY
timestamp DESC LIMIT 24 OFFSET 0;

I asked in #postgresql and was told that there are two possible plans
for this query; the first scans the BTREE timestamp index, gets the
ordering and then filters out the rows using text search; the second
finds all rows matching the text search using the GIN index and then
sorts them according to that field -- this much I already knew, in fact,
I had to drop the BTREE index on "timestamp" to prevent the planner from
choosing the first, since the first plan is completely useless to me,
considering the table is so huge (suggestions on how to prevent the
planner from picking the "wrong" plan are also appreciated).

Obviously, this gets really slow when I try to search for common words
and full text search returns a lot of rows to be ordered.

I tried to make a GIN index on ("timestamp", "comment_tsv"), (using
btree_gin from contrib) but it doesn't really do anything -- I was told
on IRC this is because GIN doesn't provide support for ORDER BY, only
BTREE can do that.

Here's a couple of queries:

archive=> explain analyze select * from a where  comment_tsv @@
plainto_tsquery('love') order by timestamp desc limit 24 offset 0;

QUERY PLAN
----------
  Limit  (cost=453248.73..453248.79 rows=24 width=281) (actual
time=188441.047..188441.148 rows=24 loops=1)
    ->  Sort  (cost=453248.73..453882.82 rows=253635 width=281) (actual
time=188441.043..188441.079 rows=24 loops=1)
          Sort Key: "timestamp"
          Sort Method:  top-N heapsort  Memory: 42kB
          ->  Bitmap Heap Scan on a  (cost=17782.16..446166.02
rows=253635 width=281) (actual time=2198.930..187948.050 rows=256378
loops=1)
                Recheck Cond: (comment_tsv @@ plainto_tsquery('love'::text))
                ->  Bitmap Index Scan on timestamp_comment_gin
(cost=0.00..17718.75 rows=253635 width=0) (actual
time=2113.664..2113.664 rows=259828 loops=1)
                      Index Cond: (comment_tsv @@
plainto_tsquery('love'::text))
  Total runtime: 188442.617 ms
(9 rows)

archive=> explain analyze select * from a where  comment_tsv @@
plainto_tsquery('love') limit 24 offset 0;

QUERY PLAN
----------
  Limit  (cost=0.00..66.34 rows=24 width=281) (actual
time=14.632..53.647 rows=24 loops=1)
    ->  Seq Scan on a  (cost=0.00..701071.49 rows=253635 width=281)
(actual time=14.629..53.588 rows=24 loops=1)
          Filter: (comment_tsv @@ plainto_tsquery('love'::text))
  Total runtime: 53.731 ms
(4 rows)

First one runs painfully slow.

Is there really no way to have efficient full text search results
ordered by a separate field? I'm really open to all possibilities, at
this point.

Thanks.

Re: Full text search with ORDER BY performance issue

From

Oleg Bartunov

Date:

20 July 2009, 09:12:49

Krade,

On Sat, 18 Jul 2009, Krade wrote:

> Here's a couple of queries:
>
> archive=> explain analyze select * from a where  comment_tsv @@
> plainto_tsquery('love') order by timestamp desc limit 24 offset 0;
>
> QUERY PLAN
> ----------
> Limit  (cost=453248.73..453248.79 rows=24 width=281) (actual
> time=188441.047..188441.148 rows=24 loops=1)
>   ->  Sort  (cost=453248.73..453882.82 rows=253635 width=281) (actual
> time=188441.043..188441.079 rows=24 loops=1)
>         Sort Key: "timestamp"
>         Sort Method:  top-N heapsort  Memory: 42kB
>         ->  Bitmap Heap Scan on a  (cost=17782.16..446166.02 rows=253635
> width=281) (actual time=2198.930..187948.050 rows=256378 loops=1)
>               Recheck Cond: (comment_tsv @@ plainto_tsquery('love'::text))
>               ->  Bitmap Index Scan on timestamp_comment_gin
> (cost=0.00..17718.75 rows=253635 width=0) (actual time=2113.664..2113.664
> rows=259828 loops=1)
>                     Index Cond: (comment_tsv @@
> plainto_tsquery('love'::text))
> Total runtime: 188442.617 ms
> (9 rows)
>
> archive=> explain analyze select * from a where  comment_tsv @@
> plainto_tsquery('love') limit 24 offset 0;
>
> QUERY PLAN
> ----------
> Limit  (cost=0.00..66.34 rows=24 width=281) (actual time=14.632..53.647
> rows=24 loops=1)
>   ->  Seq Scan on a  (cost=0.00..701071.49 rows=253635 width=281) (actual
> time=14.629..53.588 rows=24 loops=1)
>         Filter: (comment_tsv @@ plainto_tsquery('love'::text))
> Total runtime: 53.731 ms
> (4 rows)
>
> First one runs painfully slow.

Hmm, everything is already written in explain :) In the first query
253635 rows should be readed from disk and sorted, while in the
second query only 24 (random) rows readed from disk, so there is 4 magnitudes
difference and in the worst case you should expected time for the 1st query
about 53*10^4 ms.

>
> Is there really no way to have efficient full text search results ordered by
> a separate field? I'm really open to all possibilities, at this point.
>
> Thanks.
>
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: Full text search with ORDER BY performance issue

From

Marcin Stępnicki

Date:

20 July 2009, 09:22:22

On Sun, Jul 19, 2009 at 12:07 AM, Krade<krade@krade.com> wrote:
> archive=> explain analyze select * from a where  comment_tsv @@
> plainto_tsquery('love') order by timestamp desc limit 24 offset 0;

What happens if you make it:


select * from (
        select * from a where  comment_tsv @@plainto_tsquery('love')
) xx

order by xx.timestamp desc
limit 24 offset 0;

?

Re: Full text search with ORDER BY performance issue

From

Krade

Date:

20 July 2009, 17:49:12

Hello, thanks for your replies.


On 7/20/2009 13:12, Oleg Bartunov wrote:

> Hmm, everything is already written in explain :) In the first query
> 253635 rows should be readed from disk and sorted, while in the
> second query only 24 (random) rows readed from disk, so there is 4
> magnitudes
> difference and in the worst case you should expected time for the 1st
> query
> about 53*10^4 ms.
Yes, I do realize the first query is retrieving all the rows that match
the full text search and sorting them, that's what I wanted to avoid. :)
Since I only want 24 results at a time, I wanted to avoid having to get
all the rows and sort them. I was wondering if there was any way to use,
say, some index combination I'm not aware of, cluster the table
according to an index or using a different query to get the same results.

Well, to take advantage of the gin index on (timestamp, comment_tsv), I
suppose could do something like this:
archive=> explain analyze select * from a where comment_tsv @@
plainto_tsquery('love') and timestamp > cast(floor(extract(epoch from
CURRENT_TIMESTAMP) - 864000) as integer) order by timestamp limit 24
offset 0;

QUERY PLAN
------------------
Limit (cost=17326.69..17326.75 rows=24 width=281) (actual
time=3249.192..3249.287 rows=24 loops=1)
-> Sort (cost=17326.69..17337.93 rows=4499 width=281) (actual
time=3249.188..3249.221 rows=24 loops=1)
Sort Key: "timestamp"
Sort Method: top-N heapsort Memory: 39kB
-> Bitmap Heap Scan on a (cost=408.80..17201.05 rows=4499 width=281)
(actual time=3223.890..3240.484 rows=5525 loops=1)
Recheck Cond: (("timestamp" > (floor((date_part('epoch'::text, now()) -
864000::double precision)))::integer) AND (comment_tsv @@
plainto_tsquery('love'::text)))
-> Bitmap Index Scan on timestamp_comment_gin (cost=0.00..407.67
rows=4499 width=0) (actual time=3222.769..3222.769 rows=11242 loops=1)
Index Cond: (("timestamp" > (floor((date_part('epoch'::text, now()) -
864000::double precision)))::integer) AND (comment_tsv @@
plainto_tsquery('love'::text)))
Total runtime: 3249.957 ms
(9 rows)

Which only looks at the last 10 days and is considerably faster. Not
perfect, but a lot better. But this requires a lot of application logic,
for example, if I didn't get 24 results in the first query, I'd have to
reissue the query with a larger time interval and it gets worse pretty
fast. It strikes me as a really dumb thing to do.

I'm really hitting a brick wall here, I can't seem to be able to provide
reasonably efficient full text search that is ordered by date rather
than random results from the database.

On 7/20/2009 13:22, Marcin Stępnicki wrote:
> What happens if you make it:
>
>
> select * from (
> select * from a where comment_tsv @@plainto_tsquery('love')
> ) xx
>
> order by xx.timestamp desc
> limit 24 offset 0;
>
> ?
Same query plan, I'm afraid.

Re: Full text search with ORDER BY performance issue

From

"Kevin Grittner"

Date:

20 July 2009, 18:42:49

Krade <krade@krade.com> wrote:
> SELECT * FROM a WHERE comment_tsv @@ plainto_tsquery('love')
> ORDER BY timestamp DESC LIMIT 24 OFFSET 0;

Have you considered keeping rows "narrow" until you've identified your
24 rows?  Something like:

SELECT * FROM a
  WHERE id in
    (
      SELECT id FROM a
        WHERE comment_tsv @@ plainto_tsquery('love')
        ORDER BY timestamp DESC
        LIMIT 24 OFFSET 0
    )
  ORDER BY timestamp DESC
;

-Kevin

Re: Full text search with ORDER BY performance issue

From

Krade

Date:

20 July 2009, 21:31:59

Hello,

On 7/20/2009 22:42, Kevin Grittner wrote:
> Have you considered keeping rows "narrow" until you've identified your
> 24 rows?  Something like:
>
> SELECT * FROM a
>    WHERE id in
>      (
>        SELECT id FROM a
>          WHERE comment_tsv @@ plainto_tsquery('love')
>          ORDER BY timestamp DESC
>          LIMIT 24 OFFSET 0
>      )
>    ORDER BY timestamp DESC
> ;
>
Good idea, but it doesn't really seem to do much. The query times are
roughly the same.

Re: Full text search with ORDER BY performance issue

From

Devin Ben-Hur

Date:

20 July 2009, 22:13:56

Krade wrote:
> SELECT * FROM a WHERE comment_tsv @@ plainto_tsquery('love') ORDER BY
> timestamp DESC LIMIT 24 OFFSET 0;

Have you tried make the full-text condition in a subselect with "offset
0" to stop the plan reordering?

eg:

select *
from (
   select * from a where comment_tsv @@ plainto_tsquery('love')
   offset 0
) xx
order by timestamp DESC
limit 24
offset 0;


See http://blog.endpoint.com/2009/04/offset-0-ftw.html

--
-Devin

Re: Full text search with ORDER BY performance issue

From

Krade

Date:

21 July 2009, 00:35:32

On 7/21/2009 2:13, Devin Ben-Hur wrote:
> Have you tried make the full-text condition in a subselect with
> "offset 0" to stop the plan reordering?
>
> eg:
>
> select *
> from (
>   select * from a where comment_tsv @@ plainto_tsquery('love')
>   offset 0
> ) xx
> order by timestamp DESC
> limit 24
> offset 0;
>
>
> See http://blog.endpoint.com/2009/04/offset-0-ftw.html
Yes, that does force the planner to always pick the full text index
first rather than the timestamp index. I managed to force that by doing
something a lot more drastic, I just dropped my timestamp index
altogether, since I never used it for anything else. (I mentioned this
in my original post)

Though, that comment did make me try to readd it. I was pretty
surprised, the planner was only doing backward searches on the timestamp
index for very common words (therefore turning multi-minute queries into
very fast ones), as opposed to trying to use the timestamp index for all
queries. I wonder if this is related to tweaks to the planner in 8.4 or
if it was just my statistics that got balanced out.

I'm not entirely happy, because I still easily get minute long queries
on common words, but where the planner choses to not use the timestamp
index. The planner can't guess right all the time.

But I think I might just do:
select * from a where comment_tsv @@ plainto_tsquery('query') and
timestamp > cast(floor(extract(epoch from CURRENT_TIMESTAMP) - 864000)
as integer) order by timestamp desc limit 24 offset 0;

And if I get less than 24 rows, issue the regular query:

select * from a where comment_tsv @@ plainto_tsquery('query') order by
timestamp desc limit 24 offset 0;

I pay the price of doing two queries when I could have done just one,
and it does make almost all queries about 200 ms slower, but it does so
at the expense of turning the few very slow queries into quick ones.

Thanks for all the help.

Re: Full text search with ORDER BY performance issue

From

Scott Marlowe

Date:

21 July 2009, 01:07:02

On Mon, Jul 20, 2009 at 9:35 PM, Krade<krade@krade.com> wrote:

> But I think I might just do:
> select * from a where comment_tsv @@ plainto_tsquery('query') and timestamp
>> cast(floor(extract(epoch from CURRENT_TIMESTAMP) - 864000) as integer)
> order by timestamp desc limit 24 offset 0;
>
> And if I get less than 24 rows, issue the regular query:
>
> select * from a where comment_tsv @@ plainto_tsquery('query') order by
> timestamp desc limit 24 offset 0;

Couldn't you do tge second query as a with query then run another
query to limit that result to everything greater than now()-xdays ?

Re: Full text search with ORDER BY performance issue

From

valgog

Date:

21 July 2009, 10:39:17

On Jul 21, 6:06 am, scott.marl...@gmail.com (Scott Marlowe) wrote:
> On Mon, Jul 20, 2009 at 9:35 PM, Krade<kr...@krade.com> wrote:
> > But I think I might just do:
> > select * from a where comment_tsv @@ plainto_tsquery('query') and timestamp
> >> cast(floor(extract(epoch from CURRENT_TIMESTAMP) - 864000) as integer)
> > order by timestamp desc limit 24 offset 0;
>
> > And if I get less than 24 rows, issue the regular query:
>
> > select * from a where comment_tsv @@ plainto_tsquery('query') order by
> > timestamp desc limit 24 offset 0;
>
> Couldn't you do tge second query as a with query then run another
> query to limit that result to everything greater than now()-xdays ?
>
> --
> Sent via pgsql-performance mailing list (pgsql-performa...@postgresql.org)
> To make changes to your subscription:http://www.postgresql.org/mailpref/pgsql-performance

Hi,

There is a problem with GIN and GIST indexes, that they cannot be used
by the ORDER BY. Maybe it will be a nice idea to ask Oleg to make it
possible to use the b-tree columns in GIST or GIN to make the sort
easier, but I have no idea how difficult it will be to implement it in
current GIN or GIST structures. I think Oleg or even Tom will be the
right people to ask it :) But even if it is possible it will not be
implemented at least until 8.5 that will need a year to come, so until
then...

It is possible to strip your table in several smaller ones putting
them on different machines and then splitting your query with DBLINK.
This will distribute the burden of sorting to several machines that
will have to sort smaller parts as well. After you have your 25 ids
from each of the machines, you can merge them, sort again and limit as
you wish. Doing large offsets will be still problematic but faster
anyway in most reasonable offset ranges. (Load balancing tools like
pg_pool can automate this task, but I do not have practical experience
using them for that purposes)

Yet another very interesting technology -- sphinx search (http://
www.sphinxsearch.com/). It can distribute data on several machines
automatically, but it will be probably too expensive to start using
(if your task is not your main one :)) as they do not have standard
automation scripts, it does not support live updates (so you will
always have some minutes delay), and this is a standalone service,
that needs to be maintained and configured and synchronized with our
main database separately (though you can use pg/python to access it
from postgres).

Good luck with your task :)

-- Valentine Gogichashvili

Re: Full text search with ORDER BY performance issue

From

Matthew Wakeling

Date:

21 July 2009, 11:13:06

On Tue, 21 Jul 2009, valgog wrote:
> There is a problem with GIN and GIST indexes, that they cannot be used
> by the ORDER BY. Maybe it will be a nice idea to ask Oleg to make it
> possible to use the b-tree columns in GIST or GIN to make the sort
> easier, but I have no idea how difficult it will be to implement it in
> current GIN or GIST structures. I think Oleg or even Tom will be the
> right people to ask it :)

I can answer that one for GiST, having had a good look at GiST recently.
There is simply no useful information about order in a GiST index for it
to be used by an ORDER BY. The index structure is just too general,
because it needs to cope with the situation where a particular object type
does not have a well defined order, or where the "order" is unuseful for
indexing.

Matthew

--
 A good programmer is one who looks both ways before crossing a one-way street.
 Considering the quality and quantity of one-way streets in Cambridge, it
 should be no surprise that there are so many good programmers there.

Re: Full text search with ORDER BY performance issue

From

Krade

Date:

21 July 2009, 14:15:33

On 7/21/2009 11:32, valgog wrote:
> Hi,
>
> There is a problem with GIN and GIST indexes, that they cannot be used
> by the ORDER BY. Maybe it will be a nice idea to ask Oleg to make it
> possible to use the b-tree columns in GIST or GIN to make the sort
> easier, but I have no idea how difficult it will be to implement it in
> current GIN or GIST structures. I think Oleg or even Tom will be the
> right people to ask it :) But even if it is possible it will not be
> implemented at least until 8.5 that will need a year to come, so until
> then...
>
Unfortunately, it's not even just the lack of ORDER BY support,
btree_gin indexes seem to be broken under some circumstances. So I can't
even use my idea to limit searches to the last 10 days.

See this:
http://pgsql.privatepaste.com/5219TutUMk

The first query gives bogus results. It's not using the index correctly.

timestamp_comment_gin is a GIN index on timestamp, comment_tsv. The
timestamp column is an integer. The queries work right if I drop the
index. Is this a bug in btree_gin?
> It is possible to strip your table in several smaller ones putting
> them on different machines and then splitting your query with DBLINK.
> This will distribute the burden of sorting to several machines that
> will have to sort smaller parts as well. After you have your 25 ids
> from each of the machines, you can merge them, sort again and limit as
> you wish. Doing large offsets will be still problematic but faster
> anyway in most reasonable offset ranges. (Load balancing tools like
> pg_pool can automate this task, but I do not have practical experience
> using them for that purposes)
>
> Yet another very interesting technology -- sphinx search (http://
> www.sphinxsearch.com/). It can distribute data on several machines
> automatically, but it will be probably too expensive to start using
> (if your task is not your main one :)) as they do not have standard
> automation scripts, it does not support live updates (so you will
> always have some minutes delay), and this is a standalone service,
> that needs to be maintained and configured and synchronized with our
> main database separately (though you can use pg/python to access it
> from postgres).
>
> Good luck with your task :)
Yeah, I don't really have that sort of resources. This is a small hobby
project (ie: no budget) that is growing a bit too large. I might just
have to do text searches without time ordering.

On 7/21/2009 5:06, Scott Marlowe wrote:
> Couldn't you do tge second query as a with query then run another
> query to limit that result to everything greater than now()-xdays ?
>
I suppose I could, but I have no way to do a fast query that does both a
full text match and a < or > in the same WHERE due to the issue I
described above, so my original plan won't work. A separate BTREE
timestamp index obviously does nothing.

And again, thank you for all the help.

Re: Full text search with ORDER BY performance issue

From

Oleg Bartunov

Date:

21 July 2009, 16:27:49

On Tue, 21 Jul 2009, Krade wrote:

> On 7/21/2009 11:32, valgog wrote:
>> Hi,
>>
>> There is a problem with GIN and GIST indexes, that they cannot be used
>> by the ORDER BY. Maybe it will be a nice idea to ask Oleg to make it
>> possible to use the b-tree columns in GIST or GIN to make the sort
>> easier, but I have no idea how difficult it will be to implement it in
>> current GIN or GIST structures. I think Oleg or even Tom will be the
>> right people to ask it :) But even if it is possible it will not be
>> implemented at least until 8.5 that will need a year to come, so until
>> then...
>>
> Unfortunately, it's not even just the lack of ORDER BY support, btree_gin
> indexes seem to be broken under some circumstances. So I can't even use my
> idea to limit searches to the last 10 days.
>
> See this:
> http://pgsql.privatepaste.com/5219TutUMk
>
> The first query gives bogus results. It's not using the index correctly.
>
> timestamp_comment_gin is a GIN index on timestamp, comment_tsv. The timestamp
> column is an integer. The queries work right if I drop the index. Is this a
> bug in btree_gin?

it'd be nice if you provide us data,so we can reproduce your problem

>> It is possible to strip your table in several smaller ones putting
>> them on different machines and then splitting your query with DBLINK.
>> This will distribute the burden of sorting to several machines that
>> will have to sort smaller parts as well. After you have your 25 ids
>> from each of the machines, you can merge them, sort again and limit as
>> you wish. Doing large offsets will be still problematic but faster
>> anyway in most reasonable offset ranges. (Load balancing tools like
>> pg_pool can automate this task, but I do not have practical experience
>> using them for that purposes)
>>
>> Yet another very interesting technology -- sphinx search (http://
>> www.sphinxsearch.com/). It can distribute data on several machines
>> automatically, but it will be probably too expensive to start using
>> (if your task is not your main one :)) as they do not have standard
>> automation scripts, it does not support live updates (so you will
>> always have some minutes delay), and this is a standalone service,
>> that needs to be maintained and configured and synchronized with our
>> main database separately (though you can use pg/python to access it
>> from postgres).
>>
>> Good luck with your task :)
> Yeah, I don't really have that sort of resources. This is a small hobby
> project (ie: no budget) that is growing a bit too large. I might just have to
> do text searches without time ordering.
>
> On 7/21/2009 5:06, Scott Marlowe wrote:
>> Couldn't you do tge second query as a with query then run another
>> query to limit that result to everything greater than now()-xdays ?
>>
> I suppose I could, but I have no way to do a fast query that does both a full
> text match and a < or > in the same WHERE due to the issue I described above,
> so my original plan won't work. A separate BTREE timestamp index obviously
> does nothing.
>
> And again, thank you for all the help.
>
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: Full text search with ORDER BY performance issue

From

Robert Haas

Date:

29 July 2009, 11:02:25

On Mon, Jul 20, 2009 at 8:12 AM, Oleg Bartunov<oleg@sai.msu.su> wrote:
>> Here's a couple of queries:
>>
>> archive=> explain analyze select * from a where  comment_tsv @@
>> plainto_tsquery('love') order by timestamp desc limit 24 offset 0;
>>
>> QUERY PLAN
>> ----------
>> Limit  (cost=453248.73..453248.79 rows=24 width=281) (actual
>> time=188441.047..188441.148 rows=24 loops=1)
>>  ->  Sort  (cost=453248.73..453882.82 rows=253635 width=281) (actual
>> time=188441.043..188441.079 rows=24 loops=1)
>>        Sort Key: "timestamp"
>>        Sort Method:  top-N heapsort  Memory: 42kB
>>        ->  Bitmap Heap Scan on a  (cost=17782.16..446166.02 rows=253635
>> width=281) (actual time=2198.930..187948.050 rows=256378 loops=1)
>>              Recheck Cond: (comment_tsv @@ plainto_tsquery('love'::text))
>>              ->  Bitmap Index Scan on timestamp_comment_gin
>> (cost=0.00..17718.75 rows=253635 width=0) (actual time=2113.664..2113.664
>> rows=259828 loops=1)
>>                    Index Cond: (comment_tsv @@
>> plainto_tsquery('love'::text))
>> Total runtime: 188442.617 ms
>> (9 rows)
>>
>> archive=> explain analyze select * from a where  comment_tsv @@
>> plainto_tsquery('love') limit 24 offset 0;
>>
>> QUERY PLAN
>> ----------
>> Limit  (cost=0.00..66.34 rows=24 width=281) (actual time=14.632..53.647
>> rows=24 loops=1)
>>  ->  Seq Scan on a  (cost=0.00..701071.49 rows=253635 width=281) (actual
>> time=14.629..53.588 rows=24 loops=1)
>>        Filter: (comment_tsv @@ plainto_tsquery('love'::text))
>> Total runtime: 53.731 ms
>> (4 rows)
>>
>> First one runs painfully slow.
>
> Hmm, everything is already written in explain :) In the first query 253635
> rows should be readed from disk and sorted, while in the
> second query only 24 (random) rows readed from disk, so there is 4
> magnitudes
> difference and in the worst case you should expected time for the 1st query
> about 53*10^4 ms.

If love is an uncommon word, there's no help for queries of this type
being slow unless the GIN index can return the results in order.  But
if love is a common word, then it would be faster to do an index scan
by timestamp on the baserel and then treat comment_tsv @@
plainto_tsquery('love') as a filter condition.  Is this a selectivity
estimation bug?

...Robert

Re: Full text search with ORDER BY performance issue

From

Tom Lane

Date:

29 July 2009, 11:24:48

Robert Haas <robertmhaas@gmail.com> writes:
> If love is an uncommon word, there's no help for queries of this type
> being slow unless the GIN index can return the results in order.  But
> if love is a common word, then it would be faster to do an index scan
> by timestamp on the baserel and then treat comment_tsv @@
> plainto_tsquery('love') as a filter condition.  Is this a selectivity
> estimation bug?

Doesn't look like it: estimated number of matches is 253635, actual is
259828, which is really astonishingly close considering what we have to
work with.  It's not clear though what fraction of the total that
represents.

            regards, tom lane

Re: Full text search with ORDER BY performance issue

From

PFC

Date:

29 July 2009, 11:25:37

> If love is an uncommon word, there's no help for queries of this type
> being slow unless the GIN index can return the results in order.  But
> if love is a common word, then it would be faster to do an index scan
> by timestamp on the baserel and then treat comment_tsv @@
> plainto_tsquery('love') as a filter condition.  Is this a selectivity
> estimation bug?

    If you have really lots of documents to index (this seems the case)
perhaps you should consider Xapian. It is very easy to use (although, of
course, tsearch integrated in Postgres is much easier since you have
nothing to install), and it is *incredibly* fast.

    In my tests (2 years ago) with many gigabytes of stuff to search into,
differences became obvious when the data set is much bigger than RAM.
    - Postgres' fulltext was 10-100x faster than MySQL fulltext on searches
(lol) (and even a lot "more faster" on INSERTs...)
    - and Xapian was 10-100 times faster than Postgres' fulltext.

    (on a small table which fits in RAM, differences are small).

    Of course Xapian is not Postgres when you talk about update
concurrency..........
    (single writer => fulltext index updating background job is needed, a
simple Python script does the job)

Re: Full text search with ORDER BY performance issue

From

Robert Haas

Date:

29 July 2009, 12:13:48

On Wed, Jul 29, 2009 at 10:22 AM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> If love is an uncommon word, there's no help for queries of this type
>> being slow unless the GIN index can return the results in order.  But
>> if love is a common word, then it would be faster to do an index scan
>> by timestamp on the baserel and then treat comment_tsv @@
>> plainto_tsquery('love') as a filter condition.  Is this a selectivity
>> estimation bug?
>
> Doesn't look like it: estimated number of matches is 253635, actual is
> 259828, which is really astonishingly close considering what we have to
> work with.  It's not clear though what fraction of the total that
> represents.

Hmm, good point.  It seems like it would be useful to force the
planner into use the other plan and get EXPLAIN ANALYZE output for
that for comparison purposes, but off the top of my head I don't know
how to do that.

...Robert

Re: Full text search with ORDER BY performance issue

From

Tom Lane

Date:

29 July 2009, 12:33:26

Robert Haas <robertmhaas@gmail.com> writes:
> Hmm, good point.  It seems like it would be useful to force the
> planner into use the other plan and get EXPLAIN ANALYZE output for
> that for comparison purposes, but off the top of my head I don't know
> how to do that.

The standard way is

    begin;
    drop index index_you_dont_want_used;
    explain problem-query;
    rollback;

Ain't transactional DDL wonderful?

(If this is a production system, you do have to worry about the DROP
transiently locking the table; but if you put the above in a script
rather than doing it by hand, it should be fast enough to not be a big
problem.)

            regards, tom lane

Re: Full text search with ORDER BY performance issue

From

Robert Haas

Date:

29 July 2009, 12:37:51

On Wed, Jul 29, 2009 at 11:29 AM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
> Ain't transactional DDL wonderful?

Yes.  :-)

...Robert

Re: Full text search with ORDER BY performance issue

From

worthy7

Date:

28 July 2014, 19:55:18

I hate to be "that guy" but, Is this is still an issue 5 years later?? I
can't seem to get Gin/btree to use my ORDER BY column with a LIMIT no matter
what I try.

My best idea was to cluster the database by the ORDER BY column and then
just hope the index returns them in the order in the table...



--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Full-text-search-with-ORDER-BY-performance-issue-tp2074171p5813083.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.