Query Issue with full-text search - Mailing list pgsql-general

From Karl Denninger
Subject Query Issue with full-text search
Date
Msg-id 4BE5948E.3030004@denninger.net
Whole thread Raw
List pgsql-general
This may better-belong in pgsql-sql but since it deals with a function
as opposed to raw SQL syntax I am sticking it here....

Consider the following DBMS schema slice....


                                     Table "public.post"
  Column   |           Type           |
Modifiers
-----------+--------------------------+--------------------------------------------------------
subject   | text                     |
 message   | text                     |
 ordinal   | integer                  | not null default
nextval('post_ordinal_seq'::regclass)



Indexes:
    "post_pkey" PRIMARY KEY, btree (ordinal)
    "idx_message" gin (to_tsvector('english'::text, message))
    "idx_subject" gin (to_tsvector('english'::text, subject))

(there are a bunch more indices and columns in the table, but these are
the ones in question)

Now let's run a couple of queries on this:

ticker=# explain analyze select * from post where to_tsvector('english',
message) @@ to_tsquery('violence') order by modified desc limit
100;
QUERY
PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..2456.32 rows=100 width=433) (actual
time=266.703..3046.310 rows=100 loops=1)
   ->  Index Scan Backward using post_modified on post
(cost=0.00..240400.00 rows=9787 width=433) (actual
time=266.698..3045.920 rows=100 loops=1)
         Filter: (to_tsvector('english'::text, message) @@
to_tsquery('violence'::text))
 Total runtime: 3046.565 ms
(4 rows)

Ok, not too bad, considering that the table contains close to 2 million
rows - ~3 seconds is pretty good.

Now let's try something that's NOT in the database:

ticker=# explain analyze select * from post where to_tsvector('english',
message) @@ to_tsquery('hoseface') order by modified desc limit
100;
NOTICE:  word is too long to be indexed
DETAIL:  Words longer than 2047 characters are ignored.

QUERY
PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..2456.32 rows=100 width=433) (actual
time=303350.036..303350.036 rows=0 loops=1)
   ->  Index Scan Backward using post_modified on post
(cost=0.00..240400.00 rows=9787 width=433) (actual
time=303350.031..303350.031 rows=0 loops=1)
         Filter: (to_tsvector('english'::text, message) @@
to_tsquery('hoseface'::text))
 Total runtime: 303350.079 ms
(4 rows)

This is **UNBELIEVABLY** slow; indeed, it appears to have done a
sequential scan of the entire table!

Why?

One would think that if the index lookup fails it fails - and would fail
FAST, returning no rows.  It appears that this is not the case, and the
system actually goes in and tries to look up the query off the message
contents, IGNORING the index!

That's not good for what should be obvious reasons..... is the "gin"
index type screwed up in some form or fashion?

This behavior is relatively new.  I'm running 8.4.3 and this started
happening some time before that - I believe it was an issue in 8.4.2,
but I KNOW it was not a problem when I was running 8.3.  The confounding
factor is that the table has grown rapidly and as such "not happening"
before might be more due to the table size than the software release -
of that I cannot be certain.

The other possibility is that the "NOTICE" results in some sort of flag
being set that tells the query processor to ignore the index and perform
a sequential scan IF there's a failure to match.  If this is the case I
will then have to write something to go through and find the offending
item and remove it.

-- Karl

Attachment

pgsql-general by date:

Previous
From: Paulo moraes
Date:
Subject: Invitation to connect on LinkedIn
Next
From: Rick Yorgason
Date:
Subject: Database design confusing pg_restore, and misc pg_restore issues