Thread: Queryplan within FTS/GIN index -search.

Queryplan within FTS/GIN index -search.

From
Jesper Krogh
Date:
Hi

My indexing base is now up to 7.5m documents, I have raise statistics
target to 1000 for the tsvector column in order to make the
query-planner choose more correctly. That works excellent.

Table structure is still:
ftstest=# \d ftsbody
                               Table "public.ftsbody"
      Column      |   Type   |                      Modifiers

------------------+----------+------------------------------------------------------
 id               | integer  | not null default
nextval('ftsbody_id_seq'::regclass)
 body             | text     | not null default ''::text
 ftsbody_body_fts | tsvector |
Indexes:
    "ftsbody_body_md5" UNIQUE, btree (md5(body))
    "ftsbody_id_pri_idx" UNIQUE, btree (id)
    "ftsbody_tfs_idx" gin (ftsbody_body_fts)
Triggers:
    tsvectorupdate BEFORE INSERT OR UPDATE ON uniprot FOR EACH ROW
EXECUTE PROCEDURE tsvector_update_trigger('ftsbody_body_fts',
'pg_catalog.english', 'body')


I'm searching the gin-index for 1-5 terms, where all of them matches the
same document. TERM1 is unique by itself, TERM2 is a bit more common (52
rows), TERM3 more common, TERM4 close to all and TERM5 all records.

Just quering for a unique value and add in several values that match
everything makes the run-time go significantly up.

I somehow would expect the index-search to take advantage of the MCV's
informations in the statistics that sort of translate it into a search
and post-filtering (as PG's queryplanner usually does at the SQL-level).

                                                              QUERY PLAN


--------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=102.45..102.45 rows=1 width=751) (actual time=3.726..3.729
rows=1 loops=1)
   ->  Sort  (cost=102.45..102.45 rows=1 width=751) (actual
time=3.722..3.723 rows=1 loops=1)
         Sort Key: id
         Sort Method:  quicksort  Memory: 27kB
         ->  Bitmap Heap Scan on ftsbody  (cost=100.42..102.44 rows=1
width=751) (actual time=3.700..3.702 rows=1 loops=1)
               Recheck Cond: (ftsbody_body_fts @@ to_tsquery('TERM1 &
TERM2'::text))
               ->  Bitmap Index Scan on ftsbody_tfs_idx
(cost=0.00..100.42 rows=1 width=0) (actual time=3.683..3.683 rows=1 loops=1)
                     Index Cond: (ftsbody_body_fts @@ to_tsquery('TERM1
& TERM2'::text))
 Total runtime: 3.790 ms
(9 rows)

                                                                QUERY
PLAN

------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=102.45..102.45 rows=1 width=751) (actual
time=850.017..850.020 rows=1 loops=1)
   ->  Sort  (cost=102.45..102.45 rows=1 width=751) (actual
time=850.013..850.015 rows=1 loops=1)
         Sort Key: id
         Sort Method:  quicksort  Memory: 27kB
         ->  Bitmap Heap Scan on ftsbody  (cost=100.42..102.44 rows=1
width=751) (actual time=849.991..849.993 rows=1 loops=1)
               Recheck Cond: (ftsbody_body_fts @@ to_tsquery('TERM1 &
TERM2 & TERM3'::text))
               ->  Bitmap Index Scan on ftsbody_tfs_idx
(cost=0.00..100.42 rows=1 width=0) (actual time=849.970..849.970 rows=1
loops=1)
                     Index Cond: (ftsbody_body_fts @@ to_tsquery('TERM1
& TERM2 & TERM3'::text))
 Total runtime: 850.084 ms
(9 rows)

                                                                 QUERY
PLAN

--------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=102.45..102.45 rows=1 width=751) (actual
time=1152.065..1152.068 rows=1 loops=1)
   ->  Sort  (cost=102.45..102.45 rows=1 width=751) (actual
time=1152.061..1152.062 rows=1 loops=1)
         Sort Key: id
         Sort Method:  quicksort  Memory: 27kB
         ->  Bitmap Heap Scan on ftsbody  (cost=100.42..102.44 rows=1
width=751) (actual time=1152.039..1152.041 rows=1 loops=1)
               Recheck Cond: (ftsbody_body_fts @@ to_tsquery('TERM1 &
TERM2 & TERM3 & TERM4'::text))
               ->  Bitmap Index Scan on ftsbody_tfs_idx
(cost=0.00..100.42 rows=1 width=0) (actual time=1152.020..1152.020
rows=1 loops=1)
                     Index Cond: (ftsbody_body_fts @@ to_tsquery('TERM1
& TERM2 & TERM3 & TERM4'::text))
 Total runtime: 1152.129 ms
(9 rows)

                                                                 QUERY
PLAN

--------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=102.45..102.45 rows=1 width=751) (actual
time=1509.043..1509.046 rows=1 loops=1)
   ->  Sort  (cost=102.45..102.45 rows=1 width=751) (actual
time=1509.040..1509.040 rows=1 loops=1)
         Sort Key: id
         Sort Method:  quicksort  Memory: 27kB
         ->  Bitmap Heap Scan on ftsbody  (cost=100.42..102.44 rows=1
width=751) (actual time=1509.018..1509.020 rows=1 loops=1)
               Recheck Cond: (ftsbody_body_fts @@ to_tsquery('TERM1 &
TERM2 & TERM3 & TERM4 & TERM5'::text))
               ->  Bitmap Index Scan on ftsbody_tfs_idx
(cost=0.00..100.42 rows=1 width=0) (actual time=1508.998..1508.998
rows=1 loops=1)
                     Index Cond: (ftsbody_body_fts @@ to_tsquery('TERM1
& TERM2 & TERM3 & TERM4 & TERM5'::text))
 Total runtime: 1509.109 ms
(9 rows)

Can (perhaps more readable) be found at http://krogh.cc/~jesper/test.out

Can this be optimized? (I cannot really prevent users from typing stuff
in that are common).

--
Jesper

Re: Queryplan within FTS/GIN index -search.

From
"Kevin Grittner"
Date:
Jesper Krogh <jesper@krogh.cc> wrote:

> I'm searching the gin-index for 1-5 terms, where all of them matches
> the same document. TERM1 is unique by itself, TERM2 is a bit more
> common (52 rows), TERM3 more common, TERM4 close to all and TERM5
> all records.

>                Recheck Cond: (ftsbody_body_fts @@ to_tsquery('TERM1
> & TERM2 & TERM3 & TERM4 & TERM5'::text))
>                ->  Bitmap Index Scan on ftsbody_tfs_idx
> (cost=0.00..100.42 rows=1 width=0) (actual time=1508.998..1508.998
> rows=1 loops=1)
>                      Index Cond: (ftsbody_body_fts @@
> to_tsquery('TERM1 & TERM2 & TERM3 & TERM4 & TERM5'::text))
>  Total runtime: 1509.109 ms

> Can this be optimized? (I cannot really prevent users from typing
> stuff in that are common).

I've wondered that myself.  Perhaps a term which is ANDed with others
and is too common could be dropped from the Index Cond and just left
in the Recheck Cond?

-Kevin

Re: Queryplan within FTS/GIN index -search.

From
Jeff Davis
Date:
On Thu, 2009-10-22 at 18:28 +0200, Jesper Krogh wrote:
> I somehow would expect the index-search to take advantage of the MCV's
> informations in the statistics that sort of translate it into a search
> and post-filtering (as PG's queryplanner usually does at the SQL-level).

MCVs are full values that are found in columns or indexes -- you aren't
likely to have two entire documents that are exactly equal, so MCVs are
useless in your example.

I believe that stop words are a more common way of accomplishing what
you want to do, but they are slightly more limited: they won't be
checked at any level, and so are best used for truly common words like
"and". From your example, I assume that you still want the word checked,
but it's not selective enough to be usefully checked by the index.

In effect, what you want are words that aren't searched (or stored) in
the index, but are included in the tsvector (so the RECHECK still
works). That sounds like it would solve your problem and it would reduce
index size, improve update performance, etc. I don't know how difficult
it would be to implement, but it sounds reasonable to me.

The only disadvantage is that it's more metadata to manage -- all of the
existing data like dictionaries and stop words, plus this new "common
words". Also, it would mean that almost every match requires RECHECK. It
would be interesting to know how common a word needs to be before it's
better to leave it out of the index.

Regards,
    Jeff Davis


Re: Queryplan within FTS/GIN index -search.

From
Jesper Krogh
Date:
Jeff Davis wrote:
> On Thu, 2009-10-22 at 18:28 +0200, Jesper Krogh wrote:
>> I somehow would expect the index-search to take advantage of the MCV's
>> informations in the statistics that sort of translate it into a search
>> and post-filtering (as PG's queryplanner usually does at the SQL-level).
>
> MCVs are full values that are found in columns or indexes -- you aren't
> likely to have two entire documents that are exactly equal, so MCVs are
> useless in your example.

According to my testing, this is not the case and if it was the case,
the queryplanner most likely wouldn't be able to plan this query correct:
select id from ftstable where tsvectorcol @@ to_tsquery('commonterm')
order by id limit 10;
(into a index-scan on ID
and
select id from ftstable where tsvectorcol @@ to_tsquery('rareterm');
into a bitmap index scan on the tsvectorcol and a subsequent sort.

This is indeed information on individual terms from the statistics that
enable this.

> I believe that stop words are a more common way of accomplishing what
> you want to do, but they are slightly more limited: they won't be
> checked at any level, and so are best used for truly common words like
> "and". From your example, I assume that you still want the word checked,
> but it's not selective enough to be usefully checked by the index.

the terms are really common non-stop-words.

> In effect, what you want are words that aren't searched (or stored) in
> the index, but are included in the tsvector (so the RECHECK still
> works). That sounds like it would solve your problem and it would reduce
> index size, improve update performance, etc. I don't know how difficult
> it would be to implement, but it sounds reasonable to me.
>
> The only disadvantage is that it's more metadata to manage -- all of the
> existing data like dictionaries and stop words, plus this new "common
> words". Also, it would mean that almost every match requires RECHECK. It
> would be interesting to know how common a word needs to be before it's
> better to leave it out of the index.

That sounds like it could require an index rebuild if the distribution
changes?

That would be another plan to pursue, but the MCV is allready there
:
ftstest=# select * from ftsbody;
 id |                     body                     |
ftsbody_body_fts
----+----------------------------------------------+-------------------------------------------------
  1 | the cat is not a rat uniqueterm1 uniqueterm2 | 'cat':2 'rat':6
'uniqueterm1':7 'uniqueterm2':8
  2 | elephant uniqueterm1 uniqueterm2             | 'eleph':1
'uniqueterm1':2 'uniqueterm2':3
  3 | cannon uniqueterm1 uniqueterm2               | 'cannon':1
'uniqueterm1':2 'uniqueterm2':3
(3 rows)

ftstest=# select most_common_vals, most_common_freqs from pg_stats where
tablename = 'ftsbody' and attname = 'ftsbody_body_fts';
     most_common_vals      | most_common_freqs
---------------------------+-------------------
 {uniqueterm1,uniqueterm2} | {1,1,1,1}
(1 row)

And the query-planner uses this information.

--
Jesper.


Re: Queryplan within FTS/GIN index -search.

From
Jeff Davis
Date:
On Fri, 2009-10-23 at 07:18 +0200, Jesper Krogh wrote:
> This is indeed information on individual terms from the statistics that
> enable this.

My mistake, I didn't know it was that smart about it.

> > In effect, what you want are words that aren't searched (or stored) in
> > the index, but are included in the tsvector (so the RECHECK still
> > works). That sounds like it would solve your problem and it would reduce
> > index size, improve update performance, etc. I don't know how difficult
> > it would be to implement, but it sounds reasonable to me.


> That sounds like it could require an index rebuild if the distribution
> changes?

My thought was that the common words could be declared to be common the
same way stop words are. As long as words are only added to this list,
it should be OK.

> That would be another plan to pursue, but the MCV is allready there

The problem with MCVs is that the index search can never eliminate
documents because they don't contain a match, because it might contain a
match that was previously an MCV, but is no longer.

Also, MCVs are relatively few -- you only get ~1000 or so. There might
be a lot of common words you'd like to track.

Perhaps ANALYZE can automatically add the common words above some
frequency threshold to the list?

Regards,
    Jeff Davis


Re: Queryplan within FTS/GIN index -search.

From
jesper@krogh.cc
Date:
> On Fri, 2009-10-23 at 07:18 +0200, Jesper Krogh wrote:
>> > In effect, what you want are words that aren't searched (or stored) in
>> > the index, but are included in the tsvector (so the RECHECK still
>> > works). That sounds like it would solve your problem and it would
>> reduce
>> > index size, improve update performance, etc. I don't know how
>> difficult
< > it would be to implement, but it sounds reasonable to me.
>
>> That sounds like it could require an index rebuild if the distribution
>> changes?
>
> My thought was that the common words could be declared to be common the
> same way stop words are. As long as words are only added to this list,
> it should be OK.
>
>> That would be another plan to pursue, but the MCV is allready there
>
> The problem with MCVs is that the index search can never eliminate
> documents because they don't contain a match, because it might contain a
> match that was previously an MCV, but is no longer.

No, it definately has to go visit the index/table to confirm findings, but
that why I wrote Queryplan in the subject line, because this os only about
the strategy to pursue to obtain the results. And a strategy about
limiting the amout of results as early as possible (as PG usually does)
would be what I'd expect and MCV can help it guess on that.

Similar finding, rewrite the query: (now i took the extreme and made
"raretem" a spellingerror), so result is 0.

ftstest=# explain analyze select body from ftsbody where ftsbody_body_fts
@@ to_tsquery('commonterm & spellerror') limit 100;
                                                             QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=132.63..188.89 rows=28 width=739) (actual
time=862.714..862.714 rows=0 loops=1)
   ->  Bitmap Heap Scan on ftsbody  (cost=132.63..188.89 rows=28
width=739) (actual time=862.711..862.711 rows=0 loops=1)
         Recheck Cond: (ftsbody_body_fts @@ to_tsquery('commonterm &
spellerror'::text))
         ->  Bitmap Index Scan on ftsbody_tfs_idx  (cost=0.00..132.62
rows=28 width=0) (actual time=862.702..862.702 rows=0 loops=1)
               Index Cond: (ftsbody_body_fts @@ to_tsquery('commonterm &
spellerror'::text))
 Total runtime: 862.771 ms
(6 rows)

ftstest=# explain analyze select body from ftsbody where ftsbody_body_fts
@@ to_tsquery('commonterm') and ftsbody_body_fts @@
to_tsquery('spellerror') limit 100;
                                                             QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=132.70..189.11 rows=28 width=739) (actual time=8.669..8.669
rows=0 loops=1)
   ->  Bitmap Heap Scan on ftsbody  (cost=132.70..189.11 rows=28
width=739) (actual time=8.665..8.665 rows=0 loops=1)
         Recheck Cond: ((ftsbody_body_fts @@
to_tsquery('commonterm'::text)) AND (ftsbody_body_fts @@
to_tsquery('spellerror'::text)))
         ->  Bitmap Index Scan on ftsbody_tfs_idx  (cost=0.00..132.70
rows=28 width=0) (actual time=8.658..8.658 rows=0 loops=1)
               Index Cond: ((ftsbody_body_fts @@
to_tsquery('commonterm'::text)) AND (ftsbody_body_fts @@
to_tsquery('spellerror'::text)))
 Total runtime: 8.724 ms
(6 rows)

So getting them with AND inbetween gives x100 better performance. All
queries are run on "hot disk" repeated 3-5 times and the number are from
the last run, so disk-read effects should be filtered away.

Shouldn't it somehow just do what it allready are capable of doing?

--
Jesper


Re: Queryplan within FTS/GIN index -search.

From
Richard Huxton
Date:
jesper@krogh.cc wrote:
>
> So getting them with AND inbetween gives x100 better performance. All
> queries are run on "hot disk" repeated 3-5 times and the number are from
> the last run, so disk-read effects should be filtered away.
>
> Shouldn't it somehow just do what it allready are capable of doing?

I'm guessing to_tsquery(...) will produce a tree of search terms (since
it allows for quite complex expressions). Presumably there's a standard
order it gets processed in too, so it should be possible to generate a
more or less efficient ordering.

That structure isn't exposed to the planner though, so it doesn't
benefit from any re-ordering the planner would normally do for normal
(exposed) AND/OR clauses.

Now, to_tsquery() can't re-order the search terms because it doesn't
know what column it's being compared against. In fact, it might not be a
simple column at all.

So - there would either need to be:
1. Some hooks from the planner to reach into the tsquery datatype.
2. A variant to_tsquery_with_sorting() which would take the column-name
or something and look up the stats to work against.

#1 is the better solution, but #2 might well be simpler to implement as
a work-around for now.
--
  Richard Huxton
  Archonet Ltd

Re: Queryplan within FTS/GIN index -search.

From
jesper@krogh.cc
Date:
> jesper@krogh.cc wrote:
>>
>> So getting them with AND inbetween gives x100 better performance. All
>> queries are run on "hot disk" repeated 3-5 times and the number are from
>> the last run, so disk-read effects should be filtered away.
>>
>> Shouldn't it somehow just do what it allready are capable of doing?
>
> I'm guessing to_tsquery(...) will produce a tree of search terms (since
> it allows for quite complex expressions). Presumably there's a standard
> order it gets processed in too, so it should be possible to generate a
> more or less efficient ordering.
>
> That structure isn't exposed to the planner though, so it doesn't
> benefit from any re-ordering the planner would normally do for normal
> (exposed) AND/OR clauses.
>
> Now, to_tsquery() can't re-order the search terms because it doesn't
> know what column it's being compared against. In fact, it might not be a
> simple column at all.

I cant follow this logic based on explain output, but I may have
misunderstood something. The only difference in these two query-plans is
that we have an additional or'd term in the to_tsquery().

What we see is that, the query-planner indeed has knowledge about changes
in the row estimates based on changes in the query to to_tsquery(). My
guess is that it is because to_tsquery actually parses the query and give
the estimates, now how can to_tsquery give estimates without having access
to the statistics for the column?

ftstest=# explain select id from ftsbody where ftsbody_body_fts @@
to_tsquery('reallyrare');
                                   QUERY PLAN
---------------------------------------------------------------------------------
 Bitmap Heap Scan on ftsbody  (cost=132.64..190.91 rows=29 width=4)
   Recheck Cond: (ftsbody_body_fts @@ to_tsquery('reallyrare'::text))
   ->  Bitmap Index Scan on ftsbody_tfs_idx  (cost=0.00..132.63 rows=29
width=0)
         Index Cond: (ftsbody_body_fts @@ to_tsquery('reallyrare'::text))
(4 rows)

ftstest=# explain select id from ftsbody where ftsbody_body_fts @@
to_tsquery('reallyrare | morerare');
                                     QUERY PLAN
-------------------------------------------------------------------------------------
 Bitmap Heap Scan on ftsbody  (cost=164.86..279.26 rows=57 width=4)
   Recheck Cond: (ftsbody_body_fts @@ to_tsquery('reallyrare |
morerare'::text))
   ->  Bitmap Index Scan on ftsbody_tfs_idx  (cost=0.00..164.84 rows=57
width=0)
         Index Cond: (ftsbody_body_fts @@ to_tsquery('reallyrare |
morerare'::text))
(4 rows)

ftstest=# explain select id from ftsbody where ftsbody_body_fts @@
to_tsquery('reallyrare | reallycommon');
                                QUERY PLAN
--------------------------------------------------------------------------
 Seq Scan on ftsbody  (cost=0.00..1023249.39 rows=5509293 width=4)
   Filter: (ftsbody_body_fts @@ to_tsquery('reallyrare |
reallycommon'::text))
(2 rows)


> 2. A variant to_tsquery_with_sorting() which would take the column-name
> or something and look up the stats to work against.

Does above not seem like its there allready?

(sorry.. looking at C-code from my point of view would set me a couple of
weeks back, so I have troble getting closer to the answer than
interpreting the output and guessing the rest).

--
Jesper


Re: Queryplan within FTS/GIN index -search.

From
Richard Huxton
Date:
jesper@krogh.cc wrote:
>> That structure isn't exposed to the planner though, so it doesn't
>> benefit from any re-ordering the planner would normally do for normal
>> (exposed) AND/OR clauses.
>>
>> Now, to_tsquery() can't re-order the search terms because it doesn't
>> know what column it's being compared against. In fact, it might not be a
>> simple column at all.
>
> I cant follow this logic based on explain output, but I may have
> misunderstood something. The only difference in these two query-plans is
> that we have an additional or'd term in the to_tsquery().

Hmm - I've had a poke through the source. I've slightly misled you...

> What we see is that, the query-planner indeed has knowledge about changes
> in the row estimates based on changes in the query to to_tsquery().

Yes, new in 8.4 - sorry, thought that hadn't made it in.

The two plan-nodes in question are in:
  backend/executor/nodeBitmapIndexscan.c
  backend/executor/nodeBitmapHeapscan.c
The actual tsearch stuff is in
  src/backend/utils/adt/ts*.c

It looks like TS_execute (tsvector_op.c) is the bit of code that handles
the tsquery tree. That uses a callback to actually check values
(checkcondition_gin). The gin_extract_tsquery function is presumably the
extractQuery function as described in the manuals (Ch 52).

So, I'm guessing you would want to do is generate a reduced query tree
for the indexscan (A & B & C => A if A is an uncommon word) and use the
full query tree for the heap check. Now, what isn't clear to me on first
glance is how to determine which phase of the bitmap scan we are in.

HTH

Just checking, because I don't think it's useful in this case. But, you
don know about "gin_fuzzy_search_limit"?

--
  Richard Huxton
  Archonet Ltd

Re: Queryplan within FTS/GIN index -search.

From
Jeff Davis
Date:
On Fri, 2009-10-23 at 09:26 +0100, Richard Huxton wrote:
> That structure isn't exposed to the planner though, so it doesn't
> benefit from any re-ordering the planner would normally do for normal
> (exposed) AND/OR clauses.

I don't think that explains it, because in the second plan you only see
a single index scan with two quals:

   Index Cond: ((ftsbody_body_fts @@
     to_tsquery('commonterm'::text)) AND (ftsbody_body_fts @@
     to_tsquery('spellerror'::text)))

So it's entirely up to GIN how to execute that.

Regards,
    Jeff Davis


Re: Queryplan within FTS/GIN index -search.

From
Jeff Davis
Date:
On Fri, 2009-10-23 at 09:45 +0200, jesper@krogh.cc wrote:
> No, it definately has to go visit the index/table to confirm findings, but
> that why I wrote Queryplan in the subject line, because this os only about
> the strategy to pursue to obtain the results. And a strategy about
> limiting the amout of results as early as possible (as PG usually does)
> would be what I'd expect and MCV can help it guess on that.

I see what you're saying: you could still index the common terms like
normal, but just not look for anything in the index if it's an MCV. That
sounds reasonable, based on the numbers you provided.

>                Index Cond: (ftsbody_body_fts @@ to_tsquery('commonterm &
> spellerror'::text))
>  Total runtime: 862.771 ms
> (6 rows)

...

>                Index Cond: ((ftsbody_body_fts @@
> to_tsquery('commonterm'::text)) AND (ftsbody_body_fts @@
> to_tsquery('spellerror'::text)))
>  Total runtime: 8.724 ms
> (6 rows)
>

Something seems strange here. Both are a single index scan, but having a
single complex search key is worse than having two simple search keys.

Perhaps the real problem is that there's a difference between these
cases at all? I don't see any reason why the first should be more
expensive than the second.

Regards,
    Jeff Davis


Re: Queryplan within FTS/GIN index -search.

From
Richard Huxton
Date:
Jeff Davis wrote:
> On Fri, 2009-10-23 at 09:26 +0100, Richard Huxton wrote:
>> That structure isn't exposed to the planner though, so it doesn't
>> benefit from any re-ordering the planner would normally do for normal
>> (exposed) AND/OR clauses.
>
> I don't think that explains it, because in the second plan you only see
> a single index scan with two quals:
>
>    Index Cond: ((ftsbody_body_fts @@
>      to_tsquery('commonterm'::text)) AND (ftsbody_body_fts @@
>      to_tsquery('spellerror'::text)))
>
> So it's entirely up to GIN how to execute that.

http://www.postgresql.org/docs/8.4/static/gin-extensibility.html
Datum *extractQuery(...)
Returns an array of keys given a value to be queried; that is, query is
the value on the right-hand side of an indexable operator whose
left-hand side is the indexed column

So - that is presumably two separate arrays of keys being matched
against, and the AND means if the first fails it'll never check the second.

What I'm not sure about is if tsquery('commonterm & spellerror')
produces two sets of keys or if it just produces one.

--
  Richard Huxton
  Archonet Ltd

Re: Queryplan within FTS/GIN index -search.

From
Jesper Krogh
Date:
Jeff Davis wrote:
> On Fri, 2009-10-23 at 09:45 +0200, jesper@krogh.cc wrote:
>> No, it definately has to go visit the index/table to confirm findings, but
>> that why I wrote Queryplan in the subject line, because this os only about
>> the strategy to pursue to obtain the results. And a strategy about
>> limiting the amout of results as early as possible (as PG usually does)
>> would be what I'd expect and MCV can help it guess on that.
>
> I see what you're saying: you could still index the common terms like
> normal, but just not look for anything in the index if it's an MCV. That
> sounds reasonable, based on the numbers you provided.

I'm not sure if thats what I'm saying. If i should rephrase it then:
Given an AND operator (which translates into an intersection of the left
and right side), then it should go for the side with the least expected
results (SetLeast) and subsequent use the other expression for
processing only that set.

>>                Index Cond: (ftsbody_body_fts @@ to_tsquery('commonterm &
>> spellerror'::text))
>>  Total runtime: 862.771 ms
>> (6 rows)
>
> ...
>
>>                Index Cond: ((ftsbody_body_fts @@
>> to_tsquery('commonterm'::text)) AND (ftsbody_body_fts @@
>> to_tsquery('spellerror'::text)))
>>  Total runtime: 8.724 ms
>> (6 rows)
>>
>
> Something seems strange here. Both are a single index scan, but having a
> single complex search key is worse than having two simple search keys.
>
> Perhaps the real problem is that there's a difference between these
> cases at all? I don't see any reason why the first should be more
> expensive than the second.



--
Jesper

Re: Queryplan within FTS/GIN index -search.

From
Jeff Davis
Date:
On Fri, 2009-10-23 at 17:27 +0100, Richard Huxton wrote:
> Returns an array of keys given a value to be queried; that is, query is
> the value on the right-hand side of an indexable operator whose
> left-hand side is the indexed column
>
> So - that is presumably two separate arrays of keys being matched
> against, and the AND means if the first fails it'll never check the second.

My point was that if it's only one index scan in both cases, then GIN
should have the same information in both cases, right? So why are they
being treated differently?

I must be missing something.

Regards,
    Jeff Davis


Re: Queryplan within FTS/GIN index -search.

From
Jesper Krogh
Date:
Hi.

I've now got a test-set that can reproduce the problem where the two
fully equivalent queries (
body_fts @@ to_tsquery("commonterm & nonexistingterm")
 and
body_fts @@ to_tsquery("coomonterm") AND body_fts @@
to_tsquery("nonexistingterm")

give a difference of x300 in execution time. (grows with
document-base-size).

this can now be reproduced using:

* http://krogh.cc/~jesper/fts-queryplan.pl  and
http://krogh.cc/~jesper/words.txt

It build up a table with 200.000 documents where "commonterm" exists in
all of them. "nonexistingterm" is in 0.

To get the query-planner get a "sane" query I need to do a:
ftstest# set enable_seqscan=off

Then:
 ftstest=# explain analyze select id from ftstest where body_fts @@
to_tsquery('nonexistingterm & commonterm');
                                                           QUERY PLAN


--------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on ftstest  (cost=5563.09..7230.93 rows=1000 width=4)
(actual time=30.861..30.861 rows=0 loops=1)
   Recheck Cond: (body_fts @@ to_tsquery('nonexistingterm &
commonterm'::text))
   ->  Bitmap Index Scan on ftstest_gin_idx  (cost=0.00..5562.84
rows=1000 width=0) (actual time=30.856..30.856 rows=0 loops=1)
         Index Cond: (body_fts @@ to_tsquery('nonexistingterm &
commonterm'::text))
 Total runtime: 30.907 ms
(5 rows)

ftstest=# explain analyze select id from ftstest where body_fts @@
to_tsquery('nonexistingterm') and body_fts @@ to_tsquery('commonterm');
                                                          QUERY PLAN


------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on ftstest  (cost=5565.59..7238.43 rows=1000 width=4)
(actual time=0.059..0.059 rows=0 loops=1)
   Recheck Cond: ((body_fts @@ to_tsquery('nonexistingterm'::text)) AND
(body_fts @@ to_tsquery('commonterm'::text)))
   ->  Bitmap Index Scan on ftstest_gin_idx  (cost=0.00..5565.34
rows=1000 width=0) (actual time=0.057..0.057 rows=0 loops=1)
         Index Cond: ((body_fts @@ to_tsquery('nonexistingterm'::text))
AND (body_fts @@ to_tsquery('commonterm'::text)))
 Total runtime: 0.111 ms
(5 rows)


Run repeatedly to get a full memory recident dataset.

In this situation the former query end up being 300x slower than the
latter allthough they are fully equivalent.



--
Jesper

Re: Queryplan within FTS/GIN index -search.

From
Tom Lane
Date:
Jesper Krogh <jesper@krogh.cc> writes:
> I've now got a test-set that can reproduce the problem where the two
> fully equivalent queries (
> body_fts @@ to_tsquery("commonterm & nonexistingterm")
>  and
> body_fts @@ to_tsquery("coomonterm") AND body_fts @@
> to_tsquery("nonexistingterm")
> give a difference of x300 in execution time. (grows with
> document-base-size).

I looked into this a bit.  It seems the reason the first is much slower
is that the AND nature of the query is not exposed to the GIN control
logic (ginget.c).  It has to fetch every index-entry combination that
involves any of the terms, which of course is going to be the whole
index in this case.  This is obvious when you realize that the control
logic doesn't know the difference between tsqueries "commonterm &
nonexistingterm" and "commonterm | nonexistingterm".  The API for
opclass extractQuery functions just isn't powerful enough to show that.

I think a possible solution to this could involve allowing extractQuery
to mark individual keys as "required" or "optional".  Then the control
logic could know not to bother with combinations that haven't got all
the "required" keys.  There might be other better answers though.

But having said that, this particular test case is far from compelling.
Any sane text search application is going to try to filter out
common words as stopwords; it's only the failure to do that that's
making this run slow.

            regards, tom lane

Re: Queryplan within FTS/GIN index -search.

From
Jesper Krogh
Date:
Tom Lane wrote:
> But having said that, this particular test case is far from compelling.
> Any sane text search application is going to try to filter out
> common words as stopwords; it's only the failure to do that that's
> making this run slow.

Below is tests-runs not with a "commonterm" but and 80% term and a 60%
term.

There are two issues in this, one is the way PG "blows up" when
searching for a stop-word (and it even performs excellent when searching
for a term in the complete doc-base):

ftstest=# select id from ftstest where body_fts @@
to_tsquery('commonterm') limit 10;
 id
----
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
(10 rows)

Time: 1.004 ms
ftstest=# select id from ftstest where body_fts @@ to_tsquery('the')
limit 10;
NOTICE:  text-search query contains only stop words or doesn't contain
lexemes, ignored
NOTICE:  text-search query contains only stop words or doesn't contain
lexemes, ignored
 id
----
(0 rows)

Time: 0.587 ms

I can definetely effort the index-size for getting the first behavior to
my application. Stop words will first be really useful when searches for
them translates into full results not errors.

I also think you're trying to limit the scope of the problem more than
whats fair.

ftstest=# select id from ftstest where body_fts @@
to_tsquery('nonexistingterm & commonterm');
 id
----
(0 rows)

Time: 28.230 ms
ftstest=# select id from ftstest where body_fts @@
to_tsquery('nonexistingterm') and body_fts @@ to_tsquery('commonterm');
 id
----
(0 rows)

Time: 0.930 ms
(so explain analyze is not a fair measurement .. it seems to make the
problem way worse). This is "only" x28
Time: 22.432 ms
ftstest=# select id from ftstest where body_fts @@
to_tsquery('nonexistingterm') and body_fts @@ to_tsquery('commonterm80');
 id
----
(0 rows)

Time: 0.992 ms
ftstest=# select id from ftstest where body_fts @@
to_tsquery('nonexistingterm & commonterm80');
 id
----
(0 rows)

Time: 22.393 ms
ftstest=#
And for a 80% term .. x23

ftstest=# select id from ftstest where body_fts @@
to_tsquery('nonexistingterm') and body_fts @@ to_tsquery('commonterm60');
 id
----
(0 rows)

Time: 0.954 ms
ftstest=# select id from ftstest where body_fts @@
to_tsquery('nonexistingterm & commonterm60');
 id
----
(0 rows)

Time: 17.006 ms

and x17

Just trying to say that the body of the problem isn't a discussion about
stop-words.

That being said, if you coin the term "stopword" to mean "any term that
exists in all or close to all documents" then the way it behaves when
searching for only one of them is a situation that we'll hit all the
time. (when dealing with user typed input).

Jesper
--
Jesper

Re: Queryplan within FTS/GIN index -search.

From
Greg Stark
Date:
On Fri, Oct 30, 2009 at 8:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> But having said that, this particular test case is far from compelling.
> Any sane text search application is going to try to filter out
> common words as stopwords; it's only the failure to do that that's
> making this run slow.

Well it would be nice if that wasn't necessary. There are plenty of
applications where that isn't really an option. Consider searching for
phrases like "The The" or "The Office". The sanity of doing this is
purely a function of implementation quality and not of actual user
interface design.


--
greg

Re: Queryplan within FTS/GIN index -search.

From
"Kevin Grittner"
Date:
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Any sane text search application is going to try to filter out
> common words as stopwords; it's only the failure to do that that's
> making this run slow.

Imagine a large table with a GIN index on a tsvector.  The user wants
a particular document, and is sure four words are in it.  One of them
only appears in 100 documents.  The other three each appear in about
a third of the documents.  Is it more sane to require the user to
wait for a table scan or to make them wade through 100 rows rather
than four?

I'd rather have the index used for the selective test, and apply the
remaining tests to the rows retrieved from the heap.

-Kevin

Re: Queryplan within FTS/GIN index -search.

From
Tom Lane
Date:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Any sane text search application is going to try to filter out
>> common words as stopwords; it's only the failure to do that that's
>> making this run slow.

> I'd rather have the index used for the selective test, and apply the
> remaining tests to the rows retrieved from the heap.

Uh, that was exactly my point.  Indexing common words is a waste.

            regards, tom lane

Re: Queryplan within FTS/GIN index -search.

From
"Kevin Grittner"
Date:
Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
>> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Any sane text search application is going to try to filter out
>>> common words as stopwords; it's only the failure to do that that's
>>> making this run slow.
>
>> I'd rather have the index used for the selective test, and apply
>> the remaining tests to the rows retrieved from the heap.
>
> Uh, that was exactly my point.  Indexing common words is a waste.

Perhaps I'm missing something.  My point was that there are words
which are too common to be useful for index searches, yet uncommon
enough to usefully limit the results.  These words could typically
benefit from tsearch2 style parsing and dictionaries; so declaring
them as stop words would be bad from a functional perspective, yet
searching an index for them would be bad from a performance
perspective.

One solution would be for the users to rigorously identify all of
these words, include them on one stop word list but not another,
include *two* tsvector columns in the table (with and without the
"iffy" words), index only the one with the larger stop word list, and
generate two tsquery values to search the two different columns.  Best
of both worlds.  Sort of.  The staff time to create and maintain such
a list would obviously be costly and writing the queries would be
error-prone.

Second best would be to somehow recognize the "iffy" words and exclude
them from the index and the index search phase, but apply the check
when the row is retrieved from the heap.  I really have a hard time
seeing how the conditional exclusion from the index could be
accomplished, though.  Next best would be to let them fall into the
index, but exclude top level ANDed values from the index search,
applying them only to the recheck when the row is read from the heap.
The seems, at least conceptually, like it could be done.

-Kevin

Re: Queryplan within FTS/GIN index -search.

From
Tom Lane
Date:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> Perhaps I'm missing something.  My point was that there are words
> which are too common to be useful for index searches, yet uncommon
> enough to usefully limit the results.  These words could typically
> benefit from tsearch2 style parsing and dictionaries; so declaring
> them as stop words would be bad from a functional perspective, yet
> searching an index for them would be bad from a performance
> perspective.

Right, but the original complaint in this thread was that a GIN index is
slow about searching for very common terms.  The answer to that clearly
is to not index common terms, rather than worry about making the case
a bit faster.

It may well be that Jesper's identified a place where the GIN code could
be improved --- it seems like having the top-level search logic be more
aware of the AND/OR structure of queries would be useful.  But the
particular example shown here doesn't make a very good case for that,
because it's hard to tell how much of a penalty would be taken in more
realistic examples.

            regards, tom lane

Re: Queryplan within FTS/GIN index -search.

From
"Kevin Grittner"
Date:
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> The answer to that clearly is to not index common terms

My understanding is that we don't currently get statistics on how
common the terms in a tsvector column are until we ANALYZE the *index*
created from it.  Seems like sort of a Catch 22.  Also, if we exclude
words which are in the tsvector from the index on the tsvector, we
need to know what words were excluded so we know not to search on them
as well as forcing the recheck of the full tsquery (unless this always
happens already?).

> It may well be that Jesper's identified a place where the GIN code
> could be improved

My naive assumption has been that it would be possible to get an
improvement without touching the index logic, by changing this part of
the query plan:

                     Index Cond: (ftsbody_body_fts @@ to_tsquery
('TERM1 & TERM2 & TERM3 & TERM4 & TERM5'::text))

to something like this:

                     Index Cond: (ftsbody_body_fts @@ to_tsquery
('TERM1'::text))

and count on this doing the rest:

               Recheck Cond: (ftsbody_body_fts @@ to_tsquery
('TERM1 & TERM2 & TERM3 & TERM4 & TERM5'::text))

I'm wondering if anyone has ever confirmed that probing for the more
frequent term through the index is *ever* a win, versus using the
index for the most common of the top level AND conditions and doing
the rest on recheck.  That seems like a dangerous assumption from
which to start.

> But the particular example shown here doesn't make a very good case
> for that, because it's hard to tell how much of a penalty would be
> taken in more realistic examples.

Fair enough.  We're in the early stages of moving to tsearch2 and I
haven't run across this yet in practice.  If I do, I'll follow up.

-Kevin

Re: Queryplan within FTS/GIN index -search.

From
"Kevin Grittner"
Date:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote:

> I'm wondering if anyone has ever confirmed that probing for the more
> frequent term through the index is *ever* a win, versus using the
> index for the most common of the top level AND conditions and doing
> the rest on recheck.

s/most/least/

-Kevin

Re: Queryplan within FTS/GIN index -search.

From
Jesper Krogh
Date:
Tom Lane wrote:
> It may well be that Jesper's identified a place where the GIN code could
> be improved --- it seems like having the top-level search logic be more
> aware of the AND/OR structure of queries would be useful.  But the
> particular example shown here doesn't make a very good case for that,
> because it's hard to tell how much of a penalty would be taken in more
> realistic examples.

With a term sitting in:
80% of the docs the penalty is: x23
60% of the docs the penalty is: x17
40% of the docs the penalty is: x13
of doing
vectorcol @@ ts_query('term & commonterm')
compared to
vectorcol @@ ts_query('term) and vectorcol @@ ts_query('commonterm');
where term is non-existing (or rare).

(in query execution performance on a fully memory recident dataset,
doing test with "drop_caches" and restart pg to simulate a dead disk the
numbers are a bit higher).

http://article.gmane.org/gmane.comp.db.postgresql.performance/22496/match=

Would you ever quantify a term sitting in 60-80% as a stop-word candidate?

I dont know if x13 in execution performance is worth hunting or there
are lower hanging fruits sitting in the fts-search-system.

This is essentially the penalty the user will get for adding a terms to
their search that rarely restricts the results.

In term of the usual "set theory" that databases work in, a search for a
stop-word translated into the full set. This is just not the case in
where it throws a warning and returns the empty set. This warning can be
caught by application code to produce the "correct" result to the users,
but just slightly more complex queries dont do this:

ftstest=# select id from ftstest where body_fts @@ to_tsquery('random |
the') limit 10;
 id
----
(0 rows)

Here I would have expected the same error.. I basically have to hook in
the complete stop-word dictionary in a FTS-preparser to give the user
the expected results or have I missed a feature somwhere?

My reason for not pushing "commonterms" into the stopword list is that
they actually perform excellent in PG.

Same body as usual, but commonterm99 is sitting in 99% of the documents.

ftstest=# set enable_seqscan=off;
SET
ftstest=# explain analyze select id from ftstest where body_fts @@
to_tsquery('commonterm99');
                                                                QUERY
PLAN

------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on ftstest  (cost=1051476.74..1107666.07 rows=197887
width=4) (actual time=51.036..121.348 rows=197951 loops=1)
   Recheck Cond: (body_fts @@ to_tsquery('commonterm99'::text))
   ->  Bitmap Index Scan on ftstest_gin_idx  (cost=0.00..1051427.26
rows=197887 width=0) (actual time=49.602..49.602 rows=197951 loops=1)
         Index Cond: (body_fts @@ to_tsquery('commonterm99'::text))
 Total runtime: 147.350 ms
(5 rows)

ftstest=# set enable_seqscan=on;
SET
ftstest=# explain analyze select id from ftstest where body_fts @@
to_tsquery('commonterm99');
                                                    QUERY PLAN

------------------------------------------------------------------------------------------------------------------
 Seq Scan on ftstest  (cost=0.00..56744.00 rows=197887 width=4) (actual
time=0.086..7134.384 rows=197951 loops=1)
   Filter: (body_fts @@ to_tsquery('commonterm99'::text))
 Total runtime: 7194.182 ms
(3 rows)



So in order to get the result with a speedup of more than x50 I simply
cannot add these terms to the stop-words because then the first query
would resolve to an error and getting results would then be up to the
second query.

My bet is that doing a seq_scan will "never" be beneficial for this type
of query.

As far as I can see the only consequence of simply not remove stop-words
at all is a (fairly small) increase in index-size. It seems to me that
stop-words were invented when it was hard to get more than 2GB of memory
into a computer to get the index-size reduced to a size that better
could fit into memory. But nowadays it seems like the downsides are hard
to see?

Jesper
--
Jesper

Re: Queryplan within FTS/GIN index -search.

From
"Kevin Grittner"
Date:
I wrote:
> Tom Lane <tgl@sss.pgh.pa.us> wrote:

>> But the particular example shown here doesn't make a very good case
>> for that, because it's hard to tell how much of a penalty would be
>> taken in more realistic examples.
>
> Fair enough.  We're in the early stages of moving to tsearch2 and I
> haven't run across this yet in practice.  If I do, I'll follow up.

We have a staging database which allowed some limited testing quickly.
While it's real production data, we haven't been gathering this type
of data long, so it's got relatively few rows; therefore, it wasn't
feasible to try any tests which would be disk-bound, so I primed the
cache for all of these, and they are all totally served from cache.
For various reasons which I'll omit unless asked, we do our text
searches through functions which take a "selection string", turn it
into a tsquery with a little extra massaging on our part, run the
query with a minimum ranking to return, and return a set of records
ordered by the ranking in descending sequence.

Under these conditions there is a slight performance gain in adding an
additional test which matches 1356 out of 1691 rows.  Not surprisingly
for a fully cached query set, timings were very consistent from run to
run.  While undoubtedly a little unusual in approach, this is
production software run against real-world data.  I confirmed that it
is using the GIN index on the tsvector for these runs.

By the way, the tsearch2 features have been received very well so
far.  One of the first reactions from most users is surprise at how
fast it is.  :-)  Anyway, our production results don't confirm the
issue shown with the artificial test data.


scca=> select count(*) from "DocThumbnail" where "text" is not null;
 count
-------
  1691
(1 row)

Time: 0.619 ms


scca=> select count(*) from (select "DocThumbnail_text_rank"('guardian
ad litem', 0.1)) x;
 count
-------
    41
(1 row)

Time: 19.394 ms


scca=> select count(*) from (select "DocThumbnail_text_rank"('guardian
ad litem attorney', 0.1)) x;
 count
-------
     4
(1 row)

Time: 16.434 ms


scca=> select count(*) from (select
"DocThumbnail_text_rank"('attorney', 0.1)) x;
 count
-------
  1356
(1 row)

Time: 415.056 ms


scca=> select count(*) from (select "DocThumbnail_text_rank"('guardian
ad litem party', 0.1)) x;
 count
-------
     2
(1 row)

Time: 16.290 ms


scca=> select count(*) from (select "DocThumbnail_text_rank"('party',
0.1)) x;
 count
-------
   935
(1 row)

Time: 386.941 ms


-Kevin