Thread: Queryplan within FTS/GIN index -search.
Hi My indexing base is now up to 7.5m documents, I have raise statistics target to 1000 for the tsvector column in order to make the query-planner choose more correctly. That works excellent. Table structure is still: ftstest=# \d ftsbody Table "public.ftsbody" Column | Type | Modifiers ------------------+----------+------------------------------------------------------ id | integer | not null default nextval('ftsbody_id_seq'::regclass) body | text | not null default ''::text ftsbody_body_fts | tsvector | Indexes: "ftsbody_body_md5" UNIQUE, btree (md5(body)) "ftsbody_id_pri_idx" UNIQUE, btree (id) "ftsbody_tfs_idx" gin (ftsbody_body_fts) Triggers: tsvectorupdate BEFORE INSERT OR UPDATE ON uniprot FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger('ftsbody_body_fts', 'pg_catalog.english', 'body') I'm searching the gin-index for 1-5 terms, where all of them matches the same document. TERM1 is unique by itself, TERM2 is a bit more common (52 rows), TERM3 more common, TERM4 close to all and TERM5 all records. Just quering for a unique value and add in several values that match everything makes the run-time go significantly up. I somehow would expect the index-search to take advantage of the MCV's informations in the statistics that sort of translate it into a search and post-filtering (as PG's queryplanner usually does at the SQL-level). QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=102.45..102.45 rows=1 width=751) (actual time=3.726..3.729 rows=1 loops=1) -> Sort (cost=102.45..102.45 rows=1 width=751) (actual time=3.722..3.723 rows=1 loops=1) Sort Key: id Sort Method: quicksort Memory: 27kB -> Bitmap Heap Scan on ftsbody (cost=100.42..102.44 rows=1 width=751) (actual time=3.700..3.702 rows=1 loops=1) Recheck Cond: (ftsbody_body_fts @@ to_tsquery('TERM1 & TERM2'::text)) -> Bitmap Index Scan on ftsbody_tfs_idx (cost=0.00..100.42 rows=1 width=0) (actual time=3.683..3.683 rows=1 loops=1) Index Cond: (ftsbody_body_fts @@ to_tsquery('TERM1 & TERM2'::text)) Total runtime: 3.790 ms (9 rows) QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------ Limit (cost=102.45..102.45 rows=1 width=751) (actual time=850.017..850.020 rows=1 loops=1) -> Sort (cost=102.45..102.45 rows=1 width=751) (actual time=850.013..850.015 rows=1 loops=1) Sort Key: id Sort Method: quicksort Memory: 27kB -> Bitmap Heap Scan on ftsbody (cost=100.42..102.44 rows=1 width=751) (actual time=849.991..849.993 rows=1 loops=1) Recheck Cond: (ftsbody_body_fts @@ to_tsquery('TERM1 & TERM2 & TERM3'::text)) -> Bitmap Index Scan on ftsbody_tfs_idx (cost=0.00..100.42 rows=1 width=0) (actual time=849.970..849.970 rows=1 loops=1) Index Cond: (ftsbody_body_fts @@ to_tsquery('TERM1 & TERM2 & TERM3'::text)) Total runtime: 850.084 ms (9 rows) QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=102.45..102.45 rows=1 width=751) (actual time=1152.065..1152.068 rows=1 loops=1) -> Sort (cost=102.45..102.45 rows=1 width=751) (actual time=1152.061..1152.062 rows=1 loops=1) Sort Key: id Sort Method: quicksort Memory: 27kB -> Bitmap Heap Scan on ftsbody (cost=100.42..102.44 rows=1 width=751) (actual time=1152.039..1152.041 rows=1 loops=1) Recheck Cond: (ftsbody_body_fts @@ to_tsquery('TERM1 & TERM2 & TERM3 & TERM4'::text)) -> Bitmap Index Scan on ftsbody_tfs_idx (cost=0.00..100.42 rows=1 width=0) (actual time=1152.020..1152.020 rows=1 loops=1) Index Cond: (ftsbody_body_fts @@ to_tsquery('TERM1 & TERM2 & TERM3 & TERM4'::text)) Total runtime: 1152.129 ms (9 rows) QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=102.45..102.45 rows=1 width=751) (actual time=1509.043..1509.046 rows=1 loops=1) -> Sort (cost=102.45..102.45 rows=1 width=751) (actual time=1509.040..1509.040 rows=1 loops=1) Sort Key: id Sort Method: quicksort Memory: 27kB -> Bitmap Heap Scan on ftsbody (cost=100.42..102.44 rows=1 width=751) (actual time=1509.018..1509.020 rows=1 loops=1) Recheck Cond: (ftsbody_body_fts @@ to_tsquery('TERM1 & TERM2 & TERM3 & TERM4 & TERM5'::text)) -> Bitmap Index Scan on ftsbody_tfs_idx (cost=0.00..100.42 rows=1 width=0) (actual time=1508.998..1508.998 rows=1 loops=1) Index Cond: (ftsbody_body_fts @@ to_tsquery('TERM1 & TERM2 & TERM3 & TERM4 & TERM5'::text)) Total runtime: 1509.109 ms (9 rows) Can (perhaps more readable) be found at http://krogh.cc/~jesper/test.out Can this be optimized? (I cannot really prevent users from typing stuff in that are common). -- Jesper
Jesper Krogh <jesper@krogh.cc> wrote: > I'm searching the gin-index for 1-5 terms, where all of them matches > the same document. TERM1 is unique by itself, TERM2 is a bit more > common (52 rows), TERM3 more common, TERM4 close to all and TERM5 > all records. > Recheck Cond: (ftsbody_body_fts @@ to_tsquery('TERM1 > & TERM2 & TERM3 & TERM4 & TERM5'::text)) > -> Bitmap Index Scan on ftsbody_tfs_idx > (cost=0.00..100.42 rows=1 width=0) (actual time=1508.998..1508.998 > rows=1 loops=1) > Index Cond: (ftsbody_body_fts @@ > to_tsquery('TERM1 & TERM2 & TERM3 & TERM4 & TERM5'::text)) > Total runtime: 1509.109 ms > Can this be optimized? (I cannot really prevent users from typing > stuff in that are common). I've wondered that myself. Perhaps a term which is ANDed with others and is too common could be dropped from the Index Cond and just left in the Recheck Cond? -Kevin
On Thu, 2009-10-22 at 18:28 +0200, Jesper Krogh wrote: > I somehow would expect the index-search to take advantage of the MCV's > informations in the statistics that sort of translate it into a search > and post-filtering (as PG's queryplanner usually does at the SQL-level). MCVs are full values that are found in columns or indexes -- you aren't likely to have two entire documents that are exactly equal, so MCVs are useless in your example. I believe that stop words are a more common way of accomplishing what you want to do, but they are slightly more limited: they won't be checked at any level, and so are best used for truly common words like "and". From your example, I assume that you still want the word checked, but it's not selective enough to be usefully checked by the index. In effect, what you want are words that aren't searched (or stored) in the index, but are included in the tsvector (so the RECHECK still works). That sounds like it would solve your problem and it would reduce index size, improve update performance, etc. I don't know how difficult it would be to implement, but it sounds reasonable to me. The only disadvantage is that it's more metadata to manage -- all of the existing data like dictionaries and stop words, plus this new "common words". Also, it would mean that almost every match requires RECHECK. It would be interesting to know how common a word needs to be before it's better to leave it out of the index. Regards, Jeff Davis
Jeff Davis wrote: > On Thu, 2009-10-22 at 18:28 +0200, Jesper Krogh wrote: >> I somehow would expect the index-search to take advantage of the MCV's >> informations in the statistics that sort of translate it into a search >> and post-filtering (as PG's queryplanner usually does at the SQL-level). > > MCVs are full values that are found in columns or indexes -- you aren't > likely to have two entire documents that are exactly equal, so MCVs are > useless in your example. According to my testing, this is not the case and if it was the case, the queryplanner most likely wouldn't be able to plan this query correct: select id from ftstable where tsvectorcol @@ to_tsquery('commonterm') order by id limit 10; (into a index-scan on ID and select id from ftstable where tsvectorcol @@ to_tsquery('rareterm'); into a bitmap index scan on the tsvectorcol and a subsequent sort. This is indeed information on individual terms from the statistics that enable this. > I believe that stop words are a more common way of accomplishing what > you want to do, but they are slightly more limited: they won't be > checked at any level, and so are best used for truly common words like > "and". From your example, I assume that you still want the word checked, > but it's not selective enough to be usefully checked by the index. the terms are really common non-stop-words. > In effect, what you want are words that aren't searched (or stored) in > the index, but are included in the tsvector (so the RECHECK still > works). That sounds like it would solve your problem and it would reduce > index size, improve update performance, etc. I don't know how difficult > it would be to implement, but it sounds reasonable to me. > > The only disadvantage is that it's more metadata to manage -- all of the > existing data like dictionaries and stop words, plus this new "common > words". Also, it would mean that almost every match requires RECHECK. It > would be interesting to know how common a word needs to be before it's > better to leave it out of the index. That sounds like it could require an index rebuild if the distribution changes? That would be another plan to pursue, but the MCV is allready there : ftstest=# select * from ftsbody; id | body | ftsbody_body_fts ----+----------------------------------------------+------------------------------------------------- 1 | the cat is not a rat uniqueterm1 uniqueterm2 | 'cat':2 'rat':6 'uniqueterm1':7 'uniqueterm2':8 2 | elephant uniqueterm1 uniqueterm2 | 'eleph':1 'uniqueterm1':2 'uniqueterm2':3 3 | cannon uniqueterm1 uniqueterm2 | 'cannon':1 'uniqueterm1':2 'uniqueterm2':3 (3 rows) ftstest=# select most_common_vals, most_common_freqs from pg_stats where tablename = 'ftsbody' and attname = 'ftsbody_body_fts'; most_common_vals | most_common_freqs ---------------------------+------------------- {uniqueterm1,uniqueterm2} | {1,1,1,1} (1 row) And the query-planner uses this information. -- Jesper.
On Fri, 2009-10-23 at 07:18 +0200, Jesper Krogh wrote: > This is indeed information on individual terms from the statistics that > enable this. My mistake, I didn't know it was that smart about it. > > In effect, what you want are words that aren't searched (or stored) in > > the index, but are included in the tsvector (so the RECHECK still > > works). That sounds like it would solve your problem and it would reduce > > index size, improve update performance, etc. I don't know how difficult > > it would be to implement, but it sounds reasonable to me. > That sounds like it could require an index rebuild if the distribution > changes? My thought was that the common words could be declared to be common the same way stop words are. As long as words are only added to this list, it should be OK. > That would be another plan to pursue, but the MCV is allready there The problem with MCVs is that the index search can never eliminate documents because they don't contain a match, because it might contain a match that was previously an MCV, but is no longer. Also, MCVs are relatively few -- you only get ~1000 or so. There might be a lot of common words you'd like to track. Perhaps ANALYZE can automatically add the common words above some frequency threshold to the list? Regards, Jeff Davis
> On Fri, 2009-10-23 at 07:18 +0200, Jesper Krogh wrote: >> > In effect, what you want are words that aren't searched (or stored) in >> > the index, but are included in the tsvector (so the RECHECK still >> > works). That sounds like it would solve your problem and it would >> reduce >> > index size, improve update performance, etc. I don't know how >> difficult < > it would be to implement, but it sounds reasonable to me. > >> That sounds like it could require an index rebuild if the distribution >> changes? > > My thought was that the common words could be declared to be common the > same way stop words are. As long as words are only added to this list, > it should be OK. > >> That would be another plan to pursue, but the MCV is allready there > > The problem with MCVs is that the index search can never eliminate > documents because they don't contain a match, because it might contain a > match that was previously an MCV, but is no longer. No, it definately has to go visit the index/table to confirm findings, but that why I wrote Queryplan in the subject line, because this os only about the strategy to pursue to obtain the results. And a strategy about limiting the amout of results as early as possible (as PG usually does) would be what I'd expect and MCV can help it guess on that. Similar finding, rewrite the query: (now i took the extreme and made "raretem" a spellingerror), so result is 0. ftstest=# explain analyze select body from ftsbody where ftsbody_body_fts @@ to_tsquery('commonterm & spellerror') limit 100; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=132.63..188.89 rows=28 width=739) (actual time=862.714..862.714 rows=0 loops=1) -> Bitmap Heap Scan on ftsbody (cost=132.63..188.89 rows=28 width=739) (actual time=862.711..862.711 rows=0 loops=1) Recheck Cond: (ftsbody_body_fts @@ to_tsquery('commonterm & spellerror'::text)) -> Bitmap Index Scan on ftsbody_tfs_idx (cost=0.00..132.62 rows=28 width=0) (actual time=862.702..862.702 rows=0 loops=1) Index Cond: (ftsbody_body_fts @@ to_tsquery('commonterm & spellerror'::text)) Total runtime: 862.771 ms (6 rows) ftstest=# explain analyze select body from ftsbody where ftsbody_body_fts @@ to_tsquery('commonterm') and ftsbody_body_fts @@ to_tsquery('spellerror') limit 100; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------ Limit (cost=132.70..189.11 rows=28 width=739) (actual time=8.669..8.669 rows=0 loops=1) -> Bitmap Heap Scan on ftsbody (cost=132.70..189.11 rows=28 width=739) (actual time=8.665..8.665 rows=0 loops=1) Recheck Cond: ((ftsbody_body_fts @@ to_tsquery('commonterm'::text)) AND (ftsbody_body_fts @@ to_tsquery('spellerror'::text))) -> Bitmap Index Scan on ftsbody_tfs_idx (cost=0.00..132.70 rows=28 width=0) (actual time=8.658..8.658 rows=0 loops=1) Index Cond: ((ftsbody_body_fts @@ to_tsquery('commonterm'::text)) AND (ftsbody_body_fts @@ to_tsquery('spellerror'::text))) Total runtime: 8.724 ms (6 rows) So getting them with AND inbetween gives x100 better performance. All queries are run on "hot disk" repeated 3-5 times and the number are from the last run, so disk-read effects should be filtered away. Shouldn't it somehow just do what it allready are capable of doing? -- Jesper
jesper@krogh.cc wrote: > > So getting them with AND inbetween gives x100 better performance. All > queries are run on "hot disk" repeated 3-5 times and the number are from > the last run, so disk-read effects should be filtered away. > > Shouldn't it somehow just do what it allready are capable of doing? I'm guessing to_tsquery(...) will produce a tree of search terms (since it allows for quite complex expressions). Presumably there's a standard order it gets processed in too, so it should be possible to generate a more or less efficient ordering. That structure isn't exposed to the planner though, so it doesn't benefit from any re-ordering the planner would normally do for normal (exposed) AND/OR clauses. Now, to_tsquery() can't re-order the search terms because it doesn't know what column it's being compared against. In fact, it might not be a simple column at all. So - there would either need to be: 1. Some hooks from the planner to reach into the tsquery datatype. 2. A variant to_tsquery_with_sorting() which would take the column-name or something and look up the stats to work against. #1 is the better solution, but #2 might well be simpler to implement as a work-around for now. -- Richard Huxton Archonet Ltd
> jesper@krogh.cc wrote: >> >> So getting them with AND inbetween gives x100 better performance. All >> queries are run on "hot disk" repeated 3-5 times and the number are from >> the last run, so disk-read effects should be filtered away. >> >> Shouldn't it somehow just do what it allready are capable of doing? > > I'm guessing to_tsquery(...) will produce a tree of search terms (since > it allows for quite complex expressions). Presumably there's a standard > order it gets processed in too, so it should be possible to generate a > more or less efficient ordering. > > That structure isn't exposed to the planner though, so it doesn't > benefit from any re-ordering the planner would normally do for normal > (exposed) AND/OR clauses. > > Now, to_tsquery() can't re-order the search terms because it doesn't > know what column it's being compared against. In fact, it might not be a > simple column at all. I cant follow this logic based on explain output, but I may have misunderstood something. The only difference in these two query-plans is that we have an additional or'd term in the to_tsquery(). What we see is that, the query-planner indeed has knowledge about changes in the row estimates based on changes in the query to to_tsquery(). My guess is that it is because to_tsquery actually parses the query and give the estimates, now how can to_tsquery give estimates without having access to the statistics for the column? ftstest=# explain select id from ftsbody where ftsbody_body_fts @@ to_tsquery('reallyrare'); QUERY PLAN --------------------------------------------------------------------------------- Bitmap Heap Scan on ftsbody (cost=132.64..190.91 rows=29 width=4) Recheck Cond: (ftsbody_body_fts @@ to_tsquery('reallyrare'::text)) -> Bitmap Index Scan on ftsbody_tfs_idx (cost=0.00..132.63 rows=29 width=0) Index Cond: (ftsbody_body_fts @@ to_tsquery('reallyrare'::text)) (4 rows) ftstest=# explain select id from ftsbody where ftsbody_body_fts @@ to_tsquery('reallyrare | morerare'); QUERY PLAN ------------------------------------------------------------------------------------- Bitmap Heap Scan on ftsbody (cost=164.86..279.26 rows=57 width=4) Recheck Cond: (ftsbody_body_fts @@ to_tsquery('reallyrare | morerare'::text)) -> Bitmap Index Scan on ftsbody_tfs_idx (cost=0.00..164.84 rows=57 width=0) Index Cond: (ftsbody_body_fts @@ to_tsquery('reallyrare | morerare'::text)) (4 rows) ftstest=# explain select id from ftsbody where ftsbody_body_fts @@ to_tsquery('reallyrare | reallycommon'); QUERY PLAN -------------------------------------------------------------------------- Seq Scan on ftsbody (cost=0.00..1023249.39 rows=5509293 width=4) Filter: (ftsbody_body_fts @@ to_tsquery('reallyrare | reallycommon'::text)) (2 rows) > 2. A variant to_tsquery_with_sorting() which would take the column-name > or something and look up the stats to work against. Does above not seem like its there allready? (sorry.. looking at C-code from my point of view would set me a couple of weeks back, so I have troble getting closer to the answer than interpreting the output and guessing the rest). -- Jesper
jesper@krogh.cc wrote: >> That structure isn't exposed to the planner though, so it doesn't >> benefit from any re-ordering the planner would normally do for normal >> (exposed) AND/OR clauses. >> >> Now, to_tsquery() can't re-order the search terms because it doesn't >> know what column it's being compared against. In fact, it might not be a >> simple column at all. > > I cant follow this logic based on explain output, but I may have > misunderstood something. The only difference in these two query-plans is > that we have an additional or'd term in the to_tsquery(). Hmm - I've had a poke through the source. I've slightly misled you... > What we see is that, the query-planner indeed has knowledge about changes > in the row estimates based on changes in the query to to_tsquery(). Yes, new in 8.4 - sorry, thought that hadn't made it in. The two plan-nodes in question are in: backend/executor/nodeBitmapIndexscan.c backend/executor/nodeBitmapHeapscan.c The actual tsearch stuff is in src/backend/utils/adt/ts*.c It looks like TS_execute (tsvector_op.c) is the bit of code that handles the tsquery tree. That uses a callback to actually check values (checkcondition_gin). The gin_extract_tsquery function is presumably the extractQuery function as described in the manuals (Ch 52). So, I'm guessing you would want to do is generate a reduced query tree for the indexscan (A & B & C => A if A is an uncommon word) and use the full query tree for the heap check. Now, what isn't clear to me on first glance is how to determine which phase of the bitmap scan we are in. HTH Just checking, because I don't think it's useful in this case. But, you don know about "gin_fuzzy_search_limit"? -- Richard Huxton Archonet Ltd
On Fri, 2009-10-23 at 09:26 +0100, Richard Huxton wrote: > That structure isn't exposed to the planner though, so it doesn't > benefit from any re-ordering the planner would normally do for normal > (exposed) AND/OR clauses. I don't think that explains it, because in the second plan you only see a single index scan with two quals: Index Cond: ((ftsbody_body_fts @@ to_tsquery('commonterm'::text)) AND (ftsbody_body_fts @@ to_tsquery('spellerror'::text))) So it's entirely up to GIN how to execute that. Regards, Jeff Davis
On Fri, 2009-10-23 at 09:45 +0200, jesper@krogh.cc wrote: > No, it definately has to go visit the index/table to confirm findings, but > that why I wrote Queryplan in the subject line, because this os only about > the strategy to pursue to obtain the results. And a strategy about > limiting the amout of results as early as possible (as PG usually does) > would be what I'd expect and MCV can help it guess on that. I see what you're saying: you could still index the common terms like normal, but just not look for anything in the index if it's an MCV. That sounds reasonable, based on the numbers you provided. > Index Cond: (ftsbody_body_fts @@ to_tsquery('commonterm & > spellerror'::text)) > Total runtime: 862.771 ms > (6 rows) ... > Index Cond: ((ftsbody_body_fts @@ > to_tsquery('commonterm'::text)) AND (ftsbody_body_fts @@ > to_tsquery('spellerror'::text))) > Total runtime: 8.724 ms > (6 rows) > Something seems strange here. Both are a single index scan, but having a single complex search key is worse than having two simple search keys. Perhaps the real problem is that there's a difference between these cases at all? I don't see any reason why the first should be more expensive than the second. Regards, Jeff Davis
Jeff Davis wrote: > On Fri, 2009-10-23 at 09:26 +0100, Richard Huxton wrote: >> That structure isn't exposed to the planner though, so it doesn't >> benefit from any re-ordering the planner would normally do for normal >> (exposed) AND/OR clauses. > > I don't think that explains it, because in the second plan you only see > a single index scan with two quals: > > Index Cond: ((ftsbody_body_fts @@ > to_tsquery('commonterm'::text)) AND (ftsbody_body_fts @@ > to_tsquery('spellerror'::text))) > > So it's entirely up to GIN how to execute that. http://www.postgresql.org/docs/8.4/static/gin-extensibility.html Datum *extractQuery(...) Returns an array of keys given a value to be queried; that is, query is the value on the right-hand side of an indexable operator whose left-hand side is the indexed column So - that is presumably two separate arrays of keys being matched against, and the AND means if the first fails it'll never check the second. What I'm not sure about is if tsquery('commonterm & spellerror') produces two sets of keys or if it just produces one. -- Richard Huxton Archonet Ltd
Jeff Davis wrote: > On Fri, 2009-10-23 at 09:45 +0200, jesper@krogh.cc wrote: >> No, it definately has to go visit the index/table to confirm findings, but >> that why I wrote Queryplan in the subject line, because this os only about >> the strategy to pursue to obtain the results. And a strategy about >> limiting the amout of results as early as possible (as PG usually does) >> would be what I'd expect and MCV can help it guess on that. > > I see what you're saying: you could still index the common terms like > normal, but just not look for anything in the index if it's an MCV. That > sounds reasonable, based on the numbers you provided. I'm not sure if thats what I'm saying. If i should rephrase it then: Given an AND operator (which translates into an intersection of the left and right side), then it should go for the side with the least expected results (SetLeast) and subsequent use the other expression for processing only that set. >> Index Cond: (ftsbody_body_fts @@ to_tsquery('commonterm & >> spellerror'::text)) >> Total runtime: 862.771 ms >> (6 rows) > > ... > >> Index Cond: ((ftsbody_body_fts @@ >> to_tsquery('commonterm'::text)) AND (ftsbody_body_fts @@ >> to_tsquery('spellerror'::text))) >> Total runtime: 8.724 ms >> (6 rows) >> > > Something seems strange here. Both are a single index scan, but having a > single complex search key is worse than having two simple search keys. > > Perhaps the real problem is that there's a difference between these > cases at all? I don't see any reason why the first should be more > expensive than the second. -- Jesper
On Fri, 2009-10-23 at 17:27 +0100, Richard Huxton wrote: > Returns an array of keys given a value to be queried; that is, query is > the value on the right-hand side of an indexable operator whose > left-hand side is the indexed column > > So - that is presumably two separate arrays of keys being matched > against, and the AND means if the first fails it'll never check the second. My point was that if it's only one index scan in both cases, then GIN should have the same information in both cases, right? So why are they being treated differently? I must be missing something. Regards, Jeff Davis
Hi. I've now got a test-set that can reproduce the problem where the two fully equivalent queries ( body_fts @@ to_tsquery("commonterm & nonexistingterm") and body_fts @@ to_tsquery("coomonterm") AND body_fts @@ to_tsquery("nonexistingterm") give a difference of x300 in execution time. (grows with document-base-size). this can now be reproduced using: * http://krogh.cc/~jesper/fts-queryplan.pl and http://krogh.cc/~jesper/words.txt It build up a table with 200.000 documents where "commonterm" exists in all of them. "nonexistingterm" is in 0. To get the query-planner get a "sane" query I need to do a: ftstest# set enable_seqscan=off Then: ftstest=# explain analyze select id from ftstest where body_fts @@ to_tsquery('nonexistingterm & commonterm'); QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on ftstest (cost=5563.09..7230.93 rows=1000 width=4) (actual time=30.861..30.861 rows=0 loops=1) Recheck Cond: (body_fts @@ to_tsquery('nonexistingterm & commonterm'::text)) -> Bitmap Index Scan on ftstest_gin_idx (cost=0.00..5562.84 rows=1000 width=0) (actual time=30.856..30.856 rows=0 loops=1) Index Cond: (body_fts @@ to_tsquery('nonexistingterm & commonterm'::text)) Total runtime: 30.907 ms (5 rows) ftstest=# explain analyze select id from ftstest where body_fts @@ to_tsquery('nonexistingterm') and body_fts @@ to_tsquery('commonterm'); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------ Bitmap Heap Scan on ftstest (cost=5565.59..7238.43 rows=1000 width=4) (actual time=0.059..0.059 rows=0 loops=1) Recheck Cond: ((body_fts @@ to_tsquery('nonexistingterm'::text)) AND (body_fts @@ to_tsquery('commonterm'::text))) -> Bitmap Index Scan on ftstest_gin_idx (cost=0.00..5565.34 rows=1000 width=0) (actual time=0.057..0.057 rows=0 loops=1) Index Cond: ((body_fts @@ to_tsquery('nonexistingterm'::text)) AND (body_fts @@ to_tsquery('commonterm'::text))) Total runtime: 0.111 ms (5 rows) Run repeatedly to get a full memory recident dataset. In this situation the former query end up being 300x slower than the latter allthough they are fully equivalent. -- Jesper
Jesper Krogh <jesper@krogh.cc> writes: > I've now got a test-set that can reproduce the problem where the two > fully equivalent queries ( > body_fts @@ to_tsquery("commonterm & nonexistingterm") > and > body_fts @@ to_tsquery("coomonterm") AND body_fts @@ > to_tsquery("nonexistingterm") > give a difference of x300 in execution time. (grows with > document-base-size). I looked into this a bit. It seems the reason the first is much slower is that the AND nature of the query is not exposed to the GIN control logic (ginget.c). It has to fetch every index-entry combination that involves any of the terms, which of course is going to be the whole index in this case. This is obvious when you realize that the control logic doesn't know the difference between tsqueries "commonterm & nonexistingterm" and "commonterm | nonexistingterm". The API for opclass extractQuery functions just isn't powerful enough to show that. I think a possible solution to this could involve allowing extractQuery to mark individual keys as "required" or "optional". Then the control logic could know not to bother with combinations that haven't got all the "required" keys. There might be other better answers though. But having said that, this particular test case is far from compelling. Any sane text search application is going to try to filter out common words as stopwords; it's only the failure to do that that's making this run slow. regards, tom lane
Tom Lane wrote: > But having said that, this particular test case is far from compelling. > Any sane text search application is going to try to filter out > common words as stopwords; it's only the failure to do that that's > making this run slow. Below is tests-runs not with a "commonterm" but and 80% term and a 60% term. There are two issues in this, one is the way PG "blows up" when searching for a stop-word (and it even performs excellent when searching for a term in the complete doc-base): ftstest=# select id from ftstest where body_fts @@ to_tsquery('commonterm') limit 10; id ---- 1 2 3 4 5 6 7 8 9 10 (10 rows) Time: 1.004 ms ftstest=# select id from ftstest where body_fts @@ to_tsquery('the') limit 10; NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored id ---- (0 rows) Time: 0.587 ms I can definetely effort the index-size for getting the first behavior to my application. Stop words will first be really useful when searches for them translates into full results not errors. I also think you're trying to limit the scope of the problem more than whats fair. ftstest=# select id from ftstest where body_fts @@ to_tsquery('nonexistingterm & commonterm'); id ---- (0 rows) Time: 28.230 ms ftstest=# select id from ftstest where body_fts @@ to_tsquery('nonexistingterm') and body_fts @@ to_tsquery('commonterm'); id ---- (0 rows) Time: 0.930 ms (so explain analyze is not a fair measurement .. it seems to make the problem way worse). This is "only" x28 Time: 22.432 ms ftstest=# select id from ftstest where body_fts @@ to_tsquery('nonexistingterm') and body_fts @@ to_tsquery('commonterm80'); id ---- (0 rows) Time: 0.992 ms ftstest=# select id from ftstest where body_fts @@ to_tsquery('nonexistingterm & commonterm80'); id ---- (0 rows) Time: 22.393 ms ftstest=# And for a 80% term .. x23 ftstest=# select id from ftstest where body_fts @@ to_tsquery('nonexistingterm') and body_fts @@ to_tsquery('commonterm60'); id ---- (0 rows) Time: 0.954 ms ftstest=# select id from ftstest where body_fts @@ to_tsquery('nonexistingterm & commonterm60'); id ---- (0 rows) Time: 17.006 ms and x17 Just trying to say that the body of the problem isn't a discussion about stop-words. That being said, if you coin the term "stopword" to mean "any term that exists in all or close to all documents" then the way it behaves when searching for only one of them is a situation that we'll hit all the time. (when dealing with user typed input). Jesper -- Jesper
On Fri, Oct 30, 2009 at 8:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > But having said that, this particular test case is far from compelling. > Any sane text search application is going to try to filter out > common words as stopwords; it's only the failure to do that that's > making this run slow. Well it would be nice if that wasn't necessary. There are plenty of applications where that isn't really an option. Consider searching for phrases like "The The" or "The Office". The sanity of doing this is purely a function of implementation quality and not of actual user interface design. -- greg
Tom Lane <tgl@sss.pgh.pa.us> wrote: > Any sane text search application is going to try to filter out > common words as stopwords; it's only the failure to do that that's > making this run slow. Imagine a large table with a GIN index on a tsvector. The user wants a particular document, and is sure four words are in it. One of them only appears in 100 documents. The other three each appear in about a third of the documents. Is it more sane to require the user to wait for a table scan or to make them wade through 100 rows rather than four? I'd rather have the index used for the selective test, and apply the remaining tests to the rows retrieved from the heap. -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Any sane text search application is going to try to filter out >> common words as stopwords; it's only the failure to do that that's >> making this run slow. > I'd rather have the index used for the selective test, and apply the > remaining tests to the rows retrieved from the heap. Uh, that was exactly my point. Indexing common words is a waste. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: >> Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Any sane text search application is going to try to filter out >>> common words as stopwords; it's only the failure to do that that's >>> making this run slow. > >> I'd rather have the index used for the selective test, and apply >> the remaining tests to the rows retrieved from the heap. > > Uh, that was exactly my point. Indexing common words is a waste. Perhaps I'm missing something. My point was that there are words which are too common to be useful for index searches, yet uncommon enough to usefully limit the results. These words could typically benefit from tsearch2 style parsing and dictionaries; so declaring them as stop words would be bad from a functional perspective, yet searching an index for them would be bad from a performance perspective. One solution would be for the users to rigorously identify all of these words, include them on one stop word list but not another, include *two* tsvector columns in the table (with and without the "iffy" words), index only the one with the larger stop word list, and generate two tsquery values to search the two different columns. Best of both worlds. Sort of. The staff time to create and maintain such a list would obviously be costly and writing the queries would be error-prone. Second best would be to somehow recognize the "iffy" words and exclude them from the index and the index search phase, but apply the check when the row is retrieved from the heap. I really have a hard time seeing how the conditional exclusion from the index could be accomplished, though. Next best would be to let them fall into the index, but exclude top level ANDed values from the index search, applying them only to the recheck when the row is read from the heap. The seems, at least conceptually, like it could be done. -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > Perhaps I'm missing something. My point was that there are words > which are too common to be useful for index searches, yet uncommon > enough to usefully limit the results. These words could typically > benefit from tsearch2 style parsing and dictionaries; so declaring > them as stop words would be bad from a functional perspective, yet > searching an index for them would be bad from a performance > perspective. Right, but the original complaint in this thread was that a GIN index is slow about searching for very common terms. The answer to that clearly is to not index common terms, rather than worry about making the case a bit faster. It may well be that Jesper's identified a place where the GIN code could be improved --- it seems like having the top-level search logic be more aware of the AND/OR structure of queries would be useful. But the particular example shown here doesn't make a very good case for that, because it's hard to tell how much of a penalty would be taken in more realistic examples. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> wrote: > The answer to that clearly is to not index common terms My understanding is that we don't currently get statistics on how common the terms in a tsvector column are until we ANALYZE the *index* created from it. Seems like sort of a Catch 22. Also, if we exclude words which are in the tsvector from the index on the tsvector, we need to know what words were excluded so we know not to search on them as well as forcing the recheck of the full tsquery (unless this always happens already?). > It may well be that Jesper's identified a place where the GIN code > could be improved My naive assumption has been that it would be possible to get an improvement without touching the index logic, by changing this part of the query plan: Index Cond: (ftsbody_body_fts @@ to_tsquery ('TERM1 & TERM2 & TERM3 & TERM4 & TERM5'::text)) to something like this: Index Cond: (ftsbody_body_fts @@ to_tsquery ('TERM1'::text)) and count on this doing the rest: Recheck Cond: (ftsbody_body_fts @@ to_tsquery ('TERM1 & TERM2 & TERM3 & TERM4 & TERM5'::text)) I'm wondering if anyone has ever confirmed that probing for the more frequent term through the index is *ever* a win, versus using the index for the most common of the top level AND conditions and doing the rest on recheck. That seems like a dangerous assumption from which to start. > But the particular example shown here doesn't make a very good case > for that, because it's hard to tell how much of a penalty would be > taken in more realistic examples. Fair enough. We're in the early stages of moving to tsearch2 and I haven't run across this yet in practice. If I do, I'll follow up. -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote: > I'm wondering if anyone has ever confirmed that probing for the more > frequent term through the index is *ever* a win, versus using the > index for the most common of the top level AND conditions and doing > the rest on recheck. s/most/least/ -Kevin
Tom Lane wrote: > It may well be that Jesper's identified a place where the GIN code could > be improved --- it seems like having the top-level search logic be more > aware of the AND/OR structure of queries would be useful. But the > particular example shown here doesn't make a very good case for that, > because it's hard to tell how much of a penalty would be taken in more > realistic examples. With a term sitting in: 80% of the docs the penalty is: x23 60% of the docs the penalty is: x17 40% of the docs the penalty is: x13 of doing vectorcol @@ ts_query('term & commonterm') compared to vectorcol @@ ts_query('term) and vectorcol @@ ts_query('commonterm'); where term is non-existing (or rare). (in query execution performance on a fully memory recident dataset, doing test with "drop_caches" and restart pg to simulate a dead disk the numbers are a bit higher). http://article.gmane.org/gmane.comp.db.postgresql.performance/22496/match= Would you ever quantify a term sitting in 60-80% as a stop-word candidate? I dont know if x13 in execution performance is worth hunting or there are lower hanging fruits sitting in the fts-search-system. This is essentially the penalty the user will get for adding a terms to their search that rarely restricts the results. In term of the usual "set theory" that databases work in, a search for a stop-word translated into the full set. This is just not the case in where it throws a warning and returns the empty set. This warning can be caught by application code to produce the "correct" result to the users, but just slightly more complex queries dont do this: ftstest=# select id from ftstest where body_fts @@ to_tsquery('random | the') limit 10; id ---- (0 rows) Here I would have expected the same error.. I basically have to hook in the complete stop-word dictionary in a FTS-preparser to give the user the expected results or have I missed a feature somwhere? My reason for not pushing "commonterms" into the stopword list is that they actually perform excellent in PG. Same body as usual, but commonterm99 is sitting in 99% of the documents. ftstest=# set enable_seqscan=off; SET ftstest=# explain analyze select id from ftstest where body_fts @@ to_tsquery('commonterm99'); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------ Bitmap Heap Scan on ftstest (cost=1051476.74..1107666.07 rows=197887 width=4) (actual time=51.036..121.348 rows=197951 loops=1) Recheck Cond: (body_fts @@ to_tsquery('commonterm99'::text)) -> Bitmap Index Scan on ftstest_gin_idx (cost=0.00..1051427.26 rows=197887 width=0) (actual time=49.602..49.602 rows=197951 loops=1) Index Cond: (body_fts @@ to_tsquery('commonterm99'::text)) Total runtime: 147.350 ms (5 rows) ftstest=# set enable_seqscan=on; SET ftstest=# explain analyze select id from ftstest where body_fts @@ to_tsquery('commonterm99'); QUERY PLAN ------------------------------------------------------------------------------------------------------------------ Seq Scan on ftstest (cost=0.00..56744.00 rows=197887 width=4) (actual time=0.086..7134.384 rows=197951 loops=1) Filter: (body_fts @@ to_tsquery('commonterm99'::text)) Total runtime: 7194.182 ms (3 rows) So in order to get the result with a speedup of more than x50 I simply cannot add these terms to the stop-words because then the first query would resolve to an error and getting results would then be up to the second query. My bet is that doing a seq_scan will "never" be beneficial for this type of query. As far as I can see the only consequence of simply not remove stop-words at all is a (fairly small) increase in index-size. It seems to me that stop-words were invented when it was hard to get more than 2GB of memory into a computer to get the index-size reduced to a size that better could fit into memory. But nowadays it seems like the downsides are hard to see? Jesper -- Jesper
I wrote: > Tom Lane <tgl@sss.pgh.pa.us> wrote: >> But the particular example shown here doesn't make a very good case >> for that, because it's hard to tell how much of a penalty would be >> taken in more realistic examples. > > Fair enough. We're in the early stages of moving to tsearch2 and I > haven't run across this yet in practice. If I do, I'll follow up. We have a staging database which allowed some limited testing quickly. While it's real production data, we haven't been gathering this type of data long, so it's got relatively few rows; therefore, it wasn't feasible to try any tests which would be disk-bound, so I primed the cache for all of these, and they are all totally served from cache. For various reasons which I'll omit unless asked, we do our text searches through functions which take a "selection string", turn it into a tsquery with a little extra massaging on our part, run the query with a minimum ranking to return, and return a set of records ordered by the ranking in descending sequence. Under these conditions there is a slight performance gain in adding an additional test which matches 1356 out of 1691 rows. Not surprisingly for a fully cached query set, timings were very consistent from run to run. While undoubtedly a little unusual in approach, this is production software run against real-world data. I confirmed that it is using the GIN index on the tsvector for these runs. By the way, the tsearch2 features have been received very well so far. One of the first reactions from most users is surprise at how fast it is. :-) Anyway, our production results don't confirm the issue shown with the artificial test data. scca=> select count(*) from "DocThumbnail" where "text" is not null; count ------- 1691 (1 row) Time: 0.619 ms scca=> select count(*) from (select "DocThumbnail_text_rank"('guardian ad litem', 0.1)) x; count ------- 41 (1 row) Time: 19.394 ms scca=> select count(*) from (select "DocThumbnail_text_rank"('guardian ad litem attorney', 0.1)) x; count ------- 4 (1 row) Time: 16.434 ms scca=> select count(*) from (select "DocThumbnail_text_rank"('attorney', 0.1)) x; count ------- 1356 (1 row) Time: 415.056 ms scca=> select count(*) from (select "DocThumbnail_text_rank"('guardian ad litem party', 0.1)) x; count ------- 2 (1 row) Time: 16.290 ms scca=> select count(*) from (select "DocThumbnail_text_rank"('party', 0.1)) x; count ------- 935 (1 row) Time: 386.941 ms -Kevin