Thread: lexemes in prefix search going through dictionary modifications
I am currently using the prefix search feature in text search. I find that the prefix characters are treated the same as a normal lexeme and passed through stemming and stopword dictionaries. This seems like a bug to me. db=# select to_tsquery('english', 's:*'); NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignoredto_tsquery ------------ (1 row) db=# select to_tsquery('simple', 's:*');to_tsquery ------------'s':* (1 row) I also think that this is a mistake. It should only be highlighting "s". db=# select ts_headline('sushant', to_tsquery('simple', 's:*')); ts_headline ----------------<b>sushant</b> Thanks, Sushant.
On Oct25, 2011, at 17:26 , Sushant Sinha wrote: > I am currently using the prefix search feature in text search. I find > that the prefix characters are treated the same as a normal lexeme and > passed through stemming and stopword dictionaries. This seems like a bug > to me. Hm, I don't think so. If they don't pass through stopword dictionaries, then queries containing stopwords will fail to find any rows - which is probably not what one would expect. Here's an example: Query for records containing the* and car*. The @@-operator returns true, because the stopword is removed from both the tsvector and the tsquery (the 'english' dictionary drops 'these' as a stopward and stems 'cars' to 'car. Both the tsvector and the query end up being just 'car') postgres=# select to_tsvector('english', 'these cars') @@ to_tsquery('english', 'the:* & car:*');?column? ----------t (1 row) Here what happens stopwords aren't removed from the query (Now, the tsvector ends up being 'car', but the query is 'the:* & car:*') postgres=# select to_tsvector('english', 'these cars') @@ to_tsquery('simple', 'the:* & car:*');?column? ----------f (1 row) best regards, Florian Pflug
On Tue, 2011-10-25 at 18:05 +0200, Florian Pflug wrote: > On Oct25, 2011, at 17:26 , Sushant Sinha wrote: > > I am currently using the prefix search feature in text search. I find > > that the prefix characters are treated the same as a normal lexeme and > > passed through stemming and stopword dictionaries. This seems like a bug > > to me. > > Hm, I don't think so. If they don't pass through stopword dictionaries, > then queries containing stopwords will fail to find any rows - which is > probably not what one would expect. I think what you are saying a feature is really a bug. I am fairly sure that when someone says to_tsquery('english', 's:*') one is looking for an entry that has a *non-stopword* word that starts with 's'. And specially so in a text search configuration that eliminates stop words. Does it even make sense to stem, abbreviate, synonym for a few letters? It will be so unpredictable. -Sushant.
On Oct25, 2011, at 18:47 , Sushant Sinha wrote: > On Tue, 2011-10-25 at 18:05 +0200, Florian Pflug wrote: >> On Oct25, 2011, at 17:26 , Sushant Sinha wrote: >>> I am currently using the prefix search feature in text search. I find >>> that the prefix characters are treated the same as a normal lexeme and >>> passed through stemming and stopword dictionaries. This seems like a bug >>> to me. >> >> Hm, I don't think so. If they don't pass through stopword dictionaries, >> then queries containing stopwords will fail to find any rows - which is >> probably not what one would expect. > > I think what you are saying a feature is really a bug. I am fairly sure > that when someone says to_tsquery('english', 's:*') one is looking for > an entry that has a *non-stopword* word that starts with 's'. And > specially so in a text search configuration that eliminates stop words. But the whole idea of removing stopwords from the query is that users *don't* need to be aware of the precise list of stopwords. The way I see it, stopwords are simply an optimization that helps reduce the size of your fulltext index. Assume, for example, that the postgres mailing list archive search used tsearch (which I think it does, but I'm not sure). It'd then probably make sense to add "postgres" to the list of stopwords, because it's bound to appear in nearly every mail. But wouldn't you want searched which include 'postgres*' to turn up empty? Quite certainly not. > Does it even make sense to stem, abbreviate, synonym for a few letters? > It will be so unpredictable. That depends on the language. In german (my native tongue), one can concatenate nouns to form new nouns. It's this not entirely unreasonable that one would want the prefix to be stemmed to it's singular form before being matched. Also, suppose you're using a dictionary which corrects common typos. Who says you wouldn't want that to be applied to prefix queries? best regards, Florian Pflug
On Tue, 2011-10-25 at 19:27 +0200, Florian Pflug wrote: > Assume, for example, that the postgres mailing list archive search used > tsearch (which I think it does, but I'm not sure). It'd then probably make > sense to add "postgres" to the list of stopwords, because it's bound to > appear in nearly every mail. But wouldn't you want searched which include > 'postgres*' to turn up empty? Quite certainly not. That improves recall for "postgres:*" query and certainly doesn't help other queries like "post:*". But more importantly it affects precision for all queries like "a:*", "an:*", "and:*", "s:*", 't:*', "the:*", etc (When that is the only search it also affects recall as no row matches an empty tsquery). Since stopwords are smaller, it means prefix search for a few characters is meaningless. And I would argue that is when the prefix search is more important -- only when you know a few characters. -Sushant.
I think there is a need to provide prefix search to bypass dictionaries.If you folks think that there is some credibility to such a need then I can think about implementing it. How about an operator like ":#" that does this? The ":*" will continue to mean the same as currently. -Sushant. On Tue, 2011-10-25 at 23:45 +0530, Sushant Sinha wrote: > On Tue, 2011-10-25 at 19:27 +0200, Florian Pflug wrote: > > > Assume, for example, that the postgres mailing list archive search used > > tsearch (which I think it does, but I'm not sure). It'd then probably make > > sense to add "postgres" to the list of stopwords, because it's bound to > > appear in nearly every mail. But wouldn't you want searched which include > > 'postgres*' to turn up empty? Quite certainly not. > > That improves recall for "postgres:*" query and certainly doesn't help > other queries like "post:*". But more importantly it affects precision > for all queries like "a:*", "an:*", "and:*", "s:*", 't:*', "the:*", etc > (When that is the only search it also affects recall as no row matches > an empty tsquery). Since stopwords are smaller, it means prefix search > for a few characters is meaningless. And I would argue that is when the > prefix search is more important -- only when you know a few characters. > > > -Sushant
Sushant Sinha <sushant354@gmail.com> writes: > I think there is a need to provide prefix search to bypass > dictionaries.If you folks think that there is some credibility to such a > need then I can think about implementing it. How about an operator like > ":#" that does this? The ":*" will continue to mean the same as > currently. I don't think that just turning off dictionaries for prefix searches is going to do much of anything useful, because the lexemes in the index are still going to have gone through normalization. Somehow we need to identify which lexemes could match the prefix after accounting for the fact that they've been through normalization. An example: if the original word is "transferring", the lexeme (in the english config) is just "transfer". If you search for "transferring:*" and suppress dictionaries, you'll fail to get a match, which is simply wrong. It's not a step forward to suppress some failure cases while adding new ones. Another point is that whatever we do about this really ought to be inside the engine, not exposed in a form that makes users do their queries differently. regards, tom lane