Re: tsearch2: plainto_tsquery() with OR? - Mailing list pgsql-general
From | Mike Rylander |
---|---|
Subject | Re: tsearch2: plainto_tsquery() with OR? |
Date | |
Msg-id | b918cf3d0708191805t7a7b96d7n6d81fd63c948d37d@mail.gmail.com Whole thread Raw |
In response to | Re: tsearch2: plainto_tsquery() with OR? (cluster <skrald@amossen.dk>) |
List | pgsql-general |
On 8/9/07, cluster <skrald@amossen.dk> wrote: > Thanks for your response! Let me try to elaborate what I meant with my > original post. > > If R is the set of words in the tsvector for a given table row and S is > the set of keywords to search for (entered by e.g. a website user) I > would like to receive all rows for which the intersection between R and > S is nonempty. That is: The row should be return if just there is SOME > match. S does not necessarily need to be a subset of R. > You could just wrap up a simple query in an SQL function called plainto_or_tsquery or the like. CREATE OR REPLACE FUNCTION plainto_or_tsquery (TEXT) RETURNS tsquery AS $$ SELECT to_tsquery( regexp_replace( $1, E'[\\s\'|:&()!]+','|','g') ); $$ LANGUAGE SQL STRICT IMMUTABLE; Paste this into a PG database that has tsearch2 loaded (after creating the above function, of course): select rank_cd(to_tsvector('hi . there web 2.0'), plainto_or_tsquery('hello . web 2.0')), to_tsvector('hi . there web 2.0') @@ plainto_or_tsquery('hello . web 2.0') as matches; > Furthermore I would like a measure for how "nonempty" the intersection > is (we would call this measure "the rank"). > Example: > For R = "three big houses" and S = "three small houses" the rank should > be higher than for R = "three big houses" and S = "four small houses" as > the first case has two words in common while the second case has only one. Both rank() and rank_cd() work fine for all-ORed queries, full match or otherwise. The more "matchy", the better the rank. > > A version of plainto_tsquery() with a simple OR operator instead of AND > would solve this problem somewhat elegant: > 1) I can now use the conventional "tsvector @@ tsquery" syntax in my > WHERE clause as the "@@" operator will return true and thus include the > row in the result. Example: > select to_tsvector('simple', 'three small houses') > @@ 'four|big|houses'::tsquery; > would return "true". > Um... it does. forge=# select to_tsvector('simple', 'three small houses') @@ 'four|big|houses'::tsquery; ?column? ---------- t (1 row) > 2) The rank() of the @@ operator is automatically higher when there is a > good match. > Again, that's already the case. forge=# select rank(to_tsvector('hi . there web 2.0'), plainto_or_tsquery('hello . web 2.0')), rank(to_tsvector('hi . there web 2.0'), plainto_or_tsquery('hi . web 2.0')); rank | rank -----------+----------- 0.0405285 | 0.0607927 (1 row) The second is a better match; "hi" vs "hello" in the queries. > > An example where this OR-version of plainto_tsquery() could be useful is > for websites using tags. Each website entry is associated with some tags > and each user has defined some "tags of interest". The search should > then return all website entries where there is a match (not necessarily > complete) with the users tags of interest. Of course the best matching > entries should be displayed top most. > See above. Though, again, you'd need to put in a little work to make sure everything is completely protected. Probably less time than it's take you to discuss this so far, though. And you'd want to create a 2-param version that could accept the correct tsearch2 config. > > I find it important that this function is a part of tsearch2 itself as: > 1) The user can input arbitrary data. Also potentially harmful data if > they are not escaped right. That's not tsearch2's problem in particular. You should be using parameterized queries in your app (or applying the correct quoting functions) for all data, not just directly user supplied strings. All data is user supplied at some level. > 2) Special characters should be stripped in just the same way as > to_tsvector() does it. E.g. stripping the dot in "Hi . there" but > keeping it in "web 2.0". Only tsearch2 can do that in a clean consistent > way - it would be fairly messy if some thirdparty or especially some > website-developer-homecooked stripping functionality is used for this. > The simple example above uses to_tsquery to do that. Try it out, and if you improve it please feel free to share. On a more general note, IMO this should not be getting in the way of integrating tsearch2 into core. The example above shows how trivial it is to do the "simple" thing, but that's probably not going to be the "right" thing. Any time you find yourself forcing your user to is "x OR y OR z" to get a result is a time you should probably be augmenting your tsvectors (or tsquerys now, with query rewriting) with thesauri. BTW, Google /does/ just AND everything. They just don't tell you about everything they're ANDing. All this is, of course, disregarding the actual utility of an all-OR query (not much in practice, IME) and the speed of such queries on non-trivial datasets (not good -- essentially a scan per ORed component)). Anyway, I hope that helps... --miker
pgsql-general by date: