Re: tsearch2 keep throw-away characters - Mailing list pgsql-general

From Ivan Zolotukhin
Subject Re: tsearch2 keep throw-away characters
Date
Msg-id 751e56400705192234t33abf55s44e2f3aa7c6746ac@mail.gmail.com
Whole thread Raw
In response to tsearch2 keep throw-away characters  (Kimball <kbighorse@gmail.com>)
List pgsql-general
Hello,

Your problem is not about stop words, it's about the fact that tsearch
parser treats '+' and '#' symbols as a lexemes of a blank type (use
ts_debug() function to figure it out) and drops it without any further
processing. AFAIK, typical solution for this is to rewrite your text
and then queries to some auxiliary words, like 'SYScpp' and
'SYScsharp', that will be included in tsvectors and indexed without
any problems. Usually you can do replacements in tsvector trigger when
indexing documents and via query rewriting (in tsearch or your
application) when quering database.

Trivial examples:

test=# select to_tsvector('english','I know how to code in SYScsharp,
java and SYScpp');
                     to_tsvector
------------------------------------------------------
 'code':5 'java':8 'know':2 'syscpp':10 'syscsharp':7
(1 row)

and, sure:

test=# select 'I know how to code in SYScsharp, java and SYScpp' @@ 'SYScpp';
 ?column?
----------
 t
(1 row)

There might be more sophisticated solution like prevent parser from
treating '++' as a blank lexemes, but Oleg will explain this much
better, as soon as he has time.

--
Regards,
Ivan


On 5/16/07, Kimball <kbighorse@gmail.com> wrote:
>
> postgres=# select to_tsvector('default','I know how to code in C#, java and
> C++');
>               to_tsvector
> -------------------------------------
>  'c':7,10 'code':5 'java':8 'know':2
>  (1 row)
>
> postgres=# select to_tsvector('simple','I know how to code in C#, java and
> C++');
>                                to_tsvector
> -------------------------------------------------------------------------
>  'c':7,10 'i':1 'in':6 'to':4 'and':9 'how':3 'code':5 'java':8 'know':2
> (1 row)
>
>
> I'd like to get lexemes/tokens 'c#' and 'c++' out of this query.  Everything
> I can find has to do with stop words.   How do I keep characters that
> tsearch throws out?  I've already tried 'c\#' and 'c\\#' etc, which don't
> work.
>
> Kimball

pgsql-general by date:

Previous
From: Tom Lane
Date:
Subject: Re: FULL JOIN is only supported with merge-joinable join conditions
Next
From: novnov
Date:
Subject: Trigger function which inserts into table; values from lookup