Thread: tsearch2 problem

tsearch2 problem

From

"Jodok Batlogg"

Date:

31 October 2008, 07:54:45

we're using tsearch2 with the german dictionary
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/dicts/ispell/ispell-german-compound.tar.gz
for fulltext search.

the indexing is configured as follows:

CREATE TEXT SEARCH DICTIONARY public.german (
    TEMPLATE = ispell,
    DictFile = german,
    AffFile = german,
    StopWords = german
);

CREATE TEXT SEARCH CONFIGURATION public.default ( COPY = pg_catalog.german );

ALTER TEXT SEARCH CONFIGURATION public.default
    ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
                      word, hword, hword_part
    WITH public.german;

-------------------------

select * from ts_debug('default', 'hundshütte');
works as expected: creates the two lexemes: "{hund,hütte}"

BUT

SELECT to_tsvector('default','lovely und bauarbeiter/in');
looses a lot of stuff:
"'bauarbeiter/in':2"

some more debugging shows:

SELECT * from ts_debug('default','lovely und bauarbeiter/in');

"asciiword";"Word, all ASCII";"lovely";"{german}";"german";""
"blank";"Space symbols";" ";"{}";"";""
"asciiword";"Word, all ASCII";"und";"{german}";"german";"{}"
"blank";"Space symbols";" ";"{}";"";""
"file";"File or path
name";"bauarbeiter/in";"{simple}";"simple";"{bauarbeiter/in}"

a) unknown words are just beeing dropped
b) words with slashes are interpreted as file paths and the first path
is beeing dropped.

any idea how we can fix this?

jodok

--
Jodok Batlogg, Vorstand

Lovely Systems AG
Telefon +43 5572 908060, Fax +43 5572 908060-77, Mobil +43 664 9636963
Schmelzhütterstraße 26a, 6850 Dornbirn, Austria

Sitz: Dornbirn, FB: Landesgericht Feldkirch, FN: 208859x, UID: ATU51736705
Aufsichtsratsvorsitzender: Christian Lutz
Vorstand: Jodok Batlogg, Manfred Schwendinger

Re: tsearch2 problem

From

Oleg Bartunov

Date:

31 October 2008, 10:10:37

Jodok,

you got what's you defined. Please, read documentation.
In short, word doesn't indexed if it is not recognized by any
dictionaried from stack of dictionaries. Put stemming dictionary at the end,
which recognizes everything.

Oleg
On Fri, 31 Oct 2008, Jodok Batlogg wrote:

> we're using tsearch2 with the german dictionary
> http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/dicts/ispell/ispell-german-compound.tar.gz
> for fulltext search.
>
> the indexing is configured as follows:
>
> CREATE TEXT SEARCH DICTIONARY public.german (
>    TEMPLATE = ispell,
>    DictFile = german,
>    AffFile = german,
>    StopWords = german
> );
>
> CREATE TEXT SEARCH CONFIGURATION public.default ( COPY = pg_catalog.german );
>
> ALTER TEXT SEARCH CONFIGURATION public.default
>    ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
>                      word, hword, hword_part
>    WITH public.german;
>
> -------------------------
>
> select * from ts_debug('default', 'hundshЪЪtte');
> works as expected: creates the two lexemes: "{hund,hЪЪtte}"
>
> BUT
>
> SELECT to_tsvector('default','lovely und bauarbeiter/in');
> looses a lot of stuff:
> "'bauarbeiter/in':2"
>
> some more debugging shows:
>
> SELECT * from ts_debug('default','lovely und bauarbeiter/in');
>
> "asciiword";"Word, all ASCII";"lovely";"{german}";"german";""
> "blank";"Space symbols";" ";"{}";"";""
> "asciiword";"Word, all ASCII";"und";"{german}";"german";"{}"
> "blank";"Space symbols";" ";"{}";"";""
> "file";"File or path
> name";"bauarbeiter/in";"{simple}";"simple";"{bauarbeiter/in}"
>
> a) unknown words are just beeing dropped
> b) words with slashes are interpreted as file paths and the first path
> is beeing dropped.
>
> any idea how we can fix this?
>
> jodok
>
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: tsearch2 problem

From

"Jodok Batlogg"

Date:

31 October 2008, 10:30:17

hi oleg,

thanks for your quick response,

2008/10/31 Oleg Bartunov <oleg@sai.msu.su>:
> Jodok,
>
> you got what's you defined. Please, read documentation.
> In short, word doesn't indexed if it is not recognized by any
> dictionaried from stack of dictionaries. Put stemming dictionary at the end,
> which recognizes everything.

can you point me to "the" documentation where i could find that? i
think i tried hard :)

however - problem a) is fixed. thanks :)
nevertheless i still have the problem that words with '/' are beeing
interpreted as file paths instead of words. any idea how i could tweak
this?

thanks

jodok

>
> Oleg
> On Fri, 31 Oct 2008, Jodok Batlogg wrote:
>
>> we're using tsearch2 with the german dictionary
>>
>> http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/dicts/ispell/ispell-german-compound.tar.gz
>> for fulltext search.
>>
>> the indexing is configured as follows:
>>
>> CREATE TEXT SEARCH DICTIONARY public.german (
>>   TEMPLATE = ispell,
>>   DictFile = german,
>>   AffFile = german,
>>   StopWords = german
>> );
>>
>> CREATE TEXT SEARCH CONFIGURATION public.default ( COPY = pg_catalog.german
>> );
>>
>> ALTER TEXT SEARCH CONFIGURATION public.default
>>   ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
>>                     word, hword, hword_part
>>   WITH public.german;
>>
>> -------------------------
>>
>> select * from ts_debug('default', 'hundshЪЪtte');
>> works as expected: creates the two lexemes: "{hund,hЪЪtte}"
>>
>> BUT
>>
>> SELECT to_tsvector('default','lovely und bauarbeiter/in');
>> looses a lot of stuff:
>> "'bauarbeiter/in':2"
>>
>> some more debugging shows:
>>
>> SELECT * from ts_debug('default','lovely und bauarbeiter/in');
>>
>> "asciiword";"Word, all ASCII";"lovely";"{german}";"german";""
>> "blank";"Space symbols";" ";"{}";"";""
>> "asciiword";"Word, all ASCII";"und";"{german}";"german";"{}"
>> "blank";"Space symbols";" ";"{}";"";""
>> "file";"File or path
>> name";"bauarbeiter/in";"{simple}";"simple";"{bauarbeiter/in}"
>>
>> a) unknown words are just beeing dropped
>> b) words with slashes are interpreted as file paths and the first path
>> is beeing dropped.
>>
>> any idea how we can fix this?
>>
>> jodok
>>
>>
>
>        Regards,
>                Oleg
> _____________________________________________________________
> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> Sternberg Astronomical Institute, Moscow University, Russia
> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
> phone: +007(495)939-16-83, +007(495)939-23-83



--
Jodok Batlogg, Vorstand

Lovely Systems AG
Telefon +43 5572 908060, Fax +43 5572 908060-77, Mobil +43 664 9636963
Schmelzhütterstraße 26a, 6850 Dornbirn, Austria

Sitz: Dornbirn, FB: Landesgericht Feldkirch, FN: 208859x, UID: ATU51736705
Aufsichtsratsvorsitzender: Christian Lutz
Vorstand: Jodok Batlogg, Manfred Schwendinger

Re: tsearch2 problem

From

Ivan Sergio Borgonovo

Date:

31 October 2008, 10:37:39

On Fri, 31 Oct 2008 13:10:20 +0300 (MSK)
Oleg Bartunov <oleg@sai.msu.su> wrote:

> Jodok,
>
> you got what's you defined. Please, read documentation.
> In short, word doesn't indexed if it is not recognized by any
> dictionaried from stack of dictionaries. Put stemming dictionary
> at the end, which recognizes everything.

Could you rephrase?
I've a similar situation whose real solution would be to have 2+
tsvectors (English and Italian) but that now looks too costly to
implement.

I'd like to have "proper full support" for English so that eg. it
recognise plurals etc... and "acceptable" support for Italian so
that if I chose something that's not in the English dictionary... at
least it is put "as is" in the tsvector.

I've built the tsvectors similarly to:
setweight(
  to_tsvector('pg_catalog.english',
   coalesce(FilterCode(catalog_items.Code),'')
), 'A')

No setup of tsearch2 was made. Just installed and started to use
to_tsvector, to_tsquery and Co. functions.

If I run Italian words through to_ts* they mostly remain as they are
with some exceptions when there is some superposition with English.

Till now it looks as an acceptable compromise but I wouldn't like to
have surprises before I find the resources to actually do what
should be done (fully support the 2 languages).

--
Ivan Sergio Borgonovo
http://www.webthatworks.it

Re: tsearch2 problem

From

Oleg Bartunov

Date:

31 October 2008, 11:32:01

On Fri, 31 Oct 2008, Jodok Batlogg wrote:

> hi oleg,
>
> thanks for your quick response,
>
> 2008/10/31 Oleg Bartunov <oleg@sai.msu.su>:
>> Jodok,
>>
>> you got what's you defined. Please, read documentation.
>> In short, word doesn't indexed if it is not recognized by any
>> dictionaried from stack of dictionaries. Put stemming dictionary at the end,
>> which recognizes everything.
>
> can you point me to "the" documentation where i could find that? i
> think i tried hard :)

well, it's not really hard
http://www.postgresql.org/docs/8.3/static/textsearch-dictionaries.html

"A text search configuration binds a parser together with a set of
dictionaries to process the parser's output tokens. For each token type that
the parser can return, a separate list of dictionaries is specified by the
configuration. When a token of that type is found by the parser, each
dictionary in the list is consulted in turn, until some dictionary recognizes
it as a known word. If it is identified as a stop word, or if no dictionary
recognizes the token, it will be discarded and not indexed or searched for.
The general rule for configuring a list of dictionaries is to place first
the most narrow, most specific dictionary, then the more general dictionaries,
finishing with a very general dictionary, like a Snowball stemmer or simple,
which recognizes everything."

>
> however - problem a) is fixed. thanks :)
> nevertheless i still have the problem that words with '/' are beeing
> interpreted as file paths instead of words. any idea how i could tweak
> this?

several ways:
1. use your own parser
2. use encode/decode functions, which cheat default parser. For example,
    encodeslash('aa/bb') -> aaxxxxxxbb. But then you should understand, that
    dictionary like ispell will not be able to recognize it.


>
> thanks
>
> jodok
>
>>
>> Oleg
>> On Fri, 31 Oct 2008, Jodok Batlogg wrote:
>>
>>> we're using tsearch2 with the german dictionary
>>>
>>> http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/dicts/ispell/ispell-german-compound.tar.gz
>>> for fulltext search.
>>>
>>> the indexing is configured as follows:
>>>
>>> CREATE TEXT SEARCH DICTIONARY public.german (
>>>   TEMPLATE = ispell,
>>>   DictFile = german,
>>>   AffFile = german,
>>>   StopWords = german
>>> );
>>>
>>> CREATE TEXT SEARCH CONFIGURATION public.default ( COPY = pg_catalog.german
>>> );
>>>
>>> ALTER TEXT SEARCH CONFIGURATION public.default
>>>   ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
>>>                     word, hword, hword_part
>>>   WITH public.german;
>>>
>>> -------------------------
>>>
>>> select * from ts_debug('default', 'hundshЪЪtte');
>>> works as expected: creates the two lexemes: "{hund,hЪЪtte}"
>>>
>>> BUT
>>>
>>> SELECT to_tsvector('default','lovely und bauarbeiter/in');
>>> looses a lot of stuff:
>>> "'bauarbeiter/in':2"
>>>
>>> some more debugging shows:
>>>
>>> SELECT * from ts_debug('default','lovely und bauarbeiter/in');
>>>
>>> "asciiword";"Word, all ASCII";"lovely";"{german}";"german";""
>>> "blank";"Space symbols";" ";"{}";"";""
>>> "asciiword";"Word, all ASCII";"und";"{german}";"german";"{}"
>>> "blank";"Space symbols";" ";"{}";"";""
>>> "file";"File or path
>>> name";"bauarbeiter/in";"{simple}";"simple";"{bauarbeiter/in}"
>>>
>>> a) unknown words are just beeing dropped
>>> b) words with slashes are interpreted as file paths and the first path
>>> is beeing dropped.
>>>
>>> any idea how we can fix this?
>>>
>>> jodok
>>>
>>>
>>
>>        Regards,
>>                Oleg
>> _____________________________________________________________
>> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
>> Sternberg Astronomical Institute, Moscow University, Russia
>> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
>> phone: +007(495)939-16-83, +007(495)939-23-83
>
>
>
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: tsearch2 problem

From

Oleg Bartunov

Date:

31 October 2008, 11:40:45

Sergio,

On Fri, 31 Oct 2008, Ivan Sergio Borgonovo wrote:

> On Fri, 31 Oct 2008 13:10:20 +0300 (MSK)
> Oleg Bartunov <oleg@sai.msu.su> wrote:
>
>> Jodok,
>>
>> you got what's you defined. Please, read documentation.
>> In short, word doesn't indexed if it is not recognized by any
>> dictionaried from stack of dictionaries. Put stemming dictionary
>> at the end, which recognizes everything.
>
> Could you rephrase?
> I've a similar situation whose real solution would be to have 2+
> tsvectors (English and Italian) but that now looks too costly to
> implement.
>
> I'd like to have "proper full support" for English so that eg. it
> recognise plurals etc... and "acceptable" support for Italian so
> that if I chose something that's not in the English dictionary... at
> least it is put "as is" in the tsvector.

so, what's the problem ? Create custom configuration with dictionary stack
like ispell_en, ispell_it, english_stem.

unfortunately, stemmer is very general dictionary, so you can't have two
stemmers. But, you can always write your own dictionary, which could
call english_stem or italian_stem depending on the word. There are
several open-source language recognizers available, like textcat
http://odur.let.rug.nl/~vannoord/TextCat/, or another implementation
http://www.mnogosearch.org/guesser/

btw, it can be good contribution.

>
> I've built the tsvectors similarly to:
> setweight(
>  to_tsvector('pg_catalog.english',
>   coalesce(FilterCode(catalog_items.Code),'')
> ), 'A')
>
> No setup of tsearch2 was made. Just installed and started to use
> to_tsvector, to_tsquery and Co. functions.
>
> If I run Italian words through to_ts* they mostly remain as they are
> with some exceptions when there is some superposition with English.
>
> Till now it looks as an acceptable compromise but I wouldn't like to
> have surprises before I find the resources to actually do what
> should be done (fully support the 2 languages).
>
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: tsearch2 problem

From

John DeSoi

Date:

31 October 2008, 12:19:33

On Oct 31, 2008, at 6:30 AM, Jodok Batlogg wrote:

> nevertheless i still have the problem that words with '/' are beeing
> interpreted as file paths instead of words. any idea how i could tweak
> this?

The easiest solution I found was to replace '/' with a space before
parsing the text.

John DeSoi, Ph.D.