Thread: Indexing unknown words with Tsearch2

Indexing unknown words with Tsearch2

From
Greg Maitrallain
Date:
Hi,

First of all, excuse my poor english :)

I'm working on a fulltext database with tsearch2, which contains french
historical writings.
I'm using the fr_ispell dictionnary that can be found here :
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/
(ispell-french.tar.gz
<http://www.sai.msu.su/%7Emegera/postgres/gist/tsearch/V2/dicts/ispell/ispell-french.tar.gz>
- submitted by Max Jacob)
The database encoding is LATIN1

The problem is the writings contains many names of personnalities. For
example : Churchill (the database covers WWII). But when I try to search
for these names, nothing is found.

I tried many things, like this introduction :
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/tsearch-V2-intro.html
And I think the problem's root is that no lexem is found (I could even
say an empty lexem is found).

With the default en_stem dictionnary, I get this :

SELECT lexize('en_stem', 'churchill');
"{churchil}"

Then, I try to add the french dictionnary :

INSERT INTO pg_ts_dict
               (SELECT 'fr_ispell',
                       dict_init,
                       'DictFile="/home/.../french.dict",'
                       'AffFile="/home/.../french.aff",'
                       'StopFile="/home/.../french.stop"',
                       dict_lexize
                FROM pg_ts_dict
                WHERE dict_name = 'ispell_template');

And the result is :

SELECT lexize('fr_ispell', 'churchill');
""

My questions are :
- Is it OK to give empty string as a result for a word that is not in
the dictionnary, neither in the stop words ?
- Is there a way to get the word itself as a result, when the word is
not in the dictionnary, neither in the stop words ?
- If yes, how ?

I'm also interested in any information you could give me...
Many thanks !

Greg Maitrallain.

Re: Indexing unknown words with Tsearch2

From
Tom Lane
Date:
Greg Maitrallain <greg.maitrallain@evodia.fr> writes:
> The problem is the writings contains many names of personnalities. For
> example : Churchill (the database covers WWII). But when I try to search
> for these names, nothing is found.

> I tried many things, like this introduction :
> http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/tsearch-V2-intro.html
> And I think the problem's root is that no lexem is found (I could even
> say an empty lexem is found).

I think you've misconfigured your dictionary list.  You normally want to
use an ispell dictionary together with some other one, like a snowball
dictionary.  Using it by itself means exactly that only words known to
the dictionary will be indexed.

            regards, tom lane