Re: Use of ISpell dictionaries with tsearch2 - what is the point? - Mailing list pgsql-general

From Don Walker
Subject Re: Use of ISpell dictionaries with tsearch2 - what is the point?
Date
Msg-id 001e01c66d3a$ae392830$dbd849c6@donxp
Whole thread Raw
In response to Re: Use of ISpell dictionaries with tsearch2 - what is  (Teodor Sigaev <teodor@sigaev.ru>)
Responses Re: Use of ISpell dictionaries with tsearch2 - what is
List pgsql-general
Are you saying that the English ISpell dictionary isn't particularly useful
for English text if you're using the English stemmer? One of the concerns
that I had about the use of ISpell on English text was that ISpell could
provide two or more alternatives for a single search term that would
increase the number of unique words and hurt performance. The examples I saw
all would have been reduced to a single stem by the English stemmer.

If I have to deal with a mix of English and French would using a French
ISpell dictionary followed by an English stemmer be the best approach? If
I'm wrong about the use of English ISpell, then what would be the best
sequence, e.g. French ISpell, English ISpell, English stemmer?

-----Original Message-----
From: Teodor Sigaev [mailto:teodor@sigaev.ru]
Sent: May 1, 2006 10:31
To: Don Walker
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Use of ISpell dictionaries with tsearch2 - what is
the point?


> 1. If I am correct about this then what is the point of using the
> ISpell dictionary in the first place?

Yes. The main goal of any dictionaries is a 'normalize' lexeme, ie to
get a infinitive. It's very important for languages with variable word's
form such as french, russian, norwegian etc. So, if dictionaries are
used, user don't think about exact form of word for searching.

There is two basic approaches for dictionaries: stemming and vocabulary
based. First one tries to remove variable end of word, in tsearch2 it's
a snowball dictionaries. Second is ispell - it tries to find word in
vocabulary with some grammar changes.

>
> 2. Is there a solution for correcting spelling mistakes in the
> documents you index? I have seen the readme files for pg_trgm,
> http://www.sai.msu.su/~megera/postgres/gist/, which would allow me to
> suggest other terms for a query if the misspellings were common
> enough. I'd rather fix the problem at index time so that querying with
> the proper term would find any misspelled terms (within reason).

It's possible, but it may produce unpredictable results for searching,
example from head (sorry, russian):

horosho - good ('sh' in russian is one character)
herovo  - bad  ( slang )

horovo - where is mistype? second character or 5-th? If we correct this
to one or both variants, user will get 'bad' for searching query 'good'...

 > 2.1 Are there any canned synonym dictionaries available the deal with  >
misspellings in English and/or French?  > 2.2 Are there any clever
linguistic algorithms that can partly solve  > the same problem?

Ask linguists :).


pgsql-general by date:

Previous
From: "Andrus"
Date:
Subject: Re: How to define + operator for strings
Next
From: "Tony Lausin"
Date:
Subject: Re: Is PostgreSQL an easy choice for a large CMS?