Thread: Use of ISpell dictionaries with tsearch2 - what is the point?

Use of ISpell dictionaries with tsearch2 - what is the point?

From
"Don Walker"
Date:
I'm new to using tsearch2 and am trying to understand why I would want to
use an ISpell dictionary before the stemming dictionary. I'd originally
hoped that ISpell would suggest corrections for misspelled words as the
documents that I will be indexing will contain a lot of spelling mistakes.
From what I now understand ISpell dictionaries only recognize properly
spelled words. This means that any misspelled word will be handled by the
stemming dictionary and usually just passed as is.

1. If I am correct about this then what is the point of using the ISpell
dictionary in the first place?

2. Is there a solution for correcting spelling mistakes in the documents you
index? I have seen the readme files for pg_trgm,
http://www.sai.msu.su/~megera/postgres/gist/, which would allow me to
suggest other terms for a query if the misspellings were common enough. I'd
rather fix the problem at index time so that querying with the proper term
would find any misspelled terms (within reason).


Re: Use of ISpell dictionaries with tsearch2 - what is the point?

From
"Don Walker"
Date:
I realized shortly after I sent this email that I could use a synonym
dictionary to solve problem #2. To construct it myself I'd have to determine
the common misspellings and create synonyms for them. So I have two more
questions:

2.1 Are there any canned synonym dictionaries available the deal with
misspellings in English and/or French?

2.2 Are there any clever linguistic algorithms that can partly solve the
same problem?

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Don Walker
Sent: April 28, 2006 15:11
To: pgsql-general@postgresql.org
Subject: [GENERAL] Use of ISpell dictionaries with tsearch2 - what is the
point?


I'm new to using tsearch2 and am trying to understand why I would want to
use an ISpell dictionary before the stemming dictionary. I'd originally
hoped that ISpell would suggest corrections for misspelled words as the
documents that I will be indexing will contain a lot of spelling mistakes.
From what I now understand ISpell dictionaries only recognize properly
spelled words. This means that any misspelled word will be handled by the
stemming dictionary and usually just passed as is.

1. If I am correct about this then what is the point of using the ISpell
dictionary in the first place?

2. Is there a solution for correcting spelling mistakes in the documents you
index? I have seen the readme files for pg_trgm,
http://www.sai.msu.su/~megera/postgres/gist/, which would allow me to
suggest other terms for a query if the misspellings were common enough. I'd
rather fix the problem at index time so that querying with the proper term
would find any misspelled terms (within reason).


---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend


Re: Use of ISpell dictionaries with tsearch2 - what is

From
Teodor Sigaev
Date:
> 1. If I am correct about this then what is the point of using the ISpell
> dictionary in the first place?

Yes. The main goal of any dictionaries is a 'normalize' lexeme, ie to
get a infinitive. It's very important for languages with variable word's
form such as french, russian, norwegian etc. So, if dictionaries are
used, user don't think about exact form of word for searching.

There is two basic approaches for dictionaries: stemming and vocabulary
based. First one tries to remove variable end of word, in tsearch2 it's
a snowball dictionaries. Second is ispell - it tries to find word in
vocabulary with some grammar changes.

>
> 2. Is there a solution for correcting spelling mistakes in the documents you
> index? I have seen the readme files for pg_trgm,
> http://www.sai.msu.su/~megera/postgres/gist/, which would allow me to
> suggest other terms for a query if the misspellings were common enough. I'd
> rather fix the problem at index time so that querying with the proper term
> would find any misspelled terms (within reason).

It's possible, but it may produce unpredictable results for searching,
example from head (sorry, russian):

horosho - good ('sh' in russian is one character)
herovo  - bad  ( slang )

horovo - where is mistype? second character or 5-th? If we correct this
to one or both variants, user will get 'bad' for searching query 'good'...

 > 2.1 Are there any canned synonym dictionaries available the deal with
 > misspellings in English and/or French?
 > 2.2 Are there any clever linguistic algorithms that can partly solve
 > the same problem?

Ask linguists :).

Re: Use of ISpell dictionaries with tsearch2 - what is the point?

From
"Don Walker"
Date:
Are you saying that the English ISpell dictionary isn't particularly useful
for English text if you're using the English stemmer? One of the concerns
that I had about the use of ISpell on English text was that ISpell could
provide two or more alternatives for a single search term that would
increase the number of unique words and hurt performance. The examples I saw
all would have been reduced to a single stem by the English stemmer.

If I have to deal with a mix of English and French would using a French
ISpell dictionary followed by an English stemmer be the best approach? If
I'm wrong about the use of English ISpell, then what would be the best
sequence, e.g. French ISpell, English ISpell, English stemmer?

-----Original Message-----
From: Teodor Sigaev [mailto:teodor@sigaev.ru]
Sent: May 1, 2006 10:31
To: Don Walker
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Use of ISpell dictionaries with tsearch2 - what is
the point?


> 1. If I am correct about this then what is the point of using the
> ISpell dictionary in the first place?

Yes. The main goal of any dictionaries is a 'normalize' lexeme, ie to
get a infinitive. It's very important for languages with variable word's
form such as french, russian, norwegian etc. So, if dictionaries are
used, user don't think about exact form of word for searching.

There is two basic approaches for dictionaries: stemming and vocabulary
based. First one tries to remove variable end of word, in tsearch2 it's
a snowball dictionaries. Second is ispell - it tries to find word in
vocabulary with some grammar changes.

>
> 2. Is there a solution for correcting spelling mistakes in the
> documents you index? I have seen the readme files for pg_trgm,
> http://www.sai.msu.su/~megera/postgres/gist/, which would allow me to
> suggest other terms for a query if the misspellings were common
> enough. I'd rather fix the problem at index time so that querying with
> the proper term would find any misspelled terms (within reason).

It's possible, but it may produce unpredictable results for searching,
example from head (sorry, russian):

horosho - good ('sh' in russian is one character)
herovo  - bad  ( slang )

horovo - where is mistype? second character or 5-th? If we correct this
to one or both variants, user will get 'bad' for searching query 'good'...

 > 2.1 Are there any canned synonym dictionaries available the deal with  >
misspellings in English and/or French?  > 2.2 Are there any clever
linguistic algorithms that can partly solve  > the same problem?

Ask linguists :).


Re: Use of ISpell dictionaries with tsearch2 - what is

From
Oleg Bartunov
Date:
Don,

let me to answer to your original question
"Use of ISpell dictionaries with tsearch2 - what is the point?".

The purpose of any dictionaries in search engines is to help people
to search words not bothering about different forms (declension, inflexion,...).
Dictionaries could be used to process query as well as in the indexing.
You may store original form of a word and/or it's stem. Most complete index
stores both variants and could provide exact search, but at index's size cost.
Historically, since tsearch2 was based on gist storage, which is quite
sensitive to the number of unique words, so we store only stems. This might
be changed in future, since now we could use inverted index with tsearch2.

ISpell dictionary is a (open-source) way  to find word's stem(s), their
quality is very different for different languages. We use russian ispell
dictionary and found it rather useful. Of course, since real language is
much complex than ispell rules, there are errors, which produce "noise"
in search results.  Ispell dictionary could return several normal forms
for one word, for example, booking has two infinitives - booking and book.

Ispell dictionary support many ways of word building, but are difficult to
build and support. That's why various stemming algorithms become popular,
read http://snowball.tartarus.org/texts/introduction.html for good introduction.
We chose snowball stemmer since it's open-source and written/supported by
well-known Martin Porter.

For each lexeme class there is a configurable dictionary queue (in pg_ts_cfgmap).
Lexeme  passes through this queue until it recognized by some dictionary
(currently, there is no possibility to recognize lexeme and pass it to the
next dictionary). It's tenable to begin from very specific dictionary
(topic related, synonym), and finish queue with most common dictionary like
'simple' or 'stemmer', which recognize everything :)

Specific configuration is very depends on the language, availability of
good dictionaries and the goals of search engine.
Snowball stemmer works good for english language,
since word formation is mostly suffix-oriented (I might be wrong here !),
so having not good ispell dictionary, one could use just snowball stemmer.
On the other side, for russian language we have good ispell dictionary,
which is actively developed and supported, and russian word building is
quite complex, so we definitely recommend to use ispell dictionary before
snowball stemmer.

It's quite difficult to index mix of several languages which share common
characters, since there is no possibility to recognize language. I'd
definitely warn you against using stemmer except at the very end of queue,
since it recognizes everything and no dictionaries after it will be utilized.
Hopefully, any useful text shoud have only one main language. If, for example,
the main language is French and second one - English, I'd use
French Ispell, English Ispell, French stemmer.

Oleg
On Mon, 1 May 2006, Don Walker wrote:

> Are you saying that the English ISpell dictionary isn't particularly useful
> for English text if you're using the English stemmer? One of the concerns
> that I had about the use of ISpell on English text was that ISpell could
> provide two or more alternatives for a single search term that would
> increase the number of unique words and hurt performance. The examples I saw
> all would have been reduced to a single stem by the English stemmer.
>
> If I have to deal with a mix of English and French would using a French
> ISpell dictionary followed by an English stemmer be the best approach? If
> I'm wrong about the use of English ISpell, then what would be the best
> sequence, e.g. French ISpell, English ISpell, English stemmer?
>
> -----Original Message-----
> From: Teodor Sigaev [mailto:teodor@sigaev.ru]
> Sent: May 1, 2006 10:31
> To: Don Walker
> Cc: pgsql-general@postgresql.org
> Subject: Re: [GENERAL] Use of ISpell dictionaries with tsearch2 - what is
> the point?
>
>
>> 1. If I am correct about this then what is the point of using the
>> ISpell dictionary in the first place?
>
> Yes. The main goal of any dictionaries is a 'normalize' lexeme, ie to
> get a infinitive. It's very important for languages with variable word's
> form such as french, russian, norwegian etc. So, if dictionaries are
> used, user don't think about exact form of word for searching.
>
> There is two basic approaches for dictionaries: stemming and vocabulary
> based. First one tries to remove variable end of word, in tsearch2 it's
> a snowball dictionaries. Second is ispell - it tries to find word in
> vocabulary with some grammar changes.
>
>>
>> 2. Is there a solution for correcting spelling mistakes in the
>> documents you index? I have seen the readme files for pg_trgm,
>> http://www.sai.msu.su/~megera/postgres/gist/, which would allow me to
>> suggest other terms for a query if the misspellings were common
>> enough. I'd rather fix the problem at index time so that querying with
>> the proper term would find any misspelled terms (within reason).
>
> It's possible, but it may produce unpredictable results for searching,
> example from head (sorry, russian):
>
> horosho - good ('sh' in russian is one character)
> herovo  - bad  ( slang )
>
> horovo - where is mistype? second character or 5-th? If we correct this
> to one or both variants, user will get 'bad' for searching query 'good'...
>
> > 2.1 Are there any canned synonym dictionaries available the deal with  >
> misspellings in English and/or French?  > 2.2 Are there any clever
> linguistic algorithms that can partly solve  > the same problem?
>
> Ask linguists :).
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83