Thread: Re: [OpenFTS-general] AW: tsearch2, ispell, utf-8 and german special characters

Re: [OpenFTS-general] AW: tsearch2, ispell, utf-8 and german special characters

From
"Markus Wollny"
Date:
Hi!

ts2test=# select * from ts_debug('Jeden Tag wird man ein bisschen weiser');
    ts_name     | tok_type | description |  token   |  dict_name  |  tsvector
----------------+----------+-------------+----------+-------------+------------
 default_german | lword    | Latin word  | Jeden    | {de_ispell} |
 default_german | lword    | Latin word  | Tag      | {de_ispell} | 'tag'
 default_german | lword    | Latin word  | wird     | {de_ispell} |
 default_german | lword    | Latin word  | man      | {de_ispell} |
 default_german | lword    | Latin word  | ein      | {de_ispell} | 'eint'
 default_german | lword    | Latin word  | bisschen | {de_ispell} | 'bisschen'
 default_german | lword    | Latin word  | weiser   | {de_ispell} | 'weise'
(7 rows)

cat german.stop|grep ^ein$
ein

'jeden', 'man', 'wird' and 'ein' are all in german.stop; the first three words are correctly recognozed as stopwords,
whereasthe last one is converted to 'eint', although 'ein' is a stopword, too. I still don't understand what exactly is
happeningand if I should be concerned by that sort of "wrong guess" - so 'ein' is just converted to 'eint' every time,
nomatter if it's in the stopwords-file or not, but on the other hand, as this applies to to_tsvector(), to_tsquery()
andlexize(), this behaviour would be consitant throughout tsearch2 - thus making any search containing 'ein' a little
bitfuzzier, but nonetheless still usable. It's still some sort of cosmetic bug, though, but I guess that's probably due
toGerman being somewhat less IT-friendly than english.  

Kind regards

   Markus

-----Original Message-----
From:    Oleg Bartunov [mailto:oleg@sai.msu.su]
Sent:    Wed 7/21/2004 22:24
To:    Markus Wollny
Cc:    pgsql-general@postgresql.org; openfts-general@lists.sourceforge.net
Subject:    Re: AW: [OpenFTS-general] AW: [GENERAL] tsearch2, ispell, utf-8 and german special characters
On Wed, 21 Jul 2004, Markus Wollny wrote:

>
> Hi!
>
> > -----Urspr?ngliche Nachricht-----
> > Von: openfts-general-admin@lists.sourceforge.net
> > [mailto:openfts-general-admin@lists.sourceforge.net] Im
> > Auftrag von Markus Wollny
> > Gesendet: Mittwoch, 21. Juli 2004 17:04
> > An: Oleg Bartunov
> > Cc: pgsql-general@postgresql.org;
> > openfts-general@lists.sourceforge.net
> > Betreff: [OpenFTS-general] AW: [GENERAL] tsearch2, ispell,
> > utf-8 and german special characters
>
> > The issue with the unrecognized stop-word 'ein' which is
> > converted by to_tsvector to 'eint' remains however. Now
> > here's as much detail as I can provide:
> >
> > Ispell is Version  3.1.20 10/10/95, patch 1.
>
> I've just upgraded Ispell to the latest version (International Ispell Version 3.2.06 08/01/01), but that didn't help;
bynow I think it might be something to do with a german language peculiarity or with something in the german
dictionary.In german.med, there is an entry 

ispell itself don't used in tsearch2, only dict,aff files !

>
> eint/EGPVWX
>
> So the ts_vector output is just a bit like a wrong guess. Doesn't it evaluate the stopword-list first before doing
thelookup in the Ispell-dictionary? 

yes.  There is very usefull function for debugging I always recommend to use -
ts_debug. See my notes (http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Notes)
for examples.



>
> Kind regards
>
>    Markus Wollny
>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by BEA Weblogic Workshop
> FREE Java Enterprise J2EE developer tools!
> Get your free copy of BEA WebLogic Workshop 8.1 today.
> http://ads.osdn.com/?ad_idG21&alloc_id040&op?k
> _______________________________________________
> OpenFTS-general mailing list
> OpenFTS-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/openfts-general
>

    Regards,
        Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83




Re: [OpenFTS-general] AW: tsearch2, ispell, utf-8 and

From
Oleg Bartunov
Date:
Markus,

I was not quite correct - different dictionaries hanlde stop words in
different way ! Stemmers checked before, while ispell - after normalization.
So, in your case, you need 'eint' listed in stop word list.


    Oleg

On Wed, 21 Jul 2004, Markus Wollny wrote:

> Hi!
>
> ts2test=# select * from ts_debug('Jeden Tag wird man ein bisschen weiser');
>     ts_name     | tok_type | description |  token   |  dict_name  |  tsvector
> ----------------+----------+-------------+----------+-------------+------------
>  default_german | lword    | Latin word  | Jeden    | {de_ispell} |
>  default_german | lword    | Latin word  | Tag      | {de_ispell} | 'tag'
>  default_german | lword    | Latin word  | wird     | {de_ispell} |
>  default_german | lword    | Latin word  | man      | {de_ispell} |
>  default_german | lword    | Latin word  | ein      | {de_ispell} | 'eint'
>  default_german | lword    | Latin word  | bisschen | {de_ispell} | 'bisschen'
>  default_german | lword    | Latin word  | weiser   | {de_ispell} | 'weise'
> (7 rows)
>
> cat german.stop|grep ^ein$
> ein
>
> 'jeden', 'man', 'wird' and 'ein' are all in german.stop; the first three words are correctly recognozed as stopwords,
whereasthe last one is converted to 'eint', although 'ein' is a stopword, too. I still don't understand what exactly is
happeningand if I should be concerned by that sort of "wrong guess" - so 'ein' is just converted to 'eint' every time,
nomatter if it's in the stopwords-file or not, but on the other hand, as this applies to to_tsvector(), to_tsquery()
andlexize(), this behaviour would be consitant throughout tsearch2 - thus making any search containing 'ein' a little
bitfuzzier, but nonetheless still usable. It's still some sort of cosmetic bug, though, but I guess that's probably due
toGerman being somewhat less IT-friendly than english. 
>
> Kind regards
>
>    Markus
>
> -----Original Message-----
> From:    Oleg Bartunov [mailto:oleg@sai.msu.su]
> Sent:    Wed 7/21/2004 22:24
> To:    Markus Wollny
> Cc:    pgsql-general@postgresql.org; openfts-general@lists.sourceforge.net
> Subject:    Re: AW: [OpenFTS-general] AW: [GENERAL] tsearch2, ispell, utf-8 and german special characters
> On Wed, 21 Jul 2004, Markus Wollny wrote:
>
> >
> > Hi!
> >
> > > -----Urspr?ngliche Nachricht-----
> > > Von: openfts-general-admin@lists.sourceforge.net
> > > [mailto:openfts-general-admin@lists.sourceforge.net] Im
> > > Auftrag von Markus Wollny
> > > Gesendet: Mittwoch, 21. Juli 2004 17:04
> > > An: Oleg Bartunov
> > > Cc: pgsql-general@postgresql.org;
> > > openfts-general@lists.sourceforge.net
> > > Betreff: [OpenFTS-general] AW: [GENERAL] tsearch2, ispell,
> > > utf-8 and german special characters
> >
> > > The issue with the unrecognized stop-word 'ein' which is
> > > converted by to_tsvector to 'eint' remains however. Now
> > > here's as much detail as I can provide:
> > >
> > > Ispell is Version  3.1.20 10/10/95, patch 1.
> >
> > I've just upgraded Ispell to the latest version (International Ispell Version 3.2.06 08/01/01), but that didn't
help;by now I think it might be something to do with a german language peculiarity or with something in the german
dictionary.In german.med, there is an entry 
>
> ispell itself don't used in tsearch2, only dict,aff files !
>
> >
> > eint/EGPVWX
> >
> > So the ts_vector output is just a bit like a wrong guess. Doesn't it evaluate the stopword-list first before doing
thelookup in the Ispell-dictionary? 
>
> yes.  There is very usefull function for debugging I always recommend to use -
> ts_debug. See my notes (http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Notes)
> for examples.
>
>
>
> >
> > Kind regards
> >
> >    Markus Wollny
> >
> >
> > -------------------------------------------------------
> > This SF.Net email is sponsored by BEA Weblogic Workshop
> > FREE Java Enterprise J2EE developer tools!
> > Get your free copy of BEA WebLogic Workshop 8.1 today.
> > http://ads.osdn.com/?ad_idG21&alloc_id040&op?k
> > _______________________________________________
> > OpenFTS-general mailing list
> > OpenFTS-general@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/openfts-general
> >
>
>     Regards,
>         Oleg
> _____________________________________________________________
> Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
> Sternberg Astronomical Institute, Moscow University (Russia)
> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
> phone: +007(095)939-16-83, +007(095)939-23-83
>
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faqs/FAQ.html
>

    Regards,
        Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83