Thread: Snowball and ispell in tsearch2

Snowball and ispell in tsearch2

From
Teodor Sigaev
Date:
We got a lot requests about including stemmers and ispell dictionaries for all 
accessible languages into tsearch2. I understand that tsearch2 will be closer to 
end user. But sources of snowball stemmers  is about 800kb, each ispell 
dictionaries will takes about 0.5-2M. All sizes are sized with compression. I am 
afraid that is too big size...

What are opinions?

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 


Re: Snowball and ispell in tsearch2

From
Markus Schiltknecht
Date:
Hello Teodor,

I've just recently implemented an advanced full-text search function on 
top of tsearch2. Searching through the manuals and websites to get the 
snowball stemmer and compile my own module took me way to long. I'd 
rather go fetch a cup of coffee during a 30 minute download...

That said, I don't necessarily mean that all stemmers must be included 
in CVS or such. It should just be simpler for the database administrator 
to install ispell or stemmer 'modules'. A non-plus-ultra solution would 
be to provide packages for each language (in debian or fedora, etc..).

Perhaps we can put together the source code for all languages modules 
available and provide scripts to fetch ispell data or to generate the 
snowball stemmers. A debian package maintainer would have to fetch all 
the data to generate all language packages. Someone else might just want 
to download and compile a norwegian snowball stemmer.

I'd be willing to help with such a project. I have experience with 
tsearch2 as well as with gentoo and debian packaging. I can't help with 
rpm, though.

Regards

Markus

Teodor Sigaev wrote:
> We got a lot requests about including stemmers and ispell dictionaries 
> for all accessible languages into tsearch2. I understand that tsearch2 
> will be closer to end user. But sources of snowball stemmers  is about 
> 800kb, each ispell dictionaries will takes about 0.5-2M. All sizes are 
> sized with compression. I am afraid that is too big size...
> 
> What are opinions?
> 


Re: Snowball and ispell in tsearch2

From
Teodor Sigaev
Date:
> 800kb, each ispell dictionaries will takes about 0.5-2M. All sizes are 
Sorry, withOUT compression...

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 


Re: Snowball and ispell in tsearch2

From
"John Jawed"
Date:
OpenFTS ebuild: http://bugs.gentoo.org/show_bug.cgi?id=135859

It has a USE flag for the snowball stemmer. I can take care of
packaging for Gentoo if it will free up time for you to work on other
distros.

John

PS, upstream package size isn't, and shouldn't be an issue, it should
be left to the packaging systems to discretely fetch what is needed.

On 6/7/06, Markus Schiltknecht <markus@bluegap.ch> wrote:

> That said, I don't necessarily mean that all stemmers must be included
> in CVS or such. It should just be simpler for the database administrator
> to install ispell or stemmer 'modules'. A non-plus-ultra solution would
> be to provide packages for each language (in debian or fedora, etc..).
>
> I'd be willing to help with such a project. I have experience with
> tsearch2 as well as with gentoo and debian packaging. I can't help with
> rpm, though.
>
> Regards
>
> Markus
>
> Teodor Sigaev wrote:
> > We got a lot requests about including stemmers and ispell dictionaries
> > for all accessible languages into tsearch2. I understand that tsearch2
> > will be closer to end user. But sources of snowball stemmers  is about
> > 800kb, each ispell dictionaries will takes about 0.5-2M. All sizes are
> > sized with compression. I am afraid that is too big size...
> >
> > What are opinions?
> >
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>


Re: Snowball and ispell in tsearch2

From
Christopher Kings-Lynne
Date:
> We got a lot requests about including stemmers and ispell dictionaries 
> for all accessible languages into tsearch2. I understand that tsearch2 
> will be closer to end user. But sources of snowball stemmers  is about 
> 800kb, each ispell dictionaries will takes about 0.5-2M. All sizes are 
> sized with compression. I am afraid that is too big size...
> 
> What are opinions?

Maybe putting it on pgFoundry?



Re: Snowball and ispell in tsearch2

From
Christopher Kings-Lynne
Date:
> Perhaps we can put together the source code for all languages modules 
> available and provide scripts to fetch ispell data or to generate the 
> snowball stemmers. A debian package maintainer would have to fetch all 
> the data to generate all language packages. Someone else might just want 
> to download and compile a norwegian snowball stemmer.
> 
> I'd be willing to help with such a project. I have experience with 
> tsearch2 as well as with gentoo and debian packaging. I can't help with 
> rpm, though.


I could help with a FreeBSD package I suppose.



Re: Snowball and ispell in tsearch2

From
Christopher Kings-Lynne
Date:
>> I'd be willing to help with such a project. I have experience with 
>> tsearch2 as well as with gentoo and debian packaging. I can't help 
>> with rpm, though.
> 
> I could help with a FreeBSD package I suppose.

Although I should probably finish up those damn GIN docs first :)



Re: Snowball and ispell in tsearch2

From
Teodor Sigaev
Date:
> Maybe putting it on pgFoundry?

Hmm, it's a variant. We can create project 'tsearch2_dict' and there I'll place 
contrib module which will make all Snowball stemmers. Right now I'm working on 
supporting OpenOffice's dictionaries in tsearch2, so it will be simple to add it 
to packaging system.

I suggest that in the same cvs somebody will manage packages/package's builder 
for different packaging system (sorry, I havn't any experience with that systems)

BTW, it will be good, if packaging will work with "maked" postgres, something like:
% cd PGSQL/contrib/tsearch2
% make LANG=norwegian


-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 


Re: Snowball and ispell in tsearch2

From
Teodor Sigaev
Date:
> I'll place contrib module which will make all Snowball stemmers. Right 
> now I'm working on supporting OpenOffice's dictionaries in tsearch2, so 
> it will be simple to add it to packaging system.

done,  http://archives.postgresql.org/pgsql-committers/2006-06/msg00112.php

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/