Thread: Snowball and ispell in tsearch2
We got a lot requests about including stemmers and ispell dictionaries for all accessible languages into tsearch2. I understand that tsearch2 will be closer to end user. But sources of snowball stemmers is about 800kb, each ispell dictionaries will takes about 0.5-2M. All sizes are sized with compression. I am afraid that is too big size... What are opinions? -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Hello Teodor, I've just recently implemented an advanced full-text search function on top of tsearch2. Searching through the manuals and websites to get the snowball stemmer and compile my own module took me way to long. I'd rather go fetch a cup of coffee during a 30 minute download... That said, I don't necessarily mean that all stemmers must be included in CVS or such. It should just be simpler for the database administrator to install ispell or stemmer 'modules'. A non-plus-ultra solution would be to provide packages for each language (in debian or fedora, etc..). Perhaps we can put together the source code for all languages modules available and provide scripts to fetch ispell data or to generate the snowball stemmers. A debian package maintainer would have to fetch all the data to generate all language packages. Someone else might just want to download and compile a norwegian snowball stemmer. I'd be willing to help with such a project. I have experience with tsearch2 as well as with gentoo and debian packaging. I can't help with rpm, though. Regards Markus Teodor Sigaev wrote: > We got a lot requests about including stemmers and ispell dictionaries > for all accessible languages into tsearch2. I understand that tsearch2 > will be closer to end user. But sources of snowball stemmers is about > 800kb, each ispell dictionaries will takes about 0.5-2M. All sizes are > sized with compression. I am afraid that is too big size... > > What are opinions? >
> 800kb, each ispell dictionaries will takes about 0.5-2M. All sizes are Sorry, withOUT compression... -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
OpenFTS ebuild: http://bugs.gentoo.org/show_bug.cgi?id=135859 It has a USE flag for the snowball stemmer. I can take care of packaging for Gentoo if it will free up time for you to work on other distros. John PS, upstream package size isn't, and shouldn't be an issue, it should be left to the packaging systems to discretely fetch what is needed. On 6/7/06, Markus Schiltknecht <markus@bluegap.ch> wrote: > That said, I don't necessarily mean that all stemmers must be included > in CVS or such. It should just be simpler for the database administrator > to install ispell or stemmer 'modules'. A non-plus-ultra solution would > be to provide packages for each language (in debian or fedora, etc..). > > I'd be willing to help with such a project. I have experience with > tsearch2 as well as with gentoo and debian packaging. I can't help with > rpm, though. > > Regards > > Markus > > Teodor Sigaev wrote: > > We got a lot requests about including stemmers and ispell dictionaries > > for all accessible languages into tsearch2. I understand that tsearch2 > > will be closer to end user. But sources of snowball stemmers is about > > 800kb, each ispell dictionaries will takes about 0.5-2M. All sizes are > > sized with compression. I am afraid that is too big size... > > > > What are opinions? > > > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend >
> We got a lot requests about including stemmers and ispell dictionaries > for all accessible languages into tsearch2. I understand that tsearch2 > will be closer to end user. But sources of snowball stemmers is about > 800kb, each ispell dictionaries will takes about 0.5-2M. All sizes are > sized with compression. I am afraid that is too big size... > > What are opinions? Maybe putting it on pgFoundry?
> Perhaps we can put together the source code for all languages modules > available and provide scripts to fetch ispell data or to generate the > snowball stemmers. A debian package maintainer would have to fetch all > the data to generate all language packages. Someone else might just want > to download and compile a norwegian snowball stemmer. > > I'd be willing to help with such a project. I have experience with > tsearch2 as well as with gentoo and debian packaging. I can't help with > rpm, though. I could help with a FreeBSD package I suppose.
>> I'd be willing to help with such a project. I have experience with >> tsearch2 as well as with gentoo and debian packaging. I can't help >> with rpm, though. > > I could help with a FreeBSD package I suppose. Although I should probably finish up those damn GIN docs first :)
> Maybe putting it on pgFoundry? Hmm, it's a variant. We can create project 'tsearch2_dict' and there I'll place contrib module which will make all Snowball stemmers. Right now I'm working on supporting OpenOffice's dictionaries in tsearch2, so it will be simple to add it to packaging system. I suggest that in the same cvs somebody will manage packages/package's builder for different packaging system (sorry, I havn't any experience with that systems) BTW, it will be good, if packaging will work with "maked" postgres, something like: % cd PGSQL/contrib/tsearch2 % make LANG=norwegian -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
> I'll place contrib module which will make all Snowball stemmers. Right > now I'm working on supporting OpenOffice's dictionaries in tsearch2, so > it will be simple to add it to packaging system. done, http://archives.postgresql.org/pgsql-committers/2006-06/msg00112.php -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/