Re: Tsearch2 Dutch snowball stemmer in PG8.1 - Mailing list pgsql-general
From | Oleg Bartunov |
---|---|
Subject | Re: Tsearch2 Dutch snowball stemmer in PG8.1 |
Date | |
Msg-id | Pine.LNX.4.64.0710031859151.3304@sn.sai.msu.ru Whole thread Raw |
In response to | Re: Tsearch2 Dutch snowball stemmer in PG8.1 (Alban Hertroys <a.hertroys@magproductions.nl>) |
List | pgsql-general |
On Wed, 3 Oct 2007, Alban Hertroys wrote: > Oleg Bartunov wrote: >> Alban, >> >> the documentation you're refereed on is for upcoming 8.3 release. >> For 8.1 and 8.2 you need to do all machinery by hand. It's not >> difficult, for example: > > Thanks Oleg. > I think I managed to do this right, although I had to google for some of > the files (we don't have ispell installed). > > You also seem to have mixed russion and english dictionaries in your > example, I'm not sure that was on purpose? yes, we index mixed content > > Anyway, I changed your example to use dutch dictionaries and locale > where I thought it applicable, and I got something working apparently. > Quite some guess work was involved, so I have a few questions left. > > The only odd thing is that to_tsvector('dutch', 'some dutch text') now > returns '|' for stop words... Could you packed your dictionary files and .sql, so we look on them in spare time. > > For example: > select to_tsvector('nederlands', 'De beste stuurlui staan aan wal'); > to_tsvector > ------------------------------------------------ > '|':1,5 'bes':2 'wal':6 'staan':4 'stuurlui':3 > > > A minor nit... You ended the script with a hidden commit (END;). I would > have preferred to experiment with the results a bit before commiting... this is up to you. It was just an example > > I mixed in a few questions below, if you could answer them please? > >> -- sample tsearch2 configuration for search.postgresql.org >> -- Creates configuration 'pg' - default, should match server's locale !!! >> -- Change 'ru_RU.UTF-8' >> >> begin; >> >> -- create special (default) configuration 'pg' >> update pg_ts_cfg set locale=NULL where locale = 'ru_RU.UTF-8'; > > I suppose this disables a possibly existing stemmer for that locale? no, it's just to have one (default) configuration 'pg' for locale 'ru_RU.UTF-8'. You can skip this. > >> insert into pg_ts_cfg values('pg','default','ru_RU.UTF8'); >> >> -- register 'pg_dict' dictionary using synonym template >> -- postgres pg >> -- pgsql pg >> -- postgresql pg >> insert into pg_ts_dict >> (select 'pg_dict',dict_init, >> '/usr/local/pgsql-dev/share/contrib/pg_dict.txt', >> dict_lexize, 'pg-specific dictionary' >> from pg_ts_dict >> where dict_name='synonym' >> ); >> >> -- register ispell dictionary, check paths and stop words >> -- I used iconv for english files, since there are some cyrillic stuff >> insert into pg_ts_dict >> (SELECT 'en_ispell', dict_init, >> 'DictFile="/usr/local/share/dicts/ispell/utf8/english-utf8.dict",' >> 'AffFile="/usr/local/share/dicts/ispell/utf8/english-utf8.aff",' >> 'StopFile="/usr/local/share/dicts/ispell/utf8/english-utf8.stop"', >> dict_lexize >> FROM pg_ts_dict >> WHERE dict_name = 'ispell_template' >> ); > > I actually use a .lat file here. I have no idea whether that's > compatible (but it appears to have worked). it's just filenames, no matter (for 8.1,8.2) > > I got my .lat and .aff files from: > http://fmg-www.cs.ucla.edu/geoff/ispell-dictionaries.html#Dutch-dicts You can use myspell dictionaries. > > My stop words file is from: > http://snowball.tartarus.org/algorithms/dutch/stop.txt > >> -- use the same stop-word list as 'en_ispell' dictionary >> UPDATE pg_ts_dict set dict_initoption='/usr/local/share/dicts/english.stop' >> where dict_name='en_stem'; > > Why change the stop words for the English dictionary? I skipped this > step. Is that right? I wanted to have the same list of stop words for ispell and snowball. > >> -- default token<->dicts mappings >> insert into pg_ts_cfgmap select 'pg', tok_alias, dict_name from >> public.pg_ts_cfgmap where ts_name='default'; >> >> -- modify mappings for latin words for configuration 'pg' >> update pg_ts_cfgmap set dict_name = '{pg_dict,en_ispell,en_stem}' >> where tok_alias in ( 'lword', 'lhword', 'lpart_hword' ) >> and ts_name = 'pg'; >> >> -- we won't index/search some tokens >> update pg_ts_cfgmap set dict_name = NULL >> --where tok_alias in ('email', 'url', 'sfloat', 'uri', 'float','word') >> where tok_alias in ('email', 'url', 'sfloat', 'uri', 'float') >> and ts_name = 'pg'; >> >> end; >> >> -- testing >> >> select * from ts_debug(' >> PostgreSQL, the highly scalable, SQL compliant, open source >> object-relational >> database management system, is now undergoing beta testing of the next >> version of our software: PostgreSQL 8.2. >> '); >> >> >> Oleg > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
pgsql-general by date: