Thread: Tsearch2 Dutch snowball stemmer in PG8.1
Hello, I'm trying to get a Dutch snowball stemmer in Postgres 8.1, but I can't find how to do that. I found CREATE FULLTEXT DICTIONARY commands in the tsearch2 docs on http://www.sai.msu.su/~megera/postgres/fts/doc/index.html, but these commands are apparently not available on PG8.1. I also found the tables pg_ts_(cfg|cfgmap|dict|parser), but I have no idea how to add a Dutch stemmer to those. I did find some references to stem.[ch] files that were suggested to compile into the postgres sources, but I cannot believe that's the right way to do this (besides that I don't have sufficient privileges to install such a version). So... How do I do this? The system involved is some version of Debian Linux (2.6 kernel); are there any packages for a Dutch stemmer maybe? I'm in a bit of a hurry too, as we're on a tight deadline :( Regards, -- Alban Hertroys a.hertroys@magproductions.nl magproductions b.v. T: ++31(0)534346874 F: ++31(0)534346876 M: I: www.magproductions.nl A: Postbus 416 7500 AK Enschede // Integrate Your World //
Alban, the documentation you're refereed on is for upcoming 8.3 release. For 8.1 and 8.2 you need to do all machinery by hand. It's not difficult, for example: -- sample tsearch2 configuration for search.postgresql.org -- Creates configuration 'pg' - default, should match server's locale !!! -- Change 'ru_RU.UTF-8' begin; -- create special (default) configuration 'pg' update pg_ts_cfg set locale=NULL where locale = 'ru_RU.UTF-8'; insert into pg_ts_cfg values('pg','default','ru_RU.UTF8'); -- register 'pg_dict' dictionary using synonym template -- postgres pg -- pgsql pg -- postgresql pg insert into pg_ts_dict (select 'pg_dict',dict_init, '/usr/local/pgsql-dev/share/contrib/pg_dict.txt', dict_lexize, 'pg-specific dictionary' from pg_ts_dict where dict_name='synonym' ); -- register ispell dictionary, check paths and stop words -- I used iconv for english files, since there are some cyrillic stuff insert into pg_ts_dict (SELECT 'en_ispell', dict_init, 'DictFile="/usr/local/share/dicts/ispell/utf8/english-utf8.dict",' 'AffFile="/usr/local/share/dicts/ispell/utf8/english-utf8.aff",' 'StopFile="/usr/local/share/dicts/ispell/utf8/english-utf8.stop"', dict_lexize FROM pg_ts_dict WHERE dict_name = 'ispell_template' ); -- use the same stop-word list as 'en_ispell' dictionary UPDATE pg_ts_dict set dict_initoption='/usr/local/share/dicts/english.stop' where dict_name='en_stem'; -- default token<->dicts mappings insert into pg_ts_cfgmap select 'pg', tok_alias, dict_name from public.pg_ts_cfgmap where ts_name='default'; -- modify mappings for latin words for configuration 'pg' update pg_ts_cfgmap set dict_name = '{pg_dict,en_ispell,en_stem}' where tok_alias in ( 'lword', 'lhword', 'lpart_hword' ) and ts_name = 'pg'; -- we won't index/search some tokens update pg_ts_cfgmap set dict_name = NULL --where tok_alias in ('email', 'url', 'sfloat', 'uri', 'float','word') where tok_alias in ('email', 'url', 'sfloat', 'uri', 'float') and ts_name = 'pg'; end; -- testing select * from ts_debug(' PostgreSQL, the highly scalable, SQL compliant, open source object-relational database management system, is now undergoing beta testing of the next version of our software: PostgreSQL 8.2. '); Oleg On Wed, 3 Oct 2007, Alban Hertroys wrote: > Hello, > > I'm trying to get a Dutch snowball stemmer in Postgres 8.1, but I can't > find how to do that. > > I found CREATE FULLTEXT DICTIONARY commands in the tsearch2 docs on > http://www.sai.msu.su/~megera/postgres/fts/doc/index.html, but these > commands are apparently not available on PG8.1. > > I also found the tables pg_ts_(cfg|cfgmap|dict|parser), but I have no > idea how to add a Dutch stemmer to those. > > I did find some references to stem.[ch] files that were suggested to > compile into the postgres sources, but I cannot believe that's the right > way to do this (besides that I don't have sufficient privileges to > install such a version). > > So... How do I do this? > > The system involved is some version of Debian Linux (2.6 kernel); are > there any packages for a Dutch stemmer maybe? > > I'm in a bit of a hurry too, as we're on a tight deadline :( > > Regards, > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Oleg Bartunov wrote: > Alban, > > the documentation you're refereed on is for upcoming 8.3 release. > For 8.1 and 8.2 you need to do all machinery by hand. It's not > difficult, for example: Thanks Oleg. I think I managed to do this right, although I had to google for some of the files (we don't have ispell installed). You also seem to have mixed russion and english dictionaries in your example, I'm not sure that was on purpose? Anyway, I changed your example to use dutch dictionaries and locale where I thought it applicable, and I got something working apparently. Quite some guess work was involved, so I have a few questions left. The only odd thing is that to_tsvector('dutch', 'some dutch text') now returns '|' for stop words... For example: select to_tsvector('nederlands', 'De beste stuurlui staan aan wal'); to_tsvector ------------------------------------------------ '|':1,5 'bes':2 'wal':6 'staan':4 'stuurlui':3 A minor nit... You ended the script with a hidden commit (END;). I would have preferred to experiment with the results a bit before commiting... I mixed in a few questions below, if you could answer them please? > -- sample tsearch2 configuration for search.postgresql.org > -- Creates configuration 'pg' - default, should match server's locale !!! > -- Change 'ru_RU.UTF-8' > > begin; > > -- create special (default) configuration 'pg' > update pg_ts_cfg set locale=NULL where locale = 'ru_RU.UTF-8'; I suppose this disables a possibly existing stemmer for that locale? > insert into pg_ts_cfg values('pg','default','ru_RU.UTF8'); > > -- register 'pg_dict' dictionary using synonym template > -- postgres pg > -- pgsql pg > -- postgresql pg > insert into pg_ts_dict > (select 'pg_dict',dict_init, > '/usr/local/pgsql-dev/share/contrib/pg_dict.txt', > dict_lexize, 'pg-specific dictionary' > from pg_ts_dict > where dict_name='synonym' > ); > > -- register ispell dictionary, check paths and stop words > -- I used iconv for english files, since there are some cyrillic stuff > insert into pg_ts_dict > (SELECT 'en_ispell', dict_init, > 'DictFile="/usr/local/share/dicts/ispell/utf8/english-utf8.dict",' > 'AffFile="/usr/local/share/dicts/ispell/utf8/english-utf8.aff",' > 'StopFile="/usr/local/share/dicts/ispell/utf8/english-utf8.stop"', > dict_lexize > FROM pg_ts_dict > WHERE dict_name = 'ispell_template' > ); I actually use a .lat file here. I have no idea whether that's compatible (but it appears to have worked). I got my .lat and .aff files from: http://fmg-www.cs.ucla.edu/geoff/ispell-dictionaries.html#Dutch-dicts My stop words file is from: http://snowball.tartarus.org/algorithms/dutch/stop.txt > -- use the same stop-word list as 'en_ispell' dictionary > UPDATE pg_ts_dict set dict_initoption='/usr/local/share/dicts/english.stop' > where dict_name='en_stem'; Why change the stop words for the English dictionary? I skipped this step. Is that right? > -- default token<->dicts mappings > insert into pg_ts_cfgmap select 'pg', tok_alias, dict_name from > public.pg_ts_cfgmap where ts_name='default'; > > -- modify mappings for latin words for configuration 'pg' > update pg_ts_cfgmap set dict_name = '{pg_dict,en_ispell,en_stem}' > where tok_alias in ( 'lword', 'lhword', 'lpart_hword' ) > and ts_name = 'pg'; > > -- we won't index/search some tokens > update pg_ts_cfgmap set dict_name = NULL > --where tok_alias in ('email', 'url', 'sfloat', 'uri', 'float','word') > where tok_alias in ('email', 'url', 'sfloat', 'uri', 'float') > and ts_name = 'pg'; > > end; > > -- testing > > select * from ts_debug(' > PostgreSQL, the highly scalable, SQL compliant, open source > object-relational > database management system, is now undergoing beta testing of the next > version of our software: PostgreSQL 8.2. > '); > > > Oleg -- Alban Hertroys a.hertroys@magproductions.nl magproductions b.v. T: ++31(0)534346874 F: ++31(0)534346876 M: I: www.magproductions.nl A: Postbus 416 7500 AK Enschede // Integrate Your World //
Alban Hertroys wrote: > The only odd thing is that to_tsvector('dutch', 'some dutch text') now > returns '|' for stop words... > > For example: > select to_tsvector('nederlands', 'De beste stuurlui staan aan wal'); > to_tsvector > ------------------------------------------------ > '|':1,5 'bes':2 'wal':6 'staan':4 'stuurlui':3 I found the cause. The stop words list I found contained comments prefixed by '|' signs. Removing the contents and recreating the database solved the problem. Just updating the reference didn't seem to help... There's undoubtedly some cleaner way to replace the stop words list, but at the current stage of our project this was the simplest to achieve. -- Alban Hertroys a.hertroys@magproductions.nl magproductions b.v. T: ++31(0)534346874 F: ++31(0)534346876 M: I: www.magproductions.nl A: Postbus 416 7500 AK Enschede // Integrate Your World //
On Wed, 3 Oct 2007, Alban Hertroys wrote: > Oleg Bartunov wrote: >> Alban, >> >> the documentation you're refereed on is for upcoming 8.3 release. >> For 8.1 and 8.2 you need to do all machinery by hand. It's not >> difficult, for example: > > Thanks Oleg. > I think I managed to do this right, although I had to google for some of > the files (we don't have ispell installed). > > You also seem to have mixed russion and english dictionaries in your > example, I'm not sure that was on purpose? yes, we index mixed content > > Anyway, I changed your example to use dutch dictionaries and locale > where I thought it applicable, and I got something working apparently. > Quite some guess work was involved, so I have a few questions left. > > The only odd thing is that to_tsvector('dutch', 'some dutch text') now > returns '|' for stop words... Could you packed your dictionary files and .sql, so we look on them in spare time. > > For example: > select to_tsvector('nederlands', 'De beste stuurlui staan aan wal'); > to_tsvector > ------------------------------------------------ > '|':1,5 'bes':2 'wal':6 'staan':4 'stuurlui':3 > > > A minor nit... You ended the script with a hidden commit (END;). I would > have preferred to experiment with the results a bit before commiting... this is up to you. It was just an example > > I mixed in a few questions below, if you could answer them please? > >> -- sample tsearch2 configuration for search.postgresql.org >> -- Creates configuration 'pg' - default, should match server's locale !!! >> -- Change 'ru_RU.UTF-8' >> >> begin; >> >> -- create special (default) configuration 'pg' >> update pg_ts_cfg set locale=NULL where locale = 'ru_RU.UTF-8'; > > I suppose this disables a possibly existing stemmer for that locale? no, it's just to have one (default) configuration 'pg' for locale 'ru_RU.UTF-8'. You can skip this. > >> insert into pg_ts_cfg values('pg','default','ru_RU.UTF8'); >> >> -- register 'pg_dict' dictionary using synonym template >> -- postgres pg >> -- pgsql pg >> -- postgresql pg >> insert into pg_ts_dict >> (select 'pg_dict',dict_init, >> '/usr/local/pgsql-dev/share/contrib/pg_dict.txt', >> dict_lexize, 'pg-specific dictionary' >> from pg_ts_dict >> where dict_name='synonym' >> ); >> >> -- register ispell dictionary, check paths and stop words >> -- I used iconv for english files, since there are some cyrillic stuff >> insert into pg_ts_dict >> (SELECT 'en_ispell', dict_init, >> 'DictFile="/usr/local/share/dicts/ispell/utf8/english-utf8.dict",' >> 'AffFile="/usr/local/share/dicts/ispell/utf8/english-utf8.aff",' >> 'StopFile="/usr/local/share/dicts/ispell/utf8/english-utf8.stop"', >> dict_lexize >> FROM pg_ts_dict >> WHERE dict_name = 'ispell_template' >> ); > > I actually use a .lat file here. I have no idea whether that's > compatible (but it appears to have worked). it's just filenames, no matter (for 8.1,8.2) > > I got my .lat and .aff files from: > http://fmg-www.cs.ucla.edu/geoff/ispell-dictionaries.html#Dutch-dicts You can use myspell dictionaries. > > My stop words file is from: > http://snowball.tartarus.org/algorithms/dutch/stop.txt > >> -- use the same stop-word list as 'en_ispell' dictionary >> UPDATE pg_ts_dict set dict_initoption='/usr/local/share/dicts/english.stop' >> where dict_name='en_stem'; > > Why change the stop words for the English dictionary? I skipped this > step. Is that right? I wanted to have the same list of stop words for ispell and snowball. > >> -- default token<->dicts mappings >> insert into pg_ts_cfgmap select 'pg', tok_alias, dict_name from >> public.pg_ts_cfgmap where ts_name='default'; >> >> -- modify mappings for latin words for configuration 'pg' >> update pg_ts_cfgmap set dict_name = '{pg_dict,en_ispell,en_stem}' >> where tok_alias in ( 'lword', 'lhword', 'lpart_hword' ) >> and ts_name = 'pg'; >> >> -- we won't index/search some tokens >> update pg_ts_cfgmap set dict_name = NULL >> --where tok_alias in ('email', 'url', 'sfloat', 'uri', 'float','word') >> where tok_alias in ('email', 'url', 'sfloat', 'uri', 'float') >> and ts_name = 'pg'; >> >> end; >> >> -- testing >> >> select * from ts_debug(' >> PostgreSQL, the highly scalable, SQL compliant, open source >> object-relational >> database management system, is now undergoing beta testing of the next >> version of our software: PostgreSQL 8.2. >> '); >> >> >> Oleg > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
On Wed, 3 Oct 2007, Alban Hertroys wrote: > Alban Hertroys wrote: >> The only odd thing is that to_tsvector('dutch', 'some dutch text') now >> returns '|' for stop words... >> >> For example: >> select to_tsvector('nederlands', 'De beste stuurlui staan aan wal'); >> to_tsvector >> ------------------------------------------------ >> '|':1,5 'bes':2 'wal':6 'staan':4 'stuurlui':3 > > I found the cause. The stop words list I found contained comments > prefixed by '|' signs. Removing the contents and recreating the database > solved the problem. Just updating the reference didn't seem to help... you need to recreate tsvector field and index, after changing any dicts. > > There's undoubtedly some cleaner way to replace the stop words list, but > at the current stage of our project this was the simplest to achieve. > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83