Thread: questions about tsearch2 (for czech language)
Hello I try tsearch2 within czech environment. It is works fine, but I have two questions. 1. I have words "se", "ve" in my czech stop words. But I get this words in result. Why? Have I problem with my configuration? tsearch2=# select * from ts_debug('jmenuji se Pavel Stěhule a bydlím ve Skalici.'); ts_name | tok_type | description | token | dict_name | tsvector ---------------+----------+-------------+---------+-------------+----------- default_czech | lword | Latin word | jmenuji | {cz_ispell} | 'jmenuji' default_czech | lword | Latin word | se | {cz_ispell} | 'se' default_czech | lword | Latin word | Pavel | {cz_ispell} | 'pavel' default_czech | word | Word | Stěhule | {cz_ispell} | default_czech | lword | Latin word | a | {cz_ispell} | default_czech | word | Word | bydlím | {cz_ispell} | 'bydlet' default_czech | lword | Latin word | ve | {cz_ispell} | 've' default_czech | lword | Latin word | Skalici | {cz_ispell} | 'skalici' (8 řádek) tsearch2=# select * from pg_ts_cfgmap where ts_name='default_czech'; ts_name | tok_alias | dict_name ---------------+--------------+------------- default_czech | email | {simple} default_czech | file | {simple} default_czech | float | {simple} default_czech | host | {simple} default_czech | hword | {cz_ispell} default_czech | int | {simple} default_czech | lhword | {cz_ispell} default_czech | lpart_hword | {cz_ispell} default_czech | lword | {cz_ispell} default_czech | nlhword | {cz_ispell} default_czech | nlpart_hword | {cz_ispell} default_czech | nlword | {cz_ispell} default_czech | part_hword | {simple} default_czech | sfloat | {simple} default_czech | uint | {simple} default_czech | uri | {simple} default_czech | url | {simple} default_czech | version | {simple} default_czech | word | {cz_ispell} (19 řádek) 2. I use small czech dictionary. I need don't erase words which aren't in dictionary (in my sample Stěhule). Can I set it somewhere? I tryed add simple dict into cfg map, but witout sucess tsearch2=# select * from ts_debug('jmenuji se Pavel Stěhule a bydlím ve Skalici.'); ts_name | tok_type | description | token | dict_name | tsvector ---------------+----------+-------------+---------+--------------------+----------- default_czech | word | Word | Stěhule | {cz_ispell,simple} | default_czech | lword | Latin word | a | {cz_ispell,simple} | default_czech | word | Word | bydlím | {cz_ispell,simple} | 'bydlet' Thank You Pavel Stehule
On Mon, 22 Dec 2003, Pavel Stehule wrote: > Hello > > I try tsearch2 within czech environment. It is works fine, but I have two > questions. > > 1. I have words "se", "ve" in my czech stop words. But I get this words in > result. Why? Have I problem with my configuration? did you specify stop words in dictionaries configuration ? select * from pg_ts_dict; > > tsearch2=# select * from ts_debug('jmenuji se Pavel StЛhule a bydlМm ve > Skalici.'); > ts_name | tok_type | description | token | dict_name | tsvector > ---------------+----------+-------------+---------+-------------+----------- > default_czech | lword | Latin word | jmenuji | {cz_ispell} | > 'jmenuji' > default_czech | lword | Latin word | se | {cz_ispell} | 'se' > default_czech | lword | Latin word | Pavel | {cz_ispell} | 'pavel' > default_czech | word | Word | StЛhule | {cz_ispell} | > default_czech | lword | Latin word | a | {cz_ispell} | > default_czech | word | Word | bydlМm | {cz_ispell} | 'bydlet' > default_czech | lword | Latin word | ve | {cz_ispell} | 've' > default_czech | lword | Latin word | Skalici | {cz_ispell} | > 'skalici' > (8 ЬАdek) > > tsearch2=# select * from pg_ts_cfgmap where ts_name='default_czech'; > ts_name | tok_alias | dict_name > ---------------+--------------+------------- > default_czech | email | {simple} > default_czech | file | {simple} > default_czech | float | {simple} > default_czech | host | {simple} > default_czech | hword | {cz_ispell} > default_czech | int | {simple} > default_czech | lhword | {cz_ispell} > default_czech | lpart_hword | {cz_ispell} > default_czech | lword | {cz_ispell} > default_czech | nlhword | {cz_ispell} > default_czech | nlpart_hword | {cz_ispell} > default_czech | nlword | {cz_ispell} > default_czech | part_hword | {simple} > default_czech | sfloat | {simple} > default_czech | uint | {simple} > default_czech | uri | {simple} > default_czech | url | {simple} > default_czech | version | {simple} > default_czech | word | {cz_ispell} > (19 ЬАdek) > > 2. I use small czech dictionary. I need don't erase words which aren't in > dictionary (in my sample StЛhule). Can I set it somewhere? I tryed add > simple dict into cfg map, but witout sucess > Example, please ! What do you mean 'erase words' ? > tsearch2=# select * from ts_debug('jmenuji se Pavel StЛhule a bydlМm ve > Skalici.'); ts_name | tok_type | description | token | > dict_name | tsvector > ---------------+----------+-------------+---------+--------------------+----------- > default_czech | word | Word | StЛhule | {cz_ispell,simple} | > default_czech | lword | Latin word | a | {cz_ispell,simple} | > default_czech | word | Word | bydlМm | {cz_ispell,simple} | > 'bydlet' > > > Thank You > Pavel Stehule > > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
> > result. Why? Have I problem with my configuration? > > did you specify stop words in dictionaries configuration ? > > select * from pg_ts_dict; > tsearch2=# select * from pg_ts_dict where dict_name ='cz_ispell'; -[ RECORD 1 ]---+-------------------------------------------------------------------------------------------------------------------------- dict_name | cz_ispell dict_init | 173405 dict_initoption | DictFile="/usr/lib/ispell/czech",AffFile="/usr/lib/ispell/czech.aff",StopFile="/usr/local/pgsql/share/contrib/czech.stop" dict_lexize | 173406 dict_comment | [postgres@usop root]$ cat /usr/local/pgsql/share/contrib/czech.stop|grep -e "^[sv]." se sem si svůj ve vám váš viz vy > > > > 2. I use small czech dictionary. I need don't erase words which aren't in > > dictionary (in my sample Stěhule). Can I set it somewhere? I tryed add > > simple dict into cfg map, but witout sucess > > > > Example, please ! What do you mean 'erase words' ? > > > > tsearch2=# select * from ts_debug('jmenuji se Pavel Stěhule a bydlím ve > > Skalici.'); ts_name | tok_type | description | token | > > dict_name | tsvector > > ---------------+----------+-------------+---------+--------------------+----------- > > default_czech | word | Word | Stěhule | {cz_ispell,simple} | > > default_czech | lword | Latin word | a | {cz_ispell,simple} | > > default_czech | word | Word | bydlím | {cz_ispell,simple} | > > 'bydlet' > > > > If tsearch didn't find word in dictionary, then erase this from result. True? My surname, fo example isn't in dictionary, but I wont save this word in result (tsvector). I use tsearch2=# select version(); version ------------------------------------------------------------------------------------------------------- PostgreSQL 7.4RC2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.3 20030715 (Red Hat Linux 3.3-14)
Pavel, did you restart psql session after modifying tsearch2 configuration ? btw, there is czech dictionary available from http://lingucomponent.openoffice.org/download_dictionary.html We have utility to convert myspell dicts to ispell one. It's included in 7.5 development. Patch for 7.4 could be downloaded from http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/ Also, historically, we use openfts mailing list for discussion of tsearch2. Oleg On Mon, 22 Dec 2003, Pavel Stehule wrote: > > > result. Why? Have I problem with my configuration? > > > > did you specify stop words in dictionaries configuration ? > > > > select * from pg_ts_dict; > > > tsearch2=# select * from pg_ts_dict where dict_name ='cz_ispell'; > -[ RECORD 1 > ]---+-------------------------------------------------------------------------------------------------------------------------- > dict_name | cz_ispell > dict_init | 173405 > dict_initoption | > DictFile="/usr/lib/ispell/czech",AffFile="/usr/lib/ispell/czech.aff",StopFile="/usr/local/pgsql/share/contrib/czech.stop" > dict_lexize | 173406 > dict_comment | > > [postgres@usop root]$ cat /usr/local/pgsql/share/contrib/czech.stop|grep -e "^[sv]." > se > sem > si > svЫj > ve > vАm > vА╧ > viz > vy > > > > > > > 2. I use small czech dictionary. I need don't erase words which aren't in > > > dictionary (in my sample StЛhule). Can I set it somewhere? I tryed add > > > simple dict into cfg map, but witout sucess > > > > > > > Example, please ! What do you mean 'erase words' ? > > > > > > > tsearch2=# select * from ts_debug('jmenuji se Pavel StЛhule a bydlМm ve > > > Skalici.'); ts_name | tok_type | description | token | > > > dict_name | tsvector > > > ---------------+----------+-------------+---------+--------------------+----------- > > > default_czech | word | Word | StЛhule | {cz_ispell,simple} | > > > default_czech | lword | Latin word | a | {cz_ispell,simple} | > > > default_czech | word | Word | bydlМm | {cz_ispell,simple} | > > > 'bydlet' > > > > > > > > If tsearch didn't find word in dictionary, then erase this from result. > True? My surname, fo example isn't in dictionary, but I wont save this > word in result (tsvector). > > I use > > tsearch2=# select version(); > version > ------------------------------------------------------------------------------------------------------- > PostgreSQL 7.4RC2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.3 > 20030715 (Red Hat Linux 3.3-14) > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
Oleg You has true. After restart of postmaster all works fine. tsearch2=# select to_tsvector('default_czech','Jmenuji se Pavel Stěhule'); to_tsvector ------------------------------------ 'pavel':3 'stěhule':4 'jmenovat':1 Thank You very much Pavel Stehule On Mon, 22 Dec 2003, Oleg Bartunov wrote: > Pavel, > > did you restart psql session after modifying tsearch2 configuration ? > btw, there is czech dictionary available from http://lingucomponent.openoffice.org/download_dictionary.html > We have utility to convert myspell dicts to ispell one. It's included > in 7.5 development. Patch for 7.4 could be downloaded from > http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/ > > Also, historically, we use openfts mailing list for discussion of > tsearch2. > > Oleg > On Mon, 22 Dec 2003, Pavel Stehule wrote: > > > > > result. Why? Have I problem with my configuration? > > > > > > did you specify stop words in dictionaries configuration ? > > > > > > select * from pg_ts_dict; > > > > > tsearch2=# select * from pg_ts_dict where dict_name ='cz_ispell'; > > -[ RECORD 1 > > ]---+-------------------------------------------------------------------------------------------------------------------------- > > dict_name | cz_ispell > > dict_init | 173405 > > dict_initoption | > > DictFile="/usr/lib/ispell/czech",AffFile="/usr/lib/ispell/czech.aff",StopFile="/usr/local/pgsql/share/contrib/czech.stop" > > dict_lexize | 173406 > > dict_comment | > > > > [postgres@usop root]$ cat /usr/local/pgsql/share/contrib/czech.stop|grep -e "^[sv]." > > se > > sem > > si > > svůj > > ve > > vám > > váš > > viz > > vy > > > > > > > > > > 2. I use small czech dictionary. I need don't erase words which aren't in > > > > dictionary (in my sample Stěhule). Can I set it somewhere? I tryed add > > > > simple dict into cfg map, but witout sucess > > > > > > > > > > Example, please ! What do you mean 'erase words' ? > > > > > > > > > > tsearch2=# select * from ts_debug('jmenuji se Pavel Stěhule a bydlím ve > > > > Skalici.'); ts_name | tok_type | description | token | > > > > dict_name | tsvector > > > > ---------------+----------+-------------+---------+--------------------+----------- > > > > default_czech | word | Word | Stěhule | {cz_ispell,simple} | > > > > default_czech | lword | Latin word | a | {cz_ispell,simple} | > > > > default_czech | word | Word | bydlím | {cz_ispell,simple} | > > > > 'bydlet' > > > > > > > > > > > > If tsearch didn't find word in dictionary, then erase this from result. > > True? My surname, fo example isn't in dictionary, but I wont save this > > word in result (tsvector). > > > > I use > > > > tsearch2=# select version(); > > version > > ------------------------------------------------------------------------------------------------------- > > PostgreSQL 7.4RC2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.3 > > 20030715 (Red Hat Linux 3.3-14) > > > > > > Regards, > Oleg > _____________________________________________________________ > Oleg Bartunov, sci.researcher, hostmaster of AstroNet, > Sternberg Astronomical Institute, Moscow University (Russia) > Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ > phone: +007(095)939-16-83, +007(095)939-23-83 > > ---------------------------(end of broadcast)--------------------------- > TIP 9: the planner will ignore your desire to choose an index scan if your > joining column's datatypes do not match >
> You has true. After restart of postmaster all works fine. One comment, you don't need restart postmaster, you should reconnect to postgresql by exit and start psql. Every new connect creates new child of postmaster. > > tsearch2=# select to_tsvector('default_czech','Jmenuji se Pavel Stěhule'); > to_tsvector > ------------------------------------ > 'pavel':3 'stěhule':4 'jmenovat':1 > > Thank You very much > > Pavel Stehule > > > On Mon, 22 Dec 2003, Oleg Bartunov wrote: > > >>Pavel, >> >>did you restart psql session after modifying tsearch2 configuration ? >>btw, there is czech dictionary available from http://lingucomponent.openoffice.org/download_dictionary.html >>We have utility to convert myspell dicts to ispell one. It's included >>in 7.5 development. Patch for 7.4 could be downloaded from >>http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/ >> >>Also, historically, we use openfts mailing list for discussion of >>tsearch2. >> >> Oleg >>On Mon, 22 Dec 2003, Pavel Stehule wrote: >> >> >>>>>result. Why? Have I problem with my configuration? >>>> >>>>did you specify stop words in dictionaries configuration ? >>>> >>>>select * from pg_ts_dict; >>>> >>> >>>tsearch2=# select * from pg_ts_dict where dict_name ='cz_ispell'; >>>-[ RECORD 1 >>>]---+-------------------------------------------------------------------------------------------------------------------------- >>>dict_name | cz_ispell >>>dict_init | 173405 >>>dict_initoption | >>>DictFile="/usr/lib/ispell/czech",AffFile="/usr/lib/ispell/czech.aff",StopFile="/usr/local/pgsql/share/contrib/czech.stop" >>>dict_lexize | 173406 >>>dict_comment | >>> >>>[postgres@usop root]$ cat /usr/local/pgsql/share/contrib/czech.stop|grep -e "^[sv]." >>>se >>>sem >>>si >>>svůj >>>ve >>>vám >>>váš >>>viz >>>vy >>> >>> >>>>>2. I use small czech dictionary. I need don't erase words which aren't in >>>>>dictionary (in my sample Stěhule). Can I set it somewhere? I tryed add >>>>>simple dict into cfg map, but witout sucess >>>>> >>>> >>>>Example, please ! What do you mean 'erase words' ? >>>> >>>> >>>> >>>>>tsearch2=# select * from ts_debug('jmenuji se Pavel Stěhule a bydlím ve >>>>>Skalici.'); ts_name | tok_type | description | token | >>>>>dict_name | tsvector >>>>>---------------+----------+-------------+---------+--------------------+----------- >>>>> default_czech | word | Word | Stěhule | {cz_ispell,simple} | >>>>> default_czech | lword | Latin word | a | {cz_ispell,simple} | >>>>> default_czech | word | Word | bydlím | {cz_ispell,simple} | >>>>>'bydlet' >>>>> >>>>> >>> >>>If tsearch didn't find word in dictionary, then erase this from result. >>>True? My surname, fo example isn't in dictionary, but I wont save this >>>word in result (tsvector). >>> >>>I use >>> >>>tsearch2=# select version(); >>> version >>>------------------------------------------------------------------------------------------------------- >>> PostgreSQL 7.4RC2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.3 >>>20030715 (Red Hat Linux 3.3-14) >>> >>> >> >> Regards, >> Oleg >>_____________________________________________________________ >>Oleg Bartunov, sci.researcher, hostmaster of AstroNet, >>Sternberg Astronomical Institute, Moscow University (Russia) >>Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ >>phone: +007(095)939-16-83, +007(095)939-23-83 >> >>---------------------------(end of broadcast)--------------------------- >>TIP 9: the planner will ignore your desire to choose an index scan if your >> joining column's datatypes do not match >> > > > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly -- Teodor Sigaev E-mail: teodor@sigaev.ru
On Tue, 23 Dec 2003, Teodor Sigaev wrote: > > You has true. After restart of postmaster all works fine. > One comment, you don't need restart postmaster, you should reconnect to > postgresql by exit and start psql. Every new connect creates new child of > postmaster. > > true, but I like hard solutions, :-> "/etc/init.d/postgresql restart" is my top command I work only one on this database, a can use en force. Pavel > > > > tsearch2=# select to_tsvector('default_czech','Jmenuji se Pavel Stěhule'); > > to_tsvector > > ------------------------------------ > > 'pavel':3 'stěhule':4 'jmenovat':1 > > > > Thank You very much > > > > Pavel Stehule > > > > > > On Mon, 22 Dec 2003, Oleg Bartunov wrote: > > > > > >>Pavel, > >> > >>did you restart psql session after modifying tsearch2 configuration ? > >>btw, there is czech dictionary available from http://lingucomponent.openoffice.org/download_dictionary.html > >>We have utility to convert myspell dicts to ispell one. It's included > >>in 7.5 development. Patch for 7.4 could be downloaded from > >>http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/ > >> > >>Also, historically, we use openfts mailing list for discussion of > >>tsearch2. > >> > >> Oleg > >>On Mon, 22 Dec 2003, Pavel Stehule wrote: > >> > >> > >>>>>result. Why? Have I problem with my configuration? > >>>> > >>>>did you specify stop words in dictionaries configuration ? > >>>> > >>>>select * from pg_ts_dict; > >>>> > >>> > >>>tsearch2=# select * from pg_ts_dict where dict_name ='cz_ispell'; > >>>-[ RECORD 1 > >>>]---+-------------------------------------------------------------------------------------------------------------------------- > >>>dict_name | cz_ispell > >>>dict_init | 173405 > >>>dict_initoption | > >>>DictFile="/usr/lib/ispell/czech",AffFile="/usr/lib/ispell/czech.aff",StopFile="/usr/local/pgsql/share/contrib/czech.stop" > >>>dict_lexize | 173406 > >>>dict_comment | > >>> > >>>[postgres@usop root]$ cat /usr/local/pgsql/share/contrib/czech.stop|grep -e "^[sv]." > >>>se > >>>sem > >>>si > >>>svůj > >>>ve > >>>vám > >>>váš > >>>viz > >>>vy > >>> > >>> > >>>>>2. I use small czech dictionary. I need don't erase words which aren't in > >>>>>dictionary (in my sample Stěhule). Can I set it somewhere? I tryed add > >>>>>simple dict into cfg map, but witout sucess > >>>>> > >>>> > >>>>Example, please ! What do you mean 'erase words' ? > >>>> > >>>> > >>>> > >>>>>tsearch2=# select * from ts_debug('jmenuji se Pavel Stěhule a bydlím ve > >>>>>Skalici.'); ts_name | tok_type | description | token | > >>>>>dict_name | tsvector > >>>>>---------------+----------+-------------+---------+--------------------+----------- > >>>>> default_czech | word | Word | Stěhule | {cz_ispell,simple} | > >>>>> default_czech | lword | Latin word | a | {cz_ispell,simple} | > >>>>> default_czech | word | Word | bydlím | {cz_ispell,simple} | > >>>>>'bydlet' > >>>>> > >>>>> > >>> > >>>If tsearch didn't find word in dictionary, then erase this from result. > >>>True? My surname, fo example isn't in dictionary, but I wont save this > >>>word in result (tsvector). > >>> > >>>I use > >>> > >>>tsearch2=# select version(); > >>> version > >>>------------------------------------------------------------------------------------------------------- > >>> PostgreSQL 7.4RC2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.3 > >>>20030715 (Red Hat Linux 3.3-14) > >>> > >>> > >> > >> Regards, > >> Oleg > >>_____________________________________________________________ > >>Oleg Bartunov, sci.researcher, hostmaster of AstroNet, > >>Sternberg Astronomical Institute, Moscow University (Russia) > >>Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ > >>phone: +007(095)939-16-83, +007(095)939-23-83 > >> > >>---------------------------(end of broadcast)--------------------------- > >>TIP 9: the planner will ignore your desire to choose an index scan if your > >> joining column's datatypes do not match > >> > > > > > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 3: if posting/reading through Usenet, please send an appropriate > > subscribe-nomail command to majordomo@postgresql.org so that your > > message can get through to the mailing list cleanly > >