Thread: integrated tsearch has different results than tsearch2
Hello I am testing fulltext. 1. I am not able use fulltext with latin2 encoding :( I missing note about only utf8 dictionaries in doc). 2. with hspell dictionaries (fresh copy from open office) I got different and wrong results. Original (old) result ts=# select * from ts_debug('Příliš žluťoučký kůň se napil žluté vody'); ts_name | tok_type | description | token | dict_name | tsvector--------------+----------+-------------+-----------+ -------------------+ ------------default_czech | word | Word | Příliš | {cz_ispell,simple} | 'příliš'default_czech | word | Word | žluťoučký | {cz_ispell,simple} | 'žluťoučký'default_czech | word | Word | kůň | {cz_ispell,simple} | 'kůň'default_czech| lword | Latin word | se | {cz_ispell,simple} |default_czech | lword | Latin word | napil | {cz_ispell,simple} | 'napít'default_czech | word | Word | žluté | {cz_ispell,simple} | 'žlutý'default_czech | lword | Latin word | vody | {cz_ispell,simple} | 'voda'(7 řádek) New results: postgres=# create Text search dictionary cspell(template=ispell, dictfile=czech, afffile=czech, stopwords=czech); CREATE TEXT SEARCH DICTIONARY postgres=# CREATE text search configuration cs (copy=english); CREATE TEXT SEARCH CONFIGURATION postgres=# alter text search configuration cs alter mapping for word, lword with cspell, simple; ALTER TEXT SEARCH CONFIGURATION postgres=# select * from ts_debug('cs','Příliš žluťoučký kůň se napil žluté vody');Alias | Description | Token | Dictionaries | Lexized token -------+---------------+-----------+-----------------+---------------------word | Word | Příliš | {cspell,simple}| cspell: {příliš}blank | Space symbols | | {} |word | Word | žluťoučký |{cspell,simple} | cspell: {žluťoučký}blank | Space symbols | | {} |word | Word | kůň | {cspell,simple} | cspell: {kůň}blank | Space symbols | | {} |lword | Latin word | se | {cspell,simple} | cspell: {}blank | Space symbols | | {} |lword | Latin word | napil |{cspell,simple} | simple: {napil}blank | Space symbols | | {} |word | Word | žluté | {cspell,simple} | simple: {žluté}blank | Space symbols | | {} |lword | Latin word | vody | {cspell,simple} | simple: {vody} (13 rows) This query returned true in 8.2 and now: postgres=# select to_tsvector('cs','Příliš žlutý kůň se napil žluté vody') @@ to_tsquery('cs','napít');?column? ----------f (1 row) Regards Pavel Stehule
Pavel, I can't read your posting. Can you use plain text format ? Oleg On Mon, 3 Sep 2007, Pavel Stehule wrote: > Hello > I am testing fulltext. > 1. I am not able use fulltext with latin2 encoding :( I missing noteabout only utf8 dictionaries in doc). > > 2. with hspell dictionaries (fresh copy from open office) I gotdifferent and wrong results. > Original (old) result > ts=# select * from ts_debug('P??li? ?lu?ou?k? k?? se napil ?lut? vody'); ts_name | tok_type | description | token | dict_name | tsvector --------------+----------+-------------+-----------+-------------------+ ------------default_czech | word | Word | P??li? |{cz_ispell,simple} | 'p??li?' default_czech | word |Word | ?lu?ou?k? |{cz_ispell,simple} | '?lu?ou?k?' default_czech | word | Word | k?? | {cz_ispell,simple}| 'k??' default_czech | lword | Latin word | se | {cz_ispell,simple} | default_czech | lword | Latin word | napil |{cz_ispell,simple} | 'nap?t' default_czech | word | Word | ?lut? |{cz_ispell,simple}| '?lut?' default_czech | lword | Latin word | vody |{cz_ispell,simple} | 'voda' (7 ??dek) > New results:postgres=# create Text search dictionary cspell(template=ispell,dictfile=czech, afffile=czech, stopwords=czech);CREATETEXT SEARCH DICTIONARYpostgres=# CREATE text search configuration cs (copy=english);CREATE TEXT SEARCHCONFIGURATION > postgres=# alter text search configuration cs alter mapping for word,lword with cspell, simple;ALTER TEXT SEARCH CONFIGURATIONpostgres=#select * from ts_debug('cs','P??li? ?lu?ou?k? k?? se napil?lut? vody'); Alias | Description | Token | Dictionaries | Lexized token-------+---------------+-----------+-----------------+--------------------- word | Word | P??li? | {cspell,simple} | cspell: {p??li?} blank | Space symbols | | {} | word | Word | ?lu?ou?k? | {cspell,simple} | cspell: {?lu?ou?k?} blank | Space symbols | | {} | word | Word | k?? | {cspell,simple} | cspell: {k??} blank | Space symbols | | {} | lword | Latin word | se | {cspell,simple} | cspell: {} blank | Space symbols | | {} | lword | Latin word | napil | {cspell,simple} | simple: {napil} blank | Space symbols | | {} | word | Word | ?lut? | {cspell,simple} | simple: {?lut?} blank | Space symbols | | {} | lword | Latin word | vody | {cspell,simple} | simple: {vody}(13 rows) > This query returned true in 8.2 and now: > postgres=# select to_tsvector('cs','P??li? ?lut? k?? se napil ?lut?vody') @@ to_tsquery('cs','nap?t'); ?column?----------f(1 row) > RegardsPavel Stehule > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
> 1. I am not able use fulltext with latin2 encoding :( I missing note > about only utf8 dictionaries in doc). You can use any server encoding, but dictionary's files should be in utf8 - dictionary will convert utf8 files into server encoding. > > > 2. with hspell dictionaries (fresh copy from open office) I got > different and wrong results. > postgres=# select to_tsvector('cs','Příliš žlutý kůň se napil žluté > vody') @@ to_tsquery('cs','napít'); > ?column? > ---------- > f > (1 row) Pls, output of: select ts_lexize('cspell','napil'); select to_tsvector('cs','Příliš žlutý kůň se napil žluté vody'); -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
2007/9/3, Teodor Sigaev <teodor@sigaev.ru>: > > 1. I am not able use fulltext with latin2 encoding :( I missing note > > about only utf8 dictionaries in doc). > You can use any server encoding, but dictionary's files should be in utf8 - > dictionary will convert utf8 files into server encoding. > > > > > > > 2. with hspell dictionaries (fresh copy from open office) I got > > different and wrong results. > > postgres=# select to_tsvector('cs','Příliš žlutý kůň se napil žluté > > vody') @@ to_tsquery('cs','napít'); > > ?column? > > ---------- > > f > > (1 row) > > Pls, output of: > select ts_lexize('cspell','napil'); > select to_tsvector('cs','Příliš žlutý kůň se napil žluté > vody'); > > postgres=# select ts_lexize('cspell','napil');ts_lexize ----------- (1 row) postgres=# select to_tsvector('cs','Příliš žlutý kůň se napil žluté vody'); to_tsvector -----------------------------------------------------------'vody':7 'kůň':3 'napil':5 'žluté':6 'žlutý':2 'příliš':1 (1 row) There is difference 8.2.x postgres=# select lexize('cz_ispell','jablka'); lexize ----------{jablko} (1 row) 8.3 postgres=# select ts_lexize('cspell','jablka');ts_lexize ----------- (1 row) postgres=# select ts_lexize('cspell','jablko');ts_lexize -----------{jablko} (1 row) Pavel Stehule
Pavel Stehule wrote: > 2007/9/3, Teodor Sigaev <teodor@sigaev.ru>: >>> 1. I am not able use fulltext with latin2 encoding :( I missing note >>> about only utf8 dictionaries in doc). >> You can use any server encoding, but dictionary's files should be in utf8 - >> dictionary will convert utf8 files into server encoding. >> >>> >>> 2. with hspell dictionaries (fresh copy from open office) I got >>> different and wrong results. >>> postgres=# select to_tsvector('cs','Příliš žlutý kůň se napil žluté >>> vody') @@ to_tsquery('cs','napít'); >>> ?column? >>> ---------- >>> f >>> (1 row) >> Pls, output of: >> select ts_lexize('cspell','napil'); >> select to_tsvector('cs','Příliš žlutý kůň se napil žluté >> vody'); >> >> > postgres=# select ts_lexize('cspell','napil'); > ts_lexize > ----------- > > (1 row) > postgres=# select to_tsvector('cs','Příliš žlutý kůň se napil žluté vody'); > to_tsvector > ----------------------------------------------------------- > 'vody':7 'kůň':3 'napil':5 'žluté':6 'žlutý':2 'příliš':1 > (1 row) > > There is difference > 8.2.x > postgres=# select lexize('cz_ispell','jablka'); > lexize > ---------- > {jablko} > (1 row) > 8.3 > postgres=# select ts_lexize('cspell','jablka'); > ts_lexize > ----------- > > (1 row) > postgres=# select ts_lexize('cspell','jablko'); > ts_lexize > ----------- > {jablko} > (1 row) Can you post a link to the ispell dictionary file you're using so I and others can reproduce that? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
I used dictionaries from fedora core packages hunspell-cs-20060303-5.fc7.i386.rpm then I converted it to utf8 with iconv Pavel 2007/9/4, Heikki Linnakangas <heikki@enterprisedb.com>: > Pavel Stehule wrote: > > 2007/9/3, Teodor Sigaev <teodor@sigaev.ru>: > >>> 1. I am not able use fulltext with latin2 encoding :( I missing note > >>> about only utf8 dictionaries in doc). > >> You can use any server encoding, but dictionary's files should be in utf8 - > >> dictionary will convert utf8 files into server encoding. > >> > >>> > >>> 2. with hspell dictionaries (fresh copy from open office) I got > >>> different and wrong results. > >>> postgres=# select to_tsvector('cs','Příliš žlutý kůň se napil žluté > >>> vody') @@ to_tsquery('cs','napít'); > >>> ?column? > >>> ---------- > >>> f > >>> (1 row) > >> Pls, output of: > >> select ts_lexize('cspell','napil'); > >> select to_tsvector('cs','Příliš žlutý kůň se napil žluté > >> vody'); > >> > >> > > postgres=# select ts_lexize('cspell','napil'); > > ts_lexize > > ----------- > > > > (1 row) > > postgres=# select to_tsvector('cs','Příliš žlutý kůň se napil žluté vody'); > > to_tsvector > > ----------------------------------------------------------- > > 'vody':7 'kůň':3 'napil':5 'žluté':6 'žlutý':2 'příliš':1 > > (1 row) > > > > There is difference > > 8.2.x > > postgres=# select lexize('cz_ispell','jablka'); > > lexize > > ---------- > > {jablko} > > (1 row) > > 8.3 > > postgres=# select ts_lexize('cspell','jablka'); > > ts_lexize > > ----------- > > > > (1 row) > > postgres=# select ts_lexize('cspell','jablko'); > > ts_lexize > > ----------- > > {jablko} > > (1 row) > > Can you post a link to the ispell dictionary file you're using so I and > others can reproduce that? > > -- > Heikki Linnakangas > EnterpriseDB http://www.enterprisedb.com >
Pavel Stehule wrote: > I used dictionaries from fedora core packages > > hunspell-cs-20060303-5.fc7.i386.rpm > > then I converted it to utf8 with iconv Ok, thanks. Apparently it's a bug I introduced when I refactored spell.c to use the readline function for reading and recoding the input file. I didn't notice that some calls to STRNCMP used the non-lowercased version of the input line. Patch attached. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com Index: src/backend/tsearch/spell.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/tsearch/spell.c,v retrieving revision 1.2 diff -c -r1.2 spell.c *** src/backend/tsearch/spell.c 25 Aug 2007 00:03:59 -0000 1.2 --- src/backend/tsearch/spell.c 4 Sep 2007 12:31:55 -0000 *************** *** 733,739 **** while ((recoded = t_readline(affix)) != NULL) { pstr = lowerstr(recoded); - pfree(recoded); lineno++; --- 733,738 ---- *************** *** 813,820 **** flag = (unsigned char) *s; goto nextline; } ! if (STRNCMP(str, "COMPOUNDFLAG") == 0 || STRNCMP(str, "COMPOUNDMIN") == 0 || ! STRNCMP(str, "PFX") == 0 || STRNCMP(str, "SFX") == 0) { if (oldformat) ereport(ERROR, --- 812,819 ---- flag = (unsigned char) *s; goto nextline; } ! if (STRNCMP(recoded, "COMPOUNDFLAG") == 0 || STRNCMP(recoded, "COMPOUNDMIN") == 0 || ! STRNCMP(recoded, "PFX") == 0 || STRNCMP(recoded, "SFX") == 0) { if (oldformat) ereport(ERROR, *************** *** 834,839 **** --- 833,839 ---- NIAddAffix(Conf, flag, flagflags, mask, find, repl, suffixes ? FF_SUFFIX : FF_PREFIX); nextline: + pfree(recoded); pfree(pstr); } FreeFile(affix);
2007/9/4, Heikki Linnakangas <heikki@enterprisedb.com>: > Pavel Stehule wrote: > > I used dictionaries from fedora core packages > > > > hunspell-cs-20060303-5.fc7.i386.rpm > > > > then I converted it to utf8 with iconv > > Ok, thanks. > > Apparently it's a bug I introduced when I refactored spell.c to use the > readline function for reading and recoding the input file. I didn't > notice that some calls to STRNCMP used the non-lowercased version of the > input line. Patch attached. > > -- It works Thank you Pavel