Thread: Tsearch + polish ispell + polish locale
<div class="Section1"><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial">Hi all,</span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial"> </span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial">I am experiencing strange problem using tsearch with polish locale on (initdb –locale pl_PL.iso88592)and polish ispell dictionary.</span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial"> </span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial">I have a pl/pgSQL function that creates tsvector for a given record (it basically gets texts from varioustables and creates one tsvector)</span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial">The function returns semething like his:</span></font><p class="MsoNormal"><font face="Arial" size="2"><spanstyle="font-size:10.0pt; font-family:Arial"> </span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial"> RETURN setweight(to_tsvector(fname), ''A'')</span></font><p class="MsoNormal"><font face="Arial"size="2"><span style="font-size:10.0pt; font-family:Arial"> || setweight(to_tsvector(prov), ''C'')</span></font><p class="MsoNormal"><fontface="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial"> [ … 15 more lines like above ... ]</span></font><p class="MsoNormal"><font face="Arial" size="2"><spanstyle="font-size:10.0pt; font-family:Arial"> || setweight(to_tsvector(firm_rec.fax), ''A'')</span></font><p class="MsoNormal"><fontface="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial"> ;</span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial"> </span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial">After several calls to this function I get an error:</span></font><p class="MsoNormal"><font face="Arial"size="2"><span style="font-size:10.0pt; font-family:Arial"> </span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial">psql> update some_table set fts_vect = record_to_tsvector(id) where id < 40;</span></font><p class="MsoNormal"><fontface="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial">ERROR: Error in regis: [^ż]ać at pos 3</span></font><p class="MsoNormal"><font face="Arial" size="2"><spanstyle="font-size:10.0pt; font-family:Arial"> </span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial">Any idea show can I fix this ?</span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial"> </span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial">What is even more strange lower() function gets broken *<b><span style="font-weight:bold">after</span></b>*this error occurs.</span></font><p class="MsoNormal"><font face="Arial" size="2"><spanstyle="font-size:10.0pt; font-family:Arial">Before the error it correctly lowers polish letters, and after it does not lowercase them anymore.</span></font><pclass="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial">After reconnecting to the database everything works fine (untill next error…)</span></font><p class="MsoNormal"><fontface="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial"> </span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial">Regards,</span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial"> Arek.</span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial"> </span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial"> </span></font></div>
> ERROR: Error in regis: [^ż]ać at pos 3 > Any idea show can I fix this ? > What is even more strange lower() function gets broken **after** this > error occurs. > > Before the error it correctly lowers polish letters, and after it does > not lowercase them anymore. > > After reconnecting to the database everything works fine (untill next > error…) Which version do you use? I just fix some bug near to your problem in current CVS - try new version. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
>> >> After reconnecting to the database everything works fine (untill next >> error...) > Which version do you use? > > I just fix some bug near to your problem in current CVS - try new version. I am using version 8.1.5 I will try and let you know... Thanks for your answer,Arek.
> I am using version 8.1.5 Oops, I worked on 8.2. Can you send ispell files (dict and affix) to me? And make simple test suite to demonstrate the problem. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
I am using ispell files from openoffice (converted with my2ispell). I also tried other (eg. http://www.kurnik.pl/dictionary/) with the same result.. As for the test suite, it will take some time I think to prepare one.. I will send one as soon as possibile. I think I will first try to port locale fix into 8.1 and see how it Works ... Thanks,Arek. -----Original Message----- From: Teodor Sigaev [mailto:teodor@sigaev.ru] Sent: Monday, November 20, 2006 3:21 PM To: Staroń Arkadiusz Cc: pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Tsearch + polish ispell + polish locale > I am using version 8.1.5 Oops, I worked on 8.2. Can you send ispell files (dict and affix) to me? And make simple test suite to demonstrate the problem. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Hi Teodor, Unfortunately I can't create test suite ... I tried to create it as simple as possibile, but on simple (small) database everything works fine. I also cannot provide you mirror of my database since it contains proprietary data ... I solved my problem by creating my own tolower() function and replace it over the tsearch2 code. On database with locale set to 'C' it works fine. As far As I debugged the problem I could observe that with locale = 'C' RS_compile() is fed only with strings that does notcontain polish letters. With locale set to 'pl_PL.iso88592' strings passed to PS_compile contain polish letters. I do not know how, but in some strange, random cases function isalpha() stops return true value for polish letters, and thatis when RS_compile() returns error. I will try to compile and run my database on the CVS version of postgres, and let you know the results. Is it safe to use 8.2 version over 8.1.5 database files ? BTW. When the official 8.2 release is expected ? Thanks for your time and engagement,Arek. PS. BTW I have found minor inconsistency in the regis.c code (CVS version) Return value type is not as it should .. seesnippet below... 170 bool 171 RS_execute(Regis * r, char *str) [...] 183 >>>>>>>>if (len < r->nchar) 184 >>>>>>>>>>>>>>>>return 0;
> I solved my problem by creating my own tolower() function and replace it over the tsearch2 code. > On database with locale set to 'C' it works fine. > > As far As I debugged the problem I could observe that with locale = 'C' RS_compile() is fed only with strings that doesnot contain polish letters. > With locale set to 'pl_PL.iso88592' strings passed to PS_compile contain polish letters. > I do not know how, but in some strange, random cases function isalpha() stops return true value for polish letters, andthat is when RS_compile() returns error. Hmm, very strange. Which OS do you use? Pls, show exact # show lc_ctype; # show lc_collate; and tsearch2 configuration > > I will try to compile and run my database on the CVS version of postgres, and let you know the results. ok > Is it safe to use 8.2 version over 8.1.5 database files ? No, it's impossible due to significant format of db's files change. > > BTW. When the official 8.2 release is expected ? During 2006 :) > > Thanks for your time and engagement, > Arek. > > PS. BTW I have found minor inconsistency in the regis.c code (CVS version) > Return value type is not as it should .. see snippet below... fixed -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Hi, > > I do not know how, but in some strange, random cases function isalpha() > stops return true value for polish letters, and that is when RS_compile() > returns error. > Hmm, very strange. Which OS do you use? > Pls, show exact > # show lc_ctype; > # show lc_collate; > and tsearch2 configuration Linux 2.6.14.4-dl380 lc_ctype ----------------pl_PL.iso88592 lc_collate ----------------pl_PL.iso88592 The other interesting thing is that, although tolower() and isalpha() functionality is broken, sorting polish letters worksfine ... Tsearch2 is configured as follows: INSERT INTO pg_ts_cfg (...) VALUES ('default_polish', 'default', 'pl_PL'); INSERT INTO pg_ts_cfgmap (...) VALUES( 'default_polish', 'url', '{simple}'); INSERT INTO pg_ts_cfgmap (...) VALUES( 'default_polish', 'host', '{simple}'); INSERT INTO pg_ts_cfgmap (...) VALUES( 'default_polish', 'sfloat', '{simple}'); INSERT INTO pg_ts_cfgmap (...) VALUES( 'default_polish', 'uri', '{simple}'); INSERT INTO pg_ts_cfgmap (...) VALUES( 'default_polish', 'int', '{simple}'); INSERT INTO pg_ts_cfgmap (...) VALUES( 'default_polish', 'float', '{simple}'); INSERT INTO pg_ts_cfgmap (...) VALUES( 'default_polish', 'email', '{simple}'); INSERT INTO pg_ts_cfgmap (...) VALUES( 'default_polish', 'word', '{simple}'); INSERT INTO pg_ts_cfgmap (...) VALUES( 'default_polish', 'hword', '{simple}'); INSERT INTO pg_ts_cfgmap (...) VALUES( 'default_polish', 'nlword', '{simple}'); INSERT INTO pg_ts_cfgmap (...) VALUES( 'default_polish', 'nlpart_hword', '{simple}'); INSERT INTO pg_ts_cfgmap (...) VALUES( 'default_polish', 'part_hword', '{simple}'); INSERT INTO pg_ts_cfgmap (...) VALUES( 'default_polish', 'nlhword', '{simple}'); INSERT INTO pg_ts_cfgmap (...) VALUES( 'default_polish', 'file', '{simple}'); INSERT INTO pg_ts_cfgmap (...) VALUES( 'default_polish', 'uint', '{simple}'); INSERT INTO pg_ts_cfgmap (...) VALUES( 'default_polish', 'version', '{simple}'); INSERT INTO pg_ts_cfgmap (...) VALUES( 'default_polish', 'lhword', '{pl_ispell,simple}'); INSERT INTO pg_ts_cfgmap (...) VALUES( 'default_polish', 'lpart_hword','{pl_ispell,simple}'); INSERT INTO pg_ts_cfgmap (...) VALUES( 'default_polish', 'lword', '{pl_ispell,simple}'); INSERT INTO pg_ts_dict (SELECT 'pl_ispell', dict_init, 'DictFile="/home/astaron/lib/ispell/polish.dic",' 'AffFile="/home/astaron/lib/ispell/polish.aff",' 'StopFile="/home/astaron/lib/ispell/polish.stop"', dict_lexize FROM pg_ts_dict WHERE dict_name = 'ispell_template'); If there is anything, I can do to help you to debug this issue (logs, tests, code changes..), please let me know. As for now I will run 8.2 and see if the problem persists ... Best regards,Arek.
> INSERT INTO pg_ts_cfg (...) VALUES ('default_polish', 'default', 'pl_PL'); If your mark locale as 'pl_PL.iso88592' instead of 'pl_PL' then tsearch2 will be able to find configuration itself. > If there is anything, I can do to help you to debug > this issue (logs, tests, code changes..), please let me know. > > As for now I will run 8.2 and see if the problem persists ... Does lower()/upper() functions works well in postgres? -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
> > If your mark locale as 'pl_PL.iso88592' instead of 'pl_PL' then tsearch2 > will be > able to find configuration itself. Good point.. I forgot about this ;-) > > Does lower()/upper() functions works well in postgres? Until regis error it works fine... then it gets broken. As the matter of fact I wasn't able to determine who breaks it, is it postgres or tsearch ... Any idea how can I check it ? Regards,Arek.
>> Does lower()/upper() functions works well in postgres? > > Until regis error it works fine... then it gets broken. > As the matter of fact I wasn't able to determine who breaks it, is it postgres or tsearch ... > > Any idea how can I check it ? It seems to me, it's a memory corruption somewhere. try to compile postgres(and tsearch2 too) with CFLAGS=-O0 ./configure --enable-cassert --enable-debug and repeats the tests If you are using recent versions of Linux libc (later than 5.4.23) and GNU libc (2.x) then it will be useful to set MALLOC_CHECK_ enviroment variable to 2 before starting postgres (man 3 malloc). -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
FYI, The problem does NOT exist in 8.2beta3. I think it can be assumed that this was some locale related issue ... Thanks for your help,Arek. > -----Original Message----- > From: Teodor Sigaev [mailto:teodor@sigaev.ru] > Sent: Wednesday, November 22, 2006 5:12 PM > To: Staroń Arkadiusz > Cc: pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Tsearch + polish ispell + polish locale > > >> Does lower()/upper() functions works well in postgres? > > > > Until regis error it works fine... then it gets broken. > > As the matter of fact I wasn't able to determine who breaks it, is it > postgres or tsearch ... > > > > Any idea how can I check it ? > > It seems to me, it's a memory corruption somewhere. > > try to compile postgres(and tsearch2 too) with > CFLAGS=-O0 ./configure --enable-cassert --enable-debug > and repeats the tests > > If you are using recent versions of Linux libc (later than 5.4.23) and > GNU > libc (2.x) then it will be useful to set MALLOC_CHECK_ enviroment variable > to 2 > before starting postgres (man 3 malloc). > > -- > Teodor Sigaev E-mail: teodor@sigaev.ru > WWW: > http://www.sigaev.ru/