Thread: Tsearch + polish ispell + polish locale

Tsearch + polish ispell + polish locale

From
Date:
<div class="Section1"><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt;
font-family:Arial">Hi all,</span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt;
font-family:Arial"> </span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt;
font-family:Arial">I am experiencing strange problem using tsearch with polish locale on (initdb –locale
pl_PL.iso88592)and polish ispell dictionary.</span></font><p class="MsoNormal"><font face="Arial" size="2"><span
style="font-size:10.0pt;
font-family:Arial"> </span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt;
font-family:Arial">I have a pl/pgSQL function that creates tsvector for a given record (it basically gets texts from
varioustables and creates one tsvector)</span></font><p class="MsoNormal"><font face="Arial" size="2"><span
style="font-size:10.0pt;
font-family:Arial">The function returns semething like his:</span></font><p class="MsoNormal"><font face="Arial"
size="2"><spanstyle="font-size:10.0pt; 
font-family:Arial"> </span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt;
font-family:Arial">     RETURN  setweight(to_tsvector(fname),       ''A'')</span></font><p class="MsoNormal"><font
face="Arial"size="2"><span style="font-size:10.0pt; 
font-family:Arial">                ||  setweight(to_tsvector(prov),                ''C'')</span></font><p
class="MsoNormal"><fontface="Arial" size="2"><span style="font-size:10.0pt; 
font-family:Arial">            [  … 15 more lines like above ... ]</span></font><p class="MsoNormal"><font face="Arial"
size="2"><spanstyle="font-size:10.0pt; 
font-family:Arial">                ||  setweight(to_tsvector(firm_rec.fax),        ''A'')</span></font><p
class="MsoNormal"><fontface="Arial" size="2"><span style="font-size:10.0pt; 
font-family:Arial">                ;</span></font><p class="MsoNormal"><font face="Arial" size="2"><span
style="font-size:10.0pt;
font-family:Arial"> </span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt;
font-family:Arial">After several calls to this function I get an error:</span></font><p class="MsoNormal"><font
face="Arial"size="2"><span style="font-size:10.0pt; 
font-family:Arial"> </span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt;
font-family:Arial">psql> update some_table set fts_vect = record_to_tsvector(id) where id < 40;</span></font><p
class="MsoNormal"><fontface="Arial" size="2"><span style="font-size:10.0pt; 
font-family:Arial">ERROR:  Error in regis: [^ż]ać at pos 3</span></font><p class="MsoNormal"><font face="Arial"
size="2"><spanstyle="font-size:10.0pt; 
font-family:Arial"> </span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt;
font-family:Arial">Any idea show can I fix this ?</span></font><p class="MsoNormal"><font face="Arial" size="2"><span
style="font-size:10.0pt;
font-family:Arial"> </span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt;
font-family:Arial">What is even more strange lower() function gets broken *<b><span
style="font-weight:bold">after</span></b>*this error occurs.</span></font><p class="MsoNormal"><font face="Arial"
size="2"><spanstyle="font-size:10.0pt; 
font-family:Arial">Before the error it correctly lowers polish letters, and after it does not lowercase them
anymore.</span></font><pclass="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt; 
font-family:Arial">After reconnecting to the database everything works fine (untill next error…)</span></font><p
class="MsoNormal"><fontface="Arial" size="2"><span style="font-size:10.0pt; 
font-family:Arial"> </span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt;
font-family:Arial">Regards,</span></font><p class="MsoNormal"><font face="Arial" size="2"><span
style="font-size:10.0pt;
font-family:Arial">            Arek.</span></font><p class="MsoNormal"><font face="Arial" size="2"><span
style="font-size:10.0pt;
font-family:Arial"> </span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt;
font-family:Arial"> </span></font></div>

Re: Tsearch + polish ispell + polish locale

From
Teodor Sigaev
Date:
> ERROR:  Error in regis: [^ż]ać at pos 3
> Any idea show can I fix this ?
> What is even more strange lower() function gets broken **after** this 
> error occurs.
> 
> Before the error it correctly lowers polish letters, and after it does 
> not lowercase them anymore.
> 
> After reconnecting to the database everything works fine (untill next 
> error…)

Which version do you use?

I just fix some bug near to your problem in current CVS - try new version.

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 


Re: Tsearch + polish ispell + polish locale

From
Date:
>>
>> After reconnecting to the database everything works fine (untill next
>> error...)

> Which version do you use?
>
> I just fix some bug near to your problem in current CVS - try new version.

I am using version 8.1.5

I will try and let you know...

Thanks for your answer,Arek.


Re: Tsearch + polish ispell + polish locale

From
Teodor Sigaev
Date:
> I am using version 8.1.5  
Oops, I worked on 8.2.

Can you send ispell files (dict and affix) to me? And make simple test suite to 
demonstrate the problem.

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 


Re: Tsearch + polish ispell + polish locale

From
Date:
I am using ispell files from openoffice (converted with my2ispell).
I also tried other (eg. http://www.kurnik.pl/dictionary/) with the same result..

As for the test suite, it will take some time I think to prepare one..
I will send one as soon as possibile.

I think I will first try to port locale fix into 8.1 and see how it Works ...


Thanks,Arek.

-----Original Message-----
From: Teodor Sigaev [mailto:teodor@sigaev.ru]
Sent: Monday, November 20, 2006 3:21 PM
To: Staroń Arkadiusz
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Tsearch + polish ispell + polish locale


> I am using version 8.1.5
Oops, I worked on 8.2.

Can you send ispell files (dict and affix) to me? And make simple test suite to
demonstrate the problem.

--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/ 


Re: Tsearch + polish ispell + polish locale

From
Date:
Hi Teodor,

Unfortunately I can't create test suite ...
I tried to create it as simple as possibile, but on simple (small) database everything works fine.
I also cannot provide you mirror of my database since it contains proprietary data ...

I solved my problem by creating my own tolower() function and replace it over the tsearch2 code.
On database with locale set to 'C' it works fine.

As far As I debugged the problem I could observe that with locale = 'C' RS_compile() is fed only with strings that does
notcontain polish letters. 
With locale set to 'pl_PL.iso88592' strings passed to PS_compile contain polish letters.
I do not know how, but in some strange, random cases function isalpha() stops return true value for polish letters, and
thatis when RS_compile() returns error. 

I will try to compile and run my database on the CVS version of postgres, and let you know the results.

Is it safe to use 8.2 version over 8.1.5 database files ?

BTW. When the official 8.2 release is expected ?

Thanks for your time and engagement,Arek.

PS. BTW I have found minor inconsistency in the regis.c code (CVS version)   Return value type is not as it should ..
seesnippet below... 

170 bool
171 RS_execute(Regis * r, char *str)
[...]
183 >>>>>>>>if (len < r->nchar)
184 >>>>>>>>>>>>>>>>return 0;




Re: Tsearch + polish ispell + polish locale

From
Teodor Sigaev
Date:
> I solved my problem by creating my own tolower() function and replace it over the tsearch2 code.
> On database with locale set to 'C' it works fine.
> 
> As far As I debugged the problem I could observe that with locale = 'C' RS_compile() is fed only with strings that
doesnot contain polish letters.
 
> With locale set to 'pl_PL.iso88592' strings passed to PS_compile contain polish letters.
> I do not know how, but in some strange, random cases function isalpha() stops return true value for polish letters,
andthat is when RS_compile() returns error.
 
Hmm, very strange. Which OS do you use?
Pls, show exact
# show lc_ctype;
# show lc_collate;
and tsearch2 configuration

> 
> I will try to compile and run my database on the CVS version of postgres, and let you know the results.
ok

> Is it safe to use 8.2 version over 8.1.5 database files ?
No, it's impossible due to significant format of db's files change.

> 
> BTW. When the official 8.2 release is expected ?

During 2006 :)

> 
> Thanks for your time and engagement,
>     Arek.
> 
> PS. BTW I have found minor inconsistency in the regis.c code (CVS version)
>     Return value type is not as it should .. see snippet below...
fixed
-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 


Re: Tsearch + polish ispell + polish locale

From
Date:
Hi,

> > I do not know how, but in some strange, random cases function isalpha()
> stops return true value for polish letters, and that is when RS_compile()
> returns error.
> Hmm, very strange. Which OS do you use?
> Pls, show exact
> # show lc_ctype;
> # show lc_collate;
> and tsearch2 configuration

Linux 2.6.14.4-dl380
   lc_ctype
----------------pl_PL.iso88592
  lc_collate
----------------pl_PL.iso88592

The other interesting thing is that,  although tolower() and isalpha() functionality is broken, sorting polish letters
worksfine ... 

Tsearch2 is configured as follows:

INSERT INTO pg_ts_cfg (...) VALUES ('default_polish', 'default', 'pl_PL');

INSERT INTO pg_ts_cfgmap (...) VALUES( 'default_polish', 'url',    '{simple}');
INSERT INTO pg_ts_cfgmap (...) VALUES( 'default_polish', 'host',   '{simple}');
INSERT INTO pg_ts_cfgmap (...) VALUES( 'default_polish', 'sfloat', '{simple}');
INSERT INTO pg_ts_cfgmap (...) VALUES( 'default_polish', 'uri',    '{simple}');
INSERT INTO pg_ts_cfgmap (...) VALUES( 'default_polish', 'int',    '{simple}');
INSERT INTO pg_ts_cfgmap (...) VALUES( 'default_polish', 'float',  '{simple}');
INSERT INTO pg_ts_cfgmap (...) VALUES( 'default_polish', 'email',  '{simple}');
INSERT INTO pg_ts_cfgmap (...) VALUES( 'default_polish', 'word',   '{simple}');
INSERT INTO pg_ts_cfgmap (...) VALUES( 'default_polish', 'hword',  '{simple}');
INSERT INTO pg_ts_cfgmap (...) VALUES( 'default_polish', 'nlword', '{simple}');
INSERT INTO pg_ts_cfgmap (...) VALUES( 'default_polish', 'nlpart_hword',   '{simple}');
INSERT INTO pg_ts_cfgmap (...) VALUES( 'default_polish', 'part_hword',     '{simple}');
INSERT INTO pg_ts_cfgmap (...) VALUES( 'default_polish', 'nlhword',    '{simple}');
INSERT INTO pg_ts_cfgmap (...) VALUES( 'default_polish', 'file',       '{simple}');
INSERT INTO pg_ts_cfgmap (...) VALUES( 'default_polish', 'uint',       '{simple}');
INSERT INTO pg_ts_cfgmap (...) VALUES( 'default_polish', 'version',    '{simple}');
INSERT INTO pg_ts_cfgmap (...) VALUES( 'default_polish', 'lhword',     '{pl_ispell,simple}');
INSERT INTO pg_ts_cfgmap (...) VALUES( 'default_polish', 'lpart_hword','{pl_ispell,simple}');
INSERT INTO pg_ts_cfgmap (...) VALUES( 'default_polish', 'lword',      '{pl_ispell,simple}');

INSERT INTO pg_ts_dict  (SELECT 'pl_ispell',          dict_init,
'DictFile="/home/astaron/lib/ispell/polish.dic",'         'AffFile="/home/astaron/lib/ispell/polish.aff",'
'StopFile="/home/astaron/lib/ispell/polish.stop"',         dict_lexize   FROM pg_ts_dict   WHERE dict_name =
'ispell_template');


If there is anything, I can do to help you to debug
this issue (logs, tests, code changes..), please let me know.

As for now I will run 8.2 and see if the problem persists ...

Best regards,Arek.


Re: Tsearch + polish ispell + polish locale

From
Teodor Sigaev
Date:
> INSERT INTO pg_ts_cfg (...) VALUES ('default_polish', 'default', 'pl_PL');

If your mark locale as 'pl_PL.iso88592' instead of 'pl_PL' then tsearch2 will be 
able to find configuration itself.

> If there is anything, I can do to help you to debug
> this issue (logs, tests, code changes..), please let me know.
> 
> As for now I will run 8.2 and see if the problem persists ...
Does lower()/upper() functions works well in postgres?

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 


Re: Tsearch + polish ispell + polish locale

From
Date:
>
> If your mark locale as 'pl_PL.iso88592' instead of 'pl_PL' then tsearch2
> will be
> able to find configuration itself.

Good point.. I forgot about this ;-)

>
> Does lower()/upper() functions works well in postgres?

Until regis error it works fine... then it gets broken.
As the matter of fact I wasn't able to determine who breaks it, is it postgres or tsearch ...

Any idea how can I check it ?

Regards,Arek.



Re: Tsearch + polish ispell + polish locale

From
Teodor Sigaev
Date:
>> Does lower()/upper() functions works well in postgres?
> 
> Until regis error it works fine... then it gets broken.
> As the matter of fact I wasn't able to determine who breaks it, is it postgres or tsearch ...
> 
> Any idea how can I check it ?

It seems to me, it's a memory corruption somewhere.

try to compile postgres(and tsearch2 too) with
CFLAGS=-O0 ./configure  --enable-cassert --enable-debug
and repeats the tests

If you are using recent  versions of Linux libc (later than 5.4.23) and GNU
libc (2.x) then it will be useful to set MALLOC_CHECK_ enviroment variable to 2 
before starting postgres (man 3 malloc).

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 


Re: Tsearch + polish ispell + polish locale

From
Date:
FYI,

The problem does NOT exist in 8.2beta3.
I think it can be assumed that this was some locale related issue ...

Thanks for your help,Arek.


> -----Original Message-----
> From: Teodor Sigaev [mailto:teodor@sigaev.ru]
> Sent: Wednesday, November 22, 2006 5:12 PM
> To: Staroń Arkadiusz
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Tsearch + polish ispell + polish locale
>
> >> Does lower()/upper() functions works well in postgres?
> >
> > Until regis error it works fine... then it gets broken.
> > As the matter of fact I wasn't able to determine who breaks it, is it
> postgres or tsearch ...
> >
> > Any idea how can I check it ?
>
> It seems to me, it's a memory corruption somewhere.
>
> try to compile postgres(and tsearch2 too) with
> CFLAGS=-O0 ./configure  --enable-cassert --enable-debug
> and repeats the tests
>
> If you are using recent  versions of Linux libc (later than 5.4.23) and
> GNU
> libc (2.x) then it will be useful to set MALLOC_CHECK_ enviroment variable
> to 2
> before starting postgres (man 3 malloc).
>
> --
> Teodor Sigaev                                   E-mail: teodor@sigaev.ru
>                                                     WWW:
> http://www.sigaev.ru/