Thread: BUG #6457: Regexp not processing word (with special characters on ends) correctly (UTF-8)
BUG #6457: Regexp not processing word (with special characters on ends) correctly (UTF-8)
From
albert.cieszkowski@cc.com.pl
Date:
The following bug has been logged on the website: Bug reference: 6457 Logged by: Albert Cieszkowski Email address: albert.cieszkowski@cc.com.pl PostgreSQL version: 9.0.6 Operating system: CentOS 5.x Description:=20=20=20=20=20=20=20=20 OS, base and client encoding UTF-8: peimp=3D> select '=C5=9Awinouj=C5=9Bcie' ~* '\m=C5=9Awinouj=C5=9Bcie\M'; ?column? ---------- f (1 row) peimp=3D> select '=C5=9Awinouj=C5=9Bcie' ~* '\A=C5=9Awinouj=C5=9Bcie\Z'; ?column? ---------- t (1 row) but: peimp=3D> select 'Mr=C3=B3z' ~* '\mmr=C3=B3Z\M'; ?column? ---------- t (1 row) peimp=3D> select 'Mr=C3=B3z' ~* '\Amr=C3=B3Z\Z'; ?column? ---------- t (1 row) I believe it is connected with bug #5766 and #3433.
Re: BUG #6457: Regexp not processing word (with special characters on ends) correctly (UTF-8)
From
Tom Lane
Date:
albert.cieszkowski@cc.com.pl writes: > OS, base and client encoding UTF-8: What's your lc_collate/lc_ctype settings? regards, tom lane
Re: BUG #6457: Regexp not processing word (with special characters on ends) correctly (UTF-8)
From
Albert Cieszkowski
Date:
Hello Tom, Every lc_x value is pl_PL.UTF8 (corresponding to the word's language). Database was created with --locale=pl_PL.UTF8. OS (CentOS 5.x) uses: en_US.UTF-8 Best regards, Albert Cieszkowski W dniu 2012-02-14 16:27, Tom Lane pisze: <blockquote class=" cite" id="mid_7784_1329233225_sss_pgh_pa_us" cite="mid:7784.1329233225@sss.pgh.pa.us" type="cite"> albert.cieszkowski@cc.com.pl writes: OS, base and client encoding UTF-8: What's your lc_collate/lc_ctype settings? regards, tom lane
Re: BUG #6457: Regexp not processing word (with special characters on ends) correctly (UTF-8)
From
Tom Lane
Date:
albert.cieszkowski@cc.com.pl writes: > peimp=> select 'ÅwinoujÅcie' ~* '\mÅwinoujÅcie\M'; > ?column? > ---------- > f > (1 row) Oh, I see the reason for this: the code in cclass() in regc_locale.c doesn't go further up than U+00FF, so no codes above that will be thought to be letters (or members of any other character class). Clearly we need to go further when we are dealing with UTF8. I'm not sure what a sane limit would be though. (It would be nice if there were a more efficient way to get this information than laboriously iterating through all the possible character codes. It doesn't look like we're even trying to cache the results, ick.) regards, tom lane
Re: BUG #6457: Regexp not processing word (with special characters on ends) correctly (UTF-8)
From
Duncan Rance
Date:
On 14 Feb 2012, at 18:28, Tom Lane wrote: > > Oh, I see the reason for this: the code in cclass() in regc_locale.c > doesn't go further up than U+00FF, so no codes above that will be > thought to be letters (or members of any other character class). > Clearly we need to go further when we are dealing with UTF8. > I'm not sure what a sane limit would be though. The Basic Multilingual Plane goes up to FFFF: https://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Planes
Re: BUG #6457: Regexp not processing word (with special characters on ends) correctly (UTF-8)
From
Duncan Rance
Date:
On 14 Feb 2012, at 18:28, Tom Lane wrote: > > Oh, I see the reason for this: the code in cclass() in regc_locale.c > doesn't go further up than U+00FF, so no codes above that will be > thought to be letters (or members of any other character class). > Clearly we need to go further when we are dealing with UTF8. > I'm not sure what a sane limit would be though. The Basic Multilingual Plane goes up to FFFF: https://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Planes