Thread: Regexps vs. locale
This came up on irc: postgres=# show lc_ctype; lc_ctype -------------fr_FR.UTF-8 postgres=# show server_encoding;server_encoding -----------------UTF8 (1 row) postgres=# select E'\303\201' ILIKE E'\303\241';?column? ----------t (1 row) postgres=# select E'\303\201' ~* E'\303\241';?column? ----------f (1 row) Obviously, this happens because the locale support functions in backend/regex/regc_locale.c are (presumably intentionally) crippled so as not to support non-ascii chars, despite all the code there using wide chars for everything otherwise. Why is this? It does not appear to be a documented restriction. -- Andrew (irc:RhodiumToad)
Andrew Gierth <andrew@tao11.riddles.org.uk> writes: > Obviously, this happens because the locale support functions in > backend/regex/regc_locale.c are (presumably intentionally) crippled so > as not to support non-ascii chars, despite all the code there using > wide chars for everything otherwise. It's not so much intentional as that no one has gotten around to making it work. The difficulty is that the wide-char codes we are using might not match what the <wctype.h> functions expect, and it's unclear what we could do to fix that. regards, tom lane
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: > Andrew Gierth <andrew@tao11.riddles.org.uk> writes:>> Obviously, this happens because the locale support functions in>>backend/regex/regc_locale.c are (presumably intentionally)>> crippled so as not to support non-ascii chars, despite allthe>> code there using wide chars for everything otherwise. Tom> It's not so much intentional as that no one has gotten around toTom> making it work. The difficulty is that the wide-charcodes weTom> are using might not match what the <wctype.h> functions expect,Tom> and it's unclear what we coulddo to fix that. Couldn't we follow the example of lower(), and convert the string to wchar_t using mbstowcs (rather than pg_wchar_t and pg_mb2wchar)? This obviously requires that we have a matching lc_ctype for the encoding, but we insist on that now anyway, no? -- Andrew.
Andrew Gierth <andrew@tao11.riddles.org.uk> writes: > "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: > Tom> It's not so much intentional as that no one has gotten around to > Tom> making it work. The difficulty is that the wide-char codes we > Tom> are using might not match what the <wctype.h> functions expect, > Tom> and it's unclear what we could do to fix that. > Couldn't we follow the example of lower(), and convert the string to > wchar_t using mbstowcs (rather than pg_wchar_t and pg_mb2wchar)? Possibly. I think we did not have the char2wchar() infrastructure when the regexp stuff was last gone over, so it might be more practical to do that now. regards, tom lane
Added to TODO: Add ability to use case-insensitive regular expressions on multi-bytecharacters ILIKE already works with multi-byte characters * http://archives.postgresql.org/pgsql-hackers/2008-12/msg00433.php --------------------------------------------------------------------------- Andrew Gierth wrote: > This came up on irc: > > postgres=# show lc_ctype; > lc_ctype > ------------- > fr_FR.UTF-8 > > postgres=# show server_encoding; > server_encoding > ----------------- > UTF8 > (1 row) > > postgres=# select E'\303\201' ILIKE E'\303\241'; > ?column? > ---------- > t > (1 row) > > postgres=# select E'\303\201' ~* E'\303\241'; > ?column? > ---------- > f > (1 row) > > Obviously, this happens because the locale support functions in > backend/regex/regc_locale.c are (presumably intentionally) crippled so > as not to support non-ascii chars, despite all the code there using > wide chars for everything otherwise. > > Why is this? It does not appear to be a documented restriction. > > -- > Andrew (irc:RhodiumToad) > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +