Thread: Regexps vs. locale

Regexps vs. locale

From

Andrew Gierth

Date:

08 December 2008, 04:14:06

This came up on irc:

postgres=# show lc_ctype; lc_ctype   
-------------fr_FR.UTF-8

postgres=# show server_encoding;server_encoding 
-----------------UTF8
(1 row)

postgres=# select E'\303\201' ILIKE  E'\303\241';?column? 
----------t
(1 row)

postgres=# select E'\303\201' ~*  E'\303\241';?column? 
----------f
(1 row)

Obviously, this happens because the locale support functions in
backend/regex/regc_locale.c are (presumably intentionally) crippled so
as not to support non-ascii chars, despite all the code there using
wide chars for everything otherwise.

Why is this? It does not appear to be a documented restriction.

-- 
Andrew (irc:RhodiumToad)

Re: Regexps vs. locale

From

Tom Lane

Date:

08 December 2008, 09:19:05

Andrew Gierth <andrew@tao11.riddles.org.uk> writes:
> Obviously, this happens because the locale support functions in
> backend/regex/regc_locale.c are (presumably intentionally) crippled so
> as not to support non-ascii chars, despite all the code there using
> wide chars for everything otherwise.

It's not so much intentional as that no one has gotten around to making
it work.  The difficulty is that the wide-char codes we are using might
not match what the <wctype.h> functions expect, and it's unclear what
we could do to fix that.
        regards, tom lane

Re: Regexps vs. locale

From

Andrew Gierth

Date:

08 December 2008, 13:56:34

>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
> Andrew Gierth <andrew@tao11.riddles.org.uk> writes:>> Obviously, this happens because the locale support functions
in>>backend/regex/regc_locale.c are (presumably intentionally)>> crippled so as not to support non-ascii chars, despite
allthe>> code there using wide chars for everything otherwise.

Tom> It's not so much intentional as that no one has gotten around toTom> making it work.  The difficulty is that the
wide-charcodes weTom> are using might not match what the <wctype.h> functions expect,Tom> and it's unclear what we
coulddo to fix that.

Couldn't we follow the example of lower(), and convert the string to
wchar_t using mbstowcs (rather than pg_wchar_t and pg_mb2wchar)?

This obviously requires that we have a matching lc_ctype for the
encoding, but we insist on that now anyway, no?

-- 
Andrew.

Re: Regexps vs. locale

From

Tom Lane

Date:

10 December 2008, 13:52:46

Andrew Gierth <andrew@tao11.riddles.org.uk> writes:
> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
>  Tom> It's not so much intentional as that no one has gotten around to
>  Tom> making it work.  The difficulty is that the wide-char codes we
>  Tom> are using might not match what the <wctype.h> functions expect,
>  Tom> and it's unclear what we could do to fix that.

> Couldn't we follow the example of lower(), and convert the string to
> wchar_t using mbstowcs (rather than pg_wchar_t and pg_mb2wchar)?

Possibly.  I think we did not have the char2wchar() infrastructure
when the regexp stuff was last gone over, so it might be more practical
to do that now.
        regards, tom lane

Re: Regexps vs. locale

From

Bruce Momjian

Date:

07 January 2009, 00:44:29

Added to TODO:
Add ability to use case-insensitive regular expressions on multi-bytecharacters    ILIKE already works with multi-byte
characters       * http://archives.postgresql.org/pgsql-hackers/2008-12/msg00433.php 
 

---------------------------------------------------------------------------

Andrew Gierth wrote:
> This came up on irc:
> 
> postgres=# show lc_ctype;
>   lc_ctype   
> -------------
>  fr_FR.UTF-8
> 
> postgres=# show server_encoding;
>  server_encoding 
> -----------------
>  UTF8
> (1 row)
> 
> postgres=# select E'\303\201' ILIKE  E'\303\241';
>  ?column? 
> ----------
>  t
> (1 row)
> 
> postgres=# select E'\303\201' ~*  E'\303\241';
>  ?column? 
> ----------
>  f
> (1 row)
> 
> Obviously, this happens because the locale support functions in
> backend/regex/regc_locale.c are (presumably intentionally) crippled so
> as not to support non-ascii chars, despite all the code there using
> wide chars for everything otherwise.
> 
> Why is this? It does not appear to be a documented restriction.
> 
> -- 
> Andrew (irc:RhodiumToad)
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +