Thread: A thought about regex versus multibyte character sets

A thought about regex versus multibyte character sets

From
Tom Lane
Date:
We've had many complaints about the fact that the regex functions
are not bright about locale-dependent operations in multibyte character
sets, especially case-insensitive matching.  The reason for this, as
was discussed in this thread
http://archives.postgresql.org/pgsql-hackers/2008-12/msg00433.php
is that we'd need to use the <wctype.h> functions, but those expect
the platform's wchar_t representation, whereas the regex stuff works
on pg_wchar_t which might have a different character set mapping.

I just spent a bit of time considering what we might do to fix this.
The idea mentioned in the above thread was to switch over to using
wchar_t in the regex code, but that seems to have a number of problems.
One showstopper is that on some platforms wchar_t is only 16 bits and
can't represent the full range of Unicode characters.  I don't want to
fix case-folding only to break regexes for other uses.

However, it strikes me that we might be overstating the size of the
mismatch between wchar_t and pg_wchar_t representations.  In particular,
for Unicode-based locales it seems virtually certain that every platform
would use Unicode code points for the wchar_t representation, and that
is also our representation in pg_wchar_t.

I therefore propose the following idea: if the database encoding is
UTF8, allow the regc_locale.c functions to call the <wctype.h>
functions, assuming that wchar_t and pg_wchar_t share the same
representation.  On platforms where wchar_t is only 16 bits, we can do
this up to U+FFFF and be stupid about code points above that.

I think this will solve at least 99% of the problem for a fairly small
amount of work.  It does not do anything for non-UTF8 multibyte
encodings, but so far as I can see the only such encodings are Far
Eastern ones, in which the present ASCII-only behavior is probably good
enough --- concepts like case don't apply to their non-ASCII characters
anyhow.  (Well, there's also MULE_INTERNAL, but I don't believe anyone
runs their DB in that.)

However, not being a native user of any non-ASCII character set, I might
be missing something big here.

Comments?
        regards, tom lane


Re: A thought about regex versus multibyte character sets

From
Tom Lane
Date:
I wrote:
> I therefore propose the following idea: if the database encoding is
> UTF8, allow the regc_locale.c functions to call the <wctype.h>
> functions, assuming that wchar_t and pg_wchar_t share the same
> representation.  On platforms where wchar_t is only 16 bits, we can do
> this up to U+FFFF and be stupid about code points above that.

Or to be concrete, how about the attached?  It seems to do what's
wanted, but I'm hardly the best-qualified person to test it.

            regards, tom lane

Index: src/backend/regex/regc_locale.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/regex/regc_locale.c,v
retrieving revision 1.9
diff -c -r1.9 regc_locale.c
*** src/backend/regex/regc_locale.c    14 Feb 2008 17:33:37 -0000    1.9
--- src/backend/regex/regc_locale.c    1 Dec 2009 03:04:29 -0000
***************
*** 349,415 ****
      }
  };

  /*
!  * some ctype functions with non-ascii-char guard
   */
  static int
  pg_wc_isdigit(pg_wchar c)
  {
!     return (c >= 0 && c <= UCHAR_MAX && isdigit((unsigned char) c));
  }

  static int
  pg_wc_isalpha(pg_wchar c)
  {
!     return (c >= 0 && c <= UCHAR_MAX && isalpha((unsigned char) c));
  }

  static int
  pg_wc_isalnum(pg_wchar c)
  {
!     return (c >= 0 && c <= UCHAR_MAX && isalnum((unsigned char) c));
  }

  static int
  pg_wc_isupper(pg_wchar c)
  {
!     return (c >= 0 && c <= UCHAR_MAX && isupper((unsigned char) c));
  }

  static int
  pg_wc_islower(pg_wchar c)
  {
!     return (c >= 0 && c <= UCHAR_MAX && islower((unsigned char) c));
  }

  static int
  pg_wc_isgraph(pg_wchar c)
  {
!     return (c >= 0 && c <= UCHAR_MAX && isgraph((unsigned char) c));
  }

  static int
  pg_wc_isprint(pg_wchar c)
  {
!     return (c >= 0 && c <= UCHAR_MAX && isprint((unsigned char) c));
  }

  static int
  pg_wc_ispunct(pg_wchar c)
  {
!     return (c >= 0 && c <= UCHAR_MAX && ispunct((unsigned char) c));
  }

  static int
  pg_wc_isspace(pg_wchar c)
  {
!     return (c >= 0 && c <= UCHAR_MAX && isspace((unsigned char) c));
  }

  static pg_wchar
  pg_wc_toupper(pg_wchar c)
  {
!     if (c >= 0 && c <= UCHAR_MAX)
          return toupper((unsigned char) c);
      return c;
  }
--- 349,500 ----
      }
  };

+
  /*
!  * ctype functions adapted to work on pg_wchar (a/k/a chr)
!  *
!  * When working in UTF8 encoding, we use the <wctype.h> functions if
!  * available.  This assumes that every platform uses Unicode codepoints
!  * directly as the wchar_t representation of Unicode.  On some platforms
!  * wchar_t is only 16 bits wide, so we have to punt for codepoints > 0xFFFF.
!  *
!  * In all other encodings, we use the <ctype.h> functions for pg_wchar
!  * values up to 255, and punt for values above that.  This is only 100%
!  * correct in single-byte encodings such as LATINn.  However, non-Unicode
!  * multibyte encodings are mostly Far Eastern character sets for which the
!  * properties being tested here aren't relevant for higher code values anyway.
!  *
!  * NB: the coding here assumes pg_wchar is an unsigned type.
   */
+
  static int
  pg_wc_isdigit(pg_wchar c)
  {
! #ifdef USE_WIDE_UPPER_LOWER
!     if (GetDatabaseEncoding() == PG_UTF8)
!     {
!         if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
!             return iswdigit((wint_t) c);
!     }
! #endif
!     return (c <= (pg_wchar) UCHAR_MAX && isdigit((unsigned char) c));
  }

  static int
  pg_wc_isalpha(pg_wchar c)
  {
! #ifdef USE_WIDE_UPPER_LOWER
!     if (GetDatabaseEncoding() == PG_UTF8)
!     {
!         if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
!             return iswalpha((wint_t) c);
!     }
! #endif
!     return (c <= (pg_wchar) UCHAR_MAX && isalpha((unsigned char) c));
  }

  static int
  pg_wc_isalnum(pg_wchar c)
  {
! #ifdef USE_WIDE_UPPER_LOWER
!     if (GetDatabaseEncoding() == PG_UTF8)
!     {
!         if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
!             return iswalnum((wint_t) c);
!     }
! #endif
!     return (c <= (pg_wchar) UCHAR_MAX && isalnum((unsigned char) c));
  }

  static int
  pg_wc_isupper(pg_wchar c)
  {
! #ifdef USE_WIDE_UPPER_LOWER
!     if (GetDatabaseEncoding() == PG_UTF8)
!     {
!         if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
!             return iswupper((wint_t) c);
!     }
! #endif
!     return (c <= (pg_wchar) UCHAR_MAX && isupper((unsigned char) c));
  }

  static int
  pg_wc_islower(pg_wchar c)
  {
! #ifdef USE_WIDE_UPPER_LOWER
!     if (GetDatabaseEncoding() == PG_UTF8)
!     {
!         if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
!             return iswlower((wint_t) c);
!     }
! #endif
!     return (c <= (pg_wchar) UCHAR_MAX && islower((unsigned char) c));
  }

  static int
  pg_wc_isgraph(pg_wchar c)
  {
! #ifdef USE_WIDE_UPPER_LOWER
!     if (GetDatabaseEncoding() == PG_UTF8)
!     {
!         if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
!             return iswgraph((wint_t) c);
!     }
! #endif
!     return (c <= (pg_wchar) UCHAR_MAX && isgraph((unsigned char) c));
  }

  static int
  pg_wc_isprint(pg_wchar c)
  {
! #ifdef USE_WIDE_UPPER_LOWER
!     if (GetDatabaseEncoding() == PG_UTF8)
!     {
!         if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
!             return iswprint((wint_t) c);
!     }
! #endif
!     return (c <= (pg_wchar) UCHAR_MAX && isprint((unsigned char) c));
  }

  static int
  pg_wc_ispunct(pg_wchar c)
  {
! #ifdef USE_WIDE_UPPER_LOWER
!     if (GetDatabaseEncoding() == PG_UTF8)
!     {
!         if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
!             return iswpunct((wint_t) c);
!     }
! #endif
!     return (c <= (pg_wchar) UCHAR_MAX && ispunct((unsigned char) c));
  }

  static int
  pg_wc_isspace(pg_wchar c)
  {
! #ifdef USE_WIDE_UPPER_LOWER
!     if (GetDatabaseEncoding() == PG_UTF8)
!     {
!         if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
!             return iswspace((wint_t) c);
!     }
! #endif
!     return (c <= (pg_wchar) UCHAR_MAX && isspace((unsigned char) c));
  }

  static pg_wchar
  pg_wc_toupper(pg_wchar c)
  {
! #ifdef USE_WIDE_UPPER_LOWER
!     if (GetDatabaseEncoding() == PG_UTF8)
!     {
!         if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
!             return towupper((wint_t) c);
!     }
! #endif
!     if (c <= (pg_wchar) UCHAR_MAX)
          return toupper((unsigned char) c);
      return c;
  }
***************
*** 417,423 ****
  static pg_wchar
  pg_wc_tolower(pg_wchar c)
  {
!     if (c >= 0 && c <= UCHAR_MAX)
          return tolower((unsigned char) c);
      return c;
  }
--- 502,515 ----
  static pg_wchar
  pg_wc_tolower(pg_wchar c)
  {
! #ifdef USE_WIDE_UPPER_LOWER
!     if (GetDatabaseEncoding() == PG_UTF8)
!     {
!         if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
!             return towlower((wint_t) c);
!     }
! #endif
!     if (c <= (pg_wchar) UCHAR_MAX)
          return tolower((unsigned char) c);
      return c;
  }
Index: src/include/regex/regcustom.h
===================================================================
RCS file: /cvsroot/pgsql/src/include/regex/regcustom.h,v
retrieving revision 1.7
diff -c -r1.7 regcustom.h
*** src/include/regex/regcustom.h    14 Feb 2008 17:33:37 -0000    1.7
--- src/include/regex/regcustom.h    1 Dec 2009 03:04:29 -0000
***************
*** 34,39 ****
--- 34,50 ----
  #include <ctype.h>
  #include <limits.h>

+ /*
+  * towlower() and friends should be in <wctype.h>, but some pre-C99 systems
+  * declare them in <wchar.h>.
+  */
+ #ifdef HAVE_WCHAR_H
+ #include <wchar.h>
+ #endif
+ #ifdef HAVE_WCTYPE_H
+ #include <wctype.h>
+ #endif
+
  #include "mb/pg_wchar.h"



Re: A thought about regex versus multibyte character sets

From
Alvaro Herrera
Date:
Tom Lane wrote:

> I just spent a bit of time considering what we might do to fix this.
> The idea mentioned in the above thread was to switch over to using
> wchar_t in the regex code, but that seems to have a number of problems.
> One showstopper is that on some platforms wchar_t is only 16 bits and
> can't represent the full range of Unicode characters.  I don't want to
> fix case-folding only to break regexes for other uses.

We have a TODO item about having a regex specific data type.  Would
implementing that solve this problem?

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: A thought about regex versus multibyte character sets

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Tom Lane wrote:
>> I just spent a bit of time considering what we might do to fix this.
>> The idea mentioned in the above thread was to switch over to using
>> wchar_t in the regex code, but that seems to have a number of problems.
>> One showstopper is that on some platforms wchar_t is only 16 bits and
>> can't represent the full range of Unicode characters.  I don't want to
>> fix case-folding only to break regexes for other uses.

> We have a TODO item about having a regex specific data type.  Would
> implementing that solve this problem?

No, not particularly --- the stumbling block here is really impedance
mismatch between our internal APIs and libc's standard locale support.
The TODO item that would fix it is implementing our own locale support;
but I ain't holding my breath for that one.

AFAIR the motivation for a regex data type was solely performance.
        regards, tom lane