Home > mailing lists

A thought about regex versus multibyte character sets - Mailing list pgsql-hackers

From	Tom Lane
Subject	A thought about regex versus multibyte character sets
Date	November 30, 2009 14:15:28
Msg-id	7341.1259604906@sss.pgh.pa.us Whole thread Raw
Responses	Re: A thought about regex versus multibyte character sets Re: A thought about regex versus multibyte character sets
List	pgsql-hackers

Tree view

We've had many complaints about the fact that the regex functions
are not bright about locale-dependent operations in multibyte character
sets, especially case-insensitive matching.  The reason for this, as
was discussed in this thread
http://archives.postgresql.org/pgsql-hackers/2008-12/msg00433.php
is that we'd need to use the <wctype.h> functions, but those expect
the platform's wchar_t representation, whereas the regex stuff works
on pg_wchar_t which might have a different character set mapping.

I just spent a bit of time considering what we might do to fix this.
The idea mentioned in the above thread was to switch over to using
wchar_t in the regex code, but that seems to have a number of problems.
One showstopper is that on some platforms wchar_t is only 16 bits and
can't represent the full range of Unicode characters.  I don't want to
fix case-folding only to break regexes for other uses.

However, it strikes me that we might be overstating the size of the
mismatch between wchar_t and pg_wchar_t representations.  In particular,
for Unicode-based locales it seems virtually certain that every platform
would use Unicode code points for the wchar_t representation, and that
is also our representation in pg_wchar_t.

I therefore propose the following idea: if the database encoding is
UTF8, allow the regc_locale.c functions to call the <wctype.h>
functions, assuming that wchar_t and pg_wchar_t share the same
representation.  On platforms where wchar_t is only 16 bits, we can do
this up to U+FFFF and be stupid about code points above that.

I think this will solve at least 99% of the problem for a fairly small
amount of work.  It does not do anything for non-UTF8 multibyte
encodings, but so far as I can see the only such encodings are Far
Eastern ones, in which the present ASCII-only behavior is probably good
enough --- concepts like case don't apply to their non-ASCII characters
anyhow.  (Well, there's also MULE_INTERNAL, but I don't believe anyone
runs their DB in that.)

However, not being a native user of any non-ASCII character set, I might
be missing something big here.

Comments?
        regards, tom lane

pgsql-hackers by date:

From: David E. Wheeler
Date: 30 November 2009, 14:02:22
Subject: Re: [PATCH] hstore documentation update

From: "Joshua D. Drake"
Date: 30 November 2009, 14:17:31
Subject: Re: Block-level CRC checks

A thought about regex versus multibyte character sets - Mailing list pgsql-hackers

Previous

Next