Notes about fixing regexes and UTF-8 (yet again) - Mailing list pgsql-hackers

From Tom Lane
Subject Notes about fixing regexes and UTF-8 (yet again)
Date
Msg-id 24241.1329347196@sss.pgh.pa.us
Whole thread Raw
Responses Re: Notes about fixing regexes and UTF-8 (yet again)  (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>)
List pgsql-hackers
In bug #6457 it's pointed out that we *still* don't have full
functionality for locale-dependent regexp behavior with UTF8 encoding.
The reason is that there's old crufty code in regc_locale.c that only
considers character codes up to 255 when searching for characters that
should be considered "letters", "digits", etc.  We could fix that, for
some value of "fix", by iterating up to perhaps 0xFFFF when dealing with
UTF8 encoding, but the time that would take is unappealing.  Especially
so considering that this code is executed afresh anytime we compile a
regex that requires locale knowledge.

I looked into the upstream Tcl code and observed that they deal with
this by having hard-wired tables of which Unicode code points are to be
considered letters etc.  The tables are directly traceable to the
Unicode standard (they provide a script to regenerate them from files
available from unicode.org).  Nonetheless, I do not find that approach
appealing, mainly because we'd be risking deviating from the libc locale
code's behavior within regexes when we follow it everywhere else.
It seems entirely likely to me that a particular locale setting might
consider only some of what Unicode says are letters to be letters.

However, we could possibly compromise by using Unicode-derived tables
as a guide to which code points are worth probing libc for.  That is,
assume that a utf8-based locale will never claim that some code is a
letter that unicode.org doesn't think is a letter.  That would cut the
number of required probes by a pretty large factor.

The other thing that seems worth doing is to install some caching.
We could presumably assume that the behavior of iswupper() et al are
fixed for the duration of a database session, so that we only need to
run the probe loop once when first asked to create a cvec for a
particular category.

Thoughts, better ideas?
        regards, tom lane


pgsql-hackers by date:

Previous
From: "Kevin Grittner"
Date:
Subject: Re: run GUC check hooks on RESET
Next
From: Gaetano Mendola
Date:
Subject: Re: CUDA Sorting