Thread: Regex code versus Unicode chars beyond codepoint 255

Regex code versus Unicode chars beyond codepoint 255

From
Tom Lane
Date:
Bug #5766 points out that we're still not there yet in terms of having
sane behavior for locale-specific regex operations in Unicode encoding.
The reason it's not working is that regc_locale does this to expand
the set of characters that are considered to match [[:alnum:]] :
/* * Now compute the character class contents. * * For the moment, assume that only char codes < 256 can be in these *
classes.*/...    case CC_ALNUM:        cv = getcvec(v, UCHAR_MAX, 0);        if (cv)        {            for (i = 0; i
<=UCHAR_MAX; i++)            {                if (pg_wc_isalnum((chr) i))                    addchr(cv, (chr) i);
    }        }        break;
 

This is a leftover from when we weren't trying to behave sanely for
multibyte encodings.  Now that we are, it's clearly not good enough.
But iterating up to many thousands in this loop isn't too appetizing
from a performance standpoint.

I looked at the equivalent place in Tcl, and I see that what they're
currently doing is they have a hard-wired list of all the Unicode
code points that are classified as alnum, punct, etc.  We could
duplicate that (and use it only if encoding is UTF8), but it seems
kind of ugly, and it doesn't respect the idea that the locale setting
ought to control which characters are considered to be in each class.

Another possibility is to take those lists but apply iswpunct() and
friends to the values, including only code points that pass in the
finished set.  So what you get is the intersection of the Unicode list
and the locale behavior.

Some of the performance pressure could be taken off if we cached
the results instead of recomputing them every time a regex uses
the character classification; but I'm not sure how much that would
save.

Thoughts?
        regards, tom lane


Re: Regex code versus Unicode chars beyond codepoint 255

From
David Smith
Date:
on 2010-11-24 at 15:56, Tom Lane wrote:

> Bug #5766 points out that we're still not there yet in terms of having
> sane behavior for locale-specific regex operations in Unicode
> encoding. The reason it's not working is that regc_locale does this to
> expand the set of characters that are considered to match [[:alnum:]]
> : <SNIP>

and it would appear that nobody answered the email.

I am currently implementing a library system that needs to search by
whole word. I am using \m...\M regexes, and the DB is utf8, which
includes text in Hebrew, Greek, Arabic and various European character
sets. I need a solution to do whole word searches on the data, and this
either means fixing the value of alnum for utf8 to include all character
sets, or manually generating a list of all characters and reimplementing
a word-start/end in regex myself. I would prefer to avoid the latter if
at all possible!

What is the current status regarding a full character list for alnum for
utf8, and is there anything I can do to help get it working?

Thanks,

David