Home > mailing lists

Re: Notes about fixing regexes and UTF-8 (yet again) - Mailing list pgsql-hackers

From	Vik Reykja
Subject	Re: Notes about fixing regexes and UTF-8 (yet again)
Date	February 18, 2012 23:39:19
Msg-id	CALDgxVtk41fkTcF+24b1DbytbwD=kO+K-HGbMyOwjT45TRRkiQ@mail.gmail.com Whole thread
In response to	Re: Notes about fixing regexes and UTF-8 (yet again) (Robert Haas <robertmhaas@gmail.com>)
Responses	Re: Notes about fixing regexes and UTF-8 (yet again)
List	pgsql-hackers

Tree view

On Sun, Feb 19, 2012 at 04:33, Robert Haas <robertmhaas@gmail.com> wrote:

On Sat, Feb 18, 2012 at 7:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Yeah, it's conceivable that we could implement something whereby
>> characters with codes above some cutoff point are handled via runtime
>> calls to iswalpha() and friends, rather than being included in the
>> statically-constructed DFA maps. The cutoff point could likely be a lot
>> less than U+FFFF, too, thereby saving storage and map build time all
>> round.
>
> In the meantime, I still think the caching logic is worth having, and
> we could at least make some people happy if we selected a cutoff point
> somewhere between U+FF and U+FFFF. I don't have any strong ideas about
> what a good compromise cutoff would be. One possibility is U+7FF, which
> corresponds to the limit of what fits in 2-byte UTF8; but I don't know
> if that corresponds to any significant dropoff in frequency of usage.

The problem, of course, is that this probably depends quite a bit on
what language you happen to be using. For some languages, it won't
matter whether you cut it off at U+FF or U+7FF; while for others even
U+FFFF might not be enough. So I think this is one of those cases
where it's somewhat meaningless to talk about frequency of usage.

Does it make sense for regexps to have collations?

pgsql-hackers by date:

From: Robert Haas
Date: 18 February 2012, 23:33:23
Subject: Re: Notes about fixing regexes and UTF-8 (yet again)

From: Robert Haas
Date: 19 February 2012, 00:04:21
Subject: Re: Notes about fixing regexes and UTF-8 (yet again)

Re: Notes about fixing regexes and UTF-8 (yet again) - Mailing list pgsql-hackers

Previous

Next