Bruno Wolff III <bruno@wolff.to> writes:
> However I am wondering about my use of [[:graph:]] to match characters
> that have glyphs. I was not expecting there to be characters that have
> glyphs to not be in the graph class. In the short term I might want to
> change the way I am testing that.
[ looks into code... ] The [[:foo:]] notations only work up to Unicode
code point U+7FF at the moment, per this comment in regc_pg_locale.c:
* Decide how many character codes we ought to look through. For C locale
* there's no need to go further than 127. Otherwise, if the encoding is
* UTF8 go up to 0x7FF, which is a pretty arbitrary cutoff but we cannot
* extend it as far as we'd like (say, 0xFFFF, the end of the Basic
* Multilingual Plane) without creating significant performance issues due
* to too many characters being fed through the colormap code. This will
* need redesign to fix reasonably, but at least for the moment we have
* all common European languages covered. Otherwise (not C, not UTF8) go
* up to 255. These limits are interrelated with restrictions discussed
* at the head of this file.
Unfortunately, these particular characters are U+2013 and U+2014 so you
lose.
Obviously there's room for improvement here, but so far nobody's been
motivated to work on it. Last discussion about it (AFAIR) was this
thread:
https://www.postgresql.org/message-id/flat/24241.1329347196%40sss.pgh.pa.us
I'm not sure if any of the subsequent work on the regex engine would
make it any easier to fix than it seemed at the time.
regards, tom lane