Re: endash not a graphic character? - Mailing list pgsql-general

From Tom Lane
Subject Re: endash not a graphic character?
Date
Msg-id 25067.1471803856@sss.pgh.pa.us
Whole thread Raw
In response to Re: endash not a graphic character?  (Bruno Wolff III <bruno@wolff.to>)
Responses Re: endash not a graphic character?
List pgsql-general
Bruno Wolff III <bruno@wolff.to> writes:
> However I am wondering about my use of [[:graph:]] to match characters
> that have glyphs. I was not expecting there to be characters that have
> glyphs to not be in the graph class. In the short term I might want to
> change the way I am testing that.

[ looks into code... ]  The [[:foo:]] notations only work up to Unicode
code point U+7FF at the moment, per this comment in regc_pg_locale.c:

     * Decide how many character codes we ought to look through.  For C locale
     * there's no need to go further than 127.  Otherwise, if the encoding is
     * UTF8 go up to 0x7FF, which is a pretty arbitrary cutoff but we cannot
     * extend it as far as we'd like (say, 0xFFFF, the end of the Basic
     * Multilingual Plane) without creating significant performance issues due
     * to too many characters being fed through the colormap code.  This will
     * need redesign to fix reasonably, but at least for the moment we have
     * all common European languages covered.  Otherwise (not C, not UTF8) go
     * up to 255.  These limits are interrelated with restrictions discussed
     * at the head of this file.

Unfortunately, these particular characters are U+2013 and U+2014 so you
lose.

Obviously there's room for improvement here, but so far nobody's been
motivated to work on it.  Last discussion about it (AFAIR) was this
thread:

https://www.postgresql.org/message-id/flat/24241.1329347196%40sss.pgh.pa.us

I'm not sure if any of the subsequent work on the regex engine would
make it any easier to fix than it seemed at the time.

            regards, tom lane


pgsql-general by date:

Previous
From: Bruno Wolff III
Date:
Subject: Re: endash not a graphic character?
Next
From: Bruno Wolff III
Date:
Subject: Re: endash not a graphic character?