Thread: endash not a graphic character?

endash not a graphic character?

From
Bruno Wolff III
Date:
I was surprised to find endash and emdash were not graphic characters in
en_US. I'm not sure if this is correct behavior, a bug in postgres or a
bug in my OS' collation definitions?

For example:

Dash:
area=> select '-' ~ '[[:graph:]]' collate "en_US";
 ?column?
----------
 t
(1 row)

Endash:
area=> select '–' ~ '[[:graph:]]' collate "en_US";
 ?column?
----------
 f
(1 row)


Emdash:
area=> select '—' ~ '[[:graph:]]' collate "en_US";
 ?column?
----------
 f
(1 row)


Re: endash not a graphic character?

From
rob stone
Date:
Hello Bruno,
On Sat, 2016-08-20 at 14:04 -0500, Bruno Wolff III wrote:
> I was surprised to find endash and emdash were not graphic characters
> in 
> en_US. I'm not sure if this is correct behavior, a bug in postgres or
> a 
> bug in my OS' collation definitions?
>
> For example:
>
> Dash:
> area=> select '-' ~ '[[:graph:]]' collate "en_US";
>  ?column? 
> ----------
>  t
> (1 row)
>
> Endash:
> area=> select '–' ~ '[[:graph:]]' collate "en_US";
>  ?column? 
> ----------
>  f
> (1 row)
>
>
> Emdash:
> area=> select '—' ~ '[[:graph:]]' collate "en_US";
>  ?column? 
> ----------
>  f
> (1 row)
>
>



You can't use — (emdash) or – (endash)?
Or their hex equivalents. See the Unicode chart.

HTH,
rob


Re: endash not a graphic character?

From
Bruno Wolff III
Date:
On Sun, Aug 21, 2016 at 08:12:23 +1000,
  rob stone <floriparob@gmail.com> wrote:
>
>You can't use — (emdash) or – (endash)?
>Or their hex equivalents. See the Unicode chart.

I am not the source of the data, but I can special case them one way
or the other.

However I am wondering about my use of [[:graph:]] to match characters
that have glyphs. I was not expecting there to be characters that have
glyphs to not be in the graph class. In the short term I might want to
change the way I am testing that.

I should also try the equivalent test in perl to see if it is more likely
tied to the unicode implementation on my system or if it appears to be
Postgres specific.


Re: endash not a graphic character?

From
Bruno Wolff III
Date:
On Sun, Aug 21, 2016 at 08:12:23 +1000,
  rob stone <floriparob@gmail.com> wrote:
>
>You can't use — (emdash) or – (endash)?
>Or their hex equivalents. See the Unicode chart.

By the way, those aren't the correct codes. That only works if your
code treats iso-5589-1 code points as windows 1252 code points. That
may happen to work in many cases, but isn't a good thing to bet on.
(Single byte utf8 codes match iso-8859-1, not windows 1252.)


Re: endash not a graphic character?

From
Bruno Wolff III
Date:
On Sun, Aug 21, 2016 at 12:30:21 -0500,
  Bruno Wolff III <bruno@wolff.to> wrote:
>
>I should also try the equivalent test in perl to see if it is more
>likely tied to the unicode implementation on my system or if it
>appears to be Postgres specific.

It looks like my locale may not be being set the way I expect. I tried
testing in perl and initially I got results consistent with Postgres,
but when I added code to make sure perl was working in utf-8 mode I
started getting the expected results.

I would have expected manually adding a collation to the queries would
have worked even if the default was not what I expected. So pointers
to what I am missing would still be appreciated.


Re: endash not a graphic character?

From
Tom Lane
Date:
Bruno Wolff III <bruno@wolff.to> writes:
> However I am wondering about my use of [[:graph:]] to match characters
> that have glyphs. I was not expecting there to be characters that have
> glyphs to not be in the graph class. In the short term I might want to
> change the way I am testing that.

[ looks into code... ]  The [[:foo:]] notations only work up to Unicode
code point U+7FF at the moment, per this comment in regc_pg_locale.c:

     * Decide how many character codes we ought to look through.  For C locale
     * there's no need to go further than 127.  Otherwise, if the encoding is
     * UTF8 go up to 0x7FF, which is a pretty arbitrary cutoff but we cannot
     * extend it as far as we'd like (say, 0xFFFF, the end of the Basic
     * Multilingual Plane) without creating significant performance issues due
     * to too many characters being fed through the colormap code.  This will
     * need redesign to fix reasonably, but at least for the moment we have
     * all common European languages covered.  Otherwise (not C, not UTF8) go
     * up to 255.  These limits are interrelated with restrictions discussed
     * at the head of this file.

Unfortunately, these particular characters are U+2013 and U+2014 so you
lose.

Obviously there's room for improvement here, but so far nobody's been
motivated to work on it.  Last discussion about it (AFAIR) was this
thread:

https://www.postgresql.org/message-id/flat/24241.1329347196%40sss.pgh.pa.us

I'm not sure if any of the subsequent work on the regex engine would
make it any easier to fix than it seemed at the time.

            regards, tom lane


Re: endash not a graphic character?

From
Bruno Wolff III
Date:
On Sun, Aug 21, 2016 at 14:24:16 -0400,
  Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
>Unfortunately, these particular characters are U+2013 and U+2014 so you
>lose.

Thanks for saving me some time, as it would have taken me quite a while
to figure that out.

I'll adjust the constraint so that good strings aren't rejected. Which
was my immediate problem. I'm not that worried about bad strings getting
added, since the data also gets checked before trying to add it to
the database.

>Obviously there's room for improvement here, but so far nobody's been
>motivated to work on it.  Last discussion about it (AFAIR) was this
>thread:

One thing I would suggest is documenting this limitation under:
https://www.postgresql.org/docs/9.6/static/functions-matching.html#FUNCTIONS-POSIX-REGEXP

I might have missed it, but I did try reading that section to see if I was
doing something wrong before asking on the list. In particular I would
expect this limitation to be noted under:
9.7.3.6. Limits and Compatibility