Re: Unicode grapheme clusters - Mailing list pgsql-hackers

From Pavel Stehule
Subject Re: Unicode grapheme clusters
Date
Msg-id CAFj8pRDow1__QV9YbLp5DSCdeGR87hhAPphBE1NX7TWkRhFZ-Q@mail.gmail.com
Whole thread Raw
In response to Unicode grapheme clusters  (Bruce Momjian <bruce@momjian.us>)
Responses Re: Unicode grapheme clusters
List pgsql-hackers


čt 19. 1. 2023 v 1:20 odesílatel Bruce Momjian <bruce@momjian.us> napsal:
Just my luck, I had to dig into a two-"character" emoji that came to me
as part of a Google Calendar entry --- here it is:

        👩🏼‍⚕️🩺

                              libc
        Unicode     UTF8      len
        U+1F469  f0 9f 91 a9   2   woman
        U+1F3FC  f0 9f 8f bc   2   emoji modifier fitzpatrick type-3 (skin tone)
        U+200D   e2 80 8d      0   zero width joiner (ZWJ)
        U+2695   e2 9a 95      1   staff with snake
        U+FE0F   ef b8 8f      0   variation selector-16 (VS16) (previous character as emoji)
        U+1FA7A  f0 9f a9 ba   2   stethoscope

Now, in Debian 11 character apps like vi, I see:

  a woman(2) - a black box(2) - a staff with snake(1) - a stethoscope(2)

Display widths are in parentheses.  I also see '<200d>' in blue.

In current Firefox, I see a woman with a stethoscope around her neck,
and then a stethoscope.  Copying the Unicode string above into a browser
URL bar should show you the same thing, thought it might be too small to
see.

For those looking for details on how these should be handled, see this
for an explanation of grapheme clusters that use things like skin tone
modifiers and zero-width joiners:

        https://tonsky.me/blog/emoji/

These comments explain the confusion of the term character:

        https://stackoverflow.com/questions/27331819/whats-the-difference-between-a-character-a-code-point-a-glyph-and-a-grapheme

and I think this comment summarizes it well:

        https://github.com/kovidgoyal/kitty/issues/3998#issuecomment-914807237

        This is by design. wcwidth() is utterly broken. Any terminal or terminal
        application that uses it is also utterly broken. Forget about emoji
        wcwidth() doesn't even work with combining characters, zero width
        joiners, flags, and a whole bunch of other things.

I decided to see how Postgres, without ICU, handles it:

        show lc_ctype;
          lc_ctype
        -------------
         en_US.UTF-8

        select octet_length('👩🏼‍⚕️🩺');
         octet_length
        --------------
                   21

        select character_length('👩🏼‍⚕️🩺');
         character_length
        ------------------
                        6

The octet_length() is verified as correct by counting the UTF8 bytes
above.  I think character_length() is correct if we consider the number
of Unicode characters, display and non-display.

I then started looking at how Postgres computes and uses _display_
width.  The display width, when properly processed like by Firefox, is 4
(two double-wide displayed characters.)  Based on the libc display
lengths above and incorrect displayed character lengths in Debian 11, it
would be 7.

libpq has PQdsplen(), which calls pg_encoding_dsplen(), which then calls
the per-encoding width function stored in pg_wchar_table.dsplen --- for
UTF8, the function is pg_utf_dsplen().

There is no SQL API for display length, but PQdsplen() that can be
called with a string by calling pg_wcswidth() the gdb debugger:

        pg_wcswidth(const char *pwcs, size_t len, int encoding)
        UTF8 encoding == 6

        (gdb) print (int)pg_wcswidth("abcd", 4, 6)
        $8 = 4
        (gdb) print (int)pg_wcswidth("👩🏼‍⚕️🩺", 21, 6))
        $9 = 7

Here is the psql output:

        SELECT octet_length('👩🏼‍⚕️🩺'), '👩🏼‍⚕️🩺', character_length('👩🏼‍⚕️🩺');
         octet_length | ?column? | character_length
        --------------+----------+------------------
                   21 | 👩🏼‍⚕️🩺  |                6

More often called from psql are pg_wcssize() and pg_wcsformat(), which
also calls PQdsplen().

I think the question is whether we want to report a string width that
assumes the display doesn't understand the more complex UTF8
controls/"characters" listed above.

tsearch has p_isspecial() calls pg_dsplen() which also uses
pg_wchar_table.dsplen.  p_isspecial() also has a small table of what it
calls "strange_letter",

Here is a report about Unicode variation selector and combining
characters from May, 2022:

    https://www.postgresql.org/message-id/flat/013f01d873bb%24ff5f64b0%24fe1e2e10%24%40ndensan.co.jp

Is this something people want improved?

Surely it should be fixed. Unfortunately - all the terminals that I can use don't support it. So at this moment it may be premature to fix it, because the visual form will still be broken.

Regards

Pavel


--
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

Embrace your flaws.  They make you human, rather than perfect,
which you will never be.


pgsql-hackers by date:

Previous
From: Arthur Nascimento
Date:
Subject: Re: vac_update_datfrozenxid will raise "wrong tuple length" if pg_database tuple contains toast attribute.
Next
From: Aleksander Alekseev
Date:
Subject: Re: HOT chain validation in verify_heapam()