Home > mailing lists

Re: Unicode grapheme clusters - Mailing list pgsql-hackers

From	Pavel Stehule
Subject	Re: Unicode grapheme clusters
Date	January 19, 2023 13:44:57
Msg-id	CAFj8pRDow1__QV9YbLp5DSCdeGR87hhAPphBE1NX7TWkRhFZ-Q@mail.gmail.com Whole thread Raw
In response to	Unicode grapheme clusters (Bruce Momjian <bruce@momjian.us>)
Responses	Re: Unicode grapheme clusters
List	pgsql-hackers

Tree view

čt 19. 1. 2023 v 1:20 odesílatel Bruce Momjian <bruce@momjian.us> napsal:

Just my luck, I had to dig into a two-"character" emoji that came to me
as part of a Google Calendar entry --- here it is:

👩🏼‍⚕️🩺

libc
Unicode UTF8 len
U+1F469 f0 9f 91 a9 2 woman
U+1F3FC f0 9f 8f bc 2 emoji modifier fitzpatrick type-3 (skin tone)
U+200D e2 80 8d 0 zero width joiner (ZWJ)
U+2695 e2 9a 95 1 staff with snake
U+FE0F ef b8 8f 0 variation selector-16 (VS16) (previous character as emoji)
U+1FA7A f0 9f a9 ba 2 stethoscope

Now, in Debian 11 character apps like vi, I see:

a woman(2) - a black box(2) - a staff with snake(1) - a stethoscope(2)

Display widths are in parentheses. I also see '<200d>' in blue.

In current Firefox, I see a woman with a stethoscope around her neck,
and then a stethoscope. Copying the Unicode string above into a browser
URL bar should show you the same thing, thought it might be too small to
see.

For those looking for details on how these should be handled, see this
for an explanation of grapheme clusters that use things like skin tone
modifiers and zero-width joiners:

https://tonsky.me/blog/emoji/

These comments explain the confusion of the term character:

https://stackoverflow.com/questions/27331819/whats-the-difference-between-a-character-a-code-point-a-glyph-and-a-grapheme

and I think this comment summarizes it well:

https://github.com/kovidgoyal/kitty/issues/3998#issuecomment-914807237

This is by design. wcwidth() is utterly broken. Any terminal or terminal
application that uses it is also utterly broken. Forget about emoji
wcwidth() doesn't even work with combining characters, zero width
joiners, flags, and a whole bunch of other things.

I decided to see how Postgres, without ICU, handles it:

show lc_ctype;
lc_ctype
-------------
en_US.UTF-8

select octet_length('👩🏼‍⚕️🩺');
octet_length
--------------
21

select character_length('👩🏼‍⚕️🩺');
character_length
------------------
6

The octet_length() is verified as correct by counting the UTF8 bytes
above. I think character_length() is correct if we consider the number
of Unicode characters, display and non-display.

I then started looking at how Postgres computes and uses _display_
width. The display width, when properly processed like by Firefox, is 4
(two double-wide displayed characters.) Based on the libc display
lengths above and incorrect displayed character lengths in Debian 11, it
would be 7.

libpq has PQdsplen(), which calls pg_encoding_dsplen(), which then calls
the per-encoding width function stored in pg_wchar_table.dsplen --- for
UTF8, the function is pg_utf_dsplen().

There is no SQL API for display length, but PQdsplen() that can be
called with a string by calling pg_wcswidth() the gdb debugger:

pg_wcswidth(const char *pwcs, size_t len, int encoding)
UTF8 encoding == 6

(gdb) print (int)pg_wcswidth("abcd", 4, 6)
$8 = 4
(gdb) print (int)pg_wcswidth("👩🏼‍⚕️🩺", 21, 6))
$9 = 7

Here is the psql output:

SELECT octet_length('👩🏼‍⚕️🩺'), '👩🏼‍⚕️🩺', character_length('👩🏼‍⚕️🩺');
octet_length | ?column? | character_length
--------------+----------+------------------
21 | 👩🏼‍⚕️🩺 | 6

More often called from psql are pg_wcssize() and pg_wcsformat(), which
also calls PQdsplen().

I think the question is whether we want to report a string width that
assumes the display doesn't understand the more complex UTF8
controls/"characters" listed above.

tsearch has p_isspecial() calls pg_dsplen() which also uses
pg_wchar_table.dsplen. p_isspecial() also has a small table of what it
calls "strange_letter",

Here is a report about Unicode variation selector and combining
characters from May, 2022:

https://www.postgresql.org/message-id/flat/013f01d873bb%24ff5f64b0%24fe1e2e10%24%40ndensan.co.jp

Is this something people want improved?

Surely it should be fixed. Unfortunately - all the terminals that I can use don't support it. So at this moment it may be premature to fix it, because the visual form will still be broken.

Regards

Pavel

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Embrace your flaws. They make you human, rather than perfect,
which you will never be.

pgsql-hackers by date:

From: Arthur Nascimento
Date: 19 January 2023, 13:42:47
Subject: Re: vac_update_datfrozenxid will raise "wrong tuple length" if pg_database tuple contains toast attribute.

From: Aleksander Alekseev
Date: 19 January 2023, 13:55:18
Subject: Re: HOT chain validation in verify_heapam()

Re: Unicode grapheme clusters - Mailing list pgsql-hackers

Previous

Next