Thread: collate not support Unicode Variation Selector

collate not support Unicode Variation Selector

From
荒井元成
Date:
Hi, I hope supported Unicode Variation Selector on collate.

D209007=# create table ivstesticu (
D209007(# moji text
D209007(# );
D209007=# create table ivstest (
D209007(# moji text  collate "ja-x-icu" CONSTRAINT firstkey PRIMARY KEY D209007(# );
D209007=# insert into ivstest (moji) values ( U&'\+003436' || U&'\+0E0101' || U&'\+00304D');
D209007=# insert into ivstest (moji) values ( U&'\+003436' || U&'\+00304D');
D209007=# select moji from ivstest where moji like '%' || U&'\+00304B' || '%';
-------------
㐶󠄁き
㐶き
(2 行)

expected
-------------
㐶き
(1 行)


Best regards,





RE: collate not support Unicode Variation Selector

From
荒井元成
Date:
Hi, I hope supported Unicode Variation Selector on collate.

I will resend it because there was a typo.

D209007=# create table ivstest ( moji text  collate "ja-x-icu" CONSTRAINT firstkey PRIMARY KEY );
D209007=# insert into ivstest (moji) values ( U&'\+003436' || U&'\+0E0101' || U&'\+00304D');
D209007=# insert into ivstest (moji) values ( U&'\+003436' || U&'\+00304D');
D209007=# select moji from ivstest where moji like '%' || U&'\+003436' || '%';
-------------
㐶󠄁き
㐶き
(2 行)

expected
-------------
㐶き
(1 行)


Best regards,







Re: collate not support Unicode Variation Selector

From
Thomas Munro
Date:
On Wed, Aug 3, 2022 at 12:09 PM 荒井元成 <n2029@ndensan.co.jp> wrote:
> D209007=# create table ivstest ( moji text  collate "ja-x-icu" CONSTRAINT firstkey PRIMARY KEY );
> D209007=# insert into ivstest (moji) values ( U&'\+003436' || U&'\+0E0101' || U&'\+00304D');
> D209007=# insert into ivstest (moji) values ( U&'\+003436' || U&'\+00304D');
> D209007=# select moji from ivstest where moji like '%' || U&'\+003436' || '%';
> -------------
> 㐶󠄁き
> 㐶き
> (2 行)
>
> expected
> -------------
> 㐶き
> (1 行)

So you want to match only strings that contain U&'\+003436' *not*
followed by a variation selector (as we also discussed at [1]).  I'm
pretty sure that everything in PostgreSQL considers variation
selectors to be separate characters.  Perhaps it is possible to write
a regular expression covering the variation selector ranges, something
like '\U00003436[^\U000E0100-\U000E010EF]'?

Here's an example using Latin characters that are easier for me, but
show approximately the same thing, since variation selectors are a bit
like "combining" characters:

postgres=# create table t (x text);
CREATE TABLE
postgres=# insert into t values ('e'), ('ef'), ('e' || U&'\0301');
INSERT 0 3
postgres=# select * from t;
 x
----
 e
 ef
 é
(3 rows)

postgres=# select * from t where x ~ 'e([^\u0300-\u036f]|$)';
 x
----
 e
 ef
(2 rows)

[1] https://www.postgresql.org/message-id/flat/013f01d873bb%24ff5f64b0%24fe1e2e10%24%40ndensan.co.jp



Re: collate not support Unicode Variation Selector

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> So you want to match only strings that contain U&'\+003436' *not*
> followed by a variation selector (as we also discussed at [1]).  I'm
> pretty sure that everything in PostgreSQL considers variation
> selectors to be separate characters.

There might be something that doesn't, but LIKE certainly isn't it.
I don't believe plain LIKE is collation-aware at all, it just sees
characters to match or not match.  ILIKE is a little collation-aware,
but it's still not going to consider a combining sequence as one
character.  The same for the regex operators.

Maybe it would help if you run the strings through normalize() first?
I'm not sure if that can combine combining characters.

            regards, tom lane



Re: collate not support Unicode Variation Selector

From
Thomas Munro
Date:
On Wed, Aug 3, 2022 at 12:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Maybe it would help if you run the strings through normalize() first?
> I'm not sure if that can combine combining characters.

I think the similarity between Latin combining characters and these
ideographic variations might end there.  I don't think there is a
single codepoint version of U&'\+003436' || U&'\+0E0101', unlike é.
This system is for controlling small differences in rendering for the
"same" character[1].  My computer doesn't even show the OP's example
glyphs as different (to my eyes, at least; I can see on a random
picture I found[2] that the one with the e0101 selector is supposed to
have a ... what do you call that ... a tiny gap :-)).

[1] http://www.unicode.org/reports/tr37/tr37-14.html
[2] https://glyphwiki.org/wiki/u3436



Re: collate not support Unicode Variation Selector

From
Kyotaro Horiguchi
Date:
At Wed, 3 Aug 2022 14:02:08 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in 
> On Wed, Aug 3, 2022 at 12:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Maybe it would help if you run the strings through normalize() first?
> > I'm not sure if that can combine combining characters.
> 
> I think the similarity between Latin combining characters and these
> ideographic variations might end there.  I don't think there is a
> single codepoint version of U&'\+003436' || U&'\+0E0101', unlike é.

Right. At least in Japanese texts, the two "character"s are the same
glyph.  In that sense the loss of variation selectors from a text
doesn't alter its meaning and doesn't hurt correctness at all.
Ideographic variation is useful in special cases where their
ideographic identity is crucial.

> This system is for controlling small differences in rendering for the
> "same" character[1].  My computer doesn't even show the OP's example
> glyphs as different (to my eyes, at least; I can see on a random
> picture I found[2] that the one with the e0101 selector is supposed to
> have a ... what do you call that ... a tiny gap :-)).

They need variation-aware fonts and application support to render.  So
when even *I* see the two characters on Excel (which I believe doesn't
have that support by default), they would look exactly same.  In that
sense, my opinion on the behavior is that all ideographic variations
rather should be treated as the same character in searching in general
context. In other words, text matching should just drop variation
selectors as the default behavior.

ICU:Collator [1] has the notion of "collation strength" and I saw in
an article that only Colator::IDENTICAL among five alternatives makes
distinction between ideographic variations of a glyph.

> [1] http://www.unicode.org/reports/tr37/tr37-14.html
> [2] https://glyphwiki.org/wiki/u3436

[1] https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/classicu_1_1Collator.html

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

RE: collate not support Unicode Variation Selector

From
荒井元成
Date:
Thank you for your reply.

About 60,000 characters are registered in the IPAmj Mincho font designated by the national specifications.
It should be able to handle all characters.

regards.


-----Original Message-----
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Sent: Wednesday, August 3, 2022 3:26 PM
To: thomas.munro@gmail.com
Cc: tgl@sss.pgh.pa.us; n2029@ndensan.co.jp; pgsql-hackers@lists.postgresql.org
Subject: Re: collate not support Unicode Variation Selector

At Wed, 3 Aug 2022 14:02:08 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in
> On Wed, Aug 3, 2022 at 12:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Maybe it would help if you run the strings through normalize() first?
> > I'm not sure if that can combine combining characters.
>
> I think the similarity between Latin combining characters and these
> ideographic variations might end there.  I don't think there is a
> single codepoint version of U&'\+003436' || U&'\+0E0101', unlike é.

Right. At least in Japanese texts, the two "character"s are the same glyph.  In that sense the loss of variation
selectorsfrom a text doesn't alter its meaning and doesn't hurt correctness at all. 
Ideographic variation is useful in special cases where their ideographic identity is crucial.

> This system is for controlling small differences in rendering for the
> "same" character[1].  My computer doesn't even show the OP's example
> glyphs as different (to my eyes, at least; I can see on a random
> picture I found[2] that the one with the e0101 selector is supposed to
> have a ... what do you call that ... a tiny gap :-)).

They need variation-aware fonts and application support to render.  So when even *I* see the two characters on Excel
(whichI believe doesn't have that support by default), they would look exactly same.  In that sense, my opinion on the
behavioris that all ideographic variations rather should be treated as the same character in searching in general
context.In other words, text matching should just drop variation selectors as the default behavior. 

ICU:Collator [1] has the notion of "collation strength" and I saw in an article that only Colator::IDENTICAL among five
alternativesmakes distinction between ideographic variations of a glyph. 

> [1] http://www.unicode.org/reports/tr37/tr37-14.html
> [2] https://glyphwiki.org/wiki/u3436

[1] https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/classicu_1_1Collator.html

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center




Re: collate not support Unicode Variation Selector

From
Kyotaro Horiguchi
Date:
At Wed, 3 Aug 2022 20:12:53 +0900, 荒井元成 <n2029@ndensan.co.jp> wrote in 
> Thank you for your reply.
> 
> About 60,000 characters are registered in the IPAmj Mincho font designated by the national specifications. 
> It should be able to handle all characters.

Yeah, it is one of that fonts. But I didn't know that MS-Word can
*display* ideographic variations. But it is dissapoinging that input
requires to copy-paste from the Web..  Maybe that characters can be
input smoothly by using ATOK or alikes..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



RE: collate not support Unicode Variation Selector

From
荒井元成
Date:
Thank you for your reply.

SQLServer supports Unicode Variation Selector, so I would like PostgreSQL to
support them as well.

Regards.

--
Motonari Arai

-----Original Message-----
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Sent: Thursday, August 4, 2022 5:24 PM
To: n2029@ndensan.co.jp
Cc: thomas.munro@gmail.com; tgl@sss.pgh.pa.us;
pgsql-hackers@lists.postgresql.org
Subject: Re: collate not support Unicode Variation Selector

At Wed, 3 Aug 2022 20:12:53 +0900, 荒井元成 <n2029@ndensan.co.jp> wrote in
> Thank you for your reply.
>
> About 60,000 characters are registered in the IPAmj Mincho font designated
by the national specifications.
> It should be able to handle all characters.

Yeah, it is one of that fonts. But I didn't know that MS-Word can
*display* ideographic variations. But it is dissapoinging that input
requires to copy-paste from the Web..  Maybe that characters can be input
smoothly by using ATOK or alikes..

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center





Re: collate not support Unicode Variation Selector

From
Kyotaro Horiguchi
Date:
At Thu, 4 Aug 2022 19:01:33 +0900, 荒井元成 <n2029@ndensan.co.jp> wrote in 
> Thank you for your reply.
> 
> SQLServer supports Unicode Variation Selector, so I would like PostgreSQL to
> support them as well.

I studied the code a bit further, then found that simple comparison
can ignore selectors by using nondeterministic collation.

CREATE COLLATION col1 (provider=icu, locale='ja', deterministic=false);
SELECT (U&'\+003436' || U&'\+0E0101' || U&'\+00304D' collate col1) = U&'\+003436' || U&'\+00304D';
 ?column? 
----------
 t

However LIKE dislikes this.

> ERROR:  nondeterministic collations are not supported for LIKE

Deterministic collations assume text equality means bytewise
equality. So, the "problem" behavior is correct in a sense.  In that
sense, those functions that do not support nondeterministic collations
can be implemented without considering ICU, which leads to the
"problem" behavior.  ICU has regular expression function so LIKE might
be ableto be implemented using this.  If it is done, and if a
non-deterministic IVS-sensitive collation is available (I didin't find
how to get one..), LIKE would work as you expect.

But..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center