Home > mailing lists

RE: collate not support Unicode Variation Selector - Mailing list pgsql-hackers

From	荒井元成
Subject	RE: collate not support Unicode Variation Selector
Date	August 3, 2022 11:12:53
Msg-id	00a501d8a729$fa021d90$ee0658b0$@ndensan.co.jp Whole thread Raw
In response to	Re: collate not support Unicode Variation Selector (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Responses	Re: collate not support Unicode Variation Selector
List	pgsql-hackers

Tree view

Thank you for your reply.

About 60,000 characters are registered in the IPAmj Mincho font designated by the national specifications.
It should be able to handle all characters.

regards.

-----Original Message-----
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Sent: Wednesday, August 3, 2022 3:26 PM
To: thomas.munro@gmail.com
Cc: tgl@sss.pgh.pa.us; n2029@ndensan.co.jp; pgsql-hackers@lists.postgresql.org
Subject: Re: collate not support Unicode Variation Selector

At Wed, 3 Aug 2022 14:02:08 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in
> On Wed, Aug 3, 2022 at 12:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Maybe it would help if you run the strings through normalize() first?
> > I'm not sure if that can combine combining characters.
>
> I think the similarity between Latin combining characters and these
> ideographic variations might end there.  I don't think there is a
> single codepoint version of U&'\+003436' || U&'\+0E0101', unlike é.

Right. At least in Japanese texts, the two "character"s are the same glyph.  In that sense the loss of variation
selectorsfrom a text doesn't alter its meaning and doesn't hurt correctness at all. 
Ideographic variation is useful in special cases where their ideographic identity is crucial.

> This system is for controlling small differences in rendering for the
> "same" character[1].  My computer doesn't even show the OP's example
> glyphs as different (to my eyes, at least; I can see on a random
> picture I found[2] that the one with the e0101 selector is supposed to
> have a ... what do you call that ... a tiny gap :-)).

They need variation-aware fonts and application support to render.  So when even *I* see the two characters on Excel
(whichI believe doesn't have that support by default), they would look exactly same.  In that sense, my opinion on the
behavioris that all ideographic variations rather should be treated as the same character in searching in general
context.In other words, text matching should just drop variation selectors as the default behavior. 

ICU:Collator [1] has the notion of "collation strength" and I saw in an article that only Colator::IDENTICAL among five
alternativesmakes distinction between ideographic variations of a glyph. 

> [1] http://www.unicode.org/reports/tr37/tr37-14.html
> [2] https://glyphwiki.org/wiki/u3436

[1] https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/classicu_1_1Collator.html

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

pgsql-hackers by date:

From: Alvaro Herrera
Date: 03 August 2022, 10:58:04
Subject: Re: [PATCH] postgresql.conf.sample comment alignment.

From: Dilip Kumar
Date: 03 August 2022, 11:15:23
Subject: Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

RE: collate not support Unicode Variation Selector - Mailing list pgsql-hackers

Previous

Next