Thread: unicode match normal forms

unicode match normal forms

From
hamann.w@t-online.de
Date:
Hi,

in unicode letter ä exists in two versions - linux and windows use a composite whereas macos prefers
the decomposed form. Is there any way to make a semi-exact match that accepts both variants?
This question  is not about fulltext but about matching filenames across a network - I wish to avoid two
equally-looking
filenames.

Regards
Wolfgang






Re: unicode match normal forms

From
Gianni Ceccarelli
Date:
On 17 May 2021 13:27:40 -0000
hamann.w@t-online.de wrote:
> in unicode letter ä exists in two versions - linux and windows use a
> composite whereas macos prefers the decomposed form. Is there any way
> to make a semi-exact match that accepts both variants?

You should probably normalise the strings in whatever application code
handles the inserting. NFC is the "usually sensible" normal form to
use.

If you can't change the application code, you may use a trigger and
apply the `normalize(text[,form])→text` function to the values

https://www.postgresql.org/docs/13/functions-string.html#id-1.5.8.10.5.2.2.7.1.1.2

something vaguely like (totally untested!)::

  create function normalize_filename() returns trigger as $$
  begin
    new.filename := normalize(new.filename);
    return new;
  end;
  $$ language plpgsql;

  create trigger normalize_filename
  before insert or update
  on that_table
  for each row
  execute function normalize_filename();

--
    Dakkar - <Mobilis in mobile>
    GPG public key fingerprint = A071 E618 DD2C 5901 9574
                                 6FE2 40EA 9883 7519 3F88
                        key id = 0x75193F88




Re: unicode match normal forms

From
Matthias Apitz
Date:
El día lunes, mayo 17, 2021 a las 01:27:40p. m. -0000, hamann.w@t-online.de escribió:

> Hi,
> 
> in unicode letter ä exists in two versions - linux and windows use a composite whereas macos prefers
> the decomposed form. Is there any way to make a semi-exact match that accepts both variants?
> This question  is not about fulltext but about matching filenames across a network - I wish to avoid two
equally-looking
> filenames.

There is only *one* codepoint for the German letter a Umlaut:
LATIN SMALL LETTER A WITH DIAERESI U+00E4

Said that, having such chars (non ASCII) in file names, I count as a bad
idea.

    matthias

-- 
Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045
Public GnuPG key: http://www.unixarea.de/key.pub
¡Con Cuba no te metas!  «»  Don't mess with Cuba!  «»  Leg Dich nicht mit Kuba an!
http://www.cubadebate.cu/noticias/2020/12/25/en-video-con-cuba-no-te-metas/



Re: unicode match normal forms

From
Gianni Ceccarelli
Date:
On 17 May 2021 13:27:40 -0000
hamann.w@t-online.de wrote:
> in unicode letter ä exists in two versions - linux and windows use a
> composite whereas macos prefers the decomposed form. Is there any
> way to make a semi-exact match that accepts both variants?

Actually, re-reading your request, you want to *find* both forms?

In which case, a function index may be more useful::

  create index filename_unique_normalized
  on that_table(normalize(filename));

then you can search::

  select *
  from that_table
  where normalize(filename)=?

If you want to make sure that no two rows contain "equally-looking"
filenames, you can use a unique index::

  create unique index filename_unique_normalized
  on that_table(normalize(filename));

(while we're on the topic of "equally-looking" characters, you may
want to look at https://en.wikipedia.org/wiki/Homoglyph and
https://www.unicode.org/reports/tr36/ )

--
    Dakkar - <Mobilis in mobile>
    GPG public key fingerprint = A071 E618 DD2C 5901 9574
                                 6FE2 40EA 9883 7519 3F88
                        key id = 0x75193F88




Re: unicode match normal forms

From
Gianni Ceccarelli
Date:
On Mon, 17 May 2021 15:45:00 +0200
Matthias Apitz <guru@unixarea.de> wrote:
> There is only *one* codepoint for the German letter a Umlaut:
> LATIN SMALL LETTER A WITH DIAERESI U+00E4

True. On the other hand, the sequence:

* U+0061 LATIN SMALL LETTER A
* U+0308 COMBINING DIAERESIS

will render exactly the same glyph. The two forms are closely related:
U+00E4 is in NFC (normalization form canonical composition), U+0061
U+0308 is in NFD (normalization form canonical decomposition).

See https://en.wikipedia.org/wiki/Unicode_equivalence#Normalization

-- 
    Dakkar - <Mobilis in mobile>
    GPG public key fingerprint = A071 E618 DD2C 5901 9574
                                 6FE2 40EA 9883 7519 3F88
                        key id = 0x75193F88




Re: unicode match normal forms

From
"Daniel Verite"
Date:
    Hamann W wrote:

> in unicode letter ä exists in two versions - linux and windows use a
> composite whereas macos prefers
> the decomposed form. Is there any way to make a semi-exact match that
> accepts both variants?

Aside from normalizing the strings into the same normal form
before comparing, non-deterministic ICU collations will recognize them as
identical (they're "canonically equivalent" in Unicode terms)

For instance,

CREATE COLLATION nd (
   provider = 'icu',
   locale='',
   deterministic = false
);

SELECT
 nfc_form,
 nfd_form,
 nfc_form = nfd_form COLLATE nd AS equal1,
 nfc_form = nfd_form COLLATE "C" AS equal2 -- or any deterministic collation
FROM
  (VALUES
      (E'j\u00E4hrlich',
       E'j\u0061\u0308hrlich'))
  AS s(nfc_form, nfd_form);


 nfc_form | nfd_form | equal1 | equal2
----------+----------+--------+--------
 jährlich  | jährlich  | t    | f
(1 row)

Normalizing is available as a built-in function since Postgres 13 and
non-deterministic collations appeared in Postgres 12.


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: https://www.manitou-mail.org
Twitter: @DanielVerite



Re: unicode match normal forms

From
goldgraeber-werbetechnik@t-online.de
Date:
Hi Gianni,

many thanks for your detailed response.
It turned out that my postgresql installation is too old for normalize,so I will probably
a) use an external script to normalize existing data
b) change application code to normalize data before inserting or searching

Regards
Wolfgang
>> On 17 May 2021 13:27:40 -0000
>> hamann.w@t-online.de wrote:
>> > in unicode letter ä exists in two versions - linux and windows use a
>> > composite whereas macos prefers the decomposed form. Is there any way
>> > to make a semi-exact match that accepts both variants?
>> >> You should probably normalise the strings in whatever application code
>> handles the inserting. NFC is the "usually sensible" normal form to
>> use.
>> >> If you can't change the application code, you may use a trigger and
>> apply the `normalize(text[,form])→text` function to the values
>> >> https://www.postgresql.org/docs/13/functions-string.html#id-1.5.8.10.5.2.2.7.1.1.2
>> >> something vaguely like (totally untested!)::
>> >>   create function normalize_filename() returns trigger as $$
>>   begin
>>     new.filename := normalize(new.filename);
>>     return new;
>>   end;
>>   $$ language plpgsql;
>> >>   create trigger normalize_filename
>>   before insert or update
>>   on that_table
>>   for each row
>>   execute function normalize_filename();
>> >> -- >>     Dakkar - <Mobilis in mobile>
>>     GPG public key fingerprint = A071 E618 DD2C 5901 9574
>>                                  6FE2 40EA 9883 7519 3F88
>>                         key id = 0x75193F88
>> >> >>







Re: unicode match normal forms

From
goldgraeber-werbetechnik@t-online.de
Date:
>> El día lunes, mayo 17, 2021 a las 01:27:40p. m. -0000, hamann.w@t-online.de escribió:
>> >> > Hi,
>> > >> > in unicode letter ä exists in two versions - linux and windows use a composite whereas macos prefers
>> > the decomposed form. Is there any way to make a semi-exact match that accepts both variants?
>> > This question  is not about fulltext but about matching filenames across a network - I wish to avoid two
equally-looking
>> > filenames.
>> >> There is only *one* codepoint for the German letter a Umlaut:
>> LATIN SMALL LETTER A WITH DIAERESI U+00E4
>>
Hi Matthias,

unfortunately there also is letter a with combining dieretic - and it is used by MacOS
The mac seems to prefer decomposed characters in other contexts as well, so in my
everyday job I used to have fun with product catalogues from a few companies.
Depending on the computer used for adding / editing a productthe relevant field could be
iso-latin-1, utf8 normal, or utf8 decomposed

>> Said that, having such chars (non ASCII) in file names, I count as a bad
>> idea.
I usually try to avoid whitespace and accented charactersin filenames, to be able to use ssh and scp
without much hassle, but I am not the user in this case.

Now, if I look at a music collection (stored as folders with mp3 files for the tracks), I would really prefer
"Einstürzende Neubauten" over Einstuerzende_Neubauten

Regards
Wolfgang

>>