Thread: unicode match normal forms
Hi, in unicode letter ä exists in two versions - linux and windows use a composite whereas macos prefers the decomposed form. Is there any way to make a semi-exact match that accepts both variants? This question is not about fulltext but about matching filenames across a network - I wish to avoid two equally-looking filenames. Regards Wolfgang
On 17 May 2021 13:27:40 -0000 hamann.w@t-online.de wrote: > in unicode letter ä exists in two versions - linux and windows use a > composite whereas macos prefers the decomposed form. Is there any way > to make a semi-exact match that accepts both variants? You should probably normalise the strings in whatever application code handles the inserting. NFC is the "usually sensible" normal form to use. If you can't change the application code, you may use a trigger and apply the `normalize(text[,form])→text` function to the values https://www.postgresql.org/docs/13/functions-string.html#id-1.5.8.10.5.2.2.7.1.1.2 something vaguely like (totally untested!):: create function normalize_filename() returns trigger as $$ begin new.filename := normalize(new.filename); return new; end; $$ language plpgsql; create trigger normalize_filename before insert or update on that_table for each row execute function normalize_filename(); -- Dakkar - <Mobilis in mobile> GPG public key fingerprint = A071 E618 DD2C 5901 9574 6FE2 40EA 9883 7519 3F88 key id = 0x75193F88
El día lunes, mayo 17, 2021 a las 01:27:40p. m. -0000, hamann.w@t-online.de escribió: > Hi, > > in unicode letter ä exists in two versions - linux and windows use a composite whereas macos prefers > the decomposed form. Is there any way to make a semi-exact match that accepts both variants? > This question is not about fulltext but about matching filenames across a network - I wish to avoid two equally-looking > filenames. There is only *one* codepoint for the German letter a Umlaut: LATIN SMALL LETTER A WITH DIAERESI U+00E4 Said that, having such chars (non ASCII) in file names, I count as a bad idea. matthias -- Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045 Public GnuPG key: http://www.unixarea.de/key.pub ¡Con Cuba no te metas! «» Don't mess with Cuba! «» Leg Dich nicht mit Kuba an! http://www.cubadebate.cu/noticias/2020/12/25/en-video-con-cuba-no-te-metas/
On 17 May 2021 13:27:40 -0000 hamann.w@t-online.de wrote: > in unicode letter ä exists in two versions - linux and windows use a > composite whereas macos prefers the decomposed form. Is there any > way to make a semi-exact match that accepts both variants? Actually, re-reading your request, you want to *find* both forms? In which case, a function index may be more useful:: create index filename_unique_normalized on that_table(normalize(filename)); then you can search:: select * from that_table where normalize(filename)=? If you want to make sure that no two rows contain "equally-looking" filenames, you can use a unique index:: create unique index filename_unique_normalized on that_table(normalize(filename)); (while we're on the topic of "equally-looking" characters, you may want to look at https://en.wikipedia.org/wiki/Homoglyph and https://www.unicode.org/reports/tr36/ ) -- Dakkar - <Mobilis in mobile> GPG public key fingerprint = A071 E618 DD2C 5901 9574 6FE2 40EA 9883 7519 3F88 key id = 0x75193F88
On Mon, 17 May 2021 15:45:00 +0200 Matthias Apitz <guru@unixarea.de> wrote: > There is only *one* codepoint for the German letter a Umlaut: > LATIN SMALL LETTER A WITH DIAERESI U+00E4 True. On the other hand, the sequence: * U+0061 LATIN SMALL LETTER A * U+0308 COMBINING DIAERESIS will render exactly the same glyph. The two forms are closely related: U+00E4 is in NFC (normalization form canonical composition), U+0061 U+0308 is in NFD (normalization form canonical decomposition). See https://en.wikipedia.org/wiki/Unicode_equivalence#Normalization -- Dakkar - <Mobilis in mobile> GPG public key fingerprint = A071 E618 DD2C 5901 9574 6FE2 40EA 9883 7519 3F88 key id = 0x75193F88
Hamann W wrote: > in unicode letter ä exists in two versions - linux and windows use a > composite whereas macos prefers > the decomposed form. Is there any way to make a semi-exact match that > accepts both variants? Aside from normalizing the strings into the same normal form before comparing, non-deterministic ICU collations will recognize them as identical (they're "canonically equivalent" in Unicode terms) For instance, CREATE COLLATION nd ( provider = 'icu', locale='', deterministic = false ); SELECT nfc_form, nfd_form, nfc_form = nfd_form COLLATE nd AS equal1, nfc_form = nfd_form COLLATE "C" AS equal2 -- or any deterministic collation FROM (VALUES (E'j\u00E4hrlich', E'j\u0061\u0308hrlich')) AS s(nfc_form, nfd_form); nfc_form | nfd_form | equal1 | equal2 ----------+----------+--------+-------- jährlich | jährlich | t | f (1 row) Normalizing is available as a built-in function since Postgres 13 and non-deterministic collations appeared in Postgres 12. Best regards, -- Daniel Vérité PostgreSQL-powered mailer: https://www.manitou-mail.org Twitter: @DanielVerite
Hi Gianni, many thanks for your detailed response. It turned out that my postgresql installation is too old for normalize,so I will probably a) use an external script to normalize existing data b) change application code to normalize data before inserting or searching Regards Wolfgang >> On 17 May 2021 13:27:40 -0000 >> hamann.w@t-online.de wrote: >> > in unicode letter ä exists in two versions - linux and windows use a >> > composite whereas macos prefers the decomposed form. Is there any way >> > to make a semi-exact match that accepts both variants? >> >> You should probably normalise the strings in whatever application code >> handles the inserting. NFC is the "usually sensible" normal form to >> use. >> >> If you can't change the application code, you may use a trigger and >> apply the `normalize(text[,form])→text` function to the values >> >> https://www.postgresql.org/docs/13/functions-string.html#id-1.5.8.10.5.2.2.7.1.1.2 >> >> something vaguely like (totally untested!):: >> >> create function normalize_filename() returns trigger as $$ >> begin >> new.filename := normalize(new.filename); >> return new; >> end; >> $$ language plpgsql; >> >> create trigger normalize_filename >> before insert or update >> on that_table >> for each row >> execute function normalize_filename(); >> >> -- >> Dakkar - <Mobilis in mobile> >> GPG public key fingerprint = A071 E618 DD2C 5901 9574 >> 6FE2 40EA 9883 7519 3F88 >> key id = 0x75193F88 >> >> >>
>> El día lunes, mayo 17, 2021 a las 01:27:40p. m. -0000, hamann.w@t-online.de escribió: >> >> > Hi, >> > >> > in unicode letter ä exists in two versions - linux and windows use a composite whereas macos prefers >> > the decomposed form. Is there any way to make a semi-exact match that accepts both variants? >> > This question is not about fulltext but about matching filenames across a network - I wish to avoid two equally-looking >> > filenames. >> >> There is only *one* codepoint for the German letter a Umlaut: >> LATIN SMALL LETTER A WITH DIAERESI U+00E4 >> Hi Matthias, unfortunately there also is letter a with combining dieretic - and it is used by MacOS The mac seems to prefer decomposed characters in other contexts as well, so in my everyday job I used to have fun with product catalogues from a few companies. Depending on the computer used for adding / editing a productthe relevant field could be iso-latin-1, utf8 normal, or utf8 decomposed >> Said that, having such chars (non ASCII) in file names, I count as a bad >> idea. I usually try to avoid whitespace and accented charactersin filenames, to be able to use ssh and scp without much hassle, but I am not the user in this case. Now, if I look at a music collection (stored as folders with mp3 files for the tracks), I would really prefer "Einstürzende Neubauten" over Einstuerzende_Neubauten Regards Wolfgang >>