Re: [HACKERS] Extra Vietnamese unaccent rules - Mailing list pgsql-hackers

From Dang Minh Huong
Subject Re: [HACKERS] Extra Vietnamese unaccent rules
Date
Msg-id D367CC2F-5595-4370-827A-C439C0361979@gmail.com
Whole thread Raw
In response to Re: [HACKERS] Extra Vietnamese unaccent rules  (Michael Paquier <michael.paquier@gmail.com>)
Responses Re: [HACKERS] Extra Vietnamese unaccent rules  (Thomas Munro <thomas.munro@enterprisedb.com>)
List pgsql-hackers
Hi,

I am interested in this thread.

On May 27, 29 Heisei, at 10:41, Michael Paquier <michael.paquier@gmail.com> wrote:

On Fri, May 26, 2017 at 5:48 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
Unicode has two ways to represent characters with accents: either with
composed codepoints like "é" or decomposed codepoints where you say
"e" and then "´".  The field "00E2 0301" is the decomposed form of
that character above.  Our job here is to identify the basic letter
that each composed character contains, by analysing the decomposed
field that you see in that line.  I failed to realise that characters
with TWO accents are described as a composed character with ONE accent
plus another accent.

Doesn't that depend on the NF operation you are working on? With a
canonical decomposition it seems to me that a character with two
accents can as well be decomposed with one character and two composing
character accents (NFKC does a canonical decomposition in one of its
steps).

You don't have to worry about decoding that line, it's all done in
that Python script.  The problem is just in the function
is_letter_with_marks().  Instead of just checking if combining_ids[0]
is a plain letter, it looks like it should also check if
combining_ids[0] itself is a letter with marks.  Also get_plain_letter
would need to be able to recurse to extract the "a".


Thanks for reporting and lecture about unicode.
I attached a patch as the instruction from Thomas. Could you confirm it.

Actually, with the recent work that has been done with
unicode_norm_table.h which has been to transpose UnicodeData.txt into
user-friendly tables, shouldn't the python script of unaccent/ be
replaced by something that works on this table? This does a canonical
decomposition but just keeps the first characters with a class
ordering of 0. So we have basic APIs able to look at UnicodeData.txt
and let caller do decision making with the result returned.
--
Michael

Thanks, i will learning about it.

---
Dang Minh Huong
Attachment

pgsql-hackers by date:

Previous
From: Mark Kirkwood
Date:
Subject: Re: [HACKERS] logical replication - still unstable after all thesemonths
Next
From: Amit Kapila
Date:
Subject: Re: [HACKERS] Broken hint bits (freeze)