Home > mailing lists

Re: [HACKERS] Extra Vietnamese unaccent rules - Mailing list pgsql-hackers

From	Dang Minh Huong
Subject	Re: [HACKERS] Extra Vietnamese unaccent rules
Date	May 28, 2017 10:55:07
Msg-id	D367CC2F-5595-4370-827A-C439C0361979@gmail.com Whole thread Raw
In response to	Re: [HACKERS] Extra Vietnamese unaccent rules (Michael Paquier <michael.paquier@gmail.com>)
Responses	Re: [HACKERS] Extra Vietnamese unaccent rules
List	pgsql-hackers

Tree view

Hi,

I am interested in this thread.

On May 27, 29 Heisei, at 10:41, Michael Paquier <michael.paquier@gmail.com> wrote:

On Fri, May 26, 2017 at 5:48 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
Unicode has two ways to represent characters with accents: either with
composed codepoints like "é" or decomposed codepoints where you say
"e" and then "´". The field "00E2 0301" is the decomposed form of
that character above. Our job here is to identify the basic letter
that each composed character contains, by analysing the decomposed
field that you see in that line. I failed to realise that characters
with TWO accents are described as a composed character with ONE accent
plus another accent.

Doesn't that depend on the NF operation you are working on? With a
canonical decomposition it seems to me that a character with two
accents can as well be decomposed with one character and two composing
character accents (NFKC does a canonical decomposition in one of its
steps).

You don't have to worry about decoding that line, it's all done in
that Python script. The problem is just in the function
is_letter_with_marks(). Instead of just checking if combining_ids[0]
is a plain letter, it looks like it should also check if
combining_ids[0] itself is a letter with marks. Also get_plain_letter
would need to be able to recurse to extract the "a".

Thanks for reporting and lecture about unicode.

I attached a patch as the instruction from Thomas. Could you confirm it.

Actually, with the recent work that has been done with
unicode_norm_table.h which has been to transpose UnicodeData.txt into
user-friendly tables, shouldn't the python script of unaccent/ be
replaced by something that works on this table? This does a canonical
decomposition but just keeps the first characters with a class
ordering of 0. So we have basic APIs able to look at UnicodeData.txt
and let caller do decision making with the result returned.
--
Michael

Thanks, i will learning about it.

---

Dang Minh Huong

Attachment

unaccent.patch

pgsql-hackers by date:

From: Mark Kirkwood
Date: 28 May 2017, 10:01:58
Subject: Re: [HACKERS] logical replication - still unstable after all thesemonths

From: Amit Kapila
Date: 28 May 2017, 13:37:54
Subject: Re: [HACKERS] Broken hint bits (freeze)

Re: [HACKERS] Extra Vietnamese unaccent rules - Mailing list pgsql-hackers

Attachment

Previous

Next