Re: [HACKERS] Extra Vietnamese unaccent rules - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: [HACKERS] Extra Vietnamese unaccent rules
Date
Msg-id CAEepm=0S_b04AjS-4acrjU+20FgamKwF5CiJz-cd=E4a1SOWMw@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Extra Vietnamese unaccent rules  (Dang Minh Huong <kakalot49@gmail.com>)
Responses Re: [HACKERS] Extra Vietnamese unaccent rules  (Dang Minh Huong <kakalot49@gmail.com>)
Re: [HACKERS] Extra Vietnamese unaccent rules  (Michael Paquier <michael.paquier@gmail.com>)
List pgsql-hackers
On Sun, May 28, 2017 at 7:55 PM, Dang Minh Huong <kakalot49@gmail.com> wrote:
> [Quoting Thomas]
>> You don't have to worry about decoding that line, it's all done in
>> that Python script.  The problem is just in the function
>> is_letter_with_marks().  Instead of just checking if combining_ids[0]
>> is a plain letter, it looks like it should also check if
>> combining_ids[0] itself is a letter with marks.  Also get_plain_letter
>> would need to be able to recurse to extract the "a".
>
> Thanks for reporting and lecture about unicode.
> I attached a patch as the instruction from Thomas. Could you confirm it.

-           is_plain_letter(table[codepoint.combining_ids[0]]) and \
+           (is_plain_letter(table[codepoint.combining_ids[0]]) or\
+            len(table[codepoint.combining_ids[0]].combining_ids) > 1) and \

Shouldn't you use "or is_letter_with_marks()", instead of "or len(...)
> 1"?  Your test might catch something that isn't based on a 'letter'
(according to is_plain_letter).  Otherwise this looks pretty good to
me.  Please add it to the next commitfest.

I expect that some users in Vietnam will consider this to be a bugfix,
which raises the question of whether to backpatch it.  Perhaps we
could consider fixing it for 10.  Then users of older versions could
grab the rules file from 10 to use with 9.whatever if they want to do
that and reindex their data as appropriate.

> [Quoting Michael]
>> Actually, with the recent work that has been done with
>> unicode_norm_table.h which has been to transpose UnicodeData.txt into
>> user-friendly tables, shouldn't the python script of unaccent/ be
>> replaced by something that works on this table? This does a canonical
>> decomposition but just keeps the first characters with a class
>> ordering of 0. So we have basic APIs able to look at UnicodeData.txt
>> and let caller do decision making with the result returned.
>
> Thanks, i will learning about it.

It seems like that could be useful for runtime use (I'm sure there is
a whole world of Unicode support we could add), but here we're only
trying to generate a mapping file to add to the source tree, so I'm
not sure how it's relevant.

-- 
Thomas Munro
http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Jeff Janes
Date:
Subject: Re: [HACKERS] logical replication - still unstable after all these months
Next
From: Erik Rijkers
Date:
Subject: Re: [HACKERS] logical replication - still unstable after all thesemonths