Home > mailing lists

Re: [HACKERS] Extra Vietnamese unaccent rules - Mailing list pgsql-hackers

From	Thomas Munro
Subject	Re: [HACKERS] Extra Vietnamese unaccent rules
Date	May 29, 2017 07:47:40
Msg-id	CAEepm=0S_b04AjS-4acrjU+20FgamKwF5CiJz-cd=E4a1SOWMw@mail.gmail.com Whole thread Raw
In response to	Re: [HACKERS] Extra Vietnamese unaccent rules (Dang Minh Huong <kakalot49@gmail.com>)
Responses	Re: [HACKERS] Extra Vietnamese unaccent rules (Dang Minh Huong <kakalot49@gmail.com>) Re: [HACKERS] Extra Vietnamese unaccent rules (Michael Paquier <michael.paquier@gmail.com>)
List	pgsql-hackers

Tree view

On Sun, May 28, 2017 at 7:55 PM, Dang Minh Huong <kakalot49@gmail.com> wrote:
> [Quoting Thomas]
>> You don't have to worry about decoding that line, it's all done in
>> that Python script.  The problem is just in the function
>> is_letter_with_marks().  Instead of just checking if combining_ids[0]
>> is a plain letter, it looks like it should also check if
>> combining_ids[0] itself is a letter with marks.  Also get_plain_letter
>> would need to be able to recurse to extract the "a".
>
> Thanks for reporting and lecture about unicode.
> I attached a patch as the instruction from Thomas. Could you confirm it.

-           is_plain_letter(table[codepoint.combining_ids[0]]) and \
+           (is_plain_letter(table[codepoint.combining_ids[0]]) or\
+            len(table[codepoint.combining_ids[0]].combining_ids) > 1) and \

Shouldn't you use "or is_letter_with_marks()", instead of "or len(...)
> 1"?  Your test might catch something that isn't based on a 'letter'
(according to is_plain_letter).  Otherwise this looks pretty good to
me.  Please add it to the next commitfest.

I expect that some users in Vietnam will consider this to be a bugfix,
which raises the question of whether to backpatch it.  Perhaps we
could consider fixing it for 10.  Then users of older versions could
grab the rules file from 10 to use with 9.whatever if they want to do
that and reindex their data as appropriate.

> [Quoting Michael]
>> Actually, with the recent work that has been done with
>> unicode_norm_table.h which has been to transpose UnicodeData.txt into
>> user-friendly tables, shouldn't the python script of unaccent/ be
>> replaced by something that works on this table? This does a canonical
>> decomposition but just keeps the first characters with a class
>> ordering of 0. So we have basic APIs able to look at UnicodeData.txt
>> and let caller do decision making with the result returned.
>
> Thanks, i will learning about it.

It seems like that could be useful for runtime use (I'm sure there is
a whole world of Unicode support we could add), but here we're only
trying to generate a mapping file to add to the source tree, so I'm
not sure how it's relevant.

-- 
Thomas Munro
http://www.enterprisedb.com

pgsql-hackers by date:

From: Jeff Janes
Date: 29 May 2017, 07:33:51
Subject: Re: [HACKERS] logical replication - still unstable after all these months

From: Erik Rijkers
Date: 29 May 2017, 10:26:36
Subject: Re: [HACKERS] logical replication - still unstable after all thesemonths

Re: [HACKERS] Extra Vietnamese unaccent rules - Mailing list pgsql-hackers

Previous

Next