Home > mailing lists

Re: [HACKERS] Extra Vietnamese unaccent rules - Mailing list pgsql-hackers

From	Thomas Munro
Subject	Re: [HACKERS] Extra Vietnamese unaccent rules
Date	May 27, 2017 00:19:37
Msg-id	CAEepm=39zN5tkbWPVUMifK9uk+rVkyEaXDs-y+DO2R+CtUUEBA@mail.gmail.com Whole thread Raw
In response to	Re: [HACKERS] Extra Vietnamese unaccent rules (Tom Lane <tgl@sss.pgh.pa.us>)
Responses	Re: [HACKERS] Extra Vietnamese unaccent rules
List	pgsql-hackers

Tree view

On Sat, May 27, 2017 at 5:13 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wrote:
>> Nguyen Le Hoang Kha <nlhkha@gmail.com> writes:
>>> Most of the time in Vietnamese language, there are up to 2 accents in a
>>> character. These unaccent rules are added to handle such cases (which are
>>> very common).
>
>> I can't see any reason not to add these --- any objections out there?
>
> Oh, wait a minute.  Patching unaccent.rules directly isn't the way
> to do this; that file is supposed to be generated by
> generate_unaccent_rules.py.  Can you see how to modify that script
> to produce these rules?

Looking at one example from this patch:

UTF8: <E1><BA><A5>
Codepoint: 1EA5
Name: LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE

In UnicodData.txt it's this line:

1EA5;LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE;Ll;0;L;00E2
0301;;;;N;;;1EA4;;1EA4

The problem is that generate_unaccent_rules.py assumes that the
composing data is a plain letter followed by some number of
diacritical modifiers.  That's true for the characters with a single
accent, but in this multi-accent case it's *composed* character 00E2
(LATIN SMALL LETTER A WITH CIRCUMFLEX) and a diacritical marker 0301
(COMBINING ACCENT ACUTE).  So we need to teach it to be recursive.

-- 
Thomas Munro
http://www.enterprisedb.com

pgsql-hackers by date:

From: Michael Paquier
Date: 27 May 2017, 00:16:19
Subject: Re: [HACKERS] logical replication and PANIC during shutdowncheckpoint in publisher

From: Amit Kapila
Date: 27 May 2017, 00:39:48
Subject: Re: [HACKERS] Broken hint bits (freeze)

Re: [HACKERS] Extra Vietnamese unaccent rules - Mailing list pgsql-hackers

Previous

Next