Thread: [HACKERS] Extra Vietnamese unaccent rules

[HACKERS] Extra Vietnamese unaccent rules

From
Nguyen Le Hoang Kha
Date:
Most of the time in Vietnamese language, there are up to 2 accents in a character. These unaccent rules are added to handle such cases (which are very common).

Kha Nguyen | nlhkh@github
Attachment

Re: [HACKERS] Extra Vietnamese unaccent rules

From
Tom Lane
Date:
Nguyen Le Hoang Kha <nlhkha@gmail.com> writes:
> Most of the time in Vietnamese language, there are up to 2 accents in a
> character. These unaccent rules are added to handle such cases (which are
> very common).

I can't see any reason not to add these --- any objections out there?
        regards, tom lane



Re: [HACKERS] Extra Vietnamese unaccent rules

From
Tom Lane
Date:
I wrote:
> Nguyen Le Hoang Kha <nlhkha@gmail.com> writes:
>> Most of the time in Vietnamese language, there are up to 2 accents in a
>> character. These unaccent rules are added to handle such cases (which are
>> very common).

> I can't see any reason not to add these --- any objections out there?

Oh, wait a minute.  Patching unaccent.rules directly isn't the way
to do this; that file is supposed to be generated by
generate_unaccent_rules.py.  Can you see how to modify that script
to produce these rules?
        regards, tom lane



Re: [HACKERS] Extra Vietnamese unaccent rules

From
Thomas Munro
Date:
On Sat, May 27, 2017 at 5:13 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wrote:
>> Nguyen Le Hoang Kha <nlhkha@gmail.com> writes:
>>> Most of the time in Vietnamese language, there are up to 2 accents in a
>>> character. These unaccent rules are added to handle such cases (which are
>>> very common).
>
>> I can't see any reason not to add these --- any objections out there?
>
> Oh, wait a minute.  Patching unaccent.rules directly isn't the way
> to do this; that file is supposed to be generated by
> generate_unaccent_rules.py.  Can you see how to modify that script
> to produce these rules?

Looking at one example from this patch:

UTF8: <E1><BA><A5>
Codepoint: 1EA5
Name: LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE

In UnicodData.txt it's this line:

1EA5;LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE;Ll;0;L;00E2
0301;;;;N;;;1EA4;;1EA4

The problem is that generate_unaccent_rules.py assumes that the
composing data is a plain letter followed by some number of
diacritical modifiers.  That's true for the characters with a single
accent, but in this multi-accent case it's *composed* character 00E2
(LATIN SMALL LETTER A WITH CIRCUMFLEX) and a diacritical marker 0301
(COMBINING ACCENT ACUTE).  So we need to teach it to be recursive.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] Extra Vietnamese unaccent rules

From
Kha Nguyen
Date:
Could you explain to me what this line means:
“
1EA5;LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE;Ll;0;L;00E2
0301;;;;N;;;1EA4;;1EA4
“

If you could give me an example of adding a rule for “recursive” case, I can do the rest. I am not familiar with this
unaccentformat generation yet. 

Thanks
Kha

> On 26 May 2017, at 21.19, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>
> On Sat, May 27, 2017 at 5:13 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I wrote:
>>> Nguyen Le Hoang Kha <nlhkha@gmail.com> writes:
>>>> Most of the time in Vietnamese language, there are up to 2 accents in a
>>>> character. These unaccent rules are added to handle such cases (which are
>>>> very common).
>>
>>> I can't see any reason not to add these --- any objections out there?
>>
>> Oh, wait a minute.  Patching unaccent.rules directly isn't the way
>> to do this; that file is supposed to be generated by
>> generate_unaccent_rules.py.  Can you see how to modify that script
>> to produce these rules?
>
> Looking at one example from this patch:
>
> UTF8: <E1><BA><A5>
> Codepoint: 1EA5
> Name: LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE
>
> In UnicodData.txt it's this line:
>
> 1EA5;LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE;Ll;0;L;00E2
> 0301;;;;N;;;1EA4;;1EA4
>
> The problem is that generate_unaccent_rules.py assumes that the
> composing data is a plain letter followed by some number of
> diacritical modifiers.  That's true for the characters with a single
> accent, but in this multi-accent case it's *composed* character 00E2
> (LATIN SMALL LETTER A WITH CIRCUMFLEX) and a diacritical marker 0301
> (COMBINING ACCENT ACUTE).  So we need to teach it to be recursive.
>
> --
> Thomas Munro
> http://www.enterprisedb.com




Re: [HACKERS] Extra Vietnamese unaccent rules

From
Thomas Munro
Date:
On Sat, May 27, 2017 at 9:09 AM, Kha Nguyen <nlhkha@gmail.com> wrote:
> Could you explain to me what this line means:
> “
> 1EA5;LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE;Ll;0;L;00E2
> 0301;;;;N;;;1EA4;;1EA4
> “
>
> If you could give me an example of adding a rule for “recursive” case, I can do the rest. I am not familiar with this
unaccentformat generation yet. 

So contrib/unaccent/generate_unaccent_rules.py is a Python script that
takes UnicodeData.txt, a list of information about all Unicode
codepoints available at a URL that is shown in a comment, and
generates unaccent.rules.  The idea was to avoid having to change it
manually every time someone finds characters that should be in there
(as you have just done!) by doing it systematically.

Unicode has two ways to represent characters with accents: either with
composed codepoints like "é" or decomposed codepoints where you say
"e" and then "´".  The field "00E2 0301" is the decomposed form of
that character above.  Our job here is to identify the basic letter
that each composed character contains, by analysing the decomposed
field that you see in that line.  I failed to realise that characters
with TWO accents are described as a composed character with ONE accent
plus another accent.

You don't have to worry about decoding that line, it's all done in
that Python script.  The problem is just in the function
is_letter_with_marks().  Instead of just checking if combining_ids[0]
is a plain letter, it looks like it should also check if
combining_ids[0] itself is a letter with marks.  Also get_plain_letter
would need to be able to recurse to extract the "a".

I hope that helps!

--
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] Extra Vietnamese unaccent rules

From
Kha Nguyen
Date:
Does this mean that the python script has to be updated to be recursive too?

> On 27 May 2017, at 0.48, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>
> On Sat, May 27, 2017 at 9:09 AM, Kha Nguyen <nlhkha@gmail.com> wrote:
>> Could you explain to me what this line means:
>> “
>> 1EA5;LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE;Ll;0;L;00E2
>> 0301;;;;N;;;1EA4;;1EA4
>> “
>>
>> If you could give me an example of adding a rule for “recursive” case, I can do the rest. I am not familiar with
thisunaccent format generation yet. 
>
> So contrib/unaccent/generate_unaccent_rules.py is a Python script that
> takes UnicodeData.txt, a list of information about all Unicode
> codepoints available at a URL that is shown in a comment, and
> generates unaccent.rules.  The idea was to avoid having to change it
> manually every time someone finds characters that should be in there
> (as you have just done!) by doing it systematically.
>
> Unicode has two ways to represent characters with accents: either with
> composed codepoints like "é" or decomposed codepoints where you say
> "e" and then "´".  The field "00E2 0301" is the decomposed form of
> that character above.  Our job here is to identify the basic letter
> that each composed character contains, by analysing the decomposed
> field that you see in that line.  I failed to realise that characters
> with TWO accents are described as a composed character with ONE accent
> plus another accent.
>
> You don't have to worry about decoding that line, it's all done in
> that Python script.  The problem is just in the function
> is_letter_with_marks().  Instead of just checking if combining_ids[0]
> is a plain letter, it looks like it should also check if
> combining_ids[0] itself is a letter with marks.  Also get_plain_letter
> would need to be able to recurse to extract the "a".
>
> I hope that helps!
>
> --
> Thomas Munro
> http://www.enterprisedb.com




Re: [HACKERS] Extra Vietnamese unaccent rules

From
Michael Paquier
Date:
On Fri, May 26, 2017 at 5:48 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> Unicode has two ways to represent characters with accents: either with
> composed codepoints like "é" or decomposed codepoints where you say
> "e" and then "´".  The field "00E2 0301" is the decomposed form of
> that character above.  Our job here is to identify the basic letter
> that each composed character contains, by analysing the decomposed
> field that you see in that line.  I failed to realise that characters
> with TWO accents are described as a composed character with ONE accent
> plus another accent.

Doesn't that depend on the NF operation you are working on? With a
canonical decomposition it seems to me that a character with two
accents can as well be decomposed with one character and two composing
character accents (NFKC does a canonical decomposition in one of its
steps).

> You don't have to worry about decoding that line, it's all done in
> that Python script.  The problem is just in the function
> is_letter_with_marks().  Instead of just checking if combining_ids[0]
> is a plain letter, it looks like it should also check if
> combining_ids[0] itself is a letter with marks.  Also get_plain_letter
> would need to be able to recurse to extract the "a".

Actually, with the recent work that has been done with
unicode_norm_table.h which has been to transpose UnicodeData.txt into
user-friendly tables, shouldn't the python script of unaccent/ be
replaced by something that works on this table? This does a canonical
decomposition but just keeps the first characters with a class
ordering of 0. So we have basic APIs able to look at UnicodeData.txt
and let caller do decision making with the result returned.
--
Michael



Re: [HACKERS] Extra Vietnamese unaccent rules

From
Dang Minh Huong
Date:
Hi,

I am interested in this thread.

On May 27, 29 Heisei, at 10:41, Michael Paquier <michael.paquier@gmail.com> wrote:

On Fri, May 26, 2017 at 5:48 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
Unicode has two ways to represent characters with accents: either with
composed codepoints like "é" or decomposed codepoints where you say
"e" and then "´".  The field "00E2 0301" is the decomposed form of
that character above.  Our job here is to identify the basic letter
that each composed character contains, by analysing the decomposed
field that you see in that line.  I failed to realise that characters
with TWO accents are described as a composed character with ONE accent
plus another accent.

Doesn't that depend on the NF operation you are working on? With a
canonical decomposition it seems to me that a character with two
accents can as well be decomposed with one character and two composing
character accents (NFKC does a canonical decomposition in one of its
steps).

You don't have to worry about decoding that line, it's all done in
that Python script.  The problem is just in the function
is_letter_with_marks().  Instead of just checking if combining_ids[0]
is a plain letter, it looks like it should also check if
combining_ids[0] itself is a letter with marks.  Also get_plain_letter
would need to be able to recurse to extract the "a".


Thanks for reporting and lecture about unicode.
I attached a patch as the instruction from Thomas. Could you confirm it.

Actually, with the recent work that has been done with
unicode_norm_table.h which has been to transpose UnicodeData.txt into
user-friendly tables, shouldn't the python script of unaccent/ be
replaced by something that works on this table? This does a canonical
decomposition but just keeps the first characters with a class
ordering of 0. So we have basic APIs able to look at UnicodeData.txt
and let caller do decision making with the result returned.
--
Michael

Thanks, i will learning about it.

---
Dang Minh Huong
Attachment

Re: [HACKERS] Extra Vietnamese unaccent rules

From
Thomas Munro
Date:
On Sun, May 28, 2017 at 7:55 PM, Dang Minh Huong <kakalot49@gmail.com> wrote:
> [Quoting Thomas]
>> You don't have to worry about decoding that line, it's all done in
>> that Python script.  The problem is just in the function
>> is_letter_with_marks().  Instead of just checking if combining_ids[0]
>> is a plain letter, it looks like it should also check if
>> combining_ids[0] itself is a letter with marks.  Also get_plain_letter
>> would need to be able to recurse to extract the "a".
>
> Thanks for reporting and lecture about unicode.
> I attached a patch as the instruction from Thomas. Could you confirm it.

-           is_plain_letter(table[codepoint.combining_ids[0]]) and \
+           (is_plain_letter(table[codepoint.combining_ids[0]]) or\
+            len(table[codepoint.combining_ids[0]].combining_ids) > 1) and \

Shouldn't you use "or is_letter_with_marks()", instead of "or len(...)
> 1"?  Your test might catch something that isn't based on a 'letter'
(according to is_plain_letter).  Otherwise this looks pretty good to
me.  Please add it to the next commitfest.

I expect that some users in Vietnam will consider this to be a bugfix,
which raises the question of whether to backpatch it.  Perhaps we
could consider fixing it for 10.  Then users of older versions could
grab the rules file from 10 to use with 9.whatever if they want to do
that and reindex their data as appropriate.

> [Quoting Michael]
>> Actually, with the recent work that has been done with
>> unicode_norm_table.h which has been to transpose UnicodeData.txt into
>> user-friendly tables, shouldn't the python script of unaccent/ be
>> replaced by something that works on this table? This does a canonical
>> decomposition but just keeps the first characters with a class
>> ordering of 0. So we have basic APIs able to look at UnicodeData.txt
>> and let caller do decision making with the result returned.
>
> Thanks, i will learning about it.

It seems like that could be useful for runtime use (I'm sure there is
a whole world of Unicode support we could add), but here we're only
trying to generate a mapping file to add to the source tree, so I'm
not sure how it's relevant.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] Extra Vietnamese unaccent rules

From
Dang Minh Huong
Date:

On May 29, 29 Heisei, at 10:47, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

On Sun, May 28, 2017 at 7:55 PM, Dang Minh Huong <kakalot49@gmail.com> wrote:
Thanks for reporting and lecture about unicode.
I attached a patch as the instruction from Thomas. Could you confirm it.

-           is_plain_letter(table[codepoint.combining_ids[0]]) and \
+           (is_plain_letter(table[codepoint.combining_ids[0]]) or\
+            len(table[codepoint.combining_ids[0]].combining_ids) > 1) and \

Shouldn't you use "or is_letter_with_marks()", instead of "or len(...)
1"?  Your test might catch something that isn't based on a 'letter'
(according to is_plain_letter).  Otherwise this looks pretty good to
me.  Please add it to the next commitfest.

Thanks for confirm, sir.
I will add it to the next CF soon.

I expect that some users in Vietnam will consider this to be a bugfix,
which raises the question of whether to backpatch it.  Perhaps we
could consider fixing it for 10.  Then users of older versions could
grab the rules file from 10 to use with 9.whatever if they want to do
that and reindex their data as appropriate.

I am also inclined to the fixing it for 10, because it will not affect to current users.
But do you want to back-patch to all supported versions Kha Nguyen?
# I would also want to note that, not only Vietnamese characters were missed to add from the rule list.


---
Thanks and best regards,
Dang Minh Huong

Re: [HACKERS] Extra Vietnamese unaccent rules

From
Dang Minh Huong
Date:

On May 30, 29 Heisei, at 00:22, Dang Minh Huong <kakalot49@gmail.com> wrote:


On May 29, 29 Heisei, at 10:47, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

On Sun, May 28, 2017 at 7:55 PM, Dang Minh Huong <kakalot49@gmail.com> wrote:
Thanks for reporting and lecture about unicode.
I attached a patch as the instruction from Thomas. Could you confirm it.

-           is_plain_letter(table[codepoint.combining_ids[0]]) and \
+           (is_plain_letter(table[codepoint.combining_ids[0]]) or\
+            len(table[codepoint.combining_ids[0]].combining_ids) > 1) and \

Shouldn't you use "or is_letter_with_marks()", instead of "or len(...)
1"?  Your test might catch something that isn't based on a 'letter'
(according to is_plain_letter).  Otherwise this looks pretty good to
me.  Please add it to the next commitfest.

Thanks for confirm, sir.
I will add it to the next CF soon.

Sorry for lately response. I attach the update patch.

---
Thanks and best regards,
Dang Minh Huong



Attachment

Re: [HACKERS] Extra Vietnamese unaccent rules

From
Michael Paquier
Date:
On Mon, May 29, 2017 at 10:47 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
>> [Quoting Michael]
>>> Actually, with the recent work that has been done with
>>> unicode_norm_table.h which has been to transpose UnicodeData.txt into
>>> user-friendly tables, shouldn't the python script of unaccent/ be
>>> replaced by something that works on this table? This does a canonical
>>> decomposition but just keeps the first characters with a class
>>> ordering of 0. So we have basic APIs able to look at UnicodeData.txt
>>> and let caller do decision making with the result returned.
>>
>> Thanks, i will learning about it.
>
> It seems like that could be useful for runtime use (I'm sure there is
> a whole world of Unicode support we could add), but here we're only
> trying to generate a mapping file to add to the source tree, so I'm
> not sure how it's relevant.

Yes, that's what I am coming at, but that would be really dictionnary
specific and that would be roughly to provide a fast-path equivalent
to the tsearch_readline* routines working on files. The addition of
new infrastructure may perhaps not be worth the performance gains.
Definitely for this fix there is no need to do anything more
complicated than tweaking the script to generate the rules.
-- 
Michael



Re: [HACKERS] Extra Vietnamese unaccent rules

From
Dang Minh Huong
Date:
On Jun 4, 29 Heisei, at 00:48, Bruce Momjian <bruce@momjian.us> wrote:

On Sun, Jun  4, 2017 at 12:43:17AM +0900, Dang Minh Huong wrote:

On May 30, 29 Heisei, at 00:22, Dang Minh Huong <kakalot49@gmail.com> wrote:



On May 29, 29 Heisei, at 10:47, Thomas Munro <thomas.munro@enterprisedb.com <mailto:thomas.munro@enterprisedb.com>> wrote:

On Sun, May 28, 2017 at 7:55 PM, Dang Minh Huong <kakalot49@gmail.com <mailto:kakalot49@gmail.com>> wrote:
Thanks for reporting and lecture about unicode.
I attached a patch as the instruction from Thomas. Could you confirm it.

-           is_plain_letter(table[codepoint.combining_ids[0]]) and \
+           (is_plain_letter(table[codepoint.combining_ids[0]]) or\
+            len(table[codepoint.combining_ids[0]].combining_ids) > 1) and \

Shouldn't you use "or is_letter_with_marks()", instead of "or len(...)
1"?  Your test might catch something that isn't based on a 'letter'
(according to is_plain_letter).  Otherwise this looks pretty good to
me.  Please add it to the next commitfest.

Thanks for confirm, sir.
I will add it to the next CF soon.

Sorry for lately response. I attach the update patch.

Uh, there is no patch attached.


Sorry sir, reattach the patch.
I also added it to the next CF and set reviewers to Thomas Munro. Could you confirm for me.

---
Thanks and best regards,
Dang Minh Huong


Attachment

Re: [HACKERS] Extra Vietnamese unaccent rules

From
Bruce Momjian
Date:
On Wed, Jun  7, 2017 at 12:10:25AM +0900, Dang Minh Huong wrote:
> > On Jun 4, 29 Heisei, at 00:48, Bruce Momjian <bruce@momjian.us> wrote:
> >>>> Shouldn't you use "or is_letter_with_marks()", instead of "or len(...)
> >>>>> 1"?  Your test might catch something that isn't based on a 'letter'
> >>>> (according to is_plain_letter).  Otherwise this looks pretty good to
> >>>> me.  Please add it to the next commitfest.
> >>> 
> >>> Thanks for confirm, sir.
> >>> I will add it to the next CF soon.
> >> 
> >> Sorry for lately response. I attach the update patch.
> > 
> > Uh, there is no patch attached.
> > 
> 
> Sorry sir, reattach the patch.
> I also added it to the next CF and set reviewers to Thomas Munro. Could you confirm for me.

There seems to be a problem.  I can't see a patch dated 2017-06-07 on
the commitfest page:
https://commitfest.postgresql.org/14/1161/

I added the thread but there was no change.  (I think the thread was
already present.)  It appears it is not seeing this patch as the latest
patch.

Does anyone know why this is happening?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: [HACKERS] Extra Vietnamese unaccent rules

From
Man Trieu
Date:
2017-06-07 0:31 GMT+09:00 Bruce Momjian <bruce@momjian.us>:
On Wed, Jun  7, 2017 at 12:10:25AM +0900, Dang Minh Huong wrote:
> > On Jun 4, 29 Heisei, at 00:48, Bruce Momjian <bruce@momjian.us> wrote:
> >>>> Shouldn't you use "or is_letter_with_marks()", instead of "or len(...)
> >>>>> 1"?  Your test might catch something that isn't based on a 'letter'
> >>>> (according to is_plain_letter).  Otherwise this looks pretty good to
> >>>> me.  Please add it to the next commitfest.
> >>>
> >>> Thanks for confirm, sir.
> >>> I will add it to the next CF soon.
> >>
> >> Sorry for lately response. I attach the update patch.
> >
> > Uh, there is no patch attached.
> >
>
> Sorry sir, reattach the patch.
> I also added it to the next CF and set reviewers to Thomas Munro. Could you confirm for me.

There seems to be a problem.  I can't see a patch dated 2017-06-07 on
the commitfest page:

        https://commitfest.postgresql.org/14/1161/

I added the thread but there was no change.  (I think the thread was
already present.)  It appears it is not seeing this patch as the latest
patch.

Does anyone know why this is happening?


May be due to my Mac's mailer? Sorry but I try one more time to attach the patch by webmail.

---
Thanks and best regards,
Dang Minh Huong

Attachment

Re: [HACKERS] Extra Vietnamese unaccent rules

From
Bruce Momjian
Date:
On Wed, Jun  7, 2017 at 01:06:22AM +0900, Man Trieu wrote:
> 2017-06-07 0:31 GMT+09:00 Bruce Momjian <bruce@momjian.us>:
>     I added the thread but there was no change.  (I think the thread was
>     already present.)  It appears it is not seeing this patch as the latest
>     patch.
> 
>     Does anyone know why this is happening?
> 
> 
> 
> May be due to my Mac's mailer? Sorry but I try one more time to attach the
> patch by webmail.

It is getting weirder.  It has picked up my email report of a commitfest
problem as the latest patch (even though there was no patch), and your
second posting is not listed:
https://commitfest.postgresql.org/14/1161/

I think we need someone who knows the rules of how the commitfest finds
patches.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: [HACKERS] Extra Vietnamese unaccent rules

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> There seems to be a problem.  I can't see a patch dated 2017-06-07 on
> the commitfest page:
>     https://commitfest.postgresql.org/14/1161/

It looks to me like the patch is buried inside a multipart/alternative
MIME section.  That's evidently causing our mail archives to miss its
presence.  The latest message does show as having an attachment in the
archives, but I think there's some delay before the CF app will notice.
        regards, tom lane



Re: [HACKERS] Extra Vietnamese unaccent rules

From
Bruce Momjian
Date:
On Tue, Jun  6, 2017 at 12:15:13PM -0400, Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > There seems to be a problem.  I can't see a patch dated 2017-06-07 on
> > the commitfest page:
> >     https://commitfest.postgresql.org/14/1161/
> 
> It looks to me like the patch is buried inside a multipart/alternative
> MIME section.  That's evidently causing our mail archives to miss its
> presence.  The latest message does show as having an attachment in the
> archives, but I think there's some delay before the CF app will notice.

OK, I see had picked up my email as the lastest, not the latest patch. 
I see now the second patch email appears properly on the webpage, so we
are good:
https://commitfest.postgresql.org/14/1161/

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: [HACKERS] Extra Vietnamese unaccent rules

From
Michael Paquier
Date:
On Wed, Jun 7, 2017 at 1:06 AM, Man Trieu <man.trieu@gmail.com> wrote:
> 2017-06-07 0:31 GMT+09:00 Bruce Momjian <bruce@momjian.us>:
>>
>> On Wed, Jun  7, 2017 at 12:10:25AM +0900, Dang Minh Huong wrote:
>> > > On Jun 4, 29 Heisei, at 00:48, Bruce Momjian <bruce@momjian.us> wrote:
>> > >>>> Shouldn't you use "or is_letter_with_marks()", instead of "or
>> > >>>> len(...)
>> > >>>>> 1"?  Your test might catch something that isn't based on a
>> > >>>>> 'letter'
>> > >>>> (according to is_plain_letter).  Otherwise this looks pretty good
>> > >>>> to
>> > >>>> me.  Please add it to the next commitfest.
>> > >>>
>> > >>> Thanks for confirm, sir.
>> > >>> I will add it to the next CF soon.
>> > >>
>> > >> Sorry for lately response. I attach the update patch.
>> > >
>> > > Uh, there is no patch attached.
>> > >
>> >
>> > Sorry sir, reattach the patch.
>> > I also added it to the next CF and set reviewers to Thomas Munro. Could
>> > you confirm for me.
>>
>> There seems to be a problem.  I can't see a patch dated 2017-06-07 on
>> the commitfest page:
>>
>>         https://commitfest.postgresql.org/14/1161/
>>
>> I added the thread but there was no change.  (I think the thread was
>> already present.)  It appears it is not seeing this patch as the latest
>> patch.
>>
>> Does anyone know why this is happening?
>
> May be due to my Mac's mailer? Sorry but I try one more time to attach the
> patch by webmail.

I have finally been able to look at this patch.
(Surprised to see that generate_unaccent_rules.py is inconsistent on
MacOS, runs fine on Linux).

 def get_plain_letter(codepoint, table):
     """Return the base codepoint without marks."""
     if is_letter_with_marks(codepoint, table):
-        return table[codepoint.combining_ids[0]]
+        if len(table[codepoint.combining_ids[0]].combining_ids) > 1:
+            # Recursive to find the plain letter
+            return get_plain_letter(table[codepoint.combining_ids[0]],table)
+        elif is_plain_letter(table[codepoint.combining_ids[0]]):
+            return table[codepoint.combining_ids[0]]
+        else:
+            return None
     elif is_plain_letter(codepoint):
         return codepoint
     else:
-        raise "mu"
+        return None
The code paths returning None should not be reached, so I would
suggest adding an assertion instead. Callers of get_plain_letter would
blow up on None, still that would make future debugging harder.

 def is_letter_with_marks(codepoint, table):
-    """Returns true for plain letters combined with one or more marks."""
+    """Returns true for letters combined with one or more marks."""
     # See http://www.unicode.org/reports/tr44/tr44-14.html#General_Category_Values
     return len(codepoint.combining_ids) > 1 and \
-           is_plain_letter(table[codepoint.combining_ids[0]]) and \
+           (is_plain_letter(table[codepoint.combining_ids[0]]) or\
+            is_letter_with_marks(table[codepoint.combining_ids[0]],table))
and \
            all(is_mark(table[i]) for i in codepoint.combining_ids[1:]
This was already hard to follow, and this patch makes its harder. I
think that the thing should be refactored with multiple conditions.

             if is_letter_with_marks(codepoint, table):
-                charactersSet.add((codepoint.id,
+                if get_plain_letter(codepoint, table) <> None:
+                    charactersSet.add((codepoint.id,
This change is not necessary as a letter with marks is not a plain
character anyway.

Testing with characters having two accents, the results are produced
as wanted. I am attaching an updated patch with all those
simplifications. Thoughts?
-- 
Michael

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Extra Vietnamese unaccent rules

From
Dang Minh Huong
Date:
On 2017/07/05 15:28, Michael Paquier wrote:
> I have finally been able to look at this patch.

Thanks for reviewing and the new version of the patch.
> (Surprised to see that generate_unaccent_rules.py is inconsistent on
> MacOS, runs fine on Linux).
>
>   def get_plain_letter(codepoint, table):
>       """Return the base codepoint without marks."""
>       if is_letter_with_marks(codepoint, table):
> -        return table[codepoint.combining_ids[0]]
> +        if len(table[codepoint.combining_ids[0]].combining_ids) > 1:
> +            # Recursive to find the plain letter
> +            return get_plain_letter(table[codepoint.combining_ids[0]],table)
> +        elif is_plain_letter(table[codepoint.combining_ids[0]]):
> +            return table[codepoint.combining_ids[0]]
> +        else:
> +            return None
>       elif is_plain_letter(codepoint):
>           return codepoint
>       else:
> -        raise "mu"
> +        return None
> The code paths returning None should not be reached, so I would
> suggest adding an assertion instead. Callers of get_plain_letter would
> blow up on None, still that would make future debugging harder.
>
>   def is_letter_with_marks(codepoint, table):
> -    """Returns true for plain letters combined with one or more marks."""
> +    """Returns true for letters combined with one or more marks."""
>       # See http://www.unicode.org/reports/tr44/tr44-14.html#General_Category_Values
>       return len(codepoint.combining_ids) > 1 and \
> -           is_plain_letter(table[codepoint.combining_ids[0]]) and \
> +           (is_plain_letter(table[codepoint.combining_ids[0]]) or\
> +            is_letter_with_marks(table[codepoint.combining_ids[0]],table))
> and \
>              all(is_mark(table[i]) for i in codepoint.combining_ids[1:]
> This was already hard to follow, and this patch makes its harder. I
> think that the thing should be refactored with multiple conditions.
>
>               if is_letter_with_marks(codepoint, table):
> -                charactersSet.add((codepoint.id,
> +                if get_plain_letter(codepoint, table) <> None:
> +                    charactersSet.add((codepoint.id,
> This change is not necessary as a letter with marks is not a plain
> character anyway.
>
> Testing with characters having two accents, the results are produced
> as wanted. I am attaching an updated patch with all those
> simplifications. Thoughts?

Thanks, so pretty. The patch is fine to me.

---
Thanks and best regards,
Dang Minh Huong

---
このEメールはアバスト アンチウイルスによりウイルススキャンされています。
https://www.avast.com/antivirus




Re: [HACKERS] Extra Vietnamese unaccent rules

From
Tom Lane
Date:
Dang Minh Huong <kakalot49@gmail.com> writes:
> On 2017/07/05 15:28, Michael Paquier wrote:
>> (Surprised to see that generate_unaccent_rules.py is inconsistent on
>> MacOS, runs fine on Linux).

FWIW, I got identical results from running the script on current macOS
(Sierra) and Linux (RHEL6).

>> Testing with characters having two accents, the results are produced
>> as wanted. I am attaching an updated patch with all those
>> simplifications. Thoughts?

> Thanks, so pretty. The patch is fine to me.

Pushed into v11.  I'm not really qualified to review the Python coding
style, but I did fix a typo in a comment.

BTW, while this isn't a reason to delay this patch, I wonder whether
the regression test for unaccent is really adequate.  According to
https://coverage.postgresql.org/contrib/unaccent/unaccent.c.gcov.html
it isn't doing anything to check multicharacter source strings, and
what's considerably more disturbing, it isn't exercising the PG_CATCH
code that's meant to deal with characters outside the current database's
encoding.
        regards, tom lane



Re: [HACKERS] Extra Vietnamese unaccent rules

From
Michael Paquier
Date:
On Thu, Aug 17, 2017 at 6:01 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Pushed into v11.

Thanks.

> I'm not really qualified to review the Python coding
> style, but I did fix a typo in a comment.

No pythonist here, but a large confusing "if" condition without any
comments is better if split up and explained with comments if that can
help in clarifying what the code is doing in any language, so thanks
for keeping the code intact.

> BTW, while this isn't a reason to delay this patch, I wonder whether
> the regression test for unaccent is really adequate.  According to
> https://coverage.postgresql.org/contrib/unaccent/unaccent.c.gcov.html
> it isn't doing anything to check multicharacter source strings, and
> what's considerably more disturbing, it isn't exercising the PG_CATCH
> code that's meant to deal with characters outside the current database's
> encoding.

Yeah, that could be improved a bit.
-- 
Michael



Re: [HACKERS] Extra Vietnamese unaccent rules

From
Tom Lane
Date:
Michael Paquier <michael.paquier@gmail.com> writes:
> On Thu, Aug 17, 2017 at 6:01 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I'm not really qualified to review the Python coding
>> style, but I did fix a typo in a comment.

> No pythonist here, but a large confusing "if" condition without any
> comments is better if split up and explained with comments if that can
> help in clarifying what the code is doing in any language, so thanks
> for keeping the code intact.

Certainly agreed on splitting up the logic into multiple statements.
I just meant that I don't know enough Python to know if there are
better ways to do these tests.  (It probably doesn't matter, since
performance of this script is not an issue, and it's not likely to
undergo a lot of further development either.)
        regards, tom lane



Re: [HACKERS] Extra Vietnamese unaccent rules

From
Dang Minh Huong
Date:
Thanks!


On 2017/08/17 11:56, Tom Lane wrote:
> Michael Paquier <michael.paquier@gmail.com> writes:
>> On Thu, Aug 17, 2017 at 6:01 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> I'm not really qualified to review the Python coding
>>> style, but I did fix a typo in a comment.
>> No pythonist here, but a large confusing "if" condition without any
>> comments is better if split up and explained with comments if that can
>> help in clarifying what the code is doing in any language, so thanks
>> for keeping the code intact.
> Certainly agreed on splitting up the logic into multiple statements.
> I just meant that I don't know enough Python to know if there are
> better ways to do these tests.  (It probably doesn't matter, since
> performance of this script is not an issue, and it's not likely to
> undergo a lot of further development either.)
>
>             regards, tom lane