Thread: BUG #15548: Unaccent does not remove combining diacritical characters

BUG #15548: Unaccent does not remove combining diacritical characters

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      15548
Logged by:          Hugh Ranalli
Email address:      hugh@whtc.ca
PostgreSQL version: 11.1
Operating system:   Ubuntu 18.04
Description:

Apparently Unicode has two ways of accenting a character: as a separate code
point, which represents the base character and the accent, or as a
"combining diacritical mark"
(https://en.wikipedia.org/wiki/Combining_Diacritical_Marks), in which case
the mark applies itself to the preceding character. For example, A followed
by U+0300 displays À. However, unaccent is not removing these accents. 

SELECT unaccent(U&'A\0300'); should result in 'A', but instead results in
'À.' I'm running PostgreSQL 11.1, installed from the PostgreSQL
repositories. I've read bug report #13440, and have tried with both the
installed unaccent.rules as well as a new set generated by the
generate_unaccent_rules.py distributed with the 11.1 source code:
  wget http://unicode.org/Public/7.0.0/ucd/UnicodeData.txt
  wget
https://www.unicode.org/repos/cldr/trunk/common/transforms/Latin-ASCII.xml
  python generate_unaccent_rules.py --unicode-data-file UnicodeData.txt
--latin-ascii-file  Latin-ASCII.xml > unaccent.rules

I see there have been some updates to generate_unaccent_rules.py to handle
Greek and Vietnamese characters, but neither of them seem to address this
issue. I'm happy to contribute a patch to handle these cases, but of course
wanted to make sure this is desired behaviour, or if I am misunderstanding
something somewhere.

Thank you,
Hugh Ranalli


Re: BUG #15548: Unaccent does not remove combining diacriticalcharacters

From
"Daniel Verite"
Date:
    PG Bug reporting form wrote:

> Apparently Unicode has two ways of accenting a character: as a separate code
> point, which represents the base character and the accent, or as a
> "combining diacritical mark"
> (https://en.wikipedia.org/wiki/Combining_Diacritical_Marks)

Yes. See also https://en.wikipedia.org/wiki/Unicode_equivalence

In general, PostgreSQL leaves it to applications to normalize
Unicode strings so that they are all in the same canonical form,
either composed or decomposed.

> the mark applies itself to the preceding character. For example, A
> followed by U+0300 displays À. However, unaccent is not removing
> these accents.

Short of having the input normalized by the application, ISTM that the
best solution would be to provide functions to do it in Postgres, so
you'd just write for example:
    unaccent(unicode_NFC(string))

Otherwise unaccent.rules can be customized. You may add replacements
for letter+diacritical sequences that are missing for the languages
you have to deal with. But doing it in general for all diacriticals
multiplied by all base characters seems unrealistic.


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite


Re: BUG #15548: Unaccent does not remove combining diacritical characters

From
Tom Lane
Date:
"Daniel Verite" <daniel@manitou-mail.org> writes:
>     PG Bug reporting form wrote:
>> ... For example, A
>> followed by U+0300 displays À. However, unaccent is not removing
>> these accents.

> Short of having the input normalized by the application, ISTM that the
> best solution would be to provide functions to do it in Postgres, so
> you'd just write for example:
>     unaccent(unicode_NFC(string))

That might be worthwhile, but it seems independent of this issue.

> Otherwise unaccent.rules can be customized. You may add replacements
> for letter+diacritical sequences that are missing for the languages
> you have to deal with. But doing it in general for all diacriticals
> multiplied by all base characters seems unrealistic.

Hm, I thought the OP's proposal was just to make unaccent drop 
combining diacriticals independently of context, which'd avoid the
combinatorial-growth problem.

            regards, tom lane


Re: BUG #15548: Unaccent does not remove combining diacriticalcharacters

From
"Daniel Verite"
Date:
    Tom Lane wrote:

> Hm, I thought the OP's proposal was just to make unaccent drop
> combining diacriticals independently of context, which'd avoid the
> combinatorial-growth problem.

In that case, this could be achieved by simply appending the
diacriticals themselves to unaccent.rules, since replacement of a
string by an empty string is already supported as a rule.
It doesn't seem like the current file has any of these, but from
https://www.postgresql.org/docs/11/unaccent.html :

 "Alternatively, if only one character is given on a line, instances
 of that character are deleted; this is useful in languages where
 accents are represented by separate characters"

Incidentally we may want to improve this bit of doc to mention
explicitly the Unicode decomposed forms as a use case for
removing characters. In fact I wonder if that's not what it's
already trying to express, but confusing "languages" with "forms".


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite


Re: BUG #15548: Unaccent does not remove combining diacritical characters

From
Hugh Ranalli
Date:


On Thu, 13 Dec 2018, 11:26 Daniel Verite <daniel@manitou-mail.org wrote:
        Tom Lane wrote:

> Hm, I thought the OP's proposal was just to make unaccent drop
> combining diacriticals independently of context, which'd avoid the
> combinatorial-growth problem.

That's what I was thinking. Given that the accent is separate from the characters, simply dropping it should result in the correct unaccented character.

In that case, this could be achieved by simply appending the
diacriticals themselves to unaccent.rules, since replacement of a
string by an empty string is already supported as a rule.
It doesn't seem like the current file has any of these, but from
https://www.postgresql.org/docs/11/unaccent.html :

 "Alternatively, if only one character is given on a line, instances
 of that character are deleted; this is useful in languages where
 accents are represented by separate characters"

Yes, I had read that in the docs, and that's the approach I planned to take. I'll go ahead and develop a patch, then.

Best wishes,
Hugh

Re: BUG #15548: Unaccent does not remove combining diacritical characters

From
Hugh Ranalli
Date:
I've attached a patch removes combining diacriticals. As with Latin and Greek letters, it uses ranges to restrict its activity. 

I have not submitted a patch for unaccent.rules, as it seems that a rules file generated from generate_unaccent_rules.py will actually remove a large number of rules (even before my changes), such as replacing the copyright symbol © with (C), as well as other accented characters. It's probably worth asking if the shipped unaccent.rules should correspond to what the shipped generation utility produces, or not. I was surprised to see that it didn't.

Please let me know if you see anything I need to change.

Best wishes,
Hugh

--
Hugh Ranalli
Principal Consultant
White Horse Technology Consulting
e: hugh@whtc.ca
c: +01-416-994-7957
w: www.whtc.ca


On Thu, 13 Dec 2018 at 13:50, Hugh Ranalli <hugh@whtc.ca> wrote:


On Thu, 13 Dec 2018, 11:26 Daniel Verite <daniel@manitou-mail.org wrote:
        Tom Lane wrote:

> Hm, I thought the OP's proposal was just to make unaccent drop
> combining diacriticals independently of context, which'd avoid the
> combinatorial-growth problem.

That's what I was thinking. Given that the accent is separate from the characters, simply dropping it should result in the correct unaccented character.

In that case, this could be achieved by simply appending the
diacriticals themselves to unaccent.rules, since replacement of a
string by an empty string is already supported as a rule.
It doesn't seem like the current file has any of these, but from
https://www.postgresql.org/docs/11/unaccent.html :

 "Alternatively, if only one character is given on a line, instances
 of that character are deleted; this is useful in languages where
 accents are represented by separate characters"

Yes, I had read that in the docs, and that's the approach I planned to take. I'll go ahead and develop a patch, then.

Best wishes,
Hugh
Attachment

Re: BUG #15548: Unaccent does not remove combining diacritical characters

From
Tom Lane
Date:
Hugh Ranalli <hugh@whtc.ca> writes:
> I've attached a patch removes combining diacriticals. As with Latin and
> Greek letters, it uses ranges to restrict its activity.

Cool.  Please add it to the current CF so we don't forget about it:
https://commitfest.postgresql.org/21/

> I have not submitted a patch for unaccent.rules, as it seems that a rules
> file generated from generate_unaccent_rules.py will actually remove a large
> number of rules (even before my changes), such as replacing the copyright
> symbol © with (C), as well as other accented characters. It's probably
> worth asking if the shipped unaccent.rules should correspond to what the
> shipped generation utility produces, or not. I was surprised to see that it
> didn't.

Me too -- seems like that bears looking into.  Perhaps the script's
results are platform dependent -- what were you testing on?

            regards, tom lane


Re: BUG #15548: Unaccent does not remove combining diacritical characters

From
Hugh Ranalli
Date:


On Fri, 14 Dec 2018 at 17:50, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Hugh Ranalli <hugh@whtc.ca> writes:
Cool.  Please add it to the current CF so we don't forget about it:
https://commitfest.postgresql.org/21/
Done.
 
Me too -- seems like that bears looking into.  Perhaps the script's
results are platform dependent -- what were you testing on?
I'm on Linux Mint 17, which is based on Ubuntu 14.04. But I don't think that's it. The program's decisions come from the two data files, the Unicode data set and the Latin-ASCII transliteration file. The script uses categories (ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html#General%20Category) to identify letters (and now combining marks) and if they are in range, performs a substitution. It then uses the transliteration file to find rules for particular character substitutions (for example, that file seems to handle the copyright symbol substitution). I don't see anything platform dependent in there. 

In looking more closely, I also see that script isn't generating ligatures, even though it should, because although the program can generate them, none of the ligatures are in the ranges defined in PLAIN_LETTER_RANGES, and so they are skipped.

This could probably be handled by adding the ligature ranges to the defined ranges. Symbol types could be added to the types it looks at, and perhaps the codepoint ranges collapsed into one list, as the IDs are unique across all categories. I don't think we'd want to just rely on ranges, as that could include control characters, punctuation, etc. 

There are a number of other characters that appear in unaccent.rules that aren't generated by the script. I've attached a diff of the output of generate_unaccent_rules (using the version before my changes, to simplify matters) and unaccent.rules. Unfortunately, I don't know how to interpret most of these characters.

I suppose it's valid to ask if changing © to (C) is even something an "unaccent" function should do. Given that it's in the existing rules file, should it be supported as an existing behaviour?

Sorry for more questions than answers. ;-)

Attachment

Re: BUG #15548: Unaccent does not remove combining diacritical characters

From
Tom Lane
Date:
Hugh Ranalli <hugh@whtc.ca> writes:
> On Fri, 14 Dec 2018 at 17:50, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Me too -- seems like that bears looking into.  Perhaps the script's
>> results are platform dependent -- what were you testing on?

> I'm on Linux Mint 17, which is based on Ubuntu 14.04. But I don't think
> that's it. The program's decisions come from the two data files, the
> Unicode data set and the Latin-ASCII transliteration file. The script uses
> categories (
> ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html#General%20Category)
> to identify letters (and now combining marks) and if they are in range,
> performs a substitution. It then uses the transliteration file to find
> rules for particular character substitutions (for example, that file seems
> to handle the copyright symbol substitution). I don't see anything platform
> dependent in there.

Hm.  Something funny is going on here.  When I fetch the two reference
files from the URLs cited in the script, and do

python2 generate_unaccent_rules.py --unicode-data-file UnicodeData.txt --latin-ascii-file Latin-ASCII.xml >newrules

I get something that's bit-for-bit the same as what's in unaccent.rules.
So there's clearly a platform difference between here and there.

I'm using Python 2.6.6, which is what ships with RHEL6; have not tried
it on anything newer.

            regards, tom lane


Re: BUG #15548: Unaccent does not remove combining diacritical characters

From
Tom Lane
Date:
I wrote:
> ... I get something that's bit-for-bit the same as what's in unaccent.rules.
> So there's clearly a platform difference between here and there.
> I'm using Python 2.6.6, which is what ships with RHEL6; have not tried
> it on anything newer.

A few minutes later on a Fedora 28 box: python 2.7.15 also gives me the
expected results, while python 3.6.6 fails with "SyntaxError: invalid
syntax".

So updating that script to also work with python3 might be a worthwhile
TODO item.  But I'm at a loss to explain why you get different results.

            regards, tom lane


Re: BUG #15548: Unaccent does not remove combining diacritical characters

From
Hugh Ranalli
Date:

On Sat, 15 Dec 2018 at 13:44, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Hm.  Something funny is going on here.  When I fetch the two reference
files from the URLs cited in the script, and do

python2 generate_unaccent_rules.py --unicode-data-file UnicodeData.txt --latin-ascii-file Latin-ASCII.xml >newrules

I get something that's bit-for-bit the same as what's in unaccent.rules.
So there's clearly a platform difference between here and there.

I'm using Python 2.6.6, which is what ships with RHEL6; have not tried
it on anything newer.
Well, that's embarrassing. When I looked I couldn't see anything that looked platform specific. I'm on Python 2.7.6, which shipped with Mint 17. We use other versions of 2.7 on our production platforms. I'll take another look, and check the URLs I am using.

Re: BUG #15548: Unaccent does not remove combining diacritical characters

From
Hugh Ranalli
Date:

On Sat, 15 Dec 2018 at 14:05, Hugh Ranalli <hugh@whtc.ca> wrote:
On Sat, 15 Dec 2018 at 13:44, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Hm.  Something funny is going on here.  When I fetch the two reference
files from the URLs cited in the script, and do

python2 generate_unaccent_rules.py --unicode-data-file UnicodeData.txt --latin-ascii-file Latin-ASCII.xml >newrules

I get something that's bit-for-bit the same as what's in unaccent.rules.
So there's clearly a platform difference between here and there.

I'm using Python 2.6.6, which is what ships with RHEL6; have not tried
it on anything newer.
Well, that's embarrassing. When I looked I couldn't see anything that looked platform specific. I'm on Python 2.7.6, which shipped with Mint 17. We use other versions of 2.7 on our production platforms. I'll take another look, and check the URLs I am using.

The problem is that I downloaded the latest version of the Latin-ASCII transliteration file (r34 rather than the r28 specified in the URL). Over 3 years ago (in r29, of course) they changed the file format (https://unicode.org/cldr/trac/ticket/5873) so that parse_cldr_latin_ascii_transliterator loads an empty rules set. I'd be happy to either a) support both formats, or b), support just the newest and update the URL. Option b) is cleaner, and I can't imagine why anyone would want to use an older rule set (then again, struggling with Unicode always makes my head hurt; I am not an expert on it). Thoughts?

Re: BUG #15548: Unaccent does not remove combining diacritical characters

From
Tom Lane
Date:
Hugh Ranalli <hugh@whtc.ca> writes:
> The problem is that I downloaded the latest version of the Latin-ASCII
> transliteration file (r34 rather than the r28 specified in the URL). Over 3
> years ago (in r29, of course) they changed the file format (
> https://unicode.org/cldr/trac/ticket/5873) so that
> parse_cldr_latin_ascii_transliterator loads an empty rules set.

Ah-hah.

> I'd be
> happy to either a) support both formats, or b), support just the newest and
> update the URL. Option b) is cleaner, and I can't imagine why anyone would
> want to use an older rule set (then again, struggling with Unicode always
> makes my head hurt; I am not an expert on it). Thoughts?

(b) seems sufficient to me, but perhaps someone else has a different
opinion.

Whichever we do, I think it should be a separate patch from the feature
addition for combining diacriticals, just to keep the commit history
clear.

            regards, tom lane


Re: BUG #15548: Unaccent does not remove combining diacritical characters

From
Thomas Munro
Date:
On Sun, Dec 16, 2018 at 8:20 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Hugh Ranalli <hugh@whtc.ca> writes:
> > The problem is that I downloaded the latest version of the Latin-ASCII
> > transliteration file (r34 rather than the r28 specified in the URL). Over 3
> > years ago (in r29, of course) they changed the file format (
> > https://unicode.org/cldr/trac/ticket/5873) so that
> > parse_cldr_latin_ascii_transliterator loads an empty rules set.
>
> Ah-hah.
>
> > I'd be
> > happy to either a) support both formats, or b), support just the newest and
> > update the URL. Option b) is cleaner, and I can't imagine why anyone would
> > want to use an older rule set (then again, struggling with Unicode always
> > makes my head hurt; I am not an expert on it). Thoughts?
>
> (b) seems sufficient to me, but perhaps someone else has a different
> opinion.
>
> Whichever we do, I think it should be a separate patch from the feature
> addition for combining diacriticals, just to keep the commit history
> clear.

+1 for updating to the latest file from time to time.  After
http://unicode.org/cldr/trac/ticket/11383 makes it into a new release,
our special_cases() function will have just the two Cyrillic
characters, which should almost certainly be handled by adding
Cyrillic to the ranges we handle via the usual code path, and DEGREE
CELSIUS and DEGREE FAHRENHEIT.  Those degree signs could possibly be
extracted from Unicode.txt (or we could just forget about them), and
then we could drop special_cases().

-- 
Thomas Munro
http://www.enterprisedb.com


Re: BUG #15548: Unaccent does not remove combining diacritical characters

From
Hugh Ranalli
Date:

On Sat, 15 Dec 2018 at 21:26, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
+1 for updating to the latest file from time to time.  After
http://unicode.org/cldr/trac/ticket/11383 makes it into a new release,
our special_cases() function will have just the two Cyrillic
characters, which should almost certainly be handled by adding
Cyrillic to the ranges we handle via the usual code path, and DEGREE
CELSIUS and DEGREE FAHRENHEIT.  Those degree signs could possibly be
extracted from Unicode.txt (or we could just forget about them), and
then we could drop special_cases().
Well, when I modified the code to handle the new version of the transliteration file, I discovered that was sufficient to handle the old version as well. That's not the way things usually go, but I'll take it. ;-)

I've attached two patches, one to update generate_unaccent_rules.py, and another that updates unaccent.rules from the v34 transliteration file. I'll be happy to add these to the CF. Does anyone need to review them and give me approval before I do so?

Best wishes,
Hugh 

Re: BUG #15548: Unaccent does not remove combining diacritical characters

From
Tom Lane
Date:
Hugh Ranalli <hugh@whtc.ca> writes:
> I've attached two patches, one to update generate_unaccent_rules.py, and
> another that updates unaccent.rules from the v34 transliteration file.

I think you forgot the patches?

> I'll
> be happy to add these to the CF. Does anyone need to review them and give
> me approval before I do so?

Nope.

            regards, tom lane


Re: BUG #15548: Unaccent does not remove combining diacritical characters

From
Hugh Ranalli
Date:
On Mon, 17 Dec 2018 at 15:31, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Hugh Ranalli <hugh@whtc.ca> writes:
> I've attached two patches, one to update generate_unaccent_rules.py, and
> another that updates unaccent.rules from the v34 transliteration file.

I think you forgot the patches?

Sigh, yes I did. That's what I get for trying to get this sent out before heading to an appointment. Patches attached and will add to CF. Let me know if you see anything amiss.

Hugh
Attachment

Re: BUG #15548: Unaccent does not remove combining diacritical characters

From
Thomas Munro
Date:
On Tue, Dec 18, 2018 at 12:03 PM Hugh Ranalli <hugh@whtc.ca> wrote:
> On Mon, 17 Dec 2018 at 15:31, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Hugh Ranalli <hugh@whtc.ca> writes:
>> > I've attached two patches, one to update generate_unaccent_rules.py, and
>> > another that updates unaccent.rules from the v34 transliteration file.
>>
>> I think you forgot the patches?
>
>
> Sigh, yes I did. That's what I get for trying to get this sent out before heading to an appointment. Patches attached
andwill add to CF. Let me know if you see anything amiss. 

+ʹ    '
+ʺ    "
+ʻ    '
+ʼ    '
+ʽ    '
+˂    <
+˃    >
+˄    ^
+ˆ    ^
+ˈ    '
+ˋ    `
+ː    :
+˖    +
+˗    -
+˜    ~

I don't think this is quite right.  Those don't seem to be the
combining codepoints[1], and in any case they are being replaced with
ASCII characters, whereas I thought we wanted to replace them with
nothing at all.  Here is my attempt to come up with a test case using
combining characters:

  select unaccent('un café crème s''il vous plaît');

It's not stripping the accents.  I've attached that in a file for
reference so you can run it with psql -f x.sql, and you can see that
it's using combining code points (code points 0301, 0300, 0302 which
come out as cc81, cc80, cc82 in UTF-8) like so:

$ xxd x.sql
00000000: 7365 6c65 6374 2075 6e61 6363 656e 7428  select unaccent(
00000010: 2775 6e20 6361 6665 cc81 2063 7265 cc80  'un cafe.. cre..
00000020: 6d65 2073 2727 696c 2076 6f75 7320 706c  me s''il vous pl
00000030: 6169 cc82 7427 293b 0a0a                 ai..t');..

(To come up with that I used the trick of typing ":%!xxd" and then
when finished ":%!xxd -r", to turn vim into a hex editor.)

[1] https://en.wikipedia.org/wiki/Combining_Diacritical_Marks

--
Thomas Munro
http://www.enterprisedb.com

Attachment

Re: BUG #15548: Unaccent does not remove combining diacritical characters

From
Thomas Munro
Date:
On Tue, Dec 18, 2018 at 3:05 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Tue, Dec 18, 2018 at 12:03 PM Hugh Ranalli <hugh@whtc.ca> wrote:
> +ʹ    '
> +ʺ    "
> +ʻ    '
> +ʼ    '
> +ʽ    '
> +˂    <
> +˃    >
> +˄    ^
> +ˆ    ^
> +ˈ    '
> +ˋ    `
> +ː    :
> +˖    +
> +˗    -
> +˜    ~
>
> I don't think this is quite right.  Those don't seem to be the
> combining codepoints[1], and in any case they are being replaced with
> ASCII characters, whereas I thought we wanted to replace them with
> nothing at all.  Here is my attempt to come up with a test case using
> combining characters:
>
>   select unaccent('un café crème s''il vous plaît');

Oh, I see now that that was just the v34 ASCII transliteration update,
and perhaps the diacritic stripping will be posted separately.

--
Thomas Munro
http://www.enterprisedb.com


Re: BUG #15548: Unaccent does not remove combining diacriticalcharacters

From
Michael Paquier
Date:
On Tue, Dec 18, 2018 at 03:05:00PM +1100, Thomas Munro wrote:
> I don't think this is quite right.  Those don't seem to be the
> combining codepoints[1], and in any case they are being replaced with
> ASCII characters, whereas I thought we wanted to replace them with
> nothing at all.  Here is my attempt to come up with a test case using
> combining characters:
>
>   select unaccent('un café crème s''il vous plaît');
>
> It's not stripping the accents.  I've attached that in a file for
> reference so you can run it with psql -f x.sql, and you can see that
> it's using combining code points (code points 0301, 0300, 0302 which
> come out as cc81, cc80, cc82 in UTF-8) like so:

Could you also add some tests in contrib/unaccent/sql/unaccent.sql at
the same time?  That would be nice to check easily the extent of the
patches proposed on this thread.
--
Michael

Attachment

Re: BUG #15548: Unaccent does not remove combining diacritical characters

From
Tom Lane
Date:
Michael Paquier <michael@paquier.xyz> writes:
> Could you also add some tests in contrib/unaccent/sql/unaccent.sql at
> the same time?  That would be nice to check easily the extent of the
> patches proposed on this thread.

I wonder why unaccent.sql is set up to run its tests in KOI8 client
encoding rather than UTF8.  It doesn't seem like it's the business
of this test script to be verifying transcoding from KOI8 to UTF8
(and if it were meant to do that, it's a pretty incomplete test...).
But having it set up like that means that we can't directly add
such tests to unaccent.sql, because there are no combining diacritics
in the KOI8 character set.  We have two unattractive options:

* Change client encodings partway through unaccent.sql.  I think this
would be disastrous for editability of that file; no common tools
will understand the encoding change.

* Put the new test cases into a separate file with a different client
encoding.  This is workable, I suppose, but it seems pretty silly
when the tests are only a few queries apiece.

Another problem I've got with the current setup is that it seems
unlikely that many people's editors default to an assumption of
KOI8 encoding.  Mine guesses that these files are UTF8, and so
the test cases look perfectly insane.  They do make sense if
I transcode the files to UTF8, but I wonder why we're not shipping
them as UTF8 in the first place.

tl;dr: I think we should convert unaccent.sql and unaccent.out
to UTF8 encoding.  Then, adding more test cases for this patch
will be easy.

            regards, tom lane


Re: BUG #15548: Unaccent does not remove combining diacriticalcharacters

From
Michael Paquier
Date:
On Tue, Dec 18, 2018 at 12:36:02AM -0500, Tom Lane wrote:
> tl;dr: I think we should convert unaccent.sql and unaccent.out
> to UTF8 encoding.  Then, adding more test cases for this patch
> will be easy.

Do you think that we could also remove the non-ASCII characters from the
tests?  It would be easy enough to use E'\xNN' (utf8 hex) or such in
input, and show the output with bytea.  That's harder to read, still we
discussed about not using UTF-8 in the python script to allow folks with
simple terminals to touch the code the last time this was touched
(5e8d670) and the characters used could be documented as comments in the
tests.
--
Michael

Attachment

Re: BUG #15548: Unaccent does not remove combining diacritical characters

From
Tom Lane
Date:
Michael Paquier <michael@paquier.xyz> writes:
> On Tue, Dec 18, 2018 at 12:36:02AM -0500, Tom Lane wrote:
>> tl;dr: I think we should convert unaccent.sql and unaccent.out
>> to UTF8 encoding.  Then, adding more test cases for this patch
>> will be easy.

> Do you think that we could also remove the non-ASCII characters from the
> tests?  It would be easy enough to use E'\xNN' (utf8 hex) or such in
> input, and show the output with bytea.

I'm not really for that, because it would make the test cases harder
to verify by eyeball.  With the current setup --- other than the
uncommon-outside-Russia encoding choice --- you don't really need
to read or speak Russian to see that this:

SELECT unaccent('ёлка');
 unaccent 
----------
 елка
(1 row)

probably represents unaccent doing what it ought to.  If everything
is in hex then it's a lot harder.

Ten years ago I might've agreed with your point, but today it's
hard to believe that anyone who takes any interest at all in
unaccent's functionality would not have a UTF8-capable terminal.

> That's harder to read, still we
> discussed about not using UTF-8 in the python script to allow folks with
> simple terminals to touch the code the last time this was touched
> (5e8d670) and the characters used could be documented as comments in the
> tests.

Maybe I'm misremembering, but I thought that discussion was about the
code files.  I am still mistrustful of non-ASCII in our code files.
But for data and test files, we've been accepting UTF8 ever since the
text-search-in-core stuff landed.  Heck, unaccent.rules itself is UTF8.

            regards, tom lane


Re: BUG #15548: Unaccent does not remove combining diacriticalcharacters

From
Michael Paquier
Date:
On Tue, Dec 18, 2018 at 01:23:57AM -0500, Tom Lane wrote:
> Maybe I'm misremembering, but I thought that discussion was about the
> code files.  I am still mistrustful of non-ASCII in our code files.

Yes, that was in generate_unaccent_rules.py:
https://www.postgresql.org/message-id/25859.1535076450@sss.pgh.pa.us

> But for data and test files, we've been accepting UTF8 ever since the
> text-search-in-core stuff landed.  Heck, unaccent.rules itself is UTF8.

Okay, fine by me.
--
Michael

Attachment

Re: BUG #15548: Unaccent does not remove combining diacritical characters

From
Hugh Ranalli
Date:
On Mon, 17 Dec 2018 at 23:05, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
+ʹ    '
+ʺ    "
+ʻ    '
+ʼ    '
+ʽ    '
+˂    <
+˃    >
+˄    ^
+ˆ    ^
+ˈ    '
+ˋ    `
+ː    :
+˖    +
+˗    -
+˜    ~
These aren't the combining codepoints. They're new substitutions defined in r34 of the Latin-ASCII transliteration file. I had wondered about those, too, and did some testing.

I don't think this is quite right. 

However, you are correct that something isn't write. In testing why I was getting a different output, I had reverted to the generate_unaccent_rules.py BEFORE my changes. And then I applied my update for the transliteration file format to the reverted version. The patch for generate_unaccent_rules should still be good, but the generated rules file didn't include the combining diacriticals. In generating that, I want to double check some of the additions before re-submitting. 

 On Mon, 17 Dec 2018 at 23:57, Michael Paquier <michael@paquier.xyz> wrote:
Could you also add some tests in contrib/unaccent/sql/unaccent.sql at
the same time?  That would be nice to check easily the extent of the
patches proposed on this thread.

That makes sense. I'm happy to do that. Let me look at that file and see how extensive the other changes (encoding and removal of special characters would be).

Hugh

Re: BUG #15548: Unaccent does not remove combining diacritical characters

From
Hugh Ranalli
Date:
Okay, I've tried to separate everything cleanly. The patches are numbered in the order in which they should be applied. Each patch contains all the updates appropriate to that version (i.e., if the change would modify unaccent.rules, those changes are also in the patch):

01 - Updates generate_unaccent_rules.py to be Python 2 and 3 compatible. The approach I have taken is "native" Python 3 compatibility with adjustments for Python 2. There's a marked block at the beginning of the file that can be removed whenever Python 2 support is dropped. I haven't followed the recommended practice of importing the "past" or "future" modules, as the changes are minimal, and these are just additional dependencies that need to be installed separately, which didn't seem to make sense for a utility script. This patch also updates sql/unaccent.sql to UTF-8 format. 

02 - Updates generate_unaccent_rules.py to work with all versions (I tested r28 and r34) of the Latin-ASCII transliteration file. It also updates unaccent.rules to have the output of the r34 transliteration file. This patch should work without the 01 patch.

03 - Updates generate_unaccent_rules.py to remove combining diacritical marks. It also updates unaccent.rules with the revised output, and adds tests to sql/unaccent.sql. It will not work or apply if the 01 patch is not applied. It should without the 02 patch.

When you look at unaccent.rules generated by the 03 version, there may appear to be blank lines. I've checked and they're not blank. They are characters which are only visible with other characters in front of them, at least in my editor.

I'll go update the CommitFest now. I hope I've covered everything; please let me know if there's anything I've missed.

Best wishes,
Hugh

Attachment

Re: BUG #15548: Unaccent does not remove combining diacriticalcharacters

From
Michael Paquier
Date:
On Thu, Dec 20, 2018 at 05:39:36PM -0500, Hugh Ranalli wrote:
> I'll go update the CommitFest now. I hope I've covered everything; please
> let me know if there's anything I've missed.

-# [2] http://unicode.org/cldr/trac/export/12304/tags/release-28/common/transforms/Latin-ASCII.xml
+# [2] http://unicode.org/cldr/trac/export/12304/tags/release-34/common/transforms/Latin-ASCII.xml
+#     (Ideally you should use the latest release).

I have begun playing with this patch set.  And for the note this URL
is incorrect.  Here is a more correct one:
https://unicode.org/cldr/trac/browser/tags/release-34/common/transforms/Latin-ASCII.xml

And for the information it is possible to get the latest released
versions by browsing the code (see the tags release-*):
https://unicode.org/cldr/trac/browser/tags
--
Michael

Attachment

Re: BUG #15548: Unaccent does not remove combining diacritical characters

From
Hugh Ranalli
Date:
On Thu, 27 Dec 2018 at 01:50, Michael Paquier <michael@paquier.xyz> wrote:
-# [2] http://unicode.org/cldr/trac/export/12304/tags/release-28/common/transforms/Latin-ASCII.xml
+# [2] http://unicode.org/cldr/trac/export/12304/tags/release-34/common/transforms/Latin-ASCII.xml
+#     (Ideally you should use the latest release).

I have begun playing with this patch set.  And for the note this URL
is incorrect.  Here is a more correct one:
https://unicode.org/cldr/trac/browser/tags/release-34/common/transforms/Latin-ASCII.xml

And for the information it is possible to get the latest released
versions by browsing the code (see the tags release-*):
https://unicode.org/cldr/trac/browser/tags

Thank you. As I've said, I only pretend to be someone who knows something about Unicode. ;-) I'll update once we've determined there is no further feedback, so I'm not releasing too many changes, if that's okay. 

Hugh

Re: BUG #15548: Unaccent does not remove combining diacriticalcharacters

From
Peter Eisentraut
Date:
On 20/12/2018 23:39, Hugh Ranalli wrote:
> 01 - Updates generate_unaccent_rules.py to be Python 2 and 3 compatible.

My opinion is that we should just convert the whole thing to Python 3
and be done.  This script is only run rarely, on a developer's machine,
so it's not unreasonable to expect Python 3 to be available.

The only other Python script I can find in the source is
src/test/locale/sort-test.py, which we should similarly convert.

> This patch also updates sql/unaccent.sql to UTF-8 format. 

I have committed that in the meantime.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: BUG #15548: Unaccent does not remove combining diacritical characters

From
Hugh Ranalli
Date:
On Wed, 2 Jan 2019 at 12:41, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote:
On 20/12/2018 23:39, Hugh Ranalli wrote:
> 01 - Updates generate_unaccent_rules.py to be Python 2 and 3 compatible.

My opinion is that we should just convert the whole thing to Python 3
and be done.  This script is only run rarely, on a developer's machine,
so it's not unreasonable to expect Python 3 to be available.

Well, this is definitely an edge case, but I am actually running the patched script from a complex application installer running a custom-compiled version of Python 2.7. The installer runs under the same Python instance as the application. I certainly could invoke Python 3 to run this script, it's just a little more work, so I'm happy to go with the team's decision. Just let me know.

Hugh

Re: BUG #15548: Unaccent does not remove combining diacritical characters

From
Tom Lane
Date:
Hugh Ranalli <hugh@whtc.ca> writes:
> On Wed, 2 Jan 2019 at 12:41, Peter Eisentraut <
> peter.eisentraut@2ndquadrant.com> wrote:
>> My opinion is that we should just convert the whole thing to Python 3
>> and be done.  This script is only run rarely, on a developer's machine,
>> so it's not unreasonable to expect Python 3 to be available.

> Well, this is definitely an edge case, but I am actually running the
> patched script from a complex application installer running a
> custom-compiled version of Python 2.7. The installer runs under the same
> Python instance as the application. I certainly could invoke Python 3 to
> run this script, it's just a little more work, so I'm happy to go with the
> team's decision. Just let me know.

Seeing that supporting python 2 only adds a dozen lines of code,
I vote for retaining it for now.  It'd be appropriate to drop that when
python 3 is the overwhelmingly more-installed version, but AFAICT that
isn't the case yet.

            regards, tom lane


Re: BUG #15548: Unaccent does not remove combining diacriticalcharacters

From
Michael Paquier
Date:
On Wed, Jan 02, 2019 at 02:32:32PM -0500, Tom Lane wrote:
> Seeing that supporting python 2 only adds a dozen lines of code,
> I vote for retaining it for now.  It'd be appropriate to drop that when
> python 3 is the overwhelmingly more-installed version, but AFAICT that
> isn't the case yet.

As a side note, if I recall correctly Python 2.7 will be EOL'd in
2020 by community, though I suspect that a couple of vendors will
still maintain compatibility for a couple of years in what they ship.
CentOS and RHEL enter in this category perhaps.  Like Peter, I would
vote for just maintaining support for Python 3 in this script, as any
modern development machines have it anyway, and not a lot of commits
involve it (I am counting 4 since 2015).
--
Michael

Attachment

Re: BUG #15548: Unaccent does not remove combining diacritical characters

From
Hugh Ranalli
Date:
On Wed, 2 Jan 2019 at 20:15, Michael Paquier <michael@paquier.xyz> wrote:
As a side note, if I recall correctly Python 2.7 will be EOL'd in
2020 by community, though I suspect that a couple of vendors will
still maintain compatibility for a couple of years in what they ship.
CentOS and RHEL enter in this category perhaps.  Like Peter, I would
vote for just maintaining support for Python 3 in this script, as any
modern development machines have it anyway, and not a lot of commits
involve it (I am counting 4 since 2015).

I realise this is an incredibly minor component of the PostgreSQL infrastructure, but as I don't want to hold up reviewers, may I ask:
  • It seems we have two votes for Python 3 only, and one for Python 2/3. I lean toward Python 2/3 myself because: a) many distributions still ship with Python 2 as the default and b) it's a single code block that can easily be removed. If the decision is for Python 3, I'd like at least to add a check that catches this and prints a message, rather than leaving someone with a cryptic runtime error that makes them think the script is broken;
  • Michael Paquier, do you have any other comments? If not, I'll adjust the documentation to use the URLs you have indicated. If you are downloading via curl or wget, the URL I used is the proper one. It gives you the XML file, whereas the other saves the HTML interface, leading to errors if you try to run it. I'll also add this to the documentation.
Once I have clarification on these, I'll update the patches.

Thanks,
Hugh


Re: BUG #15548: Unaccent does not remove combining diacriticalcharacters

From
Alvaro Herrera
Date:
On 2019-Jan-03, Hugh Ranalli wrote:

> I realise this is an incredibly minor component of the PostgreSQL
> infrastructure, but as I don't want to hold up reviewers, may I ask:
> 
>    - It seems we have two votes for Python 3 only, and one for Python 2/3.
>    I lean toward Python 2/3 myself because: a) many distributions still ship
>    with Python 2 as the default and b) it's a single code block that can
>    easily be removed. If the decision is for Python 3, I'd like at least to
>    add a check that catches this and prints a message, rather than leaving
>    someone with a cryptic runtime error that makes them think the script is
>    broken;

I kinda agree with Peter that this is a fringe, rarely run program where
the python3 requirement is unlikely to be onerous, but since the 2/3
compatibility is so little code, I would opt for keeping it for the time
being.  We can remove it in a couple of years.

>    - Michael Paquier, do you have any other comments? If not, I'll adjust
>    the documentation to use the URLs you have indicated. If you are
>    downloading via curl or wget, the URL I used is the proper one. It gives
>    you the XML file, whereas the other saves the HTML interface, leading to
>    errors if you try to run it. I'll also add this to the documentation.

I think the point is that if the committee updates with a further
version of the file, how do you find the new version?  We need a URL
that's one step removed from the final file, so that we can see if we
need to update it.  Maybe we can provide both URLs for convenience.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: BUG #15548: Unaccent does not remove combining diacritical characters

From
Tom Lane
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> I think the point is that if the committee updates with a further
> version of the file, how do you find the new version?  We need a URL
> that's one step removed from the final file, so that we can see if we
> need to update it.  Maybe we can provide both URLs for convenience.

+1.  Could be phrased along the lines of "documents are at URL1,
currently synced with URL2" so that it's clear that URL2 should
be updated when we re-sync with a newer release.

            regards, tom lane


Re: BUG #15548: Unaccent does not remove combining diacritical characters

From
Hugh Ranalli
Date:
On Thu, 3 Jan 2019 at 13:22, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> I think the point is that if the committee updates with a further
> version of the file, how do you find the new version?  We need a URL
> that's one step removed from the final file, so that we can see if we
> need to update it.  Maybe we can provide both URLs for convenience.

+1.  Could be phrased along the lines of "documents are at URL1,
currently synced with URL2" so that it's clear that URL2 should
be updated when we re-sync with a newer release.
 
Yes, this is what I was thinking. I was integrating this into my installer, used the "new" URL provided to download the file, and spent several minutes wondering why the script was failing (and what I had broken in it), before realising what had happened.
 

Re: BUG #15548: Unaccent does not remove combining diacriticalcharacters

From
Peter Eisentraut
Date:
On 03/01/2019 19:19, Alvaro Herrera wrote:
> I kinda agree with Peter that this is a fringe, rarely run program where
> the python3 requirement is unlikely to be onerous, but since the 2/3
> compatibility is so little code, I would opt for keeping it for the time
> being.  We can remove it in a couple of years.

OK, committed with the compat layer.  I also fixed up sort-test.py for
Python 3, so now everything in the source should support Python 3.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: BUG #15548: Unaccent does not remove combining diacriticalcharacters

From
Michael Paquier
Date:
On Thu, Jan 03, 2019 at 04:48:33PM -0500, Hugh Ranalli wrote:
> On Thu, 3 Jan 2019 at 13:22, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
>>> I think the point is that if the committee updates with a further
>>> version of the file, how do you find the new version?  We need a URL
>>> that's one step removed from the final file, so that we can see if we
>>> need to update it.  Maybe we can provide both URLs for convenience.
>>
>> +1.  Could be phrased along the lines of "documents are at URL1,
>> currently synced with URL2" so that it's clear that URL2 should
>> be updated when we re-sync with a newer release.
>>
>
> Yes, this is what I was thinking. I was integrating this into my installer,
> used the "new" URL provided to download the file, and spent several minutes
> wondering why the script was failing (and what I had broken in it), before
> realising what had happened.

I think that we could just use the URLs I am mentioning here:
https://www.postgresql.org/message-id/20181227064958.GK2106@paquier.xyz

I haven't been able to finish what I wanted for the proposed patch set
yet, but what I was thinking about is to include:
1) The root URL where all the release folders are present
2) The full URL of the current Latin-ASCII.xml being used for the
generation, not as a URL pointing to the latest version, but as a URL
pointing to an exact version in time (I doubt that a released version
never changes in this tree, but who knows..).
3) The version used to generate the rules.
--
Michael

Attachment

Re: BUG #15548: Unaccent does not remove combining diacritical characters

From
Hugh Ranalli
Date:
On Fri, 4 Jan 2019 at 08:00, Michael Paquier <michael@paquier.xyz> wrote:
I haven't been able to finish what I wanted for the proposed patch set
yet, but what I was thinking about is to include:
1) The root URL where all the release folders are present
2) The full URL of the current Latin-ASCII.xml being used for the
generation, not as a URL pointing to the latest version, but as a URL
pointing to an exact version in time (I doubt that a released version
never changes in this tree, but who knows..).
3) The version used to generate the rules.

Hi Michael, 
I think we're on the same page. I'll wait for you to finish your review and provide any further comments before I make any changes. 

Thanks,
Hugh

Re: BUG #15548: Unaccent does not remove combining diacriticalcharacters

From
Michael Paquier
Date:
Hi Hugh,

On Fri, Jan 04, 2019 at 11:29:42AM -0500, Hugh Ranalli wrote:
> I think we're on the same page. I'll wait for you to finish your review and
> provide any further comments before I make any changes.

I have been doing a bit more than a review by studying by myself the
new format and the old format, and the way we could do things in the
XML parsing part, and hacked the code by myself.  On top of the
incorrect URL for Latin-ASCII.xml, I have noticed as well that there
should be only one block transforms/transform/tRule in the source, so
I think that we should add an assertion on that as a sanity check.  I
have also changed the code to use splitlines(), which is more portable
across platforms, and added an extra regression test for the new
characters added to unaccent.rules.  This does not close this thread
but we can support the new format this way.  I have also documented
the way to browse the full set of releases for Latin-ASCII.xml, and
precisely which version has been used for this patch.

This does not close yet the part for diacritical characters, but
supporting the new format is a step into this direction.  What do
you think?
--
Michael

Attachment

Re: BUG #15548: Unaccent does not remove combining diacritical characters

From
Hugh Ranalli
Date:
On Tue, 8 Jan 2019 at 22:53, Michael Paquier <michael@paquier.xyz> wrote:
I have been doing a bit more than a review by studying by myself the
new format and the old format, and the way we could do things in the
XML parsing part, and hacked the code by myself.  On top of the
incorrect URL for Latin-ASCII.xml, I have noticed as well that there
should be only one block transforms/transform/tRule in the source, so
I think that we should add an assertion on that as a sanity check.  I
have also changed the code to use splitlines(), which is more portable
across platforms, and added an extra regression test for the new
characters added to unaccent.rules.  This does not close this thread
but we can support the new format this way.  I have also documented
the way to browse the full set of releases for Latin-ASCII.xml, and
precisely which version has been used for this patch.

This does not close yet the part for diacritical characters, but
supporting the new format is a step into this direction.  What do
you think?
HI Michael,
Thank you for putting so much effort into this. I think that looks great. When I was doing this, I discovered that I could parse both pre- and post- r29 versions, so I went with that, but I agree that there's probably no good reason to do so.

And thank you for the information on splitlines; that's a method I've overlooked. .split('\n') should be identical, if python is, as usual, compiled with universal newlines support, but it's nice to have a method guaranteed to work in all instances. 

Best wishes,
Hugh

Re: BUG #15548: Unaccent does not remove combining diacriticalcharacters

From
Michael Paquier
Date:
On Wed, Jan 09, 2019 at 09:52:05PM -0500, Hugh Ranalli wrote:
> Thank you for putting so much effort into this. I think that looks great.
> When I was doing this, I discovered that I could parse both pre- and post-
> r29 versions, so I went with that, but I agree that there's probably no
> good reason to do so.

OK, committed then. I have yet to study yet the other part of the
proposal regarding diatritical characters.  Patch 3 has a conflict for
the regression tests, so a rebase would be needed.  That's not a big
deal though to resolve the conflict.  I am also a bit confused by the
newly-generated unaccent.rules.  Why nothing shows up for the second
column (around line 414 for example)?  Shouldn't we have mapping
characters?
--
Michael

Attachment

Re: BUG #15548: Unaccent does not remove combining diacritical characters

From
Hugh Ranalli
Date:

On Thu, 10 Jan 2019 at 01:09, Michael Paquier <michael@paquier.xyz> wrote:
OK, committed then. I have yet to study yet the other part of the
proposal regarding diatritical characters.  Patch 3 has a conflict for
the regression tests, so a rebase would be needed.  That's not a big
deal though to resolve the conflict.  I am also a bit confused by the
newly-generated unaccent.rules.  Why nothing shows up for the second
column (around line 414 for example)?  Shouldn't we have mapping
characters?

That concerned me, as well. I have confirmed the lines are not empty. If you open the file in a text editor (I'm using KDE's Kate), and insert a standard character at the beginning of one of those lines, the diacritic then appears, combined with the character you just entered. The only program I've found that wants to display them on their own is vi (and I only just thought of trying that).

From what I can tell, this is likely a font issue: 
Hugh

Re: BUG #15548: Unaccent does not remove combining diacriticalcharacters

From
Michael Paquier
Date:
Thomas,

On Thu, Jan 10, 2019 at 03:09:45PM +0900, Michael Paquier wrote:
> OK, committed then. I have yet to study yet the other part of the
> proposal regarding diatritical characters.  Patch 3 has a conflict for
> the regression tests, so a rebase would be needed.  That's not a big
> deal though to resolve the conflict.  I am also a bit confused by the
> newly-generated unaccent.rules.  Why nothing shows up for the second
> column (around line 414 for example)?  Shouldn't we have mapping
> characters?

You are registered as a reviewer and committer of the last patch of
this thread:
https://commitfest.postgresql.org/21/1924/

Are you planning to look at it or should I jump in?  I have not looked
at the patch status in depth yet.
--
Michael

Attachment

Re: BUG #15548: Unaccent does not remove combining diacritical characters

From
Thomas Munro
Date:
On Mon, Jan 28, 2019 at 7:45 PM Michael Paquier <michael@paquier.xyz> wrote:
> You are registered as a reviewer and committer of the last patch of
> this thread:
> https://commitfest.postgresql.org/21/1924/
>
> Are you planning to look at it or should I jump in?  I have not looked
> at the patch status in depth yet.

Thanks for the reminder.  I looked at this a couple of weeks ago when
you ping me off-list, but I see we're still waiting for a rebase.
Hugh, can you please post a new patch?  The approach looks right to me
(simply replace the composing diacritics with nothing), so if you post
a new version I'll  double check with that test case I came up with
earlier, and then I'll be happy to commit it.

-- 
Thomas Munro
http://www.enterprisedb.com


Re: BUG #15548: Unaccent does not remove combining diacritical characters

From
Hugh Ranalli
Date:
On Mon, 28 Jan 2019 at 02:27, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
Thanks for the reminder.  I looked at this a couple of weeks ago when
you ping me off-list, but I see we're still waiting for a rebase.
Hugh, can you please post a new patch?  The approach looks right to me
(simply replace the composing diacritics with nothing), so if you post
a new version I'll  double check with that test case I came up with
earlier, and then I'll be happy to commit it.

Hi Thomas,
My apologies; I hadn't realised I was supposed to do this. A rebased version of patch 03 is attached. Let me know if you have any questions or need any changes.

Best wishes,
Hugh
Attachment

Re: BUG #15548: Unaccent does not remove combining diacriticalcharacters

From
Michael Paquier
Date:
On Mon, Jan 28, 2019 at 03:26:12PM -0500, Hugh Ranalli wrote:
> My apologies; I hadn't realised I was supposed to do this. A rebased
> version of patch 03 is attached. Let me know if you have any questions or
> need any changes.

Moved to next CF.
--
Michael

Attachment

Re: BUG #15548: Unaccent does not remove combining diacritical characters

From
Thomas Munro
Date:
On Fri, Feb 1, 2019 at 2:44 PM Michael Paquier <michael@paquier.xyz> wrote:
> On Mon, Jan 28, 2019 at 03:26:12PM -0500, Hugh Ranalli wrote:
> > My apologies; I hadn't realised I was supposed to do this. A rebased
> > version of patch 03 is attached. Let me know if you have any questions or
> > need any changes.
>
> Moved to next CF.

I checked that the script generates identical output on my machine.
Committed.  Thanks!

-- 
Thomas Munro
http://www.enterprisedb.com


Re: BUG #15548: Unaccent does not remove combining diacritical characters

From
Thomas Munro
Date:
On Tue, Dec 3, 2019 at 9:57 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Sun, Dec 16, 2018 at 8:20 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Hugh Ranalli <hugh@whtc.ca> writes:
> > > The problem is that I downloaded the latest version of the Latin-ASCII
> > > transliteration file (r34 rather than the r28 specified in the URL). Over 3
> > > years ago (in r29, of course) they changed the file format (
> > > https://unicode.org/cldr/trac/ticket/5873) so that
> > > parse_cldr_latin_ascii_transliterator loads an empty rules set.
> >
> > Ah-hah.
> >
> > > I'd be
> > > happy to either a) support both formats, or b), support just the newest and
> > > update the URL. Option b) is cleaner, and I can't imagine why anyone would
> > > want to use an older rule set (then again, struggling with Unicode always
> > > makes my head hurt; I am not an expert on it). Thoughts?
> >
> > (b) seems sufficient to me, but perhaps someone else has a different
> > opinion.
> >
> > Whichever we do, I think it should be a separate patch from the feature
> > addition for combining diacriticals, just to keep the commit history
> > clear.
>
> +1 for updating to the latest file from time to time.  After
> http://unicode.org/cldr/trac/ticket/11383 makes it into a new release,
> our special_cases() function will have just the two Cyrillic
> characters, which should almost certainly be handled by adding
> Cyrillic to the ranges we handle via the usual code path, and DEGREE
> CELSIUS and DEGREE FAHRENHEIT.  Those degree signs could possibly be
> extracted from Unicode.txt (or we could just forget about them), and
> then we could drop special_cases().

Aha, CLDR 36 included that change, so when we update we can drop a special case.