Thread: BUG #15548: Unaccent does not remove combining diacritical characters
BUG #15548: Unaccent does not remove combining diacritical characters
From
PG Bug reporting form
Date:
The following bug has been logged on the website: Bug reference: 15548 Logged by: Hugh Ranalli Email address: hugh@whtc.ca PostgreSQL version: 11.1 Operating system: Ubuntu 18.04 Description: Apparently Unicode has two ways of accenting a character: as a separate code point, which represents the base character and the accent, or as a "combining diacritical mark" (https://en.wikipedia.org/wiki/Combining_Diacritical_Marks), in which case the mark applies itself to the preceding character. For example, A followed by U+0300 displays À. However, unaccent is not removing these accents. SELECT unaccent(U&'A\0300'); should result in 'A', but instead results in 'À.' I'm running PostgreSQL 11.1, installed from the PostgreSQL repositories. I've read bug report #13440, and have tried with both the installed unaccent.rules as well as a new set generated by the generate_unaccent_rules.py distributed with the 11.1 source code: wget http://unicode.org/Public/7.0.0/ucd/UnicodeData.txt wget https://www.unicode.org/repos/cldr/trunk/common/transforms/Latin-ASCII.xml python generate_unaccent_rules.py --unicode-data-file UnicodeData.txt --latin-ascii-file Latin-ASCII.xml > unaccent.rules I see there have been some updates to generate_unaccent_rules.py to handle Greek and Vietnamese characters, but neither of them seem to address this issue. I'm happy to contribute a patch to handle these cases, but of course wanted to make sure this is desired behaviour, or if I am misunderstanding something somewhere. Thank you, Hugh Ranalli
PG Bug reporting form wrote: > Apparently Unicode has two ways of accenting a character: as a separate code > point, which represents the base character and the accent, or as a > "combining diacritical mark" > (https://en.wikipedia.org/wiki/Combining_Diacritical_Marks) Yes. See also https://en.wikipedia.org/wiki/Unicode_equivalence In general, PostgreSQL leaves it to applications to normalize Unicode strings so that they are all in the same canonical form, either composed or decomposed. > the mark applies itself to the preceding character. For example, A > followed by U+0300 displays À. However, unaccent is not removing > these accents. Short of having the input normalized by the application, ISTM that the best solution would be to provide functions to do it in Postgres, so you'd just write for example: unaccent(unicode_NFC(string)) Otherwise unaccent.rules can be customized. You may add replacements for letter+diacritical sequences that are missing for the languages you have to deal with. But doing it in general for all diacriticals multiplied by all base characters seems unrealistic. Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite
"Daniel Verite" <daniel@manitou-mail.org> writes: > PG Bug reporting form wrote: >> ... For example, A >> followed by U+0300 displays À. However, unaccent is not removing >> these accents. > Short of having the input normalized by the application, ISTM that the > best solution would be to provide functions to do it in Postgres, so > you'd just write for example: > unaccent(unicode_NFC(string)) That might be worthwhile, but it seems independent of this issue. > Otherwise unaccent.rules can be customized. You may add replacements > for letter+diacritical sequences that are missing for the languages > you have to deal with. But doing it in general for all diacriticals > multiplied by all base characters seems unrealistic. Hm, I thought the OP's proposal was just to make unaccent drop combining diacriticals independently of context, which'd avoid the combinatorial-growth problem. regards, tom lane
Tom Lane wrote: > Hm, I thought the OP's proposal was just to make unaccent drop > combining diacriticals independently of context, which'd avoid the > combinatorial-growth problem. In that case, this could be achieved by simply appending the diacriticals themselves to unaccent.rules, since replacement of a string by an empty string is already supported as a rule. It doesn't seem like the current file has any of these, but from https://www.postgresql.org/docs/11/unaccent.html : "Alternatively, if only one character is given on a line, instances of that character are deleted; this is useful in languages where accents are represented by separate characters" Incidentally we may want to improve this bit of doc to mention explicitly the Unicode decomposed forms as a use case for removing characters. In fact I wonder if that's not what it's already trying to express, but confusing "languages" with "forms". Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite
On Thu, 13 Dec 2018, 11:26 Daniel Verite <daniel@manitou-mail.org wrote:
Tom Lane wrote:
> Hm, I thought the OP's proposal was just to make unaccent drop
> combining diacriticals independently of context, which'd avoid the
> combinatorial-growth problem.
That's what I was thinking. Given that the accent is separate from the characters, simply dropping it should result in the correct unaccented character.
In that case, this could be achieved by simply appending the
diacriticals themselves to unaccent.rules, since replacement of a
string by an empty string is already supported as a rule.
It doesn't seem like the current file has any of these, but from
https://www.postgresql.org/docs/11/unaccent.html :
"Alternatively, if only one character is given on a line, instances
of that character are deleted; this is useful in languages where
accents are represented by separate characters"
Yes, I had read that in the docs, and that's the approach I planned to take. I'll go ahead and develop a patch, then.
Best wishes,
Hugh
I've attached a patch removes combining diacriticals. As with Latin and Greek letters, it uses ranges to restrict its activity.
I have not submitted a patch for unaccent.rules, as it seems that a rules file generated from generate_unaccent_rules.py will actually remove a large number of rules (even before my changes), such as replacing the copyright symbol © with (C), as well as other accented characters. It's probably worth asking if the shipped unaccent.rules should correspond to what the shipped generation utility produces, or not. I was surprised to see that it didn't.
--
Hugh Ranalli
Principal Consultant
White Horse Technology Consulting
e: hugh@whtc.ca
c: +01-416-994-7957
w: www.whtc.ca
Please let me know if you see anything I need to change.
Best wishes,
Hugh
--
Hugh Ranalli
Principal Consultant
White Horse Technology Consulting
e: hugh@whtc.ca
c: +01-416-994-7957
w: www.whtc.ca
On Thu, 13 Dec 2018 at 13:50, Hugh Ranalli <hugh@whtc.ca> wrote:
On Thu, 13 Dec 2018, 11:26 Daniel Verite <daniel@manitou-mail.org wrote:Tom Lane wrote:
> Hm, I thought the OP's proposal was just to make unaccent drop
> combining diacriticals independently of context, which'd avoid the
> combinatorial-growth problem.That's what I was thinking. Given that the accent is separate from the characters, simply dropping it should result in the correct unaccented character.
In that case, this could be achieved by simply appending the
diacriticals themselves to unaccent.rules, since replacement of a
string by an empty string is already supported as a rule.
It doesn't seem like the current file has any of these, but from
https://www.postgresql.org/docs/11/unaccent.html :
"Alternatively, if only one character is given on a line, instances
of that character are deleted; this is useful in languages where
accents are represented by separate characters"Yes, I had read that in the docs, and that's the approach I planned to take. I'll go ahead and develop a patch, then.Best wishes,Hugh
Attachment
Hugh Ranalli <hugh@whtc.ca> writes: > I've attached a patch removes combining diacriticals. As with Latin and > Greek letters, it uses ranges to restrict its activity. Cool. Please add it to the current CF so we don't forget about it: https://commitfest.postgresql.org/21/ > I have not submitted a patch for unaccent.rules, as it seems that a rules > file generated from generate_unaccent_rules.py will actually remove a large > number of rules (even before my changes), such as replacing the copyright > symbol © with (C), as well as other accented characters. It's probably > worth asking if the shipped unaccent.rules should correspond to what the > shipped generation utility produces, or not. I was surprised to see that it > didn't. Me too -- seems like that bears looking into. Perhaps the script's results are platform dependent -- what were you testing on? regards, tom lane
On Fri, 14 Dec 2018 at 17:50, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Hugh Ranalli <hugh@whtc.ca> writes:
Cool. Please add it to the current CF so we don't forget about it:
https://commitfest.postgresql.org/21/
Done.
Me too -- seems like that bears looking into. Perhaps the script's
results are platform dependent -- what were you testing on?
I'm on Linux Mint 17, which is based on Ubuntu 14.04. But I don't think that's it. The program's decisions come from the two data files, the Unicode data set and the Latin-ASCII transliteration file. The script uses categories (ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html#General%20Category) to identify letters (and now combining marks) and if they are in range, performs a substitution. It then uses the transliteration file to find rules for particular character substitutions (for example, that file seems to handle the copyright symbol substitution). I don't see anything platform dependent in there.
In looking more closely, I also see that script isn't generating ligatures, even though it should, because although the program can generate them, none of the ligatures are in the ranges defined in PLAIN_LETTER_RANGES, and so they are skipped.
This could probably be handled by adding the ligature ranges to the defined ranges. Symbol types could be added to the types it looks at, and perhaps the codepoint ranges collapsed into one list, as the IDs are unique across all categories. I don't think we'd want to just rely on ranges, as that could include control characters, punctuation, etc.
There are a number of other characters that appear in unaccent.rules that aren't generated by the script. I've attached a diff of the output of generate_unaccent_rules (using the version before my changes, to simplify matters) and unaccent.rules. Unfortunately, I don't know how to interpret most of these characters.
I suppose it's valid to ask if changing © to (C) is even something an "unaccent" function should do. Given that it's in the existing rules file, should it be supported as an existing behaviour?
Sorry for more questions than answers. ;-)
Attachment
Hugh Ranalli <hugh@whtc.ca> writes: > On Fri, 14 Dec 2018 at 17:50, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Me too -- seems like that bears looking into. Perhaps the script's >> results are platform dependent -- what were you testing on? > I'm on Linux Mint 17, which is based on Ubuntu 14.04. But I don't think > that's it. The program's decisions come from the two data files, the > Unicode data set and the Latin-ASCII transliteration file. The script uses > categories ( > ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html#General%20Category) > to identify letters (and now combining marks) and if they are in range, > performs a substitution. It then uses the transliteration file to find > rules for particular character substitutions (for example, that file seems > to handle the copyright symbol substitution). I don't see anything platform > dependent in there. Hm. Something funny is going on here. When I fetch the two reference files from the URLs cited in the script, and do python2 generate_unaccent_rules.py --unicode-data-file UnicodeData.txt --latin-ascii-file Latin-ASCII.xml >newrules I get something that's bit-for-bit the same as what's in unaccent.rules. So there's clearly a platform difference between here and there. I'm using Python 2.6.6, which is what ships with RHEL6; have not tried it on anything newer. regards, tom lane
I wrote: > ... I get something that's bit-for-bit the same as what's in unaccent.rules. > So there's clearly a platform difference between here and there. > I'm using Python 2.6.6, which is what ships with RHEL6; have not tried > it on anything newer. A few minutes later on a Fedora 28 box: python 2.7.15 also gives me the expected results, while python 3.6.6 fails with "SyntaxError: invalid syntax". So updating that script to also work with python3 might be a worthwhile TODO item. But I'm at a loss to explain why you get different results. regards, tom lane
On Sat, 15 Dec 2018 at 13:44, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Hm. Something funny is going on here. When I fetch the two reference
files from the URLs cited in the script, and do
python2 generate_unaccent_rules.py --unicode-data-file UnicodeData.txt --latin-ascii-file Latin-ASCII.xml >newrules
I get something that's bit-for-bit the same as what's in unaccent.rules.
So there's clearly a platform difference between here and there.
I'm using Python 2.6.6, which is what ships with RHEL6; have not tried
it on anything newer.
Well, that's embarrassing. When I looked I couldn't see anything that looked platform specific. I'm on Python 2.7.6, which shipped with Mint 17. We use other versions of 2.7 on our production platforms. I'll take another look, and check the URLs I am using.
On Sat, 15 Dec 2018 at 14:05, Hugh Ranalli <hugh@whtc.ca> wrote:
On Sat, 15 Dec 2018 at 13:44, Tom Lane <tgl@sss.pgh.pa.us> wrote:Hm. Something funny is going on here. When I fetch the two reference
files from the URLs cited in the script, and do
python2 generate_unaccent_rules.py --unicode-data-file UnicodeData.txt --latin-ascii-file Latin-ASCII.xml >newrules
I get something that's bit-for-bit the same as what's in unaccent.rules.
So there's clearly a platform difference between here and there.
I'm using Python 2.6.6, which is what ships with RHEL6; have not tried
it on anything newer.Well, that's embarrassing. When I looked I couldn't see anything that looked platform specific. I'm on Python 2.7.6, which shipped with Mint 17. We use other versions of 2.7 on our production platforms. I'll take another look, and check the URLs I am using.
The problem is that I downloaded the latest version of the Latin-ASCII transliteration file (r34 rather than the r28 specified in the URL). Over 3 years ago (in r29, of course) they changed the file format (https://unicode.org/cldr/trac/ticket/5873) so that parse_cldr_latin_ascii_transliterator loads an empty rules set. I'd be happy to either a) support both formats, or b), support just the newest and update the URL. Option b) is cleaner, and I can't imagine why anyone would want to use an older rule set (then again, struggling with Unicode always makes my head hurt; I am not an expert on it). Thoughts?
Hugh Ranalli <hugh@whtc.ca> writes: > The problem is that I downloaded the latest version of the Latin-ASCII > transliteration file (r34 rather than the r28 specified in the URL). Over 3 > years ago (in r29, of course) they changed the file format ( > https://unicode.org/cldr/trac/ticket/5873) so that > parse_cldr_latin_ascii_transliterator loads an empty rules set. Ah-hah. > I'd be > happy to either a) support both formats, or b), support just the newest and > update the URL. Option b) is cleaner, and I can't imagine why anyone would > want to use an older rule set (then again, struggling with Unicode always > makes my head hurt; I am not an expert on it). Thoughts? (b) seems sufficient to me, but perhaps someone else has a different opinion. Whichever we do, I think it should be a separate patch from the feature addition for combining diacriticals, just to keep the commit history clear. regards, tom lane
On Sun, Dec 16, 2018 at 8:20 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Hugh Ranalli <hugh@whtc.ca> writes: > > The problem is that I downloaded the latest version of the Latin-ASCII > > transliteration file (r34 rather than the r28 specified in the URL). Over 3 > > years ago (in r29, of course) they changed the file format ( > > https://unicode.org/cldr/trac/ticket/5873) so that > > parse_cldr_latin_ascii_transliterator loads an empty rules set. > > Ah-hah. > > > I'd be > > happy to either a) support both formats, or b), support just the newest and > > update the URL. Option b) is cleaner, and I can't imagine why anyone would > > want to use an older rule set (then again, struggling with Unicode always > > makes my head hurt; I am not an expert on it). Thoughts? > > (b) seems sufficient to me, but perhaps someone else has a different > opinion. > > Whichever we do, I think it should be a separate patch from the feature > addition for combining diacriticals, just to keep the commit history > clear. +1 for updating to the latest file from time to time. After http://unicode.org/cldr/trac/ticket/11383 makes it into a new release, our special_cases() function will have just the two Cyrillic characters, which should almost certainly be handled by adding Cyrillic to the ranges we handle via the usual code path, and DEGREE CELSIUS and DEGREE FAHRENHEIT. Those degree signs could possibly be extracted from Unicode.txt (or we could just forget about them), and then we could drop special_cases(). -- Thomas Munro http://www.enterprisedb.com
On Sat, 15 Dec 2018 at 21:26, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
+1 for updating to the latest file from time to time. After
http://unicode.org/cldr/trac/ticket/11383 makes it into a new release,
our special_cases() function will have just the two Cyrillic
characters, which should almost certainly be handled by adding
Cyrillic to the ranges we handle via the usual code path, and DEGREE
CELSIUS and DEGREE FAHRENHEIT. Those degree signs could possibly be
extracted from Unicode.txt (or we could just forget about them), and
then we could drop special_cases().
Well, when I modified the code to handle the new version of the transliteration file, I discovered that was sufficient to handle the old version as well. That's not the way things usually go, but I'll take it. ;-)
I've attached two patches, one to update generate_unaccent_rules.py, and another that updates unaccent.rules from the v34 transliteration file. I'll be happy to add these to the CF. Does anyone need to review them and give me approval before I do so?
Best wishes,
Hugh
Hugh Ranalli <hugh@whtc.ca> writes: > I've attached two patches, one to update generate_unaccent_rules.py, and > another that updates unaccent.rules from the v34 transliteration file. I think you forgot the patches? > I'll > be happy to add these to the CF. Does anyone need to review them and give > me approval before I do so? Nope. regards, tom lane
On Mon, 17 Dec 2018 at 15:31, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Hugh Ranalli <hugh@whtc.ca> writes:
> I've attached two patches, one to update generate_unaccent_rules.py, and
> another that updates unaccent.rules from the v34 transliteration file.
I think you forgot the patches?
Sigh, yes I did. That's what I get for trying to get this sent out before heading to an appointment. Patches attached and will add to CF. Let me know if you see anything amiss.
Hugh
Attachment
On Tue, Dec 18, 2018 at 12:03 PM Hugh Ranalli <hugh@whtc.ca> wrote: > On Mon, 17 Dec 2018 at 15:31, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Hugh Ranalli <hugh@whtc.ca> writes: >> > I've attached two patches, one to update generate_unaccent_rules.py, and >> > another that updates unaccent.rules from the v34 transliteration file. >> >> I think you forgot the patches? > > > Sigh, yes I did. That's what I get for trying to get this sent out before heading to an appointment. Patches attached andwill add to CF. Let me know if you see anything amiss. +ʹ ' +ʺ " +ʻ ' +ʼ ' +ʽ ' +˂ < +˃ > +˄ ^ +ˆ ^ +ˈ ' +ˋ ` +ː : +˖ + +˗ - +˜ ~ I don't think this is quite right. Those don't seem to be the combining codepoints[1], and in any case they are being replaced with ASCII characters, whereas I thought we wanted to replace them with nothing at all. Here is my attempt to come up with a test case using combining characters: select unaccent('un café crème s''il vous plaît'); It's not stripping the accents. I've attached that in a file for reference so you can run it with psql -f x.sql, and you can see that it's using combining code points (code points 0301, 0300, 0302 which come out as cc81, cc80, cc82 in UTF-8) like so: $ xxd x.sql 00000000: 7365 6c65 6374 2075 6e61 6363 656e 7428 select unaccent( 00000010: 2775 6e20 6361 6665 cc81 2063 7265 cc80 'un cafe.. cre.. 00000020: 6d65 2073 2727 696c 2076 6f75 7320 706c me s''il vous pl 00000030: 6169 cc82 7427 293b 0a0a ai..t');.. (To come up with that I used the trick of typing ":%!xxd" and then when finished ":%!xxd -r", to turn vim into a hex editor.) [1] https://en.wikipedia.org/wiki/Combining_Diacritical_Marks -- Thomas Munro http://www.enterprisedb.com
Attachment
On Tue, Dec 18, 2018 at 3:05 PM Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Tue, Dec 18, 2018 at 12:03 PM Hugh Ranalli <hugh@whtc.ca> wrote: > +ʹ ' > +ʺ " > +ʻ ' > +ʼ ' > +ʽ ' > +˂ < > +˃ > > +˄ ^ > +ˆ ^ > +ˈ ' > +ˋ ` > +ː : > +˖ + > +˗ - > +˜ ~ > > I don't think this is quite right. Those don't seem to be the > combining codepoints[1], and in any case they are being replaced with > ASCII characters, whereas I thought we wanted to replace them with > nothing at all. Here is my attempt to come up with a test case using > combining characters: > > select unaccent('un café crème s''il vous plaît'); Oh, I see now that that was just the v34 ASCII transliteration update, and perhaps the diacritic stripping will be posted separately. -- Thomas Munro http://www.enterprisedb.com
On Tue, Dec 18, 2018 at 03:05:00PM +1100, Thomas Munro wrote: > I don't think this is quite right. Those don't seem to be the > combining codepoints[1], and in any case they are being replaced with > ASCII characters, whereas I thought we wanted to replace them with > nothing at all. Here is my attempt to come up with a test case using > combining characters: > > select unaccent('un café crème s''il vous plaît'); > > It's not stripping the accents. I've attached that in a file for > reference so you can run it with psql -f x.sql, and you can see that > it's using combining code points (code points 0301, 0300, 0302 which > come out as cc81, cc80, cc82 in UTF-8) like so: Could you also add some tests in contrib/unaccent/sql/unaccent.sql at the same time? That would be nice to check easily the extent of the patches proposed on this thread. -- Michael
Attachment
Michael Paquier <michael@paquier.xyz> writes: > Could you also add some tests in contrib/unaccent/sql/unaccent.sql at > the same time? That would be nice to check easily the extent of the > patches proposed on this thread. I wonder why unaccent.sql is set up to run its tests in KOI8 client encoding rather than UTF8. It doesn't seem like it's the business of this test script to be verifying transcoding from KOI8 to UTF8 (and if it were meant to do that, it's a pretty incomplete test...). But having it set up like that means that we can't directly add such tests to unaccent.sql, because there are no combining diacritics in the KOI8 character set. We have two unattractive options: * Change client encodings partway through unaccent.sql. I think this would be disastrous for editability of that file; no common tools will understand the encoding change. * Put the new test cases into a separate file with a different client encoding. This is workable, I suppose, but it seems pretty silly when the tests are only a few queries apiece. Another problem I've got with the current setup is that it seems unlikely that many people's editors default to an assumption of KOI8 encoding. Mine guesses that these files are UTF8, and so the test cases look perfectly insane. They do make sense if I transcode the files to UTF8, but I wonder why we're not shipping them as UTF8 in the first place. tl;dr: I think we should convert unaccent.sql and unaccent.out to UTF8 encoding. Then, adding more test cases for this patch will be easy. regards, tom lane
On Tue, Dec 18, 2018 at 12:36:02AM -0500, Tom Lane wrote: > tl;dr: I think we should convert unaccent.sql and unaccent.out > to UTF8 encoding. Then, adding more test cases for this patch > will be easy. Do you think that we could also remove the non-ASCII characters from the tests? It would be easy enough to use E'\xNN' (utf8 hex) or such in input, and show the output with bytea. That's harder to read, still we discussed about not using UTF-8 in the python script to allow folks with simple terminals to touch the code the last time this was touched (5e8d670) and the characters used could be documented as comments in the tests. -- Michael
Attachment
Michael Paquier <michael@paquier.xyz> writes: > On Tue, Dec 18, 2018 at 12:36:02AM -0500, Tom Lane wrote: >> tl;dr: I think we should convert unaccent.sql and unaccent.out >> to UTF8 encoding. Then, adding more test cases for this patch >> will be easy. > Do you think that we could also remove the non-ASCII characters from the > tests? It would be easy enough to use E'\xNN' (utf8 hex) or such in > input, and show the output with bytea. I'm not really for that, because it would make the test cases harder to verify by eyeball. With the current setup --- other than the uncommon-outside-Russia encoding choice --- you don't really need to read or speak Russian to see that this: SELECT unaccent('ёлка'); unaccent ---------- елка (1 row) probably represents unaccent doing what it ought to. If everything is in hex then it's a lot harder. Ten years ago I might've agreed with your point, but today it's hard to believe that anyone who takes any interest at all in unaccent's functionality would not have a UTF8-capable terminal. > That's harder to read, still we > discussed about not using UTF-8 in the python script to allow folks with > simple terminals to touch the code the last time this was touched > (5e8d670) and the characters used could be documented as comments in the > tests. Maybe I'm misremembering, but I thought that discussion was about the code files. I am still mistrustful of non-ASCII in our code files. But for data and test files, we've been accepting UTF8 ever since the text-search-in-core stuff landed. Heck, unaccent.rules itself is UTF8. regards, tom lane
On Tue, Dec 18, 2018 at 01:23:57AM -0500, Tom Lane wrote: > Maybe I'm misremembering, but I thought that discussion was about the > code files. I am still mistrustful of non-ASCII in our code files. Yes, that was in generate_unaccent_rules.py: https://www.postgresql.org/message-id/25859.1535076450@sss.pgh.pa.us > But for data and test files, we've been accepting UTF8 ever since the > text-search-in-core stuff landed. Heck, unaccent.rules itself is UTF8. Okay, fine by me. -- Michael
Attachment
On Mon, 17 Dec 2018 at 23:05, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
+ʹ '
+ʺ "
+ʻ '
+ʼ '
+ʽ '
+˂ <
+˃ >
+˄ ^
+ˆ ^
+ˈ '
+ˋ `
+ː :
+˖ +
+˗ -
+˜ ~
These aren't the combining codepoints. They're new substitutions defined in r34 of the Latin-ASCII transliteration file. I had wondered about those, too, and did some testing.
I don't think this is quite right.
However, you are correct that something isn't write. In testing why I was getting a different output, I had reverted to the generate_unaccent_rules.py BEFORE my changes. And then I applied my update for the transliteration file format to the reverted version. The patch for generate_unaccent_rules should still be good, but the generated rules file didn't include the combining diacriticals. In generating that, I want to double check some of the additions before re-submitting.
On Mon, 17 Dec 2018 at 23:57, Michael Paquier <michael@paquier.xyz> wrote:
Could you also add some tests in contrib/unaccent/sql/unaccent.sql at
the same time? That would be nice to check easily the extent of the
patches proposed on this thread.
That makes sense. I'm happy to do that. Let me look at that file and see how extensive the other changes (encoding and removal of special characters would be).
Hugh
Okay, I've tried to separate everything cleanly. The patches are numbered in the order in which they should be applied. Each patch contains all the updates appropriate to that version (i.e., if the change would modify unaccent.rules, those changes are also in the patch):
01 - Updates generate_unaccent_rules.py to be Python 2 and 3 compatible. The approach I have taken is "native" Python 3 compatibility with adjustments for Python 2. There's a marked block at the beginning of the file that can be removed whenever Python 2 support is dropped. I haven't followed the recommended practice of importing the "past" or "future" modules, as the changes are minimal, and these are just additional dependencies that need to be installed separately, which didn't seem to make sense for a utility script. This patch also updates sql/unaccent.sql to UTF-8 format.
02 - Updates generate_unaccent_rules.py to work with all versions (I tested r28 and r34) of the Latin-ASCII transliteration file. It also updates unaccent.rules to have the output of the r34 transliteration file. This patch should work without the 01 patch.
03 - Updates generate_unaccent_rules.py to remove combining diacritical marks. It also updates unaccent.rules with the revised output, and adds tests to sql/unaccent.sql. It will not work or apply if the 01 patch is not applied. It should without the 02 patch.
When you look at unaccent.rules generated by the 03 version, there may appear to be blank lines. I've checked and they're not blank. They are characters which are only visible with other characters in front of them, at least in my editor.
I'll go update the CommitFest now. I hope I've covered everything; please let me know if there's anything I've missed.
Best wishes,
Hugh
Attachment
On Thu, Dec 20, 2018 at 05:39:36PM -0500, Hugh Ranalli wrote: > I'll go update the CommitFest now. I hope I've covered everything; please > let me know if there's anything I've missed. -# [2] http://unicode.org/cldr/trac/export/12304/tags/release-28/common/transforms/Latin-ASCII.xml +# [2] http://unicode.org/cldr/trac/export/12304/tags/release-34/common/transforms/Latin-ASCII.xml +# (Ideally you should use the latest release). I have begun playing with this patch set. And for the note this URL is incorrect. Here is a more correct one: https://unicode.org/cldr/trac/browser/tags/release-34/common/transforms/Latin-ASCII.xml And for the information it is possible to get the latest released versions by browsing the code (see the tags release-*): https://unicode.org/cldr/trac/browser/tags -- Michael
Attachment
On Thu, 27 Dec 2018 at 01:50, Michael Paquier <michael@paquier.xyz> wrote:
-# [2] http://unicode.org/cldr/trac/export/12304/tags/release-28/common/transforms/Latin-ASCII.xml
+# [2] http://unicode.org/cldr/trac/export/12304/tags/release-34/common/transforms/Latin-ASCII.xml
+# (Ideally you should use the latest release).
I have begun playing with this patch set. And for the note this URL
is incorrect. Here is a more correct one:
https://unicode.org/cldr/trac/browser/tags/release-34/common/transforms/Latin-ASCII.xml
And for the information it is possible to get the latest released
versions by browsing the code (see the tags release-*):
https://unicode.org/cldr/trac/browser/tags
Thank you. As I've said, I only pretend to be someone who knows something about Unicode. ;-) I'll update once we've determined there is no further feedback, so I'm not releasing too many changes, if that's okay.
Hugh
Re: BUG #15548: Unaccent does not remove combining diacriticalcharacters
From
Peter Eisentraut
Date:
On 20/12/2018 23:39, Hugh Ranalli wrote: > 01 - Updates generate_unaccent_rules.py to be Python 2 and 3 compatible. My opinion is that we should just convert the whole thing to Python 3 and be done. This script is only run rarely, on a developer's machine, so it's not unreasonable to expect Python 3 to be available. The only other Python script I can find in the source is src/test/locale/sort-test.py, which we should similarly convert. > This patch also updates sql/unaccent.sql to UTF-8 format. I have committed that in the meantime. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, 2 Jan 2019 at 12:41, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote:
On 20/12/2018 23:39, Hugh Ranalli wrote:
> 01 - Updates generate_unaccent_rules.py to be Python 2 and 3 compatible.
My opinion is that we should just convert the whole thing to Python 3
and be done. This script is only run rarely, on a developer's machine,
so it's not unreasonable to expect Python 3 to be available.
Well, this is definitely an edge case, but I am actually running the patched script from a complex application installer running a custom-compiled version of Python 2.7. The installer runs under the same Python instance as the application. I certainly could invoke Python 3 to run this script, it's just a little more work, so I'm happy to go with the team's decision. Just let me know.
Hugh
Hugh Ranalli <hugh@whtc.ca> writes: > On Wed, 2 Jan 2019 at 12:41, Peter Eisentraut < > peter.eisentraut@2ndquadrant.com> wrote: >> My opinion is that we should just convert the whole thing to Python 3 >> and be done. This script is only run rarely, on a developer's machine, >> so it's not unreasonable to expect Python 3 to be available. > Well, this is definitely an edge case, but I am actually running the > patched script from a complex application installer running a > custom-compiled version of Python 2.7. The installer runs under the same > Python instance as the application. I certainly could invoke Python 3 to > run this script, it's just a little more work, so I'm happy to go with the > team's decision. Just let me know. Seeing that supporting python 2 only adds a dozen lines of code, I vote for retaining it for now. It'd be appropriate to drop that when python 3 is the overwhelmingly more-installed version, but AFAICT that isn't the case yet. regards, tom lane
On Wed, Jan 02, 2019 at 02:32:32PM -0500, Tom Lane wrote: > Seeing that supporting python 2 only adds a dozen lines of code, > I vote for retaining it for now. It'd be appropriate to drop that when > python 3 is the overwhelmingly more-installed version, but AFAICT that > isn't the case yet. As a side note, if I recall correctly Python 2.7 will be EOL'd in 2020 by community, though I suspect that a couple of vendors will still maintain compatibility for a couple of years in what they ship. CentOS and RHEL enter in this category perhaps. Like Peter, I would vote for just maintaining support for Python 3 in this script, as any modern development machines have it anyway, and not a lot of commits involve it (I am counting 4 since 2015). -- Michael
Attachment
On Wed, 2 Jan 2019 at 20:15, Michael Paquier <michael@paquier.xyz> wrote:
As a side note, if I recall correctly Python 2.7 will be EOL'd in
2020 by community, though I suspect that a couple of vendors will
still maintain compatibility for a couple of years in what they ship.
CentOS and RHEL enter in this category perhaps. Like Peter, I would
vote for just maintaining support for Python 3 in this script, as any
modern development machines have it anyway, and not a lot of commits
involve it (I am counting 4 since 2015).
I realise this is an incredibly minor component of the PostgreSQL infrastructure, but as I don't want to hold up reviewers, may I ask:
- It seems we have two votes for Python 3 only, and one for Python 2/3. I lean toward Python 2/3 myself because: a) many distributions still ship with Python 2 as the default and b) it's a single code block that can easily be removed. If the decision is for Python 3, I'd like at least to add a check that catches this and prints a message, rather than leaving someone with a cryptic runtime error that makes them think the script is broken;
- Michael Paquier, do you have any other comments? If not, I'll adjust the documentation to use the URLs you have indicated. If you are downloading via curl or wget, the URL I used is the proper one. It gives you the XML file, whereas the other saves the HTML interface, leading to errors if you try to run it. I'll also add this to the documentation.
Once I have clarification on these, I'll update the patches.
Thanks,
Hugh
On 2019-Jan-03, Hugh Ranalli wrote: > I realise this is an incredibly minor component of the PostgreSQL > infrastructure, but as I don't want to hold up reviewers, may I ask: > > - It seems we have two votes for Python 3 only, and one for Python 2/3. > I lean toward Python 2/3 myself because: a) many distributions still ship > with Python 2 as the default and b) it's a single code block that can > easily be removed. If the decision is for Python 3, I'd like at least to > add a check that catches this and prints a message, rather than leaving > someone with a cryptic runtime error that makes them think the script is > broken; I kinda agree with Peter that this is a fringe, rarely run program where the python3 requirement is unlikely to be onerous, but since the 2/3 compatibility is so little code, I would opt for keeping it for the time being. We can remove it in a couple of years. > - Michael Paquier, do you have any other comments? If not, I'll adjust > the documentation to use the URLs you have indicated. If you are > downloading via curl or wget, the URL I used is the proper one. It gives > you the XML file, whereas the other saves the HTML interface, leading to > errors if you try to run it. I'll also add this to the documentation. I think the point is that if the committee updates with a further version of the file, how do you find the new version? We need a URL that's one step removed from the final file, so that we can see if we need to update it. Maybe we can provide both URLs for convenience. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > I think the point is that if the committee updates with a further > version of the file, how do you find the new version? We need a URL > that's one step removed from the final file, so that we can see if we > need to update it. Maybe we can provide both URLs for convenience. +1. Could be phrased along the lines of "documents are at URL1, currently synced with URL2" so that it's clear that URL2 should be updated when we re-sync with a newer release. regards, tom lane
On Thu, 3 Jan 2019 at 13:22, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> I think the point is that if the committee updates with a further
> version of the file, how do you find the new version? We need a URL
> that's one step removed from the final file, so that we can see if we
> need to update it. Maybe we can provide both URLs for convenience.
+1. Could be phrased along the lines of "documents are at URL1,
currently synced with URL2" so that it's clear that URL2 should
be updated when we re-sync with a newer release.
Yes, this is what I was thinking. I was integrating this into my installer, used the "new" URL provided to download the file, and spent several minutes wondering why the script was failing (and what I had broken in it), before realising what had happened.
Re: BUG #15548: Unaccent does not remove combining diacriticalcharacters
From
Peter Eisentraut
Date:
On 03/01/2019 19:19, Alvaro Herrera wrote: > I kinda agree with Peter that this is a fringe, rarely run program where > the python3 requirement is unlikely to be onerous, but since the 2/3 > compatibility is so little code, I would opt for keeping it for the time > being. We can remove it in a couple of years. OK, committed with the compat layer. I also fixed up sort-test.py for Python 3, so now everything in the source should support Python 3. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Jan 03, 2019 at 04:48:33PM -0500, Hugh Ranalli wrote: > On Thu, 3 Jan 2019 at 13:22, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Alvaro Herrera <alvherre@2ndquadrant.com> writes: >>> I think the point is that if the committee updates with a further >>> version of the file, how do you find the new version? We need a URL >>> that's one step removed from the final file, so that we can see if we >>> need to update it. Maybe we can provide both URLs for convenience. >> >> +1. Could be phrased along the lines of "documents are at URL1, >> currently synced with URL2" so that it's clear that URL2 should >> be updated when we re-sync with a newer release. >> > > Yes, this is what I was thinking. I was integrating this into my installer, > used the "new" URL provided to download the file, and spent several minutes > wondering why the script was failing (and what I had broken in it), before > realising what had happened. I think that we could just use the URLs I am mentioning here: https://www.postgresql.org/message-id/20181227064958.GK2106@paquier.xyz I haven't been able to finish what I wanted for the proposed patch set yet, but what I was thinking about is to include: 1) The root URL where all the release folders are present 2) The full URL of the current Latin-ASCII.xml being used for the generation, not as a URL pointing to the latest version, but as a URL pointing to an exact version in time (I doubt that a released version never changes in this tree, but who knows..). 3) The version used to generate the rules. -- Michael
Attachment
On Fri, 4 Jan 2019 at 08:00, Michael Paquier <michael@paquier.xyz> wrote:
I haven't been able to finish what I wanted for the proposed patch set
yet, but what I was thinking about is to include:
1) The root URL where all the release folders are present
2) The full URL of the current Latin-ASCII.xml being used for the
generation, not as a URL pointing to the latest version, but as a URL
pointing to an exact version in time (I doubt that a released version
never changes in this tree, but who knows..).
3) The version used to generate the rules.
Hi Michael,
I think we're on the same page. I'll wait for you to finish your review and provide any further comments before I make any changes.
Thanks,
Hugh
Hi Hugh, On Fri, Jan 04, 2019 at 11:29:42AM -0500, Hugh Ranalli wrote: > I think we're on the same page. I'll wait for you to finish your review and > provide any further comments before I make any changes. I have been doing a bit more than a review by studying by myself the new format and the old format, and the way we could do things in the XML parsing part, and hacked the code by myself. On top of the incorrect URL for Latin-ASCII.xml, I have noticed as well that there should be only one block transforms/transform/tRule in the source, so I think that we should add an assertion on that as a sanity check. I have also changed the code to use splitlines(), which is more portable across platforms, and added an extra regression test for the new characters added to unaccent.rules. This does not close this thread but we can support the new format this way. I have also documented the way to browse the full set of releases for Latin-ASCII.xml, and precisely which version has been used for this patch. This does not close yet the part for diacritical characters, but supporting the new format is a step into this direction. What do you think? -- Michael
Attachment
On Tue, 8 Jan 2019 at 22:53, Michael Paquier <michael@paquier.xyz> wrote:
I have been doing a bit more than a review by studying by myself theHI Michael,
new format and the old format, and the way we could do things in the
XML parsing part, and hacked the code by myself. On top of the
incorrect URL for Latin-ASCII.xml, I have noticed as well that there
should be only one block transforms/transform/tRule in the source, so
I think that we should add an assertion on that as a sanity check. I
have also changed the code to use splitlines(), which is more portable
across platforms, and added an extra regression test for the new
characters added to unaccent.rules. This does not close this thread
but we can support the new format this way. I have also documented
the way to browse the full set of releases for Latin-ASCII.xml, and
precisely which version has been used for this patch.
This does not close yet the part for diacritical characters, but
supporting the new format is a step into this direction. What do
you think?
Thank you for putting so much effort into this. I think that looks great. When I was doing this, I discovered that I could parse both pre- and post- r29 versions, so I went with that, but I agree that there's probably no good reason to do so.
And thank you for the information on splitlines; that's a method I've overlooked. .split('\n') should be identical, if python is, as usual, compiled with universal newlines support, but it's nice to have a method guaranteed to work in all instances.
Best wishes,
Hugh
On Wed, Jan 09, 2019 at 09:52:05PM -0500, Hugh Ranalli wrote: > Thank you for putting so much effort into this. I think that looks great. > When I was doing this, I discovered that I could parse both pre- and post- > r29 versions, so I went with that, but I agree that there's probably no > good reason to do so. OK, committed then. I have yet to study yet the other part of the proposal regarding diatritical characters. Patch 3 has a conflict for the regression tests, so a rebase would be needed. That's not a big deal though to resolve the conflict. I am also a bit confused by the newly-generated unaccent.rules. Why nothing shows up for the second column (around line 414 for example)? Shouldn't we have mapping characters? -- Michael
Attachment
On Thu, 10 Jan 2019 at 01:09, Michael Paquier <michael@paquier.xyz> wrote:
OK, committed then. I have yet to study yet the other part of the
proposal regarding diatritical characters. Patch 3 has a conflict for
the regression tests, so a rebase would be needed. That's not a big
deal though to resolve the conflict. I am also a bit confused by the
newly-generated unaccent.rules. Why nothing shows up for the second
column (around line 414 for example)? Shouldn't we have mapping
characters?
That concerned me, as well. I have confirmed the lines are not empty. If you open the file in a text editor (I'm using KDE's Kate), and insert a standard character at the beginning of one of those lines, the diacritic then appears, combined with the character you just entered. The only program I've found that wants to display them on their own is vi (and I only just thought of trying that).
From what I can tell, this is likely a font issue:
Hugh
Thomas, On Thu, Jan 10, 2019 at 03:09:45PM +0900, Michael Paquier wrote: > OK, committed then. I have yet to study yet the other part of the > proposal regarding diatritical characters. Patch 3 has a conflict for > the regression tests, so a rebase would be needed. That's not a big > deal though to resolve the conflict. I am also a bit confused by the > newly-generated unaccent.rules. Why nothing shows up for the second > column (around line 414 for example)? Shouldn't we have mapping > characters? You are registered as a reviewer and committer of the last patch of this thread: https://commitfest.postgresql.org/21/1924/ Are you planning to look at it or should I jump in? I have not looked at the patch status in depth yet. -- Michael
Attachment
On Mon, Jan 28, 2019 at 7:45 PM Michael Paquier <michael@paquier.xyz> wrote: > You are registered as a reviewer and committer of the last patch of > this thread: > https://commitfest.postgresql.org/21/1924/ > > Are you planning to look at it or should I jump in? I have not looked > at the patch status in depth yet. Thanks for the reminder. I looked at this a couple of weeks ago when you ping me off-list, but I see we're still waiting for a rebase. Hugh, can you please post a new patch? The approach looks right to me (simply replace the composing diacritics with nothing), so if you post a new version I'll double check with that test case I came up with earlier, and then I'll be happy to commit it. -- Thomas Munro http://www.enterprisedb.com
On Mon, 28 Jan 2019 at 02:27, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
Thanks for the reminder. I looked at this a couple of weeks ago when
you ping me off-list, but I see we're still waiting for a rebase.
Hugh, can you please post a new patch? The approach looks right to me
(simply replace the composing diacritics with nothing), so if you post
a new version I'll double check with that test case I came up with
earlier, and then I'll be happy to commit it.
Hi Thomas,
My apologies; I hadn't realised I was supposed to do this. A rebased version of patch 03 is attached. Let me know if you have any questions or need any changes.
Best wishes,
Hugh
Attachment
On Mon, Jan 28, 2019 at 03:26:12PM -0500, Hugh Ranalli wrote: > My apologies; I hadn't realised I was supposed to do this. A rebased > version of patch 03 is attached. Let me know if you have any questions or > need any changes. Moved to next CF. -- Michael
Attachment
On Fri, Feb 1, 2019 at 2:44 PM Michael Paquier <michael@paquier.xyz> wrote: > On Mon, Jan 28, 2019 at 03:26:12PM -0500, Hugh Ranalli wrote: > > My apologies; I hadn't realised I was supposed to do this. A rebased > > version of patch 03 is attached. Let me know if you have any questions or > > need any changes. > > Moved to next CF. I checked that the script generates identical output on my machine. Committed. Thanks! -- Thomas Munro http://www.enterprisedb.com
On Tue, Dec 3, 2019 at 9:57 PM Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Sun, Dec 16, 2018 at 8:20 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Hugh Ranalli <hugh@whtc.ca> writes: > > > The problem is that I downloaded the latest version of the Latin-ASCII > > > transliteration file (r34 rather than the r28 specified in the URL). Over 3 > > > years ago (in r29, of course) they changed the file format ( > > > https://unicode.org/cldr/trac/ticket/5873) so that > > > parse_cldr_latin_ascii_transliterator loads an empty rules set. > > > > Ah-hah. > > > > > I'd be > > > happy to either a) support both formats, or b), support just the newest and > > > update the URL. Option b) is cleaner, and I can't imagine why anyone would > > > want to use an older rule set (then again, struggling with Unicode always > > > makes my head hurt; I am not an expert on it). Thoughts? > > > > (b) seems sufficient to me, but perhaps someone else has a different > > opinion. > > > > Whichever we do, I think it should be a separate patch from the feature > > addition for combining diacriticals, just to keep the commit history > > clear. > > +1 for updating to the latest file from time to time. After > http://unicode.org/cldr/trac/ticket/11383 makes it into a new release, > our special_cases() function will have just the two Cyrillic > characters, which should almost certainly be handled by adding > Cyrillic to the ranges we handle via the usual code path, and DEGREE > CELSIUS and DEGREE FAHRENHEIT. Those degree signs could possibly be > extracted from Unicode.txt (or we could just forget about them), and > then we could drop special_cases(). Aha, CLDR 36 included that change, so when we update we can drop a special case.