Home > mailing lists

Re: BUG #15548: Unaccent does not remove combining diacritical characters - Mailing list pgsql-bugs

From	Hugh Ranalli
Subject	Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date	December 15, 2018 21:03:33
Msg-id	CAAhbUMMmXnj0YSD+fr5hSqeC+D6PAG+0kXJwMMhK2DCdwQVoxQ@mail.gmail.com Whole thread
In response to	Re: BUG #15548: Unaccent does not remove combining diacritical characters (Hugh Ranalli <hugh@whtc.ca>)
Responses	Re: BUG #15548: Unaccent does not remove combining diacritical characters
List	pgsql-bugs

Tree view

On Sat, 15 Dec 2018 at 14:05, Hugh Ranalli <hugh@whtc.ca> wrote:

On Sat, 15 Dec 2018 at 13:44, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Hm. Something funny is going on here. When I fetch the two reference
files from the URLs cited in the script, and do

python2 generate_unaccent_rules.py --unicode-data-file UnicodeData.txt --latin-ascii-file Latin-ASCII.xml >newrules

I get something that's bit-for-bit the same as what's in unaccent.rules.
So there's clearly a platform difference between here and there.

I'm using Python 2.6.6, which is what ships with RHEL6; have not tried
it on anything newer.
Well, that's embarrassing. When I looked I couldn't see anything that looked platform specific. I'm on Python 2.7.6, which shipped with Mint 17. We use other versions of 2.7 on our production platforms. I'll take another look, and check the URLs I am using.

The problem is that I downloaded the latest version of the Latin-ASCII transliteration file (r34 rather than the r28 specified in the URL). Over 3 years ago (in r29, of course) they changed the file format (https://unicode.org/cldr/trac/ticket/5873) so that parse_cldr_latin_ascii_transliterator loads an empty rules set. I'd be happy to either a) support both formats, or b), support just the newest and update the URL. Option b) is cleaner, and I can't imagine why anyone would want to use an older rule set (then again, struggling with Unicode always makes my head hurt; I am not an expert on it). Thoughts?

pgsql-bugs by date:

From: Hugh Ranalli
Date: 15 December 2018, 19:05:07
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters

From: Tom Lane
Date: 15 December 2018, 21:20:11
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters

Re: BUG #15548: Unaccent does not remove combining diacritical characters - Mailing list pgsql-bugs

Previous

Next