Re: BUG #15548: Unaccent does not remove combining diacritical characters - Mailing list pgsql-bugs

From Hugh Ranalli
Subject Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date
Msg-id CAAhbUMN1n=ZVns-OeCbaVRYPS0oj7tTnmJrzw7Az-op4DHC+JA@mail.gmail.com
Whole thread Raw
In response to Re: BUG #15548: Unaccent does not remove combining diacritical characters  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: BUG #15548: Unaccent does not remove combining diacritical characters
List pgsql-bugs


On Fri, 14 Dec 2018 at 17:50, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Hugh Ranalli <hugh@whtc.ca> writes:
Cool.  Please add it to the current CF so we don't forget about it:
https://commitfest.postgresql.org/21/
Done.
 
Me too -- seems like that bears looking into.  Perhaps the script's
results are platform dependent -- what were you testing on?
I'm on Linux Mint 17, which is based on Ubuntu 14.04. But I don't think that's it. The program's decisions come from the two data files, the Unicode data set and the Latin-ASCII transliteration file. The script uses categories (ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html#General%20Category) to identify letters (and now combining marks) and if they are in range, performs a substitution. It then uses the transliteration file to find rules for particular character substitutions (for example, that file seems to handle the copyright symbol substitution). I don't see anything platform dependent in there. 

In looking more closely, I also see that script isn't generating ligatures, even though it should, because although the program can generate them, none of the ligatures are in the ranges defined in PLAIN_LETTER_RANGES, and so they are skipped.

This could probably be handled by adding the ligature ranges to the defined ranges. Symbol types could be added to the types it looks at, and perhaps the codepoint ranges collapsed into one list, as the IDs are unique across all categories. I don't think we'd want to just rely on ranges, as that could include control characters, punctuation, etc. 

There are a number of other characters that appear in unaccent.rules that aren't generated by the script. I've attached a diff of the output of generate_unaccent_rules (using the version before my changes, to simplify matters) and unaccent.rules. Unfortunately, I don't know how to interpret most of these characters.

I suppose it's valid to ask if changing © to (C) is even something an "unaccent" function should do. Given that it's in the existing rules file, should it be supported as an existing behaviour?

Sorry for more questions than answers. ;-)

Attachment

pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Next
From: Tom Lane
Date:
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters