Re: BUG #15548: Unaccent does not remove combining diacritical characters - Mailing list pgsql-bugs

From Hugh Ranalli
Subject Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date
Msg-id CAAhbUMNyZ+PhNr_mQ=G161K0-hvbq13Tz2is9M3WK+yX9cQOCw@mail.gmail.com
Whole thread Raw
In response to Re: BUG #15548: Unaccent does not remove combining diacritical characters  (Hugh Ranalli <hugh@whtc.ca>)
Responses Re: BUG #15548: Unaccent does not remove combining diacriticalcharacters  (Michael Paquier <michael@paquier.xyz>)
Re: BUG #15548: Unaccent does not remove combining diacriticalcharacters  (Peter Eisentraut <peter.eisentraut@2ndquadrant.com>)
List pgsql-bugs
Okay, I've tried to separate everything cleanly. The patches are numbered in the order in which they should be applied. Each patch contains all the updates appropriate to that version (i.e., if the change would modify unaccent.rules, those changes are also in the patch):

01 - Updates generate_unaccent_rules.py to be Python 2 and 3 compatible. The approach I have taken is "native" Python 3 compatibility with adjustments for Python 2. There's a marked block at the beginning of the file that can be removed whenever Python 2 support is dropped. I haven't followed the recommended practice of importing the "past" or "future" modules, as the changes are minimal, and these are just additional dependencies that need to be installed separately, which didn't seem to make sense for a utility script. This patch also updates sql/unaccent.sql to UTF-8 format. 

02 - Updates generate_unaccent_rules.py to work with all versions (I tested r28 and r34) of the Latin-ASCII transliteration file. It also updates unaccent.rules to have the output of the r34 transliteration file. This patch should work without the 01 patch.

03 - Updates generate_unaccent_rules.py to remove combining diacritical marks. It also updates unaccent.rules with the revised output, and adds tests to sql/unaccent.sql. It will not work or apply if the 01 patch is not applied. It should without the 02 patch.

When you look at unaccent.rules generated by the 03 version, there may appear to be blank lines. I've checked and they're not blank. They are characters which are only visible with other characters in front of them, at least in my editor.

I'll go update the CommitFest now. I hope I've covered everything; please let me know if there's anything I've missed.

Best wishes,
Hugh

Attachment

pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: BUG #15553: "ERROR: cache lookup failed for type 2" with a function the first time it run.
Next
From: Etsuro Fujita
Date:
Subject: Re: BUG #15552: Unexpected error in COPY to a foreign table in atransaction