Re: BUG #15548: Unaccent does not remove combining diacritical characters - Mailing list pgsql-bugs

From Thomas Munro
Subject Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date
Msg-id CAEepm=0qb_nx-f8cACS1=1NdmCj-3D9zXFU+RJHsFbZEztcqjg@mail.gmail.com
Whole thread Raw
In response to Re: BUG #15548: Unaccent does not remove combining diacritical characters  (Hugh Ranalli <hugh@whtc.ca>)
Responses Re: BUG #15548: Unaccent does not remove combining diacritical characters
Re: BUG #15548: Unaccent does not remove combining diacriticalcharacters
Re: BUG #15548: Unaccent does not remove combining diacritical characters
List pgsql-bugs
On Tue, Dec 18, 2018 at 12:03 PM Hugh Ranalli <hugh@whtc.ca> wrote:
> On Mon, 17 Dec 2018 at 15:31, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Hugh Ranalli <hugh@whtc.ca> writes:
>> > I've attached two patches, one to update generate_unaccent_rules.py, and
>> > another that updates unaccent.rules from the v34 transliteration file.
>>
>> I think you forgot the patches?
>
>
> Sigh, yes I did. That's what I get for trying to get this sent out before heading to an appointment. Patches attached
andwill add to CF. Let me know if you see anything amiss. 

+ʹ    '
+ʺ    "
+ʻ    '
+ʼ    '
+ʽ    '
+˂    <
+˃    >
+˄    ^
+ˆ    ^
+ˈ    '
+ˋ    `
+ː    :
+˖    +
+˗    -
+˜    ~

I don't think this is quite right.  Those don't seem to be the
combining codepoints[1], and in any case they are being replaced with
ASCII characters, whereas I thought we wanted to replace them with
nothing at all.  Here is my attempt to come up with a test case using
combining characters:

  select unaccent('un café crème s''il vous plaît');

It's not stripping the accents.  I've attached that in a file for
reference so you can run it with psql -f x.sql, and you can see that
it's using combining code points (code points 0301, 0300, 0302 which
come out as cc81, cc80, cc82 in UTF-8) like so:

$ xxd x.sql
00000000: 7365 6c65 6374 2075 6e61 6363 656e 7428  select unaccent(
00000010: 2775 6e20 6361 6665 cc81 2063 7265 cc80  'un cafe.. cre..
00000020: 6d65 2073 2727 696c 2076 6f75 7320 706c  me s''il vous pl
00000030: 6169 cc82 7427 293b 0a0a                 ai..t');..

(To come up with that I used the trick of typing ":%!xxd" and then
when finished ":%!xxd -r", to turn vim into a hex editor.)

[1] https://en.wikipedia.org/wiki/Combining_Diacritical_Marks

--
Thomas Munro
http://www.enterprisedb.com

Attachment

pgsql-bugs by date:

Previous
From: Amit Langote
Date:
Subject: Re: BUG #15552: Unexpected error in COPY to a foreign table in atransaction
Next
From: Thomas Munro
Date:
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters