Re: BUG #15548: Unaccent does not remove combining diacritical characters - Mailing list pgsql-bugs

From Tom Lane
Subject Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date
Msg-id 11345.1545114237@sss.pgh.pa.us
Whole thread Raw
In response to Re: BUG #15548: Unaccent does not remove combining diacriticalcharacters  (Michael Paquier <michael@paquier.xyz>)
Responses Re: BUG #15548: Unaccent does not remove combining diacriticalcharacters
List pgsql-bugs
Michael Paquier <michael@paquier.xyz> writes:
> On Tue, Dec 18, 2018 at 12:36:02AM -0500, Tom Lane wrote:
>> tl;dr: I think we should convert unaccent.sql and unaccent.out
>> to UTF8 encoding.  Then, adding more test cases for this patch
>> will be easy.

> Do you think that we could also remove the non-ASCII characters from the
> tests?  It would be easy enough to use E'\xNN' (utf8 hex) or such in
> input, and show the output with bytea.

I'm not really for that, because it would make the test cases harder
to verify by eyeball.  With the current setup --- other than the
uncommon-outside-Russia encoding choice --- you don't really need
to read or speak Russian to see that this:

SELECT unaccent('ёлка');
 unaccent 
----------
 елка
(1 row)

probably represents unaccent doing what it ought to.  If everything
is in hex then it's a lot harder.

Ten years ago I might've agreed with your point, but today it's
hard to believe that anyone who takes any interest at all in
unaccent's functionality would not have a UTF8-capable terminal.

> That's harder to read, still we
> discussed about not using UTF-8 in the python script to allow folks with
> simple terminals to touch the code the last time this was touched
> (5e8d670) and the characters used could be documented as comments in the
> tests.

Maybe I'm misremembering, but I thought that discussion was about the
code files.  I am still mistrustful of non-ASCII in our code files.
But for data and test files, we've been accepting UTF8 ever since the
text-search-in-core stuff landed.  Heck, unaccent.rules itself is UTF8.

            regards, tom lane


pgsql-bugs by date:

Previous
From: Amit Langote
Date:
Subject: Re: BUG #15552: Unexpected error in COPY to a foreign table in atransaction
Next
From: Michael Paquier
Date:
Subject: Re: BUG #15548: Unaccent does not remove combining diacriticalcharacters