Michael Paquier <michael@paquier.xyz> writes:
> On Tue, Dec 18, 2018 at 12:36:02AM -0500, Tom Lane wrote:
>> tl;dr: I think we should convert unaccent.sql and unaccent.out
>> to UTF8 encoding. Then, adding more test cases for this patch
>> will be easy.
> Do you think that we could also remove the non-ASCII characters from the
> tests? It would be easy enough to use E'\xNN' (utf8 hex) or such in
> input, and show the output with bytea.
I'm not really for that, because it would make the test cases harder
to verify by eyeball. With the current setup --- other than the
uncommon-outside-Russia encoding choice --- you don't really need
to read or speak Russian to see that this:
SELECT unaccent('ёлка');
unaccent
----------
елка
(1 row)
probably represents unaccent doing what it ought to. If everything
is in hex then it's a lot harder.
Ten years ago I might've agreed with your point, but today it's
hard to believe that anyone who takes any interest at all in
unaccent's functionality would not have a UTF8-capable terminal.
> That's harder to read, still we
> discussed about not using UTF-8 in the python script to allow folks with
> simple terminals to touch the code the last time this was touched
> (5e8d670) and the characters used could be documented as comments in the
> tests.
Maybe I'm misremembering, but I thought that discussion was about the
code files. I am still mistrustful of non-ASCII in our code files.
But for data and test files, we've been accepting UTF8 ever since the
text-search-in-core stuff landed. Heck, unaccent.rules itself is UTF8.
regards, tom lane