On Tue, Dec 18, 2018 at 03:05:00PM +1100, Thomas Munro wrote:
> I don't think this is quite right. Those don't seem to be the
> combining codepoints[1], and in any case they are being replaced with
> ASCII characters, whereas I thought we wanted to replace them with
> nothing at all. Here is my attempt to come up with a test case using
> combining characters:
>
> select unaccent('un café crème s''il vous plaît');
>
> It's not stripping the accents. I've attached that in a file for
> reference so you can run it with psql -f x.sql, and you can see that
> it's using combining code points (code points 0301, 0300, 0302 which
> come out as cc81, cc80, cc82 in UTF-8) like so:
Could you also add some tests in contrib/unaccent/sql/unaccent.sql at
the same time? That would be nice to check easily the extent of the
patches proposed on this thread.
--
Michael