Re: BUG #15347: Unaccent for greek characters does not work - Mailing list pgsql-bugs

From Thomas Munro
Subject Re: BUG #15347: Unaccent for greek characters does not work
Date
Msg-id CAEepm=3a_5y+COG6AM0UFZXb4MQmxSdpQK3oGnri1kaP+Uqx5A@mail.gmail.com
Whole thread Raw
In response to BUG #15347: Unaccent for greek characters does not work  (PG Bug reporting form <noreply@postgresql.org>)
Responses Re: BUG #15347: Unaccent for greek characters does not work
Re: BUG #15347: Unaccent for greek characters does not work
List pgsql-bugs
On Thu, Aug 23, 2018 at 3:08 AM, PG Bug reporting form
<noreply@postgresql.org> wrote:
> The following bug has been logged on the website:
>
> Bug reference:      15347
> Logged by:          Tasos Maschalidis
> Email address:      tas.o.s@hotmail.com
> PostgreSQL version: 9.3.18
> Operating system:   Ubuntu 4.8.4
> Description:
>
> Call to unaccent function with greek characters does not return the greek
> characters without the accents as expected (not even just the few diacritics
> used in modern Greek).

Hello Tasos,

Right.  We generate the unaccent.rules file from the Unicode data file
using the Python script contrib/unaccent/generate_unaccent_rules.py in
the PostgreSQL source tree.  The script currently limits itself to
Latin characters here:

def is_plain_letter(codepoint):
    """Return true if codepoint represents a plain ASCII letter."""
    return (codepoint.id >= ord('a') and codepoint.id <= ord('z')) or \
           (codepoint.id >= ord('A') and codepoint.id <= ord('Z'))

I was not brave enough to support other kinds of characters, because I
can't read 'em and check if the results are garbage (if you remove the
diacritics from Klingon, it might change the meaning of any word into
a declaration of war for all I know).  If you know Python and would
like to have a go at modifying that script to support Greek, please
do!  Otherwise perhaps I could try to do it and you could review the
results.

There is a precedent already that it knows how to remove a diacritic
from at least one Cyrillic character.  I think there is no reason at
all we shouldn't take a patch to support Greek or any other alphabet
that a native speaker can advise us on.

I think the chances of squeaking a change into PostgreSQL 11 are slim,
since it would require a special exception from the Release Management
Team at this point.  Failing that, it'd be for PostgreSQL 12.  We
don't usually back-patch unaccent.rules changes because they can
affect in indexed data, and we don't want minor version upgrades to
break stuff.

[1] https://www.postgresql.org/message-id/CAEepm%3D1KRVinFtuDao4L%2BqSBh4T4k3z996EwD5-zgytu4Qa5Fw%40mail.gmail.com

-- 
Thomas Munro
http://www.enterprisedb.com


pgsql-bugs by date:

Previous
From: Michael Paquier
Date:
Subject: Re: BUG #15346: Replica fails to start after the crash
Next
From: Michael Paquier
Date:
Subject: Re: BUG #15347: Unaccent for greek characters does not work