Re: BUG #15347: Unaccent for greek characters does not work - Mailing list pgsql-bugs

From Tasos Maschalidis
Subject Re: BUG #15347: Unaccent for greek characters does not work
Date
Msg-id VI1PR01MB38537EBD529FE5EE3FE9A5FEB5370@VI1PR01MB3853.eurprd01.prod.exchangelabs.com
Whole thread Raw
In response to Re: BUG #15347: Unaccent for greek characters does not work  (Thomas Munro <thomas.munro@enterprisedb.com>)
Responses Re: BUG #15347: Unaccent for greek characters does not work
List pgsql-bugs

Hi Thomas,

 

Your concerns are understandable, especially when Klingon is taken into consideration.

I am not familiar enough with python to set up something to run the script and check the result, but I am more than willing to review the results! If you need any more input from my part (being a native Greek speaker) please ask away!

 

If I understood correctly, I guess to include the greek characters the method would need to change to this?:

return (codepoint.id >= ord('a') and codepoint.id <= ord('z')) or \
           (codepoint.id >= ord('A') and codepoint.id <= ord('Z')) or \

           (codepoint.id >= ord('α') and codepoint.id <= ord('ω')) or \
           (codepoint.id >= ord('Α') and codepoint.id <= ord('Ω'))

 

Thanks,

Tasos Maschalidis

 

Ps: This gist is what the results should look like, considering greek characters (lines 190-409).

 

 


Από: Thomas Munro <thomas.munro@enterprisedb.com>
Στάλθηκε: Thursday, August 23, 2018 8:22:21 AM
Προς: tas.o.s@hotmail.com; PostgreSQL mailing lists
Θέμα: Re: BUG #15347: Unaccent for greek characters does not work
 
On Thu, Aug 23, 2018 at 3:08 AM, PG Bug reporting form
<noreply@postgresql.org> wrote:
> The following bug has been logged on the website:
>
> Bug reference:      15347
> Logged by:          Tasos Maschalidis
> Email address:      tas.o.s@hotmail.com
> PostgreSQL version: 9.3.18
> Operating system:   Ubuntu 4.8.4
> Description:
>
> Call to unaccent function with greek characters does not return the greek
> characters without the accents as expected (not even just the few diacritics
> used in modern Greek).

Hello Tasos,

Right.  We generate the unaccent.rules file from the Unicode data file
using the Python script contrib/unaccent/generate_unaccent_rules.py in
the PostgreSQL source tree.  The script currently limits itself to
Latin characters here:

def is_plain_letter(codepoint):
    """Return true if codepoint represents a plain ASCII letter."""
    return (codepoint.id >= ord('a') and codepoint.id <= ord('z')) or \
           (codepoint.id >= ord('A') and codepoint.id <= ord('Z'))

I was not brave enough to support other kinds of characters, because I
can't read 'em and check if the results are garbage (if you remove the
diacritics from Klingon, it might change the meaning of any word into
a declaration of war for all I know).  If you know Python and would
like to have a go at modifying that script to support Greek, please
do!  Otherwise perhaps I could try to do it and you could review the
results.

There is a precedent already that it knows how to remove a diacritic
from at least one Cyrillic character.  I think there is no reason at
all we shouldn't take a patch to support Greek or any other alphabet
that a native speaker can advise us on.

I think the chances of squeaking a change into PostgreSQL 11 are slim,
since it would require a special exception from the Release Management
Team at this point.  Failing that, it'd be for PostgreSQL 12.  We
don't usually back-patch unaccent.rules changes because they can
affect in indexed data, and we don't want minor version upgrades to
break stuff.

[1] https://www.postgresql.org/message-id/CAEepm%3D1KRVinFtuDao4L%2BqSBh4T4k3z996EwD5-zgytu4Qa5Fw%40mail.gmail.com

--
Thomas Munro
http://www.enterprisedb.com

pgsql-bugs by date:

Previous
From: "David Klika"
Date:
Subject: 11 beta 3 / ROLLBACK TO SAVEPOINT regression in PLPGSQL
Next
From: Sergei Kornilov
Date:
Subject: Re: 11 beta 3 / ROLLBACK TO SAVEPOINT regression in PLPGSQL