Re: BUG #15347: Unaccent for greek characters does not work - Mailing list pgsql-bugs
From | Tasos Maschalidis |
---|---|
Subject | Re: BUG #15347: Unaccent for greek characters does not work |
Date | |
Msg-id | VI1PR01MB38537EBD529FE5EE3FE9A5FEB5370@VI1PR01MB3853.eurprd01.prod.exchangelabs.com Whole thread Raw |
In response to | Re: BUG #15347: Unaccent for greek characters does not work (Thomas Munro <thomas.munro@enterprisedb.com>) |
Responses |
Re: BUG #15347: Unaccent for greek characters does not work
|
List | pgsql-bugs |
Hi Thomas,
Your concerns are understandable, especially when Klingon is taken into consideration.
I am not familiar enough with python to set up something to run the script and check the result, but I am more than willing to review the results! If you need any more input from my part (being a native Greek speaker) please ask away!
If I understood correctly, I guess to include the greek characters the method would need to change to this?:
return (codepoint.id >= ord('a') and codepoint.id <= ord('z')) or \
(codepoint.id >= ord('A') and codepoint.id <= ord('Z')) or \
(codepoint.id >= ord('α') and codepoint.id <= ord('ω')) or \
(codepoint.id >= ord('Α') and codepoint.id <= ord('Ω'))
Thanks,
Tasos Maschalidis
Ps: This gist is what the results should look like, considering greek characters (lines 190-409).
Στάλθηκε: Thursday, August 23, 2018 8:22:21 AM
Προς: tas.o.s@hotmail.com; PostgreSQL mailing lists
Θέμα: Re: BUG #15347: Unaccent for greek characters does not work
<noreply@postgresql.org> wrote:
> The following bug has been logged on the website:
>
> Bug reference: 15347
> Logged by: Tasos Maschalidis
> Email address: tas.o.s@hotmail.com
> PostgreSQL version: 9.3.18
> Operating system: Ubuntu 4.8.4
> Description:
>
> Call to unaccent function with greek characters does not return the greek
> characters without the accents as expected (not even just the few diacritics
> used in modern Greek).
Hello Tasos,
Right. We generate the unaccent.rules file from the Unicode data file
using the Python script contrib/unaccent/generate_unaccent_rules.py in
the PostgreSQL source tree. The script currently limits itself to
Latin characters here:
def is_plain_letter(codepoint):
"""Return true if codepoint represents a plain ASCII letter."""
return (codepoint.id >= ord('a') and codepoint.id <= ord('z')) or \
(codepoint.id >= ord('A') and codepoint.id <= ord('Z'))
I was not brave enough to support other kinds of characters, because I
can't read 'em and check if the results are garbage (if you remove the
diacritics from Klingon, it might change the meaning of any word into
a declaration of war for all I know). If you know Python and would
like to have a go at modifying that script to support Greek, please
do! Otherwise perhaps I could try to do it and you could review the
results.
There is a precedent already that it knows how to remove a diacritic
from at least one Cyrillic character. I think there is no reason at
all we shouldn't take a patch to support Greek or any other alphabet
that a native speaker can advise us on.
I think the chances of squeaking a change into PostgreSQL 11 are slim,
since it would require a special exception from the Release Management
Team at this point. Failing that, it'd be for PostgreSQL 12. We
don't usually back-patch unaccent.rules changes because they can
affect in indexed data, and we don't want minor version upgrades to
break stuff.
[1] https://www.postgresql.org/message-id/CAEepm%3D1KRVinFtuDao4L%2BqSBh4T4k3z996EwD5-zgytu4Qa5Fw%40mail.gmail.com
--
Thomas Munro
http://www.enterprisedb.com
pgsql-bugs by date: