Thread: BUG #18362: unaccent rules and Old Greek text

BUG #18362: unaccent rules and Old Greek text

From

PG Bug reporting form

Date:

24 February 2024, 21:33:05

The following bug has been logged on the website:

Bug reference:      18362
Logged by:          Cees van Zeeland
Email address:      cees.van.zeeland@freedom.nl
PostgreSQL version: 15.6
Operating system:   Windows 11
Description:

I am using a Postgres Server 15.06-1 with UTF-8

I am struggling with the unaccent extension and "Old Greek" characters.
To explain what behaviour I encoutered, try this:

1.    Create a table with one text field

CREATE TABLE IF NOT EXISTS public.test
(
    entry text COLLATE pg_catalog."default" NOT NULL,
    CONSTRAINT test_pkey PRIMARY KEY (entry)
)

2.    Insert the next few greek words with (stress accents) on the vowels,
or import de CSV file with the same items.
ἀνήρ    (== man)
πέντε    (== five)
γίγας    (== giant)
γράφω    (== write)
δύο    (== two)
ἐγώ    (== Ι)
θεός    (== god)

3.    Create the next view for searching:

CREATE OR REPLACE VIEW public.test_view
 AS
 SELECT test.entry,
    COALESCE(array_to_string(ts_lexize('unaccent'::regdictionary,
replace(test.entry, 'ς'::text, 'σ'::text)), ''::text), replace(test.entry,
'ς'::text, 'σ'::text)) AS search_entry
   FROM test
  ORDER BY test.entry;

4. Try if it works:

SELECT entry, search_entry FROM public.test_view;

Result shows that not all diacritics are removed

When I search in the unaccent.rules I see around line 530 characters that
look the same but they are in fact different. f.e.
Greek Small Letter Epsilon with Tonos 
versus
Greek Small Letter Epsilon with Oxia

I found here a discussion about this subject:

https://ibiblio.org/bgreek/forum/viewtopic.php?t=4170

So, there are reasons to keep the current unaccent.rules as it is, but...
there are other reasons to add a few lines to it, f.e. after line 955 and
insert five greek vowels with Oxia
Please add:
ά    α
έ    ε
ή    η
ί    ι
ό    ο
ύ    υ
ώ    ω

It would solve the problem and make searching through old greek texts al lot
easier...

Thanks for your help,

Cees van Zeeland

Re: BUG #18362: unaccent rules and Old Greek text

From

Thomas Munro

Date:

25 February 2024, 03:21:36

On Sun, Feb 25, 2024 at 11:14 AM PG Bug reporting form
<noreply@postgresql.org> wrote:
> So, there are reasons to keep the current unaccent.rules as it is, but...
> there are other reasons to add a few lines to it, f.e. after line 955 and
> insert five greek vowels with Oxia
> Please add:
> ά       α
> έ       ε
> ή       η
> ί       ι
> ό       ο
> ύ       υ
> ώ       ω

Hi,

We don't exactly maintain this list manually, we extract it from
Unicode source data.  Can you see what needs to be adjusted in here to
achieve that goal?

https://github.com/postgres/postgres/blob/master/contrib/unaccent/generate_unaccent_rules.py

Perhaps a new range or something like that?

Re: BUG #18362: unaccent rules and Old Greek text

From

Thomas Munro

Date:

25 February 2024, 23:15:57

On Sun, Feb 25, 2024 at 4:21 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Sun, Feb 25, 2024 at 11:14 AM PG Bug reporting form
> <noreply@postgresql.org> wrote:
> > So, there are reasons to keep the current unaccent.rules as it is, but...
> > there are other reasons to add a few lines to it, f.e. after line 955 and
> > insert five greek vowels with Oxia
> > Please add:
> > ά       α

Oh, I think I see it.  "ά" is:

1F71;GREEK SMALL LETTER ALPHA WITH OXIA;Ll;0;L;03AC;;;;N;;;1FBB;;1FBB

The Python script is looking for combining sequences that add accents,
but this one has just "03AC" in the combining sequence field, so it's
a kind of "simple" redirection that points here:

03AC;GREEK SMALL LETTER ALPHA WITH TONOS;Ll;0;L;03B1 0301;;;;N;GREEK
SMALL LETTER ALPHA TONOS;;0386;;0386

That has a normal looking sequence that we can understand (α + an
accent).  If I tell the script to follow such "simple" redirections, I
get over a thousand new mappings, including those.  See attached.
There is probably more correct terminology that I'm using here...

Attachment

0001-Add-simple-codepoint-redirections-to-unaccent.rules.patch

Re: BUG #18362: unaccent rules and Old Greek text

From

Michael Paquier

Date:

25 February 2024, 23:19:53

On Sun, Feb 25, 2024 at 04:21:36PM +1300, Thomas Munro wrote:
> On Sun, Feb 25, 2024 at 11:14 AM PG Bug reporting form
> <noreply@postgresql.org> wrote:
>> So, there are reasons to keep the current unaccent.rules as it is, but...
>> there are other reasons to add a few lines to it, f.e. after line 955 and
>> insert five greek vowels with Oxia
>> Please add:
>> ά       α
>> έ       ε
>> ή       η
>> ί       ι
>> ό       ο
>> ύ       υ
>> ώ       ω

Correct me if I'm wrong of course, but reading a bit on the matter at
[1], letters with Tonos or Oxia are actually equivalent since 1986,
and we only include character with Tonos in our unaccent.rules.

> We don't exactly maintain this list manually, we extract it from
> Unicode source data.  Can you see what needs to be adjusted in here to
> achieve that goal?

See commits like e3dd7c06e627 or 59f47fb98dab for some references.
Unfortunately, we've been using as policy to not backpatch any changes
to the in-core rules file, and you can plug in your own file.  Saying
that, these additions sound like a natural addition seen from here.

> Perhaps a new range or something like that?

It seems to me that it is a bit more complicated than that, because
Unicode.data decomposes the characters with Oxia as characters with
Tonos, and then characters with Tonos are decomposed with the "base"
alphabet characters + Tonos.  We do a recursive lookup at the unicode
table in get_plain_letter() and is_letter_with_marks(), so it seems to
me that we're not missing much, and I suspect that there should be no
need for a new custom range of characters..

Cees, perhaps you would like to get a shot at that?

[1]: https://en.wikipedia.org/wiki/Greek_diacritics#Unicode
--
Michael

Attachment

signature.asc

Re: BUG #18362: unaccent rules and Old Greek text

From

Michael Paquier

Date:

25 February 2024, 23:25:36

On Mon, Feb 26, 2024 at 12:15:57PM +1300, Thomas Munro wrote:
> The Python script is looking for combining sequences that add accents,
> but this one has just "03AC" in the combining sequence field, so it's
> a kind of "simple" redirection that points here:
>
> 03AC;GREEK SMALL LETTER ALPHA WITH TONOS;Ll;0;L;03B1 0301;;;;N;GREEK
> SMALL LETTER ALPHA TONOS;;0386;;0386
>
> That has a normal looking sequence that we can understand (α + an
> accent).  If I tell the script to follow such "simple" redirections, I
> get over a thousand new mappings, including those.  See attached.
> There is probably more correct terminology that I'm using here...

Ah, you've beaten me to it.  Yes, that's pretty much the impression I
was getting while looking at the set of characters in Unicode.txt.  I
am not entirely sure if what you are doing is the best way to do it,
but the set of characters generated in unaccent.rules makes sense
here.  I am surprised to see that many, TBH.

Perhaps you should add a few characters of these series to
unaccent.sql?
--
Michael

Attachment

signature.asc

Re: BUG #18362: unaccent rules and Old Greek text

From

Tom Lane

Date:

25 February 2024, 23:59:37

Michael Paquier <michael@paquier.xyz> writes:
> On Mon, Feb 26, 2024 at 12:15:57PM +1300, Thomas Munro wrote:
>> That has a normal looking sequence that we can understand (α + an
>> accent).  If I tell the script to follow such "simple" redirections, I
>> get over a thousand new mappings, including those.  See attached.
>> There is probably more correct terminology that I'm using here...

> Ah, you've beaten me to it.  Yes, that's pretty much the impression I
> was getting while looking at the set of characters in Unicode.txt.  I
> am not entirely sure if what you are doing is the best way to do it,
> but the set of characters generated in unaccent.rules makes sense
> here.  I am surprised to see that many, TBH.

There are only about 1650 lines in our standard unaccent.rules
file today.  Are we concerned about adding so many more?
I don't think the trie lookup logic would be slowed any,
but the time to load the rules file might take a hit.

            regards, tom lane

Re: BUG #18362: unaccent rules and Old Greek text

From

Cees van Zeeland

Date:

26 February 2024, 12:33:28

Thomas Munro <thomas.munro@gmail.com> wrote:
> If I tell the script to follow such "simple" redirections, I
> get over a thousand new mappings, including those. See attached.
> There is probably more correct terminology that I'm using here...

Michael Paquier wrote:

> It seems to me that it is a bit more complicated than that, because
> Unicode.data decomposes the characters with Oxia as characters with
> Tonos, and then characters with Tonos are decomposed with the "base"
> alphabet characters + Tonos. We do a recursive lookup at the unicode
> table in get_plain_letter() and is_letter_with_marks(), so it seems to
> me that we're not missing much, and I suspect that there should be no
> need for a new custom range of characters..
>
> Cees, perhaps you would like to get a shot at that?
>
> [1]: https://en.wikipedia.org/wiki/Greek_diacritics#Unicode

I'm not an expert, but obviously computers make a difference between the two versions of the characters.
We are talking about this series:
U+1F70 - U+1F7D: ὰ ά ὲ έ ὴ ή ὶ ί ὸ ό ὺ ύ ὼ ώ
Is it possible to filter / limit in some way the redirection in the script to this range?

~
Cees

Re: BUG #18362: unaccent rules and Old Greek text

From

Thomas Munro

Date:

01 March 2024, 01:53:06

On Tue, Feb 27, 2024 at 1:33 AM Cees van Zeeland
<cees.van.zeeland@freedom.nl> wrote:
> I'm not an expert, but obviously computers make a difference between the two versions of the characters.
> We are talking about this series:
> U+1F70 - U+1F7D:    ὰ     ά     ὲ     έ     ὴ     ή     ὶ     ί     ὸ     ό     ὺ     ύ     ὼ     ώ
> Is it possible to filter / limit in some way the redirection in the script to this range?

Right, so to get this in we either need to decide that we're OK with
adding that many characters, or figure out some systematic way to
select just the ones we want.  One hint that might be helpful if
someone wants to investigate: I suspect that a lot of those mappings
might be marked with <font>, which seems to be for code points for
alternative renderings ("mathematical" bold, italic, fraktur etc), so
perhaps we could filter them out that way without losing the
oxia-marked characters if that's the way it has to be.

I think all the relevant part of the character database file is described here:

https://unicode.org/reports/tr44/#Property_Values

The file we're currently using is 15.1:

https://www.unicode.org/Public/15.1.0/ucd/UnicodeData.txt

I registered this thread as https://commitfest.postgresql.org/47/4873/ .

Re: BUG #18362: unaccent rules and Old Greek text

From

Cees van Zeeland

Date:

01 March 2024, 15:54:07

Hi Thomas,

I found:
https://www.unicode.org/Public/15.1.0/ucd/CompositionExclusions.txt
that might be useful to tackle characters that we are searching for.

Hope this helps.

Cees

On 01/03/2024 02:53, Thomas Munro wrote:
> On Tue, Feb 27, 2024 at 1:33 AM Cees van Zeeland
> <cees.van.zeeland@freedom.nl> wrote:
>> I'm not an expert, but obviously computers make a difference between the two versions of the characters.
>> We are talking about this series:
>> U+1F70 - U+1F7D:    ὰ     ά     ὲ     έ     ὴ     ή     ὶ     ί     ὸ     ό     ὺ     ύ     ὼ     ώ
>> Is it possible to filter / limit in some way the redirection in the script to this range?
> Right, so to get this in we either need to decide that we're OK with
> adding that many characters, or figure out some systematic way to
> select just the ones we want.  One hint that might be helpful if
> someone wants to investigate: I suspect that a lot of those mappings
> might be marked with <font>, which seems to be for code points for
> alternative renderings ("mathematical" bold, italic, fraktur etc), so
> perhaps we could filter them out that way without losing the
> oxia-marked characters if that's the way it has to be.
>
> I think all the relevant part of the character database file is described here:
>
> https://unicode.org/reports/tr44/#Property_Values
>
> The file we're currently using is 15.1:
>
> https://www.unicode.org/Public/15.1.0/ucd/UnicodeData.txt
>
> I registered this thread as https://commitfest.postgresql.org/47/4873/ .

Re: BUG #18362: unaccent rules and Old Greek text

From

Robert Haas

Date:

14 May 2024, 14:51:51

On Thu, Feb 29, 2024 at 8:53 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Tue, Feb 27, 2024 at 1:33 AM Cees van Zeeland
> <cees.van.zeeland@freedom.nl> wrote:
> > I'm not an expert, but obviously computers make a difference between the two versions of the characters.
> > We are talking about this series:
> > U+1F70 - U+1F7D:    ὰ     ά     ὲ     έ     ὴ     ή     ὶ     ί     ὸ     ό     ὺ     ύ     ὼ     ώ
> > Is it possible to filter / limit in some way the redirection in the script to this range?
>
> Right, so to get this in we either need to decide that we're OK with
> adding that many characters, or figure out some systematic way to
> select just the ones we want.  One hint that might be helpful if
> someone wants to investigate: I suspect that a lot of those mappings
> might be marked with <font>, which seems to be for code points for
> alternative renderings ("mathematical" bold, italic, fraktur etc), so
> perhaps we could filter them out that way without losing the
> oxia-marked characters if that's the way it has to be.

There's a CommitFest entry for this thread at
https://commitfest.postgresql.org/48/4873/ which is set to "Needs
Review," but does it, really? There seem to basically be two related
issues here. One is whether all of the mappings that the proposed
change would create are correct; perhaps some of them are superfluous,
or even wrong. The other is whether adding a lot of mappings is going
to cause a problem with rule file load times.

It's true that a reviewer could look into these questions, but I think
what we need here is almost more like a volunteer to be a co-author or
co-sponsor of the patch, because I normally expect that when someone
asks me to review a patch, they believe they've got something that
they already have good reasons to believe is correct and what they
want from me is to know whether I agree. And it doesn't seem like this
has progressed to that stage, so I kind of wonder whether it ought to
be evicted from the CommitFest entirely until it does. This perhaps
sounds mean-spirited, but our CommitFest application contains an awful
lot of things that aren't actually actionable, and it makes it hard to
find the things that are, so trying to improve that situation is the
motivation here. On the other hand, maybe this is actionable after all
and we just need to make a decision and do something, so let's look at
the practical questions.

1. The question of rule file load times seems like something that
anyone who could compile PostgreSQL with and without a patch applied
could test in under an hour. They could then report the results that
they got, and people here could judge whether the resulting numbers
are totally cool or very sad or something in between. Anyone willing
to do that?

2. The question of which mappings we actually ought to be adding seems
a lot harder, because it's not altogether clear what it means to
"remove an accent". The proposed patch adds a whole lot of rules that
turn tiny little characters into full-sized characters, boldfaced
and/or italicized and/or otherwise-fancily-printed characters into
full-sized characters. Only a handful of the changes are actually
adding rules that specifically *remove an accent*, but there are
similar rules that already exist, like turning ⅐ into the
four-character sequence " 1/7" and blocky-looking versions of each
letter into standard versions and ㍱ into the three-character sequence
"hPa". So my naive guess would be that we want all of these rules,
even though you would not guess from the unaccent documentation that
it's supposed to do stuff like this. But my knowledge of languages
other than English is very limited, and I am not a user of unaccent
and never have been, so I am reluctant to make grand pronouncements.
Does anyone more knowledgeable want to opine?

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: BUG #18362: unaccent rules and Old Greek text

From

Laurenz Albe

Date:

14 May 2024, 15:45:35

On Tue, 2024-05-14 at 10:51 -0400, Robert Haas wrote:
> The question of which mappings we actually ought to be adding seems
> a lot harder, because it's not altogether clear what it means to
> "remove an accent". The proposed patch adds a whole lot of rules that
> turn tiny little characters into full-sized characters, boldfaced
> and/or italicized and/or otherwise-fancily-printed characters into
> full-sized characters. Only a handful of the changes are actually
> adding rules that specifically *remove an accent*, but there are
> similar rules that already exist, like turning ⅐ into the
> four-character sequence " 1/7" and blocky-looking versions of each
> letter into standard versions and ㍱ into the three-character sequence
> "hPa". So my naive guess would be that we want all of these rules,
> even though you would not guess from the unaccent documentation that
> it's supposed to do stuff like this. But my knowledge of languages
> other than English is very limited, and I am not a user of unaccent
> and never have been, so I am reluctant to make grand pronouncements.
> Does anyone more knowledgeable want to opine?

I am not necessarily more knowledgeable, but I'll opine anyway.

As a German speaker, I wouldn't call the dieresis on "ü" an
accent like the French é, è or ê, even though the current
implementation of unaccent() turns it into an "u".

And while most people would agree that the caret on â is an
accent in the French language, I am not sure if it is the same
in Vietnamese.

And I cannot see how ⅐ could be considered an accent...

Perhaps if we invent a function called convert_to_ascii() or so
instead of shoving that into unaccent(), it would make more sense.

Yours,
Laurenz Albe

Re: BUG #18362: unaccent rules and Old Greek text

From

Peter Eisentraut

Date:

15 May 2024, 06:45:27

On 14.05.24 16:51, Robert Haas wrote:
> 1. The question of rule file load times seems like something that
> anyone who could compile PostgreSQL with and without a patch applied
> could test in under an hour. They could then report the results that
> they got, and people here could judge whether the resulting numbers
> are totally cool or very sad or something in between. Anyone willing
> to do that?

The rules are only loaded once on first use, right?  I tested with

date; for x in $(seq 1 1000); do psql -X -c "select unaccent('foobar')" 
-o /dev/null; done; date

and this had the same runtime (about 8 seconds here) with and without 
the patch.

Btw., with the patch I get

WARNING:  duplicate source strings, first one will be used

so it will need to adjustments in how the rules are produced.

Re: BUG #18362: unaccent rules and Old Greek text

From

Peter Eisentraut

Date:

15 May 2024, 07:01:10

On 14.05.24 16:51, Robert Haas wrote:
> 2. The question of which mappings we actually ought to be adding seems
> a lot harder, because it's not altogether clear what it means to
> "remove an accent". The proposed patch adds a whole lot of rules that
> turn tiny little characters into full-sized characters, boldfaced
> and/or italicized and/or otherwise-fancily-printed characters into
> full-sized characters. Only a handful of the changes are actually
> adding rules that specifically*remove an accent*, but there are
> similar rules that already exist, like turning ⅐ into the
> four-character sequence " 1/7" and blocky-looking versions of each
> letter into standard versions and ㍱ into the three-character sequence
> "hPa". So my naive guess would be that we want all of these rules,
> even though you would not guess from the unaccent documentation that
> it's supposed to do stuff like this.

unaccent actually does both accent removal and ligature expansion. 
(This is documented.)  The cases you show above are ligature expansions.

You can also run generate_unaccent_rules.py with --no-ligatures and then 
you get a smaller list that indeed looks more like just accent removal.

It does look like that whatever it thinks a ligature is has some 
unintuitive results.

Re: BUG #18362: unaccent rules and Old Greek text

From

Robert Haas

Date:

15 May 2024, 13:40:22

On Wed, May 15, 2024 at 2:45 AM Peter Eisentraut <peter@eisentraut.org> wrote:
> On 14.05.24 16:51, Robert Haas wrote:
> > 1. The question of rule file load times seems like something that
> > anyone who could compile PostgreSQL with and without a patch applied
> > could test in under an hour. They could then report the results that
> > they got, and people here could judge whether the resulting numbers
> > are totally cool or very sad or something in between. Anyone willing
> > to do that?
>
> The rules are only loaded once on first use, right?  I tested with
>
> date; for x in $(seq 1 1000); do psql -X -c "select unaccent('foobar')"
> -o /dev/null; done; date
>
> and this had the same runtime (about 8 seconds here) with and without
> the patch.

Cool. Sounds like that's not a problem.

> Btw., with the patch I get
>
> WARNING:  duplicate source strings, first one will be used
>
> so it will need to adjustments in how the rules are produced.

OK. Does anyone want to look into that?

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: BUG #18362: unaccent rules and Old Greek text

From

Robert Haas

Date:

15 May 2024, 13:41:45

On Wed, May 15, 2024 at 3:01 AM Peter Eisentraut <peter@eisentraut.org> wrote:
> unaccent actually does both accent removal and ligature expansion.
> (This is documented.)  The cases you show above are ligature expansions.

Ah. Searching the documentation of unaccent for the word ligature, I
found where this is mentioned. But at the top of the page, it says
only "unaccent is a text search dictionary that removes accents
(diacritic signs) from lexemes". Perhaps we should add a mention of
ligature expansion there as well.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: BUG #18362: unaccent rules and Old Greek text

From

Thomas Munro

Date:

18 May 2024, 09:36:25

On Thu, May 16, 2024 at 1:40 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, May 15, 2024 at 2:45 AM Peter Eisentraut <peter@eisentraut.org> wrote:
> > On 14.05.24 16:51, Robert Haas wrote:
> > The rules are only loaded once on first use, right?  I tested with
> >
> > date; for x in $(seq 1 1000); do psql -X -c "select unaccent('foobar')"
> > -o /dev/null; done; date
> >
> > and this had the same runtime (about 8 seconds here) with and without
> > the patch.
>
> Cool. Sounds like that's not a problem.

Thanks Peter for testing, and thanks Robert for kicking this thread.

> > Btw., with the patch I get
> >
> > WARNING:  duplicate source strings, first one will be used
> >
> > so it will need to adjustments in how the rules are produced.
>
> OK. Does anyone want to look into that?

I think the problem is that the new "simple redirection" rule from the
Unicode database produces some values that are also present in
Latin-ASCII.xml, and these are all tolerated as long as the "from" and
"to" strings both match, because we uniquify them as pairs.  But there
is one pair where the "to" string is different, resulting in this
clash:

ℌ      x
ℌ      H

I think the first line might actually be a bug in CLDR data.  I dunno,
but this just doesn't look right:

ℌ → x ; # 210C;BLACK-LETTER CAPITAL H (compat)

And in the tests I now see that Michael had already figured that out!
I've included a kludge to remove that.  Someone should file a ticket with CLDR.

Attachment

v2-0001-Add-simple-codepoint-redirections-to-unaccent.rul.patch

Re: BUG #18362: unaccent rules and Old Greek text

From

Peter Eisentraut

Date:

21 May 2024, 12:04:59

On 18.05.24 11:36, Thomas Munro wrote:
>>> WARNING:  duplicate source strings, first one will be used
>>>
>>> so it will need to adjustments in how the rules are produced.
>>
>> OK. Does anyone want to look into that?
> 
> I think the problem is that the new "simple redirection" rule from the
> Unicode database produces some values that are also present in
> Latin-ASCII.xml, and these are all tolerated as long as the "from" and
> "to" strings both match, because we uniquify them as pairs.  But there
> is one pair where the "to" string is different, resulting in this
> clash:
> 
> ℌ      x
> ℌ      H
> 
> I think the first line might actually be a bug in CLDR data.  I dunno,
> but this just doesn't look right:
> 
> ℌ → x ; # 210C;BLACK-LETTER CAPITAL H (compat)
> 
> And in the tests I now see that Michael had already figured that out!
> I've included a kludge to remove that.  Someone should file a ticket with CLDR.

Done: https://unicode-org.atlassian.net/browse/CLDR-17656

Re: BUG #18362: unaccent rules and Old Greek text

From

Robert Haas

Date:

22 May 2024, 16:47:37

On Sat, May 18, 2024 at 5:37 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> And in the tests I now see that Michael had already figured that out!
> I've included a kludge to remove that.  Someone should file a ticket with CLDR.

I think you should update the comment that says "a mistake?" to
instead link to the CLDR issue that Peter filed. Other than that, I'm
not sure this needs any other changes. I can't actually testify to the
correctness of the Python code, but the results look sane so hey, why
not?

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: BUG #18362: unaccent rules and Old Greek text

From

Michael Paquier

Date:

23 May 2024, 06:21:02

On Wed, May 22, 2024 at 12:47:37PM -0400, Robert Haas wrote:
> On Sat, May 18, 2024 at 5:37 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>> And in the tests I now see that Michael had already figured that out!
>> I've included a kludge to remove that.  Someone should file a ticket with CLDR.

That was some time ago..  I was not sure back then how to handle that
with upstream data, so thanks for the bug report and the pointers.
I'll try to remember that.

> I think you should update the comment that says "a mistake?" to
> instead link to the CLDR issue that Peter filed. Other than that, I'm
> not sure this needs any other changes. I can't actually testify to the
> correctness of the Python code, but the results look sane so hey, why
> not?

+1 for the comment refresh in the test, keeping the test.

+                if src == "ℌ":
+                    # a mistake?
+                    continue

Perhaps this should use the codepoint rather than the non-ascii
character in the script.

Another thing would be to add some tests that cover the new characters
that get a conversion.  Just a few of them in the new ranges, checking
the recursive case with is_letter_with_marks() would be fine.
 --
Michael

Attachment

signature.asc

Re: BUG #18362: unaccent rules and Old Greek text

From

Robert Haas

Date:

23 May 2024, 13:52:22

On Thu, May 23, 2024 at 2:21 AM Michael Paquier <michael@paquier.xyz> wrote:
> Another thing would be to add some tests that cover the new characters
> that get a conversion.  Just a few of them in the new ranges, checking
> the recursive case with is_letter_with_marks() would be fine.

Because the mappings that the script produces are going to be checked
into the repository, I don't necessarily see that we also need
separate tests for it. I'm not sure what problem that could
realistically find.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: BUG #18362: unaccent rules and Old Greek text

From

Thomas Munro

Date:

05 July 2024, 03:41:05

I improved the comment, referenced the CLDR bug report, and replaced
the funky character with \uNNNN notation, as requested.  And pushed
(master/18-to-be only).  Ευχαριστώ!

Re: BUG #18362: unaccent rules and Old Greek text

From

Michael Paquier

Date:

05 July 2024, 04:53:12

On Fri, Jul 05, 2024 at 03:41:05PM +1200, Thomas Munro wrote:
> I improved the comment, referenced the CLDR bug report, and replaced
> the funky character with \uNNNN notation, as requested.  And pushed
> (master/18-to-be only).  Ευχαριστώ!

Thanks!
--
Michael

Attachment

signature.asc