Home > mailing lists

Thread: [PATCH] Completed unaccent dictionary with many missing characters

[PATCH] Completed unaccent dictionary with many missing characters

From

Przemysław Sztoch

Date:

28 April 2022, 19:50:57

Current unnaccent dictionary does not include many popular numeric symbols,
in example: "m²" -> "m2"

--

Przemysław Sztoch | Mobile +48 509 99 00 66

Attachment

unaccent-missed-chars.patch

Re: [PATCH] Completed unaccent dictionary with many missing characters

From

Peter Eisentraut

Date:

04 May 2022, 18:17:34

On 28.04.22 18:50, Przemysław Sztoch wrote:
> Current unnaccent dictionary does not include many popular numeric symbols,
> in example: "m²" -> "m2"

Seems reasonable.

Can you explain what your patch does to achieve this?

Re: [PATCH] Completed unaccent dictionary with many missing characters

From

Tom Lane

Date:

04 May 2022, 18:32:37

Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes:
> On 28.04.22 18:50, Przemysław Sztoch wrote:
>> Current unnaccent dictionary does not include many popular numeric symbols,
>> in example: "m²" -> "m2"

> Seems reasonable.

It kinda feels like this is outside the charter of an "unaccent"
dictionary.  I don't object to having these conversions available
but it seems like it ought to be a separate feature.

            regards, tom lane

Re: [PATCH] Completed unaccent dictionary with many missing characters

From

Przemysław Sztoch

Date:

05 May 2022, 22:40:09

Peter Eisentraut wrote on 5/4/2022 5:17 PM:

On 28.04.22 18:50, Przemysław Sztoch wrote:
Current unnaccent dictionary does not include many popular numeric symbols,
in example: "m²" -> "m2"
Seems reasonable.

Can you explain what your patch does to achieve this?

I used an existing python implementation of the generator.
It is based on ready-made unicode dictionary: src/common/unicode/UnicodeData.txt.
The current generator was filtering UnicodeData.txt too much.
I relaxed these conditions, because the previous implementation focused only on selected character types.

Browsing the unaccent.rules file is the easiest way to see how many and what missing characters have been completed.

For FTS, the addition of these characters is very much needed.

--
Przemysław Sztoch | Mobile +48 509 99 00 66

Re: [PATCH] Completed unaccent dictionary with many missing characters

From

Przemysław Sztoch

Date:

05 May 2022, 22:44:15

Tom Lane wrote on 5/4/2022 5:32 PM:

Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes:

On 28.04.22 18:50, Przemysław Sztoch wrote:

Current unnaccent dictionary does not include many popular numeric symbols,
in example: "m²" -> "m2"

Seems reasonable.

It kinda feels like this is outside the charter of an "unaccent"
dictionary.  I don't object to having these conversions available
but it seems like it ought to be a separate feature.
			regards, tom lane

Tom, I disagree with you because many similar numerical conversions are already taking place, e.g. 1/2, 1/4...

Today Unicode is ubiquitous and we use a lot more weird characters.
I just completed these less common characters.

Therefore, the problem of missing characters in unaccent.rules affects the correct operation of the FTS mechanisms.

--
Przemysław Sztoch | Mobile +48 509 99 00 66

Re: [PATCH] Completed unaccent dictionary with many missing characters

From

Michael Paquier

Date:

17 May 2022, 10:11:40

On Thu, May 05, 2022 at 09:44:15PM +0200, Przemysław Sztoch wrote:
> Tom, I disagree with you because many similar numerical conversions are
> already taking place, e.g. 1/2, 1/4...

This part sounds like a valid argument to me.  unaccent.rules does
already the conversion of some mathematical signs, and the additions
proposed in the patch don't look that weird to me.  I agree with Peter
and Przemysław that this is reasonable.
--
Michael

Attachment

signature.asc

Re: [PATCH] Completed unaccent dictionary with many missing characters

From

Przemysław Sztoch

Date:

15 June 2022, 14:01:37

Two fixes (bad comment and fixed Latin-ASCII.xml).

Michael Paquier wrote on 17.05.2022 09:11:

On Thu, May 05, 2022 at 09:44:15PM +0200, Przemysław Sztoch wrote:

Tom, I disagree with you because many similar numerical conversions are
already taking place, e.g. 1/2, 1/4...

This part sounds like a valid argument to me.  unaccent.rules does
already the conversion of some mathematical signs, and the additions
proposed in the patch don't look that weird to me.  I agree with Peter
and Przemysław that this is reasonable.
--
Michael

--
Przemysław Sztoch | Mobile +48 509 99 00 66

Attachment

unaccent-missed-chars-v2.patch

Re: [PATCH] Completed unaccent dictionary with many missing characters

From

Michael Paquier

Date:

20 June 2022, 04:49:37

On Wed, Jun 15, 2022 at 01:01:37PM +0200, Przemysław Sztoch wrote:
> Two fixes (bad comment and fixed Latin-ASCII.xml).

         if codepoint.general_category.startswith('L') and \
-           len(codepoint.combining_ids) > 1:
+           len(codepoint.combining_ids) > 0:
So, this one checks for the case where a codepoint is within the
letter category.  As far as I can see this indeed adds a couple of
characters, with a combination of Greek and Latin letters.  So that
looks fine.

+        elif codepoint.general_category.startswith('N') and \
+           len(codepoint.combining_ids) > 0 and \
+           args.noLigaturesExpansion is False and is_ligature(codepoint, table):
+            charactersSet.add((codepoint.id,
+                               "".join(chr(combining_codepoint.id)
+                                       for combining_codepoint
+                                       in get_plain_letters(codepoint, table))))
And this one is for the numerical part of the change.  Do you actually
need to apply is_ligature() here?  I would have thought that this only
applies to letters.

-    assert(False)
+    assert False, 'Codepoint U+%0.2X' % codepoint.id
[...]
-    assert(is_ligature(codepoint, table))
+    assert is_ligature(codepoint, table), 'Codepoint U+%0.2X' % codepoint.id
These two are a good idea for debugging.

-    return all(is_letter(table[i], table) for i in codepoint.combining_ids)
+    return all(i in table and is_letter(table[i], table) for i in codepoint.combining_ids)
It looks like this makes the code weaker, as we would silently skip
characters that are not part of the table rather than checking for
them all the time?

While recreating unaccent.rules with your patch, I have noticed what
looks like an error.  An extra rule mapping U+210C (black-letter
capital h) to "x" gets added on top of te existing one for "H", but
the correct answer is the existing rule, not the one added by the
patch.
--
Michael

Attachment

signature.asc

Re: [PATCH] Completed unaccent dictionary with many missing characters

From

Przemysław Sztoch

Date:

20 June 2022, 11:37:57

Michael Paquier wrote on 20.06.2022 03:49:

On Wed, Jun 15, 2022 at 01:01:37PM +0200, Przemysław Sztoch wrote:

Two fixes (bad comment and fixed Latin-ASCII.xml).

         if codepoint.general_category.startswith('L') and \
-           len(codepoint.combining_ids) > 1:
+           len(codepoint.combining_ids) > 0:
So, this one checks for the case where a codepoint is within the
letter category.  As far as I can see this indeed adds a couple of
characters, with a combination of Greek and Latin letters.  So that
looks fine.

Previously, there were only multi-letter conversions. Now we also have single letters.


+        elif codepoint.general_category.startswith('N') and \
+           len(codepoint.combining_ids) > 0 and \
+           args.noLigaturesExpansion is False and is_ligature(codepoint, table):
+            charactersSet.add((codepoint.id,
+                               "".join(chr(combining_codepoint.id)
+                                       for combining_codepoint
+                                       in get_plain_letters(codepoint, table))))
And this one is for the numerical part of the change.  Do you actually
need to apply is_ligature() here?  I would have thought that this only
applies to letters.

But ligature check is performed on combining_ids (result of translation), not on base codepoint.
Without it, you will get assertions in get_plain_letters.

The idea is that we take translations that turn into normal letters. Others (strange) are rejected.
Maybe it could be done better. I didn't like it as much as you did, but I couldn't do better.
In the end, I left it just like in the original script.

Note that the plain letter list (PLAIN_LETTER_RANGES) has now been expanded with numbers.

-    assert(False)
+    assert False, 'Codepoint U+%0.2X' % codepoint.id
[...]
-    assert(is_ligature(codepoint, table))
+    assert is_ligature(codepoint, table), 'Codepoint U+%0.2X' % codepoint.id
These two are a good idea for debugging.

-    return all(is_letter(table[i], table) for i in codepoint.combining_ids)
+    return all(i in table and is_letter(table[i], table) for i in codepoint.combining_ids)
It looks like this makes the code weaker, as we would silently skip
characters that are not part of the table rather than checking for
them all the time?

Unfortunately, there are entries in combining_ids that are not in the character table being used.
This protection is necessary so that there is no error. But unfamiliar characters are omitted.

While recreating unaccent.rules with your patch, I have noticed what
looks like an error.  An extra rule mapping U+210C (black-letter
capital h) to "x" gets added on top of te existing one for "H", but
the correct answer is the existing rule, not the one added by the
patch.

The problem with the sign of U+210C is that there are conflicting translations for it.
As the name suggests "(black-letter capital h)", it should be converted to a capital H.
However, the current Latin-ASCII.xml suggests a conversion to x.
I found an open discussion on the internet about this and the suggestion that the Latin-ASCII.xml file should be corrected for this letter.
But I wouldn't expect that Unicode makes the revised Latin-ASCII.xml quickly into the official repo.

--
Przemysław Sztoch | Mobile +48 509 99 00 66

Re: [PATCH] Completed unaccent dictionary with many missing characters

From

Michael Paquier

Date:

21 June 2022, 03:11:41

On Mon, Jun 20, 2022 at 10:37:57AM +0200, Przemysław Sztoch wrote:
> But ligature check is performed on combining_ids (result of translation),
> not on base codepoint.
> Without it, you will get assertions in get_plain_letters.

Hmm.  I am wondering if we could make the whole logic a bit more
intuitive here.  The loop that builds the set of mappings gets now
much more complicated with the addition of the categories beginning by
N for the numbers, and that's mostly the same set of checks as the
ones applied for T.

> However, the current Latin-ASCII.xml suggests a conversion to x.
> I found an open discussion on the internet about this and the suggestion
> that the Latin-ASCII.xml file should be corrected for this letter.
> But I wouldn't expect that Unicode makes the revised Latin-ASCII.xml quickly
> into the official repo.

Yeah, Latin-ASCII.xml is getting it wrong here, then.  unaccent
fetches the thing from this URL currently:
https://raw.githubusercontent.com/unicode-org/cldr/release-41/common/transforms/Latin-ASCII.xml

Could it be better to handle that as an exception in
generate_unaccent_rules.py, documenting why we are doing it this way
then?  My concern is somebody re-running the script without noticing
this exception, and the set of rules would be blindly, and
incorrectly, updated.
--
Michael

Attachment

signature.asc

Re: [PATCH] Completed unaccent dictionary with many missing characters

From

Thomas Munro

Date:

21 June 2022, 03:53:43

On Tue, Jun 21, 2022 at 12:11 PM Michael Paquier <michael@paquier.xyz> wrote:
> Yeah, Latin-ASCII.xml is getting it wrong here, then.  unaccent
> fetches the thing from this URL currently:
> https://raw.githubusercontent.com/unicode-org/cldr/release-41/common/transforms/Latin-ASCII.xml

Oh, we're using CLDR 41, which reminds me: CLDR 36 added SOUND
RECORDING COPYRIGHT[1] so we could drop it from special_cases().

Hmm, is it possible to get rid of CYRILLIC CAPITAL LETTER IO and
CYRILLIC SMALL LETTER IO by adding Cyrillic to PLAIN_LETTER_RANGES?

That'd leave just DEGREE CELSIUS and DEGREE FAHRENHEIT.  Not sure how
to kill those last two special cases -- they should be directly
replaced by their decomposition.

[1] https://unicode-org.atlassian.net/browse/CLDR-11383

Re: [PATCH] Completed unaccent dictionary with many missing characters

From

Przemysław Sztoch

Date:

21 June 2022, 16:36:49

Michael Paquier wrote on 21.06.2022 02:11:

On Mon, Jun 20, 2022 at 10:37:57AM +0200, Przemysław Sztoch wrote:

But ligature check is performed on combining_ids (result of translation),
not on base codepoint.
Without it, you will get assertions in get_plain_letters.

Hmm.  I am wondering if we could make the whole logic a bit more
intuitive here.  The loop that builds the set of mappings gets now
much more complicated with the addition of the categories beginning by
N for the numbers, and that's mostly the same set of checks as the
ones applied for T.

I'm sorry, but I can't correct this condition.
I have tried, but there are further exceptions and errors.

However, the current Latin-ASCII.xml suggests a conversion to x.
I found an open discussion on the internet about this and the suggestion
that the Latin-ASCII.xml file should be corrected for this letter.
But I wouldn't expect that Unicode makes the revised Latin-ASCII.xml quickly
into the official repo.

Yeah, Latin-ASCII.xml is getting it wrong here, then.  unaccent
fetches the thing from this URL currently:
https://raw.githubusercontent.com/unicode-org/cldr/release-41/common/transforms/Latin-ASCII.xml

Could it be better to handle that as an exception in
generate_unaccent_rules.py, documenting why we are doing it this way
then?  My concern is somebody re-running the script without noticing
this exception, and the set of rules would be blindly, and
incorrectly, updated.

I replaced python set with python dictionary.
It resolve problem with duplicated entry.
I left the conversion to "x". It was like that before and I leave it as it was.
The conversion to "x" is probably due to the phonetic interpretation of this sign.
If they correct the Latin-ASCII.xml file, it will change.

--
Michael

--
Przemysław Sztoch | Mobile +48 509 99 00 66

Attachment

unaccent-missed-chars-v3.patch

Re: [PATCH] Completed unaccent dictionary with many missing characters

From

Przemysław Sztoch

Date:

21 June 2022, 16:41:48

Thomas Munro wrote on 21.06.2022 02:53:

On Tue, Jun 21, 2022 at 12:11 PM Michael Paquier <michael@paquier.xyz> wrote:

Yeah, Latin-ASCII.xml is getting it wrong here, then.  unaccent
fetches the thing from this URL currently:
https://raw.githubusercontent.com/unicode-org/cldr/release-41/common/transforms/Latin-ASCII.xml

Oh, we're using CLDR 41, which reminds me: CLDR 36 added SOUND
RECORDING COPYRIGHT[1] so we could drop it from special_cases().

Hmm, is it possible to get rid of CYRILLIC CAPITAL LETTER IO and
CYRILLIC SMALL LETTER IO by adding Cyrillic to PLAIN_LETTER_RANGES?

That'd leave just DEGREE CELSIUS and DEGREE FAHRENHEIT.  Not sure how
to kill those last two special cases -- they should be directly
replaced by their decomposition.

[1] https://unicode-org.atlassian.net/browse/CLDR-11383

I patch v3 support for cirilic is added.
Special character function has been purged.
Added support for category: So - Other Symbol. This category include characters from special_cases().

--
Przemysław Sztoch | Mobile +48 509 99 00 66

Re: [PATCH] Completed unaccent dictionary with many missing characters

From

Michael Paquier

Date:

23 June 2022, 07:39:26

On Tue, Jun 21, 2022 at 03:41:48PM +0200, Przemysław Sztoch wrote:
> Thomas Munro wrote on 21.06.2022 02:53:
>> Oh, we're using CLDR 41, which reminds me: CLDR 36 added SOUND
>> RECORDING COPYRIGHT[1] so we could drop it from special_cases().

Indeed.

>> Hmm, is it possible to get rid of CYRILLIC CAPITAL LETTER IO and
>> CYRILLIC SMALL LETTER IO by adding Cyrillic to PLAIN_LETTER_RANGES?

That's a good point.  There are quite a bit of cyrillic characters
missing a conversion, visibly.

>> That'd leave just DEGREE CELSIUS and DEGREE FAHRENHEIT.  Not sure how
>> to kill those last two special cases -- they should be directly
>> replaced by their decomposition.
>>
>> [1] https://unicode-org.atlassian.net/browse/CLDR-11383
>
> I patch v3 support for cirilic is added.
> Special character function has been purged.
> Added support for category: So - Other Symbol. This category include
> characters from special_cases().

I think that we'd better split v3 into more patches to keep each
improvement isolated.  The addition of cyrillic characters in the
range of letters and the removal of the sound copyright from the
special cases can be done on their own, before considering the
original case tackled by this thread.
--
Michael

Attachment

signature.asc

Re: [PATCH] Completed unaccent dictionary with many missing characters

From

Przemysław Sztoch

Date:

23 June 2022, 15:10:42

Michael Paquier wrote on 23.06.2022 06:39:

That'd leave just DEGREE CELSIUS and DEGREE FAHRENHEIT.  Not sure how
to kill those last two special cases -- they should be directly
replaced by their decomposition.

[1] https://unicode-org.atlassian.net/browse/CLDR-11383

I patch v3 support for cirilic is added.
Special character function has been purged.
Added support for category: So - Other Symbol. This category include
characters from special_cases().

I think that we'd better split v3 into more patches to keep each
improvement isolated.  The addition of cyrillic characters in the
range of letters and the removal of the sound copyright from the
special cases can be done on their own, before considering the
original case tackled by this thread.
--
Michael

The only division that is probably possible is the one attached.

--
Przemysław Sztoch | Mobile +48 509 99 00 66

Attachment

Re: [PATCH] Completed unaccent dictionary with many missing characters

From

Michael Paquier

Date:

28 June 2022, 08:14:53

On Thu, Jun 23, 2022 at 02:10:42PM +0200, Przemysław Sztoch wrote:
> The only division that is probably possible is the one attached.

Well, the addition of cyrillic does not make necessary the removal of
SOUND RECORDING COPYRIGHT or the DEGREEs, that implies the use of a
dictionnary when manipulating the set of codepoints, but that's me
being too picky.  Just to say that I am fine with what you are
proposing here.

By the way, could you add a couple of regressions tests for each
patch with a sample of the characters added?  U+210C is a particularly
sensitive case, as we should really make sure that it maps to what we
want even if Latin-ASCII.xml tells a different story.  This requires
the addition of a couple of queries in unaccent.sql with the expected
output updated in unaccent.out.
--
Michael

Attachment

signature.asc

Re: [PATCH] Completed unaccent dictionary with many missing characters

From

Przemysław Sztoch

Date:

03 July 2022, 23:51:56

Michael Paquier wrote on 6/28/2022 7:14 AM:

On Thu, Jun 23, 2022 at 02:10:42PM +0200, Przemysław Sztoch wrote:

The only division that is probably possible is the one attached.

Well, the addition of cyrillic does not make necessary the removal of
SOUND RECORDING COPYRIGHT or the DEGREEs, that implies the use of a
dictionnary when manipulating the set of codepoints, but that's me
being too picky.  Just to say that I am fine with what you are
proposing here.

By the way, could you add a couple of regressions tests for each
patch with a sample of the characters added?  U+210C is a particularly
sensitive case, as we should really make sure that it maps to what we
want even if Latin-ASCII.xml tells a different story.  This requires
the addition of a couple of queries in unaccent.sql with the expected
output updated in unaccent.out.
--
Michael

Regression tests has been added.

--
Przemysław Sztoch | Mobile +48 509 99 00 66

Attachment

Re: [PATCH] Completed unaccent dictionary with many missing characters

From

Michael Paquier

Date:

05 July 2022, 10:22:19

On Tue, Jun 28, 2022 at 02:14:53PM +0900, Michael Paquier wrote:
> Well, the addition of cyrillic does not make necessary the removal of
> SOUND RECORDING COPYRIGHT or the DEGREEs, that implies the use of a
> dictionnary when manipulating the set of codepoints, but that's me
> being too picky.  Just to say that I am fine with what you are
> proposing here.

So, I have been looking at the change for cyrillic letters, and are
you sure that the range of codepoints [U+0410,U+044f] is right when it
comes to consider all those letters as plain letters?  There are a
couple of characters that itch me a bit with this range:
- What of the letter CAPITAL SHORT I (U+0419) and SMALL SHORT I
(U+0439)?  Shouldn't U+0439 be translated to U+0438 and U+0419
translated to U+0418?  That's what I get while looking at
UnicodeData.txt, and it would mean that the range of plain letters
should not include both of them.
- It seems like we are missing a couple of letters after U+044F, like
U+0454, U+0456 or U+0455 just to name three of them?

I have extracted from 0001 and applied the parts about the regression
tests for degree signs, while adding two more for SOUND RECORDING
COPYRIGHT (U+2117) and Black-Letter Capital H (U+210C) translated to
'x', while it should be probably 'H'.
--
Michael

Attachment

signature.asc

Re: [PATCH] Completed unaccent dictionary with many missing characters

From

Przemysław Sztoch

Date:

05 July 2022, 22:24:49

Michael Paquier wrote on 7/5/2022 9:22 AM:

On Tue, Jun 28, 2022 at 02:14:53PM +0900, Michael Paquier wrote:

Well, the addition of cyrillic does not make necessary the removal of
SOUND RECORDING COPYRIGHT or the DEGREEs, that implies the use of a
dictionnary when manipulating the set of codepoints, but that's me
being too picky.  Just to say that I am fine with what you are
proposing here.

So, I have been looking at the change for cyrillic letters, and are
you sure that the range of codepoints [U+0410,U+044f] is right when it
comes to consider all those letters as plain letters?  There are a
couple of characters that itch me a bit with this range:
- What of the letter CAPITAL SHORT I (U+0419) and SMALL SHORT I
(U+0439)?  Shouldn't U+0439 be translated to U+0438 and U+0419
translated to U+0418?  That's what I get while looking at
UnicodeData.txt, and it would mean that the range of plain letters
should not include both of them.

1. It's good that you noticed it. I missed it. But it doesn't affect the generated rule list.

- It seems like we are missing a couple of letters after U+044F, like
U+0454, U+0456 or U+0455 just to name three of them?

2. I added a few more letters that are used in languages other than Russian: Byelorussian or Ukrainian.

-                       (0x0410, 0x044f),      # Cyrillic capital and small letters
+                       (0x0402, 0x0402),      # Cyrillic capital and small letters
+                       (0x0404, 0x0406),      #
+                       (0x0408, 0x040b),      #
+                       (0x040f, 0x0418),      #
+                       (0x041a, 0x0438),      #
+                       (0x043a, 0x044f),      #
+                       (0x0452, 0x0452),      #
+                       (0x0454, 0x0456),      #

I do not add more, because they probably concern older languages.
An alternative might be to rely entirely on Unicode decomposition ...
However, after the change, only one additional Ukrainian letter with an accent was added to the rule file.


I have extracted from 0001 and applied the parts about the regression
tests for degree signs, while adding two more for SOUND RECORDING
COPYRIGHT (U+2117) and Black-Letter Capital H (U+210C) translated to
'x', while it should be probably 'H'.

3. The matter is not that simple. When I change priorities (ie Latin-ASCII.xml is less important than Unicode decomposition),
then "U + 33D7" changes not to pH but to PH.
In the end, I left it like it was before ...

If you decide what to do with point 3, I will correct it and send new patches.

--
Przemysław Sztoch | Mobile +48 509 99 00 66

Re: [PATCH] Completed unaccent dictionary with many missing characters

From

Przemysław Sztoch

Date:

13 July 2022, 13:12:43

Dear Michael P.,

3. The matter is not that simple. When I change priorities (ie Latin-ASCII.xml is less important than Unicode decomposition),
then "U + 33D7" changes not to pH but to PH.
In the end, I left it like it was before ...

If you decide what to do with point 3, I will correct it and send new patches.

What is your decision?
Option 1: We leave x as in Latin-ASCII.xml and we also have full compatibility with previous PostgreSQL versions.
If they fix Latin-ASCII.xml at Unicode, it will fix itself.

Option 2: We choose a lower priority for entries in Latin-ASCII.xml

I would choose option 1.

P.S. I will be going on vacation and it would be nice to close this patch soon. TIA.

--
Przemysław Sztoch | Mobile +48 509 99 00 66

Re: [PATCH] Completed unaccent dictionary with many missing characters

From

Michael Paquier

Date:

14 July 2022, 08:41:31

On Tue, Jul 05, 2022 at 09:24:49PM +0200, Przemysław Sztoch wrote:
> I do not add more, because they probably concern older languages.
> An alternative might be to rely entirely on Unicode decomposition ...
> However, after the change, only one additional Ukrainian letter with an
> accent was added to the rule file.

Hmm.  I was wondering about the decomposition part, actually.  How
much would it make things simpler if we treat the full range of the
cyrillic characters, aka from U+0400 to U+4FF, scanning all of them
and building rules only if there are decompositions?  Is it worth
considering the Cyrillic supplement, as of U+0500-U+052F?

I was also thinking about the regression tests, and as unaccent
characters are more spread than for Latin and Greek, it could be a
good thing to have a complete coverage.  We could for example use a
query like that to check if a character is treated properly or not:
SELECT chr(i.a) = unaccent(chr(i.a))
  FROM generate_series(1024, 1327) AS i(a); -- range of Cyrillic.
--
Michael

Attachment

signature.asc

Re: [PATCH] Completed unaccent dictionary with many missing characters

From

Ian Lawrence Barwick

Date:

04 November 2022, 02:28:51

2022年7月13日(水) 19:13 Przemysław Sztoch <przemyslaw@sztoch.pl>:
>
> Dear Michael P.,
>
> 3. The matter is not that simple. When I change priorities (ie Latin-ASCII.xml is less important than Unicode
decomposition),
> then "U + 33D7" changes not to pH but to PH.
> In the end, I left it like it was before ...
>
> If you decide what to do with point 3, I will correct it and send new patches.
>
> What is your decision?
> Option 1: We leave x as in Latin-ASCII.xml and we also have full compatibility with previous PostgreSQL versions.
> If they fix Latin-ASCII.xml at Unicode, it will fix itself.
>
> Option 2:  We choose a lower priority for entries in Latin-ASCII.xml
>
> I would choose option 1.
>
> P.S. I will be going on vacation and it would be nice to close this patch soon. TIA.

Hi

This entry was marked as "Needs review" in the CommitFest app but cfbot
reports the patch no longer applies.

We've marked it as "Waiting on Author". As CommitFest 2022-11 is
currently underway, this would be an excellent time update the patch.

Once you think the patchset is ready for review again, you (or any
interested party) can  move the patch entry forward by visiting

    https://commitfest.postgresql.org/40/3631/

and changing the status to "Needs review".


Thanks

Ian Barwick

Re: [PATCH] Completed unaccent dictionary with many missing characters

From

vignesh C

Date:

16 January 2023, 17:37:24

On Fri, 4 Nov 2022 at 04:59, Ian Lawrence Barwick <barwick@gmail.com> wrote:
>
> 2022年7月13日(水) 19:13 Przemysław Sztoch <przemyslaw@sztoch.pl>:
> >
> > Dear Michael P.,
> >
> > 3. The matter is not that simple. When I change priorities (ie Latin-ASCII.xml is less important than Unicode
decomposition),
> > then "U + 33D7" changes not to pH but to PH.
> > In the end, I left it like it was before ...
> >
> > If you decide what to do with point 3, I will correct it and send new patches.
> >
> > What is your decision?
> > Option 1: We leave x as in Latin-ASCII.xml and we also have full compatibility with previous PostgreSQL versions.
> > If they fix Latin-ASCII.xml at Unicode, it will fix itself.
> >
> > Option 2:  We choose a lower priority for entries in Latin-ASCII.xml
> >
> > I would choose option 1.
> >
> > P.S. I will be going on vacation and it would be nice to close this patch soon. TIA.
>
> Hi
>
> This entry was marked as "Needs review" in the CommitFest app but cfbot
> reports the patch no longer applies.
>
> We've marked it as "Waiting on Author". As CommitFest 2022-11 is
> currently underway, this would be an excellent time update the patch.
>
> Once you think the patchset is ready for review again, you (or any
> interested party) can  move the patch entry forward by visiting
>
>     https://commitfest.postgresql.org/40/3631/
>
> and changing the status to "Needs review".

I was not sure if you will be planning to post an updated version of
patch as the patch has been awaiting your attention from last
commitfest, please post an updated version for it soon or update the
commitfest entry accordingly.  As CommitFest 2023-01 is currently
underway, this would be an excellent time to update the patch.

Regards,
Vignesh

Re: [PATCH] Completed unaccent dictionary with many missing characters

From

vignesh C

Date:

31 January 2023, 20:31:30

On Mon, 16 Jan 2023 at 20:07, vignesh C <vignesh21@gmail.com> wrote:
>
> On Fri, 4 Nov 2022 at 04:59, Ian Lawrence Barwick <barwick@gmail.com> wrote:
> >
> > 2022年7月13日(水) 19:13 Przemysław Sztoch <przemyslaw@sztoch.pl>:
> > >
> > > Dear Michael P.,
> > >
> > > 3. The matter is not that simple. When I change priorities (ie Latin-ASCII.xml is less important than Unicode
decomposition),
> > > then "U + 33D7" changes not to pH but to PH.
> > > In the end, I left it like it was before ...
> > >
> > > If you decide what to do with point 3, I will correct it and send new patches.
> > >
> > > What is your decision?
> > > Option 1: We leave x as in Latin-ASCII.xml and we also have full compatibility with previous PostgreSQL versions.
> > > If they fix Latin-ASCII.xml at Unicode, it will fix itself.
> > >
> > > Option 2:  We choose a lower priority for entries in Latin-ASCII.xml
> > >
> > > I would choose option 1.
> > >
> > > P.S. I will be going on vacation and it would be nice to close this patch soon. TIA.
> >
> > Hi
> >
> > This entry was marked as "Needs review" in the CommitFest app but cfbot
> > reports the patch no longer applies.
> >
> > We've marked it as "Waiting on Author". As CommitFest 2022-11 is
> > currently underway, this would be an excellent time update the patch.
> >
> > Once you think the patchset is ready for review again, you (or any
> > interested party) can  move the patch entry forward by visiting
> >
> >     https://commitfest.postgresql.org/40/3631/
> >
> > and changing the status to "Needs review".
>
> I was not sure if you will be planning to post an updated version of
> patch as the patch has been awaiting your attention from last
> commitfest, please post an updated version for it soon or update the
> commitfest entry accordingly.  As CommitFest 2023-01 is currently
> underway, this would be an excellent time to update the patch.

There has been no updates on this thread for some time, so this has
been switched as Returned with Feedback. Feel free to open it in the
next commitfest if you plan to continue on this.

Regards,
Vignesh