Re: [PATCH] Completed unaccent dictionary with many missing characters - Mailing list pgsql-hackers

From Przemysław Sztoch
Subject Re: [PATCH] Completed unaccent dictionary with many missing characters
Date
Msg-id 4c9326a1-6554-262f-1f22-e636933086ed@sztoch.pl
Whole thread Raw
In response to Re: [PATCH] Completed unaccent dictionary with many missing characters  (Michael Paquier <michael@paquier.xyz>)
Responses Re: [PATCH] Completed unaccent dictionary with many missing characters
Re: [PATCH] Completed unaccent dictionary with many missing characters
List pgsql-hackers
Michael Paquier wrote on 7/5/2022 9:22 AM:
On Tue, Jun 28, 2022 at 02:14:53PM +0900, Michael Paquier wrote:
Well, the addition of cyrillic does not make necessary the removal of
SOUND RECORDING COPYRIGHT or the DEGREEs, that implies the use of a
dictionnary when manipulating the set of codepoints, but that's me
being too picky.  Just to say that I am fine with what you are
proposing here.
So, I have been looking at the change for cyrillic letters, and are
you sure that the range of codepoints [U+0410,U+044f] is right when it
comes to consider all those letters as plain letters?  There are a
couple of characters that itch me a bit with this range:
- What of the letter CAPITAL SHORT I (U+0419) and SMALL SHORT I
(U+0439)?  Shouldn't U+0439 be translated to U+0438 and U+0419
translated to U+0418?  That's what I get while looking at
UnicodeData.txt, and it would mean that the range of plain letters
should not include both of them.
1. It's good that you noticed it. I missed it. But it doesn't affect the generated rule list.
- It seems like we are missing a couple of letters after U+044F, like
U+0454, U+0456 or U+0455 just to name three of them?
2. I added a few more letters that are used in languages other than Russian: Byelorussian or Ukrainian.

-                       (0x0410, 0x044f),      # Cyrillic capital and small letters
+                       (0x0402, 0x0402),      # Cyrillic capital and small letters
+                       (0x0404, 0x0406),      #
+                       (0x0408, 0x040b),      #
+                       (0x040f, 0x0418),      #
+                       (0x041a, 0x0438),      #
+                       (0x043a, 0x044f),      #
+                       (0x0452, 0x0452),      #
+                       (0x0454, 0x0456),      #

I do not add more, because they probably concern older languages.
An alternative might be to rely entirely on Unicode decomposition ...
However, after the change, only one additional Ukrainian letter with an accent was added to the rule file.

I have extracted from 0001 and applied the parts about the regression
tests for degree signs, while adding two more for SOUND RECORDING
COPYRIGHT (U+2117) and Black-Letter Capital H (U+210C) translated to
'x', while it should be probably 'H'.
3. The matter is not that simple. When I change priorities (ie Latin-ASCII.xml is less important than Unicode decomposition),
then "U + 33D7" changes not to pH but to PH.
In the end, I left it like it was before ...

If you decide what to do with point 3, I will correct it and send new patches.

--
Przemysław Sztoch | Mobile +48 509 99 00 66

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: [UNVERIFIED SENDER] Re: pg_upgrade can result in early wraparound on databases with high transaction load
Next
From: Alvaro Herrera
Date:
Subject: Re: Emit extra debug message when executing extension script.