Thread: [PATCH] Completed unaccent dictionary with many missing characters
Current unnaccent dictionary does not include many popular numeric symbols,
in example: "m²" -> "m2"
--
in example: "m²" -> "m2"
--
Przemysław Sztoch | Mobile +48 509 99 00 66
Attachment
On 28.04.22 18:50, Przemysław Sztoch wrote: > Current unnaccent dictionary does not include many popular numeric symbols, > in example: "m²" -> "m2" Seems reasonable. Can you explain what your patch does to achieve this?
Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes: > On 28.04.22 18:50, Przemysław Sztoch wrote: >> Current unnaccent dictionary does not include many popular numeric symbols, >> in example: "m²" -> "m2" > Seems reasonable. It kinda feels like this is outside the charter of an "unaccent" dictionary. I don't object to having these conversions available but it seems like it ought to be a separate feature. regards, tom lane
Peter Eisentraut wrote on 5/4/2022 5:17 PM:
It is based on ready-made unicode dictionary: src/common/unicode/UnicodeData.txt.
The current generator was filtering UnicodeData.txt too much.
I relaxed these conditions, because the previous implementation focused only on selected character types.
Browsing the unaccent.rules file is the easiest way to see how many and what missing characters have been completed.
For FTS, the addition of these characters is very much needed.
On 28.04.22 18:50, Przemysław Sztoch wrote:I used an existing python implementation of the generator.Current unnaccent dictionary does not include many popular numeric symbols,Seems reasonable.
in example: "m²" -> "m2"
Can you explain what your patch does to achieve this?
It is based on ready-made unicode dictionary: src/common/unicode/UnicodeData.txt.
The current generator was filtering UnicodeData.txt too much.
I relaxed these conditions, because the previous implementation focused only on selected character types.
Browsing the unaccent.rules file is the easiest way to see how many and what missing characters have been completed.
For FTS, the addition of these characters is very much needed.
--
Przemysław Sztoch | Mobile +48 509 99 00 66
Przemysław Sztoch | Mobile +48 509 99 00 66
Tom Lane wrote on 5/4/2022 5:32 PM:
Today Unicode is ubiquitous and we use a lot more weird characters.
I just completed these less common characters.
Therefore, the problem of missing characters in unaccent.rules affects the correct operation of the FTS mechanisms.
Tom, I disagree with you because many similar numerical conversions are already taking place, e.g. 1/2, 1/4...Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes:On 28.04.22 18:50, Przemysław Sztoch wrote:Current unnaccent dictionary does not include many popular numeric symbols, in example: "m²" -> "m2"Seems reasonable.It kinda feels like this is outside the charter of an "unaccent" dictionary. I don't object to having these conversions available but it seems like it ought to be a separate feature. regards, tom lane
Today Unicode is ubiquitous and we use a lot more weird characters.
I just completed these less common characters.
Therefore, the problem of missing characters in unaccent.rules affects the correct operation of the FTS mechanisms.
--
Przemysław Sztoch | Mobile +48 509 99 00 66
Przemysław Sztoch | Mobile +48 509 99 00 66
On Thu, May 05, 2022 at 09:44:15PM +0200, Przemysław Sztoch wrote: > Tom, I disagree with you because many similar numerical conversions are > already taking place, e.g. 1/2, 1/4... This part sounds like a valid argument to me. unaccent.rules does already the conversion of some mathematical signs, and the additions proposed in the patch don't look that weird to me. I agree with Peter and Przemysław that this is reasonable. -- Michael
Attachment
Two fixes (bad comment and fixed Latin-ASCII.xml).
Michael Paquier wrote on 17.05.2022 09:11:
Michael Paquier wrote on 17.05.2022 09:11:
On Thu, May 05, 2022 at 09:44:15PM +0200, Przemysław Sztoch wrote:Tom, I disagree with you because many similar numerical conversions are already taking place, e.g. 1/2, 1/4...This part sounds like a valid argument to me. unaccent.rules does already the conversion of some mathematical signs, and the additions proposed in the patch don't look that weird to me. I agree with Peter and Przemysław that this is reasonable. -- Michael
--
Przemysław Sztoch | Mobile +48 509 99 00 66
Przemysław Sztoch | Mobile +48 509 99 00 66
Attachment
On Wed, Jun 15, 2022 at 01:01:37PM +0200, Przemysław Sztoch wrote: > Two fixes (bad comment and fixed Latin-ASCII.xml). if codepoint.general_category.startswith('L') and \ - len(codepoint.combining_ids) > 1: + len(codepoint.combining_ids) > 0: So, this one checks for the case where a codepoint is within the letter category. As far as I can see this indeed adds a couple of characters, with a combination of Greek and Latin letters. So that looks fine. + elif codepoint.general_category.startswith('N') and \ + len(codepoint.combining_ids) > 0 and \ + args.noLigaturesExpansion is False and is_ligature(codepoint, table): + charactersSet.add((codepoint.id, + "".join(chr(combining_codepoint.id) + for combining_codepoint + in get_plain_letters(codepoint, table)))) And this one is for the numerical part of the change. Do you actually need to apply is_ligature() here? I would have thought that this only applies to letters. - assert(False) + assert False, 'Codepoint U+%0.2X' % codepoint.id [...] - assert(is_ligature(codepoint, table)) + assert is_ligature(codepoint, table), 'Codepoint U+%0.2X' % codepoint.id These two are a good idea for debugging. - return all(is_letter(table[i], table) for i in codepoint.combining_ids) + return all(i in table and is_letter(table[i], table) for i in codepoint.combining_ids) It looks like this makes the code weaker, as we would silently skip characters that are not part of the table rather than checking for them all the time? While recreating unaccent.rules with your patch, I have noticed what looks like an error. An extra rule mapping U+210C (black-letter capital h) to "x" gets added on top of te existing one for "H", but the correct answer is the existing rule, not the one added by the patch. -- Michael
Attachment
Michael Paquier wrote on 20.06.2022 03:49:
Previously, there were only multi-letter conversions. Now we also have single letters.On Wed, Jun 15, 2022 at 01:01:37PM +0200, Przemysław Sztoch wrote:Two fixes (bad comment and fixed Latin-ASCII.xml).if codepoint.general_category.startswith('L') and \ - len(codepoint.combining_ids) > 1: + len(codepoint.combining_ids) > 0: So, this one checks for the case where a codepoint is within the letter category. As far as I can see this indeed adds a couple of characters, with a combination of Greek and Latin letters. So that looks fine.
But ligature check is performed on combining_ids (result of translation), not on base codepoint.+ elif codepoint.general_category.startswith('N') and \ + len(codepoint.combining_ids) > 0 and \ + args.noLigaturesExpansion is False and is_ligature(codepoint, table): + charactersSet.add((codepoint.id, + "".join(chr(combining_codepoint.id) + for combining_codepoint + in get_plain_letters(codepoint, table)))) And this one is for the numerical part of the change. Do you actually need to apply is_ligature() here? I would have thought that this only applies to letters.
Without it, you will get assertions in get_plain_letters.
The idea is that we take translations that turn into normal letters. Others (strange) are rejected.
Maybe it could be done better. I didn't like it as much as you did, but I couldn't do better.
In the end, I left it just like in the original script.
Note that the plain letter list (PLAIN_LETTER_RANGES) has now been expanded with numbers.
Unfortunately, there are entries in combining_ids that are not in the character table being used.- assert(False) + assert False, 'Codepoint U+%0.2X' % codepoint.id [...] - assert(is_ligature(codepoint, table)) + assert is_ligature(codepoint, table), 'Codepoint U+%0.2X' % codepoint.id These two are a good idea for debugging. - return all(is_letter(table[i], table) for i in codepoint.combining_ids) + return all(i in table and is_letter(table[i], table) for i in codepoint.combining_ids) It looks like this makes the code weaker, as we would silently skip characters that are not part of the table rather than checking for them all the time?
This protection is necessary so that there is no error. But unfamiliar characters are omitted.
The problem with the sign of U+210C is that there are conflicting translations for it.While recreating unaccent.rules with your patch, I have noticed what looks like an error. An extra rule mapping U+210C (black-letter capital h) to "x" gets added on top of te existing one for "H", but the correct answer is the existing rule, not the one added by the patch.
As the name suggests "(black-letter capital h)", it should be converted to a capital H.
However, the current Latin-ASCII.xml suggests a conversion to x.
I found an open discussion on the internet about this and the suggestion that the Latin-ASCII.xml file should be corrected for this letter.
But I wouldn't expect that Unicode makes the revised Latin-ASCII.xml quickly into the official repo.
--
Przemysław Sztoch | Mobile +48 509 99 00 66
Przemysław Sztoch | Mobile +48 509 99 00 66
On Mon, Jun 20, 2022 at 10:37:57AM +0200, Przemysław Sztoch wrote: > But ligature check is performed on combining_ids (result of translation), > not on base codepoint. > Without it, you will get assertions in get_plain_letters. Hmm. I am wondering if we could make the whole logic a bit more intuitive here. The loop that builds the set of mappings gets now much more complicated with the addition of the categories beginning by N for the numbers, and that's mostly the same set of checks as the ones applied for T. > However, the current Latin-ASCII.xml suggests a conversion to x. > I found an open discussion on the internet about this and the suggestion > that the Latin-ASCII.xml file should be corrected for this letter. > But I wouldn't expect that Unicode makes the revised Latin-ASCII.xml quickly > into the official repo. Yeah, Latin-ASCII.xml is getting it wrong here, then. unaccent fetches the thing from this URL currently: https://raw.githubusercontent.com/unicode-org/cldr/release-41/common/transforms/Latin-ASCII.xml Could it be better to handle that as an exception in generate_unaccent_rules.py, documenting why we are doing it this way then? My concern is somebody re-running the script without noticing this exception, and the set of rules would be blindly, and incorrectly, updated. -- Michael
Attachment
On Tue, Jun 21, 2022 at 12:11 PM Michael Paquier <michael@paquier.xyz> wrote: > Yeah, Latin-ASCII.xml is getting it wrong here, then. unaccent > fetches the thing from this URL currently: > https://raw.githubusercontent.com/unicode-org/cldr/release-41/common/transforms/Latin-ASCII.xml Oh, we're using CLDR 41, which reminds me: CLDR 36 added SOUND RECORDING COPYRIGHT[1] so we could drop it from special_cases(). Hmm, is it possible to get rid of CYRILLIC CAPITAL LETTER IO and CYRILLIC SMALL LETTER IO by adding Cyrillic to PLAIN_LETTER_RANGES? That'd leave just DEGREE CELSIUS and DEGREE FAHRENHEIT. Not sure how to kill those last two special cases -- they should be directly replaced by their decomposition. [1] https://unicode-org.atlassian.net/browse/CLDR-11383
Michael Paquier wrote on 21.06.2022 02:11:
I'm sorry, but I can't correct this condition.On Mon, Jun 20, 2022 at 10:37:57AM +0200, Przemysław Sztoch wrote:But ligature check is performed on combining_ids (result of translation), not on base codepoint. Without it, you will get assertions in get_plain_letters.Hmm. I am wondering if we could make the whole logic a bit more intuitive here. The loop that builds the set of mappings gets now much more complicated with the addition of the categories beginning by N for the numbers, and that's mostly the same set of checks as the ones applied for T.
I have tried, but there are further exceptions and errors.
I replaced python set with python dictionary.However, the current Latin-ASCII.xml suggests a conversion to x. I found an open discussion on the internet about this and the suggestion that the Latin-ASCII.xml file should be corrected for this letter. But I wouldn't expect that Unicode makes the revised Latin-ASCII.xml quickly into the official repo.Yeah, Latin-ASCII.xml is getting it wrong here, then. unaccent fetches the thing from this URL currently: https://raw.githubusercontent.com/unicode-org/cldr/release-41/common/transforms/Latin-ASCII.xml Could it be better to handle that as an exception in generate_unaccent_rules.py, documenting why we are doing it this way then? My concern is somebody re-running the script without noticing this exception, and the set of rules would be blindly, and incorrectly, updated.
It resolve problem with duplicated entry.
I left the conversion to "x". It was like that before and I leave it as it was.
The conversion to "x" is probably due to the phonetic interpretation of this sign.
If they correct the Latin-ASCII.xml file, it will change.
-- Michael
--
Przemysław Sztoch | Mobile +48 509 99 00 66
Przemysław Sztoch | Mobile +48 509 99 00 66
Attachment
Thomas Munro wrote on 21.06.2022 02:53:
I patch v3 support for cirilic is added.On Tue, Jun 21, 2022 at 12:11 PM Michael Paquier <michael@paquier.xyz> wrote:Yeah, Latin-ASCII.xml is getting it wrong here, then. unaccent fetches the thing from this URL currently: https://raw.githubusercontent.com/unicode-org/cldr/release-41/common/transforms/Latin-ASCII.xmlOh, we're using CLDR 41, which reminds me: CLDR 36 added SOUND RECORDING COPYRIGHT[1] so we could drop it from special_cases(). Hmm, is it possible to get rid of CYRILLIC CAPITAL LETTER IO and CYRILLIC SMALL LETTER IO by adding Cyrillic to PLAIN_LETTER_RANGES? That'd leave just DEGREE CELSIUS and DEGREE FAHRENHEIT. Not sure how to kill those last two special cases -- they should be directly replaced by their decomposition. [1] https://unicode-org.atlassian.net/browse/CLDR-11383
Special character function has been purged.
Added support for category: So - Other Symbol. This category include characters from special_cases().
--
Przemysław Sztoch | Mobile +48 509 99 00 66
Przemysław Sztoch | Mobile +48 509 99 00 66
On Tue, Jun 21, 2022 at 03:41:48PM +0200, Przemysław Sztoch wrote: > Thomas Munro wrote on 21.06.2022 02:53: >> Oh, we're using CLDR 41, which reminds me: CLDR 36 added SOUND >> RECORDING COPYRIGHT[1] so we could drop it from special_cases(). Indeed. >> Hmm, is it possible to get rid of CYRILLIC CAPITAL LETTER IO and >> CYRILLIC SMALL LETTER IO by adding Cyrillic to PLAIN_LETTER_RANGES? That's a good point. There are quite a bit of cyrillic characters missing a conversion, visibly. >> That'd leave just DEGREE CELSIUS and DEGREE FAHRENHEIT. Not sure how >> to kill those last two special cases -- they should be directly >> replaced by their decomposition. >> >> [1] https://unicode-org.atlassian.net/browse/CLDR-11383 > > I patch v3 support for cirilic is added. > Special character function has been purged. > Added support for category: So - Other Symbol. This category include > characters from special_cases(). I think that we'd better split v3 into more patches to keep each improvement isolated. The addition of cyrillic characters in the range of letters and the removal of the sound copyright from the special cases can be done on their own, before considering the original case tackled by this thread. -- Michael
Attachment
Michael Paquier wrote on 23.06.2022 06:39:
The only division that is probably possible is the one attached.That'd leave just DEGREE CELSIUS and DEGREE FAHRENHEIT. Not sure how to kill those last two special cases -- they should be directly replaced by their decomposition. [1] https://unicode-org.atlassian.net/browse/CLDR-11383I patch v3 support for cirilic is added. Special character function has been purged. Added support for category: So - Other Symbol. This category include characters from special_cases().I think that we'd better split v3 into more patches to keep each improvement isolated. The addition of cyrillic characters in the range of letters and the removal of the sound copyright from the special cases can be done on their own, before considering the original case tackled by this thread. -- Michael
--
Przemysław Sztoch | Mobile +48 509 99 00 66
Przemysław Sztoch | Mobile +48 509 99 00 66
Attachment
On Thu, Jun 23, 2022 at 02:10:42PM +0200, Przemysław Sztoch wrote: > The only division that is probably possible is the one attached. Well, the addition of cyrillic does not make necessary the removal of SOUND RECORDING COPYRIGHT or the DEGREEs, that implies the use of a dictionnary when manipulating the set of codepoints, but that's me being too picky. Just to say that I am fine with what you are proposing here. By the way, could you add a couple of regressions tests for each patch with a sample of the characters added? U+210C is a particularly sensitive case, as we should really make sure that it maps to what we want even if Latin-ASCII.xml tells a different story. This requires the addition of a couple of queries in unaccent.sql with the expected output updated in unaccent.out. -- Michael
Attachment
Michael Paquier wrote on 6/28/2022 7:14 AM:
Regression tests has been added.On Thu, Jun 23, 2022 at 02:10:42PM +0200, Przemysław Sztoch wrote:The only division that is probably possible is the one attached.Well, the addition of cyrillic does not make necessary the removal of SOUND RECORDING COPYRIGHT or the DEGREEs, that implies the use of a dictionnary when manipulating the set of codepoints, but that's me being too picky. Just to say that I am fine with what you are proposing here. By the way, could you add a couple of regressions tests for each patch with a sample of the characters added? U+210C is a particularly sensitive case, as we should really make sure that it maps to what we want even if Latin-ASCII.xml tells a different story. This requires the addition of a couple of queries in unaccent.sql with the expected output updated in unaccent.out. -- Michael
--
Przemysław Sztoch | Mobile +48 509 99 00 66
Przemysław Sztoch | Mobile +48 509 99 00 66
Attachment
On Tue, Jun 28, 2022 at 02:14:53PM +0900, Michael Paquier wrote: > Well, the addition of cyrillic does not make necessary the removal of > SOUND RECORDING COPYRIGHT or the DEGREEs, that implies the use of a > dictionnary when manipulating the set of codepoints, but that's me > being too picky. Just to say that I am fine with what you are > proposing here. So, I have been looking at the change for cyrillic letters, and are you sure that the range of codepoints [U+0410,U+044f] is right when it comes to consider all those letters as plain letters? There are a couple of characters that itch me a bit with this range: - What of the letter CAPITAL SHORT I (U+0419) and SMALL SHORT I (U+0439)? Shouldn't U+0439 be translated to U+0438 and U+0419 translated to U+0418? That's what I get while looking at UnicodeData.txt, and it would mean that the range of plain letters should not include both of them. - It seems like we are missing a couple of letters after U+044F, like U+0454, U+0456 or U+0455 just to name three of them? I have extracted from 0001 and applied the parts about the regression tests for degree signs, while adding two more for SOUND RECORDING COPYRIGHT (U+2117) and Black-Letter Capital H (U+210C) translated to 'x', while it should be probably 'H'. -- Michael
Attachment
Michael Paquier wrote on 7/5/2022 9:22 AM:
- (0x0410, 0x044f), # Cyrillic capital and small letters
+ (0x0402, 0x0402), # Cyrillic capital and small letters
+ (0x0404, 0x0406), #
+ (0x0408, 0x040b), #
+ (0x040f, 0x0418), #
+ (0x041a, 0x0438), #
+ (0x043a, 0x044f), #
+ (0x0452, 0x0452), #
+ (0x0454, 0x0456), #
I do not add more, because they probably concern older languages.
An alternative might be to rely entirely on Unicode decomposition ...
However, after the change, only one additional Ukrainian letter with an accent was added to the rule file.
then "U + 33D7" changes not to pH but to PH.
In the end, I left it like it was before ...
If you decide what to do with point 3, I will correct it and send new patches.
1. It's good that you noticed it. I missed it. But it doesn't affect the generated rule list.On Tue, Jun 28, 2022 at 02:14:53PM +0900, Michael Paquier wrote:Well, the addition of cyrillic does not make necessary the removal of SOUND RECORDING COPYRIGHT or the DEGREEs, that implies the use of a dictionnary when manipulating the set of codepoints, but that's me being too picky. Just to say that I am fine with what you are proposing here.So, I have been looking at the change for cyrillic letters, and are you sure that the range of codepoints [U+0410,U+044f] is right when it comes to consider all those letters as plain letters? There are a couple of characters that itch me a bit with this range: - What of the letter CAPITAL SHORT I (U+0419) and SMALL SHORT I (U+0439)? Shouldn't U+0439 be translated to U+0438 and U+0419 translated to U+0418? That's what I get while looking at UnicodeData.txt, and it would mean that the range of plain letters should not include both of them.
2. I added a few more letters that are used in languages other than Russian: Byelorussian or Ukrainian.- It seems like we are missing a couple of letters after U+044F, like U+0454, U+0456 or U+0455 just to name three of them?
- (0x0410, 0x044f), # Cyrillic capital and small letters
+ (0x0402, 0x0402), # Cyrillic capital and small letters
+ (0x0404, 0x0406), #
+ (0x0408, 0x040b), #
+ (0x040f, 0x0418), #
+ (0x041a, 0x0438), #
+ (0x043a, 0x044f), #
+ (0x0452, 0x0452), #
+ (0x0454, 0x0456), #
I do not add more, because they probably concern older languages.
An alternative might be to rely entirely on Unicode decomposition ...
However, after the change, only one additional Ukrainian letter with an accent was added to the rule file.
3. The matter is not that simple. When I change priorities (ie Latin-ASCII.xml is less important than Unicode decomposition),I have extracted from 0001 and applied the parts about the regression tests for degree signs, while adding two more for SOUND RECORDING COPYRIGHT (U+2117) and Black-Letter Capital H (U+210C) translated to 'x', while it should be probably 'H'.
then "U + 33D7" changes not to pH but to PH.
In the end, I left it like it was before ...
If you decide what to do with point 3, I will correct it and send new patches.
--
Przemysław Sztoch | Mobile +48 509 99 00 66
Przemysław Sztoch | Mobile +48 509 99 00 66
Dear Michael P.,
Option 1: We leave x as in Latin-ASCII.xml and we also have full compatibility with previous PostgreSQL versions.
If they fix Latin-ASCII.xml at Unicode, it will fix itself.
Option 2: We choose a lower priority for entries in Latin-ASCII.xml
I would choose option 1.
P.S. I will be going on vacation and it would be nice to close this patch soon. TIA.
3. The matter is not that simple. When I change priorities (ie Latin-ASCII.xml is less important than Unicode decomposition),What is your decision?
then "U + 33D7" changes not to pH but to PH.
In the end, I left it like it was before ...
If you decide what to do with point 3, I will correct it and send new patches.
Option 1: We leave x as in Latin-ASCII.xml and we also have full compatibility with previous PostgreSQL versions.
If they fix Latin-ASCII.xml at Unicode, it will fix itself.
Option 2: We choose a lower priority for entries in Latin-ASCII.xml
I would choose option 1.
P.S. I will be going on vacation and it would be nice to close this patch soon. TIA.
--
Przemysław Sztoch | Mobile +48 509 99 00 66
Przemysław Sztoch | Mobile +48 509 99 00 66
On Tue, Jul 05, 2022 at 09:24:49PM +0200, Przemysław Sztoch wrote: > I do not add more, because they probably concern older languages. > An alternative might be to rely entirely on Unicode decomposition ... > However, after the change, only one additional Ukrainian letter with an > accent was added to the rule file. Hmm. I was wondering about the decomposition part, actually. How much would it make things simpler if we treat the full range of the cyrillic characters, aka from U+0400 to U+4FF, scanning all of them and building rules only if there are decompositions? Is it worth considering the Cyrillic supplement, as of U+0500-U+052F? I was also thinking about the regression tests, and as unaccent characters are more spread than for Latin and Greek, it could be a good thing to have a complete coverage. We could for example use a query like that to check if a character is treated properly or not: SELECT chr(i.a) = unaccent(chr(i.a)) FROM generate_series(1024, 1327) AS i(a); -- range of Cyrillic. -- Michael
Attachment
Re: [PATCH] Completed unaccent dictionary with many missing characters
From
Ian Lawrence Barwick
Date:
2022年7月13日(水) 19:13 Przemysław Sztoch <przemyslaw@sztoch.pl>: > > Dear Michael P., > > 3. The matter is not that simple. When I change priorities (ie Latin-ASCII.xml is less important than Unicode decomposition), > then "U + 33D7" changes not to pH but to PH. > In the end, I left it like it was before ... > > If you decide what to do with point 3, I will correct it and send new patches. > > What is your decision? > Option 1: We leave x as in Latin-ASCII.xml and we also have full compatibility with previous PostgreSQL versions. > If they fix Latin-ASCII.xml at Unicode, it will fix itself. > > Option 2: We choose a lower priority for entries in Latin-ASCII.xml > > I would choose option 1. > > P.S. I will be going on vacation and it would be nice to close this patch soon. TIA. Hi This entry was marked as "Needs review" in the CommitFest app but cfbot reports the patch no longer applies. We've marked it as "Waiting on Author". As CommitFest 2022-11 is currently underway, this would be an excellent time update the patch. Once you think the patchset is ready for review again, you (or any interested party) can move the patch entry forward by visiting https://commitfest.postgresql.org/40/3631/ and changing the status to "Needs review". Thanks Ian Barwick
On Fri, 4 Nov 2022 at 04:59, Ian Lawrence Barwick <barwick@gmail.com> wrote: > > 2022年7月13日(水) 19:13 Przemysław Sztoch <przemyslaw@sztoch.pl>: > > > > Dear Michael P., > > > > 3. The matter is not that simple. When I change priorities (ie Latin-ASCII.xml is less important than Unicode decomposition), > > then "U + 33D7" changes not to pH but to PH. > > In the end, I left it like it was before ... > > > > If you decide what to do with point 3, I will correct it and send new patches. > > > > What is your decision? > > Option 1: We leave x as in Latin-ASCII.xml and we also have full compatibility with previous PostgreSQL versions. > > If they fix Latin-ASCII.xml at Unicode, it will fix itself. > > > > Option 2: We choose a lower priority for entries in Latin-ASCII.xml > > > > I would choose option 1. > > > > P.S. I will be going on vacation and it would be nice to close this patch soon. TIA. > > Hi > > This entry was marked as "Needs review" in the CommitFest app but cfbot > reports the patch no longer applies. > > We've marked it as "Waiting on Author". As CommitFest 2022-11 is > currently underway, this would be an excellent time update the patch. > > Once you think the patchset is ready for review again, you (or any > interested party) can move the patch entry forward by visiting > > https://commitfest.postgresql.org/40/3631/ > > and changing the status to "Needs review". I was not sure if you will be planning to post an updated version of patch as the patch has been awaiting your attention from last commitfest, please post an updated version for it soon or update the commitfest entry accordingly. As CommitFest 2023-01 is currently underway, this would be an excellent time to update the patch. Regards, Vignesh
On Mon, 16 Jan 2023 at 20:07, vignesh C <vignesh21@gmail.com> wrote: > > On Fri, 4 Nov 2022 at 04:59, Ian Lawrence Barwick <barwick@gmail.com> wrote: > > > > 2022年7月13日(水) 19:13 Przemysław Sztoch <przemyslaw@sztoch.pl>: > > > > > > Dear Michael P., > > > > > > 3. The matter is not that simple. When I change priorities (ie Latin-ASCII.xml is less important than Unicode decomposition), > > > then "U + 33D7" changes not to pH but to PH. > > > In the end, I left it like it was before ... > > > > > > If you decide what to do with point 3, I will correct it and send new patches. > > > > > > What is your decision? > > > Option 1: We leave x as in Latin-ASCII.xml and we also have full compatibility with previous PostgreSQL versions. > > > If they fix Latin-ASCII.xml at Unicode, it will fix itself. > > > > > > Option 2: We choose a lower priority for entries in Latin-ASCII.xml > > > > > > I would choose option 1. > > > > > > P.S. I will be going on vacation and it would be nice to close this patch soon. TIA. > > > > Hi > > > > This entry was marked as "Needs review" in the CommitFest app but cfbot > > reports the patch no longer applies. > > > > We've marked it as "Waiting on Author". As CommitFest 2022-11 is > > currently underway, this would be an excellent time update the patch. > > > > Once you think the patchset is ready for review again, you (or any > > interested party) can move the patch entry forward by visiting > > > > https://commitfest.postgresql.org/40/3631/ > > > > and changing the status to "Needs review". > > I was not sure if you will be planning to post an updated version of > patch as the patch has been awaiting your attention from last > commitfest, please post an updated version for it soon or update the > commitfest entry accordingly. As CommitFest 2023-01 is currently > underway, this would be an excellent time to update the patch. There has been no updates on this thread for some time, so this has been switched as Returned with Feedback. Feel free to open it in the next commitfest if you plan to continue on this. Regards, Vignesh