Home > mailing lists

Re: [PATCH] Completed unaccent dictionary with many missing characters - Mailing list pgsql-hackers

From	Michael Paquier
Subject	Re: [PATCH] Completed unaccent dictionary with many missing characters
Date	June 20, 2022 04:49:37
Msg-id	Yq/SMQqOAK9w0nJZ@paquier.xyz Whole thread Raw
In response to	Re: [PATCH] Completed unaccent dictionary with many missing characters (Przemysław Sztoch <przemyslaw@sztoch.pl>)
Responses	Re: [PATCH] Completed unaccent dictionary with many missing characters
List	pgsql-hackers

Tree view

On Wed, Jun 15, 2022 at 01:01:37PM +0200, Przemysław Sztoch wrote:
> Two fixes (bad comment and fixed Latin-ASCII.xml).

         if codepoint.general_category.startswith('L') and \
-           len(codepoint.combining_ids) > 1:
+           len(codepoint.combining_ids) > 0:
So, this one checks for the case where a codepoint is within the
letter category.  As far as I can see this indeed adds a couple of
characters, with a combination of Greek and Latin letters.  So that
looks fine.

+        elif codepoint.general_category.startswith('N') and \
+           len(codepoint.combining_ids) > 0 and \
+           args.noLigaturesExpansion is False and is_ligature(codepoint, table):
+            charactersSet.add((codepoint.id,
+                               "".join(chr(combining_codepoint.id)
+                                       for combining_codepoint
+                                       in get_plain_letters(codepoint, table))))
And this one is for the numerical part of the change.  Do you actually
need to apply is_ligature() here?  I would have thought that this only
applies to letters.

-    assert(False)
+    assert False, 'Codepoint U+%0.2X' % codepoint.id
[...]
-    assert(is_ligature(codepoint, table))
+    assert is_ligature(codepoint, table), 'Codepoint U+%0.2X' % codepoint.id
These two are a good idea for debugging.

-    return all(is_letter(table[i], table) for i in codepoint.combining_ids)
+    return all(i in table and is_letter(table[i], table) for i in codepoint.combining_ids)
It looks like this makes the code weaker, as we would silently skip
characters that are not part of the table rather than checking for
them all the time?

While recreating unaccent.rules with your patch, I have noticed what
looks like an error.  An extra rule mapping U+210C (black-letter
capital h) to "x" gets added on top of te existing one for "H", but
the correct answer is the existing rule, not the one added by the
patch.
--
Michael

Attachment

signature.asc

pgsql-hackers by date:

From: Masahiko Sawada
Date: 20 June 2022, 03:56:26
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum

From: Amit Kapila
Date: 20 June 2022, 05:59:39
Subject: Re: Perform streaming logical transactions by background workers and parallel apply

Re: [PATCH] Completed unaccent dictionary with many missing characters - Mailing list pgsql-hackers

Attachment

Previous

Next