On Mon, 2023-10-02 at 15:27 -0500, Nico Williams wrote:
> I think you misunderstand Unicode normalization and equivalence.
> There
> is no standard Unicode `normalize()` that would cause the above
> equality
> predicate to be true. If you normalize to NFD (normal form
> decomposed)
> then a _prefix_ of those two strings will be equal, but that's
> clearly
> not what you're looking for.
From [1]:
"Unicode Normalization Forms are formally defined normalizations of
Unicode strings which make it possible to determine whether any two
Unicode strings are equivalent to each other. Depending on the
particular Unicode Normalization Form, that equivalence can either be a
canonical equivalence or a compatibility equivalence... A binary
comparison of the transformed strings will then determine equivalence."
NFC and NFD are based on Canonical Equivalence.
"Canonical equivalence is a fundamental equivalency between characters
or sequences of characters which represent the same abstract character,
and which when correctly displayed should always have the same visual
appearance and behavior."
Can you explain why NFC (the default form of normalization used by the
postgres normalize() function), followed by memcmp(), is not the right
thing to use to determine Canonical Equivalence?
Or are you saying that Canonical Equivalence is not a useful thing to
test?
What do you mean about the "prefix"?
In Postgres today:
SELECT normalize(U&'\0061\0301', nfc)::bytea; -- \xc3a1
SELECT normalize(U&'\00E1', nfc)::bytea; -- \xc3a1
SELECT normalize(U&'\0061\0301', nfd)::bytea; -- \x61cc81
SELECT normalize(U&'\00E1', nfd)::bytea; -- \x61cc81
which looks useful to me, but I assume you are saying that it doesn't
generalize well to other cases?
[1] https://unicode.org/reports/tr15/
> There are two ways to write 'á' in Unicode: one is pre-composed (one
> codepoint) and the other is decomposed (two codepoints in this
> specific
> case), and it would be nice to be able to preserve input form when
> storing strings but then still be able to index and match them
> form-insensitively (in the case of 'á' both equivalent
> representations
> should be considered equal, and for UNIQUE indexes they should be
> considered the same).
Sometimes preserving input differences is a good thing, other times
it's not, depending on the context. Almost any data type has some
aspects of the input that might not be preserved -- leading zeros in a
number, or whitespace in jsonb, etc.
If text is stored as normalized with NFC, it could be frustrating if
the retrieved string has a different binary representation than the
source data. But it could also be frustrating to look at two strings
made up of ordinary characters that look identical and for the database
to consider them unequal.
Regards,
Jeff Davis