Re: Pre-proposal: unicode normalized text - Mailing list pgsql-hackers

From Jeff Davis
Subject Re: Pre-proposal: unicode normalized text
Date
Msg-id 23edd434b490f74a82b564b8027b6a52059f8de7.camel@j-davis.com
Whole thread Raw
In response to Re: Pre-proposal: unicode normalized text  (Nico Williams <nico@cryptonector.com>)
Responses Re: Pre-proposal: unicode normalized text
List pgsql-hackers
On Mon, 2023-10-02 at 15:27 -0500, Nico Williams wrote:
> I think you misunderstand Unicode normalization and equivalence. 
> There
> is no standard Unicode `normalize()` that would cause the above
> equality
> predicate to be true.  If you normalize to NFD (normal form
> decomposed)
> then a _prefix_ of those two strings will be equal, but that's
> clearly
> not what you're looking for.

From [1]:

"Unicode Normalization Forms are formally defined normalizations of
Unicode strings which make it possible to determine whether any two
Unicode strings are equivalent to each other. Depending on the
particular Unicode Normalization Form, that equivalence can either be a
canonical equivalence or a compatibility equivalence... A binary
comparison of the transformed strings will then determine equivalence."

NFC and NFD are based on Canonical Equivalence.

"Canonical equivalence is a fundamental equivalency between characters
or sequences of characters which represent the same abstract character,
and which when correctly displayed should always have the same visual
appearance and behavior."

Can you explain why NFC (the default form of normalization used by the
postgres normalize() function), followed by memcmp(), is not the right
thing to use to determine Canonical Equivalence?

Or are you saying that Canonical Equivalence is not a useful thing to
test?

What do you mean about the "prefix"?

In Postgres today:

  SELECT normalize(U&'\0061\0301', nfc)::bytea; -- \xc3a1
  SELECT normalize(U&'\00E1', nfc)::bytea; -- \xc3a1

  SELECT normalize(U&'\0061\0301', nfd)::bytea; -- \x61cc81
  SELECT normalize(U&'\00E1', nfd)::bytea; -- \x61cc81

which looks useful to me, but I assume you are saying that it doesn't
generalize well to other cases?

[1] https://unicode.org/reports/tr15/

> There are two ways to write 'á' in Unicode: one is pre-composed (one
> codepoint) and the other is decomposed (two codepoints in this
> specific
> case), and it would be nice to be able to preserve input form when
> storing strings but then still be able to index and match them
> form-insensitively (in the case of 'á' both equivalent
> representations
> should be considered equal, and for UNIQUE indexes they should be
> considered the same).

Sometimes preserving input differences is a good thing, other times
it's not, depending on the context. Almost any data type has some
aspects of the input that might not be preserved -- leading zeros in a
number, or whitespace in jsonb, etc.

If text is stored as normalized with NFC, it could be frustrating if
the retrieved string has a different binary representation than the
source data. But it could also be frustrating to look at two strings
made up of ordinary characters that look identical and for the database
to consider them unequal.

Regards,
    Jeff Davis





pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: trying again to get incremental backup
Next
From: James Coleman
Date:
Subject: Re: [DOCS] HOT - correct claim about indexes not referencing old line pointers