Pre-proposal: unicode normalized text - Mailing list pgsql-hackers

From Jeff Davis
Subject Pre-proposal: unicode normalized text
Date
Msg-id f30b58657ceb71d5be032decf4058d454cc1df74.camel@j-davis.com
Whole thread Raw
Responses Re: Pre-proposal: unicode normalized text
Re: Pre-proposal: unicode normalized text
Re: Pre-proposal: unicode normalized text
List pgsql-hackers
One of the frustrations with using the "C" locale (or any deterministic
locale) is that the following returns false:

  SELECT 'á' = 'á'; -- false

because those are the unicode sequences U&'\0061\0301' and U&'\00E1',
respectively, so memcmp() returns non-zero. But it's really the same
character with just a different representation, and if you normalize
them they are equal:

  SELECT normalize('á') = normalize('á'); -- true

The idea is to have a new data type, say "UTEXT", that normalizes the
input so that it can have an improved notion of equality while still
using memcmp().

Unicode guarantees that "the results of normalizing a string on one
version will always be the same as normalizing it on any other version,
as long as the string contains only assigned characters according to
both versions"[1]. It also guarantees that it "will not reallocate,
remove, or reassign" characters[2]. That means that we can normalize in
a forward-compatible way as long as we don't allow the use of
unassigned code points.

I looked at the standard to see what it had to say, and is discusses
normalization, but a standard UCS string with an unassigned code point
is not an error. Without a data type to enforce the constraint that
there are no unassigned code points, we can't guarantee forward
compatibility. Some other systems support NVARCHAR, but I didn't see
any guarantee of normalization or blocking unassigned code points
there, either.

UTEXT benefits:
  * slightly better natural language semantics than TEXT with
deterministic collation
  * still deterministic=true
  * fast memcmp()-based comparisons
  * no breaking semantic changes as unicode evolves

TEXT allows unassigned code points, and generally returns the same byte
sequences that were orgiinally entered; therefore UTEXT is not a
replacement for TEXT.

UTEXT could be built-in or it could be an extension or in contrib. If
an extension, we'd probably want to at least expose a function that can
detect unassigned code points, so that it's easy to be consistent with
the auto-generated unicode tables. I also notice that there already is
an unassigned code points table in saslprep.c, but it seems to be
frozen as of Unicode 3.2, and I'm not sure why.

Questions:

 * Would this be useful enough to justify a new data type? Would it be
confusing about when to choose one versus the other?
 * Would cross-type comparisons between TEXT and UTEXT become a major
problem that would reduce the utility?
 * Should "some_utext_value = some_text_value" coerce the LHS to TEXT
or the RHS to UTEXT?
 * Other comments or am I missing something?

Regards,
    Jeff Davis


[1] https://unicode.org/reports/tr15/
[2] https://www.unicode.org/policies/stability_policy.html



pgsql-hackers by date:

Previous
From: Jacob Champion
Date:
Subject: Re: Row pattern recognition
Next
From: Jeff Davis
Date:
Subject: Re: [17] CREATE SUBSCRIPTION ... SERVER