Pre-proposal: unicode normalized text - Mailing list pgsql-hackers
From | Jeff Davis |
---|---|
Subject | Pre-proposal: unicode normalized text |
Date | |
Msg-id | f30b58657ceb71d5be032decf4058d454cc1df74.camel@j-davis.com Whole thread Raw |
Responses |
Re: Pre-proposal: unicode normalized text
Re: Pre-proposal: unicode normalized text Re: Pre-proposal: unicode normalized text |
List | pgsql-hackers |
One of the frustrations with using the "C" locale (or any deterministic locale) is that the following returns false: SELECT 'á' = 'á'; -- false because those are the unicode sequences U&'\0061\0301' and U&'\00E1', respectively, so memcmp() returns non-zero. But it's really the same character with just a different representation, and if you normalize them they are equal: SELECT normalize('á') = normalize('á'); -- true The idea is to have a new data type, say "UTEXT", that normalizes the input so that it can have an improved notion of equality while still using memcmp(). Unicode guarantees that "the results of normalizing a string on one version will always be the same as normalizing it on any other version, as long as the string contains only assigned characters according to both versions"[1]. It also guarantees that it "will not reallocate, remove, or reassign" characters[2]. That means that we can normalize in a forward-compatible way as long as we don't allow the use of unassigned code points. I looked at the standard to see what it had to say, and is discusses normalization, but a standard UCS string with an unassigned code point is not an error. Without a data type to enforce the constraint that there are no unassigned code points, we can't guarantee forward compatibility. Some other systems support NVARCHAR, but I didn't see any guarantee of normalization or blocking unassigned code points there, either. UTEXT benefits: * slightly better natural language semantics than TEXT with deterministic collation * still deterministic=true * fast memcmp()-based comparisons * no breaking semantic changes as unicode evolves TEXT allows unassigned code points, and generally returns the same byte sequences that were orgiinally entered; therefore UTEXT is not a replacement for TEXT. UTEXT could be built-in or it could be an extension or in contrib. If an extension, we'd probably want to at least expose a function that can detect unassigned code points, so that it's easy to be consistent with the auto-generated unicode tables. I also notice that there already is an unassigned code points table in saslprep.c, but it seems to be frozen as of Unicode 3.2, and I'm not sure why. Questions: * Would this be useful enough to justify a new data type? Would it be confusing about when to choose one versus the other? * Would cross-type comparisons between TEXT and UTEXT become a major problem that would reduce the utility? * Should "some_utext_value = some_text_value" coerce the LHS to TEXT or the RHS to UTEXT? * Other comments or am I missing something? Regards, Jeff Davis [1] https://unicode.org/reports/tr15/ [2] https://www.unicode.org/policies/stability_policy.html
pgsql-hackers by date: