Home > mailing lists

Re: Pre-proposal: unicode normalized text - Mailing list pgsql-hackers

From	Nico Williams
Subject	Re: Pre-proposal: unicode normalized text
Date	October 4, 2023 17:23:41
Msg-id	ZR2fnTPDwPb0IamC@ubby21 Whole thread
In response to	Pre-proposal: unicode normalized text (Jeff Davis <pgsql@j-davis.com>)
Responses	Re: Pre-proposal: unicode normalized text
List	pgsql-hackers

Tree view

On Tue, Sep 12, 2023 at 03:47:10PM -0700, Jeff Davis wrote:
> The idea is to have a new data type, say "UTEXT", that normalizes the
> input so that it can have an improved notion of equality while still
> using memcmp().

A UTEXT type would be helpful for specifying that the text must be
Unicode (in which transform?) even if the character data encoding for
the database is not UTF-8.

Maybe UTF8 might be a better name for the new type, since it would
denote the transform (and would allow for UTF16 and UTF32 some day,
though it's doubtful those would ever happen).

But it's one thing to specify Unicode (and transform) in the type and
another to specify an NF to normalize to on insert or on lookup.

How about new column constraint keywords, such as NORMALIZE (meaning
normalize on insert) and NORMALIZED (meaning reject non-canonical form
text), with an optional parenthetical by which to specify a non-default
form?  (These would apply to TEXT as well when the default encoding for
the DB is UTF-8.)

One could then ALTER TABLE to add this to existing tables.

This would also make it easier to add a form-preserving/form-insensitive
mode later if it turns out to be useful or necessary, maybe making it
the default for Unicode text in new tables.

> Questions:
> 
>  * Would this be useful enough to justify a new data type? Would it be
> confusing about when to choose one versus the other?

Yes.  See above.  I think I'd rather have it be called UTF8, and the
normalization properties of it to be specified as column constraints.

>  * Would cross-type comparisons between TEXT and UTEXT become a major
> problem that would reduce the utility?

Maybe when the database's encoding is UTF_8 then UTEXT (or UTF8) can be an alias
of TEXT.

>  * Should "some_utext_value = some_text_value" coerce the LHS to TEXT
> or the RHS to UTEXT?

Ooh, this is nice!  If the TEXT is _not_ UTF-8 then it could be
converted to UTF-8.  So I think which is RHS and which is LHS doesn't
matter -- it's which is UTF-8, and if both are then the only thing left
to do is normalize, and for that I'd take the LHS' form if the LHS is
UTF-8, else the RHS'.

Nico
--

pgsql-hackers by date:

From: Robert Haas
Date: 04 October 2023, 17:16:22
Subject: Re: Pre-proposal: unicode normalized text

From: Robert Haas
Date: 04 October 2023, 17:47:40
Subject: Re: Pre-proposal: unicode normalized text

Re: Pre-proposal: unicode normalized text - Mailing list pgsql-hackers

Previous

Next