Re: Pre-proposal: unicode normalized text - Mailing list pgsql-hackers

From Jeff Davis
Subject Re: Pre-proposal: unicode normalized text
Date
Msg-id a0e85aca6e03042881924c4b31a840a915a9d349.camel@j-davis.com
Whole thread Raw
In response to Re: Pre-proposal: unicode normalized text  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Pre-proposal: unicode normalized text
List pgsql-hackers
On Mon, 2023-10-02 at 16:06 -0400, Robert Haas wrote:
> It seems to me that this overlooks one of the major points of Jeff's
> proposal, which is that we don't reject text input that contains
> unassigned code points. That decision turns out to be really painful.

Attached is an implementation of a per-database option STRICT_UNICODE
which enforces the use of assigned code points only.

Not everyone would want to use it. There are lots of applications that
accept free-form text, and that may include recently-assigned code
points not yet recognized by Postgres.

But it would offer protection/stability for some databases. It makes it
possible to have a hard guarantee that Unicode normalization is
stable[1]. And it may also mitigate the risk of collation changes --
using unassigned code points carries a high risk that the collation
order changes as soon as the collation provider recognizes the
assignment. (Though assigned code points can change, too, so limiting
yourself to assigned code points is only a mitigation.)

I worry slightly that users will think at first that they want only
assigned code points, and then later figure out that the application
has increased in scope and now takes all kinds of free-form text. In
that case, the user can "ALTER DATABASE ... STRICT_UNICODE FALSE", and
follow up with some "CHECK (unicode_assigned(...))" constraints on the
particular fields that they'd like to protect.

There's some weirdness that the set of assigned code points as Postgres
sees it may not match what a collation provider sees due to differing
Unicode versions. That's not great -- perhaps we could check that code
points are considered assigned by *both* Postgres and ICU. I don't know
if there's a way to tell if libc considers a code point to be assigned.

Regards,
    Jeff Davis

[1]
https://www.unicode.org/policies/stability_policy.html#Normalization


Attachment

pgsql-hackers by date:

Previous
From: Melanie Plageman
Date:
Subject: Re: BitmapHeapScan streaming read user and prelim refactoring
Next
From: Jacob Champion
Date:
Subject: Re: [PoC] Federated Authn/z with OAUTHBEARER