Home > mailing lists

Re: Pre-proposal: unicode normalized text - Mailing list pgsql-hackers

From	Nico Williams
Subject	Re: Pre-proposal: unicode normalized text
Date	October 3, 2023 20:15:17
Msg-id	ZRx2VcsWomTBcE+L@ubby21 Whole thread Raw
In response to	Re: Pre-proposal: unicode normalized text (Jeff Davis <pgsql@j-davis.com>)
Responses	Re: Pre-proposal: unicode normalized text
List	pgsql-hackers

Tree view

On Tue, Oct 03, 2023 at 12:15:10PM -0700, Jeff Davis wrote:
> On Mon, 2023-10-02 at 15:27 -0500, Nico Williams wrote:
> > I think you misunderstand Unicode normalization and equivalence. 
> > There is no standard Unicode `normalize()` that would cause the
> > above equality predicate to be true.  If you normalize to NFD
> > (normal form decomposed) then a _prefix_ of those two strings will
> > be equal, but that's clearly not what you're looking for.

Ugh, My client is not displying 'a' correctly, thus I misunderstood your
post.

> From [1]:

Here's what you wrote in your post:

| [...] But it's really the same
| character with just a different representation, and if you normalize
| them they are equal:
|
|  SELECT normalize('á') = normalize('á'); -- true

but my client is not displying 'a' correctly!  (It displays like 'a' but
it should display like 'á'.)

Bah.  So I'd (mis)interpreted you as saying that normalize('a') should
equal normalize('á').  Please disregard that part of my reply.

> > There are two ways to write 'á' in Unicode: one is pre-composed (one
> > codepoint) and the other is decomposed (two codepoints in this
> > specific case), and it would be nice to be able to preserve input
> > form when storing strings but then still be able to index and match
> > them form-insensitively (in the case of 'á' both equivalent
> > representations should be considered equal, and for UNIQUE indexes
> > they should be considered the same).
> 
> Sometimes preserving input differences is a good thing, other times
> it's not, depending on the context. Almost any data type has some
> aspects of the input that might not be preserved -- leading zeros in a
> number, or whitespace in jsonb, etc.

Almost every Latin input mode out there produces precomposed characters
and so they effectively produce NFC.  I'm not sure if the same is true
for, e.g., Hangul (Korean) and various other scripts.

But there are things out there that produce NFD.  Famously Apple's HFS+
uses NFD (or something very close to NFD).  So if you cut-n-paste things
that got normalized to NFD and paste them into contexts where
normalization isn't done, then you might start wanting to alter those
contexts to either normalize or be form-preserving/form-insensitive.
Sometimes you don't get to normalize, so you have to pick form-
preserving/form-insensitive behavior.

> If text is stored as normalized with NFC, it could be frustrating if
> the retrieved string has a different binary representation than the
> source data. But it could also be frustrating to look at two strings
> made up of ordinary characters that look identical and for the database
> to consider them unequal.

Exactly.  If you have such a case you might like the option to make your
database form-preserving and form-insensitive.  That means that indices
need to normalize strings, but tables need to store unnormalized
strings.

ZFS (filesystems are a bit like databases) does just that!

Nico
--

pgsql-hackers by date:

From: Tom Lane
Date: 03 October 2023, 20:07:30
Subject: Re: Annoying build warnings from latest Apple toolchain

From: Matthias van de Meent
Date: 03 October 2023, 21:14:01
Subject: Re: Allow deleting enumerated values from an existing enumerated data type

Re: Pre-proposal: unicode normalized text - Mailing list pgsql-hackers

Previous

Next