Re: Pre-proposal: unicode normalized text - Mailing list pgsql-hackers

From Nico Williams
Subject Re: Pre-proposal: unicode normalized text
Date
Msg-id ZUQvZ2HQIQqG3U8Z@ubby21
Whole thread Raw
In response to Re: Pre-proposal: unicode normalized text  (Jeff Davis <pgsql@j-davis.com>)
List pgsql-hackers
On Wed, Oct 04, 2023 at 01:15:03PM -0700, Jeff Davis wrote:
> > The fact that there are multiple types of normalization and multiple
> > notions of equality doesn't make this easier.

And then there's text that isn't normalized to any of them.

> NFC is really the only one that makes sense.

Yes.

Most input modes produce NFC, though there may be scripts (like Hangul)
where input modes might produce NFD, so I wouldn't say NFC is universal.

Unfortunately HFS+ uses NFD so NFD can leak into places naturally enough
through OS X.

> I believe that having a kind of text data type where it's stored in NFC
> and compared with memcmp() would be a good place for many users to be -
> - probably most users. It's got all the performance and stability
> benefits of memcmp(), with slightly richer semantics. It's less likely
> that someone malicious can confuse the database by using different
> representations of the same character.
> 
> The problem is that it's not universally better for everyone: there are
> certainly users who would prefer that the codepoints they send to the
> database are preserved exactly, and also users who would like to be
> able to use unassigned code points.

The alternative is forminsensitivity, where you compare strings as
equal even if they aren't memcmp() eq as long as they are equal when
normalized.  This can be made fast, though not as fast as memcmp().

The problem with form insensitivity is that you might have to implement
it in numerous places.  In ZFS there's only a few, but in a database
every index type, for example, will need to hook in form insensitivity.
If so then that complexity would be a good argument to just normalize.

Nico
-- 



pgsql-hackers by date:

Previous
From: Nico Williams
Date:
Subject: Re: Pre-proposal: unicode normalized text
Next
From: David Rowley
Date:
Subject: Re: Why is DEFAULT_FDW_TUPLE_COST so insanely low?