On Fri, 2025-06-20 at 11:31 -0500, Nico Williams wrote:
> In the slow path you only
> normalize the _current character_, so you only need enough buffer
> space
> for that.
That's a clear win for UTF8 data. Also, if there are no changes, then
you can just return the input buffer and not bother allocating an
output buffer.
> The really nice thing about form-insensitive/form-preserving
> functionality is that it is form-preserving rather than normalizing
> on
> create and lookup, and that makes the fast-path described above
> feasible. Whereas if you normalize on create and lookup you have to
> heap allocate enough space for each string normalized.
Non-deterministic ICU collations are already insensitive to most
normalization differences. Some differences are not handled when it
involves too many combining marks, but you can use the "full
normalization" option to address that. See:
https://www.postgresql.org/docs/current/collation.html#ICU-COLLATION-SETTINGS-TABLE
> The other nice
> thing is that f-i/f-p behavior is a lot less surprising to users --
> the
> input methods they use don't matter.
Postgres is already form-preserving; it does not auto-normalize. (I
have suggested that we might want to offer something like that, but
that would be a user choice.)
Currently, the non-deterministic collations (which offer form-
insensitivity) are not available at the database level, so you have to
explicitly specify the COLLATE clause on a column or query. In other
words, Postgres is not form-insensitive by default, though there is
work to make that possible.
> What motivated this f-i/f-p behavior was that HFS+ used NFD (well,
> something very close to NFD) but input methods (even on OS X)
> invariably
> produce NFC (well, something close to it), at least for Latin
> scripts.
> This means that on OS X if you use filenames from directory listings
> those will be NFD while user inputs will be NFC, and so you can't
> just
> memcmp()/strcmp() those -- you have to normalize _or_ use form-
> insensitive string comparison, but nothing did that 20 years ago.
> Thus
> doing the form-insensitivity in the filesystem seemed best, and if
> you
> do that you can be form-preserving to enable the optimization
> described
> above.
Databases have similar concerns as a filesystem in this respect.
Regards,
Jeff Davis