Re: Pre-proposal: unicode normalized text - Mailing list pgsql-hackers

From Nico Williams
Subject Re: Pre-proposal: unicode normalized text
Date
Msg-id ZUQouavj48HfD1aK@ubby21
Whole thread Raw
In response to Re: Pre-proposal: unicode normalized text  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Wed, Oct 04, 2023 at 01:16:22PM -0400, Robert Haas wrote:
> There's a very popular commercial database where, or so I have been
> led to believe, any byte sequence at all is accepted when you try to
> put values into the database. [...]

In other circles we call this "just-use-8".

ZFS, for example, has an option to require that filenames be valid
UTF-8 or not, and if not it will accept any garbage (other than ASCII
NUL and /, for obvious reasons).

For filesystems the situation is a bit dire because:

 - strings at the system call boundary have never been tagged with a
   codeset (in the beginning there was only ASCII)
 - there has never been a standard codeset to use at the system call
   boundary, 
 - there have been multiple codesets in use for decades

so filesystems have to be prepared to be tolerant of garbage, at least
until only Unicode is left (UTF-16 on Windows filesystems, UTF-8 for
most others).

This is another reason that ZFS has form-insensitive/form-preserving
behavior: if you want to use non-UTF-8 filenames then names or
substrings thereof that look like valid UTF-8 won't accidentally be
broken by normalization.

If PG never tagged strings with codesets on the wire then PG has the
same problem, especially since there's multiple implementations of the
PG wire protocol.

So I can see why a "popular database" might want to take this approach.

For the longer run though, either move to supporting only UTF-8, or
allow multiple text types each with a codeset specified in its type.

> At any rate, if we were to go in the direction of rejecting code
> points that aren't yet assigned, or aren't yet known to the collation
> library, that's another way for data loading to fail. Which feels like
> very defensible behavior, but not what everyone wants, or is used to.

Yes.  See points about ZFS.  I do think ZFS struck a good balance.

PG could take the ZFS approach and add functions for use in CHECK
constraints that enforce valid UTF-8, valid Unicode (no use of
unassigned codepoints, no use of private use codepoints not configured
into the database), etc.

Coming back to the "just-use-8" thing, a database could have a text type
where the codeset is not specified, one or more text types where the
codeset is specified, manual or automatic codeset conversions, and
whatever enforcement functions make sense.  Provided that the type
information is not lost at the edges.

> > Whether we ever get to a core data type -- and more importantly,
> > whether anyone uses it -- I'm not sure.
> 
> Same here.

A TEXTutf8 type (whatever name you want to give it) could be useful as a
way to a) opt into heavier enforcement w/o having to write CHECK
constraints, b) documentation of intent, all provided that the type is
not lost on the wire nor in memory.

Support for other codesets is less important.

Nico
-- 



pgsql-hackers by date:

Previous
From: Nico Williams
Date:
Subject: Re: Pre-proposal: unicode normalized text
Next
From: Nico Williams
Date:
Subject: Re: Pre-proposal: unicode normalized text