Re: Pre-proposal: unicode normalized text - Mailing list pgsql-hackers
From | Nico Williams |
---|---|
Subject | Re: Pre-proposal: unicode normalized text |
Date | |
Msg-id | ZUQouavj48HfD1aK@ubby21 Whole thread Raw |
In response to | Re: Pre-proposal: unicode normalized text (Robert Haas <robertmhaas@gmail.com>) |
List | pgsql-hackers |
On Wed, Oct 04, 2023 at 01:16:22PM -0400, Robert Haas wrote: > There's a very popular commercial database where, or so I have been > led to believe, any byte sequence at all is accepted when you try to > put values into the database. [...] In other circles we call this "just-use-8". ZFS, for example, has an option to require that filenames be valid UTF-8 or not, and if not it will accept any garbage (other than ASCII NUL and /, for obvious reasons). For filesystems the situation is a bit dire because: - strings at the system call boundary have never been tagged with a codeset (in the beginning there was only ASCII) - there has never been a standard codeset to use at the system call boundary, - there have been multiple codesets in use for decades so filesystems have to be prepared to be tolerant of garbage, at least until only Unicode is left (UTF-16 on Windows filesystems, UTF-8 for most others). This is another reason that ZFS has form-insensitive/form-preserving behavior: if you want to use non-UTF-8 filenames then names or substrings thereof that look like valid UTF-8 won't accidentally be broken by normalization. If PG never tagged strings with codesets on the wire then PG has the same problem, especially since there's multiple implementations of the PG wire protocol. So I can see why a "popular database" might want to take this approach. For the longer run though, either move to supporting only UTF-8, or allow multiple text types each with a codeset specified in its type. > At any rate, if we were to go in the direction of rejecting code > points that aren't yet assigned, or aren't yet known to the collation > library, that's another way for data loading to fail. Which feels like > very defensible behavior, but not what everyone wants, or is used to. Yes. See points about ZFS. I do think ZFS struck a good balance. PG could take the ZFS approach and add functions for use in CHECK constraints that enforce valid UTF-8, valid Unicode (no use of unassigned codepoints, no use of private use codepoints not configured into the database), etc. Coming back to the "just-use-8" thing, a database could have a text type where the codeset is not specified, one or more text types where the codeset is specified, manual or automatic codeset conversions, and whatever enforcement functions make sense. Provided that the type information is not lost at the edges. > > Whether we ever get to a core data type -- and more importantly, > > whether anyone uses it -- I'm not sure. > > Same here. A TEXTutf8 type (whatever name you want to give it) could be useful as a way to a) opt into heavier enforcement w/o having to write CHECK constraints, b) documentation of intent, all provided that the type is not lost on the wire nor in memory. Support for other codesets is less important. Nico --
pgsql-hackers by date: