Thread: Pre-proposal: unicode normalized text

Pre-proposal: unicode normalized text

From
Jeff Davis
Date:
One of the frustrations with using the "C" locale (or any deterministic
locale) is that the following returns false:

  SELECT 'á' = 'á'; -- false

because those are the unicode sequences U&'\0061\0301' and U&'\00E1',
respectively, so memcmp() returns non-zero. But it's really the same
character with just a different representation, and if you normalize
them they are equal:

  SELECT normalize('á') = normalize('á'); -- true

The idea is to have a new data type, say "UTEXT", that normalizes the
input so that it can have an improved notion of equality while still
using memcmp().

Unicode guarantees that "the results of normalizing a string on one
version will always be the same as normalizing it on any other version,
as long as the string contains only assigned characters according to
both versions"[1]. It also guarantees that it "will not reallocate,
remove, or reassign" characters[2]. That means that we can normalize in
a forward-compatible way as long as we don't allow the use of
unassigned code points.

I looked at the standard to see what it had to say, and is discusses
normalization, but a standard UCS string with an unassigned code point
is not an error. Without a data type to enforce the constraint that
there are no unassigned code points, we can't guarantee forward
compatibility. Some other systems support NVARCHAR, but I didn't see
any guarantee of normalization or blocking unassigned code points
there, either.

UTEXT benefits:
  * slightly better natural language semantics than TEXT with
deterministic collation
  * still deterministic=true
  * fast memcmp()-based comparisons
  * no breaking semantic changes as unicode evolves

TEXT allows unassigned code points, and generally returns the same byte
sequences that were orgiinally entered; therefore UTEXT is not a
replacement for TEXT.

UTEXT could be built-in or it could be an extension or in contrib. If
an extension, we'd probably want to at least expose a function that can
detect unassigned code points, so that it's easy to be consistent with
the auto-generated unicode tables. I also notice that there already is
an unassigned code points table in saslprep.c, but it seems to be
frozen as of Unicode 3.2, and I'm not sure why.

Questions:

 * Would this be useful enough to justify a new data type? Would it be
confusing about when to choose one versus the other?
 * Would cross-type comparisons between TEXT and UTEXT become a major
problem that would reduce the utility?
 * Should "some_utext_value = some_text_value" coerce the LHS to TEXT
or the RHS to UTEXT?
 * Other comments or am I missing something?

Regards,
    Jeff Davis


[1] https://unicode.org/reports/tr15/
[2] https://www.unicode.org/policies/stability_policy.html



Re: Pre-proposal: unicode normalized text

From
Peter Eisentraut
Date:
On 13.09.23 00:47, Jeff Davis wrote:
> The idea is to have a new data type, say "UTEXT", that normalizes the
> input so that it can have an improved notion of equality while still
> using memcmp().

I think a new type like this would obviously be suboptimal because it's 
nonstandard and most people wouldn't use it.

I think a better direction here would be to work toward making 
nondeterministic collations usable on the global/database level and then 
encouraging users to use those.

It's also not clear which way the performance tradeoffs would fall.

Nondeterministic collations are obviously going to be slower, but by how 
much?  People have accepted moving from C locale to "real" locales 
because they needed those semantics.  Would it be any worse moving from 
real locales to "even realer" locales?

On the other hand, a utext type would either require a large set of its 
own functions and operators, or you would have to inject text-to-utext 
casts in places, which would also introduce overhead.



Re: Pre-proposal: unicode normalized text

From
Robert Haas
Date:
On Mon, Oct 2, 2023 at 3:42 PM Peter Eisentraut <peter@eisentraut.org> wrote:
> I think a better direction here would be to work toward making
> nondeterministic collations usable on the global/database level and then
> encouraging users to use those.

It seems to me that this overlooks one of the major points of Jeff's
proposal, which is that we don't reject text input that contains
unassigned code points. That decision turns out to be really painful.
Here, Jeff mentions normalization, but I think it's a major issue with
collation support. If new code points are added, users can put them
into the database before they are known to the collation library, and
then when they become known to the collation library the sort order
changes and indexes break. Would we endorse a proposal to make
pg_catalog.text with encoding UTF-8 reject code points that aren't yet
known to the collation library? To do so would be tighten things up
considerably from where they stand today, and the way things stand
today is already rigid enough to cause problems for some users. But if
we're not willing to do that then I find it easy to understand why
Jeff wants an alternative type that does.

Now, there is still the question of whether such a data type would
properly belong in core or even contrib rather than being an
out-of-core project. It's not obvious to me that such a data type
would get enough traction that we'd want it to be part of PostgreSQL
itself. But at the same time I can certainly understand why Jeff finds
the status quo problematic.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Pre-proposal: unicode normalized text

From
Nico Williams
Date:
On Tue, Sep 12, 2023 at 03:47:10PM -0700, Jeff Davis wrote:
> One of the frustrations with using the "C" locale (or any deterministic
> locale) is that the following returns false:
> 
>   SELECT 'á' = 'á'; -- false
> 
> because those are the unicode sequences U&'\0061\0301' and U&'\00E1',
> respectively, so memcmp() returns non-zero. But it's really the same
> character with just a different representation, and if you normalize
> them they are equal:
> 
>   SELECT normalize('á') = normalize('á'); -- true

I think you misunderstand Unicode normalization and equivalence.  There
is no standard Unicode `normalize()` that would cause the above equality
predicate to be true.  If you normalize to NFD (normal form decomposed)
then a _prefix_ of those two strings will be equal, but that's clearly
not what you're looking for.

PostgreSQL already has Unicode normalization support, though it would be
nice to also have form-insensitive indexing and equality predicates.

There are two ways to write 'á' in Unicode: one is pre-composed (one
codepoint) and the other is decomposed (two codepoints in this specific
case), and it would be nice to be able to preserve input form when
storing strings but then still be able to index and match them
form-insensitively (in the case of 'á' both equivalent representations
should be considered equal, and for UNIQUE indexes they should be
considered the same).

You could also have functions that perform lossy normalization in the
sort of way that soundex does, such as first normalizing to NFD then
dropping all combining codepoints which then could allow 'á' to be eq to
'a'.  But this would not be a Unicode normalization function.

Nico
-- 



Re: Pre-proposal: unicode normalized text

From
Jeff Davis
Date:
On Mon, 2023-10-02 at 15:27 -0500, Nico Williams wrote:
> I think you misunderstand Unicode normalization and equivalence. 
> There
> is no standard Unicode `normalize()` that would cause the above
> equality
> predicate to be true.  If you normalize to NFD (normal form
> decomposed)
> then a _prefix_ of those two strings will be equal, but that's
> clearly
> not what you're looking for.

From [1]:

"Unicode Normalization Forms are formally defined normalizations of
Unicode strings which make it possible to determine whether any two
Unicode strings are equivalent to each other. Depending on the
particular Unicode Normalization Form, that equivalence can either be a
canonical equivalence or a compatibility equivalence... A binary
comparison of the transformed strings will then determine equivalence."

NFC and NFD are based on Canonical Equivalence.

"Canonical equivalence is a fundamental equivalency between characters
or sequences of characters which represent the same abstract character,
and which when correctly displayed should always have the same visual
appearance and behavior."

Can you explain why NFC (the default form of normalization used by the
postgres normalize() function), followed by memcmp(), is not the right
thing to use to determine Canonical Equivalence?

Or are you saying that Canonical Equivalence is not a useful thing to
test?

What do you mean about the "prefix"?

In Postgres today:

  SELECT normalize(U&'\0061\0301', nfc)::bytea; -- \xc3a1
  SELECT normalize(U&'\00E1', nfc)::bytea; -- \xc3a1

  SELECT normalize(U&'\0061\0301', nfd)::bytea; -- \x61cc81
  SELECT normalize(U&'\00E1', nfd)::bytea; -- \x61cc81

which looks useful to me, but I assume you are saying that it doesn't
generalize well to other cases?

[1] https://unicode.org/reports/tr15/

> There are two ways to write 'á' in Unicode: one is pre-composed (one
> codepoint) and the other is decomposed (two codepoints in this
> specific
> case), and it would be nice to be able to preserve input form when
> storing strings but then still be able to index and match them
> form-insensitively (in the case of 'á' both equivalent
> representations
> should be considered equal, and for UNIQUE indexes they should be
> considered the same).

Sometimes preserving input differences is a good thing, other times
it's not, depending on the context. Almost any data type has some
aspects of the input that might not be preserved -- leading zeros in a
number, or whitespace in jsonb, etc.

If text is stored as normalized with NFC, it could be frustrating if
the retrieved string has a different binary representation than the
source data. But it could also be frustrating to look at two strings
made up of ordinary characters that look identical and for the database
to consider them unequal.

Regards,
    Jeff Davis





Re: Pre-proposal: unicode normalized text

From
Jeff Davis
Date:
On Mon, 2023-10-02 at 16:06 -0400, Robert Haas wrote:
> It seems to me that this overlooks one of the major points of Jeff's
> proposal, which is that we don't reject text input that contains
> unassigned code points. That decision turns out to be really painful.

Yeah, because we lose forward-compatibility of some useful operations.

> Here, Jeff mentions normalization, but I think it's a major issue
> with
> collation support. If new code points are added, users can put them
> into the database before they are known to the collation library, and
> then when they become known to the collation library the sort order
> changes and indexes break.

The collation version number may reflect the change in understanding
about assigned code points that may affect collation -- though I'd like
to understand whether this is guaranteed or not.

Regardless, given that (a) we don't have a good story for migrating to
new collation versions; and (b) it would be painful to rebuild indexes
even if we did; then you are right that it's a problem.

>  Would we endorse a proposal to make
> pg_catalog.text with encoding UTF-8 reject code points that aren't
> yet
> known to the collation library? To do so would be tighten things up
> considerably from where they stand today, and the way things stand
> today is already rigid enough to cause problems for some users.

What problems exist today due to the rigidity of text?

I assume you mean because we reject invalid byte sequences? Yeah, I'm
sure that causes a problem for some (especially migrations), but it's
difficult for me to imagine a database working well with no rules at
all for the the basic data types.

> Now, there is still the question of whether such a data type would
> properly belong in core or even contrib rather than being an
> out-of-core project. It's not obvious to me that such a data type
> would get enough traction that we'd want it to be part of PostgreSQL
> itself.

At minimum I think we need to have some internal functions to check for
unassigned code points. That belongs in core, because we generate the
unicode tables from a specific version.

I also think we should expose some SQL functions to check for
unassigned code points. That sounds useful, especially since we already
expose normalization functions.

One could easily imagine a domain with CHECK(NOT
contains_unassigned(a)). Or an extension with a data type that uses the
internal functions.

Whether we ever get to a core data type -- and more importantly,
whether anyone uses it -- I'm not sure.

>  But at the same time I can certainly understand why Jeff finds
> the status quo problematic.

Yeah, I am looking for a better compromise between:

  * everything is memcmp() and 'á' sometimes doesn't equal 'á'
(depending on code point sequence)
  * everything is constantly changing, indexes break, and text
comparisons are slow

A stable idea of unicode normalization based on using only assigned
code points is very tempting.

Regards,
    Jeff Davis




Re: Pre-proposal: unicode normalized text

From
Nico Williams
Date:
On Tue, Oct 03, 2023 at 12:15:10PM -0700, Jeff Davis wrote:
> On Mon, 2023-10-02 at 15:27 -0500, Nico Williams wrote:
> > I think you misunderstand Unicode normalization and equivalence. 
> > There is no standard Unicode `normalize()` that would cause the
> > above equality predicate to be true.  If you normalize to NFD
> > (normal form decomposed) then a _prefix_ of those two strings will
> > be equal, but that's clearly not what you're looking for.

Ugh, My client is not displying 'a' correctly, thus I misunderstood your
post.

> From [1]:

Here's what you wrote in your post:

| [...] But it's really the same
| character with just a different representation, and if you normalize
| them they are equal:
|
|  SELECT normalize('á') = normalize('á'); -- true

but my client is not displying 'a' correctly!  (It displays like 'a' but
it should display like 'á'.)

Bah.  So I'd (mis)interpreted you as saying that normalize('a') should
equal normalize('á').  Please disregard that part of my reply.

> > There are two ways to write 'á' in Unicode: one is pre-composed (one
> > codepoint) and the other is decomposed (two codepoints in this
> > specific case), and it would be nice to be able to preserve input
> > form when storing strings but then still be able to index and match
> > them form-insensitively (in the case of 'á' both equivalent
> > representations should be considered equal, and for UNIQUE indexes
> > they should be considered the same).
> 
> Sometimes preserving input differences is a good thing, other times
> it's not, depending on the context. Almost any data type has some
> aspects of the input that might not be preserved -- leading zeros in a
> number, or whitespace in jsonb, etc.

Almost every Latin input mode out there produces precomposed characters
and so they effectively produce NFC.  I'm not sure if the same is true
for, e.g., Hangul (Korean) and various other scripts.

But there are things out there that produce NFD.  Famously Apple's HFS+
uses NFD (or something very close to NFD).  So if you cut-n-paste things
that got normalized to NFD and paste them into contexts where
normalization isn't done, then you might start wanting to alter those
contexts to either normalize or be form-preserving/form-insensitive.
Sometimes you don't get to normalize, so you have to pick form-
preserving/form-insensitive behavior.

> If text is stored as normalized with NFC, it could be frustrating if
> the retrieved string has a different binary representation than the
> source data. But it could also be frustrating to look at two strings
> made up of ordinary characters that look identical and for the database
> to consider them unequal.

Exactly.  If you have such a case you might like the option to make your
database form-preserving and form-insensitive.  That means that indices
need to normalize strings, but tables need to store unnormalized
strings.

ZFS (filesystems are a bit like databases) does just that!

Nico
-- 



Re: Pre-proposal: unicode normalized text

From
Jeff Davis
Date:
On Tue, 2023-10-03 at 15:15 -0500, Nico Williams wrote:
> Ugh, My client is not displying 'a' correctly

Ugh. Is that an argument in favor of normalization or against?

I've also noticed that some fonts render the same character a bit
differently depending on the constituent code points. For instance, if
the accent is its own code point, it seems to be more prominent than if
a single code point represents both the base character and the accent.
That seems to be a violation, but I can understand why that might be
useful.

>
> Almost every Latin input mode out there produces precomposed
> characters
> and so they effectively produce NFC.

The problem is not the normal case, the problem will be things like
obscure input methods, some kind of software that's being too clever,
or some kind of malicious user trying to confuse the database.

>
> That means that indices
> need to normalize strings, but tables need to store unnormalized
> strings.

That's an interesting idea. Would the equality operator normalize
first, or are you saying that the index would need to recheck the
results?

Regards,
    Jeff Davis




Re: Pre-proposal: unicode normalized text

From
Jeff Davis
Date:
On Mon, 2023-10-02 at 10:47 +0200, Peter Eisentraut wrote:
> I think a better direction here would be to work toward making
> nondeterministic collations usable on the global/database level and
> then
> encouraging users to use those.
>
> It's also not clear which way the performance tradeoffs would fall.
>
> Nondeterministic collations are obviously going to be slower, but by
> how
> much?  People have accepted moving from C locale to "real" locales
> because they needed those semantics.  Would it be any worse moving
> from
> real locales to "even realer" locales?

If you normalize first, then you can get some semantic improvements
without giving up on the stability and performance of memcmp(). That
seems like a win with zero costs in terms of stability or performance
(except perhaps some extra text->utext casts).

Going to a "real" locale gives more semantic benefits but at a very
high cost: depending on a collation provider library, dealing with
collation changes, and performance costs. While supporting the use of
nondeterministic collations at the database level may be a good idea,
it's not helping to reach the compromise that I'm trying to reach in
this thread.

Regards,
    Jeff Davis




Re: Pre-proposal: unicode normalized text

From
Nico Williams
Date:
On Tue, Oct 03, 2023 at 03:34:44PM -0700, Jeff Davis wrote:
> On Tue, 2023-10-03 at 15:15 -0500, Nico Williams wrote:
> > Ugh, My client is not displying 'a' correctly
> 
> Ugh. Is that an argument in favor of normalization or against?

Heheh, well, it's an argument in favor of more software getting this
right (darn it).

It's also an argument for building a time machine so HFS+ can just
always have used NFC.  But the existence of UTF-16 is proof that time
machines don't exist (or that only bad actors have them).

> I've also noticed that some fonts render the same character a bit
> differently depending on the constituent code points. For instance, if
> the accent is its own code point, it seems to be more prominent than if
> a single code point represents both the base character and the accent.
> That seems to be a violation, but I can understand why that might be
> useful.

Yes, that happens.  Did you know that the ASCII character set was
designed with overstrike in mind for typing of accented Latin
characters?  Unicode combining sequences are kinda like that, but more
complex.

Yes, the idea really was that you could write a<BS>' (or '<BS>a) to get á.
That's how people did it with typewriters anyways.

> > Almost every Latin input mode out there produces precomposed
> > characters and so they effectively produce NFC.
> 
> The problem is not the normal case, the problem will be things like
> obscure input methods, some kind of software that's being too clever,
> or some kind of malicious user trying to confuse the database.

_HFS+ enters the chat_

> > That means that indices
> > need to normalize strings, but tables need to store unnormalized
> > strings.
> 
> That's an interesting idea. Would the equality operator normalize
> first, or are you saying that the index would need to recheck the
> results?

You can optimize this to avoid having to normalize first.  Most strings
are not equal, and they tend to differ early.  And most strings will
likely be ASCII-mostly or in the same form anyways.  So you can just
walk a cursor down each string looking at two bytes, and if they are
both ASCII then you move each cursor forward by one byte, and if then
are not both ASCII then you take a slow path where you normalize one
grapheme cluster at each cursor (if necessary) and compare that.  (ZFS
does this.)

You can also assume ASCII-mostly, load as many bits of each string
(padding as needed) as will fit in SIMD registers, compare and check
that they're all ASCII, and if not then jump to the slow path.

You can also normalize one grapheme cluster at a time when hashing
(e.g., for hash indices), thus avoiding a large allocation if the string
is large.

Nico
-- 



Re: Pre-proposal: unicode normalized text

From
Robert Haas
Date:
On Tue, Oct 3, 2023 at 3:54 PM Jeff Davis <pgsql@j-davis.com> wrote:
> I assume you mean because we reject invalid byte sequences? Yeah, I'm
> sure that causes a problem for some (especially migrations), but it's
> difficult for me to imagine a database working well with no rules at
> all for the the basic data types.

There's a very popular commercial database where, or so I have been
led to believe, any byte sequence at all is accepted when you try to
put values into the database. The rumors I've heard -- I have not
played with it myself -- are that when you try to do anything, byte
sequences that are not valid in the configured encoding are treated as
single-byte characters or something of that sort. So like if you had
UTF-8 as the encoding and the first byte of the string is something
that can only appear as a continuation byte in UTF-8, I think that
byte is just treated as a separate character. I don't quite know how
you make all of the operations work that way, but it seems like
they've come up with a somewhat-consistent set of principles that are
applied across the board. Very different from the PG philosophy, of
course. And I'm not saying it's better. But it does eliminate the
problem of being unable to load data into the database, because in
such a model there's no such thing as invalidly-encoded data. Instead,
an encoding like UTF-8 is effectively extended so that every byte
sequence represents *something*. Whether that something is what you
wanted is another story.

At any rate, if we were to go in the direction of rejecting code
points that aren't yet assigned, or aren't yet known to the collation
library, that's another way for data loading to fail. Which feels like
very defensible behavior, but not what everyone wants, or is used to.

> At minimum I think we need to have some internal functions to check for
> unassigned code points. That belongs in core, because we generate the
> unicode tables from a specific version.

That's a good idea.

> I also think we should expose some SQL functions to check for
> unassigned code points. That sounds useful, especially since we already
> expose normalization functions.

That's a good idea, too.

> One could easily imagine a domain with CHECK(NOT
> contains_unassigned(a)). Or an extension with a data type that uses the
> internal functions.

Yeah.

> Whether we ever get to a core data type -- and more importantly,
> whether anyone uses it -- I'm not sure.

Same here.

> Yeah, I am looking for a better compromise between:
>
>   * everything is memcmp() and 'á' sometimes doesn't equal 'á'
> (depending on code point sequence)
>   * everything is constantly changing, indexes break, and text
> comparisons are slow
>
> A stable idea of unicode normalization based on using only assigned
> code points is very tempting.

The fact that there are multiple types of normalization and multiple
notions of equality doesn't make this easier.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Pre-proposal: unicode normalized text

From
Nico Williams
Date:
On Tue, Sep 12, 2023 at 03:47:10PM -0700, Jeff Davis wrote:
> The idea is to have a new data type, say "UTEXT", that normalizes the
> input so that it can have an improved notion of equality while still
> using memcmp().

A UTEXT type would be helpful for specifying that the text must be
Unicode (in which transform?) even if the character data encoding for
the database is not UTF-8.

Maybe UTF8 might be a better name for the new type, since it would
denote the transform (and would allow for UTF16 and UTF32 some day,
though it's doubtful those would ever happen).

But it's one thing to specify Unicode (and transform) in the type and
another to specify an NF to normalize to on insert or on lookup.

How about new column constraint keywords, such as NORMALIZE (meaning
normalize on insert) and NORMALIZED (meaning reject non-canonical form
text), with an optional parenthetical by which to specify a non-default
form?  (These would apply to TEXT as well when the default encoding for
the DB is UTF-8.)

One could then ALTER TABLE to add this to existing tables.

This would also make it easier to add a form-preserving/form-insensitive
mode later if it turns out to be useful or necessary, maybe making it
the default for Unicode text in new tables.

> Questions:
> 
>  * Would this be useful enough to justify a new data type? Would it be
> confusing about when to choose one versus the other?

Yes.  See above.  I think I'd rather have it be called UTF8, and the
normalization properties of it to be specified as column constraints.

>  * Would cross-type comparisons between TEXT and UTEXT become a major
> problem that would reduce the utility?

Maybe when the database's encoding is UTF_8 then UTEXT (or UTF8) can be an alias
of TEXT.

>  * Should "some_utext_value = some_text_value" coerce the LHS to TEXT
> or the RHS to UTEXT?

Ooh, this is nice!  If the TEXT is _not_ UTF-8 then it could be
converted to UTF-8.  So I think which is RHS and which is LHS doesn't
matter -- it's which is UTF-8, and if both are then the only thing left
to do is normalize, and for that I'd take the LHS' form if the LHS is
UTF-8, else the RHS'.

Nico
-- 



Re: Pre-proposal: unicode normalized text

From
Robert Haas
Date:
On Wed, Oct 4, 2023 at 1:27 PM Nico Williams <nico@cryptonector.com> wrote:
> A UTEXT type would be helpful for specifying that the text must be
> Unicode (in which transform?) even if the character data encoding for
> the database is not UTF-8.

That's actually pretty thorny ... because right now client_encoding
specifies the encoding to be used for all data sent to the client. So
would we convert the data from UTF8 to the selected client encoding?
Or what?

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Pre-proposal: unicode normalized text

From
Chapman Flack
Date:
On 2023-10-04 13:47, Robert Haas wrote:
> On Wed, Oct 4, 2023 at 1:27 PM Nico Williams <nico@cryptonector.com> 
> wrote:
>> A UTEXT type would be helpful for specifying that the text must be
>> Unicode (in which transform?) even if the character data encoding for
>> the database is not UTF-8.
> 
> That's actually pretty thorny ... because right now client_encoding
> specifies the encoding to be used for all data sent to the client. So
> would we convert the data from UTF8 to the selected client encoding?

The SQL standard would have me able to:

CREATE TABLE foo (
   a CHARACTER VARYING CHARACTER SET UTF8,
   b CHARACTER VARYING CHARACTER SET LATIN1
)

and so on, and write character literals like

_UTF8'Hello, world!' and _LATIN1'Hello, world!'

and have those columns and data types independently contain what
they can contain, without constraints imposed by one overall
database encoding.

Obviously, we're far from being able to do that. But should it
become desirable to get closer, would it be worthwhile to also
try to follow how the standard would have it look?

Clearly, part of the job would involve making the wire protocol
able to transmit binary values and identify their encodings.

Regards,
-Chap



Re: Pre-proposal: unicode normalized text

From
Robert Haas
Date:
On Wed, Oct 4, 2023 at 2:02 PM Chapman Flack <chap@anastigmatix.net> wrote:
> Clearly, part of the job would involve making the wire protocol
> able to transmit binary values and identify their encodings.

Right. Which unfortunately is moving the goal posts into the
stratosphere compared to any other work mentioned so far. I agree it
would be great. But not if you want concrete progress any time soon.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Pre-proposal: unicode normalized text

From
Isaac Morland
Date:
On Wed, 4 Oct 2023 at 14:05, Chapman Flack <chap@anastigmatix.net> wrote:
On 2023-10-04 13:47, Robert Haas wrote:
 
The SQL standard would have me able to:

CREATE TABLE foo (
   a CHARACTER VARYING CHARACTER SET UTF8,
   b CHARACTER VARYING CHARACTER SET LATIN1
)

and so on, and write character literals like

_UTF8'Hello, world!' and _LATIN1'Hello, world!'

and have those columns and data types independently contain what
they can contain, without constraints imposed by one overall
database encoding.

Obviously, we're far from being able to do that. But should it
become desirable to get closer, would it be worthwhile to also
try to follow how the standard would have it look?

Clearly, part of the job would involve making the wire protocol
able to transmit binary values and identify their encodings.

I would go in the other direction (note: I’m ignoring all backward compatibility considerations related to the current design of Postgres).

Always store only UTF-8 in the database, and send only UTF-8 on the wire protocol. If we still want to have a concept of "client encoding", have the client libpq take care of translating the bytes between the bytes used by the caller and the bytes sent on the wire.

Note that you could still define columns as you say, but the character set specification would effectively act simply as a CHECK constraint on the characters allowed, essentially CHECK (column_name ~ '^[...all characters in encoding...]$*'). We don't allow different on-disk representations of dates or other data types; except when we really need to, and then we have multiple data types (e.g. int vs. float) rather than different ways of storing the same datatype.

What about characters not in UTF-8? If a character is important enough for us to worry about in Postgres, it’s important enough to get a U+ number from the Unicode Consortium, which automatically puts it in UTF-8. In the modern context, "plain text" mean "UTF-8 encoded text", as far as I'm concerned.

Re: Pre-proposal: unicode normalized text

From
Jeff Davis
Date:
On Wed, 2023-10-04 at 13:16 -0400, Robert Haas wrote:
> any byte sequence at all is accepted when you try to
> put values into the database.

We support SQL_ASCII, which allows something similar.

> At any rate, if we were to go in the direction of rejecting code
> points that aren't yet assigned, or aren't yet known to the collation
> library, that's another way for data loading to fail.

A failure during data loading is either a feature or a bug, depending
on whether you are the one loading the data or the one trying to make
sense of it later ;-)

>  Which feels like
> very defensible behavior, but not what everyone wants, or is used to.

Yeah, there are many reasons someone might want to accept unassigned
code points. An obvious one is if their application is on a newer
version of unicode where the codepoint *is* assigned.

>
> The fact that there are multiple types of normalization and multiple
> notions of equality doesn't make this easier.

NFC is really the only one that makes sense.

NFD is semantically the same as NFC, but expanded into a larger
representation. NFKC/NFKD are based on a more relaxed notion of
equality -- kind of like non-deterministic collations. These other
forms might make sense in certain cases, but not general use.

I believe that having a kind of text data type where it's stored in NFC
and compared with memcmp() would be a good place for many users to be -
- probably most users. It's got all the performance and stability
benefits of memcmp(), with slightly richer semantics. It's less likely
that someone malicious can confuse the database by using different
representations of the same character.

The problem is that it's not universally better for everyone: there are
certainly users who would prefer that the codepoints they send to the
database are preserved exactly, and also users who would like to be
able to use unassigned code points.

Regards,
    Jeff Davis





Re: Pre-proposal: unicode normalized text

From
Jeff Davis
Date:
On Wed, 2023-10-04 at 14:02 -0400, Chapman Flack wrote:
> The SQL standard would have me able to:
>
> CREATE TABLE foo (
>    a CHARACTER VARYING CHARACTER SET UTF8,
>    b CHARACTER VARYING CHARACTER SET LATIN1
> )
>
> and so on, and write character literals like
>
> _UTF8'Hello, world!' and _LATIN1'Hello, world!'

Is there a use case for that? UTF-8 is able to encode any unicode code
point, it's relatively compact, and it's backwards-compatible with 7-
bit ASCII. If you have a variety of text data in your system (and in
many cases even if not), then UTF-8 seems like the right solution.

Text data encoded 17 different ways requires a lot of bookkeeping in
the type system, and it also requires injecting a bunch of fallible
transcoding operators around just to compare strings.

Regards,
    Jeff Davis




Re: Pre-proposal: unicode normalized text

From
Nico Williams
Date:
On Wed, Oct 04, 2023 at 01:38:15PM -0700, Jeff Davis wrote:
> On Wed, 2023-10-04 at 14:02 -0400, Chapman Flack wrote:
> > The SQL standard would have me able to:
> > 
> > [...]
> > _UTF8'Hello, world!' and _LATIN1'Hello, world!'
> 
> Is there a use case for that? UTF-8 is able to encode any unicode code
> point, it's relatively compact, and it's backwards-compatible with 7-
> bit ASCII. If you have a variety of text data in your system (and in
> many cases even if not), then UTF-8 seems like the right solution.
> 
> Text data encoded 17 different ways requires a lot of bookkeeping in
> the type system, and it also requires injecting a bunch of fallible
> transcoding operators around just to compare strings.

Better that than TEXT blobs w/ the encoding given by the `CREATE
DATABASE` or `initdb` default!

It'd be a lot _less_ fragile to have all text tagged with an encoding
(indirectly, via its type which then denotes the encoding).

That would be a lot of work, but starting with just a UTF-8 text type
would be an improvement.

Nico
-- 



Re: Pre-proposal: unicode normalized text

From
Chapman Flack
Date:
On 2023-10-04 16:38, Jeff Davis wrote:
> On Wed, 2023-10-04 at 14:02 -0400, Chapman Flack wrote:
>> The SQL standard would have me able to:
>> 
>> CREATE TABLE foo (
>>    a CHARACTER VARYING CHARACTER SET UTF8,
>>    b CHARACTER VARYING CHARACTER SET LATIN1
>> )
>> 
>> and so on
> 
> Is there a use case for that? UTF-8 is able to encode any unicode code
> point, it's relatively compact, and it's backwards-compatible with 7-
> bit ASCII. If you have a variety of text data in your system (and in
> many cases even if not), then UTF-8 seems like the right solution.

Well, for what reason does anybody run PG now with the encoding set
to anything besides UTF-8? I don't really have my finger on that pulse.
Could it be that it bloats common strings in their local script, and
with enough of those to store, it could matter to use the local
encoding that stores them more economically?

Also, while any Unicode transfer format can encode any Unicode code
point, I'm unsure whether it's yet the case that {any Unicode code
point} is a superset of every character repertoire associated with
every non-Unicode encoding.

The cheap glaring counterexample is SQL_ASCII. Half those code points
are *nobody knows what Unicode character* (or even *whether*). I'm not
insisting that's a good thing, but it is a thing.

It might be a very tidy future to say all text is Unicode and all
server encodings are UTF-8, but I'm not sure it wouldn't still
be a good step on the way to be able to store some things in
their own encodings. We have JSON and XML now, two data types
that are *formally defined* to accept any Unicode content, and
we hedge and mumble and say (well, as long as it goes in the
server encoding) and that makes me sad. Things like that should
be easy to handle even without declaring UTF-8 as a server-wide
encoding ... they already are their own distinct data types, and
could conceivably know their own encodings.

But there again, it's possible that going with unconditional
UTF-8 for JSON or XML documents could, in some regions, bloat them.

Regards,
-Chap



Re: Pre-proposal: unicode normalized text

From
Jeff Davis
Date:
On Wed, 2023-10-04 at 14:14 -0400, Isaac Morland wrote:
> Always store only UTF-8 in the database

What problem does that solve? I don't see our encoding support as a big
source of problems, given that database-wide UTF-8 already works fine.
In fact, some postgres features only work with UTF-8.

I agree that we shouldn't add a bunch of bookkeeping and type system
support for per-column encodings without a clear use case, because that
would have a cost. But right now it's just a database-wide thing.

I don't see encodings as a major area to solve problems or innovate. At
the end of the day, encodings have little semantic significance, and
therefore limited upside and limited downside. Collations and
normalization get more interesting, but those are happening at a higher
layer than the encoding.


> What about characters not in UTF-8?

Honestly I'm not clear on this topic. Are the "private use" areas in
unicode enough to cover use cases for characters not recognized by
unicode? Which encodings in postgres can represent characters that
can't be automatically transcoded (without failure) to unicode?

Obviously if we have some kind of unicode-based type, it would only
work with encodings that are a subset of unicode.

Regards,
    Jeff Davis




Re: Pre-proposal: unicode normalized text

From
Nico Williams
Date:
On Wed, Oct 04, 2023 at 05:32:50PM -0400, Chapman Flack wrote:
> Well, for what reason does anybody run PG now with the encoding set
> to anything besides UTF-8? I don't really have my finger on that pulse.

Because they still have databases that didn't use UTF-8 10 or 20 years
ago that they haven't migrated to UTF-8?

It's harder to think of why one might _want_ to store text in any
encoding other than UTF-8 for _new_ databases.

Though too there's no reason that it should be impossible other than
lack of developer interest: as long as text is tagged with its encoding,
it should be possible to store text in any number of encodings.

> Could it be that it bloats common strings in their local script, and
> with enough of those to store, it could matter to use the local
> encoding that stores them more economically?

UTF-8 bloat is not likely worth the trouble.  UTF-8 is only clearly
bloaty when compared to encodings with 1-byte code units, like
ISO-8859-*.  For CJK UTF-8 is not much more bloaty than native
non-Unicode encodings like SHIFT_JIS.

UTF-8 is not much bloatier than UTF-16 in general either.

Bloat is not really a good reason to avoid Unicode or any specific TF.

> Also, while any Unicode transfer format can encode any Unicode code
> point, I'm unsure whether it's yet the case that {any Unicode code
> point} is a superset of every character repertoire associated with
> every non-Unicode encoding.

It's not always been the case that Unicode is a strict superset of all
currently-in-use human scripts.  Making Unicode a strict superset of all
currently-in-use human scripts seems to be the Unicode Consortium's aim.

I think you're asking why not just use UTF-8 for everything, all the
time.  It's a fair question.  I don't have a reason to answer in the
negative (maybe someone else does).  But that doesn't mean that one
couldn't want to store text in many encodings (e.g., for historical
reasons).

Nico
-- 



Re: Pre-proposal: unicode normalized text

From
Jeff Davis
Date:
On Wed, 2023-10-04 at 16:15 -0500, Nico Williams wrote:
> Better that than TEXT blobs w/ the encoding given by the `CREATE
> DATABASE` or `initdb` default!

From an engineering perspective, yes, per-column encodings would be
more flexible. But I still don't understand who exactly would use that,
and why.

It would take an awful lot of effort to implement and make the code
more complex, so we'd really need to see some serious demand for that.

Regards,
    Jeff Davis




Re: Pre-proposal: unicode normalized text

From
Nico Williams
Date:
On Wed, Oct 04, 2023 at 04:01:26PM -0700, Jeff Davis wrote:
> On Wed, 2023-10-04 at 16:15 -0500, Nico Williams wrote:
> > Better that than TEXT blobs w/ the encoding given by the `CREATE
> > DATABASE` or `initdb` default!
> 
> From an engineering perspective, yes, per-column encodings would be
> more flexible. But I still don't understand who exactly would use that,
> and why.

Say you have a bunch of text files in different encodings for reasons
(historical).  And now say you want to store them in a database so you
can index them and search them.  Sure, you could use a filesystem, but
you want an RDBMS.  Well, the answer to this is "convert all those files
to UTF-8".

> It would take an awful lot of effort to implement and make the code
> more complex, so we'd really need to see some serious demand for that.

Yes, it's better to just use UTF-8.

The DB could implement conversions to/from other codesets and encodings
for clients that insist on it.  Why would clients insist anyways?
Better to do the conversions at the clients.

In the middle its best to just have Unicode, and specifically UTF-8,
then push all conversions to the edges of the system.

Nico
-- 



Re: Pre-proposal: unicode normalized text

From
Isaac Morland
Date:
On Wed, 4 Oct 2023 at 17:37, Jeff Davis <pgsql@j-davis.com> wrote:
On Wed, 2023-10-04 at 14:14 -0400, Isaac Morland wrote:
> Always store only UTF-8 in the database

What problem does that solve? I don't see our encoding support as a big
source of problems, given that database-wide UTF-8 already works fine.
In fact, some postgres features only work with UTF-8.

My idea is in the context of a suggestion that we support specifying the encoding per column. I don't mean to suggest eliminating the ability to set a server-wide encoding, although I doubt there is any use case for using anything other than UTF-8 except for an old database that hasn’t been converted yet.

I see no reason to write different strings using different encodings in the data files, depending on what column they belong to. The various text types are all abstract data types which store sequences of characters (not bytes); if one wants bytes, then one has to encode them. Of course, if one wants UTF-8 bytes, then the encoding is, under the covers, the identity function, but conceptually it is still taking the characters stored in the database and converting them to bytes according to a specific encoding.

By contrast, although I don’t see it as a top-priority use case, I can imagine somebody wanting to restrict the characters stored in a particular column to characters that can be encoded in a particular encoding. That is what "CHARACTER SET LATIN1" and so on should mean. 

> What about characters not in UTF-8?

Honestly I'm not clear on this topic. Are the "private use" areas in
unicode enough to cover use cases for characters not recognized by
unicode? Which encodings in postgres can represent characters that
can't be automatically transcoded (without failure) to unicode?

Here I’m just anticipating a hypothetical objection, “what about characters that can’t be represented in UTF-8?” to my suggestion to always use UTF-8 and I’m saying we shouldn’t care about them. I believe the answers to your questions in this paragraph are “yes”, and “none”.

Re: Pre-proposal: unicode normalized text

From
Robert Haas
Date:
On Wed, Oct 4, 2023 at 9:02 PM Isaac Morland <isaac.morland@gmail.com> wrote:
>> > What about characters not in UTF-8?
>>
>> Honestly I'm not clear on this topic. Are the "private use" areas in
>> unicode enough to cover use cases for characters not recognized by
>> unicode? Which encodings in postgres can represent characters that
>> can't be automatically transcoded (without failure) to unicode?
>
> Here I’m just anticipating a hypothetical objection, “what about characters that can’t be represented in UTF-8?” to
mysuggestion to always use UTF-8 and I’m saying we shouldn’t care about them. I believe the answers to your questions
inthis paragraph are “yes”, and “none”. 

Years ago, I remember SJIS being cited as an example of an encoding
that had characters which weren't part of Unicode. I don't know
whether this is still a live issue.

But I do think that sometimes users are reluctant to perform encoding
conversions on the data that they have. Sometimes they're not
completely certain what encoding their data is in, and sometimes
they're worried that the encoding conversion might fail or produce
wrong answers. In theory, if your existing data is validly encoded and
you know what encoding it's in and it's easily mapped onto UTF-8,
there's no problem. You can just transcode it and be done. But a lot
of times the reality is a lot messier than that.

Which gives me some sympathy with the idea of wanting multiple
character sets within a database. Such a feature exists in some other
database systems and is, presumably, useful to some people. On the
other hand, to do that in PostgreSQL, we'd need to propagate the
character set/encoding information into all of the places that
currently get the typmod and collation, and that is not a small number
of places. It's a lot of infrastructure for the project to carry
around for a feature that's probably only going to continue to become
less relevant.

I suppose you never know, though. Maybe the Unicode consortium will
explode in a tornado of fiery rage and there will be dueling standards
making war over the proper way of representing an emoji of a dog
eating broccoli for decades to come. In that case, our hypothetical
multi-character-set feature might seem prescient.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Pre-proposal: unicode normalized text

From
Isaac Morland
Date:
On Thu, 5 Oct 2023 at 07:32, Robert Haas <robertmhaas@gmail.com> wrote:
 
But I do think that sometimes users are reluctant to perform encoding
conversions on the data that they have. Sometimes they're not
completely certain what encoding their data is in, and sometimes
they're worried that the encoding conversion might fail or produce
wrong answers. In theory, if your existing data is validly encoded and
you know what encoding it's in and it's easily mapped onto UTF-8,
there's no problem. You can just transcode it and be done. But a lot
of times the reality is a lot messier than that.

In the case you describe, the users don’t have text at all; they have bytes, and a vague belief about what encoding the bytes might be in and therefore what characters they are intended to represent. The correct way to store that in the database is using bytea. Text types should be for when you know what characters you want to store. In this scenario, the implementation detail of what encoding the database uses internally to write the data on the disk doesn't matter, any more than it matters to a casual user how a table is stored on disk.

Similarly, I don't believe we have a "YMD" data type which stores year, month, and day, without being specific as to whether it's Gregorian or Julian; if you have that situation, make a 3-tuple type or do something else. "Date" is for when you actually know what day you want to record.

Re: Pre-proposal: unicode normalized text

From
Jeff Davis
Date:
On Thu, 2023-10-05 at 07:31 -0400, Robert Haas wrote:
> It's a lot of infrastructure for the project to carry
> around for a feature that's probably only going to continue to become
> less relevant.

Agreed, at least until we understand the set of users per-column
encoding is important to. I acknowledge that the presence of per-column
encoding in the standard is some kind of signal there, but not enough
by itself to justify something so invasive.

> I suppose you never know, though.

On balance I think it's better to keep the code clean enough that we
can adapt to whatever unanticipated things happen in the future; rather
than to make the code very complicated trying to anticipate everything,
and then being completely unable to adapt it when something
unanticipated happens anyway.

Regards,
    Jeff Davis




Re: Pre-proposal: unicode normalized text

From
Nico Williams
Date:
On Thu, Oct 05, 2023 at 07:31:54AM -0400, Robert Haas wrote:
> [...] On the other hand, to do that in PostgreSQL, we'd need to
> propagate the character set/encoding information into all of the
> places that currently get the typmod and collation, and that is not a
> small number of places. It's a lot of infrastructure for the project
> to carry around for a feature that's probably only going to continue
> to become less relevant.

Text+encoding can be just like bytea with a one- or two-byte prefix
indicating what codeset+encoding it's in.  That'd be how to encode
such text values on the wire, though on disk the column's type should
indicate the codeset+encoding, so no need to add a prefix to the value.

Complexity would creep in around when and whether to perform automatic
conversions.  The easy answer would be "never, on the server side", but
on the client side it might be useful to convert to/from the locale's
codeset+encoding when displaying to the user or accepting user input.

If there's no automatic server-side codeset/encoding conversions then
the server-side cost of supporting non-UTF-8 text should not be too high
dev-wise -- it's just (famous last words) a generic text type
parameterized by codeset+ encoding type.  There would not even be a hard
need for functions for conversions, though there would be demand for
them.

But I agree that if there's no need, there's no need.  UTF-8 is great,
and if only all PG users would just switch then there's not much more to
do.

Nico
-- 



Re: Pre-proposal: unicode normalized text

From
Jeff Davis
Date:
On Thu, 2023-10-05 at 09:10 -0400, Isaac Morland wrote:
> In the case you describe, the users don’t have text at all; they have
> bytes, and a vague belief about what encoding the bytes might be in
> and therefore what characters they are intended to represent. The
> correct way to store that in the database is using bytea.

I wouldn't be so absolute. It's text data to the user, and is
presumably working fine for them now, and if they switched to bytea
today then 'foo' would show up as '\x666f6f' in psql.

The point is that this is a somewhat messy problem because there's so
much software out there that treats byte strings and textual data
interchangably. Rust goes the extra mile to organize all of this, and
it ends up with:

  * String -- always UTF-8, never NUL-terminated
  * CString -- NUL-terminated byte sequence with no internal NULs
  * OsString[3] -- needed to make a Path[4], which is needed to open a
file[5]
  * Vec<u8> -- any byte sequence

and I suppose we could work towards offering better support for these
different types, the casts between them, and delivering them in a form
the client can understand. But I wouldn't describe it as a solved
problem with one "correct" solution.

One takeaway from this discussion is that it would be useful to provide
more flexibility in how values are represented to the client in a more
general way. In addition to encoding, representational issues have come
up with binary formats, bytea, extra_float_digits, etc.

The collection of books by CJ Date & Hugh Darwen, et al. (sorry I don't
remember exactly which books), made the theoretical case for explicitly
distinguishing values from representations at the lanugage level. We're
starting to see that representational issues can't be satisfied with a
few special cases and hacks -- it's worth thinking about a general
solution to that problem. There was also a lot of relevant discussion
about how to think about overlapping domains (e.g. ASCII is valid in
any of these text domains).

>  Text types should be for when you know what characters you want to
> store. In this scenario, the implementation detail of what encoding
> the database uses internally to write the data on the disk doesn't
> matter, any more than it matters to a casual user how a table is
> stored on disk.

Perhaps the user and application do know, and there's some kind of
subtlety that we're missing, or some historical artefact that we're not
accounting for, and that somehow makes UTF-8 unsuitable. Surely there
are applications that treat certain byte sequences in non-standard
ways, and perhaps not all of those byte sequences can be reproduced by
transcoding from UTF-8 to the client_encoding. In any case, I would
want to understand in detail why a user thinks UTF8 is not good enough
before I make too strong of a statement here.

Even the terminal font that I use renders some "identical" unicode
characters slightly differently depending on the code points from which
they are composed. I believe that's an intentional convenience to make
it more apparent why the "diff" command (or other byte-based tool) is
showing a difference between two textually identical strings, but it's
also a violation of unicode. (This is another reason why normalization
might not be for everyone, but I believe it's still good in typical
cases.)

Regards,
    Jeff Davis




Re: Pre-proposal: unicode normalized text

From
Tom Lane
Date:
Nico Williams <nico@cryptonector.com> writes:
> Text+encoding can be just like bytea with a one- or two-byte prefix
> indicating what codeset+encoding it's in.  That'd be how to encode
> such text values on the wire, though on disk the column's type should
> indicate the codeset+encoding, so no need to add a prefix to the value.

The precedent of BOMs (byte order marks) suggests strongly that
such a solution would be horrible to use.

            regards, tom lane



Re: Pre-proposal: unicode normalized text

From
Nico Williams
Date:
On Thu, Oct 05, 2023 at 03:49:37PM -0400, Tom Lane wrote:
> Nico Williams <nico@cryptonector.com> writes:
> > Text+encoding can be just like bytea with a one- or two-byte prefix
> > indicating what codeset+encoding it's in.  That'd be how to encode
> > such text values on the wire, though on disk the column's type should
> > indicate the codeset+encoding, so no need to add a prefix to the value.
> 
> The precedent of BOMs (byte order marks) suggests strongly that
> such a solution would be horrible to use.

This is just how you encode the type of the string.  You have any number
of options.  The point is that already PG can encode binary data, so if
how to encode text of disparate encodings on the wire, building on top
of the encoding of bytea is an option.



Re: Pre-proposal: unicode normalized text

From
Peter Eisentraut
Date:
On 03.10.23 21:54, Jeff Davis wrote:
>> Here, Jeff mentions normalization, but I think it's a major issue
>> with
>> collation support. If new code points are added, users can put them
>> into the database before they are known to the collation library, and
>> then when they become known to the collation library the sort order
>> changes and indexes break.
> 
> The collation version number may reflect the change in understanding
> about assigned code points that may affect collation -- though I'd like
> to understand whether this is guaranteed or not.

This is correct.  The collation version number produced by ICU contains 
the UCA version, which is effectively the Unicode version (14.0, 15.0, 
etc.).  Since new code point assignments can only come from new Unicode 
versions, a new assigned code point will always result in a different 
collation version.

For example, with ICU 70 / CLDR 40 / Unicode 14:

select collversion from pg_collation where collname = 'unicode';
= 153.112

With ICU 72 / CLDR 42 / Unicode 15:
= 153.120

> At minimum I think we need to have some internal functions to check for
> unassigned code points. That belongs in core, because we generate the
> unicode tables from a specific version.

If you want to be rigid about it, you also need to consider whether the 
Unicode version used by the ICU library in use matches the one used by 
the in-core tables.




Re: Pre-proposal: unicode normalized text

From
Peter Eisentraut
Date:
On 05.10.23 19:30, Jeff Davis wrote:
> Agreed, at least until we understand the set of users per-column
> encoding is important to. I acknowledge that the presence of per-column
> encoding in the standard is some kind of signal there, but not enough
> by itself to justify something so invasive.

The per-column encoding support in SQL is clearly a legacy feature from 
before Unicode.  If one were to write something like SQL today, one 
would most likely just specify, "everything is Unicode".




Re: Pre-proposal: unicode normalized text

From
Jeff Davis
Date:
On Fri, 2023-10-06 at 09:58 +0200, Peter Eisentraut wrote:
> If you want to be rigid about it, you also need to consider whether
> the
> Unicode version used by the ICU library in use matches the one used
> by
> the in-core tables.

What problem are you concerned about here? I thought about it and I
didn't see an obvious issue.

If the ICU unicode version is ahead of the Postgres unicode version,
and no unassigned code points are used according to the Postgres
version, then there's no problem.

And in the other direction, there might be some code points that are
assigned according to the postgres unicode version but unassigned
according to the ICU version. But that would be tracked by the
collation version as you pointed out earlier, so upgrading ICU would be
like any other ICU upgrade (with the same risks). Right?

Regards,
    Jeff Davis




Re: Pre-proposal: unicode normalized text

From
Robert Haas
Date:
On Thu, Oct 5, 2023 at 3:15 PM Nico Williams <nico@cryptonector.com> wrote:
> Text+encoding can be just like bytea with a one- or two-byte prefix
> indicating what codeset+encoding it's in.  That'd be how to encode
> such text values on the wire, though on disk the column's type should
> indicate the codeset+encoding, so no need to add a prefix to the value.

Well, that would be making the encoding a per-value property, rather
than a per-column property like collation as I proposed. I can't see
that working out very nicely, because encodings are
collation-specific. It wouldn't make any sense if the column collation
were en_US.UTF8 or ko_KR.eucKR or en_CA.ISO8859-1 (just to pick a few
values that are legal on my machine) while data stored in the column
was from a whole bunch of different encodings, at most one of which
could be the one to which the column's collation applied. That would
end up meaning, for example, that such a column was very hard to sort.

For that and other reasons, I suspect that the utility of storing data
from a variety of different encodings in the same database column is
quite limited. What I think people really want is a whole column in
some encoding that isn't the normal one for that database. That's not
to say we should add such a feature, but if we do, I think it should
be that, not a different encoding for every individual value.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Pre-proposal: unicode normalized text

From
Nico Williams
Date:
On Fri, Oct 06, 2023 at 01:33:06PM -0400, Robert Haas wrote:
> On Thu, Oct 5, 2023 at 3:15 PM Nico Williams <nico@cryptonector.com> wrote:
> > Text+encoding can be just like bytea with a one- or two-byte prefix
> > indicating what codeset+encoding it's in.  That'd be how to encode
> > such text values on the wire, though on disk the column's type should
> > indicate the codeset+encoding, so no need to add a prefix to the value.
> 
> Well, that would be making the encoding a per-value property, rather
> than a per-column property like collation as I proposed. I can't see

On-disk it would be just a property of the type, not part of the value.

Nico
-- 



Re: Pre-proposal: unicode normalized text

From
Jeff Davis
Date:
On Thu, 2023-10-05 at 14:52 -0500, Nico Williams wrote:
> This is just how you encode the type of the string.  You have any
> number
> of options.  The point is that already PG can encode binary data, so
> if
> how to encode text of disparate encodings on the wire, building on
> top
> of the encoding of bytea is an option.

There's another significant discussion going on here:

https://www.postgresql.org/message-id/CA+TgmoZ8r8xb_73WzKHGb00cV3tpHV_U0RHuzzMFKvLepdu2Jw@mail.gmail.com

about how to handle binary formats better, so it's not clear to me that
it's a great precedent to expand upon. At least not yet.

I think it would be interesting to think more generally about these
representational issues in a way that accounds for binary formats,
extra_float_digits, client_encoding, etc. But I see that as more of an
issue with how the client expects to receive the data -- nobody has a
presented a reason in this thread that we need per-column encodings on
the server.

Regards,
    Jeff Davis




Re: Pre-proposal: unicode normalized text

From
Robert Haas
Date:
On Fri, Oct 6, 2023 at 1:38 PM Nico Williams <nico@cryptonector.com> wrote:
> On Fri, Oct 06, 2023 at 01:33:06PM -0400, Robert Haas wrote:
> > On Thu, Oct 5, 2023 at 3:15 PM Nico Williams <nico@cryptonector.com> wrote:
> > > Text+encoding can be just like bytea with a one- or two-byte prefix
> > > indicating what codeset+encoding it's in.  That'd be how to encode
> > > such text values on the wire, though on disk the column's type should
> > > indicate the codeset+encoding, so no need to add a prefix to the value.
> >
> > Well, that would be making the encoding a per-value property, rather
> > than a per-column property like collation as I proposed. I can't see
>
> On-disk it would be just a property of the type, not part of the value.

I mean, that's not how it works.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Pre-proposal: unicode normalized text

From
Nico Williams
Date:
On Fri, Oct 06, 2023 at 02:17:32PM -0400, Robert Haas wrote:
> On Fri, Oct 6, 2023 at 1:38 PM Nico Williams <nico@cryptonector.com> wrote:
> > On Fri, Oct 06, 2023 at 01:33:06PM -0400, Robert Haas wrote:
> > > On Thu, Oct 5, 2023 at 3:15 PM Nico Williams <nico@cryptonector.com> wrote:
> > > > Text+encoding can be just like bytea with a one- or two-byte prefix
> > > > indicating what codeset+encoding it's in.  That'd be how to encode
> > > > such text values on the wire, though on disk the column's type should
> > > > indicate the codeset+encoding, so no need to add a prefix to the value.
> > >
> > > Well, that would be making the encoding a per-value property, rather
> > > than a per-column property like collation as I proposed. I can't see
> >
> > On-disk it would be just a property of the type, not part of the value.
> 
> I mean, that's not how it works.

Sure, because TEXT in PG doesn't have codeset+encoding as part of it --
it's whatever the database's encoding is.  Collation can and should be a
porperty of a column, since for Unicode it wouldn't be reasonable to
make that part of the type.  But codeset+encoding should really be a
property of the type if PG were to support more than one.  IMO.



Re: Pre-proposal: unicode normalized text

From
Robert Haas
Date:
On Fri, Oct 6, 2023 at 2:25 PM Nico Williams <nico@cryptonector.com> wrote:
> > > > Well, that would be making the encoding a per-value property, rather
> > > > than a per-column property like collation as I proposed. I can't see
> > >
> > > On-disk it would be just a property of the type, not part of the value.
> >
> > I mean, that's not how it works.
>
> Sure, because TEXT in PG doesn't have codeset+encoding as part of it --
> it's whatever the database's encoding is.  Collation can and should be a
> porperty of a column, since for Unicode it wouldn't be reasonable to
> make that part of the type.  But codeset+encoding should really be a
> property of the type if PG were to support more than one.  IMO.

No, what I mean is, you can't just be like "oh, the varlena will be
different in memory than on disk" as if that were no big deal.

I agree that, as an alternative to encoding being a column property,
it could instead be completely a type property, meaning that if you
want to store, say, LATIN1 text in your UTF-8 database, you first
create a latint1text data type and then use it, rather than, as in the
model I proposed, creating a text column and then applying a setting
like ENCODING latin1 to it. I think that there might be some problems
with that model, but it could also have some benefits. If someone were
going to make a run at implementing this, they might want to consider
both designs and evaluate the tradeoffs.

But, even if we were all convinced that this kind of feature was good
to add, I think it would almost certainly be wrong to invent new
varlena features along the way.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Pre-proposal: unicode normalized text

From
Jeff Davis
Date:
On Fri, 2023-10-06 at 13:33 -0400, Robert Haas wrote:
> What I think people really want is a whole column in
> some encoding that isn't the normal one for that database.

Do people really want that? I'd be curious to know why.

A lot of modern projects are simply declaring UTF-8 to be the "one true
way". I am not suggesting that we do that, but it seems odd to go in
the opposite direction and have greater flexibility for many encodings.

Regards,
    Jeff Davis




Re: Pre-proposal: unicode normalized text

From
Isaac Morland
Date:
On Fri, 6 Oct 2023 at 15:07, Jeff Davis <pgsql@j-davis.com> wrote:
On Fri, 2023-10-06 at 13:33 -0400, Robert Haas wrote:
> What I think people really want is a whole column in
> some encoding that isn't the normal one for that database.

Do people really want that? I'd be curious to know why.

A lot of modern projects are simply declaring UTF-8 to be the "one true
way". I am not suggesting that we do that, but it seems odd to go in
the opposite direction and have greater flexibility for many encodings.

And even if they want it, we can give it to them when we send/accept the data from the client; just because they want to store ISO-8859-1 doesn't mean the actual bytes on the disk need to be that. And by "client" maybe I mean the client end of the network connection, and maybe I mean the program that is calling in to libpq.

If they try to submit data that cannot possibly be encoded in the stated encoding because the bytes they submit don't correspond to any string in that encoding, then that is unambiguously an error, just as trying to put February 30 in a date column is an error.

Is there a single other data type where anybody is even discussing letting the client tell us how to write the data on disk?

Re: Pre-proposal: unicode normalized text

From
Matthias van de Meent
Date:


On Fri, 6 Oct 2023, 21:08 Jeff Davis, <pgsql@j-davis.com> wrote:
On Fri, 2023-10-06 at 13:33 -0400, Robert Haas wrote:
> What I think people really want is a whole column in
> some encoding that isn't the normal one for that database.

Do people really want that? I'd be curious to know why.

One reason someone would like this is because a database cluster may have been initialized with something like --no-locale (thus getting defaulted to LC_COLLATE=C, which is desired behaviour and gets fast strcmp operations for indexing, and LC_CTYPE=SQL_ASCII, which is not exactly expected but can be sufficient for some workloads), but now that the data has grown they want to use utf8.EN_US collations in some of their new and modern table's fields? 
Or, a user wants to maintain literal translation tables, where different encodings would need to be used for different languages to cover the full script when Unicode might not cover the full character set yet.
Additionally, I'd imagine specialized encodings like Shift_JIS could be more space efficient than UTF-8 for e.g. japanese text, which might be useful for someone who wants to be a bit more frugal with storage when they know text is guaranteed to be in some encoding's native language: compression can do the same work, but also adds significant overhead.

I've certainly experienced situations where I forgot to explicitly include the encoding in initdb --no-locale and then only much later noticed that my big data load is useless due to an inability to create UTF-8 collated indexes.
I often use --no-locale to make string indexing fast (locales/collation are not often important to my workload) and to block any environment variables from being carried over into the installation. An ability to set or update the encoding of columns would help reduce the pain: I would no longer have to re-initialize the database or cluster from 0.

Kind regards,

Matthias van de Meent

Re: Pre-proposal: unicode normalized text

From
Jeff Davis
Date:
On Wed, 2023-10-04 at 13:16 -0400, Robert Haas wrote:
> > At minimum I think we need to have some internal functions to check
> > for
> > unassigned code points. That belongs in core, because we generate
> > the
> > unicode tables from a specific version.
>
> That's a good idea.

Patch attached.

I added a new perl script to parse UnicodeData.txt and generate a
lookup table (of ranges, which can be binary-searched).

The C entry point does the same thing as u_charType(), and I also
matched the enum numeric values for convenience. I didn't use
u_charType() because I don't think this kind of unicode functionality
should depend on ICU, and I think it should match other Postgres
Unicode functionality.

Strictly speaking, I only needed to know whether it's unassigned or
not, not the general category. But it seemed easy enough to return the
general category, and it will be easier to create other potentially-
useful functions on top of this.

The tests do require ICU though, because I compare with the results of
u_charType().

Regards,
    Jeff Davis


Attachment

Re: Pre-proposal: unicode normalized text

From
Robert Haas
Date:
On Fri, Oct 6, 2023 at 3:07 PM Jeff Davis <pgsql@j-davis.com> wrote:
> On Fri, 2023-10-06 at 13:33 -0400, Robert Haas wrote:
> > What I think people really want is a whole column in
> > some encoding that isn't the normal one for that database.
>
> Do people really want that? I'd be curious to know why.

Because it's a feature that exists in other products and so having it
eases migrations and/or replication of data between systems.

I'm not saying that there are a lot of people who want this, any more.
I think there used to be more interest in it. But the point of the
comment was that people who want multiple character set support want
it as a per-column property, not a per-value property. I've never
heard of anyone wanting to store text blobs in multiple distinct
character sets in the same column. But I have heard of people wanting
text blobs in multiple distinct character sets in the same database,
each one in its own column.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Pre-proposal: unicode normalized text

From
Peter Eisentraut
Date:
On 07.10.23 03:18, Jeff Davis wrote:
> On Wed, 2023-10-04 at 13:16 -0400, Robert Haas wrote:
>>> At minimum I think we need to have some internal functions to check
>>> for
>>> unassigned code points. That belongs in core, because we generate
>>> the
>>> unicode tables from a specific version.
>> That's a good idea.
> Patch attached.

Can you restate what this is supposed to be for?  This thread appears to 
have morphed from "let's normalize everything" to "let's check for 
unassigned code points", but I'm not sure what we are aiming for now.




Re: Pre-proposal: unicode normalized text

From
Peter Eisentraut
Date:
On 06.10.23 19:22, Jeff Davis wrote:
> On Fri, 2023-10-06 at 09:58 +0200, Peter Eisentraut wrote:
>> If you want to be rigid about it, you also need to consider whether
>> the
>> Unicode version used by the ICU library in use matches the one used
>> by
>> the in-core tables.
> What problem are you concerned about here? I thought about it and I
> didn't see an obvious issue.
> 
> If the ICU unicode version is ahead of the Postgres unicode version,
> and no unassigned code points are used according to the Postgres
> version, then there's no problem.
> 
> And in the other direction, there might be some code points that are
> assigned according to the postgres unicode version but unassigned
> according to the ICU version. But that would be tracked by the
> collation version as you pointed out earlier, so upgrading ICU would be
> like any other ICU upgrade (with the same risks). Right?

It might be alright in this particular combination of circumstances. 
But in general if we rely on these tables for correctness (e.g., check 
that a string is normalized before passing it to a function that 
requires it to be normalized), we would need to consider this.  The 
correct fix would then probably be to not use our own tables but use 
some ICU function to achieve the desired task.




Re: Pre-proposal: unicode normalized text

From
Robert Haas
Date:
On Tue, Oct 10, 2023 at 2:44 AM Peter Eisentraut <peter@eisentraut.org> wrote:
> Can you restate what this is supposed to be for?  This thread appears to
> have morphed from "let's normalize everything" to "let's check for
> unassigned code points", but I'm not sure what we are aiming for now.

Jeff can say what he wants it for, but one obvious application would
be to have the ability to add a CHECK constraint that forbids
inserting unassigned code points into your database, which would be
useful if you're worried about forward-compatibility with collation
definitions that might be extended to cover those code points in the
future. Another application would be to find data already in your
database that has this potential problem.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Pre-proposal: unicode normalized text

From
Jeff Davis
Date:
On Tue, 2023-10-10 at 10:02 -0400, Robert Haas wrote:
> On Tue, Oct 10, 2023 at 2:44 AM Peter Eisentraut
> <peter@eisentraut.org> wrote:
> > Can you restate what this is supposed to be for?  This thread
> > appears to
> > have morphed from "let's normalize everything" to "let's check for
> > unassigned code points", but I'm not sure what we are aiming for
> > now.

It was a "pre-proposal", so yes, the goalposts have moved a bit. Right
now I'm aiming to get some primitives in place that will be useful by
themselves, but also that we can potentially build on.

Attached is a new version of the patch which introduces some SQL
functions as well:

  * unicode_is_valid(text): returns true if all codepoints are
assigned, false otherwise
  * unicode_version(): version of unicode Postgres is built with
  * icu_unicode_version(): version of Unicode ICU is built with

I'm not 100% clear on the consequences of differences between the PG
unicode version and the ICU unicode version, but because normalization
uses the Postgres version of Unicode, I believe the Postgres version of
Unicode should also be available to determine whether a code point is
assigned or not.

We may also find it interesting to use the PG Unicode tables for regex
character classification. This is just an idea and we can discuss
whether that makes sense or not, but having the primitives in place
seems like a good idea regardless.

> Jeff can say what he wants it for, but one obvious application would
> be to have the ability to add a CHECK constraint that forbids
> inserting unassigned code points into your database, which would be
> useful if you're worried about forward-compatibility with collation
> definitions that might be extended to cover those code points in the
> future. Another application would be to find data already in your
> database that has this potential problem.

Exactly. Avoiding unassigned code points also allows you to be forward-
compatible with normalization.

Regards,
    Jeff Davis



Attachment

Re: Pre-proposal: unicode normalized text

From
Peter Eisentraut
Date:
On 10.10.23 16:02, Robert Haas wrote:
> On Tue, Oct 10, 2023 at 2:44 AM Peter Eisentraut <peter@eisentraut.org> wrote:
>> Can you restate what this is supposed to be for?  This thread appears to
>> have morphed from "let's normalize everything" to "let's check for
>> unassigned code points", but I'm not sure what we are aiming for now.
> 
> Jeff can say what he wants it for, but one obvious application would
> be to have the ability to add a CHECK constraint that forbids
> inserting unassigned code points into your database, which would be
> useful if you're worried about forward-compatibility with collation
> definitions that might be extended to cover those code points in the
> future.

I don't see how this would really work in practice.  Whether your data 
has unassigned code points or not, when the collations are updated to 
the next Unicode version, the collations will have a new version number, 
and so you need to run the refresh procedure in any case.




Re: Pre-proposal: unicode normalized text

From
Peter Eisentraut
Date:
On 11.10.23 03:08, Jeff Davis wrote:
>    * unicode_is_valid(text): returns true if all codepoints are
> assigned, false otherwise

We need to be careful about precise terminology.  "Valid" has a defined 
meaning for Unicode.  A byte sequence can be valid or not as UTF-8.  But 
a string containing unassigned code points is not not-"valid" as Unicode.

>    * unicode_version(): version of unicode Postgres is built with
>    * icu_unicode_version(): version of Unicode ICU is built with

This seems easy enough, but it's not clear what users would actually do 
with that.




Re: Pre-proposal: unicode normalized text

From
Jeff Davis
Date:
On Wed, 2023-10-11 at 08:56 +0200, Peter Eisentraut wrote:
> On 11.10.23 03:08, Jeff Davis wrote:
> >    * unicode_is_valid(text): returns true if all codepoints are
> > assigned, false otherwise
>
> We need to be careful about precise terminology.  "Valid" has a
> defined
> meaning for Unicode.  A byte sequence can be valid or not as UTF-8. 
> But
> a string containing unassigned code points is not not-"valid" as
> Unicode.

Agreed. Perhaps "unicode_assigned()" is better?

> >    * unicode_version(): version of unicode Postgres is built with
> >    * icu_unicode_version(): version of Unicode ICU is built with
>
> This seems easy enough, but it's not clear what users would actually
> do
> with that.

Just there to make it visible. If it affects the semantics (which it
does currently for normalization) it seems wise to have some way to
access the version.

Regards,
    Jeff Davis




Re: Pre-proposal: unicode normalized text

From
Jeff Davis
Date:
On Wed, 2023-10-11 at 08:51 +0200, Peter Eisentraut wrote:
> I don't see how this would really work in practice.  Whether your
> data
> has unassigned code points or not, when the collations are updated to
> the next Unicode version, the collations will have a new version
> number,
> and so you need to run the refresh procedure in any case.

Even with a version number, we don't provide a great reresh procedure
or document how it should be done. In practice, avoiding unassigned
code points might mitigate some kinds of problems, especially for glibc
which has a very coarse version number.

In any case, a CHECK constraint to avoid unassigned code points has
utility to be forward-compatible with normalization, and also might
just be a good sanity check.

Regards,
    Jeff Davis




Re: Pre-proposal: unicode normalized text

From
Jeff Davis
Date:
On Wed, 2023-10-11 at 08:56 +0200, Peter Eisentraut wrote:
> We need to be careful about precise terminology.  "Valid" has a
> defined
> meaning for Unicode.  A byte sequence can be valid or not as UTF-8. 
> But
> a string containing unassigned code points is not not-"valid" as
> Unicode.

New patch attached, function name is "unicode_assigned".

I believe the patch has utility as-is, but I've been brainstorming a
few more ideas that could build on it:

* Add a per-database option to enforce only storing assigned unicode
code points.

* (More radical) Add a per-database option to normalize all text in
NFC.

* Do character classification in Unicode rather than relying on
glibc/ICU. This would affect regex character classes, etc., but not
affect upper/lower/initcap nor collation. I did some experiments and
the General Category doesn't change a lot: a total of 197 characters
changed their General Category since Unicode 6.0.0, and only 5 since
ICU 11.0.0. I'm not quite sure how to expose this, but it seems like a
nicer way to handle it than tying it into the collation provider.

Regards,
    Jeff Davis


Attachment

Re: Pre-proposal: unicode normalized text

From
"Daniel Verite"
Date:
    Jeff Davis wrote:

> I believe the patch has utility as-is, but I've been brainstorming a
> few more ideas that could build on it:
>
> * Add a per-database option to enforce only storing assigned unicode
> code points.

There's a problem in the fact that the set of assigned code points is
expanding with every Unicode release, which happens about every year.

If we had this option in Postgres 11 released in 2018 it would use
Unicode 11, and in 2023 this feature would reject thousands of code
points that have been assigned since then.

Aside from that, aborting a transaction because there's an
unassigned code point in a string feels like doing too much,
too late.
The programs that want to filter out unwanted code points
do it before they hit the database, client-side.


Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite



Re: Pre-proposal: unicode normalized text

From
Robert Haas
Date:
On Tue, Oct 17, 2023 at 11:07 AM Daniel Verite <daniel@manitou-mail.org> wrote:
> There's a problem in the fact that the set of assigned code points is
> expanding with every Unicode release, which happens about every year.
>
> If we had this option in Postgres 11 released in 2018 it would use
> Unicode 11, and in 2023 this feature would reject thousands of code
> points that have been assigned since then.

Are code points assigned from a gapless sequence? That is, is the
implementation of codepoint_is_assigned(char) just 'codepoint <
SOME_VALUE' and SOME_VALUE increases over time?

If so, we could consider having a function that lets you specify the
bound as an input parameter. But whether anyone would use it, or know
how to set that input parameter, is questionable. The real issue here
is whether you can figure out which of the code points that you could
put into the database already have collation definitions.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Pre-proposal: unicode normalized text

From
Isaac Morland
Date:
On Tue, 17 Oct 2023 at 11:15, Robert Haas <robertmhaas@gmail.com> wrote:
 
Are code points assigned from a gapless sequence? That is, is the
implementation of codepoint_is_assigned(char) just 'codepoint <
SOME_VALUE' and SOME_VALUE increases over time?

Not even close. Code points are organized in blocks, e.g. for mathematical symbols or Ethiopic script. Sometimes new blocks are added, sometimes new characters are added to existing blocks. Where they go is a combination of convenience, history, and planning. 

Re: Pre-proposal: unicode normalized text

From
Robert Haas
Date:
On Tue, Oct 17, 2023 at 11:38 AM Isaac Morland <isaac.morland@gmail.com> wrote:
> On Tue, 17 Oct 2023 at 11:15, Robert Haas <robertmhaas@gmail.com> wrote:
>> Are code points assigned from a gapless sequence? That is, is the
>> implementation of codepoint_is_assigned(char) just 'codepoint <
>> SOME_VALUE' and SOME_VALUE increases over time?
>
> Not even close. Code points are organized in blocks, e.g. for mathematical symbols or Ethiopic script. Sometimes new
blocksare added, sometimes new characters are added to existing blocks. Where they go is a combination of convenience,
history,and planning. 

Ah. Good to know.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Pre-proposal: unicode normalized text

From
Jeff Davis
Date:
On Tue, 2023-10-17 at 17:07 +0200, Daniel Verite wrote:
> There's a problem in the fact that the set of assigned code points is
> expanding with every Unicode release, which happens about every year.
>
> If we had this option in Postgres 11 released in 2018 it would use
> Unicode 11, and in 2023 this feature would reject thousands of code
> points that have been assigned since then.

That wouldn't be good for everyone, but might it be good for some
users?

We already expose normalization functions. If users are depending on
normalization, and they have unassigned code points in their system,
that will break when we update Unicode. By restricting themselves to
assigned code points, normalization is guaranteed to be forward-
compatible.

Regards,
    Jeff Davis




Re: Pre-proposal: unicode normalized text

From
Jeff Davis
Date:
On Mon, 2023-10-16 at 20:32 -0700, Jeff Davis wrote:
> On Wed, 2023-10-11 at 08:56 +0200, Peter Eisentraut wrote:
> > We need to be careful about precise terminology.  "Valid" has a
> > defined
> > meaning for Unicode.  A byte sequence can be valid or not as UTF-
> > 8. 
> > But
> > a string containing unassigned code points is not not-"valid" as
> > Unicode.
>
> New patch attached, function name is "unicode_assigned".

I plan to commit something like v3 early next week unless someone else
has additional comments or I missed a concern.

Regards,
    Jeff Davis




Re: Pre-proposal: unicode normalized text

From
Thomas Munro
Date:
bowerbird and hammerkop didn't like commit a02b37fc.  They're still
using the old 3rd build system that is not tested by CI.  It's due for
removal in the 17 cycle IIUC but in the meantime I guess the new
codegen script needs to be invoked by something under src/tools/msvc?

  varlena.obj : error LNK2019: unresolved external symbol
unicode_category referenced in function unicode_assigned
[H:\\prog\\bf\\root\\HEAD\\pgsql.build\\postgres.vcxproj]



Re: Pre-proposal: unicode normalized text

From
Nico Williams
Date:
On Fri, Oct 06, 2023 at 02:37:06PM -0400, Robert Haas wrote:
> > Sure, because TEXT in PG doesn't have codeset+encoding as part of it --
> > it's whatever the database's encoding is.  Collation can and should be a
> > porperty of a column, since for Unicode it wouldn't be reasonable to
> > make that part of the type.  But codeset+encoding should really be a
> > property of the type if PG were to support more than one.  IMO.
> 
> No, what I mean is, you can't just be like "oh, the varlena will be
> different in memory than on disk" as if that were no big deal.

It would have to be the same in memory as on disk, indeed, but you might
need new types in C as well for that.

> I agree that, as an alternative to encoding being a column property,
> it could instead be completely a type property, meaning that if you
> want to store, say, LATIN1 text in your UTF-8 database, you first
> create a latint1text data type and then use it, rather than, as in the
> model I proposed, creating a text column and then applying a setting
> like ENCODING latin1 to it. I think that there might be some problems

Yes, that was the idea.

> with that model, but it could also have some benefits. [...]

Mainly, I think, whether you want PG to do automatic codeset conversions
(ugly and problematic) or not, like for when using text functions.

Automatic codeset conversions are problematic because a) it can be lossy
(so what to do when it is?) and b) automatic type conversions can be
surprising.

Ultimately the client would have to do its own codeset conversions, if
it wants them, or treat text in codesets other than its local one as
blobs and leave it for a higher app layer to deal with.

I wouldn't want to propose automatic codeset conversions.  If you'd want
that then you might as well declare it has to all be UTF-8 and say no to
any other codesets.

> But, even if we were all convinced that this kind of feature was good
> to add, I think it would almost certainly be wrong to invent new
> varlena features along the way.

Yes.

Nico
-- 



Re: Pre-proposal: unicode normalized text

From
Nico Williams
Date:
On Wed, Oct 04, 2023 at 01:16:22PM -0400, Robert Haas wrote:
> There's a very popular commercial database where, or so I have been
> led to believe, any byte sequence at all is accepted when you try to
> put values into the database. [...]

In other circles we call this "just-use-8".

ZFS, for example, has an option to require that filenames be valid
UTF-8 or not, and if not it will accept any garbage (other than ASCII
NUL and /, for obvious reasons).

For filesystems the situation is a bit dire because:

 - strings at the system call boundary have never been tagged with a
   codeset (in the beginning there was only ASCII)
 - there has never been a standard codeset to use at the system call
   boundary, 
 - there have been multiple codesets in use for decades

so filesystems have to be prepared to be tolerant of garbage, at least
until only Unicode is left (UTF-16 on Windows filesystems, UTF-8 for
most others).

This is another reason that ZFS has form-insensitive/form-preserving
behavior: if you want to use non-UTF-8 filenames then names or
substrings thereof that look like valid UTF-8 won't accidentally be
broken by normalization.

If PG never tagged strings with codesets on the wire then PG has the
same problem, especially since there's multiple implementations of the
PG wire protocol.

So I can see why a "popular database" might want to take this approach.

For the longer run though, either move to supporting only UTF-8, or
allow multiple text types each with a codeset specified in its type.

> At any rate, if we were to go in the direction of rejecting code
> points that aren't yet assigned, or aren't yet known to the collation
> library, that's another way for data loading to fail. Which feels like
> very defensible behavior, but not what everyone wants, or is used to.

Yes.  See points about ZFS.  I do think ZFS struck a good balance.

PG could take the ZFS approach and add functions for use in CHECK
constraints that enforce valid UTF-8, valid Unicode (no use of
unassigned codepoints, no use of private use codepoints not configured
into the database), etc.

Coming back to the "just-use-8" thing, a database could have a text type
where the codeset is not specified, one or more text types where the
codeset is specified, manual or automatic codeset conversions, and
whatever enforcement functions make sense.  Provided that the type
information is not lost at the edges.

> > Whether we ever get to a core data type -- and more importantly,
> > whether anyone uses it -- I'm not sure.
> 
> Same here.

A TEXTutf8 type (whatever name you want to give it) could be useful as a
way to a) opt into heavier enforcement w/o having to write CHECK
constraints, b) documentation of intent, all provided that the type is
not lost on the wire nor in memory.

Support for other codesets is less important.

Nico
-- 



Re: Pre-proposal: unicode normalized text

From
Nico Williams
Date:
On Tue, Oct 17, 2023 at 05:07:40PM +0200, Daniel Verite wrote:
> > * Add a per-database option to enforce only storing assigned unicode
> > code points.
> 
> There's a problem in the fact that the set of assigned code points is
> expanding with every Unicode release, which happens about every year.
> 
> If we had this option in Postgres 11 released in 2018 it would use
> Unicode 11, and in 2023 this feature would reject thousands of code
> points that have been assigned since then.

Yes, and that's desirable if PG were to normalize text as Jeff proposes,
since then PG wouldn't know how to normalize text containing codepoints
assigned after that.  At that point to use those codepoints you'd have
to upgrade PG -- not too unreasonable.

Nico
-- 



Re: Pre-proposal: unicode normalized text

From
Nico Williams
Date:
On Wed, Oct 04, 2023 at 01:15:03PM -0700, Jeff Davis wrote:
> > The fact that there are multiple types of normalization and multiple
> > notions of equality doesn't make this easier.

And then there's text that isn't normalized to any of them.

> NFC is really the only one that makes sense.

Yes.

Most input modes produce NFC, though there may be scripts (like Hangul)
where input modes might produce NFD, so I wouldn't say NFC is universal.

Unfortunately HFS+ uses NFD so NFD can leak into places naturally enough
through OS X.

> I believe that having a kind of text data type where it's stored in NFC
> and compared with memcmp() would be a good place for many users to be -
> - probably most users. It's got all the performance and stability
> benefits of memcmp(), with slightly richer semantics. It's less likely
> that someone malicious can confuse the database by using different
> representations of the same character.
> 
> The problem is that it's not universally better for everyone: there are
> certainly users who would prefer that the codepoints they send to the
> database are preserved exactly, and also users who would like to be
> able to use unassigned code points.

The alternative is forminsensitivity, where you compare strings as
equal even if they aren't memcmp() eq as long as they are equal when
normalized.  This can be made fast, though not as fast as memcmp().

The problem with form insensitivity is that you might have to implement
it in numerous places.  In ZFS there's only a few, but in a database
every index type, for example, will need to hook in form insensitivity.
If so then that complexity would be a good argument to just normalize.

Nico
-- 



Re: Pre-proposal: unicode normalized text

From
Jeff Davis
Date:
On Fri, 2023-11-03 at 10:51 +1300, Thomas Munro wrote:
> bowerbird and hammerkop didn't like commit a02b37fc.  They're still
> using the old 3rd build system that is not tested by CI.  It's due
> for
> removal in the 17 cycle IIUC but in the meantime I guess the new
> codegen script needs to be invoked by something under src/tools/msvc?
>
>   varlena.obj : error LNK2019: unresolved external symbol
> unicode_category referenced in function unicode_assigned
> [H:\\prog\\bf\\root\\HEAD\\pgsql.build\\postgres.vcxproj]

I think I just need to add unicode_category.c to @pgcommonallfiles in
Mkvcbuild.pm. I'll do a trial commit tomorrow and see if that fixes it
unless someone has a better suggestion.

Regards,
    Jeff Davis




Re: Pre-proposal: unicode normalized text

From
David Rowley
Date:
On Fri, 3 Nov 2023 at 20:49, Jeff Davis <pgsql@j-davis.com> wrote:
>
> On Fri, 2023-11-03 at 10:51 +1300, Thomas Munro wrote:
> > bowerbird and hammerkop didn't like commit a02b37fc.  They're still
> > using the old 3rd build system that is not tested by CI.  It's due
> > for
> > removal in the 17 cycle IIUC but in the meantime I guess the new
> > codegen script needs to be invoked by something under src/tools/msvc?
> >
> >   varlena.obj : error LNK2019: unresolved external symbol
> > unicode_category referenced in function unicode_assigned
> > [H:\\prog\\bf\\root\\HEAD\\pgsql.build\\postgres.vcxproj]
>
> I think I just need to add unicode_category.c to @pgcommonallfiles in
> Mkvcbuild.pm. I'll do a trial commit tomorrow and see if that fixes it
> unless someone has a better suggestion.

(I didn't realise this was being discussed.)

Thomas mentioned this to me earlier today. After looking I also
concluded that unicode_category.c needed to be added to
@pgcommonallfiles. After looking at the time, I didn't expect you to
be around so opted just to push that to fix the MSVC buildfarm
members.

Sorry for the duplicate effort and/or stepping on your toes.

David



Re: Pre-proposal: unicode normalized text

From
John Naylor
Date:
On Sat, Oct 28, 2023 at 4:15 AM Jeff Davis <pgsql@j-davis.com> wrote:
>
> I plan to commit something like v3 early next week unless someone else
> has additional comments or I missed a concern.

Hi Jeff, is the CF entry titled "Unicode character general category
functions" ready to be marked committed?



Re: Pre-proposal: unicode normalized text

From
Jeff Davis
Date:
On Fri, 2023-11-03 at 21:01 +1300, David Rowley wrote:
> Thomas mentioned this to me earlier today. After looking I also
> concluded that unicode_category.c needed to be added to
> @pgcommonallfiles. After looking at the time, I didn't expect you to
> be around so opted just to push that to fix the MSVC buildfarm
> members.
>
> Sorry for the duplicate effort and/or stepping on your toes.

Thank you, no apology necessary.

Regards,
    Jeff Davis




Re: Pre-proposal: unicode normalized text

From
Jeff Davis
Date:
On Fri, 2023-11-03 at 17:11 +0700, John Naylor wrote:
> On Sat, Oct 28, 2023 at 4:15 AM Jeff Davis <pgsql@j-davis.com> wrote:
> >
> > I plan to commit something like v3 early next week unless someone
> > else
> > has additional comments or I missed a concern.
>
> Hi Jeff, is the CF entry titled "Unicode character general category
> functions" ready to be marked committed?

Done, thank you.

Regards,
    Jeff Davis




Re: Pre-proposal: unicode normalized text

From
Phil Krylov
Date:
On 2023-10-04 23:32, Chapman Flack wrote:
> Well, for what reason does anybody run PG now with the encoding set
> to anything besides UTF-8? I don't really have my finger on that pulse.
> Could it be that it bloats common strings in their local script, and
> with enough of those to store, it could matter to use the local
> encoding that stores them more economically?

I do use CP1251 for storing some data which is coming in as XMLs in 
CP1251, and thus definitely fits. In UTF-8, that data would take exactly 
2x the size on disks (before compression, and pglz/lz4 won't help much 
with that).

-- Ph.



Re: Pre-proposal: unicode normalized text

From
Thomas Munro
Date:
On Fri, Nov 3, 2023 at 9:01 PM David Rowley <dgrowleyml@gmail.com> wrote:
> On Fri, 3 Nov 2023 at 20:49, Jeff Davis <pgsql@j-davis.com> wrote:
> > On Fri, 2023-11-03 at 10:51 +1300, Thomas Munro wrote:
> > > bowerbird and hammerkop didn't like commit a02b37fc.  They're still
> > > using the old 3rd build system that is not tested by CI.  It's due
> > > for
> > > removal in the 17 cycle IIUC but in the meantime I guess the new
> > > codegen script needs to be invoked by something under src/tools/msvc?
> > >
> > >   varlena.obj : error LNK2019: unresolved external symbol
> > > unicode_category referenced in function unicode_assigned
> > > [H:\\prog\\bf\\root\\HEAD\\pgsql.build\\postgres.vcxproj]
> >
> > I think I just need to add unicode_category.c to @pgcommonallfiles in
> > Mkvcbuild.pm. I'll do a trial commit tomorrow and see if that fixes it
> > unless someone has a better suggestion.
>
> (I didn't realise this was being discussed.)
>
> Thomas mentioned this to me earlier today. After looking I also
> concluded that unicode_category.c needed to be added to
> @pgcommonallfiles. After looking at the time, I didn't expect you to
> be around so opted just to push that to fix the MSVC buildfarm
> members.

Shouldn't it be added unconditionally near unicode_norm.c?  It looks
like it was accidentally made conditional on openssl, which might
explain why it worked for David but not for bowerbird.



Re: Pre-proposal: unicode normalized text

From
David Rowley
Date:
On Sat, 4 Nov 2023 at 10:57, Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Fri, Nov 3, 2023 at 9:01 PM David Rowley <dgrowleyml@gmail.com> wrote:
> > On Fri, 3 Nov 2023 at 20:49, Jeff Davis <pgsql@j-davis.com> wrote:
> > > I think I just need to add unicode_category.c to @pgcommonallfiles in
> > > Mkvcbuild.pm. I'll do a trial commit tomorrow and see if that fixes it
> > > unless someone has a better suggestion.
> >
> > Thomas mentioned this to me earlier today. After looking I also
> > concluded that unicode_category.c needed to be added to
> > @pgcommonallfiles. After looking at the time, I didn't expect you to
> > be around so opted just to push that to fix the MSVC buildfarm
> > members.
>
> Shouldn't it be added unconditionally near unicode_norm.c?  It looks
> like it was accidentally made conditional on openssl, which might
> explain why it worked for David but not for bowerbird.

Well, I did that one pretty poorly :-(

I've just pushed a fix for that.  Thanks.

David



Re: Pre-proposal: unicode normalized text

From
Jeff Davis
Date:
On Mon, 2023-10-02 at 16:06 -0400, Robert Haas wrote:
> It seems to me that this overlooks one of the major points of Jeff's
> proposal, which is that we don't reject text input that contains
> unassigned code points. That decision turns out to be really painful.

Attached is an implementation of a per-database option STRICT_UNICODE
which enforces the use of assigned code points only.

Not everyone would want to use it. There are lots of applications that
accept free-form text, and that may include recently-assigned code
points not yet recognized by Postgres.

But it would offer protection/stability for some databases. It makes it
possible to have a hard guarantee that Unicode normalization is
stable[1]. And it may also mitigate the risk of collation changes --
using unassigned code points carries a high risk that the collation
order changes as soon as the collation provider recognizes the
assignment. (Though assigned code points can change, too, so limiting
yourself to assigned code points is only a mitigation.)

I worry slightly that users will think at first that they want only
assigned code points, and then later figure out that the application
has increased in scope and now takes all kinds of free-form text. In
that case, the user can "ALTER DATABASE ... STRICT_UNICODE FALSE", and
follow up with some "CHECK (unicode_assigned(...))" constraints on the
particular fields that they'd like to protect.

There's some weirdness that the set of assigned code points as Postgres
sees it may not match what a collation provider sees due to differing
Unicode versions. That's not great -- perhaps we could check that code
points are considered assigned by *both* Postgres and ICU. I don't know
if there's a way to tell if libc considers a code point to be assigned.

Regards,
    Jeff Davis

[1]
https://www.unicode.org/policies/stability_policy.html#Normalization


Attachment

Re: Pre-proposal: unicode normalized text

From
Jeff Davis
Date:
On Thu, 2024-02-29 at 17:02 -0800, Jeff Davis wrote:
> Attached is an implementation of a per-database option STRICT_UNICODE
> which enforces the use of assigned code points only.

The CF app doesn't seem to point at the latest patch:

https://www.postgresql.org/message-id/a0e85aca6e03042881924c4b31a840a915a9d349.camel@j-davis.com

which is perhaps why nobody has looked at it yet.

But in any case, I'm OK if this gets bumped to 18. I still think it's a
good feature, but some of the value will come later in v18 anyway, when
I plan to propose support for case folding. Case folding is a version
of lowercasing with compatibility guarantees when you only use assigned
code points.

Regards,
    Jeff Davis