Home > mailing lists

Re: Pre-proposal: unicode normalized text - Mailing list pgsql-hackers

From	Jeff Davis
Subject	Re: Pre-proposal: unicode normalized text
Date	October 3, 2023 19:54:46
Msg-id	3941663a8e2f185d6acbbbc4f172c41dd3cfb6fe.camel@j-davis.com Whole thread Raw
In response to	Re: Pre-proposal: unicode normalized text (Robert Haas <robertmhaas@gmail.com>)
Responses	Re: Pre-proposal: unicode normalized text Re: Pre-proposal: unicode normalized text
List	pgsql-hackers

Tree view

On Mon, 2023-10-02 at 16:06 -0400, Robert Haas wrote:
> It seems to me that this overlooks one of the major points of Jeff's
> proposal, which is that we don't reject text input that contains
> unassigned code points. That decision turns out to be really painful.

Yeah, because we lose forward-compatibility of some useful operations.

> Here, Jeff mentions normalization, but I think it's a major issue
> with
> collation support. If new code points are added, users can put them
> into the database before they are known to the collation library, and
> then when they become known to the collation library the sort order
> changes and indexes break.

The collation version number may reflect the change in understanding
about assigned code points that may affect collation -- though I'd like
to understand whether this is guaranteed or not.

Regardless, given that (a) we don't have a good story for migrating to
new collation versions; and (b) it would be painful to rebuild indexes
even if we did; then you are right that it's a problem.

>  Would we endorse a proposal to make
> pg_catalog.text with encoding UTF-8 reject code points that aren't
> yet
> known to the collation library? To do so would be tighten things up
> considerably from where they stand today, and the way things stand
> today is already rigid enough to cause problems for some users.

What problems exist today due to the rigidity of text?

I assume you mean because we reject invalid byte sequences? Yeah, I'm
sure that causes a problem for some (especially migrations), but it's
difficult for me to imagine a database working well with no rules at
all for the the basic data types.

> Now, there is still the question of whether such a data type would
> properly belong in core or even contrib rather than being an
> out-of-core project. It's not obvious to me that such a data type
> would get enough traction that we'd want it to be part of PostgreSQL
> itself.

At minimum I think we need to have some internal functions to check for
unassigned code points. That belongs in core, because we generate the
unicode tables from a specific version.

I also think we should expose some SQL functions to check for
unassigned code points. That sounds useful, especially since we already
expose normalization functions.

One could easily imagine a domain with CHECK(NOT
contains_unassigned(a)). Or an extension with a data type that uses the
internal functions.

Whether we ever get to a core data type -- and more importantly,
whether anyone uses it -- I'm not sure.

>  But at the same time I can certainly understand why Jeff finds
> the status quo problematic.

Yeah, I am looking for a better compromise between:

  * everything is memcmp() and 'á' sometimes doesn't equal 'á'
(depending on code point sequence)
  * everything is constantly changing, indexes break, and text
comparisons are slow

A stable idea of unicode normalization based on using only assigned
code points is very tempting.

Regards,
    Jeff Davis

pgsql-hackers by date:

From: James Coleman
Date: 03 October 2023, 19:35:15
Subject: Re: [DOCS] HOT - correct claim about indexes not referencing old line pointers

From: Tom Lane
Date: 03 October 2023, 20:07:30
Subject: Re: Annoying build warnings from latest Apple toolchain

Re: Pre-proposal: unicode normalized text - Mailing list pgsql-hackers

Previous

Next