Re: Pre-proposal: unicode normalized text - Mailing list pgsql-hackers
From | Jeff Davis |
---|---|
Subject | Re: Pre-proposal: unicode normalized text |
Date | |
Msg-id | 3941663a8e2f185d6acbbbc4f172c41dd3cfb6fe.camel@j-davis.com Whole thread Raw |
In response to | Re: Pre-proposal: unicode normalized text (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Pre-proposal: unicode normalized text
Re: Pre-proposal: unicode normalized text |
List | pgsql-hackers |
On Mon, 2023-10-02 at 16:06 -0400, Robert Haas wrote: > It seems to me that this overlooks one of the major points of Jeff's > proposal, which is that we don't reject text input that contains > unassigned code points. That decision turns out to be really painful. Yeah, because we lose forward-compatibility of some useful operations. > Here, Jeff mentions normalization, but I think it's a major issue > with > collation support. If new code points are added, users can put them > into the database before they are known to the collation library, and > then when they become known to the collation library the sort order > changes and indexes break. The collation version number may reflect the change in understanding about assigned code points that may affect collation -- though I'd like to understand whether this is guaranteed or not. Regardless, given that (a) we don't have a good story for migrating to new collation versions; and (b) it would be painful to rebuild indexes even if we did; then you are right that it's a problem. > Would we endorse a proposal to make > pg_catalog.text with encoding UTF-8 reject code points that aren't > yet > known to the collation library? To do so would be tighten things up > considerably from where they stand today, and the way things stand > today is already rigid enough to cause problems for some users. What problems exist today due to the rigidity of text? I assume you mean because we reject invalid byte sequences? Yeah, I'm sure that causes a problem for some (especially migrations), but it's difficult for me to imagine a database working well with no rules at all for the the basic data types. > Now, there is still the question of whether such a data type would > properly belong in core or even contrib rather than being an > out-of-core project. It's not obvious to me that such a data type > would get enough traction that we'd want it to be part of PostgreSQL > itself. At minimum I think we need to have some internal functions to check for unassigned code points. That belongs in core, because we generate the unicode tables from a specific version. I also think we should expose some SQL functions to check for unassigned code points. That sounds useful, especially since we already expose normalization functions. One could easily imagine a domain with CHECK(NOT contains_unassigned(a)). Or an extension with a data type that uses the internal functions. Whether we ever get to a core data type -- and more importantly, whether anyone uses it -- I'm not sure. > But at the same time I can certainly understand why Jeff finds > the status quo problematic. Yeah, I am looking for a better compromise between: * everything is memcmp() and 'á' sometimes doesn't equal 'á' (depending on code point sequence) * everything is constantly changing, indexes break, and text comparisons are slow A stable idea of unicode normalization based on using only assigned code points is very tempting. Regards, Jeff Davis
pgsql-hackers by date: