Re: Pre-proposal: unicode normalized text - Mailing list pgsql-hackers

From Peter Eisentraut
Subject Re: Pre-proposal: unicode normalized text
Date
Msg-id c3efcb28-5285-d668-a835-84d5ffb73721@eisentraut.org
Whole thread Raw
In response to Re: Pre-proposal: unicode normalized text  (Jeff Davis <pgsql@j-davis.com>)
Responses Re: Pre-proposal: unicode normalized text
List pgsql-hackers
On 03.10.23 21:54, Jeff Davis wrote:
>> Here, Jeff mentions normalization, but I think it's a major issue
>> with
>> collation support. If new code points are added, users can put them
>> into the database before they are known to the collation library, and
>> then when they become known to the collation library the sort order
>> changes and indexes break.
> 
> The collation version number may reflect the change in understanding
> about assigned code points that may affect collation -- though I'd like
> to understand whether this is guaranteed or not.

This is correct.  The collation version number produced by ICU contains 
the UCA version, which is effectively the Unicode version (14.0, 15.0, 
etc.).  Since new code point assignments can only come from new Unicode 
versions, a new assigned code point will always result in a different 
collation version.

For example, with ICU 70 / CLDR 40 / Unicode 14:

select collversion from pg_collation where collname = 'unicode';
= 153.112

With ICU 72 / CLDR 42 / Unicode 15:
= 153.120

> At minimum I think we need to have some internal functions to check for
> unassigned code points. That belongs in core, because we generate the
> unicode tables from a specific version.

If you want to be rigid about it, you also need to consider whether the 
Unicode version used by the ICU library in use matches the one used by 
the in-core tables.




pgsql-hackers by date:

Previous
From: Kuwamura Masaki
Date:
Subject: Re: pg_rewind with cascade standby doesn't work well
Next
From: Michael Paquier
Date:
Subject: Re: Use FD_CLOEXEC on ListenSockets (was Re: Refactoring backend fork+exec code)