Re: Update Unicode data to Unicode 16.0.0 - Mailing list pgsql-hackers

From Joe Conway
Subject Re: Update Unicode data to Unicode 16.0.0
Date
Msg-id f0bd0304-97b8-4a55-bf16-d1a7feb948e3@joeconway.com
Whole thread Raw
Responses Re: Update Unicode data to Unicode 16.0.0
List pgsql-hackers
On 11/11/24 01:27, Peter Eisentraut wrote:
> Here is the patch to update the Unicode data to version 16.0.0.
> 
> Normally, this would have been routine, but a few months ago there was
> some debate about how this should be handled. [0]  AFAICT, the consensus
> was to go ahead with it, but I just wanted to notify it here to be clear.
> 
> [0]:
> https://www.postgresql.org/message-id/flat/d75d2d0d1d2bd45b2c332c47e3e0a67f0640b49c.camel%40j-davis.com

I ran a check and found that this patch causes changes in upper casing 
of some characters. Repro:

setup
8<-------------
wget https://joeconway.com/presentations/formated-unicode.txt
initdb
psql
CREATE DATABASE builtincoll
  LOCALE_PROVIDER builtin
  BUILTIN_LOCALE 'C.UTF-8'
  TEMPLATE template0;
\c builtincoll
CREATE TABLE unsorted_table(strings text);
\copy unsorted_table from formated-unicode.txt (format csv)
VACUUM FREEZE ANALYZE unsorted_table;
8<-------------


8<-------------
-- on master
builtincoll=# WITH t AS (SELECT lower(strings) AS s FROM unsorted_table 
ORDER BY 1)
SELECT md5(string_agg(t.s,NULL)) FROM t;
                md5
----------------------------------
  7ec7f5c2d8729ec960942942bb82aedd
(1 row)

builtincoll=# WITH t AS (SELECT upper(strings) AS s FROM unsorted_table 
ORDER BY 1)
SELECT md5(string_agg(t.s,NULL)) FROM t;
                md5
----------------------------------
  97f83a4d1937aa65bcf8be134bf7b0c4
(1 row)

builtincoll=# WITH t AS (SELECT initcap(strings) AS s FROM 
unsorted_table ORDER BY 1)
SELECT md5(string_agg(t.s,NULL)) FROM t;
                md5
----------------------------------
  8cf65a43affc221f3a20645ef402085e
(1 row)
8<-------------


8<-------------
-- master+patch
builtincoll=# WITH t AS (SELECT lower(strings) AS s FROM unsorted_table 
ORDER BY 1)
SELECT md5(string_agg(t.s,NULL)) FROM t;
                md5
----------------------------------
  7ec7f5c2d8729ec960942942bb82aedd
(1 row)

Time: 19858.981 ms (00:19.859)
builtincoll=# WITH t AS (SELECT upper(strings) AS s FROM unsorted_table 
ORDER BY 1)SELECT md5(string_agg(t.s,NULL)) FROM t;
                md5
----------------------------------
  3055b3d5dff76c8c1250ef500c6ec13f
(1 row)

Time: 19774.467 ms (00:19.774)
builtincoll=# WITH t AS (SELECT initcap(strings) AS s FROM 
unsorted_table ORDER BY 1)
SELECT md5(string_agg(t.s,NULL)) FROM t;
                md5
----------------------------------
  9985acddf7902ea603897cdaccd02114
(1 row)
8<-------------

So both UPPER and INITCAP produce different results unless I am missing 
something.

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: Jim Jones
Date:
Subject: Re: [PoC] XMLCast (SQL/XML X025)
Next
From: Peter Geoghegan
Date:
Subject: Re: index prefetching