Improved ICU patch - WAS: Implementing full UTF-8 support (aka supporting 0x00) - Mailing list pgsql-hackers

From Palle Girgensohn
Subject Improved ICU patch - WAS: Implementing full UTF-8 support (aka supporting 0x00)
Date
Msg-id A4DB6CD4-F4CC-4C48-A9DC-DCBDCBD51186@pingpong.net
Whole thread Raw
In response to Re: Implementing full UTF-8 support (aka supporting 0x00)  (Bruce Momjian <bruce@momjian.us>)
Responses Re: Improved ICU patch - WAS: Implementing full UTF-8 support (aka supporting 0x00)  (Peter Geoghegan <pg@heroku.com>)
List pgsql-hackers
> 4 aug. 2016 kl. 02:40 skrev Bruce Momjian <bruce@momjian.us>:
>
> On Thu, Aug  4, 2016 at 08:22:25AM +0800, Craig Ringer wrote:
>> Yep, it does. But we've made little to no progress on integration of ICU
>> support and AFAIK nobody's working on it right now.
>
> Uh, this email from July says Peter Eisentraut will submit it in
> September  :-)
>
>     https://www.postgresql.org/message-id/2b833706-1133-1e11-39d9-4fa2288925bd@2ndquadrant.com

Cool.

I have brushed up my decade+ old patches [1] for ICU, so they now have support for COLLATE on columns.


https://github.com/girgen/postgres/


in branches icu/XXX where XXX is master or REL9_X_STABLE.

They've been used for the FreeBSD ports since 2005, and have served us well. I have of course updated them regularly.
Inthis latest version, I've removed support for other encodings beside UTF-8, mostly since I don't know how to test
them,but also, I see little point in supporting anything else using ICU. 



I have one question for someone with knowledge about Turkish (Devrim?). This is the diff from regression tests, when
running

$ gmake check EXTRA_TESTS=collate.linux.utf8 LANG=sv_SE.UTF-8

$ cat "/Users/girgen/postgresql/obj/src/test/regress/regression.diffs"
*** /Users/girgen/postgresql/postgres/src/test/regress/expected/collate.linux.utf8.out    2016-08-10 21:09:03.000000000
+0200
--- /Users/girgen/postgresql/obj/src/test/regress/results/collate.linux.utf8.out    2016-08-10 21:12:53.000000000 +0200
***************
*** 373,379 **** SELECT 'Türkiye' COLLATE "tr_TR" ~* 'KI' AS "false";  false -------
!  f (1 row)
 SELECT 'bıt' ~* 'BIT' COLLATE "en_US" AS "false";
--- 373,379 ---- SELECT 'Türkiye' COLLATE "tr_TR" ~* 'KI' AS "false";  false -------
!  t (1 row)
 SELECT 'bıt' ~* 'BIT' COLLATE "en_US" AS "false";
***************
*** 385,391 **** SELECT 'bıt' ~* 'BIT' COLLATE "tr_TR" AS "true";  true ------
!  t (1 row)
 -- The following actually exercises the selectivity estimation for ~*.
--- 385,391 ---- SELECT 'bıt' ~* 'BIT' COLLATE "tr_TR" AS "true";  true ------
!  f (1 row)
 -- The following actually exercises the selectivity estimation for ~*.

======================================================================

The Linux locale behaves differently from ICU for the above (corner ?) cases. Any ideas if one is more correct than the
other?I seems unclear to me. Perhaps it depends on whether the case-insensitive match is done using lower(both) or
upper(both)?I haven't investigated this yet. @Devrim, is one more correct than the other? 


As Thomas points out, using ucoll_strcoll it is quick, since no copying is needed. I will get some benchmarks soon.

Palle



[1] https://people.freebsd.org/~girgen/postgresql-icu/README.html


pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Wait events monitoring future development
Next
From: Bruce Momjian
Date:
Subject: Re: new pgindent run before branch?