Home > mailing lists

Improved ICU patch - WAS: Implementing full UTF-8 support (aka supporting 0x00) - Mailing list pgsql-hackers

From	Palle Girgensohn
Subject	Improved ICU patch - WAS: Implementing full UTF-8 support (aka supporting 0x00)
Date	August 10, 2016 20:42:07
Msg-id	A4DB6CD4-F4CC-4C48-A9DC-DCBDCBD51186@pingpong.net Whole thread Raw
In response to	Re: Implementing full UTF-8 support (aka supporting 0x00) (Bruce Momjian <bruce@momjian.us>)
Responses	Re: Improved ICU patch - WAS: Implementing full UTF-8 support (aka supporting 0x00)
List	pgsql-hackers

Tree view

> 4 aug. 2016 kl. 02:40 skrev Bruce Momjian <bruce@momjian.us>:
>
> On Thu, Aug  4, 2016 at 08:22:25AM +0800, Craig Ringer wrote:
>> Yep, it does. But we've made little to no progress on integration of ICU
>> support and AFAIK nobody's working on it right now.
>
> Uh, this email from July says Peter Eisentraut will submit it in
> September  :-)
>
>     https://www.postgresql.org/message-id/2b833706-1133-1e11-39d9-4fa2288925bd@2ndquadrant.com

Cool.

I have brushed up my decade+ old patches [1] for ICU, so they now have support for COLLATE on columns.

https://github.com/girgen/postgres/

in branches icu/XXX where XXX is master or REL9_X_STABLE.

They've been used for the FreeBSD ports since 2005, and have served us well. I have of course updated them regularly.
Inthis latest version, I've removed support for other encodings beside UTF-8, mostly since I don't know how to test
them,but also, I see little point in supporting anything else using ICU. 

I have one question for someone with knowledge about Turkish (Devrim?). This is the diff from regression tests, when
running

$ gmake check EXTRA_TESTS=collate.linux.utf8 LANG=sv_SE.UTF-8

$ cat "/Users/girgen/postgresql/obj/src/test/regress/regression.diffs"
*** /Users/girgen/postgresql/postgres/src/test/regress/expected/collate.linux.utf8.out    2016-08-10 21:09:03.000000000
+0200
--- /Users/girgen/postgresql/obj/src/test/regress/results/collate.linux.utf8.out    2016-08-10 21:12:53.000000000 +0200
***************
*** 373,379 **** SELECT 'Türkiye' COLLATE "tr_TR" ~* 'KI' AS "false";  false -------
!  f (1 row)
 SELECT 'bıt' ~* 'BIT' COLLATE "en_US" AS "false";
--- 373,379 ---- SELECT 'Türkiye' COLLATE "tr_TR" ~* 'KI' AS "false";  false -------
!  t (1 row)
 SELECT 'bıt' ~* 'BIT' COLLATE "en_US" AS "false";
***************
*** 385,391 **** SELECT 'bıt' ~* 'BIT' COLLATE "tr_TR" AS "true";  true ------
!  t (1 row)
 -- The following actually exercises the selectivity estimation for ~*.
--- 385,391 ---- SELECT 'bıt' ~* 'BIT' COLLATE "tr_TR" AS "true";  true ------
!  f (1 row)
 -- The following actually exercises the selectivity estimation for ~*.

======================================================================

The Linux locale behaves differently from ICU for the above (corner ?) cases. Any ideas if one is more correct than the
other?I seems unclear to me. Perhaps it depends on whether the case-insensitive match is done using lower(both) or
upper(both)?I haven't investigated this yet. @Devrim, is one more correct than the other? 

As Thomas points out, using ucoll_strcoll it is quick, since no copying is needed. I will get some benchmarks soon.

Palle

[1] https://people.freebsd.org/~girgen/postgresql-icu/README.html

pgsql-hackers by date:

From: Robert Haas
Date: 10 August 2016, 20:39:08
Subject: Re: Wait events monitoring future development

From: Bruce Momjian
Date: 10 August 2016, 20:44:17
Subject: Re: new pgindent run before branch?

Improved ICU patch - WAS: Implementing full UTF-8 support (aka supporting 0x00) - Mailing list pgsql-hackers

Previous

Next