Improved ICU patch - WAS: Implementing full UTF-8 support (aka supporting 0x00) - Mailing list pgsql-hackers
From | Palle Girgensohn |
---|---|
Subject | Improved ICU patch - WAS: Implementing full UTF-8 support (aka supporting 0x00) |
Date | |
Msg-id | A4DB6CD4-F4CC-4C48-A9DC-DCBDCBD51186@pingpong.net Whole thread Raw |
In response to | Re: Implementing full UTF-8 support (aka supporting 0x00) (Bruce Momjian <bruce@momjian.us>) |
Responses |
Re: Improved ICU patch - WAS: Implementing full UTF-8
support (aka supporting 0x00)
(Peter Geoghegan <pg@heroku.com>)
|
List | pgsql-hackers |
> 4 aug. 2016 kl. 02:40 skrev Bruce Momjian <bruce@momjian.us>: > > On Thu, Aug 4, 2016 at 08:22:25AM +0800, Craig Ringer wrote: >> Yep, it does. But we've made little to no progress on integration of ICU >> support and AFAIK nobody's working on it right now. > > Uh, this email from July says Peter Eisentraut will submit it in > September :-) > > https://www.postgresql.org/message-id/2b833706-1133-1e11-39d9-4fa2288925bd@2ndquadrant.com Cool. I have brushed up my decade+ old patches [1] for ICU, so they now have support for COLLATE on columns. https://github.com/girgen/postgres/ in branches icu/XXX where XXX is master or REL9_X_STABLE. They've been used for the FreeBSD ports since 2005, and have served us well. I have of course updated them regularly. Inthis latest version, I've removed support for other encodings beside UTF-8, mostly since I don't know how to test them,but also, I see little point in supporting anything else using ICU. I have one question for someone with knowledge about Turkish (Devrim?). This is the diff from regression tests, when running $ gmake check EXTRA_TESTS=collate.linux.utf8 LANG=sv_SE.UTF-8 $ cat "/Users/girgen/postgresql/obj/src/test/regress/regression.diffs" *** /Users/girgen/postgresql/postgres/src/test/regress/expected/collate.linux.utf8.out 2016-08-10 21:09:03.000000000 +0200 --- /Users/girgen/postgresql/obj/src/test/regress/results/collate.linux.utf8.out 2016-08-10 21:12:53.000000000 +0200 *************** *** 373,379 **** SELECT 'Türkiye' COLLATE "tr_TR" ~* 'KI' AS "false"; false ------- ! f (1 row) SELECT 'bıt' ~* 'BIT' COLLATE "en_US" AS "false"; --- 373,379 ---- SELECT 'Türkiye' COLLATE "tr_TR" ~* 'KI' AS "false"; false ------- ! t (1 row) SELECT 'bıt' ~* 'BIT' COLLATE "en_US" AS "false"; *************** *** 385,391 **** SELECT 'bıt' ~* 'BIT' COLLATE "tr_TR" AS "true"; true ------ ! t (1 row) -- The following actually exercises the selectivity estimation for ~*. --- 385,391 ---- SELECT 'bıt' ~* 'BIT' COLLATE "tr_TR" AS "true"; true ------ ! f (1 row) -- The following actually exercises the selectivity estimation for ~*. ====================================================================== The Linux locale behaves differently from ICU for the above (corner ?) cases. Any ideas if one is more correct than the other?I seems unclear to me. Perhaps it depends on whether the case-insensitive match is done using lower(both) or upper(both)?I haven't investigated this yet. @Devrim, is one more correct than the other? As Thomas points out, using ucoll_strcoll it is quick, since no copying is needed. I will get some benchmarks soon. Palle [1] https://people.freebsd.org/~girgen/postgresql-icu/README.html
pgsql-hackers by date: