Re: Patch for collation using ICU - Mailing list pgsql-hackers

From Palle Girgensohn
Subject Re: Patch for collation using ICU
Date
Msg-id 966F3D72826AD115FCE40984@palle.girgensohn.se
Whole thread Raw
In response to Re: Patch for collation using ICU  (Bruce Momjian <pgman@candle.pha.pa.us>)
Responses Re: Patch for collation using ICU  (Bruce Momjian <pgman@candle.pha.pa.us>)
List pgsql-hackers

--On fredag, maj 06, 2005 22.57.59 -0400 Bruce Momjian 
<pgman@candle.pha.pa.us> wrote:

>
> Is this patch ready for application?
>
>     http://people.freebsd.org/~girgen/postgresql-icu/pg-802-icu-2005-05-06.d
> iff.gz
>
> The web site is:
>
>     http://people.freebsd.org/~girgen/postgresql-icu/readme.html

I don't think so, not quite. I have not had any positive reports from linux 
users, this is only tested in a FreeBSD environment. I'd say it needs some 
more testing.

Also, apparently, ICU is installed by default in many linux distributions, 
and usually it is version 2.8. Some linux users have asked me if there are 
plans for a patch that works with ICU 2.8. That's probably a good idea. IBM 
and the ICU folks seem to consider 3.2 to be the stable version, older 
versions are hard to find on their sites, but most linux distributers seem 
to consider it too bleeding edge, even gentoo. I don't know why they don't 
agree.

> I do have a few questions:
>
> Why don't you use the lc_ctype_is_c() part of this test?
>
>      if (pg_database_encoding_max_length() > 1 && !lc_ctype_is_c())

Um, well, I didn't think about that. :)  What would be the locale in this 
case? c_C.UTF-8? ;)  Hmm, it is possible to have CTYPE=C and use a wide 
encoding, indeed. Then the strings will be handled like byte-wide chars. 
Yeah, it's a bug. I'll fix it! Thanks.

> Why is so much code added, for example, in lower()?  The existing
> multibyte code is much smaller, and lots of code is added in other
> places too.

ICU uses UTF-16 internally, so all strings must be converted from the 
database encoding to UTF-16. Since that means the strings need to be 
copied, I took the same approach as in varlena.c:varstr_cmp(), where small 
strings use the heap and only larger strings use a palloc. Comments in 
varstr_cmp about performance made me use that approach.

Also, in the latest patch, I also added checks and logging for *every* 
status returned from ICU. I hope this will help debugging on debian, where 
previous version didn't work. That excessive status checking is hardly be 
necessary once the stuff is better tested.

I think the string copying and heap/palloc choices stands for most of the 
code bloat, together with the excessive status checking and logging.


> Why do you need to add a mapping of encoding names from iana to our
> names?

This was already answered by John Hansen... There's an old thread here 
about the choice of the name "UNICODE" to describe an encoding, which it 
doesn't. There's half a dozen unicode based encodings... UTF-8 is used by 
postgresql, that would have been a better name... Similarly for most other 
encodings, really. ICU expect a setlocale(3) string (i.e. IANA). PostgreSQL 
can't provide it, so a mapping table is required.

I use this patch in production on one FreeBSD 4.10 server at the moment. 
With the latest version, I've had no problems. Logging is swithed on for 
now, and it shows no signs of ICU complaining. I'd like more reports on 
Linux, though.

/Palle

>
> -------------------------------------------------------------------------
> --
>
> Palle Girgensohn wrote:
>> Hi!
>>
>> I've put together a patch for using IBM's ICU package for collation.
>>
>> If your OS does not have full support for collation ur
>> uppercase/lowercase  in multibyte locales, this might be useful. If you
>> are using a multibyte  character encoding in your database and want
>> collation, i.e. order by, and  also lower(), upper() and initcap() to
>> work properly, this patch will do  just that.
>>
>> This patch is needed for FreeBSD, since this OS has no support for
>> collation of for example unicode locales (that is, wcscoll(3) does not
>> do  what you expect if you set LC_ALL=sv_SE.UTF-8, for example). AFAIK
>> the  patch is *not* necessary for Linux, although IBM claims ICU
>> collation to be  about twice as fast as glibc for simple western locales.
>>
>> It adds a configure switch, `--with-icu', which will set up the code to
>> use  ICU instead of wchar_t and wcscoll.
>>
>> This has been tested only on FreeBSD-4.11 & FreeBSD-5-stable, where it
>> seems to run well. I've not had the time to do any comparative
>> performance  tests yet, but it seems it is at least not slower than
>> using LATIN1 with  sv_SE.ISO8859-1 locale, perhaps even faster.
>>
>> I'd be delighted if some more experienced postgresql hackers would
>> review  this stuff. The patch is pretty compact, so it's fast reading :)
>> I'm  planning to add this patch as an option (tagged "experimental") to
>> FreeBSD's postgresql port. Any ideas about whether this is a good idea
>> or  not?
>>
>> Any thoughts or ideas are welcome!
>>
>> Cheers,
>> Palle
>>
>> Patch at:
>> <http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2005-03-14.
>> diff>
>>
>> ICU at sourceforge: <http://icu.sf.net/>
>>
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 7: don't forget to increase your free space map settings
>>
>
> --
>   Bruce Momjian                        |  http://candle.pha.pa.us
>   pgman@candle.pha.pa.us               |  (610) 359-1001
>   +  If your life is a hard drive,     |  13 Roberts Road
>   +  Christ can be your backup.        |  Newtown Square, Pennsylvania
> 19073






pgsql-hackers by date:

Previous
From: Palle Girgensohn
Date:
Subject: Re: Patch for collation using ICU
Next
From: "John Hansen"
Date:
Subject: Re: Patch for collation using ICU