Re: Patch for collation using ICU - Mailing list pgsql-hackers

From Palle Girgensohn
Subject Re: Patch for collation using ICU
Date
Msg-id 0B537F6953FA3B724B5761A7@palle.girgensohn.se
Whole thread Raw
In response to Re: Patch for collation using ICU  (Bruce Momjian <pgman@candle.pha.pa.us>)
Responses Re: Patch for collation using ICU
List pgsql-hackers

--On lördag, maj 07, 2005 09.52.59 -0400 Bruce Momjian
<pgman@candle.pha.pa.us> wrote:

> Palle Girgensohn wrote:
>> >> Also, apparently, ICU is installed by default in many linux
>> >> distributions,  and usually it is version 2.8. Some linux users have
>> >> asked me if there are  plans for a patch that works with ICU 2.8.
>> >> That's probably a good idea. IBM  and the ICU folks seem to consider
>> >> 3.2 to be the stable version, older  versions are hard to find on
>> >> their sites, but most linux distributers seem  to consider it too
>> >> bleeding edge, even gentoo. I don't know why they don't  agree.
>> >
>> > Good point.  Why would linux folks need ICU?  Doesn't their OS support
>> > encodings natively?  I am particularly excited about this for OSs that
>> > don't have such encodings, like UTF8 support for Win32.
>> >
>> > Because ICU will not be used unless enabled by configure, it seems we
>> > are fine with only supporting the newest version.  Do Linux users need
>> > to use ICU for any reason?
>>
>>
>> There are corner cases where it is impossible to upper/lowercase one
>> character at the time. for example:
>>
>> -- without ICU
>>  select upper('E?er');
>>  upper
>> -------
>>  E?ER
>> (1 row)
>>
>> -- with ICU
>> select upper('E?er');
>>  upper
>> -------
>>  ESSER
>> (1 rad)
>>
>> This is because in the standard postgres implementation, upper/lower is
>> done one character at the time. A proper upper/lower cannot do it that
>> way.  Other known example is in Turkish, where an ? (?) should look
>> different  whether it is an initial letter or not. This fails in
>> standard postgresql  for all platforms.
>
> Uh, where do you see that?  Our code has:
>
>         workspace = texttowcs(string);
>
>         for (i = 0; workspace[i] != 0; i++)
>             workspace[i] = towupper(workspace[i]);

as you see, the loop runs towupper for one character at the time. I cannot
consider whether the letter is the initial, as required in Turkish, and it
cannot really convert one character into two ('ß' -> 'SS')

>
>         result = wcstotext(workspace, i);
>
>
>> >> Also, in the latest patch, I also added checks and logging for *every*
>> >> status returned from ICU. I hope this will help debugging on debian,
>> >> where  previous version didn't work. That excessive status checking is
>> >> hardly be  necessary once the stuff is better tested.
>> >>
>> >> I think the string copying and heap/palloc choices stands for most of
>> >> the  code bloat, together with the excessive status checking and
>> >> logging.
>> >
>> > OK, move that into some common functions and I think it will be better.
>>
>> Best way for upper/lower/initcap is probably to use a function
>> pointer...  uhh...
>
> Uh, I don't think so.  Just send pointers to the the function and let
> the function allocate the memory, and another function to free them, or
> something like that.  I can probably do it if you want.

I'll check it out, it seems simple enough.

>> > We have depricated UNICODE in 8.1 in favor of UTF8 (no dash).  Does
>> > that help?
>>
>> I'm aware of that. It might help for unicode, but there are a bunch of
>> other encodings. IANA has decided that utf-8 has *no* aliases, hence
>> only  utf-8 (with dash, but case insensitve) is accepted. Perhaps ICU is
>> fogiving, I don't remember/know, but I think we need the mappings,
>> unfortunately.
>
> OK.  I guess I am just confused why the native implementations are OK.

They're OK since they understand that UNICODE (or UTF8) is really utf-8.
Problem is the strings used to describe them are not understood by ICU.

BTW, the pg_enc2iananame_tbl is only used *from* internal representation
*to* IANA, not the other way around. Maybe that fact lowers the rate of
confusion? ;-)

/Palle



pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: Patch for collation using ICU
Next
From: "John Hansen"
Date:
Subject: Re: Patch for collation using ICU