Re: Patch for collation using ICU - Mailing list pgsql-hackers

From Palle Girgensohn
Subject Re: Patch for collation using ICU
Date
Msg-id 9CC5F10245B7612103CC03F6@palle.girgensohn.se
Whole thread Raw
In response to Re: Patch for collation using ICU  (Bruce Momjian <pgman@candle.pha.pa.us>)
Responses Re: Patch for collation using ICU  (Bruce Momjian <pgman@candle.pha.pa.us>)
List pgsql-hackers

--On lördag, maj 07, 2005 08.37.05 -0400 Bruce Momjian
<pgman@candle.pha.pa.us> wrote:

> Palle Girgensohn wrote:
>> >
>> > Is this patch ready for application?
>>
>> I don't think so, not quite. I have not had any positive reports from
>> linux  users, this is only tested in a FreeBSD environment. I'd say it
>> needs some  more testing.
>
> OK.

John Hansen just reported that it does work on linux. fine!


>> Also, apparently, ICU is installed by default in many linux
>> distributions,  and usually it is version 2.8. Some linux users have
>> asked me if there are  plans for a patch that works with ICU 2.8. That's
>> probably a good idea. IBM  and the ICU folks seem to consider 3.2 to be
>> the stable version, older  versions are hard to find on their sites, but
>> most linux distributers seem  to consider it too bleeding edge, even
>> gentoo. I don't know why they don't  agree.
>
> Good point.  Why would linux folks need ICU?  Doesn't their OS support
> encodings natively?  I am particularly excited about this for OSs that
> don't have such encodings, like UTF8 support for Win32.
>
> Because ICU will not be used unless enabled by configure, it seems we
> are fine with only supporting the newest version.  Do Linux users need
> to use ICU for any reason?


There are corner cases where it is impossible to upper/lowercase one
character at the time. for example:

-- without ICUselect upper('Eßer');upper
-------EßER
(1 row)

-- with ICU
select upper('Eßer');upper
-------ESSER
(1 rad)

This is because in the standard postgres implementation, upper/lower is
done one character at the time. A proper upper/lower cannot do it that way.
Other known example is in Turkish, where an Ì (?) should look different
whether it is an initial letter or not. This fails in standard postgresql
for all platforms.

>> > I do have a few questions:
>> >
>> > Why don't you use the lc_ctype_is_c() part of this test?
>> >
>> >      if (pg_database_encoding_max_length() > 1 && !lc_ctype_is_c())
>>
>> Um, well, I didn't think about that. :)  What would be the locale in
>> this  case? c_C.UTF-8? ;)  Hmm, it is possible to have CTYPE=C and use a
>> wide  encoding, indeed. Then the strings will be handled like byte-wide
>> chars.  Yeah, it's a bug. I'll fix it! Thanks.
>
> The additional test is more of an optmization, and it fixes a problem
> with some OSs that have processing problems with UTF8 when the locale is
> supposed to be turned off, like in "C".  I realize ICU might be fine
> with it but the optimization still is an issue.

Well, the results are quite different, depending on whether ICU is used or
not. See separate mail.


>> > Why is so much code added, for example, in lower()?  The existing
>> > multibyte code is much smaller, and lots of code is added in other
>> > places too.
>>
>> ICU uses UTF-16 internally, so all strings must be converted from the
>> database encoding to UTF-16. Since that means the strings need to be
>> copied, I took the same approach as in varlena.c:varstr_cmp(), where
>> small  strings use the heap and only larger strings use a palloc.
>> Comments in  varstr_cmp about performance made me use that approach.
>
> Oh, interesting.   I think you need to create new functions that
> factor out that common code so the patch is smaller and easier to
> maintain.

Hmm, yes, perhaps it can be refactored a bit. It has ocurred to me...


>> Also, in the latest patch, I also added checks and logging for *every*
>> status returned from ICU. I hope this will help debugging on debian,
>> where  previous version didn't work. That excessive status checking is
>> hardly be  necessary once the stuff is better tested.
>>
>> I think the string copying and heap/palloc choices stands for most of
>> the  code bloat, together with the excessive status checking and logging.
>
> OK, move that into some common functions and I think it will be better.

Best way for upper/lower/initcap is probably to use a function pointer...
uhh...


>> > Why do you need to add a mapping of encoding names from iana to our
>> > names?
>>
>> This was already answered by John Hansen... There's an old thread here
>> about the choice of the name "UNICODE" to describe an encoding, which it
>> doesn't. There's half a dozen unicode based encodings... UTF-8 is used
>> by  postgresql, that would have been a better name... Similarly for most
>> other  encodings, really. ICU expect a setlocale(3) string (i.e. IANA).
>> PostgreSQL  can't provide it, so a mapping table is required.
>
> We have depricated UNICODE in 8.1 in favor of UTF8 (no dash).  Does that
> help?

I'm aware of that. It might help for unicode, but there are a bunch of
other encodings. IANA has decided that utf-8 has *no* aliases, hence only
utf-8 (with dash, but case insensitve) is accepted. Perhaps ICU is
fogiving, I don't remember/know, but I think we need the mappings,
unfortunately.


>> I use this patch in production on one FreeBSD 4.10 server at the moment.
>> With the latest version, I've had no problems. Logging is swithed on for
>> now, and it shows no signs of ICU complaining. I'd like more reports on
>> Linux, though.
>
> OK, I certainly would like this all done for 8.1 which should have
> feature freeze on July 1.

That shouldn't be a problem.

/Palle


>
> --
>   Bruce Momjian                        |  http://candle.pha.pa.us
>   pgman@candle.pha.pa.us               |  (610) 359-1001
>   +  If your life is a hard drive,     |  13 Roberts Road
>   +  Christ can be your backup.        |  Newtown Square, Pennsylvania
> 19073






pgsql-hackers by date:

Previous
From: "John Hansen"
Date:
Subject: Re: Patch for collation using ICU
Next
From: "John Hansen"
Date:
Subject: Re: Patch for collation using ICU