Re: Patch for collation using ICU - Mailing list pgsql-hackers

From Bruce Momjian
Subject Re: Patch for collation using ICU
Date
Msg-id 200505071352.j47DqxK28575@candle.pha.pa.us
Whole thread Raw
In response to Re: Patch for collation using ICU  (Palle Girgensohn <girgen@pingpong.net>)
Responses Re: Patch for collation using ICU
List pgsql-hackers
Palle Girgensohn wrote:
> >> Also, apparently, ICU is installed by default in many linux
> >> distributions,  and usually it is version 2.8. Some linux users have
> >> asked me if there are  plans for a patch that works with ICU 2.8. That's
> >> probably a good idea. IBM  and the ICU folks seem to consider 3.2 to be
> >> the stable version, older  versions are hard to find on their sites, but
> >> most linux distributers seem  to consider it too bleeding edge, even
> >> gentoo. I don't know why they don't  agree.
> >
> > Good point.  Why would linux folks need ICU?  Doesn't their OS support
> > encodings natively?  I am particularly excited about this for OSs that
> > don't have such encodings, like UTF8 support for Win32.
> >
> > Because ICU will not be used unless enabled by configure, it seems we
> > are fine with only supporting the newest version.  Do Linux users need
> > to use ICU for any reason?
> 
> 
> There are corner cases where it is impossible to upper/lowercase one 
> character at the time. for example:
> 
> -- without ICU
>  select upper('E?er');
>  upper
> -------
>  E?ER
> (1 row)
> 
> -- with ICU
> select upper('E?er');
>  upper
> -------
>  ESSER
> (1 rad)
> 
> This is because in the standard postgres implementation, upper/lower is 
> done one character at the time. A proper upper/lower cannot do it that way. 
> Other known example is in Turkish, where an ? (?) should look different 
> whether it is an initial letter or not. This fails in standard postgresql 
> for all platforms.

Uh, where do you see that?  Our code has:
       workspace = texttowcs(string);
       for (i = 0; workspace[i] != 0; i++)           workspace[i] = towupper(workspace[i]);
       result = wcstotext(workspace, i);


> >> Also, in the latest patch, I also added checks and logging for *every*
> >> status returned from ICU. I hope this will help debugging on debian,
> >> where  previous version didn't work. That excessive status checking is
> >> hardly be  necessary once the stuff is better tested.
> >>
> >> I think the string copying and heap/palloc choices stands for most of
> >> the  code bloat, together with the excessive status checking and logging.
> >
> > OK, move that into some common functions and I think it will be better.
> 
> Best way for upper/lower/initcap is probably to use a function pointer... 
> uhh...

Uh, I don't think so.  Just send pointers to the the function and let
the function allocate the memory, and another function to free them, or
something like that.  I can probably do it if you want.

> >> > Why do you need to add a mapping of encoding names from iana to our
> >> > names?
> >>
> >> This was already answered by John Hansen... There's an old thread here
> >> about the choice of the name "UNICODE" to describe an encoding, which it
> >> doesn't. There's half a dozen unicode based encodings... UTF-8 is used
> >> by  postgresql, that would have been a better name... Similarly for most
> >> other  encodings, really. ICU expect a setlocale(3) string (i.e. IANA).
> >> PostgreSQL  can't provide it, so a mapping table is required.
> >
> > We have depricated UNICODE in 8.1 in favor of UTF8 (no dash).  Does that
> > help?
> 
> I'm aware of that. It might help for unicode, but there are a bunch of 
> other encodings. IANA has decided that utf-8 has *no* aliases, hence only 
> utf-8 (with dash, but case insensitve) is accepted. Perhaps ICU is 
> fogiving, I don't remember/know, but I think we need the mappings, 
> unfortunately.

OK.  I guess I am just confused why the native implementations are OK.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


pgsql-hackers by date:

Previous
From: "John Hansen"
Date:
Subject: Re: Patch for collation using ICU
Next
From: Bruce Momjian
Date:
Subject: Re: Patch for collation using ICU