Re: Patch for collation using ICU - Mailing list pgsql-hackers
From | Palle Girgensohn |
---|---|
Subject | Re: Patch for collation using ICU |
Date | |
Msg-id | 0B537F6953FA3B724B5761A7@palle.girgensohn.se Whole thread Raw |
In response to | Re: Patch for collation using ICU (Bruce Momjian <pgman@candle.pha.pa.us>) |
Responses |
Re: Patch for collation using ICU
|
List | pgsql-hackers |
--On lördag, maj 07, 2005 09.52.59 -0400 Bruce Momjian <pgman@candle.pha.pa.us> wrote: > Palle Girgensohn wrote: >> >> Also, apparently, ICU is installed by default in many linux >> >> distributions, and usually it is version 2.8. Some linux users have >> >> asked me if there are plans for a patch that works with ICU 2.8. >> >> That's probably a good idea. IBM and the ICU folks seem to consider >> >> 3.2 to be the stable version, older versions are hard to find on >> >> their sites, but most linux distributers seem to consider it too >> >> bleeding edge, even gentoo. I don't know why they don't agree. >> > >> > Good point. Why would linux folks need ICU? Doesn't their OS support >> > encodings natively? I am particularly excited about this for OSs that >> > don't have such encodings, like UTF8 support for Win32. >> > >> > Because ICU will not be used unless enabled by configure, it seems we >> > are fine with only supporting the newest version. Do Linux users need >> > to use ICU for any reason? >> >> >> There are corner cases where it is impossible to upper/lowercase one >> character at the time. for example: >> >> -- without ICU >> select upper('E?er'); >> upper >> ------- >> E?ER >> (1 row) >> >> -- with ICU >> select upper('E?er'); >> upper >> ------- >> ESSER >> (1 rad) >> >> This is because in the standard postgres implementation, upper/lower is >> done one character at the time. A proper upper/lower cannot do it that >> way. Other known example is in Turkish, where an ? (?) should look >> different whether it is an initial letter or not. This fails in >> standard postgresql for all platforms. > > Uh, where do you see that? Our code has: > > workspace = texttowcs(string); > > for (i = 0; workspace[i] != 0; i++) > workspace[i] = towupper(workspace[i]); as you see, the loop runs towupper for one character at the time. I cannot consider whether the letter is the initial, as required in Turkish, and it cannot really convert one character into two ('ß' -> 'SS') > > result = wcstotext(workspace, i); > > >> >> Also, in the latest patch, I also added checks and logging for *every* >> >> status returned from ICU. I hope this will help debugging on debian, >> >> where previous version didn't work. That excessive status checking is >> >> hardly be necessary once the stuff is better tested. >> >> >> >> I think the string copying and heap/palloc choices stands for most of >> >> the code bloat, together with the excessive status checking and >> >> logging. >> > >> > OK, move that into some common functions and I think it will be better. >> >> Best way for upper/lower/initcap is probably to use a function >> pointer... uhh... > > Uh, I don't think so. Just send pointers to the the function and let > the function allocate the memory, and another function to free them, or > something like that. I can probably do it if you want. I'll check it out, it seems simple enough. >> > We have depricated UNICODE in 8.1 in favor of UTF8 (no dash). Does >> > that help? >> >> I'm aware of that. It might help for unicode, but there are a bunch of >> other encodings. IANA has decided that utf-8 has *no* aliases, hence >> only utf-8 (with dash, but case insensitve) is accepted. Perhaps ICU is >> fogiving, I don't remember/know, but I think we need the mappings, >> unfortunately. > > OK. I guess I am just confused why the native implementations are OK. They're OK since they understand that UNICODE (or UTF8) is really utf-8. Problem is the strings used to describe them are not understood by ICU. BTW, the pg_enc2iananame_tbl is only used *from* internal representation *to* IANA, not the other way around. Maybe that fact lowers the rate of confusion? ;-) /Palle
pgsql-hackers by date: