Re: Patch for collation using ICU - Mailing list pgsql-hackers
From | Bruce Momjian |
---|---|
Subject | Re: Patch for collation using ICU |
Date | |
Msg-id | 200505071352.j47DqxK28575@candle.pha.pa.us Whole thread Raw |
In response to | Re: Patch for collation using ICU (Palle Girgensohn <girgen@pingpong.net>) |
Responses |
Re: Patch for collation using ICU
|
List | pgsql-hackers |
Palle Girgensohn wrote: > >> Also, apparently, ICU is installed by default in many linux > >> distributions, and usually it is version 2.8. Some linux users have > >> asked me if there are plans for a patch that works with ICU 2.8. That's > >> probably a good idea. IBM and the ICU folks seem to consider 3.2 to be > >> the stable version, older versions are hard to find on their sites, but > >> most linux distributers seem to consider it too bleeding edge, even > >> gentoo. I don't know why they don't agree. > > > > Good point. Why would linux folks need ICU? Doesn't their OS support > > encodings natively? I am particularly excited about this for OSs that > > don't have such encodings, like UTF8 support for Win32. > > > > Because ICU will not be used unless enabled by configure, it seems we > > are fine with only supporting the newest version. Do Linux users need > > to use ICU for any reason? > > > There are corner cases where it is impossible to upper/lowercase one > character at the time. for example: > > -- without ICU > select upper('E?er'); > upper > ------- > E?ER > (1 row) > > -- with ICU > select upper('E?er'); > upper > ------- > ESSER > (1 rad) > > This is because in the standard postgres implementation, upper/lower is > done one character at the time. A proper upper/lower cannot do it that way. > Other known example is in Turkish, where an ? (?) should look different > whether it is an initial letter or not. This fails in standard postgresql > for all platforms. Uh, where do you see that? Our code has: workspace = texttowcs(string); for (i = 0; workspace[i] != 0; i++) workspace[i] = towupper(workspace[i]); result = wcstotext(workspace, i); > >> Also, in the latest patch, I also added checks and logging for *every* > >> status returned from ICU. I hope this will help debugging on debian, > >> where previous version didn't work. That excessive status checking is > >> hardly be necessary once the stuff is better tested. > >> > >> I think the string copying and heap/palloc choices stands for most of > >> the code bloat, together with the excessive status checking and logging. > > > > OK, move that into some common functions and I think it will be better. > > Best way for upper/lower/initcap is probably to use a function pointer... > uhh... Uh, I don't think so. Just send pointers to the the function and let the function allocate the memory, and another function to free them, or something like that. I can probably do it if you want. > >> > Why do you need to add a mapping of encoding names from iana to our > >> > names? > >> > >> This was already answered by John Hansen... There's an old thread here > >> about the choice of the name "UNICODE" to describe an encoding, which it > >> doesn't. There's half a dozen unicode based encodings... UTF-8 is used > >> by postgresql, that would have been a better name... Similarly for most > >> other encodings, really. ICU expect a setlocale(3) string (i.e. IANA). > >> PostgreSQL can't provide it, so a mapping table is required. > > > > We have depricated UNICODE in 8.1 in favor of UTF8 (no dash). Does that > > help? > > I'm aware of that. It might help for unicode, but there are a bunch of > other encodings. IANA has decided that utf-8 has *no* aliases, hence only > utf-8 (with dash, but case insensitve) is accepted. Perhaps ICU is > fogiving, I don't remember/know, but I think we need the mappings, > unfortunately. OK. I guess I am just confused why the native implementations are OK. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
pgsql-hackers by date: