Re: Patch for collation using ICU - Mailing list pgsql-hackers
From | Palle Girgensohn |
---|---|
Subject | Re: Patch for collation using ICU |
Date | |
Msg-id | 9CC5F10245B7612103CC03F6@palle.girgensohn.se Whole thread Raw |
In response to | Re: Patch for collation using ICU (Bruce Momjian <pgman@candle.pha.pa.us>) |
Responses |
Re: Patch for collation using ICU
|
List | pgsql-hackers |
--On lördag, maj 07, 2005 08.37.05 -0400 Bruce Momjian <pgman@candle.pha.pa.us> wrote: > Palle Girgensohn wrote: >> > >> > Is this patch ready for application? >> >> I don't think so, not quite. I have not had any positive reports from >> linux users, this is only tested in a FreeBSD environment. I'd say it >> needs some more testing. > > OK. John Hansen just reported that it does work on linux. fine! >> Also, apparently, ICU is installed by default in many linux >> distributions, and usually it is version 2.8. Some linux users have >> asked me if there are plans for a patch that works with ICU 2.8. That's >> probably a good idea. IBM and the ICU folks seem to consider 3.2 to be >> the stable version, older versions are hard to find on their sites, but >> most linux distributers seem to consider it too bleeding edge, even >> gentoo. I don't know why they don't agree. > > Good point. Why would linux folks need ICU? Doesn't their OS support > encodings natively? I am particularly excited about this for OSs that > don't have such encodings, like UTF8 support for Win32. > > Because ICU will not be used unless enabled by configure, it seems we > are fine with only supporting the newest version. Do Linux users need > to use ICU for any reason? There are corner cases where it is impossible to upper/lowercase one character at the time. for example: -- without ICUselect upper('Eßer');upper -------EßER (1 row) -- with ICU select upper('Eßer');upper -------ESSER (1 rad) This is because in the standard postgres implementation, upper/lower is done one character at the time. A proper upper/lower cannot do it that way. Other known example is in Turkish, where an Ì (?) should look different whether it is an initial letter or not. This fails in standard postgresql for all platforms. >> > I do have a few questions: >> > >> > Why don't you use the lc_ctype_is_c() part of this test? >> > >> > if (pg_database_encoding_max_length() > 1 && !lc_ctype_is_c()) >> >> Um, well, I didn't think about that. :) What would be the locale in >> this case? c_C.UTF-8? ;) Hmm, it is possible to have CTYPE=C and use a >> wide encoding, indeed. Then the strings will be handled like byte-wide >> chars. Yeah, it's a bug. I'll fix it! Thanks. > > The additional test is more of an optmization, and it fixes a problem > with some OSs that have processing problems with UTF8 when the locale is > supposed to be turned off, like in "C". I realize ICU might be fine > with it but the optimization still is an issue. Well, the results are quite different, depending on whether ICU is used or not. See separate mail. >> > Why is so much code added, for example, in lower()? The existing >> > multibyte code is much smaller, and lots of code is added in other >> > places too. >> >> ICU uses UTF-16 internally, so all strings must be converted from the >> database encoding to UTF-16. Since that means the strings need to be >> copied, I took the same approach as in varlena.c:varstr_cmp(), where >> small strings use the heap and only larger strings use a palloc. >> Comments in varstr_cmp about performance made me use that approach. > > Oh, interesting. I think you need to create new functions that > factor out that common code so the patch is smaller and easier to > maintain. Hmm, yes, perhaps it can be refactored a bit. It has ocurred to me... >> Also, in the latest patch, I also added checks and logging for *every* >> status returned from ICU. I hope this will help debugging on debian, >> where previous version didn't work. That excessive status checking is >> hardly be necessary once the stuff is better tested. >> >> I think the string copying and heap/palloc choices stands for most of >> the code bloat, together with the excessive status checking and logging. > > OK, move that into some common functions and I think it will be better. Best way for upper/lower/initcap is probably to use a function pointer... uhh... >> > Why do you need to add a mapping of encoding names from iana to our >> > names? >> >> This was already answered by John Hansen... There's an old thread here >> about the choice of the name "UNICODE" to describe an encoding, which it >> doesn't. There's half a dozen unicode based encodings... UTF-8 is used >> by postgresql, that would have been a better name... Similarly for most >> other encodings, really. ICU expect a setlocale(3) string (i.e. IANA). >> PostgreSQL can't provide it, so a mapping table is required. > > We have depricated UNICODE in 8.1 in favor of UTF8 (no dash). Does that > help? I'm aware of that. It might help for unicode, but there are a bunch of other encodings. IANA has decided that utf-8 has *no* aliases, hence only utf-8 (with dash, but case insensitve) is accepted. Perhaps ICU is fogiving, I don't remember/know, but I think we need the mappings, unfortunately. >> I use this patch in production on one FreeBSD 4.10 server at the moment. >> With the latest version, I've had no problems. Logging is swithed on for >> now, and it shows no signs of ICU complaining. I'd like more reports on >> Linux, though. > > OK, I certainly would like this all done for 8.1 which should have > feature freeze on July 1. That shouldn't be a problem. /Palle > > -- > Bruce Momjian | http://candle.pha.pa.us > pgman@candle.pha.pa.us | (610) 359-1001 > + If your life is a hard drive, | 13 Roberts Road > + Christ can be your backup. | Newtown Square, Pennsylvania > 19073
pgsql-hackers by date: