Thread: Proper Unicode support
Hello, are there any plans of making Postgresql to properly support Unicode wrt language-specific collations and upper/lower case handling? AFAIK, currently the codepoints are sorted in their numerical order. I've searched the source code and could not find the actual place where this is done. I've seen executor/nodeSort.c and utils/tuplesort.c. AFAIU, they are generic sorting routines. Where is the actual code for (rudimentary) Unicode collation? Also, where are the UPPER()/LOWER() functions being handled? Thanks, --alexm
Alexey Mahotkin writes: > AFAIK, currently the codepoints are sorted in their numerical order. I've > searched the source code and could not find the actual place where this is > done. I've seen executor/nodeSort.c and utils/tuplesort.c. AFAIU, they > are generic sorting routines. PostgreSQL uses the operating system's locale routines for this. So the sort order depends on choosing a locale that can deal with Unicode. -- Peter Eisentraut peter_e@gmx.net
On Mon, 11 Aug 2003, Peter Eisentraut wrote: > Alexey Mahotkin writes: > > > AFAIK, currently the codepoints are sorted in their numerical order. I've > > searched the source code and could not find the actual place where this is > > done. I've seen executor/nodeSort.c and utils/tuplesort.c. AFAIU, they > > are generic sorting routines. > > PostgreSQL uses the operating system's locale routines for this. So the > sort order depends on choosing a locale that can deal with Unicode. > sort order works, but upper/lower are broken. > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
Alexey Mahotkin <alexm@hsys.msk.ru> writes: > Where is the actual code for (rudimentary) Unicode collation? strcoll() and friends, in libc. If you aren't happy with the sorting and case translation behavior, you've selected the wrong locale. regards, tom lane
Oleg Bartunov kirjutas E, 11.08.2003 kell 11:52: > On Mon, 11 Aug 2003, Peter Eisentraut wrote: > > > Alexey Mahotkin writes: > > > > > AFAIK, currently the codepoints are sorted in their numerical order. I've > > > searched the source code and could not find the actual place where this is > > > done. I've seen executor/nodeSort.c and utils/tuplesort.c. AFAIU, they > > > are generic sorting routines. > > > > PostgreSQL uses the operating system's locale routines for this. So the > > sort order depends on choosing a locale that can deal with Unicode. > > > > sort order works, but upper/lower are broken. I think that the original MB/Unicode support was made for japanese language/characters, and AFAIK they don't even have the concept (problem) of upper/lower case. A question to the core - are there any plans to rectify this for less fortunate languages/charsets? Will the ASCII-speaking core tolerate the potential loss of performance from locale-aware upper/lower ? Will this be considered a feature or a bugfix (i.e. should we attempt to fix it for 7.4) ? --------------- Hannu
Hannu Krosing <hannu@tm.ee> writes: > A question to the core - are there any plans to rectify this for less > fortunate languages/charsets? I'm not planning to fix it personally, if that's what you mean. I agree somebody should do something about it. Possibly the C99 <wctype.h> routines can be used (where available) in place of <ctype.h>. > Will the ASCII-speaking core tolerate the potential loss of performance > from locale-aware upper/lower ? Depends how big a loss we're talking about, doesn't it? Still, we tolerated multibyte-always ... which I'd think would be a bigger hit. BTW, I'd be at least as concerned about maintaining code readability as about performance. > Will this be considered a feature or a bugfix (i.e. should we attempt to > fix it for 7.4) ? I think it is probably too large a change to consider for 7.4 at this point ... especially since it's been broken since day one and no one has shown any particular urgency about fixing it before. It's not too soon to start on work for 7.5 though. Keep in mind that Bruce is still muttering about a short 7.5 cycle, if the Win32 port work comes through. regards, tom lane