Thread: Proper Unicode support

Proper Unicode support

From
Alexey Mahotkin
Date:
Hello,

are there any plans of making Postgresql to properly support Unicode wrt
language-specific collations and upper/lower case handling?


AFAIK, currently the codepoints are sorted in their numerical order.  I've
searched the source code and could not find the actual place where this is
done.  I've seen executor/nodeSort.c and utils/tuplesort.c.  AFAIU, they
are generic sorting routines.  

Where is the actual code for (rudimentary) Unicode collation?  Also, where
are the UPPER()/LOWER() functions being handled?


Thanks,

--alexm


Re: Proper Unicode support

From
Peter Eisentraut
Date:
Alexey Mahotkin writes:

> AFAIK, currently the codepoints are sorted in their numerical order.  I've
> searched the source code and could not find the actual place where this is
> done.  I've seen executor/nodeSort.c and utils/tuplesort.c.  AFAIU, they
> are generic sorting routines.

PostgreSQL uses the operating system's locale routines for this.  So the
sort order depends on choosing a locale that can deal with Unicode.

-- 
Peter Eisentraut   peter_e@gmx.net


Re: Proper Unicode support

From
Oleg Bartunov
Date:
On Mon, 11 Aug 2003, Peter Eisentraut wrote:

> Alexey Mahotkin writes:
>
> > AFAIK, currently the codepoints are sorted in their numerical order.  I've
> > searched the source code and could not find the actual place where this is
> > done.  I've seen executor/nodeSort.c and utils/tuplesort.c.  AFAIU, they
> > are generic sorting routines.
>
> PostgreSQL uses the operating system's locale routines for this.  So the
> sort order depends on choosing a locale that can deal with Unicode.
>

sort order works, but upper/lower are broken.

>
Regards,    Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83


Re: Proper Unicode support

From
Tom Lane
Date:
Alexey Mahotkin <alexm@hsys.msk.ru> writes:
> Where is the actual code for (rudimentary) Unicode collation?

strcoll() and friends, in libc.  If you aren't happy with the sorting
and case translation behavior, you've selected the wrong locale.
        regards, tom lane


Re: Proper Unicode support

From
Hannu Krosing
Date:
Oleg Bartunov kirjutas E, 11.08.2003 kell 11:52:
> On Mon, 11 Aug 2003, Peter Eisentraut wrote:
> 
> > Alexey Mahotkin writes:
> >
> > > AFAIK, currently the codepoints are sorted in their numerical order.  I've
> > > searched the source code and could not find the actual place where this is
> > > done.  I've seen executor/nodeSort.c and utils/tuplesort.c.  AFAIU, they
> > > are generic sorting routines.
> >
> > PostgreSQL uses the operating system's locale routines for this.  So the
> > sort order depends on choosing a locale that can deal with Unicode.
> >
> 
> sort order works, but upper/lower are broken.

I think that the original MB/Unicode support was made for japanese
language/characters, and AFAIK they don't even have the concept
(problem) of upper/lower case.

A question to the core - are there any plans to rectify this for less
fortunate languages/charsets?

Will the ASCII-speaking core tolerate the potential loss of performance
from locale-aware upper/lower ?

Will this be considered a feature or a bugfix (i.e. should we attempt to
fix it for 7.4) ?

---------------
Hannu



Re: Proper Unicode support

From
Tom Lane
Date:
Hannu Krosing <hannu@tm.ee> writes:
> A question to the core - are there any plans to rectify this for less
> fortunate languages/charsets?

I'm not planning to fix it personally, if that's what you mean.  I agree
somebody should do something about it.  Possibly the C99 <wctype.h>
routines can be used (where available) in place of <ctype.h>.

> Will the ASCII-speaking core tolerate the potential loss of performance
> from locale-aware upper/lower ?

Depends how big a loss we're talking about, doesn't it?  Still, we
tolerated multibyte-always ... which I'd think would be a bigger hit.
BTW, I'd be at least as concerned about maintaining code readability as
about performance.

> Will this be considered a feature or a bugfix (i.e. should we attempt to
> fix it for 7.4) ?

I think it is probably too large a change to consider for 7.4 at this
point ... especially since it's been broken since day one and no one
has shown any particular urgency about fixing it before.

It's not too soon to start on work for 7.5 though.  Keep in mind that
Bruce is still muttering about a short 7.5 cycle, if the Win32 port work
comes through.
        regards, tom lane