Thread: Proper Unicode support

Proper Unicode support

From

Alexey Mahotkin

Date:

10 August 2003, 19:05:44

Hello,

are there any plans of making Postgresql to properly support Unicode wrt
language-specific collations and upper/lower case handling?


AFAIK, currently the codepoints are sorted in their numerical order.  I've
searched the source code and could not find the actual place where this is
done.  I've seen executor/nodeSort.c and utils/tuplesort.c.  AFAIU, they
are generic sorting routines.  

Where is the actual code for (rudimentary) Unicode collation?  Also, where
are the UPPER()/LOWER() functions being handled?


Thanks,

--alexm

Re: Proper Unicode support

From

Peter Eisentraut

Date:

11 August 2003, 05:37:46

Alexey Mahotkin writes:

> AFAIK, currently the codepoints are sorted in their numerical order.  I've
> searched the source code and could not find the actual place where this is
> done.  I've seen executor/nodeSort.c and utils/tuplesort.c.  AFAIU, they
> are generic sorting routines.

PostgreSQL uses the operating system's locale routines for this.  So the
sort order depends on choosing a locale that can deal with Unicode.

-- 
Peter Eisentraut   peter_e@gmx.net

Re: Proper Unicode support

From

Oleg Bartunov

Date:

11 August 2003, 05:54:29

On Mon, 11 Aug 2003, Peter Eisentraut wrote:

> Alexey Mahotkin writes:
>
> > AFAIK, currently the codepoints are sorted in their numerical order.  I've
> > searched the source code and could not find the actual place where this is
> > done.  I've seen executor/nodeSort.c and utils/tuplesort.c.  AFAIU, they
> > are generic sorting routines.
>
> PostgreSQL uses the operating system's locale routines for this.  So the
> sort order depends on choosing a locale that can deal with Unicode.
>

sort order works, but upper/lower are broken.

>
Regards,    Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

Re: Proper Unicode support

From

Tom Lane

Date:

11 August 2003, 11:25:58

Alexey Mahotkin <alexm@hsys.msk.ru> writes:
> Where is the actual code for (rudimentary) Unicode collation?

strcoll() and friends, in libc.  If you aren't happy with the sorting
and case translation behavior, you've selected the wrong locale.
        regards, tom lane

Re: Proper Unicode support

From

Hannu Krosing

Date:

12 August 2003, 19:25:36

Oleg Bartunov kirjutas E, 11.08.2003 kell 11:52:
> On Mon, 11 Aug 2003, Peter Eisentraut wrote:
> 
> > Alexey Mahotkin writes:
> >
> > > AFAIK, currently the codepoints are sorted in their numerical order.  I've
> > > searched the source code and could not find the actual place where this is
> > > done.  I've seen executor/nodeSort.c and utils/tuplesort.c.  AFAIU, they
> > > are generic sorting routines.
> >
> > PostgreSQL uses the operating system's locale routines for this.  So the
> > sort order depends on choosing a locale that can deal with Unicode.
> >
> 
> sort order works, but upper/lower are broken.

I think that the original MB/Unicode support was made for japanese
language/characters, and AFAIK they don't even have the concept
(problem) of upper/lower case.

A question to the core - are there any plans to rectify this for less
fortunate languages/charsets?

Will the ASCII-speaking core tolerate the potential loss of performance
from locale-aware upper/lower ?

Will this be considered a feature or a bugfix (i.e. should we attempt to
fix it for 7.4) ?

---------------
Hannu

Re: Proper Unicode support

From

Tom Lane

Date:

12 August 2003, 19:59:21

Hannu Krosing <hannu@tm.ee> writes:
> A question to the core - are there any plans to rectify this for less
> fortunate languages/charsets?

I'm not planning to fix it personally, if that's what you mean.  I agree
somebody should do something about it.  Possibly the C99 <wctype.h>
routines can be used (where available) in place of <ctype.h>.

> Will the ASCII-speaking core tolerate the potential loss of performance
> from locale-aware upper/lower ?

Depends how big a loss we're talking about, doesn't it?  Still, we
tolerated multibyte-always ... which I'd think would be a bigger hit.
BTW, I'd be at least as concerned about maintaining code readability as
about performance.

> Will this be considered a feature or a bugfix (i.e. should we attempt to
> fix it for 7.4) ?

I think it is probably too large a change to consider for 7.4 at this
point ... especially since it's been broken since day one and no one
has shown any particular urgency about fixing it before.

It's not too soon to start on work for 7.5 though.  Keep in mind that
Bruce is still muttering about a short 7.5 cycle, if the Win32 port work
comes through.
        regards, tom lane