Thread: confused with encodings

confused with encodings

From
Oleg Bartunov
Date:
Tatsuo,

recently I tried to understand why I can't get sorting works properly
with cyrillic characters  in UTF8 datbase. I figure out the
reason of my confusion - I thought I could specify different encodings
for different databases and these encodings will be used in text operations
(sort, upper,lower), not just for conversion.
But, actually, the only encoding is important for text operations - the one
specified with 'initdb' command ! Is't true ?

If so, it's a big issue :)

After I created separate storage for unicode (initdb -E utf8) and
restarted postmaster I got success with 'order by', but
upper(), lower() functions still fails.
Regards,    Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83


Re: confused with encodings

From
Tatsuo Ishii
Date:
> Tatsuo,
> 
> recently I tried to understand why I can't get sorting works properly
> with cyrillic characters  in UTF8 datbase. I figure out the
> reason of my confusion - I thought I could specify different encodings
> for different databases and these encodings will be used in text operations
> (sort, upper,lower), not just for conversion.
> But, actually, the only encoding is important for text operations - the one
> specified with 'initdb' command ! Is't true ?
> 
> If so, it's a big issue :)
> 
> After I created separate storage for unicode (initdb -E utf8) and
> restarted postmaster I got success with 'order by', but
> upper(), lower() functions still fails.

[I assume you enable the locale support.]

Dont't ask me. These are locale support problems.
--
Tatsuo Ishii


Re: confused with encodings

From
Dennis Björklund
Date:
On Mon, 16 Jun 2003, Oleg Bartunov wrote:

> I thought I could specify different encodings
> for different databases and these encodings will be used in text operations
> (sort, upper,lower), not just for conversion.

En encoding does not imply any sort order. UTF-8 can be used to store 
strings in many languages, each having different sort order (and other 
properties). It's the locale that determines these things.

It would be nice to be able to set the locale per database, or even per 
column.

-- 
/Dennis



Re: confused with encodings

From
Oleg Bartunov
Date:
On Tue, 17 Jun 2003, Tatsuo Ishii wrote:

> > Tatsuo,
> >
> > recently I tried to understand why I can't get sorting works properly
> > with cyrillic characters  in UTF8 datbase. I figure out the
> > reason of my confusion - I thought I could specify different encodings
> > for different databases and these encodings will be used in text operations
> > (sort, upper,lower), not just for conversion.
> > But, actually, the only encoding is important for text operations - the one
> > specified with 'initdb' command ! Is't true ?
> >
> > If so, it's a big issue :)
> >
> > After I created separate storage for unicode (initdb -E utf8) and
> > restarted postmaster I got success with 'order by', but
> > upper(), lower() functions still fails.
>
> [I assume you enable the locale support.]

isn't it enabled by default ?

>
> Dont't ask me. These are locale support problems.

Sorry, I just wanted to understand where I get confused.
You're right, utf8 locale support in glibc is broke,
I've tested simple C-program with glibc 2.2.5 and 2.3.1 on
Linux system and toupper, tolower functions are broken.

btw, did you try libutf8 library ?
http://www.haible.de/bruno/packages-libutf8.html


> --
> Tatsuo Ishii
>
Regards,    Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83


Re: confused with encodings

From
Tatsuo Ishii
Date:
> > [I assume you enable the locale support.]
> 
> isn't it enabled by default ?

It can be off by using ---no-locale option with initdb.

> > Dont't ask me. These are locale support problems.
> 
> Sorry, I just wanted to understand where I get confused.
> You're right, utf8 locale support in glibc is broke,
> I've tested simple C-program with glibc 2.2.5 and 2.3.1 on
> Linux system and toupper, tolower functions are broken.
> 
> btw, did you try libutf8 library ?
> http://www.haible.de/bruno/packages-libutf8.html

No. BTW, upper() will never work even glibc works fine with UTF-8. See
the code fragment below(utils/adt/oracle_compat.c);
char       *ptr;
:
:while (m-- > 0){    *ptr = toupper((unsigned char) *ptr);    ptr++;}

Apparently this is not multibyte aware...
--
Tatsuo Ishii


Re: confused with encodings

From
Oleg Bartunov
Date:
On Tue, 17 Jun 2003, Tatsuo Ishii wrote:

> > > [I assume you enable the locale support.]
> >
> > isn't it enabled by default ?
>
> It can be off by using ---no-locale option with initdb.
>

what's the benefit of this for non-ascii world :?

> > > Dont't ask me. These are locale support problems.
> >
> > Sorry, I just wanted to understand where I get confused.
> > You're right, utf8 locale support in glibc is broke,
> > I've tested simple C-program with glibc 2.2.5 and 2.3.1 on
> > Linux system and toupper, tolower functions are broken.
> >
> > btw, did you try libutf8 library ?
> > http://www.haible.de/bruno/packages-libutf8.html
>
> No. BTW, upper() will never work even glibc works fine with UTF-8. See
> the code fragment below(utils/adt/oracle_compat.c);
>
>     char       *ptr;
> :
> :
>     while (m-- > 0)
>     {
>         *ptr = toupper((unsigned char) *ptr);
>         ptr++;
>     }
>
> Apparently this is not multibyte aware...

I see. Hope someone is aware on making postgresql unicode compatible.



> --
> Tatsuo Ishii
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
>                http://archives.postgresql.org
>
Regards,    Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83


Re: confused with encodings

From
Peter Eisentraut
Date:
Oleg Bartunov writes:

> I thought I could specify different encodings for different databases
> and these encodings will be used in text operations (sort, upper,lower),
> not just for conversion. But, actually, the only encoding is important
> for text operations - the one specified with 'initdb' command ! Is't
> true ?

Absolutely not, but you may find that in order to allow LC_CTYPE
operations (namely sort, upper, lower) in UTF8, you need a locale that
supports that, namely the xx_XX.utf8 kind.  So realistically, you are kind
of stuck with one encoding for the entire cluster.

-- 
Peter Eisentraut   peter_e@gmx.net