Thread: Latin2 and Unicode problems

Latin2 and Unicode problems

From
Grzegorz Mucha
Date:
I still haven't got rid of the problem with ISO 8859-2 charset and
Postgres. While I may use the database with correct Polish locale (by
using only --enable-locale) it is still impossible (in 7.1) to use unicode
encoding with Polish chars.

I have tried every possible combination of compile/init parameters and it
doesn't work either - there are problems with upper/lower functions. and
collation order.

Any else suggestions?

Re: Latin2 and Unicode problems

From
Grzegorz Mucha
Date:
Well, I didn't state it correctly, then. When I use iso8859-2, Postgres is
compiled with --enable-locale only (though compiling it with UNICODE
support, then creating database with ISO encoding works fine - sorting,
upper/lower case conversion).

However, when creating db with Unicode,  no matter if I try to use
ISO8859-2 or Unicode client encoding, the db doesn't get it right - such
functions as upper/lower etc. give unpredictable results (as the result
 of select upper('some-polish-chars') return even some three-byte unicode
chars). I quite frequently get the following message:

utf_to_latin: could not convert UTF-8 (0xc3a3) ignored
(the Unicode char code varies...)

--
Grzegorz Mucha <mucher@tigana.pl> ICQ #91619595, tel.(502)261417
----------------------------------------------------------------
Quidquid id est, timeo Danaos et dona ferentes.
                    Wergiliusz, "Eneida"

Re: Re: Latin2 and Unicode problems

From
Tatsuo Ishii
Date:
> Well, I didn't state it correctly, then. When I use iso8859-2, Postgres is
> compiled with --enable-locale only (though compiling it with UNICODE
> support, then creating database with ISO encoding works fine - sorting,
> upper/lower case conversion).

I'm confused. Did you enable the locale support only?

Then, why you see following erros:

> However, when creating db with Unicode,  no matter if I try to use
> ISO8859-2 or Unicode client encoding, the db doesn't get it right - such
> functions as upper/lower etc. give unpredictable results (as the result
>  of select upper('some-polish-chars') return even some three-byte unicode
> chars). I quite frequently get the following message:
>
> utf_to_latin: could not convert UTF-8 (0xc3a3) ignored
> (the Unicode char code varies...)

This kind of error messages should appear only when the unicode
support enabled. So I assume both locale support AND unicode support
are enabled...

That's because locale support (--enable-locale) does not consider
about the Unicode support. (that's not the locale support's fault,
since it was developped before the Unicode support appears). When you
create the unicode database, everything is represented in the UTF-8
encoding. However, the locale support thinks that it is ISO 8859-2 (in
your case) and it try to do the case conversion using the ISO 8859-2
locale. As a result, you see invalid UTF-8 sequences.

Does it match your situation?
--
Tatsuo Ishii

Re: Re: Latin2 and Unicode problems

From
Grzegorz Mucha
Date:
> I'm confused. Did you enable the locale support only?

Sorry then. Well, I tested two cases:
- pg compiled only with enable locale and iso8859-2 works
- pg compiled with locale, enable unicode and unicode conversion doesn't
work as it should.

> That's because locale support (--enable-locale) does not consider
> about the Unicode support. (that's not the locale support's fault,
> since it was developped before the Unicode support appears). When you
> create the unicode database, everything is represented in the UTF-8
> encoding. However, the locale support thinks that it is ISO 8859-2 (in
> your case) and it try to do the case conversion using the ISO 8859-2
> locale. As a result, you see invalid UTF-8 sequences.
>
> Does it match your situation?

Actually, that may be it. I stopped getting the messages after compiling
without locale support, but with Unicode. But there is still the problem
of not working sorting and conversions. Only option I can think of would
be to somehow set the system locale to pl_PL.UTF-8 (I don't even know if
there's such option). Please let me know if there is another way to do it.

--
Grzegorz Mucha <mucher@tigana.pl> ICQ #91619595, tel.(502)261417
----------------------------------------------------------------
Quidquid id est, timeo Danaos et dona ferentes.
                    Wergiliusz, "Eneida"

Re: Re: Latin2 and Unicode problems

From
Tatsuo Ishii
Date:
> Sorry then. Well, I tested two cases:
> - pg compiled only with enable locale and iso8859-2 works
> - pg compiled with locale, enable unicode and unicode conversion doesn't
> work as it should.
>
> > That's because locale support (--enable-locale) does not consider
> > about the Unicode support. (that's not the locale support's fault,
> > since it was developped before the Unicode support appears). When you
> > create the unicode database, everything is represented in the UTF-8
> > encoding. However, the locale support thinks that it is ISO 8859-2 (in
> > your case) and it try to do the case conversion using the ISO 8859-2
> > locale. As a result, you see invalid UTF-8 sequences.
> >
> > Does it match your situation?
>
> Actually, that may be it. I stopped getting the messages after compiling
> without locale support, but with Unicode. But there is still the problem
> of not working sorting and conversions. Only option I can think of would
> be to somehow set the system locale to pl_PL.UTF-8 (I don't even know if
> there's such option). Please let me know if there is another way to do it.

I understand your problem. Another way I could think of is modifying
PostgreSQL backend so that it converts UTF-8 to 88569-2 before calling
strcoll(), toupper() or tolower(). This might be terribly slow,
though.

BTW, if you use only ISO 8859-2, then why you need to store data as
UTF-8 in the database?
--
Tatsuo Ishii