Thread: Unicode support problem

Unicode support problem

From
"Jatinder Sangha"
Date:
Hi all,

I'm having a problem with unicode support in postgres under linux.

The issue is that I am copying lots of data from an MS SQL Server
database via java/jdbc running on Windows XP over to a postgres database
running on linux. I've setup the postgres database as follows:

LANG=C
initdb  -E UNICODE
createdb -E UNICODE

And then when I'm transferring the data, when my program tries to send a
string containing a character 0xF6 (Latin small letter o with
diaeresis), then I get a JDBC exception & an error log on the server as
follows:
ERROR:  invalid multibyte character for locale
HINT:  The server's LC_CTYPE locale is probably incompatible with the
database encoding.

I have tried setting locale/lc_ctype to C, POSIX, iso_8859_1, all kinds
of things, and nothing seems to fix it.


If I setup the database as follows:
LANG=C
initdb -E iso8859_1
createdb -E iso8859_1

Then it appears to work OK - but I then get an error with character 0xE2
(Latin small letter a with circumflex):
ERROR:  could not convert UTF-8 character 0x00e2 to ISO8859-1

Does anyone know how to do correctly do this?

This is my environment:
LINUX: DEBIAN 3.0, KERNEL 2.4 running on a 2CPU PC.
Postgres: 8.0.1 built from source, no changes to anything, running on
the linux box.
JDBC driver: postgresql-8.0-310.jdbc3.jar
Java JVM (Sun) 1.4.2_02 on Windows XP SP2.


If I run postgres on the Windows XP machine (configured for UNICODE as
above), then I don't have any errors at all. This only happens on the
linux box.

Any help in fixing this would be greatly appreciated.
Thanks,
--Jatinder Sangha
Coalition Development


Re: Unicode support problem

From
Tom Lane
Date:
"Jatinder Sangha" <js@coalitiondev.com> writes:
> I've setup the postgres database as follows:

> LANG=C
> initdb  -E UNICODE
> createdb -E UNICODE

> I have tried setting locale/lc_ctype to C, POSIX, iso_8859_1, all kinds
> of things, and nothing seems to fix it.

You can't just pick random combinations of locale and database encoding.
Any given locale setting implies a character set encoding, and you have
to use that same encoding as the database encoding; at least if you want
encoding-dependent operations such as upper()/lower() to work.  The
locale you want for Unicode (UTF8) may be named something like
"en_US.utf8".  Try "locale -a" to get a list of supported locales.

            regards, tom lane

Re: Unicode support problem

From
Tatsuo Ishii
Date:
> If I setup the database as follows:
> LANG=C
> initdb -E iso8859_1
> createdb -E iso8859_1
>
> Then it appears to work OK - but I then get an error with character 0xE2
> (Latin small letter a with circumflex):
> ERROR:  could not convert UTF-8 character 0x00e2 to ISO8859-1

The error message says all. You are trying to convert an UTF-8
character starting with 0x00e2 to ISO-8859-1, which does not exist in
the world. All ISO-8859-1 chars in UTF-8 are below 0x00e0 range.
Probably you mixed up with ISO-8859-2 or any other characters other
than ISO-8859-1?
--
Tatsuo Ishii

Re: Unicode support problem

From
"Jatinder Sangha"
Date:
Hi Tom,

Thanks for the reply -- yes, creating the en_US.utf8 locale and using
that, fixed all of my problems.

Thanks,
--Jatinder

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: 24 February 2005 17:11
To: Jatinder Sangha
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Unicode support problem


"Jatinder Sangha" <js@coalitiondev.com> writes:
> I've setup the postgres database as follows:

> LANG=C
> initdb  -E UNICODE
> createdb -E UNICODE

> I have tried setting locale/lc_ctype to C, POSIX, iso_8859_1, all
> kinds of things, and nothing seems to fix it.

You can't just pick random combinations of locale and database encoding.
Any given locale setting implies a character set encoding, and you have
to use that same encoding as the database encoding; at least if you want
encoding-dependent operations such as upper()/lower() to work.  The
locale you want for Unicode (UTF8) may be named something like
"en_US.utf8".  Try "locale -a" to get a list of supported locales.

            regards, tom lane