Thread: Encoding Issue with UNICODE

Encoding Issue with UNICODE

From
fritz-bayer@web.de (Fritz Bayer)
Date:
Hello,

I`m using postgresql 7.2.1. According to the following lines data in
my database gets encoded as unicode. Server and client communication
seems to use unicode as well:

woody=# select version();
version
---------------------------------------------------------------
PostgreSQL 7.2.1 on i686-pc-linux-gnu, compiled by GCC 2.95.4
(1 row)

woody=# select getdatabaseencoding();
getdatabaseencoding
---------------------
UNICODE
(1 row)

woody=# show client_encoding;
NOTICE:  Current client encoding is 'UNICODE'
SHOW VARIABLE

I have a java program, which writes words containing german umlauts
like äöü into the database. As you probably know, those characters
belong to the ISO-8859-1 character encoding set.

In my java webapplication those umlauts (äöü) get displayed correctly.
So they actually get stored correctly in the database.

However, when I use postgresql's psql client I those characters get
displayed incorretly.

For example the city name "münchen" gets displayed as "mÌnchen". Not
so in my webapplication. There the city name in the HTML code appears
corretly as "münchen".

So why is psql not displaying the unicode characters correclty? Or
could it be that my xterm can not handle unicode characters? But since
ü is also LATIN1 (ISO 8859 1) would expect that this should not be a
problem.

Can somebody help me out here? Should I create the databases as LATIN1
instead of UNICODE? And how can I transform my current databases into
LATIN1 ones? They should be compatible, because all characters I use
are only äöü, which are downward compatible.

Fritz

Re: Encoding Issue with UNICODE

From
"Daniel Verite"
Date:
    Fritz Bayer wrote:

> I have a java program, which writes words containing german umlauts
> like äöü into the database. As you probably know, those characters
> belong to the ISO-8859-1 character encoding set.
>
> In my java webapplication those umlauts (äöü) get displayed correctly.
> So they actually get stored correctly in the database.
>
> However, when I use postgresql's psql client I those characters get
> displayed incorretly.
>
> For example the city name "münchen" gets displayed as "mÌnchen". Not
> so in my webapplication. There the city name in the HTML code appears
> corretly as "münchen".
>
> So why is psql not displaying the unicode characters correclty? Or
> could it be that my xterm can not handle unicode characters?

From your description it really looks like the latter. You can issue
\encoding latin1
inside psql

or you can also set the PGCLIENTENCODING environment variable to latin1
before launching psql on non-unicode aware terminals.

> Can somebody help me out here? Should I create the databases as LATIN1
> instead of UNICODE? And how can I transform my current databases into
> LATIN1 ones? They should be compatible, because all characters I use
> are only äöü, which are downward compatible.

But then you'll have trouble with your java app if you do that. Java works with
unicode strings, so it makes sense to have the db contents in unicode as well.

--
 Daniel
 PostgreSQL-powered mail user agent and storage: http://www.manitou-mail.org


Re: Encoding Issue with UNICODE

From
"Magnus Naeslund(t)"
Date:
Fritz Bayer wrote:
> Hello,
>
> I`m using postgresql 7.2.1. According to the following lines data in
> my database gets encoded as unicode. Server and client communication
> seems to use unicode as well:
>
> woody=# select version();
> version
> ---------------------------------------------------------------
> PostgreSQL 7.2.1 on i686-pc-linux-gnu, compiled by GCC 2.95.4
> (1 row)
>
> woody=# select getdatabaseencoding();
> getdatabaseencoding
> ---------------------
> UNICODE
> (1 row)
>
> woody=# show client_encoding;
> NOTICE:  Current client encoding is 'UNICODE'
> SHOW VARIABLE
>
> I have a java program, which writes words containing german umlauts
> like äöü into the database. As you probably know, those characters
> belong to the ISO-8859-1 character encoding set.
>
> In my java webapplication those umlauts (äöü) get displayed correctly.
> So they actually get stored correctly in the database.
>

I know I had to set the charSet option in the connection URL to get
stuff working once:

"jdbc:postgresql://server/database?charSet=LATIN1"

Maybe that would work for UNICODE?

Regards,
Magnus




Re: Encoding Issue with UNICODE

From
fritz-bayer@web.de (Fritz Bayer)
Date:
mag@fbab.net ("Magnus Naeslund(t)") wrote in message news:<425AAE6D.6080008@fbab.net>...
> Fritz Bayer wrote:
> > Hello,
> >
> > I`m using postgresql 7.2.1. According to the following lines data in
> > my database gets encoded as unicode. Server and client communication
> > seems to use unicode as well:
> >
> > woody=# select version();
> > version
> > ---------------------------------------------------------------
> > PostgreSQL 7.2.1 on i686-pc-linux-gnu, compiled by GCC 2.95.4
> > (1 row)
> >
> > woody=# select getdatabaseencoding();
> > getdatabaseencoding
> > ---------------------
> > UNICODE
> > (1 row)
> >
> > woody=# show client_encoding;
> > NOTICE:  Current client encoding is 'UNICODE'
> > SHOW VARIABLE
> >
> > I have a java program, which writes words containing german umlauts
> > like äöü into the database. As you probably know, those characters
> > belong to the ISO-8859-1 character encoding set.
> >
> > In my java webapplication those umlauts (äöü) get displayed correctly.
> > So they actually get stored correctly in the database.
> >
>
> I know I had to set the charSet option in the connection URL to get
> stuff working once:
>
> "jdbc:postgresql://server/database?charSet=LATIN1"
>
> Maybe that would work for UNICODE?
>

As far I have heard the charSet property is ignored by the jdbc
drivers. However, somebody patched them an introduced this property.

> Regards,
> Magnus
>
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
>       subscribe-nomail command to majordomo@postgresql.org so that your
>       message can get through to the mailing list cleanly

Re: Encoding Issue with UNICODE

From
fritz-bayer@web.de (Fritz Bayer)
Date:
daniel@manitou-mail.org ("Daniel Verite") wrote in message news:<20050411035003.3592776@localhost>...
> Fritz Bayer wrote:
>
> > I have a java program, which writes words containing german umlauts
> > like äöü into the database. As you probably know, those characters
> > belong to the ISO-8859-1 character encoding set.
> >
> > In my java webapplication those umlauts (äöü) get displayed correctly.
> > So they actually get stored correctly in the database.
> >
> > However, when I use postgresql's psql client I those characters get
> > displayed incorretly.
> >
> > For example the city name "münchen" gets displayed as "mÃ?nchen". Not
> > so in my webapplication. There the city name in the HTML code appears
> > corretly as "münchen".
> >
> > So why is psql not displaying the unicode characters correclty? Or
> > could it be that my xterm can not handle unicode characters?
>
> From your description it really looks like the latter. You can issue
> \encoding latin1
> inside psql
>

Thanks for you help. Now I undestand. It's true somehow my terminal
does not handle unicode characters.

After I entered "\encoding latin1" as you suggested everything works
fine. So the answer is that without that unicode characters get
displayed.

But in which encoding? I guess utf8 or utf16...

But why doesn that fail only for äüö? Shouldn't any other letter
encoded in utf16 also fail?

I mean unicode itself is 16 bit long. So "münchen" should expand to 14
characters. But only ü expands to two characters.

> or you can also set the PGCLIENTENCODING environment variable to latin1
> before launching psql on non-unicode aware terminals.
>
> > Can somebody help me out here? Should I create the databases as LATIN1
> > instead of UNICODE? And how can I transform my current databases into
> > LATIN1 ones? They should be compatible, because all characters I use
> > are only äöü, which are downward compatible.
>
> But then you'll have trouble with your java app if you do that. Java works with
> unicode strings, so it makes sense to have the db contents in unicode as well.

No thats ok. Java communicates with psql using unicode only. That's
why it also worked...

Re: Encoding Issue with UNICODE

From
John DeSoi
Date:
On Apr 12, 2005, at 6:39 AM, Fritz Bayer wrote:

> But in which encoding? I guess utf8 or utf16...
>
> But why doesn that fail only for äüö? Shouldn't any other letter
> encoded in utf16 also fail?
>
> I mean unicode itself is 16 bit long. So "münchen" should expand to 14
> characters. But only ü expands to two characters.


PostgreSQL only supports utf-8. There has been discussion of using a
label other than "unicode" to make this more apparent.


John DeSoi, Ph.D.
http://pgedit.com/
Power Tools for PostgreSQL


Re: Encoding Issue with UNICODE

From
Stephane Bortzmeyer
Date:
On Tue, Apr 12, 2005 at 03:39:45AM -0700,
 Fritz Bayer <fritz-bayer@web.de> wrote
 a message of 53 lines which said:

> I mean unicode itself is 16 bit long.

This is completely false. Unicode itself is just a table and, since it
contains more than 100,000 characters, you cannot index them with 16
bits.

Unicode has various encodings, some fixed-size, like UTF-32, some not.

> So "münchen" should expand to 14 characters. But only ü expands to
> two characters.

Perfectly normal with UTF-8, where the size of an Unicode charactere
is not fixed.