Thread: Unicode confusion

Unicode confusion

From
"Chris Palmer"
Date:
Hello,

As you can see, at:

http://www.nodewarrior.org/chris/unicode.png

I am very confused. My database is set to Unicode encoding:

> psql -p 9000 -l
               List of databases
   Name    |         Owner          | Encoding
-----------+------------------------+-----------
 japanese  | GENEEDINC+chris.palmer | EUC_JP
 template0 | GENEEDINC+chris.palmer | SQL_ASCII
 template1 | GENEEDINC+chris.palmer | SQL_ASCII
 test      | GENEEDINC+chris.palmer | SQL_ASCII
 unicode   | GENEEDINC+chris.palmer | UNICODE

I'm using Pg 7.3.2 and pg73jdbc3.jar.

According to *The Java Programming Language, Third Edition* (p. 138), "...you can use the escape sequence \uxxxx to
encodeUnicode characters, where each x is a hexadecimal digit...". Therefore, shouldn't I see "262f 0b87" in the hex
editor?It seems I'm not getting the same stuff out that I am putting in. psql is not much help; it just shows wacky
characters(4 of them: "â¯à®"). 

Am I doing something wrong? Does something need to be set in the database or in the JDBC Connection object? Or am I
justa confused monkey? 

Thanks in advance for any clues!


--
Chris Palmer    Systems Programmer    GeneEd


Re: Unicode confusion

From
Ian Barwick
Date:
On Saturday 10 May 2003 01:47, Chris Palmer wrote:
> Hello,
(...)
> According to *The Java Programming Language, Third Edition* (p. 138),
> "...you can use the escape sequence \uxxxx to encode Unicode characters,
> where each x is a hexadecimal digit...". Therefore, shouldn't I see "262f
> 0b87" in the hex editor? It seems I'm not getting the same stuff out that I
> am putting in. psql is not much help; it just shows wacky characters (4 of
> them: "â¯à®").
>
> Am I doing something wrong? Does something need to be set in the database
> or in the JDBC Connection object? Or am I just a confused monkey?

If it's any help, your code should work as expected. The hex data you see
(3F3F0A) is two question marks and an \n; I would guess Java is not able to
display the unicode characters in your environment and is replacing them with
'?'.

PostgreSQL stores Unicode internally as UTF-8, so if you view the
data with psql in a non-unicode-environment, you will probably be
seeing the UTF-8 byte values expressed in whatever 8 bit characters
your terminal uses.

Ian Barwick
barwick@gmx.net


Re: Unicode confusion

From
"Chris Palmer"
Date:
Ian Barwick writes:

> If it's any help, your code should work as expected. The hex
> data you see (3F3F0A) is two question marks and an \n; I would
> guess Java is not able to display the unicode characters in your
> environment and is replacing them with '?'.

What part of my environment are you referring to? It's not the terminal emulator (which Java has no knowledge of). Java
doesUnicode (in fact, there is no other choice). Is there some locale setting I can use? Is there a parameter I can use
withthe Pg JDBC driver or Connection object? 

Thanks for your response.


--
Chris Palmer    Systems Programmer    GeneEd


Re: Unicode confusion

From
Ian Barwick
Date:
On Monday 12 May 2003 20:49, Chris Palmer wrote:
> Ian Barwick writes:
> > If it's any help, your code should work as expected. The hex
> > data you see (3F3F0A) is two question marks and an \n; I would
> > guess Java is not able to display the unicode characters in your
> > environment and is replacing them with '?'.
>
> What part of my environment are you referring to? It's not the terminal
> emulator (which Java has no knowledge of). Java does Unicode (in fact,
> there is no other choice). Is there some locale setting I can use? Is there
> a parameter I can use with the Pg JDBC driver or Connection object?

OK, put it another way: Java is not able to or does not want to print
the specified Unicode characters (the Yin / Yang symbol and something
squiggly IIRC) to your STDOUT. This is nothing to do with the JDBC connection.
I presume Java looks at your locale setting, maybe Google knows the
answer ;-).

Using a UTF-8 capable terminal (I use mlterm or konsole, no idea what options
there are in Windows) the characters retrieved from Postgres and which are now
in Java's internal Unicode encoding (which one I don't recall) can be
displayed by converting them into UTF-8.


Ian Barwick
barwick@gmx.net


Re: Unicode confusion

From
"Chris Palmer"
Date:
Ian Barwick writes:

> OK, put it another way: Java is not able to or does not want to print
> the specified Unicode characters (the Yin / Yang symbol and something
> squiggly IIRC) to your STDOUT.

In the example I gave in my first post, stdout was a file. Shouldn't Java just write out bytes without trying to get
smart?Especially if stdout is not a tty? But see below. 

> This is nothing to do with the
> JDBC connection.
> I presume Java looks at your locale setting, maybe Google knows the
> answer ;-).

Yes, I'm searching now.

> Using a UTF-8 capable terminal (I use mlterm or konsole, no
> idea what options
> there are in Windows) the characters retrieved from Postgres
> and which are now
> in Java's internal Unicode encoding (which one I don't recall) can be
> displayed by converting them into UTF-8.

===
ps = new PrintStream(System.out, true, "UTF-8");
...
// this line might look strange to you if your mailer shows it differently than mine does:
s.executeUpdate("INSERT INTO test (chug) VALUES ('¤ä´©¬O¬°¤FÅý')");
s.executeUpdate("INSERT INTO test (chug) VALUES ('testing')");
s.executeUpdate("INSERT INTO test (chug) VALUES ('\u262f\u0b87')");
...
ps.println(rs.getString("chug"));
===

I'm no Java expert, so if that's not a good way to get UTF-8-encoded output, please let me know. When I try it, I get:

===
> java Noodle > goo
> cat goo
¤ä´©¬O¬°¤FÃ
ý
testing
â¯à®
===

I installed KDE on our Linux machine (the one running Java and Pg) and got the similar results using konsole. (Fwiw I
amusing PuTTY on Windows to connect to Linux). 

===
¤ä´©¬O¬°¤FÃý
testing
â¯à®
===

Note the lack of the newline in the middle of the first result.

In either case, konsole or PuTTY, I am not getting back what I put in (the first s.executeUpdate(...), above).

In psql, the result of "select * from test" looks the same as it does when output by the Noodle Java program.

Fwiw, I do have the encoding of this database set to UNICODE:

===
> psql -p 9000 -l
               List of databases
   Name    |         Owner          | Encoding
-----------+------------------------+-----------
 japanese  | GENEEDINC+chris.palmer | EUC_JP
 template0 | GENEEDINC+chris.palmer | SQL_ASCII
 template1 | GENEEDINC+chris.palmer | SQL_ASCII
 test      | GENEEDINC+chris.palmer | SQL_ASCII
 unicode   | GENEEDINC+chris.palmer | UNICODE
===


I am much more confused now than I ever have been. :)


--
Chris Palmer    Systems Programmer    GeneEd


Re: Unicode confusion

From
Ian Barwick
Date:
On Tuesday 13 May 2003 00:35, Chris Palmer wrote:
(...)
> ===
> ps = new PrintStream(System.out, true, "UTF-8");
> ...
> // this line might look strange to you if your mailer shows it differently
> than mine does: s.executeUpdate("INSERT INTO test (chug) VALUES
> ('¤ä´©¬O¬°¤FÅý')"); s.executeUpdate("INSERT INTO test (chug) VALUES
> ('testing')");
> s.executeUpdate("INSERT INTO test (chug) VALUES ('\u262f\u0b87')");
> ...
> ps.println(rs.getString("chug"));
> ===
>
> I'm no Java expert, so if that's not a good way to get UTF-8-encoded
> output, please let me know. When I try it, I get:
>
> ===
>
> > java Noodle > goo
> > cat goo
>
> ¤ä´©¬O¬°¤FÃ
> ý
> testing
> â¯à®
> ===
>
> I installed KDE on our Linux machine (the one running Java and Pg) and got
> the similar results using konsole. (Fwiw I am using PuTTY on Windows to
> connect to Linux).
>
> ===
> ¤ä´©¬O¬°¤FÃý
> testing
> â¯à®
> ===
>
> Note the lack of the newline in the middle of the first result.
>
> In either case, konsole or PuTTY, I am not getting back what I put in (the
> first s.executeUpdate(...), above).

Err, yes you are. Just encoded differently (UTF-8 vs. whatever Java
uses, I would guess UCS2 or UTF16). The bytes are now getting dumped to the
display, just the display does not know that they are UTF-8. Before starting
konsole you may need to set your locale. (No idea whether putty is Unicode
capable).

> In psql, the result of "select * from test" looks the same as it does when
> output by the Noodle Java program.
>
> Fwiw, I do have the encoding of this database set to UNICODE:

This is expected behaviour. Have you looked to see what encoding
Postgres uses to store Unicode?

Anyway, the obvious question is: have you tried printing the strings
you are currently passing through Postgres directly?
( ps.println('\u262f\u0b87'); ?) Do they appear any differently?


Ian Barwick
barwick@gmx.net