Thread: Unicode confusion
Hello, As you can see, at: http://www.nodewarrior.org/chris/unicode.png I am very confused. My database is set to Unicode encoding: > psql -p 9000 -l List of databases Name | Owner | Encoding -----------+------------------------+----------- japanese | GENEEDINC+chris.palmer | EUC_JP template0 | GENEEDINC+chris.palmer | SQL_ASCII template1 | GENEEDINC+chris.palmer | SQL_ASCII test | GENEEDINC+chris.palmer | SQL_ASCII unicode | GENEEDINC+chris.palmer | UNICODE I'm using Pg 7.3.2 and pg73jdbc3.jar. According to *The Java Programming Language, Third Edition* (p. 138), "...you can use the escape sequence \uxxxx to encodeUnicode characters, where each x is a hexadecimal digit...". Therefore, shouldn't I see "262f 0b87" in the hex editor?It seems I'm not getting the same stuff out that I am putting in. psql is not much help; it just shows wacky characters(4 of them: "â¯à®"). Am I doing something wrong? Does something need to be set in the database or in the JDBC Connection object? Or am I justa confused monkey? Thanks in advance for any clues! -- Chris Palmer Systems Programmer GeneEd
On Saturday 10 May 2003 01:47, Chris Palmer wrote: > Hello, (...) > According to *The Java Programming Language, Third Edition* (p. 138), > "...you can use the escape sequence \uxxxx to encode Unicode characters, > where each x is a hexadecimal digit...". Therefore, shouldn't I see "262f > 0b87" in the hex editor? It seems I'm not getting the same stuff out that I > am putting in. psql is not much help; it just shows wacky characters (4 of > them: "â¯à®"). > > Am I doing something wrong? Does something need to be set in the database > or in the JDBC Connection object? Or am I just a confused monkey? If it's any help, your code should work as expected. The hex data you see (3F3F0A) is two question marks and an \n; I would guess Java is not able to display the unicode characters in your environment and is replacing them with '?'. PostgreSQL stores Unicode internally as UTF-8, so if you view the data with psql in a non-unicode-environment, you will probably be seeing the UTF-8 byte values expressed in whatever 8 bit characters your terminal uses. Ian Barwick barwick@gmx.net
Ian Barwick writes: > If it's any help, your code should work as expected. The hex > data you see (3F3F0A) is two question marks and an \n; I would > guess Java is not able to display the unicode characters in your > environment and is replacing them with '?'. What part of my environment are you referring to? It's not the terminal emulator (which Java has no knowledge of). Java doesUnicode (in fact, there is no other choice). Is there some locale setting I can use? Is there a parameter I can use withthe Pg JDBC driver or Connection object? Thanks for your response. -- Chris Palmer Systems Programmer GeneEd
On Monday 12 May 2003 20:49, Chris Palmer wrote: > Ian Barwick writes: > > If it's any help, your code should work as expected. The hex > > data you see (3F3F0A) is two question marks and an \n; I would > > guess Java is not able to display the unicode characters in your > > environment and is replacing them with '?'. > > What part of my environment are you referring to? It's not the terminal > emulator (which Java has no knowledge of). Java does Unicode (in fact, > there is no other choice). Is there some locale setting I can use? Is there > a parameter I can use with the Pg JDBC driver or Connection object? OK, put it another way: Java is not able to or does not want to print the specified Unicode characters (the Yin / Yang symbol and something squiggly IIRC) to your STDOUT. This is nothing to do with the JDBC connection. I presume Java looks at your locale setting, maybe Google knows the answer ;-). Using a UTF-8 capable terminal (I use mlterm or konsole, no idea what options there are in Windows) the characters retrieved from Postgres and which are now in Java's internal Unicode encoding (which one I don't recall) can be displayed by converting them into UTF-8. Ian Barwick barwick@gmx.net
Ian Barwick writes: > OK, put it another way: Java is not able to or does not want to print > the specified Unicode characters (the Yin / Yang symbol and something > squiggly IIRC) to your STDOUT. In the example I gave in my first post, stdout was a file. Shouldn't Java just write out bytes without trying to get smart?Especially if stdout is not a tty? But see below. > This is nothing to do with the > JDBC connection. > I presume Java looks at your locale setting, maybe Google knows the > answer ;-). Yes, I'm searching now. > Using a UTF-8 capable terminal (I use mlterm or konsole, no > idea what options > there are in Windows) the characters retrieved from Postgres > and which are now > in Java's internal Unicode encoding (which one I don't recall) can be > displayed by converting them into UTF-8. === ps = new PrintStream(System.out, true, "UTF-8"); ... // this line might look strange to you if your mailer shows it differently than mine does: s.executeUpdate("INSERT INTO test (chug) VALUES ('¤ä´©¬O¬°¤FÅý')"); s.executeUpdate("INSERT INTO test (chug) VALUES ('testing')"); s.executeUpdate("INSERT INTO test (chug) VALUES ('\u262f\u0b87')"); ... ps.println(rs.getString("chug")); === I'm no Java expert, so if that's not a good way to get UTF-8-encoded output, please let me know. When I try it, I get: === > java Noodle > goo > cat goo ¤ä´©¬O¬°¤Fà ý testing â¯à® === I installed KDE on our Linux machine (the one running Java and Pg) and got the similar results using konsole. (Fwiw I amusing PuTTY on Windows to connect to Linux). === ¤ä´©¬O¬°¤FÃý testing â¯à® === Note the lack of the newline in the middle of the first result. In either case, konsole or PuTTY, I am not getting back what I put in (the first s.executeUpdate(...), above). In psql, the result of "select * from test" looks the same as it does when output by the Noodle Java program. Fwiw, I do have the encoding of this database set to UNICODE: === > psql -p 9000 -l List of databases Name | Owner | Encoding -----------+------------------------+----------- japanese | GENEEDINC+chris.palmer | EUC_JP template0 | GENEEDINC+chris.palmer | SQL_ASCII template1 | GENEEDINC+chris.palmer | SQL_ASCII test | GENEEDINC+chris.palmer | SQL_ASCII unicode | GENEEDINC+chris.palmer | UNICODE === I am much more confused now than I ever have been. :) -- Chris Palmer Systems Programmer GeneEd
On Tuesday 13 May 2003 00:35, Chris Palmer wrote: (...) > === > ps = new PrintStream(System.out, true, "UTF-8"); > ... > // this line might look strange to you if your mailer shows it differently > than mine does: s.executeUpdate("INSERT INTO test (chug) VALUES > ('¤ä´©¬O¬°¤FÅý')"); s.executeUpdate("INSERT INTO test (chug) VALUES > ('testing')"); > s.executeUpdate("INSERT INTO test (chug) VALUES ('\u262f\u0b87')"); > ... > ps.println(rs.getString("chug")); > === > > I'm no Java expert, so if that's not a good way to get UTF-8-encoded > output, please let me know. When I try it, I get: > > === > > > java Noodle > goo > > cat goo > > ¤ä´©¬O¬°¤Fà > ý > testing > â¯à® > === > > I installed KDE on our Linux machine (the one running Java and Pg) and got > the similar results using konsole. (Fwiw I am using PuTTY on Windows to > connect to Linux). > > === > ¤ä´©¬O¬°¤FÃý > testing > â¯à® > === > > Note the lack of the newline in the middle of the first result. > > In either case, konsole or PuTTY, I am not getting back what I put in (the > first s.executeUpdate(...), above). Err, yes you are. Just encoded differently (UTF-8 vs. whatever Java uses, I would guess UCS2 or UTF16). The bytes are now getting dumped to the display, just the display does not know that they are UTF-8. Before starting konsole you may need to set your locale. (No idea whether putty is Unicode capable). > In psql, the result of "select * from test" looks the same as it does when > output by the Noodle Java program. > > Fwiw, I do have the encoding of this database set to UNICODE: This is expected behaviour. Have you looked to see what encoding Postgres uses to store Unicode? Anyway, the obvious question is: have you tried printing the strings you are currently passing through Postgres directly? ( ps.println('\u262f\u0b87'); ?) Do they appear any differently? Ian Barwick barwick@gmx.net