Thread: Problem with accessing Russian UTF database
I'm having real trouble with the jdbc driver for postgres... I just installed the latest version... I have a database, UTF8 encoded, which has data in Russian. I can view it beautifully using PGAdmin3 or any other ODBC connection. This is the connection string I'm using (from Tomcat 5.5) String URLdb = "jdbc:postgresql://127.0.0.1:5432/oldzautest?user=noe&password=genesis&charS et=UNICODE"; However, when I try to retrieve data from it, I get this garbage, where the questionmarks are the Russian text... <data> <db_content> <row> <contents content = "1" /> <contents content = "1" /> <contents content = "?????" /> <contents content = "????????" /> <contents content = "?????????" /> <contents content = "1965-03-10" /> <contents content = "1" /> </row> </db_content> </data> Interestingly enough, if I send Russian data to the database through the jdbc driver, it writes it into the DB as a bunch of garbage... but retrieves it beautifully as Russian text... I've set the client_encoding to UTF8 on the server... What am I doing wrong? What am I missing? I'd be thrilled to interact privately with someone who has solved what for now is a mystery to me. Thanks! Ron
Ronald Vyhmeister wrote: > I'm having real trouble with the jdbc driver for postgres... I just > installed the latest version... > > I have a database, UTF8 encoded, which has data in Russian. I can view it > beautifully using PGAdmin3 or any other ODBC connection. Perhaps these connections are not actually using UTF8 to interpret the data, but some other encoding - so while they appear to write encoded data then retrieve it OK, it's not actually what you think it is when interpreted as UTF8? > String URLdb = > "jdbc:postgresql://127.0.0.1:5432/oldzautest?user=noe&password=genesis&charS > et=UNICODE"; You should not need "charSet=UNICODE", though I don't think it'll break anything. > <data> > <db_content> > <row> > <contents content = "1" /> > <contents content = "1" /> > <contents content = "?????" /> > <contents content = "????????" /> > <contents content = "?????????" /> > <contents content = "1965-03-10" /> > <contents content = "1" /> > </row> > </db_content> > </data> Perhaps the problem is in the encoding you are using to write out that XML fragment? Or in whatever tool you are using to view it? > I've set the client_encoding to UTF8 on the server... What am I doing > wrong? What am I missing? I'd be thrilled to interact privately with > someone who has solved what for now is a mystery to me. You shouldn't need to touch client_encoding for JDBC to work (though other clients might need it). The JDBC driver forces client_encoding to UTF8 anyway on connection startup. It may be useful to examine the actual value of the characters in the String objects you are dealing with (i.e. print out (int)s.charAt(0) etc) to check they contain the unicode codepoints you were expecting. In general the driver "just works" with UTF-8 encoded databases. It's dealing in terms of Unicode strings internally, so the only transcoding that goes on is from UTF-8 to UTF-16, which is lossless. All the reported problems we've seen in the (recent) past with this configuration have been either problems with non-JDBC clients getting confused, or problems with how the resulting String was displayed to the user, or having non-unicode garbage stored in the database in the first place. -O
Ronald Vyhmeister wrote: > Locale locale = Locale.getDefault(); > locale = new Locale("ru", "RU"); The driver ignores locale so this won't actually be doing anything. > SQL = "update sys_people set middle_name='фывфывафыва' where > family_name='Pratt';"; I wouldn't rely on your JSP implementation / java compiler interpreting that string literal in the way that you assume. I suggest you construct your string with \uNNNN unicode escapes to be sure you're really compiling what you think you're compiling. Your mail headers claims a charset of "koi8-r" but I don't know what the default file encoding for your target system is; perhaps it is using ISO-8859-1 or similar, which might result in the above being interpreted as the accented characters you see in PgAdmin? Also, as I suggested earlier, try examining your strings character-by-character to check that they really contain the codepoints you think they contain. -O
Ronald Vyhmeister wrote: > Locale locale = Locale.getDefault(); > locale = new Locale("ru", "RU"); >The driver ignores locale so this won't actually be doing anything. > SQL = "update sys_people set middle_name='фывфывафыва' where > family_name='Pratt';"; >I wouldn't rely on your JSP implementation / java compiler interpreting >that string literal in the way that you assume. I suggest you construct >your string with \uNNNN unicode escapes to be sure you're really >compiling what you think you're compiling. Your mail headers claims a >charset of "koi8-r" but I don't know what the default file encoding for >your target system is; perhaps it is using ISO-8859-1 or similar, which >might result in the above being interpreted as the accented characters >you see in PgAdmin? Thank you! I think we're getting much closer to a solution... but here I'll need some help... I just added this line justafter creating the statement: out.print(SQL); and this is the text I got: update sys_people set middle_name='testing' where family_name='Smith'; update sys_people set middle_name='ôûâôûâàôûâà' where family_name='Pratt'; This matches the text I have in the DB perfectly... the question is now, how to get that to work right? As for the Unicodeescapes, how do I determine them? And why is it that when I read the "garbage" back from the DB it shows perfectRussian characters again? I've tried both jikes and the Sun javac compilers, as well as the gcj compiler, with JDK 1.6 EE (and the symptoms are thesame since I was working with language en_US.UTF-8 and JDK 1.5). Supposedly they all take their parameters from the environment(which I believe is set right)... For the server settings (Ubuntu 8.04 LTS), LANG=ru_RU.UTF-8 LANGUAGE=ru LC_CTYPE="ru_RU.UTF-8" LC_NUMERIC="ru_RU.UTF-8" LC_TIME="ru_RU.UTF-8" LC_COLLATE="ru_RU.UTF-8" LC_MONETARY="ru_RU.UTF-8" LC_MESSAGES="ru_RU.UTF-8" LC_PAPER="ru_RU.UTF-8" LC_NAME="ru_RU.UTF-8" LC_ADDRESS="ru_RU.UTF-8" LC_TELEPHONE="ru_RU.UTF-8" LC_MEASUREMENT="ru_RU.UTF-8" LC_IDENTIFICATION="ru_RU.UTF-8" LC_ALL= >Also, as I suggested earlier, try examining your strings >character-by-character to check that they really contain the codepoints >you think they contain. Right now, the string I'm entering was from the keyboard, set to Russian mode (and yes, I've tried it from Linux and Windows,and the results are the same). Thanks again for all the help, Ron
Ronald Vyhmeister wrote: > As for the Unicode escapes, how do I determine them? The syntax of a unicode escape in Java is \uNNNN where NNNN is the hex value of the Unicode codepoint you want to use. See http://unicode.org/charts/ to find the particular ones you need (e.g. Cyrillic is in http://unicode.org/charts/PDF/U0400.pdf) >> Also, as I suggested earlier, try examining your strings >> character-by-character to check that they really contain the codepoints >> you think they contain. > > Right now, the string I'm entering was from the keyboard, set to Russian mode (and yes, I've tried it from Linux and Windows,and the results are the same). What I mean is to do something like this: > String someString = /* whatever you want to inspect */; > char[] rawCharacters = someString.toCharArray(); > for (int i = 0; i < rawCharacters.length; ++i) > System.out.println("#" + i + " = " + Integer.toHexString((int)rawCharacters[i])); so that you can see exactly what the String really contains, not whatever the combination of your output encoding & your terminal encoding thinks it should look like. (Java strings are UCS-2/UTF-16 internally, which is a 1:1 mapping to Unicode codepoint values most of the time, so the above code prints out unicode codepoint values in hex) -O
Oliver and all, I see that there is a Java problem... I limited my code to this: <%@ page import="java.util.*" %> <% Locale locale = Locale.getDefault(); locale = new Locale("ru", "RU"); Locale.setDefault(locale); String someString = 'фывфывафыва'; char[] rawCharacters = someString.toCharArray(); for (int i = 0; i < rawCharacters.length; ++i) System.out.println("#" + i + " = " + Integer.toHexString((int)rawCharacters[i])); %> And tomcat cannot handle the string An error occurred at line: 2 in the jsp file: /iutus/unicode.jsp Invalid character constant 1: <%@ page import="java.util.*" %> 2: <% 3: 4: Locale locale = Locale.getDefault(); 5: locale = new Locale("ru", "RU"); If I replace the String statement with this: String someString = 'ф'; The error changes to: Type mismatch: cannot convert from char to String So I thought I'd be smart and just define it as a char variable, but now it says: Invalid character constant So I've just done them one at a time for 5 characters: ф = f4 ы = fb в = e2 ц = f6 л = eb It would appear to me that whatever I'm entering is NOT Unicode... so how do I force Windows/Linux/Mac to use Unicode for input/output? I'm trying to deal with an inherited app... and I'm no java guru... Ron
Oliver and all, I see that there is a Java problem... I limited my code to this: <%@ page import="java.util.*" %> <% Locale locale = Locale.getDefault(); locale = new Locale("ru", "RU"); Locale.setDefault(locale); String someString = 'фывфывафыва'; char[] rawCharacters = someString.toCharArray(); for (int i = 0; i < rawCharacters.length; ++i) System.out.println("#" + i + " = " + Integer.toHexString((int)rawCharacters[i])); %> And tomcat cannot handle the string An error occurred at line: 2 in the jsp file: /iutus/unicode.jsp Invalid character constant 1: <%@ page import="java.util.*" %> 2: <% 3: 4: Locale locale = Locale.getDefault(); 5: locale = new Locale("ru", "RU"); If I replace the String statement with this: String someString = 'ф'; The error changes to: Type mismatch: cannot convert from char to String So I thought I'd be smart and just define it as a char variable, but now it says: Invalid character constant So I've just done them one at a time for 5 characters: ф = f4 ы = fb в = e2 ц = f6 л = eb It would appear to me that whatever I'm entering is NOT Unicode... so how do I force Windows/Linux/Mac to use Unicode for input/output? I'm trying to deal with an inherited app... and I'm no java guru... Ron
Oliver Jowett <oliver 'at' opencloud.com> writes: > Ronald Vyhmeister wrote: > >> As for the Unicode escapes, how do I determine them? > > The syntax of a unicode escape in Java is \uNNNN where NNNN is the hex > value of the Unicode codepoint you want to use. > > See http://unicode.org/charts/ to find the particular ones you need > (e.g. Cyrillic is in http://unicode.org/charts/PDF/U0400.pdf) Or easier yet, use native2ascii! [gc@meuh ~] echo 'André is nice' | native2ascii Andr\u00e9 is nice -- Guillaume Cottenceau
"Ronald Vyhmeister" <rvyhmeister 'at' aiias.edu> writes: > Oliver and all, > > I see that there is a Java problem... I limited my code to this: It's not a Java problem. It is wrong to include most non ASCII characters in Java and relative sources. "The Java compiler and other Java tools can only process files that contain Latin-1 or Unicode-encoded (\udddd notation) characters." -- Guillaume Cottenceau
----- Original Message ----- From: "Guillaume Cottenceau" <gc@mnc.ch> To: "Ronald Vyhmeister" <rvyhmeister@aiias.edu> Cc: <pgsql-jdbc@postgresql.org> Sent: Wednesday, November 26, 2008 11:11 AM Subject: Re: [JDBC] Problem with accessing Russian UTF database > "Ronald Vyhmeister" <rvyhmeister 'at' aiias.edu> writes: > >> Oliver and all, >> >> I see that there is a Java problem... I limited my code to this: > > It's not a Java problem. It is wrong to include most non ASCII > characters in Java and relative sources. > > "The Java compiler and other Java tools can only process files > that contain Latin-1 or Unicode-encoded (\udddd notation) > characters." As far as I know you can use -encoding to specify the encoding to use for the source files, or have I got that wrong ? Tthat would of course require that you can change the options to the javac compiler. > > -- > Guillaume Cottenceau > > -- > Sent via pgsql-jdbc mailing list (pgsql-jdbc@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-jdbc
Ronald Vyhmeister wrote: > String someString = 'фывфывафыва'; > An error occurred at line: 2 in the jsp file: /iutus/unicode.jsp > Invalid character constant That's just because you used single quotes '' not double quotes "" -O
Hi Ronald, This is the JSP problem, since you need to set the correct encoding on the JSP. <%@page pageEncoding="UTF-8"%> - something like this should help. I have a UTF8 coded database that handles multiple scripts(Cyrillic, Latin and Georgian) and PostgeSQL has not in 5 years shown a problem with any store/retrieve problems with UTF8 encoding. If you have any problems, this discussion should probably be directed to people that handle JSP and/or Java questions. Ronald Vyhmeister wrote: > Oliver and all, > > I see that there is a Java problem... I limited my code to this: > > <%@ page import="java.util.*" %> > <% > > Locale locale = Locale.getDefault(); > locale = new Locale("ru", "RU"); > Locale.setDefault(locale); > String someString = 'фывфывафыва'; > char[] rawCharacters = someString.toCharArray(); > for (int i = 0; i < rawCharacters.length; ++i) > System.out.println("#" + i + " = " + > Integer.toHexString((int)rawCharacters[i])); > %> > > And tomcat cannot handle the string > > An error occurred at line: 2 in the jsp file: /iutus/unicode.jsp > Invalid character constant > 1: <%@ page import="java.util.*" %> > 2: <% > 3: > 4: Locale locale = Locale.getDefault(); > 5: locale = new Locale("ru", "RU"); > > > If I replace the String statement with this: > > String someString = 'ф'; > > The error changes to: > > Type mismatch: cannot convert from char to String > > So I thought I'd be smart and just define it as a char variable, but now it > says: > > Invalid character constant > > So I've just done them one at a time for 5 characters: > > ф = f4 > ы = fb > в = e2 > ц = f6 > л = eb > > It would appear to me that whatever I'm entering is NOT Unicode... so how do > I force Windows/Linux/Mac to use Unicode for input/output? I'm trying to > deal with an inherited app... and I'm no java guru... > > Ron > > >