Thread: Problem with accessing Russian UTF database

Problem with accessing Russian UTF database

From

"Ronald Vyhmeister"

Date:

25 November 2008, 17:27:42

I'm having real trouble with the jdbc driver for postgres... I just
installed the latest version...

I have a database, UTF8 encoded, which has data in Russian.  I can view it
beautifully using PGAdmin3 or any other ODBC connection.

This is the connection string I'm using (from Tomcat 5.5)

String URLdb =
"jdbc:postgresql://127.0.0.1:5432/oldzautest?user=noe&password=genesis&charS
et=UNICODE";

However, when I try to retrieve data from it, I get this garbage, where the
questionmarks are the Russian text...

<data>
        <db_content>
            <row>
                    <contents content = "1"  />
                    <contents content = "1"  />
                    <contents content = "?????"  />
                    <contents content = "????????"  />
                    <contents content = "?????????"  />
                    <contents content = "1965-03-10"  />
                    <contents content = "1"  />
            </row>
        </db_content>
</data>

Interestingly enough, if I send Russian data to the database through the
jdbc driver, it writes it into the DB as a bunch of garbage... but retrieves
it beautifully as Russian text...

I've set the client_encoding to UTF8 on the server...  What am I doing
wrong?  What am I missing?  I'd be thrilled to interact privately with
someone who has solved what for now is a mystery to me.

Thanks!

Ron

Re: Problem with accessing Russian UTF database

From

Oliver Jowett

Date:

25 November 2008, 19:10:07

Ronald Vyhmeister wrote:
> I'm having real trouble with the jdbc driver for postgres... I just
> installed the latest version...
>
> I have a database, UTF8 encoded, which has data in Russian.  I can view it
> beautifully using PGAdmin3 or any other ODBC connection.

Perhaps these connections are not actually using UTF8 to interpret the
data, but some other encoding - so while they appear to write encoded
data then retrieve it OK, it's not actually what you think it is when
interpreted as UTF8?

> String URLdb =
> "jdbc:postgresql://127.0.0.1:5432/oldzautest?user=noe&password=genesis&charS
> et=UNICODE";

You should not need "charSet=UNICODE", though I don't think it'll break
anything.

> <data>
>         <db_content>
>             <row>
>                     <contents content = "1"  />
>                     <contents content = "1"  />
>                     <contents content = "?????"  />
>                     <contents content = "????????"  />
>                     <contents content = "?????????"  />
>                     <contents content = "1965-03-10"  />
>                     <contents content = "1"  />
>             </row>
>         </db_content>
> </data>

Perhaps the problem is in the encoding you are using to write out that
XML fragment? Or in whatever tool you are using to view it?

> I've set the client_encoding to UTF8 on the server...  What am I doing
> wrong?  What am I missing?  I'd be thrilled to interact privately with
> someone who has solved what for now is a mystery to me.

You shouldn't need to touch client_encoding for JDBC to work (though
other clients might need it). The JDBC driver forces client_encoding to
UTF8 anyway on connection startup.

It may be useful to examine the actual value of the characters in the
String objects you are dealing with (i.e. print out (int)s.charAt(0)
etc) to check they contain the unicode codepoints you were expecting.

In general the driver "just works" with UTF-8 encoded databases. It's
dealing in terms of Unicode strings internally, so the only transcoding
that goes on is from UTF-8 to UTF-16, which is lossless. All the
reported problems we've seen in the (recent) past with this
configuration have been either problems with non-JDBC clients getting
confused, or problems with how the resulting String was displayed to the
user, or having non-unicode garbage stored in the database in the first
place.

-O

Re: Problem with accessing Russian UTF database

From

Oliver Jowett

Date:

25 November 2008, 19:37:55

Ronald Vyhmeister wrote:

>         Locale locale = Locale.getDefault();
>         locale = new Locale("ru", "RU");

The driver ignores locale so this won't actually be doing anything.

>     SQL = "update sys_people set middle_name='фывфывафыва' where
> family_name='Pratt';";

I wouldn't rely on your JSP implementation / java compiler interpreting
that string literal in the way that you assume. I suggest you construct
your string with \uNNNN unicode escapes to be sure you're really
compiling what you think you're compiling. Your mail headers claims a
charset of "koi8-r" but I don't know what the default file encoding for
your target system is; perhaps it is using ISO-8859-1 or similar, which
might result in the above being interpreted as the accented characters
you see in PgAdmin?

Also, as I suggested earlier, try examining your strings
character-by-character to check that they really contain the codepoints
you think they contain.

-O

Re: Problem with accessing Russian UTF database

From

"Ronald Vyhmeister"

Date:

25 November 2008, 20:04:51

Ronald Vyhmeister wrote:

>         Locale locale = Locale.getDefault();
>         locale = new Locale("ru", "RU");

>The driver ignores locale so this won't actually be doing anything.

>     SQL = "update sys_people set middle_name='фывфывафыва' where
> family_name='Pratt';";

>I wouldn't rely on your JSP implementation / java compiler interpreting
>that string literal in the way that you assume. I suggest you construct
>your string with \uNNNN unicode escapes to be sure you're really
>compiling what you think you're compiling. Your mail headers claims a
>charset of "koi8-r" but I don't know what the default file encoding for
>your target system is; perhaps it is using ISO-8859-1 or similar, which
>might result in the above being interpreted as the accented characters
>you see in PgAdmin?

Thank you!  I think we're getting much closer to a solution...  but here I'll need some help...  I just added this line
justafter creating the statement: 

out.print(SQL);

and this is the text I got:

update sys_people set middle_name='testing' where family_name='Smith';
update sys_people set middle_name='ôûâôûâàôûâà' where family_name='Pratt';

This matches the text I have in the DB perfectly...  the question is now, how to get that to work right?    As for the
Unicodeescapes, how do I determine them?  And why is it that when I read the "garbage" back from the DB it shows
perfectRussian characters again?   

I've tried both jikes and the Sun javac compilers, as well as the gcj compiler, with JDK 1.6 EE  (and the symptoms are
thesame since I was working with language en_US.UTF-8 and JDK 1.5).  Supposedly they all take their parameters from the
environment(which I believe is set right)... 

For the server settings (Ubuntu 8.04 LTS),

LANG=ru_RU.UTF-8
LANGUAGE=ru
LC_CTYPE="ru_RU.UTF-8"
LC_NUMERIC="ru_RU.UTF-8"
LC_TIME="ru_RU.UTF-8"
LC_COLLATE="ru_RU.UTF-8"
LC_MONETARY="ru_RU.UTF-8"
LC_MESSAGES="ru_RU.UTF-8"
LC_PAPER="ru_RU.UTF-8"
LC_NAME="ru_RU.UTF-8"
LC_ADDRESS="ru_RU.UTF-8"
LC_TELEPHONE="ru_RU.UTF-8"
LC_MEASUREMENT="ru_RU.UTF-8"
LC_IDENTIFICATION="ru_RU.UTF-8"
LC_ALL=

>Also, as I suggested earlier, try examining your strings
>character-by-character to check that they really contain the codepoints
>you think they contain.

Right now, the string I'm entering was from the keyboard, set to Russian mode (and yes, I've tried it from Linux and
Windows,and the results are the same). 

Thanks again for all the help,

Ron

Re: Problem with accessing Russian UTF database

From

Oliver Jowett

Date:

25 November 2008, 20:22:15

Ronald Vyhmeister wrote:

> As for the Unicode escapes, how do I determine them?

The syntax of a unicode escape in Java is \uNNNN where NNNN is the hex
value of the Unicode codepoint you want to use.

See http://unicode.org/charts/ to find the particular ones you need
(e.g. Cyrillic is in http://unicode.org/charts/PDF/U0400.pdf)

>> Also, as I suggested earlier, try examining your strings
>> character-by-character to check that they really contain the codepoints
>> you think they contain.
>
> Right now, the string I'm entering was from the keyboard, set to Russian mode (and yes, I've tried it from Linux and
Windows,and the results are the same). 

What I mean is to do something like this:

> String someString = /* whatever you want to inspect */;
> char[] rawCharacters = someString.toCharArray();
> for (int i = 0; i < rawCharacters.length; ++i)
>    System.out.println("#" + i + " = " + Integer.toHexString((int)rawCharacters[i]));

so that you can see exactly what the String really contains, not
whatever the combination of your output encoding & your terminal
encoding thinks it should look like. (Java strings are UCS-2/UTF-16
internally, which is a 1:1 mapping to Unicode codepoint values most of
the time, so the above code prints out unicode codepoint values in hex)

-O

Re: Problem with accessing Russian UTF database

From

"Ronald Vyhmeister"

Date:

26 November 2008, 03:13:28

Oliver and all,

I see that there is a Java problem...  I limited my code to this:

<%@ page import="java.util.*" %>
<%

    Locale locale = Locale.getDefault();
    locale = new Locale("ru", "RU");
    Locale.setDefault(locale);
    String someString =  'фывфывафыва';
    char[] rawCharacters = someString.toCharArray();
    for (int i = 0; i < rawCharacters.length; ++i)
        System.out.println("#" + i + " = " +
Integer.toHexString((int)rawCharacters[i]));
%>

And tomcat cannot handle the string

An error occurred at line: 2 in the jsp file: /iutus/unicode.jsp
Invalid character constant
1: <%@ page import="java.util.*" %>
2: <%
3:
4:     Locale locale = Locale.getDefault();
5:     locale = new Locale("ru", "RU");


If I replace the String statement with this:

String someString =  'ф';

The error changes to:

Type mismatch: cannot convert from char to String

So I thought I'd be smart and just define it as a char variable, but now it
says:

Invalid character constant

So I've just done them one at a time for 5 characters:

ф = f4
ы = fb
в = e2
ц = f6
л = eb

It would appear to me that whatever I'm entering is NOT Unicode... so how do
I force Windows/Linux/Mac to use Unicode for input/output?  I'm trying to
deal with an inherited app... and I'm no java guru...

Ron

Re: Problem with accessing Russian UTF database

From

"Ronald Vyhmeister"

Date:

26 November 2008, 03:33:25

Oliver and all,

I see that there is a Java problem...  I limited my code to this:

<%@ page import="java.util.*" %>
<%

    Locale locale = Locale.getDefault();
    locale = new Locale("ru", "RU");
    Locale.setDefault(locale);
    String someString =  'фывфывафыва';
    char[] rawCharacters = someString.toCharArray();
    for (int i = 0; i < rawCharacters.length; ++i)
        System.out.println("#" + i + " = " +
Integer.toHexString((int)rawCharacters[i]));
%>

And tomcat cannot handle the string

An error occurred at line: 2 in the jsp file: /iutus/unicode.jsp
Invalid character constant
1: <%@ page import="java.util.*" %>
2: <%
3:
4:     Locale locale = Locale.getDefault();
5:     locale = new Locale("ru", "RU");


If I replace the String statement with this:

String someString =  'ф';

The error changes to:

Type mismatch: cannot convert from char to String

So I thought I'd be smart and just define it as a char variable, but now it
says:

Invalid character constant

So I've just done them one at a time for 5 characters:

ф = f4
ы = fb
в = e2
ц = f6
л = eb

It would appear to me that whatever I'm entering is NOT Unicode... so how do
I force Windows/Linux/Mac to use Unicode for input/output?  I'm trying to
deal with an inherited app... and I'm no java guru...

Ron

Re: Problem with accessing Russian UTF database

From

Guillaume Cottenceau

Date:

26 November 2008, 06:09:54

Oliver Jowett <oliver 'at' opencloud.com> writes:

> Ronald Vyhmeister wrote:
>
>> As for the Unicode escapes, how do I determine them?
>
> The syntax of a unicode escape in Java is \uNNNN where NNNN is the hex
> value of the Unicode codepoint you want to use.
>
> See http://unicode.org/charts/ to find the particular ones you need
> (e.g. Cyrillic is in http://unicode.org/charts/PDF/U0400.pdf)

Or easier yet, use native2ascii!

[gc@meuh ~] echo 'André is nice' | native2ascii
Andr\u00e9 is nice

--
Guillaume Cottenceau

Re: Problem with accessing Russian UTF database

From

Guillaume Cottenceau

Date:

26 November 2008, 06:11:18

"Ronald Vyhmeister" <rvyhmeister 'at' aiias.edu> writes:

> Oliver and all,
>
> I see that there is a Java problem...  I limited my code to this:

It's not a Java problem. It is wrong to include most non ASCII
characters in Java and relative sources.

"The Java compiler and other Java tools can only process files
that contain Latin-1 or Unicode-encoded (\udddd notation)
characters."

--
Guillaume Cottenceau

Re: Problem with accessing Russian UTF database

From

"Mikael"

Date:

26 November 2008, 07:31:08

----- Original Message -----
From: "Guillaume Cottenceau" <gc@mnc.ch>
To: "Ronald Vyhmeister" <rvyhmeister@aiias.edu>
Cc: <pgsql-jdbc@postgresql.org>
Sent: Wednesday, November 26, 2008 11:11 AM
Subject: Re: [JDBC] Problem with accessing Russian UTF database


> "Ronald Vyhmeister" <rvyhmeister 'at' aiias.edu> writes:
>
>> Oliver and all,
>>
>> I see that there is a Java problem...  I limited my code to this:
>
> It's not a Java problem. It is wrong to include most non ASCII
> characters in Java and relative sources.
>
> "The Java compiler and other Java tools can only process files
> that contain Latin-1 or Unicode-encoded (\udddd notation)
> characters."

As far as I know you can use -encoding to specify the encoding
to use for the source files, or have I got that wrong ?

Tthat would of course require that
you can change the options to the javac compiler.

>
> --
> Guillaume Cottenceau
>
> --
> Sent via pgsql-jdbc mailing list (pgsql-jdbc@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-jdbc

Re: Problem with accessing Russian UTF database

From

Oliver Jowett

Date:

26 November 2008, 08:03:35

Ronald Vyhmeister wrote:

>     String someString =  'фывфывафыва';

> An error occurred at line: 2 in the jsp file: /iutus/unicode.jsp
> Invalid character constant

That's just because you used single quotes '' not double quotes ""

-O

Re: Problem with accessing Russian UTF database

From

Alexander Panzhin

Date:

26 November 2008, 20:51:31

Hi Ronald,

      This is the JSP problem, since you need to set the correct
encoding on the JSP.
      <%@page pageEncoding="UTF-8"%> - something like this should help.
      I have a UTF8 coded database that handles multiple
scripts(Cyrillic, Latin and Georgian) and PostgeSQL has not in 5 years
shown a problem with any store/retrieve problems with UTF8 encoding. If
you have any problems, this discussion should probably be directed to
people that handle JSP and/or Java questions.

Ronald Vyhmeister wrote:
> Oliver and all,
>
> I see that there is a Java problem...  I limited my code to this:
>
> <%@ page import="java.util.*" %>
> <%
>
>     Locale locale = Locale.getDefault();
>     locale = new Locale("ru", "RU");
>     Locale.setDefault(locale);
>     String someString =  'фывфывафыва';
>     char[] rawCharacters = someString.toCharArray();
>     for (int i = 0; i < rawCharacters.length; ++i)
>         System.out.println("#" + i + " = " +
> Integer.toHexString((int)rawCharacters[i]));
> %>
>
> And tomcat cannot handle the string
>
> An error occurred at line: 2 in the jsp file: /iutus/unicode.jsp
> Invalid character constant
> 1: <%@ page import="java.util.*" %>
> 2: <%
> 3:
> 4:     Locale locale = Locale.getDefault();
> 5:     locale = new Locale("ru", "RU");
>
>
> If I replace the String statement with this:
>
> String someString =  'ф';
>
> The error changes to:
>
> Type mismatch: cannot convert from char to String
>
> So I thought I'd be smart and just define it as a char variable, but now it
> says:
>
> Invalid character constant
>
> So I've just done them one at a time for 5 characters:
>
> ф = f4
> ы = fb
> в = e2
> ц = f6
> л = eb
>
> It would appear to me that whatever I'm entering is NOT Unicode... so how do
> I force Windows/Linux/Mac to use Unicode for input/output?  I'm trying to
> deal with an inherited app... and I'm no java guru...
>
> Ron
>
>
>

Attachment

smime.p7s