Thread: Character Decoding Problems

Character Decoding Problems

From

Evan Tsue

Date:

12 August 2003, 02:38:32

Hi,

    I've been having problems decoding non-Latin characters using the
Postgres JDBC driver.  Here's the situation:  I'm using postgres 7.3.2
and I've created a test database using 'createdb -E UNICODE testdb' to
ensure that I really am using the UNICODE character set.  Using psql, I
created a table using the following command: 'CREATE TABLE messages
(message_uid SERIAL PRIMARY KEY, message_text VARCHAR(255))' to test
character encoding and decoding.  At that point, I inserted a message
that was in English.  I also inserted a message that was in Arabic.  I
did a select on that table using psql and the values came back
perfectly (I'm using MacOS X, so the characters are displayed
correctly).
    Next, I did a select on the same table via JDBC.  All I had the
program do was select on the table and print the results out to
standard output.  The message in English was displayed perfectly.
However, the message that was in Arabic was displayed as a series of
question marks and spaces.
    I eventually navigated my way through the JDBC driver source to find
that the problem is in the decodeUTF8 method in the
org.postgresql.core.Encoding class.  Apparently, it doesn't seem to be
working properly for non-Western characters.  I replaced the call to
that method with a call to the java.lang.String constructor and now
everything works perfectly.
    In addition to Arabic, I took a random sample of Chinese, Japanese,
Russian and Korean text and inserted it into the database.  Using the
original driver, I get the question marks.  But, when I used the String
constructor, everything comes out fine.
    Could someone please either fix the Encoding.decodeUTF8 method or
replace the call to that with a call to the String constructor?

Thanks,
Evan

Re: Character Decoding Problems

From

Date:

12 August 2003, 09:38:36

I can insert and retrieve chinese into postgresql 7.2.2 successfully.  Both operation through JDBC. 
It seems you insert text using psql and retrieve using JDBC. 

----- Original Message ----- 
From: "Evan Tsue" <evan@windsormgmt.com>
To: <pgsql-jdbc@postgresql.org>
Sent: Tuesday, August 12, 2003 1:38 PM
Subject: [JDBC] Character Decoding Problems


> Hi,
> 
> I've been having problems decoding non-Latin characters using the 
> Postgres JDBC driver.  Here's the situation:  I'm using postgres 7.3.2 
> and I've created a test database using 'createdb -E UNICODE testdb' to 
> ensure that I really am using the UNICODE character set.  Using psql, I 
> created a table using the following command: 'CREATE TABLE messages 
> (message_uid SERIAL PRIMARY KEY, message_text VARCHAR(255))' to test 
> character encoding and decoding.  At that point, I inserted a message 
> that was in English.  I also inserted a message that was in Arabic.  I 
> did a select on that table using psql and the values came back 
> perfectly (I'm using MacOS X, so the characters are displayed 
> correctly).
> Next, I did a select on the same table via JDBC.  All I had the 
> program do was select on the table and print the results out to 
> standard output.  The message in English was displayed perfectly.  
> However, the message that was in Arabic was displayed as a series of 
> question marks and spaces.
> I eventually navigated my way through the JDBC driver source to find 
> that the problem is in the decodeUTF8 method in the 
> org.postgresql.core.Encoding class.  Apparently, it doesn't seem to be 
> working properly for non-Western characters.  I replaced the call to 
> that method with a call to the java.lang.String constructor and now 
> everything works perfectly.
> In addition to Arabic, I took a random sample of Chinese, Japanese, 
> Russian and Korean text and inserted it into the database.  Using the 
> original driver, I get the question marks.  But, when I used the String 
> constructor, everything comes out fine.
> Could someone please either fix the Encoding.decodeUTF8 method or 
> replace the call to that with a call to the String constructor?
> 
> Thanks,
> Evan
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 8: explain analyze is your friend
>

Re: Character Decoding Problems

From

Evan Tsue

Date:

12 August 2003, 11:19:54

Yes, it should work in 7.2.2.  The decodeUTF8 method wasn't introduced
until later.  From the comments in the code, it seems that the reason
for its inclusion was for performance.

Evan

On Tuesday, Aug 12, 2003, at 08:34 US/Eastern, <zy7111@mail.china.com>
wrote:

> I can insert and retrieve chinese into postgresql 7.2.2 successfully.
> Both operation through JDBC.
> It seems you insert text using psql and retrieve using JDBC.
>
> ----- Original Message -----
> From: "Evan Tsue" <evan@windsormgmt.com>
> To: <pgsql-jdbc@postgresql.org>
> Sent: Tuesday, August 12, 2003 1:38 PM
> Subject: [JDBC] Character Decoding Problems
>
>
>> Hi,
>>
>> I've been having problems decoding non-Latin characters using the
>> Postgres JDBC driver.  Here's the situation:  I'm using postgres 7.3.2
>> and I've created a test database using 'createdb -E UNICODE testdb' to
>> ensure that I really am using the UNICODE character set.  Using psql,
>> I
>> created a table using the following command: 'CREATE TABLE messages
>> (message_uid SERIAL PRIMARY KEY, message_text VARCHAR(255))' to test
>> character encoding and decoding.  At that point, I inserted a message
>> that was in English.  I also inserted a message that was in Arabic.  I
>> did a select on that table using psql and the values came back
>> perfectly (I'm using MacOS X, so the characters are displayed
>> correctly).
>> Next, I did a select on the same table via JDBC.  All I had the
>> program do was select on the table and print the results out to
>> standard output.  The message in English was displayed perfectly.
>> However, the message that was in Arabic was displayed as a series of
>> question marks and spaces.
>> I eventually navigated my way through the JDBC driver source to find
>> that the problem is in the decodeUTF8 method in the
>> org.postgresql.core.Encoding class.  Apparently, it doesn't seem to be
>> working properly for non-Western characters.  I replaced the call to
>> that method with a call to the java.lang.String constructor and now
>> everything works perfectly.
>> In addition to Arabic, I took a random sample of Chinese, Japanese,
>> Russian and Korean text and inserted it into the database.  Using the
>> original driver, I get the question marks.  But, when I used the
>> String
>> constructor, everything comes out fine.
>> Could someone please either fix the Encoding.decodeUTF8 method or
>> replace the call to that with a call to the String constructor?
>>
>> Thanks,
>> Evan
>>
>>
>> ---------------------------(end of
>> broadcast)---------------------------
>> TIP 8: explain analyze is your friend
>>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 8: explain analyze is your friend

Re: Character Decoding Problems

From

Barry Lind

Date:

12 August 2003, 18:38:48

Evan,

Can you provide a test case to demonstrate your problem.  Many people
are using the driver sucessfully with non-english characters.  So I
don't think the problem is as you describe it.

thanks,
--Barry

Evan Tsue wrote:
> Hi,
>
>     I've been having problems decoding non-Latin characters using the
> Postgres JDBC driver.  Here's the situation:  I'm using postgres 7.3.2
> and I've created a test database using 'createdb -E UNICODE testdb' to
> ensure that I really am using the UNICODE character set.  Using psql, I
> created a table using the following command: 'CREATE TABLE messages
> (message_uid SERIAL PRIMARY KEY, message_text VARCHAR(255))' to test
> character encoding and decoding.  At that point, I inserted a message
> that was in English.  I also inserted a message that was in Arabic.  I
> did a select on that table using psql and the values came back perfectly
> (I'm using MacOS X, so the characters are displayed correctly).
>     Next, I did a select on the same table via JDBC.  All I had the
> program do was select on the table and print the results out to standard
> output.  The message in English was displayed perfectly.  However, the
> message that was in Arabic was displayed as a series of question marks
> and spaces.
>     I eventually navigated my way through the JDBC driver source to find
> that the problem is in the decodeUTF8 method in the
> org.postgresql.core.Encoding class.  Apparently, it doesn't seem to be
> working properly for non-Western characters.  I replaced the call to
> that method with a call to the java.lang.String constructor and now
> everything works perfectly.
>     In addition to Arabic, I took a random sample of Chinese, Japanese,
> Russian and Korean text and inserted it into the database.  Using the
> original driver, I get the question marks.  But, when I used the String
> constructor, everything comes out fine.
>     Could someone please either fix the Encoding.decodeUTF8 method or
> replace the call to that with a call to the String constructor?
>
> Thanks,
> Evan
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 8: explain analyze is your friend
>

Re: Character Decoding Problems

From

Evan Tsue

Date:

12 August 2003, 21:43:55

Barry,
Sure.  I'm not certain what the best way to do this is, but if this is
not sufficient, then we can try something else.  Here's the schema for
the table I created:

testdb=# \d messages                                          Table "public.messages"    Column    |          Type
   |                               
Modifiers
--------------+------------------------
+------------------------------------------------------------------- message_uid  | integer                | not null
default  
nextval('public.messages_message_uid_seq'::text) message_text | character varying(255) |
Indexes: messages_pkey primary key btree (message_uid)

I used this command to create that table:

CREATE TABLE messages (message_uid SERIAL PRIMARY KEY, message_text
VARCHAR(255));

The next thing I did from psql was this insert statement:

INSERT INTO messages (message_text) VALUES ('يرجى ادخال النص المراد
ترجمته');

I hope that the Arabic text I have in there comes out right for you.
If not, let me know.

So, if I do a SELECT * FROM messages; in psql, everything comes out
fine.  Now, here's the Java code that I used to access this data:

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.util.Properties;

public class LanguageTest {
public static void main(String[] args) {    Connection conn = null;    PreparedStatement ps = null;    ResultSet rs =
null;
    // Load the PostgreSQL driver.    try {        Class.forName("org.postgresql.Driver");    } catch
(ClassNotFoundExceptione) {        System.err.println("Unable to find the PostgreSQL JDBC driver.");
System.exit(1);   } 
    try {        // Set the connection properties.        Properties info = new Properties();        info.put("user",
"test");       info.put("password", "test"); 
        // Create a new connection.        conn =            DriverManager.getConnection(
"jdbc:postgresql://127.0.0.1:5432/testdb",               info); 
        // Prepare the SQL statement.        ps = conn.prepareStatement("SELECT * FROM messages");
        // Execute the query.        rs = ps.executeQuery();
        // Iterate through the results.        if (rs.first()) {            do {                int messageId =
rs.getInt(1);               String message = rs.getString(2);                System.out.println(
"UID:" + messageId + " Message: " + message);            } while (rs.next());        }    } catch (SQLException ex) {
    ex.printStackTrace();    } finally {        // Close the connection.        try {            rs.close();
ps.close();           conn.close();        } catch (SQLException e1) {        }    }} 
}

Let me know what you think.

Evan


On Tuesday, Aug 12, 2003, at 17:38 US/Eastern, Barry Lind wrote:

> Evan,
>
> Can you provide a test case to demonstrate your problem.  Many people
> are using the driver sucessfully with non-english characters.  So I
> don't think the problem is as you describe it.
>
> thanks,
> --Barry
>
> Evan Tsue wrote:
>> Hi,
>>     I've been having problems decoding non-Latin characters using the
>> Postgres JDBC driver.  Here's the situation:  I'm using postgres
>> 7.3.2 and I've created a test database using 'createdb -E UNICODE
>> testdb' to ensure that I really am using the UNICODE character set.
>> Using psql, I created a table using the following command: 'CREATE
>> TABLE messages (message_uid SERIAL PRIMARY KEY, message_text
>> VARCHAR(255))' to test character encoding and decoding.  At that
>> point, I inserted a message that was in English.  I also inserted a
>> message that was in Arabic.  I did a select on that table using psql
>> and the values came back perfectly (I'm using MacOS X, so the
>> characters are displayed correctly).
>>     Next, I did a select on the same table via JDBC.  All I had the
>> program do was select on the table and print the results out to
>> standard output.  The message in English was displayed perfectly.
>> However, the message that was in Arabic was displayed as a series of
>> question marks and spaces.
>>     I eventually navigated my way through the JDBC driver source to
>> find that the problem is in the decodeUTF8 method in the
>> org.postgresql.core.Encoding class.  Apparently, it doesn't seem to
>> be working properly for non-Western characters.  I replaced the call
>> to that method with a call to the java.lang.String constructor and
>> now everything works perfectly.
>>     In addition to Arabic, I took a random sample of Chinese,
>> Japanese, Russian and Korean text and inserted it into the database.
>> Using the original driver, I get the question marks.  But, when I
>> used the String constructor, everything comes out fine.
>>     Could someone please either fix the Encoding.decodeUTF8 method or
>> replace the call to that with a call to the String constructor?
>> Thanks,
>> Evan
>> ---------------------------(end of
>> broadcast)---------------------------
>> TIP 8: explain analyze is your friend
>

Re: Character Decoding Problems

From

"zy7111"

Date:

12 August 2003, 22:28:41

I use pg73jdbc3.jar as JDBC driver. It works fine.

> Yes, it should work in 7.2.2.  The decodeUTF8 method wasn't introduced 
> until later.  From the comments in the code, it seems that the reason 
> for its inclusion was for performance.
> 
> Evan
> 
> On Tuesday, Aug 12, 2003, at 08:34 US/Eastern, <zy7111@mail.china.com> 
> wrote:
> 
> > I can insert and retrieve chinese into postgresql 7.2.2 successfully.  
> > Both operation through JDBC.
> > It seems you insert text using psql and retrieve using JDBC.
> >
> > ----- Original Message -----
> > From: "Evan Tsue" <evan@windsormgmt.com>
> > To: <pgsql-jdbc@postgresql.org>
> > Sent: Tuesday, August 12, 2003 1:38 PM
> > Subject: [JDBC] Character Decoding Problems
> >
> >
> >> Hi,
> >>
> >> I've been having problems decoding non-Latin characters using the
> >> Postgres JDBC driver.  Here's the situation:  I'm using postgres 7.3.2
> >> and I've created a test database using 'createdb -E UNICODE testdb' to
> >> ensure that I really am using the UNICODE character set.  Using psql, 
> >> I
> >> created a table using the following command: 'CREATE TABLE messages
> >> (message_uid SERIAL PRIMARY KEY, message_text VARCHAR(255))' to test
> >> character encoding and decoding.  At that point, I inserted a message
> >> that was in English.  I also inserted a message that was in Arabic.  I
> >> did a select on that table using psql and the values came back
> >> perfectly (I'm using MacOS X, so the characters are displayed
> >> correctly).
> >> Next, I did a select on the same table via JDBC.  All I had the
> >> program do was select on the table and print the results out to
> >> standard output.  The message in English was displayed perfectly.
> >> However, the message that was in Arabic was displayed as a series of
> >> question marks and spaces.
> >> I eventually navigated my way through the JDBC driver source to find
> >> that the problem is in the decodeUTF8 method in the
> >> org.postgresql.core.Encoding class.  Apparently, it doesn't seem to be
> >> working properly for non-Western characters.  I replaced the call to
> >> that method with a call to the java.lang.String constructor and now
> >> everything works perfectly.
> >> In addition to Arabic, I took a random sample of Chinese, Japanese,
> >> Russian and Korean text and inserted it into the database.  Using the
> >> original driver, I get the question marks.  But, when I used the 
> >> String
> >> constructor, everything comes out fine.
> >> Could someone please either fix the Encoding.decodeUTF8 method or
> >> replace the call to that with a call to the String constructor?
> >>
> >> Thanks,
> >> Evan
> >>
> >>
> >> ---------------------------(end of 
> >> broadcast)---------------------------
> >> TIP 8: explain analyze is your friend
> >>
> >
> > ---------------------------(end of 
> > broadcast)---------------------------
> > TIP 8: explain analyze is your friend
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
----------------------------------------------------------------------
�Ҵ��ڣ���Ϊ�����й���,������ע�л���������!
������֮�շ��� http://paymail.china.com
������֮������ http://mail.china.com

Re: Character Decoding Problems

From

Evan Tsue

Date:

13 August 2003, 00:50:37

Ok,  I've sat down with the problem a little bit more.  It now seems to
me that
the decodeUTF8 method is doing the encoding correctly.  It places the
result from translating from UTF-8 to UTF-16 in the char[] l_cdata
variable.
It then creates a new String by calling

    new String(l_cdata, 0, j)

I believe that the variable j is the length of the filled in portion of
the l_cdata
array.  l_cdata is a class variable that is reused between method calls
(the decodeUTF8 method is synchronized).

This seems to be the problem.  I haven't figured out why yet.  I also
have the
same problem when running on FreeBSD (using the FreeBSD 1.4 JVM).

Evan


On Tuesday, Aug 12, 2003, at 21:28 US/Eastern, zy7111 wrote:

> I use pg73jdbc3.jar as JDBC driver. It works fine.
>
>> Yes, it should work in 7.2.2.  The decodeUTF8 method wasn't introduced
>> until later.  From the comments in the code, it seems that the reason
>> for its inclusion was for performance.
>>
>> Evan
>>
>> On Tuesday, Aug 12, 2003, at 08:34 US/Eastern, <zy7111@mail.china.com>
>> wrote:
>>
>>> I can insert and retrieve chinese into postgresql 7.2.2 successfully.
>>> Both operation through JDBC.
>>> It seems you insert text using psql and retrieve using JDBC.
>>>
>>> ----- Original Message -----
>>> From: "Evan Tsue" <evan@windsormgmt.com>
>>> To: <pgsql-jdbc@postgresql.org>
>>> Sent: Tuesday, August 12, 2003 1:38 PM
>>> Subject: [JDBC] Character Decoding Problems
>>>
>>>
>>>> Hi,
>>>>
>>>> I've been having problems decoding non-Latin characters using the
>>>> Postgres JDBC driver.  Here's the situation:  I'm using postgres
>>>> 7.3.2
>>>> and I've created a test database using 'createdb -E UNICODE testdb'
>>>> to
>>>> ensure that I really am using the UNICODE character set.  Using
>>>> psql,
>>>> I
>>>> created a table using the following command: 'CREATE TABLE messages
>>>> (message_uid SERIAL PRIMARY KEY, message_text VARCHAR(255))' to test
>>>> character encoding and decoding.  At that point, I inserted a
>>>> message
>>>> that was in English.  I also inserted a message that was in Arabic.
>>>>  I
>>>> did a select on that table using psql and the values came back
>>>> perfectly (I'm using MacOS X, so the characters are displayed
>>>> correctly).
>>>> Next, I did a select on the same table via JDBC.  All I had the
>>>> program do was select on the table and print the results out to
>>>> standard output.  The message in English was displayed perfectly.
>>>> However, the message that was in Arabic was displayed as a series of
>>>> question marks and spaces.
>>>> I eventually navigated my way through the JDBC driver source to find
>>>> that the problem is in the decodeUTF8 method in the
>>>> org.postgresql.core.Encoding class.  Apparently, it doesn't seem to
>>>> be
>>>> working properly for non-Western characters.  I replaced the call to
>>>> that method with a call to the java.lang.String constructor and now
>>>> everything works perfectly.
>>>> In addition to Arabic, I took a random sample of Chinese, Japanese,
>>>> Russian and Korean text and inserted it into the database.  Using
>>>> the
>>>> original driver, I get the question marks.  But, when I used the
>>>> String
>>>> constructor, everything comes out fine.
>>>> Could someone please either fix the Encoding.decodeUTF8 method or
>>>> replace the call to that with a call to the String constructor?
>>>>
>>>> Thanks,
>>>> Evan
>>>>
>>>>
>>>> ---------------------------(end of
>>>> broadcast)---------------------------
>>>> TIP 8: explain analyze is your friend
>>>>
>>>
>>> ---------------------------(end of
>>> broadcast)---------------------------
>>> TIP 8: explain analyze is your friend
>>
>>
>> ---------------------------(end of
>> broadcast)---------------------------
>> TIP 2: you can get off all lists at once with the unregister command
>>     (send "unregister YourEmailAddressHere" to
>> majordomo@postgresql.org)
> ----------------------------------------------------------------------
> ÎÒ´æÔÚ£¬ÒòÎªÎÒÊÇÖÐ¹úÈË,¾´Çë¹Ø×¢ÖÐ»ªÍøÐÅÌìÓÊ!
> ÐÅÌìÓÊÖ®ÊÕ·ÑÓÊ http://paymail.china.com
> ÐÅÌìÓÊÖ®Ãâ·ÑÓÊ http://mail.china.com
>
>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to
> majordomo@postgresql.org
>

Re: Character Decoding Problems

From

Evan Tsue

Date:

13 August 2003, 14:12:05

Ok, I think I've figured out the problem.  I retract my statement that
the decodeUTF8
method is incorrectly implemented.

I'm still not exactly sure what the problem is.  When I do a
getBytes("UTF16")
on the string I get back from the JDBC query, everything looks ok.
However,
when I do getBytes() it seems to default to some other encoding.  Does
anyone
know what the deal is with this?

The issue that still remains is why does the new String(...) method
work for
me whereas the decodeUTF8 method does not?

Btw, thanks for everybody's help so far.

Evan

On Tuesday, Aug 12, 2003, at 23:50 US/Eastern, Evan Tsue wrote:

> Ok,  I've sat down with the problem a little bit more.  It now seems
> to me that
> the decodeUTF8 method is doing the encoding correctly.  It places the
> result from translating from UTF-8 to UTF-16 in the char[] l_cdata
> variable.
> It then creates a new String by calling
>
>     new String(l_cdata, 0, j)
>
> I believe that the variable j is the length of the filled in portion
> of the l_cdata
> array.  l_cdata is a class variable that is reused between method calls
> (the decodeUTF8 method is synchronized).
>
> This seems to be the problem.  I haven't figured out why yet.  I also
> have the
> same problem when running on FreeBSD (using the FreeBSD 1.4 JVM).
>
> Evan
>
>
> On Tuesday, Aug 12, 2003, at 21:28 US/Eastern, zy7111 wrote:
>
>> I use pg73jdbc3.jar as JDBC driver. It works fine.
>>
>>> Yes, it should work in 7.2.2.  The decodeUTF8 method wasn't
>>> introduced
>>> until later.  From the comments in the code, it seems that the reason
>>> for its inclusion was for performance.
>>>
>>> Evan
>>>
>>> On Tuesday, Aug 12, 2003, at 08:34 US/Eastern,
>>> <zy7111@mail.china.com>
>>> wrote:
>>>
>>>> I can insert and retrieve chinese into postgresql 7.2.2
>>>> successfully.
>>>> Both operation through JDBC.
>>>> It seems you insert text using psql and retrieve using JDBC.
>>>>
>>>> ----- Original Message -----
>>>> From: "Evan Tsue" <evan@windsormgmt.com>
>>>> To: <pgsql-jdbc@postgresql.org>
>>>> Sent: Tuesday, August 12, 2003 1:38 PM
>>>> Subject: [JDBC] Character Decoding Problems
>>>>
>>>>
>>>>> Hi,
>>>>>
>>>>> I've been having problems decoding non-Latin characters using the
>>>>> Postgres JDBC driver.  Here's the situation:  I'm using postgres
>>>>> 7.3.2
>>>>> and I've created a test database using 'createdb -E UNICODE
>>>>> testdb' to
>>>>> ensure that I really am using the UNICODE character set.  Using
>>>>> psql,
>>>>> I
>>>>> created a table using the following command: 'CREATE TABLE messages
>>>>> (message_uid SERIAL PRIMARY KEY, message_text VARCHAR(255))' to
>>>>> test
>>>>> character encoding and decoding.  At that point, I inserted a
>>>>> message
>>>>> that was in English.  I also inserted a message that was in
>>>>> Arabic.  I
>>>>> did a select on that table using psql and the values came back
>>>>> perfectly (I'm using MacOS X, so the characters are displayed
>>>>> correctly).
>>>>> Next, I did a select on the same table via JDBC.  All I had the
>>>>> program do was select on the table and print the results out to
>>>>> standard output.  The message in English was displayed perfectly.
>>>>> However, the message that was in Arabic was displayed as a series
>>>>> of
>>>>> question marks and spaces.
>>>>> I eventually navigated my way through the JDBC driver source to
>>>>> find
>>>>> that the problem is in the decodeUTF8 method in the
>>>>> org.postgresql.core.Encoding class.  Apparently, it doesn't seem
>>>>> to be
>>>>> working properly for non-Western characters.  I replaced the call
>>>>> to
>>>>> that method with a call to the java.lang.String constructor and now
>>>>> everything works perfectly.
>>>>> In addition to Arabic, I took a random sample of Chinese, Japanese,
>>>>> Russian and Korean text and inserted it into the database.  Using
>>>>> the
>>>>> original driver, I get the question marks.  But, when I used the
>>>>> String
>>>>> constructor, everything comes out fine.
>>>>> Could someone please either fix the Encoding.decodeUTF8 method or
>>>>> replace the call to that with a call to the String constructor?
>>>>>
>>>>> Thanks,
>>>>> Evan
>>>>>
>>>>>
>>>>> ---------------------------(end of
>>>>> broadcast)---------------------------
>>>>> TIP 8: explain analyze is your friend
>>>>>
>>>>
>>>> ---------------------------(end of
>>>> broadcast)---------------------------
>>>> TIP 8: explain analyze is your friend
>>>
>>>
>>> ---------------------------(end of
>>> broadcast)---------------------------
>>> TIP 2: you can get off all lists at once with the unregister command
>>>     (send "unregister YourEmailAddressHere" to
>>> majordomo@postgresql.org)
>> ----------------------------------------------------------------------
>> ÎÒ´æÔÚ£¬ÒòÎªÎÒÊÇÖÐ¹úÈË,¾´Çë¹Ø×¢ÖÐ»ªÍøÐÅÌìÓÊ!
>> ÐÅÌìÓÊÖ®ÊÕ·ÑÓÊ http://paymail.china.com
>> ÐÅÌìÓÊÖ®Ãâ·ÑÓÊ http://mail.china.com
>>
>>
>>
>> ---------------------------(end of
>> broadcast)---------------------------
>> TIP 1: subscribe and unsubscribe commands go to
>> majordomo@postgresql.org
>>
>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
>

Re: Character Decoding Problems

From

Barry Lind

Date:

13 August 2003, 15:26:05

Evan,

A call to getBytes() without specifying a character set will use the
default encoding for the jvm.  I think it is platform dependent on how
the jvm determines its default encoding.  In my environments the default
jvm encoding is LATIN1.

thanks,
--Barry


Evan Tsue wrote:
> Ok, I think I've figured out the problem.  I retract my statement that
> the decodeUTF8
> method is incorrectly implemented.
>
> I'm still not exactly sure what the problem is.  When I do a
> getBytes("UTF16")
> on the string I get back from the JDBC query, everything looks ok.
> However,
> when I do getBytes() it seems to default to some other encoding.  Does
> anyone
> know what the deal is with this?
>
> The issue that still remains is why does the new String(...) method work
> for
> me whereas the decodeUTF8 method does not?
>
> Btw, thanks for everybody's help so far.
>
> Evan
>
> On Tuesday, Aug 12, 2003, at 23:50 US/Eastern, Evan Tsue wrote:
>
>> Ok,  I've sat down with the problem a little bit more.  It now seems
>> to me that
>> the decodeUTF8 method is doing the encoding correctly.  It places the
>> result from translating from UTF-8 to UTF-16 in the char[] l_cdata
>> variable.
>> It then creates a new String by calling
>>
>>     new String(l_cdata, 0, j)
>>
>> I believe that the variable j is the length of the filled in portion
>> of the l_cdata
>> array.  l_cdata is a class variable that is reused between method calls
>> (the decodeUTF8 method is synchronized).
>>
>> This seems to be the problem.  I haven't figured out why yet.  I also
>> have the
>> same problem when running on FreeBSD (using the FreeBSD 1.4 JVM).
>>
>> Evan
>>
>>
>> On Tuesday, Aug 12, 2003, at 21:28 US/Eastern, zy7111 wrote:
>>
>>> I use pg73jdbc3.jar as JDBC driver. It works fine.
>>>
>>>> Yes, it should work in 7.2.2.  The decodeUTF8 method wasn't introduced
>>>> until later.  From the comments in the code, it seems that the reason
>>>> for its inclusion was for performance.
>>>>
>>>> Evan
>>>>
>>>> On Tuesday, Aug 12, 2003, at 08:34 US/Eastern, <zy7111@mail.china.com>
>>>> wrote:
>>>>
>>>>> I can insert and retrieve chinese into postgresql 7.2.2 successfully.
>>>>> Both operation through JDBC.
>>>>> It seems you insert text using psql and retrieve using JDBC.
>>>>>
>>>>> ----- Original Message -----
>>>>> From: "Evan Tsue" <evan@windsormgmt.com>
>>>>> To: <pgsql-jdbc@postgresql.org>
>>>>> Sent: Tuesday, August 12, 2003 1:38 PM
>>>>> Subject: [JDBC] Character Decoding Problems
>>>>>
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I've been having problems decoding non-Latin characters using the
>>>>>> Postgres JDBC driver.  Here's the situation:  I'm using postgres
>>>>>> 7.3.2
>>>>>> and I've created a test database using 'createdb -E UNICODE
>>>>>> testdb' to
>>>>>> ensure that I really am using the UNICODE character set.  Using psql,
>>>>>> I
>>>>>> created a table using the following command: 'CREATE TABLE messages
>>>>>> (message_uid SERIAL PRIMARY KEY, message_text VARCHAR(255))' to test
>>>>>> character encoding and decoding.  At that point, I inserted a message
>>>>>> that was in English.  I also inserted a message that was in
>>>>>> Arabic.  I
>>>>>> did a select on that table using psql and the values came back
>>>>>> perfectly (I'm using MacOS X, so the characters are displayed
>>>>>> correctly).
>>>>>> Next, I did a select on the same table via JDBC.  All I had the
>>>>>> program do was select on the table and print the results out to
>>>>>> standard output.  The message in English was displayed perfectly.
>>>>>> However, the message that was in Arabic was displayed as a series of
>>>>>> question marks and spaces.
>>>>>> I eventually navigated my way through the JDBC driver source to find
>>>>>> that the problem is in the decodeUTF8 method in the
>>>>>> org.postgresql.core.Encoding class.  Apparently, it doesn't seem
>>>>>> to be
>>>>>> working properly for non-Western characters.  I replaced the call to
>>>>>> that method with a call to the java.lang.String constructor and now
>>>>>> everything works perfectly.
>>>>>> In addition to Arabic, I took a random sample of Chinese, Japanese,
>>>>>> Russian and Korean text and inserted it into the database.  Using the
>>>>>> original driver, I get the question marks.  But, when I used the
>>>>>> String
>>>>>> constructor, everything comes out fine.
>>>>>> Could someone please either fix the Encoding.decodeUTF8 method or
>>>>>> replace the call to that with a call to the String constructor?
>>>>>>
>>>>>> Thanks,
>>>>>> Evan
>>>>>>
>>>>>>
>>>>>> ---------------------------(end of
>>>>>> broadcast)---------------------------
>>>>>> TIP 8: explain analyze is your friend
>>>>>>
>>>>>
>>>>> ---------------------------(end of
>>>>> broadcast)---------------------------
>>>>> TIP 8: explain analyze is your friend
>>>>
>>>>
>>>>
>>>> ---------------------------(end of
>>>> broadcast)---------------------------
>>>> TIP 2: you can get off all lists at once with the unregister command
>>>>     (send "unregister YourEmailAddressHere" to
>>>> majordomo@postgresql.org)
>>>
>>> ----------------------------------------------------------------------
>>> ÎÒ´æÔÚ£¬ÒòÎªÎÒÊÇÖÐ¹úÈË,¾´Çë¹Ø×¢ÖÐ»ªÍøÐÅÌìÓÊ!
>>> ÐÅÌìÓÊÖ®ÊÕ·ÑÓÊ http://paymail.china.com
>>> ÐÅÌìÓÊÖ®Ãâ·ÑÓÊ http://mail.china.com
>>>
>>>
>>>
>>> ---------------------------(end of broadcast)---------------------------
>>> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>>>
>>
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 2: you can get off all lists at once with the unregister command
>>    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
>>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
>      subscribe-nomail command to majordomo@postgresql.org so that your
>      message can get through to the mailing list cleanly
>
>

Re: Character Decoding Problems

From

Date:

14 August 2003, 02:31:24

What is your client_encoding when you insert data into database using psql?

----- Original Message ----- 
From: "Evan Tsue" <evan@windsormgmt.com>
To: <pgsql-jdbc@postgresql.org>
Sent: Thursday, August 14, 2003 1:11 AM
Subject: Re: [JDBC] Character Decoding Problems


Ok, I think I've figured out the problem.  I retract my statement that 
the decodeUTF8
method is incorrectly implemented.

I'm still not exactly sure what the problem is.  When I do a 
getBytes("UTF16")
on the string I get back from the JDBC query, everything looks ok.  
However,
when I do getBytes() it seems to default to some other encoding.  Does 
anyone
know what the deal is with this?

The issue that still remains is why does the new String(...) method 
work for
me whereas the decodeUTF8 method does not?

Btw, thanks for everybody's help so far.

Evan

On Tuesday, Aug 12, 2003, at 23:50 US/Eastern, Evan Tsue wrote:

> Ok,  I've sat down with the problem a little bit more.  It now seems 
> to me that
> the decodeUTF8 method is doing the encoding correctly.  It places the
> result from translating from UTF-8 to UTF-16 in the char[] l_cdata 
> variable.
> It then creates a new String by calling
>
> new String(l_cdata, 0, j)
>
> I believe that the variable j is the length of the filled in portion 
> of the l_cdata
> array.  l_cdata is a class variable that is reused between method calls
> (the decodeUTF8 method is synchronized).
>
> This seems to be the problem.  I haven't figured out why yet.  I also 
> have the
> same problem when running on FreeBSD (using the FreeBSD 1.4 JVM).
>
> Evan
>
>
> On Tuesday, Aug 12, 2003, at 21:28 US/Eastern, zy7111 wrote:
>
>> I use pg73jdbc3.jar as JDBC driver. It works fine.
>>
>>> Yes, it should work in 7.2.2.  The decodeUTF8 method wasn't 
>>> introduced
>>> until later.  From the comments in the code, it seems that the reason
>>> for its inclusion was for performance.
>>>
>>> Evan
>>>
>>> On Tuesday, Aug 12, 2003, at 08:34 US/Eastern, 
>>> <zy7111@mail.china.com>
>>> wrote:
>>>
>>>> I can insert and retrieve chinese into postgresql 7.2.2 
>>>> successfully.
>>>> Both operation through JDBC.
>>>> It seems you insert text using psql and retrieve using JDBC.
>>>>
>>>> ----- Original Message -----
>>>> From: "Evan Tsue" <evan@windsormgmt.com>
>>>> To: <pgsql-jdbc@postgresql.org>
>>>> Sent: Tuesday, August 12, 2003 1:38 PM
>>>> Subject: [JDBC] Character Decoding Problems
>>>>
>>>>
>>>>> Hi,
>>>>>
>>>>> I've been having problems decoding non-Latin characters using the
>>>>> Postgres JDBC driver.  Here's the situation:  I'm using postgres 
>>>>> 7.3.2
>>>>> and I've created a test database using 'createdb -E UNICODE 
>>>>> testdb' to
>>>>> ensure that I really am using the UNICODE character set.  Using 
>>>>> psql,
>>>>> I
>>>>> created a table using the following command: 'CREATE TABLE messages
>>>>> (message_uid SERIAL PRIMARY KEY, message_text VARCHAR(255))' to 
>>>>> test
>>>>> character encoding and decoding.  At that point, I inserted a 
>>>>> message
>>>>> that was in English.  I also inserted a message that was in 
>>>>> Arabic.  I
>>>>> did a select on that table using psql and the values came back
>>>>> perfectly (I'm using MacOS X, so the characters are displayed
>>>>> correctly).
>>>>> Next, I did a select on the same table via JDBC.  All I had the
>>>>> program do was select on the table and print the results out to
>>>>> standard output.  The message in English was displayed perfectly.
>>>>> However, the message that was in Arabic was displayed as a series 
>>>>> of
>>>>> question marks and spaces.
>>>>> I eventually navigated my way through the JDBC driver source to 
>>>>> find
>>>>> that the problem is in the decodeUTF8 method in the
>>>>> org.postgresql.core.Encoding class.  Apparently, it doesn't seem 
>>>>> to be
>>>>> working properly for non-Western characters.  I replaced the call 
>>>>> to
>>>>> that method with a call to the java.lang.String constructor and now
>>>>> everything works perfectly.
>>>>> In addition to Arabic, I took a random sample of Chinese, Japanese,
>>>>> Russian and Korean text and inserted it into the database.  Using 
>>>>> the
>>>>> original driver, I get the question marks.  But, when I used the
>>>>> String
>>>>> constructor, everything comes out fine.
>>>>> Could someone please either fix the Encoding.decodeUTF8 method or
>>>>> replace the call to that with a call to the String constructor?
>>>>>
>>>>> Thanks,
>>>>> Evan
>>>>>
>>>>>
>>>>> ---------------------------(end of
>>>>> broadcast)---------------------------
>>>>> TIP 8: explain analyze is your friend
>>>>>
>>>>
>>>> ---------------------------(end of
>>>> broadcast)---------------------------
>>>> TIP 8: explain analyze is your friend
>>>
>>>
>>> ---------------------------(end of 
>>> broadcast)---------------------------
>>> TIP 2: you can get off all lists at once with the unregister command
>>>     (send "unregister YourEmailAddressHere" to 
>>> majordomo@postgresql.org)
>> ----------------------------------------------------------------------
>> ÎÒ´æÔÚ£¬ÒòÎªÎÒÊÇÖÐ¹úÈË,¾´Çë¹Ø×¢ÖÐ»ªÍøÐÅÌìÓÊ!
>> ÐÅÌìÓÊÖ®ÊÕ·ÑÓÊ http://paymail.china.com
>> ÐÅÌìÓÊÖ®Ãâ·ÑÓÊ http://mail.china.com
>>
>>
>>
>> ---------------------------(end of 
>> broadcast)---------------------------
>> TIP 1: subscribe and unsubscribe commands go to 
>> majordomo@postgresql.org
>>
>
>
> ---------------------------(end of 
> broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
>


---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
      subscribe-nomail command to majordomo@postgresql.org so that your
      message can get through to the mailing list cleanly