Thread: Charset encoding and accents

Charset encoding and accents

From

Davide Romanini

Date:

10 April 2003, 05:03:02

Hi,

I've posted this problem two times in the pgsql-jdbc user list, but no
one helped me to solve it. I think this is a really serious problem in
the jdbc driver. I've tried different solutions with no result.

Well, let me explain the problem. I've a currently working database in
PostgreSQL. There's an application, written in M$ Access, that uses the
database through the ODBC driver with no problems. I'd want to access
the data using a Swing application through the jdbc driver.
At server side the charset encoding is set as SQL_ASCII. It is not a
problem because all the strings containing accented characters are
retrived correctly by ODBC and also the psql client.
But if I retrive strings containing accents (like àòù) using jdbc I get
in trouble because my accents get dirty. For example: the string 'La
città di Forlì' is retrived and displayed as 'La citt?di Forl?'!

I've worked a bit around the problem with the source code of the driver.
I notice that when I call rs.getString(), the driver invokes (at a
certain point) the method org.postgresql.core.Encoding.decode(byte[]
encodedString, int offset, int length).
This method calls the decodeUTF8 when the actual encoding equals to
"UTF-8". If the encoding is different, it simply returns a new
String(encodedString, offset, length, encoding).
Well, my database is SQL_ASCII, so the jdbc driver should return a new
string and not call decodeUTF8. But when I do a step by step debug into
the source, the encoding ALWAYS equals to UTF-8! I've also tried to set
a parameter in my connection string:
jdbc:postgresql://localhost/prova?charSet=SQL_ASCII (I've tried a lot of
different encodings here). The encoding is always UTF-8.
Well, I thought 'if the driver wants strings to be UNICODE, set up the
server variable CLIENT_ENCODING to UNICODE'. No result! It doesn't change!
The only way to have my string displayed correctly is to comment out all
the decodeUTF8 and take it return a new String(data). So I think that if
the encoding is correctly recognized to be different from UTF-8 the
decode method will return the new String that is the correct behaviour
in my case.

Please don't answer me to change my database to UNICODE. I cannot do
that. And I do not WANT to do that. Why the ODBC driver works fine and
the JDBC driver works only with UNICODE databases?? It's a bug and
should be corrected. If I was skilled enough I corrected the bug myself
but I don't know much about JDBC standard.

I hope you answer to me with a solution. Really, the driver is simply
unusable for serious work with this bug.

The problem is not solved with the latest stable (version 7.3 build 109)
and development (version 7.4 build 204) release of the driver.

Regards, Romaz
--
Davide Romanini

Re: Charset encoding and accents

From

Thomas O'Dowd

Date:

10 April 2003, 08:19:48

Davide,

ASCII implies 7-bit characters which is doesn't have enough information
to store the accented characters that you are using. I'm confused as to
how they are being stored in the database at all if this is the case. I
presume it gets stored as the 8th bit is there anyway by default, but
that shouldn't really be expected me thinks.

Your database should probably be using LATIN1 (ISO-8859-1) or some other
8 bit encoding if you really want to store 8 bit information in it.

Anyway, try connecting with:

jdbc:postgresql://localhost/prova?charSet=LATIN1

This might well work for you. That said I haven't tried this nor dug
into the internals of the java driver in a while. I'll Cc the jdbc list.

Tom.

On Thu, 2003-04-10 at 18:04, Davide Romanini wrote:
> Hi,
>
> I've posted this problem two times in the pgsql-jdbc user list, but no
> one helped me to solve it. I think this is a really serious problem in
> the jdbc driver. I've tried different solutions with no result.
>
> Well, let me explain the problem. I've a currently working database in
> PostgreSQL. There's an application, written in M$ Access, that uses the
> database through the ODBC driver with no problems. I'd want to access
> the data using a Swing application through the jdbc driver.
> At server side the charset encoding is set as SQL_ASCII. It is not a
> problem because all the strings containing accented characters are
> retrived correctly by ODBC and also the psql client.
> But if I retrive strings containing accents (like àòù) using jdbc I get
> in trouble because my accents get dirty. For example: the string 'La
> città di Forlì' is retrived and displayed as 'La citt?di Forl?'!
>
> I've worked a bit around the problem with the source code of the driver.
> I notice that when I call rs.getString(), the driver invokes (at a
> certain point) the method org.postgresql.core.Encoding.decode(byte[]
> encodedString, int offset, int length).
> This method calls the decodeUTF8 when the actual encoding equals to
> "UTF-8". If the encoding is different, it simply returns a new
> String(encodedString, offset, length, encoding).
> Well, my database is SQL_ASCII, so the jdbc driver should return a new
> string and not call decodeUTF8. But when I do a step by step debug into
> the source, the encoding ALWAYS equals to UTF-8! I've also tried to set
> a parameter in my connection string:
> jdbc:postgresql://localhost/prova?charSet=SQL_ASCII (I've tried a lot of
> different encodings here). The encoding is always UTF-8.
> Well, I thought 'if the driver wants strings to be UNICODE, set up the
> server variable CLIENT_ENCODING to UNICODE'. No result! It doesn't change!
> The only way to have my string displayed correctly is to comment out all
> the decodeUTF8 and take it return a new String(data). So I think that if
> the encoding is correctly recognized to be different from UTF-8 the
> decode method will return the new String that is the correct behaviour
> in my case.
>
> Please don't answer me to change my database to UNICODE. I cannot do
> that. And I do not WANT to do that. Why the ODBC driver works fine and
> the JDBC driver works only with UNICODE databases?? It's a bug and
> should be corrected. If I was skilled enough I corrected the bug myself
> but I don't know much about JDBC standard.
>
> I hope you answer to me with a solution. Really, the driver is simply
> unusable for serious work with this bug.
>
> The problem is not solved with the latest stable (version 7.3 build 109)
> and development (version 7.4 build 204) release of the driver.
>
> Regards, Romaz
> --
> Davide Romanini
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
--
Thomas O'Dowd  - Got a keitai? Get Nooped!
tom@nooper.com - http://nooper.com

Re: Charset encoding and accents

From

Barry Lind

Date:

10 April 2003, 13:51:36

The charSet= option will no longer work with the 7.3 driver talking to a
7.3 server, since character set translation is now performed by the
server (for performance reasons) in that senario.

The correct solution here is to convert the database to the proper
character set for the data it is storing.  SQL_ASCII is not a proper
character set for storing 8bit data.

--Barry

Thomas O'Dowd wrote:
> Davide,
>
> ASCII implies 7-bit characters which is doesn't have enough information
> to store the accented characters that you are using. I'm confused as to
> how they are being stored in the database at all if this is the case. I
> presume it gets stored as the 8th bit is there anyway by default, but
> that shouldn't really be expected me thinks.
>
> Your database should probably be using LATIN1 (ISO-8859-1) or some other
> 8 bit encoding if you really want to store 8 bit information in it.
>
> Anyway, try connecting with:
>
> jdbc:postgresql://localhost/prova?charSet=LATIN1
>
> This might well work for you. That said I haven't tried this nor dug
> into the internals of the java driver in a while. I'll Cc the jdbc list.
>
> Tom.
>
> On Thu, 2003-04-10 at 18:04, Davide Romanini wrote:
>
>>Hi,
>>
>>I've posted this problem two times in the pgsql-jdbc user list, but no
>>one helped me to solve it. I think this is a really serious problem in
>>the jdbc driver. I've tried different solutions with no result.
>>
>>Well, let me explain the problem. I've a currently working database in
>>PostgreSQL. There's an application, written in M$ Access, that uses the
>>database through the ODBC driver with no problems. I'd want to access
>>the data using a Swing application through the jdbc driver.
>>At server side the charset encoding is set as SQL_ASCII. It is not a
>>problem because all the strings containing accented characters are
>>retrived correctly by ODBC and also the psql client.
>>But if I retrive strings containing accents (like àòù) using jdbc I get
>>in trouble because my accents get dirty. For example: the string 'La
>>città di Forlì' is retrived and displayed as 'La citt?di Forl?'!
>>
>>I've worked a bit around the problem with the source code of the driver.
>>I notice that when I call rs.getString(), the driver invokes (at a
>>certain point) the method org.postgresql.core.Encoding.decode(byte[]
>>encodedString, int offset, int length).
>>This method calls the decodeUTF8 when the actual encoding equals to
>>"UTF-8". If the encoding is different, it simply returns a new
>>String(encodedString, offset, length, encoding).
>>Well, my database is SQL_ASCII, so the jdbc driver should return a new
>>string and not call decodeUTF8. But when I do a step by step debug into
>>the source, the encoding ALWAYS equals to UTF-8! I've also tried to set
>>a parameter in my connection string:
>>jdbc:postgresql://localhost/prova?charSet=SQL_ASCII (I've tried a lot of
>>different encodings here). The encoding is always UTF-8.
>>Well, I thought 'if the driver wants strings to be UNICODE, set up the
>>server variable CLIENT_ENCODING to UNICODE'. No result! It doesn't change!
>>The only way to have my string displayed correctly is to comment out all
>>the decodeUTF8 and take it return a new String(data). So I think that if
>>the encoding is correctly recognized to be different from UTF-8 the
>>decode method will return the new String that is the correct behaviour
>>in my case.
>>
>>Please don't answer me to change my database to UNICODE. I cannot do
>>that. And I do not WANT to do that. Why the ODBC driver works fine and
>>the JDBC driver works only with UNICODE databases?? It's a bug and
>>should be corrected. If I was skilled enough I corrected the bug myself
>>but I don't know much about JDBC standard.
>>
>>I hope you answer to me with a solution. Really, the driver is simply
>>unusable for serious work with this bug.
>>
>>The problem is not solved with the latest stable (version 7.3 build 109)
>>and development (version 7.4 build 204) release of the driver.
>>
>>Regards, Romaz
>>--
>>Davide Romanini
>>
>>
>>---------------------------(end of broadcast)---------------------------
>>TIP 2: you can get off all lists at once with the unregister command
>>    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

Re: Charset encoding and accents

From

Davide Romanini

Date:

11 April 2003, 10:41:50

Barry Lind ha scritto:

> The charSet= option will no longer work with the 7.3 driver talking to a 
> 7.3 server, since character set translation is now performed by the 
> server (for performance reasons) in that senario.
> 
> The correct solution here is to convert the database to the proper 
> character set for the data it is storing.  SQL_ASCII is not a proper 
> character set for storing 8bit data.
> 

Probably I'm not enough clear about the problem. I *cannot* change 
charset type. SQL_ASCII really *is* the proper character set for my 
porpuses, because I actually work using psql and ODBC driver without any 
problem. I repeat: psql and ODBC retrives all data (with the accents) in 
the correct manner. Also, if I change the 
org.postgresql.core.Encoding.java making the decodeUTF8 method to return 
simply a new String(data), JDBC retrives the data from my SQL_ASCII 
database correctly! So my question is: why JDBC calls the decodeUTF8 
method also when the string is surely *not* an UTF-8 string? If JDBC 
could recognize that the string is *not* an UTF-8 string, then it will 
simply return a new String that is the right thing to do.
It's obvious that if JDBC receives from postgresql server a byte array 
representing a non-UTF8 string, and it a calls e method that wants as a 
parameter a byte array representing an UTF8 string, then it is a *bug*, 
because for non-UTF8 strings it must return a new String.

I hope to be enough clear this time.

Sincerely, I'm getting a bit frustrated from the problem, because I've 
projects to do and it prevents me to do that projects :-(

Greetings, Romaz
--
Davide Romanini

Re: Charset encoding and accents

From

Barry Lind

Date:

12 April 2003, 17:55:14

Davide Romanini wrote:
> Barry Lind ha scritto:
>
>> The charSet= option will no longer work with the 7.3 driver talking to
>> a 7.3 server, since character set translation is now performed by the
>> server (for performance reasons) in that senario.
>>
>> The correct solution here is to convert the database to the proper
>> character set for the data it is storing.  SQL_ASCII is not a proper
>> character set for storing 8bit data.
>>
>
> Probably I'm not enough clear about the problem. I *cannot* change
> charset type. SQL_ASCII really *is* the proper character set for my
> porpuses, because I actually work using psql and ODBC driver without any
> problem.

You were clear, however we disagree.  SQL_ASCII is *not* the proper
character set for your purposes.  The characters you are having problems
with do not exist in the SQL_ASCII character set.  The fact that psql
and ODBC work under this misconfiguration doesn't mean that the
configuration is correct.  Java deals with all characters internally in
unicode thus forcing a character set conversion.  So the code is
converting from SQL_ASCII to UTF8.  When it finds characters that are
not part of SQL_ASCII character set it doesn't know what to do with them
(are they LATIN1, LATIN5, LATIN? characters).

You state that you "*cannot* change" the character set.  Can you explain
why this is the case?

> I repeat: psql and ODBC retrives all data (with the accents) in
> the correct manner. Also, if I change the
> org.postgresql.core.Encoding.java making the decodeUTF8 method to return
> simply a new String(data), JDBC retrives the data from my SQL_ASCII
> database correctly! So my question is: why JDBC calls the decodeUTF8
> method also when the string is surely *not* an UTF-8 string?

If you were only storing SQL_ASCII characters it would be a UTF8 string
since SQL_ASCII is a subset of UTF8.  But since you are storing invalid
SQL_ASCII characters this is no longer true.

The logic is as follows:
The driver sets the CLIENT_ENCODING parameter to UNICODE which instructs
the server to convert from the character set of the database to UTF8.

The server then sends all data to the client encoded in UTF8.

The jdbc driver reads the UTF8 data and converts it to java's internal
unicode representation.

The problem in all of this is that the server has decided as an
optimization that if the database character set is SQL_ASCII then no
conversion is necessary to UTF8 since SQL_ASCII is a proper subset of
UTF8.  However when characters that are not SQL_ASCII are stored in the
database (i.e. 8bit characters) then this optimization simply sends them
on to the client as if they were valid UTF8 characters (which they are
not).  So the client then tries to read what are supposed to be UTF8
characters and fails because it is receiving non UTF8 data even though
it asked the server to only send it UTF8 data.

> If jdbc could recognize that the string is *not* an UTF-8 string, then it will
> simply return a new String that is the right thing to do.
> It's obvious that if JDBC receives from postgresql server a byte array
> representing a non-UTF8 string, and it a calls e method that wants as a
> parameter a byte array representing an UTF8 string, then it is a *bug*,
> because for non-UTF8 strings it must return a new String.
>

As stated above the driver tells the server to send all data as UTF8,
but because of the optimization and the non-SQL_ASCII characters you are
storing that optimization results in non-UTF8 data being sent to the
client.

> I hope to be enough clear this time.

As I said ealier you were clear the first time.  I hope I have been more
clear in my response to explain the issues in greater detail.

>
> Sincerely, I'm getting a bit frustrated from the problem, because I've
> projects to do and it prevents me to do that projects :-(

I understand that you are frustrated, but frankly I am frustrated too,
because I keep telling you what the solution to your problem is and you
keep ignoring it :-)

thanks,
--Barry