Thread: Charset encoding and accents
Hi, I've posted this problem two times in the pgsql-jdbc user list, but no one helped me to solve it. I think this is a really serious problem in the jdbc driver. I've tried different solutions with no result. Well, let me explain the problem. I've a currently working database in PostgreSQL. There's an application, written in M$ Access, that uses the database through the ODBC driver with no problems. I'd want to access the data using a Swing application through the jdbc driver. At server side the charset encoding is set as SQL_ASCII. It is not a problem because all the strings containing accented characters are retrived correctly by ODBC and also the psql client. But if I retrive strings containing accents (like àòù) using jdbc I get in trouble because my accents get dirty. For example: the string 'La città di Forlì' is retrived and displayed as 'La citt?di Forl?'! I've worked a bit around the problem with the source code of the driver. I notice that when I call rs.getString(), the driver invokes (at a certain point) the method org.postgresql.core.Encoding.decode(byte[] encodedString, int offset, int length). This method calls the decodeUTF8 when the actual encoding equals to "UTF-8". If the encoding is different, it simply returns a new String(encodedString, offset, length, encoding). Well, my database is SQL_ASCII, so the jdbc driver should return a new string and not call decodeUTF8. But when I do a step by step debug into the source, the encoding ALWAYS equals to UTF-8! I've also tried to set a parameter in my connection string: jdbc:postgresql://localhost/prova?charSet=SQL_ASCII (I've tried a lot of different encodings here). The encoding is always UTF-8. Well, I thought 'if the driver wants strings to be UNICODE, set up the server variable CLIENT_ENCODING to UNICODE'. No result! It doesn't change! The only way to have my string displayed correctly is to comment out all the decodeUTF8 and take it return a new String(data). So I think that if the encoding is correctly recognized to be different from UTF-8 the decode method will return the new String that is the correct behaviour in my case. Please don't answer me to change my database to UNICODE. I cannot do that. And I do not WANT to do that. Why the ODBC driver works fine and the JDBC driver works only with UNICODE databases?? It's a bug and should be corrected. If I was skilled enough I corrected the bug myself but I don't know much about JDBC standard. I hope you answer to me with a solution. Really, the driver is simply unusable for serious work with this bug. The problem is not solved with the latest stable (version 7.3 build 109) and development (version 7.4 build 204) release of the driver. Regards, Romaz -- Davide Romanini
Davide, ASCII implies 7-bit characters which is doesn't have enough information to store the accented characters that you are using. I'm confused as to how they are being stored in the database at all if this is the case. I presume it gets stored as the 8th bit is there anyway by default, but that shouldn't really be expected me thinks. Your database should probably be using LATIN1 (ISO-8859-1) or some other 8 bit encoding if you really want to store 8 bit information in it. Anyway, try connecting with: jdbc:postgresql://localhost/prova?charSet=LATIN1 This might well work for you. That said I haven't tried this nor dug into the internals of the java driver in a while. I'll Cc the jdbc list. Tom. On Thu, 2003-04-10 at 18:04, Davide Romanini wrote: > Hi, > > I've posted this problem two times in the pgsql-jdbc user list, but no > one helped me to solve it. I think this is a really serious problem in > the jdbc driver. I've tried different solutions with no result. > > Well, let me explain the problem. I've a currently working database in > PostgreSQL. There's an application, written in M$ Access, that uses the > database through the ODBC driver with no problems. I'd want to access > the data using a Swing application through the jdbc driver. > At server side the charset encoding is set as SQL_ASCII. It is not a > problem because all the strings containing accented characters are > retrived correctly by ODBC and also the psql client. > But if I retrive strings containing accents (like àòù) using jdbc I get > in trouble because my accents get dirty. For example: the string 'La > città di Forlì' is retrived and displayed as 'La citt?di Forl?'! > > I've worked a bit around the problem with the source code of the driver. > I notice that when I call rs.getString(), the driver invokes (at a > certain point) the method org.postgresql.core.Encoding.decode(byte[] > encodedString, int offset, int length). > This method calls the decodeUTF8 when the actual encoding equals to > "UTF-8". If the encoding is different, it simply returns a new > String(encodedString, offset, length, encoding). > Well, my database is SQL_ASCII, so the jdbc driver should return a new > string and not call decodeUTF8. But when I do a step by step debug into > the source, the encoding ALWAYS equals to UTF-8! I've also tried to set > a parameter in my connection string: > jdbc:postgresql://localhost/prova?charSet=SQL_ASCII (I've tried a lot of > different encodings here). The encoding is always UTF-8. > Well, I thought 'if the driver wants strings to be UNICODE, set up the > server variable CLIENT_ENCODING to UNICODE'. No result! It doesn't change! > The only way to have my string displayed correctly is to comment out all > the decodeUTF8 and take it return a new String(data). So I think that if > the encoding is correctly recognized to be different from UTF-8 the > decode method will return the new String that is the correct behaviour > in my case. > > Please don't answer me to change my database to UNICODE. I cannot do > that. And I do not WANT to do that. Why the ODBC driver works fine and > the JDBC driver works only with UNICODE databases?? It's a bug and > should be corrected. If I was skilled enough I corrected the bug myself > but I don't know much about JDBC standard. > > I hope you answer to me with a solution. Really, the driver is simply > unusable for serious work with this bug. > > The problem is not solved with the latest stable (version 7.3 build 109) > and development (version 7.4 build 204) release of the driver. > > Regards, Romaz > -- > Davide Romanini > > > ---------------------------(end of broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) -- Thomas O'Dowd - Got a keitai? Get Nooped! tom@nooper.com - http://nooper.com
The charSet= option will no longer work with the 7.3 driver talking to a 7.3 server, since character set translation is now performed by the server (for performance reasons) in that senario. The correct solution here is to convert the database to the proper character set for the data it is storing. SQL_ASCII is not a proper character set for storing 8bit data. --Barry Thomas O'Dowd wrote: > Davide, > > ASCII implies 7-bit characters which is doesn't have enough information > to store the accented characters that you are using. I'm confused as to > how they are being stored in the database at all if this is the case. I > presume it gets stored as the 8th bit is there anyway by default, but > that shouldn't really be expected me thinks. > > Your database should probably be using LATIN1 (ISO-8859-1) or some other > 8 bit encoding if you really want to store 8 bit information in it. > > Anyway, try connecting with: > > jdbc:postgresql://localhost/prova?charSet=LATIN1 > > This might well work for you. That said I haven't tried this nor dug > into the internals of the java driver in a while. I'll Cc the jdbc list. > > Tom. > > On Thu, 2003-04-10 at 18:04, Davide Romanini wrote: > >>Hi, >> >>I've posted this problem two times in the pgsql-jdbc user list, but no >>one helped me to solve it. I think this is a really serious problem in >>the jdbc driver. I've tried different solutions with no result. >> >>Well, let me explain the problem. I've a currently working database in >>PostgreSQL. There's an application, written in M$ Access, that uses the >>database through the ODBC driver with no problems. I'd want to access >>the data using a Swing application through the jdbc driver. >>At server side the charset encoding is set as SQL_ASCII. It is not a >>problem because all the strings containing accented characters are >>retrived correctly by ODBC and also the psql client. >>But if I retrive strings containing accents (like àòù) using jdbc I get >>in trouble because my accents get dirty. For example: the string 'La >>città di Forlì' is retrived and displayed as 'La citt?di Forl?'! >> >>I've worked a bit around the problem with the source code of the driver. >>I notice that when I call rs.getString(), the driver invokes (at a >>certain point) the method org.postgresql.core.Encoding.decode(byte[] >>encodedString, int offset, int length). >>This method calls the decodeUTF8 when the actual encoding equals to >>"UTF-8". If the encoding is different, it simply returns a new >>String(encodedString, offset, length, encoding). >>Well, my database is SQL_ASCII, so the jdbc driver should return a new >>string and not call decodeUTF8. But when I do a step by step debug into >>the source, the encoding ALWAYS equals to UTF-8! I've also tried to set >>a parameter in my connection string: >>jdbc:postgresql://localhost/prova?charSet=SQL_ASCII (I've tried a lot of >>different encodings here). The encoding is always UTF-8. >>Well, I thought 'if the driver wants strings to be UNICODE, set up the >>server variable CLIENT_ENCODING to UNICODE'. No result! It doesn't change! >>The only way to have my string displayed correctly is to comment out all >>the decodeUTF8 and take it return a new String(data). So I think that if >>the encoding is correctly recognized to be different from UTF-8 the >>decode method will return the new String that is the correct behaviour >>in my case. >> >>Please don't answer me to change my database to UNICODE. I cannot do >>that. And I do not WANT to do that. Why the ODBC driver works fine and >>the JDBC driver works only with UNICODE databases?? It's a bug and >>should be corrected. If I was skilled enough I corrected the bug myself >>but I don't know much about JDBC standard. >> >>I hope you answer to me with a solution. Really, the driver is simply >>unusable for serious work with this bug. >> >>The problem is not solved with the latest stable (version 7.3 build 109) >>and development (version 7.4 build 204) release of the driver. >> >>Regards, Romaz >>-- >>Davide Romanini >> >> >>---------------------------(end of broadcast)--------------------------- >>TIP 2: you can get off all lists at once with the unregister command >> (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
Barry Lind ha scritto: > The charSet= option will no longer work with the 7.3 driver talking to a > 7.3 server, since character set translation is now performed by the > server (for performance reasons) in that senario. > > The correct solution here is to convert the database to the proper > character set for the data it is storing. SQL_ASCII is not a proper > character set for storing 8bit data. > Probably I'm not enough clear about the problem. I *cannot* change charset type. SQL_ASCII really *is* the proper character set for my porpuses, because I actually work using psql and ODBC driver without any problem. I repeat: psql and ODBC retrives all data (with the accents) in the correct manner. Also, if I change the org.postgresql.core.Encoding.java making the decodeUTF8 method to return simply a new String(data), JDBC retrives the data from my SQL_ASCII database correctly! So my question is: why JDBC calls the decodeUTF8 method also when the string is surely *not* an UTF-8 string? If JDBC could recognize that the string is *not* an UTF-8 string, then it will simply return a new String that is the right thing to do. It's obvious that if JDBC receives from postgresql server a byte array representing a non-UTF8 string, and it a calls e method that wants as a parameter a byte array representing an UTF8 string, then it is a *bug*, because for non-UTF8 strings it must return a new String. I hope to be enough clear this time. Sincerely, I'm getting a bit frustrated from the problem, because I've projects to do and it prevents me to do that projects :-( Greetings, Romaz -- Davide Romanini
Davide Romanini wrote: > Barry Lind ha scritto: > >> The charSet= option will no longer work with the 7.3 driver talking to >> a 7.3 server, since character set translation is now performed by the >> server (for performance reasons) in that senario. >> >> The correct solution here is to convert the database to the proper >> character set for the data it is storing. SQL_ASCII is not a proper >> character set for storing 8bit data. >> > > Probably I'm not enough clear about the problem. I *cannot* change > charset type. SQL_ASCII really *is* the proper character set for my > porpuses, because I actually work using psql and ODBC driver without any > problem. You were clear, however we disagree. SQL_ASCII is *not* the proper character set for your purposes. The characters you are having problems with do not exist in the SQL_ASCII character set. The fact that psql and ODBC work under this misconfiguration doesn't mean that the configuration is correct. Java deals with all characters internally in unicode thus forcing a character set conversion. So the code is converting from SQL_ASCII to UTF8. When it finds characters that are not part of SQL_ASCII character set it doesn't know what to do with them (are they LATIN1, LATIN5, LATIN? characters). You state that you "*cannot* change" the character set. Can you explain why this is the case? > I repeat: psql and ODBC retrives all data (with the accents) in > the correct manner. Also, if I change the > org.postgresql.core.Encoding.java making the decodeUTF8 method to return > simply a new String(data), JDBC retrives the data from my SQL_ASCII > database correctly! So my question is: why JDBC calls the decodeUTF8 > method also when the string is surely *not* an UTF-8 string? If you were only storing SQL_ASCII characters it would be a UTF8 string since SQL_ASCII is a subset of UTF8. But since you are storing invalid SQL_ASCII characters this is no longer true. The logic is as follows: The driver sets the CLIENT_ENCODING parameter to UNICODE which instructs the server to convert from the character set of the database to UTF8. The server then sends all data to the client encoded in UTF8. The jdbc driver reads the UTF8 data and converts it to java's internal unicode representation. The problem in all of this is that the server has decided as an optimization that if the database character set is SQL_ASCII then no conversion is necessary to UTF8 since SQL_ASCII is a proper subset of UTF8. However when characters that are not SQL_ASCII are stored in the database (i.e. 8bit characters) then this optimization simply sends them on to the client as if they were valid UTF8 characters (which they are not). So the client then tries to read what are supposed to be UTF8 characters and fails because it is receiving non UTF8 data even though it asked the server to only send it UTF8 data. > If jdbc could recognize that the string is *not* an UTF-8 string, then it will > simply return a new String that is the right thing to do. > It's obvious that if JDBC receives from postgresql server a byte array > representing a non-UTF8 string, and it a calls e method that wants as a > parameter a byte array representing an UTF8 string, then it is a *bug*, > because for non-UTF8 strings it must return a new String. > As stated above the driver tells the server to send all data as UTF8, but because of the optimization and the non-SQL_ASCII characters you are storing that optimization results in non-UTF8 data being sent to the client. > I hope to be enough clear this time. As I said ealier you were clear the first time. I hope I have been more clear in my response to explain the issues in greater detail. > > Sincerely, I'm getting a bit frustrated from the problem, because I've > projects to do and it prevents me to do that projects :-( I understand that you are frustrated, but frankly I am frustrated too, because I keep telling you what the solution to your problem is and you keep ignoring it :-) thanks, --Barry