Thread: JDBC driver is too fussy about Unicode
Hi, I have been trying to track down a bug in an application. This is a message broker for a chat application, written in Hibernate using PostgreSQL underneath. I was rather confused when I started getting decoding errors from data written into the database by the same application. Here's part of the stack trace from one such exception: org.hibernate.exception.DataException: could not load an entity: [com.pharaohsystems.txt2.db.TextRec#24413864] (...) Caused by: org.postgresql.util.PSQLException: Invalid character data was found. This is most likely caused by stored data containing characters that are invalid for the character set the database was created in. The most common example of this is storing 8bit data in a SQL_ASCII database. at org.postgresql.jdbc2.AbstractJdbc2ResultSet.getString(AbstractJdbc2ResultSet.java:1891) at org.postgresql.jdbc2.AbstractJdbc2ResultSet.getString(AbstractJdbc2ResultSet.java:2190) I've gone in and looked at the underlying record. It contains the following character: http://www.fileformat.info/info/unicode/char/1f60a/index.htm Actually the exception above was generated after I went in and doctored the database so that the field in question contains only this particular character. I am confused because this is clearly a valid UTF-8 character. This can be easily seen by matching the UTF-8 character code shows on the page above, f09f988a, with the output from a simple query: txt2=# select text, encode (text::bytea, 'hex') from text where id = 24413864; text | encode ------+---------- 😊 | f09f988a (1 row) Thinking about this I have come to the conclusion that it is because of the Unicode version being used. This character is present in Unicode 6.0.0, but I am using Java 6, which apparently uses Unicode 4.0. My point is that this seems like an absurd restriction. Java allows me to use characters which aren't included in the Unicode specification it was built with. The postgresql driver allows me to store these values. But when I try and bring them out it gives me an error. This is a simple and well defined conversion. This code point exists in both UTF-8 (the encoding used by the database) and UTF-16 (the encoding used in Java). The conversion has clearly taken place to store this in the database. Why can't it pull it out? I can only think of two solutions (or workarounds). One is to perform the conversion myself. This is a pain because with a lot of user specified data these characters could get all over the place. The other is to filter out these characters. Neither of these are very appealing. Does anyone have any thoughts on this? I could of course be wrong and there could be something else going on here. Any ideas? Thanks, James
On 29 October 2011 08:50, James Pharaoh <james@phsys.co.uk> wrote: > org.hibernate.exception.DataException: could not load an entity: > [com.pharaohsystems.txt2.db.TextRec#24413864] > (...) > Caused by: org.postgresql.util.PSQLException: Invalid character data > was found. This is most likely caused by stored data containing > characters that are invalid for the character set the database was > created in. The most common example of this is storing 8bit data in a > SQL_ASCII database. > at org.postgresql.jdbc2.AbstractJdbc2ResultSet.getString(AbstractJdbc2ResultSet.java:1891) > at org.postgresql.jdbc2.AbstractJdbc2ResultSet.getString(AbstractJdbc2ResultSet.java:2190) > txt2=# select text, encode (text::bytea, 'hex') from text where id = 24413864; > text | encode > ------+---------- > 😊 | f09f988a > (1 row) > Does anyone have any thoughts on this? I could of course be wrong and > there could be something else going on here. Any ideas? Can't reproduce it here; the attached testcase works fine for me (driver 9.1-901, server 8.4.9): oliver@flux:~/t$ java -classpath .:postgresql-9.1-901.jdbc4.jar TestEncoding jdbc:postgresql://localhost:5433/testdb test test 1: \ud83d\ude0a oliver@flux:~/t$ psql -c "select data, encode(data::bytea, 'hex') from foo" testdb data | encode ------+---------- 😊 | f09f988a (1 row) I've confirmed that the bytes on the wire are indeed F0 9F 98 8A; and the table has the same contents as your test above. I assume your database has a UNICODE encoding? Oliver
On 29 October 2011 13:17, Oliver Jowett <oliver@opencloud.com> wrote: > Can't reproduce it here; the attached testcase works fine for me Oops, testcase really attached this time. Oliver
Attachment
The test case works for me, and also on the server. A little investigation shows one big difference, the code I am using is somehow using a JDBC2 statement class, whereas adding some output to the test case when it runs successfully shows me using a JDBC4 one. Furthermore, the app in question has a page where I can enter queries in groovy. Trying a similar testcase there also works, but I am getting a JDBC3 statement class. I have a feeling this might be a big part of the problem but don't really know enough about what is going on underneath to be sure. I am using Hibernate 3.3.2 which is a little old. I wonder if it is somehow managing to force the driver to use JDBC2? How would you write a testcase to do this? Does the JDBC driver for PostgreSQL contain compatibility code for JDBC2 and could it be that there is a bug there? Of course, I could try and upgrade Hibernate as well. I'll look into that but the last time I had to do that it didn't work out so well. Thanks, James 2011/10/29 Oliver Jowett <oliver@opencloud.com>: > On 29 October 2011 13:17, Oliver Jowett <oliver@opencloud.com> wrote: > >> Can't reproduce it here; the attached testcase works fine for me > > Oops, testcase really attached this time. > > Oliver > -- James Pharaoh Pharaoh Systems Limited http://phsys.co.uk/contact
James Pharaoh wrote: > The test case works for me, and also on the server. A little > investigation shows one big difference, the code I am using is somehow > using a JDBC2 statement class, whereas adding some output to the test > case when it runs successfully shows me using a JDBC4 one. > > Furthermore, the app in question has a page where I can enter queries > in groovy. Trying a similar testcase there also works, but I am > getting a JDBC3 statement class. > > I have a feeling this might be a big part of the problem but don't > really know enough about what is going on underneath to be sure. I am > using Hibernate 3.3.2 which is a little old. > > I wonder if it is somehow managing to force the driver to use JDBC2? > How would you write a testcase to do this? > > Does the JDBC driver for PostgreSQL contain compatibility code for > JDBC2 and could it be that there is a bug there? > > Of course, I could try and upgrade Hibernate as well. I'll look into > that but the last time I had to do that it didn't work out so well. > > Thanks, > James > > 2011/10/29 Oliver Jowett<oliver@opencloud.com>: >> On 29 October 2011 13:17, Oliver Jowett<oliver@opencloud.com> wrote: >> >>> Can't reproduce it here; the attached testcase works fine for me >> >> Oops, testcase really attached this time. >> >> Oliver Oliver's test case on my setup fails in the insert because the database indicates the character 0xf09f9480 of encoding "UTF8" has no equivalent in "LATIN1". The original error given is thrown in the class indicated but the only thing really going on there is a new String creation with the encoding derived from the connection. What I would like to see is what that encoding is for Oliver's sucessful test case and then again on the app if possible. Not sure right now how to do that, but I would say that encoding is not being returned the same through the app. with hibernate as when the JDBC is ran directly. The only other thing maybe is the encodedString returned. See below: danap Abstractjdbc2ResultSet.getString() { ~ ~ Encoding encoding = connection.getEncoding(); try { return trimString(columnIndex, encoding.decode(this_row[columnIndex - 1])); } catch (IOException ioe) { throw new PSQLException(GT.tr("Invalid character data was found. This is most likely caused by stored data containing characters that are invalid for the character set the database was created in. The most common example of this is storing 8bit data in a SQL_ASCII database."), PSQLState.DATA_ERROR, ioe); } From Encoding: public String decode(byte[] encodedString) throws IOException { return decode(encodedString, 0, encodedString.length); } public String decode(byte[] encodedString, int offset, int length) throws IOException { if (encoding == null) return new String(encodedString, offset, length); --------> return new String(encodedString, offset, length, encoding); }
On 29 October 2011 18:21, James Pharaoh <james@phsys.co.uk> wrote: > The test case works for me, and also on the server. A little > investigation shows one big difference, the code I am using is somehow > using a JDBC2 statement class, whereas adding some output to the test > case when it runs successfully shows me using a JDBC4 one. > > Furthermore, the app in question has a page where I can enter queries > in groovy. Trying a similar testcase there also works, but I am > getting a JDBC3 statement class. Maybe you have two different copies of the driver floating around? Oliver
Well I wondered that originally. That's why I did the test with the groovy script. The page I use to run that is part of the same application and I am using the same dataSource which I pass to hibernate so I don't see how it can be a different driver. In any case, I've looked into the metadata and found that I am using an older version of the driver. My build scripts don't clean out the old version and so it was just picing one at random. I'm now using the latest PostgreSQL 9.1 JDBC4 (build 901). This actually does seem to have resolved the issue, more or less. I am seeing two invalid character symbols instead of an exception. This is good enough for this client ;-) Thanks for the help, James 2011/10/29 Oliver Jowett <oliver@opencloud.com>: > On 29 October 2011 18:21, James Pharaoh <james@phsys.co.uk> wrote: >> The test case works for me, and also on the server. A little >> investigation shows one big difference, the code I am using is somehow >> using a JDBC2 statement class, whereas adding some output to the test >> case when it runs successfully shows me using a JDBC4 one. >> >> Furthermore, the app in question has a page where I can enter queries >> in groovy. Trying a similar testcase there also works, but I am >> getting a JDBC3 statement class. > > Maybe you have two different copies of the driver floating around? > > Oliver > -- James Pharaoh Pharaoh Systems Limited http://phsys.co.uk/contact