Thread: JDBC driver is too fussy about Unicode

JDBC driver is too fussy about Unicode

From
James Pharaoh
Date:
Hi,

I have been trying to track down a bug in an application. This is a
message broker for a chat application, written in Hibernate using
PostgreSQL underneath. I was rather confused when I started getting
decoding errors from data written into the database by the same
application.

Here's part of the stack trace from one such exception:

org.hibernate.exception.DataException: could not load an entity:
[com.pharaohsystems.txt2.db.TextRec#24413864]
(...)
Caused by: org.postgresql.util.PSQLException: Invalid character data
was found.  This is most likely caused by stored data containing
characters that are invalid for the character set the database was
created in.  The most common example of this is storing 8bit data in a
SQL_ASCII database.
    at org.postgresql.jdbc2.AbstractJdbc2ResultSet.getString(AbstractJdbc2ResultSet.java:1891)
    at org.postgresql.jdbc2.AbstractJdbc2ResultSet.getString(AbstractJdbc2ResultSet.java:2190)

I've gone in and looked at the underlying record. It contains the
following character:

http://www.fileformat.info/info/unicode/char/1f60a/index.htm

Actually the exception above was generated after I went in and
doctored the database so that the field in question contains only this
particular character.

I am confused because this is clearly a valid UTF-8 character. This
can be easily seen by matching the UTF-8 character code shows on the
page above, f09f988a, with the output from a simple query:

txt2=# select text, encode (text::bytea, 'hex') from text where id = 24413864;
 text |  encode
------+----------
 😊    | f09f988a
(1 row)

Thinking about this I have come to the conclusion that it is because
of the Unicode version being used. This character is present in
Unicode 6.0.0, but I am using Java 6, which apparently uses Unicode
4.0.

My point is that this seems like an absurd restriction. Java allows me
to use characters which aren't included in the Unicode specification
it was built with. The postgresql driver allows me to store these
values. But when I try and bring them out it gives me an error.

This is a simple and well defined conversion. This code point exists
in both UTF-8 (the encoding used by the database) and UTF-16 (the
encoding used in Java). The conversion has clearly taken place to
store this in the database. Why can't it pull it out?

I can only think of two solutions (or workarounds). One is to perform
the conversion myself. This is a pain because with a lot of user
specified data these characters could get all over the place. The
other is to filter out these characters. Neither of these are very
appealing.

Does anyone have any thoughts on this? I could of course be wrong and
there could be something else going on here. Any ideas?

Thanks,
James

Re: JDBC driver is too fussy about Unicode

From
Oliver Jowett
Date:
On 29 October 2011 08:50, James Pharaoh <james@phsys.co.uk> wrote:

> org.hibernate.exception.DataException: could not load an entity:
> [com.pharaohsystems.txt2.db.TextRec#24413864]
> (...)
> Caused by: org.postgresql.util.PSQLException: Invalid character data
> was found.  This is most likely caused by stored data containing
> characters that are invalid for the character set the database was
> created in.  The most common example of this is storing 8bit data in a
> SQL_ASCII database.
>        at org.postgresql.jdbc2.AbstractJdbc2ResultSet.getString(AbstractJdbc2ResultSet.java:1891)
>        at org.postgresql.jdbc2.AbstractJdbc2ResultSet.getString(AbstractJdbc2ResultSet.java:2190)

> txt2=# select text, encode (text::bytea, 'hex') from text where id = 24413864;
>  text |  encode
> ------+----------
>  😊    | f09f988a
> (1 row)

> Does anyone have any thoughts on this? I could of course be wrong and
> there could be something else going on here. Any ideas?

Can't reproduce it here; the attached testcase works fine for me
(driver 9.1-901, server 8.4.9):

oliver@flux:~/t$ java -classpath .:postgresql-9.1-901.jdbc4.jar
TestEncoding jdbc:postgresql://localhost:5433/testdb test test
1: \ud83d\ude0a
oliver@flux:~/t$ psql -c "select data, encode(data::bytea, 'hex') from
foo" testdb
 data |  encode
------+----------
 😊    | f09f988a
(1 row)

I've confirmed that the bytes on the wire are indeed F0 9F 98 8A; and
the table has the same contents as your test above.

I assume your database has a UNICODE encoding?

Oliver

Re: JDBC driver is too fussy about Unicode

From
Oliver Jowett
Date:
On 29 October 2011 13:17, Oliver Jowett <oliver@opencloud.com> wrote:

> Can't reproduce it here; the attached testcase works fine for me

Oops, testcase really attached this time.

Oliver

Attachment

Re: JDBC driver is too fussy about Unicode

From
James Pharaoh
Date:
The test case works for me, and also on the server. A little
investigation shows one big difference, the code I am using is somehow
using a JDBC2 statement class, whereas adding some output to the test
case when it runs successfully shows me using a JDBC4 one.

Furthermore, the app in question has a page where I can enter queries
in groovy. Trying a similar testcase there also works, but I am
getting a JDBC3 statement class.

I have a feeling this might be a big part of the problem but don't
really know enough about what is going on underneath to be sure. I am
using Hibernate 3.3.2 which is a little old.

I wonder if it is somehow managing to force the driver to use JDBC2?
How would you write a testcase to do this?

Does the JDBC driver for PostgreSQL contain compatibility code for
JDBC2 and could it be that there is a bug there?

Of course, I could try and upgrade Hibernate as well. I'll look into
that but the last time I had to do that it didn't work out so well.

Thanks,
James

2011/10/29 Oliver Jowett <oliver@opencloud.com>:
> On 29 October 2011 13:17, Oliver Jowett <oliver@opencloud.com> wrote:
>
>> Can't reproduce it here; the attached testcase works fine for me
>
> Oops, testcase really attached this time.
>
> Oliver
>



--
James Pharaoh
Pharaoh Systems Limited
http://phsys.co.uk/contact

Re: JDBC driver is too fussy about Unicode

From
dmp
Date:
James Pharaoh wrote:
 > The test case works for me, and also on the server. A little
 > investigation shows one big difference, the code I am using is somehow
 > using a JDBC2 statement class, whereas adding some output to the test
 > case when it runs successfully shows me using a JDBC4 one.
 >
 > Furthermore, the app in question has a page where I can enter queries
 > in groovy. Trying a similar testcase there also works, but I am
 > getting a JDBC3 statement class.
 >
 > I have a feeling this might be a big part of the problem but don't
 > really know enough about what is going on underneath to be sure. I am
 > using Hibernate 3.3.2 which is a little old.
 >
 > I wonder if it is somehow managing to force the driver to use JDBC2?
 > How would you write a testcase to do this?
 >
 > Does the JDBC driver for PostgreSQL contain compatibility code for
 > JDBC2 and could it be that there is a bug there?
 >
 > Of course, I could try and upgrade Hibernate as well. I'll look into
 > that but the last time I had to do that it didn't work out so well.
 >
 > Thanks,
 > James
 >
 > 2011/10/29 Oliver Jowett<oliver@opencloud.com>:
 >> On 29 October 2011 13:17, Oliver Jowett<oliver@opencloud.com>  wrote:
 >>
 >>> Can't reproduce it here; the attached testcase works fine for me
 >>
 >> Oops, testcase really attached this time.
 >>
 >> Oliver

Oliver's test case on my setup fails in the insert because the database
indicates the character 0xf09f9480 of encoding "UTF8" has no equivalent
in "LATIN1".

The original error given is thrown in the class indicated but the only
thing really going on there is a new String creation with the encoding
derived from the connection. What I would like to see is what that
encoding is for Oliver's sucessful test case and then again on the app
if possible. Not sure right now how to do that, but I would say that
encoding is not being returned the same through the app. with hibernate
as when the JDBC is ran directly.

The only other thing maybe is the encodedString returned. See below:

danap

Abstractjdbc2ResultSet.getString()
{
~
~
Encoding encoding = connection.getEncoding();
         try
         {
             return trimString(columnIndex, encoding.decode(this_row[columnIndex
- 1]));
         }
         catch (IOException ioe)
         {
             throw new PSQLException(GT.tr("Invalid character data was found.
This is most likely caused by stored data containing characters that are invalid
for the character set the database was created in.  The most common example of
this is storing 8bit data in a SQL_ASCII database."), PSQLState.DATA_ERROR, ioe);
         }

 From Encoding:

public String decode(byte[] encodedString) throws IOException
     {
         return decode(encodedString, 0, encodedString.length);
     }

public String decode(byte[] encodedString, int offset, int length) throws
IOException
     {
         if (encoding == null)
             return new String(encodedString, offset, length);

-------->        return new String(encodedString, offset, length, encoding);
     }


Re: JDBC driver is too fussy about Unicode

From
Oliver Jowett
Date:
On 29 October 2011 18:21, James Pharaoh <james@phsys.co.uk> wrote:
> The test case works for me, and also on the server. A little
> investigation shows one big difference, the code I am using is somehow
> using a JDBC2 statement class, whereas adding some output to the test
> case when it runs successfully shows me using a JDBC4 one.
>
> Furthermore, the app in question has a page where I can enter queries
> in groovy. Trying a similar testcase there also works, but I am
> getting a JDBC3 statement class.

Maybe you have two different copies of the driver floating around?

Oliver

Re: JDBC driver is too fussy about Unicode

From
James Pharaoh
Date:
Well I wondered that originally. That's why I did the test with the
groovy script. The page I use to run that is part of the same
application and I am using the same dataSource which I pass to
hibernate so I don't see how it can be a different driver.

In any case, I've looked into the metadata and found that I am using
an older version of the driver. My build scripts don't clean out the
old version and so it was just picing one at random. I'm now using the
latest PostgreSQL 9.1 JDBC4 (build 901).

This actually does seem to have resolved the issue, more or less. I am
seeing two invalid character symbols instead of an exception. This is
good enough for this client ;-)

Thanks for the help,
James

2011/10/29 Oliver Jowett <oliver@opencloud.com>:
> On 29 October 2011 18:21, James Pharaoh <james@phsys.co.uk> wrote:
>> The test case works for me, and also on the server. A little
>> investigation shows one big difference, the code I am using is somehow
>> using a JDBC2 statement class, whereas adding some output to the test
>> case when it runs successfully shows me using a JDBC4 one.
>>
>> Furthermore, the app in question has a page where I can enter queries
>> in groovy. Trying a similar testcase there also works, but I am
>> getting a JDBC3 statement class.
>
> Maybe you have two different copies of the driver floating around?
>
> Oliver
>



--
James Pharaoh
Pharaoh Systems Limited
http://phsys.co.uk/contact