On Wed, 6 Apr 2005, Chris Kratz wrote:
> Our production database was created with the default SQL_ASCII encoding. It
> appears that some of our users have entered characters into the system with
> characters above 127 (accented vowels, etc). None of the tools we use
> currently have had a problem with this behavior until recently, everything
> just worked.
>
> I was testing some reporting tools this past weekend and have been playing
> with Jasper reports[1] . Jasper reports is a Java based reporting tool that
> reads data from the database via JDBC. When I initially tried to generate
> reports, the jdbc connection would crash with the following message:
>
> org.postgresql.util.PSQLException: Invalid character data was found.
>
> Googling eventually turned up a message on the pgsql-jdbc list detailing the
> problem[2]. Basically, java cannot convert these characters above 127 into
> unicode which is required by java.
>
> After some more googling, I found that if I took a recent database dump and
> then ran it through iconv[3] and then created the database with a unicode
> encoding, everything worked.
>
> 1. Is there any way to do a iconv type translation inline in a sql statement?
> ie select translate(text_field, unicode) from table.... Btw, set
> client_encoding=UNICODE does not work in this situation. In fact the JDBC
> driver for postgres seems to do this automatically.
You can't do translation inline, how would a driver interpret the results
of SELECT translate(field1, unicode), translate(field2, latin1) ?
The driver does SET client_encoding which does work for all real server
encodings. The problem is that SQL_ASCII is not a real encoding. It
accepts any encoding and cannot do conversions to other encodings. Your
db right now could easily have a mix of encodings.
> 2. I'm really not sure I want to change the encoding of our main database to
> Unicode. Is there a performance loss when going to a UNICODE database
> encoding? What about sorts, etc. I'm really worried about unintended side
> effects of moving from SQL_ASCII to UNICODE.
You don't need to use unicode, but you must select another encoding. If
you'd like to stick with a single byte encoding perhaps LATIN1 would be
appropriate for you.
> 3. Is there any other way around this issue? Or are we living dangerously by
> trying to store non-ascii data in a database created as ascii encoded?
You are living dangerously.
> 4. Has anyone else gone through a conversion like this? Are there any
> gotchas we should look out for?
The gotchas here are to make sure your other client tools still work
against the new database.
> [3] iconv -f iso8859-1 -t utf-8 < dbsnapshot.dumpall > dump-utf-8.dumpall
I see your data really is LATIN1. Perhaps you should use that as your db
encoding. That should keep your existing client tools happy as well as
the JDBC driver.
Kris Jurka