Re: Problems with charsets, investigated... - Mailing list pgsql-jdbc

From Oliver Jowett
Subject Re: Problems with charsets, investigated...
Date
Msg-id 411400E6.8060704@opencloud.com
Whole thread Raw
In response to Problems with charsets, investigated...  (Alexandre Aufrere <alexandre.aufrere@inet6.fr>)
List pgsql-jdbc
Alexandre Aufrere wrote:
> Hello,
>
> I am using Postgresql 7.4.2 and its JDBC drivers, straight out from a FC2,
> along with JDK 1.4.2 from Sun.
> I use the JDBC driver in a web app using Enhydra appserver. Java correctly
> sets its file.encoding property to the charset specified in the LANG
> environment variable. However, it appears that whatever i set this
> variable to, the JDBC driver seems to use UTF-8.

This is entirely intentional. See below.

> I have digged into the code, and seen that in the
> AbstractJdbc1Connection.java class, the encoding is always forced to
> "UNICODE" (therefore forcing UTF-8 on Java side).
>From that, i patched the code to correctly use the file.encoding system
> property to guess the charset.
>
> As i didn't dig very long, and as it seems from what i see in cvsweb at
> gborg that all this stuff could have changed deeply, i am not sure that
> this would be useful to you. However i downloaded the latest dev builds at
> jdbc.postgresql.org, and it seems the bad behaviour is still there.
>
> So, did i miss something somewhere ? Are you interested in that (frankly
> quite ugly) patch ?

This change doesn't make sense.

The internal representation of Java strings is UTF-16 always. So it
doesn't really matter whether you do:

   db encoding -> UTF-8 (done by the server)
   UTF-8 -> UTF-16 Java string (done trivially by the driver)

or:

   look up db encoding to know how to transcode
   db encoding -> UTF-16 Java string (done by the driver)

other than if you do the second option, you have to do a lot more
(unnecessary) work on the driver side. Either way, you still have to
somehow transcode the DB data into unicode.

Using file.encoding as a basis for which encoding to use is horribly
broken anyway -- what if that encoding does not match the actual DB
charset? Whatever transcoding happens really needs to be done based on
the actual DB encoding in use.

I'd suggest that your real problem is that you do not have your database
encoding set correctly. If server_encoding is correct, then the server
will do the correct transcoding to UNICODE and everything will be happy
-- you will get correctly formed Java strings and can then encode those
using whatever output encoding you like. If server_encoding is
SQL_ASCII, everything will break horribly as the server has no idea how
the raw data is actually encoded and can't transcode.

If you're exclusively using JDBC to access the database, a UNICODE
database encoding is the right choice since it means the server does not
need to transcode at all when talking to JDBC. It's probably the right
choice even with mixed clients unless you have other clients that don't
understand client_encoding.

This is getting to be a FAQ -- I'm actually looking at disabling support
for JDBC access to SQL_ASCII databases entirely since it breaks so
unpredictably.

-O

pgsql-jdbc by date:

Previous
From: Dave Cramer
Date:
Subject: Re: Problems with big tables.
Next
From: Oliver Jowett
Date:
Subject: Re: Problems with big tables.