Re: Charset encoding patch to JDBC driver - Mailing list pgsql-jdbc

From Oliver Jowett
Subject Re: Charset encoding patch to JDBC driver
Date
Msg-id 4239E942.5060608@opencloud.com
Whole thread Raw
In response to Re: Charset encoding patch to JDBC driver  (Javier Yáñez <javier@cibal.es>)
List pgsql-jdbc
Javier Yáñez wrote:

>    I think that this patch is necessary to resolve some problems of the
> real life. In my particular case I have to make a j2ee application to
> access a existing database. This database is SQL-ASCII  encoding, with
> the actual version of pgjdbc when the result of a query contains a 8
> bits character (very common in Spanish) appears this error:

Indeed. You should be using a LATIN1 or UNICODE database encoding in
this case.

>    I can not say to my customer that changes the database encoding
> because other applications (non-java) could not work or show strange
> characters.

You can change database encoding, then change the default
client_encoding for clients that do not set client_encoding themselves.
For example, translate your database to UNICODE or LATIN1. Set the
default client_encoding to LATIN1. Then JDBC will explicitly set
client_encoding=UNICODE, and other clients will get LATIN1 data unless
they explicitly change client_encoding. This is how server_encoding /
client_encoding are *meant* to work.. once you tell the database how its
text data is encoded, clients can choose what format they get the data
in and the server does the transcoding work automatically.

>    By other hand, I do not think that to use SQL-ASCII encoding is a
> database misconfiguration. I do not think that storing 8bit data in a
> SQL_ASCII database is incorrect. Others applications are using the same
> database with ODBC without problem.

Try SQL_ASCII + multibyte encoding (UNICODE anyone?) and you're in for a
world of hurt..

ODBC just pushes the question of interpretation into application code,
as I understand it (it does no encoding translation at all?).

The reason that I argue that this is a database misconfiguration is that
you are storing text data in the database and expecting the database to
interpret it *as text*. If you are using 8-bit characters with
SQL_ASCII, the database has to treat it as a bunch of bytes, not as
meaningful text. It's not too surprising that the JDBC driver then has
problems when it needs to interpret that alleged "text" as individual
characters, not bytes.

If you want to store a bunch of bytes, use bytea. ResultSet.getBytes()
on bytea works just fine regardless of database encoding;
ResultSet.getString() on text types works just fine if you only use
7-bit characters, or if you set the database encoding correctly.

>> - it is missing changes to the v2 protocol path
>
> I have not proven it, but I think that the v2 protocol has the
> functionality of choose the encoding.

 From memory this is only used for pre-7.2 servers which may not be
compiled with encoding support, so we have to manually supply an
encoding to use to interpret text data. Which is actually identical to
the SQL_ASCII case: the database doesn't have sufficient information to
do the raw data -> UNICODE translation, so the client has to be
configured to do it itself.

I think requiring this sort of configuration is a step backwards to the
poor encoding support of the pre-7.3 era. We really should be
encouraging people to move away from SQL_ASCII for anything other than
7-bit ASCII.

If there was some sort of simple SQL_ASCII -> UNICODE (or other
encoding) database conversion tool, would that be a viable alternative?
Then set the default client_encoding appropriately for your existing
non-Java clients.

Personally I'd like to see one or more of (in rough order of severity):

- JDBC driver emits a warning when connecting to a SQL_ASCII DB
- Backend refuses to set client_encoding to anything but SQL_ASCII when
server_encoding is SQL_ASCII (currently, it silently accepts other
encodings, but does no translation -- which is what breaks the JDBC
driver, as it's expecting to see UTF-8 when client_encoding=UNICODE)
- Backend refuses to accept 8-bit text into a SQL_ASCII database.

The last two seem unlikely to happen any time soon..

-O

pgsql-jdbc by date:

Previous
From: Charl Gerber
Date:
Subject: int array as function input
Next
From: Oliver Jowett
Date:
Subject: Re: invalid string enlargement request