Re: Charset encoding patch to JDBC driver - Mailing list pgsql-jdbc
From | Oliver Jowett |
---|---|
Subject | Re: Charset encoding patch to JDBC driver |
Date | |
Msg-id | 4239E942.5060608@opencloud.com Whole thread Raw |
In response to | Re: Charset encoding patch to JDBC driver (Javier Yáñez <javier@cibal.es>) |
List | pgsql-jdbc |
Javier Yáñez wrote: > I think that this patch is necessary to resolve some problems of the > real life. In my particular case I have to make a j2ee application to > access a existing database. This database is SQL-ASCII encoding, with > the actual version of pgjdbc when the result of a query contains a 8 > bits character (very common in Spanish) appears this error: Indeed. You should be using a LATIN1 or UNICODE database encoding in this case. > I can not say to my customer that changes the database encoding > because other applications (non-java) could not work or show strange > characters. You can change database encoding, then change the default client_encoding for clients that do not set client_encoding themselves. For example, translate your database to UNICODE or LATIN1. Set the default client_encoding to LATIN1. Then JDBC will explicitly set client_encoding=UNICODE, and other clients will get LATIN1 data unless they explicitly change client_encoding. This is how server_encoding / client_encoding are *meant* to work.. once you tell the database how its text data is encoded, clients can choose what format they get the data in and the server does the transcoding work automatically. > By other hand, I do not think that to use SQL-ASCII encoding is a > database misconfiguration. I do not think that storing 8bit data in a > SQL_ASCII database is incorrect. Others applications are using the same > database with ODBC without problem. Try SQL_ASCII + multibyte encoding (UNICODE anyone?) and you're in for a world of hurt.. ODBC just pushes the question of interpretation into application code, as I understand it (it does no encoding translation at all?). The reason that I argue that this is a database misconfiguration is that you are storing text data in the database and expecting the database to interpret it *as text*. If you are using 8-bit characters with SQL_ASCII, the database has to treat it as a bunch of bytes, not as meaningful text. It's not too surprising that the JDBC driver then has problems when it needs to interpret that alleged "text" as individual characters, not bytes. If you want to store a bunch of bytes, use bytea. ResultSet.getBytes() on bytea works just fine regardless of database encoding; ResultSet.getString() on text types works just fine if you only use 7-bit characters, or if you set the database encoding correctly. >> - it is missing changes to the v2 protocol path > > I have not proven it, but I think that the v2 protocol has the > functionality of choose the encoding. From memory this is only used for pre-7.2 servers which may not be compiled with encoding support, so we have to manually supply an encoding to use to interpret text data. Which is actually identical to the SQL_ASCII case: the database doesn't have sufficient information to do the raw data -> UNICODE translation, so the client has to be configured to do it itself. I think requiring this sort of configuration is a step backwards to the poor encoding support of the pre-7.3 era. We really should be encouraging people to move away from SQL_ASCII for anything other than 7-bit ASCII. If there was some sort of simple SQL_ASCII -> UNICODE (or other encoding) database conversion tool, would that be a viable alternative? Then set the default client_encoding appropriately for your existing non-Java clients. Personally I'd like to see one or more of (in rough order of severity): - JDBC driver emits a warning when connecting to a SQL_ASCII DB - Backend refuses to set client_encoding to anything but SQL_ASCII when server_encoding is SQL_ASCII (currently, it silently accepts other encodings, but does no translation -- which is what breaks the JDBC driver, as it's expecting to see UTF-8 when client_encoding=UNICODE) - Backend refuses to accept 8-bit text into a SQL_ASCII database. The last two seem unlikely to happen any time soon.. -O
pgsql-jdbc by date: