Re: MULTIBYTE and SQL_ASCII (was Re: [JDBC] Re: A bugwith pgsql 7.1/jdbc and non-ascii (8-bit) chars?) - Mailing list pgsql-hackers
From | Barry Lind |
---|---|
Subject | Re: MULTIBYTE and SQL_ASCII (was Re: [JDBC] Re: A bugwith pgsql 7.1/jdbc and non-ascii (8-bit) chars?) |
Date | |
Msg-id | 3AF861C6.9090705@xythos.com Whole thread Raw |
In response to | MULTIBYTE and SQL_ASCII (was Re: [JDBC] Re: A bug with pgsql 7.1/jdbc and non-ascii (8-bit) chars?) (Tom Lane <tgl@sss.pgh.pa.us>) |
List | pgsql-hackers |
Peter B. West wrote: > I'm not entirely sure of the situation here, although I have been > reading the thread as it has unwound. Given that I may not understand > the whole situation, my *philosophical* preference is NOT to build in > kludges which silently bypass the information which is being passed > around. > > Initially, I was getting wound up about Latin1 imperialism, but I > realised that, for SQL_ASCII encoding to work in 8-bit environments up > to now, users must be working in homogeneous encoding environments, > where 8 bits coming and going will always represent the same character. > In that case it doesn't matter how the character is represented > internally as long as the round-trip translation is consistent. > > How hard is it to change the single-byte character encoding of a > database? If that is currently difficult, why not provide a one-off > upgrade application which does just that, provided it is going from > SQL_ASCII to a single-byte encoding? It is currently not possible to change the character encoding of a database once created. You can specify a character encoding for a newly created database only if multibyte is enabled. The code hardcodes a value of 'SQL_ACSII' if multibyte is not enabled. How difficult would it be to change this functionality is a question more appropriately answered by others on the list (i.e. I don't know). > > Alternatively, add a compile switch that specifies an implicit 8-bit > encoding in which 8-bit SQL_ASCII values are to be understood? I think > that the first solution should be as easy to implement, and would be a > lot cleaner. > > Peter > I agree that your first suggestion would be more desirable IMHO. thanks, --Barry > > Barry Lind wrote: > >> Tatsuo Ishii wrote: >> >>>>> Thus I would be happy if getdatabaseencoding() returned 'UNKNOWN' or >>>>> something similar when in fact it doesn't know what the encoding is >>>>> (i.e. when not compiled with multibyte). >>>> >>> Is that ok for Java? I thought Java needs to know the encoding >>> beforehand so that it could convert to/from Unicode. >> >> That is actually the original issue that started this thread. If you >> want the full thread see the jdbc mail archive list. A user was >> complaining that when running on a database without multibyte enabled, >> that through psql he could insert and retrieve 8bit characters, but in >> jdbc the 8bit characters were converted to ?'s. >> >> I then explained why this was happening (db returns SQL_ASCII as the db >> character set when not compiled with multibyte) so that character set is >> used to convert to unicode. >> >> Tom suggested that it would make more sense for jdbc to use LATIN1 when >> the database reported SQL_ASCII so that most users will see 'correct' >> behavior in a non multibyte database. Because currently you need to >> enable multibyte support in order to use 8bit characters with jdbc. >> Jdbc could easily be changed to treat SQL_ASCII as LATIN1, but I don't >> feel that is an appropriate solution for the reasons outlined in this >> thread (thus the suggestions for UNKNOWN, or the ability for the client >> to determine if multibyte is enabled or not). >> >>>> I have a philosophical difference with this: basically, I think that >>>> since SQL_ASCII is the default value, you probably ought to assume that >>>> it's not too trustworthy. The software can *never* be said to KNOW what >>>> the data encoding is; at most it knows what it's been told, and in the >>>> case of a default it probably hasn't been told anything. I'd argue that >>>> SQL_ASCII should be interpreted in the way you are saying "UNKNOWN" >>>> ought to be: ie, it's an unspecified 8-bit encoding (and from there >>>> it's not much of a jump to deciding to treat it as LATIN1, if you're >>>> forced to do conversion to Unicode or whatever). Certainly, seeing >>>> SQL_ASCII from the server is not license to throw away data, which is >>>> what JDBC is doing now. >>>> >>>>> PS. Note that if multibyte is enabled, the functionality that is being >>>>> complained about here in the jdbc client is apparently ok for the server >>>>> to do. If you insert a value into a text column on a SQL_ASCII database >>>>> with multibyte enabled and that value contains 8bit characters, those >>>>> 8bit characters will be quietly replaced with a dummy character since >>>>> they are invalid for the SQL_ASCII 7bit character set. >>>> >>>> I have not tried it, but if the backend does that then I'd argue that >>>> that's a bug too. >>> >>> >>> I suspect the JDBC driver is responsible for the problem Burry has >>> reported (I have tried to reproduce the problem using psql without >>> success). >>> >>> >From interfaces/jdbc/org/postgresql/Connection.java: >>> >>>> if (dbEncoding.equals("SQL_ASCII")) { >>>> dbEncoding = "ASCII"; >>> >>> >>> BTW, even if the backend behaves like that, I don't think it's a >>> bug. Since SQL_ASCII is nothing more than an ascii encoding. >> >> I believe Tom's point is that if multibyte is not enabled this isn't >> true, since SQL_ASCII then means whatever character set the client wants >> to use against the server as the server really doesn't care what single >> byte data is being inserted/selected from the database. >> >>>> To my mind, a MULTIBYTE backend operating in >>>> SQL_ASCII encoding ought to behave the same as a non-MULTIBYTE backend: >>>> transparent pass-through of characters with the high bit set. But I'm >>>> not a multibyte guru. Comments anyone? >>> >>> >>> If you expect that behavior, I think the encoding name 'UNKNOWN' or >>> something like that seems more appropreate. (SQL_)ASCII is just an >>> ascii IMHO. >> >> I agree. >
pgsql-hackers by date: