Re: MULTIBYTE and SQL_ASCII (was Re: [JDBC] Re: A bugwith pgsql 7.1/jdbc and non-ascii (8-bit) chars?) - Mailing list pgsql-hackers

From Barry Lind
Subject Re: MULTIBYTE and SQL_ASCII (was Re: [JDBC] Re: A bugwith pgsql 7.1/jdbc and non-ascii (8-bit) chars?)
Date
Msg-id 3AF861C6.9090705@xythos.com
Whole thread Raw
In response to MULTIBYTE and SQL_ASCII (was Re: [JDBC] Re: A bug with pgsql 7.1/jdbc and non-ascii (8-bit) chars?)  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers

Peter B. West wrote:

> I'm not entirely sure of the situation here, although I have been
> reading the thread as it has unwound.  Given that I may not understand
> the whole situation, my *philosophical* preference is NOT to build in
> kludges which silently bypass the information which is being passed
> around.
> 
> Initially, I was getting wound up about Latin1 imperialism, but I
> realised that, for SQL_ASCII encoding to work in 8-bit environments up
> to now, users must be working in homogeneous encoding environments,
> where 8 bits coming and going will always represent the same character. 
> In that case it doesn't matter how the character is represented
> internally as long as the round-trip translation is consistent.
> 
> How hard is it to change the single-byte character encoding of a
> database?  If that is currently difficult, why not provide a one-off
> upgrade application which does just that, provided it is going from
> SQL_ASCII to a single-byte encoding?

It is currently not possible to change the character encoding of a 
database once created.  You can specify a character encoding for a newly 
created database only if multibyte is enabled.  The code hardcodes a 
value of 'SQL_ACSII' if multibyte is not enabled.  How difficult would 
it be to change this functionality is a question more appropriately 
answered by others on the list (i.e. I don't know).

> 
> Alternatively, add a compile switch that specifies an implicit 8-bit
> encoding in which 8-bit SQL_ASCII values are to be understood?  I think
> that the first solution should be as easy to implement, and would be a
> lot cleaner.
> 
> Peter
> 
I agree that your first suggestion would be more desirable IMHO.

thanks,
--Barry

> 
> Barry Lind wrote:
> 
>> Tatsuo Ishii wrote:
>> 
>>>>> Thus I would be happy if getdatabaseencoding() returned 'UNKNOWN' or
>>>>> something similar when in fact it doesn't know what the encoding is
>>>>> (i.e. when not compiled with multibyte).
>>>> 
>>> Is that ok for Java? I thought Java needs to know the encoding
>>> beforehand so that it could convert to/from Unicode.
>> 
>> That is actually the original issue that started this thread.  If you
>> want the full thread see the jdbc mail archive list.  A user was
>> complaining that when running on a database without multibyte enabled,
>> that through psql he could insert and retrieve 8bit characters, but in
>> jdbc the 8bit characters were converted to ?'s.
>> 
>> I then explained why this was happening (db returns SQL_ASCII as the db
>> character set when not compiled with multibyte) so that character set is
>> used to convert to unicode.
>> 
>> Tom suggested that it would make more sense for jdbc to use LATIN1 when
>> the database reported SQL_ASCII so that most users will see 'correct'
>> behavior in a non multibyte database.  Because currently you need to
>> enable multibyte support in order to use 8bit characters with jdbc.
>> Jdbc could easily be changed to treat SQL_ASCII as LATIN1, but I don't
>> feel that is an appropriate solution for the reasons outlined in this
>> thread (thus the suggestions for UNKNOWN, or the ability for the client
>> to determine if multibyte is enabled or not).
>> 
>>>> I have a philosophical difference with this: basically, I think that
>>>> since SQL_ASCII is the default value, you probably ought to assume that
>>>> it's not too trustworthy.  The software can *never* be said to KNOW what
>>>> the data encoding is; at most it knows what it's been told, and in the
>>>> case of a default it probably hasn't been told anything.  I'd argue that
>>>> SQL_ASCII should be interpreted in the way you are saying "UNKNOWN"
>>>> ought to be: ie, it's an unspecified 8-bit encoding (and from there
>>>> it's not much of a jump to deciding to treat it as LATIN1, if you're
>>>> forced to do conversion to Unicode or whatever).  Certainly, seeing
>>>> SQL_ASCII from the server is not license to throw away data, which is
>>>> what JDBC is doing now.
>>>> 
>>>>> PS.  Note that if multibyte is enabled, the functionality that is being
>>>>> complained about here in the jdbc client is apparently ok for the server
>>>>> to do.  If you insert a value into a text column on a SQL_ASCII database
>>>>> with multibyte enabled and that value contains 8bit characters, those
>>>>> 8bit characters will be quietly replaced with a dummy character since
>>>>> they are invalid for the SQL_ASCII 7bit character set.
>>>> 
>>>> I have not tried it, but if the backend does that then I'd argue that
>>>> that's a bug too.
>>> 
>>> 
>>> I suspect the JDBC driver is responsible for the problem Burry has
>>> reported (I have tried to reproduce the problem using psql without
>>> success).
>>> 
>>> >From interfaces/jdbc/org/postgresql/Connection.java:
>>> 
>>>>        if (dbEncoding.equals("SQL_ASCII")) {
>>>>          dbEncoding = "ASCII";
>>> 
>>> 
>>> BTW, even if the backend behaves like that, I don't think it's a
>>> bug. Since SQL_ASCII is nothing more than an ascii encoding.
>> 
>> I believe Tom's point is that if multibyte is not enabled this isn't
>> true, since SQL_ASCII then means whatever character set the client wants
>> to use against the server as the server really doesn't care what single
>> byte data is being inserted/selected from the database.
>> 
>>>> To my mind, a MULTIBYTE backend operating in
>>>> SQL_ASCII encoding ought to behave the same as a non-MULTIBYTE backend:
>>>> transparent pass-through of characters with the high bit set.  But I'm
>>>> not a multibyte guru.  Comments anyone?
>>> 
>>> 
>>> If you expect that behavior, I think the encoding name 'UNKNOWN' or
>>> something like that seems more appropreate. (SQL_)ASCII is just an
>>> ascii IMHO.
>> 
>> I agree.
> 



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: New tests for new bugs (was Re: [BUGS] Re: backend dies on 7.1.1 loading large datamodel.)
Next
From: Tom Lane
Date:
Subject: Re: Outstanding patches