Thread: MULTIBYTE and SQL_ASCII (was Re: [JDBC] Re: A bug with pgsql 7.1/jdbc and non-ascii (8-bit) chars?)

[ thread renamed and cross-posted to pghackers, since this isn't only
about JDBC anymore ]

Barry Lind <barry@xythos.com> writes:
> The basic issue I have it that the server is providing an API to the
> client to get the character encoding for the database and that API can
> report incorrect information to the client.

I don't have any objection to changing the system so that even a
non-MULTIBYTE server can store and return encoding settings.
(Presumably it should only accept encoding settings that correspond
to single-byte encodings.)  That can't happen before 7.2, however,
as the necessary changes are a bit larger than I'd care to shoehorn
into a 7.1.* release.

> Thus I would be happy if getdatabaseencoding() returned 'UNKNOWN' or
> something similar when in fact it doesn't know what the encoding is
> (i.e. when not compiled with multibyte).

I have a philosophical difference with this: basically, I think that
since SQL_ASCII is the default value, you probably ought to assume that
it's not too trustworthy.  The software can *never* be said to KNOW what
the data encoding is; at most it knows what it's been told, and in the
case of a default it probably hasn't been told anything.  I'd argue that
SQL_ASCII should be interpreted in the way you are saying "UNKNOWN"
ought to be: ie, it's an unspecified 8-bit encoding (and from there
it's not much of a jump to deciding to treat it as LATIN1, if you're
forced to do conversion to Unicode or whatever).  Certainly, seeing
SQL_ASCII from the server is not license to throw away data, which is
what JDBC is doing now.

> PS.  Note that if multibyte is enabled, the functionality that is being
> complained about here in the jdbc client is apparently ok for the server
> to do.  If you insert a value into a text column on a SQL_ASCII database
> with multibyte enabled and that value contains 8bit characters, those
> 8bit characters will be quietly replaced with a dummy character since
> they are invalid for the SQL_ASCII 7bit character set.

I have not tried it, but if the backend does that then I'd argue that
that's a bug too.  To my mind, a MULTIBYTE backend operating in
SQL_ASCII encoding ought to behave the same as a non-MULTIBYTE backend:
transparent pass-through of characters with the high bit set.  But I'm
not a multibyte guru.  Comments anyone?

            regards, tom lane

> > Thus I would be happy if getdatabaseencoding() returned 'UNKNOWN' or
> > something similar when in fact it doesn't know what the encoding is
> > (i.e. when not compiled with multibyte).

Is that ok for Java? I thought Java needs to know the encoding
beforehand so that it could convert to/from Unicode.

> I have a philosophical difference with this: basically, I think that
> since SQL_ASCII is the default value, you probably ought to assume that
> it's not too trustworthy.  The software can *never* be said to KNOW what
> the data encoding is; at most it knows what it's been told, and in the
> case of a default it probably hasn't been told anything.  I'd argue that
> SQL_ASCII should be interpreted in the way you are saying "UNKNOWN"
> ought to be: ie, it's an unspecified 8-bit encoding (and from there
> it's not much of a jump to deciding to treat it as LATIN1, if you're
> forced to do conversion to Unicode or whatever).  Certainly, seeing
> SQL_ASCII from the server is not license to throw away data, which is
> what JDBC is doing now.
>
> > PS.  Note that if multibyte is enabled, the functionality that is being
> > complained about here in the jdbc client is apparently ok for the server
> > to do.  If you insert a value into a text column on a SQL_ASCII database
> > with multibyte enabled and that value contains 8bit characters, those
> > 8bit characters will be quietly replaced with a dummy character since
> > they are invalid for the SQL_ASCII 7bit character set.
>
> I have not tried it, but if the backend does that then I'd argue that
> that's a bug too.

I suspect the JDBC driver is responsible for the problem Burry has
reported (I have tried to reproduce the problem using psql without
success).

From interfaces/jdbc/org/postgresql/Connection.java:

>        if (dbEncoding.equals("SQL_ASCII")) {
>          dbEncoding = "ASCII";

BTW, even if the backend behaves like that, I don't think it's a
bug. Since SQL_ASCII is nothing more than an ascii encoding.

> To my mind, a MULTIBYTE backend operating in
> SQL_ASCII encoding ought to behave the same as a non-MULTIBYTE backend:
> transparent pass-through of characters with the high bit set.  But I'm
> not a multibyte guru.  Comments anyone?

If you expect that behavior, I think the encoding name 'UNKNOWN' or
something like that seems more appropreate. (SQL_)ASCII is just an
ascii IMHO.
--
Tatsuo Ishii


Tatsuo Ishii wrote:

>>> Thus I would be happy if getdatabaseencoding() returned 'UNKNOWN' or
>>> something similar when in fact it doesn't know what the encoding is
>>> (i.e. when not compiled with multibyte).
>>
>
> Is that ok for Java? I thought Java needs to know the encoding
> beforehand so that it could convert to/from Unicode.

That is actually the original issue that started this thread.  If you
want the full thread see the jdbc mail archive list.  A user was
complaining that when running on a database without multibyte enabled,
that through psql he could insert and retrieve 8bit characters, but in
jdbc the 8bit characters were converted to ?'s.

I then explained why this was happening (db returns SQL_ASCII as the db
character set when not compiled with multibyte) so that character set is
used to convert to unicode.

Tom suggested that it would make more sense for jdbc to use LATIN1 when
the database reported SQL_ASCII so that most users will see 'correct'
behavior in a non multibyte database.  Because currently you need to
enable multibyte support in order to use 8bit characters with jdbc.
Jdbc could easily be changed to treat SQL_ASCII as LATIN1, but I don't
feel that is an appropriate solution for the reasons outlined in this
thread (thus the suggestions for UNKNOWN, or the ability for the client
to determine if multibyte is enabled or not).

>
>> I have a philosophical difference with this: basically, I think that
>> since SQL_ASCII is the default value, you probably ought to assume that
>> it's not too trustworthy.  The software can *never* be said to KNOW what
>> the data encoding is; at most it knows what it's been told, and in the
>> case of a default it probably hasn't been told anything.  I'd argue that
>> SQL_ASCII should be interpreted in the way you are saying "UNKNOWN"
>> ought to be: ie, it's an unspecified 8-bit encoding (and from there
>> it's not much of a jump to deciding to treat it as LATIN1, if you're
>> forced to do conversion to Unicode or whatever).  Certainly, seeing
>> SQL_ASCII from the server is not license to throw away data, which is
>> what JDBC is doing now.
>>
>>> PS.  Note that if multibyte is enabled, the functionality that is being
>>> complained about here in the jdbc client is apparently ok for the server
>>> to do.  If you insert a value into a text column on a SQL_ASCII database
>>> with multibyte enabled and that value contains 8bit characters, those
>>> 8bit characters will be quietly replaced with a dummy character since
>>> they are invalid for the SQL_ASCII 7bit character set.
>>
>> I have not tried it, but if the backend does that then I'd argue that
>> that's a bug too.
>
>
> I suspect the JDBC driver is responsible for the problem Burry has
> reported (I have tried to reproduce the problem using psql without
> success).
>
> >From interfaces/jdbc/org/postgresql/Connection.java:
>
>>        if (dbEncoding.equals("SQL_ASCII")) {
>>          dbEncoding = "ASCII";
>
>
> BTW, even if the backend behaves like that, I don't think it's a
> bug. Since SQL_ASCII is nothing more than an ascii encoding.

I believe Tom's point is that if multibyte is not enabled this isn't
true, since SQL_ASCII then means whatever character set the client wants
to use against the server as the server really doesn't care what single
byte data is being inserted/selected from the database.

>
>> To my mind, a MULTIBYTE backend operating in
>> SQL_ASCII encoding ought to behave the same as a non-MULTIBYTE backend:
>> transparent pass-through of characters with the high bit set.  But I'm
>> not a multibyte guru.  Comments anyone?
>
>
> If you expect that behavior, I think the encoding name 'UNKNOWN' or
> something like that seems more appropreate. (SQL_)ASCII is just an
> ascii IMHO.

I agree.

>
> --
> Tatsuo Ishii
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>
>
--Barry


> >>> Thus I would be happy if getdatabaseencoding() returned 'UNKNOWN' or
> >>> something similar when in fact it doesn't know what the encoding is
> >>> (i.e. when not compiled with multibyte).
> >>
> >
> > Is that ok for Java? I thought Java needs to know the encoding
> > beforehand so that it could convert to/from Unicode.
>
> That is actually the original issue that started this thread.  If you
> want the full thread see the jdbc mail archive list.  A user was
> complaining that when running on a database without multibyte enabled,
> that through psql he could insert and retrieve 8bit characters, but in
> jdbc the 8bit characters were converted to ?'s.

Still I don't see what you are wanting in the JDBC driver if
PostgreSQL would return "UNKNOWN" indicating that the backend is not
compiled with MULTIBYTE. Do you want exact the same behavior as prior
7.1 driver? i.e. reading data from the PostgreSQL backend, assume its
encoding default to the Java client (that is set by locale or
something else) and convert it to UTF-8. If so, that would make sense
to me...
--
Tatsuo Ishii


Tatsuo Ishii wrote:

>>>>> Thus I would be happy if getdatabaseencoding() returned 'UNKNOWN' or
>>>>> something similar when in fact it doesn't know what the encoding is
>>>>> (i.e. when not compiled with multibyte).
>>>>
>>> Is that ok for Java? I thought Java needs to know the encoding
>>> beforehand so that it could convert to/from Unicode.
>>
>> That is actually the original issue that started this thread.  If you
>> want the full thread see the jdbc mail archive list.  A user was
>> complaining that when running on a database without multibyte enabled,
>> that through psql he could insert and retrieve 8bit characters, but in
>> jdbc the 8bit characters were converted to ?'s.
>
>
> Still I don't see what you are wanting in the JDBC driver if
> PostgreSQL would return "UNKNOWN" indicating that the backend is not
> compiled with MULTIBYTE. Do you want exact the same behavior as prior
> 7.1 driver? i.e. reading data from the PostgreSQL backend, assume its
> encoding default to the Java client (that is set by locale or
> something else) and convert it to UTF-8. If so, that would make sense
> to me...

My suggestion would be that if the jdbc client was able to determine if
the server character set was UNKNOWN (i.e. no multibyte) that it would
then use some appropriate default character set to perform conversions
to UCS2 (LATIN1 would probably make the most sence as a default).  The
jdbc driver would perform its existing behavior if the character set was
SQL_ASCII and multibyte was enabled (i.e. only support 7bit characters
just like the backend does).

Note that the user is always able to override the character set used for
conversion by setting the charSet property.

Tom also mentioned that it might be possible for the server to support
setting the character set for a database even when multibyte wasn't
enabled.  That would then allow clients like jdbc to get a value from
non-multibyte enabled servers that would be more meaningful than the
current SQL_ASCII.  If this where done, then the 'UNKNOWN' hack would
not be necessary.

thanks,
--Barry

>
> --
> Tatsuo Ishii
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>
>



Peter B. West wrote:

> I'm not entirely sure of the situation here, although I have been
> reading the thread as it has unwound.  Given that I may not understand
> the whole situation, my *philosophical* preference is NOT to build in
> kludges which silently bypass the information which is being passed
> around.
> 
> Initially, I was getting wound up about Latin1 imperialism, but I
> realised that, for SQL_ASCII encoding to work in 8-bit environments up
> to now, users must be working in homogeneous encoding environments,
> where 8 bits coming and going will always represent the same character. 
> In that case it doesn't matter how the character is represented
> internally as long as the round-trip translation is consistent.
> 
> How hard is it to change the single-byte character encoding of a
> database?  If that is currently difficult, why not provide a one-off
> upgrade application which does just that, provided it is going from
> SQL_ASCII to a single-byte encoding?

It is currently not possible to change the character encoding of a 
database once created.  You can specify a character encoding for a newly 
created database only if multibyte is enabled.  The code hardcodes a 
value of 'SQL_ACSII' if multibyte is not enabled.  How difficult would 
it be to change this functionality is a question more appropriately 
answered by others on the list (i.e. I don't know).

> 
> Alternatively, add a compile switch that specifies an implicit 8-bit
> encoding in which 8-bit SQL_ASCII values are to be understood?  I think
> that the first solution should be as easy to implement, and would be a
> lot cleaner.
> 
> Peter
> 
I agree that your first suggestion would be more desirable IMHO.

thanks,
--Barry

> 
> Barry Lind wrote:
> 
>> Tatsuo Ishii wrote:
>> 
>>>>> Thus I would be happy if getdatabaseencoding() returned 'UNKNOWN' or
>>>>> something similar when in fact it doesn't know what the encoding is
>>>>> (i.e. when not compiled with multibyte).
>>>> 
>>> Is that ok for Java? I thought Java needs to know the encoding
>>> beforehand so that it could convert to/from Unicode.
>> 
>> That is actually the original issue that started this thread.  If you
>> want the full thread see the jdbc mail archive list.  A user was
>> complaining that when running on a database without multibyte enabled,
>> that through psql he could insert and retrieve 8bit characters, but in
>> jdbc the 8bit characters were converted to ?'s.
>> 
>> I then explained why this was happening (db returns SQL_ASCII as the db
>> character set when not compiled with multibyte) so that character set is
>> used to convert to unicode.
>> 
>> Tom suggested that it would make more sense for jdbc to use LATIN1 when
>> the database reported SQL_ASCII so that most users will see 'correct'
>> behavior in a non multibyte database.  Because currently you need to
>> enable multibyte support in order to use 8bit characters with jdbc.
>> Jdbc could easily be changed to treat SQL_ASCII as LATIN1, but I don't
>> feel that is an appropriate solution for the reasons outlined in this
>> thread (thus the suggestions for UNKNOWN, or the ability for the client
>> to determine if multibyte is enabled or not).
>> 
>>>> I have a philosophical difference with this: basically, I think that
>>>> since SQL_ASCII is the default value, you probably ought to assume that
>>>> it's not too trustworthy.  The software can *never* be said to KNOW what
>>>> the data encoding is; at most it knows what it's been told, and in the
>>>> case of a default it probably hasn't been told anything.  I'd argue that
>>>> SQL_ASCII should be interpreted in the way you are saying "UNKNOWN"
>>>> ought to be: ie, it's an unspecified 8-bit encoding (and from there
>>>> it's not much of a jump to deciding to treat it as LATIN1, if you're
>>>> forced to do conversion to Unicode or whatever).  Certainly, seeing
>>>> SQL_ASCII from the server is not license to throw away data, which is
>>>> what JDBC is doing now.
>>>> 
>>>>> PS.  Note that if multibyte is enabled, the functionality that is being
>>>>> complained about here in the jdbc client is apparently ok for the server
>>>>> to do.  If you insert a value into a text column on a SQL_ASCII database
>>>>> with multibyte enabled and that value contains 8bit characters, those
>>>>> 8bit characters will be quietly replaced with a dummy character since
>>>>> they are invalid for the SQL_ASCII 7bit character set.
>>>> 
>>>> I have not tried it, but if the backend does that then I'd argue that
>>>> that's a bug too.
>>> 
>>> 
>>> I suspect the JDBC driver is responsible for the problem Burry has
>>> reported (I have tried to reproduce the problem using psql without
>>> success).
>>> 
>>> >From interfaces/jdbc/org/postgresql/Connection.java:
>>> 
>>>>        if (dbEncoding.equals("SQL_ASCII")) {
>>>>          dbEncoding = "ASCII";
>>> 
>>> 
>>> BTW, even if the backend behaves like that, I don't think it's a
>>> bug. Since SQL_ASCII is nothing more than an ascii encoding.
>> 
>> I believe Tom's point is that if multibyte is not enabled this isn't
>> true, since SQL_ASCII then means whatever character set the client wants
>> to use against the server as the server really doesn't care what single
>> byte data is being inserted/selected from the database.
>> 
>>>> To my mind, a MULTIBYTE backend operating in
>>>> SQL_ASCII encoding ought to behave the same as a non-MULTIBYTE backend:
>>>> transparent pass-through of characters with the high bit set.  But I'm
>>>> not a multibyte guru.  Comments anyone?
>>> 
>>> 
>>> If you expect that behavior, I think the encoding name 'UNKNOWN' or
>>> something like that seems more appropreate. (SQL_)ASCII is just an
>>> ascii IMHO.
>> 
>> I agree.
> 



> > Still I don't see what you are wanting in the JDBC driver if
> > PostgreSQL would return "UNKNOWN" indicating that the backend is not
> > compiled with MULTIBYTE. Do you want exact the same behavior as prior
> > 7.1 driver? i.e. reading data from the PostgreSQL backend, assume its
> > encoding default to the Java client (that is set by locale or
> > something else) and convert it to UTF-8. If so, that would make sense
> > to me...
>
> My suggestion would be that if the jdbc client was able to determine if
> the server character set was UNKNOWN (i.e. no multibyte) that it would
> then use some appropriate default character set to perform conversions
> to UCS2 (LATIN1 would probably make the most sence as a default).  The
> jdbc driver would perform its existing behavior if the character set was
> SQL_ASCII and multibyte was enabled (i.e. only support 7bit characters
> just like the backend does).
>
> Note that the user is always able to override the character set used for
> conversion by setting the charSet property.

I see.  However I would say we could not change the current behavior
of the backend until 7.2 is out. It is our policy the we would not
add/change existing functionalities while we are in the minor release
cycle.

What about doing like this:

1. call pg_encoding_to_char(1)    (actually any number except 0 is ok)

2. if it returns "SQL_ASCII", then you could assume that MULTIBYTE is
not enbaled.

This is pretty ugly, but should work.

> Tom also mentioned that it might be possible for the server to support
> setting the character set for a database even when multibyte wasn't
> enabled.  That would then allow clients like jdbc to get a value from
> non-multibyte enabled servers that would be more meaningful than the
> current SQL_ASCII.  If this where done, then the 'UNKNOWN' hack would
> not be necessary.

Tom's suggestion does not sound reasonable to me. If PostgreSQL is not
built with MULTIBYTE, then it means there would be no such idea
"encoding" in PostgreSQL becuase there is no program to handle
encodings. Thus it would be meaningless to assign an "encoding" to a
database if MULTIBYTE is not enabled.
--
Tatsuo Ishii

Tatsuo Ishii <t-ishii@sra.co.jp> writes:
>> Tom also mentioned that it might be possible for the server to support
>> setting the character set for a database even when multibyte wasn't
>> enabled.  That would then allow clients like jdbc to get a value from
>> non-multibyte enabled servers that would be more meaningful than the
>> current SQL_ASCII.  If this where done, then the 'UNKNOWN' hack would
>> not be necessary.

> Tom's suggestion does not sound reasonable to me. If PostgreSQL is not
> built with MULTIBYTE, then it means there would be no such idea
> "encoding" in PostgreSQL becuase there is no program to handle
> encodings. Thus it would be meaningless to assign an "encoding" to a
> database if MULTIBYTE is not enabled.

Why?  Without the MULTIBYTE code, the backend cannot perform character
set translations --- but it's perfectly possible that someone might not
need translations.  A lot of European sites are probably very happy
as long as the server gives them back the same 8-bit characters they
stored.  But what they would like, if they have to deal with tools like
JDBC, is to *identify* what character set they are storing data in, so
that their data will be correctly translated to Unicode or whatever.
The obvious way to do that is to allow them to set the value that
getdatabaseencoding() will return.

Essentially, my point is that identifying the character set is useful
to support outside-the-database character set conversions, whether or
not we have compiled the code for inside-the-database conversions.
Moreover, the stored data certainly has some encoding, whether or not
the database contains code that knows enough to do anything useful about
the encoding.  So it's not "meaningless" to be able to store and report
an encoding value.

I am not sure how much of the MULTIBYTE code would have to be activated
to allow this, but surely it's only a small fraction of the complete
feature.

            regards, tom lane