Thread: [Fwd: Patch for MULTIBYTE and SQL_ASCII (was Re: [JDBC] Re: A bug with pgsql 7.1/jdbc and non-ascii (8-bit) chars?)]]

The following patch for JDBC fixes an issue with jdbc running on a
non-multibyte database loosing 8bit characters.  This patch will cause
the jdbc driver to ignore the encoding reported by the database when
multibyte isn't enabled and use the JVM default in that case.

thanks,
--Barry


-------- Original Message --------
Subject: Re: [HACKERS] MULTIBYTE and SQL_ASCII (was Re: [JDBC] Re: A bug
with pgsql 7.1/jdbc and non-ascii (8-bit) chars?)
Date: Fri, 25 May 2001 17:12:09 -0700
From: Barry Lind
To: Tatsuo Ishii , tgl@sss.pgh.pa.us
References: <3AF74768.8060807@xythos.com>
<20010508110249R.t-ishii@sra.co.jp> <3AF78113.6080907@xythos.com>
<20010509102305C.t-ishii@sra.co.jp>



Tatsuo, Tom,

Since the two of you were the only two that seemed to care about this
thread, I am addressing you directly.  I want to come to some sort of
resolution.  Since it doesn't appear that anything is going to be
changed in the backend code inn 7.2 to address the issue here, I will
submit the attached patch to the jdbc code.

This patch uses the function pg_encoding_to_char(1) to determine that
multibyte is not enabled on the server (as suggested by Tatsuo), and in
that case will use the default JVM character set to convert data from
the backend. This is instead of the current behaviour that will force
all data to 7bit ascii in the non-multibyte case because
getdatabaseencoding() always returns SQL_ASCII for non-multibyte databases.

If I don't hear anything, I will go ahead and submit this patch.

thanks for your help on this issue.

--Barry


Tatsuo Ishii wrote:

>>> Still I don't see what you are wanting in the JDBC driver if
>>> PostgreSQL would return "UNKNOWN" indicating that the backend is not
>>> compiled with MULTIBYTE. Do you want exact the same behavior as prior
>>> 7.1 driver? i.e. reading data from the PostgreSQL backend, assume its
>>> encoding default to the Java client (that is set by locale or
>>> something else) and convert it to UTF-8. If so, that would make sense
>>> to me...
>>
>> My suggestion would be that if the jdbc client was able to determine if
>> the server character set was UNKNOWN (i.e. no multibyte) that it would
>> then use some appropriate default character set to perform conversions
>> to UCS2 (LATIN1 would probably make the most sence as a default).  The
>> jdbc driver would perform its existing behavior if the character set was
>> SQL_ASCII and multibyte was enabled (i.e. only support 7bit characters
>> just like the backend does).
>>
>> Note that the user is always able to override the character set used for
>> conversion by setting the charSet property.
>
>
> I see.  However I would say we could not change the current behavior
> of the backend until 7.2 is out. It is our policy the we would not
> add/change existing functionalities while we are in the minor release
> cycle.
>
> What about doing like this:
>
> 1. call pg_encoding_to_char(1)    (actually any number except 0 is ok)
>
> 2. if it returns "SQL_ASCII", then you could assume that MULTIBYTE is
> not enbaled.
>
> This is pretty ugly, but should work.
>
>> Tom also mentioned that it might be possible for the server to support
>> setting the character set for a database even when multibyte wasn't
>> enabled.  That would then allow clients like jdbc to get a value from
>> non-multibyte enabled servers that would be more meaningful than the
>> current SQL_ASCII.  If this where done, then the 'UNKNOWN' hack would
>> not be necessary.
>
>
> Tom's suggestion does not sound reasonable to me. If PostgreSQL is not
> built with MULTIBYTE, then it means there would be no such idea
> "encoding" in PostgreSQL becuase there is no program to handle
> encodings. Thus it would be meaningless to assign an "encoding" to a
> database if MULTIBYTE is not enabled.
> --
> Tatsuo Ishii
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
>
>



*** ./org/postgresql/Connection.java.orig    Fri May 25 16:23:02 2001
--- ./org/postgresql/Connection.java    Fri May 25 16:26:55 2001
***************
*** 267,273 ****
        //
        firstWarning = null;

!       java.sql.ResultSet initrset = ExecSQL("set datestyle to 'ISO'; select getdatabaseencoding()");

        String dbEncoding = null;
        //retrieve DB properties
--- 267,274 ----
        //
        firstWarning = null;

!       java.sql.ResultSet initrset = ExecSQL("set datestyle to 'ISO'; " +
!         "select case when pg_encoding_to_char(1) = 'SQL_ASCII' then 'UNKNOWN' else getdatabaseencoding() end");

        String dbEncoding = null;
        //retrieve DB properties
***************
*** 319,324 ****
--- 320,330 ----

          } else if (dbEncoding.equals("WIN")) {
            dbEncoding = "Cp1252";
+         } else if (dbEncoding.equals("UNKNOWN")) {
+           //This isn't a multibyte database so we don't have an encoding to use
+           //We leave dbEncoding null which will cause the default encoding for the
+           //JVM to be used
+           dbEncoding = null;
          } else {
            dbEncoding = null;
          }



Your patch has been added to the PostgreSQL unapplied patches list at:

    http://candle.pha.pa.us/cgi-bin/pgpatches

I will try to apply it within the next 48 hours.

> The following patch for JDBC fixes an issue with jdbc running on a
> non-multibyte database loosing 8bit characters.  This patch will cause
> the jdbc driver to ignore the encoding reported by the database when
> multibyte isn't enabled and use the JVM default in that case.
>
> thanks,
> --Barry
>
>
> -------- Original Message --------
> Subject: Re: [HACKERS] MULTIBYTE and SQL_ASCII (was Re: [JDBC] Re: A bug
> with pgsql 7.1/jdbc and non-ascii (8-bit) chars?)
> Date: Fri, 25 May 2001 17:12:09 -0700
> From: Barry Lind
> To: Tatsuo Ishii , tgl@sss.pgh.pa.us
> References: <3AF74768.8060807@xythos.com>
> <20010508110249R.t-ishii@sra.co.jp> <3AF78113.6080907@xythos.com>
> <20010509102305C.t-ishii@sra.co.jp>
>
>
>
> Tatsuo, Tom,
>
> Since the two of you were the only two that seemed to care about this
> thread, I am addressing you directly.  I want to come to some sort of
> resolution.  Since it doesn't appear that anything is going to be
> changed in the backend code inn 7.2 to address the issue here, I will
> submit the attached patch to the jdbc code.
>
> This patch uses the function pg_encoding_to_char(1) to determine that
> multibyte is not enabled on the server (as suggested by Tatsuo), and in
> that case will use the default JVM character set to convert data from
> the backend. This is instead of the current behaviour that will force
> all data to 7bit ascii in the non-multibyte case because
> getdatabaseencoding() always returns SQL_ASCII for non-multibyte databases.
>
> If I don't hear anything, I will go ahead and submit this patch.
>
> thanks for your help on this issue.
>
> --Barry
>
>
> Tatsuo Ishii wrote:
>
> >>> Still I don't see what you are wanting in the JDBC driver if
> >>> PostgreSQL would return "UNKNOWN" indicating that the backend is not
> >>> compiled with MULTIBYTE. Do you want exact the same behavior as prior
> >>> 7.1 driver? i.e. reading data from the PostgreSQL backend, assume its
> >>> encoding default to the Java client (that is set by locale or
> >>> something else) and convert it to UTF-8. If so, that would make sense
> >>> to me...
> >>
> >> My suggestion would be that if the jdbc client was able to determine if
> >> the server character set was UNKNOWN (i.e. no multibyte) that it would
> >> then use some appropriate default character set to perform conversions
> >> to UCS2 (LATIN1 would probably make the most sence as a default).  The
> >> jdbc driver would perform its existing behavior if the character set was
> >> SQL_ASCII and multibyte was enabled (i.e. only support 7bit characters
> >> just like the backend does).
> >>
> >> Note that the user is always able to override the character set used for
> >> conversion by setting the charSet property.
> >
> >
> > I see.  However I would say we could not change the current behavior
> > of the backend until 7.2 is out. It is our policy the we would not
> > add/change existing functionalities while we are in the minor release
> > cycle.
> >
> > What about doing like this:
> >
> > 1. call pg_encoding_to_char(1)    (actually any number except 0 is ok)
> >
> > 2. if it returns "SQL_ASCII", then you could assume that MULTIBYTE is
> > not enbaled.
> >
> > This is pretty ugly, but should work.
> >
> >> Tom also mentioned that it might be possible for the server to support
> >> setting the character set for a database even when multibyte wasn't
> >> enabled.  That would then allow clients like jdbc to get a value from
> >> non-multibyte enabled servers that would be more meaningful than the
> >> current SQL_ASCII.  If this where done, then the 'UNKNOWN' hack would
> >> not be necessary.
> >
> >
> > Tom's suggestion does not sound reasonable to me. If PostgreSQL is not
> > built with MULTIBYTE, then it means there would be no such idea
> > "encoding" in PostgreSQL becuase there is no program to handle
> > encodings. Thus it would be meaningless to assign an "encoding" to a
> > database if MULTIBYTE is not enabled.
> > --
> > Tatsuo Ishii
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 2: you can get off all lists at once with the unregister command
> >     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
> >
> >
>
>
>

> *** ./org/postgresql/Connection.java.orig    Fri May 25 16:23:02 2001
> --- ./org/postgresql/Connection.java    Fri May 25 16:26:55 2001
> ***************
> *** 267,273 ****
>         //
>         firstWarning = null;
>
> !       java.sql.ResultSet initrset = ExecSQL("set datestyle to 'ISO'; select getdatabaseencoding()");
>
>         String dbEncoding = null;
>         //retrieve DB properties
> --- 267,274 ----
>         //
>         firstWarning = null;
>
> !       java.sql.ResultSet initrset = ExecSQL("set datestyle to 'ISO'; " +
> !         "select case when pg_encoding_to_char(1) = 'SQL_ASCII' then 'UNKNOWN' else getdatabaseencoding() end");
>
>         String dbEncoding = null;
>         //retrieve DB properties
> ***************
> *** 319,324 ****
> --- 320,330 ----
>
>           } else if (dbEncoding.equals("WIN")) {
>             dbEncoding = "Cp1252";
> +         } else if (dbEncoding.equals("UNKNOWN")) {
> +           //This isn't a multibyte database so we don't have an encoding to use
> +           //We leave dbEncoding null which will cause the default encoding for the
> +           //JVM to be used
> +           dbEncoding = null;
>           } else {
>             dbEncoding = null;
>           }
>
>

>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/users-lounge/docs/faq.html

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Patch applied.  Thanks.

> The following patch for JDBC fixes an issue with jdbc running on a
> non-multibyte database loosing 8bit characters.  This patch will cause
> the jdbc driver to ignore the encoding reported by the database when
> multibyte isn't enabled and use the JVM default in that case.
>
> thanks,
> --Barry
>
>
> -------- Original Message --------
> Subject: Re: [HACKERS] MULTIBYTE and SQL_ASCII (was Re: [JDBC] Re: A bug
> with pgsql 7.1/jdbc and non-ascii (8-bit) chars?)
> Date: Fri, 25 May 2001 17:12:09 -0700
> From: Barry Lind
> To: Tatsuo Ishii , tgl@sss.pgh.pa.us
> References: <3AF74768.8060807@xythos.com>
> <20010508110249R.t-ishii@sra.co.jp> <3AF78113.6080907@xythos.com>
> <20010509102305C.t-ishii@sra.co.jp>
>
>
>
> Tatsuo, Tom,
>
> Since the two of you were the only two that seemed to care about this
> thread, I am addressing you directly.  I want to come to some sort of
> resolution.  Since it doesn't appear that anything is going to be
> changed in the backend code inn 7.2 to address the issue here, I will
> submit the attached patch to the jdbc code.
>
> This patch uses the function pg_encoding_to_char(1) to determine that
> multibyte is not enabled on the server (as suggested by Tatsuo), and in
> that case will use the default JVM character set to convert data from
> the backend. This is instead of the current behaviour that will force
> all data to 7bit ascii in the non-multibyte case because
> getdatabaseencoding() always returns SQL_ASCII for non-multibyte databases.
>
> If I don't hear anything, I will go ahead and submit this patch.
>
> thanks for your help on this issue.
>
> --Barry
>
>
> Tatsuo Ishii wrote:
>
> >>> Still I don't see what you are wanting in the JDBC driver if
> >>> PostgreSQL would return "UNKNOWN" indicating that the backend is not
> >>> compiled with MULTIBYTE. Do you want exact the same behavior as prior
> >>> 7.1 driver? i.e. reading data from the PostgreSQL backend, assume its
> >>> encoding default to the Java client (that is set by locale or
> >>> something else) and convert it to UTF-8. If so, that would make sense
> >>> to me...
> >>
> >> My suggestion would be that if the jdbc client was able to determine if
> >> the server character set was UNKNOWN (i.e. no multibyte) that it would
> >> then use some appropriate default character set to perform conversions
> >> to UCS2 (LATIN1 would probably make the most sence as a default).  The
> >> jdbc driver would perform its existing behavior if the character set was
> >> SQL_ASCII and multibyte was enabled (i.e. only support 7bit characters
> >> just like the backend does).
> >>
> >> Note that the user is always able to override the character set used for
> >> conversion by setting the charSet property.
> >
> >
> > I see.  However I would say we could not change the current behavior
> > of the backend until 7.2 is out. It is our policy the we would not
> > add/change existing functionalities while we are in the minor release
> > cycle.
> >
> > What about doing like this:
> >
> > 1. call pg_encoding_to_char(1)    (actually any number except 0 is ok)
> >
> > 2. if it returns "SQL_ASCII", then you could assume that MULTIBYTE is
> > not enbaled.
> >
> > This is pretty ugly, but should work.
> >
> >> Tom also mentioned that it might be possible for the server to support
> >> setting the character set for a database even when multibyte wasn't
> >> enabled.  That would then allow clients like jdbc to get a value from
> >> non-multibyte enabled servers that would be more meaningful than the
> >> current SQL_ASCII.  If this where done, then the 'UNKNOWN' hack would
> >> not be necessary.
> >
> >
> > Tom's suggestion does not sound reasonable to me. If PostgreSQL is not
> > built with MULTIBYTE, then it means there would be no such idea
> > "encoding" in PostgreSQL becuase there is no program to handle
> > encodings. Thus it would be meaningless to assign an "encoding" to a
> > database if MULTIBYTE is not enabled.
> > --
> > Tatsuo Ishii
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 2: you can get off all lists at once with the unregister command
> >     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
> >
> >
>
>
>

> *** ./org/postgresql/Connection.java.orig    Fri May 25 16:23:02 2001
> --- ./org/postgresql/Connection.java    Fri May 25 16:26:55 2001
> ***************
> *** 267,273 ****
>         //
>         firstWarning = null;
>
> !       java.sql.ResultSet initrset = ExecSQL("set datestyle to 'ISO'; select getdatabaseencoding()");
>
>         String dbEncoding = null;
>         //retrieve DB properties
> --- 267,274 ----
>         //
>         firstWarning = null;
>
> !       java.sql.ResultSet initrset = ExecSQL("set datestyle to 'ISO'; " +
> !         "select case when pg_encoding_to_char(1) = 'SQL_ASCII' then 'UNKNOWN' else getdatabaseencoding() end");
>
>         String dbEncoding = null;
>         //retrieve DB properties
> ***************
> *** 319,324 ****
> --- 320,330 ----
>
>           } else if (dbEncoding.equals("WIN")) {
>             dbEncoding = "Cp1252";
> +         } else if (dbEncoding.equals("UNKNOWN")) {
> +           //This isn't a multibyte database so we don't have an encoding to use
> +           //We leave dbEncoding null which will cause the default encoding for the
> +           //JVM to be used
> +           dbEncoding = null;
>           } else {
>             dbEncoding = null;
>           }
>
>

>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/users-lounge/docs/faq.html

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026