Thread: ArrayIndexOutOfBoundsException in Encoding.decodeUTF8()

ArrayIndexOutOfBoundsException in Encoding.decodeUTF8()

From
Joseph Shraibman
Date:
java.lang.ArrayIndexOutOfBoundsException: 3
         at org.postgresql.core.Encoding.decodeUTF8(Encoding.java:253)
         at org.postgresql.core.Encoding.decode(Encoding.java:165)
         at org.postgresql.core.Encoding.decode(Encoding.java:181)
         at
org.postgresql.jdbc1.AbstractJdbc1ResultSet.getString(AbstractJdbc1ResultSet.java:97)

The relavent code is:

        while (i < k) {
            z = data[i] & 0xFF;
            if (z < 0x80) {
                l_cdata[j++] = (char)data[i];
                i++;
            } else if (z >= 0xE0) {        // length == 3
                y = data[i+1] & 0xFF; //<<== THIS IS LINE 253
                x = data[i+2] & 0xFF;
                val = (z-0xE0)*pow2_12 + (y-0x80)*pow2_6 + (x-0x80);
                l_cdata[j++] = (char) val;
                i+= 3;
            } else {        // length == 2 (maybe add checking for length > 3, throw exception if it is


And in the method that calls that:

    if (encoding.equals("UTF-8")) {
                    return decodeUTF8(encodedString, offset, length);
                }

The thing is my database encoding is SQL_ASCII

=> SELECT version(),  getdatabaseencoding() ;
                                                  version
                 | getdatabaseencoding

---------------------------------------------------------------------------------------------------------+---------------------
  PostgreSQL 7.3.1 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.2 20020903 (Red Hat
Linux 8.0 3.2-7) | SQL_ASCII
(1 row)

... so why is it trying to decode the string as UTF-8?  I just upgraded this database from
7.2.3 yesterday.

--
Joseph Shraibman
joseph@xtenit.com
Increase signal to noise ratio.  http://xis.xtenit.com


Re: ArrayIndexOutOfBoundsException in Encoding.decodeUTF8()

From
Joseph Shraibman
Date:
BTW the string that caused this is 'Oné'

Joseph Shraibman wrote:
> java.lang.ArrayIndexOutOfBoundsException: 3
>         at org.postgresql.core.Encoding.decodeUTF8(Encoding.java:253)
>         at org.postgresql.core.Encoding.decode(Encoding.java:165)
>         at org.postgresql.core.Encoding.decode(Encoding.java:181)
>         at
> org.postgresql.jdbc1.AbstractJdbc1ResultSet.getString(AbstractJdbc1ResultSet.java:97)
>
>
> The relavent code is:
>
>         while (i < k) {
>             z = data[i] & 0xFF;
>             if (z < 0x80) {
>                 l_cdata[j++] = (char)data[i];
>                 i++;
>             } else if (z >= 0xE0) {        // length == 3
>                 y = data[i+1] & 0xFF; //<<== THIS IS LINE 253
>                 x = data[i+2] & 0xFF;
>                 val = (z-0xE0)*pow2_12 + (y-0x80)*pow2_6 + (x-0x80);
>                 l_cdata[j++] = (char) val;
>                 i+= 3;
>             } else {        // length == 2 (maybe add checking for
> length > 3, throw exception if it is
>
>
> And in the method that calls that:
>
>     if (encoding.equals("UTF-8")) {
>                     return decodeUTF8(encodedString, offset, length);
>                 }
>
> The thing is my database encoding is SQL_ASCII
>
> => SELECT version(),  getdatabaseencoding() ;
>                                                  version
> | getdatabaseencoding
>
---------------------------------------------------------------------------------------------------------+---------------------

>
>  PostgreSQL 7.3.1 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.2
> 20020903 (Red Hat Linux 8.0 3.2-7) | SQL_ASCII
> (1 row)
>
> ... so why is it trying to decode the string as UTF-8?  I just upgraded
> this database from 7.2.3 yesterday.
>


--
Joseph Shraibman
joseph@xtenit.com
Increase signal to noise ratio.  http://xis.xtenit.com


Re: ArrayIndexOutOfBoundsException in Encoding.decodeUTF8()

From
Barry Lind
Date:
Joseph,

The problem is that your database claims to be ASCII, but you are
storing non-ascii data in it.

As of 7.3 the jdbc driver has the server convert from the database
character set to UTF8.  Then send the data to the driver in UTF8 and the
driver then decodes the UTF8 to java unicode.

The conversion from ASCII to UTF8 is a noop since the 127 characters of
ascii map directly to the same values in UTF8.  However since you are
storing not ASCII data the values that have the values from 128 - 255
just get passed from the server to the client without any additional
processing (since there aren't supposed to be any values in this range),
but then when the driver tries to convert to java unicode, it can't
because it has received an invalid UTF8 character.

It seems that you are actually storing Latin1 data in this database and
thus the database character set should probably be Latin1.

In 7.2 is was possible to override the character set used by the driver,
however I don't think this works anymore when connecting to a 7.3
server.  .... looks at code .... Yes the override is ignored if the
server is a 7.3 server.  You could hack at AbstractJdbc1Connection to
work around the issue or just correctly set the database character set
to match the data that the database contains.

thanks,
--Barry


Joseph Shraibman wrote:
> BTW the string that caused this is 'Oné'
>
> Joseph Shraibman wrote:
>
>> java.lang.ArrayIndexOutOfBoundsException: 3
>>         at org.postgresql.core.Encoding.decodeUTF8(Encoding.java:253)
>>         at org.postgresql.core.Encoding.decode(Encoding.java:165)
>>         at org.postgresql.core.Encoding.decode(Encoding.java:181)
>>         at
>> org.postgresql.jdbc1.AbstractJdbc1ResultSet.getString(AbstractJdbc1ResultSet.java:97)
>>
>>
>> The relavent code is:
>>
>>         while (i < k) {
>>             z = data[i] & 0xFF;
>>             if (z < 0x80) {
>>                 l_cdata[j++] = (char)data[i];
>>                 i++;
>>             } else if (z >= 0xE0) {        // length == 3
>>                 y = data[i+1] & 0xFF; //<<== THIS IS LINE 253
>>                 x = data[i+2] & 0xFF;
>>                 val = (z-0xE0)*pow2_12 + (y-0x80)*pow2_6 + (x-0x80);
>>                 l_cdata[j++] = (char) val;
>>                 i+= 3;
>>             } else {        // length == 2 (maybe add checking for
>> length > 3, throw exception if it is
>>
>>
>> And in the method that calls that:
>>
>>     if (encoding.equals("UTF-8")) {
>>                     return decodeUTF8(encodedString, offset, length);
>>                 }
>>
>> The thing is my database encoding is SQL_ASCII
>>
>> => SELECT version(),  getdatabaseencoding() ;
>>
>> version                 | getdatabaseencoding
>>
---------------------------------------------------------------------------------------------------------+---------------------

>>
>>  PostgreSQL 7.3.1 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.2
>> 20020903 (Red Hat Linux 8.0 3.2-7) | SQL_ASCII
>> (1 row)
>>
>> ... so why is it trying to decode the string as UTF-8?  I just
>> upgraded this database from 7.2.3 yesterday.
>>
>
>




Re: ArrayIndexOutOfBoundsException in Encoding.decodeUTF8()

From
Joseph Shraibman
Date:
Well this data was inserted into postgres through the jdbc driver in the first place.

So how come postgres itself didn't complain about non-ascii data?  How do I change the
encoding?  And what will the side effects be?

Barry Lind wrote:
> Joseph,
>
> The problem is that your database claims to be ASCII, but you are
> storing non-ascii data in it.
>
> As of 7.3 the jdbc driver has the server convert from the database
> character set to UTF8.  Then send the data to the driver in UTF8 and the
> driver then decodes the UTF8 to java unicode.
>
> The conversion from ASCII to UTF8 is a noop since the 127 characters of
> ascii map directly to the same values in UTF8.  However since you are
> storing not ASCII data the values that have the values from 128 - 255
> just get passed from the server to the client without any additional
> processing (since there aren't supposed to be any values in this range),
> but then when the driver tries to convert to java unicode, it can't
> because it has received an invalid UTF8 character.
>
> It seems that you are actually storing Latin1 data in this database and
> thus the database character set should probably be Latin1.
>
> In 7.2 is was possible to override the character set used by the driver,
> however I don't think this works anymore when connecting to a 7.3
> server.  .... looks at code .... Yes the override is ignored if the
> server is a 7.3 server.  You could hack at AbstractJdbc1Connection to
> work around the issue or just correctly set the database character set
> to match the data that the database contains.
>
> thanks,
> --Barry
>
>
> Joseph Shraibman wrote:
>
>> BTW the string that caused this is 'Oné'
>>
>> Joseph Shraibman wrote:
>>
>>> java.lang.ArrayIndexOutOfBoundsException: 3
>>>         at org.postgresql.core.Encoding.decodeUTF8(Encoding.java:253)
>>>         at org.postgresql.core.Encoding.decode(Encoding.java:165)
>>>         at org.postgresql.core.Encoding.decode(Encoding.java:181)
>>>         at
>>> org.postgresql.jdbc1.AbstractJdbc1ResultSet.getString(AbstractJdbc1ResultSet.java:97)
>>>
>>>
>>> The relavent code is:
>>>
>>>         while (i < k) {
>>>             z = data[i] & 0xFF;
>>>             if (z < 0x80) {
>>>                 l_cdata[j++] = (char)data[i];
>>>                 i++;
>>>             } else if (z >= 0xE0) {        // length == 3
>>>                 y = data[i+1] & 0xFF; //<<== THIS IS LINE 253
>>>                 x = data[i+2] & 0xFF;
>>>                 val = (z-0xE0)*pow2_12 + (y-0x80)*pow2_6 + (x-0x80);
>>>                 l_cdata[j++] = (char) val;
>>>                 i+= 3;
>>>             } else {        // length == 2 (maybe add checking for
>>> length > 3, throw exception if it is
>>>
>>>
>>> And in the method that calls that:
>>>
>>>     if (encoding.equals("UTF-8")) {
>>>                     return decodeUTF8(encodedString, offset, length);
>>>                 }
>>>
>>> The thing is my database encoding is SQL_ASCII
>>>
>>> => SELECT version(),  getdatabaseencoding() ;
>>>
>>> version                 | getdatabaseencoding
>>>
---------------------------------------------------------------------------------------------------------+---------------------

>>>
>>>  PostgreSQL 7.3.1 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.2
>>> 20020903 (Red Hat Linux 8.0 3.2-7) | SQL_ASCII
>>> (1 row)
>>>
>>> ... so why is it trying to decode the string as UTF-8?  I just
>>> upgraded this database from 7.2.3 yesterday.
>>>
>>


Re: ArrayIndexOutOfBoundsException in Encoding.decodeUTF8()

From
Joseph Shraibman
Date:
Barry Lind wrote:
> Joseph,
>
> The problem is that your database claims to be ASCII, but you are
> storing non-ascii data in it.
>
> As of 7.3 the jdbc driver has the server convert from the database
> character set to UTF8.  Then send the data to the driver in UTF8 and the
> driver then decodes the UTF8 to java unicode.

I see this in my postgres log when I connect via jdbc:

LOG:  query: set datestyle to 'ISO'; select version(), case when pg_encoding_to_char(1) =
'SQL_ASCII' then 'UNKNOWN' else getdatabaseencoding() end;
LOG:  query: set client_encoding = 'UNICODE'; show autocommit

So if client_encoding is unicode why is the driver trying to convert from UTF8?


Re: ArrayIndexOutOfBoundsException in Encoding.decodeUTF8()

From
Barry Lind
Date:
Joseph,

In postgres UNICODE means utf8.

--Barry

Joseph Shraibman wrote:
> Barry Lind wrote:
>
>> Joseph,
>>
>> The problem is that your database claims to be ASCII, but you are
>> storing non-ascii data in it.
>>
>> As of 7.3 the jdbc driver has the server convert from the database
>> character set to UTF8.  Then send the data to the driver in UTF8 and
>> the driver then decodes the UTF8 to java unicode.
>
>
> I see this in my postgres log when I connect via jdbc:
>
> LOG:  query: set datestyle to 'ISO'; select version(), case when
> pg_encoding_to_char(1) = 'SQL_ASCII' then 'UNKNOWN' else
> getdatabaseencoding() end;
> LOG:  query: set client_encoding = 'UNICODE'; show autocommit
>
> So if client_encoding is unicode why is the driver trying to convert
> from UTF8?
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/users-lounge/docs/faq.html
>



Re: ArrayIndexOutOfBoundsException in Encoding.decodeUTF8()

From
Joseph Shraibman
Date:
Barry Lind wrote:
> Joseph,
>
> In postgres UNICODE means utf8.

Which differs from java unicode?

I notice there is no way to change a database's encoding.  If I just change the encoding
type in the pg_database to latin1 will there be data loss?

>
> --Barry
>
> Joseph Shraibman wrote:
>
>> Barry Lind wrote:
>>
>>> Joseph,
>>>
>>> The problem is that your database claims to be ASCII, but you are
>>> storing non-ascii data in it.
>>>
>>> As of 7.3 the jdbc driver has the server convert from the database
>>> character set to UTF8.  Then send the data to the driver in UTF8 and
>>> the driver then decodes the UTF8 to java unicode.
>>
>>
>>
>> I see this in my postgres log when I connect via jdbc:
>>
>> LOG:  query: set datestyle to 'ISO'; select version(), case when
>> pg_encoding_to_char(1) = 'SQL_ASCII' then 'UNKNOWN' else
>> getdatabaseencoding() end;
>> LOG:  query: set client_encoding = 'UNICODE'; show autocommit
>>
>> So if client_encoding is unicode why is the driver trying to convert
>> from UTF8?
>>
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 5: Have you checked our extensive FAQ?
>>
>> http://www.postgresql.org/users-lounge/docs/faq.html
>>
>


--
Joseph Shraibman
joseph@xtenit.com
Increase signal to noise ratio.  http://xis.xtenit.com


Re: [GENERAL] ArrayIndexOutOfBoundsException in Encoding.decodeUTF8()

From
Barry Lind
Date:

Joseph Shraibman wrote:
>>
>> In postgres UNICODE means utf8.
>
>
> Which differs from java unicode?
>

Yes.  Unicode in java is 16 bit characters (I think the term for this is
UCS2), two bytes for each character, whereas utf8 is a variable length
encoding with characters represented by 1, 2 or 3 bytes.

> I notice there is no way to change a database's encoding.  If I just
> change the encoding type in the pg_database to latin1 will there be data
> loss?

The recommended way to do this would be to dump the contents of the
database, create a new database with the desired character set and then
import the data into that new database.  I don't know if changing
pg_database directly would work or not.

--Barry



Character Encoding WAS: ArrayIndexOutOfBoundsException in Encoding.decodeUTF8()

From
Joseph Shraibman
Date:
Barry Lind wrote:
>
>
> Joseph Shraibman wrote:

>> I notice there is no way to change a database's encoding.  If I just
>> change the encoding type in the pg_database to latin1 will there be
>> data loss?
>
>
> The recommended way to do this would be to dump the contents of the
> database, create a new database with the desired character set and then
> import the data into that new database.  I don't know if changing
> pg_database directly would work or not.
>
>
That didn't work. When I tried that Oné turned into Oné, which confuses me because I
thought my problem was that I was storing latin1 chars in a text field that was supposed
to only have the lower ascii bits.  Oh well, I guess it is dump/reload time.


Re: Character Encoding WAS: ArrayIndexOutOfBoundsException in Encoding.decodeUTF8()

From
Joseph Shraibman
Date:
Joseph Shraibman wrote:
> Barry Lind wrote:
>> Joseph Shraibman wrote:
>>> I notice there is no way to change a database's encoding.  If I just
>>> change the encoding type in the pg_database to latin1 will there be
>>> data loss?
>>
>>
>>
>> The recommended way to do this would be to dump the contents of the
>> database, create a new database with the desired character set and
>> then import the data into that new database.  I don't know if changing
>> pg_database directly would work or not.
>>
>>
> That didn't work.

Acutally it did. My test data was flawed.  What didn't work is editing the dump to change
the type to unicode.