Thread: Accents bug ?

Accents bug ?

From
Denis Bucher
Date:
Hello !

BIG problem ;-)

When INSERTing a value like "Genève" it works, but when doing a SELECT the
fields
are troncated when containing acents :

"Genève" is received as "Gen"
"Thé froid" as "Th"
"Hosomaki végétarien" as "Hosomaki v"
and so on ...

Does someone knows if it is a but in the driver or another problem ?

Thanks alot in advance for any help :-)

Denis Bucher
NiftyCom

P.S. Using latest version of JDBC, Java and postgresql 7.1


Re: Accents bug ?

From
"Dave Cramer"
Date:
Denis,

It sounds like an encoding problem. You can check the encoding of the db
by using \encoding in psql.

There is a section in the docs on this
http://www.postgresql.org/idocs/index.php?multibyte.html

Dave

-----Original Message-----
From: pgsql-jdbc-owner@postgresql.org
[mailto:pgsql-jdbc-owner@postgresql.org] On Behalf Of Denis Bucher
Sent: October 1, 2001 4:42 PM
To: PGSQL-JDBC
Subject: [JDBC] Accents bug ?



Hello !

BIG problem ;-)

When INSERTing a value like "Genève" it works, but when doing a SELECT
the
fields
are troncated when containing acents :

"Genève" is received as "Gen"
"Thé froid" as "Th"
"Hosomaki végétarien" as "Hosomaki v"
and so on ...

Does someone knows if it is a but in the driver or another problem ?

Thanks alot in advance for any help :-)

Denis Bucher
NiftyCom

P.S. Using latest version of JDBC, Java and postgresql 7.1


---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org



Re: Accents bug ?

From
Denis Bucher
Date:
At 20:09 01.10.01 -0400, you wrote:

Hello !

>It sounds like an encoding problem. You can check the encoding of the db
>by using \encoding in psql.
>
>There is a section in the docs on this
>http://www.postgresql.org/idocs/index.php?multibyte.html

Yes I read it, but it didn't help me much, I also tryed this :

postgres@sashimi:~$ psql -l ekai
         List of databases
  Database  |  Owner   | Encoding
-----------+----------+-----------
  ekai      | postgres | UNICODE
  template0 | postgres | SQL_ASCII
  template1 | postgres | SQL_ASCII
(6 rows)

The problem is with database ekai... Accents, once again are ok in INSERT
but not in SELECT, and only with JDBC, not with psql (psql works perfectly).

Denis Bucher
NiftyCom

>-----Original Message-----
>From: pgsql-jdbc-owner@postgresql.org
>[mailto:pgsql-jdbc-owner@postgresql.org] On Behalf Of Denis Bucher
>Sent: October 1, 2001 4:42 PM
>To: PGSQL-JDBC
>Subject: [JDBC] Accents bug ?

...

>When INSERTing a value like "Genève" it works, but when doing a SELECT
>the
>fields
>are troncated when containing acents :
>
>"Genève" is received as "Gen"
>"Thé froid" as "Th"
>"Hosomaki végétarien" as "Hosomaki v"
>and so on ...


Re: Accents bug ?

From
Rene Pijlman
Date:
On Mon, 1 Oct 2001 20:09:45 -0400, you wrote:
>It sounds like an encoding problem. You can check the encoding of the db
>by using \encoding in psql.
>
>There is a section in the docs on this
>http://www.postgresql.org/idocs/index.php?multibyte.html

And more on
http://lab.applinet.nl/postgresql-jdbc/#CharacterEncoding

I've heard of conversion problems before, but not of chunking
the data though.

Regards,
René Pijlman <rene@lab.applinet.nl>

Re: Accents bug ?

From
Denis Bucher
Date:
At 11:48 02.10.01 +0200, Rene Pijlman wrote:

Hello !

> >It sounds like an encoding problem. You can check the encoding of the db
> >by using \encoding in psql.
> >There is a section in the docs on this
> >http://www.postgresql.org/idocs/index.php?multibyte.html
>
>And more on
>http://lab.applinet.nl/postgresql-jdbc/#CharacterEncoding

Yes, very interesting, it says :
jdbc:postgresql://localhost/dbname?charSet=UTF-8&user=foo&password=bar

But it gives me an error using that connection string :
 >
jdbc:postgresql://sashimi:5432/ekai?charSet=UNICODE&user=ekaitest&password=aaa

...when doing my executeQuery I get :
 > VendorError:  0
 > SQLState:     null
 > SQLException: postgresql.con.encoding

So, not better... or is it really "UNICODE" ?

>I've heard of conversion problems before, but not of chunking
>the data though.

Yes, that's the most strange I think :-)

Denis


Re: Accents bug ?

From
Knut Forkalsrud
Date:
dbucher@niftycom.com (Denis Bucher) writes:

> At 11:48 02.10.01 +0200, Rene Pijlman wrote:
>
> >I've heard of conversion problems before, but not of chunking
> >the data though.
>
> Yes, that's the most strange I think :-)

I have seen the same thing.  It happened to Norwegian characters like
æøå.  Strings read through the JDBC driver were truncated at the first
non-ASCII character.  I found that it worked using the jdbc7.0-1.1.jar
version of the driver though.  I don't remember all of the details
right now, but the database encoding was UNICODE and I had inserted
the data from an earlier database dump from a Latin-1 database.  The
dump file used for INSERTs is in the Latin-1 character set.

If I can reproduce the problem I will try the getBytes() method on the
result set to get a byte array back and verify if the bytes returned
are legal UTF-8 sequences.  But ... looking at the source ... that
doesn't appear to be supported by the driver(?).  Maybe I'll have to
try to compile the driver and put some debug statements in it.

The database server was 7.1.2 an I think I reproduced it using a 7.1.3
backend as well.  Is there a good way to get some exact version
numbers from a precompiled driver jar file in case I find a good way
of reproducing the error and want to give a better bug report?

Pardon the bad excuse for a bug report, I just wanted to let Denis
know that he is not the only one who has seen this bug.

-Knut


Re: Accents bug ?

From
Knut Forkalsrud
Date:
This is a follow-up to my post a few days ago about the JDBC driver
chopping of strings at non-ASCII characters.  First a brief summary of
the problem:

1. I create a database specifying the encoding as UNICODE.

2. I invoke psql in a standard xterm (ISO-8859-1) and insert a few
   values.

3. I continue using psql and issue a SELECT to read the strings back.
   All appears well.

4. I try to do the same SELECT through JDBC and the strings are
   truncated at the non ASCII characters.

The issue seems to be related to another interface or the back end.  I
downloaded the JDBC source from anonymous cvs and compiled in some
debug code in the ResultSet.getString() method to display the actual
byte codes returned from the server.  Here is the method with my added
code prefixed by + characters at the beginning of the line:

  public String getString(int columnIndex) throws SQLException
  {
    if (columnIndex < 1 || columnIndex > fields.length)
      throw new PSQLException("postgresql.res.colrange");

    wasNullFlag = (this_row[columnIndex - 1] == null);
    if(wasNullFlag)
      return null;

+    final char[] hexDigits = { '0', '1', '2', '3', '4', '5', '6', '7',
+                               '8', '9', 'a', 'b', 'c', 'd', 'e', 'f' };
+    int i;
+    byte[] rawText = this_row[columnIndex - 1];
+    System.out.print("-- Raw:");
+    for (i = 0; i < rawText.length; ++i) {
+        int ch = rawText[i] >= 0 ? rawText[i] : rawText[i] + 256;
+        System.out.print(" " + hexDigits[ch/16] + hexDigits[ch%16]);
+    }
+    System.out.println();

    Encoding encoding = connection.getEncoding();
    return encoding.decode(this_row[columnIndex - 1]);
  }


Before my actual query I saw two calls to getString():

        50 6f 73 74 67 72 65 53 51 4c 20 37 2e 31 2e 33 20 6f 6e 20 69
        36 38 36 2d 70 63 2d 6c 69 6e 75 78 2d 67 6e 75 2c 20 63 6f 6d
        70 69 6c 65 64 20 62 79 20 47 43 43 20 32 2e 39 36

which roughly translates to:

        PostgreSQL 7.1.3 on i686-pc-linux-gnu, compiled by GCC 2.96

and:

        55 4e 49 43 4f 44 45

which roughly translate to:

        UNICODE

I guess this is some initialization queries the driver does on
startup.  Then to my query.  I summarized the results in a table to
make it easier to follow.

Inserted     The JDBC driver byte buffer    JDBC string
-------------------------------------------------------
abcdefgh     61 62 63 64 65 66 67 68        abcdefgh
abæøå        61 62 e6 f8 e5                 ab
ab²          61 62 b2                       ab
ß÷¥£         df f7 a5 a3

The byte buffer seems to be in the ISO8859-1 character set and not
UTF8 as the UNICODE database encoding expects.  The error was probably
introduced during the INSERT.  I guess I should submit this as a bug
to the maintainers of the psql program.  But the "Bug report tool"
listed on the home page (http://www.ca.postgresql.org/bugs/index.php)
gives me a 404 Not found.  Well, I'll just have to try some other
servers.

-Knut

--
The early worm gets the bird.

Re: Accents bug ?

From
Denis Bucher
Date:
At 16:15 04.10.01 -0700, you wrote:

Hello !

Not reading the mailing-list very often I only answer now. I suppose I started
the thread about this bug...

>This is a follow-up to my post a few days ago about the JDBC driver
>chopping of strings at non-ASCII characters.  First a brief summary of
>the problem:
>[.......]

You exactely have the same problem as I have.
And my base is also UNICODE...

>I guess this is some initialization queries the driver does on
>startup.  Then to my query.  I summarized the results in a table to
>make it easier to follow.
>
>Inserted     The JDBC driver byte buffer    JDBC string
>-------------------------------------------------------
>abcdefgh     61 62 63 64 65 66 67 68        abcdefgh
>abæøå        61 62 e6 f8 e5                 ab
>ab²          61 62 b2                       ab
>ß÷¥£         df f7 a5 a3
>
>The byte buffer seems to be in the ISO8859-1 character set and not
>UTF8 as the UNICODE database encoding expects.  The error was probably
>introduced during the INSERT.

Well, are you sure ? With *all* tools, that can be psql or pgdump_all,
everything seems
right, look at an extract of a dump :
6       6       PPN     Nigiri mixte    t       t
7       7       PNS     Nigiri saké     t       t
8       8       PNT     Nigiri maguro   t       t
9       9       PNM     Nigiri saké maguro      t       t
43      43      PSM     Petit Nigiri maguro     t       t
44      44      PSS     Petit Nigiri saké       t       t

As you can see it *works* for a dump.

>   I guess I should submit this as a bug
>to the maintainers of the psql program.

Are you sure the problem is in the INSERT ?

And let's admit you're right, isn't there a way to do something in the JDBC
driver ?
Or is there a way to correct the database ?

Denis


Re: Accents bug ?

From
Jean-Christophe ARNU
Date:
On 24 Oct 2001 17:51:56 +0200, Denis Bucher wrote:
> You exactely have the same problem as I have.
> And my base is also UNICODE...

> Well, are you sure ? With *all* tools, that can be psql or pgdump_all,
> everything seems
> right, look at an extract of a dump :
> 6       6       PPN     Nigiri mixte    t       t
> 7       7       PNS     Nigiri saké     t       t
> 8       8       PNT     Nigiri maguro   t       t
> 9       9       PNM     Nigiri saké maguro      t       t
> 43      43      PSM     Petit Nigiri maguro     t       t
> 44      44      PSS     Petit Nigiri saké       t       t

    Maybe you'll just have to force charset at connexion opening just like
this :
    Connection conn =
DriverManager.getConnection("jdbc:postgresql://myDataBase?charSet=iso-8859-1",uid,pw);



You can force charSet to different values. There's a complete paragraph
on Postgresql doc on supported charsets.

    Hope it would help.
--
Jean-Christophe ARNU
s/w developer
Paratronic France

  Dans un premier temps, ayons l'esprit large et naviguons à la voile et
  à la vapeur. Si la promiscuité entre les deux communautés devient
  insupportable, il sera toujours temps d'organiser l'apartheid.
  -+- PM in: Guide du Cabaliste Usenet - Bien séparer les enfilades -+-