Thread: Confused with db client encoding

Confused with db client encoding

From
Carlos Correia
Date:
Hi,

Here is the output a psql session. Please notice that the identation
inconsistences in the records containg non ASCII chars is as outputed by
psql.

The db was created with LANIN9 and the console was ran (in the same
machine) using UTF-8 (my system's default).

I was surprised to notice that setting the client to unicode (which is
what that console is using) messed the localized chars as I was
expecting to see the opposite way.

On the other way, when invoking from a Java app, running on the same
machine, the accentuaded chars also appeared messed.

Have I misunderstood the manual? How can I get a consistant behaviour?

It was tested in a Debian/unstable box, running PostgreSQL 7.4.5-3 and
Sun's JVM 1.4.2

Thanks,

Carlos

psql session:
-----
mpb2-m16e=# \l
        List of databases
   Name    |  Owner   | Encoding
-----------+----------+----------
 mpb2-test | carlos   | LATIN9
 template0 | postgres | LATIN9
 template1 | postgres | LATIN9
(3 rows)

mpb2-m16e=# select tipo_doc_id, nome, descricao from tab_tipo_doc where
tipo_doc_id < 100;
 tipo_doc_id |         nome         |               descricao
-------------+----------------------+---------------------------------------
           0 |                      | (documento desconhecido)
           1 | Encomenda            | Encomendas
           2 | Factura              | Facturas
           3 | Tx. Dinheiro         | Transacções a Dinheiro
          11 | Nota de Crédito     | Notas de Crédito
          12 | Nota de Débito      | Notas de Débito
          21 | G. Remessa           | Guia de Remessa
          91 | Saída Armazém      | Saídas de Armazém
          92 | Ent. Armazém        | Entradas em Armazém
           5 | Devolução          | Devoluções de Facturas/Tx. Dinheiro
          99 | Acerto Inv.          | Acerto de Inventário
          51 | O.T.                 | Ordens de Trabalho
(12 rows)

mpb2-m16e=# set client_encoding to unicode;
SET
mpb2-m16e=# select tipo_doc_id, nome, descricao from tab_tipo_doc where
tipo_doc_id < 100;
 tipo_doc_id |         nome         |               descricao
-------------+----------------------+---------------------------------------
           0 |                      | (documento desconhecido)
           1 | Encomenda            | Encomendas
           2 | Factura              | Facturas
           3 | Tx. Dinheiro         | Transacções a Dinheiro
          11 | Nota de Crédito     | Notas de Crédito
          12 | Nota de Débito      | Notas de Débito
          21 | G. Remessa           | Guia de Remessa
          91 | Saída Armazém      | Saídas de Armazém
          92 | Ent. Armazém        | Entradas em Armazém
           5 | Devolução          | Devoluções de Facturas/Tx.
Dinheiro
          99 | Acerto Inv.          | Acerto de Inventário
          51 | O.T.                 | Ordens de Trabalho
(12 rows)




Re: Confused with db client encoding

From
Ian Barwick
Date:
On Mon, 06 Sep 2004 00:02:24 +0100, Carlos Correia <carlos@m16e.com> wrote:
> Hi,
>
> Here is the output a psql session. Please notice that the identation
> inconsistences in the records containg non ASCII chars is as outputed by
> psql.
>
> The db was created with LANIN9 and the console was ran (in the same
> machine) using UTF-8 (my system's default).
>
> I was surprised to notice that setting the client to unicode (which is
> what that console is using) messed the localized chars as I was
> expecting to see the opposite way.
>
> On the other way, when invoking from a Java app, running on the same
> machine, the accentuaded chars also appeared messed.

(...)
>            3 | Tx. Dinheiro         | Transacções a Dinheiro
>           11 | Nota de Crédito     | Notas de Crédito
>           12 | Nota de Débito      | Notas de Débito
>           21 | G. Remessa           | Guia de Remessa

It looks like this data was entered as UTF-8 but the client encoding
was LATIN9 (or whatever), meaning the two incoming bytes from each
accentuated character in UTF-8 was interpreted by the backend as two
individual bytes in LATINx.

Test case (session in a UTF-8 environment):

test=# CREATE DATABASE ctest encoding 'LATIN1';
CREATE DATABASE
test=# \c ctest;
You are now connected to database "ctest".
ctest=# CREATE TABLE coding (data TEXT);
CREATE TABLE
ctest=# SET client_encoding TO LATIN1;
SET
ctest=# INSERT INTO coding VALUES('müller');
INSERT 349960 1
ctest=# SELECT * FROM coding;
  data
---------
 müller
(1 row)

ctest=# SET client_encoding TO UNICODE;
SET
ctest=# SELECT * FROM coding;
  data
---------
 müller
(1 row)

Ian Barwick