Re: UNICODE - Mailing list pgsql-hackers

From Marko Kreen
Subject Re: UNICODE
Date
Msg-id 20011028170945.A18241@l-t.ee
Whole thread Raw
In response to Re: UNICODE  (Jean-Michel POURE <jm.poure@freesurf.fr>)
Responses Re: UNICODE
List pgsql-hackers
On Sun, Oct 28, 2001 at 02:34:49PM +0100, Jean-Michel POURE wrote:
> 
> >psql uses your input literally - so is your console/xterm in
> >UNICODE/UTF8?
> Client: \encoding returns 'UNICODE'.
> Server: \list show databases. All databases are UNICODE (except TEMPLATE0 
> and TEMPLATE1 which are ASCII of course). I use a Mandrake 8.1 distribution 
> and think my console is UNICODE.

You think?  Try this:
$ echo "accepté" | od -c

If your term is in utf you should get:
0000000   a   c   c   e   p   t 303 251  \n0000011

If in iso-8859-1:
0000000   a   c   c   e   p   t 351  \n0000010

It may be in some other 8bit encoding too, then the last number
may be different.

> >> As for me, I typed INSERT INTO source_content VALUES ('Permis de conduire
> >> accepté') in Psql.
> >As I said - psql does not do any conversion.
> The faulty query is: INSERT INTO test (source_content) VALUES ('Permis de 
> conduire accepté');

Hmm.  It may be a bug in input routines.  You give PostgreSQL a
1byte 'é', it expects 2 byte char and overflows somewhere.  Can
you reproduce it on 7.1.3?  Maybe its fixed there, I cant
reproduce it.

> I just can't believe that Psql is not UTF-8 compatible. It seems unreal as 
> Psql is PostgreSQL #1 helper application. Should I use PostgreSQL MULE 
> encoding to have automatic trans coding. What are the guidelines, I am 
> completely lost.

psql & pg_dump are fine.  Your problem is that you dont give to
psql and pg_exec/PHP utf-8 strings, but some iso-8859-*.

> >> Psql does not insert the data and I have to kill it manually. Can you
> >> reproduce this?
> >No.  If it hangs this is serious problem.  Or did you simply
> >forgot final ';' ?   It btw does not seem valid sql to me,
> >considering you previously provided table structure.
> Is it possible that my database is corrupted? I have used pg_dump several 
> times to dump data from production server to development servers and 
> conversely. Does pg_dump produce UTF8 output? What are the guidelines when 
> using UTF-8: forget psql and pg_dump?

As I said, psql & pg_dump are fine, they do not touch your data
when it passes through them.

It may be that all of your database is in latin1, as you
inserted strings in this encoding, not utf8.  Basically
PostgreSQL server also does not touch your data, only its
compare functions does not work, as the strings are not in
encoding you tell they are.

Solution to this is to dump your data, use the iconv utility
to convert it to utf8 and reload.


To see this you should do:
$ psql -c "SELECT source_contect FROM table where ..." \    | od -c

And then look whether the weird characters are represented in
1 or 2 bytes.

-- 
marko



pgsql-hackers by date:

Previous
From: mlw
Date:
Subject: Query planner, 7.2b1 select ... order by
Next
From: Jean-Michel POURE
Date:
Subject: Re: UNICODE