Thread: Pg 7.4 to 8.1 UTF problems

Pg 7.4 to 8.1 UTF problems

From
Mario Splivalo
Date:
Hello.

I have a database created with -E UTF-8. I need to transfer it to
pg8.1.2.

First I dumped the database:

pg_dump -v -E UTF-8 the_database > the_database.psql

Then I created the database on pg8.1.2

createdb the_database -E UTF-8

Then, I wanted to insert the dump:

psql -f the_database.psql the_database

But, psql complained with:

psql:the_database.psql:86: ERROR:  invalid UTF-8 byte sequence detected
near byte 0x8d
CONTEXT:  COPY _netsms, line 1367, column text: "Padamo u pozu 69.uz
\uffffmas mog macana u usta.a ja tvoju ljepu picu"

Now, here it says \ufff, and I couldn't do copy/paste from
gnome-terminal directly into evolution, I needed first to paste it into
gedit, and then copy from there and paste here.

Neverthenless, here is what hexdump says:

mario@mike:~/work/lidija$ cat lidija.psql | grep "amo u pozu 69" | head
-1 | hexdump -C
00000000  32 30 30 36 2d 30 31 2d  30 31 20 32 31 3a 33 30  |2006-01-01
21:30|
00000010  3a 30 39 2e 38 34 36 35  34 36 30 30 09 2b 33 38
|:09.84654600.+38|
00000020  35 39 38 32 39 37 35 38  32 09 50 61 64 61 6d 6f  |
598297582.Padamo|
00000030  20 75 20 70 6f 7a 75 20  36 39 2e 75 7a 8d 6d 61  | u pozu
69.uz.ma|
00000040  73 20 6d 6f 67 20 6d 61  63 61 6e 61 20 75 20 75  |s mog
macana u u|
00000050  73 74 61 2e 61 20 6a 61  20 74 76 6f 6a 75 20 6c  |sta.a ja
tvoju l|
00000060  6a 65 70 75 20 70 69 63  75 0a                    |jepu picu.|
0000006a

From here it is visible that the troubled charachter has has ASCII HEX
code 8d, which is regular ascii charachter.

If I create the database (on 8.1.2) with -E SQL_ASCII, the import goes
ok.

If I run psql on 7.4 server, everything works ok. I haven't tried on
8.0, because I don't have them.

How could I import the database to 8.1.2, but so that database on 8.1.2
is with -E UTF8?

    Mike
--
"I can do it quick, I can do it cheap, I can do it well. Pick any two."

Mario Splivalo
msplival@jagor.srce.hr



Re: Pg 7.4 to 8.1 UTF problems

From
Josh O'Brien
Date:
run this coversion command
time iconv -c -f UTF8 -t UTF8 -o newdump.sql olddump.sql

Josh O'Brien

Mario Splivalo wrote:

>Hello.
>
>I have a database created with -E UTF-8. I need to transfer it to
>pg8.1.2.
>
>First I dumped the database:
>
>pg_dump -v -E UTF-8 the_database > the_database.psql
>
>Then I created the database on pg8.1.2
>
>createdb the_database -E UTF-8
>
>Then, I wanted to insert the dump:
>
>psql -f the_database.psql the_database
>
>But, psql complained with:
>
>psql:the_database.psql:86: ERROR:  invalid UTF-8 byte sequence detected
>near byte 0x8d
>CONTEXT:  COPY _netsms, line 1367, column text: "Padamo u pozu 69.uz
>\uffffmas mog macana u usta.a ja tvoju ljepu picu"
>
>Now, here it says \ufff, and I couldn't do copy/paste from
>gnome-terminal directly into evolution, I needed first to paste it into
>gedit, and then copy from there and paste here.
>
>Neverthenless, here is what hexdump says:
>
>mario@mike:~/work/lidija$ cat lidija.psql | grep "amo u pozu 69" | head
>-1 | hexdump -C
>00000000  32 30 30 36 2d 30 31 2d  30 31 20 32 31 3a 33 30  |2006-01-01
>21:30|
>00000010  3a 30 39 2e 38 34 36 35  34 36 30 30 09 2b 33 38
>|:09.84654600.+38|
>00000020  35 39 38 32 39 37 35 38  32 09 50 61 64 61 6d 6f  |
>598297582.Padamo|
>00000030  20 75 20 70 6f 7a 75 20  36 39 2e 75 7a 8d 6d 61  | u pozu
>69.uz.ma|
>00000040  73 20 6d 6f 67 20 6d 61  63 61 6e 61 20 75 20 75  |s mog
>macana u u|
>00000050  73 74 61 2e 61 20 6a 61  20 74 76 6f 6a 75 20 6c  |sta.a ja
>tvoju l|
>00000060  6a 65 70 75 20 70 69 63  75 0a                    |jepu picu.|
>0000006a
>
>From here it is visible that the troubled charachter has has ASCII HEX
>code 8d, which is regular ascii charachter.
>
>If I create the database (on 8.1.2) with -E SQL_ASCII, the import goes
>ok.
>
>If I run psql on 7.4 server, everything works ok. I haven't tried on
>8.0, because I don't have them.
>
>How could I import the database to 8.1.2, but so that database on 8.1.2
>is with -E UTF8?
>
>    Mike
>
>

Attachment

Re: Pg 7.4 to 8.1 UTF problems

From
Tom Lane
Date:
Mario Splivalo <msplival@jagor.srce.hr> writes:
> psql:the_database.psql:86: ERROR:  invalid UTF-8 byte sequence detected
> near byte 0x8d
> CONTEXT:  COPY _netsms, line 1367, column text: "Padamo u pozu 69.uz
> \uffffmas mog macana u usta.a ja tvoju ljepu picu"

7.4's checking for valid UTF8 code sequences had some bugs, causing it
to accept data that is not valid UTF8.  8.1 has tightened that up.

> From here it is visible that the troubled charachter has has ASCII HEX
> code 8d, which is regular ascii charachter.

It is not ASCII, and it is not legal UTF8 either, at least not without
another byte >= 0x80 after it.

You need to decide whether this is bad data (and if so fix it), or
whether you misdetermined what the encoding of your data is (and if
so, change to the correct encoding declaration).

            regards, tom lane