Re: UTF8 frustrations - Mailing list pgsql-general

From Albe Laurenz
Subject Re: UTF8 frustrations
Date
Msg-id D960CB61B694CF459DCFB4B0128514C2297F8E@exadv11.host.magwien.gv.at
Whole thread Raw
In response to UTF8 frustrations  (jesse.waters@gmail.com)
List pgsql-general
jesse.waters@gmail.com wrote:
>
> Here is the latest issue, to verify that the pg_dump works, I'm going
> to do dump and restore on the same host/cluster.
>
> Source:
>  DB_source:
>  Red Hat Enterprise Linux AS release 4 (Nahant Update 4)
>  psql 8.2.4
> Destination:
>  same machine different db name
>
> echo $LANG
> en_US.UTF-8
>
> SET client_encoding = 'UTF8';
>
> Command used:
>   pg_dump -Fc srcdb > db.dump
>   pg_restore -d devdb db.dump
>
> Results, same error. Now I'm really concerned.
>
> pg_restore: [archiver (db)] Error while PROCESSING TOC:
> pg_restore: [archiver (db)] Error from TOC entry 1625; 0 16680 TABLE
DATA logs watersj
> pg_restore: [archiver (db)] COPY failed: ERROR:  invalid byte sequence
for encoding "UTF8": 0xdf69
> HINT:  This error can also happen if the byte sequence does not match
the encoding expected by the server, which is controlled by
"client_encoding
> CONTEXT:  COPY logs, line 69238382
> WARNING: errors ignored on restore: 1

I can recreate this behaviour with 8.2.4 (UTF-8).

psql> CREATE TABLE test (id serial PRIMARY KEY, val text);

psql> INSERT INTO test (val) VALUES (E'\xdf\x69');

psql> \q

$ pg_dump -F c -f x.dmp -t test testdb

$ pg_restore -c -d testdb x.dmp

pg_restore: [archiver (db)] Error while PROCESSING TOC:
pg_restore: [archiver (db)] Error from TOC entry 1840; 0 45883 TABLE
DATA test laurenz
pg_restore: [archiver (db)] COPY failed: ERROR:  invalid byte sequence
for encoding "UTF8": 0xdf69
HINT:  This error can also happen if the byte sequence does not match
the encoding expected by the server, which is controlled by
"client_encoding".
CONTEXT:  COPY test, line 1
WARNING: errors ignored on restore: 1

The problem is that in (at least) one record in your table
watersj.logs, there is a corrupt string.

Unfortunately (as demonstrated above) it is possible to enter
corrupt data into a PostgreSQL database, this is what must have
happened in your case.

I suggest that you identify and correct this string in the original
database, then everything should work fine.

You can extract the offending row from the dump, that should
help to identify it. 69238382 rows is a little unwieldy, but
tools like awk can help:

pg_restore db.dump | awk '/^COPY logs /,/^\\\.$/ { if (lineno==69238382)
print $0; ++lineno }'

I think there is the desire to fix problems like this in 8.3,
but I don't think that's done yet.

Yours,
Laurenz Albe

pgsql-general by date:

Previous
From: Ow Mun Heng
Date:
Subject: Re: Need suggestion on how best to update 3 million rows
Next
From: Richard Huxton
Date:
Subject: Re: Need suggestion on how best to update 3 million rows