Re: retry: converting ASCII to UTF-8 - Mailing list pgsql-general

From Albe Laurenz
Subject Re: retry: converting ASCII to UTF-8
Date
Msg-id D960CB61B694CF459DCFB4B0128514C2D2CE1C@exadv11.host.magwien.gv.at
Whole thread Raw
In response to retry: converting ASCII to UTF-8  (Tom Hart <tomhart@coopfed.org>)
List pgsql-general
Tom Hart wrote:
> Originally I set up the database to use ASCII encoding. However we
> have a large base of Spanish speaking members and services,
> and we need utf-8 encoding to maintain and support the extended character
> sets. In my naivety I thought it would be a relatively simple process
> to convert the db but I've found this to not be the case. I tried doing
> a dump and restore into a new database with the proper encoding, but
> pg_restore is getting hung up on one of the tables, our largest by far
> (~1gb, not huge I know). When I tried pg_restore from a command line
> (I was using pgAdmin, I know i'm a nub) I received this error.
>
> C:\Program Files\PostgreSQL\8.2\bin>pg_restore.exe -i -h 192.168.1.xxx -p 5432 -U foobar -d warehouse_utf8 -a -t
"transaction"-v "O:\foo\bar\pg_dump_transaction.backup" 
> pg_restore: connecting to database for restore
> Password:
> pg_restore: restoring data for table "transaction"
> pg_restore: [archiver (db)] Error while PROCESSING TOC:
> pg_restore: [archiver (db)] Error from TOC entry 1706; 0 17861 TABLE DATA transaction foobar
> pg_restore: [archiver (db)] COPY failed: ERROR:  invalid byte sequence for encoding "UTF8": 0xc52f
> HINT:  This error can also happen if the byte sequence does not match the encoding expected by the server, which is
controlledby "client_encoding". 
> CONTEXT:  COPY transaction, line 209487
> WARNING: errors ignored on restore: 1

You will have to figure out in which encoding the data really are.

SQL_ASCII will allow anything in, and the client is responsible for
feeding the right thing.

Somebody stored a 0xC5 in the database, which is not ASCII.
In WIN-1252, ISO8859-1, and ISO8859-15 that would be Å.
Could that be?

Once you find out the correct encoding, you can uye the -E switch of
pg_dump to set that encoding for your exported data.

If your clients entered consistent data, that should work.

If different clients used different encodings, you might end up
sorting it out manually...

Yours,
Laurenz Albe

pgsql-general by date:

Previous
From: hubert depesz lubaczewski
Date:
Subject: maximum amount of data to be written during checkpoint?
Next
From: Tomasz Ostrowski
Date:
Subject: Re: ascii to utf-8