Thread: ascii to utf-8

ascii to utf-8

From

Tom Hart

Date:

23 January 2008, 16:08:27

Hello everybody. I hope your week's going well so far.

I built our data mine in postgreSQL around 3 months ago and I've been
working with it since. Postgres is great and I'm really enjoying it, but
I've hit a bit of a hitch. Originally (and against pgAdmin's good
advice, duh!) I set up the database to use ASCII encoding. However we
have a large base of Spanish speaking members and services, and we need
utf-8 encoding to maintain and support the extended character sets. In
my naivety I thought it would be a relatively simple process to convert
the db but I've found this to not be the case. I tried doing a dump and
restore into a new database with the proper encoding, but pg_restore is
getting hung up on one of the tables, our largest by far (~1gb, not huge
I know). When I tried pg_restore from a command line (I was using
pgAdmin, I know i'm a nub) I received this error.

C:\Program Files\PostgreSQL\8.2\bin>pg_restore.exe -i -h 192.168.1.xxx
-p 5432 -U foobar -d warehouse_utf8 -a -t "transaction" -v
"O:\foo\bar\pg_dump_transaction.backup"
pg_restore: connecting to database for restore
Password:
pg_restore: restoring data for table "transaction"
pg_restore: [archiver (db)] Error while PROCESSING TOC:
pg_restore: [archiver (db)] Error from TOC entry 1706; 0 17861 TABLE
DATA transaction foobar
pg_restore: [archiver (db)] COPY failed: ERROR:  invalid byte sequence
for encoding "UTF8": 0xc52f
HINT:  This error can also happen if the byte sequence does not match
the encoding expected by the server, which is controlled by
"client_encoding".
CONTEXT:  COPY transaction, line 209487
WARNING: errors ignored on restore: 1

I remember reading somewhere recently that I could use iconv to convert
the ASCII encoded dump to UTF-8 encoding, but I'm currently on a windows
box, and a windows server, so is there an easier way to do this? Also I
was thinking perhaps it was possible to do an ETL type setup, where I
can SELECT from the ASCII db and INSERT into the UTF-8 db.

If you haven't gathered yet, I'm pretty in the dark regarding encoding
issues, especially when applied to pg, so any help here would be
appreciated.

--
Tom Hart
IT Specialist
Cooperative Federal
723 Westcott St.
Syracuse, NY 13210
(315) 471-1116 ext. 202
(315) 476-0567 (fax)

Re: ascii to utf-8

From

Tommy Gildseth

Date:

23 January 2008, 17:07:31

Tom Hart wrote:
> Hello everybody. I hope your week's going well so far.
>
> I built our data mine in postgreSQL around 3 months ago and I've been
> working with it since. Postgres is great and I'm really enjoying it,
> but I've hit a bit of a hitch. Originally (and against pgAdmin's good
> advice, duh!) I set up the database to use ASCII encoding. However we
> have a large base of Spanish speaking members and services, and we
> need utf-8
...snip snip
>
> pg_restore: [archiver (db)] COPY failed: ERROR:  invalid byte sequence
> for encoding "UTF8": 0xc52f
> HINT:  This error can also happen if the byte sequence does not match
> the encoding expected by the server, which is controlled by
> "client_encoding".
> CONTEXT:  COPY transaction, line 209487
> WARNING: errors ignored on restore: 1
>

Try editing your dump-file and change the line which reads "SET
client_encoding = 'SQL_ASCII';" to "SET client_encoding = 'LATIN1';"

--
Tommy Gildseth

Re: ascii to utf-8

From

Tom Hart

Date:

23 January 2008, 18:51:36

Tommy Gildseth wrote:
> Tom Hart wrote:
>> Hello everybody. I hope your week's going well so far.
>>
>> I built our data mine in postgreSQL around 3 months ago and I've been
>> working with it since. Postgres is great and I'm really enjoying it,
>> but I've hit a bit of a hitch. Originally (and against pgAdmin's good
>> advice, duh!) I set up the database to use ASCII encoding. However we
>> have a large base of Spanish speaking members and services, and we
>> need utf-8
> ...snip snip
>>
>> pg_restore: [archiver (db)] COPY failed: ERROR:  invalid byte
>> sequence for encoding "UTF8": 0xc52f
>> HINT:  This error can also happen if the byte sequence does not match
>> the encoding expected by the server, which is controlled by
>> "client_encoding".
>> CONTEXT:  COPY transaction, line 209487
>> WARNING: errors ignored on restore: 1
>>
>
> Try editing your dump-file and change the line which reads "SET
> client_encoding = 'SQL_ASCII';" to "SET client_encoding = 'LATIN1';"
>
I tried making the changes you specified with notepad, wordpad, gVim,
vim and emacs and in each case pgAdmin (and pg_restore) complain about
the dump header being corrupted. This has been kind of a pain since the
file is ~ 65mb and it's difficult to load something that size into a
text editor. I also did a head > file, edited the file, and then did
head -n -10 >> file, but once again I had no success. Is there an easy
way of doing this, or perhaps a different way of solving the problem?

--
Tom Hart
IT Specialist
Cooperative Federal
723 Westcott St.
Syracuse, NY 13210
(315) 471-1116 ext. 202
(315) 476-0567 (fax)

Re: ascii to utf-8

From

Tom Hart

Date:

23 January 2008, 19:43:11

Tommy Gildseth wrote:
> Tom Hart wrote:
>> Hello everybody. I hope your week's going well so far.
>>
>> I built our data mine in postgreSQL around 3 months ago and I've been
>> working with it since. Postgres is great and I'm really enjoying it,
>> but I've hit a bit of a hitch. Originally (and against pgAdmin's good
>> advice, duh!) I set up the database to use ASCII encoding. However we
>> have a large base of Spanish speaking members and services, and we
>> need utf-8
> ...snip snip
>>
>> pg_restore: [archiver (db)] COPY failed: ERROR:  invalid byte
>> sequence for encoding "UTF8": 0xc52f
>> HINT:  This error can also happen if the byte sequence does not match
>> the encoding expected by the server, which is controlled by
>> "client_encoding".
>> CONTEXT:  COPY transaction, line 209487
>> WARNING: errors ignored on restore: 1
>>
>
> Try editing your dump-file and change the line which reads "SET
> client_encoding = 'SQL_ASCII';" to "SET client_encoding = 'LATIN1';"
>
Ok, so I figured out that head -n -10 and tail -n +10 are not the same
thing, and I've got a decent file now. However when I try the restore I
get this

pg_restore.exe -i -h 192.168.1.xxx -p 5432 -U foobar -d warehouse_utf8
-a -t "transaction" -v "O:\foo\bar\fixed.backup"
pg_restore: [archiver] out of memory

Process returned exit code 1.

I tried upping some of the memory settings in postgresql.conf. The
server has ~2gb of RAM unused, and the file is ~65mb. Anybody have any
ideas?

Also, it's taking around an hour and a half for a message to go from my
computer to being posted on the list. Is there a problem with the
mailing list software?

Thanks again for any assistance you can give me.

--
Tom Hart
IT Specialist
Cooperative Federal
723 Westcott St.
Syracuse, NY 13210
(315) 471-1116 ext. 202
(315) 476-0567 (fax)

Re: ascii to utf-8

From

Tom Lane

Date:

23 January 2008, 20:26:14

Tom Hart <tomhart@coopfed.org> writes:
> Tommy Gildseth wrote:
>> Try editing your dump-file and change the line which reads "SET
>> client_encoding = 'SQL_ASCII';" to "SET client_encoding = 'LATIN1';"
>>
> I tried making the changes you specified with notepad, wordpad, gVim,
> vim and emacs and in each case pgAdmin (and pg_restore) complain about
> the dump header being corrupted.

You can't really manually edit a custom or tar-format archive.
What you'll need to do is use pg_restore to emit a plain SQL script
from the archive, then edit that, then load it via psql (NOT pg_restore).

            regards, tom lane

Re: ascii to utf-8

From

Tom Lane

Date:

23 January 2008, 20:26:47

Tom Hart <tomhart@coopfed.org> writes:
> Also, it's taking around an hour and a half for a message to go from my
> computer to being posted on the list. Is there a problem with the
> mailing list software?

Yeah, every so often the PG mail servers get kinda clogged up.
I pinged Marc about this instance already ...

            regards, tom lane

Re: ascii to utf-8

From

Tomasz Ostrowski

Date:

24 January 2008, 07:56:39

On Wed, 23 Jan 2008, Tom Hart wrote:

>>> pg_restore: [archiver (db)] COPY failed: ERROR:  invalid byte sequence
>>> for encoding "UTF8": 0xc52f
>>
>> Try editing your dump-file and change the line which reads "SET
>> client_encoding = 'SQL_ASCII';" to "SET client_encoding = 'LATIN1';"
>
> I tried making the changes you specified with notepad, wordpad, gVim, vim
> and emacs and in each case pgAdmin (and pg_restore) complain about the dump
> header being corrupted.

Try an "-E LATIN1" option in pg_dump. Do you at least know what is
the encoding of data in the database?

Regards
Tometzky
--
...although Eating Honey was a very good thing to do, there was a
moment just before you began to eat it which was better than when you
were...
                                                      Winnie the Pooh

Re: ascii to utf-8

From

Tom Hart

Date:

25 January 2008, 16:55:36

Ok, that did it. Thank you for the help.

Tomasz Ostrowski wrote:
> On Wed, 23 Jan 2008, Tom Hart wrote:
>
>
>>>> pg_restore: [archiver (db)] COPY failed: ERROR:  invalid byte sequence
>>>> for encoding "UTF8": 0xc52f
>>>>
>>> Try editing your dump-file and change the line which reads "SET
>>> client_encoding = 'SQL_ASCII';" to "SET client_encoding = 'LATIN1';"
>>>
>> I tried making the changes you specified with notepad, wordpad, gVim, vim
>> and emacs and in each case pgAdmin (and pg_restore) complain about the dump
>> header being corrupted.
>>
>
> Try an "-E LATIN1" option in pg_dump. Do you at least know what is
> the encoding of data in the database?
>
> Regards
> Tometzky
>

--
Tom Hart
IT Specialist
Cooperative Federal
723 Westcott St.
Syracuse, NY 13210
(315) 471-1116 ext. 202
(315) 476-0567 (fax)