Thread: ascii to utf-8
Hello everybody. I hope your week's going well so far. I built our data mine in postgreSQL around 3 months ago and I've been working with it since. Postgres is great and I'm really enjoying it, but I've hit a bit of a hitch. Originally (and against pgAdmin's good advice, duh!) I set up the database to use ASCII encoding. However we have a large base of Spanish speaking members and services, and we need utf-8 encoding to maintain and support the extended character sets. In my naivety I thought it would be a relatively simple process to convert the db but I've found this to not be the case. I tried doing a dump and restore into a new database with the proper encoding, but pg_restore is getting hung up on one of the tables, our largest by far (~1gb, not huge I know). When I tried pg_restore from a command line (I was using pgAdmin, I know i'm a nub) I received this error. C:\Program Files\PostgreSQL\8.2\bin>pg_restore.exe -i -h 192.168.1.xxx -p 5432 -U foobar -d warehouse_utf8 -a -t "transaction" -v "O:\foo\bar\pg_dump_transaction.backup" pg_restore: connecting to database for restore Password: pg_restore: restoring data for table "transaction" pg_restore: [archiver (db)] Error while PROCESSING TOC: pg_restore: [archiver (db)] Error from TOC entry 1706; 0 17861 TABLE DATA transaction foobar pg_restore: [archiver (db)] COPY failed: ERROR: invalid byte sequence for encoding "UTF8": 0xc52f HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding". CONTEXT: COPY transaction, line 209487 WARNING: errors ignored on restore: 1 I remember reading somewhere recently that I could use iconv to convert the ASCII encoded dump to UTF-8 encoding, but I'm currently on a windows box, and a windows server, so is there an easier way to do this? Also I was thinking perhaps it was possible to do an ETL type setup, where I can SELECT from the ASCII db and INSERT into the UTF-8 db. If you haven't gathered yet, I'm pretty in the dark regarding encoding issues, especially when applied to pg, so any help here would be appreciated. -- Tom Hart IT Specialist Cooperative Federal 723 Westcott St. Syracuse, NY 13210 (315) 471-1116 ext. 202 (315) 476-0567 (fax)
Tom Hart wrote: > Hello everybody. I hope your week's going well so far. > > I built our data mine in postgreSQL around 3 months ago and I've been > working with it since. Postgres is great and I'm really enjoying it, > but I've hit a bit of a hitch. Originally (and against pgAdmin's good > advice, duh!) I set up the database to use ASCII encoding. However we > have a large base of Spanish speaking members and services, and we > need utf-8 ...snip snip > > pg_restore: [archiver (db)] COPY failed: ERROR: invalid byte sequence > for encoding "UTF8": 0xc52f > HINT: This error can also happen if the byte sequence does not match > the encoding expected by the server, which is controlled by > "client_encoding". > CONTEXT: COPY transaction, line 209487 > WARNING: errors ignored on restore: 1 > Try editing your dump-file and change the line which reads "SET client_encoding = 'SQL_ASCII';" to "SET client_encoding = 'LATIN1';" -- Tommy Gildseth
Tommy Gildseth wrote: > Tom Hart wrote: >> Hello everybody. I hope your week's going well so far. >> >> I built our data mine in postgreSQL around 3 months ago and I've been >> working with it since. Postgres is great and I'm really enjoying it, >> but I've hit a bit of a hitch. Originally (and against pgAdmin's good >> advice, duh!) I set up the database to use ASCII encoding. However we >> have a large base of Spanish speaking members and services, and we >> need utf-8 > ...snip snip >> >> pg_restore: [archiver (db)] COPY failed: ERROR: invalid byte >> sequence for encoding "UTF8": 0xc52f >> HINT: This error can also happen if the byte sequence does not match >> the encoding expected by the server, which is controlled by >> "client_encoding". >> CONTEXT: COPY transaction, line 209487 >> WARNING: errors ignored on restore: 1 >> > > Try editing your dump-file and change the line which reads "SET > client_encoding = 'SQL_ASCII';" to "SET client_encoding = 'LATIN1';" > I tried making the changes you specified with notepad, wordpad, gVim, vim and emacs and in each case pgAdmin (and pg_restore) complain about the dump header being corrupted. This has been kind of a pain since the file is ~ 65mb and it's difficult to load something that size into a text editor. I also did a head > file, edited the file, and then did head -n -10 >> file, but once again I had no success. Is there an easy way of doing this, or perhaps a different way of solving the problem? -- Tom Hart IT Specialist Cooperative Federal 723 Westcott St. Syracuse, NY 13210 (315) 471-1116 ext. 202 (315) 476-0567 (fax)
Tommy Gildseth wrote: > Tom Hart wrote: >> Hello everybody. I hope your week's going well so far. >> >> I built our data mine in postgreSQL around 3 months ago and I've been >> working with it since. Postgres is great and I'm really enjoying it, >> but I've hit a bit of a hitch. Originally (and against pgAdmin's good >> advice, duh!) I set up the database to use ASCII encoding. However we >> have a large base of Spanish speaking members and services, and we >> need utf-8 > ...snip snip >> >> pg_restore: [archiver (db)] COPY failed: ERROR: invalid byte >> sequence for encoding "UTF8": 0xc52f >> HINT: This error can also happen if the byte sequence does not match >> the encoding expected by the server, which is controlled by >> "client_encoding". >> CONTEXT: COPY transaction, line 209487 >> WARNING: errors ignored on restore: 1 >> > > Try editing your dump-file and change the line which reads "SET > client_encoding = 'SQL_ASCII';" to "SET client_encoding = 'LATIN1';" > Ok, so I figured out that head -n -10 and tail -n +10 are not the same thing, and I've got a decent file now. However when I try the restore I get this pg_restore.exe -i -h 192.168.1.xxx -p 5432 -U foobar -d warehouse_utf8 -a -t "transaction" -v "O:\foo\bar\fixed.backup" pg_restore: [archiver] out of memory Process returned exit code 1. I tried upping some of the memory settings in postgresql.conf. The server has ~2gb of RAM unused, and the file is ~65mb. Anybody have any ideas? Also, it's taking around an hour and a half for a message to go from my computer to being posted on the list. Is there a problem with the mailing list software? Thanks again for any assistance you can give me. -- Tom Hart IT Specialist Cooperative Federal 723 Westcott St. Syracuse, NY 13210 (315) 471-1116 ext. 202 (315) 476-0567 (fax)
Tom Hart <tomhart@coopfed.org> writes: > Tommy Gildseth wrote: >> Try editing your dump-file and change the line which reads "SET >> client_encoding = 'SQL_ASCII';" to "SET client_encoding = 'LATIN1';" >> > I tried making the changes you specified with notepad, wordpad, gVim, > vim and emacs and in each case pgAdmin (and pg_restore) complain about > the dump header being corrupted. You can't really manually edit a custom or tar-format archive. What you'll need to do is use pg_restore to emit a plain SQL script from the archive, then edit that, then load it via psql (NOT pg_restore). regards, tom lane
Tom Hart <tomhart@coopfed.org> writes: > Also, it's taking around an hour and a half for a message to go from my > computer to being posted on the list. Is there a problem with the > mailing list software? Yeah, every so often the PG mail servers get kinda clogged up. I pinged Marc about this instance already ... regards, tom lane
On Wed, 23 Jan 2008, Tom Hart wrote: >>> pg_restore: [archiver (db)] COPY failed: ERROR: invalid byte sequence >>> for encoding "UTF8": 0xc52f >> >> Try editing your dump-file and change the line which reads "SET >> client_encoding = 'SQL_ASCII';" to "SET client_encoding = 'LATIN1';" > > I tried making the changes you specified with notepad, wordpad, gVim, vim > and emacs and in each case pgAdmin (and pg_restore) complain about the dump > header being corrupted. Try an "-E LATIN1" option in pg_dump. Do you at least know what is the encoding of data in the database? Regards Tometzky -- ...although Eating Honey was a very good thing to do, there was a moment just before you began to eat it which was better than when you were... Winnie the Pooh
Ok, that did it. Thank you for the help. Tomasz Ostrowski wrote: > On Wed, 23 Jan 2008, Tom Hart wrote: > > >>>> pg_restore: [archiver (db)] COPY failed: ERROR: invalid byte sequence >>>> for encoding "UTF8": 0xc52f >>>> >>> Try editing your dump-file and change the line which reads "SET >>> client_encoding = 'SQL_ASCII';" to "SET client_encoding = 'LATIN1';" >>> >> I tried making the changes you specified with notepad, wordpad, gVim, vim >> and emacs and in each case pgAdmin (and pg_restore) complain about the dump >> header being corrupted. >> > > Try an "-E LATIN1" option in pg_dump. Do you at least know what is > the encoding of data in the database? > > Regards > Tometzky > -- Tom Hart IT Specialist Cooperative Federal 723 Westcott St. Syracuse, NY 13210 (315) 471-1116 ext. 202 (315) 476-0567 (fax)