Thread: Problem while restoring a database from SQL_ASCII to UTF-8
Hi,
I have a database in SQL_ASCII Encoding format from Postgresql-7.4.19. I have migrated to Postgresql-8.4.9 and to UTF-8 and want to restore this database in UTF-8 encoding. I am facing problem in restoring it as UTF-8. I have 29 tables in the database, out of which it is unable to restore 3 tables. It gives the following error:
pg_restore: [archiver (db)] Error while PROCESSING TOC:
pg_restore: [archiver (db)] Error from TOC entry 83; 0 40257 TABLE DATA document postgres
pg_restore: [archiver (db)] COPY failed for table "document": ERROR: invalid byte sequence for encoding "UTF8": 0xe92034
HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".
CONTEXT: COPY document, line 15
pg_restore: [archiver (db)] Error from TOC entry 88; 0 40271 TABLE DATA attribute postgres
pg_restore: [archiver (db)] COPY failed for table "attribute": ERROR: invalid byte sequence for encoding "UTF8": 0xe90931
HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".
CONTEXT: COPY attribute, line 909
pg_restore: [archiver (db)] Error from TOC entry 90; 0 40277 TABLE DATA revision_state postgres
pg_restore: [archiver (db)] COPY failed for table "revision_state": ERROR: invalid byte sequence for encoding "UTF8": 0xe9e96e
HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".
CONTEXT: COPY revision_state, line 363
I have checked these three tables and I found that these tables contain special characters as values for some attributes for some records. I tried to use the procedure mentioned in the following link, which didn’t worked for me:
http://archives.postgresql.org/pgsql-admin/2004-07/msg00206.php
I also have blobs in one table in this database, but I don’t think/see any issues with them. Is there a way to restore the database from SQL_ASCII to UTF-8? Is there a way to restore the database as SQL_ASCII and then convert the Encoding to UTF-8? What is the consistency level? Or are there chances of getting garbage?
Kind Regards,
Manoj Agarwal
On 06/06/2012 04:30 AM, Manoj Agarwal wrote: > Hi, > > I have a database in SQL_ASCII Encoding format from Postgresql-7.4.19. I > have migrated to Postgresql-8.4.9 and to UTF-8 and want to restore this > database in UTF-8 encoding. I am facing problem in restoring it as > UTF-8. I have 29 tables in the database, out of which it is unable to > restore 3 tables. It gives the following error: > > > > I have checked these three tables and I found that these tables contain > special characters as values for some attributes for some records. I > tried to use the procedure mentioned in the following link, which didn’t > worked for me: > > http://archives.postgresql.org/pgsql-admin/2004-07/msg00206.php > > I also have blobs in one table in this database, but I don’t think/see > any issues with them. Is there a way to restore the database from > SQL_ASCII to UTF-8? Is there a way to restore the database as SQL_ASCII > and then convert the Encoding to UTF-8? What is the consistency level? > Or are there chances of getting garbage? SQL_ASCII is not an encoding, it is basically a way of saying encoding ignored. Given that, outside knowledge of the encoding used for data inserted into the database is needed to make the conversion to UTF8. In other words do you have any idea of what encoding was used by the clients that supplied data to the database? > > Kind Regards, > > Manoj Agarwal > -- Adrian Klaver adrian.klaver@gmail.com
On 6 June 2012 14:12, Adrian Klaver <adrian.klaver@gmail.com> wrote: > SQL_ASCII is not an encoding, it is basically a way of saying encoding > ignored. Given that, outside knowledge of the encoding used for data > inserted into the database is needed to make the conversion to UTF8. In > other words do you have any idea of what encoding was used by the clients > that supplied data to the database? You're going to have to manage this process carefully. I believe the best tool for the job is pgloader: http://pgfoundry.org/projects/pgloader/ It's certainly not inconceivable that this sort of tool could automagically deduce encoding (albeit often inconclusively) in much the same manor as modern web browsers attempt to when they have no alternative. I don't see much demand for that though. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
Hi, The encoding that is used by the clients that supplied data to the database is ISO-8859-1. Kind Regards, Manoj Agarwal -----Original Message----- From: Adrian Klaver [mailto:adrian.klaver@gmail.com] Sent: Wednesday, June 06, 2012 6:43 PM To: Manoj Agarwal Cc: pgsql-general@postgresql.org Subject: Re: [GENERAL] Problem while restoring a database from SQL_ASCII to UTF-8 On 06/06/2012 04:30 AM, Manoj Agarwal wrote: > Hi, > > I have a database in SQL_ASCII Encoding format from Postgresql-7.4.19. > I have migrated to Postgresql-8.4.9 and to UTF-8 and want to restore > this database in UTF-8 encoding. I am facing problem in restoring it > as UTF-8. I have 29 tables in the database, out of which it is unable > to restore 3 tables. It gives the following error: > > > > I have checked these three tables and I found that these tables > contain special characters as values for some attributes for some > records. I tried to use the procedure mentioned in the following link, > which didn't worked for me: > > http://archives.postgresql.org/pgsql-admin/2004-07/msg00206.php > > I also have blobs in one table in this database, but I don't think/see > any issues with them. Is there a way to restore the database from > SQL_ASCII to UTF-8? Is there a way to restore the database as > SQL_ASCII and then convert the Encoding to UTF-8? What is the consistency level? > Or are there chances of getting garbage? SQL_ASCII is not an encoding, it is basically a way of saying encoding ignored. Given that, outside knowledge of the encoding used for data inserted into the database is needed to make the conversion to UTF8. In other words do you have any idea of what encoding was used by the clients that supplied data to the database? > > Kind Regards, > > Manoj Agarwal > -- Adrian Klaver adrian.klaver@gmail.com
On 06/06/2012 08:30 PM, Manoj Agarwal wrote: > Hi, > > The encoding that is used by the clients that supplied data to the database > is ISO-8859-1. You said previously that you had 29 tables in the 7.4 database and three of those had errors when upgrading to the 8.4 database. At what state is the 8.4 database?: 1) None of the tables from the 7.4 database are in it. or 2) 26 tables are in it and a) The schema for the three problem tables are there but not the data. b) Neither the table schema or the data for the three tables is there. Where I am going with this is using client_encoding to temporarily set the client_encoding to ISO88591 while restoring the data for those tables/entire database. The easiest way I can think of to do that is for a whole database restore is set the client_encoding in postgresql.conf, do a pg_ctl reload, restore, unset the client_encoding, pg_ctl reload. This would entail excluding other clients from the db for the duration. There are other ways to do this depending on the answers above. > > Kind Regards, > Manoj Agarwal -- Adrian Klaver adrian.klaver@gmail.com .
On 06/06/2012 08:30 PM, Manoj Agarwal wrote: > Hi, > > The encoding that is used by the clients that supplied data to the database > is ISO-8859-1. I meant to ask in my previous post: 1) What are the special characters causing the problem? 2) What is the data type for the fields holding these characters? > > Kind Regards, > Manoj Agarwal -- Adrian Klaver adrian.klaver@gmail.com
Hi, I took a slightly different approach to fix the issue. There is a CPAN module called Encoding::FixLatin, that takes mixed encoding input and produces UTF-8 output. It can be downloaded from the following link: http://search.cpan.org/dist/Encoding-FixLatin/lib/Encoding/FixLatin.pm This module takes input which may contain characters in more than one encoding and makes a best effort to convert them all to UTF-8 output. I took the dump from Postgresql 7.4.19 database where encoding was SQL_ASCII. Then I created a new database in UTF-8 in Postgresql 8.4.9. Then I installed the above Perl module. After installing the module, I used the following command to restore SQL_ASCII database to UTF-8: pg_restore -O db_dump_sql_ascii | fix_latin | psql -d db_utf The conversion was done successfully without any errors. The database was restored as UTF-8. Kind Regards, Manoj Agarwal -----Original Message----- From: Adrian Klaver [mailto:adrian.klaver@gmail.com] Sent: Friday, June 08, 2012 2:19 AM To: Manoj Agarwal Cc: pgsql-general@postgresql.org Subject: Re: [GENERAL] Problem while restoring a database from SQL_ASCII to UTF-8 On 06/06/2012 08:30 PM, Manoj Agarwal wrote: > Hi, > > The encoding that is used by the clients that supplied data to the > database is ISO-8859-1. I meant to ask in my previous post: 1) What are the special characters causing the problem? 2) What is the data type for the fields holding these characters? > > Kind Regards, > Manoj Agarwal -- Adrian Klaver adrian.klaver@gmail.com