Thread: Error while loading sql file

Error while loading sql file

From

Adarsh Sharma

Date:

26 December 2011, 03:21:39

Dear all,

I am facing a unique issue when I try to load an sql into a postgresql database :-

ERROR: invalid byte sequence for encoding "UTF8": 0xe32720
HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".
ERROR: invalid byte sequence for encoding "UTF8": 0xe32720
HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".
ERROR: current transaction is aborted, commands ignored until end of transaction block
ERROR: current transaction is aborted, commands ignored until end of transaction block

I faced an issue some days ago & I solved the issue by the below command :

( http://blog.e-shell.org/134 )

ERROR:  invalid byte sequence for encoding "UTF8": 0xf36e2020
HINT:  This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".

cat backup.sql | recode iso-8859-1..u8 > backup.sql

But this time the byte sequence changes to Japanese , & I fail to solve the issue. Please let me know how to solve the issue as typing the error in Google shows only one link:

Thanks

Re: Error while loading sql file

From

Alban Hertroys

Date:

26 December 2011, 08:18:45

On 26 Dec 2011, at 8:22, Adarsh Sharma wrote:

> Dear all,
>
> I am facing a unique issue when I try to load an sql into a postgresql database :-

Actually, your issue isn't unique at all. You'll find it reoccurs on this list regularly, although perhaps less
frequentthese days. 

>  I faced an issue some days ago & I solved the issue by the below command :
> cat backup.sql | recode iso-8859-1..u8 > backup.sql

That command assumes that every string in the sql file is encoded as iso-8859-1 (unless it already is unicode).

> But this time the byte sequence changes to Japanese , & I fail to solve the issue. Please let me know how to solve
theissue as typing the error in Google shows only one link: 
> ( http://blog.e-shell.org/134 )

The above recode command works for the guys in the blog post you linked, as they were converting a database with
Spanishdata to UTF-8. They knew what encoding they were coming from. 

In your case, you have a mixed bag of encodings, going all the way from latin encodings to japanese.

I'm not sure what recode would do to data that's in a different encoding than the specified source encoding - I expect
thatit will just assume it's in the specified source encoding (it cannot know that this isn't the case for a particular
string)and attempt to convert it to UTF-8 _using that encoding_. 

Chances are you just converted valid data in a different encoding (than the source encoding you specified) into garbage
inUTF-8... I seem to recall that if recode runs into problems recoding a string to UTF-8 it will leave it untouched,
butthat will NOT happen in all cases. Sometimes it will succeed, even though the result has no meaning to a human. 

That's a nasty problem you ran into, I hope the archives provide the wisdom you need.

Alban Hertroys

--
Screwing up is an excellent way to attach something to the ceiling.

Re: Error while loading sql file

From

Adarsh Sharma

Date:

27 December 2011, 01:31:31

Thanks for the Explaination,
I find it hard to determine the way to store data in different encodings to store in postgresql, below is the demo of some data :-

INSERT INTO conceptnet_frame VALUES(3884,'ja','{1}は{2}を持っている。',16,3,2140,NULL,NULL,NULL);
INSERT INTO conceptnet_frame VALUES(3885,'ja','{1}は{2}と同じくらい大きい。',31,3,2140,NULL,NULL,NULL);
INSERT INTO conceptnet_frame VALUES(3886,'ja','{1}は{2}と同じくらいの大きさである。',31,3,2140,NULL,NULL,NULL);
INSERT INTO conceptnet_frame VALUES(3887,'ja','{1}は{2}と同じくらい小さい。',31,3,2140,NULL,NULL,NULL);
INSERT INTO conceptnet_frame VALUES(3888,'ja','{1}の痛みの強度は、{2}と同じくらい。',29,3,2140,NULL,NULL,NULL);
INSERT INTO conceptnet_frame VALUES(3889,'ja','{1}の痛み方は、{2}に似ている。',28,3,2140,NULL,NULL,NULL);
INSERT INTO conceptnet_frame VALUES(3890,'ja','{1}は、{2}のインスタンスである。',14,3,2140,NULL,NULL,NULL);
INSERT INTO conceptnet_frame VALUES(3891,'ja','{2}に成功したい時、{1}は一般的だ。',9,3,2140,NULL,NULL,NULL);

Below link explains all the things :-
http://www.depesz.com/index.php/2010/03/07/error-invalid-byte-sequence-for-encoding/

Above link shows the above encoding schemes is in utf16 format but postgresql-8.4 doesn't support it.
Is there any way to store data in different encoding in a utf-8 database.

Happy Holidays!

Alban Hertroys wrote:

On 26 Dec 2011, at 8:22, Adarsh Sharma wrote:

Dear all,

I am facing a unique issue when I try to load an sql into a postgresql database :-

Actually, your issue isn't unique at all. You'll find it reoccurs on this list regularly, although perhaps less frequent these days.

 I faced an issue some days ago & I solved the issue by the below command :
cat backup.sql | recode iso-8859-1..u8 > backup.sql

That command assumes that every string in the sql file is encoded as iso-8859-1 (unless it already is unicode).

But this time the byte sequence changes to Japanese , & I fail to solve the issue. Please let me know how to solve the issue as typing the error in Google shows only one link:
( http://blog.e-shell.org/134 )

The above recode command works for the guys in the blog post you linked, as they were converting a database with Spanish data to UTF-8. They knew what encoding they were coming from.

In your case, you have a mixed bag of encodings, going all the way from latin encodings to japanese.

I'm not sure what recode would do to data that's in a different encoding than the specified source encoding - I expect that it will just assume it's in the specified source encoding (it cannot know that this isn't the case for a particular string) and attempt to convert it to UTF-8 _using that encoding_.

Chances are you just converted valid data in a different encoding (than the source encoding you specified) into garbage in UTF-8... I seem to recall that if recode runs into problems recoding a string to UTF-8 it will leave it untouched, but that will NOT happen in all cases. Sometimes it will succeed, even though the result has no meaning to a human.

That's a nasty problem you ran into, I hope the archives provide the wisdom you need.

Alban Hertroys

--
Screwing up is an excellent way to attach something to the ceiling.

Re: Error while loading sql file

From

John R Pierce

Date:

27 December 2011, 01:42:52

On 12/26/11 9:32 PM, Adarsh Sharma wrote:
> Above link shows the above encoding schemes is in utf16 format but
> postgresql-8.4 doesn't support it.
> Is there any way to store data in different encoding in a utf-8 database.

afaik, anything in UTF16 can be represented by UTF8, but the inverse is
not true.   you'd want to convert the data to the CLIENT_ENCODING,
whatever thats set to, before you send it to postgres.

--
john r pierce                            N 37, W 122
santa cruz ca                         mid-left coast

Re: Error while loading sql file

From

Bèrto ëd Sèra

Date:

27 December 2011, 08:21:12

HI Ardash!

INSERT INTO conceptnet_frame VALUES(3884,'ja','{1}は{2}を持っている。',16,3,2140,NULL,NULL,NULL);
....

Can you still access the database that produced the dump? If so, you may want to produce a number of dumps for distinct language values. Japanese, in particular, is a very complicated issue, as Japanese change alphabet to produce the same effect Latin writers get by using bold and italic. So text searches are quite a nightmare, since DOG is written in different alphabets (Latin included), depending on what the author had in mind, in terms of underlining this specific word/expression.

If you could manage to insulate single languages things would definitely get less complex, as you can track down specific problems related to a single language (or maybe to an external source that inserted weird encodings). If the original DB is not available you could still filter the dump file for such values (although this is a very boring experience).

One of the good things with this approach (in case you need to do heavy text searches on your data) is that (as of PG 9.1) you can model structures like:

CREATE COLLATION "de_DE.utf8" (

LC_COLLATE = "de_DE.utf8",

LC_CTYPE = "de_DE.utf8"

);

CREATE COLLATION "en_GB.utf8" (

LC_COLLATE = "en_GB.utf8",

LC_CTYPE = "en_GB.utf8"

);

# NOTE!! all these locales must pre-exist on your box, you are simply importing them into PG with this CREATE phase.

CREATE TABLE conceptnet_frame_root (

id BIGINT NOT NULL PRIMARY KEY,

ISO693_1_code CHAR(2) NOT NULL

);

CREATE TABLE conceptnet_frame_en (

linguistic_content TEXT COLLATE "en_GB.utf8" NOT NULL

) INHERITS (conceptnet_frame_root);

CREATE TABLE conceptnet_frame_de (

linguistic_content TEXT COLLATE "de_DE.utf8" NOT NULL

) INHERITS (conceptnet_frame_root);

Then have a view pick up the entire linguistic bouquet, if needed. Sadly you cannot leave 'linguistic_content' in the root table, because you won't be able to use an alter table on it later, to alter the collation at single inherited table level, so you are responsible of ensuring type consistency on your own. It does add complication to the model, however, if you are into serious conflicts among different languages this is the best solution I could come up with, thus far.

You may want to have a look at:
http://www.postgresql.org/docs/9.1/static/collation.html

BTW, ISO 639-1 is a dangerous standard to use, if you are to make extensive language coverage, see

http://en.wikipedia.org/wiki/ISO_639

Hope this helps

Bèrto

--
==============================
If Pac-Man had affected us as kids, we'd all be running around in a darkened room munching pills and listening to repetitive music.

Re: Error while loading sql file

From

Adrian Klaver

Date:

27 December 2011, 10:40:59

On Monday, December 26, 2011 9:32:41 pm Adarsh Sharma wrote:
> Thanks for the Explaination,
> I find it hard to determine the way to store data in different encodings
> to store in postgresql, below is the demo of some data :-
>
> INSERT INTO conceptnet_frame
> VALUES(3884,'ja','{1}?{2}???????',16,3,2140,NULL,NULL,NULL);
> INSERT INTO conceptnet_frame
> VALUES(3885,'ja','{1}?{2}??????????',31,3,2140,NULL,NULL,NULL);

> Below link explains all the things :-
> http://www.depesz.com/index.php/2010/03/07/error-invalid-byte-sequence-for-
> encoding/
>
> Above link shows the above encoding schemes is in utf16 format but
> postgresql-8.4 doesn't support it.
> Is there any way to store data in different encoding in a utf-8 database.

Use iconv or recode to convert the UTF16 to UTF8 first.

>
> Happy Holidays!
>


--
Adrian Klaver
adrian.klaver@gmail.com