Thread: Migration error " invalid byte sequence for encoding "UTF8": 0xff " from mysql 5.5 to postgresql 9.1

When I do migration from Mysql to PostgreSQL:
firstly dump data from mysql in cmd(encoding is GBK) is WIN8:
mysqldump -v -nt --complete-insert=TRUE --compatible=postgresql --default-character-set=utf8 --skip-add-locks --compact --no-create-info --skip-quote-names -uroot -p test >dbdata.sql

then load data to postgresql in cmd(encoding is GBK) is WIN8:
psql -h localhost  -d test -U postgres <  dbdata.sql

I got the error:
ERROR:  invalid byte sequence for encoding "UTF8": 0xff

I checked in dbdata.sql using UltraEdit, data "0xff" really exists as followings:
蛾4?4抿\0xfF???±??x¤?
and I use UltraEdit menu-> view -> encoding page to get: 936 GBK.


DDL in Mysql 5.5 is:
CREATE TABLE `personpicture` (
  `ID` char(32) NOT NULL,
  `Picture` mediumblob,
...
  KEY `personId` (`PersonID`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

DDL in PostgreSQL 9.1 is:
create database test encoding 'utf8';
CREATE TABLE personpicture (
  ID char(32) NOT NULL,
  Picture BYTEA,
  ....
  PRIMARY KEY (ID)
);

The error is related to Picture mediumblob and BYTEA, which is stored pic binary data.
My Operationg system is WIN8 64bit.

I've set postgresql.conf before starting everything, but error still occurs:
backslash_quote = on
escape_string_warning = off
standard_conforming_strings = off

How to resolve it?
Thanks!

peng


sunpeng <bluevaley@gmail.com> wrote:

> load data to postgresql in cmd(encoding is GBK) is WIN8:

> psql -h localhost  -d test -U postgres <  dbdata.sql
>
> I got the error:
> ERROR:  invalid byte sequence for encoding "UTF8": 0xff

If the encoding is GBK then you will get errors (or incorrect
characters) if it is read as UTF8.  Try setting the environment
variable PGCLIENTENCODING.

http://www.postgresql.org/docs/9.1/static/app-psql.html

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


I‘v changed cmd (in win8) to encoding utf8 through chcp 65001, but error still occurs.
And i use the following cmd to dump mysql data:
mysql> select Picture from personpicture where id = 'F2931306D1EE44ca82394CD3BC2404D4'  into outfile "d:\\1.txt" ;
I got the ansi file, and use Ultraedit to see first 16 bytes:
FF D8 FF E0 5C 30 10 4A 46 49 46 5C 30 01 01 5C
It's different from mysql workbench to see:
FF D8 FF E0 00 10 4a 46 49 46 00 01 01 00 00 01


peng



On Tue, Jul 1, 2014 at 9:18 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
sunpeng <bluevaley@gmail.com> wrote:

> load data to postgresql in cmd(encoding is GBK) is WIN8:

> psql -h localhost  -d test -U postgres <  dbdata.sql
>
> I got the error:
> ERROR:  invalid byte sequence for encoding "UTF8": 0xff

If the encoding is GBK then you will get errors (or incorrect
characters) if it is read as UTF8.  Try setting the environment
variable PGCLIENTENCODING.

http://www.postgresql.org/docs/9.1/static/app-psql.html

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

sunpeng wrote:
>>> load data to postgresql in cmd(encoding is GBK) is WIN8:
>>> 
>>> psql -h localhost  -d test -U postgres <  dbdata.sql
>>>
>>> I got the error:
>>> ERROR:  invalid byte sequence for encoding "UTF8": 0xff

>> If the encoding is GBK then you will get errors (or incorrect
>> characters) if it is read as UTF8.  Try setting the environment
>> variable PGCLIENTENCODING.
>> 
>> http://www.postgresql.org/docs/9.1/static/app-psql.html

> I‘v changed cmd (in win8) to encoding utf8 through chcp 65001, but error still occurs.
> And i use the following cmd to dump mysql data:
> mysql> select Picture from personpicture where id = 'F2931306D1EE44ca82394CD3BC2404D4'  into outfile
> "d:\\1.txt" ;
> I got the ansi file, and use Ultraedit to see first 16 bytes:
> FF D8 FF E0 5C 30 10 4A 46 49 46 5C 30 01 01 5C
> 
> It's different from mysql workbench to see:
> FF D8 FF E0 00 10 4a 46 49 46 00 01 01 00 00 01

Changing the terminal code page won't do anything, it's probably the data
that are in a different encoding.

I don't know enough about MySQL to know which encoding it uses when dumping data,
but the man page of "mysqldump" tells me:

  --set-charset
  Add SET NAMES default_character_set to the output. This option is enabled by default.

So is there a SET NAMES command in the dump? If yes, what is the argument?

You will have to tell PostgreSQL the encoding of the data.
As Kevin pointed out, you can do that by setting the environment variable
PGCLIENT ENCODING to the correct value.  Then PostgreSQL will convert the
data automatically.

Yours,
Laurenz Albe

Thank you, friend, I use  --hex-blob :
mysqldump -v -nt --complete-insert=TRUE --compatible=postgresql --default-character-set=utf8 --skip-add-locks --compact --no-create-info --skip-quote-names --hex-blob -uroot -p test videorecresult >dbdata.sql
to dump mysql data.
And replace blob data "0x...." into "E'\\xx....'" to load data into postgresql.




On Fri, Jul 4, 2014 at 3:27 PM, Albe Laurenz <laurenz.albe@wien.gv.at> wrote:
sunpeng wrote:
>>> load data to postgresql in cmd(encoding is GBK) is WIN8:
>>>
>>> psql -h localhost  -d test -U postgres <  dbdata.sql
>>>
>>> I got the error:
>>> ERROR:  invalid byte sequence for encoding "UTF8": 0xff

>> If the encoding is GBK then you will get errors (or incorrect
>> characters) if it is read as UTF8.  Try setting the environment
>> variable PGCLIENTENCODING.
>>
>> http://www.postgresql.org/docs/9.1/static/app-psql.html

> I‘v changed cmd (in win8) to encoding utf8 through chcp 65001, but error still occurs.
> And i use the following cmd to dump mysql data:
> mysql> select Picture from personpicture where id = 'F2931306D1EE44ca82394CD3BC2404D4'  into outfile
> "d:\\1.txt" ;
> I got the ansi file, and use Ultraedit to see first 16 bytes:
> FF D8 FF E0 5C 30 10 4A 46 49 46 5C 30 01 01 5C
>
> It's different from mysql workbench to see:
> FF D8 FF E0 00 10 4a 46 49 46 00 01 01 00 00 01

Changing the terminal code page won't do anything, it's probably the data
that are in a different encoding.

I don't know enough about MySQL to know which encoding it uses when dumping data,
but the man page of "mysqldump" tells me:

  --set-charset
  Add SET NAMES default_character_set to the output. This option is enabled by default.

So is there a SET NAMES command in the dump? If yes, what is the argument?

You will have to tell PostgreSQL the encoding of the data.
As Kevin pointed out, you can do that by setting the environment variable
PGCLIENT ENCODING to the correct value.  Then PostgreSQL will convert the
data automatically.

Yours,
Laurenz Albe

On 7/4/2014 2:12 AM, sunpeng wrote:
> Thank you, friend, I use  --hex-blob :
> mysqldump -v -nt --complete-insert=TRUE --compatible=postgresql
> --default-character-set=utf8 --skip-add-locks --compact
> --no-create-info --skip-quote-names --hex-blob -uroot -p test
> videorecresult >dbdata.sql
> to dump mysql data.
> And replace blob data "0x...." into "E'\\xx....'" to load data into
> postgresql.

regardless of all that, 0xFF is not a valid UTF8 character code. perhaps
you should store the data in a postgres BYTEA, or at least use character
encoding SQLASCII (which is to say, no encoding, bytes in == bytes out.)


--
john r pierce                                      37N 122W
somewhere on the middle of the left coast



John R Pierce wrote:
> On 7/4/2014 2:12 AM, sunpeng wrote:
>> Thank you, friend, I use  --hex-blob :
>> mysqldump -v -nt --complete-insert=TRUE --compatible=postgresql
>> --default-character-set=utf8 --skip-add-locks --compact
>> --no-create-info --skip-quote-names --hex-blob -uroot -p test
>> videorecresult >dbdata.sql
>> to dump mysql data.
>> And replace blob data "0x...." into "E'\\xx....'" to load data into
>> postgresql.

> regardless of all that, 0xFF is not a valid UTF8 character code. perhaps
> you should store the data in a postgres BYTEA, or at least use character
> encoding SQLASCII (which is to say, no encoding, bytes in == bytes out.)

Exactly.
According to mysqldump's man page, the affected field must be
BINARY, VARBINARY, the BLOB types or BIT.
For these PostgreSQL's "bytea" would definitely be the correct data type,
and there wouldn't be any encoding problems with that.

Yours,
Laurenz Albe