Thread: UTF-8 data migration problem in Postgresql 7.2

UTF-8 data migration problem in Postgresql 7.2

From

Jean-Michel POURE

Date:

19 February 2002, 10:53:34

Dear all,

I tried to migrate a UTF-8 database with Japanese content from Postgresql
7.1.3 to Postgresql 7.2 unsucessfully.

I made a pg_dump of the 7.1.3 database and tried to upload it in 7.2.
Error message was "Invalid Unicode character sequence found".

I tried to upload data in a fresh PostgreSQL 7.1.3 database and everything
worked OK. The problem can be reproduced on PostgreSQL 7.2.

Do not hesitate to ask me for a pg_dump of the involved tables. I cannot
submit them on the list for privacy reasons. Thanks for your help.

Cheers,
Jean-Michel POURE

Re: [HACKERS] UTF-8 data migration problem in Postgresql 7.2

From

Jean-Michel POURE

Date:

19 February 2002, 16:54:13

Le Mardi 19 Février 2002 17:03, Trond Eivind Glomsrød a écrit :
> Same here.
I am stuck with this UTF-8 migration problem.

Is it safe to connect to 2 separate databases (Left=7.1.3, Right=7.2) and to
migrate UTF-8 data using blind-native ADODB calls? I would say yes, but can
some hacker confirm? Otherwize I can wait a few days, but not more.

Any help my friends?

Cheers,
Jean-Michel POURE

Re: [HACKERS] UTF-8 data migration problem in Postgresql 7.2

From

Jean-Michel POURE

Date:

20 February 2002, 02:38:47

Le Mercredi 20 Février 2002 01:50, Tatsuo Ishii a écrit :
> The error messages shows your data contains an invalid UTF-8 character
> sequence. The first byte (e8) expects the UTF-8 sequence consists of 3
> bytes, each of 8th bit on. However the third byte is 0x27, apparently
> it does not satisfy the requirement. Check your data.

Dear Tatsuo,

This data was created using pg_dump in PostgreSQL 7.1.3. The data load well
in PostgreSQL 7.1.3, but not into PostgreSQL 7.2.

Is this a bug or is the data faulty? How can I verify UTF-8 data already
stored into PostgreSQL 7.1.3?

Best regards,
Jean-Michel POURE

Re: [HACKERS] UTF-8 data migration problem in Postgresql 7.2

From

Jean-Michel POURE

Date:

20 February 2002, 05:41:14

> Le Mercredi 20 Février 2002 01:50, Tatsuo Ishii a écrit :
> > The error messages shows your data contains an invalid UTF-8 character
> > sequence. The first byte (e8) expects the UTF-8 sequence consists of 3
> > bytes, each of 8th bit on. However the third byte is 0x27, apparently
> > it does not satisfy the requirement. Check your data.

Dear all,

Can anyone help me understand where the problem stems from.

1) Is this a data perversion problem (=wrong UTF-8 sequences entered)? Then,
how can I fix it? Why did 7.1.3 allow the input of such data?

2) The data was created with pg_dump. It loads without problem in 7.1.3. But
not  in 7.2. Why?

3) Ultimately, please tell me if I can use ADODB odbc calls to migrate UTF-8
data using VB. Are there special settings required? Should I wait for a fix
in 7.2 or should I go for VB right now?

I am stuck and cannot work anymore (my new code is on 7.2, the old data is on
7.1.3). Thank you for your help,

Cheers,
Jean-Michel POURE

Re: [HACKERS] UTF-8 data migration problem in Postgresql 7.2

From

Jean-Michel POURE

Date:

20 February 2002, 10:11:17

Dear Tatsuo,

Thank you for this answer.

> Why don't you edit the output from pg_dump? It's just a text file.
I edited the output in Yudit. Nothing seems to be wrong.

> > Why did 7.1.3 allow the input of such data?
> It's just becuase no data validation checking was there before 7.2.
Data was entered by a Japanese translator using IE5.5, japanese fonts, Win2K,
Apache and Php with UTF-8 extensions. How can it be wrong? It displays well
under IE5.5. Are you sure it is not a PostgreSQL 7.1.3 pg_dump related bug?
Other users have the same problem when migrating to 7.2.

Furthermore, do you think I can migrate data using ODBC adodb calls?

Thank you for your help,
Jean-Michel POURE

Re: [HACKERS] UTF-8 data migration problem in Postgresql 7.2

From

Tom Lane

Date:

20 February 2002, 11:09:41

Jean-Michel POURE <jm.poure@freesurf.fr> writes:
> 1) Is this a data perversion problem (=wrong UTF-8 sequences entered)? Then,
> how can I fix it? Why did 7.1.3 allow the input of such data?

7.1 did not have validity checks for UTF sequences.  7.2 does.  This
does not seem like a bad change to me.

            regards, tom lane

Re: [HACKERS] UTF-8 data migration problem in Postgresql 7.2

From

Jean-Michel POURE

Date:

20 February 2002, 11:39:43

Dear friends,

> Hard to believe it's a pg_dump bug.
You are right, Yudit shows an invalid UTF-8 character.

For example, record OID 20643 has invalid =E3=82 character. But, when
querying the PostgreSQL 7.1.3 database, I can see the missing character which
displays well in Apache/PHP.

It seems to be a pg_dump bug or a peculiar kind of UTF-8 (is there any
difference between Windows / Unix ?). There are hundreds of errors in the
complete dump. But the data displays always well.

How can I test the database further in psql or in PHP. I would like to query
record OID 20643 and see if E3/82 is there or not. Any idea how to proceed?

Thank you very much for your help.
Best regards,
Jean-Michel POURE

Re: [HACKERS] UTF-8 data migration problem in Postgresql 7.2

From

Tom Lane

Date:

20 February 2002, 12:16:55

Jean-Michel POURE <jm.poure@freesurf.fr> writes:
> You are right, Yudit shows an invalid UTF-8 character.

> For example, record OID 20643 has invalid =E3=82 character. But, when
> querying the PostgreSQL 7.1.3 database, I can see the missing character which
> displays well in Apache/PHP.

Does your Apache/PHP environment set any particular client encoding?

I am wondering if Apache is seeing a different translation of the
database contents than what pg_dump got.

BTW, it seems entirely likely that Windows apps might display something
for UTF-8 sequences that are invalid according to the published specs.
It'd be just like Microsoft to "extend" the standard...

            regards, tom lane

Re: [HACKERS] UTF-8 data migration problem in Postgresql 7.2

From

Jean-Michel POURE

Date:

20 February 2002, 12:38:56

> Does your Apache/PHP environment set any particular client encoding?
> I am wondering if Apache is seeing a different translation of the
> database contents than what pg_dump got.
I recompiled PHP with UTF-8 internal encoding. So no transcoding is performed.

> BTW, it seems entirely likely that Windows apps might display something
> for UTF-8 sequences that are invalid according to the published specs.
> It'd be just like Microsoft to "extend" the standard...

It might be the case, I agree. Could it be a problem of surface (UCS-2
characters embedded in UTF-8)? PostgreSQL 7.3 should accept more UTF-8
characters anyway as I can display all Japanese characters. Many users may
have the same problem. So why filter?

Again, Ithe problem might also come from pg_dump. The database has French and
Dutch content. There are also errors in the Dutch file... Is there a way to
perform some testing?

Cheers,
Jean-Michel POURE

Re: [HACKERS] UTF-8 data migration problem in Postgresql 7.2

From

Hiroshi Inoue

Date:

20 February 2002, 19:25:22

Jean-Michel POURE wrote:
>
> Dear friends,
>
> > Hard to believe it's a pg_dump bug.
> You are right, Yudit shows an invalid UTF-8 character.
>
> For example, record OID 20643 has invalid =E3=82 character. But, when
> querying the PostgreSQL 7.1.3 database, I can see the missing character which
> displays well in Apache/PHP.

Could you send me the image(jpeg, bmp etc format) which
corresponds to the line(OID 20643) ?

regards,
Hiroshi Inoue

Re: [HACKERS] UTF-8 data migration problem in Postgresql 7.2

From

Jean-Michel POURE

Date:

21 February 2002, 04:13:56

Dear Tatsuo,

Thank you for your previous answer.

> o Were server/clien encodings UTF-8 for PostgreSQL?
Yes.

> o What are versions of these softwares? Especially of PHP? Is it a
> PHP4? if so, what version? What is the "Php with UTF-8 extensions"?
> I've never heard of it.
It is PHP 4.0.6 with :
--enable-mbstring : Enable mbstring functions. This option is required to use
mbstring functions.
--enable-mbstr-enc-trans : Enable HTTP input character encoding conversion
using mbstring conversion engine. If this feature is enabled, HTTP input
character encoding may be converted to mbstring.internal_encoding
automatically.

Now, some more information:
1) Dutch text was entered using IE5.5. It is not faulty.

2) Japanese text was entered using OpenOffice latest release (sorry, I said
IE5 but I was wrong), saved under UTF-8 and imported in PostgreSQL. Only
Japanese data has problems.

3) When opening a faulty Japanese record using Apache/IE5, the record is
displayed correctly. Each faulty character is replaced by a Japanese 30A7
gryph (looks like a French cross with two horizontal lines). What is this
gryph? Does it mean 'I don't know' in Japanese.

The record is saved correctly using this 30A1 gryph (then it looks like it is
fixed as I can dump it and import it in 7.2, but this is not a solution).

4) In PostgreSQL 7.1.3 original dump, there is only one faulty UTF-8
character repeated 700 times. If you open my file in Yudit, it is displayed
as =E3=82' Why is it always the same character everywhere? Maybe you could
have a look at my source file again. Sounds like a bug (Open Office or
PostgreSQL).

5) Surrogate pairs
I heard PostgreSQL did not support surrogate pairs. Is this a problem of
surrogate pair? Just my 0.02 cents, I know very little about UTF-8.

Any help appreciated,
Thanks, Jean-Michel POURE

Re: [HACKERS] UTF-8 data migration problem in Postgresql 7.2

From

Patrice Hédé

Date:

21 February 2002, 14:40:49

Hi Jean-Michel,

I just started browsing this list again after a long absence...

* Jean-Michel POURE <jm.poure@freesurf.fr> [020221 18:39]:
> 5) Surrogate pairs
> I heard PostgreSQL did not support surrogate pairs. Is this a problem of
> surrogate pair? Just my 0.02 cents, I know very little about UTF-8.

Surrogate pairs only exist in UTF-16. They are used to access
characters which are not on the BMP.

UTF-8 has a different way to encode these characters. Encoding
surrogates in UTF-8 is invalid and should be rejected by any
application receiving a UTF-8 stream (actually, they used to be just
irregular, but starting with Unicode 3.2, they will be illegal).

Regarding your sequence E3/82/27, it cannot be valid under any scheme.
UTF-8 is done in a way that any subsequent byte is equal or above
0x80. For E3 in particular, the 3rd byte has to be between 80 and BF.

Anyway "UTF-8 encoded surrogates" can only start with ED, so that's
not your problem here.

Hope this helps.

Patrice

--
Patrice Hédé
email: patrice hede à islande org
www  : http://www.islande.org/

Re: [HACKERS] UTF-8 data migration problem in Postgresql 7.2

From

Tatsuo Ishii

Date:

21 February 2002, 14:43:40

> > o Were server/clien encodings UTF-8 for PostgreSQL?
> Yes.
>
> > o What are versions of these softwares? Especially of PHP? Is it a
> > PHP4? if so, what version? What is the "Php with UTF-8 extensions"?
> > I've never heard of it.
> It is PHP 4.0.6 with :
> --enable-mbstring : Enable mbstring functions. This option is required to use
> mbstring functions.
> --enable-mbstr-enc-trans : Enable HTTP input character encoding conversion
> using mbstring conversion engine. If this feature is enabled, HTTP input
> character encoding may be converted to mbstring.internal_encoding
> automatically.

Oh, that's a general functionality for handling multibyte characters,
not only for UTF-8. What are settings for mbstring in php.ini?
(entries begin with "mbstring.")

BTW, PHP4.0.6 is very buggy when used with PostgreSQL (random
crashes). I recomend to upgrade to 4.1.1.

> Now, some more information:
> 1) Dutch text was entered using IE5.5. It is not faulty.

I assume the web page's encoding was UTF-8.

> 2) Japanese text was entered using OpenOffice latest release (sorry, I said
> IE5 but I was wrong), saved under UTF-8 and imported in PostgreSQL. Only
> Japanese data has problems.

Can I take a look at the UTF-8 text generated by OpenOffice?

> 3) When opening a faulty Japanese record using Apache/IE5, the record is
> displayed correctly. Each faulty character is replaced by a Japanese 30A7
> gryph (looks like a French cross with two horizontal lines). What is this
> gryph? Does it mean 'I don't know' in Japanese.

What do you mean by "gryph"? Is 30A7 is an EUC-JP?

> The record is saved correctly using this 30A1 gryph (then it looks like it is
> fixed as I can dump it and import it in 7.2, but this is not a solution).

Again, what is "gryph"?

> 4) In PostgreSQL 7.1.3 original dump, there is only one faulty UTF-8
> character repeated 700 times. If you open my file in Yudit, it is displayed
> as =E3=82' Why is it always the same character everywhere? Maybe you could
> have a look at my source file again. Sounds like a bug (Open Office or
> PostgreSQL).
>
> 5) Surrogate pairs
> I heard PostgreSQL did not support surrogate pairs. Is this a problem of
> surrogate pair? Just my 0.02 cents, I know very little about UTF-8.

I don't think so.
--
Tatsuo Ishii

Re: [HACKERS] UTF-8 data migration problem in Postgresql 7.2

From

teg@redhat.com (Trond Eivind Glomsrød)

Date:

21 February 2002, 15:13:39

Jean-Michel POURE <jm.poure@freesurf.fr> writes:

> Dear all,
>
> I tried to migrate a UTF-8 database with Japanese content from Postgresql
> 7.1.3 to Postgresql 7.2 unsucessfully.
>
> I made a pg_dump of the 7.1.3 database and tried to upload it in 7.2.
> Error message was "Invalid Unicode character sequence found".

Same here.

--
Trond Eivind Glomsrød
Red Hat, Inc.

Re: [HACKERS] UTF-8 data migration problem in Postgresql 7.2

From

Tatsuo Ishii

Date:

21 February 2002, 15:18:28

> I tried to migrate a UTF-8 database with Japanese content from Postgresql
> 7.1.3 to Postgresql 7.2 unsucessfully.
>
> I made a pg_dump of the 7.1.3 database and tried to upload it in 7.2.
> Error message was "Invalid Unicode character sequence found".
>
> I tried to upload data in a fresh PostgreSQL 7.1.3 database and everything
> worked OK. The problem can be reproduced on PostgreSQL 7.2.
>
> Do not hesitate to ask me for a pg_dump of the involved tables. I cannot
> submit them on the list for privacy reasons. Thanks for your help.

I have received the test data and got:

psql:dump_error.sql:40: ERROR:  copy: line 1, Invalid UNICODE character sequence found (0xe38227)

The error messages shows your data contains an invalid UTF-8 character
sequence. The first byte (e8) expects the UTF-8 sequence consists of 3
bytes, each of 8th bit on. However the third byte is 0x27, apparently
it does not satisfy the requirement. Check your data.
--
Tatsuo Ishii

Re: [HACKERS] UTF-8 data migration problem in Postgresql 7.2

From

Tatsuo Ishii

Date:

21 February 2002, 15:20:07

> 1) Is this a data perversion problem (=wrong UTF-8 sequences entered)? Then,
> how can I fix it?

Why don't you edit the output from pg_dump? It's just a text file.

> Why did 7.1.3 allow the input of such data?

It's just becuase no data validation checking was there before 7.2.
--
Tatsuo Ishii

Re: [HACKERS] UTF-8 data migration problem in Postgresql 7.2

From

Tatsuo Ishii

Date:

21 February 2002, 15:21:46

> > Why don't you edit the output from pg_dump? It's just a text file.
> I edited the output in Yudit. Nothing seems to be wrong.

I don't know anything about Yuedit, but I guess it skips wrong UTF-8
sequences. I see an encoding converter called "lv" does something like
that. Maybe filtering the file using such kind of tools and save the
result to a file then input to psql works fine.

> > > Why did 7.1.3 allow the input of such data?
> > It's just becuase no data validation checking was there before 7.2.
> Data was entered by a Japanese translator using IE5.5, japanese fonts, Win2K,
> Apache and Php with UTF-8 extensions. How can it be wrong? It displays well
> under IE5.5.

I do not trust M$ products:-)

> Are you sure it is not a PostgreSQL 7.1.3 pg_dump related bug?
> Other users have the same problem when migrating to 7.2.

Hard to believe it's a pg_dump bug.

> Furthermore, do you think I can migrate data using ODBC adodb calls?

I'm totally unsure about ODBC. Hiroshi, what do you think?
--
Tatsuo Ishii

Re: [HACKERS] UTF-8 data migration problem in Postgresql 7.2

From

Tatsuo Ishii

Date:

21 February 2002, 15:28:39

> I recompiled PHP with UTF-8 internal encoding. So no transcoding is performed.

Recompiled PHP with UTF-8? That sounds new to me. As far as I know,
there is no such a compiling option in PHP.

> > BTW, it seems entirely likely that Windows apps might display something
> > for UTF-8 sequences that are invalid according to the published specs.
> > It'd be just like Microsoft to "extend" the standard...
>
> It might be the case, I agree. Could it be a problem of surface (UCS-2
> characters embedded in UTF-8)? PostgreSQL 7.3 should accept more UTF-8
> characters anyway as I can display all Japanese characters. Many users may
> have the same problem. So why filter?

You need to make sure that M$ or whatever define such kind of
"extention" before jumping to the conclusion.

I still hard to believe E3/82/27 is a valid UTF-8 character even in
M$'s extention...
--
Tatsuo Ishii

Re: [HACKERS] UTF-8 data migration problem in Postgresql 7.2

From

Tatsuo Ishii

Date:

21 February 2002, 15:32:27

> You are right, Yudit shows an invalid UTF-8 character.
>
> For example, record OID 20643 has invalid =E3=82 character. But, when
> querying the PostgreSQL 7.1.3 database, I can see the missing character which
> displays well in Apache/PHP.

How did you judge "displays well"? Can you show me the window capture?

The broken UTF-8 (E3/82/27) should be E3/82/92 by guessing from a
consistency and grammertical correctness as Japanese according to
Hiroshi (and I agree with him).
--
Tatsuo Ishii

Re: [HACKERS] UTF-8 data migration problem in Postgresql 7.2

From

Tatsuo Ishii

Date:

21 February 2002, 15:34:55

> Data was entered by a Japanese translator using IE5.5, japanese fonts, Win2K,
> Apache and Php with UTF-8 extensions. How can it be wrong? It displays well
> under IE5.5.

Can you be more specific on how did you enter the data?

o Were server/clien encodings UTF-8 for PostgreSQL?

o What are versions of these softwares? Especially of PHP? Is it a
PHP4? if so, what version? What is the "Php with UTF-8 extensions"?
I've never heard of it.
--
Tatsuo Ishii

Re: [HACKERS] UTF-8 data migration problem in Postgresql 7.2

From

Jean-Michel POURE

Date:

22 February 2002, 02:41:54

Le Jeudi 21 Février 2002 05:04, Tatsuo Ishii a écrit :
> o Were server/clien encodings UTF-8 for PostgreSQL?
Yes, I am absolutely sure. Everything from Web page, Apache to PostgreSQL.

> o What are versions of these softwares? Especially of PHP? Is it a
> PHP4? if so, what version? What is the "Php with UTF-8 extensions"?
> I've never heard of it.

PHP 4.06. PHP needs to be compiled with "--enable-mbstring" to allow the
parsing of UTF-8 files and strings. Internal settings set to UTF-8.

I guess the error might come from PostgreSQL string parsing functions:
1) new.target_content:= translate(new.target_content, chr(146), chr (39)) ;
2) substring(text, int4). Is substring multibyte safe?

Cheers,
Jean-Michel

Re: [HACKERS] UTF-8 data migration problem in Postgresql 7.2

From

Jean-Michel POURE

Date:

22 February 2002, 06:40:03

Le Vendredi 22 Février 2002 10:05, Tatsuo Ishii a écrit :
> I believe 7.2's translate and substring are multibyte safe.
> Can you show us reproducable examples?

No, this was under 7.1.3. Good to know 7.2 is multibyte safe.
Thanks, Jean-Michel

Re: [HACKERS] UTF-8 data migration problem in Postgresql 7.2

From

Tatsuo Ishii

Date:

22 February 2002, 08:40:23

> I guess the error might come from PostgreSQL string parsing functions:
> 1) new.target_content:= translate(new.target_content, chr(146), chr (39)) ;
> 2) substring(text, int4). Is substring multibyte safe?

I believe 7.2's translate and substring are multibyte safe.
Can you show us reproducable examples?
--
Tatsuo Ishii

Re: [HACKERS] UTF-8 data migration problem in Postgresql 7.2

From

Tatsuo Ishii

Date:

22 February 2002, 08:40:54

> I guess the error might come from PostgreSQL string parsing functions:
> 1) new.target_content:= translate(new.target_content, chr(146), chr (39)) ;
> 2) substring(text, int4). Is substring multibyte safe?

Oh, I think I found the source of the problem. Apparently you did 1)
above in PostgreSQL 7.1.3, right?

7.1's translate is not multibyte aware, so it simply replaces every
occurence of chr(146) = 0x92 to chr(39) = 0x27 even if it's part of
the multibyte sequece.

The broken seqneuce was E3/82/27, which is apparently replaced by
translate() from E3/82/92.
--
Tatsuo Ishii