Thread: again: Bug #943: Server-Encoding from EUC_TW to UTF-8 doesn'twork

again: Bug #943: Server-Encoding from EUC_TW to UTF-8 doesn'twork

From
"Enke, Michael"
Date:
Hello,
I reported bug #943 (I found in 7.3.2) and you checked in some change against integer overflow.
Now I upgraded to 7.3.3 and I'm not happy with this.
The exact error as I described is fixed, but I found new errors in conversion UTF-8 <-> EUC_TW and BIG5:

Copy to table (DB has UTF-8 encoding) from file:
for PGCLIENTENCODING=BIG5:
WARNING:  copy: line 1, LocalToUtf: could not convert (0xf9d6) BIG5 to UTF-8. Ignored
WARNING:  copy: line 2, LocalToUtf: could not convert (0xf9d7) BIG5 to UTF-8. Ignored
WARNING:  copy: line 3, LocalToUtf: could not convert (0xf9d8) BIG5 to UTF-8. Ignored
WARNING:  copy: line 4, LocalToUtf: could not convert (0xf9db) BIG5 to UTF-8. Ignored

for EUC_TW
WARNING:  copy: line 1, LocalToUtf: could not convert (0x8ea3c3b7) EUC_TW to UTF-8. Ignored
WARNING:  copy: line 2, LocalToUtf: could not convert (0x8ea3cfd0) EUC_TW to UTF-8. Ignored
WARNING:  copy: line 3, LocalToUtf: could not convert (0x8ea3c4ce) EUC_TW to UTF-8. Ignored
WARNING:  copy: line 4, LocalToUtf: could not convert (0x8ea3bdfe) EUC_TW to UTF-8. Ignored

Copy out to file from table (UTF-8 data):
to BIG5
WARNING:  UtfToLocal: could not convert UTF-8 (0xe7a281). Ignored
WARNING:  UtfToLocal: could not convert UTF-8 (0xe98ab9). Ignored
WARNING:  UtfToLocal: could not convert UTF-8 (0xe8a38f). Ignored
WARNING:  UtfToLocal: could not convert UTF-8 (0xe7b2a7). Ignored

to EUC_TW is ok!

Regards,
Michael


Re: again: Bug #943: Server-Encoding from EUC_TW to

From
Tatsuo Ishii
Date:
> Hello,
> I reported bug #943 (I found in 7.3.2) and you checked in some change against integer overflow.
> Now I upgraded to 7.3.3 and I'm not happy with this.
> The exact error as I described is fixed, but I found new errors in conversion UTF-8 <-> EUC_TW and BIG5:
> 
> Copy to table (DB has UTF-8 encoding) from file:
> for PGCLIENTENCODING=BIG5:
> WARNING:  copy: line 1, LocalToUtf: could not convert (0xf9d6) BIG5 to UTF-8. Ignored
> WARNING:  copy: line 2, LocalToUtf: could not convert (0xf9d7) BIG5 to UTF-8. Ignored
> WARNING:  copy: line 3, LocalToUtf: could not convert (0xf9d8) BIG5 to UTF-8. Ignored
> WARNING:  copy: line 4, LocalToUtf: could not convert (0xf9db) BIG5 to UTF-8. Ignored

I see no problem here. The only standard conversion map I could found
on-line form so far (see below URL) does not include entries 0xf9d6 or
above.

http://www.unicode.org/Public/UNIDATA/Unihan.txt

> for EUC_TW
> WARNING:  copy: line 1, LocalToUtf: could not convert (0x8ea3c3b7) EUC_TW to UTF-8. Ignored
> WARNING:  copy: line 2, LocalToUtf: could not convert (0x8ea3cfd0) EUC_TW to UTF-8. Ignored
> WARNING:  copy: line 3, LocalToUtf: could not convert (0x8ea3c4ce) EUC_TW to UTF-8. Ignored
> WARNING:  copy: line 4, LocalToUtf: could not convert (0x8ea3bdfe) EUC_TW to UTF-8. Ignored

Hum. These seem to be CNS 11643-1993, plane 3. Currently PostgreSQL
supports only:

CNS 11643-1993, plane 0
CNS 11643-1993, plane 1
CNS 11643-1993, plane 2
CNS 11643-1993, plane 15

Would you like to have support for rest of CNS 11643-1993 planes:

CNS 11643-1993, plane 3
CNS 11643-1993, plane 4
CNS 11643-1993, plane 5
CNS 11643-1993, plane 6
CNS 11643-1993, plane 7

support for upcoming 7.4?

> Copy out to file from table (UTF-8 data):
> to BIG5
> WARNING:  UtfToLocal: could not convert UTF-8 (0xe7a281). Ignored
> WARNING:  UtfToLocal: could not convert UTF-8 (0xe98ab9). Ignored
> WARNING:  UtfToLocal: could not convert UTF-8 (0xe8a38f). Ignored
> WARNING:  UtfToLocal: could not convert UTF-8 (0xe7b2a7). Ignored
> 
> to EUC_TW is ok!

BIG5 and EUC_TW have different code points. So this is not very strange.
--
Tatsuo Ishii


Re: again: Bug #943: Server-Encoding from EUC_TW toUTF-8

From
Tatsuo Ishii
Date:
> > > Copy to table (DB has UTF-8 encoding) from file:
> > > for PGCLIENTENCODING=BIG5:
> > > WARNING:  copy: line 1, LocalToUtf: could not convert (0xf9d6) BIG5 to UTF-8. Ignored
> > > WARNING:  copy: line 2, LocalToUtf: could not convert (0xf9d7) BIG5 to UTF-8. Ignored
> > > WARNING:  copy: line 3, LocalToUtf: could not convert (0xf9d8) BIG5 to UTF-8. Ignored
> > > WARNING:  copy: line 4, LocalToUtf: could not convert (0xf9db) BIG5 to UTF-8. Ignored
> > 
> > I see no problem here. The only standard conversion map I could found
> > on-line form so far (see below URL) does not include entries 0xf9d6 or
> > above.
> 
> Sorry, I do not know anything about conversion maps and CNS 11643-1993 planes.
> I only got a file in BIG5 encoding from Taiwan and found that it is not possible
> to load all text to postgresql 7.3.3.
> But it is possible to convert to UTF-8 with iconv tool from glibc (Linux).
> It would be good if next release supports todays BIG5.

I'm not looking forward to add any conversion entries confirmed by
standards. Can some one explain me the current status of the
conversion maps between BIG5 and Unicode? The only info I could found
so far is in www.unicode.org.
--
Tatsuo Ishii


Re: [BUGS] again: Bug #943: Server-Encoding from EUC_TW

From
Tatsuo Ishii
Date:
> > > > Copy to table (DB has UTF-8 encoding) from file:
> > > > for PGCLIENTENCODING=BIG5:
> > > > WARNING:  copy: line 1, LocalToUtf: could not convert (0xf9d6) BIG5 to UTF-8. Ignored
> > > > WARNING:  copy: line 2, LocalToUtf: could not convert (0xf9d7) BIG5 to UTF-8. Ignored
> > > > WARNING:  copy: line 3, LocalToUtf: could not convert (0xf9d8) BIG5 to UTF-8. Ignored
> > > > WARNING:  copy: line 4, LocalToUtf: could not convert (0xf9db) BIG5 to UTF-8. Ignored
> > > 
> > > I see no problem here. The only standard conversion map I could found
> > > on-line form so far (see below URL) does not include entries 0xf9d6 or
> > > above.
> > 
> > Sorry, I do not know anything about conversion maps and CNS 11643-1993 planes.
> > I only got a file in BIG5 encoding from Taiwan and found that it is not possible
> > to load all text to postgresql 7.3.3.
> > But it is possible to convert to UTF-8 with iconv tool from glibc (Linux).
> > It would be good if next release supports todays BIG5.
> 
> I'm not looking forward to add any conversion entries confirmed by
> standards. Can some one explain me the current status of the

Oops. above should be: 

I'm not looking forward to add any conversion entries NOT confirmed by
standards.

> conversion maps between BIG5 and Unicode? The only info I could found
> so far is in www.unicode.org.
> --
> Tatsuo Ishii
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
> 


Re: again: Bug #943: Server-Encoding from EUC_TW toUTF-8

From
Tatsuo Ishii
Date:
> > > I reported bug #943 (I found in 7.3.2) and you checked in some change against integer overflow.
> > > Now I upgraded to 7.3.3 and I'm not happy with this.
> > > The exact error as I described is fixed, but I found new errors in conversion UTF-8 <-> EUC_TW and BIG5:
> > >
> > > Copy to table (DB has UTF-8 encoding) from file:
> > > for PGCLIENTENCODING=BIG5:
> > > WARNING:  copy: line 1, LocalToUtf: could not convert (0xf9d6) BIG5 to UTF-8. Ignored
> > > WARNING:  copy: line 2, LocalToUtf: could not convert (0xf9d7) BIG5 to UTF-8. Ignored
> > > WARNING:  copy: line 3, LocalToUtf: could not convert (0xf9d8) BIG5 to UTF-8. Ignored
> > > WARNING:  copy: line 4, LocalToUtf: could not convert (0xf9db) BIG5 to UTF-8. Ignored
> > 
> > I see no problem here. The only standard conversion map I could found
> > on-line form so far (see below URL) does not include entries 0xf9d6 or
> > above.
> > 
> > http://www.unicode.org/Public/UNIDATA/Unihan.txt
> 
> 
> I found in this file:
> U+F9D7 in line 604519
> U+F9D8 in line 219540
> U+F9D6...U+F9DB in lines 730707...730766.

No. U+F9D6 means *Unicode* code point, not BIG5 code point.

> 
> > > for EUC_TW
> > > WARNING:  copy: line 1, LocalToUtf: could not convert (0x8ea3c3b7) EUC_TW to UTF-8. Ignored
> > > WARNING:  copy: line 2, LocalToUtf: could not convert (0x8ea3cfd0) EUC_TW to UTF-8. Ignored
> > > WARNING:  copy: line 3, LocalToUtf: could not convert (0x8ea3c4ce) EUC_TW to UTF-8. Ignored
> > > WARNING:  copy: line 4, LocalToUtf: could not convert (0x8ea3bdfe) EUC_TW to UTF-8. Ignored
> > 
> > Hum. These seem to be CNS 11643-1993, plane 3. Currently PostgreSQL
> > supports only:
> > 
> > CNS 11643-1993, plane 0
> > CNS 11643-1993, plane 1
> > CNS 11643-1993, plane 2
> > CNS 11643-1993, plane 15
> > 
> > Would you like to have support for rest of CNS 11643-1993 planes:
> > 
> > CNS 11643-1993, plane 3
> > CNS 11643-1993, plane 4
> > CNS 11643-1993, plane 5
> > CNS 11643-1993, plane 6
> > CNS 11643-1993, plane 7
> > 
> > support for upcoming 7.4?
> > 
> > > Copy out to file from table (UTF-8 data):
> > > to BIG5
> > > WARNING:  UtfToLocal: could not convert UTF-8 (0xe7a281). Ignored
> > > WARNING:  UtfToLocal: could not convert UTF-8 (0xe98ab9). Ignored
> > > WARNING:  UtfToLocal: could not convert UTF-8 (0xe8a38f). Ignored
> > > WARNING:  UtfToLocal: could not convert UTF-8 (0xe7b2a7). Ignored
> > >
> > > to EUC_TW is ok!
> > 
> > BIG5 and EUC_TW have different code points. So this is not very strange.
> 
> 
> But it is very strange that I can (for EUC_TW) copy to file without error but I can not copy from file without
error.
> 
> Michael
> 


Re: again: Bug #943: Server-Encoding from EUC_TW toUTF-8

From
"Enke, Michael"
Date:
Tatsuo Ishii wrote:
> 
> > Hello,
> > I reported bug #943 (I found in 7.3.2) and you checked in some change against integer overflow.
> > Now I upgraded to 7.3.3 and I'm not happy with this.
> > The exact error as I described is fixed, but I found new errors in conversion UTF-8 <-> EUC_TW and BIG5:
> >
> > Copy to table (DB has UTF-8 encoding) from file:
> > for PGCLIENTENCODING=BIG5:
> > WARNING:  copy: line 1, LocalToUtf: could not convert (0xf9d6) BIG5 to UTF-8. Ignored
> > WARNING:  copy: line 2, LocalToUtf: could not convert (0xf9d7) BIG5 to UTF-8. Ignored
> > WARNING:  copy: line 3, LocalToUtf: could not convert (0xf9d8) BIG5 to UTF-8. Ignored
> > WARNING:  copy: line 4, LocalToUtf: could not convert (0xf9db) BIG5 to UTF-8. Ignored
> 
> I see no problem here. The only standard conversion map I could found
> on-line form so far (see below URL) does not include entries 0xf9d6 or
> above.

Sorry, I do not know anything about conversion maps and CNS 11643-1993 planes.
I only got a file in BIG5 encoding from Taiwan and found that it is not possible
to load all text to postgresql 7.3.3.
But it is possible to convert to UTF-8 with iconv tool from glibc (Linux).
It would be good if next release supports todays BIG5.

Michael


> http://www.unicode.org/Public/UNIDATA/Unihan.txt
> 
> > for EUC_TW
> > WARNING:  copy: line 1, LocalToUtf: could not convert (0x8ea3c3b7) EUC_TW to UTF-8. Ignored
> > WARNING:  copy: line 2, LocalToUtf: could not convert (0x8ea3cfd0) EUC_TW to UTF-8. Ignored
> > WARNING:  copy: line 3, LocalToUtf: could not convert (0x8ea3c4ce) EUC_TW to UTF-8. Ignored
> > WARNING:  copy: line 4, LocalToUtf: could not convert (0x8ea3bdfe) EUC_TW to UTF-8. Ignored
> 
> Hum. These seem to be CNS 11643-1993, plane 3. Currently PostgreSQL
> supports only:
> 
> CNS 11643-1993, plane 0
> CNS 11643-1993, plane 1
> CNS 11643-1993, plane 2
> CNS 11643-1993, plane 15
> 
> Would you like to have support for rest of CNS 11643-1993 planes:
> 
> CNS 11643-1993, plane 3
> CNS 11643-1993, plane 4
> CNS 11643-1993, plane 5
> CNS 11643-1993, plane 6
> CNS 11643-1993, plane 7
> 
> support for upcoming 7.4?
> 
> > Copy out to file from table (UTF-8 data):
> > to BIG5
> > WARNING:  UtfToLocal: could not convert UTF-8 (0xe7a281). Ignored
> > WARNING:  UtfToLocal: could not convert UTF-8 (0xe98ab9). Ignored
> > WARNING:  UtfToLocal: could not convert UTF-8 (0xe8a38f). Ignored
> > WARNING:  UtfToLocal: could not convert UTF-8 (0xe7b2a7). Ignored
> >
> > to EUC_TW is ok!
> 
> BIG5 and EUC_TW have different code points. So this is not very strange.
> --
> Tatsuo Ishii


Re: again: Bug #943: Server-Encoding from EUC_TW

From
"Enke, Michael"
Date:
Tatsuo Ishii wrote:
> 
> > > > I reported bug #943 (I found in 7.3.2) and you checked in some change against integer overflow.
> > > > Now I upgraded to 7.3.3 and I'm not happy with this.
> > > > The exact error as I described is fixed, but I found new errors in conversion UTF-8 <-> EUC_TW and BIG5:
> > > >
> > > > Copy to table (DB has UTF-8 encoding) from file:
> > > > for PGCLIENTENCODING=BIG5:
> > > > WARNING:  copy: line 1, LocalToUtf: could not convert (0xf9d6) BIG5 to UTF-8. Ignored
> > > > WARNING:  copy: line 2, LocalToUtf: could not convert (0xf9d7) BIG5 to UTF-8. Ignored
> > > > WARNING:  copy: line 3, LocalToUtf: could not convert (0xf9d8) BIG5 to UTF-8. Ignored
> > > > WARNING:  copy: line 4, LocalToUtf: could not convert (0xf9db) BIG5 to UTF-8. Ignored
> > >
> > > I see no problem here. The only standard conversion map I could found
> > > on-line form so far (see below URL) does not include entries 0xf9d6 or
> > > above.
> > >
> > > http://www.unicode.org/Public/UNIDATA/Unihan.txt
> >
> >
> > I found in this file:
> > U+F9D7 in line 604519
> > U+F9D8 in line 219540
> > U+F9D6...U+F9DB in lines 730707...730766.
> 
> No. U+F9D6 means *Unicode* code point, not BIG5 code point.

Ok.
I have looked into my Linux box and found this in /usr/share/i18n/charmaps/BIG5.gz:
% Chinese charmap for BIG5 (CP950)
% version: 0.92
% Contact: Tung-Han Hsieh   <thhsieh@linux.org.tw>
%          Yuan-Chung Cheng <platin@ms31.hinet.net>
% Distribution and use is free, even for comercial purpose.
%
% This charmap is converted from:
%     ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP950.TXT
% ...

There "my" characters are in.

Don't you agree that it is strange that I can (for EUC_TW) copy "to" file without error
but I can not copy "from" file without error?

Michael

> >
> > > > for EUC_TW
> > > > WARNING:  copy: line 1, LocalToUtf: could not convert (0x8ea3c3b7) EUC_TW to UTF-8. Ignored
> > > > WARNING:  copy: line 2, LocalToUtf: could not convert (0x8ea3cfd0) EUC_TW to UTF-8. Ignored
> > > > WARNING:  copy: line 3, LocalToUtf: could not convert (0x8ea3c4ce) EUC_TW to UTF-8. Ignored
> > > > WARNING:  copy: line 4, LocalToUtf: could not convert (0x8ea3bdfe) EUC_TW to UTF-8. Ignored
> > >
> > > Hum. These seem to be CNS 11643-1993, plane 3. Currently PostgreSQL
> > > supports only:
> > >
> > > CNS 11643-1993, plane 0
> > > CNS 11643-1993, plane 1
> > > CNS 11643-1993, plane 2
> > > CNS 11643-1993, plane 15
> > >
> > > Would you like to have support for rest of CNS 11643-1993 planes:
> > >
> > > CNS 11643-1993, plane 3
> > > CNS 11643-1993, plane 4
> > > CNS 11643-1993, plane 5
> > > CNS 11643-1993, plane 6
> > > CNS 11643-1993, plane 7
> > >
> > > support for upcoming 7.4?
> > >
> > > > Copy out to file from table (UTF-8 data):
> > > > to BIG5
> > > > WARNING:  UtfToLocal: could not convert UTF-8 (0xe7a281). Ignored
> > > > WARNING:  UtfToLocal: could not convert UTF-8 (0xe98ab9). Ignored
> > > > WARNING:  UtfToLocal: could not convert UTF-8 (0xe8a38f). Ignored
> > > > WARNING:  UtfToLocal: could not convert UTF-8 (0xe7b2a7). Ignored
> > > >
> > > > to EUC_TW is ok!
> > >
> > > BIG5 and EUC_TW have different code points. So this is not very strange.
> >
> >
> > But it is very strange that I can (for EUC_TW) copy to file without error but I can not copy from file without
error.
> >
> > Michael
> >


Re: again: Bug #943: Server-Encoding from EUC_TW

From
Tatsuo Ishii
Date:
> I have looked into my Linux box and found this in /usr/share/i18n/charmaps/BIG5.gz:
> % Chinese charmap for BIG5 (CP950)
> % version: 0.92
> % Contact: Tung-Han Hsieh   <thhsieh@linux.org.tw>
> %          Yuan-Chung Cheng <platin@ms31.hinet.net>
> % Distribution and use is free, even for comercial purpose.
> %
> % This charmap is converted from:
> %     ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP950.TXT
> % ...
> 
> There "my" characters are in.

That's a M$'s definition, not a standard. I think there should be a
reason why the Unicode org. does not use it.

> Don't you agree that it is strange that I can (for EUC_TW) copy "to" file without error
> but I can not copy "from" file without error?

I'm not quite sure what you are saying. Are you complaining that (for
example) 0xe7a281 in UTF-8 does not convert to EUC_TW?

BTW, what do you think about below?

FYI, CNS 11643-1993 is the standard character set and EUC_TW is the
one of the encodings. That means your problem below will disappear.

> > > > WARNING:  copy: line 2, LocalToUtf: could not convert (0x8ea3cfd0) EUC_TW to UTF-8. Ignored
> > > > WARNING:  copy: line 3, LocalToUtf: could not convert (0x8ea3c4ce) EUC_TW to UTF-8. Ignored
> > > > WARNING:  copy: line 4, LocalToUtf: could not convert (0x8ea3bdfe) EUC_TW to UTF-8. Ignored

> > > > Hum. These seem to be CNS 11643-1993, plane 3. Currently PostgreSQL
> > > > supports only:
> > > >
> > > > CNS 11643-1993, plane 0
> > > > CNS 11643-1993, plane 1
> > > > CNS 11643-1993, plane 2
> > > > CNS 11643-1993, plane 15
> > > >
> > > > Would you like to have support for rest of CNS 11643-1993 planes:
> > > >
> > > > CNS 11643-1993, plane 3
> > > > CNS 11643-1993, plane 4
> > > > CNS 11643-1993, plane 5
> > > > CNS 11643-1993, plane 6
> > > > CNS 11643-1993, plane 7
> > > >
> > > > support for upcoming 7.4?
--
Tatsuo Ishii


Re: again: Bug #943: Server-Encoding from

From
"Enke, Michael"
Date:
Tatsuo Ishii wrote:
> 
> > I have looked into my Linux box and found this in /usr/share/i18n/charmaps/BIG5.gz:
> > % Chinese charmap for BIG5 (CP950)
> > % version: 0.92
> > % Contact: Tung-Han Hsieh   <thhsieh@linux.org.tw>
> > %          Yuan-Chung Cheng <platin@ms31.hinet.net>
> > % Distribution and use is free, even for comercial purpose.
> > %
> > % This charmap is converted from:
> > %     ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP950.TXT
> > % ...
> >
> > There "my" characters are in.
> 
> That's a M$'s definition, not a standard. I think there should be a
> reason why the Unicode org. does not use it.

Ok, I do not know the reason. But since also the glibc uses it, couldn't you use it too?
I believe the glibc delveloper have thought about this a lot. And they came to the
conclusion to use this definition. Why not postgresql?

> > Don't you agree that it is strange that I can (for EUC_TW) copy "to" file without error
> > but I can not copy "from" file without error?
> 
> I'm not quite sure what you are saying. Are you complaining that (for
> example) 0xe7a281 in UTF-8 does not convert to EUC_TW?

Yes exactly, since this value comes from a "copy to" with PGCLIENTENCODING=EUC_TW

> 
> BTW, what do you think about below?
> 
> FYI, CNS 11643-1993 is the standard character set and EUC_TW is the
> one of the encodings. That means your problem below will disappear.

Ok.

Regards,
Michael

> > > > > WARNING:  copy: line 2, LocalToUtf: could not convert (0x8ea3cfd0) EUC_TW to UTF-8. Ignored
> > > > > WARNING:  copy: line 3, LocalToUtf: could not convert (0x8ea3c4ce) EUC_TW to UTF-8. Ignored
> > > > > WARNING:  copy: line 4, LocalToUtf: could not convert (0x8ea3bdfe) EUC_TW to UTF-8. Ignored
> 
> > > > > Hum. These seem to be CNS 11643-1993, plane 3. Currently PostgreSQL
> > > > > supports only:
> > > > >
> > > > > CNS 11643-1993, plane 0
> > > > > CNS 11643-1993, plane 1
> > > > > CNS 11643-1993, plane 2
> > > > > CNS 11643-1993, plane 15
> > > > >
> > > > > Would you like to have support for rest of CNS 11643-1993 planes:
> > > > >
> > > > > CNS 11643-1993, plane 3
> > > > > CNS 11643-1993, plane 4
> > > > > CNS 11643-1993, plane 5
> > > > > CNS 11643-1993, plane 6
> > > > > CNS 11643-1993, plane 7
> > > > >
> > > > > support for upcoming 7.4?
> --
> Tatsuo Ishii