Thread: Re: [BUGS] Bug #943: Server-Encoding from EUC_TW to UTF-8 doesn't

Re: [BUGS] Bug #943: Server-Encoding from EUC_TW to UTF-8 doesn't

From
Tatsuo Ishii
Date:
It turned out that it's a bug with encoding conversion engine of
PostgreSQL. It just failed to find proper entry from a encoding
conversion table because of a integer overflow problem. Since only
maps for EUC_TW have such a huge code point values (for example
0x8eaee7aa), I believe the conversion failure merely occurs with the
particular encoding. Included patches should solve the problem (it is
against PostgreSQL 7.3.2).

BTW, I'm surprised to find the bug since it has been there since 7.2
days.

I'm going to commit the fix to both current and 7.3-stable trees.
--
Tatsuo Ishii

> Short Description
> Server-Encoding from EUC_TW to UTF-8 doesn't work
> 
> Long Description
> System: SuSE Linux 8.1, kernel 2.4.19, glibc 2.2.5/glibc-locale 2.2.5
> the same error on RedHat 7.3, kernel 2.4.20, glibc2.2.5
> postgresql version 7.3.2
> description: I loaded Chinese (TW) characters, encoded as UTF-8 into a
> database which has UTF-8 encoding with "copy table from 'original'" with psql. Ok.
> Than I exit from psql, exported PGCLIENTENCODING=EUC_TW
> I started psql, make a "copy table to 'file.EUC_TW'". Ok.
> If I convert this file to UTF-8 with iconv -f EUC-TW -t UTF-8 file.EUC_TW file.UTF-8
> than file.UTF-8 looks ecaxtly the same as the original.
> That means, PostgreSQL converts from UTF-8 to EUC_TW correct.
> Now I load the exported file 'file.EUC_TW' back into DB:
> "copy table2 from 'file.EUC_TW'", still I did not finish psql,
> PGCLIENTENCODING is the same as for "copy to".
> Now I get error telling me: "copy: line 1,  LocalToUtf: could not convert (0xe5b5) EUC_TW to UTF-8" ... and the
charactersare missing in table2
 
> 
> Sample Code
> UTF-8:
> 00000000: e795 b6e6 97a5 0ae5 959f e58b 95e4 b8ad
> 00000010: 2ce4 bd86 e69c 89e9 8caf e8aa a40a
> 
> EUC_TW as exported from PostgreSQL and not imported:
> 00000000: e5b5 c5ca 0ada f6d9 afc4 e32c c8fe c8b4
> 00000010: f2e3 eba8 0a

*** src/backend/utils/mb/conv.c.orig    2003-04-12 10:03:25.000000000 +0900
--- src/backend/utils/mb/conv.c    2003-04-12 10:16:04.000000000 +0900
***************
*** 313,319 ****      v1 = *(unsigned int *) p1;     v2 = ((pg_utf_to_local *) p2)->utf;
!     return (v1 - v2); }  /*
--- 313,319 ----      v1 = *(unsigned int *) p1;     v2 = ((pg_utf_to_local *) p2)->utf;
!     return (v1 > v2)?1:((v1 == v2)?0:-1); }  /*
***************
*** 328,334 ****      v1 = *(unsigned int *) p1;     v2 = ((pg_local_to_utf *) p2)->code;
!     return (v1 - v2); }  /*
--- 328,334 ----      v1 = *(unsigned int *) p1;     v2 = ((pg_local_to_utf *) p2)->code;
!     return (v1 > v2)?1:((v1 == v2)?0:-1); }  /*



Re: [BUGS] Bug #943: Server-Encoding from EUC_TW to UTF-8 doesn'twork

From
"Enke, Michael"
Date:
I tried also BIG5 encoded data (Trad. Chinese for Taiwan) and got warnings:
WARNING:  copy: line 4586, LocalToUtf: could not convert (0xf9d7) BIG5 to UTF-8. Ignored
...
Is this also solved with this fix?

Michael


Tatsuo Ishii wrote:
> 
> It turned out that it's a bug with encoding conversion engine of
> PostgreSQL. It just failed to find proper entry from a encoding
> conversion table because of a integer overflow problem. Since only
> maps for EUC_TW have such a huge code point values (for example
> 0x8eaee7aa), I believe the conversion failure merely occurs with the
> particular encoding. Included patches should solve the problem (it is
> against PostgreSQL 7.3.2).
> 
> BTW, I'm surprised to find the bug since it has been there since 7.2
> days.
> 
> I'm going to commit the fix to both current and 7.3-stable trees.
> --
> Tatsuo Ishii
> 
> > Short Description
> > Server-Encoding from EUC_TW to UTF-8 doesn't work
> >
> > Long Description
> > System: SuSE Linux 8.1, kernel 2.4.19, glibc 2.2.5/glibc-locale 2.2.5
> > the same error on RedHat 7.3, kernel 2.4.20, glibc2.2.5
> > postgresql version 7.3.2
> > description: I loaded Chinese (TW) characters, encoded as UTF-8 into a
> > database which has UTF-8 encoding with "copy table from 'original'" with psql. Ok.
> > Than I exit from psql, exported PGCLIENTENCODING=EUC_TW
> > I started psql, make a "copy table to 'file.EUC_TW'". Ok.
> > If I convert this file to UTF-8 with iconv -f EUC-TW -t UTF-8 file.EUC_TW file.UTF-8
> > than file.UTF-8 looks ecaxtly the same as the original.
> > That means, PostgreSQL converts from UTF-8 to EUC_TW correct.
> > Now I load the exported file 'file.EUC_TW' back into DB:
> > "copy table2 from 'file.EUC_TW'", still I did not finish psql,
> > PGCLIENTENCODING is the same as for "copy to".
> > Now I get error telling me: "copy: line 1,  LocalToUtf: could not convert (0xe5b5) EUC_TW to UTF-8" ... and the
charactersare missing in table2
 
> >
> > Sample Code
> > UTF-8:
> > 00000000: e795 b6e6 97a5 0ae5 959f e58b 95e4 b8ad
> > 00000010: 2ce4 bd86 e69c 89e9 8caf e8aa a40a
> >
> > EUC_TW as exported from PostgreSQL and not imported:
> > 00000000: e5b5 c5ca 0ada f6d9 afc4 e32c c8fe c8b4
> > 00000010: f2e3 eba8 0a
> 
> *** src/backend/utils/mb/conv.c.orig    2003-04-12 10:03:25.000000000 +0900
> --- src/backend/utils/mb/conv.c 2003-04-12 10:16:04.000000000 +0900
> ***************
> *** 313,319 ****
> 
>         v1 = *(unsigned int *) p1;
>         v2 = ((pg_utf_to_local *) p2)->utf;
> !       return (v1 - v2);
>   }
> 
>   /*
> --- 313,319 ----
> 
>         v1 = *(unsigned int *) p1;
>         v2 = ((pg_utf_to_local *) p2)->utf;
> !       return (v1 > v2)?1:((v1 == v2)?0:-1);
>   }
> 
>   /*
> ***************
> *** 328,334 ****
> 
>         v1 = *(unsigned int *) p1;
>         v2 = ((pg_local_to_utf *) p2)->code;
> !       return (v1 - v2);
>   }
> 
>   /*
> --- 328,334 ----
> 
>         v1 = *(unsigned int *) p1;
>         v2 = ((pg_local_to_utf *) p2)->code;
> !       return (v1 > v2)?1:((v1 == v2)?0:-1);
>   }
> 
>   /*



Re: [BUGS] Bug #943: Server-Encoding from EUC_TW to

From
Tatsuo Ishii
Date:
> I tried also BIG5 encoded data (Trad. Chinese for Taiwan) and got warnings:
> WARNING:  copy: line 4586, LocalToUtf: could not convert (0xf9d7) BIG5 to UTF-8. Ignored
> ...
> Is this also solved with this fix?

No. In your case it seems 0xf9d7 is not a valid BIG5 data, since
there's no corresponding Unicode data for it.
--
Tatsuo Ishii