Thread: A Patch for MIC to EUC_TW code converting in mb support

A Patch for MIC to EUC_TW code converting in mb support

From
Chih-Chang Hsieh
Date:
============================================================================

POSTGRESQL BUG REPORT: MIC to EUC_TW code converting in mb support
============================================================================

System Configuration
---------------------
  Architecture (example: Intel Pentium)         :x86
  Operating System (example: Linux 2.0.26 ELF)  :Linux 2.2.x and FreeBSD
3.5R
  PostgreSQL version (example: PostgreSQL-7.0)  :PostgreSQL-7.0.2
  Compiler used (example:  gcc 2.8.0)           :egcs-2.91.66, gcc 2.7.3

A FULL description of the problem:
------------------------------------------------
In PostgreSQL mb (multi-byte) support, there is a bug in code converting

for MIC to EUC_TW. Original mic2euc_tw() in conv.c converts CNS
11643-1992
Plane 2 into 2 bytes EUC_TW encoding. But characters in CNS 11643-1992
Plane 2
should be converted into 4 bytes EUC_TW encoding instead.

A way to repeat the problem:
----------------------------------------------------------------------
When you initdb with -E EUC_TW and set PGCLIENTENCODING to BIG5,
you will find all the characters in CNS 11643-1992 Plane 2 are
incorrectly stored or output.

This problem might be fixed by the solution in the attachement.

*** conv.c    Wed Nov  8 22:44:21 2000
--- conv.c.orig    Sat May 20 21:12:26 2000
***************
*** 906,920 ****
      {
          len -= pg_mic_mblen(mic++);

!         if (c1 == LC_CNS11643_1)
          {
-             *p++ = *mic++;
-             *p++ = *mic++;
-         }
-         else if (c1 == LC_CNS11643_2)
-         {
-             *p++ = SS2;
-             *p++ = 0xa2;
              *p++ = *mic++;
              *p++ = *mic++;
          }
--- 906,913 ----
      {
          len -= pg_mic_mblen(mic++);

!         if (c1 == LC_CNS11643_1 || c1 == LC_CNS11643_2)
          {
              *p++ = *mic++;
              *p++ = *mic++;
          }

Re: A Patch for MIC to EUC_TW code converting in mb support

From
Tatsuo Ishii
Date:
> ============================================================================
>
> POSTGRESQL BUG REPORT: MIC to EUC_TW code converting in mb support
> ============================================================================
>
> System Configuration
> ---------------------
>   Architecture (example: Intel Pentium)         :x86
>   Operating System (example: Linux 2.0.26 ELF)  :Linux 2.2.x and FreeBSD
> 3.5R
>   PostgreSQL version (example: PostgreSQL-7.0)  :PostgreSQL-7.0.2
>   Compiler used (example:  gcc 2.8.0)           :egcs-2.91.66, gcc 2.7.3
>
> A FULL description of the problem:
> ------------------------------------------------
> In PostgreSQL mb (multi-byte) support, there is a bug in code converting
>
> for MIC to EUC_TW. Original mic2euc_tw() in conv.c converts CNS
> 11643-1992
> Plane 2 into 2 bytes EUC_TW encoding. But characters in CNS 11643-1992
> Plane 2
> should be converted into 4 bytes EUC_TW encoding instead.
>
> A way to repeat the problem:
> ----------------------------------------------------------------------
> When you initdb with -E EUC_TW and set PGCLIENTENCODING to BIG5,
> you will find all the characters in CNS 11643-1992 Plane 2 are
> incorrectly stored or output.
>
> This problem might be fixed by the solution in the attachement.

Thanks for pointing it out. Your fix seems correct.

BTW I have found another bug with EUC_TW support. line 917 in conv.c:

            *p++ = c1 - LC_CNS11643_3 + 0xa3;

this should be:

            *p++ = *mic++ - LC_CNS11643_3 + 0xa3;

Otherwise, CNS 11643-1992 Plane 3 or more won't work. Could you test
it out with CNS 11643-1992 Plane 3 or more?

If they are ok, I will fix the current source and make a patch for
7.0.3 (I guess it's too late to back-patch the 7.0 tree).
--
Tatsuo Ishii

Re: A Patch for MIC to EUC_TW code converting in mb support

From
Bruce Momjian
Date:
Tatsuo, I assume these are all done in 7.1, right?

> > ============================================================================
> >
> > POSTGRESQL BUG REPORT: MIC to EUC_TW code converting in mb support
> > ============================================================================
> >
> > System Configuration
> > ---------------------
> >   Architecture (example: Intel Pentium)         :x86
> >   Operating System (example: Linux 2.0.26 ELF)  :Linux 2.2.x and FreeBSD
> > 3.5R
> >   PostgreSQL version (example: PostgreSQL-7.0)  :PostgreSQL-7.0.2
> >   Compiler used (example:  gcc 2.8.0)           :egcs-2.91.66, gcc 2.7.3
> >
> > A FULL description of the problem:
> > ------------------------------------------------
> > In PostgreSQL mb (multi-byte) support, there is a bug in code converting
> >
> > for MIC to EUC_TW. Original mic2euc_tw() in conv.c converts CNS
> > 11643-1992
> > Plane 2 into 2 bytes EUC_TW encoding. But characters in CNS 11643-1992
> > Plane 2
> > should be converted into 4 bytes EUC_TW encoding instead.
> >
> > A way to repeat the problem:
> > ----------------------------------------------------------------------
> > When you initdb with -E EUC_TW and set PGCLIENTENCODING to BIG5,
> > you will find all the characters in CNS 11643-1992 Plane 2 are
> > incorrectly stored or output.
> >
> > This problem might be fixed by the solution in the attachement.
>
> Thanks for pointing it out. Your fix seems correct.
>
> BTW I have found another bug with EUC_TW support. line 917 in conv.c:
>
>             *p++ = c1 - LC_CNS11643_3 + 0xa3;
>
> this should be:
>
>             *p++ = *mic++ - LC_CNS11643_3 + 0xa3;
>
> Otherwise, CNS 11643-1992 Plane 3 or more won't work. Could you test
> it out with CNS 11643-1992 Plane 3 or more?
>
> If they are ok, I will fix the current source and make a patch for
> 7.0.3 (I guess it's too late to back-patch the 7.0 tree).
> --
> Tatsuo Ishii
>


--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Re: A Patch for MIC to EUC_TW code converting in mb support

From
Tatsuo Ishii
Date:
> Tatsuo, I assume these are all done in 7.1, right?

Yes.
--
Tatsuo Ishii

> > > ============================================================================
> > >
> > > POSTGRESQL BUG REPORT: MIC to EUC_TW code converting in mb support
> > > ============================================================================
> > >
> > > System Configuration
> > > ---------------------
> > >   Architecture (example: Intel Pentium)         :x86
> > >   Operating System (example: Linux 2.0.26 ELF)  :Linux 2.2.x and FreeBSD
> > > 3.5R
> > >   PostgreSQL version (example: PostgreSQL-7.0)  :PostgreSQL-7.0.2
> > >   Compiler used (example:  gcc 2.8.0)           :egcs-2.91.66, gcc 2.7.3
> > >
> > > A FULL description of the problem:
> > > ------------------------------------------------
> > > In PostgreSQL mb (multi-byte) support, there is a bug in code converting
> > >
> > > for MIC to EUC_TW. Original mic2euc_tw() in conv.c converts CNS
> > > 11643-1992
> > > Plane 2 into 2 bytes EUC_TW encoding. But characters in CNS 11643-1992
> > > Plane 2
> > > should be converted into 4 bytes EUC_TW encoding instead.
> > >
> > > A way to repeat the problem:
> > > ----------------------------------------------------------------------
> > > When you initdb with -E EUC_TW and set PGCLIENTENCODING to BIG5,
> > > you will find all the characters in CNS 11643-1992 Plane 2 are
> > > incorrectly stored or output.
> > >
> > > This problem might be fixed by the solution in the attachement.
> >
> > Thanks for pointing it out. Your fix seems correct.
> >
> > BTW I have found another bug with EUC_TW support. line 917 in conv.c:
> >
> >             *p++ = c1 - LC_CNS11643_3 + 0xa3;
> >
> > this should be:
> >
> >             *p++ = *mic++ - LC_CNS11643_3 + 0xa3;
> >
> > Otherwise, CNS 11643-1992 Plane 3 or more won't work. Could you test
> > it out with CNS 11643-1992 Plane 3 or more?
> >
> > If they are ok, I will fix the current source and make a patch for
> > 7.0.3 (I guess it's too late to back-patch the 7.0 tree).
> > --
> > Tatsuo Ishii
> >
>
>
> --
>   Bruce Momjian                        |  http://candle.pha.pa.us
>   pgman@candle.pha.pa.us               |  (610) 853-3000
>   +  If your life is a hard drive,     |  830 Blythe Avenue
>   +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026