Re: Re: [PATCHES] A Patch for MIC to EUC_TW code converting in mbsupport - Mailing list pgsql-hackers

From Bruce Momjian
Subject Re: Re: [PATCHES] A Patch for MIC to EUC_TW code converting in mbsupport
Date
Msg-id 200011160601.BAA02070@candle.pha.pa.us
Whole thread Raw
In response to Re: [PATCHES] A Patch for MIC to EUC_TW code converting in mbsupport  (Tatsuo Ishii <t-ishii@sra.co.jp>)
Responses Re: Re: [PATCHES] A Patch for MIC to EUC_TW code converting in mbsupport
List pgsql-hackers
Can someone tell me where we are on this?  Tatsuo, I think you said you
wanted to apply this fix.


> [Cced to hackers list]
> 
> > > BTW I have found another bug with EUC_TW support. line 917 in conv.c:
> > >
> > >                         *p++ = c1 - LC_CNS11643_3 + 0xa3;
> > >
> > > this should be:
> > >
> > >                         *p++ = *mic++ - LC_CNS11643_3 + 0xa3;
> > >
> > > Otherwise, CNS 11643-1992 Plane 3 or more won't work. Could you test
> > > it out with CNS 11643-1992 Plane 3 or more?
> > 
> > Thanks for your very quickly reply!!
> 
> You are welcome.
> 
> > I think you are right, but I have not test it.
> > Because original Big5 encoding does not contain characters in CNS 11643-1992
> > Plane 3.
> > But I will have a chance to test it, we here are seeking the support for Big5E
> > (an extendied Big5
> > encoding) in PostgreSQL. Though most people who use PostgresSQL in Taiwan only
> > cares about
> > Big5 encoding .
> > 
> > Would you like to answer some mb related questions for me? I am a newbie :P
> > 
> > 1.) Because the 2nd byte of Big5 encoding overlaps with ASCII,
> >     such as '\' (this is very bad for many programs to work with Big5).
> 
> As long as frontend side knows the current client side encoding is
> Big5, this should be no problem. At least for libpq. It examins the
> first byte of Big5. If it is greater than 0x7f, then it must be a
> double byte Hanji. So libpq reads 2 bytes in this case, not matter the
> second byte is '\'.
> 
> >     For example: If we initdb -E MULE_INTERNAL first,
> >     SET CLIENT_ENCODING TO 'BIG5', and
> >     INSERT INTO some_table VALUES (..., 'the last byte of  some Big5 char is
> > backslash\',...),
> >     then we can not successfully complete this SQL INSERT -- the prompt of psql
> > changes
> >     but psql does not execute it. If we initdb -E with EUC_TW, it's OK.
> >     Is this is a parsing problem? What's your suggestion for the solution?
> 
> Hum. initdb -E MULE_INTERNAL should work as well. Let me dig into the
> problem. It would be nice if you could send me the Big5 data for
> testing by a private mail.
> 
> BTW I would not recommend "SET CLIENT_ENCODING TO 'BIG5'" to do an
> on-the-fly encoding changes. Since in this way, frontend side has no
> idea what the client encoding is. 7.0.x overcome this problem by
> introducing new \encoding command. For 6.5 or before I would recommend
> to use PGCLIENTENCODING environment variable.
> 
> > 2.) Is using MULE_INTERNAL faster than EUC_TW as backend encoding when
> >      PostgreSQL processing Big5 data?  (It seems
> > BIG5->big52mic()->mic2euc_tw()->EUC_TW
> >      needs 2 code converting procedures, but BIG5->big52mic()->EUC_TW only needs
> > one from
> >      the mb sources)
> 
> Yes. But the difference would be very small. The expensive part is a
> table look-up in big52mic.
> 
> BTW 7.1 will support automatic encoding conversion between Unicode
> (UTF-8) and Big5 (or EUC_TW). Try the snapshot if you like.
> 
> > 3.) Dose PostgreSQL's ODBC driver support mb?
> 
> I don't think so. For Japanese (EUC_JP/SJIS) Kataoka has made patches
> to enable MB support in ODBC. It should not be very difficult to
> support EUC_TW/Big5, I don't know.
> --
> Tatsuo Ishii
> 


--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: Syslog Facility Patch
Next
From: Bruce Momjian
Date:
Subject: Re: int4 or int32