Re: [PATCHES] A Patch for MIC to EUC_TW code converting in mbsupport - Mailing list pgsql-hackers

From Tatsuo Ishii
Subject Re: [PATCHES] A Patch for MIC to EUC_TW code converting in mbsupport
Date
Msg-id 20001114145434O.t-ishii@sra.co.jp
Whole thread Raw
Responses Re: Re: [PATCHES] A Patch for MIC to EUC_TW code converting in mbsupport  (Bruce Momjian <pgman@candle.pha.pa.us>)
List pgsql-hackers
[Cced to hackers list]

> > BTW I have found another bug with EUC_TW support. line 917 in conv.c:
> >
> >                         *p++ = c1 - LC_CNS11643_3 + 0xa3;
> >
> > this should be:
> >
> >                         *p++ = *mic++ - LC_CNS11643_3 + 0xa3;
> >
> > Otherwise, CNS 11643-1992 Plane 3 or more won't work. Could you test
> > it out with CNS 11643-1992 Plane 3 or more?
> 
> Thanks for your very quickly reply!!

You are welcome.

> I think you are right, but I have not test it.
> Because original Big5 encoding does not contain characters in CNS 11643-1992
> Plane 3.
> But I will have a chance to test it, we here are seeking the support for Big5E
> (an extendied Big5
> encoding) in PostgreSQL. Though most people who use PostgresSQL in Taiwan only
> cares about
> Big5 encoding .
> 
> Would you like to answer some mb related questions for me? I am a newbie :P
> 
> 1.) Because the 2nd byte of Big5 encoding overlaps with ASCII,
>     such as '\' (this is very bad for many programs to work with Big5).

As long as frontend side knows the current client side encoding is
Big5, this should be no problem. At least for libpq. It examins the
first byte of Big5. If it is greater than 0x7f, then it must be a
double byte Hanji. So libpq reads 2 bytes in this case, not matter the
second byte is '\'.

>     For example: If we initdb -E MULE_INTERNAL first,
>     SET CLIENT_ENCODING TO 'BIG5', and
>     INSERT INTO some_table VALUES (..., 'the last byte of  some Big5 char is
> backslash\',...),
>     then we can not successfully complete this SQL INSERT -- the prompt of psql
> changes
>     but psql does not execute it. If we initdb -E with EUC_TW, it's OK.
>     Is this is a parsing problem? What's your suggestion for the solution?

Hum. initdb -E MULE_INTERNAL should work as well. Let me dig into the
problem. It would be nice if you could send me the Big5 data for
testing by a private mail.

BTW I would not recommend "SET CLIENT_ENCODING TO 'BIG5'" to do an
on-the-fly encoding changes. Since in this way, frontend side has no
idea what the client encoding is. 7.0.x overcome this problem by
introducing new \encoding command. For 6.5 or before I would recommend
to use PGCLIENTENCODING environment variable.

> 2.) Is using MULE_INTERNAL faster than EUC_TW as backend encoding when
>      PostgreSQL processing Big5 data?  (It seems
> BIG5->big52mic()->mic2euc_tw()->EUC_TW
>      needs 2 code converting procedures, but BIG5->big52mic()->EUC_TW only needs
> one from
>      the mb sources)

Yes. But the difference would be very small. The expensive part is a
table look-up in big52mic.

BTW 7.1 will support automatic encoding conversion between Unicode
(UTF-8) and Big5 (or EUC_TW). Try the snapshot if you like.

> 3.) Dose PostgreSQL's ODBC driver support mb?

I don't think so. For Japanese (EUC_JP/SJIS) Kataoka has made patches
to enable MB support in ODBC. It should not be very difficult to
support EUC_TW/Big5, I don't know.
--
Tatsuo Ishii


pgsql-hackers by date:

Previous
From: Hiroshi Inoue
Date:
Subject: SearchSysCacheTuple(Copy)
Next
From: Peter Eisentraut
Date:
Subject: Re: Syslog Facility Patch