Thread: Second byte of multibyte characters causing trouble

Second byte of multibyte characters causing trouble

From
"Karen Ellrick"
Date:
I am using Perl CGI scripts with DBI to take data from a web interface and
from text files to put into my database, and I'm dealing with Japanese (i.e.
two-byte characters).  PostgreSQL is installed with multibyte enabled, but
somewhere in the communication chain from Perl to DBI to PostgreSQL,
something is trying to interpret multibyte text byte by byte, which is
causing trouble.  The example that has been discovered so far is that if the
second of the two bytes is 0x5c (in ASCII, "\"), it gets swallowed and a
ripple effect of byte pairs ensues (at least if the byte after the 0x5c
isn't a valid character to follow \ to make a metacharacter - if it is, who
knows what will happen!).  I fixed that one by replacing any \ in the
strings with "\\" to get a literal 0x5C byte past whatever is trying to
interpret it.  But I am wondering what other similar pitfalls I have to
watch out for, and I'm hoping others have ideas.  For example, is my SQL
insert or update statement going to choke if the second byte of one of the
characters is the same as ASCII for a single quote?  The possibilities are
endless, depending on what part of the process is doing the damage.  And
trying to test this stuff is like looking for a needle in a haystack - it's
not easy to figure out what Japanese characters have second bytes that would
have special meaning if interpreted as ASCII.

If someone knows how to set things up so that all text is guaranteed to go
through unscathed (make Perl or DBI multi-byte aware, or whatever - i.e. the
real fix), that would be ideal.  Otherwise, at least some ideas would be
welcome regarding what other bytes to write bandaid code for.  I know I'm
not the only one trying to use Perl to maintain PostgreSQL databases with
Japanese or Chinese text! :-)

Thanks in advance,
Karen

--------------------------------
Karen Ellrick
S & C Technology, Inc.
1-21-35 Kusatsu-shinmachi
Hiroshima  733-0834  Japan
(from U.S. 011-81, from Japan 0) 82-293-2838
--------------------------------


Re: Second byte of multibyte characters causing trouble

From
David Emery
Date:
The usual way to deal with this is to convert the J text from S-JIS (which
will almost always cause problems) to either EUC-JP or UTF8 encoding before
inserting it into the DB or otherwise messing with it. You can then convert
it back to SJIS before sending it to the client. For Perl there are some
scripts/modules to make encoding conversion fairly painless. Check on CPAN
for the Jcode.pm module and jcode.pl scripts. I think there may be some
others available now too, so it may be worth searching around a bit.

Gambatte,
-dave

At 15:24 +0900 01.9.18, Karen Ellrick wrote:
>I am using Perl CGI scripts with DBI to take data from a web interface and
>from text files to put into my database, and I'm dealing with Japanese (i.e.
>two-byte characters).  PostgreSQL is installed with multibyte enabled, but
>somewhere in the communication chain from Perl to DBI to PostgreSQL,
>something is trying to interpret multibyte text byte by byte, which is
>causing trouble.  The example that has been discovered so far is that if the
>second of the two bytes is 0x5c (in ASCII, "\"), it gets swallowed and a
>ripple effect of byte pairs ensues (at least if the byte after the 0x5c
>isn't a valid character to follow \ to make a metacharacter - if it is, who
>knows what will happen!).  I fixed that one by replacing any \ in the
>strings with "\\" to get a literal 0x5C byte past whatever is trying to
>interpret it.  But I am wondering what other similar pitfalls I have to
>watch out for, and I'm hoping others have ideas.  For example, is my SQL
>insert or update statement going to choke if the second byte of one of the
>characters is the same as ASCII for a single quote?  The possibilities are
>endless, depending on what part of the process is doing the damage.  And
>trying to test this stuff is like looking for a needle in a haystack - it's
>not easy to figure out what Japanese characters have second bytes that would
>have special meaning if interpreted as ASCII.
>
>If someone knows how to set things up so that all text is guaranteed to go
>through unscathed (make Perl or DBI multi-byte aware, or whatever - i.e. the
>real fix), that would be ideal.  Otherwise, at least some ideas would be
>welcome regarding what other bytes to write bandaid code for.  I know I'm
>not the only one trying to use Perl to maintain PostgreSQL databases with
>Japanese or Chinese text! :-)
>
>Thanks in advance,
>Karen
>
>--------------------------------
>Karen Ellrick
>S & C Technology, Inc.
>1-21-35 Kusatsu-shinmachi
>Hiroshima  733-0834  Japan
>(from U.S. 011-81, from Japan 0) 82-293-2838
>--------------------------------
>
>
>---------------------------(end of broadcast)---------------------------
>TIP 6: Have you searched our list archives?
>
>http://archives.postgresql.org


Re: Second byte of multibyte characters causing trouble

From
"Karen Ellrick"
Date:
> The usual way to deal with this is to convert the J text from
> S-JIS (which
> will almost always cause problems) to either EUC-JP or UTF8
> encoding before
> inserting it into the DB or otherwise messing with it. You can
> then convert
> it back to SJIS before sending it to the client.

After reading this, I started thinking in terms of character sets and dug a
little more, and lo-and-behold, I discovered that our installation of
PostgreSQL was configured with "--enable-multibyte=EUC_JP".  No wonder I'm
having problems!  Okay, I'm convinced.

Now first I have to convert my existing data, which although sitting in a
database that expects EUC, is actually SJIS-based text.  I found the
following series of bash commands in a Japanese mailing list archive - does
it look like this will work for me?  (It looks scary to just drop the whole
database and hope that the .out file knows how to rebuild it with all the
indexes, sequences, users, etc. in place - should I be nervous?)
$ pg_dump -D dbname > db.out
$ dropdb dbname
$ createdb -E EUC_JP dbname
$ export PGCLIENTENCODING=SJIS
$ psql dbname < db.out
$ export PGCLIENTENCODING=EUC_JP

Regarding the user interface end, when I read the suggested solution of
using jcode to convert everything in and out of the database, I thought,
"That's tedious!  Why not just use EUC on the web pages, and the whole
system will be in sync?"  But that seems to be almost as tedious.  The
Windows-based editor I normally use to input the Japanese text portions of
my code (I do most of the work in vi on my Linux box, but I can't input the
Japanese that way) reads and writes in Shift-JIS unless I use pre- and
post-processing filters, and it seems that other Windows programs also favor
Shift-JIS.  I did a totally unofficial, very-small-data-sample survey of
Japanese web sites, and it seems that in general, sites that deal with
ordinary consumers (and likely are written on Microsoft machines) use
Shift-JIS (even ones that I figure must use databases, like search engines
and e-commerce), Linux-related sites use JIS, and PostgreSQL-related sites
use EUC.  I'm sure there's a grand story to explain how it got to be this
messy, but for right now, I guess we have to live with all these different
systems - apparently there is not one system that works nicely for all
things, or else the others would gradually become obselete, right?

Before I add jcode function calls for every piece of data I get in or out of
the database, or convert all my web page text to EUC-JP (I haven't decided
yet which approach is more work, or more of a problem to maintain), are
there any other thoughts on this?  For example, does someone know of one of
the following: (a) a way to get the text-only console of a RedHat 6.1J box
to actually display Japanese characters (if so, I not only wouldn't have to
deal with the Windows box for editing, I could even read the output of
queries in psql!), or (b) a text editor for Windows that can be configured
to default to EUC, rather than having to remember to always select a filter
to convert to and from Shift-JIS?  Or on the flip side of the discussion,
can anyone imagine pitfalls associated with having a web site that is half
EUC (the PHP and Perl files that deal with the database) and half Shift-JIS
(the static HTML pages that are written by other people in who-knows-what
Windows-based tools)?

Thanks,
--------------------------------
Karen Ellrick
S & C Technology, Inc.
1-21-35 Kusatsu-shinmachi
Hiroshima  733-0834  Japan
(from U.S. 011-81, from Japan 0) 82-293-2838
--------------------------------


Re: Second byte of multibyte characters causing trouble

From
Tatsuo Ishii
Date:
> Now first I have to convert my existing data, which although sitting in a
> database that expects EUC, is actually SJIS-based text.  I found the
> following series of bash commands in a Japanese mailing list archive - does
> it look like this will work for me?  (It looks scary to just drop the whole
> database and hope that the .out file knows how to rebuild it with all the
> indexes, sequences, users, etc. in place - should I be nervous?)
> $ pg_dump -D dbname > db.out
> $ dropdb dbname
> $ createdb -E EUC_JP dbname
> $ export PGCLIENTENCODING=SJIS
> $ psql dbname < db.out
> $ export PGCLIENTENCODING=EUC_JP

Yes, above procedure should convert your SJIS based database (by
mistake) to EUC_JP database.

> Regarding the user interface end, when I read the suggested solution of
> using jcode to convert everything in and out of the database, I thought,
> "That's tedious!  Why not just use EUC on the web pages, and the whole
> system will be in sync?"  But that seems to be almost as tedious.  The
> Windows-based editor I normally use to input the Japanese text portions of
> my code (I do most of the work in vi on my Linux box, but I can't input the
> Japanese that way)

You can't input Japanese using vi? Why?

> reads and writes in Shift-JIS unless I use pre- and
> post-processing filters, and it seems that other Windows programs also favor
> Shift-JIS.

Why not emacs? It can read and write SJIS texts directory.

> I did a totally unofficial, very-small-data-sample survey of
> Japanese web sites, and it seems that in general, sites that deal with
> ordinary consumers (and likely are written on Microsoft machines) use
> Shift-JIS (even ones that I figure must use databases, like search engines
> and e-commerce), Linux-related sites use JIS, and PostgreSQL-related sites
> use EUC.  I'm sure there's a grand story to explain how it got to be this
> messy, but for right now, I guess we have to live with all these different
> systems - apparently there is not one system that works nicely for all
> things, or else the others would gradually become obselete, right?
>
> Before I add jcode function calls for every piece of data I get in or out of
> the database, or convert all my web page text to EUC-JP (I haven't decided
> yet which approach is more work, or more of a problem to maintain), are
> there any other thoughts on this?  For example, does someone know of one of
> the following: (a) a way to get the text-only console of a RedHat 6.1J box
> to actually display Japanese characters (if so, I not only wouldn't have to
> deal with the Windows box for editing, I could even read the output of
> queries in psql!),

Use "kon" command.

> or (b) a text editor for Windows that can be configured
> to default to EUC, rather than having to remember to always select a filter
> to convert to and from Shift-JIS?

Again why not emacs?

> Or on the flip side of the discussion,
> can anyone imagine pitfalls associated with having a web site that is half
> EUC (the PHP and Perl files that deal with the database) and half Shift-JIS
> (the static HTML pages that are written by other people in who-knows-what
> Windows-based tools)?

Are yo using PHP? Then I strongly recommend upgrade to PHP 4.0.6 or
higher. It supports Japanese very well. It aumatically guess the input
charset, does the neccessary conversion. This is very helpfull.  Also
I recommend that you always use EUC-JP to write PHP scripts.

Assuming you could read/write Japanese, I recommend you subscribe
PHP-users list (http://ns1.php.gr.jp/mailman/listinfo/php-users).
--
Tatsuo Ishii

Re: Second byte of multibyte characters causing trouble

From
"Karen Ellrick"
Date:
> Use "kon" command.
This was a wonderful tip - thank you, Ishii-san!  I didn't know about the
command, and it seems to be trying to do what it is designed to do.  It
doesn't display Shift-JIS correctly, but it does work for EUC.  Since I seem
to be moving in the direction of converting everything to EUC, that should
be okay.  But in vi, how do I input Japanese?  Is there a key combination
that does what IME does in Windoze?

I also noticed that vi is still not aware of the multi-byte characters - for
example, when moving around in the text, I have to type h or l twice to get
to the next character, and if I want to copy or delete characters I have to
pretend that there are twice as many.  Typing "x" just once (or an odd
number of times) is really entertaining - all the characters in a whole word
or sentence change to obscure kanji as kon tries to process the second byte
of one character and the first byte of the next as a character.  Is there a
way to make vi aware of multibyte characters?  (This is not an absolute
necessity, but would help.)

> Again why not emacs?
I had never used it - in fact, it wasn't even installed on my system.  After
you seemed to be recommending it, I installed the "no X" version (I don't
have any graphical interfaces on these machines) and invoked it once to see
what it is like, but it looks like it would be miserable to learn to use
without a mouse, if it would work at all for some features (I had to dig
real deep in the docs to figure out key commands - they constantly refer to
the mouse).  It would be no problem to add a mouse to the server that
resides at my desk (well, maybe a bit of a desk space shortage...), but much
of my work is done through ssh to two other servers, and I doubt a mouse
would work in that environment - am I wrong?  (Zero experience with mice in
Linux!)

> Assuming you could read/write Japanese, I recommend you subscribe
> PHP-users list (http://ns1.php.gr.jp/mailman/listinfo/php-users).
I do read and write Japanese if I work hard enough at it (lots of copy/paste
to/from a software dictionary - I've lived in Japan 5 years), but reading
and contributing to a mailing list in Japanese could consume my whole work
day! :-)  That's why I have been using English lists.  But I know that most
of the people on the English lists (maybe everybody except Ishii-san!) don't
work with Japanese systems and can't answer questions about them, so when I
have future questions of this type, I probably should try the php.gr.jp
list.  Thanks for the link.

> Are yo using PHP? Then I strongly recommend upgrade to PHP 4.0.6 or
> higher. It supports Japanese very well. It aumatically guess the input
> charset, does the neccessary conversion.
Input from where and conversion to what?  Do you mean data typed into forms?
(I had assumed that I control the charset used for form data by the
"charset" variable in the html header, but I haven't tested that theory!)
Or do you mean that if the text in echo statements is in a different charset
than the header (how could it even know?), it will convert it when sending
it out to the browser?  (That would be hard to believe, but wonderful if
it's true!)

I'm still unsure of what to do.  I was just about to take your advice and
switch all my PHP and Perl files to EUC, when I remembered that I have to
consider other people. After I get the PHP/Perl code working, the webmaster
cleans up my grammar and/or changes the wording to the way he wants it, and
he never uses Linux but only Windows-based editors, which as far as I know
all expect Shift-JIS.  Maybe I can train him to always open files with the
EUC->Shift-JIS preprocessor and save them with the Shift-JIS->EUC
postprocessor, but I suspect he won't be happy about it.  But if I can get
answers to the above questions, I may be closer to a decision on which
approach is better, all things considered.

Regards,
Karen

--------------------------------
Karen Ellrick
S & C Technology, Inc.
1-21-35 Kusatsu-shinmachi
Hiroshima  733-0834  Japan
(from U.S. 011-81, from Japan 0) 82-293-2838
--------------------------------


Re: Second byte of multibyte characters causing trouble

From
Tatsuo Ishii
Date:
> > Use "kon" command.
> This was a wonderful tip - thank you, Ishii-san!  I didn't know about the
> command, and it seems to be trying to do what it is designed to do.  It
> doesn't display Shift-JIS correctly, but it does work for EUC.  Since I seem
> to be moving in the direction of converting everything to EUC, that should
> be okay.  But in vi, how do I input Japanese?  Is there a key combination
> that does what IME does in Windoze?

I've heard that there is a vi crone called "nvi-canna" having a IME
called "canna". In nvi-canna, "set canna" + CTRL-O should initiate the
Japanese input mode.

> I also noticed that vi is still not aware of the multi-byte characters - for
> example, when moving around in the text, I have to type h or l twice to get
> to the next character, and if I want to copy or delete characters I have to
> pretend that there are twice as many.  Typing "x" just once (or an odd
> number of times) is really entertaining - all the characters in a whole word
> or sentence change to obscure kanji as kon tries to process the second byte
> of one character and the first byte of the next as a character.  Is there a
> way to make vi aware of multibyte characters?  (This is not an absolute
> necessity, but would help.)

Hmm. Maybe locale problem? In my case (I'm using Vine Linux), the
correct locale is "ja_JP.eucJP", but may be different on RH6.2J.

> > Again why not emacs?
> I had never used it - in fact, it wasn't even installed on my system.  After
> you seemed to be recommending it, I installed the "no X" version (I don't
> have any graphical interfaces on these machines) and invoked it once to see
> what it is like, but it looks like it would be miserable to learn to use
> without a mouse, if it would work at all for some features (I had to dig
> real deep in the docs to figure out key commands - they constantly refer to
> the mouse).  It would be no problem to add a mouse to the server that
> resides at my desk (well, maybe a bit of a desk space shortage...), but much
> of my work is done through ssh to two other servers, and I doubt a mouse
> would work in that environment - am I wrong?  (Zero experience with mice in
> Linux!)

I've never use a mouse with Emacs even in the X environment.

> I do read and write Japanese if I work hard enough at it (lots of copy/paste
> to/from a software dictionary - I've lived in Japan 5 years), but reading
> and contributing to a mailing list in Japanese could consume my whole work
> day! :-)

I understand. Same thing can be said for me with English:-)

> > Are yo using PHP? Then I strongly recommend upgrade to PHP 4.0.6 or
> > higher. It supports Japanese very well. It aumatically guess the input
> > charset, does the neccessary conversion.
> Input from where and conversion to what?  Do you mean data typed into forms?

Yes.

It seems most browsers use same encoding as the one used for the the
page (you could view the page's charset using "property" or whatever
menu in the browser). PHP4.0.6 is clever enough to automaticall guess
the chaset of data put in a form, and do a conversion between the
encoding and EUC-JP (that is the recommended internal encoding of
PHP).

In PHP4.0.6's php.ini there are entries to control the behavior of the
encoding handling:

mbstring.internal_encoding = EUC-JP
mbstring.http_input = auto
mbstring.http_output = SJIS

Theses are read:

o the internal encoding in PHP is EUC-JP (this is recommended)
o charsets of any input from forms etc. are automatically determined
o charsets for all final pages produced by PHP are SJIS

Interesting?:-)
--
Tatsuo Ishii