Thread: Second byte of multibyte characters causing trouble
I am using Perl CGI scripts with DBI to take data from a web interface and from text files to put into my database, and I'm dealing with Japanese (i.e. two-byte characters). PostgreSQL is installed with multibyte enabled, but somewhere in the communication chain from Perl to DBI to PostgreSQL, something is trying to interpret multibyte text byte by byte, which is causing trouble. The example that has been discovered so far is that if the second of the two bytes is 0x5c (in ASCII, "\"), it gets swallowed and a ripple effect of byte pairs ensues (at least if the byte after the 0x5c isn't a valid character to follow \ to make a metacharacter - if it is, who knows what will happen!). I fixed that one by replacing any \ in the strings with "\\" to get a literal 0x5C byte past whatever is trying to interpret it. But I am wondering what other similar pitfalls I have to watch out for, and I'm hoping others have ideas. For example, is my SQL insert or update statement going to choke if the second byte of one of the characters is the same as ASCII for a single quote? The possibilities are endless, depending on what part of the process is doing the damage. And trying to test this stuff is like looking for a needle in a haystack - it's not easy to figure out what Japanese characters have second bytes that would have special meaning if interpreted as ASCII. If someone knows how to set things up so that all text is guaranteed to go through unscathed (make Perl or DBI multi-byte aware, or whatever - i.e. the real fix), that would be ideal. Otherwise, at least some ideas would be welcome regarding what other bytes to write bandaid code for. I know I'm not the only one trying to use Perl to maintain PostgreSQL databases with Japanese or Chinese text! :-) Thanks in advance, Karen -------------------------------- Karen Ellrick S & C Technology, Inc. 1-21-35 Kusatsu-shinmachi Hiroshima 733-0834 Japan (from U.S. 011-81, from Japan 0) 82-293-2838 --------------------------------
The usual way to deal with this is to convert the J text from S-JIS (which will almost always cause problems) to either EUC-JP or UTF8 encoding before inserting it into the DB or otherwise messing with it. You can then convert it back to SJIS before sending it to the client. For Perl there are some scripts/modules to make encoding conversion fairly painless. Check on CPAN for the Jcode.pm module and jcode.pl scripts. I think there may be some others available now too, so it may be worth searching around a bit. Gambatte, -dave At 15:24 +0900 01.9.18, Karen Ellrick wrote: >I am using Perl CGI scripts with DBI to take data from a web interface and >from text files to put into my database, and I'm dealing with Japanese (i.e. >two-byte characters). PostgreSQL is installed with multibyte enabled, but >somewhere in the communication chain from Perl to DBI to PostgreSQL, >something is trying to interpret multibyte text byte by byte, which is >causing trouble. The example that has been discovered so far is that if the >second of the two bytes is 0x5c (in ASCII, "\"), it gets swallowed and a >ripple effect of byte pairs ensues (at least if the byte after the 0x5c >isn't a valid character to follow \ to make a metacharacter - if it is, who >knows what will happen!). I fixed that one by replacing any \ in the >strings with "\\" to get a literal 0x5C byte past whatever is trying to >interpret it. But I am wondering what other similar pitfalls I have to >watch out for, and I'm hoping others have ideas. For example, is my SQL >insert or update statement going to choke if the second byte of one of the >characters is the same as ASCII for a single quote? The possibilities are >endless, depending on what part of the process is doing the damage. And >trying to test this stuff is like looking for a needle in a haystack - it's >not easy to figure out what Japanese characters have second bytes that would >have special meaning if interpreted as ASCII. > >If someone knows how to set things up so that all text is guaranteed to go >through unscathed (make Perl or DBI multi-byte aware, or whatever - i.e. the >real fix), that would be ideal. Otherwise, at least some ideas would be >welcome regarding what other bytes to write bandaid code for. I know I'm >not the only one trying to use Perl to maintain PostgreSQL databases with >Japanese or Chinese text! :-) > >Thanks in advance, >Karen > >-------------------------------- >Karen Ellrick >S & C Technology, Inc. >1-21-35 Kusatsu-shinmachi >Hiroshima 733-0834 Japan >(from U.S. 011-81, from Japan 0) 82-293-2838 >-------------------------------- > > >---------------------------(end of broadcast)--------------------------- >TIP 6: Have you searched our list archives? > >http://archives.postgresql.org
> The usual way to deal with this is to convert the J text from > S-JIS (which > will almost always cause problems) to either EUC-JP or UTF8 > encoding before > inserting it into the DB or otherwise messing with it. You can > then convert > it back to SJIS before sending it to the client. After reading this, I started thinking in terms of character sets and dug a little more, and lo-and-behold, I discovered that our installation of PostgreSQL was configured with "--enable-multibyte=EUC_JP". No wonder I'm having problems! Okay, I'm convinced. Now first I have to convert my existing data, which although sitting in a database that expects EUC, is actually SJIS-based text. I found the following series of bash commands in a Japanese mailing list archive - does it look like this will work for me? (It looks scary to just drop the whole database and hope that the .out file knows how to rebuild it with all the indexes, sequences, users, etc. in place - should I be nervous?) $ pg_dump -D dbname > db.out $ dropdb dbname $ createdb -E EUC_JP dbname $ export PGCLIENTENCODING=SJIS $ psql dbname < db.out $ export PGCLIENTENCODING=EUC_JP Regarding the user interface end, when I read the suggested solution of using jcode to convert everything in and out of the database, I thought, "That's tedious! Why not just use EUC on the web pages, and the whole system will be in sync?" But that seems to be almost as tedious. The Windows-based editor I normally use to input the Japanese text portions of my code (I do most of the work in vi on my Linux box, but I can't input the Japanese that way) reads and writes in Shift-JIS unless I use pre- and post-processing filters, and it seems that other Windows programs also favor Shift-JIS. I did a totally unofficial, very-small-data-sample survey of Japanese web sites, and it seems that in general, sites that deal with ordinary consumers (and likely are written on Microsoft machines) use Shift-JIS (even ones that I figure must use databases, like search engines and e-commerce), Linux-related sites use JIS, and PostgreSQL-related sites use EUC. I'm sure there's a grand story to explain how it got to be this messy, but for right now, I guess we have to live with all these different systems - apparently there is not one system that works nicely for all things, or else the others would gradually become obselete, right? Before I add jcode function calls for every piece of data I get in or out of the database, or convert all my web page text to EUC-JP (I haven't decided yet which approach is more work, or more of a problem to maintain), are there any other thoughts on this? For example, does someone know of one of the following: (a) a way to get the text-only console of a RedHat 6.1J box to actually display Japanese characters (if so, I not only wouldn't have to deal with the Windows box for editing, I could even read the output of queries in psql!), or (b) a text editor for Windows that can be configured to default to EUC, rather than having to remember to always select a filter to convert to and from Shift-JIS? Or on the flip side of the discussion, can anyone imagine pitfalls associated with having a web site that is half EUC (the PHP and Perl files that deal with the database) and half Shift-JIS (the static HTML pages that are written by other people in who-knows-what Windows-based tools)? Thanks, -------------------------------- Karen Ellrick S & C Technology, Inc. 1-21-35 Kusatsu-shinmachi Hiroshima 733-0834 Japan (from U.S. 011-81, from Japan 0) 82-293-2838 --------------------------------
> Now first I have to convert my existing data, which although sitting in a > database that expects EUC, is actually SJIS-based text. I found the > following series of bash commands in a Japanese mailing list archive - does > it look like this will work for me? (It looks scary to just drop the whole > database and hope that the .out file knows how to rebuild it with all the > indexes, sequences, users, etc. in place - should I be nervous?) > $ pg_dump -D dbname > db.out > $ dropdb dbname > $ createdb -E EUC_JP dbname > $ export PGCLIENTENCODING=SJIS > $ psql dbname < db.out > $ export PGCLIENTENCODING=EUC_JP Yes, above procedure should convert your SJIS based database (by mistake) to EUC_JP database. > Regarding the user interface end, when I read the suggested solution of > using jcode to convert everything in and out of the database, I thought, > "That's tedious! Why not just use EUC on the web pages, and the whole > system will be in sync?" But that seems to be almost as tedious. The > Windows-based editor I normally use to input the Japanese text portions of > my code (I do most of the work in vi on my Linux box, but I can't input the > Japanese that way) You can't input Japanese using vi? Why? > reads and writes in Shift-JIS unless I use pre- and > post-processing filters, and it seems that other Windows programs also favor > Shift-JIS. Why not emacs? It can read and write SJIS texts directory. > I did a totally unofficial, very-small-data-sample survey of > Japanese web sites, and it seems that in general, sites that deal with > ordinary consumers (and likely are written on Microsoft machines) use > Shift-JIS (even ones that I figure must use databases, like search engines > and e-commerce), Linux-related sites use JIS, and PostgreSQL-related sites > use EUC. I'm sure there's a grand story to explain how it got to be this > messy, but for right now, I guess we have to live with all these different > systems - apparently there is not one system that works nicely for all > things, or else the others would gradually become obselete, right? > > Before I add jcode function calls for every piece of data I get in or out of > the database, or convert all my web page text to EUC-JP (I haven't decided > yet which approach is more work, or more of a problem to maintain), are > there any other thoughts on this? For example, does someone know of one of > the following: (a) a way to get the text-only console of a RedHat 6.1J box > to actually display Japanese characters (if so, I not only wouldn't have to > deal with the Windows box for editing, I could even read the output of > queries in psql!), Use "kon" command. > or (b) a text editor for Windows that can be configured > to default to EUC, rather than having to remember to always select a filter > to convert to and from Shift-JIS? Again why not emacs? > Or on the flip side of the discussion, > can anyone imagine pitfalls associated with having a web site that is half > EUC (the PHP and Perl files that deal with the database) and half Shift-JIS > (the static HTML pages that are written by other people in who-knows-what > Windows-based tools)? Are yo using PHP? Then I strongly recommend upgrade to PHP 4.0.6 or higher. It supports Japanese very well. It aumatically guess the input charset, does the neccessary conversion. This is very helpfull. Also I recommend that you always use EUC-JP to write PHP scripts. Assuming you could read/write Japanese, I recommend you subscribe PHP-users list (http://ns1.php.gr.jp/mailman/listinfo/php-users). -- Tatsuo Ishii
> Use "kon" command. This was a wonderful tip - thank you, Ishii-san! I didn't know about the command, and it seems to be trying to do what it is designed to do. It doesn't display Shift-JIS correctly, but it does work for EUC. Since I seem to be moving in the direction of converting everything to EUC, that should be okay. But in vi, how do I input Japanese? Is there a key combination that does what IME does in Windoze? I also noticed that vi is still not aware of the multi-byte characters - for example, when moving around in the text, I have to type h or l twice to get to the next character, and if I want to copy or delete characters I have to pretend that there are twice as many. Typing "x" just once (or an odd number of times) is really entertaining - all the characters in a whole word or sentence change to obscure kanji as kon tries to process the second byte of one character and the first byte of the next as a character. Is there a way to make vi aware of multibyte characters? (This is not an absolute necessity, but would help.) > Again why not emacs? I had never used it - in fact, it wasn't even installed on my system. After you seemed to be recommending it, I installed the "no X" version (I don't have any graphical interfaces on these machines) and invoked it once to see what it is like, but it looks like it would be miserable to learn to use without a mouse, if it would work at all for some features (I had to dig real deep in the docs to figure out key commands - they constantly refer to the mouse). It would be no problem to add a mouse to the server that resides at my desk (well, maybe a bit of a desk space shortage...), but much of my work is done through ssh to two other servers, and I doubt a mouse would work in that environment - am I wrong? (Zero experience with mice in Linux!) > Assuming you could read/write Japanese, I recommend you subscribe > PHP-users list (http://ns1.php.gr.jp/mailman/listinfo/php-users). I do read and write Japanese if I work hard enough at it (lots of copy/paste to/from a software dictionary - I've lived in Japan 5 years), but reading and contributing to a mailing list in Japanese could consume my whole work day! :-) That's why I have been using English lists. But I know that most of the people on the English lists (maybe everybody except Ishii-san!) don't work with Japanese systems and can't answer questions about them, so when I have future questions of this type, I probably should try the php.gr.jp list. Thanks for the link. > Are yo using PHP? Then I strongly recommend upgrade to PHP 4.0.6 or > higher. It supports Japanese very well. It aumatically guess the input > charset, does the neccessary conversion. Input from where and conversion to what? Do you mean data typed into forms? (I had assumed that I control the charset used for form data by the "charset" variable in the html header, but I haven't tested that theory!) Or do you mean that if the text in echo statements is in a different charset than the header (how could it even know?), it will convert it when sending it out to the browser? (That would be hard to believe, but wonderful if it's true!) I'm still unsure of what to do. I was just about to take your advice and switch all my PHP and Perl files to EUC, when I remembered that I have to consider other people. After I get the PHP/Perl code working, the webmaster cleans up my grammar and/or changes the wording to the way he wants it, and he never uses Linux but only Windows-based editors, which as far as I know all expect Shift-JIS. Maybe I can train him to always open files with the EUC->Shift-JIS preprocessor and save them with the Shift-JIS->EUC postprocessor, but I suspect he won't be happy about it. But if I can get answers to the above questions, I may be closer to a decision on which approach is better, all things considered. Regards, Karen -------------------------------- Karen Ellrick S & C Technology, Inc. 1-21-35 Kusatsu-shinmachi Hiroshima 733-0834 Japan (from U.S. 011-81, from Japan 0) 82-293-2838 --------------------------------
> > Use "kon" command. > This was a wonderful tip - thank you, Ishii-san! I didn't know about the > command, and it seems to be trying to do what it is designed to do. It > doesn't display Shift-JIS correctly, but it does work for EUC. Since I seem > to be moving in the direction of converting everything to EUC, that should > be okay. But in vi, how do I input Japanese? Is there a key combination > that does what IME does in Windoze? I've heard that there is a vi crone called "nvi-canna" having a IME called "canna". In nvi-canna, "set canna" + CTRL-O should initiate the Japanese input mode. > I also noticed that vi is still not aware of the multi-byte characters - for > example, when moving around in the text, I have to type h or l twice to get > to the next character, and if I want to copy or delete characters I have to > pretend that there are twice as many. Typing "x" just once (or an odd > number of times) is really entertaining - all the characters in a whole word > or sentence change to obscure kanji as kon tries to process the second byte > of one character and the first byte of the next as a character. Is there a > way to make vi aware of multibyte characters? (This is not an absolute > necessity, but would help.) Hmm. Maybe locale problem? In my case (I'm using Vine Linux), the correct locale is "ja_JP.eucJP", but may be different on RH6.2J. > > Again why not emacs? > I had never used it - in fact, it wasn't even installed on my system. After > you seemed to be recommending it, I installed the "no X" version (I don't > have any graphical interfaces on these machines) and invoked it once to see > what it is like, but it looks like it would be miserable to learn to use > without a mouse, if it would work at all for some features (I had to dig > real deep in the docs to figure out key commands - they constantly refer to > the mouse). It would be no problem to add a mouse to the server that > resides at my desk (well, maybe a bit of a desk space shortage...), but much > of my work is done through ssh to two other servers, and I doubt a mouse > would work in that environment - am I wrong? (Zero experience with mice in > Linux!) I've never use a mouse with Emacs even in the X environment. > I do read and write Japanese if I work hard enough at it (lots of copy/paste > to/from a software dictionary - I've lived in Japan 5 years), but reading > and contributing to a mailing list in Japanese could consume my whole work > day! :-) I understand. Same thing can be said for me with English:-) > > Are yo using PHP? Then I strongly recommend upgrade to PHP 4.0.6 or > > higher. It supports Japanese very well. It aumatically guess the input > > charset, does the neccessary conversion. > Input from where and conversion to what? Do you mean data typed into forms? Yes. It seems most browsers use same encoding as the one used for the the page (you could view the page's charset using "property" or whatever menu in the browser). PHP4.0.6 is clever enough to automaticall guess the chaset of data put in a form, and do a conversion between the encoding and EUC-JP (that is the recommended internal encoding of PHP). In PHP4.0.6's php.ini there are entries to control the behavior of the encoding handling: mbstring.internal_encoding = EUC-JP mbstring.http_input = auto mbstring.http_output = SJIS Theses are read: o the internal encoding in PHP is EUC-JP (this is recommended) o charsets of any input from forms etc. are automatically determined o charsets for all final pages produced by PHP are SJIS Interesting?:-) -- Tatsuo Ishii