Thread: Second byte of multibyte characters causing trouble

I am using Perl CGI scripts with DBI to take data from a web interface and
from text files to put into my database, and I'm dealing with Japanese (i.e.
two-byte characters).  PostgreSQL is installed with multibyte enabled, but
somewhere in the communication chain from Perl to DBI to PostgreSQL,
something is trying to interpret multibyte text byte by byte, which is
causing trouble.  The example that has been discovered so far is that if the
second of the two bytes is 0x5c (in ASCII, "\"), it gets swallowed and a
ripple effect of byte pairs ensues (at least if the byte after the 0x5c
isn't a valid character to follow \ to make a metacharacter - if it is, who
knows what will happen!).  I fixed that one by replacing any \ in the
strings with "\\" to get a literal 0x5C byte past whatever is trying to
interpret it.  But I am wondering what other similar pitfalls I have to
watch out for, and I'm hoping others have ideas.  For example, is my SQL
insert or update statement going to choke if the second byte of one of the
characters is the same as ASCII for a single quote?  The possibilities are
endless, depending on what part of the process is doing the damage.  And
trying to test this stuff is like looking for a needle in a haystack - it's
not easy to figure out what Japanese characters have second bytes that would
have special meaning if interpreted as ASCII.

If someone knows how to set things up so that all text is guaranteed to go
through unscathed (make Perl or DBI multi-byte aware, or whatever - i.e. the
real fix), that would be ideal.  Otherwise, at least some ideas would be
welcome regarding what other bytes to write bandaid code for.  I know I'm
not the only one trying to use Perl to maintain PostgreSQL databases with
Japanese or Chinese text! :-)

Thanks in advance,
Karen

--------------------------------
Karen Ellrick
S & C Technology, Inc.
1-21-35 Kusatsu-shinmachi
Hiroshima  733-0834  Japan
(from U.S. 011-81, from Japan 0) 82-293-2838
--------------------------------

Thread: Second byte of multibyte characters causing trouble

Second byte of multibyte characters causing trouble

Re: Second byte of multibyte characters causing trouble

Re: Second byte of multibyte characters causing trouble

Re: Second byte of multibyte characters causing trouble

Re: Second byte of multibyte characters causing trouble

Re: Second byte of multibyte characters causing trouble