Thread: UNICODE

UNICODE

From
"Per Aronsson"
Date:
Hi,

To enable localization of our new platform, we thought that saving all
character strings as UNICODE would be a good idea. Even if the front-end
(PHP) doesn't fully support UNICODE yet, we figured it's still good to have
the database in that format, for the future. We have not installed
mb_string.

We have created a UNICODE database and started experimenting with it
(PostgreSQL)
./configure --enable-multibyte
createdb -E UNICODE me-e

My question is: do you need to convert strings to UTF-8 before adding them
to the database, or is that done "automatically"?


Best regards,
Per Aronsson




Re: UNICODE

From
Tatsuo Ishii
Date:
> To enable localization of our new platform, we thought that saving all
> character strings as UNICODE would be a good idea. Even if the front-end
> (PHP) doesn't fully support UNICODE yet, we figured it's still good to have
> the database in that format, for the future. We have not installed
> mb_string.
>
> We have created a UNICODE database and started experimenting with it
> (PostgreSQL)
> ./configure --enable-multibyte
> createdb -E UNICODE me-e
>
> My question is: do you need to convert strings to UTF-8 before adding them
> to the database, or is that done "automatically"?

PostgreSQL 7.1 can do the conversion in the backend side.  You need to
add an option "--enable-unicode-conversion", however. Also, you need
to tell what kind of encoding you are using in your applications. To
do it in PHP4, you could use pg_set_client_encoding function.
If your PHP installation does not have it, you could issue a SQL:

set client_encoding to 'encoding_name_in_your_PHP_applicatoion';
--
Tatsuo Ishii

Re: UNICODE

From
Tatsuo Ishii
Date:
> I'm also trying to write some Chinese data to postgresql database.  I'm
> gibberish after it's written to the database.
>
> I recognize the problem is at the http request.

More details of what you found, please.

>  How do I retrieve double
> byte characters through http request using C/C++?  And how do I write it the
> database?  And how do I tell it what kind of encoding to use?
--
Tatsuo Ishii

Re: UNICODE

From
Tatsuo Ishii
Date:
> I'm also trying to write some Chinese data to postgresql database.  I'm
> gibberish after it's written to the database.
>
> I recognize the problem is at the http request.  How do I retrieve double
> byte characters through http request using C/C++? And how do I write it the
> database?

Nothing special. Just read/write one by one.

> And how do I tell it what kind of encoding to use?

set client_encoding.
--
Tatsuo Ishii

Re: UNICODE

From
Tatsuo Ishii
Date:
Can you please do not send me a personal mail?
Let's share info among people in the mailing list.
Anyway...

> I've tried that.  Still not writing the Chinese characters correctly.

I don't know what kind of Chinese character set you are using, but at
least your code will not work if the Chinese character set is Big5
since the second byte of it contains ascii characters.
To learn more about character sets, see
ftp://ftp.ora.com/pub/examples/nutshell/ujip/doc/cjk.inf
for example.
--
Tatsuo Ishii

> Here is the code:
>
>   contentTypeFromPost = getenv("CONTENT_TYPE");
>   contentTypeLength = getenv("CONTENT_LENGTH");
>   icontentLength = atoi(contentTypeLength);
>
>     if((queryString = malloc(icontentLength + 1)) == NULL)
>     {
>       postMessage("Cannot allocate memory", 0);
>       return(0);
>     }
>     for(i=0; *queryString; i++)
>     {
>       splitword(items.Item, queryString, '&');
>       unescape_url(items.Item);
>       splitword(items.name, items.Item, '=');
>
>  // items.Item contains double byte characters
>  // However, when write to database I get unrecognizable data
>     }
>
> void splitword(uchar *out, uchar *in, uchar stop)
> {
>    int i, j;
>
>    while(*in == ' ') in++; /* skip past any spaces */
>
>    for(i = 0; in[i] && (in[i] != stop); i++)
>       out[i] = in[i];
>
>    out[i] = '\0'; /* terminate it */
>    if(in[i]) ++i; /* position past the stop */
>
>    while(in[i] == ' ') i++; /* skip past any spaces */
>
>    for(j = 0; in[j]; )  /* shift the rest of the in */
>       in[j++] = in[i++];
> }
>
> uchar x2c(uchar *x)
> {
>    register uchar c;
>
>    /* note: (x & 0xdf) makes x upper case */
>    c  = (x[0] >= 'A' ? ((x[0] & 0xdf) - 'A') + 10 : (x[0] - '0'));
>    c *= 16;
>    c += (x[1] >= 'A' ? ((x[1] & 0xdf) - 'A') + 10 : (x[1] - '0'));
>    return(c);
> }
>
> void unescape_url(uchar *url)
> {
>    register int i, j;
>
>    for(i = 0, j = 0; url[j]; ++i, ++j)
>    {
>       if((url[i] = url[j]) == '%')
>       {
>          url[i] = x2c(&url[j + 1]);
>          j += 2;
>       }
>       else if (url[i] == '+')
>          url[i] = ' ';
>    }
>    url[i] = '\0';  /* terminate it at the new length */
> }
>
> -----Original Message-----
> From: Tatsuo Ishii [mailto:t-ishii@sra.co.jp]
> Sent: Sunday, October 28, 2001 4:57 PM
> To: jklcom@mindspring.com
> Cc: pgsql-general@postgresql.org
> Subject: RE: [GENERAL] UNICODE
>
>
> > I'm also trying to write some Chinese data to postgresql database.  I'm
> > gibberish after it's written to the database.
> >
> > I recognize the problem is at the http request.  How do I retrieve double
> > byte characters through http request using C/C++? And how do I write it
> the
> > database?
>
> Nothing special. Just read/write one by one.
>
> > And how do I tell it what kind of encoding to use?
>
> set client_encoding.
> --
> Tatsuo Ishii
>