Re: [PATCHES] UNICODE characters above 0x10000 - Mailing list pgsql-hackers

From Oliver Jowett
Subject Re: [PATCHES] UNICODE characters above 0x10000
Date
Msg-id 4115918A.1020405@opencloud.com
Whole thread Raw
In response to Re: [PATCHES] UNICODE characters above 0x10000  (Tatsuo Ishii <t-ishii@sra.co.jp>)
List pgsql-hackers
Tatsuo Ishii wrote:
>>Tom Lane wrote:
>>
>>
>>>If I understood what I was reading, this would take several things:
>>>* Remove the "special UTF-8 check" in pg_verifymbstr;
>>>* Extend pg_utf2wchar_with_len and pg_utf_mblen to handle the 4-byte case;
>>>* Set maxmblen to 4 in the pg_wchar_table[] entry for UTF-8.
>>>
>>>Are there any other places that would have to change?  Would this break
>>>anything?  The testing aspect is what's bothering me at the moment.
>>
>>Does this change what client_encoding = UNICODE might produce? The JDBC
>>driver will need some tweaking to handle this -- Java uses UTF-16
>>internally and I think some supplementary character (?) scheme for
>>values above 0xffff as of JDK 1.5.
>
>
> Java doesn't handle UCS above 0xffff? I didn't know that. As long as
> you put in/out JDBC, it shouldn't be a problem. However if other APIs
> put in such a data, you will get into trouble...

Internally, Java strings are arrays of UTF-16 values. Before JDK 1.5,
all the string-manipulation library routines assumed that one code point
== one UTF-16 value, so you can't represent values above 0xffff. The 1.5
libraries understand using supplementary characters to use multiple
UTF-16 values per code point. See
http://java.sun.com/developer/technicalArticles/Intl/Supplementary/

However, the JDBC driver needs to be taught about how to translate
between UTF-8 representations of code points above 0xffff and pairs of
UTF-16 values. Previously it didn't need to do anything since the server
didn't use those high values. It's a minor thing..

-O

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: UNICODE characters above 0x10000
Next
From: Tom Lane
Date:
Subject: Re: beta time