Thread: Re: [HACKERS] UNICODE characters above 0x10000

Re: [HACKERS] UNICODE characters above 0x10000

From

"John Hansen"

Date:

07 August 2004, 07:11:47

Yes, but the specification allows for 6byte sequences, or 32bit
characters.
As dennis pointed out, just because they're not used, doesn't mean we
should not allow them to be stored, since there might me someone using
the high ranges for a private character set, which could very well be
included in the specification some day.

Regards,

John Hansen

-----Original Message-----
From: Tatsuo Ishii [mailto:t-ishii@sra.co.jp]
Sent: Saturday, August 07, 2004 8:09 PM
To: tgl@sss.pgh.pa.us
Cc: db@zigo.dhs.org; John Hansen; pgsql-hackers@postgresql.org;
pgsql-patches@postgresql.org
Subject: Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

> Dennis Bjorklund <db@zigo.dhs.org> writes:
> > ... This also means that the start byte can never start with 7 or 8
> > ones, that is illegal and should be tested for and rejected. So the
> > longest utf-8 sequence is 6 bytes (and the longest character needs 4

> > bytes (or 31 bits)).
>
> Tatsuo would know more about this than me, but it looks from here like

> our coding was originally designed to support only 16-bit-wide
> internal characters (ie, 16-bit pg_wchar datatype width).  I believe
> that the regex library limitation here is gone, and that as far as
> that library is concerned we could assume a 32-bit internal character
> width.  The question at hand is whether we can support 32-bit
> characters or not --- and if not, what's the next bug to fix?

pg_wchar has been already 32-bit datatype.  However I doubt there's
actually a need for 32-but width character sets. Even Unicode only uese
up 0x0010FFFF, so 24-bit should be enough...
--
Tatsuo Ishii

Re: [HACKERS] UNICODE characters above 0x10000

From

Tatsuo Ishii

Date:

07 August 2004, 07:44:25

> Yes, but the specification allows for 6byte sequences, or 32bit
> characters.

UTF-8 is just an encoding specification, not character set
specification. Unicode only has 17 256x256 planes in its
specification.

> As dennis pointed out, just because they're not used, doesn't mean we
> should not allow them to be stored, since there might me someone using
> the high ranges for a private character set, which could very well be
> included in the specification some day.

We should expand it to 64-bit since some day the specification might
be changed then:-)

More seriously, Unicode is filled with tons of confusion and
inconsistency IMO. Remember that once Unicode adovocates said that the
merit of Unicode was it only requires 16-bit width. Now they say they
need surrogate pairs and 32-bit width chars...

Anyway my point is if current specification of Unicode only allows
24-bit range, why we need to allow usage against the specification?
--
Tatsuo Ishii

Re: [HACKERS] UNICODE characters above 0x10000

From

Dennis Bjorklund

Date:

07 August 2004, 08:06:15

On Sat, 7 Aug 2004, John Hansen wrote:

> should not allow them to be stored, since there might me someone using
> the high ranges for a private character set, which could very well be
> included in the specification some day.

There are areas reserved for private character sets.

--
/Dennis Björklund

Re: [HACKERS] UNICODE characters above 0x10000

From

Dennis Bjorklund

Date:

07 August 2004, 08:12:28

On Sat, 7 Aug 2004, Tatsuo Ishii wrote:

> More seriously, Unicode is filled with tons of confusion and
> inconsistency IMO. Remember that once Unicode adovocates said that the
> merit of Unicode was it only requires 16-bit width. Now they say they
> need surrogate pairs and 32-bit width chars...
>
> Anyway my point is if current specification of Unicode only allows
> 24-bit range, why we need to allow usage against the specification?

Whatever problems they have had in the past, the ISO 10646 defines
formally a 31-bit character set. Are you saying that applications should
reject strings that contain characters that it does not recognize?

Is there a specific reason you want to restrict it to 24 bits? In practice
it does not matter much since it's not used today, I just don't know why
you want it.

--
/Dennis Björklund

Re: [HACKERS] UNICODE characters above 0x10000

From

Tom Lane

Date:

07 August 2004, 13:43:32

Dennis Bjorklund <db@zigo.dhs.org> writes:
> On Sat, 7 Aug 2004, Tatsuo Ishii wrote:
>> Anyway my point is if current specification of Unicode only allows
>> 24-bit range, why we need to allow usage against the specification?

> Is there a specific reason you want to restrict it to 24 bits?

I see several places that have to allocate space on the basis of the
maximum encoded character length possible in the current encoding
(look for uses of pg_database_encoding_max_length).  Probably the only
one that's really significant for performance is text_substr(), but
that's enough to be an argument against setting maxmblen higher than
we have to.

It looks to me like supporting 4-byte UTF-8 characters would be enough
to handle the existing range of Unicode codepoints, and that is probably
as much as we want to do.

If I understood what I was reading, this would take several things:
* Remove the "special UTF-8 check" in pg_verifymbstr;
* Extend pg_utf2wchar_with_len and pg_utf_mblen to handle the 4-byte case;
* Set maxmblen to 4 in the pg_wchar_table[] entry for UTF-8.

Are there any other places that would have to change?  Would this break
anything?  The testing aspect is what's bothering me at the moment.

            regards, tom lane

Re: [HACKERS] UNICODE characters above 0x10000

From

Oliver Jowett

Date:

07 August 2004, 21:15:39

Tom Lane wrote:

> If I understood what I was reading, this would take several things:
> * Remove the "special UTF-8 check" in pg_verifymbstr;
> * Extend pg_utf2wchar_with_len and pg_utf_mblen to handle the 4-byte case;
> * Set maxmblen to 4 in the pg_wchar_table[] entry for UTF-8.
>
> Are there any other places that would have to change?  Would this break
> anything?  The testing aspect is what's bothering me at the moment.

Does this change what client_encoding = UNICODE might produce? The JDBC
driver will need some tweaking to handle this -- Java uses UTF-16
internally and I think some supplementary character (?) scheme for
values above 0xffff as of JDK 1.5.

-O

Re: [HACKERS] UNICODE characters above 0x10000

From

Tom Lane

Date:

07 August 2004, 21:25:24

Oliver Jowett <oliver@opencloud.com> writes:
> Does this change what client_encoding = UNICODE might produce? The JDBC
> driver will need some tweaking to handle this -- Java uses UTF-16
> internally and I think some supplementary character (?) scheme for
> values above 0xffff as of JDK 1.5.

You're not likely to get out anything you didn't put in, so I'm not sure
it matters.

            regards, tom lane

Re: [HACKERS] UNICODE characters above 0x10000

From

Tatsuo Ishii

Date:

07 August 2004, 23:18:17

> Tom Lane wrote:
>
> > If I understood what I was reading, this would take several things:
> > * Remove the "special UTF-8 check" in pg_verifymbstr;
> > * Extend pg_utf2wchar_with_len and pg_utf_mblen to handle the 4-byte case;
> > * Set maxmblen to 4 in the pg_wchar_table[] entry for UTF-8.
> >
> > Are there any other places that would have to change?  Would this break
> > anything?  The testing aspect is what's bothering me at the moment.
>
> Does this change what client_encoding = UNICODE might produce? The JDBC
> driver will need some tweaking to handle this -- Java uses UTF-16
> internally and I think some supplementary character (?) scheme for
> values above 0xffff as of JDK 1.5.

Java doesn't handle UCS above 0xffff? I didn't know that. As long as
you put in/out JDBC, it shouldn't be a problem. However if other APIs
put in such a data, you will get into trouble...
--
Tatsuo Ishii

Re: [HACKERS] UNICODE characters above 0x10000

From

Oliver Jowett

Date:

07 August 2004, 23:37:21

Tatsuo Ishii wrote:
>>Tom Lane wrote:
>>
>>
>>>If I understood what I was reading, this would take several things:
>>>* Remove the "special UTF-8 check" in pg_verifymbstr;
>>>* Extend pg_utf2wchar_with_len and pg_utf_mblen to handle the 4-byte case;
>>>* Set maxmblen to 4 in the pg_wchar_table[] entry for UTF-8.
>>>
>>>Are there any other places that would have to change?  Would this break
>>>anything?  The testing aspect is what's bothering me at the moment.
>>
>>Does this change what client_encoding = UNICODE might produce? The JDBC
>>driver will need some tweaking to handle this -- Java uses UTF-16
>>internally and I think some supplementary character (?) scheme for
>>values above 0xffff as of JDK 1.5.
>
>
> Java doesn't handle UCS above 0xffff? I didn't know that. As long as
> you put in/out JDBC, it shouldn't be a problem. However if other APIs
> put in such a data, you will get into trouble...

Internally, Java strings are arrays of UTF-16 values. Before JDK 1.5,
all the string-manipulation library routines assumed that one code point
== one UTF-16 value, so you can't represent values above 0xffff. The 1.5
libraries understand using supplementary characters to use multiple
UTF-16 values per code point. See
http://java.sun.com/developer/technicalArticles/Intl/Supplementary/

However, the JDBC driver needs to be taught about how to translate
between UTF-8 representations of code points above 0xffff and pairs of
UTF-16 values. Previously it didn't need to do anything since the server
didn't use those high values. It's a minor thing..

-O