Thread: Re: [PATCHES] UNICODE characters above 0x10000
Yes, but the specification allows for 6byte sequences, or 32bit characters. As dennis pointed out, just because they're not used, doesn't mean we should not allow them to be stored, since there might me someone using the high ranges for a private character set, which could very well be included in the specification some day. Regards, John Hansen -----Original Message----- From: Tatsuo Ishii [mailto:t-ishii@sra.co.jp] Sent: Saturday, August 07, 2004 8:09 PM To: tgl@sss.pgh.pa.us Cc: db@zigo.dhs.org; John Hansen; pgsql-hackers@postgresql.org; pgsql-patches@postgresql.org Subject: Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000 > Dennis Bjorklund <db@zigo.dhs.org> writes: > > ... This also means that the start byte can never start with 7 or 8 > > ones, that is illegal and should be tested for and rejected. So the > > longest utf-8 sequence is 6 bytes (and the longest character needs 4 > > bytes (or 31 bits)). > > Tatsuo would know more about this than me, but it looks from here like > our coding was originally designed to support only 16-bit-wide > internal characters (ie, 16-bit pg_wchar datatype width). I believe > that the regex library limitation here is gone, and that as far as > that library is concerned we could assume a 32-bit internal character > width. The question at hand is whether we can support 32-bit > characters or not --- and if not, what's the next bug to fix? pg_wchar has been already 32-bit datatype. However I doubt there's actually a need for 32-but width character sets. Even Unicode only uese up 0x0010FFFF, so 24-bit should be enough... -- Tatsuo Ishii
> Yes, but the specification allows for 6byte sequences, or 32bit > characters. UTF-8 is just an encoding specification, not character set specification. Unicode only has 17 256x256 planes in its specification. > As dennis pointed out, just because they're not used, doesn't mean we > should not allow them to be stored, since there might me someone using > the high ranges for a private character set, which could very well be > included in the specification some day. We should expand it to 64-bit since some day the specification might be changed then:-) More seriously, Unicode is filled with tons of confusion and inconsistency IMO. Remember that once Unicode adovocates said that the merit of Unicode was it only requires 16-bit width. Now they say they need surrogate pairs and 32-bit width chars... Anyway my point is if current specification of Unicode only allows 24-bit range, why we need to allow usage against the specification? -- Tatsuo Ishii
On Sat, 7 Aug 2004, John Hansen wrote: > should not allow them to be stored, since there might me someone using > the high ranges for a private character set, which could very well be > included in the specification some day. There are areas reserved for private character sets. -- /Dennis Björklund
On Sat, 7 Aug 2004, Tatsuo Ishii wrote: > More seriously, Unicode is filled with tons of confusion and > inconsistency IMO. Remember that once Unicode adovocates said that the > merit of Unicode was it only requires 16-bit width. Now they say they > need surrogate pairs and 32-bit width chars... > > Anyway my point is if current specification of Unicode only allows > 24-bit range, why we need to allow usage against the specification? Whatever problems they have had in the past, the ISO 10646 defines formally a 31-bit character set. Are you saying that applications should reject strings that contain characters that it does not recognize? Is there a specific reason you want to restrict it to 24 bits? In practice it does not matter much since it's not used today, I just don't know why you want it. -- /Dennis Björklund
Dennis Bjorklund <db@zigo.dhs.org> writes: > On Sat, 7 Aug 2004, Tatsuo Ishii wrote: >> Anyway my point is if current specification of Unicode only allows >> 24-bit range, why we need to allow usage against the specification? > Is there a specific reason you want to restrict it to 24 bits? I see several places that have to allocate space on the basis of the maximum encoded character length possible in the current encoding (look for uses of pg_database_encoding_max_length). Probably the only one that's really significant for performance is text_substr(), but that's enough to be an argument against setting maxmblen higher than we have to. It looks to me like supporting 4-byte UTF-8 characters would be enough to handle the existing range of Unicode codepoints, and that is probably as much as we want to do. If I understood what I was reading, this would take several things: * Remove the "special UTF-8 check" in pg_verifymbstr; * Extend pg_utf2wchar_with_len and pg_utf_mblen to handle the 4-byte case; * Set maxmblen to 4 in the pg_wchar_table[] entry for UTF-8. Are there any other places that would have to change? Would this break anything? The testing aspect is what's bothering me at the moment. regards, tom lane
Tom Lane wrote: > If I understood what I was reading, this would take several things: > * Remove the "special UTF-8 check" in pg_verifymbstr; > * Extend pg_utf2wchar_with_len and pg_utf_mblen to handle the 4-byte case; > * Set maxmblen to 4 in the pg_wchar_table[] entry for UTF-8. > > Are there any other places that would have to change? Would this break > anything? The testing aspect is what's bothering me at the moment. Does this change what client_encoding = UNICODE might produce? The JDBC driver will need some tweaking to handle this -- Java uses UTF-16 internally and I think some supplementary character (?) scheme for values above 0xffff as of JDK 1.5. -O
Oliver Jowett <oliver@opencloud.com> writes: > Does this change what client_encoding = UNICODE might produce? The JDBC > driver will need some tweaking to handle this -- Java uses UTF-16 > internally and I think some supplementary character (?) scheme for > values above 0xffff as of JDK 1.5. You're not likely to get out anything you didn't put in, so I'm not sure it matters. regards, tom lane
> Tom Lane wrote: > > > If I understood what I was reading, this would take several things: > > * Remove the "special UTF-8 check" in pg_verifymbstr; > > * Extend pg_utf2wchar_with_len and pg_utf_mblen to handle the 4-byte case; > > * Set maxmblen to 4 in the pg_wchar_table[] entry for UTF-8. > > > > Are there any other places that would have to change? Would this break > > anything? The testing aspect is what's bothering me at the moment. > > Does this change what client_encoding = UNICODE might produce? The JDBC > driver will need some tweaking to handle this -- Java uses UTF-16 > internally and I think some supplementary character (?) scheme for > values above 0xffff as of JDK 1.5. Java doesn't handle UCS above 0xffff? I didn't know that. As long as you put in/out JDBC, it shouldn't be a problem. However if other APIs put in such a data, you will get into trouble... -- Tatsuo Ishii
Tatsuo Ishii wrote: >>Tom Lane wrote: >> >> >>>If I understood what I was reading, this would take several things: >>>* Remove the "special UTF-8 check" in pg_verifymbstr; >>>* Extend pg_utf2wchar_with_len and pg_utf_mblen to handle the 4-byte case; >>>* Set maxmblen to 4 in the pg_wchar_table[] entry for UTF-8. >>> >>>Are there any other places that would have to change? Would this break >>>anything? The testing aspect is what's bothering me at the moment. >> >>Does this change what client_encoding = UNICODE might produce? The JDBC >>driver will need some tweaking to handle this -- Java uses UTF-16 >>internally and I think some supplementary character (?) scheme for >>values above 0xffff as of JDK 1.5. > > > Java doesn't handle UCS above 0xffff? I didn't know that. As long as > you put in/out JDBC, it shouldn't be a problem. However if other APIs > put in such a data, you will get into trouble... Internally, Java strings are arrays of UTF-16 values. Before JDK 1.5, all the string-manipulation library routines assumed that one code point == one UTF-16 value, so you can't represent values above 0xffff. The 1.5 libraries understand using supplementary characters to use multiple UTF-16 values per code point. See http://java.sun.com/developer/technicalArticles/Intl/Supplementary/ However, the JDBC driver needs to be taught about how to translate between UTF-8 representations of code points above 0xffff and pairs of UTF-16 values. Previously it didn't need to do anything since the server didn't use those high values. It's a minor thing.. -O