Thread: Re: [HACKERS] UNICODE characters above 0x10000
My apologies for not reading the code properly. Attached patch using pg_utf_mblen() instead of an indexed table. It now also do bounds checks. Regards, John Hansen -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Saturday, August 07, 2004 4:37 AM To: John Hansen Cc: Hackers; Patches Subject: Re: [HACKERS] UNICODE characters above 0x10000 "John Hansen" <john@geeknet.com.au> writes: > Attached, as promised, small patch removing the limitation, adding > correct utf8 validation. Surely this is badly broken --- it will happily access data outside the bounds of the given string. Also, doesn't pg_mblen already know the length rules for UTF8? Why are you duplicating that knowledge? regards, tom lane
Attachment
"John Hansen" <john@geeknet.com.au> writes: > My apologies for not reading the code properly. > Attached patch using pg_utf_mblen() instead of an indexed table. > It now also do bounds checks. I think you missed my point. If we don't need this limitation, the correct patch is simply to delete the whole check (ie, delete lines 827-836 of wchar.c, and for that matter we'd then not need the encoding local variable). What's really at stake here is whether anything else breaks if we do that. What else, if anything, assumes that UTF characters are not more than 2 bytes? Now it's entirely possible that the underlying support is a few bricks shy of a load --- for instance I see that pg_utf_mblen thinks there are no UTF8 codes longer than 3 bytes whereas your code goes to 4. I'm not an expert on this stuff, so I don't know what the UTF8 spec actually says. But I do think you are fixing the code at the wrong level. regards, tom lane
On Sat, 7 Aug 2004, Tom Lane wrote: > shy of a load --- for instance I see that pg_utf_mblen thinks there are > no UTF8 codes longer than 3 bytes whereas your code goes to 4. I'm not > an expert on this stuff, so I don't know what the UTF8 spec actually > says. But I do think you are fixing the code at the wrong level. I can give some general info about utf-9. This is how it is encoded: character encoding ------------------- --------- 00000000 - 0000007F: 0xxxxxxx 00000080 - 000007FF: 110xxxxx 10xxxxxx 00000800 - 0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx 00010000 - 001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 00200000 - 03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 04000000 - 7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx If the first byte starts with a 1 then the number of ones give the length of the utf-8 sequence. And the rest of the bytes in the sequence always starts with 10 (this makes it possble to look anywhere in the string and fast find the start of a character). This also means that the start byte can never start with 7 or 8 ones, that is illegal and should be tested for and rejected. So the longest utf-8 sequence is 6 bytes (and the longest character needs 4 bytes (or 31 bits)). -- /Dennis Björklund
Oliver Elphick <olly@lfix.co.uk> writes: > glibc provides various routines (mb...) for handling Unicode. How many > of our supported platforms don't have these? Every one that doesn't use glibc. Don't bother proposing a glibc-only solution (and that's from someone who works for a glibc-only company; you don't even want to think about the push-back you'll get from other quarters). regards, tom lane
On Sat, 2004-08-07 at 06:06, Tom Lane wrote: > Now it's entirely possible that the underlying support is a few bricks > shy of a load --- for instance I see that pg_utf_mblen thinks there are > no UTF8 codes longer than 3 bytes whereas your code goes to 4. I'm not > an expert on this stuff, so I don't know what the UTF8 spec actually > says. But I do think you are fixing the code at the wrong level. UTF-8 characters can be up to 6 bytes long: http://www.cl.cam.ac.uk/~mgk25/unicode.html glibc provides various routines (mb...) for handling Unicode. How many of our supported platforms don't have these? If there are still some that don't, wouldn't it be better to use the standard routines where they do exist? -- Oliver Elphick olly@lfix.co.uk Isle of Wight http://www.lfix.co.uk/oliver GPG: 1024D/A54310EA 92C8 39E7 280E 3631 3F0E 1EC0 5664 7A2F A543 10EA ======================================== "Be still before the LORD and wait patiently for him; do not fret when men succeed in their ways, when they carry out their wicked schemes." Psalms 37:7
Dennis Bjorklund <db@zigo.dhs.org> writes: > ... This also means that the start byte can never start with 7 or 8 > ones, that is illegal and should be tested for and rejected. So the > longest utf-8 sequence is 6 bytes (and the longest character needs 4 > bytes (or 31 bits)). Tatsuo would know more about this than me, but it looks from here like our coding was originally designed to support only 16-bit-wide internal characters (ie, 16-bit pg_wchar datatype width). I believe that the regex library limitation here is gone, and that as far as that library is concerned we could assume a 32-bit internal character width. The question at hand is whether we can support 32-bit characters or not --- and if not, what's the next bug to fix? regards, tom lane
On Sat, 7 Aug 2004, Tom Lane wrote: > question at hand is whether we can support 32-bit characters or not --- > and if not, what's the next bug to fix? True, and that's hard to just give an answer to. One could do some simple testing, make sure regexps work and then treat anything else that might not work, as bugs to be fixed later on when found. The alternative is to inspect all code paths that involve strings, not fun at all :-) My previous mail talked about utf-8 translation. Not all characters possible to form using utf-8 are assigned by the unicode org. However, the part that interprets the unicode strings are in the os so different os'es can give different results. So I think pg should just accept even 6 byte utf-8 sequences even if some characters are not currently assigned. -- /Dennis Björklund
> Dennis Bjorklund <db@zigo.dhs.org> writes: > > ... This also means that the start byte can never start with 7 or 8 > > ones, that is illegal and should be tested for and rejected. So the > > longest utf-8 sequence is 6 bytes (and the longest character needs 4 > > bytes (or 31 bits)). > > Tatsuo would know more about this than me, but it looks from here like > our coding was originally designed to support only 16-bit-wide internal > characters (ie, 16-bit pg_wchar datatype width). I believe that the > regex library limitation here is gone, and that as far as that library > is concerned we could assume a 32-bit internal character width. The > question at hand is whether we can support 32-bit characters or not --- > and if not, what's the next bug to fix? pg_wchar has been already 32-bit datatype. However I doubt there's actually a need for 32-but width character sets. Even Unicode only uese up 0x0010FFFF, so 24-bit should be enough... -- Tatsuo Ishii
> Now it's entirely possible that the underlying support is a few bricks > shy of a load --- for instance I see that pg_utf_mblen thinks there are > no UTF8 codes longer than 3 bytes whereas your code goes to 4. I'm not > an expert on this stuff, so I don't know what the UTF8 spec actually > says. But I do think you are fixing the code at the wrong level. Surely there are UTF-8 codes that are at least 3 bytes. I have a _vague_ recollection that you have to keep escaping and escaping to get up to like 4 bytes for some asian code points? Chris
On Sat, 2004-08-07 at 07:10, Tom Lane wrote: > Oliver Elphick <olly@lfix.co.uk> writes: > > glibc provides various routines (mb...) for handling Unicode. How many > > of our supported platforms don't have these? > > Every one that doesn't use glibc. Don't bother proposing a glibc-only > solution (and that's from someone who works for a glibc-only company; > you don't even want to think about the push-back you'll get from other > quarters). No. that's not what I was proposing. My suggestion was to use these routines if they are sufficiently widely implemented, and our own routines where standard ones are not available. The man page for mblen says "CONFORMING TO ISO/ANSI C, UNIX98" Is glibc really the only C library to conform? If using the mb... routines isn't feasible, IBM's ICU library (http://oss.software.ibm.com/icu/) is available under the X licence, which is compatible with BSD as far as I can see. Besides character conversion, ICU can also do collation in various locales and encodings. My point is, we shouldn't be writing a new set of routines to do half a job if there are already libraries available to do all of it. -- Oliver Elphick olly@lfix.co.uk Isle of Wight http://www.lfix.co.uk/oliver GPG: 1024D/A54310EA 92C8 39E7 280E 3631 3F0E 1EC0 5664 7A2F A543 10EA ======================================== "Be still before the LORD and wait patiently for him; do not fret when men succeed in their ways, when they carry out their wicked schemes." Psalms 37:7