Thread: UNICODE characters above 0x10000
I've started work on a patch for this problem. Doing regression tests at present. I'll get back when done. Regards, John
Attached, as promised, small patch removing the limitation, adding correct utf8 validation. Regards, John -----Original Message----- From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of John Hansen Sent: Friday, August 06, 2004 2:20 PM To: 'Hackers' Subject: [HACKERS] UNICODE characters above 0x10000 I've started work on a patch for this problem. Doing regression tests at present. I'll get back when done. Regards, John ---------------------------(end of broadcast)--------------------------- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Attachment
"John Hansen" <john@geeknet.com.au> writes: > Attached, as promised, small patch removing the limitation, adding > correct utf8 validation. Surely this is badly broken --- it will happily access data outside the bounds of the given string. Also, doesn't pg_mblen already know the length rules for UTF8? Why are you duplicating that knowledge? regards, tom lane
My apologies for not reading the code properly. Attached patch using pg_utf_mblen() instead of an indexed table. It now also do bounds checks. Regards, John Hansen -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Saturday, August 07, 2004 4:37 AM To: John Hansen Cc: Hackers; Patches Subject: Re: [HACKERS] UNICODE characters above 0x10000 "John Hansen" <john@geeknet.com.au> writes: > Attached, as promised, small patch removing the limitation, adding > correct utf8 validation. Surely this is badly broken --- it will happily access data outside the bounds of the given string. Also, doesn't pg_mblen already know the length rules for UTF8? Why are you duplicating that knowledge? regards, tom lane
Attachment
"John Hansen" <john@geeknet.com.au> writes: > My apologies for not reading the code properly. > Attached patch using pg_utf_mblen() instead of an indexed table. > It now also do bounds checks. I think you missed my point. If we don't need this limitation, the correct patch is simply to delete the whole check (ie, delete lines 827-836 of wchar.c, and for that matter we'd then not need the encoding local variable). What's really at stake here is whether anything else breaks if we do that. What else, if anything, assumes that UTF characters are not more than 2 bytes? Now it's entirely possible that the underlying support is a few bricks shy of a load --- for instance I see that pg_utf_mblen thinks there are no UTF8 codes longer than 3 bytes whereas your code goes to 4. I'm not an expert on this stuff, so I don't know what the UTF8 spec actually says. But I do think you are fixing the code at the wrong level. regards, tom lane
Possibly, since I got it wrong once more.... About to give up, but attached, Updated patch. Regards, John Hansen -----Original Message----- From: Oliver Elphick [mailto:olly@lfix.co.uk] Sent: Saturday, August 07, 2004 3:56 PM To: Tom Lane Cc: John Hansen; Hackers; Patches Subject: Re: [HACKERS] UNICODE characters above 0x10000 On Sat, 2004-08-07 at 06:06, Tom Lane wrote: > Now it's entirely possible that the underlying support is a few bricks > shy of a load --- for instance I see that pg_utf_mblen thinks there > are no UTF8 codes longer than 3 bytes whereas your code goes to 4. > I'm not an expert on this stuff, so I don't know what the UTF8 spec > actually says. But I do think you are fixing the code at the wrong level. UTF-8 characters can be up to 6 bytes long: http://www.cl.cam.ac.uk/~mgk25/unicode.html glibc provides various routines (mb...) for handling Unicode. How many of our supported platforms don't have these? If there are still some that don't, wouldn't it be better to use the standard routines where they do exist? -- Oliver Elphick olly@lfix.co.uk Isle of Wight http://www.lfix.co.uk/oliver GPG: 1024D/A54310EA 92C8 39E7 280E 3631 3F0E 1EC0 5664 7A2F A543 10EA ======================================== "Be still before the LORD and wait patiently for him; do not fret when men succeed in their ways, when they carry out their wicked schemes." Psalms 37:7
Attachment
On Sat, 7 Aug 2004, Tom Lane wrote: > shy of a load --- for instance I see that pg_utf_mblen thinks there are > no UTF8 codes longer than 3 bytes whereas your code goes to 4. I'm not > an expert on this stuff, so I don't know what the UTF8 spec actually > says. But I do think you are fixing the code at the wrong level. I can give some general info about utf-9. This is how it is encoded: character encoding ------------------- --------- 00000000 - 0000007F: 0xxxxxxx 00000080 - 000007FF: 110xxxxx 10xxxxxx 00000800 - 0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx 00010000 - 001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 00200000 - 03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 04000000 - 7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx If the first byte starts with a 1 then the number of ones give the length of the utf-8 sequence. And the rest of the bytes in the sequence always starts with 10 (this makes it possble to look anywhere in the string and fast find the start of a character). This also means that the start byte can never start with 7 or 8 ones, that is illegal and should be tested for and rejected. So the longest utf-8 sequence is 6 bytes (and the longest character needs 4 bytes (or 31 bits)). -- /Dennis Björklund
Oliver Elphick <olly@lfix.co.uk> writes: > glibc provides various routines (mb...) for handling Unicode. How many > of our supported platforms don't have these? Every one that doesn't use glibc. Don't bother proposing a glibc-only solution (and that's from someone who works for a glibc-only company; you don't even want to think about the push-back you'll get from other quarters). regards, tom lane
On Sat, 2004-08-07 at 06:06, Tom Lane wrote: > Now it's entirely possible that the underlying support is a few bricks > shy of a load --- for instance I see that pg_utf_mblen thinks there are > no UTF8 codes longer than 3 bytes whereas your code goes to 4. I'm not > an expert on this stuff, so I don't know what the UTF8 spec actually > says. But I do think you are fixing the code at the wrong level. UTF-8 characters can be up to 6 bytes long: http://www.cl.cam.ac.uk/~mgk25/unicode.html glibc provides various routines (mb...) for handling Unicode. How many of our supported platforms don't have these? If there are still some that don't, wouldn't it be better to use the standard routines where they do exist? -- Oliver Elphick olly@lfix.co.uk Isle of Wight http://www.lfix.co.uk/oliver GPG: 1024D/A54310EA 92C8 39E7 280E 3631 3F0E 1EC0 5664 7A2F A543 10EA ======================================== "Be still before the LORD and wait patiently for him; do not fret when men succeed in their ways, when they carry out their wicked schemes." Psalms 37:7
Ahh, but that's not the case. You cannot just delete the check, since not all combinations of bytes are valid UTF8. UTF bytes FE & FF never appear in a byte sequence for instance. UTF8 is more that two bytes btw, up to 6 bytes are used to represent an UTF8 character. The 5 and 6 byte characters are currently not in use tho. I didn't actually notice the difference in UTF8 width between my original patch and my last, so attached, updated patch. Regards, John Hansen -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Saturday, August 07, 2004 3:07 PM To: John Hansen Cc: Hackers; Patches Subject: Re: [HACKERS] UNICODE characters above 0x10000 "John Hansen" <john@geeknet.com.au> writes: > My apologies for not reading the code properly. > Attached patch using pg_utf_mblen() instead of an indexed table. > It now also do bounds checks. I think you missed my point. If we don't need this limitation, the correct patch is simply to delete the whole check (ie, delete lines 827-836 of wchar.c, and for that matter we'd then not need the encoding local variable). What's really at stake here is whether anything else breaks if we do that. What else, if anything, assumes that UTF characters are not more than 2 bytes? Now it's entirely possible that the underlying support is a few bricks shy of a load --- for instance I see that pg_utf_mblen thinks there are no UTF8 codes longer than 3 bytes whereas your code goes to 4. I'm not an expert on this stuff, so I don't know what the UTF8 spec actually says. But I do think you are fixing the code at the wrong level. regards, tom lane
Attachment
Dennis Bjorklund <db@zigo.dhs.org> writes: > ... This also means that the start byte can never start with 7 or 8 > ones, that is illegal and should be tested for and rejected. So the > longest utf-8 sequence is 6 bytes (and the longest character needs 4 > bytes (or 31 bits)). Tatsuo would know more about this than me, but it looks from here like our coding was originally designed to support only 16-bit-wide internal characters (ie, 16-bit pg_wchar datatype width). I believe that the regex library limitation here is gone, and that as far as that library is concerned we could assume a 32-bit internal character width. The question at hand is whether we can support 32-bit characters or not --- and if not, what's the next bug to fix? regards, tom lane
On Sat, 7 Aug 2004, Tom Lane wrote: > question at hand is whether we can support 32-bit characters or not --- > and if not, what's the next bug to fix? True, and that's hard to just give an answer to. One could do some simple testing, make sure regexps work and then treat anything else that might not work, as bugs to be fixed later on when found. The alternative is to inspect all code paths that involve strings, not fun at all :-) My previous mail talked about utf-8 translation. Not all characters possible to form using utf-8 are assigned by the unicode org. However, the part that interprets the unicode strings are in the os so different os'es can give different results. So I think pg should just accept even 6 byte utf-8 sequences even if some characters are not currently assigned. -- /Dennis Björklund
This should do it. Regards, John Hansen -----Original Message----- From: Dennis Bjorklund [mailto:db@zigo.dhs.org] Sent: Saturday, August 07, 2004 5:02 PM To: Tom Lane Cc: John Hansen; Hackers; Patches Subject: Re: [HACKERS] UNICODE characters above 0x10000 On Sat, 7 Aug 2004, Tom Lane wrote: > question at hand is whether we can support 32-bit characters or not > --- and if not, what's the next bug to fix? True, and that's hard to just give an answer to. One could do some simple testing, make sure regexps work and then treatanything else that might not work, as bugs to be fixed later on when found. The alternative is to inspect all code paths that involve strings, not fun at all :-) My previous mail talked about utf-8 translation. Not all characters possible to form using utf-8 are assigned by the unicodeorg. However, the part that interprets the unicode strings are in the os so different os'es can give different results.So I think pg should just accept even 6 byte utf-8 sequences even if some characters are not currently assigned. -- /Dennis Björklund
Attachment
> Dennis Bjorklund <db@zigo.dhs.org> writes: > > ... This also means that the start byte can never start with 7 or 8 > > ones, that is illegal and should be tested for and rejected. So the > > longest utf-8 sequence is 6 bytes (and the longest character needs 4 > > bytes (or 31 bits)). > > Tatsuo would know more about this than me, but it looks from here like > our coding was originally designed to support only 16-bit-wide internal > characters (ie, 16-bit pg_wchar datatype width). I believe that the > regex library limitation here is gone, and that as far as that library > is concerned we could assume a 32-bit internal character width. The > question at hand is whether we can support 32-bit characters or not --- > and if not, what's the next bug to fix? pg_wchar has been already 32-bit datatype. However I doubt there's actually a need for 32-but width character sets. Even Unicode only uese up 0x0010FFFF, so 24-bit should be enough... -- Tatsuo Ishii
> Now it's entirely possible that the underlying support is a few bricks > shy of a load --- for instance I see that pg_utf_mblen thinks there are > no UTF8 codes longer than 3 bytes whereas your code goes to 4. I'm not > an expert on this stuff, so I don't know what the UTF8 spec actually > says. But I do think you are fixing the code at the wrong level. Surely there are UTF-8 codes that are at least 3 bytes. I have a _vague_ recollection that you have to keep escaping and escaping to get up to like 4 bytes for some asian code points? Chris
4 actually, 10FFFF needs four bytes: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 10FFFF = 00001010 11111111 11111111 Fill in the blanks, starting from the bottom, you get: 11110000 10101111 10111111 10111111 Regards, John Hansen -----Original Message----- From: Christopher Kings-Lynne [mailto:chriskl@familyhealth.com.au] Sent: Saturday, August 07, 2004 8:47 PM To: Tom Lane Cc: John Hansen; Hackers; Patches Subject: Re: [HACKERS] UNICODE characters above 0x10000 > Now it's entirely possible that the underlying support is a few bricks > shy of a load --- for instance I see that pg_utf_mblen thinks there > are no UTF8 codes longer than 3 bytes whereas your code goes to 4. > I'm not an expert on this stuff, so I don't know what the UTF8 spec > actually says. But I do think you are fixing the code at the wrong level. Surely there are UTF-8 codes that are at least 3 bytes. I have a _vague_ recollection that you have to keep escaping and escaping to get up to like 4 bytes for some asian code points? Chris
On Sat, 2004-08-07 at 07:10, Tom Lane wrote: > Oliver Elphick <olly@lfix.co.uk> writes: > > glibc provides various routines (mb...) for handling Unicode. How many > > of our supported platforms don't have these? > > Every one that doesn't use glibc. Don't bother proposing a glibc-only > solution (and that's from someone who works for a glibc-only company; > you don't even want to think about the push-back you'll get from other > quarters). No. that's not what I was proposing. My suggestion was to use these routines if they are sufficiently widely implemented, and our own routines where standard ones are not available. The man page for mblen says "CONFORMING TO ISO/ANSI C, UNIX98" Is glibc really the only C library to conform? If using the mb... routines isn't feasible, IBM's ICU library (http://oss.software.ibm.com/icu/) is available under the X licence, which is compatible with BSD as far as I can see. Besides character conversion, ICU can also do collation in various locales and encodings. My point is, we shouldn't be writing a new set of routines to do half a job if there are already libraries available to do all of it. -- Oliver Elphick olly@lfix.co.uk Isle of Wight http://www.lfix.co.uk/oliver GPG: 1024D/A54310EA 92C8 39E7 280E 3631 3F0E 1EC0 5664 7A2F A543 10EA ======================================== "Be still before the LORD and wait patiently for him; do not fret when men succeed in their ways, when they carry out their wicked schemes." Psalms 37:7
> -----Original Message----- > From: Oliver Elphick [mailto:olly@lfix.co.uk] > Sent: Sunday, August 08, 2004 7:43 AM > To: Tom Lane > Cc: John Hansen; Hackers; Patches > Subject: Re: [HACKERS] UNICODE characters above 0x10000 > > On Sat, 2004-08-07 at 07:10, Tom Lane wrote: > > Oliver Elphick <olly@lfix.co.uk> writes: > > > glibc provides various routines (mb...) for handling Unicode. How > > > many of our supported platforms don't have these? > > > > Every one that doesn't use glibc. Don't bother proposing a glibc-only > > solution (and that's from someone who works for a glibc-only company; > > you don't even want to think about the push-back you'll get from other > > quarters). > > No. that's not what I was proposing. My suggestion was to > use these routines if they are sufficiently widely > implemented, and our own routines where standard ones are not > available. > > The man page for mblen says > "CONFORMING TO > ISO/ANSI C, UNIX98" > > Is glibc really the only C library to conform? > > If using the mb... routines isn't feasible, IBM's ICU library > (http://oss.software.ibm.com/icu/) is available under the X > licence, which is compatible with BSD as far as I can see. > Besides character conversion, ICU can also do collation in > various locales and encodings. > My point is, we shouldn't be writing a new set of routines to > do half a job if there are already libraries available to do > all of it. > This sounds like a brilliant move, if anything. > -- > Oliver Elphick > olly@lfix.co.uk > Isle of Wight > http://www.lfix.co.uk/oliver > GPG: 1024D/A54310EA 92C8 39E7 280E 3631 3F0E 1EC0 5664 7A2F > A543 10EA > ======================================== > "Be still before the LORD and wait patiently for him; > do not fret when men succeed in their ways, when they > carry out their wicked schemes." > Psalms 37:7 > > > Kind Regards, John Hansen
"John Hansen" <john@geeknet.com.au> writes: > Ahh, but that's not the case. You cannot just delete the check, since > not all combinations of bytes are valid UTF8. UTF bytes FE & FF never > appear in a byte sequence for instance. Well, this is still working at the wrong level. The code that's in pg_verifymbstr is mainly intended to enforce the *system wide* assumption that multibyte characters must have the high bit set in every byte. (We do not support encodings without this property in the backend, because it breaks code that looks for ASCII characters ... such as the main parser/lexer ...) It's not really intended to check that the multibyte character is actually legal in its encoding. The "special UTF-8 check" was never more than a very quick-n-dirty hack that was in the wrong place to start with. We ought to be getting rid of it not institutionalizing it. If you want an exact encoding-specific check on the legitimacy of a multibyte sequence, I think the right way to do it is to add another function pointer to pg_wchar_table entries to let each encoding have its own check routine. Perhaps this could be defined so as to avoid a separate call to pg_mblen inside the loop, and thereby not add any new overhead. I'm thinking about an API something like int validate_mbchar(const unsigned char *str, int len) with result +N if a valid character N bytes long is present at *str, and -N if an invalid character is present at *str and it would be appropriate to display N bytes in the complaint. (N must be <= len in either case.) This would reduce the main loop of pg_verifymbstr to a call of this function and an error-case-handling block. regards, tom lane
> Well, this is still working at the wrong level. The code > that's in pg_verifymbstr is mainly intended to enforce the > *system wide* assumption that multibyte characters must have > the high bit set in every byte. (We do not support encodings > without this property in the backend, because it breaks code > that looks for ASCII characters ... such as the main > parser/lexer ...) It's not really intended to check that the > multibyte character is actually legal in its encoding. > Ok, point taken. > The "special UTF-8 check" was never more than a very > quick-n-dirty hack that was in the wrong place to start with. > We ought to be getting rid of it not institutionalizing it. > If you want an exact encoding-specific check on the > legitimacy of a multibyte sequence, I think the right way to > do it is to add another function pointer to pg_wchar_table > entries to let each encoding have its own check routine. > Perhaps this could be defined so as to avoid a separate call > to pg_mblen inside the loop, and thereby not add any new > overhead. I'm thinking about an API something like > > int validate_mbchar(const unsigned char *str, int len) > > with result +N if a valid character N bytes long is present > at *str, and -N if an invalid character is present at *str > and it would be appropriate to display N bytes in the complaint. > (N must be <= len in either case.) This would reduce the > main loop of pg_verifymbstr to a call of this function and an > error-case-handling block. > Sounds like a plan... > regards, tom lane > > Regards, John Hansen