Thread: Unicode problems on IRC
Hey guys, The 'Unicode characters above 0x10000' issue keeps rearing its ugly head in the IRC channel. I propose that it be fixed, even backported... This is John Hansen's most recent patch to fix it: http://archives.postgresql.org/pgsql-patches/2004-11/msg00259.php And from what I can tell it was committed, then reverted because it wasn't a "bug". It was going to go in for 8.1. We on the channel are starting to think that it is in fact a bug. There are are people with legitimately utf-8 encoded XML documents that they cannot store in PostgreSQL. Apparently in the distant past, Unicode was limited to 0x10000, but then was extended. Perhaps we can reopen this case... Chris
Christopher Kings-Lynne wrote: > Hey guys, > > The 'Unicode characters above 0x10000' issue keeps rearing its ugly head > in the IRC channel. I propose that it be fixed, even backported... > > This is John Hansen's most recent patch to fix it: > > http://archives.postgresql.org/pgsql-patches/2004-11/msg00259.php > > And from what I can tell it was committed, then reverted because it > wasn't a "bug". It was going to go in for 8.1. > > We on the channel are starting to think that it is in fact a bug. There > are are people with legitimately utf-8 encoded XML documents that they > cannot store in PostgreSQL. Apparently in the distant past, Unicode was > limited to 0x10000, but then was extended. > > Perhaps we can reopen this case... Uh, I thought we fixed this another way, buy not using Unicode-aware functions for upper/lower/initcap when the locale is "C" or "POSIX". That is backpatched to 8.0.X. Does that not fix the problem reported? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On 2005-04-09, Bruce Momjian <pgman@candle.pha.pa.us> wrote: > Uh, I thought we fixed this another way, buy not using Unicode-aware > functions for upper/lower/initcap when the locale is "C" or "POSIX". > That is backpatched to 8.0.X. Does that not fix the problem reported? Unicode values over 0xFFFF are simply not accepted on input, so no, it doesn't fix the problem. What do upper/lower/initcap have to do with it? textin() unconditionally calls pg_verifymbstr, which in turn explicitly checks for such values (if the encoding is UTF8) and throws ERROR if it finds them. -- Andrew, Supernews http://www.supernews.com - individual and corporate NNTP services
> -----Original Message----- > From: pgsql-hackers-owner@postgresql.org > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Bruce Momjian > Sent: Sunday, April 10, 2005 8:18 AM > To: Christopher Kings-Lynne > Cc: pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Unicode problems on IRC > > Christopher Kings-Lynne wrote: > > Hey guys, > > > > The 'Unicode characters above 0x10000' issue keeps rearing its ugly > > head in the IRC channel. I propose that it be fixed, even > backported... > > > > This is John Hansen's most recent patch to fix it: > > > > http://archives.postgresql.org/pgsql-patches/2004-11/msg00259.php > > > > And from what I can tell it was committed, then reverted because it > > wasn't a "bug". It was going to go in for 8.1. > > > > We on the channel are starting to think that it is in fact a bug. > > There are are people with legitimately utf-8 encoded XML documents > > that they cannot store in PostgreSQL. Apparently in the > distant past, > > Unicode was limited to 0x10000, but then was extended. > > > > Perhaps we can reopen this case... > > Uh, I thought we fixed this another way, buy not using > Unicode-aware functions for upper/lower/initcap when the > locale is "C" or "POSIX". > That is backpatched to 8.0.X. Does that not fix the problem reported? No, as andrew said, what this patch does, is allow values > 0xffff and at the same time validates the input to make sure it's valid utf8. ... John > > -- > Bruce Momjian | http://candle.pha.pa.us > pgman@candle.pha.pa.us | (610) 359-1001 > + If your life is a hard drive, | 13 Roberts Road > + Christ can be your backup. | Newtown Square, > Pennsylvania 19073 > > ---------------------------(end of > broadcast)--------------------------- > TIP 9: the planner will ignore your desire to choose an index > scan if your > joining column's datatypes do not match > >
"John Hansen" <john@geeknet.com.au> writes: >> That is backpatched to 8.0.X. Does that not fix the problem reported? > No, as andrew said, what this patch does, is allow values > 0xffff and > at the same time validates the input to make sure it's valid utf8. The impression I get is that most of the 'Unicode characters above 0x10000' reports we've seen did not come from people who actually needed more-than-16-bit Unicode codepoints, but from people who had screwed up their encoding settings and were trying to tell the backend that Latin1 was Unicode or some such. So I'm a bit worried that extending the backend support to full 32-bit Unicode will do more to mask encoding mistakes than it will do to create needed functionality. Not that I'm against adding the functionality. I'm just doubtful that the reports we've seen really indicate that we need it, or that adding it will cut down on the incidence of complaints :-( regards, tom lane
Tom Lane wrote: > "John Hansen" <john@geeknet.com.au> writes: > >> That is backpatched to 8.0.X. Does that not fix the problem reported? > > > No, as andrew said, what this patch does, is allow values > 0xffff and > > at the same time validates the input to make sure it's valid utf8. > > The impression I get is that most of the 'Unicode characters above > 0x10000' reports we've seen did not come from people who actually needed > more-than-16-bit Unicode codepoints, but from people who had screwed up > their encoding settings and were trying to tell the backend that Latin1 > was Unicode or some such. So I'm a bit worried that extending the > backend support to full 32-bit Unicode will do more to mask encoding > mistakes than it will do to create needed functionality. Yes, that was my impression too. The upper/lower/initcap issue was that some operating systems were testing unicode values even if the local was set to C. That is fixed in 8.0.2, but I now see this is a different problem. > Not that I'm against adding the functionality. I'm just doubtful that > the reports we've seen really indicate that we need it, or that adding > it will cut down on the incidence of complaints :-( Yea, that was my question too. I figured Japan or Chinese would be using these longer values, and if they are fine, why are others having problems. It would be great to find a test case that actually shows the failure. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On 2005-04-10, Tom Lane <tgl@sss.pgh.pa.us> wrote: > The impression I get is that most of the 'Unicode characters above > 0x10000' reports we've seen did not come from people who actually needed > more-than-16-bit Unicode codepoints, but from people who had screwed up > their encoding settings and were trying to tell the backend that Latin1 > was Unicode or some such. So I'm a bit worried that extending the > backend support to full 32-bit Unicode will do more to mask encoding > mistakes than it will do to create needed functionality. I think you will find that this impression is actually false. Or that at the very least, _correct_ verification of UTF-8 sequences will still catch essentially all cases of non-utf-8 input mislabelled as utf-8 while allowing the full range of Unicode codepoints. (The current check will report the "characters above 0x10000" error even on input which is blatantly not utf-8 at all.) One of UTF-8's nicest properties is that other encodings are almost never also valid utf-8. I did some tests on this myself some years ago, feeding hundreds of thousands of short non-utf-8 strings (taken from Usenet subject lines in non-english-speaking hierarchies) into a utf-8 decoder. The false accept rate was on the order of 0.01%, and going back and re-checking my old data, _none_ of the incorrectly detected sequences would have been interpreted as characters over 0xFFFF. -- Andrew, Supernews http://www.supernews.com - individual and corporate NNTP services
Andrew - Supernews <andrew+nonews@supernews.com> writes: > On 2005-04-10, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The impression I get is that most of the 'Unicode characters above >> 0x10000' reports we've seen did not come from people who actually needed >> more-than-16-bit Unicode codepoints, but from people who had screwed up >> their encoding settings and were trying to tell the backend that Latin1 >> was Unicode or some such. > I think you will find that this impression is actually false. Or that at > the very least, _correct_ verification of UTF-8 sequences will still > catch essentially all cases of non-utf-8 input mislabelled as utf-8 > while allowing the full range of Unicode codepoints. Yeah? Cool. Does John's proposed patch do it "correctly"? http://candle.pha.pa.us/mhonarc/patches2/msg00076.html regards, tom lane
>On 2005-04-10, Tom Lane <tgl ( at ) sss ( dot ) pgh ( dot ) pa ( dot ) us> wrote: >> Andrew - Supernews <andrew+nonews ( at ) supernews ( dot ) com> writes: >>> I think you will find that this impression is actually false. Or that at >>> the very least, _correct_ verification of UTF-8 sequences will still >>> catch essentially all cases of non-utf-8 input mislabelled as utf-8 >>> while allowing the full range of Unicode codepoints. >> >> Yeah? Cool. Does John's proposed patch do it "correctly"? >> >> http://candle.pha.pa.us/mhonarc/patches2/msg00076.html > >It looks correct to me. The only thing I think that code will let through >incorrectly are encoded surrogates; those could be fixed by adding one line: > > switch (*source) { > /* no fall-through in this inner switch */ > case 0xE0: if (a < 0xA0) return false; break; >+ case 0xED: if (a > 0x9F) return false; break; > case 0xF0: if (a < 0x90) return false; break; > case 0xF4: if (a > 0x8F) return false; break; > That's right, dono how I missed that one, but looks correct to me, and is in line with the code in ConvertUTF.c from unicode.org, on which I based the patch, extended to support 6 byte utf8 characters. >(Accepting encoded surrogates in utf-8 was always forbidden by most >specifications that used utf-8, though the Unicode specs originally were >not absolute about it (but forbade generating them). Current Unicode >specifications define those sequences as malformed. Surrogates are the >code points from 0xD800 - 0xDFFF, which are used in UTF-16 to encode >characters 0x10000 - 0x10FFFF as two 16-bit values; UTF-8 requires that >such characters are encoded directly rather than via surrogate pairs.) > >-- >Andrew, Supernews >http://www.supernews.com - individual and corporate NNTP services ... John
On 2005-04-10, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Andrew - Supernews <andrew+nonews@supernews.com> writes: >> I think you will find that this impression is actually false. Or that at >> the very least, _correct_ verification of UTF-8 sequences will still >> catch essentially all cases of non-utf-8 input mislabelled as utf-8 >> while allowing the full range of Unicode codepoints. > > Yeah? Cool. Does John's proposed patch do it "correctly"? > > http://candle.pha.pa.us/mhonarc/patches2/msg00076.html It looks correct to me. The only thing I think that code will let through incorrectly are encoded surrogates; those could be fixed by adding one line: switch (*source) { /* no fall-through in this inner switch */ case 0xE0:if (a < 0xA0) return false; break; + case 0xED: if (a > 0x9F) return false; break; case 0xF0: if (a < 0x90) return false;break; case 0xF4: if (a > 0x8F) return false; break; (Accepting encoded surrogates in utf-8 was always forbidden by most specifications that used utf-8, though the Unicode specs originally were not absolute about it (but forbade generating them). Current Unicode specifications define those sequences as malformed. Surrogates are the code points from 0xD800 - 0xDFFF, which are used in UTF-16 to encode characters 0x10000 - 0x10FFFF as two 16-bit values; UTF-8 requires that such characters are encoded directly rather than via surrogate pairs.) -- Andrew, Supernews http://www.supernews.com - individual and corporate NNTP services
Tom Lane wrote: > Yeah? Cool. Does John's proposed patch do it "correctly"? > > http://candle.pha.pa.us/mhonarc/patches2/msg00076.html Some comments on that patch: Doesn't pg_utf2wchar_with_len need changes for the longer sequences? UtfToLocal also appears to need changes. If we support sequences >4 bytes (>U+10FFFF), then UtfToLocal/LocalToUtf and the associated translation tables need a redesign as they currently assume the sequence fits in an unsigned int. (IIRC, Unicode doesn't use >U+10FFFF, but UTF-8 can encode it?) -O
On 2005-04-10, "John Hansen" <john@geeknet.com.au> wrote: > That's right, dono how I missed that one, but looks correct to me, and > is in line with the code in ConvertUTF.c from unicode.org, on which I > based the patch, extended to support 6 byte utf8 characters. Frankly, you should probably de-extend it back down to 4 bytes. That's enough to encode the Unicode range of 0x000000 - 0x10FFFF, and enough other stuff would break if anyone allocated a character outside that range that I don't think it it worth worrying about. (Even the ISO people have agreed to conform to that limitation.) Even if insanity struck simultaneously at both standards bodies, 4 bytes is enough to go to 0x1FFFFF so there is still substantial slack. (A number of other specifications based on utf-8 have removed the 5 and 6 byte sequences too, so there is substantial precedent for this.) -- Andrew, Supernews http://www.supernews.com - individual and corporate NNTP services