Thread: Unicode problems on IRC

From:
Christopher Kings-Lynne
Date:

Hey guys,

The 'Unicode characters above 0x10000' issue keeps rearing its ugly head 
in the IRC channel.  I propose that it be fixed, even backported...

This is John Hansen's most recent patch to fix it:

http://archives.postgresql.org/pgsql-patches/2004-11/msg00259.php

And from what I can tell it was committed, then reverted because it 
wasn't a "bug".  It was going to go in for 8.1.

We on the channel are starting to think that it is in fact a bug.  There 
are are people with legitimately utf-8 encoded XML documents that they 
cannot store in PostgreSQL.  Apparently in the distant past, Unicode was 
limited to 0x10000, but then was extended.

Perhaps we can reopen this case...

Chris


From:
Bruce Momjian
Date:

Christopher Kings-Lynne wrote:
> Hey guys,
> 
> The 'Unicode characters above 0x10000' issue keeps rearing its ugly head 
> in the IRC channel.  I propose that it be fixed, even backported...
> 
> This is John Hansen's most recent patch to fix it:
> 
> http://archives.postgresql.org/pgsql-patches/2004-11/msg00259.php
> 
> And from what I can tell it was committed, then reverted because it 
> wasn't a "bug".  It was going to go in for 8.1.
> 
> We on the channel are starting to think that it is in fact a bug.  There 
> are are people with legitimately utf-8 encoded XML documents that they 
> cannot store in PostgreSQL.  Apparently in the distant past, Unicode was 
> limited to 0x10000, but then was extended.
> 
> Perhaps we can reopen this case...

Uh, I thought we fixed this another way, buy not using Unicode-aware
functions for upper/lower/initcap when the locale is "C" or "POSIX". 
That is backpatched to 8.0.X.  Does that not fix the problem reported?

--  Bruce Momjian                        |  http://candle.pha.pa.us                |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


From:
Andrew - Supernews
Date:

On 2005-04-09, Bruce Momjian <> wrote:
> Uh, I thought we fixed this another way, buy not using Unicode-aware
> functions for upper/lower/initcap when the locale is "C" or "POSIX". 
> That is backpatched to 8.0.X.  Does that not fix the problem reported?

Unicode values over 0xFFFF are simply not accepted on input, so no, it
doesn't fix the problem. What do upper/lower/initcap have to do with it?

textin() unconditionally calls pg_verifymbstr, which in turn explicitly
checks for such values (if the encoding is UTF8) and throws ERROR if it
finds them.

-- 
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services


From:
"John Hansen"
Date:


> -----Original Message-----
> From: 
> [mailto:] On Behalf Of Bruce Momjian
> Sent: Sunday, April 10, 2005 8:18 AM
> To: Christopher Kings-Lynne
> Cc: 
> Subject: Re: [HACKERS] Unicode problems on IRC
>
> Christopher Kings-Lynne wrote:
> > Hey guys,
> >
> > The 'Unicode characters above 0x10000' issue keeps rearing its ugly
> > head in the IRC channel.  I propose that it be fixed, even
> backported...
> >
> > This is John Hansen's most recent patch to fix it:
> >
> > http://archives.postgresql.org/pgsql-patches/2004-11/msg00259.php
> >
> > And from what I can tell it was committed, then reverted because it
> > wasn't a "bug".  It was going to go in for 8.1.
> >
> > We on the channel are starting to think that it is in fact a bug.
> > There are are people with legitimately utf-8 encoded XML documents
> > that they cannot store in PostgreSQL.  Apparently in the
> distant past,
> > Unicode was limited to 0x10000, but then was extended.
> >
> > Perhaps we can reopen this case...
>
> Uh, I thought we fixed this another way, buy not using
> Unicode-aware functions for upper/lower/initcap when the
> locale is "C" or "POSIX".
> That is backpatched to 8.0.X.  Does that not fix the problem reported?

No, as andrew said, what this patch does, is allow values > 0xffff and
at the same time validates the input to make sure it's valid utf8.


... John
>
> --
>   Bruce Momjian                        |  http://candle.pha.pa.us
>                  |  (610) 359-1001
>   +  If your life is a hard drive,     |  13 Roberts Road
>   +  Christ can be your backup.        |  Newtown Square,
> Pennsylvania 19073
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 9: the planner will ignore your desire to choose an index
> scan if your
>       joining column's datatypes do not match
>
>


From:
Tom Lane
Date:

"John Hansen" <> writes:
>> That is backpatched to 8.0.X.  Does that not fix the problem reported?

> No, as andrew said, what this patch does, is allow values > 0xffff and
> at the same time validates the input to make sure it's valid utf8.

The impression I get is that most of the 'Unicode characters above
0x10000' reports we've seen did not come from people who actually needed
more-than-16-bit Unicode codepoints, but from people who had screwed up
their encoding settings and were trying to tell the backend that Latin1
was Unicode or some such.  So I'm a bit worried that extending the
backend support to full 32-bit Unicode will do more to mask encoding
mistakes than it will do to create needed functionality.

Not that I'm against adding the functionality.  I'm just doubtful that
the reports we've seen really indicate that we need it, or that adding
it will cut down on the incidence of complaints :-(
        regards, tom lane


From:
Bruce Momjian
Date:

Tom Lane wrote:
> "John Hansen" <> writes:
> >> That is backpatched to 8.0.X.  Does that not fix the problem reported?
> 
> > No, as andrew said, what this patch does, is allow values > 0xffff and
> > at the same time validates the input to make sure it's valid utf8.
> 
> The impression I get is that most of the 'Unicode characters above
> 0x10000' reports we've seen did not come from people who actually needed
> more-than-16-bit Unicode codepoints, but from people who had screwed up
> their encoding settings and were trying to tell the backend that Latin1
> was Unicode or some such.  So I'm a bit worried that extending the
> backend support to full 32-bit Unicode will do more to mask encoding
> mistakes than it will do to create needed functionality.

Yes, that was my impression too.

The upper/lower/initcap issue was that some operating systems were
testing unicode values even if the local was set to C.  That is fixed in
8.0.2, but I now see this is a different problem.

> Not that I'm against adding the functionality.  I'm just doubtful that
> the reports we've seen really indicate that we need it, or that adding
> it will cut down on the incidence of complaints :-(

Yea, that was my question too.  I figured Japan or Chinese would be
using these longer values, and if they are fine, why are others having
problems.  It would be great to find a test case that actually shows the
failure.

--  Bruce Momjian                        |  http://candle.pha.pa.us                |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


From:
Andrew - Supernews
Date:

On 2005-04-10, Tom Lane <> wrote:
> The impression I get is that most of the 'Unicode characters above
> 0x10000' reports we've seen did not come from people who actually needed
> more-than-16-bit Unicode codepoints, but from people who had screwed up
> their encoding settings and were trying to tell the backend that Latin1
> was Unicode or some such.  So I'm a bit worried that extending the
> backend support to full 32-bit Unicode will do more to mask encoding
> mistakes than it will do to create needed functionality.

I think you will find that this impression is actually false. Or that at
the very least, _correct_ verification of UTF-8 sequences will still
catch essentially all cases of non-utf-8 input mislabelled as utf-8
while allowing the full range of Unicode codepoints. (The current check
will report the "characters above 0x10000" error even on input which is
blatantly not utf-8 at all.)

One of UTF-8's nicest properties is that other encodings are almost never
also valid utf-8. I did some tests on this myself some years ago, feeding
hundreds of thousands of short non-utf-8 strings (taken from Usenet
subject lines in non-english-speaking hierarchies) into a utf-8 decoder.
The false accept rate was on the order of 0.01%, and going back and
re-checking my old data, _none_ of the incorrectly detected sequences
would have been interpreted as characters over 0xFFFF.

-- 
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services


From:
Tom Lane
Date:

Andrew - Supernews <> writes:
> On 2005-04-10, Tom Lane <> wrote:
>> The impression I get is that most of the 'Unicode characters above
>> 0x10000' reports we've seen did not come from people who actually needed
>> more-than-16-bit Unicode codepoints, but from people who had screwed up
>> their encoding settings and were trying to tell the backend that Latin1
>> was Unicode or some such.

> I think you will find that this impression is actually false. Or that at
> the very least, _correct_ verification of UTF-8 sequences will still
> catch essentially all cases of non-utf-8 input mislabelled as utf-8
> while allowing the full range of Unicode codepoints.

Yeah?  Cool.  Does John's proposed patch do it "correctly"?

http://candle.pha.pa.us/mhonarc/patches2/msg00076.html
        regards, tom lane


From:
"John Hansen"
Date:

>On 2005-04-10, Tom Lane <tgl ( at ) sss ( dot ) pgh ( dot ) pa ( dot )
us> wrote:
>> Andrew - Supernews <andrew+nonews ( at ) supernews ( dot ) com>
writes:
>>> I think you will find that this impression is actually false. Or
that at
>>> the very least, _correct_ verification of UTF-8 sequences will still
>>> catch essentially all cases of non-utf-8 input mislabelled as utf-8
>>> while allowing the full range of Unicode codepoints.
>>
>> Yeah?  Cool.  Does John's proposed patch do it "correctly"?
>>
>> http://candle.pha.pa.us/mhonarc/patches2/msg00076.html
>
>It looks correct to me. The only thing I think that code will let
through
>incorrectly are encoded surrogates; those could be fixed by adding one
line:
>
>              switch (*source) {
>                      /* no fall-through in this inner switch */
>                      case 0xE0: if (a < 0xA0) return false; break;
>+                     case 0xED: if (a > 0x9F) return false; break;
>                      case 0xF0: if (a < 0x90) return false; break;
>                      case 0xF4: if (a > 0x8F) return false; break;
>

That's right, dono how I missed that one, but looks correct to me, and
is in line with the code in ConvertUTF.c from unicode.org, on which I
based the patch, extended to support 6 byte utf8 characters.

>(Accepting encoded surrogates in utf-8 was always forbidden by most
>specifications that used utf-8, though the Unicode specs originally
were
>not absolute about it (but forbade generating them). Current Unicode
>specifications define those sequences as malformed. Surrogates are the
>code points from 0xD800 - 0xDFFF, which are used in UTF-16 to encode
>characters 0x10000 - 0x10FFFF as two 16-bit values; UTF-8 requires that
>such characters are encoded directly rather than via surrogate pairs.)
>
>--
>Andrew, Supernews
>http://www.supernews.com - individual and corporate NNTP services

... John


From:
Andrew - Supernews
Date:

On 2005-04-10, Tom Lane <> wrote:
> Andrew - Supernews <> writes:
>> I think you will find that this impression is actually false. Or that at
>> the very least, _correct_ verification of UTF-8 sequences will still
>> catch essentially all cases of non-utf-8 input mislabelled as utf-8
>> while allowing the full range of Unicode codepoints.
>
> Yeah?  Cool.  Does John's proposed patch do it "correctly"?
>
> http://candle.pha.pa.us/mhonarc/patches2/msg00076.html

It looks correct to me. The only thing I think that code will let through
incorrectly are encoded surrogates; those could be fixed by adding one line:
             switch (*source) {                     /* no fall-through in this inner switch */                     case
0xE0:if (a < 0xA0) return false; break;
 
+                     case 0xED: if (a > 0x9F) return false; break;                     case 0xF0: if (a < 0x90) return
false;break;                     case 0xF4: if (a > 0x8F) return false; break;
 

(Accepting encoded surrogates in utf-8 was always forbidden by most
specifications that used utf-8, though the Unicode specs originally were
not absolute about it (but forbade generating them). Current Unicode
specifications define those sequences as malformed. Surrogates are the
code points from 0xD800 - 0xDFFF, which are used in UTF-16 to encode
characters 0x10000 - 0x10FFFF as two 16-bit values; UTF-8 requires that
such characters are encoded directly rather than via surrogate pairs.)

-- 
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services


From:
Oliver Jowett
Date:

Tom Lane wrote:

> Yeah?  Cool.  Does John's proposed patch do it "correctly"?
> 
> http://candle.pha.pa.us/mhonarc/patches2/msg00076.html

Some comments on that patch:

Doesn't pg_utf2wchar_with_len need changes for the longer sequences?

UtfToLocal also appears to need changes.

If we support sequences >4 bytes (>U+10FFFF), then UtfToLocal/LocalToUtf
and the associated translation tables need a redesign as they currently
assume the sequence fits in an unsigned int. (IIRC, Unicode doesn't use
>U+10FFFF, but UTF-8 can encode it?)

-O


From:
Andrew - Supernews
Date:

On 2005-04-10, "John Hansen" <> wrote:
> That's right, dono how I missed that one, but looks correct to me, and
> is in line with the code in ConvertUTF.c from unicode.org, on which I
> based the patch, extended to support 6 byte utf8 characters.

Frankly, you should probably de-extend it back down to 4 bytes. That's
enough to encode the Unicode range of 0x000000 - 0x10FFFF, and enough
other stuff would break if anyone allocated a character outside that
range that I don't think it it worth worrying about. (Even the ISO
people have agreed to conform to that limitation.) Even if insanity
struck simultaneously at both standards bodies, 4 bytes is enough to
go to 0x1FFFFF so there is still substantial slack. (A number of other
specifications based on utf-8 have removed the 5 and 6 byte sequences
too, so there is substantial precedent for this.)

-- 
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services