Thread: Three-byte Unicode characters

From:
Bruce Momjian
Date:

[ This email to hackers from last night got lost so I am remailing.]

Tom Lane wrote:
> "John Hansen" <> writes:
> >> That is backpatched to 8.0.X.  Does that not fix the problem reported?
> 
> > No, as andrew said, what this patch does, is allow values > 0xffff and
> > at the same time validates the input to make sure it's valid utf8.
> 
> The impression I get is that most of the 'Unicode characters above
> 0x10000' reports we've seen did not come from people who actually needed
> more-than-16-bit Unicode codepoints, but from people who had screwed up
> their encoding settings and were trying to tell the backend that Latin1
> was Unicode or some such.  So I'm a bit worried that extending the
> backend support to full 32-bit Unicode will do more to mask encoding
> mistakes than it will do to create needed functionality.
> 
> Not that I'm against adding the functionality.  I'm just doubtful that
> the reports we've seen really indicate that we need it, or that adding
> it will cut down on the incidence of complaints :-(

OK, I got on the IRC server and talked to folks who actually understand
this.  They say there are Chinese who are reporting this problem, so I
Googled and found this:
http://www.yale.edu/chinesemac/pages/charset_encoding.html#Unicode

See the paragraph with "Supplementary Ideographic Plane".  You will see
that paragraph says:
The Supplementary Ideographic Plane (SIP) currently contains 42,711additional characters in "CJK Unified Ideographs
ExtensionB"(U+20000-2A6D6). The PDF chart for this is available at:http://www.unicode.org/charts/PDF/U20000.pdf
 

I assume it is that U+20000-2A6D6 range that people are complaining
about.

So, we do have a bug, and we are probably going to need to fix it in
8.0.X.

I apologize to people who reported this problem and I wasn't attentive
to the seriousness of it.

--  Bruce Momjian                        |  http://candle.pha.pa.us                |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


From:
Peter Eisentraut
Date:

Bruce Momjian wrote:
> So, we do have a bug, and we are probably going to need to fix it in
> 8.0.X.

This has never worked in all the years we have had Unicode 
functionality, so I don't understand why we have to rush to fix it now.  
Certainly, it ought to be fixed, but not in a minor release.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/


From:
Tom Lane
Date:

Peter Eisentraut <> writes:
> Bruce Momjian wrote:
>> So, we do have a bug, and we are probably going to need to fix it in
>> 8.0.X.

> This has never worked in all the years we have had Unicode 
> functionality, so I don't understand why we have to rush to fix it now.  
> Certainly, it ought to be fixed, but not in a minor release.

The reasons why we rejected applying John's patch at the tail end
of the 8.0 cycle are still valid: it is a new feature and there
is nontrivial risk of introducing new bugs (more specifically,
exposing bits of the system that aren't prepared for more-than-16-bit
characters).

I'm fine with changing it in the 8.1 cycle, but I think a back-patch
would be folly. 
        regards, tom lane


From:
"Marc G. Fournier"
Date:

On Sun, 10 Apr 2005, Peter Eisentraut wrote:

> Bruce Momjian wrote:
>> So, we do have a bug, and we are probably going to need to fix it in
>> 8.0.X.
>
> This has never worked in all the years we have had Unicode
> functionality, so I don't understand why we have to rush to fix it now.
> Certainly, it ought to be fixed, but not in a minor release.

Agreed ... this is extending an existing feature to include a broader 
charset, not fixing a but ...

----
Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
Email:            Yahoo!: yscrappy              ICQ: 7615664