Home > mailing lists

Re: Unicode problems on IRC - Mailing list pgsql-hackers

From	Andrew - Supernews
Subject	Re: Unicode problems on IRC
Date	April 10, 2005 19:18:08
Msg-id	slrnd5iren.2ilg.andrew+nonews@trinity.supernews.net Whole thread
In response to	Re: Unicode problems on IRC ("John Hansen" <john@geeknet.com.au>)
List	pgsql-hackers

Tree view

On 2005-04-10, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Andrew - Supernews <andrew+nonews@supernews.com> writes:
>> I think you will find that this impression is actually false. Or that at
>> the very least, _correct_ verification of UTF-8 sequences will still
>> catch essentially all cases of non-utf-8 input mislabelled as utf-8
>> while allowing the full range of Unicode codepoints.
>
> Yeah?  Cool.  Does John's proposed patch do it "correctly"?
>
> http://candle.pha.pa.us/mhonarc/patches2/msg00076.html

It looks correct to me. The only thing I think that code will let through
incorrectly are encoded surrogates; those could be fixed by adding one line:
             switch (*source) {                     /* no fall-through in this inner switch */                     case
0xE0:if (a < 0xA0) return false; break;

+                     case 0xED: if (a > 0x9F) return false; break;                     case 0xF0: if (a < 0x90) return
false;break;                     case 0xF4: if (a > 0x8F) return false; break;

(Accepting encoded surrogates in utf-8 was always forbidden by most
specifications that used utf-8, though the Unicode specs originally were
not absolute about it (but forbade generating them). Current Unicode
specifications define those sequences as malformed. Surrogates are the
code points from 0xD800 - 0xDFFF, which are used in UTF-16 to encode
characters 0x10000 - 0x10FFFF as two 16-bit values; UTF-8 requires that
such characters are encoded directly rather than via surrogate pairs.)

-- 
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services

pgsql-hackers by date:

From: Euler Taveira de Oliveira
Date: 10 April 2005, 19:17:40
Subject: Re: Case Sensitivity

From: Josh Berkus
Date: 10 April 2005, 19:50:06
Subject: Re: [PATCHES] DELETE ... USING

Re: Unicode problems on IRC - Mailing list pgsql-hackers

Previous

Next