Re: Unicode problems on IRC

From: Andrew - Supernews
Subject: Re: Unicode problems on IRC
Date: ,
Msg-id: slrnd5ifv4.2ilg.andrew+nonews@trinity.supernews.net
(view: Whole thread, Raw)
In response to: Re: Unicode problems on IRC  ("John Hansen")
Responses: Re: Unicode problems on IRC  (Tom Lane)
List: pgsql-hackers

Tree view

Unicode problems on IRC  (Christopher Kings-Lynne, )
 Re: Unicode problems on IRC  (Bruce Momjian, )
 Re: Unicode problems on IRC  (Andrew - Supernews, )
 Re: Unicode problems on IRC  ("John Hansen", )
  Re: Unicode problems on IRC  (Tom Lane, )
   Re: Unicode problems on IRC  (Bruce Momjian, )
  Re: Unicode problems on IRC  (Andrew - Supernews, )
   Re: Unicode problems on IRC  (Tom Lane, )
    Re: Unicode problems on IRC  (Oliver Jowett, )
  Re: Unicode problems on IRC  (Andrew - Supernews, )
 Re: Unicode problems on IRC  ("John Hansen", )
  Re: Unicode problems on IRC  (Andrew - Supernews, )

On 2005-04-10, Tom Lane <> wrote:
> The impression I get is that most of the 'Unicode characters above
> 0x10000' reports we've seen did not come from people who actually needed
> more-than-16-bit Unicode codepoints, but from people who had screwed up
> their encoding settings and were trying to tell the backend that Latin1
> was Unicode or some such.  So I'm a bit worried that extending the
> backend support to full 32-bit Unicode will do more to mask encoding
> mistakes than it will do to create needed functionality.

I think you will find that this impression is actually false. Or that at
the very least, _correct_ verification of UTF-8 sequences will still
catch essentially all cases of non-utf-8 input mislabelled as utf-8
while allowing the full range of Unicode codepoints. (The current check
will report the "characters above 0x10000" error even on input which is
blatantly not utf-8 at all.)

One of UTF-8's nicest properties is that other encodings are almost never
also valid utf-8. I did some tests on this myself some years ago, feeding
hundreds of thousands of short non-utf-8 strings (taken from Usenet
subject lines in non-english-speaking hierarchies) into a utf-8 decoder.
The false accept rate was on the order of 0.01%, and going back and
re-checking my old data, _none_ of the incorrectly detected sequences
would have been interpreted as characters over 0xFFFF.

-- 
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services



pgsql-hackers by date:

From: "Ramy M. Hassan"
Date:
Subject: Re: static genericcostestimate
From: Euler Taveira de Oliveira
Date:
Subject: Re: Case Sensitivity