Home > mailing lists

Re: Unicode string literals versus the world - Mailing list pgsql-hackers

From	Andrew Dunstan
Subject	Re: Unicode string literals versus the world
Date	April 16, 2009 15:34:36
Msg-id	49E75003.9050003@dunslane.net Whole thread Raw
In response to	Re: Unicode string literals versus the world (Tom Lane <tgl@sss.pgh.pa.us>)
List	pgsql-hackers

Tree view


Tom Lane wrote:
> Sam Mason <sam@samason.me.uk> writes:
>   
>> I'd never heard of UTF-16 surrogate pairs before this discussion and
>> hence didn't realise that it's valid to have a surrogate pair in place
>> of a single code point.  The docs say that <D800 DF02> corresponds to
>> U+10302, Python would appear to follow my intuitions in that:
>>     
>
>   
>>   ord(u'\uD800\uDF02')
>>     
>
>   
>> results in an error instead of giving back 66306, as I'd expect.  Is
>> this a bug in Python, my understanding, or something else?
>>     
>
> I might be wrong, but I think surrogate pairs are expressly forbidden in
> all representations other than UTF16/UCS2.  We definitely forbid them
> when validating UTF-8 strings --- that's per an RFC recommendation.
> It sounds like Python is doing the same.
>
>             
>   

You mustn't encode the surrogate, but it's up to us how we allow people 
to designate a given code point.

Frankly, I think we shouldn't provide for using surrogates at all. I 
would prefer something like \uXXXX for BMP items and \UXXXXXXXX as the 
straight 32bit designation of a higher codepoint.

cheers

andrew

pgsql-hackers by date:

From: Marko Kreen
Date: 16 April 2009, 15:34:20
Subject: Re: Unicode string literals versus the world

From: Robert Haas
Date: 16 April 2009, 15:37:58
Subject: Re: [GENERAL] Performance of full outer join in 8.3

Re: Unicode string literals versus the world - Mailing list pgsql-hackers

Previous

Next