Re: Unicode string literals versus the world - Mailing list pgsql-hackers

From Sam Mason
Subject Re: Unicode string literals versus the world
Date
Msg-id 20090416160808.GO12225@frubble.xen.chris-lamb.co.uk
Whole thread Raw
In response to Re: Unicode string literals versus the world  (Marko Kreen <markokr@gmail.com>)
List pgsql-hackers
On Thu, Apr 16, 2009 at 06:34:06PM +0300, Marko Kreen wrote:
> Which hints that you can aswell enter the pairs directly: \uxx\uxx.
> If I'd be language designer, I would not see any reason to disallow it.
> 
> And anyway, at least mono seems to support it:
> 
> using System;
> public class HelloWorld {
>     public static void Main() {
>         Console.WriteLine("<\uD800\uDF02>\n");
>     }
> }
> 
> It will output single UTF8 character.  I think this should settle it.

I don't have any .net stuff installed so can't test; but C# is defined
to use UTF-16 as its internal representation so it would make sense if
the above gets treated as a single character internally.  However, if it
used any other encoding the above should be treated as an error.

> The de-facto about Postgres is stdstr=off.  Even if not, E'' strings
> are still better for various things, so it would be good if they also
> aquired unicode-capabilities.

OK, this seems independent of the U&'lit' discussion that started the
thread.  Note that PG already supports UTF8; if you want the character
I've been using in my examples up-thread, you can do:
 SELECT E'\xF0\x90\x8C\x82';

I have a feeling that this is predicated on the server_encoding being
set to "utf8" and this can only be done at database creation time.
Another alternative would be to use the convert_from function, i.e:
 SELECT convert_from(E'\xF0\x90\x8C\x82', 'UTF8');

Never had to do this though, so there may be better options available.

> Python's internal representation is *not* UTF-16, but plain UCS2/UCS4,
> that is - plain 16 or 32-bit values.  Seems your python is compiled with
> UCS2, not UCS4.

Cool, I didn't know that.  I believe mine is UCS4 as I can do:
 ord(u'\U00010302')

and I get 66306 back rather than an error.

> As I understand, in UCS2 mode it simply takes surrogate
> values as-is.

UCS2 doesn't have surrogate pairs, or at least I believe it's considered
a bug if you don't get an error when you present it with one.

> From ord() docs:
> 
>   If a unicode argument is given and Python was built with UCS2 Unicode,
>   then the character’s code point must be in the range [0..65535]
>   inclusive; otherwise the string length is two, and a TypeError will
>   be raised.
> 
> So only in UCS4 mode it detects surrogates and converts them to internal
> representation.  (Which in Postgres case would be UTF8.)

I think you mean UTF-16 instead of UCS4; but otherwise, yes.

> Or perhaps it is partially UTF16 aware - eg. I/O routines do unterstand
> UTF16 but low-level string routines do not:
> 
>   print "<%s>" % u'\uD800\uDF02'
> 
> seems to handle it properly.

Yes, I get this as well.  It's all a bit weird, which is why I was
asking if "this a bug in Python, my understanding, or something else".

When I do:
 python <<EOF | hexdump -C print u"\uD800\uDF02" EOF

to see what it's doing I get an error which I'm not expecting, hence I
think it's probably my understanding.

--  Sam  http://samason.me.uk/


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: [GENERAL] Performance of full outer join in 8.3
Next
From: Tom Lane
Date:
Subject: Re: Unicode string literals versus the world