Re: Unicode string literals versus the world - Mailing list pgsql-hackers

From Marko Kreen
Subject Re: Unicode string literals versus the world
Date
Msg-id e51f66da0904160447m764b0ee9i925b45b00320d084@mail.gmail.com
Whole thread Raw
In response to Re: Unicode string literals versus the world  (Sam Mason <sam@samason.me.uk>)
Responses Re: Unicode string literals versus the world  (Sam Mason <sam@samason.me.uk>)
List pgsql-hackers
On 4/16/09, Sam Mason <sam@samason.me.uk> wrote:
> On Wed, Apr 15, 2009 at 11:19:42PM +0300, Marko Kreen wrote:
>  > On 4/15/09, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> > > Given Martijn's complaint about more-than-16-bit code points, I think
>  > >  the \u proposal is not mature enough to go into 8.4.  We can think
>  > >  about some version of that later, if there's enough interest.
>  >
>  > I think it would be good idea. Basically we should pick one from
>  > couple of pre-existing sane schemes.  Here is quick summary
>  > of Python, Perl and Java:
>  >
>  > Python [1]:
>  >
>  >   \uXXXX         - 16-bit codepoint
>  >   \UXXXXXXXX     - 32-bit codepoint
>  >   \N{char-name}  - Characted by name
>
>
> Microsoft have also gone this way in C#, named code points are not
>  supported however.

And it handles also non-BMP codepoints with \u escape similarly:
 http://en.csharp-online.net/ECMA-334:_9.4.1_Unicode_escape_sequences

This makes it even more standard.

>  > Perl [2]:
>  >
>  >   \x{XXXX..}     - {} contains hexadecimal codepoint
>  >   \N{char-name}  - Unicode char name
>
>
> Looks OK, but the 'x' seems somewhat redundant.  Why not just:
>
>   \{xxxx}
>
>  This would be following the BitC[2] project, especially if it was more
>  like:
>
>   \{U+xxxx}
>
>  e.g.
>
>   \{U+03BB}
>
>  would be the lowercase lambda character.  Added appeal is in the fact
>  that this (i.e. U+03BB) is how the Unicode consortium spells code
>  points.

We already got yet-another-unique-way-of-escaping-unicode with U&.

Now let's try to support some actual standard also.

>  > Java [3]:
>  >
>  >   \uXXXX         - 16-bit codepoint
>
>
> AFAIK, Java isn't the best reference to choose; it assumed from an early
>  point in its design that Unicode characters were at most 16bits and
>  hence had to switch its internal representation to UTF-16.  I don't
>  program much Java these days to know how it's all worked out, but it
>  would be interesting to hear from people who regularly have to deal with
>  characters outside the BMP (i.e. code points greater than 65535).

You did not read my mail carefully enough - the Java and also Python/C#
already support non-BMP chars with '\u' and exactly the same (utf16) way.

-- 
marko


pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: [GENERAL] Performance of full outer join in 8.3
Next
From: Andrew Dunstan
Date:
Subject: Re: Unicode string literals versus the world