On Fri, Jan 20, 2012 at 12:14 PM, David E. Wheeler <david@kineticode.com> wrote:
> On Jan 20, 2012, at 8:58 AM, Robert Haas wrote:
>
>> If, however,
>> we're not using UTF-8, we have to first turn \uXXXX into a Unicode
>> code point, then covert that to a character in the database encoding,
>> and then test for equality with the other character after that. I'm
>> not sure whether that's possible in general, how to do it, or how
>> efficient it is. Can you or anyone shed any light on that topic?
>
> If it’s like the XML example, it should always represent a Unicode code point, and *not* be converted to the other
characterset, no?
Well, you can pick which way you want to do the conversion. If the
database encoding is SJIS, and there's an SJIS character in a string
that gets passed to json_in(), and there's another string which also
gets passed to json_in() which contains \uXXXX, then any sort of
canonicalization or equality testing is going to need to convert the
SJIS character to a Unicode code point, or the Unicode code point to
an SJIS character, to see whether they match.
Err, actually, now that I think about it, that might be a problem:
what happens if we're trying to test two characters for equality and
the encoding conversion fails? We really just want to return false -
the strings are clearly not equal if either contains even one
character that can't be converted to the other encoding - so it's not
good if an error gets thrown in there anywhere.
> At any rate, since the JSON standard requires UTF-8, such distinctions having to do with alternate encodings are not
likelyto be covered, so I suspect we can do whatever we want here. It’s outside the spec.
I agree.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company