On Sat, Jan 14, 2012 at 3:06 PM, Andrew Dunstan <andrew@dunslane.net> wrote:
> Second, what should be do when the database encoding isn't UTF8? I'm
> inclined to emit a \unnnn escape for any non-ASCII character (assuming it
> has a unicode code point - are there any code points in the non-unicode
> encodings that don't have unicode equivalents?). The alternative would be to
> fail on non-ASCII characters, which might be ugly. Of course, anyone wanting
> to deal with JSON should be using UTF8 anyway, but we still have to deal
> with these things. What about SQL_ASCII? If there's a non-ASCII sequence
> there we really have no way of telling what it should be. There at least I
> think we should probably error out.
I don't think there is a satisfying solution to this problem. Things
working against us:
* Some server encodings support characters that don't map to Unicode
characters (e.g. unused slots in Windows-1252). Thus, converting to
UTF-8 and back is lossy in general.
* We want a normalized representation for comparison. This will
involve a mixture of server and Unicode characters, unless the
encoding is UTF-8.
* We can't efficiently convert individual characters to and from
Unicode with the current API.
* What do we do about \u0000 ? TEXT datums cannot contain NUL characters.
I'd say just ban Unicode escapes and non-ASCII characters unless the
server encoding is UTF-8, and ban all \u0000 escapes. It's easy, and
whatever we support later will be a superset of this.
Strategies for handling this situation have been discussed in prior
emails. This is where things got stuck last time.
- Joey