On Mon, Jul 15, 2013 at 05:11:40PM +0900, Tatsuo Ishii wrote:
> > Does support for alternative multi-byte encodings have something to do
> > with the Han unification controversy? I don't know terribly much about
> > this, so apologies if that's just wrong.
>
> There's a famous problem regarding conversion between Unicode and other
> encodings, such as Shift Jis.
>
> There are lots of discussion on this. Here is the one from Microsoft:
>
> http://support.microsoft.com/kb/170559/EN-US
Apart from Shift-JIS not being a well defined (it's more a family of
encodings) it has the unusual feature of providing multiple ways to
encode the same character. This is not even a Han unification issue,
they have largely been addressed. For example, the square-root symbol
exists twice (0x8795 and 0x81E3) and many other mathmatical symbols
also.
Here's the code page which you can browse online:
http://msdn.microsoft.com/en-us/goglobal/cc305152
Which means to be round-trippable Unicode would have to double those
characters, but this would make it hard/impossible to round-trip with
any other character set that had those characters. No easy solution
here.
Something that has been done before [1] is to map the doubles to the
custom area of the unicode space (0xe000-0xffff). It gives you
round-trip support at the expense of having to handle those characters
yourself. But since postgres doesn't do anything meaningful with
unicode characters this might be acceptable.
[1] Python does a similar trick to handle filenames coming from disk in
an unknown encoding:
http://docs.python.org/3/howto/unicode.html#files-in-an-unknown-encoding
Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
> He who writes carelessly confesses thereby at the very outset that he does
> not attach much importance to his own thoughts. -- Arthur Schopenhauer