Re: Proposal - Support for National Characters functionality - Mailing list pgsql-hackers

From Martijn van Oosterhout
Subject Re: Proposal - Support for National Characters functionality
Date
Msg-id 20130716210727.GD28628@svana.org
Whole thread Raw
In response to Re: Proposal - Support for National Characters functionality  (Tatsuo Ishii <ishii@postgresql.org>)
List pgsql-hackers
On Mon, Jul 15, 2013 at 05:11:40PM +0900, Tatsuo Ishii wrote:
> > Does support for alternative multi-byte encodings have something to do
> > with the Han unification controversy? I don't know terribly much about
> > this, so apologies if that's just wrong.
>
> There's a famous problem regarding conversion between Unicode and other
> encodings, such as Shift Jis.
>
> There are lots of discussion on this. Here is the one from Microsoft:
>
> http://support.microsoft.com/kb/170559/EN-US

Apart from Shift-JIS not being a well defined (it's more a family of
encodings) it has the unusual feature of providing multiple ways to
encode the same character.  This is not even a Han unification issue,
they have largely been addressed.  For example, the square-root symbol
exists twice (0x8795 and 0x81E3) and many other mathmatical symbols
also.

Here's the code page which you can browse online:

http://msdn.microsoft.com/en-us/goglobal/cc305152

Which means to be round-trippable Unicode would have to double those
characters, but this would make it hard/impossible to round-trip with
any other character set that had those characters.  No easy solution
here.

Something that has been done before [1] is to map the doubles to the
custom area of the unicode space (0xe000-0xffff).  It gives you
round-trip support at the expense of having to handle those characters
yourself.  But since postgres doesn't do anything meaningful with
unicode characters this might be acceptable.

[1] Python does a similar trick to handle filenames coming from disk in
an unknown encoding:
http://docs.python.org/3/howto/unicode.html#files-in-an-unknown-encoding

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> He who writes carelessly confesses thereby at the very outset that he does
> not attach much importance to his own thoughts.  -- Arthur Schopenhauer

pgsql-hackers by date:

Previous
From: Martijn van Oosterhout
Date:
Subject: Re: pg_memory_barrier() doesn't compile, let alone work, for me
Next
From: Josh Berkus
Date:
Subject: Re: pg_filedump 9.3: checksums (and a few other fixes)