Hmm.
> I've noted that in PostgreSQL 7.2.1 some of the utf8 mappings
> of sjis and euc characters were different. One example that caught me out
> was the double width ~.
>
> '〜' (double byte/double width ~)
That's not really a tilde. It's referred to as a "wave dash", and is
usually used as such in most of what I've seen of word-processing/e-mail
type data. (Tilde is a combining character, is it not?)
> euc: 0xa1c1 -> 0xe3809c utf8
That's the Unicode wave dash.
> sjis: 0x8160 -> 0xefbd9e utf8
That's the Unicode full-width tilde.
Now, if I were going by the names, I would choose the Unicode wave dash
for that mapping, both of them to 0xe3809c.
But if I were to go by the intent of the full-width block, I'd go with
the latter, 0xefbd9e, but I'd still be wondering why the Unicode people
called it full-width tilde.
Hmm.
At any rate, mapping euc and s-jis the same should be correct, since euc
and s-jis are both just a numerical transform of JIS with ASCII squeezed
in.
> This caused me problems when a '〜' was loaded using euc and retrieved
> using sjis as there was no sjis mapping for 0xe3809c.
Another hmm. That's probably going to create surprises sometimes. Good
reason to have the source code open.
(Just thinking out loud.)
Anyway, thanks for the heads-up, Tom.
--
Joel Rees <joel@alpsgiken.gr.jp>