Thread: postgresql euc/sjis utf8 mappings

postgresql euc/sjis utf8 mappings

From
Thomas O'Dowd
Date:
Hi all,

I've noted that in PostgreSQL 7.2.1 some of the utf8 mappings
of sjis and euc characters were different. One example that caught me out
was the double width ~.

'〜' (double byte/double width ~)
euc:  0xa1c1 -> 0xe3809c utf8
sjis: 0x8160 -> 0xefbd9e utf8

This caused me problems when a '〜' was loaded using euc and retrieved
using sjis as there was no sjis mapping for 0xe3809c.

I checked cvs and in the new mb mapping files, it seems to be fixed and
both euc and sjis now use 0xefbd9e. So I guess 7.3 will fix my problems
when it is released. Any idea when this will be btw?

My question is, if I can just get the new map files from cvs, ie:
x euc_jp_to_utf8.map
x sjis_to_utf8.map
x utf8_to_euc_jp.map
x utf8_to_sjis.map

and recompile 7.2.1 with no other changes (ie I just want the new mappings)
and expect everything to work? Or do I have to change some other
settings also or rerun some generator scripts or something?

Thanks,

Tom.
--
Thomas O'Dowd. - Nooping - http://nooper.com
tom@nooper.com - Testing - http://nooper.co.jp/labs


Re: postgresql euc/sjis utf8 mappings

From
Joel Rees
Date:
Hmm.

> I've noted that in PostgreSQL 7.2.1 some of the utf8 mappings
> of sjis and euc characters were different. One example that caught me out
> was the double width ~.
>
> '〜' (double byte/double width ~)

That's not really a tilde. It's referred to as a "wave dash", and is
usually used as such in most of what I've seen of word-processing/e-mail
type data. (Tilde is a combining character, is it not?)

> euc:  0xa1c1 -> 0xe3809c utf8

That's the Unicode wave dash.

> sjis: 0x8160 -> 0xefbd9e utf8

That's the Unicode full-width tilde.

Now, if I were going by the names, I would choose the Unicode wave dash
for that mapping, both of them to 0xe3809c.

But if I were to go by the intent of the full-width block, I'd go with
the latter, 0xefbd9e, but I'd still be wondering why the Unicode people
called it full-width tilde.

Hmm.

At any rate, mapping euc and s-jis the same should be correct, since euc
and s-jis are both just a numerical transform of JIS with ASCII squeezed
in.

> This caused me problems when a '〜' was loaded using euc and retrieved
> using sjis as there was no sjis mapping for 0xe3809c.

Another hmm. That's probably going to create surprises sometimes. Good
reason to have the source code open.

(Just thinking out loud.)

Anyway, thanks for the heads-up, Tom.

--
Joel Rees <joel@alpsgiken.gr.jp>