Re: Do we still need MULE_INTERNAL? - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: Do we still need MULE_INTERNAL?
Date
Msg-id CA+hUKGK4ZvZYNRC_W10dT2W6TYBY24q=B-EfKpUL50v2E3U6_w@mail.gmail.com
Whole thread
In response to Re: Do we still need MULE_INTERNAL?  (Tatsuo Ishii <ishii@postgresql.org>)
Responses Re: Do we still need MULE_INTERNAL?
List pgsql-hackers
On Wed, Feb 11, 2026 at 7:52 PM Tatsuo Ishii <ishii@postgresql.org> wrote:
> Thank you for the report. I find it is quite useful, especially the
> Emacs 23 internal (new to me). I agree that MULE_INTERNAL has
> fulfilled its historic role.

Thanks Ishii-san and Tom.  Here's a patch.  Obviously it mostly just
deletes thousands of lines, but also: I had to preserve the encoding
number, so there's a hole in the table, and I had to think of a new
name for cyrillic_and_mic.c, so I went with cyrillic.c because it
handles 4 single-byte encodings and it wasn't clear how to fit into
the existing x_and_y pattern (ie which two to highlight arbitrarily in
the name).

> > Since there are two encodings for kana characters and MULE's
> > superpower is to switch, I guess it depends how you chose to encode it
> > and what your ratio of kana to kanji is.
>
> The reason for 2 encodings in MULE for "kana" exist is, it's a nature
> of the character sets mule supports. In Japanese there are 2 types of
> "kana", one is "hiragana" and the other is "katakana". JIS X0208/0212
> includes both types of "kana", while JIS X0201 includes only
> "katakana". So why "katakana" appears on those two encodings? Katakana
> in JIS X0201 is often rendered on screen in half width comparing with
> JIS X 0208 and 0212. Some users find this beneficial.

Ah, right, I see.  And judging by Wikipedia's article on half-width
katakana, it sounds like any scenario where it's mixed with hiragana
and kanji would probably not use them anyway, so perhaps 3 is a better
guess.  In other words, MULE_INTERNAL databases would probably not get
bigger if reloaded as UTF-8.

> > UTF8:                                3     3
>
> I thought some of JIS 2004 kanji are mapped to 4-byte UTF8 character.

Looks like it:

grep 'U+[0-9A-F][0-9A-F][0-9A-F][0-9A-F][0-9A-F].*\[200[04]\]' \
./src/backend/utils/mb/Unicode/euc-jis-2004-std.txt

They are in "CJK Unified Ideographs Extension B" for "rare and
historic CJK ideographs", so I guess they wouldn't matter much, but in
any case we're talking about a hypothetical user moving from
MULE_INTERNAL, which *doesn't* have JIS 2004.  I think the older
standards are entirely in the basic plane, so only 1-3-byte UTF-8
sequences.

. o O ( UTF-16 would probably be the ideal storage for CJK text if we
could do it... )

Attachment

pgsql-hackers by date:

Previous
From: Nazir Bilal Yavuz
Date:
Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD
Next
From: Chengpeng Yan
Date:
Subject: Re: Unfortunate pushing down of expressions below sort