Re: BUG #1091: Localization in EUC_TW Can't decode Big5 - Mailing list pgsql-bugs

From Tatsuo Ishii
Subject Re: BUG #1091: Localization in EUC_TW Can't decode Big5
Date
Msg-id 20040304.120945.71083122.t-ishii@sra.co.jp
Whole thread Raw
In response to BUG #1091: Localization in EUC_TW Can't decode Big5 0xFA40--0xFEF0.  ("PostgreSQL Bugs List" <pgsql-bugs@postgresql.org>)
List pgsql-bugs
The problem with Big5 is there's no well established standard for
it. Here is an excerption from the famous cjk.txt by Ken Lunde:

----------------------------------------------------------------
2.3.1: BIG FIVE

    The Big Five character set is composed of 94 rows of 157
characters each (the 157 characters of each row are encoded in an
initial group of 63 codes followed by the remaining 94 codes). The
following is a break-down of its contents:

o Row 1: 157 symbols
o Row 2: 157 symbols
o Row 3: 94 symbols
o Rows 4 through 38: 5,401 hanzi (Level 1 Hanzi; last is 38-63)
o Rows 41 through 89: 7,652 hanzi (Level 2 Hanzi; last is 89-116)

This forms what I consider to be the basic Big Five set. Actually, two
of the hanzi in Level 2 are duplicates, so there are actually only
7,650 unique hanzi in Level 2.
    There are two major extensions to Big Five. The first really
has no name, and can be considered part of the basic Big Five set as
specified above. It adds the following characters:

o Rows 38-39: 4 Japanese iteration marks, 83 hiragana, 86 katakana, 66
  uppercase and lowercase Cyrillic (Russian) alphabet, 10 circled
  digits, and 10 parenthesized digits

    The other extension was developed by a company called ETen
Information System in Taiwan, and is actually considered to be the
most widely used version of Big Five. It provides the following
extensions to Big Five (different from the above extension):

o Rows 38-40: 10 circled digits, 10 parenthesized digits, 10 lowercase
  Roman numerals, 25 classical radicals, 15 Japanese-specific symbols,
  83 hiragana, 86 katakana, 66 uppercase and lowercase Cyrillic
  (Russian) alphabet, 3 arrows, 10 radical-like hanzi elements, 40
  fraction-like digits, and 7 symbols
o Row 89: 7 hanzi, 33 double-lined line-drawing elements, and a black
  box

    It is *very* important to note that while these two extensions
have many common portions (in particular, hiragana, katakana, the
Cyrillic alphabet, and so on), they do not share the same code points
for such characters.
----------------------------------------------------------------

If someone is sure there's an existing standard for it, including
mappings between Big5 and EUC-TW, Big5 and Unicode, and *also* wish to
provide patches, I will welcome. Meanwhile you could write your own
mapping between Big5 and other encodings. See CREATE CONVERSION
command documents for more details.
--
Tatsuo Ishii

From: "PostgreSQL Bugs List" <pgsql-bugs@postgresql.org>
Subject: [BUGS] BUG #1091: Localization in EUC_TW Can't decode Big5 0xFA40--0xFEF0.
Date: Wed,  3 Mar 2004 22:08:47 -0400 (AST)
Message-ID: <20040304020847.E10A2CF4D3A@www.postgresql.com>

>
> The following bug has been logged online:
>
> Bug reference:      1091
> Logged by:          yychen
>
> Email address:      yychen@mail.clhs.tyc.edu.tw
>
> PostgreSQL version: 7.4
>
> Operating system:   MS-WIN2000(Run With TAIWAN Big5)
>
> Description:        Localization in EUC_TW Can't decode Big5
> 0xFA40--0xFEF0.
>
> Details:
>
> In Localization:
>  DataBase
>  When i save string (with Big5 0xFA40-0xFEF0) to database (encodinig with
> EUC_TW or UNICODE); and then read it.
> But PostgreSQL Can't decode these.
> According to:  ftp://ftp.ora.com/pub/examples/nutshell/ujip/doc/cjk.inf.
> 3.3.4: BIG FIVE
>
>     Big Five is the encoding system used on machines that support
> MS-DOS or Windows, and also for Macintosh (such as the Chinese
> Language Kit or the fully-localized operating system).
>
>   Two-byte Standard Characters                  Encoding Ranges
>   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^                  ^^^^^^^^^^^^^^^
>   first byte range                              0xA1-0xFE
>   second byte ranges                            0x40-0x7E, 0xA1-0xFE
>
>   One-byte Characters                           Encoding Range
>   ^^^^^^^^^^^^^^^^^^^                           ^^^^^^^^^^^^^^
>   ASCII                                         0x21-0x7E
>
>     The encoding used on Macintosh is quite similar to the above,
> but has a slightly shortened two-byte range (second byte range up to
> 0xFC only) plus additional one-byte code points, namely 0x80
> (backslash), 0xFD ("copyright" symbol: "c" in a circle), 0xFE
> ("trademark" symbol: "TM" as a superscript), and 0xFF ("ellipsis"
> symbol: three dots).
>
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>

pgsql-bugs by date:

Previous
From: "PostgreSQL Bugs List"
Date:
Subject: BUG #1091: Localization in EUC_TW Can't decode Big5 0xFA40--0xFEF0.
Next
From: "PostgreSQL Bugs List"
Date:
Subject: BUG #1092: Memory Fault in PQsetdbLogin