Re: BUG #1091: Localization in EUC_TW Can't decode Big5 - Mailing list pgsql-bugs
From | Tatsuo Ishii |
---|---|
Subject | Re: BUG #1091: Localization in EUC_TW Can't decode Big5 |
Date | |
Msg-id | 20040304.120945.71083122.t-ishii@sra.co.jp Whole thread Raw |
In response to | BUG #1091: Localization in EUC_TW Can't decode Big5 0xFA40--0xFEF0. ("PostgreSQL Bugs List" <pgsql-bugs@postgresql.org>) |
List | pgsql-bugs |
The problem with Big5 is there's no well established standard for it. Here is an excerption from the famous cjk.txt by Ken Lunde: ---------------------------------------------------------------- 2.3.1: BIG FIVE The Big Five character set is composed of 94 rows of 157 characters each (the 157 characters of each row are encoded in an initial group of 63 codes followed by the remaining 94 codes). The following is a break-down of its contents: o Row 1: 157 symbols o Row 2: 157 symbols o Row 3: 94 symbols o Rows 4 through 38: 5,401 hanzi (Level 1 Hanzi; last is 38-63) o Rows 41 through 89: 7,652 hanzi (Level 2 Hanzi; last is 89-116) This forms what I consider to be the basic Big Five set. Actually, two of the hanzi in Level 2 are duplicates, so there are actually only 7,650 unique hanzi in Level 2. There are two major extensions to Big Five. The first really has no name, and can be considered part of the basic Big Five set as specified above. It adds the following characters: o Rows 38-39: 4 Japanese iteration marks, 83 hiragana, 86 katakana, 66 uppercase and lowercase Cyrillic (Russian) alphabet, 10 circled digits, and 10 parenthesized digits The other extension was developed by a company called ETen Information System in Taiwan, and is actually considered to be the most widely used version of Big Five. It provides the following extensions to Big Five (different from the above extension): o Rows 38-40: 10 circled digits, 10 parenthesized digits, 10 lowercase Roman numerals, 25 classical radicals, 15 Japanese-specific symbols, 83 hiragana, 86 katakana, 66 uppercase and lowercase Cyrillic (Russian) alphabet, 3 arrows, 10 radical-like hanzi elements, 40 fraction-like digits, and 7 symbols o Row 89: 7 hanzi, 33 double-lined line-drawing elements, and a black box It is *very* important to note that while these two extensions have many common portions (in particular, hiragana, katakana, the Cyrillic alphabet, and so on), they do not share the same code points for such characters. ---------------------------------------------------------------- If someone is sure there's an existing standard for it, including mappings between Big5 and EUC-TW, Big5 and Unicode, and *also* wish to provide patches, I will welcome. Meanwhile you could write your own mapping between Big5 and other encodings. See CREATE CONVERSION command documents for more details. -- Tatsuo Ishii From: "PostgreSQL Bugs List" <pgsql-bugs@postgresql.org> Subject: [BUGS] BUG #1091: Localization in EUC_TW Can't decode Big5 0xFA40--0xFEF0. Date: Wed, 3 Mar 2004 22:08:47 -0400 (AST) Message-ID: <20040304020847.E10A2CF4D3A@www.postgresql.com> > > The following bug has been logged online: > > Bug reference: 1091 > Logged by: yychen > > Email address: yychen@mail.clhs.tyc.edu.tw > > PostgreSQL version: 7.4 > > Operating system: MS-WIN2000(Run With TAIWAN Big5) > > Description: Localization in EUC_TW Can't decode Big5 > 0xFA40--0xFEF0. > > Details: > > In Localization: > DataBase > When i save string (with Big5 0xFA40-0xFEF0) to database (encodinig with > EUC_TW or UNICODE); and then read it. > But PostgreSQL Can't decode these. > According to: ftp://ftp.ora.com/pub/examples/nutshell/ujip/doc/cjk.inf. > 3.3.4: BIG FIVE > > Big Five is the encoding system used on machines that support > MS-DOS or Windows, and also for Macintosh (such as the Chinese > Language Kit or the fully-localized operating system). > > Two-byte Standard Characters Encoding Ranges > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^ > first byte range 0xA1-0xFE > second byte ranges 0x40-0x7E, 0xA1-0xFE > > One-byte Characters Encoding Range > ^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^ > ASCII 0x21-0x7E > > The encoding used on Macintosh is quite similar to the above, > but has a slightly shortened two-byte range (second byte range up to > 0xFC only) plus additional one-byte code points, namely 0x80 > (backslash), 0xFD ("copyright" symbol: "c" in a circle), 0xFE > ("trademark" symbol: "TM" as a superscript), and 0xFF ("ellipsis" > symbol: three dots). > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org >
pgsql-bugs by date: