Re: [PATCHES] Postgres-6.3.2 locale patch - Mailing list pgsql-hackers
From | t-ishii@sra.co.jp |
---|---|
Subject | Re: [PATCHES] Postgres-6.3.2 locale patch |
Date | |
Msg-id | 199806040523.OAA06173@srapc451.sra.co.jp Whole thread Raw |
In response to | Re: [PATCHES] Postgres-6.3.2 locale patch ("Thomas G. Lockhart" <lockhart@alumni.caltech.edu>) |
Responses |
Re: [PATCHES] Postgres-6.3.2 locale patch
|
List | pgsql-hackers |
>Hi. I'm looking for non-English-using Postgres hackers to participate in >implementing NCHAR() and alternate character sets in Postgres. I think >I've worked out how to do the implementation (not the details, just a >strategy) so that multiple character sets will be allowed in a single >database, additional character sets can be loaded at run-time, and so >that everything will behave transparently. Sounds interesting idea... But before going into discussion, Let me make clarify what "character sets" means. A character sets consists of some characters. One of the most famous character set is ISO646 (almost same as ASCII). In western Europe, ISO 8859 series character sets are widely used. For example, ISO 8859-1 includes English, French, German etc. and ISO 8859-2 includes Albanian, Romanian etc. These are "single byte" and there is one to many correspondacne between the character set and Languages. Example1: ISO 8859-1 <------> English, French, German On the other hand, some asian languages such as Japanese, Chinese, and Korean do not correspond to a chacter set, rather correspond to multiple character sets. Example2: ASCII, JIS X0208, JIS X0201, JIS X0212 <-------> Japanese (ASCII, JIS X0208, JIS X0201, JIS X0212 are individual character sets) An "encoding" is a way to represent set of charactser sets in computers. The above set of characters sets are encoded in the EUC_JP encdoing. I think SQL92 uses a term "character set" as encoding. >So, the initial questions: > >1) Is the NCHAR/NVARCHAR/CHARACTER SET syntax and usage acceptable for >non-English applications? Do other databases use this SQL92 convention, >or does it have difficulties? As far as I know, there is no commercial RDBMS that supports NCHAR/NVARCHAR/CHARACTER SET syntax. Oracle supports multiple encodings. An encoding for a database is defined while creating the database and cannot be changed at runtime. Clients can use different encoding as long as it is a "subset" of the database's encoding. For example, a oracle client can use ASCII if the database encoding is EUC_JP. I think the idea that the "default" encoding for a database being defined at the database creation time is nice. create database with encoding EUC_JP; If NCHAR/NVARCHAR/CHARACTER SET syntax would be supported, a user could use a different encoding other than EUC_JP. Sound very nice too. >2) Would anyone be interested in helping to define the character sets >and helping to test? I don't know the correct collation sequences and >don't think they would display properly on my screen... I would be able to help you in the Japanese part. For Chinese and Korean, I'm going to find volunteers in the local PostgreSQL mailing list I'm running if necessary. >3) I'd like to implement the existing Cyrillic and EUC-jp character >sets, and also some European languages (French and ??) which use the >Latin-1 alphabet but might have different collation sequences. Any >suggestions for candidates?? Collation sequences for EUC_JP? How nice it would be! One of a problem for collation sequences for multi-byte encodings is the sequence might become huge. Seems you have a solution for that. Please let me know more details. -- Tatsuo Ishii t-ishii@sra.co.jp
pgsql-hackers by date: