Thread: Re: [PATCHES] Postgres-6.3.2 locale patch
Hi. I'm looking for non-English-using Postgres hackers to participate in implementing NCHAR() and alternate character sets in Postgres. I think I've worked out how to do the implementation (not the details, just a strategy) so that multiple character sets will be allowed in a single database, additional character sets can be loaded at run-time, and so that everything will behave transparently. I would propose to do this for v6.4 as user-defined packages (with compile-time parser support) on top of the existing USE_LOCALE and MB patches so that the existing compile-time options are not changed or damaged. So, the initial questions: 1) Is the NCHAR/NVARCHAR/CHARACTER SET syntax and usage acceptable for non-English applications? Do other databases use this SQL92 convention, or does it have difficulties? 2) Would anyone be interested in helping to define the character sets and helping to test? I don't know the correct collation sequences and don't think they would display properly on my screen... 3) I'd like to implement the existing Cyrillic and EUC-jp character sets, and also some European languages (French and ??) which use the Latin-1 alphabet but might have different collation sequences. Any suggestions for candidates?? - Tom
Hi Tom, > I would propose to do this for v6.4 as user-defined packages (with > compile-time parser support) on top of the existing USE_LOCALE and MB > patches so that the existing compile-time options are not changed or > damaged. Be careful that system locales may not be here, though you may need the locale information in Postgres. They may also be broken (which is in fact often the case), so don't depend on them. > So, the initial questions: > > 1) Is the NCHAR/NVARCHAR/CHARACTER SET syntax and usage acceptable for > non-English applications? Do other databases use this SQL92 convention, > or does it have difficulties? Don't know (yet). > > 2) Would anyone be interested in helping to define the character sets > and helping to test? I don't know the correct collation sequences and > don't think they would display properly on my screen... I can help for french, icelandic, and german and norwegian (though for the two last ones, I guess there are more appropriate persons on this list :). > 3) I'd like to implement the existing Cyrillic and EUC-jp character > sets, and also some European languages (French and ??) which use the > Latin-1 alphabet but might have different collation sequences. Any > suggestions for candidates?? They all have, as soon as we take care of accents, which are all put at the end with an english system. And of course, they are different for each language :) Patrice PS : I'm sorry, Tom, I haven't been able to work on the faq for the past month :(( because I've been busy in my free time learning norwegian ! I will submit something very soon, I promise ! -- Patrice HÉDÉ --------------------------------- patrice@idf.net ----- ... Looking for a job in Iceland or in Norway ! Ingénieur informaticien - Computer engineer - Tölvufræðingur ----- http://www.idf.net/patrice/ ----------------------------------
>Hi. I'm looking for non-English-using Postgres hackers to participate in >implementing NCHAR() and alternate character sets in Postgres. I think >I've worked out how to do the implementation (not the details, just a >strategy) so that multiple character sets will be allowed in a single >database, additional character sets can be loaded at run-time, and so >that everything will behave transparently. Sounds interesting idea... But before going into discussion, Let me make clarify what "character sets" means. A character sets consists of some characters. One of the most famous character set is ISO646 (almost same as ASCII). In western Europe, ISO 8859 series character sets are widely used. For example, ISO 8859-1 includes English, French, German etc. and ISO 8859-2 includes Albanian, Romanian etc. These are "single byte" and there is one to many correspondacne between the character set and Languages. Example1: ISO 8859-1 <------> English, French, German On the other hand, some asian languages such as Japanese, Chinese, and Korean do not correspond to a chacter set, rather correspond to multiple character sets. Example2: ASCII, JIS X0208, JIS X0201, JIS X0212 <-------> Japanese (ASCII, JIS X0208, JIS X0201, JIS X0212 are individual character sets) An "encoding" is a way to represent set of charactser sets in computers. The above set of characters sets are encoded in the EUC_JP encdoing. I think SQL92 uses a term "character set" as encoding. >So, the initial questions: > >1) Is the NCHAR/NVARCHAR/CHARACTER SET syntax and usage acceptable for >non-English applications? Do other databases use this SQL92 convention, >or does it have difficulties? As far as I know, there is no commercial RDBMS that supports NCHAR/NVARCHAR/CHARACTER SET syntax. Oracle supports multiple encodings. An encoding for a database is defined while creating the database and cannot be changed at runtime. Clients can use different encoding as long as it is a "subset" of the database's encoding. For example, a oracle client can use ASCII if the database encoding is EUC_JP. I think the idea that the "default" encoding for a database being defined at the database creation time is nice. create database with encoding EUC_JP; If NCHAR/NVARCHAR/CHARACTER SET syntax would be supported, a user could use a different encoding other than EUC_JP. Sound very nice too. >2) Would anyone be interested in helping to define the character sets >and helping to test? I don't know the correct collation sequences and >don't think they would display properly on my screen... I would be able to help you in the Japanese part. For Chinese and Korean, I'm going to find volunteers in the local PostgreSQL mailing list I'm running if necessary. >3) I'd like to implement the existing Cyrillic and EUC-jp character >sets, and also some European languages (French and ??) which use the >Latin-1 alphabet but might have different collation sequences. Any >suggestions for candidates?? Collation sequences for EUC_JP? How nice it would be! One of a problem for collation sequences for multi-byte encodings is the sequence might become huge. Seems you have a solution for that. Please let me know more details. -- Tatsuo Ishii t-ishii@sra.co.jp
Hello! On Wed, 3 Jun 1998, Thomas G. Lockhart wrote: > Hi. I'm looking for non-English-using Postgres hackers to participate in > implementing NCHAR() and alternate character sets in Postgres. I think > I've worked out how to do the implementation (not the details, just a > strategy) so that multiple character sets will be allowed in a single > database, additional character sets can be loaded at run-time, and so > that everything will behave transparently. All this sounds nice, but I am afraid the job is not for me. Actually I am very new to Postgres and SQL world. I started to learn SQL 3 months ago; I started to play with Postgres 2 months ago. I started to hack Potsgres sources (about locale) a little more than a month ago. > 2) Would anyone be interested in helping to define the character sets > and helping to test? I don't know the correct collation sequences and > don't think they would display properly on my screen... It would be nice to test it, providing that it wouldn't break existing code. Our site is running hundreds CGIs that rely on current locale support in Postgres... Oleg. ---- Oleg Broytmann http://members.tripod.com/~phd2/ phd2@earthling.net Programmers don't die, they just GOSUB without RETURN.
On Thu, 4 Jun 1998 t-ishii@sra.co.jp wrote: > >Hi. I'm looking for non-English-using Postgres hackers to participate in > >implementing NCHAR() and alternate character sets in Postgres. I think > >I've worked out how to do the implementation (not the details, just a > >strategy) so that multiple character sets will be allowed in a single > >database, additional character sets can be loaded at run-time, and so > >that everything will behave transparently. > > Sounds interesting idea... But before going into discussion, Let me > make clarify what "character sets" means. A character sets consists of > some characters. One of the most famous character set is ISO646 > (almost same as ASCII). In western Europe, ISO 8859 series character > sets are widely used. For example, ISO 8859-1 includes English, > French, German etc. and ISO 8859-2 includes Albanian, Romanian > etc. These are "single byte" and there is one to many correspondacne > between the character set and Languages. > > Example1: > ISO 8859-1 <------> English, French, German > > On the other hand, some asian languages such as Japanese, Chinese, and > Korean do not correspond to a chacter set, rather correspond to > multiple character sets. > > Example2: > ASCII, JIS X0208, JIS X0201, JIS X0212 <-------> Japanese > (ASCII, JIS X0208, JIS X0201, JIS X0212 are individual character sets) > > An "encoding" is a way to represent set of charactser sets in > computers. The above set of characters sets are encoded in the EUC_JP > encdoing. > > I think SQL92 uses a term "character set" as encoding. > > >So, the initial questions: > > > >1) Is the NCHAR/NVARCHAR/CHARACTER SET syntax and usage acceptable for > >non-English applications? Do other databases use this SQL92 convention, > >or does it have difficulties? > > As far as I know, there is no commercial RDBMS that supports > NCHAR/NVARCHAR/CHARACTER SET syntax. Oracle supports multiple > encodings. An encoding for a database is defined while creating the > database and cannot be changed at runtime. Clients can use different > encoding as long as it is a "subset" of the database's encoding. For > example, a oracle client can use ASCII if the database encoding is > EUC_JP. I try the following databases on Linux and no one has this feature: . MySql . Solid . Empress . Kubl . ADABAS D I found only one under M$-Windows that implement this feature: . OCELOT I'm playing with it, but so far I don't understand its behavior. There's an interesting documentation about it on OCELOT manual, if you want I can send it to you. > > I think the idea that the "default" encoding for a database being > defined at the database creation time is nice. > > create database with encoding EUC_JP; > > If NCHAR/NVARCHAR/CHARACTER SET syntax would be supported, a user > could use a different encoding other than EUC_JP. Sound very nice too. > > >2) Would anyone be interested in helping to define the character sets > >and helping to test? I don't know the correct collation sequences and > >don't think they would display properly on my screen... > > I would be able to help you in the Japanese part. For Chinese and > Korean, I'm going to find volunteers in the local PostgreSQL mailing > list I'm running if necessary. I may help with Italian, Spanish and Portuguese. > > >3) I'd like to implement the existing Cyrillic and EUC-jp character > >sets, and also some European languages (French and ??) which use the > >Latin-1 alphabet but might have different collation sequences. Any > >suggestions for candidates?? > > Collation sequences for EUC_JP? How nice it would be! One of a problem > for collation sequences for multi-byte encodings is the sequence might > become huge. Seems you have a solution for that. Please let me know > more details. > -- > Tatsuo Ishii > t-ishii@sra.co.jp Ciao, Jose'
> > Sounds interesting idea... But before going into discussion, Let me > > make clarify what "character sets" means. > > An "encoding" is a way to represent set of charactser sets in > > computers. > > I think SQL92 uses a term "character set" as encoding. I have found the SQL92 terminology confusing, because they do not seem to make the nice clear distinction between encoding and collation sequence which you have pointed out. I suppose that there can be an issue of visual appearance of an alphabet for different locales also. afaik, SQL92 uses the term "character set" to mean an encoding with an implicit collation sequence. SQL92 allows alternate collation sequences to be specified for a "character set" when it can be made meaningful. I would propose to implement VARCHAR(length) WITH CHARACTER SET setname as a type with a type name of, for example, "VARSETNAME". This type would have the comparison functions and operators which implement collation sequences. I would propose to implement VARCHAR(length) WITH CHARACTER SET setname COLLATION collname as a type with a name of, for example, "VARCOLLNAME". For the EUC-jp encoding, "collname" could be "Korean" or "Japanese" so the type name would become "varkorean" or "varjapanese". Don't know for sure yet whether this is adequate, but other possibilities can be used if necessary. When a database is created, it can be specified with a default character set/collation sequence for the database; this would correspond to the NCHAR/NVARCHAR/NTEXT types. We could implement a SET NATIONAL CHARACTER SET = 'language'; command to determine the default character set for the session when NCHAR is used. The SQL92 technique for specifying an encoding/collation sequence in a literal string is _language 'string' so for example to specify a string in the French language (implying an encoding, collation, and representation?) you would use _FRENCH 'string' > > I would be able to help you in the Japanese part. For Chinese and > > Korean, I'm going to find volunteers in the local PostgreSQL mailing > > list I'm running if necessary. > > I may help with Italian, Spanish and Portuguese. Great, and perhaps Oleg could help test with Cyrillic (I assume I can steal code from the existing "CYR_LOCALE" blocks in the Postgres backend). > > Collation sequences for EUC_JP? How nice it would be! One of a > > problem for collation sequences for multi-byte encodings is the > > sequence might become huge. Seems you have a solution for that. > > Please let me know more details. Um, no, I just assume we can find a solution :/ I'd like to implement the infrastructure in the Postgres parser to allow multiple encodings/collations, and then see where we are. As I mentioned, this would be done for v6.4 as a transparent add-on, so that existing capabilities are not touched or damaged. Implementing everything for some European languages (with the 1-byte Latin-1 encoding?) may be easiest, but the Asian languages might be more fun :) - Tom
Hi! On Thu, 4 Jun 1998, Thomas G. Lockhart wrote: > Great, and perhaps Oleg could help test with Cyrillic (I assume I can > steal code from the existing "CYR_LOCALE" blocks in the Postgres > backend). Before sending my patch to pgsql-patches I gave it out to few testers here. It wouldn't be too hard to find testers for Cyrillic support, sure. Oleg. ---- Oleg Broytmann http://members.tripod.com/~phd2/ phd2@earthling.net Programmers don't die, they just GOSUB without RETURN.
>When a database is created, it can be specified with a default character >set/collation sequence for the database; this would correspond to the >NCHAR/NVARCHAR/NTEXT types. We could implement a > SET NATIONAL CHARACTER SET = 'language'; In the current implementation of MB, the encoding used by BE is determined at the compile time. This time I would like to add more flexibility in that the encoding can be specified when creating a database. I would like to add a new option to the CREATE DATABASE statement: CREATE DATABASE WITH ENCODING 'encoding'; I'm not sure if this kind of thing is defined in the standard. Suggestion? -- Tatsuo Ishii t-ishii@sra.co.jp