Thread: Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem
> > Perhaps the locale data can be used to gather this information? > > It's certainly there ! locale data contains all information about > specific character set and encoding. Is it possible to use it to create > indices ? It should be slower but benefits of using indices will cover > expenses for non-US people. I didn't notice such behaivour in Informix > and Oracle. > Informix has the national character handling, and it is indexable in Informix. But it is not done for the standard types char and varchar. In Informix you use nchar and nvarchar, and have one locale defined per database. In Oracle you have national characters, but access is not indexable. Actually the SQL standard has something to say about national char and varchar. I think it is wrong that char and varchar change their behavior in postgresql, if you enable locale. A locale sensitive column needs to be specified as such in the create table statement according to the standard. I never enable locale. Andreas
On Wed, 9 Jun 1999, ZEUGSWETTER Andreas IZ5 wrote: > Date: Wed, 9 Jun 1999 19:16:14 +0200 > From: ZEUGSWETTER Andreas IZ5 <Andreas.Zeugswetter@telecom.at> > To: 'Oleg Bartunov' <oleg@sai.msu.su> > Cc: "'hackers@postgresql.org'" <hackers@postgresql.org> > Subject: Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem > > > > > Perhaps the locale data can be used to gather this information? > > > > It's certainly there ! locale data contains all information about > > specific character set and encoding. Is it possible to use it to create > > indices ? It should be slower but benefits of using indices will cover > > expenses for non-US people. I didn't notice such behaivour in Informix > > and Oracle. > > > Informix has the national character handling, and it is indexable in > Informix. > But it is not done for the standard types char and varchar. In Informix > you use nchar and nvarchar, and have one locale defined per database. > In Oracle you have national characters, but access is not indexable. > Actually the SQL standard has something to say about national char > and varchar. I think it is wrong that char and varchar change their behavior > in postgresql, if you enable locale. A locale sensitive column needs to be > specified > as such in the create table statement according to the standard. Thanks for explanations. It would be great if I could specify for specific columns if it needs locale. For now I have to pay decreasing in speed just to enable locale for the only column of one database !!! I always dream to be able to specify on fly 'SET LOCALE to ON|OFF' It's a bit separate problem, but probably more easy to implement. Regards, Oleg > I never enable locale. > > Andreas > _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
Oleg Bartunov wrote: > > > Thanks for explanations. It would be great if I could specify for specific > columns if it needs locale. For now I have to pay decreasing in speed > just to enable locale for the only column of one database !!! > I always dream to be able to specify on fly 'SET LOCALE to ON|OFF' > It's a bit separate problem, but probably more easy to implement. Actually the locale should be part of field definition (like length and precision currently are) with default of ascii. And also it may be probably safest to maintain our own locale defs (as Oracle currently does)so that things don't break if somebody changes the system ones. ------------- Hannu
> Oleg Bartunov wrote: > > > > > > Thanks for explanations. It would be great if I could specify for specific > > columns if it needs locale. For now I have to pay decreasing in speed > > just to enable locale for the only column of one database !!! > > I always dream to be able to specify on fly 'SET LOCALE to ON|OFF' > > It's a bit separate problem, but probably more easy to implement. > > Actually the locale should be part of field definition (like length and > precision currently are) with default of ascii. > > And also it may be probably safest to maintain our own locale defs (as > Oracle > currently does)so that things don't break if somebody changes the system > ones. Added to TODO: * Allow LOCALE on a per-column basis, default to ASCII -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> Actually the SQL standard has something to say about national char > and varchar. I think it is wrong that char and varchar change their > behavior in postgresql, if you enable locale. A locale sensitive column > needs to be specified as such in the create table statement according to > the standard. I never enable locale. I had some discussions on the list a while ago (6 months?) about this topic, but never really got feedback from "locale-using" people that NATIONAL CHARACTER and collation sequences are an acceptable solution. istm that Postgres' extensibility would make this *very* easy to implement and extend, and that then everyone would get the same behavior from CHAR while being able to get the behaviors they need from a variety of other character sets. I do have an interest in implementing or helping with something, but since I don't have to live with the consequences of the results (coming from an ASCII country :) it seemed to be poor form to push it without feedback from others... - Thomas -- Thomas Lockhart lockhart@alumni.caltech.edu South Pasadena, California
On Thu, 10 Jun 1999, Thomas Lockhart wrote: > I had some discussions on the list a while ago (6 months?) about this > topic, but never really got feedback from "locale-using" people that > NATIONAL CHARACTER and collation sequences are an acceptable solution. What feedback do you want? I am pretty sure two Olegs on this list are ready to continue the discussion. > istm that Postgres' extensibility would make this *very* easy to > implement and extend, and that then everyone would get the same > behavior from CHAR while being able to get the behaviors they need > from a variety of other character sets. > > I do have an interest in implementing or helping with something, but > since I don't have to live with the consequences of the results > (coming from an ASCII country :) it seemed to be poor form to push it > without feedback from others... > > - Thomas > > -- > Thomas Lockhart lockhart@alumni.caltech.edu > South Pasadena, California Oleg. ---- Oleg Broytmann http://members.xoom.com/phd2/ phd2@earthling.net Programmers don't die, they justGOSUB without RETURN.
> What feedback do you want? I am pretty sure two Olegs on this list are > ready to continue the discussion. istm that the Russian and Japanese contingents could represent the needs of multibyte and locale concerns very well. So, we should ask ourselves some questions to see if we can make *progress* in evolving our text handling, rather than just staying the same forever. SQL92 suggests some specific text handling features to help with non-ascii applications. "national character" is, afaik, the feature which would hold an installation-wide local text type. "collations" would allow other text types in the same installation, but SQL92 is a bit fuzzier about how to make them work. Would these mechanisms work for people? Or are they so fundamentally flawed or non-standard (it is from a standard, but I'm not sure who implements it)? - Thomas -- Thomas Lockhart lockhart@alumni.caltech.edu South Pasadena, California
> istm that the Russian and Japanese contingents could represent the > needs of multibyte and locale concerns very well. So, we should ask > ourselves some questions to see if we can make *progress* in evolving > our text handling, rather than just staying the same forever. > > SQL92 suggests some specific text handling features to help with > non-ascii applications. "national character" is, afaik, the feature > which would hold an installation-wide local text type. "collations" > would allow other text types in the same installation, but SQL92 is a > bit fuzzier about how to make them work. > > Would these mechanisms work for people? Or are they so fundamentally > flawed or non-standard (it is from a standard, but I'm not sure who > implements it)? In my opinion, introducing NCHAR is useful at least for single byte codes. Although I'm not familiar with single byte languages, I see strong demands for NCHAR through recent discussions. I don't mean it's useless for multibyte, however. As far as I know locales for multibyte provided by any OS are totally broken especially in COLLATE(I seriously doubt existing locale framework work for multibyte). It would be nice to have our own locale data for COLLATE somewhere in our system like some commercial dbms do, or even better user defined collations allowed (this is already in the standard). I have a feeling that I'm going to implement CREATE CHARSET through slightly modifying the multibyte support code that currently only allows predefined charset. That will be the first step toward CREATE COLLATE, NCHAR etc.... --- Tatsuo Ishii
On Thu, 10 Jun 1999, Thomas Lockhart wrote: > istm that the Russian and Japanese contingents could represent the > needs of multibyte and locale concerns very well. So, we should ask > ourselves some questions to see if we can make *progress* in evolving > our text handling, rather than just staying the same forever. Ok, we are here. And what a pros and cons for NCHAR? > SQL92 suggests some specific text handling features to help with > non-ascii applications. "national character" is, afaik, the feature What the help? > which would hold an installation-wide local text type. "collations" > would allow other text types in the same installation, but SQL92 is a > bit fuzzier about how to make them work. > > Would these mechanisms work for people? Or are they so fundamentally > flawed or non-standard (it is from a standard, but I'm not sure who > implements it)? > > - Thomas > > -- > Thomas Lockhart lockhart@alumni.caltech.edu > South Pasadena, California Oleg. ---- Oleg Broytmann http://members.xoom.com/phd2/ phd2@earthling.net Programmers don't die, they justGOSUB without RETURN.
On 11-Jun-99 Oleg Broytmann wrote: > On Thu, 10 Jun 1999, Thomas Lockhart wrote: >> istm that the Russian and Japanese contingents could represent the >> needs of multibyte and locale concerns very well. So, we should ask >> ourselves some questions to see if we can make *progress* in evolving >> our text handling, rather than just staying the same forever. > > Ok, we are here. > And what a pros and cons for NCHAR? > >> SQL92 suggests some specific text handling features to help with >> non-ascii applications. "national character" is, afaik, the feature > > What the help? > >> which would hold an installation-wide local text type. "collations" >> would allow other text types in the same installation, but SQL92 is a >> bit fuzzier about how to make them work. >> >> Would these mechanisms work for people? Or are they so fundamentally >> flawed or non-standard (it is from a standard, but I'm not sure who >> implements it)? There are two different problems under "locale" cover. First is national charset handling, mostly sorting, I try to find official recomendation for it. Another one is encoding conversion - it's unlimited field for discussion. IMHO, conversion table have to be stored on per host basic, some where near from HBA. Sort table is to be placed near from createdb. Custom sort table is also very attractive for me. --- Dmitry Samersoff, dms@wplus.net, ICQ:3161705 http://devnull.wplus.net * There will come soft rains ...
Hi! On Fri, 11 Jun 1999, Dmitry Samersoff wrote: > IMHO, conversion table have to be stored on per host basic, > some where near from HBA. Currently you can do this with per-host (SQL command) SET CLIENT_ENCODING or (env var) export PGCLIENTENCODING=... > --- > Dmitry Samersoff, dms@wplus.net, ICQ:3161705 > http://devnull.wplus.net > * There will come soft rains ... Oleg. ---- Oleg Broytmann http://members.xoom.com/phd2/ phd2@earthling.net Programmers don't die, they justGOSUB without RETURN.
> > istm that the Russian and Japanese contingents could represent the > > needs of multibyte and locale concerns very well. So, we should ask > > ourselves some questions to see if we can make *progress* in evolving > > our text handling, rather than just staying the same forever. > Ok, we are here. > And what a pros and cons for NCHAR? I was hoping you would tell me! :) > > SQL92 suggests some specific text handling features to help with > > non-ascii applications. > What the help? OK, SQL92 defines two kinds of native character sets: those we already have (char, varchar) and those which can be locale customized (nchar, national character varying, and others). char and varchar always default to the "SQL" behavior (which I think corresponds to ascii (called "SQL_TEXT") but I didn't bother looking for the details). So, at its simplest, there would be two sets of character types, with char, varchar, etc, always the same on every system (just like Postgres w/o multibyte or locale), and nchar, nvarchar, etc configured as your locale/multibyte environment would want. However, there are many more features in SQL92 to deal with text customization. I'll mention a few (well, most of them, but not in detail): o You can define a "character set" and, independently, a "collation". The syntax for the type definition is CHARACTER [ VARYING ] [ (length) ] [ CHARACTER SET your-character-set ] [ COLLATEyour-collation-sequence ] o You can specify a character type for string literals: _your-character-set 'literal string' e.g. _ESPANOL 'Que pasa?' (forgivemy omission of a leading upside down question mark :) We already have some support for this in that character string literals can have a type specification (e.g. "DATETIME 'now'") and presumably we can use the required underscore to convert the "_ESPANOL" to a character set and collation, all within the existing Postgres type system. o You can specify collation behavior in a strange way: 'Yo dude!' COLLATE collation-method (which we could convert in the parser to a function call). o You can translate between character sets, *if* there is a reasonable mapping available: TRANSLATE(string USING method) and you can define translations in a vague way (since no one but Postgres implemented a type system back then): CREATE TRANSLATION translation FOR source-charset TO target-charset FROM { EXTERNAL('external-translation') | IDENTITY | existing-translation } DROP TRANSLATION translation o You can convert character strings which have the same character "repertoire" from one to the other: CONVERT(string USING conversion-method) (e.g. we could define a method "EBCDIC_TO_ASCII" once we have an "EBCDIC" character set). o You can specify identifiers (column names, etc) with a specific character set/collation by: _charset colname (e.g. _FRANCAIS Francais where the second "c" is allowed to be "c-cedilla", a character in the French/latin character set; sorry I didn't type it). > > Would these mechanisms work for people? Or are they so fundamentally > > flawed or non-standard (it is from a standard, but I'm not sure who > > implements it)? Fully implementing these features (or a reasonable subset) would give us more capabilities than we have now, and imho can be fit into our existing type system. *Only* implementing NCHAR etc gives us the ability to carry SQL_TEXT and multibyte/locale types in the same database, which may not be a huge benefit to those who never want to mix them in the same installation. I don't know who those folks might be but Tatsuo and yourself probably do. Comments? - Thomas -- Thomas Lockhart lockhart@alumni.caltech.edu South Pasadena, California
Hi! On Fri, 11 Jun 1999, Thomas Lockhart wrote: > > And what a pros and cons for NCHAR? > I was hoping you would tell me! :) I can see only one advantage for NCHAR - those fields that aren't NCHAR will not use strcoll() for comparison. But I cannot remember one filed in my database that does not contain russian characters. Even my WWW logs contain them. So in any case I am forced to make all my fields NCHAR, and this is exactly what we have now - postgres compiled with --enable-locale makes all char NCHAR. > - Thomas > -- > Thomas Lockhart lockhart@alumni.caltech.edu > South Pasadena, California Oleg. ---- Oleg Broytmann http://members.xoom.com/phd2/ phd2@earthling.net Programmers don't die, they justGOSUB without RETURN.
> I can see only one advantage for NCHAR - those fields that aren't NCHAR > will not use strcoll() for comparison. > But I cannot remember one filed in my database that does not contain > russian characters. Even my WWW logs contain them. > So in any case I am forced to make all my fields NCHAR, and this is > exactly what we have now - postgres compiled with --enable-locale makes > all char NCHAR. Yes, and that is how we got the implementation we have. Implementing NCHAR is a step on the road toward having fully flexible character set capabilities in a single database. By itself, NCHAR probably does not offer tremendous advantages for anyone running a fully "localized" database. So some the questions really might be: 1) Is implementing NCHAR, and reverting CHAR back to the SQL-standard ascii-ish behavior, acceptable, or does it introduce fatal flaws for implementers? e.g. do any third party tools know about NCHAR? I would assume that the odbc interface could just map NCHAR to CHAR if odbc knows nothing about NCHAR... 2) Solving various problems for specific datasets will require new specialized support routines. If this is true, then isn't the Postgres type system the way to introduce these specialized capabilities? Doesn't Unicode, for example, work well as a new data type, as opposed to shoehorning it into all areas of the backend with #ifdefs? 3) Do the SQL92-defined features help us solve the problem, or do they just get in the way? istm that they address some of the features we would need, and have sufficient fuzz around the edges to allow a successful implementation. An example of what we could do would be to have both Russian/Cyrillic and Japanese regression tests in the main regression suite, since they could coexist with the other tests. - Thomas -- Thomas Lockhart lockhart@alumni.caltech.edu South Pasadena, California
Oleg Broytmann wrote: > > Hi! > > On Fri, 11 Jun 1999, Thomas Lockhart wrote: > > > And what a pros and cons for NCHAR? > > I was hoping you would tell me! :) > > I can see only one advantage for NCHAR - those fields that aren't NCHAR > will not use strcoll() for comparison. > But I cannot remember one filed in my database that does not contain > russian characters. Even my WWW logs contain them. what about the tables beginning with pg_ ? Are the system tables currently affected by --enable-locale ? > So in any case I am forced to make all my fields NCHAR, and this is > exactly what we have now - postgres compiled with --enable-locale makes all > char NCHAR. Well, the problem is that while I do occasionally need cyrillic chars, I also need English, Estonian, Finnish/Swedish, Latvian and Lithuanian. The only two of them that don't have overlapping character codes are Russian (all chars >127) and English (all < 128) My current solution is to run without --enable-locale and do all the sorting in the client. But it would be often useful to have language specific columns. -------------------- Hannu
> Well, the problem is that while I do occasionally need cyrillic chars, > I also need English, Estonian, Finnish/Swedish, Latvian and Lithuanian. > The only two of them that don't have overlapping character codes are > Russian (all chars >127) and English (all < 128) > My current solution is to run without --enable-locale and do all the > sorting > in the client. But it would be often useful to have language specific > columns. Great! You're our representative for multi-charset usage :) Perhaps you can think about and comment on the SQL92 features which support this, and whether that kind of capability would suit your needs. Of course, we would need to start getting specific about how to map those generic features into a Postgres implementation... btw, istm that we could make things somewhat backward compatible by allowing the backend to be built with CHAR and NCHAR always mapped to NCHAR. - Thomas -- Thomas Lockhart lockhart@alumni.caltech.edu South Pasadena, California
> OK, SQL92 defines two kinds of native character sets: those we already > have (char, varchar) and those which can be locale customized (nchar, > national character varying, and others). char and varchar always > default to the "SQL" behavior (which I think corresponds to ascii > (called "SQL_TEXT") but I didn't bother looking for the details). This seems to be a little bit different from the standard. First, SQL_TEXT is not equal to ascii. It's a subset of ascii. Second, the default charset for char and varchar might be implemenation dependent, not neccesarily limited to SQL_TEXT. The only requirement is the charset must contain the repertoire SQL_TEXT has. I think any charsets including ascii I've ever seen satisfies the requirement. Third, the standards says nothing about locale. --- Tatsuo Ishii
> Well, the problem is that while I do occasionally need cyrillic chars, > I also need English, Estonian, Finnish/Swedish, Latvian and Lithuanian. Probably this is the first example ever appeared on this list for the demand of i18n database, that should be the future direction of PostgreSQL, in my opinion. Currently MB has two encodings for this kind of purpose: Unicode and mule-internal-code. Both of them allows mixed languages even in the same column. This might give you a partial solution. > The only two of them that don't have overlapping character codes are > Russian (all chars >127) and English (all < 128) > > My current solution is to run without --enable-locale and do all the > sorting > in the client. But it would be often useful to have language specific > columns. This is another reason why we cannot rely on the locale mechanism supported by os. Basically locale is a global data for the entire process. If each column has different language, we have to switch locales. That would be inefficient and painful. --- Tatsuo Ishii
> This seems to be a little bit different from the standard. First, > SQL_TEXT is not equal to ascii. It's a subset of ascii. Yes, sorry. I was lazy in my posting. > Second, the > default charset for char and varchar might be implemenation dependent, > not neccesarily limited to SQL_TEXT. The only requirement is the > charset must contain the repertoire SQL_TEXT has. I think any charsets > including ascii I've ever seen satisfies the requirement. Yow! I certainly misremembered the definition. Date and Darwen, 1997, point out that the SQL implementation *must* support at least one character set, SQL_TEXT, whose repertoire must contain: 1) Every character that is used in the SQL language itself (this is the part I remembered), and 2) Every character that is included in *any other character set* supported by the SQL implementation (Postgres). This second requirement is presumably to enable text handling of multiple character sets, but would seem to put severe restrictions on how we would implement things. Or can it act only as a placeholder, allowing us to define new character sets as different types in Postgres? Otherwise, we would have to retrofit capabilities into SQL_TEXT anytime we defined a new character set?? > Third, the > standards says nothing about locale. You are referring to the Unix-style system support for "locale"? Certainly the NCHAR and character set support in SQL92 would qualify as locale support in the generic sense... Regards. - Thomas -- Thomas Lockhart lockhart@alumni.caltech.edu South Pasadena, California
> > Second, the > > default charset for char and varchar might be implemenation dependent, > > not neccesarily limited to SQL_TEXT. The only requirement is the > > charset must contain the repertoire SQL_TEXT has. I think any charsets > > including ascii I've ever seen satisfies the requirement. > > Yow! I certainly misremembered the definition. Date and Darwen, 1997, > point out that the SQL implementation *must* support at least one > character set, SQL_TEXT, whose repertoire must contain: > > 1) Every character that is used in the SQL language itself (this is > the part I remembered), and > > 2) Every character that is included in *any other character set* > supported by the SQL implementation (Postgres). > > This second requirement is presumably to enable text handling of > multiple character sets, but would seem to put severe restrictions on > how we would implement things. Or can it act only as a placeholder, > allowing us to define new character sets as different types in > Postgres? Otherwise, we would have to retrofit capabilities into > SQL_TEXT anytime we defined a new character set?? I don't think so. 2) can be read as: Any other character set must contain every character included in SQL_TEXT. This seems extremely easy to implement. We could define SQL_TEXT be a subset of ASCII and almost any character set contains ASCII chars. As a result, any character set satisfies above that is logically same as 2). No? > > Third, the > > standards says nothing about locale. > > You are referring to the Unix-style system support for "locale"? Yes. > Certainly the NCHAR and character set support in SQL92 would qualify > as locale support in the generic sense... --- Tatsuo Ishii
> > SQL_TEXT ... repertoire must contain: > > 1) Every character that is used in the SQL language itself (this is > > the part I remembered), and > > 2) Every character that is included in *any other character set* > > supported by the SQL implementation (Postgres). > > This second requirement is presumably to enable text handling of > > multiple character sets, but would seem to put severe restrictions on > > how we would implement things. Or can it act only as a placeholder, > > allowing us to define new character sets as different types in > > Postgres? Otherwise, we would have to retrofit capabilities into > > SQL_TEXT anytime we defined a new character set?? > I don't think so. 2) can be read as: > Any other character set must contain every character included in > SQL_TEXT. Here is the text from the July, 1992 SQL92 draft standard: The <implementation-defined character repertoire name> SQL_TEXT specifies the name of a character repertoire and implied form-of- use that can represent every character that is in <SQL language character> and all other characters that are in character sets supported by the implementation. and later in the same doc: 11)The character set named SQL_TEXT is an implementation-defined character set whose character repertoire is SQL_TEXT. I'm reading this to say that SQL_TEXT must contain the union of all characters in the character sets in the implementation, rather than an intersection between that union and the characters required by the SQL language itself. But I'm not really sure what they mean by this, or whether it is a problem or not. Clearly different character sets and collations can be mixed only when that can preserve meaning, so saying that SQL_TEXT has a repertoire which contains ASCII characters and Japanese characters doesn't seem to help much. So istm that "SQL_TEXT" might be just a container class for all characters in the installation, which still doesn't make complete sense to me wrt a Postgres implementation. - Tom -- Thomas Lockhart lockhart@alumni.caltech.edu South Pasadena, California