Thread: Turkish locale bug
Your name : Sezai YILMAZ Your email address : sezaiy@ata.cs.hun.edu.tr System Configuration --------------------- Architecture (example: Intel Pentium) : AMD Duron Operating System (example: Linux 2.0.26 ELF) : Linux 2.2.17 ELF PostgreSQL version (example: PostgreSQL-7.0): PostgreSQL-7.0.3 Compiler used (example: gcc 2.8.0) : gcc 2.95.3 Please enter a FULL description of your problem: ------------------------------------------------ Locale support for Turkish causes a problem. The problem is with character 'I' (capital of 9.th character of English alphabet). When character 'I' is given to tolower() function and locale is set to "tr_TR", it downgrades to special Turkish character 'ý' (its is called "y acute"), not 'i'. This causes the following problem: With Turkish locale it is not possible to write SQL queries in CAPITAL letters. SQL identifiers like "INSERT" and "UNION" first are downgraded to "ýnsert" and "unýon". Then "ýnsert" and "unýon" does not match as SQL identifier. Please describe a way to repeat the problem. Please try to provide a concise reproducible example, if at all possible: ---------------------------------------------------------------------- When you set "LC_ALL" environment variable to "tr_TR" this problem happens. If you know how this problem might be fixed, list the solution below: --------------------------------------------------------------------- In file: [postgresqlsourcepath]/src/backend/parser/scan.l This block uses function tolower() which is affected by locale settings of the shell which runs postmaster. ================================================================ {identifier} { int i; ScanKeyword *keyword; for(i = 0; yytext[i]; i++) if (isascii((unsigned char)yytext[i]) && isupper(yytext[i])) yytext[i] = tolower(yytext[i]); ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ================================================================ I think it should be better to use another thing which does what function tolower() does but only in English language. This should stay in English locale. I think this will solve the problem. 'a' - 'A' = 32 So we can use the following line instead of the last line marked in above block. yytext[i] += 32;
Sezai YILMAZ <sezaiy@ata.cs.hun.edu.tr> writes: > With Turkish locale it is not possible to write SQL queries in > CAPITAL letters. SQL identifiers like "INSERT" and "UNION" first > are downgraded to "�nsert" and "un�on". Then "�nsert" and "un�on" > does not match as SQL identifier. Ugh. > for(i = 0; yytext[i]; i++) > if (isascii((unsigned char)yytext[i]) && > isupper(yytext[i])) > yytext[i] = tolower(yytext[i]); > I think it should be better to use another thing which does what > function tolower() does but only in English language. This should > stay in English locale. I think this will solve the problem. > yytext[i] += 32; Hm. Several problems here: (1) This solution would break in other locales where isupper() may return TRUE for characters other than 'A'..'Z'. (2) We could fix that by gutting the isascii/isupper test as well, reducing it to "yytext[i] >= 'A' && yytext[i] <= 'Z'", but I'd prefer to still be able to say that "identifiers fold to lower case" works for whatever the local locale thinks is upper and lower case. It would be strange if identifier folding did not agree with the SQL lower() function. (3) I do not like the idea of hard-wiring knowledge of ASCII encoding here, even if it's unlikely that anyone would ever try to run Postgres on a non-ASCII-based system. I see your problem, but I'm not sure of a solution that doesn't have bad side-effects elsewhere. Ideas anyone? regards, tom lane
* Tom Lane <tgl@sss.pgh.pa.us> [010219 20:31]: > > Hm. Several problems here: > > (1) This solution would break in other locales where isupper() may > return TRUE for characters other than 'A'..'Z'. > > (2) We could fix that by gutting the isascii/isupper test as well, > reducing it to "yytext[i] >= 'A' && yytext[i] <= 'Z'", but I'd prefer to > still be able to say that "identifiers fold to lower case" works for > whatever the local locale thinks is upper and lower case. It would be > strange if identifier folding did not agree with the SQL lower() > function. What about EBCDIC (IBM MainFrame, I.E. Linux on S/390, Z/390). EBCDIC has 3 different ranges that contain letters. X'C1'-X'C9' (A-I) X'D1'-X'D9' (J-R) X'E2'-X'E9' (S-Z) and the *LOWER* case ones subtract X'40' (SPACE) to get there. Plus Numbers are X'F0'- X'F9'. This is from 5 year ago mainframe assembler memory.... > > (3) I do not like the idea of hard-wiring knowledge of ASCII encoding > here, even if it's unlikely that anyone would ever try to run Postgres > on a non-ASCII-based system. Not unlikely now. See APACHE and other ports to now handle EBCDIC. > > I see your problem, but I'm not sure of a solution that doesn't have bad > side-effects elsewhere. Ideas anyone? > -- Larry Rosenman http://www.lerctr.org/~ler Phone: +1 972-414-9812 E-Mail: ler@lerctr.org US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749
Larry Rosenman <ler@lerctr.org> writes: > What about EBCDIC (IBM MainFrame, I.E. Linux on S/390, Z/390). Right, that was what I meant about not wanting to hardwire assumptions about ASCII. We could instead code it as if (isupper(ch)) ch = ch + ('a' - 'A'); which I believe will work on EBCDIC as well as ASCII. However, it still breaks down if isupper() claims that anything besides 'A'..'Z' is uppercase --- and the simple 'A' to 'Z' range check does *not* work in EBCDIC. It would be an interesting timewaster to try to get Postgres working on an EBCDIC platform ;-). I'm sure there are a lot of ASCII dependencies lurking in the code that would need to be snuffed out. However, that doesn't mean that I'm eager to add another one here ... regards, tom lane
* Tom Lane <tgl@sss.pgh.pa.us> [010219 21:02]: > Larry Rosenman <ler@lerctr.org> writes: > > What about EBCDIC (IBM MainFrame, I.E. Linux on S/390, Z/390). > > Right, that was what I meant about not wanting to hardwire assumptions > about ASCII. > > We could instead code it as > > if (isupper(ch)) > ch = ch + ('a' - 'A'); what about: if (isupper(ch) && isalpha(ch)) ch = ch + ('a' - 'A'); ? or does that break somewhere? LER -- Larry Rosenman http://www.lerctr.org/~ler Phone: +1 972-414-9812 E-Mail: ler@lerctr.org US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749
Tom Lane wrote: > > Sezai YILMAZ <sezaiy@ata.cs.hun.edu.tr> writes: > > With Turkish locale it is not possible to write SQL queries in > > CAPITAL letters. SQL identifiers like "INSERT" and "UNION" first > > are downgraded to "ýnsert" and Then "ýnsert" and "unýon" > > does not match as SQL identifier. > > Ugh. <snip> How about thinking in the other direction.... is it possible for PostgreSQL to be able to recognised localised versions of SQL queries? i.e. For a Turkish locale it associates "ýnsert" INSERT and "unýon" with UNION. Perhaps including this in the compilation stage (checking which locates are installed on a system, or maybe which locales are specified somewhere)? Not sure what this would do to performance though, as having to do extra SQL identifier matching might be a bit slow. This would have the advantage of the present SQL queries out there working. Regards and best wishes, Justin Clift Database Administrator
Justin Clift <aa2@bigpond.net.au> writes: > How about thinking in the other direction.... is it possible for > PostgreSQL to be able to recognised localised versions of SQL queries? > i.e. For a Turkish locale it associates "�nsert" INSERT and "un�on" > with UNION. Hmm. Wouldn't that mean that if someone actually wrote �nsert, it would be taken as matching the INSERT keyword, not as an identifier? If I understood Sezai correctly, that would surprise a Turkish user. But if this behavior is OK then you might have a good answer. regards, tom lane
Justin Clift wrote: > > Tom Lane wrote: > > > > Sezai YILMAZ <sezaiy@ata.cs.hun.edu.tr> writes: > > > With Turkish locale it is not possible to write SQL queries in > > > CAPITAL letters. SQL identifiers like "INSERT" and "UNION" first > > > are downgraded to "ınsert" and Then "ınsert" and "unıon" > > > does not match as SQL identifier. > > > > Ugh. > <snip> > > How about thinking in the other direction.... is it possible for > PostgreSQL > to be able to recognised localised versions of SQL queries? > > i.e. For a Turkish locale it associates "ınsert" INSERT and "unıon" > with UNION. I don't have any opinion how can solve this problem. But, I don't agree with this solution. SQL is naturally English. I am against SQL to be localized. regards -sezai
Tom Lane wrote: > > Justin Clift <aa2@bigpond.net.au> writes: > > How about thinking in the other direction.... is it possible for > > PostgreSQL to be able to recognised localised versions of SQL queries? > > > i.e. For a Turkish locale it associates "ýnsert" INSERT and "unýon" > > with UNION. > > Hmm. Wouldn't that mean that if someone actually wrote ýnsert, > it would be taken as matching the INSERT keyword, not as an identifier? > If I understood Sezai correctly, that would surprise a Turkish user. > But if this behavior is OK then you might have a good answer. This solution is simple and clear. But it is not a good solution, I think. I don't prefer "ýnsert" to be understood as "INSERT" and "unýon" as "UNION" in SQL keywords. I think this behaviour is not OK. It should be better to write functions isalpha_en(), isupper_en() and tolower_en() which actually behave with English locale. Then use these function in that block. regards -sezai > > regards, tom lane
Tom Lane wrote: > > Sezai YILMAZ <sezaiy@ata.cs.hun.edu.tr> writes: > > With Turkish locale it is not possible to write SQL queries in > > CAPITAL letters. SQL identifiers like "INSERT" and "UNION" first > > are downgraded to "ınsert" and "unıon". Then "ınsert" and "unıon" > > does not match as SQL identifier. > > Ugh. > > > for(i = 0; yytext[i]; i++) > > if (isascii((unsigned char)yytext[i]) && > > isupper(yytext[i])) > > yytext[i] = tolower(yytext[i]); > > > I think it should be better to use another thing which does what > > function tolower() does but only in English language. This should > > stay in English locale. I think this will solve the problem. > > > yytext[i] += 32; > > Hm. Several problems here: > > (1) This solution would break in other locales where isupper() may > return TRUE for characters other than 'A'..'Z'. > > (2) We could fix that by gutting the isascii/isupper test as well, > reducing it to "yytext[i] >= 'A' && yytext[i] <= 'Z'", but I'd prefer to > still be able to say that "identifiers fold to lower case" works for > whatever the local locale thinks is upper and lower case. It would be > strange if identifier folding did not agree with the SQL lower() > function. > > (3) I do not like the idea of hard-wiring knowledge of ASCII encoding > here, even if it's unlikely that anyone would ever try to run Postgres > on a non-ASCII-based system. > > I see your problem, but I'm not sure of a solution that doesn't have bad > side-effects elsewhere. Ideas anyone? > > regards, tom lane You are right. What about this one? ================================================================ {identifier} { int i; ScanKeyword *keyword; /* I think many platforms understands the following and sets locale to 7-bit ASCII character set (English) */ setlocale(LC_ALL, "C"); for(i = 0; yytext[i]; i++) if (isascii((unsigned char)yytext[i]) && isupper(yytext[i])) yytext[i] = tolower(yytext[i]); /* This sets locale to default locale which user prefer to use */ setlocale(LC_ALL, ""); ================================================================ This works on my Linux box. But, I am not sure with other platforms. What do you think about performance? regards -sezai
Sezai YILMAZ <sezaiy@ata.cs.hun.edu.tr> writes: > You are right. What about this one? > setlocale(LC_ALL, "C"); > for(i = 0; yytext[i]; i++) > if (isascii((unsigned char)yytext[i]) && > isupper(yytext[i])) > yytext[i] = tolower(yytext[i]); > /* This sets locale to default locale which > user prefer to use */ > setlocale(LC_ALL, ""); This isn't really better than "if (isupper(ch)) ch = ch + ('a' - 'A')". It still breaks the existing locale-aware handling of identifier case, which I believe is considered a good thing in all locales except C and Turkish. Another small problem is that setlocale() is moderately expensive in most implementations, and we don't want to call it twice for every identifier scanned. I am starting to think that the only real solution is a special case for Turkish users. Perhaps use tolower() normally but have a compile- time option to use a non-locale-aware method: #ifdef LOCALE_AWARE_IDENTIFIER_FOLDING if (isupper(yytext[i])) yytext[i] = tolower(yytext[i]); #else /* this assumes ASCII encoding... */ if (yytext[i] >= 'A' && yytext[i] <= 'Z') yytext[i] += 'a' - 'A'; #endif and then document that you have to disable LOCALE_AWARE_IDENTIFIER_FOLDING to use Turkish locale. regards, tom lane
Merhaba Sezai! > I am starting to think that the only real solution is a special case > for Turkish users. Perhaps use tolower() normally but have a compile- > time option to use a non-locale-aware method: istm that this illustrates the tip of the locale iceberg as we think about moving to a more "locale independent" strategy. Applying locale-specific munging when scanning tokens prohibits a context-sensitive interpretation of tokens, which we will need to fully implement a reasonable set of (or reasonable interpretation of) SQL9x character set and collation features. Anyway, your proposal is just fine since we haven't decoupled these things farther back in the server. But eventually we should hope to have SQL_ASCII and other character sets enforced in context. - Thomas
Thomas Lockhart <lockhart@alumni.caltech.edu> writes: > Anyway, your proposal is just fine since we haven't decoupled these > things farther back in the server. But eventually we should hope to have > SQL_ASCII and other character sets enforced in context. Now I'm confused. Are you saying that we *should* treat identifier case under ASCII rules only? That seems like a step backwards to me, but then I don't use any non-US locale myself... regards, tom lane
Sezai YILMAZ <sezaiy@ata.cs.hun.edu.tr> writes: > With Turkish locale it is not possible to write SQL queries in CAPITAL > letters. SQL identifiers like "INSERT" and "UNION" first are > downgraded to "ýnsert" and "unýon". Then "ýnsert" and > "unýon" does not match as SQL identifier. I believe this should now work correctly with the changes I just committed. If you have the time, please try it out --- you can get current sources from our CVS server, or use a nightly snapshot dated tomorrow or later, or use 7.1beta5 when it comes out (which should be shortly). regards, tom lane
Tom Lane wrote: > > Sezai YILMAZ <sezaiy@ata.cs.hun.edu.tr> writes: > > With Turkish locale it is not possible to write SQL queries in CAPITAL > > letters. SQL identifiers like "INSERT" and "UNION" first are > > downgraded to "ýnsert" and "unýon". Then "ýnsert" and > > "unýon" does not match as SQL identifier. > > I believe this should now work correctly with the changes I just > committed. If you have the time, please try it out --- you can get > current sources from our CVS server, or use a nightly snapshot dated > tomorrow or later, or use 7.1beta5 when it comes out (which should be > shortly). > > regards, tom lane I have tested it with nightly snapshot dated 22 Feb 2001 and it is working. Thanks a lot. regards -sezai
Thomas Lockhart <lockhart@alumni.caltech.edu> writes: > (Just a follow up...) > I haven't had time to review the spec on this, but my recollection is > that the entire SQL language can be described using the SQL_ASCII > character set. I would assume that this might include unquoted > identifiers. The keywords are all ASCII, but SQL99 appears to contemplate allowing most of Unicode for unquoted identifiers. See my later message. (I've already committed the changes described therein, btw...) regards, tom lane
> > Anyway, your proposal is just fine since we haven't decoupled these > > things farther back in the server. But eventually we should hope to have > > SQL_ASCII and other character sets enforced in context. > Now I'm confused. Are you saying that we *should* treat identifier case > under ASCII rules only? That seems like a step backwards to me, but > then I don't use any non-US locale myself... (Just a follow up...) I haven't had time to review the spec on this, but my recollection is that the entire SQL language can be described using the SQL_ASCII character set. I would assume that this might include unquoted identifiers. I'd looked at much of this some time ago, but not recently so my memory might be faultly (for, um, not the first time :/ - Thomas