Thread: Turkish locale bug

Turkish locale bug

From

Sezai YILMAZ

Date:

19 February 2001, 06:47:33

Your name               : Sezai YILMAZ
Your email address      : sezaiy@ata.cs.hun.edu.tr


System Configuration
---------------------
  Architecture (example: Intel Pentium)         : AMD Duron

  Operating System (example: Linux 2.0.26 ELF)  : Linux 2.2.17 ELF

  PostgreSQL version (example: PostgreSQL-7.0):   PostgreSQL-7.0.3

  Compiler used (example:  gcc 2.8.0)           : gcc 2.95.3


Please enter a FULL description of your problem:
------------------------------------------------

Locale support for Turkish causes a problem. The problem is with
character 'I' (capital of 9.th character of English alphabet).
When character 'I' is given to tolower() function and locale is
set to "tr_TR", it downgrades to special Turkish character 'ý'
(its is called "y acute"), not 'i'. This causes the following
problem:

With Turkish locale it is not possible to write SQL queries in
CAPITAL letters. SQL identifiers like "INSERT" and "UNION" first
are downgraded to "ýnsert" and "unýon". Then "ýnsert" and "unýon"
does not match as SQL identifier.



Please describe a way to repeat the problem.   Please try to provide a
concise reproducible example, if at all possible:
----------------------------------------------------------------------

When you set "LC_ALL" environment variable to "tr_TR" this
problem happens.



If you know how this problem might be fixed, list the solution below:
---------------------------------------------------------------------

In file:

[postgresqlsourcepath]/src/backend/parser/scan.l

This block uses function tolower() which is affected by locale
settings of the shell which runs postmaster.

================================================================
{identifier}    {
                    int i;
                    ScanKeyword             *keyword;

                    for(i = 0; yytext[i]; i++)
                          if (isascii((unsigned char)yytext[i]) &&
                                isupper(yytext[i]))
                                yytext[i] = tolower(yytext[i]);
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
================================================================

I think it should be better to use another thing which does what
function tolower() does but only in English language. This should
stay in English locale. I think this will solve the problem.

'a' - 'A' = 32

So we can use the following line instead of the last line marked
in above block.

yytext[i] += 32;

Re: Turkish locale bug

From

Tom Lane

Date:

19 February 2001, 21:30:33

Sezai YILMAZ <sezaiy@ata.cs.hun.edu.tr> writes:
> With Turkish locale it is not possible to write SQL queries in 
> CAPITAL letters. SQL identifiers like "INSERT" and "UNION" first 
> are downgraded to "�nsert" and "un�on". Then "�nsert" and "un�on"
> does not match as SQL identifier.

Ugh.

>                     for(i = 0; yytext[i]; i++)
>                           if (isascii((unsigned char)yytext[i]) &&
>                                 isupper(yytext[i]))
>                                 yytext[i] = tolower(yytext[i]);

> I think it should be better to use another thing which does what 
> function tolower() does but only in English language. This should
> stay in English locale. I think this will solve the problem.

> yytext[i] += 32;

Hm.  Several problems here:

(1) This solution would break in other locales where isupper() may
return TRUE for characters other than 'A'..'Z'.

(2) We could fix that by gutting the isascii/isupper test as well,
reducing it to "yytext[i] >= 'A' && yytext[i] <= 'Z'", but I'd prefer to
still be able to say that "identifiers fold to lower case" works for
whatever the local locale thinks is upper and lower case.  It would be
strange if identifier folding did not agree with the SQL lower()
function.

(3) I do not like the idea of hard-wiring knowledge of ASCII encoding
here, even if it's unlikely that anyone would ever try to run Postgres
on a non-ASCII-based system.

I see your problem, but I'm not sure of a solution that doesn't have bad
side-effects elsewhere.  Ideas anyone?
        regards, tom lane

Re: [HACKERS] Re: Turkish locale bug

From

Larry Rosenman

Date:

19 February 2001, 21:46:27

* Tom Lane <tgl@sss.pgh.pa.us> [010219 20:31]:
> 
> Hm.  Several problems here:
> 
> (1) This solution would break in other locales where isupper() may
> return TRUE for characters other than 'A'..'Z'.
> 
> (2) We could fix that by gutting the isascii/isupper test as well,
> reducing it to "yytext[i] >= 'A' && yytext[i] <= 'Z'", but I'd prefer to
> still be able to say that "identifiers fold to lower case" works for
> whatever the local locale thinks is upper and lower case.  It would be
> strange if identifier folding did not agree with the SQL lower()
> function.
What about EBCDIC (IBM MainFrame, I.E. Linux on S/390, Z/390). 

EBCDIC has 3 different ranges that contain letters.

X'C1'-X'C9' (A-I)
X'D1'-X'D9' (J-R)
X'E2'-X'E9' (S-Z)

and the *LOWER* case ones subtract X'40' (SPACE) to get there.

Plus Numbers are X'F0'- X'F9'. 

This is from 5 year ago mainframe assembler memory....
> 
> (3) I do not like the idea of hard-wiring knowledge of ASCII encoding
> here, even if it's unlikely that anyone would ever try to run Postgres
> on a non-ASCII-based system.
Not unlikely now.  See APACHE and other ports to now handle EBCDIC.
> 
> I see your problem, but I'm not sure of a solution that doesn't have bad
> side-effects elsewhere.  Ideas anyone?
> 
-- 
Larry Rosenman                     http://www.lerctr.org/~ler
Phone: +1 972-414-9812                 E-Mail: ler@lerctr.org
US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749

Re: [HACKERS] Re: Turkish locale bug

From

Tom Lane

Date:

19 February 2001, 22:06:00

Larry Rosenman <ler@lerctr.org> writes:
> What about EBCDIC (IBM MainFrame, I.E. Linux on S/390, Z/390). 

Right, that was what I meant about not wanting to hardwire assumptions
about ASCII.

We could instead code it as
if (isupper(ch))  ch = ch + ('a' - 'A');

which I believe will work on EBCDIC as well as ASCII.  However, it still
breaks down if isupper() claims that anything besides 'A'..'Z' is
uppercase --- and the simple 'A' to 'Z' range check does *not* work in
EBCDIC.

It would be an interesting timewaster to try to get Postgres working on
an EBCDIC platform ;-).  I'm sure there are a lot of ASCII dependencies
lurking in the code that would need to be snuffed out.  However, that
doesn't mean that I'm eager to add another one here ...
        regards, tom lane

Re: [HACKERS] Re: Turkish locale bug

From

Larry Rosenman

Date:

19 February 2001, 22:15:35

* Tom Lane <tgl@sss.pgh.pa.us> [010219 21:02]:
> Larry Rosenman <ler@lerctr.org> writes:
> > What about EBCDIC (IBM MainFrame, I.E. Linux on S/390, Z/390). 
> 
> Right, that was what I meant about not wanting to hardwire assumptions
> about ASCII.
> 
> We could instead code it as
> 
>     if (isupper(ch))
>       ch = ch + ('a' - 'A');
what about:       if (isupper(ch) && isalpha(ch))          ch = ch + ('a' - 'A'); 

? 

or does that break somewhere? 



LER
-- 
Larry Rosenman                     http://www.lerctr.org/~ler
Phone: +1 972-414-9812                 E-Mail: ler@lerctr.org
US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749

Re: Turkish locale bug

From

Justin Clift

Date:

19 February 2001, 22:32:12

Tom Lane wrote:
> 
> Sezai YILMAZ <sezaiy@ata.cs.hun.edu.tr> writes:
> > With Turkish locale it is not possible to write SQL queries in
> > CAPITAL letters. SQL identifiers like "INSERT" and "UNION" first
> > are downgraded to "ýnsert" and  Then "ýnsert" and "unýon"
> > does not match as SQL identifier.
> 
> Ugh.
<snip>

How about thinking in the other direction.... is it possible for
PostgreSQL
to be able to recognised localised versions of SQL queries?
i.e. For a Turkish locale it associates "ýnsert" INSERT and "unýon"
with UNION.

Perhaps including this in the compilation stage (checking which locates
are installed on a system, or maybe which locales are specified
somewhere)?

Not sure what this would do to performance though, as having to do extra
SQL identifier matching might be a bit slow.

This would have the advantage of the present SQL queries out there
working.

Regards and best wishes,

Justin Clift
Database Administrator

Re: Turkish locale bug

From

Tom Lane

Date:

19 February 2001, 22:38:11

Justin Clift <aa2@bigpond.net.au> writes:
> How about thinking in the other direction.... is it possible for
> PostgreSQL to be able to recognised localised versions of SQL queries?

>  i.e. For a Turkish locale it associates "�nsert" INSERT and "un�on"
> with UNION.

Hmm.  Wouldn't that mean that if someone actually wrote �nsert,
it would be taken as matching the INSERT keyword, not as an identifier?
If I understood Sezai correctly, that would surprise a Turkish user.
But if this behavior is OK then you might have a good answer.
        regards, tom lane

Re: Turkish locale bug

From

Sezai YILMAZ

Date:

20 February 2001, 03:43:40


Justin Clift wrote:
> 
> Tom Lane wrote:
> >
> > Sezai YILMAZ <sezaiy@ata.cs.hun.edu.tr> writes:
> > > With Turkish locale it is not possible to write SQL queries in
> > > CAPITAL letters. SQL identifiers like "INSERT" and "UNION" first
> > > are downgraded to "ınsert" and  Then "ınsert" and "unıon"
> > > does not match as SQL identifier.
> >
> > Ugh.
> <snip>
> 
> How about thinking in the other direction.... is it possible for
> PostgreSQL
> to be able to recognised localised versions of SQL queries?
> 
>  i.e. For a Turkish locale it associates "ınsert" INSERT and "unıon"
> with UNION.

I don't have any opinion how can solve this problem. But,
I don't agree with this solution. SQL is naturally English. I am 
against SQL to be localized.

regards
-sezai

Re: Re: Turkish locale bug

From

Sezai YILMAZ

Date:

20 February 2001, 03:57:40

Tom Lane wrote:
>
> Justin Clift <aa2@bigpond.net.au> writes:
> > How about thinking in the other direction.... is it possible for
> > PostgreSQL to be able to recognised localised versions of SQL queries?
>
> >  i.e. For a Turkish locale it associates "ýnsert" INSERT and "unýon"
> > with UNION.
>
> Hmm.  Wouldn't that mean that if someone actually wrote ýnsert,
> it would be taken as matching the INSERT keyword, not as an identifier?
> If I understood Sezai correctly, that would surprise a Turkish user.
> But if this behavior is OK then you might have a good answer.

This solution is simple and clear. But it is not a good solution,
I think. I don't prefer "ýnsert" to be understood as "INSERT" and
"unýon" as "UNION" in SQL keywords. I think this behaviour is not
OK.

It should be better to write functions isalpha_en(), isupper_en()
and tolower_en() which actually behave with English locale. Then
use these function in that block.

regards
-sezai

>
>                         regards, tom lane

Re: Turkish locale bug

From

Sezai YILMAZ

Date:

20 February 2001, 04:22:31


Tom Lane wrote:
> 
> Sezai YILMAZ <sezaiy@ata.cs.hun.edu.tr> writes:
> > With Turkish locale it is not possible to write SQL queries in
> > CAPITAL letters. SQL identifiers like "INSERT" and "UNION" first
> > are downgraded to "ınsert" and "unıon". Then "ınsert" and "unıon"
> > does not match as SQL identifier.
> 
> Ugh.
> 
> >                     for(i = 0; yytext[i]; i++)
> >                           if (isascii((unsigned char)yytext[i]) &&
> >                                 isupper(yytext[i]))
> >                                 yytext[i] = tolower(yytext[i]);
> 
> > I think it should be better to use another thing which does what
> > function tolower() does but only in English language. This should
> > stay in English locale. I think this will solve the problem.
> 
> > yytext[i] += 32;
> 
> Hm.  Several problems here:
> 
> (1) This solution would break in other locales where isupper() may
> return TRUE for characters other than 'A'..'Z'.
> 
> (2) We could fix that by gutting the isascii/isupper test as well,
> reducing it to "yytext[i] >= 'A' && yytext[i] <= 'Z'", but I'd prefer to
> still be able to say that "identifiers fold to lower case" works for
> whatever the local locale thinks is upper and lower case.  It would be
> strange if identifier folding did not agree with the SQL lower()
> function.
> 
> (3) I do not like the idea of hard-wiring knowledge of ASCII encoding
> here, even if it's unlikely that anyone would ever try to run Postgres
> on a non-ASCII-based system.
> 
> I see your problem, but I'm not sure of a solution that doesn't have bad
> side-effects elsewhere.  Ideas anyone?
> 
>                         regards, tom lane

You are right. What about this one?

================================================================
{identifier}    {                   int i;                   ScanKeyword             *keyword;
                  /* I think many platforms understands the                      following and sets locale to 7-bit
ASCII                     character set (English) */
 
        setlocale(LC_ALL, "C");  
                   for(i = 0; yytext[i]; i++)                         if (isascii((unsigned char)yytext[i]) &&
                    isupper(yytext[i]))                               yytext[i] = tolower(yytext[i]);
 
                   /* This sets locale to default locale which                       user prefer to use */
        setlocale(LC_ALL, "");  
================================================================

This works on my Linux box. But, I am not sure with other 
platforms. What do you think about performance?

regards
-sezai

Re: Turkish locale bug

From

Tom Lane

Date:

20 February 2001, 11:10:07

Sezai YILMAZ <sezaiy@ata.cs.hun.edu.tr> writes:
> You are right. What about this one?

>             setlocale(LC_ALL, "C");  

>                     for(i = 0; yytext[i]; i++)
>                           if (isascii((unsigned char)yytext[i]) &&
>                                 isupper(yytext[i]))
>                                 yytext[i] = tolower(yytext[i]);

>                     /* This sets locale to default locale which 
>                        user prefer to use */

>             setlocale(LC_ALL, "");  

This isn't really better than "if (isupper(ch)) ch = ch + ('a' - 'A')".
It still breaks the existing locale-aware handling of identifier case,
which I believe is considered a good thing in all locales except C
and Turkish.  Another small problem is that setlocale() is moderately
expensive in most implementations, and we don't want to call it twice
for every identifier scanned.

I am starting to think that the only real solution is a special case
for Turkish users.  Perhaps use tolower() normally but have a compile-
time option to use a non-locale-aware method:

#ifdef LOCALE_AWARE_IDENTIFIER_FOLDING                 if (isupper(yytext[i]))                    yytext[i] =
tolower(yytext[i]);
#else                 /* this assumes ASCII encoding... */                 if (yytext[i] >= 'A' && yytext[i] <= 'Z')
               yytext[i] += 'a' - 'A';

#endif

and then document that you have to disable
LOCALE_AWARE_IDENTIFIER_FOLDING to use Turkish locale.
        regards, tom lane

Re: Turkish locale bug

From

Thomas Lockhart

Date:

20 February 2001, 11:39:11

Merhaba Sezai!

> I am starting to think that the only real solution is a special case
> for Turkish users.  Perhaps use tolower() normally but have a compile-
> time option to use a non-locale-aware method:

istm that this illustrates the tip of the locale iceberg as we think
about moving to a more "locale independent" strategy. Applying
locale-specific munging when scanning tokens prohibits a
context-sensitive interpretation of tokens, which we will need to fully
implement a reasonable set of (or reasonable interpretation of) SQL9x
character set and collation features.

Anyway, your proposal is just fine since we haven't decoupled these
things farther back in the server. But eventually we should hope to have
SQL_ASCII and other character sets enforced in context.
                     - Thomas

Re: Turkish locale bug

From

Tom Lane

Date:

20 February 2001, 11:52:34

Thomas Lockhart <lockhart@alumni.caltech.edu> writes:
> Anyway, your proposal is just fine since we haven't decoupled these
> things farther back in the server. But eventually we should hope to have
> SQL_ASCII and other character sets enforced in context.

Now I'm confused.  Are you saying that we *should* treat identifier case
under ASCII rules only?  That seems like a step backwards to me, but
then I don't use any non-US locale myself...
        regards, tom lane

Re: Turkish locale bug

From

Tom Lane

Date:

21 February 2001, 14:14:00

Sezai YILMAZ <sezaiy@ata.cs.hun.edu.tr> writes:
> With Turkish locale it is not possible to write SQL queries in CAPITAL
> letters. SQL identifiers like "INSERT" and "UNION" first are
> downgraded to "ýnsert" and "unýon". Then "ýnsert" and
> "unýon" does not match as SQL identifier.

I believe this should now work correctly with the changes I just
committed.  If you have the time, please try it out --- you can get
current sources from our CVS server, or use a nightly snapshot dated
tomorrow or later, or use 7.1beta5 when it comes out (which should be
shortly).

            regards, tom lane

Re: Turkish locale bug

From

Sezai YILMAZ

Date:

23 February 2001, 03:02:36

Tom Lane wrote:
>
> Sezai YILMAZ <sezaiy@ata.cs.hun.edu.tr> writes:
> > With Turkish locale it is not possible to write SQL queries in CAPITAL
> > letters. SQL identifiers like "INSERT" and "UNION" first are
> > downgraded to "ýnsert" and "unýon". Then "ýnsert" and
> > "unýon" does not match as SQL identifier.
>
> I believe this should now work correctly with the changes I just
> committed.  If you have the time, please try it out --- you can get
> current sources from our CVS server, or use a nightly snapshot dated
> tomorrow or later, or use 7.1beta5 when it comes out (which should be
> shortly).
>
>                         regards, tom lane

I have tested it with nightly snapshot dated 22 Feb 2001 and it is
working. Thanks a lot.

regards
-sezai

Re: Turkish locale bug

From

Tom Lane

Date:

23 February 2001, 13:13:27

Thomas Lockhart <lockhart@alumni.caltech.edu> writes:
> (Just a follow up...)

> I haven't had time to review the spec on this, but my recollection is
> that the entire SQL language can be described using the SQL_ASCII
> character set. I would assume that this might include unquoted
> identifiers.

The keywords are all ASCII, but SQL99 appears to contemplate allowing
most of Unicode for unquoted identifiers.  See my later message.
(I've already committed the changes described therein, btw...)
        regards, tom lane

Re: Turkish locale bug

From

Thomas Lockhart

Date:

23 February 2001, 13:26:46

> > Anyway, your proposal is just fine since we haven't decoupled these
> > things farther back in the server. But eventually we should hope to have
> > SQL_ASCII and other character sets enforced in context.
> Now I'm confused.  Are you saying that we *should* treat identifier case
> under ASCII rules only?  That seems like a step backwards to me, but
> then I don't use any non-US locale myself...

(Just a follow up...)

I haven't had time to review the spec on this, but my recollection is
that the entire SQL language can be described using the SQL_ASCII
character set. I would assume that this might include unquoted
identifiers. I'd looked at much of this some time ago, but not recently
so my memory might be faultly (for, um, not the first time :/
                   - Thomas