Thread: Re: 7.4.1 release status - Turkish Locale

Re: 7.4.1 release status - Turkish Locale

From
"Nicolai Tufar"
Date:
<div class="Section1"><pre><font color="black" face="Courier New" size="2"><span style="font-size:10.0pt">> <span
class="SpellE">We</span><span class="SpellE">might</span> <span class="SpellE">think</span> <span
class="SpellE">that</span><span class="SpellE">the</span> <span class="SpellE">Turkish</span>-<span
class="SpellE">locale</span>problem Devrim <span class="SpellE">Gunduz</span> <span class="SpellE">pointed</span> <span
class="SpellE">out</span></span></font></pre><pre><fontcolor="black" face="Courier New" size="2"><span
style="font-size:10.0pt">>is a <span class="SpellE">must</span>-<span class="SpellE">fix</span>, <span
class="SpellE">too</span>.<spanstyle="mso-spacerun:yes">  </span>But <span class="SpellE">I'm</span> not <span
class="SpellE">convinced</span>yet <span class="SpellE">what</span> <span class="SpellE">to</span> do <span
class="SpellE">about</span>it.</span></font></pre><p class="MsoNormal"><font face="Arial" size="2"><span
style="font-size:10.0pt;
font-family:Arial"> </span></font><p class="MsoPlainText"><font face="Courier New" size="2"><span style="font-size:
10.0pt">Here is a <span class="SpellE">first</span> <span class="SpellE">try</span> <span class="SpellE">to</span>
<spanclass="SpellE">fix</span> <span class="SpellE">what</span> Devrim <span class="SpellE">Gunduz</span> <span
class="SpellE">talked</span><span class="SpellE">about</span>.<br /><br /><span class="SpellE">Please</span> be <span
class="SpellE">patient</span><span class="SpellE">with</span> <span class="SpellE">me</span> <span
class="SpellE">for</span>it is <span class="SpellE">the</span> <span class="SpellE">first</span> <span
class="SpellE">major</span><span class="SpellE">patch</span><br /> I <span class="SpellE">submit</span> <span
class="SpellE">and</span>I <span class="SpellE">realize</span> <span class="SpellE">that</span> I <span
class="SpellE">blatantly</span><span class="SpellE">violated</span> <span class="SpellE">many</span> <span
class="SpellE">rules</span><br/> of <span class="SpellE">good</span> <span class="SpellE">style</span> in <span
class="SpellE">PostgreSQL</span><span class="SpellE">source</span> <span class="SpellE">code</span>.<br /><br /><span
class="SpellE">First</span>,<span class="SpellE">about</span> <span class="SpellE">the</span> problem. <span
class="SpellE">Turkish</span><span class="SpellE">language</span> has <span class="SpellE">two</span> <span
class="SpellE">letters</span>"i".<br /><span class="SpellE">One</span> is <span class="SpellE">with</span> <span
class="SpellE">dot</span>on top <span class="SpellE">and</span> <span class="SpellE">the</span> <span
class="SpellE">other</span>is <span class="SpellE">without</span>. <span class="SpellE">Simply</span> as <span
class="SpellE">that</span>.<br/><span class="SpellE">The</span> <span class="SpellE">one</span> <span
class="SpellE">with</span><span class="SpellE">dot</span> has <span class="SpellE">the</span> <span
class="SpellE">dot</span><span class="SpellE">both</span> as <span class="SpellE">capital</span> <span
class="SpellE">and</span><span class="SpellE">lower</span>-<span class="SpellE">case</span> <span
class="SpellE">and</span><br/><span class="SpellE">the</span> <span class="SpellE">one</span> <span
class="SpellE">without</span><span class="SpellE">dot</span> has no <span class="SpellE">dot</span> in <span
class="SpellE">both</span><span class="SpellE">upper</span> <span class="SpellE">and</span> <span
class="SpellE">lower</span><span class="SpellE">case</span>...<br /> as <span class="SpellE">opposed</span> <span
class="SpellE">to</span><span class="SpellE">English</span> <span class="SpellE">where</span> "i" has a <span
class="SpellE">dot</span><span class="SpellE">when</span> <span class="SpellE">lower</span>-<span
class="SpellE">case</span><span class="SpellE">and</span><br /> has no <span class="SpellE">dot</span> <span
class="SpellE">when</span><span class="SpellE">upper</span>-<span class="SpellE">case</span>.<br /><br /> Problem <span
class="SpellE">arise</span><span class="SpellE">when</span> <span class="SpellE">PostgreSQL</span>, <span
class="SpellE">while</span><span class="SpellE">running</span> <span class="SpellE">with</span> "<span
class="SpellE">tr</span>_TR"<span class="SpellE">locale</span><br /><span class="SpellE">converts</span> <span
class="SpellE">to</span><span class="SpellE">lower</span>-<span class="SpellE">case</span> an <span
class="SpellE">identifier</span>as a <span class="SpellE">table</span>, an <span class="SpellE">index</span> <span
class="SpellE">or</span><br/> a <span class="SpellE">column</span> name. <span class="SpellE">If</span> it is <span
class="SpellE">written</span><span class="SpellE">with</span> <span class="SpellE">capital</span> "I", <span
class="SpellE">tolower</span>()<span class="SpellE">with</span><br /> 'I' as <span class="SpellE">argument</span> <span
class="SpellE">will</span><span class="SpellE">return</span> <span class="SpellE">Turkish</span> <span
class="SpellE">specific</span><span class="SpellE">character</span>: <br /> 'i'-<span
class="SpellE">without</span>-a-<spanclass="SpellE">dot</span> <span class="SpellE">what</span> I <span
class="SpellE">am</span><span class="SpellE">afraid</span> <span class="SpellE">will</span> not be <span
class="SpellE">shown</span><span class="SpellE">correctly</span><br /> in <span class="SpellE">your</span> e-mail <span
class="SpellE">readers</span>.<br/><br /><span class="SpellE">Let</span> <span class="SpellE">me</span> <span
class="SpellE">give</span><span class="SpellE">some</span> <span class="SpellE">examples</span>.<br /><br /><span
class="SpellE">initdb</span><span class="SpellE">script</span> <span class="SpellE">runs</span> <span
class="SpellE">apparently</span><span class="SpellE">innocent</span> <span class="SpellE">script</span> in file<br
/><spanclass="SpellE">src</span>/<span class="SpellE">backend</span>/<span class="SpellE">utils</span>/<span
class="SpellE">mb</span>/<spanclass="SpellE">conversion</span>_<span class="SpellE">procs</span>/<span
class="SpellE">conversion</span>_<spanclass="SpellE">create</span>.<span class="SpellE">sql</span><br /><span
class="SpellE">to</span><span class="SpellE">create</span> a <span class="SpellE">couple</span> of <span
class="SpellE">functions</span><span class="SpellE">whose</span> <span class="SpellE">only</span> <span
class="SpellE">fault</span><span class="SpellE">was</span><br /><span class="SpellE">to</span> <span
class="SpellE">declare</span>it <span class="SpellE">their</span> <span class="SpellE">return</span> <span
class="SpellE">parameters</span>as VOID. <span class="SpellE">Backend</span><br /><span class="SpellE">returns</span>
<spanclass="SpellE">error</span> <span class="SpellE">message</span> <span class="SpellE">that</span> <span
class="SpellE">type</span>"<span class="SpellE">vo</span> d" is not <span class="SpellE">found</span> <span
class="SpellE">and</span><br/><span class="SpellE">initdb</span> <span class="SpellE">fails</span>.<br /><br /> A <span
class="SpellE">nothing</span><span class="SpellE">suspecting</span> <span class="SpellE">novice</span> <span
class="SpellE">user</span><span class="SpellE">was</span> <span class="SpellE">excited</span> <span
class="SpellE">about</span><br/> SERIAL data <span class="SpellE">type</span> <span class="SpellE">he</span> <span
class="SpellE">was</span><span class="SpellE">tail</span> is <span class="SpellE">present</span> in <span
class="SpellE">PostgreSQL</span>.<br/><span class="SpellE">It</span> <span class="SpellE">took</span> us <span
class="SpellE">with</span>Devrim a <span class="SpellE">lot</span> of time <span class="SpellE">to</span> <span
class="SpellE">explain</span><span class="SpellE">why</span> <span class="SpellE">he</span><br /><span
class="SpellE">need</span><span class="SpellE">to</span> <span class="SpellE">type</span> SERIAL as <span
class="SpellE">SERiAL</span><span class="SpellE">for</span> <span class="SpellE">now</span> <span
class="SpellE">till</span>a <span class="SpellE">workaround</span><br /> is <span class="SpellE">developed</span>.<br
/><br/><span class="SpellE">Another</span> <span class="SpellE">case</span> <span class="SpellE">happened</span> <span
class="SpellE">with</span><span class="SpellE">me</span> <span class="SpellE">when</span> I <span
class="SpellE">wanted</span><span class="SpellE">to</span> restore<br /> a <span class="SpellE">pg</span>_<span
class="SpellE">dump</span><span class="SpellE">dump</span>. Restore <span class="SpellE">failed</span> <span
class="SpellE">because</span><span class="SpellE">script</span> <span class="SpellE">was</span> <span
class="SpellE">creating</span><br/><span class="SpellE">scripts</span> <span class="SpellE">that</span> <span
class="SpellE">belong</span><span class="SpellE">to</span> PUBLIC.<br /><br /><br /><span class="SpellE">For</span>
<spanclass="SpellE">the</span> <span class="SpellE">solution</span>, <span class="SpellE">after</span> <span
class="SpellE">some</span><span class="SpellE">research</span> <span class="SpellE">we</span> <span
class="SpellE">found</span><span class="SpellE">out</span> <span class="SpellE">that</span><br /><span
class="SpellE">offender</span>is <span class="SpellE">tolower</span>() <span class="SpellE">call</span> in <span
class="SpellE">src</span>/<spanclass="SpellE">backend</span>/<span class="SpellE">parser</span>/<span
class="SpellE">scan</span>.l<br/> in {<span class="SpellE">identifier</span>} <span class="SpellE">section</span>.
<spanclass="SpellE">tolower</span>() <span class="SpellE">works</span> <span class="SpellE">fine</span> <span
class="SpellE">with</span><span class="SpellE">any</span><br /><span class="SpellE">locale</span> <span
class="SpellE">and</span><span class="SpellE">with</span> <span class="SpellE">any</span> <span
class="SpellE">character</span><span class="SpellE">save</span> <span class="SpellE">for</span> <span
class="SpellE">the</span><span class="SpellE">Turkish</span> <span class="SpellE">locale</span><br /><span
class="SpellE">and</span><span class="SpellE">capital</span> 'I' <span class="SpellE">character</span>. <span
class="SpellE">So</span>,<span class="SpellE">the</span> <span class="SpellE">obvious</span> <span
class="SpellE">solution</span>is<br /><span class="SpellE">to</span> put a <span class="SpellE">check</span> <span
class="SpellE">for</span><span class="SpellE">Turkish</span> <span class="SpellE">locale</span> <span
class="SpellE">and</span>'I' <span class="SpellE">character</span>.<br /><span class="SpellE">Something</span> <span
class="SpellE">like</span><span class="SpellE">this</span>:<br /><br /><span class="SpellE">if</span>( <<span
class="SpellE">locale</span>is <span class="SpellE">Turkish</span>> && <span class="SpellE">ident</span>[i]
=='I' )<br /><span style="mso-tab-count:1">      </span><span class="SpellE">ident</span>[i] = 'i';<br /> else<br
/><spanstyle="mso-tab-count:1">      </span><span class="SpellE">ident</span>[i] = <span
class="SpellE">tolower</span>((<spanclass="SpellE">unsigned</span> <span class="SpellE">char</span>) <span
class="SpellE">ident</span>[i]);<br/><br /><span class="SpellE">Looks</span> <span class="SpellE">rather</span> <span
class="SpellE">simple</span>but <span class="SpellE">the</span> hard <span class="SpellE">part</span> <span
class="SpellE">was</span><span class="SpellE">to</span> <span class="SpellE">figure</span> <span
class="SpellE">out</span><br/><span class="SpellE">what</span> is <span class="SpellE">the</span> <span
class="SpellE">current</span><span class="SpellE">locale</span>. <span class="SpellE">To</span> do <span
class="SpellE">this</span>I <span class="SpellE">added</span><br /><br /><span class="SpellE">const</span> <span
class="SpellE">char</span>*<span class="SpellE">get</span>_<span class="SpellE">locale</span>_<span
class="SpellE">category</span>(<spanclass="SpellE">const</span> <span class="SpellE">char</span> *<span
class="SpellE">category</span>);<br/><br /><span class="SpellE">to</span> <span class="SpellE">src</span>/<span
class="SpellE">backend</span>/<spanclass="SpellE">utils</span>/<span class="SpellE">adt</span>/<span
class="SpellE">pg</span>_<spanclass="SpellE">locale</span>.c <span class="SpellE">that</span> <span
class="SpellE">would</span><span class="SpellE">return</span><br /><span class="SpellE">locale</span> <span
class="SpellE">identifier</span><span class="SpellE">for</span> <span class="SpellE">the</span> <span
class="SpellE">category</span><span class="SpellE">specified</span> <span class="SpellE">or</span> LC_ALL<br /><span
class="SpellE">if</span><span class="SpellE">category</span> is NULL. I <span class="SpellE">could</span> not <span
class="SpellE">find</span><span class="SpellE">any</span> <span class="SpellE">other</span> <span
class="SpellE">function</span><br/><span class="SpellE">that</span> <span class="SpellE">will</span> <span
class="SpellE">return</span><span class="SpellE">what</span> I <span class="SpellE">need</span>. <span
class="SpellE">Please</span><span class="SpellE">help</span> <span class="SpellE">me</span> <span
class="SpellE">to</span><span class="SpellE">find</span><br /><span class="SpellE">one</span> <span
class="SpellE">because</span>I <span class="SpellE">would</span> <span class="SpellE">hate</span> <span
class="SpellE">to</span><span class="SpellE">introduce</span> a <span class="SpellE">new</span> <span
class="SpellE">function</span>.<br/><br /> I <span class="SpellE">realize</span> <span class="SpellE">that</span>
{<spanclass="SpellE">identifier</span>} <span class="SpellE">section</span> is <span class="SpellE">very</span> <span
class="SpellE">performance</span><br/><span class="SpellE">critical</span> <span class="SpellE">so</span> I <span
class="SpellE">introduced</span>a global <span class="SpellE">variable</span><br /><br /><span
class="SpellE">static</span><span class="SpellE">int</span> <span class="SpellE">isturkishlocale</span> = -1;<br /><br
/>at <span class="SpellE">the</span> <span class="SpellE">beginning</span> of <span class="SpellE">src</span>/<span
class="SpellE">backend</span>/<spanclass="SpellE">parser</span>/<span class="SpellE">scan</span>.l<br /><span
class="SpellE">It</span>is set <span class="SpellE">to</span> -1 <span class="SpellE">when</span> not yet <span
class="SpellE">initialized</span>,0 <span class="SpellE">if</span><br /><span class="SpellE">locale</span> is not <span
class="SpellE">Turkish</span><span class="SpellE">and</span> 1 <span class="SpellE">if</span> <span
class="SpellE">locale</span>is <span class="SpellE">Turkish</span>.<br /><br /><span class="SpellE">It</span> <span
class="SpellE">might</span>not be <span class="SpellE">the</span> <span class="SpellE">way</span> it is <span
class="SpellE">usually</span>done in <span class="SpellE">PostgreSQL</span><br /><span class="SpellE">source</span>
<spanclass="SpellE">code</span>. <span class="SpellE">Could</span> <span class="SpellE">you</span> <span
class="SpellE">pleas</span><span class="SpellE">advise</span> <span class="SpellE">if</span> <span
class="SpellE">the</span>name I <span class="SpellE">chose</span><br /> is <span class="SpellE">appropriate</span>
<spanclass="SpellE">and</span> <span class="SpellE">whether</span> <span class="SpellE">there</span> is a <span
class="SpellE">more</span><span class="SpellE">appropriate</span><br /><span class="SpellE">place</span> <span
class="SpellE">to</span>put <span class="SpellE">declaration</span> <span class="SpellE">and</span> <span
class="SpellE">initialization</span>.<br/><br /><span class="SpellE">Best</span> <span
class="SpellE">regards</span>,<br/><span class="SpellE">Nicolai</span> <span class="SpellE">Tufar</span> & Devrim
<spanclass="SpellE">Gunduz</span></span></font><font face="Arial"><span style="font-family:Arial"></span></font><p
class="MsoNormal"><fontface="Arial" size="2"><span style="font-size:10.0pt; 
font-family:Arial"> </span></font></div>

Re: 7.4.1 release status - Turkish Locale

From
Tom Lane
Date:
"Nicolai Tufar" <ntufar@pisem.net> writes:
>> We might think that the Turkish-locale problem Devrim Gunduz pointed out
>> is a must-fix, too.  But I'm not convinced yet what to do about it.
> Here is a first try to fix what Devrim Gunduz talked about.

I still don't much like having a locale-specific wart in the parser
(and the code you give could not work anyway --- for starters, the
first argument of setlocale is not a pointer).

A possible compromise is to apply ASCII downcasing (same as in
keywords.c) for 7-bit-ASCII characters, and apply tolower() only
for character codes above 127.  In other words
unsigned char ch = (unsigned char) ident[i];
if (ch >= 'A' && ch <= 'Z')    ch += 'a' - 'A';else if (ch > 127 && isupper(ch))    ch = tolower(ch);ident[i] = (char)
ch;

In reasonably sane locales this will have the same effects as currently,
while in unsane locales it will ensure that basic-ASCII identifiers are
treated the way we want.

Comments?
        regards, tom lane


Re: 7.4.1 release status - Turkish Locale

From
"Nicolai Tufar"
Date:
> I still don't much like having a locale-specific wart in the parser
> (and the code you give could not work anyway --- for starters, the
> first argument of setlocale is not a pointer).

Aw, I see, my code broken. I got confused by locale_......_asign()
family
if functions. Sure, first argument need to be int. But as you said the
code
is a wart.

> A possible compromise is to apply ASCII downcasing (same as in
> keywords.c) for 7-bit-ASCII characters, and apply tolower() only
> for character codes above 127.  In other words
>
>     unsigned char ch = (unsigned char) ident[i];
> 
>    if (ch >= 'A' && ch <= 'Z')
>        ch += 'a' - 'A';
>    else if (ch > 127 && isupper(ch))
>        ch = tolower(ch);
>    ident[i] = (char) ch;
>
> In reasonably sane locales this will have the same effects as
currently,
> while in unsane locales it will ensure that basic-ASCII identifiers
are
> treated the way we want.

If we go this way why not make a special case only and only for 'I'
Character and not all 7-bit ASCII:
unsigned char ch = (unsigned char) ident[i];
if(ch == (unsigned char)'I')    ch = 'i';else    ch = tolower(ch)ident[i] = (char) ch;    

Will it break any locales?

>
>            regards, tom lane

Regards
Nicolai




Re: 7.4.1 release status - Turkish Locale

From
Tom Lane
Date:
"Nicolai Tufar" <ntufar@pisem.net> writes:
>> A possible compromise is to apply ASCII downcasing (same as in
>> keywords.c) for 7-bit-ASCII characters, and apply tolower() only
>> for character codes above 127.  In other words

> If we go this way why not make a special case only and only for 'I'
> Character and not all 7-bit ASCII:

It seems to me that that's too narrow a definition of the problem.
I think we should state our goal as "we don't want bizarre locale
definitions to interfere with downcasing of the basic ASCII letters".
If we put in a special case for 'I' we will fix the known problem
with Turkish, but what other strange locales might be out there?
And if we don't trust tolower() for 'I', why should we trust it
for 'A'-'Z'?

What it comes down to is that by training and experience, I always
expect that any bug might be just one example of a whole class of bugs.
You have to look for the related cases that might happen in future,
not only fix the case that's under your nose.
        regards, tom lane


Re: 7.4.1 release status - Turkish Locale

From
"Nicolai Tufar"
Date:
"Tom Lane" tgl@sss.pgh.pa.us wrote:
>
>"Nicolai Tufar" <ntufar@pisem.net> writes:
>>> A possible compromise is to apply ASCII downcasing (same as in
>>> keywords.c) for 7-bit-ASCII characters, and apply tolower() only
>>> for character codes above 127.  In other words
>
>> If we go this way why not make a special case only and only for 'I'
>> Character and not all 7-bit ASCII:
>
> It seems to me that that's too narrow a definition of the problem.
> I think we should state our goal as "we don't want bizarre locale
> definitions to interfere with downcasing of the basic ASCII letters".
> If we put in a special case for 'I' we will fix the known problem
> with Turkish, but what other strange locales might be out there?
> And if we don't trust tolower() for 'I', why should we trust it
> for 'A'-'Z'?

To my knowledge no other locale have similar problems. At least nobody
complained so far while Turk users are rising their voices for many
years 
now. Let try and put this very special case, together with an extensive
explanation in comment and see if someone complains. And by the way,
national characters in table, column, index or function names is
something
that never happens in production databases.

As for 'A'-'Z'^, it was pointed to me that SQL99 standard states that
identifier names need to be downcasted in locale-dependent manner.

Would you like me to create a patch that would touch only 
src/backend/parser/scan.l, introduce a special case for 'I'
and include an explanation in comment?

> What it comes down to is that by training and experience, I always
> expect that any bug might be just one example of a whole class of
bugs.
> You have to look for the related cases that might happen in future,
> not only fix the case that's under your nose.
>
>            regards, tom lane

Regards,
Nicolai Tufar



Turkish Locale in Identifiers (contd.)

From
"Nicolai Tufar"
Date:
> Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Nicolai Tufar" <ntufar@pisem.net> writes:
> >> A possible compromise is to apply ASCII downcasing (same as in
> >> keywords.c) for 7-bit-ASCII characters, and apply tolower() only
> >> for character codes above 127.  In other words
> 
> > If we go this way why not make a special case only and only for 'I'
> > Character and not all 7-bit ASCII:
> 
> It seems to me that that's too narrow a definition of the problem.
> I think we should state our goal as "we don't want bizarre locale
> definitions to interfere with downcasing of the basic ASCII letters".
> If we put in a special case for 'I' we will fix the known problem
> with Turkish, but what other strange locales might be out there?
> And if we don't trust tolower() for 'I', why should we trust it
> for 'A'-'Z'?

Since nobody commented on the issue I may suggest a patch that
implements
'I' special case solution. 'A'-'Z' ASCII-only downcasting idea was
rejected 
before on basis of SQL99 compliance. I hope I would have more luck with
this
one. Because PostgreSQL just does not work with Turkish locale, and it
was
so since 7.4.0. initdb just chokes on VOID identifier and quits. Devrim
Gunduz will second me on this, I am sure.

With my knowledge of Russian, Arabic and -to some degree- Hebrew
encodings
I claim that this patch will not break them. If someone who uses far
eastern
Encodings would also check it, I think it would be pretty safe to apply
this patch to the source.

Thanks,
Nicolai Tufar



Re: Turkish Locale in Identifiers (contd.)

From
"Nicolai Tufar"
Date:
Oops, forgot the patch :)

> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-
> owner@postgresql.org] On Behalf Of Nicolai Tufar
> Sent: Tuesday, February 03, 2004 9:31 PM
> To: pgsql-hackers@postgresql.org
> Cc: 'Tom Lane'; devrim@tdmsoft.com
> Subject: [HACKERS] Turkish Locale in Identifiers (contd.)
> 
> > Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > "Nicolai Tufar" <ntufar@pisem.net> writes:
> > >> A possible compromise is to apply ASCII downcasing (same as in
> > >> keywords.c) for 7-bit-ASCII characters, and apply tolower() only
> > >> for character codes above 127.  In other words
> >
> > > If we go this way why not make a special case only and only for
'I'
> > > Character and not all 7-bit ASCII:
> >
> > It seems to me that that's too narrow a definition of the problem.
> > I think we should state our goal as "we don't want bizarre locale
> > definitions to interfere with downcasing of the basic ASCII
letters".
> > If we put in a special case for 'I' we will fix the known problem
> > with Turkish, but what other strange locales might be out there?
> > And if we don't trust tolower() for 'I', why should we trust it
> > for 'A'-'Z'?
> 
> Since nobody commented on the issue I may suggest a patch that
> implements
> 'I' special case solution. 'A'-'Z' ASCII-only downcasting idea was
> rejected
> before on basis of SQL99 compliance. I hope I would have more luck
with
> this
> one. Because PostgreSQL just does not work with Turkish locale, and it
> was
> so since 7.4.0. initdb just chokes on VOID identifier and quits.
Devrim
> Gunduz will second me on this, I am sure.
> 
> With my knowledge of Russian, Arabic and -to some degree- Hebrew
> encodings
> I claim that this patch will not break them. If someone who uses far
> eastern
> Encodings would also check it, I think it would be pretty safe to
apply
> this patch to the source.
> 
> Thanks,
> Nicolai Tufar
> 
> 
> ---------------------------(end of
broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to
majordomo@postgresql.org)

Re: 7.4.1 release status - Turkish Locale

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> It seems to me that that's too narrow a definition of the problem.
> I think we should state our goal as "we don't want bizarre locale
> definitions to interfere with downcasing of the basic ASCII letters".
> If we put in a special case for 'I' we will fix the known problem
> with Turkish, but what other strange locales might be out there?
> And if we don't trust tolower() for 'I', why should we trust it
> for 'A'-'Z'?

But then wouldn't it be a little weird for Turkish table and column names to
treat "I and "Ý" (I think that's a dotted capital I) as equivalent to "i"
instead of "ý" "i" respectively. (I think that first one was a dotless i).

Perhaps what really ought to be happening is that the downcasing should be
done separately for keywords, or postponed until the point where it's checked
to see if it's a keyword. Then it could be done using an entirely
ascii-centric bit-twiddling implementation.

If it matches an SQL keyword after being downcased the old fashioned way, then
it's an SQL keyword. If not then the locale-aware tolower() would be
appropriate for tables, columns, etc.

But then perhaps that's unnecessarily complex.

-- 
greg



Re: 7.4.1 release status - Turkish Locale

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> If it matches an SQL keyword after being downcased the old fashioned way, then
> it's an SQL keyword. If not then the locale-aware tolower() would be
> appropriate for tables, columns, etc.

That's exactly what we do already.  The complaint was that the
locale-aware downcasing is broken (not to put it too finely) in Turkish
locales, leading to unexpected/unwanted results for identifiers that are
not keywords.  My own opinion is that the correct response is to fix the
Turkish locale tables, but I can see where that might be beyond the
skills of the average Postgres user.  Thus I thought a reasonable
compromise would be to override the locale for the handling of A-Z,
allowing it to determine what happens to high-bit-set characters only.
        regards, tom lane


Re: 7.4.1 release status - Turkish Locale

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Greg Stark <gsstark@mit.edu> writes:
> > If it matches an SQL keyword after being downcased the old fashioned way, then
> > it's an SQL keyword. If not then the locale-aware tolower() would be
> > appropriate for tables, columns, etc.
> 
> That's exactly what we do already.  The complaint was that the
> locale-aware downcasing is broken (not to put it too finely) in Turkish
> locales, leading to unexpected/unwanted results for identifiers that are
> not keywords.  

But the example given was "SERIAL". "serial" is an English word, not a Turkish
word. It shouldn't really be subject to Turkish locale effects at all. Perhaps
"keyword" wasn't the right word in my message.

I'm wondering if he really expects all identifiers to be subject to this ascii
downcasing. Like, if he had a GÜNAYDIN column he might be surprised to when
günaydýn (where ý is the lowercase dotless i) says column "günaydýn" doesn't
exist.

Or is the real problem simply that both styles of i really ought to match all
the time, ie, that they should really be considered the same letter for
matches? I wonder if there are other locales where that's an issue.

-- 
greg



Re: 7.4.1 release status - Turkish Locale

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> But the example given was "SERIAL". "serial" is an English word, not a
> Turkish word. It shouldn't really be subject to Turkish locale effects
> at all.

SERIAL is not a keyword according to the grammar.  Neither are PUBLIC,
VOID, INT4, and numerous other examples.  It's not appropriate to try to
fix this by making them all keywords --- that will just create other
problems.  (And where do you draw the line, anyway?  Should every
identifier present in the default system catalogs become a keyword?)

> I'm wondering if he really expects all identifiers to be subject to
> this ascii downcasing.

Without doubt it isn't ideal, but if we don't do something then a lot of
stuff starting with initdb is broken.  We could perhaps work around the
problem by spelling everything in lower-case in all the commands we
issue, but I can't see that as an acceptable answer either.  We can't
expect to control all the SQL sent to a database.
        regards, tom lane


Re: 7.4.1 release status - Turkish Locale

From
"Nicolai Tufar"
Date:
Sorry for rising up old issue again but the problem still persists.
And database cluster is not being created with Turkish locale

If you have any doubts about how Turkish users will react to the fact
that both "I" and "I WITH DOT" will be treated as same character, rest
assured that this behavior is de-facto standard when it comes to file
names, 
identifiers and commands. Greg Stark and Devrim Gunduz will confirm
that, 
no doubt. Please review and apply this patch I send you for the third
time. You will not regret and many users will be grateful. Please note
that to my knowledge it will not break any other locales.


Best regards,
Nicolai

Re: 7.4.1 release status - Turkish Locale

From
Tom Lane
Date:
"Nicolai Tufar" <ntufar@pisem.net> writes:
> Sorry for rising up old issue again but the problem still persists.
> And database cluster is not being created with Turkish locale

I've committed the attached fix, which I believe will solve this
problem.  Could you test it?

(Patch is against 7.4 branch)

            regards, tom lane


Attachment

Re: 7.4.1 release status - Turkish Locale

From
"Nicolai Tufar"
Date:
> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
>
> I've committed the attached fix, which I believe will solve this
> problem.  Could you test it?

Thank you very much for your effort and attention!

I am not sure I am testing the right version. I am testing the
one with REL7_4_STABLE, the one with downcase_truncate_identifier()
function added.

Under locale-ignorant FreeBSD it works fine.
But under Fedora Core 1 initdb it crashes under all
locales I tested -C, en_US, tr_TR with message given below.

I remember seeing this message before, when I messed up with downcasting
Functions. See, it is downcasting "ISO" and gets "ıso" in return. Could
Someone confirm the results I got?

Regards,
Nicolai

--------------------------------------------------------
fixing permissions on existing directory /pgdata... ok
creating directory /pgdata/base... ok
creating directory /pgdata/global... ok
creating directory /pgdata/pg_xlog... ok
creating directory /pgdata/pg_clog... ok
selecting default max_connections... 10
selecting default shared_buffers... 50
creating configuration files... ok
creating template1 database in /pgdata/base/1... FATAL:  XX000: failed
to initialize DateStyle to "ISO, MDY"
LOCATION:  InitializeGUCOptions, guc.c:1866

initdb: failed



Re: 7.4.1 release status - Turkish Locale

From
Tom Lane
Date:
"Nicolai Tufar" <ntufar@pisem.net> writes:
> Under locale-ignorant FreeBSD it works fine.
> But under Fedora Core 1 initdb it crashes under all
> locales I tested -C, en_US, tr_TR with message given below.

Hmm.  It seems that tr_TR has problems much more extensive than you've
indicated previously.  I was able to get through initdb with the attached
additional patch, but the regression tests fail in several places.
It looks to me like every use of strcasecmp in the backend has to be
questioned if we're going to make this work.  I'm starting to lean in
the direction of "tr_TR is hopelessly broken" again...
        regards, tom lane


*** src/backend/commands/variable.c~    Mon Jan 19 14:04:40 2004
--- src/backend/commands/variable.c    Fri Feb 20 23:16:16 2004
***************
*** 82,103 ****          /* Ugh. Somebody ought to write a table driven version -- mjl */ 
!         if (strcasecmp(tok, "ISO") == 0)         {             newDateStyle = USE_ISO_DATES;             scnt++;
  }
 
!         else if (strcasecmp(tok, "SQL") == 0)         {             newDateStyle = USE_SQL_DATES;             scnt++;
       }
 
!         else if (strncasecmp(tok, "POSTGRES", 8) == 0)         {             newDateStyle = USE_POSTGRES_DATES;
     scnt++;         }
 
!         else if (strcasecmp(tok, "GERMAN") == 0)         {             newDateStyle = USE_GERMAN_DATES;
scnt++;
--- 82,108 ----          /* Ugh. Somebody ought to write a table driven version -- mjl */ 
!         /*
!          * Note: SplitIdentifierString already downcased the input, so
!          * we needn't use strcasecmp here.
!          */
! 
!         if (strcmp(tok, "iso") == 0)         {             newDateStyle = USE_ISO_DATES;             scnt++;
}
!         else if (strcmp(tok, "sql") == 0)         {             newDateStyle = USE_SQL_DATES;             scnt++;
   }
 
!         else if (strncmp(tok, "postgres", 8) == 0)         {             newDateStyle = USE_POSTGRES_DATES;
 scnt++;         }
 
!         else if (strcmp(tok, "german") == 0)         {             newDateStyle = USE_GERMAN_DATES;
scnt++;
***************
*** 105,129 ****             if (ocnt == 0)                 newDateOrder = DATEORDER_DMY;         }
!         else if (strcasecmp(tok, "YMD") == 0)         {             newDateOrder = DATEORDER_YMD;             ocnt++;
       }
 
!         else if (strcasecmp(tok, "DMY") == 0 ||
!                  strncasecmp(tok, "EURO", 4) == 0)         {             newDateOrder = DATEORDER_DMY;
ocnt++;        }
 
!         else if (strcasecmp(tok, "MDY") == 0 ||
!                  strcasecmp(tok, "US") == 0 ||
!                  strncasecmp(tok, "NONEURO", 7) == 0)         {             newDateOrder = DATEORDER_MDY;
ocnt++;        }
 
!         else if (strcasecmp(tok, "DEFAULT") == 0)         {             /*              * Easiest way to get the
currentDEFAULT state is to fetch
 
--- 110,134 ----             if (ocnt == 0)                 newDateOrder = DATEORDER_DMY;         }
!         else if (strcmp(tok, "ymd") == 0)         {             newDateOrder = DATEORDER_YMD;             ocnt++;
   }
 
!         else if (strcmp(tok, "dmy") == 0 ||
!                  strncmp(tok, "euro", 4) == 0)         {             newDateOrder = DATEORDER_DMY;
ocnt++;        }
 
!         else if (strcmp(tok, "mdy") == 0 ||
!                  strcmp(tok, "us") == 0 ||
!                  strncmp(tok, "noneuro", 7) == 0)         {             newDateOrder = DATEORDER_MDY;
ocnt++;        }
 
!         else if (strcmp(tok, "default") == 0)         {             /*              * Easiest way to get the current
DEFAULTstate is to fetch
 
***************
*** 474,480 ****                 HasCTZSet = true;             }         }
!         else if (strcasecmp(value, "UNKNOWN") == 0)         {             /*              * UNKNOWN is the value
shownas the "default" for TimeZone in
 
--- 479,485 ----                 HasCTZSet = true;             }         }
!         else if (strcasecmp(value, "unknown") == 0)         {             /*              * UNKNOWN is the value
shownas the "default" for TimeZone in
 


Re: 7.4.1 release status - Turkish Locale

From
"Nicolai Tufar"
Date:
> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Hmm.  It seems that tr_TR has problems much more extensive than you've
> indicated previously.  I was able to get through initdb with the
attached
> additional patch, but the regression tests fail in several places.
> It looks to me like every use of strcasecmp in the backend has to be
> questioned if we're going to make this work.  I'm starting to lean in
> the direction of "tr_TR is hopelessly broken" again...

With this patch applied everything works fine. Thanks!
What do you plan to do? Apply it? And if apply, will
You apply both of the modifications to 7.5devel also?

Thanks again for your effort.

Best regards,
Nicolai




> 
>             regards, tom lane
> 
> 
> *** src/backend/commands/variable.c~    Mon Jan 19 14:04:40 2004
> --- src/backend/commands/variable.c    Fri Feb 20 23:16:16 2004
> ***************
> *** 82,103 ****
> 
>           /* Ugh. Somebody ought to write a table driven version
-- mjl
> */
> 
> !         if (strcasecmp(tok, "ISO") == 0)
>           {
>               newDateStyle = USE_ISO_DATES;
>               scnt++;
>           }
> !         else if (strcasecmp(tok, "SQL") == 0)
>           {
>               newDateStyle = USE_SQL_DATES;
>               scnt++;
>           }
> !         else if (strncasecmp(tok, "POSTGRES", 8) == 0)
>           {
>               newDateStyle = USE_POSTGRES_DATES;
>               scnt++;
>           }
> !         else if (strcasecmp(tok, "GERMAN") == 0)
>           {
>               newDateStyle = USE_GERMAN_DATES;
>               scnt++;
> --- 82,108 ----
> 
>           /* Ugh. Somebody ought to write a table driven version
-- mjl
> */
> 
> !         /*
> !          * Note: SplitIdentifierString already downcased the
input, so
> !          * we needn't use strcasecmp here.
> !          */
> !
> !         if (strcmp(tok, "iso") == 0)
>           {
>               newDateStyle = USE_ISO_DATES;
>               scnt++;
>           }
> !         else if (strcmp(tok, "sql") == 0)
>           {
>               newDateStyle = USE_SQL_DATES;
>               scnt++;
>           }
> !         else if (strncmp(tok, "postgres", 8) == 0)
>           {
>               newDateStyle = USE_POSTGRES_DATES;
>               scnt++;
>           }
> !         else if (strcmp(tok, "german") == 0)
>           {
>               newDateStyle = USE_GERMAN_DATES;
>               scnt++;
> ***************
> *** 105,129 ****
>               if (ocnt == 0)
>                   newDateOrder = DATEORDER_DMY;
>           }
> !         else if (strcasecmp(tok, "YMD") == 0)
>           {
>               newDateOrder = DATEORDER_YMD;
>               ocnt++;
>           }
> !         else if (strcasecmp(tok, "DMY") == 0 ||
> !                  strncasecmp(tok, "EURO", 4) == 0)
>           {
>               newDateOrder = DATEORDER_DMY;
>               ocnt++;
>           }
> !         else if (strcasecmp(tok, "MDY") == 0 ||
> !                  strcasecmp(tok, "US") == 0 ||
> !                  strncasecmp(tok, "NONEURO", 7) == 0)
>           {
>               newDateOrder = DATEORDER_MDY;
>               ocnt++;
>           }
> !         else if (strcasecmp(tok, "DEFAULT") == 0)
>           {
>               /*
>                * Easiest way to get the current DEFAULT state
is to
> fetch
> --- 110,134 ----
>               if (ocnt == 0)
>                   newDateOrder = DATEORDER_DMY;
>           }
> !         else if (strcmp(tok, "ymd") == 0)
>           {
>               newDateOrder = DATEORDER_YMD;
>               ocnt++;
>           }
> !         else if (strcmp(tok, "dmy") == 0 ||
> !                  strncmp(tok, "euro", 4) == 0)
>           {
>               newDateOrder = DATEORDER_DMY;
>               ocnt++;
>           }
> !         else if (strcmp(tok, "mdy") == 0 ||
> !                  strcmp(tok, "us") == 0 ||
> !                  strncmp(tok, "noneuro", 7) == 0)
>           {
>               newDateOrder = DATEORDER_MDY;
>               ocnt++;
>           }
> !         else if (strcmp(tok, "default") == 0)
>           {
>               /*
>                * Easiest way to get the current DEFAULT state
is to
> fetch
> ***************
> *** 474,480 ****
>                   HasCTZSet = true;
>               }
>           }
> !         else if (strcasecmp(value, "UNKNOWN") == 0)
>           {
>               /*
>                * UNKNOWN is the value shown as the "default"
for
> TimeZone in
> --- 479,485 ----
>                   HasCTZSet = true;
>               }
>           }
> !         else if (strcasecmp(value, "unknown") == 0)
>           {
>               /*
>                * UNKNOWN is the value shown as the "default"
for
> TimeZone in



Re: 7.4.1 release status - Turkish Locale

From
Tom Lane
Date:
"Nicolai Tufar" <ntufar@pisem.net> writes:
>> It looks to me like every use of strcasecmp in the backend has to be
>> questioned if we're going to make this work.  I'm starting to lean in
>> the direction of "tr_TR is hopelessly broken" again...

> With this patch applied everything works fine. Thanks!

Did you try running the regression tests under tr_TR locale?  It seems
a few bricks short of "fine" yet :-(
        regards, tom lane


Re: 7.4.1 release status - Turkish Locale

From
"Nicolai Tufar"
Date:
> -----Original Message-----
> From: Tom Lane 
> Did you try running the regression tests under tr_TR locale?  It seems
> a few bricks short of "fine" yet :-(

I run regression tests under tr_TR locale. To do this I hardcoded
Turkish locale in initdb in pg_regress.sh. Three tests failed, I
attached resulting diff. 

With days of the week, the same problem is with downcasting occurs. I
think it is not that crucial, but the rest of the differences in the
file seem to be important. I was not able to interpret them.

Thanks,
Nicolai







> 
>             regards, tom lane