Thread: Re: 7.4.1 release status - Turkish Locale
<div class="Section1"><pre><font color="black" face="Courier New" size="2"><span style="font-size:10.0pt">> <span class="SpellE">We</span><span class="SpellE">might</span> <span class="SpellE">think</span> <span class="SpellE">that</span><span class="SpellE">the</span> <span class="SpellE">Turkish</span>-<span class="SpellE">locale</span>problem Devrim <span class="SpellE">Gunduz</span> <span class="SpellE">pointed</span> <span class="SpellE">out</span></span></font></pre><pre><fontcolor="black" face="Courier New" size="2"><span style="font-size:10.0pt">>is a <span class="SpellE">must</span>-<span class="SpellE">fix</span>, <span class="SpellE">too</span>.<spanstyle="mso-spacerun:yes"> </span>But <span class="SpellE">I'm</span> not <span class="SpellE">convinced</span>yet <span class="SpellE">what</span> <span class="SpellE">to</span> do <span class="SpellE">about</span>it.</span></font></pre><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial"> </span></font><p class="MsoPlainText"><font face="Courier New" size="2"><span style="font-size: 10.0pt">Here is a <span class="SpellE">first</span> <span class="SpellE">try</span> <span class="SpellE">to</span> <spanclass="SpellE">fix</span> <span class="SpellE">what</span> Devrim <span class="SpellE">Gunduz</span> <span class="SpellE">talked</span><span class="SpellE">about</span>.<br /><br /><span class="SpellE">Please</span> be <span class="SpellE">patient</span><span class="SpellE">with</span> <span class="SpellE">me</span> <span class="SpellE">for</span>it is <span class="SpellE">the</span> <span class="SpellE">first</span> <span class="SpellE">major</span><span class="SpellE">patch</span><br /> I <span class="SpellE">submit</span> <span class="SpellE">and</span>I <span class="SpellE">realize</span> <span class="SpellE">that</span> I <span class="SpellE">blatantly</span><span class="SpellE">violated</span> <span class="SpellE">many</span> <span class="SpellE">rules</span><br/> of <span class="SpellE">good</span> <span class="SpellE">style</span> in <span class="SpellE">PostgreSQL</span><span class="SpellE">source</span> <span class="SpellE">code</span>.<br /><br /><span class="SpellE">First</span>,<span class="SpellE">about</span> <span class="SpellE">the</span> problem. <span class="SpellE">Turkish</span><span class="SpellE">language</span> has <span class="SpellE">two</span> <span class="SpellE">letters</span>"i".<br /><span class="SpellE">One</span> is <span class="SpellE">with</span> <span class="SpellE">dot</span>on top <span class="SpellE">and</span> <span class="SpellE">the</span> <span class="SpellE">other</span>is <span class="SpellE">without</span>. <span class="SpellE">Simply</span> as <span class="SpellE">that</span>.<br/><span class="SpellE">The</span> <span class="SpellE">one</span> <span class="SpellE">with</span><span class="SpellE">dot</span> has <span class="SpellE">the</span> <span class="SpellE">dot</span><span class="SpellE">both</span> as <span class="SpellE">capital</span> <span class="SpellE">and</span><span class="SpellE">lower</span>-<span class="SpellE">case</span> <span class="SpellE">and</span><br/><span class="SpellE">the</span> <span class="SpellE">one</span> <span class="SpellE">without</span><span class="SpellE">dot</span> has no <span class="SpellE">dot</span> in <span class="SpellE">both</span><span class="SpellE">upper</span> <span class="SpellE">and</span> <span class="SpellE">lower</span><span class="SpellE">case</span>...<br /> as <span class="SpellE">opposed</span> <span class="SpellE">to</span><span class="SpellE">English</span> <span class="SpellE">where</span> "i" has a <span class="SpellE">dot</span><span class="SpellE">when</span> <span class="SpellE">lower</span>-<span class="SpellE">case</span><span class="SpellE">and</span><br /> has no <span class="SpellE">dot</span> <span class="SpellE">when</span><span class="SpellE">upper</span>-<span class="SpellE">case</span>.<br /><br /> Problem <span class="SpellE">arise</span><span class="SpellE">when</span> <span class="SpellE">PostgreSQL</span>, <span class="SpellE">while</span><span class="SpellE">running</span> <span class="SpellE">with</span> "<span class="SpellE">tr</span>_TR"<span class="SpellE">locale</span><br /><span class="SpellE">converts</span> <span class="SpellE">to</span><span class="SpellE">lower</span>-<span class="SpellE">case</span> an <span class="SpellE">identifier</span>as a <span class="SpellE">table</span>, an <span class="SpellE">index</span> <span class="SpellE">or</span><br/> a <span class="SpellE">column</span> name. <span class="SpellE">If</span> it is <span class="SpellE">written</span><span class="SpellE">with</span> <span class="SpellE">capital</span> "I", <span class="SpellE">tolower</span>()<span class="SpellE">with</span><br /> 'I' as <span class="SpellE">argument</span> <span class="SpellE">will</span><span class="SpellE">return</span> <span class="SpellE">Turkish</span> <span class="SpellE">specific</span><span class="SpellE">character</span>: <br /> 'i'-<span class="SpellE">without</span>-a-<spanclass="SpellE">dot</span> <span class="SpellE">what</span> I <span class="SpellE">am</span><span class="SpellE">afraid</span> <span class="SpellE">will</span> not be <span class="SpellE">shown</span><span class="SpellE">correctly</span><br /> in <span class="SpellE">your</span> e-mail <span class="SpellE">readers</span>.<br/><br /><span class="SpellE">Let</span> <span class="SpellE">me</span> <span class="SpellE">give</span><span class="SpellE">some</span> <span class="SpellE">examples</span>.<br /><br /><span class="SpellE">initdb</span><span class="SpellE">script</span> <span class="SpellE">runs</span> <span class="SpellE">apparently</span><span class="SpellE">innocent</span> <span class="SpellE">script</span> in file<br /><spanclass="SpellE">src</span>/<span class="SpellE">backend</span>/<span class="SpellE">utils</span>/<span class="SpellE">mb</span>/<spanclass="SpellE">conversion</span>_<span class="SpellE">procs</span>/<span class="SpellE">conversion</span>_<spanclass="SpellE">create</span>.<span class="SpellE">sql</span><br /><span class="SpellE">to</span><span class="SpellE">create</span> a <span class="SpellE">couple</span> of <span class="SpellE">functions</span><span class="SpellE">whose</span> <span class="SpellE">only</span> <span class="SpellE">fault</span><span class="SpellE">was</span><br /><span class="SpellE">to</span> <span class="SpellE">declare</span>it <span class="SpellE">their</span> <span class="SpellE">return</span> <span class="SpellE">parameters</span>as VOID. <span class="SpellE">Backend</span><br /><span class="SpellE">returns</span> <spanclass="SpellE">error</span> <span class="SpellE">message</span> <span class="SpellE">that</span> <span class="SpellE">type</span>"<span class="SpellE">vo</span> d" is not <span class="SpellE">found</span> <span class="SpellE">and</span><br/><span class="SpellE">initdb</span> <span class="SpellE">fails</span>.<br /><br /> A <span class="SpellE">nothing</span><span class="SpellE">suspecting</span> <span class="SpellE">novice</span> <span class="SpellE">user</span><span class="SpellE">was</span> <span class="SpellE">excited</span> <span class="SpellE">about</span><br/> SERIAL data <span class="SpellE">type</span> <span class="SpellE">he</span> <span class="SpellE">was</span><span class="SpellE">tail</span> is <span class="SpellE">present</span> in <span class="SpellE">PostgreSQL</span>.<br/><span class="SpellE">It</span> <span class="SpellE">took</span> us <span class="SpellE">with</span>Devrim a <span class="SpellE">lot</span> of time <span class="SpellE">to</span> <span class="SpellE">explain</span><span class="SpellE">why</span> <span class="SpellE">he</span><br /><span class="SpellE">need</span><span class="SpellE">to</span> <span class="SpellE">type</span> SERIAL as <span class="SpellE">SERiAL</span><span class="SpellE">for</span> <span class="SpellE">now</span> <span class="SpellE">till</span>a <span class="SpellE">workaround</span><br /> is <span class="SpellE">developed</span>.<br /><br/><span class="SpellE">Another</span> <span class="SpellE">case</span> <span class="SpellE">happened</span> <span class="SpellE">with</span><span class="SpellE">me</span> <span class="SpellE">when</span> I <span class="SpellE">wanted</span><span class="SpellE">to</span> restore<br /> a <span class="SpellE">pg</span>_<span class="SpellE">dump</span><span class="SpellE">dump</span>. Restore <span class="SpellE">failed</span> <span class="SpellE">because</span><span class="SpellE">script</span> <span class="SpellE">was</span> <span class="SpellE">creating</span><br/><span class="SpellE">scripts</span> <span class="SpellE">that</span> <span class="SpellE">belong</span><span class="SpellE">to</span> PUBLIC.<br /><br /><br /><span class="SpellE">For</span> <spanclass="SpellE">the</span> <span class="SpellE">solution</span>, <span class="SpellE">after</span> <span class="SpellE">some</span><span class="SpellE">research</span> <span class="SpellE">we</span> <span class="SpellE">found</span><span class="SpellE">out</span> <span class="SpellE">that</span><br /><span class="SpellE">offender</span>is <span class="SpellE">tolower</span>() <span class="SpellE">call</span> in <span class="SpellE">src</span>/<spanclass="SpellE">backend</span>/<span class="SpellE">parser</span>/<span class="SpellE">scan</span>.l<br/> in {<span class="SpellE">identifier</span>} <span class="SpellE">section</span>. <spanclass="SpellE">tolower</span>() <span class="SpellE">works</span> <span class="SpellE">fine</span> <span class="SpellE">with</span><span class="SpellE">any</span><br /><span class="SpellE">locale</span> <span class="SpellE">and</span><span class="SpellE">with</span> <span class="SpellE">any</span> <span class="SpellE">character</span><span class="SpellE">save</span> <span class="SpellE">for</span> <span class="SpellE">the</span><span class="SpellE">Turkish</span> <span class="SpellE">locale</span><br /><span class="SpellE">and</span><span class="SpellE">capital</span> 'I' <span class="SpellE">character</span>. <span class="SpellE">So</span>,<span class="SpellE">the</span> <span class="SpellE">obvious</span> <span class="SpellE">solution</span>is<br /><span class="SpellE">to</span> put a <span class="SpellE">check</span> <span class="SpellE">for</span><span class="SpellE">Turkish</span> <span class="SpellE">locale</span> <span class="SpellE">and</span>'I' <span class="SpellE">character</span>.<br /><span class="SpellE">Something</span> <span class="SpellE">like</span><span class="SpellE">this</span>:<br /><br /><span class="SpellE">if</span>( <<span class="SpellE">locale</span>is <span class="SpellE">Turkish</span>> && <span class="SpellE">ident</span>[i] =='I' )<br /><span style="mso-tab-count:1"> </span><span class="SpellE">ident</span>[i] = 'i';<br /> else<br /><spanstyle="mso-tab-count:1"> </span><span class="SpellE">ident</span>[i] = <span class="SpellE">tolower</span>((<spanclass="SpellE">unsigned</span> <span class="SpellE">char</span>) <span class="SpellE">ident</span>[i]);<br/><br /><span class="SpellE">Looks</span> <span class="SpellE">rather</span> <span class="SpellE">simple</span>but <span class="SpellE">the</span> hard <span class="SpellE">part</span> <span class="SpellE">was</span><span class="SpellE">to</span> <span class="SpellE">figure</span> <span class="SpellE">out</span><br/><span class="SpellE">what</span> is <span class="SpellE">the</span> <span class="SpellE">current</span><span class="SpellE">locale</span>. <span class="SpellE">To</span> do <span class="SpellE">this</span>I <span class="SpellE">added</span><br /><br /><span class="SpellE">const</span> <span class="SpellE">char</span>*<span class="SpellE">get</span>_<span class="SpellE">locale</span>_<span class="SpellE">category</span>(<spanclass="SpellE">const</span> <span class="SpellE">char</span> *<span class="SpellE">category</span>);<br/><br /><span class="SpellE">to</span> <span class="SpellE">src</span>/<span class="SpellE">backend</span>/<spanclass="SpellE">utils</span>/<span class="SpellE">adt</span>/<span class="SpellE">pg</span>_<spanclass="SpellE">locale</span>.c <span class="SpellE">that</span> <span class="SpellE">would</span><span class="SpellE">return</span><br /><span class="SpellE">locale</span> <span class="SpellE">identifier</span><span class="SpellE">for</span> <span class="SpellE">the</span> <span class="SpellE">category</span><span class="SpellE">specified</span> <span class="SpellE">or</span> LC_ALL<br /><span class="SpellE">if</span><span class="SpellE">category</span> is NULL. I <span class="SpellE">could</span> not <span class="SpellE">find</span><span class="SpellE">any</span> <span class="SpellE">other</span> <span class="SpellE">function</span><br/><span class="SpellE">that</span> <span class="SpellE">will</span> <span class="SpellE">return</span><span class="SpellE">what</span> I <span class="SpellE">need</span>. <span class="SpellE">Please</span><span class="SpellE">help</span> <span class="SpellE">me</span> <span class="SpellE">to</span><span class="SpellE">find</span><br /><span class="SpellE">one</span> <span class="SpellE">because</span>I <span class="SpellE">would</span> <span class="SpellE">hate</span> <span class="SpellE">to</span><span class="SpellE">introduce</span> a <span class="SpellE">new</span> <span class="SpellE">function</span>.<br/><br /> I <span class="SpellE">realize</span> <span class="SpellE">that</span> {<spanclass="SpellE">identifier</span>} <span class="SpellE">section</span> is <span class="SpellE">very</span> <span class="SpellE">performance</span><br/><span class="SpellE">critical</span> <span class="SpellE">so</span> I <span class="SpellE">introduced</span>a global <span class="SpellE">variable</span><br /><br /><span class="SpellE">static</span><span class="SpellE">int</span> <span class="SpellE">isturkishlocale</span> = -1;<br /><br />at <span class="SpellE">the</span> <span class="SpellE">beginning</span> of <span class="SpellE">src</span>/<span class="SpellE">backend</span>/<spanclass="SpellE">parser</span>/<span class="SpellE">scan</span>.l<br /><span class="SpellE">It</span>is set <span class="SpellE">to</span> -1 <span class="SpellE">when</span> not yet <span class="SpellE">initialized</span>,0 <span class="SpellE">if</span><br /><span class="SpellE">locale</span> is not <span class="SpellE">Turkish</span><span class="SpellE">and</span> 1 <span class="SpellE">if</span> <span class="SpellE">locale</span>is <span class="SpellE">Turkish</span>.<br /><br /><span class="SpellE">It</span> <span class="SpellE">might</span>not be <span class="SpellE">the</span> <span class="SpellE">way</span> it is <span class="SpellE">usually</span>done in <span class="SpellE">PostgreSQL</span><br /><span class="SpellE">source</span> <spanclass="SpellE">code</span>. <span class="SpellE">Could</span> <span class="SpellE">you</span> <span class="SpellE">pleas</span><span class="SpellE">advise</span> <span class="SpellE">if</span> <span class="SpellE">the</span>name I <span class="SpellE">chose</span><br /> is <span class="SpellE">appropriate</span> <spanclass="SpellE">and</span> <span class="SpellE">whether</span> <span class="SpellE">there</span> is a <span class="SpellE">more</span><span class="SpellE">appropriate</span><br /><span class="SpellE">place</span> <span class="SpellE">to</span>put <span class="SpellE">declaration</span> <span class="SpellE">and</span> <span class="SpellE">initialization</span>.<br/><br /><span class="SpellE">Best</span> <span class="SpellE">regards</span>,<br/><span class="SpellE">Nicolai</span> <span class="SpellE">Tufar</span> & Devrim <spanclass="SpellE">Gunduz</span></span></font><font face="Arial"><span style="font-family:Arial"></span></font><p class="MsoNormal"><fontface="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial"> </span></font></div>
"Nicolai Tufar" <ntufar@pisem.net> writes: >> We might think that the Turkish-locale problem Devrim Gunduz pointed out >> is a must-fix, too. But I'm not convinced yet what to do about it. > Here is a first try to fix what Devrim Gunduz talked about. I still don't much like having a locale-specific wart in the parser (and the code you give could not work anyway --- for starters, the first argument of setlocale is not a pointer). A possible compromise is to apply ASCII downcasing (same as in keywords.c) for 7-bit-ASCII characters, and apply tolower() only for character codes above 127. In other words unsigned char ch = (unsigned char) ident[i]; if (ch >= 'A' && ch <= 'Z') ch += 'a' - 'A';else if (ch > 127 && isupper(ch)) ch = tolower(ch);ident[i] = (char) ch; In reasonably sane locales this will have the same effects as currently, while in unsane locales it will ensure that basic-ASCII identifiers are treated the way we want. Comments? regards, tom lane
> I still don't much like having a locale-specific wart in the parser > (and the code you give could not work anyway --- for starters, the > first argument of setlocale is not a pointer). Aw, I see, my code broken. I got confused by locale_......_asign() family if functions. Sure, first argument need to be int. But as you said the code is a wart. > A possible compromise is to apply ASCII downcasing (same as in > keywords.c) for 7-bit-ASCII characters, and apply tolower() only > for character codes above 127. In other words > > unsigned char ch = (unsigned char) ident[i]; > > if (ch >= 'A' && ch <= 'Z') > ch += 'a' - 'A'; > else if (ch > 127 && isupper(ch)) > ch = tolower(ch); > ident[i] = (char) ch; > > In reasonably sane locales this will have the same effects as currently, > while in unsane locales it will ensure that basic-ASCII identifiers are > treated the way we want. If we go this way why not make a special case only and only for 'I' Character and not all 7-bit ASCII: unsigned char ch = (unsigned char) ident[i]; if(ch == (unsigned char)'I') ch = 'i';else ch = tolower(ch)ident[i] = (char) ch; Will it break any locales? > > regards, tom lane Regards Nicolai
"Nicolai Tufar" <ntufar@pisem.net> writes: >> A possible compromise is to apply ASCII downcasing (same as in >> keywords.c) for 7-bit-ASCII characters, and apply tolower() only >> for character codes above 127. In other words > If we go this way why not make a special case only and only for 'I' > Character and not all 7-bit ASCII: It seems to me that that's too narrow a definition of the problem. I think we should state our goal as "we don't want bizarre locale definitions to interfere with downcasing of the basic ASCII letters". If we put in a special case for 'I' we will fix the known problem with Turkish, but what other strange locales might be out there? And if we don't trust tolower() for 'I', why should we trust it for 'A'-'Z'? What it comes down to is that by training and experience, I always expect that any bug might be just one example of a whole class of bugs. You have to look for the related cases that might happen in future, not only fix the case that's under your nose. regards, tom lane
"Tom Lane" tgl@sss.pgh.pa.us wrote: > >"Nicolai Tufar" <ntufar@pisem.net> writes: >>> A possible compromise is to apply ASCII downcasing (same as in >>> keywords.c) for 7-bit-ASCII characters, and apply tolower() only >>> for character codes above 127. In other words > >> If we go this way why not make a special case only and only for 'I' >> Character and not all 7-bit ASCII: > > It seems to me that that's too narrow a definition of the problem. > I think we should state our goal as "we don't want bizarre locale > definitions to interfere with downcasing of the basic ASCII letters". > If we put in a special case for 'I' we will fix the known problem > with Turkish, but what other strange locales might be out there? > And if we don't trust tolower() for 'I', why should we trust it > for 'A'-'Z'? To my knowledge no other locale have similar problems. At least nobody complained so far while Turk users are rising their voices for many years now. Let try and put this very special case, together with an extensive explanation in comment and see if someone complains. And by the way, national characters in table, column, index or function names is something that never happens in production databases. As for 'A'-'Z'^, it was pointed to me that SQL99 standard states that identifier names need to be downcasted in locale-dependent manner. Would you like me to create a patch that would touch only src/backend/parser/scan.l, introduce a special case for 'I' and include an explanation in comment? > What it comes down to is that by training and experience, I always > expect that any bug might be just one example of a whole class of bugs. > You have to look for the related cases that might happen in future, > not only fix the case that's under your nose. > > regards, tom lane Regards, Nicolai Tufar
> Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Nicolai Tufar" <ntufar@pisem.net> writes: > >> A possible compromise is to apply ASCII downcasing (same as in > >> keywords.c) for 7-bit-ASCII characters, and apply tolower() only > >> for character codes above 127. In other words > > > If we go this way why not make a special case only and only for 'I' > > Character and not all 7-bit ASCII: > > It seems to me that that's too narrow a definition of the problem. > I think we should state our goal as "we don't want bizarre locale > definitions to interfere with downcasing of the basic ASCII letters". > If we put in a special case for 'I' we will fix the known problem > with Turkish, but what other strange locales might be out there? > And if we don't trust tolower() for 'I', why should we trust it > for 'A'-'Z'? Since nobody commented on the issue I may suggest a patch that implements 'I' special case solution. 'A'-'Z' ASCII-only downcasting idea was rejected before on basis of SQL99 compliance. I hope I would have more luck with this one. Because PostgreSQL just does not work with Turkish locale, and it was so since 7.4.0. initdb just chokes on VOID identifier and quits. Devrim Gunduz will second me on this, I am sure. With my knowledge of Russian, Arabic and -to some degree- Hebrew encodings I claim that this patch will not break them. If someone who uses far eastern Encodings would also check it, I think it would be pretty safe to apply this patch to the source. Thanks, Nicolai Tufar
Oops, forgot the patch :) > -----Original Message----- > From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers- > owner@postgresql.org] On Behalf Of Nicolai Tufar > Sent: Tuesday, February 03, 2004 9:31 PM > To: pgsql-hackers@postgresql.org > Cc: 'Tom Lane'; devrim@tdmsoft.com > Subject: [HACKERS] Turkish Locale in Identifiers (contd.) > > > Tom Lane <tgl@sss.pgh.pa.us> wrote: > > "Nicolai Tufar" <ntufar@pisem.net> writes: > > >> A possible compromise is to apply ASCII downcasing (same as in > > >> keywords.c) for 7-bit-ASCII characters, and apply tolower() only > > >> for character codes above 127. In other words > > > > > If we go this way why not make a special case only and only for 'I' > > > Character and not all 7-bit ASCII: > > > > It seems to me that that's too narrow a definition of the problem. > > I think we should state our goal as "we don't want bizarre locale > > definitions to interfere with downcasing of the basic ASCII letters". > > If we put in a special case for 'I' we will fix the known problem > > with Turkish, but what other strange locales might be out there? > > And if we don't trust tolower() for 'I', why should we trust it > > for 'A'-'Z'? > > Since nobody commented on the issue I may suggest a patch that > implements > 'I' special case solution. 'A'-'Z' ASCII-only downcasting idea was > rejected > before on basis of SQL99 compliance. I hope I would have more luck with > this > one. Because PostgreSQL just does not work with Turkish locale, and it > was > so since 7.4.0. initdb just chokes on VOID identifier and quits. Devrim > Gunduz will second me on this, I am sure. > > With my knowledge of Russian, Arabic and -to some degree- Hebrew > encodings > I claim that this patch will not break them. If someone who uses far > eastern > Encodings would also check it, I think it would be pretty safe to apply > this patch to the source. > > Thanks, > Nicolai Tufar > > > ---------------------------(end of broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
Tom Lane <tgl@sss.pgh.pa.us> writes: > It seems to me that that's too narrow a definition of the problem. > I think we should state our goal as "we don't want bizarre locale > definitions to interfere with downcasing of the basic ASCII letters". > If we put in a special case for 'I' we will fix the known problem > with Turkish, but what other strange locales might be out there? > And if we don't trust tolower() for 'I', why should we trust it > for 'A'-'Z'? But then wouldn't it be a little weird for Turkish table and column names to treat "I and "Ý" (I think that's a dotted capital I) as equivalent to "i" instead of "ý" "i" respectively. (I think that first one was a dotless i). Perhaps what really ought to be happening is that the downcasing should be done separately for keywords, or postponed until the point where it's checked to see if it's a keyword. Then it could be done using an entirely ascii-centric bit-twiddling implementation. If it matches an SQL keyword after being downcased the old fashioned way, then it's an SQL keyword. If not then the locale-aware tolower() would be appropriate for tables, columns, etc. But then perhaps that's unnecessarily complex. -- greg
Greg Stark <gsstark@mit.edu> writes: > If it matches an SQL keyword after being downcased the old fashioned way, then > it's an SQL keyword. If not then the locale-aware tolower() would be > appropriate for tables, columns, etc. That's exactly what we do already. The complaint was that the locale-aware downcasing is broken (not to put it too finely) in Turkish locales, leading to unexpected/unwanted results for identifiers that are not keywords. My own opinion is that the correct response is to fix the Turkish locale tables, but I can see where that might be beyond the skills of the average Postgres user. Thus I thought a reasonable compromise would be to override the locale for the handling of A-Z, allowing it to determine what happens to high-bit-set characters only. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > Greg Stark <gsstark@mit.edu> writes: > > If it matches an SQL keyword after being downcased the old fashioned way, then > > it's an SQL keyword. If not then the locale-aware tolower() would be > > appropriate for tables, columns, etc. > > That's exactly what we do already. The complaint was that the > locale-aware downcasing is broken (not to put it too finely) in Turkish > locales, leading to unexpected/unwanted results for identifiers that are > not keywords. But the example given was "SERIAL". "serial" is an English word, not a Turkish word. It shouldn't really be subject to Turkish locale effects at all. Perhaps "keyword" wasn't the right word in my message. I'm wondering if he really expects all identifiers to be subject to this ascii downcasing. Like, if he had a GÜNAYDIN column he might be surprised to when günaydýn (where ý is the lowercase dotless i) says column "günaydýn" doesn't exist. Or is the real problem simply that both styles of i really ought to match all the time, ie, that they should really be considered the same letter for matches? I wonder if there are other locales where that's an issue. -- greg
Greg Stark <gsstark@mit.edu> writes: > But the example given was "SERIAL". "serial" is an English word, not a > Turkish word. It shouldn't really be subject to Turkish locale effects > at all. SERIAL is not a keyword according to the grammar. Neither are PUBLIC, VOID, INT4, and numerous other examples. It's not appropriate to try to fix this by making them all keywords --- that will just create other problems. (And where do you draw the line, anyway? Should every identifier present in the default system catalogs become a keyword?) > I'm wondering if he really expects all identifiers to be subject to > this ascii downcasing. Without doubt it isn't ideal, but if we don't do something then a lot of stuff starting with initdb is broken. We could perhaps work around the problem by spelling everything in lower-case in all the commands we issue, but I can't see that as an acceptable answer either. We can't expect to control all the SQL sent to a database. regards, tom lane
Sorry for rising up old issue again but the problem still persists. And database cluster is not being created with Turkish locale If you have any doubts about how Turkish users will react to the fact that both "I" and "I WITH DOT" will be treated as same character, rest assured that this behavior is de-facto standard when it comes to file names, identifiers and commands. Greg Stark and Devrim Gunduz will confirm that, no doubt. Please review and apply this patch I send you for the third time. You will not regret and many users will be grateful. Please note that to my knowledge it will not break any other locales. Best regards, Nicolai
"Nicolai Tufar" <ntufar@pisem.net> writes: > Sorry for rising up old issue again but the problem still persists. > And database cluster is not being created with Turkish locale I've committed the attached fix, which I believe will solve this problem. Could you test it? (Patch is against 7.4 branch) regards, tom lane
Attachment
> -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > > I've committed the attached fix, which I believe will solve this > problem. Could you test it? Thank you very much for your effort and attention! I am not sure I am testing the right version. I am testing the one with REL7_4_STABLE, the one with downcase_truncate_identifier() function added. Under locale-ignorant FreeBSD it works fine. But under Fedora Core 1 initdb it crashes under all locales I tested -C, en_US, tr_TR with message given below. I remember seeing this message before, when I messed up with downcasting Functions. See, it is downcasting "ISO" and gets "ıso" in return. Could Someone confirm the results I got? Regards, Nicolai -------------------------------------------------------- fixing permissions on existing directory /pgdata... ok creating directory /pgdata/base... ok creating directory /pgdata/global... ok creating directory /pgdata/pg_xlog... ok creating directory /pgdata/pg_clog... ok selecting default max_connections... 10 selecting default shared_buffers... 50 creating configuration files... ok creating template1 database in /pgdata/base/1... FATAL: XX000: failed to initialize DateStyle to "ISO, MDY" LOCATION: InitializeGUCOptions, guc.c:1866 initdb: failed
"Nicolai Tufar" <ntufar@pisem.net> writes: > Under locale-ignorant FreeBSD it works fine. > But under Fedora Core 1 initdb it crashes under all > locales I tested -C, en_US, tr_TR with message given below. Hmm. It seems that tr_TR has problems much more extensive than you've indicated previously. I was able to get through initdb with the attached additional patch, but the regression tests fail in several places. It looks to me like every use of strcasecmp in the backend has to be questioned if we're going to make this work. I'm starting to lean in the direction of "tr_TR is hopelessly broken" again... regards, tom lane *** src/backend/commands/variable.c~ Mon Jan 19 14:04:40 2004 --- src/backend/commands/variable.c Fri Feb 20 23:16:16 2004 *************** *** 82,103 **** /* Ugh. Somebody ought to write a table driven version -- mjl */ ! if (strcasecmp(tok, "ISO") == 0) { newDateStyle = USE_ISO_DATES; scnt++; } ! else if (strcasecmp(tok, "SQL") == 0) { newDateStyle = USE_SQL_DATES; scnt++; } ! else if (strncasecmp(tok, "POSTGRES", 8) == 0) { newDateStyle = USE_POSTGRES_DATES; scnt++; } ! else if (strcasecmp(tok, "GERMAN") == 0) { newDateStyle = USE_GERMAN_DATES; scnt++; --- 82,108 ---- /* Ugh. Somebody ought to write a table driven version -- mjl */ ! /* ! * Note: SplitIdentifierString already downcased the input, so ! * we needn't use strcasecmp here. ! */ ! ! if (strcmp(tok, "iso") == 0) { newDateStyle = USE_ISO_DATES; scnt++; } ! else if (strcmp(tok, "sql") == 0) { newDateStyle = USE_SQL_DATES; scnt++; } ! else if (strncmp(tok, "postgres", 8) == 0) { newDateStyle = USE_POSTGRES_DATES; scnt++; } ! else if (strcmp(tok, "german") == 0) { newDateStyle = USE_GERMAN_DATES; scnt++; *************** *** 105,129 **** if (ocnt == 0) newDateOrder = DATEORDER_DMY; } ! else if (strcasecmp(tok, "YMD") == 0) { newDateOrder = DATEORDER_YMD; ocnt++; } ! else if (strcasecmp(tok, "DMY") == 0 || ! strncasecmp(tok, "EURO", 4) == 0) { newDateOrder = DATEORDER_DMY; ocnt++; } ! else if (strcasecmp(tok, "MDY") == 0 || ! strcasecmp(tok, "US") == 0 || ! strncasecmp(tok, "NONEURO", 7) == 0) { newDateOrder = DATEORDER_MDY; ocnt++; } ! else if (strcasecmp(tok, "DEFAULT") == 0) { /* * Easiest way to get the currentDEFAULT state is to fetch --- 110,134 ---- if (ocnt == 0) newDateOrder = DATEORDER_DMY; } ! else if (strcmp(tok, "ymd") == 0) { newDateOrder = DATEORDER_YMD; ocnt++; } ! else if (strcmp(tok, "dmy") == 0 || ! strncmp(tok, "euro", 4) == 0) { newDateOrder = DATEORDER_DMY; ocnt++; } ! else if (strcmp(tok, "mdy") == 0 || ! strcmp(tok, "us") == 0 || ! strncmp(tok, "noneuro", 7) == 0) { newDateOrder = DATEORDER_MDY; ocnt++; } ! else if (strcmp(tok, "default") == 0) { /* * Easiest way to get the current DEFAULTstate is to fetch *************** *** 474,480 **** HasCTZSet = true; } } ! else if (strcasecmp(value, "UNKNOWN") == 0) { /* * UNKNOWN is the value shownas the "default" for TimeZone in --- 479,485 ---- HasCTZSet = true; } } ! else if (strcasecmp(value, "unknown") == 0) { /* * UNKNOWN is the value shownas the "default" for TimeZone in
> -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > Hmm. It seems that tr_TR has problems much more extensive than you've > indicated previously. I was able to get through initdb with the attached > additional patch, but the regression tests fail in several places. > It looks to me like every use of strcasecmp in the backend has to be > questioned if we're going to make this work. I'm starting to lean in > the direction of "tr_TR is hopelessly broken" again... With this patch applied everything works fine. Thanks! What do you plan to do? Apply it? And if apply, will You apply both of the modifications to 7.5devel also? Thanks again for your effort. Best regards, Nicolai > > regards, tom lane > > > *** src/backend/commands/variable.c~ Mon Jan 19 14:04:40 2004 > --- src/backend/commands/variable.c Fri Feb 20 23:16:16 2004 > *************** > *** 82,103 **** > > /* Ugh. Somebody ought to write a table driven version -- mjl > */ > > ! if (strcasecmp(tok, "ISO") == 0) > { > newDateStyle = USE_ISO_DATES; > scnt++; > } > ! else if (strcasecmp(tok, "SQL") == 0) > { > newDateStyle = USE_SQL_DATES; > scnt++; > } > ! else if (strncasecmp(tok, "POSTGRES", 8) == 0) > { > newDateStyle = USE_POSTGRES_DATES; > scnt++; > } > ! else if (strcasecmp(tok, "GERMAN") == 0) > { > newDateStyle = USE_GERMAN_DATES; > scnt++; > --- 82,108 ---- > > /* Ugh. Somebody ought to write a table driven version -- mjl > */ > > ! /* > ! * Note: SplitIdentifierString already downcased the input, so > ! * we needn't use strcasecmp here. > ! */ > ! > ! if (strcmp(tok, "iso") == 0) > { > newDateStyle = USE_ISO_DATES; > scnt++; > } > ! else if (strcmp(tok, "sql") == 0) > { > newDateStyle = USE_SQL_DATES; > scnt++; > } > ! else if (strncmp(tok, "postgres", 8) == 0) > { > newDateStyle = USE_POSTGRES_DATES; > scnt++; > } > ! else if (strcmp(tok, "german") == 0) > { > newDateStyle = USE_GERMAN_DATES; > scnt++; > *************** > *** 105,129 **** > if (ocnt == 0) > newDateOrder = DATEORDER_DMY; > } > ! else if (strcasecmp(tok, "YMD") == 0) > { > newDateOrder = DATEORDER_YMD; > ocnt++; > } > ! else if (strcasecmp(tok, "DMY") == 0 || > ! strncasecmp(tok, "EURO", 4) == 0) > { > newDateOrder = DATEORDER_DMY; > ocnt++; > } > ! else if (strcasecmp(tok, "MDY") == 0 || > ! strcasecmp(tok, "US") == 0 || > ! strncasecmp(tok, "NONEURO", 7) == 0) > { > newDateOrder = DATEORDER_MDY; > ocnt++; > } > ! else if (strcasecmp(tok, "DEFAULT") == 0) > { > /* > * Easiest way to get the current DEFAULT state is to > fetch > --- 110,134 ---- > if (ocnt == 0) > newDateOrder = DATEORDER_DMY; > } > ! else if (strcmp(tok, "ymd") == 0) > { > newDateOrder = DATEORDER_YMD; > ocnt++; > } > ! else if (strcmp(tok, "dmy") == 0 || > ! strncmp(tok, "euro", 4) == 0) > { > newDateOrder = DATEORDER_DMY; > ocnt++; > } > ! else if (strcmp(tok, "mdy") == 0 || > ! strcmp(tok, "us") == 0 || > ! strncmp(tok, "noneuro", 7) == 0) > { > newDateOrder = DATEORDER_MDY; > ocnt++; > } > ! else if (strcmp(tok, "default") == 0) > { > /* > * Easiest way to get the current DEFAULT state is to > fetch > *************** > *** 474,480 **** > HasCTZSet = true; > } > } > ! else if (strcasecmp(value, "UNKNOWN") == 0) > { > /* > * UNKNOWN is the value shown as the "default" for > TimeZone in > --- 479,485 ---- > HasCTZSet = true; > } > } > ! else if (strcasecmp(value, "unknown") == 0) > { > /* > * UNKNOWN is the value shown as the "default" for > TimeZone in
"Nicolai Tufar" <ntufar@pisem.net> writes: >> It looks to me like every use of strcasecmp in the backend has to be >> questioned if we're going to make this work. I'm starting to lean in >> the direction of "tr_TR is hopelessly broken" again... > With this patch applied everything works fine. Thanks! Did you try running the regression tests under tr_TR locale? It seems a few bricks short of "fine" yet :-( regards, tom lane
> -----Original Message----- > From: Tom Lane > Did you try running the regression tests under tr_TR locale? It seems > a few bricks short of "fine" yet :-( I run regression tests under tr_TR locale. To do this I hardcoded Turkish locale in initdb in pg_regress.sh. Three tests failed, I attached resulting diff. With days of the week, the same problem is with downcasting occurs. I think it is not that crucial, but the rest of the differences in the file seem to be important. I was not able to interpret them. Thanks, Nicolai > > regards, tom lane