Thread: Locale-based identifier conversion and Turkish

Locale-based identifier conversion and Turkish

From

Devrim GUNDUZ

Date:

15 December 2003, 14:40:07

Hi,

A year ago Nicolai Tufar <ntufar@TDMSoft.com> submitted a patch to
change lower-case conversion of identifiers from locale-dependent to
ASCII in this thread:

http://archives.postgresql.org/pgsql-hackers/2002-11/msg01159.php

Tom Lane argued that SQL99 standard states that identifier case
convervisons are to be done on the basis of Unicode upper/lower case
equivalencies and that locale-based conversion is closer than
ASCII-only. And the patch was rejected.

Now, PostgreSQL 7.4 initdb fails if run with locale set to tr_TR:

----------------------------------------------------------------------------
[pgsql74@devrim backend]$ initdb -D /usr/local/pgsql/data --locale=tr_TR
The files belonging to this database system will be owned by user
"pgsql74".
This user must also own the server process.

The database cluster will be initialized with locale tr_TR.

fixing permissions on existing directory /usr/local/pgsql/data... ok
creating directory /usr/local/pgsql/data/base... ok
creating directory /usr/local/pgsql/data/global... ok
creating directory /usr/local/pgsql/data/pg_xlog... ok
creating directory /usr/local/pgsql/data/pg_clog... ok
selecting default max_connections... 100
selecting default shared_buffers... 1000
creating configuration files... ok
creating template1 database in /usr/local/pgsql/data/base/1... ok
initializing pg_shadow... ok
enabling unlimited row size for system tables... ok
initializing pg_depend... ok
creating system views... ok
loading pg_description... ok
creating conversions... NOTICE:  type "voıd" is not yet defined
DETAIL:  Creating a shell type definition.
ERROR:  type cstrıng does not exist

initdb: failed
[pgsql74@devrim backend]$
-----------------------------------------------------------------

Failure is caused by the following statement:
    CREATE OR REPLACE FUNCTION ascii_to_mic (INTEGER, INTEGER,
     CSTRING,CSTRING, INTEGER) RETURNS VOID AS
     '$libdir/ascii_and_mic', 'ascii_to_mic'
     LANGUAGE 'c' STRICT;

from file share/conversion_create.sql

As you can see "I" in "VOID" gets converted to i-dotless in conformance
to tr_TR Locale conversion rules, which is not an expected behaviour for
Turkish users who set their locale to tr_TR.

Attached is a two-line patch that changes identifier name conversion in
backend/parser/scan.l from tolower() to a simple ASCII based one. It
will solve database creation problem but apparently will break
upper-lower case conversion of identifiers in national languages, like
Russian or Korean.

So what shall be done? Would you like us to prepare a patch that will
change identifer case conversion behaviour only when locale is set to
tr_TR?

Regards,
--
Devrim GUNDUZ
devrim@gunduz.org                  devrim.gunduz@linux.org.tr
                http://www.TDMSoft.com
                http://www.gunduz.org

Attachment

scan.l.diff

Re: Locale-based identifier conversion and Turkish

From

Tom Lane

Date:

15 December 2003, 16:14:48

Devrim GUNDUZ <devrim@gunduz.org> writes:
> Now, PostgreSQL 7.4 initdb fails if run with locale set to tr_TR:

Ugh :-(

> As you can see "I" in "VOID" gets converted to i-dotless in conformance
> to tr_TR Locale conversion rules, which is not an expected behaviour for
> Turkish users who set their locale to tr_TR.

Why is it not expected behavior?  The SQL99 spec has not changed from
what it said when we discussed this last year: case conversion of
identifiers is *not* to be done in plain-ASCII rules.  Admittedly it
says "unicode" and not "locale", but you'd have the same problem if we
had a full unicode implementation, no?

My feeling is that the best answer is to use lower case names to
reference identifiers in the initdb code.  This is not a real pleasant
prospect, but I can't see giving up on spec-compatible case folding ...
        regards, tom lane

conversion_create.sql

From

Michael Brusser

Date:

15 December 2003, 16:36:23

After upgrading to v 7.3.4 I noticed this error message in the database
setup log:
grep: can't open  <data_path>/conversion_create.sql

Turned out initdb is looking for conversion_create.sql.
We're not building this script and I may need to look into the build
process,
but for now can someone tell me what it does and why we'd need it.

I guess it has to do with locale/data conversion, but there's so much stuff
in
src/backend/utils/mb/conversion_procs that I wonder - do we need to know
which conversions we need to support, or should we build all of them to be
on the safe side?

Thanks,
Mike.

Re: Locale-based identifier conversion and Turkish

From

Devrim GUNDUZ

Date:

15 December 2003, 17:48:38

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,

On Mon, 15 Dec 2003, Tom Lane wrote:

> > As you can see "I" in "VOID" gets converted to i-dotless in conformance
> > to tr_TR Locale conversion rules, which is not an expected behaviour for
> > Turkish users who set their locale to tr_TR.
> 
> Why is it not expected behavior?  The SQL99 spec has not changed from
> what it said when we discussed this last year: case conversion of
> identifiers is *not* to be done in plain-ASCII rules.  Admittedly it
> says "unicode" and not "locale", but you'd have the same problem if we
>  had a full unicode implementation, no?

I'm pretty sure that Nicolai will explain things better; so it's better to 
wait for his e-mail.

If PostgreSQL had a full unicode implementation, as you've said: Yes, we 
would experience the same problems.

We also have a problem while sorting result in a SELECT query, which seems 
a similar issue as this one. In Turkish, i-dotless comes just before i in 
alphabet; but PostgreSQL places i-dotless 'after' i, which is a false 
result. Nicolai will also report this bug, and explain why/how it happens.

> My feeling is that the best answer is to use lower case names to
> reference identifiers in the initdb code.  This is not a real pleasant
> prospect, but I can't see giving up on spec-compatible case folding ...

This is what we've also thought in here.

Regards,
- -- 
Devrim GUNDUZ
devrim@gunduz.org                devrim.gunduz@linux.org.tr         http://www.TDMSoft.com
http://www.gunduz.org

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.7 (GNU/Linux)

iD8DBQE/3iwVtl86P3SPfQ4RAl1VAJ9vmbEwq2pbAoPWY+lmVKSLhhabJQCgrzN9
zIRLFXS5WxjnZKbaO9XviA4=
=SDYx
-----END PGP SIGNATURE-----

Re: conversion_create.sql

From

Tom Lane

Date:

15 December 2003, 19:10:27

Michael Brusser <michael@synchronicity.com> writes:
> Turned out initdb is looking for conversion_create.sql.
> We're not building this script and I may need to look into the build
> process,
> but for now can someone tell me what it does and why we'd need it.

Well, if you don't have it then there won't be any built-in conversions
defined in your database, which you might or might not care about.  I'd
be more concerned about what other parts of the build process might have
failed, if this one did.
        regards, tom lane