Thread: PostgreSQL implicitly double-quoting identifier name with umlaut

PostgreSQL implicitly double-quoting identifier name with umlaut

From
Michael Downey
Date:

PostgreSQL implicitly double-quoting identifier name with umlaut

 

I an understand this behavior, just trying to find documentation to confirm it.

 

Server and client encoding are UTF8.

 

One of our internal users, using our tools, added a column called Örtschaft. We anticipated it would be folded to lower case.

So we inserted our metadata for the column in our metadata with the name örtschaft. With the system query for metadata, we

ended up seeing query mismatches involving this column as we found the actual column name is Örtschaft

in the database.

 

I looked at the “SQL” of the table in pgadmin and saw this:

 

“CREATE TABLE IF NOT EXISTS map.mytable2

(

    oid bigint,

    name1 smallint,

    "Örtschaft" integer

)

 

TABLESPACE pg_default;

 

ALTER TABLE IF EXISTS map.mytable2

    OWNER to map;”

 

I thought our code might be adding the double quotes inadvertently, but the statement in the Postgres log had the alter table statement with

Örtschaft without double quotes.

 

After experimenting in SQL, I was able to find that in I could reproduce this in psql and Pgadmin sql.

 

Ex

In SQL:

alter table map.mytable2 add column Örtschaft7 integer;

 

PGADMIN interface output (I could not find the exact query statement to produce this)

CREATE TABLE IF NOT EXISTS map.mytable2

(

    oid bigint,

    name1 smallint,

    "Örtschaft" integer,

    ortschaft2 integer,

    "™rtschaft3" integer,

    "Örtschaft4" integer,

    "Örtschaft5" integer,

    "Örtschaft6" integer,

    "Örtschaft7" integer

)

 

TABLESPACE pg_default;

 

ALTER TABLE IF EXISTS map.mytable2

    OWNER to map;

 

This behavior seems like an exception to the lower-case folding behavior expectation.

 

Is this expected?

Is this documented someplace?

 

Thanks,

Michael

Michael Downey <mdowney@esri.com> writes:
> One of our internal users, using our tools, added a column called Örtschaft. We anticipated it would be folded to
lowercase. 
> So we inserted our metadata for the column in our metadata with the name örtschaft. With the system query for
metadata,we 
> ended up seeing query mismatches involving this column as we found the actual column name is Örtschaft
> in the database.

When working in UTF8 (or any multibyte encoding), PG's identifier
case-folding changes only ASCII letters.  I can't find anything in
our SGML docs about this, at least not where I'd expect it to be
documented.  The code is pretty clear about what it's doing though:

    /*
     * SQL99 specifies Unicode-aware case normalization, which we don't yet
     * have the infrastructure for.  Instead we use tolower() to provide a
     * locale-aware translation.  However, there are some locales where this
     * is not right either (eg, Turkish may do strange things with 'i' and
     * 'I').  Our current compromise is to use tolower() for characters with
     * the high bit set, as long as they aren't part of a multi-byte
     * character, and use an ASCII-only downcasing for 7-bit characters.
     */

These days the claim that no infrastructure is available is obsolete.
But I'm mighty hesitant to touch this behavior, because it'd almost
surely break peoples' apps.  We could do better on the documentation
front though.

            regards, tom lane



RE: PostgreSQL implicitly double-quoting identifier name with umlaut

From
Michael Downey
Date:
Thanks Tom!

I doubt we would want this changed. Our documentation and Dev team are on-board for the lower case folding and I really
wouldnot 
 
want it changed. We will make sure that we make a notice of this in our documentation as well.

Thank you
Michael

-----Original Message-----
From: Tom Lane <tgl@sss.pgh.pa.us> 
Sent: Friday, September 13, 2024 4:20 PM
To: Michael Downey <mdowney@esri.com>
Cc: pgsql-sql@lists.postgresql.org
Subject: Re: PostgreSQL implicitly double-quoting identifier name with umlaut

Michael Downey <mdowney@esri.com> writes:
> One of our internal users, using our tools, added a column called Örtschaft. We anticipated it would be folded to
lowercase.
 
> So we inserted our metadata for the column in our metadata with the 
> name örtschaft. With the system query for metadata, we ended up seeing 
> query mismatches involving this column as we found the actual column name is Örtschaft in the database.

When working in UTF8 (or any multibyte encoding), PG's identifier case-folding changes only ASCII letters.  I can't
findanything in our SGML docs about this, at least not where I'd expect it to be documented.  The code is pretty clear
aboutwhat it's doing though:
 

    /*
     * SQL99 specifies Unicode-aware case normalization, which we don't yet
     * have the infrastructure for.  Instead we use tolower() to provide a
     * locale-aware translation.  However, there are some locales where this
     * is not right either (eg, Turkish may do strange things with 'i' and
     * 'I').  Our current compromise is to use tolower() for characters with
     * the high bit set, as long as they aren't part of a multi-byte
     * character, and use an ASCII-only downcasing for 7-bit characters.
     */

These days the claim that no infrastructure is available is obsolete.
But I'm mighty hesitant to touch this behavior, because it'd almost surely break peoples' apps.  We could do better on
thedocumentation front though.
 

            regards, tom lane