Thread: Folding of case of identifiers

Folding of case of identifiers

From
Niels Jespersen
Date:
Hello all

According to https://www.postgresql.org/docs/current/sql-syntax-lexical.html, "Key words and unquoted identifiers are
caseinsensitive." And "SQL identifiers and key words must begin with a letter (a-z, but also letters with diacritical
marksand non-Latin letters) or an underscore (_). Subsequent characters in an identifier or key word can be letters,
underscores,digits (0-9), or dollar signs ($)." 

So far so good. Non-latin letters are included, which I take to also include the danish letters æøå/ÆØÅ.

However, name-folding is odd for these letters. Of these three create tables, the two first succeed, the last one does
not(G and g is equivalent, Æ and æ is not).  

create table æblegrød (a int, køn text);
create table ÆblegrØd (a int, køn text);
create table ÆbleGrØd (a int, køn text);

Can anyone explain the logic that rules this.

Regards Niels Jespersen






Re: Folding of case of identifiers

From
Tom Lane
Date:
Niels Jespersen <NJN@dst.dk> writes:
> According to https://www.postgresql.org/docs/current/sql-syntax-lexical.html, "Key words and unquoted identifiers are
caseinsensitive." And "SQL identifiers and key words must begin with a letter (a-z, but also letters with diacritical
marksand non-Latin letters) or an underscore (_). Subsequent characters in an identifier or key word can be letters,
underscores,digits (0-9), or dollar signs ($)." 

> So far so good. Non-latin letters are included, which I take to also include the danish letters æøå/ÆØÅ.

> However, name-folding is odd for these letters. Of these three create tables, the two first succeed, the last one
doesnot (G and g is equivalent, Æ and æ is not).  

Whether non-ASCII characters get downcased is very context dependent.
You've not mentioned the database encoding or the locale (LC_CTYPE)
setting, but both of those are relevant.  Basically, in a single-byte
encoding we'll apply tolower() to identifier characters; but we don't
attempt to case-fold multi-byte characters at all.  This logic is pretty
hoary, dating from before Unicode became widespread, but I'd be hesitant
to change it now.

            regards, tom lane