Thread: Inaccurate documentation about identifiers

Inaccurate documentation about identifiers

From

Brennan Vincent

Date:

16 November 2022, 16:36:03

Hello,

The documentation states:

"SQL identifiers and key words must begin with a letter (a-z, but also letters
with diacritical marks and non-Latin letters) or an underscore (_). Subsequent
characters in an identifier or key word can be letters, underscores, digits
(0-9), or dollar signs ($)"

However, it seems that all non-ASCII characters are considered "letters" by the

Re: Inaccurate documentation about identifiers

From

Jeff Davis

Date:

17 November 2022, 22:12:39

On Wed, 2022-11-16 at 08:36 -0500, Brennan Vincent wrote:
> Hello,
>
> The documentation states:
>
> "SQL identifiers and key words must begin with a letter (a-z, but
> also letters
> with diacritical marks and non-Latin letters) or an underscore (_).
> Subsequent
> characters in an identifier or key word can be letters, underscores,
> digits
> (0-9), or dollar signs ($)"
>
> However, it seems that all non-ASCII characters are considered
> "letters"

You're correct: it seems to allow any byte with the high bit set;
including, for example, a zero-width space.

I don't think we want to change the documentation here, because that
would amount to a promise that we support such identifiers forever.

I also don't think we want to change the code, because it opens up
several problems and I'm not sure it's worth trying to solve them.

Is there any harm just leaving it as-is?

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Re: Inaccurate documentation about identifiers

From

Tom Lane

Date:

17 November 2022, 23:01:10

Jeff Davis <pgsql@j-davis.com> writes:
> On Wed, 2022-11-16 at 08:36 -0500, Brennan Vincent wrote:
>> However, it seems that all non-ASCII characters are considered
>> "letters"

> You're correct: it seems to allow any byte with the high bit set;
> including, for example, a zero-width space.

Yes, see scan.l:

ident_start        [A-Za-z\200-\377_]
ident_cont        [A-Za-z\200-\377_0-9\$]

identifier        {ident_start}{ident_cont}*

> I don't think we want to change the documentation here, because that
> would amount to a promise that we support such identifiers forever.
> I also don't think we want to change the code, because it opens up
> several problems and I'm not sure it's worth trying to solve them.

Right.  IIRC, the SQL spec would have us allow only things that actually
are letters per Unicode or other relevant spec, but (1) that's rather
encoding-dependent and (2) the hit to parsing speed would likely be
non-negligible.  Still, we might do it someday if someone can find
a way around those concerns.  (Accepting whitespace, in particular,
is Not Great.)  I think benign neglect in the docs is the best path.

            regards, tom lane

Re: Inaccurate documentation about identifiers

From

raf

Date:

18 November 2022, 01:47:32

On Thu, Nov 17, 2022 at 03:01:10PM -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Jeff Davis <pgsql@j-davis.com> writes:
> > On Wed, 2022-11-16 at 08:36 -0500, Brennan Vincent wrote:
> >> However, it seems that all non-ASCII characters are considered
> >> "letters"
> 
> > You're correct: it seems to allow any byte with the high bit set;
> > including, for example, a zero-width space.
> 
> Yes, see scan.l:
> 
> ident_start        [A-Za-z\200-\377_]
> ident_cont        [A-Za-z\200-\377_0-9\$]
> 
> identifier        {ident_start}{ident_cont}*
> 
> > I don't think we want to change the documentation here, because that
> > would amount to a promise that we support such identifiers forever.
> > I also don't think we want to change the code, because it opens up
> > several problems and I'm not sure it's worth trying to solve them.
> 
> Right.  IIRC, the SQL spec would have us allow only things that actually
> are letters per Unicode or other relevant spec, but (1) that's rather
> encoding-dependent and (2) the hit to parsing speed would likely be
> non-negligible.  Still, we might do it someday if someone can find
> a way around those concerns.  (Accepting whitespace, in particular,
> is Not Great.)  I think benign neglect in the docs is the best path.
> 
>             regards, tom lane

I think a lot of programming languages probably only use ASCII for
operators and whitespace.

I have a domain specific micro language that explicitly treats all
8-bit bytes as "letters" when parsing the names of things as a cheap
way to "support" ASCII-compatible encodings like UTF-8 and ISO-8859-*
(but it's useless for UTF-16, GB 18030, Big5, ...). The only way to
do it right would be to decode everything. But then you'd probably
lose the ability to include emojis in identifiers. I wonder if anyone's
doing that in postgresql. :-)

Does the SQL spec require accepting *only* real letters as letters,
or does it require accepting *at least* real letters as letters. :-)
Just a bit of wishful thinking.

cheers,
raf

Re: Inaccurate documentation about identifiers

From

Vik Fearing

Date:

26 November 2022, 02:19:23

On 11/17/22 23:47, raf wrote:
> On Thu, Nov 17, 2022 at 03:01:10PM -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> 
>> ident_start        [A-Za-z\200-\377_]
>> ident_cont        [A-Za-z\200-\377_0-9\$]
>>
>> identifier        {ident_start}{ident_cont}*
> 
> Does the SQL spec require accepting *only* real letters as letters,
> or does it require accepting *at least* real letters as letters. :-)
> Just a bit of wishful thinking.

<identifier body> ::=
   <identifier start> [ <identifier part>... ]

<identifier part> ::=
     <identifier start>
   | <identifier extend>

An <identifier start> is any character in the Unicode General Category 
classes “Lu”, “Ll”, “Lt”, “Lm”, “Lo”, or “Nl”.

An <identifier extend> is U+00B7, “Middle Dot”, or any character in the 
Unicode General Category classes “Mn”, “Mc”, “Nd”, or “Pc”.
-- 
Vik Fearing