Re: Inaccurate documentation about identifiers - Mailing list pgsql-bugs

From raf
Subject Re: Inaccurate documentation about identifiers
Date
Msg-id Y3a6BMoEzbcZ0rEy@raf.org
Whole thread Raw
In response to Re: Inaccurate documentation about identifiers  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Inaccurate documentation about identifiers
List pgsql-bugs
On Thu, Nov 17, 2022 at 03:01:10PM -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Jeff Davis <pgsql@j-davis.com> writes:
> > On Wed, 2022-11-16 at 08:36 -0500, Brennan Vincent wrote:
> >> However, it seems that all non-ASCII characters are considered
> >> "letters"
> 
> > You're correct: it seems to allow any byte with the high bit set;
> > including, for example, a zero-width space.
> 
> Yes, see scan.l:
> 
> ident_start        [A-Za-z\200-\377_]
> ident_cont        [A-Za-z\200-\377_0-9\$]
> 
> identifier        {ident_start}{ident_cont}*
> 
> > I don't think we want to change the documentation here, because that
> > would amount to a promise that we support such identifiers forever.
> > I also don't think we want to change the code, because it opens up
> > several problems and I'm not sure it's worth trying to solve them.
> 
> Right.  IIRC, the SQL spec would have us allow only things that actually
> are letters per Unicode or other relevant spec, but (1) that's rather
> encoding-dependent and (2) the hit to parsing speed would likely be
> non-negligible.  Still, we might do it someday if someone can find
> a way around those concerns.  (Accepting whitespace, in particular,
> is Not Great.)  I think benign neglect in the docs is the best path.
> 
>             regards, tom lane

I think a lot of programming languages probably only use ASCII for
operators and whitespace.

I have a domain specific micro language that explicitly treats all
8-bit bytes as "letters" when parsing the names of things as a cheap
way to "support" ASCII-compatible encodings like UTF-8 and ISO-8859-*
(but it's useless for UTF-16, GB 18030, Big5, ...). The only way to
do it right would be to decode everything. But then you'd probably
lose the ability to include emojis in identifiers. I wonder if anyone's
doing that in postgresql. :-)

Does the SQL spec require accepting *only* real letters as letters,
or does it require accepting *at least* real letters as letters. :-)
Just a bit of wishful thinking.

cheers,
raf




pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: Inaccurate documentation about identifiers
Next
From: PG Bug reporting form
Date:
Subject: BUG #17689: Two UPDATE operators in common table expressions (CTE) perform not as expected