Home > mailing lists

Re: Inaccurate documentation about identifiers - Mailing list pgsql-bugs

From	raf
Subject	Re: Inaccurate documentation about identifiers
Date	November 17, 2022 22:47:32
Msg-id	Y3a6BMoEzbcZ0rEy@raf.org Whole thread Raw
In response to	Re: Inaccurate documentation about identifiers (Tom Lane <tgl@sss.pgh.pa.us>)
Responses	Re: Inaccurate documentation about identifiers
List	pgsql-bugs

Tree view

On Thu, Nov 17, 2022 at 03:01:10PM -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Jeff Davis <pgsql@j-davis.com> writes:
> > On Wed, 2022-11-16 at 08:36 -0500, Brennan Vincent wrote:
> >> However, it seems that all non-ASCII characters are considered
> >> "letters"
> 
> > You're correct: it seems to allow any byte with the high bit set;
> > including, for example, a zero-width space.
> 
> Yes, see scan.l:
> 
> ident_start        [A-Za-z\200-\377_]
> ident_cont        [A-Za-z\200-\377_0-9\$]
> 
> identifier        {ident_start}{ident_cont}*
> 
> > I don't think we want to change the documentation here, because that
> > would amount to a promise that we support such identifiers forever.
> > I also don't think we want to change the code, because it opens up
> > several problems and I'm not sure it's worth trying to solve them.
> 
> Right.  IIRC, the SQL spec would have us allow only things that actually
> are letters per Unicode or other relevant spec, but (1) that's rather
> encoding-dependent and (2) the hit to parsing speed would likely be
> non-negligible.  Still, we might do it someday if someone can find
> a way around those concerns.  (Accepting whitespace, in particular,
> is Not Great.)  I think benign neglect in the docs is the best path.
> 
>             regards, tom lane

I think a lot of programming languages probably only use ASCII for
operators and whitespace.

I have a domain specific micro language that explicitly treats all
8-bit bytes as "letters" when parsing the names of things as a cheap
way to "support" ASCII-compatible encodings like UTF-8 and ISO-8859-*
(but it's useless for UTF-16, GB 18030, Big5, ...). The only way to
do it right would be to decode everything. But then you'd probably
lose the ability to include emojis in identifiers. I wonder if anyone's
doing that in postgresql. :-)

Does the SQL spec require accepting *only* real letters as letters,
or does it require accepting *at least* real letters as letters. :-)
Just a bit of wishful thinking.

cheers,
raf

pgsql-bugs by date:

From: Tom Lane
Date: 17 November 2022, 20:01:10
Subject: Re: Inaccurate documentation about identifiers

From: PG Bug reporting form
Date: 18 November 2022, 05:41:54
Subject: BUG #17689: Two UPDATE operators in common table expressions (CTE) perform not as expected

Re: Inaccurate documentation about identifiers - Mailing list pgsql-bugs

Previous

Next