On Thu, Nov 17, 2022 at 03:01:10PM -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Jeff Davis <pgsql@j-davis.com> writes:
> > On Wed, 2022-11-16 at 08:36 -0500, Brennan Vincent wrote:
> >> However, it seems that all non-ASCII characters are considered
> >> "letters"
>
> > You're correct: it seems to allow any byte with the high bit set;
> > including, for example, a zero-width space.
>
> Yes, see scan.l:
>
> ident_start [A-Za-z\200-\377_]
> ident_cont [A-Za-z\200-\377_0-9\$]
>
> identifier {ident_start}{ident_cont}*
>
> > I don't think we want to change the documentation here, because that
> > would amount to a promise that we support such identifiers forever.
> > I also don't think we want to change the code, because it opens up
> > several problems and I'm not sure it's worth trying to solve them.
>
> Right. IIRC, the SQL spec would have us allow only things that actually
> are letters per Unicode or other relevant spec, but (1) that's rather
> encoding-dependent and (2) the hit to parsing speed would likely be
> non-negligible. Still, we might do it someday if someone can find
> a way around those concerns. (Accepting whitespace, in particular,
> is Not Great.) I think benign neglect in the docs is the best path.
>
> regards, tom lane
I think a lot of programming languages probably only use ASCII for
operators and whitespace.
I have a domain specific micro language that explicitly treats all
8-bit bytes as "letters" when parsing the names of things as a cheap
way to "support" ASCII-compatible encodings like UTF-8 and ISO-8859-*
(but it's useless for UTF-16, GB 18030, Big5, ...). The only way to
do it right would be to decode everything. But then you'd probably
lose the ability to include emojis in identifiers. I wonder if anyone's
doing that in postgresql. :-)
Does the SQL spec require accepting *only* real letters as letters,
or does it require accepting *at least* real letters as letters. :-)
Just a bit of wishful thinking.
cheers,
raf