Re: BUG #15273: Lexer bug with UESCAPE - Mailing list pgsql-bugs

From Andrew Gierth
Subject Re: BUG #15273: Lexer bug with UESCAPE
Date
Msg-id 87bmbekq90.fsf@news-spur.riddles.org.uk
Whole thread Raw
In response to Re: BUG #15273: Lexer bug with UESCAPE  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: BUG #15273: Lexer bug with UESCAPE
List pgsql-bugs
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:

 Tom> Also, I'm going to push back on the claim that allowing comments
 Tom> there is required by the SQL spec. The relevant rules in SQL:2011
 Tom> are

 Tom> <Unicode character string literal> ::=
 Tom>   [ <introducer> <character set specification> ]
 Tom>       U <ampersand> <quote> [ <Unicode representation>... ] <quote>
 Tom>       [ { <separator> <quote> [ <Unicode representation>... ] <quote> }... ]
 Tom>       <Unicode escape specifier>

 Tom> <Unicode escape specifier> ::=
 Tom>   [ UESCAPE <quote> <Unicode escape character> <quote> ]

 Tom> I do not see any principled way of arguing that these rules
 Tom> require comments to be allowed adjacent to UESCAPE without also
 Tom> claiming that they must be allowed between, say, the initial 'U'
 Tom> and the ampersand.

These are the rules that (as far as I can see) apply to that case:

5.2 <token> and <separator>

<separator> ::=
  { <comment> | <white space> }...

  7) Any <token> may be followed by a <separator>.

5.3 <literal>

  11) In a <Unicode character string literal>, there shall be no
      <separator> between the "U" and the <ampersand> nor between the
      <ampersand> and the <quote>.

 Tom> The only place these rules allow a <separator> is between segments
 Tom> of a multiline literal. It looks to me like an extension that we
 Tom> even allow whitespace around UESCAPE.

I think that that use of <separator> is only to indicate that a
<separator> there is _required_, rather than optional as it usually is
after tokens, and that the special rule about requiring newlines also
applies only to that specific use of <separator>.

If the whole <Unicode character string literal> is regarded as being a
single token, and therefore rule 5.2.7 above didn't apply around the
UESCAPE, then there would be no reason to write rule 5.3.11 forbidding
separators within the U&' part.

(In the case of X'...', there's rule 5.2.5, which as I see it would
prevent a space after the X, but that rule explicitly does not apply to
the U& cases.)

As a related issue, we don't allow comments within the <separator> that
splits a multiline literal, even though the spec certainly allows those
(arguably, since the spec defines that comments are equivalent to
newlines, "select 'foo' /**/ 'bar';" should be legal too).

I've put up a summary of all these at
https://wiki.postgresql.org/wiki/PostgreSQL_vs_SQL_Standard#Lexing_of_string_literals_and_comments

(under the assumption that the whole issue is filed under WONTFIX at
least for the time being)

-- 
Andrew (irc:RhodiumToad)


pgsql-bugs by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: BUG #15274: error LOG: server process (PID 13723) was terminatedby signal 11: Segmentation fault
Next
From: PG Bug reporting form
Date:
Subject: BUG #15275: Trigger don't take supperuser role into account to createrole