Thread: Re: Invalid "trailing junk" error message when non-English letters are used

Hi, Karina!

On Tue, 27 Aug 2024 at 19:06, Karina Litskevich <litskevichkarina@gmail.com> wrote:
Hi hackers,

When error "trailing junk after numeric literal" occurs at a number
followed by a symbol that is presented by more than one byte, that symbol
in the error message is not displayed correctly. Instead of that symbol
there is only its first byte. That makes the error message an invalid
UTF-8 (or whatever encoding is set). The whole log file where this error
message goes also becomes invalid. That could lead to problems with
reading logs. You can see an invalid message by trying "SELECT 123ä;".

Rejecting trailing junk after numeric literals was introduced in commit
2549f066 to prevent scanning a number immediately followed by an
identifier without whitespace as number and identifier. All the tokens
that made to catch such cases match a numeric literal and the next byte,
and that is where the problem comes from. I thought that it could be fixed
just by using tokens that match a numeric literal immediately followed by
an identifier, not only one byte. This also improves error messages in
cases with English letters. After these changes, for "SELECT 123abc;" the
error message will say that the error appeared at or near "123abc" instead
of "123a".

I've attached the patch. Are there any pitfalls I can't see? It just keeps
bothering me why wasn't it done from the beginning. Matching the whole
identifier after a numeric literal just seems more obvious to me than
matching its first byte.
 
I see the following compile time warnings:
scan.l:1062: warning, rule cannot be matched
scan.l:1066: warning, rule cannot be matched
scan.l:1070: warning, rule cannot be matched
pgc.l:1030: warning, rule cannot be matched
pgc.l:1033: warning, rule cannot be matched
pgc.l:1036: warning, rule cannot be matched
psqlscan.l:905: warning, rule cannot be matched
psqlscan.l:908: warning, rule cannot be matched
psqlscan.l:911: warning, rule cannot be matched

FWIW output of the whole string in the error message doesnt' look nice to me, but other places of code do this anyway e.g:
select ('1'||repeat('p',1000000))::integer;
This may be worth fixing.

Regards,
Pavel Borisov
Supabase
Karina Litskevich <litskevichkarina@gmail.com> writes:
> I see the two solutions here: either move the rule for decinteger_junk
> below the rules for hexinteger_junk, octinteger_junk and bininteger_junk,
> or just use a single rule decinteger_junk for all these cases, since the
> error message is the same anyway. I implemented the latter in the second
> version of the patch, also renamed this common rule to integer_junk.

That seems reasonable, but IMO this code was unacceptably
undercommented before and what you've done has made it worse.
We really need a comment block associated with the flex macros,
perhaps along the lines of

/*
 * An identifier immediately following a numeric literal is disallowed
 * because in some cases it's ambiguous what is meant: for example,
 * 0x1234 could be either a hexinteger or a decinteger "0" and an
 * identifier "x1234".  We can detect such problems by seeing if
 * integer_junk matches a longer substring than any of the XXXinteger
 * patterns.  (One "junk" pattern is sufficient because this will match
 * all the same strings we'd match with {hexinteger}{identifier} etc.)
 * Note that the rule for integer_junk must appear after the ones for
 * XXXinteger to make this work correctly.
 */

(Hmm, actually, is that last sentence true?  My flex is a bit rusty.)

param_junk really needs a similar comment, or maybe we could put
all the XXX_junk macros together and use one comment for all.

> Additionally, I noticed that this patch is going to change error messages
> in some cases, though I don't think it's a big deal. Example:
> Without patch:
> postgres=# select 0xyz;
> ERROR:  invalid hexadecimal integer at or near "0x"
> With patch:
> postgres=# select 0xyz;
> ERROR:  trailing junk after numeric literal at or near "0xyz"

That's sort of annoying, but I don't really see a better way,
or at least not one that's worth the effort.

>> FWIW output of the whole string in the error message doesnt' look nice to
>> me, but other places of code do this anyway e.g:
>> select ('1'||repeat('p',1000000))::integer;
>> This may be worth fixing.

I think this is nonsense: we are already in the habit of repeating the
whole failing query string in the STATEMENT field.  In any case it's
not something for this patch to worry about.

            regards, tom lane



Re: Invalid "trailing junk" error message when non-English letters are used

From
Karina Litskevich
Date:
On Thu, Sep 5, 2024 at 1:52 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Karina Litskevich <litskevichkarina@gmail.com> writes:
> > I see the two solutions here: either move the rule for decinteger_junk
> > below the rules for hexinteger_junk, octinteger_junk and bininteger_junk,
> > or just use a single rule decinteger_junk for all these cases, since the
> > error message is the same anyway. I implemented the latter in the second
> > version of the patch, also renamed this common rule to integer_junk.
>
> That seems reasonable, but IMO this code was unacceptably
> undercommented before and what you've done has made it worse.
> We really need a comment block associated with the flex macros,
> perhaps along the lines of
>
> /*
>  * An identifier immediately following a numeric literal is disallowed
>  * because in some cases it's ambiguous what is meant: for example,
>  * 0x1234 could be either a hexinteger or a decinteger "0" and an
>  * identifier "x1234".  We can detect such problems by seeing if
>  * integer_junk matches a longer substring than any of the XXXinteger
>  * patterns.  (One "junk" pattern is sufficient because this will match
>  * all the same strings we'd match with {hexinteger}{identifier} etc.)
>  * Note that the rule for integer_junk must appear after the ones for
>  * XXXinteger to make this work correctly.
>  */

Thank you, this piece of code definitely needs a comment.
 
> (Hmm, actually, is that last sentence true?  My flex is a bit rusty.)

Yes, the rule for integer_junk must appear after the ones for XXXinteger.
Here is a quote from https://ftp.gnu.org/old-gnu/Manuals/flex-2.5.4/html_mono/flex.html
"If it finds more than one match, it takes the one matching the most text
(...). If it finds two or more matches of the same length, the rule listed
first in the flex input file is chosen."
For example, 0x123 is matched by both integer_junk and hexinteger, and we
want the rule for hexinteger to be chosen, so we should place it before
the rule for integer_junk.
 
> param_junk really needs a similar comment, or maybe we could put
> all the XXX_junk macros together and use one comment for all.

In v3 of the patch I grouped all the *_junk rules together and included
the suggested comment with a little added something.


Best regards,
Karina Litskevich
Postgres Professional: http://postgrespro.com/
Karina Litskevich <litskevichkarina@gmail.com> writes:
> On Thu, Sep 5, 2024 at 6:07 PM Karina Litskevich
> <litskevichkarina@gmail.com> wrote:
>> In v3 of the patch I grouped all the *_junk rules together and included
>> the suggested comment with a little added something.

> Oops, I forgot to attach the patch, here it is.

Pushed with a bit of further wordsmithing on the comment.

I left out the proposed new test case "SELECT 1ä;".  The trouble
with that is it'd introduce an encoding dependency into the test.
For example, it'd likely fail with some other error message in
a server encoding that lacks an equivalent to UTF8 "ä".  While
we have methods for coping with such cases, it requires some
pushups, and I didn't see the value.  The changes in existing
test case results are sufficient to show the patch does what
we want.

Also, while the bug exists in v15, the patch didn't apply at all.
I got lazy and just did the minimal s/ident_start/identifier/ change
in that branch, instead of back-patching the cosmetic aspects.

            regards, tom lane



Re: Invalid "trailing junk" error message when non-English letters are used

From
Pavel Borisov
Date:


On Thu, 5 Sept 2024 at 20:49, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Karina Litskevich <litskevichkarina@gmail.com> writes:
> On Thu, Sep 5, 2024 at 6:07 PM Karina Litskevich
> <litskevichkarina@gmail.com> wrote:
>> In v3 of the patch I grouped all the *_junk rules together and included
>> the suggested comment with a little added something.

> Oops, I forgot to attach the patch, here it is.

Pushed with a bit of further wordsmithing on the comment.

I left out the proposed new test case "SELECT 1ä;".  The trouble
with that is it'd introduce an encoding dependency into the test.
For example, it'd likely fail with some other error message in
a server encoding that lacks an equivalent to UTF8 "ä".  While
we have methods for coping with such cases, it requires some
pushups, and I didn't see the value.  The changes in existing
test case results are sufficient to show the patch does what
we want.

Also, while the bug exists in v15, the patch didn't apply at all.
I got lazy and just did the minimal s/ident_start/identifier/ change
in that branch, instead of back-patching the cosmetic aspects.

Good! Thank you!
Pavel