Re: Underscores in numeric literals - Mailing list pgsql-hackers

From Dean Rasheed
Subject Re: Underscores in numeric literals
Date
Msg-id CAEZATCUJPJHMokBEm4cYmq0t8T5hXMiKz6Rkqj5+EWOWF-==3A@mail.gmail.com
Whole thread Raw
In response to Re: Underscores in numeric literals  (Andrew Dunstan <andrew@dunslane.net>)
Responses Re: Underscores in numeric literals
Re: Underscores in numeric literals
List pgsql-hackers
On Wed, 28 Dec 2022 at 14:28, Andrew Dunstan <andrew@dunslane.net> wrote:
>
> On 2022-12-27 Tu 09:55, Tom Lane wrote:
> > We already accept that numeric input is different from numeric
> > literals: you can't write Infinity or NaN in SQL without quotes.
> > So I don't see an argument that we have to allow this in numeric
> > input for consistency.
>
> That's almost the same, but not quite, ISTM. Those are things you can't
> say without quotes, but here unless I'm mistaken you'd be disallowing
> this style if you use quotes. I get the difficulties with input
> functions, but it seems like we'll be building lots of grounds for
> confusion.
>

Yeah, it's easy to see why something like 'NaN' needs quotes, but it
would be harder to explain why something like 1000_000 mustn't have
quotes, and couldn't be used as input to COPY.

My feeling is that we should try to make the datatype input functions
accept anything that is legal syntax as a numeric literal, even if the
reverse isn't always possible.

That said, I think it's very important to minimise any performance
hit, especially in the existing case of inputs with no underscores.

Looking at the patch's changes to pg_strtointNN(), I think there's
more that can be done to reduce that performance hit. As it stands,
every input character is checked to see if it's an underscore, and
then there's a new check at the end to ensure that the input string
doesn't have a trailing underscore. Both of those can be avoided by
rearranging things a little, as in the attached v2 patch.

In the v2 patch, each input character is only compared with underscore
if it's not a digit, so in the case of an input with no underscores or
trailing spaces, the new checks for underscores are never executed.

In addition, if an underscore is seen, it now checks that the next
character is a digit. This eliminates the possibility of two
underscores in a row, and also of a trailing underscore, and so there
is no need for the final check for trailing underscores.

Thus, if the input consists only of digits, it never has to test for
underscores at all, and the performance hit for this case is
minimised.

My other concern with this patch is that the responsibility for
handling underscores is distributed over a couple of different places.
I had the same concern about the non-decimal integer patch, but at the
time I couldn't see any way round it. Now that we have soft error
handling though, I think that there is a way to improve this,
centralising the logic for both underscore and non-decimal handling to
one place for each datatype, reducing code duplication and the chances
of bugs.

For example, make_const() in the T_Float case has gained new code to
parse both the sign and base-prefix of the input, duplicating the
logic in pg_strtointNN(). That can now be avoided by having it call
pg_strtoint64_safe() with an ErrorSaveContext, instead of strtoi64().
In the process, it would then gain the ability to handle underscores,
so they wouldn't need to be stripped off elsewhere.

Similarly, process_integer_literal() could be made to call
pg_strtoint32_safe() with an ErrorSaveContext instead of strtoint(),
and it then wouldn't need to strip off underscores, or be passed the
number's base, since pg_strtoint32_safe() would handle all of that.

In addition, I think that strip_underscores() could then go away if
numeric_in() were made to handle underscores.

Essentially then, that would move all responsibility for parsing
underscores and non-decimal integers to the datatype input functions,
or their support routines, rather than having it distributed.

Regards,
Dean

Attachment

pgsql-hackers by date:

Previous
From: Vladimir Churyukin
Date:
Subject: Re: [PATCH] Improve ability to display optimizer analysis using OPTIMIZER_DEBUG
Next
From: Dean Rasheed
Date:
Subject: Re: Underscores in numeric literals