Thread: Non-decimal integer literals

Non-decimal integer literals

From

Peter Eisentraut

Date:

16 August 2021, 09:51:50

Here is a patch to add support for hexadecimal, octal, and binary 
integer literals:

     0x42E
     0o112
     0b100101

per SQL:202x draft.

This adds support in the lexer as well as in the integer type input 
functions.

Those core parts are straightforward enough, but there are a bunch of 
other places where integers are parsed, and one could consider in each 
case whether they should get the same treatment, for example the 
replication syntax lexer, or input function for oid, numeric, and 
int2vector.  There are also some opportunities to move some code around, 
for example scanint8() could be in numutils.c.  I have also looked with 
some suspicion at some details of the number lexing in ecpg, but haven't 
found anything I could break yet.  Suggestions are welcome.

Attachment

v1-0001-Non-decimal-integer-literals.patch

Re: Non-decimal integer literals

From

John Naylor

Date:

16 August 2021, 15:32:37

On Mon, Aug 16, 2021 at 5:52 AM Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote:
>
> Here is a patch to add support for hexadecimal, octal, and binary
> integer literals:
>
> 0x42E
> 0o112
> 0b100101
>
> per SQL:202x draft.
>
> This adds support in the lexer as well as in the integer type input
> functions.

The one thing that jumped out at me on a cursory reading is the {integer} rule, which seems to be used nowhere except to call process_integer_literal, which must then inspect the token text to figure out what type of integer it is. Maybe consider 4 separate process_*_literal functions?

--
John Naylor
EDB: http://www.enterprisedb.com

Re: Non-decimal integer literals

From

Peter Eisentraut

Date:

07 September 2021, 11:13:10

On 16.08.21 17:32, John Naylor wrote:
> The one thing that jumped out at me on a cursory reading is 
> the {integer} rule, which seems to be used nowhere except to 
> call process_integer_literal, which must then inspect the token text to 
> figure out what type of integer it is. Maybe consider 4 separate 
> process_*_literal functions?

Agreed, that can be done in a simpler way.  Here is an updated patch.

Attachment

v2-0001-Non-decimal-integer-literals.patch

Re: Non-decimal integer literals

From

Zhihong Yu

Date:

07 September 2021, 11:50:15

On Tue, Sep 7, 2021 at 4:13 AM Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote:

On 16.08.21 17:32, John Naylor wrote:
> The one thing that jumped out at me on a cursory reading is
> the {integer} rule, which seems to be used nowhere except to
> call process_integer_literal, which must then inspect the token text to
> figure out what type of integer it is. Maybe consider 4 separate
> process_*_literal functions?

Agreed, that can be done in a simpler way. Here is an updated patch.

Hi,

Minor comment:

+SELECT int4 '0o112';

Maybe involve digits of up to 7 in the octal test case.

Thanks

Re: Non-decimal integer literals

From

Vik Fearing

Date:

08 September 2021, 12:44:45

On 8/16/21 11:51 AM, Peter Eisentraut wrote:
> Here is a patch to add support for hexadecimal, octal, and binary
> integer literals:
> 
>     0x42E
>     0o112
>     0b100101
> 
> per SQL:202x draft.

Is there any hope of adding the optional underscores?  I see a potential
problem there as SELECT 1_a; is currently parsed as SELECT 1 AS _a; when
it should be parsed as SELECT 1_ AS a; or perhaps even as an error since
0x1_a would be a valid number with no alias.

(The standard does not allow identifiers to begin with _ but we do...)
-- 
Vik Fearing

Re: Non-decimal integer literals

From

Tom Lane

Date:

08 September 2021, 13:14:34

Vik Fearing <vik@postgresfriends.org> writes:
> On 8/16/21 11:51 AM, Peter Eisentraut wrote:
>> Here is a patch to add support for hexadecimal, octal, and binary
>> integer literals:
>> 
>>     0x42E
>>     0o112
>>     0b100101
>> 
>> per SQL:202x draft.

> Is there any hope of adding the optional underscores?  I see a potential
> problem there as SELECT 1_a; is currently parsed as SELECT 1 AS _a; when
> it should be parsed as SELECT 1_ AS a; or perhaps even as an error since
> 0x1_a would be a valid number with no alias.

Even without that point, this patch *is* going to break valid queries,
because every one of those cases is a valid number-followed-by-identifier
today, e.g.

regression=# select 0x42e;
 x42e 
------
    0
(1 row)

AFAIR we've seen exactly zero field demand for this feature,
so I kind of wonder why we're in such a hurry to adopt something
that hasn't even made it past draft-standard status.

            regards, tom lane

Re: Non-decimal integer literals

From

Vik Fearing

Date:

09 September 2021, 14:08:04

On 9/8/21 3:14 PM, Tom Lane wrote:
> Vik Fearing <vik@postgresfriends.org> writes:
> 
>> Is there any hope of adding the optional underscores?  I see a potential
>> problem there as SELECT 1_a; is currently parsed as SELECT 1 AS _a; when
>> it should be parsed as SELECT 1_ AS a; or perhaps even as an error since
>> 0x1_a would be a valid number with no alias.
> 
> Even without that point, this patch *is* going to break valid queries,
> because every one of those cases is a valid number-followed-by-identifier
> today,

Ah, true that.  So if this does go in, we may as well add the
underscores at the same time.

> AFAIR we've seen exactly zero field demand for this feature,

I have often wanted something like this, even if I didn't bring it up on
this list.  I have had customers who have wanted this, too.  My response
has always been to show these exact problems to explain why it's not
possible, but if it's going to be in the standard then I favor doing it.

I have never really had a use for octal, but sometimes binary and hex
make things much clearer.  Having a grouping separator for large numbers
is even more useful.

> so I kind of wonder why we're in such a hurry to adopt something
> that hasn't even made it past draft-standard status.
I don't really see a hurry here.  I am fine with waiting until the draft
becomes final.
-- 
Vik Fearing

Re: Non-decimal integer literals

From

Peter Eisentraut

Date:

28 September 2021, 15:26:51

On 07.09.21 13:50, Zhihong Yu wrote:
>     On 16.08.21 17:32, John Naylor wrote:
>      > The one thing that jumped out at me on a cursory reading is
>      > the {integer} rule, which seems to be used nowhere except to
>      > call process_integer_literal, which must then inspect the token
>     text to
>      > figure out what type of integer it is. Maybe consider 4 separate
>      > process_*_literal functions?
> 
>     Agreed, that can be done in a simpler way.  Here is an updated patch.
> 
> Hi,
> Minor comment:
> 
> +SELECT int4 '0o112';
> 
> Maybe involve digits of up to 7 in the octal test case.

Good point, here is a lightly updated patch.

Attachment

v3-0001-Non-decimal-integer-literals.patch

Re: Non-decimal integer literals

From

Peter Eisentraut

Date:

28 September 2021, 15:30:28

On 09.09.21 16:08, Vik Fearing wrote:
>> Even without that point, this patch *is* going to break valid queries,
>> because every one of those cases is a valid number-followed-by-identifier
>> today,
> 
> Ah, true that.  So if this does go in, we may as well add the
> underscores at the same time.

Yeah, looks like I'll need to look into the identifier lexing issues 
previously discussed.  I'll attack that during the next commit fest.

>> so I kind of wonder why we're in such a hurry to adopt something
>> that hasn't even made it past draft-standard status.
> I don't really see a hurry here.  I am fine with waiting until the draft
> becomes final.

Right, the point is to explore this now so that it can be ready when the 
standard is ready.

Re: Non-decimal integer literals

From

Peter Eisentraut

Date:

01 November 2021, 06:09:15

On 28.09.21 17:30, Peter Eisentraut wrote:
> On 09.09.21 16:08, Vik Fearing wrote:
>>> Even without that point, this patch *is* going to break valid queries,
>>> because every one of those cases is a valid 
>>> number-followed-by-identifier
>>> today,
>>
>> Ah, true that.  So if this does go in, we may as well add the
>> underscores at the same time.
> 
> Yeah, looks like I'll need to look into the identifier lexing issues 
> previously discussed.  I'll attack that during the next commit fest.

Here is an updated patch for this.  It's the previous patch polished a 
bit more, and it contains changes so that numeric literals reject 
trailing identifier parts without whitespace in between, as discussed. 
Maybe I should split that into incremental patches, but for now I only 
have the one.  I don't have a patch for the underscores in numeric 
literals yet.  It's in progress, but not ready.

Attachment

v4-0001-Non-decimal-integer-literals.patch

Re: Non-decimal integer literals

From

Peter Eisentraut

Date:

25 November 2021, 13:17:52

On 01.11.21 07:09, Peter Eisentraut wrote:
> Here is an updated patch for this.  It's the previous patch polished a 
> bit more, and it contains changes so that numeric literals reject 
> trailing identifier parts without whitespace in between, as discussed. 
> Maybe I should split that into incremental patches, but for now I only 
> have the one.  I don't have a patch for the underscores in numeric 
> literals yet.  It's in progress, but not ready.

Here is a progressed version of this work, split into more incremental 
patches.  The first three patches are harmless code cleanups.  Patch 3 
has an interesting naming conflict, noted in the commit message; ideas 
welcome.  Patches 4 and 5 handle the rejection of trailing junk after 
numeric literals, as discussed.  I have expanded that compared to the v4 
patch to also cover non-integer literals.  It also comes with more tests 
now.  Patch 6 is the titular introduction of non-decimal integer 
literals, unchanged from before.

Attachment

Re: Non-decimal integer literals

From

Zhihong Yu

Date:

25 November 2021, 15:46:51

On Thu, Nov 25, 2021 at 5:18 AM Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote:

On 01.11.21 07:09, Peter Eisentraut wrote:
> Here is an updated patch for this. It's the previous patch polished a
> bit more, and it contains changes so that numeric literals reject
> trailing identifier parts without whitespace in between, as discussed.
> Maybe I should split that into incremental patches, but for now I only
> have the one. I don't have a patch for the underscores in numeric
> literals yet. It's in progress, but not ready.

Here is a progressed version of this work, split into more incremental
patches. The first three patches are harmless code cleanups. Patch 3
has an interesting naming conflict, noted in the commit message; ideas
welcome. Patches 4 and 5 handle the rejection of trailing junk after
numeric literals, as discussed. I have expanded that compared to the v4
patch to also cover non-integer literals. It also comes with more tests
now. Patch 6 is the titular introduction of non-decimal integer
literals, unchanged from before.

Hi,

For patch 3,

+int64

+pg_strtoint64(const char *s)

How about naming the above function pg_scanint64()?

pg_strtoint64xx() can be named pg_strtoint64() - this would align with existing function:

pg_strtouint64(const char *str, char **endptr, int base)

Cheers

Re: Non-decimal integer literals

From

John Naylor

Date:

25 November 2021, 17:51:43

Hi Peter,

0001

-/* we no longer allow unary minus in numbers.
- * instead we pass it separately to parser. there it gets
- * coerced via doNegate() -- Leon aug 20 1999
+/*
+ * Numbers
+ *
+ * Unary minus is not part of a number here. Instead we pass it separately to
+ * parser, and there it gets coerced via doNegate().

If we're going to change the comment anyway, "the parser" sounds more natural. Aside from that, 0001 and 0002 can probably be pushed now, if you like. I don't have any good ideas about 0003 at the moment.

0005

--- a/src/interfaces/ecpg/preproc/pgc.l
+++ b/src/interfaces/ecpg/preproc/pgc.l
@@ -365,6 +365,10 @@ real ({integer}|{decimal})[Ee][-+]?{digit}+
realfail1 ({integer}|{decimal})[Ee]
realfail2 ({integer}|{decimal})[Ee][-+]

+integer_junk {integer}{ident_start}
+decimal_junk {decimal}{ident_start}
+real_junk {real}{ident_start}

A comment might be good here to explain these are only in ECPG for consistency with the other scanners. Not really important, though.

0006

+{hexfail} {
+ yyerror("invalid hexadecimal integer");
+ }
+{octfail} {
+ yyerror("invalid octal integer");
}
-{decimal} {
+{binfail} {
+ yyerror("invalid binary integer");
+ }

It seems these could use SET_YYLLOC(), since the error cursor doesn't match other failure states:

+SELECT 0b;
+ERROR: invalid binary integer at or near "SELECT 0b"
+LINE 1: SELECT 0b;
+ ^
+SELECT 1b;
+ERROR: trailing junk after numeric literal at or near "1b"
+LINE 1: SELECT 1b;
+ ^

We might consider some tests for ECPG since lack of coverage has been a problem.

Also, I'm curious: how does the spec work as far as deciding the year of release, or feature-freezing of new items?
--
John Naylor
EDB: http://www.enterprisedb.com

Re: Non-decimal integer literals

From

Peter Eisentraut

Date:

01 December 2021, 15:40:19

On 25.11.21 16:46, Zhihong Yu wrote:
> For patch 3,
> 
> +int64
> +pg_strtoint64(const char *s)
> 
> How about naming the above function pg_scanint64()?
> pg_strtoint64xx() can be named pg_strtoint64() - this would align with 
> existing function:
> 
> pg_strtouint64(const char *str, char **endptr, int base)

That would be one way.  But the existing pg_strtointNN() functions are 
pretty widely used, so I would tend toward finding another name for the 
less used pg_strtouint64(), maybe pg_strtouint64x() ("extended").

Re: Non-decimal integer literals

From

Peter Eisentraut

Date:

01 December 2021, 15:47:09

On 25.11.21 18:51, John Naylor wrote:
> If we're going to change the comment anyway, "the parser" sounds more 
> natural. Aside from that, 0001 and 0002 can probably be pushed now, if 
> you like.

done

> --- a/src/interfaces/ecpg/preproc/pgc.l
> +++ b/src/interfaces/ecpg/preproc/pgc.l
> @@ -365,6 +365,10 @@ real ({integer}|{decimal})[Ee][-+]?{digit}+
>   realfail1 ({integer}|{decimal})[Ee]
>   realfail2 ({integer}|{decimal})[Ee][-+]
> 
> +integer_junk {integer}{ident_start}
> +decimal_junk {decimal}{ident_start}
> +real_junk {real}{ident_start}
> 
> A comment might be good here to explain these are only in ECPG for 
> consistency with the other scanners. Not really important, though.

Yeah, it's a bit weird that not all the symbols are used in ecpg.  I'll 
look into explaining this better.

> 0006
> 
> +{hexfail} {
> + yyerror("invalid hexadecimal integer");
> + }
> +{octfail} {
> + yyerror("invalid octal integer");
>    }
> -{decimal} {
> +{binfail} {
> + yyerror("invalid binary integer");
> + }
> 
> It seems these could use SET_YYLLOC(), since the error cursor doesn't 
> match other failure states:

ok

> We might consider some tests for ECPG since lack of coverage has been a 
> problem.

right

> Also, I'm curious: how does the spec work as far as deciding the year of 
> release, or feature-freezing of new items?

The schedule has recently been extended again, so the current plan is 
for SQL:202x with x=3, with feature freeze in mid-2022.

So the feature patches in this thread are in my mind now targeting 
PG15+1.  But the preparation work (up to v5-0005, and some other number 
parsing refactoring that I'm seeing) could be considered for PG15.

I'll move this to the next CF and come back with an updated patch set in 
a little while.

Re: Non-decimal integer literals

From

Peter Eisentraut

Date:

30 December 2021, 09:43:30

There has been some other refactoring going on, which made this patch 
set out of date.  So here is an update.

The old pg_strtouint64() has been removed, so there is no longer a 
naming concern with patch 0001.  That one should be good to go.

I also found that yet another way to parse integers in pg_atoi() has 
mostly faded away in utility, so I removed the last two callers and 
removed the function in 0002 and 0003.

The remaining patches are as before, with some of the review comments 
applied.  I still need to write some lexing unit tests for ecpg, which I 
haven't gotten to yet.  This affects patches 0004 and 0005.

As mentioned before, patches 0006 and 0007 are more feature previews at 
this point.


On 01.12.21 16:47, Peter Eisentraut wrote:
> On 25.11.21 18:51, John Naylor wrote:
>> If we're going to change the comment anyway, "the parser" sounds more 
>> natural. Aside from that, 0001 and 0002 can probably be pushed now, if 
>> you like.
> 
> done
> 
>> --- a/src/interfaces/ecpg/preproc/pgc.l
>> +++ b/src/interfaces/ecpg/preproc/pgc.l
>> @@ -365,6 +365,10 @@ real ({integer}|{decimal})[Ee][-+]?{digit}+
>>   realfail1 ({integer}|{decimal})[Ee]
>>   realfail2 ({integer}|{decimal})[Ee][-+]
>>
>> +integer_junk {integer}{ident_start}
>> +decimal_junk {decimal}{ident_start}
>> +real_junk {real}{ident_start}
>>
>> A comment might be good here to explain these are only in ECPG for 
>> consistency with the other scanners. Not really important, though.
> 
> Yeah, it's a bit weird that not all the symbols are used in ecpg.  I'll 
> look into explaining this better.
> 
>> 0006
>>
>> +{hexfail} {
>> + yyerror("invalid hexadecimal integer");
>> + }
>> +{octfail} {
>> + yyerror("invalid octal integer");
>>    }
>> -{decimal} {
>> +{binfail} {
>> + yyerror("invalid binary integer");
>> + }
>>
>> It seems these could use SET_YYLLOC(), since the error cursor doesn't 
>> match other failure states:
> 
> ok
> 
>> We might consider some tests for ECPG since lack of coverage has been 
>> a problem.
> 
> right
> 
>> Also, I'm curious: how does the spec work as far as deciding the year 
>> of release, or feature-freezing of new items?
> 
> The schedule has recently been extended again, so the current plan is 
> for SQL:202x with x=3, with feature freeze in mid-2022.
> 
> So the feature patches in this thread are in my mind now targeting 
> PG15+1.  But the preparation work (up to v5-0005, and some other number 
> parsing refactoring that I'm seeing) could be considered for PG15.
> 
> I'll move this to the next CF and come back with an updated patch set in 
> a little while.
> 
>

Attachment

Re: Non-decimal integer literals

From

Peter Eisentraut

Date:

13 January 2022, 13:42:42

Another modest update, because of the copyright year update preventing 
the previous patches from applying cleanly.

I also did a bit of work on the ecpg scanner so that it also handles 
some errors on par with the main scanner.

There is still no automated testing of this in ecpg, but I have a bunch 
of single-line test files that can provoke various errors.  I will keep 
these around and maybe put them into something more formal in the future.


On 30.12.21 10:43, Peter Eisentraut wrote:
> There has been some other refactoring going on, which made this patch 
> set out of date.  So here is an update.
> 
> The old pg_strtouint64() has been removed, so there is no longer a 
> naming concern with patch 0001.  That one should be good to go.
> 
> I also found that yet another way to parse integers in pg_atoi() has 
> mostly faded away in utility, so I removed the last two callers and 
> removed the function in 0002 and 0003.
> 
> The remaining patches are as before, with some of the review comments 
> applied.  I still need to write some lexing unit tests for ecpg, which I 
> haven't gotten to yet.  This affects patches 0004 and 0005.
> 
> As mentioned before, patches 0006 and 0007 are more feature previews at 
> this point.
> 
> 
> On 01.12.21 16:47, Peter Eisentraut wrote:
>> On 25.11.21 18:51, John Naylor wrote:
>>> If we're going to change the comment anyway, "the parser" sounds more 
>>> natural. Aside from that, 0001 and 0002 can probably be pushed now, 
>>> if you like.
>>
>> done
>>
>>> --- a/src/interfaces/ecpg/preproc/pgc.l
>>> +++ b/src/interfaces/ecpg/preproc/pgc.l
>>> @@ -365,6 +365,10 @@ real ({integer}|{decimal})[Ee][-+]?{digit}+
>>>   realfail1 ({integer}|{decimal})[Ee]
>>>   realfail2 ({integer}|{decimal})[Ee][-+]
>>>
>>> +integer_junk {integer}{ident_start}
>>> +decimal_junk {decimal}{ident_start}
>>> +real_junk {real}{ident_start}
>>>
>>> A comment might be good here to explain these are only in ECPG for 
>>> consistency with the other scanners. Not really important, though.
>>
>> Yeah, it's a bit weird that not all the symbols are used in ecpg.  
>> I'll look into explaining this better.
>>
>>> 0006
>>>
>>> +{hexfail} {
>>> + yyerror("invalid hexadecimal integer");
>>> + }
>>> +{octfail} {
>>> + yyerror("invalid octal integer");
>>>    }
>>> -{decimal} {
>>> +{binfail} {
>>> + yyerror("invalid binary integer");
>>> + }
>>>
>>> It seems these could use SET_YYLLOC(), since the error cursor doesn't 
>>> match other failure states:
>>
>> ok
>>
>>> We might consider some tests for ECPG since lack of coverage has been 
>>> a problem.
>>
>> right
>>
>>> Also, I'm curious: how does the spec work as far as deciding the year 
>>> of release, or feature-freezing of new items?
>>
>> The schedule has recently been extended again, so the current plan is 
>> for SQL:202x with x=3, with feature freeze in mid-2022.
>>
>> So the feature patches in this thread are in my mind now targeting 
>> PG15+1.  But the preparation work (up to v5-0005, and some other 
>> number parsing refactoring that I'm seeing) could be considered for PG15.
>>
>> I'll move this to the next CF and come back with an updated patch set 
>> in a little while.
>>
>>

Attachment

Re: Non-decimal integer literals

From

Peter Eisentraut

Date:

24 January 2022, 08:41:20

Rebased patch set

On 13.01.22 14:42, Peter Eisentraut wrote:
> Another modest update, because of the copyright year update preventing 
> the previous patches from applying cleanly.
> 
> I also did a bit of work on the ecpg scanner so that it also handles 
> some errors on par with the main scanner.
> 
> There is still no automated testing of this in ecpg, but I have a bunch 
> of single-line test files that can provoke various errors.  I will keep 
> these around and maybe put them into something more formal in the future.
> 
> 
> On 30.12.21 10:43, Peter Eisentraut wrote:
>> There has been some other refactoring going on, which made this patch 
>> set out of date.  So here is an update.
>>
>> The old pg_strtouint64() has been removed, so there is no longer a 
>> naming concern with patch 0001.  That one should be good to go.
>>
>> I also found that yet another way to parse integers in pg_atoi() has 
>> mostly faded away in utility, so I removed the last two callers and 
>> removed the function in 0002 and 0003.
>>
>> The remaining patches are as before, with some of the review comments 
>> applied.  I still need to write some lexing unit tests for ecpg, which 
>> I haven't gotten to yet.  This affects patches 0004 and 0005.
>>
>> As mentioned before, patches 0006 and 0007 are more feature previews 
>> at this point.
>>
>>
>> On 01.12.21 16:47, Peter Eisentraut wrote:
>>> On 25.11.21 18:51, John Naylor wrote:
>>>> If we're going to change the comment anyway, "the parser" sounds 
>>>> more natural. Aside from that, 0001 and 0002 can probably be pushed 
>>>> now, if you like.
>>>
>>> done
>>>
>>>> --- a/src/interfaces/ecpg/preproc/pgc.l
>>>> +++ b/src/interfaces/ecpg/preproc/pgc.l
>>>> @@ -365,6 +365,10 @@ real ({integer}|{decimal})[Ee][-+]?{digit}+
>>>>   realfail1 ({integer}|{decimal})[Ee]
>>>>   realfail2 ({integer}|{decimal})[Ee][-+]
>>>>
>>>> +integer_junk {integer}{ident_start}
>>>> +decimal_junk {decimal}{ident_start}
>>>> +real_junk {real}{ident_start}
>>>>
>>>> A comment might be good here to explain these are only in ECPG for 
>>>> consistency with the other scanners. Not really important, though.
>>>
>>> Yeah, it's a bit weird that not all the symbols are used in ecpg. 
>>> I'll look into explaining this better.
>>>
>>>> 0006
>>>>
>>>> +{hexfail} {
>>>> + yyerror("invalid hexadecimal integer");
>>>> + }
>>>> +{octfail} {
>>>> + yyerror("invalid octal integer");
>>>>    }
>>>> -{decimal} {
>>>> +{binfail} {
>>>> + yyerror("invalid binary integer");
>>>> + }
>>>>
>>>> It seems these could use SET_YYLLOC(), since the error cursor 
>>>> doesn't match other failure states:
>>>
>>> ok
>>>
>>>> We might consider some tests for ECPG since lack of coverage has 
>>>> been a problem.
>>>
>>> right
>>>
>>>> Also, I'm curious: how does the spec work as far as deciding the 
>>>> year of release, or feature-freezing of new items?
>>>
>>> The schedule has recently been extended again, so the current plan is 
>>> for SQL:202x with x=3, with feature freeze in mid-2022.
>>>
>>> So the feature patches in this thread are in my mind now targeting 
>>> PG15+1.  But the preparation work (up to v5-0005, and some other 
>>> number parsing refactoring that I'm seeing) could be considered for 
>>> PG15.
>>>
>>> I'll move this to the next CF and come back with an updated patch set 
>>> in a little while.
>>>
>>>

Attachment

Re: Non-decimal integer literals

From

Robert Haas

Date:

24 January 2022, 18:53:16

On Mon, Jan 24, 2022 at 3:41 AM Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:
> Rebased patch set

What if someone finds this new behavior too permissive?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Non-decimal integer literals

From

Peter Eisentraut

Date:

25 January 2022, 10:34:17

On 24.01.22 19:53, Robert Haas wrote:
> On Mon, Jan 24, 2022 at 3:41 AM Peter Eisentraut
> <peter.eisentraut@enterprisedb.com> wrote:
>> Rebased patch set
> 
> What if someone finds this new behavior too permissive?

Which part exactly?  There are several different changes proposed here.

Re: Non-decimal integer literals

From

Robert Haas

Date:

25 January 2022, 18:15:43

On Tue, Jan 25, 2022 at 5:34 AM Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:
> On 24.01.22 19:53, Robert Haas wrote:
> > On Mon, Jan 24, 2022 at 3:41 AM Peter Eisentraut
> > <peter.eisentraut@enterprisedb.com> wrote:
> >> Rebased patch set
> >
> > What if someone finds this new behavior too permissive?
>
> Which part exactly?  There are several different changes proposed here.

I was just going based on the description of the feature in your
original post. If someone is hoping that int4in() will accept only
^\d+$ then they will be disappointed by this patch.

Maybe nobody is hoping that, though.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Non-decimal integer literals

From

Alvaro Herrera

Date:

25 January 2022, 18:43:35

On 2022-Jan-24, Peter Eisentraut wrote:

> +decinteger        {decdigit}(_?{decdigit})*
> +hexinteger        0[xX](_?{hexdigit})+
> +octinteger        0[oO](_?{octdigit})+
> +bininteger        0[bB](_?{bindigit})+

I think there should be test cases for literals that these seemingly
strange expressions reject, which are a number with trailing _ (0x123_),
and one with consecutive underscores __ (0x12__34).

I like the idea of these literals.  I would have found them useful on
many occassions.

-- 
Álvaro Herrera              Valdivia, Chile  —  https://www.EnterpriseDB.com/

Re: Non-decimal integer literals

From

Tom Lane

Date:

26 January 2022, 00:02:38

Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Jan 25, 2022 at 5:34 AM Peter Eisentraut
> <peter.eisentraut@enterprisedb.com> wrote:
>> Which part exactly?  There are several different changes proposed here.

> I was just going based on the description of the feature in your
> original post. If someone is hoping that int4in() will accept only
> ^\d+$ then they will be disappointed by this patch.

Maybe I misunderstood, but I thought this was about what you could
write as a SQL literal, not about the I/O behavior of the integer
types.  I'd be -0.1 on changing the latter.

            regards, tom lane

Re: Non-decimal integer literals

From

Peter Eisentraut

Date:

26 January 2022, 15:10:39

On 26.01.22 01:02, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Tue, Jan 25, 2022 at 5:34 AM Peter Eisentraut
>> <peter.eisentraut@enterprisedb.com> wrote:
>>> Which part exactly?  There are several different changes proposed here.
> 
>> I was just going based on the description of the feature in your
>> original post. If someone is hoping that int4in() will accept only
>> ^\d+$ then they will be disappointed by this patch.
> 
> Maybe I misunderstood, but I thought this was about what you could
> write as a SQL literal, not about the I/O behavior of the integer
> types.  I'd be -0.1 on changing the latter.

I think it would be strange if I/O routines would accept a different 
syntax from the literals.  Also, the behavior of a cast from string/text 
to a numeric type is usually defined in terms of what the literal syntax 
is, so they need to be aligned.

Re: Non-decimal integer literals

From

Andrew Dunstan

Date:

26 January 2022, 16:19:15

On 1/25/22 13:43, Alvaro Herrera wrote:
> On 2022-Jan-24, Peter Eisentraut wrote:
>
>> +decinteger        {decdigit}(_?{decdigit})*
>> +hexinteger        0[xX](_?{hexdigit})+
>> +octinteger        0[oO](_?{octdigit})+
>> +bininteger        0[bB](_?{bindigit})+
> I think there should be test cases for literals that these seemingly
> strange expressions reject, which are a number with trailing _ (0x123_),
> and one with consecutive underscores __ (0x12__34).
>
> I like the idea of these literals.  I would have found them useful on
> many occassions.


+1. I can't remember the number of times I have miscounted a long string
of digits in a literal.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Non-decimal integer literals

From

John Naylor

Date:

13 February 2022, 12:14:07

On Wed, Jan 26, 2022 at 10:10 PM Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:
> [v8 patch]

0001-0004 seem pretty straightforward.

0005:

 {realfail1} {
- /*
- * throw back the [Ee], and figure out whether what
- * remains is an {integer} or {decimal}.
- */
- yyless(yyleng - 1);
  SET_YYLLOC();
- return process_integer_literal(yytext, yylval);
+ yyerror("trailing junk after numeric literal");
  }

realfail1 has been subsumed by integer_junk and decimal_junk, so that
pattern can be removed.

 <SQL>{
+/*
+ * Note that some trailing junk is valid in C (such as 100LL), so we contain
+ * this to SQL mode.
+ */

It seems Flex doesn't like C comments after the "%%", so this stanza
was indented in 0006. If these are to be committed separately, that
fix should happen here.

0006:

Minor nit -- the s/decimal/numeric/ change doesn't seem to have any
notational advantage, but it's not worse, either.

0007:

I've attached an addendum to restore the no-backtrack property.

Will the underscore syntax need treatment in the input routines as well?

-- 
John Naylor
EDB: http://www.enterprisedb.com

Attachment

v8-0007-addendum-restore-nobackup.txt

Re: Non-decimal integer literals

From

Christoph Berg

Date:

14 February 2022, 18:49:59

Re: Peter Eisentraut
> This adds support in the lexer as well as in the integer type input
> functions.
> 
> Those core parts are straightforward enough, but there are a bunch of other
> places where integers are parsed, and one could consider in each case
> whether they should get the same treatment, for example the replication
> syntax lexer, or input function for oid, numeric, and int2vector.

One thing I always found weird is that timeline IDs appear most
prominently as hex numbers in WAL filenames, but they are printed as
decimal in the log ("new timeline id nn"), and have to be specified as
decimal in recovery_target_timeline.

Perhaps both these could make use of 0xhex numbers as well.

Christoph

Re: Non-decimal integer literals

From

Peter Eisentraut

Date:

16 February 2022, 10:11:05

On 13.02.22 13:14, John Naylor wrote:
> On Wed, Jan 26, 2022 at 10:10 PM Peter Eisentraut
> <peter.eisentraut@enterprisedb.com> wrote:
>> [v8 patch]
> 
> 0001-0004 seem pretty straightforward.

These have been committed.

> 
> 0005:
> 
>   {realfail1} {
> - /*
> - * throw back the [Ee], and figure out whether what
> - * remains is an {integer} or {decimal}.
> - */
> - yyless(yyleng - 1);
>    SET_YYLLOC();
> - return process_integer_literal(yytext, yylval);
> + yyerror("trailing junk after numeric literal");
>    }
> 
> realfail1 has been subsumed by integer_junk and decimal_junk, so that
> pattern can be removed.

Committed with that change.

I found that the JSON path lexer has the same trailing-junk issue.  I 
have researched the relevant ECMA standard and it explicitly points out 
that this is not allowed.  I will look into that separately.  I'm just 
pointing that out here because grepping for "realfail1" will still show 
a hit after this.

The remaining patches are material for PG16 at this point, and I will 
set the commit fest item to returned with feedback in the meantime.

> 0006:
> 
> Minor nit -- the s/decimal/numeric/ change doesn't seem to have any
> notational advantage, but it's not worse, either.

I did that because with the addition of non-decimal literals, the word 
"decimal" becomes ambiguous or misleading.  (It doesn't mean "uses 
decimal digits" but "has a decimal point".)  (Of course, "numeric" isn't 
entirely free of ambiguity, but there are only so many words available 
in this space. ;-) )

> 0007:
> 
> I've attached an addendum to restore the no-backtrack property.

Thanks, that is helpful.

> Will the underscore syntax need treatment in the input routines as well?

Yeah, various additional work is required for this.

Re: Non-decimal integer literals

From

Peter Eisentraut

Date:

10 October 2022, 14:17:23

On 16.02.22 11:11, Peter Eisentraut wrote:
> The remaining patches are material for PG16 at this point, and I will 
> set the commit fest item to returned with feedback in the meantime.

Time to continue with this.

Attached is a rebased and cleaned up patch for non-decimal integer 
literals.  (I don't include the underscores-in-numeric literals patch. 
I'm keeping that for later.)

Two open issues from my notes:

Technically, numeric_in() should be made aware of this, but that seems 
relatively complicated and maybe not necessary for the first iteration.

Taking another look around ecpg to see how this interacts with C-syntax 
integer literals.  I'm not aware of any particular issues, but it's 
understandably tricky.

Other than that, this seems pretty complete as a start.

Attachment

v9-0001-Non-decimal-integer-literals.patch

Re: Non-decimal integer literals

From

Junwang Zhao

Date:

11 October 2022, 03:29:35

Hi Peter,

  /* process digits */
+ if (ptr[0] == '0' && (ptr[1] == 'x' || ptr[1] == 'X'))
+ {
+ ptr += 2;
+ while (*ptr && isxdigit((unsigned char) *ptr))
+ {
+ int8 digit = hexlookup[(unsigned char) *ptr];
+
+ if (unlikely(pg_mul_s16_overflow(tmp, 16, &tmp)) ||
+ unlikely(pg_sub_s16_overflow(tmp, digit, &tmp)))
+ goto out_of_range;
+
+ ptr++;
+ }
+ }
+ else if (ptr[0] == '0' && (ptr[1] == 'o' || ptr[1] == 'O'))
+ {
+ ptr += 2;
+
+ while (*ptr && (*ptr >= '0' && *ptr <= '7'))
+ {
+ int8 digit = (*ptr++ - '0');
+
+ if (unlikely(pg_mul_s16_overflow(tmp, 8, &tmp)) ||
+ unlikely(pg_sub_s16_overflow(tmp, digit, &tmp)))
+ goto out_of_range;
+ }
+ }
+ else if (ptr[0] == '0' && (ptr[1] == 'b' || ptr[1] == 'B'))
+ {
+ ptr += 2;
+
+ while (*ptr && (*ptr >= '0' && *ptr <= '1'))
+ {
+ int8 digit = (*ptr++ - '0');
+
+ if (unlikely(pg_mul_s16_overflow(tmp, 2, &tmp)) ||
+ unlikely(pg_sub_s16_overflow(tmp, digit, &tmp)))
+ goto out_of_range;
+ }
+ }
+ else
+ {
  while (*ptr && isdigit((unsigned char) *ptr))
  {
  int8 digit = (*ptr++ - '0');
@@ -128,6 +181,7 @@ pg_strtoint16(const char *s)
  unlikely(pg_sub_s16_overflow(tmp, digit, &tmp)))
  goto out_of_range;
  }
+ }

What do you think if we move these code into a static inline function? like:

static inline char*
process_digits(char *ptr, int32 *result)
{
...
}

On Mon, Oct 10, 2022 at 10:17 PM Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:
>
> On 16.02.22 11:11, Peter Eisentraut wrote:
> > The remaining patches are material for PG16 at this point, and I will
> > set the commit fest item to returned with feedback in the meantime.
>
> Time to continue with this.
>
> Attached is a rebased and cleaned up patch for non-decimal integer
> literals.  (I don't include the underscores-in-numeric literals patch.
> I'm keeping that for later.)
>
> Two open issues from my notes:
>
> Technically, numeric_in() should be made aware of this, but that seems
> relatively complicated and maybe not necessary for the first iteration.
>
> Taking another look around ecpg to see how this interacts with C-syntax
> integer literals.  I'm not aware of any particular issues, but it's
> understandably tricky.
>
> Other than that, this seems pretty complete as a start.



-- 
Regards
Junwang Zhao

Re: Non-decimal integer literals

From

Peter Eisentraut

Date:

11 October 2022, 08:59:30

On 11.10.22 05:29, Junwang Zhao wrote:
> What do you think if we move these code into a static inline function? like:
> 
> static inline char*
> process_digits(char *ptr, int32 *result)
> {
> ...
> }

How would you handle the different ways each branch checks for valid 
digits and computes the value of each digit?

Re: Non-decimal integer literals

From

Junwang Zhao

Date:

11 October 2022, 10:16:08

On Tue, Oct 11, 2022 at 4:59 PM Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:
>
> On 11.10.22 05:29, Junwang Zhao wrote:
> > What do you think if we move these code into a static inline function? like:
> >
> > static inline char*
> > process_digits(char *ptr, int32 *result)
> > {
> > ...
> > }
>
> How would you handle the different ways each branch checks for valid
> digits and computes the value of each digit?
>

Didn't notice that, sorry for the noise ;(


-- 
Regards
Junwang Zhao

Re: Non-decimal integer literals

From

John Naylor

Date:

14 November 2022, 07:25:30

On Mon, Oct 10, 2022 at 9:17 PM Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote:

> Taking another look around ecpg to see how this interacts with C-syntax
> integer literals. I'm not aware of any particular issues, but it's
> understandably tricky.

I can find no discussion in the archives about the commit that added "xch":

commit 6fb3c3f78fbb2296894424f6e3183d339915eac7
Author: Michael Meskes <meskes@postgresql.org>
Date: Fri Oct 15 19:02:08 1999 +0000

*** empty log message ***

...and I can't think of why bounds checking a C literal was done like this.

Regarding the patch, it looks good overall. My only suggestion would be to add a regression test for just below and just above overflow, at least for int2.

Minor nits:

- * Process {integer}. Note this will also do the right thing with {decimal},
+ * Process {*integer}. Note this will also do the right thing with {numeric},

I scratched my head for a while, thinking this was Flex syntax, until I realized my brain was supposed to do shell-globbing first, at which point I could see it was referring to multiple Flex rules. I'd try to rephrase.

+T661 Non-decimal integer literals YES SQL:202x draft

Is there an ETA yet?

Also, it's not this patch's job to do it, but there are at least a half dozen places that open-code turning a hex char into a number. It might be a good easy "todo item" to unify that.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: Non-decimal integer literals

From

Peter Eisentraut

Date:

15 November 2022, 10:31:03

On 14.11.22 08:25, John Naylor wrote:
> Regarding the patch, it looks good overall. My only suggestion would be 
> to add a regression test for just below and just above overflow, at 
> least for int2.

ok

> Minor nits:
> 
> - * Process {integer}.  Note this will also do the right thing with 
> {decimal},
> + * Process {*integer}.  Note this will also do the right thing with 
> {numeric},
> 
> I scratched my head for a while, thinking this was Flex syntax, until I 
> realized my brain was supposed to do shell-globbing first, at which 
> point I could see it was referring to multiple Flex rules. I'd try to 
> rephrase.

ok

> +T661 Non-decimal integer literals YES SQL:202x draft
> 
> Is there an ETA yet?

March 2023

> Also, it's not this patch's job to do it, but there are at least a half 
> dozen places that open-code turning a hex char into a number. It might 
> be a good easy "todo item" to unify that.

right

Re: Non-decimal integer literals

From

Peter Eisentraut

Date:

22 November 2022, 13:36:45

On 15.11.22 11:31, Peter Eisentraut wrote:
> On 14.11.22 08:25, John Naylor wrote:
>> Regarding the patch, it looks good overall. My only suggestion would 
>> be to add a regression test for just below and just above overflow, at 
>> least for int2.
> 
> ok

This was a valuable suggestion, because this found some breakage.  In 
particular, the handling of grammar-level literals that overflow to 
"Float" was not correct.  (The radix prefix was simply stripped and 
forgotten.)  So I added a bunch more tests for this.  Here is a new patch.

Attachment

v10-0001-Non-decimal-integer-literals.patch

Re: Non-decimal integer literals

From

John Naylor

Date:

23 November 2022, 04:35:17

On Tue, Nov 22, 2022 at 8:36 PM Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote:
>
> On 15.11.22 11:31, Peter Eisentraut wrote:
> > On 14.11.22 08:25, John Naylor wrote:
> >> Regarding the patch, it looks good overall. My only suggestion would
> >> be to add a regression test for just below and just above overflow, at
> >> least for int2.
> >
> > ok
>
> This was a valuable suggestion, because this found some breakage. In
> particular, the handling of grammar-level literals that overflow to
> "Float" was not correct. (The radix prefix was simply stripped and
> forgotten.) So I added a bunch more tests for this. Here is a new patch.

Looks good to me.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: Non-decimal integer literals

From

David Rowley

Date:

23 November 2022, 08:54:25

On Wed, 23 Nov 2022 at 02:37, Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:
> Here is a new patch.

This looks like quite an inefficient way to convert a hex string into an int64:

        while (*ptr && isxdigit((unsigned char) *ptr))
        {
            int8        digit = hexlookup[(unsigned char) *ptr];

            if (unlikely(pg_mul_s64_overflow(tmp, 16, &tmp)) ||
                unlikely(pg_sub_s64_overflow(tmp, digit, &tmp)))
                goto out_of_range;

            ptr++;
        }

I wonder if you'd be better off with something like:

        while (*ptr && isxdigit((unsigned char) *ptr))
        {
            if (unlikely(tmp & UINT64CONST(0xF000000000000000)))
                goto out_of_range;

            tmp = (tmp << 4) | hexlookup[(unsigned char) *ptr++];
        }

Going by [1], clang will actually use multiplication by 16 to
implement the former. gcc is better and shifts left by 4, so likely
won't improve things for gcc.  It seems worth doing it this way for
anything that does not have HAVE__BUILTIN_OP_OVERFLOW anyway.

David

[1] https://godbolt.org/z/jz6Th6jnM

Re: Non-decimal integer literals

From

David Rowley

Date:

23 November 2022, 08:56:15

On Wed, 23 Nov 2022 at 21:54, David Rowley <dgrowleyml@gmail.com> wrote:
> I wonder if you'd be better off with something like:
>
>         while (*ptr && isxdigit((unsigned char) *ptr))
>         {
>             if (unlikely(tmp & UINT64CONST(0xF000000000000000)))
>                 goto out_of_range;
>
>             tmp = (tmp << 4) | hexlookup[(unsigned char) *ptr++];
>         }

Here's a delta diff with it changed to work that way.

David

Attachment

more_efficient_hex_oct_and_binary_processing.diff

Re: Non-decimal integer literals

From

John Naylor

Date:

23 November 2022, 09:19:27

On Wed, Nov 23, 2022 at 3:54 PM David Rowley <dgrowleyml@gmail.com> wrote:
>
> Going by [1], clang will actually use multiplication by 16 to
> implement the former. gcc is better and shifts left by 4, so likely
> won't improve things for gcc. It seems worth doing it this way for
> anything that does not have HAVE__BUILTIN_OP_OVERFLOW anyway.

FWIW, gcc 12.2 generates an imul on my system when compiling in situ. I've found it useful to run godbolt locally* and load the entire PG file (nicer to read than plain objdump) -- compilers can make different decisions when going from isolated snippets to within full functions.

* clone from https://github.com/compiler-explorer/compiler-explorer

install npm 16

run "make" and when finished will show the localhost url

add the right flags, which in this case was

-Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation -O2 -I/path/to/srcdir/src/include -I/path/to/builddir/src/include -D_GNU_SOURCE

--
John Naylor
EDB: http://www.enterprisedb.com

Re: Non-decimal integer literals

From

Dean Rasheed

Date:

23 November 2022, 16:25:50

On Tue, 22 Nov 2022 at 13:37, Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:
>
> On 15.11.22 11:31, Peter Eisentraut wrote:
> > On 14.11.22 08:25, John Naylor wrote:
> >> Regarding the patch, it looks good overall. My only suggestion would
> >> be to add a regression test for just below and just above overflow, at
> >> least for int2.
> >
> This was a valuable suggestion, because this found some breakage.  In
> particular, the handling of grammar-level literals that overflow to
> "Float" was not correct.  (The radix prefix was simply stripped and
> forgotten.)  So I added a bunch more tests for this.  Here is a new patch.

Taking a quick look, I noticed that there are no tests for negative
values handled in the parser.

Giving that a spin shows that make_const() fails to correctly identify
the base of negative non-decimal integers in the T_Float case, causing
it to fall through to numeric_in() and fail:

SELECT -0x80000000;

ERROR:  invalid input syntax for type numeric: "-0x80000000"
               ^
Regards,
Dean

Re: Non-decimal integer literals

From

Peter Eisentraut

Date:

24 November 2022, 08:35:08

On 23.11.22 09:54, David Rowley wrote:
> On Wed, 23 Nov 2022 at 02:37, Peter Eisentraut
> <peter.eisentraut@enterprisedb.com> wrote:
>> Here is a new patch.
> 
> This looks like quite an inefficient way to convert a hex string into an int64:
> 
>          while (*ptr && isxdigit((unsigned char) *ptr))
>          {
>              int8        digit = hexlookup[(unsigned char) *ptr];
> 
>              if (unlikely(pg_mul_s64_overflow(tmp, 16, &tmp)) ||
>                  unlikely(pg_sub_s64_overflow(tmp, digit, &tmp)))
>                  goto out_of_range;
> 
>              ptr++;
>          }
> 
> I wonder if you'd be better off with something like:
> 
>          while (*ptr && isxdigit((unsigned char) *ptr))
>          {
>              if (unlikely(tmp & UINT64CONST(0xF000000000000000)))
>                  goto out_of_range;
> 
>              tmp = (tmp << 4) | hexlookup[(unsigned char) *ptr++];
>          }
> 
> Going by [1], clang will actually use multiplication by 16 to
> implement the former. gcc is better and shifts left by 4, so likely
> won't improve things for gcc.  It seems worth doing it this way for
> anything that does not have HAVE__BUILTIN_OP_OVERFLOW anyway.

My code follows the style used for parsing the decimal integers. 
Keeping that consistent is valuable I think.  I think the proposed 
change makes the code significantly harder to understand.  Also, what 
you are suggesting here would amount to an attempt to make parsing 
hexadecimal integers even faster than parsing decimal integers.  Is that 
useful?

Re: Non-decimal integer literals

From

David Rowley

Date:

24 November 2022, 09:13:39

On Thu, 24 Nov 2022 at 21:35, Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:
> My code follows the style used for parsing the decimal integers.
> Keeping that consistent is valuable I think.  I think the proposed
> change makes the code significantly harder to understand.  Also, what
> you are suggesting here would amount to an attempt to make parsing
> hexadecimal integers even faster than parsing decimal integers.  Is that
> useful?

Isn't it being faster one of the major use cases for this feature?   I
remember many years ago and several jobs ago when working with SQL
Server being able to speed up importing data using hexadecimal
DATETIMEs. I can't think why else you might want to represent a
DATETIME as a hexstring, so I assumed this was a large part of the use
case for INTs in PostgreSQL. Are you telling me that better
performance is not something anyone will want out of this feature?

David

Re: Non-decimal integer literals

From

Peter Eisentraut

Date:

25 November 2022, 16:13:29

On 24.11.22 10:13, David Rowley wrote:
> On Thu, 24 Nov 2022 at 21:35, Peter Eisentraut
> <peter.eisentraut@enterprisedb.com> wrote:
>> My code follows the style used for parsing the decimal integers.
>> Keeping that consistent is valuable I think.  I think the proposed
>> change makes the code significantly harder to understand.  Also, what
>> you are suggesting here would amount to an attempt to make parsing
>> hexadecimal integers even faster than parsing decimal integers.  Is that
>> useful?
> 
> Isn't it being faster one of the major use cases for this feature?

Never thought about it that way.

> I
> remember many years ago and several jobs ago when working with SQL
> Server being able to speed up importing data using hexadecimal
> DATETIMEs. I can't think why else you might want to represent a
> DATETIME as a hexstring, so I assumed this was a large part of the use
> case for INTs in PostgreSQL. Are you telling me that better
> performance is not something anyone will want out of this feature?

This isn't about datetimes but about integers.

Re: Non-decimal integer literals

From

Peter Eisentraut

Date:

28 November 2022, 13:59:52

On 23.11.22 17:25, Dean Rasheed wrote:
> Taking a quick look, I noticed that there are no tests for negative
> values handled in the parser.
> 
> Giving that a spin shows that make_const() fails to correctly identify
> the base of negative non-decimal integers in the T_Float case, causing
> it to fall through to numeric_in() and fail:

Fixed in new patch.

Attachment

v11-0001-Non-decimal-integer-literals.patch

Re: Non-decimal integer literals

From

David Rowley

Date:

28 November 2022, 23:43:33

On Sat, 26 Nov 2022 at 05:13, Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:
>
> On 24.11.22 10:13, David Rowley wrote:
> > I
> > remember many years ago and several jobs ago when working with SQL
> > Server being able to speed up importing data using hexadecimal
> > DATETIMEs. I can't think why else you might want to represent a
> > DATETIME as a hexstring, so I assumed this was a large part of the use
> > case for INTs in PostgreSQL. Are you telling me that better
> > performance is not something anyone will want out of this feature?
>
> This isn't about datetimes but about integers.

I'm aware. My aim was to show that hex is commonly used as a more
efficient way of getting integer numbers in and out of computers.

Likely it's better for me to quantify this performance increase claim
with some actual performance results.

Here's master (@f0cd57f85) doing copy ab2 from '/tmp/ab.csv';

ab2 is a table with no indexes and just 2 int columns.

  16.55%  postgres          [.] CopyReadLine
   7.82%  postgres          [.] pg_strtoint32
   7.60%  postgres          [.] CopyReadAttributesText
   7.06%  postgres          [.] NextCopyFrom
   4.40%  postgres          [.] CopyFrom

The copy completes in 2512.5278 ms (average time over 10 runs)

Patching master with your v11 patch and copying in hex numbers instead
of decimal numbers shows:

  14.39%  postgres          [.] CopyReadLine
   8.60%  postgres          [.] pg_strtoint32
   6.95%  postgres          [.] NextCopyFrom
   6.79%  postgres          [.] CopyReadAttributesText
   4.81%  postgres          [.] CopyFrom

This shows that we're spending proportionally less time in
CopyReadLine() and proportionally more time in pg_strtoint32(). There
are probably two things going on there, CopyReadLine is likely faster
due to having to read fewer bytes and pg_strtoint32() is likely slower
due to additional branching and code size.

This (copy ab2 from '/tmp/abhex.csv') saw an average time of 2720.1387
ms over 10 runs.

Patching master with your v11 patch +
more_efficient_hex_oct_and_binary_processing.diff

  15.68%  postgres          [.] CopyReadLine
   7.75%  postgres          [.] NextCopyFrom
   7.73%  postgres          [.] pg_strtoint32
   6.25%  postgres          [.] CopyReadAttributesText
   4.76%  postgres          [.] CopyFrom

The average time to import the hex version of the csv file comes down
to 2385.7298 ms over 10 runs.

I didn't run any tests to see how much the performance of importing
the decimal representation slowed down from the v11 patch. I assume
there will be a small performance hit due to the extra processing done
in pg_strtoint32()

David

Re: Non-decimal integer literals

From

Dean Rasheed

Date:

29 November 2022, 10:11:00

On Wed, 23 Nov 2022 at 08:56, David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Wed, 23 Nov 2022 at 21:54, David Rowley <dgrowleyml@gmail.com> wrote:
> > I wonder if you'd be better off with something like:
> >
> >         while (*ptr && isxdigit((unsigned char) *ptr))
> >         {
> >             if (unlikely(tmp & UINT64CONST(0xF000000000000000)))
> >                 goto out_of_range;
> >
> >             tmp = (tmp << 4) | hexlookup[(unsigned char) *ptr++];
> >         }
>
> Here's a delta diff with it changed to work that way.
>

This isn't correct, because those functions are meant to accumulate a
negative number in "tmp".

The overflow check can't just ignore the final digit either, so I'm
not sure how much this would end up saving once those issues are
fixed.

Regards,
Dean

Re: Non-decimal integer literals

From

David Rowley

Date:

29 November 2022, 20:16:55

On Tue, 29 Nov 2022 at 23:11, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
>
> On Wed, 23 Nov 2022 at 08:56, David Rowley <dgrowleyml@gmail.com> wrote:
> >
> > On Wed, 23 Nov 2022 at 21:54, David Rowley <dgrowleyml@gmail.com> wrote:
> > > I wonder if you'd be better off with something like:
> > >
> > >         while (*ptr && isxdigit((unsigned char) *ptr))
> > >         {
> > >             if (unlikely(tmp & UINT64CONST(0xF000000000000000)))
> > >                 goto out_of_range;
> > >
> > >             tmp = (tmp << 4) | hexlookup[(unsigned char) *ptr++];
> > >         }
> >
> > Here's a delta diff with it changed to work that way.
> >
>
> This isn't correct, because those functions are meant to accumulate a
> negative number in "tmp".

Looks like I didn't quite look at that code closely enough.

To make that work we could just form the non-decimal versions in an
unsigned integer of the given size and then check if that's become
greater than -PG_INTXX_MIN after the loop.  We'd then just need to
convert it back to its negative form.

i.e:

uint64 tmp2 = 0;
ptr += 2;
while (*ptr && isxdigit((unsigned char) *ptr))
{
    if (unlikely(tmp2 & UINT64CONST(0xF000000000000000)))
        goto out_of_range;

    tmp2 = (tmp2 << 4) | hexlookup[(unsigned char) *ptr++];
}

if (tmp2 > -PG_INT64_MIN)
    goto out_of_range;
tmp = -((int64) tmp2);

David

Re: Non-decimal integer literals

From

David Rowley

Date:

29 November 2022, 20:22:25

On Tue, 29 Nov 2022 at 03:00, Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:
> Fixed in new patch.

There seems to be a small bug in the pg_strtointXX functions in the
code that checks that there's at least 1 digit.  This causes 0x to be
a valid representation of zero.  That does not seem to be allowed by
the parser, so I think we should likely reject it in COPY too.

-- Does not work.
postgres=# select 0x + 1;
ERROR:  invalid hexadecimal integer at or near "0x"
LINE 1: select 0x + 1;

postgres=# create table a (a int);
CREATE TABLE

-- probably shouldn't work
postgres=# copy a from stdin;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.
>> 0x
>> \.
COPY 1

David

Re: Non-decimal integer literals

From

David Rowley

Date:

30 November 2022, 05:50:12

On Wed, 23 Nov 2022 at 22:19, John Naylor <john.naylor@enterprisedb.com> wrote:
>
>
> On Wed, Nov 23, 2022 at 3:54 PM David Rowley <dgrowleyml@gmail.com> wrote:
> >
> > Going by [1], clang will actually use multiplication by 16 to
> > implement the former. gcc is better and shifts left by 4, so likely
> > won't improve things for gcc.  It seems worth doing it this way for
> > anything that does not have HAVE__BUILTIN_OP_OVERFLOW anyway.
>
> FWIW, gcc 12.2 generates an imul on my system when compiling in situ.

I spent a bit more time trying to figure out why the compiler does
imul instead of bit shifting and it just seems to be down to a
combination of signed-ness plus the overflow check. See [1]. Neither
of the two compilers I tested could use bit shifting with a signed
type when overflow checking is done, which is what we're doing in the
new code.

In clang 15, multiplication is done in both smultiply16 and
umultiply16. These both check for overflow. The versions without the
overflow checks both use bit shifting. With GCC, only smultiply16 does
multiplication. The other 3 variants all use bit shifting.

David

[1] https://godbolt.org/z/EG9jKMjq5

Re: Non-decimal integer literals

From

Dean Rasheed

Date:

30 November 2022, 11:34:07

On Wed, 30 Nov 2022 at 05:50, David Rowley <dgrowleyml@gmail.com> wrote:
>
> I spent a bit more time trying to figure out why the compiler does
> imul instead of bit shifting and it just seems to be down to a
> combination of signed-ness plus the overflow check. See [1]. Neither
> of the two compilers I tested could use bit shifting with a signed
> type when overflow checking is done, which is what we're doing in the
> new code.
>

Ah, interesting. That makes me think that it might be possible to get
some performance gains for all bases (including 10) by separating the
overflow check from the multiplication, and giving the compiler the
best chance to decide on the optimal way to do the multiplication. For
example, on my Intel box, GCC prefers a pair of LEA instructions over
an IMUL, to multiply by 10.

I like your previous idea of using an unsigned integer for the
accumulator, because then the overflow check in the loop doesn't need
to be exact, as long as an exact check is done later. That way, there
are fewer conditional branches in the loop, and the possibility for
the compiler to choose the fastest multiplication method. So something
like:

    // Accumulate positive value using unsigned int, with approximate
    // overflow check. If acc >= 1 - INT_MIN / 10, then acc * 10 is
    // sure to exceed -INT_MIN.
    unsigned int cutoff = 1 - INT_MIN / 10;
    unsigned int acc = 0;

    while (*ptr && isdigit((unsigned char) *ptr))
    {
        if (unlikely(acc >= cutoff))
            goto out_of_range;
        acc = acc * 10 + (*ptr - '0');
        ptr++;
    }

and similar for other bases, allowing the coding for all bases to be
kept similar.

I think it's probably best to consider this as a follow-on patch
though. It shouldn't delay getting the main feature committed.

Regards,
Dean

Re: Non-decimal integer literals

From

David Rowley

Date:

30 November 2022, 23:11:56

On Thu, 1 Dec 2022 at 00:34, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
> So something
> like:
>
>     // Accumulate positive value using unsigned int, with approximate
>     // overflow check. If acc >= 1 - INT_MIN / 10, then acc * 10 is
>     // sure to exceed -INT_MIN.
>     unsigned int cutoff = 1 - INT_MIN / 10;
>     unsigned int acc = 0;
>
>     while (*ptr && isdigit((unsigned char) *ptr))
>     {
>         if (unlikely(acc >= cutoff))
>             goto out_of_range;
>         acc = acc * 10 + (*ptr - '0');
>         ptr++;
>     }
>
> and similar for other bases, allowing the coding for all bases to be
> kept similar.

Seems like a good idea to me. Couldn't the cutoff check just be "acc >
INT_MAX / 10"?

> I think it's probably best to consider this as a follow-on patch
> though. It shouldn't delay getting the main feature committed.

I agree that it should be a separate patch.  But thinking about what
Tom mentioned in [1], I had in mind this patch would need to wait
until the new standard is out so that we have a more genuine reason
for breaking existing queries.

I've drafted up a full patch for improving the current base-10 code,
so I'll go post that on another thread.

David

[1] https://postgr.es/m/3260805.1631106874@sss.pgh.pa.us

Re: Non-decimal integer literals

From

Tom Lane

Date:

30 November 2022, 23:36:47

David Rowley <dgrowleyml@gmail.com> writes:
> I agree that it should be a separate patch.  But thinking about what
> Tom mentioned in [1], I had in mind this patch would need to wait
> until the new standard is out so that we have a more genuine reason
> for breaking existing queries.

Well, we already broke them in v15: that example now gives

regression=# select 0x42e;
ERROR:  trailing junk after numeric literal at or near "0x"
LINE 1: select 0x42e;
               ^

So there's probably no compatibility reason not to drop the
other shoe.

            regards, tom lane

Re: Non-decimal integer literals

From

Peter Eisentraut

Date:

08 December 2022, 11:16:29

On 29.11.22 21:22, David Rowley wrote:
> There seems to be a small bug in the pg_strtointXX functions in the
> code that checks that there's at least 1 digit.  This causes 0x to be
> a valid representation of zero.  That does not seem to be allowed by
> the parser, so I think we should likely reject it in COPY too.
> -- probably shouldn't work
> postgres=# copy a from stdin;
> Enter data to be copied followed by a newline.
> End with a backslash and a period on a line by itself, or an EOF signal.
>>> 0x
>>> \.
> COPY 1

Fixed in new patch.  I moved the "require at least one digit" checks 
after the loops over the digits, to make it easier to write one check 
for all bases.

This patch is also incorporates your changes to the digit analysis 
algorithm.  I didn't check it carefully, but all the tests still pass. ;-)

Attachment

v12-0001-Non-decimal-integer-literals.patch

Re: Non-decimal integer literals

From

Peter Eisentraut

Date:

14 December 2022, 05:47:28

On 08.12.22 12:16, Peter Eisentraut wrote:
> On 29.11.22 21:22, David Rowley wrote:
>> There seems to be a small bug in the pg_strtointXX functions in the
>> code that checks that there's at least 1 digit.  This causes 0x to be
>> a valid representation of zero.  That does not seem to be allowed by
>> the parser, so I think we should likely reject it in COPY too.
>> -- probably shouldn't work
>> postgres=# copy a from stdin;
>> Enter data to be copied followed by a newline.
>> End with a backslash and a period on a line by itself, or an EOF signal.
>>>> 0x
>>>> \.
>> COPY 1
> 
> Fixed in new patch.  I moved the "require at least one digit" checks 
> after the loops over the digits, to make it easier to write one check 
> for all bases.
> 
> This patch is also incorporates your changes to the digit analysis 
> algorithm.  I didn't check it carefully, but all the tests still pass. ;-)

committed

Re: Non-decimal integer literals

From

Dean Rasheed

Date:

13 January 2023, 10:01:15

On Wed, 14 Dec 2022 at 05:47, Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:
>
> committed

Now that we have this for integer types, I think it's worth doing for
numeric as well, since the parser will now pass such things through to
numeric_in() when they don't fit in an int64, and it seems plausible
that at least some people might use non-decimal integers beyond
INT64MIN/MAX. Also, without such support in numeric_in(), the feature
looks a little incomplete:

SELECT -0x8000000000000000;
       ?column?
----------------------
 -9223372036854775808
(1 row)

SELECT 0x8000000000000000;
ERROR:  invalid input syntax for type numeric: "0x8000000000000000"
LINE 1: select 0x8000000000000000;
               ^

One concern I had was what the performance would be like. I don't
really expect people to pass in the kinds of truly huge values that
numeric supports, but it can't be ruled out. So I gave it a go, to see
how hard it would be, and what the worst-case performance looks like.
(I included underscore-handling too, so that I could measure that at
the same time.)

The base-conversion algorithm is O(N^2), and the worst case before
overflow is with hex strings with around 108,000 digits, oct strings
with around 145,000 digits, or binary strings with around 435,000
digits. Each of those takes around 400ms to parse on my machine.
That's around the level at which I might consider adding
CHECK_FOR_INTERRUPTS()'s, but I think that it's probably not worth it,
given how unrealistic such huge inputs are in practice.

The other important thing is that this shouldn't impact the
performance when parsing regular decimal inputs. The bulk of the
non-decimal integer parsing is handled by a separate function, which
is called directly from numeric_in(), since non-decimal handling isn't
required at the set_var_from_str() level (used by the float4/8 ->
numeric conversion functions). I also re-arranged the numeric_in()
code somewhat, and was able to make substantial savings by reducing
the number of pg_strncasecmp() calls, and avoiding those calls
entirely for regular numbers that aren't NaN or Inf. Testing that with
COPY with a few million numbers of different sizes, I observed a
10-15% performance increase.

So I'm feeling quite good about the end result -- I set out hoping not
to make performance noticeably worse, but ended up making it
significantly better.

Regards,
Dean

Attachment

Re: Non-decimal integer literals

From

Peter Eisentraut

Date:

23 January 2023, 15:55:17

On 13.01.23 11:01, Dean Rasheed wrote:
> So I'm feeling quite good about the end result -- I set out hoping not
> to make performance noticeably worse, but ended up making it
> significantly better.

This is great!  How do you want to proceed?  You also posted an updated 
patch in the "underscores" thread and suggested some additional work 
there.  In which order should these be addressed, in your opinion?

Re: Non-decimal integer literals

From

Dean Rasheed

Date:

23 January 2023, 16:22:35

On Mon, 23 Jan 2023 at 15:55, Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:
>
> On 13.01.23 11:01, Dean Rasheed wrote:
> > So I'm feeling quite good about the end result -- I set out hoping not
> > to make performance noticeably worse, but ended up making it
> > significantly better.
>
> This is great!  How do you want to proceed?  You also posted an updated
> patch in the "underscores" thread and suggested some additional work
> there.  In which order should these be addressed, in your opinion?
>

I think it makes most sense if I push 0001 now, and then merge 0002
into the underscores patch. I think at least one of the suggested
changes to the underscores patch required 0002 to work.

Regards,
Dean

Re: Non-decimal integer literals

From

"Joel Jacobson"

Date:

23 January 2023, 20:00:14

On Fri, Jan 13, 2023, at 07:01, Dean Rasheed wrote:
> Attachments:
> * 0001-Add-non-decimal-integer-support-to-type-numeric.patch

Nice! This also simplifies when dealing with non-negative integers represented as byte arrays,
common in e.g. cryptography code.

Before, one had to implement numeric_from_bytes(bytea) in plpgsql [1],
which can now be greatly simplified:

create function numeric_from_bytes(bytea) returns numeric language sql as $$
select ('0'||right($1::text,-1))::numeric
$$;

\timing
select numeric_from_bytes(('\x'||repeat('0123456789abcdef',1000))::bytea);
Time: 484.223 ms -- HEAD + plpgsql numeric_from_bytes()
Time: 19.790 ms -- 0001 + simplified numeric_from_bytes()

About 25x faster!

Would we want a built-in function for this?
To avoid the text casts, but also to improve user-friendliness,
since the improved solution is still a hack a user needing it has to someone come up with or find.
The topic "Convert hex in text representation to decimal number" is an old one on Stackoverflow [2],
posted 11 years ago, with a myriad of various hackis solutions, out of which one had a bug that I reported.
Many other modern languages seems to have this as a built-in or in stdlibs:
Python3:
classmethod int.from_bytes(bytes, byteorder='big', *, signed=False)
Rust:
pub const fn from_be_bytes(bytes: [u8; 8]) -> u64

/Joel

[1] https://gist.github.com/joelonsql/f54552db1f0fd6d9b3397d255e51f58a
[2] https://stackoverflow.com/questions/8316164/convert-hex-in-text-representation-to-decimal-number

Re: Non-decimal integer literals

From

Dean Rasheed

Date:

23 January 2023, 21:20:45

On Mon, 23 Jan 2023 at 20:00, Joel Jacobson <joel@compiler.org> wrote:
>
> Nice! This also simplifies when dealing with non-negative integers represented as byte arrays,
> common in e.g. cryptography code.
>

Ah, interesting. I hadn't thought of that use-case.

> create function numeric_from_bytes(bytea) returns numeric language sql as $$
> select ('0'||right($1::text,-1))::numeric
> $$;
>
> Would we want a built-in function for this?

Not sure. It does feel a bit niche. It's quite common in other
programming languages, but that doesn't mean that a lot of Postgres
users need it. Perhaps start a new thread to gauge people's interest?

Regards,
Dean

Re: Non-decimal integer literals

From

Ranier Vilela

Date:

24 January 2023, 00:46:49

On 13.01.23 11:01, Dean Rasheed wrote:
> So I'm feeling quite good about the end result -- I set out hoping not
> to make performance noticeably worse, but ended up making it

> significantly better.

Hi Dean, thanks for your work.

But since PG_RETURN_NULL, is a simple return,

now the "value" var is not leaked?

If not, sorry for the noise.

regards,

Ranier Vilela

Attachment

avoid_leak_value_numeric.patch

Re: Non-decimal integer literals

From

Dean Rasheed

Date:

24 January 2023, 10:24:09

On Tue, 24 Jan 2023 at 00:47, Ranier Vilela <ranier.vf@gmail.com> wrote:
>
> On 13.01.23 11:01, Dean Rasheed wrote:
> > So I'm feeling quite good about the end result -- I set out hoping not
> > to make performance noticeably worse, but ended up making it
> > significantly better.
> Hi Dean, thanks for your work.
>
> But since PG_RETURN_NULL, is a simple return,
> now the "value" var is not leaked?
>

That originates from a prior commit:

ccff2d20ed Convert a few datatype input functions to use "soft" error reporting.

and see also a bunch of follow-on commits for other input functions.

It will only return NULL if the input is invalid and escontext is
non-NULL. You only identified a fraction of the cases where that would
happen. If we really cared about not leaking memory for invalid
inputs, we'd have to look at every code path using ereturn()
(including lower-level functions, and not just in numeric.c). I think
that would be a waste of time, and counterproductive -- trying to
immediately free memory for all possible invalid inputs would likely
complicate a lot of code, and slow down parsing of valid inputs.
Better to leave it until the owning memory context is freed.

Regards,
Dean

Re: Non-decimal integer literals

From

Ranier Vilela

Date:

24 January 2023, 10:34:19

Em ter., 24 de jan. de 2023 às 07:24, Dean Rasheed <dean.a.rasheed@gmail.com> escreveu:

On Tue, 24 Jan 2023 at 00:47, Ranier Vilela <ranier.vf@gmail.com> wrote:
>
> On 13.01.23 11:01, Dean Rasheed wrote:
> > So I'm feeling quite good about the end result -- I set out hoping not
> > to make performance noticeably worse, but ended up making it
> > significantly better.
> Hi Dean, thanks for your work.
>
> But since PG_RETURN_NULL, is a simple return,
> now the "value" var is not leaked?
>

That originates from a prior commit:

ccff2d20ed Convert a few datatype input functions to use "soft" error reporting.

and see also a bunch of follow-on commits for other input functions.

It will only return NULL if the input is invalid and escontext is
non-NULL. You only identified a fraction of the cases where that would
happen. If we really cared about not leaking memory for invalid
inputs, we'd have to look at every code path using ereturn()
(including lower-level functions, and not just in numeric.c). I think
that would be a waste of time, and counterproductive -- trying to
immediately free memory for all possible invalid inputs would likely
complicate a lot of code, and slow down parsing of valid inputs.
Better to leave it until the owning memory context is freed.

Thank you for the explanation.

regards,

Ranier Vilela