Thread: Non-decimal integer literals
Here is a patch to add support for hexadecimal, octal, and binary integer literals: 0x42E 0o112 0b100101 per SQL:202x draft. This adds support in the lexer as well as in the integer type input functions. Those core parts are straightforward enough, but there are a bunch of other places where integers are parsed, and one could consider in each case whether they should get the same treatment, for example the replication syntax lexer, or input function for oid, numeric, and int2vector. There are also some opportunities to move some code around, for example scanint8() could be in numutils.c. I have also looked with some suspicion at some details of the number lexing in ecpg, but haven't found anything I could break yet. Suggestions are welcome.
Attachment
On Mon, Aug 16, 2021 at 5:52 AM Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote:
>
> Here is a patch to add support for hexadecimal, octal, and binary
> integer literals:
>
> 0x42E
> 0o112
> 0b100101
>
> per SQL:202x draft.
>
> This adds support in the lexer as well as in the integer type input
> functions.
The one thing that jumped out at me on a cursory reading is the {integer} rule, which seems to be used nowhere except to call process_integer_literal, which must then inspect the token text to figure out what type of integer it is. Maybe consider 4 separate process_*_literal functions?
--
John Naylor
EDB: http://www.enterprisedb.com
On 16.08.21 17:32, John Naylor wrote: > The one thing that jumped out at me on a cursory reading is > the {integer} rule, which seems to be used nowhere except to > call process_integer_literal, which must then inspect the token text to > figure out what type of integer it is. Maybe consider 4 separate > process_*_literal functions? Agreed, that can be done in a simpler way. Here is an updated patch.
Attachment
On Tue, Sep 7, 2021 at 4:13 AM Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote:
On 16.08.21 17:32, John Naylor wrote:
> The one thing that jumped out at me on a cursory reading is
> the {integer} rule, which seems to be used nowhere except to
> call process_integer_literal, which must then inspect the token text to
> figure out what type of integer it is. Maybe consider 4 separate
> process_*_literal functions?
Agreed, that can be done in a simpler way. Here is an updated patch.
Hi,
Minor comment:
+SELECT int4 '0o112';
Maybe involve digits of up to 7 in the octal test case.
Thanks
On 8/16/21 11:51 AM, Peter Eisentraut wrote: > Here is a patch to add support for hexadecimal, octal, and binary > integer literals: > > 0x42E > 0o112 > 0b100101 > > per SQL:202x draft. Is there any hope of adding the optional underscores? I see a potential problem there as SELECT 1_a; is currently parsed as SELECT 1 AS _a; when it should be parsed as SELECT 1_ AS a; or perhaps even as an error since 0x1_a would be a valid number with no alias. (The standard does not allow identifiers to begin with _ but we do...) -- Vik Fearing
Vik Fearing <vik@postgresfriends.org> writes: > On 8/16/21 11:51 AM, Peter Eisentraut wrote: >> Here is a patch to add support for hexadecimal, octal, and binary >> integer literals: >> >> 0x42E >> 0o112 >> 0b100101 >> >> per SQL:202x draft. > Is there any hope of adding the optional underscores? I see a potential > problem there as SELECT 1_a; is currently parsed as SELECT 1 AS _a; when > it should be parsed as SELECT 1_ AS a; or perhaps even as an error since > 0x1_a would be a valid number with no alias. Even without that point, this patch *is* going to break valid queries, because every one of those cases is a valid number-followed-by-identifier today, e.g. regression=# select 0x42e; x42e ------ 0 (1 row) AFAIR we've seen exactly zero field demand for this feature, so I kind of wonder why we're in such a hurry to adopt something that hasn't even made it past draft-standard status. regards, tom lane
On 9/8/21 3:14 PM, Tom Lane wrote: > Vik Fearing <vik@postgresfriends.org> writes: > >> Is there any hope of adding the optional underscores? I see a potential >> problem there as SELECT 1_a; is currently parsed as SELECT 1 AS _a; when >> it should be parsed as SELECT 1_ AS a; or perhaps even as an error since >> 0x1_a would be a valid number with no alias. > > Even without that point, this patch *is* going to break valid queries, > because every one of those cases is a valid number-followed-by-identifier > today, Ah, true that. So if this does go in, we may as well add the underscores at the same time. > AFAIR we've seen exactly zero field demand for this feature, I have often wanted something like this, even if I didn't bring it up on this list. I have had customers who have wanted this, too. My response has always been to show these exact problems to explain why it's not possible, but if it's going to be in the standard then I favor doing it. I have never really had a use for octal, but sometimes binary and hex make things much clearer. Having a grouping separator for large numbers is even more useful. > so I kind of wonder why we're in such a hurry to adopt something > that hasn't even made it past draft-standard status. I don't really see a hurry here. I am fine with waiting until the draft becomes final. -- Vik Fearing
On 07.09.21 13:50, Zhihong Yu wrote: > On 16.08.21 17:32, John Naylor wrote: > > The one thing that jumped out at me on a cursory reading is > > the {integer} rule, which seems to be used nowhere except to > > call process_integer_literal, which must then inspect the token > text to > > figure out what type of integer it is. Maybe consider 4 separate > > process_*_literal functions? > > Agreed, that can be done in a simpler way. Here is an updated patch. > > Hi, > Minor comment: > > +SELECT int4 '0o112'; > > Maybe involve digits of up to 7 in the octal test case. Good point, here is a lightly updated patch.
Attachment
On 09.09.21 16:08, Vik Fearing wrote: >> Even without that point, this patch *is* going to break valid queries, >> because every one of those cases is a valid number-followed-by-identifier >> today, > > Ah, true that. So if this does go in, we may as well add the > underscores at the same time. Yeah, looks like I'll need to look into the identifier lexing issues previously discussed. I'll attack that during the next commit fest. >> so I kind of wonder why we're in such a hurry to adopt something >> that hasn't even made it past draft-standard status. > I don't really see a hurry here. I am fine with waiting until the draft > becomes final. Right, the point is to explore this now so that it can be ready when the standard is ready.
On 28.09.21 17:30, Peter Eisentraut wrote: > On 09.09.21 16:08, Vik Fearing wrote: >>> Even without that point, this patch *is* going to break valid queries, >>> because every one of those cases is a valid >>> number-followed-by-identifier >>> today, >> >> Ah, true that. So if this does go in, we may as well add the >> underscores at the same time. > > Yeah, looks like I'll need to look into the identifier lexing issues > previously discussed. I'll attack that during the next commit fest. Here is an updated patch for this. It's the previous patch polished a bit more, and it contains changes so that numeric literals reject trailing identifier parts without whitespace in between, as discussed. Maybe I should split that into incremental patches, but for now I only have the one. I don't have a patch for the underscores in numeric literals yet. It's in progress, but not ready.
Attachment
On 01.11.21 07:09, Peter Eisentraut wrote: > Here is an updated patch for this. It's the previous patch polished a > bit more, and it contains changes so that numeric literals reject > trailing identifier parts without whitespace in between, as discussed. > Maybe I should split that into incremental patches, but for now I only > have the one. I don't have a patch for the underscores in numeric > literals yet. It's in progress, but not ready. Here is a progressed version of this work, split into more incremental patches. The first three patches are harmless code cleanups. Patch 3 has an interesting naming conflict, noted in the commit message; ideas welcome. Patches 4 and 5 handle the rejection of trailing junk after numeric literals, as discussed. I have expanded that compared to the v4 patch to also cover non-integer literals. It also comes with more tests now. Patch 6 is the titular introduction of non-decimal integer literals, unchanged from before.
Attachment
On Thu, Nov 25, 2021 at 5:18 AM Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote:
On 01.11.21 07:09, Peter Eisentraut wrote:
> Here is an updated patch for this. It's the previous patch polished a
> bit more, and it contains changes so that numeric literals reject
> trailing identifier parts without whitespace in between, as discussed.
> Maybe I should split that into incremental patches, but for now I only
> have the one. I don't have a patch for the underscores in numeric
> literals yet. It's in progress, but not ready.
Here is a progressed version of this work, split into more incremental
patches. The first three patches are harmless code cleanups. Patch 3
has an interesting naming conflict, noted in the commit message; ideas
welcome. Patches 4 and 5 handle the rejection of trailing junk after
numeric literals, as discussed. I have expanded that compared to the v4
patch to also cover non-integer literals. It also comes with more tests
now. Patch 6 is the titular introduction of non-decimal integer
literals, unchanged from before.
Hi,
For patch 3,
+pg_strtoint64(const char *s)
How about naming the above function pg_scanint64()?
pg_strtoint64xx() can be named pg_strtoint64() - this would align with existing function:
pg_strtouint64(const char *str, char **endptr, int base)
Cheers
Hi Peter,
0001
-/* we no longer allow unary minus in numbers.
- * instead we pass it separately to parser. there it gets
- * coerced via doNegate() -- Leon aug 20 1999
+/*
+ * Numbers
+ *
+ * Unary minus is not part of a number here. Instead we pass it separately to
+ * parser, and there it gets coerced via doNegate().
If we're going to change the comment anyway, "the parser" sounds more natural. Aside from that, 0001 and 0002 can probably be pushed now, if you like. I don't have any good ideas about 0003 at the moment.
0005
--- a/src/interfaces/ecpg/preproc/pgc.l
+++ b/src/interfaces/ecpg/preproc/pgc.l
@@ -365,6 +365,10 @@ real ({integer}|{decimal})[Ee][-+]?{digit}+
realfail1 ({integer}|{decimal})[Ee]
realfail2 ({integer}|{decimal})[Ee][-+]
+integer_junk {integer}{ident_start}
+decimal_junk {decimal}{ident_start}
+real_junk {real}{ident_start}
A comment might be good here to explain these are only in ECPG for consistency with the other scanners. Not really important, though.
0001
-/* we no longer allow unary minus in numbers.
- * instead we pass it separately to parser. there it gets
- * coerced via doNegate() -- Leon aug 20 1999
+/*
+ * Numbers
+ *
+ * Unary minus is not part of a number here. Instead we pass it separately to
+ * parser, and there it gets coerced via doNegate().
If we're going to change the comment anyway, "the parser" sounds more natural. Aside from that, 0001 and 0002 can probably be pushed now, if you like. I don't have any good ideas about 0003 at the moment.
0005
+++ b/src/interfaces/ecpg/preproc/pgc.l
@@ -365,6 +365,10 @@ real ({integer}|{decimal})[Ee][-+]?{digit}+
realfail1 ({integer}|{decimal})[Ee]
realfail2 ({integer}|{decimal})[Ee][-+]
+integer_junk {integer}{ident_start}
+decimal_junk {decimal}{ident_start}
+real_junk {real}{ident_start}
A comment might be good here to explain these are only in ECPG for consistency with the other scanners. Not really important, though.
0006
+{hexfail} {
+ yyerror("invalid hexadecimal integer");
+ }
+{octfail} {
+ yyerror("invalid octal integer");
}
-{decimal} {
+{binfail} {
+ yyerror("invalid binary integer");
+ }
It seems these could use SET_YYLLOC(), since the error cursor doesn't match other failure states:
+SELECT 0b;
+ERROR: invalid binary integer at or near "SELECT 0b"
+LINE 1: SELECT 0b;
+ ^
+SELECT 1b;
+ERROR: trailing junk after numeric literal at or near "1b"
+LINE 1: SELECT 1b;
+ ^
We might consider some tests for ECPG since lack of coverage has been a problem.
+{hexfail} {
+ yyerror("invalid hexadecimal integer");
+ }
+{octfail} {
+ yyerror("invalid octal integer");
}
-{decimal} {
+{binfail} {
+ yyerror("invalid binary integer");
+ }
It seems these could use SET_YYLLOC(), since the error cursor doesn't match other failure states:
+SELECT 0b;
+ERROR: invalid binary integer at or near "SELECT 0b"
+LINE 1: SELECT 0b;
+ ^
+SELECT 1b;
+ERROR: trailing junk after numeric literal at or near "1b"
+LINE 1: SELECT 1b;
+ ^
We might consider some tests for ECPG since lack of coverage has been a problem.
Also, I'm curious: how does the spec work as far as deciding the year of release, or feature-freezing of new items?
--
John Naylor
EDB: http://www.enterprisedb.com
--
John Naylor
EDB: http://www.enterprisedb.com
On 25.11.21 16:46, Zhihong Yu wrote: > For patch 3, > > +int64 > +pg_strtoint64(const char *s) > > How about naming the above function pg_scanint64()? > pg_strtoint64xx() can be named pg_strtoint64() - this would align with > existing function: > > pg_strtouint64(const char *str, char **endptr, int base) That would be one way. But the existing pg_strtointNN() functions are pretty widely used, so I would tend toward finding another name for the less used pg_strtouint64(), maybe pg_strtouint64x() ("extended").
On 25.11.21 18:51, John Naylor wrote: > If we're going to change the comment anyway, "the parser" sounds more > natural. Aside from that, 0001 and 0002 can probably be pushed now, if > you like. done > --- a/src/interfaces/ecpg/preproc/pgc.l > +++ b/src/interfaces/ecpg/preproc/pgc.l > @@ -365,6 +365,10 @@ real ({integer}|{decimal})[Ee][-+]?{digit}+ > realfail1 ({integer}|{decimal})[Ee] > realfail2 ({integer}|{decimal})[Ee][-+] > > +integer_junk {integer}{ident_start} > +decimal_junk {decimal}{ident_start} > +real_junk {real}{ident_start} > > A comment might be good here to explain these are only in ECPG for > consistency with the other scanners. Not really important, though. Yeah, it's a bit weird that not all the symbols are used in ecpg. I'll look into explaining this better. > 0006 > > +{hexfail} { > + yyerror("invalid hexadecimal integer"); > + } > +{octfail} { > + yyerror("invalid octal integer"); > } > -{decimal} { > +{binfail} { > + yyerror("invalid binary integer"); > + } > > It seems these could use SET_YYLLOC(), since the error cursor doesn't > match other failure states: ok > We might consider some tests for ECPG since lack of coverage has been a > problem. right > Also, I'm curious: how does the spec work as far as deciding the year of > release, or feature-freezing of new items? The schedule has recently been extended again, so the current plan is for SQL:202x with x=3, with feature freeze in mid-2022. So the feature patches in this thread are in my mind now targeting PG15+1. But the preparation work (up to v5-0005, and some other number parsing refactoring that I'm seeing) could be considered for PG15. I'll move this to the next CF and come back with an updated patch set in a little while.
There has been some other refactoring going on, which made this patch set out of date. So here is an update. The old pg_strtouint64() has been removed, so there is no longer a naming concern with patch 0001. That one should be good to go. I also found that yet another way to parse integers in pg_atoi() has mostly faded away in utility, so I removed the last two callers and removed the function in 0002 and 0003. The remaining patches are as before, with some of the review comments applied. I still need to write some lexing unit tests for ecpg, which I haven't gotten to yet. This affects patches 0004 and 0005. As mentioned before, patches 0006 and 0007 are more feature previews at this point. On 01.12.21 16:47, Peter Eisentraut wrote: > On 25.11.21 18:51, John Naylor wrote: >> If we're going to change the comment anyway, "the parser" sounds more >> natural. Aside from that, 0001 and 0002 can probably be pushed now, if >> you like. > > done > >> --- a/src/interfaces/ecpg/preproc/pgc.l >> +++ b/src/interfaces/ecpg/preproc/pgc.l >> @@ -365,6 +365,10 @@ real ({integer}|{decimal})[Ee][-+]?{digit}+ >> realfail1 ({integer}|{decimal})[Ee] >> realfail2 ({integer}|{decimal})[Ee][-+] >> >> +integer_junk {integer}{ident_start} >> +decimal_junk {decimal}{ident_start} >> +real_junk {real}{ident_start} >> >> A comment might be good here to explain these are only in ECPG for >> consistency with the other scanners. Not really important, though. > > Yeah, it's a bit weird that not all the symbols are used in ecpg. I'll > look into explaining this better. > >> 0006 >> >> +{hexfail} { >> + yyerror("invalid hexadecimal integer"); >> + } >> +{octfail} { >> + yyerror("invalid octal integer"); >> } >> -{decimal} { >> +{binfail} { >> + yyerror("invalid binary integer"); >> + } >> >> It seems these could use SET_YYLLOC(), since the error cursor doesn't >> match other failure states: > > ok > >> We might consider some tests for ECPG since lack of coverage has been >> a problem. > > right > >> Also, I'm curious: how does the spec work as far as deciding the year >> of release, or feature-freezing of new items? > > The schedule has recently been extended again, so the current plan is > for SQL:202x with x=3, with feature freeze in mid-2022. > > So the feature patches in this thread are in my mind now targeting > PG15+1. But the preparation work (up to v5-0005, and some other number > parsing refactoring that I'm seeing) could be considered for PG15. > > I'll move this to the next CF and come back with an updated patch set in > a little while. > >
Attachment
- v6-0001-Move-scanint8-to-numutils.c.patch
- v6-0002-Remove-one-use-of-pg_atoi.patch
- v6-0003-Remove-pg_atoi.patch
- v6-0004-Add-test-case-for-trailing-junk-after-numeric-lit.patch
- v6-0005-Reject-trailing-junk-after-numeric-literals.patch
- v6-0006-Non-decimal-integer-literals.patch
- v6-0007-WIP-Underscores-in-numeric-literals.patch
Another modest update, because of the copyright year update preventing the previous patches from applying cleanly. I also did a bit of work on the ecpg scanner so that it also handles some errors on par with the main scanner. There is still no automated testing of this in ecpg, but I have a bunch of single-line test files that can provoke various errors. I will keep these around and maybe put them into something more formal in the future. On 30.12.21 10:43, Peter Eisentraut wrote: > There has been some other refactoring going on, which made this patch > set out of date. So here is an update. > > The old pg_strtouint64() has been removed, so there is no longer a > naming concern with patch 0001. That one should be good to go. > > I also found that yet another way to parse integers in pg_atoi() has > mostly faded away in utility, so I removed the last two callers and > removed the function in 0002 and 0003. > > The remaining patches are as before, with some of the review comments > applied. I still need to write some lexing unit tests for ecpg, which I > haven't gotten to yet. This affects patches 0004 and 0005. > > As mentioned before, patches 0006 and 0007 are more feature previews at > this point. > > > On 01.12.21 16:47, Peter Eisentraut wrote: >> On 25.11.21 18:51, John Naylor wrote: >>> If we're going to change the comment anyway, "the parser" sounds more >>> natural. Aside from that, 0001 and 0002 can probably be pushed now, >>> if you like. >> >> done >> >>> --- a/src/interfaces/ecpg/preproc/pgc.l >>> +++ b/src/interfaces/ecpg/preproc/pgc.l >>> @@ -365,6 +365,10 @@ real ({integer}|{decimal})[Ee][-+]?{digit}+ >>> realfail1 ({integer}|{decimal})[Ee] >>> realfail2 ({integer}|{decimal})[Ee][-+] >>> >>> +integer_junk {integer}{ident_start} >>> +decimal_junk {decimal}{ident_start} >>> +real_junk {real}{ident_start} >>> >>> A comment might be good here to explain these are only in ECPG for >>> consistency with the other scanners. Not really important, though. >> >> Yeah, it's a bit weird that not all the symbols are used in ecpg. >> I'll look into explaining this better. >> >>> 0006 >>> >>> +{hexfail} { >>> + yyerror("invalid hexadecimal integer"); >>> + } >>> +{octfail} { >>> + yyerror("invalid octal integer"); >>> } >>> -{decimal} { >>> +{binfail} { >>> + yyerror("invalid binary integer"); >>> + } >>> >>> It seems these could use SET_YYLLOC(), since the error cursor doesn't >>> match other failure states: >> >> ok >> >>> We might consider some tests for ECPG since lack of coverage has been >>> a problem. >> >> right >> >>> Also, I'm curious: how does the spec work as far as deciding the year >>> of release, or feature-freezing of new items? >> >> The schedule has recently been extended again, so the current plan is >> for SQL:202x with x=3, with feature freeze in mid-2022. >> >> So the feature patches in this thread are in my mind now targeting >> PG15+1. But the preparation work (up to v5-0005, and some other >> number parsing refactoring that I'm seeing) could be considered for PG15. >> >> I'll move this to the next CF and come back with an updated patch set >> in a little while. >> >>
Attachment
- v7-0001-Move-scanint8-to-numutils.c.patch
- v7-0002-Remove-one-use-of-pg_atoi.patch
- v7-0003-Remove-pg_atoi.patch
- v7-0004-Add-test-case-for-trailing-junk-after-numeric-lit.patch
- v7-0005-Reject-trailing-junk-after-numeric-literals.patch
- v7-0006-Non-decimal-integer-literals.patch
- v7-0007-WIP-Underscores-in-numeric-literals.patch
Rebased patch set On 13.01.22 14:42, Peter Eisentraut wrote: > Another modest update, because of the copyright year update preventing > the previous patches from applying cleanly. > > I also did a bit of work on the ecpg scanner so that it also handles > some errors on par with the main scanner. > > There is still no automated testing of this in ecpg, but I have a bunch > of single-line test files that can provoke various errors. I will keep > these around and maybe put them into something more formal in the future. > > > On 30.12.21 10:43, Peter Eisentraut wrote: >> There has been some other refactoring going on, which made this patch >> set out of date. So here is an update. >> >> The old pg_strtouint64() has been removed, so there is no longer a >> naming concern with patch 0001. That one should be good to go. >> >> I also found that yet another way to parse integers in pg_atoi() has >> mostly faded away in utility, so I removed the last two callers and >> removed the function in 0002 and 0003. >> >> The remaining patches are as before, with some of the review comments >> applied. I still need to write some lexing unit tests for ecpg, which >> I haven't gotten to yet. This affects patches 0004 and 0005. >> >> As mentioned before, patches 0006 and 0007 are more feature previews >> at this point. >> >> >> On 01.12.21 16:47, Peter Eisentraut wrote: >>> On 25.11.21 18:51, John Naylor wrote: >>>> If we're going to change the comment anyway, "the parser" sounds >>>> more natural. Aside from that, 0001 and 0002 can probably be pushed >>>> now, if you like. >>> >>> done >>> >>>> --- a/src/interfaces/ecpg/preproc/pgc.l >>>> +++ b/src/interfaces/ecpg/preproc/pgc.l >>>> @@ -365,6 +365,10 @@ real ({integer}|{decimal})[Ee][-+]?{digit}+ >>>> realfail1 ({integer}|{decimal})[Ee] >>>> realfail2 ({integer}|{decimal})[Ee][-+] >>>> >>>> +integer_junk {integer}{ident_start} >>>> +decimal_junk {decimal}{ident_start} >>>> +real_junk {real}{ident_start} >>>> >>>> A comment might be good here to explain these are only in ECPG for >>>> consistency with the other scanners. Not really important, though. >>> >>> Yeah, it's a bit weird that not all the symbols are used in ecpg. >>> I'll look into explaining this better. >>> >>>> 0006 >>>> >>>> +{hexfail} { >>>> + yyerror("invalid hexadecimal integer"); >>>> + } >>>> +{octfail} { >>>> + yyerror("invalid octal integer"); >>>> } >>>> -{decimal} { >>>> +{binfail} { >>>> + yyerror("invalid binary integer"); >>>> + } >>>> >>>> It seems these could use SET_YYLLOC(), since the error cursor >>>> doesn't match other failure states: >>> >>> ok >>> >>>> We might consider some tests for ECPG since lack of coverage has >>>> been a problem. >>> >>> right >>> >>>> Also, I'm curious: how does the spec work as far as deciding the >>>> year of release, or feature-freezing of new items? >>> >>> The schedule has recently been extended again, so the current plan is >>> for SQL:202x with x=3, with feature freeze in mid-2022. >>> >>> So the feature patches in this thread are in my mind now targeting >>> PG15+1. But the preparation work (up to v5-0005, and some other >>> number parsing refactoring that I'm seeing) could be considered for >>> PG15. >>> >>> I'll move this to the next CF and come back with an updated patch set >>> in a little while. >>> >>>
Attachment
- v8-0001-Move-scanint8-to-numutils.c.patch
- v8-0002-Remove-one-use-of-pg_atoi.patch
- v8-0003-Remove-pg_atoi.patch
- v8-0004-Add-test-case-for-trailing-junk-after-numeric-lit.patch
- v8-0005-Reject-trailing-junk-after-numeric-literals.patch
- v8-0006-Non-decimal-integer-literals.patch
- v8-0007-WIP-Underscores-in-numeric-literals.patch
On Mon, Jan 24, 2022 at 3:41 AM Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > Rebased patch set What if someone finds this new behavior too permissive? -- Robert Haas EDB: http://www.enterprisedb.com
On 24.01.22 19:53, Robert Haas wrote: > On Mon, Jan 24, 2022 at 3:41 AM Peter Eisentraut > <peter.eisentraut@enterprisedb.com> wrote: >> Rebased patch set > > What if someone finds this new behavior too permissive? Which part exactly? There are several different changes proposed here.
On Tue, Jan 25, 2022 at 5:34 AM Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > On 24.01.22 19:53, Robert Haas wrote: > > On Mon, Jan 24, 2022 at 3:41 AM Peter Eisentraut > > <peter.eisentraut@enterprisedb.com> wrote: > >> Rebased patch set > > > > What if someone finds this new behavior too permissive? > > Which part exactly? There are several different changes proposed here. I was just going based on the description of the feature in your original post. If someone is hoping that int4in() will accept only ^\d+$ then they will be disappointed by this patch. Maybe nobody is hoping that, though. -- Robert Haas EDB: http://www.enterprisedb.com
On 2022-Jan-24, Peter Eisentraut wrote: > +decinteger {decdigit}(_?{decdigit})* > +hexinteger 0[xX](_?{hexdigit})+ > +octinteger 0[oO](_?{octdigit})+ > +bininteger 0[bB](_?{bindigit})+ I think there should be test cases for literals that these seemingly strange expressions reject, which are a number with trailing _ (0x123_), and one with consecutive underscores __ (0x12__34). I like the idea of these literals. I would have found them useful on many occassions. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, Jan 25, 2022 at 5:34 AM Peter Eisentraut > <peter.eisentraut@enterprisedb.com> wrote: >> Which part exactly? There are several different changes proposed here. > I was just going based on the description of the feature in your > original post. If someone is hoping that int4in() will accept only > ^\d+$ then they will be disappointed by this patch. Maybe I misunderstood, but I thought this was about what you could write as a SQL literal, not about the I/O behavior of the integer types. I'd be -0.1 on changing the latter. regards, tom lane
On 26.01.22 01:02, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Tue, Jan 25, 2022 at 5:34 AM Peter Eisentraut >> <peter.eisentraut@enterprisedb.com> wrote: >>> Which part exactly? There are several different changes proposed here. > >> I was just going based on the description of the feature in your >> original post. If someone is hoping that int4in() will accept only >> ^\d+$ then they will be disappointed by this patch. > > Maybe I misunderstood, but I thought this was about what you could > write as a SQL literal, not about the I/O behavior of the integer > types. I'd be -0.1 on changing the latter. I think it would be strange if I/O routines would accept a different syntax from the literals. Also, the behavior of a cast from string/text to a numeric type is usually defined in terms of what the literal syntax is, so they need to be aligned.
On 1/25/22 13:43, Alvaro Herrera wrote: > On 2022-Jan-24, Peter Eisentraut wrote: > >> +decinteger {decdigit}(_?{decdigit})* >> +hexinteger 0[xX](_?{hexdigit})+ >> +octinteger 0[oO](_?{octdigit})+ >> +bininteger 0[bB](_?{bindigit})+ > I think there should be test cases for literals that these seemingly > strange expressions reject, which are a number with trailing _ (0x123_), > and one with consecutive underscores __ (0x12__34). > > I like the idea of these literals. I would have found them useful on > many occassions. +1. I can't remember the number of times I have miscounted a long string of digits in a literal. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On Wed, Jan 26, 2022 at 10:10 PM Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > [v8 patch] 0001-0004 seem pretty straightforward. 0005: {realfail1} { - /* - * throw back the [Ee], and figure out whether what - * remains is an {integer} or {decimal}. - */ - yyless(yyleng - 1); SET_YYLLOC(); - return process_integer_literal(yytext, yylval); + yyerror("trailing junk after numeric literal"); } realfail1 has been subsumed by integer_junk and decimal_junk, so that pattern can be removed. <SQL>{ +/* + * Note that some trailing junk is valid in C (such as 100LL), so we contain + * this to SQL mode. + */ It seems Flex doesn't like C comments after the "%%", so this stanza was indented in 0006. If these are to be committed separately, that fix should happen here. 0006: Minor nit -- the s/decimal/numeric/ change doesn't seem to have any notational advantage, but it's not worse, either. 0007: I've attached an addendum to restore the no-backtrack property. Will the underscore syntax need treatment in the input routines as well? -- John Naylor EDB: http://www.enterprisedb.com
Attachment
Re: Peter Eisentraut > This adds support in the lexer as well as in the integer type input > functions. > > Those core parts are straightforward enough, but there are a bunch of other > places where integers are parsed, and one could consider in each case > whether they should get the same treatment, for example the replication > syntax lexer, or input function for oid, numeric, and int2vector. One thing I always found weird is that timeline IDs appear most prominently as hex numbers in WAL filenames, but they are printed as decimal in the log ("new timeline id nn"), and have to be specified as decimal in recovery_target_timeline. Perhaps both these could make use of 0xhex numbers as well. Christoph
On 13.02.22 13:14, John Naylor wrote: > On Wed, Jan 26, 2022 at 10:10 PM Peter Eisentraut > <peter.eisentraut@enterprisedb.com> wrote: >> [v8 patch] > > 0001-0004 seem pretty straightforward. These have been committed. > > 0005: > > {realfail1} { > - /* > - * throw back the [Ee], and figure out whether what > - * remains is an {integer} or {decimal}. > - */ > - yyless(yyleng - 1); > SET_YYLLOC(); > - return process_integer_literal(yytext, yylval); > + yyerror("trailing junk after numeric literal"); > } > > realfail1 has been subsumed by integer_junk and decimal_junk, so that > pattern can be removed. Committed with that change. I found that the JSON path lexer has the same trailing-junk issue. I have researched the relevant ECMA standard and it explicitly points out that this is not allowed. I will look into that separately. I'm just pointing that out here because grepping for "realfail1" will still show a hit after this. The remaining patches are material for PG16 at this point, and I will set the commit fest item to returned with feedback in the meantime. > 0006: > > Minor nit -- the s/decimal/numeric/ change doesn't seem to have any > notational advantage, but it's not worse, either. I did that because with the addition of non-decimal literals, the word "decimal" becomes ambiguous or misleading. (It doesn't mean "uses decimal digits" but "has a decimal point".) (Of course, "numeric" isn't entirely free of ambiguity, but there are only so many words available in this space. ;-) ) > 0007: > > I've attached an addendum to restore the no-backtrack property. Thanks, that is helpful. > Will the underscore syntax need treatment in the input routines as well? Yeah, various additional work is required for this.
On 16.02.22 11:11, Peter Eisentraut wrote: > The remaining patches are material for PG16 at this point, and I will > set the commit fest item to returned with feedback in the meantime. Time to continue with this. Attached is a rebased and cleaned up patch for non-decimal integer literals. (I don't include the underscores-in-numeric literals patch. I'm keeping that for later.) Two open issues from my notes: Technically, numeric_in() should be made aware of this, but that seems relatively complicated and maybe not necessary for the first iteration. Taking another look around ecpg to see how this interacts with C-syntax integer literals. I'm not aware of any particular issues, but it's understandably tricky. Other than that, this seems pretty complete as a start.
Attachment
Hi Peter, /* process digits */ + if (ptr[0] == '0' && (ptr[1] == 'x' || ptr[1] == 'X')) + { + ptr += 2; + while (*ptr && isxdigit((unsigned char) *ptr)) + { + int8 digit = hexlookup[(unsigned char) *ptr]; + + if (unlikely(pg_mul_s16_overflow(tmp, 16, &tmp)) || + unlikely(pg_sub_s16_overflow(tmp, digit, &tmp))) + goto out_of_range; + + ptr++; + } + } + else if (ptr[0] == '0' && (ptr[1] == 'o' || ptr[1] == 'O')) + { + ptr += 2; + + while (*ptr && (*ptr >= '0' && *ptr <= '7')) + { + int8 digit = (*ptr++ - '0'); + + if (unlikely(pg_mul_s16_overflow(tmp, 8, &tmp)) || + unlikely(pg_sub_s16_overflow(tmp, digit, &tmp))) + goto out_of_range; + } + } + else if (ptr[0] == '0' && (ptr[1] == 'b' || ptr[1] == 'B')) + { + ptr += 2; + + while (*ptr && (*ptr >= '0' && *ptr <= '1')) + { + int8 digit = (*ptr++ - '0'); + + if (unlikely(pg_mul_s16_overflow(tmp, 2, &tmp)) || + unlikely(pg_sub_s16_overflow(tmp, digit, &tmp))) + goto out_of_range; + } + } + else + { while (*ptr && isdigit((unsigned char) *ptr)) { int8 digit = (*ptr++ - '0'); @@ -128,6 +181,7 @@ pg_strtoint16(const char *s) unlikely(pg_sub_s16_overflow(tmp, digit, &tmp))) goto out_of_range; } + } What do you think if we move these code into a static inline function? like: static inline char* process_digits(char *ptr, int32 *result) { ... } On Mon, Oct 10, 2022 at 10:17 PM Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > > On 16.02.22 11:11, Peter Eisentraut wrote: > > The remaining patches are material for PG16 at this point, and I will > > set the commit fest item to returned with feedback in the meantime. > > Time to continue with this. > > Attached is a rebased and cleaned up patch for non-decimal integer > literals. (I don't include the underscores-in-numeric literals patch. > I'm keeping that for later.) > > Two open issues from my notes: > > Technically, numeric_in() should be made aware of this, but that seems > relatively complicated and maybe not necessary for the first iteration. > > Taking another look around ecpg to see how this interacts with C-syntax > integer literals. I'm not aware of any particular issues, but it's > understandably tricky. > > Other than that, this seems pretty complete as a start. -- Regards Junwang Zhao
On 11.10.22 05:29, Junwang Zhao wrote: > What do you think if we move these code into a static inline function? like: > > static inline char* > process_digits(char *ptr, int32 *result) > { > ... > } How would you handle the different ways each branch checks for valid digits and computes the value of each digit?
On Tue, Oct 11, 2022 at 4:59 PM Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > > On 11.10.22 05:29, Junwang Zhao wrote: > > What do you think if we move these code into a static inline function? like: > > > > static inline char* > > process_digits(char *ptr, int32 *result) > > { > > ... > > } > > How would you handle the different ways each branch checks for valid > digits and computes the value of each digit? > Didn't notice that, sorry for the noise ;( -- Regards Junwang Zhao
On Mon, Oct 10, 2022 at 9:17 PM Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote:
> Taking another look around ecpg to see how this interacts with C-syntax
> integer literals. I'm not aware of any particular issues, but it's
> understandably tricky.
I can find no discussion in the archives about the commit that added "xch":
commit 6fb3c3f78fbb2296894424f6e3183d339915eac7
Author: Michael Meskes <meskes@postgresql.org>
Date: Fri Oct 15 19:02:08 1999 +0000
*** empty log message ***
...and I can't think of why bounds checking a C literal was done like this.
Regarding the patch, it looks good overall. My only suggestion would be to add a regression test for just below and just above overflow, at least for int2.
> Taking another look around ecpg to see how this interacts with C-syntax
> integer literals. I'm not aware of any particular issues, but it's
> understandably tricky.
I can find no discussion in the archives about the commit that added "xch":
commit 6fb3c3f78fbb2296894424f6e3183d339915eac7
Author: Michael Meskes <meskes@postgresql.org>
Date: Fri Oct 15 19:02:08 1999 +0000
*** empty log message ***
...and I can't think of why bounds checking a C literal was done like this.
Regarding the patch, it looks good overall. My only suggestion would be to add a regression test for just below and just above overflow, at least for int2.
Minor nits:
- * Process {integer}. Note this will also do the right thing with {decimal},
+ * Process {*integer}. Note this will also do the right thing with {numeric},
I scratched my head for a while, thinking this was Flex syntax, until I realized my brain was supposed to do shell-globbing first, at which point I could see it was referring to multiple Flex rules. I'd try to rephrase.
+T661 Non-decimal integer literals YES SQL:202x draft
Is there an ETA yet?
Also, it's not this patch's job to do it, but there are at least a half dozen places that open-code turning a hex char into a number. It might be a good easy "todo item" to unify that.
- * Process {integer}. Note this will also do the right thing with {decimal},
+ * Process {*integer}. Note this will also do the right thing with {numeric},
I scratched my head for a while, thinking this was Flex syntax, until I realized my brain was supposed to do shell-globbing first, at which point I could see it was referring to multiple Flex rules. I'd try to rephrase.
+T661 Non-decimal integer literals YES SQL:202x draft
Is there an ETA yet?
Also, it's not this patch's job to do it, but there are at least a half dozen places that open-code turning a hex char into a number. It might be a good easy "todo item" to unify that.
On 14.11.22 08:25, John Naylor wrote: > Regarding the patch, it looks good overall. My only suggestion would be > to add a regression test for just below and just above overflow, at > least for int2. ok > Minor nits: > > - * Process {integer}. Note this will also do the right thing with > {decimal}, > + * Process {*integer}. Note this will also do the right thing with > {numeric}, > > I scratched my head for a while, thinking this was Flex syntax, until I > realized my brain was supposed to do shell-globbing first, at which > point I could see it was referring to multiple Flex rules. I'd try to > rephrase. ok > +T661 Non-decimal integer literals YES SQL:202x draft > > Is there an ETA yet? March 2023 > Also, it's not this patch's job to do it, but there are at least a half > dozen places that open-code turning a hex char into a number. It might > be a good easy "todo item" to unify that. right
On 15.11.22 11:31, Peter Eisentraut wrote: > On 14.11.22 08:25, John Naylor wrote: >> Regarding the patch, it looks good overall. My only suggestion would >> be to add a regression test for just below and just above overflow, at >> least for int2. > > ok This was a valuable suggestion, because this found some breakage. In particular, the handling of grammar-level literals that overflow to "Float" was not correct. (The radix prefix was simply stripped and forgotten.) So I added a bunch more tests for this. Here is a new patch.
Attachment
On Tue, Nov 22, 2022 at 8:36 PM Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote:
>
> On 15.11.22 11:31, Peter Eisentraut wrote:
> > On 14.11.22 08:25, John Naylor wrote:
> >> Regarding the patch, it looks good overall. My only suggestion would
> >> be to add a regression test for just below and just above overflow, at
> >> least for int2.
> >
> > ok
>
> This was a valuable suggestion, because this found some breakage. In
> particular, the handling of grammar-level literals that overflow to
> "Float" was not correct. (The radix prefix was simply stripped and
> forgotten.) So I added a bunch more tests for this. Here is a new patch.
Looks good to me.
--
John Naylor
EDB: http://www.enterprisedb.com
On Wed, 23 Nov 2022 at 02:37, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > Here is a new patch. This looks like quite an inefficient way to convert a hex string into an int64: while (*ptr && isxdigit((unsigned char) *ptr)) { int8 digit = hexlookup[(unsigned char) *ptr]; if (unlikely(pg_mul_s64_overflow(tmp, 16, &tmp)) || unlikely(pg_sub_s64_overflow(tmp, digit, &tmp))) goto out_of_range; ptr++; } I wonder if you'd be better off with something like: while (*ptr && isxdigit((unsigned char) *ptr)) { if (unlikely(tmp & UINT64CONST(0xF000000000000000))) goto out_of_range; tmp = (tmp << 4) | hexlookup[(unsigned char) *ptr++]; } Going by [1], clang will actually use multiplication by 16 to implement the former. gcc is better and shifts left by 4, so likely won't improve things for gcc. It seems worth doing it this way for anything that does not have HAVE__BUILTIN_OP_OVERFLOW anyway. David [1] https://godbolt.org/z/jz6Th6jnM
On Wed, 23 Nov 2022 at 21:54, David Rowley <dgrowleyml@gmail.com> wrote: > I wonder if you'd be better off with something like: > > while (*ptr && isxdigit((unsigned char) *ptr)) > { > if (unlikely(tmp & UINT64CONST(0xF000000000000000))) > goto out_of_range; > > tmp = (tmp << 4) | hexlookup[(unsigned char) *ptr++]; > } Here's a delta diff with it changed to work that way. David
Attachment
On Wed, Nov 23, 2022 at 3:54 PM David Rowley <dgrowleyml@gmail.com> wrote:
>
> Going by [1], clang will actually use multiplication by 16 to
> implement the former. gcc is better and shifts left by 4, so likely
> won't improve things for gcc. It seems worth doing it this way for
> anything that does not have HAVE__BUILTIN_OP_OVERFLOW anyway.
FWIW, gcc 12.2 generates an imul on my system when compiling in situ. I've found it useful to run godbolt locally* and load the entire PG file (nicer to read than plain objdump) -- compilers can make different decisions when going from isolated snippets to within full functions.
* clone from https://github.com/compiler-explorer/compiler-explorer
install npm 16
run "make" and when finished will show the localhost url
add the right flags, which in this case was
-Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation -O2 -I/path/to/srcdir/src/include -I/path/to/builddir/src/include -D_GNU_SOURCE
On Tue, 22 Nov 2022 at 13:37, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > > On 15.11.22 11:31, Peter Eisentraut wrote: > > On 14.11.22 08:25, John Naylor wrote: > >> Regarding the patch, it looks good overall. My only suggestion would > >> be to add a regression test for just below and just above overflow, at > >> least for int2. > > > This was a valuable suggestion, because this found some breakage. In > particular, the handling of grammar-level literals that overflow to > "Float" was not correct. (The radix prefix was simply stripped and > forgotten.) So I added a bunch more tests for this. Here is a new patch. Taking a quick look, I noticed that there are no tests for negative values handled in the parser. Giving that a spin shows that make_const() fails to correctly identify the base of negative non-decimal integers in the T_Float case, causing it to fall through to numeric_in() and fail: SELECT -0x80000000; ERROR: invalid input syntax for type numeric: "-0x80000000" ^ Regards, Dean
On 23.11.22 09:54, David Rowley wrote: > On Wed, 23 Nov 2022 at 02:37, Peter Eisentraut > <peter.eisentraut@enterprisedb.com> wrote: >> Here is a new patch. > > This looks like quite an inefficient way to convert a hex string into an int64: > > while (*ptr && isxdigit((unsigned char) *ptr)) > { > int8 digit = hexlookup[(unsigned char) *ptr]; > > if (unlikely(pg_mul_s64_overflow(tmp, 16, &tmp)) || > unlikely(pg_sub_s64_overflow(tmp, digit, &tmp))) > goto out_of_range; > > ptr++; > } > > I wonder if you'd be better off with something like: > > while (*ptr && isxdigit((unsigned char) *ptr)) > { > if (unlikely(tmp & UINT64CONST(0xF000000000000000))) > goto out_of_range; > > tmp = (tmp << 4) | hexlookup[(unsigned char) *ptr++]; > } > > Going by [1], clang will actually use multiplication by 16 to > implement the former. gcc is better and shifts left by 4, so likely > won't improve things for gcc. It seems worth doing it this way for > anything that does not have HAVE__BUILTIN_OP_OVERFLOW anyway. My code follows the style used for parsing the decimal integers. Keeping that consistent is valuable I think. I think the proposed change makes the code significantly harder to understand. Also, what you are suggesting here would amount to an attempt to make parsing hexadecimal integers even faster than parsing decimal integers. Is that useful?
On Thu, 24 Nov 2022 at 21:35, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > My code follows the style used for parsing the decimal integers. > Keeping that consistent is valuable I think. I think the proposed > change makes the code significantly harder to understand. Also, what > you are suggesting here would amount to an attempt to make parsing > hexadecimal integers even faster than parsing decimal integers. Is that > useful? Isn't it being faster one of the major use cases for this feature? I remember many years ago and several jobs ago when working with SQL Server being able to speed up importing data using hexadecimal DATETIMEs. I can't think why else you might want to represent a DATETIME as a hexstring, so I assumed this was a large part of the use case for INTs in PostgreSQL. Are you telling me that better performance is not something anyone will want out of this feature? David
On 24.11.22 10:13, David Rowley wrote: > On Thu, 24 Nov 2022 at 21:35, Peter Eisentraut > <peter.eisentraut@enterprisedb.com> wrote: >> My code follows the style used for parsing the decimal integers. >> Keeping that consistent is valuable I think. I think the proposed >> change makes the code significantly harder to understand. Also, what >> you are suggesting here would amount to an attempt to make parsing >> hexadecimal integers even faster than parsing decimal integers. Is that >> useful? > > Isn't it being faster one of the major use cases for this feature? Never thought about it that way. > I > remember many years ago and several jobs ago when working with SQL > Server being able to speed up importing data using hexadecimal > DATETIMEs. I can't think why else you might want to represent a > DATETIME as a hexstring, so I assumed this was a large part of the use > case for INTs in PostgreSQL. Are you telling me that better > performance is not something anyone will want out of this feature? This isn't about datetimes but about integers.
On 23.11.22 17:25, Dean Rasheed wrote: > Taking a quick look, I noticed that there are no tests for negative > values handled in the parser. > > Giving that a spin shows that make_const() fails to correctly identify > the base of negative non-decimal integers in the T_Float case, causing > it to fall through to numeric_in() and fail: Fixed in new patch.
Attachment
On Sat, 26 Nov 2022 at 05:13, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > > On 24.11.22 10:13, David Rowley wrote: > > I > > remember many years ago and several jobs ago when working with SQL > > Server being able to speed up importing data using hexadecimal > > DATETIMEs. I can't think why else you might want to represent a > > DATETIME as a hexstring, so I assumed this was a large part of the use > > case for INTs in PostgreSQL. Are you telling me that better > > performance is not something anyone will want out of this feature? > > This isn't about datetimes but about integers. I'm aware. My aim was to show that hex is commonly used as a more efficient way of getting integer numbers in and out of computers. Likely it's better for me to quantify this performance increase claim with some actual performance results. Here's master (@f0cd57f85) doing copy ab2 from '/tmp/ab.csv'; ab2 is a table with no indexes and just 2 int columns. 16.55% postgres [.] CopyReadLine 7.82% postgres [.] pg_strtoint32 7.60% postgres [.] CopyReadAttributesText 7.06% postgres [.] NextCopyFrom 4.40% postgres [.] CopyFrom The copy completes in 2512.5278 ms (average time over 10 runs) Patching master with your v11 patch and copying in hex numbers instead of decimal numbers shows: 14.39% postgres [.] CopyReadLine 8.60% postgres [.] pg_strtoint32 6.95% postgres [.] NextCopyFrom 6.79% postgres [.] CopyReadAttributesText 4.81% postgres [.] CopyFrom This shows that we're spending proportionally less time in CopyReadLine() and proportionally more time in pg_strtoint32(). There are probably two things going on there, CopyReadLine is likely faster due to having to read fewer bytes and pg_strtoint32() is likely slower due to additional branching and code size. This (copy ab2 from '/tmp/abhex.csv') saw an average time of 2720.1387 ms over 10 runs. Patching master with your v11 patch + more_efficient_hex_oct_and_binary_processing.diff 15.68% postgres [.] CopyReadLine 7.75% postgres [.] NextCopyFrom 7.73% postgres [.] pg_strtoint32 6.25% postgres [.] CopyReadAttributesText 4.76% postgres [.] CopyFrom The average time to import the hex version of the csv file comes down to 2385.7298 ms over 10 runs. I didn't run any tests to see how much the performance of importing the decimal representation slowed down from the v11 patch. I assume there will be a small performance hit due to the extra processing done in pg_strtoint32() David
On Wed, 23 Nov 2022 at 08:56, David Rowley <dgrowleyml@gmail.com> wrote: > > On Wed, 23 Nov 2022 at 21:54, David Rowley <dgrowleyml@gmail.com> wrote: > > I wonder if you'd be better off with something like: > > > > while (*ptr && isxdigit((unsigned char) *ptr)) > > { > > if (unlikely(tmp & UINT64CONST(0xF000000000000000))) > > goto out_of_range; > > > > tmp = (tmp << 4) | hexlookup[(unsigned char) *ptr++]; > > } > > Here's a delta diff with it changed to work that way. > This isn't correct, because those functions are meant to accumulate a negative number in "tmp". The overflow check can't just ignore the final digit either, so I'm not sure how much this would end up saving once those issues are fixed. Regards, Dean
On Tue, 29 Nov 2022 at 23:11, Dean Rasheed <dean.a.rasheed@gmail.com> wrote: > > On Wed, 23 Nov 2022 at 08:56, David Rowley <dgrowleyml@gmail.com> wrote: > > > > On Wed, 23 Nov 2022 at 21:54, David Rowley <dgrowleyml@gmail.com> wrote: > > > I wonder if you'd be better off with something like: > > > > > > while (*ptr && isxdigit((unsigned char) *ptr)) > > > { > > > if (unlikely(tmp & UINT64CONST(0xF000000000000000))) > > > goto out_of_range; > > > > > > tmp = (tmp << 4) | hexlookup[(unsigned char) *ptr++]; > > > } > > > > Here's a delta diff with it changed to work that way. > > > > This isn't correct, because those functions are meant to accumulate a > negative number in "tmp". Looks like I didn't quite look at that code closely enough. To make that work we could just form the non-decimal versions in an unsigned integer of the given size and then check if that's become greater than -PG_INTXX_MIN after the loop. We'd then just need to convert it back to its negative form. i.e: uint64 tmp2 = 0; ptr += 2; while (*ptr && isxdigit((unsigned char) *ptr)) { if (unlikely(tmp2 & UINT64CONST(0xF000000000000000))) goto out_of_range; tmp2 = (tmp2 << 4) | hexlookup[(unsigned char) *ptr++]; } if (tmp2 > -PG_INT64_MIN) goto out_of_range; tmp = -((int64) tmp2); David
On Tue, 29 Nov 2022 at 03:00, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > Fixed in new patch. There seems to be a small bug in the pg_strtointXX functions in the code that checks that there's at least 1 digit. This causes 0x to be a valid representation of zero. That does not seem to be allowed by the parser, so I think we should likely reject it in COPY too. -- Does not work. postgres=# select 0x + 1; ERROR: invalid hexadecimal integer at or near "0x" LINE 1: select 0x + 1; postgres=# create table a (a int); CREATE TABLE -- probably shouldn't work postgres=# copy a from stdin; Enter data to be copied followed by a newline. End with a backslash and a period on a line by itself, or an EOF signal. >> 0x >> \. COPY 1 David
On Wed, 23 Nov 2022 at 22:19, John Naylor <john.naylor@enterprisedb.com> wrote: > > > On Wed, Nov 23, 2022 at 3:54 PM David Rowley <dgrowleyml@gmail.com> wrote: > > > > Going by [1], clang will actually use multiplication by 16 to > > implement the former. gcc is better and shifts left by 4, so likely > > won't improve things for gcc. It seems worth doing it this way for > > anything that does not have HAVE__BUILTIN_OP_OVERFLOW anyway. > > FWIW, gcc 12.2 generates an imul on my system when compiling in situ. I spent a bit more time trying to figure out why the compiler does imul instead of bit shifting and it just seems to be down to a combination of signed-ness plus the overflow check. See [1]. Neither of the two compilers I tested could use bit shifting with a signed type when overflow checking is done, which is what we're doing in the new code. In clang 15, multiplication is done in both smultiply16 and umultiply16. These both check for overflow. The versions without the overflow checks both use bit shifting. With GCC, only smultiply16 does multiplication. The other 3 variants all use bit shifting. David [1] https://godbolt.org/z/EG9jKMjq5
On Wed, 30 Nov 2022 at 05:50, David Rowley <dgrowleyml@gmail.com> wrote: > > I spent a bit more time trying to figure out why the compiler does > imul instead of bit shifting and it just seems to be down to a > combination of signed-ness plus the overflow check. See [1]. Neither > of the two compilers I tested could use bit shifting with a signed > type when overflow checking is done, which is what we're doing in the > new code. > Ah, interesting. That makes me think that it might be possible to get some performance gains for all bases (including 10) by separating the overflow check from the multiplication, and giving the compiler the best chance to decide on the optimal way to do the multiplication. For example, on my Intel box, GCC prefers a pair of LEA instructions over an IMUL, to multiply by 10. I like your previous idea of using an unsigned integer for the accumulator, because then the overflow check in the loop doesn't need to be exact, as long as an exact check is done later. That way, there are fewer conditional branches in the loop, and the possibility for the compiler to choose the fastest multiplication method. So something like: // Accumulate positive value using unsigned int, with approximate // overflow check. If acc >= 1 - INT_MIN / 10, then acc * 10 is // sure to exceed -INT_MIN. unsigned int cutoff = 1 - INT_MIN / 10; unsigned int acc = 0; while (*ptr && isdigit((unsigned char) *ptr)) { if (unlikely(acc >= cutoff)) goto out_of_range; acc = acc * 10 + (*ptr - '0'); ptr++; } and similar for other bases, allowing the coding for all bases to be kept similar. I think it's probably best to consider this as a follow-on patch though. It shouldn't delay getting the main feature committed. Regards, Dean
On Thu, 1 Dec 2022 at 00:34, Dean Rasheed <dean.a.rasheed@gmail.com> wrote: > So something > like: > > // Accumulate positive value using unsigned int, with approximate > // overflow check. If acc >= 1 - INT_MIN / 10, then acc * 10 is > // sure to exceed -INT_MIN. > unsigned int cutoff = 1 - INT_MIN / 10; > unsigned int acc = 0; > > while (*ptr && isdigit((unsigned char) *ptr)) > { > if (unlikely(acc >= cutoff)) > goto out_of_range; > acc = acc * 10 + (*ptr - '0'); > ptr++; > } > > and similar for other bases, allowing the coding for all bases to be > kept similar. Seems like a good idea to me. Couldn't the cutoff check just be "acc > INT_MAX / 10"? > I think it's probably best to consider this as a follow-on patch > though. It shouldn't delay getting the main feature committed. I agree that it should be a separate patch. But thinking about what Tom mentioned in [1], I had in mind this patch would need to wait until the new standard is out so that we have a more genuine reason for breaking existing queries. I've drafted up a full patch for improving the current base-10 code, so I'll go post that on another thread. David [1] https://postgr.es/m/3260805.1631106874@sss.pgh.pa.us
David Rowley <dgrowleyml@gmail.com> writes: > I agree that it should be a separate patch. But thinking about what > Tom mentioned in [1], I had in mind this patch would need to wait > until the new standard is out so that we have a more genuine reason > for breaking existing queries. Well, we already broke them in v15: that example now gives regression=# select 0x42e; ERROR: trailing junk after numeric literal at or near "0x" LINE 1: select 0x42e; ^ So there's probably no compatibility reason not to drop the other shoe. regards, tom lane
On 29.11.22 21:22, David Rowley wrote: > There seems to be a small bug in the pg_strtointXX functions in the > code that checks that there's at least 1 digit. This causes 0x to be > a valid representation of zero. That does not seem to be allowed by > the parser, so I think we should likely reject it in COPY too. > -- probably shouldn't work > postgres=# copy a from stdin; > Enter data to be copied followed by a newline. > End with a backslash and a period on a line by itself, or an EOF signal. >>> 0x >>> \. > COPY 1 Fixed in new patch. I moved the "require at least one digit" checks after the loops over the digits, to make it easier to write one check for all bases. This patch is also incorporates your changes to the digit analysis algorithm. I didn't check it carefully, but all the tests still pass. ;-)
Attachment
On 08.12.22 12:16, Peter Eisentraut wrote: > On 29.11.22 21:22, David Rowley wrote: >> There seems to be a small bug in the pg_strtointXX functions in the >> code that checks that there's at least 1 digit. This causes 0x to be >> a valid representation of zero. That does not seem to be allowed by >> the parser, so I think we should likely reject it in COPY too. >> -- probably shouldn't work >> postgres=# copy a from stdin; >> Enter data to be copied followed by a newline. >> End with a backslash and a period on a line by itself, or an EOF signal. >>>> 0x >>>> \. >> COPY 1 > > Fixed in new patch. I moved the "require at least one digit" checks > after the loops over the digits, to make it easier to write one check > for all bases. > > This patch is also incorporates your changes to the digit analysis > algorithm. I didn't check it carefully, but all the tests still pass. ;-) committed
On Wed, 14 Dec 2022 at 05:47, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > > committed Now that we have this for integer types, I think it's worth doing for numeric as well, since the parser will now pass such things through to numeric_in() when they don't fit in an int64, and it seems plausible that at least some people might use non-decimal integers beyond INT64MIN/MAX. Also, without such support in numeric_in(), the feature looks a little incomplete: SELECT -0x8000000000000000; ?column? ---------------------- -9223372036854775808 (1 row) SELECT 0x8000000000000000; ERROR: invalid input syntax for type numeric: "0x8000000000000000" LINE 1: select 0x8000000000000000; ^ One concern I had was what the performance would be like. I don't really expect people to pass in the kinds of truly huge values that numeric supports, but it can't be ruled out. So I gave it a go, to see how hard it would be, and what the worst-case performance looks like. (I included underscore-handling too, so that I could measure that at the same time.) The base-conversion algorithm is O(N^2), and the worst case before overflow is with hex strings with around 108,000 digits, oct strings with around 145,000 digits, or binary strings with around 435,000 digits. Each of those takes around 400ms to parse on my machine. That's around the level at which I might consider adding CHECK_FOR_INTERRUPTS()'s, but I think that it's probably not worth it, given how unrealistic such huge inputs are in practice. The other important thing is that this shouldn't impact the performance when parsing regular decimal inputs. The bulk of the non-decimal integer parsing is handled by a separate function, which is called directly from numeric_in(), since non-decimal handling isn't required at the set_var_from_str() level (used by the float4/8 -> numeric conversion functions). I also re-arranged the numeric_in() code somewhat, and was able to make substantial savings by reducing the number of pg_strncasecmp() calls, and avoiding those calls entirely for regular numbers that aren't NaN or Inf. Testing that with COPY with a few million numbers of different sizes, I observed a 10-15% performance increase. So I'm feeling quite good about the end result -- I set out hoping not to make performance noticeably worse, but ended up making it significantly better. Regards, Dean
Attachment
On 13.01.23 11:01, Dean Rasheed wrote: > So I'm feeling quite good about the end result -- I set out hoping not > to make performance noticeably worse, but ended up making it > significantly better. This is great! How do you want to proceed? You also posted an updated patch in the "underscores" thread and suggested some additional work there. In which order should these be addressed, in your opinion?
On Mon, 23 Jan 2023 at 15:55, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > > On 13.01.23 11:01, Dean Rasheed wrote: > > So I'm feeling quite good about the end result -- I set out hoping not > > to make performance noticeably worse, but ended up making it > > significantly better. > > This is great! How do you want to proceed? You also posted an updated > patch in the "underscores" thread and suggested some additional work > there. In which order should these be addressed, in your opinion? > I think it makes most sense if I push 0001 now, and then merge 0002 into the underscores patch. I think at least one of the suggested changes to the underscores patch required 0002 to work. Regards, Dean
On Fri, Jan 13, 2023, at 07:01, Dean Rasheed wrote: > Attachments: > * 0001-Add-non-decimal-integer-support-to-type-numeric.patch Nice! This also simplifies when dealing with non-negative integers represented as byte arrays, common in e.g. cryptography code. Before, one had to implement numeric_from_bytes(bytea) in plpgsql [1], which can now be greatly simplified: create function numeric_from_bytes(bytea) returns numeric language sql as $$ select ('0'||right($1::text,-1))::numeric $$; \timing select numeric_from_bytes(('\x'||repeat('0123456789abcdef',1000))::bytea); Time: 484.223 ms -- HEAD + plpgsql numeric_from_bytes() Time: 19.790 ms -- 0001 + simplified numeric_from_bytes() About 25x faster! Would we want a built-in function for this? To avoid the text casts, but also to improve user-friendliness, since the improved solution is still a hack a user needing it has to someone come up with or find. The topic "Convert hex in text representation to decimal number" is an old one on Stackoverflow [2], posted 11 years ago, with a myriad of various hackis solutions, out of which one had a bug that I reported. Many other modern languages seems to have this as a built-in or in stdlibs: Python3: classmethod int.from_bytes(bytes, byteorder='big', *, signed=False) Rust: pub const fn from_be_bytes(bytes: [u8; 8]) -> u64 /Joel [1] https://gist.github.com/joelonsql/f54552db1f0fd6d9b3397d255e51f58a [2] https://stackoverflow.com/questions/8316164/convert-hex-in-text-representation-to-decimal-number
On Mon, 23 Jan 2023 at 20:00, Joel Jacobson <joel@compiler.org> wrote: > > Nice! This also simplifies when dealing with non-negative integers represented as byte arrays, > common in e.g. cryptography code. > Ah, interesting. I hadn't thought of that use-case. > create function numeric_from_bytes(bytea) returns numeric language sql as $$ > select ('0'||right($1::text,-1))::numeric > $$; > > Would we want a built-in function for this? Not sure. It does feel a bit niche. It's quite common in other programming languages, but that doesn't mean that a lot of Postgres users need it. Perhaps start a new thread to gauge people's interest? Regards, Dean
On 13.01.23 11:01, Dean Rasheed wrote:
> So I'm feeling quite good about the end result -- I set out hoping not
> to make performance noticeably worse, but ended up making it
> So I'm feeling quite good about the end result -- I set out hoping not
> to make performance noticeably worse, but ended up making it
> significantly better.
Hi Dean, thanks for your work.
But since PG_RETURN_NULL, is a simple return,
now the "value" var is not leaked?
If not, sorry for the noise.
regards,
Ranier Vilela
Attachment
On Tue, 24 Jan 2023 at 00:47, Ranier Vilela <ranier.vf@gmail.com> wrote: > > On 13.01.23 11:01, Dean Rasheed wrote: > > So I'm feeling quite good about the end result -- I set out hoping not > > to make performance noticeably worse, but ended up making it > > significantly better. > Hi Dean, thanks for your work. > > But since PG_RETURN_NULL, is a simple return, > now the "value" var is not leaked? > That originates from a prior commit: ccff2d20ed Convert a few datatype input functions to use "soft" error reporting. and see also a bunch of follow-on commits for other input functions. It will only return NULL if the input is invalid and escontext is non-NULL. You only identified a fraction of the cases where that would happen. If we really cared about not leaking memory for invalid inputs, we'd have to look at every code path using ereturn() (including lower-level functions, and not just in numeric.c). I think that would be a waste of time, and counterproductive -- trying to immediately free memory for all possible invalid inputs would likely complicate a lot of code, and slow down parsing of valid inputs. Better to leave it until the owning memory context is freed. Regards, Dean
Em ter., 24 de jan. de 2023 às 07:24, Dean Rasheed <dean.a.rasheed@gmail.com> escreveu:
On Tue, 24 Jan 2023 at 00:47, Ranier Vilela <ranier.vf@gmail.com> wrote:
>
> On 13.01.23 11:01, Dean Rasheed wrote:
> > So I'm feeling quite good about the end result -- I set out hoping not
> > to make performance noticeably worse, but ended up making it
> > significantly better.
> Hi Dean, thanks for your work.
>
> But since PG_RETURN_NULL, is a simple return,
> now the "value" var is not leaked?
>
That originates from a prior commit:
ccff2d20ed Convert a few datatype input functions to use "soft" error reporting.
and see also a bunch of follow-on commits for other input functions.
It will only return NULL if the input is invalid and escontext is
non-NULL. You only identified a fraction of the cases where that would
happen. If we really cared about not leaking memory for invalid
inputs, we'd have to look at every code path using ereturn()
(including lower-level functions, and not just in numeric.c). I think
that would be a waste of time, and counterproductive -- trying to
immediately free memory for all possible invalid inputs would likely
complicate a lot of code, and slow down parsing of valid inputs.
Better to leave it until the owning memory context is freed.
Thank you for the explanation.
regards,
Ranier Vilela