Thread: refactoring - share str2*int64 functions

refactoring - share str2*int64 functions

From
Fabien COELHO
Date:
Hello Tom,

>> Yep, but ISTM that it is down to 32 bits,
>
> Only on 32-bit-long machines, which are a dwindling minority (except
> for Windows, which I don't really care about).
>
>> So the third short is now always 0. Hmmm. I'll propose another option over
>> the week-end.
>
> I suppose we could put pg_strtouint64 somewhere where pgbench can use it,
> but TBH I don't think it's worth the trouble.  The set of people using
> the --random-seed=int option at all is darn near empty, I suspect,
> and the documentation only says you can write an int there.

Although I agree it is not worth a lot of trouble, and even if I don't do 
Windows, I think it valuable that the behavior is the same on all 
platform. The attached match shares pg_str2*int64 functions between 
frontend and backend by moving them to "common/", which avoids some code 
duplication.

This is more refactoring, and it fixes the behavior change on 32 bit 
architectures.

-- 
Fabien.



Re: refactoring - share str2*int64 functions

From
Fabien COELHO
Date:
As usual, it is better with the attachement. Sorry for the noise.

>>> Yep, but ISTM that it is down to 32 bits,
>> 
>> Only on 32-bit-long machines, which are a dwindling minority (except
>> for Windows, which I don't really care about).
>> 
>>> So the third short is now always 0. Hmmm. I'll propose another option over
>>> the week-end.
>> 
>> I suppose we could put pg_strtouint64 somewhere where pgbench can use it,
>> but TBH I don't think it's worth the trouble.  The set of people using
>> the --random-seed=int option at all is darn near empty, I suspect,
>> and the documentation only says you can write an int there.
>
> Although I agree it is not worth a lot of trouble, and even if I don't do 
> Windows, I think it valuable that the behavior is the same on all platform. 
> The attached match shares pg_str2*int64 functions between frontend and 
> backend by moving them to "common/", which avoids some code duplication.
>
> This is more refactoring, and it fixes the behavior change on 32 bit 
> architectures.
>
>

-- 
Fabien.
Attachment

Re: refactoring - share str2*int64 functions

From
Fabien COELHO
Date:
>> Although I agree it is not worth a lot of trouble, and even if I don't do 
>> Windows, I think it valuable that the behavior is the same on all platform. 
>> The attached match shares pg_str2*int64 functions between frontend and 
>> backend by moving them to "common/", which avoids some code duplication.
>> 
>> This is more refactoring, and it fixes the behavior change on 32 bit 
>> architectures.

V2 is a rebase.

-- 
Fabien.
Attachment

Re: refactoring - share str2*int64 functions

From
Thomas Munro
Date:
On Fri, May 24, 2019 at 3:23 AM Fabien COELHO <coelho@cri.ensmp.fr> wrote:
> >> Although I agree it is not worth a lot of trouble, and even if I don't do
> >> Windows, I think it valuable that the behavior is the same on all platform.
> >> The attached match shares pg_str2*int64 functions between frontend and
> >> backend by moving them to "common/", which avoids some code duplication.
> >>
> >> This is more refactoring, and it fixes the behavior change on 32 bit
> >> architectures.
>
> V2 is a rebase.

Hi Fabien,

Here's some semi-automated feedback, noted while going through
failures on cfbot.cputube.org.  You have a stray editor file
src/backend/parser/parse_node.c.~1~.  Something is failing to compile
while doing the temp-install in make check-world, which probably
indicates that some test or contrib module is using the interface you
changed?

-- 
Thomas Munro
https://enterprisedb.com



Re: refactoring - share str2*int64 functions

From
Thomas Munro
Date:
On Mon, Jul 8, 2019 at 3:22 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Here's some semi-automated feedback, noted while going through
> failures on cfbot.cputube.org.  You have a stray editor file
> src/backend/parser/parse_node.c.~1~.  Something is failing to compile
> while doing the temp-install in make check-world, which probably
> indicates that some test or contrib module is using the interface you
> changed?

Please disregard the the comment about the ".~1~" file, my mistake.
As for the check-world failure, it's here:

pg_stat_statements.c:1024:11: error: implicit declaration of function
'pg_strtouint64' is invalid in C99
[-Werror,-Wimplicit-function-declaration]
                        rows = pg_strtouint64(completionTag + 5, NULL, 10);
                               ^
Apparently it needs to include common/string.h.

-- 
Thomas Munro
https://enterprisedb.com



Re: refactoring - share str2*int64 functions

From
Fabien COELHO
Date:
Hello Thomas,

> pg_stat_statements.c:1024:11: error: implicit declaration of function
> 'pg_strtouint64' is invalid in C99
> [-Werror,-Wimplicit-function-declaration]
>                        rows = pg_strtouint64(completionTag + 5, NULL, 10);
>                               ^
> Apparently it needs to include common/string.h.

Yep, I gathered that as well, but did not act promptly on your help.
Thanks for it!

Here is the updated patch on which I checked "make check-world".

-- 
Fabien.
Attachment

Re: refactoring - share str2*int64 functions

From
Thomas Munro
Date:
On Sun, Jul 14, 2019 at 3:22 AM Fabien COELHO <coelho@cri.ensmp.fr> wrote:
> Here is the updated patch on which I checked "make check-world".

Thanks!  So, we're moving pg_strtouint64() to a place where frontend
code can use it, and getting rid of some duplication.  I like it.  I
wanted this once before myself[1].

+extern bool pg_strtoint64(const char *str, bool errorOK, int64 *result);
+extern uint64 pg_strtouint64(const char *str, char **endptr, int base);

One of these things is not like the other.  Let's see... the int64
version is used only by pgbench and is being promoted to common where
it can be used by more code.  With a name like that, wouldn't it make
sense to bring it into line with the uint64 interface, and then move
pgbench's error reporting stuff back into pgbench?  The uint64 one
derives its shape from the family of standard functions like strtol()
so I think it wins.

[1] https://www.postgresql.org/message-id/CAEepm=2KeC8xDbEWgDTDObXGqPHFW4kcD7BZXR6NMfiHjjnKhQ@mail.gmail.com

-- 
Thomas Munro
https://enterprisedb.com



Re: refactoring - share str2*int64 functions

From
Fabien COELHO
Date:
Hello Thomas,

> +extern bool pg_strtoint64(const char *str, bool errorOK, int64 *result);
> +extern uint64 pg_strtouint64(const char *str, char **endptr, int base);
>
> One of these things is not like the other.

Indeed.

I agree that it is unfortunate, and it was bothering me a little as well.

> Let's see... the int64 version is used only by pgbench and is being 
> promoted to common where it can be used by more code.

No and yes.

The pgbench code was a copy of server-side internal "scanint8", so it is 
used both by pgbench and the server-side handling of "int8", it is used 
significantly, taking advantage of its versatile error reporting feature 
on both sides.

> With a name like that, wouldn't it make sense to bring it into line with 
> the uint64 interface, and then move pgbench's error reporting stuff back 
> into pgbench?

That would need moving the server-side error handling as well, which I 
would not really be happy with.

> The uint64 one derives its shape from the family of standard functions 
> like strtol() so I think it wins.

Yep, it cannot be changed either.

I do not think that changing the error handling capability is appropriate, 
it is really a feature of the function. The function could try to use an 
internal pg_strtoint64 which would look like the other unsigned version, 
but then it would not differentiate the various error conditions (out of 
range vs syntax error).

The compromise I can offer is to change the name of the first one, say to 
"pg_scanint8" to reflect its former backend name. Attached a v4 which does 
a renaming so as to avoid the name similarity but signature difference. I 
also made both error messages identical.

-- 
Fabien.
Attachment

Re: refactoring - share str2*int64 functions

From
Thomas Munro
Date:
On Mon, Jul 15, 2019 at 5:08 PM Fabien COELHO <coelho@cri.ensmp.fr> wrote:
> The compromise I can offer is to change the name of the first one, say to
> "pg_scanint8" to reflect its former backend name. Attached a v4 which does
> a renaming so as to avoid the name similarity but signature difference. I
> also made both error messages identical.

Cool.  I'm not exactly sure when we should include 'pg_' in identifier
names.  It seems to be used for functions/macros that wrap or replace
something else with a similar name, like pg_pwrite(),
pg_attribute_noreturn(), ...  In this case it's just our own code that
we're moving, so I'm wondering if we should just call it scanint8().

If you agree, I think this is ready to commit.

-- 
Thomas Munro
https://enterprisedb.com



Re: refactoring - share str2*int64 functions

From
Alvaro Herrera
Date:
On 2019-Jul-16, Thomas Munro wrote:

> On Mon, Jul 15, 2019 at 5:08 PM Fabien COELHO <coelho@cri.ensmp.fr> wrote:
> > The compromise I can offer is to change the name of the first one, say to
> > "pg_scanint8" to reflect its former backend name. Attached a v4 which does
> > a renaming so as to avoid the name similarity but signature difference. I
> > also made both error messages identical.
> 
> Cool.  I'm not exactly sure when we should include 'pg_' in identifier
> names.  It seems to be used for functions/macros that wrap or replace
> something else with a similar name, like pg_pwrite(),
> pg_attribute_noreturn(), ...  In this case it's just our own code that
> we're moving, so I'm wondering if we should just call it scanint8().

Isn't it annoying that pg_strtouint64() has an implementation that
suggests that it ought to be in src/port?  The fact that the signatures
are so different suggests to me that we should indeed put them separate.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: refactoring - share str2*int64 functions

From
Michael Paquier
Date:
On Tue, Jul 16, 2019 at 11:16:31AM +1200, Thomas Munro wrote:
> Cool.  I'm not exactly sure when we should include 'pg_' in identifier
> names.  It seems to be used for functions/macros that wrap or replace
> something else with a similar name, like pg_pwrite(),
> pg_attribute_noreturn(), ...  In this case it's just our own code that
> we're moving, so I'm wondering if we should just call it scanint8().

FWIW, I was looking forward to putting my hands on this patch and try
to get it merged so as we can get rid of those duplications.  Here are
some comments.

+#ifdef FRONTEND
+       fprintf(stderr,
+               "invalid input syntax for type %s: \"%s\"\n",
"bigint", str);
+#else
+       ereport(ERROR,
+               (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+                errmsg("invalid input syntax for type %s: \"%s\"",
+                       "bigint", str)));
+#endif
Have you looked at using the wrapper pg_log_error() here?

+extern bool pg_scanint8(const char *str, bool errorOK, int64
*result);
+extern uint64 pg_strtouint64(const char *str, char **endptr, int
base);
Hmm.  With this patch we have strtoint and pg_strtouint64, which makes
the whole set inconsistent.

+
 #endif                         /* COMMON_STRING_H */
Noise diff..

numutils.c also has pg_strtoint16 and pg_strtoint32, so the locations
become rather inconsistent with inconsistent APIs for the manipulation
of int2 and int4 fields, and scanint8 is just a derivative of the same
logic.  We have two categories of routines here:
- The wrappers on top of strtol and strtoul & co, which are named
respectively strtoint and pg_strtouint64 with the patch.  The naming
part is inconsistent, and we only handle uint64 and int32.  We don't
need to worry about int64 and uint32 because they are not used?
That's fine by me, but at least let's have a consistent naming.
Prefixing the functions with pg_* is a better practice in my opinion
as we will unlikely run into conflicts this way.
- The str->integer conversion routines, which actually have very
similar characteristics to the strtol families as they remove trailing
whitespaces first, check for a sign, etc, except that they work only
on base 10.  And here we get into a state where pg_scanint8 should be
actually called pg_strtoint64, with an interface inconsistent with its
int32/int16 relatives now only in the backend.  Could we consider more
consolidation here please?  Like moving the whole set to src/common/?

> If you agree, I think this is ready to commit.

Thomas, are you planning to look at this patch as committer?  I had it
in my agenda, and was planning to look at it sooner than later.  Now
if you are on it, I won't step on your toes.
--
Michael

Attachment

Re: refactoring - share str2*int64 functions

From
Thomas Munro
Date:
On Tue, Jul 16, 2019 at 7:11 PM Michael Paquier <michael@paquier.xyz> wrote:
> Thomas, are you planning to look at this patch as committer?  I had it
> in my agenda, and was planning to look at it sooner than later.  Now
> if you are on it, I won't step on your toes.

Hi Michael, please go ahead, it's all yours.

-- 
Thomas Munro
https://enterprisedb.com



Re: refactoring - share str2*int64 functions

From
Fabien COELHO
Date:
Hello Thomas,

> On Mon, Jul 15, 2019 at 5:08 PM Fabien COELHO <coelho@cri.ensmp.fr> wrote:
>> The compromise I can offer is to change the name of the first one, say to
>> "pg_scanint8" to reflect its former backend name. Attached a v4 which does
>> a renaming so as to avoid the name similarity but signature difference. I
>> also made both error messages identical.
>
> Cool.  I'm not exactly sure when we should include 'pg_' in identifier
> names.  It seems to be used for functions/macros that wrap or replace
> something else with a similar name, like pg_pwrite(),
> pg_attribute_noreturn(), ...  In this case it's just our own code that
> we're moving, so I'm wondering if we should just call it scanint8().

I added the pg_ prefix as a poor man's namespace because the function can 
be used by external tools (eg contribs), so as to avoid potential name 
conflicts.

I agree that such conflicts are less probable if the name does not replace 
something existing.

> If you agree, I think this is ready to commit.

It can be removed, or not. So you do as you feel.

-- 
Fabien.



Re: refactoring - share str2*int64 functions

From
Robert Haas
Date:
On Jul 16, 2019, at 3:30 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

>> Cool.  I'm not exactly sure when we should include 'pg_' in identifier
>> names.  It seems to be used for functions/macros that wrap or replace
>> something else with a similar name, like pg_pwrite(),
>> pg_attribute_noreturn(), ...  In this case it's just our own code that
>> we're moving, so I'm wondering if we should just call it scanint8().
>
> I added the pg_ prefix as a poor man's namespace because the function can be used by external tools (eg contribs), so
asto avoid potential name conflicts. 

Yeah, I think if we are going to expose it to front end code there is a good argument for some kind of prefix that
makesit sound PostgreSQL-related. 

...Robert



Re: refactoring - share str2*int64 functions

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Jul 16, 2019, at 3:30 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
>>> Cool.  I'm not exactly sure when we should include 'pg_' in identifier
>>> names.

>> I added the pg_ prefix as a poor man's namespace because the function can be used by external tools (eg contribs),
soas to avoid potential name conflicts. 

> Yeah, I think if we are going to expose it to front end code there is a good argument for some kind of prefix that
makesit sound PostgreSQL-related. 

Yeah, I'd tend to err in favor of including "pg_".  We might get away
without that as long as the name is never exposed to non-PG code, but
for stuff that's going into src/common/ or src/port/ I think that's
a risky assumption to make.

I'm also in agreement with Michael's comments in
<20190716071144.GF1439@paquier.xyz> that this would be a good time
to bring some consistency to the naming of related functions.

            regards, tom lane



Re: refactoring - share str2*int64 functions

From
Andres Freund
Date:
Hi,

On 2019-07-15 07:08:42 +0200, Fabien COELHO wrote:
> I do not think that changing the error handling capability is appropriate,
> it is really a feature of the function. The function could try to use an
> internal pg_strtoint64 which would look like the other unsigned version, but
> then it would not differentiate the various error conditions (out of range
> vs syntax error).

> The compromise I can offer is to change the name of the first one, say to
> "pg_scanint8" to reflect its former backend name. Attached a v4 which does a
> renaming so as to avoid the name similarity but signature difference. I also
> made both error messages identical.

I think the interface of that function is not that good, and the "scan"
in the name isn't great for discoverability (for one it's a different
naming than pg_strtoint16 etc), and the *8 meaning 64bit is confusing
enough in the backend, we definitely shouldn't extend that to frontend
code.

Referencing "bigint" and "input syntax" from frontend code imo doesn't
make a lot of sense. And int8in is the only caller that uses
errorOK=False anyway, so there's currently no need for the frontend
error strings afaict.

ISTM that something like

typedef enum StrToIntConversion
{
        STRTOINT_OK = 0,
        STRTOINT_SYNTAX_ERROR = 1,
        STRTOINT_OUT_OF_RANGE = 2
} StrToIntConversion;
StrToIntConversion pg_strtoint64(const char *str, int64 *result);

would make more sense.

There is the issue that there already is pg_strtoint16 and
pg_strtoint32, which do not have the option to not raise an error.  I'd
probably name the non-error throwing ones something like pg_strtointNN_e
(for extended, or error handling), and have pg_strtointNN wrappers that
just handle the errors, or reverse the naming (which'd cause a bit of
churn, but not that much).

That'd also make the code for pg_strtointNN a bit nicer, because we'd
not need the gotos anymore, they're just there to avoid redundant error
messages - which'd not be an issue if the error handling were just a
switch in a separate function. E.g.

int32
pg_strtoint32(const char *s)
{
    int32 result;

    switch (pg_strtoint32_e(s, &result))
    {
            case STRTOINT_OK:
                 return result;

            case STRTOINT_SYNTAX_ERROR:
                ereport(ERROR,
                        (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
                         errmsg("value \"%s\" is out of range for type %s",
                                s, "integer")));

            case STRTOINT_OUT_OF_RANGE:
                ereport(ERROR,
                        (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
                         errmsg("invalid input syntax for type %s: \"%s\"",
                                "integer", s)));

    }

    return 0;                   /* keep compiler quiet */
}

which does seem nicer than what we have right now.


Greetings,

Andres Freund



Re: refactoring - share str2*int64 functions

From
Andres Freund
Date:
Hi,

On 2019-07-16 16:11:44 +0900, Michael Paquier wrote:

> numutils.c also has pg_strtoint16 and pg_strtoint32, so the locations
> become rather inconsistent with inconsistent APIs for the manipulation
> of int2 and int4 fields, and scanint8 is just a derivative of the same
> logic.  We have two categories of routines here:

> - The wrappers on top of strtol and strtoul & co, which are named
> respectively strtoint and pg_strtouint64 with the patch.  The naming
> part is inconsistent, and we only handle uint64 and int32.  We don't
> need to worry about int64 and uint32 because they are not used?
> That's fine by me, but at least let's have a consistent naming.

Yea, consistent naming seems like a strong requirement
here. Additionally I think we should just provide a consistent set
rather than what's needed just now. That'll just lead to people
inventing their own again down the line.


> Prefixing the functions with pg_* is a better practice in my opinion
> as we will unlikely run into conflicts this way.

+1


> - The str->integer conversion routines, which actually have very
> similar characteristics to the strtol families as they remove trailing
> whitespaces first, check for a sign, etc, except that they work only
> on base 10.

There's afaict neither a caller that needs the base argument at the
moment, nor one in the tree previously. I'd argue for just making
pg_strtouint64's API consistent.

I'd probably also just use the implementation we have for signed
integers (minus the relevant negation and overflow checks, obviously) -
it's a lot faster, and I think there's value in keeping the
implementations in sync.


Greetings,

Andres Freund



Re: refactoring - share str2*int64 functions

From
Michael Paquier
Date:
On Tue, Jul 16, 2019 at 01:18:38PM -0700, Andres Freund wrote:
> Hi,
>
> On 2019-07-16 16:11:44 +0900, Michael Paquier wrote:
> Yea, consistent naming seems like a strong requirement
> here. Additionally I think we should just provide a consistent set
> rather than what's needed just now. That'll just lead to people
> inventing their own again down the line.

Agreed.  The first versions of pg_rewind in the tree have been using
copy_file_range(), which has been introduced in Linux.

> > - The str->integer conversion routines, which actually have very
> > similar characteristics to the strtol families as they remove trailing
> > whitespaces first, check for a sign, etc, except that they work only
> > on base 10.
>
> There's afaict neither a caller that needs the base argument at the
> moment, nor one in the tree previously. I'd argue for just making
> pg_strtouint64's API consistent.

Good point, indeed, this could be much more simplified.  I have not
paid attention at that part.

> I'd probably also just use the implementation we have for signed
> integers (minus the relevant negation and overflow checks, obviously) -
> it's a lot faster, and I think there's value in keeping the
> implementations in sync.

You mean that it is much faster than the set of wrappers for strtol
than we have?  Is that because we don't care about the base?
--
Michael

Attachment

Re: refactoring - share str2*int64 functions

From
Michael Paquier
Date:
On Tue, Jul 16, 2019 at 01:04:38PM -0700, Andres Freund wrote:
> There is the issue that there already is pg_strtoint16 and
> pg_strtoint32, which do not have the option to not raise an error.  I'd
> probably name the non-error throwing ones something like pg_strtointNN_e
> (for extended, or error handling), and have pg_strtointNN wrappers that
> just handle the errors, or reverse the naming (which'd cause a bit of
> churn, but not that much).
>
> That'd also make the code for pg_strtointNN a bit nicer, because we'd
> not need the gotos anymore, they're just there to avoid redundant error
> messages - which'd not be an issue if the error handling were just a
> switch in a separate function. E.g.

Agreed on that.  I am wondering if we should use a common wrapper for
all the internal functions taking in input a set of bits16 flags to
control its behavior and put all that into common/script.c:
- Issue an error.
- Check for signedness.
- Base length: 16, 32 or 64.
This would have the advantage to move the error string generation, the
trailing whitespace check, sign handling and such in a single place.
We could have the internal routine return uint64 which is casted
afterwards to a proper result depending on what we use.  (Perhaps
that's what you actually meant?)

I would also rather not touch the strtol wrappers that we have able to
handle the base.  There is nothing in the tree using it, but likely
there are extensions relying on it.  Switching all the existing
callers in the tree to the new routines sounds good to me of course.

Consolidating all that still needs more work, so for now I am
switching the patch as waiting on author.
--
Michael

Attachment

Re: refactoring - share str2*int64 functions

From
Fabien COELHO
Date:
Bonjour Michaël,

> FWIW, I was looking forward to putting my hands on this patch and try
> to get it merged so as we can get rid of those duplications.  Here are
> some comments.
>
> +#ifdef FRONTEND
> +       fprintf(stderr,
> +               "invalid input syntax for type %s: \"%s\"\n",
> "bigint", str);
> +#else
> +       ereport(ERROR,
> +               (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
> +                errmsg("invalid input syntax for type %s: \"%s\"",
> +                       "bigint", str)));
> +#endif
> Have you looked at using the wrapper pg_log_error() here?

I have not.

I have simply merged the two implementations (pgbench & backend) as they 
were.

> +extern bool pg_scanint8(const char *str, bool errorOK, int64 *result);
> +extern uint64 pg_strtouint64(const char *str, char **endptr, int base);

> Hmm.  With this patch we have strtoint and pg_strtouint64, which makes
> the whole set inconsistent.

I understand that you mean bits vs bytes? Indeed it can bite!

> +
> #endif                         /* COMMON_STRING_H */
> Noise diff..

Indeed.

> numutils.c also has pg_strtoint16 and pg_strtoint32, so the locations
> become rather inconsistent with inconsistent APIs for the manipulation
> of int2 and int4 fields, and scanint8 is just a derivative of the same
> logic.  We have two categories of routines here:

Yep, but the int2/4 functions are not used elsewhere.

> - The wrappers on top of strtol and strtoul & co, which are named
> respectively strtoint and pg_strtouint64 with the patch.  The naming
> part is inconsistent, and we only handle uint64 and int32.  We don't
> need to worry about int64 and uint32 because they are not used?

Indeed, it seems that they are not needed/used by client code, AFAICT.

> That's fine by me, but at least let's have a consistent naming.

Ok.

> Prefixing the functions with pg_* is a better practice in my opinion
> as we will unlikely run into conflicts this way.

Ok.

> - The str->integer conversion routines, which actually have very
> similar characteristics to the strtol families as they remove trailing
> whitespaces first, check for a sign, etc, except that they work only
> on base 10.  And here we get into a state where pg_scanint8 should be
> actually called pg_strtoint64,

I just removed that:-)

ISTM that the issue is that the error handling of these functions is 
pretty different.

> with an interface inconsistent with its int32/int16 relatives now only 
> in the backend.

We can, but I'm not at ease with changing the error handling approach.

> Could we consider more consolidation here please?  Like moving the whole 
> set to src/common/?

My initial plan was simply to remove direct code duplications between 
front-end and back-end, not to re-engineer the full set of string to int 
conversion functions:-)

On the re-engineering front: Given the various points on the thread, ISTM 
that there should probably be two behaviors for str to signed/unsigned 
int{16,32,64}, and having only one kind of signature for all types would 
be definitely better.

One low-level one that does the conversion or return an error.

Another higher-level one which possibly adds an error message (stderr for 
front-end, log for back-end).

One choice is whether there are two functions (the higher one calling the 
lower one and adding the messages) or just one with a boolean to trigger 
the messages. I do not have a strong opinion. Maybe one function would be 
simpler. Andres boolean-compatible enum return looks like a good option.

Overall, this leads to something like:

enum { STRTOINT_OK, STRTOINT_OVERFLOW_ERROR, STRTOINT_SYNTAX_ERROR }
   pg_strto{,u}int{8?,16,32,64}
     (const char * string, const TYPE * result, const bool verbose);

-- 
Fabien.

Re: refactoring - share str2*int64 functions

From
Michael Paquier
Date:
On Wed, Jul 17, 2019 at 07:55:39AM +0000, Fabien COELHO wrote:
>> numutils.c also has pg_strtoint16 and pg_strtoint32, so the locations
>> become rather inconsistent with inconsistent APIs for the manipulation
>> of int2 and int4 fields, and scanint8 is just a derivative of the same
>> logic.  We have two categories of routines here:
>
> Yep, but the int2/4 functions are not used elsewhere.

The worry is more about having people invent the same stuff all over
again.  If we can get a clean interface, that would ease adoption.
Hopefully.

> Overall, this leads to something like:
>
> enum { STRTOINT_OK, STRTOINT_OVERFLOW_ERROR, STRTOINT_SYNTAX_ERROR }
>   pg_strto{,u}int{8?,16,32,64}
>     (const char * string, const TYPE * result, const bool verbose);

Something like that.  "verbose" may mean "error_ok" though.  Not
having 6 times the same trailing whitespace checks and such would be
nice.

Actually, one thing which may be a problem is that we lack currently
the equivalents of pg_mul_s16_overflow and such for unsigned
integers.  The point of contention comes from pgbench's
set_random_seed() in this case as we can expect an unsigned seed as
the docs say.  But if we give up on the signedness of the random seed
which remains with 8 bytes, then we could let pg_strtouint64 as
backend-only and only worry about porting this set of APIs for signed
integers.
--
Michael

Attachment

Re: refactoring - share str2*int64 functions

From
Andres Freund
Date:
Hi,

On 2019-07-17 12:18:19 +0900, Michael Paquier wrote:
> On Tue, Jul 16, 2019 at 01:04:38PM -0700, Andres Freund wrote:
> > There is the issue that there already is pg_strtoint16 and
> > pg_strtoint32, which do not have the option to not raise an error.  I'd
> > probably name the non-error throwing ones something like pg_strtointNN_e
> > (for extended, or error handling), and have pg_strtointNN wrappers that
> > just handle the errors, or reverse the naming (which'd cause a bit of
> > churn, but not that much).
> > 
> > That'd also make the code for pg_strtointNN a bit nicer, because we'd
> > not need the gotos anymore, they're just there to avoid redundant error
> > messages - which'd not be an issue if the error handling were just a
> > switch in a separate function. E.g.
> 
> Agreed on that.  I am wondering if we should use a common wrapper for
> all the internal functions taking in input a set of bits16 flags to
> control its behavior and put all that into common/script.c:
> - Issue an error.
> - Check for signedness.
> - Base length: 16, 32 or 64.

That'd be considerably slower, so I'm *strongly* against that. These
conversion routines are *really* hot in a number of workloads,
e.g. bulk-loading with COPY.  Check e.g.
https://www.postgresql.org/message-id/20171208214437.qgn6zdltyq5hmjpk%40alap3.anarazel.de


> I would also rather not touch the strtol wrappers that we have able to
> handle the base.  There is nothing in the tree using it, but likely
> there are extensions relying on it.

I doubt it - it's not of that long-standing vintage (23a27b039d9,
2016-03-12), and if so they are very likely to use base 10. We shouldn't
keep some barely tested function around, just for the hypothetical
scenario that some extension uses it. Especially if that function is
considerably slower than the potential replacement.

Greetings,

Andres Freund



Re: refactoring - share str2*int64 functions

From
Andres Freund
Date:
Hi,

On 2019-07-17 12:04:32 +0900, Michael Paquier wrote:
> On Tue, Jul 16, 2019 at 01:18:38PM -0700, Andres Freund wrote:
> > I'd probably also just use the implementation we have for signed
> > integers (minus the relevant negation and overflow checks, obviously) -
> > it's a lot faster, and I think there's value in keeping the
> > implementations in sync.
> 
> You mean that it is much faster than the set of wrappers for strtol
> than we have?  Is that because we don't care about the base?

Yes: https://www.postgresql.org/message-id/20171208214437.qgn6zdltyq5hmjpk%40alap3.anarazel.de

Not caring about the base is one significant part, that removes a fair
bit of branches and more importantly allows the compiler to replace
divisions with much faster code (glibc tries to avoid the division too,
with lookup tables, but that's still expensive). Additionally there's
also some locale awareness in strtoll etc that we don't need. It's also
plainly not that well implemented at least in glibc and musl.

Having an implementation that reliably works the same across all
platforms is also advantageous.

Greetings,

Andres Freund



Re: refactoring - share str2*int64 functions

From
Andres Freund
Date:
Hi,

On 2019-07-17 07:55:39 +0000, Fabien COELHO wrote:
> > - The str->integer conversion routines, which actually have very
> > similar characteristics to the strtol families as they remove trailing
> > whitespaces first, check for a sign, etc, except that they work only
> > on base 10.  And here we get into a state where pg_scanint8 should be
> > actually called pg_strtoint64,
> 
> I just removed that:-)

What do you mean by that?


> > with an interface inconsistent with its int32/int16 relatives now only
> > in the backend.
> 
> We can, but I'm not at ease with changing the error handling approach.

Why?


> > Could we consider more consolidation here please?  Like moving the whole
> > set to src/common/?
> 
> My initial plan was simply to remove direct code duplications between
> front-end and back-end, not to re-engineer the full set of string to int
> conversion functions:-)

Well, if you expose functions to more places - e.g. now the whole
frontend - it becomes more important that they're reasonably designed.


> On the re-engineering front: Given the various points on the thread, ISTM
> that there should probably be two behaviors for str to signed/unsigned
> int{16,32,64}, and having only one kind of signature for all types would be
> definitely better.

I don't understand why we'd want to have different behaviours for
signed/unsigned? Maybe I'm mis-understanding your sentence, and you just
mean that there should be one that throws, and one that returns an
errorcode?


> Another higher-level one which possibly adds an error message (stderr for
> front-end, log for back-end).

Is there actually any need for a non-backend one that has an
error-message?  I'm not convinced that in the frontend it's very useful
to have such a function that exits - in the backend we have a much more
complete way to handle that, including pointing to the right location in
the query strings etc.


> One choice is whether there are two functions (the higher one calling the
> lower one and adding the messages) or just one with a boolean to trigger the
> messages. I do not have a strong opinion. Maybe one function would be
> simpler. Andres boolean-compatible enum return looks like a good option.

The boolean makes the calling code harder to understand, the function
slower, and the code harder to grep.


> Overall, this leads to something like:
> 
> enum { STRTOINT_OK, STRTOINT_OVERFLOW_ERROR, STRTOINT_SYNTAX_ERROR }
>   pg_strto{,u}int{8?,16,32,64}
>     (const char * string, const TYPE * result, const bool verbose);

What's with hthe const for the result? I assume that was just a copy&pasto?

Greetings,

Andres Freund



Re: refactoring - share str2*int64 functions

From
Andres Freund
Date:
Hi,

On 2019-07-17 17:29:58 +0900, Michael Paquier wrote:
> Actually, one thing which may be a problem is that we lack currently
> the equivalents of pg_mul_s16_overflow and such for unsigned
> integers.

It's much simpler to implement them for unsigned than for signed,
because unsigned overflow is well-defined. So I'd not be particularly
worried about just adding them.  E.g. comparing the "slow" version of
pg_mul_s64_overflow() with an untested implementation of
pg_mul_u64_overflow():

pg_mul_s64_overflow:
    /*
     * Overflow can only happen if at least one value is outside the range
     * sqrt(min)..sqrt(max) so check that first as the division can be quite a
     * bit more expensive than the multiplication.
     *
     * Multiplying by 0 or 1 can't overflow of course and checking for 0
     * separately avoids any risk of dividing by 0.  Be careful about dividing
     * INT_MIN by -1 also, note reversing the a and b to ensure we're always
     * dividing it by a positive value.
     *
     */
    if ((a > PG_INT32_MAX || a < PG_INT32_MIN ||
         b > PG_INT32_MAX || b < PG_INT32_MIN) &&
        a != 0 && a != 1 && b != 0 && b != 1 &&
        ((a > 0 && b > 0 && a > PG_INT64_MAX / b) ||
         (a > 0 && b < 0 && b < PG_INT64_MIN / a) ||
         (a < 0 && b > 0 && a < PG_INT64_MIN / b) ||
         (a < 0 && b < 0 && a < PG_INT64_MAX / b)))
    {
        *result = 0x5EED;        /* to avoid spurious warnings */
        return true;
    }
    *result = a * b;
    return false;

pg_mul_s64_overflow:

        /*
         * Checking for unsigned overflow is simple, just check
         * if reversing the multiplication indicates that the
         * multiplication overflowed.
         */
        int64 res = a * b;
        if (a != 0 && b != res / a)
        {
        *result = 0x5EED;        /* to avoid spurious warnings */
        return true;
    }
    *result = res;
    return false;


The cases for addition/subtraction are even easier:
addition:
res = a + b;
if (res < a)
   /* overflow */

subtraction:
if (a < b)
   /* underflow */
res  = a - b;

Greetings,

Andres Freund



Re: refactoring - share str2*int64 functions

From
Fabien COELHO
Date:

>>> - The str->integer conversion routines, which actually have very
>>> similar characteristics to the strtol families as they remove trailing
>>> whitespaces first, check for a sign, etc, except that they work only
>>> on base 10.  And here we get into a state where pg_scanint8 should be
>>> actually called pg_strtoint64,
>>
>> I just removed that:-)
>
> What do you mean by that?

That I renamed something from a previous patch version and someone is 
complaining that I did.

>>> with an interface inconsistent with its int32/int16 relatives now only
>>> in the backend.
>>
>> We can, but I'm not at ease with changing the error handling approach.
>
> Why?

If a function reports an error to log, it should keep on doing it, 
otherwise there would be a regression.

>>> Could we consider more consolidation here please?  Like moving the whole
>>> set to src/common/?
>>
>> My initial plan was simply to remove direct code duplications between
>> front-end and back-end, not to re-engineer the full set of string to int
>> conversion functions:-)
>
> Well, if you expose functions to more places - e.g. now the whole
> frontend - it becomes more important that they're reasonably designed.

I can somehow only agree with that. Note that the contraposite assumption 
that badly designed functions would be okay for backend seems doubtful.

>> On the re-engineering front: Given the various points on the thread, 
>> ISTM that there should probably be two behaviors for str to 
>> signed/unsigned int{16,32,64}, and having only one kind of signature 
>> for all types would be definitely better.
>
> I don't understand why we'd want to have different behaviours for
> signed/unsigned? Maybe I'm mis-understanding your sentence, and you just
> mean that there should be one that throws, and one that returns an
> errorcode?

Yep for the backend (if reporting an error generates a longjump), for the 
frontend there is no exception mechanism, so it is showing the error or 
not to stderr, and returning whether it was ok.

>> Another higher-level one which possibly adds an error message (stderr for
>> front-end, log for back-end).
>
> Is there actually any need for a non-backend one that has an
> error-message?

Pgbench uses it. If the function is shared and one is loging something, it 
looks ok to send to stderr for front-end?

>  I'm not convinced that in the frontend it's very useful to have such a 
> function that exits - in the backend we have a much more complete way to 
> handle that, including pointing to the right location in the query 
> strings etc.

Sure. There is not exit though, just messages to stderr and return false.

>> One choice is whether there are two functions (the higher one calling the
>> lower one and adding the messages) or just one with a boolean to trigger the
>> messages. I do not have a strong opinion. Maybe one function would be
>> simpler. Andres boolean-compatible enum return looks like a good option.
>
> The boolean makes the calling code harder to understand, the function
> slower,

Hmmm. So I understand that you would prefer 2 functions, one raw (fast) 
one and the other with the other with the better error reporting facity, 
and the user must chose the one they like. I'm fine with that as well.

> and the code harder to grep.

>> Overall, this leads to something like:
>>
>> enum { STRTOINT_OK, STRTOINT_OVERFLOW_ERROR, STRTOINT_SYNTAX_ERROR }
>>   pg_strto{,u}int{8?,16,32,64}
>>     (const char * string, const TYPE * result, const bool verbose);
>
> What's with hthe const for the result? I assume that was just a copy&pasto?

Yep. The pointer is constant, not the value pointed, maybe it should be 
"TYPE * const result" or something like that. Or no const at all on 
result.

-- 
Fabien.



Re: refactoring - share str2*int64 functions

From
Andres Freund
Date:
Hi,

On 2019-07-17 22:59:01 +0000, Fabien COELHO wrote:
> > > > with an interface inconsistent with its int32/int16 relatives now only
> > > > in the backend.
> > > 
> > > We can, but I'm not at ease with changing the error handling approach.
> > 
> > Why?
> 
> If a function reports an error to log, it should keep on doing it, otherwise
> there would be a regression.

Err, huh. Especially if we change the signature, I fail to see how it's
a regression if we change the behaviour.


> > > Another higher-level one which possibly adds an error message (stderr for
> > > front-end, log for back-end).
> > 
> > Is there actually any need for a non-backend one that has an
> > error-message?
> 
> Pgbench uses it. If the function is shared and one is loging something, it
> looks ok to send to stderr for front-end?

> >  I'm not convinced that in the frontend it's very useful to have such a
> > function that exits - in the backend we have a much more complete way to
> > handle that, including pointing to the right location in the query
> > strings etc.
> 
> Sure. There is not exit though, just messages to stderr and return false.

I think it's a seriously bad idea to have a function that returns
depending in the error case depending on whether it's frontend or
backend code.  We shouldn't do stuff like that, it just leads to bugs.


> > > One choice is whether there are two functions (the higher one calling the
> > > lower one and adding the messages) or just one with a boolean to trigger the
> > > messages. I do not have a strong opinion. Maybe one function would be
> > > simpler. Andres boolean-compatible enum return looks like a good option.
> > 
> > The boolean makes the calling code harder to understand, the function
> > slower,
> 
> Hmmm. So I understand that you would prefer 2 functions, one raw (fast) one
> and the other with the other with the better error reporting facity, and the
> user must chose the one they like. I'm fine with that as well.

Well, the one with error reporting would use the former.

Greetings,

Andres Freund



Re: refactoring - share str2*int64 functions

From
Michael Paquier
Date:
On Wed, Jul 17, 2019 at 11:14:28AM -0700, Andres Freund wrote:
> That'd be considerably slower, so I'm *strongly* against that. These
> conversion routines are *really* hot in a number of workloads,
> e.g. bulk-loading with COPY.  Check e.g.
> https://www.postgresql.org/message-id/20171208214437.qgn6zdltyq5hmjpk%40alap3.anarazel.de

Thanks for the link.  That makes sense!  So stacking more function
calls could also be an issue.  Even if using static inline for the
inner wrapper?  That may sound like a stupid question but you have
likely more experience than me regarding that with profiling.

> I doubt it - it's not of that long-standing vintage (23a27b039d9,
> 2016-03-12), and if so they are very likely to use base 10. We shouldn't
> keep some barely tested function around, just for the hypothetical
> scenario that some extension uses it. Especially if that function is
> considerably slower than the potential replacement.

Okay, I won't fight hard on that either.
--
Michael

Attachment

Re: refactoring - share str2*int64 functions

From
Andres Freund
Date:
Hi,

On 2019-07-18 09:28:28 +0900, Michael Paquier wrote:
> On Wed, Jul 17, 2019 at 11:14:28AM -0700, Andres Freund wrote:
> > That'd be considerably slower, so I'm *strongly* against that. These
> > conversion routines are *really* hot in a number of workloads,
> > e.g. bulk-loading with COPY.  Check e.g.
> > https://www.postgresql.org/message-id/20171208214437.qgn6zdltyq5hmjpk%40alap3.anarazel.de
>
> Thanks for the link.  That makes sense!  So stacking more function
> calls could also be an issue.  Even if using static inline for the
> inner wrapper?  That may sound like a stupid question but you have
> likely more experience than me regarding that with profiling.

A static inline would be fine, depending on how you do that. I'm not
quite sure what you mean with "inner wrapper" - isn't a wrapper normally
outside?

I'd probably do something like

static inline int64
strtoint64(const char *str)
{
    int64 res;

    e = strtoint64_e(str, &res);
    if (likely(e == STRTOINT_OK))
        return res;
    else
    {
        report_strtoint_error(str, e, "int64");
        return 0; /* pacify compiler */
    }
}

and then have one non-inline report_strtoint_error() shared across the
various functions. Even leaving code-duplication aside, not having the
elog call itself in the inline function is nice, as that's quite a few
instructions.

Greetings,

Andres Freund



Re: refactoring - share str2*int64 functions

From
Fabien COELHO
Date:
Hello Andres,

>> If a function reports an error to log, it should keep on doing it, otherwise
>> there would be a regression.
>
> Err, huh. Especially if we change the signature, I fail to see how it's
> a regression if we change the behaviour.

ISTM that we do not understand one another, because I'm only trying to 
state the obvious. Currently error messages on overflow or syntax error 
are displayed. I just mean that such messages must keep on being emitted 
somehow otherwise there would be a regression.

   sql> SELECT INT8 '12345678901234567890';
   ERROR:  value "12345678901234567890" is out of range for type bigint
   LINE 1: SELECT INT8 '12345678901234567890';

>> Sure. There is not exit though, just messages to stderr and return false.
>
> I think it's a seriously bad idea to have a function that returns
> depending in the error case depending on whether it's frontend or
> backend code.  We shouldn't do stuff like that, it just leads to bugs.

Ok, you really want two functions and getting read of the errorOK boolean.
I got that. The scanint8 already exists with its ereport ERROR vs return 
based on a boolean, so the point is to get read of this kind of signature.

>>> The boolean makes the calling code harder to understand, the function
>>> slower,
>>
>> Hmmm. So I understand that you would prefer 2 functions, one raw (fast) one
>> and the other with the other with the better error reporting facity, and the
>> user must chose the one they like. I'm fine with that as well.
>
> Well, the one with error reporting would use the former.

Yep, possibly two functions, no boolean.

Having a common function will not escape the fact that there is no 
exception mechanism for front-end, and none seems desirable just for 
parsing ints. So if you want NOT to have a same function with return 
something vs raises an error, that would make three functions.

One basic int parsing in the fe/be common part.

   typedef enum { STRTOINT_OK, STRTOINT_OVERFLOW, STRTOINT_SYNTAX_ERROR }
     strtoint_error;

   strtoint_error pg_strtoint64(const char * str, int64 * result);

Then for backend, probably in backend somewhere:

   // raises an exception on errors
   void pg_strtoint64_log(const char * str, int64 * result) {
     switch (pg_strtoint64) {
       case STRTOINT_OK: return;
       case STRTOINT...: ereport(ERROR, ...);
     }
   }


And for frontend, only in frontend somewhere:

   // print to stderr and return false on error
   bool pg_strtoint64_err(const char * str, int64 * result);

I'm unhappy with the function names though, feel free to improve.

-- 
Fabien.



Re: refactoring - share str2*int64 functions

From
Michael Paquier
Date:
On Thu, Jul 18, 2019 at 07:57:41AM +0000, Fabien COELHO wrote:
> I'm unhappy with the function names though, feel free to improve.

I would have something rather close to what you are suggesting, still
not exactly that because we just don't care about the error strings
generated for the frontend.  And my bet is that each frontend would
like to have their own error message depending on the error case.

FWIW, I had a similar experience with pg_strong_random() not so long
ago, which required a backend-specific handling because the fallback
random implementation needed some tweaks at postmaster startup (that's
why we have an alias pg_backend_random in include/port.h).  So I would
recommend the following, roughly:
- One set of functions in src/port/ which return the status code for
the error handling, without any error reporting in it to avoid any
ifdef FRONTEND business, which have a generic name pg_strto[u]intXX
(XX = {16,32,64}).  And have all that in a new, separate file, say
strtoint.c?
- One set of functions for the backend, called pg_stro[u]intXX_backend
or pg_backend_stro[u]intXX which can take as extra argument error_ok,
calling the portions in src/port/, and move those functions in a new
file prefixed with "backend_" in src/backend/utils/misc/ with a name
consistent with the one in src/port/ (with the previous naming that
would be backend_strtoint.c)
- We also need the unsigned-specific equivalents of
pg_mul_s64_overflow and such, so I would suggest putting that in a new
header, simply uint.h.  If I finish by committing this stuff, I would
handle that in a separate commit.
--
Michael

Attachment

Re: refactoring - share str2*int64 functions

From
Andres Freund
Date:
Hi,

On 2019-07-19 12:21:27 +0900, Michael Paquier wrote:
> On Thu, Jul 18, 2019 at 07:57:41AM +0000, Fabien COELHO wrote:
> > I'm unhappy with the function names though, feel free to improve.
> 
> I would have something rather close to what you are suggesting, still
> not exactly that because we just don't care about the error strings
> generated for the frontend.  And my bet is that each frontend would
> like to have their own error message depending on the error case.

Yea, the error messages pgbench is currently generating, for example,
don't make a lot of sense.

> FWIW, I had a similar experience with pg_strong_random() not so long
> ago, which required a backend-specific handling because the fallback
> random implementation needed some tweaks at postmaster startup (that's
> why we have an alias pg_backend_random in include/port.h).  So I would
> recommend the following, roughly:
> - One set of functions in src/port/ which return the status code for
> the error handling, without any error reporting in it to avoid any
> ifdef FRONTEND business, which have a generic name pg_strto[u]intXX
> (XX = {16,32,64}).  And have all that in a new, separate file, say
> strtoint.c?

Why not common? It's not a platform dependent bit. Could even be put
into the already existing string.c.


> - One set of functions for the backend, called pg_stro[u]intXX_backend
> or pg_backend_stro[u]intXX which can take as extra argument error_ok,
> calling the portions in src/port/, and move those functions in a new
> file prefixed with "backend_" in src/backend/utils/misc/ with a name
> consistent with the one in src/port/ (with the previous naming that
> would be backend_strtoint.c)

I'm not following. What would be the point of any of this?  The error_ok
bit is unnecessary, because the function is exactly the same as the
generic function.  And the backend_ prefix would be pretty darn weird,
given that that's already below src/backend.


> - We also need the unsigned-specific equivalents of
> pg_mul_s64_overflow and such, so I would suggest putting that in a new
> header, simply uint.h.  If I finish by committing this stuff, I would
> handle that in a separate commit.

Why not the same header? I fail to see what we'd gain by splitting it
up.

Greetings,

Andres Freund



Re: refactoring - share str2*int64 functions

From
Michael Paquier
Date:
On Thu, Jul 18, 2019 at 09:16:22PM -0700, Andres Freund wrote:
> On 2019-07-19 12:21:27 +0900, Michael Paquier wrote:
> Why not common? It's not a platform dependent bit. Could even be put
> into the already existing string.c.

That would be fine to me, it is not like this file is bloated now.

>> - One set of functions for the backend, called pg_stro[u]intXX_backend
>> or pg_backend_stro[u]intXX which can take as extra argument error_ok,
>> calling the portions in src/port/, and move those functions in a new
>> file prefixed with "backend_" in src/backend/utils/misc/ with a name
>> consistent with the one in src/port/ (with the previous naming that
>> would be backend_strtoint.c)
>
> I'm not following. What would be the point of any of this?  The error_ok
> bit is unnecessary, because the function is exactly the same as the
> generic function.  And the backend_ prefix would be pretty darn weird,
> given that that's already below src/backend.

Do you have a better idea of name for those functions?

>> - We also need the unsigned-specific equivalents of
>> pg_mul_s64_overflow and such, so I would suggest putting that in a new
>> header, simply uint.h.  If I finish by committing this stuff, I would
>> handle that in a separate commit.
>
> Why not the same header? I fail to see what we'd gain by splitting it
> up.

No objections to that at the end.

Fabien, are you planning to send an updated patch?  This stuff has
value.
--
Michael

Attachment

Re: refactoring - share str2*int64 functions

From
Fabien COELHO
Date:
Bonjour Michaël,

> Fabien, are you planning to send an updated patch?  This stuff has 
> value.

Yep, I will try for this week.

-- 
Fabien.

Re: refactoring - share str2*int64 functions

From
Andres Freund
Date:
Hi,

On 2019-07-29 10:48:41 +0900, Michael Paquier wrote:
> On Thu, Jul 18, 2019 at 09:16:22PM -0700, Andres Freund wrote:
> >> - One set of functions for the backend, called pg_stro[u]intXX_backend
> >> or pg_backend_stro[u]intXX which can take as extra argument error_ok,
> >> calling the portions in src/port/, and move those functions in a new
> >> file prefixed with "backend_" in src/backend/utils/misc/ with a name
> >> consistent with the one in src/port/ (with the previous naming that
> >> would be backend_strtoint.c)
> > 
> > I'm not following. What would be the point of any of this?  The error_ok
> > bit is unnecessary, because the function is exactly the same as the
> > generic function.  And the backend_ prefix would be pretty darn weird,
> > given that that's already below src/backend.
> 
> Do you have a better idea of name for those functions?

I don't understand why they need any prefix. pg_strto[u]int{32,64}{,_checked}.
The unchecked variant would be for both frontend backend. The checked one
either for both frontend/backend, or just for backend. I also could live with
_raises, _throws or such instead of _checked. Implement all of them in one
file in /common/, possibly hidin gthe ones not currently implemented for the
frontend.

Imo if _checked is implemented for both frontend/backend they'd need
different error messages. In my opinion
out_of_range:
    if (!errorOK)
        fprintf(stderr,
                "value \"%s\" is out of range for type bigint\n", str);
    return false;

invalid_syntax:
    if (!errorOK)
        fprintf(stderr,
                "invalid input syntax for type bigint: \"%s\"\n", str);

is unsuitable for generic code. In fact, I'm doubtful that it's applicable for
any use, except for int8in(), which makes me think it better ought to use the
a non-checked variant, and include the errors directly.  If we still want to
have _checked - which is reasonable imo - it should refer to 64bit integers or somesuch.

Greetings,

Andres Freund



Re: refactoring - share str2*int64 functions

From
Michael Paquier
Date:
On Mon, Jul 29, 2019 at 07:04:09AM +0200, Fabien COELHO wrote:
> Bonjour Michaël,
> Yep, I will try for this week.

Please note that for now I have marked the patch as returned with
feedback as the CF is ending.
--
Michael

Attachment

Re: refactoring - share str2*int64 functions

From
Fabien COELHO
Date:
Bonjour Michaël,

>> Yep, I will try for this week.
>
> Please note that for now I have marked the patch as returned with
> feedback as the CF is ending.

Ok.

I have looked quickly at it, but I'm not sure that there is an agreement 
about what should be done precisely, so the feedback is not clearly 
actionable.

-- 
Fabien.

Re: refactoring - share str2*int64 functions

From
Michael Paquier
Date:
On Thu, Aug 01, 2019 at 09:00:41AM +0200, Fabien COELHO wrote:
> I have looked quickly at it, but I'm not sure that there is an agreement
> about what should be done precisely, so the feedback is not clearly
> actionable.

Per the latest trends, it seems that the input of Andres was kind of
the most interesting pieces.  If you don't have room for it, I would
not mind doing the legwork myself.
--
Michael

Attachment

Re: refactoring - share str2*int64 functions

From
Fabien COELHO
Date:
Michaël-san,

>> I have looked quickly at it, but I'm not sure that there is an agreement
>> about what should be done precisely, so the feedback is not clearly
>> actionable.
>
> Per the latest trends, it seems that the input of Andres was kind of
> the most interesting pieces.

Yes, definitely. I understood that we want in "string.h" something like
(just the spirit):

   typedef enum {
     STRTOINT_OK, STRTOINT_RANGE_ERROR, STRTOINT_SYNTAX_ERROR
   } strtoint_status;

   strtoint_status pg_strtoint64(const char * str, int64 * result);

However there is a contrary objective to have a unified interface,
but there also exists a:

   extern uint64 pg_strtouint64(const char *str, char **endptr, int base);

called 3 times, always with base == 10. We have a similar name but a 
totally different interface, so basically it would have to be replaced
by something like the first interface.

> If you don't have room for it, I would not mind doing the legwork 
> myself.

I think that it would be quick if the what is clear enough, so I can do 
it.

-- 
Fabien.

Re: refactoring - share str2*int64 functions

From
Michael Paquier
Date:
On Thu, Aug 01, 2019 at 11:34:34AM +0200, Fabien COELHO wrote:
> However there is a contrary objective to have a unified interface,
> but there also exists a:
>
>   extern uint64 pg_strtouint64(const char *str, char **endptr, int base);
>
> called 3 times, always with base == 10. We have a similar name but a totally
> different interface, so basically it would have to be replaced
> by something like the first interface.

My understanding on this one was to nuke the base argument and unify
the interface with our own, faster routines:
https://www.postgresql.org/message-id/20190716201838.rwrd7xzbrybq7dop%40alap3.anarazel.de
--
Michael

Attachment

Re: refactoring - share str2*int64 functions

From
Fabien COELHO
Date:
>>  extern uint64 pg_strtouint64(const char *str, char **endptr, int base);
>>
>> called 3 times, always with base == 10. We have a similar name but a totally
>> different interface, so basically it would have to be replaced
>> by something like the first interface.
>
> My understanding on this one was to nuke the base argument and unify
> the interface with our own, faster routines:
> https://www.postgresql.org/message-id/20190716201838.rwrd7xzbrybq7dop%40alap3.anarazel.de

Ok, so there is an agreement on reworking the unsigned function. I missed 
this bit.

So I'll set out to write and use "pg_strtou?int64", i.e. 2 functions, and 
then possibly generalize to lower sizes, 32, 16, depending on what is 
actually needed.

-- 
Fabien.



Re: refactoring - share str2*int64 functions

From
Michael Paquier
Date:
Hi Fabien,

On Thu, Aug 01, 2019 at 04:48:35PM +0200, Fabien COELHO wrote:
> Ok, so there is an agreement on reworking the unsigned function. I missed
> this bit.
>
> So I'll set out to write and use "pg_strtou?int64", i.e. 2 functions, and
> then possibly generalize to lower sizes, 32, 16, depending on what is
> actually needed.

I am interested in this patch, and the next commit fest is close by.
Are you working on an updated version?  If not, do you mind if I work
on it and post a new version by the beginning of next week based on
all the feedback gathered?
--
Michael

Attachment

Re: refactoring - share str2*int64 functions

From
Fabien COELHO
Date:
Bonjour Michaël,

>> So I'll set out to write and use "pg_strtou?int64", i.e. 2 functions, and
>> then possibly generalize to lower sizes, 32, 16, depending on what is
>> actually needed.
>
> I am interested in this patch, and the next commit fest is close by.
> Are you working on an updated version?  If not, do you mind if I work
> on it and post a new version by the beginning of next week based on
> all the feedback gathered?

I have started to do something, and I can spend some time on that this 
week, but I'm pretty unclear about what exactly should be done.

The error returning stuff is simple enough, but I'm unclear about what to 
do with pg_uint64, which has a totally different signature. Should it be 
aligned?

-- 
Fabien Coelho - CRI, MINES ParisTech

Re: refactoring - share str2*int64 functions

From
Michael Paquier
Date:
On Mon, Aug 26, 2019 at 11:05:55AM +0200, Fabien COELHO wrote:
> I have started to do something, and I can spend some time on that this week,
> but I'm pretty unclear about what exactly should be done.

Thanks.

> The error returning stuff is simple enough, but I'm unclear about what to do
> with pg_uint64, which has a totally different signature. Should it be
> aligned?

I am not sure what you mean with aligned here.  If you mean
consistent, getting into a state where we have all functions for all
three sizes, signed and unsigned, would be nice.
--
Michael

Attachment

Re: refactoring - share str2*int64 functions

From
Fabien COELHO
Date:
Bonjour Michaël,

>> The error returning stuff is simple enough, but I'm unclear about what to do
>> with pg_uint64, which has a totally different signature. Should it be
>> aligned?
>
> I am not sure what you mean with aligned here.

I meant same signature.

> If you mean consistent, getting into a state where we have all functions 
> for all three sizes, signed and unsigned, would be nice.

Ok, I look into it.

-- 
Fabien.

Re: refactoring - share str2*int64 functions

From
Fabien COELHO
Date:
Bonjour Michaël,

>> So I'll set out to write and use "pg_strtou?int64", i.e. 2 functions, and
>> then possibly generalize to lower sizes, 32, 16, depending on what is
>> actually needed.
>
> I am interested in this patch, and the next commit fest is close by.
> Are you working on an updated version?  If not, do you mind if I work
> on it and post a new version by the beginning of next week based on
> all the feedback gathered?

Here is an updated patch for the u?int64 conversion functions.

I have taken the liberty to optimize the existing int64 function by 
removing spurious tests. I have not created uint64 specific inlined 
overflow functions.

If it looks ok, a separate patch could address the 32 & 16 versions.

-- 
Fabien.
Attachment

Re: refactoring - share str2*int64 functions

From
Michael Paquier
Date:
On Wed, Aug 28, 2019 at 08:51:29AM +0200, Fabien COELHO wrote:
> Here is an updated patch for the u?int64 conversion functions.

Thanks!

> I have taken the liberty to optimize the existing int64 function by removing
> spurious tests.

Which are?

> I have not created uint64 specific inlined overflow functions.

Why?  There is a comment below ;p

> If it looks ok, a separate patch could address the 32 & 16 versions.

I am surprised to see a negative diff actually just by doing that
(adding the 32 and 16 parts will add much more code of course).  At
quick glance, I think that this is on the right track.  Some comments
I have on the top of my mind:
- It would me good to have the unsigned equivalents of
pg_mul_s64_overflow, etc.  These are simple enough, and per the
feedback from Andres they could live in common/int.h.
- It is more consistent to use upper-case statuses in the enum
strtoint_status.  Could it be renamed to pg_strtoint_status?
--
Michael

Attachment

Re: refactoring - share str2*int64 functions

From
Fabien COELHO
Date:
Michaël,

>> I have taken the liberty to optimize the existing int64 function by removing
>> spurious tests.
>
> Which are?

  - *ptr && WHATEVER(*ptr)
   *ptr is redundant, WHATEVER yields false on '\0', and it costs on each
   char but at the end. It might be debatable in some places, e.g. it is
   likely that there are no spaces in the string, but likely that there are
   more than one digit.

   If you want all/some *ptr added back, no problem.

  - isdigit repeated on if and following while, used if/do-while instead.

>> I have not created uint64 specific inlined overflow functions.
>
> Why?  There is a comment below ;p

See comment about comment below:-)

>> If it looks ok, a separate patch could address the 32 & 16 versions.
>
> I am surprised to see a negative diff

Is it? Long duplicate functions are factored out (this was my initial 
intent), one file is removed…

> actually just by doing that (adding the 32 and 16 parts will add much 
> more code of course).  At quick glance, I think that this is on the 
> right track.  Some comments I have on the top of my mind:

> - It would me good to have the unsigned equivalents of
> pg_mul_s64_overflow, etc.  These are simple enough,

Hmmm. Have you looked at the fallback cases when the corresponding 
builtins are not available?

I'm unsure of a reliable way to detect a generic unsigned int overflow 
without expensive dividing back and having to care about zero…

So I was pretty happy with my two discreet, small and efficient tests.

> and per the feedback from Andres they could live in common/int.h.

Could be, however "string.c" already contains a string to int conversion 
function, so I put them together. Probably this function should be 
removed in the end, though.

> - It is more consistent to use upper-case statuses in the enum
> strtoint_status.

I thought of that, but first enum I found used lower case, so it did not 
seem obvious that pg style was to use upper case. Indeed, some enum 
constants use upper cases.

>  Could it be renamed to pg_strtoint_status?

Sure. I also prefixed the enum constants for consistency.

Attached patch uses a prefix and uppers constants. Waiting for further 
input about other comments.

-- 
Fabien.
Attachment

Re: refactoring - share str2*int64 functions

From
Michael Paquier
Date:
On Wed, Aug 28, 2019 at 09:50:44AM +0200, Fabien COELHO wrote:
>  - *ptr && WHATEVER(*ptr)
>   *ptr is redundant, WHATEVER yields false on '\0', and it costs on each
>   char but at the end. It might be debatable in some places, e.g. it is
>   likely that there are no spaces in the string, but likely that there are
>   more than one digit.

Still this makes the checks less robust?

>   If you want all/some *ptr added back, no problem.
>
>  - isdigit repeated on if and following while, used if/do-while instead.

I see, you don't check twice if the first character is a digit this
way.

> Hmmm. Have you looked at the fallback cases when the corresponding builtins
> are not available?
>
> I'm unsure of a reliable way to detect a generic unsigned int overflow
> without expensive dividing back and having to care about zero…

Mr Freund has mentioned that here:
https://www.postgresql.org/message-id/20190717184820.iqz7schxdbucmdmu@alap3.anarazel.de

> So I was pretty happy with my two discreet, small and efficient tests.

That's also a matter of code and interface consistency IMHO.
--
Michael

Attachment

Re: refactoring - share str2*int64 functions

From
Fabien COELHO
Date:
Bonjour Michaël,

>>  - *ptr && WHATEVER(*ptr)
>>   *ptr is redundant, WHATEVER yields false on '\0', and it costs on each
>>   char but at the end. It might be debatable in some places, e.g. it is
>>   likely that there are no spaces in the string, but likely that there are
>>   more than one digit.
>
> Still this makes the checks less robust?

I do not see any downside, but maybe I lack imagination.

On my laptop isWHATEVER is implementd through an array mapping characters 
to a bitfield saying whether each char is WHATEVER, depending on the bit. 
This array is well defined for index 0 ('\0').

If an implementation is based on comparisons, for isdigit it would be:

    c >= '0' && c <= '9'

Then checking c != '\0' is redundant with c >= '0'.

Given the way the character checks are used in the function, we do not go 
beyond the end of the string because we only proceed further when a 
character is actually recognized, else we return.

So I cannot see any robustness issue, just a few cycles saved.

>> Hmmm. Have you looked at the fallback cases when the corresponding builtins
>> are not available?
>>
>> I'm unsure of a reliable way to detect a generic unsigned int overflow
>> without expensive dividing back and having to care about zero…
>
> Mr Freund has mentioned that here:
> https://www.postgresql.org/message-id/20190717184820.iqz7schxdbucmdmu@alap3.anarazel.de

Yep, that is what I mean by expensive: there is an integer division, which 
is avoided if b is known to be 10, hence is not zero and the limit value 
can be checked directly on the input without having to perform a division 
each time.

>> So I was pretty happy with my two discreet, small and efficient tests.
>
> That's also a matter of code and interface consistency IMHO.

Possibly.

ISTM that part of the motivation is to reduce costs for heavily used 
conversions, eg on COPY. Function "scanf" is overly expensive because it 
has to interpret its format, and we have to check for overflows…

Anyway, if we assume that the builtins exist and rely on efficient 
hardware check, maybe we do not care about the fallback cases which would 
just be slow but never executed.

Note that all this is moot, as all instances of string to uint64 
conversion do not check for errors.

Attached v7 does create uint64 overflow inline functions. The stuff yet is 
not moved to "common/int.c", a file which does not exists, even if 
"common/int.h" does.

-- 
Fabien.
Attachment

Re: refactoring - share str2*int64 functions

From
Michael Paquier
Date:
On Thu, Aug 29, 2019 at 08:14:54AM +0200, Fabien COELHO wrote:
> Attached v7 does create uint64 overflow inline functions. The stuff yet is
> not moved to "common/int.c", a file which does not exists, even if
> "common/int.h" does.

Thanks for the new version.  I have begun reviewing your patch, and
attached is a first cut that I would like to commit separately which
adds all the compatibility overflow routines to int.h for uint16,
uint32 and uint64 with all the fallback implementations (int128-based
method added as well if available).  I have also grouped at the top of
the file the comments about each routine's return policy to avoid
duplication.  For the fallback implementations of uint64 using int128,
I think that it is cleaner to use uint128 so as there is no need to
check after negative results for multiplications, and I have made the
various expressions consistent for each size.

Attached is a small module called "overflow" with various regression
tests that I used to check each implementation.  I don't propose that
for commit as I am not sure if that's worth the extra CPU, so let's
consider it as a toy for now.

What do you think?
--
Michael

Attachment

Re: refactoring - share str2*int64 functions

From
Fabien COELHO
Date:
Michaël,

> attached is a first cut that I would like to commit separately which 
> adds all the compatibility overflow routines to int.h for uint16, uint32 
> and uint64 with all the fallback implementations (int128-based method 
> added as well if available).  I have also grouped at the top of the file 
> the comments about each routine's return policy to avoid duplication. 
> For the fallback implementations of uint64 using int128, I think that it 
> is cleaner to use uint128 so as there is no need to check after negative 
> results for multiplications, and I have made the various expressions 
> consistent for each size.

Patch applies cleanly, compiles, "make check" ok, but the added functions 
are not used (yet).

I think that factoring out comments is a good move.

For symmetry and efficiency, ISTM that uint16 mul overflow could use 
uint32 and uint32 could use uint64, and the division-based method be 
dropped in these cases.

Maybe I would add a comment before each new section telling about the 
type, eg:

  /*
   * UINT16
   */
  add/sub/mul uint16 functions…

I would tend to commit working solutions per type rather than by 
installment, so that at least all added functions are actually used 
somewhere, but it does not matter much, really.

I was unsure that having int128 implies uint128 availability, so I did not 
use it.

The checking extension is funny, but ISTM that these checks should be (are 
already?) included in some standard sql test, which will test the macros 
from direct SQL operations:

   fabien=# SELECT INT2 '1234512434343';
   ERROR:  value "1234512434343" is out of range for type smallint

Well, a quick look at "src/test/regress/sql/int2.sql" suggests that
there the existing tests should be extended… :-(

-- 
Fabien.

Re: refactoring - share str2*int64 functions

From
Michael Paquier
Date:
On Fri, Aug 30, 2019 at 10:06:11AM +0200, Fabien COELHO wrote:
> Patch applies cleanly, compiles, "make check" ok, but the added functions
> are not used (yet).

Thanks.

> I think that factoring out comments is a good move.
>
> For symmetry and efficiency, ISTM that uint16 mul overflow could use uint32
> and uint32 could use uint64, and the division-based method be dropped in
> these cases.

Yes, the division would be worse than the other.  What do you think
about using the previous module I sent and measure how long it takes
to evaluate the overflows in some specific cases N times in loops?

> Maybe I would add a comment before each new section telling about the type,
> eg:
>
>  /*
>   * UINT16
>   */
>  add/sub/mul uint16 functions.

Let's do as you suggest here.

> I would tend to commit working solutions per type rather than by
> installment, so that at least all added functions are actually used
> somewhere, but it does not matter much, really.

I prefer by section, with testing dedicated to each part properly
done so as we can move to the next parts.

> I was unsure that having int128 implies uint128 availability, so I did not
> use it.

The recent Ryu-floating point work has begun using them (see f2s.c and
d2s.c).

> The checking extension is funny, but ISTM that these checks should be (are
> already?) included in some standard sql test, which will test the macros
> from direct SQL operations:

Sure.  But not for the unsigned part as there are no equivalent
in-core data types, still it is possible to trick things with signed
integer arguments.  I found my toy useful to check test all
implementations consistently.

>   fabien=# SELECT INT2 '1234512434343';
>   ERROR:  value "1234512434343" is out of range for type smallint
>
> Well, a quick look at "src/test/regress/sql/int2.sql" suggests that
> there the existing tests should be extended… :-(

We can tackle that separately.  -32768 is perfectly legal for
smallint, and the test is wrong here.
--
Michael

Attachment

Re: refactoring - share str2*int64 functions

From
Fabien COELHO
Date:
Michaël,

>> For symmetry and efficiency, ISTM that uint16 mul overflow could use uint32
>> and uint32 could use uint64, and the division-based method be dropped in
>> these cases.
>
> Yes, the division would be worse than the other.  What do you think
> about using the previous module I sent and measure how long it takes
> to evaluate the overflows in some specific cases N times in loops?

Given the overheads of the SQL interpreter, I'm unsure about what you 
would measure at the SQL level. You could just write a small standalone C 
program to test perf and accuracy. Maybe this is what you have in mind.

>> I would tend to commit working solutions per type rather than by
>> installment, so that at least all added functions are actually used
>> somewhere, but it does not matter much, really.
>
> I prefer by section, with testing dedicated to each part properly
> done so as we can move to the next parts.

This suggests that you will test twice: once when adding the inlined 
functions, once when calling from SQL.

>> The checking extension is funny, but ISTM that these checks should be (are
>> already?) included in some standard sql test, which will test the macros
>> from direct SQL operations:
>
> Sure.  But not for the unsigned part as there are no equivalent
> in-core data types,

Yep, it bothered me sometimes, but not enough to propose to add them.

> still it is possible to trick things with signed integer arguments.

Is it?

> I found my toy useful to check test all implementations consistently.

Ok.

>>   fabien=# SELECT INT2 '1234512434343';
>>   ERROR:  value "1234512434343" is out of range for type smallint
>>
>> Well, a quick look at "src/test/regress/sql/int2.sql" suggests that
>> there the existing tests should be extended… :-(
>
> We can tackle that separately.  -32768 is perfectly legal for
> smallint, and the test is wrong here.

Do you mean:

  sql> SELECT -32768::INT2;
  ERROR:  smallint out of range

This is not a negative constant but the reverse of a positive, which is 
indeed out of range, although the error message could help more.

  sql> SELECT (-32768)::INT2;
  -32768 # ok

  sql> SELECT INT2 '-32768';
  -32768 # ok

-- 
Fabien.

Re: refactoring - share str2*int64 functions

From
Michael Paquier
Date:
On Fri, Aug 30, 2019 at 04:50:21PM +0200, Fabien COELHO wrote:
> Given the overheads of the SQL interpreter, I'm unsure about what you would
> measure at the SQL level. You could just write a small standalone C program
> to test perf and accuracy. Maybe this is what you have in mind.

After a certain threshold, you can see the difference anyway by paying
once the overhead of the function.  See for example the updated module
attached that I used for my tests.

I have been testing the various implementations, and doing 2B
iterations leads to roughly the following with a non-assert, -O2
build using mul_u32:
- __builtin_sub_overflow => 5s
- cast to uint64 => 5.9s
- division => 8s
You are right as well that having symmetry with the signed methods is
much better.  In order to see the difference, you can just do that
with the extension attached, after of course hijacking int.h with some
undefs and recompiling the backend and the module:
select pg_overflow_check(10000, 10000, 2000000000, 'uint32', 'mul');

>> still it is possible to trick things with signed integer arguments.
>
> Is it?

If you pass -1 and then you can fall back to the maximum of each 16,
32 or 64 bits for the unsigned (see the regression tests I added with
the module).

> Do you mean:
>
>  sql> SELECT -32768::INT2;
>  ERROR:  smallint out of range

You are incorrect here, as the minus sign is ignored by the cast.
This works though:
=# SELECT (-32768)::INT2;
  int2
--------
 -32768
(1 row)

If you look at int2.sql, we do that:
-- largest and smallest values
INSERT INTO INT2_TBL(f1) VALUES ('32767');
INSERT INTO INT2_TBL(f1) VALUES ('-32767');
That's the part I mean is wrong, as the minimum is actually -32768,
but the test fails to consider that.  I'll go fix that after
double-checking other similar tests for int4 and int8.

Attached is an updated patch to complete the work for common/int.h,
with the following changes:
- Changed the multiplication methods for uint16 and uint32 to not be
division-based.  uint64 can use that only if int128 exists.
- Added comments on top of each sub-sections for the types checked.

Attached is also an updated version of the module I used to validate
this stuff.  Fabien, any thoughts?
--
Michael

Attachment

Re: refactoring - share str2*int64 functions

From
Fabien COELHO
Date:
Bonjour Michaël,

> You are right as well that having symmetry with the signed methods is 
> much better.  In order to see the difference, you can just do that with 
> the extension attached, after of course hijacking int.h with some undefs 
> and recompiling the backend and the module: select 
> pg_overflow_check(10000, 10000, 2000000000, 'uint32', 'mul');

Ok.

>>> still it is possible to trick things with signed integer arguments.
>>
>> Is it?
>
> If you pass -1 and then you can fall back to the maximum of each 16,
> 32 or 64 bits for the unsigned (see the regression tests I added with
> the module).

> Attached is also an updated version of the module I used to validate
> this stuff.  Fabien, any thoughts?

Patch apply cleanly, compiles, "make check" ok (although changes 
are untested).

I would put back unlikely() on overflow tests, as there are indeed 
unlikely to occur and it may help some compilers, and cannot be harmful. 
It also helps the code reader to know that these path are not expected to 
be taken often.

On reflection, I'm not sure that add_u64 and sub_u64 overflow with uint128 
are useful. The res < a or b > a tricks should suffice, just like for u16 
and u32 cases, and it may cost a little less anyway.

I would suggest keep the overflow extension as "contrib/overflow_test". 
For mul tests, I'd suggest not to try only min/max values like add/sub, 
but also "standard" multiplications that overflow or not. It would be good 
if "make check" could be made to work", for some reason it requires 
"installcheck".

I could not test performance directly, loops are optimized out by the 
compiler. I added "volatile" on input value declarations to work around 
that. On 2B iterations I got on my laptop:

  int16: mul = 2770 ms, add = 1830 ms, sub = 1826 ms
  int32: mul = 1838 ms, add = 1835 ms, sub = 1840 ms
  int64: mul = 1836 ms, add = 1834 ms, sub = 1833 ms

  uint16: mul = 3670 ms, add = 1830 ms, sub = 2148 ms
  uint32: mul = 2438 ms, add = 1834 ms, sub = 1831 ms
  uint64: mul = 2139 ms, add = 1841 ms, sub = 1882 ms

Why int16 mul, uint* mul and uint16 sub are bad is unclear.

With fallback code triggered with:

  #undef HAVE__BUILTIN_OP_OVERFLOW

  int16: mul = 1433 ms, add = 1424 ms, sub = 1254 ms
  int32: mul = 1433 ms, add = 1425 ms, sub = 1443 ms
  int64: mul = 1430 ms, add = 1429 ms, sub = 1441 ms

  uint16: mul = 1445 ms, add = 1291 ms, sub = 1265 ms
  uint32: mul = 1419 ms, add = 1434 ms, sub = 1493 ms
  uint64: mul = 1266 ms, add = 1430 ms, sub = 1440 ms

For some unclear reason, 4 tests are significantly faster.

Forcing further down fallback code with:

   #undef HAVE_INT128

  int64: mul = 1424 ms, add = 1429 ms, sub = 1440 ms
  uint64: mul = 24145 ms, add = 1434 ms, sub = 1435 ms

There is no doubt that dividing 64 bits integers is a very bad idea, at 
least on my architecture!

Note that checks depends on value, so actual performance may vary 
depending on actual val1 and val2 passed. I used 10000 10000 like your 
example.

These results are definitely depressing because the fallback code is 
nearly twice as fast as the builtin overflow detection version. For the 
record: gcc 7.4.0 on ubuntu 18.04 LTS. Not sure what to advise, relying on 
the builtin should be the better idea…

-- 
Fabien.

Re: refactoring - share str2*int64 functions

From
Michael Paquier
Date:
On Sun, Sep 01, 2019 at 01:57:06PM +0200, Fabien COELHO wrote:
> I would put back unlikely() on overflow tests, as there are indeed unlikely
> to occur and it may help some compilers, and cannot be harmful. It also
> helps the code reader to know that these path are not expected to be taken
> often.

Hm.  I don't agree about that part, and the original signed portions
don't do that.  I think that we should let the callers of the routines
decide if a problem is unlikely to happen or not as we do now.

> On reflection, I'm not sure that add_u64 and sub_u64 overflow with uint128
> are useful. The res < a or b > a tricks should suffice, just like for u16
> and u32 cases, and it may cost a little less anyway.

Actually, I agree and this is something I can see as well with some
extra measurements.  mul_u64 without int128 is twice slower, while
add_u64 and sub_u64 are 15~20% faster.

> I would suggest keep the overflow extension as "contrib/overflow_test". For
> mul tests, I'd suggest not to try only min/max values like add/sub, but also
> "standard" multiplications that overflow or not. It would be good if "make
> check" could be made to work", for some reason it requires "installcheck".

Any extensions out of core can only work with "installcheck", and
check is not supported (see pgxs.mk).  I am still not convinced that
this module is worth the extra cycles to justify its existence
though.

> There is no doubt that dividing 64 bits integers is a very bad idea, at
> least on my architecture!

That's surprising.  I cannot reproduce that.  Are you sure that you
didn't just undefine HAVE_INT128?  This would cause
HAVE__BUILTIN_OP_OVERFLOW to still be active in all the code paths.
Here are a couple of results from my side with this query, FWIW, and
some numbers for all the compile flags (-O2 used):
select pg_overflow_check(10000, 10000, 2000000000, 'XXX', 'XXX');
1) uint16:
1-1) mul:
- built-in: 5.5s
- fallback: 5.5s
1-2) sub:
- built-in: 5.3s
- fallback: 5.4s
1-3) add:
- built-in: 5.3s
- fallback: 6.2s
2) uint32:
2-1) mul:
- built-in: 5.1s
- fallback: 5.9s
2-2) sub:
- built-in: 5.2s
- fallback: 5.4s
2-3) add:
- built-in: 5.2s
- fallback: 6.2s
2) uint64:
2-1) mul:
- built-in: 5.1s
- fallback (with uint128): 8.0s
- fallback (without uint128): 18.1s
2-2) sub:
- built-in: 5.2s
- fallback (with uint128): 7.1s
- fallback (without uint128): 5.5s
2-3) add:
- built-in: 5.2s
- fallback (with uint128): 7.1s
- fallback (without uint128): 6.3s

So, the built-in option is always faster, and keeping the int128 path
if available for the multiplication makes sense, but not for the
subtraction and the addition.  I am wondering if we should review
further the signed part for add and sub, but I'd rather not touch it
in this patch.

> Note that checks depends on value, so actual performance may vary depending
> on actual val1 and val2 passed. I used 10000 10000 like your example.

Sure.  Still that offers helpful hints as we do the same operations
for all code paths the same number of times.

If you have done any changes on my previous patch, or if you have a
script to share I could use to attempt to reproduce your results, I
would be happy to do so.

So, do you have more comments?
--
Michael

Attachment

Re: refactoring - share str2*int64 functions

From
Fabien COELHO
Date:
Michaël,

>> I would put back unlikely() on overflow tests, as there are indeed unlikely
>> to occur and it may help some compilers, and cannot be harmful. It also
>> helps the code reader to know that these path are not expected to be taken
>> often.
>
> Hm.  I don't agree about that part, and the original signed portions
> don't do that.  I think that we should let the callers of the routines
> decide if a problem is unlikely to happen or not as we do now.

Hmmm. Maybe inlining propates them, but otherwise they make sense for a 
compiler perspective.

> I am still not convinced that this module is worth the extra cycles to 
> justify its existence though.

They allow to quickly do performance tests, for me it is useful to keep it 
around, but you are the committer, you do as you feel.

>> [...]
>> There is no doubt that dividing 64 bits integers is a very bad idea, at
>> least on my architecture!
>
> That's surprising.  I cannot reproduce that.

It seems to me that somehow you can, you have a 5 to 18 seconds drop 
below. There are actual reasons why some processors are more expensive 
than others, it is not just marketing:-)

> 2-1) mul:
> - built-in: 5.1s
> - fallback (with uint128): 8.0s
> - fallback (without uint128): 18.1s

> So, the built-in option is always faster, and keeping the int128 path
> if available for the multiplication makes sense, but not for the
> subtraction and the addition.

Yep.

> I am wondering if we should review further the signed part for add and 
> sub, but I'd rather not touch it in this patch.

The signed overflows are trickier even, I have not paid attention to the 
fallback code. I agree that it is better left untouched for know.

> If you have done any changes on my previous patch, or if you have a
> script to share I could use to attempt to reproduce your results, I
> would be happy to do so.

Hmmm. I did manual tests really. Attached a psql script replicating them.

   # with builtin overflow detection
   sh> psql < oc.sql
   NOTICE:  int 16 mul: 00:00:02.747269 # slow
   NOTICE:  int 16 add: 00:00:01.83281
   NOTICE:  int 16 sub: 00:00:01.8501
   NOTICE:  uint 16 mul: 00:00:03.68362 # slower
   NOTICE:  uint 16 add: 00:00:01.835294
   NOTICE:  uint 16 sub: 00:00:02.136895 # slow
   NOTICE:  int 32 mul: 00:00:01.828065
   NOTICE:  int 32 add: 00:00:01.840269
   NOTICE:  int 32 sub: 00:00:01.843557
   NOTICE:  uint 32 mul: 00:00:02.447052 # slow
   NOTICE:  uint 32 add: 00:00:01.849899
   NOTICE:  uint 32 sub: 00:00:01.840773
   NOTICE:  int 64 mul: 00:00:01.839051
   NOTICE:  int 64 add: 00:00:01.839065
   NOTICE:  int 64 sub: 00:00:01.838599
   NOTICE:  uint 64 mul: 00:00:02.161346 # slow
   NOTICE:  uint 64 add: 00:00:01.839404
   NOTICE:  uint 64 sub: 00:00:01.838549
   DO

> So, do you have more comments?

No other comments.

-- 
Fabien.
Attachment

Re: refactoring - share str2*int64 functions

From
Michael Paquier
Date:
On Sun, Sep 01, 2019 at 08:07:06PM +0200, Fabien COELHO wrote:
> They allow to quickly do performance tests, for me it is useful to keep it
> around, but you are the committer, you do as you feel.

If there are more voices for having that in core, we could consider
it.  For now I have added that into my own plugin repository with all
the functions discussed on this thread:
https://github.com/michaelpq/pg_plugins/

> The signed overflows are trickier even, I have not paid attention to the
> fallback code. I agree that it is better left untouched for know.

Thanks.

> Hmmm. I did manual tests really. Attached a psql script replicating them.
>
>   # with builtin overflow detection
>   sh> psql < oc.sql
>   NOTICE:  int 16 mul: 00:00:02.747269 # slow
>   NOTICE:  int 16 add: 00:00:01.83281
>   NOTICE:  int 16 sub: 00:00:01.8501
>   NOTICE:  uint 16 mul: 00:00:03.68362 # slower
>   NOTICE:  uint 16 add: 00:00:01.835294
>   NOTICE:  uint 16 sub: 00:00:02.136895 # slow
>   NOTICE:  int 32 mul: 00:00:01.828065
>   NOTICE:  int 32 add: 00:00:01.840269
>   NOTICE:  int 32 sub: 00:00:01.843557
>   NOTICE:  uint 32 mul: 00:00:02.447052 # slow
>   NOTICE:  uint 32 add: 00:00:01.849899
>   NOTICE:  uint 32 sub: 00:00:01.840773
>   NOTICE:  int 64 mul: 00:00:01.839051
>   NOTICE:  int 64 add: 00:00:01.839065
>   NOTICE:  int 64 sub: 00:00:01.838599
>   NOTICE:  uint 64 mul: 00:00:02.161346 # slow
>   NOTICE:  uint 64 add: 00:00:01.839404
>   NOTICE:  uint 64 sub: 00:00:01.838549

Actually that's much faster than a single core on my debian SID with
gcc 9.2.1.

Here are more results from me:
           Built-in         undef Built-in
int16 mul  00:00:05.425207  00:00:05.634417
int16 add  00:00:05.389738  00:00:06.38885
int16 sub  00:00:05.446529  00:00:06.39569
uint16 mul 00:00:05.499066  00:00:05.541617
uint16 add 00:00:05.281622  00:00:06.252511
uint16 sub 00:00:05.366424  00:00:05.457148
int32 mul  00:00:05.121209  00:00:06.154989
int32 add  00:00:05.228722  00:00:06.344721
int32 sub  00:00:05.237594  00:00:06.323543
uint32 mul 00:00:05.126339  00:00:05.921738
uint32 add 00:00:05.212085  00:00:06.183031
uint32 sub 00:00:05.201884  00:00:05.363667
int64 mul  00:00:05.136129  00:00:06.148101
int64 add  00:00:05.200201  00:00:06.329091
int64 sub  00:00:05.218028  00:00:06.313114
uint64 mul 00:00:05.444733  00:00:08.089742
uint64 add 00:00:05.603978  00:00:06.377753
uint64 sub 00:00:05.544838  00:00:05.490873

This part has been committed, now let's move to the next parts of your
patch.
--
Michael

Attachment

Re: refactoring - share str2*int64 functions

From
Fabien COELHO
Date:
> This part has been committed, now let's move to the next parts of your
> patch.

Attached a rebased version which implements the int64/uint64 stuff. It is 
basically the previous patch without the overflow inlined functions.

-- 
Fabien Coelho - CRI, MINES ParisTech
Attachment

Re: refactoring - share str2*int64 functions

From
Michael Paquier
Date:
On Tue, Sep 03, 2019 at 08:10:37PM +0200, Fabien COELHO wrote:
> Attached a rebased version which implements the int64/uint64 stuff. It is
> basically the previous patch without the overflow inlined functions.

-     if (!strtoint64(yytext, true, &yylval->ival))
+     if (unlikely(pg_strtoint64(yytext, &yylval->ival) != PG_STRTOINT_OK))
It seems to me that we should have unlikely() only within
pg_strtoint64(), pg_strtouint64(), etc.

-   /* skip leading spaces; cast is consistent with strtoint64 */
-   while (*ptr && isspace((unsigned char) *ptr))
+   /* skip leading spaces; cast is consistent with pg_strtoint64 */
+   while (isspace((unsigned char) *ptr))
        ptr++;
What do you think about splitting this part in two?  I would suggest
to do the refactoring in a first patch, and the consider all the
optimizations for the routines you have in mind afterwards.

I think that we don't actually need is_an_int() and str2int64() at all
in pgbench.c as we could just check for the return code of
pg_strtoint64() and switch to the double parsing only if we don't have
PG_STRTOINT_OK.

scanint8() only has one caller in the backend with your patch, and we
don't check after its return result in int8.c, so I would suggest to
move the error handling directly in int8in() and to remove scanint8().

I think that we should also introduce the [u]int[16|32] flavors and
expand them in the code in a single patch, in a way consistent with
what you have done for int64/uint64 as the state that we reach with
the patch is kind of weird as there are existing versions numutils.c.

Have you done some performance testing of the patch?  The COPY
bulkload is a good example of that:
https://www.postgresql.org/message-id/20190717181428.krqpmduejbqr4m6g%40alap3.anarazel.de
--
Michael

Attachment

Re: refactoring - share str2*int64 functions

From
Andres Freund
Date:
Hi,

On 2019-09-03 20:10:37 +0200, Fabien COELHO wrote:
> @@ -113,7 +113,7 @@ parse_output_parameters(List *options, uint32 *protocol_version,
>                           errmsg("conflicting or redundant options")));
>              protocol_version_given = true;
>
> -            if (!scanint8(strVal(defel->arg), true, &parsed))
> +            if (unlikely(pg_strtoint64(strVal(defel->arg), &parsed) != PG_STRTOINT_OK))
>                  ereport(ERROR,
>                          (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
>                           errmsg("invalid proto_version")));

Unexcited about adding unlikely() to any place that's not a hot path -
which this certainly is not.

But I also wonder if we shouldn't just put the branch likelihood of
pg_strtoint64 not failing into one central location. E.g. something like

static inline pg_strtoint_status
pg_strtoint64(const char *str, int64 *result)
{
        pg_strtoint_status status;

        status = pg_strtoint64_impl(str, result);

        likely(status == PG_STRTOINT_OK);

        return status;
}

I've not tested this, but IIRC that should be sufficient to propagate
that knowledge up to callers.


> +    if (likely(stat == PG_STRTOINT_OK))
> +        return true;
> +    else if (stat == PG_STRTOINT_RANGE_ERROR)
>      {
> -        /* could fail if input is most negative number */
> -        if (unlikely(tmp == PG_INT64_MIN))
> -            goto out_of_range;
> -        tmp = -tmp;
> -    }
> -
> -    *result = tmp;
> -    return true;
> -
> -out_of_range:
> -    if (!errorOK)
>          ereport(ERROR,
>                  (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
>                   errmsg("value \"%s\" is out of range for type %s",
>                          str, "bigint")));
> -    return false;
> -
> -invalid_syntax:
> -    if (!errorOK)
> +        return false;
> +    }
> +    else if (stat == PG_STRTOINT_SYNTAX_ERROR)
> +    {
>          ereport(ERROR,
>                  (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
>                   errmsg("invalid input syntax for type %s: \"%s\"",
>                          "bigint", str)));
> -    return false;
> +        return false;
> +    }
> +    else
> +        /* cannot get here */
> +        Assert(0);

I'd write this as a switch over the enum - that way we'll get a
compile-time warning if we're not handling a value.


> +static bool
> +str2int64(const char *str, int64 *val)
> +{
> +    pg_strtoint_status        stat = pg_strtoint64(str, val);
> +

I find it weird to have a wrapper that's named 'str2...' that then calls
'strto' to implement itself.


> +/*
> + * pg_strtoint64 -- convert a string to 64-bit integer
> + *
> + * The function returns if the conversion failed, or
> + * "*result" is set to the result.
> + */
> +pg_strtoint_status
> +pg_strtoint64(const char *str, int64 *result)
> +{
> +    const char *ptr = str;
> +    int64        tmp = 0;
> +    bool        neg = false;
> +
> +    /*
> +     * Do our own scan, rather than relying on sscanf which might be broken
> +     * for long long.
> +     *
> +     * As INT64_MIN can't be stored as a positive 64 bit integer, accumulate
> +     * value as a negative number.
> +     */
> +
> +    /* skip leading spaces */
> +    while (isspace((unsigned char) *ptr))
> +        ptr++;
> +
> +    /* handle sign */
> +    if (*ptr == '-')
> +    {
> +        ptr++;
> +        neg = true;
> +    }
> +    else if (*ptr == '+')
> +        ptr++;
> +
> +    /* require at least one digit */
> +    if (unlikely(!isdigit((unsigned char) *ptr)))
> +        return PG_STRTOINT_SYNTAX_ERROR;
> +
> +    /* process digits, we know that we have one ahead */
> +    do
> +    {
> +        int64        digit = (*ptr++ - '0');
> +
> +        if (unlikely(pg_mul_s64_overflow(tmp, 10, &tmp)) ||
> +            unlikely(pg_sub_s64_overflow(tmp, digit, &tmp)))
> +            return PG_STRTOINT_RANGE_ERROR;
> +    }
> +    while (isdigit((unsigned char) *ptr));

Hm. If we're concerned about the cost of isdigit etc - and I think
that's reasonable, after looking at their implementation in glibc (it
performs a lookup in a global table to handle potential locale changes)
- I think we ought to just not use the provided isdigit, and probably
not isspace either.  This code effectively relies on the input being
ascii anyway, so we don't need any locale specific behaviour afaict
(we'd e.g. get wrong results if isdigit() returned true for something
other than the ascii chars).

To me the generated code looks considerably better if I use something
like

static inline bool
pg_isdigit(char c)
{
    return c >= '0' && c <= '9';
}

static inline bool
pg_isspace(char c)
{
    return c == ' ';
}

(if we were to actually go for this, we'd probably want to add some docs
that we don't expect EOF, or make the code safe against that).

I've not benchmarked that, but I'd be surprised if it didn't improve
matters.

And once coded using the above, there's no point in the do/while
conversion imo, as any compiler can trivially optimize redundant checks.


> +/*
> + * pg_strtouint64 -- convert a string to unsigned 64-bit integer
> + *
> + * The function returns if the conversion failed, or
> + * "*result" is set to the result.
> + */
> +pg_strtoint_status
> +pg_strtouint64(const char *str, uint64 *result)
> +{
> +    const char *ptr = str;
> +    uint64        tmp = 0;
> +
> +    /* skip leading spaces */
> +    while (isspace((unsigned char) *ptr))
> +        ptr++;
> +
> +    /* handle sign */
> +    if (*ptr == '+')
> +        ptr++;
> +    else if (unlikely(*ptr == '-'))
> +        return PG_STRTOINT_SYNTAX_ERROR;

Hm. Seems like that should return PG_STRTOINT_RANGE_ERROR?


> +typedef enum {

Please don't define anonyous types (the enum itself, which now is only
reachable via the typedef). Also, there's a missing newline here).


Greetings,

Andres Freund



Re: refactoring - share str2*int64 functions

From
Fabien COELHO
Date:
Bonjour Michaël,

>> Attached a rebased version which implements the int64/uint64 stuff. It is
>> basically the previous patch without the overflow inlined functions.
>
> -     if (!strtoint64(yytext, true, &yylval->ival))
> +     if (unlikely(pg_strtoint64(yytext, &yylval->ival) != PG_STRTOINT_OK))

> It seems to me that we should have unlikely() only within
> pg_strtoint64(), pg_strtouint64(), etc.

From a compiler perspective, the (un)likely tip is potentially useful on 
any test. We know when parsing a that it is very unlikely that the string 
conversion would fail, so we can tell that, so that the compiler knows 
which branch it should optimize first.

You can argue against that if the functions are inlined, because maybe the 
compiler would propagate the information, but for distinct functions 
compiled separately the information is useful at each level.

> -   /* skip leading spaces; cast is consistent with strtoint64 */
> -   while (*ptr && isspace((unsigned char) *ptr))

> +   /* skip leading spaces; cast is consistent with pg_strtoint64 */
> +   while (isspace((unsigned char) *ptr))
>        ptr++;

> What do you think about splitting this part in two?  I would suggest
> to do the refactoring in a first patch, and the consider all the
> optimizations for the routines you have in mind afterwards.

I would not bother.

> I think that we don't actually need is_an_int() and str2int64() at all
> in pgbench.c as we could just check for the return code of
> pg_strtoint64() and switch to the double parsing only if we don't have
> PG_STRTOINT_OK.

Yep, you are right, now that the conversion functions does not error out a 
message, its failure can be used as a test.

The version attached changes slightly the semantics, because on int 
overflows a double conversion will be attempted instead of erroring. I do 
not think that it is worth the effort of preserving the previous semantic 
of erroring.

> scanint8() only has one caller in the backend with your patch, and we
> don't check after its return result in int8.c, so I would suggest to
> move the error handling directly in int8in() and to remove scanint8().

Ok.

> I think that we should also introduce the [u]int[16|32] flavors and
> expand them in the code in a single patch, in a way consistent with
> what you have done for int64/uint64 as the state that we reach with
> the patch is kind of weird as there are existing versions numutils.c.

Before dealing with the 16/32 versions, which involve quite a significant 
amount of changes, I would want a clear message that "the 64 bit approach" 
is the model to follow.

Moreover, I'm unsure how to rename the existing pg_strtoint32 and others
which call ereport, if the name is used for the common error returning 
version.

> Have you done some performance testing of the patch?  The COPY
> bulkload is a good example of that:
> https://www.postgresql.org/message-id/20190717181428.krqpmduejbqr4m6g%40alap3.anarazel.de

I have done no such thing for now.

I would not expect any significant performance difference when loading 
int8 things because basically scanint8 has just been renamed pg_strtoint64 
and made global, and that is more or less all. It might be a little bit 
slower because possible the compiler cannot inline the conversion, but on 
the other hand, the *likely hints and removed tests may marginaly help 
performance. I think that the only way to test performance significantly 
would be to write a specific program that loops over a conversion.

-- 
Fabien.
Attachment

Re: refactoring - share str2*int64 functions

From
Michael Paquier
Date:
On Wed, Sep 04, 2019 at 02:08:39AM -0700, Andres Freund wrote:
>> +static bool
>> +str2int64(const char *str, int64 *val)
>> +{
>> +    pg_strtoint_status        stat = pg_strtoint64(str, val);
>> +
>
> I find it weird to have a wrapper that's named 'str2...' that then calls
> 'strto' to implement itself.

It happens that this wrapper in pgbench.c is not actually needed.

> Hm. If we're concerned about the cost of isdigit etc - and I think
> that's reasonable, after looking at their implementation in glibc (it
> performs a lookup in a global table to handle potential locale changes)
> - I think we ought to just not use the provided isdigit, and probably
> not isspace either.  This code effectively relies on the input being
> ascii anyway, so we don't need any locale specific behaviour afaict
> (we'd e.g. get wrong results if isdigit() returned true for something
> other than the ascii chars).

Yeah.  It seems to me that we have more optimizations that could come
in line here, and actually we have perhaps more refactoring at hand
with each one of the 6 functions we'd like to add at the end.  I had
in mind about first shaping the refactoring patch, consolidating all
the interfaces, and then evaluate the improvements we can come up with
as after the refactoring we'd need to update only common/string.c.

> I've not benchmarked that, but I'd be surprised if it didn't improve
> matters.

I think that you are right here, there is something to gain.  Looking
at their stuff this makes use of __isctype as told by ctype/ctype.h.
--
Michael

Attachment

Re: refactoring - share str2*int64 functions

From
Michael Paquier
Date:
On Wed, Sep 04, 2019 at 12:49:17PM +0200, Fabien COELHO wrote:
> From a compiler perspective, the (un)likely tip is potentially useful on any
> test. We know when parsing a that it is very unlikely that the string
> conversion would fail, so we can tell that, so that the compiler knows which
> branch it should optimize first.
>
> You can argue against that if the functions are inlined, because maybe the
> compiler would propagate the information, but for distinct functions
> compiled separately the information is useful at each level.

Hmm.  There has been an extra recent argument on the matter here:
https://www.postgresql.org/message-id/20190904090839.stp3madovtynq3px@alap3.anarazel.de

I am not sure that we should tackle that as part of the first
refactoring though, as what we'd want is first to put all the
interfaces in a single place we can deal with afterwards.

> Yep, you are right, now that the conversion functions does not error out a
> message, its failure can be used as a test.
>
> The version attached changes slightly the semantics, because on int
> overflows a double conversion will be attempted instead of erroring. I do
> not think that it is worth the effort of preserving the previous semantic of
> erroring.

Yes.  I would move things in this direction.  I may reconsider this
part again with more testing but switching from one to the other is
simple enough so let's keep the code as you suggest for now.

>> scanint8() only has one caller in the backend with your patch, and we
>> don't check after its return result in int8.c, so I would suggest to
>> move the error handling directly in int8in() and to remove scanint8().
>
> Ok.

As per the extra comments of upthread, this should use a switch
without a default.

> Before dealing with the 16/32 versions, which involve quite a significant
> amount of changes, I would want a clear message that "the 64 bit approach"
> is the model to follow.
>
> Moreover, I'm unsure how to rename the existing pg_strtoint32 and others
> which call ereport, if the name is used for the common error returning
> version.

Right, there was this part.  This brings also the point of having one
interface for the backend as all the error messages for the backend
are actually the same, with the most simple name being that:
pg_strtoint(value, size, error_ok).

This then calls all the sub-routines we have in src/common/.  There
were more suggestions here:
https://www.postgresql.org/message-id/20190729050539.d7mbjabcrlv7bxc3@alap3.anarazel.de

> I would not expect any significant performance difference when loading int8
> things because basically scanint8 has just been renamed pg_strtoint64 and
> made global, and that is more or less all. It might be a little bit slower
> because possible the compiler cannot inline the conversion, but on the other
> hand, the *likely hints and removed tests may marginaly help performance. I
> think that the only way to test performance significantly would be to write
> a specific program that loops over a conversion.

I would suspect a change for pg_strtouint64().
--
Michael

Attachment

Re: refactoring - share str2*int64 functions

From
Michael Paquier
Date:
On Thu, Sep 05, 2019 at 03:52:48PM +0900, Michael Paquier wrote:
> Right, there was this part.  This brings also the point of having one
> interface for the backend as all the error messages for the backend
> are actually the same, with the most simple name being that:
> pg_strtoint(value, size, error_ok).

I have been looking at that for the last couple of days.  First, I
have consolidated all the error strings in a single routine like this
one, except that error_ok is not really needed if you take this
approach: callers that don't care about failures could just call the
set of routines in common/string.c and be done with it.

Attached is an update of my little module that I used to check that
the refactoring was done correctly (regression tests attached), it
also includes a couple of routines to check the performance difference
between one approach and the other, with focus on two things:
- Is pg_strtouint64 really improved?
- How much do we lose by moving to a common interface in the backend
with pg_strtoint?

The good news is that removing strtol from pg_strtouint64 really
improves the performance as already reported, and with one billion
calls in a tight loop you see a clear difference:
=# select pg_strtouint64_old_check('10000', 1000000000);
 pg_strtouint64_old_check
--------------------------
                    10000
(1 row)
Time: 15576.539 ms (00:15.577)
=# select pg_strtouint64_new_check('10000', 1000000000);
 pg_strtouint64_new_check
--------------------------
                    10000
(1 row)
Time: 8820.544 ms (00:08.821)

So the new implementation is more than 40% faster with this
micro-benchmark on my Debian box.

The bad news is that a pg_strtoint() wrapper is not a good idea:
=# select pg_strtoint_check('10000', 1000000000, 4);
 pg_strtoint_check
-------------------
             10000
(1 row)
Time: 11178.101 ms (00:11.178)
=# select pg_strtoint32_check('10000', 1000000000);
 pg_strtoint32_check
---------------------
               10000
(1 row)
Time: 9252.894 ms (00:09.253)
So trying to consolidate all error messages leads to a 15% hit with
this test, which sucks.

So, it seems to me that if we want to have a consolidation of
everything, we basically need to have the generic error messages for
the backend directly into common/string.c where the routines are
refactored, and I think that we should do the following based on the
patch attached:
- Just remove pg_strtoint(), and move the error generation logic in
common/string.c.
- Add an error_ok flag for all the pg_strto[u]int{16|32|64} routines,
which generates errors only for FRONTEND is not defined.
- Use pg_log_error() for the frontend errors.

If those errors are added everywhere, we would basically have no code
paths in the frontend or the backend (for the unsigned part only)
which use them yet.  Another possibility would be to ignore the
error_ok flag in those cases, but that's inconsistent.  My take would
be here to have a more generic error message, like that:
"value \"%s\" is out of range for [unsigned] {16|32|64}-bit integer"
"invalid input syntax for [unsigned] {16|32|64}-bit integer: \"%s\"\n"

I do not suggest to change the messages for the backend for signed
entries though, as these are linked to the data types used.

Attached is an update of my toy module, and an updated patch with what
I have done up to now.  This stuff already does a lot, so for now I
have not worked on the removal strtoint() in string.c yet.  We could
just do that with a follow-up patch and make it use the new conversion
routines once we are sure that they are in a correct shape.  As
strtoint() makes use of strtol(), switching to the new routines will
be much faster anyway...

Fabien, any thoughts?
--
Michael

Attachment

Re: refactoring - share str2*int64 functions

From
Andres Freund
Date:
Hi,

On 2019-09-09 14:28:14 +0900, Michael Paquier wrote:
> On Thu, Sep 05, 2019 at 03:52:48PM +0900, Michael Paquier wrote:
> > Right, there was this part.  This brings also the point of having one
> > interface for the backend as all the error messages for the backend
> > are actually the same, with the most simple name being that:
> > pg_strtoint(value, size, error_ok).

I *VEHEMENTLY* oppose the introduction of any future pseudo-generic
routines taking the type width as a parameter. They're weakening the
type system and they're unnecessarily inefficient.


I don't really buy that saving a few copies of a strings is worth that
much. But if you really want to do that, the right approach imo would be
to move the error reporting into a separate function. I.e. something

void pg_attribute_noreturn()
pg_strtoint_error(pg_strtoint_status status, const char *input, const char *type)

that you'd call in small wrappers. Something like

static inline int32
pg_strtoint32_check(const char* s)
{
    int32 result;
    pg_strtoint_status status = pg_strtoint32(s, &result);

    if (unlikely(status == PG_STRTOINT_OK))
       pg_strtoint_error(status, s, "int32");
    return result;
}


> So, it seems to me that if we want to have a consolidation of
> everything, we basically need to have the generic error messages for
> the backend directly into common/string.c where the routines are
> refactored, and I think that we should do the following based on the
> patch attached:

> - Just remove pg_strtoint(), and move the error generation logic in
> common/string.c.

I'm not quite sure what you mean by moving the "error generation logic"?


> - Add an error_ok flag for all the pg_strto[u]int{16|32|64} routines,
> which generates errors only for FRONTEND is not defined.

I think this is a bad idea.


> - Use pg_log_error() for the frontend errors.
>
> If those errors are added everywhere, we would basically have no code
> paths in the frontend or the backend (for the unsigned part only)
> which use them yet.  Another possibility would be to ignore the
> error_ok flag in those cases, but that's inconsistent.

Yea, ignoring it would be terrible idea.

> diff --git a/src/backend/libpq/pqmq.c b/src/backend/libpq/pqmq.c
> index a9bd47d937..593a5ef54e 100644
> --- a/src/backend/libpq/pqmq.c
> +++ b/src/backend/libpq/pqmq.c
> @@ -286,10 +286,10 @@ pq_parse_errornotice(StringInfo msg, ErrorData *edata)
>                  edata->hint = pstrdup(value);
>                  break;
>              case PG_DIAG_STATEMENT_POSITION:
> -                edata->cursorpos = pg_strtoint32(value);
> +                edata->cursorpos = pg_strtoint(value, 4);
>                  break;

I'd be really upset if this type of change went in.



>  #include "fmgr.h"
>  #include "miscadmin.h"
> +#include "common/string.h"
>  #include "nodes/extensible.h"
>  #include "nodes/parsenodes.h"
>  #include "nodes/plannodes.h"
> @@ -80,7 +81,7 @@
>  #define READ_UINT64_FIELD(fldname) \
>      token = pg_strtok(&length);        /* skip :fldname */ \
>      token = pg_strtok(&length);        /* get field value */ \
> -    local_node->fldname = pg_strtouint64(token, NULL, 10)
> +    (void) pg_strtouint64(token, &local_node->fldname)

Seems like these actually could just ought to use the error-checked
variants. And I think it ought to change all of
READ_{INT,UINT,LONG,UINT64,OID}_FIELD, rather than just redirecting one
of them to the new routines.


>  static void pcb_error_callback(void *arg);
> @@ -496,7 +496,7 @@ make_const(ParseState *pstate, Value *value, int location)
>
>          case T_Float:
>              /* could be an oversize integer as well as a float ... */
> -            if (scanint8(strVal(value), true, &val64))
> +            if (pg_strtoint64(strVal(value), &val64) == PG_STRTOINT_OK)
>              {
>                  /*
>                   * It might actually fit in int32. Probably only INT_MIN can

Not for this change, but we really ought to move away from this crappy
logic. It's really bonkers to have T_Float represent large integers and
floats.



> +/*
> + * pg_strtoint16
> + *
> + * Convert input string to a signed 16-bit integer.  Allows any number of
> + * leading or trailing whitespace characters.
> + *
> + * NB: Accumulate input as a negative number, to deal with two's complement
> + * representation of the most negative number, which can't be represented as a
> + * positive number.
> + *
> + * The function returns immediately if the conversion failed with a status
> + * value to let the caller handle the error.  On success, the result is
> + * stored in "*result".
> + */
> +pg_strtoint_status
> +pg_strtoint16(const char *s, int16 *result)
> +{
> +    const char *ptr = s;
> +    int16        tmp = 0;
> +    bool        neg = false;
> +
> +    /* skip leading spaces */
> +    while (likely(*ptr) && isspace((unsigned char) *ptr))
> +        ptr++;
> +
> +    /* handle sign */
> +    if (*ptr == '-')
> +    {
> +        ptr++;
> +        neg = true;
> +    }
> +    else if (*ptr == '+')
> +        ptr++;

> +    /* require at least one digit */
> +    if (unlikely(!isdigit((unsigned char) *ptr)))
> +        return PG_STRTOINT_SYNTAX_ERROR;

Wonder if there's an argument for moving this behaviour to someplace
else - in most cases we really don't expect whitespace, and checking for
it is unnecessary overhead.

Greetings,

Andres Freund



Re: refactoring - share str2*int64 functions

From
Michael Paquier
Date:
On Mon, Sep 09, 2019 at 03:17:38AM -0700, Andres Freund wrote:
> On 2019-09-09 14:28:14 +0900, Michael Paquier wrote:
> I *VEHEMENTLY* oppose the introduction of any future pseudo-generic
> routines taking the type width as a parameter. They're weakening the
> type system and they're unnecessarily inefficient.

I saw that, using the previous wrapper increases the time of one call
from roughly 0.9ms to 1.1ns.

> I don't really buy that saving a few copies of a strings is worth that
> much. But if you really want to do that, the right approach imo would be
> to move the error reporting into a separate function. I.e. something
>
> void pg_attribute_noreturn()
> pg_strtoint_error(pg_strtoint_status status, const char *input, const char *type)
>
> that you'd call in small wrappers. Something like

I am not completely sure if we should do that either anyway.  Another
approach would be to try to make the callers of the routines generate
their own error messages.  The errors we have now are really linked to
the data types we have in core for signed integers (smallint, int,
bigint).  In most cases do they really make sense (varlena.c, pqmq.c,
etc.)?  And for errors which should never happen we could just use
elog().  For the input functions of int2/4/8 we still need the
existing errors of course.

>> So, it seems to me that if we want to have a consolidation of
>> everything, we basically need to have the generic error messages for
>> the backend directly into common/string.c where the routines are
>> refactored, and I think that we should do the following based on the
>> patch attached:
>
>> - Just remove pg_strtoint(), and move the error generation logic in
>> common/string.c.
>
> I'm not quite sure what you mean by moving the "error generation logic"?

I was referring to the error messages we have on HEAD in scanint8()
and pg_strtoint16() for bad inputs and overflows.

> Seems like these actually could just ought to use the error-checked
> variants. And I think it ought to change all of
> READ_{INT,UINT,LONG,UINT64,OID}_FIELD, rather than just redirecting one

Right.

>>  static void pcb_error_callback(void *arg);
>> @@ -496,7 +496,7 @@ make_const(ParseState *pstate, Value *value, int location)
>>
>>          case T_Float:
>>              /* could be an oversize integer as well as a float ... */
>> -            if (scanint8(strVal(value), true, &val64))
>> +            if (pg_strtoint64(strVal(value), &val64) == PG_STRTOINT_OK)
>>              {
>>                  /*
>>                   * It might actually fit in int32. Probably only INT_MIN can
>
> Not for this change, but we really ought to move away from this crappy
> logic. It's really bonkers to have T_Float represent large integers and
> floats.

I am not sure but what are you suggesting here?

>> +    /* skip leading spaces */
>> +    while (likely(*ptr) && isspace((unsigned char) *ptr))
>> +        ptr++;
>> +
>> +    /* handle sign */
>> +    if (*ptr == '-')
>> +    {
>> +        ptr++;
>> +        neg = true;
>> +    }
>> +    else if (*ptr == '+')
>> +        ptr++;
>
>> +    /* require at least one digit */
>> +    if (unlikely(!isdigit((unsigned char) *ptr)))
>> +        return PG_STRTOINT_SYNTAX_ERROR;
>
> Wonder if there's an argument for moving this behaviour to someplace
> else - in most cases we really don't expect whitespace, and checking for
> it is unnecessary overhead.

Not sure about that.  I would keep the scope of the patch simple as of
now, where we make sure that we have the right interface for
everything.  There are a couple of extra improvements which could be
done afterwards, and if we move everything in the same place that
should be easier to move on with more improvements.  Hopefully.
--
Michael

Attachment

Re: refactoring - share str2*int64 functions

From
Andres Freund
Date:
Hi,

On 2019-09-09 20:57:46 +0900, Michael Paquier wrote:
> > I don't really buy that saving a few copies of a strings is worth that
> > much. But if you really want to do that, the right approach imo would be
> > to move the error reporting into a separate function. I.e. something
> > 
> > void pg_attribute_noreturn()
> > pg_strtoint_error(pg_strtoint_status status, const char *input, const char *type)
> > 
> > that you'd call in small wrappers. Something like
> 
> I am not completely sure if we should do that either anyway.  Another
> approach would be to try to make the callers of the routines generate
> their own error messages.  The errors we have now are really linked to
> the data types we have in core for signed integers (smallint, int,
> bigint).  In most cases do they really make sense (varlena.c, pqmq.c,
> etc.)?

I think there's plenty places that ought to use the checked functions,
even if they currently don't. And typically calling in the caller will
actually be slightly less efficient, than an out-of-line function like I
was proposing above, because it'll be in-line code for infrequent code.

But ISTM all of them ought to just use the C types, rather than the SQL
types however. Since in the above proposal the caller determines the
type names, if you want a different type - like the SQL input routines -
can just invoke pg_strtoint_error() themselves (or just have it open
coded).


> And for errors which should never happen we could just use
> elog().  For the input functions of int2/4/8 we still need the
> existing errors of course.

Right, there it makes sense to continue to refer the SQL level types.


> >> So, it seems to me that if we want to have a consolidation of
> >> everything, we basically need to have the generic error messages for
> >> the backend directly into common/string.c where the routines are
> >> refactored, and I think that we should do the following based on the
> >> patch attached:
> > 
> >> - Just remove pg_strtoint(), and move the error generation logic in
> >> common/string.c.
> > 
> > I'm not quite sure what you mean by moving the "error generation logic"?
> 
> I was referring to the error messages we have on HEAD in scanint8()
> and pg_strtoint16() for bad inputs and overflows.

Not the right direction imo.



> >>  static void pcb_error_callback(void *arg);
> >> @@ -496,7 +496,7 @@ make_const(ParseState *pstate, Value *value, int location)
> >>
> >>          case T_Float:
> >>              /* could be an oversize integer as well as a float ... */
> >> -            if (scanint8(strVal(value), true, &val64))
> >> +            if (pg_strtoint64(strVal(value), &val64) == PG_STRTOINT_OK)
> >>              {
> >>                  /*
> >>                   * It might actually fit in int32. Probably only INT_MIN can
> > 
> > Not for this change, but we really ought to move away from this crappy
> > logic. It's really bonkers to have T_Float represent large integers and
> > floats.
> 
> I am not sure but what are you suggesting here?

I'm saying that we shouldn't need the whole logic of trying to parse the
string as an int, and then fail to float if it's not that. But that it's
not this patchset's task to fix this.


> >> +    /* skip leading spaces */
> >> +    while (likely(*ptr) && isspace((unsigned char) *ptr))
> >> +        ptr++;
> >> +
> >> +    /* handle sign */
> >> +    if (*ptr == '-')
> >> +    {
> >> +        ptr++;
> >> +        neg = true;
> >> +    }
> >> +    else if (*ptr == '+')
> >> +        ptr++;
> > 
> >> +    /* require at least one digit */
> >> +    if (unlikely(!isdigit((unsigned char) *ptr)))
> >> +        return PG_STRTOINT_SYNTAX_ERROR;
> > 
> > Wonder if there's an argument for moving this behaviour to someplace
> > else - in most cases we really don't expect whitespace, and checking for
> > it is unnecessary overhead.
> 
> Not sure about that.  I would keep the scope of the patch simple as of
> now, where we make sure that we have the right interface for
> everything.  There are a couple of extra improvements which could be
> done afterwards, and if we move everything in the same place that
> should be easier to move on with more improvements.  Hopefully.

The only reason for thinking about it now is that we'd then avoid
changing the API twice.

Greetings,

Andres Freund



Re: refactoring - share str2*int64 functions

From
Michael Paquier
Date:
On Mon, Sep 09, 2019 at 03:17:38AM -0700, Andres Freund wrote:
> On 2019-09-09 14:28:14 +0900, Michael Paquier wrote:
>> @@ -80,7 +81,7 @@
>>  #define READ_UINT64_FIELD(fldname) \
>>      token = pg_strtok(&length);        /* skip :fldname */ \
>>      token = pg_strtok(&length);        /* get field value */ \
>> -    local_node->fldname = pg_strtouint64(token, NULL, 10)
>> +    (void) pg_strtouint64(token, &local_node->fldname)
>
> Seems like these actually could just ought to use the error-checked
> variants. And I think it ought to change all of
> READ_{INT,UINT,LONG,UINT64,OID}_FIELD, rather than just redirecting one
> of them to the new routines.

Okay for these changes, except for READ_INT_FIELD where we have short
variables using it as well (for example StrategyNumber) so this
generates a justified warning.  I think that a correct solution
here would be to add a new READ_SHORT_FIELD which uses pg_strtoint16.
I am not adding that for now.
--
Michael

Attachment

Re: refactoring - share str2*int64 functions

From
Michael Paquier
Date:
On Mon, Sep 09, 2019 at 05:27:04AM -0700, Andres Freund wrote:
> On 2019-09-09 20:57:46 +0900, Michael Paquier wrote:
> But ISTM all of them ought to just use the C types, rather than the SQL
> types however. Since in the above proposal the caller determines the
> type names, if you want a different type - like the SQL input routines -
> can just invoke pg_strtoint_error() themselves (or just have it open
> coded).

Yep, that was my line of thoughts.

>> And for errors which should never happen we could just use
>> elog().  For the input functions of int2/4/8 we still need the
>> existing errors of course.
>
> Right, there it makes sense to continue to refer the SQL level types.

Actually, I found your suggestion of using a noreturn function for the
error reporting to be a very clean alternative.  I didn't know though
that gcc is not able to detect that a function does not return if you
don't have a default in the switch for all the status codes.  And this
even if all the values of the enum for the switch are listed.

> I'm saying that we shouldn't need the whole logic of trying to parse the
> string as an int, and then fail to float if it's not that. But that it's
> not this patchset's task to fix this.

Ah, sure.  Agreed.

>> Not sure about that.  I would keep the scope of the patch simple as of
>> now, where we make sure that we have the right interface for
>> everything.  There are a couple of extra improvements which could be
>> done afterwards, and if we move everything in the same place that
>> should be easier to move on with more improvements.  Hopefully.
>
> The only reason for thinking about it now is that we'd then avoid
> changing the API twice.

What I think we would be looking for here is an extra argument for the
low-level routines to control the behavior of the function in an
extensible way, say a bits16 for a set of flags, with one flag to
ignore checks for trailing and leading whitespace.  This feels a bit
over-engineered though for this purpose.

Attached is an updated patch?  How does it look?  I have left the
parts of readfuncs.c for now as there are more issues behind that than
doing a single switch, short reads are one, long reads a second.  And
the patch already does a lot.  There could be also an argument for
having extra _check wrappers for the unsigned portions but these would
be mostly unused in the backend code, so I have left that out on
purpose.

After all that stuff, there are still some issues which need more
care, in short:
- the T_Float conversion.
- removal of strtoint()
- the part for readfuncs.c
--
Michael


Attachment

Re: refactoring - share str2*int64 functions

From
Michael Paquier
Date:
On Tue, Sep 10, 2019 at 12:05:25PM +0900, Michael Paquier wrote:
> Attached is an updated patch?  How does it look?  I have left the
> parts of readfuncs.c for now as there are more issues behind that than
> doing a single switch, short reads are one, long reads a second.  And
> the patch already does a lot.  There could be also an argument for
> having extra _check wrappers for the unsigned portions but these would
> be mostly unused in the backend code, so I have left that out on
> purpose.

I have looked at this patch again today after letting it aside a
couple of days, and I quite like the resulting shape of the routines.
Does anybody else have any comments?  Would it make sense to extend
more the string-to-int conversion routines with a set of control flags
to bypass the removal of leading and trailing whitespaces?
--
Michael

Attachment

Re: refactoring - share str2*int64 functions

From
Andres Freund
Date:
On 2019-09-10 12:05:25 +0900, Michael Paquier wrote:
> On Mon, Sep 09, 2019 at 05:27:04AM -0700, Andres Freund wrote:
> > On 2019-09-09 20:57:46 +0900, Michael Paquier wrote:
> > But ISTM all of them ought to just use the C types, rather than the SQL
> > types however. Since in the above proposal the caller determines the
> > type names, if you want a different type - like the SQL input routines -
> > can just invoke pg_strtoint_error() themselves (or just have it open
> > coded).
> 
> Yep, that was my line of thoughts.
> 
> >> And for errors which should never happen we could just use
> >> elog().  For the input functions of int2/4/8 we still need the
> >> existing errors of course.
> > 
> > Right, there it makes sense to continue to refer the SQL level types.
> 
> Actually, I found your suggestion of using a noreturn function for the
> error reporting to be a very clean alternative.  I didn't know though
> that gcc is not able to detect that a function does not return if you
> don't have a default in the switch for all the status codes.  And this
> even if all the values of the enum for the switch are listed.

As I proposed they'd be in different translation units, so the compiler
wouldn't see the definition of the function, just the declaration.


> >> Not sure about that.  I would keep the scope of the patch simple as of
> >> now, where we make sure that we have the right interface for
> >> everything.  There are a couple of extra improvements which could be
> >> done afterwards, and if we move everything in the same place that
> >> should be easier to move on with more improvements.  Hopefully.
> > 
> > The only reason for thinking about it now is that we'd then avoid
> > changing the API twice.
> 
> What I think we would be looking for here is an extra argument for the
> low-level routines to control the behavior of the function in an
> extensible way, say a bits16 for a set of flags, with one flag to
> ignore checks for trailing and leading whitespace.

That'd probably be a bad idea, for performance reasons.



> Attached is an updated patch?  How does it look?  I have left the
> parts of readfuncs.c for now as there are more issues behind that than
> doing a single switch, short reads are one, long reads a second.

Hm? I don't know what you mean by those issues.


> And the patch already does a lot.  There could be also an argument for
> having extra _check wrappers for the unsigned portions but these would
> be mostly unused in the backend code, so I have left that out on
> purpose.

I'd value consistency higher here.




> diff --git a/src/backend/executor/spi.c b/src/backend/executor/spi.c
> index 2c0ae395ba..8e75d52b06 100644
> --- a/src/backend/executor/spi.c
> +++ b/src/backend/executor/spi.c
> @@ -21,6 +21,7 @@
>  #include "catalog/heap.h"
>  #include "catalog/pg_type.h"
>  #include "commands/trigger.h"
> +#include "common/string.h"
>  #include "executor/executor.h"
>  #include "executor/spi_priv.h"
>  #include "miscadmin.h"
> @@ -2338,8 +2339,7 @@ _SPI_execute_plan(SPIPlanPtr plan, ParamListInfo paramLI,
>                      CreateTableAsStmt *ctastmt = (CreateTableAsStmt *) stmt->utilityStmt;
>  
>                      if (strncmp(completionTag, "SELECT ", 7) == 0)
> -                        _SPI_current->processed =
> -                            pg_strtouint64(completionTag + 7, NULL, 10);
> +                        (void) pg_strtouint64(completionTag + 7, &_SPI_current->processed);

I'd just use the checked version here, seems like a good thing to check
for, and I can't imagine it matters performance wise.


> @@ -63,8 +63,16 @@ Datum
>  int2in(PG_FUNCTION_ARGS)
>  {
>      char       *num = PG_GETARG_CSTRING(0);
> +    int16        res;
> +    pg_strtoint_status status;
>  
> -    PG_RETURN_INT16(pg_strtoint16(num));
> +    /* Use a custom set of error messages here adapted to the data type */
> +    status = pg_strtoint16(num, &res);

I don't know what that comment is supposed to mean?

> +/*
> + * pg_strtoint64_check
> + *
> + * Convert input string to a signed 64-bit integer.
> + *
> + * This throws ereport() upon bad input format or overflow.
> + */
> +int64
> +pg_strtoint64_check(const char *s)
> +{
> +    int64        result;
> +    pg_strtoint_status status = pg_strtoint64(s, &result);
> +
> +    if (unlikely(status != PG_STRTOINT_OK))
> +        pg_strtoint_error(status, s, "int64");
> +    return result;
> +}

I think I'd just put these as inlines in the header.



Greetings,

Andres Freund



Re: refactoring - share str2*int64 functions

From
Michael Paquier
Date:
On Fri, Sep 13, 2019 at 06:38:31PM -0700, Andres Freund wrote:
> On 2019-09-10 12:05:25 +0900, Michael Paquier wrote:
>> On Mon, Sep 09, 2019 at 05:27:04AM -0700, Andres Freund wrote:
>> Attached is an updated patch?  How does it look?  I have left the
>> parts of readfuncs.c for now as there are more issues behind that than
>> doing a single switch, short reads are one, long reads a second.
>
> Hm? I don't know what you mean by those issues.

I think that we have more issues than it looks.  For example:
- Switching UINT to use pg_strtouint32() causes an incompatibility
issue compared to atoui().
- Switching INT to use pg_strtoint32() causes a set of warnings as for
example with AttrNumber:
72 |  (void) pg_strtoint32(token, &local_node->fldname)
   |                              ^~~~~~~~~~~~~~~~~~~~~
   |                              |
   |                              AttrNumber * {aka short int *}
And it is not like we should use a cast either, as we could hide real
issues.     Hence it seems to me that we need to have a new routine
definition for shorter integers and switch more flags to that.
- Switching LONG to use pg_strtoint64() leads to another set of
issues, particularly one could see an assertion failure related to Agg
nodes.  I am not sure either that we should use int64 here as the size
can be at least 32b.
- Switching OID to use pg_strtoint32 causes a failure with initdb.

So while I agree with you that a switch should be doable, there is a
large set of issues to ponder about here, and the patch already does a
lot, so I really think that we had better do a closer lookup at those
issues separately, once the basics are in place, and consider them if
they actually make sense.  There is much more than just doing a direct
switch in this area with the family of ato*() system calls.

>> And the patch already does a lot.  There could be also an argument for
>> having extra _check wrappers for the unsigned portions but these would
>> be mostly unused in the backend code, so I have left that out on
>> purpose.
>
> I'd value consistency higher here.

Okay, no objections to that.

>> @@ -2338,8 +2339,7 @@ _SPI_execute_plan(SPIPlanPtr plan, ParamListInfo paramLI,
>>                      CreateTableAsStmt *ctastmt = (CreateTableAsStmt *) stmt->utilityStmt;
>>
>>                      if (strncmp(completionTag, "SELECT ", 7) == 0)
>> -                        _SPI_current->processed =
>> -                            pg_strtouint64(completionTag + 7, NULL, 10);
>> +                        (void) pg_strtouint64(completionTag + 7, &_SPI_current->processed);
>
> I'd just use the checked version here, seems like a good thing to check
> for, and I can't imagine it matters performance wise.

Yeah, makes sense.  I don't think it matters either for
pg_stat_statements in the same context.  So changed that part as
well.

>> @@ -63,8 +63,16 @@ Datum
>>  int2in(PG_FUNCTION_ARGS)
>>  {
>>      char       *num = PG_GETARG_CSTRING(0);
>> +    int16        res;
>> +    pg_strtoint_status status;
>>
>> -    PG_RETURN_INT16(pg_strtoint16(num));
>> +    /* Use a custom set of error messages here adapted to the data type */
>> +    status = pg_strtoint16(num, &res);
>
> I don't know what that comment is supposed to mean?

I mean here that the _check equivalent cannot be used as any error
messages generated need to be consistent with the SQL data type.  I
have updated the comment, does it look better now?

>> +/*
>> + * pg_strtoint64_check
>> + *
>> + * Convert input string to a signed 64-bit integer.
>> + *
>> + * This throws ereport() upon bad input format or overflow.
>> + */
>> +int64
>> +pg_strtoint64_check(const char *s)
>> +{
>> +    int64        result;
>> +    pg_strtoint_status status = pg_strtoint64(s, &result);
>> +
>> +    if (unlikely(status != PG_STRTOINT_OK))
>> +        pg_strtoint_error(status, s, "int64");
>> +    return result;
>> +}
>
> I think I'd just put these as inlines in the header.

I have not considered that.  This bloats a bit more builtins.h.  We
could live with that, or just move that into a separate header in
include/utils/, say int.h?  Even if common/int.h exists?

Attached is an updated patch.  Perhaps you have something else in
mind?
--
Michael

Attachment

Re: refactoring - share str2*int64 functions

From
Fabien COELHO
Date:
Bonjour Michaël,

> - Switching INT to use pg_strtoint32() causes a set of warnings as for
> example with AttrNumber:
> 72 |  (void) pg_strtoint32(token, &local_node->fldname)
>   |                              ^~~~~~~~~~~~~~~~~~~~~
>   |                              |
>   |                              AttrNumber * {aka short int *}

> And it is not like we should use a cast either, as we could hide real
> issues.

It should rather call pg_strtoint16? And possibly switch the "short int" 
declaration to int16?

About batch v14: applies cleanly and compiles without warnings. "make 
check" ok.

I do not think that "pg_strtoint_error" should be inlinable. The function 
is unlikely to be called, so it is not performance critical to inline it, 
and would enlarge the executable needlessly. However, the 
"pg_strto*_check" variants should be inlinable, as you have done.

About the code, on several instances of:

        /* skip leading spaces */
          while (likely(*ptr) && isspace((unsigned char) *ptr)) …

I would drop the "likely(*ptr)".

And on several instances of:

    !unlikely(isdigit((unsigned char) *ptr)))

ISTM that we want "unlikely(!isdigit((unsigned char) *ptr)))". Parsing 
!unlikely leads to false conclusion and a headache:-)

Otherwise, this batch of changes looks ok to me.

-- 
Fabien.

Re: refactoring - share str2*int64 functions

From
Michael Paquier
Date:
On Sat, Sep 14, 2019 at 10:24:10AM +0200, Fabien COELHO wrote:
> It should rather call pg_strtoint16? And possibly switch the "short int"
> declaration to int16?

Sure, but you get into other problems if using the 16-bit version for
some other fields, which is why it seems to me that we should add an
extra routine for shorts.  So for now I prefer discarding this part.

> I do not think that "pg_strtoint_error" should be inlinable. The function is
> unlikely to be called, so it is not performance critical to inline it, and
> would enlarge the executable needlessly. However, the "pg_strto*_check"
> variants should be inlinable, as you have done.

Makes sense.

> About the code, on several instances of:
>
>        /* skip leading spaces */
>          while (likely(*ptr) && isspace((unsigned char) *ptr)) …
>
> I would drop the "likely(*ptr)".

Right as well.  There were two places out of six with that pattern.

> And on several instances of:
>
>    !unlikely(isdigit((unsigned char) *ptr)))
>
> ISTM that we want "unlikely(!isdigit((unsigned char) *ptr)))". Parsing
> !unlikely leads to false conclusion and a headache:-)

That part was actually inconsistent with the rest.

> Otherwise, this batch of changes looks ok to me.

Thanks.
--
Michael

Attachment

Re: refactoring - share str2*int64 functions

From
Andres Freund
Date:
Hi,

On 2019-09-14 15:02:36 +0900, Michael Paquier wrote:
> On Fri, Sep 13, 2019 at 06:38:31PM -0700, Andres Freund wrote:
> > On 2019-09-10 12:05:25 +0900, Michael Paquier wrote:
> >> On Mon, Sep 09, 2019 at 05:27:04AM -0700, Andres Freund wrote:
> >> Attached is an updated patch?  How does it look?  I have left the
> >> parts of readfuncs.c for now as there are more issues behind that than
> >> doing a single switch, short reads are one, long reads a second.
> > 
> > Hm? I don't know what you mean by those issues.
> 
> I think that we have more issues than it looks.  For example:
> - Switching UINT to use pg_strtouint32() causes an incompatibility
> issue compared to atoui().

"An incompatibility" is, uh, vague.


> - Switching INT to use pg_strtoint32() causes a set of warnings as for
> example with AttrNumber:
> 72 |  (void) pg_strtoint32(token, &local_node->fldname)
>    |                              ^~~~~~~~~~~~~~~~~~~~~
>    |                              |
>    |                              AttrNumber * {aka short int *}
> And it is not like we should use a cast either, as we could hide real
> issues.     Hence it seems to me that we need to have a new routine
> definition for shorter integers and switch more flags to that.

Yea.


> - Switching LONG to use pg_strtoint64() leads to another set of
> issues, particularly one could see an assertion failure related to Agg
> nodes.  I am not sure either that we should use int64 here as the size
> can be at least 32b.

That seems pretty clearly something that needs to be debugged before
applying this series. If there's such a failure, it indicates that
there's either a problem in this patchset, or a pre-existing problem in
readfuncs.


> - Switching OID to use pg_strtoint32 causes a failure with initdb.

Needs to be debugged too. Although I suspect this might just be that you
need to use unsigned variant.


> So while I agree with you that a switch should be doable, there is a
> large set of issues to ponder about here, and the patch already does a
> lot, so I really think that we had better do a closer lookup at those
> issues separately, once the basics are in place, and consider them if
> they actually make sense.  There is much more than just doing a direct
> switch in this area with the family of ato*() system calls.

I have no problme applying this separately, but I really don't think
it's wise to apply this before these problems have been debugged.

Greetings,

Andres Freund



Re: refactoring - share str2*int64 functions

From
Fabien COELHO
Date:
Bonjour Michaël,

>> Otherwise, this batch of changes looks ok to me.
>
> Thanks.

About v15: applies cleanly, compiles, "make check" ok.

While re-reading the patch, there are bit of repetitions on pg_strtou?int* 
comments. I'm wondering whether it would make sense to write a global 
comments before each 3-type series to avoid that.

-- 
Fabien.

Re: refactoring - share str2*int64 functions

From
Michael Paquier
Date:
On Mon, Sep 16, 2019 at 10:08:19AM -0700, Andres Freund wrote:
> On 2019-09-14 15:02:36 +0900, Michael Paquier wrote:
>> - Switching OID to use pg_strtoint32 causes a failure with initdb.
>
> Needs to be debugged too. Although I suspect this might just be that you
> need to use unsigned variant.

No, that's not it.  My last message had a typo as I used the unsigned
variant.  Anyway, by switching the routines of readfuncs.c to use the
new _check wrappers it is easy to see what the actual issue is.  And
here we go with one example:
"FATAL:  invalid input syntax for type uint32: "12089 :relkind v"

So, the root of the problem is that the optimized conversion routines
would complain if the end of the string includes incorrect characters
when converting a node from text, which is not something that strtoXX
complains about.  So our wrappers atooid() and atoui() accept more
types of strings in input as they rely on the system's strtol().  And
that counts for the failures with UINT and OID.  atoi() also is more
flexible on that, which explains the failures for INT, as well as
atol() for LONG (this shows a failure in the regression tests, not at
initdb time though).

One may think that this restriction does not actually apply to
READ_UINT64_FIELD because the routine involves no other things than
queryId.  However once I enable -DWRITE_READ_PARSE_PLAN_TREES
-DWRITE_READ_PARSE_PLAN_TREES -DCOPY_PARSE_PLAN_TREES in the builds,
queryId parsing also complains with the patch.  So except if we
redesign the node reads we are bound to keep around the wrapper of
strtoXX on HEAD called pg_strtouint64() to avoid an incompatibility
when parsing the 64-bit query ID.  We could keep that isolated in
readfuncs.c close to the declarations of atoui() and strtobool()
though.  This also points out that pg_strtouint64 of HEAD is
inconsistent with its signed relatives in the treatment of the input
string.

The code paths of the patch calling pg_strtouint64_check to parse
completion tags (spi.c and pg_stat_statements.c) should complain in
those cases as the format of the tags for SELECT and COPY is fixed.

In order to unify the parsing interface and put all the conversion
routines in a single place, I still think that the patch has value so
I would still keep it (with a fix for the queryId parsing of course),
but there is much more to it.

>> So while I agree with you that a switch should be doable, there is a
>> large set of issues to ponder about here, and the patch already does a
>> lot, so I really think that we had better do a closer lookup at those
>> issues separately, once the basics are in place, and consider them if
>> they actually make sense.  There is much more than just doing a direct
>> switch in this area with the family of ato*() system calls.
>
> I have no problem applying this separately, but I really don't think
> it's wise to apply this before these problems have been debugged.

Sure.  No problem with that line of reasoning.
--
Michael

Attachment

Re: refactoring - share str2*int64 functions

From
Michael Paquier
Date:
On Tue, Sep 17, 2019 at 11:29:13AM +0900, Michael Paquier wrote:
> The code paths of the patch calling pg_strtouint64_check to parse
> completion tags (spi.c and pg_stat_statements.c) should complain in
> those cases as the format of the tags for SELECT and COPY is fixed.
>
> In order to unify the parsing interface and put all the conversion
> routines in a single place, I still think that the patch has value so
> I would still keep it (with a fix for the queryId parsing of course),
> but there is much more to it.

Forgot to mention that another angle of attack would be of course to
add some control flags in the optimized parsing functions to make them
more permissive regarding the handling of the trailing characters, by
not considering as a syntax error the case where the last character is
not a zero-termination.
--
Michael

Attachment

Re: refactoring - share str2*int64 functions

From
Michael Paquier
Date:
On Tue, Sep 17, 2019 at 11:29:13AM +0900, Michael Paquier wrote:
> In order to unify the parsing interface and put all the conversion
> routines in a single place, I still think that the patch has value so
> I would still keep it (with a fix for the queryId parsing of course),
> but there is much more to it.

As of now, here is an updated patch which takes the path to not
complicate the refactored APIs and fixes the issue with queryID in
readfuncs.c.  Thoughts?
--
Michael

Attachment

Re: refactoring - share str2*int64 functions

From
Ashutosh Sharma
Date:
Is there any specific reason for hard coding the *base* of a number
representing the string in strtouint64(). I understand that currently
strtouint64() is being used just to convert an input string to decimal
unsigned value but what if we want it to be used for hexadecimal
values or may be some other values, in that case it can't be used.
Further, the function name is strtouint64() but the comments atop it's
definition says it's pg_strtouint64(). That needs to be corrected.

At few places, I could see that the function call to
pg_strtoint32_check() is followed by an error handling. Isn't that
already being done in pg_strtoint32_check function itself. For e.g. in
refint.c the function call to pg_strtoint32_check is followed by a if
condition that checks for an error which I assume shouldn't be there
as it is already being done by pg_strtoint32_check.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

On Wed, Sep 18, 2019 at 6:43 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Tue, Sep 17, 2019 at 11:29:13AM +0900, Michael Paquier wrote:
> > In order to unify the parsing interface and put all the conversion
> > routines in a single place, I still think that the patch has value so
> > I would still keep it (with a fix for the queryId parsing of course),
> > but there is much more to it.
>
> As of now, here is an updated patch which takes the path to not
> complicate the refactored APIs and fixes the issue with queryID in
> readfuncs.c.  Thoughts?
> --
> Michael



Re: refactoring - share str2*int64 functions

From
Andres Freund
Date:
Hi,

On 2019-10-04 14:27:44 +0530, Ashutosh Sharma wrote:
> Is there any specific reason for hard coding the *base* of a number
> representing the string in strtouint64(). I understand that currently
> strtouint64() is being used just to convert an input string to decimal
> unsigned value but what if we want it to be used for hexadecimal
> values or may be some other values, in that case it can't be used.

It's a lot slower if the base is variable, because the compiler cannot
replace the division by shifts.

Greetings,

Andres Freund



Re: refactoring - share str2*int64 functions

From
Ashutosh Sharma
Date:
On Fri, Oct 4, 2019 at 8:58 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2019-10-04 14:27:44 +0530, Ashutosh Sharma wrote:
> > Is there any specific reason for hard coding the *base* of a number
> > representing the string in strtouint64(). I understand that currently
> > strtouint64() is being used just to convert an input string to decimal
> > unsigned value but what if we want it to be used for hexadecimal
> > values or may be some other values, in that case it can't be used.
>
> It's a lot slower if the base is variable, because the compiler cannot
> replace the division by shifts.
>

Thanks Andres for the reply. I didn't know that the compiler won't be
able to replace division with shifts operator if the base is variable
and it's true that it would make the things a lot slower.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com



Re: refactoring - share str2*int64 functions

From
Michael Paquier
Date:
On Fri, Oct 04, 2019 at 02:27:44PM +0530, Ashutosh Sharma wrote:
> Is there any specific reason for hard coding the *base* of a number
> representing the string in strtouint64(). I understand that currently
> strtouint64() is being used just to convert an input string to decimal
> unsigned value but what if we want it to be used for hexadecimal
> values or may be some other values, in that case it can't be used.
> Further, the function name is strtouint64() but the comments atop it's
> definition says it's pg_strtouint64(). That needs to be corrected.

Performance, as Andres has already stated upthread.  Moving away from
strtol gives roughly a 40% improvement with a call-to-call comparison:
https://www.postgresql.org/message-id/20190909052814.GA26605@paquier.xyz

> At few places, I could see that the function call to
> pg_strtoint32_check() is followed by an error handling. Isn't that
> already being done in pg_strtoint32_check function itself. For e.g. in
> refint.c the function call to pg_strtoint32_check is followed by a if
> condition that checks for an error which I assume shouldn't be there
> as it is already being done by pg_strtoint32_check.

pg_strtoint32_check is used for a signed integer, so it would complain
about incorrect input syntax, but not when the parsed integer is less
or equal than 0, which is what refint.c complains about.
--
Michael

Attachment

Re: refactoring - share str2*int64 functions

From
Michael Paquier
Date:
On Wed, Sep 18, 2019 at 10:13:20AM +0900, Michael Paquier wrote:
> As of now, here is an updated patch which takes the path to not
> complicate the refactored APIs and fixes the issue with queryID in
> readfuncs.c.  Thoughts?

For now, and seeing that there is little interest in it.  I have
marked the patch as returned with feedback in this CF.  The thread has
gone long, so if there is a new submission I would suggest using a new
thread with a fresh start point..  Not sure if I'll work on that or
not.
--
Michael

Attachment