Thread: Greatest Common Divisor
I recently came across the need for a gcd function (actually I needed lcm) and was surprised that we didn't have one. So here one is, using the basic Euclidean algorithm. I made one for smallint, integer, and bigint. -- Vik Fearing +33 6 46 75 15 36 http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
Attachment
Bonsoir Vik, > I recently came across the need for a gcd function (actually I needed > lcm) and was surprised that we didn't have one. Why not. > So here one is, using the basic Euclidean algorithm. I made one for > smallint, integer, and bigint. Should pg provide the LCM as well? Hmmm, probably not, too likely to overflow. Should there be a NUMERIC version as well? I'd say maybe yes. I'm wondering what it should do on N, 0 and 0, 0. Raise an error? Return 0? Return 1? return N? There should be some logic and comments explaining it. I'd test with INT_MIN and INT_MAX. Given that there are no overflows risk with the Euclidian descent, would it make sense that the int2 version call the int4 implementation? C modulo operator (%) is a pain because it is not positive remainder (2 % -3 == -1 vs 2 % 3 == 2, AFAICR). It does not seem that fixing the sign afterwards is the right thing to do. I'd rather turn both arguments positive before doing the descent. Which raises an issue with INT_MIN by the way, which has no positive:-( Also, the usual approach is to exchange args so that the largest is first, because there may be a software emulation if the hardware does not implement modulo. At least it was the case with some sparc processors 25 years ago:-) See for instance (the int min value is probably not well handled): https://svn.cri.ensmp.fr/svn/linear/trunk/src/arithmetique/pgcd.c Basically, welcome to arithmetic:-) -- Fabien.
On 28/12/2019 19:15, Fabien COELHO wrote: > >> So here one is, using the basic Euclidean algorithm. I made one for >> smallint, integer, and bigint. > > Should pg provide the LCM as well? Hmmm, probably not, too likely to > overflow. I decided against it for that reason. > Should there be a NUMERIC version as well? I'd say maybe yes. I thought about that, too, but also decided against it for this patch. > I'm wondering what it should do on N, 0 and 0, 0. Raise an error? > Return 0? Return 1? return N? There should be some logic and comments > explaining it. Well, gcd(N, 0) is N, and gcd(0, 0) is 0, so I don't see an issue here? > I'd test with INT_MIN and INT_MAX. Okay, I'll add tests for those, instead of the pretty much random values I have now. > Given that there are no overflows risk with the Euclidian descent, would > it make sense that the int2 version call the int4 implementation? Meh. > > C modulo operator (%) is a pain because it is not positive remainder > (2 % -3 == -1 vs 2 % 3 == 2, AFAICR). This does not seem to be the case... > It does not seem that fixing the sign afterwards is the right thing to > do. I'd rather turn both arguments positive before doing the descent. Why isn't it the right thing to do? > Which raises an issue with INT_MIN by the way, which has no positive:-( That's an argument against abs-ing the input values. It's only an issue with gcd(INT_MIN, INT_MIN) which currently returns INT_MIN. Any other value with INT_MIN will be 1 or something lower than INT_MAX. > Also, the usual approach is to exchange args so that the largest is > first, because there may be a software emulation if the hardware does > not implement modulo. At least it was the case with some sparc > processors 25 years ago:-) The args will exchange themselves. Thanks for the review! Attached is a new patch that changes the regression tests based on your comments (and another comment that I got on irc to test gcd(b, a)).
Attachment
Bonjour Vik, >> Should there be a NUMERIC version as well? I'd say maybe yes. > > I thought about that, too, but also decided against it for this patch. Hmmm. ISTM that int functions are available for numeric? >> I'm wondering what it should do on N, 0 and 0, 0. Raise an error? >> Return 0? Return 1? return N? There should be some logic and comments >> explaining it. > > Well, gcd(N, 0) is N, and gcd(0, 0) is 0, so I don't see an issue here? I think that there should be a comment. >> I'd test with INT_MIN and INT_MAX. > > Okay, I'll add tests for those, instead of the pretty much random values > I have now. > >> C modulo operator (%) is a pain because it is not positive remainder >> (2 % -3 == -1 vs 2 % 3 == 2, AFAICR). > > This does not seem to be the case... Indeed, I tested quickly with python, but it has yet another behavior as shown above, what a laugh! So with C: 2 % -3 == 2, -2 % 3 == -2 Note that AFAICS there is no integer i so that 3 * i - (-2) == -2. >> It does not seem that fixing the sign afterwards is the right thing to >> do. I'd rather turn both arguments positive before doing the descent. > > Why isn't it the right thing to do? Because I do not trust C modulo as I had a lot of problems with it? :-) If it works, but it should deserve a clear comment explaining why. >> Which raises an issue with INT_MIN by the way, which has no positive:-( > > That's an argument against abs-ing the input values. It's only an issue > with gcd(INT_MIN, INT_MIN) which currently returns INT_MIN. That should be an error instead, because -INT_MIN cannot be represented? > Any other value with INT_MIN will be 1 or something lower than INT_MAX. Looks ok. >> Also, the usual approach is to exchange args so that the largest is >> first, because there may be a software emulation if the hardware does >> not implement modulo. At least it was the case with some sparc >> processors 25 years ago:-) > > The args will exchange themselves. Yep, but after a possibly expensive software-emulated modulo operation? -- Fabien.
On 29/12/2019 08:30, Fabien COELHO wrote: > >>> I'm wondering what it should do on N, 0 and 0, 0. Raise an error? >>> Return 0? Return 1? return N? There should be some logic and comments >>> explaining it. >> >> Well, gcd(N, 0) is N, and gcd(0, 0) is 0, so I don't see an issue here? > > I think that there should be a comment. Done. >>> It does not seem that fixing the sign afterwards is the right thing to >>> do. I'd rather turn both arguments positive before doing the descent. >> >> Why isn't it the right thing to do? > > Because I do not trust C modulo as I had a lot of problems with it? :-) > > If it works, but it should deserve a clear comment explaining why. Surely such a comment should be on the mod functions and not in this patch. > >>> Which raises an issue with INT_MIN by the way, which has no positive:-( >> >> That's an argument against abs-ing the input values. It's only an issue >> with gcd(INT_MIN, INT_MIN) which currently returns INT_MIN. > > That should be an error instead, because -INT_MIN cannot be represented? Why should it error? Is INT_MIN not a valid divisor of INT_MIN? I added a comment instead. >>> Also, the usual approach is to exchange args so that the largest is >>> first, because there may be a software emulation if the hardware does >>> not implement modulo. At least it was the case with some sparc >>> processors 25 years ago:-) >> >> The args will exchange themselves. > > Yep, but after a possibly expensive software-emulated modulo operation? > I'll just trust you on this. Swap added. Thanks! -- Vik
Attachment
On 12/29/19 02:30, Fabien COELHO wrote: >>> C modulo operator (%) is a pain because it is not positive remainder >>> (2 % -3 == -1 vs 2 % 3 == 2, AFAICR). >> >> This does not seem to be the case... > ... > Because I do not trust C modulo as I had a lot of problems with it? :-) If I recall correctly (and I'm traveling and away from those notes), the exact semantics of C's % with negative operands was left implementation-defined until, was it, C99 ? So it might be ok to rely on the specified C99 behavior (whichever behavior that is, he wrote, notelessly) for PG 12 and later, where C99 is expected. Regards, -Chap
Hello, >> Because I do not trust C modulo as I had a lot of problems with it?:-) > > If I recall correctly (and I'm traveling and away from those notes), > the exact semantics of C's % with negative operands was left > implementation-defined until, was it, C99 ? Indeed, my woes with C % started before that date:-) By Googling the C99 spec, I found: "When integers are divided, the result of the / operator is the algebraic quotient with any fractional part discarded (aka truncation toward zero). If the quotient a/b is representable, the expression (a/b)*b + a%b shall equal a." Let a = 2 and b = -3, then a/b == 0 (-0.666 truncated toward zero), then (a/b)*b + a%b == a => 0 * -3 + (2 % -3) == 2 => 2 % -3 == 2 Then with a = -2, b = 3, then a/b == 0 (same as above), and the same reasoning leads to -2 % 3 == -2 Which is indeed what was produced with C, but not with Python. The good news is that the absolute value of the modulo is the module in the usual sense, which is what is needed for the Euclidian descent and allows fixing the sign afterwards, as Vik was doing. > So it might be ok to rely on the specified C99 behavior (whichever > behavior that is, he wrote, notelessly) for PG 12 and later, where > C99 is expected. Yep, probably with a comment. -- Fabien.
Out of curiosity, what was the original use-case for this? I'm not objecting to adding it, I'm just curious. In fact, I think that if we do add this, then we should probably add lcm() at the same time, since handling its overflow cases is sufficiently non-trivial to justify not requiring users to have to implement it themselves. I don't like the INT_MIN handling though: select gcd(-2147483648,0); gcd ------------- -2147483648 (1 row) select gcd(-2147483648,-2147483648); gcd ------------- -2147483648 (1 row) Normally gcd() returns a positive integer, and gcd(a,0) = gcd(a,a) = abs(a). But since abs(INT_MIN) cannot be represented as a 32-bit integer, both those cases should throw an integer-out-of-range error. In addition, the following case should produce 1, but for me it produces an error. This is actually going to be platform-dependent as it is currently implemented (see the comments in int4div and int4mod): select gcd(-2147483648,-1); ERROR: floating-point exception DETAIL: An invalid floating-point operation was signaled. This probably means an out-of-range result or an invalid operation, such as division by zero. so there needs to be some special-case INT_MIN handling at the start to deal with these cases. AFAIK, what gcd(0,0) should be is not well defined, but most common implementations seem to return 0 (I checked Matlab, python and Java's standard libraries). This seems like a reasonable extrapolation of the rule gcd(a,0) = gcd(0,a) = gcd(a,a) = abs(a), so I don't have a problem with doing the same here, but I think that it should be documented (e.g., see [1]), if for no other reason than users might expect it to be safe to divide by the result. Regards, Dean [1] https://www.mathworks.com/help/matlab/ref/gcd.html
Greetings, * Dean Rasheed (dean.a.rasheed@gmail.com) wrote: > I'm not objecting to adding it, I'm just curious. In fact, I think > that if we do add this, then we should probably add lcm() at the same > time, since handling its overflow cases is sufficiently non-trivial to > justify not requiring users to have to implement it themselves. I tend to agree with this. Thanks, Stephen
Attachment
Stephen Frost <sfrost@snowman.net> writes: > * Dean Rasheed (dean.a.rasheed@gmail.com) wrote: >> I'm not objecting to adding it, I'm just curious. In fact, I think >> that if we do add this, then we should probably add lcm() at the same >> time, since handling its overflow cases is sufficiently non-trivial to >> justify not requiring users to have to implement it themselves. > I tend to agree with this. Does this impact the decision about whether we need a variant for numeric? I was leaning against that, primarily because (a) it'd introduce a set of questions about what to do with non-integral inputs, and (b) it'd make the patch quite a lot larger, I imagine. But a variant of lcm() that returns numeric would have much more resistance to overflow. Maybe we could just define "lcm(bigint, bigint) returns numeric" and figure that that covers all cases, but it feels slightly weird. You couldn't do lcm(lcm(a,b),c) without casting. I guess that particular use-case could be addressed with "lcm(variadic bigint[]) returns numeric", but that's getting really odd. regards, tom lane
On Sat, Dec 28, 2019 at 12:15 PM Fabien COELHO <coelho@cri.ensmp.fr> wrote: > > > Bonsoir Vik, > > > I recently came across the need for a gcd function (actually I needed > > lcm) and was surprised that we didn't have one. > > Why not. Proliferation of code in the public namespace; it can displace code that is written by others during the upgrade. merlin
> Normally gcd() returns a positive integer, and gcd(a,0) = gcd(a,a) = > abs(a). But since abs(INT_MIN) cannot be represented as a 32-bit > integer, both those cases should throw an integer-out-of-range error. I'm also in favor of that option, rather than sending a negative result as a result. About lcm(a, b): a / gcd(a, b) * b, at least if a & b are positive. If not, some thoughts are needed:-) Returning a NUMERIC as suggested by Tom would solve the overflow problem by sending it back to the user who has to cast. This looks ok to me. Maybe we could provide "int4 lcm(int2, int2)", "int8 lcm(int4, int4)", as ISTM that there cannot be overflows on those (eg for the later: lcm <= a*b, a & b are 31 non-signed bits, 62 bits are needed, 63 are available). -- Fabien.
On 2020-01-02 15:50, Dean Rasheed wrote: > Out of curiosity, what was the original use-case for this? Yeah, I'm wondering, is this useful for any typical analytics or business application? Otherwise, abstract algebra functionality seems a bit out of scope. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes: > On 2020-01-02 15:50, Dean Rasheed wrote: >> Out of curiosity, what was the original use-case for this? > Yeah, I'm wondering, is this useful for any typical analytics or > business application? Otherwise, abstract algebra functionality seems a > bit out of scope. Nobody complained when we added sinh, cosh, tanh, asinh, acosh, atanh last year, so I'm feeling skeptical of claims that gcd should be out of scope. Now, those functions were just exposing libc functionality, so there wasn't a lot of code to write. There might be a good argument that gcd isn't useful enough to justify the amount of code we'd have to add (especially if we allow it to scope-creep into needing to deal with "numeric" calculations). But I'm not on board with just dismissing it as uninteresting. regards, tom lane
On Fri, Jan 3, 2020 at 10:23 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Now, those functions were just exposing libc functionality, so there > wasn't a lot of code to write. There might be a good argument that > gcd isn't useful enough to justify the amount of code we'd have to > add (especially if we allow it to scope-creep into needing to deal > with "numeric" calculations). But I'm not on board with just > dismissing it as uninteresting. Yeah. There's always the question with things like this as to whether we ought to push certain things into contrib modules that are not installed by default to avoid bloating the set of things built into the core server. But it's hard to know where to draw the line. There's no objective answer to the question of whether gcd() or sinh() is more useful to have in core; each is more useful to people who need that one but not the other, and trying to guess whether more or fewer people need gcd() than sinh() seems like a fool's errand. Perhaps in retrospect we would be better off having a 'math' extension where a lot of this stuff lives, and people who want that extension can install it and others need not bother. But, to try to create that now and move things there would break upgrades for an exceedingly marginal benefit. I don't really like the namespace pollution that comes with accepting feature requests like this, but it's hard to argue that it's a serious show-stopper or that the cure is any less bad than the disease. And I'm sure that I'd be much more likely to use gcd() or lcm() in a query than tanh()... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2020-Jan-03, Robert Haas wrote: > On Fri, Jan 3, 2020 at 10:23 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Now, those functions were just exposing libc functionality, so there > > wasn't a lot of code to write. There might be a good argument that > > gcd isn't useful enough to justify the amount of code we'd have to > > add (especially if we allow it to scope-creep into needing to deal > > with "numeric" calculations). But I'm not on board with just > > dismissing it as uninteresting. > > Yeah. There's always the question with things like this as to whether > we ought to push certain things into contrib modules that are not > installed by default to avoid bloating the set of things built into > the core server. But it's hard to know where to draw the line. There's > no objective answer to the question of whether gcd() or sinh() is more > useful to have in core; The SQL standard's feature T622 requires trigonometric functions, while it doesn't list gcd() or anything of the sort, so there's that. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Jan 3, 2020 at 10:24 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Fri, Jan 3, 2020 at 10:23 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Now, those functions were just exposing libc functionality, so there > > wasn't a lot of code to write. There might be a good argument that > > gcd isn't useful enough to justify the amount of code we'd have to > > add (especially if we allow it to scope-creep into needing to deal > > with "numeric" calculations). But I'm not on board with just > > dismissing it as uninteresting. > > Yeah. There's always the question with things like this as to whether > we ought to push certain things into contrib modules that are not > installed by default to avoid bloating the set of things built into > the core server. But it's hard to know where to draw the line. Just stop doing it. It's very little extra work to package an item into an extension and this protects your hapless users who might have implemented a function called gcd() that does something different. Ideally, the public namespace should contain (by default) only sql standard functions with all non-standard material in an appropriate extension. Already released material is obviously problematic and needs more thought but we ought to at least stop making the problem worse if possible. merlin
On 02/01/2020 16:12, Tom Lane wrote: > Stephen Frost <sfrost@snowman.net> writes: >> * Dean Rasheed (dean.a.rasheed@gmail.com) wrote: >>> I'm not objecting to adding it, I'm just curious. In fact, I think >>> that if we do add this, then we should probably add lcm() at the same >>> time, since handling its overflow cases is sufficiently non-trivial to >>> justify not requiring users to have to implement it themselves. >> I tend to agree with this. > Does this impact the decision about whether we need a variant for > numeric? I was leaning against that, primarily because (a) > it'd introduce a set of questions about what to do with non-integral > inputs, and (b) it'd make the patch quite a lot larger, I imagine. > But a variant of lcm() that returns numeric would have much more > resistance to overflow. > > Maybe we could just define "lcm(bigint, bigint) returns numeric" > and figure that that covers all cases, but it feels slightly > weird. You couldn't do lcm(lcm(a,b),c) without casting. > I guess that particular use-case could be addressed with > "lcm(variadic bigint[]) returns numeric", but that's getting > really odd. Okay. Here is a version that should handle everyone's comments. gcd() is now strictly positive, so INT_MIN is no longer a valid result. I added an lcm() function. It returns the same type as its arguments so overflow is possible. I made this choice because int2mul returns int2 (and same for its friends). One can just cast to a bigger integer if needed. Because of that, I added a version of gcd() and lcm() for numeric. This was my first time working with numeric so reviewers should pay extra attention there, please. Patch attached. -- Vik Fearing
Attachment
On Fri, Jan 3, 2020 at 1:10 PM Merlin Moncure <mmoncure@gmail.com> wrote: > Just stop doing it. It's very little extra work to package an item > into an extension and this protects your hapless users who might have > implemented a function called gcd() that does something different. > Ideally, the public namespace should contain (by default) only sql > standard functions with all non-standard material in an appropriate > extension. Already released material is obviously problematic and > needs more thought but we ought to at least stop making the problem > worse if possible. There are counter-arguments to that, though. Maintaining a lot of extensions with only one or two functions in them is a nuisance. Having things installed by default is convenient for wanting to use them. Maintaining contrib code so that it works whether or not the SQL definitions have been updated via ALTER EXTENSION .. UPDATE takes some work and thought, and sometimes we screw it up. I don't find any position on this topic to be without merit. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 1/3/20 1:46 PM, Robert Haas wrote: > On Fri, Jan 3, 2020 at 1:10 PM Merlin Moncure <mmoncure@gmail.com> wrote: >> Just stop doing it. It's very little extra work to package an item >> into an extension and this protects your hapless users who might have >> implemented a function called gcd() that does something different. >> ... > There are counter-arguments to that, though. Maintaining a lot of > extensions with only one or two functions in them is a nuisance. > Having things installed by default is convenient for wanting to use > them. Maintaining contrib code so that it works whether or not the SQL > definitions have been updated via ALTER EXTENSION .. UPDATE takes some > work and thought, and sometimes we screw it up. Is there a middle ground staring us in the face, where certain things could be added in core, but in a new schema like pg_math (pg_ !), so if you want them you put them on your search path or qualify them explicitly, and if you don't, you don't? Regards, -Chap
On Fri, Jan 3, 2020 at 1:57 PM Chapman Flack <chap@anastigmatix.net> wrote: > Is there a middle ground staring us in the face, where certain things > could be added in core, but in a new schema like pg_math (pg_ !), so > if you want them you put them on your search path or qualify them > explicitly, and if you don't, you don't? I guess, but it seems like a patch whose mandate is to add one or two functions should not be burdened with inventing an entirely new way to do extensibility. Also, I'm not entirely sure that really addresses all the concerns. Part of my concern about continually adding new functions to core comes from the fact that it bloats the core code, and moving things to another schema does not help with that. It does potentially help with the namespace pollution issue, but how much of an issue is that anyway? Unless you've set up an unusual search_path configuration, your own schemas probably precede pg_catalog in your search path, besides which it seems unlikely that many people have a gcd() function that does anything other than take the greatest common divisor. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Bonsoir Vik, +int4gcd_internal(int32 arg1, int32 arg2) +{ + int32 swap; + + /* + * Put the greater value in arg1. + * This would happen automatically in the loop below, but avoids an + * expensive modulo simulation on some architectures. + */ + if (arg1 < arg2) + { + swap = arg1; + arg1 = arg2; + arg2 = swap; + } The point of swapping is to a void possibly expensive modulo, but this should be done on absolute values, otherwise it may not achieve its purpose as stated by the comment? > gcd() is now strictly positive, so INT_MIN is no longer a valid result. Ok. I'm unsure about gcd(INT_MIN, 0) should error. Possibly 0 would be nicer? -- Fabien.
On 1/3/20 2:11 PM, Robert Haas wrote: > and moving things to another schema does not help with that. It does > potentially help with the namespace pollution issue, but how much of > an issue is that anyway? Unless you've set up an unusual search_path > configuration, your own schemas probably precede pg_catalog in your > search path, besides which it seems unlikely that many people have a > gcd() function that does anything other than take the greatest common > divisor. As seen in this thread though, there can be edge cases of "take the greatest common divisor" that might not be identically treated in a thoroughly-reviewed addition to core as in someone's hastily-rolled local version. Regards, -Chap
On Fri, Jan 3, 2020 at 2:27 PM Chapman Flack <chap@anastigmatix.net> wrote: > On 1/3/20 2:11 PM, Robert Haas wrote: > > and moving things to another schema does not help with that. It does > > potentially help with the namespace pollution issue, but how much of > > an issue is that anyway? Unless you've set up an unusual search_path > > configuration, your own schemas probably precede pg_catalog in your > > search path, besides which it seems unlikely that many people have a > > gcd() function that does anything other than take the greatest common > > divisor. > > As seen in this thread though, there can be edge cases of "take the > greatest common divisor" that might not be identically treated in a > thoroughly-reviewed addition to core as in someone's hastily-rolled > local version. True, but because of the way search_path is typically set, they'd probably continue to get their own version anyway, so I'm not sure what the problem is. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2020-01-03 16:22, Tom Lane wrote: > Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes: >> On 2020-01-02 15:50, Dean Rasheed wrote: >>> Out of curiosity, what was the original use-case for this? > >> Yeah, I'm wondering, is this useful for any typical analytics or >> business application? Otherwise, abstract algebra functionality seems a >> bit out of scope. > > Nobody complained when we added sinh, cosh, tanh, asinh, acosh, atanh > last year, so I'm feeling skeptical of claims that gcd should be out > of scope. Geometry is generally in scope, though, for Postgres specifically and for databases in general. Abstract algebra is not in scope, so far, and we still haven't been told the use case for this. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Jan 3, 2020 at 1:32 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Fri, Jan 3, 2020 at 2:27 PM Chapman Flack <chap@anastigmatix.net> wrote: > > On 1/3/20 2:11 PM, Robert Haas wrote: > > > and moving things to another schema does not help with that. It does > > > potentially help with the namespace pollution issue, but how much of > > > an issue is that anyway? Unless you've set up an unusual search_path > > > configuration, your own schemas probably precede pg_catalog in your > > > search path, besides which it seems unlikely that many people have a > > > gcd() function that does anything other than take the greatest common > > > divisor. > > > > As seen in this thread though, there can be edge cases of "take the > > greatest common divisor" that might not be identically treated in a > > thoroughly-reviewed addition to core as in someone's hastily-rolled > > local version. > > True, but because of the way search_path is typically set, they'd > probably continue to get their own version anyway, so I'm not sure > what the problem is. Is that right? Default search_path is for pg_catalog to resolve before public. Lightly testing with a hand rolled pg_advisory_lock implementation that raise a notice, my default database seemed to prefer the build in function. Maybe I'm not following you. > There are counter-arguments to that, though. Maintaining a lot of > extensions with only one or two functions in them is a nuisance. > Having things installed by default is convenient for wanting to use > them. Maintaining contrib code so that it works whether or not the SQL > definitions have been updated via ALTER EXTENSION .. UPDATE takes some > work and thought, and sometimes we screw it up. If the external contract changes (which seems likely for gcd) than I would much rather have the core team worry about this than force your users to worry about it, which is what putting the function in core would require them to do (if version < x call it this way, > y then that way etc). This is exactly why we shouldn't be putting non standard items in core (maybe excepting some pg_ prefixed administration functions). Now, it's quite unfair to $OP to saddle his proposal and patch with the broader considerations of core/extension packaging, so if some kind of rational framework can be applied to the NEXT submission, or a least a discussion about this can start, those are all good options. But we need to start from somewhere, and moving forward with, "If it's not sql standard or prefixed with pg_, it ought not to be in pg_catalog" might be a good way to open the discussion. merlin
On 1/3/20 3:09 PM, Peter Eisentraut wrote: > Geometry is generally in scope, though, for Postgres specifically and > for databases in general. > > Abstract algebra is not in scope, so far, and we still haven't been told > the use case for this. It's funny, I think I've used gcd and lcm in real life way more often than sinh and cosh, maybe even as often as sin and cos. For example, how many times around will I have to go with this engine crankshaft to be able to confirm the painted links on the timing chain really do line up with the sprocket marks? (Need to count the sprocket teeth and the chain links.) Or, if I'm cycling through two different-length tuple stores, how many times before the same tuples coincide again? That isn't a question I've yet had an occasion to face, but I don't have to squint real hard to imagine it arising in a database in some situation or other. This is just me riffing, as of course I'm not the person who had such a pressing use case as to justify sitting down to write the patch. Another funny thing: this message sent me googling just to indulge my own "is gcd more abstract algebra or number theory?" quibble*, and I ended up discovering there are more algorithms for it than the Euclidean one I remember. There's a binary one using only ands, subtractions, and shifts, asymptotically the same as Euclid but perhaps somewhat faster: https://en.wikipedia.org/wiki/Binary_GCD_algorithm It looks fairly simple to code up, if not quite as short as Euclid. There's at least one specific to representations like numeric: https://en.wikipedia.org/wiki/Lehmer%27s_GCD_algorithm ... considerably more effort to implement though. It might be possible, if there are crypto libraries we're already linking to for other reasons like SSL, there could be good big-number gcd implementations already in there. Regards, -Chap * maybe I've decided to call it number theory if the things being gcd'd are integers, abstract algebra if they belong to other commutative rings.
On 2020-Jan-03, Merlin Moncure wrote: > On Fri, Jan 3, 2020 at 1:32 PM Robert Haas <robertmhaas@gmail.com> wrote: > > True, but because of the way search_path is typically set, they'd > > probably continue to get their own version anyway, so I'm not sure > > what the problem is. > > Is that right? Default search_path is for pg_catalog to resolve before > public. Lightly testing with a hand rolled pg_advisory_lock > implementation that raise a notice, my default database seemed to > prefer the build in function. Maybe I'm not following you. Maybe a very simple solution is indeed to have a separate pg_math or pg_extra or whatever, which by default is *last* in the search_path. That would make a user's gcd() be chosen preferently, if one exists. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 1/3/20 4:10 PM, Alvaro Herrera wrote: > Maybe a very simple solution is indeed to have a separate pg_math or > pg_extra or whatever, which by default is *last* in the search_path. > That would make a user's gcd() be chosen preferently, if one exists. I'm liking the direction this is going. Regards, -Chap
On Fri, Jan 3, 2020 at 3:51 PM Merlin Moncure <mmoncure@gmail.com> wrote: > Is that right? Default search_path is for pg_catalog to resolve before > public. Lightly testing with a hand rolled pg_advisory_lock > implementation that raise a notice, my default database seemed to > prefer the build in function. Maybe I'm not following you. Nope, I'm just wrong. Sorry. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jan 3, 2020 at 4:11 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Maybe a very simple solution is indeed to have a separate pg_math or > pg_extra or whatever, which by default is *last* in the search_path. > That would make a user's gcd() be chosen preferently, if one exists. Then every time we add a function, or anything else, we can bikeshed about whether it should go in pg_catalog or pg_extra! FWIW, EnterpriseDB has something like this for Advanced Server, and it actually adds a fair amount of complexity, much of it around OverrideSearchPath. It's not unmanageable, but it's not trivial, either. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2020-Jan-03, Robert Haas wrote: > On Fri, Jan 3, 2020 at 4:11 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > Maybe a very simple solution is indeed to have a separate pg_math or > > pg_extra or whatever, which by default is *last* in the search_path. > > That would make a user's gcd() be chosen preferently, if one exists. > > Then every time we add a function, or anything else, we can bikeshed > about whether it should go in pg_catalog or pg_extra! Yeah, I was just thinking about that :-) I was thinking that all standard-mandated functions, as well as system functions, should be in pg_catalog; and otherwise stuff should not get in the user's way. > FWIW, EnterpriseDB has something like this for Advanced Server, and it > actually adds a fair amount of complexity, much of it around > OverrideSearchPath. It's not unmanageable, but it's not trivial, > either. Oh, hmm. okay. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 03/01/2020 20:14, Fabien COELHO wrote: > > Bonsoir Vik, > > +int4gcd_internal(int32 arg1, int32 arg2) > +{ > + int32 swap; > + > + /* > + * Put the greater value in arg1. > + * This would happen automatically in the loop below, but > avoids an > + * expensive modulo simulation on some architectures. > + */ > + if (arg1 < arg2) > + { > + swap = arg1; > + arg1 = arg2; > + arg2 = swap; > + } > > > The point of swapping is to a void possibly expensive modulo, but this > should be done on absolute values, otherwise it may not achieve its > purpose as stated by the comment? Ah, true. How widespread are these architectures that need this special treatment? Is it really worth handling? > I'm unsure about gcd(INT_MIN, 0) should error. Possibly 0 would be nicer? What justification for that do you have? -- Vik Fearing
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > On 2020-Jan-03, Robert Haas wrote: >> Then every time we add a function, or anything else, we can bikeshed >> about whether it should go in pg_catalog or pg_extra! > Yeah, I was just thinking about that :-) I was thinking that all > standard-mandated functions, as well as system functions, should be in > pg_catalog; and otherwise stuff should not get in the user's way. I think that ship sailed a long time ago, frankly. Why is it that this particular proposal is such a problem that we need to redesign how we add features? There are currently 2977 rows in a default installation's pg_proc, with 2447 unique values of proname. Certainly at least a couple of thousand of them are not standard-mandated; despite which there are only 357 named 'pg_something'. gcd and/or lcm are not going to move the needle noticeably. I'd also submit that just pushing a bunch of built-in stuff into a schema that's behind the users' schema instead of in front doesn't mean that all is magically better. There are still going to be the same issues that make CVE-2018-1058 such a problem, but now we get to have them in both directions not just one: * a system-supplied function in "pg_extra" could still capture a call away from a user-supplied one in an earlier schema, if it is a better match to the actual argument types; * malicious users now have a much better chance to capture other people's calls to "pg_extra" functions, since they can just drop an exact match into public. (BTW, I'm pretty sure we've had this conversation before. I definitely recall a proposal to try to move functions not meant for user consumption at all, such as index support functions, into a whole other schema that wouldn't be in the path period. It went nowhere, partly because those functions don't seem to be big problems in practice.) regards, tom lane
On 2020-01-03 18:00:01 -0500, Tom Lane wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> writes: > > On 2020-Jan-03, Robert Haas wrote: > >> Then every time we add a function, or anything else, we can bikeshed > >> about whether it should go in pg_catalog or pg_extra! > > > Yeah, I was just thinking about that :-) I was thinking that all > > standard-mandated functions, as well as system functions, should be in > > pg_catalog; and otherwise stuff should not get in the user's way. > > I think that ship sailed a long time ago, frankly. > > Why is it that this particular proposal is such a problem that we > need to redesign how we add features? There are currently 2977 > rows in a default installation's pg_proc, with 2447 unique values > of proname. Certainly at least a couple of thousand of them are not > standard-mandated; despite which there are only 357 named 'pg_something'. > gcd and/or lcm are not going to move the needle noticeably. > > I'd also submit that just pushing a bunch of built-in stuff into a > schema that's behind the users' schema instead of in front doesn't > mean that all is magically better. There are still going to be the > same issues that make CVE-2018-1058 such a problem, but now we get > to have them in both directions not just one: > > * a system-supplied function in "pg_extra" could still capture a call > away from a user-supplied one in an earlier schema, if it is a better > match to the actual argument types; > > * malicious users now have a much better chance to capture other > people's calls to "pg_extra" functions, since they can just drop an > exact match into public. > > (BTW, I'm pretty sure we've had this conversation before. I > definitely recall a proposal to try to move functions not meant > for user consumption at all, such as index support functions, > into a whole other schema that wouldn't be in the path period. > It went nowhere, partly because those functions don't seem to > be big problems in practice.) +1 to all of this.
Vik Fearing <vik.fearing@2ndquadrant.com> writes: > On 03/01/2020 20:14, Fabien COELHO wrote: >> The point of swapping is to a void possibly expensive modulo, but this >> should be done on absolute values, otherwise it may not achieve its >> purpose as stated by the comment? > Ah, true. How widespread are these architectures that need this special > treatment? Is it really worth handling? On some older RISC architectures, integer division is really slow, like slower than floating-point. I'm not sure if that's true on any platform people still care about though. In recent years, CPU architects have been able to throw all the transistors they needed at such problems. On a machine with single-cycle divide, it's likely that the extra compare-and-branch is a net loss. Might be worth checking it on ARM in particular, as being a RISC architecture that's still popular. Also, if we end up having a "numeric" implementation, it absolutely is worth it for that, because there is nothing cheap about numeric_div. I'd be sort of inclined to have the swap in the other implementations just to keep the algorithms as much alike as possible. regards, tom lane
Hi, On 2020-01-03 18:49:18 -0500, Tom Lane wrote: > On some older RISC architectures, integer division is really slow, like > slower than floating-point. I'm not sure if that's true on any platform > people still care about though. In recent years, CPU architects have been > able to throw all the transistors they needed at such problems. On a > machine with single-cycle divide, it's likely that the extra > compare-and-branch is a net loss. Which architecture has single cycle division? I think it's way above that, based on profiles I've seen. And Agner seems to back me up: https://www.agner.org/optimize/instruction_tables.pdf That lists a 32/64 idiv with a latency of ~26/~42-95 cycles, even on a moder uarch like skylake-x. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2020-01-03 18:49:18 -0500, Tom Lane wrote: >> On a machine with single-cycle divide, it's likely that the extra >> compare-and-branch is a net loss. > Which architecture has single cycle division? I think it's way above > that, based on profiles I've seen. And Agner seems to back me up: > https://www.agner.org/optimize/instruction_tables.pdf > That lists a 32/64 idiv with a latency of ~26/~42-95 cycles, even on a > moder uarch like skylake-x. Huh. I figured Intel would have thrown sufficient transistors at that problem by now. But per that result, it's worth having the swap step even on CISC, never mind RISC. regards, tom lane
Vik Fearing <vik.fearing@2ndquadrant.com> writes: > On 03/01/2020 20:14, Fabien COELHO wrote: >> I'm unsure about gcd(INT_MIN, 0) should error. Possibly 0 would be nicer? > What justification for that do you have? Zero is the "correct" answer for that, isn't it, independently of overflow considerations? We should strive to give the correct answer if it's known and representable, rather than have arbitrary failure conditions. (IOW, we should throw errors only when the *result* is out of range or undefined, not just because the input is an edge case.) regards, tom lane
On 04/01/2020 00:49, Tom Lane wrote: > Vik Fearing <vik.fearing@2ndquadrant.com> writes: >> On 03/01/2020 20:14, Fabien COELHO wrote: >>> The point of swapping is to a void possibly expensive modulo, but this >>> should be done on absolute values, otherwise it may not achieve its >>> purpose as stated by the comment? >> Ah, true. How widespread are these architectures that need this special >> treatment? Is it really worth handling? > On some older RISC architectures, integer division is really slow, like > slower than floating-point. I'm not sure if that's true on any platform > people still care about though. In recent years, CPU architects have been > able to throw all the transistors they needed at such problems. On a > machine with single-cycle divide, it's likely that the extra > compare-and-branch is a net loss. OK. > Might be worth checking it on ARM in particular, as being a RISC > architecture that's still popular. I don't know how I would check this. > Also, if we end up having a "numeric" implementation, it absolutely is > worth it for that, because there is nothing cheap about numeric_div. The patch includes a numeric version, and I take care to short-circuit everything I can. > I'd be sort of inclined to have the swap in the other implementations > just to keep the algorithms as much alike as possible. They can't quite be the same behavior because numeric doesn't have the unrepresentable -INT_MIN problem, and integers don't have NaN. -- Vik Fearing
On 04/01/2020 01:21, Tom Lane wrote: > Vik Fearing <vik.fearing@2ndquadrant.com> writes: >> On 03/01/2020 20:14, Fabien COELHO wrote: >>> I'm unsure about gcd(INT_MIN, 0) should error. Possibly 0 would be nicer? >> What justification for that do you have? > Zero is the "correct" answer for that, isn't it, independently of overflow > considerations? I would say not. The correct answer is INT_MIN but we've decided a negative result is not desirable. > We should strive to give the correct answer if it's known > and representable, rather than have arbitrary failure conditions. On that we fully agree. > (IOW, we should throw errors only when the *result* is out of range > or undefined, not just because the input is an edge case.) That's what I do with the rest of it. INT_MIN is only an error if the result of the calculation is also INT_MIN. -- Vik Fearing
Vik Fearing <vik.fearing@2ndquadrant.com> writes: > On 04/01/2020 01:21, Tom Lane wrote: >> Zero is the "correct" answer for that, isn't it, independently of overflow >> considerations? > I would say not. Oh, right, I was misremembering the identity gcd(a,0) as being 0 not a. Never mind that then. > The correct answer is INT_MIN but we've decided a > negative result is not desirable. Agreed. On the other hand, we could stave off overflow the same way we discussed for lcm: make it return int8. We're still stuck with the special case for INT64_MIN in gcd64 of course, so maybe that's just inconsistent rather than being worthwhile. [ thinks for a bit... ] In practice, I imagine few people use gcd on negative values, so doing weird things with the datatype choices is probably not better than throwing an error for this case. regards, tom lane
On 04/01/2020 01:26, Vik Fearing wrote: > On 04/01/2020 01:21, Tom Lane wrote: >> Vik Fearing <vik.fearing@2ndquadrant.com> writes: >>> On 03/01/2020 20:14, Fabien COELHO wrote: >>>> I'm unsure about gcd(INT_MIN, 0) should error. Possibly 0 would be nicer? >>> What justification for that do you have? >> Zero is the "correct" answer for that, isn't it, independently of overflow >> considerations? > > I would say not. The correct answer is INT_MIN but we've decided a > negative result is not desirable. Wolfram Alpha agrees. https://www.wolframalpha.com/input/?i=gcd%28-9223372036854775808%2C0%29 -- Vik Fearing
On 03/01/2020 20:14, Fabien COELHO wrote: > > Bonsoir Vik, > > +int4gcd_internal(int32 arg1, int32 arg2) > +{ > + int32 swap; > + > + /* > + * Put the greater value in arg1. > + * This would happen automatically in the loop below, but > avoids an > + * expensive modulo simulation on some architectures. > + */ > + if (arg1 < arg2) > + { > + swap = arg1; > + arg1 = arg2; > + arg2 = swap; > + } > > > The point of swapping is to a void possibly expensive modulo, but this > should be done on absolute values, otherwise it may not achieve its > purpose as stated by the comment? Here is an updated patch fixing that. -- Vik Fearing
Attachment
Bonjour Vik, >> The point of swapping is to a void possibly expensive modulo, but this >> should be done on absolute values, otherwise it may not achieve its >> purpose as stated by the comment? > > Ah, true. How widespread are these architectures that need this special > treatment? Is it really worth handling? Dunno. AFAICR it was with sparc architectures 25 years ago. Also I do not like much relying on the subtleties of C99 % wrt negative numbers to have the algorithm work, I'd be much at ease to deal with sign and special values at the beginning of the function and proceed with positive numbers afterwards. >> I'm unsure about gcd(INT_MIN, 0) should error. Possibly 0 would be nicer? > > > What justification for that do you have? ISTM that the current implementation has: \forall int4 n, n \neq MIN_INT4, \gcd(n, 0) = 0 ? In which case applying the same rule for min int seems ok. -- Fabien Coelho - CRI, MINES ParisTech
Hello Tom, >> Which architecture has single cycle division? I think it's way above >> that, based on profiles I've seen. And Agner seems to back me up: >> https://www.agner.org/optimize/instruction_tables.pdf >> That lists a 32/64 idiv with a latency of ~26/~42-95 cycles, even on a >> moder uarch like skylake-x. > > Huh. I figured Intel would have thrown sufficient transistors at that > problem by now. It is not just a problem of number of transistors, division is intrisically iterative (with various kind of iterations used in division algorithms), involving some level of guessing and other arithmetics, so the latency can only be bad, and the possibility of implementing that in 1 cycle at 3 GHz looks pretty remote. -- Fabien.
On 04/01/2020 09:35, Fabien COELHO wrote: >>> I'm unsure about gcd(INT_MIN, 0) should error. Possibly 0 would be >>> nicer? >> >> >> What justification for that do you have? > > ISTM that the current implementation has: > > \forall int4 n, n \neq MIN_INT4, \gcd(n, 0) = 0 ? > > In which case applying the same rule for min int seems ok. No, gcd(n, 0) = n. -- Vik Fearing
Bonjour Vik, > Here is an updated patch fixing that. As I said, welcome to arithmetic:-) Patch v5 applies cleanly. Doc: I'd consider using an example the result of which is 42 instead of 21, for obvious geek motivations. Possibly gcd(2142, 462) should be ok. I got it wrong with my previous comment on gcd(int_min, 0). I'm okay with erroring on int_min. Code comments: gcd(n, 0) = abs(n), not n? About the code. Add unlikely() where appropriate. I'd deal with int_min and 0 at the beginning and then proceed with absoluting the values, rather than the dance around a1/arg1, a2/arg2, and other arg2 = -arg2, and arg1 = -arg1 anyway in various places, which does not make the code that easy to understand. Pseudo code could be: if ((a1 == min && (a2 == min || a2 == 0)) || (a2 == min && a1 == 0)) error; a1 = abs(a1), a2 = abs(a2); euclids; return; Note: in the numeric code you abs the value, ISTM consistent to do it as well in the other implementations. Would it make sense that NAN is returned on NUMERIC when the computation cannot be performed, eg on non integer values? Why the positive constraint on LCM(NUMERIC, NUMERIC)? Why not absoluting? Tests: you can make LCM fail on much smaller values for int2/4/8, you do not need to start around max_int. -- Fabien.
On Thu, 2 Jan 2020 at 15:13, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Stephen Frost <sfrost@snowman.net> writes: > > * Dean Rasheed (dean.a.rasheed@gmail.com) wrote: > >> I'm not objecting to adding it, I'm just curious. In fact, I think > >> that if we do add this, then we should probably add lcm() at the same > >> time, since handling its overflow cases is sufficiently non-trivial to > >> justify not requiring users to have to implement it themselves. > > > I tend to agree with this. > > Does this impact the decision about whether we need a variant for > numeric? I was leaning against that, primarily because (a) > it'd introduce a set of questions about what to do with non-integral > inputs, and (b) it'd make the patch quite a lot larger, I imagine. > But a variant of lcm() that returns numeric would have much more > resistance to overflow. > Well Vik has now provided a numeric implementation and it doesn't appear to be too much code. BTW, there is actually no need to restrict the inputs to integral values because GCD is something that has a perfectly natural extension to floating point inputs (see for example [1]). Moreover, since we already have a mod(numeric, numeric) that works for arbitrary inputs, Euclid's algorithm just works. For example: SELECT gcd(285, 7845); gcd ----- 15 SELECT gcd(28.5, 7.845); gcd ------- 0.015 Essentially, this is because gcd(a*10^n, b*10^n) = gcd(a, b) * 10^n, so you can think of it as pre-multiplying by a power of 10 large enough to make both inputs integers, and then dividing the result by that power of 10. If it were more work to support non-integer inputs, I'd say that it's not worth the effort, but since it's actually less work to just allow it, then why not? > Maybe we could just define "lcm(bigint, bigint) returns numeric" > and figure that that covers all cases, but it feels slightly > weird. You couldn't do lcm(lcm(a,b),c) without casting. > I guess that particular use-case could be addressed with > "lcm(variadic bigint[]) returns numeric", but that's getting > really odd. > Having thought about that, I don't like defining these functions to return a different type than their inputs. I think most people tend to be working with a given type, and are used to having to move to a wider type if necessary. We don't, for example, define "mul(bigint, bigint) returns numeric". Also I don't think it really buys you all that much -- the problem with lcm(lcm(a,b),c) where bigint inputs produce a numeric output isn't just that you need to add casting; the lcm(a,b) result may not fit in a bigint, so the cast might fail. So really, this is just postponing the problem a bit, without really fixing it. As for "lcm(variadic bigint[]) returns numeric", to implement that you'd need to use numeric computations internally, so I suspect it's implementation would be at least as complex as lcm(numeric, numeric). FWIW, looking for precedents elsewhere, I note that the C++ standard library defines these functions to return the same type as the inputs. To me, that seems more natural. Regards, Dean [1] https://www.geeksforgeeks.org/program-find-gcd-floating-point-numbers/
On Sat, 4 Jan 2020 at 09:37, Dean Rasheed <dean.a.rasheed@gmail.com> wrote: > > Well Vik has now provided a numeric implementation and it doesn't > appear to be too much code. > BTW, I did a bit of research into the efficiency of Euclid's algorithm. It's actually quite interesting: It turns out that the worst case is when the inputs are successive values from the Fibonacci sequence. In that case, since Fib(n)/Fib(n-1) = 1 remainder Fib(n-2), the algorithm will walk backwards through the whole sequence before terminating, and the result will always be 1. For bigint inputs, the worst case is gcd(7540113804746346429, 4660046610375530309) which requires something like 90 divisions. Testing that, it's still sub-millisecond though, so I don't think there's any problem there. OTOH, for numeric inputs, this could easily end up needing many thousands of divisions and it's not hard to construct inputs that take minutes to compute, although this is admittedly with ridiculously large inputs (~10^130000), and AFAICS, the performance is OK with "normal" sized inputs. Should we put a limit on the size of the inputs? I'm not sure exactly how that would work, but I think it would have to take into account the relative weights of the inputs rather than just the maximum weight. At the very least, I think we need a check for interrupts here (c.f. the numeric factorial function). Perhaps such a check is sufficient. It's not like there aren't lots of other ways to tie up the server. There are apparently more efficient algorithms, but I think that should definitely be kept out of scope for this patch. Regards, Dean
Dean Rasheed <dean.a.rasheed@gmail.com> writes: > OTOH, for numeric inputs, this could easily end up needing many > thousands of divisions and it's not hard to construct inputs that take > minutes to compute, although this is admittedly with ridiculously > large inputs (~10^130000), and AFAICS, the performance is OK with > "normal" sized inputs. Should we put a limit on the size of the > inputs? No, but a CHECK_FOR_INTERRUPTS in the loop would be well-advised, if there's not one already inside the called functions. > There are apparently more efficient algorithms, but I think that > should definitely be kept out of scope for this patch. +1 regards, tom lane
On 04/01/2020 10:34, Fabien COELHO wrote: > Code comments: gcd(n, 0) = abs(n), not n? OK, changed. > Add unlikely() where appropriate. Any particular place in mind where I didn't already put it? > I'd deal with int_min and 0 at the beginning and then proceed with > absoluting the values, rather than the dance around a1/arg1, a2/arg2, > and other arg2 = -arg2, and arg1 = -arg1 anyway in various places, > which does not make the code that easy to understand. > > Pseudo code could be: > > if ((a1 == min && (a2 == min || a2 == 0)) || > (a2 == min && a1 == 0)) > error; > a1 = abs(a1), a2 = abs(a2); > euclids; > return; This would cause one of my tests to fail. Please stop suggesting it. > Note: in the numeric code you abs the value, ISTM consistent to do it > as well in the other implementations. As noted in the comments, numeric does not have the INT_MIN problem. > Would it make sense that NAN is returned on NUMERIC when the > computation cannot be performed, eg on non integer values? I don't think so, no. > Why the positive constraint on LCM(NUMERIC, NUMERIC)? Why not absoluting? I didn't see a definition for negative inputs, but now I see one so I've lifted the restriction. On 04/01/2020 10:37, Dean Rasheed wrote: > > BTW, there is actually no need to restrict the inputs to integral > values because GCD is something that has a perfectly natural extension > to floating point inputs (see for example [1]). Moreover, since we > already have a mod(numeric, numeric) that works for arbitrary inputs, > Euclid's algorithm just works. > [...] > If it were more work to support non-integer inputs, I'd say that it's > not worth the effort, but since it's actually less work to just allow > it, then why not? Okay, I allow that now, but I've still left it for lcm. I can't find anything anywhere that defines lcm for floating point (I do find it for fractions) and the result of abs(a*b)/gcd(a,b) certainly doesn't match "lowest" in the examples I tried. On 04/01/2020 18:01, Tom Lane wrote: > Dean Rasheed <dean.a.rasheed@gmail.com> writes: >> OTOH, for numeric inputs, this could easily end up needing many >> thousands of divisions and it's not hard to construct inputs that take >> minutes to compute, although this is admittedly with ridiculously >> large inputs (~10^130000), and AFAICS, the performance is OK with >> "normal" sized inputs. Should we put a limit on the size of the >> inputs? > No, but a CHECK_FOR_INTERRUPTS in the loop would be well-advised, > if there's not one already inside the called functions. Good idea. Added. -- Vik Fearing
Attachment
On Sat, 4 Jan 2020 at 17:55, Vik Fearing <vik.fearing@2ndquadrant.com> wrote: > On 04/01/2020 10:37, Dean Rasheed wrote: > > > > BTW, there is actually no need to restrict the inputs to integral > > values because GCD is something that has a perfectly natural extension > > to floating point inputs (see for example [1]). Moreover, since we > > already have a mod(numeric, numeric) that works for arbitrary inputs, > > Euclid's algorithm just works. > > [...] > > If it were more work to support non-integer inputs, I'd say that it's > > not worth the effort, but since it's actually less work to just allow > > it, then why not? > > > Okay, I allow that now, but I've still left it for lcm. I can't find > anything anywhere that defines lcm for floating point (I do find it for > fractions) and the result of abs(a*b)/gcd(a,b) certainly doesn't match > "lowest" in the examples I tried. > Here's another article on the subject: https://www.math-only-math.com/hcf-and-lcm-of-decimals.html It works because gcd(a*10^n, b*10^n) = gcd(a, b)*10^n, and therefore lcm(a*10^n, b*10^n) = lcm(a, b)*10^n, so the results will just have their decimal points shifted. For example: gcd(54, 24) = 6 lcm(54, 24) = 216 = 4*54 = 9*24 gcd(5.4, 2.4) = 0.6 lcm(5.4, 2.4) = 21.6 = 4*5.4 = 9*2.4 that is the lowest common integer multiple of the two decimal inputs. Regards, Dean
Hello Vik, >> Add unlikely() where appropriate. > > Any particular place in mind where I didn't already put it? In GCD implementations, for instance: if (arg1 == PG_INT32_MIN) if (arg2 == 0 || arg2 == PG_INT32_MIN) And possibly a "likely" on the while. In LCM implementations, for instance: if (arg1 == 0 || arg2 == 0) if (arg1 == arg2) The later is partially redundant with preceeding case BTW, which could be managed inside this one, reducing the number of tests? Something like: if (arg1 == arg2) if (arg1 == MIN_INT) error else return abs(arg1) I'm not sure why you want to deal with a1 == a2 case separately, could it not just work with the main code? If you want to deal with it separately, then why not doing arg1 == -arg2 as well? > Please stop suggesting it. Fine, fine! Tom also suggested to align implementations as much as possible, and I do agree with him. Also, I'd suggest to add a comment to explain that the precise C99 modulo semantics is required to make the algorithm work, and that it may not work with C89 semantics for instance. >> Note: in the numeric code you abs the value, ISTM consistent to do it >> as well in the other implementations. > > As noted in the comments, numeric does not have the INT_MIN problem. Sure, but there are special cases at the beginning all the same: NAN, INTEGRAL… >> Would it make sense that NAN is returned on NUMERIC when the >> computation cannot be performed, eg on non integer values? > > I don't think so, no. Ok. Why? I do not have an opinion, but ISTM that there is a choice and it should be explained. Could be consistency with other cases, whatever. >> Why the positive constraint on LCM(NUMERIC, NUMERIC)? Why not absoluting? > > I didn't see a definition for negative inputs, but now I see one so I've > lifted the restriction. Good. About tests: again, I'd check the LCM overflow on smaller values. I'm not convinced by the handling of fractional numerics in gcd, eg: +SELECT gcd(330.3::numeric, 462::numeric); + gcd +----- + 0.3 +(1 row) ISTM that the average user, including myself, would expect an integer result from gcd. If this is kept, the documentation should be clear about what it does and what it means, because the least to say is that it is surprising. Somehow I could have expected the arguments to be casted to int, so that it would lead to 66. Python does a type error, which I find even better. I'd vote for erroring. If this fractional gcd makes some sense and is desirable, then ISTM that lcm(a,b) = a / gcd(a, b) * b should make as much sense and should be allowed as well, for consistency. -- Fabien.
On 04/01/2020 20:08, Dean Rasheed wrote: > On Sat, 4 Jan 2020 at 17:55, Vik Fearing <vik.fearing@2ndquadrant.com> wrote: >> On 04/01/2020 10:37, Dean Rasheed wrote: >>> BTW, there is actually no need to restrict the inputs to integral >>> values because GCD is something that has a perfectly natural extension >>> to floating point inputs (see for example [1]). Moreover, since we >>> already have a mod(numeric, numeric) that works for arbitrary inputs, >>> Euclid's algorithm just works. >>> [...] >>> If it were more work to support non-integer inputs, I'd say that it's >>> not worth the effort, but since it's actually less work to just allow >>> it, then why not? >> >> Okay, I allow that now, but I've still left it for lcm. I can't find >> anything anywhere that defines lcm for floating point (I do find it for >> fractions) and the result of abs(a*b)/gcd(a,b) certainly doesn't match >> "lowest" in the examples I tried. >> > Here's another article on the subject: > https://www.math-only-math.com/hcf-and-lcm-of-decimals.html Yeah, my eyes weren't aligning the decimal points properly. Attached version frees up lcm to work on non-integrals. Thanks for your input! -- Vik Fearing
Attachment
On Sat, Jan 4, 2020 at 4:21 PM Fabien COELHO <coelho@cri.ensmp.fr> wrote: > In GCD implementations, for instance: > > if (arg1 == PG_INT32_MIN) > if (arg2 == 0 || arg2 == PG_INT32_MIN) > > And possibly a "likely" on the while. I don't think decoration the code with likely() and unlikely() all over the place is a very good idea. Odds are good that we'll end up with a bunch that are actually non-optimal, and nobody will ever figure it out because it's hard to figure out. I have a hard time believing that we're going to be much worse off if we just write the code normally. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hello Robert, >> if (arg1 == PG_INT32_MIN) >> if (arg2 == 0 || arg2 == PG_INT32_MIN) >> >> And possibly a "likely" on the while. > > I don't think decoration the code with likely() and unlikely() all > over the place is a very good idea. > Odds are good that we'll end up with a bunch that are actually > non-optimal, and nobody will ever figure it out because it's hard to > figure out. My 0.02€: I'd tend to disagree. Modern pipelined processors can take advantage of speculative execution on branches, so if you know which branch is the more likely it can help. Obviously if you get it wrong it does not, but for the above cases it seems to me that they are rather straightforward. It also provides some "this case is expected to be exceptional" semantics to people reading the code. > I have a hard time believing that we're going to be much > worse off if we just write the code normally. I think that your point applies to more general programming in postgres, but this is not the context here. For low-level arithmetic code like this one, with tests and loops containing very few hardware instructions, I think that helping compiler optimizations is a good idea. Maybe in the "while" the compiler would assume that it is going to loop anyway by default, so it may be less useful there. -- Fabien.
On Mon, Jan 6, 2020 at 6:52 AM Fabien COELHO <coelho@cri.ensmp.fr> wrote: > > > Hello Robert, > > >> if (arg1 == PG_INT32_MIN) > >> if (arg2 == 0 || arg2 == PG_INT32_MIN) > >> > >> And possibly a "likely" on the while. > > > > I don't think decoration the code with likely() and unlikely() all > > over the place is a very good idea. > > > Odds are good that we'll end up with a bunch that are actually > > non-optimal, and nobody will ever figure it out because it's hard to > > figure out. > > My 0.02€: I'd tend to disagree. > > Modern pipelined processors can take advantage of speculative execution on > branches, so if you know which branch is the more likely it can help. > > Obviously if you get it wrong it does not, but for the above cases it > seems to me that they are rather straightforward. > > It also provides some "this case is expected to be exceptional" semantics > to people reading the code. > > > I have a hard time believing that we're going to be much > > worse off if we just write the code normally. > > I think that your point applies to more general programming in postgres, > but this is not the context here. > > For low-level arithmetic code like this one, with tests and loops > containing very few hardware instructions, I think that helping compiler > optimizations is a good idea. Do you have any performance data to back that up? merlin
Hello Merlin, >> For low-level arithmetic code like this one, with tests and loops >> containing very few hardware instructions, I think that helping compiler >> optimizations is a good idea. > > Do you have any performance data to back that up? Yep. A generic data is the woes about speculative execution related CVE (aka Spectre) fixes and their impact on performance, which is in percents, possibly tens of them, when the thing is more or less desactivated to mitigate the security issue: https://www.nextplatform.com/2018/03/16/how-spectre-and-meltdown-mitigation-hits-xeon-performance/ Some data about the __builtin_expect compiler hints: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0479r0.html Basically, they are talking about percents, up to tens in some cases, which is consistent with the previous example. As I said, helping the compiler is a good idea, and pg has been doing that with the likely/unlikely macros for some time, there are over an hundred of them, including in headers which get expanded ("logging.h", "float.h", "simplehash.h", …), which is a good thing. I do not see any good reason to stop doing that, especially in low-level arithmetic functions. Now, I do not have specific data about the performance impact on a gcd implementation. Mileage may vary depending on hardware, compiler, options and other overheads. -- Fabien.
Do we actually need the smallint versions of these functions? I would have thought that automatic casting would take care of any cases that need smallints, and I can't believe that there's any performance benefit to be had that's worth maintaining the extra code. Regards, Dean
Dean Rasheed <dean.a.rasheed@gmail.com> writes: > Do we actually need the smallint versions of these functions? Doubt it. It'd be fairly hard even to call those, since e.g. "42" is an int not a smallint. regards, tom lane
On Tue, 7 Jan 2020 at 12:31, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Dean Rasheed <dean.a.rasheed@gmail.com> writes: > > Do we actually need the smallint versions of these functions? > > Doubt it. It'd be fairly hard even to call those, since e.g. "42" > is an int not a smallint. > I see this has been marked RFC. I'll take it, and barring objections, I'll start by ripping out the smallint code. Regards, Dean
On 20/01/2020 08:44, Dean Rasheed wrote: > On Tue, 7 Jan 2020 at 12:31, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Dean Rasheed <dean.a.rasheed@gmail.com> writes: >>> Do we actually need the smallint versions of these functions? >> Doubt it. It'd be fairly hard even to call those, since e.g. "42" >> is an int not a smallint. >> > I see this has been marked RFC. I'll take it, Thanks! > and barring objections, > I'll start by ripping out the smallint code. No strong objection. -- Vik Fearing
Looking at the docs, I think it's worth going a little further than just saying what the acronyms stand for -- especially since the behaviour for zero inputs is an implementation choice (albeit the most common one). I propose the following: + <entry> + greatest common divisor — the largest positive number that + divides both inputs with no remainder; returns <literal>0</literal> if + both inputs are zero + </entry> and: + <entry> + least common multiple — the smallest strictly positive number + that is an integer multiple of both inputs; returns <literal>0</literal> + if either input is zero + </entry> (I have tried to be precise in my use of terms like "number" and "integer", to cover the different cases) Regards, Dean
On 20/01/2020 11:28, Dean Rasheed wrote: > Looking at the docs, I think it's worth going a little further than > just saying what the acronyms stand for -- especially since the > behaviour for zero inputs is an implementation choice (albeit the most > common one). I propose the following: > > + <entry> > + greatest common divisor — the largest positive number that > + divides both inputs with no remainder; returns <literal>0</literal> if > + both inputs are zero > + </entry> > > and: > > + <entry> > + least common multiple — the smallest strictly positive number > + that is an integer multiple of both inputs; returns > <literal>0</literal> > + if either input is zero > + </entry> > > (I have tried to be precise in my use of terms like "number" and > "integer", to cover the different cases) In that case should lcm be "...that is an integral multiple..." since the numeric version will return numeric? Other than that, I'm happy with this change. -- Vik Fearing
On 2020-Jan-20, Dean Rasheed wrote: > + <entry> > + greatest common divisor — the largest positive number that > + divides both inputs with no remainder; returns <literal>0</literal> if > + both inputs are zero > + </entry> Warning, severe TOC/bikeshedding ahead. I don't know why, but this dash-semicolon sequence reads strange to me and looks out of place. I would use parens for the first phrase and keep the semicolon, that is "greatest common divisor (the largest ...); returns 0 if ..." That seems more natural to me, and we're already using parens in other description <entry>s. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, 20 Jan 2020 at 18:52, Vik Fearing <vik.fearing@2ndquadrant.com> wrote: > > On 20/01/2020 11:28, Dean Rasheed wrote: > > > > + <entry> > > + least common multiple — the smallest strictly positive number > > + that is an integer multiple of both inputs; returns > > <literal>0</literal> > > + if either input is zero > > + </entry> > > In that case should lcm be "...that is an integral multiple..." since > the numeric version will return numeric? > So "integral multiple" instead of "integer multiple"? I think I'm more used to the latter, but I'm happy with either. Regards, Dean
On Mon, 20 Jan 2020 at 19:04, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > On 2020-Jan-20, Dean Rasheed wrote: > > > + <entry> > > + greatest common divisor — the largest positive number that > > + divides both inputs with no remainder; returns <literal>0</literal> if > > + both inputs are zero > > + </entry> > > Warning, severe TOC/bikeshedding ahead. > > I don't know why, but this dash-semicolon sequence reads strange to me > and looks out of place. I would use parens for the first phrase and > keep the semicolon, that is "greatest common divisor (the largest ...); > returns 0 if ..." > > That seems more natural to me, and we're already using parens in other > description <entry>s. > Hmm, OK. I suppose that's more logical because then the bit in parens is the standard definition of gcd/lcm, and the part after the semicolon is the implementation choice for the special case not covered by the standard definition. Regards, Dean
On Mon, 20 Jan 2020 at 08:04, Vik Fearing <vik.fearing@2ndquadrant.com> wrote: > > On 20/01/2020 08:44, Dean Rasheed wrote: > >> > > I see this has been marked RFC. I'll take it, > Committed with some adjustments, mostly cosmetic but a couple more substantive: The code to guard against a floating point exception with inputs of (INT_MIN, -1) wasn't quite right because it actually just moved the problem so that it would fall over with inputs of (INT_MIN, +1). The convention in numeric.c is that the xxx_var() functions take *pointers* to their NumericVar arguments rather than copies, and they do not modify their inputs, as indicated by the use of "const". You might just have gotten away with what you were doing, but I think it was bad style and potentially unsafe -- for example, someone calling gcd_var() with a NumericVar that came from some other computation and having a non-null buf would risk having the buf freed in the copy, leaving the original NumericVar with a buf pointing to freed memory. Regards, Dean
On 25/01/2020 15:18, Dean Rasheed wrote: > > Committed with some adjustments, mostly cosmetic but a couple more substantive: Thanks! > The code to guard against a floating point exception with inputs of > (INT_MIN, -1) wasn't quite right because it actually just moved the > problem so that it would fall over with inputs of (INT_MIN, +1). Good catch. > The convention in numeric.c is that the xxx_var() functions take > *pointers* to their NumericVar arguments rather than copies, and they > do not modify their inputs, as indicated by the use of "const". You > might just have gotten away with what you were doing, but I think it > was bad style and potentially unsafe -- for example, someone calling > gcd_var() with a NumericVar that came from some other computation and > having a non-null buf would risk having the buf freed in the copy, > leaving the original NumericVar with a buf pointing to freed memory. Thank you for taking the time to look closely at this. This was my first time dealing with "numeric" so I was bound to make some mistakes. -- Vik Fearing