Thread: NUMERIC private methods?
Folks, While noodling with some weighted statistics <https://github.com/davidfetter/weighted_stats>, I noticed I was having to jump through a lot of hoops because of all the private methods in numeric.c, especially NumericVar. Would there be some major objection to exposing NumericVar as an opaque blob? Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On 12/16/2014 08:34 AM, David Fetter wrote: > Folks, > > While noodling with some weighted statistics > <https://github.com/davidfetter/weighted_stats>, I noticed I was > having to jump through a lot of hoops because of all the private > methods in numeric.c, especially NumericVar. Would there be some > major objection to exposing NumericVar as an opaque blob? Hmm. You'd want to make add_var, mul_var etc. non-static? Looking at the weighed_stats code, this probably illustrates the hoops you had to jump through: > /* sqrt((n/(n-1)) * ((s0*s2 - s1*s1)/(s0*s0)) */ > > result > = DirectFunctionCall1( > numeric_sqrt, > DirectFunctionCall2( > numeric_mul, > DirectFunctionCall2( > numeric_div, > n_prime, > DirectFunctionCall2( > numeric_sub, > n_prime, > /* > * This rather convoluted way to compute the value > * 1 gives us a result which should have at least > * as big a decimal scale as s_2 does, which should > * guarantee that our result is as precise as the > * input... > */ > DirectFunctionCall2( > numeric_add, > DirectFunctionCall2( > numeric_sub, > state->s_2, > state->s_2 > ), > make_numeric(1) > ) > ) > ), > DirectFunctionCall2( > numeric_div, > DirectFunctionCall2( > numeric_sub, > DirectFunctionCall2( > numeric_mul, > state->s_0, > state->s_2 > ), > DirectFunctionCall2( > numeric_mul, > state->s_1, > state->s_1 > ) > ), > DirectFunctionCall2( > numeric_mul, > state->s_0, > state->s_0 > ) > ) > ) > ); As a start, it would help a lot to #define a few helper macros like: #define ADD(a, b) DirectFunctionCall2(numeric_add, a, b) #define MUL(a, b) DirectFunctionCall2(numeric_mul, a, b) in your extension. That would already make that a lot shorter. You might also be worrying about performance, though. The above snippet was from the aggregate's final function, which isn't performance critical, but you have some numeric operations in the transition function too. I wonder how big the impact really is, though. init_var_from_num and make_result look quite cheap, but certainly not free. - Heikki
>>>>> "Heikki" == Heikki Linnakangas <hlinnakangas@vmware.com> writes: Heikki> Looking at the weighed_stats code, this probably illustratesHeikki> the hoops you had to jump through: Actually that hoop-jumping expression is almost irrelevant. The part that hurts (and yes, it's performance that's at issue here, and not code aesthetics) is not being able to use NumericVar in the aggregate's transition variable, because that means that every computed intermediate value is palloc'd and pfree'd twice (once as the digit buffer of a NumericVar and again as a Numeric datum). -- Andrew (irc:RhodiumToad)
On Tue, Dec 16, 2014 at 09:01:47AM +0000, Andrew Gierth wrote: > >>>>> "Heikki" == Heikki Linnakangas <hlinnakangas@vmware.com> writes: > > Heikki> Looking at the weighed_stats code, this probably illustrates > Heikki> the hoops you had to jump through: > > Actually that hoop-jumping expression is almost irrelevant. Right. Not that it made things fun or easy (at least for me) to debug, as you can see by the git history. > The part that hurts (and yes, it's performance that's at issue here, > and not code aesthetics) is not being able to use NumericVar in the > aggregate's transition variable, because that means that every > computed intermediate value is palloc'd and pfree'd twice (once as > the digit buffer of a NumericVar and again as a Numeric datum). Yep. Performance of NUMERIC might yet get some of the love it deserves. Perhaps something along the lines of a 128-bit default structure with a promotion/demotion scheme for larger representations. Until then, those of us writing extensions are stuck with heaps of extra instructions in it that could easily be trimmed away to good effect. Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Heikki Linnakangas <hlinnakangas@vmware.com> writes: > On 12/16/2014 08:34 AM, David Fetter wrote: >> While noodling with some weighted statistics >> <https://github.com/davidfetter/weighted_stats>, I noticed I was >> having to jump through a lot of hoops because of all the private >> methods in numeric.c, especially NumericVar. Would there be some >> major objection to exposing NumericVar as an opaque blob? > Hmm. You'd want to make add_var, mul_var etc. non-static? -1 for that. > Looking at the weighed_stats code, this probably illustrates the hoops > you had to jump through: >> /* sqrt((n/(n-1)) * ((s0*s2 - s1*s1)/(s0*s0)) */ If you're concerned about arithmetic performance, there is a very obvious fix here: use double. Is there some utterly compelling reason to use numeric, despite the fact that it's certain to be orders of magnitude slower? (It would still be orders of magnitude slower, no matter how much we were willing to destroy numeric.c's modularity boundary.) regards, tom lane
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: >> Hmm. You'd want to make add_var, mul_var etc. non-static? Tom> -1 for that. possibly with more meaningful names. Tom> If you're concerned about arithmetic performance, there is aTom> very obvious fix here: use double. Independently of this specific example, the obvious issue there is that there are applications for which double is simply not acceptable. As it stands, no extension can use the numeric type in any non-trivial way without paying a large penalty for repeated pallocs and data copies. Given that the ability to write C extensions easily is one of pg's great strengths, this is a defect that should be corrected. Tom> (It would still be orders of magnitude slower, no matter howTom> much we were willing to destroy numeric.c's modularityTom>boundary.) There is no need to expose any details of NumericVar's implementation; it would suffice to provide an interface to allocate NumericVars, and access to the functions. -- Andrew (irc:RhodiumToad)
Andrew Gierth <andrew@tao11.riddles.org.uk> writes: > "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: > Tom> If you're concerned about arithmetic performance, there is a > Tom> very obvious fix here: use double. > Independently of this specific example, the obvious issue there is that > there are applications for which double is simply not acceptable. As the guy who last fooled with the numeric calculation algorithms in any major way, I'm painfully aware that numeric is not necessarily more accurate than double for anything more complicated than addition/subtraction/multiplication. The example that was shown upthread is pretty nearly a textbook case of something where I'd not believe that numeric offers any accuracy improvement without *very* careful investigation. In general, if your application is sufficiently arithmetic-intensive that you need to care about the speed of calculations, then it's not obvious which one to pick, and if I hear a knee-jerk "simply not acceptable" I'm simply going to conclude that you haven't actually studied the subject. > As it stands, no extension can use the numeric type in any non-trivial > way without paying a large penalty for repeated pallocs and data copies. > Given that the ability to write C extensions easily is one of pg's great > strengths, this is a defect that should be corrected. > Tom> (It would still be orders of magnitude slower, no matter how > Tom> much we were willing to destroy numeric.c's modularity > Tom> boundary.) > There is no need to expose any details of NumericVar's implementation; > it would suffice to provide an interface to allocate NumericVars, and > access to the functions. I call BS on that. Exposing NumericVar is *already* violating numeric.c's abstraction boundary. If somebody were to want to reimplement numeric for more speed, changing that data structure and/or the usage conventions for it would likely be one of the core things they need to do. As soon as we expose the functions you're asking for, that becomes impossible. regards, tom lane
On 12/18/14, 9:21 AM, Tom Lane wrote: >> As it stands, no extension can use the numeric type in any non-trivial >> >way without paying a large penalty for repeated pallocs and data copies. >> >Given that the ability to write C extensions easily is one of pg's great >> >strengths, this is a defect that should be corrected. If copying data/palloc is the root of numeric's performance problems then we need to address that, because it will providebenefit across the entire database. The pattern of (palloc; copy) is repeated throughout a large part of the codebase. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On Thu, Dec 18, 2014 at 10:21 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Andrew Gierth <andrew@tao11.riddles.org.uk> writes: >> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: >> Tom> If you're concerned about arithmetic performance, there is a >> Tom> very obvious fix here: use double. > >> Independently of this specific example, the obvious issue there is that >> there are applications for which double is simply not acceptable. > > As the guy who last fooled with the numeric calculation algorithms in any > major way, I'm painfully aware that numeric is not necessarily more > accurate than double for anything more complicated than > addition/subtraction/multiplication. The example that was shown upthread > is pretty nearly a textbook case of something where I'd not believe that > numeric offers any accuracy improvement without *very* careful > investigation. In general, if your application is sufficiently > arithmetic-intensive that you need to care about the speed of calculations, > then it's not obvious which one to pick, and if I hear a knee-jerk "simply > not acceptable" I'm simply going to conclude that you haven't actually > studied the subject. I think that's ridiculous. You're basically arguing that numeric doesn't offer meaningful advantages over float8, which flies in the face of the fact that essentially every database application I've ever seen uses numeric and I'm not sure I've ever seen one using float8. Nearly all database users prefer to store quantities like currency units in a type that is guaranteed not to lose precision. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas wrote: > I think that's ridiculous. You're basically arguing that numeric > doesn't offer meaningful advantages over float8, which flies in the > face of the fact that essentially every database application I've ever > seen uses numeric and I'm not sure I've ever seen one using float8. > Nearly all database users prefer to store quantities like currency > units in a type that is guaranteed not to lose precision. I think it's reasonable to expose NumericVar and the supporting function prototypes in, say, numeric_internal.h; normal applications that just want to operate on numerics as today can just include numeric.h, and continue to be at arms-length of the implementation details, while code that wants to optimize operations further can use numeric_internal.h and be very aware that they are subject to heavy breakage if we ever feel a need to change the internal API. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Dec 18, 2014 at 10:21 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> As the guy who last fooled with the numeric calculation algorithms in any >> major way, I'm painfully aware that numeric is not necessarily more >> accurate than double for anything more complicated than >> addition/subtraction/multiplication. The example that was shown upthread >> is pretty nearly a textbook case of something where I'd not believe that >> numeric offers any accuracy improvement without *very* careful >> investigation. > I think that's ridiculous. You're basically arguing that numeric > doesn't offer meaningful advantages over float8, which flies in the > face of the fact that essentially every database application I've ever > seen uses numeric and I'm not sure I've ever seen one using float8. > Nearly all database users prefer to store quantities like currency > units in a type that is guaranteed not to lose precision. If you're doing banking, you don't do anything except addition, subtraction, and multiplication. And that is what those users who want "guaranteed precision" are doing, and yeah numeric will make them happy. If you're doing any sort of higher math or statistics, I stand by my statement that you'd better think rather than just blindly assume that numeric is going to be better for you. A moment's fooling about finds this example, which is pretty relevant to the formula we started this thread with: regression=# select (1234::numeric/1235) * 1235; ?column? ---------------------------1234.00000000000000000100 (1 row) regression=# select (1234::float8/1235) * 1235; ?column? ---------- 1234 (1 row) What it boils down to is that numeric is great for storing given decimal inputs exactly, and it can do exact addition/subtraction/multiplication on those too, but as soon as you get into territory where the result is fundamentally inexact it is *not* promised to be better than float8. In fact, it's designed to be more or less the same as float8; see the comments in select_div_scale. We could probably improve on this if we were to redesign the algorithms around a concept of decimal floating-point, rather than decimal fixed-point as it is now. But I'm not sure how well that would comport with the SQL standard. And I'm very not sure that we could still do it once we'd tied one hand behind our backs by virtue of exporting a bunch of the internals as public API. regards, tom lane
On Thu, Dec 18, 2014 at 11:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > What it boils down to is that numeric is great for storing given decimal > inputs exactly, and it can do exact addition/subtraction/multiplication > on those too, but as soon as you get into territory where the result is > fundamentally inexact it is *not* promised to be better than float8. I think what it boils down to is that several people here (and I'll add my voice to the chorus) are saying, hey, numeric is really useful, and we'd like to be able to manipulate numerics without all the palloc and fmgr overhead, and your response appears to be to say, use float8. That may be the right answer in some cases, but certainly not in all. > We could probably improve on this if we were to redesign the algorithms > around a concept of decimal floating-point, rather than decimal > fixed-point as it is now. But I'm not sure how well that would comport > with the SQL standard. And I'm very not sure that we could still do it > once we'd tied one hand behind our backs by virtue of exporting a bunch > of the internals as public API. You make this argument every time somebody wants to drop static from a function or stick PGDLLIMPORT on a variable, and frankly I think it's pretty developer-hostile. I would much rather see us go ahead and do those things and if people later complain that we broke stuff, we'll go tell them to pound sand. That's what we said when people complained about relistemp -> relpersistence, and the number of people who are affected by a change in the system catalogs has got to be a good two orders of magnitude more than the number of people who are going to notice a change in the C API. Giving developers the ability to write extensions that *might* get broken by future changes is a lot better than not giving them that ability in the first place. There's only a problem for the core project if we think we're bound not to break the APIs in a future release, and I don't feel so bound. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > I think what it boils down to is that several people here (and I'll > add my voice to the chorus) are saying, hey, numeric is really useful, > and we'd like to be able to manipulate numerics without all the palloc > and fmgr overhead, and your response appears to be to say, use float8. > That may be the right answer in some cases, but certainly not in all. Well, there are two components to what I'm saying. One is that the example David started with looks like it could use some better-informed consideration about which datatype to use. The other is that numeric leaves quite a lot to be desired still, and someday we might want to fix that, and that might require breaking the APIs you want to expose. > You make this argument every time somebody wants to drop static from a > function or stick PGDLLIMPORT on a variable, and frankly I think it's > pretty developer-hostile. I would much rather see us go ahead and do > those things and if people later complain that we broke stuff, we'll > go tell them to pound sand. Really. I will remember that next time you bitch about us changing some extension-visible behavior, which AFAIR you are usually one of the first to complain about. Anyway, if we do this, I concur with Alvaro's suggestion that the additional exposure be in a header file named something like numeric_private.h, so that there's less room for complaint when we change it. regards, tom lane
On Fri, Dec 19, 2014 at 10:56 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Well, there are two components to what I'm saying. One is that the > example David started with looks like it could use some better-informed > consideration about which datatype to use. The other is that numeric > leaves quite a lot to be desired still, and someday we might want to fix > that, and that might require breaking the APIs you want to expose. As I say, I'm OK with that. That file has been nearly-static for a really long time, so I don't expect a massive rewrite imminently. But if it happens, and it breaks things for people relying on those functions, tough luck for them. >> You make this argument every time somebody wants to drop static from a >> function or stick PGDLLIMPORT on a variable, and frankly I think it's >> pretty developer-hostile. I would much rather see us go ahead and do >> those things and if people later complain that we broke stuff, we'll >> go tell them to pound sand. > > Really. I will remember that next time you bitch about us changing some > extension-visible behavior, which AFAIR you are usually one of the first > to complain about. Really? I have bitched about what I see as nearly-pointless reorganizations of header files, because I personally place very low priority on speeding up incremental compiles, and most of those changes have no other benefit. But I can't remember ever bitching about function signature changes or changes to the types of global variables, or even the disappearance of previously-exposed types or global variables. And I daresay, given the nature of my responsibilities at EDB, such things have at least one order of magnitude more impact on me than they do on most extension maintainers. If it were up to me, I'd go stick PGDLLIMPORT on every global variable I could find; I estimate that would make things easier for me, say, five times as often as it would make them harder. The amount of time we've spent tracking down regressions on Windows because some loadable module of ours depended on a variable that wasn't PGDLLIMPORT'd is large - everything works fine on Linux, and sometimes survives cursory testing on Windows too, but then breaks in some obscure scenario. And there have also been cases where we've had to work around the absence of PGDLLIMPORT markings with ugly hacks and then that workaround code has turned out to be buggy. I would not argue for indiscriminately making global every function we have anywhere in the backend, but I would favor a policy of being significantly more liberal about it than we have been heretofore. > Anyway, if we do this, I concur with Alvaro's suggestion that the > additional exposure be in a header file named something like > numeric_private.h, so that there's less room for complaint when > we change it. Yes, I liked that proposal, too. Also, I'm not sure whether there's room to worry that making those function extern rather than static would defeat compiler optimizations that materially affect performance. That might just be paranoia on my part, but I've seen cases where it matters. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Dec 18, 2014 at 11:51:37PM -0300, Alvaro Herrera wrote: > Robert Haas wrote: > > > I think that's ridiculous. You're basically arguing that numeric > > doesn't offer meaningful advantages over float8, which flies in > > the face of the fact that essentially every database application > > I've ever seen uses numeric and I'm not sure I've ever seen one > > using float8. Nearly all database users prefer to store > > quantities like currency units in a type that is guaranteed not to > > lose precision. > > I think it's reasonable to expose NumericVar and the supporting > function prototypes in, say, numeric_internal.h; normal applications > that just want to operate on numerics as today can just include > numeric.h, and continue to be at arms-length of the implementation > details, while code that wants to optimize operations further can > use numeric_internal.h and be very aware that they are subject to > heavy breakage if we ever feel a need to change the internal API. While nothing can prevent negligence and pilot error, making it clear by the name of the included header that breakable stuff is being used seems like an excellent way to proceed. Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On Thu, Dec 18, 2014 at 11:51:12PM -0500, Tom Lane wrote: > If you're doing any sort of higher math or statistics, I stand by my > statement that you'd better think rather than just blindly assume that > numeric is going to be better for you. A moment's fooling about finds > this example, which is pretty relevant to the formula we started this > thread with: > > regression=# select (1234::numeric/1235) * 1235; > ?column? > --------------------------- > 1234.00000000000000000100 > (1 row) > > regression=# select (1234::float8/1235) * 1235; > ?column? > ---------- > 1234 > (1 row) > > What it boils down to is that numeric is great for storing given decimal > inputs exactly, and it can do exact addition/subtraction/multiplication > on those too, but as soon as you get into territory where the result is > fundamentally inexact it is *not* promised to be better than float8. > In fact, it's designed to be more or less the same as float8; see the > comments in select_div_scale. Based on the analysis above, I have written the attached patch to the NUMERIC docs to mention this. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
Attachment
>>>>> "Bruce" == Bruce Momjian <bruce@momjian.us> writes: > ! However, calculations on <type>numeric</type> values is very slow arithmetic ... is, but calculations ... are -- Andrew (irc:RhodiumToad)
On Sun, Mar 22, 2015 at 04:08:32AM +0000, Andrew Gierth wrote: > >>>>> "Bruce" == Bruce Momjian <bruce@momjian.us> writes: > > > ! However, calculations on <type>numeric</type> values is very slow > > arithmetic ... is, but calculations ... are Ah, good point. Fixed an applied. Thanks. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +