Thread: Floating point comparison inconsistencies of the geometric types
There are those macros defined for the built-in geometric types: > #define EPSILON 1.0E-06 > #define FPzero(A) (fabs(A) <= EPSILON) > #define FPeq(A,B) (fabs((A) - (B)) <= EPSILON) > #define FPne(A,B) (fabs((A) - (B)) > EPSILON) > #define FPlt(A,B) ((B) - (A) > EPSILON) > #define FPle(A,B) ((A) - (B) <= EPSILON) > #define FPgt(A,B) ((A) - (B) > EPSILON) > #define FPge(A,B) ((B) - (A) <= EPSILON) with this warning: > * XXX These routines were not written by a numerical analyst. Most of the geometric operators use those macros for comparison, but those do not: * polygon << polygon * polygon &< polygon * polygon &> polygon * polygon >> polygon * polygon <<| polygon * polygon &<| polygon * polygon |&> polygon * polygon |>> polygon * box @> point * point <@ box * lseg <@ box * circle @> point * point <@ circle This is really a bug that needs to be fixed one way or another. I think that it is better to fix it by removing the macros all together. I am not sure how useful they are in practice. I haven't seen anyone complaining about the above operators not using the macros. Though, people often complain about the ones using the macros and the problems caused by them. The hackers evidently don't like the macros, either. That should be why they are not used on the new operators. What annoys me most about this situation is the inconsistency blocks demonstrating our indexes. Because of this, we had to rip out point type support from BRIN inclusion operator class which could be useful to PostGIS. Fixing it has been discussed many times before [1][2][3][4] with no result. Here is my plan to fix the situation covering the problems around it: 1) Reimplement some operators to avoid divisions The attached patch does it on some of the line operators. 2) Use exact comparison on everywhere except the operators fuzzy comparison is certainly required The attach patch changes all operators except some "lseg" operators. "lseg" stores two points on a line. Most of the operations done on it are lossy. I don't see a problem treating them differently, those operators are very unlikely to be index supported. 3) Check numbers for underflow and overflow I am thinking to use CHECKFLOATVAL on utils/adt/float.c, but it is not exposed outside at the moment. Would it be okay to create a new header file utils/float.h to increase code sharing between float and geometric types? 4) Implement relative fuzzy comparison for the remaining operators It is better to completely get rid of those macros while we are on it. I think we can get away by implementing just a single function for fuzzy equality, not for other comparisons. I am inclined to put it to utils/adt/float.c. Is this a good idea? Tom Lane commented on the function posted to the list [1] on 2002: > Not like that. Perhaps use a fraction of the absolute value of the > one with larger absolute value. As-is it's hard to tell how FLT_EPSILON > is measured. I cannot image how the function would look like. I would appreciate any guidance. 5) Implement default hash operator class for all geometric types This would solve most complained problem of those types allowing them to used with DISTINCT and GROUP BY. 6) Implement default btree operator class at least for the point type This would let the point type to be used with ORDER BY. It is relatively straight forward to implement it for the point type. Is it a good idea to somehow implement it for other types? 7) Add GiST index support for missing cross type operators Currently only contained by operators are supported by an out-of-range strategy number. I think we can make the operator classes much nicer by allowing really cross type operator families. Comments? [1] https://www.postgresql.org/message-id/flat/D90A5A6C612A39408103E6ECDD77B8290FD4E3@voyager.corporate.connx.com [2] https://www.postgresql.org/message-id/flat/4A7C2C4B.5020508@netspace.net.au [3] https://www.postgresql.org/message-id/flat/12549.1346111029@sss.pgh.pa.us [4] https://www.postgresql.org/message-id/flat/20150512181307.GJ2523@alvh.no-ip.org
Attachment
On Fri, May 27, 2016 at 6:43 AM, Emre Hasegeli <emre@hasegeli.com> wrote: > There are those macros defined for the built-in geometric types: > >> #define EPSILON 1.0E-06 > >> #define FPzero(A) (fabs(A) <= EPSILON) >> #define FPeq(A,B) (fabs((A) - (B)) <= EPSILON) >> #define FPne(A,B) (fabs((A) - (B)) > EPSILON) >> #define FPlt(A,B) ((B) - (A) > EPSILON) >> #define FPle(A,B) ((A) - (B) <= EPSILON) >> #define FPgt(A,B) ((A) - (B) > EPSILON) >> #define FPge(A,B) ((B) - (A) <= EPSILON) > > with this warning: > >> * XXX These routines were not written by a numerical analyst. I agree that those macros looks like a pile of suck. It's unclear to me what purpose they're trying to accomplish, but regardless of what it is, it's hard for me to believe that they are accomplishing it. Whether 1.0E-06 is a correct fuzz factor presumably depends greatly on the scale of the number; e.g. if the values are all like "3" it's probably fine but if they are like "37142816124856" it's probably not enough fuzz and if they are all like ".00000004" it's probably way too much fuzz. Figuring out what to do about it is harder. Your proposal seems to be to remove them except where we need the fuzzy behavior, which doesn't sound unreasonable, but I don't personally understand why we need it in some places and not others. It would be good if some of the people who are more numerically inclined than I am (and hate floats less, but then that's everyone) could jump in here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jun 1, 2016 at 8:08 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, May 27, 2016 at 6:43 AM, Emre Hasegeli <emre@hasegeli.com> wrote: >> There are those macros defined for the built-in geometric types: >> >>> #define EPSILON 1.0E-06 >> >>> #define FPzero(A) (fabs(A) <= EPSILON) >>> #define FPeq(A,B) (fabs((A) - (B)) <= EPSILON) >>> #define FPne(A,B) (fabs((A) - (B)) > EPSILON) >>> #define FPlt(A,B) ((B) - (A) > EPSILON) >>> #define FPle(A,B) ((A) - (B) <= EPSILON) >>> #define FPgt(A,B) ((A) - (B) > EPSILON) >>> #define FPge(A,B) ((B) - (A) <= EPSILON) >> >> with this warning: >> >>> * XXX These routines were not written by a numerical analyst. > > I agree that those macros looks like a pile of suck. +1 > It's unclear to > me what purpose they're trying to accomplish, but regardless of what > it is, it's hard for me to believe that they are accomplishing it. > Whether 1.0E-06 is a correct fuzz factor presumably depends greatly on > the scale of the number; e.g. if the values are all like "3" it's > probably fine but if they are like "37142816124856" it's probably not > enough fuzz and if they are all like ".00000004" it's probably way too > much fuzz. Also, it's a pretty lame heuristic. It doesn't really eliminate *any* problems; it just aims to make them less frequent by shifting the edges around. In doing so, it creates whole new classes of problems: test=# \set A '''(1.9999996,1.9999996),(1,1)''::box' test=# \set B '''(2,2),(1,1)''::box' test=# \set C '''(2.0000004,2.0000004),(1,1)''::box' test=# select :A = :B, :B = :C, :A = :C;?column? | ?column? | ?column? ----------+----------+----------t | t | f (1 row) Is the benefit we get from the macros worth destroying the transitive property of the comparison operators on these types? > Figuring out what to do about it is harder. Your proposal seems to be > to remove them except where we need the fuzzy behavior, which doesn't > sound unreasonable, but I don't personally understand why we need it > in some places and not others. +1 My first inclination is to remove those macros in version 10, but it would be good to hear from some people using these types on what the impact of that would be. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Kevin Grittner <kgrittn@gmail.com> writes: > On Wed, Jun 1, 2016 at 8:08 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> Figuring out what to do about it is harder. Your proposal seems to be >> to remove them except where we need the fuzzy behavior, which doesn't >> sound unreasonable, but I don't personally understand why we need it >> in some places and not others. > +1 > My first inclination is to remove those macros in version 10, but > it would be good to hear from some people using these types on what > the impact of that would be. As I understand it, the key problem is that tests like "is point on line" would basically never succeed except in the most trivial cases, because of roundoff error. That's not very nice, and it might cascade to larger problems like object-containment tests failing unexpectedly. We would need to go through all the geometric operations and figure out where that kind of gotcha is significant and what we can do about it. Seems like a fair amount of work :-(. If somebody's willing to do that kind of investigation, then sure, but I don't think just blindly removing these macros is going to lead to anything good. Also, I suppose this means that Robert promises not to make any of his usual complaints about breaking compatibility? Because we certainly would be. regards, tom lane
On 6/1/16 9:27 AM, Tom Lane wrote: > Kevin Grittner <kgrittn@gmail.com> writes: >> > On Wed, Jun 1, 2016 at 8:08 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>> >> Figuring out what to do about it is harder. Your proposal seems to be >>> >> to remove them except where we need the fuzzy behavior, which doesn't >>> >> sound unreasonable, but I don't personally understand why we need it >>> >> in some places and not others. >> > +1 >> > My first inclination is to remove those macros in version 10, but >> > it would be good to hear from some people using these types on what >> > the impact of that would be. > As I understand it, the key problem is that tests like "is point on line" > would basically never succeed except in the most trivial cases, because of > roundoff error. That's not very nice, and it might cascade to larger > problems like object-containment tests failing unexpectedly. We would > need to go through all the geometric operations and figure out where that > kind of gotcha is significant and what we can do about it. Seems like a > fair amount of work :-(. If somebody's willing to do that kind of > investigation, then sure, but I don't think just blindly removing these > macros is going to lead to anything good. I suspect another wrinkle here is that in the GIS world a single point can be represented it multiple reference/coordinate systems, and it would have different values in each of them. AIUI the transforms between those systems can be rather complicated if different projection methods are involved. I don't know if PostGIS depends on what these macros are doing or not. If it doesn't, perhaps it would be sufficient to lop of the last few bits of the significand. ISTM that'd be much better than what the macros currently do. BTW, I suspect the macro values were chosen specifically for dealing with LAT/LONG. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) mobile: 512-569-9461
On Wed, Jun 1, 2016 at 7:52 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: > > I suspect another wrinkle here is that in the GIS world a single point can > be represented it multiple reference/coordinate systems, and it would have > different values in each of them. AIUI the transforms between those systems > can be rather complicated if different projection methods are involved. I > don't know if PostGIS depends on what these macros are doing or not. If it > doesn't, perhaps it would be sufficient to lop of the last few bits of the > significand. ISTM that'd be much better than what the macros currently do. We don't depend on these, we have our own :/ The real answer for a GIS system is to have an explicit tolerance parameter for calculations like distance/touching/containment, but unfortunately we didn't do that so now we have our own compatibility/boil the ocean problem if we ever wanted/were funded to add one. P. > BTW, I suspect the macro values were chosen specifically for dealing with > LAT/LONG. > -- > Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX > Experts in Analytics, Data Architecture and PostgreSQL > Data in Trouble? Get it in Treble! http://BlueTreble.com > 855-TREBLE2 (855-873-2532) mobile: 512-569-9461 > > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers
On 06/01/2016 07:52 AM, Jim Nasby wrote: > On 6/1/16 9:27 AM, Tom Lane wrote: >> Kevin Grittner <kgrittn@gmail.com> writes: >>> > On Wed, Jun 1, 2016 at 8:08 AM, Robert Haas <robertmhaas@gmail.com> >>> wrote: >>>> >> Figuring out what to do about it is harder. Your proposal seems >>>> to be >>>> >> to remove them except where we need the fuzzy behavior, which >>>> doesn't >>>> >> sound unreasonable, but I don't personally understand why we need it >>>> >> in some places and not others. >>> > +1 >>> > My first inclination is to remove those macros in version 10, but >>> > it would be good to hear from some people using these types on what >>> > the impact of that would be. >> As I understand it, the key problem is that tests like "is point on line" >> would basically never succeed except in the most trivial cases, >> because of >> roundoff error. That's not very nice, and it might cascade to larger >> problems like object-containment tests failing unexpectedly. We would >> need to go through all the geometric operations and figure out where that >> kind of gotcha is significant and what we can do about it. Seems like a >> fair amount of work :-(. If somebody's willing to do that kind of >> investigation, then sure, but I don't think just blindly removing these >> macros is going to lead to anything good. > > I suspect another wrinkle here is that in the GIS world a single point > can be represented it multiple reference/coordinate systems, and it > would have different values in each of them. AIUI the transforms between > those systems can be rather complicated if different projection methods > are involved. I don't know if PostGIS depends on what these macros are > doing or not. If it doesn't, perhaps it would be sufficient to lop of > the last few bits of the significand. ISTM that'd be much better than > what the macros currently do. IIRC PostGIS uses a function from libgeos to do things like "point equals" (ST_Equals). I've never looked at that source, but would be unsurprised to find that it does something similar to this although probably also more sophisticated. (looks) yes -- ST_Equals calls GEOSEquals() after some sanity checking... > BTW, I suspect the macro values were chosen specifically for dealing > with LAT/LONG. I think that is it exactly. Joe -- Crunchy Data - http://crunchydata.com PostgreSQL Support for Secure Enterprises Consulting, Training, & Open Source Development
On 6/1/16 10:03 AM, Paul Ramsey wrote: > We don't depend on these, we have our own :/ > The real answer for a GIS system is to have an explicit tolerance > parameter for calculations like distance/touching/containment, but > unfortunately we didn't do that so now we have our own > compatibility/boil the ocean problem if we ever wanted/were funded to > add one. Well it sounds like what's currently happening in Postgres is probably going to change, so how might we structure that to help PostGIS? Would simply lopping off the last few bits of the significand/mantissa work, or is that not enough when different GRSes are involved? -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) mobile: 512-569-9461
On Wed, Jun 1, 2016 at 8:59 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: > On 6/1/16 10:03 AM, Paul Ramsey wrote: >> >> We don't depend on these, we have our own :/ >> The real answer for a GIS system is to have an explicit tolerance >> parameter for calculations like distance/touching/containment, but >> unfortunately we didn't do that so now we have our own >> compatibility/boil the ocean problem if we ever wanted/were funded to >> add one. > > > Well it sounds like what's currently happening in Postgres is probably going > to change, so how might we structure that to help PostGIS? Would simply > lopping off the last few bits of the significand/mantissa work, or is that > not enough when different GRSes are involved? PostGIS doesn't look at all at what the PgSQL geotypes do, so go forward w/o fear. Tolerance in geo world is more than vertex rounding though, it's things like saying that when distance(pt,line) < epsilon then distance(pt,line) == 0, or similarly for shape touching, etc. One of the things people find annoying about postgis is that ST_Intersects(ST_Intersection(a, b), a) can come out as false (a derived point at a crossing of lines may not exactly intersect either of the input lines), which is a direct result of our use of exact math for the boolean intersects test. Anyways, go forth and do whatever makes sense for PgSQL P > > -- > Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX > Experts in Analytics, Data Architecture and PostgreSQL > Data in Trouble? Get it in Treble! http://BlueTreble.com > 855-TREBLE2 (855-873-2532) mobile: 512-569-9461
Paul Ramsey <pramsey@cleverelephant.ca> writes: > One of the things people find annoying about postgis is that > ST_Intersects(ST_Intersection(a, b), a) can come out as false (a > derived point at a crossing of lines may not exactly intersect either > of the input lines), which is a direct result of our use of exact math > for the boolean intersects test. That's an interesting comment, because it's more or less exactly the type of failure that we could expect to get if we remove fuzzy comparisons from the built-in types. How much of a problem is it in practice for PostGIS users? Do you have any plans to change it? > Anyways, go forth and do whatever makes sense for PgSQL I think we're trying to figure out what that is ... regards, tom lane
On Wed, Jun 1, 2016 at 10:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > As I understand it, the key problem is that tests like "is point on line" > would basically never succeed except in the most trivial cases, because of > roundoff error. That's not very nice, and it might cascade to larger > problems like object-containment tests failing unexpectedly. We would > need to go through all the geometric operations and figure out where that > kind of gotcha is significant and what we can do about it. Seems like a > fair amount of work :-(. If somebody's willing to do that kind of > investigation, then sure, but I don't think just blindly removing these > macros is going to lead to anything good. Yeah, it does seem to need some research. > Also, I suppose this means that Robert promises not to make any of his > usual complaints about breaking compatibility? Because we certainly > would be. Pot, meet Mr. Kettle! Obviously, the inconvenience caused by any backward incompatibility has to be balanced against the fact that the new behavior is presumably better. But I stridently object to the accusation that of the two of us I'm the one more concerned with backward-compatibility. There may be some instances where I've had a more conservative judgement than you about breaking user-facing stuff, but you've blocked dozens of changes to the C API that would have enabled meaningful extension development on the grounds that somebody might complain when a future release changes the API! I think behavior changes that users will notice are of vastly greater significance than those which will only be observed by developers. In this particular case, I think that the current behavior is pretty stupid, and that the built-in geometric types are barely used, possibly because they have stupid behavior. So I would be willing to bet on a well-thought-out change in this area coming out to a net positive. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
My progress so far is attached as 2 patches. First one introduces an header file for adt/float.c. Second one refactors the geometric operations. I have removed the fuzzy comparison macros all together. It is very hard to keep some of them and maintain consistency. Many operators that would benefit from the fuzzy comparison is quite broken at the moment [1], anyway. I have also made the comparisons safe against NaN [2]. [1] https://www.postgresql.org/message-id/flat/CAE2gYzw_-z%3DV2kh8QqFjenu%3D8MJXzOP44wRW%3DAzzeamrmTT1%3DQ%40mail.gmail.com [2] https://www.postgresql.org/message-id/flat/28685.1468246504%40sss.pgh.pa.us
Attachment
On Mon, Jul 18, 2016 at 3:54 PM, Emre Hasegeli <emre@hasegeli.com> wrote: > My progress so far is attached as 2 patches. First one introduces an > header file for adt/float.c. Second one refactors the geometric > operations. The first patch fails to apply due to bit-rot. That's easy enough to correct, but then it runs into warnings on make: btree_gin.c: In function ‘leftmostvalue_float4’: btree_gin.c:234:2: error: implicit declaration of function ‘get_float4_infinity’ [-Werror=implicit-function-declaration] return Float4GetDatum(-get_float4_infinity()); ^ btree_gin.c: In function ‘leftmostvalue_float8’: btree_gin.c:242:2: error: implicit declaration of function ‘get_float8_infinity’ [-Werror=implicit-function-declaration] return Float8GetDatum(-get_float8_infinity()); ^ Please fix. Something to consider before posting new version -- should we change some of those macros to static inline (in the .h files) to avoid double-evaluation hazards? They might perform as well or even better that way, and remove a subtle programmer foot-gun. Changing status to "Waiting on Author". -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> The first patch fails to apply due to bit-rot. That's easy enough > to correct, but then it runs into warnings on make: Rebased and fixed. > Something to consider before posting new version -- should we > change some of those macros to static inline (in the .h files) to > avoid double-evaluation hazards? They might perform as well or > even better that way, and remove a subtle programmer foot-gun. One reason to keep them as macros is that they are currently supporting both float4 and float8. Currently they look like this: FLOAT_EQ(val1, val2) FLOAT_PL(result, val1, val2) I guess if we would use inline functions, they would look like: FLOAT8_EQ(val1, val2) result = FLOAT8_PL(val1, val2) Result would need to be defined again in the function to be checked for overflow. I think this way would be more developer-friendly. I have gone some trouble to allocate the result to be able to use the macros. Do you know if the performance of both would be the same? I am not much experienced in C. If you think that inline functions are better suited, I can rework the patches.
Attachment
On Sun, Sep 4, 2016 at 12:42 PM, Emre Hasegeli <emre@hasegeli.com> wrote: >> The first patch fails to apply due to bit-rot. That's easy enough >> to correct, but then it runs into warnings on make: > > Rebased and fixed. These patches apply and build on top of 5c609a74 with no problems, but `make check` finds differences per the attached. Please investigate why the regression tests are failing and what the appropriate response is. >> Something to consider before posting new version -- should we >> change some of those macros to static inline (in the .h files) to >> avoid double-evaluation hazards? They might perform as well or >> even better that way, and remove a subtle programmer foot-gun. > > One reason to keep them as macros is that they are currently > supporting both float4 and float8. Currently they look like this: > > FLOAT_EQ(val1, val2) > FLOAT_PL(result, val1, val2) > > I guess if we would use inline functions, they would look like: > > FLOAT8_EQ(val1, val2) > result = FLOAT8_PL(val1, val2) > > Result would need to be defined again in the function to be checked > for overflow. I think this way would be more developer-friendly. I > have gone some trouble to allocate the result to be able to use the > macros. Do you know if the performance of both would be the same? In one case where I was concerned that a static inline definition would not perform as well as a macro, I went so far as to compare the compiled code for both at a number of call sites in an optimized build and found that a reasonably current gcc generated *identical* code for both. Of course, this will not always be the case -- in particular for macros which require multiple evaluations of one or more parameters and they are not simple literals. In such cases, the static inline often benchmarks faster because the results from the potentially expensive expression can be stored on the stack or in a register, and just referenced again. > I am not much experienced in C. If you think that inline functions > are better suited, I can rework the patches. I suspect that they will be as fast or faster, and they eliminate the hazard of multiple evaluation, where a programmer might not be aware of the multiple evaluation or of some side-effect of an argument. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Fri, Sep 9, 2016 at 4:25 PM, Kevin Grittner <kgrittn@gmail.com> wrote: > On Sun, Sep 4, 2016 at 12:42 PM, Emre Hasegeli <emre@hasegeli.com> wrote: > These patches apply and build on top of 5c609a74 with no problems, > but `make check` finds differences per the attached. Please > investigate why the regression tests are failing and what the > appropriate response is. >> I am not much experienced in C. If you think that inline functions >> are better suited, I can rework the patches. > > I suspect that they will be as fast or faster, and they eliminate > the hazard of multiple evaluation, where a programmer might not be > aware of the multiple evaluation or of some side-effect of an > argument. Emre, are you going to address the above? It would have to be Real Soon Now. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> Emre, are you going to address the above? It would have to be Real > Soon Now. Yes, I am working on it. I found more problems, replaced more algorithms. That took a lot of time. I will post the new version really soon. I wouldn't feel bad, if you wouldn't have enough time to review it in this commitfest.
> These patches apply and build on top of 5c609a74 with no problems, > but `make check` finds differences per the attached. Please > investigate why the regression tests are failing and what the > appropriate response is. I fixed the first one and workaround the second with COLLATE "C". I have how my changes caused this regression. "select_views" test runs "SELECT name, #thepath FROM iexit ORDER BY 1, 2" and expects to get rows in this order: > I- 580 Ramp | 8 > I- 580/I-680 Ramp | 2 With the collation on my laptop, this is actually true: > regression=# select 'I- 580/I-680 Ramp' < 'I- 580 Ramp'; > ?column? > ---------- > t > (1 row) However, on the Linux server, I am testing it is not: > regression=# select 'I- 580 Ramp' < 'I- 580/I-680 Ramp'; > ?column? > ---------- > f > (1 row) Do you know how it is not failing on the master? > I suspect that they will be as fast or faster, and they eliminate > the hazard of multiple evaluation, where a programmer might not be > aware of the multiple evaluation or of some side-effect of an > argument. I reworked the the patches to use inline functions and fixed the problems I found. The new versions are attached.
Attachment
On Wed, Sep 28, 2016 at 12:02 PM, Emre Hasegeli <emre@hasegeli.com> wrote: >> `make check` finds differences per the attached. Please >> investigate why the regression tests are failing and what the >> appropriate response is. > > I fixed the first one and workaround the second with COLLATE "C". I > have how my changes caused this regression. > > "select_views" test runs "SELECT name, #thepath FROM iexit ORDER BY 1, > 2" and expects to get rows in this order: > >> I- 580 Ramp | 8 >> I- 580/I-680 Ramp | 2 > > With the collation on my laptop, this is actually true: > >> regression=# select 'I- 580/I-680 Ramp' < 'I- 580 Ramp'; >> ?column? >> ---------- >> t >> (1 row) > > However, on the Linux server, I am testing it is not: > >> regression=# select 'I- 580 Ramp' < 'I- 580/I-680 Ramp'; >> ?column? >> ---------- >> f >> (1 row) > > Do you know how it is not failing on the master? Well, those two results are not contradictory -- notice that you switched the order of the values in the comparison. I don't think you've really found the explanation yet. >> [discussing inline static functions compared to macros for min()/max(), etc.] >> I suspect that they will be as fast or faster, and they eliminate >> the hazard of multiple evaluation, where a programmer might not be >> aware of the multiple evaluation or of some side-effect of an >> argument. > > I reworked the the patches to use inline functions and fixed the > problems I found. The new versions are attached. Will take a look and post again. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Sep 28, 2016 at 2:04 PM, Kevin Grittner <kgrittn@gmail.com> wrote: > Will take a look and post again. I am moving this patch to the next CF. You'll be hearing from me sometime after this CF is closed. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> Well, those two results are not contradictory -- notice that you > switched the order of the values in the comparison. I don't think > you've really found the explanation yet. I am sorry I copy-pasted the wrong example: "select_views" test runs: > SELECT name, #thepath FROM iexit ORDER BY 1, 2 and expects to get rows in this order: > I- 580 Ramp | 8 > I- 580/I-680 Ramp | 2 with the collation on my laptop, this is actually true: > regression=# select 'I- 580 Ramp' < 'I- 580/I-680 Ramp'; > ?column? > ---------- > t > (1 row) on the Linux server I am testing, it is not: > regression=# select 'I- 580 Ramp' < 'I- 580/I-680 Ramp'; > ?column? > ---------- > f > (1 row) The later should be the case on your environment as the test was also failing for you. This is not consistent with the expected test result. Do you know how this test can still pass on the master?
Hello, At Thu, 29 Sep 2016 10:37:30 +0200, Emre Hasegeli <emre@hasegeli.com> wrote in <CAE2gYzxijWKwJ-ZPD--QHM+SxMd+vL_81_3Xt0spnCbsqFH=Ug@mail.gmail.com> > > regression=# select 'I- 580 Ramp' < 'I- 580/I-680 Ramp'; > > ?column? > > ---------- > > t > > (1 row) > > on the Linux server I am testing, it is not: > > > regression=# select 'I- 580 Ramp' < 'I- 580/I-680 Ramp'; > > ?column? > > ---------- > > f > > (1 row) > > The later should be the case on your environment as the test was also > failing for you. This is not consistent with the expected test > result. Do you know how this test can still pass on the master? Perhaps you ran the test under the environment LC_COLLATE (or LANG) of something other than C. The reson for the result is collation. The following returns expected result. =# select 'I- 580 Ramp' < ('I- 580/I-680 Ramp' COLLATE "C");?column? ----------t (1 row) For a reason I don't know, say, en_US locale behaves in an unintuitive way. regression=# select ' ' < ('/' COLLATE "en_US"), ' x' < ('/a' COLLATE "en_US");?column? | ?column? ----------+----------t | f (1 row) Returning to the issue, the following query should give you the expected result. SELECT name, #thepath FROM iexit ORDER BY name COLLATE "C", 2; regards, -- Kyotaro Horiguchi NTT Open Source Software Center
> Returning to the issue, the following query should give you the > expected result. > > SELECT name, #thepath FROM iexit ORDER BY name COLLATE "C", 2; Yes, I have worked around it like this. What I couldn't understand is how my patch can cause this regression. How is it passes on master without COLLATE?
Hello, > > Returning to the issue, the following query should give you the > > expected result. > > > > SELECT name, #thepath FROM iexit ORDER BY name COLLATE "C", 2; > > Yes, I have worked around it like this. What I couldn't understand is > how my patch can cause this regression. How is it passes on master > without COLLATE? Ah, sorry, I understand that *you* added the COLLATE. Revering select_views.sql/out to master actually causes a regression error. The reason for that is that you forgot to edit an alternative exptect file, which matches for en_US locale. But the test doesn't for collation and the only difference between the alternative expect files comes from the difference of collation for the query. "the workaround" seems to be the right way to do it. I recommend rather to leave the workaroud as it is and remove select_views_1.out from the "expected" directory. Aside from that, I'd like to comment this patch on other points later. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Hello, > Aside from that, I'd like to comment this patch on other points > later. The start of this patch is that the fact that most of but not all geometric operators use fuzzy comparson. But Tom pointed out that the fixed fuzz factor is not reasonable but hard to find how to modify it. We can remove the fuzz factor altogether but I think we also should provide a means usable to do similar things. At least "is a point on a line" might be useless for most cases without any fuzzing feature. (Nevertheless, it is a problem only when it is being used to do that:) If we don't find reasonable policy on fuzzing operations, it would be the proof that we shouldn't change the behavior. The 0001 patch adds many FP comparison functions individually considering NaN. As the result the sort order logic involving NaN is scattered around into the functions, then, you implement generic comparison function using them. It seems inside-out to me. Defining ordering at one place, then comparison using it seems to be reasonable. The 0002 patch repalces many native operators for floating point numbers by functions having sanity check. This seems to be addressing to Tom's comment. It is reasonable for comparison operators but I don't think replacing all arithmetics is so. For example, float8_div checks that - If both of operands are not INF, result should not be INF.- If the devident is not exactly zero, the result should notbe zero. The second assumption is wrong by itself. For an extreme case, 4.9e-324 / 1.7e+308 becomes exactly zero (by underflow). We canont assert it to be wrong but the devedent is not zero. The validity of the result varies according to its meaning. For the case of box_cn, > center->x = float8_div(float8_pl(box->high.x, box->low.x), 2.0); > center->y = float8_div(float8_pl(box->high.y, box->low.y), 2.0); If the center somehow goes extremely near to the origin, it could result in a false error. > =# select @@ box'(-8e-324, -8e-324), (4.9e-324, 4.9e-324)'; > ERROR: value out of range: underflow I don't think this underflow is an error, and actually it is a change of the current behavior without a reasonable reason. More significant (and maybe unacceptable) side-effect is that it changes the behavior of ordinary operators. I don't think this is acceptable. More consideration is needed. > =# select ('-8e-324'::float8 + '4.9e-324'::float8) / 2.0; > ERROR: value out of range: underflow In regard to fuzzy operations, libgeos seems to have several types of this kind of feature. (I haven't looked closer into them). Other than reducing precision seems overkill or unappliable for PostgreSQL bulitins. As Jim said, can we replace the fixed scale fuzz factor by precision reduction? Maybe, with a GUC variable (I hear someone's roaring..) to specify the amount defaults to fit the current assumption. https://apt-browse.org/browse/ubuntu/trusty/universe/i386/libgeos++-dev/3.4.2-4ubuntu1/file/usr/include/geos/geom/BinaryOp.h /** Define this to use PrecisionReduction policy* in an attempt at by-passing binary operation* robustness problems (handlesTopologyExceptions)*/ ...* Define this to use TopologyPreserving simplification policy* in an attempt at by-passing binary operation* robustnessproblems (handles TopologyExceptions) (seems not activated) ...* Use common bits removal policy.* If enabled, this would be tried /before/* Geometry snapping. ... /** Use snapping policy*/ regards, -- Kyotaro Horiguchi NTT Open Source Software Center
> The reason for that is that you forgot to edit an alternative > exptect file, which matches for en_US locale. Now I understand. Thank you for the explanation. > But the test doesn't for collation and the only difference > between the alternative expect files comes from the difference of > collation for the query. "the workaround" seems to be the right > way to do it. I recommend rather to leave the workaroud as it is > and remove select_views_1.out from the "expected" directory. I will do.
> We can remove the fuzz factor altogether but I think we also > should provide a means usable to do similar things. At least "is > a point on a line" might be useless for most cases without any > fuzzing feature. (Nevertheless, it is a problem only when it is > being used to do that:) If we don't find reasonable policy on > fuzzing operations, it would be the proof that we shouldn't > change the behavior. It was my initial idea to keep the fuzzy comparison behaviour on some places, but the more I get into I realised that it is almost impossible to get this right. Instead, I re-implemented some operators to keep precision as much as possible. The previous "is a point on a line" operator would *never* give the true result without the fuzzy comparison. The new implementation would return true, when precision is not lost. I think this is a behaviour people, who are working with floating points, are prepared to deal with. By the way, "is a point on a line" operator is quite wrong with the fuzzy comparison at the moment [1]. > The 0001 patch adds many FP comparison functions individually > considering NaN. As the result the sort order logic involving NaN > is scattered around into the functions, then, you implement > generic comparison function using them. It seems inside-out to > me. Defining ordering at one place, then comparison using it > seems to be reasonable. I agree that it would be simpler to use the comparison function for implementing other operators. I have done it other way around to make them more optimised. They are called very often. I don't think checking exit code of the comparison function would be optimised the same way. I could leave the comparison functions as they are, but re-implemented them using the others to keep documentation of NaN comparison in the single place. > If the center somehow goes extremely near to the origin, it could > result in a false error. > >> =# select @@ box'(-8e-324, -8e-324), (4.9e-324, 4.9e-324)'; >> ERROR: value out of range: underflow > > I don't think this underflow is an error, and actually it is a > change of the current behavior without a reasonable reason. More > significant (and maybe unacceptable) side-effect is that it > changes the behavior of ordinary operators. I don't think this is > acceptable. More consideration is needed. > >> =# select ('-8e-324'::float8 + '4.9e-324'::float8) / 2.0; >> ERROR: value out of range: underflow This is the current behaviour of float datatype. My patch doesn't change that. This problem would probably also apply to multiplying very small values. I agree that this is not the ideal behaviour. Though I am not sure, if we should go to a different direction than the float datatypes. I think there is value in making geometric types compatible with the float. Users are going to mix them, anyway. For example, users can calculate the center of a box manually, and confuse when the built-in operator behaves differently. > In regard to fuzzy operations, libgeos seems to have several > types of this kind of feature. (I haven't looked closer into > them). Other than reducing precision seems overkill or > unappliable for PostgreSQL bulitins. As Jim said, can we replace > the fixed scale fuzz factor by precision reduction? Maybe, with a > GUC variable (I hear someone's roaring..) to specify the amount > defaults to fit the current assumption. I am disinclined to try to implement something complicated for the geometric types. I think they are mostly useful for 2 purposes: uses simple enough to not worth looking for better solutions, and demonstrating our indexing capabilities. The inconsistencies harm both of those. [1] https://www.postgresql.org/message-id/flat/CAE2gYzw_-z%3DV2kh8QqFjenu%3D8MJXzOP44wRW%3DAzzeamrmTT1%3DQ%40mail.gmail.com
Hello, At Sun, 13 Nov 2016 22:41:01 +0100, Emre Hasegeli <emre@hasegeli.com> wrote in <CAE2gYzymeQXGGmhU1Vc35DpugwfRd-QRK3BM-6TGg0rwHcDN_w@mail.gmail.com> > > We can remove the fuzz factor altogether but I think we also > > should provide a means usable to do similar things. At least "is > > a point on a line" might be useless for most cases without any > > fuzzing feature. (Nevertheless, it is a problem only when it is > > being used to do that:) If we don't find reasonable policy on > > fuzzing operations, it would be the proof that we shouldn't > > change the behavior. > > It was my initial idea to keep the fuzzy comparison behaviour on some > places, but the more I get into I realised that it is almost > impossible to get this right. Instead, I re-implemented some > operators to keep precision as much as possible. The previous "is a > point on a line" operator would *never* give the true result without > the fuzzy comparison. The new implementation would return true, when > precision is not lost. There's no way to accurately store numbers other than some special ones in floating point format where mantissa precision is limited. Even 0.1 is not stored accurately with a binary mantissa. (It would be concealed by machine rounding for most cases though.) The same is said for numeric. It cannot store 1/3 accurately and doesn't even conceal the incaccuracy. Even if they could store numbers accurately, anyway , say, the constant pi does not have infinite precision. As the result, we have a tradition that equal operator on real numbers should be avoided generally. Numerics are provided mainly for financial use, on which finite-but-high precision decimal arithmetic are performed. > I think this is a behaviour people, who are > working with floating points, are prepared to deal with. What way to deal with it is in your mind? The problem hides behind operators. To fix it a user should rewrite a expression using more primitive operators. For example, (line_a # point_a) should be rewritten as ((point_a <-> lineseg_a) < EPSILON), or in more primitive way. I regared this that the operator # just become useless. > By the way, "is a point on a line" operator is quite wrong with > the fuzzy comparison at the moment [1]. Th bug is a isolate problem from the epsilon behavior. It of course should be fixed, but in "appropriate" way that may defined in this discussion. > > The 0001 patch adds many FP comparison functions individually > > considering NaN. As the result the sort order logic involving NaN > > is scattered around into the functions, then, you implement > > generic comparison function using them. It seems inside-out to > > me. Defining ordering at one place, then comparison using it > > seems to be reasonable. > > I agree that it would be simpler to use the comparison function for > implementing other operators. I have done it other way around to make > them more optimised. They are called very often. I don't think > checking exit code of the comparison function would be optimised the > same way. I could leave the comparison functions as they are, but > re-implemented them using the others to keep documentation of NaN > comparison in the single place. Regarding optimization, at least gcc generates seemingly not so different code for the two. The both generally generates extended code directly calling isnan() and so. Have you measured the performance of the two implement (with -O2, without --enable-cassert)? This kind of hand-optimization gets legitimacy when we see a siginificant difference, according to the convention here.. I suppose. At least the comment you dropped by the patch, intfloat4_cmp_internal(float4 a, float4 b){ - /* - * We consider all NANs to be equal and larger than any non-NAN. This is - * somewhat arbitrary; the important thing is to have a consistent sort - * order. - */ seems very significant and should be kept anywhere relevant. > > If the center somehow goes extremely near to the origin, it could > > result in a false error. > > > >> =# select @@ box'(-8e-324, -8e-324), (4.9e-324, 4.9e-324)'; > >> ERROR: value out of range: underflow > > > > I don't think this underflow is an error, and actually it is a > > change of the current behavior without a reasonable reason. More > > significant (and maybe unacceptable) side-effect is that it > > changes the behavior of ordinary operators. I don't think this is > > acceptable. More consideration is needed. > > > >> =# select ('-8e-324'::float8 + '4.9e-324'::float8) / 2.0; > >> ERROR: value out of range: underflow > > This is the current behaviour of float datatype. My patch doesn't > change that. Very sorry for the mistake! I somehow saw float8pl on master and float8_div with your patch. Pleas forget it. > This problem would probably also apply to multiplying > very small values. I agree that this is not the ideal behaviour. > Though I am not sure, if we should go to a different direction than > the float datatypes. Yes, perhaps the behavior of geometric arithmetic should be considered differently from bare float calculation, but it would not be the topic for now. > I think there is value in making geometric types compatible with the > float. Users are going to mix them, anyway. For example, users can > calculate the center of a box manually, and confuse when the built-in > operator behaves differently. That doesn't resolve the problem that some comparison operators are practically useless, and means that we (maybe) should create another geometirc data type based on numeric. But it also would be a crap from the same reason with float-based. At a glance, by no-tolerance comparison, the following geometric operators seem to be hardly usable. #, &&, ?#, ?-, ?|, ?-|, ?||, @>, <@, ~= And including(le, ge) and exclusing(lt, gt) comparisons seem to lose their distinction practically. <<, >>, &<, &>, <<|, |>>, &<|, |&>, <^, >^ I seached pgsql-general ML but counldn't find a complaint about the current behavior. Even though I'm not familar with PostGIS, I found that it uses exactly the same EPSILON method with PostgreSQL. And libgeos also takes tolerance on comparison by several ways. From the facts, I think that the users of geometric types requires any kind of tolerance/fuzzyness factor on comparison. If we had an apparent plan to use them for other than earth-scale(?) geometric usage, we could design what they should be. But without such a plan it is just a breakage of the current usage. # By the way, 2D geometric functions are inaccurate for # earth-scale spherical figures , but we ignoring that? > > In regard to fuzzy operations, libgeos seems to have several > > types of this kind of feature. (I haven't looked closer into > > them). Other than reducing precision seems overkill or > > unappliable for PostgreSQL bulitins. As Jim said, can we replace > > the fixed scale fuzz factor by precision reduction? Maybe, with a > > GUC variable (I hear someone's roaring..) to specify the amount > > defaults to fit the current assumption. > > I am disinclined to try to implement something complicated for the > geometric types. > I think they are mostly useful for 2 purposes: uses > simple enough to not worth looking for better solutions, and > demonstrating our indexing capabilities. The inconsistencies harm > both of those. About What kind of (needless) complication you are saying? The fuzziness seems to me essential for geometric comparisons to work practically. Addition to that, I don't think that we're not allowed to change the behavior in such area of released versions the time after time. I don't think index scan and tolerant comparison are not contradicting. Could you let me have an description about the indexing capabilities and the inconsistencies? regards, -- Kyotaro Horiguchi NTT Open Source Software Center
> What way to deal with it is in your mind? The problem hides > behind operators. To fix it a user should rewrite a expression > using more primitive operators. For example, (line_a # point_a) > should be rewritten as ((point_a <-> lineseg_a) < EPSILON), or in > more primitive way. I regared this that the operator # just > become useless. Simple equations like this works well with the current algorithm: > select '(0.1,0.1)'::point @ '(0,0),(1,1)'::line; The operator does what you expect from it. Users can use something like you described to get fuzzy behaviour with an epsilon they choose. > Regarding optimization, at least gcc generates seemingly not so > different code for the two. The both generally generates extended > code directly calling isnan() and so. Have you measured the > performance of the two implement (with -O2, without > --enable-cassert)? This kind of hand-optimization gets > legitimacy when we see a siginificant difference, according to > the convention here.. I suppose. I tested it with this program: > int main() > { > double i, > j; > int result = 0; > > for (i = 0.1; i < 10000.0; i += 1.0) > for (j = 0.1; j < 10000.0; j += 1.0) > if (float8_lt(i, j)) > result = (result + 1) % 10; > > return result; > } The one calling cmp() was noticeable slower. ./test1 0.74s user 0.00s system 99% cpu 0.748 total ./test2 0.89s user 0.00s system 99% cpu 0.897 total This would probably be not much noticeable by calling SQL functions which are doing a few comparisons only, but it may be necessary to do many more comparisons on some places. I don't find the optimised versions less clear than calling the cmp(). I can change it the other way, if you find it more clear. > At least the comment you dropped by the patch, > > int > float4_cmp_internal(float4 a, float4 b) > { > - /* > - * We consider all NANs to be equal and larger than any non-NAN. This is > - * somewhat arbitrary; the important thing is to have a consistent sort > - * order. > - */ > > seems very significant and should be kept anywhere relevant. I will add it back on the next version. > I seached pgsql-general ML but counldn't find a complaint about > the current behavior. Even though I'm not familar with PostGIS, I > found that it uses exactly the same EPSILON method with > PostgreSQL. Is it? I understood from Paul Ramsey's comment on this thread [1] that they don't. > If we had an apparent plan to use them for other than > earth-scale(?) geometric usage, we could design what they should > be. But without such a plan it is just a breakage of the current > usage. We give no promises about the geometric types being useful in earth scale. > About What kind of (needless) complication you are saying? The > fuzziness seems to me essential for geometric comparisons to work > practically. Addition to that, I don't think that we're not > allowed to change the behavior in such area of released versions > the time after time. Even when it is a total mess? > I don't think index scan and tolerant comparison are not > contradicting. Could you let me have an description about the > indexing capabilities and the inconsistencies? The first problem is that some operators are not using the epsilon. This requires special treatment while developing index support for operators. I have tried to support point for BRIN using the box operators, and failed because of that. Comparing differences with epsilon is not applicable the same way to every operator. Even with simple operators like "point in box" it covers different distances outside the box depending on where the point is. For example, "point <-> box < EPSILON" wouldn't be equivalent with "point <@ box", when the point is outside corner of the box. Things get more complicated with lines. Because of these, we are easily violating basic expectations of the operators: > regression=# select '{1000,0.000001,0}'::line ?|| '{90000,0.00009,0}'::line; > > ?column? > ---------- > f > (1 row) > > regression=# select '{90000,0.00009,0}'::line ?|| '{1000,0.000001,0}'::line; > ?column? > ---------- > t > (1 row) Another problem is lack of hash and btree operator classes. In my experience, the point datatype is by far most used one. People often try to use it on DISTINCT, GROUP BY, or ORDER BY clauses and complain when it doesn't work. There are many complaints like this on the archives. If we get rid of the epsilon, we can easily add those operator classes. [1] https://www.postgresql.org/message-id/CACowWR0DBEjCfBscKKumdRLJUkObjB7D%3Diw7-0_ZwSFJM9_gpw%40mail.gmail.com
Hello, At Mon, 14 Nov 2016 11:41:52 +0100, Emre Hasegeli <emre@hasegeli.com> wrote in <CAE2gYzyjvRbj4bQZRp-yBo73-77-LDH0WnwPtAtZRFKPxYQRgw@mail.gmail.com> > > What way to deal with it is in your mind? The problem hides > > behind operators. To fix it a user should rewrite a expression > > using more primitive operators. For example, (line_a # point_a) > > should be rewritten as ((point_a <-> lineseg_a) < EPSILON), or in > > more primitive way. I regared this that the operator # just > > become useless. > > Simple equations like this works well with the current algorithm: > > > select '(0.1,0.1)'::point @ '(0,0),(1,1)'::line; That's too simple for this discussion. If it satisfies someone's requirement, a geometiric type with integer would go. Any trigonometric calculations (by other than right angles) would break it. > The operator does what you expect from it. Users can use something > like you described to get fuzzy behaviour with an epsilon they choose. > > > Regarding optimization, at least gcc generates seemingly not so > > different code for the two. The both generally generates extended > > code directly calling isnan() and so. Have you measured the > > performance of the two implement (with -O2, without > > --enable-cassert)? This kind of hand-optimization gets > > legitimacy when we see a siginificant difference, according to > > the convention here.. I suppose. > > I tested it with this program: > > > int main() > > { > > double i, > > j; > > int result = 0; > > > > for (i = 0.1; i < 10000.0; i += 1.0) > > for (j = 0.1; j < 10000.0; j += 1.0) > > if (float8_lt(i, j)) > > result = (result + 1) % 10; > > > > return result; > > } > > The one calling cmp() was noticeable slower. > > ./test1 0.74s user 0.00s system 99% cpu 0.748 total > ./test2 0.89s user 0.00s system 99% cpu 0.897 total > > This would probably be not much noticeable by calling SQL functions > which are doing a few comparisons only, but it may be necessary to do > many more comparisons on some places. I don't find the optimised > versions less clear than calling the cmp(). I can change it the other > way, if you find it more clear. Thanks for the measurment. I had similar numbers (14:13) for separate code from Pg, but probably this difference is buried under other steps. The law of stay-stupid says that this kind of optimization should be consider only after it found to be problematic, I think. > > At least the comment you dropped by the patch, > > > > int > > float4_cmp_internal(float4 a, float4 b) > > { > > - /* > > - * We consider all NANs to be equal and larger than any non-NAN. This is > > - * somewhat arbitrary; the important thing is to have a consistent sort > > - * order. > > - */ > > > > seems very significant and should be kept anywhere relevant. > > I will add it back on the next version. > > > I seached pgsql-general ML but counldn't find a complaint about > > the current behavior. Even though I'm not familar with PostGIS, I > > found that it uses exactly the same EPSILON method with > > PostgreSQL. > > Is it? I understood from Paul Ramsey's comment on this thread [1] > that they don't. I saw the repository of PostGIS by myself, but only in box2d.c. Other objects have their own EPSILONs, EPSILOG_SQLMM, DBL_EPSLON, FLT_EPSILON... Maybe that's all | * Copyright (C) 2008-2011 Paul Ramsey <pramsey@cleverelephant.ca> | * Copyright (C) 2008 Mark Cave-Ayland <mark.cave-ayland@siriusit.co.uk> | | #ifndef EPSILON | #define EPSILON 1.0E-06 | #endif | #ifndef FPeq | #define FPeq(A,B) (fabs((A) - (B)) <= EPSILON) | #endif I agree that we should have our own tolerance method, and it migh be an explicit one. But anyway we need any kind of tolerance on comparing FP numbers. Removing such tolerance as preparation for implementing another one is quite natural, but released version should have any kind of tolerance. > > If we had an apparent plan to use them for other than > > earth-scale(?) geometric usage, we could design what they should > > be. But without such a plan it is just a breakage of the current > > usage. > > We give no promises about the geometric types being useful in earth scale. So, we shouldn't just remove it. > > About What kind of (needless) complication you are saying? The > > fuzziness seems to me essential for geometric comparisons to work > > practically. Addition to that, I don't think that we're not > > allowed to change the behavior in such area of released versions > > the time after time. > > Even when it is a total mess? Yes. The seemingly 'total-mess' is intended for a use with lon/lat values and probably gave useful result at the time of development. The mess is rather better than just ruined. > > I don't think index scan and tolerant comparison are not > > contradicting. Could you let me have an description about the > > indexing capabilities and the inconsistencies? > > The first problem is that some operators are not using the epsilon. > This requires special treatment while developing index support for > operators. I have tried to support point for BRIN using the box > operators, and failed because of that. > > Comparing differences with epsilon is not applicable the same way to > every operator. Even with simple operators like "point in box" it > covers different distances outside the box depending on where the > point is. For example, "point <-> box < EPSILON" wouldn't be > equivalent with "point <@ box", when the point is outside corner of > the box. Things get more complicated with lines. Because of these, > we are easily violating basic expectations of the operators: To keep such kind of integrity, we should deeply consider about errors. > > regression=# select '{1000,0.000001,0}'::line ?|| '{90000,0.00009,0}'::line; > > > > ?column? > > ---------- > > f > > (1 row) > > > > regression=# select '{90000,0.00009,0}'::line ?|| '{1000,0.000001,0}'::line; > > ?column? > > ---------- > > t > > (1 row) This kind of thing also appears on exact comparison with more uncertain way. So we have the EPSILON so that the errors would be reasonable at least on lon/lat geometrics. The values above are far out of the scope of the EPSILON from the first. But exact ones should be just unworkable. > Another problem is lack of hash and btree operator classes. In my > experience, the point datatype is by far most used one. People often > try to use it on DISTINCT, GROUP BY, or ORDER BY clauses and complain > when it doesn't work. There are many complaints like this on the > archives. If we get rid of the epsilon, we can easily add those > operator classes. That's a fate of FP numbers. Btree is uasble for such data since it is constructed using inequality operators but hash is almost unsuitable without rounding and normalization because of the nature of floating point numbers. Range scan on bree will work find but on the other hand equality doesn't work even on btree index from the same reason to the below. DSITINCT, GROUP BY are just inpossible without somehow defining the equiality of two points. Bare FP numbers doesn't fit since generally they cannot match each other and should match based on any resolutions of 0.000001, 0.1, 1, or 1000 according to objective. I think probably those who complain so have different resolution in their minds. Again, FP numbers generally cannot match exactly each other. You have to avoid that. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
> To keep such kind of integrity, we should deeply consider about > errors. My point is that I don't think we can keep integrity together with the fuzzy behaviour, or at least I don't have the skills to do that. I can just leave the more complicated operators like "is a point on a line" as it is, and only change the basic ones. Do you think a smaller patch like this would be acceptable? > That's a fate of FP numbers. Btree is uasble for such data since > it is constructed using inequality operators but hash is almost > unsuitable without rounding and normalization because of the > nature of floating point numbers. Range scan on bree will work > find but on the other hand equality doesn't work even on btree > index from the same reason to the below. Btree is very useful with floats. There is no problem with it. > Again, FP numbers generally cannot match exactly each other. You > have to avoid that. Again, they do match very well. Floating point numbers are no magic. They are perfectly deterministic.
Hello, # Sorry for a trash I emitted before.. Perhaps you don't assume any calculations applied on stored geo-type values. Please be patient to disucuss with me. (Sorry for perhas hard-to-read English..) At Fri, 18 Nov 2016 10:58:27 +0100, Emre Hasegeli <emre@hasegeli.com> wrote in <CAE2gYzxVxKNS7qU74UdHVZTmfXQjxMbFiXH5+16XFy90SRAbXA@mail.gmail.com> > > To keep such kind of integrity, we should deeply consider about > > errors. > > My point is that I don't think we can keep integrity together with the > fuzzy behaviour, or at least I don't have the skills to do that. I > can just leave the more complicated operators like "is a > point on a line" as it is, and only change the basic ones. Do you > think a smaller patch like this would be acceptable? The size of the patch is not a matter. It broken the current behavior is a matter. It shows strange behavior for some cases but it designed to work reasonable for some cases. > > That's a fate of FP numbers. Btree is uasble for such data since > > it is constructed using inequality operators but hash is almost > > unsuitable without rounding and normalization because of the > > nature of floating point numbers. Range scan on bree will work > > find but on the other hand equality doesn't work even on btree > > index from the same reason to the below. > > Btree is very useful with floats. There is no problem with it. I don't mind btrees on bare floats. It will 'work' if error is properly considered, or error can be omitted for the purpose. Point has some kind of physical context or expected behavior other than that of bare floats which the EPSILON implies. > > Again, FP numbers generally cannot match exactly each other. You > > have to avoid that. > > Again, they do match very well. Floating point numbers are no magic. > They are perfectly deterministic. No magic, yes, so it is always accompanied by error. If you store points, then searching for *any of them*, it works fine. It's no magic. If you obtained a ponit from any physical data source, say GPS receiver, then search identical points from a table, it would be almost impossible to find a "exactly the same" location. It's no magic. Once a calculation has been applied on them, it is no longer guaranteed to work in the same way. It's also no magic. CREATE OR REPLACE FUNCTION rotate_point (integer, float8) RETURNS void AS $$ UPDATE p SET x = x * cos($2) - y * sin($2),y = x * sin($2) + y * cos($2) WHERE i = $1 $$ LANGUAGE SQL; ALTER TABLE p (i int, x float8, y float8); INSERT INTO p VALUES (0, 1.0, 1.0), (1, 1.0, 1.0); SELECT rotate_point(0, pi() * 2.0 / 3.0); SELECT rotate_point(0, pi() * 2.0 / 3.0); SELECT rotate_point(0, pi() * 2.0 / 3.0); SELECT * FROM p WHERE i = 0;i | x | y ---+---+-------------------0 | 1 | 0.999999999999999 So SELECT p0.x = p1.x AND p0.y = p1.y FROM p p0, p p1 WHERE p0.i = 0 and p1.i = 1;?column? ----------f (1 row) This just repeats 1/3 rotation 3 times. Ideally p0 and p1 are identical but this primitive operation generates visible error that makes the comparison fail. Even though these two points are expected be identical in some geometric viewpoint. This is a part of the meaning of the EPSILON. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [HACKERS] Floating point comparison inconsistencies of thegeometric types
From
Kyotaro HORIGUCHI
Date:
Hello, At Fri, 18 Nov 2016 10:58:27 +0100, Emre Hasegeli <emre@hasegeli.com> wrote in <CAE2gYzxVxKNS7qU74UdHVZTmfXQjxMbFiXH5+16XFy90SRAbXA@mail.gmail.com> > > To keep such kind of integrity, we should deeply consider about > > errors. > > My point is that I don't think we can keep integrity together with the > fuzzy behaviour, or at least I don't have the skills to do that. I > can just leave the more complicated operators like "is a > point on a line" as it is, and only change the basic ones. Do you > think a smaller patch like this would be acceptable? The size of the patch is not a problem. I regret that I haven't made your requirement clear. So as the startpoint of the new discussion, I briefly summarize the current implement of geometric comparisons. - Floating point comparisons for gemetric types Comparison related to geometric types is performed by FPeq macro and similars defined in geo_decls.h. This intends to givetolerance to the comparisons. A FPeq: |<=e-|-e=>| (<= means inclusive, e = epsilon = tolerance) FPne: ->| e | e |<- (<- meansexclusive) FPlt: | e |<- FPle: |<=e | FPgt: ->| e | FPge: | e=>| These seems reasonable ignoring the tolerance amount issue. - Consistency between index and non-index scans. GIST supports geometric types. =# create table tpnt1(id int, p point);=# insert into tpnt1 (select i + 200, point(i*1.0e-6 / 100.0, i * 1.0e-6 / 100.0)from generate_series(-200, 200) as i);=# create index on tpnt1 using gist (p);=# set enable_seqscan to false;=# setenable_bitmapscan to true;=# select count(*) from tpnt1 where p ~= point(0, 0); 201=# select count(*) from tpnt1 wherep << point(0, 0); 100=# set enable_seqscan to true;=# set enable_bitmapscan to false;=# select count(*) from tpnt1where p ~= point(0, 0); 201=# select count(*) from tpnt1 where p << point(0, 0); 100 At least for the point type, (bitmap) index scan is consistent with sequential scan. I remember that the issue was "inconsistency between indexed and non-indexed scans over geometric types" but they seem consistent with each other. You mentioned b-tree, which don't have predefined opclass for geometric types. So the "inconsistency" may be mentioning the difference between geometric values and combinations of plain (first-class) values. But the two are different things and apparently using different operators (~= and = for equality) so the issue is not fit for this. More specifically, "p ~= p0" and "x = x0 and y = y0" are completely different. Could you let me (or any other guys on this ml) have more precise description on the problem and/or what you want to do with this patch? regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [HACKERS] Floating point comparison inconsistencies of thegeometric types
From
Emre Hasegeli
Date:
> - Floating point comparisons for gemetric types > > Comparison related to geometric types is performed by FPeq > macro and similars defined in geo_decls.h. This intends to give > tolerance to the comparisons. > > A > FPeq: |<=e-|-e=>| (<= means inclusive, e = epsilon = tolerance) > FPne: ->| e | e |<- (<- means exclusive) > FPlt: | e |<- > FPle: |<=e | > FPgt: ->| e | > FPge: | e=>| > > These seems reasonable ignoring the tolerance amount issue. I tried to explain below why I don't find this method reasonable. > At least for the point type, (bitmap) index scan is consistent > with sequential scan. I remember that the issue was > "inconsistency between indexed and non-indexed scans over > geometric types" but they seem consistent with each other. You can see on the Git history that it was a lot of trouble to make the indexes consistent with the operators, and this is limiting improving indexing of the geometric types. > You mentioned b-tree, which don't have predefined opclass for > geometric types. So the "inconsistency" may be mentioning the > difference between geometric values and combinations of plain > (first-class) values. But the two are different things and > apparently using different operators (~= and = for equality) so > the issue is not fit for this. More specifically, "p ~= p0" and > "x = x0 and y = y0" are completely different. Yes, the default btree operator class doesn't have to implement standard equality and basic comparison operator symbols. We can implement it with different symbols. > Could you let me (or any other guys on this ml) have more precise > description on the problem and/or what you want to do with this > patch? I will try to itemize the current problems with the epsilon method: = Operator Inconsistencies My current patches address all of them. - Only some operators are using the epsilon. I included the list of the ones which doesn't use the epsilon on initial post of this thread. This makes the operators not compatible with each other. For example, you cannot say "if box_a @> box_b and box_b @> point then box_a @> box_b". - The application of epsilon is implementation dependent. Epsilon works between two numbers. There is not always a single way of implementing geometric operators. This is surprising to the users. For example, you cannot say "if box_a @> box_b then box_a <-> box_b <= epsilon". This is also creating a high barrier on developing those operators. Fixing bugs and performance regressions are very likely to change the results. - Some operators are just wrong: Line operators are not well maintained. The following are giving wrong results when neither A or B of lines are not exactly 1: * line # line * line <-> line * point ## line * point ## lseg They are broken independently from the epsilon, though it is not easy to fix them without changing the results of the cases that are currently working. - Some operators are violating commutative property. For example, you cannot say "if line_a ?|| line_b then line_b ?|| line_a". - Some operators are violating transitive property. For example, you cannot say "if point_a ~= point_b and point_b ~= point_c then point_a ~= point_c". - The operators with epsilon are not compatible with float operators. This is also surprising for the users. For example, you cannot say "if point_a << point_b then point_a[0] < point_b[0]". - The operators are not checking for underflow or overflow. - NaN values are not treated same as the float datatype. Those are independent problems, though checking them with epsilon would create additional set of inconsistencies. Epsilon creates especially more problems around 0. = Missing Bits on the Operator Classes I want to address those in the future, if my patches are accepted. - There is no GiST or SP-GiST support for cross datatype operators other than containment. We can develop nice cross-datatype operator families to support more operators. It would have been easier, if we could rely on some operators to implement others. Geometric types serve the purpose of demonstrating our indexing capabilities. We should make them good examples, not full of special cases with hidden bugs. - There is no BRIN support for types other than the box. BRIN inclusion operator class is datatype agnostic. It uses some operators to implement others which doesn't work for anything more the box vs box operators, because of the inconsistencies. - GROUP BY and DISTINCT are not working. - ORDER BY is not working. There are regular user complaints about this on the mailing list. We can easily provide default hash and btree operator classes that would fix the problem, if we haven't had the epsilon.
Re: [HACKERS] Floating point comparison inconsistencies of thegeometric types
From
Kyotaro HORIGUCHI
Date:
Hello. (I apologize in advance for possible inaccurate wording on maths, which might cause confusion..) At Wed, 11 Jan 2017 16:37:59 +0100, Emre Hasegeli <emre@hasegeli.com> wrote in <CAE2gYzzNufOZqh4HO3Od8urzamNSeX-OXJxfNkRcU3ex9RD8jw@mail.gmail.com> > > - Floating point comparisons for gemetric types > > > > Comparison related to geometric types is performed by FPeq > > macro and similars defined in geo_decls.h. This intends to give > > tolerance to the comparisons. > > > > A > > FPeq: |<=e-|-e=>| (<= means inclusive, e = epsilon = tolerance) > > FPne: ->| e | e |<- (<- means exclusive) > > FPlt: | e |<- > > FPle: |<=e | > > FPgt: ->| e | > > FPge: | e=>| > > > > These seems reasonable ignoring the tolerance amount issue. > > I tried to explain below why I don't find this method reasonable. Thank you very much for the explanation. > > At least for the point type, (bitmap) index scan is consistent > > with sequential scan. I remember that the issue was > > "inconsistency between indexed and non-indexed scans over > > geometric types" but they seem consistent with each other. > > You can see on the Git history that it was a lot of trouble to make > the indexes consistent with the operators, and this is limiting > improving indexing of the geometric types. Yeah. geo_ops.c has the following comment from its first version in the current repository. | * Relational operators for Points. | * Since we do have a sense of coordinates being | * "equal" to a given accuracy (point_vert, point_horiz), So, we take it as a requirement for geometric types. All the difficulties come from the "nature". > > You mentioned b-tree, which don't have predefined opclass for > > geometric types. So the "inconsistency" may be mentioning the > > difference between geometric values and combinations of plain > > (first-class) values. But the two are different things and > > apparently using different operators (~= and = for equality) so > > the issue is not fit for this. More specifically, "p ~= p0" and > > "x = x0 and y = y0" are completely different. > > Yes, the default btree operator class doesn't have to implement > standard equality and basic comparison operator symbols. We can > implement it with different symbols. Perhaps I don't understand it. Many opclass are defined for btree. But since (ordinary) btree handles only one-dimentional, totally-orderable sets, geometric types even point don't fit it. Even if we relaxed it by removing EPSILON, multidimentional data still doesn't fit. Still we can implement equality mechanism *over* multiple btree indexes, but it cannot be a opclass for btree. > > Could you let me (or any other guys on this ml) have more precise > > description on the problem and/or what you want to do with this > > patch? > > I will try to itemize the current problems with the epsilon method: > > = Operator Inconsistencies > > My current patches address all of them. No. It just removes tolerans that is a "requirement" for coordinates of geometric types. I think we don't have enough reason to remove it. We could newly define geometric types with no-tolerance as 'exact_point" or something but it still doesn't fit btree. > - Only some operators are using the epsilon. > > I included the list of the ones which doesn't use the epsilon on > initial post of this thread. This makes the operators not compatible > with each other. For example, you cannot say "if box_a @> box_b and > box_b @> point then box_a @> box_b". (It must be "then box_a @> point".) Both of the operators don't seem to use EPSILON so the transitivity holds, but perhaps we should change them to more appropriate shape, that is, to use EPSILON. Even having the tolerance, we can define the operators to keep the transitivity. > - The application of epsilon is implementation dependent. > > Epsilon works between two numbers. There is not always a single way > of implementing geometric operators. This is surprising to the users. > For example, you cannot say "if box_a @> box_b then box_a <-> box_b <= > epsilon". Maybe it should be like "if box_a ~= box_b && box_b @< box_a then ..". I'm not sure off-hand. But I don't see the relationship is significant. > This is also creating a high barrier on developing those operators. > Fixing bugs and performance regressions are very likely to change the > results. I agree to you. The effect of the EPSILON is quite arbitrary and difficult to see their chemistry. > - Some operators are just wrong: > > Line operators are not well maintained. The following are giving > wrong results when neither A or B of lines are not exactly 1: > > * line # line > * line <-> line > * point ## line > * point ## lseg These are not comparison operators so EPSILON is unrelated. And I agree that they needs fix. > They are broken independently from the epsilon, though it is not easy > to fix them without changing the results of the cases that are > currently working. Although I'm not sure they are using EPSILON, I think they don't need it. > - Some operators are violating commutative property. > > For example, you cannot say "if line_a ?|| line_b then line_b ?|| line_a". Whether EPSILON is introduced or not, commutativity cannot be assured for it from calculation error, I suppose. > - Some operators are violating transitive property. > > For example, you cannot say "if point_a ~= point_b and point_b ~= > point_c then point_a ~= point_c". It holds only in ideal (or mathematical) world, or for similars of integer (which are strictly implemented mathematica world on computer). Without the EPSILON, two points hardly match by floating point error, as I mentioned. I don't have an evidence though (sorry), mere equality among three floating point numbers is not assured. Sorry, I have no more time today, I'll continue tomorrow. > - The operators with epsilon are not compatible with float operators. > > This is also surprising for the users. For example, you cannot say > "if point_a << point_b then point_a[0] < point_b[0]". > > - The operators are not checking for underflow or overflow. > - NaN values are not treated same as the float datatype. > > Those are independent problems, though checking them with epsilon > would create additional set of inconsistencies. Epsilon creates > especially more problems around 0. > > = Missing Bits on the Operator Classes > > I want to address those in the future, if my patches are accepted. > > - There is no GiST or SP-GiST support for cross datatype operators > other than containment. > > We can develop nice cross-datatype operator families to support more > operators. It would have been easier, if we could rely on some > operators to implement others. Geometric types serve the purpose of > demonstrating our indexing capabilities. We should make them good > examples, not full of special cases with hidden bugs. > > - There is no BRIN support for types other than the box. > > BRIN inclusion operator class is datatype agnostic. It uses some > operators to implement others which doesn't work for anything more the > box vs box operators, because of the inconsistencies. > > - GROUP BY and DISTINCT are not working. > - ORDER BY is not working. > > There are regular user complaints about this on the mailing list. We > can easily provide default hash and btree operator classes that would > fix the problem, if we haven't had the epsilon. > -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [HACKERS] Floating point comparison inconsistencies of thegeometric types
From
Kyotaro HORIGUCHI
Date:
Sorry, this is the continuation of the previous comment. At Wed, 18 Jan 2017 17:36:12 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20170118.173612.13506164.horiguchi.kyotaro@lab.ntt.co.jp> > Hello. > > (I apologize in advance for possible inaccurate wording on maths, > which might cause confusion..) > > At Wed, 11 Jan 2017 16:37:59 +0100, Emre Hasegeli <emre@hasegeli.com> wrote in <CAE2gYzzNufOZqh4HO3Od8urzamNSeX-OXJxfNkRcU3ex9RD8jw@mail.gmail.com> > > > - Floating point comparisons for gemetric types > > > > > > Comparison related to geometric types is performed by FPeq > > > macro and similars defined in geo_decls.h. This intends to give > > > tolerance to the comparisons. > > > > > > A > > > FPeq: |<=e-|-e=>| (<= means inclusive, e = epsilon = tolerance) > > > FPne: ->| e | e |<- (<- means exclusive) > > > FPlt: | e |<- > > > FPle: |<=e | > > > FPgt: ->| e | > > > FPge: | e=>| > > > > > > These seems reasonable ignoring the tolerance amount issue. > > > > I tried to explain below why I don't find this method reasonable. > > Thank you very much for the explanation. > > > > At least for the point type, (bitmap) index scan is consistent > > > with sequential scan. I remember that the issue was > > > "inconsistency between indexed and non-indexed scans over > > > geometric types" but they seem consistent with each other. > > > > You can see on the Git history that it was a lot of trouble to make > > the indexes consistent with the operators, and this is limiting > > improving indexing of the geometric types. > > Yeah. geo_ops.c has the following comment from its first version > in the current repository. > > | * Relational operators for Points. > | * Since we do have a sense of coordinates being > | * "equal" to a given accuracy (point_vert, point_horiz), > > So, we take it as a requirement for geometric types. All the > difficulties come from the "nature". > > > > You mentioned b-tree, which don't have predefined opclass for > > > geometric types. So the "inconsistency" may be mentioning the > > > difference between geometric values and combinations of plain > > > (first-class) values. But the two are different things and > > > apparently using different operators (~= and = for equality) so > > > the issue is not fit for this. More specifically, "p ~= p0" and > > > "x = x0 and y = y0" are completely different. > > > > Yes, the default btree operator class doesn't have to implement > > standard equality and basic comparison operator symbols. We can > > implement it with different symbols. > > Perhaps I don't understand it. Many opclass are defined for > btree. But since (ordinary) btree handles only one-dimentional, > totally-orderable sets, geometric types even point don't fit > it. Even if we relaxed it by removing EPSILON, multidimentional > data still doesn't fit. > > Still we can implement equality mechanism *over* multiple btree > indexes, but it cannot be a opclass for btree. > > > > Could you let me (or any other guys on this ml) have more precise > > > description on the problem and/or what you want to do with this > > > patch? > > > > I will try to itemize the current problems with the epsilon method: > > > > = Operator Inconsistencies > > > > My current patches address all of them. > > No. It just removes tolerans that is a "requirement" for > coordinates of geometric types. I think we don't have enough > reason to remove it. We could newly define geometric types with > no-tolerance as 'exact_point" or something but it still doesn't > fit btree. > > > - Only some operators are using the epsilon. > > > > I included the list of the ones which doesn't use the epsilon on > > initial post of this thread. This makes the operators not compatible > > with each other. For example, you cannot say "if box_a @> box_b and > > box_b @> point then box_a @> box_b". > > (It must be "then box_a @> point".) Both of the operators don't > seem to use EPSILON so the transitivity holds, but perhaps we > should change them to more appropriate shape, that is, to use > EPSILON. Even having the tolerance, we can define the operators > to keep the transitivity. > > > - The application of epsilon is implementation dependent. > > > > Epsilon works between two numbers. There is not always a single way > > of implementing geometric operators. This is surprising to the users. > > For example, you cannot say "if box_a @> box_b then box_a <-> box_b <= > > epsilon". > > Maybe it should be like "if box_a ~= box_b && box_b @< box_a then > ..". I'm not sure off-hand. But I don't see the relationship is > significant. > > > This is also creating a high barrier on developing those operators. > > Fixing bugs and performance regressions are very likely to change the > > results. > > I agree to you. The effect of the EPSILON is quite arbitrary and > difficult to see their chemistry. > > > - Some operators are just wrong: > > > > Line operators are not well maintained. The following are giving > > wrong results when neither A or B of lines are not exactly 1: > > > > * line # line > > * line <-> line > > * point ## line > > * point ## lseg > > These are not comparison operators so EPSILON is unrelated. And I > agree that they needs fix. > > > They are broken independently from the epsilon, though it is not easy > > to fix them without changing the results of the cases that are > > currently working. > > Although I'm not sure they are using EPSILON, I think they don't > need it. > > > - Some operators are violating commutative property. > > > > For example, you cannot say "if line_a ?|| line_b then line_b ?|| line_a". > > Whether EPSILON is introduced or not, commutativity cannot be > assured for it from calculation error, I suppose. > > > - Some operators are violating transitive property. > > > > For example, you cannot say "if point_a ~= point_b and point_b ~= > > point_c then point_a ~= point_c". > > It holds only in ideal (or mathematical) world, or for similars > of integer (which are strictly implemented mathematica world on > computer). Without the EPSILON, two points hardly match by > floating point error, as I mentioned. I don't have an evidence > though (sorry), mere equality among three floating point numbers > is not assured. > > > Sorry, I have no more time today, I'll continue tomorrow. > - The operators with epsilon are not compatible with float operators. > > This is also surprising for the users. For example, you cannot say > "if point_a << point_b then point_a[0] < point_b[0]". It is because "is strictly left of" just doesn't mean their x-coordinates are in such a relationship. The difference might be surprising but is a specification (truely not a bug:). > - The operators are not checking for underflow or overflow. > - NaN values are not treated same as the float datatype. > > Those are independent problems, though checking them with epsilon > would create additional set of inconsistencies. Epsilon creates > especially more problems around 0. I dont' understand this precisely but it seems to be reasonable. > = Missing Bits on the Operator Classes > > I want to address those in the future, if my patches are accepted. > > - There is no GiST or SP-GiST support for cross datatype operators > other than containment. > > We can develop nice cross-datatype operator families to support more > operators. It would have been easier, if we could rely on some > operators to implement others. Geometric types serve the purpose of > demonstrating our indexing capabilities. We should make them good > examples, not full of special cases with hidden bugs. > > - There is no BRIN support for types other than the box. > > BRIN inclusion operator class is datatype agnostic. It uses some > operators to implement others which doesn't work for anything more the > box vs box operators, because of the inconsistencies. > > - GROUP BY and DISTINCT are not working. > - ORDER BY is not working. > > There are regular user complaints about this on the mailing list. We > can easily provide default hash and btree operator classes that would > fix the problem, if we haven't had the epsilon. This final section seems to mention the application but sorry, it still don't describe for me what application that this patch aiming to enable. But I can understand that you want to handle floating point numbers like integers. The epsilon is surely quite annoying for the purpose. Ok, returning to the basis of the current geometric types, it is described as below. It is not perfect (so needs amendment) but achived to a certain degree. 1. For geometric types, equality needs to be defined with some tolerance. Concretely an absolute value of 1E-06, which is quite random but would be suitable to ignore error of longitude and latitude. 2. Translative comparisons should consider the tolerance to make a result that does not contradict the equality. 3. It is not sure about other comparsons such like parallel and perpendicular, but it is hardly usable without any amountof tolerance. On the other hand, some people complains about the following. A. The geometric comparisons are not in relationships that analogously deducible (or expected) from scalar comparisons. B. The tolerance seems prevents us from using indexes. As a presupposition, we have the common knowlegde, or principle, about floating point handling on a digital computer. X. We shouldn't detect equalty of two floating point numbers just using '=' for real usage. It hardly works for arbitrary numbers including results of any arithmetics, so it should incorporate any amount of tolerance that differs among applications. Sorry for some refrains from the past discussion, but the above X and 1-3 are the immovable basis in this area. (I belive it . But any objection are welcome) You might want to object that float has an equality operator, but it is because it is an atomic type, which has no specific purpose. On the contrary geometric types are application types, which is assumed specific purpose and it should behave reasonablly against the expected purpose. That being said, if we want to, for example, use btree index for each coordinate of geometric types, it is doable by normalizing the values before applying operators, instead of applying epsilon arbitrarily in operators. Normalization may be done both on storing or on-the-fly just befor usage. It depends on whether we want to preserve precision of stored values. If we want, normalization should be done just before comparisons, or on creating indexes. I don't know much about the normalization method for floating point numbers but libgeos might give some hints. And still I don't have an idea how to detect parallelity and perpendicularity based on the normalization. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [HACKERS] Floating point comparison inconsistencies of thegeometric types
From
Emre Hasegeli
Date:
I am responding both of your emails together. > Perhaps I don't understand it. Many opclass are defined for > btree. But since (ordinary) btree handles only one-dimentional, > totally-orderable sets, geometric types even point don't fit > it. Even if we relaxed it by removing EPSILON, multidimentional > data still doesn't fit. Yes, multidimentional data doesn't fit into btree. Let's say we would have to introduce operators called *<, *<=, *>, *>= to support btree opclass for point. I agree those operators would be useless, but btree opclass is used for other purposes too. "ORDER BY point", merge-joins, btree index support for ~= would be useful. Lack of ORDER BY, GROUP BY, and DISTINCT support is the major inconvenience about the geometric types. There are many complaints about this on the mailing list. Many frameworks and ORMs are not prepared to deal with getting an error when they use certain types on those clauses. >> - Only some operators are using the epsilon. >> >> I included the list of the ones which doesn't use the epsilon on >> initial post of this thread. This makes the operators not compatible >> with each other. For example, you cannot say "if box_a @> box_b and >> box_b @> point then box_a @> box_b". > > (It must be "then box_a @> point".) Yes, I am sorry. > Both of the operators don't > seem to use EPSILON so the transitivity holds, but perhaps we > should change them to more appropriate shape, that is, to use > EPSILON. Even having the tolerance, we can define the operators > to keep the transitivity. Yes, that would be another way. In my view, this would make things worse, because I believe epsilon approach is completely broken. The operators which are not using the epsilon are the only ones people can sanely use to develop applications. >> > - The application of epsilon is implementation dependent. >> > >> > Epsilon works between two numbers. There is not always a single way >> > of implementing geometric operators. This is surprising to the users. >> > For example, you cannot say "if box_a @> box_b then box_a <-> box_b <= >> > epsilon". >> >> Maybe it should be like "if box_a ~= box_b && box_b @< box_a then >> ..". I'm not sure off-hand. But I don't see the relationship is >> significant. What I meant was the behaviour being unclear to even people who knows about the epsilon. If two boxes are overlapping with each other, one who knows about EPSILON can expect the distance between them to be less than EPSILON, but this wouldn't be true. >> - Some operators are violating commutative property. >> >> For example, you cannot say "if line_a ?|| line_b then line_b ?|| line_a". > > Whether EPSILON is introduced or not, commutativity cannot be > assured for it from calculation error, I suppose. It can easily be assured by treating both sides of the operator the same. It is actually assured on my patch. >> - Some operators are violating transitive property. >> >> For example, you cannot say "if point_a ~= point_b and point_b ~= >> point_c then point_a ~= point_c". > > It holds only in ideal (or mathematical) world, or for similars > of integer (which are strictly implemented mathematica world on > computer). Without the EPSILON, two points hardly match by > floating point error, as I mentioned. I don't have an evidence > though (sorry), mere equality among three floating point numbers > is not assured. Of course, it is assured. Floating point numbers are deterministic. >> - The operators with epsilon are not compatible with float operators. >> >> This is also surprising for the users. For example, you cannot say >> "if point_a << point_b then point_a[0] < point_b[0]". > > It is because "is strictly left of" just doesn't mean their > x-coordinates are in such a relationship. The difference might be > surprising but is a specification (truely not a bug:). Then what does that mean? Every operator with EPSILON is currently surprising in different way. People use databases to build applications. They often need to implement same operations on the application side. It is very hard to be bug-compatible with our geometric types. >> = Missing Bits on the Operator Classes > This final section seems to mention the application but sorry, it > still don't describe for me what application that this patch > aiming to enable. But I can understand that you want to handle > floating point numbers like integers. The epsilon is surely quite > annoying for the purpose. I will try to itemise what applications I am aiming to enable: - Applications with very little GIS requirement PostGIS can be too much work for an application in s different business domain but has a little requirement to GIS. The built-in geometric types are incomplete to support real world geography. Getting rid of epsilon would make this worse. In contrary, it would allow people to deal with the difficulties on their own. - Applications with geometric objects I believe people could make use the geometric types in many different business areas, if they would be in better shape. I am working for a gaming company. There is great overlap between the logic of many games and the geometric types. I could support using them more, if they wouldn't be so buggy and impractical. - Demonstrating indexes We require indexing features to be committed with an example. Geometric types have served this purpose many times. Many operators are actually added to demonstrate indexes. These examples are also used by other projects like PostGIS. They can re-use part of the implementations, or the implementation as a starting point. I believe we should make those examples as clean as possible. > Ok, returning to the basis of the current geometric types, it is > described as below. It is not perfect (so needs amendment) but > achived to a certain degree. > > 1. For geometric types, equality needs to be defined with some > tolerance. Concretely an absolute value of 1E-06, which is > quite random but would be suitable to ignore error of > longitude and latitude. I don't agree this requirement as it is put. Returning true to something like "point(0.000001, 0.000001) ~= point(0.000002, 0.000002)" is obviously wrong. > 2. Translative comparisons should consider the tolerance to make > a result that does not contradict the equality. The current implementation fails to provide it: > hasegeli=# select point(0.000001, 0.000001) ~= point(0.000002, 0.000002); > ?column? > ---------- > t > (1 row) > > hasegeli=# select point(0.000001, 0.000001) << point(0.0000021, 0.0000021); > ?column? > ---------- > t > (1 row) > > hasegeli=# select point(0.000002, 0.000002) << point(0.0000021, 0.0000021); > ?column? > ---------- > f > (1 row) > 3. It is not sure about other comparsons such like parallel and > perpendicular, but it is hardly usable without any amount of > tolerance. I agree. I don't think they are useable with our tolerance either. > On the other hand, some people complains about the following. I believe have listed enough complaints on my previous post. > As a presupposition, we have the common knowlegde, or principle, > about floating point handling on a digital computer. > > X. We shouldn't detect equalty of two floating point numbers just > using '=' for real usage. It hardly works for arbitrary > numbers including results of any arithmetics, so it should > incorporate any amount of tolerance that differs among > applications. > > Sorry for some refrains from the past discussion, but the above X > and 1-3 are the immovable basis in this area. (I belive it . But > any objection are welcome) > > You might want to object that float has an equality operator, but > it is because it is an atomic type, which has no specific > purpose. On the contrary geometric types are application types, > which is assumed specific purpose and it should behave > reasonablly against the expected purpose. Are they really? We have really simple geometric types which are not for any specific kind of application. We are not claiming that they are for earth scale or any other scale. I think we should aim of making them predictable, extensible, as general purpose as possible to let people use them for their purpose. > That being said, if we want to, for example, use btree index for > each coordinate of geometric types, it is doable by normalizing > the values before applying operators, instead of applying epsilon > arbitrarily in operators. > > Normalization may be done both on storing or on-the-fly just > befor usage. It depends on whether we want to preserve precision > of stored values. If we want, normalization should be done just > before comparisons, or on creating indexes. > > I don't know much about the normalization method for floating > point numbers but libgeos might give some hints. Me neither. I have been told that storing the minimum and and maximum possible value is a good strategy for tolerance. Then all operators can return true, false, or maybe. This would be a much bigger and incompatible change that I don't think I can implement. We cannot come into an agreement. Geometric types are not well maintained. I am trying my best to put them into a shape. Please help me do that. Let's try to find a way to leave them better than we found.
Re: [HACKERS] Floating point comparison inconsistencies of thegeometric types
From
Kyotaro HORIGUCHI
Date:
Hello, Emre. At Wed, 25 Jan 2017 12:24:18 +0100, Emre Hasegeli <emre@hasegeli.com> wrote in <CAE2gYzzLWCdYDPLi_UQXLSJ+pqc7g4WGNHsb8tHkmosKK+rO7g@mail.gmail.com> > I am responding both of your emails together. > > > Perhaps I don't understand it. Many opclass are defined for > > btree. But since (ordinary) btree handles only one-dimentional, > > totally-orderable sets, geometric types even point don't fit > > it. Even if we relaxed it by removing EPSILON, multidimentional > > data still doesn't fit. > > Yes, multidimentional data doesn't fit into btree. Let's say we would > have to introduce operators called *<, *<=, *>, *>= to support btree > opclass for point. I agree those operators would be useless, but > btree opclass is used for other purposes too. "ORDER BY point", > merge-joins, btree index support for ~= would be useful. Btree cannot be used unless any kind of total-orderedness is defined but hash index would be usable so. > Lack of ORDER BY, GROUP BY, and DISTINCT support is the major > inconvenience about the geometric types. There are many complaints > about this on the mailing list. Many frameworks and ORMs are not > prepared to deal with getting an error when they use certain types on > those clauses. Even though I'm not sure but I don't see a "natural" (or agreeable by many poeple) ordering of geometric types in general. Anyway it's quite application (not application program but the relationship with the real world) specific. > >> - Only some operators are using the epsilon. > >> > >> I included the list of the ones which doesn't use the epsilon on > >> initial post of this thread. This makes the operators not compatible > >> with each other. For example, you cannot say "if box_a @> box_b and > >> box_b @> point then box_a @> box_b". > > > > (It must be "then box_a @> point".) > > Yes, I am sorry. > > > Both of the operators don't > > seem to use EPSILON so the transitivity holds, but perhaps we > > should change them to more appropriate shape, that is, to use > > EPSILON. Even having the tolerance, we can define the operators > > to keep the transitivity. > > Yes, that would be another way. In my view, this would make things > worse, because I believe epsilon approach is completely broken. The > operators which are not using the epsilon are the only ones people can > sanely use to develop applications. What we should not forget is that PostGIS does the same thing and it is widly used (I believe..). This means it not broken at least on a certain context. But it is a fact that it also quite inconvenient to get performance from, say, indexes. > >> > - The application of epsilon is implementation dependent. > >> > > >> > Epsilon works between two numbers. There is not always a single way > >> > of implementing geometric operators. This is surprising to the users. > >> > For example, you cannot say "if box_a @> box_b then box_a <-> box_b <= > >> > epsilon". > >> > >> Maybe it should be like "if box_a ~= box_b && box_b @< box_a then > >> ..". I'm not sure off-hand. But I don't see the relationship is > >> significant. > > What I meant was the behaviour being unclear to even people who knows > about the epsilon. If two boxes are overlapping with each other, one > who knows about EPSILON can expect the distance between them to be > less than EPSILON, but this wouldn't be true. Yeah, the EPSILON is mere a fuzz factor so we cannot expect any kind of stable behavior for differences under the level. But on the analogy of comparisons of floating point numbers, I suppose that inequality comparison could be done without the tolerance. > >> - Some operators are violating commutative property. > >> > >> For example, you cannot say "if line_a ?|| line_b then line_b ?|| line_a". > > > > Whether EPSILON is introduced or not, commutativity cannot be > > assured for it from calculation error, I suppose. > > It can easily be assured by treating both sides of the operator the > same. It is actually assured on my patch. It surely holds for certain cases. Even how many applicable cases we guess, finally we cannot proof that it works generally. Just three times of 1/3 rotation breakes it. > >> - Some operators are violating transitive property. > >> > >> For example, you cannot say "if point_a ~= point_b and point_b ~= > >> point_c then point_a ~= point_c". > > > > It holds only in ideal (or mathematical) world, or for similars > > of integer (which are strictly implemented mathematica world on > > computer). Without the EPSILON, two points hardly match by > > floating point error, as I mentioned. I don't have an evidence > > though (sorry), mere equality among three floating point numbers > > is not assured. > > Of course, it is assured. Floating point numbers are deterministic. Hmm, I have nothing more to say if you don't agree that floating point numbers involving any kind of arithmetic is hardly deterministic especially not defining its usage. > >> - The operators with epsilon are not compatible with float operators. > >> > >> This is also surprising for the users. For example, you cannot say > >> "if point_a << point_b then point_a[0] < point_b[0]". > > > > It is because "is strictly left of" just doesn't mean their > > x-coordinates are in such a relationship. The difference might be > > surprising but is a specification (truely not a bug:). > > Then what does that mean? Every operator with EPSILON is currently > surprising in different way. People use databases to build > applications. They often need to implement same operations on the > application side. It is very hard to be bug-compatible with our > geometric types. The world of the geometric types in PostgreSQL *is* built so. There is nothing different with that Windows client can make different results from PostgreSQL on a Linux server for a simple floating point arithmetics, or even different binaries made by different version of compilers on the same platoform can. Relying on such coherency by accident is a bad policy. > >> = Missing Bits on the Operator Classes > > > This final section seems to mention the application but sorry, it > > still don't describe for me what application that this patch > > aiming to enable. But I can understand that you want to handle > > floating point numbers like integers. The epsilon is surely quite > > annoying for the purpose. > > I will try to itemise what applications I am aiming to enable: > > - Applications with very little GIS requirement > > PostGIS can be too much work for an application in s different > business domain but has a little requirement to GIS. The built-in > geometric types are incomplete to support real world geography. > Getting rid of epsilon would make this worse. In contrary, it would > allow people to deal with the difficulties on their own. > > - Applications with geometric objects > > I believe people could make use the geometric types in many different > business areas, if they would be in better shape. I am working for a > gaming company. There is great overlap between the logic of many > games and the geometric types. I could support using them more, if > they wouldn't be so buggy and impractical. > > - Demonstrating indexes > > We require indexing features to be committed with an example. > Geometric types have served this purpose many times. Many operators > are actually added to demonstrate indexes. These examples are also > used by other projects like PostGIS. They can re-use part of the > implementations, or the implementation as a starting point. I believe > we should make those examples as clean as possible. > > > Ok, returning to the basis of the current geometric types, it is > > described as below. It is not perfect (so needs amendment) but > > achived to a certain degree. > > > > 1. For geometric types, equality needs to be defined with some > > tolerance. Concretely an absolute value of 1E-06, which is > > quite random but would be suitable to ignore error of > > longitude and latitude. > > I don't agree this requirement as it is put. Returning true to > something like "point(0.000001, 0.000001) ~= point(0.000002, > 0.000002)" is obviously wrong. No. It can be right. For example, it is right for an application where the precision of data is 0.1 or 1E-6. But it is wrong if the tolerance designed for its purpose is 1E-7. > > 2. Translative comparisons should consider the tolerance to make > > a result that does not contradict the equality. > > The current implementation fails to provide it: > > > hasegeli=# select point(0.000001, 0.000001) ~= point(0.000002, 0.000002); > > ?column? > > ---------- > > t > > (1 row) > > > > hasegeli=# select point(0.000001, 0.000001) << point(0.0000021, 0.0000021); > > ?column? > > ---------- > > t > > (1 row) > > > > hasegeli=# select point(0.000002, 0.000002) << point(0.0000021, 0.0000021); > > ?column? > > ---------- > > f > > (1 row) So this example without any concrete application is irrelevant. > > 3. It is not sure about other comparsons such like parallel and > > perpendicular, but it is hardly usable without any amount of > > tolerance. > > I agree. I don't think they are useable with our tolerance either. > > > On the other hand, some people complains about the following. > > I believe have listed enough complaints on my previous post. > > > As a presupposition, we have the common knowlegde, or principle, > > about floating point handling on a digital computer. > > > > X. We shouldn't detect equalty of two floating point numbers just > > using '=' for real usage. It hardly works for arbitrary > > numbers including results of any arithmetics, so it should > > incorporate any amount of tolerance that differs among > > applications. > > > > Sorry for some refrains from the past discussion, but the above X > > and 1-3 are the immovable basis in this area. (I belive it . But > > any objection are welcome) > > > > You might want to object that float has an equality operator, but > > it is because it is an atomic type, which has no specific > > purpose. On the contrary geometric types are application types, > > which is assumed specific purpose and it should behave > > reasonablly against the expected purpose. > > Are they really? We have really simple geometric types which are not PostGIS or libgeos seems to prove it. They are designed exactly for this purpose and actually used. > for any specific kind of application. We are not claiming that they > are for earth scale or any other scale. I think we should aim of > making them predictable, extensible, as general purpose as possible to > let people use them for their purpose. If settable tolerance, and preferablly, with definable method were introduced, it could be said that generic geometric types. > > That being said, if we want to, for example, use btree index for > > each coordinate of geometric types, it is doable by normalizing > > the values before applying operators, instead of applying epsilon > > arbitrarily in operators. > > > > Normalization may be done both on storing or on-the-fly just > > befor usage. It depends on whether we want to preserve precision > > of stored values. If we want, normalization should be done just > > before comparisons, or on creating indexes. > > > > I don't know much about the normalization method for floating > > point numbers but libgeos might give some hints. > > Me neither. I have been told that storing the minimum and and maximum > possible value is a good strategy for tolerance. Then all operators > can return true, false, or maybe. This would be a much bigger and > incompatible change that I don't think I can implement. > > We cannot come into an agreement. Geometric types are not well > maintained. I am trying my best to put them into a shape. Please > help me do that. Let's try to find a way to leave them better than we > found. Although some operators have apparently wrong implement and fixing it would be a progress, it is not a progress toward your objective if I don't agree to just remove the tolerance. # Sorry for the long detour.. So, the union of the two requirements seems to be having such parameter as a GUC. Geometric types don't work without any amount of tolerance for many cases, and surely some people needs the EPSILON but the amout is quite arbitrary. On the other hand, some other people want no-tolerance. This seems enough reason to make it variable. If there's no objection for the above, how about proceeding by the following steps? 1: Replace fixed EPSILON with a guc parameter 'geometric_tolerance' or such and its default value is 1E-6. This also shouldhave documentation. 2: Fix some currently broken operators to give sane results, considring geometric_tolerance. Once we get to this stage, setting geometric_tolerance to 0 would give the preferable behavior for you, and it is mere a concrete value for a specific application. No problem. After that, perhaps we may add some indexing method for geometric types not considering tolerance. Then someone may want make them support geometric_tolerance. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [HACKERS] Floating point comparison inconsistencies of thegeometric types
From
Emre Hasegeli
Date:
> Even though I'm not sure but I don't see a "natural" (or > agreeable by many poeple) ordering of geometric types in > general. Anyway it's quite application (not application program > but the relationship with the real world) specific. We can just define it for point as "ORDER BY point.x, point.y". > What we should not forget is that PostGIS does the same thing and > it is widly used (I believe..). This means it not broken at least > on a certain context. But it is a fact that it also quite > inconvenient to get performance from, say, indexes. I understand from Paul Ramsey's email [1] on this thread that PostGIS doesn't currently have a tolerance. > Yeah, the EPSILON is mere a fuzz factor so we cannot expect any > kind of stable behavior for differences under the level. But on > the analogy of comparisons of floating point numbers, I suppose > that inequality comparison could be done without the tolerance. What do you mean exactly? >> >> - Some operators are violating commutative property. >> >> >> >> For example, you cannot say "if line_a ?|| line_b then line_b ?|| line_a". >> > >> > Whether EPSILON is introduced or not, commutativity cannot be >> > assured for it from calculation error, I suppose. >> >> It can easily be assured by treating both sides of the operator the >> same. It is actually assured on my patch. > > It surely holds for certain cases. Even how many applicable cases > we guess, finally we cannot proof that it works generally. Just > three times of 1/3 rotation breakes it. It is a different story. We cannot talk about commutative property of rotation function that we currently don't have, because it would be an asymmetrical operator. The parallel operator is currently marked as commutative. The planner is free to switch the sides of the operation. Therefore this is not only a surprise, but a bug. > Hmm, I have nothing more to say if you don't agree that floating > point numbers involving any kind of arithmetic is hardly > deterministic especially not defining its usage. The floating point results are not random. There are certain guarantees. The transitive property of equality is one of them. Our aim should be making things easier for our users by providing more guarantees not breaking what is already there. > The world of the geometric types in PostgreSQL *is* built > so. There is nothing different with that Windows client can make > different results from PostgreSQL on a Linux server for a simple > floating point arithmetics, or even different binaries made by > different version of compilers on the same platoform can. Relying > on such coherency by accident is a bad policy. Yes, the results are not portable. We should only rely on the results being stable on the same build. The epsilon doesn't cure this problem. It arguably makes it worse. > PostGIS or libgeos seems to prove it. They are designed exactly > for this purpose and actually used. Yes, PostGIS is a GIS application. We are not. Geometric types name suggests to me them being useful for general purpose. > So, the union of the two requirements seems to be having such > parameter as a GUC. That sounds doable to me. We can use this opportunity to make all operators consistent. So the epsilon would apply to the ones that it current is not. We can still add btree and hash opclasses, and make them give an error when this GUC is not 0. We can even make this or another GUC apply to floats making whole system more consistent. Though, I know the community is against behaviour changing GUCs. I will not spend more time on this, before I get positive feedback from others. [1] https://www.postgresql.org/message-id/CACowWR0DBEjCfBscKKumdRLJUkObjB7D%3Diw7-0_ZwSFJM9_gpw%40mail.gmail.com
Re: [HACKERS] Floating point comparison inconsistencies of thegeometric types
From
Kyotaro HORIGUCHI
Date:
Hello, At Thu, 26 Jan 2017 11:53:28 +0100, Emre Hasegeli <emre@hasegeli.com> wrote in <CAE2gYzwYLcx-3ffToVm7JEEniYt1fU31y5BAikXzEqvCbQyTMg@mail.gmail.com> > > Even though I'm not sure but I don't see a "natural" (or > > agreeable by many poeple) ordering of geometric types in > > general. Anyway it's quite application (not application program > > but the relationship with the real world) specific. > > We can just define it for point as "ORDER BY point.x, point.y". It's nonsense. Totally for a convenient. Anyone can do so on their own application but PostgreSQL cannot have it as a platform. > > What we should not forget is that PostGIS does the same thing and > > it is widly used (I believe..). This means it not broken at least > > on a certain context. But it is a fact that it also quite > > inconvenient to get performance from, say, indexes. > > I understand from Paul Ramsey's email [1] on this thread that PostGIS > doesn't currently have a tolerance. Thank you for the pointer. (My memory is too small as 8bit CPU) Looking into the source of PostGIS 2.3.0 (Maybe the latest) Surely EPSILON is used is someplaces. FPeq and similars are also defined. liblwgeom_internal.h even defines another tolerance EPSILON_SQLMM.. Paul> The real answer for a GIS system is to have an explicit tolerance Paul> parameter for calculations like distance/touching/containment, but Paul> unfortunately we didn't do that so now we have our own Paul> compatibility/boil the ocean problem if we ever wanted/were funded to Paul> add one. This doesn't seem saying PostGIS doesn't have fixed-amount tolerance. Alhtough we don't necessarily need compatibility with PostGIS, it makes the problem rather complex since we lose an apparent start point for this. It would be good if we make geom comparators to have explicit tolerances as Paul said but maybe it is overdone. So I proposed the varialbe implicit tolerance. > > Yeah, the EPSILON is mere a fuzz factor so we cannot expect any > > kind of stable behavior for differences under the level. But on > > the analogy of comparisons of floating point numbers, I suppose > > that inequality comparison could be done without the tolerance. > > What do you mean exactly? Sorry for poor wording. I'll try in different way. It means comparison on numbers that contains certain amount of error gives unstable result for differences small enough comparing their precision. > >> >> - Some operators are violating commutative property. > >> >> > >> >> For example, you cannot say "if line_a ?|| line_b then line_b ?|| line_a". > >> > > >> > Whether EPSILON is introduced or not, commutativity cannot be > >> > assured for it from calculation error, I suppose. > >> > >> It can easily be assured by treating both sides of the operator the > >> same. It is actually assured on my patch. > > > > It surely holds for certain cases. Even how many applicable cases > > we guess, finally we cannot proof that it works generally. Just > > three times of 1/3 rotation breakes it. > > It is a different story. We cannot talk about commutative property of > rotation function that we currently don't have, because it would be an > asymmetrical operator. That's wrong. Any shpaes represented by geometric types assumed to get any geometric operatsions such as transision, rotation and others. It is fundamental for geometric types. > The parallel operator is currently marked as commutative. The planner > is free to switch the sides of the operation. Therefore this is not > only a surprise, but a bug. Strictly it is not commutative, but assuming FP error or EPSILON, commutation among them would be acceptable. Having larger tolerance, the defference from commutated expression becomes relatively small for the type's domain. As far as I know, even summation of two floating points is not guaranteed to yield strictly the same result for commutated opeation. But the difference doesn't affect for most cases and programmers make a program so that such differences don't matter in the objective. This is basically the same thing with the case of geo-types. > > Hmm, I have nothing more to say if you don't agree that floating > > point numbers involving any kind of arithmetic is hardly > > deterministic especially not defining its usage. > > The floating point results are not random. There are certain > guarantees. The transitive property of equality is one of them. Our > aim should be making things easier for our users by providing more > guarantees not breaking what is already there. Sorry for repeating the same thing but floating point numbers after getting arithmetic operations must considered that they have fuzziness or error (the same can occur for even just after assigning). They must not be handled as exact numbers. Please study about handling floating point numbers. If such strictness is required and no arithmetic involved, it is proper to store them, say, in a string form. If such a behavior is required but want to use floating points, maybe it is easier to create a extension conforming such a specification, rather than chainging the core behavior. > > The world of the geometric types in PostgreSQL *is* built > > so. There is nothing different with that Windows client can make > > different results from PostgreSQL on a Linux server for a simple > > floating point arithmetics, or even different binaries made by > > different version of compilers on the same platoform can. Relying > > on such coherency by accident is a bad policy. > > Yes, the results are not portable. We should only rely on the results > being stable on the same build. The epsilon doesn't cure this > problem. It arguably makes it worse. Yes, EPSILON doesn't improve such cross-platform consistency. It is totally a different issue. Such exact consistency should not be expected regardless of EPSILON. > > PostGIS or libgeos seems to prove it. They are designed exactly > > for this purpose and actually used. > > Yes, PostGIS is a GIS application. We are not. Geometric types name > suggests to me them being useful for general purpose. > > > So, the union of the two requirements seems to be having such > > parameter as a GUC. > > That sounds doable to me. We can use this opportunity to make all > operators consistent. So the epsilon would apply to the ones that it > current is not. We can still add btree and hash opclasses, and make > them give an error when this GUC is not 0. We can even make this or > another GUC apply to floats making whole system more consistent. Maybe. > Though, I know the community is against behaviour changing GUCs. I > will not spend more time on this, before I get positive feedback from > others. That's too bad. I'm sorry that I wans't very helpful.. > [1] https://www.postgresql.org/message-id/CACowWR0DBEjCfBscKKumdRLJUkObjB7D%3Diw7-0_ZwSFJM9_gpw%40mail.gmail.com regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [HACKERS] Floating point comparison inconsistencies of thegeometric types
From
Michael Paquier
Date:
On Thu, Jan 26, 2017 at 9:11 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >> Though, I know the community is against behaviour changing GUCs. I >> will not spend more time on this, before I get positive feedback from >> others. > > That's too bad. I'm sorry that I wasn't very helpful.. You make a constructive discussion in a way that you think makes sense by providing feedback, there's nothing bad in that IMO, quite the contrary to be honest. (I am just showing up here to tell that I have moved this patch to CF 2017-03). -- Michael
Re: [HACKERS] Floating point comparison inconsistencies of thegeometric types
From
Robert Haas
Date:
On Thu, Jan 26, 2017 at 5:53 AM, Emre Hasegeli <emre@hasegeli.com> wrote: > Though, I know the community is against behaviour changing GUCs. I > will not spend more time on this, before I get positive feedback from > others. As if on cue, let me say that a behavior-changing GUC sounds like a terrible idea to me. It's almost never good when a GUC can cause the same queries to return answers in different sessions, and even worse, it seems like the GUC might have the effect of letting us build indexes that are only valid for the value of the GUC with which they were built. Backing up a bit here, have we lost track of the problem that we're trying to solve? Tom gave his opinion here: https://www.postgresql.org/message-id/3895.1464791274@sss.pgh.pa.us But I don't see that the latest patch I can find does anything to fix that. Now maybe that's not the problem that Emre is trying to solve, but then it is not very clear to me what problem he IS trying to solve. And I think Kyotaro Horiguchi is trying to solve yet another problem which is again different. So IMHO the first task here is to agree on a clear statement of what we'd like to fix, and then, given a patch, we can judge whether it's fixed. Maybe I'm being dumb here and it's clear to you guys what the issues under discussion are. If so, apologies for that, but the thread has gotten too theoretical for me and I can't figure out what the top-level concern is any more. I believe we all agree these macros are bad, but there seems to be no agreement that I can discern on what would be better or why. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Floating point comparison inconsistencies of thegeometric types
From
Emre Hasegeli
Date:
> Backing up a bit here, have we lost track of the problem that we're > trying to solve? Tom gave his opinion here: > > https://www.postgresql.org/message-id/3895.1464791274@sss.pgh.pa.us > > But I don't see that the latest patch I can find does anything to fix > that. This is what he wrote: > As I understand it, the key problem is that tests like "is point on line" > would basically never succeed except in the most trivial cases, because of > roundoff error. That's not very nice, and it might cascade to larger > problems like object-containment tests failing unexpectedly. We would > need to go through all the geometric operations and figure out where that > kind of gotcha is significant and what we can do about it. Seems like a > fair amount of work :-(. If somebody's willing to do that kind of > investigation, then sure, but I don't think just blindly removing these > macros is going to lead to anything good. I re-implemented "is point on line" operator on my patch so that it would, at least, work on the most trivial cases. This is not very nice, but it is predictable. I have tried to prevent it cascade to larger problems like object-containment tests failing unexpectedly. I have gone through all of the geometric operations and re-implemented the ones with similar problems, too. It is no work at all compared to discussions we have to have :-). My initial idea was to keep the fuzzy behaviour for some operators like "is point on line", but the more I get into it the less value I see in doing so. The same family operators like "is point on line" is currently badly broken: > postgres=# select '(1,0)'::point ## '{1,2,0}'::line; > ?column? > ---------- > (2,2) > (1 row) This makes me wonder if anybody is ever using those operators. In the end, I don't care about those operators. They are unlikely to be supported by indexes. I can simplify my patch to leave them as they are. > Now maybe that's not the problem that Emre is trying to solve, > but then it is not very clear to me what problem he IS trying to > solve. And I think Kyotaro Horiguchi is trying to solve yet another > problem which is again different. So IMHO the first task here is to > agree on a clear statement of what we'd like to fix, and then, given a > patch, we can judge whether it's fixed. I listed the problems I am trying to solve in here: https://www.postgresql.org/message-id/CAE2gYzzNufOZqh4HO3Od8urzamNSeX-OXJxfNkRcU3ex9RD8jw%40mail.gmail.com > Maybe I'm being dumb here and it's clear to you guys what the issues > under discussion are. If so, apologies for that, but the thread has > gotten too theoretical for me and I can't figure out what the > top-level concern is any more. I believe we all agree these macros > are bad, but there seems to be no agreement that I can discern on what > would be better or why. We couldn't agree on how we should treat on floating point numbers. I think we should threat them just like the "float" datatype.
Re: [HACKERS] Floating point comparison inconsistencies of thegeometric types
From
Robert Haas
Date:
On Tue, Jan 31, 2017 at 1:06 PM, Emre Hasegeli <emre@hasegeli.com> wrote: > This is what he wrote: > >> As I understand it, the key problem is that tests like "is point on line" >> would basically never succeed except in the most trivial cases, because of >> roundoff error. That's not very nice, and it might cascade to larger >> problems like object-containment tests failing unexpectedly. We would >> need to go through all the geometric operations and figure out where that >> kind of gotcha is significant and what we can do about it. Seems like a >> fair amount of work :-(. If somebody's willing to do that kind of >> investigation, then sure, but I don't think just blindly removing these >> macros is going to lead to anything good. > > I re-implemented "is point on line" operator on my patch so that it > would, at least, work on the most trivial cases. Right. So you and he do not agree on the goal. He wants it to work in MORE than the most trivial cases, and you wrote it so that it will work in AT LEAST the most trivial cases. Those are not the same. > I listed the problems I am trying to solve in here: > > https://www.postgresql.org/message-id/CAE2gYzzNufOZqh4HO3Od8urzamNSeX-OXJxfNkRcU3ex9RD8jw%40mail.gmail.com Those problems boil down to "let's get rid of the tolerance completely", and there's no consensus on that being a good thing to do. In particular, I can't remember anybody but you being unequivocally in favor of it. > We couldn't agree on how we should treat on floating point numbers. I > think we should threat them just like the "float" datatype. Got it, but if other people don't agree then this is going nowhere. Personally, I've got my doubts about how sane it is to make anything work here with the built-in floating point types. The CPU is going to do some kind of black magic, which may differ between machines, and good luck getting consistent semantics that work everywhere. For example it seems entirely plausible (with some implementation) that you might find that (1, 2) is on the line from (0, 0) to (2, 4) but that (1, 3) is not on the line from (0, 0) to (2, 6) because the slope 1/2 can be represented exactly in base 2 and the slope 1/3 cannot. If you try to fix that kind of thing by introducing a tolerance, then you might get it wrong when the line goes from (0, 0) to (1000000000, 2000000001). Now, if we represented values using home-grown arbitrary-precision arithmetic - like numeric does - then it might be possible to get such cases right, especially if you supported exact rather than approximate representations of fractional quantities. Otherwise, I think there are inevitably going to be artifacts, and tinkering with this is just changing which artifacts we get without really solving the problem. But I am notorious for my low opinion of floating-point arithmetic; maybe somebody who understands it better or likes it more can come up with something that's more clearly an improvement. Regardless, we have to have an agreement in order to commit anything here, and I don't see one. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Floating point comparison inconsistencies of thegeometric types
From
Emre Hasegeli
Date:
> Got it, but if other people don't agree then this is going nowhere. Yes. As I wrote, I don't particularly care about functions like "is point on line". I can prepare a patch to fix as many problems as possible around those operators by preserving the current epsilon. I though we were arguing about *all* operators. Having containment and placement operators consistent with each other, is the primary thing I am trying to fix. Is removing epsilon from them is acceptable? We can also stay away from changing operators like "~=" to minimise compatibility break, if we keep the epsilon on some places. We can instead document this operator as "close enough", and introduce another symbol for really "the same" operator. That said, there are some places where it is hard to decide to apply the epsilon or not. For example, we can keep the epsilon to check for two lines being parallel, but then should we return the intersection point, or not? Those issues may become more clear when I start working on it, if preserving epsilon for those operators is the way to go forward.
[HACKERS] Re: Floating point comparison inconsistencies of the geometrictypes
From
David Steele
Date:
On 2/1/17 6:36 AM, Emre Hasegeli wrote: >> Got it, but if other people don't agree then this is going nowhere. > > Yes. As I wrote, I don't particularly care about functions like "is > point on line". I can prepare a patch to fix as many problems as > possible around those operators by preserving the current epsilon. > > I though we were arguing about *all* operators. Having containment > and placement operators consistent with each other, is the primary > thing I am trying to fix. Is removing epsilon from them is > acceptable? > > We can also stay away from changing operators like "~=" to minimise > compatibility break, if we keep the epsilon on some places. We can > instead document this operator as "close enough", and introduce > another symbol for really "the same" operator. > > That said, there are some places where it is hard to decide to apply > the epsilon or not. For example, we can keep the epsilon to check for > two lines being parallel, but then should we return the intersection > point, or not? Those issues may become more clear when I start > working on it, if preserving epsilon for those operators is the way to > go forward. The current patches do not apply cleanly at cccbdde: $ git apply ../other/0001-float-header-v03.patch error: patch failed: contrib/btree_gist/btree_ts.c:1 error: contrib/btree_gist/btree_ts.c: patch does not apply error: patch failed: contrib/postgres_fdw/postgres_fdw.c:26 error: contrib/postgres_fdw/postgres_fdw.c: patch does not apply error: patch failed: src/backend/access/gist/gistutil.c:14 error: src/backend/access/gist/gistutil.c: patch does not apply error: patch failed: src/backend/utils/adt/float.c:339 error: src/backend/utils/adt/float.c: patch does not apply error: patch failed: src/backend/utils/adt/geo_ops.c:14 error: src/backend/utils/adt/geo_ops.c: patch does not apply error: patch failed: src/backend/utils/misc/guc.c:68 error: src/backend/utils/misc/guc.c: patch does not apply error: patch failed: src/include/utils/builtins.h:334 error: src/include/utils/builtins.h: patch does not apply I don't believe this patch should be in the "Needs review" state anyway.There are clearly a number of issues that need workand agreement. Given that this thread has been idle since the beginning of February and no resolution is likely for v10, I'm marking this submission "Returned with Feedback". -- -David david@pgmasters.net