Re: Floating point comparison inconsistencies of the geometric types - Mailing list pgsql-hackers
From | Emre Hasegeli |
---|---|
Subject | Re: Floating point comparison inconsistencies of the geometric types |
Date | |
Msg-id | CAE2gYzyjvRbj4bQZRp-yBo73-77-LDH0WnwPtAtZRFKPxYQRgw@mail.gmail.com Whole thread Raw |
In response to | Re: Floating point comparison inconsistencies of the geometric types (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>) |
Responses |
Re: Floating point comparison inconsistencies of the
geometric types
|
List | pgsql-hackers |
> What way to deal with it is in your mind? The problem hides > behind operators. To fix it a user should rewrite a expression > using more primitive operators. For example, (line_a # point_a) > should be rewritten as ((point_a <-> lineseg_a) < EPSILON), or in > more primitive way. I regared this that the operator # just > become useless. Simple equations like this works well with the current algorithm: > select '(0.1,0.1)'::point @ '(0,0),(1,1)'::line; The operator does what you expect from it. Users can use something like you described to get fuzzy behaviour with an epsilon they choose. > Regarding optimization, at least gcc generates seemingly not so > different code for the two. The both generally generates extended > code directly calling isnan() and so. Have you measured the > performance of the two implement (with -O2, without > --enable-cassert)? This kind of hand-optimization gets > legitimacy when we see a siginificant difference, according to > the convention here.. I suppose. I tested it with this program: > int main() > { > double i, > j; > int result = 0; > > for (i = 0.1; i < 10000.0; i += 1.0) > for (j = 0.1; j < 10000.0; j += 1.0) > if (float8_lt(i, j)) > result = (result + 1) % 10; > > return result; > } The one calling cmp() was noticeable slower. ./test1 0.74s user 0.00s system 99% cpu 0.748 total ./test2 0.89s user 0.00s system 99% cpu 0.897 total This would probably be not much noticeable by calling SQL functions which are doing a few comparisons only, but it may be necessary to do many more comparisons on some places. I don't find the optimised versions less clear than calling the cmp(). I can change it the other way, if you find it more clear. > At least the comment you dropped by the patch, > > int > float4_cmp_internal(float4 a, float4 b) > { > - /* > - * We consider all NANs to be equal and larger than any non-NAN. This is > - * somewhat arbitrary; the important thing is to have a consistent sort > - * order. > - */ > > seems very significant and should be kept anywhere relevant. I will add it back on the next version. > I seached pgsql-general ML but counldn't find a complaint about > the current behavior. Even though I'm not familar with PostGIS, I > found that it uses exactly the same EPSILON method with > PostgreSQL. Is it? I understood from Paul Ramsey's comment on this thread [1] that they don't. > If we had an apparent plan to use them for other than > earth-scale(?) geometric usage, we could design what they should > be. But without such a plan it is just a breakage of the current > usage. We give no promises about the geometric types being useful in earth scale. > About What kind of (needless) complication you are saying? The > fuzziness seems to me essential for geometric comparisons to work > practically. Addition to that, I don't think that we're not > allowed to change the behavior in such area of released versions > the time after time. Even when it is a total mess? > I don't think index scan and tolerant comparison are not > contradicting. Could you let me have an description about the > indexing capabilities and the inconsistencies? The first problem is that some operators are not using the epsilon. This requires special treatment while developing index support for operators. I have tried to support point for BRIN using the box operators, and failed because of that. Comparing differences with epsilon is not applicable the same way to every operator. Even with simple operators like "point in box" it covers different distances outside the box depending on where the point is. For example, "point <-> box < EPSILON" wouldn't be equivalent with "point <@ box", when the point is outside corner of the box. Things get more complicated with lines. Because of these, we are easily violating basic expectations of the operators: > regression=# select '{1000,0.000001,0}'::line ?|| '{90000,0.00009,0}'::line; > > ?column? > ---------- > f > (1 row) > > regression=# select '{90000,0.00009,0}'::line ?|| '{1000,0.000001,0}'::line; > ?column? > ---------- > t > (1 row) Another problem is lack of hash and btree operator classes. In my experience, the point datatype is by far most used one. People often try to use it on DISTINCT, GROUP BY, or ORDER BY clauses and complain when it doesn't work. There are many complaints like this on the archives. If we get rid of the epsilon, we can easily add those operator classes. [1] https://www.postgresql.org/message-id/CACowWR0DBEjCfBscKKumdRLJUkObjB7D%3Diw7-0_ZwSFJM9_gpw%40mail.gmail.com
pgsql-hackers by date: