Re: Floating point comparison inconsistencies of the geometric types - Mailing list pgsql-hackers

From Emre Hasegeli
Subject Re: Floating point comparison inconsistencies of the geometric types
Date
Msg-id CAE2gYzyjvRbj4bQZRp-yBo73-77-LDH0WnwPtAtZRFKPxYQRgw@mail.gmail.com
Whole thread Raw
In response to Re: Floating point comparison inconsistencies of the geometric types  (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
Responses Re: Floating point comparison inconsistencies of the geometric types
List pgsql-hackers
> What way to deal with it is in your mind? The problem hides
> behind operators. To fix it a user should rewrite a expression
> using more primitive operators. For example, (line_a # point_a)
> should be rewritten as ((point_a <-> lineseg_a) < EPSILON), or in
> more primitive way. I regared this that the operator # just
> become useless.

Simple equations like this works well with the current algorithm:

> select '(0.1,0.1)'::point @ '(0,0),(1,1)'::line;

The operator does what you expect from it.  Users can use something
like you described to get fuzzy behaviour with an epsilon they choose.

> Regarding optimization, at least gcc generates seemingly not so
> different code for the two. The both generally generates extended
> code directly calling isnan() and so. Have you measured the
> performance of the two implement (with -O2, without
> --enable-cassert)?  This kind of hand-optimization gets
> legitimacy when we see a siginificant difference, according to
> the convention here.. I suppose.

I tested it with this program:

> int main()
> {
>    double  i,
>            j;
>    int     result = 0;
>
>    for (i = 0.1; i < 10000.0; i += 1.0)
>        for (j = 0.1; j < 10000.0; j += 1.0)
>            if (float8_lt(i, j))
>                result = (result + 1) % 10;
>
>    return result;
> }

The one calling cmp() was noticeable slower.

./test1  0.74s user 0.00s system 99% cpu 0.748 total
./test2  0.89s user 0.00s system 99% cpu 0.897 total

This would probably be not much noticeable by calling SQL functions
which are doing a few comparisons only, but it may be necessary to do
many more comparisons on some places.  I don't find the optimised
versions less clear than calling the cmp().  I can change it the other
way, if you find it more clear.

> At least the comment you dropped by the patch,
>
>  int
>  float4_cmp_internal(float4 a, float4 b)
>  {
> -       /*
> -        * We consider all NANs to be equal and larger than any non-NAN. This is
> -        * somewhat arbitrary; the important thing is to have a consistent sort
> -        * order.
> -        */
>
> seems very significant and should be kept anywhere relevant.

I will add it back on the next version.

> I seached pgsql-general ML but counldn't find a complaint about
> the current behavior. Even though I'm not familar with PostGIS, I
> found that it uses exactly the same EPSILON method with
> PostgreSQL.

Is it?  I understood from Paul Ramsey's comment on this thread [1]
that they don't.

> If we had an apparent plan to use them for other than
> earth-scale(?)  geometric usage, we could design what they should
> be. But without such a plan it is just a breakage of the current
> usage.

We give no promises about the geometric types being useful in earth scale.

> About What kind of (needless) complication you are saying? The
> fuzziness seems to me essential for geometric comparisons to work
> practically. Addition to that, I don't think that we're not
> allowed to change the behavior in such area of released versions
> the time after time.

Even when it is a total mess?

> I don't think index scan and tolerant comparison are not
> contradicting. Could you let me have an description about the
> indexing capabilities and the inconsistencies?

The first problem is that some operators are not using the epsilon.
This requires special treatment while developing index support for
operators.  I have tried to support point for BRIN using the box
operators, and failed because of that.

Comparing differences with epsilon is not applicable the same way to
every operator.  Even with simple operators like "point in box" it
covers different distances outside the box depending on where the
point is.  For example, "point <-> box < EPSILON" wouldn't be
equivalent with "point <@ box", when the point is outside corner of
the box.  Things get more complicated with lines.  Because of these,
we are easily violating basic expectations of the operators:

> regression=# select '{1000,0.000001,0}'::line ?|| '{90000,0.00009,0}'::line;
>
> ?column?
> ----------
> f
> (1 row)
>
> regression=# select '{90000,0.00009,0}'::line ?|| '{1000,0.000001,0}'::line;
> ?column?
> ----------
> t
> (1 row)

Another problem is lack of hash and btree operator classes.  In my
experience, the point datatype is by far most used one.  People often
try to use it on DISTINCT, GROUP BY, or ORDER BY clauses and complain
when it doesn't work.  There are many complaints like this on the
archives.  If we get rid of the epsilon, we can easily add those
operator classes.

[1] https://www.postgresql.org/message-id/CACowWR0DBEjCfBscKKumdRLJUkObjB7D%3Diw7-0_ZwSFJM9_gpw%40mail.gmail.com



pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Remove the comment on the countereffectiveness of large shared_buffers on Windows
Next
From: Mithun Cy
Date:
Subject: Re: Patch: Implement failover on libpq connect level.