Re: [HACKERS] Floating point comparison inconsistencies of thegeometric types - Mailing list pgsql-hackers

From Emre Hasegeli
Subject Re: [HACKERS] Floating point comparison inconsistencies of thegeometric types
Date
Msg-id CAE2gYzzLWCdYDPLi_UQXLSJ+pqc7g4WGNHsb8tHkmosKK+rO7g@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Floating point comparison inconsistencies of thegeometric types  (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
Responses Re: [HACKERS] Floating point comparison inconsistencies of thegeometric types
List pgsql-hackers
I am responding both of your emails together.

> Perhaps I don't understand it. Many opclass are defined for
> btree. But since (ordinary) btree handles only one-dimentional,
> totally-orderable sets, geometric types even point don't fit
> it. Even if we relaxed it by removing EPSILON, multidimentional
> data still doesn't fit.

Yes, multidimentional data doesn't fit into btree.  Let's say we would
have to introduce operators called *<, *<=, *>, *>= to support btree
opclass for point.  I agree those operators would be useless, but
btree opclass is used for other purposes too.  "ORDER BY point",
merge-joins, btree index support for ~= would be useful.

Lack of ORDER BY, GROUP BY, and DISTINCT support is the major
inconvenience about the geometric types.  There are many complaints
about this on the mailing list.  Many frameworks and ORMs are not
prepared to deal with getting an error when they use certain types on
those clauses.

>> - Only some operators are using the epsilon.
>>
>> I included the list of the ones which doesn't use the epsilon on
>> initial post of this thread.  This makes the operators not compatible
>> with each other.  For example, you cannot say "if box_a @> box_b and
>> box_b @> point then box_a @> box_b".
>
> (It must be "then box_a @> point".)

Yes, I am sorry.

>  Both of the operators don't
> seem to use EPSILON so the transitivity holds, but perhaps we
> should change them to more appropriate shape, that is, to use
> EPSILON. Even having the tolerance, we can define the operators
> to keep the transitivity.

Yes, that would be another way.  In my view, this would make things
worse, because I believe epsilon approach is completely broken.  The
operators which are not using the epsilon are the only ones people can
sanely use to develop applications.

>> > - The application of epsilon is implementation dependent.
>> >
>> > Epsilon works between two numbers.  There is not always a single way
>> > of implementing geometric operators.  This is surprising to the users.
>> > For example, you cannot say "if box_a @> box_b then box_a <-> box_b <=
>> > epsilon".
>>
>> Maybe it should be like "if box_a ~= box_b && box_b @< box_a then
>> ..". I'm not sure off-hand. But I don't see the relationship is
>> significant.

What I meant was the behaviour being unclear to even people who knows
about the epsilon.  If two boxes are overlapping with each other, one
who knows about EPSILON can expect the distance between them to be
less than EPSILON, but this wouldn't be true.

>> - Some operators are violating commutative property.
>>
>> For example, you cannot say "if line_a ?|| line_b then line_b ?|| line_a".
>
> Whether EPSILON is introduced or not, commutativity cannot be
> assured for it from calculation error, I suppose.

It can easily be assured by treating both sides of the operator the
same.  It is actually assured on my patch.

>> - Some operators are violating transitive property.
>>
>> For example, you cannot say "if point_a ~= point_b and point_b ~=
>> point_c then point_a ~= point_c".
>
> It holds only in ideal (or mathematical) world, or for similars
> of integer (which are strictly implemented mathematica world on
> computer).  Without the EPSILON, two points hardly match by
> floating point error, as I mentioned. I don't have an evidence
> though (sorry), mere equality among three floating point numbers
> is not assured.

Of course, it is assured.  Floating point numbers are deterministic.

>> - The operators with epsilon are not compatible with float operators.
>>
>> This is also surprising for the users.  For example, you cannot say
>> "if point_a << point_b then point_a[0] < point_b[0]".
>
> It is because "is strictly left of" just doesn't mean their
> x-coordinates are in such a relationship. The difference might be
> surprising but is a specification (truely not a bug:).

Then what does that mean?  Every operator with EPSILON is currently
surprising in different way.  People use databases to build
applications.  They often need to implement same operations on the
application side.  It is very hard to be bug-compatible with our
geometric types.

>> = Missing Bits on the Operator Classes

> This final section seems to mention the application but sorry, it
> still don't describe for me what application that this patch
> aiming to enable. But I can understand that you want to handle
> floating point numbers like integers. The epsilon is surely quite
> annoying for the purpose.

I will try to itemise what applications I am aiming to enable:

- Applications with very little GIS requirement

PostGIS can be too much work for an application in s different
business domain but has a little requirement to GIS.  The built-in
geometric types are incomplete to support real world geography.
Getting rid of epsilon would make this worse. In contrary, it would
allow people to deal with the difficulties on their own.

- Applications with geometric objects

I believe people could make use the geometric types in many different
business areas, if they would be in better shape.  I am working for a
gaming company.  There is great overlap between the logic of many
games and the geometric types.  I could support using them more, if
they wouldn't be so buggy and impractical.

- Demonstrating indexes

We require indexing features to be committed with an example.
Geometric types have served this purpose many times.  Many operators
are actually added to demonstrate indexes.  These examples are also
used by other projects like PostGIS.  They can re-use part of the
implementations, or the implementation as a starting point.  I believe
we should make those examples as clean as possible.

> Ok, returning to the basis of the current geometric types, it is
> described as below. It is not perfect (so needs amendment) but
> achived to a certain degree.
>
> 1. For geometric types, equality needs to be defined with some
>    tolerance. Concretely an absolute value of 1E-06, which is
>    quite random but would be suitable to ignore error of
>    longitude and latitude.

I don't agree this requirement as it is put.  Returning true to
something like "point(0.000001, 0.000001) ~= point(0.000002,
0.000002)" is obviously wrong.

> 2. Translative comparisons should consider the tolerance to make
>    a result that does not contradict the equality.

The current implementation fails to provide it:

> hasegeli=# select point(0.000001, 0.000001) ~= point(0.000002, 0.000002);
> ?column?
> ----------
> t
> (1 row)
>
> hasegeli=# select point(0.000001, 0.000001) << point(0.0000021, 0.0000021);
> ?column?
> ----------
> t
> (1 row)
>
> hasegeli=# select point(0.000002, 0.000002) << point(0.0000021, 0.0000021);
> ?column?
> ----------
> f
> (1 row)

> 3. It is not sure about other comparsons such like parallel and
>    perpendicular, but it is hardly usable without any amount of
>    tolerance.

I agree.  I don't think they are useable with our tolerance either.

> On the other hand, some people complains about the following.

I believe have listed enough complaints on my previous post.

> As a presupposition, we have the common knowlegde, or principle,
> about floating point handling on a digital computer.
>
> X. We shouldn't detect equalty of two floating point numbers just
>    using '=' for real usage. It hardly works for arbitrary
>    numbers including results of any arithmetics, so it should
>    incorporate any amount of tolerance that differs among
>    applications.
>
> Sorry for some refrains from the past discussion, but the above X
> and 1-3 are the immovable basis in this area. (I belive it . But
> any objection are welcome)
>
> You might want to object that float has an equality operator, but
> it is because it is an atomic type, which has no specific
> purpose. On the contrary geometric types are application types,
> which is assumed specific purpose and it should behave
> reasonablly against the expected purpose.

Are they really?  We have really simple geometric types which are not
for any specific kind of application.  We are not claiming that they
are for earth scale or any other scale.  I think we should aim of
making them predictable, extensible, as general purpose as possible to
let people use them for their purpose.

> That being said, if we want to, for example, use btree index for
> each coordinate of geometric types, it is doable by normalizing
> the values before applying operators, instead of applying epsilon
> arbitrarily in operators.
>
> Normalization may be done both on storing or on-the-fly just
> befor usage. It depends on whether we want to preserve precision
> of stored values. If we want, normalization should be done just
> before comparisons, or on creating indexes.
>
> I don't know much about the normalization method for floating
> point numbers but libgeos might give some hints.

Me neither.  I have been told that storing the minimum and and maximum
possible value is a good strategy for tolerance.  Then all operators
can return true, false, or maybe.  This would be a much bigger and
incompatible change that I don't think I can implement.

We cannot come into an agreement.  Geometric types are not well
maintained.  I am trying my best to put them into a shape.  Please
help me do that.  Let's try to find a way to leave them better than we
found.



pgsql-hackers by date:

Previous
From: Ashutosh Bapat
Date:
Subject: Re: [HACKERS] Assignment of valid collation for SET operations onqueries with UNKNOWN types.
Next
From: Amit Kapila
Date:
Subject: Re: [HACKERS] Proposal : For Auto-Prewarm.