Re: [HACKERS] Floating point comparison inconsistencies of thegeometric types - Mailing list pgsql-hackers

From Kyotaro HORIGUCHI
Subject Re: [HACKERS] Floating point comparison inconsistencies of thegeometric types
Date
Msg-id 20170124.191741.131563793.horiguchi.kyotaro@lab.ntt.co.jp
Whole thread Raw
In response to Re: [HACKERS] Floating point comparison inconsistencies of thegeometric types  (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
Responses Re: [HACKERS] Floating point comparison inconsistencies of thegeometric types  (Emre Hasegeli <emre@hasegeli.com>)
List pgsql-hackers
Sorry, this is the continuation of the previous comment.

At Wed, 18 Jan 2017 17:36:12 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20170118.173612.13506164.horiguchi.kyotaro@lab.ntt.co.jp>
> Hello.
> 
> (I apologize in advance for possible inaccurate wording on maths,
> which might cause confusion..)
> 
> At Wed, 11 Jan 2017 16:37:59 +0100, Emre Hasegeli <emre@hasegeli.com> wrote in
<CAE2gYzzNufOZqh4HO3Od8urzamNSeX-OXJxfNkRcU3ex9RD8jw@mail.gmail.com>
> > > - Floating point comparisons for gemetric types
> > >
> > >   Comparison related to geometric types is performed by FPeq
> > >   macro and similars defined in geo_decls.h. This intends to give
> > >   tolerance to the comparisons.
> > >
> > >                  A
> > >   FPeq:     |<=e-|-e=>|    (<= means inclusive, e = epsilon = tolerance)
> > >   FPne:   ->|  e | e  |<-  (<- means exclusive)
> > >   FPlt:          | e  |<-
> > >   FPle:     |<=e |
> > >   FPgt:   ->|  e |
> > >   FPge:          | e=>|
> > >
> > >   These seems reasonable ignoring the tolerance amount issue.
> > 
> > I tried to explain below why I don't find this method reasonable.
> 
> Thank you very much for the explanation.
> 
> > > At least for the point type, (bitmap) index scan is consistent
> > > with sequential scan.  I remember that the issue was
> > > "inconsistency between indexed and non-indexed scans over
> > > geometric types" but they seem consistent with each other.
> > 
> > You can see on the Git history that it was a lot of trouble to make
> > the indexes consistent with the operators, and this is limiting
> > improving indexing of the geometric types.
> 
> Yeah. geo_ops.c has the following comment from its first version
> in the current repository.
> 
> | *    Relational operators for Points.
> | *        Since we do have a sense of coordinates being
> | *        "equal" to a given accuracy (point_vert, point_horiz),
> 
> So, we take it as a requirement for geometric types. All the
> difficulties come from the "nature".
> 
> > > You mentioned b-tree, which don't have predefined opclass for
> > > geometric types. So the "inconsistency" may be mentioning the
> > > difference between geometric values and combinations of plain
> > > (first-class) values. But the two are different things and
> > > apparently using different operators (~= and = for equality) so
> > > the issue is not fit for this. More specifically, "p ~= p0" and
> > > "x = x0 and y = y0" are completely different.
> > 
> > Yes, the default btree operator class doesn't have to implement
> > standard equality and basic comparison operator symbols.  We can
> > implement it with different symbols.
> 
> Perhaps I don't understand it. Many opclass are defined for
> btree. But since (ordinary) btree handles only one-dimentional,
> totally-orderable sets, geometric types even point don't fit
> it. Even if we relaxed it by removing EPSILON, multidimentional
> data still doesn't fit.
> 
> Still we can implement equality mechanism *over* multiple btree
> indexes, but it cannot be a opclass for btree.
> 
> > > Could you let me (or any other guys on this ml) have more precise
> > > description on the problem and/or what you want to do with this
> > > patch?
> > 
> > I will try to itemize the current problems with the epsilon method:
> > 
> > = Operator Inconsistencies
> > 
> > My current patches address all of them.
> 
> No. It just removes tolerans that is a "requirement" for
> coordinates of geometric types. I think we don't have enough
> reason to remove it. We could newly define geometric types with
> no-tolerance as 'exact_point" or something but it still doesn't
> fit btree.
> 
> > - Only some operators are using the epsilon.
> > 
> > I included the list of the ones which doesn't use the epsilon on
> > initial post of this thread.  This makes the operators not compatible
> > with each other.  For example, you cannot say "if box_a @> box_b and
> > box_b @> point then box_a @> box_b".
> 
> (It must be "then box_a @> point".)  Both of the operators don't
> seem to use EPSILON so the transitivity holds, but perhaps we
> should change them to more appropriate shape, that is, to use
> EPSILON. Even having the tolerance, we can define the operators
> to keep the transitivity.
> 
> > - The application of epsilon is implementation dependent.
> > 
> > Epsilon works between two numbers.  There is not always a single way
> > of implementing geometric operators.  This is surprising to the users.
> > For example, you cannot say "if box_a @> box_b then box_a <-> box_b <=
> > epsilon".
> 
> Maybe it should be like "if box_a ~= box_b && box_b @< box_a then
> ..". I'm not sure off-hand. But I don't see the relationship is
> significant.
> 
> > This is also creating a high barrier on developing those operators.
> > Fixing bugs and performance regressions are very likely to change the
> > results.
> 
> I agree to you. The effect of the EPSILON is quite arbitrary and
> difficult to see their chemistry.
> 
> > - Some operators are just wrong:
> > 
> > Line operators are not well maintained.  The following are giving
> > wrong results when neither A or B of lines are not exactly 1:
> > 
> > * line # line
> > * line <-> line
> > * point ## line
> > * point ## lseg
> 
> These are not comparison operators so EPSILON is unrelated. And I
> agree that they needs fix.
> 
> > They are broken independently from the epsilon, though it is not easy
> > to fix them without changing the results of the cases that are
> > currently working.
> 
> Although I'm not sure they are using EPSILON, I think they don't
> need it.
> 
> > - Some operators are violating commutative property.
> > 
> > For example, you cannot say "if line_a ?|| line_b then line_b ?|| line_a".
> 
> Whether EPSILON is introduced or not, commutativity cannot be
> assured for it from calculation error, I suppose.
> 
> > - Some operators are violating transitive property.
> > 
> > For example, you cannot say "if point_a ~= point_b and point_b ~=
> > point_c then point_a ~= point_c".
> 
> It holds only in ideal (or mathematical) world, or for similars
> of integer (which are strictly implemented mathematica world on
> computer).  Without the EPSILON, two points hardly match by
> floating point error, as I mentioned. I don't have an evidence
> though (sorry), mere equality among three floating point numbers
> is not assured.
> 
> 
> Sorry, I have no more time today, I'll continue tomorrow.


> - The operators with epsilon are not compatible with float operators.
> 
> This is also surprising for the users.  For example, you cannot say
> "if point_a << point_b then point_a[0] < point_b[0]".

It is because "is strictly left of" just doesn't mean their
x-coordinates are in such a relationship. The difference might be
surprising but is a specification (truely not a bug:).

> - The operators are not checking for underflow or overflow.
> - NaN values are not treated same as the float datatype.
> 
> Those are independent problems, though checking them with epsilon
> would create additional set of inconsistencies.  Epsilon creates
> especially more problems around 0.

I dont' understand this precisely but it seems to be reasonable.

> = Missing Bits on the Operator Classes
> 
> I want to address those in the future, if my patches are accepted.
> 
> - There is no GiST or SP-GiST support for cross datatype operators
> other than containment.
> 
> We can develop nice cross-datatype operator families to support more
> operators.  It would have been easier, if we could rely on some
> operators to implement others.  Geometric types serve the purpose of
> demonstrating our indexing capabilities.  We should make them good
> examples, not full of special cases with hidden bugs.
> 
> - There is no BRIN support for types other than the box.
> 
> BRIN inclusion operator class is datatype agnostic.  It uses some
> operators to implement others which doesn't work for anything more the
> box vs box operators, because of the inconsistencies.
> 
> - GROUP BY and DISTINCT are not working.
> - ORDER BY is not working.
> 
> There are regular user complaints about this on the mailing list.  We
> can easily provide default hash and btree operator classes that would
> fix the problem, if we haven't had the epsilon.

This final section seems to mention the application but sorry, it
still don't describe for me what application that this patch
aiming to enable. But I can understand that you want to handle
floating point numbers like integers. The epsilon is surely quite
annoying for the purpose.


Ok, returning to the basis of the current geometric types, it is
described as below. It is not perfect (so needs amendment) but
achived to a certain degree.

1. For geometric types, equality needs to be defined with some  tolerance. Concretely an absolute value of 1E-06, which
is quite random but would be suitable to ignore error of  longitude and latitude.
 

2. Translative comparisons should consider the tolerance to make  a result that does not contradict the equality.

3. It is not sure about other comparsons such like parallel and  perpendicular, but it is hardly usable without any
amountof  tolerance.
 


On the other hand, some people complains about the following.

A. The geometric comparisons are not in relationships that  analogously deducible (or expected) from scalar
comparisons.

B. The tolerance seems prevents us from using indexes.


As a presupposition, we have the common knowlegde, or principle,
about floating point handling on a digital computer.

X. We shouldn't detect equalty of two floating point numbers just  using '=' for real usage. It hardly works for
arbitrary numbers including results of any arithmetics, so it should  incorporate any amount of tolerance that differs
among applications.
 

Sorry for some refrains from the past discussion, but the above X
and 1-3 are the immovable basis in this area. (I belive it . But
any objection are welcome)

You might want to object that float has an equality operator, but
it is because it is an atomic type, which has no specific
purpose. On the contrary geometric types are application types,
which is assumed specific purpose and it should behave
reasonablly against the expected purpose.


That being said, if we want to, for example, use btree index for
each coordinate of geometric types, it is doable by normalizing
the values before applying operators, instead of applying epsilon
arbitrarily in operators.

Normalization may be done both on storing or on-the-fly just
befor usage. It depends on whether we want to preserve precision
of stored values. If we want, normalization should be done just
before comparisons, or on creating indexes.

I don't know much about the normalization method for floating
point numbers but libgeos might give some hints.

And still I don't have an idea how to detect parallelity and
perpendicularity based on the normalization.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: [HACKERS] Failure in commit_ts tap tests
Next
From: Vladimir Rusinov
Date:
Subject: Re: [HACKERS] [PATCH] Rename pg_switch_xlog to pg_switch_wal