Thread: find close (duplicate) points + create index

find close (duplicate) points + create index

From

Elinor Medezinski

Date:

09 March 2004, 20:05:54

Hello,

I have a table with one column --
"CREATE TABLE pointtable (point POINT)".
I'm trying to find duplicate entries, where two entries are considered
duplicates if  they're within a radius of 1, meaning something like ~ "select
point from pointtable where distance between points <=1".
Obviously this is not SQL syntax. Is there a way to do this - search within a
column itself?

Also, I also tried to build an index on that column, but there's no operator
class for type point. How can I do that?

Thanks,
Elinor

Re: find close (duplicate) points + create index

From

Bruno Wolff III

Date:

10 March 2004, 00:13:59

On Thu, Mar 04, 2004 at 12:42:23 +0200,
  Elinor Medezinski <elinor@bellatrix.tau.ac.il> wrote:
> Hello,
>
> I have a table with one column --
> "CREATE TABLE pointtable (point POINT)".
> I'm trying to find duplicate entries, where two entries are considered
> duplicates if  they're within a radius of 1, meaning something like ~ "select
> point from pointtable where distance between points <=1".
> Obviously this is not SQL syntax. Is there a way to do this - search within a
> column itself?

Yes, you can join a table to itself and return records matching some
critera. This isn't going to be fast. I didn't see a distance between
two points function, but it is probably there somewhere. If not you
can check if one point is contained in a circle of radius 1 centered
at the other point. This is something that could use an index, though
it probably wouldn't help when checking the whole table. It probably
would speed up checking a single point for conflicts.

Re: find close (duplicate) points + create index

From

Tom Lane

Date:

10 March 2004, 01:28:47

Elinor Medezinski <elinor@bellatrix.tau.ac.il> writes:
> I'm trying to find duplicate entries, where two entries are considered
> duplicates if  they're within a radius of 1, meaning something like
> "select point from pointtable where distance between points <=1".
> Obviously this is not SQL syntax.

Well, it is if you do a self-join:

    select * from pointtable a, pointtable b
    where distance(a.point, b.point) <= 1;

Postgres spells the "distance" operator as "<->", so this becomes

    select * from pointtable a, pointtable b
    where (a.point <-> b.point) <= 1;

Making it fast is a more difficult problem :-( ... if you write the
above query as-is then the system will sit there and compare each row of
pointtable to each other row, looking for pairs of rows that match the
where-clause.  Okay if you just have some thousands of rows, but on a
big table this will take longer than you want to wait.

> Also, I also tried to build an index on that column, but there's no operator
> class for type point. How can I do that?

A btree index on a point column would be quite useless, since btree
understands only a one-dimensional continuum with less-than, equal,
greater-than relationships.  But I think you might be able to do
something with an rtree index.  I'd look at making an rtree index on
the unit box around each point, and then using an "overlaps" test as
an indexable coarse filter before the exact distance check.

            regards, tom lane

Re: find close (duplicate) points + create index

From

Date:

10 March 2004, 08:27:50

On Wed, 10 Mar 2004, Tom Lane wrote:
> Elinor Medezinski <elinor@bellatrix.tau.ac.il> writes:

> > I'm trying to find duplicate entries, where two entries are considered
> > duplicates if  they're within a radius of 1, meaning something like
> > "select point from pointtable where distance between points <=1".
> > Obviously this is not SQL syntax.
>
> Well, it is if you do a self-join:
>
>     select * from pointtable a, pointtable b
>     where distance(a.point, b.point) <= 1;
>
> Postgres spells the "distance" operator as "<->", so this becomes
>
>     select * from pointtable a, pointtable b
>     where (a.point <-> b.point) <= 1;
>
> Making it fast is a more difficult problem :-( ... if you write the
> above query as-is then the system will sit there and compare each row of
> pointtable to each other row, looking for pairs of rows that match the
> where-clause.  Okay if you just have some thousands of rows, but on a
> big table this will take longer than you want to wait.

I'm guessing distance is defined to be:
 sqrt( (x1-x0)^2 + (y1-y0)^2 )
or
 sqrt( (x1-x0)^2 + (y1-y0)^2 + (z1-z0)^2 )

You obviously need to find the difference in the x,y (and possibly
z if used) coordinates, as is multiplying that difference by
itself and summing them.  However, you don't need to take the
square root, which is often a computationally expensive thing.
Two points which are close in terms of distance, are also close in
terms of distance squared.

If you are only interested in points that are within a distance of
1 (or a distance squared of 1*1), then you can stop calculating
the distance when any of the differences is bigger than 1.  Since
you don't care, other than knowing it is too big, just stop the
calculation and return some number bigger than 1.  That should
help in speeding things up a bit.  How you do this in PostgreSQL,
I can't help you.  :-)

> > Also, I also tried to build an index on that column, but
> > there's no operator class for type point. How can I do that?
>
> A btree index on a point column would be quite useless, since btree
> understands only a one-dimensional continuum with less-than, equal,
> greater-than relationships.  But I think you might be able to do
> something with an rtree index.  I'd look at making an rtree index on
> the unit box around each point, and then using an "overlaps" test as
> an indexable coarse filter before the exact distance check.

You probably want to look a little at how GIS (Geographical
Information Systems) works.  There are lots of different
techniques they use to search and partition things.  Quadtree
and oct-tree come to mind.

Gord

Re: find close (duplicate) points + create index

From

Tom Lane

Date:

10 March 2004, 10:59:52

Elinor Medezinski <elinor@bellatrix.tau.ac.il> writes:
> And then I found out that in postgres the only operator classes defined for
> rtree indexes are: bigbox_ops, box_ops and poly_ops. Neither of which works
> with points, only with type box and polygon. Therefore I also have to create
> an operator class.

No you don't.  What you want is a functional index built on a box or polygon
surrounding the point.  For instance, given

regression=# create table p1 (point_a point);
CREATE TABLE
regression=# create index p1i on p1 using rtree (box(point_a, point_a));
CREATE INDEX

you could do searches for points enclosed in a specific box like this:

regression=# explain select * from p1 where box(point_a, point_a) && '(0,1),(0,1)'::box;
                           QUERY PLAN
----------------------------------------------------------------
 Index Scan using p1i on p1  (cost=0.00..17.07 rows=5 width=16)
   Index Cond: (box(point_a, point_a) && '(0,1),(0,1)'::box)
(2 rows)

since box-overlap (&&) is one of the rtree-indexable operators.

The most useful way to solve your original problem seems to be

regression=# explain select * from p1 a, p1 b where
regression-# box(a.point_a, a.point_a) && box(circle(b.point_a,sqrt(2)))
regression-# and (a.point_a <-> b.point_a) <= 1;
                                                     QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.01..17220.00 rows=1667 width=32)
   Join Filter: (("inner".point_a <-> "outer".point_a) <= 1::double precision)
   ->  Seq Scan on p1 b  (cost=0.00..20.00 rows=1000 width=16)
   ->  Index Scan using p1i on p1 a  (cost=0.01..17.07 rows=5 width=16)
         Index Cond: (box(a.point_a, a.point_a) && box(circle("outer".point_a, 1.4142135623731::double precision)))
(5 rows)

The indexable condition finds "a" rows that are within the bounding box
of a circle surrounding the "b" row, and then we only need to apply the
exact distance check to those rows.

(If you're wondering about the sqrt(2), there's an oddity in the
built-in circle-to-box function: it divides the circle radius
by sqrt(2).  I think this is a bug and will propose changing it
for 7.5.)

            regards, tom lane

I am trying to send a message on the list for the last three hours

From

"Costin Manda"

Date:

10 March 2004, 11:15:30

What's going on? the same message I've sent 3 times! not once did it arrive

plperl doesn't work

From

"Costin Manda"

Date:

10 March 2004, 11:26:51

I am trying to install postgreSQL with perl support on a Slackware Linux
9.1. I have compiled perl with shared libraries, the rest of the settings
default. Then I configured postgres with --with-perl and it worked fine.
  But when I am executing 'make' I get an error: plperl no such file or
directory. I have tried everything. Please help!

Re: find close (duplicate) points + create index

From

Elinor Medezinski

Date:

13 March 2004, 15:49:01

You suggested:

>     select * from pointtable a, pointtable b
>     where (a.point <-> b.point) <= 1;

Thanks, Thats what I'll do.

> Making it fast is a more difficult problem :-( ... if you write the
> above query as-is then the system will sit there and compare each row of
> pointtable to each other row, looking for pairs of rows that match the
> where-clause.  Okay if you just have some thousands of rows, but on a
> big table this will take longer than you want to wait.

This query will only work on a few thousand lines, but I will have other
queries on type point that will require comparing tables with millions of
rows. Therefore I must learn how to build indexes on points.

> A btree index on a point column would be quite useless, since btree
> understands only a one-dimensional continuum with less-than, equal,
> greater-than relationships.  But I think you might be able to do
> something with an rtree index.

That much I know. I didn't find how I can use rtree to build an index on
points, seeing how when I tried the following:
"create INDEX Phot_point_a ON Phot USING RTREE (point_a);"
I got this error:
"ERROR:  data type point has no default operator class for access method
"rtree"
HINT:  You must specify an operator class for the index or define a default
operator class for the data type."

And then I found out that in postgres the only operator classes defined for
rtree indexes are: bigbox_ops, box_ops and poly_ops. Neither of which works
with points, only with type box and polygon. Therefore I also have to create
an operator class. I didn't understand how to do that. Do you know how?


> I'd look at making an rtree index on
> the unit box around each point, and then using an "overlaps" test as
> an indexable coarse filter before the exact distance check.

I didn't understand this either.

Thanks,
Elinor

Re: find close (duplicate) points + create index

From

Bruno Wolff III

Date:

13 March 2004, 20:10:59

On Wed, Mar 10, 2004 at 11:22:47 +0200,
  Elinor Medezinski <elinor@bellatrix.tau.ac.il> wrote:
>
> That much I know. I didn't find how I can use rtree to build an index on
> points, seeing how when I tried the following:
> "create INDEX Phot_point_a ON Phot USING RTREE (point_a);"
> I got this error:
> "ERROR:  data type point has no default operator class for access method
> "rtree"
> HINT:  You must specify an operator class for the index or define a default
> operator class for the data type."
>
> And then I found out that in postgres the only operator classes defined for
> rtree indexes are: bigbox_ops, box_ops and poly_ops. Neither of which works
> with points, only with type box and polygon. Therefore I also have to create
> an operator class. I didn't understand how to do that. Do you know how?

rtree indexes on points doesn't make sense since containment is the same
as equals. You want to use boxes. A point is a box with the same point
for both of the defining corners. When you do searches you use a box
that describes where you are looking and look for boxes (including points)
that are located in the search box.

You can also use the cube type and gist indexes to do the same thing.

Re: find close (duplicate) points + create index

From

Date:

14 March 2004, 13:27:49

On Wed, 10 Mar 2004, Elinor Medezinski wrote:

> > I'd look at making an rtree index on
> > the unit box around each point, and then using an "overlaps" test as
> > an indexable coarse filter before the exact distance check.
>
> I didn't understand this either.

Lets say a 2D point comes from GPS, and has coordinates (from some
particular zone) of (483121.534, 5124378.745).  The units are
meters, and so these coordinates are both specified to the nearest
millimeter (which is too many significant figures for a GPS
measurement without special equipment).

We can construct a small box around this point, by chopping off
the fractional meters, and either adding  or subtracting 1.  We
will also assume that any boxes we make, have sides parallel to
our axis.  So, the Upper Left point is (483120,5124379) and the
Lower Right point is (483122,5124377).

This isn't a unit box, it's a 4 unit box.  But, it requires less
care in constructing it (in my mind) than making the unit box.
But you can work with unit boxes too.  To check that are box
(point) of interest overlaps another box, you check to see that
the minimum X of this box is greater than the minimum X of the
other box and less than the maximum X of the other box.  Likewise
for the Y coordinates.  You are considering small boxes around
points, and all the boxes are squares.  The more general problem,
allows for N-sided polygons, where it is possible for the bounding
boxes to overlap, but for the one box to not be inside the other
box.

If the bounding boxes don't overlap, there is no sense calculating
a distance (in some cases) as the points/boxes are too far apart.

Gord

Re: find close (duplicate) points + create index

From

Elinor

Date:

16 February 2005, 18:57:52

Hi Tom,
Following your advice quite a long time ago, I built an index on my very large table called object:

myDB=# select count(*) from object;
count
---------
2797036
(1 row)
Time: 8354.25 ms

using RTREE INDEX on a column of type point:

myDB=# create index object_point_a on object using rtree (box(point_a,point_a));
CREATE INDEX
Time: 220546.54 ms

Then I tried the query you suggested to find points closer to each other than 1.5:
SELECT * FROM object a,object b
WHERE box(a.point_a,a.point_a)&&box(circle(b.point_a,sqrt(2)))
AND (a.point_a<->b.point_a)<1;

After many many minutes, it started giving these lines:
server sent data ("D" message) without prior row description ("T" message)
server sent data ("D" message) without prior row description ("T" message)
server sent data ("D" message) without prior row description ("T" message)
.....

Eventually I lost patience and killed it. I am pretty sure I tried it once, but perhaps the table wasn't as big.
Any suggestions?

Thanks,
Elinor

Elinor Medezinski <elinor ( at ) bellatrix ( dot ) tau ( dot ) ac ( dot ) il> writes: > And then I found out that in postgres the only operator classes defined for > rtree indexes are: bigbox_ops, box_ops and poly_ops. Neither of which works > with points, only with type box and polygon. Therefore I also have to create > an operator class. No you don't. What you want is a functional index built on a box or polygon surrounding the point. For instance, given regression=# create table p1 (point_a point); CREATE TABLE regression=# create index p1i on p1 using rtree (box(point_a, point_a)); CREATE INDEX you could do searches for points enclosed in a specific box like this: regression=# explain select * from p1 where box(point_a, point_a) && '(0,1),(0,1)'::box; QUERY PLAN ----------------------------------------------------------------Index Scan using p1i on p1 (cost=0.00..17.07 rows=5 width=16) Index Cond: (box(point_a, point_a) && '(0,1),(0,1)'::box) (2 rows) since box-overlap (&&) is one of the rtree-indexable operators. The most useful way to solve your original problem seems to be regression=# explain select * from p1 a, p1 b where regression-# box(a.point_a, a.point_a) && box(circle(b.point_a,sqrt(2))) regression-# and (a.point_a <-> b.point_a) <= 1; QUERY PLAN --------------------------------------------------------------------------------------------------------------------Nested Loop (cost=0.01..17220.00 rows=1667 width=32) Join Filter: (("inner".point_a <-> "outer".point_a) <= 1::double precision) -> Seq Scan on p1 b (cost=0.00..20.00 rows=1000 width=16) -> Index Scan using p1i on p1 a (cost=0.01..17.07 rows=5 width=16) Index Cond: (box(a.point_a, a.point_a) && box(circle("outer".point_a, 1.4142135623731::double precision))) (5 rows) The indexable condition finds "a" rows that are within the bounding box of a circle surrounding the "b" row, and then we only need to apply the exact distance check to those rows. (If you're wondering about the sqrt(2), there's an oddity in the built-in circle-to-box function: it divides the circle radius by sqrt(2). I think this is a bug and will propose changing it for 7.5.) regards, tom lane

On Wed, 2004-03-10 at 07:27, Tom Lane wrote:

Elinor Medezinski <elinor@bellatrix.tau.ac.il> writes:
> I'm trying to find duplicate entries, where two entries are considered 
> duplicates if  they're within a radius of 1, meaning something like
> "select point from pointtable where distance between points <=1".
> Obviously this is not SQL syntax.

Well, it is if you do a self-join:
select * from pointtable a, pointtable bwhere distance(a.point, b.point) <= 1;

Postgres spells the "distance" operator as "<->", so this becomes
select * from pointtable a, pointtable bwhere (a.point <-> b.point) <= 1;

Making it fast is a more difficult problem :-( ... if you write the
above query as-is then the system will sit there and compare each row of
pointtable to each other row, looking for pairs of rows that match the
where-clause.  Okay if you just have some thousands of rows, but on a
big table this will take longer than you want to wait.

> Also, I also tried to build an index on that column, but there's no operator 
> class for type point. How can I do that?

A btree index on a point column would be quite useless, since btree
understands only a one-dimensional continuum with less-than, equal,
greater-than relationships.  But I think you might be able to do
something with an rtree index.  I'd look at making an rtree index on
the unit box around each point, and then using an "overlaps" test as
an indexable coarse filter before the exact distance check.
		regards, tom lane+++++++++++++++++++++++++++++++++++++++++++This Mail Was Scanned By Mail-seCure Systemat the Tel-Aviv University CC.

Re: find close (duplicate) points + create index

From

Tom Lane

Date:

16 February 2005, 20:22:28

Elinor <elinor@wise1.tau.ac.il> writes:
>         After many many minutes, it started giving these lines:
>         server sent data ("D" message) without prior row description
>         ("T" message)

Up till fairly recently (8.0 or maybe 7.4), libpq would do that if it
ran out of memory to hold the query result.  I think you miswrote the
query and it's returning a huge number of rows.  If you actually want
to fetch a huge number of rows, try using a cursor ...

            regards, tom lane