Thread: find close (duplicate) points + create index
Hello, I have a table with one column -- "CREATE TABLE pointtable (point POINT)". I'm trying to find duplicate entries, where two entries are considered duplicates if they're within a radius of 1, meaning something like ~ "select point from pointtable where distance between points <=1". Obviously this is not SQL syntax. Is there a way to do this - search within a column itself? Also, I also tried to build an index on that column, but there's no operator class for type point. How can I do that? Thanks, Elinor
On Thu, Mar 04, 2004 at 12:42:23 +0200, Elinor Medezinski <elinor@bellatrix.tau.ac.il> wrote: > Hello, > > I have a table with one column -- > "CREATE TABLE pointtable (point POINT)". > I'm trying to find duplicate entries, where two entries are considered > duplicates if they're within a radius of 1, meaning something like ~ "select > point from pointtable where distance between points <=1". > Obviously this is not SQL syntax. Is there a way to do this - search within a > column itself? Yes, you can join a table to itself and return records matching some critera. This isn't going to be fast. I didn't see a distance between two points function, but it is probably there somewhere. If not you can check if one point is contained in a circle of radius 1 centered at the other point. This is something that could use an index, though it probably wouldn't help when checking the whole table. It probably would speed up checking a single point for conflicts.
Elinor Medezinski <elinor@bellatrix.tau.ac.il> writes: > I'm trying to find duplicate entries, where two entries are considered > duplicates if they're within a radius of 1, meaning something like > "select point from pointtable where distance between points <=1". > Obviously this is not SQL syntax. Well, it is if you do a self-join: select * from pointtable a, pointtable b where distance(a.point, b.point) <= 1; Postgres spells the "distance" operator as "<->", so this becomes select * from pointtable a, pointtable b where (a.point <-> b.point) <= 1; Making it fast is a more difficult problem :-( ... if you write the above query as-is then the system will sit there and compare each row of pointtable to each other row, looking for pairs of rows that match the where-clause. Okay if you just have some thousands of rows, but on a big table this will take longer than you want to wait. > Also, I also tried to build an index on that column, but there's no operator > class for type point. How can I do that? A btree index on a point column would be quite useless, since btree understands only a one-dimensional continuum with less-than, equal, greater-than relationships. But I think you might be able to do something with an rtree index. I'd look at making an rtree index on the unit box around each point, and then using an "overlaps" test as an indexable coarse filter before the exact distance check. regards, tom lane
On Wed, 10 Mar 2004, Tom Lane wrote: > Elinor Medezinski <elinor@bellatrix.tau.ac.il> writes: > > I'm trying to find duplicate entries, where two entries are considered > > duplicates if they're within a radius of 1, meaning something like > > "select point from pointtable where distance between points <=1". > > Obviously this is not SQL syntax. > > Well, it is if you do a self-join: > > select * from pointtable a, pointtable b > where distance(a.point, b.point) <= 1; > > Postgres spells the "distance" operator as "<->", so this becomes > > select * from pointtable a, pointtable b > where (a.point <-> b.point) <= 1; > > Making it fast is a more difficult problem :-( ... if you write the > above query as-is then the system will sit there and compare each row of > pointtable to each other row, looking for pairs of rows that match the > where-clause. Okay if you just have some thousands of rows, but on a > big table this will take longer than you want to wait. I'm guessing distance is defined to be: sqrt( (x1-x0)^2 + (y1-y0)^2 ) or sqrt( (x1-x0)^2 + (y1-y0)^2 + (z1-z0)^2 ) You obviously need to find the difference in the x,y (and possibly z if used) coordinates, as is multiplying that difference by itself and summing them. However, you don't need to take the square root, which is often a computationally expensive thing. Two points which are close in terms of distance, are also close in terms of distance squared. If you are only interested in points that are within a distance of 1 (or a distance squared of 1*1), then you can stop calculating the distance when any of the differences is bigger than 1. Since you don't care, other than knowing it is too big, just stop the calculation and return some number bigger than 1. That should help in speeding things up a bit. How you do this in PostgreSQL, I can't help you. :-) > > Also, I also tried to build an index on that column, but > > there's no operator class for type point. How can I do that? > > A btree index on a point column would be quite useless, since btree > understands only a one-dimensional continuum with less-than, equal, > greater-than relationships. But I think you might be able to do > something with an rtree index. I'd look at making an rtree index on > the unit box around each point, and then using an "overlaps" test as > an indexable coarse filter before the exact distance check. You probably want to look a little at how GIS (Geographical Information Systems) works. There are lots of different techniques they use to search and partition things. Quadtree and oct-tree come to mind. Gord
Elinor Medezinski <elinor@bellatrix.tau.ac.il> writes: > And then I found out that in postgres the only operator classes defined for > rtree indexes are: bigbox_ops, box_ops and poly_ops. Neither of which works > with points, only with type box and polygon. Therefore I also have to create > an operator class. No you don't. What you want is a functional index built on a box or polygon surrounding the point. For instance, given regression=# create table p1 (point_a point); CREATE TABLE regression=# create index p1i on p1 using rtree (box(point_a, point_a)); CREATE INDEX you could do searches for points enclosed in a specific box like this: regression=# explain select * from p1 where box(point_a, point_a) && '(0,1),(0,1)'::box; QUERY PLAN ---------------------------------------------------------------- Index Scan using p1i on p1 (cost=0.00..17.07 rows=5 width=16) Index Cond: (box(point_a, point_a) && '(0,1),(0,1)'::box) (2 rows) since box-overlap (&&) is one of the rtree-indexable operators. The most useful way to solve your original problem seems to be regression=# explain select * from p1 a, p1 b where regression-# box(a.point_a, a.point_a) && box(circle(b.point_a,sqrt(2))) regression-# and (a.point_a <-> b.point_a) <= 1; QUERY PLAN -------------------------------------------------------------------------------------------------------------------- Nested Loop (cost=0.01..17220.00 rows=1667 width=32) Join Filter: (("inner".point_a <-> "outer".point_a) <= 1::double precision) -> Seq Scan on p1 b (cost=0.00..20.00 rows=1000 width=16) -> Index Scan using p1i on p1 a (cost=0.01..17.07 rows=5 width=16) Index Cond: (box(a.point_a, a.point_a) && box(circle("outer".point_a, 1.4142135623731::double precision))) (5 rows) The indexable condition finds "a" rows that are within the bounding box of a circle surrounding the "b" row, and then we only need to apply the exact distance check to those rows. (If you're wondering about the sqrt(2), there's an oddity in the built-in circle-to-box function: it divides the circle radius by sqrt(2). I think this is a bug and will propose changing it for 7.5.) regards, tom lane
What's going on? the same message I've sent 3 times! not once did it arrive
I am trying to install postgreSQL with perl support on a Slackware Linux 9.1. I have compiled perl with shared libraries, the rest of the settings default. Then I configured postgres with --with-perl and it worked fine. But when I am executing 'make' I get an error: plperl no such file or directory. I have tried everything. Please help!
You suggested: > select * from pointtable a, pointtable b > where (a.point <-> b.point) <= 1; Thanks, Thats what I'll do. > Making it fast is a more difficult problem :-( ... if you write the > above query as-is then the system will sit there and compare each row of > pointtable to each other row, looking for pairs of rows that match the > where-clause. Okay if you just have some thousands of rows, but on a > big table this will take longer than you want to wait. This query will only work on a few thousand lines, but I will have other queries on type point that will require comparing tables with millions of rows. Therefore I must learn how to build indexes on points. > A btree index on a point column would be quite useless, since btree > understands only a one-dimensional continuum with less-than, equal, > greater-than relationships. But I think you might be able to do > something with an rtree index. That much I know. I didn't find how I can use rtree to build an index on points, seeing how when I tried the following: "create INDEX Phot_point_a ON Phot USING RTREE (point_a);" I got this error: "ERROR: data type point has no default operator class for access method "rtree" HINT: You must specify an operator class for the index or define a default operator class for the data type." And then I found out that in postgres the only operator classes defined for rtree indexes are: bigbox_ops, box_ops and poly_ops. Neither of which works with points, only with type box and polygon. Therefore I also have to create an operator class. I didn't understand how to do that. Do you know how? > I'd look at making an rtree index on > the unit box around each point, and then using an "overlaps" test as > an indexable coarse filter before the exact distance check. I didn't understand this either. Thanks, Elinor
On Wed, Mar 10, 2004 at 11:22:47 +0200, Elinor Medezinski <elinor@bellatrix.tau.ac.il> wrote: > > That much I know. I didn't find how I can use rtree to build an index on > points, seeing how when I tried the following: > "create INDEX Phot_point_a ON Phot USING RTREE (point_a);" > I got this error: > "ERROR: data type point has no default operator class for access method > "rtree" > HINT: You must specify an operator class for the index or define a default > operator class for the data type." > > And then I found out that in postgres the only operator classes defined for > rtree indexes are: bigbox_ops, box_ops and poly_ops. Neither of which works > with points, only with type box and polygon. Therefore I also have to create > an operator class. I didn't understand how to do that. Do you know how? rtree indexes on points doesn't make sense since containment is the same as equals. You want to use boxes. A point is a box with the same point for both of the defining corners. When you do searches you use a box that describes where you are looking and look for boxes (including points) that are located in the search box. You can also use the cube type and gist indexes to do the same thing.
On Wed, 10 Mar 2004, Elinor Medezinski wrote: > > I'd look at making an rtree index on > > the unit box around each point, and then using an "overlaps" test as > > an indexable coarse filter before the exact distance check. > > I didn't understand this either. Lets say a 2D point comes from GPS, and has coordinates (from some particular zone) of (483121.534, 5124378.745). The units are meters, and so these coordinates are both specified to the nearest millimeter (which is too many significant figures for a GPS measurement without special equipment). We can construct a small box around this point, by chopping off the fractional meters, and either adding or subtracting 1. We will also assume that any boxes we make, have sides parallel to our axis. So, the Upper Left point is (483120,5124379) and the Lower Right point is (483122,5124377). This isn't a unit box, it's a 4 unit box. But, it requires less care in constructing it (in my mind) than making the unit box. But you can work with unit boxes too. To check that are box (point) of interest overlaps another box, you check to see that the minimum X of this box is greater than the minimum X of the other box and less than the maximum X of the other box. Likewise for the Y coordinates. You are considering small boxes around points, and all the boxes are squares. The more general problem, allows for N-sided polygons, where it is possible for the bounding boxes to overlap, but for the one box to not be inside the other box. If the bounding boxes don't overlap, there is no sense calculating a distance (in some cases) as the points/boxes are too far apart. Gord
Hi Tom,On Wed, 2004-03-10 at 07:27, Tom Lane wrote:
Following your advice quite a long time ago, I built an index on my very large table called object:
myDB=# select count(*) from object;
count
---------
2797036
(1 row)
Time: 8354.25 ms
using RTREE INDEX on a column of type point:
myDB=# create index object_point_a on object using rtree (box(point_a,point_a));
CREATE INDEX
Time: 220546.54 ms
Then I tried the query you suggested to find points closer to each other than 1.5:
SELECT * FROM object a,object b
WHERE box(a.point_a,a.point_a)&&box(circle(b.point_a,sqrt(2)))
AND (a.point_a<->b.point_a)<1;
After many many minutes, it started giving these lines:
server sent data ("D" message) without prior row description ("T" message)
server sent data ("D" message) without prior row description ("T" message)
server sent data ("D" message) without prior row description ("T" message)
.....
Eventually I lost patience and killed it. I am pretty sure I tried it once, but perhaps the table wasn't as big.
Any suggestions?
Thanks,
Elinor
Elinor Medezinski <elinor ( at ) bellatrix ( dot ) tau ( dot ) ac ( dot ) il> writes:> And then I found out that in postgres the only operator classes defined for > rtree indexes are: bigbox_ops, box_ops and poly_ops. Neither of which works > with points, only with type box and polygon. Therefore I also have to create > an operator class. No you don't. What you want is a functional index built on a box or polygon surrounding the point. For instance, given regression=# create table p1 (point_a point); CREATE TABLE regression=# create index p1i on p1 using rtree (box(point_a, point_a)); CREATE INDEX you could do searches for points enclosed in a specific box like this: regression=# explain select * from p1 where box(point_a, point_a) && '(0,1),(0,1)'::box; QUERY PLAN ----------------------------------------------------------------Index Scan using p1i on p1 (cost=0.00..17.07 rows=5 width=16) Index Cond: (box(point_a, point_a) && '(0,1),(0,1)'::box) (2 rows) since box-overlap (&&) is one of the rtree-indexable operators. The most useful way to solve your original problem seems to be regression=# explain select * from p1 a, p1 b where regression-# box(a.point_a, a.point_a) && box(circle(b.point_a,sqrt(2))) regression-# and (a.point_a <-> b.point_a) <= 1; QUERY PLAN --------------------------------------------------------------------------------------------------------------------Nested Loop (cost=0.01..17220.00 rows=1667 width=32) Join Filter: (("inner".point_a <-> "outer".point_a) <= 1::double precision) -> Seq Scan on p1 b (cost=0.00..20.00 rows=1000 width=16) -> Index Scan using p1i on p1 a (cost=0.01..17.07 rows=5 width=16) Index Cond: (box(a.point_a, a.point_a) && box(circle("outer".point_a, 1.4142135623731::double precision))) (5 rows) The indexable condition finds "a" rows that are within the bounding box of a circle surrounding the "b" row, and then we only need to apply the exact distance check to those rows. (If you're wondering about the sqrt(2), there's an oddity in the built-in circle-to-box function: it divides the circle radius by sqrt(2). I think this is a bug and will propose changing it for 7.5.) regards, tom lane
Elinor Medezinski <elinor@bellatrix.tau.ac.il> writes: > I'm trying to find duplicate entries, where two entries are considered > duplicates if they're within a radius of 1, meaning something like > "select point from pointtable where distance between points <=1". > Obviously this is not SQL syntax. Well, it is if you do a self-join: select * from pointtable a, pointtable bwhere distance(a.point, b.point) <= 1; Postgres spells the "distance" operator as "<->", so this becomes select * from pointtable a, pointtable bwhere (a.point <-> b.point) <= 1; Making it fast is a more difficult problem :-( ... if you write the above query as-is then the system will sit there and compare each row of pointtable to each other row, looking for pairs of rows that match the where-clause. Okay if you just have some thousands of rows, but on a big table this will take longer than you want to wait. > Also, I also tried to build an index on that column, but there's no operator > class for type point. How can I do that? A btree index on a point column would be quite useless, since btree understands only a one-dimensional continuum with less-than, equal, greater-than relationships. But I think you might be able to do something with an rtree index. I'd look at making an rtree index on the unit box around each point, and then using an "overlaps" test as an indexable coarse filter before the exact distance check. regards, tom lane+++++++++++++++++++++++++++++++++++++++++++This Mail Was Scanned By Mail-seCure Systemat the Tel-Aviv University CC.
Elinor <elinor@wise1.tau.ac.il> writes: > After many many minutes, it started giving these lines: > server sent data ("D" message) without prior row description > ("T" message) Up till fairly recently (8.0 or maybe 7.4), libpq would do that if it ran out of memory to hold the query result. I think you miswrote the query and it's returning a huge number of rows. If you actually want to fetch a huge number of rows, try using a cursor ... regards, tom lane