Thread: gist indexes for distance calculations

gist indexes for distance calculations

From

Marcelo Zabani

Date:

30 September 2010, 18:33:12

Hi everyone. I have a question, and it's well beyond me to even speculate about the inner workings of postgresql on this.

I have a "places" table, and a "coordinates" column, of type POINT.

If I want to find every place within, for example, a distance of 1 unit from an arbitrary point, I'll do:

CREATE INDEX ON places USING gist (circle(coordinates, 1));

And then I'll fetch the desired rows like this:

SELECT * FROM places WHERE circle(coordinates, 1) @> circle('(a,b)', 0);
(where (a,b) is an arbitrary point)

I'd like to know how this index works, though, as it seems to me the only way to have this kind of index to work is to calculate the distance of every point in a square of sides 2*1=2 units centered on (a, b).

So, am I wrong to think it works like that? If it does work like that, could I have instead two columns of type FLOAT (xcoordinate and ycoordinate) and create traditional b-tree indexes on both of these, and then do something like:

SELECT * FROM places WHERE xcoordinate >= (a-1) AND xcoordinate <= (a+1) AND ycoordinate >= (b-1) AND ycoordinate <= (b+1) And SQRT(POW(a-xcoordinate,2)+POW(b-ycoordinate,2))<=1;

If you can also pinpoint me to where I can find this sort of information (index utilization and planning, performance tuning), I'd be very grateful.
Thank you already,
Marcelo Zabani.

Re: gist indexes for distance calculations

From

Tom Lane

Date:

30 September 2010, 19:16:47

Marcelo Zabani <mzabani@gmail.com> writes:
> CREATE INDEX ON places USING gist (circle(coordinates, 1));

> I'd like to know how this index works, though, as it seems to me the only
> way to have this kind of index to work is to calculate the distance of every
> point in a square of sides 2*1=2 units centered on (a, b).

I believe it is a bounding-box based implementation; that is, each
non-leaf entry in the index stores the bounding box that covers all the
entries in the child page (plus its children if any).  So a lookup
descends to only those child pages that could possibly contain entries
overlapping the target circle.

> So, am I wrong to think it works like that? If it does work like that, could
> I have instead two columns of type FLOAT (xcoordinate and ycoordinate) and
> create traditional b-tree indexes on both of these, and then do something
> like:

> SELECT * FROM places WHERE xcoordinate >= (a-1) AND xcoordinate <= (a+1) AND
> ycoordinate >= (b-1) AND ycoordinate <= (b+1) And
> SQRT(POW(a-xcoordinate,2)+POW(b-ycoordinate,2))<=1;

Well, there's nothing stopping you from doing that, but search
performance is likely to be a whole lot worse than with an actual 2-D
index.  The reason is that it'll have to fetch index entries for
everything in the vertical strip between a-1 and a+1, as well as
everything in the horizontal strip between b-1 and b+1; most of which is
nowhere near the target.  If your circles are always very very small
this might work tolerably, but in most applications the amount of stuff
fetched soon gets out of hand.

            regards, tom lane

Re: gist indexes for distance calculations

From

Marcelo Zabani

Date:

01 October 2010, 02:50:22

So let me see if I understand, when searching for everyone within radius "r" of point (a,b), the gist index will be used like this:
* C is the circle centered on (a,b) of radius "r"

1. Traverse down the tree, starting at the root. Only go down nodes whose bounding-box has a non-empty intersection with the circle C (how fast is this verification?)
2. Once the leaves are reached, verify for everyone of them whether they're inside C or not, returning those that are.

If this really is how it happens, then I ask: What about boxes that intersect (does that happen)? What if the boxes aren't "nice" (boxes with a very small desired intersection with the circle and a very large quantity of unwanted rows).

Also, you've said that with b-tree indexes on two orthogonal coordinates (two columns) postgresql would need to check the ENTIRE vertical strip bounded by a-1 and a+1 and the ENTIRE horizontal strip bounded by b-1 and b+1 (two limitless, "infinite" rectangles)? This leads me to another question:

- Isn't there an equality check between all the rows returned by both indexes, and then the actual distance calculations are performed only for those returned by both indexes? What if I have many more indexes on my table, how are things done?

And if I may, one last question:
- Is box-bounding the index strategy used for all geometric operations with gist indexes?

Also, to Oleg:
I had personally tested the use of gist indexes (without limiting the number of returned rows, however) for more than 15 million rows (with their coordinates distributed a VERY LARGE area, sadly). The results were still impressive to me (although I didn't know what to expect, maximum running times of around 17ms seemed great to me!).
And sorry for sending this message to your personal email (my mistake).

Thanks a lot for all the help, if you can lead me to any docs/articles, I'll gladly read them.

Re: gist indexes for distance calculations

From

Karim Nassar

Date:

01 October 2010, 04:46:21

Thanks a lot for all the help, if you can lead me to any docs/articles, I'll gladly read them.

I found this: http://en.wikipedia.org/wiki/R-tree

Looks like what Tom was talking about, ja?

Karim

Re: gist indexes for distance calculations

From

Jesper Krogh

Date:

01 October 2010, 05:56:32

On 2010-09-30 20:33, Marcelo Zabani wrote:
> If you can also pinpoint me to where I can find this sort of information
> (index utilization and planning, performance tuning), I'd be very grateful.
> Thank you already,
>

Isn't this what the knngist patches are for?
https://commitfest.postgresql.org/action/patch_view?id=350

http://www.sai.msu.su/~megera/wiki/knngist


--
Jesper

Re: gist indexes for distance calculations

From

Merlin Moncure

Date:

01 October 2010, 16:04:28

On Thu, Sep 30, 2010 at 2:33 PM, Marcelo Zabani <mzabani@gmail.com> wrote:
> Hi everyone. I have a question, and it's well beyond me to even speculate
> about the inner workings of postgresql on this.
>
> I have a "places" table, and a "coordinates" column, of type POINT.
>
> If I want to find every place within, for example, a distance of 1 unit from
> an arbitrary point, I'll do:
>
> CREATE INDEX ON places USING gist (circle(coordinates, 1));
>
> And then I'll fetch the desired rows like this:
>
> SELECT * FROM places WHERE circle(coordinates, 1) @> circle('(a,b)', 0);
> (where (a,b) is an arbitrary point)
>
> I'd like to know how this index works, though, as it seems to me the only
> way to have this kind of index to work is to calculate the distance of every
> point in a square of sides 2*1=2 units centered on (a, b).
>
> So, am I wrong to think it works like that? If it does work like that, could
> I have instead two columns of type FLOAT (xcoordinate and ycoordinate) and
> create traditional b-tree indexes on both of these, and then do something
> like:
>
> SELECT * FROM places WHERE xcoordinate >= (a-1) AND xcoordinate <= (a+1) AND
> ycoordinate >= (b-1) AND ycoordinate <= (b+1) And
> SQRT(POW(a-xcoordinate,2)+POW(b-ycoordinate,2))<=1;
>
> If you can also pinpoint me to where I can find this sort of information
> (index utilization and planning, performance tuning), I'd be very grateful.

A quick heads up: It's possible, although it may not necessarily help,
to further reduce distance calcs by drawing an inner bounding box of
points that are confirmed good.  Your outer box is made by squaring
the circle on lat/lon projection -- you can also calculate the biggest
lat lon 'rectangle' that completely fits inside the circle, and play
with a query that looks something like this (pseudo sql):

select * from points where (point inside good box) or (point inside
possible box and dist(point, mypoint < n));

You get reduction of dist calcs at expense of second gist lookup.  You
can also, of course, do this on application side, but what's the fun
in that? :-).

merlin

Re: gist indexes for distance calculations

From

Marcelo Zabani

Date:

01 October 2010, 17:12:12

Thanks a lot everyone for all the info! It is all really helpful.

2010/10/1 Merlin Moncure <mmoncure@gmail.com>

On Thu, Sep 30, 2010 at 2:33 PM, Marcelo Zabani <mzabani@gmail.com> wrote:
> Hi everyone. I have a question, and it's well beyond me to even speculate
> about the inner workings of postgresql on this.
>
> I have a "places" table, and a "coordinates" column, of type POINT.
>
> If I want to find every place within, for example, a distance of 1 unit from
> an arbitrary point, I'll do:
>
> CREATE INDEX ON places USING gist (circle(coordinates, 1));
>
> And then I'll fetch the desired rows like this:
>
> SELECT * FROM places WHERE circle(coordinates, 1) @> circle('(a,b)', 0);
> (where (a,b) is an arbitrary point)
>
> I'd like to know how this index works, though, as it seems to me the only
> way to have this kind of index to work is to calculate the distance of every
> point in a square of sides 2*1=2 units centered on (a, b).
>
> So, am I wrong to think it works like that? If it does work like that, could
> I have instead two columns of type FLOAT (xcoordinate and ycoordinate) and
> create traditional b-tree indexes on both of these, and then do something
> like:
>
> SELECT * FROM places WHERE xcoordinate >= (a-1) AND xcoordinate <= (a+1) AND
> ycoordinate >= (b-1) AND ycoordinate <= (b+1) And
> SQRT(POW(a-xcoordinate,2)+POW(b-ycoordinate,2))<=1;
>
> If you can also pinpoint me to where I can find this sort of information
> (index utilization and planning, performance tuning), I'd be very grateful.

A quick heads up: It's possible, although it may not necessarily help,
to further reduce distance calcs by drawing an inner bounding box of
points that are confirmed good. Your outer box is made by squaring
the circle on lat/lon projection -- you can also calculate the biggest
lat lon 'rectangle' that completely fits inside the circle, and play
with a query that looks something like this (pseudo sql):

select * from points where (point inside good box) or (point inside
possible box and dist(point, mypoint < n));

You get reduction of dist calcs at expense of second gist lookup. You
can also, of course, do this on application side, but what's the fun
in that? :-).

merlin

--
Marcelo Zabani
(19) 9341-0221

Re: gist indexes for distance calculations

From

Robert Haas

Date:

12 October 2010, 00:52:32

On Fri, Oct 1, 2010 at 1:56 AM, Jesper Krogh <jesper@krogh.cc> wrote:
> On 2010-09-30 20:33, Marcelo Zabani wrote:
>>
>> If you can also pinpoint me to where I can find this sort of information
>> (index utilization and planning, performance tuning), I'd be very
>> grateful.
>> Thank you already,
>>
>
> Isn't this what the knngist patches are for?
> https://commitfest.postgresql.org/action/patch_view?id=350
>
> http://www.sai.msu.su/~megera/wiki/knngist

Those are for when you want to order by distance; the OP is trying to
*filter* by distance, which is different.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company