Thread: optimizing a geo_distance() proximity query

optimizing a geo_distance() proximity query

From

Mark Stosberg

Date:

03 February 2007, 15:01:42

Hello,

I'm using geo_distance() from contrib/earthdistance would like to find a
way to spend up the geo distance calculation if possible. This is for a
proximity search: "Show me adoptable pets within 250 miles of this
zipcode".

I'm researched a number of approaches to this, but none seem as workable
as I would have hoped.

I read this claim [1] that Jobster used a lookup table of pre-calculated
distances between zipcodes...it took about 100 million rows.

1. http://bostonsteamer.livejournal.com/831325.html

I'd like to avoid that, but I think there's a sound concept in there: we
repeatedly making a complex calculation with the same inputs, and the
outputs are always the same.

The zipdy project [2] used some interesting approaches, both similar to
the large table idea. One variation involved a PL routine that would
look up the result in a cache table. If no result was found, it would
would compute the result and add it to the cache table. Besides
eventually creating millions of rows in the cache table, I tried this
technique and found it was much slower than using geo_distance() without
a cache. Another variation in the zipdy distribution just uses several
smaller cache tables, like one for "zipcodes 25 miles away" and
"zipcodes 50 miles away".  Equally unattractive.

2. http://www.cryptnet.net/fsp/zipdy/

I looked at doing the calculation outside of PostgreSQL, and passing in
the resulting list of zipcodes in an explicit IN() list. This seem
promising at first. Geo::Postalcode (Perl) could do the calculation in
5ms, which seemed to beat PostgreSQL. For a small proximity, I think
that combination might have performed better. However, some places have
close to 5,000 zipcodes within 250 files. I tried putting /that/
resulting list into an explicitly IN() clause, and it got much slower. :)

I did find there are some possible optimizations that can be made to the
Haversine algorithm itself. As this post pointed out [3], we could
pre-convert the lat/lon pair to radians, and also compute their sin()
and cos() values. However, the person suggesting this approach provided
no benchmarks to suggest it was worth it, and I have no evidence so far
that it matters either.

3.
http://www.voxclandestina.com/2006-09-01/optimizing-spatial-proximity-searches-in-sql/

What strikes me to consider at this point are a couple of options:

A. Explore a way add some memory caching or "memoizing" to
geo_distance() so it would hold on to frequently pre-computed values,
but without storing all the millions of possibilities.

B. Look at an alternate implementation. I suspect that given a small
enough radius and the relatively large size of zipcodes, a simpler
representation of the Earth's curvature could be used, with a sufficient
accuracy. Perhaps a cylinder, or even a flat projection... We currently
max out at 250 miles. ( I just discussed this option with my wife, the
math teacher. :)

Advice from other people who have deployed high-performance proximity
searches with PostgreSQL would be appreciated!

   Mark

Re: optimizing a geo_distance() proximity query

From

Oleg Bartunov

Date:

03 February 2007, 15:12:02

Mark,

in astronomy we extensively use such kind of query, which we call
radial search or conesearch. There are several algorithms which perform
efficiently such query using spherical coordinates.
Specifically, we use our Q3C algorithm, see
http://www.sai.msu.su/~megera/wiki/SkyPixelization for details,
which was designed for PostgreSQL and is freely available.

The paper is http://lnfm1.sai.msu.ru/~math/docs/adass_proceedings2005.pdf
Web site - http://q3c.sf.net/


Oleg

On Sat, 3 Feb 2007, Mark Stosberg wrote:

>
> Hello,
>
> I'm using geo_distance() from contrib/earthdistance would like to find a
> way to spend up the geo distance calculation if possible. This is for a
> proximity search: "Show me adoptable pets within 250 miles of this
> zipcode".
>
> I'm researched a number of approaches to this, but none seem as workable
> as I would have hoped.
>
> I read this claim [1] that Jobster used a lookup table of pre-calculated
> distances between zipcodes...it took about 100 million rows.
>
> 1. http://bostonsteamer.livejournal.com/831325.html
>
> I'd like to avoid that, but I think there's a sound concept in there: we
> repeatedly making a complex calculation with the same inputs, and the
> outputs are always the same.
>
> The zipdy project [2] used some interesting approaches, both similar to
> the large table idea. One variation involved a PL routine that would
> look up the result in a cache table. If no result was found, it would
> would compute the result and add it to the cache table. Besides
> eventually creating millions of rows in the cache table, I tried this
> technique and found it was much slower than using geo_distance() without
> a cache. Another variation in the zipdy distribution just uses several
> smaller cache tables, like one for "zipcodes 25 miles away" and
> "zipcodes 50 miles away".  Equally unattractive.
>
> 2. http://www.cryptnet.net/fsp/zipdy/
>
> I looked at doing the calculation outside of PostgreSQL, and passing in
> the resulting list of zipcodes in an explicit IN() list. This seem
> promising at first. Geo::Postalcode (Perl) could do the calculation in
> 5ms, which seemed to beat PostgreSQL. For a small proximity, I think
> that combination might have performed better. However, some places have
> close to 5,000 zipcodes within 250 files. I tried putting /that/
> resulting list into an explicitly IN() clause, and it got much slower. :)
>
> I did find there are some possible optimizations that can be made to the
> Haversine algorithm itself. As this post pointed out [3], we could
> pre-convert the lat/lon pair to radians, and also compute their sin()
> and cos() values. However, the person suggesting this approach provided
> no benchmarks to suggest it was worth it, and I have no evidence so far
> that it matters either.
>
> 3.
> http://www.voxclandestina.com/2006-09-01/optimizing-spatial-proximity-searches-in-sql/
>
> What strikes me to consider at this point are a couple of options:
>
> A. Explore a way add some memory caching or "memoizing" to
> geo_distance() so it would hold on to frequently pre-computed values,
> but without storing all the millions of possibilities.
>
> B. Look at an alternate implementation. I suspect that given a small
> enough radius and the relatively large size of zipcodes, a simpler
> representation of the Earth's curvature could be used, with a sufficient
> accuracy. Perhaps a cylinder, or even a flat projection... We currently
> max out at 250 miles. ( I just discussed this option with my wife, the
> math teacher. :)
>
> Advice from other people who have deployed high-performance proximity
> searches with PostgreSQL would be appreciated!
>
>   Mark
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>       choose an index scan if your joining column's datatypes do not
>       match
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: optimizing a geo_distance() proximity query

From

Bruno Wolff III

Date:

04 February 2007, 01:19:00

On Sat, Feb 03, 2007 at 14:00:26 -0500,
  Mark Stosberg <mark@summersault.com> wrote:
>
> I'm using geo_distance() from contrib/earthdistance would like to find a
> way to spend up the geo distance calculation if possible. This is for a
> proximity search: "Show me adoptable pets within 250 miles of this
> zipcode".

If you are using the "cube" based part of the earth distance package,
then you can use gist indexes to speed those searches up. There are
functions for creating boxes that include all of the points some distance
from a fixed point. This is lossy, so you need to recheck if you don't
want some points a bit farther away returned. Also you would need to
pick a point to be where the zip code is located, rather than using area
based zip codes. However, if you have actually addresses you could use the
tiger database to locate them instead of just zip code locations.

Re: optimizing a geo_distance() proximity query

From

Mark Stosberg

Date:

05 February 2007, 15:47:57

Bruno Wolff III wrote:
> On Sat, Feb 03, 2007 at 14:00:26 -0500,
>   Mark Stosberg <mark@summersault.com> wrote:
>> I'm using geo_distance() from contrib/earthdistance would like to find a
>> way to spend up the geo distance calculation if possible. This is for a
>> proximity search: "Show me adoptable pets within 250 miles of this
>> zipcode".
>
> If you are using the "cube" based part of the earth distance package,
> then you can use gist indexes to speed those searches up.

Thanks for the tip. Any idea what kind of improvement I can expect to
see, compared to using geo_distance()?

> There are functions for creating boxes that include all of the points some distance
> from a fixed point. This is lossy, so you need to recheck if you don't
> want some points a bit farther away returned. Also you would need to
> pick a point to be where the zip code is located, rather than using area
> based zip codes.

This is also interesting. Is this approach practical if I want to index
what's near each of about 40,000 US zipcodes, or the approach mostly
useful if you there are just a small number of fixed points to address?

I'm going to start installing the cube() and earth_distance() functions
today and see where I can get with the approach.

  Mark

Re: optimizing a geo_distance() proximity query

From

"Merlin Moncure"

Date:

05 February 2007, 16:15:29

On 2/5/07, Mark Stosberg <mark@summersault.com> wrote:
> Bruno Wolff III wrote:
> > On Sat, Feb 03, 2007 at 14:00:26 -0500,
> >   Mark Stosberg <mark@summersault.com> wrote:
> >> I'm using geo_distance() from contrib/earthdistance would like to find a
> >> way to spend up the geo distance calculation if possible. This is for a
> >> proximity search: "Show me adoptable pets within 250 miles of this
> >> zipcode".
> >
> > If you are using the "cube" based part of the earth distance package,
> > then you can use gist indexes to speed those searches up.
>
> Thanks for the tip. Any idea what kind of improvement I can expect to
> see, compared to using geo_distance()?

a lot. be aware that gist takes longer to build than btree, but very
fast to search.  Index search and filter to box is basically an index
lookup (fast!). for mostly static datasets that involve a lot of
searching, gist is ideal.

keep in mind that the cube based gist searches out a the smallest
lat/lon 'square' projected onto the earth which covers your circular
radius so you have to do extra processing if you want exact matches. (
you can speed this up to, by doing an 'inner box' search and not
recomputing distance to those points)

merlin

Re: optimizing a geo_distance() proximity query

From

Bruno Wolff III

Date:

05 February 2007, 16:22:23

On Mon, Feb 05, 2007 at 14:47:25 -0500,
  Mark Stosberg <mark@summersault.com> wrote:
>
> This is also interesting. Is this approach practical if I want to index
> what's near each of about 40,000 US zipcodes, or the approach mostly
> useful if you there are just a small number of fixed points to address?

I think the answer depends on what your data model is. If you treat each
zip code as having a location at a single point, the earth distance stuff
should work. If you are trying to include the shape of each zip code in
your model and measure distances to the nearest point of zip codes, then
you will probably be better off using postgis.

Re: optimizing a geo_distance() proximity query (example and benchmark)

From

Mark Stosberg

Date:

05 February 2007, 19:01:40

Merlin Moncure wrote:
> On 2/5/07, Mark Stosberg <mark@summersault.com> wrote:
>> Bruno Wolff III wrote:
>> > On Sat, Feb 03, 2007 at 14:00:26 -0500,
>> >   Mark Stosberg <mark@summersault.com> wrote:
>> >> I'm using geo_distance() from contrib/earthdistance would like to
>> find a
>> >> way to spend up the geo distance calculation if possible. This is
>> for a
>> >> proximity search: "Show me adoptable pets within 250 miles of this
>> >> zipcode".
>> >
>> > If you are using the "cube" based part of the earth distance package,
>> > then you can use gist indexes to speed those searches up.
>>
>> Thanks for the tip. Any idea what kind of improvement I can expect to
>> see, compared to using geo_distance()?
>
> a lot. be aware that gist takes longer to build than btree, but very
> fast to search.  Index search and filter to box is basically an index
> lookup (fast!). for mostly static datasets that involve a lot of
> searching, gist is ideal.

The documentation in contrib/ didn't provide examples of how to create
or the index or actually a the proximity search. Here's what I figured
out to do:

I added a new column as type 'cube':

 ALTER table zipcodes add column earth_coords cube;

Next I converted the old lat/lon data I had stored in a 'point'
type to the new format:

-- Make to get lat/lon in the right order for your data model!
 UPDATE zipcodes set earth_coords = ll_to_earth( lon_lat[1], lon_lat[0] );

Now I added a GIST index on the field:

 CREATE index earth_coords_idx on zipcodes using gist (earth_coords);

Finally, I was able to run a query, which I could see used the index (by
checking "EXPLAIN ANALYZE ..."

   select * from zipcodes where earth_box('(436198.322855334,
4878562.8732218, 4085386.43843934)'::cube,16093.44) @ earth_coords;

It's also notable that the units used are meters, not miles like
geo_distance(). That's what the magic number of "16093.44" is-- 10 miles
converted to meters.

When I benchmarked this query against the old geo_distance() variation,
it was about 200 times faster (~100ms vs .5ms).

However, my next step was to try a more "real world" query that involved
 a more complex where clause and a couple of table joins. So far, that
result is coming out /slower/ with the new approach, even though the
index is being used. I believe this may be cause of the additional
results found that are outside of the sphere, but inside the cube. This
causes additional rows that need processing in the joined tables.

Could someone post an example of how to further refine this so the
results more closely match what geo_distance returns() ?

Any other indexing or optimization tips would be appreciated.

   Mark

Re: optimizing a geo_distance() proximity query (example and benchmark)

From

Bruno Wolff III

Date:

06 February 2007, 01:40:25

On Mon, Feb 05, 2007 at 18:01:05 -0500,
  Mark Stosberg <mark@summersault.com> wrote:
>
> It's also notable that the units used are meters, not miles like
> geo_distance(). That's what the magic number of "16093.44" is-- 10 miles
> converted to meters.

You can change the earth() function in earthdistance.sql before running it
to use some other unit other than meters:

-- earth() returns the radius of the earth in meters. This is the only
-- place you need to change things for the cube base distance functions
-- in order to use different units (or a better value for the Earth's radius).

CREATE OR REPLACE FUNCTION earth() RETURNS float8
LANGUAGE 'sql' IMMUTABLE
AS 'SELECT ''6378168''::float8';

> However, my next step was to try a more "real world" query that involved
>  a more complex where clause and a couple of table joins. So far, that
> result is coming out /slower/ with the new approach, even though the
> index is being used. I believe this may be cause of the additional
> results found that are outside of the sphere, but inside the cube. This
> causes additional rows that need processing in the joined tables.

This is unlikely to be the cause. The ratio of the area of the cube to
the circle for small radii (compared to the radius of the earth, so that
we can consider thinsg flat) is 4/pi = 1.27 which shouldn't cause that
much of a change.
It might be that you are getting a bad plan. The guess on the selectivity
of the gist constraint may not be very good.
Some people here may be able to tell you more if you show us explain
analyze output.

Re: optimizing a geo_distance() proximity query (example and benchmark)

From

"Merlin Moncure"

Date:

06 February 2007, 08:58:35

On 2/6/07, Mark Stosberg <mark@summersault.com> wrote:
> It's also notable that the units used are meters, not miles like
> geo_distance(). That's what the magic number of "16093.44" is-- 10 miles
> converted to meters.
>
> When I benchmarked this query against the old geo_distance() variation,
> it was about 200 times faster (~100ms vs .5ms).
>
> However, my next step was to try a more "real world" query that involved
>  a more complex where clause and a couple of table joins. So far, that
> result is coming out /slower/ with the new approach, even though the
> index is being used. I believe this may be cause of the additional
> results found that are outside of the sphere, but inside the cube. This
> causes additional rows that need processing in the joined tables.
>
> Could someone post an example of how to further refine this so the
> results more closely match what geo_distance returns() ?

I agree with bruno...the extra time is probably  not what you are
thinking...please post explain analyze results, etc.  However bruno's
ratio, while correct does not tell the whole story because you have to
recheck distance to every point in the returned set.

There is a small optimization you can make.  The query you wrote
automatically excludes points within a certain box.  you can also
include points in the set which is the largest box that fits in the
circle:

select * from zipcodes
where
earth_box('(436198.322855334,
4878562.8732218, 4085386.43843934)'::cube,inner_radius) @ earth_coords
or
(
earth_box('(436198.322855334,
4878562.8732218, 4085386.43843934)'::cube,16093.44) @ earth_coords
and
geo_dist...
);

you can also choose to omit the earth_coords column and calculate it
on the fly...there is no real performance hit for this but it does
make the sql a bit ugly.

merlin

Re: explain analyze output for review (was: optimizing a geo_distance()...)

From

Mark Stosberg

Date:

06 February 2007, 10:40:34

Bruno Wolff III wrote:
>
> Some people here may be able to tell you more if you show us explain
> analyze output.

Here is my explain analyze output. Some brief context of what's going
on. The goal is to find "Pets Near You".

We join the pets table on the shelters table to get a zipcode, and then
join a shelters table with "earth_distance" to get the coordinates of
the zipcode.  (  Is there any significant penalty for using a varchar vs
an int for a joint? ).

I've been investigating partial indexes for the pets table. It has about
300,000 rows, but only about 10 are "active", and those are the ones we
are care about. Queries are also frequently made on males vs females, dogs vs cats
or specific ages, and those specific cases seem like possible candidates for partial indexes
as well. I played with that approach some, but had trouble coming up with any thing that
benchmarked faster.

I'm reading the explain analyze output correctly myself, nearly all of
the time spent is related to the 'pets' table, but I can't see what to
about it.

Help appreciated!

  Mark

Nested Loop  (cost=11.82..29.90 rows=1 width=0) (actual time=37.601..1910.787 rows=628 loops=1)
   ->  Nested Loop  (cost=6.68..20.73 rows=1 width=24) (actual time=35.525..166.547 rows=1727 loops=1)
         ->  Bitmap Heap Scan on pets  (cost=6.68..14.71 rows=1 width=4) (actual time=35.427..125.594 rows=1727
loops=1)
               Recheck Cond: (((sex)::text = 'f'::text) AND (species_id = 1))
               Filter: ((pet_state)::text = 'available'::text)
               ->  BitmapAnd  (cost=6.68..6.68 rows=2 width=0) (actual time=33.398..33.398 rows=0 loops=1)
                     ->  Bitmap Index Scan on pets_sex_idx  (cost=0.00..3.21 rows=347 width=0) (actual
time=14.739..14.739rows=35579 loops=1) 
                           Index Cond: ((sex)::text = 'f'::text)
                     ->  Bitmap Index Scan on pet_species_id_idx  (cost=0.00..3.21 rows=347 width=0) (actual
time=16.779..16.779rows=48695 loops=1) 
                           Index Cond: (species_id = 1)
         ->  Index Scan using shelters_pkey on shelters  (cost=0.00..6.01 rows=1 width=28) (actual time=0.012..0.014
rows=1loops=1727) 
               Index Cond: ("outer".shelter_id = shelters.shelter_id)
   ->  Bitmap Heap Scan on earth_distance  (cost=5.14..9.15 rows=1 width=9) (actual time=0.984..0.984 rows=0
loops=1727)
         Recheck Cond: ((cube_enlarge(('(-2512840.11676572, 4646218.19036629, 3574817.21369166)'::cube)::cube,
160930.130863421::doubleprecision, 3) @ earth_distance.earth_coords) AND 
(("outer".postal_code_for_joining)::text = (earth_distance.zipcode)::text))
         ->  BitmapAnd  (cost=5.14..5.14 rows=1 width=0) (actual time=0.978..0.978 rows=0 loops=1727)
               ->  Bitmap Index Scan on earth_coords_idx  (cost=0.00..2.15 rows=42 width=0) (actual time=0.951..0.951
rows=1223loops=1727) 
                     Index Cond: (cube_enlarge(('(-2512840.11676572, 4646218.19036629, 3574817.21369166)'::cube)::cube,
160930.130863421::doubleprecision, 3) @ earth_coords) 
               ->  Bitmap Index Scan on earth_distance_zipcode_idx  (cost=0.00..2.74 rows=212 width=0) (actual
time=0.015..0.015rows=1 loops=1727) 
                     Index Cond: (("outer".postal_code_for_joining)::text = (earth_distance.zipcode)::text)
 Total runtime: 1913.099 ms

Re: explain analyze output for review (was: optimizing a geo_distance()...)

From

"Adam Rich"

Date:

06 February 2007, 10:52:48

If I'm reading this correctly, 89% of the query time is spent
doing an index scan of earth_coords_idx.  Scanning pets is only
taking 6% of the total time.




-----Original Message-----
From: pgsql-performance-owner@postgresql.org
[mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Mark
Stosberg
Sent: Tuesday, February 06, 2007 8:40 AM
To: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] explain analyze output for review (was:
optimizing a geo_distance()...)


Bruno Wolff III wrote:
>
> Some people here may be able to tell you more if you show us explain
> analyze output.

Here is my explain analyze output. Some brief context of what's going
on. The goal is to find "Pets Near You".

We join the pets table on the shelters table to get a zipcode, and then
join a shelters table with "earth_distance" to get the coordinates of
the zipcode.  (  Is there any significant penalty for using a varchar vs
an int for a joint? ).

I've been investigating partial indexes for the pets table. It has about
300,000 rows, but only about 10 are "active", and those are the ones we
are care about. Queries are also frequently made on males vs females,
dogs vs cats
or specific ages, and those specific cases seem like possible candidates
for partial indexes
as well. I played with that approach some, but had trouble coming up
with any thing that
benchmarked faster.

I'm reading the explain analyze output correctly myself, nearly all of
the time spent is related to the 'pets' table, but I can't see what to
about it.

Help appreciated!

  Mark

Nested Loop  (cost=11.82..29.90 rows=1 width=0) (actual
time=37.601..1910.787 rows=628 loops=1)
   ->  Nested Loop  (cost=6.68..20.73 rows=1 width=24) (actual
time=35.525..166.547 rows=1727 loops=1)
         ->  Bitmap Heap Scan on pets  (cost=6.68..14.71 rows=1 width=4)
(actual time=35.427..125.594 rows=1727 loops=1)
               Recheck Cond: (((sex)::text = 'f'::text) AND (species_id
= 1))
               Filter: ((pet_state)::text = 'available'::text)
               ->  BitmapAnd  (cost=6.68..6.68 rows=2 width=0) (actual
time=33.398..33.398 rows=0 loops=1)
                     ->  Bitmap Index Scan on pets_sex_idx
(cost=0.00..3.21 rows=347 width=0) (actual time=14.739..14.739
rows=35579 loops=1)
                           Index Cond: ((sex)::text = 'f'::text)
                     ->  Bitmap Index Scan on pet_species_id_idx
(cost=0.00..3.21 rows=347 width=0) (actual time=16.779..16.779
rows=48695 loops=1)
                           Index Cond: (species_id = 1)
         ->  Index Scan using shelters_pkey on shelters
(cost=0.00..6.01 rows=1 width=28) (actual time=0.012..0.014 rows=1
loops=1727)
               Index Cond: ("outer".shelter_id = shelters.shelter_id)
   ->  Bitmap Heap Scan on earth_distance  (cost=5.14..9.15 rows=1
width=9) (actual time=0.984..0.984 rows=0 loops=1727)
         Recheck Cond: ((cube_enlarge(('(-2512840.11676572,
4646218.19036629, 3574817.21369166)'::cube)::cube,
160930.130863421::double precision, 3) @ earth_distance.earth_coords)
AND
(("outer".postal_code_for_joining)::text =
(earth_distance.zipcode)::text))
         ->  BitmapAnd  (cost=5.14..5.14 rows=1 width=0) (actual
time=0.978..0.978 rows=0 loops=1727)
               ->  Bitmap Index Scan on earth_coords_idx
(cost=0.00..2.15 rows=42 width=0) (actual time=0.951..0.951 rows=1223
loops=1727)
                     Index Cond: (cube_enlarge(('(-2512840.11676572,
4646218.19036629, 3574817.21369166)'::cube)::cube,
160930.130863421::double precision, 3) @ earth_coords)
               ->  Bitmap Index Scan on earth_distance_zipcode_idx
(cost=0.00..2.74 rows=212 width=0) (actual time=0.015..0.015 rows=1
loops=1727)
                     Index Cond:
(("outer".postal_code_for_joining)::text =
(earth_distance.zipcode)::text)
 Total runtime: 1913.099 ms



---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Re: explain analyze output: vacuuming made a big difference.

From

Mark Stosberg

Date:

06 February 2007, 10:55:47

Mark Stosberg wrote:
>
> I'm reading the explain analyze output correctly myself, nearly all of
> the time spent is related to the 'pets' table, but I can't see what to
> about it.

Something about typing that message jarred by brain to think to try:

VACUUM FULL pets;
VACUUM ANALYZE pets;

Now the new cube-based calculation benchmarks reliably faster. The old
lat/lon systems now benchmarks at 250ms, while the the new cube-based
code bechmarks at 100ms,  over a 50% savings!

That's good enough for me.

However, I'm still interested advice on the other points I snuck into my
last message about joining with ints vs varchars and best use of partial
indexes.

   Mark

Re: explain analyze output for review (was: optimizing a geo_distance()...)

From

Bruno Wolff III

Date:

06 February 2007, 11:45:55

On Tue, Feb 06, 2007 at 09:39:54 -0500,
  Mark Stosberg <mark@summersault.com> wrote:
>
> I've been investigating partial indexes for the pets table. It has about
> 300,000 rows, but only about 10 are "active", and those are the ones we
> are care about. Queries are also frequently made on males vs females, dogs vs cats

It probably won't pay to make partial indexes on sex or species (at least
for the popular ones), as you aren't likely to save enough by eliminating only
half the cases to make up for maintaining another index. A partial index for
active rows probably does make sense.

> or specific ages, and those specific cases seem like possible candidates for partial indexes
> as well. I played with that approach some, but had trouble coming up with any thing that
> benchmarked faster.

tip: faster sorting for proximity queries by using cube_distance()

From

Mark Stosberg

Date:

07 February 2007, 17:31:29

Hello,

I wanted to share something else I learned in my proximity search work.
 One my requirements is to order by the distance that matches are found
from the center point.

When did this using earth_distance(), the benefit of the earth_box()
technique over the old geo_distance became minimal as I approached a
250mi radius.

Switching to sorting by cube_distance() offered a huge benefit, allowing
the earth_distance() query to run in about 100ms vs 300ms for the
geo_distance() equivalent.

I checked the results that cube_distance() produced versus
earth_distance(). cube_distance() is always (not surprisingly) a little
smaller, but the difference seems only grows to about a mile for a 250
mile radius. That's an acceptable margin of error for this application,
and may be for others as well.

   Mark