Thread: application of KNN code to US zipcode searches?
We perform over 1,000,000 searches each day for "adoptable shelter pets near your zipcode". We already have adequate performance for these searches using the "cube" contrib, but the new KNN work in 9.1 seemed like it might be a promising way to speed this up even further. I installed PostgreSQL 9.1 on my laptop to evaluate it, using this post as a reference: http://www.depesz.com/index.php/2010/12/11/waiting-for-9-1-knngist/ The first task was to translate a geo-spatial search to use the new KNN syntax. I'm most familiar with two approaches to geo-spatial searching with PostgreSQL. The first is the older "earthdistance" approach, using "point" types and the "<@>" operator. The second is the one we are using now, which uses a cube type, the "cube_distance()" and "earth_box()" method and a GIST index on the cube type. Immediately there is a hurdle in that KNN only appears to work with point types and the <-> operator, which does simple point-to-point distance, instead of the distance-around-the-earth. Still, I thought that could be enough of an approximation to test the waters. I started with some "real world" queries that involved some table joins, and when those failed to show improvement, I worked with some reduced-test-case queries. While I could confirm the new GIST index was being used on the point type, I couldn't get a query to benchmark better when it was invoked. I'm wondering if perhaps US zipcode searches aren't good use of this technology, perhaps because the data set is too small ( About 40,000 zipcodes ). Given that we can already do GIST-indexed searches with the cube type that provide good reasonable approximations for zipcode-radius searches, are others planning to eventually apply the KNN work to US zipcode searches? Sample EXPLAIN output and query times are below. Mark EXPLAIN ANALYZE SELECT zipcode, lon_lat <-> '(-118.412426,34.096629)' AS radius FROM zipcodes ; ------------------------------------------- Seq Scan on zipcodes (cost=0.00..1257.54 rows=41483 width=22) (actual time=0.019..84.543 rows=41483 loops=1) Total runtime: 148.129 ms EXPLAIN ANALYZE SELECT zipcode, lon_lat <-> '(-118.412426,34.096629)' As radius FROM zipcodes ORDER BY lon_lat <-> '(-118.412426,34.096629)'; -------------------------------------------------- Index Scan using zipcodes_knn on zipcodes (cost=0.00..5365.93 rows=41483 width=22) (actual time=0.451..141.590 rows=41483 loops=1) Order By: (lon_lat <-> '(-118.412426,34.096629)'::point) Total runtime: 206.392 ms
Mark Stosberg <mark@summersault.com> wrote: > Sample EXPLAIN output and query times are below. > Seq Scan on zipcodes (cost=0.00..1257.54 rows=41483 width=22) > (actual time=0.019..84.543 rows=41483 loops=1) > Index Scan using zipcodes_knn on zipcodes (cost=0.00..5365.93 > rows=41483 width=22) (actual time=0.451..141.590 rows=41483 > loops=1) I thought the benefit of KNN was that you could retrieve the rows in distance order, so that a query for the closest 20 locations (for example) would be very fast. I wouldn't have expected it to be helpful when you're selecting all the rows regardless of distance. -Kevin
> I thought the benefit of KNN was that you could retrieve the rows in > distance order, so that a query for the closest 20 locations (for > example) would be very fast. I wouldn't have expected it to be > helpful when you're selecting all the rows regardless of distance. Kevin, Thanks for the feedback. You are right that my "reduced test case" wasn't a good approximation. I added a limit, to simulate finding the 100 zipcodes closest to 90210. Below I compare 4 approaches to the same query: 1. Cube search 2. Earth Distance Search 3. Simple point distance (no index) 4. Simple point distance (KNN) Now KNN benchmarks to be almost 100x faster! That's very promising. Then there's only the issue that simple point distance is not expected to be a good enough approximation of earth-distances. Perhaps that can be solved by pre-computing coordinates based on the lat/long pairs.... much like the map projections used to present a curved surface on a flat map? Given that's OK to be be a few miles off, it seems we have some leeway here. Recommendations? Mark EXPLAIN ANALYZE SELECT zipcode, cube_distance( '(-2513120.64361786, -4645511.0460328, 3575538.9507084)', zipcodes.earth_coords)/1609.344 AS radius FROM zipcodes ORDER BY radius LIMIT 100; --------------------------------------------------------------- Limit (cost=2946.70..2946.95 rows=100 width=62) (actual time=167.650..168.064 rows=100 loops=1) -> Sort (cost=2946.70..3050.40 rows=41483 width=62) (actual time=167.644..167.829 rows=100 loops=1) Sort Key: ((cube_distance('(-2513120.64361786, -4645511.0460328, 3575538.9507084)'::cube, earth_coords) / 1609.344::double precision)) Sort Method: top-N heapsort Memory: 20kB -> Seq Scan on zipcodes (cost=0.00..1361.24 rows=41483 width=62) (actual time=0.030..90.807 rows=41483 loops=1) Total runtime: 168.300 ms ############################################################3 -- Using Earthdistance EXPLAIN ANALYZE SELECT zipcode, lon_lat <@> '(-118.412426,34.096629)' As radius FROM zipcodes ORDER BY lon_lat <@> '(-118.412426,34.096629)' LIMIT 100; ------------------------------------------------------------ Limit (cost=2842.99..2843.24 rows=100 width=22) (actual time=187.995..188.451 rows=100 loops=1) -> Sort (cost=2842.99..2946.70 rows=41483 width=22) (actual time=187.989..188.149 rows=100 loops=1) Sort Key: ((lon_lat <@> '(-118.412426,34.096629)'::point)) Sort Method: top-N heapsort Memory: 20kB -> Seq Scan on zipcodes (cost=0.00..1257.54 rows=41483 width=22) (actual time=0.033..108.203 rows=41483 loops=1) Total runtime: 188.660 ms ########################################## Using simple point distance, but with no Gist Index: EXPLAIN ANALYZE SELECT zipcode, lon_lat <-> '(-118.412426,34.096629)' As radius FROM zipcodes ORDER BY lon_lat <-> '(-118.412426,34.096629)' LIMIT 100; -------------------------------------------------------- Limit (cost=2842.99..2843.24 rows=100 width=22) (actual time=160.574..161.057 rows=100 loops=1) -> Sort (cost=2842.99..2946.70 rows=41483 width=22) (actual time=160.568..160.691 rows=100 loops=1) Sort Key: ((lon_lat <-> '(-118.412426,34.096629)'::point)) Sort Method: top-N heapsort Memory: 20kB -> Seq Scan on zipcodes (cost=0.00..1257.54 rows=41483 width=22) (actual time=0.027..84.610 rows=41483 loops=1) Total runtime: 161.226 ms ######################################### -- Using KNN-GIST index EXPLAIN ANALYZE SELECT zipcode, lon_lat <-> '(-118.412426,34.096629)' As radius FROM zipcodes ORDER BY lon_lat <-> '(-118.412426,34.096629)' LIMIT 100; ------------------------------------------------------------------ Limit (cost=0.00..12.94 rows=100 width=22) (actual time=0.447..1.892 rows=100 loops=1) -> Index Scan using zipcodes_knn on zipcodes (cost=0.00..5365.93 rows=41483 width=22) (actual time=0.440..1.407 rows=100 loops=1) Order By: (lon_lat <-> '(-118.412426,34.096629)'::point) Total runtime: 2.198 ms
* Mark Stosberg (mark@summersault.com) wrote: > Recommendations? PostGIS, geometry columns, and UTM.. I'm not sure where they are wrt adding KNN support, but it's something they've been anxious to have for a while, so I expect support will come quickly. Thanks, Stephen
Attachment
> PostGIS, geometry columns, and UTM.. I'm not sure where they are wrt > adding KNN support, but it's something they've been anxious to have for > a while, so I expect support will come quickly. I've looked into this a little more. One approach seems to be to project the lat/long pairs on to a flat plane using the Albers projection (which would be a one-time calculation), and then the current KNN point/distance calculations could be used. Here's a Perl module that references the Albers projection (although it's not yet clear to me how to use it): http://search.cpan.org/dist/PDL/ And a Wikipedia page on various calculation possibilities: http://en.wikipedia.org/wiki/Geographical_distance#Flat-surface_formulae Further suggestions welcome. Thanks, Mark
On 17.02.2011 17:20, Mark Stosberg wrote: >> I thought the benefit of KNN was that you could retrieve the rows in >> distance order, so that a query for the closest 20 locations (for >> example) would be very fast. I wouldn't have expected it to be >> helpful when you're selecting all the rows regardless of distance. > > Kevin, > > Thanks for the feedback. You are right that my "reduced test case" > wasn't a good approximation. I added a limit, to simulate finding the > 100 zipcodes closest to 90210. > > Below I compare 4 approaches to the same query: > > 1. Cube search > 2. Earth Distance Search > 3. Simple point distance (no index) > 4. Simple point distance (KNN) > > Now KNN benchmarks to be almost 100x faster! That's very promising. > Then there's only the issue that simple point distance is not expected > to be a good enough approximation of earth-distances. Perhaps that can > be solved by pre-computing coordinates based on the lat/long pairs.... > much like the map projections used to present a curved surface on a flat > map? Given that's OK to be be a few miles off, it seems we have some > leeway here. > > Recommendations? The existing opclasses only support distance-to-a-point, but I believe the KNN gist code is flexible enough that it could be used for distance to the edge of a shape as well. Someone just needs to write the operators and support functions. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
I tried again to use KNN for a real-world query, and I was able to get it to add an approximately 6x speed-up vs the cube search or earthdistance methods ( from 300 ms to 50ms ). I had to make some notable changes for the KNN index to be considered. - Of course, I had to switch to using basic point/distance calculation. As previously noted, this still needs more work to confirm the accuracy and get the "distance" reported in miles. - The query planner didn't like it when the "ORDER BY" referred to a column value instead of a static value, even when I believe it should know that the column value never changes. See this pseudo-query where we look-up the coordinates for 90210 once: EXPLAIN ANALYZE SELECT pets.pet_id, zipcodes.lon_lat <-> center.lon_lat AS radius FROM (SELECT lon_lat FROM zipcodes WHERE zipcode = '90210') AS center, pets JOIN shelters USING (shelter_id) JOIN zipcodes USING (zipcode) ORDER BY postal_codes.lon_lat <-> center.lon_lat limit 1000; This didn't use the KNN index until I changed the "center.lon_lat" in the ORDER BY to an explicit point value. I'm not sure if that's expected, or something I should take up with -hackers. This could be worked around by doing a initial query to look-up this value, and then feed a static value into this query. That's not ideal, but the combination would still be faster. - I had to drop the part of the WHERE clause which restricted the results to shelters within 50 miles from the target zipcode. However, I could set the "LIMIT" so high that I could get back "enough" pets, and then the application could trim out the results. Or, perhaps I could push this query down into a sub-select, and let PostgreSQL do a second pass to throw out some of the results. In any case, with a real-world speed-up of 6x, this looks like it will be worth it to us to continue to investigate.
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > The existing opclasses only support distance-to-a-point, but I believe > the KNN gist code is flexible enough that it could be used for distance > to the edge of a shape as well. Someone just needs to write the > operators and support functions. The distance has to be exactly computable from the index entry, so you'd need to store the whole shape in the index, not just a bounding box. Not sure how practical that will be for complex shapes. regards, tom lane
Mark Stosberg <mark@summersault.com> writes: > - The query planner didn't like it when the "ORDER BY" referred to a > column value instead of a static value, even when I believe it should > know that the column value never changes. See this pseudo-query where > we look-up the coordinates for 90210 once: > EXPLAIN ANALYZE > SELECT pets.pet_id, > zipcodes.lon_lat <-> center.lon_lat AS radius > FROM (SELECT lon_lat FROM zipcodes WHERE zipcode = '90210') AS > center, pets > JOIN shelters USING (shelter_id) > JOIN zipcodes USING (zipcode) > ORDER BY postal_codes.lon_lat <-> center.lon_lat limit 1000; As phrased, that's a join condition, so there's no way that an index on a single table can possibly satisfy it. You could probably convert it to a sub-select though: ORDER BY postal_codes.lon_lat <-> (SELECT lon_lat FROM zipcodes WHERE zipcode = '90210') limit 1000; regards, tom lane
Mark, we investigating pgsphere http://pgsphere.projects.postgresql.org/, if we could add KNN support. Oleg On Thu, 17 Feb 2011, Mark Stosberg wrote: > >> I thought the benefit of KNN was that you could retrieve the rows in >> distance order, so that a query for the closest 20 locations (for >> example) would be very fast. I wouldn't have expected it to be >> helpful when you're selecting all the rows regardless of distance. > > Kevin, > > Thanks for the feedback. You are right that my "reduced test case" > wasn't a good approximation. I added a limit, to simulate finding the > 100 zipcodes closest to 90210. > > Below I compare 4 approaches to the same query: > > 1. Cube search > 2. Earth Distance Search > 3. Simple point distance (no index) > 4. Simple point distance (KNN) > > Now KNN benchmarks to be almost 100x faster! That's very promising. > Then there's only the issue that simple point distance is not expected > to be a good enough approximation of earth-distances. Perhaps that can > be solved by pre-computing coordinates based on the lat/long pairs.... > much like the map projections used to present a curved surface on a flat > map? Given that's OK to be be a few miles off, it seems we have some > leeway here. > > Recommendations? > > Mark > > EXPLAIN ANALYZE > SELECT zipcode, > cube_distance( '(-2513120.64361786, -4645511.0460328, > 3575538.9507084)', zipcodes.earth_coords)/1609.344 AS radius > FROM zipcodes ORDER BY radius LIMIT 100; > > --------------------------------------------------------------- > Limit (cost=2946.70..2946.95 rows=100 width=62) (actual > time=167.650..168.064 rows=100 loops=1) > -> Sort (cost=2946.70..3050.40 rows=41483 width=62) (actual > time=167.644..167.829 rows=100 loops=1) > Sort Key: ((cube_distance('(-2513120.64361786, > -4645511.0460328, 3575538.9507084)'::cube, earth_coords) / > 1609.344::double precision)) > Sort Method: top-N heapsort Memory: 20kB > -> Seq Scan on zipcodes (cost=0.00..1361.24 rows=41483 > width=62) (actual time=0.030..90.807 rows=41483 loops=1) > Total runtime: 168.300 ms > > ############################################################3 > > -- Using Earthdistance > EXPLAIN ANALYZE SELECT zipcode, > lon_lat <@> '(-118.412426,34.096629)' As radius > FROM zipcodes > ORDER BY lon_lat <@> '(-118.412426,34.096629)' > LIMIT 100; > > ------------------------------------------------------------ > Limit (cost=2842.99..2843.24 rows=100 width=22) (actual > time=187.995..188.451 rows=100 loops=1) > -> Sort (cost=2842.99..2946.70 rows=41483 width=22) (actual > time=187.989..188.149 rows=100 loops=1) > Sort Key: ((lon_lat <@> '(-118.412426,34.096629)'::point)) > Sort Method: top-N heapsort Memory: 20kB > -> Seq Scan on zipcodes (cost=0.00..1257.54 rows=41483 > width=22) (actual time=0.033..108.203 rows=41483 loops=1) > Total runtime: 188.660 ms > > ########################################## > > Using simple point distance, but with no Gist Index: > > EXPLAIN ANALYZE SELECT zipcode, > lon_lat <-> '(-118.412426,34.096629)' As radius > FROM zipcodes > ORDER BY lon_lat <-> '(-118.412426,34.096629)' > LIMIT 100; > > -------------------------------------------------------- > Limit (cost=2842.99..2843.24 rows=100 width=22) (actual > time=160.574..161.057 rows=100 loops=1) > -> Sort (cost=2842.99..2946.70 rows=41483 width=22) (actual > time=160.568..160.691 rows=100 loops=1) > Sort Key: ((lon_lat <-> '(-118.412426,34.096629)'::point)) > Sort Method: top-N heapsort Memory: 20kB > -> Seq Scan on zipcodes (cost=0.00..1257.54 rows=41483 > width=22) (actual time=0.027..84.610 rows=41483 loops=1) > Total runtime: 161.226 ms > > ######################################### > > -- Using KNN-GIST index > EXPLAIN ANALYZE SELECT zipcode, > lon_lat <-> '(-118.412426,34.096629)' As radius > FROM zipcodes > ORDER BY lon_lat <-> '(-118.412426,34.096629)' > LIMIT 100; > ------------------------------------------------------------------ > Limit (cost=0.00..12.94 rows=100 width=22) (actual time=0.447..1.892 > rows=100 loops=1) > -> Index Scan using zipcodes_knn on zipcodes (cost=0.00..5365.93 > rows=41483 width=22) (actual time=0.440..1.407 rows=100 loops=1) > Order By: (lon_lat <-> '(-118.412426,34.096629)'::point) > Total runtime: 2.198 ms > > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
On 02/17/2011 03:17 PM, Oleg Bartunov wrote: > Mark, > > we investigating pgsphere http://pgsphere.projects.postgresql.org/, if > we could add KNN support. Great, thanks Oleg. I'll be happy to test it when something is ready. Mark
On Thu, Feb 17, 2011 at 11:17 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Mark Stosberg <mark@summersault.com> writes: >> - The query planner didn't like it when the "ORDER BY" referred to a >> column value instead of a static value, even when I believe it should >> know that the column value never changes. See this pseudo-query where >> we look-up the coordinates for 90210 once: > >> EXPLAIN ANALYZE >> SELECT pets.pet_id, >> zipcodes.lon_lat <-> center.lon_lat AS radius >> FROM (SELECT lon_lat FROM zipcodes WHERE zipcode = '90210') AS >> center, pets >> JOIN shelters USING (shelter_id) >> JOIN zipcodes USING (zipcode) >> ORDER BY postal_codes.lon_lat <-> center.lon_lat limit 1000; > > As phrased, that's a join condition, so there's no way that an index on > a single table can possibly satisfy it. You could probably convert it > to a sub-select though: > > ORDER BY postal_codes.lon_lat <-> (SELECT lon_lat FROM zipcodes WHERE zipcode = '90210') limit 1000; > > regards, tom lane Would pushing that subquery to a WITH clause be helpful at all?
Hello, I want to report that I have now solved the challenges I ran into using KNN for US zipcode searching. I've found the new approach to not only be viable, but to benchmark about 3x faster for our own real-world application than the previous approach we used, involving cube_distance() and earth_box(). Here's some details about my research so far. To evaluate it, I installed PostgreSQL 9.1 and a current PostGIS 2.0 snapshot (not yet released as stable). A primary challenge I had to solve was that KNN is designed for a slightly different problem than what I needed to solve. I need to answer the question: "What are all the objects that are in zipcodes with 50 miles of a given zipcode?" However, KNN only directly provides a performance boost to this variation: "What are the N nearest objects to this point?" Just adding a "WHERE clause" to check the 50 mile rule would erase the benefits of KNN, which works through an "ORDER BY" clause. I solved my issue by using a "WITH" clause that creates a pseudo-table called "nearby_zipcodes". In this example, I select all the zipcodes that are within 50 miles of the "47374" zipcode. The trick I've employed is that I've set the LIMIT value to 286-- exactly the number of zipcodes within 50 miles of 47374. My plan is to add another column to my "zipcodes" table for each of the small number distances I need to search. Then, when I load new zipcodes I can pre-compute how many zipcodes would be found at this distance. This have approach would not have worked without a "WITH" clause, or some equivalent, because the number of objects within the radius is not known, but the number of nearby zipcodes is fixed. This approach allows me to get the performance benefits of KNN, while also returning exactly those objects within 50 miles of my target zipcode, by JOINing on the "nearby_zipcodes" table: WITH nearby_zipcodes AS ( SELECT zipcode, st_distance_sphere(lonlat_point, (SELECT lonlat_point from zipcodes WHERE zipcode = '47374')) / 1609.344 as radius FROM zipcodes ORDER BY lonlat_point <-> (SELECT lonlat_point from zipcodes WHERE zipcode = '47374') LIMIT 286 ) SELECT ... You might also notice that "st_distance_sphere()" doesn't mean exactly the same thing as the "<->" operator. That's something I could refine going forward. That's what I've got so far. How could I improve this further? For reference, here are the key parts of the "zipcodes" table: # \d zipcodes Table "public.zipcodes" Column | Type | Modifiers --------------+-----------------------+----------- zipcode | character varying(5) | not null lonlat_point | geometry(Point,4326) | Indexes: "zipcodes_pkey" PRIMARY KEY, btree (zipcode) "lonlat_point_idx" gist (lonlat_point) Thanks for the peer review! Mark Stosberg