KNNGiST for knn-search - Mailing list pgsql-hackers

From Teodor Sigaev
Subject KNNGiST for knn-search
Date
Msg-id 4B0AC9E1.5050509@sigaev.ru
Whole thread Raw
Responses Re: KNNGiST for knn-search
Re: KNNGiST for knn-search
Re: KNNGiST for knn-search (WIP)
List pgsql-hackers
Hi there,

http://www.sigaev.ru/misc/knngist-0.11.tar.gz

we'd like to present contrib module for CVS HEAD, which contains implementation
of knn (k nearest neighbourhood) search in PostgreSQL, see README.knngist for
details. KNNGiST is an extension of existing GiST, which inherits most of
their code, so it's has recovery (WAL-logged)  and concurrency support.
Basically, it introduces a new distance-based priority queue tree traversal
algorithm (instead of depth-first in plain GiST), so index scan returns rows
sorted by closiness to a query. Notice, index returns all rows, so one should
use LIMIT clause to specify k (which is usually small) to get real benefits.
We get 300x times better performance on real-life data (about 1 mln points):

Module requires rbtree and point_ops patches applied.
(http://archives.postgresql.org/message-id/4B0A8DFA.7050009@sigaev.ru and
http://archives.postgresql.org/message-id/4B0A8F0F.3020308@sigaev.ru)

Old way:
SELECT coordinates, (coordinates <-> '5.0,5.0'::point) AS dist FROM spots
order by dist asc LIMIT 10;

Time: 1024.242 ms

knn-search:
SELECT coordinates, (coordinates <-> '5.0,5.0'::point) AS dist FROM spots
WHERE coordinates  >< '5.0,5.0'::point LIMIT 10;

Time: 3.158 ms


We didn't patch existing implementation of GiST for several reasons:

1. KNNGiST is about 5% slower than GiST on non-knn search queries, like
   contains or contained by, because of some overhead of new algorithm of
   tree traversal
2.  KNNGiST can't be used in  bitmap index scan, which destroys order of results,
   We don't know the way to forbid bitmap index scan only for knn queries.
   Current version of KNNGiST doesn't distinguish knn-search and usual search
   and postgres doesn't know about ordered output from KNNGiST.

We see several ways to add KNNGiST to PostgreSQL:

1. Patch existing GiST.
    Con - see problems above.
    Pro - any existing GiST opclasses will work with KNNGiST.

2. add KNNGIST as a contrib module like now.
    Con - there is no way in PostgreSQL to test other modules, which depends
          on KNNGiST. For example, it's easy to add support for trigrams, but
          then we add dependence on contrib/pg_trgm module.

3. add KNNGiST as separate access method into core of PostgreSQL.
    We think it the best way, since we don't touch existing GiST and opclasses,
    and could use KNNGiST in other contrib modules


Separate problem is query planning.

1. To use KNNGiST we need to use index scan ! To prevent seqscan on table,
    operator >< has extremely high cost.
2. To prevent bitmap index scan, KNNGiST doesn't have bitmap index scan
    interface at all (no amgetbitmap method).


--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/
knngist - contrib module for k-nearest neighbourhood search

Support: The Open Planning Project, Inc.

Module knngist provides:

- KNNGiST access method is an extension of GiST, which implements  k-nearest
 neighbourhood search
- Operator class for KNNGiST for data type points with capability of knn-search
- Operator class for KNNGiST for data type tsvector without capability of knn-search

KNNGiST is inherited from GiST and use the same write methods, so, that KNNGiST
has recovery (WAL-logged)  and concurrency support.

KNNGiST supports all queries executable by GiST, but with possible
performance loss.  KNNGiST keeps all features of GiST, such as multicolumn
indexes with support of any subset of the index's columns, indexing and
searching of NULL values.

The KNNGiST differs from GiST in:
- search traversal algorithm on tree. While GiST uses first-depth search
  algorithm of KNN version uses traversal priority queue
- consistent user-defined method should return distance of tuple to the query,
  instead of a boolean value as in GiST. However, KNNGiST can use GiST's
  consistent method for additional filtering of result or GiST-alike
  search,  but not for knn-search (for example, tsvector_ops).
- KNNGiST doesn't have amgetbitmap method, because of nature of knn-search.


consistent user-defined method for KNNGiST can return:

 - negative value, which means tuple doesn't match query
   (like false in GiST's consistent)
 - 0.0 means one of:
   - a zero distance (exact match)
   - a match for filtering clause, like a <@ or @> for point.
   KNNGist doesn't distinguish these two cases and relies on user-defined
   methods
 - positive value, which means the method returns distance. In this case
   keyRecheck should be false!, since it's impossible to make right order
   with lossy values.

Distance between tuple and query is calculated as a sum of all distances
(on all keys). Notice, that distance is a numerical (non-negative)
description of how tuple is different from a query and KNNGiST doesn't require,
that it should follow triangle rule.

Caveats:

Currently, it's impossible to specify the number of closest neighbourhood
points returned, use LIMIT clause for this. Index ALWAYS returns ALL rows
in the order of closiness to the given point, so it can be very slow
without  LIMIT clause.


The module also provides index support for k-nn search for points data type
using KNNGiST access method.

Operator:

point   >< point   - fake operator, which always returns TRUE

Indexed support for operators:

point    <<  point
point    >>  point
point    <^  point
point    >^  point
point    ~=  point
point   <@ box
box      @> point
point   <@ polygon
polygon  @> point
point   <@ circle
circle   @> point

Also, knngist provides support full-text search operator @@  for tsvector
data type.


Examples:

We use test database of POI (point of interests), which has 1034170 spots.

First, compare performance of traditional approach and k-nn search.

=#
SELECT coordinates, (coordinates <-> '5.0,5.0'::point) AS dist FROM spots
order by dist asc LIMIT 10;
             coordinates             |       dist
-------------------------------------+------------------
 (3.57192993164062,6.51727240153665) | 2.08362656457647
 (3.49502563476562,6.49134782128243) | 2.11874164636854
 (3.4393,6.4473)                     | 2.12848814420001
 (3.31787109375,6.50089913799597)    | 2.25438592075067
 (2.6323,6.4779)                     | 2.79109148900569
 (2.66349792480469,6.53159856871478) | 2.79374947392946
 (1.84102535247803,6.27874198581057) |  3.4079762161672
 (1.2255,6.1228)                     | 3.93796014327215
 (1.22772216796875,6.15693947094637) | 3.94570513108469
 (9.6977,4.044503)                   | 4.79388775494473
(10 rows)

Time: 1024.242 ms

=#
SELECT coordinates, (coordinates <-> '5.0,5.0'::point) AS dist FROM spots
WHERE coordinates  >< '5.0,5.0'::point LIMIT 10;
             coordinates             |       dist
-------------------------------------+------------------
 (3.57192993164062,6.51727240153665) | 2.08362656457647
 (3.49502563476562,6.49134782128243) | 2.11874164636854
 (3.4393,6.4473)                     | 2.12848814420001
 (3.31787109375,6.50089913799597)    | 2.25438592075067
 (2.6323,6.4779)                     | 2.79109148900569
 (2.66349792480469,6.53159856871478) | 2.79374947392946
 (1.84102535247803,6.27874198581057) |  3.4079762161672
 (1.2255,6.1228)                     | 3.93796014327215
 (1.22772216796875,6.15693947094637) | 3.94570513108469
 (9.6977,4.044503)                   | 4.79388775494473
(10 rows)

Time: 3.158 ms

This query demonstrates 300x perfomance gain due to k-nn search and the gain
will only increases with the growing of table size, both in number of points
and row length.


Find 10 most closest points to the Eiffel tower in Paris, which has 'mars' in
their address.

=# CREATE INDEX spots_idx ON spots USING knngist (coordinates, to_tsvector('french',address));
=# SELECT id, address,  (coordinates <-> '(2.29470491409302,48.858263472125)'::point) AS dist
FROM spots WHERE coordinates >< '(2.29470491409302,48.858263472125)'::point
AND to_tsvector('french',address) @@ to_tsquery('french','mars')  LIMIT 10;
   id    |                           address                           |         dist
---------+-------------------------------------------------------------+---------------------
  366096 | 1st Floor Tour Eiffel | Champs de Mars, Paris 75007, France | 2.32488941293945e-05
 4356328 | r Champ de Mars 75007 PARIS                                 |  0.00421854756964406
 5200167 | Champ De Mars 75007 Paris                                   |  0.00453564562587288
 9301676 | Champ de Mars, 75007 Paris,                                 |  0.00453564562587288
 2152213 | 16, ave Rapp, Champ de Mars, Tour Eiffel, Paris, France     |  0.00624152097590896
 1923818 | Champ de Mars Paris, France                                 |  0.00838214733539654
 5165953 | 39 Rue Champ De Mars Paris, France                          |  0.00874410234569529
 7395870 | 39 Rue Champ De Mars Paris, France                          |  0.00874410234569529
 4358671 | 32 Rue Champ De Mars Paris, France                          |  0.00876089659276339
 1923742 | 12 rue du Champ de Mars Paris, France                       |  0.00876764731845995
(10 rows)

Time: 7.859 ms

=# EXPLAIN (COSTS OFF)
SELECT id, address FROM spots WHERE coordinates >< '(2.29470491409302,48.858263472125)'::point
AND to_tsvector('french',address) @@ to_tsquery('french','mars')  LIMIT 10;

                            QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------
 Limit
   ->  Index Scan using spots_idx on spots
         Index Cond: ((coordinates >< '(2.29470491409302,48.858263472125)'::point) AND
(to_tsvector('french'::regconfig,address) @@ '''mar'''::tsquery)) 
(3 rows)

Plan of query is consists of only index scan.


Find 10 most closest points to the Eiffel tower from the 1st arrondissement of
Paris (Paris 1), which addresses contains 'place'. See exact PARIS_1 polygon
in the Appendix.

=# SELECT id, address,  (coordinates <-> '(2.29470491409302,48.858263472125)'::point) AS dist
FROM spots WHERE coordinates >< '(2.29470491409302,48.858263472125)'::point
AND coordinates <@  PARIS_1::polygon
AND  to_tsvector('french',address) @@ to_tsquery('french','place')  LIMIT 10;
   id    |                     address                     |        dist
---------+-------------------------------------------------+-------------------
 4832659 | 1, Place de la Concorde Paris, France           | 0.0295206872672182
  411437 | Place de la Concorde Paris, France              | 0.0302147996937845
  378340 | 1, Place de la Concorde Paris, France           | 0.0307854422609629
 4831330 | Place Maurice Barrs Paris, France               | 0.0325866682024178
  376250 | 1, Place de la Madeleine Paris, France          | 0.0331104655425048
  474301 | Place de la Madeleine, 75009 PARIS 9eme, France | 0.0345299306576103
 5344357 | 15, Place Vendme 75001 Paris                    | 0.0347800149417034
 1655967 | 23, Place Vendme Paris, France                  | 0.0349331087313895
 5189448 | 21 Place Vendome 75001 Paris                    | 0.0349739602806271
 1645248 | 17 Place Vendome Paris, France                  |  0.035054516779252
(10 rows)

Time: 26.061 ms

See other examples of usage in sql/knngist.sql file.


Appendix:

PARIS_1::polygon is

'((2.34115348312647,48.8657428523236),
(2.34118074365616,48.8657338586581),(2.34124889530144,48.8657158708278),
(2.34507895327157,48.8647984137371),(2.34664637351338,48.8644206007538),
(2.35092599260214,48.8633949969132),(2.35018963361274,48.8620911379807),
(2.35013508709759,48.86198323148),(2.34942603926653,48.860724328187),
(2.34934422651062,48.8605714606062),(2.348198918621,48.8585122428713),
(2.34818528403175,48.8584852660126),(2.34754450116443,48.8573432499501),
(2.34748998045578,48.8573072839752),(2.34727190749839,48.8572083834273),
(2.34716287044605,48.8571544366361),(2.34702657180341,48.8570735139172),
(2.34695841648506,48.8570015779684),(2.34678119248757,48.8567048330991),
(2.34607232885619,48.8555807999441),(2.34594964833223,48.8554099476641),
(2.34588149215383,48.8553110329086),(2.34584060107095,48.8552660723913),
(2.34457303943998,48.8540431508262),(2.34282882150308,48.8548525947756),
(2.34277431496994,48.8548885683273),(2.34265167596398,48.8549785014457),
(2.34191584194969,48.8556170173375),(2.34082568197206,48.8565523000639),
(2.34045773722682,48.8567681364201),(2.34036234349704,48.8568220954177),
(2.34033508714246,48.856822096164),(2.33752769217743,48.8583778791875),
(2.33658731599251,48.8585757177307),(2.33288025427342,48.8593400183869),
(2.32988178792963,48.8601222319092),(2.32982726892706,48.8601402138378),
(2.32978637897676,48.8601581966286),(2.3295001526867,48.8602570975859),
(2.32830070631995,48.8607246317809),(2.32828707545679,48.8607336234383),
(2.32704673093708,48.8611381921984),(2.32607897177943,48.8614708326283),
(2.32526114627639,48.8617135534289),(2.32520662603102,48.861722540429),
(2.32471592483796,48.8618753640429),(2.32091289641744,48.8630349455836),
(2.32087200086194,48.8630529251833),(2.32155322584936,48.8639073294346),
(2.32243884665493,48.8650405307146),(2.32252059466181,48.8651574466784),
(2.3232972488748,48.8661647291047),(2.32341988057605,48.8663266130246),
(2.32354251307602,48.8664884968134),(2.3235152464625,48.8665064789925),
(2.3245644587385,48.8679274505231),(2.32464621421731,48.8680533576953),
(2.32500049864077,48.8685749741703),(2.32530028589861,48.8690066566296),
(2.32513660273671,48.8694382902038),(2.32566821564096,48.8695282724636),
(2.32580452605167,48.8695552643448),(2.32581815771984,48.8695552657067),
(2.32797189268413,48.869924162148),(2.32798552779567,48.8699061778024),
(2.32993507578488,48.8685034538095),(2.33027588164409,48.8683505987317),
(2.33063031789587,48.8681887507006),(2.33169358579805,48.8679460035635),
(2.33170721702503,48.8679460042226),(2.33365651409609,48.8675054382284),
(2.33583750536962,48.8669929002209),(2.33585113633529,48.8669929003854),
(2.33735054165431,48.8666421923058),(2.34115348312647,48.8657428523236))'::polygon

pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: Re: [COMMITTERS] pgsql: Remove -w (--ignore-all-space) option from pg_regress's diff
Next
From: Caleb Cushing
Date:
Subject: Re: named generic constraints [feature request]