Thread: KNNGiST for knn-search

KNNGiST for knn-search

From
Teodor Sigaev
Date:
Hi there,

http://www.sigaev.ru/misc/knngist-0.11.tar.gz

we'd like to present contrib module for CVS HEAD, which contains implementation
of knn (k nearest neighbourhood) search in PostgreSQL, see README.knngist for
details. KNNGiST is an extension of existing GiST, which inherits most of
their code, so it's has recovery (WAL-logged)  and concurrency support.
Basically, it introduces a new distance-based priority queue tree traversal
algorithm (instead of depth-first in plain GiST), so index scan returns rows
sorted by closiness to a query. Notice, index returns all rows, so one should
use LIMIT clause to specify k (which is usually small) to get real benefits.
We get 300x times better performance on real-life data (about 1 mln points):

Module requires rbtree and point_ops patches applied.
(http://archives.postgresql.org/message-id/4B0A8DFA.7050009@sigaev.ru and
http://archives.postgresql.org/message-id/4B0A8F0F.3020308@sigaev.ru)

Old way:
SELECT coordinates, (coordinates <-> '5.0,5.0'::point) AS dist FROM spots
order by dist asc LIMIT 10;

Time: 1024.242 ms

knn-search:
SELECT coordinates, (coordinates <-> '5.0,5.0'::point) AS dist FROM spots
WHERE coordinates  >< '5.0,5.0'::point LIMIT 10;

Time: 3.158 ms


We didn't patch existing implementation of GiST for several reasons:

1. KNNGiST is about 5% slower than GiST on non-knn search queries, like
   contains or contained by, because of some overhead of new algorithm of
   tree traversal
2.  KNNGiST can't be used in  bitmap index scan, which destroys order of results,
   We don't know the way to forbid bitmap index scan only for knn queries.
   Current version of KNNGiST doesn't distinguish knn-search and usual search
   and postgres doesn't know about ordered output from KNNGiST.

We see several ways to add KNNGiST to PostgreSQL:

1. Patch existing GiST.
    Con - see problems above.
    Pro - any existing GiST opclasses will work with KNNGiST.

2. add KNNGIST as a contrib module like now.
    Con - there is no way in PostgreSQL to test other modules, which depends
          on KNNGiST. For example, it's easy to add support for trigrams, but
          then we add dependence on contrib/pg_trgm module.

3. add KNNGiST as separate access method into core of PostgreSQL.
    We think it the best way, since we don't touch existing GiST and opclasses,
    and could use KNNGiST in other contrib modules


Separate problem is query planning.

1. To use KNNGiST we need to use index scan ! To prevent seqscan on table,
    operator >< has extremely high cost.
2. To prevent bitmap index scan, KNNGiST doesn't have bitmap index scan
    interface at all (no amgetbitmap method).


--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/
knngist - contrib module for k-nearest neighbourhood search

Support: The Open Planning Project, Inc.

Module knngist provides:

- KNNGiST access method is an extension of GiST, which implements  k-nearest
 neighbourhood search
- Operator class for KNNGiST for data type points with capability of knn-search
- Operator class for KNNGiST for data type tsvector without capability of knn-search

KNNGiST is inherited from GiST and use the same write methods, so, that KNNGiST
has recovery (WAL-logged)  and concurrency support.

KNNGiST supports all queries executable by GiST, but with possible
performance loss.  KNNGiST keeps all features of GiST, such as multicolumn
indexes with support of any subset of the index's columns, indexing and
searching of NULL values.

The KNNGiST differs from GiST in:
- search traversal algorithm on tree. While GiST uses first-depth search
  algorithm of KNN version uses traversal priority queue
- consistent user-defined method should return distance of tuple to the query,
  instead of a boolean value as in GiST. However, KNNGiST can use GiST's
  consistent method for additional filtering of result or GiST-alike
  search,  but not for knn-search (for example, tsvector_ops).
- KNNGiST doesn't have amgetbitmap method, because of nature of knn-search.


consistent user-defined method for KNNGiST can return:

 - negative value, which means tuple doesn't match query
   (like false in GiST's consistent)
 - 0.0 means one of:
   - a zero distance (exact match)
   - a match for filtering clause, like a <@ or @> for point.
   KNNGist doesn't distinguish these two cases and relies on user-defined
   methods
 - positive value, which means the method returns distance. In this case
   keyRecheck should be false!, since it's impossible to make right order
   with lossy values.

Distance between tuple and query is calculated as a sum of all distances
(on all keys). Notice, that distance is a numerical (non-negative)
description of how tuple is different from a query and KNNGiST doesn't require,
that it should follow triangle rule.

Caveats:

Currently, it's impossible to specify the number of closest neighbourhood
points returned, use LIMIT clause for this. Index ALWAYS returns ALL rows
in the order of closiness to the given point, so it can be very slow
without  LIMIT clause.


The module also provides index support for k-nn search for points data type
using KNNGiST access method.

Operator:

point   >< point   - fake operator, which always returns TRUE

Indexed support for operators:

point    <<  point
point    >>  point
point    <^  point
point    >^  point
point    ~=  point
point   <@ box
box      @> point
point   <@ polygon
polygon  @> point
point   <@ circle
circle   @> point

Also, knngist provides support full-text search operator @@  for tsvector
data type.


Examples:

We use test database of POI (point of interests), which has 1034170 spots.

First, compare performance of traditional approach and k-nn search.

=#
SELECT coordinates, (coordinates <-> '5.0,5.0'::point) AS dist FROM spots
order by dist asc LIMIT 10;
             coordinates             |       dist
-------------------------------------+------------------
 (3.57192993164062,6.51727240153665) | 2.08362656457647
 (3.49502563476562,6.49134782128243) | 2.11874164636854
 (3.4393,6.4473)                     | 2.12848814420001
 (3.31787109375,6.50089913799597)    | 2.25438592075067
 (2.6323,6.4779)                     | 2.79109148900569
 (2.66349792480469,6.53159856871478) | 2.79374947392946
 (1.84102535247803,6.27874198581057) |  3.4079762161672
 (1.2255,6.1228)                     | 3.93796014327215
 (1.22772216796875,6.15693947094637) | 3.94570513108469
 (9.6977,4.044503)                   | 4.79388775494473
(10 rows)

Time: 1024.242 ms

=#
SELECT coordinates, (coordinates <-> '5.0,5.0'::point) AS dist FROM spots
WHERE coordinates  >< '5.0,5.0'::point LIMIT 10;
             coordinates             |       dist
-------------------------------------+------------------
 (3.57192993164062,6.51727240153665) | 2.08362656457647
 (3.49502563476562,6.49134782128243) | 2.11874164636854
 (3.4393,6.4473)                     | 2.12848814420001
 (3.31787109375,6.50089913799597)    | 2.25438592075067
 (2.6323,6.4779)                     | 2.79109148900569
 (2.66349792480469,6.53159856871478) | 2.79374947392946
 (1.84102535247803,6.27874198581057) |  3.4079762161672
 (1.2255,6.1228)                     | 3.93796014327215
 (1.22772216796875,6.15693947094637) | 3.94570513108469
 (9.6977,4.044503)                   | 4.79388775494473
(10 rows)

Time: 3.158 ms

This query demonstrates 300x perfomance gain due to k-nn search and the gain
will only increases with the growing of table size, both in number of points
and row length.


Find 10 most closest points to the Eiffel tower in Paris, which has 'mars' in
their address.

=# CREATE INDEX spots_idx ON spots USING knngist (coordinates, to_tsvector('french',address));
=# SELECT id, address,  (coordinates <-> '(2.29470491409302,48.858263472125)'::point) AS dist
FROM spots WHERE coordinates >< '(2.29470491409302,48.858263472125)'::point
AND to_tsvector('french',address) @@ to_tsquery('french','mars')  LIMIT 10;
   id    |                           address                           |         dist
---------+-------------------------------------------------------------+---------------------
  366096 | 1st Floor Tour Eiffel | Champs de Mars, Paris 75007, France | 2.32488941293945e-05
 4356328 | r Champ de Mars 75007 PARIS                                 |  0.00421854756964406
 5200167 | Champ De Mars 75007 Paris                                   |  0.00453564562587288
 9301676 | Champ de Mars, 75007 Paris,                                 |  0.00453564562587288
 2152213 | 16, ave Rapp, Champ de Mars, Tour Eiffel, Paris, France     |  0.00624152097590896
 1923818 | Champ de Mars Paris, France                                 |  0.00838214733539654
 5165953 | 39 Rue Champ De Mars Paris, France                          |  0.00874410234569529
 7395870 | 39 Rue Champ De Mars Paris, France                          |  0.00874410234569529
 4358671 | 32 Rue Champ De Mars Paris, France                          |  0.00876089659276339
 1923742 | 12 rue du Champ de Mars Paris, France                       |  0.00876764731845995
(10 rows)

Time: 7.859 ms

=# EXPLAIN (COSTS OFF)
SELECT id, address FROM spots WHERE coordinates >< '(2.29470491409302,48.858263472125)'::point
AND to_tsvector('french',address) @@ to_tsquery('french','mars')  LIMIT 10;

                            QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------
 Limit
   ->  Index Scan using spots_idx on spots
         Index Cond: ((coordinates >< '(2.29470491409302,48.858263472125)'::point) AND
(to_tsvector('french'::regconfig,address) @@ '''mar'''::tsquery)) 
(3 rows)

Plan of query is consists of only index scan.


Find 10 most closest points to the Eiffel tower from the 1st arrondissement of
Paris (Paris 1), which addresses contains 'place'. See exact PARIS_1 polygon
in the Appendix.

=# SELECT id, address,  (coordinates <-> '(2.29470491409302,48.858263472125)'::point) AS dist
FROM spots WHERE coordinates >< '(2.29470491409302,48.858263472125)'::point
AND coordinates <@  PARIS_1::polygon
AND  to_tsvector('french',address) @@ to_tsquery('french','place')  LIMIT 10;
   id    |                     address                     |        dist
---------+-------------------------------------------------+-------------------
 4832659 | 1, Place de la Concorde Paris, France           | 0.0295206872672182
  411437 | Place de la Concorde Paris, France              | 0.0302147996937845
  378340 | 1, Place de la Concorde Paris, France           | 0.0307854422609629
 4831330 | Place Maurice Barrs Paris, France               | 0.0325866682024178
  376250 | 1, Place de la Madeleine Paris, France          | 0.0331104655425048
  474301 | Place de la Madeleine, 75009 PARIS 9eme, France | 0.0345299306576103
 5344357 | 15, Place Vendme 75001 Paris                    | 0.0347800149417034
 1655967 | 23, Place Vendme Paris, France                  | 0.0349331087313895
 5189448 | 21 Place Vendome 75001 Paris                    | 0.0349739602806271
 1645248 | 17 Place Vendome Paris, France                  |  0.035054516779252
(10 rows)

Time: 26.061 ms

See other examples of usage in sql/knngist.sql file.


Appendix:

PARIS_1::polygon is

'((2.34115348312647,48.8657428523236),
(2.34118074365616,48.8657338586581),(2.34124889530144,48.8657158708278),
(2.34507895327157,48.8647984137371),(2.34664637351338,48.8644206007538),
(2.35092599260214,48.8633949969132),(2.35018963361274,48.8620911379807),
(2.35013508709759,48.86198323148),(2.34942603926653,48.860724328187),
(2.34934422651062,48.8605714606062),(2.348198918621,48.8585122428713),
(2.34818528403175,48.8584852660126),(2.34754450116443,48.8573432499501),
(2.34748998045578,48.8573072839752),(2.34727190749839,48.8572083834273),
(2.34716287044605,48.8571544366361),(2.34702657180341,48.8570735139172),
(2.34695841648506,48.8570015779684),(2.34678119248757,48.8567048330991),
(2.34607232885619,48.8555807999441),(2.34594964833223,48.8554099476641),
(2.34588149215383,48.8553110329086),(2.34584060107095,48.8552660723913),
(2.34457303943998,48.8540431508262),(2.34282882150308,48.8548525947756),
(2.34277431496994,48.8548885683273),(2.34265167596398,48.8549785014457),
(2.34191584194969,48.8556170173375),(2.34082568197206,48.8565523000639),
(2.34045773722682,48.8567681364201),(2.34036234349704,48.8568220954177),
(2.34033508714246,48.856822096164),(2.33752769217743,48.8583778791875),
(2.33658731599251,48.8585757177307),(2.33288025427342,48.8593400183869),
(2.32988178792963,48.8601222319092),(2.32982726892706,48.8601402138378),
(2.32978637897676,48.8601581966286),(2.3295001526867,48.8602570975859),
(2.32830070631995,48.8607246317809),(2.32828707545679,48.8607336234383),
(2.32704673093708,48.8611381921984),(2.32607897177943,48.8614708326283),
(2.32526114627639,48.8617135534289),(2.32520662603102,48.861722540429),
(2.32471592483796,48.8618753640429),(2.32091289641744,48.8630349455836),
(2.32087200086194,48.8630529251833),(2.32155322584936,48.8639073294346),
(2.32243884665493,48.8650405307146),(2.32252059466181,48.8651574466784),
(2.3232972488748,48.8661647291047),(2.32341988057605,48.8663266130246),
(2.32354251307602,48.8664884968134),(2.3235152464625,48.8665064789925),
(2.3245644587385,48.8679274505231),(2.32464621421731,48.8680533576953),
(2.32500049864077,48.8685749741703),(2.32530028589861,48.8690066566296),
(2.32513660273671,48.8694382902038),(2.32566821564096,48.8695282724636),
(2.32580452605167,48.8695552643448),(2.32581815771984,48.8695552657067),
(2.32797189268413,48.869924162148),(2.32798552779567,48.8699061778024),
(2.32993507578488,48.8685034538095),(2.33027588164409,48.8683505987317),
(2.33063031789587,48.8681887507006),(2.33169358579805,48.8679460035635),
(2.33170721702503,48.8679460042226),(2.33365651409609,48.8675054382284),
(2.33583750536962,48.8669929002209),(2.33585113633529,48.8669929003854),
(2.33735054165431,48.8666421923058),(2.34115348312647,48.8657428523236))'::polygon

Re: KNNGiST for knn-search

From
Heikki Linnakangas
Date:
Teodor Sigaev wrote:
> we'd like to present contrib module for CVS HEAD, which contains
> implementation of knn (k nearest neighbourhood) search in PostgreSQL,
> see README.knngist for
> details.

Cool!

> Old way:
> SELECT coordinates, (coordinates <-> '5.0,5.0'::point) AS dist FROM spots
> order by dist asc LIMIT 10;
> 
> Time: 1024.242 ms
> 
> knn-search:
> SELECT coordinates, (coordinates <-> '5.0,5.0'::point) AS dist FROM spots
> WHERE coordinates  >< '5.0,5.0'::point LIMIT 10;
> 
> Time: 3.158 ms

I think you'll need to work on that. A WHERE qual shouldn't imply a sort
order. You'll have to teach the planner how to use the index to speed up
a query in the first form.

> We didn't patch existing implementation of GiST for several reasons:
> 
> 1. KNNGiST is about 5% slower than GiST on non-knn search queries, like
>   contains or contained by, because of some overhead of new algorithm of
>   tree traversal

Is it possible to use the regular GiST traversal algorithm on a
KNNGiST-tree, when performing regular GiST searches that don't require a
particular order?

> 2.  KNNGiST can't be used in  bitmap index scan, which destroys order of
> results,
>   We don't know the way to forbid bitmap index scan only for knn queries.
>   Current version of KNNGiST doesn't distinguish knn-search and usual
> search
>   and postgres doesn't know about ordered output from KNNGiST.

Yeah, you really need to modify the planner to understand the ordering
and plan accordingly.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: KNNGiST for knn-search

From
Teodor Sigaev
Date:
> I think you'll need to work on that. A WHERE qual shouldn't imply a sort
> order. You'll have to teach the planner how to use the index to speed up
> a query in the first form.
Of course, right now it is a working prototype.

>> 1. KNNGiST is about 5% slower than GiST on non-knn search queries, like
>>   contains or contained by, because of some overhead of new algorithm of
>>   tree traversal
> 
> Is it possible to use the regular GiST traversal algorithm on a
> KNNGiST-tree, when performing regular GiST searches that don't require a
> particular order?
New algorithm works much more with memory for allocation/free to manage lists 
and it's a single reason of performance loss. Choosing of algorithm could not be 
done by consistent function, it should be done at least in amrescan method or 
even earlier - in planner.


> 
>> 2.  KNNGiST can't be used in  bitmap index scan, which destroys order of
>> results,
>>   We don't know the way to forbid bitmap index scan only for knn queries.
>>   Current version of KNNGiST doesn't distinguish knn-search and usual
>> search
>>   and postgres doesn't know about ordered output from KNNGiST.
> 
> Yeah, you really need to modify the planner to understand the ordering
> and plan accordingly.
Hmm, I thought about it, but still have no a good idea.
One idea:
SELECT p FROM pt WHERE p << '5.0,5.0'::point ORDER BY (p <-> '5.0,5.0'::point) 
DESC LIMIT 10;
And add <-> to opclass (but for now any indexable operation should return 
boolean type). Of course, KNNGiST should be modified to support not only 
k-nearest search but k-"farest" search and NULLS LAST/FIRST.

Not very convenient, because it's needed to look into expression of ORDER BY. 
And now you can specify p >< 'one point' AND p >< 'another point', but it's 
impossible to do that by ORDER BY clause.

Second idea with non-standard syntax.
SELECT ... ORDER BY PROXIMITY OF expression[, expression [..]] TO expression[, 
expression [..]] USING [operator [, operator [..]]
and operator is distance operator, i.e. it's not a member of btree opclass, but 
returns non-negative float8 value.

Without index it will be essentially the same as
ORDER BY expression operator expression[ + ..] DESC NULLS LAST


-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 


Re: KNNGiST for knn-search

From
Heikki Linnakangas
Date:
Teodor Sigaev wrote:
>>> 1. KNNGiST is about 5% slower than GiST on non-knn search queries, like
>>>   contains or contained by, because of some overhead of new algorithm of
>>>   tree traversal
>>
>> Is it possible to use the regular GiST traversal algorithm on a
>> KNNGiST-tree, when performing regular GiST searches that don't require a
>> particular order?
> New algorithm works much more with memory for allocation/free to manage
> lists and it's a single reason of performance loss. Choosing of
> algorithm could not be done by consistent function, it should be done at
> least in amrescan method or even earlier - in planner.

Ok, that sounds good. The bottom line is that you can use the same
on-disk tree with both algorithms. No need for a separate indexam in
that case.

> One idea:
> SELECT p FROM pt WHERE p << '5.0,5.0'::point ORDER BY (p <->
> '5.0,5.0'::point) DESC LIMIT 10;
> And add <-> to opclass (but for now any indexable operation should
> return boolean type).

You really shouldn't need to have a WHERE clause.

> Of course, KNNGiST should be modified to support
> not only k-nearest search but k-"farest" search and NULLS LAST/FIRST.

Well, as long as the planner knows the capabilities of the indexam, it
can just fall back to a seqscan+sort if the query can't be sped up with
the index.

> And now you can specify p >< 'one point' AND p >< 'another
> point', but it's impossible to do that by ORDER BY clause.

Huh, what does that mean? Is it like "ORDER BY (min( p >< 'one point', p
>< 'another point')" ?

> Second idea with non-standard syntax.
> SELECT ... ORDER BY PROXIMITY OF expression[, expression [..]] TO
> expression[, expression [..]] USING [operator [, operator [..]]
> and operator is distance operator, i.e. it's not a member of btree
> opclass, but returns non-negative float8 value.
> 
> Without index it will be essentially the same as
> ORDER BY expression operator expression[ + ..] DESC NULLS LAST

We already have the syntax to represent the query, using ORDER BY. IMHO
we just need to teach the planner that when it sees a query like that,
it can use a GiST index to speed it up. A number of indexam and operator
class API changes are probably required, but it should be invisible to
the user.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: KNNGiST for knn-search

From
Simon Riggs
Date:
On Mon, 2009-11-23 at 20:44 +0300, Teodor Sigaev wrote:
> Old way:
> SELECT coordinates, (coordinates <-> '5.0,5.0'::point) AS dist FROM
> spots
> order by dist asc LIMIT 10;
> 
> Time: 1024.242 ms
> 
> knn-search:
> SELECT coordinates, (coordinates <-> '5.0,5.0'::point) AS dist FROM
> spots
> WHERE coordinates  >< '5.0,5.0'::point LIMIT 10;
> 
> Time: 3.158 ms
> 
> 
> We didn't patch existing implementation of GiST for several reasons:
> 
> 1. KNNGiST is about 5% slower than GiST on non-knn search queries,
> like
>    contains or contained by, because of some overhead of new algorithm
> of
>    tree traversal
> 2.  KNNGiST can't be used in  bitmap index scan, which destroys order
> of results,
>    We don't know the way to forbid bitmap index scan only for knn
> queries.
>    Current version of KNNGiST doesn't distinguish knn-search and usual
> search
>    and postgres doesn't know about ordered output from KNNGiST.

Sounds very cool.

Seems like you should look at the way sorted_path works in
query_planner().

If you have a query like this
 explain select col1 from s order by col1 limit 10;

then we currently understand that we should use an IndexScan for that.
We don't specifically exclude the bitmap scan, it's just that we know
that the results from the index are ordered and therefore the cost of
sorting the output need not be added. In the bitmap case the cost of the
sort must be added and that's enough to ensure we almost never do that.

I notice that a query like 
 explain select col1 from s order by abs(col1 - 5) limit 10;

is the one-dimensional equivalent of the type of query you're proposing
and that doesn't work either until you put an index on abs(col1 - 5),
then it just works, but only for k = 5.

Maybe you should look at the above query and see if there are any usable
similarities for the Knn index.

Part of your problem appears to be that cost_sort does not include
anything about the cost of the comparison operators for different
datatypes.

-- Simon Riggs           www.2ndQuadrant.com



Re: KNNGiST for knn-search (WIP)

From
Teodor Sigaev
Date:
Hi!

Contrib module is reworked as a patch for current GiST. Now GiST supports
KNN-search, the query looks like
SELECT * FROM point_tbl ORDER BY f1 <-> '0,1';
or
SELECT * FROM point_tbl WHERE f1 <@ '(-10,-10),(10,10)':: box ORDER BY f1 <-> '0,1';
Plans are:
EXPLAIN (COSTS OFF)
SELECT * FROM point_tbl ORDER BY f1 <-> '0,1';
                QUERY PLAN
-----------------------------------------
  Index Scan using gpointind on point_tbl
    Index Cond: (f1 <-> '(0,1)'::point)
EXPLAIN (COSTS OFF)
SELECT * FROM point_tbl WHERE f1 <@ '(-10,-10),(10,10)':: box ORDER BY f1 <-> '0,1';
                                   QUERY PLAN
------------------------------------------------------------------------------
  Index Scan using gpointind on point_tbl
    Index Cond: ((f1 <@ '(10,10),(-10,-10)'::box) AND (f1 <-> '(0,1)'::point))


pg_am now has new column amcanorderbyop (can order by operation), indexes with
this flag enabled can be used to speedup operations, which returns non-boolean
value, currently type of returned value should have default btree operator class
  to perform sort.

Planner (find_usable_indexes function, actually) could push pathkey expression
into restriction clauses of index. I'm not fully satisfied with this piece of
code, but, may be, someone has a better idea. I though about adding separate
indexorderquals in IndexPath structure...

Both GiST's get methods are optimized and there is no overhead, since gistrescan
method can choose what traversal algorithm to use using information about types
of values returned by operations. If at least one of them returns non-boolean
result, then KNN-search will be performed.

The only change in interface of supporting functions is: consistentFn function
could return float8 non-negative value and it's mandatory to perform KNN-search.
Old-style consistent functions are supported.

Patch contains (it still requires rbtree-0.5 and point_ops-0.4 patches):
- GiST changes
- changes in point_ops to support knn-search
- contrib/pg_trgm now has new operation <-> returns distance between texts. This
operation is supported in KNN-search
- contrib/btree_gist provides <-> and its support for GiST for types int2, int4,
int8, float4, float8, money, oid, interval, time, date, timestamp and timestamptz

TODO:
- selectivity of ordering operation should be 1.0
- current patch remove support of  IndexScanDesc->kill_prior_tuple, it's needed
to restore support if it will not create too big overhead
- documentation
--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Attachment

Re: KNNGiST for knn-search

From
Teodor Sigaev
Date:
>   explain select col1 from s order by abs(col1 - 5) limit 10;
> 
> is the one-dimensional equivalent of the type of query you're proposing
Exactly, it's already done in next version of patch :)

> and that doesn't work either until you put an index on abs(col1 - 5),
> then it just works, but only for k = 5.

BTW, it's possible to add this feature to plain btree by changing traversal 
algorithm, but I'm fill enough power to do it.


-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 


Re: KNNGiST for knn-search

From
Teodor Sigaev
Date:
> BTW, it's possible to add this feature to plain btree by changing 
> traversal algorithm, but I'm fill enough power to do it.

Sorry, I'm NOT fill enough power to do it.
-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 


Re: KNNGiST for knn-search

From
Teodor Sigaev
Date:
>> BTW, it's possible to add this feature to plain btree by changing 
>> traversal algorithm, but I'm fill enough power to do it.
> 
> Sorry, I'm NOT fill enough power to do it.

%-), I'm NOT FEEL enough power to do it.
-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 


Re: KNNGiST for knn-search (WIP)

From
Teodor Sigaev
Date:
> Planner (find_usable_indexes function, actually) could push pathkey
> expression into restriction clauses of index. I'm not fully satisfied
> with this piece of code, but, may be, someone has a better idea. I
> though about adding separate indexorderquals in IndexPath structure...

Done, IndexScan and IndexPath have separate field to store order clauses. That's
allowed to improve explain output:
#  EXPLAIN (COSTS OFF)
   SELECT * FROM point_tbl WHERE f1 <@ '(-10,-10),(10,10)':: box ORDER BY f1 <->
'0,1';
                    QUERY PLAN
------------------------------------------------
  Index Scan using gpointind on point_tbl
    Index Cond: (f1 <@ '(10,10),(-10,-10)'::box)
    Sort Cond: (f1 <-> '(0,1)'::point)
(3 rows)

We are waiting feedback to choose a way of planner support of knn-search.

Still TODO:
- cost of index scan
- Sort condition should not be ANDed in explain output
- current patch remove support of  IndexScanDesc->kill_prior_tuple, it's needed
to restore support if it will not create too big overhead
- documentation

--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Attachment

Re: KNNGiST for knn-search (WIP)

From
Tom Lane
Date:
Teodor Sigaev <teodor@sigaev.ru> writes:
>> Planner (find_usable_indexes function, actually) could push pathkey 
>> expression into restriction clauses of index. I'm not fully satisfied 
>> with this piece of code, but, may be, someone has a better idea. I 
>> though about adding separate indexorderquals in IndexPath structure...

> Done, IndexScan and IndexPath have separate field to store order clauses.

Why?  Isn't that redundant with the pathkey structures?  We generate
enough paths during a complex planning problem that I'm not in favor
of adding unnecessary structures to them.
        regards, tom lane


Re: KNNGiST for knn-search (WIP)

From
Teodor Sigaev
Date:
>> Done, IndexScan and IndexPath have separate field to store order clauses.
> 
> Why?  Isn't that redundant with the pathkey structures?  We generate
> enough paths during a complex planning problem that I'm not in favor
> of adding unnecessary structures to them.
I found two ways to add support of knn-seaech to planner
- 0.4 version: add sorting clauses to restrictclauses in find_usable_indexes,  and there is two issues:  -
find_usable_indexescould not be used to find indexes for index and bitmap    scans at once, because sorting clauses
willbe used in bitmap scan. Full    scan of index with knn-ordering on large index is much more expensive.  -
implied/refusedpredicate machinery is teached to ignore sorting clauses,    but it's not so obvious: it should ignore
operationreturning non-boolean    values
 
- 0.4.1 version: pull sort clauses separately and merge them with regular  clauses at create_indexscan_plan function.
That'ssolves problems above
 

Do you suggest to construct that clauses from pathkey representation? And append  constructed clauses to indexquals in
create_indexscan_plan?
-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 


Re: KNNGiST for knn-search (WIP)

From
Tom Lane
Date:
Teodor Sigaev <teodor@sigaev.ru> writes:
> Do you suggest to construct that clauses from pathkey representation? And append 
>   constructed clauses to indexquals in create_indexscan_plan?

I'd probably have to read the patch to make any intelligent suggestions
... and right now I'm busy with patches that are already in the
commitfest.  But usually it's a good idea to postpone work to createplan
time if you can.
        regards, tom lane


Re: KNNGiST for knn-search (WIP)

From
Robert Haas
Date:
2009/11/26 Teodor Sigaev <teodor@sigaev.ru>:
> Hi!
>
> Contrib module is reworked as a patch for current GiST. Now GiST supports
> KNN-search, the query looks like
> SELECT * FROM point_tbl ORDER BY f1 <-> '0,1';
> or
> SELECT * FROM point_tbl WHERE f1 <@ '(-10,-10),(10,10)':: box ORDER BY f1
> <-> '0,1';
> Plans are:
> EXPLAIN (COSTS OFF)
> SELECT * FROM point_tbl ORDER BY f1 <-> '0,1';
>               QUERY PLAN
> -----------------------------------------
>  Index Scan using gpointind on point_tbl
>   Index Cond: (f1 <-> '(0,1)'::point)
> EXPLAIN (COSTS OFF)
> SELECT * FROM point_tbl WHERE f1 <@ '(-10,-10),(10,10)':: box ORDER BY f1
> <-> '0,1';
>                                  QUERY PLAN
> ------------------------------------------------------------------------------
>  Index Scan using gpointind on point_tbl
>   Index Cond: ((f1 <@ '(10,10),(-10,-10)'::box) AND (f1 <-> '(0,1)'::point))
>
>
> pg_am now has new column amcanorderbyop (can order by operation), indexes
> with this flag enabled can be used to speedup operations, which returns
> non-boolean value, currently type of returned value should have default
> btree operator class  to perform sort.
>
> Planner (find_usable_indexes function, actually) could push pathkey
> expression into restriction clauses of index. I'm not fully satisfied with
> this piece of code, but, may be, someone has a better idea. I though about
> adding separate indexorderquals in IndexPath structure...
>
> Both GiST's get methods are optimized and there is no overhead, since
> gistrescan method can choose what traversal algorithm to use using
> information about types of values returned by operations. If at least one of
> them returns non-boolean result, then KNN-search will be performed.
>
> The only change in interface of supporting functions is: consistentFn
> function could return float8 non-negative value and it's mandatory to
> perform KNN-search. Old-style consistent functions are supported.
>
> Patch contains (it still requires rbtree-0.5 and point_ops-0.4 patches):
> - GiST changes
> - changes in point_ops to support knn-search
> - contrib/pg_trgm now has new operation <-> returns distance between texts.
> This operation is supported in KNN-search
> - contrib/btree_gist provides <-> and its support for GiST for types int2,
> int4, int8, float4, float8, money, oid, interval, time, date, timestamp and
> timestamptz
>
> TODO:
> - selectivity of ordering operation should be 1.0
> - current patch remove support of  IndexScanDesc->kill_prior_tuple, it's
> needed to restore support if it will not create too big overhead
> - documentation

Based on the feedback provided on this patch so far, it looks like
some changes are probably needed, but it's not entirely clear whether
the feedback provided is sufficient to provide guidance on what
changes should be made.  It does also need to be updated to CVS HEAD,
as it no longer applies cleanly.

I tend to feel that we should probably target this for 8.6 rather than
8.5.  We are down to the last CommitFest, and while we don't have a
nailed-down criterion for what is "too big" for the last CommitFest of
a given release cycle, this is definitely a big, invasive patch.  This
patch weights in at over 2400 adds/removes, and it's not boilerplate
stuff like updates to pg_proc entries, but real, significant changes.
I'm worried that applying something like this late in the release
cycle is just not a good idea, especially given the fact that it
probably still needs significant revising.  However, I'm fairly
conservative by nature, so perhaps someone else will have a different
opinion, or maybe there is a way to restructure it so that the needed
changes are less invasive.

...Robert


Re: KNNGiST for knn-search (WIP)

From
Oleg Bartunov
Date:
Robert,

On Wed, 30 Dec 2009, Robert Haas wrote:

> Based on the feedback provided on this patch so far, it looks like
> some changes are probably needed, but it's not entirely clear whether
> the feedback provided is sufficient to provide guidance on what
> changes should be made.  It does also need to be updated to CVS HEAD,
> as it no longer applies cleanly.

this is not a problem.

>
> I tend to feel that we should probably target this for 8.6 rather than
> 8.5.  We are down to the last CommitFest, and while we don't have a
> nailed-down criterion for what is "too big" for the last CommitFest of
> a given release cycle, this is definitely a big, invasive patch.  This
> patch weights in at over 2400 adds/removes, and it's not boilerplate
> stuff like updates to pg_proc entries, but real, significant changes.
> I'm worried that applying something like this late in the release
> cycle is just not a good idea, especially given the fact that it
> probably still needs significant revising.  However, I'm fairly
> conservative by nature, so perhaps someone else will have a different
> opinion, or maybe there is a way to restructure it so that the needed
> changes are less invasive.

the patch adds new strategy of gist tree traverse and doesn't change old one,
so there is no risk to ruin old code. I'm all for  good conservatism, but
this is not the case, else we wouldn't have GiST at all. We are very
interested in the KNN to be in the 8.5 and we're ready to fix any issues.

From metodological point of view I don't quite understand how to measure
the value of development, I mean what'is a "big patch", "invasive patch".
Should we prefer cosmetic pathces, spelling fixes, etc ? Of course, they are
easy for refering, but people are waiting from us not just fixes, but new
features. For example, KNN-GiST is a big improvement for PostGIS community,
which is a big part of postgres users. Actually, it's PostGIS community, which
supported our work. Now, what we should say them ? The patch was too big and
invasive, so, sorry, wait one year more ? I think it's not good.

Robert, I'm not against you, it's your right to have your opinion. I address
this to other developers. It's important for us, since we have several
other patches ready, for example, long awaited phrase search 
(http://www.sai.msu.su/~megera/wiki/2009-08-12). We postponed it, since 
it was supposed that EDB will support it, but, hey, it wont. We did it for our
own. Teodor insist to submit it for 8.5, but I'm now begin to hesitate, what if
this patch will be also too big.
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


Re: KNNGiST for knn-search (WIP)

From
Teodor Sigaev
Date:
> changes should be made.  It does also need to be updated to CVS HEAD,
> as it no longer applies cleanly.

The reason was a point_ops patch, some OIDs become duplicated. Both attached
patches are synced with current CVS.
>
> I tend to feel that we should probably target this for 8.6 rather than
> 8.5.  We are down to the last CommitFest, and while we don't have a
> nailed-down criterion for what is "too big" for the last CommitFest of
> a given release cycle, this is definitely a big, invasive patch.  This

Is we really have rule to accept only small patches at last CommitFest? May be,
FixFest name is better for it? :)

Actually, it's easy to split patch to several ones:
- contrib/pg_trgm
- contrib/btree_gist
- knngist itself
- planner changes

And knngist depends on rbtree and point_ops patch, in summary 6 dependent
patches. Is it more comfortable?

--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Attachment

Re: KNNGiST for knn-search (WIP)

From
Robert Haas
Date:
On Wed, Dec 30, 2009 at 9:20 AM, Oleg Bartunov <oleg@sai.msu.su> wrote:
> From metodological point of view I don't quite understand how to measure
> the value of development, I mean what'is a "big patch", "invasive patch".

I want to speak specifically to this question because I think it's a
good one.  Of course, I also want to make clear that I have nothing
against you or your patch and that it sounds like a really nice
feature.

From my point of view, what makes a patch invasive is the likelihood
that it might break something other than itself.  For example, your
patch touches the core planner code and the core GIST code, so it
seems possible that adding support for this feature might break
something else in one of those areas.  All things being equal, we
would prefer to take that risk at the beginning of a development cycle
rather than the end.  If your patch was the same size, but consisted
mostly of new code with very few changes to what is there now, it
might still be difficult to properly review and verify - but any bugs
we missed would likely affect only the NEW functionality, not any
EXISTING functionality.

Please understand that the previous paragraph is intended to be a
general statement about software development in general more than a
specific commentary on your particular patch.  Whether applying your
patch in particular will break anything is, of course, something
that's difficult to know until we do it and see what happens, and at
this point I haven't even reviewed it.  It's also possible that I'm
doing a poor job estimating the risk of breakage, and I certainly
welcome other opinions from other people in a position to make a
technical judgement on that point.  I might also have a different
opinion myself after I review the patch in more detail, so please do
post an updated version.

> Should we prefer cosmetic pathces, spelling fixes, etc ? Of course, they are
> easy for refering, but people are waiting from us not just fixes, but new
> features. For example, KNN-GiST is a big improvement for PostGIS community,
> which is a big part of postgres users. Actually, it's PostGIS community,
> which
> supported our work. Now, what we should say them ? The patch was too big and
> invasive, so, sorry, wait one year more ? I think it's not good.

Well, I understand your point, but there is obviously some deadline
for patches to be submitted for any particular release.  Clearly,
after the last CommitFest is over, that deadline is past.  However, we
have previously discussed having a policy that no new large patches
will be accepted for the last CommitFest that were not also submitted
for the second-to-last CommitFest.  Hopefully it's obvious that I have
no desire to keep cool new features away from the PostGIS community,
the PostgreSQL community, or anyone else, but we have to weigh that
against the desire to have a stable and bug-free release, and applying
big patches at the last minute makes that less likely.

As an example, the change to run the background writer during recovery
and the changes in semi/anti join planning for 8.4 have both resulted
in multiple bug reports.  The former was half the footprint of your
patch and applied at the very end of the release cycle; the latter was
slightly larger and applied in August 2008, so considerably earlier in
the cycle than this one could possibly be - and there were still
things we did not catch before release.

...Robert


Re: KNNGiST for knn-search (WIP)

From
Alvaro Herrera
Date:
Teodor Sigaev escribió:

> Actually, it's easy to split patch to several ones:
> - contrib/pg_trgm
> - contrib/btree_gist
> - knngist itself
> - planner changes

+1 on the split patches.  I wonder about the opr_sanity test change ...
why is it necessary?

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: KNNGiST for knn-search (WIP)

From
Robert Haas
Date:
2009/12/30 Teodor Sigaev <teodor@sigaev.ru>:
>> changes should be made.  It does also need to be updated to CVS HEAD,
>> as it no longer applies cleanly.
>
> The reason was a point_ops patch, some OIDs become duplicated. Both attached
> patches are synced with current CVS.

Thanks!  I will take a look.

>> I tend to feel that we should probably target this for 8.6 rather than
>> 8.5.  We are down to the last CommitFest, and while we don't have a
>> nailed-down criterion for what is "too big" for the last CommitFest of
>> a given release cycle, this is definitely a big, invasive patch.  This
>
> Is we really have rule to accept only small patches at last CommitFest? May
> be, FixFest name is better for it? :)

See here and following for some of the previous discussion - which was
not unanimous on all points:

http://archives.postgresql.org/pgsql-hackers/2009-09/msg00139.php

I think the intention is not to accept only bug fixes, but to limit
large features to those that have already been through a CommitFest or
two.

> Actually, it's easy to split patch to several ones:
> - contrib/pg_trgm
> - contrib/btree_gist
> - knngist itself
> - planner changes
>
> And knngist depends on rbtree and point_ops patch, in summary 6 dependent
> patches. Is it more comfortable?

I'm not sure. One of the problems with separating out contrib module
changes is that it tends to obscure the point of the changes to the
core code.  On the other hand if some of the core code changes can be
split out into an infrastructure patch that is of some independent
usefulness, that can certainly be worthwhile.  It's not obvious to me
without looking at this more than I have whether there is a possble
split that makes sense here; I will read your updated patch.

...Robert


Re: KNNGiST for knn-search (WIP)

From
Robert Haas
Date:
On Wed, Dec 30, 2009 at 12:16 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> 2009/12/30 Teodor Sigaev <teodor@sigaev.ru>:
>>> changes should be made.  It does also need to be updated to CVS HEAD,
>>> as it no longer applies cleanly.
>>
>> The reason was a point_ops patch, some OIDs become duplicated. Both attached
>> patches are synced with current CVS.
>
> Thanks!  I will take a look.

OK, I'm confused.  First, there are two versions of the patch here, so
I'm not sure which one I'm supposed to be looking at.  Second, when I
attempt to apply either one, I get:

$ patch -p0 < ~/Download/builtin_knngist-0.5
patching file src/backend/access/gist/gistget.c
patching file src/backend/access/gist/gistproc.c
Reversed (or previously applied) patch detected!  Assume -R? [n]

...regardless of how I answer that question, it then goes on to apply
most of the rest of the patch successfully.

Help?

...Robert


Re: KNNGiST for knn-search (WIP)

From
Greg Stark
Date:
On Wed, Dec 30, 2009 at 4:56 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> From my point of view, what makes a patch invasive is the likelihood
> that it might break something other than itself.  For example, your
> patch touches the core planner code and the core GIST code, so it
> seems possible that adding support for this feature might break
> something else in one of those areas.

It doesn't seem obvious to me that this is a high-risk patch. It's
touching the planner which is tricky but it's not the kind of massive
overhaul that touches every module that HOT or HS were.  I'm really
glad HS got in before the end because lots of people with different
areas of expertise and different use cases are going to get to
exercise it in the time remaining. This patch I would expect
relatively few people to need to try it out before any oversights are
caught.

--
greg


Re: KNNGiST for knn-search (WIP)

From
Paul Ramsey
Date:
I'm sure whatever conclusion -hackers comes to in the end will be the
best for pgsql, and I'll be supportive. But until then, let me note
from the PostGIS point-of-view: sure would be great to get this in for
8.5 :)

P.

On Thu, Dec 31, 2009 at 4:26 AM, Greg Stark <gsstark@mit.edu> wrote:
> On Wed, Dec 30, 2009 at 4:56 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> From my point of view, what makes a patch invasive is the likelihood
>> that it might break something other than itself.  For example, your
>> patch touches the core planner code and the core GIST code, so it
>> seems possible that adding support for this feature might break
>> something else in one of those areas.
>
> It doesn't seem obvious to me that this is a high-risk patch. It's
> touching the planner which is tricky but it's not the kind of massive
> overhaul that touches every module that HOT or HS were.  I'm really
> glad HS got in before the end because lots of people with different
> areas of expertise and different use cases are going to get to
> exercise it in the time remaining. This patch I would expect
> relatively few people to need to try it out before any oversights are
> caught.
>
> --
> greg
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


Re: KNNGiST for knn-search (WIP)

From
Robert Haas
Date:
On Mon, Jan 4, 2010 at 5:33 PM, Paul Ramsey <pramsey@cleverelephant.ca> wrote:
> I'm sure whatever conclusion -hackers comes to in the end will be the
> best for pgsql, and I'll be supportive. But until then, let me note
> from the PostGIS point-of-view: sure would be great to get this in for
> 8.5 :)

That's good to know.   The current status is that I've been waiting
for a patch that applies cleanly for 6 days, and we have 41 days left
until the end of the last CommitFest.  There's not much I can do to
move this along until I have a clean patch to work with.

...Robert


Re: KNNGiST for knn-search (WIP)

From
Oleg Bartunov
Date:
Robert,

On Mon, 4 Jan 2010, Robert Haas wrote:

> On Mon, Jan 4, 2010 at 5:33 PM, Paul Ramsey <pramsey@cleverelephant.ca> wrote:
>> I'm sure whatever conclusion -hackers comes to in the end will be the
>> best for pgsql, and I'll be supportive. But until then, let me note
>> from the PostGIS point-of-view: sure would be great to get this in for
>> 8.5 :)
>
> That's good to know.   The current status is that I've been waiting
> for a patch that applies cleanly for 6 days, and we have 41 days left
> until the end of the last CommitFest.  There's not much I can do to
> move this along until I have a clean patch to work with.

sorry, it's a long holiday in Russia, we'll be able to sync next week.

    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


Re: KNNGiST for knn-search (WIP)

From
Teodor Sigaev
Date:
Changes:

- split patch to several ones
- sync with current CVS

Patch set is based on 0.5.1 version, difference between 0.5 and 0.6 should be
only in planner patch.

builtin_knngist_itself-0.6.gz  - patch to the gist itself
builtin_knngist_proc-0.6.gz - patch for support knnsearch in point_ops
builtin_knngist_planner-0.6.gz - planner patch to support knnearch
builtin_knngist_contrib_btree_gist-0.6.gz - patch for contrib/btree_gist module
                patch provides <-> operation for various scalar types which is
                exactly  abs(a - b) function
builtin_knngist_contrib_pg_trgm-0.6.gz - contrib/pg_trgm, like above,patch
                provides <-> distance between strings


Patch set sill requires rbtree patch and point_ops patch (with Robert's changes)

--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Attachment

Re: KNNGiST for knn-search (WIP)

From
Robert Haas
Date:
2010/1/12 Teodor Sigaev <teodor@sigaev.ru>:
> Changes:
>
> - split patch to several ones
> - sync with current CVS
>
> Patch set is based on 0.5.1 version, difference between 0.5 and 0.6 should
> be only in planner patch.
>
> builtin_knngist_itself-0.6.gz  - patch to the gist itself
> builtin_knngist_proc-0.6.gz - patch for support knnsearch in point_ops
> builtin_knngist_planner-0.6.gz - planner patch to support knnearch
> builtin_knngist_contrib_btree_gist-0.6.gz - patch for contrib/btree_gist
> module
>               patch provides <-> operation for various scalar types which is
>               exactly  abs(a - b) function
> builtin_knngist_contrib_pg_trgm-0.6.gz - contrib/pg_trgm, like above,patch
>               provides <-> distance between strings
>
>
> Patch set sill requires rbtree patch and point_ops patch (with Robert's
> changes)

Please update commitfest.postgresql.org.

...Robert