Re: [HACKERS] [PATCH] kNN for btree - Mailing list pgsql-hackers

From Nikita Glukhov
Subject Re: [HACKERS] [PATCH] kNN for btree
Date
Msg-id b656b581-0b35-6893-fb6e-3686bfa16d5d@postgrespro.ru
Whole thread Raw
In response to [HACKERS] [PATCH] kNN for btree  (Nikita Glukhov <n.gluhov@postgrespro.ru>)
Responses Re: [HACKERS] [PATCH] kNN for btree
List pgsql-hackers
Sorry for the broken formatting in my previous message.
Below is a corrected version of this message.


I'd like to present a series of patches that implements k-Nearest Neighbors
(kNN) search for btree, which can be used to speed up ORDER BY distance
queries like this:
SELECT * FROM events ORDER BY date <-> '2000-01-01'::date ASC LIMIT 100;

Now only GiST supports kNN, but kNN on btree can be emulated using
contrib/btree_gist.


Scanning algorithm
==================

Algorithm is very simple: we use bidirectional B-tree index scan starting at
the point from which we measure the distance (target point).  At each step,
we advance this scan in the direction that has the  nearest point.  But when
the target point does not fall into the scanned range, we don't even need to
use a bidirectional scan here --- we can use ordinary unidirectional scan
in the right direction.


Performance results
===================

Test database is taken from original kNN-GiST presentation (PGCon 2010).

Test query

SELECT * FROM events ORDER BY date <-> '1957-10-04'::date ASC LIMIT k;

can be optimized to the next rather complicated UNION form,
which no longer requires kNN:

WITH    t1 AS (SELECT * FROM events WHERE date >= '1957-10-04'::date           ORDER BY date ASC  LIMIT k),    t2 AS
(SELECT* FROM events WHERE date <  '1957-10-04'::date           ORDER BY date DESC LIMIT k),    t  AS (SELECT * FROM t1
UNIONSELECT * FROM t2)
 
SELECT * FROM t ORDER BY date <-> '1957-10-04'::date ASC LIMIT k;


In each cell of this table shown query execution time in milliseconds and
the number of accessed blocks:

     k  |  kNN-btree   |  kNN-GiST    |  Opt. query   |  Seq. scan        |              | (btree_gist) |  with UNION
| with sort
 
--------|--------------|--------------|---------------|------------      1 |  0.041     4 |  0.079     4 |   0.060
8|  41.1 1824     10 |  0.048     7 |  0.091     9 |   0.097    17 |  41.8 1824    100 |  0.107    47 |  0.192    52 |
0.342   104 |  42.3 1824   1000 |  0.735   573 |  0.913   650 |   2.970  1160 |  43.5 1824  10000 |  5.070  5622 |
6.240 6760 |  36.300 11031 |  54.1 1824 100000 | 49.600 51608 | 61.900 64194 | 295.100 94980 | 115.0 1824
 


As you can see, kNN-btree can be two times faster than kNN-GiST (btree_gist)
when k < 1000, but the number of blocks read is roughly the same.


Implementation details
======================

A brief description is given below for each of the patches:

1. Introduce amcanorderbyop() function

This patch transforms existing boolean AM property amcanorderbyop into a method
(function pointer).  This is necessary because, unlike GiST, kNN for btree
supports only a one ordering operator on the first index column and we need a
different pathkey matching logic for btree (there was a corresponding comment
in match_pathkeys_to_index()).  GiST-specific logic has been moved from
match_pathkeys_to_index() to gistcanorderbyop().

2. Extract substructure BTScanState from BTScanOpaque

This refactoring is necessary for bidirectional kNN-scan implementation.
Now, BTScanOpaque's substructure BTScanState containing only the fields
related to scan position is passed to some functions where the whole
BTScanOpaque was passed previously.

3. Extract get_index_column_opclass(), get_opclass_opfamily_and_input_type().

Extracted two simple common functions used in gistproperty() and
btproperty() (see the next patch).

4. Add kNN support to btree
  * Added additional optional BTScanState to BTScanOpaque for    bidirectional kNN scan.  * Implemented bidirectional
kNNscan.  * Implemented logic for selecting kNN strategy  * Implemented btcanorderbyop(), updated btproperty() and
btvalidate()

B-tree user interface functions have not been altered because ordering
operators are used directly.

5. Add distance operators for some types

These operators for integer, float, date, time, timestamp, interval, cash and
oid types have been copied from contrib/btree_gist and added to the existing
btree opclasses as ordering operators.  Their btree_gist duplicates are removed
in the next patch.

6. Remove duplicate distance operators from contrib/btree_gist.

References to their own distance operators in btree_gist opclasses are
replaced with references to the built-in operators and than duplicate
operators are dropped.  But if the user is using somewhere these operators,
upgrade of btree_gist from 1.3 to 1.4 would fail.

7. Add regression tests for btree kNN.

Tests were added only after the built-in distance operators were added.


-- 
Nikita Glukhov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





pgsql-hackers by date:

Previous
From: Ashutosh Bapat
Date:
Subject: Re: [HACKERS] postgres_fdw bug in 9.6
Next
From: Michael Paquier
Date:
Subject: Re: [HACKERS] [PATCH] Rename pg_switch_xlog to pg_switch_wal