Re: index structure for 114-dimension vector - Mailing list pgsql-performance

From Oleg Bartunov
Subject Re: index structure for 114-dimension vector
Date
Msg-id Pine.LNX.4.64.0704270838230.12152@sn.sai.msu.ru
Whole thread Raw
In response to Re: index structure for 114-dimension vector  ("Alexander Staubo" <alex@purefiction.net>)
Responses Re: index structure for 114-dimension vector  (Andrew Lazarus <andrew@pillette.com>)
List pgsql-performance
On Fri, 27 Apr 2007, Alexander Staubo wrote:

> On 4/20/07, Andrew Lazarus <andrew@pillette.com> wrote:
>> I have a table with 2.5 million real[] arrays. (They are points in a
>> time series.) Given a new array X, I'd like to find, say, the 25
>> closest to X in some sense--for simplification, let's just say in the
>> usual vector norm. Speed is critical here, and everything I have tried
>> has been too slow.
>
> Let me chime in with the observation that this is a multidimensional
> nearest neighbour (reverse nearest neighbour and its close cousin,
> k-NN) that is well known in statistics, and particularly relevant to
> statistical learning and classification. Knowing the jargon might help
> you dig up efficient algorithms to mine your data; there are tons of
> fascinating papers available through Citeseer.
>
> In particular, I recommend the paper "Efficient k-NN Search on
> Vertically Decomposed Data" by de Vries et al, SIGMOD 2002 (PDF here:
> http://citeseer.ist.psu.edu/618138.html), if only for inspiration. It
> proposes an algorithm called BOND to drastically reduce the search
> space by probalistic means. They give an example using image
> histograms, but the algorithm applies to any multidimensional data.
> Briefly put, it points out that proximity comparison can be computed
> vertically, a few dimensions at a time, and entire subsets can be
> thrown away when it's apparent that they are below a statistically
> derived lower bound. The only gotcha is that the algorithm derives
> much of its performance from the assumption that your data is
> vertically decomposed, one table per dimension, otherwise the search
> effectively incurs a sequential scan of the entire dataset, and then
> you're pretty much back to square one.
>
> The most common approach to nearest neighbour search is to use a
> spatial data structure. The classic algorithm is the kd-tree
> (http://en.wikipedia.org/wiki/Kd-tree) and there's the newer K-D-B
> tree, neither of which are available in PostgreSQL. If I remember
> correctly, R-trees have also been shown to be useful for high numbers
> of dimensions; with PostgreSQL you have R-trees and even better
> R-tree-equivalent support through GiST. I have no idea whether you can
> actually munge your integer vectors into something GiST can index and
> search, but it's a thought. (GiST, presumably, can also theoretically
> index kd-trees and other spatial trees.)

you're right, but currently  only theoretically due to interface restriction.
We have plan to improve it sometime. There was SP-GiST project, which
could be used for k-d-b tree,  see http://www.cs.purdue.edu/spgist/
I don't know if it works with 8.2 version. Also, it doesn't supports
concurrency and recovery

>
> Alexander.
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>              http://archives.postgresql.org
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

pgsql-performance by date:

Previous
From: "Andres Retzlaff"
Date:
Subject: Usage up to 50% CPU
Next
From: Oleg Bartunov
Date:
Subject: Re: Simple query, 10 million records...MySQL ten times faster