Re: index structure for 114-dimension vector - Mailing list pgsql-performance
From | Oleg Bartunov |
---|---|
Subject | Re: index structure for 114-dimension vector |
Date | |
Msg-id | Pine.LNX.4.64.0704270838230.12152@sn.sai.msu.ru Whole thread Raw |
In response to | Re: index structure for 114-dimension vector ("Alexander Staubo" <alex@purefiction.net>) |
Responses |
Re: index structure for 114-dimension vector
|
List | pgsql-performance |
On Fri, 27 Apr 2007, Alexander Staubo wrote: > On 4/20/07, Andrew Lazarus <andrew@pillette.com> wrote: >> I have a table with 2.5 million real[] arrays. (They are points in a >> time series.) Given a new array X, I'd like to find, say, the 25 >> closest to X in some sense--for simplification, let's just say in the >> usual vector norm. Speed is critical here, and everything I have tried >> has been too slow. > > Let me chime in with the observation that this is a multidimensional > nearest neighbour (reverse nearest neighbour and its close cousin, > k-NN) that is well known in statistics, and particularly relevant to > statistical learning and classification. Knowing the jargon might help > you dig up efficient algorithms to mine your data; there are tons of > fascinating papers available through Citeseer. > > In particular, I recommend the paper "Efficient k-NN Search on > Vertically Decomposed Data" by de Vries et al, SIGMOD 2002 (PDF here: > http://citeseer.ist.psu.edu/618138.html), if only for inspiration. It > proposes an algorithm called BOND to drastically reduce the search > space by probalistic means. They give an example using image > histograms, but the algorithm applies to any multidimensional data. > Briefly put, it points out that proximity comparison can be computed > vertically, a few dimensions at a time, and entire subsets can be > thrown away when it's apparent that they are below a statistically > derived lower bound. The only gotcha is that the algorithm derives > much of its performance from the assumption that your data is > vertically decomposed, one table per dimension, otherwise the search > effectively incurs a sequential scan of the entire dataset, and then > you're pretty much back to square one. > > The most common approach to nearest neighbour search is to use a > spatial data structure. The classic algorithm is the kd-tree > (http://en.wikipedia.org/wiki/Kd-tree) and there's the newer K-D-B > tree, neither of which are available in PostgreSQL. If I remember > correctly, R-trees have also been shown to be useful for high numbers > of dimensions; with PostgreSQL you have R-trees and even better > R-tree-equivalent support through GiST. I have no idea whether you can > actually munge your integer vectors into something GiST can index and > search, but it's a thought. (GiST, presumably, can also theoretically > index kd-trees and other spatial trees.) you're right, but currently only theoretically due to interface restriction. We have plan to improve it sometime. There was SP-GiST project, which could be used for k-d-b tree, see http://www.cs.purdue.edu/spgist/ I don't know if it works with 8.2 version. Also, it doesn't supports concurrency and recovery > > Alexander. > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
pgsql-performance by date: