Home > mailing lists

Re: index structure for 114-dimension vector - Mailing list pgsql-performance

From	C Storm
Subject	Re: index structure for 114-dimension vector
Date	April 26, 2007 20:58:45
Msg-id	1177350571.162300.296770@b75g2000hsg.googlegroups.com Whole thread Raw
In response to	index structure for 114-dimension vector (Andrew Lazarus <andrew@pillette.com>)
List	pgsql-performance

Tree view

On Apr 20, 12:07 pm, and...@pillette.com (Andrew Lazarus) wrote:
> I have a table with 2.5 million real[] arrays. (They are points in a
> time series.) Given a new array X, I'd like to find, say, the 25
> closest to X in some sense--for simplification, let's just say in the
> usualvectornorm. Speed is critical here, and everything I have tried
> has been too slow.
>
> I imported the cube contrib package, and I tried creating an index on
> a cube of the last 6 elements, which are the most important. Then I
> tested the 2.5MM rows for being contained within a tolerance of the
> last 6 elements of X, +/- 0.1 in each coordinate, figuring that would
> be an indexed search (which I CLUSTERED on). I then ran the sort on
> this smaller set. The index was used, but it was still too slow. I
> also tried creating new columns with rounded int2 values of the last 6
> coordinates and made a multicolumn index.
>
> For each X the search is taking about 4-15 seconds which is above my
> target at least one order of magnitude. Absolute numbers are dependent
> on my hardware and settings, and some of this can be addressed with
> configuration tweaks, etc., but first I think I need to know the
> optimum data structure/indexing strategy.
>
> Is anyone on the list experienced with this sort of issue?
>
> Thanks.
> Andrew Lazarus  and...@pillette.com
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faq

Having worked in high dimensional spaces a lot in my career I think
you'll find that there are
mathematical limits in terms of speed.  In practical terms, a seq_scan
will be unavoidable since
on first approximation you are limited to doing an exhaustive search
in 101-dimensional space unless
you make approximations or dimensionality reductions of some kind.

Read up on the Curse of Dimensionality:  http://en.wikipedia.org/wiki/Curse_of_dimensionality

Have you considered dimension reduction techniques such as Singular
Value Decomposition,
Principal Components Analysis, etc.?

pgsql-performance by date:

From: Scott Marlowe
Date: 26 April 2007, 19:22:33
Subject: Re: not using indexes on large table

From: "Sergey Tsukinovsky"
Date: 26 April 2007, 20:59:01
Subject: Re: postgres: 100% CPU utilization

Re: index structure for 114-dimension vector - Mailing list pgsql-performance

Previous

Next