Thread: indexing array columns

indexing array columns

From
Rajarshi Guha
Date:
Hi, I have a table of about 10M rows. It has two columns A and B, where
A is a text field and B is a real[12] field.

Now when this table is queried it is usually of the form:

select A from theTable where sim(B, C) > 0.8

Now C will be a 12-element array specified by the user and the value 0.8
can be arbitrary. The function, sim(), is essentially a similarity
function which simply computes the inverse of the Manhattan distance
between the query array and the rows of the array column. Right now the
above query uses a seq scan.

Furthermore, the values of the individual array elements for any given
row can vary from 0 to infinity (but for most cases will be numbers less
than 1000)

My question is: how can I index the column B so that such queries are
fast.

I realize that my table is essentially a collection of 12-dimensional
points and that I could replace my similarity function with a distance
function.

Thus my query boils down to asking 'find me rows of the table that are
within X distance of my query'

I know that the GIS community deals with 2D points, but I'm not familiar
with this area and if I understand correctly, they use Euclidean
distances, where as I need Manhattan distances.

What type of indexing, available in Postgres could be used for my
problem? Would it require me to implement my own indexing scheme?

Any pointers would be appreciated

Thanks,

-------------------------------------------------------------------
Rajarshi Guha <rguha@indiana.edu>
GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE
-------------------------------------------------------------------
Every little picofarad has a nanohenry all its own.
-- Don Vonada



Re: indexing array columns

From
Martijn van Oosterhout
Date:
On Fri, Apr 13, 2007 at 06:09:50PM -0400, Rajarshi Guha wrote:
> Hi, I have a table of about 10M rows. It has two columns A and B, where
> A is a text field and B is a real[12] field.
>
> Now when this table is queried it is usually of the form:
>
> select A from theTable where sim(B, C) > 0.8

<snip>

> What type of indexing, available in Postgres could be used for my
> problem? Would it require me to implement my own indexing scheme?

PostgreSQL provides very flexible indexing mechanisms. By using GiST
you should be able to index the way you want. In contrib there a module
"cube" which does similar to what you want to 3D, extending it to 12D
shouldn't be too hard...

Have a nice day,

--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Attachment

Re: indexing array columns

From
Tom Lane
Date:
Rajarshi Guha <rguha@indiana.edu> writes:
> select A from theTable where sim(B, C) > 0.8

> I realize that my table is essentially a collection of 12-dimensional
> points and that I could replace my similarity function with a distance
> function.

> Thus my query boils down to asking 'find me rows of the table that are
> within X distance of my query'

> I know that the GIS community deals with 2D points, but I'm not familiar
> with this area and if I understand correctly, they use Euclidean
> distances, where as I need Manhattan distances.

> What type of indexing, available in Postgres could be used for my
> problem? Would it require me to implement my own indexing scheme?

AFAIK there's nothing "canned" that addresses this problem.  In
principle you could implement it as an operator class for GIST,
which is much less work than inventing your own index access method,
but still not exactly trivial.

In any case you would need to recast the queries so that the WHERE
clauses look like binary operators returning boolean.  I think the
most straightforward way would be to invent a concept of a 12-D
Manhattan sphere, and a point-contained-in-sphere operator, so that
your query becomes

    select ... where B <@ sphere(C, 0.8);

If you've got distance, volume, and contains/contained-in operations
for your datatype, then it should be fairly straightforward to adapt
one of the existing GIST "r-tree" opclasses.

            regards, tom lane

Re: indexing array columns

From
Teodor Sigaev
Date:
> you should be able to index the way you want. In contrib there a module
> "cube" which does similar to what you want to 3D, extending it to 12D
> shouldn't be too hard...

contrib/cube module implements N dimensional cube representation

--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/