Thread: several questions about R-tree index

several questions about R-tree index

From

"TJ O'Donnell"

Date:

26 April 2005, 23:52:34

According to the manual at:
http://www.postgresql.org/docs/7.4/static/functions-geometry.html
" The PostgreSQL query planner will consider using an R-tree index whenever an indexed column is
involved in a comparison using one of these operators: <<, &<, &>, >>, @, ~=, && (Refer to Section
9.9 about the meaning of these operators.)"

Shouldn't the ~ (contains) operator be included also?
Isn't ~ the commutator of @ ?

I am considering using R-tree's for other-than-geometric purposes.  Here is the story:
I have string data (column smiles) which represents chemical structures.  Comparing strings
for equality works perfectly using an ordinary B-tree index.  But locating chemical
substructures requires a more elaborate (read time-consuming) match function.
I search for substructures like this:
Select count(smiles) from structures where matches(smiles, subsmiles);

where subsmiles is the search pattern desired.  This is not a fast search - about
15 seconds for 0.25 million rows.
To speed up this search I have a function (called fingerprint) which produces a bit string
representation of some common and uncommon substructures.  I have populated my table
with column fp = fingerprint(smiles)
So, this is about 5 times faster:
 Select count(smiles) from structures where (fp & fingerprint(subsmiles)) = fingerprint(subsmiles) & matches(smiles,
subsmiles);
The sequence scan using the first where-clause takes about 3 seconds and the final
matches on the resulting subset is insignificant when the subset is small.

But I think I might be able to do better (faster) using R-trees.
Bitstrings can be thought of as "containing" when one bitstring has all the
same bits set as another, even if it has other bits set too - this is the gist of the
first where-clause above.

Can I expect to be able to use R-tree's to do this?
Will I simply have to define a new datatype and three R-tree functions
(union, intersection and size).
Will the use of the ~ (contains) operator cause the planner to consider
using an index (this is my first question, way above)?

I hope someone has had the patience and interest to read this far.

Thanks,
TJ

Re: several questions about R-tree index

From

Andrew - Supernews

Date:

27 April 2005, 00:39:40

On 2005-04-26, "TJ O'Donnell" <tjo@acm.org> wrote:
> But I think I might be able to do better (faster) using R-trees.
> Bitstrings can be thought of as "containing" when one bitstring has all the
> same bits set as another, even if it has other bits set too - this is the
> gist of the first where-clause above.
>
> Can I expect to be able to use R-tree's to do this?

You may want to use GIST instead - it is more flexible.

> Will I simply have to define a new datatype and three R-tree functions
> (union, intersection and size).
> Will the use of the ~ (contains) operator cause the planner to consider
> using an index (this is my first question, way above)?

What the planner actually looks for is, having identified the specific
operator and types, whether an index opclass exists for them, and if so
whether an index using that opclass exists. The actual names of the
operators are irrelevent.

So for either rtree or GIST, all you need is to define your new datatype,
with its associated operators, and create an operator class for it with
appropriate support functions, and create indexes using that opclass.
Once all that is done, the planner will consider using them.

-- 
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services

Re: several questions about R-tree index

From

Tom Lane

Date:

27 April 2005, 04:12:47

"TJ O'Donnell" <tjo@acm.org> writes:
> Shouldn't the ~ (contains) operator be included also?
> Isn't ~ the commutator of @ ?

Yeah, it looks like the documentation is in error:

regression=# select amopopr::regoperator,amopstrategy from pg_amop where amopclaid in
regression-# (select oid from pg_opclass where opcamid = 402);      amopopr       | amopstrategy 
---------------------+--------------<<(box,box)         |            1&<(box,box)         |            2&&(box,box)
   |            3&>(box,box)         |            4>>(box,box)         |            5~=(box,box)         |
6~(box,box)         |            7@(box,box)          |            8<<(polygon,polygon) |
1&<(polygon,polygon)|            2&&(polygon,polygon) |            3&>(polygon,polygon) |
4>>(polygon,polygon)|            5~=(polygon,polygon) |            6~(polygon,polygon)  |
7@(polygon,polygon) |            8
 
(16 rows)

Of course, the thing that leaps out here is that there are only two
built-in opclasses for rtree.  Not exactly confidence building.

> I am considering using R-tree's for other-than-geometric purposes.

I concur with Andrew's suggestion to consider GIST.  GIST has its own
issues, but at least there are people looking at it/using it/working on it.
R-tree doesn't seem to have any user community that really cares.
        regards, tom lane