Thread: Help w/speeding up range queries?

Help w/speeding up range queries?

From
John Major
Date:
Hello-

#I am a biologist, and work with large datasets (tables with millions of
rows are common).
#These datasets often can be simplified as features with a name, and a
start and end position (ie:  a range along a number line.  GeneX is on
some chromosome from position 10->40)

I store  these features in tables that generally have the form:

SIMPLE_TABLE:
FeatureID(PrimaryKey) -- FeatureName(varchar) --
FeatureChromosomeName(varchar) -- StartPosition(int) -- EndPosition(int)

My problem is, I often need to execute searches of tables like these
which find "All features within a range".
Ie:  select FeatureID from SIMPLE_TABLE where FeatureChromosomeName like
'chrX' and StartPosition > 1000500 and EndPosition < 2000000;

This kind of query is VERY slow, and I've tried tinkering with indexes
to speed it up, but with little success.
Indexes on Chromosome help a little, but it I can't think of a way to
avoid full table scans for each of the position range queries.

Any advice on how I might be able to improve this situation would be
very helpful.

Thanks!
John

Re: Help w/speeding up range queries?

From
"Luke Lonergan"
Date:
John,

On 10/31/06 3:18 PM, "John Major" <major@cbio.mskcc.org> wrote:

> Any advice on how I might be able to improve this situation would be
> very helpful.

I think table partitioning is exactly what you need.

There's a basic capability in current Postgres to divide tables into parent
+ children, each of which have a constraint for the rows inside (in your
case chromosome).  When you query the parent, the planner will exclude child
tables outside of the predicate range.

- Luke



Re: Help w/speeding up range queries?

From
Weslee Bilodeau
Date:
John Major wrote:
> Hello-
>
> #I am a biologist, and work with large datasets (tables with millions of
> rows are common).
> #These datasets often can be simplified as features with a name, and a
> start and end position (ie:  a range along a number line.  GeneX is on
> some chromosome from position 10->40)
>
> I store  these features in tables that generally have the form:
>
> SIMPLE_TABLE:
> FeatureID(PrimaryKey) -- FeatureName(varchar) --
> FeatureChromosomeName(varchar) -- StartPosition(int) -- EndPosition(int)
>
> My problem is, I often need to execute searches of tables like these
> which find "All features within a range". Ie:  select FeatureID from
> SIMPLE_TABLE where FeatureChromosomeName like 'chrX' and StartPosition >
> 1000500 and EndPosition < 2000000;
>
> This kind of query is VERY slow, and I've tried tinkering with indexes
> to speed it up, but with little success.
> Indexes on Chromosome help a little, but it I can't think of a way to
> avoid full table scans for each of the position range queries.
>
> Any advice on how I might be able to improve this situation would be
> very helpful.

Basic question - What version, and what indexes do you have?

Have an EXPLAIN?

Something like -

CREATE INDEX index_name ON SIMPLE_TABLE ( FeatureChromosomeName
varchar_pattern_ops, StartPosition, EndPosition );

The varchar_pattern_ops being the "key" so LIKE can use an index.
Provided of course its LIKE 'something%' and not LIKE '%something'


>
> Thanks!
> John
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: You can help support the PostgreSQL project by donating at
>
>                http://www.postgresql.org/about/donate
>


Weslee

Re: Help w/speeding up range queries?

From
"Luke Lonergan"
Date:
Weslee,

On 10/31/06 3:57 PM, "Weslee Bilodeau"
<weslee.bilodeau@hypermediasystems.com> wrote:

> Basic question - What version, and what indexes do you have?

I'd expect the problem with this is that unless the indexed column is
correlated with the loading order of the rows over time, then the index will
refer to rows distributed non-sequentially on disk, in which case the index
can be worse than a sequential scan.

You can cluster the table on the index (don't use the "CLUSTER" command! Do
a CREATE TABLE AS SELECT .. ORDER BY instead!), but the index won't refer to
sequential table data when there's more data added.  What this does is
analogous to the partitioning option though, and you don't have the problem
of the table being de-clustered on the constraint column.

The problem with the current support for partitioning is that you have to
implement rules for inserts/updates/deletes so that you can do them to the
parent and they will be implemented on the children.  As a result,
partitioning is not transparent.  OTOH, it achieves great performance gains.

BTW - If you have a date column and your data is loaded in date order, then
an index is all that's necessary, you will get sequential access.

- Luke



Re: Help w/speeding up range queries?

From
Tom Lane
Date:
John Major <major@cbio.mskcc.org> writes:
> My problem is, I often need to execute searches of tables like these
> which find "All features within a range".
> Ie:  select FeatureID from SIMPLE_TABLE where FeatureChromosomeName like
> 'chrX' and StartPosition > 1000500 and EndPosition < 2000000;

A standard btree index is just going to suck for these types of queries;
you need something that's actually designed for spatial range queries.
You might look at the contrib/seg module --- if you can store your
ranges as "seg" datatype then the seg overlap operator expresses what
you need to do, and searches on an overlap operator can be handled well
by a GIST index.

Also, there's the PostGIS stuff, though it might be overkill for what
you want.

            regards, tom lane

Re: Help w/speeding up range queries?

From
"Luke Lonergan"
Date:
John,

On 10/31/06 8:29 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

>> 'chrX' and StartPosition > 1000500 and EndPosition < 2000000;
>
> Also, there's the PostGIS stuff, though it might be overkill for what
> you want.

Oops - I missed the point earlier.  Start and End are separate attributes so
this is like an unbounded window in a Start,End space.  PostGis provides
quadtree indexing would provide a terse TID list but you still have the
problem of how to ensure that the heap tuples being scanned are efficiently
retrieved, which would only happen if they are grouped similarly to the
retrieval pattern, right?

- Luke



Re: Help w/speeding up range queries?

From
Tom Lane
Date:
"Luke Lonergan" <llonergan@greenplum.com> writes:
> Oops - I missed the point earlier.  Start and End are separate attributes so
> this is like an unbounded window in a Start,End space.  PostGis provides
> quadtree indexing would provide a terse TID list but you still have the
> problem of how to ensure that the heap tuples being scanned are efficiently
> retrieved, which would only happen if they are grouped similarly to the
> retrieval pattern, right?

Yeah, but I think that's a second-order problem compared to having an
index that's reasonably well matched to the query ...

            regards, tom lane

Re: Help w/speeding up range queries?

From
"Marcin Mank"
Date:
> Ie:  select FeatureID from SIMPLE_TABLE where FeatureChromosomeName like
> 'chrX' and StartPosition > 1000500 and EndPosition < 2000000;

How about ( this assumes that StartPosition <= EndPosition ):

select FeatureID
from SIMPLE_TABLE
where FeatureChromosomeName llike 'chrX'
and StartPosition > 1000500
and StartPosition < 2000000
and EndPosition > 1000500
and EndPosition < 2000000;


This at least should help the planner with estimating number of rows.

Also think twice when You assume that a query with ILIKE will use an index.
Read about varchar_pattern_ops.
Make an index on (FeatureChromosomeName,StartPosition) , and all should be
fine.

Greetings
Marcin


Re: Help w/speeding up range queries?

From
"Simon Riggs"
Date:
On Tue, 2006-10-31 at 18:18 -0500, John Major wrote:

> #I am a biologist, and work with large datasets (tables with millions of
> rows are common).
> #These datasets often can be simplified as features with a name, and a
> start and end position (ie:  a range along a number line.  GeneX is on
> some chromosome from position 10->40)

Do you know about www.biopostgres.org ?

I believe they provide some additional indexing mechanisms for just this
type of data.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com



Re: Help w/speeding up range queries?

From
Jim Nasby
Date:
On Oct 31, 2006, at 8:29 PM, Tom Lane wrote:
> John Major <major@cbio.mskcc.org> writes:
>> My problem is, I often need to execute searches of tables like these
>> which find "All features within a range".
>> Ie:  select FeatureID from SIMPLE_TABLE where
>> FeatureChromosomeName like
>> 'chrX' and StartPosition > 1000500 and EndPosition < 2000000;
>
> A standard btree index is just going to suck for these types of
> queries;
> you need something that's actually designed for spatial range queries.
> You might look at the contrib/seg module --- if you can store your
> ranges as "seg" datatype then the seg overlap operator expresses what
> you need to do, and searches on an overlap operator can be handled
> well
> by a GIST index.
>
> Also, there's the PostGIS stuff, though it might be overkill for what
> you want.

Another possibility (think Tom has suggested in the past) is to
define Start and End as a box, and then use the geometric functions
built into plain PostgreSQL (though perhaps that's what he meant by
"PostGIS stuff").
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)