Thread: Help w/speeding up range queries?
Hello- #I am a biologist, and work with large datasets (tables with millions of rows are common). #These datasets often can be simplified as features with a name, and a start and end position (ie: a range along a number line. GeneX is on some chromosome from position 10->40) I store these features in tables that generally have the form: SIMPLE_TABLE: FeatureID(PrimaryKey) -- FeatureName(varchar) -- FeatureChromosomeName(varchar) -- StartPosition(int) -- EndPosition(int) My problem is, I often need to execute searches of tables like these which find "All features within a range". Ie: select FeatureID from SIMPLE_TABLE where FeatureChromosomeName like 'chrX' and StartPosition > 1000500 and EndPosition < 2000000; This kind of query is VERY slow, and I've tried tinkering with indexes to speed it up, but with little success. Indexes on Chromosome help a little, but it I can't think of a way to avoid full table scans for each of the position range queries. Any advice on how I might be able to improve this situation would be very helpful. Thanks! John
John, On 10/31/06 3:18 PM, "John Major" <major@cbio.mskcc.org> wrote: > Any advice on how I might be able to improve this situation would be > very helpful. I think table partitioning is exactly what you need. There's a basic capability in current Postgres to divide tables into parent + children, each of which have a constraint for the rows inside (in your case chromosome). When you query the parent, the planner will exclude child tables outside of the predicate range. - Luke
John Major wrote: > Hello- > > #I am a biologist, and work with large datasets (tables with millions of > rows are common). > #These datasets often can be simplified as features with a name, and a > start and end position (ie: a range along a number line. GeneX is on > some chromosome from position 10->40) > > I store these features in tables that generally have the form: > > SIMPLE_TABLE: > FeatureID(PrimaryKey) -- FeatureName(varchar) -- > FeatureChromosomeName(varchar) -- StartPosition(int) -- EndPosition(int) > > My problem is, I often need to execute searches of tables like these > which find "All features within a range". Ie: select FeatureID from > SIMPLE_TABLE where FeatureChromosomeName like 'chrX' and StartPosition > > 1000500 and EndPosition < 2000000; > > This kind of query is VERY slow, and I've tried tinkering with indexes > to speed it up, but with little success. > Indexes on Chromosome help a little, but it I can't think of a way to > avoid full table scans for each of the position range queries. > > Any advice on how I might be able to improve this situation would be > very helpful. Basic question - What version, and what indexes do you have? Have an EXPLAIN? Something like - CREATE INDEX index_name ON SIMPLE_TABLE ( FeatureChromosomeName varchar_pattern_ops, StartPosition, EndPosition ); The varchar_pattern_ops being the "key" so LIKE can use an index. Provided of course its LIKE 'something%' and not LIKE '%something' > > Thanks! > John > > ---------------------------(end of broadcast)--------------------------- > TIP 7: You can help support the PostgreSQL project by donating at > > http://www.postgresql.org/about/donate > Weslee
Weslee, On 10/31/06 3:57 PM, "Weslee Bilodeau" <weslee.bilodeau@hypermediasystems.com> wrote: > Basic question - What version, and what indexes do you have? I'd expect the problem with this is that unless the indexed column is correlated with the loading order of the rows over time, then the index will refer to rows distributed non-sequentially on disk, in which case the index can be worse than a sequential scan. You can cluster the table on the index (don't use the "CLUSTER" command! Do a CREATE TABLE AS SELECT .. ORDER BY instead!), but the index won't refer to sequential table data when there's more data added. What this does is analogous to the partitioning option though, and you don't have the problem of the table being de-clustered on the constraint column. The problem with the current support for partitioning is that you have to implement rules for inserts/updates/deletes so that you can do them to the parent and they will be implemented on the children. As a result, partitioning is not transparent. OTOH, it achieves great performance gains. BTW - If you have a date column and your data is loaded in date order, then an index is all that's necessary, you will get sequential access. - Luke
John Major <major@cbio.mskcc.org> writes: > My problem is, I often need to execute searches of tables like these > which find "All features within a range". > Ie: select FeatureID from SIMPLE_TABLE where FeatureChromosomeName like > 'chrX' and StartPosition > 1000500 and EndPosition < 2000000; A standard btree index is just going to suck for these types of queries; you need something that's actually designed for spatial range queries. You might look at the contrib/seg module --- if you can store your ranges as "seg" datatype then the seg overlap operator expresses what you need to do, and searches on an overlap operator can be handled well by a GIST index. Also, there's the PostGIS stuff, though it might be overkill for what you want. regards, tom lane
John, On 10/31/06 8:29 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote: >> 'chrX' and StartPosition > 1000500 and EndPosition < 2000000; > > Also, there's the PostGIS stuff, though it might be overkill for what > you want. Oops - I missed the point earlier. Start and End are separate attributes so this is like an unbounded window in a Start,End space. PostGis provides quadtree indexing would provide a terse TID list but you still have the problem of how to ensure that the heap tuples being scanned are efficiently retrieved, which would only happen if they are grouped similarly to the retrieval pattern, right? - Luke
"Luke Lonergan" <llonergan@greenplum.com> writes: > Oops - I missed the point earlier. Start and End are separate attributes so > this is like an unbounded window in a Start,End space. PostGis provides > quadtree indexing would provide a terse TID list but you still have the > problem of how to ensure that the heap tuples being scanned are efficiently > retrieved, which would only happen if they are grouped similarly to the > retrieval pattern, right? Yeah, but I think that's a second-order problem compared to having an index that's reasonably well matched to the query ... regards, tom lane
> Ie: select FeatureID from SIMPLE_TABLE where FeatureChromosomeName like > 'chrX' and StartPosition > 1000500 and EndPosition < 2000000; How about ( this assumes that StartPosition <= EndPosition ): select FeatureID from SIMPLE_TABLE where FeatureChromosomeName llike 'chrX' and StartPosition > 1000500 and StartPosition < 2000000 and EndPosition > 1000500 and EndPosition < 2000000; This at least should help the planner with estimating number of rows. Also think twice when You assume that a query with ILIKE will use an index. Read about varchar_pattern_ops. Make an index on (FeatureChromosomeName,StartPosition) , and all should be fine. Greetings Marcin
On Tue, 2006-10-31 at 18:18 -0500, John Major wrote: > #I am a biologist, and work with large datasets (tables with millions of > rows are common). > #These datasets often can be simplified as features with a name, and a > start and end position (ie: a range along a number line. GeneX is on > some chromosome from position 10->40) Do you know about www.biopostgres.org ? I believe they provide some additional indexing mechanisms for just this type of data. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Oct 31, 2006, at 8:29 PM, Tom Lane wrote: > John Major <major@cbio.mskcc.org> writes: >> My problem is, I often need to execute searches of tables like these >> which find "All features within a range". >> Ie: select FeatureID from SIMPLE_TABLE where >> FeatureChromosomeName like >> 'chrX' and StartPosition > 1000500 and EndPosition < 2000000; > > A standard btree index is just going to suck for these types of > queries; > you need something that's actually designed for spatial range queries. > You might look at the contrib/seg module --- if you can store your > ranges as "seg" datatype then the seg overlap operator expresses what > you need to do, and searches on an overlap operator can be handled > well > by a GIST index. > > Also, there's the PostGIS stuff, though it might be overkill for what > you want. Another possibility (think Tom has suggested in the past) is to define Start and End as a box, and then use the geometric functions built into plain PostgreSQL (though perhaps that's what he meant by "PostGIS stuff"). -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)