Thread: Performance indexing of a simple query
I have a table called 'jobs' with several million rows, and the only columns that are important to this discussion are 'start_time' and 'completion_time'. The sort of queries I want to execute (among others) are like: SELECT * FROM jobs WHERE completion_time > SOMEDATE AND start_time < SOMEDATE; In plain english: All the jobs that were running at SOMEDATE. The result of the query is on the order of 500 rows. I've got seperate indexes on 'start_time' and 'completion_time'. Now, if SOMEDATE is such that the number of rows with completion_time > SOMEDATE is small (say 10s of thousands), the query uses index scans and executes quickly. If not, the query uses sequential scans and is unacceptably slow (a couple of minutes). I've used EXPLAIN and EXPLAIN ANALYZE to confirm this. This makes perfect sense to me. I've played with some of the memory settings for PostgreSQL, but none has had a significant impact. Any ideas on how to structure the query or add/change indexes in such a way to improve its performance? In desperation, I tried using a subquery, but unsurprisingly it made no (positive) difference. I feel like there might be a way of using an index on both 'completion_time' and 'start_time', but can't put a temporal lobe on the details. Mark
Try CREATE INDEX start_complete ON jobs( start_time, completion_time ); Try also completion_time, start_time. One might work better than the other. Or, depending on your data, you might want to keep both. In 8.1 you'll be able to do bitmap-based index combination, which might allow making use of the seperate indexes. On Wed, Aug 24, 2005 at 02:43:51PM -0600, Mark Fox wrote: > I have a table called 'jobs' with several million rows, and the only > columns that are important to this discussion are 'start_time' and > 'completion_time'. > > The sort of queries I want to execute (among others) are like: > > SELECT * FROM jobs > WHERE completion_time > SOMEDATE AND start_time < SOMEDATE; > > In plain english: All the jobs that were running at SOMEDATE. The > result of the query is on the order of 500 rows. > > I've got seperate indexes on 'start_time' and 'completion_time'. > > Now, if SOMEDATE is such that the number of rows with completion_time > > SOMEDATE is small (say 10s of thousands), the query uses index scans > and executes quickly. If not, the query uses sequential scans and is > unacceptably slow (a couple of minutes). I've used EXPLAIN and > EXPLAIN ANALYZE to confirm this. This makes perfect sense to me. > > I've played with some of the memory settings for PostgreSQL, but none > has had a significant impact. > > Any ideas on how to structure the query or add/change indexes in such > a way to improve its performance? In desperation, I tried using a > subquery, but unsurprisingly it made no (positive) difference. I feel > like there might be a way of using an index on both 'completion_time' > and 'start_time', but can't put a temporal lobe on the details. > > > Mark > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org > -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com 512-569-9461
Mark Fox <mark.fox@gmail.com> writes: > The sort of queries I want to execute (among others) are like: > SELECT * FROM jobs > WHERE completion_time > SOMEDATE AND start_time < SOMEDATE; > In plain english: All the jobs that were running at SOMEDATE. AFAIK there is no good way to do this with btree indexes; the problem is that it's fundamentally a 2-dimensional query and btrees are 1-dimensional. There are various hacks you can try if you're willing to constrain the problem (eg, if you can assume some not-very-large maximum on the running time of jobs) but in full generality btrees are just the Wrong Thing. So what you want to look at is a non-btree index, ie, rtree or gist. For example, the contrib/seg data type could pretty directly be adapted to solve this problem, since it can index searches for overlapping line segments. The main drawback of these index types in existing releases is that they are bad on concurrent updates and don't have WAL support. Both those things are (allegedly) fixed for GIST in 8.1 ... are you interested in trying out 8.1beta? regards, tom lane
On Wed, Aug 24, 2005 at 07:42:00PM -0400, Tom Lane wrote: > Mark Fox <mark.fox@gmail.com> writes: > > The sort of queries I want to execute (among others) are like: > > SELECT * FROM jobs > > WHERE completion_time > SOMEDATE AND start_time < SOMEDATE; > > In plain english: All the jobs that were running at SOMEDATE. Uh, the plain english and the SQL don't match. That query will find every job that was NOT running at the time you said. > AFAIK there is no good way to do this with btree indexes; the problem > is that it's fundamentally a 2-dimensional query and btrees are > 1-dimensional. There are various hacks you can try if you're willing > to constrain the problem (eg, if you can assume some not-very-large > maximum on the running time of jobs) but in full generality btrees are > just the Wrong Thing. Ignoring the SQL and doing what the author actually wanted, wouldn't a bitmap combination of indexes work here? Or with an index on (start_time, completion_time), start an index scan at start_time = SOMEDATE and only include rows where completion_time < SOMEDATE. Of course if SOMEDATE is near the beginning of the table that wouldn't help. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com 512-569-9461
"Jim C. Nasby" <jnasby@pervasive.com> writes: > Uh, the plain english and the SQL don't match. That query will find > every job that was NOT running at the time you said. No, I think it was right. But anyway it was just an example. > On Wed, Aug 24, 2005 at 07:42:00PM -0400, Tom Lane wrote: >> AFAIK there is no good way to do this with btree indexes; the problem >> is that it's fundamentally a 2-dimensional query and btrees are >> 1-dimensional. There are various hacks you can try if you're willing >> to constrain the problem (eg, if you can assume some not-very-large >> maximum on the running time of jobs) but in full generality btrees are >> just the Wrong Thing. > Ignoring the SQL and doing what the author actually wanted, wouldn't a > bitmap combination of indexes work here? > Or with an index on (start_time, completion_time), start an index scan > at start_time = SOMEDATE and only include rows where completion_time < > SOMEDATE. Of course if SOMEDATE is near the beginning of the table that > wouldn't help. The trouble with either of those is that you have to scan very large fractions of the index (if not indeed *all* of it) in order to get your answer; certainly you hit much more of the index than just the region containing matching rows. Btree just doesn't have a good way to answer this type of query. regards, tom lane