Thread: Performance indexing of a simple query

Performance indexing of a simple query

From
Mark Fox
Date:
I have a table called 'jobs' with several million rows, and the only
columns that are important to this discussion are 'start_time' and
'completion_time'.

The sort of queries I want to execute (among others) are like:

SELECT * FROM jobs
WHERE completion_time > SOMEDATE AND start_time < SOMEDATE;

In plain english:  All the jobs that were running at SOMEDATE.  The
result of the query is on the order of 500 rows.

I've got seperate indexes on 'start_time' and 'completion_time'.

Now, if SOMEDATE is such that the number of rows with completion_time
> SOMEDATE is small (say 10s of thousands), the query uses index scans
and executes quickly.  If not, the query uses sequential scans and is
unacceptably slow (a couple of minutes).  I've used EXPLAIN and
EXPLAIN ANALYZE to confirm this.  This makes perfect sense to me.

I've played with some of the memory settings for PostgreSQL, but none
has had a significant impact.

Any ideas on how to structure the query or add/change indexes in such
a way to improve its performance?  In desperation, I tried using a
subquery, but unsurprisingly it made no (positive) difference.  I feel
like there might be a way of using an index on both 'completion_time'
and 'start_time', but can't put a temporal lobe on the details.


Mark

Re: Performance indexing of a simple query

From
"Jim C. Nasby"
Date:
Try

CREATE INDEX start_complete ON jobs( start_time, completion_time );

Try also completion_time, start_time. One might work better than the
other. Or, depending on your data, you might want to keep both.

In 8.1 you'll be able to do bitmap-based index combination, which might
allow making use of the seperate indexes.

On Wed, Aug 24, 2005 at 02:43:51PM -0600, Mark Fox wrote:
> I have a table called 'jobs' with several million rows, and the only
> columns that are important to this discussion are 'start_time' and
> 'completion_time'.
>
> The sort of queries I want to execute (among others) are like:
>
> SELECT * FROM jobs
> WHERE completion_time > SOMEDATE AND start_time < SOMEDATE;
>
> In plain english:  All the jobs that were running at SOMEDATE.  The
> result of the query is on the order of 500 rows.
>
> I've got seperate indexes on 'start_time' and 'completion_time'.
>
> Now, if SOMEDATE is such that the number of rows with completion_time
> > SOMEDATE is small (say 10s of thousands), the query uses index scans
> and executes quickly.  If not, the query uses sequential scans and is
> unacceptably slow (a couple of minutes).  I've used EXPLAIN and
> EXPLAIN ANALYZE to confirm this.  This makes perfect sense to me.
>
> I've played with some of the memory settings for PostgreSQL, but none
> has had a significant impact.
>
> Any ideas on how to structure the query or add/change indexes in such
> a way to improve its performance?  In desperation, I tried using a
> subquery, but unsurprisingly it made no (positive) difference.  I feel
> like there might be a way of using an index on both 'completion_time'
> and 'start_time', but can't put a temporal lobe on the details.
>
>
> Mark
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org
>

--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software        http://pervasive.com        512-569-9461

Re: Performance indexing of a simple query

From
Tom Lane
Date:
Mark Fox <mark.fox@gmail.com> writes:
> The sort of queries I want to execute (among others) are like:
> SELECT * FROM jobs
> WHERE completion_time > SOMEDATE AND start_time < SOMEDATE;
> In plain english:  All the jobs that were running at SOMEDATE.

AFAIK there is no good way to do this with btree indexes; the problem
is that it's fundamentally a 2-dimensional query and btrees are
1-dimensional.  There are various hacks you can try if you're willing
to constrain the problem (eg, if you can assume some not-very-large
maximum on the running time of jobs) but in full generality btrees are
just the Wrong Thing.

So what you want to look at is a non-btree index, ie, rtree or gist.
For example, the contrib/seg data type could pretty directly be adapted
to solve this problem, since it can index searches for overlapping
line segments.

The main drawback of these index types in existing releases is that they
are bad on concurrent updates and don't have WAL support.  Both those
things are (allegedly) fixed for GIST in 8.1 ... are you interested in
trying out 8.1beta?

            regards, tom lane

Re: Performance indexing of a simple query

From
"Jim C. Nasby"
Date:
On Wed, Aug 24, 2005 at 07:42:00PM -0400, Tom Lane wrote:
> Mark Fox <mark.fox@gmail.com> writes:
> > The sort of queries I want to execute (among others) are like:
> > SELECT * FROM jobs
> > WHERE completion_time > SOMEDATE AND start_time < SOMEDATE;
> > In plain english:  All the jobs that were running at SOMEDATE.

Uh, the plain english and the SQL don't match. That query will find
every job that was NOT running at the time you said.

> AFAIK there is no good way to do this with btree indexes; the problem
> is that it's fundamentally a 2-dimensional query and btrees are
> 1-dimensional.  There are various hacks you can try if you're willing
> to constrain the problem (eg, if you can assume some not-very-large
> maximum on the running time of jobs) but in full generality btrees are
> just the Wrong Thing.

Ignoring the SQL and doing what the author actually wanted, wouldn't a
bitmap combination of indexes work here?

Or with an index on (start_time, completion_time), start an index scan
at start_time = SOMEDATE and only include rows where completion_time <
SOMEDATE. Of course if SOMEDATE is near the beginning of the table that
wouldn't help.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software        http://pervasive.com        512-569-9461

Re: Performance indexing of a simple query

From
Tom Lane
Date:
"Jim C. Nasby" <jnasby@pervasive.com> writes:
> Uh, the plain english and the SQL don't match. That query will find
> every job that was NOT running at the time you said.

No, I think it was right.  But anyway it was just an example.

> On Wed, Aug 24, 2005 at 07:42:00PM -0400, Tom Lane wrote:
>> AFAIK there is no good way to do this with btree indexes; the problem
>> is that it's fundamentally a 2-dimensional query and btrees are
>> 1-dimensional.  There are various hacks you can try if you're willing
>> to constrain the problem (eg, if you can assume some not-very-large
>> maximum on the running time of jobs) but in full generality btrees are
>> just the Wrong Thing.

> Ignoring the SQL and doing what the author actually wanted, wouldn't a
> bitmap combination of indexes work here?

> Or with an index on (start_time, completion_time), start an index scan
> at start_time = SOMEDATE and only include rows where completion_time <
> SOMEDATE. Of course if SOMEDATE is near the beginning of the table that
> wouldn't help.

The trouble with either of those is that you have to scan very large
fractions of the index (if not indeed *all* of it) in order to get your
answer; certainly you hit much more of the index than just the region
containing matching rows.  Btree just doesn't have a good way to answer
this type of query.

            regards, tom lane