Thread: Btree indexes, large numbers and <= comparisons

Btree indexes, large numbers and <= comparisons

From
Toke Høiland-Jørgensen
Date:
I have a table with ~5 million rows containing ranges of large (8-digit)
numbers. The table has an int4 field for the range start and the range end,
and a field which is null if that particular range is expired, and has a
value otherwise.

I need to query this table to find a range containing a particular number,
e.g. a query might look like this:

SELECT * FROM table_name WHERE range_start <= 87654321 AND range_end >=
87654321 AND expired IS NULL

My problem is that when I run a query like the above, the query planner does a
sequential scan, even though i have an index on both the query columns
separately, as well as an index containing both columns. The indexes are
defined like this:

CREATE INDEX range_start_end_index ON table_name USING btree (range_start,
range_end) WHERE expired IS NULL
CREATE INDEX range_start_index ON table_name USING btree (range_start) WHERE
expired IS NULL
CREATE INDEX range_end_index ON table_name USING btree (range_end) WHERE
expired IS NULL

When I do a query for smaller numbers (7-digit and below, as far as I can
see), the query planner uses the index(es) and the query is instantaneous.
However, when I run a query like the above, the planner decides to do a
sequential scan of the entire table.

I realize this probably has something to do with the planner only searching
for the first part of the WHERE clause (i.e. range_start <= 87654321) and
deciding that this will probably yield so many rows that a sequential scan
will yield results that are just as good. However, the data is structured in
such a way that multiple ranges containing the same number (and which are not
expired) do not exist. So in reality there will be either 1 or 0 results for
a query like the above.

How do I make the query planner realize that using the index is a Good
Thing(tm)?

Any help will be greatly appreciated.

Regards,
-Toke

Re: Btree indexes, large numbers and <= comparisons

From
"Alejandro D. Burne"
Date:
2007/3/29, Toke Høiland-Jørgensen <toke@toke.dk>:
> I have a table with ~5 million rows containing ranges of large (8-digit)
> numbers. The table has an int4 field for the range start and the range end,
> and a field which is null if that particular range is expired, and has a
> value otherwise.
>
> I need to query this table to find a range containing a particular number,
> e.g. a query might look like this:
>
> SELECT * FROM table_name WHERE range_start <= 87654321 AND range_end >=
> 87654321 AND expired IS NULL
>
> My problem is that when I run a query like the above, the query planner does a
> sequential scan, even though i have an index on both the query columns
> separately, as well as an index containing both columns. The indexes are
> defined like this:
>
> CREATE INDEX range_start_end_index ON table_name USING btree (range_start,
> range_end) WHERE expired IS NULL
> CREATE INDEX range_start_index ON table_name USING btree (range_start) WHERE
> expired IS NULL
> CREATE INDEX range_end_index ON table_name USING btree (range_end) WHERE
> expired IS NULL
>
> When I do a query for smaller numbers (7-digit and below, as far as I can
> see), the query planner uses the index(es) and the query is instantaneous.
> However, when I run a query like the above, the planner decides to do a
> sequential scan of the entire table.
>
> I realize this probably has something to do with the planner only searching
> for the first part of the WHERE clause (i.e. range_start <= 87654321) and
> deciding that this will probably yield so many rows that a sequential scan
> will yield results that are just as good. However, the data is structured in
> such a way that multiple ranges containing the same number (and which are not
> expired) do not exist. So in reality there will be either 1 or 0 results for
> a query like the above.
>
> How do I make the query planner realize that using the index is a Good
> Thing(tm)?
>
> Any help will be greatly appreciated.
>
> Regards,
> -Toke

Can you send an explain analyze for that query?

Alejandro

Re: Btree indexes, large numbers and <= comparisons

From
Ron Johnson
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 03/29/07 06:33, Alejandro D. Burne wrote:
> 2007/3/29, Toke Høiland-Jørgensen <toke@toke.dk>:
[snip]
>>
>> Any help will be greatly appreciated.
>>
>> Regards,
>> -Toke
>
> Can you send an explain analyze for that query?

And tell us how often you VACCUM & which version of PG you are running.

- --
Ron Johnson, Jr.
Jefferson LA  USA

Give a man a fish, and he eats for a day.
Hit him with a fish, and he goes away for good!

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD4DBQFGC7qOS9HxQb37XmcRAl8eAKDeMRvbTR8gjHgf2RugrELXffQCVQCWM8co
zySEdLt3JYKAkFMa6oywOw==
=Rg8p
-----END PGP SIGNATURE-----

Re: Btree indexes, large numbers and <= comparisons

From
Toke Høiland-Jørgensen
Date:
On Thursday 29 March 2007 13:33, Alejandro D. Burne wrote:
> 2007/3/29, Toke Høiland-Jørgensen <toke@toke.dk>:
> > I have a table with ~5 million rows containing ranges of large (8-digit)
> > numbers. The table has an int4 field for the range start and the range
> > end, and a field which is null if that particular range is expired, and
> > has a value otherwise.
> >
> > I need to query this table to find a range containing a particular
> > number, e.g. a query might look like this:
> >
> > SELECT * FROM table_name WHERE range_start <= 87654321 AND range_end >=
> > 87654321 AND expired IS NULL
> >
> > My problem is that when I run a query like the above, the query planner
> > does a sequential scan, even though i have an index on both the query
> > columns separately, as well as an index containing both columns. The
> > indexes are defined like this:
> >
> > CREATE INDEX range_start_end_index ON table_name USING btree
> > (range_start, range_end) WHERE expired IS NULL
> > CREATE INDEX range_start_index ON table_name USING btree (range_start)
> > WHERE expired IS NULL
> > CREATE INDEX range_end_index ON table_name USING btree (range_end) WHERE
> > expired IS NULL
> >
> > When I do a query for smaller numbers (7-digit and below, as far as I can
> > see), the query planner uses the index(es) and the query is
> > instantaneous. However, when I run a query like the above, the planner
> > decides to do a sequential scan of the entire table.
> >
> > I realize this probably has something to do with the planner only
> > searching for the first part of the WHERE clause (i.e. range_start <=
> > 87654321) and deciding that this will probably yield so many rows that a
> > sequential scan will yield results that are just as good. However, the
> > data is structured in such a way that multiple ranges containing the same
> > number (and which are not expired) do not exist. So in reality there will
> > be either 1 or 0 results for a query like the above.
> >
> > How do I make the query planner realize that using the index is a Good
> > Thing(tm)?
> >
> > Any help will be greatly appreciated.
> >
> > Regards,
> > -Toke
>
> Can you send an explain analyze for that query?
>
> Alejandro
>

Sure:

=> EXPLAIN ANALYSE SELECT * FROM table_name WHERE range_start <= 87654321 AND
range_end >= 87654321 AND expired IS NULL;
                                                    QUERY PLAN
------------------------------------------------------------------------------------------------------------------
 Seq Scan on table_name  (cost=0.00..137046.73 rows=161972 width=109) (actual
time=3894.471..3894.471 rows=0 loops=1)
   Filter: ((range_start <= 87654321) AND (range_end >= 87654321) AND (expired
IS NULL))
 Total runtime: 3894.531 ms
(3 rows)

The same thing with a smaller number:

=> EXPLAIN ANALYSE SELECT * FROM table_name WHERE range_start <= 8765432 AND
range_end >= 8765432 AND expired IS NULL;
                                                             QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using range_start_index on table_name  (cost=0.00..800.61 rows=200
width=109) (actual time=26.839..26.839 rows=0 loops=1)
   Index Cond: (range_start <= 8765432)
   Filter: ((range_end >= 8765432) AND (expired IS NULL))
 Total runtime: 26.899 ms
(4 rows)

Thanks for your prompt reply :)

-Toke

Re: Btree indexes, large numbers and <= comparisons

From
Toke Høiland-Jørgensen
Date:
On Thursday 29 March 2007 15:09, Ron Johnson wrote:
> On 03/29/07 06:33, Alejandro D. Burne wrote:
> > 2007/3/29, Toke Høiland-Jørgensen <toke@toke.dk>:
>
> [snip]
>
> >> Any help will be greatly appreciated.
> >>
> >> Regards,
> >> -Toke
> >
> > Can you send an explain analyze for that query?
>
> And tell us how often you VACCUM & which version of PG you are running.

Ah, yes, sorry...forgot about that.

I've tried running both VACUUM ANALYSE and REINDEX to no avail.
Postgresql v. 7.4.16.

Re: Btree indexes, large numbers and <= comparisons

From
Tom Lane
Date:
Toke =?utf-8?q?H=C3=B8iland-J=C3=B8rgensen?= <toke@toke.dk> writes:
> I need to query this table to find a range containing a particular number,
> e.g. a query might look like this:

> SELECT * FROM table_name WHERE range_start <= 87654321 AND range_end >=
> 87654321 AND expired IS NULL

You can't usefully use a two-column btree index for this.  btree indexes
are not magic, they're just ordered lists, and if you think about where
the rows you want might fall in the sort order, you'll see that the two
given constraints aren't helpful for constraining the indexscan: it'd
have to scan every row up to range_start = 87654321, or every row after
range_end = 87654321, depending on which is the first index column.
(The btree lacks any way of using the fact that range_start <= range_end
or that they're probably close together.)

What you need is a different index type that's designed for this kind of
query.  The closest thing available in the stock Postgres distribution
is the contrib/seg module, which can handle overlap/intersection of line
segments as an indexable query on a GIST index.  You'd store line
segments representing your ranges in the index, and query using the
"overlaps" operator.  However the seg data type is probably not
immediately useful to you because it only stores float4 internally,
and you seem to want more precision than that.  You'd need to make a
modified flavor of seg that stores the endpoints with the same precision
your range endpoint columns have.

            regards, tom lane

Re: Btree indexes, large numbers and <= comparisons

From
Toke Høiland-Jørgensen
Date:
> You can't usefully use a two-column btree index for this.  btree indexes
> are not magic, they're just ordered lists, and if you think about where
> the rows you want might fall in the sort order, you'll see that the two
> given constraints aren't helpful for constraining the indexscan: it'd
> have to scan every row up to range_start = 87654321, or every row after
> range_end = 87654321, depending on which is the first index column.
> (The btree lacks any way of using the fact that range_start <= range_end
> or that they're probably close together.)
>
> What you need is a different index type that's designed for this kind of
> query.  The closest thing available in the stock Postgres distribution
> is the contrib/seg module, which can handle overlap/intersection of line
> segments as an indexable query on a GIST index.  You'd store line
> segments representing your ranges in the index, and query using the
> "overlaps" operator.  However the seg data type is probably not
> immediately useful to you because it only stores float4 internally,
> and you seem to want more precision than that.  You'd need to make a
> modified flavor of seg that stores the endpoints with the same precision
> your range endpoint columns have.
>
>             regards, tom lane

I'll look into it. Thank you :)

-Toke