Thread: seqscan instead of index scan

seqscan instead of index scan

From

Martin Sarsale

Date:

30 August 2004, 14:46:44

Dear all:

Im having a weird problem here. I have a table w/ ~180.000 rows and I
want to select those where c > 0 or d > 0 (there only a few of those on
the table)
I indexed columns c and d (separately) but this query used the slow
seqscan instead of the index scan:

select * from t where c<>0 or d<>0;

After playing some time, I noticed that if I change the "or" for an
"and", pg used the fast index scan (but the query w/ 'and' was not what
I as looking for).

Then, I thought I could do the following:
Creating an index with the expression (c+d) and selecting the rows where
c+d > 0:
select * from t where c + d <> 0;

Again, this used a seqscan. Asking in #postgresql in freenode, somebody
told me to try to disable seqscan (set enable_seqscan false) and
suprisingly, Pg started using the index scan and it was -fast-.

Now: I've no idea why it chooses to use a seq scan instead of the index
scan (yes, I've just vacuum analyzed the table before running the
query).

Some more info:
c and d are both bigint. I've tried the queries casting the constant (0)
to bigint but nothing changed.

Im using debian's pg 7.4.1-2.


Thanks in advance

Attachment

signature.asc

Re: seqscan instead of index scan

From

Bruno Wolff III

Date:

30 August 2004, 14:53:46

On Mon, Aug 30, 2004 at 14:46:37 -0300,
  Martin Sarsale <martin@emepe3.net> wrote:
> Dear all:
>
> Im having a weird problem here. I have a table w/ ~180.000 rows and I
> want to select those where c > 0 or d > 0 (there only a few of those on
> the table)
> I indexed columns c and d (separately) but this query used the slow
> seqscan instead of the index scan:

Postgres doesn't 'or' bitmaps derived from two indexes. You might have more
luck using a combined index.

>
> select * from t where c<>0 or d<>0;
>
> After playing some time, I noticed that if I change the "or" for an
> "and", pg used the fast index scan (but the query w/ 'and' was not what
> I as looking for).
>
> Then, I thought I could do the following:
> Creating an index with the expression (c+d) and selecting the rows where
> c+d > 0:
> select * from t where c + d <> 0;
>
> Again, this used a seqscan. Asking in #postgresql in freenode, somebody
> told me to try to disable seqscan (set enable_seqscan false) and
> suprisingly, Pg started using the index scan and it was -fast-.
>
> Now: I've no idea why it chooses to use a seq scan instead of the index
> scan (yes, I've just vacuum analyzed the table before running the
> query).
>
> Some more info:
> c and d are both bigint. I've tried the queries casting the constant (0)
> to bigint but nothing changed.
>
> Im using debian's pg 7.4.1-2.
>
>
> Thanks in advance
>

Re: seqscan instead of index scan

From

"Merlin Moncure"

Date:

30 August 2004, 15:06:57

> Im having a weird problem here. I have a table w/ ~180.000 rows and I
> want to select those where c > 0 or d > 0 (there only a few of those
on
> the table)
> I indexed columns c and d (separately) but this query used the slow
> seqscan instead of the index scan:

create function is_somethingable (ctype, dtype) returns boolean as
'
    return case when $1 > 0 and $2 > 0 then true else false end;
' language sql immutable;

create index t_idx on t(is_somethingable(c,d));

analyze t;

select * from t where is_somethingable(t.c, t.d) = true;

Merlin

Re: seqscan instead of index scan

From

Martin Sarsale

Date:

30 August 2004, 15:07:23

On Mon, 2004-08-30 at 15:02, Bruno Wolff III wrote:
> On Mon, Aug 30, 2004 at 14:46:37 -0300,

> > Im having a weird problem here. I have a table w/ ~180.000 rows and I
> > want to select those where c > 0 or d > 0 (there only a few of those on
> > the table)
> > I indexed columns c and d (separately) but this query used the slow
> > seqscan instead of the index scan:
>
> Postgres doesn't 'or' bitmaps derived from two indexes. You might have more
> luck using a combined index.

With combined index, you mean a multiple column index?
From
http://www.postgresql.org/docs/7.4/interactive/indexes-multicolumn.html

"Multicolumn indexes can only be used if the clauses involving the
indexed columns are joined with AND. For instance,

SELECT name FROM test2 WHERE major = constant OR minor = constant;

cannot make use of the index test2_mm_idx defined above to look up both
columns. (It can be used to look up only the major column, however.) "

But I need something like:

select * from t where c<>0 or d<>0;

Attachment

signature.asc

Re: seqscan instead of index scan

From

Martin Sarsale

Date:

30 August 2004, 15:17:37

On Mon, 2004-08-30 at 15:06, Merlin Moncure wrote:
> create function is_somethingable (ctype, dtype) returns boolean as

Thanks, but I would prefer a simpler solution.

I would like to know why this uses a seqscan instead of an index scan:

create index t_idx on t((c+d));
select * from t where c+d > 0;

Attachment

signature.asc

Re: seqscan instead of index scan

From

"Merlin Moncure"

Date:

30 August 2004, 15:29:31

> On Mon, 2004-08-30 at 15:06, Merlin Moncure wrote:
> > create function is_somethingable (ctype, dtype) returns boolean as
>
> Thanks, but I would prefer a simpler solution.
>
> I would like to know why this uses a seqscan instead of an index scan:
>
> create index t_idx on t((c+d));
> select * from t where c+d > 0;
>

hmmm, please define simple.

Using a functional index you can define an index around the way you
access the data.  There is no faster or better way to do it...this is a
mathematical truth, not a problem with the planner.  Why not use the
right tool for the job?  A boolean index is super-efficient both in disk
space and cache utilization.

Multiple column indexes are useless for 'or' combinations! (however they
are a huge win for 'and' combinations because you don't have to merge).

With an 'or' expression, the planner must use one index or the other, or
use both and merge the results.  When and what the planner uses is an
educated guess based on statistics.

Also, your function can be changed...why fill all your queries with
Boolean cruft when you can abstract it into the database and reap the
speed savings at the same time?  I think it's time to rethink the
concept of 'simple'.

Constructive criticism all,
Merlin

Re: seqscan instead of index scan

From

Bruno Wolff III

Date:

30 August 2004, 16:09:22

On Mon, Aug 30, 2004 at 15:07:15 -0300,
  Martin Sarsale <martin@emepe3.net> wrote:
>
> With combined index, you mean a multiple column index?
> From
> http://www.postgresql.org/docs/7.4/interactive/indexes-multicolumn.html

You are right, a multicolumn index doesn't help for 'or'.

Re: seqscan instead of index scan

From

Pierre-Frédéric Caillaud

Date:

30 August 2004, 16:37:36

>> create index t_idx on t((c+d));
>> select * from t where c+d > 0;

    Why not :

    select ((select * from t where c<>0::bigint) UNION (select * from t where
d<>0::bigint))
    group by whatever;

    or someting ?

Re: seqscan instead of index scan

From

Stephan Szabo

Date:

30 August 2004, 17:04:10

On Mon, 30 Aug 2004, Martin Sarsale wrote:

> On Mon, 2004-08-30 at 15:06, Merlin Moncure wrote:
> > create function is_somethingable (ctype, dtype) returns boolean as
>
> Thanks, but I would prefer a simpler solution.
>
> I would like to know why this uses a seqscan instead of an index scan:
>
> create index t_idx on t((c+d));
> select * from t where c+d > 0;

As a geuss, since 7.4 and earlier have no statistics on the distribution
of c+d it has to guess about how likely that is to be true and is probably
overestimating. 8.0beta might handle this better.

Re: seqscan instead of index scan

From

Greg Stark

Date:

30 August 2004, 17:36:40

Another option here is to use a partial index. You can index on some other
column -- perhaps the column you want the results ordered by where the where
clause is true.

Something like:

create index t_idx on t (name) where c>0 and d>0;

then any select with a matching where clause can use the index:

select * from t where c>0 and d>0 order by name

Could scan the index and not even have to sort on name.

--
greg

Re: seqscan instead of index scan

From

Tom Lane

Date:

30 August 2004, 17:48:11

Martin Sarsale <martin@emepe3.net> writes:
> I indexed columns c and d (separately) but this query used the slow
> seqscan instead of the index scan:

> select * from t where c<>0 or d<>0;

> After playing some time, I noticed that if I change the "or" for an
> "and", pg used the fast index scan (but the query w/ 'and' was not what
> I as looking for).

I don't think so.  <> is not an indexable operator --- it appears
nowhere in the index operator classes.  It would help if you showed us
*exactly* what you did instead of a not-very-accurate filtered version.
I'm going to assume that you meant > ...

> Now: I've no idea why it chooses to use a seq scan instead of the index
> scan (yes, I've just vacuum analyzed the table before running the
> query).

Because 7.4 doesn't have statistics about expression indexes, so it has
no idea that there are only a few rows with c+d > 0.

What I'd suggest is

    select * from t where c>0 union select * from t where d>0

with separate indexes on c and d.

Another possibility is a partial index on exactly the condition you
want:

    create index nonzero on t(c) where c>0 or d>0;

although I'm not certain if 7.4 has enough stats to recognize this as a win.

            regards, tom lane

Re: seqscan instead of index scan

From

Martin Sarsale

Date:

31 August 2004, 12:14:47

> Using a functional index you can define an index around the way you
> access the data.  There is no faster or better way to do it...this is a
> mathematical truth, not a problem with the planner.  Why not use the
> right tool for the job?  A boolean index is super-efficient both in disk
> space and cache utilization.

Thanks for your constructive criticism, you're absolutely right.

I had to modify your "return" for a "select":

create function rankeable (bigint, bigint) returns boolean as '
        select case when $1 > 0 or $2 > 0 then true else false end;'
language sql immutable;

and it works great.

Attachment

signature.asc

Re: seqscan instead of index scan

From

Chester Kustarz

Date:

01 September 2004, 12:47:37

On Mon, 30 Aug 2004, Martin Sarsale wrote:
> "Multicolumn indexes can only be used if the clauses involving the
> indexed columns are joined with AND. For instance,
>
> SELECT name FROM test2 WHERE major = constant OR minor = constant;

You can use DeMorgan's Theorem to transform an OR clause to an AND clause.

In general:
    A OR B <=> NOT ((NOT A) AND (NOT B))

So:

> But I need something like:
>
> select * from t where c<>0 or d<>0;

    select * from t where not (c=0 and d=0);

I haven't actually tried to see if postgresql would do anything
interesting after such a transformation.

Re: seqscan instead of index scan

From

"Merlin Moncure"

Date:

01 September 2004, 14:53:32

> On Mon, 30 Aug 2004, Martin Sarsale wrote:
> > "Multicolumn indexes can only be used if the clauses involving the
> > indexed columns are joined with AND. For instance,
> >
> > SELECT name FROM test2 WHERE major = constant OR minor = constant;
>
> You can use DeMorgan's Theorem to transform an OR clause to an AND
clause.
>
> In general:
>     A OR B <=> NOT ((NOT A) AND (NOT B))
>
> So:
>
> > But I need something like:
> >
> > select * from t where c<>0 or d<>0;
>
>     select * from t where not (c=0 and d=0);
>
> I haven't actually tried to see if postgresql would do anything
> interesting after such a transformation.

That made me really curious.  I ran a quick test and it turns out the
server used dm's theorem to convert the expression back to 'or' case.

Explain output (see below to set up the test case for this stmnt):
esp=# explain analyze select * from millions where not (value1 <> 500000
and value2 <> 200000);
                                                                QUERY
PLAN

------------------------------------------------------------------------
----------------------------
--------------------------------------
 Index Scan using millions_1_idx, millions_2_idx on millions
(cost=0.00..12.01 rows=2 width=8) (act
ual time=0.000..0.000 rows=2 loops=1)
   Index Cond: ((value1 = 500000) OR (value2 = 200000))
 Total runtime: 0.000 ms
(3 rows)

drop table tens;
drop table millions;

create table tens(value int);
create table millions(value1 int, value2 int);
insert into tens values (0);
insert into tens values (1);
insert into tens values (2);
insert into tens values (3);
insert into tens values (4);
insert into tens values (5);
insert into tens values (6);
insert into tens values (7);
insert into tens values (8);
insert into tens values (9);

insert into millions
 select ones.value +
       (tens.value * 10) +
       (hundreds.value * 100) +
       (thousands.value * 1000) +
       (tenthousands.value * 10000) +
       (hundredthousands.value * 100000)
   from tens ones,
        tens tens,
        tens hundreds,
        tens thousands,
        tens tenthousands,
        tens hundredthousands;

update millions set value2 = value1;

create index millions_idx1 on millions(value1);
create index millions_idx2 on millions(value2);
create index millions_idx12 on millions(value1, value2);
vacuum analyze millions;