Thread: Optimizing a condition based on an a very unequally distributed value.

Optimizing a condition based on an a very unequally distributed value.

From
"Nick Fankhauser"
Date:
Hi-

I have a field that is very rarely set to 'YES', but I need to filter my
results so that only rows where it is set to 'NO' appear.

Here is the distribution:


temp=#  select count(*) from case_data where case_impound = 'YES';
 count
-------
     1
(1 row)

temp=#  select count(*) from case_data where case_impound = 'NO';
 count
-------
 23768
(1 row)


Since I always test this field, I want to make sure an index is used, but
depending on what I look for, I get different query plans:



temp=# explain select count(*) from case_data where case_impound = 'NO';
NOTICE:  QUERY PLAN:

Aggregate  (cost=815.52..815.52 rows=1 width=0)
  ->  Seq Scan on case_data  (cost=0.00..756.10 rows=23768 width=0)

EXPLAIN
temp=# explain select count(*) from case_data where case_impound = 'YES';
NOTICE:  QUERY PLAN:

Aggregate  (cost=2.23..2.23 rows=1 width=0)
  ->  Index Scan using case_data_case_impound on case_data  (cost=0.00..2.22
rows=1 width=0)

EXPLAIN
temp=# explain select count(*) from case_data where case_impound != 'NO';
NOTICE:  QUERY PLAN:

Aggregate  (cost=756.10..756.10 rows=1 width=0)
  ->  Seq Scan on case_data  (cost=0.00..756.10 rows=1 width=0)

EXPLAIN
temp=# explain select count(*) from case_data where case_impound != 'YES';
NOTICE:  QUERY PLAN:

Aggregate  (cost=815.52..815.52 rows=1 width=0)
  ->  Seq Scan on case_data  (cost=0.00..756.10 rows=23768 width=0)



So my question in general is why does PGSQL opt to use the index when
looking for the single field row, and not use it when looking for the other
23768 rows?

More specifically is there a trick to make it use the index in the condition
that I want to test for, which could be either [ = 'NO' ] or [ != 'YES' ]?

Thanks!

-NickF



--------------------------------------------------------------------------
Nick Fankhauser  nickf@ontko.com  Phone 1.765.935.4283  Fax 1.765.962.9788
Ray Ontko & Co.     Software Consulting Services     http://www.ontko.com/


Re: Optimizing a condition based on an a very unequally

From
Stephan Szabo
Date:
On Wed, 13 Mar 2002, Nick Fankhauser wrote:

> So my question in general is why does PGSQL opt to use the index when
> looking for the single field row, and not use it when looking for the other
> 23768 rows?

Because in many cases the sequence scan will be faster than an index scan
when you're trying to get most of the rows of the table. Since the tuple
validity is in the table row and not the index, it's still got to check
the table which would involve lots of random reads into that file, so
you're doing even more disk operations and less efficient ones.


Re: Optimizing a condition based on an a very unequally distributed value.

From
"Nick Fankhauser"
Date:
Thanks for the answer! ...but now I have more questions:

Is it true that if I use this condition in a join that further constrains
the results, this might use the index?

Does PGSQL know that few or many rows will be returned because of data
gathered from the last vacuum?

What do you mean by "tuple validity?" can't this query be resolved entirely
based on the index?

-Nick
> -----Original Message Snips:-----

> > So my question in general is why does PGSQL opt to use the index when
> > looking for the single field row, and not use it when looking
> for the other
> > 23768 rows?
>
> Because in many cases the sequence scan will be faster than an index scan
> when you're trying to get most of the rows of the table. Since the tuple
> validity is in the table row and not the index, it's still got to check
> the table which would involve lots of random reads into that file, so
> you're doing even more disk operations and less efficient ones.
>


Re: Optimizing a condition based on an a very unequally

From
Stephan Szabo
Date:
On Wed, 13 Mar 2002, Nick Fankhauser wrote:

> Thanks for the answer! ...but now I have more questions:
>
> Is it true that if I use this condition in a join that further constrains
> the results, this might use the index?
If there were other conditions, it may do so.

> Does PGSQL know that few or many rows will be returned because of data
> gathered from the last vacuum?

I'd guess the last analyze (or vacuum analyze).

> What do you mean by "tuple validity?" can't this query be resolved entirely
> based on the index?

Not currently, because the index doesn't know if the row can be seen by
your transaction. There may be rows that are valid for some transactions
but not yours that also have index entries.