Thread: Indexing a Boolean or Null column?

Indexing a Boolean or Null column?

From
"D. Dante Lorenso"
Date:
I've been debating with a collegue who argues that indexing a
boolean column is a BAD idea and that is will actually slow
down queries.

My plan is to have a table with many rows sharing 'versions'
(version/archive/history) of data where the most current row
is the one where 'is_active' contains a true value.

If the table begins to look like this:

data_id(pk) | data_lookup_key | data_is_active | ...
------------+-----------------+----------------+--------
1           | banana          | false          | ...
2           | banana          | false          | ...
3           | banana          | false          | ...
4           | banana          | false          | ...
5           | banana          | false          | ...
6           | banana          | false          | ...
7           | banana          | false          | ...
8           | banana          | false          | ...
9           | banana          | true           | ...
10          | apple           | true           | ...
11          | pear            | false          | ...
12          | pear            | false          | ...
13          | pear            | false          | ...
14          | pear            | false          | ...
15          | pear            | false          | ...
...
1000000     | pear            | true           | ...

Will an index on the 'data_is_active' column be used or work
as I expect?  I'm assuming that I may have a million entries
sharing the same 'data_lookup_key' and I'll be using that to
search for the active version of the row.

    SELECT *
    FROM table
    WHERE data_lookup_key = 'pear'
    AND data_is_active IS TRUE;

Does it make sense to have an index on data_is_active?

Now, I've read that in some databases the index on a column that
has relatively even distribution of values over a small set of values
will not be efficient.

I bet this is in a FAQ somewhere.  Can you point me in the right
direction?

Dante





Re: Indexing a Boolean or Null column?

From
Tom Lane
Date:
"D. Dante Lorenso" <dante@lorenso.com> writes:
> Does it make sense to have an index on data_is_active?

Hard to say.  You weren't very clear about what fraction of the table
rows you expect to have data_is_active = true.  If that's a very small
fraction, then an index might be worthwhile.

However, I'd suggest using a partial index that merges the is_active
test with some other useful behavior.  For example, if this is a
common pattern:

>     SELECT *
>     FROM table
>     WHERE data_lookup_key = 'pear'
>     AND data_is_active IS TRUE;

then what you really want is

CREATE INDEX myindex ON table (data_lookup_key) WHERE data_is_active IS TRUE;

> I bet this is in a FAQ somewhere.  Can you point me in the right
> direction?

See the docs on partial indexes.

            regards, tom lane

Re: Indexing a Boolean or Null column?

From
Christopher Kings-Lynne
Date:
> Will an index on the 'data_is_active' column be used or work
> as I expect?  I'm assuming that I may have a million entries
> sharing the same 'data_lookup_key' and I'll be using that to
> search for the active version of the row.

An index just on a boolean column won't be 'selective enough'.  eg. The
index will only be able to choose 50% of the table - since it's faster
to do a full table scan in that case, the index won't get used.

A multi keyed index, however will work a bit better, eg an index over
(data_lookup_key, data_is_active).

That way, the index will first be able to find the correct key    (which is
nicely selective) and then will be able to halve the resulting search
space to get the active ones.

BTW, you shouldn't use 'banana', 'pear', etc as the data_lookup_key, you
should make another table like this:

id    name
1    banana
2    apple
3    pear

And then replace the data_lookup_key col with a column of integers that
is a foreign key to the names table - waaaaay faster to process.

Chris


Re: Indexing a Boolean or Null column?

From
"D. Dante Lorenso"
Date:
Christopher Kings-Lynne wrote:

 > > Will an index on the 'data_is_active' column be used or work
 > > as I expect?  I'm assuming that I may have a million entries
 > > sharing the same 'data_lookup_key' and I'll be using that to
 > > search for the active version of the row.

 > An index just on a boolean column won't be 'selective enough'.
 > eg. The index will only be able to choose 50% of the table -
 > since it's faster to do a full table scan in that case, the
 > index won't get used.

Ok, so ...evenly distributed data on small set of values forces
sequential scan since that's faster.  I expected that based on
what I've read so far.

 > A multi keyed index, however will work a bit better, eg an index
 > over (data_lookup_key, data_is_active).
 >
 > That way, the index will first be able to find the correct
 > key    (which is nicely selective) and then will be able to
 > halve the resulting ? search space to get the active ones.

I'm not using the 50% TRUE / 50% FALSE model.  My model will be
more like only ONE value IS TRUE for 'is_active' for each
'data_lookup_key' in my table.  All the rest are FALSE.  In
this case for 100 rows all having the same 'data_lookup_key'
we are looking at a 99% FALSE / 1% TRUE model ... and what I'll
be searching for is the ONE TRUE.

In this case, it WILL pay off to have the index on a boolean
column, yes?  Will I win my debate with my collegue? ;-)

I think Tom Lanes suggestion of partial indexes is what I need to
look into.

 > BTW, you shouldn't use 'banana', 'pear', etc as the data_lookup_key,
 > you should make another table like this: ... And then replace the
 > data_lookup_key col with a column of integers that is a foreign
 > key to the names table - waaaaay faster to process.

Gotcha, yeah, I'm targeting as close to 3NF as I get get.  Was just
trying to be generic for my example ... bad example, oops.

Dante


Re: Indexing a Boolean or Null column?

From
Christopher Kings-Lynne
Date:
> In this case, it WILL pay off to have the index on a boolean
> column, yes?  Will I win my debate with my collegue? ;-)
>
> I think Tom Lanes suggestion of partial indexes is what I need to
> look into.

Yes, given that it will be highly skewed towards false entries, Tom's
suggestion is perfect.

Chris


Re: Indexing a Boolean or Null column?

From
Christopher Kings-Lynne
Date:
> Ok, so ...evenly distributed data on small set of values forces
> sequential scan since that's faster.  I expected that based on
> what I've read so far.

Actually, it's more a case of that fetching an item via and index is
considered, say, four times slower than fetching something off a
sequential scan (sort of).  Hence, if you are selecting more than 25% of
the table, then a sequential scan will be faster, even though it has to
process more rows.

Chris


Re: Indexing a Boolean or Null column?

From
Tom Lane
Date:
Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes:
>> Ok, so ...evenly distributed data on small set of values forces
>> sequential scan since that's faster.  I expected that based on
>> what I've read so far.

> Actually, it's more a case of that fetching an item via and index is
> considered, say, four times slower than fetching something off a
> sequential scan (sort of).  Hence, if you are selecting more than 25% of
> the table, then a sequential scan will be faster, even though it has to
> process more rows.

Actually it's worse than that: if an indexscan is going to fetch more
than a few percent of the table, the planner will think it slower than
a sequential scan --- and usually it'll be right.  The four-to-one ratio
refers to the cost of fetching a whole page (8K) randomly versus
sequentially.  In a seqscan, you can examine all the rows on a page
(dozens to hundreds usually) for the price of one page fetch.  In an
indexscan, one page fetch might bring in just one row that you care
about.  So the breakeven point is a lot worse than 4:1.

There is constant debate about the values of these parameters; in
particular the 4:1 page fetch cost ratio breaks down if you are able
to cache a significant fraction of the table in RAM.  See the list
archives for details.  But it's certainly true that an indexscan has to
be a lot more selective than 25% before it's going to be a win over
a seqscan.  I'd say 1% to 5% is the right ballpark.

            regards, tom lane

Re: Indexing a Boolean or Null column?

From
Christopher Browne
Date:
After a long battle with technology, dante@lorenso.com ("D. Dante Lorenso"), an earthling, wrote:
> I've been debating with a collegue who argues that indexing a
> boolean column is a BAD idea and that is will actually slow
> down queries.

No, it would be expected to slow down inserts, but not likely queries.

> Will an index on the 'data_is_active' column be used or work
> as I expect?  I'm assuming that I may have a million entries
> sharing the same 'data_lookup_key' and I'll be using that to
> search for the active version of the row.

>     SELECT *
>     FROM table
>     WHERE data_lookup_key = 'pear'
>     AND data_is_active IS TRUE;
>
> Does it make sense to have an index on data_is_active?

Not really.

> Now, I've read that in some databases the index on a column that has
> relatively even distribution of values over a small set of values
> will not be efficient.

The problem is (and this is likely to be true for just about any
database system that is 'page-based,' which is just about any of them,
these days) that what happens, with the elements being so pervasive,
throughout the table, queries will be quite likely to hit nearly every
page of the table.

If you're hitting practically every page, then it is more efficient to
just walk thru the pages (Seq Scan) rather than to bother reading the
index.

The only improvement that could (in theory) be made is to cluster all
the "true" values onto one set of pages, and all the "false" ones onto
another set of pages, and have a special sort of index that knows
which pages are "true" and "false".  I _think_ that Oracle's notion of
"cluster tables" function rather like this; it is rather debatable
whether it would be worthwhile to do similar with PostgreSQL.

A way of 'clustering' with PostgreSQL might be to have two tables
  table_active
   and
  table_inactive
where a view, representing the 'join' of them, would throw in the
'data_is_active' value.  By clever use of some rules/triggers, you
could insert into the view, and have values get shuffled into the
appropriate table.

When doing a select on the view, if you asked for "data_is_active is
TRUE", the select would only draw data from table_inactive, or
vice-versa.

Unfortunately, sometimes the query optimizer may not be clever enough
when working with the resulting joins, though that may just be a
Simple Matter Of Programming to make it more clever in future versions.
:-)
--
output = reverse("gro.gultn" "@" "enworbbc")
http://www3.sympatico.ca/cbbrowne/spreadsheets.html
Rules of  the Evil Overlord  #136. "If I  build a bomb, I  will simply
remember which wire to cut if  it has to be deactivated and make every
wire red." <http://www.eviloverlord.com/>