Thread: Indexing a Boolean or Null column?
I've been debating with a collegue who argues that indexing a boolean column is a BAD idea and that is will actually slow down queries. My plan is to have a table with many rows sharing 'versions' (version/archive/history) of data where the most current row is the one where 'is_active' contains a true value. If the table begins to look like this: data_id(pk) | data_lookup_key | data_is_active | ... ------------+-----------------+----------------+-------- 1 | banana | false | ... 2 | banana | false | ... 3 | banana | false | ... 4 | banana | false | ... 5 | banana | false | ... 6 | banana | false | ... 7 | banana | false | ... 8 | banana | false | ... 9 | banana | true | ... 10 | apple | true | ... 11 | pear | false | ... 12 | pear | false | ... 13 | pear | false | ... 14 | pear | false | ... 15 | pear | false | ... ... 1000000 | pear | true | ... Will an index on the 'data_is_active' column be used or work as I expect? I'm assuming that I may have a million entries sharing the same 'data_lookup_key' and I'll be using that to search for the active version of the row. SELECT * FROM table WHERE data_lookup_key = 'pear' AND data_is_active IS TRUE; Does it make sense to have an index on data_is_active? Now, I've read that in some databases the index on a column that has relatively even distribution of values over a small set of values will not be efficient. I bet this is in a FAQ somewhere. Can you point me in the right direction? Dante
"D. Dante Lorenso" <dante@lorenso.com> writes: > Does it make sense to have an index on data_is_active? Hard to say. You weren't very clear about what fraction of the table rows you expect to have data_is_active = true. If that's a very small fraction, then an index might be worthwhile. However, I'd suggest using a partial index that merges the is_active test with some other useful behavior. For example, if this is a common pattern: > SELECT * > FROM table > WHERE data_lookup_key = 'pear' > AND data_is_active IS TRUE; then what you really want is CREATE INDEX myindex ON table (data_lookup_key) WHERE data_is_active IS TRUE; > I bet this is in a FAQ somewhere. Can you point me in the right > direction? See the docs on partial indexes. regards, tom lane
> Will an index on the 'data_is_active' column be used or work > as I expect? I'm assuming that I may have a million entries > sharing the same 'data_lookup_key' and I'll be using that to > search for the active version of the row. An index just on a boolean column won't be 'selective enough'. eg. The index will only be able to choose 50% of the table - since it's faster to do a full table scan in that case, the index won't get used. A multi keyed index, however will work a bit better, eg an index over (data_lookup_key, data_is_active). That way, the index will first be able to find the correct key (which is nicely selective) and then will be able to halve the resulting search space to get the active ones. BTW, you shouldn't use 'banana', 'pear', etc as the data_lookup_key, you should make another table like this: id name 1 banana 2 apple 3 pear And then replace the data_lookup_key col with a column of integers that is a foreign key to the names table - waaaaay faster to process. Chris
Christopher Kings-Lynne wrote: > > Will an index on the 'data_is_active' column be used or work > > as I expect? I'm assuming that I may have a million entries > > sharing the same 'data_lookup_key' and I'll be using that to > > search for the active version of the row. > An index just on a boolean column won't be 'selective enough'. > eg. The index will only be able to choose 50% of the table - > since it's faster to do a full table scan in that case, the > index won't get used. Ok, so ...evenly distributed data on small set of values forces sequential scan since that's faster. I expected that based on what I've read so far. > A multi keyed index, however will work a bit better, eg an index > over (data_lookup_key, data_is_active). > > That way, the index will first be able to find the correct > key (which is nicely selective) and then will be able to > halve the resulting ? search space to get the active ones. I'm not using the 50% TRUE / 50% FALSE model. My model will be more like only ONE value IS TRUE for 'is_active' for each 'data_lookup_key' in my table. All the rest are FALSE. In this case for 100 rows all having the same 'data_lookup_key' we are looking at a 99% FALSE / 1% TRUE model ... and what I'll be searching for is the ONE TRUE. In this case, it WILL pay off to have the index on a boolean column, yes? Will I win my debate with my collegue? ;-) I think Tom Lanes suggestion of partial indexes is what I need to look into. > BTW, you shouldn't use 'banana', 'pear', etc as the data_lookup_key, > you should make another table like this: ... And then replace the > data_lookup_key col with a column of integers that is a foreign > key to the names table - waaaaay faster to process. Gotcha, yeah, I'm targeting as close to 3NF as I get get. Was just trying to be generic for my example ... bad example, oops. Dante
> In this case, it WILL pay off to have the index on a boolean > column, yes? Will I win my debate with my collegue? ;-) > > I think Tom Lanes suggestion of partial indexes is what I need to > look into. Yes, given that it will be highly skewed towards false entries, Tom's suggestion is perfect. Chris
> Ok, so ...evenly distributed data on small set of values forces > sequential scan since that's faster. I expected that based on > what I've read so far. Actually, it's more a case of that fetching an item via and index is considered, say, four times slower than fetching something off a sequential scan (sort of). Hence, if you are selecting more than 25% of the table, then a sequential scan will be faster, even though it has to process more rows. Chris
Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes: >> Ok, so ...evenly distributed data on small set of values forces >> sequential scan since that's faster. I expected that based on >> what I've read so far. > Actually, it's more a case of that fetching an item via and index is > considered, say, four times slower than fetching something off a > sequential scan (sort of). Hence, if you are selecting more than 25% of > the table, then a sequential scan will be faster, even though it has to > process more rows. Actually it's worse than that: if an indexscan is going to fetch more than a few percent of the table, the planner will think it slower than a sequential scan --- and usually it'll be right. The four-to-one ratio refers to the cost of fetching a whole page (8K) randomly versus sequentially. In a seqscan, you can examine all the rows on a page (dozens to hundreds usually) for the price of one page fetch. In an indexscan, one page fetch might bring in just one row that you care about. So the breakeven point is a lot worse than 4:1. There is constant debate about the values of these parameters; in particular the 4:1 page fetch cost ratio breaks down if you are able to cache a significant fraction of the table in RAM. See the list archives for details. But it's certainly true that an indexscan has to be a lot more selective than 25% before it's going to be a win over a seqscan. I'd say 1% to 5% is the right ballpark. regards, tom lane
After a long battle with technology, dante@lorenso.com ("D. Dante Lorenso"), an earthling, wrote: > I've been debating with a collegue who argues that indexing a > boolean column is a BAD idea and that is will actually slow > down queries. No, it would be expected to slow down inserts, but not likely queries. > Will an index on the 'data_is_active' column be used or work > as I expect? I'm assuming that I may have a million entries > sharing the same 'data_lookup_key' and I'll be using that to > search for the active version of the row. > SELECT * > FROM table > WHERE data_lookup_key = 'pear' > AND data_is_active IS TRUE; > > Does it make sense to have an index on data_is_active? Not really. > Now, I've read that in some databases the index on a column that has > relatively even distribution of values over a small set of values > will not be efficient. The problem is (and this is likely to be true for just about any database system that is 'page-based,' which is just about any of them, these days) that what happens, with the elements being so pervasive, throughout the table, queries will be quite likely to hit nearly every page of the table. If you're hitting practically every page, then it is more efficient to just walk thru the pages (Seq Scan) rather than to bother reading the index. The only improvement that could (in theory) be made is to cluster all the "true" values onto one set of pages, and all the "false" ones onto another set of pages, and have a special sort of index that knows which pages are "true" and "false". I _think_ that Oracle's notion of "cluster tables" function rather like this; it is rather debatable whether it would be worthwhile to do similar with PostgreSQL. A way of 'clustering' with PostgreSQL might be to have two tables table_active and table_inactive where a view, representing the 'join' of them, would throw in the 'data_is_active' value. By clever use of some rules/triggers, you could insert into the view, and have values get shuffled into the appropriate table. When doing a select on the view, if you asked for "data_is_active is TRUE", the select would only draw data from table_inactive, or vice-versa. Unfortunately, sometimes the query optimizer may not be clever enough when working with the resulting joins, though that may just be a Simple Matter Of Programming to make it more clever in future versions. :-) -- output = reverse("gro.gultn" "@" "enworbbc") http://www3.sympatico.ca/cbbrowne/spreadsheets.html Rules of the Evil Overlord #136. "If I build a bomb, I will simply remember which wire to cut if it has to be deactivated and make every wire red." <http://www.eviloverlord.com/>