Thread: What kind of index to use for many rows with few unique values?
Hi, I have a table with a column called "state". Each row can be in one of four states, let's call them 'new', 'pending', 'ok', and 'bad'. On average, about 95% of the rows will be 'bad', with the remaining 5% being in one of the other three states. If the table has 50K rows and I just want to pull out the 'ok' rows, I don't want to do a sequential scan. To pull out the 'bad' rows, obviously, sequential scan is fine. I've heard that a btree index performs badly in this situation. Is a hash index appropriate? I've heard bad things about hash indexes in PostgreSQL. Regards, David. Roaring Penguin Software Inc. | http://www.roaringpenguin.com GPG fingerprint: C523 771C 3710 0F54 B2D2 4B0D C6EF 6991 34AB 95BA GPG public key: http://www.roaringpenguin.com/dskoll-key-2002.txt ID: 34AB95BA
On Mon, Dec 02, 2002 at 05:10:00PM -0500, David F. Skoll wrote: > Hi, > > I have a table with a column called "state". Each row can be in one > of four states, let's call them 'new', 'pending', 'ok', and 'bad'. > On average, about 95% of the rows will be 'bad', with the remaining > 5% being in one of the other three states. If the table has 50K rows > and I just want to pull out the 'ok' rows, I don't want to do a sequential > scan. To pull out the 'bad' rows, obviously, sequential scan is fine. > > I've heard that a btree index performs badly in this situation. Is > a hash index appropriate? I've heard bad things about hash indexes in > PostgreSQL. create table states (id serial primary key, state varchar(10), t text ); create function fill_states(varchar, int) returns bool as ' begin for i in 1 .. $2 loop insert into states (state, t) values ($1, ''random''); end loop; return true; end; ' language plpgsql; select fill_states('ok',45000); select fill_states('bad',5000); select fill_states('warning',5000); analyze states; joel@joel=# explain select * from states where state='warning'; QUERY PLAN ------------------------------------------------------------------------------- Index Scan using state_idx on states (cost=0.00..1013.00 rows=5533 width=20) Index Cond: (state = 'warning'::character varying) (2 rows) joel@joel=# explain select * from states where state='ok'; QUERY PLAN -------------------------------------------------------------- Seq Scan on states (cost=0.00..1171.51 rows=44627 width=20) Filter: (state = 'ok'::character varying) (2 rows) Looks right to me: index scan for the less-common option, seqscan for the most common. Why don't you think this, as a btree, will work for you? -- Joel BURTON | joel@joelburton.com | joelburton.com | aim: wjoelburton Independent Knowledge Management Consultant
On Mon, 2 Dec 2002, Joel Burton wrote: > Looks right to me: index scan for the less-common option, seqscan for > the most common. Why don't you think this, as a btree, will work for > you? No, I'm sure a btree will work. However, won't the index be inefficient (i.e., very flat) if there are many entries with the same value? Or is this not a problem? -- David.
On 2 Dec 2002 at 17:10, David F. Skoll wrote: > I've heard that a btree index performs badly in this situation. As another poster has shown, it should be OK. I recently dealt with distributions simlar to your example. > Is a hash index appropriate? I've heard bad things about hash > indexes in PostgreSQL. No, don't use them. I didn't get any performance increase out of them. Does your experience show poor btree results? -- Dan Langille : http://www.langille.org/
David F. Skoll wrote: > No, I'm sure a btree will work. However, won't the index be > inefficient (i.e., very flat) if there are many entries with the same > value? Or is this not a problem? > If you're concerned, why not try a partial index? Joe
On 2 Dec 2002 at 15:15, Joe Conway wrote: > David F. Skoll wrote: > > No, I'm sure a btree will work. However, won't the index be > > inefficient (i.e., very flat) if there are many entries with the same > > value? Or is this not a problem? > > > > If you're concerned, why not try a partial index? FWIW, partial indexes didn't help for my distributions. But btrees were satisfactory. -- Dan Langille : http://www.langille.org/