Thread: What kind of index to use for many rows with few unique values?

What kind of index to use for many rows with few unique values?

From
"David F. Skoll"
Date:
Hi,

I have a table with a column called "state".  Each row can be in one
of four states, let's call them 'new', 'pending', 'ok', and 'bad'.
On average, about 95% of the rows will be 'bad', with the remaining
5% being in one of the other three states.  If the table has 50K rows
and I just want to pull out the 'ok' rows, I don't want to do a sequential
scan.  To pull out the 'bad' rows, obviously, sequential scan is fine.

I've heard that a btree index performs badly in this situation.  Is
a hash index appropriate?  I've heard bad things about hash indexes in
PostgreSQL.

Regards,

David.

Roaring Penguin Software Inc. | http://www.roaringpenguin.com
GPG fingerprint: C523 771C 3710 0F54 B2D2 4B0D C6EF 6991 34AB 95BA
GPG public key:  http://www.roaringpenguin.com/dskoll-key-2002.txt ID: 34AB95BA

Re: What kind of index to use for many rows with few unique values?

From
Joel Burton
Date:
On Mon, Dec 02, 2002 at 05:10:00PM -0500, David F. Skoll wrote:
> Hi,
>
> I have a table with a column called "state".  Each row can be in one
> of four states, let's call them 'new', 'pending', 'ok', and 'bad'.
> On average, about 95% of the rows will be 'bad', with the remaining
> 5% being in one of the other three states.  If the table has 50K rows
> and I just want to pull out the 'ok' rows, I don't want to do a sequential
> scan.  To pull out the 'bad' rows, obviously, sequential scan is fine.
>
> I've heard that a btree index performs badly in this situation.  Is
> a hash index appropriate?  I've heard bad things about hash indexes in
> PostgreSQL.

create table states (id serial primary key, state varchar(10), t text );

create function fill_states(varchar, int) returns bool as '
  begin for i in 1 .. $2
  loop
    insert into states (state, t) values ($1, ''random'');
  end loop;
  return true;
  end;

' language plpgsql;

select fill_states('ok',45000);
select fill_states('bad',5000);
select fill_states('warning',5000);

analyze states;



joel@joel=# explain select * from states where state='warning';
QUERY PLAN
-------------------------------------------------------------------------------
  Index Scan using state_idx on states (cost=0.00..1013.00 rows=5533 width=20)
    Index Cond: (state = 'warning'::character varying)
(2 rows)


joel@joel=# explain select * from states where state='ok';
QUERY PLAN
--------------------------------------------------------------
  Seq Scan on states  (cost=0.00..1171.51 rows=44627 width=20)
  Filter: (state = 'ok'::character varying)
(2 rows)


Looks right to me: index scan for the less-common option, seqscan for
the most common. Why don't you think this, as a btree, will work for
you?

--

Joel BURTON  |  joel@joelburton.com  |  joelburton.com  |  aim: wjoelburton
Independent Knowledge Management Consultant

Re: What kind of index to use for many rows with few unique

From
"David F. Skoll"
Date:
On Mon, 2 Dec 2002, Joel Burton wrote:

> Looks right to me: index scan for the less-common option, seqscan for
> the most common. Why don't you think this, as a btree, will work for
> you?

No, I'm sure a btree will work.  However, won't the index be
inefficient (i.e., very flat) if there are many entries with the same
value?  Or is this not a problem?

--
David.

Re: What kind of index to use for many rows with few unique values?

From
"Dan Langille"
Date:
On 2 Dec 2002 at 17:10, David F. Skoll wrote:

> I've heard that a btree index performs badly in this situation.

As another poster has shown, it should be OK.  I recently dealt with
distributions simlar to your example.

> Is a hash index appropriate?  I've heard bad things about hash
> indexes in PostgreSQL.

No, don't use them.  I didn't get any performance increase out of
them.  Does your experience show poor btree results?
--
Dan Langille : http://www.langille.org/


Re: What kind of index to use for many rows with few unique

From
Joe Conway
Date:
David F. Skoll wrote:
> No, I'm sure a btree will work.  However, won't the index be
> inefficient (i.e., very flat) if there are many entries with the same
> value?  Or is this not a problem?
>

If you're concerned, why not try a partial index?

Joe


Re: What kind of index to use for many rows with few unique

From
"Dan Langille"
Date:
On 2 Dec 2002 at 15:15, Joe Conway wrote:

> David F. Skoll wrote:
> > No, I'm sure a btree will work.  However, won't the index be
> > inefficient (i.e., very flat) if there are many entries with the same
> > value?  Or is this not a problem?
> >
>
> If you're concerned, why not try a partial index?

FWIW, partial indexes didn't help for my distributions.  But btrees
were satisfactory.
--
Dan Langille : http://www.langille.org/