Thread: Indices types, what to use. Btree, Hash, Gin or Gist

Indices types, what to use. Btree, Hash, Gin or Gist

From
Mohamed
Date:
Hi,

I have several fields that use to match with my queries. I am curious to what index types is best for what. Here is some examples that will help you understand.

Say I have a 1000 000 rows.

Speed is of the essence here, insertions and updates happens relatively less frequent than search. 

I want to match against a boolean field, that is, only true or false is possible. I am thinking Btree but not sure.. correct?

Match against a field which is an big in and contains the ids for the states of america (say 50 different values), a btree again?

Against an integer field? Using > < = and between

Date (no timezone), I would like to have this field presorted in desc order. How does that creation look like?

I understand the Hash is not recommended. When should I use the Gin index ?

Thanks in advance / Moe 






Re: Indices types, what to use. Btree, Hash, Gin or Gist

From
Gregory Stark
Date:
Mohamed <mohamed5432154321@gmail.com> writes:

> Hi,
> I have several fields that use to match with my queries. I am curious to
> what index types is best for what. Here is some examples that will help you
> understand.
>
> Say I have a 1000 000 rows.
>
> Speed is of the essence here, insertions and updates happens relatively less
> frequent than search.

Can you give examples of actual WHERE clauses? It's the combination of
restrictions that actually matters. Are there specific fields which will never
be used on their own, only in combination with others?

> I want to match against a boolean field, that is, only true or false is
> possible. I am thinking Btree but not sure.. correct?

No index is going to be particularly effective for boolean columns unless
they're very heavily skewed. You might find it useful to build separate
partial indexes on other keys for each value though.

> I understand the Hash is not recommended. When should I use the Gin index ?

GIN and GIST are used for fairly specialized purposes. Full text searching,
geometric data types, etc.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's RemoteDBA services!

Re: Indices types, what to use. Btree, Hash, Gin or Gist

From
Kevin Murphy
Date:
Gregory Stark wrote:
> Mohamed <mohamed5432154321@gmail.com> writes:
>
>
>> I want to match against a boolean field, that is, only true or false is
>> possible. I am thinking Btree but not sure.. correct?
>>
>
> No index is going to be particularly effective for boolean columns unless
> they're very heavily skewed. You might find it useful to build separate
> partial indexes on other keys for each value though.
>
Gregory,

Just for the edification of the masses, can you show an example that
illustrates your point about partial indexes?

-Kevin


Re: Indices types, what to use. Btree, Hash, Gin or Gist

From
Scott Marlowe
Date:
On Sat, Jan 31, 2009 at 1:33 PM, Gregory Stark <stark@enterprisedb.com> wrote:
> No index is going to be particularly effective for boolean columns unless
> they're very heavily skewed. You might find it useful to build separate
> partial indexes on other keys for each value though.

Not entirely true.  If you've got a table where <1% of the rows are
one or the other, a partial index can be very useful on a bool.

Also, in the very odd case where you almost all the time select all
the true or all the false, a regular index can be useful for
reindexing the table on.

But yeah, if there's a relatively even mix of true and false, then a
bool index isn't much use.

Re: Indices types, what to use. Btree, Hash, Gin or Gist

From
Mohamed
Date:
My Gmail(bloody gmail!) has been auto-replying to the last messager (Scott) so I think we have been having a private discussion on this topic. Here is an update on our discussion.

ME : 

When it comes to the boolean, the content is about 70-30%. I find it strange though that an index on a 50-50% isn't that useful. With an index the DB can skip 50% of the table so it should be useful, but perhaps the intersection of sets is expensive for the DB?

Could an index in fact possibly slow down queries? Or will the DB ignore using the index in such cases?

Most of my matches contains simple matches, and I don't see the use of partial indexes, but I get the idea. 
You want to make a bigger difference between the set contained/matched against to get them "more" "unique".

I am now reading about fulltext search and its my next step. So I am a bit interested in the Gin/Gist. But I will revive this thread once I am more familiar with fulltext, currently reading up on the topic..


SCOTT :


> When it comes to the boolean, the content is about 70-30%. I find it strange
> though that an index on a 50-50% isn't that useful. With an index the DB can
> skip 50% of the table so it should be useful, but perhaps the intersection
> of sets is expensive for the DB?

If the values are randomly mixed, and you can fit at least a couple of
rows in each 8k block, then using an index on a 50/50 mix is a total
loser, because you're gonna have to read every single block anyway.
If you can fit 10 rows in a single block, then 10% of one value means
it's a loser too, because, again, you're gonna have to hit every block
anyway.

With the mix you list, 70/30, it means that if you can fit 3 rows in
one block, and they're randomly distributed, you'll have to hit every
block anyway, and an index on bool won't help.

Keep in mind random table accesses are about 4 to 10 times more
expensive than sequential scans, and you have to add in the random
access time of the index as well.

> Could an index in fact possibly slow down queries? Or will the DB ignore
> using the index in such cases?

The db should ignore it for select queries unless the statistics are
wrong.  However, indexes ALWAYS cost on insert / update / delete.

> I am now reading about fulltext search and its my next step. So I am a bit
> interested in the Gin/Gist. But I will revive this thread once I am more
> familiar with fulltext, currently reading up on the topic..

If you're searching a lot on text, full text search is a great way to go.

Be sure and look up the pg_stat type tables.  There's tons of useful
info in them about how your database is actually being accessed.
pg_stat_user_indexes and pg_stat_user_tables are both very useful.




ME:


Thanks, I am new to PostgreSQL and just an SQL scholar really. I am using pgAdmin now, is there a way of looking at those stats from there or is it just from the command line ? 

Would you say it's safe to index all columns that are searched for in a relation? I have indexed perhaps 10 columns (of 15) and some are like the boolean one. But I am thinking that they will only be used if the DB finds them useful so I am over-indexing.. is this ok? I fin d updates and insertions pretty fast anyway so I am not worried about that aspect unless I am wrong !? :O


SCOTT :

This is one of those "it really depends" types of questions.  If the
database is mostly read from, and the updates aren't slowed down too
much by the many indexes, then sure, go ahead and add the indexes.  It
won't generally slow down the select queries running all the time.
After a month or so check the pg_stat_user_indexes table to see which
non-unique indexes aren't being used and drop them.




Thank you Scott for your private help :)

/ Moe

Re: Indices types, what to use. Btree, Hash, Gin or Gist

From
Martijn van Oosterhout
Date:
On Sun, Feb 01, 2009 at 06:00:02PM +0100, Mohamed wrote:
> When it comes to the boolean, the content is about 70-30%. I find it strange
> though that an index on a 50-50% isn't that useful. With an index the DB can
> skip 50% of the table so it should be useful, but perhaps the intersection
> of sets is expensive for the DB?
> Could an index in fact possibly slow down queries? Or will the DB ignore
> using the index in such cases?

It's more complex than you suggest: the database cannot just skip 50%
of the table. The database reads or write blocks of data (8k) and each
such block will contain (in your example) 50% rows you are interested
in. So the database will have to read every block in the table anyway,
so you may as well not use the index at all.

Yes, the database will avoid using indexes if it decides they're a bad
idea.

Usually an index has to cut the number of blocks required by at least
90% before it becomes at all useful to use it. Indexes on booleans
rarely reach that kind of level.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Attachment

Re: Indices types, what to use. Btree, Hash, Gin or Gist

From
Gregory Stark
Date:
Mohamed <mohamed5432154321@gmail.com> writes:

> My Gmail(bloody gmail!) has been auto-replying to the last messager (Scott)
> so I think we have been having a private discussion on this topic.

There is an option in the Google Labs tab to make "Reply All" the default
button -- of course then there's always a chance you'll make the opposite
mistake which can be a lot worse.

Earlier I suggested with a boolean column you could consider making it the
condition on a partial index with some other key. For example you could have

CREATE INDEX partial_age_male   on tab(age) WHERE gender = 'M';
CREATE INDEX partial_age_female on tab(age) WHERE gender = 'F';

Then if you always search on age with gender the optimizer can use the index
which only includes the records for the appropriate gender. It's basically a
"free" index key column since it doesn't actually have to store the extra
column.

Note that in this example if you were to search on just age it wouldn't be
able to use either of these indexes however. In theory it could use the
indexes if you search on just gender but it would be unlikely to for all the
same reasons as previously mentioned for regular indexes.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's RemoteDBA services!

Re: Indices types, what to use. Btree, Hash, Gin or Gist

From
Mohamed
Date:
Yeah, but reply-all will still send private messages :O .. its strange because this is the only mailing list that gmail behaves like this with.. it must have to with how postgre sends messages out.

But anyways. Back to topic :) 

Yeah, I think that a partial index is something that would be smart. The problem is that sometimes I want to search for both genders (that is no match for that column at all) and my index will then not be used on the partial ones that includes that one which leads me to another thought. 
Does this mean that the DB will hit all the blocks anyways? Does that mean that as soon as I search for one column that is not a good index or not indexed at all, the other indexed fields become useless since the DB will have to go through all rows anyway?

Will I have to make a partial index and include the gender together with all other fields each ?

Here is my indexes as of now (in one of my relations).

  region index:'region_id_index'                 // Searched for or not included in query
            district index:'district_id_index'              // Searched for or not included in query
            category index:'category_id_index'        // Searched for alone or not included in query
            subCategory index:'sub_category_id_index' // Searched for or not included in query

            languageOfAd index:'language_of_ad_index'      // Searched for values 1,2 and sometimes don't search this field if all lang should be shown

            name index:'name_index'                     
            phoneNumber index:'phone_number_index'  // Always with name, I guess partial index could work here (often not searched for)
            email index:'email_index'                           // Always with name, I guess partial index could work here (often not searched for)
            price index:'price_index'                            // 
            typeOfAd index:'type_of_ad_index'            // 1,2,3,4       Always one of these in the query
            rentingPeriod index:'renting_period_index'      // if 3.4 then 1,7,30,365 if 1,2 then value is 0 (but not always used in query)
            time index:'time_index'               // Date with no timezone, newest first in index, read about it ? ordered index..?
            statusOfAd index:'status_of_ad_index'           // Always in the query, guess could be included in all indexes as partial



Thats only the index fields of this relation. Things that are searched for. I will create a Gin index on description also but thats for the fulltext that is coming together but is a bit of a struggle :)

/ Moe





On Sun, Feb 1, 2009 at 7:23 PM, Gregory Stark <stark@enterprisedb.com> wrote:
Mohamed <mohamed5432154321@gmail.com> writes:

> My Gmail(bloody gmail!) has been auto-replying to the last messager (Scott)
> so I think we have been having a private discussion on this topic.

There is an option in the Google Labs tab to make "Reply All" the default
button -- of course then there's always a chance you'll make the opposite
mistake which can be a lot worse.

Earlier I suggested with a boolean column you could consider making it the
condition on a partial index with some other key. For example you could have

CREATE INDEX partial_age_male   on tab(age) WHERE gender = 'M';
CREATE INDEX partial_age_female on tab(age) WHERE gender = 'F';

Then if you always search on age with gender the optimizer can use the index
which only includes the records for the appropriate gender. It's basically a
"free" index key column since it doesn't actually have to store the extra
column.

Note that in this example if you were to search on just age it wouldn't be
able to use either of these indexes however. In theory it could use the
indexes if you search on just gender but it would be unlikely to for all the
same reasons as previously mentioned for regular indexes.

--
 Gregory Stark
 EnterpriseDB          http://www.enterprisedb.com
 Ask me about EnterpriseDB's RemoteDBA services!