Thread: selecting from indexes

selecting from indexes

From
"Tim Joyce"
Date:
Hi,

I am trying to improve search times on a moderately large table (approx 1
GB).

I have noticed that clustering the data improves performance significantly,
but is a bit of a pain especially with dynamic data.

What I would like to do is select data direct from the index and not have to
go back to the table itself each time.

eg, my query is:

SELECT id FROM books WHERE category_key = 1471;

(this takes ages on a table not ordered by category_key even if I have an
index on category_key)

If I created an index:

CREATE INDEX books_category_id ON books(category_key,id);

and then run the above query,  it has no need to go to the books table to
retrieve the id, and should be fast.  But it appears that it still does
access the books table.

I have tried

SELECT id FROM books_category_id WHERE category_key = 1471;

but you can't select from an index :(

Another option would be to do the clustering using a view, but:

create view books_category as select id,category from books order by
category;
ERROR:  Order by and Distinct on views is not implemented.

Does anyone know when this will be implemented?

Has anyone got any better ideas, or shall I just do static clustering once
in a while?

Thanks for any advice.

Cheers

Tim Joyce




Re: [SQL] selecting from indexes

From
Tom Lane
Date:
"Tim Joyce" <tim@hoop.co.uk> writes:
> I am trying to improve search times on a moderately large table (approx 1
> GB).

> eg, my query is:

> SELECT id FROM books WHERE category_key = 1471;

> (this takes ages on a table not ordered by category_key even if I have an
> index on category_key)

This should *not* take a long time if you have an index on category_key.
What does EXPLAIN show as the query plan?  (I am wondering if maybe the
planner doesn't know the table is large, which it wouldn't if you've
never vacuumed it... in that case it might be picking a sequential scan
instead of using the index.)

Also, how many rows are actually selected by the above?


> If I created an index:

> CREATE INDEX books_category_id ON books(category_key,id);

> and then run the above query,  it has no need to go to the books table to
> retrieve the id,

Yes it does, because the index is only a hint.  The executor must still
fetch each tuple fingered by the index in order to find out whether the
tuples are valid (committed).  But that fetching should cost at most
one disk read per potentially-interesting tuple.

Adding id to the index as you show above would be counterproductive,
at least for this query.  It'd just inflate the size of the index
and thus require more I/O to scan the index.  A 2-column index is
only useful for queries where WHERE constrains both columns.

> Another option would be to do the clustering using a view, but:

A view doesn't provide any performance advantage, it's only a rule
for rewriting your query before it's executed.
        regards, tom lane


Re: [SQL] selecting from indexes

From
"Tim Joyce"
Date:
> "Tim Joyce" <tim@hoop.co.uk> writes:
> > I am trying to improve search times on a moderately large table (approx
1
> > GB).
>
> > eg, my query is:
>
> > SELECT id FROM books WHERE category_key = 1471;
>
> > (this takes ages on a table not ordered by category_key even if I have
an
> > index on category_key)
>
> This should *not* take a long time if you have an index on category_key.
> What does EXPLAIN show as the query plan?  (I am wondering if maybe the
> planner doesn't know the table is large, which it wouldn't if you've
> never vacuumed it... in that case it might be picking a sequential scan
> instead of using the index.)
>
> Also, how many rows are actually selected by the above?

the query above selects 294072 rows, which i obviously don't want to do, but
I do want to use the clause above in a query that involves a join.  eg

select id from books, book_words where book_words.word='happy' and
book_words.id = books.id and books.category_key=1471;

this query works fine if the books table (with 1,200,000 rows) is clustered
on category_key, but trundles on for ages (mainly accessing the disk) if
not.

>
>
> > If I created an index:
>
> > CREATE INDEX books_category_id ON books(category_key,id);
>
> > and then run the above query,  it has no need to go to the books table
to
> > retrieve the id,
>
> Yes it does, because the index is only a hint.  The executor must still
> fetch each tuple fingered by the index in order to find out whether the
> tuples are valid (committed).  But that fetching should cost at most
> one disk read per potentially-interesting tuple.

ok, and because i have 294072 potentially interesting tuples, it hits the
disk hard, and takes forever.

perhaps (in the above query) there is a way of directing postgres to only
access the books that are selected by the 'words' part of the query?  this
would mean about 1000 interesting book tuples which can then be checked for
good categories.  is there a way to do this?

thanks very much for your help.

timj

>
> regards, tom lane




Re: [SQL] selecting from indexes

From
Tom Lane
Date:
"Tim Joyce" <tim@hoop.co.uk> writes:
>>>> SELECT id FROM books WHERE category_key = 1471;
>>>> (this takes ages on a table not ordered by category_key even if I have
>>>> an index on category_key)

> the query above selects 294072 rows, which i obviously don't want to do, but
> I do want to use the clause above in a query that involves a join.  eg

Ah, I begin to understand.  With an index scan you're going to get
294072 probes into the table (maybe even more, if there are deleted rows
that match the category_key).  If the rows are scattered all over the
disk then that may actually take about 300k disk reads.  After you
cluster the table, the rows with the same category_key are all
contiguous in the table, so many fewer blocks have to be read to visit
them all.  That's why clustering helps here.

Since you're selecting about 1/4th of the table, this particular query
would probably be better off *not* using the index, but just doing a
sequential scan of the whole table :-(.  I assume most of your
categories are more selective than this one, though, so dropping the
category index entirely is probably not the answer.

> select id from books, book_words where book_words.word='happy' and
> book_words.id = books.id and books.category_key=1471;

If this is what you're really doing, I think what you actually want is
indexes on book_words.word and books.id.  That would allow book_words
to be searched on the word (hopefully giving a more selective result
than the category does), and then books would be probed using the id index.
id has unique values, right?

> perhaps (in the above query) there is a way of directing postgres to only
> access the books that are selected by the 'words' part of the query?

You might want to look at contrib/fulltextindex in the distribution for
ideas about indexing words.  fulltextindex might be overkill for your
needs, or maybe not, but you could probably adapt it for your purposes.
        regards, tom lane


Re: [SQL] selecting from indexes

From
"Tim Joyce"
Date:
> > the query above selects 294072 rows, which i obviously don't want to do,
but
> > I do want to use the clause above in a query that involves a join.  eg
>
> Ah, I begin to understand.  With an index scan you're going to get
> 294072 probes into the table (maybe even more, if there are deleted rows
> that match the category_key).  If the rows are scattered all over the
> disk then that may actually take about 300k disk reads.  After you
> cluster the table, the rows with the same category_key are all
> contiguous in the table, so many fewer blocks have to be read to visit
> them all.  That's why clustering helps here.

Indeed, and this lead me to see if there was a way of getting the ids
without hitting the books table, but it doesn't look like there is :(

>
> Since you're selecting about 1/4th of the table, this particular query
> would probably be better off *not* using the index, but just doing a
> sequential scan of the whole table :-(.  I assume most of your
> categories are more selective than this one, though, so dropping the
> category index entirely is probably not the answer.
>
> > select id from books, book_words where book_words.word='happy' and
> > book_words.id = books.id and books.category_key=1471;
>
> If this is what you're really doing, I think what you actually want is
> indexes on book_words.word and books.id.

I have indexes on both of these.

>That would allow book_words
> to be searched on the word (hopefully giving a more selective result
> than the category does), and then books would be probed using the id
index.
> id has unique values, right?

yep, but how can i use this subset to then select for category?  perhaps at
this stage, I should start to do things in the application code?

>
> > perhaps (in the above query) there is a way of directing postgres to
only
> > access the books that are selected by the 'words' part of the query?
>
> You might want to look at contrib/fulltextindex in the distribution for
> ideas about indexing words.  fulltextindex might be overkill for your
> needs, or maybe not, but you could probably adapt it for your purposes.

we have adapted this already (taking out the reg expression stuff so that it
is a bit quicker)

thanks for your help

timj

>
> regards, tom lane



Re: [SQL] selecting from indexes

From
Tom Lane
Date:
"Tim Joyce" <tim@hoop.co.uk> writes:
>>>> select id from books, book_words where book_words.word='happy' and
>>>> book_words.id = books.id and books.category_key=1471;
>> 
>> If this is what you're really doing, I think what you actually want is
>> indexes on book_words.word and books.id.

> I have indexes on both of these.

Hmm.  So the question is why the system is (apparently) doing an
indexscan on category_key rather than on id.  You didn't show us
the EXPLAIN output...
        regards, tom lane


Re: [SQL] selecting from indexes

From
"Tim Joyce"
Date:

> "Tim Joyce" <tim@hoop.co.uk> writes:
> >>>> select id from books, book_words where book_words.word='happy' and
> >>>> book_words.id = books.id and books.category_key=1471;
> >>
> >> If this is what you're really doing, I think what you actually want is
> >> indexes on book_words.word and books.id.
>
> > I have indexes on both of these.
>
> Hmm.  So the question is why the system is (apparently) doing an
> indexscan on category_key rather than on id.  You didn't show us
> the EXPLAIN output...

bloomsbury=> explain select id from books, books_title_words_idx t where
t.word=
'happy' and t.rec_id = books.id and books.category_key=1471;
NOTICE:  QUERY PLAN:

Hash Join  (cost=43709.86 rows=2817 width=8) ->  Index Scan using books_categories_idx on books  (cost=32855.63
rows=293573width=4) ->  Hash  (cost=733.03 rows=12221 width=4)       ->  Index Scan using books_title_words_idx_word
on
books_title_words_idxt  (cost=733.03 rows=12221 width=4)

EXPLAIN

thanks again

timj
>
> regards, tom lane