Thread: selecting from indexes
Hi, I am trying to improve search times on a moderately large table (approx 1 GB). I have noticed that clustering the data improves performance significantly, but is a bit of a pain especially with dynamic data. What I would like to do is select data direct from the index and not have to go back to the table itself each time. eg, my query is: SELECT id FROM books WHERE category_key = 1471; (this takes ages on a table not ordered by category_key even if I have an index on category_key) If I created an index: CREATE INDEX books_category_id ON books(category_key,id); and then run the above query, it has no need to go to the books table to retrieve the id, and should be fast. But it appears that it still does access the books table. I have tried SELECT id FROM books_category_id WHERE category_key = 1471; but you can't select from an index :( Another option would be to do the clustering using a view, but: create view books_category as select id,category from books order by category; ERROR: Order by and Distinct on views is not implemented. Does anyone know when this will be implemented? Has anyone got any better ideas, or shall I just do static clustering once in a while? Thanks for any advice. Cheers Tim Joyce
"Tim Joyce" <tim@hoop.co.uk> writes: > I am trying to improve search times on a moderately large table (approx 1 > GB). > eg, my query is: > SELECT id FROM books WHERE category_key = 1471; > (this takes ages on a table not ordered by category_key even if I have an > index on category_key) This should *not* take a long time if you have an index on category_key. What does EXPLAIN show as the query plan? (I am wondering if maybe the planner doesn't know the table is large, which it wouldn't if you've never vacuumed it... in that case it might be picking a sequential scan instead of using the index.) Also, how many rows are actually selected by the above? > If I created an index: > CREATE INDEX books_category_id ON books(category_key,id); > and then run the above query, it has no need to go to the books table to > retrieve the id, Yes it does, because the index is only a hint. The executor must still fetch each tuple fingered by the index in order to find out whether the tuples are valid (committed). But that fetching should cost at most one disk read per potentially-interesting tuple. Adding id to the index as you show above would be counterproductive, at least for this query. It'd just inflate the size of the index and thus require more I/O to scan the index. A 2-column index is only useful for queries where WHERE constrains both columns. > Another option would be to do the clustering using a view, but: A view doesn't provide any performance advantage, it's only a rule for rewriting your query before it's executed. regards, tom lane
> "Tim Joyce" <tim@hoop.co.uk> writes: > > I am trying to improve search times on a moderately large table (approx 1 > > GB). > > > eg, my query is: > > > SELECT id FROM books WHERE category_key = 1471; > > > (this takes ages on a table not ordered by category_key even if I have an > > index on category_key) > > This should *not* take a long time if you have an index on category_key. > What does EXPLAIN show as the query plan? (I am wondering if maybe the > planner doesn't know the table is large, which it wouldn't if you've > never vacuumed it... in that case it might be picking a sequential scan > instead of using the index.) > > Also, how many rows are actually selected by the above? the query above selects 294072 rows, which i obviously don't want to do, but I do want to use the clause above in a query that involves a join. eg select id from books, book_words where book_words.word='happy' and book_words.id = books.id and books.category_key=1471; this query works fine if the books table (with 1,200,000 rows) is clustered on category_key, but trundles on for ages (mainly accessing the disk) if not. > > > > If I created an index: > > > CREATE INDEX books_category_id ON books(category_key,id); > > > and then run the above query, it has no need to go to the books table to > > retrieve the id, > > Yes it does, because the index is only a hint. The executor must still > fetch each tuple fingered by the index in order to find out whether the > tuples are valid (committed). But that fetching should cost at most > one disk read per potentially-interesting tuple. ok, and because i have 294072 potentially interesting tuples, it hits the disk hard, and takes forever. perhaps (in the above query) there is a way of directing postgres to only access the books that are selected by the 'words' part of the query? this would mean about 1000 interesting book tuples which can then be checked for good categories. is there a way to do this? thanks very much for your help. timj > > regards, tom lane
"Tim Joyce" <tim@hoop.co.uk> writes: >>>> SELECT id FROM books WHERE category_key = 1471; >>>> (this takes ages on a table not ordered by category_key even if I have >>>> an index on category_key) > the query above selects 294072 rows, which i obviously don't want to do, but > I do want to use the clause above in a query that involves a join. eg Ah, I begin to understand. With an index scan you're going to get 294072 probes into the table (maybe even more, if there are deleted rows that match the category_key). If the rows are scattered all over the disk then that may actually take about 300k disk reads. After you cluster the table, the rows with the same category_key are all contiguous in the table, so many fewer blocks have to be read to visit them all. That's why clustering helps here. Since you're selecting about 1/4th of the table, this particular query would probably be better off *not* using the index, but just doing a sequential scan of the whole table :-(. I assume most of your categories are more selective than this one, though, so dropping the category index entirely is probably not the answer. > select id from books, book_words where book_words.word='happy' and > book_words.id = books.id and books.category_key=1471; If this is what you're really doing, I think what you actually want is indexes on book_words.word and books.id. That would allow book_words to be searched on the word (hopefully giving a more selective result than the category does), and then books would be probed using the id index. id has unique values, right? > perhaps (in the above query) there is a way of directing postgres to only > access the books that are selected by the 'words' part of the query? You might want to look at contrib/fulltextindex in the distribution for ideas about indexing words. fulltextindex might be overkill for your needs, or maybe not, but you could probably adapt it for your purposes. regards, tom lane
> > the query above selects 294072 rows, which i obviously don't want to do, but > > I do want to use the clause above in a query that involves a join. eg > > Ah, I begin to understand. With an index scan you're going to get > 294072 probes into the table (maybe even more, if there are deleted rows > that match the category_key). If the rows are scattered all over the > disk then that may actually take about 300k disk reads. After you > cluster the table, the rows with the same category_key are all > contiguous in the table, so many fewer blocks have to be read to visit > them all. That's why clustering helps here. Indeed, and this lead me to see if there was a way of getting the ids without hitting the books table, but it doesn't look like there is :( > > Since you're selecting about 1/4th of the table, this particular query > would probably be better off *not* using the index, but just doing a > sequential scan of the whole table :-(. I assume most of your > categories are more selective than this one, though, so dropping the > category index entirely is probably not the answer. > > > select id from books, book_words where book_words.word='happy' and > > book_words.id = books.id and books.category_key=1471; > > If this is what you're really doing, I think what you actually want is > indexes on book_words.word and books.id. I have indexes on both of these. >That would allow book_words > to be searched on the word (hopefully giving a more selective result > than the category does), and then books would be probed using the id index. > id has unique values, right? yep, but how can i use this subset to then select for category? perhaps at this stage, I should start to do things in the application code? > > > perhaps (in the above query) there is a way of directing postgres to only > > access the books that are selected by the 'words' part of the query? > > You might want to look at contrib/fulltextindex in the distribution for > ideas about indexing words. fulltextindex might be overkill for your > needs, or maybe not, but you could probably adapt it for your purposes. we have adapted this already (taking out the reg expression stuff so that it is a bit quicker) thanks for your help timj > > regards, tom lane
"Tim Joyce" <tim@hoop.co.uk> writes: >>>> select id from books, book_words where book_words.word='happy' and >>>> book_words.id = books.id and books.category_key=1471; >> >> If this is what you're really doing, I think what you actually want is >> indexes on book_words.word and books.id. > I have indexes on both of these. Hmm. So the question is why the system is (apparently) doing an indexscan on category_key rather than on id. You didn't show us the EXPLAIN output... regards, tom lane
> "Tim Joyce" <tim@hoop.co.uk> writes: > >>>> select id from books, book_words where book_words.word='happy' and > >>>> book_words.id = books.id and books.category_key=1471; > >> > >> If this is what you're really doing, I think what you actually want is > >> indexes on book_words.word and books.id. > > > I have indexes on both of these. > > Hmm. So the question is why the system is (apparently) doing an > indexscan on category_key rather than on id. You didn't show us > the EXPLAIN output... bloomsbury=> explain select id from books, books_title_words_idx t where t.word= 'happy' and t.rec_id = books.id and books.category_key=1471; NOTICE: QUERY PLAN: Hash Join (cost=43709.86 rows=2817 width=8) -> Index Scan using books_categories_idx on books (cost=32855.63 rows=293573width=4) -> Hash (cost=733.03 rows=12221 width=4) -> Index Scan using books_title_words_idx_word on books_title_words_idxt (cost=733.03 rows=12221 width=4) EXPLAIN thanks again timj > > regards, tom lane