Thread: Compound Indexes
I have a table with ten columns. My queries basically one column as the first WHERE condition, so an index on that column is certain. But the columns after that one vary depending on end-user's choice (this is a reporting application) and so does the sorting order. In MySQL world, I had sort_buffer in the config file, and I made a compound index with the columns most often used in these types of queries. So my index looked like: INDEX idx_trades(id, t_id, c_id, s_id, t_brief, created_on); This has five columns in it. While reading the pgsql documentation, I gather than anything beyond three columns offers diminishing benefits. My queries will look like these: SELECT * from trades where id = 99999 and c_id = 9999 ORDER by s_id; SELECT * from trades where id = 99999 and s_id = 99990 ORDER by created_on desc ; SELECT * from trades where id = 99999 and s_id = 99990 and t_brief ~* 'more|than|one|word' ORDER by created_on desc ; So my question: how does PGSQL optimize its sorts? If I were to index merely the columns that are most commonly used in the reporting WHERE clause, would that be ok? Some ofthese columns may be "TEXT" type -- how should I include these in the index (in MySQL, I included only the first 100 words in the index). TIA!
phoenix.kiula@gmail.com ("Phoenix Kiula") writes: > I have a table with ten columns. My queries basically one column as > the first WHERE condition, so an index on that column is certain. But > the columns after that one vary depending on end-user's choice (this > is a reporting application) and so does the sorting order. > > In MySQL world, I had sort_buffer in the config file, and I made a > compound index with the columns most often used in these types of > queries. So my index looked like: > > INDEX idx_trades(id, t_id, c_id, s_id, t_brief, created_on); > > This has five columns in it. While reading the pgsql documentation, I > gather than anything beyond three columns offers diminishing benefits. > > My queries will look like these: > > SELECT * from trades where id = 99999 > and c_id = 9999 > ORDER by s_id; > > SELECT * from trades where id = 99999 > and s_id = 99990 > ORDER by created_on desc ; > > SELECT * from trades where id = 99999 > and s_id = 99990 > and t_brief ~* 'more|than|one|word' > ORDER by created_on desc ; > > So my question: how does PGSQL optimize its sorts? If I were to index > merely the columns that are most commonly used in the reporting WHERE > clause, would that be ok? Some ofthese columns may be "TEXT" type -- > how should I include these in the index (in MySQL, I included only the > first 100 words in the index). If you have only these three sorts of queries, then I would speculate that the following indices *might* be useful: create idx1 on trades (id); create idx2 on trades (c_id); create idx3 on trades (s_id); create idx4 on trades (created_on); create idx5 on trades (created_on) where t_brief ~* 'more|than|one|word'; create idx6 on trades (id, s_id) where t_brief ~* 'more|than|one|word'; (I'm assuming with idx5 and idx6 that you were actually searching for 'more|than|one|word'; if what is searched for can vary, then idx5/idx6 are worthless.) You could try adding them all, and check out which of them are actually used by the query planner. And eventually drop out the irrelevant ones. PostgreSQL has a rather sophisticated query planner (pretty much "rocket science," compared to MySQL), and it is even possible that it would use multiple of those indices simultaneously for some of the queries. Which indexes, if any, it will use will vary from query to query based on the parameters in the query. You can determine the query plan by prefixing the query with the keyword "EXPLAIN." Suppose you add the above 6 indexes, you could get query plans via running the following: > EXPLAIN SELECT * from trades where id = 99999 > and c_id = 9999 > ORDER by s_id; > > EXPLAIN SELECT * from trades where id = 99999 > and s_id = 99990 > ORDER by created_on desc ; > > EXPLAIN SELECT * from trades where id = 99999 > and s_id = 99990 > and t_brief ~* 'more|than|one|word' > ORDER by created_on desc ; You may want to post the output to the list; learning to read query planner output is a bit of an art, and you won't necessarily make the right sense of the results on day #1... -- output = ("cbbrowne" "@" "cbbrowne.com") http://www3.sympatico.ca/cbbrowne/spreadsheets.html Consciousness - that annoying time between naps.
On 15/08/07, Chris Browne <cbbrowne@acm.org> wrote: > phoenix.kiula@gmail.com ("Phoenix Kiula") writes: > > I have a table with ten columns. My queries basically one column as > > the first WHERE condition, so an index on that column is certain. But > > the columns after that one vary depending on end-user's choice (this > > is a reporting application) and so does the sorting order. > > > > In MySQL world, I had sort_buffer in the config file, and I made a > > compound index with the columns most often used in these types of > > queries. So my index looked like: > > > > INDEX idx_trades(id, t_id, c_id, s_id, t_brief, created_on); > > > > This has five columns in it. While reading the pgsql documentation, I > > gather than anything beyond three columns offers diminishing benefits. > > > > My queries will look like these: > > > > SELECT * from trades where id = 99999 > > and c_id = 9999 > > ORDER by s_id; > > > > SELECT * from trades where id = 99999 > > and s_id = 99990 > > ORDER by created_on desc ; > > > > SELECT * from trades where id = 99999 > > and s_id = 99990 > > and t_brief ~* 'more|than|one|word' > > ORDER by created_on desc ; > > > > So my question: how does PGSQL optimize its sorts? If I were to index > > merely the columns that are most commonly used in the reporting WHERE > > clause, would that be ok? Some ofthese columns may be "TEXT" type -- > > how should I include these in the index (in MySQL, I included only the > > first 100 words in the index). > > If you have only these three sorts of queries, then I would speculate > that the following indices *might* be useful: > > create idx1 on trades (id); > create idx2 on trades (c_id); > create idx3 on trades (s_id); > create idx4 on trades (created_on); > create idx5 on trades (created_on) where t_brief ~* 'more|than|one|word'; > create idx6 on trades (id, s_id) where t_brief ~* 'more|than|one|word'; > > (I'm assuming with idx5 and idx6 that you were actually searching for > 'more|than|one|word'; if what is searched for can vary, then idx5/idx6 > are worthless.) > > You could try adding them all, and check out which of them are > actually used by the query planner. And eventually drop out the > irrelevant ones. > > PostgreSQL has a rather sophisticated query planner (pretty much > "rocket science," compared to MySQL), and it is even possible that it > would use multiple of those indices simultaneously for some of the > queries. Which indexes, if any, it will use will vary from query to > query based on the parameters in the query. Thank you so much! My only concern, probably a hangover from the MySQL world, is that if I have 5 of 6 indices, what would that do to INSERT and UPDATE performance if all these indices have to be updated? Is updating individual indices faster than one large compound index?
"Phoenix Kiula" <phoenix.kiula@gmail.com> writes: > Thank you so much! My only concern, probably a hangover from the MySQL > world, is that if I have 5 of 6 indices, what would that do to INSERT > and UPDATE performance if all these indices have to be updated? Is > updating individual indices faster than one large compound index? Probably slower, but the compound index is enough more specialized that it's not as useful as the individual indexes. (I suspect that that compound index in MySQL didn't do you nearly as much good as you thought, either.) You should read http://www.postgresql.org/docs/8.2/static/indexes-bitmap-scans.html and nearby sections of the manual. regards, tom lane
"Phoenix Kiula" <phoenix.kiula@gmail.com> writes: >> > SELECT * from trades where id = 99999 >> > and c_id = 9999 >> > ORDER by s_id; >> > >> > SELECT * from trades where id = 99999 >> > and s_id = 99990 >> > ORDER by created_on desc ; >> > >> > SELECT * from trades where id = 99999 >> > and s_id = 99990 >> > and t_brief ~* 'more|than|one|word' >> > ORDER by created_on desc ; Well I would start with testing: trades(id, c_id, s_id) trades(id, s_id, created_on) However you may (and actually probably will, i expect) find that the third column is not helping. That's especially true if the result of matching id and either c_id or s_id is always a small enough set of records that sorting them is quick (I would expect the point where an extra column in the index would start to save you anything to be somewhere around 100-1,000, possibly even as much as 10,000 or more). Note that in released versions getting an index which is useful for ORDER BY created_on *DESC* is actually quite difficult. So unless these queries are returning thousands of records I would suggest ignoring the ORDER BY clauses and just looking at the WHERE clauses. If id,s_id and id,c_id are selective enough to return only a few records I would actually expect you to end up with just trades(id, s_id) trades(id, c_id) You might also be able to build some kind of index to help the ~* clause. If you do a lot of queries like that and the id,s_id restriction isn't very selective you might look into tsearch2 which can index that type of query. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
> you do a lot of queries like that and the id,s_id restriction isn't very > selective you might look into tsearch2 which can index that type of query. > Thanks. Does tsearch2 come installed with 8.2.3? I am not techie enough to do all the compiling stuff so I'm hoping it does! How can I check?
"Phoenix Kiula" writes: >> I have a table with ten columns. My queries basically one column as >> the first WHERE condition, so an index on that column is certain. But >> the columns after that one vary depending on end-user's choice (this >> is a reporting application) and so does the sorting order. >> >> In MySQL world, I had sort_buffer in the config file, and I made a >> compound index with the columns most often used in these types of >> queries. So my index looked like: >> >> INDEX idx_trades(id, t_id, c_id, s_id, t_brief, created_on); >> >> This has five columns in it. While reading the pgsql documentation, I >> gather than anything beyond three columns offers diminishing benefits. >> >> My queries will look like these: >> >> SELECT * from trades where id = 99999 >> and c_id = 9999 >> ORDER by s_id; >> >> SELECT * from trades where id = 99999 >> and s_id = 99990 >> ORDER by created_on desc ; >> >> SELECT * from trades where id = 99999 >> and s_id = 99990 >> and t_brief ~* 'more|than|one|word' >> ORDER by created_on desc ; The documentation at <http://www.postgresql.org/docs/8.2/interactive/indexes-multicolumn.html> also points out that your multi-column index will not help much with the columns after "id", certainly not with the columns after "t_id" since "t_id" isn't involved in any of the conditions or ordering. > The exact rule is that equality constraints on leading columns, plus any inequality constraints on the first column thatdoes not have an equality constraint, will be used to limit the portion of the index that is scanned. Constraints oncolumns to the right of these columns are checked in the index, so they save visits to the table proper, but they do notreduce the portion of the index that has to be scanned. and goes on to say, > Multicolumn indexes should be used sparingly. In most situations, an index on a single column is sufficient and saves spaceand time. The planner can make use of single-column indexes in combination (if your PG is recent enough). <http://www.postgresql.org/docs/8.2/interactive/indexes-bitmap-scans.html> > Sometimes multicolumn indexes are best, but sometimes it's better to create separate indexes and rely on the index-combinationfeature. The selectivity of each column is also relevant. If you have ten million rows with "s_id" values of only either 99990 or 99991, an index on "s_id" is not going to help much. -- Lew