Thread: Two Index Questions
Folks (esp. Tom, Stephan, and Bruce): I have two questions for my "Adventures in PostgreSQL" article reasearch: Multi-Column Indexes and GROUP BY: Q: If you group a table by multiple colums, e.g.SELECT t1.A, t1.B, t1.C, MAX(t1.G)FROM t1GROUP BY t1.A, t1.B, t1.C Thenwould a multi-column index on A, B, C be faster than seperate indexes on A, B and C? I've run a few tests, but I don't have enough data in the seperate tables to really get a feel for the difference. DESC Alpha Sort Indexes: Q: In PostgreSQL 7.0, there was an issue that indexes where never consulted for DESC alpha sorts. Has this been resolved? If so, does one need to create any special indexes to take advantage of indexes for DESC sorts? Thanks for your help! -- -Josh BerkusAglio Database SolutionsSan Francisco
Josh Berkus <josh@agliodbs.com> writes: > Q: If you group a table by multiple colums, e.g. > SELECT t1.A, t1.B, t1.C, MAX(t1.G) > FROM t1 > GROUP BY t1.A, t1.B, t1.C > Then would a multi-column index on A, B, C be faster than seperate indexes > on A, B and C? An indexscan can only use one index, so separate indexes would be completely worthless for this query. An index on A,B,C is potentially useful, but in most cases I think the planner will prefer an explicit sort anyway if there's no WHERE clause. > Q: In PostgreSQL 7.0, there was an issue that indexes where never consulted > for DESC alpha sorts. Has this been resolved? If so, does one need to > create any special indexes to take advantage of indexes for DESC sorts? Yes; no. regression=# create table foo (f1 text); CREATE TABLE regression=# create index fooi on foo(f1); CREATE INDEX regression=# explain select * from foo order by f1; QUERY PLAN ---------------------------------------------------------------------Index Scan using fooi on foo (cost=0.00..52.00 rows=1000width=32) (1 row) regression=# explain select * from foo order by f1 desc; QUERY PLAN ------------------------------------------------------------------------------Index Scan Backward using fooi on foo (cost=0.00..52.00rows=1000 width=32) (1 row) regards, tom lane
On Fri, Jul 19, 2002 at 09:33:59 -0700, Josh Berkus <josh@agliodbs.com> wrote: > Folks (esp. Tom, Stephan, and Bruce): > > I have two questions for my "Adventures in PostgreSQL" article reasearch: > > Multi-Column Indexes and GROUP BY: > Q: If you group a table by multiple colums, e.g. > SELECT t1.A, t1.B, t1.C, MAX(t1.G) > FROM t1 > GROUP BY t1.A, t1.B, t1.C > Then would a multi-column index on A, B, C be faster than seperate indexes > on A, B and C? I've run a few tests, but I don't have enough data in the > seperate tables to really get a feel for the difference. If there are lots of G entries for fixed As, Bs and Cs, then another option would be to have an index on all 4 tables and use a subquery with a limit 1 clause to get the row with the max G value for any A, B and C.
Bruno, > > I have two questions for my "Adventures in PostgreSQL" article reasearch: > > > > Multi-Column Indexes and GROUP BY: > > Q: If you group a table by multiple colums, e.g. > > SELECT t1.A, t1.B, t1.C, MAX(t1.G) > > FROM t1 > > GROUP BY t1.A, t1.B, t1.C > > Then would a multi-column index on A, B, C be faster than seperate indexes > > on A, B and C? I've run a few tests, but I don't have enough data in the > > seperate tables to really get a feel for the difference. > > If there are lots of G entries for fixed As, Bs and Cs, then another option > would be to have an index on all 4 tables and use a subquery with a limit 1 > clause to get the row with the max G value for any A, B and C. We're talking about only one table, here. Not four. I generally try to avoid using LIMIT, as it is a non-SQL92 extension. Also, LIMIT in subqueries might someday be disallowed as it interferes with the fundmentally unordered nature of subqueries. -- -Josh BerkusAglio Database SolutionsSan Francisco
On Fri, Jul 19, 2002 at 10:28:18 -0700, Josh Berkus <josh@agliodbs.com> wrote: > > Bruno, > > > > I have two questions for my "Adventures in PostgreSQL" article reasearch: > > > > > > Multi-Column Indexes and GROUP BY: > > > Q: If you group a table by multiple colums, e.g. > > > SELECT t1.A, t1.B, t1.C, MAX(t1.G) > > > FROM t1 > > > GROUP BY t1.A, t1.B, t1.C > > > Then would a multi-column index on A, B, C be faster than seperate > indexes > > > on A, B and C? I've run a few tests, but I don't have enough data in the > > > seperate tables to really get a feel for the difference. > > > > If there are lots of G entries for fixed As, Bs and Cs, then another option > > would be to have an index on all 4 tables and use a subquery with a limit 1 > > clause to get the row with the max G value for any A, B and C. > > We're talking about only one table, here. Not four. Typo. I meant to say columns. The issue is that max doesn't use an index, but if there are a lot of different values of G for a given A, B and C, it may be better to use an index then to search through the applicable rows to find the maximum. > I generally try to avoid using LIMIT, as it is a non-SQL92 extension. Also, > LIMIT in subqueries might someday be disallowed as it interferes with the > fundmentally unordered nature of subqueries. I can see not using it because it is nonstandard. I don't think it would be disabled for the reason you have given. The documentation makes it pretty clear that you need to use order by if you want a particular tuple.
Josh Berkus <josh@agliodbs.com> writes: > I generally try to avoid using LIMIT, as it is a non-SQL92 extension. Fair enough ... > Also, > LIMIT in subqueries might someday be disallowed as it interferes with the > fundmentally unordered nature of subqueries. I don't think so; we went out of our way to make it work, so we're unlikely to discard the feature. It's true that it does prevent some optimizations that might otherwise occur, but not using it is a sufficient answer to that. regards, tom lane
Bruno, Tom, > Typo. I meant to say columns. The issue is that max doesn't use an index, > but if there are a lot of different values of G for a given A, B and C, > it may be better to use an index then to search through the applicable > rows to find the maximum. That's odd ... you're correct. Tom, why doesn't MAX() use an index? I understand why indexes are generally useless for SUM(), AVG, and COUNT, but it seems that MAX() and MIN() should *always* use an index. -- -Josh BerkusAglio Database SolutionsSan Francisco
Josh Berkus wrote: > Bruno, Tom, > > > Typo. I meant to say columns. The issue is that max doesn't use an index, > > but if there are a lot of different values of G for a given A, B and C, > > it may be better to use an index then to search through the applicable > > rows to find the maximum. > > That's odd ... you're correct. Tom, why doesn't MAX() use an index? I > understand why indexes are generally useless for SUM(), AVG, and COUNT, but > it seems that MAX() and MIN() should *always* use an index. Index access methods don't know about aggregates, and our type system makes such linkage difficult. The FAQ does have: However, <SMALL>LIMIT</SMALL> combined with <SMALL>ORDER BY</SMALL> often will use an index because only a small portion of the table is returned. In fact, though MAX() andMIN() don't use indexes, it is possible to retrieve such values using an index with ORDER BY and LIMIT:<PRE> SELECT col FROM tab ORDER BY col [ DESC ] LIMIT 1</PRE> -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Josh Berkus <josh@agliodbs.com> writes: > Tom, why doesn't MAX() use an index? Don't tell me you haven't seen that discussed many times before :-( Because Postgres uses an extensible set of aggregate functions, we treat all aggregates as "black boxes": the implementation strategy is always to pass all the specified values through the aggregate. Special-casing MIN and MAX would be nice from the point of view of performance, but there's this little problem that our sets of datatypes and index types are also extensible. We'd need to devise some non-hard-wired way of identifying which aggregate functions are related to the sort orderings of what indexes. Finally, the transformation into an optimized form is just not that easy to do automatically in the general case --- it's easy enough if you write "SELECT max(col) FROM tab", but how about "SELECT max(col), min(col) FROM tab"? What if there are WHERE clauses (with or without constraints on col)? And the GROUP BY case that we started this discussion with is *very* nontrivial. This issue is on the TODO list, but given that the ORDER BY/LIMIT workaround is available (and offers more functionality than MAX/MIN anyway), I don't think it's a very high-priority problem. We've got plenty of TODO items for which there is no good workaround... regards, tom lane
Tom, Bruce, Bruno > Don't tell me you haven't seen that discussed many times before :-( Actually, I'll have to admint to not having read the relevant discussions. Honestly, on this list I usually only answer the questions I asked or the questions I'm going to respond to. Short-sighted, but I can't keep up otherwise. The explanation certainly makes sense. It just goes against the grain for me to use a non-SQL standard structure in major parts of my database schema. But I'll learn to live with it. -- -Josh Berkus ______AGLIO DATABASE SOLUTIONS___________________________ Josh Berkus Complete informationtechnology josh@agliodbs.com and data management solutions (415) 565-7293 for law firms, small businesses fax 621-2533 and non-profit organizations. San Francisco