Thread: optimizing queries and indexes...
i'm fairly new at this whole database design thing and my grasp of set theory is not what it was when i was in college lo these many years past. but i want to get a better idea of how to optimize sql statements. i have several friends that are DBA's by profession and work on oracle and/or ms sql server. they have all told me that while there are some general rules to follow that each database is different. for example, one thing that one of my friends said is: select X from big_table ,little_table Generally speaking, Oracle optimizes better when the smaller/reducing tables are on the bottom and the larger tablesare on the top. I believe SQLServer likes them in the opposite direction. and also: Generally speaking indexes should be built with column names in the order of higher cardinality. I frequentlyscrew this up because it runs counter to the way you think about building your joins. An example might be as follows: An index with: Company_Id (distinct count = 4) Dept_Id (distinct count = 40) Employee_ID (distinct count = 1000+) This index should probably be in the illogical order of: Employee_Id Dept_Id Company_Id so, i am hopeful that there is some sort of postgresql performance faq for queries. also, is there a general rule as to what is considered expensive when running explain on a query? i had one query running a join and calculation across two unindexed tables. the initial plan cost about 800 with a majority of the cost being taken up in a sequential scan of about 5000 rows costing 210 and then a hash join of 225 rows across the two tables costing 585. after creating indexes on the costs dropped about in half to 106 and 299 respectively. is 800 expensive? is 400 expensive? will the cost go up when the data baloons to 100,000 rows? many thanks! rjsjr
Robert, I'd advise you to buy a book, but frankly I don't know a good one on DB performance optimization. The DB optimizers I've met tend to guard their secrets closely. Suggestions, anyone? > i have several friends that are DBA's by profession and work on > oracle > and/or ms sql server. they have all told me that while there are some > general rules to follow that each database is different. Yup. They told ya right. > for example, one thing that one of my friends said is: > select X > from big_table > ,little_table > > Generally speaking, Oracle optimizes better > when the smaller/reducing tables are on the > bottom and the larger tables are on the top. > I believe SQLServer likes them in the opposite > direction. Generally speaking, in PG SQL it pays to leave the join order up to the optimizer as much as possible. Tom and Stephan have built a phenominal query optimizer, and you are much more likely to slow it down if you limit its choices. Also, for the implicit join style in Postgres the order in which you give tables is largely ignored by the optimizer. Order only matters in explicit joins. That being said, there are a few practices that can help: 1. When joining a limited subset of an exceptionally large table to several smaller tables, consider using a subselect for the large table. In some cases this will speed up query execution. Sometimes it won't. Example: 2. When doing several Inner (normal) Joins and several Outer Joins, do the inner joins first and the outer joins second, as the inner joins should limit the result set that is being matched for the outer joins. 3. All joins and where conditions, ideally, should be executed on indexed columns. 4. EXISTS and NOT EXISTS are almost always faster than IN and NOT IN for a sub-select in the WHERE clause. 5. Some Postgres-specific tips: a. VACUUM ANALYZE regularly! b. If you have indexes on tables with a large number of regular deletions, you will need to drop and re-create the indexes during periods of inactivity. This is a Postgres defect that will be fixed in ver. 7.3 or later. 6. Make sure to establish Foriegn Key constraints wherever appropriate. This will speed up joins on the constrained columns considerably, as the parser does not have to worry about unmatched rows. 7. Set Unique indexes on all unique columns. This also helps the parser. > and also: > Generally speaking indexes should be built > with column names in the order of higher > cardinality. I frequently screw this up > because it runs counter to the way you think > about building your joins. This may be true in Oracle (opinions?) but is is neither true in Postgres nor true in SQL Server (or Frontbase). If you build your indexes according to this rule you will be dissapointed in the results. The real rule is: Indexes should always be built according to the manner and order in which they will be queried. Example: If you have a database in which there is a unique set of EmployeeIDs for each Region, but not unique between Regions. Thus all joins to Employees join on both the regionID and the employeeID. Further, you force the user to select a region before s/he can look up an employee. In this case, you would want to establish your indexes as CREATE INDEX reg_emp_ix UNIQUE ( regionID, employeeID ) Create all your joins as: JOIN employees e ON (t.regionID = e.regionID AND t.employeeID = e.employeeID) And your WHERE clauses as: WHERE regionID = $region AND employeeID = $emp For all of these, order is immensely important. However, this index, while immensely useful for the above purposes, would be useless if there are more than 2 regions and you wanted to select on employeeID only for some reason. In that case, you would need to add a second index on employeeID alone, as the regionID, employeeID index would not be used in a query that filtered only by employeeID and ignored the regionID. One of my clients didn't know any of the above and established a number of indexes on tables > 3,000,000 records using columns in alphabetical order (!?!). They then had to call me, and re-ordering the index columns cut the delay in single-row queries (especially DELETE queries) by 80%. > so, i am hopeful that there is some sort of postgresql performance > faq > for queries. Somebody wanna re-organize the above with more examples? -Josh Berkus ______AGLIO DATABASE SOLUTIONS___________________________ Josh Berkus Complete information technology josh@agliodbs.com and data management solutions (415) 565-7293 for law firms, small businesses fax 621-2533 and non-profit organizations. San Francisco
Attachment
On Sun, 9 Sep 2001, Josh Berkus wrote: > > i have several friends that are DBA's by profession and work on > > oracle > > and/or ms sql server. they have all told me that while there are some > > general rules to follow that each database is different. > > Yup. They told ya right. > > > for example, one thing that one of my friends said is: > > select X > > from big_table > > ,little_table > > > > Generally speaking, Oracle optimizes better > > when the smaller/reducing tables are on the > > bottom and the larger tables are on the top. > > I believe SQLServer likes them in the opposite > > direction. > > Generally speaking, in PG SQL it pays to leave the join order up to the > optimizer as much as possible. Tom and Stephan have built a phenominal Just wanted to jump in to say that I've actually had nothing to do with the optimizer, I just often field questions. :) > query optimizer, and you are much more likely to slow it down if you > limit its choices. Also, for the implicit join style in Postgres the > order in which you give tables is largely ignored by the optimizer. > Order only matters in explicit joins. This bit above is very important. If explain (see below) is showing you a plan you think is suboptimal, using explicit joins to force join order is sometimes a win. The downside is that if the situation changes, this may no longer be an optimization. > 4. EXISTS and NOT EXISTS are almost always faster than IN and NOT IN for > a sub-select in the WHERE clause. As a mention, a little info is given on this in the FAQ entry 4.23 with a simple example of a conversion from IN to EXISTS. > > 5. Some Postgres-specific tips: > a. VACUUM ANALYZE regularly! > b. If you have indexes on tables with a large number of regular > deletions, you will need to drop and re-create the indexes during > periods of inactivity. This is a Postgres defect that will be fixed in > ver. 7.3 or later. For b, I think you may also be able to use REINDEX to do that, but I'm not sure. > 6. Make sure to establish Foriegn Key constraints wherever appropriate. > This will speed up joins on the constrained columns considerably, as the > parser does not have to worry about unmatched rows. I don't actually think that the optimizer takes that into account (although I could be wrong). It's possible that it will in the future however. As a note however, foreign keys *do* make insert/update on the fk table and update/delete on the pk table a small bit more expensive. It's a balancing act, just like indexes. > The real rule is: > Indexes should always be built according to the manner and order in > which they will be queried. > > Example: If you have a database in which there is a unique set of > EmployeeIDs for each Region, but not unique between Regions. Thus all > joins to Employees join on both the regionID and the employeeID. > Further, you force the user to select a region before s/he can look up > an employee. In this case, you would want to establish your indexes as > CREATE INDEX reg_emp_ix UNIQUE ( regionID, employeeID ) > Create all your joins as: > JOIN employees e ON (t.regionID = e.regionID AND t.employeeID = > e.employeeID) > And your WHERE clauses as: > WHERE regionID = $region AND employeeID = $emp > For all of these, order is immensely important. > > However, this index, while immensely useful for the above purposes, > would be useless if there are more than 2 regions and you wanted to > select on employeeID only for some reason. In that case, you would need > to add a second index on employeeID alone, as the regionID, employeeID > index would not be used in a query that filtered only by employeeID and > ignored the regionID. The general rule for this is that AFAIK that for multi-column indexes the index can only be used up to the first missing column in its order of definition. Index (a,b,c) can be used for a query on a, a & b, a & b & c, or to a limited extent on a & c (it'll only look up a in the index however). Additional things (in no particular order):Explain is your friend. Whenever you have a query that you expect to do alot on your dataset (well, except straight inserts) use explain to get an idea of what the database thinks is necessary to do the query. It's simple: EXPLAIN <query>; and is not immediately understandable, but it's enough info to start getting help from the mailing lists. :)Sometimes a sequence scan across the heap will be faster in postgresthan an index scan. If a reasonable % of rows are going to be returned, the index scan will actually often be slower. Sometimes postgres gets it wrong, but you can test that by experimentation. Indexes are a double-edged sword. Too many indexes can be bad, especially ones that aren't likely to actually be used in queries since there is a cost involved on updates and inserts for updating the indexes. Aggregates like count(*) on a single table are not done via index information or stored information and will require a sequence scan. Often people expect count to be very cheap, but it isn't always. If you're using UNION but know that there will not be overlapping rows in the two queries, use UNION ALL instead to prevent needing to do the duplicate removal checking. At least on 7.1 and below, if you have a dummy value that is very common but doesn't really pass any information (like 'N/A' for example), consider using NULL instead. The optimizer statistics can often be thrown off-kilter by values that are much more common than the real data.
Stephan, Ah. Well, you answer optimization questions so often that I'd assumed that you had a hand in it. Is the optimizer all Tom and Bruce's work? > At least on 7.1 and below, if you have a dummy value that is very > common > but doesn't really pass any information (like 'N/A' for example), > consider > using NULL instead. The optimizer statistics can often be thrown > off-kilter by values that are much more common than the real data. Personally, I cannot reccomend this. There are a number of normalization problems with using NULL instead of 'N/A' or 0 or another "no" value. Some database theorists (Fabian Pascal & co.) even propose the elimination of NULL from the SQL spec on the grounds that it encourages bad DB design. As such, I have a hard time reccommending any course that involves adding *more* NULLs to the database, especially for a marginal query performance gain. To phrase it another way: Optimization problems cost you seconds. DB Design and normalization problems cost you *days*. -Josh Berkus ______AGLIO DATABASE SOLUTIONS___________________________ Josh Berkus Complete information technology josh@agliodbs.com and data management solutions (415) 565-7293 for law firms, small businesses fax 621-2533 and non-profit organizations. San Francisco
Attachment
On Sun, 9 Sep 2001, Josh Berkus wrote: > Ah. Well, you answer optimization questions so often that I'd assumed > that you had a hand in it. Is the optimizer all Tom and Bruce's work? Mostly Tom I believe. > > At least on 7.1 and below, if you have a dummy value that is very > > common > > but doesn't really pass any information (like 'N/A' for example), > > consider > > using NULL instead. The optimizer statistics can often be thrown > > off-kilter by values that are much more common than the real data. > > Personally, I cannot reccomend this. There are a number of > normalization problems with using NULL instead of 'N/A' or 0 or another > "no" value. Some database theorists (Fabian Pascal & co.) even propose > the elimination of NULL from the SQL spec on the grounds that it > encourages bad DB design. As such, I have a hard time reccommending any > course that involves adding *more* NULLs to the database, especially for > a marginal query performance gain. NULLs are rather ugly, but the difference is often index scan vs sequence scan and on billion row tables that starts being those messages about, I do this query and it never comes back. This should become much less of an issue in 7.2 however with Tom's enhancing of the statistics generated.
[ Sorry for slow response, I've been out of town ] "Robert J. Sanford, Jr." <rsanford@nolimitsystems.com> writes: > [ some questions already ably answered by Josh and Stephan ] I just wanted to throw in one more tidbit: > for example, one thing that one of my friends said is: > select X > from big_table > ,little_table > Generally speaking, Oracle optimizes better > when the smaller/reducing tables are on the > bottom and the larger tables are on the top. > I believe SQLServer likes them in the opposite > direction. Postgres absolutely does not care: the optimizer will always consider both A-join-B and B-join-A orders for every join it has to do. As Stephan and Josh noted, you can constrain the join pairs the optimizer will consider if you use explicit-JOIN syntax --- but each pair will be considered in both directions. regards, tom lane
Tom, > [ Sorry for slow response, I've been out of town ] Taking a much-deserved vacation, hey? Any new job plans? > Postgres absolutely does not care: the optimizer will always consider > both A-join-B and B-join-A orders for every join it has to do. As > Stephan and Josh noted, you can constrain the join pairs the > optimizer > will consider if you use explicit-JOIN syntax --- but each pair will > be > considered in both directions. Fantastic! You may want to point out to unbelievers that MS SQL Server does not do this; if you fail to put your joins/where clauses in the *exact* order of the indecies in SQL Server, it ignores them and does a table scan. This is especially deadly because table scans are about 1/2 as fast in SQL Server as they are in Postgres. -Josh ______AGLIO DATABASE SOLUTIONS___________________________ Josh Berkus Complete information technology josh@agliodbs.com and data management solutions (415) 565-7293 for law firms, small businesses fax 621-2533 and non-profit organizations. San Francisco