Thread: Optimizer badness in 7.0 beta
Hello, I just downloaded the 7.0 beta to test it with my database to make sure there will be no unexpected problems when I upgrade my production site. I've run into a problem that I hope you can help me with. I dumped my 6.5.2 database and loaded it into 7.0. Lot's of queries are now taking much much longer. I have included the plans from one of the queries. In 7.0, the query takes 94 seconds compared to less than a second for it to run on 6.5.2. All of the data is exactly the same, the indexes are the same. I thought maybe the indexes had bad statistics, so I "vaccum analyze" both the 6.5.2 database and the 7.0 database and ran again on both just to be on the safe side. Still, same problem. I know that there were problems with IN clauses optimizing and the preferred method is to use an exists statement. However, I wouldn't expect this kind of change in performance. It does appear that 7.0 is trying to be smarter by using an index in the SubPlan, but for some reason it's being a hog. Some more information that may be useful, the table 'game' has about 1000 rows and the table game_developer has about 15000 rows. There is an index on game_developer(developer_id) Other than these types of queries, everything else seems to be working okay. I logged about 500 different queries that run against my database, removed the ones that exhibit the behaviour above and ran a little benchmark. The run times between 6.5.2 and 7.0.0, for the types of queries I'm running, are almost identical. I was hoping that the new improved optimizer would bring a great speed improvement, but I'm not seeing it. My guess is that most of the queries that I'm running are small and there's a fixed cost associated with running each one -- the actual work they perform is pretty small. Possibly more time is being spent optimizing the plan and is offsetting the improved execution time on smaller queries. -brian -- PG 7.0 -- NOTICE: QUERY PLAN: Sort (cost=383940.72..383940.72 rows=905 width=59) -> Seq Scan on game (cost=0.00..383896.28 rows=905 width=59) SubPlan -> Unique (cost=0.00..808.88 rows=0 width=4) -> Index Scan using game_developer_game_indexon game_developer (cost=0.00..808.87 rows=4 width=4) EXPLAIN -- PG 6.5.2 -- NOTICE: QUERY PLAN: Sort (cost=99.32 rows=872 width=59) -> Seq Scan on game (cost=99.32 rows=872 width=59) SubPlan -> Unique (cost=578.53 rows=2 width=4) -> Sort (cost=578.53 rows=2 width=4) -> Seq Scanon game_developer (cost=578.53 rows=2 width=4) EXPLAIN Query: select creation_timestamp,approved,moby_user_id,copyright_year,game_title,game_url,company_line,credits_complete,game_id from game where approved = 1 and game_id in ( select distinct game_id from game_developer where developer_id = 3) order by copyright_year desc,game_title; -- The world's most ambitious and comprehensive PC game database project. http://www.mobygames.com
This query can be rewritten as SELECT creation_timestamp, etc. FROM game, game_developer WHERE game.game_id = game_developer.game_id AND approved = 1 AND developer_id = 3 ORDER BY copyright_year desc, game_title The way you're writing it you're almost asking it to be slow. :) Of course that still doesn't explain why it's now 94sec versus formerly 1 but I'm sure Tom Lane will enlighten us all very soon. :) Brian Hirt writes: > select > creation_timestamp, [snip] > from > game > where > approved = 1 > and > game_id in ( > select > distinct game_id > from > game_developer > where > developer_id = 3) > order by > copyright_year desc, > game_title; -- Peter Eisentraut Sernanders väg 10:115 peter_e@gmx.net 75262 Uppsala http://yi.org/peter-e/ Sweden
Peter, Actually, the query you supply will not work, I'll get duplicate rows because the relationship between game and game_developer is a one to many. Of course you had no way of knowing that from the information I supplied. I could throw a distinct in there to get the results, but that really feels like bad form because of the large amount of duplicate rows. In any case, the original query I supplied is generated SQL, created by a Database to Object persistance layer and cannot by design have multiple tables in the from clause, so restrictions to the table seleted from must be in the form of a qualifier. I realize that the query in question could be written better. My concern was the huge difference in performance between 6.5 and 7.0 on this type of query. Other people may be bitten by this one, so I wanted to bring it up. I've been able to easily to work around this problem, it just seems wrong that the difference in execution time is so far off from the previous release. I dont know too much about the PG internals, but when I used sybase, it would usually execute the sub-select independently and stuff the results into a temp table and then do another query, joining to the results of the sub-select. In a situation like this one, worst case without indexes you would get a table scan for the sub-select and then a merge join with a sequential scan on the temp table and a sequential scan on the other table (example below). Using that approach, with no indexes, the query still executes in a fraction of a second. It just seems that a query on tables as small as I'm describing should never take as long as it did. It seems like a problem with the optimizer, but if people are happy with currenty functionality that's fine with me also. -brian For Example: SELECT DISTINCT game_id INTO temp tmp_res FROM game_developer WHERE developer_id = 3 SELECT * FROM game, tmp_res WHERE game.game_id = tmp_res.game_id AND game.approved = 1 ORDER BY copyright_year desc, game_title On Sun, Mar 05, 2000 at 03:15:45PM +0100, Peter Eisentraut wrote: > This query can be rewritten as > > SELECT creation_timestamp, etc. > FROM game, game_developer > WHERE game.game_id = game_developer.game_id > AND approved = 1 AND developer_id = 3 > ORDER BY copyright_year desc, game_title > > The way you're writing it you're almost asking it to be slow. :) > > Of course that still doesn't explain why it's now 94sec versus formerly 1 > but I'm sure Tom Lane will enlighten us all very soon. :) > > > Brian Hirt writes: > > > select > > creation_timestamp, > [snip] > > from > > game > > where > > approved = 1 > > and > > game_id in ( > > select > > distinct game_id > > from > > game_developer > > where > > developer_id = 3) > > order by > > copyright_year desc, > > game_title; > > > -- > Peter Eisentraut Sernanders väg 10:115 > peter_e@gmx.net 75262 Uppsala > http://yi.org/peter-e/ Sweden > > > > ************ -- The world's most ambitious and comprehensive PC game database project. http://www.mobygames.com
On Sun, 5 Mar 2000, Brian Hirt wrote: > Actually, the query you supply will not work, I'll get duplicate > rows because the relationship between game and game_developer is > a one to many. Got me. I tried to read into it a little like 'one developer develops many games' but apparently it's the other way around. In that case you could use DISTINCT or maybe DISTINCT ON depending on the details. > I dont know too much about the PG internals, but when I used sybase, > it would usually execute the sub-select independently and stuff the > results into a temp table and then do another query, joining to the > results of the sub-select. Last time I checked PostgreSQL executes the subquery for each row. Apparently it must still be doing that and I do suspect that it is right in the overall sense because the subquery may have side effects. Consider SELECT * FROM t1 WHERE id IN (select nextval('my_sequence')) Of course this query makes absolutely no sense whatsoever but perhaps there are similar ones where it does. But I didn't mean to bash your query style, just pointing out a work-around that's commonly suggested. (People have been caught by this before.) > > SELECT creation_timestamp, etc. ^^ insert DISTINCT > > FROM game, game_developer > > WHERE game.game_id = game_developer.game_id > > AND approved = 1 AND developer_id = 3 > > ORDER BY copyright_year desc, game_title -- Peter Eisentraut Sernanders väg 10:115 peter_e@gmx.net 75262 Uppsala http://yi.org/peter-e/ Sweden
Peter Eisentraut <e99re41@DoCS.UU.SE> writes: >> I dont know too much about the PG internals, but when I used sybase, >> it would usually execute the sub-select independently and stuff the >> results into a temp table and then do another query, joining to the >> results of the sub-select. > Last time I checked PostgreSQL executes the subquery for each row. > Apparently it must still be doing that It did up until last Wednesday. If Brian retries his example with current sources I think he'll see better performance. But I still want to poke into exactly why the indexscan implementation seems so much slower than the prior seqscan+sort implementation; that doesn't seem right. (And if it is right, why doesn't the optimizer realize it?) I'll get back to Brian on that. > and I do suspect that it is right > in the overall sense because the subquery may have side effects. Consider > SELECT * FROM t1 WHERE id IN (select nextval('my_sequence')) > Of course this query makes absolutely no sense whatsoever but perhaps > there are similar ones where it does. Interesting example. But since the tuples in t1 are not guaranteed to be scanned in any particular order, it seems to me that a query that has side-effects in WHERE inherently has undefined results. If we could detect side-effect-producing expressions (which we cannot, currently, and in general I suspect that problem is undecidable) I would argue that we ought to reject this query. I certainly don't want to constrain the optimizer by assuming that repeated executions of subqueries can't be optimized away. regards, tom lane
Brian Hirt <bhirt@mobygames.com> writes: > -- PG 7.0 -- > NOTICE: QUERY PLAN: > Sort (cost=383940.72..383940.72 rows=905 width=59) > -> Seq Scan on game (cost=0.00..383896.28 rows=905 width=59) > SubPlan > -> Unique (cost=0.00..808.88 rows=0 width=4) > -> Index Scan using game_developer_game_index on game_developer (cost=0.00..808.87 rows=4 width=4) There's something very strange about this query plan --- why is the estimated cost of the indexscan so high? If I do, say, regression=# explain select distinct * from tenk1 where unique1 < 3; NOTICE: QUERY PLAN: Unique (cost=3.22..3.34 rows=0 width=148) -> Sort (cost=3.22..3.22 rows=3 width=148) -> Index Scan using tenk1_unique1on tenk1 (cost=0.00..3.19 rows=3 width=148) The tenk1 table from the regression database is only 10K rows, versus 15K in your table, but still I'd expect costs not a heck of a lot higher than one page fetch per tuple retrieved. How is it coming up with a cost of 800 to retrieve 4 tuples? Could I trouble you for the exact declarations of the tables and indices involved here? Also, what plan do you get from 7.0 if you do set enable_indexscan = 'off'; before the EXPLAIN? regards, tom lane