Thread: Optimizer badness in 7.0 beta

Optimizer badness in 7.0 beta

From
Brian Hirt
Date:
Hello,

I just downloaded the 7.0 beta to test it with my database to make sure
there will be no unexpected problems when I upgrade my production site.
I've run into a problem that I hope you can help me with.  I dumped my 
6.5.2 database and loaded it into 7.0.  Lot's of queries are now taking 
much much longer.  I have included the plans from one of the queries.  
In 7.0, the query takes 94 seconds compared to less than a second for
it to run on 6.5.2.  All of the data is exactly the same, the indexes are the 
same.  I thought maybe the indexes had bad statistics, so I "vaccum analyze" 
both the 6.5.2 database and the 7.0 database and ran again on both just to 
be on the safe side.  Still, same problem.  I know that there were problems 
with IN clauses optimizing and the preferred method is to use an exists 
statement.  However, I wouldn't expect this kind of change in performance.  
It does appear that 7.0 is trying to be smarter by using an index in the 
SubPlan, but for some reason it's being a hog.

Some more information that may be useful, the table 'game' has about 1000
rows and the table game_developer has about 15000 rows.  There is an
index on game_developer(developer_id)  

Other than these types of queries, everything else seems to be working 
okay.  I logged about 500 different queries that run against my database,
removed the ones that exhibit the behaviour above and ran a little
benchmark.  The run times between 6.5.2 and 7.0.0, for the types of
queries I'm running, are almost identical.   I was hoping that the new
improved optimizer would bring a great speed improvement, but I'm not
seeing it.  My guess is that most of the queries that I'm running are
small and there's a fixed cost associated with running each one -- the
actual work they perform is pretty small.  Possibly more time is being
spent optimizing the plan and is offsetting the improved execution time
on smaller queries.

-brian


-- PG 7.0 --
NOTICE:  QUERY PLAN:

Sort  (cost=383940.72..383940.72 rows=905 width=59) ->  Seq Scan on game  (cost=0.00..383896.28 rows=905 width=59)
SubPlan         ->  Unique  (cost=0.00..808.88 rows=0 width=4)               ->  Index Scan using
game_developer_game_indexon game_developer  (cost=0.00..808.87 rows=4 width=4)
 

EXPLAIN

-- PG 6.5.2 --
NOTICE:  QUERY PLAN:

Sort  (cost=99.32 rows=872 width=59) ->  Seq Scan on game  (cost=99.32 rows=872 width=59)       SubPlan         ->
Unique (cost=578.53 rows=2 width=4)               ->  Sort  (cost=578.53 rows=2 width=4)                     ->  Seq
Scanon game_developer  (cost=578.53 rows=2 width=4)
 

EXPLAIN

Query:

select
creation_timestamp,approved,moby_user_id,copyright_year,game_title,game_url,company_line,credits_complete,game_id
 
from game 
where approved = 1 
and game_id in (    select         distinct game_id    from         game_developer    where         developer_id = 3) 
order by copyright_year desc,game_title;
-- 
The world's most ambitious and comprehensive PC game database project.
                     http://www.mobygames.com


Re: [HACKERS] Optimizer badness in 7.0 beta

From
Peter Eisentraut
Date:
This query can be rewritten as

SELECT creation_timestamp, etc. 
FROM game, game_developer
WHERE game.game_id = game_developer.game_id AND approved = 1 AND developer_id = 3
ORDER BY copyright_year desc, game_title

The way you're writing it you're almost asking it to be slow. :)

Of course that still doesn't explain why it's now 94sec versus formerly 1
but I'm sure Tom Lane will enlighten us all very soon. :)


Brian Hirt writes:

> select 
>     creation_timestamp,
[snip]
> from 
>     game 
> where 
>     approved = 1 
> and 
>     game_id in (
>         select 
>             distinct game_id
>         from 
>             game_developer
>         where 
>             developer_id = 3) 
> order by 
>     copyright_year desc,
>     game_title;


-- 
Peter Eisentraut                  Sernanders väg 10:115
peter_e@gmx.net                   75262 Uppsala
http://yi.org/peter-e/            Sweden




Re: [HACKERS] Optimizer badness in 7.0 beta

From
Brian Hirt
Date:
Peter,

Actually, the query you supply will not work, I'll get duplicate 
rows because the relationship between game and game_developer is 
a one to many.  Of course you had no way of knowing that from the 
information I supplied.  I could throw a distinct in there to get 
the results, but that really feels like bad form because of the large
amount of duplicate rows.  In any case, the original query I supplied 
is generated SQL, created by a Database to Object persistance layer 
and cannot by design have multiple tables in the from clause, so 
restrictions to the table seleted from must be in the form of a 
qualifier.  

I realize that the query in question could be written better.  My 
concern was the huge difference in performance between 6.5 and 7.0 on
this type of query.  Other people may be bitten by this one, so I wanted
to bring it up.  I've been able to easily to work around this problem, 
it just seems wrong that the difference in execution time is so far 
off from the previous release.  

I dont know too much about the PG internals, but when I used sybase, 
it would usually execute the sub-select independently and stuff the 
results into a temp table and then do another query, joining to the 
results of the sub-select.  In a situation like this one, worst case 
without indexes you would get a table scan for the sub-select 
and then a merge join with a sequential scan on the temp table and a 
sequential scan on the other table (example below).  Using that
approach, with no indexes, the query still executes in a fraction of a 
second.  It just seems that a query on tables as small as I'm describing
should never take as long as it did.   It seems like a problem with
the optimizer, but if people are happy with currenty functionality that's 
fine with me also.

-brian

For Example:

SELECT DISTINCT game_id INTO temp tmp_res
FROM game_developer
WHERE developer_id = 3

SELECT *
FROM game, tmp_res
WHERE game.game_id = tmp_res.game_id AND game.approved = 1
ORDER BY copyright_year desc, game_title

On Sun, Mar 05, 2000 at 03:15:45PM +0100, Peter Eisentraut wrote:
> This query can be rewritten as
> 
> SELECT creation_timestamp, etc. 
> FROM game, game_developer
> WHERE game.game_id = game_developer.game_id
>   AND approved = 1 AND developer_id = 3
> ORDER BY copyright_year desc, game_title
> 
> The way you're writing it you're almost asking it to be slow. :)
> 
> Of course that still doesn't explain why it's now 94sec versus formerly 1
> but I'm sure Tom Lane will enlighten us all very soon. :)
> 
> 
> Brian Hirt writes:
> 
> > select 
> >     creation_timestamp,
> [snip]
> > from 
> >     game 
> > where 
> >     approved = 1 
> > and 
> >     game_id in (
> >         select 
> >             distinct game_id
> >         from 
> >             game_developer
> >         where 
> >             developer_id = 3) 
> > order by 
> >     copyright_year desc,
> >     game_title;
> 
> 
> -- 
> Peter Eisentraut                  Sernanders väg 10:115
> peter_e@gmx.net                   75262 Uppsala
> http://yi.org/peter-e/            Sweden
> 
> 
> 
> ************

-- 
The world's most ambitious and comprehensive PC game database project.
                     http://www.mobygames.com


Re: [HACKERS] Optimizer badness in 7.0 beta

From
Peter Eisentraut
Date:
On Sun, 5 Mar 2000, Brian Hirt wrote:

> Actually, the query you supply will not work, I'll get duplicate 
> rows because the relationship between game and game_developer is 
> a one to many.

Got me. I tried to read into it a little like 'one developer develops many
games' but apparently it's the other way around. In that case you could
use DISTINCT or maybe DISTINCT ON depending on the details.

> I dont know too much about the PG internals, but when I used sybase, 
> it would usually execute the sub-select independently and stuff the 
> results into a temp table and then do another query, joining to the 
> results of the sub-select.

Last time I checked PostgreSQL executes the subquery for each row.
Apparently it must still be doing that and I do suspect that it is right
in the overall sense because the subquery may have side effects. Consider

SELECT * FROM t1 WHERE id IN (select nextval('my_sequence'))

Of course this query makes absolutely no sense whatsoever but perhaps
there are similar ones where it does.

But I didn't mean to bash your query style, just pointing out a
work-around that's commonly suggested. (People have been caught by this
before.)

> > SELECT creation_timestamp, etc.          ^^ insert DISTINCT
> > FROM game, game_developer
> > WHERE game.game_id = game_developer.game_id
> >   AND approved = 1 AND developer_id = 3
> > ORDER BY copyright_year desc, game_title


-- 
Peter Eisentraut                  Sernanders väg 10:115
peter_e@gmx.net                   75262 Uppsala
http://yi.org/peter-e/            Sweden



Re: [HACKERS] Optimizer badness in 7.0 beta

From
Tom Lane
Date:
Peter Eisentraut <e99re41@DoCS.UU.SE> writes:
>> I dont know too much about the PG internals, but when I used sybase, 
>> it would usually execute the sub-select independently and stuff the 
>> results into a temp table and then do another query, joining to the 
>> results of the sub-select.

> Last time I checked PostgreSQL executes the subquery for each row.
> Apparently it must still be doing that

It did up until last Wednesday.  If Brian retries his example with
current sources I think he'll see better performance.  But I still
want to poke into exactly why the indexscan implementation seems so
much slower than the prior seqscan+sort implementation; that doesn't
seem right.  (And if it is right, why doesn't the optimizer realize it?)
I'll get back to Brian on that.

> and I do suspect that it is right
> in the overall sense because the subquery may have side effects. Consider

> SELECT * FROM t1 WHERE id IN (select nextval('my_sequence'))

> Of course this query makes absolutely no sense whatsoever but perhaps
> there are similar ones where it does.

Interesting example.  But since the tuples in t1 are not guaranteed to
be scanned in any particular order, it seems to me that a query that
has side-effects in WHERE inherently has undefined results.  If we could
detect side-effect-producing expressions (which we cannot, currently,
and in general I suspect that problem is undecidable) I would argue that
we ought to reject this query.  I certainly don't want to constrain the
optimizer by assuming that repeated executions of subqueries can't be
optimized away.
        regards, tom lane


Re: [HACKERS] Optimizer badness in 7.0 beta

From
Tom Lane
Date:
Brian Hirt <bhirt@mobygames.com> writes:
> -- PG 7.0 --
> NOTICE:  QUERY PLAN:

> Sort  (cost=383940.72..383940.72 rows=905 width=59)
>   ->  Seq Scan on game  (cost=0.00..383896.28 rows=905 width=59)
>         SubPlan
>           ->  Unique  (cost=0.00..808.88 rows=0 width=4)
>                 ->  Index Scan using game_developer_game_index on game_developer  (cost=0.00..808.87 rows=4 width=4)

There's something very strange about this query plan --- why is the
estimated cost of the indexscan so high?  If I do, say,

regression=# explain select distinct * from tenk1 where unique1 < 3;
NOTICE:  QUERY PLAN:

Unique  (cost=3.22..3.34 rows=0 width=148) ->  Sort  (cost=3.22..3.22 rows=3 width=148)       ->  Index Scan using
tenk1_unique1on tenk1  (cost=0.00..3.19 rows=3 width=148)
 

The tenk1 table from the regression database is only 10K rows, versus
15K in your table, but still I'd expect costs not a heck of a lot higher
than one page fetch per tuple retrieved.  How is it coming up with a
cost of 800 to retrieve 4 tuples?

Could I trouble you for the exact declarations of the tables and indices
involved here?  Also, what plan do you get from 7.0 if you do
set enable_indexscan = 'off';

before the EXPLAIN?
        regards, tom lane