Thread: Optimizer seems to be way off, why?

Optimizer seems to be way off, why?

From
Dirk Lutzebäck
Date:
Hi,

I do not under stand the following explain output (pgsql 8.0.3):

explain analyze
select b.e from b, d
where b.r=516081780 and b.c=513652057 and b.e=d.e;

                                                         QUERY PLAN
----------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..1220.09 rows=1 width=4) (actual
time=0.213..2926.845 rows=324503 loops=1)
   ->  Index Scan using b_index on b  (cost=0.00..1199.12 rows=1
width=4) (actual time=0.104..17.418 rows=3293 loops=1)
         Index Cond: (r = 516081780::oid)
         Filter: (c = 513652057::oid)
   ->  Index Scan using d_e_index on d  (cost=0.00..19.22 rows=140
width=4) (actual time=0.009..0.380 rows=99 loops=3293)
         Index Cond: ("outer".e = d.e)
 Total runtime: 3638.783 ms
(7 rows)

Why is the rows estimate for b_index and the nested loop 1? It is
actually 3293 and 324503.

I did VACUUM ANALYZE before and I also increased the STATISTICS TARGET
on b.e to 500. No change.

Here is the size of the tables:

select count(oid) from b;
 3532161

select count(oid) from b where r=516081780 and c=513652057;
  3293

select count(oid) from d;
 117270


Regards,

Dirk

Re: Optimizer seems to be way off, why?

From
Richard Huxton
Date:
Dirk Lutzebäck wrote:
> Hi,
>
> I do not under stand the following explain output (pgsql 8.0.3):
>
> explain analyze
> select b.e from b, d
> where b.r=516081780 and b.c=513652057 and b.e=d.e;
>
>                                                         QUERY PLAN
> ----------------------------------------------------------------------------------------------------------------
>
> Nested Loop  (cost=0.00..1220.09 rows=1 width=4) (actual
> time=0.213..2926.845 rows=324503 loops=1)
>   ->  Index Scan using b_index on b  (cost=0.00..1199.12 rows=1 width=4)
> (actual time=0.104..17.418 rows=3293 loops=1)
>         Index Cond: (r = 516081780::oid)
>         Filter: (c = 513652057::oid)
>   ->  Index Scan using d_e_index on d  (cost=0.00..19.22 rows=140
> width=4) (actual time=0.009..0.380 rows=99 loops=3293)
>         Index Cond: ("outer".e = d.e)
> Total runtime: 3638.783 ms
> (7 rows)
>
> Why is the rows estimate for b_index and the nested loop 1? It is
> actually 3293 and 324503.

I'm guessing (and that's all it is) that b.r and b.c have a higher
correlation than the planner is expecting. That is, it expects the
b.c=... to reduce the number of matching rows much more than it is.

Try a query just on WHERE b.r=516081780 and see if it gets the estimate
right for that.

If it's a common query, it might be worth an index on (r,c)

--
   Richard Huxton
   Archonet Ltd


Re: Optimizer seems to be way off, why?

From
Dirk Lutzebäck
Date:
Richard Huxton wrote:
> Dirk Lutzebäck wrote:
>
>> Hi,
>>
>> I do not under stand the following explain output (pgsql 8.0.3):
>>
>> explain analyze
>> select b.e from b, d
>> where b.r=516081780 and b.c=513652057 and b.e=d.e;
>>
>>                                                         QUERY PLAN
>> ----------------------------------------------------------------------------------------------------------------
>>
>> Nested Loop  (cost=0.00..1220.09 rows=1 width=4) (actual
>> time=0.213..2926.845 rows=324503 loops=1)
>>   ->  Index Scan using b_index on b  (cost=0.00..1199.12 rows=1
>> width=4) (actual time=0.104..17.418 rows=3293 loops=1)
>>         Index Cond: (r = 516081780::oid)
>>         Filter: (c = 513652057::oid)
>>   ->  Index Scan using d_e_index on d  (cost=0.00..19.22 rows=140
>> width=4) (actual time=0.009..0.380 rows=99 loops=3293)
>>         Index Cond: ("outer".e = d.e)
>> Total runtime: 3638.783 ms
>> (7 rows)
>>
>> Why is the rows estimate for b_index and the nested loop 1? It is
>> actually 3293 and 324503.
>
>
> I'm guessing (and that's all it is) that b.r and b.c have a higher
> correlation than the planner is expecting. That is, it expects the
> b.c=... to reduce the number of matching rows much more than it is.
>
> Try a query just on WHERE b.r=516081780 and see if it gets the estimate
> right for that.
>
> If it's a common query, it might be worth an index on (r,c)
>
> --
>   Richard Huxton
>   Archonet Ltd
>

Thanks Richard, dropping the join for b.c now gives better estimates (it
also uses a different index now) although not accurate (off by factor
10). This query is embedded in a larger query which now got a 1000 times
speed up (!) because I can drop b.c because it is redundant.

Though, why can't the planner see this correlation? I think somebody
said the planner does not know about multiple column correlations, does it?

Regards,

Dirk


Re: Optimizer seems to be way off, why?

From
John A Meinel
Date:
Dirk Lutzebäck wrote:
> Richard Huxton wrote:
>
>> Dirk Lutzebäck wrote:
>>
>>> Hi,
>>>
>>> I do not under stand the following explain output (pgsql 8.0.3):
>>>
>>> explain analyze
>>> select b.e from b, d
>>> where b.r=516081780 and b.c=513652057 and b.e=d.e;
>>>
>>>                                                         QUERY PLAN
>>> ----------------------------------------------------------------------------------------------------------------
>>>
>>> Nested Loop  (cost=0.00..1220.09 rows=1 width=4) (actual
>>> time=0.213..2926.845 rows=324503 loops=1)
>>>   ->  Index Scan using b_index on b  (cost=0.00..1199.12 rows=1
>>> width=4) (actual time=0.104..17.418 rows=3293 loops=1)
>>>         Index Cond: (r = 516081780::oid)
>>>         Filter: (c = 513652057::oid)
>>>   ->  Index Scan using d_e_index on d  (cost=0.00..19.22 rows=140
>>> width=4) (actual time=0.009..0.380 rows=99 loops=3293)
>>>         Index Cond: ("outer".e = d.e)
>>> Total runtime: 3638.783 ms
>>> (7 rows)
>>>
>>> Why is the rows estimate for b_index and the nested loop 1? It is
>>> actually 3293 and 324503.
>>
>>
>>
>> I'm guessing (and that's all it is) that b.r and b.c have a higher
>> correlation than the planner is expecting. That is, it expects the
>> b.c=... to reduce the number of matching rows much more than it is.
>>
>> Try a query just on WHERE b.r=516081780 and see if it gets the
>> estimate right for that.
>>
>> If it's a common query, it might be worth an index on (r,c)
>>
>> --
>>   Richard Huxton
>>   Archonet Ltd
>>
>
> Thanks Richard, dropping the join for b.c now gives better estimates (it
> also uses a different index now) although not accurate (off by factor
> 10). This query is embedded in a larger query which now got a 1000 times
> speed up (!) because I can drop b.c because it is redundant.

Well, part of the problem is that the poorly estimated row is not 'b.e'
but 'b.r', it expects to only find one row that matches, and instead
finds 3293 rows.

Now, that *could* be because it mis-estimates the selectivity of b.r & b.c.

It actually estimated the join with d approximately correctly. (It
thought that for each row it would find 140, and it averaged 99).

>
> Though, why can't the planner see this correlation? I think somebody
> said the planner does not know about multiple column correlations, does it?

The planner does not maintain cross-column statistics, so you are
correct. I believe it assumes distributions are independent. So that if
r=RRRRR is 10% selective, and c=CCCC is 20% selective, the total
selectivity of r=RRRR AND c=CCCC is 2%. I could be wrong on this, but I
think it is approximately correct.

Now if you created the index on b(r,c), then it would have a much better
idea of how selective that would be. At the very least, it could index
on (r,c) rather than indexing on (r) and filtering by (c).

Also, if you have very skewed data (where you have 1 value 100k times,
and 50 values only 10times each), the planner can overestimate the low
values, and underestimate the high one. (It uses random sampling, so it
kind of depends where the entries are.)

Have you tried increasing the statistics on b.r and or b.c? Do you have
an index on b.c or just b.r?

To see what the planner thinks, you might try:

EXPLAIN ANALYZE
select count(*) from b where r=516081780;

That would tell you how selective the planner thinks the r= is.
>
> Regards,
>
> Dirk
>
John
=:->


Attachment

Re: Optimizer seems to be way off, why?

From
Greg Stark
Date:
John A Meinel <john@arbash-meinel.com> writes:

> Now if you created the index on b(r,c), then it would have a much better
> idea of how selective that would be. At the very least, it could index
> on (r,c) rather than indexing on (r) and filtering by (c).

There has been some discussion of adding functionality like this but afaik no
version of Postgres actually does this yet.

Adding the index may still help though.



--
greg