Re: Optimizer seems to be way off, why? - Mailing list pgsql-performance

From John A Meinel
Subject Re: Optimizer seems to be way off, why?
Date
Msg-id 42DEBACF.2020405@arbash-meinel.com
Whole thread Raw
In response to Re: Optimizer seems to be way off, why?  (Dirk Lutzebäck <lutzeb@aeccom.com>)
Responses Re: Optimizer seems to be way off, why?
List pgsql-performance
Dirk Lutzebäck wrote:
> Richard Huxton wrote:
>
>> Dirk Lutzebäck wrote:
>>
>>> Hi,
>>>
>>> I do not under stand the following explain output (pgsql 8.0.3):
>>>
>>> explain analyze
>>> select b.e from b, d
>>> where b.r=516081780 and b.c=513652057 and b.e=d.e;
>>>
>>>                                                         QUERY PLAN
>>> ----------------------------------------------------------------------------------------------------------------
>>>
>>> Nested Loop  (cost=0.00..1220.09 rows=1 width=4) (actual
>>> time=0.213..2926.845 rows=324503 loops=1)
>>>   ->  Index Scan using b_index on b  (cost=0.00..1199.12 rows=1
>>> width=4) (actual time=0.104..17.418 rows=3293 loops=1)
>>>         Index Cond: (r = 516081780::oid)
>>>         Filter: (c = 513652057::oid)
>>>   ->  Index Scan using d_e_index on d  (cost=0.00..19.22 rows=140
>>> width=4) (actual time=0.009..0.380 rows=99 loops=3293)
>>>         Index Cond: ("outer".e = d.e)
>>> Total runtime: 3638.783 ms
>>> (7 rows)
>>>
>>> Why is the rows estimate for b_index and the nested loop 1? It is
>>> actually 3293 and 324503.
>>
>>
>>
>> I'm guessing (and that's all it is) that b.r and b.c have a higher
>> correlation than the planner is expecting. That is, it expects the
>> b.c=... to reduce the number of matching rows much more than it is.
>>
>> Try a query just on WHERE b.r=516081780 and see if it gets the
>> estimate right for that.
>>
>> If it's a common query, it might be worth an index on (r,c)
>>
>> --
>>   Richard Huxton
>>   Archonet Ltd
>>
>
> Thanks Richard, dropping the join for b.c now gives better estimates (it
> also uses a different index now) although not accurate (off by factor
> 10). This query is embedded in a larger query which now got a 1000 times
> speed up (!) because I can drop b.c because it is redundant.

Well, part of the problem is that the poorly estimated row is not 'b.e'
but 'b.r', it expects to only find one row that matches, and instead
finds 3293 rows.

Now, that *could* be because it mis-estimates the selectivity of b.r & b.c.

It actually estimated the join with d approximately correctly. (It
thought that for each row it would find 140, and it averaged 99).

>
> Though, why can't the planner see this correlation? I think somebody
> said the planner does not know about multiple column correlations, does it?

The planner does not maintain cross-column statistics, so you are
correct. I believe it assumes distributions are independent. So that if
r=RRRRR is 10% selective, and c=CCCC is 20% selective, the total
selectivity of r=RRRR AND c=CCCC is 2%. I could be wrong on this, but I
think it is approximately correct.

Now if you created the index on b(r,c), then it would have a much better
idea of how selective that would be. At the very least, it could index
on (r,c) rather than indexing on (r) and filtering by (c).

Also, if you have very skewed data (where you have 1 value 100k times,
and 50 values only 10times each), the planner can overestimate the low
values, and underestimate the high one. (It uses random sampling, so it
kind of depends where the entries are.)

Have you tried increasing the statistics on b.r and or b.c? Do you have
an index on b.c or just b.r?

To see what the planner thinks, you might try:

EXPLAIN ANALYZE
select count(*) from b where r=516081780;

That would tell you how selective the planner thinks the r= is.
>
> Regards,
>
> Dirk
>
John
=:->


Attachment

pgsql-performance by date:

Previous
From: John Mendenhall
Date:
Subject: Re: performance decrease after reboot
Next
From: Jeffrey Tenny
Date:
Subject: What is best way to stream terabytes of data into postgresql?