Re: Understanding bad estimate (related to FKs?) - Mailing list pgsql-performance

From Philip Semanchuk
Subject Re: Understanding bad estimate (related to FKs?)
Date
Msg-id 44E8D04E-A362-409B-AD2B-1D5860D01B61@americanefficient.com
Whole thread Raw
In response to Re: Understanding bad estimate (related to FKs?)  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: Understanding bad estimate (related to FKs?)
List pgsql-performance

> On Oct 29, 2020, at 6:48 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
> On Thu, Oct 29, 2020 at 11:25:48AM -0400, Philip Semanchuk wrote:
>>
>>
>>> On Oct 28, 2020, at 9:13 PM, Justin Pryzby <pryzby@telsasoft.com>
>>> wrote:
>>>
>>> On Mon, Oct 26, 2020 at 11:20:01AM -0600, Michael Lewis wrote:
>>>> On Mon, Oct 26, 2020 at 11:14 AM Philip Semanchuk
>>>> <philip@americanefficient.com> wrote:
>>>>
>>>>>>> The item I'm focused on is node 23. The estimate is for 7 rows,
>>>>>>> actual
>>>>> is 896 (multiplied by 1062 loops). I'm confused about two things in
>>>>> this node.
>>>>>>>
>>>>>>> The first is Postgres' estimate. The condition for this index
>>>>>>> scan
>>>>> contains three expressions --
>>>>>>>
>>>>>>> (five_uniform = zulu_five.five_uniform) AND (whiskey_mike =
>>>>>>> juliet_india.whiskey_mike) AND (bravo = 'mike'::text)
>>>>>
>>>>
>>>> Are the columns correlated?
>>>
>>> I guess it shouldn't matter, since the FKs should remove all but one
>>> of the conditions.
>>
>> Yes, I had the same expectation. I thought Postgres would calculate the
>> selectivity as 1.0 * 1.0 * whatever estimate it has for the frequency
>> of ‘mike’, but since the frequency estimate is very accurate but the
>> planner’s estimate is not, there’s something else going on.
>>
>
> Well, this is quite a bit more complicated, I'm afraid :-( The clauses
> include parameters passed from the nodes above the index scan. So even
> if we had extended stats on the table, we couldn't use them as that
> requires (Var op Const) conditions. So this likely ends up with a
> product of estimates for each clause, and even then we can't use any
> particular value so we probably end up with something like 1/ndistinct
> or something like that. So if the values actually passed to the index
> scan are more common and/or if the columns are somehow correlated, it's
> not surprising we end up with an overestimate.

I appreciate the insight. 1/ndistinct is exactly right. In pg_stats, five_uniform’s ndistinct = 26326, and
whiskey_mike’sndistinct = 3. The estimated frequency of bravo = ‘mike’ is .02228. There are 25156157 rows in the source
table,so we have:  

25156157 * (1/26326.0) * (1/3.0) * .02228 = 7.0966494209

Hence the estimate of 7 rows returned.

It's interesting that five_uniform’s estimated ndistinct is low by > 50% (actual = 62958). Paradoxically, if I manually
setndistinct to the correct value of 62958, the estimate gets worse (3 rows instead of 7).  

Suggestions for fixing this are of course welcome. :-)

On a related topic, are there any in depth guides to the planner that I could read? I can (and have) read the source
codeand it’s been informative, but something higher level than the source code would help. 

Thanks so much
Philip









pgsql-performance by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: Understanding bad estimate (related to FKs?)
Next
From: Guillaume Lelarge
Date:
Subject: Re: Understanding bad estimate (related to FKs?)