Re: Use extended statistics to estimate (Var op Var) clauses - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Use extended statistics to estimate (Var op Var) clauses
Date
Msg-id 6ef82d17-2d37-21ff-b00e-320927851b0e@enterprisedb.com
Whole thread Raw
In response to Re: Use extended statistics to estimate (Var op Var) clauses  (Dean Rasheed <dean.a.rasheed@gmail.com>)
List pgsql-hackers

On 8/11/21 2:08 PM, Dean Rasheed wrote:
> On Wed, 11 Aug 2021 at 00:05, Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> So with the statistics, the estimate gets a bit worse. The reason is
>> fairly simple - if you look at the two parts of the OR clause, we get this:
>>
>>       clause                   actual        no stats    with stats
>>      ---------------------------------------------------------------
>>       (A < B and A <> A)            0          331667             1
>>       not (A < A)             1000000          333333        333333
>>
>> This clearly shows that the first clause is clearly improved, while the
>> (A < A) is estimated the same way, because the clause has a single Var
>> so it's considered to be "simple" so we ignore the MCV selectivity and
>> just use the simple_sel calculated by clause_selectivity_ext.
>>
>> And the 333333 and 331667 just happen to be closer to the actual row
>> count. But that's mostly by luck, clearly.
>>
>> But now that I think about it, maybe the problem really is in how
>> statext_mcv_clauselist_selectivity treats this clause - the definition
>> of "simple" clauses as "has one attnum" was appropriate when only
>> clauses (Var op Const) were supported. But with (Var op Var) that's
>> probably not correct anymore.
>>
> 
> Hmm, interesting. Clearly the fact that the combined estimate without
> extended stats was better was just luck, based on it's large
> overestimate of the first clause. But it's also true that a (Var op
> Var) clause should not be treated as simple, because "simple" in this
> context is meant to be for clauses that are likely to be better
> estimated with regular stats, whereas in this case, extended stats
> would almost certainly do better on the second clause.

I don't see why extended stats would do better on the second clause. I 
mean, if you have (A < A) then extended stats pretty much "collapse" 
into per-column stats. We could get almost the same estimate on 
single-column MCV list, etc. The reason why that does not happen is that 
we just treat it as a range clause, and assign it a default 33% estimate.

But we could make that a bit smarter, and assign better estimates to 
those clauses

   (A < A) => 0.0
   (A = A) => 1.0
   (A <= A) => 1.0

And that'd give us the same estimates, I think. Not sure that's worth 
it, because (A op A) clauses are probably very rare, OTOH it's cheap.

> 
> Perhaps the easiest way to identify simple clauses would be in
> statext_is_compatible_clause(), rather than the way it's done now,
> because it has the relevant information at hand, so it could be made
> to return an extra flag.
> 

Agreed, that seems like a better place to fix this.

> This feels like rather an artificial example though. Is there any real
> use for this sort of clause?
> 

True. It seems a bit artificial, which is understandable as it came from 
a synthetic test generating all possible clauses. OTOH, fixing it seems 
fairly cheap ...


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Next Steps with Hash Indexes
Next
From: Mark Dilger
Date:
Subject: Re: Use extended statistics to estimate (Var op Var) clauses