Re: Use extended statistics to estimate (Var op Var) clauses - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Use extended statistics to estimate (Var op Var) clauses |
Date | |
Msg-id | 6ef82d17-2d37-21ff-b00e-320927851b0e@enterprisedb.com Whole thread Raw |
In response to | Re: Use extended statistics to estimate (Var op Var) clauses (Dean Rasheed <dean.a.rasheed@gmail.com>) |
List | pgsql-hackers |
On 8/11/21 2:08 PM, Dean Rasheed wrote: > On Wed, 11 Aug 2021 at 00:05, Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> So with the statistics, the estimate gets a bit worse. The reason is >> fairly simple - if you look at the two parts of the OR clause, we get this: >> >> clause actual no stats with stats >> --------------------------------------------------------------- >> (A < B and A <> A) 0 331667 1 >> not (A < A) 1000000 333333 333333 >> >> This clearly shows that the first clause is clearly improved, while the >> (A < A) is estimated the same way, because the clause has a single Var >> so it's considered to be "simple" so we ignore the MCV selectivity and >> just use the simple_sel calculated by clause_selectivity_ext. >> >> And the 333333 and 331667 just happen to be closer to the actual row >> count. But that's mostly by luck, clearly. >> >> But now that I think about it, maybe the problem really is in how >> statext_mcv_clauselist_selectivity treats this clause - the definition >> of "simple" clauses as "has one attnum" was appropriate when only >> clauses (Var op Const) were supported. But with (Var op Var) that's >> probably not correct anymore. >> > > Hmm, interesting. Clearly the fact that the combined estimate without > extended stats was better was just luck, based on it's large > overestimate of the first clause. But it's also true that a (Var op > Var) clause should not be treated as simple, because "simple" in this > context is meant to be for clauses that are likely to be better > estimated with regular stats, whereas in this case, extended stats > would almost certainly do better on the second clause. I don't see why extended stats would do better on the second clause. I mean, if you have (A < A) then extended stats pretty much "collapse" into per-column stats. We could get almost the same estimate on single-column MCV list, etc. The reason why that does not happen is that we just treat it as a range clause, and assign it a default 33% estimate. But we could make that a bit smarter, and assign better estimates to those clauses (A < A) => 0.0 (A = A) => 1.0 (A <= A) => 1.0 And that'd give us the same estimates, I think. Not sure that's worth it, because (A op A) clauses are probably very rare, OTOH it's cheap. > > Perhaps the easiest way to identify simple clauses would be in > statext_is_compatible_clause(), rather than the way it's done now, > because it has the relevant information at hand, so it could be made > to return an extra flag. > Agreed, that seems like a better place to fix this. > This feels like rather an artificial example though. Is there any real > use for this sort of clause? > True. It seems a bit artificial, which is understandable as it came from a synthetic test generating all possible clauses. OTOH, fixing it seems fairly cheap ... regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: