Re: Use extended statistics to estimate (Var op Var) clauses - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Use extended statistics to estimate (Var op Var) clauses |
Date | |
Msg-id | 5e97a976-c8a1-f420-4088-58df178505dc@enterprisedb.com Whole thread Raw |
In response to | Re: Use extended statistics to estimate (Var op Var) clauses (Mark Dilger <mark.dilger@enterprisedb.com>) |
Responses |
Re: Use extended statistics to estimate (Var op Var) clauses
|
List | pgsql-hackers |
On 8/9/21 9:19 PM, Mark Dilger wrote: > > >> On Jul 20, 2021, at 11:28 AM, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: >> >> Tomas Vondra >> EnterpriseDB: http://www.enterprisedb.com >> The Enterprise PostgreSQL Company >> <0001-Handling-Expr-op-Expr-clauses-in-extended-stats-20210720.patch> > > Hi Tomas, > > I tested this patch against master looking for types of clauses that uniformly get worse with the patch applied. I foundsome. > > The tests are too large to attach, but the scripts that generate them are not. To perform the tests: > > git checkout master > perl ./gentest.pl > src/test/regress/sql/gentest.sql > cat /dev/null > src/test/regress/expected/gentest.out > echo "test: gentest" >> src/test/regress/parallel_schedule > ./configure && make && make check > cp src/test/regress/results/gentest.out src/test/regress/expected/gentest.out > patch -p 1 < 0001-Handling-Expr-op-Expr-clauses-in-extended-stats-20210720.patch > make check > cat src/test/regress/regression.diffs | perl ./check.pl > > This shows patterns of conditions that get worse, such as: > > better:0, worse:80: A < B and A <> A or not A < A > better:0, worse:80: A < B and not A <= A or A <= A > better:0, worse:80: A < B or A = A > better:0, worse:80: A < B or A = A or not A >= A > better:0, worse:80: A < B or A >= A > better:0, worse:80: A < B or A >= A and not A <> A > better:0, worse:80: A < B or not A < A > better:0, worse:80: A < B or not A <> A > better:0, worse:80: A < B or not A <> A or A <= A > better:0, worse:80: A < B or not A >= A or not A < A > > It seems things get worse when the conditions contain a column compared against itself. I suspect that is being handledincorrectly. > Thanks for this testing! I took a quick look, and I think this is mostly due to luck in how the (default) range estimates combine without and with extended statistics. Consider for example this simple example: create table t (a int, b int); insert into t select mod(i,10), mod(i,20) from generate_series(1,1000000) s(i); Without stats, the first clauses example is estimated like this: explain (timing off, analyze) select * from t where (A < B and A <> A) or not A < A; QUERY PLAN ---------------------------------------------------------- Seq Scan on t (cost=0.00..21925.00 rows=554444 width=8) (actual rows=1000000 loops=1) Filter: (((a < b) AND (a <> a)) OR (a >= a)) Planning Time: 0.054 ms Execution Time: 80.485 ms (4 rows) and with MCV on (a,b) it gets estimates like this: QUERY PLAN ---------------------------------------------------------- Seq Scan on t (cost=0.00..21925.00 rows=333333 width=8) (actual rows=1000000 loops=1) Filter: (((a < b) AND (a <> a)) OR (a >= a)) Planning Time: 0.152 ms Execution Time: 79.917 ms (4 rows) So with the statistics, the estimate gets a bit worse. The reason is fairly simple - if you look at the two parts of the OR clause, we get this: clause actual no stats with stats --------------------------------------------------------------- (A < B and A <> A) 0 331667 1 not (A < A) 1000000 333333 333333 This clearly shows that the first clause is clearly improved, while the (A < A) is estimated the same way, because the clause has a single Var so it's considered to be "simple" so we ignore the MCV selectivity and just use the simple_sel calculated by clause_selectivity_ext. And the 333333 and 331667 just happen to be closer to the actual row count. But that's mostly by luck, clearly. But now that I think about it, maybe the problem really is in how statext_mcv_clauselist_selectivity treats this clause - the definition of "simple" clauses as "has one attnum" was appropriate when only clauses (Var op Const) were supported. But with (Var op Var) that's probably not correct anymore. And indeed, commenting out the if condition on line 1933 (so ignoring simple_sel) and that does improve the estimates for this query. But perhaps I'm missing something, this needs more thought. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: