Re: Use extended statistics to estimate (Var op Var) clauses - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Use extended statistics to estimate (Var op Var) clauses
Date
Msg-id 5e97a976-c8a1-f420-4088-58df178505dc@enterprisedb.com
Whole thread Raw
In response to Re: Use extended statistics to estimate (Var op Var) clauses  (Mark Dilger <mark.dilger@enterprisedb.com>)
Responses Re: Use extended statistics to estimate (Var op Var) clauses  (Dean Rasheed <dean.a.rasheed@gmail.com>)
List pgsql-hackers
On 8/9/21 9:19 PM, Mark Dilger wrote:
> 
> 
>> On Jul 20, 2021, at 11:28 AM, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>>
>> Tomas Vondra
>> EnterpriseDB: http://www.enterprisedb.com
>> The Enterprise PostgreSQL Company
>> <0001-Handling-Expr-op-Expr-clauses-in-extended-stats-20210720.patch>
> 
> Hi Tomas,
> 
> I tested this patch against master looking for types of clauses that uniformly get worse with the patch applied.  I
foundsome.
 
> 
> The tests are too large to attach, but the scripts that generate them are not.  To perform the tests:
> 
>     git checkout master
>     perl ./gentest.pl > src/test/regress/sql/gentest.sql
>     cat /dev/null > src/test/regress/expected/gentest.out
>     echo "test: gentest" >> src/test/regress/parallel_schedule
>     ./configure && make && make check
>     cp src/test/regress/results/gentest.out src/test/regress/expected/gentest.out
>     patch -p 1 < 0001-Handling-Expr-op-Expr-clauses-in-extended-stats-20210720.patch
>     make check
>     cat src/test/regress/regression.diffs | perl ./check.pl
> 
> This shows patterns of conditions that get worse, such as:
> 
> better:0, worse:80:  A < B and A <> A or not A < A
> better:0, worse:80:  A < B and not A <= A or A <= A
> better:0, worse:80:  A < B or A = A
> better:0, worse:80:  A < B or A = A or not A >= A
> better:0, worse:80:  A < B or A >= A
> better:0, worse:80:  A < B or A >= A and not A <> A
> better:0, worse:80:  A < B or not A < A
> better:0, worse:80:  A < B or not A <> A
> better:0, worse:80:  A < B or not A <> A or A <= A
> better:0, worse:80:  A < B or not A >= A or not A < A
> 
> It seems things get worse when the conditions contain a column compared against itself.  I suspect that is being
handledincorrectly.
 
>

Thanks for this testing!

I took a quick look, and I think this is mostly due to luck in how the 
(default) range estimates combine without and with extended statistics. 
Consider for example this simple example:

create table t (a int, b int);

insert into t select mod(i,10), mod(i,20)
                 from generate_series(1,1000000) s(i);

Without stats, the first clauses example is estimated like this:

     explain (timing off, analyze) select * from t
       where (A < B and A <> A) or not A < A;

                             QUERY PLAN
     ----------------------------------------------------------
      Seq Scan on t  (cost=0.00..21925.00 rows=554444 width=8)
                     (actual rows=1000000 loops=1)
        Filter: (((a < b) AND (a <> a)) OR (a >= a))
      Planning Time: 0.054 ms
      Execution Time: 80.485 ms
     (4 rows)

and with MCV on (a,b) it gets estimates like this:

                              QUERY PLAN 

     ----------------------------------------------------------
      Seq Scan on t  (cost=0.00..21925.00 rows=333333 width=8)
                     (actual rows=1000000 loops=1)
        Filter: (((a < b) AND (a <> a)) OR (a >= a))
      Planning Time: 0.152 ms
      Execution Time: 79.917 ms
     (4 rows)

So with the statistics, the estimate gets a bit worse. The reason is 
fairly simple - if you look at the two parts of the OR clause, we get this:

     clause                   actual        no stats    with stats
    ---------------------------------------------------------------
     (A < B and A <> A)            0          331667             1
     not (A < A)             1000000          333333        333333

This clearly shows that the first clause is clearly improved, while the 
(A < A) is estimated the same way, because the clause has a single Var 
so it's considered to be "simple" so we ignore the MCV selectivity and 
just use the simple_sel calculated by clause_selectivity_ext.

And the 333333 and 331667 just happen to be closer to the actual row 
count. But that's mostly by luck, clearly.

But now that I think about it, maybe the problem really is in how 
statext_mcv_clauselist_selectivity treats this clause - the definition 
of "simple" clauses as "has one attnum" was appropriate when only 
clauses (Var op Const) were supported. But with (Var op Var) that's 
probably not correct anymore.

And indeed, commenting out the if condition on line 1933 (so ignoring 
simple_sel) and that does improve the estimates for this query. But 
perhaps I'm missing something, this needs more thought.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: Autovacuum on partitioned table (autoanalyze)
Next
From: Thomas Munro
Date:
Subject: Re: Worth using personality(ADDR_NO_RANDOMIZE) for EXEC_BACKEND on linux?