Re: Use extended statistics to estimate (Var op Var) clauses - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Use extended statistics to estimate (Var op Var) clauses
Date
Msg-id 9ef36aff-ca24-92b7-9f03-aa7ecfab36ee@enterprisedb.com
Whole thread Raw
In response to Re: Use extended statistics to estimate (Var op Var) clauses  (Mark Dilger <mark.dilger@enterprisedb.com>)
List pgsql-hackers
On 6/14/21 5:36 PM, Mark Dilger wrote:
> 
> 
>> On Jun 13, 2021, at 1:28 PM, Tomas Vondra
>> <tomas.vondra@enterprisedb.com> wrote:
>> 
>> Here is a slightly updated version of the patch
> 
> Thanks for taking this up again!
> 
> Applying the new test cases from your patch, multiple estimates have
> gotten better.  That looks good.  I wrote a few extra test cases and
> saw no change, which is fine.  I was looking for regressions where
> the estimates are now worse than before.  Do you expect there to be
> any such cases?
> 

Not really. These clauses could not be estimated before, we generally
used default estimates for them. So

  WHERE a = b

would use 0.5%, while

  WHERE a < b

would use 33%, and so on. OTOH it depends on the accuracy of the
extended statistics - particularly the MCV list (what fraction of the
data it covers, etc.).

So it's possible the default estimate is very accurate by chance, and
MCV list represents only a tiny fraction of the data. Then the new
estimate could we worse. Consider for example this:

create table t (a int, b int);
insert into t select 100, 100 from generate_series(1,5000) s(i);
insert into t select i, i+1 from generate_series(1,995000) s(i);

This has exactly 0.5% of rows with (a=b). Without extended stats it's
perfect:

    explain analyze select * from t where a = b;

    Seq Scan on t  (cost=0.00..16925.00 rows=5000 width=8)
                   (actual time=0.064..159.928 rows=5000 loops=1)

while with statistics it gets worse:

    create statistics s (mcv) on a, b from t;
    analyze t;

    Seq Scan on t  (cost=0.00..16925.00 rows=9810 width=8)
                   (actual time=0.059..160.467 rows=5000 loops=1)

It's not terrible, although we could construct worse examples. But the
same issue applies to other clauses, not just to these new ones. And it
relies on the regular estimation producing better estimate by chance.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: a path towards replacing GEQO with something better
Next
From: Andrew Dunstan
Date:
Subject: Re: recent failures on lorikeet