Re: PoC/WIP: Extended statistics on expressions - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: PoC/WIP: Extended statistics on expressions
Date
Msg-id 3fbbd68b-0b52-cb5a-d468-7e7280c33e6c@enterprisedb.com
Whole thread Raw
In response to PoC/WIP: Extended statistics on expressions  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: PoC/WIP: Extended statistics on expressions  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
List pgsql-hackers
On 11/16/20 2:49 PM, Tomas Vondra wrote:
> Hi,
>
> ...
>
> 4) apply the statistics
> 
>    This is the hard part, really, and the exact state of the support
>    depends on type of statistics.
> 
>    For ndistinct coefficients, it generally works. I'm sure there may be
>    bugs in estimate_num_groups, etc. but in principle it works.
> 
>    For MCV lists, it generally works too - you can define statistics on
>    the expressions and the estimates should improve. The main downside
>    here is that it requires at least two expressions, otherwise we can't
>    build/apply the extended statistics. So for example
> 
>       SELECT * FROM t WHERE mod(a,100) = 10 AND mod(b,11) = 0
> 
>    may be estimated "correctly", once you drop any of the conditions it
>    gets much worse as we don't have stats for individual expressions.
>    That's rather annoying - it does not break the extended MCV, but the
>    behavior will certainly cause confusion.
> 
>    For functional dependencies, the estimation does not work yet. Also,
>    the missing per-column statistics have bigger impact than on MCV,
>    because while MCV can work fine without it, the dependencies heavily
>    rely on the per-column estimates. We only apply "corrections" based
>    on the dependency degree, so we still need (good) per-column
>    estimates, which does not quite work with the expressions.
> 
> 
>    Of course, the lack of per-expression statistics may be somewhat
>    fixed by adding indexes on expressions, but that's kinda expensive.
> 

FWIW after re-reading [1], I think the plan to build pg_statistic rows
for expressions and stash them in pg_statistic_ext_data is the way to
go. I was thinking that maybe we'll need some new statistics type to
request this, e.g.

   CREATE STATISTICS s (expressions) ON ...

but on second thought I think we should just build this whenever there
are expressions in the definition. It'll require some changes (e.g. we
require at least two items in the list, but we'll want to allow building
stats on a single expression too, I guess), but that's doable.

Of course, we don't have any catalogs with composite types yet, so it's
not 100% sure this will work, but it's worth a try.

regards


[1]
https://www.postgresql.org/message-id/flat/6331.1579041473%40sss.pgh.pa.us#5ec6af7583e84cef2ca6a9e8a713511e

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: Add important info about ANALYZE after create Functional Index
Next
From: Bharath Rupireddy
Date:
Subject: Re: Multi Inserts in CREATE TABLE AS - revived patch