Re: PoC/WIP: Extended statistics on expressions - Mailing list pgsql-hackers

From Justin Pryzby
Subject Re: PoC/WIP: Extended statistics on expressions
Date
Msg-id 20210324062446.GA28335@telsasoft.com
Whole thread Raw
In response to Re: PoC/WIP: Extended statistics on expressions  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: PoC/WIP: Extended statistics on expressions  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Re: PoC/WIP: Extended statistics on expressions  (Justin Pryzby <pryzby@telsasoft.com>)
List pgsql-hackers
Most importantly, it looks like this forgets to update catalog documentation
for stxexprs and stxkind='e'

It seems like you're preferring to use pluralized "statistics" in a lot of
places that sound wrong to me.  For example:
> Currently the first statistics wins, which seems silly.
I can write more separately, but I think this is resolved and clarified if you
write "statistics object" and not just "statistics".  

> +       Name of schema containing table

I don't know about the nearby descriptions, but this one sounds too much like a
"schema-containing" table.  Say "Name of the schema which contains the table" ?

> +       Name of table

Say "name of table on which the extended statistics are defined"

> +       Name of extended statistics

"Name of the extended statistic object"

> +       Owner of the extended statistics

..object

> +       Expression the extended statistics is defined on

I think it should say "the extended statistic", or "the extended statistics
object".  Maybe "..on which the extended statistic is defined"

> +       of random access to the disk.  (This expression is null if the expression
> +       data type does not have a <literal><</literal> operator.)

expression's data type

> +   much-too-small row count estimate in the first two queries. Moreover, the

maybe say "dramatically underestimates the rowcount"

> +   planner has no information about relationship between the expressions, so it

the relationship

> +   assumes the two <literal>WHERE</literal> and <literal>GROUP BY</literal>
> +   conditions are independent, and multiplies their selectivities together to
> +   arrive at a much-too-high group count estimate in the aggregate query.

severe overestimate ?

> +   This is further exacerbated by the lack of accurate statistics for the
> +   expressions, forcing the planner to use default ndistinct estimate for the

use *a default

> +   expression derived from ndistinct for the column. With such statistics, the
> +   planner recognizes that the conditions are correlated and arrives at much
> +   more accurate estimates.

are correlated comma

> +            if (type->lt_opr == InvalidOid)

These could be !OidIsValid

> +     * expressions. It's either expensive or very easy to defeat for
> +     * determined used, and there's no risk if we allow such statistics (the
> +     * statistics is useless, but harmless).

I think it's meant to say "for a determined user" ?

> +     * If there are no simply-referenced columns, give the statistics an auto
> +     * dependency on the whole table.  In most cases, this will be redundant,
> +     * but it might not be if the statistics expressions contain no Vars
> +     * (which might seem strange but possible).
> +     */
> +    if (!nattnums)
> +    {
> +        ObjectAddressSet(parentobject, RelationRelationId, relid);
> +        recordDependencyOn(&myself, &parentobject, DEPENDENCY_AUTO);
> +    }

Can this be unconditional ?

> +     * Translate the array of indexs to regular attnums for the dependency (we

sp: indexes

> +                     * Not found a matching expression, so we can simply skip

Found no matching expr

> +                /* if found a matching, */

matching ..

> +examine_attribute(Node *expr)

Maybe you should rename this to something distinct ?  So it's easy to add a
breakpoint there, for example.

> +    stats->anl_context = CurrentMemoryContext;    /* XXX should be using
> +                                                 * something else? */

> +        bool        nulls[Natts_pg_statistic];
...
> +         * Construct a new pg_statistic tuple
> +         */
> +        for (i = 0; i < Natts_pg_statistic; ++i)
> +        {
> +            nulls[i] = false;
> +        }

Shouldn't you just write nulls[Natts_pg_statistic] = {false};
or at least: memset(nulls, 0, sizeof(nulls));

> +                 * We don't store collations used to build the statistics, but
> +                 * we can use the collation for the attribute itself, as
> +                 * stored in varcollid. We do reset the statistics after a
> +                 * type change (including collation change), so this is OK. We
> +                 * may need to relax this after allowing extended statistics
> +                 * on expressions.

This text should be updated or removed ?

> @@ -2705,7 +2705,108 @@ describeOneTableDetails(const char *schemaname,
>          }
>  
>          /* print any extended statistics */
> -        if (pset.sversion >= 100000)
> +        if (pset.sversion >= 140000)
> +        {
> +            printfPQExpBuffer(&buf,
> +                              "SELECT oid, "
> +                              "stxrelid::pg_catalog.regclass, "
> +                              "stxnamespace::pg_catalog.regnamespace AS nsp, "
> +                              "stxname,\n"
> +                              "pg_get_statisticsobjdef_columns(oid) AS columns,\n"
> +                              "  'd' = any(stxkind) AS ndist_enabled,\n"
> +                              "  'f' = any(stxkind) AS deps_enabled,\n"
> +                              "  'm' = any(stxkind) AS mcv_enabled,\n");
> +
> +            if (pset.sversion >= 130000)
> +                appendPQExpBufferStr(&buf, "  stxstattarget\n");
> +            else
> +                appendPQExpBufferStr(&buf, "  -1 AS stxstattarget\n");

 >= 130000 is fully determined by >= 14000 :)

> +     * type of the opclass, which is not interesting for our purposes.  (Note:
> +     * if we did anything with non-expression index columns, we'd need to

index is wrong ?

I mentioned a bunch of other references to "index" and "predicate" which are
still around:

On Thu, Jan 07, 2021 at 08:35:37PM -0600, Justin Pryzby wrote:
> There's some remaining copy/paste stuff from index expressions:
> 
> errmsg("statistics expressions and predicates can refer only to the table being indexed")));
> left behind by evaluating the predicate or index expressions.
> Set up for predicate or expression evaluation
> Need an EState for evaluation of index expressions and
> partial-index predicates.  Create it in the per-index context to be
> Fetch function for analyzing index expressions.



pgsql-hackers by date:

Previous
From: Mark Dilger
Date:
Subject: Re: pg_amcheck contrib application
Next
From: "Drouvot, Bertrand"
Date:
Subject: Re: Minimal logical decoding on standbys