Re: PoC/WIP: Extended statistics on expressions - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: PoC/WIP: Extended statistics on expressions
Date
Msg-id ccd0e47a-2326-ffdd-657b-95b79f41db50@enterprisedb.com
Whole thread Raw
In response to Re: PoC/WIP: Extended statistics on expressions  (Justin Pryzby <pryzby@telsasoft.com>)
List pgsql-hackers
On 3/24/21 7:24 AM, Justin Pryzby wrote:
> Most importantly, it looks like this forgets to update catalog documentation
> for stxexprs and stxkind='e'
> 

Good catch.

> It seems like you're preferring to use pluralized "statistics" in a lot of
> places that sound wrong to me.  For example:
>> Currently the first statistics wins, which seems silly.
> I can write more separately, but I think this is resolved and clarified if you
> write "statistics object" and not just "statistics".  
> 

OK "statistics object" seems better and more consistent.

>> +       Name of schema containing table
> 
> I don't know about the nearby descriptions, but this one sounds too much like a
> "schema-containing" table.  Say "Name of the schema which contains the table" ?
> 

I think the current spelling is OK / consistent with the other catalogs.

>> +       Name of table
> 
> Say "name of table on which the extended statistics are defined"
> 

I've used "Name of table the statistics object is defined on".

>> +       Name of extended statistics
> 
> "Name of the extended statistic object"
> 
>> +       Owner of the extended statistics
> 
> ..object
> 

OK

>> +       Expression the extended statistics is defined on
> 
> I think it should say "the extended statistic", or "the extended statistics
> object".  Maybe "..on which the extended statistic is defined"
> 

OK

>> +       of random access to the disk.  (This expression is null if the expression
>> +       data type does not have a <literal><</literal> operator.)
> 
> expression's data type
> 

OK

>> +   much-too-small row count estimate in the first two queries. Moreover, the
> 
> maybe say "dramatically underestimates the rowcount"
> 

I've changed this to "... results in a significant underestimate of row
count".

>> +   planner has no information about relationship between the expressions, so it
> 
> the relationship
> 

OK

>> +   assumes the two <literal>WHERE</literal> and <literal>GROUP BY</literal>
>> +   conditions are independent, and multiplies their selectivities together to
>> +   arrive at a much-too-high group count estimate in the aggregate query.
> 
> severe overestimate ?
> 

OK

>> +   This is further exacerbated by the lack of accurate statistics for the
>> +   expressions, forcing the planner to use default ndistinct estimate for the
> 
> use *a default
> 

OK

>> +   expression derived from ndistinct for the column. With such statistics, the
>> +   planner recognizes that the conditions are correlated and arrives at much
>> +   more accurate estimates.
> 
> are correlated comma
> 

OK

>> +            if (type->lt_opr == InvalidOid)
> 
> These could be !OidIsValid
> 

Maybe, but it's like this already. I'll leave this alone and then
fix/backpatch separately.

>> +     * expressions. It's either expensive or very easy to defeat for
>> +     * determined used, and there's no risk if we allow such statistics (the
>> +     * statistics is useless, but harmless).
> 
> I think it's meant to say "for a determined user" ?
> 

Right.

>> +     * If there are no simply-referenced columns, give the statistics an auto
>> +     * dependency on the whole table.  In most cases, this will be redundant,
>> +     * but it might not be if the statistics expressions contain no Vars
>> +     * (which might seem strange but possible).
>> +     */
>> +    if (!nattnums)
>> +    {
>> +        ObjectAddressSet(parentobject, RelationRelationId, relid);
>> +        recordDependencyOn(&myself, &parentobject, DEPENDENCY_AUTO);
>> +    }
> 
> Can this be unconditional ?
> 

What would be the benefit? This behavior copied from index_create, so
I'd prefer keeping it the same for consistency reason. Presumably it's
like that for some reason (a bit of cargo cult programming, I know).

>> +     * Translate the array of indexs to regular attnums for the dependency (we
> 
> sp: indexes
> 

OK

>> +                     * Not found a matching expression, so we can simply skip
> 
> Found no matching expr
> 

OK

>> +                /* if found a matching, */
> 
> matching ..
> 

Matching dependency.

>> +examine_attribute(Node *expr)
> 
> Maybe you should rename this to something distinct ?  So it's easy to add a
> breakpoint there, for example.
> 

What would be a better name? It's not difficult to add a breakpoint
using line number, for example.

>> +    stats->anl_context = CurrentMemoryContext;    /* XXX should be using
>> +                                                 * something else? */
> 
>> +        bool        nulls[Natts_pg_statistic];
> ...
>> +         * Construct a new pg_statistic tuple
>> +         */
>> +        for (i = 0; i < Natts_pg_statistic; ++i)
>> +        {
>> +            nulls[i] = false;
>> +        }
> 
> Shouldn't you just write nulls[Natts_pg_statistic] = {false};
> or at least: memset(nulls, 0, sizeof(nulls));
> 

Maybe, but it's a copy of what update_attstats() does, so I prefer
keeping it the same.

>> +                 * We don't store collations used to build the statistics, but
>> +                 * we can use the collation for the attribute itself, as
>> +                 * stored in varcollid. We do reset the statistics after a
>> +                 * type change (including collation change), so this is OK. We
>> +                 * may need to relax this after allowing extended statistics
>> +                 * on expressions.
> 
> This text should be updated or removed ?
> 

Yeah, the last sentence is obsolete. Updated.

>> @@ -2705,7 +2705,108 @@ describeOneTableDetails(const char *schemaname,
>>          }
>>  
>>          /* print any extended statistics */
>> -        if (pset.sversion >= 100000)
>> +        if (pset.sversion >= 140000)
>> +        {
>> +            printfPQExpBuffer(&buf,
>> +                              "SELECT oid, "
>> +                              "stxrelid::pg_catalog.regclass, "
>> +                              "stxnamespace::pg_catalog.regnamespace AS nsp, "
>> +                              "stxname,\n"
>> +                              "pg_get_statisticsobjdef_columns(oid) AS columns,\n"
>> +                              "  'd' = any(stxkind) AS ndist_enabled,\n"
>> +                              "  'f' = any(stxkind) AS deps_enabled,\n"
>> +                              "  'm' = any(stxkind) AS mcv_enabled,\n");
>> +
>> +            if (pset.sversion >= 130000)
>> +                appendPQExpBufferStr(&buf, "  stxstattarget\n");
>> +            else
>> +                appendPQExpBufferStr(&buf, "  -1 AS stxstattarget\n");
> 
>  >= 130000 is fully determined by >= 14000 :)
> 

Ah, right.

>> +     * type of the opclass, which is not interesting for our purposes.  (Note:
>> +     * if we did anything with non-expression index columns, we'd need to
> 
> index is wrong ?
> 

Fixed

> I mentioned a bunch of other references to "index" and "predicate" which are
> still around:
> 

Whooops, sorry. Fixed.


I'll post a cleaned-up version of the patch addressing Dean's review
comments too.



regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Bernd Helmle
Date:
Subject: Re: postgres_fdw: IMPORT FOREIGN SCHEMA ... LIMIT TO (partition)
Next
From: Justin Pryzby
Date:
Subject: Re: PoC/WIP: Extended statistics on expressions