Re: multivariate statistics (v19) - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: multivariate statistics (v19)
Date
Msg-id f9a642d7-3ff4-63d1-4fb2-9a260056d3bc@2ndquadrant.com
Whole thread Raw
In response to Re: multivariate statistics (v19)  (Michael Paquier <michael.paquier@gmail.com>)
Responses Re: multivariate statistics (v19)  (Petr Jelinek <petr@2ndquadrant.com>)
Re: multivariate statistics (v19)  (Michael Paquier <michael.paquier@gmail.com>)
List pgsql-hackers
On 08/10/2016 06:41 AM, Michael Paquier wrote:
> On Wed, Aug 3, 2016 at 10:58 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
...
>> But more importantly, I think we'll need to show some of the data in EXPLAIN
>> output. With per-column statistics it's fairly straightforward to determine
>> which statistics are used and how. But with multivariate stats things are
>> often more complicated - there may be multiple candidate statistics (e.g.
>> histograms covering different subsets of the conditions), it's possible to
>> apply them in different orders, etc.
>>
>> But EXPLAIN can't show the info if it's ephemeral and available only within
>> clausesel.c (and thrown away after the estimation).
>
> This gives a good reason to not do that in clauserel.c, it would be
> really cool to be able to get some information regarding the stats
> used with a simple EXPLAIN.
>

I think there are two separate questions:

(a) Whether the query plan is "enriched" with information about 
statistics, or whether this information is ephemeral and available only 
in clausesel.c.

(b) Where exactly this enrichment happens.

Theoretically we might enrich the query plan (add nodes with info about 
the statistics), so that EXPLAIN gets the info, and it might still 
happen in clausesel.c.

>> 2) combining multiple statistics
>>
>> I think the ability to combine multivariate statistics (covering different
>> subsets of conditions) is important and useful, but I'm starting to think
>> that the current implementation may not be the correct one (which is why I
>> haven't written the SGML docs about this part of the patch series yet).
>>
>> Assume there's a table "t" with 3 columns (a, b, c), and that we're
>> estimating query:
>>
>>    SELECT * FROM t WHERE a = 1 AND b = 2 AND c = 3
>>
>> but that we only have two statistics (a,b) and (b,c). The current patch does
>> about this:
>>
>>    P(a=1,b=2,c=3) = P(a=1,b=2) * P(c=3|b=2)
>>
>> i.e. it estimates the first two conditions using (a,b), and then estimates
>> (c=3) using (b,c) with "b=2" as a condition. Now, this is very efficient,
>> but it only works as long as the query contains conditions "connecting" the
>> two statistics. So if we remove the "b=2" condition from the query, this
>> stops working.
>
> This is trying to make the algorithm smarter than the user, which is
> something I'd think we could live without. In this case statistics on
> (a,c) or (a,b,c) are missing. And what if the user does not want to
> make use of stats for (a,c) because he only defined (a,b) and (b,c)?
>

I don't think so. Obviously, if you have statistics covering all the 
conditions - great, we can't really do better than that.

But there's a crucial relation between the number of dimensions of the 
statistics and accuracy of the statistics. Let's say you have statistics 
on 8 columns, and you split each dimension twice to build a histogram - 
that's 256 buckets right there, and we only get ~50% selectivity in each 
dimension (the actual histogram building algorithm is more complex, but 
you get the idea).

I see this as probably the most interesting part of the patch, and quite 
useful. But we'll definitely get the single-statistics estimate first, 
no doubt about that.

> Patch 0001: there have been comments about that before, and you have
> put the checks on RestrictInfo in a couple of variables of
> pull_varnos_walker, so nothing to say from here.
>

I don't follow. Are you suggesting 0001 is a reasonable fix, or that 
there's a proposed solution?

> Patch 0002:
> +  <para>
> +   <command>CREATE STATISTICS</command> will create a new multivariate
> +   statistics on the table. The statistics will be created in the in the
> +   current database. The statistics will be owned by the user issuing
> +   the command.
> +  </para>
> s/in the/in the/.
>
> +  <para>
> +   Create table <structname>t1</> with two functionally dependent columns, i.e.
> +   knowledge of a value in the first column is sufficient for detemining the
> +   value in the other column. Then functional dependencies are built on those
> +   columns:
> s/detemining/determining/
>
> +  <para>
> +   If a schema name is given (for example, <literal>CREATE STATISTICS
> +   myschema.mystat ...</>) then the statistics is created in the specified
> +   schema.  Otherwise it is created in the current schema.  The name of
> +   the table must be distinct from the name of any other statistics in the
> +   same schema.
> +  </para>
> I would just assume that a statistics is located on the schema of the
> relation it depends on. So the thing that may be better to do is just:
> - Register the OID of the table a statistics depends on but not the schema.
> - Give up on those query extensions related to the schema.
> - Allow the same statistics name to be used for multiple tables.
> - Just fail if a statistics name is being reused on the table again.
> It may be better to complain about that even if the column list is
> different.
> - Register the dependency between the statistics and the table.

The idea is that the syntax should work even for statistics built on 
multiple tables, e.g. to provide better statistics for joins. That's why 
the schema may be specified (as each table might be in different 
schema), and so on.

>
> +ALTER STATISTICS <replaceable class="parameter">name</replaceable>
> OWNER TO { <replaceable class="PARAMETER">new_owner</replaceable> |
> CURRENT_USER | SESSION_USER }
> On the same line, is OWNER TO really necessary? I could have assumed
> that if a user is able to query the set of columns related to a
> statistics, he should have access to it.
>

Not sure, TBH. I think I've reused ALTER INDEX syntax, but now I see 
it's actually ignored with a warning.

> =# create statistics aa_a_b3 on aam (a, b) with (dependencies);
> ERROR:  23505: duplicate key value violates unique constraint
> "pg_mv_statistic_name_index"
> DETAIL:  Key (staname, stanamespace)=(aa_a_b3, 2200) already exists.
> SCHEMA NAME:  pg_catalog
> TABLE NAME:  pg_mv_statistic
> CONSTRAINT NAME:  pg_mv_statistic_name_index
> LOCATION:  _bt_check_unique, nbtinsert.c:433
> When creating a multivariate function with a name that already exists,
> this error message should be more friendly.

Yes, agreed.

>
> =# create table aa (a int, b int);
> CREATE TABLE
> =# create view aav as select * from aa;
> CREATE VIEW
> =# create statistics aab_v on aav (a, b) with (dependencies);
> CREATE STATISTICS
> Why do views and foreign tables support this command? This code also
> mentions that this case is not actually supported:
> +       /* multivariate stats are supported on tables and matviews */
> +       if (rel->rd_rel->relkind == RELKIND_RELATION ||
> +           rel->rd_rel->relkind == RELKIND_MATVIEW)
> +           tupdesc = RelationGetDescr(rel);
>
>  };

Yes, seems like a bug.

>
> +
>  /*
> Spurious noise in the patch.
>
> +   /* check that at least some statistics were requested */
> +   if (!build_dependencies)
> +       ereport(ERROR,
> +               (errcode(ERRCODE_SYNTAX_ERROR),
> +                errmsg("no statistics type (dependencies) was requested")));
> So, WITH (dependencies) is mandatory in any case. Why not just
> dropping it from the first cut then?

Because the follow-up patches extend this to require at least one 
statistics type. So in 0004 it becomes
    if (!(build_dependencies || build_mcv))

and in 0005 it's
    if (!(build_dependencies || build_mcv || build_histogram))

We might drop it from 0002 (and assume build_dependencies=true), and 
then add the check in 0004. But it seems a bit pointless.

>
> pg_mv_stats shows only the attribute numbers of the columns it has
> stats on, I think that those should be the column names. [...after a
> while...], as it is mentioned here:
> + * TODO  Would be nice if this printed column names (instead of just attnums).

Yeah.

>
> Does this work properly with DDL deparsing? If yes, could it be
> possible to add tests in test_ddl_deparse? This is a new object type,
> so those look necessary I think.
>

I haven't done anything with DDL deparsing, so I think the answer is 
"no" and needs to be added to a TODO.

> Statistics definition reorder the columns by itself depending on their
> order. For example:
> create table aa (a int, b int);
> create statistics aas on aa(b, a) with (dependencies);
> \d aa
>     "public.aas" (dependencies) ON (a, b)
> As this defines a correlation between multiple columns, isn't it wrong
> to assume that (b, a) and (a, b) are always the same correlation? I
> don't recall such properties as being always commutative (old
> memories, I suck at stats in general). [...reading README...] So this
> is caused by the implementation limitations that only limit the
> analysis between interactions of two columns. Still it seems incorrect
> to reorder the user-visible portion.

I don't follow. If you talk about Pearson's correlation, that clearly 
does not depend on the order of columns - it's perfectly independent of 
that. If you talk about about correlation in the wider sense (i.e. 
arbitrary dependence between columns), that might depend - but I don't 
remember a single piece of the patch where this might be a problem.

Also, which README states that we can only analyze interactions between 
two columns? That's pretty clearly not the case - the patch should 
handle dependencies between more columns without any problems.

>
> The comment on top of get_relation_info needs to be updated to mention
> that mvstatlist gets fetched as well.
>
> +   while (HeapTupleIsValid(htup = systable_getnext(indscan)))
> +       /* TODO maybe include only already built statistics? */
> +       result = insert_ordered_oid(result, HeapTupleGetOid(htup));
> I haven't looked at the rest yet of the series yet, but I'd think that
> including the ones not built may be a good idea to let caller do
> itself more filtering. Of course this depends on the next series...
>

Probably, although the more I'm thinking about this the more I think 
I'll rework this along the lines of the foreign-key-estimation patch, 
i.e. preprocessing called from planmain.c (adding info to the query 
plan), estimation in clausesel.c etc. Which also affects this bit, 
because the foreign keys are also loaded elsewhere, IIRC.

> +typedef struct MVDependencyData
> +{
> +   int         nattributes;    /* number of attributes */
> +   int16       attributes[1];  /* attribute numbers */
> +} MVDependencyData;
> You need to look for FLEXIBLE_ARRAY_MEMBER here. Same for MVDependenciesData.
>
> +++ b/src/test/regress/serial_schedule
> @@ -167,3 +167,4 @@ test: with
>  test: xml
>  test: event_trigger
>  test: stats
> +test: mv_dependencies
> This test is not listed in parallel_schedule.
>
> s/Apllying/Applying/
>
> There is a lot of mumbo-jumbo regarding the way dependencies are
> stored with mainly serialize_mv_dependencies and
> deserialize_mv_dependencies that operates them from bytea/dep trees.
> That's not cool and not portable because pg_mv_statistic represents
> that as pure bytea. I would suggest creating a generic data type that
> does those operations, named like pg_dependency_tree and then use that
> in those new catalogs. pg_node_tree is a precedent of such a thing.
> New features could as well make use of this new data type of we are
> able to design that in a way generic enough, so that would be a base
> patch that the current 0002 applies on top of.

Interesting idea, haven't thought about that. So are you suggesting to 
add a data type for each statistics type (dependencies, MCV, histogram, 
...)?

>
> Regarding psql:
> - The new commands lack psql completion, that would ease the use of
> the new commands.
> - Would it make sense to have a backslash command to show the list of
> statistics?
>

Yeah, that's on the TODO.

> Congratulations. I just looked at 25% of the overall patch and my mind
> is already blown away, but I am catching up with the rest...
>

Thanks for looking.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: Proposal for CSN based snapshots
Next
From: Petr Jelinek
Date:
Subject: Re: multivariate statistics (v19)