Re: multivariate statistics (v19) - Mailing list pgsql-hackers

From Michael Paquier
Subject Re: multivariate statistics (v19)
Date
Msg-id CAB7nPqS2oxRJTfsNDSqhvBSZ6Fa2ibBX2O6LEHE5FBpXZZ9VRA@mail.gmail.com
Whole thread Raw
In response to Re: multivariate statistics (v19)  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: multivariate statistics (v19)  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
List pgsql-hackers
On Wed, Aug 10, 2016 at 8:33 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On 08/10/2016 06:41 AM, Michael Paquier wrote:
>> Patch 0001: there have been comments about that before, and you have
>> put the checks on RestrictInfo in a couple of variables of
>> pull_varnos_walker, so nothing to say from here.
>>
>
> I don't follow. Are you suggesting 0001 is a reasonable fix, or that there's
> a proposed solution?

I think that's reasonable.

>> Patch 0002:
>> +  <para>
>> +   <command>CREATE STATISTICS</command> will create a new multivariate
>> +   statistics on the table. The statistics will be created in the in the
>> +   current database. The statistics will be owned by the user issuing
>> +   the command.
>> +  </para>
>> s/in the/in the/.
>>
>> +  <para>
>> +   Create table <structname>t1</> with two functionally dependent
>> columns, i.e.
>> +   knowledge of a value in the first column is sufficient for detemining
>> the
>> +   value in the other column. Then functional dependencies are built on
>> those
>> +   columns:
>> s/detemining/determining/
>>
>> +  <para>
>> +   If a schema name is given (for example, <literal>CREATE STATISTICS
>> +   myschema.mystat ...</>) then the statistics is created in the
>> specified
>> +   schema.  Otherwise it is created in the current schema.  The name of
>> +   the table must be distinct from the name of any other statistics in
>> the
>> +   same schema.
>> +  </para>
>> I would just assume that a statistics is located on the schema of the
>> relation it depends on. So the thing that may be better to do is just:
>> - Register the OID of the table a statistics depends on but not the
>> schema.
>> - Give up on those query extensions related to the schema.
>> - Allow the same statistics name to be used for multiple tables.
>> - Just fail if a statistics name is being reused on the table again.
>> It may be better to complain about that even if the column list is
>> different.
>> - Register the dependency between the statistics and the table.
>
> The idea is that the syntax should work even for statistics built on
> multiple tables, e.g. to provide better statistics for joins. That's why the
> schema may be specified (as each table might be in different schema), and so
> on.

So you mean that the same statistics could be shared between tables?
But as this is visibly not a concept introduced yet in this set of
patches, why not just cut it off for now to simplify the whole? If
there is no schema-related field in pg_mv_statistics we could still
add it later if it proves to be useful.

>> +
>>  /*
>> Spurious noise in the patch.
>>
>> +   /* check that at least some statistics were requested */
>> +   if (!build_dependencies)
>> +       ereport(ERROR,
>> +               (errcode(ERRCODE_SYNTAX_ERROR),
>> +                errmsg("no statistics type (dependencies) was
>> requested")));
>> So, WITH (dependencies) is mandatory in any case. Why not just
>> dropping it from the first cut then?
>
>
> Because the follow-up patches extend this to require at least one statistics
> type. So in 0004 it becomes
>
>     if (!(build_dependencies || build_mcv))
>
> and in 0005 it's
>
>     if (!(build_dependencies || build_mcv || build_histogram))
>
> We might drop it from 0002 (and assume build_dependencies=true), and then
> add the check in 0004. But it seems a bit pointless.

This is a complicated set of patches. I'd think that we should try to
simplify things as much as possible first, and the WITH clause is not
mandatory to have as of 0002.

>> Statistics definition reorder the columns by itself depending on their
>> order. For example:
>> create table aa (a int, b int);
>> create statistics aas on aa(b, a) with (dependencies);
>> \d aa
>>     "public.aas" (dependencies) ON (a, b)
>> As this defines a correlation between multiple columns, isn't it wrong
>> to assume that (b, a) and (a, b) are always the same correlation? I
>> don't recall such properties as being always commutative (old
>> memories, I suck at stats in general). [...reading README...] So this
>> is caused by the implementation limitations that only limit the
>> analysis between interactions of two columns. Still it seems incorrect
>> to reorder the user-visible portion.
>
> I don't follow. If you talk about Pearson's correlation, that clearly does
> not depend on the order of columns - it's perfectly independent of that. If
> you talk about about correlation in the wider sense (i.e. arbitrary
> dependence between columns), that might depend - but I don't remember a
> single piece of the patch where this might be a problem.

Yes, based on what is done in the patch that may not be a problem, but
I am wondering if this is not restricting things too much.

> Also, which README states that we can only analyze interactions between two
> columns? That's pretty clearly not the case - the patch should handle
> dependencies between more columns without any problems.

I have noticed that the patch evaluates all the set of permutations
possible using a column list, it seems to me though that say if we
have three columns (a,b,c) listed in a statistics, (a,b) => c and
(b,a) => c are two different things.

>> There is a lot of mumbo-jumbo regarding the way dependencies are
>> stored with mainly serialize_mv_dependencies and
>> deserialize_mv_dependencies that operates them from bytea/dep trees.
>> That's not cool and not portable because pg_mv_statistic represents
>> that as pure bytea. I would suggest creating a generic data type that
>> does those operations, named like pg_dependency_tree and then use that
>> in those new catalogs. pg_node_tree is a precedent of such a thing.
>> New features could as well make use of this new data type of we are
>> able to design that in a way generic enough, so that would be a base
>> patch that the current 0002 applies on top of.
>
>
> Interesting idea, haven't thought about that. So are you suggesting to add a
> data type for each statistics type (dependencies, MCV, histogram, ...)?

Yes that would be something like that, it would be actually perhaps
better to have one single data type, and be able to switch between
each model easily instead of putting byteas in the catalog.
-- 
Michael



pgsql-hackers by date:

Previous
From: Petr Jelinek
Date:
Subject: Re: multivariate statistics (v19)
Next
From: Michael Paquier
Date:
Subject: Re: multivariate statistics (v19)