Re: multivariate statistics / patch v7 - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: multivariate statistics / patch v7
Date
Msg-id 55A51AF0.7020501@2ndquadrant.com
Whole thread Raw
In response to Re: multivariate statistics / patch v7  (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
List pgsql-hackers
Hi,

On 07/13/2015 10:51 AM, Kyotaro HORIGUCHI wrote:
>
> Ok, I understood the diferrence between what I thought and what you
> say. The code is actually concious of OR clause but is looks somewhat
> confused.

I'm not sure which part is confused by the OR clauses, but it's 
certainly possible. Initially it only handled AND clauses, and the 
support for OR clauses was added later, so it's possible some parts are 
not behaving correctly.

>
> Currently choosing mv stats in clauselist_selectivity can be
> outlined as following,
>
> 1. find_stats finds candidate mv stats containing *all*
>     attributes appeared in the whole clauses regardless of and/or
>     exprs by walking whole the clause tree.
>
>     Perhaps this is the measure to early bailout.

Not entirely. The goal of find_stats() is to lookup all stats on the 
'current' relation - it's coded the way it is because I had to deal with 
varRelid=0 cases, in which case I have to inspect the Var nodes. But 
maybe I got this wrong and there's much simpler way to do that?

It is an early bailout in the sense that if there are no multivariate 
stats defined on the table, there's no point in doing any of the 
following steps. So that we don't increase planning times for users not 
using multivariate stats.

> 2.1. Within every disjunction elements, collect mv-related
>     attributes while checking whether the all leaf nodes (binop or
>     ifnull) are compatible by (eventually) walking whole the
>     clause tree.

Generally, yes. The idea is to check whether there are clauses that 
might be estimated using multivariate statistics, and whether the 
clauses reference at least two different attributes. Imagine a query 
like this:
   SELECT * FROM t WHERE (a=1) AND (a>0) AND (a<100)

It makes no sense to process this using multivariate statistics, because 
all the Var nodes reference a single attribute.

Similarly, the check is not just about the leaf nodes - to be able to 
estimate a clause at this point, we have to be able to process the whole 
tree, starting from the top-level clause. Although maybe that's no 
longer true, now that support for OR clauses was added ... I wonder 
whether there are other BoolExpr-like nodes, that might make the tree 
incompatible with multivariate statistics (in the sense that the current 
implementation does not know how to handle them).

Also note that even though the clause may be "incompatible" at this 
level, it may get partially processed by multivariate statistics later. 
For example with a query:
   SELECT * FROM t WHERE (a=1 OR b=2 OR c ~* 'xyz') AND (q=1 OR r=4)

the first query is "incompatible" because it contains unsupported 
operator '~*', but it will eventually be processed as BoolExpr node, and 
should be split into two parts - (a=1 OR b=2) which is compatible, and 
(c ~* 'xyz') which is incompatible.

This split should happen in clauselist_selectivity_or(), and the other 
thing that may be interesting is that it uses (q=1 OR r=4) as a 
condition. So if there's a statistics built on (a,b,q,r) we'll compute 
conditional probability
    P(a=1,b=2 | q=1,r=4)
>> 2.2. Check if all the collected attribute are contained in>     mv-stats columns.

No, I think you got this wrong. We do not check that *all* the 
attributes are contained in mvstats columns - we only need two such 
columns (then there's a chance that the multivariate statistics will get 
applied).

Anyway, both 2.1 and 2.2 are meant as a quick bailout, before doing the 
most expensive part, which is choose_mv_statistics(). Which is however 
missing in this list.

> 3. Finally, clauseset_mv_selectivity_histogram() (and others).
>
>     This funciton applies every ExprOp onto every attribute in
>     every histogram backes and (tries to) make the boolean
>     operation of the result bitmaps.

Yes, but this only happens after choose_mv_statistics(), because that's 
the code that decides which statistics will be used and in what order.

The list is also missing handling of the 'functional dependencies', so a 
complete list of steps would look like this:

1) find_stats - lookup stats on the current relation (from RelOptInfo)

2) apply functional dependencies
   a) check if there are equality clauses that may be reduced using      functional dependencies, referencing at least
twocolumns
 
   b) if yes, perform the clause reduction

3) apply MCV lists and histograms
   a) check if there are clauses 'compatible' with those types of      statistics, again containing at least two
columns
   b) if yes, use choose_mv_statistics() to decide which statistics to         apply and in which order
   c) apply the selected histograms and MCV lists

4) estimate the remaining clauses using the regular statistics
   a) this is where the clauselist_mv_selectivity_histogram and other      are called

I tried to explain this in the comment before clauselist_selectivity(), 
but maybe it's not detailed enough / missing some important details.

>
> I have some comments on the implement and I also try to find the
> solution for them.
>
>
> 1. The flow above looks doing  very similiar thins repeatedly.

I worked hard to remove such code duplicities, and believe all the 
current steps are necessary - for example 2(a) and 3(a) may seems 
similar, but it's really necessary to do that twice.

>
> 2. I believe what the current code does can be simplified.

Possibly.

>
> 3. As you mentioned in comments, some additional infrastructure
>     needed.
>
> After all, I think what we should do after this are as follows,
> as the first step.

OK.

>
> - Add the means to judge the selectivity operator(?) by other
>    than oprrest of the op of ExprOp. (You missed neqsel already)

Yes, the way we use 'oprno' to determine how to estimate the selectivity
is a bit awkward. It's inspired by handling of range queries, and having
something better would be nice.

But I don't think this is the reason why I missed neqsel, and I don't
see this as a significant design issue at this point. But if we can come
up with a better solution, why not ...

>    I suppose one solution for this is adding oprmvstats taking
>    'm', 'h' and 'f' and their combinations. Or for the
>    convenience, it would be a fixed-length string like this.
>
>    oprname | oprmvstats
>    =       | 'mhf'
>    <>      | 'mhf'
>    <       | 'mh-'
>    >       | 'mh-'
>    >=      | 'mh-'
>    <=      | 'mh-'
>
>    This would make the code in clause_is_mv_compatible like this.
>
>    > oprmvstats = get_mvstatsset(expr->opno); /* bitwise representation */
>    > if (oprmvstats & types)
>    > {
>    >    *attnums = bms_add_member(*attnums, var->varattno);
>    >    return true;
>    > }
>    > return false;

So this only determines the compatibility of operators with respect to 
different types of statistics? How does that solve the neqsel case? It 
will probably decide the clause is compatible, but it will later fail at 
the actual estimation, no?

>
> - Current design just manage to work but it is too complicated
>    and hardly have affinity with the existing estimation
>    framework.

I respectfully disagree. I've strived to make it as affine to the 
current implementation as possible - maybe it's possible to improve 
that, but I believe there's a natural difference between the two types 
of statistics. It may be somewhat simplified, but it will never be 
exactly the same.
>    I proposed separation of finding stats phase and
>    calculation phase, but I would like to propose transforming
>    RestrictInfo(and finding mvstat) phase and running the
>    transformed RestrictInfo phase after looking close to the
>    patch.

Those phases are already separated, as is illustrated by the steps 
explained above.

So technically we might place a hook either right after the find_stats() 
call, so that it's possible to process all the stats on the table, or 
maybe after the choose_mv_statistics() call, so that we only process the 
actually used stats.

>
>    I think transforing RestrictInfo makes the situnation
>    better. Since it nedds different information, maybe it is
>    better to have new struct, say, RestrictInfoForEstimate
>    (boo!). Then provide mvstatssel() to use in the new struct.
>    The rough looking of the code would be like below.
>
>    clauselist_selectivity()
>    {
>      ...
>      RestrictInfoForEstmate *esclause =
>        transformClauseListForEstimation(root, clauses, varRelid);
>      ...
>
>      return clause_selectivity(esclause):
>    }
>
>    clause_selectivity(RestrictInfoForEstmate *esclause)
>    {
>      if (IsA(clause, RestrictInfo))...
>      if (IsA(clause, RestrictInfoForEstimate))
>      {
>         RestrictInfoForEstimate *ecl = (RestrictInfoForEstimate*) clause;
>         if (ecl->selfunc)
>         {
>            sx = ecl->selfunc(root, ecl);
>         }
>      }
>      if (IsA(clause, Var))...
>    }
>
>
>    transformClauseListForEstimation(...)
>    {
>      ...
>
>      relid = collect_mvstats_info(root, clause, &attlist);
>      if (!relid) return;
>      if (get_mvstats_hook)
>           mvstats = (*get_mvstats_hoook) (root, relid, attset);
>      else
>           mvstats = find_mv_stats(root, relid, attset))
>    }
>    ...

So you'd transform the clause tree first, replacing parts of the tree 
(to be estimated by multivariate stats) by a new node type? That's an 
interesting idea, I think ...

I can't really say whether it's a good approach, though. Can you explain 
why do you think it'd make the situation better?

The one benefit I can think of is being able to look at the processed 
tree and see which parts will be estimated using multivariate stats.

But we'd effectively have to do the same stuff (choosing the stats, 
...), and if we move this pre-processing before clauselist_selectivity 
(I assume that's the point), we'd end up repeating a lot of the code. Or 
maybe not, I'm not sure.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: sudalai
Date:
Subject: Re: First Aggregate Funtion?
Next
From: Tom Lane
Date:
Subject: Re: RFC: replace pg_stat_activity.waiting with something more descriptive