Re: multivariate statistics / patch v7 - Mailing list pgsql-hackers

From Kyotaro HORIGUCHI
Subject Re: multivariate statistics / patch v7
Date
Msg-id 20150716.205157.170797028.horiguchi.kyotaro@lab.ntt.co.jp
Whole thread Raw
In response to Re: multivariate statistics / patch v7  (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
Responses Re: multivariate statistics / patch v7
List pgsql-hackers
Hi, I'd like to show you the modified constitution of
multivariate statistics application logic. Please find the
attached. They apply on your v7 patch.

The code to find mv-applicable clause is moved out of the main
flow of clauselist_selectivity. As I said in the previous mail,
the new function transformRestrictInfoForEstimate (too bad name
but just for PoC:) scans clauselist and generates
RestrictStatsData struct which drives mv-aware selectivity
calculation. This struct isolates MV and non-MV estimation.

The struct RestrictStatData mainly consists of the following
three parts,
 - clause to be estimated by current logic (MV is not applicable) - clause to be estimated by MV-staistics. - list of
childRestrictStatDatas, which are to be run   recursively.
 

mvclause_selectivty() is the topmost function where mv stats
works. This structure effectively prevents main estimation flow
from being broken by modifying mvstats part. Although I haven't
measured but I'm positive the code is far reduced from yours.

I attached two patches to this message. The first one is to
rebase v7 patch to current(maybe) master and the second applies
the refactoring.

I'm a little anxious about performance but I think this makes the
process to apply mv-stats far clearer. Regtests for mvstats
succeeded asis except for fdep, which is not implememted in this
patch.

What do you think about this?

regards,


> Hi, Thanks for the detailed explaination. I misunderstood the
> code (more honest speaking, din't look so close there). Then I
> looked it closer.
> 
> 
> At Wed, 08 Jul 2015 03:03:16 +0200, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in
<559C76D4.2030805@2ndquadrant.com>
> > FWIW this was a stupid bug in update_match_bitmap_histogram(), which
> > initially handled only AND clauses, and thus assumed the "match" of a
> > bucket can only decrease. But for OR clauses this is exactly the
> > opposite (we assume no buckets match and add buckets matching at least
> > one of the clauses).
> > 
> > With this fixed, the estimates look like this:
> > 
> 
> > IMHO pretty accurate estimates - no issue with OR clauses.
> 
> Ok, I understood the diferrence between what I thought and what
> you say. The code is actually concious of OR clause but is looks
> somewhat confused.
> 
> Currently choosing mv stats in clauselist_selectivity can be
> outlined as following,
> 
> 1. find_stats finds candidate mv stats containing *all*
>    attributes appeared in the whole clauses regardless of and/or
>    exprs by walking whole the clause tree.
> 
>    Perhaps this is the measure to early bailout.
> 
> 2.1. Within every disjunction elements, collect mv-related
>    attributes while checking whether the all leaf nodes (binop or
>    ifnull) are compatible by (eventually) walking whole the
>    clause tree.
> 
> 2.2. Check if all the collected attribute are contained in
>    mv-stats columns.
> 
> 3. Finally, clauseset_mv_selectivity_histogram() (and others).
> 
>    This funciton applies every ExprOp onto every attribute in
>    every histogram backes and (tries to) make the boolean
>    operation of the result bitmaps.
> 
> I have some comments on the implement and I also try to find the
> solution for them.
> 
> 
> 1. The flow above looks doing  very similiar thins repeatedly.
> 
> 2. I believe what the current code does can be simplified.
> 
> 3. As you mentioned in comments, some additional infrastructure
>    needed.
> 
> After all, I think what we should do after this are as follows,
> as the first step.
> 
> - Add the means to judge the selectivity operator(?) by other
>   than oprrest of the op of ExprOp. (You missed neqsel already)
> 
>   I suppose one solution for this is adding oprmvstats taking
>   'm', 'h' and 'f' and their combinations. Or for the
>   convenience, it would be a fixed-length string like this.
> 
>   oprname | oprmvstats
>   =       | 'mhf'
>   <>      | 'mhf'
>   <       | 'mh-'
>   >       | 'mh-'
>   >=      | 'mh-'
>   <=      | 'mh-'
> 
>   This would make the code in clause_is_mv_compatible like this.
> 
>   > oprmvstats = get_mvstatsset(expr->opno); /* bitwise representation */
>   > if (oprmvstats & types)
>   > {
>   >    *attnums = bms_add_member(*attnums, var->varattno);
>   >    return true;
>   > }
>   > return false;
> 
> - Current design just manage to work but it is too complicated
>   and hardly have affinity with the existing estimation
>   framework. I proposed separation of finding stats phase and
>   calculation phase, but I would like to propose transforming
>   RestrictInfo(and finding mvstat) phase and running the
>   transformed RestrictInfo phase after looking close to the
>   patch.
> 
>   I think transforing RestrictInfo makes the situnation
>   better. Since it nedds different information, maybe it is
>   better to have new struct, say, RestrictInfoForEstimate
>   (boo!). Then provide mvstatssel() to use in the new struct.
>   The rough looking of the code would be like below. 
> 
>   clauselist_selectivity()
>   {
>     ...
>     RestrictInfoForEstmate *esclause =
>       transformClauseListForEstimation(root, clauses, varRelid);
>     ...
> 
>     return clause_selectivity(esclause):
>   }
> 
>   clause_selectivity(RestrictInfoForEstmate *esclause)
>   {
>     if (IsA(clause, RestrictInfo))...
>     if (IsA(clause, RestrictInfoForEstimate))
>     {
>        RestrictInfoForEstimate *ecl = (RestrictInfoForEstimate*) clause;
>        if (ecl->selfunc)
>        {
>           sx = ecl->selfunc(root, ecl);
>        }
>     }
>     if (IsA(clause, Var))...
>   }
> 
>   
>   transformClauseListForEstimation(...)
>   {
>     ...
> 
>     relid = collect_mvstats_info(root, clause, &attlist);
>     if (!relid) return;
>     if (get_mvstats_hook)
>          mvstats = (*get_mvstats_hoook) (root, relid, attset);
>     else
>          mvstats = find_mv_stats(root, relid, attset))
>   }
>   ...
> 
> > I've pushed this to github [1] but I need to do some additional
> > fixes. I also had to remove some optimizations while fixing this, and
> > will have to reimplement those.
> > 
> > That's not to say that the handling of OR-clauses is perfectly
> > correct. After looking at clauselist_selectivity_or(), I believe it's
> > a bit broken and will need a bunch of fixes, as explained in the
> > FIXMEs I pushed to github.
> > 
> > [1] https://github.com/tvondra/postgres/tree/mvstats
> 
> I don't see whether it is doable or not, and I suppose you're
> unwilling to change the big picture, so I will consider the idea
> and will show you the result, if it turns out to be possible and
> promising.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

pgsql-hackers by date:

Previous
From: Sawada Masahiko
Date:
Subject: Re: Freeze avoidance of very large table.
Next
From: Petr Jelinek
Date:
Subject: Re: TABLESAMPLE patch is really in pretty sad shape