Re: multivariate statistics / patch v7 - Mailing list pgsql-hackers
| From | Kyotaro HORIGUCHI |
|---|---|
| Subject | Re: multivariate statistics / patch v7 |
| Date | |
| Msg-id | 20150716.205157.170797028.horiguchi.kyotaro@lab.ntt.co.jp Whole thread Raw |
| In response to | Re: multivariate statistics / patch v7 (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>) |
| Responses |
Re: multivariate statistics / patch v7
|
| List | pgsql-hackers |
Hi, I'd like to show you the modified constitution of
multivariate statistics application logic. Please find the
attached. They apply on your v7 patch.
The code to find mv-applicable clause is moved out of the main
flow of clauselist_selectivity. As I said in the previous mail,
the new function transformRestrictInfoForEstimate (too bad name
but just for PoC:) scans clauselist and generates
RestrictStatsData struct which drives mv-aware selectivity
calculation. This struct isolates MV and non-MV estimation.
The struct RestrictStatData mainly consists of the following
three parts,
- clause to be estimated by current logic (MV is not applicable) - clause to be estimated by MV-staistics. - list of
childRestrictStatDatas, which are to be run recursively.
mvclause_selectivty() is the topmost function where mv stats
works. This structure effectively prevents main estimation flow
from being broken by modifying mvstats part. Although I haven't
measured but I'm positive the code is far reduced from yours.
I attached two patches to this message. The first one is to
rebase v7 patch to current(maybe) master and the second applies
the refactoring.
I'm a little anxious about performance but I think this makes the
process to apply mv-stats far clearer. Regtests for mvstats
succeeded asis except for fdep, which is not implememted in this
patch.
What do you think about this?
regards,
> Hi, Thanks for the detailed explaination. I misunderstood the
> code (more honest speaking, din't look so close there). Then I
> looked it closer.
>
>
> At Wed, 08 Jul 2015 03:03:16 +0200, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in
<559C76D4.2030805@2ndquadrant.com>
> > FWIW this was a stupid bug in update_match_bitmap_histogram(), which
> > initially handled only AND clauses, and thus assumed the "match" of a
> > bucket can only decrease. But for OR clauses this is exactly the
> > opposite (we assume no buckets match and add buckets matching at least
> > one of the clauses).
> >
> > With this fixed, the estimates look like this:
> >
>
> > IMHO pretty accurate estimates - no issue with OR clauses.
>
> Ok, I understood the diferrence between what I thought and what
> you say. The code is actually concious of OR clause but is looks
> somewhat confused.
>
> Currently choosing mv stats in clauselist_selectivity can be
> outlined as following,
>
> 1. find_stats finds candidate mv stats containing *all*
> attributes appeared in the whole clauses regardless of and/or
> exprs by walking whole the clause tree.
>
> Perhaps this is the measure to early bailout.
>
> 2.1. Within every disjunction elements, collect mv-related
> attributes while checking whether the all leaf nodes (binop or
> ifnull) are compatible by (eventually) walking whole the
> clause tree.
>
> 2.2. Check if all the collected attribute are contained in
> mv-stats columns.
>
> 3. Finally, clauseset_mv_selectivity_histogram() (and others).
>
> This funciton applies every ExprOp onto every attribute in
> every histogram backes and (tries to) make the boolean
> operation of the result bitmaps.
>
> I have some comments on the implement and I also try to find the
> solution for them.
>
>
> 1. The flow above looks doing very similiar thins repeatedly.
>
> 2. I believe what the current code does can be simplified.
>
> 3. As you mentioned in comments, some additional infrastructure
> needed.
>
> After all, I think what we should do after this are as follows,
> as the first step.
>
> - Add the means to judge the selectivity operator(?) by other
> than oprrest of the op of ExprOp. (You missed neqsel already)
>
> I suppose one solution for this is adding oprmvstats taking
> 'm', 'h' and 'f' and their combinations. Or for the
> convenience, it would be a fixed-length string like this.
>
> oprname | oprmvstats
> = | 'mhf'
> <> | 'mhf'
> < | 'mh-'
> > | 'mh-'
> >= | 'mh-'
> <= | 'mh-'
>
> This would make the code in clause_is_mv_compatible like this.
>
> > oprmvstats = get_mvstatsset(expr->opno); /* bitwise representation */
> > if (oprmvstats & types)
> > {
> > *attnums = bms_add_member(*attnums, var->varattno);
> > return true;
> > }
> > return false;
>
> - Current design just manage to work but it is too complicated
> and hardly have affinity with the existing estimation
> framework. I proposed separation of finding stats phase and
> calculation phase, but I would like to propose transforming
> RestrictInfo(and finding mvstat) phase and running the
> transformed RestrictInfo phase after looking close to the
> patch.
>
> I think transforing RestrictInfo makes the situnation
> better. Since it nedds different information, maybe it is
> better to have new struct, say, RestrictInfoForEstimate
> (boo!). Then provide mvstatssel() to use in the new struct.
> The rough looking of the code would be like below.
>
> clauselist_selectivity()
> {
> ...
> RestrictInfoForEstmate *esclause =
> transformClauseListForEstimation(root, clauses, varRelid);
> ...
>
> return clause_selectivity(esclause):
> }
>
> clause_selectivity(RestrictInfoForEstmate *esclause)
> {
> if (IsA(clause, RestrictInfo))...
> if (IsA(clause, RestrictInfoForEstimate))
> {
> RestrictInfoForEstimate *ecl = (RestrictInfoForEstimate*) clause;
> if (ecl->selfunc)
> {
> sx = ecl->selfunc(root, ecl);
> }
> }
> if (IsA(clause, Var))...
> }
>
>
> transformClauseListForEstimation(...)
> {
> ...
>
> relid = collect_mvstats_info(root, clause, &attlist);
> if (!relid) return;
> if (get_mvstats_hook)
> mvstats = (*get_mvstats_hoook) (root, relid, attset);
> else
> mvstats = find_mv_stats(root, relid, attset))
> }
> ...
>
> > I've pushed this to github [1] but I need to do some additional
> > fixes. I also had to remove some optimizations while fixing this, and
> > will have to reimplement those.
> >
> > That's not to say that the handling of OR-clauses is perfectly
> > correct. After looking at clauselist_selectivity_or(), I believe it's
> > a bit broken and will need a bunch of fixes, as explained in the
> > FIXMEs I pushed to github.
> >
> > [1] https://github.com/tvondra/postgres/tree/mvstats
>
> I don't see whether it is doable or not, and I suppose you're
> unwilling to change the big picture, so I will consider the idea
> and will show you the result, if it turns out to be possible and
> promising.
--
Kyotaro Horiguchi
NTT Open Source Software Center
pgsql-hackers by date: