Re: multivariate statistics / patch v7 - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: multivariate statistics / patch v7 |
Date | |
Msg-id | 55A51AF0.7020501@2ndquadrant.com Whole thread Raw |
In response to | Re: multivariate statistics / patch v7 (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>) |
List | pgsql-hackers |
Hi, On 07/13/2015 10:51 AM, Kyotaro HORIGUCHI wrote: > > Ok, I understood the diferrence between what I thought and what you > say. The code is actually concious of OR clause but is looks somewhat > confused. I'm not sure which part is confused by the OR clauses, but it's certainly possible. Initially it only handled AND clauses, and the support for OR clauses was added later, so it's possible some parts are not behaving correctly. > > Currently choosing mv stats in clauselist_selectivity can be > outlined as following, > > 1. find_stats finds candidate mv stats containing *all* > attributes appeared in the whole clauses regardless of and/or > exprs by walking whole the clause tree. > > Perhaps this is the measure to early bailout. Not entirely. The goal of find_stats() is to lookup all stats on the 'current' relation - it's coded the way it is because I had to deal with varRelid=0 cases, in which case I have to inspect the Var nodes. But maybe I got this wrong and there's much simpler way to do that? It is an early bailout in the sense that if there are no multivariate stats defined on the table, there's no point in doing any of the following steps. So that we don't increase planning times for users not using multivariate stats. > 2.1. Within every disjunction elements, collect mv-related > attributes while checking whether the all leaf nodes (binop or > ifnull) are compatible by (eventually) walking whole the > clause tree. Generally, yes. The idea is to check whether there are clauses that might be estimated using multivariate statistics, and whether the clauses reference at least two different attributes. Imagine a query like this: SELECT * FROM t WHERE (a=1) AND (a>0) AND (a<100) It makes no sense to process this using multivariate statistics, because all the Var nodes reference a single attribute. Similarly, the check is not just about the leaf nodes - to be able to estimate a clause at this point, we have to be able to process the whole tree, starting from the top-level clause. Although maybe that's no longer true, now that support for OR clauses was added ... I wonder whether there are other BoolExpr-like nodes, that might make the tree incompatible with multivariate statistics (in the sense that the current implementation does not know how to handle them). Also note that even though the clause may be "incompatible" at this level, it may get partially processed by multivariate statistics later. For example with a query: SELECT * FROM t WHERE (a=1 OR b=2 OR c ~* 'xyz') AND (q=1 OR r=4) the first query is "incompatible" because it contains unsupported operator '~*', but it will eventually be processed as BoolExpr node, and should be split into two parts - (a=1 OR b=2) which is compatible, and (c ~* 'xyz') which is incompatible. This split should happen in clauselist_selectivity_or(), and the other thing that may be interesting is that it uses (q=1 OR r=4) as a condition. So if there's a statistics built on (a,b,q,r) we'll compute conditional probability P(a=1,b=2 | q=1,r=4) >> 2.2. Check if all the collected attribute are contained in> mv-stats columns. No, I think you got this wrong. We do not check that *all* the attributes are contained in mvstats columns - we only need two such columns (then there's a chance that the multivariate statistics will get applied). Anyway, both 2.1 and 2.2 are meant as a quick bailout, before doing the most expensive part, which is choose_mv_statistics(). Which is however missing in this list. > 3. Finally, clauseset_mv_selectivity_histogram() (and others). > > This funciton applies every ExprOp onto every attribute in > every histogram backes and (tries to) make the boolean > operation of the result bitmaps. Yes, but this only happens after choose_mv_statistics(), because that's the code that decides which statistics will be used and in what order. The list is also missing handling of the 'functional dependencies', so a complete list of steps would look like this: 1) find_stats - lookup stats on the current relation (from RelOptInfo) 2) apply functional dependencies a) check if there are equality clauses that may be reduced using functional dependencies, referencing at least twocolumns b) if yes, perform the clause reduction 3) apply MCV lists and histograms a) check if there are clauses 'compatible' with those types of statistics, again containing at least two columns b) if yes, use choose_mv_statistics() to decide which statistics to apply and in which order c) apply the selected histograms and MCV lists 4) estimate the remaining clauses using the regular statistics a) this is where the clauselist_mv_selectivity_histogram and other are called I tried to explain this in the comment before clauselist_selectivity(), but maybe it's not detailed enough / missing some important details. > > I have some comments on the implement and I also try to find the > solution for them. > > > 1. The flow above looks doing very similiar thins repeatedly. I worked hard to remove such code duplicities, and believe all the current steps are necessary - for example 2(a) and 3(a) may seems similar, but it's really necessary to do that twice. > > 2. I believe what the current code does can be simplified. Possibly. > > 3. As you mentioned in comments, some additional infrastructure > needed. > > After all, I think what we should do after this are as follows, > as the first step. OK. > > - Add the means to judge the selectivity operator(?) by other > than oprrest of the op of ExprOp. (You missed neqsel already) Yes, the way we use 'oprno' to determine how to estimate the selectivity is a bit awkward. It's inspired by handling of range queries, and having something better would be nice. But I don't think this is the reason why I missed neqsel, and I don't see this as a significant design issue at this point. But if we can come up with a better solution, why not ... > I suppose one solution for this is adding oprmvstats taking > 'm', 'h' and 'f' and their combinations. Or for the > convenience, it would be a fixed-length string like this. > > oprname | oprmvstats > = | 'mhf' > <> | 'mhf' > < | 'mh-' > > | 'mh-' > >= | 'mh-' > <= | 'mh-' > > This would make the code in clause_is_mv_compatible like this. > > > oprmvstats = get_mvstatsset(expr->opno); /* bitwise representation */ > > if (oprmvstats & types) > > { > > *attnums = bms_add_member(*attnums, var->varattno); > > return true; > > } > > return false; So this only determines the compatibility of operators with respect to different types of statistics? How does that solve the neqsel case? It will probably decide the clause is compatible, but it will later fail at the actual estimation, no? > > - Current design just manage to work but it is too complicated > and hardly have affinity with the existing estimation > framework. I respectfully disagree. I've strived to make it as affine to the current implementation as possible - maybe it's possible to improve that, but I believe there's a natural difference between the two types of statistics. It may be somewhat simplified, but it will never be exactly the same. > I proposed separation of finding stats phase and > calculation phase, but I would like to propose transforming > RestrictInfo(and finding mvstat) phase and running the > transformed RestrictInfo phase after looking close to the > patch. Those phases are already separated, as is illustrated by the steps explained above. So technically we might place a hook either right after the find_stats() call, so that it's possible to process all the stats on the table, or maybe after the choose_mv_statistics() call, so that we only process the actually used stats. > > I think transforing RestrictInfo makes the situnation > better. Since it nedds different information, maybe it is > better to have new struct, say, RestrictInfoForEstimate > (boo!). Then provide mvstatssel() to use in the new struct. > The rough looking of the code would be like below. > > clauselist_selectivity() > { > ... > RestrictInfoForEstmate *esclause = > transformClauseListForEstimation(root, clauses, varRelid); > ... > > return clause_selectivity(esclause): > } > > clause_selectivity(RestrictInfoForEstmate *esclause) > { > if (IsA(clause, RestrictInfo))... > if (IsA(clause, RestrictInfoForEstimate)) > { > RestrictInfoForEstimate *ecl = (RestrictInfoForEstimate*) clause; > if (ecl->selfunc) > { > sx = ecl->selfunc(root, ecl); > } > } > if (IsA(clause, Var))... > } > > > transformClauseListForEstimation(...) > { > ... > > relid = collect_mvstats_info(root, clause, &attlist); > if (!relid) return; > if (get_mvstats_hook) > mvstats = (*get_mvstats_hoook) (root, relid, attset); > else > mvstats = find_mv_stats(root, relid, attset)) > } > ... So you'd transform the clause tree first, replacing parts of the tree (to be estimated by multivariate stats) by a new node type? That's an interesting idea, I think ... I can't really say whether it's a good approach, though. Can you explain why do you think it'd make the situation better? The one benefit I can think of is being able to look at the processed tree and see which parts will be estimated using multivariate stats. But we'd effectively have to do the same stuff (choosing the stats, ...), and if we move this pre-processing before clauselist_selectivity (I assume that's the point), we'd end up repeating a lot of the code. Or maybe not, I'm not sure. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: