Re: BUG #15160: planner overestimates number of rows in join whenthere are more than 200 rows coming from CTE - Mailing list pgsql-hackers
From | Melanie Plageman |
---|---|
Subject | Re: BUG #15160: planner overestimates number of rows in join whenthere are more than 200 rows coming from CTE |
Date | |
Msg-id | 154232950503.1400.7562962199586713542.pgcf@coridan.postgresql.org Whole thread Raw |
Responses |
Re: BUG #15160: planner overestimates number of rows in join when there are more than 200 rows coming from CTE
|
List | pgsql-hackers |
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: not tested Documentation: not tested This patch applies cleanly and works for the case described in the original email. All existing regression tests pass with the addition of the explain plan update included in the patch. I could not devise an example in which the previous method of calculating selectivity would have produced a better estimate. However, one question I have after thinking through the optimization is the following: This new selectivity calculation (targeted at semi-joins though in eqjoinsel) is: selectivity = Min(semi-join selectivity, ntuples inner * inner-join selectivity); For a join in which you do not have access to MCVs and assuming no NULLs, the inner-join selectivity used to compare to the calculated semi-join selectivity* is: selectivity = ntuples2 * 1 / max(nd1, nd2) = ntuples2 / max(nd1, nd2) if nd1 <= nd2: selectivity = ntuples2 / nd2 else: selectivity = ntuples2 / nd1 To summarize: Selectivity Type | if nd1 <= nd2 | if nd1 > nd2 | ----------------------------------|----------------|----------------- inner-join selectivity * ntuples2 | ntuples2 / nd2 | ntuples2 / nd1 | semi-join selectivity | 1 | nd2 / nd1 | Notice that ntuples2 >= nd2 so no matter what nd1 and nd2 are: inner-join selectivity * ntuples >= semi-join selectivity So, it seems like, unless you are missing NDVs of one of the sides of the join, inner join selectivity can never be less than semi-join selectivity. If this is true, why not use the default NDVs number in the semi-join selectivity calculation? * based on these summaries of the formulas for calculating selectivity inner-join selectivity when you don't have access to MCVs and assuming no NULLs is ~ 1 / max(nd1,nd2) if either nd1 or nd2 was default, nd[1,2] = min(ntuples[1,2], 200) semi-join selectivity when you don't have access to MCVs and assuming no NULLs is 0.5 when either nd1 or nd2 is default when neither are default if nd2 < 0 || nd1 <= nd2 1 else nd2/nd1 If there is a reason to keep the existing formula, then I have an additional question about the proposed selectivity calculation: selec = Min(selec, nd2 * selec_inner); When would it be incorrect to instead multiply by inner side NDVs? Besides the actual logic of the code added, Ekta and I did a code review and had some feedback on the structure and clarity of the code and comments. In the function eqjoinsel_semi, on line 2759 of the patched, rebased code, could you not move the else condition: uncertainfrac = 0.5; Up to the top of the if statement which starts on line 2663: if (have_mcvs1 && have_mcvs2 && OidIsValid(opfuncoid)) It seems like you already know and do not further modify the value of isdefault1 and isdefault2 and could exit faster before looping through all the MCVs in this case. For the function eqjoinsel_inner, why pass in vardata1 and vardata2, as they appear not to be used? Neither are the isdefault flags. This is in the existing code, however, I thought I would ask here: In eqjoinsel_semi, on line 2691 of the patched, rebased code, Why is this the min of the number of MCVs captured and the distinct values? It seems like if clamping resulted in an NDVs that is too low (i.e. impossibly low since the number of distinct values cannot be less than the number of MCVs), then you should bump it up to at least the number of MCVs: clamped_nvalues2 = Min(sslot2->nvalues, nd2); I also found the new comment added above the new selectivity calculation to be a little bit confusing: /* * We should never estimate the output of a semijoin to be more * rows than the equivalent inner join; it's obviously impossible * for that to happen. The former is N1 * Psemi while the latter * is N1 * N2 * Pinner, so we may clamp Psemi <= N2 * Pinner. * Doing this is worthwhile because of the shakier estimation * rules we use in eqjoinsel_semi, particularly in cases where it * has to punt entirely. */ selec = Min(selec, inner_rel->rows * selec_inner); After re-reading it several times, I understood what it was doing, however, it would be ideal if somehow the relationship between selectivity and cardinality were more clear. I don't have any great ideas for additional wording, however, maybe it would help to clarify that in order to clamp the cardinality correctly, we must clamp the selectivity using this formula. Basically, specify that we are not clamping the selectivity of semi-join to the selectivity of inner join, but, rather, that we are clamping the cardinality of semi-join to consider only the matching rows of the inner side (also, if that sentence is actually not correct, then some description that avoids confusion like that would be helpful). The new status of this patch is: Waiting on Author
pgsql-hackers by date: