Re: Improve join selectivity estimation using extended statistics - Mailing list pgsql-hackers

From Ibrar Ahmed
Subject Re: Improve join selectivity estimation using extended statistics
Date
Msg-id CALtqXTdiZ6Aum8xo8vJPcs+oXx5wfNMox3NBtW4yfYU8PGaT1Q@mail.gmail.com
Whole thread Raw
In response to Improve join selectivity estimation using extended statistics  (Konstantin Knizhnik <k.knizhnik@postgrespro.ru>)
Responses Re: Improve join selectivity estimation using extended statistics  (Daniel Gustafsson <daniel@yesql.se>)
List pgsql-hackers


On Mon, Mar 15, 2021 at 8:42 PM Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:


On 11.03.2021 03:47, Tomas Vondra wrote:
> Hi Konstantin,
>
> Thanks for working on this! Using extended statistics to improve join
> cardinality estimates was definitely on my radar, and this patch seems
> like a good start.
>
> I had two basic ideas about how we might improve join estimates:
>
> (a) use per-table extended statistics to estimate join conditions
>
> (b) invent multi-table extended statistics (requires inventing how to
> sample the tables in a coordinated way, etc.)
>
> This patch aims to do (a) which is perfectly reasonable - I think we can
> achieve significant improvements this way. I have some ideas about (b),
> but it seems harder and for a separate thread/patch.
>
>
> The patch includes some *very* interesting ideas, but I think it's does
> them too late and at the wrong level of abstraction. I mean that:
>
> 1) I don't think the code in clausesel.c should deal with extended
> statistics directly - it requires far too much knowledge about different
> types of extended stats, what clauses are supported by them, etc.
> Allowing stats on expressions will make this even worse.
>
> Better do that in extended_stats.c, like statext_clauselist_selectivity.
>
> 2) in clauselist_selectivity_ext, functional dependencies are applied in
> the part that processes remaining clauses, not estimated using extended
> statistics. That seems a bit confusing, and I suspect it may lead to
> issues - for example, it only processes the clauses incrementally, in a
> particular order. That probably affects the  result, because it affects
> which functional dependencies we can apply.
>
> In the example query that's not an issue, because it only has two Vars,
> so it either can't apply anything (with one Var) or it can apply
> everything (with two Vars). But with 3 or more Vars the order would
> certainly matter, so it's problematic.
>
>
> Moreover, it seems a bit strange that this considers dependencies only
> on the inner relation. Can't that lead to issues with different join
> orders producing different cardinality estimates?
>
>
> I think a better approach would be to either modify the existing block
> dealing with extended stats for a single relation to also handle join
> conditions. Or perhaps we should invent a separate block, dealing with
> *pairs* of relations? And it should deal with *all* join clauses for
> that pair of relations at once, not one by one.
>
> As for the exact implementation, I'd imagine we call overall logic to be
> something like (for clauses on two joined relations):
>
> - pick a subset of clauses with the same type of extended statistics on
> both sides (MCV, ndistinct, ...), repeat until we can't apply more
> statistics
>
> - estimate remaining clauses either using functional dependencies or in
> the regular (old) way
>
>
> As for how to use other types of extended statistics, I think eqjoinsel
> could serve as an inspiration. We should look for an MCV list and
> ndistinct stats on both sides of the join (possibly on some subset of
> clauses), and then do the same thing eqjoinsel does, just with multiple
> columns.
>
> Note: I'm not sure what to do when we find the stats only on one side.
> Perhaps we should assume the other side does not have correlations and
> use per-column statistics (seems reasonable), or maybe just not apply
> anything (seems a bit too harsh).
>
> Anyway, if there are some non-estimated clauses, we could try applying
> functional dependencies similarly to what this patch does. It's also
> consistent with statext_clauselist_selectivity - that also tries to
> apply MCV lists first, and only then we try functional dependencies.
>
>
> BTW, should this still rely on oprrest (e.g. F_EQSEL). That's the
> selectivity function for restriction (non-join) clauses, so maybe we
> should be looking at oprjoin when dealing with joins? Not sure.
>
>
> One bit that I find *very* interesting is the calc_joinrel_size_estimate
> part, with this comment:
>
>    /*
>     * Try to take in account functional dependencies between attributes
>     * of clauses pushed-down to joined relations and retstrictlist
>     * clause. Right now we consider only case of restrictlist consists of
>     * one clause.
>     */
>
> If I understand the comment and the code after it, it essentially tries
> to apply extended statistics from both the join clauses and filters at
> the relation level. That is, with a query like
>
>      SELECT * FROM t1 JOIN t2 ON (t1.a = t2.a) WHERE t1.b = 10
>
> we would be looking at statistics on t1(a,b), because we're interested
> in estimating conditional probability distribution
>
>     P(t1.a = a? | t1.b = 10)
>
> I think that's extremely interesting and powerful, because it allows us
> to "restrict" the multi-column MCV lists, we could probably estimate
> number of distinct "a" values in rows with "b=10" like:
>
>      ndistinct(a,b) / ndistinct(b)
>
> and do various interesting stuff like this.
>
> That will require some improvements to the extended statistics code (to
> allow passing a list of conditions), but that's quite doable. I think
> the code actually did something like that originally ;-)
>
>
> Obviously, none of this is achievable for PG14, as we're in the middle
> of the last CF. But if you're interested in working on this for PG15,
> I'd love to cooperate on that.
>
>
> regards
>
Hi Tomas,
Thank you for review of my patch.
My primary attention was to implement some kid of adaptive query
optimization based online_analyze extension and building extended
statistic on demand.
I have change clausesel.c because right now extended statistic is not
used for join selectivity estimation and manual or automatic adding such
statistic can help to
choose more efficient plan for queries with joins.
I agree wit you that it can be done in better way, handling more use cases.
I will be glad to cooperate with you in improving join selectivity
estimation using extended statistic.



The patch does not compile, and needs your attention.

https://cirrus-ci.com/task/6397726985289728

clausesel.c:74:28: error: too few arguments to function ‘choose_best_statistics’
StatisticExtInfo *stat = choose_best_statistics(rel->statlist, STATS_EXT_DEPENDENCIES,
^~~~~~~~~~~~~~~~~~~~~~
In file included from clausesel.c:24:
../../../../src/include/statistics/statistics.h:123:26: note: declared here
exter


I am changing the status to "Waiting on Author".


--
Ibrar Ahmed

pgsql-hackers by date:

Previous
From: Ibrar Ahmed
Date:
Subject: Re: shared-memory based stats collector
Next
From: Dilip Kumar
Date:
Subject: Re: row filtering for logical replication