Re: Columns correlation and adaptive query optimization - Mailing list pgsql-hackers
From | Konstantin Knizhnik |
---|---|
Subject | Re: Columns correlation and adaptive query optimization |
Date | |
Msg-id | 1220e592-d50d-c015-890c-67c2e8a4e001@postgrespro.ru Whole thread Raw |
In response to | Re: Columns correlation and adaptive query optimization (Yugo NAGATA <nagata@sraoss.co.jp>) |
List | pgsql-hackers |
On 27.01.2021 8:45, Yugo NAGATA wrote: > On Mon, 25 Jan 2021 16:27:25 +0300 > Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote: > >> Hello, >> >> Thank you for review. >> My answers are inside. > Thank you for updating the patch and answering my questions. > >>> (2) >>> If I understand correctly, your proposal consists of the following two features. >>> >>> 1. Add a feature to auto_explain that creates an extended statistic automatically >>> if an error on estimated rows number is large. >>> >>> 2. Improve rows number estimation of join results by considering functional >>> dependencies between vars in the join qual if the qual has more than one clauses, >>> and also functional dependencies between a var in the join qual and vars in quals >>> of the inner/outer relation. >>> >>> As you said, these two parts are independent each other, so one feature will work >>> even if we don't assume the other. I wonder it would be better to split the patch >>> again, and register them to commitfest separately. >> I agree with you that this are two almost unrelated changes, although >> without clausesel patch additional statistic can not improve query planning. > I think extended statistics created by the auto_explain patch can improve query > planning even without the clausesel patch. For example, suppose the following case: > > postgres=# create table t ( i int, j int); > CREATE TABLE > postgres=# insert into t select i/10, i/100 from generate_series(1,1000000) i; > INSERT 0 1000000 > postgres=# analyze t; > ANALYZE > postgres=# explain analyze select * from t where i = 100 and j = 10; > QUERY PLAN > ------------------------------------------------------------------------------------------------- > Seq Scan on t (cost=0.00..19425.00 rows=1 width=8) (actual time=0.254..97.293 rows=10 loops=1) > Filter: ((i = 100) AND (j = 10)) > Rows Removed by Filter: 999990 > Planning Time: 0.199 ms > Execution Time: 97.327 ms > (5 rows) > > After applying the auto_explain patch (without clausesel patch) and issuing the query, > additional statistics were created. > > postgres=# select * from t where i = 100 and j = 10; > LOG: Add statistics t_i_j > > Then, after analyze, the row estimation was improved. > > postgres=# analyze t; > ANALYZE > postgres=# explain analyze select * from t where i = 100 and j = 10; > postgres=# explain analyze select * from t where i = 100 and j = 10; > QUERY PLAN > -------------------------------------------------------------------------------------------------- > Seq Scan on t (cost=0.00..19425.00 rows=10 width=8) (actual time=0.255..95.347 rows=10 loops=1) > Filter: ((i = 100) AND (j = 10)) > Rows Removed by Filter: 999990 > Planning Time: 0.124 ms > Execution Time: 95.383 ms > (5 rows) > > So, I think the auto_explain patch is useful with just that as a tool > to detect a gap between estimate and real and adjust the plan. Also, > the clausesel patch would be useful without the auto_explain patch > if an appropriate statistics are created. > >> But I already have too many patches at commitfest. >> May be it will be enough to spit this patch into two? > Although we can discuss both of these patches in this thread, > I wonder we don't have to commit them together. > >>> (3) >>> + DefineCustomBoolVariable("auto_explain.suggest_only", >>> + "Do not create statistic but just record in WAL suggested create statistics statement.", >>> + NULL, >>> + &auto_explain_suggest_on >>> >>> To imply that this parameter is involving to add_statistics_threshold, it seems >>> better for me to use more related name like add_statistics_suggest_only. >>> >>> Also, additional documentations for new parameters are required. >> Done. > + > + <varlistentry> > + <term> > + <varname>auto_explain.auto_explain.add_statistics_threshold</varname> (<type>real</type>) > + <indexterm> > + <primary><varname>auto_explain.add_statistics_threshold</varname> configuration parameter</primary> > + </indexterm> > + </term> > + <listitem> > + <para> > + <varname>auto_explain.add_statistics_threshold</varname> sets the threshold for > + actual/estimated #rows ratio triggering creation of multicolumn statistic > + for the related columns. It can be used for adpative query optimization. > + If there is large gap between real and estimated number of tuples for the > + concrete plan node, then multicolumn statistic is created for involved > + attributes. Zero value (default) disables implicit creation of multicolumn statistic. > + </para> > + </listitem> > > I wonder we need to say that this parameter has no effect unless log_analyze > is enabled and that statistics are created only when the excution time exceeds > log_min_duration, if these behaviors are intentional. > > In addition, additional statistics are created only if #rows is over-estimated > and not if it is under-estimated. Although it seems good as a criterion for creating > multicolumn statistic since extended statisstic is usually useful to fix over-estimation, > I am not sure if we don't have to consider under-estimation case at all. > > >>> (9) >>> Currently, it only consider functional dependencies statistics. Can we also >>> consider multivariate MCV list, and is it useful? >> >> Right now auto_explain create statistic without explicit specification >> of statistic kind. >> According to the documentation all supported statistics kinds should be >> created in this case: > Yes, auto_explain creates all kinds of extended statistics. However, > IIUC, the clausesel patch uses only functional dependencies statistics for > improving join, so my question was about possibility to consider MCV in the > clausesel patch. > >>> (10) >>> To achieve adaptive query optimization (AQO) in PostgreSQL, this patch proposes >>> to use auto_explain for getting feedback from actual results. So, could auto_explain >>> be a infrastructure of AQO in future? Or, do you have any plan or idea to make >>> built-in infrastructure for AQO? >> Sorry, I do not have answer for this question. >> I just patched auto_explain extension because it is doing half of the >> required work (analyze expensive statements). >> It can be certainly moved to separate extension. In this case it will >> party duplicate existed functionality and >> settings of auto_explain (like statement execution time threshold). I am >> not sure that it is good. >> But from the other side, this my patch makes auto_explain extension to >> do some unexpected work... > I think that auto_explain is an extension originally for aiming to detect > and log plans that take a long time, so it doesn't seem so unnatural for > me to use this for improving such plans. Especially, the feature to find > tunable points in executed plans seems useful. > >> Actually task of adaptive query optimization is much bigger. >> We have separate AQO extension which tries to use machine learning to >> correctly adjust estimations. >> This my patch is much simpler and use existed mechanism (extended >> statistics) to improve estimations. > Well, this patch provide a kind of AQO as auto_explain feature, but this > is independent of the AQO extension. Is it right? > Anyway, I'm interested in the AQO extension, so I'll look into this, too. > > Regards, > Yugo Nagata > I have updated documentation as you suggested and submit patch for auto_explain extension for the next commitfest. I will create separate thread for improving join selectivity estimation using extended statistics. > Well, this patch provide a kind of AQO as auto_explain feature, but this > is independent of the AQO extension. Is it right? Yes. The basic idea is the same, but approaches are different. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
pgsql-hackers by date: