Re: Columns correlation and adaptive query optimization - Mailing list pgsql-hackers

From Konstantin Knizhnik
Subject Re: Columns correlation and adaptive query optimization
Date
Msg-id 1220e592-d50d-c015-890c-67c2e8a4e001@postgrespro.ru
Whole thread Raw
In response to Re: Columns correlation and adaptive query optimization  (Yugo NAGATA <nagata@sraoss.co.jp>)
List pgsql-hackers

On 27.01.2021 8:45, Yugo NAGATA wrote:
> On Mon, 25 Jan 2021 16:27:25 +0300
> Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:
>
>> Hello,
>>
>> Thank you for review.
>> My answers are inside.
> Thank you for updating the patch and answering my questions.
>
>>> (2)
>>> If I understand correctly, your proposal consists of the following two features.
>>>
>>> 1. Add a feature to auto_explain that creates an extended statistic automatically
>>> if an error on estimated rows number is large.
>>>
>>> 2. Improve rows number estimation of join results by considering functional
>>> dependencies between vars in the join qual if the qual has more than one clauses,
>>> and also functional dependencies between a var in the join qual and vars in quals
>>> of the inner/outer relation.
>>>
>>> As you said, these two parts are independent each other, so one feature will work
>>> even if we don't assume the other.  I wonder it would be better to split the patch
>>> again, and register them to commitfest separately.
>> I agree with you that this are two almost unrelated changes, although
>> without clausesel patch additional statistic can not improve query planning.
> I think extended statistics created by the auto_explain patch can improve query
> planning even without the clausesel patch. For example, suppose the following case:
>
> postgres=# create table t ( i int, j int);
> CREATE TABLE
> postgres=# insert into t select i/10, i/100 from  generate_series(1,1000000) i;
> INSERT 0 1000000
> postgres=# analyze t;
> ANALYZE
> postgres=# explain analyze select * from t where i = 100 and j = 10;
>                                             QUERY PLAN
> -------------------------------------------------------------------------------------------------
>   Seq Scan on t  (cost=0.00..19425.00 rows=1 width=8) (actual time=0.254..97.293 rows=10 loops=1)
>     Filter: ((i = 100) AND (j = 10))
>     Rows Removed by Filter: 999990
>   Planning Time: 0.199 ms
>   Execution Time: 97.327 ms
> (5 rows)
>
> After applying the auto_explain patch (without clausesel patch) and issuing the query,
> additional statistics were created.
>
> postgres=# select * from t where i = 100 and j = 10;
> LOG:  Add statistics t_i_j
>
> Then, after analyze, the row estimation was improved.
>
> postgres=# analyze t;
> ANALYZE
> postgres=# explain analyze select * from t where i = 100 and j = 10;
> postgres=# explain analyze select * from t where i = 100 and j = 10;
>                                              QUERY PLAN
> --------------------------------------------------------------------------------------------------
>   Seq Scan on t  (cost=0.00..19425.00 rows=10 width=8) (actual time=0.255..95.347 rows=10 loops=1)
>     Filter: ((i = 100) AND (j = 10))
>     Rows Removed by Filter: 999990
>   Planning Time: 0.124 ms
>   Execution Time: 95.383 ms
> (5 rows)
>
> So, I think the auto_explain patch is useful with just that as a tool
> to detect a gap between estimate and real and adjust the plan. Also,
> the clausesel patch would be useful without the auto_explain patch
> if an appropriate statistics are created.
>
>> But I already have too many patches at commitfest.
>> May be it will be enough to spit this patch into two?
> Although we can discuss both of these patches in this thread,
> I wonder we don't have to commit them together.
>
>>> (3)
>>> +    DefineCustomBoolVariable("auto_explain.suggest_only",
>>> +                             "Do not create statistic but just record in WAL suggested create statistics
statement.",
>>> +                             NULL,
>>> +                             &auto_explain_suggest_on
>>>
>>> To imply that this parameter is involving to add_statistics_threshold, it seems
>>> better for me to use more related name like add_statistics_suggest_only.
>>>
>>> Also, additional documentations for new parameters are required.
>> Done.
> +
> +   <varlistentry>
> +    <term>
> +     <varname>auto_explain.auto_explain.add_statistics_threshold</varname> (<type>real</type>)
> +     <indexterm>
> +      <primary><varname>auto_explain.add_statistics_threshold</varname> configuration parameter</primary>
> +     </indexterm>
> +    </term>
> +    <listitem>
> +     <para>
> +       <varname>auto_explain.add_statistics_threshold</varname> sets the threshold for
> +       actual/estimated #rows ratio triggering creation of multicolumn statistic
> +       for the related columns. It can be used for adpative query optimization.
> +       If there is large gap between real and estimated number of tuples for the
> +       concrete plan node, then multicolumn statistic is created for involved
> +       attributes. Zero value (default) disables implicit creation of multicolumn statistic.
> +     </para>
> +    </listitem>
>
> I wonder we need to say that this parameter has no effect unless log_analyze
> is enabled and that statistics are created only when the excution time exceeds
> log_min_duration, if these behaviors are intentional.
>
> In addition, additional statistics are created only if #rows is over-estimated
> and not if it is under-estimated. Although it seems good as a criterion for creating
> multicolumn statistic since extended statisstic is usually useful to fix over-estimation,
> I am not sure if we don't have to consider under-estimation case at all.
>
>
>>> (9)
>>> Currently, it only consider functional dependencies statistics. Can we also
>>> consider multivariate MCV list, and is it useful?
>>
>> Right now auto_explain create statistic without explicit specification
>> of statistic kind.
>> According to the documentation all supported statistics kinds should be
>> created in this case:
> Yes, auto_explain creates all kinds of extended statistics. However,
> IIUC, the clausesel patch uses only functional dependencies statistics for
> improving join, so my question was about possibility to consider MCV in the
> clausesel patch.
>
>>> (10)
>>> To achieve adaptive query optimization  (AQO) in PostgreSQL, this patch proposes
>>> to use auto_explain for getting feedback from actual results. So, could auto_explain
>>> be a infrastructure of AQO in future? Or, do you have any plan or idea to make
>>> built-in infrastructure for AQO?
>> Sorry, I do not have answer for this question.
>> I just patched auto_explain extension because it is doing  half of the
>> required work (analyze  expensive statements).
>> It can be certainly moved to separate extension. In this case it will
>> party duplicate existed functionality and
>> settings of auto_explain (like statement execution time threshold). I am
>> not sure that it is good.
>> But from the other side, this my patch makes auto_explain extension to
>> do some unexpected work...
> I think that auto_explain is an extension originally for aiming to detect
> and log plans that take a long time, so it doesn't seem so unnatural for
> me to use this for improving such plans. Especially, the feature to find
> tunable points in executed plans seems useful.
>
>> Actually task of adaptive query optimization is much bigger.
>> We have separate AQO extension which tries to use machine learning to
>> correctly adjust estimations.
>> This my patch is much simpler and use existed mechanism (extended
>> statistics) to improve estimations.
> Well, this patch provide a kind of AQO as auto_explain feature, but this
> is independent of the AQO extension. Is it right?
> Anyway, I'm interested in the AQO extension, so I'll look into this, too.
>
> Regards,
> Yugo Nagata
>

I have updated documentation as you suggested and submit patch for 
auto_explain extension for the next commitfest.
I will create separate  thread for improving join selectivity estimation 
using extended statistics.

> Well, this patch provide a kind of AQO as auto_explain feature, but this
> is independent of the AQO extension. Is it right?
Yes. The basic idea is the same, but approaches are different.

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Attachment

pgsql-hackers by date:

Previous
From: Anastasia Lubennikova
Date:
Subject: Re: pg_upgrade fails with non-standard ACL
Next
From: Tomas Vondra
Date:
Subject: Re: [PoC] Non-volatile WAL buffer