Re: Columns correlation and adaptive query optimization - Mailing list pgsql-hackers

From Konstantin Knizhnik
Subject Re: Columns correlation and adaptive query optimization
Date
Msg-id e0474c11-cafe-1535-813e-15dae6b48365@postgrespro.ru
Whole thread Raw
In response to Re: Columns correlation and adaptive query optimization  (Rafia Sabih <rafia.pghackers@gmail.com>)
List pgsql-hackers
Thank you very much for review.

On 25.03.2020 20:04, Rafia Sabih wrote:
>
> +static void
> +AddMultiColumnStatisticsForNode(PlanState *planstate, ExplainState *es);
> +
>
> This doesn't look like the right place for it, you might want to
> declare it with other functions in the starting of the file.
>
> Also, there is no description about any of the functions here,
> wouldn’t hurt having some more comments there.

Sorry, I will fix it.
Actually this patch contains of two independent parts:
first allows to use auto_explain extension to generate mutlicolumn 
statistic for variables used in clauses
for which selectivity estimation gives wrong result. It affects only 
auto_explain extension.

Second part allows to use multicolumn statistic for join selectivity 
estimation.
As far as I know extended statistic is now actively improved:

https://www.postgresql.org/message-id/flat/20200309000157.ig5tcrynvaqu4ixd%40development#bfbdf9c41c31ef92819dfc5ecde4a67c

I think that using extended statistic for join selectivity is very 
important and should also be addressed.
If my approach is on so good, I will be pleased for other suggestions.

>
> A few of more questions that cross my mind at this point,
>
> - have you tried measuring the extra cost we have to pay for this
> mores statistics , and also compare it with the benefit it gives in
> terms of accuracy.
Adding statistic not always leads to performance improvement but I never 
observed any performance degradation caused by presence of extended 
statistic.
Definitely we can manually create too many extended statistic entries 
for different subsets of columns.
And it certainly increase planning time because optimizer has to 
consider more alternatives.
But in practice I never noticed such slowdown.

> - I would also be interested in understanding if there are cases when
> adding this extra step doesn’t help and have you excluded them already
> or if some of them are easily identifiable at this stage...?

Unfortunately there are many cases when extended statistic can not help.
Either because optimizer is not able to use it (for example my patch 
consider only cases with strict equality comparison,
but if you use predicate like "a.x=b.x and  a.y in (1,2,3)"  then 
extended statistic for <x,y> can not be used.
Either because collected statistic itself is not precise enough , 
especially in case of data skews.


> - is there any limit  on the number of columns for which this will
> work, or should there be any such limit...?
>
Right now there is limit for maximal number of columns used in extended 
statistic: 8 columns.
But in practice I rarely see join predicates involving more than 3 columns.



-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




pgsql-hackers by date:

Previous
From: Julien Rouhaud
Date:
Subject: Re: Patch: to pass query string to pg_plan_query()
Next
From: Justin Pryzby
Date:
Subject: Re: error context for vacuum to include block number