Home > mailing lists

Columns correlation and adaptive query optimization - Mailing list pgsql-hackers

From	Konstantin Knizhnik
Subject	Columns correlation and adaptive query optimization
Date	October 14, 2019 19:43:10
Msg-id	196f1e1a-5464-ed07-ab3c-0c9920564af7@postgrespro.ru Whole thread Raw
Responses	Re: Columns correlation and adaptive query optimization
List	pgsql-hackers

Tree view

Hi hackers,

Errors in selectivity estimations is one of the main reason of bad plans 
generation by Postgres optimizer.
Postgres estimates selectivity based on the collected statistic 
(histograms).
While it is able to more or less precisely estimated selectivity of 
simple predicate for particular table,
it is much more difficult to estimate selectivity for result of join of 
several tables and for complex predicate consisting of several 
conjuncts/disjuncts
accessing different columns.

Postgres is not able to take in account correlation between columns 
unless correspondent multicolumn statistic is explicitly created.
But even if such statistic is created, it can not be used in join 
selectivity estimation.

The problem with adjusting selectivity using machine learning based on 
the results of EXPLAIN ANALYZE was address in AQO project:

https://github.com/postgrespro/aqo

There are still many issues with proposed AQO approach (for example, it 
doesn't take in account concrete constant values).
We are going  to continue its improvement.

But here I wan to propose much simpler patch which allows two things:
1. Use extended statistic in estimation of join selectivity
2. Create on demand multicolumn statistic in auto_explain extension if 
there is larger gap between real and estimated number of tuples for the 
concrete plan node.


create table inner_tab(x integer, y integer);
create table outer_tab(pk integer primary key, x integer, y integer);
create index on inner_tab(x,y);
insert into outer_tab values (generate_series(1,100000), 
generate_series(1,100000), generate_series(1,100000)*10);
insert into inner_tab values (generate_series(1,1000000)/10, 
generate_series(1,1000000)/10*10);
analyze inner_tab;
analyze outer_tab;


Without this patch:
explain select * from outer_tab join inner_tab using(x,y) where pk=1;
                                           QUERY PLAN
----------------------------------------------------------------------------------------------
  Nested Loop  (cost=0.72..16.77 rows=1 width=12)
    ->  Index Scan using outer_tab_pkey on outer_tab (cost=0.29..8.31 
rows=1 width=12)
          Index Cond: (pk = 1)
    ->  Index Only Scan using inner_tab_x_y_idx on inner_tab 
(cost=0.42..8.45 rows=1 width=8)
          Index Cond: ((x = outer_tab.x) AND (y = outer_tab.y))
(5 rows)


With this patch:

load 'auto_explain';
set auto_explain.log_min_duration=0;
set auto_explain.add_statistics_threshold=10.0;
set auto_explain.log_analyze=on;
select * from outer_tab join inner_tab using(x,y) where pk=1;
analyze inner_tab;
analyze outer_tab;

explain select * from outer_tab join inner_tab using(x,y) where pk=1;
                                            QUERY PLAN
------------------------------------------------------------------------------------------------
  Nested Loop  (cost=0.72..32.79 rows=10 width=12)
    ->  Index Scan using outer_tab_pkey on outer_tab (cost=0.29..8.31 
rows=1 width=12)
          Index Cond: (pk = 1)
    ->  Index Only Scan using inner_tab_x_y_idx on inner_tab 
(cost=0.42..24.38 rows=10 width=8)
          Index Cond: ((x = outer_tab.x) AND (y = outer_tab.y))
(5 rows)


As you can see now estimation of join result is correct (10).

I attached two patches: one for using extended statistic for join 
selectivity estimation and another for auto_explain to implicitly add 
this extended statistic on demand.

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

pgsql-hackers by date:

From: vignesh C
Date: 14 October 2019, 19:18:12
Subject: Re: Non-Active links being referred in our source code

From: Jeremy Finzel
Date: 14 October 2019, 22:42:51
Subject: Re: BRIN index which is much faster never chosen by planner

Columns correlation and adaptive query optimization - Mailing list pgsql-hackers

Attachment

Previous

Next