Re: Columns correlation and adaptive query optimization - Mailing list pgsql-hackers

From Yugo NAGATA
Subject Re: Columns correlation and adaptive query optimization
Date
Msg-id 20210121213048.b7f958ad5c4beb941a48a7dc@sraoss.co.jp
Whole thread Raw
In response to Re: Columns correlation and adaptive query optimization  (Konstantin Knizhnik <k.knizhnik@postgrespro.ru>)
Responses Re: Columns correlation and adaptive query optimization
List pgsql-hackers
Hello,

On Thu, 26 Mar 2020 18:49:51 +0300
Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:

> Attached please find new version of the patch with more comments and 
> descriptions added.

Adaptive query optimization is very interesting feature for me, so I looked
into this patch.  Here are some comments and questions.

(1)
This patch needs rebase because clauselist_selectivity was modified to improve
estimation of OR clauses.

(2)
If I understand correctly, your proposal consists of the following two features.

1. Add a feature to auto_explain that creates an extended statistic automatically
if an error on estimated rows number is large.

2. Improve rows number estimation of join results by considering functional
dependencies between vars in the join qual if the qual has more than one clauses,
and also functional dependencies between a var in the join qual and vars in quals
of the inner/outer relation.

As you said, these two parts are independent each other, so one feature will work
even if we don't assume the other.  I wonder it would be better to split the patch
again, and register them to commitfest separately.

(3)
+    DefineCustomBoolVariable("auto_explain.suggest_only",
+                             "Do not create statistic but just record in WAL suggested create statistics statement.",
+                             NULL,
+                             &auto_explain_suggest_on

To imply that this parameter is involving to add_statistics_threshold, it seems
better for me to use more related name like add_statistics_suggest_only.

Also, additional documentations for new parameters are required.

(4)
+            /*
+             * Prevent concurrent access to extended statistic table
+             */
+            stat_rel = table_open(StatisticExtRelationId, AccessExclusiveLock);
+            slot = table_slot_create(stat_rel, NULL);
+            scan = table_beginscan_catalog(stat_rel, 2, entry);
(snip)
+            table_close(stat_rel, AccessExclusiveLock);
+        }

When I tested the auto_explain part, I got the following WARNING.

 WARNING:  buffer refcount leak: [097] (rel=base/12879/3381, blockNum=0, flags=0x83000000, refcount=1 2)
 WARNING:  buffer refcount leak: [097] (rel=base/12879/3381, blockNum=0, flags=0x83000000, refcount=1 1)
 WARNING:  relcache reference leak: relation "pg_statistic_ext" not closed
 WARNING:  TupleDesc reference leak: TupleDesc 0x7fa439266338 (12029,-1) still referenced
 WARNING:  Snapshot reference leak: Snapshot 0x55c332c10418 still referenced

To suppress this, I think we need table_endscan(scan) and
ExecDropSingleTupleTableSlot(slot) before finishing this function.

(6)
+                    elog(NOTICE, "Auto_explain suggestion: CREATE STATISTICS %s %s FROM %s", stat_name,
create_stat_stmt,rel_name);
 

We should use ereport instead of elog for log messages.

(7)
+                        double dep = find_var_dependency(root, innerRelid, var, clauses_attnums);
+                        if (dep != 0.0)
+                        {
+                            s1 *= dep + (1 - dep) * s2;
+                            continue;
+                        }

I found the following comment of clauselist_apply_dependencies():

  * we actually combine selectivities using the formula
  *
  *      P(a,b) = f * Min(P(a), P(b)) + (1-f) * P(a) * P(b)

so, is it not necessary using the same formula in this patch? That is, 

   s1 *= dep + (1-dep) * s2                   (if s1 <= s2)
  s1 *= dep * (s2/s1) + (1-dep) * s2   (otherwise) .

(8)
+/*
+ * Try to find dependency between variables.
+ * var: varaibles which dependencies are considered
+ * join_vars: list of variables used in other clauses
+ * This functions return strongest dependency and some subset of variables from the same relation
+ * or 0.0 if no dependency was found.
+ */
+static double
+var_depends_on(PlannerInfo *root, Var* var, List* clause_vars)
+{

The comment mentions join_vars but the actual argument name is clauses_vars,
so it needs unification.

(9)
Currently, it only consider functional dependencies statistics. Can we also
consider multivariate MCV list, and is it useful?

(10)
To achieve adaptive query optimization  (AQO) in PostgreSQL, this patch proposes
to use auto_explain for getting feedback from actual results. So, could auto_explain
be a infrastructure of AQO in future? Or, do you have any plan or idea to make
built-in infrastructure for AQO?

Regards,
Yugo Nagata

-- 
Yugo NAGATA <nagata@sraoss.co.jp>



pgsql-hackers by date:

Previous
From: Dmitry Dolgov
Date:
Subject: Re: [HACKERS] [PATCH] Generic type subscripting
Next
From: Jürgen Purtz
Date:
Subject: Re: Additional Chapter for Tutorial - arch-dev.sgml