pgsql: Improve estimation of OR clauses using extended statistics. - Mailing list pgsql-committers

From Dean Rasheed
Subject pgsql: Improve estimation of OR clauses using extended statistics.
Date
Msg-id E1kklVp-00019B-7I@gemulon.postgresql.org
Whole thread Raw
List pgsql-committers
Improve estimation of OR clauses using extended statistics.

Formerly we only applied extended statistics to an OR clause as part
of the clauselist_selectivity() code path for an OR clause appearing
in an implicitly-ANDed list of clauses. This meant that it could only
use extended statistics if all sub-clauses of the OR clause were
covered by a single extended statistics object.

Instead, teach clause_selectivity() how to apply extended statistics
to an OR clause by handling its ORed list of sub-clauses in a similar
manner to an implicitly-ANDed list of sub-clauses, but with different
combination rules. This allows one or more extended statistics objects
to be used to estimate all or part of the list of sub-clauses. Any
remaining sub-clauses are then treated as if they are independent.

Additionally, to avoid double-application of extended statistics, this
introduces "extended" versions of clause_selectivity() and
clauselist_selectivity(), which include an option to ignore extended
statistics. This replaces the old clauselist_selectivity_simple()
function which failed to completely ignore extended statistics when
called from the extended statistics code.

A known limitation of the current infrastructure is that an AND clause
under an OR clause is not treated as compatible with extended
statistics (because we don't build RestrictInfos for such sub-AND
clauses). Thus, for example, "(a=1 AND b=1) OR (a=2 AND b=2)" will
currently be treated as two independent AND clauses (each of which may
be estimated using extended statistics), but extended statistics will
not currently be used to account for any possible overlap between
those clauses. Improving that is left as a task for the future.

Original patch by Tomas Vondra, with additional improvements by me.

Discussion: https://postgr.es/m/20200113230008.g67iyk4cs3xbnjju@development

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/25a9e54d2db31b8031c2d8166114b187e8347098

Modified Files
--------------
src/backend/optimizer/path/clausesel.c           | 301 +++++++++++++++--------
src/backend/statistics/dependencies.c            |   4 +-
src/backend/statistics/extended_stats.c          | 217 +++++++++++-----
src/backend/statistics/mcv.c                     | 169 ++++++++++++-
src/include/optimizer/optimizer.h                |  18 +-
src/include/statistics/extended_stats_internal.h |  15 ++
src/include/statistics/statistics.h              |   3 +-
src/test/regress/expected/stats_ext.out          | 189 +++++++++++++-
src/test/regress/sql/stats_ext.sql               |  85 ++++++-
9 files changed, 798 insertions(+), 203 deletions(-)


pgsql-committers by date:

Previous
From: Michael Paquier
Date:
Subject: pgsql: Refactor CLUSTER and REINDEX grammar to use DefElem for option l
Next
From: Peter Eisentraut
Date:
Subject: pgsql: Small code simplifications