pgsql: Improve planner's cost estimation in the presence of semijoins. - Mailing list pgsql-committers

From Tom Lane
Subject pgsql: Improve planner's cost estimation in the presence of semijoins.
Date
Msg-id E1YVroA-0005Ji-0Y@gemulon.postgresql.org
Whole thread Raw
List pgsql-committers
Improve planner's cost estimation in the presence of semijoins.

If we have a semijoin, say
    SELECT * FROM x WHERE x1 IN (SELECT y1 FROM y)
and we're estimating the cost of a parameterized indexscan on x, the number
of repetitions of the indexscan should not be taken as the size of y; it'll
really only be the number of distinct values of y1, because the only valid
plan with y on the outside of a nestloop would require y to be unique-ified
before joining it to x.  Most of the time this doesn't make that much
difference, but sometimes it can lead to drastically underestimating the
cost of the indexscan and hence choosing a bad plan, as pointed out by
David Kubečka.

Fixing this is a bit difficult because parameterized indexscans are costed
out quite early in the planning process, before we have the information
that would be needed to call estimate_num_groups() and thereby estimate the
number of distinct values of the join column(s).  However we can move the
code that extracts a semijoin RHS's unique-ification columns, so that it's
done in initsplan.c rather than on-the-fly in create_unique_path().  That
shouldn't make any difference speed-wise and it's really a bit cleaner too.

The other bit of information we need is the size of the semijoin RHS,
which is easy if it's a single relation (we make those estimates before
considering indexscan costs) but problematic if it's a join relation.
The solution adopted here is just to use the product of the sizes of the
join component rels.  That will generally be an overestimate, but since
estimate_num_groups() only uses this input as a clamp, an overestimate
shouldn't hurt us too badly.  In any case we don't allow this new logic
to produce a value larger than we would have chosen before, so that at
worst an overestimate leaves us no wiser than we were before.

Branch
------
master

Details
-------
http://git.postgresql.org/pg/commitdiff/b55722692ba0ceb934bb32bcddb562e2120f43dd

Modified Files
--------------
src/backend/nodes/copyfuncs.c          |    5 +-
src/backend/nodes/equalfuncs.c         |    5 +-
src/backend/nodes/outfuncs.c           |    5 +-
src/backend/optimizer/path/costsize.c  |   10 +-
src/backend/optimizer/path/indxpath.c  |  157 ++++++++++++++++++-----
src/backend/optimizer/path/joinrels.c  |    5 +-
src/backend/optimizer/plan/initsplan.c |  181 ++++++++++++++++++++++++++-
src/backend/optimizer/util/orclauses.c |    5 +-
src/backend/optimizer/util/pathnode.c  |  213 ++++++--------------------------
src/include/nodes/relation.h           |   23 ++--
10 files changed, 385 insertions(+), 224 deletions(-)


pgsql-committers by date:

Previous
From: Peter Eisentraut
Date:
Subject: pgsql: PL/Python: Fix regression tests for Python 3
Next
From: Tom Lane
Date:
Subject: pgsql: Fix old bug in get_loop_count().