Improving planner variable handling - Mailing list pgsql-hackers

From Tom Lane
Subject Improving planner variable handling
Date
Msg-id 19778.1208308860@sss.pgh.pa.us
Whole thread Raw
Responses Re: Improving planner variable handling  (Heikki Linnakangas <heikki@enterprisedb.com>)
Re: Improving planner variable handling  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Improving planner variable handling  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
I've been thinking about how to improve the planner's poor handling of
variables in outer-join situations.  Here are some past examples for
motivation:

http://archives.postgresql.org/pgsql-hackers/2006-02/msg00154.php
http://archives.postgresql.org/pgsql-general/2008-03/msg01440.php

The reason why the planner acts so stupidly in these examples is that
we're still using a kluge solution for this old bug:
http://archives.postgresql.org/pgsql-bugs/2001-04/msg00223.php

The root of the problem is that the planner does not worry about computing
output expressions until the top of a plan tree.  All lower-level join
nodes are made to output simple lists of Vars referencing columns of the
base relations of the join.  We handle outer-join cases by forcing the
values of the Vars of the nullable side to null at the level of the join,
whenever there's no matching row in the nullable side.  If one of the base
relations of the join is a sub-SELECT whose output list includes
expressions that don't certainly go to null when the input variables are
forced to null, then we can't flatten that sub-SELECT, because flattening
the sub-SELECT means that the expression evaluations bubble to the top of
the plan tree and can produce non-null results when they shouldn't
(as happened in the above bug, before we realized that we had to prevent
flattening in this case).

Another problem with this approach is that depending on what level of the
plan tree you are thinking about, a Var nominally referencing tab.col
might really mean the value of tab.col, or it might mean "either tab.col
or NULL depending on what happened at some lower level of outer join".
Since we can't readily tell the difference, we have estimation errors
arising from failure to expect some NULLs (there have been recent
complaints about this), and we need some pretty ugly kluges in places like
EquivalenceClass processing to handle the risk that apparently identical
expressions might not really be equal.

I think the basic solution for this is that upper levels of the plan tree
should refer to the nullable output columns of an outer join using
"alias Vars" that name the join rel, not the underlying base relation,
even if there is a simple base-relation Var that the alias is tracking.
In the case involving a sub-SELECT, the alias Var would stand for whatever
output expression appears in the sub-SELECT.  We already have the concept
of these alias Vars, in fact --- that's exactly the representation emitted
by the parser.  But historically the planner has smashed aliases down to
their base Vars as early as possible (see flatten_join_alias_vars).
That has some advantages but I'm thinking it's outweighed by the
disadvantages.  I'd like to try leaving alias Vars as aliases all the
way through the planner, in any case where they might be semantically
different from their referent (ie, whenever there's a possible
force-to-null involved).

To make this work, we'd need to have the constructed plan tree compute the
alias Var from its referent expression at the lowest outer-join that can
null the alias Var.  The trick for the executor is to know when to force
the value to null instead of computing the expression.  I first thought
about marking entries of the join node's targetlist as to be forced to
null if left or right input row is null.  However, that fails if we want
the join node to compute some projection expressions on top of the raw
join output (as would certainly happen if it were the top node of the
tree, for example).  That could be handled by inserting another level of
plan node (ie, a Result) to do the projection, but that seems a pretty
ugly and inefficient solution.  What I have in mind instead is to insert a
new kind of expression node "ForceToNull" atop the referent expression,
with this node able to look at the EState (in the same way a regular Var
node would) to see if it should return a null instead of computing its
child expression.  Then expansion of an alias Var into a ForceToNull and
the underlying expression would work.

I'm envisioning keeping track of active alias Vars and their expansions in
a new list attached to the PlannerInfo "root" node.  This would provide a
place to record important information like which level of the join tree
such a Var needs to be evaluated at.

This is all pretty handwavy yet, but I don't think I'll be able to fill in
many more details until I try to code it.  I thought I'd put up this
summary to see if anyone can shoot holes in it at this level of detail ...
Comments?
        regards, tom lane


pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: DROP DATABASE vs patch to not remove files right away
Next
From: Tom Lane
Date:
Subject: Re: DROP DATABASE vs patch to not remove files right away