Re: Bad row estimation with indexed func returning bool - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Bad row estimation with indexed func returning bool
Date
Msg-id 6822.1441993722@sss.pgh.pa.us
Whole thread Raw
In response to Bad row estimation with indexed func returning bool  (Jehan-Guillaume de Rorthais <jgdr@dalibo.com>)
Responses Re: Bad row estimation with indexed func returning bool  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Jehan-Guillaume de Rorthais <jgdr@dalibo.com> writes:
> I faced a correlation problem on a query today and tried the usual trick
> consisting of using an functional index and rewriting the query to use it.

The core reason this isn't doing anything useful is that
clause_selectivity() is hard-wired to estimate the selectivity of a
top-level WHERE clause that is a function call as 0.3333333, no matter
what:
   else if (is_funcclause(clause))   {       /*        * This is not an operator, so we guess at the selectivity. THIS
ISA        * HACK TO GET V4 OUT THE DOOR.  FUNCS SHOULD BE ABLE TO HAVE        * SELECTIVITIES THEMSELVES.       -- JMH
7/9/92       */       s1 = (Selectivity) 0.3333333;   }
 

Adding per-function selectivity estimators, as Joe was presumably
envisioning, would be a sufficiently large amount of work that it's not
too surprising nobody's gotten around to it in twenty-three years.  (The
infrastructure maybe wouldn't be so bad, but where would the estimators
themselves come from, especially for user-defined functions?)

However, in the case at hand, the complaint basically is why aren't we
treating the boolean function expression like a boolean variable, and
looking to see if there are stats available for it, like this other
bit in clause_selectivity:
           /*            * A Var at the top of a clause must be a bool Var. This is            * equivalent to the
clausereln.attribute = 't', so we compute            * the selectivity as if that is what we have.            */
  s1 = restriction_selectivity(root,                                        BooleanEqualOperator,
                list_make2(var,                                                   makeBoolConst(true,
                                             false)),                                        InvalidOid,
                       varRelid);
 

Indeed you could argue that this ought to be the fallback behavior for
*any* unhandled case, not just function expressions.  Not sure if we'd
need to restrict it to single-relation expressions or not.

The implication of doing it like this would be that the default estimate
in the absence of any matching stats would be 0.5 (since eqsel defaults
to 1/ndistinct, and get_variable_numdistinct will report 2.0 for any
boolean-type expression it has no stats for).  That's not a huge change
from the existing 0.3333333 estimate, which seems pretty unprincipled
anyway ... but it would probably be enough to annoy people if we did it in
stable branches.  So I'd be inclined to propose changing this in HEAD and
maybe 9.5, but not further back.  (For non-function expressions, 0.5 is
the default already, so those would not change behavior.)

Comments?
        regards, tom lane



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: DBT-3 with SF=20 got failed
Next
From: Dean Rasheed
Date:
Subject: Re: RLS open items are vague and unactionable