Re: Bitmap scan is undercosted? - Mailing list pgsql-performance

From Tom Lane
Subject Re: Bitmap scan is undercosted?
Date
Msg-id 11630.1512342526@sss.pgh.pa.us
Whole thread Raw
In response to Re: Bitmap scan is undercosted?  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-performance
I wrote:
> I tried creating multiple-column statistics using the v10 facility for
> that:
> regression=# create statistics s1 on num, flag from aaa;
> CREATE STATISTICS
> regression=# analyze aaa;
> ANALYZE
> but that changed the estimate not at all, which surprised me because
> dependency statistics are supposed to fix exactly this type of problem.
> I suspect there may be something in the extended-stats code that causes it
> not to work right for boolean columns --- this wouldn't be excessively
> surprising because of the way the planner messes around with converting
> "flag = true" to just "flag" and sometimes back again.  But I've not
> looked closer yet.

After looking, I found that indeed dependency_is_compatible_clause()
rejects expressions like "flag" or "NOT flag", which it needn't since
those are fully equivalent to "flag = true" or "flag = false"
respectively.  Moreover there's nothing else in
dependencies_clauselist_selectivity that depends on the exact form of
the clause under test, only on the semantics that it's an equality
condition on some Var.  Hence I propose the attached patch, which
fixes the rowcount estimate for the example discussed in this thread:

create table aaa as select (id%100)::int num, (id%10=1)::bool flag from
generate_series(1, 10000000) id;
create index i1 on aaa  (num);
create index i2 on aaa  (flag);
create statistics s1 on num, flag from aaa;
analyze aaa;

explain analyze select count(*) from aaa where num = 1 and flag = true;

Without patch:

 Aggregate  (cost=43236.73..43236.74 rows=1 width=8) (actual time=349.365..349.3
65 rows=1 loops=1)
   ->  Bitmap Heap Scan on aaa  (cost=20086.40..43212.94 rows=9514 width=0) (act
ual time=101.308..337.985 rows=100000 loops=1)
         Recheck Cond: (num = 1)
         Filter: flag
         Heap Blocks: exact=44248
         ->  BitmapAnd  (cost=20086.40..20086.40 rows=9514 width=0) (actual time
=92.214..92.214 rows=0 loops=1)
               ->  Bitmap Index Scan on i1  (cost=0.00..1776.43 rows=96000 width
=0) (actual time=17.236..17.236 rows=100000 loops=1)
                     Index Cond: (num = 1)
               ->  Bitmap Index Scan on i2  (cost=0.00..18304.96 rows=991003 wid
th=0) (actual time=72.168..72.168 rows=1000000 loops=1)
                     Index Cond: (flag = true)
 Planning time: 0.254 ms
 Execution time: 350.796 ms

With patch:

 Aggregate  (cost=43496.19..43496.20 rows=1 width=8) (actual time=359.195..359.1
95 rows=1 loops=1)
   ->  Bitmap Heap Scan on aaa  (cost=20129.64..43256.19 rows=96000 width=0) (ac
tual time=99.750..347.353 rows=100000 loops=1)
         Recheck Cond: (num = 1)
         Filter: flag
         Heap Blocks: exact=44248
         ->  BitmapAnd  (cost=20129.64..20129.64 rows=9514 width=0) (actual time
=90.671..90.671 rows=0 loops=1)
               ->  Bitmap Index Scan on i1  (cost=0.00..1776.43 rows=96000 width
=0) (actual time=16.946..16.946 rows=100000 loops=1)
                     Index Cond: (num = 1)
               ->  Bitmap Index Scan on i2  (cost=0.00..18304.96 rows=991003 wid
th=0) (actual time=70.898..70.898 rows=1000000 loops=1)
                     Index Cond: (flag = true)
 Planning time: 0.218 ms
 Execution time: 360.608 ms

That's the right overall rowcount estimate for the scan, given the stats
it's working from.  There's apparently still something wrong with bitmap
costing, since it's still estimating this as cheaper than the single-index
case --- noting the bogus rowcount estimate for the BitmapAnd, I suspect
that bitmap costing is taking shortcuts rather than using
clauselist_selectivity to estimate the overall selectivity of the bitmap
conditions.  But whatever is causing that, it's independent of this
deficiency.

In addition to the bugfix proper, I improved some comments, got rid of
a NumRelids() test that's redundant with the preceding bms_membership()
test, and fixed dependencies_clauselist_selectivity so that
estimatedclauses actually is a pure output argument as stated by its
API contract.

            regards, tom lane

diff --git a/src/backend/statistics/dependencies.c b/src/backend/statistics/dependencies.c
index 9756fb8..ae0f304 100644
*** a/src/backend/statistics/dependencies.c
--- b/src/backend/statistics/dependencies.c
*************** pg_dependencies_send(PG_FUNCTION_ARGS)
*** 736,826 ****
   * dependency_is_compatible_clause
   *        Determines if the clause is compatible with functional dependencies
   *
!  * Only OpExprs with two arguments using an equality operator are supported.
!  * When returning True attnum is set to the attribute number of the Var within
!  * the supported clause.
!  *
!  * Currently we only support Var = Const, or Const = Var. It may be possible
!  * to expand on this later.
   */
  static bool
  dependency_is_compatible_clause(Node *clause, Index relid, AttrNumber *attnum)
  {
      RestrictInfo *rinfo = (RestrictInfo *) clause;

      if (!IsA(rinfo, RestrictInfo))
          return false;

!     /* Pseudoconstants are not really interesting here. */
      if (rinfo->pseudoconstant)
          return false;

!     /* clauses referencing multiple varnos are incompatible */
      if (bms_membership(rinfo->clause_relids) != BMS_SINGLETON)
          return false;

      if (is_opclause(rinfo->clause))
      {
          OpExpr       *expr = (OpExpr *) rinfo->clause;
-         Var           *var;
-         bool        varonleft = true;
-         bool        ok;

!         /* Only expressions with two arguments are considered compatible. */
          if (list_length(expr->args) != 2)
              return false;

!         /* see if it actually has the right */
!         ok = (NumRelids((Node *) expr) == 1) &&
!             (is_pseudo_constant_clause(lsecond(expr->args)) ||
!              (varonleft = false,
!               is_pseudo_constant_clause(linitial(expr->args))));
!
!         /* unsupported structure (two variables or so) */
!         if (!ok)
              return false;

          /*
!          * If it's not "=" operator, just ignore the clause, as it's not
           * compatible with functional dependencies.
           *
           * This uses the function for estimating selectivity, not the operator
           * directly (a bit awkward, but well ...).
           */
          if (get_oprrest(expr->opno) != F_EQSEL)
              return false;

!         var = (varonleft) ? linitial(expr->args) : lsecond(expr->args);
!
          /*
!          * We may ignore any RelabelType node above the operand.  (There won't
!          * be more than one, since eval_const_expressions() has been applied
!          * already.)
           */
!         if (IsA(var, RelabelType))
!             var = (Var *) ((RelabelType *) var)->arg;

!         /* We only support plain Vars for now */
!         if (!IsA(var, Var))
!             return false;

!         /* Ensure var is from the correct relation */
!         if (var->varno != relid)
!             return false;

!         /* we also better ensure the Var is from the current level */
!         if (var->varlevelsup > 0)
!             return false;

!         /* Also skip system attributes (we don't allow stats on those). */
!         if (!AttrNumberIsForUserDefinedAttr(var->varattno))
!             return false;

!         *attnum = var->varattno;
!         return true;
!     }

!     return false;
  }

  /*
--- 736,839 ----
   * dependency_is_compatible_clause
   *        Determines if the clause is compatible with functional dependencies
   *
!  * Only clauses that have the form of equality to a pseudoconstant, or can be
!  * interpreted that way, are currently accepted.  Furthermore the variable
!  * part of the clause must be a simple Var belonging to the specified
!  * relation, whose attribute number we return in *attnum on success.
   */
  static bool
  dependency_is_compatible_clause(Node *clause, Index relid, AttrNumber *attnum)
  {
      RestrictInfo *rinfo = (RestrictInfo *) clause;
+     Var           *var;

      if (!IsA(rinfo, RestrictInfo))
          return false;

!     /* Pseudoconstants are not interesting (they couldn't contain a Var) */
      if (rinfo->pseudoconstant)
          return false;

!     /* Clauses referencing multiple, or no, varnos are incompatible */
      if (bms_membership(rinfo->clause_relids) != BMS_SINGLETON)
          return false;

      if (is_opclause(rinfo->clause))
      {
+         /* If it's an opclause, check for Var = Const or Const = Var. */
          OpExpr       *expr = (OpExpr *) rinfo->clause;

!         /* Only expressions with two arguments are candidates. */
          if (list_length(expr->args) != 2)
              return false;

!         /* Make sure non-selected argument is a pseudoconstant. */
!         if (is_pseudo_constant_clause(lsecond(expr->args)))
!             var = linitial(expr->args);
!         else if (is_pseudo_constant_clause(linitial(expr->args)))
!             var = lsecond(expr->args);
!         else
              return false;

          /*
!          * If it's not an "=" operator, just ignore the clause, as it's not
           * compatible with functional dependencies.
           *
           * This uses the function for estimating selectivity, not the operator
           * directly (a bit awkward, but well ...).
+          *
+          * XXX this is pretty dubious; probably it'd be better to check btree
+          * or hash opclass membership, so as not to be fooled by custom
+          * selectivity functions, and to be more consistent with decisions
+          * elsewhere in the planner.
           */
          if (get_oprrest(expr->opno) != F_EQSEL)
              return false;

!         /* OK to proceed with checking "var" */
!     }
!     else if (not_clause((Node *) rinfo->clause))
!     {
          /*
!          * "NOT x" can be interpreted as "x = false", so get the argument and
!          * proceed with seeing if it's a suitable Var.
           */
!         var = (Var *) get_notclausearg(rinfo->clause);
!     }
!     else
!     {
!         /*
!          * A boolean expression "x" can be interpreted as "x = true", so
!          * proceed with seeing if it's a suitable Var.
!          */
!         var = (Var *) rinfo->clause;
!     }

!     /*
!      * We may ignore any RelabelType node above the operand.  (There won't be
!      * more than one, since eval_const_expressions has been applied already.)
!      */
!     if (IsA(var, RelabelType))
!         var = (Var *) ((RelabelType *) var)->arg;

!     /* We only support plain Vars for now */
!     if (!IsA(var, Var))
!         return false;

!     /* Ensure Var is from the correct relation */
!     if (var->varno != relid)
!         return false;

!     /* We also better ensure the Var is from the current level */
!     if (var->varlevelsup != 0)
!         return false;

!     /* Also ignore system attributes (we don't allow stats on those) */
!     if (!AttrNumberIsForUserDefinedAttr(var->varattno))
!         return false;

!     *attnum = var->varattno;
!     return true;
  }

  /*
*************** find_strongest_dependency(StatisticExtIn
*** 891,902 ****

  /*
   * dependencies_clauselist_selectivity
!  *        Return the estimated selectivity of the given clauses using
!  *        functional dependency statistics, or 1.0 if no useful functional
   *        dependency statistic exists.
   *
   * 'estimatedclauses' is an output argument that gets a bit set corresponding
!  * to the (zero-based) list index of clauses that are included in the
   * estimated selectivity.
   *
   * Given equality clauses on attributes (a,b) we find the strongest dependency
--- 904,915 ----

  /*
   * dependencies_clauselist_selectivity
!  *        Return the estimated selectivity of (a subset of) the given clauses
!  *        using functional dependency statistics, or 1.0 if no useful functional
   *        dependency statistic exists.
   *
   * 'estimatedclauses' is an output argument that gets a bit set corresponding
!  * to the (zero-based) list index of each clause that is included in the
   * estimated selectivity.
   *
   * Given equality clauses on attributes (a,b) we find the strongest dependency
*************** dependencies_clauselist_selectivity(Plan
*** 932,937 ****
--- 945,953 ----
      AttrNumber *list_attnums;
      int            listidx;

+     /* initialize output argument */
+     *estimatedclauses = NULL;
+
      /* check if there's any stats that might be useful for us. */
      if (!has_stats_of_kind(rel->statlist, STATS_EXT_DEPENDENCIES))
          return 1.0;

pgsql-performance by date:

Previous
From: Vitaliy Garnashevich
Date:
Subject: Re: Bitmap scan is undercosted?
Next
From: Jeff Janes
Date:
Subject: Re: Bitmap scan is undercosted? - boolean correlation