Thread: Uninterruptible long planning of a query with too many WHERE clauses

Uninterruptible long planning of a query with too many WHERE clauses

From

Alexander Kuzmenkov

Date:

09 November 2018, 18:00:41

Hi hackers,

Recently one of our customers encountered a situation when the planning 
of a particular query takes too long (several minutes) and can't be 
interrupted by pg_terminate_backend(). The query and schema are attached 
(this is generated by Zabbix). The reason for the slowness is that the 
run time of choose_bitmap_and() is quadratic in the number of WHERE 
clauses. It assigns unique ids to the clauses by putting them in a list 
and then doing a linear search with equal() to determine the position of 
each new clause.

Our first attempt to fix this was putting these clauses into an rbtree 
or dynahash. This improves the performance, but is not entirely correct. 
We don't have a comparison or hash function for nodes, so we have to 
hash or compare their string representation. But the equality of 
nodeToString() is not equivalent to equal(), because the string has some 
fields that are ignored by equal(), such as token location. So we can't 
really compare the string value instead of using equal().

I settled on a simpler solution: limiting the number of clauses we try 
to uniquely identify. If there are too many, skip the smarter logic that 
requires comparing paths by clauses, and just return the cheapest input 
path from choose_bitmap_and(). The patch is attached.

I'd like to hear your thoughts on this. This is a valid query that 
freezes a backend with 100% CPU usage and no way to interrupt it, and I 
think we should fail more gracefully.

-- 
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

Re: Uninterruptible long planning of a query with too many WHERE clauses

From

Tom Lane

Date:

11 November 2018, 07:38:13

Alexander Kuzmenkov <a.kuzmenkov@postgrespro.ru> writes:
> Recently one of our customers encountered a situation when the planning 
> of a particular query takes too long (several minutes) and can't be 
> interrupted by pg_terminate_backend(). The query and schema are attached 
> (this is generated by Zabbix).

Ugh.  I hope they aren't expecting actually *good* performance on this
sort of query.  Still, O(N^2) behavior isn't nice.

When I first saw your post, I thought maybe the problem was an
unreasonable number of paths, but actually there are only two
indexpaths being considered in choose_bitmap_and().  The problem
is that one of them has 80000 quals attached to it, and the code
that's sorting through the quals is O(N^2).

> Our first attempt to fix this was putting these clauses into an rbtree 
> or dynahash. This improves the performance, but is not entirely correct. 

... depends on how you do it ...

> I settled on a simpler solution: limiting the number of clauses we try 
> to uniquely identify. If there are too many, skip the smarter logic that 
> requires comparing paths by clauses, and just return the cheapest input 
> path from choose_bitmap_and(). The patch is attached.

I think you have the right basic idea, but we don't have to completely
lobotomize the bitmap-and search logic in order to cope with this.
This code is only trying to figure out which paths are potentially
redundant, so for a path with too many quals, we can just deem it
not-redundant, as attached.

A different line of thought is that using equal() to compare quals
here is likely overkill: plain old pointer equality ought to be enough,
since what we are looking for is different indexpaths derived from the
same members of the relation's baserestrictinfo list.  In itself, such
a change would not fix the O(N^2) problem, it'd just cut a constant
factor off the big-O multiplier.  (A big constant factor, perhaps, but
still just a constant factor.)  However, once we go to pointer equality
as the definition, we could treat the pointer values as scalars and then
use hashing or whatever on them.  But this would take a good deal of work,
and I think it might be a net loss for typical not-very-large numbers
of quals.  Also, I tried just quickly changing the equal() call to a
pointer comparison, and it didn't seem to make much difference given
that I'd already done the attached.  So my feeling is that possibly
that'd be worth doing sometime in the future, but this particular
example isn't offering a compelling reason to do it.

Another thought is that maybe we need a CHECK_FOR_INTERRUPTS call
somewhere in here; but I'm not sure where would be a good place.
I'm not excited about sticking one into classify_index_clause_usage,
but adding one up at the per-path loops would not help for this case.

            regards, tom lane

diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index f295558..7bf29a6 100644
*** a/src/backend/optimizer/path/indxpath.c
--- b/src/backend/optimizer/path/indxpath.c
*************** typedef struct
*** 67,72 ****
--- 67,73 ----
      List       *quals;            /* the WHERE clauses it uses */
      List       *preds;            /* predicates of its partial index(es) */
      Bitmapset  *clauseids;        /* quals+preds represented as a bitmapset */
+     bool        unclassifiable;    /* has too many quals+preds to process? */
  } PathClauseUsage;

  /* Callback argument for ec_member_matches_indexcol */
*************** choose_bitmap_and(PlannerInfo *root, Rel
*** 1447,1455 ****
          Path       *ipath = (Path *) lfirst(l);

          pathinfo = classify_index_clause_usage(ipath, &clauselist);
          for (i = 0; i < npaths; i++)
          {
!             if (bms_equal(pathinfo->clauseids, pathinfoarray[i]->clauseids))
                  break;
          }
          if (i < npaths)
--- 1448,1465 ----
          Path       *ipath = (Path *) lfirst(l);

          pathinfo = classify_index_clause_usage(ipath, &clauselist);
+
+         /* If it's unclassifiable, treat it as distinct from all others */
+         if (pathinfo->unclassifiable)
+         {
+             pathinfoarray[npaths++] = pathinfo;
+             continue;
+         }
+
          for (i = 0; i < npaths; i++)
          {
!             if (!pathinfoarray[i]->unclassifiable &&
!                 bms_equal(pathinfo->clauseids, pathinfoarray[i]->clauseids))
                  break;
          }
          if (i < npaths)
*************** choose_bitmap_and(PlannerInfo *root, Rel
*** 1484,1489 ****
--- 1494,1503 ----
       * For each surviving index, consider it as an "AND group leader", and see
       * whether adding on any of the later indexes results in an AND path with
       * cheaper total cost than before.  Then take the cheapest AND group.
+      *
+      * Note: paths that are either clauseless or unclassifiable will have
+      * empty clauseids, so that they will not be rejected by the clauseids
+      * filter here, nor will they cause later paths to be rejected by it.
       */
      for (i = 0; i < npaths; i++)
      {
*************** classify_index_clause_usage(Path *path,
*** 1711,1716 ****
--- 1725,1745 ----
      result->preds = NIL;
      find_indexpath_quals(path, &result->quals, &result->preds);

+     /*
+      * Some machine-generated queries have outlandish numbers of qual clauses.
+      * To avoid getting into O(N^2) behavior even in this preliminary
+      * classification step, we want to limit the number of entries we can
+      * accumulate in *clauselist.  Treat any path with more than 100 quals +
+      * preds as unclassifiable, which will cause calling code to consider it
+      * distinct from all other paths.
+      */
+     if (list_length(result->quals) + list_length(result->preds) > 100)
+     {
+         result->clauseids = NULL;
+         result->unclassifiable = true;
+         return result;
+     }
+
      /* Build up a bitmapset representing the quals and preds */
      clauseids = NULL;
      foreach(lc, result->quals)
*************** classify_index_clause_usage(Path *path,
*** 1728,1733 ****
--- 1757,1763 ----
                                     find_list_position(node, clauselist));
      }
      result->clauseids = clauseids;
+     result->unclassifiable = false;

      return result;
  }

Re: Uninterruptible long planning of a query with too many WHEREclauses

From

Alexander Kuzmenkov

Date:

12 November 2018, 15:05:28

El 11/11/18 a las 07:38, Tom Lane escribió:
> I think you have the right basic idea, but we don't have to completely
> lobotomize the bitmap-and search logic in order to cope with this.
> This code is only trying to figure out which paths are potentially
> redundant, so for a path with too many quals, we can just deem it
> not-redundant, as attached.

Thanks for the patch, looks good to me.


> A different line of thought is that using equal() to compare quals
> here is likely overkill: plain old pointer equality ought to be enough,
> since what we are looking for is different indexpaths derived from the
> same members of the relation's baserestrictinfo list.

I didn't realize that we could just hash the pointers here, this 
simplifies things. But indeed it makes sense to just use a simpler logic 
for such extreme queries, because we won't have a good plan anyway.


> Another thought is that maybe we need a CHECK_FOR_INTERRUPTS call
> somewhere in here; but I'm not sure where would be a good place.
> I'm not excited about sticking one into classify_index_clause_usage,
> but adding one up at the per-path loops would not help for this case.

We added some interrupt checks as a quick fix for the client. In the 
long run, I think we don't have to add them, because normally, planning 
a query is relatively fast, and unexpected slowdowns like this one can 
still happen in places where we don't process interrupts.

-- 
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Uninterruptible long planning of a query with too many WHERE clauses

From

Tom Lane

Date:

12 November 2018, 19:21:10

Alexander Kuzmenkov <a.kuzmenkov@postgrespro.ru> writes:
> El 11/11/18 a las 07:38, Tom Lane escribió:
>> I think you have the right basic idea, but we don't have to completely
>> lobotomize the bitmap-and search logic in order to cope with this.
>> This code is only trying to figure out which paths are potentially
>> redundant, so for a path with too many quals, we can just deem it
>> not-redundant, as attached.

> Thanks for the patch, looks good to me.

Pushed, thanks for reviewing.

            regards, tom lane