Thread: [HACKERS] Avoiding OOM in a hash join with many duplicate inner keys

[HACKERS] Avoiding OOM in a hash join with many duplicate inner keys

From

Tom Lane

Date:

17 February 2017, 01:02:41

The planner doesn't currently worry about work_mem restrictions when
planning a hash join, figuring that the executor should be able to
subdivide the data arbitrarily finely by splitting buckets at runtime.
However there's a thread here:
https://www.postgresql.org/message-id/flat/CACw4T0p4Lzd6VpwptxgPgoTMh2dEKTQBGu7NTaJ1%2BA0PRx1BGg%40mail.gmail.com
exhibiting a case where a hash join was chosen even though a single
value accounts for three-quarters of the inner relation.  Bucket
splitting obviously can never separate multiple instances of the
same value, so this choice forced the executor to try to load
three-quarters of the (very large) inner relation into memory at once;
unsurprisingly, it failed.

To fix this, I think we need to discourage use of hash joins whenever
a single bucket is predicted to exceed work_mem, as in the attached
draft patch.  The patch results in changing from hash to merge join
in one regression test case, which is fine; that case only cares about
the join order not the types of the joins.

This might be overly aggressive, because it will pretty much shut off
any attempt to use hash joining on a large inner relation unless we
have statistics for it (and those stats are favorable).  But having
seen this example, I think we need to be worried.

I'm inclined to treat this as a bug and back-patch it, but I wonder if
anyone is concerned about possible plan destabilization in the back
branches.

            regards, tom lane

diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index d01630f..9170b92 100644
*** a/src/backend/optimizer/path/costsize.c
--- b/src/backend/optimizer/path/costsize.c
*************** final_cost_hashjoin(PlannerInfo *root, H
*** 2800,2805 ****
--- 2800,2806 ----
      double        hashjointuples;
      double        virtualbuckets;
      Selectivity innerbucketsize;
+     double        inner_bucket_rows;
      ListCell   *hcl;

      /* Mark the path with the correct row estimate */
*************** final_cost_hashjoin(PlannerInfo *root, H
*** 2894,2899 ****
--- 2895,2914 ----
      }

      /*
+      * If even a single bucket would exceed work_mem, we don't want to hash
+      * unless there is really no other alternative, so apply disable_cost.
+      * (Because estimate_hash_bucketsize's estimate is mostly driven by the
+      * MCV frequency, this condition suggests that the executor will be unable
+      * to drive the batch size below work_mem no matter how much it splits
+      * buckets: splitting cannot separate values that are equal.)
+      */
+     inner_bucket_rows = clamp_row_est(inner_path_rows * innerbucketsize);
+     if (relation_byte_size(inner_bucket_rows,
+                            inner_path->pathtarget->width) >
+         (work_mem * 1024L))
+         startup_cost += disable_cost;
+
+     /*
       * Compute cost of the hashquals and qpquals (other restriction clauses)
       * separately.
       */
*************** final_cost_hashjoin(PlannerInfo *root, H
*** 2964,2970 ****
           */
          startup_cost += hash_qual_cost.startup;
          run_cost += hash_qual_cost.per_tuple * outer_path_rows *
!             clamp_row_est(inner_path_rows * innerbucketsize) * 0.5;

          /*
           * Get approx # tuples passing the hashquals.  We use
--- 2979,2985 ----
           */
          startup_cost += hash_qual_cost.startup;
          run_cost += hash_qual_cost.per_tuple * outer_path_rows *
!             inner_bucket_rows * 0.5;

          /*
           * Get approx # tuples passing the hashquals.  We use
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 4992048..b2270ca 100644
*** a/src/test/regress/expected/join.out
--- b/src/test/regress/expected/join.out
*************** where not exists (
*** 2429,2461 ****
                ) a1 on t3.c2 = a1.c1
    where t1.c1 = t2.c2
  );
!                        QUERY PLAN
! ---------------------------------------------------------
!  Hash Anti Join
!    Hash Cond: (t1.c1 = t2.c2)
!    ->  Seq Scan on tt4x t1
!    ->  Hash
!          ->  Merge Right Join
!                Merge Cond: (t5.c1 = t3.c2)
!                ->  Merge Join
!                      Merge Cond: (t4.c2 = t5.c1)
!                      ->  Sort
!                            Sort Key: t4.c2
!                            ->  Seq Scan on tt4x t4
!                      ->  Sort
!                            Sort Key: t5.c1
!                            ->  Seq Scan on tt4x t5
!                ->  Sort
!                      Sort Key: t3.c2
!                      ->  Merge Left Join
!                            Merge Cond: (t2.c3 = t3.c1)
                             ->  Sort
!                                  Sort Key: t2.c3
!                                  ->  Seq Scan on tt4x t2
                             ->  Sort
!                                  Sort Key: t3.c1
!                                  ->  Seq Scan on tt4x t3
! (24 rows)

  --
  -- regression test for problems of the sort depicted in bug #3494
--- 2429,2465 ----
                ) a1 on t3.c2 = a1.c1
    where t1.c1 = t2.c2
  );
!                           QUERY PLAN
! ---------------------------------------------------------------
!  Merge Anti Join
!    Merge Cond: (t1.c1 = t2.c2)
!    ->  Sort
!          Sort Key: t1.c1
!          ->  Seq Scan on tt4x t1
!    ->  Materialize
!          ->  Sort
!                Sort Key: t2.c2
!                ->  Merge Right Join
!                      Merge Cond: (t5.c1 = t3.c2)
!                      ->  Merge Join
!                            Merge Cond: (t4.c2 = t5.c1)
                             ->  Sort
!                                  Sort Key: t4.c2
!                                  ->  Seq Scan on tt4x t4
                             ->  Sort
!                                  Sort Key: t5.c1
!                                  ->  Seq Scan on tt4x t5
!                      ->  Sort
!                            Sort Key: t3.c2
!                            ->  Merge Left Join
!                                  Merge Cond: (t2.c3 = t3.c1)
!                                  ->  Sort
!                                        Sort Key: t2.c3
!                                        ->  Seq Scan on tt4x t2
!                                  ->  Sort
!                                        Sort Key: t3.c1
!                                        ->  Seq Scan on tt4x t3
! (28 rows)

  --
  -- regression test for problems of the sort depicted in bug #3494

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Avoiding OOM in a hash join with many duplicate inner keys

From

Robert Haas

Date:

17 February 2017, 01:11:23

On Thu, Feb 16, 2017 at 2:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> The planner doesn't currently worry about work_mem restrictions when
> planning a hash join, figuring that the executor should be able to
> subdivide the data arbitrarily finely by splitting buckets at runtime.
> However there's a thread here:
> https://www.postgresql.org/message-id/flat/CACw4T0p4Lzd6VpwptxgPgoTMh2dEKTQBGu7NTaJ1%2BA0PRx1BGg%40mail.gmail.com
> exhibiting a case where a hash join was chosen even though a single
> value accounts for three-quarters of the inner relation.  Bucket
> splitting obviously can never separate multiple instances of the
> same value, so this choice forced the executor to try to load
> three-quarters of the (very large) inner relation into memory at once;
> unsurprisingly, it failed.
>
> To fix this, I think we need to discourage use of hash joins whenever
> a single bucket is predicted to exceed work_mem, as in the attached
> draft patch.  The patch results in changing from hash to merge join
> in one regression test case, which is fine; that case only cares about
> the join order not the types of the joins.
>
> This might be overly aggressive, because it will pretty much shut off
> any attempt to use hash joining on a large inner relation unless we
> have statistics for it (and those stats are favorable).  But having
> seen this example, I think we need to be worried.

I do think that's worrying, but on the other hand it seems like this
solution could disable many hash joins that would actually be fine.  I
don't think the largest ndistinct estimates we ever generate are very
large, and therefore this seems highly prone to worry even when
worrying isn't really justified.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Avoiding OOM in a hash join with many duplicate inner keys

From

Peter Geoghegan

Date:

17 February 2017, 01:37:38

On Thu, Feb 16, 2017 at 11:11 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> I do think that's worrying, but on the other hand it seems like this
> solution could disable many hash joins that would actually be fine.  I
> don't think the largest ndistinct estimates we ever generate are very
> large, and therefore this seems highly prone to worry even when
> worrying isn't really justified.

+1. ndistinct has a general tendency to be wrong, owing to how ANALYZE
works, which we see problems with from time to time.


-- 
Peter Geoghegan

Re: [HACKERS] Avoiding OOM in a hash join with many duplicate inner keys

From

Tom Lane

Date:

17 February 2017, 01:38:19

Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Feb 16, 2017 at 2:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> This might be overly aggressive, because it will pretty much shut off
>> any attempt to use hash joining on a large inner relation unless we
>> have statistics for it (and those stats are favorable).  But having
>> seen this example, I think we need to be worried.

> I do think that's worrying, but on the other hand it seems like this
> solution could disable many hash joins that would actually be fine.  I
> don't think the largest ndistinct estimates we ever generate are very
> large, and therefore this seems highly prone to worry even when
> worrying isn't really justified.

I initially thought about driving the shutoff strictly from the estimate
of the MCV frequency, without involving the more general ndistinct
computation that estimate_hash_bucketsize does.  I'm not sure how much
that would do for your concern, but at least the MCV frequency doesn't
involve quite as much extrapolation as ndistinct.

There's a practical problem from final_cost_hashjoin's standpoint,
which is that it has noplace to cache the MCV frequency separately from
estimate_hash_bucketsize's output.  In HEAD we could just add some more
fields to RestrictInfo, but that would be an unacceptable ABI break in
the back branches.  Maybe we could get away with replacing the float8
bucketsize fields with two float4 fields --- it seems unlikely that we
need more than 6 digits of precision for these numbers, and I doubt any
extensions are touching the bucketsize fields.
        regards, tom lane

Re: [HACKERS] Avoiding OOM in a hash join with many duplicate inner keys

From

Robert Haas

Date:

17 February 2017, 02:25:54

On Thu, Feb 16, 2017 at 2:38 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Thu, Feb 16, 2017 at 2:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> This might be overly aggressive, because it will pretty much shut off
>>> any attempt to use hash joining on a large inner relation unless we
>>> have statistics for it (and those stats are favorable).  But having
>>> seen this example, I think we need to be worried.
>
>> I do think that's worrying, but on the other hand it seems like this
>> solution could disable many hash joins that would actually be fine.  I
>> don't think the largest ndistinct estimates we ever generate are very
>> large, and therefore this seems highly prone to worry even when
>> worrying isn't really justified.
>
> I initially thought about driving the shutoff strictly from the estimate
> of the MCV frequency, without involving the more general ndistinct
> computation that estimate_hash_bucketsize does.  I'm not sure how much
> that would do for your concern, but at least the MCV frequency doesn't
> involve quite as much extrapolation as ndistinct.

Hmm, so we could do something like: if the estimated frequency of the
least-common MCV is enough to make one bucket overflow work_mem, then
don't use a hash join?  That would still be prone to some error (in
both directions, really) but it seems less likely to spit out
completely stupid results than relying on ndistinct, which never gets
very big even in a 10TB table.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Avoiding OOM in a hash join with many duplicate inner keys

From

Tom Lane

Date:

17 February 2017, 02:51:20

Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Feb 16, 2017 at 2:38 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I initially thought about driving the shutoff strictly from the estimate
>> of the MCV frequency, without involving the more general ndistinct
>> computation that estimate_hash_bucketsize does.  I'm not sure how much
>> that would do for your concern, but at least the MCV frequency doesn't
>> involve quite as much extrapolation as ndistinct.

> Hmm, so we could do something like: if the estimated frequency of the
> least-common MCV is enough to make one bucket overflow work_mem, then
> don't use a hash join?  That would still be prone to some error (in
> both directions, really) but it seems less likely to spit out
> completely stupid results than relying on ndistinct, which never gets
> very big even in a 10TB table.

No, it'd be the *most* common MCV, because we're concerned about the
worst-case (largest) bucket size.  But that's good, really, because the
highest MCV frequency will be the one we have most statistical
confidence in.  There's generally a whole lot of noise in the tail-end
MCV numbers.

Also, I'd be inclined to do nothing (no shutoff) if we have no MCV
stats.  That would be an expected case if the column is believed unique,
and it's probably a better fallback behavior when we simply don't have
stats.  With the ndistinct-based rule, we'd be shutting off hashjoin
almost always when we don't have stats.  Given how long it took us
to recognize this problem, that's probably the wrong default.
        regards, tom lane

Re: [HACKERS] Avoiding OOM in a hash join with many duplicate inner keys

From

Robert Haas

Date:

17 February 2017, 02:56:49

On Thu, Feb 16, 2017 at 3:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Thu, Feb 16, 2017 at 2:38 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> I initially thought about driving the shutoff strictly from the estimate
>>> of the MCV frequency, without involving the more general ndistinct
>>> computation that estimate_hash_bucketsize does.  I'm not sure how much
>>> that would do for your concern, but at least the MCV frequency doesn't
>>> involve quite as much extrapolation as ndistinct.
>
>> Hmm, so we could do something like: if the estimated frequency of the
>> least-common MCV is enough to make one bucket overflow work_mem, then
>> don't use a hash join?  That would still be prone to some error (in
>> both directions, really) but it seems less likely to spit out
>> completely stupid results than relying on ndistinct, which never gets
>> very big even in a 10TB table.
>
> No, it'd be the *most* common MCV, because we're concerned about the
> worst-case (largest) bucket size.  But that's good, really, because the
> highest MCV frequency will be the one we have most statistical
> confidence in.  There's generally a whole lot of noise in the tail-end
> MCV numbers.

Oh, right.  That's reassuring, as it seems like it has a much better
chance of actually being right.

> Also, I'd be inclined to do nothing (no shutoff) if we have no MCV
> stats.  That would be an expected case if the column is believed unique,
> and it's probably a better fallback behavior when we simply don't have
> stats.  With the ndistinct-based rule, we'd be shutting off hashjoin
> almost always when we don't have stats.  Given how long it took us
> to recognize this problem, that's probably the wrong default.

Right.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Avoiding OOM in a hash join with many duplicate inner keys

From

Tom Lane

Date:

17 February 2017, 04:13:31

Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Feb 16, 2017 at 3:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> No, it'd be the *most* common MCV, because we're concerned about the
>> worst-case (largest) bucket size.  But that's good, really, because the
>> highest MCV frequency will be the one we have most statistical
>> confidence in.  There's generally a whole lot of noise in the tail-end
>> MCV numbers.

> Oh, right.  That's reassuring, as it seems like it has a much better
> chance of actually being right.

Here's a version that does it that way.  Unsurprisingly, it doesn't
cause any regression test changes, but you can confirm it's having an
effect with this test case:

create table tt(f1 int);
insert into tt select 1 from generate_series(1,1000000) g;
insert into tt select g from generate_series(1,1000000) g;
analyze tt;
explain select * from tt a natural join tt b;

Unpatched code will go for a hash join on this example.
 

For application to the back branches, we could do it just like this
(leaving the existing fields alone, and allowing sizeof(RestrictInfo)
to grow), or we could change the datatypes of the four fields involved
to float4 so that sizeof(RestrictInfo) stays the same.  I'm not entirely
sure which way is the more hazardous from an ABI standpoint.  If there
are any external modules doing their own palloc(sizeof(RestrictInfo))
calls, the first way would be bad, but really there shouldn't be; I'd
expect people to be using make_restrictinfo() and friends.  (Note that
palloc's power-of-2 padding wouldn't save us, because sizeof(RestrictInfo)
is currently exactly 128 on 32-bit machines in several of the back
branches.)  Conversely, if any non-core code is touching the bucketsize
fields, changing those field widths would break that; but that doesn't
seem all that likely either.  On balance I think I might go for the first
way, because it would remove doubt about whether reducing the precision
of the bucketsize estimates would cause any unexpected plan changes.

Or we could decide not to back-patch because the problem doesn't come
up often enough to justify taking any risk for.  But given that we've
gotten one confirmed field report, I'm not voting that way.

            regards, tom lane

diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 05d8538..c2cdd79 100644
*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
*************** _copyRestrictInfo(const RestrictInfo *fr
*** 2070,2075 ****
--- 2070,2077 ----
      COPY_SCALAR_FIELD(hashjoinoperator);
      COPY_SCALAR_FIELD(left_bucketsize);
      COPY_SCALAR_FIELD(right_bucketsize);
+     COPY_SCALAR_FIELD(left_mcvfreq);
+     COPY_SCALAR_FIELD(right_mcvfreq);

      return newnode;
  }
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index d01630f..f88c1c5 100644
*** a/src/backend/optimizer/path/costsize.c
--- b/src/backend/optimizer/path/costsize.c
*************** final_cost_hashjoin(PlannerInfo *root, H
*** 2800,2805 ****
--- 2800,2806 ----
      double        hashjointuples;
      double        virtualbuckets;
      Selectivity innerbucketsize;
+     Selectivity innermcvfreq;
      ListCell   *hcl;

      /* Mark the path with the correct row estimate */
*************** final_cost_hashjoin(PlannerInfo *root, H
*** 2827,2835 ****
      virtualbuckets = (double) numbuckets *(double) numbatches;

      /*
!      * Determine bucketsize fraction for inner relation.  We use the smallest
!      * bucketsize estimated for any individual hashclause; this is undoubtedly
!      * conservative.
       *
       * BUT: if inner relation has been unique-ified, we can assume it's good
       * for hashing.  This is important both because it's the right answer, and
--- 2828,2836 ----
      virtualbuckets = (double) numbuckets *(double) numbatches;

      /*
!      * Determine bucketsize fraction and MCV frequency for the inner relation.
!      * We use the smallest bucketsize or MCV frequency estimated for any
!      * individual hashclause; this is undoubtedly conservative.
       *
       * BUT: if inner relation has been unique-ified, we can assume it's good
       * for hashing.  This is important both because it's the right answer, and
*************** final_cost_hashjoin(PlannerInfo *root, H
*** 2837,2850 ****
--- 2838,2856 ----
       * non-unique-ified paths.
       */
      if (IsA(inner_path, UniquePath))
+     {
          innerbucketsize = 1.0 / virtualbuckets;
+         innermcvfreq = 0.0;
+     }
      else
      {
          innerbucketsize = 1.0;
+         innermcvfreq = 1.0;
          foreach(hcl, hashclauses)
          {
              RestrictInfo *restrictinfo = (RestrictInfo *) lfirst(hcl);
              Selectivity thisbucketsize;
+             Selectivity thismcvfreq;

              Assert(IsA(restrictinfo, RestrictInfo));

*************** final_cost_hashjoin(PlannerInfo *root, H
*** 2853,2860 ****
               * is the inner side.
               *
               * Since we tend to visit the same clauses over and over when
!              * planning a large query, we cache the bucketsize estimate in the
!              * RestrictInfo node to avoid repeated lookups of statistics.
               */
              if (bms_is_subset(restrictinfo->right_relids,
                                inner_path->parent->relids))
--- 2859,2866 ----
               * is the inner side.
               *
               * Since we tend to visit the same clauses over and over when
!              * planning a large query, we cache the bucket stats estimates in
!              * the RestrictInfo node to avoid repeated lookups of statistics.
               */
              if (bms_is_subset(restrictinfo->right_relids,
                                inner_path->parent->relids))
*************** final_cost_hashjoin(PlannerInfo *root, H
*** 2864,2875 ****
                  if (thisbucketsize < 0)
                  {
                      /* not cached yet */
!                     thisbucketsize =
!                         estimate_hash_bucketsize(root,
                                             get_rightop(restrictinfo->clause),
!                                                  virtualbuckets);
                      restrictinfo->right_bucketsize = thisbucketsize;
                  }
              }
              else
              {
--- 2870,2884 ----
                  if (thisbucketsize < 0)
                  {
                      /* not cached yet */
!                     estimate_hash_bucket_stats(root,
                                             get_rightop(restrictinfo->clause),
!                                                virtualbuckets,
!                                                &thismcvfreq,
!                                                &thisbucketsize);
                      restrictinfo->right_bucketsize = thisbucketsize;
+                     restrictinfo->right_mcvfreq = thismcvfreq;
                  }
+                 thismcvfreq = restrictinfo->right_mcvfreq;
              }
              else
              {
*************** final_cost_hashjoin(PlannerInfo *root, H
*** 2880,2899 ****
                  if (thisbucketsize < 0)
                  {
                      /* not cached yet */
!                     thisbucketsize =
!                         estimate_hash_bucketsize(root,
                                              get_leftop(restrictinfo->clause),
!                                                  virtualbuckets);
                      restrictinfo->left_bucketsize = thisbucketsize;
                  }
              }

              if (innerbucketsize > thisbucketsize)
                  innerbucketsize = thisbucketsize;
          }
      }

      /*
       * Compute cost of the hashquals and qpquals (other restriction clauses)
       * separately.
       */
--- 2889,2926 ----
                  if (thisbucketsize < 0)
                  {
                      /* not cached yet */
!                     estimate_hash_bucket_stats(root,
                                              get_leftop(restrictinfo->clause),
!                                                virtualbuckets,
!                                                &thismcvfreq,
!                                                &thisbucketsize);
                      restrictinfo->left_bucketsize = thisbucketsize;
+                     restrictinfo->left_mcvfreq = thismcvfreq;
                  }
+                 thismcvfreq = restrictinfo->left_mcvfreq;
              }

              if (innerbucketsize > thisbucketsize)
                  innerbucketsize = thisbucketsize;
+             if (innermcvfreq > thismcvfreq)
+                 innermcvfreq = thismcvfreq;
          }
      }

      /*
+      * If the bucket holding the inner MCV would exceed work_mem, we don't
+      * want to hash unless there is really no other alternative, so apply
+      * disable_cost.  (The executor normally copes with excessive memory usage
+      * by splitting batches, but obviously it cannot separate equal values
+      * that way, so it will be unable to drive the batch size below work_mem
+      * when this is true.)
+      */
+     if (relation_byte_size(clamp_row_est(inner_path_rows * innermcvfreq),
+                            inner_path->pathtarget->width) >
+         (work_mem * 1024L))
+         startup_cost += disable_cost;
+
+     /*
       * Compute cost of the hashquals and qpquals (other restriction clauses)
       * separately.
       */
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 06e843d..f6ddb2e 100644
*** a/src/backend/optimizer/prep/prepunion.c
--- b/src/backend/optimizer/prep/prepunion.c
*************** adjust_appendrel_attrs_mutator(Node *nod
*** 1961,1966 ****
--- 1961,1968 ----
          newinfo->scansel_cache = NIL;
          newinfo->left_bucketsize = -1;
          newinfo->right_bucketsize = -1;
+         newinfo->left_mcvfreq = -1;
+         newinfo->right_mcvfreq = -1;

          return (Node *) newinfo;
      }
diff --git a/src/backend/optimizer/util/restrictinfo.c b/src/backend/optimizer/util/restrictinfo.c
index 045b5cf..e7efe36 100644
*** a/src/backend/optimizer/util/restrictinfo.c
--- b/src/backend/optimizer/util/restrictinfo.c
*************** make_restrictinfo_internal(Expr *clause,
*** 199,204 ****
--- 199,206 ----

      restrictinfo->left_bucketsize = -1;
      restrictinfo->right_bucketsize = -1;
+     restrictinfo->left_mcvfreq = -1;
+     restrictinfo->right_mcvfreq = -1;

      return restrictinfo;
  }
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index d14f0f9..816cb65 100644
*** a/src/backend/utils/adt/selfuncs.c
--- b/src/backend/utils/adt/selfuncs.c
*************** estimate_num_groups(PlannerInfo *root, L
*** 3521,3529 ****
  }

  /*
!  * Estimate hash bucketsize fraction (ie, number of entries in a bucket
!  * divided by total tuples in relation) if the specified expression is used
!  * as a hash key.
   *
   * XXX This is really pretty bogus since we're effectively assuming that the
   * distribution of hash keys will be the same after applying restriction
--- 3521,3536 ----
  }

  /*
!  * Estimate hash bucket statistics when the specified expression is used
!  * as a hash key for the given number of buckets.
!  *
!  * This attempts to determine two values:
!  *
!  * 1. The frequency of the most common value of the expression (returns
!  * zero into *mcv_freq if we can't get that).
!  *
!  * 2. The "bucketsize fraction", ie, average number of entries in a bucket
!  * divided by total tuples in relation.
   *
   * XXX This is really pretty bogus since we're effectively assuming that the
   * distribution of hash keys will be the same after applying restriction
*************** estimate_num_groups(PlannerInfo *root, L
*** 3549,3563 ****
   * discourage use of a hash rather strongly if the inner relation is large,
   * which is what we want.  We do not want to hash unless we know that the
   * inner rel is well-dispersed (or the alternatives seem much worse).
   */
! Selectivity
! estimate_hash_bucketsize(PlannerInfo *root, Node *hashkey, double nbuckets)
  {
      VariableStatData vardata;
      double        estfract,
                  ndistinct,
                  stanullfrac,
-                 mcvfreq,
                  avgfreq;
      bool        isdefault;
      float4       *numbers;
--- 3556,3577 ----
   * discourage use of a hash rather strongly if the inner relation is large,
   * which is what we want.  We do not want to hash unless we know that the
   * inner rel is well-dispersed (or the alternatives seem much worse).
+  *
+  * The caller should also check that the mcv_freq is not so large that the
+  * most common value would by itself require an impractically large bucket.
+  * In a hash join, the executor can split buckets if they get too big, but
+  * obviously that doesn't help for a bucket that contains many duplicates of
+  * the same value.
   */
! void
! estimate_hash_bucket_stats(PlannerInfo *root, Node *hashkey, double nbuckets,
!                            Selectivity *mcv_freq,
!                            Selectivity *bucketsize_frac)
  {
      VariableStatData vardata;
      double        estfract,
                  ndistinct,
                  stanullfrac,
                  avgfreq;
      bool        isdefault;
      float4       *numbers;
*************** estimate_hash_bucketsize(PlannerInfo *ro
*** 3565,3578 ****

      examine_variable(root, hashkey, 0, &vardata);

      /* Get number of distinct values */
      ndistinct = get_variable_numdistinct(&vardata, &isdefault);

!     /* If ndistinct isn't real, punt and return 0.1, per comments above */
      if (isdefault)
      {
          ReleaseVariableStats(vardata);
!         return (Selectivity) 0.1;
      }

      /* Get fraction that are null */
--- 3579,3618 ----

      examine_variable(root, hashkey, 0, &vardata);

+     /* Look up the frequency of the most common value, if available */
+     *mcv_freq = 0.0;
+
+     if (HeapTupleIsValid(vardata.statsTuple))
+     {
+         if (get_attstatsslot(vardata.statsTuple,
+                              vardata.atttype, vardata.atttypmod,
+                              STATISTIC_KIND_MCV, InvalidOid,
+                              NULL,
+                              NULL, NULL,
+                              &numbers, &nnumbers))
+         {
+             /*
+              * The first MCV stat is for the most common value.
+              */
+             if (nnumbers > 0)
+                 *mcv_freq = numbers[0];
+             free_attstatsslot(vardata.atttype, NULL, 0,
+                               numbers, nnumbers);
+         }
+     }
+
      /* Get number of distinct values */
      ndistinct = get_variable_numdistinct(&vardata, &isdefault);

!     /*
!      * If ndistinct isn't real, punt.  We normally return 0.1, but if the
!      * mcv_freq is known to be even higher than that, use it instead.
!      */
      if (isdefault)
      {
+         *bucketsize_frac = (Selectivity) Max(0.1, *mcv_freq);
          ReleaseVariableStats(vardata);
!         return;
      }

      /* Get fraction that are null */
*************** estimate_hash_bucketsize(PlannerInfo *ro
*** 3614,3647 ****
          estfract = 1.0 / ndistinct;

      /*
-      * Look up the frequency of the most common value, if available.
-      */
-     mcvfreq = 0.0;
-
-     if (HeapTupleIsValid(vardata.statsTuple))
-     {
-         if (get_attstatsslot(vardata.statsTuple,
-                              vardata.atttype, vardata.atttypmod,
-                              STATISTIC_KIND_MCV, InvalidOid,
-                              NULL,
-                              NULL, NULL,
-                              &numbers, &nnumbers))
-         {
-             /*
-              * The first MCV stat is for the most common value.
-              */
-             if (nnumbers > 0)
-                 mcvfreq = numbers[0];
-             free_attstatsslot(vardata.atttype, NULL, 0,
-                               numbers, nnumbers);
-         }
-     }
-
-     /*
       * Adjust estimated bucketsize upward to account for skewed distribution.
       */
!     if (avgfreq > 0.0 && mcvfreq > avgfreq)
!         estfract *= mcvfreq / avgfreq;

      /*
       * Clamp bucketsize to sane range (the above adjustment could easily
--- 3654,3663 ----
          estfract = 1.0 / ndistinct;

      /*
       * Adjust estimated bucketsize upward to account for skewed distribution.
       */
!     if (avgfreq > 0.0 && *mcv_freq > avgfreq)
!         estfract *= *mcv_freq / avgfreq;

      /*
       * Clamp bucketsize to sane range (the above adjustment could easily
*************** estimate_hash_bucketsize(PlannerInfo *ro
*** 3653,3661 ****
      else if (estfract > 1.0)
          estfract = 1.0;

!     ReleaseVariableStats(vardata);

!     return (Selectivity) estfract;
  }


--- 3669,3677 ----
      else if (estfract > 1.0)
          estfract = 1.0;

!     *bucketsize_frac = (Selectivity) estfract;

!     ReleaseVariableStats(vardata);
  }


diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index f7ac6f6..2b2d950 100644
*** a/src/include/nodes/relation.h
--- b/src/include/nodes/relation.h
*************** typedef struct RestrictInfo
*** 1697,1702 ****
--- 1697,1704 ----
      /* cache space for hashclause processing; -1 if not yet set */
      Selectivity left_bucketsize;    /* avg bucketsize of left side */
      Selectivity right_bucketsize;        /* avg bucketsize of right side */
+     Selectivity left_mcvfreq;    /* most common val's freq on left side */
+     Selectivity right_mcvfreq;    /* most common val's freq on right side */
  } RestrictInfo;

  /*
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
index 9f9d2dc..03ed27b 100644
*** a/src/include/utils/selfuncs.h
--- b/src/include/utils/selfuncs.h
*************** extern void mergejoinscansel(PlannerInfo
*** 204,211 ****
  extern double estimate_num_groups(PlannerInfo *root, List *groupExprs,
                      double input_rows, List **pgset);

! extern Selectivity estimate_hash_bucketsize(PlannerInfo *root, Node *hashkey,
!                          double nbuckets);

  extern List *deconstruct_indexquals(IndexPath *path);
  extern void genericcostestimate(PlannerInfo *root, IndexPath *path,
--- 204,213 ----
  extern double estimate_num_groups(PlannerInfo *root, List *groupExprs,
                      double input_rows, List **pgset);

! extern void estimate_hash_bucket_stats(PlannerInfo *root,
!                            Node *hashkey, double nbuckets,
!                            Selectivity *mcv_freq,
!                            Selectivity *bucketsize_frac);

  extern List *deconstruct_indexquals(IndexPath *path);
  extern void genericcostestimate(PlannerInfo *root, IndexPath *path,

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Avoiding OOM in a hash join with many duplicate inner keys

From

Thomas Munro

Date:

17 February 2017, 07:13:36

On Fri, Feb 17, 2017 at 11:13 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Thu, Feb 16, 2017 at 3:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> No, it'd be the *most* common MCV, because we're concerned about the
>>> worst-case (largest) bucket size.  But that's good, really, because the
>>> highest MCV frequency will be the one we have most statistical
>>> confidence in.  There's generally a whole lot of noise in the tail-end
>>> MCV numbers.
>
>> Oh, right.  That's reassuring, as it seems like it has a much better
>> chance of actually being right.
>
> Here's a version that does it that way.  Unsurprisingly, it doesn't
> cause any regression test changes, but you can confirm it's having an
> effect with this test case:
>
> create table tt(f1 int);
> insert into tt select 1 from generate_series(1,1000000) g;
> insert into tt select g from generate_series(1,1000000) g;
> analyze tt;
> explain select * from tt a natural join tt b;
>
> Unpatched code will go for a hash join on this example.

+1

By strange coincidence, I was about to propose something along these
lines on theoretical grounds, having spent a bunch of time studying
the hash join code recently.  It makes a lot of sense to use
statistics to try to avoid the "fail" (ie fail to respect work_mem,
and maybe fail to complete: maybe better called "overflow" or
"explode") mode during planning.

I have been wondering about a couple of different worst case execution
strategies that would be better than throwing our hands up and
potentially exploding memory once we detect that further partitioning
is not going to help, if we still manage to reach that case despite
adding stats-based defences like this due to statistics being missing,
bad or confused by joins below us.

1.  We could probe the fraction of the hash table that we have managed
to load into work_mem so far and then rewind the outer batch and do it
again for the next work_mem-sized fraction of the inner batch and so
on.  For outer joins we'd need to scan for unmatched tuples after each
hash table refill.  If we detect this condition during the initial
hash build (as opposed to a later inner batch->hash table load), we'd
need to disable the so called 'hybrid' optimisation and fall back to
the so called 'Grace' hash join; that is, we'd need to pull in the
whole outer relation and write it out to batches before we even begin
probing batch 0, so that we have the ability to rewind outer batch 0
for another pass.  When doing extra passes of an outer batch file, we
have to make sure that we don't do the 'send this tuple to a future
batch' behaviour if the number of batches happens to have increased.
Modulo some details, and I may be missing something fundamental here
(maybe breaks in some semi/anti case?)...

2.  We could just abandon hash join for this batch.  "She cannae take
it, captain", so sort inner and outer batches and merge-join them
instead.  Same comment re switching to Grace hash join if this happens
while loading inner batch 0; we'll need a complete inner batch 0 and
outer batch 0, so we can't juse the hybrid optimisation.

Obviously there are vanishing returns here as we add more defences
making it increasingly unlikely that we hit "fail" mode.  But it
bothers me that hash joins in general are not 100% guaranteed to be
able to complete unless you have infinite RAM.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: [HACKERS] Avoiding OOM in a hash join with many duplicate inner keys

From

Robert Haas

Date:

19 February 2017, 19:42:38

On Thu, Feb 16, 2017 at 8:13 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> Obviously there are vanishing returns here as we add more defences
> making it increasingly unlikely that we hit "fail" mode.  But it
> bothers me that hash joins in general are not 100% guaranteed to be
> able to complete unless you have infinite RAM.

I think in practice most people are forced to set work_mem to such a
small percentage of their available RAM that actual RAM exhaustion is
quite rare.  The default value of 4MB is probably conservative even
for a Raspberry Pi, at least until the connection count spikes
unexpectedly, or until you have this problem:

https://www.postgresql.org/message-id/16161.1324414006@sss.pgh.pa.us

Most advice that I've seen for work_mem involves choosing values like
RAM / (4 * max_connections), which tends to result in much larger
values that are typically still small very small compared to the
amount of RAM that's available at any given moment, because most of
the time you either don't have the maximum number of connections or
some of them are idle or not all of them are using plans that need any
work_mem.  Unfortunately, even with these very conservative settings,
one sometimes sees a machine get absolutely swamped by a large
activity spike at a time when all of the backends just so happen to be
running a query that uses 2 or 3 (or 180) copies of work_mem.[1]

If I were going to try to do something about the problem of machines
running out of memory, I'd be inclined to look at the problem more
broadly than "hey, hash joins can exceed work_mem if certain bad
things happen" and instead think about "hey, work_mem is a stupid way
of deciding on a memory budget".  The intrinsic stupidity of work_mem
as an allocation system means that (1) it's perfectly possible to run
out of memory even if every node respects the memory budget and (2)
it's perfectly possible to drastically underutilize the memory you do
have even if some nodes fail to respect the memory budget.  Of course,
if we had a smarter system for deciding on the budget it would be more
not less important for nodes to respect the budget they were given, so
that's not really an argument against trying to plug the hole you're
complaining about here, just a doubt about how much it will improve
the user experience if that's the only thing you do.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1] Or all of the connections just touch each of your 100,000
relations and the backend-local caches fill up and the whole system
falls over without using any work_mem at all.

Re: [HACKERS] Avoiding OOM in a hash join with many duplicate inner keys

From

Tom Lane

Date:

08 March 2017, 06:29:40

Thomas Munro <thomas.munro@enterprisedb.com> writes:
> I have been wondering about a couple of different worst case execution
> strategies that would be better than throwing our hands up and
> potentially exploding memory once we detect that further partitioning
> is not going to help, if we still manage to reach that case despite
> adding stats-based defences like this due to statistics being missing,
> bad or confused by joins below us.

Yeah, it would definitely be nice if we could constrain the runtime
space consumption better.

> 1.  We could probe the fraction of the hash table that we have managed
> to load into work_mem so far and then rewind the outer batch and do it
> again for the next work_mem-sized fraction of the inner batch and so
> on.  For outer joins we'd need to scan for unmatched tuples after each
> hash table refill.

I do not understand how that works for a left join?  You'd need to track
whether a given outer tuple has been matched in any one of the fractions
of the inner batch, so that when you're done with the batch you could know
which outer tuples need to be emitted null-extended.  Right now we only
need to track that state for the current outer tuple, but in a rescan
situation we'd have to remember it for each outer tuple in the batch.

Perhaps it could be done by treating the outer batch file as read/write
and incorporating a state flag in each tuple; or to reduce write volume,
maintaining a separate outer batch file parallel to the main one with just
a bool or even just a bit per outer tuple.  Seems messy though.
        regards, tom lane

Re: [HACKERS] Avoiding OOM in a hash join with many duplicate inner keys

From

Thomas Munro

Date:

22 November 2017, 05:38:38

On Wed, Mar 8, 2017 at 1:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@enterprisedb.com> writes:
>> I have been wondering about a couple of different worst case execution
>> strategies that would be better than throwing our hands up and
>> potentially exploding memory once we detect that further partitioning
>> is not going to help, if we still manage to reach that case despite
>> adding stats-based defences like this due to statistics being missing,
>> bad or confused by joins below us.
>
> Yeah, it would definitely be nice if we could constrain the runtime
> space consumption better.
>
>> 1.  We could probe the fraction of the hash table that we have managed
>> to load into work_mem so far and then rewind the outer batch and do it
>> again for the next work_mem-sized fraction of the inner batch and so
>> on.  For outer joins we'd need to scan for unmatched tuples after each
>> hash table refill.
>
> I do not understand how that works for a left join?  You'd need to track
> whether a given outer tuple has been matched in any one of the fractions
> of the inner batch, so that when you're done with the batch you could know
> which outer tuples need to be emitted null-extended.  Right now we only
> need to track that state for the current outer tuple, but in a rescan
> situation we'd have to remember it for each outer tuple in the batch.
>
> Perhaps it could be done by treating the outer batch file as read/write
> and incorporating a state flag in each tuple; or to reduce write volume,
> maintaining a separate outer batch file parallel to the main one with just
> a bool or even just a bit per outer tuple.  Seems messy though.

Right.  Messy.  I think what I described may fall under the category
of "block nested loop".  It looks doable but not very appealing for
left joins, and performance seems not great, multiplying the probing
scans by the number of fragments.  Whether we actually care about
performance at all when we've reached this emergency state and are
primarily concerned with completing the query I'm not entirely sure.

Another idea would be to identify the offending bucket (how?) and
spill it to disk in its own file, and track it by pushing a special
control object with a distinguishing header flag into the hash table
(or a new overflow table, or extend the duties of the skew table,
or...).  We'd have to deal with the matched flags of spilled inner
tuples for right/full joins.  Matching is really per-key, not
per-tuple, so if there is a control object in memory for each of these
"overflow" buckets then perhaps it could hold the matched flag that
covers all tuples with each distinct key.  What I like about this is
that is doesn't change the join algorithm at all, it just bolts on a
per-bucket escape valve.  The changes might be quite localised, though
I know someone who probably wouldn't like an extra branch in
ExecScanHashBucket().

-- 
Thomas Munro
http://www.enterprisedb.com