[HACKERS] Avoiding OOM in a hash join with many duplicate inner keys - Mailing list pgsql-hackers

From Tom Lane
Subject [HACKERS] Avoiding OOM in a hash join with many duplicate inner keys
Date
Msg-id 32013.1487271761@sss.pgh.pa.us
Whole thread Raw
Responses Re: [HACKERS] Avoiding OOM in a hash join with many duplicate inner keys
List pgsql-hackers
The planner doesn't currently worry about work_mem restrictions when
planning a hash join, figuring that the executor should be able to
subdivide the data arbitrarily finely by splitting buckets at runtime.
However there's a thread here:
https://www.postgresql.org/message-id/flat/CACw4T0p4Lzd6VpwptxgPgoTMh2dEKTQBGu7NTaJ1%2BA0PRx1BGg%40mail.gmail.com
exhibiting a case where a hash join was chosen even though a single
value accounts for three-quarters of the inner relation.  Bucket
splitting obviously can never separate multiple instances of the
same value, so this choice forced the executor to try to load
three-quarters of the (very large) inner relation into memory at once;
unsurprisingly, it failed.

To fix this, I think we need to discourage use of hash joins whenever
a single bucket is predicted to exceed work_mem, as in the attached
draft patch.  The patch results in changing from hash to merge join
in one regression test case, which is fine; that case only cares about
the join order not the types of the joins.

This might be overly aggressive, because it will pretty much shut off
any attempt to use hash joining on a large inner relation unless we
have statistics for it (and those stats are favorable).  But having
seen this example, I think we need to be worried.

I'm inclined to treat this as a bug and back-patch it, but I wonder if
anyone is concerned about possible plan destabilization in the back
branches.

            regards, tom lane

diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index d01630f..9170b92 100644
*** a/src/backend/optimizer/path/costsize.c
--- b/src/backend/optimizer/path/costsize.c
*************** final_cost_hashjoin(PlannerInfo *root, H
*** 2800,2805 ****
--- 2800,2806 ----
      double        hashjointuples;
      double        virtualbuckets;
      Selectivity innerbucketsize;
+     double        inner_bucket_rows;
      ListCell   *hcl;

      /* Mark the path with the correct row estimate */
*************** final_cost_hashjoin(PlannerInfo *root, H
*** 2894,2899 ****
--- 2895,2914 ----
      }

      /*
+      * If even a single bucket would exceed work_mem, we don't want to hash
+      * unless there is really no other alternative, so apply disable_cost.
+      * (Because estimate_hash_bucketsize's estimate is mostly driven by the
+      * MCV frequency, this condition suggests that the executor will be unable
+      * to drive the batch size below work_mem no matter how much it splits
+      * buckets: splitting cannot separate values that are equal.)
+      */
+     inner_bucket_rows = clamp_row_est(inner_path_rows * innerbucketsize);
+     if (relation_byte_size(inner_bucket_rows,
+                            inner_path->pathtarget->width) >
+         (work_mem * 1024L))
+         startup_cost += disable_cost;
+
+     /*
       * Compute cost of the hashquals and qpquals (other restriction clauses)
       * separately.
       */
*************** final_cost_hashjoin(PlannerInfo *root, H
*** 2964,2970 ****
           */
          startup_cost += hash_qual_cost.startup;
          run_cost += hash_qual_cost.per_tuple * outer_path_rows *
!             clamp_row_est(inner_path_rows * innerbucketsize) * 0.5;

          /*
           * Get approx # tuples passing the hashquals.  We use
--- 2979,2985 ----
           */
          startup_cost += hash_qual_cost.startup;
          run_cost += hash_qual_cost.per_tuple * outer_path_rows *
!             inner_bucket_rows * 0.5;

          /*
           * Get approx # tuples passing the hashquals.  We use
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 4992048..b2270ca 100644
*** a/src/test/regress/expected/join.out
--- b/src/test/regress/expected/join.out
*************** where not exists (
*** 2429,2461 ****
                ) a1 on t3.c2 = a1.c1
    where t1.c1 = t2.c2
  );
!                        QUERY PLAN
! ---------------------------------------------------------
!  Hash Anti Join
!    Hash Cond: (t1.c1 = t2.c2)
!    ->  Seq Scan on tt4x t1
!    ->  Hash
!          ->  Merge Right Join
!                Merge Cond: (t5.c1 = t3.c2)
!                ->  Merge Join
!                      Merge Cond: (t4.c2 = t5.c1)
!                      ->  Sort
!                            Sort Key: t4.c2
!                            ->  Seq Scan on tt4x t4
!                      ->  Sort
!                            Sort Key: t5.c1
!                            ->  Seq Scan on tt4x t5
!                ->  Sort
!                      Sort Key: t3.c2
!                      ->  Merge Left Join
!                            Merge Cond: (t2.c3 = t3.c1)
                             ->  Sort
!                                  Sort Key: t2.c3
!                                  ->  Seq Scan on tt4x t2
                             ->  Sort
!                                  Sort Key: t3.c1
!                                  ->  Seq Scan on tt4x t3
! (24 rows)

  --
  -- regression test for problems of the sort depicted in bug #3494
--- 2429,2465 ----
                ) a1 on t3.c2 = a1.c1
    where t1.c1 = t2.c2
  );
!                           QUERY PLAN
! ---------------------------------------------------------------
!  Merge Anti Join
!    Merge Cond: (t1.c1 = t2.c2)
!    ->  Sort
!          Sort Key: t1.c1
!          ->  Seq Scan on tt4x t1
!    ->  Materialize
!          ->  Sort
!                Sort Key: t2.c2
!                ->  Merge Right Join
!                      Merge Cond: (t5.c1 = t3.c2)
!                      ->  Merge Join
!                            Merge Cond: (t4.c2 = t5.c1)
                             ->  Sort
!                                  Sort Key: t4.c2
!                                  ->  Seq Scan on tt4x t4
                             ->  Sort
!                                  Sort Key: t5.c1
!                                  ->  Seq Scan on tt4x t5
!                      ->  Sort
!                            Sort Key: t3.c2
!                            ->  Merge Left Join
!                                  Merge Cond: (t2.c3 = t3.c1)
!                                  ->  Sort
!                                        Sort Key: t2.c3
!                                        ->  Seq Scan on tt4x t2
!                                  ->  Sort
!                                        Sort Key: t3.c1
!                                        ->  Seq Scan on tt4x t3
! (28 rows)

  --
  -- regression test for problems of the sort depicted in bug #3494

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

pgsql-hackers by date:

Previous
From: Corey Huinker
Date:
Subject: Re: [HACKERS] COPY IN/BOTH vs. extended query mode
Next
From: Robert Haas
Date:
Subject: Re: [HACKERS] Avoiding OOM in a hash join with many duplicate inner keys