Thread: HashJoin w/option to unique-ify inner rel

HashJoin w/option to unique-ify inner rel

From
Robert Haas
Date:
On Thu, Feb 19, 2009 at 5:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Greg Stark <stark@enterprisedb.com> writes:
>> It's tempting to have Hash cheat and just peek at the node beneath it
>> to see if it's a HashAggregate, in which case it could call a special
>> method to request the whole hash. But it would have to know that it's
>> just a plain uniquify and not implementing a GROUP BY.
>
> More to the point, it would have to check if it's unique-ifying on the
> same columns and with the same operators as the upper hash is using.
> If we were going to do something like this, making it a real option to
> the Hash node and teaching the planner about that would be *much*
> easier, and would also allow saner cost estimation.

I've been thinking about this some more and would like someone else to
check my work.

The goal here is to handle a semi-join implemented as a hash join by
unique-ifying the inner rel as we are building the hash table, rather
than as a separate step.  This is correct when (1) the inner rel is
exactly the RHS of the join [if it's more or less, then the strategy
of unique-ifying the inner path and then joining is not even legal; it
will potentially produce different results] and (2) the hash join can
be performed by hashing all of the join clauses of the inner rel [if
not, then the criteria for the unique-ify step don't match the
criteria for the join step, so the join is legal but the two
operations can't be done together].

We don't need to implement any new logic to test for (1).  The current
code calls add_paths_to_joinrel() with jointype JOIN_UNIQUE_INNER only
when we are contemplating a semi-join against exactly the RHS.  So
hash_inner_and_outer() can disregard the possibility of using this
optimization when it sees any other join type.

To verify (2), hash_inner_and_outer must check that (2a) min_lefthand
is a subset of the outer rel and (2b) hashing the inner path would be
a valid method of unique-ifying it.  If (2a) does not hold, then we're
performing only part of the special join at this level, so the set of
columns on which we're hashing for the join is smaller than the set of
columns on which we need the inner rel to be unique, and the two steps
can't be merged.  If (2b) does not hold, then one or more of the join
columns are mergeable but not hashable - so it's legal to
sort-and-unique-ify the inner rel, hash the results, and then join to
the outer rel, handling the non-hashable join clauses as qpquals, but
there's no way for the hash join itself to do the unique-ification.
But if (2a) and (2b) do both hold, then all of the join clauses from
the SpecialJoinInfo will be present in the list of joinrel's join
clause list (since we have all of the LHS relations) and all of them
are hashable (since we know all the ones create_unique_path extracted
are hashable, and it should be extracting the same ones that
hash_inner_and_outer is pulling).

Does anyone see a hole in this logic?

...Robert


Re: HashJoin w/option to unique-ify inner rel

From
Robert Haas
Date:
On Sun, Apr 12, 2009 at 12:00 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Feb 19, 2009 at 5:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Greg Stark <stark@enterprisedb.com> writes:
>>> It's tempting to have Hash cheat and just peek at the node beneath it
>>> to see if it's a HashAggregate, in which case it could call a special
>>> method to request the whole hash. But it would have to know that it's
>>> just a plain uniquify and not implementing a GROUP BY.
>>
>> More to the point, it would have to check if it's unique-ifying on the
>> same columns and with the same operators as the upper hash is using.
>> If we were going to do something like this, making it a real option to
>> the Hash node and teaching the planner about that would be *much*
>> easier, and would also allow saner cost estimation.
>
> I've been thinking about this some more and would like someone else to
> check my work.
>
> The goal here is to handle a semi-join implemented as a hash join by
> unique-ifying the inner rel as we are building the hash table, rather
> than as a separate step. [long discussion of implementation methodology]
>
> Does anyone see a hole in this logic?

Upon further review, it appears that a big part of this problem is
that cost_hashjoin() doesn't understand that it needs cost semi-joins
differently from inner or left joins.  The bogus logic looks to be
right here:
   /*    * The number of tuple comparisons needed is the number of outer tuples    * times the typical number of tuples
ina hash bucket, which is the inner    * relation size times its bucketsize fraction.  At each one, we need to    *
evaluatethe hashjoin quals.  But actually, charging the full qual eval    * cost at each tuple is pessimistic, since we
don'tevaluate the quals    * unless the hash values match exactly.  For lack of a better idea, halve    * the cost
estimateto allow for that.    */   startup_cost += hash_qual_cost.startup;   run_cost += hash_qual_cost.per_tuple *
 outer_path_rows * clamp_row_est(inner_path_rows *
 
innerbucketsize) * 0.5;

Of course, when the join type is JOIN_SEMI, we're going to stop
looking after we find the first match, so this estimate is really far
off.

rhaas=# explain analyze select * from a where i in (select b.i from b);
    QUERY PLAN
 

------------------------------------------------------------------------------------------------------------------------Hash
Join (cost=174.72..2095.53 rows=10 width=4) (actual
 
time=33.274..285.627 rows=10 loops=1)  Hash Cond: (a.i = b.i)  ->  Seq Scan on a  (cost=0.00..1443.00 rows=100000
width=4)(actual
 
time=0.023..125.307 rows=100000 loops=1)  ->  Hash  (cost=174.60..174.60 rows=10 width=4) (actual
time=33.209..33.209 rows=10 loops=1)        ->  HashAggregate  (cost=174.50..174.60 rows=10 width=4)
(actual time=33.161..33.175 rows=10 loops=1)              ->  Seq Scan on b  (cost=0.00..148.80 rows=10280
width=4) (actual time=0.015..15.304 rows=10280 loops=1)Total runtime: 285.743 ms
(7 rows)
rhaas=# set enable_hashagg to false; set enable_sort to false;
SET
SET
rhaas=# explain analyze select * from a where i in (select b.i from b);
 QUERY PLAN
 
------------------------------------------------------------------------------------------------------------------Hash
SemiJoin  (cost=277.30..130573.10 rows=10 width=4) (actual
 
time=31.447..292.823 rows=10 loops=1)  Hash Cond: (a.i = b.i)  ->  Seq Scan on a  (cost=0.00..1443.00 rows=100000
width=4)(actual
 
time=0.022..125.300 rows=100000 loops=1)  ->  Hash  (cost=148.80..148.80 rows=10280 width=4) (actual
time=31.392..31.392 rows=10280 loops=1)        ->  Seq Scan on b  (cost=0.00..148.80 rows=10280 width=4)
(actual time=0.014..14.958 rows=10280 loops=1)Total runtime: 293.154 ms
(6 rows)

The planner costs the semi-join as two orders of magnitude more
expensive than the hash-join-over-hash-aggregate, but in reality the
semi join is only marginally slower.  The planner thinks we're going
to end up wasting a lot of time walking down long hash chains and it's
just not true.  b contains lots of duplicates but they all hash to the
same buckets, and when an a-value hashes to that bucket it's probably
because we've got a match, and because it's a semi-join, finding one
match is a sufficient excuse to skip the rest of the bucket..

...Robert


Re: HashJoin w/option to unique-ify inner rel

From
"Lawrence, Ramon"
Date:
> Upon further review, it appears that a big part of this problem is
> that cost_hashjoin() doesn't understand that it needs cost semi-joins
> differently from inner or left joins.  The bogus logic looks to be
> right here:
>     startup_cost += hash_qual_cost.startup;
>     run_cost += hash_qual_cost.per_tuple *
>         outer_path_rows * clamp_row_est(inner_path_rows *
> innerbucketsize) * 0.5;
>
> Of course, when the join type is JOIN_SEMI, we're going to stop
> looking after we find the first match, so this estimate is really far
> off.

The 8.3 version of cost_hashjoin() had a line like this:

joininfactor = join_in_selectivity(&path->jpath, root);

and a cost function like this:

run_cost += hash_qual_cost.per_tuple *    outer_path_rows * clamp_row_est(inner_path_rows *
innerbucketsize) * joininfactor * 0.5;

This compensated for IN joins being able to stop scanning a bucket once
a match is found.  You may consider something similar for a semi-join.
Having experimented with a lot of this code recently, there is some
potential for improvement on the bucket sizes, etc., but it is a
non-trivial problem.

I tested a similar query on TPCH 100M1Z on version 8.3:

select * from customer where c_custkey in (select o_custkey from orders)

and found that hash aggregate was marginally faster.  If you turn off
aggregation, it selects an IN hash join which is about 5% slower and the
planner is not too far off.  So, it would be definitely possible to
modify the cost function appropriately.

> >>> It's tempting to have Hash cheat and just peek at the node beneath
it
> >>> to see if it's a HashAggregate, in which case it could call a
special
> >>> method to request the whole hash. But it would have to know that
it's
> >>> just a plain uniquify and not implementing a GROUP BY.

If HashAggregate is faster, then the question is can you make it better
by avoiding building the hash structure twice.  I haven't considered all
the possibilities, but the situation you have used as an example, an IN
query, seems workable.  Instead of translating to a hash
aggregate/hash/hash join query plan, it may be possible to create a
special hash join node that does uniquefy.

The benefit is that the planner knows about it (instead of changing the
execution plan), you can be more accurate on costs for the hash join,
and you can optimize by using only one hash table construction. A
challenge that must be dealt with is handling the multi-batch case.  It
appears that hash aggregate does not currently handle this, but I may be
mistaken.

--
Ramon Lawrence


Re: HashJoin w/option to unique-ify inner rel

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> Upon further review, it appears that a big part of this problem is
> that cost_hashjoin() doesn't understand that it needs cost semi-joins
> differently from inner or left joins.

Yeah, I have a note to look into that before 8.4 final.  The same is
true for nestloops: stopping after hitting one match to the current
outer can make a big difference, and that's not reflected in costsize.c
yet.  I'm not sure whether cost_mergejoin needs to care.
        regards, tom lane


Re: HashJoin w/option to unique-ify inner rel

From
Robert Haas
Date:
On Thu, Apr 16, 2009 at 7:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Upon further review, it appears that a big part of this problem is
>> that cost_hashjoin() doesn't understand that it needs cost semi-joins
>> differently from inner or left joins.
>
> Yeah, I have a note to look into that before 8.4 final.  The same is
> true for nestloops: stopping after hitting one match to the current
> outer can make a big difference, and that's not reflected in costsize.c
> yet.  I'm not sure whether cost_mergejoin needs to care.

Cool.  It's worth noting that this is also a problem for anti-joins,
which can also stop after hitting one match.

For merge join, it looks like you might save something if there are
duplicates in both the inner and outer lists.  If the outer side
contains 1,1,1,1,1,2 and the inner side contains 1,1,1,2,2, then an
INNER JOIN or LEFT JOIN will rescan the first three tuples of the
inner side three times, whereas a SEMI or ANTI JOIN will scan the
first tuple three times and then the next two only once.  But I'm not
sure if we can estimate this accurately enough to matter.

...Robert


Re: HashJoin w/option to unique-ify inner rel

From
Robert Haas
Date:
> If HashAggregate is faster, then the question is can you make it better
> by avoiding building the hash structure twice.  I haven't considered all
> the possibilities, but the situation you have used as an example, an IN
> query, seems workable.  Instead of translating to a hash
> aggregate/hash/hash join query plan, it may be possible to create a
> special hash join node that does uniquefy.

Yeah, that's what I was looking at.  The problem is that unique-ify is
not free either - we have to invoke the appropriate comparison
operators for every tuple in the bucket for which the hash values
match exactly.  So, for example if the input has K copies each of N
items, I'll need to do (K - 1) * N comparisons, assuming no hash
collisions.  In return, the number of tuples in each bucket will be
reduced by a factor of K, but that doesn't actually save very much,
because I can reject all of those with an integer comparison anyway,
again assuming no hash collisions, so it's pretty cheap.

If the hash join was on track to go multi-batch, then unique-ifying it
on the fly makes a lot of sense...  otherwise, I'm not sure it's
really going to be a win.  Anyhow, further analysis needed...

...Robert


Re: HashJoin w/option to unique-ify inner rel

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> Upon further review, it appears that a big part of this problem is
> that cost_hashjoin() doesn't understand that it needs cost semi-joins
> differently from inner or left joins.
> ...
> The planner costs the semi-join as two orders of magnitude more
> expensive than the hash-join-over-hash-aggregate, but in reality the
> semi join is only marginally slower.  The planner thinks we're going
> to end up wasting a lot of time walking down long hash chains and it's
> just not true.  b contains lots of duplicates but they all hash to the
> same buckets, and when an a-value hashes to that bucket it's probably
> because we've got a match, and because it's a semi-join, finding one
> match is a sufficient excuse to skip the rest of the bucket..

I spent today poking at this problem, without much success.  I had
thought that the solution was basically to re-introduce something
comparable to the "joininfactor" correction that was used in 8.3 and
previous branches.  However, that attempt (attached for the archives)
crashed and burned miserably.  In the attached patch, we estimate the
number of outer tuples that have some match (as opposed to no match at
all) and the average number of matches for each tuple that has some
match.  These estimates seem to be pretty good, at least in the simple
test case I'm using.  However, it then combines the numbers into a
single work-savings factor averaged across all outer tuples:

+     /*
+      * For an outer-rel row that has no match, the executor will have to scan
+      * the whole input relation to prove that.  Remembering the meaning of
+      * jselec, we therefore estimate the average scan fraction over all
+      * outer-rel rows thus:
+      */
+     scanfrac = jselec * matchfrac + (1.0 - jselec) /* times 1.0 */ ;

and it turns out this is basically missing the point altogether.  The
above comment is correct in the case where we're doing a nestloop with
simple seqscan on the inner relation: we're going to have to do the
maximum amount of work for any unmatched outer tuple.  But in the
hashjoin case, as you noted, it's quite likely that an unmatched outer
will result in a probe into an empty hashbucket and thus expend minimum
rather than maximum work.

The case where we are doing a nestloop with inner indexscan (using the
join clause as index condition) also costs far less than this equation
suggests.  A no-match outer tuple will result in an indexscan that
selects no tuples, not one that selects lots of tuples we then have to
ignore.

BTW, the pre-8.4 coding basically applied the equivalent of "matchfrac"
for all tuples, thus getting something like the right answer for matched
tuples but being entirely bogus for unmatched ones.  However, in these
cases where unmatched tuples are cheap to process, this ended up being
a fairly reasonable estimate anyway.  But it was completely wrong for
cases where unmatched tuples are expensive, which is why I'd tried to
rip it out (we have had complaints about such cases in the past).

So it appears to me that instead of taking an average-case correction
as is done in this patch and the old coding, we have to explicitly model
the matched-tuple and unmatched-tuple cases separately.  For hashjoins,
what I think we should do is estimate the bucket scanning cost for
unmatched tuples assuming a perfectly uniform bucket distribution,
rather than the worst-case bucket size that's used in the current code
(and is a reasonable estimate for matched tuples).  This is to reflect
the fact that an unmatched outer tuple might or might not hit a
populated bucket, and which bucket it hits is probably uncorrelated with
the distribution of the inner tuples.

The problem seems to be a lot messier to solve for nestloop joins.
We have to detect whether the inner path is an indexscan using the
semijoin clause as an index qual --- if it is not, then the correct
thing is to model unmatched tuples as expensive.  If it is, then I think
we can take the cost of unmatched tuples as about equal to the cost of
retrieving a single row from an indexscan with matches (which we can get
by interpolating in the path's given costs).  The problem here is that the
current representation of these path structures doesn't make it cheap to
determine which joinclauses are being used as index quals.  Currently
that's okay because we only do it once at the end (in createplan.c) but
if we need the information during path searching it's got to be fairly
cheap to get.  So some rejiggering of the Path representation seems
called for.

I wouldn't blink at doing this during development phase, but it's a
larger change than I'd really like to be making during beta.  OTOH,
what we have got now is a definite regression from the pre-8.4 behavior,
especially in the cases where unmatched tuples are cheap.  Maybe there's
not much choice but to fix it now.

Comments?

            regards, tom lane

Index: costsize.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/optimizer/path/costsize.c,v
retrieving revision 1.207
diff -c -r1.207 costsize.c
*** costsize.c    17 Apr 2009 15:33:33 -0000    1.207
--- costsize.c    24 Apr 2009 22:44:36 -0000
***************
*** 119,124 ****
--- 119,126 ----
                 RestrictInfo *rinfo,
                 PathKey *pathkey);
  static bool cost_qual_eval_walker(Node *node, cost_qual_eval_context *context);
+ static Selectivity semi_join_fraction(JoinPath *path, PlannerInfo *root,
+                                       SpecialJoinInfo *sjinfo);
  static double approx_tuple_count(PlannerInfo *root, JoinPath *path,
                                   List *quals);
  static void set_rel_width(PlannerInfo *root, RelOptInfo *rel);
***************
*** 1399,1408 ****
--- 1401,1414 ----
      double        outer_path_rows = PATH_ROWS(outer_path);
      double        inner_path_rows = nestloop_inner_path_rows(inner_path);
      double        ntuples;
+     Selectivity semi_fraction;

      if (!enable_nestloop)
          startup_cost += disable_cost;

+     /* Estimate cost savings fraction for JOIN_SEMI/ANTI cases */
+     semi_fraction = semi_join_fraction(path, root, sjinfo);
+
      /* cost of source data */

      /*
***************
*** 1429,1440 ****
          run_cost += (outer_path_rows - 1) * inner_path->startup_cost;
      }
      run_cost += outer_path_rows *
!         (inner_path->total_cost - inner_path->startup_cost);

      /*
       * Compute number of tuples processed (not number emitted!)
       */
!     ntuples = outer_path_rows * inner_path_rows;

      /* CPU costs */
      cost_qual_eval(&restrict_qual_cost, path->joinrestrictinfo, root);
--- 1435,1446 ----
          run_cost += (outer_path_rows - 1) * inner_path->startup_cost;
      }
      run_cost += outer_path_rows *
!         (inner_path->total_cost - inner_path->startup_cost) * semi_fraction;

      /*
       * Compute number of tuples processed (not number emitted!)
       */
!     ntuples = outer_path_rows * inner_path_rows * semi_fraction;

      /* CPU costs */
      cost_qual_eval(&restrict_qual_cost, path->joinrestrictinfo, root);
***************
*** 1485,1490 ****
--- 1491,1497 ----
                  outerendsel,
                  innerstartsel,
                  innerendsel;
+     Selectivity semi_fraction;
      Path        sort_path;        /* dummy for result of cost_sort */

      /* Protect some assumptions below that rowcounts aren't zero */
***************
*** 1496,1501 ****
--- 1503,1511 ----
      if (!enable_mergejoin)
          startup_cost += disable_cost;

+     /* Estimate cost savings fraction for JOIN_SEMI/ANTI cases */
+     semi_fraction = semi_join_fraction(&path->jpath, root, sjinfo);
+
      /*
       * Compute cost of the mergequals and qpquals (other restriction clauses)
       * separately.
***************
*** 1718,1723 ****
--- 1728,1735 ----
       * The number of tuple comparisons needed is approximately number of outer
       * rows plus number of inner rows plus number of rescanned tuples (can we
       * refine this?).  At each one, we need to evaluate the mergejoin quals.
+      * NOTE: semi/anti join mode does not save any work here, so do NOT apply
+      * semi_fraction.
       */
      startup_cost += merge_qual_cost.startup;
      startup_cost += merge_qual_cost.per_tuple *
***************
*** 1730,1740 ****
       * For each tuple that gets through the mergejoin proper, we charge
       * cpu_tuple_cost plus the cost of evaluating additional restriction
       * clauses that are to be applied at the join.    (This is pessimistic since
!      * not all of the quals may get evaluated at each tuple.)
       */
      startup_cost += qp_qual_cost.startup;
      cpu_per_tuple = cpu_tuple_cost + qp_qual_cost.per_tuple;
!     run_cost += cpu_per_tuple * mergejointuples;

      path->jpath.path.startup_cost = startup_cost;
      path->jpath.path.total_cost = startup_cost + run_cost;
--- 1742,1753 ----
       * For each tuple that gets through the mergejoin proper, we charge
       * cpu_tuple_cost plus the cost of evaluating additional restriction
       * clauses that are to be applied at the join.    (This is pessimistic since
!      * not all of the quals may get evaluated at each tuple.)  Here we should
!      * apply semi_fraction, since mergejointuples doesn't account for that.
       */
      startup_cost += qp_qual_cost.startup;
      cpu_per_tuple = cpu_tuple_cost + qp_qual_cost.per_tuple;
!     run_cost += cpu_per_tuple * mergejointuples * semi_fraction;

      path->jpath.path.startup_cost = startup_cost;
      path->jpath.path.total_cost = startup_cost + run_cost;
***************
*** 1824,1834 ****
--- 1837,1851 ----
      int            num_skew_mcvs;
      double        virtualbuckets;
      Selectivity innerbucketsize;
+     Selectivity semi_fraction;
      ListCell   *hcl;

      if (!enable_hashjoin)
          startup_cost += disable_cost;

+     /* Estimate cost savings fraction for JOIN_SEMI/ANTI cases */
+     semi_fraction = semi_join_fraction(&path->jpath, root, sjinfo);
+
      /*
       * Compute cost of the hashquals and qpquals (other restriction clauses)
       * separately.
***************
*** 1972,1987 ****

      /*
       * The number of tuple comparisons needed is the number of outer tuples
!      * times the typical number of tuples in a hash bucket, which is the inner
!      * relation size times its bucketsize fraction.  At each one, we need to
!      * evaluate the hashjoin quals.  But actually, charging the full qual eval
!      * cost at each tuple is pessimistic, since we don't evaluate the quals
!      * unless the hash values match exactly.  For lack of a better idea, halve
!      * the cost estimate to allow for that.
       */
      startup_cost += hash_qual_cost.startup;
      run_cost += hash_qual_cost.per_tuple *
!         outer_path_rows * clamp_row_est(inner_path_rows * innerbucketsize) * 0.5;

      /*
       * For each tuple that gets through the hashjoin proper, we charge
--- 1989,2005 ----

      /*
       * The number of tuple comparisons needed is the number of outer tuples
!      * times the typical number of tuples in a hash bucket (which is the inner
!      * relation size times its bucketsize fraction) times the semi_fraction.
!      * At each tuple, we need to evaluate the hashjoin quals.  But actually,
!      * charging the full qual eval cost at each tuple is pessimistic, since we
!      * don't evaluate the quals unless the hash values match exactly.  For
!      * lack of a better idea, halve the cost estimate to allow for that.
       */
      startup_cost += hash_qual_cost.startup;
      run_cost += hash_qual_cost.per_tuple *
!         outer_path_rows * clamp_row_est(inner_path_rows * innerbucketsize) *
!         semi_fraction * 0.5;

      /*
       * For each tuple that gets through the hashjoin proper, we charge
***************
*** 1991,1997 ****
       */
      startup_cost += qp_qual_cost.startup;
      cpu_per_tuple = cpu_tuple_cost + qp_qual_cost.per_tuple;
!     run_cost += cpu_per_tuple * hashjointuples;

      path->jpath.path.startup_cost = startup_cost;
      path->jpath.path.total_cost = startup_cost + run_cost;
--- 2009,2015 ----
       */
      startup_cost += qp_qual_cost.startup;
      cpu_per_tuple = cpu_tuple_cost + qp_qual_cost.per_tuple;
!     run_cost += cpu_per_tuple * hashjointuples * semi_fraction;

      path->jpath.path.startup_cost = startup_cost;
      path->jpath.path.total_cost = startup_cost + run_cost;
***************
*** 2321,2326 ****
--- 2339,2483 ----


  /*
+  * semi_join_fraction
+  *      Determines the fraction of the inner input that a SEMI or ANTI join
+  *      can be expected to scan.
+  *
+  * In a hash or nestloop SEMI/ANTI join, the executor will stop scanning
+  * inner rows as soon as it finds a match to the current outer row.
+  * We should therefore adjust some of the cost components for this effect.
+  * This function computes the fraction of the inner relation that can be
+  * expected to be scanned on average.  (In a hash join, this can be taken
+  * as the fraction of the selected hash bucket, not of the whole relation,
+  * but the calculation here is the same.)  Mergejoins only avoid evaluating
+  * otherquals, so the effect is much weaker for them, but it still exists.
+  *
+  * 'path' is already filled in except for the cost fields
+  * 'sjinfo' is extra info about the join for selectivity estimation
+  */
+ static Selectivity
+ semi_join_fraction(JoinPath *path, PlannerInfo *root, SpecialJoinInfo *sjinfo)
+ {
+     Selectivity scanfrac;
+     Selectivity matchfrac;
+     Selectivity avgmatch;
+     Selectivity jselec;
+     Selectivity nselec;
+     SpecialJoinInfo norm_sjinfo;
+     JoinType    jointype = path->jointype;
+     List       *joinquals;
+     ListCell   *l;
+
+     /* Return 1.0 whenever it's not JOIN_SEMI or JOIN_ANTI */
+     if (jointype != JOIN_SEMI && jointype != JOIN_ANTI)
+         return 1.0;
+
+     /*
+      * Note: it's annoying to repeat this selectivity estimation on each call,
+      * when the joinclause list will be the same for all path pairs
+      * implementing a given join.  clausesel.c will save us from the worst
+      * effects of this by caching at the RestrictInfo level; but perhaps it'd
+      * be worth finding a way to cache the results at a higher level.
+      */
+
+     /*
+      * In an ANTI join, we must ignore clauses that are "pushed down",
+      * since those won't affect the match logic.  In a SEMI join, we do not
+      * distinguish joinquals from "pushed down" quals, so just use the whole
+      * restrictinfo list.
+      */
+     if (jointype == JOIN_ANTI)
+     {
+         joinquals = NIL;
+         foreach(l, path->joinrestrictinfo)
+         {
+             RestrictInfo *rinfo = (RestrictInfo *) lfirst(l);
+
+             Assert(IsA(rinfo, RestrictInfo));
+             if (!rinfo->is_pushed_down)
+                 joinquals = lappend(joinquals, rinfo);
+         }
+     }
+     else
+         joinquals = path->joinrestrictinfo;
+
+     /*
+      * Get the JOIN_SEMI or JOIN_ANTI selectivity of the join clauses.
+      */
+     jselec = clauselist_selectivity(root,
+                                     joinquals,
+                                     0,
+                                     jointype,
+                                     sjinfo);
+
+     /*
+      * Also get the normal inner-join selectivity of the join clauses.
+      */
+     norm_sjinfo.type = T_SpecialJoinInfo;
+     norm_sjinfo.min_lefthand = path->outerjoinpath->parent->relids;
+     norm_sjinfo.min_righthand = path->innerjoinpath->parent->relids;
+     norm_sjinfo.syn_lefthand = path->outerjoinpath->parent->relids;
+     norm_sjinfo.syn_righthand = path->innerjoinpath->parent->relids;
+     norm_sjinfo.jointype = JOIN_INNER;
+     /* we don't bother trying to make the remaining fields valid */
+     norm_sjinfo.lhs_strict = false;
+     norm_sjinfo.delay_upper_joins = false;
+     norm_sjinfo.join_quals = NIL;
+
+     nselec = clauselist_selectivity(root,
+                                     joinquals,
+                                     0,
+                                     JOIN_INNER,
+                                     &norm_sjinfo);
+
+     /* Avoid leaking a lot of ListCells */
+     if (jointype == JOIN_ANTI)
+         list_free(joinquals);
+
+     /*
+      * jselec can be interpreted as the fraction of outer-rel rows that have
+      * any matches (this is true for both SEMI and ANTI cases).  And nselec
+      * is the fraction of the Cartesian product that matches.  So, the
+      * average number of matches for each outer-rel row that has at least
+      * one match is nselec * inner_rows / jselec.
+      *
+      * Note: it is correct to use the inner rel's "rows" count here, not
+      * PATH_ROWS(), even if the inner path under consideration is an inner
+      * indexscan.  This is because we have included all the join clauses
+      * in the selectivity estimate, even ones used in an inner indexscan.
+      */
+     if (jselec > 0)                /* protect against zero divide */
+         avgmatch = nselec * path->innerjoinpath->parent->rows / jselec;
+     else
+         avgmatch = 0;
+
+     /*
+      * For an outer-rel row that has at least one match, we can expect the
+      * inner scan to stop after a fraction 1/(avgmatch+1) of the inner rows,
+      * if the matches are evenly distributed.  Since they probably aren't
+      * quite evenly distributed, we apply a fuzz factor of 2.0 to that
+      * fraction (but, of course, clamping at an upper limit of 1.0).
+      */
+     matchfrac = 2.0 / (avgmatch + 1.0);
+     matchfrac = Min(1.0, matchfrac);
+
+     /*
+      * For an outer-rel row that has no match, the executor will have to scan
+      * the whole input relation to prove that.  Remembering the meaning of
+      * jselec, we therefore estimate the average scan fraction over all
+      * outer-rel rows thus:
+      */
+     scanfrac = jselec * matchfrac + (1.0 - jselec) /* times 1.0 */ ;
+
+     /* clamp the result to sane range in case of roundoff errors */
+     scanfrac = Max(scanfrac, 0.0);
+     scanfrac = Min(scanfrac, 1.0);
+
+     return scanfrac;
+ }
+
+
+ /*
   * approx_tuple_count
   *        Quick-and-dirty estimation of the number of join rows passing
   *        a set of qual conditions.

Re: HashJoin w/option to unique-ify inner rel

From
Robert Haas
Date:
On Fri, Apr 24, 2009 at 8:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> So it appears to me that instead of taking an average-case correction
> as is done in this patch and the old coding, we have to explicitly model
> the matched-tuple and unmatched-tuple cases separately.  For hashjoins,
> what I think we should do is estimate the bucket scanning cost for
> unmatched tuples assuming a perfectly uniform bucket distribution,
> rather than the worst-case bucket size that's used in the current code
> (and is a reasonable estimate for matched tuples).  This is to reflect
> the fact that an unmatched outer tuple might or might not hit a
> populated bucket, and which bucket it hits is probably uncorrelated with
> the distribution of the inner tuples.

I've been investigating this some more as well and have come to the
conclusion that the following code (and estimate_hash_bucketsize) are
pretty much wrong.

>        /*
>         * The number of tuple comparisons needed is the number of outer tuples
> !        * times the typical number of tuples in a hash bucket, which is the inner
> !        * relation size times its bucketsize fraction.  At each one, we need to
> !        * evaluate the hashjoin quals.  But actually, charging the full qual eval
> !        * cost at each tuple is pessimistic, since we don't evaluate the quals
> !        * unless the hash values match exactly.  For lack of a better idea, halve
> !        * the cost estimate to allow for that.
>         */
>        startup_cost += hash_qual_cost.startup;
>        run_cost += hash_qual_cost.per_tuple *
> !               outer_path_rows * clamp_row_est(inner_path_rows * innerbucketsize) * 0.5;

As far as I can tell, the focus on trying to estimate the number of
tuples per bucket is entirely misguided.  Supposing the relation is
mostly unique so that the values don't cluster too much, the right
answer is (of course) NTUP_PER_BUCKET.   estimate_hash_bucketsize()
backs into this number, but it places far too much importance on the
value.  If you increase NTUP_PER_BUCKET to, say, 50, the planner
concludes that hash joins are really, really expensive, and avoids
them like the plague, but in reality they're only a little slower.
Why?  Because the extra tuples that get thrown into the bucket
generally don't have the same hash value (or if they did, they would
have been in the bucket either way...) and get rejected with a simple
integer comparison, which is much cheaper than
hash_qual_cost.per_tuple.

It seems to me that what we should really be attempting to estimate
here is the likely number of times we're going to have to evaluate the
hash quals per tuple.  Your idea of estimated the match and unmatched
cases separately is a lot better than anything that I had come up
with.  In the unmatched case, the cost of a hash probe is nearly zero,
regardless of whether the hash bucket is empty or not (because the
probability of a real hash collision is quite low, and nothing else
requires evaluating the hash quals).  In the matched case, we expect
to probe about inner_rows/inner_ndistinct tuples - unless it's a SEMI
or ANTI join, in which case we expect to probe exactly one row.

As a side note, the code in estimate_hash_bucketsize that scales up
the estimate of tuples per bucket by the ratio of the frequency of the
most common value looks pretty sketchy as well.  Consider a relation
which has 10000 values all distinct.  The frequency of each value is
0.0001.  Now we add a duplicate copy of one value, which makes it an
MCV with almost twice the frequency of any other value in the table,
so estimate_hash_bucketsize() essentially *doubles* its estimate of
how many items are in a bucket.  That doesn't make a lot of sense.  At
the very least, we ought not to do this scaling when the outer rel is
more or less distinct on the hash key; even if the inner rel has a
fair number of duplicates, things will still be pretty fast.

...Robert


Re: HashJoin w/option to unique-ify inner rel

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> As far as I can tell, the focus on trying to estimate the number of
> tuples per bucket is entirely misguided.  Supposing the relation is
> mostly unique so that the values don't cluster too much, the right
> answer is (of course) NTUP_PER_BUCKET.

But the entire point of that code is to arrive at a sane estimate
when the inner relation *isn't* mostly unique and *does* cluster.
So I think you're being much too hasty to conclude that it's wrong.

> Because the extra tuples that get thrown into the bucket
> generally don't have the same hash value (or if they did, they would
> have been in the bucket either way...) and get rejected with a simple
> integer comparison, which is much cheaper than
> hash_qual_cost.per_tuple.

Yeah, we are charging more than we ought to for bucket entries that can
be rejected on the basis of hashcode comparisons.  The difficulty is to
arrive at a reasonable guess of what fraction of the bucket entries will
be so rejected, versus those that will incur a comparison-function call.
I'm leery of assuming there are no hash collisions, which is what you
seem to be proposing.
        regards, tom lane


Re: HashJoin w/option to unique-ify inner rel

From
Robert Haas
Date:
On Fri, Apr 24, 2009 at 10:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> As far as I can tell, the focus on trying to estimate the number of
>> tuples per bucket is entirely misguided.  Supposing the relation is
>> mostly unique so that the values don't cluster too much, the right
>> answer is (of course) NTUP_PER_BUCKET.
>
> But the entire point of that code is to arrive at a sane estimate
> when the inner relation *isn't* mostly unique and *does* cluster.
> So I think you're being much too hasty to conclude that it's wrong.
>
>> Because the extra tuples that get thrown into the bucket
>> generally don't have the same hash value (or if they did, they would
>> have been in the bucket either way...) and get rejected with a simple
>> integer comparison, which is much cheaper than
>> hash_qual_cost.per_tuple.
>
> Yeah, we are charging more than we ought to for bucket entries that can
> be rejected on the basis of hashcode comparisons.  The difficulty is to
> arrive at a reasonable guess of what fraction of the bucket entries will
> be so rejected, versus those that will incur a comparison-function call.
> I'm leery of assuming there are no hash collisions, which is what you
> seem to be proposing.

Not really (though the chances for small relations are very low).

What bothers me is that estimate_hash_bucketsize() treats the
ndistinct < nbuckets and ndistinct > nbuckets cases as symmetrical,
and they really aren't.  When ndistinct < nbuckets, the values are all
going to cluster in a subset of the buckets, and we're not going to be
able to reject much of anything on the basis of hashcode comparisons,
because odds are good that there's only one distinct value in the
bucket, so every tuple we see will require a hash qual evaluation (and
they'll probably all return true).  On the other hand, when ndistinct
> nbuckets, we expect multiple distinct values per bucket, and the
high bits of the hash code will frequently be different, and so the
hashcode comparison will be pretty successful in tossing stuff out the
window.  So the problem is that the information we have here:
       if (ndistinct > nbuckets)               estfract = 1.0 / nbuckets;       else               estfract = 1.0 /
ndistinct;

...isn't available to cost_hashjoin().  If the lower half of this
branch is selected, chances are good that the hash quals will have to
be evaluated almost 100% of the time (so the 0.5 that cost_hashjoin
uses is an underestimate).  But if the upper half is selected, then
our odds of not needing to evaluate the hash quals are something like
1 - (ndistinct / nbuckets).

The thing to do, or so it seems to me, is to rewrite
estimate_hash_bucketsize() to return the number of hash qual
evaluations per probe rather than the fraction of items in a given
bucket; that's where you have the statistics to make that estimate.
It should be possible to factor in the possibility of hash collisions
as well.  If the hash function works totally excellently, the chance
of a hash collision for a single value will be 2^-x, where x is the
number of bits that are used neither to select the bucket nor to
select the batch.  You can multiply through by the estimated number of
entries in the bucket and then by some fudge factor.  This won't
matter much for small relations but it might for large ones,
especially for large settings of work_mem.  (If work_mem is set to a
relatively normal value, the cost of going multi-batch is likely to
blow the hash-join plan out of the water anyway... but Stephen Frost
was telling me at JDcon East that he sometimes sets it to something
like 8GB when he's the only user on his apparently-quite-awesome
hardware...)

...Robert


Re: HashJoin w/option to unique-ify inner rel

From
Grzegorz Jaskiewicz
Date:
On 25 Apr 2009, at 04:52, Robert Haas wrote:

> blow the hash-join plan out of the water anyway... but Stephen Frost
> was telling me at JDcon East that he sometimes sets it to something
> like 8GB when he's the only user on his apparently-quite-awesome
> hardware...)
For the record, because most queries have 5-6 joins here, I always set  
it up to 32MB on production. We don't have more than 100-150  
connections, so it plays well on normal 32bit machine with 4GB.

If what you wrote about hash-join is confirmed by others, than I am  
pretty much +100 for fixing it.

(just my penny).

thanks.



Re: HashJoin w/option to unique-ify inner rel

From
Robert Haas
Date:
On Sat, Apr 25, 2009 at 6:42 AM, Grzegorz Jaskiewicz
<gj@pointblue.com.pl> wrote:
> On 25 Apr 2009, at 04:52, Robert Haas wrote:
>> blow the hash-join plan out of the water anyway... but Stephen Frost
>> was telling me at JDcon East that he sometimes sets it to something
>> like 8GB when he's the only user on his apparently-quite-awesome
>> hardware...)
>
> For the record, because most queries have 5-6 joins here, I always set it up
> to 32MB on production. We don't have more than 100-150 connections, so it
> plays well on normal 32bit machine with 4GB.
>
> If what you wrote about hash-join is confirmed by others, than I am pretty
> much +100 for fixing it.
>
> (just my penny).

You may find the attached patch interesting to play around with.  It
changes the NTUP_PER_BUCKET into a GUC called hash_load, and adds
EXPLAIN support to show the number of buckets and batches.  This is
just for experimentation: I'm not in favor of adding Yet Another Thing
for users to tune, but if you try it out, you will see (I think) that
changing hash_load has a dramatic effect on the estimated cost of a
hash join but a much less dramatic effect on the actual run-time.

...Robert

Attachment

Re: HashJoin w/option to unique-ify inner rel

From
Robert Haas
Date:
On Fri, Apr 24, 2009 at 10:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> As far as I can tell, the focus on trying to estimate the number of
>> tuples per bucket is entirely misguided.  Supposing the relation is
>> mostly unique so that the values don't cluster too much, the right
>> answer is (of course) NTUP_PER_BUCKET.
>
> But the entire point of that code is to arrive at a sane estimate
> when the inner relation *isn't* mostly unique and *does* cluster.
> So I think you're being much too hasty to conclude that it's wrong.

A few further thoughts on this...  the patch you attached computes
jselec (percentage of tuples with a match) and avgmatch (number of
matches per tuple with at least one match).  This seems like a really
good idea, but perhaps we should do it even when performing an
inner/left join, if that's possible.

Regardless of the join type, when there is no match, we need to scan
the entire bucket.  There's no reason to assume we'll hit more
populated buckets more often.  So, if there are 2^b virtual buckets
and n inner tuples, then the bucket we hit will on average contain
n/(2^b) tuples and the chances of a hash collision for any given tuple
in that bucket are about 2^-(32-b) multiplied by whatever fudge factor
you want to allow for imperfect hashing (call it f).  The total number
of hash collisions per unmatched tuple will be equal to the product of
the number of tuples and the chance of a collision per tuple, or in
other words n/(2^b) * 2^-(32-b) * f = n * 2^-32 * f.

For an inner/left join, when there is a match, we still need to scan
the entire bucket, but now we know we'll have to evaluate the hash
quals at least avgmatch times.  In addition, we risk having to
evaluate the hash quals for each tuple in the bucket that doesn't
match, if the hash values happen to collide.  In this case we can only
have hash collisions with the tuples we're not matching, so the number
of hash qual evaluations should be about (n - avgmatch) * 2^-32 * f +
avgmatch.

For a semi/anti join, when there is a match, we only need to scan
until we hit the first matching tuple.  If we let the number of
collisions, c, be (n - avgmatch) * 2^-32 * f, then the percentage of
the bucket we need to scan is about c / (c + avgmatch).  Since they
might not be randomly distributed within the bucket, multiply by 2 as
in your existing patch, but with a maximum of 1.0.  So the number of
hash qual evaluations will be:

MIN(1.0, 2.0 * c / (c + avgmatch) ) * (c + avgmatch) = MIN(c +
avgmatch, 2.0 * c)

...Robert


Re: HashJoin w/option to unique-ify inner rel

From
Tom Lane
Date:
I wrote:
> ... So it appears to me that instead of taking an average-case correction
> as is done in this patch and the old coding, we have to explicitly model
> the matched-tuple and unmatched-tuple cases separately.

I've applied the attached patch that does things this way.  I did not do
anything about improving the detailed modeling of hash-bucket searching
as Robert suggested in some later messages.  I think that's probably
worth looking at, but it's a second-order consideration --- this patch
already seems to bring the estimates for semi/antijoins much closer
to reality.

I am a bit concerned about the extra time spent on repeated selectivity
estimates.  It might not matter too much since it's only done for semi
and anti joins which aren't that common.  It would be good though if
someone who has a lot of such joins could test CVS HEAD and see if
performance has gotten worse (Kevin?).  We could refactor things to
reduce the duplication of effort but I'd prefer to leave that sort of
thing to 8.5.

BTW, if you're reading the patch in detail, the changes outside
costsize.c are just refactoring to allow costsize.c to use some
code that was formerly buried in createplan.c.  The changes in
costsize.c use the same basic selectivity calculation as in my
patch of two weeks ago, but apply the results differently as per
our discussion.

            regards, tom lane

Index: src/backend/optimizer/path/costsize.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/optimizer/path/costsize.c,v
retrieving revision 1.207
diff -c -r1.207 costsize.c
*** src/backend/optimizer/path/costsize.c    17 Apr 2009 15:33:33 -0000    1.207
--- src/backend/optimizer/path/costsize.c    9 May 2009 22:45:48 -0000
***************
*** 71,76 ****
--- 71,77 ----
  #include "optimizer/pathnode.h"
  #include "optimizer/placeholder.h"
  #include "optimizer/planmain.h"
+ #include "optimizer/restrictinfo.h"
  #include "parser/parsetree.h"
  #include "utils/lsyscache.h"
  #include "utils/selfuncs.h"
***************
*** 119,124 ****
--- 120,130 ----
                 RestrictInfo *rinfo,
                 PathKey *pathkey);
  static bool cost_qual_eval_walker(Node *node, cost_qual_eval_context *context);
+ static bool adjust_semi_join(PlannerInfo *root, JoinPath *path,
+                  SpecialJoinInfo *sjinfo,
+                  Selectivity *outer_match_frac,
+                  Selectivity *match_count,
+                  bool *indexed_join_quals);
  static double approx_tuple_count(PlannerInfo *root, JoinPath *path,
                                   List *quals);
  static void set_rel_width(PlannerInfo *root, RelOptInfo *rel);
***************
*** 1394,1404 ****
--- 1400,1414 ----
      Path       *inner_path = path->innerjoinpath;
      Cost        startup_cost = 0;
      Cost        run_cost = 0;
+     Cost        inner_run_cost;
      Cost        cpu_per_tuple;
      QualCost    restrict_qual_cost;
      double        outer_path_rows = PATH_ROWS(outer_path);
      double        inner_path_rows = nestloop_inner_path_rows(inner_path);
      double        ntuples;
+     Selectivity    outer_match_frac;
+     Selectivity    match_count;
+     bool        indexed_join_quals;

      if (!enable_nestloop)
          startup_cost += disable_cost;
***************
*** 1428,1440 ****
           */
          run_cost += (outer_path_rows - 1) * inner_path->startup_cost;
      }
!     run_cost += outer_path_rows *
!         (inner_path->total_cost - inner_path->startup_cost);

!     /*
!      * Compute number of tuples processed (not number emitted!)
!      */
!     ntuples = outer_path_rows * inner_path_rows;

      /* CPU costs */
      cost_qual_eval(&restrict_qual_cost, path->joinrestrictinfo, root);
--- 1438,1503 ----
           */
          run_cost += (outer_path_rows - 1) * inner_path->startup_cost;
      }
!     inner_run_cost = inner_path->total_cost - inner_path->startup_cost;

!     if (adjust_semi_join(root, path, sjinfo,
!                          &outer_match_frac,
!                          &match_count,
!                          &indexed_join_quals))
!     {
!         double        outer_matched_rows;
!         Selectivity    inner_scan_frac;
!
!         /*
!          * SEMI or ANTI join: executor will stop after first match.
!          *
!          * For an outer-rel row that has at least one match, we can expect the
!          * inner scan to stop after a fraction 1/(match_count+1) of the inner
!          * rows, if the matches are evenly distributed.  Since they probably
!          * aren't quite evenly distributed, we apply a fuzz factor of 2.0 to
!          * that fraction.  (If we used a larger fuzz factor, we'd have to
!          * clamp inner_scan_frac to at most 1.0; but since match_count is at
!          * least 1, no such clamp is needed now.)
!          */
!         outer_matched_rows = rint(outer_path_rows * outer_match_frac);
!         inner_scan_frac = 2.0 / (match_count + 1.0);
!
!         /* Add inner run cost for outer tuples having matches */
!         run_cost += outer_matched_rows * inner_run_cost * inner_scan_frac;
!
!         /* Compute number of tuples processed (not number emitted!) */
!         ntuples = outer_matched_rows * inner_path_rows * inner_scan_frac;
!
!         /*
!          * For unmatched outer-rel rows, there are two cases.  If the inner
!          * path is an indexscan using all the joinquals as indexquals, then
!          * an unmatched row results in an indexscan returning no rows, which
!          * is probably quite cheap.  We estimate this case as the same cost
!          * to return the first tuple of a nonempty scan.  Otherwise, the
!          * executor will have to scan the whole inner rel; not so cheap.
!          */
!         if (indexed_join_quals)
!         {
!             run_cost += (outer_path_rows - outer_matched_rows) *
!                 inner_run_cost / inner_path_rows;
!             /* We won't be evaluating any quals at all for these rows */
!         }
!         else
!         {
!             run_cost += (outer_path_rows - outer_matched_rows) *
!                 inner_run_cost;
!             ntuples += (outer_path_rows - outer_matched_rows) *
!                 inner_path_rows;
!         }
!     }
!     else
!     {
!         /* Normal case; we'll scan whole input rel for each outer row */
!         run_cost += outer_path_rows * inner_run_cost;
!
!         /* Compute number of tuples processed (not number emitted!) */
!         ntuples = outer_path_rows * inner_path_rows;
!     }

      /* CPU costs */
      cost_qual_eval(&restrict_qual_cost, path->joinrestrictinfo, root);
***************
*** 1731,1736 ****
--- 1794,1802 ----
       * cpu_tuple_cost plus the cost of evaluating additional restriction
       * clauses that are to be applied at the join.    (This is pessimistic since
       * not all of the quals may get evaluated at each tuple.)
+      *
+      * Note: we could adjust for SEMI/ANTI joins skipping some qual evaluations
+      * here, but it's probably not worth the trouble.
       */
      startup_cost += qp_qual_cost.startup;
      cpu_per_tuple = cpu_tuple_cost + qp_qual_cost.per_tuple;
***************
*** 1824,1829 ****
--- 1890,1897 ----
      int            num_skew_mcvs;
      double        virtualbuckets;
      Selectivity innerbucketsize;
+     Selectivity    outer_match_frac;
+     Selectivity    match_count;
      ListCell   *hcl;

      if (!enable_hashjoin)
***************
*** 1838,1849 ****
      qp_qual_cost.startup -= hash_qual_cost.startup;
      qp_qual_cost.per_tuple -= hash_qual_cost.per_tuple;

-     /*
-      * Get approx # tuples passing the hashquals.  We use approx_tuple_count
-      * here because we need an estimate done with JOIN_INNER semantics.
-      */
-     hashjointuples = approx_tuple_count(root, &path->jpath, hashclauses);
-
      /* cost of source data */
      startup_cost += outer_path->startup_cost;
      run_cost += outer_path->total_cost - outer_path->startup_cost;
--- 1906,1911 ----
***************
*** 1970,1987 ****

      /* CPU costs */

!     /*
!      * The number of tuple comparisons needed is the number of outer tuples
!      * times the typical number of tuples in a hash bucket, which is the inner
!      * relation size times its bucketsize fraction.  At each one, we need to
!      * evaluate the hashjoin quals.  But actually, charging the full qual eval
!      * cost at each tuple is pessimistic, since we don't evaluate the quals
!      * unless the hash values match exactly.  For lack of a better idea, halve
!      * the cost estimate to allow for that.
!      */
!     startup_cost += hash_qual_cost.startup;
!     run_cost += hash_qual_cost.per_tuple *
!         outer_path_rows * clamp_row_est(inner_path_rows * innerbucketsize) * 0.5;

      /*
       * For each tuple that gets through the hashjoin proper, we charge
--- 2032,2109 ----

      /* CPU costs */

!     if (adjust_semi_join(root, &path->jpath, sjinfo,
!                          &outer_match_frac,
!                          &match_count,
!                          NULL))
!     {
!         double        outer_matched_rows;
!         Selectivity    inner_scan_frac;
!
!         /*
!          * SEMI or ANTI join: executor will stop after first match.
!          *
!          * For an outer-rel row that has at least one match, we can expect the
!          * bucket scan to stop after a fraction 1/(match_count+1) of the
!          * bucket's rows, if the matches are evenly distributed.  Since they
!          * probably aren't quite evenly distributed, we apply a fuzz factor of
!          * 2.0 to that fraction.  (If we used a larger fuzz factor, we'd have
!          * to clamp inner_scan_frac to at most 1.0; but since match_count is
!          * at least 1, no such clamp is needed now.)
!          */
!         outer_matched_rows = rint(outer_path_rows * outer_match_frac);
!         inner_scan_frac = 2.0 / (match_count + 1.0);
!
!         startup_cost += hash_qual_cost.startup;
!         run_cost += hash_qual_cost.per_tuple * outer_matched_rows *
!             clamp_row_est(inner_path_rows * innerbucketsize * inner_scan_frac) * 0.5;
!
!         /*
!          * For unmatched outer-rel rows, the picture is quite a lot different.
!          * In the first place, there is no reason to assume that these rows
!          * preferentially hit heavily-populated buckets; instead assume they
!          * are uncorrelated with the inner distribution and so they see an
!          * average bucket size of inner_path_rows / virtualbuckets.  In the
!          * second place, it seems likely that they will have few if any
!          * exact hash-code matches and so very few of the tuples in the
!          * bucket will actually require eval of the hash quals.  We don't
!          * have any good way to estimate how many will, but for the moment
!          * assume that the effective cost per bucket entry is one-tenth what
!          * it is for matchable tuples.
!          */
!         run_cost += hash_qual_cost.per_tuple *
!             (outer_path_rows - outer_matched_rows) *
!             clamp_row_est(inner_path_rows / virtualbuckets) * 0.05;
!
!         /* Get # of tuples that will pass the basic join */
!         if (path->jpath.jointype == JOIN_SEMI)
!             hashjointuples = outer_matched_rows;
!         else
!             hashjointuples = outer_path_rows - outer_matched_rows;
!     }
!     else
!     {
!         /*
!          * The number of tuple comparisons needed is the number of outer
!          * tuples times the typical number of tuples in a hash bucket, which
!          * is the inner relation size times its bucketsize fraction.  At each
!          * one, we need to evaluate the hashjoin quals.  But actually,
!          * charging the full qual eval cost at each tuple is pessimistic,
!          * since we don't evaluate the quals unless the hash values match
!          * exactly.  For lack of a better idea, halve the cost estimate to
!          * allow for that.
!          */
!         startup_cost += hash_qual_cost.startup;
!         run_cost += hash_qual_cost.per_tuple * outer_path_rows *
!             clamp_row_est(inner_path_rows * innerbucketsize) * 0.5;
!
!         /*
!          * Get approx # tuples passing the hashquals.  We use
!          * approx_tuple_count here because we need an estimate done with
!          * JOIN_INNER semantics.
!          */
!         hashjointuples = approx_tuple_count(root, &path->jpath, hashclauses);
!     }

      /*
       * For each tuple that gets through the hashjoin proper, we charge
***************
*** 2321,2326 ****
--- 2443,2598 ----


  /*
+  * adjust_semi_join
+  *      Estimate how much of the inner input a SEMI or ANTI join
+  *      can be expected to scan.
+  *
+  * In a hash or nestloop SEMI/ANTI join, the executor will stop scanning
+  * inner rows as soon as it finds a match to the current outer row.
+  * We should therefore adjust some of the cost components for this effect.
+  * This function computes some estimates needed for these adjustments.
+  *
+  * 'path' is already filled in except for the cost fields
+  * 'sjinfo' is extra info about the join for selectivity estimation
+  *
+  * Returns TRUE if this is a SEMI or ANTI join, FALSE if not.
+  *
+  * Output parameters (set only in TRUE-result case):
+  * *outer_match_frac is set to the fraction of the outer tuples that are
+  *        expected to have at least one match.
+  * *match_count is set to the average number of matches expected for
+  *        outer tuples that have at least one match.
+  * *indexed_join_quals is set to TRUE if all the joinquals are used as
+  *        inner index quals, FALSE if not.
+  *
+  * indexed_join_quals can be passed as NULL if that information is not
+  * relevant (it is only useful for the nestloop case).
+  */
+ static bool
+ adjust_semi_join(PlannerInfo *root, JoinPath *path, SpecialJoinInfo *sjinfo,
+                  Selectivity *outer_match_frac,
+                  Selectivity *match_count,
+                  bool *indexed_join_quals)
+ {
+     JoinType    jointype = path->jointype;
+     Selectivity jselec;
+     Selectivity nselec;
+     Selectivity avgmatch;
+     SpecialJoinInfo norm_sjinfo;
+     List       *joinquals;
+     ListCell   *l;
+
+     /* Fall out if it's not JOIN_SEMI or JOIN_ANTI */
+     if (jointype != JOIN_SEMI && jointype != JOIN_ANTI)
+         return false;
+
+     /*
+      * Note: it's annoying to repeat this selectivity estimation on each call,
+      * when the joinclause list will be the same for all path pairs
+      * implementing a given join.  clausesel.c will save us from the worst
+      * effects of this by caching at the RestrictInfo level; but perhaps it'd
+      * be worth finding a way to cache the results at a higher level.
+      */
+
+     /*
+      * In an ANTI join, we must ignore clauses that are "pushed down",
+      * since those won't affect the match logic.  In a SEMI join, we do not
+      * distinguish joinquals from "pushed down" quals, so just use the whole
+      * restrictinfo list.
+      */
+     if (jointype == JOIN_ANTI)
+     {
+         joinquals = NIL;
+         foreach(l, path->joinrestrictinfo)
+         {
+             RestrictInfo *rinfo = (RestrictInfo *) lfirst(l);
+
+             Assert(IsA(rinfo, RestrictInfo));
+             if (!rinfo->is_pushed_down)
+                 joinquals = lappend(joinquals, rinfo);
+         }
+     }
+     else
+         joinquals = path->joinrestrictinfo;
+
+     /*
+      * Get the JOIN_SEMI or JOIN_ANTI selectivity of the join clauses.
+      */
+     jselec = clauselist_selectivity(root,
+                                     joinquals,
+                                     0,
+                                     jointype,
+                                     sjinfo);
+
+     /*
+      * Also get the normal inner-join selectivity of the join clauses.
+      */
+     norm_sjinfo.type = T_SpecialJoinInfo;
+     norm_sjinfo.min_lefthand = path->outerjoinpath->parent->relids;
+     norm_sjinfo.min_righthand = path->innerjoinpath->parent->relids;
+     norm_sjinfo.syn_lefthand = path->outerjoinpath->parent->relids;
+     norm_sjinfo.syn_righthand = path->innerjoinpath->parent->relids;
+     norm_sjinfo.jointype = JOIN_INNER;
+     /* we don't bother trying to make the remaining fields valid */
+     norm_sjinfo.lhs_strict = false;
+     norm_sjinfo.delay_upper_joins = false;
+     norm_sjinfo.join_quals = NIL;
+
+     nselec = clauselist_selectivity(root,
+                                     joinquals,
+                                     0,
+                                     JOIN_INNER,
+                                     &norm_sjinfo);
+
+     /* Avoid leaking a lot of ListCells */
+     if (jointype == JOIN_ANTI)
+         list_free(joinquals);
+
+     /*
+      * jselec can be interpreted as the fraction of outer-rel rows that have
+      * any matches (this is true for both SEMI and ANTI cases).  And nselec
+      * is the fraction of the Cartesian product that matches.  So, the
+      * average number of matches for each outer-rel row that has at least
+      * one match is nselec * inner_rows / jselec.
+      *
+      * Note: it is correct to use the inner rel's "rows" count here, not
+      * PATH_ROWS(), even if the inner path under consideration is an inner
+      * indexscan.  This is because we have included all the join clauses
+      * in the selectivity estimate, even ones used in an inner indexscan.
+      */
+     if (jselec > 0)                /* protect against zero divide */
+     {
+         avgmatch = nselec * path->innerjoinpath->parent->rows / jselec;
+         /* Clamp to sane range */
+         avgmatch = Max(1.0, avgmatch);
+     }
+     else
+         avgmatch = 1.0;
+
+     *outer_match_frac = jselec;
+     *match_count = avgmatch;
+
+     /*
+      * If requested, check whether the inner path uses all the joinquals
+      * as indexquals.  (If that's true, we can assume that an unmatched
+      * outer tuple is cheap to process, whereas otherwise it's probably
+      * expensive.)
+      */
+     if (indexed_join_quals)
+     {
+         List       *nrclauses;
+
+         nrclauses = select_nonredundant_join_clauses(root,
+                                                      path->joinrestrictinfo,
+                                                      path->innerjoinpath);
+         *indexed_join_quals = (nrclauses == NIL);
+     }
+
+     return true;
+ }
+
+
+ /*
   * approx_tuple_count
   *        Quick-and-dirty estimation of the number of join rows passing
   *        a set of qual conditions.
Index: src/backend/optimizer/plan/createplan.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/optimizer/plan/createplan.c,v
retrieving revision 1.258
diff -c -r1.258 createplan.c
*** src/backend/optimizer/plan/createplan.c    19 Apr 2009 19:46:33 -0000    1.258
--- src/backend/optimizer/plan/createplan.c    9 May 2009 22:45:48 -0000
***************
*** 1562,1623 ****
      List       *otherclauses;
      NestLoop   *join_plan;

!     if (IsA(best_path->innerjoinpath, IndexPath))
!     {
!         /*
!          * An index is being used to reduce the number of tuples scanned in
!          * the inner relation.    If there are join clauses being used with the
!          * index, we may remove those join clauses from the list of clauses
!          * that have to be checked as qpquals at the join node.
!          *
!          * We can also remove any join clauses that are redundant with those
!          * being used in the index scan; this check is needed because
!          * find_eclass_clauses_for_index_join() may emit different clauses
!          * than generate_join_implied_equalities() did.
!          *
!          * We can skip this if the index path is an ordinary indexpath and not
!          * a special innerjoin path, since it then wouldn't be using any join
!          * clauses.
!          */
!         IndexPath  *innerpath = (IndexPath *) best_path->innerjoinpath;
!
!         if (innerpath->isjoininner)
!             joinrestrictclauses =
!                 select_nonredundant_join_clauses(root,
!                                                  joinrestrictclauses,
!                                                  innerpath->indexclauses);
!     }
!     else if (IsA(best_path->innerjoinpath, BitmapHeapPath))
!     {
!         /*
!          * Same deal for bitmapped index scans.
!          *
!          * Note: both here and above, we ignore any implicit index
!          * restrictions associated with the use of partial indexes.  This is
!          * OK because we're only trying to prove we can dispense with some
!          * join quals; failing to prove that doesn't result in an incorrect
!          * plan.  It is the right way to proceed because adding more quals to
!          * the stuff we got from the original query would just make it harder
!          * to detect duplication.  (Also, to change this we'd have to be wary
!          * of UPDATE/DELETE/SELECT FOR UPDATE target relations; see notes
!          * above about EvalPlanQual.)
!          */
!         BitmapHeapPath *innerpath = (BitmapHeapPath *) best_path->innerjoinpath;
!
!         if (innerpath->isjoininner)
!         {
!             List       *bitmapclauses;
!
!             bitmapclauses =
!                 make_restrictinfo_from_bitmapqual(innerpath->bitmapqual,
!                                                   true,
!                                                   false);
!             joinrestrictclauses =
!                 select_nonredundant_join_clauses(root,
!                                                  joinrestrictclauses,
!                                                  bitmapclauses);
!         }
!     }

      /* Sort join qual clauses into best execution order */
      joinrestrictclauses = order_qual_clauses(root, joinrestrictclauses);
--- 1562,1577 ----
      List       *otherclauses;
      NestLoop   *join_plan;

!     /*
!      * If the inner path is a nestloop inner indexscan, it might be using
!      * some of the join quals as index quals, in which case we don't have
!      * to check them again at the join node.  Remove any join quals that
!      * are redundant.
!      */
!     joinrestrictclauses =
!         select_nonredundant_join_clauses(root,
!                                          joinrestrictclauses,
!                                          best_path->innerjoinpath);

      /* Sort join qual clauses into best execution order */
      joinrestrictclauses = order_qual_clauses(root, joinrestrictclauses);
Index: src/backend/optimizer/util/restrictinfo.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/optimizer/util/restrictinfo.c,v
retrieving revision 1.58
diff -c -r1.58 restrictinfo.c
*** src/backend/optimizer/util/restrictinfo.c    16 Apr 2009 20:42:16 -0000    1.58
--- src/backend/optimizer/util/restrictinfo.c    9 May 2009 22:45:48 -0000
***************
*** 35,42 ****
                         bool pseudoconstant,
                         Relids required_relids,
                         Relids nullable_relids);
! static bool join_clause_is_redundant(PlannerInfo *root,
!                          RestrictInfo *rinfo,
                           List *reference_list);


--- 35,43 ----
                         bool pseudoconstant,
                         Relids required_relids,
                         Relids nullable_relids);
! static List *select_nonredundant_join_list(List *restrictinfo_list,
!                               List *reference_list);
! static bool join_clause_is_redundant(RestrictInfo *rinfo,
                           List *reference_list);


***************
*** 545,570 ****
      }
  }

  /*
   * select_nonredundant_join_clauses
   *
   * Given a list of RestrictInfo clauses that are to be applied in a join,
!  * select the ones that are not redundant with any clause in the
!  * reference_list.    This is used only for nestloop-with-inner-indexscan
!  * joins: any clauses being checked by the index should be removed from
!  * the qpquals list.
   *
   * "Redundant" means either equal() or derived from the same EquivalenceClass.
   * We have to check the latter because indxqual.c may select different derived
   * clauses than were selected by generate_join_implied_equalities().
   *
!  * Note that we assume the given restrictinfo_list has already been checked
!  * for local redundancies, so we don't check again.
   */
  List *
  select_nonredundant_join_clauses(PlannerInfo *root,
                                   List *restrictinfo_list,
!                                  List *reference_list)
  {
      List       *result = NIL;
      ListCell   *item;
--- 546,636 ----
      }
  }

+
  /*
   * select_nonredundant_join_clauses
   *
   * Given a list of RestrictInfo clauses that are to be applied in a join,
!  * select the ones that are not redundant with any clause that's enforced
!  * by the inner_path.  This is used for nestloop joins, wherein any clause
!  * being used in an inner indexscan need not be checked again at the join.
   *
   * "Redundant" means either equal() or derived from the same EquivalenceClass.
   * We have to check the latter because indxqual.c may select different derived
   * clauses than were selected by generate_join_implied_equalities().
   *
!  * Note that we are *not* checking for local redundancies within the given
!  * restrictinfo_list; that should have been handled elsewhere.
   */
  List *
  select_nonredundant_join_clauses(PlannerInfo *root,
                                   List *restrictinfo_list,
!                                  Path *inner_path)
! {
!     if (IsA(inner_path, IndexPath))
!     {
!         /*
!          * Check the index quals to see if any of them are join clauses.
!          *
!          * We can skip this if the index path is an ordinary indexpath and not
!          * a special innerjoin path, since it then wouldn't be using any join
!          * clauses.
!          */
!         IndexPath  *innerpath = (IndexPath *) inner_path;
!
!         if (innerpath->isjoininner)
!             restrictinfo_list =
!                 select_nonredundant_join_list(restrictinfo_list,
!                                               innerpath->indexclauses);
!     }
!     else if (IsA(inner_path, BitmapHeapPath))
!     {
!         /*
!          * Same deal for bitmapped index scans.
!          *
!          * Note: both here and above, we ignore any implicit index
!          * restrictions associated with the use of partial indexes.  This is
!          * OK because we're only trying to prove we can dispense with some
!          * join quals; failing to prove that doesn't result in an incorrect
!          * plan.  It's quite unlikely that a join qual could be proven
!          * redundant by an index predicate anyway.  (Also, if we did manage
!          * to prove it, we'd have to have a special case for update targets;
!          * see notes about EvalPlanQual testing in create_indexscan_plan().)
!          */
!         BitmapHeapPath *innerpath = (BitmapHeapPath *) inner_path;
!
!         if (innerpath->isjoininner)
!         {
!             List       *bitmapclauses;
!
!             bitmapclauses =
!                 make_restrictinfo_from_bitmapqual(innerpath->bitmapqual,
!                                                   true,
!                                                   false);
!             restrictinfo_list =
!                 select_nonredundant_join_list(restrictinfo_list,
!                                               bitmapclauses);
!         }
!     }
!
!     /*
!      * XXX the inner path of a nestloop could also be an append relation
!      * whose elements use join quals.  However, they might each use different
!      * quals; we could only remove join quals that are enforced by all the
!      * appendrel members.  For the moment we don't bother to try.
!      */
!
!     return restrictinfo_list;
! }
!
! /*
!  * select_nonredundant_join_list
!  *        Select the members of restrictinfo_list that are not redundant with
!  *        any member of reference_list.  See above for more info.
!  */
! static List *
! select_nonredundant_join_list(List *restrictinfo_list,
!                               List *reference_list)
  {
      List       *result = NIL;
      ListCell   *item;
***************
*** 574,580 ****
          RestrictInfo *rinfo = (RestrictInfo *) lfirst(item);

          /* drop it if redundant with any reference clause */
!         if (join_clause_is_redundant(root, rinfo, reference_list))
              continue;

          /* otherwise, add it to result list */
--- 640,646 ----
          RestrictInfo *rinfo = (RestrictInfo *) lfirst(item);

          /* drop it if redundant with any reference clause */
!         if (join_clause_is_redundant(rinfo, reference_list))
              continue;

          /* otherwise, add it to result list */
***************
*** 589,596 ****
   *        Test whether rinfo is redundant with any clause in reference_list.
   */
  static bool
! join_clause_is_redundant(PlannerInfo *root,
!                          RestrictInfo *rinfo,
                           List *reference_list)
  {
      ListCell   *refitem;
--- 655,661 ----
   *        Test whether rinfo is redundant with any clause in reference_list.
   */
  static bool
! join_clause_is_redundant(RestrictInfo *rinfo,
                           List *reference_list)
  {
      ListCell   *refitem;
Index: src/include/optimizer/restrictinfo.h
===================================================================
RCS file: /cvsroot/pgsql/src/include/optimizer/restrictinfo.h,v
retrieving revision 1.43
diff -c -r1.43 restrictinfo.h
*** src/include/optimizer/restrictinfo.h    16 Apr 2009 20:42:16 -0000    1.43
--- src/include/optimizer/restrictinfo.h    9 May 2009 22:45:48 -0000
***************
*** 39,44 ****
                              List **otherquals);
  extern List *select_nonredundant_join_clauses(PlannerInfo *root,
                                   List *restrictinfo_list,
!                                  List *reference_list);

  #endif   /* RESTRICTINFO_H */
--- 39,44 ----
                              List **otherquals);
  extern List *select_nonredundant_join_clauses(PlannerInfo *root,
                                   List *restrictinfo_list,
!                                  Path *inner_path);

  #endif   /* RESTRICTINFO_H */

Re: HashJoin w/option to unique-ify inner rel

From
Robert Haas
Date:
On Sat, May 9, 2009 at 7:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wrote:
>> ... So it appears to me that instead of taking an average-case correction
>> as is done in this patch and the old coding, we have to explicitly model
>> the matched-tuple and unmatched-tuple cases separately.
>
> I've applied the attached patch that does things this way.  I did not do
> anything about improving the detailed modeling of hash-bucket searching
> as Robert suggested in some later messages.  I think that's probably
> worth looking at, but it's a second-order consideration --- this patch
> already seems to bring the estimates for semi/antijoins much closer
> to reality.

I'll take a look at this when I get a chance, but I'm just playing
with test cases, so I share your hope that Kevin (or someone else with
complex queries against real data) will test it out.

> I am a bit concerned about the extra time spent on repeated selectivity
> estimates.  It might not matter too much since it's only done for semi
> and anti joins which aren't that common.  It would be good though if
> someone who has a lot of such joins could test CVS HEAD and see if
> performance has gotten worse (Kevin?).  We could refactor things to
> reduce the duplication of effort but I'd prefer to leave that sort of
> thing to 8.5.

Agreed.  I was worried about that when I wrote the emails to which you
refer above, but I don't know how else to get good estimates for all
the relevant cases.

...Robert