Re: Erronous sort used in query plan - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Erronous sort used in query plan
Date
Msg-id 19448.1168221082@sss.pgh.pa.us
Whole thread Raw
In response to Re: Erronous sort used in query plan  (Shane Ambler <pgsql@007Marketing.com>)
List pgsql-hackers
Shane Ambler <pgsql@007Marketing.com> writes:
> Tom Lane wrote:
>> It seems clear to me that we ought not impose a bias unless the join
>> type is such that both directions of hashing are feasible.

> I think that the selected sort (or at least the merge join) is incorrect 
> - the column sorted (or both actually) is linking the current record in 
> pg_operator with the oid in the pg_proc - it will only return one row.

That hasn't got anything to do with it AFAICS.  The plan is not wrong,
it's just inefficient.

I dug around in old files and found out that the notion of
discriminating against hash joins with the larger rel on the inside
is ancient: in Postgres v4r2, the oldest tarball I have, cost_hashjoin
contains
   int outerpages = page_size (outersize,outerwidth);   int innerpages = page_size (innersize,innerwidth);
   if (outerpages < innerpages)      return _disable_cost_;

and while the code's been rewritten several times, the lineage is clear.

It seems to me now that I've thought about it that we should just get
rid of this bias entirely: it is demonstrably wrong if the join is not
an inner join, or if the larger relation has significantly better
dispersion across hash buckets.  To the extent that it's accomplishing
anything good at all, the behavior ought to be emergent from the rest of
the cost model.

I experimented a bit with the idea and found out that without the bias,
it was willing to flip over to larger-relation-on-the-inside in cases
where that actually made it a bit slower.  I interpret this as meaning
that we're undercharging for loading the hash table in comparison to
using it.  The cost model as it stands (but minus bias) effectively says
that for two relations that both fit into work_mem and have equally
well-dispersed hash keys, it's actually better to have the larger rel on
the inside!  The number of hash-function calculations is the same either
way (one for each tuple in the two rels), and the executor tries to hold
the number-of-tuples-per-hash-bucket constant at about 10 regardless of
relation size, so when you trace through the calculations you find the
only differential between the total-cost estimates for the two
mirror-image hash joins is the number of outer tuples for which
hashtable probes must be made; hence the fewer of them the better.
Experimentation shows this isn't so, however, which suggests that the
cost of inserting a tuple into the hashtable is a non-ignorable cost.
I got results that seemed to track reality better after I added
cpu_tuple_cost per hashtable entry and discounted the per-outer-tuple
cost a bit to compensate.

I also noticed that the cost estimate for having to do I/O in a batched
hashjoin had missed getting updated for the 8.2 seq_page_cost changes;
this is a plain ol' oversight on my part.

In short, then, I'm proposing the attached patch for HEAD and 8.2;
I'm not sure whether to change things further back.  Comments?
        regards, tom lane

Index: costsize.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/optimizer/path/costsize.c,v
retrieving revision 1.172
diff -c -r1.172 costsize.c
*** costsize.c    5 Jan 2007 22:19:31 -0000    1.172
--- costsize.c    8 Jan 2007 01:00:56 -0000
***************
*** 1498,1507 ****     double        hashjointuples;     double        outer_path_rows = PATH_ROWS(outer_path);
double       inner_path_rows = PATH_ROWS(inner_path);
 
-     double        outerbytes = relation_byte_size(outer_path_rows,
-                                                 outer_path->parent->width);
-     double        innerbytes = relation_byte_size(inner_path_rows,
-                                                 inner_path->parent->width);     int            num_hashclauses =
list_length(hashclauses);    int            numbuckets;     int            numbatches;
 
--- 1498,1503 ----
***************
*** 1538,1550 ****      /*      * Cost of computing hash function: must do it once per input tuple. We
!      * charge one cpu_operator_cost for each column's hash function.      *      * XXX when a hashclause is more
complexthan a single operator, we really      * should charge the extra eval costs of the left or right side, as      *
appropriate,here.  This seems more work than it's worth at the moment.      */
 
!     startup_cost += cpu_operator_cost * num_hashclauses * inner_path_rows;     run_cost += cpu_operator_cost *
num_hashclauses* outer_path_rows;      /* Get hash table size that executor would use for inner relation */
 
--- 1534,1549 ----      /*      * Cost of computing hash function: must do it once per input tuple. We
!      * charge one cpu_operator_cost for each column's hash function.  Also,
!      * tack on one cpu_tuple_cost per inner row, to model the costs of
!      * inserting the row into the hashtable.      *      * XXX when a hashclause is more complex than a single
operator,we really      * should charge the extra eval costs of the left or right side, as      * appropriate, here.
Thisseems more work than it's worth at the moment.      */
 
!     startup_cost += (cpu_operator_cost * num_hashclauses + cpu_tuple_cost)
!         * inner_path_rows;     run_cost += cpu_operator_cost * num_hashclauses * outer_path_rows;      /* Get hash
tablesize that executor would use for inner relation */
 
***************
*** 1624,1631 ****     /*      * If inner relation is too big then we will need to "batch" the join,      * which
implieswriting and reading most of the tuples to disk an extra
 
!      * time.  Charge one cost unit per page of I/O (correct since it should be
!      * nice and sequential...).  Writing the inner rel counts as startup cost,      * all the rest as run cost.
*/    if (numbatches > 1)
 
--- 1623,1630 ----     /*      * If inner relation is too big then we will need to "batch" the join,      * which
implieswriting and reading most of the tuples to disk an extra
 
!      * time.  Charge seq_page_cost per page, since the I/O should be nice and
!      * sequential.  Writing the inner rel counts as startup cost,      * all the rest as run cost.      */     if
(numbatches> 1)
 
***************
*** 1635,1642 ****         double        innerpages = page_size(inner_path_rows,
   inner_path->parent->width); 
 
!         startup_cost += innerpages;
!         run_cost += innerpages + 2 * outerpages;     }      /* CPU costs */
--- 1634,1641 ----         double        innerpages = page_size(inner_path_rows,
   inner_path->parent->width); 
 
!         startup_cost += seq_page_cost * innerpages;
!         run_cost += seq_page_cost * (innerpages + 2 * outerpages);     }      /* CPU costs */
***************
*** 1654,1667 ****      * The number of tuple comparisons needed is the number of outer tuples      * times the typical
numberof tuples in a hash bucket, which is the inner      * relation size times its bucketsize fraction.  At each one,
weneed to
 
!      * evaluate the hashjoin quals.  (Note: charging the full qual eval cost
!      * at each tuple is pessimistic, since we don't evaluate the quals unless
!      * the hash values match exactly.)      */     startup_cost += hash_qual_cost.startup;     run_cost +=
hash_qual_cost.per_tuple*         outer_path_rows * clamp_row_est(inner_path_rows * innerbucketsize) *
 
!         joininfactor;      /*      * For each tuple that gets through the hashjoin proper, we charge
--- 1653,1667 ----      * The number of tuple comparisons needed is the number of outer tuples      * times the typical
numberof tuples in a hash bucket, which is the inner      * relation size times its bucketsize fraction.  At each one,
weneed to
 
!      * evaluate the hashjoin quals.  But actually, charging the full qual eval
!      * cost at each tuple is pessimistic, since we don't evaluate the quals
!      * unless the hash values match exactly.  For lack of a better idea, halve
!      * the cost estimate to allow for that.      */     startup_cost += hash_qual_cost.startup;     run_cost +=
hash_qual_cost.per_tuple*         outer_path_rows * clamp_row_est(inner_path_rows * innerbucketsize) *
 
!         joininfactor * 0.5;      /*      * For each tuple that gets through the hashjoin proper, we charge
***************
*** 1673,1694 ****     cpu_per_tuple = cpu_tuple_cost + qp_qual_cost.per_tuple;     run_cost += cpu_per_tuple *
hashjointuples* joininfactor; 
 
-     /*
-      * Bias against putting larger relation on inside.    We don't want an
-      * absolute prohibition, though, since larger relation might have better
-      * bucketsize --- and we can't trust the size estimates unreservedly,
-      * anyway.    Instead, inflate the run cost by the square root of the size
-      * ratio.  (Why square root?  No real good reason, but it seems
-      * reasonable...)
-      *
-      * Note: before 7.4 we implemented this by inflating startup cost; but if
-      * there's a disable_cost component in the input paths' startup cost, that
-      * unfairly penalizes the hash.  Probably it'd be better to keep track of
-      * disable penalty separately from cost.
-      */
-     if (innerbytes > outerbytes && outerbytes > 0)
-         run_cost *= sqrt(innerbytes / outerbytes);
-      path->jpath.path.startup_cost = startup_cost;     path->jpath.path.total_cost = startup_cost + run_cost; }
--- 1673,1678 ----


pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: --with-libxml build failures on OS X
Next
From: Tom Lane
Date:
Subject: Re: proposal: catch warnings