Re: Erronous sort used in query plan - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: Erronous sort used in query plan |
Date | |
Msg-id | 19448.1168221082@sss.pgh.pa.us Whole thread Raw |
In response to | Re: Erronous sort used in query plan (Shane Ambler <pgsql@007Marketing.com>) |
List | pgsql-hackers |
Shane Ambler <pgsql@007Marketing.com> writes: > Tom Lane wrote: >> It seems clear to me that we ought not impose a bias unless the join >> type is such that both directions of hashing are feasible. > I think that the selected sort (or at least the merge join) is incorrect > - the column sorted (or both actually) is linking the current record in > pg_operator with the oid in the pg_proc - it will only return one row. That hasn't got anything to do with it AFAICS. The plan is not wrong, it's just inefficient. I dug around in old files and found out that the notion of discriminating against hash joins with the larger rel on the inside is ancient: in Postgres v4r2, the oldest tarball I have, cost_hashjoin contains int outerpages = page_size (outersize,outerwidth); int innerpages = page_size (innersize,innerwidth); if (outerpages < innerpages) return _disable_cost_; and while the code's been rewritten several times, the lineage is clear. It seems to me now that I've thought about it that we should just get rid of this bias entirely: it is demonstrably wrong if the join is not an inner join, or if the larger relation has significantly better dispersion across hash buckets. To the extent that it's accomplishing anything good at all, the behavior ought to be emergent from the rest of the cost model. I experimented a bit with the idea and found out that without the bias, it was willing to flip over to larger-relation-on-the-inside in cases where that actually made it a bit slower. I interpret this as meaning that we're undercharging for loading the hash table in comparison to using it. The cost model as it stands (but minus bias) effectively says that for two relations that both fit into work_mem and have equally well-dispersed hash keys, it's actually better to have the larger rel on the inside! The number of hash-function calculations is the same either way (one for each tuple in the two rels), and the executor tries to hold the number-of-tuples-per-hash-bucket constant at about 10 regardless of relation size, so when you trace through the calculations you find the only differential between the total-cost estimates for the two mirror-image hash joins is the number of outer tuples for which hashtable probes must be made; hence the fewer of them the better. Experimentation shows this isn't so, however, which suggests that the cost of inserting a tuple into the hashtable is a non-ignorable cost. I got results that seemed to track reality better after I added cpu_tuple_cost per hashtable entry and discounted the per-outer-tuple cost a bit to compensate. I also noticed that the cost estimate for having to do I/O in a batched hashjoin had missed getting updated for the 8.2 seq_page_cost changes; this is a plain ol' oversight on my part. In short, then, I'm proposing the attached patch for HEAD and 8.2; I'm not sure whether to change things further back. Comments? regards, tom lane Index: costsize.c =================================================================== RCS file: /cvsroot/pgsql/src/backend/optimizer/path/costsize.c,v retrieving revision 1.172 diff -c -r1.172 costsize.c *** costsize.c 5 Jan 2007 22:19:31 -0000 1.172 --- costsize.c 8 Jan 2007 01:00:56 -0000 *************** *** 1498,1507 **** double hashjointuples; double outer_path_rows = PATH_ROWS(outer_path); double inner_path_rows = PATH_ROWS(inner_path); - double outerbytes = relation_byte_size(outer_path_rows, - outer_path->parent->width); - double innerbytes = relation_byte_size(inner_path_rows, - inner_path->parent->width); int num_hashclauses = list_length(hashclauses); int numbuckets; int numbatches; --- 1498,1503 ---- *************** *** 1538,1550 **** /* * Cost of computing hash function: must do it once per input tuple. We ! * charge one cpu_operator_cost for each column's hash function. * * XXX when a hashclause is more complexthan a single operator, we really * should charge the extra eval costs of the left or right side, as * appropriate,here. This seems more work than it's worth at the moment. */ ! startup_cost += cpu_operator_cost * num_hashclauses * inner_path_rows; run_cost += cpu_operator_cost * num_hashclauses* outer_path_rows; /* Get hash table size that executor would use for inner relation */ --- 1534,1549 ---- /* * Cost of computing hash function: must do it once per input tuple. We ! * charge one cpu_operator_cost for each column's hash function. Also, ! * tack on one cpu_tuple_cost per inner row, to model the costs of ! * inserting the row into the hashtable. * * XXX when a hashclause is more complex than a single operator,we really * should charge the extra eval costs of the left or right side, as * appropriate, here. Thisseems more work than it's worth at the moment. */ ! startup_cost += (cpu_operator_cost * num_hashclauses + cpu_tuple_cost) ! * inner_path_rows; run_cost += cpu_operator_cost * num_hashclauses * outer_path_rows; /* Get hash tablesize that executor would use for inner relation */ *************** *** 1624,1631 **** /* * If inner relation is too big then we will need to "batch" the join, * which implieswriting and reading most of the tuples to disk an extra ! * time. Charge one cost unit per page of I/O (correct since it should be ! * nice and sequential...). Writing the inner rel counts as startup cost, * all the rest as run cost. */ if (numbatches > 1) --- 1623,1630 ---- /* * If inner relation is too big then we will need to "batch" the join, * which implieswriting and reading most of the tuples to disk an extra ! * time. Charge seq_page_cost per page, since the I/O should be nice and ! * sequential. Writing the inner rel counts as startup cost, * all the rest as run cost. */ if (numbatches> 1) *************** *** 1635,1642 **** double innerpages = page_size(inner_path_rows, inner_path->parent->width); ! startup_cost += innerpages; ! run_cost += innerpages + 2 * outerpages; } /* CPU costs */ --- 1634,1641 ---- double innerpages = page_size(inner_path_rows, inner_path->parent->width); ! startup_cost += seq_page_cost * innerpages; ! run_cost += seq_page_cost * (innerpages + 2 * outerpages); } /* CPU costs */ *************** *** 1654,1667 **** * The number of tuple comparisons needed is the number of outer tuples * times the typical numberof tuples in a hash bucket, which is the inner * relation size times its bucketsize fraction. At each one, weneed to ! * evaluate the hashjoin quals. (Note: charging the full qual eval cost ! * at each tuple is pessimistic, since we don't evaluate the quals unless ! * the hash values match exactly.) */ startup_cost += hash_qual_cost.startup; run_cost += hash_qual_cost.per_tuple* outer_path_rows * clamp_row_est(inner_path_rows * innerbucketsize) * ! joininfactor; /* * For each tuple that gets through the hashjoin proper, we charge --- 1653,1667 ---- * The number of tuple comparisons needed is the number of outer tuples * times the typical numberof tuples in a hash bucket, which is the inner * relation size times its bucketsize fraction. At each one, weneed to ! * evaluate the hashjoin quals. But actually, charging the full qual eval ! * cost at each tuple is pessimistic, since we don't evaluate the quals ! * unless the hash values match exactly. For lack of a better idea, halve ! * the cost estimate to allow for that. */ startup_cost += hash_qual_cost.startup; run_cost += hash_qual_cost.per_tuple* outer_path_rows * clamp_row_est(inner_path_rows * innerbucketsize) * ! joininfactor * 0.5; /* * For each tuple that gets through the hashjoin proper, we charge *************** *** 1673,1694 **** cpu_per_tuple = cpu_tuple_cost + qp_qual_cost.per_tuple; run_cost += cpu_per_tuple * hashjointuples* joininfactor; - /* - * Bias against putting larger relation on inside. We don't want an - * absolute prohibition, though, since larger relation might have better - * bucketsize --- and we can't trust the size estimates unreservedly, - * anyway. Instead, inflate the run cost by the square root of the size - * ratio. (Why square root? No real good reason, but it seems - * reasonable...) - * - * Note: before 7.4 we implemented this by inflating startup cost; but if - * there's a disable_cost component in the input paths' startup cost, that - * unfairly penalizes the hash. Probably it'd be better to keep track of - * disable penalty separately from cost. - */ - if (innerbytes > outerbytes && outerbytes > 0) - run_cost *= sqrt(innerbytes / outerbytes); - path->jpath.path.startup_cost = startup_cost; path->jpath.path.total_cost = startup_cost + run_cost; } --- 1673,1678 ----
pgsql-hackers by date: