Thread: Potential Join Performance Issue
<div class="Section1"><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial">PostgreSQL development community:</span></font><p class="MsoNormal"><font face="Arial" size="2"><spanstyle="font-size:10.0pt; font-family:Arial"> </span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial">Our research group has been using the PostgreSQL code base to test new join algorithms. During testing,we noticed that the planner is not pushing down projections to the outer relation in a hash join. Although thismakes sense for in-memory (1 batch) joins, for joins larger than memory (such as for TPC-H DSS), this causes the systemto perform significantly more disk I/Os when reading/writing batches of the outer relation.</span></font><p class="MsoNormal"><fontface="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial"> </span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial">A simple solution is to add a single line of code to src\backend\optimizer\plan\createplan.c after line1771:</span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial"> </span></font><p class="MsoNormal"><font face="Courier New" size="2"><span style="font-size:10.0pt; font-family:"Courier New"">disuse_physical_tlist(outer_plan, best_path->jpath.outerjoinpath);</span></font><p class="MsoNormal"><fontface="Courier New" size="2"><span style="font-size:10.0pt; font-family:"Courier New""> </span></font><p class="MsoNormal"><font face="Courier New" size="2"><span style="font-size:10.0pt; font-family:"Courier New"">This will always force the projection on the outer relation.</span></font><p class="MsoNormal"><fontface="Courier New" size="2"><span style="font-size:10.0pt; font-family:"Courier New""> </span></font><p class="MsoNormal"><font face="Courier New" size="2"><span style="font-size:10.0pt; font-family:"Courier New"">A more complicated modification alternative is to add a state variable to allow the planner toknow how many batches the hash join expects and only push down the projection if it is greater than one. However, pushingthe projection on the outer relation is almost always the best choice as it eliminates unneeded attributes for operatorsabove the hash join in the plan and will be robust in the case of poor estimates.</span></font><p class="MsoNormal"><fontface="Courier New" size="2"><span style="font-size:10.0pt; font-family:"Courier New""> </span></font><p class="MsoNormal"><font face="Courier New" size="2"><span style="font-size:10.0pt; font-family:"Courier New"">We have been testing using TPC-H scale factor 1 GB. A sample query that demonstrates the behavioris:</span></font><p class="MsoNormal"><font face="Courier New" size="2"><span style="font-size:10.0pt; font-family:"Courier New""> </span></font><p class="MsoNormal"><font face="Courier New" size="2"><span style="font-size:10.0pt; font-family:"Courier New"">SELECT c_custkey, c_name, o_orderkey, o_orderdate</span></font><p class="MsoNormal"><font face="CourierNew" size="2"><span style="font-size:10.0pt; font-family:"Courier New"">FROM Customer, Orders</span></font><p class="MsoNormal"><font face="Courier New" size="2"><spanstyle="font-size:10.0pt; font-family:"Courier New"">WHERE c_custkey = o_custkey</span></font><p class="MsoNormal"><font face="Courier New" size="2"><spanstyle="font-size:10.0pt; font-family:"Courier New""> </span></font><p class="MsoNormal"><font face="Courier New" size="2"><span style="font-size:10.0pt; font-family:"Courier New"">Note that EXPLAIN on this query will indicate that the projection is performed on the outer relationeven though it is not done. We found the difference by modifying our code to track tuples and bytes output to disk,but it also can be detected by watching the size of the temporary files produced during the join.</span></font><fontface="Arial" size="2"><span style="font-size:10.0pt;font-family:Arial"></span></font><p class="MsoNormal"><fontface="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial"> </span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial">Sincerely,</span></font><p class="MsoAutoSig"><font face="Times New Roman" size="3"><span style="font-size: 12.0pt"> </span></font><p class="MsoAutoSig"><font face="Times New Roman" size="3"><span style="font-size: 12.0pt">Dr. Ramon Lawrence</span></font><p class="MsoAutoSig"><font face="Times New Roman" size="3"><span style="font-size: 12.0pt">Assistant Professor, Department of Computer Science, University of British Columbia Okanagan</span></font><p class="MsoAutoSig"><fontface="Times New Roman" size="3"><span style="font-size: 12.0pt"><a href="http://people.ok.ubc.ca/rlawrenc/">http://people.ok.ubc.ca/rlawrenc/</a></span></font><p class="MsoAutoSig"><fontface="Times New Roman" size="3"><span lang="PT-BR" style="font-size:12.0pt">E-mail: </span><a href="mailto:ramon.lawrence@ubc.ca"><spanlang="PT-BR">ramon.lawrence@ubc.ca</span></a></font><span lang="PT-BR"></span><pclass="MsoNormal"><font face="Times New Roman" size="3"><span lang="PT-BR" style="font-size:12.0pt"> </span></font></div>
"Lawrence, Ramon" <ramon.lawrence@ubc.ca> writes: > Our research group has been using the PostgreSQL code base to test new > join algorithms. During testing, we noticed that the planner is not > pushing down projections to the outer relation in a hash join. Although > this makes sense for in-memory (1 batch) joins, for joins larger than > memory (such as for TPC-H DSS), this causes the system to perform > significantly more disk I/Os when reading/writing batches of the outer > relation. Hm. The proposed patch seems a bit brute-force, since it loses the benefit of the physical-tlist optimization even if the relations are certainly too small to require batching. > A more complicated modification alternative is to add a state variable > to allow the planner to know how many batches the hash join expects and > only push down the projection if it is greater than one. However, > pushing the projection on the outer relation is almost always the best > choice as it eliminates unneeded attributes for operators above the hash > join in the plan and will be robust in the case of poor estimates. Nonetheless, I'm inclined to do it that way. The "robust in the case of poor estimates" argument doesn't convince me, because the incremental cost isn't *that* large if we get it wrong; and the other argument is just bogus because we don't do physical tlists at or above joins anyhow. regards, tom lane
On Tue, 2008-09-09 at 11:21 -0700, Lawrence, Ramon wrote: > Our research group has been using the PostgreSQL code base to test new > join algorithms. Sounds cool. I'm sure you'll come up with some good things. You might be interested in this also http://archives.postgresql.org/pgsql-hackers/2007-01/msg01600.php after which Greg Stark and I were investigating using alternate/compressed data structures to avoid the need to switch to multi-batch hash joins. If we knew we were dealing with nearly contiguous ranges of discrete values, we could store the missing values rather than the present values using an HRL encoded bitmap. Other ideas are possible also, I'm sure. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Attached is a patch that will disable the physical-tlist optimization for hash join if the number of batches is greater than 1. The patch and performance results were created by Michael Henderson (graduate student). To keep the changes simple, the update simply calls ExecChooseHashTableSize() in create_hashjoin_plan() to re-calculate the expected number of batches. This is more efficient and results in less code changes than modifying the HashPath struct to store the number of batches and updating that variable when costing (as cost_hashjoin() will be called many times during costing). We have also attached some performance results that show a dramatic effect when disabling the physical-tlist optimization for joins with more than one batch. I do not know the performance tradeoffs of using the physical-tlist optimization to avoid projection on the outer relation for joins with one batch. However, there is a potential huge penalty if the optimizer is wrong. If the optimizer suggests one batch, and on execution either due to poor estimates or data skew more than one batch is needed, then the join operator will perform considerably more I/Os on the outer relation that still contains the unnecessary attributes. An ideal solution would detect at execution time if the inner relation remained in memory (one batch) and decide to disable/enable the physical-tlist optimization on the outer relation accordingly. At this time, we are uncertain if this would be desirable or possible. Sincerely, Dr. Ramon Lawrence Assistant Professor, Department of Computer Science, University of British Columbia Okanagan E-mail: ramon.lawrence@ubc.ca -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: September 9, 2008 6:47 PM To: Lawrence, Ramon Cc: pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Potential Join Performance Issue "Lawrence, Ramon" <ramon.lawrence@ubc.ca> writes: > Our research group has been using the PostgreSQL code base to test new > join algorithms. During testing, we noticed that the planner is not > pushing down projections to the outer relation in a hash join. Although > this makes sense for in-memory (1 batch) joins, for joins larger than > memory (such as for TPC-H DSS), this causes the system to perform > significantly more disk I/Os when reading/writing batches of the outer > relation. Hm. The proposed patch seems a bit brute-force, since it loses the benefit of the physical-tlist optimization even if the relations are certainly too small to require batching. > A more complicated modification alternative is to add a state variable > to allow the planner to know how many batches the hash join expects and > only push down the projection if it is greater than one. However, > pushing the projection on the outer relation is almost always the best > choice as it eliminates unneeded attributes for operators above the hash > join in the plan and will be robust in the case of poor estimates. Nonetheless, I'm inclined to do it that way. The "robust in the case of poor estimates" argument doesn't convince me, because the incremental cost isn't *that* large if we get it wrong; and the other argument is just bogus because we don't do physical tlists at or above joins anyhow. regards, tom lane
Attachment
"Lawrence, Ramon" <ramon.lawrence@ubc.ca> writes: > To keep the changes simple, the update simply calls > ExecChooseHashTableSize() in create_hashjoin_plan() to re-calculate the > expected number of batches. This is more efficient and results in less > code changes than modifying the HashPath struct to store the number of > batches and updating that variable when costing (as cost_hashjoin() will > be called many times during costing). I was intending to do it the other way, actually. An extra field in HashPath hardly costs anything. The other reason for it is that there are other possible uses for knowing whether a hash will be multi-batch. (For example, if we were prepared to tell the executor that it *must* keep the hash to one batch, we could assume that the sort order of the left input is preserved. I haven't looked into the risks/benefits of that too much, but it's been in the back of the mind for a long time.) > An ideal solution would detect at execution time if the inner relation > remained in memory (one batch) and decide to disable/enable the > physical-tlist optimization on the outer relation accordingly. At this > time, we are uncertain if this would be desirable or possible. That seems pretty infeasible really. Aside from changing plan node output tuple types on-the-fly, it would mean renumbering Vars in the join node to reference the outer relation's new output columns. The overhead of supporting that would be paid across-the-board in the executor whether or not anyone got any real benefit from it. I'd be more inclined to deal with the issue by trying to establish a "safety margin" in the estimate of whether the hash will go multi-batch. IOW we should disuse_physical_tlist if the hash is estimated to be close to but still within one batch. regards, tom lane
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > I was intending to do it the other way, actually. An extra field in > HashPath hardly costs anything. The other reason for it is that there > are other possible uses for knowing whether a hash will be multi-batch. > (For example, if we were prepared to tell the executor that it *must* > keep the hash to one batch, we could assume that the sort order of the > left input is preserved. I haven't looked into the risks/benefits of > that too much, but it's been in the back of the mind for a long time.) Having the number of batches in HashPath could be potentially useful for a variety of reasons. For our research, we have added an nbatch variable in both HashPath and HashJoin. Having it in HashJoin is useful as we modified EXPLAIN to output the number of batches. There are costs in putting an nbatch variable in HashPath as the system may set this variable potentially hundreds/thousands of times during costing and does not (currently) use it until you convert the chosen HashPath to a plan. > I'd be more inclined to deal with the issue by trying to establish a > "safety margin" in the estimate of whether the hash will go multi-batch. > IOW we should disuse_physical_tlist if the hash is estimated to be close > to but still within one batch. Our experiments with large TPC-H 1GB joins show that it is almost always better to not use physical_tlists if the number of batches is > 1. There is a noticeable (approximately 5-15%) improvement when using physical_tlists for in-memory joins. For batches of size 2, it sometimes can go either way depending how many attributes are projected out of the outer relation. Using physical_tlists may be better even for batches of size 2 if most of the attributes of the outer relation are kept. For a larger number of batches, the extra I/O cost significantly dominates over the physical_tlist optimization. Performance of multi-batch joins may improve 50% or more by disabling the optimization. It is possible to create a "safety margin" by having ExecChooseHashTableSize() return the value inner_rel_bytes/hash_table_bytes which represents the fraction of the memory available that the inner relation is expected to consume. You can then make decisions based on that. However, this is only as good as the inner relation size estimate and especially for large queries, the estimate may be quite inaccurate. A more robust solution could examine the "width" of the path and the "width" of the relation combined with the number of batches to see if projecting early would be worth it. It may be best to keep it simple and just use number of batches > 1 as a criteria and instead focus on examining issues with inaccurate join size estimates. -- Dr. Ramon Lawrence Assistant Professor, Department of Computer Science, University of British Columbia Okanagan E-mail: ramon.lawrence@ubc.ca
Has this been completed? TODO item? --------------------------------------------------------------------------- Lawrence, Ramon wrote: > > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > > I was intending to do it the other way, actually. An extra field in > > HashPath hardly costs anything. The other reason for it is that there > > are other possible uses for knowing whether a hash will be > multi-batch. > > (For example, if we were prepared to tell the executor that it *must* > > keep the hash to one batch, we could assume that the sort order of the > > left input is preserved. I haven't looked into the risks/benefits of > > that too much, but it's been in the back of the mind for a long time.) > > Having the number of batches in HashPath could be potentially useful for > a variety of reasons. For our research, we have added an nbatch > variable in both HashPath and HashJoin. Having it in HashJoin is useful > as we modified EXPLAIN to output the number of batches. There are costs > in putting an nbatch variable in HashPath as the system may set this > variable potentially hundreds/thousands of times during costing and does > not (currently) use it until you convert the chosen HashPath to a plan. > > > I'd be more inclined to deal with the issue by trying to establish a > > "safety margin" in the estimate of whether the hash will go > multi-batch. > > IOW we should disuse_physical_tlist if the hash is estimated to be > close > > to but still within one batch. > > Our experiments with large TPC-H 1GB joins show that it is almost always > better to not use physical_tlists if the number of batches is > 1. > There is a noticeable (approximately 5-15%) improvement when using > physical_tlists for in-memory joins. For batches of size 2, it > sometimes can go either way depending how many attributes are projected > out of the outer relation. Using physical_tlists may be better even for > batches of size 2 if most of the attributes of the outer relation are > kept. For a larger number of batches, the extra I/O cost significantly > dominates over the physical_tlist optimization. Performance of > multi-batch joins may improve 50% or more by disabling the optimization. > > It is possible to create a "safety margin" by having > ExecChooseHashTableSize() return the value > inner_rel_bytes/hash_table_bytes which represents the fraction of the > memory available that the inner relation is expected to consume. You > can then make decisions based on that. However, this is only as good > as the inner relation size estimate and especially for large queries, > the estimate may be quite inaccurate. A more robust solution could > examine the "width" of the path and the "width" of the relation combined > with the number of batches to see if projecting early would be worth it. > It may be best to keep it simple and just use number of batches > 1 as a > criteria and instead focus on examining issues with inaccurate join size > estimates. > > -- > Dr. Ramon Lawrence > Assistant Professor, Department of Computer Science, University of > British Columbia Okanagan > E-mail: ramon.lawrence@ubc.ca > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
> Has this been completed? TODO item? > > > I'd be more inclined to deal with the issue by trying to establish a > > > "safety margin" in the estimate of whether the hash will go > > multi-batch. > > > IOW we should disuse_physical_tlist if the hash is estimated to be > > close to but still within one batch. I do not know how this issue was resolved. It is an issue that is very important for multi-batch hash joins. The simplest resolution is to disable physical_tlist on the outer relation for hash joins of more than one batch. However, as discussed in the thread, more sophisticated solutions are also viable. -- Ramon Lawrence
"Lawrence, Ramon" <ramon.lawrence@ubc.ca> writes: > Attached is a patch that will disable the physical-tlist optimization > for hash join if the number of batches is greater than 1. The patch and > performance results were created by Michael Henderson (graduate > student). I've applied the attached modified version of this patch. regards, tom lane Index: src/backend/nodes/outfuncs.c =================================================================== RCS file: /cvsroot/pgsql/src/backend/nodes/outfuncs.c,v retrieving revision 1.355 diff -c -r1.355 outfuncs.c *** src/backend/nodes/outfuncs.c 21 Mar 2009 00:04:39 -0000 1.355 --- src/backend/nodes/outfuncs.c 26 Mar 2009 15:19:43 -0000 *************** *** 1448,1453 **** --- 1448,1454 ---- _outJoinPathInfo(str, (JoinPath *) node); WRITE_NODE_FIELD(path_hashclauses); + WRITE_INT_FIELD(num_batches); } static void Index: src/backend/optimizer/path/costsize.c =================================================================== RCS file: /cvsroot/pgsql/src/backend/optimizer/path/costsize.c,v retrieving revision 1.205 diff -c -r1.205 costsize.c *** src/backend/optimizer/path/costsize.c 21 Mar 2009 00:04:39 -0000 1.205 --- src/backend/optimizer/path/costsize.c 26 Mar 2009 15:19:43 -0000 *************** *** 1880,1885 **** --- 1880,1887 ---- &numbatches, &num_skew_mcvs); virtualbuckets = (double) numbuckets *(double) numbatches; + /* mark the path with estimated # of batches */ + path->num_batches = numbatches; /* * Determine bucketsize fraction for inner relation. We use the smallest Index: src/backend/optimizer/plan/createplan.c =================================================================== RCS file: /cvsroot/pgsql/src/backend/optimizer/plan/createplan.c,v retrieving revision 1.256 diff -c -r1.256 createplan.c *** src/backend/optimizer/plan/createplan.c 21 Mar 2009 00:04:39 -0000 1.256 --- src/backend/optimizer/plan/createplan.c 26 Mar 2009 15:19:44 -0000 *************** *** 1910,1915 **** --- 1910,1919 ---- /* We don't want any excess columns in the hashed tuples */ disuse_physical_tlist(inner_plan, best_path->jpath.innerjoinpath); + /* If we expect batching, suppress excess columns in outer tuples too */ + if (best_path->num_batches > 1) + disuse_physical_tlist(outer_plan, best_path->jpath.outerjoinpath); + /* * If there is a single join clause and we can identify the outer * variable as a simple column reference, supply its identity for Index: src/backend/optimizer/util/pathnode.c =================================================================== RCS file: /cvsroot/pgsql/src/backend/optimizer/util/pathnode.c,v retrieving revision 1.150 diff -c -r1.150 pathnode.c *** src/backend/optimizer/util/pathnode.c 27 Feb 2009 00:06:27 -0000 1.150 --- src/backend/optimizer/util/pathnode.c 26 Mar 2009 15:19:44 -0000 *************** *** 1480,1488 **** pathnode->jpath.outerjoinpath = outer_path; pathnode->jpath.innerjoinpath = inner_path; pathnode->jpath.joinrestrictinfo = restrict_clauses; ! /* A hashjoin never has pathkeys, since its ordering is unpredictable */ pathnode->jpath.path.pathkeys = NIL; pathnode->path_hashclauses = hashclauses; cost_hashjoin(pathnode, root, sjinfo); --- 1480,1499 ---- pathnode->jpath.outerjoinpath = outer_path; pathnode->jpath.innerjoinpath = inner_path; pathnode->jpath.joinrestrictinfo = restrict_clauses; ! /* ! * A hashjoin never has pathkeys, since its output ordering is ! * unpredictable due to possible batching. XXX If the inner relation is ! * small enough, we could instruct the executor that it must not batch, ! * and then we could assume that the output inherits the outer relation's ! * ordering, which might save a sort step. However there is considerable ! * downside if our estimate of the inner relation size is badly off. ! * For the moment we don't risk it. (Note also that if we wanted to take ! * this seriously, joinpath.c would have to consider many more paths for ! * the outer rel than it does now.) ! */ pathnode->jpath.path.pathkeys = NIL; pathnode->path_hashclauses = hashclauses; + /* cost_hashjoin will fill in pathnode->num_batches */ cost_hashjoin(pathnode, root, sjinfo); Index: src/include/nodes/relation.h =================================================================== RCS file: /cvsroot/pgsql/src/include/nodes/relation.h,v retrieving revision 1.170 diff -c -r1.170 relation.h *** src/include/nodes/relation.h 5 Mar 2009 23:06:45 -0000 1.170 --- src/include/nodes/relation.h 26 Mar 2009 15:19:44 -0000 *************** *** 845,850 **** --- 845,851 ---- { JoinPath jpath; List *path_hashclauses; /* join clauses used for hashing */ + int num_batches; /* number of batches expected */ } HashPath; /*