Re: Potential Join Performance Issue - Mailing list pgsql-hackers
From | Lawrence, Ramon |
---|---|
Subject | Re: Potential Join Performance Issue |
Date | |
Msg-id | 6EEA43D22289484890D119821101B1DF2C1580@exchange20.mercury.ad.ubc.ca Whole thread Raw |
In response to | Potential Join Performance Issue ("Lawrence, Ramon" <ramon.lawrence@ubc.ca>) |
Responses |
Re: Potential Join Performance Issue
Re: Potential Join Performance Issue |
List | pgsql-hackers |
Attached is a patch that will disable the physical-tlist optimization for hash join if the number of batches is greater than 1. The patch and performance results were created by Michael Henderson (graduate student). To keep the changes simple, the update simply calls ExecChooseHashTableSize() in create_hashjoin_plan() to re-calculate the expected number of batches. This is more efficient and results in less code changes than modifying the HashPath struct to store the number of batches and updating that variable when costing (as cost_hashjoin() will be called many times during costing). We have also attached some performance results that show a dramatic effect when disabling the physical-tlist optimization for joins with more than one batch. I do not know the performance tradeoffs of using the physical-tlist optimization to avoid projection on the outer relation for joins with one batch. However, there is a potential huge penalty if the optimizer is wrong. If the optimizer suggests one batch, and on execution either due to poor estimates or data skew more than one batch is needed, then the join operator will perform considerably more I/Os on the outer relation that still contains the unnecessary attributes. An ideal solution would detect at execution time if the inner relation remained in memory (one batch) and decide to disable/enable the physical-tlist optimization on the outer relation accordingly. At this time, we are uncertain if this would be desirable or possible. Sincerely, Dr. Ramon Lawrence Assistant Professor, Department of Computer Science, University of British Columbia Okanagan E-mail: ramon.lawrence@ubc.ca -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: September 9, 2008 6:47 PM To: Lawrence, Ramon Cc: pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Potential Join Performance Issue "Lawrence, Ramon" <ramon.lawrence@ubc.ca> writes: > Our research group has been using the PostgreSQL code base to test new > join algorithms. During testing, we noticed that the planner is not > pushing down projections to the outer relation in a hash join. Although > this makes sense for in-memory (1 batch) joins, for joins larger than > memory (such as for TPC-H DSS), this causes the system to perform > significantly more disk I/Os when reading/writing batches of the outer > relation. Hm. The proposed patch seems a bit brute-force, since it loses the benefit of the physical-tlist optimization even if the relations are certainly too small to require batching. > A more complicated modification alternative is to add a state variable > to allow the planner to know how many batches the hash join expects and > only push down the projection if it is greater than one. However, > pushing the projection on the outer relation is almost always the best > choice as it eliminates unneeded attributes for operators above the hash > join in the plan and will be robust in the case of poor estimates. Nonetheless, I'm inclined to do it that way. The "robust in the case of poor estimates" argument doesn't convince me, because the incremental cost isn't *that* large if we get it wrong; and the other argument is just bogus because we don't do physical tlists at or above joins anyhow. regards, tom lane
Attachment
pgsql-hackers by date: